JP7108184B2

JP7108184B2 - Keyword extraction program, keyword extraction method and keyword extraction device

Info

Publication number: JP7108184B2
Application number: JP2018199696A
Authority: JP
Inventors: 典弘覚幸
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-10-24
Filing date: 2018-10-24
Publication date: 2022-07-28
Anticipated expiration: 2038-10-24
Also published as: JP2020067790A

Description

本発明はキーワード抽出プログラム、キーワード抽出方法およびキーワード抽出装置に関する。 The present invention relates to a keyword extraction program, a keyword extraction method and a keyword extraction device.

ユーザ間の会話の中から当該会話にとって重要であったキーワードを抽出したいことがある。例えば、顧客と接客担当者との間の会話から、顧客の共感に大きく貢献したポジティブなキーワードを抽出して、後の接客に役立てることが考えられる。また、例えば、顧客と接客担当者との間の会話から、好ましくない接客において使用された要注意のキーワードを抽出して、後の接客に役立てることが考えられる。 There are times when it is desired to extract keywords that are important to the conversation from among the conversations between users. For example, it is conceivable to extract positive keywords that have greatly contributed to the customer's empathy from the conversation between the customer and the person in charge of customer service, and use them for subsequent customer service. Further, for example, it is conceivable to extract keywords requiring attention that were used in unfavorable customer service from conversations between customers and customer service personnel, and use them in subsequent customer service.

なお、画像の表示中にユーザの発話の音声認識を行ってキーワードを抽出し、抽出したキーワードを画像にタグとして付与するデジタルフォトフレームが提案されている。また、あるユーザに他のユーザの映像を提示し、映像視聴中の当該ユーザの頭部を撮影し、二人のユーザの動作を分析して共感解釈を推定する共感解釈推定装置が提案されている。また、会議参加者それぞれの活動状態を示す信号を収集し、会議参加者それぞれの発話からコミュニケーション難易度を判定し、活動状態とコミュニケーション難易度に基づいて各会議参加者の理解度を推定する理解状態推定装置が提案されている。 A digital photo frame has been proposed in which a keyword is extracted by recognizing a user's speech while an image is being displayed, and the extracted keyword is added to the image as a tag. In addition, an empathic interpretation estimating device has been proposed that presents a video of another user to a certain user, photographs the head of the user while the video is being viewed, analyzes the actions of the two users, and estimates the empathic interpretation. there is In addition, understanding that collects signals indicating the activity status of each conference participant, determines the communication difficulty level from each conference participant's utterance, and estimates the level of understanding of each conference participant based on the activity status and communication difficulty level A state estimator has been proposed.

また、ユーザの音声から発話速度を算出し、発話中のユーザを撮影した動画像からユーザ領域の時間変化を検出し、発話速度とユーザ領域の時間変換とに基づいてユーザの応対評価値を算出する応対品質評価装置が提案されている。また、センサ装置を用いて複数のユーザそれぞれの状態を検出し、検出した状態に基づいてユーザ同士の同調度を算出し、ユーザに提示する情報を同調度に応じて変化させる情報処理装置が提案されている。 Also, the speech rate is calculated from the user's voice, the time change of the user area is detected from the moving image of the user speaking, and the user's response evaluation value is calculated based on the speech rate and the time conversion of the user area. A response quality evaluation device has been proposed. In addition, an information processing apparatus is proposed that detects the state of each of a plurality of users using a sensor device, calculates the degree of synchronization between users based on the detected state, and changes the information presented to the user according to the degree of synchronization. It is

特開２０１０－２２４７１５号公報JP 2010-224715 A 特開２０１５－６４８２７号公報JP 2015-64827 A 特開２０１６－２１３６３１号公報JP 2016-213631 A 特開２０１７－１６２１００号公報Japanese Patent Application Laid-Open No. 2017-162100 特開２０１８－４５６７６号公報JP 2018-45676 A

堀智織、古井貞熙、「単語抽出による音声要約文生成法とその評価」、電子情報通信学会論文誌、Ｊ８５－Ｄ－II、２００－２０９頁、２００２年２月Chiori Hori, Sadahiro Furui, "Speech Summary Sentence Generation Method and Its Evaluation by Extracting Words", Transactions of the Institute of Electronics, Information and Communication Engineers, J85-D-II, pp.200-209, February 2002

しかし、会話からキーワードを抽出する従来技術では、ユーザの心理状態の観点から重要度の高いキーワードを抽出することの精度に改善の余地がある。例えば、単純にキーワードの出現頻度から重要度を判定してしまうと、顧客の共感度や接客担当者の接客度などの心理状態と関連性が高いキーワードが抽出されないおそれがある。 However, in the conventional technology for extracting keywords from conversation, there is room for improvement in the accuracy of extracting keywords that are highly important from the viewpoint of the psychological state of the user. For example, if the degree of importance is simply determined based on the frequency of appearance of keywords, there is a risk that keywords that are highly relevant to psychological states, such as customer empathy and customer service levels, will not be extracted.

１つの側面では、本発明は、ユーザの心理状態を反映したキーワードの抽出精度を向上させるキーワード抽出プログラム、キーワード抽出方法およびキーワード抽出装置を提供することを目的とする。 An object of the present invention is to provide a keyword extraction program, a keyword extraction method, and a keyword extraction device that improve the accuracy of extracting keywords that reflect the psychological state of the user.

１つの態様では、コンピュータに実行させるキーワード抽出プログラムが提供される。サービスの提供側となる第１のユーザとサービスの享受側となる第２のユーザとの間の会話において第１のユーザおよび第２のユーザの少なくとも一方が行った発話を示す音声データから、キーワードを検出する。会話において第１のユーザが行った動作および第２のユーザが行った動作を示す動作データから、第１のユーザによる第１の動作のタイミングおよび第２のユーザによる第２の動作のタイミングを検出する。第１の動作のタイミングと第２の動作のタイミングとの間の関係に基づいて、キーワードの重要度を示す評価値を算出する。 In one aspect, a computer-executable keyword extraction program is provided. Keywords from voice data indicating utterances made by at least one of the first user and the second user in a conversation between a first user who is a service provider and a second user who is a service receiver to detect Detecting the timing of the first action by the first user and the timing of the second action by the second user from the action data indicating the action performed by the first user and the action performed by the second user in the conversation do. An evaluation value indicating the importance of the keyword is calculated based on the relationship between the timing of the first action and the timing of the second action.

また、１つの態様では、コンピュータが実行するキーワード抽出方法が提供される。また、１つの態様では、記憶部と処理部とを有するキーワード抽出装置が提供される。 Also, in one aspect, a computer-implemented method for keyword extraction is provided. Also, in one aspect, a keyword extracting device having a storage unit and a processing unit is provided.

１つの側面では、ユーザの心理状態を反映したキーワードの抽出精度が向上する。 In one aspect, the accuracy of extracting keywords reflecting the user's mental state is improved.

第１の実施の形態のキーワード抽出装置の例を説明する図である。It is a figure explaining the example of the keyword extraction apparatus of 1st Embodiment. 第２の実施の形態の情報処理システムの例を示す図である。It is a figure which shows the example of the information processing system of 2nd Embodiment. 会話分析装置のハードウェア例を示すブロック図である。It is a block diagram which shows the hardware example of a conversation analysis apparatus. 第２の実施の形態のキーワード抽出例を示す図である。It is a figure which shows the keyword extraction example of 2nd Embodiment. キーワード評価値の算出例を示す図である。It is a figure which shows the example of calculation of a keyword evaluation value. 第２の実施の形態の会話分析装置の機能例を示すブロック図である。FIG. 11 is a block diagram showing an example of functions of a conversation analysis device according to a second embodiment; FIG. 会話分析装置が保持するテーブルの例を示す第１の図である。FIG. 4 is a first diagram showing an example of a table held by the conversation analysis device; 会話分析装置が保持するテーブルの例を示す第２の図である。FIG. 4 is a second diagram showing an example of a table held by the conversation analysis device; 第２の実施の形態の会話分析の手順例を示すフローチャートである。FIG. 11 is a flow chart showing an example of a conversation analysis procedure according to the second embodiment; FIG. 第３の実施の形態の情報処理システムの例を示す図である。It is a figure which shows the example of the information processing system of 3rd Embodiment. 第４の実施の形態の情報処理システムの例を示す図である。It is a figure which shows the example of the information processing system of 4th Embodiment. 第５の実施の形態のキーワード抽出例を示す図である。It is a figure which shows the keyword extraction example of 5th Embodiment. 第５の実施の形態の会話分析の手順例を示すフローチャートである。FIG. 14 is a flow chart showing an example of a conversation analysis procedure according to the fifth embodiment; FIG. 第６の実施の形態の会話分析装置の機能例を示すブロック図である。FIG. 12 is a block diagram showing an example of functions of a conversation analysis device according to a sixth embodiment; FIG. 第６の実施の形態の会話分析の手順例を示すフローチャートである。FIG. 21 is a flow chart showing an example of a conversation analysis procedure according to the sixth embodiment; FIG.

以下、本実施の形態を図面を参照して説明する。
［第１の実施の形態］
第１の実施の形態を説明する。 Hereinafter, this embodiment will be described with reference to the drawings.
[First embodiment]
A first embodiment will be described.

図１は、第１の実施の形態のキーワード抽出装置の例を説明する図である。
第１の実施の形態のキーワード抽出装置１０は、ユーザ間の会話の中から少なくとも一方のユーザの心理状態を反映した重要キーワードを抽出する。例えば、キーワード抽出装置１０は、顧客と接客担当者の間の会話から顧客の共感度に関連する重要キーワードを抽出する。また、例えば、キーワード抽出装置１０は、顧客と接客担当者の間の会話から接客担当者の接客度に関連する重要キーワードを抽出する。 FIG. 1 is a diagram for explaining an example of a keyword extracting device according to the first embodiment.
The keyword extraction device 10 of the first embodiment extracts important keywords reflecting the psychological state of at least one user from conversations between users. For example, the keyword extraction device 10 extracts important keywords related to the customer's degree of empathy from the conversation between the customer and the person in charge of customer service. Further, for example, the keyword extraction device 10 extracts important keywords related to the customer service level of the customer service representative from the conversation between the customer and the customer service representative.

キーワード抽出装置１０を、コンピュータや情報処理装置と言うこともできる。キーワード抽出装置１０は、クライアント装置でもよいしサーバ装置でもよい。また、キーワード抽出装置１０は、ユーザ間の会話の間にリアルタイムに重要キーワードを抽出してもよいし、会話終了後にバッチ処理として重要キーワードを抽出してもよい。 The keyword extraction device 10 can also be called a computer or an information processing device. The keyword extraction device 10 may be a client device or a server device. Moreover, the keyword extraction device 10 may extract important keywords in real time during a conversation between users, or may extract important keywords as a batch process after the end of the conversation.

キーワード抽出装置１０は、記憶部１１および処理部１２を有する。記憶部１１は、ＲＡＭ（Random Access Memory）などの揮発性の半導体メモリでもよいし、ＨＤＤ（Hard Disk Drive）やフラッシュメモリなどの不揮発性ストレージでもよい。処理部１２は、例えば、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＤＳＰ（Digital Signal Processor）などのプロセッサである。ただし、処理部１２は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などの特定用途の電子回路を含んでもよい。複数のプロセッサの集合を「マルチプロセッサ」または単に「プロセッサ」と言うことがある。 The keyword extraction device 10 has a storage section 11 and a processing section 12 . The storage unit 11 may be a volatile semiconductor memory such as a RAM (Random Access Memory), or may be a non-volatile storage such as an HDD (Hard Disk Drive) or flash memory. The processing unit 12 is, for example, a processor such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), or a DSP (Digital Signal Processor). However, the processing unit 12 may include electronic circuits for specific purposes such as ASICs (Application Specific Integrated Circuits) and FPGAs (Field Programmable Gate Arrays). A collection of multiple processors is sometimes called a "multiprocessor" or simply a "processor."

記憶部１１は、音声データ１３および動作データ１４を記憶する。音声データ１３および動作データ１４は、第１のユーザと第２のユーザとの間の会話についての記録である。第１のユーザはサービスの提供側となる者であり、第２のユーザはサービスの享受側となる者である。サービス分野としては、例えば、小売りや教育や医療などコミュニケーションを通じて行われる業務が挙げられる。第１のユーザは、例えば、店舗の商品説明者、教育機関に従事する講師、医療機関に従事する医師やカウンセラーなど、接客を行う接客担当者である。第２のユーザは、例えば、店舗を訪れた消費者、教育機関の受講者、医療機関を訪れた患者など、接客を受ける顧客である。 The storage unit 11 stores voice data 13 and action data 14 . Audio data 13 and action data 14 are records of conversations between the first user and the second user. The first user is a service provider, and the second user is a service receiver. The service field includes, for example, operations performed through communication such as retail, education, and medical care. The first user is, for example, a person in charge of customer service, such as a product explainer at a store, a lecturer at an educational institution, or a doctor or counselor at a medical institution. A second user is, for example, a customer who receives service, such as a consumer visiting a store, a student attending an educational institution, or a patient visiting a medical institution.

音声データ１３は、第１のユーザおよび第２のユーザの少なくとも一方が行った発話を示す。例えば、音声データ１３は、マイクロフォンを用いて、第１のユーザおよび第２のユーザの少なくとも一方の発話を録音した音声信号である。その場合に、音声データ１３は、第１のユーザの発話のみ録音したものでもよいし、第２のユーザの発話のみ録音したものでもよいし、第１のユーザと第２のユーザの両方の発話を録音したものでもよい。また、第１のユーザの発話と第２のユーザの発話は、同じマイクロフォンを用いて録音されてもよいし、異なるマイクロフォンを用いて録音されてもよい。 The voice data 13 indicates an utterance made by at least one of the first user and the second user. For example, the audio data 13 is an audio signal obtained by recording speech of at least one of the first user and the second user using a microphone. In that case, the audio data 13 may be a recording of only the first user's utterances, a recording of only the second user's utterances, or a recording of both the first user's and second user's utterances. may be recorded. Also, the first user's speech and the second user's speech may be recorded using the same microphone, or may be recorded using different microphones.

動作データ１４は、会話において第１のユーザが行った動作および第２のユーザが行った動作を示す。動作は、表情の変化、頭部や腕や足のジェスチャ、視線の変更、姿勢の変更など、相手ユーザから視認可能な身体的動作である。表情の変化には笑うことが含まれる。頭部のジェスチャにはうなずくことが含まれる。視線の変更には相手ユーザの頭部を見ることが含まれる。姿勢の変更には前のめりになることが含まれる。 Action data 14 indicates actions taken by the first user and actions taken by the second user in the conversation. Actions are physical actions that can be visually recognized by the other user, such as changes in facial expressions, gestures of the head, arms, and legs, changes in line of sight, and changes in posture. Facial expressions include smiling. Head gestures include nodding. Changing the line of sight includes looking at the other user's head. Posture changes include leaning forward.

例えば、動作データ１４は、イメージセンサを用いて、第１のユーザおよび第２のユーザを撮影した動画像などの画像データである。ただし、動作データ１４は、イメージセンサ以外のセンサデバイスを用いて生成されたセンサデータであってもよい。例えば、動作データ１４は、ヘッドセットに組み込まれた加速度センサを用いて頭部のジェスチャを検出したものでもよい。また、動作データ１４は、腕時計に組み込まれた加速度センサを用いて腕のジェスチャを検出したものでもよい。また、動作データ１４は、椅子に組み込まれた感圧センサを用いて姿勢の変化を検出したものでもよい。 For example, the motion data 14 is image data such as moving images captured by an image sensor of the first user and the second user. However, the motion data 14 may be sensor data generated using a sensor device other than an image sensor. For example, the motion data 14 may be head gestures detected using an accelerometer built into the headset. Alternatively, the motion data 14 may be arm gestures detected using an acceleration sensor incorporated in the wristwatch. Alternatively, the motion data 14 may be obtained by detecting a change in posture using a pressure sensor incorporated in the chair.

動作データ１４のうち第１のユーザに関するデータと第２のユーザに関するデータとは、同じデバイスを用いて生成されてもよいし異なるデバイスを用いて生成されてもよい。例えば、動作データ１４が画像データである場合、１つのイメージセンサを用いて撮影された画像に第１のユーザと第２のユーザの両方が写っていてもよいし、異なるイメージセンサを用いて撮影された異なる画像に異なるユーザが写っていてもよい。 The data relating to the first user and the data relating to the second user in the motion data 14 may be generated using the same device or different devices. For example, when the motion data 14 is image data, both the first user and the second user may appear in an image captured using one image sensor, or may be captured using different image sensors. Different users may appear in the different images produced.

処理部１２は、音声データ１３からキーワード１５を検出する。例えば、処理部１２は、音声認識によって音声データ１３を発話の文字列（テキスト）に変換し、文字列の中から所定の検索対象キーワードを検索する。検索対象キーワードは、例えば、予めキーワードリストとして定義されている。また、例えば、処理部１２は、発話全体を文字列に変換せずに、ワードスポッティングにより発話の音声信号の特徴量と検索対象キーワードの音声信号の特徴量とを連続的に比較し、検索対象キーワードのみを直接認識する。 The processing unit 12 detects keywords 15 from the voice data 13 . For example, the processing unit 12 converts the voice data 13 into an uttered character string (text) by voice recognition, and searches the character string for a predetermined search target keyword. Keywords to be searched are defined in advance as a keyword list, for example. Further, for example, the processing unit 12 continuously compares the feature quantity of the speech signal of the speech and the feature quantity of the speech signal of the search target keyword by word spotting without converting the entire speech into a character string, Only recognize keywords directly.

キーワード１５は、第１のユーザの発話でもよいし第２のユーザの発話でもよい。処理部１２は、第１のユーザの発話と第２のユーザの発話とを区別して検索対象キーワードを検索してもよいし、第１のユーザの発話と第２のユーザの発話とを区別せずに検索対象キーワードを検索してもよい。また、処理部１２は、第１のユーザの発話と第２のユーザの発話の何れか一方に限定して検索対象キーワードを検索してもよい。 The keyword 15 may be an utterance of the first user or an utterance of the second user. The processing unit 12 may search for a search target keyword by distinguishing between the first user's utterance and the second user's utterance, or may distinguish between the first user's utterance and the second user's utterance. You may search for the search target keyword without In addition, the processing unit 12 may search for a search target keyword by limiting to one of the first user's utterance and the second user's utterance.

また、処理部１２は、動作データ１４から、第１のユーザによる動作１６（第１の動作）のタイミングと、第２のユーザによる動作１７（第２の動作）のタイミングとを検出する。上記の音声データ１３の処理と動作データ１４の処理は、何れを先に実行してもよいし並列に実行してもよい。第１のユーザの動作１６と第２のユーザの動作１７とは区別して検出される。例えば、処理部１２は、画像認識によって第１のユーザと第２のユーザそれぞれの表情の変化、頭部や腕や足のジェスチャ、視線の変更、姿勢の変更などの身体的動作を、画像データである動作データ１４から検出する。イメージセンサ以外のセンサデバイスが動作データ１４を生成した場合、特段の認識処理を行わなくてよいこともある。 The processing unit 12 also detects the timing of the action 16 (first action) by the first user and the timing of the action 17 (second action) by the second user from the action data 14 . Either of the processing of the voice data 13 and the processing of the motion data 14 may be executed first, or may be executed in parallel. The motion 16 of the first user and the motion 17 of the second user are detected separately. For example, the processing unit 12 uses image recognition to convert physical actions such as changes in facial expression, gestures of the head, arms, and legs, changes in line of sight, and changes in posture of the first user and the second user into image data. is detected from the motion data 14 . When a sensor device other than an image sensor generates motion data 14, there may be no need to perform special recognition processing.

そして、処理部１２は、第１のユーザによる動作１６のタイミングと第２のユーザによる動作１７のタイミングとの間の関係に基づいて、検出したキーワード１５の重要度を示す評価値１８を算出する。処理部１２は、例えば、評価値１８に基づいて、キーワード１５を重要キーワードとして抽出するか否か判定する。処理部１２は、評価値１８が所定の第１の閾値を超える場合、キーワード１５を好ましいキーワードとして抽出してもよい。また、処理部１２は、評価値１８が第１の閾値より小さい所定の第２の閾値未満である場合、キーワード１５を要注意のキーワードとして抽出してもよい。キーワード１５は、例えば、時間軸上で動作１６，１７から所定範囲内に発せられたキーワードである。 Then, the processing unit 12 calculates an evaluation value 18 indicating the importance of the detected keyword 15 based on the relationship between the timing of the action 16 by the first user and the timing of the action 17 by the second user. . The processing unit 12 determines whether or not to extract the keyword 15 as an important keyword based on the evaluation value 18, for example. The processing unit 12 may extract the keyword 15 as a preferable keyword when the evaluation value 18 exceeds a predetermined first threshold. Moreover, the processing unit 12 may extract the keyword 15 as a keyword requiring caution when the evaluation value 18 is less than a predetermined second threshold that is smaller than the first threshold. The keyword 15 is, for example, a keyword issued within a predetermined range from the actions 16 and 17 on the time axis.

動作１６，１７のタイミングの関係として、処理部１２は、動作１６，１７が同じ種類の動作であり、動作１６が先に行われ動作１６から所定時間以内に動作１７が行われたことを検出してもよい。このとき、動作１６の直前の所定時間以内に第２のユーザが動作を行っていないこと、すなわち、第１のユーザから動作を開始したことを条件に加えてもよい。この関係は、第２のユーザの心理状態を反映していると言える。 As for the relationship between the timings of the actions 16 and 17, the processing unit 12 detects that the actions 16 and 17 are the same type of actions, that the action 16 is performed first, and that the action 17 is performed within a predetermined time from the action 16. You may At this time, the condition may be that the second user has not performed an action within a predetermined period of time immediately before the action 16, that is, the first user has started the action. It can be said that this relationship reflects the psychological state of the second user.

例えば、第１のユーザが接客担当者であり第２のユーザが顧客である場合、この関係は、接客担当者の笑いやうなずきなどの動作と連動して、顧客の笑いやうなずきなどの同じ種類の動作が発生したという同期を示している。よって、これは顧客が接客担当者の動作をよく観察しており、接客担当者の話に共感をもっているという会話の盛り上がりを示していると推定できる。また、例えば、動作１７が先に行われ動作１７から所定時間以内に動作１６が行われた場合、顧客の笑いやうなずきなどの動作と連動して、接客担当者の笑いやうなずきなどの同じ種類の動作が発生したという同期を示している。よって、これは接客担当者が顧客の動作をよく観察しており、良い接客を示していると推定できる。 For example, if the first user is a waiter and the second user is a customer, this relationship may be linked to the waiter's laughter, nod, or other gesture, and the customer's laughter, nod, or other gesture of the same type. indicates a synchronization that the action of Therefore, it can be presumed that this indicates that the customer is closely observing the behavior of the customer service representative, and that the customer is empathizing with the customer service representative's story, which indicates the excitement of the conversation. Further, for example, when the action 17 is performed first and the action 16 is performed within a predetermined time after the action 17, the behavior of the customer, such as laughter or nod, is interlocked with the action of the customer, such as laughter or nod. indicates a synchronization that the action of Therefore, it can be inferred that the person in charge of customer service observes the behavior of the customer well and shows good customer service.

動作１６，１７が同じ種類の動作であり、動作１６が先に行われ動作１６から所定時間以内に動作１７が行われた場合、処理部１２は、動作１６，１７の近辺にあるキーワード１５を高く評価することが考えられる。キーワード１５を高く評価することは、第２のユーザの心理状態が良好であることに対応する。第２のユーザが顧客である場合、キーワード１５は、顧客の共感が得られた好ましいキーワードである可能性がある。 If the actions 16 and 17 are the same kind of actions, and the action 16 is performed first and the action 17 is performed within a predetermined time after the action 16, the processing unit 12 selects the keyword 15 near the actions 16 and 17. It can be highly evaluated. A high evaluation of the keyword 15 corresponds to a good mental state of the second user. If the second user is a customer, keyword 15 may be a preferred keyword that resonates with the customer.

一方、動作１６から所定時間以内に動作１７が行われていない場合、処理部１２は、動作１６，１７の近辺にあるキーワード１５を低く評価することが考えられる。キーワード１５を低く評価することは、第２のユーザの心理状態が良好でないことに対応する。第２のユーザが顧客である場合、キーワード１５は、顧客の共感が得られなかった要注意のキーワードである可能性がある。 On the other hand, if the action 17 has not been performed within a predetermined time from the action 16, the processing unit 12 may evaluate the keyword 15 near the actions 16 and 17 low. A low rating for keyword 15 corresponds to a poor mental state of the second user. If the second user is a customer, keyword 15 may be a cautionary keyword that did not resonate with the customer.

処理部１２は、抽出した重要キーワードを出力してもよい。例えば、処理部１２は、キーワード抽出装置１０が備えるストレージ装置に重要キーワードを保存してもよい。また、例えば、処理部１２は、ディスプレイに表示するなどキーワード抽出装置１０が備える出力デバイスに重要キーワードを出力してもよい。また、例えば、処理部１２は、ネットワーク経由で他の情報処理装置に重要キーワードを送信してもよい。 The processing unit 12 may output the extracted important keywords. For example, the processing unit 12 may store the important keywords in a storage device included in the keyword extraction device 10 . Further, for example, the processing unit 12 may output the important keywords to an output device provided in the keyword extraction device 10, such as displaying them on a display. Also, for example, the processing unit 12 may transmit the important keyword to another information processing device via a network.

第１の実施の形態のキーワード抽出装置１０によれば、音声データ１３からキーワード１５が検出され、動作データ１４から第１のユーザの動作１６のタイミングおよび第２のユーザの動作１７のタイミングが検出される。そして、動作１６のタイミングと動作１７のタイミングの間の関係に基づいて、キーワード１５の評価値１８が算出される。これにより、第１のユーザと第２のユーザの少なくとも一方の心理状態の観点から重要キーワードを精度よく抽出することが可能となる。よって、接客の改善などの所定の目的のために、キーワード抽出装置１０が抽出した重要キーワードを活用することが可能となる。 According to the keyword extraction device 10 of the first embodiment, the keyword 15 is detected from the voice data 13, and the timing of the first user's action 16 and the timing of the second user's action 17 are detected from the action data 14. be done. Then, based on the relationship between the timing of the action 16 and the timing of the action 17, the evaluation value 18 of the keyword 15 is calculated. This makes it possible to accurately extract important keywords from the viewpoint of the psychological state of at least one of the first user and the second user. Therefore, it is possible to utilize the important keywords extracted by the keyword extracting device 10 for a predetermined purpose such as improving customer service.

［第２の実施の形態］
次に、第２の実施の形態を説明する。
図２は、第２の実施の形態の情報処理システムの例を示す図である。 [Second embodiment]
Next, a second embodiment will be described.
FIG. 2 is a diagram illustrating an example of an information processing system according to the second embodiment.

第２の実施の形態の情報処理システムは、顧客と接客担当者とが会話する業種において、会話を分析して接客の改善を支援するものである。この情報処理システムは、商品説明や保健指導など様々な業種に適用することが可能である。 The information processing system according to the second embodiment analyzes conversations and supports improvement of customer service in an industry in which customers and customer service personnel converse with each other. This information processing system can be applied to various industries such as product explanations and health guidance.

第２の実施の形態の情報処理システムは、ネットワーク４０に接続された管理装置４１および会話分析装置１００を含む。会話分析装置１００にはカメラ装置５０が接続されている。管理装置４１は、接客担当者の上司など接客担当者を指導する管理者が使用する端末装置である。会話分析装置１００は、顧客と接客担当者とが会話する場所に設置された端末装置である。例えば、会話分析装置１００は、顧客と接客担当者とが対面するカウンターの上またはその周辺に設置されている。カメラ装置５０は、動画撮影機能および音声録音機能をもつデバイス装置である。カメラ装置５０は、顧客と接客担当者との間の会話を撮影および録音できるように設置されている。 The information processing system of the second embodiment includes a management device 41 and a conversation analysis device 100 connected to a network 40. FIG. A camera device 50 is connected to the conversation analysis device 100 . The management device 41 is a terminal device used by an administrator, such as the boss of the person in charge of customer service, who instructs the person in charge of customer service. A conversation analysis device 100 is a terminal device installed at a place where a customer and a customer service representative have a conversation. For example, the conversation analysis device 100 is installed on or around the counter where the customer and the customer service representative face each other. The camera device 50 is a device having a moving image shooting function and an audio recording function. The camera device 50 is installed so that the conversation between the customer and the customer service representative can be photographed and recorded.

カメラ装置５０は、顧客と接客担当者とが会話を行っている間、画像内に顧客と接客担当者の両方が収まるように撮影を行う。また、カメラ装置５０は、顧客と接客担当者とが会話を行っている間、顧客の音声と接客担当者の音声の両方が含まれるように録音を行う。会話分析装置１００は、撮影された動画を示す画像データと録音された音声を示す音声データとを収集して、顧客と接客担当者との間の会話を分析する。具体的には、会話分析装置１００は、音声データから顧客または接客担当者が発したキーワードを検出し、画像データから会話中の顧客の動作および接客担当者の動作を検出する。会話分析装置１００は、顧客と接客担当者の動作からキーワードを評価して重要キーワードを抽出する。会話分析装置１００は、抽出した重要キーワードを管理装置４１に報告する。 The camera device 50 captures images so that both the customer and the customer service representative are included in the image while the customer and the customer service representative are having a conversation. In addition, the camera device 50 records both the voice of the customer and the voice of the person in charge of customer service during the conversation between the customer and the person in charge of customer service. Conversation analysis device 100 collects image data representing captured moving images and voice data representing recorded voices, and analyzes the conversation between the customer and the customer service representative. Specifically, the conversation analysis device 100 detects a keyword uttered by a customer or a customer service representative from voice data, and detects actions of a customer and a customer service representative during a conversation from image data. The conversation analysis device 100 extracts important keywords by evaluating keywords from the actions of the customer and the person in charge of customer service. The conversation analysis device 100 reports the extracted important keywords to the management device 41 .

会話分析装置１００による会話分析および会話分析装置１００から管理装置４１への重要キーワードの報告は、接客担当者の業務中にリアルタイムに行ってもよいし、接客担当者の業務終了後にバッチ処理として行ってもよい。例えば、会話分析装置１００は、カメラ装置５０が出力する音声データおよび画像データをリアルタイムに分析し、会話の区切り毎に重要キーワードを判定して管理装置４１に送信する。会話の区切りとしては、一人の顧客に対する接客が終了したとき、無発話時間が所定時間以上継続したとき、会話開始から一定時間経過したときなどが考えられる。また、例えば、会話分析装置１００は、カメラ装置５０が出力する音声データおよび画像データを保存し、業務終了後にまとめて音声データおよび画像データを分析し、重要キーワードを管理装置４１に送信する。 Conversation analysis by the conversation analysis device 100 and reporting of important keywords from the conversation analysis device 100 to the management device 41 may be performed in real time during the customer service staff's work, or as a batch process after the customer service staff's work is completed. may For example, the conversation analysis device 100 analyzes the audio data and image data output by the camera device 50 in real time, determines important keywords for each segment of the conversation, and transmits them to the management device 41 . The end of the conversation may be, for example, the end of customer service for one customer, the continuation of no speech for a predetermined period of time or more, or the lapse of a certain period of time from the start of the conversation. Also, for example, the conversation analysis device 100 saves the voice data and image data output by the camera device 50 , analyzes the voice data and the image data collectively after the end of work, and transmits important keywords to the management device 41 .

会話分析装置１００から管理装置４１に送信された重要キーワードの少なくとも一部は、管理装置４１のディスプレイに表示される。管理者は、接客担当者の業務中に重要キーワードを確認してもよいし、業務終了後に重要キーワードを確認してもよい。 At least part of the important keywords transmitted from conversation analysis device 100 to management device 41 is displayed on the display of management device 41 . The manager may confirm the important keywords during the customer service staff's work, or may confirm the important keywords after the work is finished.

なお、会話分析装置１００は、第１の実施の形態のキーワード抽出装置１０に対応する。カメラ装置５０を用いて撮影された動画を示す画像データは、第１の実施の形態の動作データ１４に対応する。カメラ装置５０を用いて録音された音声を示す音声データは、第１の実施の形態の音声データ１３に対応する。 The conversation analysis device 100 corresponds to the keyword extraction device 10 of the first embodiment. Image data representing a moving image captured using the camera device 50 corresponds to the motion data 14 of the first embodiment. Audio data representing audio recorded using the camera device 50 corresponds to the audio data 13 of the first embodiment.

図３は、会話分析装置のハードウェア例を示すブロック図である。
会話分析装置１００は、バスに接続されたＣＰＵ１０１、ＲＡＭ１０２、ＨＤＤ１０３、画像信号処理部１０４、入力信号処理部１０５，１０６、媒体リーダ１０７および通信インタフェース１０８を有する。ＣＰＵ１０１は、第１の実施の形態の処理部１２に対応する。ＲＡＭ１０２またはＨＤＤ１０３は、第１の実施の形態の記憶部１１に対応する。管理装置４１も同様のハードウェアを用いて実現できる。 FIG. 3 is a block diagram showing a hardware example of the conversation analysis device.
Conversation analysis device 100 has CPU 101, RAM 102, HDD 103, image signal processing section 104, input signal processing sections 105 and 106, medium reader 107, and communication interface 108, which are connected to a bus. A CPU 101 corresponds to the processing unit 12 of the first embodiment. A RAM 102 or HDD 103 corresponds to the storage unit 11 of the first embodiment. The management device 41 can also be realized using similar hardware.

ＣＰＵ１０１は、プログラムの命令を実行するプロセッサである。ＣＰＵ１０１は、ＨＤＤ１０３に記憶されたプログラムやデータの少なくとも一部をＲＡＭ１０２にロードし、プログラムを実行する。なお、ＣＰＵ１０１は複数のプロセッサコアを備えてもよく、会話分析装置１００は複数のプロセッサを備えてもよい。複数のプロセッサの集合を「マルチプロセッサ」または単に「プロセッサ」と言うことがある。 The CPU 101 is a processor that executes program instructions. The CPU 101 loads at least part of the programs and data stored in the HDD 103 into the RAM 102 and executes the programs. Note that CPU 101 may include a plurality of processor cores, and conversation analysis device 100 may include a plurality of processors. A collection of multiple processors is sometimes called a "multiprocessor" or simply a "processor."

ＲＡＭ１０２は、ＣＰＵ１０１が実行するプログラムやＣＰＵ１０１が演算に使用するデータを一時的に記憶する揮発性の半導体メモリである。なお、会話分析装置１００は、ＲＡＭ以外の種類のメモリを備えてもよく、複数のメモリを備えてもよい。 The RAM 102 is a volatile semiconductor memory that temporarily stores programs executed by the CPU 101 and data used by the CPU 101 for calculation. Conversation analysis device 100 may be provided with a type of memory other than RAM, and may be provided with a plurality of memories.

ＨＤＤ１０３は、ＯＳ（Operating System）やミドルウェアやアプリケーションソフトウェアなどのソフトウェアのプログラム、および、データを記憶する不揮発性ストレージである。なお、会話分析装置１００は、フラッシュメモリやＳＳＤ（Solid State Drive）など他の種類のストレージを備えてもよく、複数のストレージを備えてもよい。 The HDD 103 is a nonvolatile storage that stores an OS (Operating System), software programs such as middleware and application software, and data. Conversation analysis device 100 may include other types of storage such as flash memory and SSD (Solid State Drive), or may include multiple storages.

画像信号処理部１０４は、ＣＰＵ１０１からの命令に従って、会話分析装置１００に接続されたディスプレイ１１１に画像を出力する。ディスプレイ１１１としては、ＣＲＴ（Cathode Ray Tube）ディスプレイ、液晶ディスプレイ（ＬＣＤ：Liquid Crystal Display）、有機ＥＬ（ＯＥＬ：Organic Electro-Luminescence）ディスプレイなど、任意の種類のディスプレイを使用することができる。 Image signal processing unit 104 outputs an image to display 111 connected to conversation analysis device 100 according to a command from CPU 101 . As the display 111, any type of display such as a CRT (Cathode Ray Tube) display, a liquid crystal display (LCD: Liquid Crystal Display), an organic EL (OEL: Organic Electro-Luminescence) display, or the like can be used.

入力信号処理部１０５は、会話分析装置１００に接続された入力デバイス１１２から入力信号を受信する。入力デバイス１１２として、マウス、タッチパネル、タッチパッド、キーボードなど、任意の種類の入力デバイスを使用できる。また、会話分析装置１００に複数の種類の入力デバイスが接続されてもよい。 Input signal processing unit 105 receives an input signal from input device 112 connected to conversation analysis apparatus 100 . Input device 112 can be any type of input device, such as a mouse, touch panel, touchpad, keyboard, or the like. Also, a plurality of types of input devices may be connected to conversation analysis apparatus 100 .

入力信号処理部１０６は、会話分析装置１００に接続されたカメラ装置５０から画像信号および音声信号を受信する。カメラ装置５０は、イメージセンサ５１およびマイクロフォン５２を有する。イメージセンサ５１は、光を電気信号（画像信号）に変換する撮像素子である。イメージセンサ５１として、ＣＣＤ（Charge Coupled Device）イメージセンサやＣＭＯＳ（Complementary Metal Oxide Semiconductor）イメージセンサなど、任意の種類のイメージセンサを使用できる。マイクロフォン５２は、音を電気信号（音声信号）に変換する。マイクロフォン５２として、ダイナミックマイクやコンデンサマイクなど、任意の種類のマイクロフォンを使用できる。 Input signal processing unit 106 receives image signals and audio signals from camera device 50 connected to conversation analysis device 100 . Camera device 50 has image sensor 51 and microphone 52 . The image sensor 51 is an imaging device that converts light into an electrical signal (image signal). Any type of image sensor such as a CCD (Charge Coupled Device) image sensor or a CMOS (Complementary Metal Oxide Semiconductor) image sensor can be used as the image sensor 51 . The microphone 52 converts sound into an electrical signal (audio signal). Any type of microphone can be used as the microphone 52, such as a dynamic microphone or a condenser microphone.

媒体リーダ１０７は、記録媒体１１３に記録されたプログラムやデータを読み取る読み取り装置である。記録媒体１１３として、例えば、フレキシブルディスク（ＦＤ：Flexible Disk）やＨＤＤなどの磁気ディスク、ＣＤ（Compact Disc）やＤＶＤ（Digital Versatile Disc）などの光ディスク、光磁気ディスク（ＭＯ：Magneto-Optical disk）、半導体メモリなどを使用できる。媒体リーダ１０７は、例えば、記録媒体１１３から読み取ったプログラムやデータをＲＡＭ１０２またはＨＤＤ１０３に格納する。 The medium reader 107 is a reading device that reads programs and data recorded on the recording medium 113 . Examples of the recording medium 113 include magnetic disks such as flexible disks (FDs) and HDDs, optical disks such as CDs (Compact Discs) and DVDs (Digital Versatile Discs), magneto-optical disks (MOs), A semiconductor memory or the like can be used. The medium reader 107 stores programs and data read from the recording medium 113 in the RAM 102 or the HDD 103, for example.

通信インタフェース１０８は、ネットワーク４０に接続され、ネットワーク４０を介して管理装置４１と通信を行う。通信インタフェース１０８は、スイッチやルータなどの有線通信装置に接続される有線通信インタフェースでもよいし、基地局やアクセスポイントなどの無線通信装置に接続される無線通信インタフェースでもよい。 The communication interface 108 is connected to the network 40 and communicates with the management device 41 via the network 40 . The communication interface 108 may be a wired communication interface connected to a wired communication device such as a switch or router, or a wireless communication interface connected to a wireless communication device such as a base station or access point.

次に、第２の実施の形態のキーワード抽出方法について説明する。第２の実施の形態の会話分析装置１００は、顧客と接客担当者との間の会話で出現したキーワードの中から、顧客の共感を得られた好ましいキーワードを推定して重要キーワードとして抽出する。また、第２の実施の形態の会話分析装置１００は、顧客と接客担当者との間の会話で出現したキーワードの中から、顧客の共感を得られなかった要注意のキーワードを推定して重要キーワードとして抽出する。これにより、次回以降の接客において顧客の共感がより得られるように、接客担当者の接客スキルを向上させることが可能となる。 Next, a keyword extraction method according to the second embodiment will be described. Conversation analysis device 100 of the second embodiment estimates and extracts as important keywords preferred keywords that empathize with customers from among keywords that appear in conversations between customers and service personnel. In addition, the conversation analysis device 100 of the second embodiment estimates, from among the keywords appearing in the conversation between the customer and the customer service representative, the keyword requiring attention that did not gain the customer's sympathy and is important. Extract as keywords. As a result, it is possible to improve the customer service skills of the person in charge of customer service so that the customer's sympathy can be obtained in the next customer service.

図４は、第２の実施の形態のキーワード抽出例を示す図である。
顧客と接客担当者との間で会話が盛り上がっており接客担当者の話に顧客が共感しているか否かを評価するため、第２の実施の形態では、顧客の動作と接客担当者の動作との間の同期を検出する。第２の実施の形態の同期は、接客担当者から動作を開始し、その直後に顧客が同じ種類の動作を行ったという動作の連鎖である。この場合、顧客は接客担当者を注意深く見ており、接客担当者の話をポジティブに聞いていると推定される。よって、このような同期が発生しているときに顧客または接客担当者が発したキーワードは、好ましいキーワードである可能性がある。一方、このような同期が発生していないときに顧客または接客担当者が発したキーワードは、要注意のキーワードである可能性がある。 FIG. 4 is a diagram showing an example of keyword extraction according to the second embodiment.
In order to evaluate whether the conversation between the customer and the customer service representative is lively and whether the customer sympathizes with the customer service representative's story, in the second embodiment, the behavior of the customer and the behavior of the customer service representative Detect synchronization between Synchronization in the second embodiment is a chain of actions in which the person in charge of customer service starts an action, and immediately after that, the customer performs the same type of action. In this case, it is presumed that the customer is looking carefully at the customer service representative and listening positively to what the customer service representative has to say. Thus, a keyword uttered by a customer or waiter while such synchronization is occurring may be a preferred keyword. On the other hand, keywords uttered by customers or attendants when such synchronization is not occurring may be sensitive keywords.

例えば、以下に説明するシーン７１～７３を考える。
シーン７１では、接客担当者が笑うという動作を行い、その直後に顧客が笑うという動作を行っている。接客担当者の動作の直前の時間Ｓ１以内に顧客は動作を行っておらず、接客担当者の動作の直後の時間Ｓ２以内に顧客は動作を行っている。シーン７１では、接客担当者から開始して顧客の動作と接客担当者の動作とが同期しているため、顧客の共感度が大きいと判定される。すると、シーン７１の周辺で顧客または接客担当者が発した「レスポンス」というキーワードの評価値は高くなる。 For example, consider the scenes 71-73 described below.
In scene 71, the person in charge of customer service laughs, and immediately after that, the customer laughs. The customer does not act within the time S1 immediately before the customer service representative's action, and the customer performs an action within the time S2 immediately after the customer service representative's action. In scene 71, the behavior of the customer and the behavior of the person in charge of customer service are synchronized, starting from the person in charge of customer service. Then, the evaluation value of the keyword "response" issued by the customer or the person in charge of customer service around the scene 71 becomes high.

シーン７２では、接客担当者が笑うという動作を行ったものの、接客担当者の動作の直前の時間Ｓ１以内に顧客は動作を行っておらず、接客担当者の動作の直後の時間Ｓ２以内にも顧客は動作を行っていない。シーン７２では、顧客の動作と接客担当者の動作とが同期していないため、顧客の共感度が小さいと判定される。すると、シーン７２の周辺で顧客または接客担当者が発した「新機能」というキーワードの評価値は低くなる。 In scene 72, although the customer service clerk made a laughing motion, the customer did not make any movement within the time S1 immediately before the customer service clerk's movement, and the customer did not make any movement within the time S2 immediately after the customer service clerk's movement. Customer is not taking action. In scene 72, it is determined that the customer's sympathy level is low because the customer's behavior is not synchronized with the customer service representative's behavior. Then, the evaluation value of the keyword "new function" issued by the customer or the person in charge of customer service around the scene 72 is low.

シーン７３では、顧客がうなずくという動作を行い、その直後に接客担当者がうなずくという動作を行っている。接客担当者の動作の直前の時間Ｓ１以内に顧客は動作を行っており、接客担当者の動作の直後の時間Ｓ２以内に顧客は動作を行っていない。シーン７３では、接客担当者から開始して顧客の動作と接客担当者の動作とが同期しているわけではなく、顧客の共感度が小さいと判定される。すると、シーン７３の周辺で顧客または接客担当者が発した「画質」というキーワードの評価値は低くなる。 In scene 73, the customer nods his head, and immediately after that, the receptionist nods his head. The customer performs an action within the time S1 immediately before the action of the customer service representative, and the customer does not perform an action within the time S2 immediately after the action of the customer service representative. In scene 73, it is determined that the customer's sympathy level is low, since the behavior of the customer is not synchronized with the behavior of the customer service representative, starting with the customer service representative. Then, the evaluation value of the keyword "image quality" issued by the customer or the person in charge of customer service around the scene 73 becomes low.

会話分析装置１００は、評価値が閾値Ｔ１より大きいキーワードを好ましいキーワードと推定し、重要キーワードとして抽出する。また、会話分析装置１００は、評価値が閾値Ｔ２より小さいキーワードを要注意のキーワードと推定し、同様に重要キーワードとして抽出する。閾値Ｔ１，Ｔ２は予め決められており、Ｔ１＞Ｔ２である。上記の例では、「レスポンス」が重要キーワードとして抽出される可能性がある。 Conversation analysis device 100 presumes that a keyword whose evaluation value is larger than threshold value T1 is a preferable keyword, and extracts it as an important keyword. In addition, conversation analysis device 100 presumes keywords with evaluation values smaller than threshold value T2 to be caution-required keywords, and similarly extracts them as important keywords. Thresholds T1 and T2 are predetermined, and T1>T2. In the above example, "response" may be extracted as an important keyword.

ここで、会話分析装置１００は、画像データから顧客と接客担当者それぞれの動作を検出することになる。検出すべき動作は、表情の変化、頭部や腕や足のジェスチャ、視線の変更、姿勢の変更など、視認可能な身体的動作である。表情の変化には笑うことが含まれる。頭部のジェスチャにはうなずくことが含まれる。視線の変更には相手の頭部を見ることが含まれる。姿勢の変更には前のめりになることが含まれる。 Here, the conversation analysis device 100 detects the actions of the customer and the person in charge of customer service from the image data. Actions to be detected are visible physical actions such as changes in facial expressions, gestures of the head, arms, and legs, changes in line of sight, and changes in posture. Facial expressions include smiling. Head gestures include nodding. A change of line of sight involves looking at the opponent's head. Posture changes include leaning forward.

会話分析装置１００は、画像認識により画像データから顧客と接客担当者を認識する。例えば、会話分析装置１００には接客担当者の容姿の特徴情報が予め登録されており、その特徴情報に基づいて接客担当者が認識される。その場合、接客担当者以外の人物が顧客として認識される。また、会話分析装置１００は、画像認識により画像データから顧客と接客担当者それぞれの動作の種類を認識する。このとき、うなずきの大きさや腕のジェスチャの大きさなど、動作の大きさを併せて認識してもよい。 The conversation analysis device 100 recognizes a customer and a customer service representative from image data by image recognition. For example, feature information of the appearance of the person in charge of customer service is registered in advance in the conversation analysis device 100, and the person in charge of customer service is recognized based on the feature information. In that case, a person other than the person in charge of serving customers is recognized as a customer. Further, the conversation analysis device 100 recognizes the types of actions of each of the customer and the customer service representative from the image data through image recognition. At this time, the size of the motion, such as the size of the nod or the size of the gesture of the arm, may be recognized together.

動作の検出には、特許文献２（特開２０１５－６４８２７号公報）に記載された技術を用いてもよい。例えば、表情について、会話分析装置１００は、画像データの各フレームから目と口の輪郭を抽出し、フレーム間における輪郭の変化から表情の変化を判定する。また、例えば、うなずきについて、会話分析装置１００は、画像データの各フレームから目と鼻と口の位置を抽出し、フレーム間における目と鼻と口の位置の変化からうなずきを判定する。動作の大きさは、その変化量から判定することができる。 A technique described in Patent Document 2 (Japanese Patent Application Laid-Open No. 2015-64827) may be used for motion detection. For example, with respect to facial expressions, speech analysis device 100 extracts the contours of the eyes and mouth from each frame of the image data, and determines changes in facial expressions from changes in the contours between frames. Further, for example, regarding a nod, the conversation analysis device 100 extracts the positions of the eyes, nose, and mouth from each frame of the image data, and determines the nod from changes in the positions of the eyes, nose, and mouth between frames. The magnitude of motion can be determined from the amount of change.

また、会話分析装置１００は、音声認識により音声データからキーワードを認識する。このとき、会話分析装置１００は、顧客の発話と接客担当者の発話とを区別して認識してもよいし、両者を区別せずに認識してもよい。顧客の発話と接客担当者の発話とを区別する方法として、例えば、接客担当者の声質の特徴情報を予め会話分析装置１００に登録しておき、異なる２つの声質の発話のうち接客担当者の発話を先に判定して他方の発話を顧客の発話とみなす方法が考えられる。また、顧客の発話と接客担当者の発話とを区別する方法として、例えば、録音時の音声の到来方向から判定する方法も考えられる。 In addition, the conversation analysis device 100 recognizes keywords from voice data by voice recognition. At this time, the conversation analysis device 100 may recognize the customer's utterance and the customer service clerk's utterance by distinguishing between them, or may recognize them without distinguishing between them. As a method of distinguishing between the customer's utterance and the customer service staff's utterance, for example, characteristic information of the customer service staff's voice quality is registered in advance in the conversation analysis device 100, and out of the two different voice quality utterances, the customer service staff's A possible method is to determine one utterance first and regard the other utterance as the customer's utterance. Also, as a method of distinguishing between the customer's utterance and the customer service clerk's utterance, for example, a method of judging from the direction of arrival of voice during recording is conceivable.

キーワードの検出には、特許文献１（特開２０１０－２２４７１５号公報）に記載された技術を用いてもよい。例えば、会話分析装置１００には、顧客の共感度に影響を与える可能性がある検索対象キーワードが予め登録されている。会話分析装置１００は、音声データが示す音声波形をフーリエ変換などにより音声特徴情報に変換し、予め用意した音声認識モデルに音声特徴情報を入力して単語列に変換し、単語列の中から検索対象キーワードを検索する。ただし、会話分析装置１００は、発話全体を単語列に変換せずに、ワードスポッティングにより検索対象キーワードのみを直接検出してもよい。 The technique described in Patent Document 1 (Japanese Unexamined Patent Application Publication No. 2010-224715) may be used for keyword detection. For example, in the conversation analysis device 100, search target keywords that may affect the customer's empathy are registered in advance. The conversation analysis device 100 converts the speech waveform indicated by the speech data into speech feature information by Fourier transform or the like, inputs the speech feature information into a speech recognition model prepared in advance, converts it into a word string, and searches from the word string. Search for the target keyword. However, conversation analysis device 100 may directly detect only the search target keyword by word spotting without converting the entire utterance into a word string.

次に、キーワード評価値の算出方法について説明する。
図５は、キーワード評価値の算出例を示す図である。
まず、会話分析装置１００は、顧客と接客担当者を録画した画像データを用いて、所定の時間間隔でシーン評価値を算出する。第２の実施の形態では、シーン評価値は顧客と接客担当者との間のその時点の会話の盛り上がりを示しており、顧客の共感度に相当する。シーン評価値が大きいほど会話の盛り上がりが大きく、共感度が大きいと推定される。シーン評価値が小さいほど会話の盛り上がりが小さく、共感度が小さいと推定される。 Next, a method of calculating the keyword evaluation value will be described.
FIG. 5 is a diagram showing an example of calculating a keyword evaluation value.
First, the conversation analysis device 100 calculates a scene evaluation value at predetermined time intervals using image data in which a customer and a customer service representative are recorded. In the second embodiment, the scene evaluation value indicates the liveliness of the conversation between the customer and the customer service representative at that time, and corresponds to the customer's degree of empathy. It is estimated that the greater the scene evaluation value, the greater the liveliness of the conversation, and the greater the degree of empathy. It is estimated that the smaller the scene evaluation value, the smaller the liveliness of the conversation and the smaller the degree of empathy.

そして、会話分析装置１００は、音声データから抽出されたキーワードについて、当該キーワードが発せられた時刻の周辺のシーン評価値を用いてキーワード評価値を算出する。キーワード評価値が大きいキーワードほど、顧客の共感を得られたキーワードである可能性が高い。キーワード評価値が小さいキーワードほど、顧客の共感を得られなかったキーワードである可能性が高い。会話分析装置１００は、キーワード評価値が閾値Ｔ１を超えるキーワードを重要キーワードとして抽出する。また、会話分析装置１００は、キーワード評価値が閾値Ｔ２未満のキーワードも重要キーワードとして抽出する。 Conversation analysis device 100 then calculates a keyword evaluation value for the keyword extracted from the voice data using the scene evaluation values around the time when the keyword was uttered. A keyword with a higher keyword evaluation value is more likely to be a keyword that has gained customer sympathy. A keyword with a lower keyword evaluation value is more likely to be a keyword that did not gain customer sympathy. Conversation analysis device 100 extracts keywords whose keyword evaluation value exceeds threshold value T1 as important keywords. Conversation analysis device 100 also extracts keywords whose keyword evaluation value is less than threshold value T2 as important keywords.

シーン評価値の算出では、会話分析装置１００は、ある時刻を中心にして時間Ｓ０前から時間Ｓ０後までの区間（前後の時間Ｓ０の区間）をスライディングウィンドウ８１として設定する。スライディングウィンドウ８１の位置は、時間Δｔずつずらしていくことになる。時間Δｔは、例えば、１フレーム時間から１秒程度とする。時間Ｓ０は、例えば、１分から２分程度とする。スライディングウィンドウ８１の中心時刻に対して１つのシーン評価値が算出されるため、時間Δｔ間隔でシーン評価値が算出されることになる。 In calculating the scene evaluation value, conversation analysis device 100 sets a section from before time S0 to after time S0 around a certain time (section of time S0 before and after) as sliding window 81 . The position of the sliding window 81 is shifted by time Δt. The time Δt is, for example, about 1 frame time to 1 second. The time S0 is, for example, approximately 1 to 2 minutes. Since one scene evaluation value is calculated for the center time of the sliding window 81, the scene evaluation value is calculated at intervals of time Δt.

スライディングウィンドウ８１の中で、接客担当者が動作を行った時刻をＦ（ｘ）とし、時刻Ｆ（ｘ）における動作の重みをｗ（ｘ）とする。重みｗ（ｘ）は、時刻Ｆ（ｘ）の直前の時間Ｓ１の間における顧客の動作に基づいて決定される。時間Ｓ１は、例えば、１秒から２秒程度である。直前の時間Ｓ１の間に顧客が動作を行っていない場合は重みｗ（ｘ）＝ｗ１とし、直前の時間Ｓ１の間に顧客が動作を行っている場合は重みｗ（ｘ）＝ｗ２とする。ただし、重みｗ１と重みｗ２の大小関係は、ｗ１＞ｗ２である。 In the sliding window 81, let F(x) be the time at which the person in charge of customer service performed the action, and let w(x) be the weight of the action at time F(x). Weight w(x) is determined based on the customer's actions during time S1 immediately preceding time F(x). The time S1 is, for example, about 1 to 2 seconds. Weight w(x)=w1 if the customer did not make a move during the previous time S1, and weight w(x)=w2 if the customer made a move during the previous time S1. . However, the magnitude relationship between the weight w1 and the weight w2 is w1>w2.

また、スライディングウィンドウ８１に属する各動作を、顧客の同期の有無に応じて集合ｒ１，ｒ２に分類する。時刻Ｆ（ｘ）の直後の時間Ｓ２の間に顧客が同じ種類の動作を行っている場合、すなわち、同期ありの場合、接客担当者の動作は集合ｒ１に分類される。直後の時間Ｓ２の間に顧客が同じ種類の動作を行っていない場合、すなわち、同期なしの場合、接客担当者の動作は集合ｒ２に分類される。時間Ｓ２は、例えば、１秒から２秒程度であり時間Ｓ１と同じでもよい。また、下記で使用する係数ａの値を予め決めておく。係数ａの値は１未満の実数（ａ＜１）であり、負の値であってもよい。 Further, each action belonging to the sliding window 81 is classified into sets r1 and r2 according to the presence or absence of synchronization of the customer. If the customer performs the same kind of action during the time S2 immediately after the time F(x), that is, if there is synchronization, the action of the customer service representative is classified into the set r1. If the customer does not perform the same type of action during the immediately following time S2, ie, no synchronization, the waiter's actions are classified into set r2. The time S2 is, for example, about 1 to 2 seconds and may be the same as the time S1. Also, the value of the coefficient a used below is determined in advance. The value of the coefficient a is a real number less than 1 (a<1) and may be a negative value.

このようなスライディングウィンドウ８１から、例えば、中心時刻のシーン評価値Ｖｔは数式（１）のように算出される。すなわち、集合ｒ１に属する動作の重みと、集合ｒ２に属する動作の重みに係数ａを乗じたものについての平均値が、シーン評価値となる。係数ａの値は１未満であるため、スライディングウィンドウ８１に属する接客担当者の動作のうち、顧客の動作と同期しているものの割合が高いほど、シーン評価値は大きくなる。 From such a sliding window 81, for example, the scene evaluation value Vt at the center time is calculated as shown in Equation (1). That is, the scene evaluation value is the average of the weights of the motions belonging to the set r1 and the weights of the motions belonging to the set r2 multiplied by the coefficient a. Since the value of the coefficient a is less than 1, the scene evaluation value increases as the percentage of the customer service clerk's actions belonging to the sliding window 81 that are synchronized with the customer's actions increases.

ただし、接客担当者の動作と顧客の動作とが同期している場合に、接客担当者の動作から顧客の動作までの遅延時間を更に考慮してシーン評価値を算出することも可能である。時刻Ｆ（ｘ）からの遅延時間をＥ（ｘ）とし、係数ｂの値を予め決めておく。係数ｂの値は正の実数である。この場合、例えば、中心時刻のシーン評価値Ｖｔは数式（２）のように算出される。数式（２）では、集合ｒ１に属する動作の重みが遅延時間Ｅ（ｘ）と係数ｂによって補正される。スライディングウィンドウ８１において、接客担当者の動作から顧客の動作までの遅延時間が短いほど、シーン評価値は大きくなる。 However, when the behavior of the person in charge of customer service and the behavior of the customer are synchronized, it is also possible to calculate the scene evaluation value by further considering the delay time from the behavior of the person in charge of customer service to the behavior of the customer. Let E(x) be the delay time from time F(x), and determine the value of the coefficient b in advance. The value of coefficient b is a positive real number. In this case, for example, the scene evaluation value Vt at the center time is calculated as shown in Equation (2). In equation (2), the weights of actions belonging to set r1 are corrected by delay time E(x) and coefficient b. In the sliding window 81, the shorter the delay time from the customer's action to the customer's action, the larger the scene evaluation value.

また、接客担当者の動作と顧客の動作とが同期している場合に、顧客の動作の大きさを更に考慮してシーン評価値を算出することも可能である。接客担当者の動作と同期する顧客の動作の大きさをＤ（ｘ）とし、係数ｃの値を予め決めておく。大きさＤ（ｘ）は、頭部の位置の変化量や腕の移動量など、画像データから認識される単位時間当たりの位置の変化量を示し、変化量が大きいほどＤ（ｘ）も大きい値をとる。係数ｃの値は正の実数である。この場合、例えば、中心時刻のシーン評価値Ｖｔは数式（３）のように算出される。数式（３）では、集合ｒ１に属する動作の重みが大きさＤ（ｘ）と係数ｃによって補正される。スライディングウィンドウ８１において、接客担当者の動作と同期する顧客の動作が大きいほど、シーン評価値は大きくなる。なお、数式（３）では遅延時間Ｅ（ｘ）も考慮しているが、遅延時間Ｅ（ｘ）を考慮しないようにしてもよい。 In addition, when the behavior of the person in charge of customer service and the behavior of the customer are synchronized, it is also possible to calculate the scene evaluation value by further considering the magnitude of the behavior of the customer. Let D(x) be the magnitude of the customer's movement synchronized with the customer's movement, and determine the value of the coefficient c in advance. The magnitude D(x) indicates the amount of positional change per unit time recognized from the image data, such as the amount of change in the position of the head or the amount of movement of the arm. take a value. The value of coefficient c is a positive real number. In this case, for example, the scene evaluation value Vt at the central time is calculated as shown in Equation (3). In equation (3), the weights of actions belonging to set r1 are corrected by magnitude D(x) and coefficient c. In the sliding window 81, the scene evaluation value increases as the customer's motion synchronized with the customer's motion increases. Although the delay time E(x) is also taken into account in Equation (3), the delay time E(x) may not be taken into account.

キーワード評価値の算出では、会話分析装置１００は、キーワードが発せられた時刻Ｇ（ｙ）を中心にして時間Ｓ３前から時間Ｓ３後までの区間（前後の時間Ｓ３の区間）をウィンドウ８２として設定する。時間Ｓ３は、例えば、２分から数分程度とする。一区切りの会話（例えば、一人の顧客に対する接客が始まってから終了するまでの一連の会話）の中で、同じキーワードが複数回発せられることがある。ここでは、１つのキーワードに着目し、一区切りの会話の中で当該キーワードがＹ＋１回発せられたとする。ウィンドウ８２は、ｙ＝０，１，…，Ｙそれぞれに対して設定される。 In calculating the keyword evaluation value, conversation analysis device 100 sets a window 82 as a window 82 from time S3 before time S3 to time S3 after time G(y) when the keyword was issued (time S3 before and after). do. The time S3 is, for example, about two minutes to several minutes. The same keyword may be uttered multiple times in a single segment of conversation (for example, a series of conversations from the beginning to the end of serving a customer). Here, it is assumed that attention is paid to one keyword, and that the keyword is uttered Y+1 times in one segment of conversation. A window 82 is set for each of y=0, 1, .

着目するキーワードの１回の出現に対して、ウィンドウ８２の範囲内にあるシーン評価値の平均値をＨ（ｙ）とする。すると、例えば、一区切りの会話における当該キーワードのキーワード評価値Ｖｋは数式（４）のように算出される。すなわち、キーワードが発せられた各時刻の周辺のキーワード評価値の平均値を、当該キーワードの複数回の出現の間で平均化したものが、当該キーワードのキーワード評価値となる。 Let H(y) be the average value of the scene evaluation values within the range of the window 82 for one appearance of the keyword of interest. Then, for example, the keyword evaluation value Vk of the keyword in one segment of conversation is calculated as shown in Equation (4). That is, the keyword evaluation value of the keyword is obtained by averaging the average values of the keyword evaluation values around each time when the keyword was issued over the multiple occurrences of the keyword.

キーワード評価値が閾値Ｔ１を超える場合、当該キーワードは重要キーワードである。また、キーワード評価値が閾値Ｔ２未満である場合、当該キーワードは重要キーワードである。会話分析装置１００は、抽出した重要キーワードを管理装置４１に送信する。管理装置４１は、会話分析装置１００から受信した重要キーワードの全部または一部をディスプレイに表示する。例えば、管理装置４１は、受信した重要キーワードのうちキーワード評価値が大きい方からＮ個（上位Ｎ件）の重要キーワードを表示する。また、管理装置４１は、受信した重要キーワードのうちキーワード評価値が小さい方からＮ個（下位Ｎ件）の重要キーワードを表示する。Ｎは予め決めておく１以上の整数である。ただし、会話分析装置１００が、重要キーワードを上位Ｎ件と下位Ｎ件に絞り込んでもよい。 If the keyword evaluation value exceeds the threshold T1, the keyword is an important keyword. Also, if the keyword evaluation value is less than the threshold T2, the keyword is an important keyword. Conversation analysis device 100 transmits the extracted important keywords to management device 41 . Management device 41 displays all or part of the important keywords received from conversation analysis device 100 on the display. For example, the management device 41 displays the N (top N) important keywords having the highest keyword evaluation value among the received important keywords. In addition, the management device 41 displays the N (lowest N) important keywords among the received important keywords having the lowest keyword evaluation values. N is a predetermined integer of 1 or more. However, the conversation analysis device 100 may narrow down the important keywords to the top N items and the bottom N items.

次に、会話分析装置１００の機能および処理手順について説明する。
図６は、第２の実施の形態の会話分析装置の機能例を示すブロック図である。
会話分析装置１００は、音声記憶部１２１、画像記憶部１２２、キーワード記憶部１２３および評価結果記憶部１２４を有する。これらの記憶部は、例えば、ＲＡＭ１０２またはＨＤＤ１０３の記憶領域を用いて実現される。また、会話分析装置１００は、キーワード検出部１２５、動作検出部１２６、シーン評価部１２７およびキーワード評価部１２８を有する。これらの処理部は、例えば、プログラムを用いて実現される。 Next, the functions and processing procedures of conversation analysis device 100 will be described.
FIG. 6 is a block diagram showing a functional example of the conversation analysis device of the second embodiment.
Conversation analysis device 100 has voice storage unit 121 , image storage unit 122 , keyword storage unit 123 and evaluation result storage unit 124 . These storage units are implemented using storage areas of the RAM 102 or the HDD 103, for example. Conversation analysis device 100 also has keyword detection section 125 , action detection section 126 , scene evaluation section 127 and keyword evaluation section 128 . These processing units are implemented using, for example, programs.

音声記憶部１２１は、カメラ装置５０から受信した音声信号を含む音声データを記憶する。画像記憶部１２２は、カメラ装置５０から受信した画像信号を含む画像データを記憶する。キーワード記憶部１２３は、検索対象キーワードを記憶する。検索対象キーワードは予め指定されている。管理者が検索対象キーワードを追加または削除できるようにしてもよい。評価結果記憶部１２４は、キーワードの評価結果を記憶する。評価結果は、音声データから抽出された重要キーワードとその順位とを含む。 The audio storage unit 121 stores audio data including audio signals received from the camera device 50 . The image storage unit 122 stores image data including image signals received from the camera device 50 . The keyword storage unit 123 stores search target keywords. A search target keyword is specified in advance. The administrator may be allowed to add or delete search target keywords. The evaluation result storage unit 124 stores keyword evaluation results. The evaluation results include the important keywords extracted from the voice data and their ranks.

キーワード検出部１２５は、音声記憶部１２１に記憶された音声データを、音声認識により単語列に変換する。キーワード検出部１２５は、キーワード記憶部１２３に記憶された検索対象キーワードを単語列の中から検出する。キーワードの検出結果は、検出したキーワードと当該キーワードが出現する時刻とを含む。 The keyword detection unit 125 converts the voice data stored in the voice storage unit 121 into a word string by voice recognition. The keyword detection unit 125 detects the search target keyword stored in the keyword storage unit 123 from the word string. The keyword detection result includes the detected keyword and the time when the keyword appears.

動作検出部１２６は、画像記憶部１２２に記憶された画像データに含まれる各フレームから、画像認識により顧客と接客担当者を認識する。動作検出部１２６は、各フレームから接客担当者が写った領域の特徴情報を抽出し、フレーム間の特徴情報の変化に基づいて接客担当者の動作を検出する。また、動作検出部１２６は、各フレームから顧客が写った領域の特徴情報を抽出し、フレーム間の特徴情報の変化に基づいて顧客の動作を検出する。動作の検出結果は、時刻と動作主体と動作の種類と動作の大きさを含む。 The motion detection unit 126 recognizes the customer and the customer service representative by image recognition from each frame included in the image data stored in the image storage unit 122 . The motion detection unit 126 extracts the characteristic information of the area in which the customer service representative is captured from each frame, and detects the customer service representative's movement based on the change in the characteristic information between the frames. In addition, the motion detection unit 126 extracts the feature information of the area in which the customer is captured from each frame, and detects the motion of the customer based on changes in the feature information between frames. The motion detection result includes the time, the subject of the motion, the type of motion, and the magnitude of the motion.

シーン評価部１２７は、動作検出部１２６による動作の検出結果を用いて、時間Δｔ間隔でシーン評価値を算出する。前述のように、シーン評価部１２７は、接客担当者の動作時刻を基準にして、その直前の時間Ｓ１の間に顧客の動作が生じているか否か、および、その直後の時間Ｓ２の間に同じ種類の顧客の動作が生じているか否かを判定する。シーン評価部１２７は、このような接客担当者の動作と顧客の動作の間の同期状況に基づいてシーン評価値を算出する。同期状況の評価では、接客担当者の動作から顧客の動作までの遅延時間や、顧客の動作の大きさを更に考慮してもよい。シーン評価結果は、複数の時刻と当該複数の時刻に対応する複数のシーン評価値とを含む。 The scene evaluation unit 127 calculates a scene evaluation value at intervals of time Δt using the motion detection result obtained by the motion detection unit 126 . As described above, the scene evaluation unit 127 determines whether or not the customer's action occurs during the time S1 immediately before the customer service staff's action time, and whether the customer's action occurs during the time S2 immediately after that. Determine if the same type of customer activity is occurring. The scene evaluation unit 127 calculates a scene evaluation value based on the state of synchronization between the behavior of the person in charge of customer service and the behavior of the customer. The evaluation of the synchronization status may further take into account the delay time from the customer's action to the customer's action, and the magnitude of the customer's action. The scene evaluation result includes multiple times and multiple scene evaluation values corresponding to the multiple times.

キーワード評価部１２８は、キーワード検出部１２５によるキーワードの検出結果とシーン評価部１２７によるシーン評価結果を用いて、検出されたキーワードそれぞれのキーワード評価値を算出する。前述のように、キーワード評価部１２８は、キーワード毎に当該キーワードの１回以上の出現時刻を抽出し、出現時刻毎に周辺時刻のシーン評価値を平均化し、１回以上の出現時刻の間で更に平均化してキーワード評価値とする。 The keyword evaluation unit 128 calculates a keyword evaluation value for each detected keyword using the keyword detection result by the keyword detection unit 125 and the scene evaluation result by the scene evaluation unit 127 . As described above, the keyword evaluation unit 128 extracts one or more appearance times of the keyword for each keyword, averages the scene evaluation values of surrounding times for each appearance time, Further, the values are averaged to obtain a keyword evaluation value.

キーワード評価部１２８は、キーワード評価値が閾値Ｔ１を超えるキーワードと閾値Ｔ２未満のキーワードを重要キーワードとして抽出する。そして、キーワード評価部１２８は、抽出した重要キーワードとキーワード評価値によって決まる順位（ベスト１、ベスト２、ワースト１、ワースト２など）を評価結果として評価結果記憶部１２４に格納する。キーワード評価部１２８は、評価結果を管理装置４１に送信する。 The keyword evaluation unit 128 extracts keywords whose keyword evaluation values exceed the threshold T1 and keywords whose keyword evaluation values are less than the threshold T2 as important keywords. Then, the keyword evaluation unit 128 stores rankings (best 1, best 2, worst 1, worst 2, etc.) determined by the extracted important keywords and keyword evaluation values in the evaluation result storage unit 124 as evaluation results. The keyword evaluation unit 128 transmits evaluation results to the management device 41 .

図７は、会話分析装置が保持するテーブルの例を示す第１の図である。
キーワードテーブル１３１は、キーワード記憶部１２３に記憶される。キーワードテーブル１３１には、検索対象キーワードとして指定されたキーワードの文字列が登録される。音声データが示す発話の中から、キーワードテーブル１３１に登録されたキーワードのみが抽出され、それ以外の単語は抽出されない。 FIG. 7 is a first diagram showing an example of a table held by the conversation analysis device.
The keyword table 131 is stored in the keyword storage unit 123. FIG. In the keyword table 131, character strings of keywords specified as search target keywords are registered. Only the keywords registered in the keyword table 131 are extracted from the speech indicated by the voice data, and no other words are extracted.

キーワード検出テーブル１３２は、キーワード検出部１２５によって生成される。キーワード検出テーブル１３２は、ＲＡＭ１０２またはＨＤＤ１０３に保存されてもよい。キーワード検出テーブル１３２は、時刻およびキーワードの項目を含む。時刻の項目には、キーワードテーブル１３１に登録された何れかのキーワードが発せられた時刻が登録される。キーワードの項目には、当該発せられたキーワードが登録される。なお、顧客による発話と接客担当者による発話とを区別して認識する場合、キーワード検出テーブル１３２は、話者を示す項目を更に含んでもよい。話者は顧客または接客担当者である。 The keyword detection table 132 is generated by the keyword detection unit 125. FIG. Keyword detection table 132 may be stored in RAM 102 or HDD 103 . The keyword detection table 132 includes time and keyword items. In the item of time, the time at which any keyword registered in the keyword table 131 was issued is registered. The issued keyword is registered in the keyword item. In addition, when distinguishing and recognizing an utterance by a customer and an utterance by a person in charge of customer service, the keyword detection table 132 may further include an item indicating a speaker. A speaker is a customer or a waiter.

動作検出テーブル１３３は、動作検出部１２６によって生成される。動作検出テーブル１３３は、ＲＡＭ１０２またはＨＤＤ１０３に保存されてもよい。動作検出テーブル１３３は、時刻、動作主体、種類および大きさの項目を含む。時刻の項目には、動作が行われた時刻が登録される。動作主体の項目には、動作を行った主体として「顧客」または「接客担当者」が登録される。種類の項目には、「笑う」や「うなずく」などの動作の種類が登録される。大きさの項目には、動作の大きさを示す数値が登録される。 The motion detection table 133 is generated by the motion detection unit 126 . The motion detection table 133 may be stored in the RAM 102 or HDD 103. FIG. The motion detection table 133 includes items of time, subject of motion, type and size. The time at which the action was performed is registered in the item of time. In the action subject item, a "customer" or a "person in charge of customer service" is registered as the subject who performed the action. In the type item, types of actions such as "laughing" and "nodding" are registered. A numerical value indicating the magnitude of the motion is registered in the magnitude item.

図８は、会話分析装置が保持するテーブルの例を示す第２の図である。
シーン評価テーブル１３４は、シーン評価部１２７によって生成される。シーン評価テーブル１３４は、ＲＡＭ１０２またはＨＤＤ１０３に保存されてもよい。シーン評価テーブル１３４は、時刻および評価値の項目を含む。時刻の項目には、会話の盛り上がりの程度が評価された時刻、すなわち、顧客の共感度が評価された時刻が登録される。評価値の項目には、算出されたシーン評価値が登録される。 FIG. 8 is a second diagram showing an example of a table held by the conversation analysis device.
A scene evaluation table 134 is generated by the scene evaluation unit 127 . Scene evaluation table 134 may be stored in RAM 102 or HDD 103 . The scene evaluation table 134 includes items of time and evaluation value. The time item registers the time at which the degree of liveliness of the conversation was evaluated, that is, the time at which the customer's degree of empathy was evaluated. The calculated scene evaluation value is registered in the evaluation value item.

キーワード評価テーブル１３５は、キーワード評価部１２８によって生成される。キーワード評価テーブル１３５は、ＲＡＭ１０２またはＨＤＤ１０３に保存されてもよい。キーワード評価テーブル１３５は、キーワードおよび評価値の項目を含む。キーワードの項目には、キーワード検出テーブル１３２に出現するキーワードが登録される。評価値の項目には、算出されたキーワード評価値が登録される。 Keyword evaluation table 135 is generated by keyword evaluation unit 128 . Keyword evaluation table 135 may be stored in RAM 102 or HDD 103 . The keyword evaluation table 135 includes items of keywords and evaluation values. Keywords appearing in the keyword detection table 132 are registered in the keyword field. The calculated keyword evaluation value is registered in the evaluation value item.

重要キーワードテーブル１３６は、評価結果記憶部１２４に記憶される。重要キーワードテーブル１３６は、順位およびキーワードの項目を含む。順位の項目には、ベスト１、ベスト２、ワースト１、ワースト２など、キーワード評価値によって決まる重要キーワードの順位が登録される。キーワードの項目には、キーワード評価テーブル１３５に登録されたキーワードのうちキーワード評価値に基づいて選択された重要キーワードが登録される。重要キーワードテーブル１３６の内容が管理装置４１に送信される。 The important keyword table 136 is stored in the evaluation result storage unit 124. FIG. The important keyword table 136 includes ranking and keyword items. In the rank item, ranks of important keywords such as best 1, best 2, worst 1, worst 2, etc. determined by keyword evaluation values are registered. An important keyword selected based on the keyword evaluation value among the keywords registered in the keyword evaluation table 135 is registered in the keyword item. The contents of the important keyword table 136 are sent to the management device 41 .

図９は、第２の実施の形態の会話分析の手順例を示すフローチャートである。
（Ｓ１０）動作検出部１２６は、画像記憶部１２２から画像データを読み出す。読み出す画像データは、処理済みの画像データの次の一定時間分の画像データである。また、キーワード検出部１２５は、音声記憶部１２１から音声データを読み出す。読み出す音声データは、処理済みの音声データの次の一定時間分の音声データである。 FIG. 9 is a flow chart showing an example of a conversation analysis procedure according to the second embodiment.
( S<b>10 ) The motion detection unit 126 reads image data from the image storage unit 122 . The image data to be read is the image data for a certain period of time following the processed image data. Also, the keyword detection unit 125 reads the voice data from the voice storage unit 121 . The audio data to be read is the audio data for a certain period of time following the processed audio data.

（Ｓ１１）キーワード検出部１２５は、ステップＳ１０で読み出した音声データを音声認識により単語列に変換する。キーワード検出部１２５は、変換した単語列から、キーワードテーブル１３１に登録された検索対象キーワードを検索し、検索されたキーワードおよび当該キーワードの出現時刻を示すキーワード検出テーブル１３２を生成する。 (S11) The keyword detection unit 125 converts the voice data read in step S10 into a word string by voice recognition. The keyword detection unit 125 searches for search target keywords registered in the keyword table 131 from the converted word string, and generates a keyword detection table 132 indicating the searched keywords and the appearance time of the keywords.

（Ｓ１２）動作検出部１２６は、ステップＳ１０で読み出した画像データに含まれる各フレームから、画像認識により顧客が写った領域および接客担当者が写った領域を認識する。動作検出部１２６は、フレーム間の位置変化から顧客の動作の種類、動作時刻および動作の大きさを検出する。また、動作検出部１２６は、フレーム間の位置変化から接客担当者の動作の種類、動作時刻および動作の大きさを検出する。動作検出部１２６は、これらの検出した情報を含む動作検出テーブル１３３を生成する。 (S12) The motion detection unit 126 recognizes the area in which the customer and the customer service representative appear by image recognition from each frame included in the image data read out in step S10. The motion detection unit 126 detects the type of motion of the customer, the time of the motion, and the magnitude of the motion from the position change between frames. Further, the motion detection unit 126 detects the type of motion, the time of motion, and the magnitude of motion of the person in charge of customer service from the position change between frames. The motion detector 126 generates a motion detection table 133 including the detected information.

（Ｓ１３）シーン評価部１２７は、ステップＳ１２で生成された動作検出テーブル１３３から接客担当者の動作時刻を抽出する。
（Ｓ１４）シーン評価部１２７は、ステップＳ１３で抽出した接客担当者の動作時刻それぞれについて、直前の顧客動作を動作検出テーブル１３３から検索して顧客動作の有無を判定し、顧客動作の有無に応じた重みを決定する。具体的には、シーン評価部１２７は、直前の時間Ｓ１の間に顧客の動作がない場合は重みｗ１を選択し、直前の時間Ｓ１の間に顧客の動作がある場合は重みｗ１より小さい重みｗ２を選択する。 (S13) The scene evaluation unit 127 extracts the action time of the person in charge of customer service from the action detection table 133 generated in step S12.
(S14) The scene evaluation unit 127 searches the action detection table 133 for the previous customer action for each action time of the person in charge of customer service extracted in step S13, determines the presence or absence of the customer action, and determines the presence or absence of the customer action. determine the weights. Specifically, the scene evaluation unit 127 selects the weight w1 when there is no customer movement during the previous time S1, and selects a weight smaller than the weight w1 when there is a customer movement during the previous time S1. Select w2.

（Ｓ１５）シーン評価部１２７は、ステップＳ１３で抽出した接客担当者の動作時刻それぞれについて、直後の顧客動作を動作検出テーブル１３３から検索して同じ種類の顧客動作による同期の有無を判定し、同期の有無に応じた係数を決定する。具体的には、同期がある場合、すなわち、直後の時間Ｓ２の間に同じ種類の顧客動作がある場合、シーン評価部１２７は係数＝１を選択する。一方、同期がない場合、すなわち、直後の時間Ｓ２の間に同じ種類の顧客動作がない場合、シーン評価部１２７は係数＝ａを選択する。これらの係数は重みに乗じる値であり、ａ＜１である。 (S15) The scene evaluation unit 127 searches the action detection table 133 for the immediately following customer action for each action time of the customer service representative extracted in step S13, determines whether there is synchronization by the same kind of customer action, Decide the coefficient according to the presence or absence of Specifically, when there is synchronization, that is, when there is the same type of customer action during the immediately following time S2, the scene evaluation unit 127 selects coefficient=1. On the other hand, if there is no synchronization, that is, if there is no customer action of the same type during the immediately following time S2, the scene evaluator 127 selects coefficient=a. These coefficients are weights multiplied by a<1.

（Ｓ１６）シーン評価部１２７は、時間Ｓ０×２の時間幅をもつスライディングウィンドウを設定する。シーン評価部１２７は、スライディングウィンドウに属する接客担当者の動作に対して算出したステップＳ１４，Ｓ１５の重みおよび係数を用いて、スライディングウィンドウの中心時刻におけるシーン評価値を算出する。このシーン評価値は顧客の共感度を表している。シーン評価部１２７は、スライディングウィンドウを時間Δｔずつスライドさせることで、時間Δｔ間隔でシーン評価値を算出する。シーン評価部１２７は、複数の時刻それぞれのシーン評価値を示すシーン評価テーブル１３４を生成する。 (S16) The scene evaluation unit 127 sets a sliding window having a time width of time S0×2. The scene evaluation unit 127 calculates the scene evaluation value at the center time of the sliding window using the weights and coefficients calculated in steps S14 and S15 for the behavior of the customer service representative belonging to the sliding window. This scene evaluation value represents the customer's degree of empathy. The scene evaluation unit 127 calculates a scene evaluation value at intervals of time Δt by sliding the sliding window by time Δt. The scene evaluation unit 127 generates a scene evaluation table 134 indicating scene evaluation values for each of a plurality of times.

（Ｓ１７）動作検出部１２６は、画像データが終了したか判断する。キーワード検出部１２５は、音声データが終了したか判断する。例えば、一人の顧客に対する接客が終了したときに画像データと音声データが終了する。画像データと音声データが終了した場合はステップＳ１８に進み、終了していない場合はステップＳ１０に進む。 (S17) The motion detector 126 determines whether the image data has ended. The keyword detection unit 125 determines whether the voice data has ended. For example, image data and voice data end when service to one customer ends. If the image data and the audio data are completed, the process proceeds to step S18, and if not completed, the process proceeds to step S10.

（Ｓ１８）キーワード評価部１２８は、ステップＳ１１で生成されたキーワード検出テーブル１３２からキーワードの出現時刻を抽出する。キーワード評価部１２８は、キーワードの出現時刻それぞれについて、ステップＳ１６で生成されたシーン評価テーブル１３４から、当該出現時刻の直前の時間Ｓ３および直後の時間Ｓ３に属する周辺のシーン評価値を検索する。キーワード評価部１２８は、周辺のシーン評価値の平均値を算出する。 (S18) The keyword evaluation unit 128 extracts the appearance times of keywords from the keyword detection table 132 generated in step S11. For each appearance time of the keyword, the keyword evaluation unit 128 searches the scene evaluation table 134 generated in step S16 for surrounding scene evaluation values belonging to the time S3 immediately before and the time S3 immediately after the appearance time. The keyword evaluation unit 128 calculates an average value of surrounding scene evaluation values.

（Ｓ１９）キーワード評価部１２８は、ステップＳ１８で算出されたシーン評価値の平均値をキーワードの同一性に応じて分類する。キーワード評価部１２８は、キーワード毎にシーン評価値の平均値を更に平均化してキーワード評価値を算出する。キーワード評価部１２８は、キーワード評価値を示すキーワード評価テーブル１３５を生成する。 (S19) The keyword evaluation unit 128 classifies the average values of the scene evaluation values calculated in step S18 according to the identity of the keywords. The keyword evaluation unit 128 calculates a keyword evaluation value by further averaging the average value of the scene evaluation values for each keyword. The keyword evaluation unit 128 generates a keyword evaluation table 135 indicating keyword evaluation values.

（Ｓ２０）キーワード評価部１２８は、ステップＳ１９で生成されたキーワード評価テーブル１３５から、キーワード評価値が閾値Ｔ１を超えるキーワードおよびキーワード評価値が閾値Ｔ２未満のキーワードを重要キーワードとして抽出する。キーワード評価部１２８は、抽出した重要キーワードとその順位を示す重要キーワードテーブル１３６を生成して評価結果記憶部１２４に格納する。キーワード評価部１２８は、重要キーワードテーブル１３６の内容を管理装置４１に送信する。管理装置４１は、重要キーワードテーブル１３６の内容に基づいて、上位Ｎ件および下位Ｎ件の重要キーワードを表示する。 (S20) The keyword evaluation unit 128 extracts keywords whose keyword evaluation value exceeds the threshold value T1 and keywords whose keyword evaluation value is less than the threshold value T2 as important keywords from the keyword evaluation table 135 generated in step S19. The keyword evaluation unit 128 generates an important keyword table 136 showing the extracted important keywords and their ranks, and stores it in the evaluation result storage unit 124 . The keyword evaluation unit 128 transmits the contents of the important keyword table 136 to the management device 41 . The management device 41 displays the top N and bottom N important keywords based on the contents of the important keyword table 136 .

第２の実施の形態の情報処理システムによれば、音声データからキーワードが検出され、画像データから顧客の動作と接客担当者の動作が検出される。接客担当者が先に動作を行い、その直後に顧客が同じ種類の動作を行ったという同期が検出され、動作の同期に基づいて顧客の共感度を示すシーン評価値が算出され、キーワードの周辺時刻のシーン評価値からキーワード評価値が算出される。そして、キーワード評価値が高い好ましいキーワードとキーワード評価値が低い要注意のキーワードが抽出されて管理者に報告される。 According to the information processing system of the second embodiment, keywords are detected from voice data, and actions of a customer and actions of a customer service representative are detected from image data. Synchronization is detected in which the customer performs an action first and the customer performs the same type of action immediately after, and a scene evaluation value indicating the customer's degree of empathy is calculated based on the synchronization of actions, and the surroundings of the keyword are calculated. A keyword evaluation value is calculated from the scene evaluation value of the time. Then, a desirable keyword with a high keyword evaluation value and a keyword requiring caution with a low keyword evaluation value are extracted and reported to the administrator.

これにより、顧客の心理状態に対してポジティブな影響を与えた可能性の高い重要キーワードと、顧客の心理状態に対してネガティブな影響を与えた可能性の高い重要キーワードとを推定でき、接客担当者の接客スキルの改善を支援することができる。また、接客担当者の動作と顧客の動作の同期状況からキーワードを評価するため、キーワードの出現回数から評価する方法などと比べて、顧客の心理状態を反映した重要キーワードを精度よく抽出することができる。また、接客担当者の動作の直前に顧客が動作を行っておらず、接客担当者の動作の直後に顧客が同じ種類の動作を行ったという条件を判定するため、接客を受ける顧客の心理状態を精度よく推定することができる。 This makes it possible to estimate important keywords that have a high possibility of having a positive impact on the customer's psychological state and important keywords that have a high possibility of having a negative impact on the customer's psychological condition. can help improve customer service skills. In addition, since keywords are evaluated based on the synchronization between the behavior of the customer service staff and the behavior of the customer, it is possible to accurately extract important keywords that reflect the customer's psychological state compared to methods such as evaluation based on the number of occurrences of keywords. can. In addition, in order to determine the condition that the customer did not perform an action immediately before the customer service representative's action, and that the customer performed the same type of action immediately after the customer service representative's action, the psychological state of the customer receiving the service was determined. can be estimated with high accuracy.

［第３の実施の形態］
次に、第３の実施の形態を説明する。第２の実施の形態との違いを中心に説明し、第２の実施の形態と同様の内容については説明を省略することがある。第３の実施の形態の情報処理システムは、会話分析装置１００の配置が第２の実施の形態と異なる。 [Third embodiment]
Next, a third embodiment will be described. Differences from the second embodiment will be mainly described, and descriptions of the same contents as those of the second embodiment may be omitted. The information processing system of the third embodiment differs from that of the second embodiment in the arrangement of conversation analysis device 100 .

図１０は、第３の実施の形態の情報処理システムの例を示す図である。
第３の実施の形態の情報処理システムは、第２の実施の形態と同様に、管理装置４１、カメラ装置５０および会話分析装置１００を含む。ただし、第３の実施の形態では、カメラ装置５０はネットワーク４０に接続されており、会話分析装置１００はネットワーク４０経由でカメラ装置５０と通信するサーバ装置として動作する。カメラ装置５０は、音声信号と画像信号をネットワーク４０経由で会話分析装置１００に送信する。第３の実施の形態の情報処理システムによれば、第２の実施の形態と同様の効果が得られる。 FIG. 10 is a diagram illustrating an example of an information processing system according to the third embodiment.
The information processing system of the third embodiment includes management device 41, camera device 50 and conversation analysis device 100, as in the second embodiment. However, in the third embodiment, camera device 50 is connected to network 40 and conversation analysis device 100 operates as a server device that communicates with camera device 50 via network 40 . Camera device 50 transmits an audio signal and an image signal to conversation analysis device 100 via network 40 . According to the information processing system of the third embodiment, effects similar to those of the second embodiment are obtained.

［第４の実施の形態］
次に、第４の実施の形態を説明する。第２の実施の形態との違いを中心に説明し、第２の実施の形態と同様の内容については説明を省略することがある。第４の実施の形態の情報処理システムは、接客担当者が遠隔で顧客を接客する業務に適用される。 [Fourth embodiment]
Next, a fourth embodiment will be described. Differences from the second embodiment will be mainly described, and descriptions of the same contents as those of the second embodiment may be omitted. The information processing system according to the fourth embodiment is applied to a business in which a person in charge of customer service attends a customer remotely.

図１１は、第４の実施の形態の情報処理システムの例を示す図である。
第４の実施の形態の情報処理システムは、ユーザ装置４２、カメラ装置５０，６０および会話分析装置１００を含む。ユーザ装置４２にはカメラ装置６０が接続されている。会話分析装置１００にはカメラ装置５０が接続されている。ユーザ装置４２および会話分析装置１００はネットワーク４０に接続されている。 FIG. 11 is a diagram illustrating an example of an information processing system according to the fourth embodiment.
The information processing system of the fourth embodiment includes user device 42 , camera devices 50 and 60 and conversation analysis device 100 . A camera device 60 is connected to the user device 42 . A camera device 50 is connected to the conversation analysis device 100 . User device 42 and speech analysis device 100 are connected to network 40 .

ユーザ装置４２は、顧客の自宅など会話分析装置１００とは異なる場所に設置され、顧客が使用する端末装置である。会話分析装置１００は、オフィスなどに設置され、接客担当者が使用する端末装置である。カメラ装置５０，６０は、イメージセンサを用いた動画撮影機能およびマイクロフォンを用いた音声録音機能をもつデバイス装置である。カメラ装置５０は、接客担当者を撮影し接客担当者の発話を録音するよう設定されている。カメラ装置６０は、顧客を撮影し顧客の発話を録音するよう設置されている。 The user device 42 is a terminal device installed at a location different from the conversation analysis device 100, such as the customer's home, and used by the customer. A conversation analysis device 100 is a terminal device installed in an office or the like and used by a person in charge of customer service. The camera devices 50 and 60 are devices having a moving image shooting function using an image sensor and an audio recording function using a microphone. The camera device 50 is set to photograph the person in charge of customer service and record the speech of the person in charge of customer service. A camera device 60 is installed to photograph the customer and record the customer's speech.

ユーザ装置４２は、顧客を撮影した画像データおよび顧客の音声を録音した音声データを収集し、ネットワーク４０を介して会話分析装置１００に送信する。また、ユーザ装置４２は、接客担当者を撮影した画像データおよび接客担当者の音声を録音した音声データを、ネットワーク４０を介して会話分析装置１００から受信する。ユーザ装置４２は、受信した画像データに基づいて接客担当者の映像をディスプレイに表示し、受信した音声データに基づいて接客担当者の発話をスピーカから再生する。 User device 42 collects image data of the customer and voice data of recorded voice of the customer, and transmits the data to conversation analysis device 100 via network 40 . Further, the user device 42 receives image data of the reception staff and audio data of the reception staff recorded from the conversation analysis device 100 via the network 40 . The user device 42 displays the image of the customer service representative on the display based on the received image data, and reproduces the customer service representative's speech from the speaker based on the received audio data.

会話分析装置１００は、接客担当者の画像データおよび接客担当者の音声データを収集し、ネットワーク４０を介してユーザ装置４２に送信する。また、会話分析装置１００は、顧客の画像データおよび顧客の音声データを、ネットワーク４０を介してユーザ装置４２から受信する。会話分析装置１００は、受信した画像データに基づいて顧客の映像をディスプレイに表示し、受信した音声データに基づいて顧客の発話をスピーカから再生する。これにより、顧客と接客担当者がテレビ会議方式で会話することができる。 Conversation analysis device 100 collects the image data of the customer service representative and the voice data of the customer service representative, and transmits them to user device 42 via network 40 . Conversation analysis device 100 also receives customer image data and customer voice data from user device 42 via network 40 . Conversation analysis device 100 displays the customer's video on the display based on the received image data, and reproduces the customer's utterance from the speaker based on the received audio data. As a result, the customer and the customer service representative can have a teleconference conversation.

また、会話分析装置１００は、顧客の画像データ、接客担当者の画像データ、顧客の音声データおよび接客担当者の音声データに基づいて、第２の実施の形態と同様に顧客と接客担当者との間の会話を分析する。すなわち、会話分析装置１００は、音声データから顧客または接客担当者が発したキーワードを検出し、画像データから会話中の顧客の動作および接客担当者の動作を検出する。会話分析装置１００は、顧客と接客担当者の動作からキーワードを評価して重要キーワードを抽出する。会話分析装置１００は、抽出した重要キーワードのうち上位Ｎ件および下位Ｎ件の重要キーワードをディスプレイに表示する。 In addition, the conversation analysis device 100, based on the image data of the customer, the image data of the reception staff, the voice data of the customer, and the voice data of the reception staff, analyzes the relationship between the customer and the reception staff in the same manner as in the second embodiment. analyze conversations between That is, the conversation analysis device 100 detects a keyword uttered by a customer or a customer service representative from voice data, and detects the customer's and customer service representative's actions during a conversation from image data. The conversation analysis device 100 extracts important keywords by evaluating keywords from the actions of the customer and the person in charge of customer service. Conversation analysis device 100 displays the top N and bottom N important keywords of the extracted important keywords on the display.

ただし、会話分析装置１００は、ネットワーク４０を介して管理装置４１に重要キーワードを送信するようにしてもよい。また、接客担当者が接客に使用する端末装置と会話分析装置１００とを分離するようにしてもよい。第４の実施の形態の情報処理システムによれば、遠隔での接客についても第２の実施の形態と同様の効果が得られる。 However, conversation analysis device 100 may transmit important keywords to management device 41 via network 40 . Also, the terminal device used by the person in charge of customer service may be separated from the conversation analysis device 100 . According to the information processing system of the fourth embodiment, the same effects as those of the second embodiment can be obtained for remote customer service.

［第５の実施の形態］
次に、第５の実施の形態を説明する。第２の実施の形態との違いを中心に説明し、第２の実施の形態と同様の内容については説明を省略することがある。第２の実施の形態では、顧客の動作と接客担当者の動作の同期状況から顧客の共感度を推定し、顧客の共感度に応じてキーワードの重要度を評価した。これに対して第５の実施の形態では、顧客の動作と接客担当者の動作の同期状況から接客担当者の接客度を推定し、接客担当者の接客度に応じてキーワードの重要度を評価する。接客度は、接客の積極性や熱心さや丁寧さなどを含む接客姿勢を表している。第５の実施の形態で算出されるシーン評価値は接客担当者の接客度に対応し、第５の実施の形態で算出されるキーワード評価値は顧客または接客担当者がキーワードを発したときの接客度を反映している。よって、第５の実施の形態で抽出される重要キーワードは、良い接客との関連が大きいと推定されるキーワードや悪い接客との関連が大きいと推定されるキーワードである。第５の実施の形態で抽出される重要キーワードは、接客担当者の心理状態を反映していると言うこともできる。 [Fifth embodiment]
Next, a fifth embodiment will be described. Differences from the second embodiment will be mainly described, and descriptions of the same contents as those of the second embodiment may be omitted. In the second embodiment, the degree of empathy of the customer is estimated from the state of synchronization between the behavior of the customer and the behavior of the person in charge of customer service, and the degree of importance of the keyword is evaluated according to the degree of empathy of the customer. On the other hand, in the fifth embodiment, the degree of customer service of the person in charge of customer service is estimated from the state of synchronization between the behavior of the customer and that of the person in charge of customer service, and the degree of importance of the keyword is evaluated according to the degree of customer service of the person in charge of customer service. do. The degree of customer service represents the customer service attitude including positiveness, enthusiasm, and politeness in customer service. The scene evaluation value calculated in the fifth embodiment corresponds to the degree of customer service of the person in charge of customer service, and the keyword evaluation value calculated in the fifth embodiment corresponds to the degree of customer service when the customer or the person in charge of customer service issues a keyword. It reflects the degree of customer service. Therefore, the important keywords extracted in the fifth embodiment are keywords presumed to have a strong relationship with good customer service and keywords presumed to have a strong relationship with bad customer service. It can be said that the important keywords extracted in the fifth embodiment reflect the mental state of the person in charge of customer service.

第５の実施の形態の情報処理システムは、図２，３，６～８に示した第２の実施の形態の情報処理システムと同様の構成によって実現できる。そこで、以下では第５の実施の形態を、図２，３，６～８と同様の符号を用いて説明することがある。なお、第５の実施の形態の情報処理システムを、図１０に示した第３の実施の形態の情報処理システムと同様のシステム構成とすることも可能であり、図１１に示した第４の実施の形態の情報処理システムと同様のシステム構成とすることも可能である。 The information processing system of the fifth embodiment can be realized by a configuration similar to that of the information processing system of the second embodiment shown in FIGS. Therefore, hereinafter, the fifth embodiment may be described using the same reference numerals as in FIGS. Note that the information processing system of the fifth embodiment can have the same system configuration as the information processing system of the third embodiment shown in FIG. 10, and the fourth embodiment shown in FIG. A system configuration similar to that of the information processing system of the embodiment is also possible.

図１２は、第５の実施の形態のキーワード抽出例を示す図である。
第５の実施の形態では、顧客の動作と接客担当者の動作との間の同期を検出する。第５の実施の形態の同期は、顧客から動作を開始し、その直後に接客担当者が同じ種類の動作を行ったという動作の連鎖である。第５の実施の形態で検出する同期は、動作の順序が異なる点で第２の実施の形態の同期と異なる。この場合、接客担当者は顧客を注意深く見ており、顧客の話を注意深く聞いていると推定される。よって、このような同期が発生しているときに顧客または接客担当者が発したキーワードは、良い接客と関連のあるキーワードである可能性がある。一方、このような同期が発生していないときに顧客または接客担当者が発したキーワードは、悪い接客と関連のあるキーワードである可能性がある。 FIG. 12 is a diagram showing an example of keyword extraction according to the fifth embodiment.
In a fifth embodiment, synchronization between the customer's actions and the waiter's actions is detected. Synchronization in the fifth embodiment is a chain of actions in which the customer initiates an action and immediately after that, the customer service representative performs the same kind of action. The synchronization detected in the fifth embodiment differs from the synchronization in the second embodiment in that the order of operations is different. In this case, it is presumed that the customer service representative is watching the customer carefully and listening carefully to what the customer has to say. Thus, keywords uttered by customers or waitstaff when such synchronization occurs are likely to be keywords associated with good customer service. On the other hand, keywords uttered by customers or attendants when such synchronization has not occurred may be keywords associated with bad customer service.

例えば、以下に説明するシーン７４～７６を考える。
シーン７４では、接客担当者が笑うという動作を行い、その直後に顧客が笑うという動作を行っている。シーン７４は、図４のシーン７１に対応する。顧客の動作の直前の時間Ｓ１以内に接客担当者は動作を行っており、顧客の動作の直後の時間Ｓ２以内に接客担当者は動作を行っていない。シーン７４では、顧客から開始して顧客の動作と接客担当者の動作とが同期しているわけではなく、接客度が小さいと判定される。すると、シーン７４の周辺で顧客または接客担当者が発した「速い」というキーワードの評価値は低くなる。 For example, consider the scenes 74-76 described below.
In scene 74, the waiter makes a laughing action, and immediately after that, the customer makes a laughing action. Scene 74 corresponds to scene 71 in FIG. The customer service clerk performs an action within the time S1 immediately before the customer's action, and the customer service clerk does not perform an action within the time S2 immediately after the customer's action. In scene 74, starting from the customer, the behavior of the customer and the behavior of the person in charge of customer service are not synchronized, and it is determined that the degree of customer service is low. Then, the evaluation value of the keyword "fast" uttered by the customer or the customer service representative around the scene 74 is low.

シーン７５では、顧客が笑うという動作を行ったものの、顧客の動作の直前の時間Ｓ１以内に接客担当者は動作を行っておらず、顧客の動作の直後の時間Ｓ２以内にも接客担当者は動作を行っていない。シーン７５では、顧客の動作と接客担当者の動作とが同期していないため、接客度が小さいと判定される。すると、シーン７５の周辺で顧客または接客担当者が発した「面白い」というキーワードの評価値は低くなる。 In scene 75, although the customer makes an action of laughing, the customer service representative does not make any action within the time S1 immediately before the customer's action, and the customer service representative does not make any action within the time S2 immediately after the customer's action. not operating. In scene 75, since the behavior of the customer and the behavior of the person in charge of customer service are not synchronized, it is determined that the degree of customer service is low. Then, the evaluation value of the keyword "interesting" uttered by the customer or the person in charge of customer service around the scene 75 is low.

シーン７６では、顧客がうなずくという動作を行い、その直後に接客担当者がうなずくという動作を行っている。シーン７６は、図４のシーン７３に対応する。顧客の動作の直前の時間Ｓ１以内に接客担当者は動作を行っておらず、顧客の動作の直後の時間Ｓ２以内に接客担当者は同じ種類の動作を行っている。顧客から開始して顧客の動作と接客担当者の動作とが同期しているため、接客度が大きいと判定される。すると、シーン７６の周辺で顧客または接客担当者が発した「きれい」というキーワードの評価値は高くなる。 In scene 76, the customer nods, and immediately after that, the customer service representative nods. Scene 76 corresponds to scene 73 in FIG. Within time S1 immediately before the customer's action, the waiter did not perform any action, and within time S2 immediately after the customer's action, the waiter performed the same kind of action. Starting from the customer, the behavior of the customer and the behavior of the person in charge of customer service are synchronized, so it is determined that the degree of customer service is high. Then, the evaluation value of the keyword "beautiful" uttered by the customer or the person in charge of customer service around the scene 76 becomes high.

キーワード評価値が算出されると第２の実施の形態と同様に、会話分析装置１００は、キーワード評価値が閾値Ｔ１より大きいキーワードを好ましいキーワードと推定し、重要キーワードとして抽出する。また、会話分析装置１００は、キーワード評価値が閾値Ｔ２より小さいキーワードを要注意のキーワードと推定し、重要キーワードとして抽出する。上記の例では、「きれい」が重要キーワードとして抽出される可能性がある。 When the keyword evaluation value is calculated, similarly to the second embodiment, conversation analysis device 100 presumes that the keyword whose keyword evaluation value is larger than threshold value T1 is a preferable keyword, and extracts it as an important keyword. In addition, conversation analysis device 100 presumes keywords whose keyword evaluation value is smaller than threshold value T2 to be caution-required keywords, and extracts them as important keywords. In the above example, "beautiful" may be extracted as an important keyword.

図１３は、第５の実施の形態の会話分析の手順例を示すフローチャートである。
（Ｓ３０）動作検出部１２６は、画像記憶部１２２から画像データを読み出す。また、キーワード検出部１２５は、音声記憶部１２１から音声データを読み出す。 FIG. 13 is a flow chart showing an example of a conversation analysis procedure according to the fifth embodiment.
(S30) The motion detection unit 126 reads image data from the image storage unit 122. FIG. Also, the keyword detection unit 125 reads the voice data from the voice storage unit 121 .

（Ｓ３１）キーワード検出部１２５は、ステップＳ３０で読み出した音声データを音声認識により単語列に変換する。キーワード検出部１２５は、変換した単語列から、キーワードテーブル１３１に登録された検索対象キーワードを検索し、検索されたキーワードおよび当該キーワードの出現時刻を示すキーワード検出テーブル１３２を生成する。 (S31) The keyword detection unit 125 converts the voice data read in step S30 into a word string by voice recognition. The keyword detection unit 125 searches for search target keywords registered in the keyword table 131 from the converted word string, and generates a keyword detection table 132 indicating the searched keywords and the appearance time of the keywords.

（Ｓ３２）動作検出部１２６は、ステップＳ３０で読み出した画像データに含まれる各フレームから、画像認識により顧客が写った領域および接客担当者が写った領域を認識する。動作検出部１２６は、フレーム間の位置変化から顧客の動作の種類、動作時刻および動作の大きさを検出する。また、動作検出部１２６は、フレーム間の位置変化から接客担当者の動作の種類、動作時刻および動作の大きさを検出する。動作検出部１２６は、これらの検出した情報を含む動作検出テーブル１３３を生成する。 (S32) The motion detection unit 126 recognizes the area in which the customer and the customer service representative appear by image recognition from each frame included in the image data read out in step S30. The motion detection unit 126 detects the type of motion of the customer, the time of the motion, and the magnitude of the motion from the position change between frames. Further, the motion detection unit 126 detects the type of motion, the time of motion, and the magnitude of motion of the person in charge of customer service from the position change between frames. The motion detector 126 generates a motion detection table 133 including the detected information.

（Ｓ３３）シーン評価部１２７は、ステップＳ３２で生成された動作検出テーブル１３３から顧客の動作時刻を抽出する。
（Ｓ３４）シーン評価部１２７は、ステップＳ３３で抽出した顧客の動作時刻それぞれについて、直前の接客担当者動作を動作検出テーブル１３３から検索して接客担当者動作の有無を判定し、接客担当者動作の有無に応じた重みを決定する。具体的には、シーン評価部１２７は、直前の時間Ｓ１の間に接客担当者の動作がない場合は重みｗ１を選択し、直前の時間Ｓ１の間に接客担当者の動作がある場合は重みｗ２を選択する。 (S33) The scene evaluation unit 127 extracts the customer's action time from the action detection table 133 generated in step S32.
(S34) The scene evaluation unit 127 searches the motion detection table 133 for the immediately preceding service staff motion for each of the customer action times extracted in step S33, determines whether or not there is a service staff motion, and determines whether or not there is a service staff motion. Determine the weight according to the presence or absence of Specifically, the scene evaluation unit 127 selects the weight w1 if there is no action of the customer service representative during the immediately preceding time S1, and selects the weight w1 if there is any action of the customer service representative during the immediately preceding time S1. Select w2.

（Ｓ３５）シーン評価部１２７は、ステップＳ３３で抽出した顧客の動作時刻それぞれについて、直後の接客担当者動作を動作検出テーブル１３３から検索して同じ種類の接客担当者動作による同期の有無を判定し、同期の有無に応じた係数を決定する。具体的には、同期がある場合、すなわち、直後の時間Ｓ２の間に同じ種類の接客担当者動作がある場合、シーン評価部１２７は係数＝１を選択する。一方、同期がない場合、すなわち、直後の時間Ｓ２の間に同じ種類の接客担当者動作がない場合、シーン評価部１２７は係数＝ａを選択する。これらの係数は重みに乗じる値であり、ａ＜１である。 (S35) The scene evaluation unit 127 searches the motion detection table 133 for the immediately following service staff motion for each of the customer action times extracted in step S33, and determines whether there is synchronization by the same kind of service staff motion. , determines the coefficient according to the presence or absence of synchronization. Specifically, when there is synchronization, that is, when there is the same type of customer service representative action during the immediately following time S2, the scene evaluation unit 127 selects coefficient=1. On the other hand, if there is no synchronization, that is, if there is no same type of customer service representative action during the immediately following time S2, the scene evaluation unit 127 selects coefficient=a. These coefficients are weights multiplied by a<1.

（Ｓ３６）シーン評価部１２７は、時間Ｓ０×２の時間幅をもつスライディングウィンドウを設定する。シーン評価部１２７は、スライディングウィンドウに属する顧客の動作に対して算出したステップＳ３４，Ｓ３５の重みおよび係数を用いて、スライディングウィンドウの中心時刻におけるシーン評価値を算出する。このシーン評価値は接客担当者の接客度を表している。シーン評価部１２７は、スライディングウィンドウを時間Δｔずつスライドさせることで、時間Δｔ間隔でシーン評価値を算出する。シーン評価部１２７は、複数の時刻それぞれのシーン評価値を示すシーン評価テーブル１３４を生成する。 (S36) The scene evaluation unit 127 sets a sliding window having a time width of time S0×2. The scene evaluation unit 127 calculates the scene evaluation value at the center time of the sliding window using the weights and coefficients calculated in steps S34 and S35 for the customer's motion belonging to the sliding window. This scene evaluation value represents the customer service level of the person in charge of customer service. The scene evaluation unit 127 calculates a scene evaluation value at intervals of time Δt by sliding the sliding window by time Δt. The scene evaluation unit 127 generates a scene evaluation table 134 indicating scene evaluation values for each of a plurality of times.

（Ｓ３７）動作検出部１２６は、画像データが終了したか判断する。キーワード検出部１２５は、音声データが終了したか判断する。画像データと音声データが終了した場合はステップＳ３８に進み、終了していない場合はステップＳ３０に進む。 (S37) The motion detector 126 determines whether the image data has ended. The keyword detection unit 125 determines whether the voice data has ended. If the image data and the audio data have ended, the process proceeds to step S38, and if not, the process proceeds to step S30.

（Ｓ３８）キーワード評価部１２８は、ステップＳ３１で生成されたキーワード検出テーブル１３２からキーワードの出現時刻を抽出する。キーワード評価部１２８は、キーワードの出現時刻それぞれについて、ステップＳ３６で生成されたシーン評価テーブル１３４から、当該出現時刻の直前の時間Ｓ３および直後の時間Ｓ３に属する周辺のシーン評価値を検索する。キーワード評価部１２８は、周辺のシーン評価値の平均値を算出する。 (S38) The keyword evaluation unit 128 extracts the appearance times of keywords from the keyword detection table 132 generated in step S31. For each appearance time of the keyword, the keyword evaluation unit 128 searches the scene evaluation table 134 generated in step S36 for surrounding scene evaluation values belonging to the time S3 immediately before and the time S3 immediately after the appearance time. The keyword evaluation unit 128 calculates an average value of surrounding scene evaluation values.

（Ｓ３９）キーワード評価部１２８は、ステップＳ３８で算出されたシーン評価値の平均値をキーワードの同一性に応じて分類する。キーワード評価部１２８は、キーワード毎にシーン評価値の平均値を更に平均化してキーワード評価値を算出する。キーワード評価部１２８は、キーワード評価値を示すキーワード評価テーブル１３５を生成する。 (S39) The keyword evaluation unit 128 classifies the average values of the scene evaluation values calculated in step S38 according to the identity of the keywords. The keyword evaluation unit 128 calculates a keyword evaluation value by further averaging the average value of the scene evaluation values for each keyword. The keyword evaluation unit 128 generates a keyword evaluation table 135 indicating keyword evaluation values.

（Ｓ４０）キーワード評価部１２８は、ステップＳ３９で生成されたキーワード評価テーブル１３５から、キーワード評価値が閾値Ｔ１を超えるキーワードおよびキーワード評価値が閾値Ｔ２未満のキーワードを重要キーワードとして抽出する。キーワード評価部１２８は、抽出した重要キーワードとその順位を示す重要キーワードテーブル１３６を生成して評価結果記憶部１２４に格納する。キーワード評価部１２８は、重要キーワードテーブル１３６の内容を管理装置４１に送信する。管理装置４１は、重要キーワードテーブル１３６の内容に基づいて、上位Ｎ件および下位Ｎ件の重要キーワードを表示する。 (S40) The keyword evaluation unit 128 extracts keywords whose keyword evaluation value exceeds the threshold value T1 and keywords whose keyword evaluation value is less than the threshold value T2 as important keywords from the keyword evaluation table 135 generated in step S39. The keyword evaluation unit 128 generates an important keyword table 136 showing the extracted important keywords and their ranks, and stores it in the evaluation result storage unit 124 . The keyword evaluation unit 128 transmits the contents of the important keyword table 136 to the management device 41 . The management device 41 displays the top N and bottom N important keywords based on the contents of the important keyword table 136 .

第５の実施の形態の情報処理システムによれば、音声データからキーワードが検出され、画像データから顧客の動作と接客担当者の動作が検出される。顧客が先に動作を行い、その直後に接客担当者が同じ種類の動作を行ったという同期が検出され、動作の同期に基づいて接客度を示すシーン評価値が算出され、キーワードの周辺時刻のシーン評価値からキーワード評価値が算出される。そして、キーワード評価値が高い好ましいキーワードとキーワード評価値が低い要注意のキーワードが抽出されて管理者に報告される。 According to the information processing system of the fifth embodiment, a keyword is detected from voice data, and a customer's action and a customer service representative's action are detected from image data. Synchronization is detected in which the customer performs an action first and the receptionist performs the same kind of action immediately after that, and a scene evaluation value indicating the degree of customer service is calculated based on the synchronization of the action, and the time around the keyword is calculated. A keyword evaluation value is calculated from the scene evaluation value. Then, a desirable keyword with a high keyword evaluation value and a keyword requiring caution with a low keyword evaluation value are extracted and reported to the administrator.

これにより、接客担当者の良い接客姿勢と関連がある可能性の高い重要キーワードと、接客担当者の悪い接客姿勢と関連がある可能性の高い重要キーワードとを推定でき、接客担当者の接客スキルの改善を支援することができる。また、顧客の動作と接客担当者の動作の同期状況からキーワードを評価するため、キーワードの出現回数から評価する方法などと比べて、接客担当者の心理状態を反映した重要キーワードを精度よく抽出することができる。また、顧客の動作の直前に接客担当者が動作を行っておらず、顧客の動作の直後に接客担当者が同じ種類の動作を行ったという条件を判定するため、接客姿勢としての接客担当者の心理状態を精度よく推定することができる。 As a result, it is possible to estimate the important keywords that are highly likely to be related to the customer service staff's good customer service attitude and the important keywords that are highly likely to be related to the customer customer service attitude of the customer service staff. can help improve In addition, since the keywords are evaluated based on the synchronization between the customer's actions and the customer service staff's actions, it is possible to accurately extract important keywords that reflect the customer service staff's psychological state, compared to methods such as evaluation based on the number of occurrences of keywords. be able to. In addition, in order to determine the condition that the customer service clerk did not make a movement immediately before the customer's movement and that the customer service clerk performed the same kind of movement immediately after the customer's movement, It is possible to accurately estimate the psychological state of

［第６の実施の形態］
次に、第６の実施の形態を説明する。第２の実施の形態との違いを中心に説明し、第２の実施の形態と同様の内容については説明を省略することがある。第２の実施の形態では、音声データから検出すべき検索対象キーワードは予め指定されていた。これに対して第６の実施の形態では、会話分析を通じて自動的に検索対象キーワードが追加されるようにし、手動で検索対象キーワードを指定する負担を軽減する。 [Sixth embodiment]
Next, a sixth embodiment will be described. Differences from the second embodiment will be mainly described, and descriptions of the same contents as those of the second embodiment may be omitted. In the second embodiment, search target keywords to be detected from voice data are specified in advance. On the other hand, in the sixth embodiment, search target keywords are automatically added through conversation analysis, thereby reducing the burden of manually designating search target keywords.

第６の実施の形態の情報処理システムは、図２に示した第２の実施の形態の情報処理システムと同様のシステム構成によって実現できる。第６の実施の形態の情報処理システムを、図１０に示した第３の実施の形態の情報処理システムと同様のシステム構成とすることも可能であり、図１１に示した第４の実施の形態の情報処理システムと同様のシステム構成とすることも可能である。ただし、会話分析装置１００に代えて後述する会話分析装置２００を使用する。第６の実施の形態の会話分析装置２００は、図３に示した第２の実施の形態の情報処理システムと同様のハードウェア構成によって実現できる。なお、第５の実施の形態のように、接客度を反映した重要キーワードを抽出することも可能である。 An information processing system according to the sixth embodiment can be realized by a system configuration similar to that of the information processing system according to the second embodiment shown in FIG. The information processing system of the sixth embodiment can have the same system configuration as the information processing system of the third embodiment shown in FIG. It is also possible to have a system configuration similar to the information processing system of the form. However, instead of the conversation analysis device 100, a conversation analysis device 200, which will be described later, is used. A conversation analysis device 200 according to the sixth embodiment can be realized by a hardware configuration similar to that of the information processing system according to the second embodiment shown in FIG. It should be noted that it is also possible to extract important keywords that reflect the degree of customer service, as in the fifth embodiment.

図１４は、第６の実施の形態の会話分析装置の機能例を示すブロック図である。
会話分析装置２００は、音声記憶部２２１、画像記憶部２２２、キーワード記憶部２２３および評価結果記憶部２２４を有する。これらの記憶部は、例えば、ＲＡＭまたはＨＤＤの記憶領域を用いて実現される。また、会話分析装置２００は、キーワード検出部２２５、動作検出部２２６、シーン評価部２２７、単語抽出部２２８およびキーワード評価部２２９を有する。これらの処理部は、例えば、プログラムを用いて実現される。 FIG. 14 is a block diagram showing a functional example of the conversation analysis device of the sixth embodiment.
Conversation analysis device 200 has voice storage unit 221 , image storage unit 222 , keyword storage unit 223 and evaluation result storage unit 224 . These storage units are implemented using, for example, RAM or HDD storage areas. Conversation analysis device 200 also has keyword detection section 225 , motion detection section 226 , scene evaluation section 227 , word extraction section 228 and keyword evaluation section 229 . These processing units are implemented using, for example, programs.

音声記憶部２２１は、カメラ装置５０から受信した音声信号を含む音声データを記憶する。画像記憶部２２２は、カメラ装置５０から受信した画像信号を含む画像データを記憶する。キーワード記憶部２２３は、図７に示したキーワードテーブル１３１を記憶する。キーワードテーブル１３１には、管理者などによって予め指定された検索対象キーワードが登録される。また、キーワードテーブル１３１には、キーワード評価部２２９によって自動的に追加された検索対象キーワードが登録される。評価結果記憶部２２４は、図８に示した重要キーワードテーブル１３６を記憶する。 The audio storage unit 221 stores audio data including audio signals received from the camera device 50 . The image storage unit 222 stores image data including image signals received from the camera device 50 . The keyword storage unit 223 stores the keyword table 131 shown in FIG. The keyword table 131 registers search target keywords that are specified in advance by an administrator or the like. Further, search target keywords automatically added by the keyword evaluation unit 229 are registered in the keyword table 131 . The evaluation result storage unit 224 stores the important keyword table 136 shown in FIG.

キーワード検出部２２５は、音声記憶部２２１に記憶された音声データを、音声認識により単語列に変換する。キーワード検出部２２５は、キーワード記憶部２２３に記憶されたキーワードテーブル１３１が示す検索対象キーワードを単語列の中から検出し、図７に示したキーワード検出テーブル１３２を生成する。 The keyword detection unit 225 converts the voice data stored in the voice storage unit 221 into a word string by voice recognition. The keyword detection unit 225 detects the search target keyword indicated by the keyword table 131 stored in the keyword storage unit 223 from the word string, and generates the keyword detection table 132 shown in FIG.

動作検出部２２６は、画像記憶部２２２に記憶された画像データに含まれる各フレームから、画像認識により顧客と接客担当者を認識する。動作検出部２２６は、各フレームから接客担当者が写った領域の特徴情報を抽出し、フレーム間の特徴情報の変化に基づいて接客担当者の動作を検出する。また、動作検出部２２６は、各フレームから顧客が写った領域の特徴情報を抽出し、フレーム間の特徴情報の変化に基づいて顧客の動作を検出する。動作検出部２２６は、図７に示した動作検出テーブル１３３を生成する。 The motion detection unit 226 recognizes the customer and the customer service representative by image recognition from each frame included in the image data stored in the image storage unit 222 . The motion detection unit 226 extracts the characteristic information of the area in which the customer service representative is captured from each frame, and detects the customer service representative's movement based on the change in the characteristic information between the frames. Further, the motion detection unit 226 extracts the feature information of the area in which the customer is captured from each frame, and detects the motion of the customer based on the change in the feature information between the frames. The motion detection unit 226 generates the motion detection table 133 shown in FIG.

シーン評価部２２７は、動作検出テーブル１３３に基づいて、時間Δｔ間隔でシーン評価値を算出する。前述のように、シーン評価部２２７は、接客担当者の動作時刻を基準にして、その直前の時間Ｓ１の間に顧客の動作が生じているか否か、および、その直後の時間Ｓ２の間に同じ種類の顧客の動作が生じているか否かを判定する。シーン評価部２２７は、このような接客担当者の動作と顧客の動作の間の同期状況に基づいてシーン評価値を算出し、図８に示したシーン評価テーブル１３４を生成する。 The scene evaluation unit 227 calculates scene evaluation values at intervals of time Δt based on the motion detection table 133 . As described above, the scene evaluation unit 227 determines whether or not the customer's action occurs during the time S1 immediately before the customer service staff's action time, and during the time S2 immediately after that. Determine if the same type of customer activity is occurring. The scene evaluation unit 227 calculates a scene evaluation value based on the synchronization between the behavior of the person in charge of customer service and the behavior of the customer, and generates the scene evaluation table 134 shown in FIG.

単語抽出部２２８は、音声記憶部２２１に記憶された音声データを、音声認識により単語列に変換する。単語抽出部２２８は、キーワードテーブル１３１に登録されていない未登録単語を単語列の中から抽出する。ただし、日本語の助詞や助動詞など発話中に多数出現し得る汎用的単語（ストップワード）は除外する。未登録単語の抽出には、非特許文献１（「単語抽出による音声要約文生成法とその評価」）に記載された技術を用いてもよい。単語抽出部２２８は、未登録単語の抽出結果をキーワード評価部２２９に通知する。抽出結果は、抽出した未登録単語と当該未登録単語が出現する時刻とを含む。 The word extraction unit 228 converts the voice data stored in the voice storage unit 221 into a word string by voice recognition. The word extraction unit 228 extracts unregistered words that are not registered in the keyword table 131 from the word string. However, general-purpose words (stop words) such as Japanese particles and auxiliary verbs, which may appear many times during speech, are excluded. For extraction of unregistered words, the technique described in Non-Patent Document 1 (“Speech Summary Sentence Generating Method by Word Extraction and Its Evaluation”) may be used. The word extraction unit 228 notifies the keyword evaluation unit 229 of the unregistered word extraction result. The extraction result includes the extracted unregistered word and the time when the unregistered word appears.

キーワード評価部２２９は、キーワード検出テーブル１３２、シーン評価テーブル１３４および未登録単語の抽出結果に基づいて、キーワードおよび未登録単語それぞれの単語評価値を算出する。単語評価値の算出方法は、第２の実施の形態のキーワード評価値と同様である。すなわち、キーワードに対する単語評価値はキーワード評価値と同じであり、未登録単語に対する単語評価値はキーワード評価値と同様の方法で算出される。 The keyword evaluation unit 229 calculates word evaluation values for each of the keyword and the unregistered word based on the keyword detection table 132, the scene evaluation table 134, and the extraction result of the unregistered word. A method of calculating the word evaluation value is the same as that of the keyword evaluation value in the second embodiment. That is, the word evaluation value for the keyword is the same as the keyword evaluation value, and the word evaluation value for the unregistered word is calculated by the same method as the keyword evaluation value.

よって、キーワード評価部２２９は、キーワード毎に当該キーワードの１回以上の出現時刻を抽出し、出現時刻毎に周辺時刻のシーン評価値を平均化し、１回以上の出現時刻の間で更に平均化して単語評価値とする。また、キーワード評価部２２９は、未登録単語毎に当該未登録単語の１回以上の出現時刻を抽出し、出現時刻毎に周辺時刻のシーン評価値を平均化し、１回以上の出現時刻の間で更に平均化して単語評価値とする。そして、キーワード評価部２２９は、図８に示したキーワード評価テーブル１３５と同様のデータ構造をもつ単語評価テーブルを生成する。単語評価テーブルには、キーワードに対する単語評価値と未登録単語に対する単語評価値の両方が記載される。 Therefore, the keyword evaluation unit 229 extracts one or more appearance times of the keyword for each keyword, averages the scene evaluation values of surrounding times for each appearance time, and further averages the one or more appearance times. is the word evaluation value. In addition, the keyword evaluation unit 229 extracts, for each unregistered word, one or more appearance times of the unregistered word, averages the scene evaluation values of surrounding times for each appearance time, to obtain a word evaluation value. The keyword evaluation unit 229 then generates a word evaluation table having the same data structure as the keyword evaluation table 135 shown in FIG. The word evaluation table contains both word evaluation values for keywords and word evaluation values for unregistered words.

キーワード評価部２２９は、キーワードのうち単語評価値が閾値Ｔ１を超えるキーワードと閾値Ｔ２未満のキーワードを重要キーワードとして抽出する。そして、キーワード評価部２２９は、図８に示した重要キーワードテーブル１３６を生成して評価結果記憶部２２４に格納し、重要キーワードテーブル１３６の内容を管理装置４１に送信する。また、キーワード評価部２２９は、未登録単語のうち単語評価値が閾値Ｔ１を超える未登録単語と閾値Ｔ２未満の未登録単語を抽出する。キーワード評価部２２９は、抽出した未登録単語を検索対象キーワードとしてキーワードテーブル１３１に追加する。 The keyword evaluation unit 229 extracts keywords whose word evaluation values exceed the threshold T1 and keywords whose word evaluation values are less than the threshold T2 as important keywords. The keyword evaluation section 229 then generates the important keyword table 136 shown in FIG. The keyword evaluation unit 229 also extracts unregistered words whose word evaluation values exceed the threshold value T1 and those whose word evaluation values are less than the threshold value T2 from among the unregistered words. The keyword evaluation unit 229 adds the extracted unregistered word to the keyword table 131 as a search target keyword.

図１５は、第６の実施の形態の会話分析の手順例を示すフローチャートである。
（Ｓ５０）動作検出部２２６は、画像記憶部２２２から画像データを読み出す。また、キーワード検出部２２５は、音声記憶部２２１から音声データを読み出す。また、単語抽出部２２８は、音声記憶部２２１から同じ音声データを読み出す。 FIG. 15 is a flow chart showing an example of a conversation analysis procedure according to the sixth embodiment.
(S50) The motion detection unit 226 reads image data from the image storage unit 222. FIG. Also, the keyword detection unit 225 reads the voice data from the voice storage unit 221 . Also, the word extraction unit 228 reads the same voice data from the voice storage unit 221 .

（Ｓ５１）キーワード検出部２２５は、ステップＳ５０で読み出した音声データを音声認識により単語列に変換し、キーワードテーブル１３１に登録された検索対象キーワードを単語列から検索し、キーワード検出テーブル１３２を生成する。また、単語抽出部２２８は、ステップＳ５０で読み出した音声データを音声認識により単語列に変換し、キーワードテーブル１３１に登録されていない未登録単語を単語列から抽出する。 (S51) The keyword detection unit 225 converts the speech data read out in step S50 into a word string by speech recognition, searches the keyword table 131 for the search target keyword registered in the word string, and generates the keyword detection table 132. . The word extraction unit 228 also converts the voice data read in step S50 into a word string by voice recognition, and extracts unregistered words that are not registered in the keyword table 131 from the word string.

（Ｓ５２）動作検出部２２６は、ステップＳ５０で読み出した画像データに含まれる各フレームから、画像認識により顧客が写った領域および接客担当者が写った領域を認識する。動作検出部２２６は、フレーム間の位置変化から顧客の動作の種類、動作時刻および動作の大きさを検出する。また、動作検出部２２６は、フレーム間の位置変化から接客担当者の動作の種類、動作時刻および動作の大きさを検出する。動作検出部２２６は、これらの検出した情報を含む動作検出テーブル１３３を生成する。 (S52) The motion detection unit 226 recognizes the area in which the customer and the customer service representative appear by image recognition from each frame included in the image data read out in step S50. The motion detection unit 226 detects the type of motion of the customer, the time of the motion, and the magnitude of the motion from the position change between frames. Further, the motion detection unit 226 detects the type of motion, the time of motion, and the magnitude of motion of the person in charge of customer service from the position change between frames. The motion detection unit 226 generates the motion detection table 133 including the detected information.

（Ｓ５３）シーン評価部２２７は、ステップＳ５２で生成された動作検出テーブル１３３から接客担当者の動作時刻を抽出する。
（Ｓ５４）シーン評価部２２７は、ステップＳ５３で抽出した接客担当者の動作時刻それぞれについて、直前の顧客動作を動作検出テーブル１３３から検索して顧客動作の有無を判定し、顧客動作の有無に応じた重みを決定する。 (S53) The scene evaluation unit 227 extracts the action time of the person in charge of customer service from the action detection table 133 generated in step S52.
(S54) The scene evaluation unit 227 searches the action detection table 133 for the previous customer action for each action time of the customer service representative extracted in step S53, determines whether or not there is a customer action, and determines whether or not there is a customer action. determine the weights.

（Ｓ５５）シーン評価部２２７は、ステップＳ５３で抽出した接客担当者の動作時刻それぞれについて、直後の顧客動作を動作検出テーブル１３３から検索して同じ種類の顧客動作による同期の有無を判定し、同期の有無に応じた係数を決定する。 (S55) The scene evaluation unit 227 searches the action detection table 133 for the immediately following customer action for each action time of the person in charge of customer service extracted in step S53, determines whether there is synchronization by the same kind of customer action, Decide the coefficient according to the presence or absence of

（Ｓ５６）シーン評価部２２７は、所定の時間幅をもつスライディングウィンドウを設定する。シーン評価部２２７は、スライディングウィンドウに属する接客担当者の動作に対して算出したステップＳ５４，Ｓ５５の重みおよび係数を用いて、スライディングウィンドウの中心時刻におけるシーン評価値を算出する。シーン評価部２２７は、スライディングウィンドウを時間Δｔずつスライドさせてシーン評価テーブル１３４を生成する。 (S56) The scene evaluation unit 227 sets a sliding window having a predetermined time width. The scene evaluation unit 227 calculates the scene evaluation value at the center time of the sliding window using the weights and coefficients calculated in steps S54 and S55 for the behavior of the customer service representative belonging to the sliding window. The scene evaluation unit 227 slides the sliding window by time Δt to generate the scene evaluation table 134 .

（Ｓ５７）動作検出部２２６は、画像データが終了したか判断する。キーワード検出部２２５は、音声データが終了したか判断する。単語抽出部２２８は、音声データが終了したか判断する。画像データと音声データが終了した場合はステップＳ５８に進み、終了していない場合はステップＳ５０に進む。 (S57) The motion detector 226 determines whether the image data has ended. The keyword detection unit 225 determines whether the voice data has ended. The word extractor 228 determines whether the voice data has ended. If the image data and the audio data have ended, the process proceeds to step S58; otherwise, the process proceeds to step S50.

（Ｓ５８）キーワード評価部２２９は、ステップＳ５１で生成されたキーワード検出テーブル１３２からキーワードの出現時刻を抽出する。キーワード評価部２２９は、キーワードの出現時刻それぞれについて、ステップＳ５６で生成されたシーン評価テーブル１３４から、当該出現時刻の周辺のシーン評価値を検索し、周辺のシーン評価値の平均値を算出する。同様に、キーワード評価部２２９は、ステップＳ５１の未登録単語の抽出結果から未登録単語の出現時刻を抽出する。キーワード評価部２２９は、未登録単語の出現時刻それぞれについて、シーン評価テーブル１３４から当該出現時刻の周辺のシーン評価値を検索し、周辺のシーン評価値の平均値を算出する。 (S58) The keyword evaluation unit 229 extracts the appearance times of keywords from the keyword detection table 132 generated in step S51. For each keyword appearance time, the keyword evaluation unit 229 searches the scene evaluation values around the appearance time from the scene evaluation table 134 generated in step S56, and calculates the average value of the surrounding scene evaluation values. Similarly, the keyword evaluation unit 229 extracts the appearance times of unregistered words from the unregistered word extraction results of step S51. For each appearance time of an unregistered word, the keyword evaluation unit 229 searches the scene evaluation values around the appearance time from the scene evaluation table 134 and calculates the average value of the surrounding scene evaluation values.

（Ｓ５９）キーワード評価部２２９は、キーワードについて、ステップＳ５８で算出されたシーン評価値の平均値をキーワードの同一性に応じて分類し、キーワード毎にシーン評価値の平均値を更に平均化して単語評価値を算出する。同様に、キーワード評価部２２９は、未登録単語について、ステップＳ５８で算出されたシーン評価値の平均値を未登録単語の同一性に応じて分類し、未登録単語毎にシーン評価値の平均値を更に平均化して単語評価値を算出する。キーワード評価部２２９は、キーワードおよび未登録単語の単語評価値を示す単語評価テーブルを生成する。 (S59) The keyword evaluation unit 229 classifies the average values of the scene evaluation values calculated in step S58 for the keywords according to the identity of the keywords, and further averages the average values of the scene evaluation values for each keyword to obtain word values. Calculate the evaluation value. Similarly, the keyword evaluation unit 229 classifies the average value of the scene evaluation values calculated in step S58 for the unregistered words according to the identity of the unregistered words, and classifies the average value of the scene evaluation values for each unregistered word. are further averaged to calculate the word evaluation value. The keyword evaluation unit 229 generates a word evaluation table indicating word evaluation values of keywords and unregistered words.

（Ｓ６０）キーワード評価部２２９は、ステップＳ５９で生成された単語評価テーブルから、単語評価値が閾値Ｔ１を超えるキーワードおよび単語評価値が閾値Ｔ２未満のキーワードを重要キーワードとして抽出する。キーワード評価部２２９は、抽出した重要キーワードとその順位を示す重要キーワードテーブル１３６を生成して評価結果記憶部２２４に格納する。キーワード評価部２２９は、重要キーワードテーブル１３６の内容を管理装置４１に送信する。管理装置４１は、重要キーワードテーブル１３６の内容に基づいて、上位Ｎ件および下位Ｎ件の重要キーワードを表示する。 (S60) The keyword evaluation unit 229 extracts keywords whose word evaluation value exceeds the threshold value T1 and keywords whose word evaluation value is less than the threshold value T2 as important keywords from the word evaluation table generated in step S59. The keyword evaluation unit 229 generates an important keyword table 136 showing the extracted important keywords and their ranks, and stores it in the evaluation result storage unit 224 . The keyword evaluation section 229 transmits the contents of the important keyword table 136 to the management device 41 . The management device 41 displays the top N and bottom N important keywords based on the contents of the important keyword table 136 .

（Ｓ６１）キーワード評価部２２９は、ステップＳ５９で生成された単語評価テーブルから、単語評価値が閾値Ｔ１を超える未登録単語および単語評価値が閾値Ｔ２未満の未登録単語を抽出する。キーワード評価部２２９は、抽出した未登録単語を新たな検索対象キーワードとしてキーワードテーブル１３１に追加する。 (S61) The keyword evaluation unit 229 extracts unregistered words whose word evaluation value exceeds the threshold value T1 and unregistered words whose word evaluation value is less than the threshold value T2 from the word evaluation table generated in step S59. The keyword evaluation unit 229 adds the extracted unregistered word to the keyword table 131 as a new search target keyword.

第６の実施の形態の情報処理システムによれば、第２の実施の形態と同様の効果が得られる。第６の実施の形態では更に、接客との関連が大きい可能性がある検索対象キーワードが、会話分析を通じて自動的に追加される。よって、検索対象キーワードを事前に網羅的に指定しておかなくてもよく、検索対象キーワードを指定する作業の負担を軽減できる。また、検索対象キーワードが自動的に学習されるため、会話から抽出される重要キーワードの精度が向上し、会話分析結果の有用性が向上する。 According to the information processing system of the sixth embodiment, effects similar to those of the second embodiment are obtained. Further, in the sixth embodiment, search target keywords that are likely to be closely related to customer service are automatically added through conversation analysis. Therefore, it is not necessary to specify search target keywords exhaustively in advance, and the burden of specifying search target keywords can be reduced. In addition, since search target keywords are automatically learned, the accuracy of important keywords extracted from conversations is improved, and the usefulness of conversation analysis results is improved.

１０キーワード抽出装置
１１記憶部
１２処理部
１３音声データ
１４動作データ
１５キーワード
１６，１７動作
１８評価値 REFERENCE SIGNS LIST 10 keyword extraction device 11 storage unit 12 processing unit 13 voice data 14 motion data 15 keywords 16, 17 motion 18 evaluation value

Claims

to the computer,
Audio data representing an utterance made by at least one of the first user and the second user in a conversation between a first user who is a service provider and a second user who is a receiver of the service to detect keywords,
The timing of the first action by the first user and the second action by the second user are obtained from the action data indicating the action performed by the first user and the action performed by the second user in the conversation. detect the timing of motion,
calculating an evaluation value indicating the importance of the keyword based on the relationship between the timing of the first action and the timing of the second action;
A keyword extraction program that runs the process.

The motion data is image data obtained by imaging the first user and the second user while the conversation is taking place.
The keyword extraction program according to claim 1.

the second action is the same type of action performed after the first action;
In calculating the evaluation value, when the timing of the second action exists within a predetermined time from the timing of the first action, the timing of the second action is higher than when the timing of the second action does not exist within the predetermined time. appreciate the importance of
The keyword extraction program according to claim 1.

the second action is the same type of action performed after the first action;
In calculating the evaluation value, the shorter the elapsed time from the timing of the first action to the timing of the second action, the higher the importance is evaluated.
The keyword extraction program according to claim 1.

In the calculation of the evaluation value, as a predetermined condition that the timing of the second action is later than the timing of the first action, the second user within a predetermined time immediately before the first action Determining the state of mind of the second user according to whether or not the second user is not performing an action and the timing of the second action exists within the predetermined time period immediately after the first action. and calculating the evaluation value based on the determination result of the mental state;
The keyword extraction program according to claim 1.

In the keyword detection, a predetermined search target keyword is searched from the voice data,
The computer further comprises:
extracting words other than the search target keyword from the audio data;
calculating another evaluation value indicating the degree of importance of the extracted word based on the relationship;
If the other evaluation value satisfies a predetermined condition, executing a process of adding the extracted word to the search target keyword;
The keyword extraction program according to claim 1.

the computer
Audio data representing an utterance made by at least one of the first user and the second user in a conversation between a first user who is a service provider and a second user who is a receiver of the service to detect keywords,
The timing of the first action by the first user and the second action by the second user are obtained from the action data indicating the action performed by the first user and the action performed by the second user in the conversation. detect the timing of motion,
calculating an evaluation value indicating the importance of the keyword based on the relationship between the timing of the first action and the timing of the second action;
Keyword extraction method.

Audio data representing an utterance made by at least one of the first user and the second user in a conversation between a first user who is a service provider and a second user who is a receiver of the service and a storage unit for storing action data indicating actions performed by the first user and actions performed by the second user in the conversation;
A keyword is detected from the voice data, a timing of a first action by the first user and a timing of a second action by the second user are detected from the action data, and the timing of the first action and the timing of the second action are detected. a processing unit that calculates an evaluation value indicating the importance of the keyword based on the relationship with the timing of the second action;
A keyword extraction device having