JP2020067790A

JP2020067790A - Keyword extracting program, keyword extracting method and keyword extracting apparatus

Info

Publication number: JP2020067790A
Application number: JP2018199696A
Authority: JP
Inventors: 典弘覚幸; Norihiro Kakuko
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-10-24
Filing date: 2018-10-24
Publication date: 2020-04-30
Anticipated expiration: 2038-10-24
Also published as: JP7108184B2

Abstract

To improve the extraction precision on a keyword that reflects the psychological state of a user.SOLUTION: In a conversation between a first user who becomes a service provider and a second user who becomes a service receiver, a keyword is extracted from speech data that indicates uttered speech by at least either the first user and the second user. A first action timing by the first user and a second action timing by a second user are detected from action data indicating the action taken by the first user in the conversation and the action taken by the second user therein. An evaluation value that indicates the importance level of the keyword is calculated on the basis of the relationship between the first action timing and the second action timing.SELECTED DRAWING: Figure 4

Description

本発明はキーワード抽出プログラム、キーワード抽出方法およびキーワード抽出装置に関する。 The present invention relates to a keyword extraction program, a keyword extraction method, and a keyword extraction device.

ユーザ間の会話の中から当該会話にとって重要であったキーワードを抽出したいことがある。例えば、顧客と接客担当者との間の会話から、顧客の共感に大きく貢献したポジティブなキーワードを抽出して、後の接客に役立てることが考えられる。また、例えば、顧客と接客担当者との間の会話から、好ましくない接客において使用された要注意のキーワードを抽出して、後の接客に役立てることが考えられる。 From the conversation between users, it may be desired to extract a keyword that is important for the conversation. For example, it is conceivable to extract a positive keyword that greatly contributes to the empathy of the customer from the conversation between the customer and the person in charge of the customer, and utilize it for later customer service. Further, for example, it is conceivable to extract a keyword of caution used in an unfavorable customer service from a conversation between the customer and the person in charge of the customer service and utilize it for later customer service.

なお、画像の表示中にユーザの発話の音声認識を行ってキーワードを抽出し、抽出したキーワードを画像にタグとして付与するデジタルフォトフレームが提案されている。また、あるユーザに他のユーザの映像を提示し、映像視聴中の当該ユーザの頭部を撮影し、二人のユーザの動作を分析して共感解釈を推定する共感解釈推定装置が提案されている。また、会議参加者それぞれの活動状態を示す信号を収集し、会議参加者それぞれの発話からコミュニケーション難易度を判定し、活動状態とコミュニケーション難易度に基づいて各会議参加者の理解度を推定する理解状態推定装置が提案されている。 A digital photo frame has been proposed in which voice recognition of a user's utterance is performed while an image is displayed, a keyword is extracted, and the extracted keyword is added as a tag to the image. Further, there has been proposed an empathic interpretation estimation device that presents a video of another user to a user, photographs the heads of the users while watching the video, analyzes the actions of the two users, and estimates the empathic interpretation. There is. Also, a signal indicating the activity state of each conference participant is collected, the communication difficulty level is determined from each utterance of each conference participant, and the understanding level of each conference participant is estimated based on the activity state and communication difficulty level. A state estimation device has been proposed.

また、ユーザの音声から発話速度を算出し、発話中のユーザを撮影した動画像からユーザ領域の時間変化を検出し、発話速度とユーザ領域の時間変換とに基づいてユーザの応対評価値を算出する応対品質評価装置が提案されている。また、センサ装置を用いて複数のユーザそれぞれの状態を検出し、検出した状態に基づいてユーザ同士の同調度を算出し、ユーザに提示する情報を同調度に応じて変化させる情報処理装置が提案されている。 Also, the speech rate is calculated from the user's voice, the time change of the user area is detected from the moving image of the user who is speaking, and the user response evaluation value is calculated based on the speech rate and the time conversion of the user area. A response quality evaluation device has been proposed. In addition, an information processing device is proposed which detects a state of each of a plurality of users using a sensor device, calculates the degree of synchronization between users based on the detected states, and changes information presented to the user according to the degree of synchronization. Has been done.

特開２０１０−２２４７１５号公報JP, 2010-224715, A 特開２０１５−６４８２７号公報JP, 2005-64827, A 特開２０１６−２１３６３１号公報JP, 2016-213631, A 特開２０１７−１６２１００号公報JP, 2017-162100, A 特開２０１８−４５６７６号公報JP, 2018-45676, A

堀智織、古井貞熙、「単語抽出による音声要約文生成法とその評価」、電子情報通信学会論文誌、Ｊ８５−Ｄ−II、２００−２０９頁、２００２年２月Tomoori Hori, Sadahiro Furui, "A Method for Generating Speech Summarization by Word Extraction and Its Evaluation", IEICE Transactions, J85-D-II, pp. 200-209, February 2002.

しかし、会話からキーワードを抽出する従来技術では、ユーザの心理状態の観点から重要度の高いキーワードを抽出することの精度に改善の余地がある。例えば、単純にキーワードの出現頻度から重要度を判定してしまうと、顧客の共感度や接客担当者の接客度などの心理状態と関連性が高いキーワードが抽出されないおそれがある。 However, in the conventional technique of extracting a keyword from a conversation, there is room for improvement in the accuracy of extracting a keyword of high importance from the viewpoint of the psychological state of the user. For example, if the importance degree is simply determined from the appearance frequency of the keyword, there is a possibility that the keyword having a high degree of relation with the psychological state such as the customer's sympathy or the degree of customer service of the customer is not extracted.

１つの側面では、本発明は、ユーザの心理状態を反映したキーワードの抽出精度を向上させるキーワード抽出プログラム、キーワード抽出方法およびキーワード抽出装置を提供することを目的とする。 In one aspect, an object of the present invention is to provide a keyword extraction program, a keyword extraction method, and a keyword extraction device that improve the extraction accuracy of keywords that reflect the psychological state of the user.

１つの態様では、コンピュータに実行させるキーワード抽出プログラムが提供される。サービスの提供側となる第１のユーザとサービスの享受側となる第２のユーザとの間の会話において第１のユーザおよび第２のユーザの少なくとも一方が行った発話を示す音声データから、キーワードを検出する。会話において第１のユーザが行った動作および第２のユーザが行った動作を示す動作データから、第１のユーザによる第１の動作のタイミングおよび第２のユーザによる第２の動作のタイミングを検出する。第１の動作のタイミングと第２の動作のタイミングとの間の関係に基づいて、キーワードの重要度を示す評価値を算出する。 In one aspect, a keyword extraction program to be executed by a computer is provided. From the voice data indicating the utterance made by at least one of the first user and the second user in the conversation between the first user who is the service providing side and the second user who is the service receiving side, a keyword To detect. Detecting the timing of the first operation by the first user and the timing of the second operation by the second user from the operation data indicating the operation performed by the first user and the operation performed by the second user in the conversation. To do. An evaluation value indicating the importance of the keyword is calculated based on the relationship between the timing of the first operation and the timing of the second operation.

また、１つの態様では、コンピュータが実行するキーワード抽出方法が提供される。また、１つの態様では、記憶部と処理部とを有するキーワード抽出装置が提供される。 Further, in one aspect, a computer-executed keyword extraction method is provided. Further, in one aspect, a keyword extraction device having a storage unit and a processing unit is provided.

１つの側面では、ユーザの心理状態を反映したキーワードの抽出精度が向上する。 In one aspect, the keyword extraction accuracy that reflects the psychological state of the user is improved.

第１の実施の形態のキーワード抽出装置の例を説明する図である。It is a figure explaining the example of the keyword extraction device of 1st Embodiment. 第２の実施の形態の情報処理システムの例を示す図である。It is a figure which shows the example of the information processing system of 2nd Embodiment. 会話分析装置のハードウェア例を示すブロック図である。It is a block diagram which shows the hardware example of a conversation analysis apparatus. 第２の実施の形態のキーワード抽出例を示す図である。It is a figure which shows the keyword extraction example of 2nd Embodiment. キーワード評価値の算出例を示す図である。It is a figure which shows the example of calculation of a keyword evaluation value. 第２の実施の形態の会話分析装置の機能例を示すブロック図である。It is a block diagram which shows the example of a function of the conversation analysis apparatus of 2nd Embodiment. 会話分析装置が保持するテーブルの例を示す第１の図である。It is a 1st figure which shows the example of the table which a conversation analysis apparatus hold | maintains. 会話分析装置が保持するテーブルの例を示す第２の図である。It is a 2nd figure which shows the example of the table which a conversation analysis apparatus hold | maintains. 第２の実施の形態の会話分析の手順例を示すフローチャートである。It is a flow chart which shows the example of a procedure of conversation analysis of a 2nd embodiment. 第３の実施の形態の情報処理システムの例を示す図である。It is a figure which shows the example of the information processing system of 3rd Embodiment. 第４の実施の形態の情報処理システムの例を示す図である。It is a figure which shows the example of the information processing system of 4th Embodiment. 第５の実施の形態のキーワード抽出例を示す図である。It is a figure which shows the keyword extraction example of 5th Embodiment. 第５の実施の形態の会話分析の手順例を示すフローチャートである。It is a flowchart which shows the example of a procedure of the conversation analysis of 5th Embodiment. 第６の実施の形態の会話分析装置の機能例を示すブロック図である。It is a block diagram which shows the example of a function of the conversation analysis apparatus of 6th Embodiment. 第６の実施の形態の会話分析の手順例を示すフローチャートである。It is a flow chart which shows the example of a procedure of conversation analysis of a 6th embodiment.

以下、本実施の形態を図面を参照して説明する。
［第１の実施の形態］
第１の実施の形態を説明する。 Hereinafter, the present embodiment will be described with reference to the drawings.
[First Embodiment]
The first embodiment will be described.

図１は、第１の実施の形態のキーワード抽出装置の例を説明する図である。
第１の実施の形態のキーワード抽出装置１０は、ユーザ間の会話の中から少なくとも一方のユーザの心理状態を反映した重要キーワードを抽出する。例えば、キーワード抽出装置１０は、顧客と接客担当者の間の会話から顧客の共感度に関連する重要キーワードを抽出する。また、例えば、キーワード抽出装置１０は、顧客と接客担当者の間の会話から接客担当者の接客度に関連する重要キーワードを抽出する。 FIG. 1 is a diagram illustrating an example of the keyword extracting device according to the first embodiment.
The keyword extracting device 10 according to the first embodiment extracts an important keyword that reflects the psychological state of at least one of the conversations between users. For example, the keyword extraction device 10 extracts an important keyword related to the customer's sympathy from the conversation between the customer and the customer service representative. Further, for example, the keyword extracting device 10 extracts an important keyword related to the degree of customer service of the customer service representative from the conversation between the customer and the customer service representative.

キーワード抽出装置１０を、コンピュータや情報処理装置と言うこともできる。キーワード抽出装置１０は、クライアント装置でもよいしサーバ装置でもよい。また、キーワード抽出装置１０は、ユーザ間の会話の間にリアルタイムに重要キーワードを抽出してもよいし、会話終了後にバッチ処理として重要キーワードを抽出してもよい。 The keyword extraction device 10 can also be called a computer or an information processing device. The keyword extracting device 10 may be a client device or a server device. Further, the keyword extracting device 10 may extract the important keywords in real time during the conversation between the users, or may extract the important keywords as a batch process after the conversation ends.

キーワード抽出装置１０は、記憶部１１および処理部１２を有する。記憶部１１は、ＲＡＭ（Random Access Memory）などの揮発性の半導体メモリでもよいし、ＨＤＤ（Hard Disk Drive）やフラッシュメモリなどの不揮発性ストレージでもよい。処理部１２は、例えば、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＤＳＰ（Digital Signal Processor）などのプロセッサである。ただし、処理部１２は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などの特定用途の電子回路を含んでもよい。複数のプロセッサの集合を「マルチプロセッサ」または単に「プロセッサ」と言うことがある。 The keyword extracting device 10 has a storage unit 11 and a processing unit 12. The storage unit 11 may be a volatile semiconductor memory such as a RAM (Random Access Memory) or a non-volatile storage such as a HDD (Hard Disk Drive) or a flash memory. The processing unit 12 is a processor such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), and a DSP (Digital Signal Processor). However, the processing unit 12 may include a specific-purpose electronic circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). A set of multiple processors may be referred to as a "multiprocessor" or simply "processor."

記憶部１１は、音声データ１３および動作データ１４を記憶する。音声データ１３および動作データ１４は、第１のユーザと第２のユーザとの間の会話についての記録である。第１のユーザはサービスの提供側となる者であり、第２のユーザはサービスの享受側となる者である。サービス分野としては、例えば、小売りや教育や医療などコミュニケーションを通じて行われる業務が挙げられる。第１のユーザは、例えば、店舗の商品説明者、教育機関に従事する講師、医療機関に従事する医師やカウンセラーなど、接客を行う接客担当者である。第２のユーザは、例えば、店舗を訪れた消費者、教育機関の受講者、医療機関を訪れた患者など、接客を受ける顧客である。 The storage unit 11 stores voice data 13 and operation data 14. The voice data 13 and the motion data 14 are records of a conversation between the first user and the second user. The first user is the person who provides the service, and the second user is the person who receives the service. Examples of the service field include business performed through communication such as retailing, education, and medical care. The first user is, for example, a person in charge of customer service, such as a product explainer in a store, a lecturer engaged in an educational institution, a doctor or a counselor engaged in a medical institution. The second user is, for example, a customer who receives a customer, such as a consumer who visits a store, a student who attends an educational institution, or a patient who visits a medical institution.

音声データ１３は、第１のユーザおよび第２のユーザの少なくとも一方が行った発話を示す。例えば、音声データ１３は、マイクロフォンを用いて、第１のユーザおよび第２のユーザの少なくとも一方の発話を録音した音声信号である。その場合に、音声データ１３は、第１のユーザの発話のみ録音したものでもよいし、第２のユーザの発話のみ録音したものでもよいし、第１のユーザと第２のユーザの両方の発話を録音したものでもよい。また、第１のユーザの発話と第２のユーザの発話は、同じマイクロフォンを用いて録音されてもよいし、異なるマイクロフォンを用いて録音されてもよい。 The voice data 13 indicates an utterance made by at least one of the first user and the second user. For example, the voice data 13 is a voice signal obtained by recording the utterance of at least one of the first user and the second user using a microphone. In that case, the voice data 13 may be the one in which only the utterance of the first user is recorded, the one in which only the utterance of the second user is recorded, or the utterance of both the first user and the second user. It may be a recording of. The utterance of the first user and the utterance of the second user may be recorded using the same microphone or different microphones.

動作データ１４は、会話において第１のユーザが行った動作および第２のユーザが行った動作を示す。動作は、表情の変化、頭部や腕や足のジェスチャ、視線の変更、姿勢の変更など、相手ユーザから視認可能な身体的動作である。表情の変化には笑うことが含まれる。頭部のジェスチャにはうなずくことが含まれる。視線の変更には相手ユーザの頭部を見ることが含まれる。姿勢の変更には前のめりになることが含まれる。 The motion data 14 indicates a motion performed by the first user and a motion performed by the second user in the conversation. The action is a physical action that can be visually recognized by the other user, such as a change in facial expression, a gesture of the head, arms, or legs, a change of the line of sight, or a change of the posture. The change in facial expression includes laughing. Head gestures include nodding. Changing the line of sight includes looking at the head of the other user. Changing the posture includes getting leaned forward.

例えば、動作データ１４は、イメージセンサを用いて、第１のユーザおよび第２のユーザを撮影した動画像などの画像データである。ただし、動作データ１４は、イメージセンサ以外のセンサデバイスを用いて生成されたセンサデータであってもよい。例えば、動作データ１４は、ヘッドセットに組み込まれた加速度センサを用いて頭部のジェスチャを検出したものでもよい。また、動作データ１４は、腕時計に組み込まれた加速度センサを用いて腕のジェスチャを検出したものでもよい。また、動作データ１４は、椅子に組み込まれた感圧センサを用いて姿勢の変化を検出したものでもよい。 For example, the operation data 14 is image data such as a moving image obtained by shooting the first user and the second user using the image sensor. However, the operation data 14 may be sensor data generated using a sensor device other than the image sensor. For example, the motion data 14 may be data obtained by detecting a gesture of the head using an acceleration sensor incorporated in the headset. The motion data 14 may be data obtained by detecting an arm gesture using an acceleration sensor incorporated in a wristwatch. Further, the motion data 14 may be data obtained by detecting a change in posture using a pressure-sensitive sensor incorporated in a chair.

動作データ１４のうち第１のユーザに関するデータと第２のユーザに関するデータとは、同じデバイスを用いて生成されてもよいし異なるデバイスを用いて生成されてもよい。例えば、動作データ１４が画像データである場合、１つのイメージセンサを用いて撮影された画像に第１のユーザと第２のユーザの両方が写っていてもよいし、異なるイメージセンサを用いて撮影された異なる画像に異なるユーザが写っていてもよい。 The data regarding the first user and the data regarding the second user of the operation data 14 may be generated using the same device or different devices. For example, when the motion data 14 is image data, both the first user and the second user may be shown in an image taken by using one image sensor, or images taken by using different image sensors. Different users may be shown in the different images displayed.

処理部１２は、音声データ１３からキーワード１５を検出する。例えば、処理部１２は、音声認識によって音声データ１３を発話の文字列（テキスト）に変換し、文字列の中から所定の検索対象キーワードを検索する。検索対象キーワードは、例えば、予めキーワードリストとして定義されている。また、例えば、処理部１２は、発話全体を文字列に変換せずに、ワードスポッティングにより発話の音声信号の特徴量と検索対象キーワードの音声信号の特徴量とを連続的に比較し、検索対象キーワードのみを直接認識する。 The processing unit 12 detects the keyword 15 from the voice data 13. For example, the processing unit 12 converts the voice data 13 into a character string (text) of utterance by voice recognition, and searches a predetermined search target keyword from the character string. The search target keyword is defined in advance as a keyword list, for example. Further, for example, the processing unit 12 continuously compares the feature amount of the voice signal of the utterance with the feature amount of the voice signal of the search target keyword by word spotting, without converting the entire utterance into a character string, and searches for the search target. Directly recognize only keywords.

キーワード１５は、第１のユーザの発話でもよいし第２のユーザの発話でもよい。処理部１２は、第１のユーザの発話と第２のユーザの発話とを区別して検索対象キーワードを検索してもよいし、第１のユーザの発話と第２のユーザの発話とを区別せずに検索対象キーワードを検索してもよい。また、処理部１２は、第１のユーザの発話と第２のユーザの発話の何れか一方に限定して検索対象キーワードを検索してもよい。 The keyword 15 may be the utterance of the first user or the utterance of the second user. The processing unit 12 may search the search target keyword by distinguishing the utterance of the first user and the utterance of the second user, or distinguish the utterance of the first user from the utterance of the second user. You may search for a search target keyword, without being. Further, the processing unit 12 may search for the search target keyword by limiting to either one of the utterance of the first user and the utterance of the second user.

また、処理部１２は、動作データ１４から、第１のユーザによる動作１６（第１の動作）のタイミングと、第２のユーザによる動作１７（第２の動作）のタイミングとを検出する。上記の音声データ１３の処理と動作データ１４の処理は、何れを先に実行してもよいし並列に実行してもよい。第１のユーザの動作１６と第２のユーザの動作１７とは区別して検出される。例えば、処理部１２は、画像認識によって第１のユーザと第２のユーザそれぞれの表情の変化、頭部や腕や足のジェスチャ、視線の変更、姿勢の変更などの身体的動作を、画像データである動作データ１４から検出する。イメージセンサ以外のセンサデバイスが動作データ１４を生成した場合、特段の認識処理を行わなくてよいこともある。 Further, the processing unit 12 detects, from the operation data 14, the timing of the operation 16 (first operation) by the first user and the timing of operation 17 (second operation) by the second user. Either of the processing of the audio data 13 and the processing of the motion data 14 may be executed first or in parallel. The first user action 16 and the second user action 17 are detected separately. For example, the processing unit 12 uses image data to recognize physical changes such as facial expression changes of the first user and the second user, gestures of the head, arms, and legs, changes of line of sight, changes of posture, and the like by image recognition. Is detected from the motion data 14. When the sensor device other than the image sensor generates the operation data 14, it may not be necessary to perform special recognition processing.

そして、処理部１２は、第１のユーザによる動作１６のタイミングと第２のユーザによる動作１７のタイミングとの間の関係に基づいて、検出したキーワード１５の重要度を示す評価値１８を算出する。処理部１２は、例えば、評価値１８に基づいて、キーワード１５を重要キーワードとして抽出するか否か判定する。処理部１２は、評価値１８が所定の第１の閾値を超える場合、キーワード１５を好ましいキーワードとして抽出してもよい。また、処理部１２は、評価値１８が第１の閾値より小さい所定の第２の閾値未満である場合、キーワード１５を要注意のキーワードとして抽出してもよい。キーワード１５は、例えば、時間軸上で動作１６，１７から所定範囲内に発せられたキーワードである。 Then, the processing unit 12 calculates an evaluation value 18 indicating the degree of importance of the detected keyword 15 based on the relationship between the timing of the action 16 by the first user and the timing of the action 17 by the second user. . The processing unit 12 determines whether to extract the keyword 15 as an important keyword based on the evaluation value 18, for example. When the evaluation value 18 exceeds the predetermined first threshold value, the processing unit 12 may extract the keyword 15 as a preferable keyword. In addition, the processing unit 12 may extract the keyword 15 as a keyword requiring attention when the evaluation value 18 is less than a predetermined second threshold value that is smaller than the first threshold value. The keyword 15 is, for example, a keyword issued within a predetermined range from the actions 16 and 17 on the time axis.

動作１６，１７のタイミングの関係として、処理部１２は、動作１６，１７が同じ種類の動作であり、動作１６が先に行われ動作１６から所定時間以内に動作１７が行われたことを検出してもよい。このとき、動作１６の直前の所定時間以内に第２のユーザが動作を行っていないこと、すなわち、第１のユーザから動作を開始したことを条件に加えてもよい。この関係は、第２のユーザの心理状態を反映していると言える。 As a timing relationship between the operations 16 and 17, the processing unit 12 detects that the operations 16 and 17 are the same type of operation, the operation 16 is performed first, and the operation 17 is performed within a predetermined time after the operation 16. You may. At this time, it may be added to the condition that the second user has not performed an operation within a predetermined time immediately before the operation 16, that is, that the operation has been started by the first user. It can be said that this relationship reflects the psychological state of the second user.

例えば、第１のユーザが接客担当者であり第２のユーザが顧客である場合、この関係は、接客担当者の笑いやうなずきなどの動作と連動して、顧客の笑いやうなずきなどの同じ種類の動作が発生したという同期を示している。よって、これは顧客が接客担当者の動作をよく観察しており、接客担当者の話に共感をもっているという会話の盛り上がりを示していると推定できる。また、例えば、動作１７が先に行われ動作１７から所定時間以内に動作１６が行われた場合、顧客の笑いやうなずきなどの動作と連動して、接客担当者の笑いやうなずきなどの同じ種類の動作が発生したという同期を示している。よって、これは接客担当者が顧客の動作をよく観察しており、良い接客を示していると推定できる。 For example, when the first user is a customer service representative and the second user is a customer, this relationship is linked to the behavior of the customer service representative, such as laughing or nodding, and the same type of customer laughing or nodding. Indicates that synchronization has occurred. Therefore, it can be inferred that this indicates the excitement of the conversation in which the customer often observes the behavior of the customer service representative and has an empathy for the story of the customer service representative. Further, for example, when the action 17 is performed first and the action 16 is performed within a predetermined time from the action 17, the same type of laughing or nodding of the customer service representative is linked with the action of the customer laughing or nodding. Indicates that synchronization has occurred. Therefore, this can be presumed to be that the customer service representative observes the customer's behavior well and shows a good customer service.

動作１６，１７が同じ種類の動作であり、動作１６が先に行われ動作１６から所定時間以内に動作１７が行われた場合、処理部１２は、動作１６，１７の近辺にあるキーワード１５を高く評価することが考えられる。キーワード１５を高く評価することは、第２のユーザの心理状態が良好であることに対応する。第２のユーザが顧客である場合、キーワード１５は、顧客の共感が得られた好ましいキーワードである可能性がある。 When the actions 16 and 17 are the same type of action and the action 16 is performed first and the action 17 is performed within a predetermined time from the action 16, the processing unit 12 sets the keyword 15 in the vicinity of the actions 16 and 17. It is considered highly appreciated. Highly evaluating the keyword 15 corresponds to the good psychological state of the second user. When the second user is a customer, the keyword 15 may be a preferable keyword that has the customer's empathy.

一方、動作１６から所定時間以内に動作１７が行われていない場合、処理部１２は、動作１６，１７の近辺にあるキーワード１５を低く評価することが考えられる。キーワード１５を低く評価することは、第２のユーザの心理状態が良好でないことに対応する。第２のユーザが顧客である場合、キーワード１５は、顧客の共感が得られなかった要注意のキーワードである可能性がある。 On the other hand, when the operation 17 is not performed within the predetermined time from the operation 16, the processing unit 12 may consider the keyword 15 near the operations 16 and 17 to be low. Evaluating the keyword 15 to be low corresponds to that the psychological state of the second user is not good. When the second user is a customer, there is a possibility that the keyword 15 is a keyword that needs attention because the customer's empathy was not obtained.

処理部１２は、抽出した重要キーワードを出力してもよい。例えば、処理部１２は、キーワード抽出装置１０が備えるストレージ装置に重要キーワードを保存してもよい。また、例えば、処理部１２は、ディスプレイに表示するなどキーワード抽出装置１０が備える出力デバイスに重要キーワードを出力してもよい。また、例えば、処理部１２は、ネットワーク経由で他の情報処理装置に重要キーワードを送信してもよい。 The processing unit 12 may output the extracted important keyword. For example, the processing unit 12 may store the important keyword in a storage device included in the keyword extracting device 10. Further, for example, the processing unit 12 may output the important keyword to an output device included in the keyword extracting device 10 such as displaying on the display. Further, for example, the processing unit 12 may transmit the important keyword to another information processing device via the network.

第１の実施の形態のキーワード抽出装置１０によれば、音声データ１３からキーワード１５が検出され、動作データ１４から第１のユーザの動作１６のタイミングおよび第２のユーザの動作１７のタイミングが検出される。そして、動作１６のタイミングと動作１７のタイミングの間の関係に基づいて、キーワード１５の評価値１８が算出される。これにより、第１のユーザと第２のユーザの少なくとも一方の心理状態の観点から重要キーワードを精度よく抽出することが可能となる。よって、接客の改善などの所定の目的のために、キーワード抽出装置１０が抽出した重要キーワードを活用することが可能となる。 According to the keyword extracting device 10 of the first embodiment, the keyword 15 is detected from the voice data 13, and the timing of the action 16 of the first user and the timing of the action 17 of the second user are detected from the action data 14. To be done. Then, the evaluation value 18 of the keyword 15 is calculated based on the relationship between the timing of the action 16 and the timing of the action 17. This makes it possible to accurately extract the important keyword from the viewpoint of the psychological state of at least one of the first user and the second user. Therefore, it is possible to utilize the important keywords extracted by the keyword extracting device 10 for a predetermined purpose such as improving customer service.

［第２の実施の形態］
次に、第２の実施の形態を説明する。
図２は、第２の実施の形態の情報処理システムの例を示す図である。 [Second Embodiment]
Next, a second embodiment will be described.
FIG. 2 is a diagram illustrating an example of the information processing system according to the second embodiment.

第２の実施の形態の情報処理システムは、顧客と接客担当者とが会話する業種において、会話を分析して接客の改善を支援するものである。この情報処理システムは、商品説明や保健指導など様々な業種に適用することが可能である。 The information processing system according to the second embodiment is for supporting the improvement of customer service by analyzing the conversation in an industry in which the customer and the person in charge of customer service have a conversation. This information processing system can be applied to various industries such as product explanations and health guidance.

第２の実施の形態の情報処理システムは、ネットワーク４０に接続された管理装置４１および会話分析装置１００を含む。会話分析装置１００にはカメラ装置５０が接続されている。管理装置４１は、接客担当者の上司など接客担当者を指導する管理者が使用する端末装置である。会話分析装置１００は、顧客と接客担当者とが会話する場所に設置された端末装置である。例えば、会話分析装置１００は、顧客と接客担当者とが対面するカウンターの上またはその周辺に設置されている。カメラ装置５０は、動画撮影機能および音声録音機能をもつデバイス装置である。カメラ装置５０は、顧客と接客担当者との間の会話を撮影および録音できるように設置されている。 The information processing system according to the second embodiment includes a management device 41 and a conversation analysis device 100 connected to the network 40. A camera device 50 is connected to the conversation analysis device 100. The management device 41 is a terminal device used by a manager who guides a customer service representative, such as a boss of the customer service representative. The conversation analysis device 100 is a terminal device installed in a place where a customer and a person in charge of customer service have a conversation. For example, the conversation analysis device 100 is installed on or near a counter where a customer and a person in charge of customer service face each other. The camera device 50 is a device device having a moving image shooting function and a voice recording function. The camera device 50 is installed so as to capture and record a conversation between a customer and a customer service representative.

カメラ装置５０は、顧客と接客担当者とが会話を行っている間、画像内に顧客と接客担当者の両方が収まるように撮影を行う。また、カメラ装置５０は、顧客と接客担当者とが会話を行っている間、顧客の音声と接客担当者の音声の両方が含まれるように録音を行う。会話分析装置１００は、撮影された動画を示す画像データと録音された音声を示す音声データとを収集して、顧客と接客担当者との間の会話を分析する。具体的には、会話分析装置１００は、音声データから顧客または接客担当者が発したキーワードを検出し、画像データから会話中の顧客の動作および接客担当者の動作を検出する。会話分析装置１００は、顧客と接客担当者の動作からキーワードを評価して重要キーワードを抽出する。会話分析装置１００は、抽出した重要キーワードを管理装置４１に報告する。 The camera device 50 captures images so that both the customer and the customer service representative fit within the image while the customer and the customer service agent have a conversation. Further, the camera device 50 performs a recording so that both the voice of the customer and the voice of the customer service representative are included while the customer and the customer service representative have a conversation. The conversation analysis device 100 collects image data indicating a captured moving image and voice data indicating a recorded voice, and analyzes a conversation between a customer and a customer service representative. Specifically, the conversation analysis device 100 detects a keyword uttered by a customer or a contact person from the voice data, and detects a customer's action during conversation and an action of the contact person from the image data. The conversation analysis device 100 evaluates the keywords from the actions of the customer and the person in charge of serving the customer, and extracts the important keywords. The conversation analysis device 100 reports the extracted important keyword to the management device 41.

会話分析装置１００による会話分析および会話分析装置１００から管理装置４１への重要キーワードの報告は、接客担当者の業務中にリアルタイムに行ってもよいし、接客担当者の業務終了後にバッチ処理として行ってもよい。例えば、会話分析装置１００は、カメラ装置５０が出力する音声データおよび画像データをリアルタイムに分析し、会話の区切り毎に重要キーワードを判定して管理装置４１に送信する。会話の区切りとしては、一人の顧客に対する接客が終了したとき、無発話時間が所定時間以上継続したとき、会話開始から一定時間経過したときなどが考えられる。また、例えば、会話分析装置１００は、カメラ装置５０が出力する音声データおよび画像データを保存し、業務終了後にまとめて音声データおよび画像データを分析し、重要キーワードを管理装置４１に送信する。 The conversation analysis by the conversation analysis device 100 and the report of the important keyword from the conversation analysis device 100 to the management device 41 may be performed in real time during the work of the customer service representative, or as a batch process after the work of the customer service representative is completed. May be. For example, the conversation analysis device 100 analyzes the voice data and the image data output from the camera device 50 in real time, determines the important keyword for each conversation break, and transmits the important keyword to the management device 41. Possible conversation breaks include when the customer service for one customer is completed, when no speech continues for a predetermined period of time, or when a certain period of time elapses from the start of the conversation. Further, for example, the conversation analysis device 100 saves the voice data and the image data output by the camera device 50, analyzes the voice data and the image data collectively after the work ends, and transmits the important keyword to the management device 41.

会話分析装置１００から管理装置４１に送信された重要キーワードの少なくとも一部は、管理装置４１のディスプレイに表示される。管理者は、接客担当者の業務中に重要キーワードを確認してもよいし、業務終了後に重要キーワードを確認してもよい。 At least a part of the important keywords transmitted from the conversation analysis device 100 to the management device 41 is displayed on the display of the management device 41. The administrator may check the important keywords during the work of the customer service staff, or may check the important keywords after the work is completed.

なお、会話分析装置１００は、第１の実施の形態のキーワード抽出装置１０に対応する。カメラ装置５０を用いて撮影された動画を示す画像データは、第１の実施の形態の動作データ１４に対応する。カメラ装置５０を用いて録音された音声を示す音声データは、第１の実施の形態の音声データ１３に対応する。 The conversation analysis device 100 corresponds to the keyword extraction device 10 of the first embodiment. Image data showing a moving image captured by using the camera device 50 corresponds to the operation data 14 of the first embodiment. The voice data indicating the voice recorded by using the camera device 50 corresponds to the voice data 13 of the first embodiment.

図３は、会話分析装置のハードウェア例を示すブロック図である。
会話分析装置１００は、バスに接続されたＣＰＵ１０１、ＲＡＭ１０２、ＨＤＤ１０３、画像信号処理部１０４、入力信号処理部１０５，１０６、媒体リーダ１０７および通信インタフェース１０８を有する。ＣＰＵ１０１は、第１の実施の形態の処理部１２に対応する。ＲＡＭ１０２またはＨＤＤ１０３は、第１の実施の形態の記憶部１１に対応する。管理装置４１も同様のハードウェアを用いて実現できる。 FIG. 3 is a block diagram showing a hardware example of the conversation analysis device.
The conversation analysis device 100 includes a CPU 101, a RAM 102, an HDD 103, an image signal processing unit 104, input signal processing units 105 and 106, a medium reader 107, and a communication interface 108 which are connected to a bus. The CPU 101 corresponds to the processing unit 12 of the first embodiment. The RAM 102 or the HDD 103 corresponds to the storage unit 11 of the first embodiment. The management device 41 can also be realized using similar hardware.

ＣＰＵ１０１は、プログラムの命令を実行するプロセッサである。ＣＰＵ１０１は、ＨＤＤ１０３に記憶されたプログラムやデータの少なくとも一部をＲＡＭ１０２にロードし、プログラムを実行する。なお、ＣＰＵ１０１は複数のプロセッサコアを備えてもよく、会話分析装置１００は複数のプロセッサを備えてもよい。複数のプロセッサの集合を「マルチプロセッサ」または単に「プロセッサ」と言うことがある。 The CPU 101 is a processor that executes program instructions. The CPU 101 loads at least a part of the programs and data stored in the HDD 103 into the RAM 102 and executes the programs. The CPU 101 may include a plurality of processor cores, and the conversation analysis device 100 may include a plurality of processors. A set of multiple processors may be referred to as a "multiprocessor" or simply "processor."

ＲＡＭ１０２は、ＣＰＵ１０１が実行するプログラムやＣＰＵ１０１が演算に使用するデータを一時的に記憶する揮発性の半導体メモリである。なお、会話分析装置１００は、ＲＡＭ以外の種類のメモリを備えてもよく、複数のメモリを備えてもよい。 The RAM 102 is a volatile semiconductor memory that temporarily stores a program executed by the CPU 101 and data used by the CPU 101 for calculation. Note that the conversation analysis device 100 may include a memory of a type other than the RAM, or may include a plurality of memories.

ＨＤＤ１０３は、ＯＳ（Operating System）やミドルウェアやアプリケーションソフトウェアなどのソフトウェアのプログラム、および、データを記憶する不揮発性ストレージである。なお、会話分析装置１００は、フラッシュメモリやＳＳＤ（Solid State Drive）など他の種類のストレージを備えてもよく、複数のストレージを備えてもよい。 The HDD 103 is a non-volatile storage that stores an OS (Operating System), software programs such as middleware and application software, and data. The conversation analysis device 100 may include another type of storage such as a flash memory or an SSD (Solid State Drive), or may include a plurality of storages.

画像信号処理部１０４は、ＣＰＵ１０１からの命令に従って、会話分析装置１００に接続されたディスプレイ１１１に画像を出力する。ディスプレイ１１１としては、ＣＲＴ（Cathode Ray Tube）ディスプレイ、液晶ディスプレイ（ＬＣＤ：Liquid Crystal Display）、有機ＥＬ（ＯＥＬ：Organic Electro-Luminescence）ディスプレイなど、任意の種類のディスプレイを使用することができる。 The image signal processing unit 104 outputs an image to the display 111 connected to the conversation analysis device 100 according to an instruction from the CPU 101. As the display 111, a CRT (Cathode Ray Tube) display, a liquid crystal display (LCD: Liquid Crystal Display), an organic EL (OEL: Organic Electro-Luminescence) display, or any other type of display can be used.

入力信号処理部１０５は、会話分析装置１００に接続された入力デバイス１１２から入力信号を受信する。入力デバイス１１２として、マウス、タッチパネル、タッチパッド、キーボードなど、任意の種類の入力デバイスを使用できる。また、会話分析装置１００に複数の種類の入力デバイスが接続されてもよい。 The input signal processing unit 105 receives an input signal from the input device 112 connected to the conversation analysis device 100. As the input device 112, any type of input device such as a mouse, a touch panel, a touch pad, a keyboard can be used. Moreover, a plurality of types of input devices may be connected to the conversation analysis device 100.

入力信号処理部１０６は、会話分析装置１００に接続されたカメラ装置５０から画像信号および音声信号を受信する。カメラ装置５０は、イメージセンサ５１およびマイクロフォン５２を有する。イメージセンサ５１は、光を電気信号（画像信号）に変換する撮像素子である。イメージセンサ５１として、ＣＣＤ（Charge Coupled Device）イメージセンサやＣＭＯＳ（Complementary Metal Oxide Semiconductor）イメージセンサなど、任意の種類のイメージセンサを使用できる。マイクロフォン５２は、音を電気信号（音声信号）に変換する。マイクロフォン５２として、ダイナミックマイクやコンデンサマイクなど、任意の種類のマイクロフォンを使用できる。 The input signal processing unit 106 receives image signals and audio signals from the camera device 50 connected to the conversation analysis device 100. The camera device 50 has an image sensor 51 and a microphone 52. The image sensor 51 is an image sensor that converts light into an electric signal (image signal). As the image sensor 51, any type of image sensor such as a CCD (Charge Coupled Device) image sensor or a CMOS (Complementary Metal Oxide Semiconductor) image sensor can be used. The microphone 52 converts sound into an electric signal (voice signal). Any type of microphone such as a dynamic microphone or a condenser microphone can be used as the microphone 52.

媒体リーダ１０７は、記録媒体１１３に記録されたプログラムやデータを読み取る読み取り装置である。記録媒体１１３として、例えば、フレキシブルディスク（ＦＤ：Flexible Disk）やＨＤＤなどの磁気ディスク、ＣＤ（Compact Disc）やＤＶＤ（Digital Versatile Disc）などの光ディスク、光磁気ディスク（ＭＯ：Magneto-Optical disk）、半導体メモリなどを使用できる。媒体リーダ１０７は、例えば、記録媒体１１３から読み取ったプログラムやデータをＲＡＭ１０２またはＨＤＤ１０３に格納する。 The medium reader 107 is a reading device that reads programs and data recorded on the recording medium 113. As the recording medium 113, for example, a magnetic disk such as a flexible disk (FD: Flexible Disk) or an HDD, an optical disk such as a CD (Compact Disc) or a DVD (Digital Versatile Disc), a magneto-optical disk (MO: Magneto-Optical disk), A semiconductor memory or the like can be used. The medium reader 107 stores, for example, the program or data read from the recording medium 113 in the RAM 102 or the HDD 103.

通信インタフェース１０８は、ネットワーク４０に接続され、ネットワーク４０を介して管理装置４１と通信を行う。通信インタフェース１０８は、スイッチやルータなどの有線通信装置に接続される有線通信インタフェースでもよいし、基地局やアクセスポイントなどの無線通信装置に接続される無線通信インタフェースでもよい。 The communication interface 108 is connected to the network 40 and communicates with the management device 41 via the network 40. The communication interface 108 may be a wired communication interface connected to a wired communication device such as a switch or a router, or a wireless communication interface connected to a wireless communication device such as a base station or an access point.

次に、第２の実施の形態のキーワード抽出方法について説明する。第２の実施の形態の会話分析装置１００は、顧客と接客担当者との間の会話で出現したキーワードの中から、顧客の共感を得られた好ましいキーワードを推定して重要キーワードとして抽出する。また、第２の実施の形態の会話分析装置１００は、顧客と接客担当者との間の会話で出現したキーワードの中から、顧客の共感を得られなかった要注意のキーワードを推定して重要キーワードとして抽出する。これにより、次回以降の接客において顧客の共感がより得られるように、接客担当者の接客スキルを向上させることが可能となる。 Next, a keyword extracting method according to the second embodiment will be described. The conversation analysis device 100 according to the second embodiment estimates a preferable keyword that has the sympathy of the customer from the keywords that appear in the conversation between the customer and the person in charge of customer reception, and extracts the keyword as an important keyword. Further, the conversation analysis device 100 according to the second embodiment estimates important keywords that cannot be sympathized with the customer from the keywords that appear in the conversation between the customer and the person in charge of hospitality. Extract as a keyword. As a result, it is possible to improve the customer service skill of the person in charge of customer service so that the customer can feel more sympathy with the customer after the next time.

図４は、第２の実施の形態のキーワード抽出例を示す図である。
顧客と接客担当者との間で会話が盛り上がっており接客担当者の話に顧客が共感しているか否かを評価するため、第２の実施の形態では、顧客の動作と接客担当者の動作との間の同期を検出する。第２の実施の形態の同期は、接客担当者から動作を開始し、その直後に顧客が同じ種類の動作を行ったという動作の連鎖である。この場合、顧客は接客担当者を注意深く見ており、接客担当者の話をポジティブに聞いていると推定される。よって、このような同期が発生しているときに顧客または接客担当者が発したキーワードは、好ましいキーワードである可能性がある。一方、このような同期が発生していないときに顧客または接客担当者が発したキーワードは、要注意のキーワードである可能性がある。 FIG. 4 is a diagram showing an example of keyword extraction according to the second embodiment.
In order to evaluate whether the conversation between the customer and the customer contact person is lively and the customer is sympathetic to the story of the customer contact person, in the second embodiment, the behavior of the customer and the behavior of the customer contact person are evaluated. Detect synchronization between and. The synchronization in the second embodiment is a chain of operations in which an operation is started by a person in charge of customer service and immediately after that, the customer performs the same kind of operation. In this case, it is presumed that the customer is carefully watching the customer contact person and listens to the customer contact person positively. Therefore, the keyword issued by the customer or the person in charge of customer service during such synchronization may be a preferable keyword. On the other hand, a keyword issued by a customer or a person in charge of customer service when such synchronization has not occurred may be a keyword requiring attention.

例えば、以下に説明するシーン７１〜７３を考える。
シーン７１では、接客担当者が笑うという動作を行い、その直後に顧客が笑うという動作を行っている。接客担当者の動作の直前の時間Ｓ１以内に顧客は動作を行っておらず、接客担当者の動作の直後の時間Ｓ２以内に顧客は動作を行っている。シーン７１では、接客担当者から開始して顧客の動作と接客担当者の動作とが同期しているため、顧客の共感度が大きいと判定される。すると、シーン７１の周辺で顧客または接客担当者が発した「レスポンス」というキーワードの評価値は高くなる。 For example, consider scenes 71-73 described below.
In the scene 71, the customer service representative performs an operation of laughing, and immediately after that, the customer performs an operation of laughing. The customer does not perform the operation within the time S1 immediately before the operation of the customer service representative, and the customer operates within the time S2 immediately after the operation of the customer service representative. In scene 71, since the customer's action and the action of the customer contact person are synchronized with each other starting from the customer service person, it is determined that the customer's sympathy is high. Then, the evaluation value of the keyword "response" issued by the customer or the person in charge of serving the customer around the scene 71 becomes high.

シーン７２では、接客担当者が笑うという動作を行ったものの、接客担当者の動作の直前の時間Ｓ１以内に顧客は動作を行っておらず、接客担当者の動作の直後の時間Ｓ２以内にも顧客は動作を行っていない。シーン７２では、顧客の動作と接客担当者の動作とが同期していないため、顧客の共感度が小さいと判定される。すると、シーン７２の周辺で顧客または接客担当者が発した「新機能」というキーワードの評価値は低くなる。 In the scene 72, the customer service representative performed an action of laughing, but the customer did not operate within the time S1 immediately before the operation of the customer service representative, and within the time S2 immediately after the operation of the customer service representative. Customer is not taking action. In the scene 72, the customer's behavior and the customer service representative's behavior are not synchronized, so it is determined that the customer's sympathy is small. Then, the evaluation value of the keyword “new function” issued by the customer or the person in charge of serving the customer around the scene 72 becomes low.

シーン７３では、顧客がうなずくという動作を行い、その直後に接客担当者がうなずくという動作を行っている。接客担当者の動作の直前の時間Ｓ１以内に顧客は動作を行っており、接客担当者の動作の直後の時間Ｓ２以内に顧客は動作を行っていない。シーン７３では、接客担当者から開始して顧客の動作と接客担当者の動作とが同期しているわけではなく、顧客の共感度が小さいと判定される。すると、シーン７３の周辺で顧客または接客担当者が発した「画質」というキーワードの評価値は低くなる。 In the scene 73, the customer makes a nod operation, and immediately after that, the customer service person nods. The customer is operating within the time S1 immediately before the operation of the customer service representative, and the customer is not performing within the time S2 immediately after the operation of the customer service representative. In the scene 73, the action of the customer and the action of the customer are not synchronized with each other starting from the person in charge of the customer, and it is determined that the sensitivity of the customer is small. Then, the evaluation value of the keyword “image quality” issued by the customer or the person in charge of customer service around the scene 73 becomes low.

会話分析装置１００は、評価値が閾値Ｔ１より大きいキーワードを好ましいキーワードと推定し、重要キーワードとして抽出する。また、会話分析装置１００は、評価値が閾値Ｔ２より小さいキーワードを要注意のキーワードと推定し、同様に重要キーワードとして抽出する。閾値Ｔ１，Ｔ２は予め決められており、Ｔ１＞Ｔ２である。上記の例では、「レスポンス」が重要キーワードとして抽出される可能性がある。 The conversation analysis device 100 estimates a keyword having an evaluation value larger than the threshold value T1 as a preferable keyword and extracts it as an important keyword. Also, the conversation analysis device 100 estimates a keyword having an evaluation value smaller than the threshold value T2 as a keyword requiring attention, and similarly extracts it as an important keyword. The threshold values T1 and T2 are predetermined and T1> T2. In the above example, “response” may be extracted as an important keyword.

ここで、会話分析装置１００は、画像データから顧客と接客担当者それぞれの動作を検出することになる。検出すべき動作は、表情の変化、頭部や腕や足のジェスチャ、視線の変更、姿勢の変更など、視認可能な身体的動作である。表情の変化には笑うことが含まれる。頭部のジェスチャにはうなずくことが含まれる。視線の変更には相手の頭部を見ることが含まれる。姿勢の変更には前のめりになることが含まれる。 Here, the conversation analysis device 100 detects the actions of the customer and the person in charge of serving the customer from the image data. The motion to be detected is a visible physical motion such as a change in facial expression, a gesture of the head, arms, or legs, a change of the line of sight, a change of the posture. The change in facial expression includes laughing. Head gestures include nodding. Changing the line of sight includes looking at the opponent's head. Changing the posture includes getting leaned forward.

会話分析装置１００は、画像認識により画像データから顧客と接客担当者を認識する。例えば、会話分析装置１００には接客担当者の容姿の特徴情報が予め登録されており、その特徴情報に基づいて接客担当者が認識される。その場合、接客担当者以外の人物が顧客として認識される。また、会話分析装置１００は、画像認識により画像データから顧客と接客担当者それぞれの動作の種類を認識する。このとき、うなずきの大きさや腕のジェスチャの大きさなど、動作の大きさを併せて認識してもよい。 The conversation analysis device 100 recognizes the customer and the customer service representative from the image data by image recognition. For example, in the conversation analysis device 100, the characteristic information of the appearance of the person in charge of customer service is registered in advance, and the person in charge of customer service is recognized based on the characteristic information. In that case, a person other than the person in charge of customer service is recognized as a customer. Further, the conversation analysis device 100 recognizes the types of actions of the customer and the customer service representative from the image data by image recognition. At this time, the size of the motion such as the size of the nod or the size of the arm gesture may be recognized together.

動作の検出には、特許文献２（特開２０１５−６４８２７号公報）に記載された技術を用いてもよい。例えば、表情について、会話分析装置１００は、画像データの各フレームから目と口の輪郭を抽出し、フレーム間における輪郭の変化から表情の変化を判定する。また、例えば、うなずきについて、会話分析装置１００は、画像データの各フレームから目と鼻と口の位置を抽出し、フレーム間における目と鼻と口の位置の変化からうなずきを判定する。動作の大きさは、その変化量から判定することができる。 The technique described in Patent Document 2 (JP-A-2015-64827) may be used to detect the motion. For example, regarding facial expressions, the conversation analysis device 100 extracts the contours of the eyes and mouth from each frame of the image data, and determines the change of facial expressions from the change in the contours between the frames. Further, for example, for nodding, the conversation analysis device 100 extracts the positions of the eyes, the nose, and the mouth from each frame of the image data, and determines the nod from the change in the positions of the eyes, the nose, and the mouth between the frames. The magnitude of the motion can be determined from the amount of change.

また、会話分析装置１００は、音声認識により音声データからキーワードを認識する。このとき、会話分析装置１００は、顧客の発話と接客担当者の発話とを区別して認識してもよいし、両者を区別せずに認識してもよい。顧客の発話と接客担当者の発話とを区別する方法として、例えば、接客担当者の声質の特徴情報を予め会話分析装置１００に登録しておき、異なる２つの声質の発話のうち接客担当者の発話を先に判定して他方の発話を顧客の発話とみなす方法が考えられる。また、顧客の発話と接客担当者の発話とを区別する方法として、例えば、録音時の音声の到来方向から判定する方法も考えられる。 Further, the conversation analysis device 100 recognizes a keyword from voice data by voice recognition. At this time, the conversation analysis device 100 may recognize the utterance of the customer and the utterance of the person in charge of serving the customer separately, or may recognize them without distinguishing both. As a method of distinguishing between the utterance of the customer and the utterance of the customer service representative, for example, the characteristic information of the voice quality of the customer service representative is registered in the conversation analysis device 100 in advance, and the utterance of the customer service representative of the utterances of two different voice qualities is registered. A method in which the utterance is determined first and the other utterance is regarded as the utterance of the customer can be considered. Further, as a method of distinguishing between the utterance of the customer and the utterance of the person in charge of customer service, for example, a method of determining from the arrival direction of the voice at the time of recording can be considered.

キーワードの検出には、特許文献１（特開２０１０−２２４７１５号公報）に記載された技術を用いてもよい。例えば、会話分析装置１００には、顧客の共感度に影響を与える可能性がある検索対象キーワードが予め登録されている。会話分析装置１００は、音声データが示す音声波形をフーリエ変換などにより音声特徴情報に変換し、予め用意した音声認識モデルに音声特徴情報を入力して単語列に変換し、単語列の中から検索対象キーワードを検索する。ただし、会話分析装置１００は、発話全体を単語列に変換せずに、ワードスポッティングにより検索対象キーワードのみを直接検出してもよい。 The technique described in Patent Document 1 (Japanese Patent Laid-Open No. 2010-224715) may be used for detecting the keyword. For example, in the conversation analysis device 100, search target keywords that may affect the customer's sympathy are registered in advance. The conversation analysis device 100 converts the voice waveform indicated by the voice data into voice feature information by Fourier transform or the like, inputs the voice feature information into a voice recognition model prepared in advance, converts the voice feature information into a word string, and retrieves from the word string. Search for the target keyword. However, the conversation analysis device 100 may directly detect only the search target keyword by word spotting without converting the entire utterance into a word string.

次に、キーワード評価値の算出方法について説明する。
図５は、キーワード評価値の算出例を示す図である。
まず、会話分析装置１００は、顧客と接客担当者を録画した画像データを用いて、所定の時間間隔でシーン評価値を算出する。第２の実施の形態では、シーン評価値は顧客と接客担当者との間のその時点の会話の盛り上がりを示しており、顧客の共感度に相当する。シーン評価値が大きいほど会話の盛り上がりが大きく、共感度が大きいと推定される。シーン評価値が小さいほど会話の盛り上がりが小さく、共感度が小さいと推定される。 Next, a method of calculating the keyword evaluation value will be described.
FIG. 5 is a diagram illustrating a calculation example of the keyword evaluation value.
First, the conversation analysis device 100 calculates a scene evaluation value at a predetermined time interval using the image data recording the customer and the person in charge of serving the customer. In the second embodiment, the scene evaluation value indicates the excitement of the conversation between the customer and the customer service representative at that time, and corresponds to the customer's empathy. It is estimated that the greater the scene evaluation value, the greater the excitement of conversation and the greater the sensitivity. It is estimated that the smaller the scene evaluation value, the smaller the excitement of conversation and the smaller the sensitivity.

そして、会話分析装置１００は、音声データから抽出されたキーワードについて、当該キーワードが発せられた時刻の周辺のシーン評価値を用いてキーワード評価値を算出する。キーワード評価値が大きいキーワードほど、顧客の共感を得られたキーワードである可能性が高い。キーワード評価値が小さいキーワードほど、顧客の共感を得られなかったキーワードである可能性が高い。会話分析装置１００は、キーワード評価値が閾値Ｔ１を超えるキーワードを重要キーワードとして抽出する。また、会話分析装置１００は、キーワード評価値が閾値Ｔ２未満のキーワードも重要キーワードとして抽出する。 Then, the conversation analysis device 100 calculates the keyword evaluation value for the keyword extracted from the voice data, using the scene evaluation value around the time when the keyword is issued. A keyword having a larger keyword evaluation value is more likely to be a keyword that has the sympathy of the customer. The smaller the keyword evaluation value, the higher the possibility that the keyword has not gained the customer's sympathy. The conversation analysis device 100 extracts a keyword whose keyword evaluation value exceeds the threshold value T1 as an important keyword. The conversation analysis device 100 also extracts keywords having a keyword evaluation value less than the threshold value T2 as important keywords.

シーン評価値の算出では、会話分析装置１００は、ある時刻を中心にして時間Ｓ０前から時間Ｓ０後までの区間（前後の時間Ｓ０の区間）をスライディングウィンドウ８１として設定する。スライディングウィンドウ８１の位置は、時間Δｔずつずらしていくことになる。時間Δｔは、例えば、１フレーム時間から１秒程度とする。時間Ｓ０は、例えば、１分から２分程度とする。スライディングウィンドウ８１の中心時刻に対して１つのシーン評価値が算出されるため、時間Δｔ間隔でシーン評価値が算出されることになる。 In the calculation of the scene evaluation value, the conversation analysis device 100 sets, as a sliding window 81, a section from before the time S0 to after the time S0 (a section before and after the time S0) around a certain time. The position of the sliding window 81 will be shifted by the time Δt. The time Δt is, for example, about 1 frame time to 1 second. The time S0 is, for example, about 1 minute to 2 minutes. Since one scene evaluation value is calculated with respect to the center time of the sliding window 81, the scene evaluation value is calculated at time Δt intervals.

スライディングウィンドウ８１の中で、接客担当者が動作を行った時刻をＦ（ｘ）とし、時刻Ｆ（ｘ）における動作の重みをｗ（ｘ）とする。重みｗ（ｘ）は、時刻Ｆ（ｘ）の直前の時間Ｓ１の間における顧客の動作に基づいて決定される。時間Ｓ１は、例えば、１秒から２秒程度である。直前の時間Ｓ１の間に顧客が動作を行っていない場合は重みｗ（ｘ）＝ｗ１とし、直前の時間Ｓ１の間に顧客が動作を行っている場合は重みｗ（ｘ）＝ｗ２とする。ただし、重みｗ１と重みｗ２の大小関係は、ｗ１＞ｗ２である。 In the sliding window 81, the time when the person in charge of customer service performed an action is F (x), and the weight of the action at the time F (x) is w (x). The weight w (x) is determined based on the behavior of the customer during the time S1 immediately before the time F (x). The time S1 is, for example, about 1 second to 2 seconds. Weight w (x) = w1 when the customer is not operating during the immediately preceding time S1, and weight w (x) = w2 when the customer is performing during the immediately preceding time S1. . However, the magnitude relationship between the weight w1 and the weight w2 is w1> w2.

また、スライディングウィンドウ８１に属する各動作を、顧客の同期の有無に応じて集合ｒ１，ｒ２に分類する。時刻Ｆ（ｘ）の直後の時間Ｓ２の間に顧客が同じ種類の動作を行っている場合、すなわち、同期ありの場合、接客担当者の動作は集合ｒ１に分類される。直後の時間Ｓ２の間に顧客が同じ種類の動作を行っていない場合、すなわち、同期なしの場合、接客担当者の動作は集合ｒ２に分類される。時間Ｓ２は、例えば、１秒から２秒程度であり時間Ｓ１と同じでもよい。また、下記で使用する係数ａの値を予め決めておく。係数ａの値は１未満の実数（ａ＜１）であり、負の値であってもよい。 Further, each operation belonging to the sliding window 81 is classified into sets r1 and r2 depending on whether or not the customer is synchronized. When the customer is performing the same type of operation during the time S2 immediately after the time F (x), that is, when there is synchronization, the operation of the customer service representative is classified into the set r1. When the customer does not perform the same type of action during the time S2 immediately after, that is, when there is no synchronization, the action of the customer service representative is classified into the set r2. The time S2 is, for example, about 1 second to 2 seconds and may be the same as the time S1. Further, the value of the coefficient a used below is determined in advance. The value of the coefficient a is a real number less than 1 (a <1), and may be a negative value.

このようなスライディングウィンドウ８１から、例えば、中心時刻のシーン評価値Ｖｔは数式（１）のように算出される。すなわち、集合ｒ１に属する動作の重みと、集合ｒ２に属する動作の重みに係数ａを乗じたものについての平均値が、シーン評価値となる。係数ａの値は１未満であるため、スライディングウィンドウ８１に属する接客担当者の動作のうち、顧客の動作と同期しているものの割合が高いほど、シーン評価値は大きくなる。 From such a sliding window 81, for example, the scene evaluation value Vt at the central time is calculated as in Expression (1). That is, the scene evaluation value is the average value of the weights of the motions belonging to the set r1 and the weights of the motions belonging to the set r2 multiplied by the coefficient a. Since the value of the coefficient a is less than 1, the scene evaluation value increases as the proportion of the actions of the customer service personnel who belong to the sliding window 81 that are synchronized with the action of the customer increases.

ただし、接客担当者の動作と顧客の動作とが同期している場合に、接客担当者の動作から顧客の動作までの遅延時間を更に考慮してシーン評価値を算出することも可能である。時刻Ｆ（ｘ）からの遅延時間をＥ（ｘ）とし、係数ｂの値を予め決めておく。係数ｂの値は正の実数である。この場合、例えば、中心時刻のシーン評価値Ｖｔは数式（２）のように算出される。数式（２）では、集合ｒ１に属する動作の重みが遅延時間Ｅ（ｘ）と係数ｂによって補正される。スライディングウィンドウ８１において、接客担当者の動作から顧客の動作までの遅延時間が短いほど、シーン評価値は大きくなる。 However, when the action of the customer service representative and the action of the customer are synchronized, the scene evaluation value can be calculated by further considering the delay time from the action of the customer service representative to the action of the customer. The delay time from time F (x) is E (x), and the value of the coefficient b is predetermined. The value of the coefficient b is a positive real number. In this case, for example, the scene evaluation value Vt at the central time is calculated as in Expression (2). In Expression (2), the weight of the operation belonging to the set r1 is corrected by the delay time E (x) and the coefficient b. In the sliding window 81, the shorter the delay time from the operation of the customer service person to the operation of the customer, the larger the scene evaluation value.

また、接客担当者の動作と顧客の動作とが同期している場合に、顧客の動作の大きさを更に考慮してシーン評価値を算出することも可能である。接客担当者の動作と同期する顧客の動作の大きさをＤ（ｘ）とし、係数ｃの値を予め決めておく。大きさＤ（ｘ）は、頭部の位置の変化量や腕の移動量など、画像データから認識される単位時間当たりの位置の変化量を示し、変化量が大きいほどＤ（ｘ）も大きい値をとる。係数ｃの値は正の実数である。この場合、例えば、中心時刻のシーン評価値Ｖｔは数式（３）のように算出される。数式（３）では、集合ｒ１に属する動作の重みが大きさＤ（ｘ）と係数ｃによって補正される。スライディングウィンドウ８１において、接客担当者の動作と同期する顧客の動作が大きいほど、シーン評価値は大きくなる。なお、数式（３）では遅延時間Ｅ（ｘ）も考慮しているが、遅延時間Ｅ（ｘ）を考慮しないようにしてもよい。 Further, when the action of the customer service person and the action of the customer are synchronized, the scene evaluation value can be calculated by further considering the magnitude of the action of the customer. The value of the coefficient c is determined in advance, where D (x) is the magnitude of the customer's action synchronized with the action of the customer service representative. The size D (x) indicates the amount of change in the position per unit time that is recognized from the image data, such as the amount of change in the position of the head or the amount of movement of the arm. Takes a value. The value of the coefficient c is a positive real number. In this case, for example, the scene evaluation value Vt at the central time is calculated as in Expression (3). In Expression (3), the weight of the motion belonging to the set r1 is corrected by the magnitude D (x) and the coefficient c. In the sliding window 81, the larger the movement of the customer synchronized with the movement of the customer service person, the larger the scene evaluation value. Although the delay time E (x) is also taken into consideration in the equation (3), the delay time E (x) may not be taken into consideration.

キーワード評価値の算出では、会話分析装置１００は、キーワードが発せられた時刻Ｇ（ｙ）を中心にして時間Ｓ３前から時間Ｓ３後までの区間（前後の時間Ｓ３の区間）をウィンドウ８２として設定する。時間Ｓ３は、例えば、２分から数分程度とする。一区切りの会話（例えば、一人の顧客に対する接客が始まってから終了するまでの一連の会話）の中で、同じキーワードが複数回発せられることがある。ここでは、１つのキーワードに着目し、一区切りの会話の中で当該キーワードがＹ＋１回発せられたとする。ウィンドウ８２は、ｙ＝０，１，…，Ｙそれぞれに対して設定される。 In the calculation of the keyword evaluation value, the conversation analysis device 100 sets, as the window 82, a section from before the time S3 to after the time S3 (section before and after the time S3) centered on the time G (y) at which the keyword is issued. To do. The time S3 is, for example, about 2 minutes to several minutes. The same keyword may be issued multiple times in a single segment of a conversation (for example, a series of conversations from the start to the end of customer service for one customer). Here, focusing on one keyword, it is assumed that the keyword is uttered Y + 1 times in a single conversation. The window 82 is set for each of y = 0, 1, ..., Y.

着目するキーワードの１回の出現に対して、ウィンドウ８２の範囲内にあるシーン評価値の平均値をＨ（ｙ）とする。すると、例えば、一区切りの会話における当該キーワードのキーワード評価値Ｖｋは数式（４）のように算出される。すなわち、キーワードが発せられた各時刻の周辺のキーワード評価値の平均値を、当該キーワードの複数回の出現の間で平均化したものが、当該キーワードのキーワード評価値となる。 Let H (y) be the average value of the scene evaluation values within the range of the window 82 for one appearance of the keyword of interest. Then, for example, the keyword evaluation value Vk of the keyword in the conversation of one segment is calculated as in Expression (4). That is, the average value of the keyword evaluation values around each time when the keyword is issued is averaged over a plurality of appearances of the keyword to obtain the keyword evaluation value of the keyword.

キーワード評価値が閾値Ｔ１を超える場合、当該キーワードは重要キーワードである。また、キーワード評価値が閾値Ｔ２未満である場合、当該キーワードは重要キーワードである。会話分析装置１００は、抽出した重要キーワードを管理装置４１に送信する。管理装置４１は、会話分析装置１００から受信した重要キーワードの全部または一部をディスプレイに表示する。例えば、管理装置４１は、受信した重要キーワードのうちキーワード評価値が大きい方からＮ個（上位Ｎ件）の重要キーワードを表示する。また、管理装置４１は、受信した重要キーワードのうちキーワード評価値が小さい方からＮ個（下位Ｎ件）の重要キーワードを表示する。Ｎは予め決めておく１以上の整数である。ただし、会話分析装置１００が、重要キーワードを上位Ｎ件と下位Ｎ件に絞り込んでもよい。 When the keyword evaluation value exceeds the threshold value T1, the keyword is an important keyword. When the keyword evaluation value is less than the threshold value T2, the keyword is an important keyword. The conversation analysis device 100 transmits the extracted important keyword to the management device 41. The management device 41 displays all or some of the important keywords received from the conversation analysis device 100 on the display. For example, the management device 41 displays N (top N) important keywords from the received important keywords with the largest keyword evaluation value. The management device 41 also displays N (lower N) important keywords from the received important keywords with the smallest keyword evaluation value. N is a predetermined integer of 1 or more. However, the conversation analysis device 100 may narrow down the important keywords to upper N cases and lower N cases.

次に、会話分析装置１００の機能および処理手順について説明する。
図６は、第２の実施の形態の会話分析装置の機能例を示すブロック図である。
会話分析装置１００は、音声記憶部１２１、画像記憶部１２２、キーワード記憶部１２３および評価結果記憶部１２４を有する。これらの記憶部は、例えば、ＲＡＭ１０２またはＨＤＤ１０３の記憶領域を用いて実現される。また、会話分析装置１００は、キーワード検出部１２５、動作検出部１２６、シーン評価部１２７およびキーワード評価部１２８を有する。これらの処理部は、例えば、プログラムを用いて実現される。 Next, the function and processing procedure of the conversation analysis device 100 will be described.
FIG. 6 is a block diagram showing a functional example of the conversation analysis device according to the second embodiment.
The conversation analysis device 100 includes a voice storage unit 121, an image storage unit 122, a keyword storage unit 123, and an evaluation result storage unit 124. These storage units are realized by using the storage area of the RAM 102 or the HDD 103, for example. The conversation analysis device 100 also includes a keyword detection unit 125, a motion detection unit 126, a scene evaluation unit 127, and a keyword evaluation unit 128. These processing units are realized by using a program, for example.

音声記憶部１２１は、カメラ装置５０から受信した音声信号を含む音声データを記憶する。画像記憶部１２２は、カメラ装置５０から受信した画像信号を含む画像データを記憶する。キーワード記憶部１２３は、検索対象キーワードを記憶する。検索対象キーワードは予め指定されている。管理者が検索対象キーワードを追加または削除できるようにしてもよい。評価結果記憶部１２４は、キーワードの評価結果を記憶する。評価結果は、音声データから抽出された重要キーワードとその順位とを含む。 The voice storage unit 121 stores voice data including a voice signal received from the camera device 50. The image storage unit 122 stores image data including the image signal received from the camera device 50. The keyword storage unit 123 stores a search target keyword. The search target keyword is designated in advance. The administrator may be allowed to add or delete the search target keyword. The evaluation result storage unit 124 stores the evaluation result of the keyword. The evaluation result includes the important keywords extracted from the voice data and their ranks.

キーワード検出部１２５は、音声記憶部１２１に記憶された音声データを、音声認識により単語列に変換する。キーワード検出部１２５は、キーワード記憶部１２３に記憶された検索対象キーワードを単語列の中から検出する。キーワードの検出結果は、検出したキーワードと当該キーワードが出現する時刻とを含む。 The keyword detection unit 125 converts the voice data stored in the voice storage unit 121 into a word string by voice recognition. The keyword detection unit 125 detects the search target keyword stored in the keyword storage unit 123 from the word string. The keyword detection result includes the detected keyword and the time at which the keyword appears.

動作検出部１２６は、画像記憶部１２２に記憶された画像データに含まれる各フレームから、画像認識により顧客と接客担当者を認識する。動作検出部１２６は、各フレームから接客担当者が写った領域の特徴情報を抽出し、フレーム間の特徴情報の変化に基づいて接客担当者の動作を検出する。また、動作検出部１２６は、各フレームから顧客が写った領域の特徴情報を抽出し、フレーム間の特徴情報の変化に基づいて顧客の動作を検出する。動作の検出結果は、時刻と動作主体と動作の種類と動作の大きさを含む。 The motion detection unit 126 recognizes a customer and a person in charge of customer service by image recognition from each frame included in the image data stored in the image storage unit 122. The motion detection unit 126 extracts the characteristic information of the area where the person in charge of the customer is reflected from each frame, and detects the motion of the person in charge of the customer based on the change in the characteristic information between the frames. In addition, the motion detection unit 126 extracts the characteristic information of the area where the customer appears from each frame, and detects the motion of the customer based on the change in the characteristic information between the frames. The motion detection result includes time, motion subject, motion type, and motion size.

シーン評価部１２７は、動作検出部１２６による動作の検出結果を用いて、時間Δｔ間隔でシーン評価値を算出する。前述のように、シーン評価部１２７は、接客担当者の動作時刻を基準にして、その直前の時間Ｓ１の間に顧客の動作が生じているか否か、および、その直後の時間Ｓ２の間に同じ種類の顧客の動作が生じているか否かを判定する。シーン評価部１２７は、このような接客担当者の動作と顧客の動作の間の同期状況に基づいてシーン評価値を算出する。同期状況の評価では、接客担当者の動作から顧客の動作までの遅延時間や、顧客の動作の大きさを更に考慮してもよい。シーン評価結果は、複数の時刻と当該複数の時刻に対応する複数のシーン評価値とを含む。 The scene evaluation unit 127 uses the detection result of the motion by the motion detection unit 126 to calculate the scene evaluation value at time Δt intervals. As described above, the scene evaluation unit 127 determines, based on the operation time of the customer service representative, whether or not the customer's operation is occurring during the time S1 immediately before, and during the time S2 immediately thereafter. It is determined whether the same type of customer action is occurring. The scene evaluation unit 127 calculates a scene evaluation value based on the synchronization status between the operation of the customer service representative and the operation of the customer. In the evaluation of the synchronization status, the delay time from the action of the customer service person to the action of the customer and the magnitude of the action of the customer may be further considered. The scene evaluation result includes a plurality of times and a plurality of scene evaluation values corresponding to the plurality of times.

キーワード評価部１２８は、キーワード検出部１２５によるキーワードの検出結果とシーン評価部１２７によるシーン評価結果を用いて、検出されたキーワードそれぞれのキーワード評価値を算出する。前述のように、キーワード評価部１２８は、キーワード毎に当該キーワードの１回以上の出現時刻を抽出し、出現時刻毎に周辺時刻のシーン評価値を平均化し、１回以上の出現時刻の間で更に平均化してキーワード評価値とする。 The keyword evaluation unit 128 calculates a keyword evaluation value for each of the detected keywords using the keyword detection result by the keyword detection unit 125 and the scene evaluation result by the scene evaluation unit 127. As described above, the keyword evaluation unit 128 extracts, for each keyword, one or more appearance times of the keyword, averages the scene evaluation values of the peripheral time for each appearance time, and calculates the appearance evaluation time between the one or more appearance times. Further averaged to obtain a keyword evaluation value.

キーワード評価部１２８は、キーワード評価値が閾値Ｔ１を超えるキーワードと閾値Ｔ２未満のキーワードを重要キーワードとして抽出する。そして、キーワード評価部１２８は、抽出した重要キーワードとキーワード評価値によって決まる順位（ベスト１、ベスト２、ワースト１、ワースト２など）を評価結果として評価結果記憶部１２４に格納する。キーワード評価部１２８は、評価結果を管理装置４１に送信する。 The keyword evaluation unit 128 extracts keywords having a keyword evaluation value exceeding the threshold T1 and keywords having a keyword evaluation value less than the threshold T2 as important keywords. Then, the keyword evaluation unit 128 stores the ranking (best 1, best 2, worst 1, worst 2, etc.) determined by the extracted important keyword and the keyword evaluation value in the evaluation result storage unit 124 as the evaluation result. The keyword evaluation unit 128 transmits the evaluation result to the management device 41.

図７は、会話分析装置が保持するテーブルの例を示す第１の図である。
キーワードテーブル１３１は、キーワード記憶部１２３に記憶される。キーワードテーブル１３１には、検索対象キーワードとして指定されたキーワードの文字列が登録される。音声データが示す発話の中から、キーワードテーブル１３１に登録されたキーワードのみが抽出され、それ以外の単語は抽出されない。 FIG. 7 is a first diagram showing an example of a table held by the conversation analysis device.
The keyword table 131 is stored in the keyword storage unit 123. In the keyword table 131, the character string of the keyword designated as the search target keyword is registered. Only the keywords registered in the keyword table 131 are extracted from the utterance indicated by the voice data, and the other words are not extracted.

キーワード検出テーブル１３２は、キーワード検出部１２５によって生成される。キーワード検出テーブル１３２は、ＲＡＭ１０２またはＨＤＤ１０３に保存されてもよい。キーワード検出テーブル１３２は、時刻およびキーワードの項目を含む。時刻の項目には、キーワードテーブル１３１に登録された何れかのキーワードが発せられた時刻が登録される。キーワードの項目には、当該発せられたキーワードが登録される。なお、顧客による発話と接客担当者による発話とを区別して認識する場合、キーワード検出テーブル１３２は、話者を示す項目を更に含んでもよい。話者は顧客または接客担当者である。 The keyword detection table 132 is generated by the keyword detection unit 125. The keyword detection table 132 may be stored in the RAM 102 or the HDD 103. The keyword detection table 132 includes items of time and keyword. In the item of time, the time when any of the keywords registered in the keyword table 131 is issued is registered. The issued keyword is registered in the keyword field. In the case of distinguishing between the utterance by the customer and the utterance by the customer service representative, the keyword detection table 132 may further include an item indicating the speaker. The speaker is a customer or a contact person.

動作検出テーブル１３３は、動作検出部１２６によって生成される。動作検出テーブル１３３は、ＲＡＭ１０２またはＨＤＤ１０３に保存されてもよい。動作検出テーブル１３３は、時刻、動作主体、種類および大きさの項目を含む。時刻の項目には、動作が行われた時刻が登録される。動作主体の項目には、動作を行った主体として「顧客」または「接客担当者」が登録される。種類の項目には、「笑う」や「うなずく」などの動作の種類が登録される。大きさの項目には、動作の大きさを示す数値が登録される。 The motion detection table 133 is generated by the motion detection unit 126. The motion detection table 133 may be stored in the RAM 102 or the HDD 103. The motion detection table 133 includes items of time, motion subject, type, and size. The time of operation is registered in the item of time. In the item of the action subject, “customer” or “contact person in charge” is registered as the subject who performed the action. In the type item, the type of action such as “laugh” or “nod” is registered. In the size item, a numerical value indicating the size of the motion is registered.

図８は、会話分析装置が保持するテーブルの例を示す第２の図である。
シーン評価テーブル１３４は、シーン評価部１２７によって生成される。シーン評価テーブル１３４は、ＲＡＭ１０２またはＨＤＤ１０３に保存されてもよい。シーン評価テーブル１３４は、時刻および評価値の項目を含む。時刻の項目には、会話の盛り上がりの程度が評価された時刻、すなわち、顧客の共感度が評価された時刻が登録される。評価値の項目には、算出されたシーン評価値が登録される。 FIG. 8 is a second diagram showing an example of a table held by the conversation analysis device.
The scene evaluation table 134 is generated by the scene evaluation unit 127. The scene evaluation table 134 may be stored in the RAM 102 or the HDD 103. The scene evaluation table 134 includes items of time and evaluation value. In the item of time, the time when the degree of excitement of the conversation is evaluated, that is, the time when the customer's sympathy is evaluated is registered. The calculated scene evaluation value is registered in the evaluation value item.

キーワード評価テーブル１３５は、キーワード評価部１２８によって生成される。キーワード評価テーブル１３５は、ＲＡＭ１０２またはＨＤＤ１０３に保存されてもよい。キーワード評価テーブル１３５は、キーワードおよび評価値の項目を含む。キーワードの項目には、キーワード検出テーブル１３２に出現するキーワードが登録される。評価値の項目には、算出されたキーワード評価値が登録される。 The keyword evaluation table 135 is generated by the keyword evaluation unit 128. The keyword evaluation table 135 may be stored in the RAM 102 or the HDD 103. The keyword evaluation table 135 includes keywords and evaluation value items. Keywords appearing in the keyword detection table 132 are registered in the keyword field. The calculated keyword evaluation value is registered in the evaluation value item.

重要キーワードテーブル１３６は、評価結果記憶部１２４に記憶される。重要キーワードテーブル１３６は、順位およびキーワードの項目を含む。順位の項目には、ベスト１、ベスト２、ワースト１、ワースト２など、キーワード評価値によって決まる重要キーワードの順位が登録される。キーワードの項目には、キーワード評価テーブル１３５に登録されたキーワードのうちキーワード評価値に基づいて選択された重要キーワードが登録される。重要キーワードテーブル１３６の内容が管理装置４１に送信される。 The important keyword table 136 is stored in the evaluation result storage unit 124. The important keyword table 136 includes items of ranking and keywords. In the order item, the order of important keywords such as best 1, best 2, worst 1, worst 2, which are determined by the keyword evaluation value, is registered. An important keyword selected based on the keyword evaluation value among the keywords registered in the keyword evaluation table 135 is registered in the keyword field. The contents of the important keyword table 136 are transmitted to the management device 41.

図９は、第２の実施の形態の会話分析の手順例を示すフローチャートである。
（Ｓ１０）動作検出部１２６は、画像記憶部１２２から画像データを読み出す。読み出す画像データは、処理済みの画像データの次の一定時間分の画像データである。また、キーワード検出部１２５は、音声記憶部１２１から音声データを読み出す。読み出す音声データは、処理済みの音声データの次の一定時間分の音声データである。 FIG. 9 is a flowchart showing a procedure example of conversation analysis according to the second embodiment.
(S10) The operation detection unit 126 reads out image data from the image storage unit 122. The image data to be read is image data for a fixed time period subsequent to the processed image data. Further, the keyword detection unit 125 reads out voice data from the voice storage unit 121. The audio data to be read is audio data for a certain time period following the processed audio data.

（Ｓ１１）キーワード検出部１２５は、ステップＳ１０で読み出した音声データを音声認識により単語列に変換する。キーワード検出部１２５は、変換した単語列から、キーワードテーブル１３１に登録された検索対象キーワードを検索し、検索されたキーワードおよび当該キーワードの出現時刻を示すキーワード検出テーブル１３２を生成する。 (S11) The keyword detection unit 125 converts the voice data read in step S10 into a word string by voice recognition. The keyword detection unit 125 searches the search target keyword registered in the keyword table 131 from the converted word string, and generates the keyword detection table 132 indicating the searched keyword and the appearance time of the keyword.

（Ｓ１２）動作検出部１２６は、ステップＳ１０で読み出した画像データに含まれる各フレームから、画像認識により顧客が写った領域および接客担当者が写った領域を認識する。動作検出部１２６は、フレーム間の位置変化から顧客の動作の種類、動作時刻および動作の大きさを検出する。また、動作検出部１２６は、フレーム間の位置変化から接客担当者の動作の種類、動作時刻および動作の大きさを検出する。動作検出部１２６は、これらの検出した情報を含む動作検出テーブル１３３を生成する。 (S12) The operation detection unit 126 recognizes the area in which the customer is reflected and the area in which the customer service representative is reflected by image recognition from each frame included in the image data read in step S10. The motion detection unit 126 detects the type of motion, the time of motion, and the size of motion of the customer from the position change between frames. In addition, the motion detection unit 126 detects the type of motion, the time of motion, and the size of motion of the person in charge of customer service from the position change between frames. The motion detection unit 126 generates a motion detection table 133 including the detected information.

（Ｓ１３）シーン評価部１２７は、ステップＳ１２で生成された動作検出テーブル１３３から接客担当者の動作時刻を抽出する。
（Ｓ１４）シーン評価部１２７は、ステップＳ１３で抽出した接客担当者の動作時刻それぞれについて、直前の顧客動作を動作検出テーブル１３３から検索して顧客動作の有無を判定し、顧客動作の有無に応じた重みを決定する。具体的には、シーン評価部１２７は、直前の時間Ｓ１の間に顧客の動作がない場合は重みｗ１を選択し、直前の時間Ｓ１の間に顧客の動作がある場合は重みｗ１より小さい重みｗ２を選択する。 (S13) The scene evaluation unit 127 extracts the operation time of the customer service representative from the operation detection table 133 generated in step S12.
(S14) The scene evaluation unit 127 searches the action detection table 133 for the immediately preceding customer action for each action time of the person in charge of service extracted in step S13, determines whether there is a customer action, and determines whether there is a customer action. Determine the weight. Specifically, the scene evaluation unit 127 selects the weight w1 when there is no customer motion during the immediately preceding time S1, and a weight smaller than the weight w1 when there is a customer motion during the immediately preceding time S1. Select w2.

（Ｓ１５）シーン評価部１２７は、ステップＳ１３で抽出した接客担当者の動作時刻それぞれについて、直後の顧客動作を動作検出テーブル１３３から検索して同じ種類の顧客動作による同期の有無を判定し、同期の有無に応じた係数を決定する。具体的には、同期がある場合、すなわち、直後の時間Ｓ２の間に同じ種類の顧客動作がある場合、シーン評価部１２７は係数＝１を選択する。一方、同期がない場合、すなわち、直後の時間Ｓ２の間に同じ種類の顧客動作がない場合、シーン評価部１２７は係数＝ａを選択する。これらの係数は重みに乗じる値であり、ａ＜１である。 (S15) The scene evaluation unit 127 searches the action detection table 133 for the immediately following customer action for each action time of the customer service person extracted in step S13, determines whether there is synchronization by the same type of customer action, and synchronizes. Determine the coefficient according to the presence or absence of. Specifically, if there is synchronization, that is, if there is the same type of customer action during the immediately following time S2, the scene evaluation unit 127 selects coefficient = 1. On the other hand, when there is no synchronization, that is, when there is no customer action of the same type during the time S2 immediately after, the scene evaluation unit 127 selects the coefficient = a. These coefficients are values by which the weight is multiplied, and a <1.

（Ｓ１６）シーン評価部１２７は、時間Ｓ０×２の時間幅をもつスライディングウィンドウを設定する。シーン評価部１２７は、スライディングウィンドウに属する接客担当者の動作に対して算出したステップＳ１４，Ｓ１５の重みおよび係数を用いて、スライディングウィンドウの中心時刻におけるシーン評価値を算出する。このシーン評価値は顧客の共感度を表している。シーン評価部１２７は、スライディングウィンドウを時間Δｔずつスライドさせることで、時間Δｔ間隔でシーン評価値を算出する。シーン評価部１２７は、複数の時刻それぞれのシーン評価値を示すシーン評価テーブル１３４を生成する。 (S16) The scene evaluation unit 127 sets a sliding window having a time width of time S0 × 2. The scene evaluation unit 127 calculates the scene evaluation value at the center time of the sliding window by using the weights and the coefficients of steps S14 and S15 calculated for the operation of the customer service person who belongs to the sliding window. This scene evaluation value represents the customer's sensitivity. The scene evaluation unit 127 calculates a scene evaluation value at time Δt intervals by sliding the sliding window by time Δt. The scene evaluation unit 127 generates a scene evaluation table 134 indicating scene evaluation values at each of a plurality of times.

（Ｓ１７）動作検出部１２６は、画像データが終了したか判断する。キーワード検出部１２５は、音声データが終了したか判断する。例えば、一人の顧客に対する接客が終了したときに画像データと音声データが終了する。画像データと音声データが終了した場合はステップＳ１８に進み、終了していない場合はステップＳ１０に進む。 (S17) The operation detection unit 126 determines whether the image data has ended. The keyword detection unit 125 determines whether the audio data has ended. For example, the image data and the audio data are finished when the customer service for one customer is finished. If the image data and the audio data have been completed, the process proceeds to step S18, and if not completed, the process proceeds to step S10.

（Ｓ１８）キーワード評価部１２８は、ステップＳ１１で生成されたキーワード検出テーブル１３２からキーワードの出現時刻を抽出する。キーワード評価部１２８は、キーワードの出現時刻それぞれについて、ステップＳ１６で生成されたシーン評価テーブル１３４から、当該出現時刻の直前の時間Ｓ３および直後の時間Ｓ３に属する周辺のシーン評価値を検索する。キーワード評価部１２８は、周辺のシーン評価値の平均値を算出する。 (S18) The keyword evaluation unit 128 extracts the keyword appearance time from the keyword detection table 132 generated in step S11. For each appearance time of the keyword, the keyword evaluation unit 128 searches the scene evaluation table 134 generated in step S16 for surrounding scene evaluation values belonging to the time S3 immediately before and the time S3 immediately after the appearance time. The keyword evaluation unit 128 calculates the average value of the surrounding scene evaluation values.

（Ｓ１９）キーワード評価部１２８は、ステップＳ１８で算出されたシーン評価値の平均値をキーワードの同一性に応じて分類する。キーワード評価部１２８は、キーワード毎にシーン評価値の平均値を更に平均化してキーワード評価値を算出する。キーワード評価部１２８は、キーワード評価値を示すキーワード評価テーブル１３５を生成する。 (S19) The keyword evaluation unit 128 classifies the average value of the scene evaluation values calculated in step S18 according to the identity of the keywords. The keyword evaluation unit 128 further averages the average value of the scene evaluation values for each keyword to calculate a keyword evaluation value. The keyword evaluation unit 128 generates a keyword evaluation table 135 indicating the keyword evaluation value.

（Ｓ２０）キーワード評価部１２８は、ステップＳ１９で生成されたキーワード評価テーブル１３５から、キーワード評価値が閾値Ｔ１を超えるキーワードおよびキーワード評価値が閾値Ｔ２未満のキーワードを重要キーワードとして抽出する。キーワード評価部１２８は、抽出した重要キーワードとその順位を示す重要キーワードテーブル１３６を生成して評価結果記憶部１２４に格納する。キーワード評価部１２８は、重要キーワードテーブル１３６の内容を管理装置４１に送信する。管理装置４１は、重要キーワードテーブル１３６の内容に基づいて、上位Ｎ件および下位Ｎ件の重要キーワードを表示する。 (S20) The keyword evaluation unit 128 extracts keywords having a keyword evaluation value exceeding the threshold T1 and keywords having a keyword evaluation value less than the threshold T2 as important keywords from the keyword evaluation table 135 generated in step S19. The keyword evaluation unit 128 generates an important keyword table 136 indicating the extracted important keywords and their ranks, and stores the important keyword table 136 in the evaluation result storage unit 124. The keyword evaluation unit 128 transmits the contents of the important keyword table 136 to the management device 41. The management device 41 displays the top N and bottom N important keywords based on the contents of the important keyword table 136.

第２の実施の形態の情報処理システムによれば、音声データからキーワードが検出され、画像データから顧客の動作と接客担当者の動作が検出される。接客担当者が先に動作を行い、その直後に顧客が同じ種類の動作を行ったという同期が検出され、動作の同期に基づいて顧客の共感度を示すシーン評価値が算出され、キーワードの周辺時刻のシーン評価値からキーワード評価値が算出される。そして、キーワード評価値が高い好ましいキーワードとキーワード評価値が低い要注意のキーワードが抽出されて管理者に報告される。 According to the information processing system of the second embodiment, the keyword is detected from the voice data, and the motion of the customer and the motion of the customer service representative are detected from the image data. The customer service person first performed the action, and immediately after that, the synchronization that the customer performed the same type of action was detected, and the scene evaluation value indicating the customer's empathy was calculated based on the action synchronization, and the vicinity of the keyword was calculated. A keyword evaluation value is calculated from the time scene evaluation value. Then, the preferred keyword having a high keyword evaluation value and the caution keyword having a low keyword evaluation value are extracted and reported to the administrator.

これにより、顧客の心理状態に対してポジティブな影響を与えた可能性の高い重要キーワードと、顧客の心理状態に対してネガティブな影響を与えた可能性の高い重要キーワードとを推定でき、接客担当者の接客スキルの改善を支援することができる。また、接客担当者の動作と顧客の動作の同期状況からキーワードを評価するため、キーワードの出現回数から評価する方法などと比べて、顧客の心理状態を反映した重要キーワードを精度よく抽出することができる。また、接客担当者の動作の直前に顧客が動作を行っておらず、接客担当者の動作の直後に顧客が同じ種類の動作を行ったという条件を判定するため、接客を受ける顧客の心理状態を精度よく推定することができる。 This makes it possible to estimate important keywords that are likely to have had a positive impact on the customer's psychological state and important keywords that are likely to have had a negative impact on the customer's psychological state. Can improve the customer service skill of the person. In addition, since the keywords are evaluated based on the synchronization of the behavior of the customer service representative and the behavior of the customer, it is possible to extract the important keywords that reflect the psychological state of the customer more accurately than the method of evaluating the appearance frequency of the keywords. it can. In addition, the customer's psychological state of the customer who receives the service is determined in order to determine the condition that the customer did not perform the operation immediately before the operation of the customer service representative and the customer performed the same type of operation immediately after the operation of the service representative. Can be accurately estimated.

［第３の実施の形態］
次に、第３の実施の形態を説明する。第２の実施の形態との違いを中心に説明し、第２の実施の形態と同様の内容については説明を省略することがある。第３の実施の形態の情報処理システムは、会話分析装置１００の配置が第２の実施の形態と異なる。 [Third Embodiment]
Next, a third embodiment will be described. The description will focus on the differences from the second embodiment, and the description of the same contents as those of the second embodiment may be omitted. The information processing system of the third embodiment is different from that of the second embodiment in the arrangement of the conversation analysis device 100.

図１０は、第３の実施の形態の情報処理システムの例を示す図である。
第３の実施の形態の情報処理システムは、第２の実施の形態と同様に、管理装置４１、カメラ装置５０および会話分析装置１００を含む。ただし、第３の実施の形態では、カメラ装置５０はネットワーク４０に接続されており、会話分析装置１００はネットワーク４０経由でカメラ装置５０と通信するサーバ装置として動作する。カメラ装置５０は、音声信号と画像信号をネットワーク４０経由で会話分析装置１００に送信する。第３の実施の形態の情報処理システムによれば、第２の実施の形態と同様の効果が得られる。 FIG. 10 is a diagram illustrating an example of the information processing system according to the third embodiment.
The information processing system according to the third embodiment includes a management device 41, a camera device 50, and a conversation analysis device 100, as in the second embodiment. However, in the third embodiment, the camera device 50 is connected to the network 40, and the conversation analysis device 100 operates as a server device that communicates with the camera device 50 via the network 40. The camera device 50 transmits the audio signal and the image signal to the conversation analysis device 100 via the network 40. According to the information processing system of the third embodiment, the same effect as that of the second embodiment can be obtained.

［第４の実施の形態］
次に、第４の実施の形態を説明する。第２の実施の形態との違いを中心に説明し、第２の実施の形態と同様の内容については説明を省略することがある。第４の実施の形態の情報処理システムは、接客担当者が遠隔で顧客を接客する業務に適用される。 [Fourth Embodiment]
Next, a fourth embodiment will be described. The description will focus on the differences from the second embodiment, and the description of the same contents as those of the second embodiment may be omitted. The information processing system according to the fourth embodiment is applied to a business where a customer service representative remotely serves a customer.

図１１は、第４の実施の形態の情報処理システムの例を示す図である。
第４の実施の形態の情報処理システムは、ユーザ装置４２、カメラ装置５０，６０および会話分析装置１００を含む。ユーザ装置４２にはカメラ装置６０が接続されている。会話分析装置１００にはカメラ装置５０が接続されている。ユーザ装置４２および会話分析装置１００はネットワーク４０に接続されている。 FIG. 11 is a diagram illustrating an example of the information processing system according to the fourth embodiment.
The information processing system according to the fourth embodiment includes a user device 42, camera devices 50 and 60, and a conversation analysis device 100. A camera device 60 is connected to the user device 42. A camera device 50 is connected to the conversation analysis device 100. The user device 42 and the conversation analysis device 100 are connected to the network 40.

ユーザ装置４２は、顧客の自宅など会話分析装置１００とは異なる場所に設置され、顧客が使用する端末装置である。会話分析装置１００は、オフィスなどに設置され、接客担当者が使用する端末装置である。カメラ装置５０，６０は、イメージセンサを用いた動画撮影機能およびマイクロフォンを用いた音声録音機能をもつデバイス装置である。カメラ装置５０は、接客担当者を撮影し接客担当者の発話を録音するよう設定されている。カメラ装置６０は、顧客を撮影し顧客の発話を録音するよう設置されている。 The user device 42 is a terminal device installed in a place different from the conversation analysis device 100, such as the customer's home, and used by the customer. The conversation analysis device 100 is a terminal device installed in an office or the like and used by a customer service representative. The camera devices 50 and 60 are device devices having a moving image shooting function using an image sensor and a voice recording function using a microphone. The camera device 50 is set to capture an image of the customer service representative and record the speech of the customer service representative. The camera device 60 is installed so as to photograph the customer and record the customer's speech.

ユーザ装置４２は、顧客を撮影した画像データおよび顧客の音声を録音した音声データを収集し、ネットワーク４０を介して会話分析装置１００に送信する。また、ユーザ装置４２は、接客担当者を撮影した画像データおよび接客担当者の音声を録音した音声データを、ネットワーク４０を介して会話分析装置１００から受信する。ユーザ装置４２は、受信した画像データに基づいて接客担当者の映像をディスプレイに表示し、受信した音声データに基づいて接客担当者の発話をスピーカから再生する。 The user device 42 collects the image data of the image of the customer and the voice data of the voice of the customer, and transmits it to the conversation analysis device 100 via the network 40. Further, the user device 42 receives, from the conversation analysis device 100 via the network 40, the image data obtained by photographing the customer service representative and the voice data obtained by recording the voice of the customer service representative. The user device 42 displays the image of the customer service representative on the display based on the received image data, and reproduces the speech of the customer service representative from the speaker based on the received audio data.

会話分析装置１００は、接客担当者の画像データおよび接客担当者の音声データを収集し、ネットワーク４０を介してユーザ装置４２に送信する。また、会話分析装置１００は、顧客の画像データおよび顧客の音声データを、ネットワーク４０を介してユーザ装置４２から受信する。会話分析装置１００は、受信した画像データに基づいて顧客の映像をディスプレイに表示し、受信した音声データに基づいて顧客の発話をスピーカから再生する。これにより、顧客と接客担当者がテレビ会議方式で会話することができる。 The conversation analysis device 100 collects the image data of the customer service representative and the voice data of the customer service representative, and transmits it to the user device 42 via the network 40. Further, the conversation analysis device 100 receives the image data of the customer and the voice data of the customer from the user device 42 via the network 40. The conversation analysis device 100 displays the video of the customer on the display based on the received image data, and reproduces the utterance of the customer from the speaker based on the received voice data. As a result, the customer and the person in charge of hospitality can talk in a video conference system.

また、会話分析装置１００は、顧客の画像データ、接客担当者の画像データ、顧客の音声データおよび接客担当者の音声データに基づいて、第２の実施の形態と同様に顧客と接客担当者との間の会話を分析する。すなわち、会話分析装置１００は、音声データから顧客または接客担当者が発したキーワードを検出し、画像データから会話中の顧客の動作および接客担当者の動作を検出する。会話分析装置１００は、顧客と接客担当者の動作からキーワードを評価して重要キーワードを抽出する。会話分析装置１００は、抽出した重要キーワードのうち上位Ｎ件および下位Ｎ件の重要キーワードをディスプレイに表示する。 Further, the conversation analysis device 100, based on the image data of the customer, the image data of the customer service representative, the voice data of the customer, and the voice data of the customer service representative, identifies the customer and the customer service representative as in the second embodiment. Analyze the conversation between. That is, the conversation analysis device 100 detects a keyword uttered by a customer or a customer service representative from the voice data, and detects a customer action and a customer serviceperson's action during the conversation from the image data. The conversation analysis device 100 evaluates the keywords from the actions of the customer and the person in charge of serving the customer and extracts the important keywords. The conversation analysis device 100 displays the upper N and lower N important keywords among the extracted important keywords on the display.

ただし、会話分析装置１００は、ネットワーク４０を介して管理装置４１に重要キーワードを送信するようにしてもよい。また、接客担当者が接客に使用する端末装置と会話分析装置１００とを分離するようにしてもよい。第４の実施の形態の情報処理システムによれば、遠隔での接客についても第２の実施の形態と同様の効果が得られる。 However, the conversation analysis device 100 may transmit the important keyword to the management device 41 via the network 40. Alternatively, the customer service representative may separate the terminal device used for customer service from the conversation analysis device 100. According to the information processing system of the fourth embodiment, the same effect as that of the second embodiment can be obtained with respect to remote customer service.

［第５の実施の形態］
次に、第５の実施の形態を説明する。第２の実施の形態との違いを中心に説明し、第２の実施の形態と同様の内容については説明を省略することがある。第２の実施の形態では、顧客の動作と接客担当者の動作の同期状況から顧客の共感度を推定し、顧客の共感度に応じてキーワードの重要度を評価した。これに対して第５の実施の形態では、顧客の動作と接客担当者の動作の同期状況から接客担当者の接客度を推定し、接客担当者の接客度に応じてキーワードの重要度を評価する。接客度は、接客の積極性や熱心さや丁寧さなどを含む接客姿勢を表している。第５の実施の形態で算出されるシーン評価値は接客担当者の接客度に対応し、第５の実施の形態で算出されるキーワード評価値は顧客または接客担当者がキーワードを発したときの接客度を反映している。よって、第５の実施の形態で抽出される重要キーワードは、良い接客との関連が大きいと推定されるキーワードや悪い接客との関連が大きいと推定されるキーワードである。第５の実施の形態で抽出される重要キーワードは、接客担当者の心理状態を反映していると言うこともできる。 [Fifth Embodiment]
Next, a fifth embodiment will be described. The description will focus on the differences from the second embodiment, and the description of the same contents as those of the second embodiment may be omitted. In the second embodiment, the customer's co-sensitivity is estimated from the synchronization of the customer's behavior and the customer service representative's behavior, and the importance of the keyword is evaluated according to the customer's co-sensitivity. On the other hand, in the fifth embodiment, the degree of customer service of the customer service representative is estimated from the synchronization status of the behavior of the customer and the operation of the customer service representative, and the importance of the keyword is evaluated according to the degree of customer service of the customer service representative. To do. The degree of customer service represents a customer service attitude that includes customer service aggressiveness, enthusiasm, and politeness. The scene evaluation value calculated in the fifth embodiment corresponds to the degree of customer service of the customer service representative, and the keyword evaluation value calculated in the fifth embodiment is when the customer or the customer service engineer utters a keyword. It reflects the degree of customer service. Therefore, the important keywords extracted in the fifth embodiment are keywords that are estimated to be highly related to good customer service and keywords that are estimated to be highly related to bad customer service. It can be said that the important keyword extracted in the fifth embodiment reflects the psychological state of the customer service representative.

第５の実施の形態の情報処理システムは、図２，３，６〜８に示した第２の実施の形態の情報処理システムと同様の構成によって実現できる。そこで、以下では第５の実施の形態を、図２，３，６〜８と同様の符号を用いて説明することがある。なお、第５の実施の形態の情報処理システムを、図１０に示した第３の実施の形態の情報処理システムと同様のシステム構成とすることも可能であり、図１１に示した第４の実施の形態の情報処理システムと同様のシステム構成とすることも可能である。 The information processing system of the fifth embodiment can be realized by the same configuration as the information processing system of the second embodiment shown in FIGS. Therefore, in the following, the fifth embodiment may be described using the same reference numerals as those in FIGS. 2, 3, 6 to 8. The information processing system according to the fifth embodiment may have the same system configuration as the information processing system according to the third embodiment shown in FIG. 10, and the fourth embodiment shown in FIG. It is also possible to adopt a system configuration similar to that of the information processing system of the embodiment.

図１２は、第５の実施の形態のキーワード抽出例を示す図である。
第５の実施の形態では、顧客の動作と接客担当者の動作との間の同期を検出する。第５の実施の形態の同期は、顧客から動作を開始し、その直後に接客担当者が同じ種類の動作を行ったという動作の連鎖である。第５の実施の形態で検出する同期は、動作の順序が異なる点で第２の実施の形態の同期と異なる。この場合、接客担当者は顧客を注意深く見ており、顧客の話を注意深く聞いていると推定される。よって、このような同期が発生しているときに顧客または接客担当者が発したキーワードは、良い接客と関連のあるキーワードである可能性がある。一方、このような同期が発生していないときに顧客または接客担当者が発したキーワードは、悪い接客と関連のあるキーワードである可能性がある。 FIG. 12 is a diagram showing an example of keyword extraction according to the fifth embodiment.
In the fifth embodiment, the synchronization between the movement of the customer and the movement of the customer service person is detected. The synchronization according to the fifth embodiment is a chain of operations in which an operation is started by a customer, and immediately after that, a customer service person performs the same kind of operation. The synchronization detected in the fifth embodiment differs from the synchronization in the second embodiment in that the order of operations is different. In this case, it is presumed that the customer service representative watches the customer carefully and listens carefully to the customer. Therefore, the keyword issued by the customer or the person in charge of customer service during such synchronization may be a keyword associated with good customer service. On the other hand, a keyword issued by a customer or a person in charge of customer service when such synchronization has not occurred may be a keyword associated with bad customer service.

例えば、以下に説明するシーン７４〜７６を考える。
シーン７４では、接客担当者が笑うという動作を行い、その直後に顧客が笑うという動作を行っている。シーン７４は、図４のシーン７１に対応する。顧客の動作の直前の時間Ｓ１以内に接客担当者は動作を行っており、顧客の動作の直後の時間Ｓ２以内に接客担当者は動作を行っていない。シーン７４では、顧客から開始して顧客の動作と接客担当者の動作とが同期しているわけではなく、接客度が小さいと判定される。すると、シーン７４の周辺で顧客または接客担当者が発した「速い」というキーワードの評価値は低くなる。 For example, consider scenes 74-76 described below.
In the scene 74, the customer service representative performs an operation of laughing, and immediately after that, the customer performs an operation of laughing. Scene 74 corresponds to scene 71 of FIG. The customer service representative is operating within the time S1 immediately before the customer's operation, and the customer service representative is not performing within the time S2 immediately after the customer's operation. In the scene 74, the behavior of the customer and the behavior of the customer contact person are not synchronized with each other starting from the customer, and it is determined that the degree of customer service is small. Then, the evaluation value of the keyword “quick” issued by the customer or the person in charge of customer service around the scene 74 becomes low.

シーン７５では、顧客が笑うという動作を行ったものの、顧客の動作の直前の時間Ｓ１以内に接客担当者は動作を行っておらず、顧客の動作の直後の時間Ｓ２以内にも接客担当者は動作を行っていない。シーン７５では、顧客の動作と接客担当者の動作とが同期していないため、接客度が小さいと判定される。すると、シーン７５の周辺で顧客または接客担当者が発した「面白い」というキーワードの評価値は低くなる。 In the scene 75, although the customer laughed, the customer service representative did not operate within the time S1 immediately before the customer's operation, and the customer service representative did not perform within the time S2 immediately after the customer's operation. Not working. In the scene 75, the customer's action and the action of the person in charge of customer service are not synchronized, so it is determined that the degree of customer service is small. Then, the evaluation value of the keyword “interesting” issued by the customer or the person in charge of serving the customer around the scene 75 becomes low.

シーン７６では、顧客がうなずくという動作を行い、その直後に接客担当者がうなずくという動作を行っている。シーン７６は、図４のシーン７３に対応する。顧客の動作の直前の時間Ｓ１以内に接客担当者は動作を行っておらず、顧客の動作の直後の時間Ｓ２以内に接客担当者は同じ種類の動作を行っている。顧客から開始して顧客の動作と接客担当者の動作とが同期しているため、接客度が大きいと判定される。すると、シーン７６の周辺で顧客または接客担当者が発した「きれい」というキーワードの評価値は高くなる。 In the scene 76, the customer makes a nod operation, and immediately after that, the customer service person makes a nod operation. Scene 76 corresponds to scene 73 of FIG. The customer service person does not perform the operation within the time S1 immediately before the customer's operation, and the customer service person performs the same type of operation within the time S2 immediately after the customer operation. Since the operation of the customer and the operation of the person in charge of customer service are synchronized with each other starting from the customer, it is determined that the degree of customer service is high. Then, the evaluation value of the keyword "beautiful" issued by the customer or the person in charge of serving the customer around the scene 76 becomes high.

キーワード評価値が算出されると第２の実施の形態と同様に、会話分析装置１００は、キーワード評価値が閾値Ｔ１より大きいキーワードを好ましいキーワードと推定し、重要キーワードとして抽出する。また、会話分析装置１００は、キーワード評価値が閾値Ｔ２より小さいキーワードを要注意のキーワードと推定し、重要キーワードとして抽出する。上記の例では、「きれい」が重要キーワードとして抽出される可能性がある。 When the keyword evaluation value is calculated, as in the second embodiment, the conversation analysis device 100 estimates a keyword having a keyword evaluation value larger than the threshold value T1 as a preferable keyword and extracts it as an important keyword. Further, the conversation analysis device 100 estimates a keyword having a keyword evaluation value smaller than the threshold value T2 as a keyword requiring attention and extracts it as an important keyword. In the above example, “beautiful” may be extracted as an important keyword.

図１３は、第５の実施の形態の会話分析の手順例を示すフローチャートである。
（Ｓ３０）動作検出部１２６は、画像記憶部１２２から画像データを読み出す。また、キーワード検出部１２５は、音声記憶部１２１から音声データを読み出す。 FIG. 13 is a flowchart showing a procedure example of conversation analysis according to the fifth embodiment.
(S30) The operation detection unit 126 reads image data from the image storage unit 122. Further, the keyword detection unit 125 reads out voice data from the voice storage unit 121.

（Ｓ３１）キーワード検出部１２５は、ステップＳ３０で読み出した音声データを音声認識により単語列に変換する。キーワード検出部１２５は、変換した単語列から、キーワードテーブル１３１に登録された検索対象キーワードを検索し、検索されたキーワードおよび当該キーワードの出現時刻を示すキーワード検出テーブル１３２を生成する。 (S31) The keyword detecting unit 125 converts the voice data read in step S30 into a word string by voice recognition. The keyword detection unit 125 searches the search target keyword registered in the keyword table 131 from the converted word string, and generates the keyword detection table 132 indicating the searched keyword and the appearance time of the keyword.

（Ｓ３２）動作検出部１２６は、ステップＳ３０で読み出した画像データに含まれる各フレームから、画像認識により顧客が写った領域および接客担当者が写った領域を認識する。動作検出部１２６は、フレーム間の位置変化から顧客の動作の種類、動作時刻および動作の大きさを検出する。また、動作検出部１２６は、フレーム間の位置変化から接客担当者の動作の種類、動作時刻および動作の大きさを検出する。動作検出部１２６は、これらの検出した情報を含む動作検出テーブル１３３を生成する。 (S32) The motion detection unit 126 recognizes, from each frame included in the image data read in step S30, the area in which the customer is reflected and the area in which the customer service representative is reflected by image recognition. The motion detection unit 126 detects the type of motion, the time of motion, and the size of motion of the customer from the position change between frames. In addition, the motion detection unit 126 detects the type of motion, the time of motion, and the size of motion of the person in charge of customer service from the position change between frames. The motion detection unit 126 generates a motion detection table 133 including the detected information.

（Ｓ３３）シーン評価部１２７は、ステップＳ３２で生成された動作検出テーブル１３３から顧客の動作時刻を抽出する。
（Ｓ３４）シーン評価部１２７は、ステップＳ３３で抽出した顧客の動作時刻それぞれについて、直前の接客担当者動作を動作検出テーブル１３３から検索して接客担当者動作の有無を判定し、接客担当者動作の有無に応じた重みを決定する。具体的には、シーン評価部１２７は、直前の時間Ｓ１の間に接客担当者の動作がない場合は重みｗ１を選択し、直前の時間Ｓ１の間に接客担当者の動作がある場合は重みｗ２を選択する。 (S33) The scene evaluation section 127 extracts the customer's motion time from the motion detection table 133 generated in step S32.
(S34) The scene evaluation unit 127 searches the action detection table 133 for the last action of the customer service representative for each of the customer operation times extracted in step S33, determines whether or not there is a service of the customer service representative, and determines the action of the customer service representative. The weight is determined according to the presence or absence of. Specifically, the scene evaluation unit 127 selects the weight w1 when there is no action of the customer service representative during the immediately preceding time S1, and selects the weight w1 when there is action of the customer service representative during the immediately preceding time S1. Select w2.

（Ｓ３５）シーン評価部１２７は、ステップＳ３３で抽出した顧客の動作時刻それぞれについて、直後の接客担当者動作を動作検出テーブル１３３から検索して同じ種類の接客担当者動作による同期の有無を判定し、同期の有無に応じた係数を決定する。具体的には、同期がある場合、すなわち、直後の時間Ｓ２の間に同じ種類の接客担当者動作がある場合、シーン評価部１２７は係数＝１を選択する。一方、同期がない場合、すなわち、直後の時間Ｓ２の間に同じ種類の接客担当者動作がない場合、シーン評価部１２７は係数＝ａを選択する。これらの係数は重みに乗じる値であり、ａ＜１である。 (S35) The scene evaluation unit 127 searches the motion detection table 133 for the immediately following service operation of the customer for each of the customer operation times extracted in step S33, and determines whether or not there is synchronization by the same type of customer operation. , Determine a coefficient according to the presence or absence of synchronization. Specifically, when there is synchronization, that is, when there is the same type of customer contact person operation during the immediately following time S2, the scene evaluation unit 127 selects coefficient = 1. On the other hand, when there is no synchronization, that is, when there is no same type of customer contact person operation during the immediately following time S2, the scene evaluation unit 127 selects the coefficient = a. These coefficients are values by which the weight is multiplied, and a <1.

（Ｓ３６）シーン評価部１２７は、時間Ｓ０×２の時間幅をもつスライディングウィンドウを設定する。シーン評価部１２７は、スライディングウィンドウに属する顧客の動作に対して算出したステップＳ３４，Ｓ３５の重みおよび係数を用いて、スライディングウィンドウの中心時刻におけるシーン評価値を算出する。このシーン評価値は接客担当者の接客度を表している。シーン評価部１２７は、スライディングウィンドウを時間Δｔずつスライドさせることで、時間Δｔ間隔でシーン評価値を算出する。シーン評価部１２７は、複数の時刻それぞれのシーン評価値を示すシーン評価テーブル１３４を生成する。 (S36) The scene evaluation unit 127 sets a sliding window having a time width of time S0 × 2. The scene evaluation unit 127 calculates the scene evaluation value at the center time of the sliding window by using the weights and coefficients of steps S34 and S35 calculated for the movement of the customer who belongs to the sliding window. This scene evaluation value represents the degree of customer service of the customer service representative. The scene evaluation unit 127 calculates a scene evaluation value at time Δt intervals by sliding the sliding window by time Δt. The scene evaluation unit 127 generates a scene evaluation table 134 indicating scene evaluation values at each of a plurality of times.

（Ｓ３７）動作検出部１２６は、画像データが終了したか判断する。キーワード検出部１２５は、音声データが終了したか判断する。画像データと音声データが終了した場合はステップＳ３８に進み、終了していない場合はステップＳ３０に進む。 (S37) The operation detection unit 126 determines whether the image data has ended. The keyword detection unit 125 determines whether the audio data has ended. If the image data and the audio data have been completed, the process proceeds to step S38, and if not completed, the process proceeds to step S30.

（Ｓ３８）キーワード評価部１２８は、ステップＳ３１で生成されたキーワード検出テーブル１３２からキーワードの出現時刻を抽出する。キーワード評価部１２８は、キーワードの出現時刻それぞれについて、ステップＳ３６で生成されたシーン評価テーブル１３４から、当該出現時刻の直前の時間Ｓ３および直後の時間Ｓ３に属する周辺のシーン評価値を検索する。キーワード評価部１２８は、周辺のシーン評価値の平均値を算出する。 (S38) The keyword evaluation unit 128 extracts the appearance time of the keyword from the keyword detection table 132 generated in step S31. For each appearance time of the keyword, the keyword evaluation unit 128 searches the scene evaluation table 134 generated in step S36 for surrounding scene evaluation values belonging to the time S3 immediately before and the time S3 immediately after the appearance time. The keyword evaluation unit 128 calculates the average value of the surrounding scene evaluation values.

（Ｓ３９）キーワード評価部１２８は、ステップＳ３８で算出されたシーン評価値の平均値をキーワードの同一性に応じて分類する。キーワード評価部１２８は、キーワード毎にシーン評価値の平均値を更に平均化してキーワード評価値を算出する。キーワード評価部１２８は、キーワード評価値を示すキーワード評価テーブル１３５を生成する。 (S39) The keyword evaluation unit 128 classifies the average value of the scene evaluation values calculated in step S38 according to the identity of the keywords. The keyword evaluation unit 128 further averages the average value of the scene evaluation values for each keyword to calculate a keyword evaluation value. The keyword evaluation unit 128 generates a keyword evaluation table 135 indicating the keyword evaluation value.

（Ｓ４０）キーワード評価部１２８は、ステップＳ３９で生成されたキーワード評価テーブル１３５から、キーワード評価値が閾値Ｔ１を超えるキーワードおよびキーワード評価値が閾値Ｔ２未満のキーワードを重要キーワードとして抽出する。キーワード評価部１２８は、抽出した重要キーワードとその順位を示す重要キーワードテーブル１３６を生成して評価結果記憶部１２４に格納する。キーワード評価部１２８は、重要キーワードテーブル１３６の内容を管理装置４１に送信する。管理装置４１は、重要キーワードテーブル１３６の内容に基づいて、上位Ｎ件および下位Ｎ件の重要キーワードを表示する。 (S40) The keyword evaluation unit 128 extracts, from the keyword evaluation table 135 generated in step S39, keywords whose keyword evaluation value exceeds the threshold T1 and keywords whose keyword evaluation value is below the threshold T2 as important keywords. The keyword evaluation unit 128 generates an important keyword table 136 indicating the extracted important keywords and their ranks, and stores the important keyword table 136 in the evaluation result storage unit 124. The keyword evaluation unit 128 transmits the contents of the important keyword table 136 to the management device 41. The management device 41 displays the top N and bottom N important keywords based on the contents of the important keyword table 136.

第５の実施の形態の情報処理システムによれば、音声データからキーワードが検出され、画像データから顧客の動作と接客担当者の動作が検出される。顧客が先に動作を行い、その直後に接客担当者が同じ種類の動作を行ったという同期が検出され、動作の同期に基づいて接客度を示すシーン評価値が算出され、キーワードの周辺時刻のシーン評価値からキーワード評価値が算出される。そして、キーワード評価値が高い好ましいキーワードとキーワード評価値が低い要注意のキーワードが抽出されて管理者に報告される。 According to the information processing system of the fifth embodiment, the keyword is detected from the voice data, and the action of the customer and the action of the person in charge of customer service are detected from the image data. The synchronization that the customer performed the action first and the customer service representative performed the same type of action immediately after that was detected, and the scene evaluation value indicating the degree of customer service was calculated based on the synchronization of the action and the time around the keyword was calculated. A keyword evaluation value is calculated from the scene evaluation value. Then, the preferred keyword having a high keyword evaluation value and the caution keyword having a low keyword evaluation value are extracted and reported to the administrator.

これにより、接客担当者の良い接客姿勢と関連がある可能性の高い重要キーワードと、接客担当者の悪い接客姿勢と関連がある可能性の高い重要キーワードとを推定でき、接客担当者の接客スキルの改善を支援することができる。また、顧客の動作と接客担当者の動作の同期状況からキーワードを評価するため、キーワードの出現回数から評価する方法などと比べて、接客担当者の心理状態を反映した重要キーワードを精度よく抽出することができる。また、顧客の動作の直前に接客担当者が動作を行っておらず、顧客の動作の直後に接客担当者が同じ種類の動作を行ったという条件を判定するため、接客姿勢としての接客担当者の心理状態を精度よく推定することができる。 This makes it possible to estimate important keywords that are likely to be related to the good customer service attitude of the customer service staff and important keywords that are likely to be related to the bad customer service attitude of the customer service staff. Can help improve. Further, since the keyword is evaluated from the synchronization of the customer's action and the action of the customer service representative, the important keyword reflecting the psychological state of the customer service provider is extracted more accurately than the method of evaluating the keyword occurrence frequency. be able to. In addition, since the service representative does not perform the action immediately before the customer's action and the condition that the service representative has performed the same type of action immediately after the customer's action, the service representative in the service attitude is determined. Can be accurately estimated.

［第６の実施の形態］
次に、第６の実施の形態を説明する。第２の実施の形態との違いを中心に説明し、第２の実施の形態と同様の内容については説明を省略することがある。第２の実施の形態では、音声データから検出すべき検索対象キーワードは予め指定されていた。これに対して第６の実施の形態では、会話分析を通じて自動的に検索対象キーワードが追加されるようにし、手動で検索対象キーワードを指定する負担を軽減する。 [Sixth Embodiment]
Next, a sixth embodiment will be described. The description will focus on the differences from the second embodiment, and the description of the same contents as those of the second embodiment may be omitted. In the second embodiment, the search target keyword to be detected from the voice data is designated in advance. On the other hand, in the sixth embodiment, the search target keyword is automatically added through the conversation analysis, and the burden of manually specifying the search target keyword is reduced.

第６の実施の形態の情報処理システムは、図２に示した第２の実施の形態の情報処理システムと同様のシステム構成によって実現できる。第６の実施の形態の情報処理システムを、図１０に示した第３の実施の形態の情報処理システムと同様のシステム構成とすることも可能であり、図１１に示した第４の実施の形態の情報処理システムと同様のシステム構成とすることも可能である。ただし、会話分析装置１００に代えて後述する会話分析装置２００を使用する。第６の実施の形態の会話分析装置２００は、図３に示した第２の実施の形態の情報処理システムと同様のハードウェア構成によって実現できる。なお、第５の実施の形態のように、接客度を反映した重要キーワードを抽出することも可能である。 The information processing system of the sixth embodiment can be realized by the same system configuration as the information processing system of the second embodiment shown in FIG. The information processing system according to the sixth embodiment may have the same system configuration as the information processing system according to the third embodiment shown in FIG. 10, and the information processing system according to the fourth embodiment shown in FIG. It is also possible to have a system configuration similar to that of the information processing system of the embodiment. However, instead of the conversation analysis device 100, a conversation analysis device 200 described later is used. The conversation analysis device 200 according to the sixth embodiment can be realized with the same hardware configuration as the information processing system according to the second embodiment shown in FIG. Note that, as in the fifth embodiment, it is possible to extract an important keyword that reflects the degree of customer service.

図１４は、第６の実施の形態の会話分析装置の機能例を示すブロック図である。
会話分析装置２００は、音声記憶部２２１、画像記憶部２２２、キーワード記憶部２２３および評価結果記憶部２２４を有する。これらの記憶部は、例えば、ＲＡＭまたはＨＤＤの記憶領域を用いて実現される。また、会話分析装置２００は、キーワード検出部２２５、動作検出部２２６、シーン評価部２２７、単語抽出部２２８およびキーワード評価部２２９を有する。これらの処理部は、例えば、プログラムを用いて実現される。 FIG. 14 is a block diagram showing a functional example of the conversation analysis device according to the sixth embodiment.
The conversation analysis device 200 includes a voice storage unit 221, an image storage unit 222, a keyword storage unit 223, and an evaluation result storage unit 224. These storage units are realized by using a storage area of a RAM or a HDD, for example. The conversation analysis device 200 also includes a keyword detection unit 225, a motion detection unit 226, a scene evaluation unit 227, a word extraction unit 228, and a keyword evaluation unit 229. These processing units are realized by using a program, for example.

音声記憶部２２１は、カメラ装置５０から受信した音声信号を含む音声データを記憶する。画像記憶部２２２は、カメラ装置５０から受信した画像信号を含む画像データを記憶する。キーワード記憶部２２３は、図７に示したキーワードテーブル１３１を記憶する。キーワードテーブル１３１には、管理者などによって予め指定された検索対象キーワードが登録される。また、キーワードテーブル１３１には、キーワード評価部２２９によって自動的に追加された検索対象キーワードが登録される。評価結果記憶部２２４は、図８に示した重要キーワードテーブル１３６を記憶する。 The voice storage unit 221 stores voice data including a voice signal received from the camera device 50. The image storage unit 222 stores the image data including the image signal received from the camera device 50. The keyword storage unit 223 stores the keyword table 131 shown in FIG. 7. In the keyword table 131, search target keywords designated in advance by an administrator or the like are registered. Further, in the keyword table 131, the search target keywords automatically added by the keyword evaluation unit 229 are registered. The evaluation result storage unit 224 stores the important keyword table 136 shown in FIG.

キーワード検出部２２５は、音声記憶部２２１に記憶された音声データを、音声認識により単語列に変換する。キーワード検出部２２５は、キーワード記憶部２２３に記憶されたキーワードテーブル１３１が示す検索対象キーワードを単語列の中から検出し、図７に示したキーワード検出テーブル１３２を生成する。 The keyword detection unit 225 converts the voice data stored in the voice storage unit 221 into a word string by voice recognition. The keyword detection unit 225 detects the search target keyword indicated by the keyword table 131 stored in the keyword storage unit 223 from the word string, and generates the keyword detection table 132 shown in FIG. 7.

動作検出部２２６は、画像記憶部２２２に記憶された画像データに含まれる各フレームから、画像認識により顧客と接客担当者を認識する。動作検出部２２６は、各フレームから接客担当者が写った領域の特徴情報を抽出し、フレーム間の特徴情報の変化に基づいて接客担当者の動作を検出する。また、動作検出部２２６は、各フレームから顧客が写った領域の特徴情報を抽出し、フレーム間の特徴情報の変化に基づいて顧客の動作を検出する。動作検出部２２６は、図７に示した動作検出テーブル１３３を生成する。 The motion detection unit 226 recognizes a customer and a person in charge of customer service by image recognition from each frame included in the image data stored in the image storage unit 222. The motion detection unit 226 extracts the characteristic information of the area in which the customer service representative is reflected from each frame, and detects the motion of the customer service representative based on the change in the characteristic information between the frames. In addition, the motion detection unit 226 extracts the characteristic information of the area in which the customer appears from each frame, and detects the customer's motion based on the change in the characteristic information between the frames. The motion detector 226 generates the motion detection table 133 shown in FIG. 7.

シーン評価部２２７は、動作検出テーブル１３３に基づいて、時間Δｔ間隔でシーン評価値を算出する。前述のように、シーン評価部２２７は、接客担当者の動作時刻を基準にして、その直前の時間Ｓ１の間に顧客の動作が生じているか否か、および、その直後の時間Ｓ２の間に同じ種類の顧客の動作が生じているか否かを判定する。シーン評価部２２７は、このような接客担当者の動作と顧客の動作の間の同期状況に基づいてシーン評価値を算出し、図８に示したシーン評価テーブル１３４を生成する。 The scene evaluation unit 227 calculates a scene evaluation value at intervals of time Δt based on the motion detection table 133. As described above, the scene evaluation unit 227 determines whether or not the customer's motion is occurring during the time S1 immediately before that based on the operation time of the customer service representative, and during the time S2 immediately thereafter. It is determined whether the same type of customer action is occurring. The scene evaluation unit 227 calculates a scene evaluation value based on such a synchronization situation between the operation of the customer service representative and the operation of the customer, and generates the scene evaluation table 134 shown in FIG. 8.

単語抽出部２２８は、音声記憶部２２１に記憶された音声データを、音声認識により単語列に変換する。単語抽出部２２８は、キーワードテーブル１３１に登録されていない未登録単語を単語列の中から抽出する。ただし、日本語の助詞や助動詞など発話中に多数出現し得る汎用的単語（ストップワード）は除外する。未登録単語の抽出には、非特許文献１（「単語抽出による音声要約文生成法とその評価」）に記載された技術を用いてもよい。単語抽出部２２８は、未登録単語の抽出結果をキーワード評価部２２９に通知する。抽出結果は、抽出した未登録単語と当該未登録単語が出現する時刻とを含む。 The word extraction unit 228 converts the voice data stored in the voice storage unit 221 into a word string by voice recognition. The word extraction unit 228 extracts an unregistered word that is not registered in the keyword table 131 from the word string. However, general-purpose words (stop words) such as Japanese particles and auxiliary verbs that can occur in large numbers during utterance are excluded. For extracting unregistered words, the technique described in Non-Patent Document 1 (“Speech summary sentence generation method by word extraction and its evaluation”) may be used. The word extraction unit 228 notifies the keyword evaluation unit 229 of the extraction result of the unregistered word. The extraction result includes the extracted unregistered word and the time when the unregistered word appears.

キーワード評価部２２９は、キーワード検出テーブル１３２、シーン評価テーブル１３４および未登録単語の抽出結果に基づいて、キーワードおよび未登録単語それぞれの単語評価値を算出する。単語評価値の算出方法は、第２の実施の形態のキーワード評価値と同様である。すなわち、キーワードに対する単語評価値はキーワード評価値と同じであり、未登録単語に対する単語評価値はキーワード評価値と同様の方法で算出される。 The keyword evaluation unit 229 calculates a word evaluation value for each of the keyword and the unregistered word based on the keyword detection table 132, the scene evaluation table 134, and the extraction result of the unregistered word. The method of calculating the word evaluation value is the same as the keyword evaluation value of the second embodiment. That is, the word evaluation value for the keyword is the same as the keyword evaluation value, and the word evaluation value for the unregistered word is calculated by the same method as the keyword evaluation value.

よって、キーワード評価部２２９は、キーワード毎に当該キーワードの１回以上の出現時刻を抽出し、出現時刻毎に周辺時刻のシーン評価値を平均化し、１回以上の出現時刻の間で更に平均化して単語評価値とする。また、キーワード評価部２２９は、未登録単語毎に当該未登録単語の１回以上の出現時刻を抽出し、出現時刻毎に周辺時刻のシーン評価値を平均化し、１回以上の出現時刻の間で更に平均化して単語評価値とする。そして、キーワード評価部２２９は、図８に示したキーワード評価テーブル１３５と同様のデータ構造をもつ単語評価テーブルを生成する。単語評価テーブルには、キーワードに対する単語評価値と未登録単語に対する単語評価値の両方が記載される。 Therefore, the keyword evaluation unit 229 extracts, for each keyword, one or more appearance times of the keyword, averages the scene evaluation value of the peripheral time for each appearance time, and further averages between the one or more appearance times. The word evaluation value. In addition, the keyword evaluation unit 229 extracts, for each unregistered word, one or more appearance times of the unregistered word, averages the scene evaluation values of the peripheral time for each appearance time, and performs one or more appearance times. Is further averaged to obtain a word evaluation value. Then, the keyword evaluation unit 229 generates a word evaluation table having the same data structure as the keyword evaluation table 135 shown in FIG. In the word evaluation table, both the word evaluation value for the keyword and the word evaluation value for the unregistered word are described.

キーワード評価部２２９は、キーワードのうち単語評価値が閾値Ｔ１を超えるキーワードと閾値Ｔ２未満のキーワードを重要キーワードとして抽出する。そして、キーワード評価部２２９は、図８に示した重要キーワードテーブル１３６を生成して評価結果記憶部２２４に格納し、重要キーワードテーブル１３６の内容を管理装置４１に送信する。また、キーワード評価部２２９は、未登録単語のうち単語評価値が閾値Ｔ１を超える未登録単語と閾値Ｔ２未満の未登録単語を抽出する。キーワード評価部２２９は、抽出した未登録単語を検索対象キーワードとしてキーワードテーブル１３１に追加する。 The keyword evaluation unit 229 extracts, from the keywords, keywords whose word evaluation value exceeds the threshold value T1 and keywords whose threshold value is less than the threshold value T2 as important keywords. Then, the keyword evaluation unit 229 generates the important keyword table 136 shown in FIG. 8 and stores it in the evaluation result storage unit 224, and transmits the contents of the important keyword table 136 to the management device 41. The keyword evaluation unit 229 also extracts unregistered words whose word evaluation value exceeds the threshold value T1 and unregistered words whose threshold value is less than the threshold value T2 among the unregistered words. The keyword evaluation unit 229 adds the extracted unregistered word to the keyword table 131 as a search target keyword.

図１５は、第６の実施の形態の会話分析の手順例を示すフローチャートである。
（Ｓ５０）動作検出部２２６は、画像記憶部２２２から画像データを読み出す。また、キーワード検出部２２５は、音声記憶部２２１から音声データを読み出す。また、単語抽出部２２８は、音声記憶部２２１から同じ音声データを読み出す。 FIG. 15 is a flowchart showing a procedure example of conversation analysis according to the sixth embodiment.
(S50) The operation detection unit 226 reads the image data from the image storage unit 222. Further, the keyword detection unit 225 reads the voice data from the voice storage unit 221. Further, the word extraction unit 228 reads the same voice data from the voice storage unit 221.

（Ｓ５１）キーワード検出部２２５は、ステップＳ５０で読み出した音声データを音声認識により単語列に変換し、キーワードテーブル１３１に登録された検索対象キーワードを単語列から検索し、キーワード検出テーブル１３２を生成する。また、単語抽出部２２８は、ステップＳ５０で読み出した音声データを音声認識により単語列に変換し、キーワードテーブル１３１に登録されていない未登録単語を単語列から抽出する。 (S51) The keyword detection unit 225 converts the voice data read out in step S50 into a word string by voice recognition, searches the search target keyword registered in the keyword table 131 from the word string, and generates the keyword detection table 132. . The word extracting unit 228 also converts the voice data read in step S50 into a word string by voice recognition, and extracts unregistered words that are not registered in the keyword table 131 from the word string.

（Ｓ５２）動作検出部２２６は、ステップＳ５０で読み出した画像データに含まれる各フレームから、画像認識により顧客が写った領域および接客担当者が写った領域を認識する。動作検出部２２６は、フレーム間の位置変化から顧客の動作の種類、動作時刻および動作の大きさを検出する。また、動作検出部２２６は、フレーム間の位置変化から接客担当者の動作の種類、動作時刻および動作の大きさを検出する。動作検出部２２６は、これらの検出した情報を含む動作検出テーブル１３３を生成する。 (S52) The motion detection unit 226 recognizes the area in which the customer is in the image and the area in which the customer service representative is in the image by image recognition from each frame included in the image data read in step S50. The motion detection unit 226 detects the type of motion, the time of motion, and the size of motion of the customer from the position change between frames. In addition, the motion detection unit 226 detects the type of motion, the motion time, and the size of the motion of the customer service representative from the position change between the frames. The motion detection unit 226 generates a motion detection table 133 including the detected information.

（Ｓ５３）シーン評価部２２７は、ステップＳ５２で生成された動作検出テーブル１３３から接客担当者の動作時刻を抽出する。
（Ｓ５４）シーン評価部２２７は、ステップＳ５３で抽出した接客担当者の動作時刻それぞれについて、直前の顧客動作を動作検出テーブル１３３から検索して顧客動作の有無を判定し、顧客動作の有無に応じた重みを決定する。 (S53) The scene evaluation unit 227 extracts the operation time of the customer service representative from the operation detection table 133 generated in step S52.
(S54) The scene evaluation unit 227 searches the action detection table 133 for the immediately preceding customer action for each action time of the person in charge of service extracted in step S53, determines whether there is a customer action, and determines whether there is a customer action. Determine the weight.

（Ｓ５５）シーン評価部２２７は、ステップＳ５３で抽出した接客担当者の動作時刻それぞれについて、直後の顧客動作を動作検出テーブル１３３から検索して同じ種類の顧客動作による同期の有無を判定し、同期の有無に応じた係数を決定する。 (S55) The scene evaluation unit 227 searches the action detection table 133 for the immediately following customer action for each action time of the customer service person extracted in step S53, determines whether there is synchronization by the same type of customer action, and synchronizes. Determine the coefficient according to the presence or absence of.

（Ｓ５６）シーン評価部２２７は、所定の時間幅をもつスライディングウィンドウを設定する。シーン評価部２２７は、スライディングウィンドウに属する接客担当者の動作に対して算出したステップＳ５４，Ｓ５５の重みおよび係数を用いて、スライディングウィンドウの中心時刻におけるシーン評価値を算出する。シーン評価部２２７は、スライディングウィンドウを時間Δｔずつスライドさせてシーン評価テーブル１３４を生成する。 (S56) The scene evaluation unit 227 sets a sliding window having a predetermined time width. The scene evaluation unit 227 calculates the scene evaluation value at the center time of the sliding window by using the weights and coefficients of steps S54 and S55 calculated for the operation of the customer service person who belongs to the sliding window. The scene evaluation unit 227 generates a scene evaluation table 134 by sliding the sliding window by time Δt.

（Ｓ５７）動作検出部２２６は、画像データが終了したか判断する。キーワード検出部２２５は、音声データが終了したか判断する。単語抽出部２２８は、音声データが終了したか判断する。画像データと音声データが終了した場合はステップＳ５８に進み、終了していない場合はステップＳ５０に進む。 (S57) The operation detection unit 226 determines whether the image data has ended. The keyword detection unit 225 determines whether the voice data has ended. The word extraction unit 228 determines whether the voice data has ended. If the image data and the audio data have been completed, the process proceeds to step S58, and if not completed, the process proceeds to step S50.

（Ｓ５８）キーワード評価部２２９は、ステップＳ５１で生成されたキーワード検出テーブル１３２からキーワードの出現時刻を抽出する。キーワード評価部２２９は、キーワードの出現時刻それぞれについて、ステップＳ５６で生成されたシーン評価テーブル１３４から、当該出現時刻の周辺のシーン評価値を検索し、周辺のシーン評価値の平均値を算出する。同様に、キーワード評価部２２９は、ステップＳ５１の未登録単語の抽出結果から未登録単語の出現時刻を抽出する。キーワード評価部２２９は、未登録単語の出現時刻それぞれについて、シーン評価テーブル１３４から当該出現時刻の周辺のシーン評価値を検索し、周辺のシーン評価値の平均値を算出する。 (S58) The keyword evaluation unit 229 extracts the appearance time of the keyword from the keyword detection table 132 generated in step S51. For each appearance time of the keyword, the keyword evaluation unit 229 retrieves the scene evaluation value around the appearance time from the scene evaluation table 134 generated in step S56, and calculates the average value of the surrounding scene evaluation values. Similarly, the keyword evaluation unit 229 extracts the appearance time of the unregistered word from the extraction result of the unregistered word in step S51. For each appearance time of an unregistered word, the keyword evaluation unit 229 searches the scene evaluation table 134 for a scene evaluation value around the appearance time, and calculates an average value of the surrounding scene evaluation values.

（Ｓ５９）キーワード評価部２２９は、キーワードについて、ステップＳ５８で算出されたシーン評価値の平均値をキーワードの同一性に応じて分類し、キーワード毎にシーン評価値の平均値を更に平均化して単語評価値を算出する。同様に、キーワード評価部２２９は、未登録単語について、ステップＳ５８で算出されたシーン評価値の平均値を未登録単語の同一性に応じて分類し、未登録単語毎にシーン評価値の平均値を更に平均化して単語評価値を算出する。キーワード評価部２２９は、キーワードおよび未登録単語の単語評価値を示す単語評価テーブルを生成する。 (S59) The keyword evaluation unit 229 classifies the average value of the scene evaluation values calculated in step S58 according to the identity of the keyword for the keyword, and further averages the average value of the scene evaluation values for each keyword to obtain the word. Calculate the evaluation value. Similarly, the keyword evaluation unit 229 classifies the average value of the scene evaluation values calculated in step S58 for the unregistered words according to the identity of the unregistered words, and the average value of the scene evaluation values for each unregistered word. Is further averaged to calculate a word evaluation value. The keyword evaluation unit 229 generates a word evaluation table showing the word evaluation value of the keyword and the unregistered word.

（Ｓ６０）キーワード評価部２２９は、ステップＳ５９で生成された単語評価テーブルから、単語評価値が閾値Ｔ１を超えるキーワードおよび単語評価値が閾値Ｔ２未満のキーワードを重要キーワードとして抽出する。キーワード評価部２２９は、抽出した重要キーワードとその順位を示す重要キーワードテーブル１３６を生成して評価結果記憶部２２４に格納する。キーワード評価部２２９は、重要キーワードテーブル１３６の内容を管理装置４１に送信する。管理装置４１は、重要キーワードテーブル１３６の内容に基づいて、上位Ｎ件および下位Ｎ件の重要キーワードを表示する。 (S60) The keyword evaluation unit 229 extracts, from the word evaluation table generated in step S59, keywords having a word evaluation value exceeding the threshold T1 and keywords having a word evaluation value less than the threshold T2 as important keywords. The keyword evaluation section 229 generates an important keyword table 136 showing the extracted important keywords and their ranks, and stores them in the evaluation result storage section 224. The keyword evaluation unit 229 transmits the contents of the important keyword table 136 to the management device 41. The management device 41 displays the top N and bottom N important keywords based on the contents of the important keyword table 136.

（Ｓ６１）キーワード評価部２２９は、ステップＳ５９で生成された単語評価テーブルから、単語評価値が閾値Ｔ１を超える未登録単語および単語評価値が閾値Ｔ２未満の未登録単語を抽出する。キーワード評価部２２９は、抽出した未登録単語を新たな検索対象キーワードとしてキーワードテーブル１３１に追加する。 (S61) The keyword evaluation unit 229 extracts an unregistered word whose word evaluation value exceeds the threshold T1 and an unregistered word whose word evaluation value is less than the threshold T2 from the word evaluation table generated in step S59. The keyword evaluation unit 229 adds the extracted unregistered word to the keyword table 131 as a new search target keyword.

第６の実施の形態の情報処理システムによれば、第２の実施の形態と同様の効果が得られる。第６の実施の形態では更に、接客との関連が大きい可能性がある検索対象キーワードが、会話分析を通じて自動的に追加される。よって、検索対象キーワードを事前に網羅的に指定しておかなくてもよく、検索対象キーワードを指定する作業の負担を軽減できる。また、検索対象キーワードが自動的に学習されるため、会話から抽出される重要キーワードの精度が向上し、会話分析結果の有用性が向上する。 According to the information processing system of the sixth embodiment, the same effect as that of the second embodiment can be obtained. Further, in the sixth embodiment, search target keywords that may be highly related to customer service are automatically added through conversation analysis. Therefore, it is not necessary to comprehensively specify the search target keywords in advance, and the burden of the work of specifying the search target keywords can be reduced. Further, since the search target keyword is automatically learned, the accuracy of the important keyword extracted from the conversation is improved and the usefulness of the conversation analysis result is improved.

１０キーワード抽出装置
１１記憶部
１２処理部
１３音声データ
１４動作データ
１５キーワード
１６，１７動作
１８評価値 10 Keyword Extractor 11 Storage Unit 12 Processing Unit 13 Voice Data 14 Motion Data 15 Keywords 16, 17 Motion 18 Evaluation Value

Claims

On the computer,
Voice data indicating an utterance made by at least one of the first user and the second user in a conversation between a first user who is a service providing side and a second user who is a service receiving side Detect keywords from
From the motion data indicating the motion performed by the first user and the motion performed by the second user in the conversation, the timing of the first motion performed by the first user and the second motion performed by the second user. Detects the timing of movement,
An evaluation value indicating the importance of the keyword is calculated based on the relationship between the timing of the first operation and the timing of the second operation.
A keyword extraction program that executes processing.

The operation data is image data obtained by capturing the first user and the second user during the conversation.
The keyword extraction program according to claim 1.

The second operation is the same type of operation that is performed after the first operation,
In the calculation of the evaluation value, when the timing of the second operation is within a predetermined time from the timing of the first operation, the timing of the second operation is less than that when the timing of the second operation is not within the predetermined time. Highly value the importance,
The keyword extraction program according to claim 1.

The second operation is the same type of operation that is performed after the first operation,
In the calculation of the evaluation value, the shorter the elapsed time from the timing of the first operation to the timing of the second operation, the higher the importance is evaluated.
The keyword extraction program according to claim 1.

In the calculation of the evaluation value, as a predetermined condition in which the timing of the second operation is later than the timing of the first operation, the second user has a predetermined time immediately before the first operation. The psychological state of the second user is determined according to whether or not the operation is not performed and the timing of the second operation exists within the predetermined time immediately after the first operation. Then, the evaluation value is calculated based on the determination result of the psychological state,
The keyword extraction program according to claim 1.

In the detection of the keyword, a predetermined search target keyword is searched from the voice data,
Further on the computer,
Extracting words other than the search target keyword from the voice data,
Based on the relationship, to calculate another evaluation value indicating the importance of the extracted word,
When the other evaluation value satisfies a predetermined condition, a process of adding the extracted word to the search target keyword is executed,
The keyword extraction program according to claim 1.

Computer
Voice data indicating an utterance made by at least one of the first user and the second user in a conversation between a first user who is a service providing side and a second user who is a service receiving side Detect keywords from
From the motion data indicating the motion performed by the first user and the motion performed by the second user in the conversation, the timing of the first motion performed by the first user and the second motion performed by the second user. Detects the timing of movement,
An evaluation value indicating the importance of the keyword is calculated based on the relationship between the timing of the first operation and the timing of the second operation.
Keyword extraction method.

Voice data indicating an utterance made by at least one of the first user and the second user in a conversation between a first user who is a service providing side and a second user who is a service receiving side And a storage unit that stores operation data indicating an operation performed by the first user and an operation performed by the second user in the conversation,
A keyword is detected from the voice data, the timing of the first operation by the first user and the timing of the second operation by the second user are detected from the operation data, and the timing of the first operation is detected. A processing unit that calculates an evaluation value indicating the importance of the keyword based on the relationship with the timing of the second operation;
A keyword extraction device having.