JP7095414B2

JP7095414B2 - Speech processing program, speech processing method and speech processing device

Info

Publication number: JP7095414B2
Application number: JP2018107778A
Authority: JP
Inventors: 昭二早川
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-06-05
Filing date: 2018-06-05
Publication date: 2022-07-05
Anticipated expiration: 2038-06-05
Also published as: JP2019211633A

Description

本発明は、音声処理プログラム等に関する。 The present invention relates to a voice processing program and the like.

近年、コールセンターでは、オペレータと顧客との会話を録音し、録音した会話の情報を蓄積している。蓄積された会話の情報は、サービス向上のために、オペレータへのフィードバック等に用いられる。 In recent years, call centers have recorded conversations between operators and customers and accumulated information on the recorded conversations. The accumulated conversation information is used for feedback to the operator in order to improve the service.

なお、従来技術には、通話の開始時刻から終了時刻までの音声情報を基にして、通話が迷惑電話であるか否かを判定する技術がある。この従来技術では、通話全体の時間、通話全体の音声区間の割合、ストレス評価値、所定のキーワードを検出した回数を、予め学習しておいたモデルに入力することで、迷惑電話らしさを特定している。 In the prior art, there is a technique for determining whether or not a call is a nuisance call based on voice information from the start time to the end time of the call. In this conventional technique, the time of the entire call, the ratio of the voice section of the entire call, the stress evaluation value, and the number of times a predetermined keyword is detected are input to the model learned in advance to identify the nuisance call. ing.

特開２００５－１２８３１号公報Japanese Unexamined Patent Publication No. 2005-12831 国際公開第２００８／０３２７８７号International Publication No. 2008/032787 国際公開第２０１４／０６９１２２号International Publication No. 2014/069122

しかしながら、上述した従来技術では、会話状況が、通常の会話状況か異常な会話状況かを判定することができないという問題がある。 However, in the above-mentioned conventional technique, there is a problem that it is not possible to determine whether the conversation situation is a normal conversation situation or an abnormal conversation situation.

１つの側面では、本発明は、会話状況が、通常の会話状況か異常な会話状況であるかを判定することができる音声処理プログラム、音声処理方法および音声処理装置を提供することを目的とする。 In one aspect, it is an object of the present invention to provide a voice processing program, a voice processing method, and a voice processing apparatus capable of determining whether a conversation situation is a normal conversation situation or an abnormal conversation situation. ..

第１の案では、コンピュータに次の処理を実行させる。コンピュータは、音声情報に含まれる判定対象とする会話の開始時刻から所定の時間間隔毎に設定された設定時刻を設定し、開始時刻から各設定時刻までの複数の音声情報から複数の特徴量を算出する。コンピュータは、会話の開始時刻から終了時刻までの音声情報の特徴量を基にして生成されたモデルに、設定時刻毎に算出した複数の特徴量を入力することで、複数の特徴量に対応するモデルの複数の出力値を設定時刻毎に算出する。コンピュータは、複数の出力値を基にして、判定対象とする会話が異常な会話状況であるか否かを判定する。 In the first plan, the computer is made to perform the following processing. The computer sets a set time set for each predetermined time interval from the start time of the conversation to be judged included in the voice information, and multiple feature quantities are selected from a plurality of voice information from the start time to each set time. calculate. The computer supports multiple features by inputting multiple features calculated for each set time into a model generated based on the features of voice information from the start time to the end time of the conversation. Calculate multiple output values of the model for each set time. The computer determines whether or not the conversation to be determined is an abnormal conversation situation based on a plurality of output values.

会話状況が、通常の会話状況か異常な会話状況かを判定することが可能となる。 It is possible to determine whether the conversation situation is a normal conversation situation or an abnormal conversation situation.

図１は、本実施例１に係る音声処理装置の処理を説明するための図である。FIG. 1 is a diagram for explaining the processing of the voice processing device according to the first embodiment. 図２は、本実施例１に係る音声処理装置の構成を示す機能ブロック図である。FIG. 2 is a functional block diagram showing a configuration of the voice processing device according to the first embodiment. 図３は、本実施例１に係るモデル情報を説明するための概略図である。FIG. 3 is a schematic diagram for explaining the model information according to the first embodiment. 図４は、本実施例１に係る出力値蓄積バッファのデータ構造の一例を示す図である。FIG. 4 is a diagram showing an example of the data structure of the output value storage buffer according to the first embodiment. 図５は、本実施例１に係る特徴量算出部の構成を示す機能ブロック図である。FIG. 5 is a functional block diagram showing the configuration of the feature amount calculation unit according to the first embodiment. 図６は、本実施例１に係るピッチ・パワー蓄積部のデータ構造の一例を示す図である。FIG. 6 is a diagram showing an example of the data structure of the pitch power storage unit according to the first embodiment. 図７は、本実施例２に係る検出回数情報のデータ構造の一例を示す図である。FIG. 7 is a diagram showing an example of a data structure of detection frequency information according to the second embodiment. 図８は、判定処理のバリエーション１を説明するための図である。FIG. 8 is a diagram for explaining variation 1 of the determination process. 図９は、判定処理のバリエーション２を説明するための図である。FIG. 9 is a diagram for explaining variation 2 of the determination process. 図１０は、判定処理のバリエーション３を説明するための図である。FIG. 10 is a diagram for explaining variation 3 of the determination process. 図１１は、判定処理のバリエーション４を説明するための図である。FIG. 11 is a diagram for explaining variation 4 of the determination process. 図１２は、本実施例１に係る音声処理装置の処理手順を示すフローチャート（１）である。FIG. 12 is a flowchart (1) showing a processing procedure of the voice processing apparatus according to the first embodiment. 図１３は、本実施例１に係る音声処理装置の処理手順を示すフローチャート（２）である。FIG. 13 is a flowchart (2) showing a processing procedure of the voice processing apparatus according to the first embodiment. 図１４は、本実施例１に係る音声処理装置の効果を説明するための図である。FIG. 14 is a diagram for explaining the effect of the voice processing device according to the first embodiment. 図１５は、会話時間管理部のその他の処理を説明するための図である。FIG. 15 is a diagram for explaining other processes of the conversation time management unit. 図１６は、参考技術２の処理を説明するための図である。FIG. 16 is a diagram for explaining the process of Reference Technique 2. 図１７は、本実施例２に係る音声処理装置の処理を説明するための図である。FIG. 17 is a diagram for explaining the processing of the voice processing device according to the second embodiment. 図１８は、本実施例２に係る音声処理装置の構成を示す機能ブロック図である。FIG. 18 is a functional block diagram showing the configuration of the voice processing device according to the second embodiment. 図１９は、本実施例２に係る出力値蓄積バッファのデータ構造の一例を示す図である。FIG. 19 is a diagram showing an example of the data structure of the output value storage buffer according to the second embodiment. 図２０は、本実施例２に係る特徴量算出部の構成を示す機能ブロック図である。FIG. 20 is a functional block diagram showing the configuration of the feature amount calculation unit according to the second embodiment. 図２１は、本実施例２に係る検出回数情報のデータ構造の一例を示す図である。FIG. 21 is a diagram showing an example of a data structure of detection frequency information according to the second embodiment. 図２２は、本実施例２に係る音声処理装置の処理手順を示すフローチャート（１）である。FIG. 22 is a flowchart (1) showing a processing procedure of the voice processing apparatus according to the second embodiment. 図２３は、本実施例２に係る音声処理装置の処理手順を示すフローチャート（２）である。FIG. 23 is a flowchart (2) showing a processing procedure of the voice processing apparatus according to the second embodiment. 図２４Ａは、第１の軌跡を説明するための図である。FIG. 24A is a diagram for explaining the first locus. 図２４Ｂは、第２の軌跡を説明するための図である。FIG. 24B is a diagram for explaining the second locus. 図２４Ｃは、第３の軌跡を説明するための図である。FIG. 24C is a diagram for explaining the third locus. 図２５は、音声処理装置と同様の機能を実現するコンピュータのハードウェア構成の一例を示す図である。FIG. 25 is a diagram showing an example of a hardware configuration of a computer that realizes a function similar to that of a voice processing device.

以下に、本願の開示する音声処理プログラム、音声処理方法および音声処理装置の実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。 Hereinafter, examples of the voice processing program, the voice processing method, and the voice processing apparatus disclosed in the present application will be described in detail with reference to the drawings. The present invention is not limited to this embodiment.

本実施例１に係る音声処理装置の説明を行う前に、オペレータと顧客との会話が異常な会話状況であるか否かを判定する参考技術１について説明する。この参考技術１は、従来技術ではない。 Before explaining the voice processing device according to the first embodiment, the reference technique 1 for determining whether or not the conversation between the operator and the customer is an abnormal conversation situation will be described. This reference technique 1 is not a conventional technique.

参考技術１は、会話の開始時刻から終了時刻までの音声情報（会話全体の音声情報）を基にして、会話状況が通常の会話状況であるか、異常な会話状況であるかを判定する。ここで「異常な会話状況」とは、顧客が不満を感じたり、怒り出したり、脅迫したりするなど、「通常でない状況」を含むものである。 The reference technique 1 determines whether the conversation situation is a normal conversation situation or an abnormal conversation situation based on the voice information (voice information of the entire conversation) from the start time to the end time of the conversation. Here, the "abnormal conversation situation" includes an "unusual situation" in which the customer feels dissatisfied, angry, or threatens.

この参考技術１は、会話全体の時間、会話全体の音声区間の割合、ストレス評価値、所定のキーワードを検出した回数を、予め学習しておいたモデルに入力することで、異常な会話状況らしさを示す出力値を特定する。参考技術１は、この出力値が閾値以上である場合に、会話状況が異常な会話状況であると判定する。 This reference technique 1 seems to be an abnormal conversation situation by inputting the time of the whole conversation, the ratio of the voice section of the whole conversation, the stress evaluation value, and the number of times when a predetermined keyword is detected into the model learned in advance. Specify the output value that indicates. Reference Technique 1 determines that the conversation situation is an abnormal conversation situation when this output value is equal to or greater than the threshold value.

ここで、会話の終盤だけ顧客が怒り出した場合、あるいは会話中に顧客が不満を述べたが、オペレータが話術で鎮静化させた場合は「異常な会話状況」であると判定することが好ましい。しかし、参考技術１では、会話全体に対する評価値、分析結果を用いて、総合的に異常な会話状況であるかを判定しているため、会話の一部に異常な会話状況が含まれていても、全体としては、異常な会話状況らしさを示す出力値が大きくならず、会話状況が異常であると判定できない場合がある。 Here, if the customer gets angry only at the end of the conversation, or if the customer complains during the conversation, but the operator calms it down by speaking, it is preferable to judge that it is an "abnormal conversation situation". .. However, in Reference Technique 1, since it is determined whether or not the conversation is abnormal comprehensively by using the evaluation value and the analysis result for the entire conversation, the abnormal conversation situation is included in a part of the conversation. However, as a whole, the output value indicating the appearance of an abnormal conversation situation does not increase, and it may not be possible to determine that the conversation situation is abnormal.

次に、本実施例１に係る音声処理装置の処理の一例について説明する。音声処理装置は、異常な会話状況らしさを判定する「モデルを学習する処理」と、「異常な会話状況であるか否かを判定する処理」を行う。 Next, an example of processing of the voice processing device according to the first embodiment will be described. The voice processing device performs "a process of learning a model" for determining the appearance of an abnormal conversation situation and "a process of determining whether or not the conversation situation is abnormal".

音声処理装置が、モデルを学習する場合には、参考技術１と同様にして、会話全体の音声情報に対する評価値、分析結果を用いて、モデルを学習する。 When the voice processing device learns the model, the model is learned by using the evaluation values and the analysis results for the voice information of the entire conversation in the same manner as in Reference Technique 1.

続いて、音声処理装置が、異常な会話状況であるか否かを判定する場合には、会話の開始時刻から各設定時刻までの評価値、分析結果を、一定時間間隔でモデルに入力し、モデルの出力値を算出、蓄積する。音声処理装置は、蓄積された出力値から得られる軌跡を用いて、通常の会話状況か異常な会話状況かの判定を行う。 Subsequently, when the voice processing device determines whether or not the conversation is abnormal, the evaluation values and analysis results from the conversation start time to each set time are input to the model at regular time intervals. Calculate and accumulate the output value of the model. The voice processing device determines whether it is a normal conversation situation or an abnormal conversation situation by using the locus obtained from the accumulated output value.

図１は、本実施例１に係る音声処理装置の処理を説明するための図である。図１において、縦軸はモデルの出力値に対応するものであり、横軸は会話時間に対応するものである。出力値１０ａは、時刻０から時刻ｔ_１までの区間における会話の評価値、分析結果をモデルに入力した場合の出力値である。出力値１０ｂは、時刻０から時刻ｔ_２までの区間における会話の評価値、分析結果をモデルに入力した場合の出力値である。出力値１０ｃは、時刻０から時刻ｔ_３までの区間における会話の評価値、分析結果をモデルに入力した場合の出力値である。出力値１０ｄは、時刻０から時刻ｔ_４までの区間における会話の評価値、分析結果をモデルに入力した場合の出力値である。出力値１０ｅは、時刻０から時刻ｔ_５までの区間における会話の評価値、分析結果をモデルに入力した場合の出力値である。出力値１０ｆは、時刻０から時刻ｔ_６までの区間における会話の評価値、分析結果をモデルに入力した場合の出力値である。 FIG. 1 is a diagram for explaining the processing of the voice processing device according to the first embodiment. In FIG. 1, the vertical axis corresponds to the output value of the model, and the horizontal axis corresponds to the conversation time. The output value 10a is an evaluation value of conversation in the section from time 0 to time t ₁ , and an output value when the analysis result is input to the model. The output value 10b is an output value when the evaluation value and the analysis result of the conversation in the section from time 0 to time t ₂ are input to the model. The output value 10c is an output value when the evaluation value and the analysis result of the conversation in the section from the time 0 to the time t3 _are input to the model. The output value 10d is an output value when the evaluation value and the analysis result of the conversation in the section from the time 0 to the time _t4 are input to the model. The output value 10e is an output value when the evaluation value and the analysis result of the conversation in the section from the time 0 to the time _t5 are input to the model. The output value 10f is an output value when the evaluation value and the analysis result of the conversation in the section from the time 0 to the time _t6 are input to the model.

出力値１０ｇは、時刻０から時刻ｔ_７までの区間における会話の評価値、分析結果をモデルに入力した場合の出力値である。出力値１０ｈは、時刻０から時刻ｔ_８までの区間における会話の評価値、分析結果をモデルに入力した場合の出力値である。出力値１０ｉは、時刻０から時刻ｔ_９までの区間における会話の評価値、分析結果をモデルに入力した場合の出力値である。出力値１０ｊは、時刻０から時刻ｔ_１０までの区間における会話の評価値、分析結果をモデルに入力した場合の出力値である。出力値１０ｋは、時刻０から時刻ｔ_１１までの区間における会話の評価値、分析結果をモデルに入力した場合の出力値である。 The output value 10g is an evaluation value of conversation in the section from time 0 to time _t7 , and an output value when the analysis result is input to the model. The output value 10h is an evaluation value of conversation _in the section from time 0 to time t8, and an output value when the analysis result is input to the model. The output value 10i is an output value when the evaluation value and the analysis result of the conversation in the section from time 0 to time t ₉ are input to the model. The output value 10j is an evaluation value of conversation in the section from time 0 to time _t10 , and an output value when the analysis result is input to the model. The output value 10k is an output value when the evaluation value and the analysis result of the conversation in the section from the time 0 to the time t ₁₁ are input to the model.

音声処理装置は、各出力値１０ａ～１０ｋの軌跡を基にして、通常の会話状況か異常な会話状況かの判定を行う。これによって、会話の一部に異常な会話状況が含まれていていると、軌跡に変化を与えるため、会話状況が、通常の会話状況か異常な会話状況であるかを判定することができる。これに対して、上記の参考技術１では、会話全体の時間（時刻０～ｔ_１１）の会話の評価値、分析結果をモデルに入力した出力値１０ｋとの閾値比較により、通常の会話状況か異常な会話状況であるかを判定するため、会話の一部に異常な会話状況が含まれていても、特定できない。 The voice processing device determines whether it is a normal conversation situation or an abnormal conversation situation based on the loci of each output value 10a to 10k. As a result, if an abnormal conversation situation is included in a part of the conversation, the locus is changed, so that it is possible to determine whether the conversation situation is a normal conversation situation or an abnormal conversation situation. On the other hand, in the above reference technique 1, it is a normal conversation situation by comparing the evaluation value of the conversation during the entire conversation time (time 0 to t ₁₁ ) and the output value 10k input to the model of the analysis result. In order to determine whether the conversation situation is abnormal, even if a part of the conversation contains an abnormal conversation situation, it cannot be specified.

図２は、本実施例１に係る音声処理装置の構成を示す機能ブロック図である。図２に示すように、この音声処理装置１００は、通信部１１０、記憶部１２０、制御部１３０を有する。 FIG. 2 is a functional block diagram showing a configuration of the voice processing device according to the first embodiment. As shown in FIG. 2, the voice processing device 100 includes a communication unit 110, a storage unit 120, and a control unit 130.

通信部１１０は、ネットワークを介して外部の装置とデータ通信を実行する処理部である。たとえば、通信部１１０は、顧客とオペレータとの会話を含む音声情報を収集するサーバ装置（図示略）から、音声情報を受信する。通信部１１０は、受信した音声情報を制御部１３０に出力する。通信部１１０は、通信装置の一例である。なお、本実施例１では一例として、音声情報の会話を顧客とオペレータとの会話とするがこれに限定されるものではなく、利用者間の会話であってもよい。 The communication unit 110 is a processing unit that executes data communication with an external device via a network. For example, the communication unit 110 receives voice information from a server device (not shown) that collects voice information including a conversation between a customer and an operator. The communication unit 110 outputs the received voice information to the control unit 130. The communication unit 110 is an example of a communication device. In the first embodiment, as an example, the conversation of voice information is a conversation between a customer and an operator, but the conversation is not limited to this, and a conversation between users may be used.

記憶部１２０は、音声バッファ１２０ａと、モデル情報１２０ｂと、出力値蓄積バッファ１２０ｃとを有する。記憶部１２０は、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、フラッシュメモリ（Flash Memory）などの半導体メモリ素子や、ＨＤＤ（Hard Disk Drive）などの記憶装置に対応する。 The storage unit 120 has a voice buffer 120a, model information 120b, and an output value storage buffer 120c. The storage unit 120 corresponds to a semiconductor memory element such as a RAM (Random Access Memory), a ROM (Read Only Memory), and a flash memory (Flash Memory), and a storage device such as an HDD (Hard Disk Drive).

音声バッファ１２０ａは、顧客とオペレータとの会話の音声情報を記憶するバッファである。「モデルを学習する処理」を音声処理装置１００が行う場合には、音声バッファ１２０ａには、学習用の音声情報が蓄積される。これに対して、「会話状況を判定する処理」を音声処理装置１００が行う場合には、音声バッファ１２０ａには、判定対象となる音声情報が蓄積される。 The voice buffer 120a is a buffer for storing voice information of a conversation between a customer and an operator. When the voice processing device 100 performs the "process for learning the model", the voice information for learning is stored in the voice buffer 120a. On the other hand, when the voice processing device 100 performs the "processing for determining the conversation status", the voice information to be determined is accumulated in the voice buffer 120a.

モデル情報１２０ｂは、音声情報に含まれる会話が、異常な会話状況である度合いを示す出力値を出力するモデルの情報である。図３は、本実施例１に係るモデル情報を説明するための概略図である。図３に示すように、このモデル情報１２０ｂは、ニューラルネットワークの構造を有し、入力層２０ａ、隠れ層２０ｂ、出力層２０ｃを持つ。入力層２０ａ、隠れ層２０ｂ、出力層２０ｃは、複数のノードがエッジで結ばれる構造となっている。隠れ層２０ｂ、出力層２０ｃは、活性化関数と呼ばれる関数とバイアス値とを持ち、エッジは、重みを持つ。 The model information 120b is information on a model that outputs an output value indicating the degree to which the conversation included in the voice information is an abnormal conversation situation. FIG. 3 is a schematic diagram for explaining the model information according to the first embodiment. As shown in FIG. 3, this model information 120b has a neural network structure, and has an input layer 20a, a hidden layer 20b, and an output layer 20c. The input layer 20a, the hidden layer 20b, and the output layer 20c have a structure in which a plurality of nodes are connected by edges. The hidden layer 20b and the output layer 20c have a function called an activation function and a bias value, and the edge has a weight.

入力層２０ａに含まれる各ノードに、音声情報の特徴量を入力すると、隠れ層２０ｂを通って、出力層２０ｃの各ノードから、会話が異常な会話状況である確率「Ｏｔ」と、会話が通常の会話状況である確率「Ｏｎ」とが出力される。 When the feature amount of voice information is input to each node included in the input layer 20a, the conversation is transmitted from each node of the output layer 20c through the hidden layer 20b with the probability "Ot" that the conversation is in an abnormal conversation situation. The probability "On", which is a normal conversation situation, is output.

本実施例では、モデル情報１２０ｂから出力される出力値Ｖを、式（１）により定義する。式（１）に含まれるＰ（ｔ）は、式（２）により定義される値である。式（１）に含まれるＰ（ｎ）は、式（３）により定義される値である。 In this embodiment, the output value V output from the model information 120b is defined by the equation (1). P (t) included in the equation (1) is a value defined by the equation (2). P (n) included in the equation (1) is a value defined by the equation (3).

Ｖ＝ｌｏｇＰ（ｔ）－ｌｏｇＰ（ｎ）・・・（１） V = logP (t) -logP (n) ... (1)

Ｐ（ｔ）＝ｅｘｐ（Ｏｔ）／｛ｅｘｐ（Ｏｔ）＋ｅｘｐ（Ｏｎ）｝・・・（２）
Ｐ（ｎ）＝ｅｘｐ（Ｏｎ）／｛ｅｘｐ（Ｏｔ）＋ｅｘｐ（Ｏｎ）｝・・・（３） P (t) = exp (Ot) / {exp (Ot) + exp (On)} ... (2)
P (n) = exp (On) / {exp (Ot) + exp (On)} ... (3)

出力値蓄積バッファ１２０ｃは、モデル情報１２０ｂを基に算出される出力値を格納するバッファである。図４は、本実施例１に係る出力値蓄積バッファのデータ構造の一例を示す図である。図４に示すように、この出力値蓄積バッファ１２０ｃは、時間と、出力値とを対応付ける。時間は、特徴量を抽出した音声情報の時間（会話の開始時刻からの経過時間）を示す。出力値は、該当する時間の音声情報から算出された特徴量を、モデル情報１２０ｂに入力した際に得られる出力値Ｖを示す。たとえば、図４に示す例では、時間「０～ｔ_１」の音声情報から算出した特徴量を、モデル情報１２０ｂに入力した際に得られる出力値は、出力値Ｖ_１である。 The output value storage buffer 120c is a buffer for storing the output value calculated based on the model information 120b. FIG. 4 is a diagram showing an example of the data structure of the output value storage buffer according to the first embodiment. As shown in FIG. 4, the output value storage buffer 120c associates the time with the output value. The time indicates the time of the voice information from which the feature amount is extracted (the elapsed time from the start time of the conversation). The output value indicates the output value V obtained when the feature amount calculated from the voice information at the corresponding time is input to the model information 120b. For example, in the example shown in FIG. 4, the output value obtained when the feature amount calculated from the voice information of the time “ ₀ to t ₁ ” is input to the model information 120b is the output value V1.

制御部１３０は、取得部１３０ａと、特徴量算出部１３０ｂと、モデル学習部１３０ｃと、会話時間管理部１３０ｄと、出力値算出部１３０ｅと、判定部１３０ｆとを有する。制御部１３０は、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）などによって実現できる。また、制御部１３０は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などのハードワイヤードロジックによっても実現できる。 The control unit 130 includes an acquisition unit 130a, a feature amount calculation unit 130b, a model learning unit 130c, a conversation time management unit 130d, an output value calculation unit 130e, and a determination unit 130f. The control unit 130 can be realized by a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or the like. Further, the control unit 130 can also be realized by hard-wired logic such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).

取得部１３０ａは、音声情報を取得し、取得した音声情報を音声バッファ１２０ａに格納する処理部である。たとえば、「モデルを学習する処理」を音声処理装置１００が行う場合には、取得部１３０ａは、学習用の音声情報を取得し、学習用の音声情報を音声バッファ１２０ａに格納する。「会話状況を判定する処理」を音声処理装置１００が行う場合には、取得部１３０ａは、判定対象となる音声情報を取得し、音声情報を音声バッファ１２０ａに格納する。 The acquisition unit 130a is a processing unit that acquires voice information and stores the acquired voice information in the voice buffer 120a. For example, when the voice processing device 100 performs the "process for learning a model", the acquisition unit 130a acquires the voice information for learning and stores the voice information for learning in the voice buffer 120a. When the voice processing device 100 performs the "processing for determining the conversation status", the acquisition unit 130a acquires the voice information to be determined and stores the voice information in the voice buffer 120a.

特徴量算出部１３０ｂは、音声バッファ１２０ａに格納された音声情報を基にして、特徴量を算出する処理部である。たとえば、特徴量算出部１３０ｂが算出する特徴量は、ストレス評価値、キーワードの検出回数、会話の開始時刻からの経過時間である。ストレス評価値、検出回数、経過時間に関する説明は後述する。 The feature amount calculation unit 130b is a processing unit that calculates the feature amount based on the voice information stored in the voice buffer 120a. For example, the feature amount calculated by the feature amount calculation unit 130b is a stress evaluation value, the number of times a keyword is detected, and the elapsed time from the start time of conversation. The stress evaluation value, the number of detections, and the elapsed time will be described later.

「モデルを学習する処理」を音声処理装置１００が行う場合には、特徴量算出部１３０ｂは、特徴量をモデル学習部１３０ｃに出力する。「会話状況を判定する処理」を音声処理装置１００が行う場合には、特徴量算出部１３０ｂは、特徴量を、出力値算出部１３０ｅに出力する。 When the voice processing device 100 performs the "process for learning the model", the feature amount calculation unit 130b outputs the feature amount to the model learning unit 130c. When the voice processing device 100 performs the "processing for determining the conversation status", the feature amount calculation unit 130b outputs the feature amount to the output value calculation unit 130e.

図５は、本実施例１に係る特徴量算出部の構成を示す機能ブロック図である。図５に示すように、この特徴量算出部１３０ｂは、音声取得部１３１ａと、フレーム処理部１３１ｂと、ピッチ抽出部１３２と、パワー算出部１３３と、ピッチ・パワー蓄積部１３４と、ストレス評価値算出部１３５とを有する。また、特徴量算出部１３０ｂは、音声認識部１３６と、認識結果蓄積部１３７と、会話時間算出部１３８とを有する。 FIG. 5 is a functional block diagram showing the configuration of the feature amount calculation unit according to the first embodiment. As shown in FIG. 5, the feature amount calculation unit 130b includes a voice acquisition unit 131a, a frame processing unit 131b, a pitch extraction unit 132, a power calculation unit 133, a pitch power storage unit 134, and a stress evaluation value. It has a calculation unit 135. Further, the feature amount calculation unit 130b has a voice recognition unit 136, a recognition result storage unit 137, and a conversation time calculation unit 138.

音声取得部１３１ａは、音声バッファ１２０ａに格納された音声情報を取得し、読み込んだ音声情報を、フレーム処理部１３１ｂに出力する。以下の説明では、音声取得部１３１ａにより読み込まれたデジタル信号の音声情報を、単に、「音声情報」と表記する。 The voice acquisition unit 131a acquires the voice information stored in the voice buffer 120a and outputs the read voice information to the frame processing unit 131b. In the following description, the voice information of the digital signal read by the voice acquisition unit 131a is simply referred to as "voice information".

フレーム処理部１３１ｂは、音声取得部１３１ａから取得する音声情報の信号時系列を、予め定められたサンプル数毎に「フレーム」として取り出し、フレームにハニング窓等の分析窓を乗算する。 The frame processing unit 131b takes out the signal time series of the voice information acquired from the voice acquisition unit 131a as a "frame" for each predetermined number of samples, and multiplies the frame by an analysis window such as a hanning window.

たとえば、フレーム処理部１３１ｂは、サンプリング周波数８ｋＨｚで３２ｍｓの区間のサンプルＮをフレームとして取り出す。たとえば、Ｎ＝２５６とする。フレームに含まれる各サンプルを「ｓ（０）、ｓ（１）、ｓ（２）、・・・、ｓ（Ｎ－１）」とする。フレーム処理部１３１ｂは、上記の各サンプルに対しハミング窓を乗算する。例えば、ハミング窓は、式（４）により示される。 For example, the frame processing unit 131b takes out the sample N in a section of 32 ms at a sampling frequency of 8 kHz as a frame. For example, N = 256. Let each sample included in the frame be "s (0), s (1), s (2), ..., S (N-1)". The frame processing unit 131b multiplies each of the above samples by a humming window. For example, the humming window is represented by equation (4).

各サンプルに対しハミング窓を乗算したサンプルを「ｘ（０）、ｘ（１）、ｘ（２）、・・・、ｘ（Ｎ－１）」とする。以下の説明では、ハミング窓を乗算した結果得られるサンプル「ｘ（０）、ｘ（１）、ｘ（２）、・・・、ｘ（Ｎ－１）」をサンプル値と表記する。フレーム処理部１３１ｂは、フレームにハニング窓を乗算したサンプル値を、ピッチ抽出部１３２、パワー算出部１３３、音声認識部１３６に出力する。フレーム処理部１３１ｂは、フレーム単位で、サンプル値の情報を出力し、フレーム識別番号をフレームに付与してもよい。 The sample obtained by multiplying each sample by the humming window is defined as "x (0), x (1), x (2), ..., X (N-1)". In the following description, the sample "x (0), x (1), x (2), ..., X (N-1)" obtained by multiplying the humming window is referred to as a sample value. The frame processing unit 131b outputs the sample value obtained by multiplying the frame by the Hanning window to the pitch extraction unit 132, the power calculation unit 133, and the voice recognition unit 136. The frame processing unit 131b may output sample value information and assign a frame identification number to the frame in frame units.

ピッチ抽出部１３２は、フレームのサンプル値を基にして、フレームの基本周波数（ピッチ）を抽出する処理部である。ピッチ抽出部１３２は、フレーム毎のピッチの情報を、ピッチ・パワー蓄積部１３４に蓄積する。 The pitch extraction unit 132 is a processing unit that extracts the fundamental frequency (pitch) of the frame based on the sample value of the frame. The pitch extraction unit 132 stores pitch information for each frame in the pitch power storage unit 134.

たとえば、ピッチ抽出部１３２は、フレームの各サンプル値を用いて、自己相関関数を計算する。ピッチ抽出部１３２は、式（５）に基づいて、自己相関関数φ（ｍ）を計算する。式（５）に示すｍは、遅延時間を示す。 For example, the pitch extraction unit 132 calculates an autocorrelation function using each sample value of the frame. The pitch extraction unit 132 calculates the autocorrelation function φ (m) based on the equation (5). The m shown in the equation (5) indicates the delay time.

ピッチ抽出部１３２は、式（５）について、遅延時間ｍ＝０以外において、自己相関関数が極大値となる遅延時間ｍの値を特定する。自己相関関数が極大となる遅延時間ｍを「遅延時間ｍ’」と表記する。ピッチ抽出部１３２は、遅延時間ｍ’を算出した後に、式（６）に基づいて、ピッチを算出する。 The pitch extraction unit 132 specifies the value of the delay time m at which the autocorrelation function becomes the maximum value for the equation (5) except for the delay time m = 0. The delay time m at which the autocorrelation function is maximized is expressed as "delay time m'". The pitch extraction unit 132 calculates the pitch based on the equation (6) after calculating the delay time m'.

ピッチ＝１／遅延時間ｍ’・・・（６） Pitch = 1 / delay time m'... (6)

ピッチ抽出部１３２は、各フレームのサンプル値に対して、上記の処理を繰り返し実行することで、各フレームからピッチをそれぞれ算出する。ただし、前記自己相関関数の極大値が、予め決められた閾値以下の場合には、無音区間として、そのフレームのピッチとパワーは後の処理には使用しない。 The pitch extraction unit 132 repeatedly executes the above processing for the sample value of each frame to calculate the pitch from each frame. However, when the maximum value of the autocorrelation function is equal to or less than a predetermined threshold value, the pitch and power of the frame are not used for the subsequent processing as a silent section.

パワー算出部１３３は、フレームのサンプル値を基にして、フレームのパワーを算出する処理部である。パワー算出部１３３は、フレーム毎のパワーの情報を、ピッチ・パワー蓄積部１３４に蓄積する。 The power calculation unit 133 is a processing unit that calculates the power of the frame based on the sample value of the frame. The power calculation unit 133 stores power information for each frame in the pitch power storage unit 134.

たとえば、パワー算出部１３３は、フレームの各サンプル値「「ｘ（０）、ｘ（１）、ｘ（２）、・・・、ｘ（Ｎ－１）」の二乗値の総和に対し、対数をとることで、フレームのパワーを算出する。具体的に、パワー算出部１３３は、式（７）に基づいて、フレームのパワーを算出する。 For example, the power calculation unit 133 is a logarithm with respect to the sum of the squared values of each sample value "x (0), x (1), x (2), ..., X (N-1)" of the frame. By taking, the power of the frame is calculated. Specifically, the power calculation unit 133 calculates the power of the frame based on the equation (7).

ピッチ・パワー蓄積部１３４は、ピッチ抽出部１３２により抽出されたピッチの情報およびパワー算出部１３３により算出されたパワーの情報を格納するバッファである。図６は、本実施例１に係るピッチ・パワー蓄積部のデータ構造の一例を示す図である。図６に示すように、ピッチ・パワー蓄積部１３４は、フレーム識別番号と、ピッチと、パワーとを対応付ける。ただし、ピッチ抽出部１３２において無音区間とされたフレームはバッファには含めない。 The pitch power storage unit 134 is a buffer that stores pitch information extracted by the pitch extraction unit 132 and power information calculated by the power calculation unit 133. FIG. 6 is a diagram showing an example of the data structure of the pitch power storage unit according to the first embodiment. As shown in FIG. 6, the pitch power storage unit 134 associates the frame identification number with the pitch and the power. However, the frame set as the silent section in the pitch extraction unit 132 is not included in the buffer.

ストレス評価値算出部１３５は、ピッチ・パワー蓄積部１３４に格納されたピッチおよびパワーの情報を基にして、設定時刻毎にストレス評価値を算出する処理部である。たとえば、ストレス評価値算出部１３５は、利用者の平常時のピッチおよびパワーの組をサンプルとした際のばらつき具合と比較して、現在のピッチおよびパワーの組のサンプルのばらつき具合が大きいほど、ストレス評価値を大きくし、小さいほどストレス評価値を小さくする。 The stress evaluation value calculation unit 135 is a processing unit that calculates a stress evaluation value for each set time based on the pitch and power information stored in the pitch power storage unit 134. For example, in the stress evaluation value calculation unit 135, the greater the variation in the sample of the current pitch and power set, the greater the variation in the sample when the user's normal pitch and power set is used as a sample. The stress evaluation value is increased, and the smaller the stress evaluation value is, the smaller the stress evaluation value is.

ストレス評価値算出部１３５は、音声処理装置１００が「会話状況を判定する処理」を行う場合に、次の処理を行う。ストレス評価値算出部１３５は、会話の開始時刻から、出力制御信号を受信した時刻までに蓄積されたピッチ・パワーの組のサンプルを用いて、混合ガウス分布を最尤推定によりモデル化し、推定に用いたサンプルに対するモデルの平均対数尤度にマイナス１を掛けたものを、ストレス評価値として算出し、算出したストレス評価値を、出力値算出部１３０ｅに出力する。「出力制御信号」は、会話時間管理部１３０ｄから出力される信号である。たとえば、ストレス評価値算出部１３５は、特開２０１５－０８２０９３に記載されたＥＭアルゴリズム（期待値最大化法）を用いて、最尤推定によるモデル化を行う。 The stress evaluation value calculation unit 135 performs the following processing when the voice processing device 100 performs the “processing for determining the conversation status”. The stress evaluation value calculation unit 135 models and estimates the mixed Gaussian distribution by maximum likelihood estimation using a sample of a set of pitch powers accumulated from the start time of the conversation to the time when the output control signal is received. The average log-likelihood of the model for the sample used is multiplied by -1 to calculate as a stress evaluation value, and the calculated stress evaluation value is output to the output value calculation unit 130e. The "output control signal" is a signal output from the conversation time management unit 130d. For example, the stress evaluation value calculation unit 135 uses the EM algorithm (expected value maximization method) described in Japanese Patent Application Laid-Open No. 2015-082093 to perform modeling by maximum likelihood estimation.

音声認識部１３６は、たとえば、ワードスポッティング型の音声認識を行うことで、音声情報に所定のキーワードが含まれているか否かを検出する処理部である。音声認識部１３６は、音声情報から所定のキーワードを検出する度に、所定のキーワードに対応する検出回数に１を加算する処理を行う。音声認識部１３６は、所定のキーワードと、検出回数とを対応付けた情報を、認識結果蓄積部１３７に蓄積する。所定のキーワードは、顧客が不満を感じた場合や、怒っている際によく発言するキーワードである。 The voice recognition unit 136 is a processing unit that detects whether or not a predetermined keyword is included in the voice information by, for example, performing word spotting type voice recognition. The voice recognition unit 136 performs a process of adding 1 to the number of detections corresponding to the predetermined keyword each time the predetermined keyword is detected from the voice information. The voice recognition unit 136 stores information in which a predetermined keyword is associated with the number of detections in the recognition result storage unit 137. A predetermined keyword is a keyword that is often spoken when a customer is dissatisfied or angry.

また、音声認識部１３６は、音声認識を行うための音声区間検出処理を開始し、音声区間を検出した際の、音声区間の開始時間と終了時間の情報を、会話時間算出部１３８に出力する。 Further, the voice recognition unit 136 starts the voice section detection process for performing voice recognition, and outputs the information of the start time and the end time of the voice section when the voice section is detected to the conversation time calculation unit 138. ..

認識結果蓄積部１３７は、音声認識部１３６により検出された各キーワード（所定のキーワード）の検出回数の情報（検出回数情報）を保持する。図７は、本実施例１に係る検出回数情報のデータ構造の一例を示す図である。図７に示すように、検出回数情報１３７ａは、キーワードと検出回数とを対応付ける。 The recognition result accumulating unit 137 holds information (detection number information) of the number of detections of each keyword (predetermined keyword) detected by the voice recognition unit 136. FIG. 7 is a diagram showing an example of a data structure of detection frequency information according to the first embodiment. As shown in FIG. 7, the detection number information 137a associates the keyword with the detection number.

認識結果蓄積部１３７は、音声処理装置１００が「モデルを学習する処理」を行う場合には、次の処理を行う。認識結果蓄積部１３７は、会話の開始時刻から、会話の終了時刻における検出回数情報１３７ａを、モデル学習部１３０ｃに出力する。 When the voice processing device 100 performs the "process for learning the model", the recognition result accumulating unit 137 performs the following processing. The recognition result storage unit 137 outputs the detection number information 137a at the end time of the conversation from the start time of the conversation to the model learning unit 130c.

認識結果蓄積部１３７は、音声処理装置１００が「会話状況を判定する処理」を行う場合には、次の処理を行う。認識結果蓄積部１３７は、会話の開始時刻から、出力制御信号を受信した時刻までの検出回数情報１３７ａを、出力値算出部１３０ｅに出力する。 When the voice processing device 100 performs the "processing for determining the conversation status", the recognition result accumulating unit 137 performs the following processing. The recognition result storage unit 137 outputs the detection number information 137a from the start time of the conversation to the time when the output control signal is received to the output value calculation unit 130e.

会話時間算出部１３８は、会話の開始時刻からの会話の経過時間を計算する処理部である。たとえば、会話時間算出部１３８は、図示しないタイマから時間情報を取得し、会話の開始時刻からの経過時間を計測する。会話時間算出部１３８は、各フレームに含まれるサンプル数の累計を基にして、経過時間を推定してもよい。会話時間算出部１３８は、開始時刻と、開始時刻からの経過時間との情報を、会話時間管理部１３０ｄに出力する。 The conversation time calculation unit 138 is a processing unit that calculates the elapsed time of the conversation from the start time of the conversation. For example, the conversation time calculation unit 138 acquires time information from a timer (not shown) and measures the elapsed time from the start time of the conversation. The conversation time calculation unit 138 may estimate the elapsed time based on the cumulative number of samples included in each frame. The conversation time calculation unit 138 outputs the information of the start time and the elapsed time from the start time to the conversation time management unit 130d.

たとえば、会話時間算出部１３８は、音声認識部１３６から、検出した音声区間の開始時刻の情報をはじめに受け付けた開始時刻を、会話の開始時刻として特定する。会話時間算出部１３８は、検出した音声区間の情報を最後に受け付けた終了時刻から、所定時間経過しても、単語を検出した旨の情報を新たに受け付けない場合には、会話が終了したと判定する。会話時間算出部１３８は、会話が終了したと判定した場合には、検出した音声区間を最後に受け付けた終了時刻を終了時刻として特定する。会話時間算出部１３８は、会話の終了時刻の情報を、会話時間管理部１３０ｄに出力する。 For example, the conversation time calculation unit 138 specifies the start time at which the information on the start time of the detected voice section is first received from the voice recognition unit 136 as the conversation start time. If the conversation time calculation unit 138 does not newly accept the information that the word has been detected even after a predetermined time has elapsed from the end time when the information of the detected voice section was last received, the conversation is terminated. judge. When the conversation time calculation unit 138 determines that the conversation has ended, the conversation time calculation unit 138 specifies the end time at which the detected voice section was last received as the end time. The conversation time calculation unit 138 outputs information on the end time of the conversation to the conversation time management unit 130d.

会話時間算出部１３８は、音声処理装置１００が「モデルを学習する処理」を行う場合には、次の処理を行う。会話時間算出部１３８は、会話の開始時刻から、会話の終了時刻までの経過時間の情報を、モデル学習部１３０ｃに出力する。 When the voice processing device 100 performs the "process for learning the model", the conversation time calculation unit 138 performs the following processing. The conversation time calculation unit 138 outputs information on the elapsed time from the conversation start time to the conversation end time to the model learning unit 130c.

会話時間算出部１３８は、音声処理装置１００が「会話状況を判定する処理」を行う場合には、次の処理を行う。会話時間算出部１３８は、会話の開始時刻から、出力制御信号を受信した時刻までの経過時間の情報を、出力値算出部１３０ｅに出力する。 When the voice processing device 100 performs the "processing for determining the conversation status", the conversation time calculation unit 138 performs the following processing. The conversation time calculation unit 138 outputs information on the elapsed time from the conversation start time to the time when the output control signal is received to the output value calculation unit 130e.

図２の説明に戻る。モデル学習部１３０ｃは、学習用の音声情報から算出された特徴量を用いて、モデル情報１２０ｂを生成（学習）する処理部である。モデル学習部１３０ｃは、モデル情報１２０ｂを生成する場合には、予め、学習用の音声情報に対応する正解データを保持しておくものとする。たとえば、学習の音声情報が、「異常な会話状況」を含むものであれば、正解データの「Ｏｔ（異常な会話状況である確率）」の値は、「Ｏｎ（通常の会話状況である確率）」の値よりも大きい値となる。一方、学習の音声情報が、「通常の会話状況」の音声情報であれば、正解データの「Ｏｔ（異常な会話状況である確率）」の値は、「Ｏｎ（通常の会話状況である確率）」の値よりも小さい値となる。 Returning to the description of FIG. The model learning unit 130c is a processing unit that generates (learns) model information 120b using a feature amount calculated from voice information for learning. When the model learning unit 130c generates the model information 120b, it is assumed that the correct answer data corresponding to the voice information for learning is stored in advance. For example, if the voice information of learning includes "abnormal conversation situation", the value of "Ot (probability of abnormal conversation situation)" of the correct answer data is "On (probability of normal conversation situation)". ) ”, Which is larger than the value. On the other hand, if the voice information of learning is the voice information of "normal conversation situation", the value of "Ot (probability of abnormal conversation situation)" of the correct answer data is "On (probability of normal conversation situation)". ) ”, Which is smaller than the value.

モデル学習部１３０ｃは、学習用の音声情報から算出された特徴量をモデル情報１２０ｂの入力層２０ａに入力して、出力層２０ｃから出力される値と、正解データとの差を小さくするように、隠れ層２０ｂおよび出力層２０ｃのバイアス値、エッジの重みを調整する。モデル学習部１３０ｃは、各学習用の音声情報と、各学習用の音声情報に対応する正解データを用いて、上記処理を繰り返し実行することで、モデル情報１２０ｂを学習する。たとえば、モデル学習部１３０ｃは、Back Propagation法等のアルゴリズムを用いて、モデル情報１２０ｂを学習してもよい。 The model learning unit 130c inputs the feature amount calculated from the learning voice information into the input layer 20a of the model information 120b so as to reduce the difference between the value output from the output layer 20c and the correct answer data. , The bias value of the hidden layer 20b and the output layer 20c, and the weight of the edge are adjusted. The model learning unit 130c learns the model information 120b by repeatedly executing the above processing using the voice information for each learning and the correct answer data corresponding to the voice information for each learning. For example, the model learning unit 130c may learn the model information 120b by using an algorithm such as the Back Propagation method.

会話時間管理部１３０ｄは、会話時間算出部１３８から、会話の開始時刻と、会話の開始時刻からの経過時間とを取得し、予め指定された時間Ｔを経過したか否かを判定する。会話時間管理部１３０ｄは、時間Ｔを経過する度に、「出力制御信号」を、ストレス評価値算出部１３５、認識結果蓄積部１３７、会話時間算出部１３８、出力値算出部１３０ｅに出力する。 The conversation time management unit 130d acquires the conversation start time and the elapsed time from the conversation start time from the conversation time calculation unit 138, and determines whether or not the predetermined time T has elapsed. The conversation time management unit 130d outputs an "output control signal" to the stress evaluation value calculation unit 135, the recognition result storage unit 137, the conversation time calculation unit 138, and the output value calculation unit 130e each time the time T elapses.

会話時間管理部１３０ｄは、会話時間算出部１３８から、会話の終了時刻の情報を受け付けた場合には、会話の終了時刻の情報を、判定部１３０ｆに出力する。 When the conversation time management unit 130d receives the information on the end time of the conversation from the conversation time calculation unit 138, the conversation time management unit 130d outputs the information on the end time of the conversation to the determination unit 130f.

出力値算出部１３０ｅは、特徴量算出部１３０ｂから取得する特徴量と、モデル情報１２０ｂとを基にして、出力値を算出する処理部である。出力値算出部１３０ｅは、算出した出力値を、出力値蓄積バッファ１２０ｃに蓄積する。 The output value calculation unit 130e is a processing unit that calculates an output value based on the feature amount acquired from the feature amount calculation unit 130b and the model information 120b. The output value calculation unit 130e stores the calculated output value in the output value storage buffer 120c.

たとえば、出力値算出部１３０ｅは、会話時間管理部１３０ｄから出力制御信号を取得したタイミングで、特徴量算出部１３０ｂから特徴量を取得し、取得した特徴量をモデル情報１２０ｂの入力層２０ａに入力する。出力値算出部１３０ｅは、特徴量をモデル情報１２０ｂの入力層２０ａに入力した際に、出力層２０ｃから出力される確率「Ｏｔ」と、確率「Ｏｎ」との値を取得し、式（１）～式（３）を基にして、出力値Ｖを算出する。 For example, the output value calculation unit 130e acquires a feature amount from the feature amount calculation unit 130b at the timing when the output control signal is acquired from the conversation time management unit 130d, and inputs the acquired feature amount to the input layer 20a of the model information 120b. do. When the feature amount is input to the input layer 20a of the model information 120b, the output value calculation unit 130e acquires the values of the probability “Ot” and the probability “On” output from the output layer 20c, and obtains the values of the equation (1). )-Equation (3) is used to calculate the output value V.

出力値算出部１３０ｅは、会話時間管理部１３０ｄから出力制御信号を取得する度に、上記の処理を繰り返し実行することで、各経過時間の特徴量に対応する出力値Ｖを順次算出し、算出した出力値Ｖの情報を、出力値蓄積バッファ１２０ｃに格納する。出力値算出部１３０ｅは、出力値Ｖを蓄積する場合に、経過時間（時間）を対応付ける。 Each time the output value calculation unit 130e acquires an output control signal from the conversation time management unit 130d, the output value calculation unit 130e repeatedly executes the above processing to sequentially calculate and calculate the output value V corresponding to the feature amount of each elapsed time. The information of the output value V is stored in the output value storage buffer 120c. The output value calculation unit 130e associates the elapsed time (time) with the accumulated output value V.

判定部１３０ｆは、出力値蓄積バッファ１２０ｃに格納された出力値の軌跡を基にして、会話が異常な会話状況であるのか、通常の会話状況であるのかを判定する処理部である。判定部１３０ｆは、判定結果を表示装置（図示略）に出力して表示させてもよいし、通信部１１０を介して、外部装置に通知してもよい。 The determination unit 130f is a processing unit that determines whether the conversation is in an abnormal conversation situation or a normal conversation situation based on the locus of the output value stored in the output value storage buffer 120c. The determination unit 130f may output the determination result to a display device (not shown) and display it, or may notify the external device via the communication unit 110.

判定部１３０ｆが行う判定処理は、様々なバリエーションがある。以下では、判定部１３０ｆが行う判定処理のバリエーション１～４について説明する。どのバリエーションにより、判定処理を行うかは、利用者が予め設定しておくものとする。 There are various variations in the determination process performed by the determination unit 130f. Hereinafter, variations 1 to 4 of the determination process performed by the determination unit 130f will be described. It is assumed that the user sets in advance which variation is used for the determination process.

図８は、判定処理のバリエーション１を説明するための図である。図８において、縦軸は出力値に対応するものであり、横軸は会話時間に対応するものである。判定部１３０ｆは、閾値５０を設け、この閾値５０により、出力値のとりうる範囲を、領域５０ａと領域５０ｂとを設ける。出力値が閾値５０を超える場合には、会話状況が異常な会話状況である可能性が高い。閾値５０は、予め設定される閾値である。 FIG. 8 is a diagram for explaining variation 1 of the determination process. In FIG. 8, the vertical axis corresponds to the output value, and the horizontal axis corresponds to the conversation time. The determination unit 130f provides a threshold value 50, and the threshold value 50 provides a range in which the output value can be taken, a region 50a and a region 50b. When the output value exceeds the threshold value 50, it is highly possible that the conversation situation is an abnormal conversation situation. The threshold value 50 is a preset threshold value.

判定部１３０ｆは、出力値の軌跡と、閾値５０とを比較し、出力値の軌跡が閾値５０を超えて領域５０ａに含まれた時点で、会話が異常な会話状況であると判定する。 The determination unit 130f compares the locus of the output value with the threshold value 50, and determines that the conversation is in an abnormal conversation situation when the locus of the output value exceeds the threshold value 50 and is included in the region 50a.

判定部１３０ｆは、出力値の軌跡３０ａと、閾値５０とを比較すると、軌跡３０ａは、閾値５０を超えないまま会話が終了している。判定部１３０ｆは、出力値の軌跡３０ａに対応する会話を「通常の会話状況」であると判定する。 When the determination unit 130f compares the locus 30a of the output value with the threshold value 50, the conversation ends without the locus 30a exceeding the threshold value 50. The determination unit 130f determines that the conversation corresponding to the locus 30a of the output value is a "normal conversation situation".

判定部１３０ｆは、出力値の軌跡３０ｂと、閾値５０とを比較すると、軌跡３０ｂは、閾値５０を超えて、一旦領域５０ａに侵入し、その後、領域５０ｂに戻っている。判定部１３０ｆは、軌跡３０ｂが会話の終盤で、領域５０ｂに戻っているものの、閾値５０を一度超えているため、軌跡３０ｂに対応する会話を「異常な会話状況」であると判定する。 When the locus 30b of the output value is compared with the threshold value 50, the determination unit 130f exceeds the threshold value 50, temporarily enters the region 50a, and then returns to the region 50b. The determination unit 130f determines that the conversation corresponding to the locus 30b is an "abnormal conversation situation" because the locus 30b returns to the area 50b at the end of the conversation but exceeds the threshold value 50 once.

判定部１３０ｆは、出力値の軌跡３０ｃと、閾値５０とを比較すると、軌跡３０ｃは、閾値５０を超えて、領域５０ａに侵入している。判定部１３０ｆは、軌跡３０ｃに対応する会話を「異常な会話状況」であると判定する。 When the determination unit 130f compares the locus 30c of the output value with the threshold value 50, the locus 30c exceeds the threshold value 50 and invades the region 50a. The determination unit 130f determines that the conversation corresponding to the locus 30c is an "abnormal conversation situation".

図９は、判定処理のバリエーション２を説明するための図である。図９において、縦軸は出力値に対応するものであり、横軸は会話時間に対応するものである。判定部１３０ｆは、閾値５０，５１を設け、この閾値５０，５１により、領域５０ｂ，５１ａ，５１ｂを設ける。出力値が閾値５０を超える場合には、会話が異常な会話状況である可能性が高い。出力値が閾値５１を超える場合には、会話が異常な会話状況である可能性が極めて高い（確実に異常な会話状況である）。閾値５０，５１は、予め設定される閾値である。 FIG. 9 is a diagram for explaining variation 2 of the determination process. In FIG. 9, the vertical axis corresponds to the output value, and the horizontal axis corresponds to the conversation time. The determination unit 130f is provided with threshold values 50 and 51, and the regions 50b, 51a and 51b are provided by the threshold values 50 and 51. If the output value exceeds the threshold value 50, it is highly possible that the conversation is in an abnormal conversation situation. When the output value exceeds the threshold value 51, it is highly possible that the conversation is in an abnormal conversation situation (certainly, it is an abnormal conversation situation). The threshold values 50 and 51 are preset threshold values.

判定部１３０ｆは、出力値の軌跡と、閾値５０，５１とを比較し、出力値の軌跡が閾値５１を超えて領域５１ｂに含まれた時点で、会話が異常な会話状況であると判定する。判定部１３０ｆは、出力値の軌跡と、閾値５０，５１とを比較し、出力値の軌跡の全体が、領域５１ａに含まれている場合には、会話が異常な会話状況であると判定する。判定部１３０ｆは、出力値の軌跡と、閾値５０，５１とを比較し、出力値の軌跡の一部が、領域５０ｂに含まれている場合には、会話が通常の会話状況であると判定する。 The determination unit 130f compares the locus of the output value with the threshold values 50 and 51, and determines that the conversation is in an abnormal conversation situation when the locus of the output value exceeds the threshold value 51 and is included in the region 51b. .. The determination unit 130f compares the locus of the output value with the threshold values 50 and 51, and if the entire locus of the output value is included in the area 51a, determines that the conversation is an abnormal conversation situation. .. The determination unit 130f compares the locus of the output value with the threshold values 50 and 51, and if a part of the locus of the output value is included in the area 50b, determines that the conversation is a normal conversation situation. do.

判定部１３０ｆは、出力値の軌跡３１ａと、閾値５０，５１と比較すると、軌跡３１ａの一部が領域５０ｂに含まれている。このため、判定部１３０ｆは、軌跡３１ａに対応する会話を「通常の会話状況」であると判定する。 The determination unit 130f includes a part of the locus 31a in the region 50b when compared with the locus 31a of the output value and the threshold values 50 and 51. Therefore, the determination unit 130f determines that the conversation corresponding to the locus 31a is a "normal conversation situation".

判定部１３０ｆは、出力値の軌跡３１ｂと、閾値５０，５１とを比較すると、軌跡３１ｂは、閾値５１を超えて、領域５１ｂに侵入している。判定部１３０ｆは、軌跡３１ｂが会話の終盤で、領域５０ｂに戻っているものの、閾値５０を一度超えているため、軌跡３１ｂに対応する会話を「異常な会話状況」であると判定する。 When the determination unit 130f compares the locus 31b of the output value with the threshold values 50 and 51, the locus 31b exceeds the threshold value 51 and invades the region 51b. The determination unit 130f determines that the conversation corresponding to the locus 31b is an "abnormal conversation situation" because the locus 31b returns to the area 50b at the end of the conversation but exceeds the threshold value 50 once.

判定部１３０ｆは、出力値の軌跡３１ｃと、閾値５０，５１とを比較すると、出力値の軌跡３１ｃの全体が、領域５１ａに含まれている。このため、判定部１３０ｆは、軌跡３１ｃに対応する会話を「異常な会話状況」であると判定する。 When the determination unit 130f compares the locus 31c of the output value with the threshold values 50 and 51, the entire locus 31c of the output value is included in the region 51a. Therefore, the determination unit 130f determines that the conversation corresponding to the locus 31c is an "abnormal conversation situation".

図１０は、判定処理のバリエーション３を説明するための図である。図１０において、縦軸は出力値に対応するものであり、横軸は会話時間に対応するものである。判定部１３０ｆは、閾値５０，５２を設け、この閾値５０，５２により、領域５０ａ，５２ａ，５２ｂを設ける。出力値が閾値５０を超える場合には、会話が異常な会話状況である可能性が高い。出力値が閾値５２以下となる場合には、会話が通常の会話状況である可能性が極めて高い（確実に通常の会話状況である）。閾値５０，５２は、予め設定される閾値である。 FIG. 10 is a diagram for explaining variation 3 of the determination process. In FIG. 10, the vertical axis corresponds to the output value, and the horizontal axis corresponds to the conversation time. The determination unit 130f is provided with threshold values 50 and 52, and the regions 50a, 52a and 52b are provided by the threshold values 50 and 52. If the output value exceeds the threshold value 50, it is highly possible that the conversation is in an abnormal conversation situation. When the output value is equal to or less than the threshold value 52, it is highly possible that the conversation is in a normal conversation situation (certainly, it is a normal conversation situation). The threshold values 50 and 52 are preset threshold values.

判定部１３０ｆは、出力値の軌跡と、閾値５０，５２とを比較し、出力値の軌跡が閾値５２を下回り、領域５２ａに含まれた時点で、会話が通常の会話状況であると判定する。判定部１３０ｆは、出力値の軌跡と、閾値５０，５２とを比較し、軌跡が領域５２ａに含まれず、かつ、閾値５０を超えた場合には、会話が異常な会話状況であると判定する。 The determination unit 130f compares the locus of the output value with the threshold values 50 and 52, and determines that the conversation is a normal conversation situation when the locus of the output value falls below the threshold value 52 and is included in the region 52a. .. The determination unit 130f compares the locus of the output value with the threshold values 50 and 52, and if the locus is not included in the region 52a and exceeds the threshold value 50, the determination unit 130f determines that the conversation is in an abnormal conversation situation. ..

判定部１３０ｆは、出力値の軌跡３２ａと、閾値５０，５２とを比較すると、軌跡３２ａは一度も閾値５２を下回らず、軌跡の一部が領域５０ａに含まれている。このため、判定部１３０ｆは、軌跡３２ａに対応する会話を「異常な会話状況」であると判定する。 When the locus 32a of the output value is compared with the threshold values 50 and 52, the determination unit 130f never falls below the threshold value 52, and a part of the locus is included in the region 50a. Therefore, the determination unit 130f determines that the conversation corresponding to the locus 32a is an "abnormal conversation situation".

判定部１３０ｆは、出力値の軌跡３２ｂと閾値５０，５２とを比較すると、軌跡３２ｂは、軌跡の一部が領域５０ａに含まれているものの、閾値５２を下回っている時間帯がある。このため、判定部１３０ｆは、軌跡３２ｂに対する会話を「通常の会話状況」であると判定する。 When the determination unit 130f compares the locus 32b of the output value with the threshold values 50 and 52, the locus 32b has a time zone in which a part of the locus is included in the region 50a but is below the threshold value 52. Therefore, the determination unit 130f determines that the conversation with respect to the locus 32b is a "normal conversation situation".

図１１は、判定処理のバリエーション４を説明するための図である。図１１において、縦軸は出力値に対応するものであり、横軸は会話時間に対応するものである。判定部１３０ｆは、閾値５０，５１，５２を設け、この閾値５０，５１，５２により、領域５１ａ，５１ｂ，５２ａ，５２ｂを設ける。出力値が閾値５１を超える場合には、会話が異常な会話状況である可能性が極めて高い（確実に異常な会話状況である）。出力値が閾値５２以下となる場合には、会話が通常の会話状況である可能性が極めて高い（確実に通常の会話状況である）。閾値５０，５１，５２は、予め設定される閾値である。 FIG. 11 is a diagram for explaining variation 4 of the determination process. In FIG. 11, the vertical axis corresponds to the output value, and the horizontal axis corresponds to the conversation time. The determination unit 130f is provided with threshold values 50, 51, 52, and the regions 51a, 51b, 52a, 52b are provided by the threshold values 50, 51, 52. When the output value exceeds the threshold value 51, it is highly possible that the conversation is in an abnormal conversation situation (certainly, it is an abnormal conversation situation). When the output value is equal to or less than the threshold value 52, it is highly possible that the conversation is in a normal conversation situation (certainly, it is a normal conversation situation). The threshold values 50, 51, and 52 are preset threshold values.

判定部１３０ｆは、出力値の軌跡と、閾値５０，５１，５２とを比較し、軌跡の一部が領域５１ａ、５２ｂに含まれる場合において、会話の終了時刻に近い方を優先する。たとえば、判定部１３０ｆは、出力値の軌跡が、先に閾値５１を上回り、その後に、閾値５２以下となった場合には、軌跡が閾値５２以下となったことを優先し、「通常の会話状況」であると判定する。判定部１３０ｆは、出力値の軌跡が、先に閾値５２以下となり、その後に、閾値５１を上回った場合には、軌跡が閾値５１以上となったことを優先し、「異常な会話状況」であると判定する。 The determination unit 130f compares the locus of the output value with the threshold values 50, 51, 52, and when a part of the locus is included in the regions 51a and 52b, priority is given to the one closer to the end time of the conversation. For example, when the locus of the output value exceeds the threshold value 51 first and then becomes the threshold value 52 or less, the determination unit 130f gives priority to the fact that the locus becomes the threshold value 52 or less, and "normal conversation". It is determined that the situation is "situation". When the locus of the output value first becomes the threshold value 52 or less and then exceeds the threshold value 51, the determination unit 130f gives priority to the fact that the locus becomes the threshold value 51 or more, and in an "abnormal conversation situation". Judge that there is.

判定部１３０ｆは、出力値の軌跡３３ａと、閾値５０，５１，５２とを比較すると、軌跡３３ａは、先に閾値５１を上回り、その後に、閾値５２以下となっている。判定部１３０ｆは、会話の終了時刻に近い「軌跡３３ａが閾値５２以下となった」ことを優先し、軌跡３３ａに対応する会話を「通常の会話状況」であると判定する。 When the determination unit 130f compares the locus 33a of the output value with the threshold values 50, 51, 52, the locus 33a first exceeds the threshold value 51 and then becomes the threshold value 52 or less. The determination unit 130f gives priority to "the locus 33a is equal to or less than the threshold value 52" near the end time of the conversation, and determines that the conversation corresponding to the locus 33a is the "normal conversation situation".

判定部１３０ｆは、出力値の軌跡３３ｂと、閾値５０，５１，５２とを比較すると、軌跡３３ｂは、先に閾値５２以下となり、その後に、閾値５１を上回っている。判定部１３０ｆは、会話の終了時刻に近い「軌跡３３ｂが閾値５１を上回った」ことを優先し、軌跡３３ｂに対応する会話を「異常な会話状況」であると判定する。 When the determination unit 130f compares the locus 33b of the output value with the threshold values 50, 51, 52, the locus 33b first becomes the threshold value 52 or less, and then exceeds the threshold value 51. The determination unit 130f gives priority to "the locus 33b exceeds the threshold value 51" near the end time of the conversation, and determines that the conversation corresponding to the locus 33b is an "abnormal conversation situation".

次に、本実施例１に係る音声処理装置１００の処理手順の一例について説明する。図１２は、本実施例１に係る音声処理装置の処理手順を示すフローチャートである。図１２に示すように、音声処理装置１００の特徴量算出部１３０ｂは、フレーム処理を実行して、音声情報からフレームを抽出する（ステップＳ１０１）。特徴量算出部１３０ｂは、フレームのピッチを抽出し（ステップＳ１０２）、パワーを算出する（ステップＳ１０３）。 Next, an example of the processing procedure of the voice processing apparatus 100 according to the first embodiment will be described. FIG. 12 is a flowchart showing a processing procedure of the voice processing apparatus according to the first embodiment. As shown in FIG. 12, the feature amount calculation unit 130b of the voice processing device 100 executes frame processing and extracts a frame from the voice information (step S101). The feature amount calculation unit 130b extracts the pitch of the frame (step S102) and calculates the power (step S103).

特徴量算出部１３０ｂは、ピッチおよびパワーの値を蓄積し（ステップＳ１０４）、ステップＳ１０７に移行する。一方、特徴量算出部１３０ｂは、音声認識を実行し（ステップＳ１０５）、検出回数情報を更新し（ステップＳ１０６）、ステップＳ１０７に移行する。 The feature amount calculation unit 130b accumulates pitch and power values (step S104), and proceeds to step S107. On the other hand, the feature amount calculation unit 130b executes voice recognition (step S105), updates the detection number information (step S106), and proceeds to step S107.

音声処理装置１００の会話時間管理部１３０ｄは、出力値を算出する時間であるか否かを判定する（ステップＳ１０７）。会話時間管理部１３０ｄは、出力値を算出する時間でない場合には（ステップＳ１０７，Ｎｏ）、ステップＳ１０１に移行する。 The conversation time management unit 130d of the voice processing device 100 determines whether or not it is time to calculate the output value (step S107). If it is not the time to calculate the output value (steps S107, No), the conversation time management unit 130d shifts to step S101.

音声処理装置１００は、出力値を算出する時間である場合には（ステップＳ１０７，Ｙｅｓ）、ストレス評価値を算出し（ステップＳ１０８）、ステップＳ１０９に移行する。音声処理装置１００の出力値算出部１３０ｅは、モデルの出力値を算出し、出力値蓄積バッファ１２０ｃに蓄積する（ステップＳ１０９）。音声処理装置１００の判定部１３０ｆは、出力値の軌跡を算出し（ステップＳ１１０）、図１３のステップＳ１１１に移行する。 When it is time to calculate the output value (step S107, Yes), the voice processing device 100 calculates the stress evaluation value (step S108), and proceeds to step S109. The output value calculation unit 130e of the voice processing device 100 calculates the output value of the model and stores it in the output value storage buffer 120c (step S109). The determination unit 130f of the voice processing device 100 calculates the locus of the output value (step S110), and proceeds to step S111 of FIG.

図１３の説明に移行する。判定部１３０ｆは、軌跡を基にして、異常な会話状況か否かを判定する（ステップＳ１１１）。判定部１３０ｆは、判定結果が確定した場合には（ステップＳ１１２，Ｙｅｓ）、ステップＳ１１５に移行する。 The description shifts to FIG. The determination unit 130f determines whether or not there is an abnormal conversation situation based on the locus (step S111). When the determination result is confirmed (step S112, Yes), the determination unit 130f shifts to step S115.

判定部１３０ｆは、判定結果が確定していない場合には（ステップＳ１１２，Ｎｏ）、会話が終了したか否かを判定する（ステップＳ１１３）。判定部１３０ｆは、会話が終了していない場合には（ステップＳ１１３，Ｎｏ）、図１２のステップＳ１０１に移行する。 If the determination result is not finalized (step S112, No), the determination unit 130f determines whether or not the conversation has ended (step S113). If the conversation is not completed (step S113, No), the determination unit 130f proceeds to step S101 in FIG.

判定部１３０ｆは、判定結果が確定した場合には（ステップＳ１１３，Ｙｅｓ）、軌跡を基にして、異常な会話状況か否かを判定する（ステップＳ１１４）。判定部１３０ｆは、判定結果を出力する（ステップＳ１１５）。 When the determination result is confirmed (step S113, Yes), the determination unit 130f determines whether or not there is an abnormal conversation situation based on the locus (step S114). The determination unit 130f outputs the determination result (step S115).

次に、本実施例１に係る音声処理装置１００の効果について説明する。音声処理装置１００は、音声情報に含まれる会話の開始時刻から所定の時間間隔毎に設定時刻を設定し、開始時刻から各設定時刻までの音声情報から複数の特徴量を算出する。音声処理装置１００は、各特徴量をモデル情報１２０ｂに入力し、モデル情報１２０ｂから得られる各出力値の軌跡を基にして、会話が異常な会話状況であるか否かを判定する。これにより、通常の会話状況か異常な会話状況かを判定することが可能となる。 Next, the effect of the voice processing device 100 according to the first embodiment will be described. The voice processing device 100 sets a set time at predetermined time intervals from the start time of the conversation included in the voice information, and calculates a plurality of feature quantities from the voice information from the start time to each set time. The voice processing device 100 inputs each feature amount into the model information 120b, and determines whether or not the conversation is in an abnormal conversation situation based on the locus of each output value obtained from the model information 120b. This makes it possible to determine whether the conversation situation is normal or abnormal.

音声処理装置１００は、出力値の軌跡がとりうる範囲を、会話の状況が異常な場合にとる異常領域と、会話の状況が通常である場合にとる通常領域とに分割し、出力値の軌跡と、異常領域、通常領域とを基にして、会話が異常な会話状況であるか否かを判定する。これにより、会話の一部に異常な状況が含まれている場合でも、会話状況が異常であるか否かを正確に判定することができる。 The voice processing device 100 divides the range that the locus of the output value can take into an abnormal area that is taken when the conversation situation is abnormal and a normal area that is taken when the conversation situation is normal, and the locus of the output value. And, based on the abnormal area and the normal area, it is determined whether or not the conversation is in an abnormal conversation situation. As a result, even if an abnormal situation is included in a part of the conversation, it is possible to accurately determine whether or not the conversation situation is abnormal.

図１４は、本実施例１に係る音声処理装置の効果を説明するための図である。図１４では、グラフ６０ａ，６０ｂ，６０ｃを示す。各グラフ６０ａ～６０ｃにおいて、縦軸は出力値に対応するものであり、横軸は会話時間に対応するものである。閾値５０，５１に関する説明は、図９の説明と同様である。閾値５５は、参考技術１が会話状況の異常、通常を判定する場合に用いる閾値である。 FIG. 14 is a diagram for explaining the effect of the voice processing device according to the first embodiment. FIG. 14 shows graphs 60a, 60b, 60c. In each of the graphs 60a to 60c, the vertical axis corresponds to the output value, and the horizontal axis corresponds to the conversation time. The description of the threshold values 50 and 51 is the same as that of FIG. The threshold value 55 is a threshold value used by Reference Technique 1 when determining an abnormality or normal conversation situation.

グラフ６０ａに示す各軌跡は、異常な会話状況に対する典型的な軌跡を示す実験結果であり、１本の軌跡は１会話に対応する。グラフ６０ａに示す各軌跡のうち、領域６１ａに含まれるものは、判定部１３０ｆにより、会話が異常な会話状況であることを判定できる。また、領域６１ａに含まれていなくても、ほとんどの軌跡が、領域６１ｂに含まれていないため、会話が異常な会話状況であることを判定できる。たとえば、図９で説明したバリエーション２に基づく判定処理により、正確に判定できる。 Each locus shown in the graph 60a is an experimental result showing a typical locus for an abnormal conversation situation, and one locus corresponds to one conversation. Among the loci shown in the graph 60a, those included in the area 61a can be determined by the determination unit 130f that the conversation is in an abnormal conversation situation. Further, even if the locus is not included in the region 61a, most of the loci are not included in the region 61b, so that it can be determined that the conversation is in an abnormal conversation situation. For example, it can be accurately determined by the determination process based on the variation 2 described with reference to FIG.

グラフ６０ｂに示す各軌跡は、通常の会話状況に対する典型的な軌跡を示す実験結果である。グラフ６０ｂに示す各軌跡のうち、軌跡の大部分が、領域６２ｂに含まれ、領域６２ａに含まれる軌跡は存在しない。このため、会話が通常の会話状況であることを判定できる。たとえば、図９で説明したバリエーション２に基づく判定処理により、正確に判定できる。 Each locus shown in the graph 60b is an experimental result showing a typical locus with respect to a normal conversation situation. Of the loci shown in the graph 60b, most of the loci are included in the region 62b, and there is no locus included in the region 62a. Therefore, it can be determined that the conversation is a normal conversation situation. For example, it can be accurately determined by the determination process based on the variation 2 described with reference to FIG.

グラフ６０ｃに示す各軌跡は、異常な会話状況に対する軌跡の実験結果である。全ての軌跡が、会話終了時において、閾値５５を下回っているので、参考技術１に基づく判定では、異常な会話状況であることを判定できない。これに対して、本実施例１に係る判定部１３０ｆによれば、会話の開始時刻から終了時刻までの軌跡は、領域６３ｂに含まれていないので、会話が異常な会話状況であることを判定できる。たとえば、図９で説明したバリエーション２に基づく判定処理により、正確に判定できる。 Each locus shown in the graph 60c is an experimental result of the locus for an abnormal conversation situation. Since all the loci are below the threshold value 55 at the end of the conversation, it cannot be determined that the conversation situation is abnormal by the determination based on the reference technique 1. On the other hand, according to the determination unit 130f according to the first embodiment, since the locus from the start time to the end time of the conversation is not included in the area 63b, it is determined that the conversation is in an abnormal conversation situation. can. For example, it can be accurately determined by the determination process based on the variation 2 described with reference to FIG.

ところで、会話の開始直後は、特徴量の値が安定しないため、モデル情報１２０ｂに特徴量を出力した際に得られる出力値が安定しない場合がある。このため、会話時間管理部１３０ｄは、会話の開始時刻を受け付けたから、所定時間を経過するまで、「出力制御信号」を、ストレス評価値算出部１３５、認識結果蓄積部１３７、会話時間算出部１３８、出力値算出部１３０ｅに出力する処理を抑止してもよい。これによって、判定部１３０ｆは、安定した出力値を用いて、会話状況を判定することができる。 By the way, since the value of the feature amount is not stable immediately after the start of the conversation, the output value obtained when the feature amount is output to the model information 120b may not be stable. Therefore, the conversation time management unit 130d uses the stress evaluation value calculation unit 135, the recognition result storage unit 137, and the conversation time calculation unit 138 for the "output control signal" from the reception of the conversation start time until the predetermined time elapses. , The process of outputting to the output value calculation unit 130e may be suppressed. As a result, the determination unit 130f can determine the conversation status using a stable output value.

図１５は、会話時間管理部のその他の処理を説明するための図である。図１５において、縦軸は出力値に対応するものであり、横軸は会話時間に対応するものである。会話時間管理部１３０ｄは、開始時刻０から、所定時間ｔａだけ経過した時点から所定の時間間隔で、「出力制御信号」を、ストレス評価値算出部１３５、認識結果蓄積部１３７、会話時間算出部１３８、出力値算出部１３０ｅに出力する。これにより、判定部１３０ｆは、時刻ｔａ以降の安定した出力値を基にして、会話状況を判定できる。図１５に示す閾値５０，５１、軌跡３１ａ～３１ｃに関する説明は、図９の説明と同様である。 FIG. 15 is a diagram for explaining other processes of the conversation time management unit. In FIG. 15, the vertical axis corresponds to the output value, and the horizontal axis corresponds to the conversation time. The conversation time management unit 130d outputs the "output control signal" to the stress evaluation value calculation unit 135, the recognition result storage unit 137, and the conversation time calculation unit at predetermined time intervals from the time when a predetermined time ta has elapsed from the start time 0. 138, output to the output value calculation unit 130e. As a result, the determination unit 130f can determine the conversation status based on the stable output value after the time ta. The description of the threshold values 50 and 51 and the loci 31a to 31c shown in FIG. 15 is the same as that of FIG.

本実施例２に係る音声処理装置の説明を行う前に、オペレータと顧客との会話が異常な会話状況であるか否かを判定する参考技術２について説明する。この参考技術２は、従来技術ではない。会話の開始時刻から所定時間間隔で音声情報を区切り、区切った各音声情報から得られる特徴量をモデル情報に入力して、出力値を算出する。 Before explaining the voice processing device according to the second embodiment, the reference technique 2 for determining whether or not the conversation between the operator and the customer is an abnormal conversation situation will be described. This reference technique 2 is not a conventional technique. The voice information is divided at predetermined time intervals from the start time of the conversation, and the feature amount obtained from each divided voice information is input to the model information to calculate the output value.

図１６は、参考技術２の処理を説明するための図である。図１６に示すように、参考技術２は、音声情報を複数の音声情報１２ａ～１２ｋに区切る。参考技術は、各音声情報１２ａ～１２ｋの区間内で算出した各特徴量をそれぞれモデルに入力することで、出力値１１ａ～１１ｋを算出する。特徴量を入力するモデルは、実施例１で説明したモデル情報１２０ｂに対応する。このように、音声情報を所定時間毎に区切って、出力値１１ａ～１１ｋを算出すると、図１６に示すように、各出力値が安定しないため、会話状況を精度よく判定できない場合がある。 FIG. 16 is a diagram for explaining the process of Reference Technique 2. As shown in FIG. 16, the reference technique 2 divides the voice information into a plurality of voice information 12a to 12k. The reference technique calculates the output values 11a to 11k by inputting each feature amount calculated in the section of each voice information 12a to 12k into the model. The model for inputting the feature amount corresponds to the model information 120b described in the first embodiment. In this way, when the voice information is divided into predetermined time intervals and the output values 11a to 11k are calculated, as shown in FIG. 16, since each output value is not stable, it may not be possible to accurately determine the conversation situation.

次に、本実施例２に係る音声処理装置の処理の一例について説明する。図１７は、本実施例２に係る音声処理装置の処理を説明するための図である。図１７の横軸は会話時間に対応する軸であり、縦軸は出力値に対応する軸である。たとえば、音声処理装置は、音声情報を３０秒毎に分割し、分割した各音声情報の特徴量をモデル情報に入力して、各出力値１１ａ～１１ｎを得る。分割した各音声情報は、分割音声情報の一例である。また、音声処理装置は、開始時刻から現在時刻までの音声情報の特徴量をモデル情報に入力して、出力値（図示略）を得る。音声処理装置は、リアルタイムに、会話状況を判定する。現在の時刻を「Ｔｃ」とする。 Next, an example of processing of the voice processing device according to the second embodiment will be described. FIG. 17 is a diagram for explaining the processing of the voice processing device according to the second embodiment. The horizontal axis of FIG. 17 is the axis corresponding to the conversation time, and the vertical axis is the axis corresponding to the output value. For example, the voice processing device divides the voice information every 30 seconds, inputs the feature amount of each divided voice information into the model information, and obtains each output value 11a to 11n. Each divided voice information is an example of the divided voice information. Further, the voice processing device inputs the feature amount of the voice information from the start time to the current time into the model information, and obtains an output value (not shown). The voice processing device determines the conversation status in real time. Let the current time be "Tc".

音声処理装置は、開始時刻から現在時刻Ｔｃまでの各出力値の平均値と、現在時刻Ｔｃから所定時間前（たとえば、５分前）までに含まれる各出力値の最小値と、開始時刻から現在時刻Ｔｃまでの出力値とを基にして、会話状況を判定する。 The voice processing device uses the average value of each output value from the start time to the current time Tc, the minimum value of each output value included within a predetermined time (for example, 5 minutes before) from the current time Tc, and the start time. The conversation status is determined based on the output value up to the current time Tc.

図１７に示す例において、開始時刻から現在時刻Ｔｃまでの各出力値の平均値は、時間帯Ｂ１に含まれる各出力値１１ａ～１１ｎの平均値である。現在時刻Ｔｃから所定時間前までに含まれる各出力値の最小値は、時間帯Ｂ２に含まれる出力値１１ｃ～１１ｎの最小値である。現在時刻Ｔｃの出力値は、時刻０～時刻Ｔｃまでの区間における音声情報の特徴量をモデルに入力することで得られる出力値である。 In the example shown in FIG. 17, the average value of each output value from the start time to the current time Tc is the average value of each output value 11a to 11n included in the time zone B1. The minimum value of each output value included from the current time Tc to a predetermined time before is the minimum value of the output values 11c to 11n included in the time zone B2. The output value of the current time Tc is an output value obtained by inputting the feature amount of the voice information in the section from the time 0 to the time Tc into the model.

本実施例２に係る音声処理装置は、「条件２および条件１を満たす場合」、または、「条件２および条件３を満たす場合」に、会話が異常な会話状況であると判定する。条件１～３に含まれるＴｈ１～Ｔｈ３は予め設定される閾値である。各閾値の大小関係は、Ｔｈ３＞Ｔｈ１＞Ｔｈ２である。 The voice processing device according to the second embodiment determines that the conversation is in an abnormal conversation situation when "condition 2 and condition 1 are satisfied" or "condition 2 and condition 3 are satisfied". Th1 to Th3 included in the conditions 1 to 3 are preset threshold values. The magnitude relationship of each threshold value is Th3> Th1> Th2.

条件１：開始時刻から現在時刻Ｔｃまでの各出力値の平均値＞Ｔｈ１
条件２：現在時刻Ｔｃから所定時間前までに含まれる各出力値の最小値＞Ｔｈ２
条件３：開始時刻から現在時刻Ｔｃの出力値＞Ｔｈ３ Condition 1: Mean value of each output value from the start time to the current time Tc> Th1
Condition 2: Minimum value of each output value included from the current time Tc to a predetermined time before> Th2
Condition 3: Output value of current time Tc from start time> Th3

本実施例２に係る音声処理装置は、出力値が安定しない場合であっても、上記の条件１～３を用いて、会話状況が異常であるか否かを判定することで、会話状況を精度よく判定することができる。 The voice processing device according to the second embodiment determines the conversation status by using the above conditions 1 to 3 to determine whether or not the conversation status is abnormal even when the output value is not stable. It can be judged accurately.

図１８は、本実施例２に係る音声処理装置の構成を示す機能ブロック図である。図１８に示すように、この音声処理装置２００は、通信部２１０、記憶部２２０、制御部２３０を有する。 FIG. 18 is a functional block diagram showing the configuration of the voice processing device according to the second embodiment. As shown in FIG. 18, the voice processing device 200 has a communication unit 210, a storage unit 220, and a control unit 230.

通信部２１０は、ネットワークを介して外部の装置とデータ通信を実行する処理部である。たとえば、通信部２１０は、顧客とオペレータとの会話を含む音声情報を収集するサーバ装置（図示略）から、音声情報を受信する。通信部２１０は、受信した音声情報を制御部２３０に出力する。通信部２１０は、通信装置の一例である。 The communication unit 210 is a processing unit that executes data communication with an external device via a network. For example, the communication unit 210 receives voice information from a server device (not shown) that collects voice information including a conversation between a customer and an operator. The communication unit 210 outputs the received voice information to the control unit 230. The communication unit 210 is an example of a communication device.

記憶部２２０は、音声バッファ２２０ａと、モデル情報２２０ｂと、出力値蓄積バッファ２２０ｃとを有する。記憶部２２０は、ＲＡＭ、ＲＯＭ、フラッシュメモリなどの半導体メモリ素子や、ＨＤＤなどの記憶装置に対応する。 The storage unit 220 has a voice buffer 220a, model information 220b, and an output value storage buffer 220c. The storage unit 220 corresponds to a semiconductor memory element such as RAM, ROM, and flash memory, and a storage device such as an HDD.

音声バッファ２２０ａは、顧客とオペレータとの会話の音声情報を記憶するバッファである。「モデルを学習する処理」を音声処理装置２００が行う場合には、音声バッファ２２０ａには、学習用の音声情報が蓄積される。これに対して、「会話状況を判定する処理」を音声処理装置２００が行う場合には、音声バッファ２２０ａには、判定対象となる音声情報が蓄積される。 The voice buffer 220a is a buffer for storing voice information of a conversation between a customer and an operator. When the voice processing device 200 performs the "process for learning the model", the voice information for learning is stored in the voice buffer 220a. On the other hand, when the voice processing device 200 performs the "processing for determining the conversation status", the voice information to be determined is accumulated in the voice buffer 220a.

モデル情報２２０ｂは、音声情報に含まれる会話が、異常な会話状況である度合いを示す出力値を出力するモデルの情報である。モデル情報２２０ｂに関するその他の説明は、実施例１で説明したモデル情報１２０ｂに関する説明と同様である。 The model information 220b is information on a model that outputs an output value indicating the degree to which the conversation included in the voice information is an abnormal conversation situation. The other description of the model information 220b is the same as the description of the model information 120b described in the first embodiment.

出力値蓄積バッファ２２０ｃは、モデル情報２２０ｂを基に出力される出力値を格納するバッファである。図１９は、本実施例２に係る出力値蓄積バッファのデータ構造の一例を示す図である。図１９に示すように出力値蓄積バッファ２２０ｃは、テーブル２２１ａ，２２１ｂを有する。 The output value storage buffer 220c is a buffer for storing the output value output based on the model information 220b. FIG. 19 is a diagram showing an example of the data structure of the output value storage buffer according to the second embodiment. As shown in FIG. 19, the output value storage buffer 220c has tables 221a and 221b.

テーブル２２１ａは、時間と、出力値とを対応付ける。テーブル２２１ａにおける時間は、特徴量を抽出した音声情報の時間を示すものである。出力値は、該当する時間の音声情報から抽出された特徴量をモデル情報２２０ｂに入力した際に得られる出力値Ｖを示す。 Table 221a associates the time with the output value. The time in the table 221a indicates the time of the voice information from which the feature amount is extracted. The output value indicates the output value V obtained when the feature amount extracted from the voice information at the corresponding time is input to the model information 220b.

テーブル２２１ｂは、時間と、出力値とを対応付ける。テーブル２２１ｂにおける時間は、特徴量を抽出した音声情報の時間（会話の開始時刻からの経過時間）を示すものである。出力値は、該当する時間の音声情報から算出された特徴量を、モデル情報２２０ｂに入力した際に得られる出力値Ｖを示す。 Table 221b associates the time with the output value. The time in the table 221b indicates the time of the voice information from which the feature amount is extracted (the elapsed time from the start time of the conversation). The output value indicates the output value V obtained when the feature amount calculated from the voice information at the corresponding time is input to the model information 220b.

図１８の説明に戻る。制御部２３０は、取得部２３０ａと、特徴量算出部２３０ｂと、モデル学習部２３０ｃと、会話時間管理部２３０ｄと、出力値算出部２３０ｅと、判定部２３０ｆとを有する。制御部２３０は、ＣＰＵやＭＰＵなどによって実現できる。また、制御部２３０は、ＡＳＩＣやＦＰＧＡなどのハードワイヤードロジックによっても実現できる。 Returning to the description of FIG. The control unit 230 includes an acquisition unit 230a, a feature amount calculation unit 230b, a model learning unit 230c, a conversation time management unit 230d, an output value calculation unit 230e, and a determination unit 230f. The control unit 230 can be realized by a CPU, an MPU, or the like. Further, the control unit 230 can also be realized by hard-wired logic such as ASIC or FPGA.

取得部２３０ａは、音声情報を取得し、取得した音声情報を音声バッファ２２０ａに格納する処理部である。たとえば、「モデルを学習する処理」を音声処理装置２００が行う場合には、取得部２３０ａは、学習用の音声情報を取得し、学習用の音声情報を音声バッファ２２０ａに格納する。「会話状況を判定する処理」を音声処理装置２００が行う場合には、取得部２３０ａは、判定対象となる音声情報を取得し、音声情報を音声バッファ２２０ａに格納する。 The acquisition unit 230a is a processing unit that acquires voice information and stores the acquired voice information in the voice buffer 220a. For example, when the voice processing device 200 performs the "process for learning a model", the acquisition unit 230a acquires the voice information for learning and stores the voice information for learning in the voice buffer 220a. When the voice processing device 200 performs the "processing for determining the conversation status", the acquisition unit 230a acquires the voice information to be determined and stores the voice information in the voice buffer 220a.

特徴量算出部２３０ｂは、音声バッファ２２０ａに格納された音声情報を基にして、特徴量を算出する処理部である。たとえば、特徴量算出部２３０ｂが算出する特徴量は、ストレス評価値、キーワードの検出回数、会話の開始時刻からの経過時間である。ストレス評価値、検出回数、経過時間に関する説明は後述する。 The feature amount calculation unit 230b is a processing unit that calculates the feature amount based on the voice information stored in the voice buffer 220a. For example, the feature amount calculated by the feature amount calculation unit 230b is a stress evaluation value, the number of times a keyword is detected, and the elapsed time from the start time of conversation. The stress evaluation value, the number of detections, and the elapsed time will be described later.

「モデルを学習する処理」を音声処理装置２００が行う場合には、特徴量算出部２３０ｂは、特徴量をモデル学習部２３０ｃに出力する。「会話状況を判定する処理」を音声処理装置２００が行う場合には、特徴量算出部２３０ｂは、特徴量を、出力値算出部２３０ｅに出力する。 When the voice processing device 200 performs the "process for learning the model", the feature amount calculation unit 230b outputs the feature amount to the model learning unit 230c. When the voice processing device 200 performs the "processing for determining the conversation status", the feature amount calculation unit 230b outputs the feature amount to the output value calculation unit 230e.

図２０は、本実施例２に係る特徴量算出部の構成を示す機能ブロック図である。図２０に示すように、この特徴量算出部２３０ｂは、音声取得部２３１ａと、フレーム処理部２３１ｂと、ピッチ抽出部２３２と、パワー算出部２３３と、ピッチ・パワー蓄積部２３４と、ストレス評価値算出部２３５とを有する。また、特徴量算出部２３０ｂは、音声認識部２３６と、認識結果蓄積部２３７と、会話時間算出部２３８とを有する。 FIG. 20 is a functional block diagram showing the configuration of the feature amount calculation unit according to the second embodiment. As shown in FIG. 20, the feature amount calculation unit 230b includes a voice acquisition unit 231a, a frame processing unit 231b, a pitch extraction unit 232, a power calculation unit 233, a pitch power storage unit 234, and a stress evaluation value. It has a calculation unit 235. Further, the feature amount calculation unit 230b has a voice recognition unit 236, a recognition result storage unit 237, and a conversation time calculation unit 238.

音声取得部２３１ａは、音声バッファ２２０ａに格納された音声情報を取得し、読み込んだ音声情報を、フレーム処理部２３１ｂに出力する。以下の説明では、音声取得部２３１ａにより読み込まれたデジタル信号の音声情報を、単に、「音声情報」と表記する。 The voice acquisition unit 231a acquires the voice information stored in the voice buffer 220a and outputs the read voice information to the frame processing unit 231b. In the following description, the voice information of the digital signal read by the voice acquisition unit 231a is simply referred to as "voice information".

フレーム処理部２３１ｂは、音声取得部２３１ａから取得する音声情報の信号時系列を、予め定められたサンプル数毎に「フレーム」として取り出し、フレームの情報を、ピッチ抽出部２３２、パワー算出部２３３、音声認識部２３６に出力する。フレーム処理部２３１ｂの処理は、実施例１のフレーム処理部１３１ｂの処理に対応する。 The frame processing unit 231b takes out the signal time series of the voice information acquired from the voice acquisition unit 231a as a "frame" for each predetermined number of samples, and extracts the frame information from the pitch extraction unit 232 and the power calculation unit 233. It is output to the voice recognition unit 236. The processing of the frame processing unit 231b corresponds to the processing of the frame processing unit 131b of the first embodiment.

ピッチ抽出部２３２は、フレームのサンプル値を基にして、フレームの基本周波数（ピッチ）を抽出する処理部である。ピッチ抽出部２３２は、フレーム毎のピッチの情報を、ピッチ・パワー蓄積部２３４に蓄積する。ピッチ抽出部２３２の処理は、実施例１のピッチ抽出部１３２の処理に対応する。 The pitch extraction unit 232 is a processing unit that extracts the fundamental frequency (pitch) of the frame based on the sample value of the frame. The pitch extraction unit 232 stores pitch information for each frame in the pitch power storage unit 234. The processing of the pitch extraction unit 232 corresponds to the processing of the pitch extraction unit 132 of the first embodiment.

パワー算出部２３３は、フレームのサンプル値を基にして、フレームのパワーを算出する処理部である。パワー算出部２３３は、フレーム毎のパワーの情報を、ピッチ・パワー蓄積部２３４に蓄積する。パワー算出部２３３の処理は、実施例１のパワー算出部１３３の処理に対応する。 The power calculation unit 233 is a processing unit that calculates the power of the frame based on the sample value of the frame. The power calculation unit 233 stores power information for each frame in the pitch power storage unit 234. The processing of the power calculation unit 233 corresponds to the processing of the power calculation unit 133 of the first embodiment.

ピッチ・パワー蓄積部２３４は、ピッチ抽出部２３２により抽出されたピッチの情報およびパワー算出部２３３により算出されたパワーの情報を格納するバッファである。ピッチ・パワー蓄積部２３４のデータ構造は、図６に示したピッチ・パワー蓄積部１３４のデータ構造と同様である。 The pitch power storage unit 234 is a buffer that stores pitch information extracted by the pitch extraction unit 232 and power information calculated by the power calculation unit 233. The data structure of the pitch power storage unit 234 is the same as the data structure of the pitch power storage unit 134 shown in FIG.

ストレス評価値算出部２３５は、ピッチ・パワー蓄積部２３４に格納されたピッチおよびパワーの情報を基にして、設定時刻毎にストレス評価値を算出する処理部である。たとえば、ストレス評価値算出部２３５は、ストレス評価値算出部１３５と同様に、ストレス評価値を算出する。 The stress evaluation value calculation unit 235 is a processing unit that calculates the stress evaluation value for each set time based on the pitch and power information stored in the pitch power storage unit 234. For example, the stress evaluation value calculation unit 235 calculates the stress evaluation value in the same manner as the stress evaluation value calculation unit 135.

ストレス評価値算出部２３５は、音声処理装置２００が「会話状況を判定する処理」を行う場合には、次の処理を行う。ストレス評価値算出部２３５は、会話の開始時刻から、出力制御信号を受信した時刻までに蓄積されたピッチ・パワーの組のサンプルを用いて、混合ガウス分布を最尤推定によりモデル化し、推定に用いたサンプルに対するモデルの平均対数尤度にマイナス１を掛けたものを、ストレス評価値として算出し、算出したストレス評価値を、第１ストレス値として、出力値算出部１３０ｅに出力する。「出力制御信号」は、会話時間管理部２３０ｄから出力される信号である。 When the voice processing device 200 performs the "processing for determining the conversation status", the stress evaluation value calculation unit 235 performs the following processing. The stress evaluation value calculation unit 235 models and estimates the mixed Gaussian distribution by maximum likelihood estimation using a sample of a set of pitch powers accumulated from the start time of the conversation to the time when the output control signal is received. The average log-likelihood of the model for the sample used is multiplied by -1 to calculate as a stress evaluation value, and the calculated stress evaluation value is output to the output value calculation unit 130e as the first stress value. The “output control signal” is a signal output from the conversation time management unit 230d.

また、ストレス評価値算出部２３５は、前回出力制御信号を受け付けた時刻から、今回出力制御信号を受け付けた時刻までに蓄積されたピッチ・パワーの組のサンプルを用いて、混合ガウス分布を最尤推定によりモデル化し、推定に用いたサンプルに対するモデルの平均対数尤度にマイナス１を掛けたものをストレス評価値として算出し、算出したストレス評価値を、第２ストレス値として、出力値算出部２３０ｅに出力する。 Further, the stress evaluation value calculation unit 235 uses a sample of a set of pitch powers accumulated from the time when the output control signal was received last time to the time when the output control signal is received this time to maximum likelihood the mixed Gaussian distribution. Modeled by estimation, the average log-likelihood of the model for the sample used for estimation multiplied by -1 is calculated as the stress evaluation value, and the calculated stress evaluation value is used as the second stress value in the output value calculation unit 230e. Output to.

音声認識部２３６は、たとえば、ワードスポッティング型の音声認識を行うことで、音声情報に所定のキーワードが含まれているか否かを検出する処理部である。音声認識部２３６は、音声情報から所定のキーワードを検出する度に、所定のキーワードに対応する検出回数に１を加算する処理を行う。音声認識部２３６は、所定のキーワードと、検出回数とを対応付けた情報を、認識結果蓄積部２３７に蓄積する。所定のキーワードは、顧客が不満を感じた場合や、怒っている際によく発言するキーワードである。 The voice recognition unit 236 is a processing unit that detects whether or not a predetermined keyword is included in the voice information by, for example, performing word spotting type voice recognition. The voice recognition unit 236 performs a process of adding 1 to the number of detections corresponding to the predetermined keyword each time the predetermined keyword is detected from the voice information. The voice recognition unit 236 stores information in which a predetermined keyword is associated with the number of detections in the recognition result storage unit 237. A predetermined keyword is a keyword that is often spoken when a customer is dissatisfied or angry.

たとえば、音声認識部２３６は、「第１検出回数」と、「第２検出回数」とを区別して、認識結果蓄積部２３７に蓄積する。第１検出回数は、会話の開始時刻から、出力制御信号を受信した時刻までの音声区間において検出した所定のキーワードの検出回数を示す。第２検出回数は、前回出力制御信号を受け付けた時刻から、今回出力制御信号を受け付けた時刻までの音声区間において検出した所定のキーワードの検出回数を示す。 For example, the voice recognition unit 236 distinguishes between the "first detection number" and the "second detection number" and stores them in the recognition result storage unit 237. The first detection number indicates the number of detections of a predetermined keyword detected in the voice section from the start time of the conversation to the time when the output control signal is received. The second number of detections indicates the number of detections of a predetermined keyword detected in the voice section from the time when the previous output control signal was received to the time when the output control signal was received this time.

また、音声認識部２３６は、音声認識を行うための音声区間検出処理を開始し、音声区間を検出した際の時間情報を、会話時間算出部２３８に出力する。 Further, the voice recognition unit 236 starts the voice section detection process for performing voice recognition, and outputs the time information when the voice section is detected to the conversation time calculation unit 238.

認識結果蓄積部２３７は、音声認識部２３６により検出された各キーワード（所定のキーワード）の検出回数の情報（第１検出回数、第２検出回数の情報）を保持する。図２１は、本実施例２に係る検出回数情報のデータ構造の一例を示す図である。図２１に示すように、検出回数情報２３７ａは、テーブル２３７ｂとテーブル２３７ｃとを有する。 The recognition result storage unit 237 holds information on the number of detections (information on the number of first detections and the number of second detections) of each keyword (predetermined keyword) detected by the voice recognition unit 236. FIG. 21 is a diagram showing an example of a data structure of detection frequency information according to the second embodiment. As shown in FIG. 21, the detection number information 237a has a table 237b and a table 237c.

テーブル２３７ｂは、キーワードと第１検出回数とを対応付ける。第１検出回数は、会話の開始時刻から、出力制御信号を受信した時刻までの音声区間において検出した所定のキーワードの検出回数を示す。 Table 237b associates the keyword with the number of first detections. The first detection number indicates the number of detections of a predetermined keyword detected in the voice section from the start time of the conversation to the time when the output control signal is received.

テーブル２３７ｃは、時間と、キーワードと、第２検出回数とを対応付ける。時間は、各出力制御信号を受信した時間間隔を示す。第２検出回数は、前回出力制御信号を受け付けた時刻から、今回出力制御信号を受け付けた時刻までの音声区間において検出した所定のキーワードの検出回数を示す。 Table 237c associates the time with the keyword and the second detection count. Time indicates the time interval at which each output control signal is received. The second number of detections indicates the number of detections of a predetermined keyword detected in the voice section from the time when the previous output control signal was received to the time when the output control signal was received this time.

認識結果蓄積部２３７は、音声処理装置２００が「モデルを学習する処理」を行う場合には、次の処理を行う。認識結果蓄積部２３７は、会話の開始時刻から、会話の終了時刻におけるテーブル２３７ｂの情報を、モデル学習部２３０ｃに出力する。 When the voice processing device 200 performs the "process for learning the model", the recognition result accumulating unit 237 performs the following processing. The recognition result storage unit 237 outputs the information of the table 237b at the end time of the conversation from the start time of the conversation to the model learning unit 230c.

認識結果蓄積部２３７は、音声処理装置２００が「会話状況を判定する処理」を行う場合には、次の処理を行う。認識結果蓄積部２３７は、会話の開始時刻から、出力制御信号を受信した時刻までのテーブル２３７ｂの情報を、出力値算出部２３０ｅに出力する。また、認識結果蓄積部２３７は、テーブル２３７ｃのレコードのうち、前回出力制御信号を受け付けた時刻から、今回出力制御信号を受け付けた時刻に対応する時刻に対応する時間のレコードを、出力値算出部２３０ｅに出力する。たとえば、前回出力制御信号を受信した時刻を「ｔ_１」、今回出力制御信号を受信した時刻を「ｔ_２」とすると、認識結果蓄積部２３７は、テーブル２３７ｃのレコードのうち、時間「ｔ_１～ｔ_２」に対応するレコードを、出力値算出部２３０ｅに出力する。 When the voice processing device 200 performs the "processing for determining the conversation status", the recognition result accumulating unit 237 performs the following processing. The recognition result storage unit 237 outputs the information of the table 237b from the start time of the conversation to the time when the output control signal is received to the output value calculation unit 230e. Further, the recognition result storage unit 237 records the record of the time corresponding to the time corresponding to the time when the output control signal is received this time from the time when the previous output control signal was received among the records in the table 237c to the output value calculation unit. Output to 230e. For example, assuming that the time when the output control signal was received last time is "t ₁ " and the time when the output control signal is received this time is "t ₂ ", the recognition result storage unit 237 has the time "t ₁ " among the records in the table 237c. The record corresponding to "~ t ₂ " is output to the output value calculation unit 230e.

会話時間算出部２３８は、会話の開始時刻からの会話の経過時間を計算する処理部である。たとえば、会話時間算出部２３８は、図示しないタイマから時間情報を取得し、会話の開始時刻からの経過時間を計測する。会話時間算出部２３８は、各フレームに含まれるサンプル数の累計を基にして、経過時間を推定してもよい。会話時間算出部２３８は、開始時刻と、開始時刻からの経過時間との情報を、会話時間管理部２３０ｄに出力する。 The conversation time calculation unit 238 is a processing unit that calculates the elapsed time of the conversation from the start time of the conversation. For example, the conversation time calculation unit 238 acquires time information from a timer (not shown) and measures the elapsed time from the start time of the conversation. The conversation time calculation unit 238 may estimate the elapsed time based on the cumulative number of samples included in each frame. The conversation time calculation unit 238 outputs information on the start time and the elapsed time from the start time to the conversation time management unit 230d.

会話時間算出部２３８は、音声処理装置２００が「モデルを学習する処理」を行う場合には、次の処理を行う。会話時間算出部２３８は、会話の開始時刻から、会話の終了時刻までの経過時間の情報を、モデル学習部２３０ｃに出力する。 When the voice processing device 200 performs the "process for learning the model", the conversation time calculation unit 238 performs the following processing. The conversation time calculation unit 238 outputs information on the elapsed time from the conversation start time to the conversation end time to the model learning unit 230c.

会話時間算出部２３８は、音声処理装置２００が「会話状況を判定する処理」を行う場合には、次の処理を行う。会話時間算出部２３８は、会話の開始時刻から、出力制御信号を受信した時刻までの経過時間の情報を、出力値算出部２３０ｅに出力する。また、会話時間算出部２３８は、前回出力制御信号を受け付けた時刻から、今回出力信号を受け付けた時刻までの時間間隔の情報を、出力値算出部２３０ｅに出力する。 When the voice processing device 200 performs the "processing for determining the conversation status", the conversation time calculation unit 238 performs the following processing. The conversation time calculation unit 238 outputs information on the elapsed time from the conversation start time to the time when the output control signal is received to the output value calculation unit 230e. Further, the conversation time calculation unit 238 outputs the information of the time interval from the time when the previous output control signal is received to the time when the output signal is received this time to the output value calculation unit 230e.

図１８の説明に戻る。モデル学習部２３０ｃは、学習用の音声情報から算出された特徴量を用いて、モデル情報２２０ｂを生成（学習）する処理部である。モデル学習部２３０ｃが、モデル情報２２０ｂを生成する処理は、実施例１で説明したモデル学習部１３０ｃの処理に対応する。 Returning to the description of FIG. The model learning unit 230c is a processing unit that generates (learns) model information 220b using a feature amount calculated from voice information for learning. The process of generating the model information 220b by the model learning unit 230c corresponds to the process of the model learning unit 130c described in the first embodiment.

会話時間管理部２３０ｄは、会話時間算出部２３８から、会話の開始時刻と、会話の開始時刻からの経過時間とを取得し、予め指定された時間Ｔを経過したか否かを判定する。会話時間管理部２３０ｄは、時間Ｔを経過する度に、「出力制御信号」を、ストレス評価値算出部２３５、音声認識部２３６、認識結果蓄積部２３７、会話時間算出部２３８、出力値算出部２３０ｅに出力する。 The conversation time management unit 230d acquires the conversation start time and the elapsed time from the conversation start time from the conversation time calculation unit 238, and determines whether or not the predetermined time T has elapsed. The conversation time management unit 230d outputs the "output control signal" to the stress evaluation value calculation unit 235, the voice recognition unit 236, the recognition result storage unit 237, the conversation time calculation unit 238, and the output value calculation unit each time the time T elapses. Output to 230e.

会話時間管理部２３０ｄは、会話時間算出部２３８から、会話の終了時刻の情報を受け付けた場合には、会話の終了時刻の情報を、判定部２３０ｆに出力する。 When the conversation time management unit 230d receives the information on the end time of the conversation from the conversation time calculation unit 238, the conversation time management unit 230d outputs the information on the end time of the conversation to the determination unit 230f.

出力値算出部２３０ｅは、特徴量算出部２３０ｂから取得する特徴量と、モデル情報２２０ｂとを基にして、出力値を算出する処理部である。出力値算出部２３０ｅは、算出した出力値を、出力値蓄積バッファ２２０ｃに蓄積する。 The output value calculation unit 230e is a processing unit that calculates an output value based on the feature amount acquired from the feature amount calculation unit 230b and the model information 220b. The output value calculation unit 230e stores the calculated output value in the output value storage buffer 220c.

たとえば、出力値算出部２３０ｅは、会話時間管理部２３０ｄから出力制御信号を取得したタイミングで、特徴量算出部２３０ｂから特徴量を取得する。この特徴量には、第１特徴量と、第２特徴量とが含まれる。 For example, the output value calculation unit 230e acquires the feature amount from the feature amount calculation unit 230b at the timing when the output control signal is acquired from the conversation time management unit 230d. This feature amount includes a first feature amount and a second feature amount.

第１特徴量は、会話の開始時刻から、今回出力制御信号を受信した時刻までの音声情報を基にして抽出される特徴量である。第１特徴量は、第１ストレス評価値、第１検出回数の情報、会話の開始時刻から、今回出力制御信号を受信した時刻までの経過時間の情報を含む。 The first feature amount is a feature amount extracted based on the voice information from the start time of the conversation to the time when the output control signal is received this time. The first feature amount includes the first stress evaluation value, the information of the first detection number, and the information of the elapsed time from the start time of the conversation to the time when the output control signal is received this time.

出力値算出部２３０ｅは、会話時間管理部２３０ｄから出力制御信号を取得したタイミングで、特徴量算出部２３０ｂから第１特徴量を取得し、取得した第１特徴量をモデル情報２２０ｂの入力層２０ａに入力する。出力値算出部２３０ｅは、特徴量をモデル情報２２０ｂの入力層２０ａに入力した際に、出力層２０ｃから出力される確率「Ｏｔ」と、確率「Ｏｎ」との値を取得し、式（１）～式（３）を基にして、出力値Ｖを算出する。出力値算出部２３０ｅは、第１特徴量から算出した出力値Ｖの情報を、テーブル２２１ｂに登録する。 The output value calculation unit 230e acquires the first feature amount from the feature amount calculation unit 230b at the timing when the output control signal is acquired from the conversation time management unit 230d, and the acquired first feature amount is used as the input layer 20a of the model information 220b. Enter in. When the feature amount is input to the input layer 20a of the model information 220b, the output value calculation unit 230e acquires the values of the probability “Ot” and the probability “On” output from the output layer 20c, and obtains the values of the equation (1). )-Equation (3) is used to calculate the output value V. The output value calculation unit 230e registers the information of the output value V calculated from the first feature amount in the table 221b.

出力値算出部２３０ｅは、会話時間管理部２３０ｄから出力制御信号を取得する度に、上記の処理を繰り返し実行することで、各経過時間の第１特徴量に対応する出力値Ｖを順次算出し、算出した出力値Ｖの情報を、テーブル２２１ｂに格納して更新する。 Each time the output value calculation unit 230e acquires an output control signal from the conversation time management unit 230d, the output value calculation unit 230e repeatedly executes the above processing to sequentially calculate the output value V corresponding to the first feature amount of each elapsed time. , The calculated output value V information is stored in the table 221b and updated.

一方、第２特徴量は、前回出力制御信号を受信した時刻から、今回出力制御信号を受信した時刻までの区間における音声情報を基にして抽出される特徴量である。第２特徴量は、第２ストレス評価値、第２検出回数の情報、前回出力制御信号を受信した時刻から、今回出力制御信号を受信した時刻までの経過時間の情報を含む。 On the other hand, the second feature amount is a feature amount extracted based on the voice information in the section from the time when the previous output control signal is received to the time when the output control signal is received this time. The second feature amount includes a second stress evaluation value, information on the number of second detections, and information on the elapsed time from the time when the previous output control signal was received to the time when the output control signal was received this time.

出力値算出部２３０ｅは、会話時間管理部２３０ｄから出力制御信号を取得したタイミングで、特徴量算出部２３０ｂから第２特徴量を取得し、取得した第２特徴量をモデル情報２２０ｂの入力層２０ａに入力する。出力値算出部２３０ｅは、特徴量をモデル情報２２０ｂの入力層２０ａに入力した際に、出力層２０ｃから出力される確率「Ｏｔ」と、確率「Ｏｎ」との値を取得し、式（１）～式（３）を基にして、出力値Ｖを算出する。出力値算出部２３０ｅは、第２特徴量から算出した出力値Ｖの情報を、該当する時間に対応付けて、テーブル２２１ａに登録する。 The output value calculation unit 230e acquires the second feature amount from the feature amount calculation unit 230b at the timing when the output control signal is acquired from the conversation time management unit 230d, and the acquired second feature amount is used as the input layer 20a of the model information 220b. Enter in. When the feature amount is input to the input layer 20a of the model information 220b, the output value calculation unit 230e acquires the values of the probability “Ot” and the probability “On” output from the output layer 20c, and obtains the values of the equation (1). )-Equation (3) is used to calculate the output value V. The output value calculation unit 230e registers the information of the output value V calculated from the second feature amount in the table 221a in association with the corresponding time.

たとえば、出力値算出部２３０ｅは、時間「ｔ_１～ｔ_２」の音声情報から抽出された第２特徴量から、出力値Ｖ_２を算出した場合には、時間「ｔ_１～ｔ_２」と、出力値Ｖ_２とを対応付けて、テーブル２２１ａに登録する。 For example, when the output value calculation unit 230e calculates the output value V ₂ from the second feature amount extracted from the voice information of the time "t ₁ to t ₂ ", the output value calculation unit 230e sets the time as "t ₁ to t ₂ ". , And the output value V2 _are associated with each other and registered in the table 221a.

出力値算出部２３０ｅは、会話時間管理部２３０ｄから出力制御信号を取得する度に、上記の処理を繰り返し実行することで、各時間間隔の第２特徴量に対応する出力値Ｖを順次算出し、算出した出力値Ｖの情報を、テーブル２２１ａに格納する。 Each time the output value calculation unit 230e acquires an output control signal from the conversation time management unit 230d, the output value calculation unit 230e repeatedly executes the above processing to sequentially calculate the output value V corresponding to the second feature amount of each time interval. , The calculated output value V information is stored in the table 221a.

判定部２３０ｆは、出力値蓄積バッファ２２０ｃに格納された出力値の情報を基にして、会話が異常な会話状況であるのか、通常の会話状況であるのかを判定する処理部である。判定部１３０ｆは、上述した条件１～３で用いる各値を算出し、会話状況が異常であるか否かを判定する。 The determination unit 230f is a processing unit that determines whether the conversation is in an abnormal conversation situation or a normal conversation situation based on the information of the output value stored in the output value storage buffer 220c. The determination unit 130f calculates each value used in the above-mentioned conditions 1 to 3 and determines whether or not the conversation situation is abnormal.

判定部２３０ｆが、開始時刻から現在時刻Ｔｃまでの各出力値の平均値を算出する処理について説明する。判定部２３０ｆは、図１９のテーブル２２１ａに格納された、開始時刻から現在時刻Ｔｃまでの各出力値の平均値を算出する。 The process in which the determination unit 230f calculates the average value of each output value from the start time to the current time Tc will be described. The determination unit 230f calculates the average value of each output value stored in the table 221a of FIG. 19 from the start time to the current time Tc.

判定部２３０ｆが、現在時刻Ｔｃから所定時間前までに含まれる各出力値の最小値を算出する処理について説明する。判定部２３０ｆは、図１９のテーブル２２１ａに格納された各出力値のうち、現在時刻Ｔｃから所定時間前までに含まれる複数の出力値を抽出する。判定部２３０ｆは、抽出した複数の出力値のうち、最小の出力値を、最小値として算出する。 A process in which the determination unit 230f calculates the minimum value of each output value included from the current time Tc to a predetermined time before will be described. The determination unit 230f extracts a plurality of output values included in the output values stored in the table 221a of FIG. 19 from the current time Tc to a predetermined time before. The determination unit 230f calculates the minimum output value among the extracted plurality of output values as the minimum value.

判定部２３０ｆが、現在時刻Ｔｃの出力値を特定する処理について説明する。判定部２３０ｆは、図１９のテーブル２２１ｂに格納された最新の出力値を、現在時刻Ｔｃの出力値として特定する。 The process of specifying the output value of the current time Tc by the determination unit 230f will be described. The determination unit 230f specifies the latest output value stored in the table 221b of FIG. 19 as the output value of the current time Tc.

判定部２３０ｆは、条件１～３で用いる各値を算出し、「条件２および条件１を満たす場合」、または、「条件２および条件３を満たす場合」に、会話が異常な会話状況であると判定する。判定部２３０ｆは、「条件２および条件１を満たさない場合」、かつ、「条件２および条件３を満たさない場合」に、会話が通常の会話状況であると判定する。判定部２３０ｆは、判定結果を表示装置（図示略）に出力して表示させてもよいし、通信部２１０を介して、外部装置に通知してもよい。 The determination unit 230f calculates each value used in the conditions 1 to 3, and the conversation is in an abnormal conversation situation in "when the condition 2 and the condition 1 are satisfied" or "when the condition 2 and the condition 3 are satisfied". Is determined. The determination unit 230f determines that the conversation is a normal conversation situation when "condition 2 and condition 1 are not satisfied" and "condition 2 and condition 3 are not satisfied". The determination unit 230f may output the determination result to a display device (not shown) and display it, or may notify the external device via the communication unit 210.

次に、本実施例２に係る音声処理装置２００の処理手順の一例について説明する。図２２および図２３は、本実施例２に係る音声処理装置の処理手順を示すフローチャートである。図２２に示すように、音声処理装置２００の特徴量算出部２３０ｂは、フレーム処理を実行して、音声情報からフレームを抽出する（ステップＳ２０１）。特徴量算出部２３０ｂは、フレームのピッチを抽出し（ステップＳ２０２）、パワーを算出する（ステップＳ２０３）。 Next, an example of the processing procedure of the voice processing apparatus 200 according to the second embodiment will be described. 22 and 23 are flowcharts showing the processing procedure of the voice processing apparatus according to the second embodiment. As shown in FIG. 22, the feature amount calculation unit 230b of the voice processing device 200 executes frame processing and extracts a frame from the voice information (step S201). The feature amount calculation unit 230b extracts the pitch of the frame (step S202) and calculates the power (step S203).

特徴量算出部２３０ｂは、ピッチおよびパワーの値を蓄積し（ステップＳ２０４）、ステップＳ２０７に移行する。一方、特徴量算出部２３０ｂは、音声認識を実行し（ステップＳ２０５）、検出回数情報を更新し（ステップＳ２０６）、ステップＳ２０７に移行する。 The feature amount calculation unit 230b accumulates pitch and power values (step S204), and proceeds to step S207. On the other hand, the feature amount calculation unit 230b executes voice recognition (step S205), updates the detection number information (step S206), and proceeds to step S207.

音声処理装置２００の会話時間管理部２３０ｄは、出力値を算出する時間であるか否かを判定する（ステップＳ２０７）。会話時間管理部２３０ｄは、出力値を算出する時間でない場合には（ステップＳ２０７，Ｎｏ）、ステップＳ２０１に移行する。 The conversation time management unit 230d of the voice processing device 200 determines whether or not it is time to calculate the output value (step S207). If it is not the time to calculate the output value (step S207, No), the conversation time management unit 230d shifts to step S201.

音声処理装置２００は、出力値を算出する時間である場合には（ステップＳ２０７，Ｙｅｓ）、ストレス評価値を算出し（ステップＳ２０８）、ステップＳ２０９に移行する。音声処理装置２００の出力値算出部２３０ｅは、第１特徴量、第２特徴量に基づいて、モデルの出力値を算出し、出力値蓄積バッファ２２０ｃに蓄積し（ステップＳ２０９）、図２３のステップＳ２１０に移行する。 When it is time to calculate the output value (step S207, Yes), the voice processing apparatus 200 calculates the stress evaluation value (step S208), and proceeds to step S209. The output value calculation unit 230e of the voice processing device 200 calculates the output value of the model based on the first feature amount and the second feature amount, stores it in the output value storage buffer 220c (step S209), and steps in FIG. 23. Move to S210.

図２３の説明に移行する。判定部２３０ｆは、条件１～３を満たすための値を算出する（ステップＳ２１０）。判定部２３０ｆは、会話が異常な会話状況である場合には（ステップＳ２１１，Ｙｅｓ）、ステップＳ２１４に移行する。 The description shifts to FIG. 23. The determination unit 230f calculates a value for satisfying the conditions 1 to 3 (step S210). When the conversation is in an abnormal conversation situation (steps S211 and Yes), the determination unit 230f shifts to step S214.

一方、判定部２３０ｆは、会話が異常な会話状況でない場合には（ステップＳ２１１，Ｎｏ）、会話が終了したか否かを判定する（ステップＳ２１２）。判定部２３０ｆは、会話が終了していない場合には（ステップＳ２１２，Ｎｏ）、図２２のステップＳ２０１に移行する。 On the other hand, if the conversation is not in an abnormal conversation situation (step S211 and No), the determination unit 230f determines whether or not the conversation has ended (step S212). If the conversation is not completed (step S212, No), the determination unit 230f shifts to step S201 in FIG. 22.

判定部２３０ｆは、会話が終了した場合には（ステップＳ２１２，Ｙｅｓ）、会話が通常の会話状況であると判定する（ステップＳ２１３）。判定部２３０ｆは、判定結果を出力する（ステップＳ２１４）。 When the conversation is completed (step S212, Yes), the determination unit 230f determines that the conversation is in a normal conversation situation (step S213). The determination unit 230f outputs the determination result (step S214).

次に、本実施例２に係る音声処理装置２００の効果について説明する。音声処理装置２００は、開始時刻から現在時刻Ｔｃまでの各出力値の平均値と、現在時刻Ｔｃから所定時間前までに含まれる各出力値の最小値と、現在時刻Ｔｃの出力値とを基にして、会話状況を判定する。これにより、開始時刻から現在時刻Ｔｃまでの音声情報の特徴量に対する出力値に加え、各時間間隔の区間内における音声情報の特徴量に対する出力値の情報も判定に用いることができるため、会話状況を精度よく判定することができる。 Next, the effect of the voice processing device 200 according to the second embodiment will be described. The voice processing device 200 is based on the average value of each output value from the start time to the current time Tc, the minimum value of each output value included from the current time Tc to a predetermined time before, and the output value of the current time Tc. And judge the conversation situation. As a result, in addition to the output value for the feature amount of the voice information from the start time to the current time Tc, the information of the output value for the feature amount of the voice information in the interval of each time interval can be used for the determination. Can be accurately determined.

ところで、上述した実施例２に対する音声処理装置２００は、リアルタイムに、会話状況を判定していたが、これに限定されるものではなく、会話が終了した際に、オフライン処理を実行し、会話状況を判定してもよい。以下の説明では、オフライン処理を実行する音声処理装置２００を、単に「音声処理装置２００」と表記する。 By the way, the voice processing device 200 for the above-described second embodiment determines the conversation status in real time, but the present invention is not limited to this, and when the conversation ends, offline processing is executed and the conversation status is executed. May be determined. In the following description, the voice processing device 200 that executes offline processing is simply referred to as "voice processing device 200".

たとえば、会話が時刻Ｔｅに終了した際に、音声処理装置２００は、オフライン処理を実行することで、次の３つの軌跡（第１の軌跡、第２の軌跡、第３の軌跡）を求める。 For example, when the conversation ends at time Te, the voice processing device 200 executes offline processing to obtain the following three loci (first locus, second locus, and third locus).

図２４Ａは、第１の軌跡を説明するための図である。図２４Ａの横軸は、会話時間に対応する軸であり、縦軸は出力値に対応する軸である。音声処理装置２００は、実施例１の音声処理装置１００と同様にして、開始時刻から設定時刻毎に、それまでの区間における音声情報の特徴量をモデル情報２２０ｂに入力して、出力値を算出する。図２４Ａに示す例では、終了時刻Ｔｅまでに、出力値１２ａ～１２ｒが算出される。音声処理装置２００は、出力値１２ａ～１２ｒの最大値を特定する。たとえば、最大値は、１２ｇとなる。第１の軌跡の最大値を「第１最大値」と表記する。 FIG. 24A is a diagram for explaining the first locus. The horizontal axis of FIG. 24A is the axis corresponding to the conversation time, and the vertical axis is the axis corresponding to the output value. Similar to the voice processing device 100 of the first embodiment, the voice processing device 200 inputs the feature amount of the voice information in the section from the start time to the set time into the model information 220b and calculates the output value. do. In the example shown in FIG. 24A, the output values 12a to 12r are calculated by the end time Te. The voice processing device 200 specifies the maximum value of the output values 12a to 12r. For example, the maximum value is 12 g. The maximum value of the first locus is referred to as "first maximum value".

図２４Ｂは、第２の軌跡を説明するための図である。図２４Ｂの横軸は、会話時間に対応する軸であり、縦軸は出力値に対応する軸である。音声処理装置２００は、設定時刻毎に区切った音声情報の特徴量（前後設定時刻間の音声情報の特徴量）をモデル情報２２０ｂに入力して、出力値を算出する。図２４Ｂに示す例では、終了時刻Ｔｅまでに、出力値１３ａ～１３ｒが算出される。音声処理装置２００は、出力値１３ａ～１３ｒの最小値を特定する。たとえば、最小値は、１３ｍとなる。第２の軌跡の最小値を「第２最小値」と表記する。 FIG. 24B is a diagram for explaining the second locus. The horizontal axis of FIG. 24B is the axis corresponding to the conversation time, and the vertical axis is the axis corresponding to the output value. The voice processing device 200 inputs the feature amount of the voice information (the feature amount of the voice information between the set time before and after) divided for each set time into the model information 220b, and calculates the output value. In the example shown in FIG. 24B, the output values 13a to 13r are calculated by the end time Te. The voice processing device 200 specifies the minimum value of the output values 13a to 13r. For example, the minimum value is 13 m. The minimum value of the second locus is referred to as "second minimum value".

図２４Ｃは、第３の軌跡を説明するための図である。図２４Ｃの横軸は、会話時間に対応する軸であり、縦軸は出力値に対応する軸である。音声処理装置２００は、図２４Ｂと同様にして、設定時刻毎に区切った音声情報の特徴量（前後設定時刻間の音声情報の特徴量）をモデル情報２２０ｂに入力して、出力値１３ａ～１３ｒを算出する。そして、音声処理装置２００は、設定時刻毎に、開始時刻から設定時刻までに算出された各出力値の平均値１４ａ～１４ｒを算出する。平均値１４ａ～１４ｒが第３の軌跡となる。たとえば、平均値１４ａは、出力値１３ａに対応する。平均値１４ｂは、出力値１３ａ，１３ｂの平均値である。平均値１４ｃは、出力値１３ａ～１３ｃの平均値である。平均値１４ｄは、出力値１３ａ～１３ｄの平均値である。平均値１４ｅは、出力値１３ａ～１３ｅの平均値である。 FIG. 24C is a diagram for explaining the third locus. The horizontal axis of FIG. 24C is the axis corresponding to the conversation time, and the vertical axis is the axis corresponding to the output value. In the same manner as in FIG. 24B, the voice processing device 200 inputs the feature amount of voice information (feature amount of voice information between the set time before and after) divided for each set time into the model information 220b, and outputs values 13a to 13r. Is calculated. Then, the voice processing device 200 calculates the average value 14a to 14r of each output value calculated from the start time to the set time for each set time. The average value 14a to 14r is the third locus. For example, the average value 14a corresponds to the output value 13a. The average value 14b is an average value of the output values 13a and 13b. The average value 14c is an average value of the output values 13a to 13c. The average value 14d is an average value of the output values 13a to 13d. The average value 14e is an average value of the output values 13a to 13e.

同様にして、平均値１４ｆは、出力値１３ａ～１３ｆの平均値である。平均値１４ｇは、出力値１３ａ～１３ｇの平均値である。平均値１４ｈは、出力値１３ａ～１３ｈの平均値である。平均値１４ｉは、出力値１３ａ～１３ｉの平均値である。平均値１４ｊは、出力値１３ａ～１３ｊの平均値である。平均値１４ｋは、出力値１３ａ～１３ｋの平均値である。平均値１４ｌは、出力値１３ａ～１３ｌの平均値である。平均値１４ｍは、出力値１３ａ～１３ｍの平均値である。平均値１４ｎは、出力値１３ａ～１３ｎの平均値である。平均値１４ｏは、出力値１３ａ～１３ｏの平均値である。平均値１４ｐは、出力値１３ａ～１３ｐの平均値である。平均値１４ｑは、出力値１３ａ～１３ｑの平均値である。平均値１４ｒは、出力値１３ａ～１３ｒの平均値である。 Similarly, the average value 14f is an average value of the output values 13a to 13f. The average value of 14 g is an average value of output values 13a to 13 g. The average value 14h is an average value of the output values 13a to 13h. The average value 14i is an average value of the output values 13a to 13i. The average value 14j is an average value of the output values 13a to 13j. The average value 14k is an average value of the output values 13a to 13k. The average value 14l is an average value of the output values 13a to 13l. The average value of 14 m is an average value of output values 13a to 13 m. The average value 14n is an average value of the output values 13a to 13n. The average value 14o is an average value of the output values 13a to 13o. The average value 14p is an average value of the output values 13a to 13p. The average value 14q is an average value of the output values 13a to 13q. The average value 14r is an average value of the output values 13a to 13r.

音声処理装置２００は、平均値１４ａ～１４ｒの最大値を特定する。たとえば、最大値は、１４ｄとなる。第３の軌跡の最大値を「第３最大値」と表記する。 The voice processing device 200 specifies the maximum value of the average value 14a to 14r. For example, the maximum value is 14d. The maximum value of the third locus is referred to as "third maximum value".

音声処理装置２００は、「条件５および条件４を満たす場合」、または、「条件５および条件６を満たす場合」に、会話が異常な会話状況であると判定する。条件４～６に含まれるＴｈ１～Ｔｈ３は予め設定される閾値である。各閾値の大小関係は、Ｔｈ１＞Ｔｈ３＞Ｔｈ２である。 The voice processing device 200 determines that the conversation is in an abnormal conversation situation when "when the conditions 5 and 4 are satisfied" or "when the conditions 5 and 6 are satisfied". Th1 to Th3 included in the conditions 4 to 6 are preset threshold values. The magnitude relationship of each threshold value is Th1> Th3> Th2.

条件４：開始時刻から終了時刻Ｔｅまでの各出力値を取った軌跡の最大値（第１最大値）＞Ｔｈ１
条件５：開始時刻から終了時刻Ｔｅまでに含まれる各出力値の最小値（第２最小値）＞Ｔｈ２
条件６：開始時刻から終了時刻Ｔｅまで、時間間隔ごとにそれまでの出力値を平均化した軌跡の最大値（第３最大値）＞Ｔｈ３ Condition 4: Maximum value of the locus (first maximum value) of each output value from the start time to the end time Te> Th1
Condition 5: Minimum value (second minimum value) of each output value included from the start time to the end time Te> Th2
Condition 6: From the start time to the end time Te, the maximum value of the trajectory obtained by averaging the output values up to that point for each time interval (third maximum value)> Th3

本実施例２に係る音声処理装置２００は、開始時刻から終了時刻Ｔeまでの音声情報の特徴量に対する出力値に加え、各時間間隔の区間における音声情報の特徴量に対する出力値の統計量も判定に用いることができるため、上記の条件４～６を用いて、会話状況が異常であるか否かを判定することで、会話状況を精度よく判定することができる。 The voice processing apparatus 200 according to the second embodiment determines not only the output value for the feature amount of the voice information from the start time to the end time Te, but also the statistical value of the output value for the feature amount of the voice information in each time interval section. Therefore, the conversation status can be accurately determined by determining whether or not the conversation status is abnormal by using the above conditions 4 to 6.

次に、上記実施例に示した音声処理装置１００，２００と同様の機能を実現するコンピュータのハードウェア構成の一例について説明する。図２５は、音声処理装置と同様の機能を実現するコンピュータのハードウェア構成の一例を示す図である。 Next, an example of the hardware configuration of the computer that realizes the same functions as the voice processing devices 100 and 200 shown in the above embodiment will be described. FIG. 25 is a diagram showing an example of a hardware configuration of a computer that realizes a function similar to that of a voice processing device.

図２５に示すように、コンピュータ３００は、各種演算処理を実行するＣＰＵ３０１と、ユーザからのデータの入力を受け付ける入力装置３０２と、ディスプレイ３０３とを有する。また、コンピュータ３００は、記憶媒体からプログラム等を読み取る読み取り装置３０４と、有線または無線ネットワークを介して他のコンピュータとの間でデータの授受を行うインターフェース装置３０５とを有する。例えば、インターフェース装置３０５は、通信装置等に接続される。また、コンピュータ３００は、各種情報を一時記憶するＲＡＭ３０６と、ハードディスク装置３０７とを有する。そして、各装置３０１～３０７は、バス３０８に接続される。 As shown in FIG. 25, the computer 300 has a CPU 301 for executing various arithmetic processes, an input device 302 for receiving data input from a user, and a display 303. Further, the computer 300 has a reading device 304 that reads a program or the like from a storage medium, and an interface device 305 that exchanges data with another computer via a wired or wireless network. For example, the interface device 305 is connected to a communication device or the like. Further, the computer 300 has a RAM 306 for temporarily storing various information and a hard disk device 307. Then, each of the devices 301 to 307 is connected to the bus 308.

ハードディスク装置３０７は、取得プログラム３０７ａ、特徴量算出プログラム３０７ｂ、モデル学習プログラム３０７ｃ、会話時間管理プログラム３０７ｄ、出力値算出プログラム３０７ｅ、判定プログラム３０７ｆを読み出してＲＡＭ３０６に展開する。 The hard disk device 307 reads the acquisition program 307a, the feature amount calculation program 307b, the model learning program 307c, the conversation time management program 307d, the output value calculation program 307e, and the determination program 307f and deploys them in the RAM 306.

取得プログラム３０７ａは、取得プロセス３０６ａとして機能する。特徴量算出プログラム３０７ｂは、特徴量算出プロセス３０６ｂとして機能する。モデル学習プログラム３０７ｃは、モデル学習プロセス３０６ｃとして機能する。会話時間管理プログラム３０７ｄは、会話時間管理プロセス３０６ｄとして機能する。出力値算出プログラム３０７ｅは、出力値算出プロセス３０６ｅとして機能する。判定プログラム３０７ｆは、判定プロセス３０６ｆとして機能する。 The acquisition program 307a functions as the acquisition process 306a. The feature amount calculation program 307b functions as a feature amount calculation process 306b. The model learning program 307c functions as a model learning process 306c. The conversation time management program 307d functions as a conversation time management process 306d. The output value calculation program 307e functions as an output value calculation process 306e. The determination program 307f functions as the determination process 306f.

取得プロセス３０６ａの処理は、取得部１３０ａ、２３０ａに対応する。特徴量算出プロセス３０６ｂの処理は、特徴量算出部１３０ｂ、２３０ｂに対応する。モデル学習プロセス３０６ｃの処理は、モデル学習部１３０ｃ、２３０ｃに対応する。会話時間管理プロセス３０６ｄの処理は、会話時間管理部１３０ｄ、２３０ｄに対応する。出力値算出プロセス３０６ｅの処理は、出力値算出部１３０ｅ、２３０ｅに対応する。判定プロセス３０６ｆの処理は、判定部１３０ｆ、２３０ｆに対応する。 The processing of the acquisition process 306a corresponds to the acquisition units 130a and 230a. The processing of the feature amount calculation process 306b corresponds to the feature amount calculation units 130b and 230b. The processing of the model learning process 306c corresponds to the model learning units 130c and 230c. The processing of the conversation time management process 306d corresponds to the conversation time management units 130d and 230d. The process of the output value calculation process 306e corresponds to the output value calculation units 130e and 230e. The processing of the determination process 306f corresponds to the determination units 130f and 230f.

なお、各プログラム３０７ａ～３０７ｆについては、必ずしも最初からハードディスク装置３０７に記憶させておかなくても良い。例えば、コンピュータ３００に挿入されるフレキシブルディスク（ＦＤ）、ＣＤ－ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」に各プログラムを記憶させておく。そして、コンピュータ３００が各プログラム３０７ａ～３０７ｆを読み出して実行するようにしても良い。 The programs 307a to 307f do not necessarily have to be stored in the hard disk device 307 from the beginning. For example, each program is stored in a "portable physical medium" such as a flexible disk (FD), a CD-ROM, a DVD disk, a magneto-optical disk, or an IC card inserted into a computer 300. Then, the computer 300 may read and execute each program 307a to 307f.

以上の各実施例を含む実施形態に関し、さらに以下の付記を開示する。 The following additional notes will be further disclosed with respect to the embodiments including each of the above embodiments.

（付記１）音声情報に含まれる判定対象とする会話の開始時刻から所定の時間間隔毎に設定された設定時刻に基づいて、前記開始時刻から各設定時刻までの複数の音声情報から複数の特徴量を算出し、
会話の開始時刻から終了時刻までの音声情報の特徴量を基にして生成されたモデルに、前記設定時刻毎に算出した複数の特徴量を入力することで、前記複数の特徴量に対応する前記モデルの複数の出力値を設定時刻毎に算出し、
前記複数の出力値を基にして、前記判定対象とする会話が異常な会話状況であるか否かを判定する
処理をコンピュータに実行させることを特徴とする音声処理プログラム。 (Appendix 1) Multiple features from a plurality of voice information from the start time to each set time based on a set time set at predetermined time intervals from the start time of the conversation to be judged included in the voice information. Calculate the amount,
By inputting a plurality of features calculated for each set time into a model generated based on the features of voice information from the start time to the end time of the conversation, the said above corresponding to the plurality of features. Calculate multiple output values of the model for each set time,
A voice processing program characterized by causing a computer to execute a process of determining whether or not the conversation to be determined is an abnormal conversation situation based on the plurality of output values.

（付記２）前記判定する処理は、前記複数の出力値の軌跡がとりうる範囲を、会話の状況が異常な場合にとる異常領域と、会話の状況が通常である場合にとる通常領域とに分割し、前記複数の出力値の軌跡と、前記異常領域および前記通常領域とを基にして、前記判定対象とする会話が異常な会話状況であるか否かを判定することを特徴とする付記１に記載の音声処理プログラム。 (Appendix 2) In the determination process, the range that can be taken by the loci of the plurality of output values is set to an abnormal area that is taken when the conversation situation is abnormal and a normal area that is taken when the conversation situation is normal. Additional note that it is divided and it is determined whether or not the conversation to be determined is an abnormal conversation situation based on the loci of the plurality of output values and the abnormal region and the normal region. The voice processing program according to 1.

（付記３）前記判定する処理は、前記異常領域を第１領域と、前記第１領域よりも出力値の大きい領域に相当する第２領域とに分割し、前記複数の出力値の軌跡の一部が前記第２領域に含まれる場合、または、前記複数の出力値の全軌跡が前記第１領域に含まれる場合に、前記判定対象とする会話が異常な会話状況であると判定することを特徴とする付記２に記載の音声処理プログラム。 (Appendix 3) In the determination process, the abnormal region is divided into a first region and a second region corresponding to a region having a larger output value than the first region, and one of the loci of the plurality of output values. When the unit is included in the second region, or when the entire locus of the plurality of output values is included in the first region, it is determined that the conversation to be determined is an abnormal conversation situation. The voice processing program described in Appendix 2 as a feature.

（付記４）前記判定する処理は、前記通常領域を第３領域と、前記第３領域よりも出力値の小さい領域に相当する第４領域とに分割し、前記複数の出力値の軌跡の一部が前記第４領域に含まれる場合に、前記判定対象とする会話が正常な会話状況であると判定することを特徴とする付記２または３に記載の音声処理プログラム。 (Appendix 4) In the determination process, the normal region is divided into a third region and a fourth region corresponding to a region having a smaller output value than the third region, and one of the loci of the plurality of output values. The voice processing program according to Appendix 2 or 3, wherein when the unit is included in the fourth area, it is determined that the conversation to be determined is in a normal conversation situation.

（付記５）前記判定する処理は、前記出力値の軌跡が前記通常領域または前記異常領域を通過した順番を基にして、前記判定対象とする会話が異常な会話状況であるか否かを判定することを特徴とする付記２に記載の音声処理プログラム。 (Appendix 5) In the determination process, it is determined whether or not the conversation to be determined is in an abnormal conversation situation based on the order in which the locus of the output value passes through the normal region or the abnormal region. The voice processing program according to Appendix 2, wherein the voice processing program is described.

（付記６）前記開始時刻は、前記音声情報に含まれる前記判定対象とする会話の開始が検出された時刻から所定時間後であることを特徴とする付記１～５のうちいずれか一つに記載の音声処理プログラム。 (Appendix 6) The start time is set to any one of the appendices 1 to 5, characterized in that the start time is a predetermined time after the time when the start of the conversation to be determined included in the voice information is detected. The described voice processing program.

（付記７）前記特徴量を算出する処理は、前記音声情報を前記所定の時間間隔毎に分割し、分割した複数の分割音声情報から複数の特徴量を更に算出し、
前記出力値を算出する処理は、前記複数の分割音声情報から算出した複数の特徴量を前記モデルに入力することで、複数の出力値を更に算出し、
前記判定する処理は、前記開始時刻から現在時刻までの複数の分割音声情報の特徴量から得られる複数の出力値の現在時刻までの平均値と、前記現在時刻よりも所定時間前の時刻から前記現在時刻までの複数の分割音声情報の特徴量から得られる複数の出力値の最小値と、開始時刻から現在時刻までの音声情報の特徴量から得られる出力値とを基にして、前記判定対象とする会話が異常な会話状況であるか否かを判定することを特徴とする付記１に記載の音声処理プログラム。 (Appendix 7) In the process of calculating the feature amount, the voice information is divided into the predetermined time intervals, and a plurality of feature amounts are further calculated from the divided plurality of divided voice information.
In the process of calculating the output value, a plurality of output values are further calculated by inputting a plurality of feature amounts calculated from the plurality of divided voice information into the model.
In the determination process, the average value of a plurality of output values obtained from the feature quantities of the plurality of divided voice information from the start time to the current time up to the current time, and the time before a predetermined time from the current time are described. The determination target is based on the minimum value of a plurality of output values obtained from the feature quantities of a plurality of divided voice information up to the current time and the output value obtained from the feature quantities of the voice information from the start time to the current time. The voice processing program according to Appendix 1, wherein it is determined whether or not the conversation is in an abnormal conversation situation.

（付記８）前記特徴量を算出する処理は、前記音声情報を前記所定の時間間隔毎に分割し、分割した複数の分割音声情報から複数の特徴量を算出し、
前記出力値を算出する処理は、前記複数の分割音声情報から算出した複数の特徴量を前記モデルに入力することで、複数の出力値を算出し、
前記判定する処理は、複数の分割音声情報の特徴量から得られる前記複数の出力値について開始時刻から設定時刻までの平均値を設定時刻ごとに算出して得られる軌跡の最大値と、複数の分割音声情報の特徴量から得られる前記複数の出力値の最小値と、前記開始時刻から各設定時刻までの音声情報の特徴量から得られる各出力値の最大値とを基にして、前記判定対象とする会話が異常な会話状況であるか否かを判定することを特徴とする付記１に記載の音声処理プログラム。 (Appendix 8) In the process of calculating the feature amount, the voice information is divided into the predetermined time intervals, and a plurality of feature amounts are calculated from the divided plurality of divided voice information.
In the process of calculating the output value, a plurality of output values are calculated by inputting a plurality of feature amounts calculated from the plurality of divided voice information into the model.
The determination process includes the maximum value of the locus obtained by calculating the average value from the start time to the set time for each of the plurality of output values obtained from the feature quantities of the plurality of divided voice information, and a plurality of loci. The determination is based on the minimum value of the plurality of output values obtained from the feature amount of the divided voice information and the maximum value of each output value obtained from the feature amount of the voice information from the start time to each set time. The voice processing program according to Appendix 1, wherein it is determined whether or not the target conversation is in an abnormal conversation situation.

（付記９）コンピュータが実行する音声処理方法であって、
音声情報に含まれる判定対象とする会話の開始時刻から所定の時間間隔毎に設定された設定時刻に基づいて、前記開始時刻から各設定時刻までの複数の音声情報から複数の特徴量を算出し、
会話の開始時刻から終了時刻までの音声情報の特徴量を基にして生成されたモデルに、前記設定時刻毎に算出した複数の特徴量を入力することで、前記複数の特徴量に対応する前記モデルの複数の出力値を設定時刻毎に算出し、
前記複数の出力値を基にして、前記判定対象とする会話が異常な会話状況であるか否かを判定する
処理を実行することを特徴とする音声処理方法。 (Appendix 9) A voice processing method executed by a computer.
Based on the set time set at predetermined time intervals from the start time of the conversation to be judged included in the voice information, a plurality of feature quantities are calculated from the plurality of voice information from the start time to each set time. ,
By inputting a plurality of features calculated for each set time into a model generated based on the features of voice information from the start time to the end time of the conversation, the said above corresponding to the plurality of features. Calculate multiple output values of the model for each set time,
A voice processing method comprising executing a process of determining whether or not the conversation to be determined is an abnormal conversation situation based on the plurality of output values.

（付記１０）前記判定する処理は、前記複数の出力値の軌跡がとりうる範囲を、会話の状況が異常な場合にとる異常領域と、会話の状況が通常である場合にとる通常領域とに分割し、前記複数の出力値の軌跡と、前記異常領域および前記通常領域とを基にして、前記判定対象とする会話が異常な会話状況であるか否かを判定することを特徴とする付記９に記載の音声処理方法。 (Appendix 10) In the determination process, the range that can be taken by the loci of the plurality of output values is set to an abnormal area that is taken when the conversation situation is abnormal and a normal area that is taken when the conversation situation is normal. Additional note that it is divided and it is determined whether or not the conversation to be determined is an abnormal conversation situation based on the loci of the plurality of output values and the abnormal region and the normal region. 9. The voice processing method according to 9.

（付記１１）前記判定する処理は、前記異常領域を第１領域と、前記第１領域よりも出力値の大きい領域に相当する第２領域とに分割し、前記複数の出力値の軌跡の一部が前記第２領域に含まれる場合、または、前記複数の出力値の全軌跡が前記第１領域に含まれる場合に、前記判定対象とする会話が異常な会話状況であると判定することを特徴とする付記１０に記載の音声処理方法。 (Appendix 11) In the determination process, the abnormal region is divided into a first region and a second region corresponding to a region having a larger output value than the first region, and one of the loci of the plurality of output values. When the unit is included in the second region, or when the entire locus of the plurality of output values is included in the first region, it is determined that the conversation to be determined is an abnormal conversation situation. The voice processing method according to Appendix 10, which is a feature.

（付記１２）前記判定する処理は、前記通常領域を第３領域と、前記第３領域よりも出力値の小さい領域に相当する第４領域とに分割し、前記複数の出力値の軌跡の一部が前記第４領域に含まれる場合に、前記判定対象とする会話が正常な会話状況であると判定することを特徴とする付記１０または１１に記載の音声処理方法。 (Appendix 12) In the determination process, the normal region is divided into a third region and a fourth region corresponding to a region having a smaller output value than the third region, and one of the loci of the plurality of output values. The voice processing method according to Appendix 10 or 11, wherein when the unit is included in the fourth region, it is determined that the conversation to be determined is in a normal conversation situation.

（付記１３）前記判定する処理は、前記出力値の軌跡が前記通常領域または前記異常領域を通過した順番を基にして、前記判定対象とする会話が異常な会話状況であるか否かを判定することを特徴とする付記１０に記載の音声処理方法。 (Appendix 13) In the determination process, it is determined whether or not the conversation to be determined is in an abnormal conversation situation based on the order in which the locus of the output value passes through the normal region or the abnormal region. The voice processing method according to Appendix 10, wherein the voice processing method is performed.

（付記１４）前記開始時刻は、前記音声情報に含まれる前記判定対象とする会話の開始が検出された時刻から所定時間後であることを特徴とする付記９～１３のうちいずれか一つに記載の音声処理方法。 (Appendix 14) The start time is set to any one of the appendices 9 to 13, characterized in that the start time is a predetermined time after the start of the conversation to be determined, which is included in the voice information, is detected. The described voice processing method.

（付記１５）前記特徴量を算出する処理は、前記音声情報を前記所定の時間間隔毎に分割し、分割した複数の分割音声情報から複数の特徴量を更に算出し、
前記出力値を算出する処理は、前記複数の分割音声情報から算出した複数の特徴量を前記モデルに入力することで、複数の出力値を更に算出し、
前記判定する処理は、前記開始時刻から現在時刻までの複数の分割音声情報の特徴量から得られる複数の出力値の現在時刻までの平均値と、前記現在時刻よりも所定時間前の時刻から前記現在時刻までの複数の分割音声情報の特徴量から得られる複数の出力値の最小値と、開始時刻から現在時刻までの音声情報の特徴量から得られる出力値とを基にして、前記判定対象とする会話が異常な会話状況であるか否かを判定することを特徴とする付記９に記載の音声処理方法。 (Appendix 15) In the process of calculating the feature amount, the voice information is divided into the predetermined time intervals, and a plurality of feature amounts are further calculated from the divided plurality of divided voice information.
In the process of calculating the output value, a plurality of output values are further calculated by inputting a plurality of feature amounts calculated from the plurality of divided voice information into the model.
In the determination process, the average value of a plurality of output values obtained from the feature quantities of the plurality of divided voice information from the start time to the current time up to the current time, and the time before a predetermined time from the current time are described. The determination target is based on the minimum value of a plurality of output values obtained from the feature quantities of a plurality of divided voice information up to the current time and the output value obtained from the feature quantities of the voice information from the start time to the current time. The voice processing method according to Appendix 9, wherein it is determined whether or not the conversation is in an abnormal conversation situation.

（付記１６）前記特徴量を算出する処理は、前記音声情報を前記所定の時間間隔毎に分割し、分割した複数の分割音声情報から複数の特徴量を算出し、
前記出力値を算出する処理は、前記複数の分割音声情報から算出した複数の特徴量を前記モデルに入力することで、複数の出力値を算出し、
前記判定する処理は、複数の分割音声情報の特徴量から得られる前記複数の出力値について開始時刻から設定時刻までの平均値を設定時刻ごとに算出して得られる軌跡の最大値と、複数の分割音声情報の特徴量から得られる前記複数の出力値の最小値と、前記開始時刻から各設定時刻までの音声情報の特徴量から得られる各出力値の最大値とを基にして、前記判定対象とする会話が異常な会話状況であるか否かを判定することを特徴とする付記９に記載の音声処理方法。 (Appendix 16) In the process of calculating the feature amount, the voice information is divided into the predetermined time intervals, and a plurality of feature amounts are calculated from the divided plurality of divided voice information.
In the process of calculating the output value, a plurality of output values are calculated by inputting a plurality of feature amounts calculated from the plurality of divided voice information into the model.
The determination process includes the maximum value of the locus obtained by calculating the average value from the start time to the set time for each of the plurality of output values obtained from the feature quantities of the plurality of divided voice information, and a plurality of loci. The determination is based on the minimum value of the plurality of output values obtained from the feature amount of the divided voice information and the maximum value of each output value obtained from the feature amount of the voice information from the start time to each set time. The voice processing method according to Appendix 9, wherein it is determined whether or not the target conversation is in an abnormal conversation situation.

（付記１７）音声情報に含まれる判定対象とする会話の開始時刻から所定の時間間隔毎に設定された設定時刻に基づいて、前記開始時刻から各設定時刻までの複数の音声情報から複数の特徴量を算出する特徴量算出部と、
会話の開始時刻から終了時刻までの音声情報の特徴量を基にして生成されたモデルに、前記設定時刻毎に算出した複数の特徴量を入力することで、前記複数の特徴量に対応する前記モデルの複数の出力値を設定時刻毎に算出する出力値算出部と、
前記複数の出力値を基にして、前記判定対象とする会話が異常な会話状況であるか否かを判定する判定部と
を有することを特徴とする音声処理装置。 (Appendix 17) Multiple features from a plurality of voice information from the start time to each set time based on a set time set at predetermined time intervals from the start time of the conversation to be determined included in the voice information. A feature amount calculation unit that calculates the amount, and
By inputting a plurality of features calculated for each set time into a model generated based on the features of voice information from the start time to the end time of the conversation, the said above corresponding to the plurality of features. An output value calculation unit that calculates multiple output values of the model for each set time,
A voice processing device comprising a determination unit for determining whether or not the conversation to be determined is an abnormal conversation situation based on the plurality of output values.

（付記１８）前記判定部は、前記複数の出力値の軌跡がとりうる範囲を、会話の状況が異常な場合にとる異常領域と、会話の状況が通常である場合にとる通常領域とに分割し、前記複数の出力値の軌跡と、前記異常領域および前記通常領域とを基にして、前記判定対象とする会話が異常な会話状況であるか否かを判定することを特徴とする付記１７に記載の音声処理装置。 (Appendix 18) The determination unit divides the range that the loci of the plurality of output values can take into an abnormal area that is taken when the conversation situation is abnormal and a normal area that is taken when the conversation situation is normal. The appendix 17 is characterized in that it is determined whether or not the conversation to be determined is an abnormal conversation situation based on the loci of the plurality of output values and the abnormal region and the normal region. The voice processing device described in.

（付記１９）前記判定部は、前記異常領域を第１領域と、前記第１領域よりも出力値の大きい領域に相当する第２領域とに分割し、前記複数の出力値の軌跡の一部が前記第２領域に含まれる場合、または、前記複数の出力値の全軌跡が前記第１領域に含まれる場合に、前記判定対象とする会話が異常な会話状況であると判定することを特徴とする付記１８に記載の音声処理装置。 (Appendix 19) The determination unit divides the abnormal region into a first region and a second region corresponding to a region having a larger output value than the first region, and a part of the loci of the plurality of output values. Is included in the second region, or when the entire locus of the plurality of output values is included in the first region, it is determined that the conversation to be determined is an abnormal conversation situation. The voice processing apparatus according to Appendix 18.

（付記２０）前記判定部は、前記通常領域を第３領域と、前記第３領域よりも出力値の小さい領域に相当する第４領域とに分割し、前記複数の出力値の軌跡の一部が前記第４領域に含まれる場合に、前記判定対象とする会話が正常な会話状況であると判定することを特徴とする付記１８または１９に記載の音声処理装置。 (Appendix 20) The determination unit divides the normal region into a third region and a fourth region corresponding to a region having a smaller output value than the third region, and a part of the loci of the plurality of output values. The voice processing apparatus according to Supplementary note 18 or 19, wherein when is included in the fourth region, it is determined that the conversation to be determined is in a normal conversation situation.

１００，２００音声処理装置
１１０，２１０通信部
１２０，２２０記憶部
１２０ａ，２２０ａ音声バッファ
１２０ｂ，２２０ｂモデル情報
１２０ｃ，２２０ｃ出力値蓄積バッファ
１３０，２３０制御部
１３０ａ，２３０ａ取得部
１３０ｂ，２３０ｂ特徴量算出部
１３０ｃ，２３０ｃモデル学習部
１３０ｄ，２３０ｄ会話時間管理部
１３０ｅ，２３０ｅ出力値算出部
１３０ｆ，２３０ｆ判定部 100,200 Audio processing device 110, 210 Communication unit 120, 220 Storage unit 120a, 220a Voice buffer 120b, 220b Model information 120c, 220c Output value storage buffer 130, 230 Control unit 130a, 230a Acquisition unit 130b, 230b Feature amount calculation unit 130c, 230c Model learning unit 130d, 230d Conversation time management unit 130e, 230e Output value calculation unit 130f, 230f Judgment unit

Claims

Based on the set time set at predetermined time intervals from the start time of the conversation to be judged included in the voice information , a plurality of feature quantities are applied to a plurality of voice information from the start time to each set time. Calculate and
By inputting a plurality of features calculated for each set time into a model generated based on the features of voice information from the start time to the end time of the conversation, the said above corresponding to the plurality of features. Calculate multiple output values of the model for each set time,
A voice processing program characterized by causing a computer to execute a process of determining whether or not the conversation to be determined is an abnormal conversation situation based on the plurality of output values.

The determination process divides the range that can be taken by the loci of the plurality of output values into an abnormal area to be taken when the conversation situation is abnormal and a normal area to be taken when the conversation situation is normal. The first aspect of the present invention is to determine whether or not the conversation to be determined is an abnormal conversation situation based on the loci of a plurality of output values and the abnormal region and the normal region. Voice processing program.

In the determination process, the abnormal region is divided into a first region and a second region corresponding to a region having a larger output value than the first region, and a part of the loci of the plurality of output values is the first region. A claim characterized in that it is determined that the conversation to be determined is an abnormal conversation situation when it is included in two regions or when the entire locus of the plurality of output values is included in the first region. Item 2. The voice processing program according to item 2.

In the determination process, the normal region is divided into a third region and a fourth region corresponding to a region having a smaller output value than the third region, and a part of the loci of the plurality of output values is the first region. The voice processing program according to claim 2 or 3, wherein when it is included in the four areas, it is determined that the conversation to be determined is in a normal conversation situation.

The determination process is characterized in that it is determined whether or not the conversation to be determined is in an abnormal conversation situation based on the order in which the locus of the output value passes through the normal region or the abnormal region. The voice processing program according to claim 2.

The voice according to any one of claims 1 to 5, wherein the start time is a predetermined time after the start of the conversation to be determined, which is included in the voice information, is detected. Processing program.

In the process of calculating the feature amount, the voice information is divided into the predetermined time intervals, and a plurality of feature amounts are further calculated from the divided plurality of divided voice information.
In the process of calculating the output value, a plurality of output values are further calculated by inputting a plurality of feature amounts calculated from the plurality of divided voice information into the model.
In the determination process, the average value of a plurality of output values obtained from the feature quantities of the plurality of divided voice information from the start time to the current time up to the current time, and the time before a predetermined time from the current time are described. The determination is based on the minimum value of a plurality of output values obtained from the feature quantities of a plurality of divided voice information up to the current time and the output value obtained from the feature quantities of the voice information from the start time to the current time. The voice processing program according to claim 1, wherein it is determined whether or not the target conversation is in an abnormal conversation situation.

In the process of calculating the feature amount, the voice information is divided into the predetermined time intervals, and a plurality of feature amounts are calculated from the divided plurality of divided voice information.
In the process of calculating the output value, a plurality of output values are calculated by inputting a plurality of feature amounts calculated from the plurality of divided voice information into the model.
The determination process includes the maximum value of the locus obtained by calculating the average value from the start time to the set time for each of the plurality of output values obtained from the feature quantities of the plurality of divided voice information and a plurality of loci. The determination is based on the minimum value of the plurality of output values obtained from the feature amount of the divided voice information and the maximum value of each output value obtained from the feature amount of the voice information from the start time to each set time. The voice processing program according to claim 1, wherein it is determined whether or not the target conversation is in an abnormal conversation situation.

It is a voice processing method executed by a computer.
Based on the set time set at predetermined time intervals from the start time of the conversation to be judged included in the voice information , a plurality of feature quantities are applied to a plurality of voice information from the start time to each set time. Calculate and
By inputting a plurality of features calculated for each set time into a model generated based on the features of voice information from the start time to the end time of the conversation, the said above corresponding to the plurality of features. Calculate multiple output values of the model for each set time,
A voice processing method comprising executing a process of determining whether or not the conversation to be determined is an abnormal conversation situation based on the plurality of output values.

Based on the set time set at predetermined time intervals from the start time of the conversation to be judged included in the voice information , a plurality of feature quantities are applied to a plurality of voice information from the start time to each set time. The feature amount calculation unit to be calculated and
By inputting a plurality of features calculated for each set time into a model generated based on the features of voice information from the start time to the end time of the conversation, the said above corresponding to the plurality of features. An output value calculation unit that calculates multiple output values of the model for each set time,
A voice processing device comprising a determination unit for determining whether or not the conversation to be determined is an abnormal conversation situation based on the plurality of output values.