JP6268916B2

JP6268916B2 - Abnormal conversation detection apparatus, abnormal conversation detection method, and abnormal conversation detection computer program

Info

Publication number: JP6268916B2
Application number: JP2013221466A
Authority: JP
Inventors: 昭二早川
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2013-10-24
Filing date: 2013-10-24
Publication date: 2018-01-31
Anticipated expiration: 2033-10-24
Also published as: JP2015082093A

Description

本発明は、例えば、複数の話者の会話を録音した音声信号に基づいて、会話が異常か否かを判定する異常会話検出装置、異常会話検出方法及び異常会話検出用コンピュータプログラムに関する。 The present invention relates to an abnormal conversation detection device, an abnormal conversation detection method, and an abnormal conversation detection computer program for determining whether or not a conversation is abnormal based on, for example, audio signals obtained by recording conversations of a plurality of speakers.

近年、電話回線を用いて行われる、金銭を騙し取ることを目的とした詐欺行為及び悪質な勧誘行為が社会的な問題となっている。そこで、電話回線を介した通話中の音声に基づいて、話者の心理状態を推定する技術が提案されている（例えば、特許文献１及び２を参照）。 In recent years, fraudulent acts and malicious solicitation activities aimed at deceiving money, which are performed using telephone lines, have become a social problem. Therefore, a technique for estimating the psychological state of a speaker based on voice during a call via a telephone line has been proposed (see, for example, Patent Documents 1 and 2).

例えば、特許文献１に開示された発話状態検出装置は、発話者の発話データを周波数解析した結果から高周波数成分を抽出し、その高周波数成分についての単位時間ごとの変動度合いを算出する。そしてこの発話状態検出装置は、特定発話者の発話データから得られた、所定期間における複数の変動度合いに基づいて算出される所定区間ごとの統計量に基づいて、特定発話者の発話状態を検出する。 For example, the utterance state detection device disclosed in Patent Document 1 extracts a high frequency component from the result of frequency analysis of the utterance data of a speaker, and calculates the degree of variation per unit time for the high frequency component. And this utterance state detection device detects the utterance state of a specific speaker based on a statistic for each predetermined section calculated based on a plurality of fluctuation degrees in a predetermined period obtained from the utterance data of the specific speaker To do.

また、特許文献２に開示された抑圧状態検出装置は、入力された音声を複数のフレームごとに解析し、その解析結果の平均値を算出する。抑圧状態検出装置は、予め記憶された複数話者ごとの解析結果の平均値及び解析結果の累積頻度分布に関する統計データと、算出した解析結果の平均値とに基づいて閾値を決定し、複数の解析結果のうち閾値よりも大きな値を有する解析結果の出現頻度を演算する。そして抑圧状態検出装置は、その出現頻度に基づいて音声を発する声帯の緊張状態を判定する。 Moreover, the suppression state detection apparatus disclosed by patent document 2 analyzes the input audio | voice for every some frame, and calculates the average value of the analysis result. The suppression state detection device determines a threshold value based on statistical data regarding the average value of the analysis results and the cumulative frequency distribution of the analysis results for each of the plurality of speakers stored in advance, and the calculated average value of the analysis results. The appearance frequency of the analysis result having a value larger than the threshold value among the analysis results is calculated. And the suppression state detection apparatus determines the tension state of the vocal cord which utters sound based on the appearance frequency.

これらの技術では、送話側の話者の音声と、受話側の話者の音声とが別々に得られることが前提となっている。送話側の話者の音声と、受話側の話者の音声とを別々に取得するためには、例えば、電話機本体とハンドセットとの間に通話録音アダプタを接続する。そして、状態推定装置は、そのアダプタから送話側の音声信号と受話側の音声信号をそれぞれ取得して、話者の状態を推定する。この場合、通話録音アダプタから取得できる音声信号は、その通話録音アダプタが接続された電話機を用いた通話の音声信号に限られる。そのため、一つの電話回線に複数の電話機が接続されており、そのうちの一つの電話機にのみ通話録音アダプタが接続されていると、状態推定装置は、他の電話機を用いた通話から話者の状態を推定することはできない。一方、モジュラーローゼットと分配器の間に通話録音アダプタを接続し、状態推定装置がその通話録音アダプタから音声信号を取得すれば、分配器に複数の電話機が接続されていても、何れの電話機の通話の音声信号を取得することができる。しかし、この場合には、通話録音アダプタから得られる音声信号は、送話側の話者の音声と受話側の話者の音声とが混ざったものとなる。そのため、このような音声信号に対して、送話側の話者の音声と、受話側の話者の音声とが別々に得られることが前提となっている上記の技術を適用しても、十分な推定精度を得ることは困難である。これは、一方の話者の音声に他方の話者の音声が重畳されるため、一方の話者の状態を推定するための音声の特徴量に、他方の話者の音声の特徴も含まれてしまうことによる。一方、正弦波重畳モデルのパラメータを推定することで、二つの音源からの音を分離する技術が提案されている（例えば、特許文献３を参照）。 These techniques are based on the premise that the voice of the transmitting speaker and the voice of the receiving speaker can be obtained separately. In order to acquire separately the voice of the speaker on the transmitting side and the voice of the speaker on the receiving side, for example, a call recording adapter is connected between the telephone body and the handset. Then, the state estimating device acquires the transmitting side audio signal and the receiving side audio signal from the adapter, and estimates the state of the speaker. In this case, the audio signal that can be acquired from the call recording adapter is limited to the audio signal of the call using the telephone connected to the call recording adapter. Therefore, if multiple telephones are connected to one telephone line, and the call recording adapter is connected to only one of the telephone lines, the state estimation device can detect the state of the speaker from a call using another telephone. Cannot be estimated. On the other hand, if a call recording adapter is connected between the modular rosette and the distributor, and the state estimation device acquires an audio signal from the call recording adapter, even if a plurality of telephones are connected to the distributor, The voice signal of the call can be acquired. However, in this case, the voice signal obtained from the call recording adapter is a mixture of the voice of the transmitting speaker and the voice of the receiving speaker. Therefore, even if the above-described technique is applied to such a voice signal, the voice of the transmitting speaker and the voice of the receiving speaker are separately obtained. It is difficult to obtain sufficient estimation accuracy. This is because the voice of one speaker is superimposed on the voice of the other speaker, so the feature of the voice for estimating the state of one speaker includes the voice feature of the other speaker. Because it ends up. On the other hand, a technique for separating sounds from two sound sources by estimating parameters of a sine wave superposition model has been proposed (see, for example, Patent Document 3).

特開２０１１−２４２７５５号公報JP2011-242755A 特開２０１２−１６８２９６号公報JP 2012-168296 A 特開２００８−３０４７１８号公報JP 2008-304718 A

特許文献３に記載の技術では、正弦波モデルに雑音を表す項が含まれていない。しかしながら、実際の通話では、話者の周囲にある音源から発せられた雑音が話者の声に重畳されるので、特許文献３に記載の技術は、実際の通話を録音した音声信号から、それぞれの話者の音声を正確に分離できないおそれがある。 In the technique described in Patent Document 3, the sine wave model does not include a term representing noise. However, in an actual call, noise emitted from a sound source around the speaker is superimposed on the voice of the speaker, so the technique described in Patent Document 3 May not be able to accurately separate the voices of the speakers.

そこで本明細書は、一つの側面では、複数の話者の会話を録音した音声信号に基づいて、会話が異常か否かを判定可能な異常会話検出装置を提供することを目的とする。 In view of this, an object of one aspect of the present invention is to provide an abnormal conversation detection apparatus that can determine whether or not a conversation is abnormal based on an audio signal obtained by recording conversations of a plurality of speakers.

一つの実施形態によれば、異常会話検出装置が提供される。この異常会話検出装置は、複数の話者の会話を含む音声信号を入力する音声入力部と、記憶部と、音声信号に対して、所定の時間長を持つフレーム単位で人の声の特徴を表す少なくとも二つの特徴量を抽出し、その少なくとも二つの特徴量の組を記憶部に保存する特徴量抽出部と、記憶部に記憶された特徴量の分布を話者の数と同じ数の確率分布でフィッティングするフィッティング部と、話者の数と同じ数の確率分布が特徴量の分布を近似できているか否か判定し、話者の数と同じ数の確率分布が特徴量の分布を近似できていない場合、会話は異常会話であると判定する判定部とを有する。 According to one embodiment, an abnormal conversation detection device is provided. This abnormal conversation detecting device includes a voice input unit that inputs a voice signal including conversations of a plurality of speakers, a storage unit, and a voice characteristic of a human voice in units of a frame having a predetermined time length. Extracting at least two feature quantities to be represented, storing the combination of the at least two feature quantities in a storage unit, and a probability of the same number of speakers as the distribution of the feature quantities stored in the storage unit Determine whether the fitting section that fits the distribution and the probability distribution of the same number as the number of speakers can approximate the distribution of the feature amount, and the probability distribution of the same number as the number of speakers approximates the distribution of the feature amount If not, the determination unit determines that the conversation is an abnormal conversation.

本発明の目的及び利点は、請求項において特に指摘されたエレメント及び組み合わせにより実現され、かつ達成される。
上記の一般的な記述及び下記の詳細な記述の何れも、例示的かつ説明的なものであり、請求項のように、本発明を限定するものではないことを理解されたい。 The objects and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.
It should be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention as claimed.

本明細書に開示された異常会話検出装置は、複数の話者の会話を録音した音声信号に基づいて、会話が異常か否かを判定できる。 The abnormal conversation detection device disclosed in this specification can determine whether or not the conversation is abnormal based on an audio signal obtained by recording conversations of a plurality of speakers.

一つの実施形態による異常会話検出装置を電話回線に接続する位置の一例を示す図である。It is a figure which shows an example of the position which connects the abnormal conversation detection apparatus by one Embodiment to a telephone line. 第１の実施形態による異常会話検出装置の概略構成図である。It is a schematic block diagram of the abnormal conversation detection apparatus by 1st Embodiment. 異常会話検出装置が有する処理部の機能ブロック図である。It is a functional block diagram of the process part which an abnormal conversation detection apparatus has. （ａ）は、二人の話者の会話が通常会話のときの音声の特徴量の分布の一例を示す図であり、（ｂ）は、二人の話者の会話が異常会話のときの音声の特徴量の分布の一例を示す図である。(A) is a figure which shows an example of distribution of the feature-value of the audio | voice when the conversation of two speakers is normal conversation, (b) is the case when the conversation of two speakers is abnormal conversation. It is a figure which shows an example of distribution of the feature-value of an audio | voice. 第１の実施形態による異常会話検出処理の動作フローチャートである。It is an operation | movement flowchart of the abnormal conversation detection process by 1st Embodiment. 第２の実施形態による処理部の機能ブロック図である。It is a functional block diagram of the process part by 2nd Embodiment. 第２の実施形態による異常会話検出処理の動作フローチャートである。It is an operation | movement flowchart of the abnormal conversation detection process by 2nd Embodiment. 何れかの実施形態またはその変形例による異常会話検出装置が実装された携帯電話機の概略構成図である。It is a schematic block diagram of the mobile telephone by which the abnormal conversation detection apparatus by any embodiment or its modification was mounted. 何れかの実施形態またはその変形例による異常会話検出装置が実装されたサーバクライアントシステムの概略構成図である。It is a schematic block diagram of the server client system by which the abnormal conversation detection apparatus by any embodiment or its modification was mounted.

以下、図を参照しつつ、異常会話検出装置について説明する。
発明者は、話者が平常状態で話しているときの話者の声を含む音声信号をフレーム単位で分割し、各フレームから人の声の特徴を表す２以上の特徴量を抽出すると、一人の話者声の特徴量の分布は正規分布といった一つの確率分布で近似できることを見出した。さらに、発明者は、話者の心理状態が平常でなくなり、話者の声にその心理状態が反映されるようになると、一人の話者の声の特徴量の分布は、一つの確率分布で近似できなくなることを見出した。 Hereinafter, the abnormal conversation detecting apparatus will be described with reference to the drawings.
The inventor divides a voice signal including the voice of the speaker when the speaker is speaking in a normal state in units of frames, and extracts two or more feature amounts representing the characteristics of the human voice from each frame. We found that the distribution of the feature amount of the speaker's voice can be approximated by one probability distribution such as a normal distribution. Furthermore, when the psychological state of the speaker is no longer normal and the psychological state is reflected in the speaker's voice, the inventor's voice feature distribution is one probability distribution. I found that I could not approximate.

そこでこの異常会話検出装置は、複数の話者の会話が録音された音声信号をフレーム単位で分割し、各フレームから、人の声の特徴を表す２以上の特徴量を抽出する。そしてこの異常会話検出装置は、その特徴量の分布が話者の数と同じ数の確率分布で近似できる場合には、各話者は平常状態にある、通常の会話が行われていると判定する。一方、その特徴量の分布が話者の数と同じ数の確率分布で近似できない場合には、異常会話検出装置は、会話が異常会話であると判定する。
なお、異常会話とは、会話に参加している複数の話者のうち、少なくとも何れか一人の話者の心理状態が異常な状態で行われている会話である。また、話者の心理状態が異常な状態とは、話者が平静を保つことができないような状態であり、例えば、怒ったり、怯えたり、泣いたりといった行動をとる状態である。 Therefore, this abnormal conversation detection apparatus divides a voice signal in which conversations of a plurality of speakers are recorded in units of frames, and extracts two or more feature amounts representing characteristics of a human voice from each frame. The abnormal conversation detection device determines that each speaker is in a normal state and normal conversation is performed when the distribution of the feature amount can be approximated by the same probability distribution as the number of speakers. To do. On the other hand, when the distribution of the feature amount cannot be approximated by the same number of probability distributions as the number of speakers, the abnormal conversation detection device determines that the conversation is an abnormal conversation.
Note that the abnormal conversation is a conversation in which at least one of the speakers participating in the conversation is in an abnormal psychological state. Moreover, the state where the speaker's psychological state is abnormal is a state where the speaker cannot keep calm, for example, a state where he / she takes action such as angry, barking, or crying.

第１の実施形態では、異常会話検出装置は、電話回線を利用した二人の話者間の通話を録音した音声信号に基づいて、会話が異常会話か否かを判定する。しかし、異常会話検出装置は、携帯電話機、ＴＶ会議システム、あるいは、ボイスレコーダにより録音された、二人の話者の会話を含む音声信号に基づいて、会話が異常会話か否かを判定してもよい。 In the first embodiment, the abnormal conversation detection device determines whether or not the conversation is an abnormal conversation based on an audio signal obtained by recording a call between two speakers using a telephone line. However, the abnormal conversation detection device determines whether or not the conversation is an abnormal conversation based on an audio signal recorded by a mobile phone, a video conference system, or a voice recorder and including a conversation between two speakers. Also good.

図１は、一つの実施形態による異常会話検出装置を電話回線に接続する位置の一例を示す図である。この例では、説明の便宜上、異常会話検出装置が取り付けられる側を送話側とし、電話回線を介して送話側との通話の相手側を受話側とする。本実施形態では、異常会話検出装置１は、二つの電話機２−１、２−２が接続された分配器３と、モジュラーローゼット４との間に接続された通話録音アダプタ５から音声信号を取得する。そのため、送話側の話者が、電話機２−１、２−２のうちの何れの電話機を使用する場合でも、送話側の話者の音声を含む音声信号は、通話録音アダプタ５を通過する。また、受話側の話者の音声を含む音声信号は、電話回線６からモジュラーローゼット４及び通話録音アダプタ５を介して何れかの電話機へ送信される。そのため、通話録音アダプタ５から異常会話検出装置１へ出力される音声信号は、送話側の話者の音声と受話側の話者の音声が混じったアナログ信号となる。 FIG. 1 is a diagram illustrating an example of a position where an abnormal conversation detection apparatus according to one embodiment is connected to a telephone line. In this example, for convenience of explanation, the side on which the abnormal conversation detecting device is attached is set as the transmitting side, and the other side of the call with the transmitting side via the telephone line is set as the receiving side. In this embodiment, the abnormal conversation detection apparatus 1 acquires a voice signal from a call recording adapter 5 connected between a distributor 3 to which two telephones 2-1 and 2-2 are connected and a modular rosette 4. To do. Therefore, regardless of which of the telephones 2-1 and 2-2 is used by the sending speaker, the voice signal including the voice of the sending speaker passes through the call recording adapter 5. To do. A voice signal including the voice of the speaker on the receiving side is transmitted from the telephone line 6 to any telephone through the modular rosette 4 and the call recording adapter 5. Therefore, the audio signal output from the call recording adapter 5 to the abnormal conversation detecting device 1 is an analog signal in which the voice of the transmitting speaker and the voice of the receiving speaker are mixed.

図２は、第１の実施形態による異常会話検出装置の概略構成図である。異常会話検出装置１は、インターフェース部１１と、アナログ／デジタルコンバータ１２と、処理部１３と、記憶部１４と、デジタル／アナログコンバータ１５と、スピーカ１６とを有する。 FIG. 2 is a schematic configuration diagram of the abnormal conversation detection apparatus according to the first embodiment. The abnormal conversation detection apparatus 1 includes an interface unit 11, an analog / digital converter 12, a processing unit 13, a storage unit 14, a digital / analog converter 15, and a speaker 16.

インターフェース部１１は、音声入力部の一例であり、オーディオインターフェースを有する。そしてインターフェース部１１は、通話録音アダプタ５からアナログ信号である音声信号を取得し、その音声信号をアナログ／デジタルコンバータ１２（以下、Ａ／Ｄコンバータと表記する）へ出力する。Ａ／Ｄコンバータ１２は、アナログの音声信号を所定のサンプリングレートでサンプリングすることにより、その音声信号をデジタル化する。そしてＡ／Ｄコンバータ１２は、デジタル化された音声信号を処理部１３へ出力する。 The interface unit 11 is an example of a voice input unit and includes an audio interface. The interface unit 11 acquires an audio signal, which is an analog signal, from the call recording adapter 5 and outputs the audio signal to the analog / digital converter 12 (hereinafter referred to as an A / D converter). The A / D converter 12 digitizes the audio signal by sampling the analog audio signal at a predetermined sampling rate. Then, the A / D converter 12 outputs the digitized audio signal to the processing unit 13.

処理部１３は、例えば、一つまたは複数のプロセッサと、メモリ回路と、周辺回路とを有する。処理部１３は、デジタル化された音声信号に基づいて、会話が異常会話か否か判定する。なお、処理部１３による処理の詳細は後述する。 The processing unit 13 includes, for example, one or a plurality of processors, a memory circuit, and a peripheral circuit. The processing unit 13 determines whether the conversation is an abnormal conversation based on the digitized audio signal. Details of processing by the processing unit 13 will be described later.

記憶部１４は、例えば、読み書き可能な不揮発性の半導体メモリと、読み書き可能な揮発性の半導体メモリとを有する。そして記憶部１４は、処理部１３上で実行される異常会話検出処理で利用される各種のデータ及び異常会話検出処理の途中で生成される各種のデータを記憶する。また記憶部１４は、処理部１３が、会話が異常会話であると判定したときにスピーカ１６から出力される警告音声信号を記憶する。 The storage unit 14 includes, for example, a readable / writable nonvolatile semiconductor memory and a readable / writable volatile semiconductor memory. The storage unit 14 stores various data used in the abnormal conversation detection process executed on the processing unit 13 and various data generated during the abnormal conversation detection process. The storage unit 14 also stores a warning sound signal output from the speaker 16 when the processing unit 13 determines that the conversation is an abnormal conversation.

デジタル／アナログコンバータ１５（以下、Ｄ／Ａコンバータと表記する）は、処理部１３が、会話が異常会話であると判定したときに、処理部１３から出力される警告音声信号をアナログ化してスピーカ１６へ出力する。スピーカ１６は、アナログ化された警告音声信号を再生する。 A digital / analog converter 15 (hereinafter referred to as a D / A converter) converts a warning voice signal output from the processing unit 13 into an analog signal when the processing unit 13 determines that the conversation is an abnormal conversation. 16 output. The speaker 16 reproduces an analog warning sound signal.

図３は、処理部１３の機能ブロック図である。処理部１３は、スペクトル算出部２１と、特徴量抽出部２２と、フィッティング部２３と、判定部２４と、警告部２５とを有する。
処理部１３が有するこれらの各部は、例えば、処理部１３が有するプロセッサ上で動作するコンピュータプログラムにより実現される機能モジュールである。 FIG. 3 is a functional block diagram of the processing unit 13. The processing unit 13 includes a spectrum calculation unit 21, a feature amount extraction unit 22, a fitting unit 23, a determination unit 24, and a warning unit 25.
Each of these units included in the processing unit 13 is, for example, a functional module realized by a computer program that operates on a processor included in the processing unit 13.

スペクトル算出部２１は、デジタル化された音声信号（以下では、単に音声信号と呼ぶ）を所定長を持つフレームごとに分割する。フレーム長は、例えば、32msecに設定される。なお、スペクトル算出部２１は、連続する二つのフレームの一部を重複させてもよい。この場合、スペクトル算出部２１は、現在のフレームから次のフレームへ移動する際に、新たにフレームに取り入れられるフレームシフト量を、例えば、10msec〜16msecに設定してもよい。 The spectrum calculation unit 21 divides a digitized audio signal (hereinafter simply referred to as an audio signal) into frames having a predetermined length. The frame length is set to 32 msec, for example. The spectrum calculation unit 21 may overlap a part of two consecutive frames. In this case, when moving from the current frame to the next frame, the spectrum calculation unit 21 may set the frame shift amount newly taken into the frame to, for example, 10 msec to 16 msec.

スペクトル算出部２１は、フレームごとに、音声信号を、時間周波数変換を用いて時間領域から周波数領域のスペクトル信号に変換する。スペクトル算出部２１は、時間周波数変換として、例えば、高速フーリエ変換(Fast Fourier Transform, FFT)または修正離散コサイン変換（Modified Discrete Cosign Transform, MDCT）を用いることができる。なお、スペクトル算出部２１は、各フレームに、ハミング窓またはハニング窓といった窓関数を乗じたのちに時間周波数変換を行ってもよい。
例えば、フレーム長が32msecであり、Ａ／Ｄコンバータ１２のサンプリングレートが8kHzであれば、1フレームあたり256個のサンプル点が含まれるので、スペクトル算出部２１は、256点のFFTを実行する。 The spectrum calculation unit 21 converts the audio signal from the time domain to the frequency domain spectrum signal using time-frequency conversion for each frame. The spectrum calculation unit 21 can use, for example, Fast Fourier Transform (FFT) or Modified Discrete Cosign Transform (MDCT) as time-frequency conversion. The spectrum calculation unit 21 may perform time-frequency conversion after multiplying each frame by a window function such as a Hamming window or a Hanning window.
For example, if the frame length is 32 msec and the sampling rate of the A / D converter 12 is 8 kHz, since 256 sample points are included in one frame, the spectrum calculation unit 21 executes 256-point FFT.

スペクトル算出部２１は、フレームのスペクトル信号が得られる度に、そのスペクトル信号を特徴量抽出部２２へ出力する。 The spectrum calculation unit 21 outputs the spectrum signal to the feature amount extraction unit 22 every time a spectrum signal of a frame is obtained.

特徴量抽出部２２は、フレームごとに、そのフレームのスペクトル信号から、人の声の特徴を表す２以上の特徴量を抽出する。本実施形態では、特徴量抽出部２２は、人の声の特徴を表す特徴量として、人の声が含まれる周波数帯域のパワーの積算値とピッチ周波数を抽出する。 For each frame, the feature amount extraction unit 22 extracts two or more feature amounts representing the characteristics of a human voice from the spectrum signal of the frame. In the present embodiment, the feature amount extraction unit 22 extracts a power integrated value and a pitch frequency in a frequency band in which a human voice is included as a feature amount representing a feature of a human voice.

特徴量抽出部２２は、フレームごとに、例えば、次式に従って、人の声が含まれる周波数帯域のパワーの積算値を算出する。
ここでS(f)は、周波数fにおけるスペクトル信号であり、|S(f)|²は、周波数fにおけるパワースペクトルである。またfmin、fmaxは、それぞれ、人の声が含まれる周波数帯域の下限及び上限を表す。そしてPはパワーの積算値である。また特徴量抽出部２２は、フレームの時間周波数変換を実行せずにフレームごとのサンプル点の二乗和からパワーの積算値を直接求めてもよい。 For each frame, the feature amount extraction unit 22 calculates an integrated value of power in a frequency band including a human voice, for example, according to the following equation.
Here, S (f) is a spectrum signal at frequency f, and | S (f) | ² is a power spectrum at frequency f. Fmin and fmax represent the lower limit and the upper limit of the frequency band in which a human voice is included, respectively. P is an integrated value of power. Further, the feature amount extraction unit 22 may directly obtain the integrated power value from the square sum of the sample points for each frame without performing the time-frequency conversion of the frame.

また、特徴量抽出部２２は、ピッチ周波数を算出するために、各フレームについて、自己相関関数または変形自己相関関数のピーク値のうちの最大値（ただし、時間差0のピーク値を除く）を求める。人の有声音に相当するフレームについては、比較的自己相関の度合いが高いのに対し、無声音または背景雑音に相当するフレームの自己相関の度合いは低い。そこで特徴量抽出部２２は、そのピーク値の最大値を所定の閾値と比較し、最大値が所定の閾値よりも大きい場合、そのフレームには話者の有声音が含まれると判定する。そして特徴量抽出部２２は、そのピーク値の最大値に相当する時間差の逆数をピッチ周波数とする。なお、自己相関関数は、各周波数のパワースペクトルを逆フーリエ変換することにより求められる。また、変形自己相関関数は、パワースペクトルに対して線形予測符号化フィルタを用いてフィルタリングしたものを逆フーリエ変換することにより求められる。なお、特徴量抽出部２２は、フレームをフーリエ変換せずに、フレームごとのサンプル点を用いて自己相関関数を求めることにより、ピッチ周波数を時間領域のフレームから直接求めることもできる。上記のように、特徴量抽出部２２がスペクトル信号を用いずに、時間領域のフレームの各サンプル点の信号値から直接特徴量を算出する場合、スペクトル算出部２１は省略されてもよい。 Also, the feature quantity extraction unit 22 calculates the maximum value (except for the peak value with a time difference of 0) of the autocorrelation function or the modified autocorrelation function for each frame in order to calculate the pitch frequency. . A frame corresponding to human voiced sound has a relatively high degree of autocorrelation, whereas a frame corresponding to unvoiced sound or background noise has a low degree of autocorrelation. Therefore, the feature amount extraction unit 22 compares the maximum value of the peak value with a predetermined threshold value, and determines that the frame contains the voiced sound of the speaker if the maximum value is larger than the predetermined threshold value. Then, the feature quantity extraction unit 22 uses the reciprocal of the time difference corresponding to the maximum value of the peak value as the pitch frequency. Note that the autocorrelation function is obtained by performing an inverse Fourier transform on the power spectrum of each frequency. The modified autocorrelation function is obtained by inverse Fourier transform of the power spectrum filtered using a linear predictive coding filter. Note that the feature quantity extraction unit 22 can also directly obtain the pitch frequency from the frame in the time domain by obtaining the autocorrelation function using the sample points for each frame without performing Fourier transform on the frame. As described above, when the feature amount extraction unit 22 directly calculates a feature amount from the signal value of each sample point of the time domain frame without using the spectrum signal, the spectrum calculation unit 21 may be omitted.

特徴量抽出部２２は、フレームごとのピッチ周波数及びパワーの積算値の組を、特徴量の組として記憶部１４に保存する。 The feature amount extraction unit 22 stores a set of pitch frequency and power integrated values for each frame in the storage unit 14 as a set of feature amounts.

なお、特徴量抽出部２２は、ピッチ周波数を所定値で割ることにより得られる正規化ピッチ周波数を特徴量としてもよい。同様に、特徴量抽出部２２は、パワー積算値を所定値で割ることにより得られる正規化パワー積算値を特徴量としてもよい。また、特徴量抽出部２２は、パワー積算値を、フレームに含まれる雑音成分を表す雑音判定閾値Thnと比較して、パワー積算値が雑音判定閾値Thnよりも大きい場合にのみ、パワー積算値及びピッチ周波数の組を記憶部１４に保存してもよい。これにより、どの話者も発声していない時のフレームから抽出された特徴量の組が、後述する確率分布による特徴量の分布のフィッティングに使用されなくなるので、処理部１３は、より正確に会話が異常会話か否かを判定できる。 Note that the feature amount extraction unit 22 may use a normalized pitch frequency obtained by dividing the pitch frequency by a predetermined value as the feature amount. Similarly, the feature amount extraction unit 22 may use a normalized power integration value obtained by dividing the power integration value by a predetermined value as the feature amount. In addition, the feature amount extraction unit 22 compares the power integrated value with a noise determination threshold value Thn representing a noise component included in the frame, and only when the power integrated value is larger than the noise determination threshold value Thn, A set of pitch frequencies may be stored in the storage unit 14. As a result, the feature set extracted from the frame when no speaker is speaking is not used for fitting the feature distribution based on the probability distribution described later. It can be determined whether or not is an abnormal conversation.

また、雑音判定閾値Thnは、通話音声の背景雑音レベルに応じて適応的に設定されることが好ましい。そこで特徴量抽出部２２は、送話側の話者と受話側の話者の両方とも発声していないフレームを、背景雑音のみが含まれる無音フレームと判定する。例えば、特徴量抽出部２２は、フレームの周波数帯域全体のパワースペクトルの積算値が所定のパワー閾値未満であれば、そのフレームを無音フレームと判定する。そして特徴量抽出部２２は、無音フレームのパワーの積算値に基づいて背景雑音レベルを推定する。例えば、特徴量抽出部２２は、次式に従って背景雑音レベルを推定する。
ここで、Psは、最新の無音フレームのパワーの積算値であり、noisePは、更新前の背景雑音レベルである。そしてnoiseP'は、更新後の背景雑音レベルである。この場合、雑音判定閾値Thnは、例えば、次式に従って設定される。
ここで、γは、あらかじめ設定される定数であり、例えば、2〜3[dB]に設定される。 The noise determination threshold value Thn is preferably set adaptively according to the background noise level of the call voice. Therefore, the feature quantity extraction unit 22 determines a frame in which neither the transmitting speaker nor the receiving speaker is speaking as a silent frame including only background noise. For example, if the integrated value of the power spectrum of the entire frequency band of the frame is less than a predetermined power threshold, the feature amount extraction unit 22 determines that the frame is a silent frame. Then, the feature quantity extraction unit 22 estimates the background noise level based on the integrated value of the power of the silent frame. For example, the feature quantity extraction unit 22 estimates the background noise level according to the following equation.
Here, Ps is the integrated value of the power of the latest silent frame, and noiseP is the background noise level before update. And noiseP 'is the background noise level after the update. In this case, the noise determination threshold value Thn is set according to the following equation, for example.
Here, γ is a constant set in advance, and is set to 2 to 3 [dB], for example.

フィッティング部２３は、記憶部１４に記憶されている特徴量の組の数が特徴量の分布を確率分布でフィッティングするのに十分な所定数（例えば、100〜1000）に達したか否か判定する。そしてフィッティング部２３は、特徴量の組の数がその所定数に達すると、特徴量の分布を、話者の数と同じ数の確率分布を含む混合分布でフィッティングする。本実施形態では、混合分布として、ピッチ周波数とパワー積算値をそれぞれ一つの次元とする、２次元の２混合ガウス分布を用いる。２混合ガウス分布は、混合正規分布の一つである。 The fitting unit 23 determines whether or not the number of feature value pairs stored in the storage unit 14 has reached a predetermined number (for example, 100 to 1000) sufficient to fit the feature value distribution with the probability distribution. To do. Then, when the number of feature quantity sets reaches the predetermined number, the fitting unit 23 fits the feature quantity distribution with a mixed distribution including probability distributions equal to the number of speakers. In the present embodiment, a two-dimensional two-mixed Gaussian distribution having a pitch frequency and a power integrated value as one dimension each is used as the mixed distribution. The two-mixed Gaussian distribution is one of the mixed normal distributions.

そしてフィッティング部２３は、各フレームから得られたピッチ周波数とパワー積算値の組をそれぞれ学習サンプルとして、２混合ガウス分布に含まれる各ガウス分布を表す複数のパラメータを最尤推定する。そのために、例えば、フィッティング部２３は、EMアルゴリズム（期待値最大化法とも呼ばれる）を用いる。例えば、フィッティング部２３は、２混合ガウス分布に含まれるガウス分布のそれぞれについて、各学習サンプルがそのガウス分布により生成された確率である重み係数、平均値ベクトル（すなわち、各特徴量の平均値の組）及び共分散行列の最尤推定値を求める。 Then, the fitting unit 23 performs maximum likelihood estimation of a plurality of parameters representing each Gaussian distribution included in the two-mixed Gaussian distribution, using a set of pitch frequency and power integrated value obtained from each frame as a learning sample. For this purpose, for example, the fitting unit 23 uses an EM algorithm (also called an expected value maximization method). For example, the fitting unit 23, for each of the Gaussian distributions included in the two-mixed Gaussian distribution, a weighting coefficient that is a probability that each learning sample is generated by the Gaussian distribution, an average value vector (that is, an average value of each feature value) Set) and the maximum likelihood estimate of the covariance matrix.

なお、フィッティング部２３は、特徴量の分布のフィッティングに用いる確率分布として、対数正規分布を用いてもよい。この場合にも、フィッティング部２３は、EMアルゴリズムを用いて、混合対数正規分布に含まれる、話者の数と同じ数の対数正規分布のそれぞれについての重み係数、平均値ベクトル及び共分散行列の最尤推定値を求める。
また、フィッティング部２３は、特徴量の分布をフィッティングする確率分布を求めるために利用するアルゴリズムとして、EMアルゴリズムの代わりに、マルコフ連鎖モンテカルロ法またはシミュレーティッドアニーリングを利用してもよい。 Note that the fitting unit 23 may use a lognormal distribution as a probability distribution used for fitting the distribution of feature values. Also in this case, the fitting unit 23 uses the EM algorithm to calculate the weighting coefficient, average value vector, and covariance matrix for each of the lognormal distributions of the same number as the number of speakers included in the mixed lognormal distribution. Find the maximum likelihood estimate.
Further, the fitting unit 23 may use a Markov chain Monte Carlo method or simulated annealing instead of the EM algorithm as an algorithm used to obtain a probability distribution for fitting the distribution of the feature amount.

フィッティング部２３は、特徴量の分布をフィッティングした各確率分布の重み係数、平均ベクトル及び共分散行列の最尤推定値を判定部２４へ通知する。 The fitting unit 23 notifies the determination unit 24 of the weighting coefficient, average vector, and maximum likelihood estimation value of the covariance matrix of each probability distribution obtained by fitting the feature amount distribution.

判定部２４は、特徴量の分布をフィッティングした話者の数と同数の確率分布が、その特徴量の分布に適合している度合いを表す適合度を算出する。そして判定部２４は、その適合度が適合判定閾値以上であれば、話者の数と同じ数の確率分布で特徴量の分布を近似できているので、会話は通常会話であると判定する。一方、適合度が適合判定閾値未満であれば、話者の数と同じ数の確率分布で特徴量の分布を近似できていないので、判定部２４は、会話は異常会話であると判定する。 The determination unit 24 calculates a degree of fitness that represents the degree to which the same number of probability distributions as the number of speakers fitting the feature amount distribution match the feature amount distribution. If the degree of fitness is equal to or greater than the fitness determination threshold value, the determination unit 24 can approximate the feature distribution with the same number of probability distributions as the number of speakers, and thus determines that the conversation is normal conversation. On the other hand, if the fitness is less than the fitness determination threshold, the distribution of feature quantities cannot be approximated by the same probability distribution as the number of speakers, so the determination unit 24 determines that the conversation is an abnormal conversation.

図４（ａ）は、二人の話者が平常状態で会話しているときの音声の特徴量の分布の一例を示す図である。一方、図４（ｂ）は、二人の話者のうちの少なくとも一方が異常な心理状態で会話しているときの音声の特徴量の分布の一例を示す図である。
図４（ａ）及び図４（ｂ）において、横軸はパワーの積算値を表し、縦軸はピッチ周波数を表す。また各点４００は、それぞれ、一つの特徴量の組を表す。図４（ａ）に示されるように、二人の話者が平常状態で会話しているとき、すなわち、その会話が通常会話である場合、楕円４０１及び楕円４０２で示される、特徴量の分布をフィッティングした二つの正規分布によって特徴量の分布が比較的良好に近似されている。そのため、適合度も高くなる。 FIG. 4A is a diagram illustrating an example of a distribution of voice feature values when two speakers are talking in a normal state. On the other hand, FIG. 4B is a diagram showing an example of the distribution of the voice feature amount when at least one of the two speakers is talking in an abnormal psychological state.
4 (a) and 4 (b), the horizontal axis represents the integrated power value, and the vertical axis represents the pitch frequency. Each point 400 represents one feature set. As shown in FIG. 4A, when two speakers are talking in a normal state, that is, when the conversation is a normal conversation, distribution of feature amounts indicated by ellipses 401 and 402 The distribution of the feature quantity is approximated relatively well by the two normal distributions fitted with. For this reason, the degree of fitness is also increased.

一方、二人の話者の少なくとも一方の心理状態が平常でなくなり、会話が異常会話になると、各話者が声を荒げるなどするので、声の特徴が通常のときから変化してしまい、特徴量の分布がばらつく。その結果として、楕円４０３及び楕円４０４で示される特徴量の分布をフィッティングした二つの正規分布は、特徴量の分布をうまく近似できていない。そのため、適合度も低くなる。 On the other hand, if the psychological state of at least one of the two speakers is not normal and the conversation becomes abnormal, each speaker will make a rough voice, and the voice characteristics will change from normal. The distribution of feature values varies. As a result, the two normal distributions obtained by fitting the distributions of feature amounts indicated by the ellipse 403 and the ellipse 404 cannot approximate the distribution of feature amounts well. As a result, the fitness is also low.

本実施形態では、判定部２４は、適合度として、２次元ベクトル系列に対する平均対数尤度を次式に従って算出する。
ここで、P(x_n|Ω)は、確率分布のパラメータΩから、n番目の２次元ベクトルx_n（本実施形態では、個々の学習サンプルに相当）が出力される確率を表す。またNは、学習サンプルの総数を表す。w_i(i=1,2)は、各ガウス分布の重み係数の最尤推定値を表す。μ_iは、各ガウス分布の平均値ベクトル（すなわち、各特徴量の平均値の組）の最尤推定値を表す。そしてΣiは、各ガウス分布の共分散行列を表す。 In the present embodiment, the determination unit 24 calculates the average log likelihood for the two-dimensional vector sequence as the fitness according to the following equation.
Here, P (x _n | Ω) represents the probability that the n-th two-dimensional vector x _n (corresponding to each learning sample in this embodiment) is output from the parameter Ω of the probability distribution. N represents the total number of learning samples. w _i (i = 1, 2) represents the maximum likelihood estimate of the weighting coefficient of each Gaussian distribution. μ _i represents a maximum likelihood estimation value of an average value vector of each Gaussian distribution (that is, a set of average values of each feature amount). Σi represents the covariance matrix of each Gaussian distribution.

判定部２４は、平均対数尤度を適合判定閾値Thfと比較する。そして判定部２４は、平均対数尤度が適合判定閾値Thf以上であれば、会話は通常会話であると判定する。なお、適合判定閾値Thfは、特徴分布が話者の数と同じ数の確率分布で近似できているとみなせるときの平均対数尤度の下限値であり、例えば、予め実験的にされる。一方、平均対数尤度が適合判定閾値Thf未満であれば、判定部２４は、会話は異常会話であると判定する。そして判定部２４は、警告部２５に、会話が異常会話であることを通知する。 The determination unit 24 compares the average log likelihood with the fitness determination threshold value Thf. Then, the determination unit 24 determines that the conversation is a normal conversation if the average log likelihood is equal to or greater than the fitness determination threshold value Thf. Note that the suitability determination threshold Thf is a lower limit value of the average log likelihood when the feature distribution can be regarded as being approximated by the same number of probability distributions as the number of speakers, and is experimentally performed in advance, for example. On the other hand, if the average log likelihood is less than the fitness determination threshold value Thf, the determination unit 24 determines that the conversation is an abnormal conversation. Then, the determination unit 24 notifies the warning unit 25 that the conversation is an abnormal conversation.

警告部２５は、判定部２４から会話が異常会話であるとの判定結果を通知されると、記憶部１４から警告音声信号を読み込む。そして警告部２５は、その警告音声信号を、Ｄ／Ａコンバータ１５を介してスピーカ１６へ出力する。 When the determination unit 24 notifies the determination unit 24 that the conversation is an abnormal conversation, the warning unit 25 reads a warning sound signal from the storage unit 14. Then, the warning unit 25 outputs the warning sound signal to the speaker 16 via the D / A converter 15.

なお、異常会話検出装置１は、警告用の光源を有していてもよい。この場合には、警告部２５は、会話が異常会話であると判定した場合、その光源を点灯または明滅させることで、送話側の話者へ警告してもよい。 The abnormal conversation detection apparatus 1 may have a warning light source. In this case, when it is determined that the conversation is an abnormal conversation, the warning unit 25 may warn the speaker on the transmission side by turning on or blinking the light source.

図５は、異常会話検出処理の動作フローチャートである。処理部１３は、通話ごとに以下の動作フローチャートに従って異常会話検出処理を実行する。なお、初期化処理として、処理部１３は、記憶部１４に記憶されているピッチ周波数及びパワー積算値を消去する。 FIG. 5 is an operation flowchart of the abnormal conversation detection process. The processing unit 13 executes abnormal conversation detection processing according to the following operation flowchart for each call. As an initialization process, the processing unit 13 erases the pitch frequency and the power integrated value stored in the storage unit 14.

スペクトル算出部２１は、音声信号から切り出した最新のフレームである現フレームを時間周波数変換することで、現フレームのスペクトル信号を算出する（ステップＳ１０１）。スペクトル算出部２１は、現フレームのスペクトル信号を特徴量抽出部２２へ出力する。 The spectrum calculation unit 21 calculates the spectrum signal of the current frame by performing time-frequency conversion on the current frame, which is the latest frame cut out from the audio signal (step S101). The spectrum calculation unit 21 outputs the spectrum signal of the current frame to the feature amount extraction unit 22.

特徴量抽出部２２は、現フレームのスペクトル信号に基づいて、パワーの積算値及びピッチ周波数といった、人の声の特徴を表す２以上の特徴量を抽出する（ステップＳ１０２）。そして特徴量抽出部２２は、抽出した特徴量の組を記憶部１４に保存する。 The feature amount extraction unit 22 extracts two or more feature amounts representing human voice features such as an integrated power value and a pitch frequency based on the spectrum signal of the current frame (step S102). Then, the feature quantity extraction unit 22 stores the extracted feature quantity set in the storage unit 14.

フィッティング部２３は、記憶部１４に保存されている特徴量の組の数が所定数に達したか否か判定する（ステップＳ１０３）。特徴量の組の数が所定数に達していなければ（ステップＳ１０３−Ｎｏ）、処理部１３は、次フレームを現フレームに設定する（ステップＳ１０４）。そして処理部１３は、ステップＳ１０１以降の処理を繰り返す。 The fitting unit 23 determines whether or not the number of feature value sets stored in the storage unit 14 has reached a predetermined number (step S103). If the number of feature quantity groups has not reached the predetermined number (step S103-No), the processing unit 13 sets the next frame as the current frame (step S104). And the process part 13 repeats the process after step S101.

一方、記憶部１４に保存されている特徴量の組の数が所定数に達していれば（ステップＳ１０３−Ｙｅｓ）、フィッティング部２３は、特徴量の分布を、話者の数と同じ数の確率分布を含む混合分布でフィッティングする（ステップＳ１０５）。そしてフィッティング部２３は、特徴量の分布をフィッティングした確率分布を表す各パラメータ（例えば、混合分布に含まれる各正規分布の重み係数、平均値ベクトル及び共分散行列）の最尤推定値を判定部２４へ通知する。 On the other hand, if the number of feature quantity sets stored in the storage unit 14 has reached a predetermined number (step S103—Yes), the fitting unit 23 sets the distribution of feature quantities to the same number as the number of speakers. Fitting is performed with a mixed distribution including a probability distribution (step S105). Then, the fitting unit 23 determines a maximum likelihood estimated value of each parameter (for example, a weighting coefficient, an average value vector, and a covariance matrix of each normal distribution included in the mixed distribution) representing a probability distribution obtained by fitting the feature amount distribution. 24 is notified.

判定部２４は、特徴量の分布をフィッティングした確率分布の適合度を算出する（ステップＳ１０６）。そして判定部２４は、その適合度が適合判定閾値Thf以上か否か判定する（ステップＳ１０７）。適合度が適合判定閾値Thf以上である場合（ステップＳ１０７−Ｙｅｓ）、判定部２４は、各確率分布は、特徴量の分布を近似できていると判定する。すなわち、判定部２４は、会話は通常会話であると判定する（ステップＳ１０８）。 The determination unit 24 calculates the fitness of the probability distribution obtained by fitting the feature amount distribution (step S106). Then, the determination unit 24 determines whether or not the fitness level is equal to or greater than the fitness determination threshold value Thf (step S107). When the fitness level is equal to or higher than the fitness determination threshold value Thf (step S107—Yes), the determination unit 24 determines that each probability distribution can approximate the distribution of the feature amount. That is, the determination unit 24 determines that the conversation is a normal conversation (step S108).

一方、適合度が適合判定閾値Thf未満である場合（ステップＳ１０７−Ｎｏ）、判定部２４は、各確率分布は、特徴量の分布を近似できていないと判定する。すなわち、判定部２４は、会話は異常会話であると判定する（ステップＳ１０９）。そして判定部２４は、会話が異常会話であることを警告部２５に通知する。警告部２５は、送話側の話者に警告を発する（ステップＳ１１０）。
ステップＳ１０８またはＳ１１０の後、処理部１３は、異常会話検出処理を終了する。 On the other hand, when the fitness level is less than the fitness determination threshold value Thf (No in step S107), the determination unit 24 determines that each probability distribution cannot approximate the distribution of the feature amount. That is, the determination unit 24 determines that the conversation is an abnormal conversation (step S109). Then, the determination unit 24 notifies the warning unit 25 that the conversation is an abnormal conversation. The warning unit 25 issues a warning to the speaker on the transmission side (step S110).
After step S108 or S110, the processing unit 13 ends the abnormal conversation detection process.

表１は、特開２０１３−０１１８３０号公報に開示された従来技術及び本実施形態による、話者が異常状態にあるか否かの判定の実験結果を示す図である。この実験では、２５名の話者の何れか２名による会話が録音された１００個の音声信号を用いた。
Table 1 is a diagram showing an experimental result of determination as to whether or not a speaker is in an abnormal state according to the related art disclosed in JP 2013-011830 A and this embodiment. In this experiment, 100 voice signals in which conversations by any two of 25 speakers were recorded were used.

表１に示されるように、従来技術では、異常会話、通常会話とも、正答率が４７％であったのに対して、本実施形態では、異常会話及び通常会話についての正答率が、それぞれ、７０％、６９％となった。このように、本実施形態による異常会話検出装置は、従来技術よりも正確に異常会話を検出できることが示された。 As shown in Table 1, in the conventional technology, the correct answer rate was 47% for both the abnormal conversation and the normal conversation, whereas in this embodiment, the correct answer rate for the abnormal conversation and the normal conversation was It became 70% and 69%. As described above, it has been shown that the abnormal conversation detection apparatus according to the present embodiment can detect abnormal conversation more accurately than the prior art.

以上に説明してきたように、この異常会話検出装置は、複数の話者の声が含まれる音声信号から抽出された２種類以上の特徴量の分布を話者の数と同じ数の確率分布で近似できたか否かにより、会話が異常会話か否かを判定する。そのため、この異常会話検出装置は、音声信号に複数の話者の声が含まれていても、会話が異常か否かを判定できる。 As described above, this abnormal conversation detection apparatus uses two or more types of feature quantity distributions extracted from a speech signal including a plurality of speaker voices in the same probability distribution as the number of speakers. Whether or not the conversation is an abnormal conversation is determined based on whether or not the approximation is possible. Therefore, the abnormal conversation detection apparatus can determine whether or not the conversation is abnormal even if the voice signal includes voices of a plurality of speakers.

なお、話者の数が３人以上であり、かつ予め分かっている場合には、フィッティング部２３は、その話者の数だけの確率分布を含む混合分布で特徴量の分布をフィッティングすればよい。 If the number of speakers is three or more and is known in advance, the fitting unit 23 may fit the distribution of feature values with a mixed distribution including probability distributions corresponding to the number of speakers. .

次に、第２の実施形態による異常会話検出装置について説明する。第２の実施形態による異常会話検出装置は、二人以上の不特定の数の話者の会話を含む音声信号に基づいて、会話が異常会話か否か判定する。 Next, an abnormal conversation detecting apparatus according to the second embodiment will be described. The abnormal conversation detection apparatus according to the second embodiment determines whether or not the conversation is an abnormal conversation based on an audio signal including conversations of an unspecified number of two or more speakers.

図６は、第２の実施形態による異常会話検出装置の処理部の機能ブロック図である。処理部１３’は、スペクトル算出部２１と、特徴量抽出部２２と、フィッティング部２３と、判定部２４と、警告部２５と、話者数推定部２６とを有する。第２の実施形態による処理部１３’は、図３に示された第１の実施形態による処理部１３と比較して、話者数推定部２６を有する点と、フィッティング部２３及び判定部２４の処理が異なる。そこで以下では、話者数推定部２６、フィッティング部２３及び判定部２４について説明する。異常会話検出装置のその他の構成要素については、第１の実施形態による異常会話検出装置の対応する構成要素の説明を参照されたい。 FIG. 6 is a functional block diagram of the processing unit of the abnormal conversation detecting apparatus according to the second embodiment. The processing unit 13 ′ includes a spectrum calculation unit 21, a feature amount extraction unit 22, a fitting unit 23, a determination unit 24, a warning unit 25, and a speaker number estimation unit 26. Compared with the processing unit 13 according to the first embodiment shown in FIG. 3, the processing unit 13 ′ according to the second embodiment includes a speaker number estimation unit 26, a fitting unit 23, and a determination unit 24. The processing of is different. Therefore, hereinafter, the speaker number estimation unit 26, the fitting unit 23, and the determination unit 24 will be described. For other components of the abnormal conversation detecting device, refer to the description of the corresponding components of the abnormal conversation detecting device according to the first embodiment.

話者数推定部２６は、会話に参加している話者の数を推定する。例えば、話者数推定部２６は、Daben Liu他、「ONLINE SPEAKER CLUSTERING」、in Proceedings of ICASSP2004、vol. I、pp.333-336、2004年に開示されているように、各フレームから抽出された特徴量の組を、遺伝的アルゴリズムなどを利用してクラスタリングする。そして話者数推定部２６は、得られたクラスタの数を話者の数とする。
なお、話者数推定部２６は、音声信号から話者の数を推定する他の手法に従って、話者の数を推定してもよい。
話者数推定部２６は、推定した話者の数を判定部２４へ通知する。 The speaker number estimation unit 26 estimates the number of speakers participating in the conversation. For example, the speaker number estimation unit 26 is extracted from each frame as disclosed in Daben Liu et al., “ONLINE SPEAKER CLUSTERING”, in Proceedings of ICASSP2004, vol. I, pp.333-336, 2004. The set of feature quantities is clustered using a genetic algorithm or the like. Then, the speaker number estimation unit 26 sets the obtained number of clusters as the number of speakers.
Note that the speaker number estimation unit 26 may estimate the number of speakers according to another method for estimating the number of speakers from the voice signal.
The speaker number estimation unit 26 notifies the determination unit 24 of the estimated number of speakers.

フィッティング部２３は、混合分布に含まれる確率分布の数を様々に変更し、その確率分布の数ごとに適合度として赤池情報量基準(Akaike's Information Criterion, AIC)の値を算出する。なお、AICの値は次式により算出される。
ここでLは、最大尤度（例えば、EMアルゴリズムを用いて特徴量分布のサンプルを確率分布でフィッティングした後の、フィッテングに使用したサンプルに対する尤度）であり、ln(L)は、例えば、着目する数の確率分布が混合分布に含まれるときの（４）式による平均対数尤度の最大値である。kは自由パラメータの数であり、混合分布に含まれる確率分布の数が増えるほど大きな値になる。例えば、混合分布として混合ガウス分布または混合対数正規分布を利用する場合、一つの確率分布を規定するために、重み係数、平均値ベクトル及び共分散行列というパラメータが必要となる。そのため、確率分布が一つ増える度に、それらのパラメータの数だけkは大きくなる。 The fitting unit 23 variously changes the number of probability distributions included in the mixed distribution, and calculates a value of Akaike's Information Criterion (AIC) as the fitness for each number of the probability distributions. The value of AIC is calculated by the following formula.
Here, L is the maximum likelihood (for example, the likelihood for the sample used for fitting after fitting the feature distribution sample with the probability distribution using the EM algorithm), and ln (L) is, for example, This is the maximum value of the average log likelihood according to equation (4) when the probability distribution of the number of interest is included in the mixed distribution. k is the number of free parameters, and increases as the number of probability distributions included in the mixture distribution increases. For example, when a mixed Gaussian distribution or a mixed lognormal distribution is used as the mixed distribution, parameters such as a weight coefficient, an average value vector, and a covariance matrix are required to define one probability distribution. Therefore, every time the probability distribution increases, k increases by the number of those parameters.

なお、フィッティング部２３は、AICを算出する代わりに、ベイジアン情報量基準(Bayesian information criteria, BIC)を算出してもよい。なお、BICの値は次式により算出される。
ここでLは、最大尤度（（５）式と同様に、EMアルゴリズムを用いて特徴量分布のサンプルを確率分布でフィッティングした後の、フィッテングに使用したサンプルに対する尤度）であり、kは自由パラメータの数である。またmは、標本の大きさ、すなわち、学習サンプルとして利用する特徴量の組の数を表す。 The fitting unit 23 may calculate Bayesian information criteria (BIC) instead of calculating AIC. The value of BIC is calculated by the following formula.
Here, L is the maximum likelihood (like the equation (5), the likelihood for the sample used for fitting after fitting the feature distribution sample with the probability distribution using the EM algorithm), and k is The number of free parameters. M represents the size of the sample, that is, the number of sets of feature quantities used as learning samples.

この場合、AICの値またはBICの数が最小となるときの数の確率分布が、特徴量の分布に最も適合していると推定される。そこでフィッティング部２３は、AICの値またはBICの数が最小となるときの確率分布の数を求める。この確率分布の数は、特徴量の分布をフィッティングするのに最も適した確率分布の数に相当する。そしてフィッティング部２３は、その確率分布の数を判定部２４に通知する。 In this case, it is estimated that the probability distribution of the number when the value of AIC or the number of BIC is minimum is most suitable for the distribution of the feature amount. Therefore, the fitting unit 23 obtains the number of probability distributions when the value of AIC or the number of BIC is minimized. The number of probability distributions corresponds to the number of probability distributions most suitable for fitting the feature amount distribution. Then, the fitting unit 23 notifies the determination unit 24 of the number of probability distributions.

判定部２４は、話者数推定部２６から通知された話者の数と、フィッティング部２３から通知された、特徴量の分布をフィッティングするのに最も適した確率分布の数を比較する。そして判定部２４は、その確率分布の数が話者の数と等しければ、特徴量の分布を話者の数の確率分布で近似できているとみなせるので、会話は通常会話であると判定する。一方、その確率分布の数が話者の数よりも多ければ、特徴量の分布を話者の数の確率分布で近似できていないので、判定部２４は、会話は異常会話であると判定する。 The determination unit 24 compares the number of speakers notified from the speaker number estimation unit 26 with the number of probability distributions notified from the fitting unit 23 and most suitable for fitting the distribution of feature values. If the number of probability distributions is equal to the number of speakers, the determination unit 24 can determine that the distribution of feature quantities can be approximated by the probability distribution of the number of speakers, so that the conversation is determined to be a normal conversation. . On the other hand, if the number of probability distributions is larger than the number of speakers, the distribution of feature values cannot be approximated by the probability distribution of the number of speakers, so the determination unit 24 determines that the conversation is an abnormal conversation. .

図７は、第２の実施形態による異常会話検出処理の動作フローチャートである。処理部１３’は、図５に示された第１の実施形態による異常会話検出処理における、ステップＳ１０５〜Ｓ１１０の代わりに、以下のフローチャートに従って異常会話検出処理を実行する。 FIG. 7 is an operational flowchart of abnormal conversation detection processing according to the second embodiment. The processing unit 13 ′ executes the abnormal conversation detection process according to the following flowchart instead of steps S <b> 105 to S <b> 110 in the abnormal conversation detection process according to the first embodiment shown in FIG. 5.

ステップＳ１０３にて、記憶部１４に保存されている特徴量の組の数が所定数に達している場合、話者数推定部２６は、会話に参加している話者の数を推定する（ステップＳ２０１）。そして話者数推定部２６は、推定した話者の数を判定部２４に通知する。
またフィッティング部２３は、特徴量の分布を確率分布でフィッティングするのに最も適した確率分布の数を算出する（ステップＳ２０２）。そしてフィッティング部２３は、確率分布の数を判定部２４へ通知する。 In step S103, when the number of feature value pairs stored in the storage unit 14 has reached a predetermined number, the speaker number estimation unit 26 estimates the number of speakers participating in the conversation ( Step S201). Then, the speaker number estimation unit 26 notifies the determination unit 24 of the estimated number of speakers.
The fitting unit 23 calculates the number of probability distributions most suitable for fitting the feature amount distribution with the probability distribution (step S202). Then, the fitting unit 23 notifies the determination unit 24 of the number of probability distributions.

判定部２４は、確率分布の数が話者数よりも多いか否か判定する（ステップＳ２０３）。確率分布の数が話者数と等しい場合（ステップＳ２０３−Ｎｏ）、判定部２４は、話者の数と同数の確率分布で特徴量の分布を近似できていると判定する。すなわち、判定部２４は、会話は通常会話であると判定する（ステップＳ２０４）。 The determination unit 24 determines whether or not the number of probability distributions is larger than the number of speakers (step S203). When the number of probability distributions is equal to the number of speakers (step S203—No), the determination unit 24 determines that the distribution of feature quantities can be approximated with the same number of probability distributions as the number of speakers. That is, the determination unit 24 determines that the conversation is a normal conversation (step S204).

一方、確率分布の数が話者数よりも多い場合（ステップＳ２０３−Ｙｅｓ）、判定部２４は、話者の数と同数の確率分布で特徴量の分布を近似できていないと判定する。すなわち、判定部２４は、会話は異常会話であると判定する（ステップＳ２０５）。そして判定部２４は、会話が異常会話であることを警告部２５に通知する。警告部２５は、送話側の話者に警告を発する（ステップＳ２０６）。
ステップＳ２０４またはＳ２０６の後、処理部１３’は、異常会話検出処理を終了する。なお、ステップＳ２０１の処理とステップＳ２０２の処理の順序は逆でもよく、あるいは、ステップＳ２０１の処理とステップＳ２０２の処理は並行して行われてもよい。 On the other hand, when the number of probability distributions is larger than the number of speakers (step S203—Yes), the determination unit 24 determines that the distribution of feature quantities cannot be approximated with the same probability distribution as the number of speakers. That is, the determination unit 24 determines that the conversation is an abnormal conversation (step S205). Then, the determination unit 24 notifies the warning unit 25 that the conversation is an abnormal conversation. The warning unit 25 issues a warning to the speaker on the transmission side (step S206).
After step S204 or S206, the processing unit 13 ′ ends the abnormal conversation detection process. Note that the order of the processing in step S201 and the processing in step S202 may be reversed, or the processing in step S201 and the processing in step S202 may be performed in parallel.

第２の実施形態によれば、異常会話検出装置は、会話に参加している話者の数が２以上の不特定の数であっても、会話が異常会話か否かを適切に判定できる。 According to the second embodiment, the abnormal conversation detection device can appropriately determine whether or not the conversation is an abnormal conversation even if the number of speakers participating in the conversation is an unspecified number of two or more. .

なお、第２の実施形態の変形例によれば、タッチパネルといったユーザインターフェース（図示せず）を介して会話に参加している話者の数が入力されてもよい。この場合には、話者数推定部２６は省略されてもよい。 Note that according to the modification of the second embodiment, the number of speakers participating in the conversation may be input via a user interface (not shown) such as a touch panel. In this case, the speaker number estimation unit 26 may be omitted.

また、上記の各実施形態の変形例によれば、特徴量抽出部２２は、人の声を表す特徴として、フレームごとに、パワー積算値の代わりに、あるいはパワー積算値とともに、デルタケプストラムのノルムを算出してもよい。デルタケプストラムのノルムは、次式によって算出される。
ここで、C_t ⁽ⁿ⁾は、フレームtのn次のケプストラムを表し、ΔC(n)は、デルタケプストラムを表す。 In addition, according to the modification of each of the above-described embodiments, the feature amount extraction unit 22 uses the norm of the delta cepstrum as a feature representing a human voice for each frame instead of the power integrated value or together with the power integrated value. May be calculated. The norm of the delta cepstrum is calculated by the following formula.
Here, C _t ⁽ⁿ⁾ represents the n-th order cepstrum of the frame t, and ΔC (n) represents the delta cepstrum.

また、特徴量抽出部２２は、人の声を表す特徴量として、フレームごとに、ピッチ周波数の代わりに、あるいは、ピッチ周波数とともに、次式のように、スペクトルの幾何平均と算術平均の比で表されるスペクトル平坦尺度(flatness-measure)を算出してもよい。
ここで、f_kは、周波数k(=1,..,N)におけるスペクトル信号であり、Nは、スペクトル信号が算出された周波数の総数（すなわち、フレームに含まれるサンプリング点数の1/2）を表す。そしてΞ(f)は、flatness-measureである。なお、flatness-measureは、例えば、早川他、「線形予測残差スペクトルの調波構造に含まれる個人性情報を用いた話者認識」、電子情報通信学会誌Ａ、Vol.J80-A, No.9, pp.1360-1367, 1997年、に記載されている。 In addition, the feature quantity extraction unit 22 uses a ratio of the geometric mean of the spectrum and the arithmetic mean as a feature quantity representing a human voice for each frame, instead of the pitch frequency or together with the pitch frequency, as in the following equation. The represented spectral flatness-measure may be calculated.
Here, f _k is a spectrum signal at the frequency k (= 1,..., N), and N is the total number of frequencies at which the spectrum signal is calculated (ie, half the number of sampling points included in the frame). Represents. Ξ (f) is a flatness-measure. The flatness-measure is, for example, Hayakawa et al., “Speaker recognition using personality information included in the harmonic structure of the linear prediction residual spectrum”, IEICE Journal A, Vol. J80-A, No. .9, pp.1360-1367, 1997.

この場合、フィッティング部２３は、得られたデルタケプストラムのノルム及びflatness-measureの分布を話者数と同数の確率分布でフィッティングしてもよい。あるいは、フィッティング部２３は、得られたパワー、デルタケプストラムのノルム、ピッチ周波数及びflatness-measureのうちの３種類以上の特徴量の分布を話者数と同数の確率分布でフィッティングしてもよい。 In this case, the fitting unit 23 may fit the obtained delta cepstrum norm and flatness-measure distribution with probability distributions equal to the number of speakers. Alternatively, the fitting unit 23 may fit the distribution of three or more types of feature amounts among the obtained power, the delta cepstrum norm, the pitch frequency, and the flatness-measure with a probability distribution equal to the number of speakers.

また他の変形例によれば、処理部は、一旦会話が通常会話であると判定しても、会話が終了するまで、異常会話検出処理を継続してもよい。この場合には、特徴量抽出部２２は、フレームごとの特徴量の組の抽出を継続し、フィッティング部２３は、最新の所定数（例えば、100〜1000）の特徴量の組に基づいて、特徴量の分布を話者数の確率分布でフィッティングすればよい。この変形例によれば、異常会話検出装置は、通話中に送話側の話者に、会話が異常会話となったことを警告できるので、送話側の話者が異常な心理状態のまま、何がしかの不利益を被る前に通話を中断させたり、正常状態に戻るきっかけを与えることができる。
また、上記の各実施形態または変形例による異常会話検出装置は、会話が異常会話であると判定する条件が満たされる場合に、その会話に参加している何れかの話者の心理状態が異常であると判定してもよい。 According to another modification, the processing unit may continue the abnormal conversation detection process until the conversation ends even if it is determined that the conversation is a normal conversation. In this case, the feature value extraction unit 22 continues to extract feature value sets for each frame, and the fitting unit 23 performs the latest predetermined number (for example, 100 to 1000) of feature value sets. The feature amount distribution may be fitted with the probability distribution of the number of speakers. According to this modification, the abnormal conversation detecting device can warn the transmitting speaker during the call that the conversation has become an abnormal conversation, so that the transmitting speaker remains in an abnormal psychological state. , Can interrupt the call before incurring any disadvantages, or give a chance to return to normal.
In addition, the abnormal conversation detection device according to each of the above embodiments or modifications has an abnormal psychological state of any speaker participating in the conversation when a condition for determining that the conversation is an abnormal conversation is satisfied. It may be determined that

また異常会話検出装置は、携帯電話機に実装されてもよい。
図８は、上記の何れかの実施形態またはその変形例による異常会話検出装置が実装された携帯電話機の概略構成図である。携帯電話機３０は、マイクロホン３１と、通信部３２と、記憶媒体アクセス装置３３と、記憶部３４と、ユーザインターフェース部３５と、処理部３６と、スピーカ３７とを有する。 The abnormal conversation detection device may be mounted on a mobile phone.
FIG. 8 is a schematic configuration diagram of a mobile phone in which the abnormal conversation detection device according to any one of the above-described embodiments or modifications thereof is mounted. The cellular phone 30 includes a microphone 31, a communication unit 32, a storage medium access device 33, a storage unit 34, a user interface unit 35, a processing unit 36, and a speaker 37.

マイクロホン３１は、音声入力部の一例であり、マイクロホン３１の周囲にいる送話側の話者が発する音声を集音してアナログ音声信号を生成し、そのアナログ音声信号をＡ／Ｄコンバータ（図示せず）へ出力する。Ａ／Ｄコンバータは、アナログ音声信号を所定のサンプリングレートでサンプリングしてデジタル化することによりデジタル音声信号を生成する。そしてＡ／Ｄコンバータは、デジタル化された音声信号を処理部３６へ出力する。 The microphone 31 is an example of an audio input unit. The microphone 31 collects audio uttered by a speaker on the transmission side around the microphone 31 to generate an analog audio signal, and the analog audio signal is converted into an A / D converter (see FIG. (Not shown). The A / D converter generates a digital audio signal by sampling an analog audio signal at a predetermined sampling rate and digitizing it. Then, the A / D converter outputs the digitized audio signal to the processing unit 36.

通信部３２は、携帯電話機３０を基地局を介して電話回線に接続するための無線通信回路を有する。そして通信部３２は、電話回線から基地局を介して受信した、受話側の話者が発した音声を電気信号化した下り音声信号を含むデータストリームを受信する。そして通信部３２は、そのデータストリームから下り音声信号を抽出する。そして通信部３２は、下り音声信号を処理部３６へ出力する。 The communication unit 32 includes a wireless communication circuit for connecting the mobile phone 30 to a telephone line via a base station. And the communication part 32 receives the data stream containing the downstream audio | voice signal converted into the electrical signal from the audio | voice which the speaker of the receiving side received via the base station from the telephone line. And the communication part 32 extracts a downstream audio | voice signal from the data stream. Then, the communication unit 32 outputs the downlink audio signal to the processing unit 36.

記憶媒体アクセス装置３３は、例えば、半導体メモリカードといった記憶媒体３８にアクセスする装置である。記憶媒体アクセス装置３３は、例えば、記憶媒体３８に記憶された処理部３６上で実行されるコンピュータプログラムを読み込み、処理部３６に渡す。例えば、記憶媒体アクセス装置３３は、記憶媒体３８から異常会話検出用コンピュータプログラムを読み込んで、処理部３６に渡してもよい。 The storage medium access device 33 is a device that accesses a storage medium 38 such as a semiconductor memory card. The storage medium access device 33 reads, for example, a computer program executed on the processing unit 36 stored in the storage medium 38 and passes it to the processing unit 36. For example, the storage medium access device 33 may read a computer program for detecting abnormal conversation from the storage medium 38 and pass it to the processing unit 36.

記憶部３４は、例えば、読み書き可能な不揮発性の半導体メモリと、読み書き可能な揮発性の半導体メモリとを有する。そして記憶部３４は、処理部３６上で実行される各種のアプリケーションプログラム及び各種のデータを記憶する。また記憶部３４は、上記の各実施形態または変形例による異常会話検出処理を実行するためのコンピュータプログラム及び異常会話検出処理に用いられる各種のデータを記憶してもよい。さらに記憶部３４は、マイクロホン３１を介して取得された音声信号と、通信部３２を介して取得された下り音声信号とが合成された音声信号を記憶してもよい。 The storage unit 34 includes, for example, a readable / writable nonvolatile semiconductor memory and a readable / writable volatile semiconductor memory. The storage unit 34 stores various application programs executed on the processing unit 36 and various data. The storage unit 34 may store a computer program for executing the abnormal conversation detection process according to each of the above-described embodiments or modifications, and various data used for the abnormal conversation detection process. Further, the storage unit 34 may store an audio signal obtained by synthesizing the audio signal acquired via the microphone 31 and the downlink audio signal acquired via the communication unit 32.

ユーザインターフェース部３５は、例えば、複数の操作キーといった入力装置と液晶ディスプレイといった表示装置とを有する。あるいは、ユーザインターフェース部３５は、タッチパネルディスプレイのように、入力装置と表示装置とが一体化された装置を有してもよい。そしてユーザインターフェース部３５は、送話側の話者による入力装置の操作に応じた操作信号を生成し、その操作信号を処理部３６へ出力する。またユーザインターフェース部３５は、処理部３６から受け取った各種の情報を表示装置上に表示する。さらにユーザインターフェース部３５は、警告を出力する出力部の一例であり、処理部３６が会話が異常会話と判定した場合の警告メッセージを処理部３６から受け取り、その警告メッセージを表示装置に表示させてもよい。 The user interface unit 35 includes, for example, an input device such as a plurality of operation keys and a display device such as a liquid crystal display. Alternatively, the user interface unit 35 may include a device in which an input device and a display device are integrated, such as a touch panel display. Then, the user interface unit 35 generates an operation signal corresponding to the operation of the input device by the speaker on the transmission side, and outputs the operation signal to the processing unit 36. The user interface unit 35 displays various information received from the processing unit 36 on the display device. Furthermore, the user interface unit 35 is an example of an output unit that outputs a warning, receives a warning message from the processing unit 36 when the processing unit 36 determines that the conversation is an abnormal conversation, and displays the warning message on the display device. Also good.

処理部３６は、一つまたは複数のプロセッサと、メモリ回路と、周辺回路とを有する。処理部３６は、携帯電話機３０の各部と信号線を介して接続されており、携帯電話機３０の各部を制御する。また処理部３６は、話者による操作、あるいは呼び出し信号の着信に応じて呼設定を行ったり、通信を維持するための各種の処理を実行する。そして処理部３６は、通話が開始されると、マイクロホン３１を介して取得された音声信号と、通信部３２を介して取得された下り音声信号とが合成された音声信号を取得する。そして処理部３６は、その合成音声信号に対して、上記の実施形態における異常会話検出装置の処理部により実行される異常会話検出処理を実行することで、会話が異常会話か否かを判定する。なお、この例では、処理部３６は、携帯電話機３０の電話アプリケーションのアプリケーションプログラミングインタフェース(Application Programming Interface、API)を介して会話の開始及び終了を知ることができる。 The processing unit 36 includes one or a plurality of processors, a memory circuit, and a peripheral circuit. The processing unit 36 is connected to each unit of the mobile phone 30 via a signal line, and controls each unit of the mobile phone 30. Further, the processing unit 36 performs various processes for performing call setting and maintaining communication in response to an operation by a speaker or an incoming call signal. When the call is started, the processing unit 36 acquires a voice signal obtained by synthesizing the voice signal acquired via the microphone 31 and the downlink voice signal acquired via the communication unit 32. And the process part 36 determines whether a conversation is abnormal conversation by performing the abnormal conversation detection process performed by the process part of the abnormal conversation detection apparatus in said embodiment with respect to the synthesized speech signal. . In this example, the processing unit 36 can know the start and end of the conversation via the application programming interface (API) of the telephone application of the mobile phone 30.

この例では、処理部３６は、会話が異常会話と判定すると、ユーザインターフェース部３５の表示装置に警告メッセージを表示させる。あるいは、処理部３６は、警告メッセージの音声信号を、出力部の他の一例であるスピーカ３７に再生させてもよい。
あるいはまた、処理部３６は、会話が異常会話と判定すると、携帯電話機３０の電子メール機能を利用して、予め指定された関係者のメールアドレスへ、会話が異常会話と判定されたことを示す警告メールを自動的に送信してもよい。 In this example, when the processing unit 36 determines that the conversation is an abnormal conversation, the processing unit 36 displays a warning message on the display device of the user interface unit 35. Alternatively, the processing unit 36 may cause the speaker 37, which is another example of the output unit, to reproduce the audio signal of the warning message.
Alternatively, when the processing unit 36 determines that the conversation is an abnormal conversation, the processing unit 36 uses the electronic mail function of the mobile phone 30 to indicate that the conversation is determined to be an abnormal conversation to a mail address of a designated party. A warning mail may be automatically sent.

さらに、上記の各実施形態またはその変形例による異常会話検出装置は、サーバクライアントシステムに実装されてもよい。
図９は、上記の各実施形態またはその変形例による異常会話検出装置が実装されたサーバクライアントシステムの概略構成図である。
サーバクライアントシステム１００は、端末１１０とサーバ１２０とを有し、端末１１０とサーバ１２０とは、通信ネットワーク１３０を介して互いに通信可能となっている。なお、サーバクライアントシステム１００が有する端末１１０は複数存在してもよい。同様に、サーバクライアントシステム１００が有するサーバ１２０は複数存在してもよい。 Furthermore, the abnormal conversation detection device according to each of the above-described embodiments or modifications thereof may be implemented in a server client system.
FIG. 9 is a schematic configuration diagram of a server client system in which the abnormal conversation detection device according to each of the above-described embodiments or modifications thereof is mounted.
The server client system 100 includes a terminal 110 and a server 120, and the terminal 110 and the server 120 can communicate with each other via a communication network 130. A plurality of terminals 110 included in the server client system 100 may exist. Similarly, a plurality of servers 120 included in the server client system 100 may exist.

端末１１０は、音声入力部１１１と、記憶部１１２と、通信部１１３と、制御部１１４と、スピーカ１１５とを有する。音声入力部１１１、記憶部１１２、通信部１１３及びスピーカ１１５は、例えば、制御部１１４とバスを介して接続されている。 The terminal 110 includes a voice input unit 111, a storage unit 112, a communication unit 113, a control unit 114, and a speaker 115. The voice input unit 111, the storage unit 112, the communication unit 113, and the speaker 115 are connected to the control unit 114 via a bus, for example.

音声入力部１１１は、例えば、オーディオインターフェースとＡ／Ｄコンバータを有する。そして音声入力部１１１は、モジュラーローゼットと電話機間に接続された通話録音アダプタから、会話を含む、アナログ信号である音声信号を取得し、その音声信号を所定のサンプリングレートでサンプリングすることにより、その音声信号をデジタル化する。そして音声入力部１１１は、デジタル化された音声信号を制御部１１４へ出力する。 The audio input unit 111 includes, for example, an audio interface and an A / D converter. Then, the voice input unit 111 acquires a voice signal that is an analog signal including a conversation from a call recording adapter connected between the modular rosette and the telephone, and samples the voice signal at a predetermined sampling rate. Digitize audio signals. Then, the voice input unit 111 outputs the digitized voice signal to the control unit 114.

記憶部１１２は、例えば、不揮発性の半導体メモリ及び揮発性の半導体メモリを有する。そして記憶部１１２は、端末１１０を制御するためのコンピュータプログラム、端末１１０の識別情報、異常会話検出処理で利用される各種のデータ及びコンピュータプログラムなどを記憶する。 The storage unit 112 includes, for example, a nonvolatile semiconductor memory and a volatile semiconductor memory. The storage unit 112 stores a computer program for controlling the terminal 110, identification information of the terminal 110, various data used in abnormal conversation detection processing, a computer program, and the like.

通信部１１３は、端末１１０を通信ネットワーク１３０に接続するためのインターフェース回路を有する。そして通信部１１３は、制御部１１４から受け取った特徴量の組を、端末１１０の識別情報とともに通信ネットワーク１３０を介してサーバ１２０へ送信する。また通信部１１３は、会話が異常会話か否かの判定結果をサーバ１２０から通信ネットワーク１３０を介して受信して、制御部１１４に渡す。 The communication unit 113 includes an interface circuit for connecting the terminal 110 to the communication network 130. Then, the communication unit 113 transmits the set of feature values received from the control unit 114 to the server 120 via the communication network 130 together with the identification information of the terminal 110. Further, the communication unit 113 receives a determination result as to whether or not the conversation is an abnormal conversation from the server 120 via the communication network 130 and passes the result to the control unit 114.

制御部１１４は、一つまたは複数のプロセッサとその周辺回路を有する。そして制御部１１４は、上記の各実施形態または変形例による処理部の各機能のうち、スペクトル算出部２１、特徴量抽出部２２及び警告部２５の機能を実現する。すなわち、制御部１１４は、音声信号をフレーム単位に分割し、各フレームから人の声の特徴を表す２種類以上の特徴量を抽出する。そして制御部１１４は、フレームごとの特徴量の組を、端末１１０の識別情報とともに、通信部１１３及び通信ネットワーク１３０を介してサーバ１２０へ送信する。
また制御部１１４は、サーバ１２０から通信ネットワーク１３０及び通信部１１３を介して、会話が異常会話であるとの判定結果を受信すると、スピーカ１１５を介して警告音声を出力する。 The control unit 114 includes one or a plurality of processors and their peripheral circuits. And the control part 114 implement | achieves the function of the spectrum calculation part 21, the feature-value extraction part 22, and the warning part 25 among each function of the process part by said each embodiment or modification. That is, the control unit 114 divides the audio signal into frames, and extracts two or more types of feature amounts representing the characteristics of a human voice from each frame. And the control part 114 transmits the group of the feature-value for every flame | frame to the server 120 via the communication part 113 and the communication network 130 with the identification information of the terminal 110. FIG.
When the control unit 114 receives a determination result that the conversation is an abnormal conversation from the server 120 via the communication network 130 and the communication unit 113, the control unit 114 outputs a warning sound via the speaker 115.

サーバ１２０は、通信部１２１と、記憶部１２２と、処理部１２３とを有する。通信部１２１及び記憶部１２２は、処理部１２３とバスを介して接続されている。 The server 120 includes a communication unit 121, a storage unit 122, and a processing unit 123. The communication unit 121 and the storage unit 122 are connected to the processing unit 123 via a bus.

通信部１２１は、サーバ１２０を通信ネットワーク１３０に接続するためのインターフェース回路を有する。そして通信部１２１は、フレームごとの特徴量の組と端末１１０の識別情報とを端末１１０から通信ネットワーク１３０を介して受信して処理部１２３に渡す。また通信部１２１は、端末１１０の識別情報に基づいて、処理部１２３から受け取った会話が異常会話であるとの判定結果を通信ネットワーク１３０を介して端末１１０へ送信する。 The communication unit 121 includes an interface circuit for connecting the server 120 to the communication network 130. The communication unit 121 receives a set of feature amounts for each frame and the identification information of the terminal 110 from the terminal 110 via the communication network 130 and passes them to the processing unit 123. The communication unit 121 transmits a determination result that the conversation received from the processing unit 123 is an abnormal conversation to the terminal 110 via the communication network 130 based on the identification information of the terminal 110.

記憶部１２２は、例えば、不揮発性の半導体メモリ及び揮発性の半導体メモリを有する。そして記憶部１２２は、サーバ１２０を制御するためのコンピュータプログラムなどを記憶する。また記憶部１２２は、異常会話検出処理を実行するためのコンピュータプログラム及び各端末から受信したフレームごとの特徴量の組を記憶していてもよい。 The storage unit 122 includes, for example, a nonvolatile semiconductor memory and a volatile semiconductor memory. The storage unit 122 stores a computer program for controlling the server 120 and the like. In addition, the storage unit 122 may store a computer program for executing abnormal conversation detection processing and a set of feature values for each frame received from each terminal.

処理部１２３は、一つまたは複数のプロセッサとその周辺回路を有する。そして処理部１２３は、上記の各実施形態または変形例による処理部の各機能のうち、フィッティング部２３及び判定部２４の機能を実現する。さらに、処理部１２３は、話者数推定部２６の機能を実現してもよい。すなわち、処理部１２３は、端末１１０から受信した、フレームごとの特徴量の組から、特徴量の分布を話者数と同じ数の確率分布でフィッティングする。そして処理部１２３は、フィッティングした各確率分布が特徴量の分布を近似できていれば、会話は通常会話であると判定し、一方、フィッティングした各確率分布が特徴量の分布を近似できていなければ、会話は異常会話であると判定する。そして処理部１２３は、その判定結果を、通信部１２１及び通信ネットワーク１３０を介して端末１１０へ送信する。 The processing unit 123 includes one or a plurality of processors and their peripheral circuits. And the process part 123 implement | achieves the function of the fitting part 23 and the determination part 24 among each function of the process part by said each embodiment or modification. Further, the processing unit 123 may realize the function of the speaker number estimation unit 26. That is, the processing unit 123 fits the feature amount distribution with the same number of probability distributions as the number of speakers from the feature amount set for each frame received from the terminal 110. Then, the processing unit 123 determines that the conversation is a normal conversation if the fitted probability distributions can approximate the distribution of the feature amount, while the fitted probability distributions cannot approximate the distribution of the feature amount. For example, it is determined that the conversation is an abnormal conversation. Then, the processing unit 123 transmits the determination result to the terminal 110 via the communication unit 121 and the communication network 130.

この実施形態によれば、個々の端末１１０は、会話を録音した音声信号からフレームごとの特徴量の組を抽出してサーバ１２０へ送信するだけで、その会話が異常会話か否かの判定結果を得ることができる。 According to this embodiment, each terminal 110 extracts a set of feature values for each frame from an audio signal recording a conversation and transmits it to the server 120, and the determination result as to whether or not the conversation is an abnormal conversation. Can be obtained.

上記の各実施形態または変形例による異常会話検出装置の処理部が有する各機能をコンピュータに実現させるコンピュータプログラムは、磁気記録媒体または光記録媒体といったコンピュータによって読み取り可能な媒体に記録された形で提供されてもよい。 A computer program that causes a computer to realize the functions of the processing unit of the abnormal conversation detection apparatus according to each of the above embodiments or modifications is provided in a form recorded on a computer-readable medium such as a magnetic recording medium or an optical recording medium. May be.

ここに挙げられた全ての例及び特定の用語は、読者が、本発明及び当該技術の促進に対する本発明者により寄与された概念を理解することを助ける、教示的な目的において意図されたものであり、本発明の優位性及び劣等性を示すことに関する、本明細書の如何なる例の構成、そのような特定の挙げられた例及び条件に限定しないように解釈されるべきものである。本発明の実施形態は詳細に説明されているが、本発明の精神及び範囲から外れることなく、様々な変更、置換及び修正をこれに加えることが可能であることを理解されたい。 All examples and specific terms listed herein are intended for instructional purposes to help the reader understand the concepts contributed by the inventor to the present invention and the promotion of the technology. It should be construed that it is not limited to the construction of any example herein, such specific examples and conditions, with respect to showing the superiority and inferiority of the present invention. Although embodiments of the present invention have been described in detail, it should be understood that various changes, substitutions and modifications can be made thereto without departing from the spirit and scope of the present invention.

以上説明した実施形態及びその変形例に関し、更に以下の付記を開示する。
（付記１）
複数の話者の会話を含む音声信号を入力する音声入力部と、
記憶部と、
前記音声信号に対して、所定の時間長を持つフレーム単位で人の声の特徴を表す少なくとも二つの特徴量を抽出し、該少なくとも二つの特徴量の組を前記記憶部に保存する特徴量抽出部と、
前記記憶部に記憶された特徴量の分布を前記話者の数と同じ数の確率分布でフィッティングするフィッティング部と、
前記話者の数と同じ数の確率分布が前記特徴量の分布を近似できているか否か判定し、前記話者の数と同じ数の確率分布が前記特徴量の分布を近似できていない場合、前記会話は異常会話であると判定する判定部と、
を有する異常会話検出装置。
（付記２）
前記フィッティング部は、前記話者の数と同じ数の確率分布を含む混合分布で前記少なくとも二つの特徴量の組の分布をフィッティングし、
前記判定部は、前記混合分布に含まれる各確率分布が前記特徴量の分布に適合している度合いを表す適合度を算出し、該適合度が、各確率分布が前記特徴量の分布を近似できているときの適合度の下限に相当する閾値未満である場合、前記会話は異常会話であると判定する、付記１に記載の異常会話検出装置。
（付記３）
前記フィッティング部は、混合分布に含まれる確率分布の数を変えつつ、前記確率分布の数ごとに前記適合度を算出し、前記適合度に基づいて、前記特徴量の分布に最も適合している確率分布の数を求め、
前記判定部は、前記確率分布の数が前記話者の数よりも多い場合、前記会話は異常会話であると判定する、付記１に記載の異常会話検出装置。
（付記４）
前記フィッティング部は、前記適合度として赤池情報量基準またはベイジアン情報量基準を算出し、赤池情報量基準またはベイジアン情報量基準が最小となるときの確率分布の数を前記特徴量の分布に最も適合している確率分布の数として求める、付記３に記載の異常会話検出装置。
（付記５）
前記音声信号から前記話者の数を推定する話者数推定部をさらに有する、付記３または４に記載の異常会話検出装置。
（付記６）
複数の話者の会話を含む音声信号を取得し、
前記音声信号に対して、所定の時間長を持つフレーム単位で人の声の特徴を表す少なくとも二つの特徴量を抽出し、
前記抽出された特徴量の分布を前記話者の数と同じ数の確率分布でフィッティングし、
前記話者の数と同じ数の確率分布が前記特徴量の分布を近似できているか否か判定し、前記話者の数と同じ数の確率分布が前記特徴量の分布を近似できていない場合、前記会話は異常会話であると判定する、
ことを含む異常会話検出方法。
（付記７）
複数の話者の会話を含む音声信号を取得し、
前記音声信号に対して、所定の時間長を持つフレーム単位で人の声の特徴を表す少なくとも二つの特徴量を抽出し、
前記抽出された特徴量の分布を前記話者の数と同じ数の確率分布でフィッティングし、
前記話者の数と同じ数の確率分布が前記特徴量の分布を近似できているか否か判定し、前記話者の数と同じ数の確率分布が前記特徴量の分布を近似できていない場合、前記会話は異常会話であると判定する、
ことをコンピュータに実行させるための異常会話検出用コンピュータプログラム。 The following supplementary notes are further disclosed regarding the embodiment described above and its modifications.
(Appendix 1)
A voice input unit for inputting a voice signal including conversations of a plurality of speakers;
A storage unit;
Extracting at least two feature amounts representing human voice features in units of frames having a predetermined time length from the audio signal, and storing the set of at least two feature amounts in the storage unit And
A fitting unit that fits the distribution of feature values stored in the storage unit with the same number of probability distributions as the number of speakers;
It is determined whether or not the probability distribution of the same number as the number of speakers can approximate the distribution of the feature amount, and the probability distribution of the same number as the number of speakers cannot approximate the distribution of the feature amount A determination unit that determines that the conversation is an abnormal conversation;
An abnormal conversation detecting device having
(Appendix 2)
The fitting unit fits the distribution of the set of at least two feature quantities with a mixed distribution including the same number of probability distributions as the number of speakers.
The determination unit calculates a fitness indicating the degree to which each probability distribution included in the mixed distribution is compatible with the distribution of the feature amount, and the probability distribution approximates the distribution of the feature amount. The abnormal conversation detection device according to appendix 1, wherein the conversation is determined to be an abnormal conversation when the conversation is less than a threshold value corresponding to a lower limit of the fitness when it is made.
(Appendix 3)
The fitting unit calculates the fitness for each of the probability distributions while changing the number of probability distributions included in the mixed distribution, and is most suitable for the feature amount distribution based on the fitness. Find the number of probability distributions,
The abnormal conversation detection device according to appendix 1, wherein the determination unit determines that the conversation is an abnormal conversation when the number of the probability distributions is larger than the number of the speakers.
(Appendix 4)
The fitting unit calculates the Akaike information criterion or the Bayesian information criterion as the fitness, and the number of probability distributions when the Akaike information criterion or the Bayesian information criterion is minimized is the best fit to the feature distribution. The abnormal conversation detection device according to appendix 3, which is obtained as the number of probability distributions.
(Appendix 5)
The abnormal conversation detection device according to appendix 3 or 4, further comprising a speaker number estimation unit that estimates the number of speakers from the voice signal.
(Appendix 6)
Acquire an audio signal containing conversations of multiple speakers,
Extracting at least two feature quantities representing the characteristics of a human voice in units of frames having a predetermined time length for the audio signal;
Fitting the extracted feature distribution with the same number of probability distributions as the number of speakers;
It is determined whether or not the probability distribution of the same number as the number of speakers can approximate the distribution of the feature amount, and the probability distribution of the same number as the number of speakers cannot approximate the distribution of the feature amount , Determining that the conversation is an abnormal conversation,
An abnormal conversation detection method including the above.
(Appendix 7)
Acquire an audio signal containing conversations of multiple speakers,
Extracting at least two feature quantities representing the characteristics of a human voice in units of frames having a predetermined time length for the audio signal;
Fitting the extracted feature distribution with the same number of probability distributions as the number of speakers;
It is determined whether or not the probability distribution of the same number as the number of speakers can approximate the distribution of the feature amount, and the probability distribution of the same number as the number of speakers cannot approximate the distribution of the feature amount , Determining that the conversation is an abnormal conversation,
A computer program for detecting an abnormal conversation for causing a computer to execute the operation.

１異常会話検出装置
２−１、２−２電話機
３分配器
４モジュラーローゼット
５通話録音アダプタ
６電話回線
１１インターフェース部
１２Ａ／Ｄコンバータ
１３、１３’ 処理部
１４記憶部
１５Ｄ／Ａコンバータ
１６スピーカ
２１スペクトル算出部
２２特徴量抽出部
２３フィッティング部
２４判定部
２５警告部
２６話者数推定部
３０携帯電話機（異常会話検出装置）
３１マイクロホン
３２通信部
３３記憶媒体アクセス装置
３４記憶部
３５ユーザインターフェース部
３６処理部
３７スピーカ
３８記憶媒体
１００サーバクライアントシステム
１１０端末
１１１音声入力部
１１２記憶部
１１３通信部
１１４制御部
１１５スピーカ
１２０サーバ
１２１通信部
１２２記憶部
１２３処理部
１３０通信ネットワーク DESCRIPTION OF SYMBOLS 1 Abnormal conversation detection apparatus 2-1, 2-2 Telephone 3 Distributor 4 Modular rosette 5 Call recording adapter 6 Telephone line 11 Interface part 12 A / D converter 13, 13 'Processing part 14 Storage part 15 D / A converter 16 Speaker DESCRIPTION OF SYMBOLS 21 Spectrum calculation part 22 Feature-value extraction part 23 Fitting part 24 Judgment part 25 Warning part 26 Speaker number estimation part 30 Cellular phone (abnormal conversation detection apparatus)
31 microphone 32 communication unit 33 storage medium access device 34 storage unit 35 user interface unit 36 processing unit 37 speaker 38 storage medium 100 server client system 110 terminal 111 voice input unit 112 storage unit 113 communication unit 114 control unit 115 speaker 120 server 121 communication Unit 122 Storage unit 123 Processing unit 130 Communication network

Claims

A voice input unit for inputting a voice signal including conversations of a plurality of speakers;
A storage unit;
Extracting at least two feature amounts representing human voice features in units of frames having a predetermined time length from the audio signal, and storing the set of at least two feature amounts in the storage unit And
A fitting unit that fits the distribution of feature values stored in the storage unit with the same number of probability distributions as the number of speakers;
It is determined whether or not the probability distribution of the same number as the number of speakers can approximate the distribution of the feature amount, and the probability distribution of the same number as the number of speakers cannot approximate the distribution of the feature amount A determination unit that determines that the conversation is an abnormal conversation;
An abnormal conversation detecting device having

The fitting unit fits the distribution of the set of at least two feature quantities with a mixed distribution including the same number of probability distributions as the number of speakers.
The determination unit calculates a fitness indicating the degree to which each probability distribution included in the mixed distribution is compatible with the distribution of the feature amount, and the probability distribution approximates the distribution of the feature amount. The abnormal conversation detection apparatus according to claim 1, wherein the conversation is determined to be an abnormal conversation when the conversation is less than a threshold value corresponding to a lower limit of the fitness level when it is made.

The fitting unit changes the number of probability distributions included in the mixed distribution, and the degree of fitness representing the degree to which each probability distribution included in the mixed distribution is adapted to the distribution of the feature amount for each number of the probability distributions And calculating the number of probability distributions most suitable for the distribution of the feature amount based on the fitness,
The abnormal conversation detection apparatus according to claim 1, wherein the determination unit determines that the conversation is an abnormal conversation when the number of the probability distributions is larger than the number of the speakers.

The abnormal conversation detection apparatus according to claim 3, further comprising a speaker number estimation unit that estimates the number of speakers from the voice signal.

Acquire an audio signal containing conversations of multiple speakers,
Extracting at least two feature quantities representing the characteristics of a human voice in units of frames having a predetermined time length for the audio signal;
Fitting the extracted feature distribution with the same number of probability distributions as the number of speakers;
It is determined whether or not the probability distribution of the same number as the number of speakers can approximate the distribution of the feature amount, and the probability distribution of the same number as the number of speakers cannot approximate the distribution of the feature amount , Determining that the conversation is an abnormal conversation,
An abnormal conversation detection method including the above.

Acquire an audio signal containing conversations of multiple speakers,
Extracting at least two feature quantities representing the characteristics of a human voice in units of frames having a predetermined time length for the audio signal;
Fitting the extracted feature distribution with the same number of probability distributions as the number of speakers;
It is determined whether or not the probability distribution of the same number as the number of speakers can approximate the distribution of the feature amount, and the probability distribution of the same number as the number of speakers cannot approximate the distribution of the feature amount , Determining that the conversation is an abnormal conversation,
A computer program for detecting an abnormal conversation for causing a computer to execute the operation.