JP7304627B2

JP7304627B2 - Answering machine judgment device, method and program

Info

Publication number: JP7304627B2
Application number: JP2019203594A
Authority: JP
Inventors: 心剣李
Original assignee: Hello Inc
Current assignee: Hello Inc
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2023-07-07
Anticipated expiration: 2039-11-08
Also published as: JP2021078012A

Description

本発明は、自動音声を使って自動で電話をかけたときに、発信先の応答が留守番電話による応答だったか否かを判定する留守番電話判定装置、方法、プログラムに関する。 The present invention relates to an answering machine determination device, method, and program for determining whether or not a caller's response was an answering machine when a call was made automatically using automatic voice.

電話への着信に対し、一定時間ユーザによりオフフック操作がなされなかった場合、電話が自動で応答し、所定の予め設定された留守番メッセージを流し、その後発信者にメッセージを残すよう促すことで発信者のメッセージを録音することができる留守番電話機能を有する電話機や、留守番電話サービスを、通信網を通じて設定することができる機能が普及している。 When an incoming call to the telephone is not off-hooked by the user for a certain period of time, the telephone automatically answers the telephone, plays a predetermined preset answering machine message, and then prompts the caller to leave a message, thereby prompting the caller to leave a message. 2. Description of the Related Art Telephones with an answering machine function capable of recording a message and a function capable of setting an answering machine service through a communication network have become widespread.

しかし、留守番電話が応答した場合、発信者は自分のメッセージを相手方に残すことができるものの、通話料金が課金されるうえに、相手方に所望の用件を伝えて電話をかけた当初の目的を達成することはできないという問題があった。 However, when the answering machine answers, the caller can leave his or her own message for the other party, but in addition to being charged for the call, the original purpose of the call by informing the other party of the desired business is lost. The problem was that it was not achievable.

この問題を解決するために、特許文献１では、発信者側の電話装置が、発信先の電話機の留守電応答時間を計測して発信先の電話番号とともに記憶しておき、その発信先に新たに発信したときには、記憶した留守電応答時間になる直前に自動切断することで無駄な通話料金の支払いを防止する技術が開示されている。 In order to solve this problem, in Japanese Unexamined Patent Application Publication No. 2002-100003, a telephone device on the caller side measures the answering machine response time of the telephone of the called party, stores it together with the telephone number of the called party, and stores the answering time with the telephone number of the called party. There is disclosed a technique for preventing useless payment of call charges by automatically disconnecting a call just before the stored answering machine response time.

特開２０１７－１１６２３号公報JP 2017-11623 A

しかし、特許文献１の技術においては、２度目の発呼からは自動切断することで無駄な通話料金の支払いを防止することができるが、相手方に最初にかける場合は、留守番電話にかけることは防止できない。また、そもそも発呼側がコンピュータプログラム等による自動発呼により、自動音声で電話をかける場合、着信側で出た応答が人間の声なのか、留守番電話メッセージなのかを判別することができないという問題があった。 However, in the technique of Patent Document 1, it is possible to prevent wasteful payment of call charges by automatically disconnecting from the second call. cannot be prevented. Also, in the first place, when the calling party makes an automatic call by a computer program or the like and makes an automatic voice call, there is a problem that it is not possible to distinguish whether the response given by the called party is a human voice or an answering machine message. there were.

発呼先の電話がオフフックされたときに、留守番電話であったか、人間が出たにも関わらず自動音声で伝えた用件に対して対応がなされなかったか、の判断ができなければ、すぐにかけ直すか、時間をおいてからかけ直すかも決定できない。このため、自動音声電話機が、留守であるにもかかわらず、すぐに何度もかけ直したりするなど、無駄に発信操作を繰り返してしまうという問題があった。 When the called party's phone goes off-hook, if it is not possible to determine whether it was an answering machine or whether the matter conveyed by automated voice was not answered despite the fact that a human answered the call, call immediately. I can't decide whether to fix it or call back later. For this reason, there is a problem that the automatic voice telephone repeats call operations unnecessarily, such as immediately calling back many times even though the caller is not at home.

そこで、本発明では、通話が終了したあとに、通話データを自動音声電話機から取得し、少なくとも話者重複特徴量とクラスタリング特徴量とを通話データから抽出し、機械学習により生成された判定部によって、留守番電話による応答か否かを判定することで、その後の無駄な発信操作を防止し、適切な対応をとれるようにすることを目
的とする。 Therefore, in the present invention, after the call is completed, the call data is acquired from the automatic voice telephone, at least the speaker overlap feature amount and the clustering feature amount are extracted from the call data, and the decision unit generated by machine learning To prevent useless calling operation after that and to take an appropriate response by judging whether or not a response is made by an answering machine.

本発明にあっては、電話による通話データを取得する通話データ取得手段と、通話データの双方の話者の重複度を算出する話者重複特徴量抽出手段と、通話データのクラスタリング特徴量を算出するクラスタリング特徴量抽出手段と、話者重複特徴量と、クラスタリング特徴量と、留守番電話による応答の有無とを教師データとして用いて、機械学習により生成された留守番電話による応答か否かを判定する判定手段と、を有し、判定手段は、通話データから抽出された話者重複特徴量と、クラスタリング特徴量と、に基づいてその通話データが留守番電話による応答か否かを判定する、留守番電話応答判定装置を提供することができる。 According to the present invention, there are provided call data acquisition means for acquiring phone call data, speaker overlap feature quantity extraction means for calculating the degree of duplication of both speakers of the call data, and clustering feature quantity calculation of the call data. Using the clustering feature extracting means, the speaker duplication feature, the clustering feature, and the presence or absence of a response by an answering machine as teacher data, it is determined whether or not the answer is a response by an answering machine generated by machine learning. determining means, wherein the determining means determines whether or not the call data is a response by an answering machine based on the overlapping speaker feature amount and the clustering feature amount extracted from the call data. A response determination device can be provided.

さらに、留守番電話応答判定装置は、通話データの通話時間の特徴量を抽出する通話時間特徴量抽出手段を有し、判定手段は、通話時間の特徴量をさらに教師データとして用いる。 Furthermore, the answering machine response determination device has a call time feature amount extraction means for extracting a call time feature amount of the call data, and the determination means further uses the call time feature amount as teacher data.

また、通話時間特徴量は、通話音声のエネルギー統計量である。 Also, the call duration feature amount is the energy statistic of the call voice.

本発明にかかる留守番電話応答判定装置は、さらに、通話データをテキストデータに変換する音声認識手段とテキストデータから機械学習により留守番電話による応答を検出してテキスト特徴量を算出する応答検出手段とを有するテキスト特徴量抽出手段をさらに有し、判定手段は、前記テキスト特徴量をさらに教師データとして用いる。 The answering machine response determination device according to the present invention further includes a voice recognition means for converting call data into text data and a response detection means for detecting a response by the answering machine from the text data by machine learning and calculating a text feature amount. text feature amount extraction means, and the determination means further uses the text feature amount as teacher data.

さらに、機械合成音データと、人間音声データと、混合ガウスのヒストグラムと、を教師データとして用いて機械学習し、通話データのうち、応答側の音声が機械合成音データである確率を機械合成音特徴量として生成する機械合成音特徴量生成手段をさらに有し、判定手段は、機械合成音特徴量をさらに教師データとして用いる、留守番電話応答判定装置を提供する。 Furthermore, machine learning is performed using machine synthesized speech data, human speech data, and a Gaussian mixture histogram as teacher data, and the probability that the speech of the answering side in the call data is machine synthesized speech data is calculated as machine synthesized speech data. Provided is an answering machine response determination device, further comprising a machine-synthesized sound feature quantity generating means for generating as a feature quantity, wherein the determination means further uses the machine-synthesized sound feature quantity as teacher data.

本発明にかかる留守番電話応答判定方法は、電話による通話データを取得する通話データ取得ステップと、通話データの双方の話者の重複度を算出する話者重複特徴量抽出ステップと、通話データのクラスタリング特徴量を算出するクラスタリング特徴量抽出ステップと、通話時間特徴量と、話者重複特徴量と、クラスタリング特徴量と、留守番電話による応答の有無とを教師データとして用いて、機械学習することにより生成された判定部により留守番電話による応答か否かを判定する判定ステップと、を有する留守番電話応答判定方法を提供する。 The answering machine response determination method according to the present invention includes a call data acquisition step of acquiring call data from telephone calls, a speaker overlap feature amount extraction step of calculating the degree of redundancy of both speakers in the call data, and clustering of the call data. Generated by machine learning using the clustering feature amount extraction step of calculating the feature amount, the call duration feature amount, the speaker overlapping feature amount, the clustering feature amount, and the presence or absence of the answering machine as teacher data. and a judgment step of judging whether or not the call is answered by an answering machine.

また、本発明の留守番電話応答判定装置としてコンピュータに実行させる留守番電話応答判定プログラムは、電話による通話データを取得する通話データ取得ステップと、通話データの双方の話者の重複度を算出する話者重複特徴量抽出ステップと、通話データのクラスタリング特徴量を算出するクラスタリング特徴量抽出ステップと、通話時間特徴量と、話者重複特徴量と、クラスタリング特徴量と、留守番電話による応答の有無とを教師データとして用いて、機械学習することにより生成された判定部により留守番電話による応答か否かを判定する判定ステップと、を実行させる留守番電話応答判定プログラムを提供する。 Further, an answering machine response determination program to be executed by a computer as an answering machine response determination apparatus of the present invention includes a call data acquisition step of acquiring call data from a telephone call, A duplicate feature quantity extraction step, a clustering feature quantity extraction step of calculating a clustering feature quantity of call data, a call duration feature quantity, a speaker overlap feature quantity, a clustering feature quantity, and the presence or absence of an answering machine response are supervised. Provided is an answering machine response judgment program for executing a judgment step of judging whether or not a response is made by an answering machine by a judging unit generated by machine learning using data as data.

本発明によれば、通話データから所定の特徴量を取得し、発呼先の電話が留守番電話による応答であったか否かを判定することができるため、具体的には、留守番電話であったと判定された場合は、数時間時間をおいて、在宅している可能性の高い時間や店舗であれば営業時間内に再度発呼するようにし、人間が出たにもかかわらず、自動音声による電話であったがために、すぐに電話が切断されたと判断した場合には、すぐにかけ直すように電話機を設定することができる。 According to the present invention, it is possible to obtain a predetermined feature value from the call data and determine whether or not the callee's telephone was answered by an answering machine. If you are called, wait a few hours and try to call again during business hours if there is a high possibility that you are at home or at a store, and even if a human answers, the call will be made by an automated voice. If it is determined that the call was immediately disconnected because of a

図１は、本発明における留守番電話応答判定装置のハードウェア構成図の一例を示すブロック図である。FIG. 1 is a block diagram showing an example of a hardware configuration diagram of an answering machine response determining apparatus according to the present invention. 図２は、本発明の第一の実施の形態にかかる留守番電話応答判定システム１の機能ブロック図である。FIG. 2 is a functional block diagram of the answering machine response determination system 1 according to the first embodiment of the present invention. 図３は、本発明の第二の実施の形態にかかる留守番電話応答判定システム１の機能ブロック図である。FIG. 3 is a functional block diagram of answering machine response determination system 1 according to the second embodiment of the present invention. 図４は、人間応答による場合の秒数分布を示すグラフである。FIG. 4 is a graph showing the number-of-seconds distribution in the case of human responses. 図５は、留守番電話の機械応答による場合の秒数分布を示すグラフである。FIG. 5 is a graph showing the distribution of the number of seconds according to the machine response of the answering machine. 図６は、教師データを用いて、着信者側で留守番電話が応答したか否かを判定する分類器を、機械学習により生成する処理を示すフローチャートである。FIG. 6 is a flow chart showing a process of generating, by machine learning, a classifier for determining whether or not an answering machine has answered on the called party side using teacher data.

以下、本発明を実施するための形態について、図面を参照しながら説明する。なお、本明細書及び図面において、実質的に同一の機能及び構成を有する構成要素については同一の符号を付し、重複説明を省略する。 EMBODIMENT OF THE INVENTION Hereinafter, the form for implementing this invention is demonstrated, referring drawings. In the present specification and drawings, constituent elements having substantially the same functions and configurations are denoted by the same reference numerals, and redundant explanations are omitted.

図１は、本発明における留守番電話応答判定装置のハードウェア構成図の一例を示すブロック図である。図１に示されるコンピュータ装置である留守番電話応答判定装置１０のハードウェア構成は、主にコンピュータ装置で実現できる。留守番電話応答判定装置１０は、自動音声電話機２０から受信した通話データから各種特徴量を抽出し、留守番電話応答の判定を行う留守番電話応答判定プログラムを実行することで、留守番電話応答か否かの判定を行う。 FIG. 1 is a block diagram showing an example of a hardware configuration diagram of an answering machine response determining apparatus according to the present invention. The hardware configuration of answering machine response determination device 10, which is a computer device shown in FIG. 1, can be realized mainly by a computer device. The answering machine response determination device 10 extracts various feature values from the call data received from the automatic voice telephone 20, and executes an answering machine response determination program for determining answering machine response, thereby determining whether or not answering machine response. make a judgment.

留守番電話応答判定装置１０は、通話データと各種特徴量、留守番電話の応答の有無を教師データとして機械学習することで生成される留守番電話応答判定部を有しており、新たな通話データを自動音声電話機２０から受信すると、各種特徴量を抽出し、相手方の応答が留守番電話による応答であったかなかったかを判定する。 The answering machine response determination device 10 has an answering machine response determination unit that is generated by machine learning using call data, various feature values, and the presence or absence of answering machine responses as teacher data, and automatically generates new call data. When received from the voice telephone 20, various feature quantities are extracted and it is determined whether or not the other party's response was an answering machine response.

留守番電話応答判定装置１０を形成するコンピュータは、図１に示したようにＣＰＵ１１、通信インターフェース１２、ＲＯＭ１３、ＲＡＭ１４、ハードディスクドライブ１５、入出力インターフェース１６、入出力インターフェース１６と接続された表示部１７、ポインティングデバイス１８及びキーボード１９を、バスに接続して構成される。また、入出力インターフェース１６には、ＵＳＢメモリなどの外部記憶装置２０が接続可能である。 As shown in FIG. 1, the computer forming the answering machine response determination device 10 includes a CPU 11, a communication interface 12, a ROM 13, a RAM 14, a hard disk drive 15, an input/output interface 16, a display unit 17 connected to the input/output interface 16, A pointing device 18 and a keyboard 19 are connected to the bus. Also, an external storage device 20 such as a USB memory can be connected to the input/output interface 16 .

表示部１７は、たとえば、液晶ディスプレイなどの表示装置である。ポインティングデバイス１８は、例えば、マウスやトラックボールなどである。 The display unit 17 is, for example, a display device such as a liquid crystal display. The pointing device 18 is, for example, a mouse or trackball.

一連の処理をプログラムにより実行させる場合には、例えば、通話データ取得部、話者重複特徴量抽出部、クラスタリング特徴量抽出部、通話時間特徴量抽出部、テキスト特徴量抽出部、機械合成音特徴量抽出部、判定部は、ＲＯＭ１３又はハードディスクドライブ１５に留守番電話応答判定プログラムとして記憶され、ＣＰＵ１１で実行させることで、各種の機能を実行させる。なお、留守番電話応答判定プログラムが記憶されたＵＳＢメモリなどの外部記憶装置２０を入出力インターフェース１６に接続することでのインストールや、ネットワーク１２からコンピュータへ留守番電話応答判定プログラムをインストール、また、装置本体に予め組み込まれた状態、例えば、留守番電話応答判定プログラムが記録されているＲＯＭ１３などで構成してもよい。 When a series of processes are executed by a program, for example, a call data acquisition unit, a speaker overlap feature amount extraction unit, a clustering feature amount extraction unit, a call time feature amount extraction unit, a text feature amount extraction unit, and a machine synthesized sound feature The amount extractor and the determiner are stored in the ROM 13 or the hard disk drive 15 as an answering machine response determination program, and are executed by the CPU 11 to perform various functions. Installation by connecting an external storage device 20 such as a USB memory in which an answering machine response determination program is stored to the input/output interface 16, installation of the answering machine response determination program from the network 12 to a computer, and installation of the apparatus main body , such as the ROM 13 in which an answering machine response determination program is recorded.

図２は、本発明の第一の実施の形態にかかる留守番電話応答判定システム１の機能ブロック図である。留守番電話応答判定システム１は、自動音声電話機２０と留守番電話応答判定装置１０から構成される。自動音声電話機２０は、所定の電話番号に自動発呼し、所定のメッセージを人工的に生成された音声で相手方に聞かせることで、人間がいなくても電話をかけることができる電話機である。例えば、お店への予約電話を行う場合に必要な電話予約メッセージをあらかじめ自動音声電話機２０に登録しておき、予約希望日時や人数など電話予約メッセージに適宜組み合わせることで、予約する店の電話番号を読み出して、自動で発呼し、生成された電話予約のためのメッセージを人工音声で読み上げ、相手方に予約が可能かどうかを問い合わせる。なお、自動音声電話機２０は、データベースに記憶された電話番号を参照して電話をかけるコンピュータプログラムがサーバなどのコンピュータにインストールされることにより構成されていてもよく、複数のコンピュータ装置によって構成されていてもよい。 FIG. 2 is a functional block diagram of the answering machine response determination system 1 according to the first embodiment of the present invention. The answering machine response determination system 1 comprises an automated voice telephone 20 and an answering machine response determination device 10 . The automatic voice telephone 20 is a telephone capable of making a call without human presence by automatically calling a predetermined telephone number and making the other party hear a predetermined message in an artificially generated voice. For example, a telephone reservation message necessary for making a reservation call to a restaurant is registered in advance in the automatic voice telephone 20, and the telephone number of the restaurant to be reserved can be obtained by appropriately combining the desired reservation date and time, the number of people, etc. with the telephone reservation message. is read out, a call is made automatically, the generated message for telephone reservation is read out by artificial voice, and the other party is asked whether the reservation is possible. The automatic voice telephone 20 may be configured by installing a computer program for making a call by referring to a telephone number stored in a database in a computer such as a server, and may be configured by a plurality of computer devices. may

留守番電話応答判定装置１０は、自動音声電話機２０が録音した通話データをインターネット等の通信ネットワークを介して受信し、通話データから各種特徴量を抽出し、教師データにより機械学習することで生成された判定部に基づいて、相手方が留守番メッセージで応答したか否かの判定を行う装置である。 The answering machine response determination device 10 receives call data recorded by the automatic voice telephone 20 via a communication network such as the Internet, extracts various feature values from the call data, and performs machine learning using teacher data. It is a device that determines whether or not the other party has responded with an answering machine message based on the determination unit.

自動音声電話機２０と留守番電話応答判定装置１０とは、ここでは別々の装置として図示しているが、これに限らず、自動音声電話機２０と留守番電話応答判定装置１０とで一つの装置として構成してもよい。また、自動音声電話機２０と留守番電話応答判定装置１０を構成する各機能がそれぞれ独立した装置として構成してもよい。 Although the automatic voice telephone 20 and the answering machine response determination device 10 are illustrated here as separate devices, the present invention is not limited to this, and the automatic voice telephone 20 and the answering machine response determination device 10 are configured as one device. may Further, each function constituting the automatic voice telephone 20 and the answering machine response determination device 10 may be configured as an independent device.

留守番電話応答装置１０は、通話データ取得部１０１、話者重複特徴量抽出部１１１、クラスタリング特徴量抽出部１１３、判定部１２１を有する。 The answering machine answering device 10 has a call data acquisition unit 101 , a speaker overlapping feature amount extraction unit 111 , a clustering feature amount extraction unit 113 and a determination unit 121 .

通話データ取得部１０１は、自動音声電話機２０が人間の手によらず自動で電話をかけて通話した通話データを自動音声電話機２０から受信し、通話データを取得する。なお、通話データのデータ形式は、発信者側の音声と着信者側の音声とによる２チャンネルのデュアルチャンネルで録音されていればよく、例えば、wav形式であるが、特に特定のデータ形式に限らない。 A call data acquisition unit 101 receives call data of a call made automatically by the automatic voice telephone 20 without relying on human hands, from the automatic voice telephone 20, and acquires the call data. The data format of the call data is limited to a specific data format, as long as it is recorded in two channels, i.e., the voice of the caller and the voice of the called party. do not have.

話者重複特徴量抽出部１１１は、通話データの双方の話者の重複度を算出する。具体的には、通話データは、発信者側と着信者側とでデュアルチャンネルで録音されているため、話者重複特徴量抽出部１１１は、両方のチャンネルで話している区間を検知し、重複度の判定を行う。話者重複特徴量抽出部１１１は、たとえば、音声区間検出器（ＶＡＤ：Voice Activity detection）を有し、発信者側と着信者側、双方のチャンネルでの話し区間を抽出し、重複度を計算する。例えば、音声区間検出器として、MFCC（Mel-frequency cepstrum coefficients）の線形モデルを用いてもよい。話者重複特徴量抽出部１１１は、下記のような計算を行って、発信者側と着信者側との話している区間の重複時間(overlap)を算出する。Shelloは、発信者側の話している区間、Srestaurantは、着信者側の話している区間、|S|は、その区間の秒数の長さを表す。

The speaker duplication feature quantity extraction unit 111 calculates the degree of duplication of both speakers in the call data. Specifically, since the call data is recorded on the caller side and the called party side in dual channels, the speaker overlap feature amount extraction unit 111 detects a section in which the speaker is speaking on both channels, degree is determined. The speaker overlapping feature amount extraction unit 111 has, for example, a voice activity detection (VAD), extracts speaking intervals in both the channel of the caller and the called party, and calculates the degree of overlap. do. For example, a linear model of MFCC (Mel-frequency cepstrum coefficients) may be used as the voice activity detector. The speaker overlap feature quantity extraction unit 111 performs the following calculation to calculate the overlap time (overlap) of the speaking section of the caller and the called party. Shello is the period during which the caller is speaking, Srestaurant is the period during which the called party is speaking, and |S| represents the length of the period in seconds.

クラスタリング特徴量抽出部１１３は、通話データのうち、着信者側のみのチャンネルの音声データからクラスタリング特徴量を算出する。クラスタリング特徴量抽出部１１３は、留守番電話による応答のような機械による応答と、人間による応答、各応答の特徴的な音声パターンで分類されるようにクラスタリングを行う。具体的にはBoAW（Bag of Audio Words）という特徴量を算出する。クラスタリング特徴量抽出部１１３は、受信した通話データにつき、まずMFCC特徴量を算出する。 The clustering feature quantity extraction unit 113 calculates a clustering feature quantity from the voice data of the channel of only the receiver side among the call data. The clustering feature amount extraction unit 113 performs clustering so as to classify responses by a machine such as an answering machine, human responses, and characteristic voice patterns of each response. Specifically, a feature value called BoAW (Bag of Audio Words) is calculated. The clustering feature quantity extraction unit 113 first calculates the MFCC feature quantity for the received call data.

具体的には、１フレームを0.025秒とし、0.01秒ずつシフトさせることで、１秒間に１００フレーム生成し、各フレームにおける４０次元のMFCC特徴量を高次元の点とみなし、点の集合をk-meansのクラスタリングにあてはめて、クラスタリングを生成する。k-meansは下記の式を最小化することで、クラスタリングを生成する。ｘは、各フレームにおける点、Siはｉ番目のクラスタに含まれる点の集合、μiは、そのクラスタの中心、Sはすべてのクラスタの集合である。

Specifically, one frame is set to 0.025 seconds and shifted by 0.01 seconds to generate 100 frames per second. Generate a clustering by fitting the -means clustering. k-means generates a clustering by minimizing the following equation. x is the point in each frame, Si is the set of points contained in the i-th cluster, μi is the center of that cluster, and S is the set of all clusters.

例えば、１０秒の通話データを１００個用意すると、10×100×100＝100000個の点が存在することとし、これにk-meansのクラスタリングを適用し、１００個のクラスタリングを生成する。次に、留守番電話応答判定を行う通話データのMFCC特徴量を算出し、各フレームのMFCC特徴量がどのクラスタリングまでの距離が近いかを計算する。一番近いクラスタリングに対してそのフレームを割り当てることで、クラスタリングのヒストグラムを生成し、クラスタリング特徴量を抽出する。つまり、クラスタリング特徴量抽出部１１３は、着信者側の音声データについて音声特徴量であるMFCC特徴量からクラスタリングを生成し、さらにクラスタリングのヒストグラムを生成することで、クラスタリング特徴量を抽出する。 For example, if 100 pieces of call data for 10 seconds are prepared, 10×100×100=100000 points are present, and k-means clustering is applied to these points to generate 100 clusterings. Next, the MFCC feature amount of call data for answering machine response determination is calculated, and the distance to which clustering the MFCC feature amount of each frame is close is calculated. By assigning the frame to the closest clustering, a clustering histogram is generated and clustering features are extracted. That is, the clustering feature quantity extraction unit 113 extracts the clustering feature quantity by generating clustering from the MFCC feature quantity, which is the speech feature quantity, for the voice data of the called party and further generating a clustering histogram.

判定部１２１は、通話データから抽出された各種特徴量と留守番電話による応答の有無とを教師データとして機械学習により生成される。機械学習により生成された判定部１２１に、留守番電話による応答であったか否かを判定したい通話データから抽出した話者重複特徴量とクラスタリング特徴量を入力することで、判定部１２１は、その通話データにおいて留守番電話による応答があったか否かの判定を行う。 The determination unit 121 is generated by machine learning using various feature values extracted from call data and the presence/absence of a response to an answering machine as teacher data. By inputting the overlapping speaker feature amount and the clustering feature amount extracted from the call data for which it is desired to determine whether or not the response was made by an answering machine to the determination unit 121 generated by machine learning, the determination unit 121 determines the call data , it is determined whether or not there is a response by answering machine.

判定部１２１は、対象となる通話データから抽出した特徴量を入力すると二値分類を行う分類器で構成される。ここでは、留守番電話による応答である機械応答と人間応答の二値に分類される。分類器としては、例えば、ロジスティック回帰、ランダムフォレスト、SVM（サポートベクトルマシン）などが用いられ、いずれを用いてもよいが、教師データが少ないとき、例えば１００００音声データ未満の場合は、ロジスティック回帰、それ以上の場合は、ＳＶＭなど使い分けてもよい。 The determination unit 121 is composed of a classifier that performs binary classification upon input of a feature quantity extracted from target call data. Here, the response is classified into two values: machine response, which is the answering machine response, and human response. As a classifier, for example, logistic regression, random forest, SVM (support vector machine), etc. are used, and any of them may be used. In the case of more than that, you may use SVM etc. separately.

例えば、ロジスティック回帰は、教師データが少ない時に最もよいパフォーマンスを示し、下記の式により、重みｗのパラメータのもと、入力特徴量ｘから留守番電話による応答である機械応答C1に分類される条件付き確率を計算する。
P（C1|ｘ；ｗ）＝σ（ｗ^TX＋ｗ_０） For example, logistic regression shows the best performance when there is little teacher data, and the following formula classifies the machine response C1, which is the answering machine response, from the input feature value x under the weight w parameter. Calculate probabilities.
P(C1|x;w)=σ( ^wTx + _w0 )

また、ランダムフォレストを分類器として使う場合、個々の決定木の結果ｆ_ｋに基づいて計算を行う。

Also, when using a random forest as a classifier, the calculation is based on the individual decision tree results f _k .

SVMはデータ量が一定以上、例えば、１００００音声データ以上に達した場合、ロジスティック回帰から置き換えることで、より高いパフォーマンスを分類器として使う場合で分類を行うことができる。カーネルトリックを用いることで、本来より高次元上で超平面を引くことで非線形分類能力を実現できるものである。境界面は、以下の式により定められる。（zi,yi）は既存のｉ番目の教師データの特徴量とラベル、ｗ、ｂは学習する重みである。

When the amount of data reaches a certain amount or more, for example, 10,000 voice data or more, the SVM can perform classification using higher performance as a classifier by replacing logistic regression. By using kernel tricks, it is possible to achieve nonlinear classification ability by drawing hyperplanes in higher dimensions than originally intended. The interface is defined by the following formula. (zi, yi) are the feature quantity and label of the existing i-th teacher data, and w and b are weights to be learned.

図３は、本発明の第二の実施の形態にかかる留守番電話応答判定システム１の機能ブロック図である。第二の実施の形態においては、話者重複特徴量、クラスタリング特徴量以外の特徴量も抽出し、留守番電話による応答であったか否かの判定に用いる。なお、第一の実施の形態と同じ構成については、詳細な説明を省略する。 FIG. 3 is a functional block diagram of answering machine response determination system 1 according to the second embodiment of the present invention. In the second embodiment, feature amounts other than the overlapping speaker feature amount and the clustering feature amount are also extracted and used to determine whether or not the response was made by an answering machine. Note that detailed descriptions of the same configurations as those of the first embodiment will be omitted.

留守番電話応答判定システム１は、自動音声電話機２０と留守番電話応答判定装置１０から構成される。自動音声電話機２０は、所定の電話番号に自動発呼し、所定のメッセージを人工的に生成された音声を相手方に聞かせることで、人間がいなくても電話をかけることができる電話機である。留守番電話応答判定装置１０は、自動音声電話機２０が録音した通話データをインターネット等の通信ネットワークを介して受信し、通話データから各種特徴量を抽出し、教師データにより機械学習することで生成された判定部に基づいて、相手方が留守番メッセージで応答したか否かの判定を行う装置である。 The answering machine response determination system 1 comprises an automated voice telephone 20 and an answering machine response determination device 10 . The automatic voice telephone 20 is a telephone capable of making a call without human presence by automatically making a call to a predetermined telephone number and making the other party hear an artificially generated voice of a predetermined message. The answering machine response determination device 10 receives call data recorded by the automatic voice telephone 20 via a communication network such as the Internet, extracts various feature values from the call data, and performs machine learning using teacher data. It is a device that determines whether or not the other party has responded with an answering machine message based on the determination unit.

本実施の形態においては、留守番電話応答判定装置１０は、通話データ取得部１０１、話者重複特徴量抽出部１１１、クラスタリング特徴量抽出部１１３、通話時間特徴量抽出部１１５、テキスト特徴量抽出部１１７、機械合成音特徴量抽出部１１９、判定部１２１を有する。 In this embodiment, the answering machine response determination device 10 includes a call data acquisition unit 101, a speaker overlap feature extraction unit 111, a clustering feature extraction unit 113, a call duration feature extraction unit 115, and a text feature extraction unit. 117 , a machine synthesized sound feature amount extraction unit 119 , and a determination unit 121 .

通話データ取得部１０１は、自動音声電話機２０が人間の手によらず自動で電話をかけて通話した通話データを自動音声電話機２０から受信し、通話データを取得する。また、話者重複特徴量抽出部１１１は、通話データの双方の話者の重複度を算出する。クラスタリング特徴量抽出部１１３は、通話データのうち、着信者側のみのチャンネルの音声データからクラスタリング特徴量を算出する。話者重複特徴量抽出部１１１と、クラスタリング特徴量抽出部１１３は、第一の実施の形態と構成が同じであるため、ここでは詳細な説明を省略する。 A call data acquisition unit 101 receives call data of a call made automatically by the automatic voice telephone 20 without relying on human hands, from the automatic voice telephone 20, and acquires the call data. In addition, the speaker duplication feature amount extraction unit 111 calculates the degree of duplication of both speakers in the call data. The clustering feature quantity extraction unit 113 calculates a clustering feature quantity from the voice data of the channel of only the receiver side among the call data. Since the overlapping speaker feature quantity extraction unit 111 and the clustering feature quantity extraction unit 113 have the same configurations as those in the first embodiment, detailed description thereof will be omitted here.

通話時間特徴量抽出部１１５は、通話データの通話時間の特徴量を抽出する。例えば、通話データにおける通話時間そのものを特徴量としてもよい。また、他の一例では、通話時間の２乗を通話時間の特徴量としてもよい。留守番電話応答による場合の通話時間は、決まったテンプレートの録音が使われることが多いため、同じような時間に通話が終了する。このため通話時間の２乗を特徴量とすることで、二次関数を表現して留守番電話応答だった場合のピークをとらえる。 The call time feature amount extraction unit 115 extracts the call time feature amount of the call data. For example, the call time itself in the call data may be used as the feature quantity. In another example, the square of the call duration may be used as the feature quantity of the call duration. Since the recording of a predetermined template is often used for the call time when answering the answering machine, the call ends at the same time. Therefore, by using the square of the call duration as a feature quantity, a quadratic function is expressed to capture the peak in the case of an answering machine response.

また、通話時間特徴量抽出部１１５は、音声の各フレームのエネルギー統計量を通話時間特徴量として抽出してもよい。この場合、音声の各フレームのエネルギー統計量を計算することで、通話時間全体でのノイズを計測する。エネルギー統計量が高い場合は、背景にノイズが多くある、低い場合は、背景にノイズがあまりないことがわかる。具体的には、エネルギー統計量として、下記の式を計算する。なお、音声信号をx(t)とし、0.025行のフレームに区切り、窓関数ｗ（ｔ）をかけて、短時間フーリエ変換による信号X[t,f]を変換する。tは時間、fは周波数である。

そして、ナイキスト周波数までのパワースペクトルを足し合わせることで、エネルギー統計量を算出する。

Further, the call duration feature quantity extraction unit 115 may extract the energy statistic of each frame of speech as the call duration feature quantity. In this case, the noise is measured over the entire call duration by calculating the energy statistic for each frame of speech. A high energy statistic indicates that there is a lot of noise in the background, and a low energy statistic indicates that there is not much noise in the background. Specifically, the following formula is calculated as an energy statistic. Let x(t) be the audio signal, divide it into frames of 0.025 rows, and apply a window function w(t) to convert the signal X[t, f] by short-time Fourier transform. t is time and f is frequency.

Then, the energy statistic is calculated by summing the power spectra up to the Nyquist frequency.

なお、通話時間特徴量抽出部１１５は、エネルギー統計量のみと通話時間特徴量として抽出してもよく、最適な実施形態としては、エネルギー統計量と通話時間の２乗の２つの特徴量を抽出してもよい。なお、エネルギー統計量と通話時間の２つの特徴量を抽出してもよい。 Note that the call duration feature quantity extraction unit 115 may extract only the energy statistic and the call duration feature quantity. You may In addition, you may extract two feature-values, an energy statistic and call time.

テキスト特徴量抽出部１１７は、テキストデータに変換する音声認識部１１７１とテキストデータから機械学習による留守番電話による応答を検出してテキスト特徴量を算出する応答検出部１１７２とを有する。音声認識部１１７１は、取得した通話データのうち着信者側の音声データの音声認識を行い、テキスト化する。 The text feature amount extraction unit 117 has a speech recognition unit 1171 that converts the text data into text data, and a response detection unit 1172 that detects a response to an answering machine by machine learning from the text data and calculates a text feature amount. The voice recognition unit 1171 performs voice recognition on the voice data of the called party in the acquired call data, and converts it into text.

応答検出部１１７２は、音声認識部１１７１によって音声認識され生成されたテキストデータから留守番電話による機械応答の典型的なテキストメッセージがあるかどうかを検出する。例えば、『ただいま留守にしております』や『メッセージをお願いします』『営業時間外です』など、留守番電話による応答メッセージでよく使われるメッセージを検出する。具体的には、応答検出部１１７２として、BERT（Bidirectional Encoder Representations from Transformer)又はXLNetなどの日本語事前学習言語モデルを適用して、応答検出を行う。 The response detection unit 1172 detects whether or not there is a typical text message of machine response by an answering machine from the text data generated by voice recognition by the voice recognition unit 1171 . For example, it detects messages that are frequently used in answering messages by answering machines, such as "I'm out of the office right now," "Please leave a message," and "It's out of business hours." Specifically, the response detection unit 1172 applies a Japanese pre-trained language model such as BERT (Bidirectional Encoder Representations from Transformer) or XLNet to perform response detection.

機械合成音特徴量抽出部１１９は、着信者側の音声が機械で合成された音かどうかを判定するための特徴量を抽出する、例えば、機械合成音特徴量抽出部１１９においては、人間の音声と人工的に合成された音声とをそれぞれ大量に収集し、それらをLSTM（Long Short-Term Memory）に入れて、人間音声と合成音声とを分類できるよう学習させる。また、音声データの混合ガウス分布のヒストグラムを算出し、人間音声の場合のヒストグラムと合成音声の場合のヒストグラムとを算出し、ニューラルネットワークでこれらを教師データとして機械学習させて機械合成音分類器を生成しておく。 The machine-synthesized sound feature quantity extraction unit 119 extracts a feature quantity for determining whether the speech of the called party is machine-synthesized sound. A large amount of speech and artificially synthesized speech are collected, put into an LSTM (Long Short-Term Memory), and trained to classify human speech and synthetic speech. In addition, the histogram of the mixed Gaussian distribution of the speech data is calculated, the histogram for human speech and the histogram for synthesized speech are calculated, and these are used as training data in a neural network for machine learning to create a machine synthesized sound classifier. generate it.

機械合成音特徴量抽出部１１９は、通話データを取得すると、着信者側のチャンネルの音声データを抽出し、その音声データの混合ガウス分布のヒストグラムを算出し、機械学習させた機械合成音分類器に合成音かどうかの確率を算出させ、その結果を機械合成音特徴量として、抽出する。 When the call data is acquired, the machine synthesized sound feature amount extraction unit 119 extracts the voice data of the called party's channel, calculates the histogram of the mixed Gaussian distribution of the voice data, and machine-learned the machine synthesized sound classifier. is made to calculate the probability of whether it is a synthetic sound or not, and the result is extracted as a machine-synthesized sound feature amount.

判定部１２１は、通話データから抽出された各種特徴量と留守番電話による応答の有無とを教師データとして機械学習により生成される。機械学習により生成された判定部に、留守番電話による応答であったか否かを判定したい通話データから抽出した特徴量を入力することで、判定部１２１は、その通話データにおいて留守番電話による応答があったか否かの判定を行う。本実施の形態では、話者重複特徴量、クラスタリング特徴量、通話時間特徴量、テキスト特徴量、機械合成音特徴量を入力し、判定を行う。 The determination unit 121 is generated by machine learning using various feature values extracted from call data and the presence/absence of a response to an answering machine as teacher data. By inputting the feature amount extracted from the call data for which it is desired to determine whether or not the response was by an answering machine to the determination unit generated by machine learning, the determination unit 121 determines whether or not there was a response by an answering machine in the call data. make a judgment as to whether In the present embodiment, a speaker overlap feature amount, clustering feature amount, call duration feature amount, text feature amount, and machine synthesized speech feature amount are input and determination is performed.

なお、第二の実施の形態においては、話者重複特徴量、クラスタリング特徴量、通話時間特徴量、テキスト特徴量、機械合成音特徴量の５つの特徴量を入力したが、テキスト特徴量又は機械合成音特徴量のいずれかの特徴量と他の３つの特徴量を入力して判定を行うように構成してもよい。また、話者重複特徴量、クラスタリング特徴量、通話時間特徴量の３つの特徴量を入力して判定を行うように構成してもよい。この３つの特徴量のみとすることで、処理を早くすることができる。 Note that in the second embodiment, the five feature amounts of overlapping speaker feature amount, clustering feature amount, call duration feature amount, text feature amount, and machine synthesized speech feature amount are input. A configuration may be adopted in which determination is performed by inputting one of the synthesized speech feature amounts and the other three feature amounts. Further, it may be configured such that determination is performed by inputting three feature amounts, namely, the overlapping speaker feature amount, the clustering feature amount, and the call duration feature amount. Processing can be speeded up by using only these three feature amounts.

図４は、人間応答による場合の秒数分布を示すグラフである。図のとおり、着信側で人が出られた場合には、短い秒数で通話時間が終了することが多いことがわかる。おそらく、自動応答で電話がかかってきた場合に相手方が、自動応答（機械）だとわかると着信側がすぐに切る傾向があるからと考えられる。 FIG. 4 is a graph showing the number-of-seconds distribution in the case of human response. As can be seen from the figure, when a caller answers the call, the call ends in a short number of seconds. This is probably because, when receiving a call with an automatic answerer, the called party tends to hang up immediately when the other party recognizes that it is an automatic answerer (machine).

図５は、留守番電話の機械応答による場合の秒数分布を示すグラフである。通話時間のピークが５０～６０秒のところにあるのがわかる。留守番電話による機械応答の場合、定型メッセージが流れるため、一定の秒数がかかる。自動応答で電話をかけた場合、着信側が留守番電話による機械応答であると、応答メッセージのあとメッセージを録音する時間があり、所定の録音時間のあと、着信側から切るため、時間のピークが５０～６０秒になると考えられる。 FIG. 5 is a graph showing the distribution of the number of seconds according to the machine response of the answering machine. It can be seen that the call duration peaks at 50 to 60 seconds. In the case of machine answering by answering machine, it takes a certain number of seconds because a standard message is played. When a call is made with an automatic answering machine, if the called party uses an answering machine to answer the call, there is time to record the message after the answering message, and after the predetermined recording time, the called party hangs up, so the time peak is 50 ~60 seconds.

図６は、教師データを用いて、着信者側で留守番電話が応答したか否かを判定する分類器を、機械学習により生成する処理を示すフローチャートである。まず、通話データ取得部１０１は、着信者側の留守番電話応答か否かを示す応答結果データと、通話データと、を取得する（ステップS６０１）。ここでは、分類器に機械学習させることが目的であるため、教師データとして、通話データとともに応答結果データとを取得する。 FIG. 6 is a flow chart showing a process of generating, by machine learning, a classifier for determining whether or not an answering machine has answered on the called party side using teacher data. First, the call data acquisition unit 101 acquires call data and response result data indicating whether or not the called party answers the answering machine (step S601). Since the purpose here is to allow the classifier to perform machine learning, the call data and the response result data are acquired as teacher data.

次に、各特徴量抽出部は、通話データから各特徴量を抽出する（ステップS６０２）。第一の実施の形態においては、話者重複特徴量抽出部１１１とクラスタリング特徴量抽出部１１３が、通話データから話者重複特徴量と、クラスタリング特徴量を抽出する。第二の実施の形態においては、話者重複特徴量抽出部１１１、クラスタリング特徴量抽出部１１３、通話時間特徴量抽出部１１５、テキスト特徴量抽出部１１７、機械合成音特徴量抽出部１１９が、話者重複特徴量、クラスタリング特徴量、通話時間特徴量、テキスト特徴量、機械合成音特徴量を抽出する。 Next, each feature quantity extraction unit extracts each feature quantity from the call data (step S602). In the first embodiment, the overlapping speaker feature amount extraction unit 111 and the clustering feature amount extraction unit 113 extract the overlapping speaker feature amount and the clustering feature amount from the call data. In the second embodiment, the speaker overlapping feature quantity extraction unit 111, the clustering feature quantity extraction unit 113, the call time feature quantity extraction unit 115, the text feature quantity extraction unit 117, and the machine synthesized speech feature quantity extraction unit 119 are We extract speaker overlapping features, clustering features, call duration features, text features, and machine synthesized speech features.

次に、判定部で用いる分類器を機械学習により生成する（ステップＳ６０３）。通話データから抽出された各種特徴量と留守番電話による応答の有無とを教師データとして機械学習させることで、留守番電話による応答であったかなかったかを判定する分類器を生成する。機械学習法としては、ロジスティック回帰、ランダムフォレスト法、サポートベクトルマシンがあげられ、いずれを用いてもよい。特徴量としては、第一の実施の形態においては、話者重複特徴量、クラスタリング特徴量、第二の実施の形態においては、話者重複特徴量、クラスタリング特徴量、通話時間特徴量、テキスト特徴量、機械合成音特徴量である。教師データにより生成された分類器により、判定対象となる通話データが留守番電話による応答であったか否かが判定される。 Next, a classifier used in the determination unit is generated by machine learning (step S603). Machine learning is performed using various feature values extracted from call data and the presence or absence of a response by an answering machine as teacher data to generate a classifier that determines whether the response was by an answering machine. Machine learning methods include logistic regression, random forest method, and support vector machine, any of which may be used. In the first embodiment, overlapping speaker feature amount, clustering feature amount, and in the second embodiment, overlapping speaker feature amount, clustering feature amount, call duration feature amount, text feature amount, and It is a machine-synthesized sound feature quantity. A classifier generated from the teacher data determines whether or not the call data to be determined was a response to an answering machine.

１留守番電話応答判定システム
１０留守番電話応答判定装置
２０自動音声電話機

1 answering machine response determination system 10 answering machine response determination device 20 automatic voice telephone

Claims

call data acquisition means for acquiring call data by telephone;
speaker duplication feature extracting means for calculating the degree of duplication of both speakers of the call data;
a clustering feature extraction means for calculating a clustering feature of the call data;
a determining means for determining whether or not a response is generated by machine learning, using the overlapping speaker feature amount, the clustering feature amount, and the presence or absence of a response to an answering machine as teacher data;
has
The determination means determines whether or not the call data is an answering machine response based on the speaker duplication feature extracted from the call data and the clustering feature.

2. The answering machine response determination device according to claim 1, further comprising a call time feature amount extracting means for extracting a call time feature amount of said call data,
The determination means is an answering machine response determination device that further uses a characteristic amount of call time as teacher data.

3. An answering machine response determination apparatus according to claim 2, wherein said call time feature quantity is an energy statistic of call voice.

The answering machine response determination device according to any one of claims 1 to 3,
Further, text feature extracting means includes speech recognition means for converting the call data into text data and response detection means for detecting a response to an answering machine from the text data by machine learning and calculating a text feature. death,
The determination means is an answering machine response determination device that further uses the text feature quantity as teacher data.

The answering machine response determination device according to any one of claims 1 to 4,
Furthermore, machine learning is performed using machine synthesized speech data, human speech data, and a Gaussian mixture histogram as teacher data, and the probability that the speech of the answering side in the call data is machine synthesized speech data is calculated as machine synthesized speech data. further comprising machine-synthesized sound feature quantity generating means for generating as a feature quantity,
The determination means is an answering machine response determination device that further uses the machine-synthesized sound feature quantity as teacher data.

An answering machine response determination method comprising:
a call data acquisition step of acquiring call data by telephone;
a speaker duplication feature extracting step of calculating the degree of duplication of both speakers of the call data;
a clustering feature extraction step of calculating a clustering feature of the call data ;
Using the speaker duplication feature amount, the clustering feature amount, and the presence or absence of a response to an answering machine as teacher data, a determination unit generated by machine learning determines whether or not there is a response to an answering machine. a determination step;
answering machine response determination method.

An answering machine response determination program executed by a computer as an answering machine response determination device,
a call data acquisition step of acquiring call data by telephone;
a speaker duplication feature extracting step of calculating the degree of duplication of both speakers of the call data;
a clustering feature extraction step of calculating a clustering feature of the call data ;
Using the speaker duplication feature amount, the clustering feature amount, and the presence or absence of a response to an answering machine as teacher data, a determination unit generated by machine learning determines whether or not there is a response to an answering machine. a determination step;
An answering machine response judgment program for executing