JP2021078012A

JP2021078012A - Answering machine determination device, method and program

Info

Publication number: JP2021078012A
Application number: JP2019203594A
Authority: JP
Inventors: 心剣李; Xinjian Li
Original assignee: Hello Inc
Current assignee: Hello Inc
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2021-05-20
Anticipated expiration: 2039-11-08
Also published as: JP7304627B2

Abstract

To solve the problem that, when a phone call is made with automatic voices, it cannot be discriminated whether an answer at a call incoming side consists of a human voice or an answering machine message.SOLUTION: An answering machine response determination device comprises: speech data acquisition means which acquires speech data of a phone; speaker overlapped feature amount extraction means which calculates an overlap degree of both speakers in the speech data; clustering feature amount extraction means which calculates a clustering feature amount of the speech data; and determination means which determines whether or not a response is an answering machine message generated by machine learning while using the speaker overlapped feature amount, the clustering feature amount and the presence/absence of a response by an automatic answering machine as teacher data. The determination means determines whether or not the speech data are the response by the answering machine message on the basis of the speaker overlapped feature amount and the clustering feature amount extracted from the speech data.SELECTED DRAWING: Figure 2

Description

本発明は、自動音声を使って自動で電話をかけたときに、発信先の応答が留守番電話による応答だったか否かを判定する留守番電話判定装置、方法、プログラムに関する。 The present invention relates to an answering machine determining device, method, and program for determining whether or not the answering machine is answered by an answering machine when an automatic call is made using automatic voice.

電話への着信に対し、一定時間ユーザによりオフフック操作がなされなかった場合、電話が自動で応答し、所定の予め設定された留守番メッセージを流し、その後発信者にメッセージを残すよう促すことで発信者のメッセージを録音することができる留守番電話機能を有する電話機や、留守番電話サービスを、通信網を通じて設定することができる機能が普及している。 If the user does not perform an off-hook operation for an incoming call for a certain period of time, the caller automatically answers, plays a predetermined preset answering machine message, and then prompts the caller to leave a message. Telephones with an answering machine function that can record the message of the above and the function that can set the answering machine service through the communication network are widespread.

しかし、留守番電話が応答した場合、発信者は自分のメッセージを相手方に残すことができるものの、通話料金が課金されるうえに、相手方に所望の用件を伝えて電話をかけた当初の目的を達成することはできないという問題があった。 However, when the answering machine answers, the caller can leave his message to the other party, but the call charge is charged and the original purpose of making the call by telling the other party the desired message is There was the problem that it could not be achieved.

この問題を解決するために、特許文献１では、発信者側の電話装置が、発信先の電話機の留守電応答時間を計測して発信先の電話番号とともに記憶しておき、その発信先に新たに発信したときには、記憶した留守電応答時間になる直前に自動切断することで無駄な通話料金の支払いを防止する技術が開示されている。 In order to solve this problem, in Patent Document 1, the telephone device on the calling side measures the answering machine response time of the telephone of the called destination and stores it together with the telephone number of the called destination. A technology is disclosed that prevents unnecessary payment of call charges by automatically disconnecting the call immediately before the memorized answering machine response time is reached.

特開２０１７−１１６２３号公報JP-A-2017-11623

しかし、特許文献１の技術においては、２度目の発呼からは自動切断することで無駄な通話料金の支払いを防止することができるが、相手方に最初にかける場合は、留守番電話にかけることは防止できない。また、そもそも発呼側がコンピュータプログラム等による自動発呼により、自動音声で電話をかける場合、着信側で出た応答が人間の声なのか、留守番電話メッセージなのかを判別することができないという問題があった。 However, in the technique of Patent Document 1, it is possible to prevent unnecessary payment of call charges by automatically disconnecting from the second call, but when making the first call to the other party, it is not possible to make an answering machine. It cannot be prevented. In addition, when the calling side makes a call by automatic voice by automatic calling by a computer program or the like, there is a problem that it is not possible to determine whether the response received by the called side is a human voice or an answering machine message. there were.

発呼先の電話がオフフックされたときに、留守番電話であったか、人間が出たにも関わらず自動音声で伝えた用件に対して対応がなされなかったか、の判断ができなければ、すぐにかけ直すか、時間をおいてからかけ直すかも決定できない。このため、自動音声電話機が、留守であるにもかかわらず、すぐに何度もかけ直したりするなど、無駄に発信操作を繰り返してしまうという問題があった。 If you can't determine whether the callee's phone was an answering machine when it was off-hook, or if a person answered the call but did not respond to the message given by automatic voice, call immediately. I can't decide whether to fix it or call it again after a while. For this reason, there is a problem that the automatic voice telephone unnecessarily repeats the outgoing call operation, such as calling back many times immediately even though the telephone is out of the office.

そこで、本発明では、通話が終了したあとに、通話データを自動音声電話機から取得し、少なくとも話者重複特徴量とクラスタリング特徴量とを通話データから抽出し、機械学習により生成された判定部によって、留守番電話による応答か否かを判定することで、その後の無駄な発信操作を防止し、適切な対応をとれるようにすることを目
的とする。 Therefore, in the present invention, after the call is completed, the call data is acquired from the automatic voice telephone, at least the speaker overlapping feature amount and the clustering feature amount are extracted from the call data, and the determination unit generated by machine learning is used. The purpose is to prevent unnecessary outgoing call operations after that by determining whether or not the answer is from an answering machine, and to be able to take appropriate measures.

本発明にあっては、電話による通話データを取得する通話データ取得手段と、通話データの双方の話者の重複度を算出する話者重複特徴量抽出手段と、通話データのクラスタリング特徴量を算出するクラスタリング特徴量抽出手段と、話者重複特徴量と、クラスタリング特徴量と、留守番電話による応答の有無とを教師データとして用いて、機械学習により生成された留守番電話による応答か否かを判定する判定手段と、を有し、判定手段は、通話データから抽出された話者重複特徴量と、クラスタリング特徴量と、に基づいてその通話データが留守番電話による応答か否かを判定する、留守番電話応答判定装置を提供することができる。 In the present invention, a call data acquisition means for acquiring telephone call data, a speaker duplication feature amount extraction means for calculating the degree of duplication of both speakers of the call data, and a clustering feature amount for call data are calculated. Using the clustering feature amount extraction means, the speaker duplication feature amount, the clustering feature amount, and the presence / absence of a response by the answering machine as teacher data, it is determined whether or not the answer is made by the answering machine generated by machine learning. It has a determination means, and the determination means determines whether or not the call data is answered by the answering phone based on the speaker overlapping feature amount extracted from the call data and the clustering feature amount. A response determination device can be provided.

さらに、留守番電話応答判定装置は、通話データの通話時間の特徴量を抽出する通話時間特徴量抽出手段を有し、判定手段は、通話時間の特徴量をさらに教師データとして用いる。 Further, the answering machine answering machine has a call time feature amount extracting means for extracting the feature amount of the call time of the call data, and the determination means further uses the feature amount of the talk time as the teacher data.

また、通話時間特徴量は、通話音声のエネルギー統計量である。 The call time feature is an energy statistic of the call voice.

本発明にかかる留守番電話応答判定装置は、さらに、通話データをテキストデータに変換する音声認識手段とテキストデータから機械学習により留守番電話による応答を検出してテキスト特徴量を算出する応答検出手段とを有するテキスト特徴量抽出手段をさらに有し、判定手段は、前記テキスト特徴量をさらに教師データとして用いる。 The answering machine answer determination device according to the present invention further includes a voice recognition means for converting call data into text data and a response detecting means for detecting a response by an answering machine from the text data by machine learning and calculating a text feature amount. It further has a text feature amount extracting means, and the determination means further uses the text feature amount as teacher data.

さらに、機械合成音データと、人間音声データと、混合ガウスのヒストグラムと、を教師データとして用いて機械学習し、通話データのうち、応答側の音声が機械合成音データである確率を機械合成音特徴量として生成する機械合成音特徴量生成手段をさらに有し、判定手段は、機械合成音特徴量をさらに教師データとして用いる、留守番電話応答判定装置を提供する。 Furthermore, machine learning is performed using machine-synthesized sound data, human voice data, and a mixed Gaussian histogram as teacher data, and the probability that the answering voice is machine-synthesized sound data among the call data is determined by machine-synthesized sound. A machine-synthesized sound feature amount generating means for generating as a feature amount is further provided, and the determination means provides an answering machine answering determination device that further uses the machine-synthesized sound feature amount as teacher data.

本発明にかかる留守番電話応答判定方法は、電話による通話データを取得する通話データ取得ステップと、通話データの双方の話者の重複度を算出する話者重複特徴量抽出ステップと、通話データのクラスタリング特徴量を算出するクラスタリング特徴量抽出ステップと、通話時間特徴量と、話者重複特徴量と、クラスタリング特徴量と、留守番電話による応答の有無とを教師データとして用いて、機械学習することにより生成された判定部により留守番電話による応答か否かを判定する判定ステップと、を有する留守番電話応答判定方法を提供する。 The answering machine answer determination method according to the present invention includes a call data acquisition step for acquiring call data by telephone, a speaker duplication feature amount extraction step for calculating the degree of duplication of both speakers of the call data, and clustering of call data. Generated by machine learning using the clustering feature amount extraction step for calculating the feature amount, the talk time feature amount, the speaker overlapping feature amount, the clustering feature amount, and the presence / absence of a response by the answering phone as teacher data. Provided is an answering machine answering determination method having a determination step for determining whether or not the answering is by an answering machine by the determination unit.

また、本発明の留守番電話応答判定装置としてコンピュータに実行させる留守番電話応答判定プログラムは、電話による通話データを取得する通話データ取得ステップと、通話データの双方の話者の重複度を算出する話者重複特徴量抽出ステップと、通話データのクラスタリング特徴量を算出するクラスタリング特徴量抽出ステップと、通話時間特徴量と、話者重複特徴量と、クラスタリング特徴量と、留守番電話による応答の有無とを教師データとして用いて、機械学習することにより生成された判定部により留守番電話による応答か否かを判定する判定ステップと、を実行させる留守番電話応答判定プログラムを提供する。 Further, the answering machine answering machine to be executed by the computer as the answering machine answering machine of the present invention is a speaker that calculates the degree of duplication of both speakers in the call data acquisition step for acquiring telephone call data and the call data. Teachers of the duplicate feature extraction step, the clustering feature extraction step for calculating the clustering feature of call data, the talk time feature, the speaker duplicate feature, the clustering feature, and the presence or absence of an answering machine response. Provided is an answering machine answering machine answering determination program for executing a determination step of determining whether or not a response is made by an answering machine by a determination unit generated by machine learning using the data.

本発明によれば、通話データから所定の特徴量を取得し、発呼先の電話が留守番電話による応答であったか否かを判定することができるため、具体的には、留守番電話であったと判定された場合は、数時間時間をおいて、在宅している可能性の高い時間や店舗であれば営業時間内に再度発呼するようにし、人間が出たにもかかわらず、自動音声による電話であったがために、すぐに電話が切断されたと判断した場合には、すぐにかけ直すように電話機を設定することができる。 According to the present invention, it is possible to acquire a predetermined feature amount from the call data and determine whether or not the call destination telephone is a response by an answering machine. Therefore, specifically, it is determined that the call is an answering machine. If this is the case, wait a few hours and try to call again during business hours if it is likely that you are at home or at a store. Therefore, if it is determined that the telephone is immediately disconnected, the telephone can be set to call back immediately.

図１は、本発明における留守番電話応答判定装置のハードウェア構成図の一例を示すブロック図である。FIG. 1 is a block diagram showing an example of a hardware configuration diagram of the answering machine answering machine answering machine according to the present invention. 図２は、本発明の第一の実施の形態にかかる留守番電話応答判定システム１の機能ブロック図である。FIG. 2 is a functional block diagram of the answering machine answering machine answering machine 1 according to the first embodiment of the present invention. 図３は、本発明の第二の実施の形態にかかる留守番電話応答判定システム１の機能ブロック図である。FIG. 3 is a functional block diagram of the answering machine answering machine answering machine 1 according to the second embodiment of the present invention. 図４は、人間応答による場合の秒数分布を示すグラフである。FIG. 4 is a graph showing the distribution of seconds in the case of human response. 図５は、留守番電話の機械応答による場合の秒数分布を示すグラフである。FIG. 5 is a graph showing the distribution of seconds in the case of a machine response of an answering machine. 図６は、教師データを用いて、着信者側で留守番電話が応答したか否かを判定する分類器を、機械学習により生成する処理を示すフローチャートである。FIG. 6 is a flowchart showing a process of generating a classifier by machine learning for determining whether or not the answering machine has answered on the called party side using the teacher data.

以下、本発明を実施するための形態について、図面を参照しながら説明する。なお、本明細書及び図面において、実質的に同一の機能及び構成を有する構成要素については同一の符号を付し、重複説明を省略する。 Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. In the present specification and drawings, components having substantially the same function and configuration are designated by the same reference numerals, and duplicate description will be omitted.

図１は、本発明における留守番電話応答判定装置のハードウェア構成図の一例を示すブロック図である。図１に示されるコンピュータ装置である留守番電話応答判定装置１０のハードウェア構成は、主にコンピュータ装置で実現できる。留守番電話応答判定装置１０は、自動音声電話機２０から受信した通話データから各種特徴量を抽出し、留守番電話応答の判定を行う留守番電話応答判定プログラムを実行することで、留守番電話応答か否かの判定を行う。 FIG. 1 is a block diagram showing an example of a hardware configuration diagram of the answering machine answering machine answering machine according to the present invention. The hardware configuration of the answering machine answering machine 10 which is the computer device shown in FIG. 1 can be realized mainly by the computer device. The answering machine answering machine 10 determines whether or not the answering machine is answered by extracting various feature quantities from the call data received from the automatic voicemail 20 and executing an answering machine answering machine determination program for determining the answering machine answer. Make a judgment.

留守番電話応答判定装置１０は、通話データと各種特徴量、留守番電話の応答の有無を教師データとして機械学習することで生成される留守番電話応答判定部を有しており、新たな通話データを自動音声電話機２０から受信すると、各種特徴量を抽出し、相手方の応答が留守番電話による応答であったかなかったかを判定する。 The answering machine answering machine answering machine 10 has an answering machine answering machine answer determination unit generated by machine learning the call data, various feature quantities, and the presence or absence of an answering machine answer as teacher data, and automatically performs new call data. When it is received from the voice telephone 20, various feature quantities are extracted, and it is determined whether or not the response of the other party is the answering machine.

留守番電話応答判定装置１０を形成するコンピュータは、図１に示したようにＣＰＵ１１、通信インターフェース１２、ＲＯＭ１３、ＲＡＭ１４、ハードディスクドライブ１５、入出力インターフェース１６、入出力インターフェース１６と接続された表示部１７、ポインティングデバイス１８及びキーボード１９を、バスに接続して構成される。また、入出力インターフェース１６には、ＵＳＢメモリなどの外部記憶装置２０が接続可能である。 As shown in FIG. 1, the computer forming the answering machine answering machine 10 includes a CPU 11, a communication interface 12, a ROM 13, a RAM 14, a hard disk drive 15, an input / output interface 16, and a display unit 17 connected to the input / output interface 16. The pointing device 18 and the keyboard 19 are connected to the bus. Further, an external storage device 20 such as a USB memory can be connected to the input / output interface 16.

表示部１７は、たとえば、液晶ディスプレイなどの表示装置である。ポインティングデバイス１８は、例えば、マウスやトラックボールなどである。 The display unit 17 is, for example, a display device such as a liquid crystal display. The pointing device 18 is, for example, a mouse or a trackball.

一連の処理をプログラムにより実行させる場合には、例えば、通話データ取得部、話者重複特徴量抽出部、クラスタリング特徴量抽出部、通話時間特徴量抽出部、テキスト特徴量抽出部、機械合成音特徴量抽出部、判定部は、ＲＯＭ１３又はハードディスクドライブ１５に留守番電話応答判定プログラムとして記憶され、ＣＰＵ１１で実行させることで、各種の機能を実行させる。なお、留守番電話応答判定プログラムが記憶されたＵＳＢメモリなどの外部記憶装置２０を入出力インターフェース１６に接続することでのインストールや、ネットワーク１２からコンピュータへ留守番電話応答判定プログラムをインストール、また、装置本体に予め組み込まれた状態、例えば、留守番電話応答判定プログラムが記録されているＲＯＭ１３などで構成してもよい。 When a series of processes are executed by a program, for example, a call data acquisition unit, a speaker overlapping feature amount extraction unit, a clustering feature amount extraction unit, a talk time feature amount extraction unit, a text feature amount extraction unit, and a machine-synthesized sound feature The amount extraction unit and the determination unit are stored in the ROM 13 or the hard disk drive 15 as an answering machine answer determination program, and are executed by the CPU 11 to execute various functions. It should be noted that the installation by connecting an external storage device 20 such as a USB memory in which the answering machine answering judgment program is stored to the input / output interface 16, the answering machine answering judgment program is installed from the network 12 to the computer, and the device main body. It may be configured in a state preliminarily incorporated in, for example, a ROM 13 in which an answering machine answer determination program is recorded.

図２は、本発明の第一の実施の形態にかかる留守番電話応答判定システム１の機能ブロック図である。留守番電話応答判定システム１は、自動音声電話機２０と留守番電話応答判定装置１０から構成される。自動音声電話機２０は、所定の電話番号に自動発呼し、所定のメッセージを人工的に生成された音声で相手方に聞かせることで、人間がいなくても電話をかけることができる電話機である。例えば、お店への予約電話を行う場合に必要な電話予約メッセージをあらかじめ自動音声電話機２０に登録しておき、予約希望日時や人数など電話予約メッセージに適宜組み合わせることで、予約する店の電話番号を読み出して、自動で発呼し、生成された電話予約のためのメッセージを人工音声で読み上げ、相手方に予約が可能かどうかを問い合わせる。なお、自動音声電話機２０は、データベースに記憶された電話番号を参照して電話をかけるコンピュータプログラムがサーバなどのコンピュータにインストールされることにより構成されていてもよく、複数のコンピュータ装置によって構成されていてもよい。 FIG. 2 is a functional block diagram of the answering machine answering machine answering machine 1 according to the first embodiment of the present invention. The answering machine answering machine answering machine 1 is composed of an automatic voice telephone 20 and an answering machine answering machine 10. The automatic voice telephone 20 is a telephone that can make a call without a human being by automatically calling a predetermined telephone number and listening to a predetermined message with an artificially generated voice to the other party. For example, by registering the telephone reservation message required for making a reservation call to the store in the automatic voice telephone 20 in advance and appropriately combining it with the telephone reservation message such as the desired date and time of reservation and the number of people, the telephone number of the store to be reserved Is read, the call is automatically made, the generated message for telephone reservation is read aloud by artificial voice, and the other party is inquired whether the reservation is possible. The automatic voice telephone 20 may be configured by installing a computer program that makes a call by referring to a telephone number stored in a database on a computer such as a server, and is configured by a plurality of computer devices. You may.

留守番電話応答判定装置１０は、自動音声電話機２０が録音した通話データをインターネット等の通信ネットワークを介して受信し、通話データから各種特徴量を抽出し、教師データにより機械学習することで生成された判定部に基づいて、相手方が留守番メッセージで応答したか否かの判定を行う装置である。 The answering machine answering machine 10 is generated by receiving the call data recorded by the automatic voice telephone 20 via a communication network such as the Internet, extracting various feature quantities from the call data, and performing machine learning using the teacher data. It is a device that determines whether or not the other party has responded with an answering machine message based on the determination unit.

自動音声電話機２０と留守番電話応答判定装置１０とは、ここでは別々の装置として図示しているが、これに限らず、自動音声電話機２０と留守番電話応答判定装置１０とで一つの装置として構成してもよい。また、自動音声電話機２０と留守番電話応答判定装置１０を構成する各機能がそれぞれ独立した装置として構成してもよい。 The automatic voice telephone 20 and the answering machine answering machine 10 are shown as separate devices here, but the present invention is not limited to this, and the automatic voice telephone 20 and the answering machine answering machine 10 are configured as one device. You may. Further, each function constituting the automatic voice telephone 20 and the answering machine answering machine 10 may be configured as an independent device.

留守番電話応答装置１０は、通話データ取得部１０１、話者重複特徴量抽出部１１１、クラスタリング特徴量抽出部１１３、判定部１２１を有する。 The answering machine answering machine 10 includes a call data acquisition unit 101, a speaker overlapping feature amount extraction unit 111, a clustering feature amount extraction unit 113, and a determination unit 121.

通話データ取得部１０１は、自動音声電話機２０が人間の手によらず自動で電話をかけて通話した通話データを自動音声電話機２０から受信し、通話データを取得する。なお、通話データのデータ形式は、発信者側の音声と着信者側の音声とによる２チャンネルのデュアルチャンネルで録音されていればよく、例えば、wav形式であるが、特に特定のデータ形式に限らない。 The call data acquisition unit 101 acquires the call data by receiving the call data from the automatic voice telephone 20 that the automatic voice telephone 20 automatically makes a call and makes a call without human hands. The data format of the call data may be recorded in a dual channel of 2 channels consisting of the voice of the caller and the voice of the callee. For example, the wav format is used, but the data format is particularly limited to a specific data format. Absent.

話者重複特徴量抽出部１１１は、通話データの双方の話者の重複度を算出する。具体的には、通話データは、発信者側と着信者側とでデュアルチャンネルで録音されているため、話者重複特徴量抽出部１１１は、両方のチャンネルで話している区間を検知し、重複度の判定を行う。話者重複特徴量抽出部１１１は、たとえば、音声区間検出器（ＶＡＤ：Voice Activity detection）を有し、発信者側と着信者側、双方のチャンネルでの話し区間を抽出し、重複度を計算する。例えば、音声区間検出器として、MFCC（Mel-frequency cepstrum coefficients）の線形モデルを用いてもよい。話者重複特徴量抽出部１１１は、下記のような計算を行って、発信者側と着信者側との話している区間の重複時間(overlap)を算出する。Shelloは、発信者側の話している区間、Srestaurantは、着信者側の話している区間、|S|は、その区間の秒数の長さを表す。

The speaker duplication feature amount extraction unit 111 calculates the degree of duplication of both speakers in the call data. Specifically, since the call data is recorded in dual channels on the caller side and the callee side, the speaker duplication feature amount extraction unit 111 detects the section talking on both channels and duplicates it. Judge the degree. The speaker duplication feature amount extraction unit 111 has, for example, a voice section detector (VAD: Voice Activity detection), extracts talk sections on both the caller side and the callee side, and calculates the degree of overlap. To do. For example, a linear model of MFCC (Mel-frequency cepstrum coefficients) may be used as the voice interval detector. The speaker overlap feature amount extraction unit 111 performs the following calculation to calculate the overlap time (overlap) of the talking section between the caller side and the callee side. Shello is the section spoken by the caller, Srestaurant is the section spoken by the called party, and | S | is the length of seconds in that section.

クラスタリング特徴量抽出部１１３は、通話データのうち、着信者側のみのチャンネルの音声データからクラスタリング特徴量を算出する。クラスタリング特徴量抽出部１１３は、留守番電話による応答のような機械による応答と、人間による応答、各応答の特徴的な音声パターンで分類されるようにクラスタリングを行う。具体的にはBoAW（Bag of Audio Words）という特徴量を算出する。クラスタリング特徴量抽出部１１３は、受信した通話データにつき、まずMFCC特徴量を算出する。 The clustering feature amount extraction unit 113 calculates the clustering feature amount from the voice data of the channel only on the called party side in the call data. The clustering feature amount extraction unit 113 performs clustering so as to be classified into a response by a machine such as a response by an answering machine, a response by a human, and a characteristic voice pattern of each response. Specifically, a feature called BoAW (Bag of Audio Words) is calculated. The clustering feature amount extraction unit 113 first calculates the MFCC feature amount for the received call data.

具体的には、１フレームを0.025秒とし、0.01秒ずつシフトさせることで、１秒間に１００フレーム生成し、各フレームにおける４０次元のMFCC特徴量を高次元の点とみなし、点の集合をk-meansのクラスタリングにあてはめて、クラスタリングを生成する。k-meansは下記の式を最小化することで、クラスタリングを生成する。ｘは、各フレームにおける点、Siはｉ番目のクラスタに含まれる点の集合、μiは、そのクラスタの中心、Sはすべてのクラスタの集合である。

Specifically, by setting one frame to 0.025 seconds and shifting by 0.01 seconds, 100 frames are generated per second, the 40-dimensional MFCC features in each frame are regarded as high-dimensional points, and the set of points is k. Generate clustering by applying to clustering of -means. k-means generates clustering by minimizing the following equation. x is the point in each frame, Si is the set of points included in the i-th cluster, μi is the center of the cluster, and S is the set of all clusters.

例えば、１０秒の通話データを１００個用意すると、10×100×100＝100000個の点が存在することとし、これにk-meansのクラスタリングを適用し、１００個のクラスタリングを生成する。次に、留守番電話応答判定を行う通話データのMFCC特徴量を算出し、各フレームのMFCC特徴量がどのクラスタリングまでの距離が近いかを計算する。一番近いクラスタリングに対してそのフレームを割り当てることで、クラスタリングのヒストグラムを生成し、クラスタリング特徴量を抽出する。つまり、クラスタリング特徴量抽出部１１３は、着信者側の音声データについて音声特徴量であるMFCC特徴量からクラスタリングを生成し、さらにクラスタリングのヒストグラムを生成することで、クラスタリング特徴量を抽出する。 For example, if 100 10-second call data are prepared, it is assumed that 10 × 100 × 100 = 100,000 points exist, and k-means clustering is applied to this to generate 100 clusters. Next, the MFCC feature amount of the call data for which the answering machine answer determination is performed is calculated, and the distance to which clustering the MFCC feature amount of each frame is close is calculated. By allocating the frame to the nearest clustering, a clustering histogram is generated and the clustering features are extracted. That is, the clustering feature amount extraction unit 113 extracts the clustering feature amount by generating clustering from the MFCC feature amount which is the voice feature amount for the voice data on the called party side and further generating a clustering histogram.

判定部１２１は、通話データから抽出された各種特徴量と留守番電話による応答の有無とを教師データとして機械学習により生成される。機械学習により生成された判定部１２１に、留守番電話による応答であったか否かを判定したい通話データから抽出した話者重複特徴量とクラスタリング特徴量を入力することで、判定部１２１は、その通話データにおいて留守番電話による応答があったか否かの判定を行う。 The determination unit 121 is generated by machine learning using various feature amounts extracted from the call data and the presence / absence of a response by the answering machine as teacher data. By inputting the speaker overlapping feature amount and the clustering feature amount extracted from the call data for which it is desired to judge whether or not the answer was made by the answering machine into the judgment unit 121 generated by machine learning, the judgment unit 121 can perform the call data. In, it is determined whether or not there is a response by the answering machine.

判定部１２１は、対象となる通話データから抽出した特徴量を入力すると二値分類を行う分類器で構成される。ここでは、留守番電話による応答である機械応答と人間応答の二値に分類される。分類器としては、例えば、ロジスティック回帰、ランダムフォレスト、SVM（サポートベクトルマシン）などが用いられ、いずれを用いてもよいが、教師データが少ないとき、例えば１００００音声データ未満の場合は、ロジスティック回帰、それ以上の場合は、ＳＶＭなど使い分けてもよい。 The determination unit 121 is composed of a classifier that performs binary classification when a feature amount extracted from the target call data is input. Here, it is classified into two values, a machine response and a human response, which are responses by an answering machine. As the classifier, for example, logistic regression, random forest, SVM (support vector machine), etc. may be used, and any of them may be used, but when the teacher data is small, for example, when it is less than 10000 voice data, logistic regression, In the case of more than that, SVM or the like may be used properly.

例えば、ロジスティック回帰は、教師データが少ない時に最もよいパフォーマンスを示し、下記の式により、重みｗのパラメータのもと、入力特徴量ｘから留守番電話による応答である機械応答C1に分類される条件付き確率を計算する。
P（C1|ｘ；ｗ）＝σ（ｗ^TX＋ｗ_０） For example, logistic regression shows the best performance when there is little teacher data, and it is conditional that it is classified into machine response C1 which is the response by the answering machine from the input feature x under the parameter of the weight w by the following formula. Calculate the probability.
P (C1 | x; w) = σ (w ^T X + w ₀ )

また、ランダムフォレストを分類器として使う場合、個々の決定木の結果ｆ_ｋに基づいて計算を行う。

When using a random forest as a classifier, the calculation is performed based on _{the result fk of each decision tree.}

SVMはデータ量が一定以上、例えば、１００００音声データ以上に達した場合、ロジスティック回帰から置き換えることで、より高いパフォーマンスを分類器として使う場合で分類を行うことができる。カーネルトリックを用いることで、本来より高次元上で超平面を引くことで非線形分類能力を実現できるものである。境界面は、以下の式により定められる。（zi,yi）は既存のｉ番目の教師データの特徴量とラベル、ｗ、ｂは学習する重みである。

When the amount of data reaches a certain level or more, for example, 10,000 voice data or more, SVM can classify by replacing it with logistic regression when using higher performance as a classifier. By using the kernel trick, the nonlinear classification ability can be realized by drawing a hyperplane on a higher dimension than originally intended. The boundary surface is determined by the following formula. (Zi, yi) are the features and labels of the existing i-th teacher data, and w and b are the weights to be learned.

図３は、本発明の第二の実施の形態にかかる留守番電話応答判定システム１の機能ブロック図である。第二の実施の形態においては、話者重複特徴量、クラスタリング特徴量以外の特徴量も抽出し、留守番電話による応答であったか否かの判定に用いる。なお、第一の実施の形態と同じ構成については、詳細な説明を省略する。 FIG. 3 is a functional block diagram of the answering machine answering machine answering machine 1 according to the second embodiment of the present invention. In the second embodiment, features other than the speaker overlapping feature amount and the clustering feature amount are also extracted and used for determining whether or not the response was made by an answering machine. A detailed description of the same configuration as that of the first embodiment will be omitted.

留守番電話応答判定システム１は、自動音声電話機２０と留守番電話応答判定装置１０から構成される。自動音声電話機２０は、所定の電話番号に自動発呼し、所定のメッセージを人工的に生成された音声を相手方に聞かせることで、人間がいなくても電話をかけることができる電話機である。留守番電話応答判定装置１０は、自動音声電話機２０が録音した通話データをインターネット等の通信ネットワークを介して受信し、通話データから各種特徴量を抽出し、教師データにより機械学習することで生成された判定部に基づいて、相手方が留守番メッセージで応答したか否かの判定を行う装置である。 The answering machine answering machine answering machine 1 is composed of an automatic voice telephone 20 and an answering machine answering machine 10. The automatic voice telephone 20 is a telephone that can make a call without a human being by automatically calling a predetermined telephone number and listening to an artificially generated voice of a predetermined message to the other party. The answering machine answering machine 10 is generated by receiving the call data recorded by the automatic voice telephone 20 via a communication network such as the Internet, extracting various feature quantities from the call data, and performing machine learning using the teacher data. It is a device that determines whether or not the other party has responded with an answering machine message based on the determination unit.

本実施の形態においては、留守番電話応答判定装置１０は、通話データ取得部１０１、話者重複特徴量抽出部１１１、クラスタリング特徴量抽出部１１３、通話時間特徴量抽出部１１５、テキスト特徴量抽出部１１７、機械合成音特徴量抽出部１１９、判定部１２１を有する。 In the present embodiment, the answering machine answer determination device 10 includes a call data acquisition unit 101, a speaker overlapping feature amount extraction unit 111, a clustering feature amount extraction unit 113, a talk time feature amount extraction unit 115, and a text feature amount extraction unit. It has 117, a machine-synthesized sound feature amount extraction unit 119, and a determination unit 121.

通話データ取得部１０１は、自動音声電話機２０が人間の手によらず自動で電話をかけて通話した通話データを自動音声電話機２０から受信し、通話データを取得する。また、話者重複特徴量抽出部１１１は、通話データの双方の話者の重複度を算出する。クラスタリング特徴量抽出部１１３は、通話データのうち、着信者側のみのチャンネルの音声データからクラスタリング特徴量を算出する。話者重複特徴量抽出部１１１と、クラスタリング特徴量抽出部１１３は、第一の実施の形態と構成が同じであるため、ここでは詳細な説明を省略する。 The call data acquisition unit 101 acquires the call data by receiving the call data from the automatic voice telephone 20 that the automatic voice telephone 20 automatically makes a call and makes a call without human hands. In addition, the speaker duplication feature amount extraction unit 111 calculates the degree of duplication of both speakers in the call data. The clustering feature amount extraction unit 113 calculates the clustering feature amount from the voice data of the channel only on the called party side in the call data. Since the speaker overlapping feature amount extraction unit 111 and the clustering feature amount extraction unit 113 have the same configuration as that of the first embodiment, detailed description thereof will be omitted here.

通話時間特徴量抽出部１１５は、通話データの通話時間の特徴量を抽出する。例えば、通話データにおける通話時間そのものを特徴量としてもよい。また、他の一例では、通話時間の２乗を通話時間の特徴量としてもよい。留守番電話応答による場合の通話時間は、決まったテンプレートの録音が使われることが多いため、同じような時間に通話が終了する。このため通話時間の２乗を特徴量とすることで、二次関数を表現して留守番電話応答だった場合のピークをとらえる。 The call time feature amount extraction unit 115 extracts the feature amount of the call time of the call data. For example, the call time itself in the call data may be used as a feature quantity. Further, in another example, the square of the talk time may be used as the feature amount of the talk time. As for the call time when answering the answering machine, the recording of a fixed template is often used, so the call ends at the same time. Therefore, by using the square of the talk time as a feature quantity, a quadratic function can be expressed to capture the peak when the answering machine is answered.

また、通話時間特徴量抽出部１１５は、音声の各フレームのエネルギー統計量を通話時間特徴量として抽出してもよい。この場合、音声の各フレームのエネルギー統計量を計算することで、通話時間全体でのノイズを計測する。エネルギー統計量が高い場合は、背景にノイズが多くある、低い場合は、背景にノイズがあまりないことがわかる。具体的には、エネルギー統計量として、下記の式を計算する。なお、音声信号をx(t)とし、0.025行のフレームに区切り、窓関数ｗ（ｔ）をかけて、短時間フーリエ変換による信号X[t,f]を変換する。tは時間、fは周波数である。

そして、ナイキスト周波数までのパワースペクトルを足し合わせることで、エネルギー統計量を算出する。

Further, the talk time feature amount extraction unit 115 may extract the energy statistics of each frame of voice as the talk time feature amount. In this case, the noise over the entire talk time is measured by calculating the energy statistics of each frame of voice. If the energy statistic is high, it means that there is a lot of noise in the background, and if it is low, it means that there is not much noise in the background. Specifically, the following formula is calculated as an energy statistic. The audio signal is x (t), divided into 0.025 line frames, and the window function w (t) is applied to convert the signal X [t, f] by short-time Fourier transform. t is time and f is frequency.

Then, the energy statistic is calculated by adding the power spectra up to the Nyquist frequency.

なお、通話時間特徴量抽出部１１５は、エネルギー統計量のみと通話時間特徴量として抽出してもよく、最適な実施形態としては、エネルギー統計量と通話時間の２乗の２つの特徴量を抽出してもよい。なお、エネルギー統計量と通話時間の２つの特徴量を抽出してもよい。 The talk time feature amount extraction unit 115 may extract only the energy statistic and the talk time feature amount, and as the optimum embodiment, the talk time feature amount extraction unit 115 extracts two feature amounts of the energy statistic and the square of the talk time. You may. Two feature quantities, an energy statistic and a talk time, may be extracted.

テキスト特徴量抽出部１１７は、テキストデータに変換する音声認識部１１７１とテキストデータから機械学習による留守番電話による応答を検出してテキスト特徴量を算出する応答検出部１１７２とを有する。音声認識部１１７１は、取得した通話データのうち着信者側の音声データの音声認識を行い、テキスト化する。 The text feature extraction unit 117 includes a voice recognition unit 1171 that converts text data, and a response detection unit 1172 that detects a response from an answering machine by machine learning from the text data and calculates a text feature amount. The voice recognition unit 1171 performs voice recognition of the voice data on the called party side among the acquired call data and converts it into text.

応答検出部１１７２は、音声認識部１１７１によって音声認識され生成されたテキストデータから留守番電話による機械応答の典型的なテキストメッセージがあるかどうかを検出する。例えば、『ただいま留守にしております』や『メッセージをお願いします』『営業時間外です』など、留守番電話による応答メッセージでよく使われるメッセージを検出する。具体的には、応答検出部１１７２として、BERT（Bidirectional Encoder Representations from Transformer)又はXLNetなどの日本語事前学習言語モデルを適用して、応答検出を行う。 The response detection unit 1172 detects whether or not there is a typical text message of a machine response by an answering machine from the text data that is voice-recognized and generated by the voice recognition unit 1171. For example, it detects frequently used messages such as "I'm away", "Please give me a message", and "It's out of business hours". Specifically, the response detection unit 1172 applies a Japanese pre-learning language model such as BERT (Bidirectional Encoder Representations from Transformer) or XLNet to perform response detection.

機械合成音特徴量抽出部１１９は、着信者側の音声が機械で合成された音かどうかを判定するための特徴量を抽出する、例えば、機械合成音特徴量抽出部１１９においては、人間の音声と人工的に合成された音声とをそれぞれ大量に収集し、それらをLSTM（Long Short-Term Memory）に入れて、人間音声と合成音声とを分類できるよう学習させる。また、音声データの混合ガウス分布のヒストグラムを算出し、人間音声の場合のヒストグラムと合成音声の場合のヒストグラムとを算出し、ニューラルネットワークでこれらを教師データとして機械学習させて機械合成音分類器を生成しておく。 The machine-synthesized sound feature amount extraction unit 119 extracts a feature amount for determining whether or not the voice of the called party is a machine-synthesized sound. For example, in the machine-synthesized sound feature amount extraction unit 119, a human A large amount of each of voice and artificially synthesized voice is collected, and they are put into LSTM (Long Short-Term Memory) to be trained so that human voice and synthetic voice can be classified. In addition, a histogram of the mixed Gaussian distribution of speech data is calculated, a histogram in the case of human speech and a histogram in the case of synthetic speech are calculated, and these are machine-learned as teacher data by a neural network to generate a machine-synthesized sound classifier. Generate it.

機械合成音特徴量抽出部１１９は、通話データを取得すると、着信者側のチャンネルの音声データを抽出し、その音声データの混合ガウス分布のヒストグラムを算出し、機械学習させた機械合成音分類器に合成音かどうかの確率を算出させ、その結果を機械合成音特徴量として、抽出する。 When the machine-synthesized sound feature amount extraction unit 119 acquires the call data, the machine-synthesized sound feature amount extraction unit 119 extracts the voice data of the channel on the called party side, calculates a histogram of the mixed Gaussian distribution of the voice data, and machine-learns the machine-synthesized sound classifier. Is asked to calculate the probability of whether or not it is a synthetic voice, and the result is extracted as a machine-synthesized voice feature.

判定部１２１は、通話データから抽出された各種特徴量と留守番電話による応答の有無とを教師データとして機械学習により生成される。機械学習により生成された判定部に、留守番電話による応答であったか否かを判定したい通話データから抽出した特徴量を入力することで、判定部１２１は、その通話データにおいて留守番電話による応答があったか否かの判定を行う。本実施の形態では、話者重複特徴量、クラスタリング特徴量、通話時間特徴量、テキスト特徴量、機械合成音特徴量を入力し、判定を行う。 The determination unit 121 is generated by machine learning using various feature amounts extracted from the call data and the presence / absence of a response by the answering machine as teacher data. By inputting the feature amount extracted from the call data for which it is desired to determine whether or not the answer was received by the answering machine into the determination unit generated by machine learning, the determination unit 121 indicates whether or not there was a response by the answering machine in the call data. Is determined. In the present embodiment, the speaker overlapping feature amount, the clustering feature amount, the talk time feature amount, the text feature amount, and the machine-synthesized sound feature amount are input and the determination is performed.

なお、第二の実施の形態においては、話者重複特徴量、クラスタリング特徴量、通話時間特徴量、テキスト特徴量、機械合成音特徴量の５つの特徴量を入力したが、テキスト特徴量又は機械合成音特徴量のいずれかの特徴量と他の３つの特徴量を入力して判定を行うように構成してもよい。また、話者重複特徴量、クラスタリング特徴量、通話時間特徴量の３つの特徴量を入力して判定を行うように構成してもよい。この３つの特徴量のみとすることで、処理を早くすることができる。 In the second embodiment, five feature amounts of speaker overlapping feature amount, clustering feature amount, talk time feature amount, text feature amount, and machine-synthesized sound feature amount are input, but the text feature amount or the machine The determination may be made by inputting one of the synthetic sound features and the other three features. Further, the determination may be made by inputting three feature amounts, that is, the speaker overlapping feature amount, the clustering feature amount, and the talk time feature amount. Processing can be speeded up by using only these three feature quantities.

図４は、人間応答による場合の秒数分布を示すグラフである。図のとおり、着信側で人が出られた場合には、短い秒数で通話時間が終了することが多いことがわかる。おそらく、自動応答で電話がかかってきた場合に相手方が、自動応答（機械）だとわかると着信側がすぐに切る傾向があるからと考えられる。 FIG. 4 is a graph showing the distribution of seconds in the case of human response. As shown in the figure, it can be seen that when a person comes out on the called side, the talk time often ends in a short number of seconds. Probably because when a call is received by automatic answering, the called party tends to hang up immediately when the other party knows that it is an automatic answering (machine).

図５は、留守番電話の機械応答による場合の秒数分布を示すグラフである。通話時間のピークが５０〜６０秒のところにあるのがわかる。留守番電話による機械応答の場合、定型メッセージが流れるため、一定の秒数がかかる。自動応答で電話をかけた場合、着信側が留守番電話による機械応答であると、応答メッセージのあとメッセージを録音する時間があり、所定の録音時間のあと、着信側から切るため、時間のピークが５０〜６０秒になると考えられる。 FIG. 5 is a graph showing the distribution of seconds in the case of a machine response of an answering machine. It can be seen that the peak of the talk time is at 50 to 60 seconds. In the case of a machine response by answering machine, it takes a certain number of seconds because a standard message is played. When making a call with automatic answering, if the called party is a machine answering with an answering machine, there is time to record the message after the answering message, and after the predetermined recording time, the called party disconnects, so the peak time is 50. It is expected to be ~ 60 seconds.

図６は、教師データを用いて、着信者側で留守番電話が応答したか否かを判定する分類器を、機械学習により生成する処理を示すフローチャートである。まず、通話データ取得部１０１は、着信者側の留守番電話応答か否かを示す応答結果データと、通話データと、を取得する（ステップS６０１）。ここでは、分類器に機械学習させることが目的であるため、教師データとして、通話データとともに応答結果データとを取得する。 FIG. 6 is a flowchart showing a process of generating a classifier by machine learning for determining whether or not the answering machine has answered on the called party side using the teacher data. First, the call data acquisition unit 101 acquires the answer result data indicating whether or not the answering machine is answered by the called party, and the call data (step S601). Here, since the purpose is to make the classifier perform machine learning, the answer result data is acquired together with the call data as the teacher data.

次に、各特徴量抽出部は、通話データから各特徴量を抽出する（ステップS６０２）。第一の実施の形態においては、話者重複特徴量抽出部１１１とクラスタリング特徴量抽出部１１３が、通話データから話者重複特徴量と、クラスタリング特徴量を抽出する。第二の実施の形態においては、話者重複特徴量抽出部１１１、クラスタリング特徴量抽出部１１３、通話時間特徴量抽出部１１５、テキスト特徴量抽出部１１７、機械合成音特徴量抽出部１１９が、話者重複特徴量、クラスタリング特徴量、通話時間特徴量、テキスト特徴量、機械合成音特徴量を抽出する。 Next, each feature amount extraction unit extracts each feature amount from the call data (step S602). In the first embodiment, the speaker overlapping feature amount extracting unit 111 and the clustering feature amount extracting unit 113 extract the speaker overlapping feature amount and the clustering feature amount from the call data. In the second embodiment, the speaker overlapping feature amount extracting unit 111, the clustering feature amount extracting unit 113, the talk time feature amount extracting unit 115, the text feature amount extracting unit 117, and the machine-synthesized sound feature amount extracting unit 119 Extract speaker overlapping features, clustering features, talk time features, text features, and machine-synthesized sound features.

次に、判定部で用いる分類器を機械学習により生成する（ステップＳ６０３）。通話データから抽出された各種特徴量と留守番電話による応答の有無とを教師データとして機械学習させることで、留守番電話による応答であったかなかったかを判定する分類器を生成する。機械学習法としては、ロジスティック回帰、ランダムフォレスト法、サポートベクトルマシンがあげられ、いずれを用いてもよい。特徴量としては、第一の実施の形態においては、話者重複特徴量、クラスタリング特徴量、第二の実施の形態においては、話者重複特徴量、クラスタリング特徴量、通話時間特徴量、テキスト特徴量、機械合成音特徴量である。教師データにより生成された分類器により、判定対象となる通話データが留守番電話による応答であったか否かが判定される。 Next, a classifier used in the determination unit is generated by machine learning (step S603). By machine learning the various features extracted from the call data and the presence or absence of a response by the answering machine as teacher data, a classifier for determining whether or not the response was by the answering machine is generated. Examples of the machine learning method include logistic regression, random forest method, and support vector machine, and any of them may be used. As the feature amount, in the first embodiment, the speaker overlapping feature amount and the clustering feature amount, and in the second embodiment, the speaker overlapping feature amount, the clustering feature amount, the talk time feature amount, and the text feature amount. Amount, machine-synthesized sound feature amount. The classifier generated from the teacher data determines whether or not the call data to be determined is a response by an answering machine.

１留守番電話応答判定システム
１０留守番電話応答判定装置
２０自動音声電話機

1 Answering machine answering machine 10 Answering machine answering machine 20 Automatic voice telephone

Claims

Call data acquisition means to acquire call data by telephone,
A speaker duplication feature amount extracting means for calculating the multiplicity of both speakers of the call data, and
A clustering feature extraction means for calculating the clustering feature of the call data,
Using the speaker overlapping feature amount, the clustering feature amount, and the presence / absence of a response by an answering machine as teacher data, a determination means for determining whether or not the answer is a response by an answering machine generated by machine learning.
Have,
The determination means is an answering machine answering machine that determines whether or not the call data is answered by an answering machine based on the speaker overlapping feature amount extracted from the call data and the clustering feature amount.

The answering machine answering machine according to claim 1, further comprising a call time feature amount extracting means for extracting a call time feature amount of the call data.
The determination means is an answering machine answer determination device that further uses the feature amount of the talk time as teacher data.

The answering machine answering machine according to claim 2, wherein the talk time feature amount is an energy statistic of the call voice.

The answering machine answering machine according to claims 1 to 3.
Further, it further has a text feature amount extracting means having a voice recognition means for converting the call data into text data and a response detection means for detecting a response by an answering machine from the text data by machine learning and calculating a text feature amount. And
The determination means is an answering machine answer determination device that further uses the text feature amount as teacher data.

The answering machine answering machine according to claims 1 to 4.
Furthermore, machine learning is performed using machine-synthesized sound data, human voice data, and a mixed Gaussian histogram as teacher data, and the probability that the answering voice is machine-synthesized sound data among the call data is determined by machine-synthesized sound. It also has a means for generating machine-synthesized sound features that are generated as features.
The determination means is an answering machine answer determination device that further uses the machine-synthesized sound feature amount as teacher data.

It is an answering machine answer judgment method,
Call data acquisition step to acquire call data by telephone and
A speaker duplication feature amount extraction step for calculating the multiplicity of both speakers of the call data, and
A clustering feature extraction step for calculating the clustering feature of the call data, and
Using the talk time feature amount, the speaker overlapping feature amount, the clustering feature amount, and the presence / absence of a response by an answering machine as teacher data, a response by an answering machine is performed by a determination unit generated by machine learning. Judgment step to determine whether or not
Answering machine answer judgment method having.

An answering machine answering machine that is executed by a computer as an answering machine answering machine.
Call data acquisition step to acquire call data by telephone and
A speaker duplication feature amount extraction step for calculating the multiplicity of both speakers of the call data, and
A clustering feature extraction step for calculating the clustering feature of the call data, and
Using the talk time feature amount, the speaker overlapping feature amount, the clustering feature amount, and the presence / absence of a response by an answering machine as teacher data, a response by an answering machine is performed by a determination unit generated by machine learning. Judgment step to determine whether or not
Answering machine answering machine to execute.