JPH05316198A

JPH05316198A - Caller recognition telephone set

Info

Publication number: JPH05316198A
Application number: JP4117384A
Authority: JP
Inventors: Shingo Nishimura; 新吾西村; Kazuhiko Okashita; 和彦岡下
Original assignee: Sekisui Chemical Co Ltd
Current assignee: Sekisui Chemical Co Ltd
Priority date: 1992-05-11
Filing date: 1992-05-11
Publication date: 1993-11-26

Abstract

PURPOSE:To eliminate the need for limiting an object audio to a content of a message learned in advance in the case of talker recognition. CONSTITUTION:When a talker recognition section 13 of a caller recognition telephone set 10 uses a neural network to recognize a talker, the section 13 receives a series of vectors representing an outline of a spectrum for a short period of time, a series of network outputs is totallized by sum, product, majority decision or the like for the result of recognition based on each output to obtain one recognition result, and number of learning data for the network is decreased by clustering.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、いたずら電話やまちが
い電話に煩わされることを防止するため、予め登録して
ある話者に対してのみ通話可能とする発呼者認識電話装
置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a caller identification telephone device which can make a call only to a speaker registered in advance in order to prevent a mischievous telephone or a wrong telephone from being bothered.

【０００２】[0002]

【従来の技術】従来、特開昭59-208965 「電話機の呼出
方式」、特開昭60-239163 「自動応答電話方式」、特開
平2-55448 「音声ダイヤリング方式」、特開平3-35648
「迷惑電話防止装置」等が提案されている。これらの従
来技術では、具体的な認識方式に関する記述が乏しく、
記述されていたとしても、以下に示す程度の記述であ
る。2. Description of the Related Art Conventionally, Japanese Patent Laid-Open No. 59-208965, "Calling System for Telephone", Japanese Patent Laid-Open No. 60-239163, "Automatic Answering Telephone System", Japanese Patent Laid-Open No. 2-55448, "Voice Dialing System", Japanese Patent Laid-Open No. 3-35648.
A "nuisance call prevention device" and the like have been proposed. In these conventional techniques, there is little description about a specific recognition method,
Even if it is described, the description is only to the extent shown below.

【０００３】入力音声に含まれる話者に関する特徴量
を抽出し、このの抽出データと、予め同様の方法で
抽出しておいた辞書データとの距離を計算（類似度算
出）し、このの計算結果より入力音声が登録話者の
声か判定し、このの判定により、入力音声が予め登
録されている者の声であれば、応答制御部によりベル鳴
動を行なうことによって応答可能とする。A feature amount relating to a speaker included in the input voice is extracted, a distance between the extracted data and dictionary data extracted in advance by a similar method is calculated (similarity calculation), and this calculation is performed. Based on the result, it is determined whether the input voice is the voice of the registered speaker, and if the input voice is the voice of a person who is registered in advance, the response control unit makes a response by making a bell ring.

【０００４】そこで本出願人は既に、特開平3-114345
「発呼者認識電話装置」を出願している。これはニュー
ラルネットワークを利用することにより、認識率、処理
時間の向上を実現したものであった。Therefore, the present applicant has already filed Japanese Patent Application Laid-Open No. 3-114345.
We have applied for "Caller Identification Telephone Device". By using a neural network, the recognition rate and the processing time were improved.

【０００５】[0005]

【発明が解決しようとする課題】然しながら、本出願人
が既に提案している発呼者認識電話装置は、話者認識す
る際、対象とする音声を予め学習した発声内容に限定す
るものであった。即ち、発呼者に特定の言葉を発声して
もらうことを前提としており、使い勝手が悪かった。ま
た、それ以外の言葉が入力されると誤動作を起こすこと
があった。However, the caller recognition telephone device proposed by the applicant of the present invention limits the target voice to the pre-learned utterance contents when recognizing the speaker. It was That is, it is premised that the caller speaks a specific word, and the usability is poor. Also, if other words are input, malfunctions may occur.

【０００６】本発明は、発呼者認識電話装置において、
話者認識する際、対象とする音声を、予め学習した発声
内容に限定する必要がないものを提供することを目的と
する。The present invention relates to a calling party recognition telephone device,
It is an object of the present invention to provide a target voice that does not need to be limited to previously learned utterances when recognizing a speaker.

【０００７】[0007]

【課題を解決するための手段】請求項１に記載の本発明
の話者認識部における動作を説明する。まず、学習用の
音声から得た短時間スペクトルの概形を用いて、ニュー
ラルネットワークを学習する。この際に話者毎のクラス
タリングを行なうことによって学習用のデータ数を減ら
しておく。認識時は、任意の発声から上記と同じ短時間
スペクトルの概形を求め、その系列をネットワークに入
力し、ネットワーク出力の系列を得る。得られたネット
ワークの出力ベクトルは、それぞれが短時間の入力に対
する話者を示唆しており、これを系列全体で、和、積、
多数決等の総合的な判断を下すことによって、１つの認
識を得る。The operation of the speaker recognition unit of the present invention as set forth in claim 1 will be described. First, the neural network is trained by using the outline of the short-time spectrum obtained from the training voice. At this time, the number of data for learning is reduced by performing clustering for each speaker. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained. The output vector of the obtained network suggests the speaker for each input for a short time.
One recognition is obtained by making a comprehensive judgment such as a majority vote.

【０００８】請求項２に記載の本発明の話者認識部にお
ける動作を説明する。まず、学習用の音声から得た短時
間スペクトルの概形を用いて、ニューラルネットワーク
を学習する。この際に話者毎のクラスタリングを行なう
ことによって学習用データ数を減らしておく。認識時
は、任意の発声から上記と同じ短時間スペクトルの概形
を求め、その系列をネットワークに入力し、ネットワー
ク出力の系列を得る。得られたネットワークの出力ベク
トルは、それぞれが短時間の入力に対する話者を示唆し
ているが、出力ベクトル選択用しきい値を設けて、この
中で信頼性の高い出力ベクトルのみを選択し、これらす
べてについて、和、積、多数決等の総合的な判断を下す
ことによって、１つの認識結果を得る。The operation of the speaker recognition section of the present invention according to claim 2 will be described. First, the neural network is trained by using the outline of the short-time spectrum obtained from the training voice. At this time, the number of learning data is reduced by performing clustering for each speaker. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained. The output vector of the obtained network suggests a speaker for each input for a short time, but a threshold for selecting an output vector is provided, and only a reliable output vector is selected among them. One recognition result is obtained by making a comprehensive judgment such as a sum, a product, or a majority decision for all of these.

【０００９】然るに、本発明における「ニューラルネッ
トワーク」について説明すれば、下記(1) 〜(4) の如く
である。However, the description of the "neural network" in the present invention is as follows (1) to (4).

【００１０】(1)ニューラルネットワークは、その構造
から、図３（Ａ）に示す階層的ネットワークと図３
（Ｂ）に示す相互結合ネットワークの２種に大別でき
る。本発明は、両ネットワークのいずれを用いて構成す
るものであっても良いが、階層的ネットワークは後述す
る如くの簡単な学習アルゴリズムが確立されているため
より有用である。(1) From the structure of the neural network, the neural network and the hierarchical network shown in FIG.
It can be roughly classified into two types of mutual connection networks shown in (B). The present invention may be configured by using either of both networks, but the hierarchical network is more useful because a simple learning algorithm as described later has been established.

【００１１】(2)ネットワークの構造階層的ネットワークは、図４に示す如く、入力層、中間
層、出力層からなる階層構造をとる。各層は１以上のユ
ニットから構成される。結合は、入力層→中間層→出力
層という前向きの結合だけで、各層内での結合はない。(2) Network Structure As shown in FIG. 4, the hierarchical network has a hierarchical structure including an input layer, an intermediate layer, and an output layer. Each layer is composed of one or more units. The coupling is only forward coupling such as input layer → middle layer → output layer, and there is no coupling in each layer.

【００１２】(3)ユニットの構造ユニットは図５に示す如く脳のニューロンのモデル化で
あり構造は簡単である。他のユニットから入力を受け、
その総和をとり一定の規則（変換関数）で変換し、結果
を出力する。他のユニットとの結合には、それぞれ結合
の強さを表わす可変の重みを付ける。(3) Unit structure The unit is a model of a brain neuron as shown in FIG. 5, and the structure is simple. Receive input from other units,
The sum is taken and converted according to a certain rule (conversion function), and the result is output. A variable weight, which represents the strength of the bond, is attached to each of the bonds with other units.

【００１３】(4)学習（バックプロパゲーション）ネットワークの学習とは、実際の出力を目標値（望まし
い出力）に近づけることであり、一般的には図５に示し
た各ユニットの変換関数及び重みを変化させて学習を行
なう。(4) Learning (Back Propagation) Learning a network is to bring an actual output close to a target value (desired output). Generally, the conversion function and weight of each unit shown in FIG. Is learned by changing.

【００１４】また、学習のアルゴリズムとしては、例え
ば、Rumelhart, D.E.,McClelland,J.L. and the PDP Re
search Group, PARALLEL DISTRIBUTED PROCESSING, the
MIT Press, 1986.に記載されているバックプロパゲー
ションを用いることができる。As a learning algorithm, for example, Rumelhart, DE, McClelland, JL and the PDP Re
search Group, PARALLEL DISTRIBUTED PROCESSING, the
Backpropagation described in MIT Press, 1986. can be used.

【００１５】[0015]

【作用】請求項１に記載の話者認識方式においては、発声内容を限定しない話者認識技術を用いることによ
り、発呼者に特定の言葉を発声してもらう必要がなくな
り、発呼者認識電話装置の使い勝手が良くなった。ま
た、どんな言葉が入力されても対応できるようになっ
た。In the speaker recognition method according to claim 1, by using the speaker recognition technology that does not limit the content of the utterance, it is not necessary for the caller to utter a specific word, and the caller recognition is performed. The usability of the telephone device has improved. In addition, it is now possible to respond to any words entered.

【００１６】クラスタリングにより複数のデータの代
表ベクトルを学習データとしているので、学習効果を保
ちつつ学習データ数を削減できる。その結果、ニューラ
ルネットワークの学習時間が大幅に短縮できる。Since the representative vector of a plurality of data is used as the learning data by clustering, it is possible to reduce the number of learning data while maintaining the learning effect. As a result, the learning time of the neural network can be greatly reduced.

【００１７】更に、請求項２に記載の話者認識方式にお
いては、出力ベクトルの中で信頼性の高いものを選択すること
により、総合的な判断がより確実になり、認識率が向上
する。Furthermore, in the speaker recognition method according to the second aspect, by selecting a highly reliable output vector from among the output vectors, comprehensive judgment becomes more reliable and the recognition rate is improved.

【００１８】[0018]

【実施例】図１は発呼者認識電話装置の一実施例を示す
ブロック図、図２は話者判定用しきい値とネットワーク
の出力値とを示す模式図、図３はニューラルネットワー
クを示す模式図、図４は階層的なニューラルネットワー
クを示す模式図、図５はユニットの構造を示す模式図で
ある。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 is a block diagram showing an embodiment of a calling party recognition telephone device, FIG. 2 is a schematic diagram showing a threshold value for speaker determination and an output value of a network, and FIG. 3 is a neural network. FIG. 4 is a schematic diagram showing a hierarchical neural network, and FIG. 5 is a schematic diagram showing the structure of a unit.

【００１９】発呼者認識電話装置１０は、発呼者が予め
登録してある話者かを認識し、登録話者に対してのみ電
話基本回路１１により応答可能とするものである。この
とき、電話装置１０は、図１に示す如く、着信に対して
自動応答する自動応答部１２と、自動応答部１２に対し
て返答した発呼者の音声から話者認識する話者認識部１
３と、話者認識部１３の認識結果により電話基本回路１
１を制御する応答制御部１４とを有する。１Ａは相手電
話装置である。The caller recognition telephone device 10 recognizes whether the caller is a pre-registered speaker and enables the telephone basic circuit 11 to respond only to the registered speaker. At this time, as shown in FIG. 1, the telephone device 10 includes an automatic response unit 12 that automatically responds to an incoming call, and a speaker recognition unit that recognizes a speaker from the voice of the caller who responds to the automatic response unit 12. 1
3 and the recognition result of the speaker recognition unit 13, the telephone basic circuit 1
1 and a response control unit 14 for controlling 1). 1A is a partner telephone device.

【００２０】然るに、発呼者認識電話装置１０の使用イ
メージを示せば、表１、表２の如くである。表１は登録
者の音声を学習する場合の使用イメージ、表２は電話が
かかってきた場合の使用イメージである。However, the usage images of the caller identification telephone device 10 are shown in Tables 1 and 2. Table 1 is a usage image when learning the voice of the registrant, and Table 2 is a usage image when a call is received.

【００２１】[0021]

【表１】 [Table 1]

【００２２】[0022]

【表２】 [Table 2]

【００２３】以下、話者認識部１３による話者認識方式
について説明する。登録者 5名・非登録者25名について、学習用の文章
を、サンプリング周波数10kHz 、フレーム長 25.6msec
、フレーム周期 12.8msec でフーリエ分析し、100〜50
00 Hz の帯域で68ch(1/12 Oct.) のパワーベクトルの系
列を得る。The speaker recognition system by the speaker recognition unit 13 will be described below. Learning text for 5 registrants and 25 non-registrants, sampling frequency 10 kHz, frame length 25.6 msec
, Fourier analysis with a frame period of 12.8 msec, 100 to 50
Obtain a sequence of 68ch (1/12 Oct.) power vectors in the 00 Hz band.

【００２４】これらのパワーベクトルから、階層的ク
ラスタリングを行なうことによって、話者毎に 200程度
の代表ベクトルを得る。Hierarchical clustering is performed from these power vectors to obtain about 200 representative vectors for each speaker.

【００２５】これらの代表ベクトルをニューラルネッ
トワークの入力とし（入力層68ユニット、入力パターン
は話者数×クラスタリング後の代表ベクトル数だけ得ら
れる）、登録者の場合のみ対応する出力ユニットが活性
化するように十分学習する。These representative vectors are used as the input of the neural network (input layer 68 units, the input pattern is obtained by the number of speakers × the number of representative vectors after clustering), and the corresponding output unit is activated only in the case of the registrant. To learn enough.

【００２６】任意の発声に対して、と同様にパワー
ベクトルの系列を得る。これを、で学習したネットワークに入力し、出力ベ
クトルの系列｛ｘ¹ ，ｘ² ，…ｘⁿ ｝ｘ^t ＝（ｘ^t ₁ ，…，ｘ^t ₅ ) ｎ＝フレーム数を得る。A sequence of power vectors is obtained in the same manner as for any utterance. This is input to the network learned by, and the sequence of output vectors {x ¹ , x ² , ... X ⁿ } x ^t = (x ^t ₁ , ..., x ^t ₅ ) n = the number of frames is obtained.

【００２７】上記のベクトル系列に対し以下の３手
法を用いて、入力が登録者・非登録者いずれのものであ
るかを判断する。The following three methods are used for the above vector series to determine whether the input is a registered person or a non-registered person.

【００２８】(1) Σ_t ｘ^t _s（ｓ＝ 1〜5 ）の最大値が、
予め設定した話者判定用しきい値を越えていれば登録
者、そうでなければ非登録者(1) The maximum value of Σ _t x ^t _s (s = 1 to 5) is
Registered person if it exceeds the preset threshold for speaker determination, otherwise non-registered person

【００２９】(2) Π_t ｘ^t _s（ｓ＝ 1〜5 ）の最大値が、
予め設定した話者判定用しきい値を越えていれば登録
者、そうでなければ非登録者(2) The maximum value of Π _t x ^t _s (s = 1 to 5) is
Registered person if it exceeds the preset threshold for speaker determination, otherwise non-registered person

【００３０】(3)max｛ｘ^t ₁ ，…，ｘ^t ₅｝＝ｘ^t _s（ｓ＝
1〜 5）の数の最大値が、予め設定した話者判定用しき
い値を越えていれば登録者、そうでなければ非登録者(3) max {x ^t ₁ , ..., X ^t ₅ } = x ^t _s (s =
If the maximum number of 1 to 5) exceeds the preset threshold for speaker determination, it is a registered person, otherwise it is a non-registered person.

【００３１】また、上記の３手法の代わりに以下の手法
を用いても良い。 (1) Σ_t ｘ^t _s（ｓ＝ 1〜5 ）の最大値のみが、予め設定
した第１の話者判定用しきい値θ₁ を越え、かつ、その
他の値が予め設定した第２の話者判定用しきい値θ₂ を
下回っていれば登録者、そうでなければ非登録者The following methods may be used instead of the above three methods. (1) Only the maximum value of Σ _t x ^t _s (s = 1 to 5) exceeds the preset first speaker determination threshold θ ₁ and the other values are preset to the second value. If it is below the speaker judgment threshold θ _{2 of} , it is a registered person, otherwise it is a non-registered person.

【００３２】(2) Π_t ｘ^t _s（ｓ＝ 1〜5 ）の最大値のみ
が、予め設定した第１の話者判定用しきい値θ₁ を越
え、かつ、その他の値が予め設定した第２の話者判定用
しきい値θ₂ を下回っていれば登録者、そうでなければ
非登録者(2) Only the maximum value of Π _t x ^t _s (s = 1 to 5) exceeds the preset first speaker determination threshold value θ ₁ , and other values are preset. If it is below the second speaker judgment threshold value θ ₂ , the registered person; otherwise, the non-registered person

【００３３】(3) max ｛ｘ^t ₁ ，…，ｘ^t ₅｝＝ｘ^t _s（ｓ
＝ 1〜 5）の数の最大値が、予め設定した第１の話者判
定用しきい値θ₁ を越え、かつ、その他の値が予め設定
した第２の話者判定用しきい値θ₂ を下回っていれば登
録者、そうでなければ非登録者(3) max {x ^t ₁ , ..., X ^t ₅ } = x ^t _s (s
= 1 to 5) exceeds the _first threshold value θ ₁ for speaker determination set in advance and the other values have second threshold value θ 2 for speaker determination set in advance. Registered if below ₂ , otherwise non-registered

【００３４】任意発声の一例として、学習用短文「明日
は東京に出ますのですみませんが留守にします。」に対
して、「ただいま」「こんにちは」「おはようございま
す」の３単語を用いて話者認識実験を行なったところ、
学習に用いた登録者 5名および学習に用いていない非登
録者26名を完全に認識できた。[0034] As an example of any utterance, learning short "Tomorrow will be the absence is I'm sorry because I get to Tokyo." For, "I'm home", "Hello" speaker by using the three-word of "good morning" When I did a recognition experiment,
We were able to fully recognize the 5 registered people who were used for learning and 26 non-registered people who were not used for learning.

【００３５】[0035]

【発明の効果】以上のように本発明によれば、発呼者認
識電話装置において、話者認識する際、対象とする音声
を、予め学習した発声内容に限定する必要がないものと
することができる。As described above, according to the present invention, in the caller recognition telephone device, when the speaker is recognized, it is not necessary to limit the target voice to the previously learned voice content. You can

[Brief description of drawings]

【図１】図１は発呼者認識電話装置の一実施例を示すブ
ロック図である。FIG. 1 is a block diagram showing an embodiment of a caller identification telephone device.

【図２】図２は話者判定用しきい値とネットワークの出
力値とを示す模式図である。FIG. 2 is a schematic diagram showing a speaker determination threshold value and a network output value.

【図３】図３はニューラルネットワークを示す模式図で
ある。FIG. 3 is a schematic diagram showing a neural network.

【図４】図４は階層的なニューラルネットワークを示す
模式図である。FIG. 4 is a schematic diagram showing a hierarchical neural network.

【図５】図５はユニットの構造を示す模式図である。FIG. 5 is a schematic diagram showing a structure of a unit.

[Explanation of symbols]

１０発呼者認識電話装置１１電話基本回路１２自動応答部１３話者認識部１４応答制御部 10 Caller Recognition Telephone Device 11 Telephone Basic Circuit 12 Automatic Response Unit 13 Speaker Recognition Unit 14 Response Control Unit

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁵ 識別記号庁内整理番号ＦＩ技術表示箇所Ｈ０４Ｍ 3/42 Ｐ ─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁵ Identification code Office reference number FI technical display location H04M 3/42 P

Claims

[Claims]

1. A caller recognition telephone device which recognizes whether a caller is a pre-registered speaker and can respond only to a registered speaker by a telephone basic circuit. An automatic response unit, a speaker recognition unit for recognizing a speaker from the voice of a caller who replies to the automatic response unit, and a response control unit for controlling the telephone basic circuit based on the recognition result of the speaker recognition unit. The speaker recognizing unit inputs a series of vectors representing the outline of a short-time spectrum when recognizing a speaker using a neural network, and outputs a network output series of recognition results by individual outputs. 1 based on sum, product, majority vote, etc.
A caller-recognition telephone device for obtaining one recognition result, wherein the number of learning data of the network is reduced by clustering.

2. A caller recognition telephone device which recognizes whether a caller is a pre-registered speaker and can respond to the registered speaker only by a telephone basic circuit. An automatic response unit, a speaker recognition unit for recognizing a speaker from the voice of the caller who replies to the automatic response unit, and a response control unit for controlling the telephone basic circuit according to the recognition result of the speaker recognition unit. The speaker recognition unit, when recognizing a speaker using a neural network, inputs a series of vectors representing the outline of a short-time spectrum, and sets a threshold for output vector selection from a series of network outputs. With respect to the output vector selected by using, a single recognition result is obtained by summing, multiplying, majority voting, etc. of recognition results by individual outputs, and the number of learning data of the network is reduced by clustering. Caller recognition telephone apparatus according to claim the door.