JPH06152728A

JPH06152728A - Answer message switching speech equipment

Info

Publication number: JPH06152728A
Application number: JP4291376A
Authority: JP
Inventors: Shingo Nishimura; 新吾西村
Original assignee: Sekisui Chemical Co Ltd
Current assignee: Sekisui Chemical Co Ltd
Priority date: 1992-10-29
Filing date: 1992-10-29
Publication date: 1994-05-31

Abstract

PURPOSE:To provide a speech equipment such as a telephone equipment and an interphone equipment, etc., by which it is unnecessary to limit an object voice to the utterance contents learned in advance, at the time of recognizing a speaker. CONSTITUTION:At the time of recognizing a speaker by using a neural network, in a speaker recognizing part of the answer message switching speech equipment, the sequence of a vector for showing a rough form of a spectirum is inputted for a short period, the sequence of a network output is integrated by the sum, the product, the majority, etc., of the result of recognition by each output and one result of recognition is obtained, and the number of learning data of the network is curtailed by clustering.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、電話をかけてきた相手
に応じて、その人宛の特別なメッセージを返すことが可
能な応答メッセージ切り替え通話装置に関する。また、
本発明は、玄関先などに設置したインターホンの利用者
に応じて、その人宛の特別なメッセージを返すことが可
能な応答メッセージ切り替えインターホン装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a response message switching communication device capable of returning a special message addressed to a person who has made a call. Also,
The present invention relates to a response message switching intercom device capable of returning a special message addressed to a user of an intercom installed at a front door or the like.

【０００２】[0002]

【従来の技術】従来、特公平3-59619 「人工知能機能付
留守番電話装置」、特開昭61-39756「留守番電話システ
ム」、特開昭64-86742「伝言蓄積再生装置」、特開平2-
294146「録音再生装置」などが提案されている。これら
の従来技術では音声認識（または話者認識）技術を利用
することが示されているが、具体的な認識手法について
の記載は無いか、記載されていたとしても以下に示す程
度である。また、発呼者に特定の言葉（キーワード、発
呼者の氏名など）を発声してもらうことを前提としてい
るものがほとんどである。2. Description of the Related Art Conventional Japanese Patent Publication No. 3-59619 "Answering machine with artificial intelligence function", Japanese Unexamined Patent Publication No. 61-39756 "Answering telephone system", Japanese Unexamined Patent Publication No. 64-86742 "Message storage and reproducing apparatus", Japanese Unexamined Patent Publication No. Hei 2 -
294146 "Recording / playback device" etc. have been proposed. Although these prior arts have been shown to use speech recognition (or speaker recognition) technology, there is no description of a specific recognition method, or even if it is described, it is only as shown below. Also, most of them are based on the premise that the caller speaks a specific word (keyword, caller's name, etc.).

【０００３】入力音声に含まれる特徴量（スペクト
ル、ピッチ、ケプストラムなど）を抽出し、このの
抽出データと、予め同様の方法で抽出しておいた辞書デ
ータとの距離を計算（ＤＰマッチングなど）し、この
の計算結果より入力音声が誰の声か判定し、この
の判定結果に応じて、その後の処理を制御する。A feature amount (spectrum, pitch, cepstrum, etc.) included in the input voice is extracted, and a distance between the extracted data and dictionary data previously extracted by a similar method is calculated (DP matching, etc.). Then, the voice of the input voice is determined from the calculation result of this, and the subsequent processing is controlled according to the determination result.

【０００４】また、従来、特開昭61-39756「留守番電話
システム」が提案されている。この従来技術では音声認
識技術を利用することが示されているが、具体的な認識
手法についての記載は以下に示す程度である。また、イ
ンターホン利用者にキーワードを発声してもらうことを
前提としている。Further, conventionally, Japanese Patent Laid-Open No. Sho 61-39756 "answering machine system" has been proposed. This prior art has been shown to utilize a voice recognition technique, but the description of a specific recognition method is as follows. Also, it is premised that the intercom user speaks a keyword.

【０００５】入力音声に含まれる特徴量（音高、母音
と子音の組み合わせなど）を抽出し、このの抽出デ
ータと、予め同様の方法で抽出しておいた辞書データと
の距離を計算し、このの計算結果より入力音声が誰
の声か判定し、このの判定結果に応じて、その後の
処理を制御する。A feature amount (pitch, combination of vowel and consonant, etc.) included in the input voice is extracted, and a distance between the extracted data and dictionary data previously extracted by a similar method is calculated, The voice of the input voice is determined from the result of the calculation, and the subsequent processing is controlled according to the result of the determination.

【０００６】[0006]

【発明が解決しようとする課題】上記従来技術において
は、音声認識（又は話者認識）の具体的な認識手法に関
する記載が乏しく、記載されている認識手法では認識率
の面で改善の余地があるものと思われる。また、発呼者
に特定の言葉を発声してもらうことを前提としている場
合、使い勝手が悪く、それ以外の言葉が入力されると誤
動作を起こすことがあるものと思われる。また、インタ
ーホン利用者にキーワードを発声してもらうことを前提
としているため使い勝手が悪く、それ以外の言葉が入力
されると誤動作を起こす場合があると思われる。In the above-mentioned prior art, there is little description about a specific recognition method of voice recognition (or speaker recognition), and there is room for improvement in recognition rate in the described recognition method. It seems that there is. In addition, when it is assumed that the caller utters a specific word, it is not easy to use, and if other words are input, malfunction may occur. Also, since it is premised that the intercom user speaks a keyword, it is not easy to use, and if other words are input, malfunction may occur.

【０００７】本発明は、電話装置やインターホン装置等
の通話装置において、話者認識する際、対象とする音声
を、予め学習した発声内容に限定する必要のないものを
提供することを目的とする。It is an object of the present invention to provide a communication device such as a telephone device or an intercom device, which does not need to limit a target voice to a pre-learned utterance content when recognizing a speaker. .

【０００８】[0008]

【課題を解決するための手段】請求項１に記載の本発明
は、発呼者が誰かを認識し、予め登録されている者に対
しては、その人宛の特別なメッセージで応答できる応答
メッセージ切り替え通話装置において、着信に対して自
動応答する自動応答部と、該自動応答部に対して返答し
た発呼者の音声から話者認識する話者認識部と、該話者
認識部の認識結果により通話基本回路を制御する応答制
御部と、メッセージを記憶するメッセージ記憶部を有
し、上記話者認識部で、ニューラルネットワークを用い
て話者認識する際、短時間スペクトルの概形を表すベク
トルの系列を入力し、ネットワーク出力の系列を、個々
の出力による認識結果の和、積、多数決などにより総合
して１つの認識結果を得るものであり、ネットワークの
学習用データ数をクラスタリングにより削減するように
したものである。The present invention as set forth in claim 1, is a response that recognizes who the calling party is and can respond to a person who is registered in advance with a special message addressed to that person. In a message switching telephone device, an automatic response unit that automatically responds to an incoming call, a speaker recognition unit that recognizes a speaker from the voice of a caller who responds to the automatic response unit, and a recognition of the speaker recognition unit According to the result, it has a response control unit for controlling the basic communication circuit and a message storage unit for storing a message. When the speaker recognition unit recognizes the speaker by using the neural network, it represents the outline of the short-time spectrum. A series of vectors is input, and the series of network outputs are combined by summing, multiplying, and majority voting of the recognition results of the individual outputs to obtain one recognition result. It is obtained so as to reduce the Staring.

【０００９】請求項２に記載の本発明は、発呼者が誰か
を認識し、予め登録されている者に対しては、その人宛
の特別なメッセージで応答できる応答メッセージ切り替
え通話装置において、着信に対して自動応答する自動応
答部と、該自動応答部に対して返答した発呼者の音声か
ら話者認識する話者認識部と、該話者認識部の認識結果
により通話基本回路を制御する応答制御部と、メッセー
ジを記憶するメッセージ記憶部を有し、上記話者認識部
で、ニューラルネットワークを用いて話者認識する際、
短時間スペクトルの概形を表すベクトルの系列を入力
し、ネットワーク出力の系列から出力ベクトル選択用し
きい値を用いて選択した出力ベクトルについて、個々の
出力による認識結果の和、積、多数決などにより総合し
て１つの認識結果を得るものであり、ネットワークの学
習用データ数をクラスタリングにより削減するようにし
たものである。The present invention as set forth in claim 2 is a response message switching telephone apparatus which recognizes who the calling party is and can respond to a person who is registered in advance with a special message addressed to the person. An automatic response unit that automatically responds to an incoming call, a speaker recognition unit that recognizes a speaker from the voice of the caller who responds to the automatic response unit, and a call basic circuit based on the recognition result of the speaker recognition unit. When a speaker is recognized using a neural network in the speaker recognition unit, which has a response control unit for controlling and a message storage unit for storing a message,
By inputting a series of vectors that represent the outline of the short-time spectrum and selecting the output vector from the network output series using the output vector selection threshold, the sum, product, majority vote, etc. of the recognition results for each output One recognition result is obtained as a whole, and the number of learning data of the network is reduced by clustering.

【００１０】請求項３に記載の本発明は、インターホン
利用者が誰かを認識し、予め登録されている者に対して
は、その人宛の特別なメッセージで応答できる応答メッ
セージ切り替えインターホン装置において、玄関先など
に設置したインターホン（子機）と、インターホン利用
者に対して自動応答する自動応答部と、該自動応答部に
対して返答したインターホン利用者の音声から話者認識
する話者認識部と、該話者認識部の認識結果によりイン
ターホン基本回路を制御する応答制御部と、メッセージ
を記憶するメッセージ記憶部を有し、上記話者認識部
で、ニューラルネットワークを用いて話者認識する際、
短時間スペクトルの概形を表すベクトルの系列を入力
し、ネットワーク出力の系列を、個々の出力による認識
結果の和、積、多数決などにより総合して１つの認識結
果を得るものであり、ネットワークの学習用データ数を
クラスタリングにより削減するようにしたものである。The present invention according to claim 3 provides a response message switching intercom device which recognizes who the intercom user is and can respond to a person who is registered in advance with a special message addressed to the person. An intercom (slave unit) installed at a front door, an automatic response unit that automatically responds to the intercom user, and a speaker recognition unit that recognizes a speaker from the voice of the intercom user who responds to the automatic response unit. And a response control unit that controls the intercom basic circuit according to the recognition result of the speaker recognition unit, and a message storage unit that stores a message. When the speaker recognition unit uses the neural network to recognize the speaker. ,
A series of vectors that represent the outline of a short-time spectrum is input, and the series of network outputs are combined by summing, recognizing, and majority voting of the recognition results of individual outputs to obtain a single recognition result. The number of learning data is reduced by clustering.

【００１１】請求項４に記載の本発明は、インターホン
利用者が誰か認識し、予め登録されている者に対して
は、その人宛の特別なメッセージで応答できる応答メッ
セージ切り替えインターホン装置において、玄関先など
に設置したインターホン（子機）と、インターホン利用
者に対して自動応答する自動応答部と、該自動応答部に
対して返答したインターホン利用者の音声から話者認識
する話者認識部と、該話者認識部の認識結果によりイン
ターホン基本回路を制御する応答制御部と、メッセージ
を記憶するメッセージ記憶部を有し、上記話者認識部
で、ニューラルネットワークを用いて話者認識する際、
短時間スペクトルの概形を表すベクトルの系列を入力
し、ネットワーク出力の系列から出力ベクトル選択用し
きい値を用いて選択した出力ベクトルについて、個々の
出力による認識結果の和、積、多数決などにより総合し
て１つの認識結果を得るものであり、ネットワークの学
習用データ数をクラスタリングにより削減するようにし
たものである。According to a fourth aspect of the present invention, in a response message switching intercom device, which recognizes who the intercom user is and responds to a person who is registered in advance with a special message addressed to the person, the front door is provided. An intercom (slave unit) installed in the front, an automatic response unit that automatically responds to the intercom user, and a speaker recognition unit that recognizes a speaker from the voice of the intercom user who responds to the automatic response unit. , A response control unit that controls the intercom basic circuit according to the recognition result of the speaker recognition unit, and a message storage unit that stores a message, and when the speaker recognition unit uses the neural network to recognize the speaker,
By inputting a series of vectors that represent the outline of the short-time spectrum and selecting the output vector from the network output series using the output vector selection threshold, the sum, product, majority vote, etc. of the recognition results for each output One recognition result is obtained as a whole, and the number of learning data of the network is reduced by clustering.

【００１２】請求項５に記載の本発明は、請求項３又は
４に記載の応答メッセージ切り替えインターホン装置を
電話機と組み合わせることにより、登録者の音声を収録
・学習するときに、電話機を利用して音声収録できるよ
うにしたものである。The present invention according to claim 5 uses the telephone when recording / learning the voice of the registrant by combining the response message switching intercom apparatus according to claim 3 or 4 with the telephone. It is made possible to record audio.

【００１３】[0013]

【作用】請求項１に記載の本発明の話者認識部における
動作を説明する。まず、学習用の音声から得た短時間ス
ペクトルの概形を用いて、ニューラルネットワークを学
習する。この際に話者ごとにクラスタリングを行うこと
によって学習用データ数を減らしておく。認識時は、任
意の発声から上記と同じ短時間スペクトルの概形を求
め、その系列をネットワークに入力し、ネットワーク出
力の系列を得る。得られたネットワークの出力ベクトル
は、それぞれが短時間の入力に対する話者を示唆してお
り、これを系列全体で、和、積、多数決等の総合的な判
断を下すことによって、１つの認識結果を得る。The operation of the speaker recognition section of the present invention according to claim 1 will be described. First, the neural network is trained by using the outline of the short-time spectrum obtained from the training voice. At this time, the number of learning data is reduced by performing clustering for each speaker. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained. The output vector of the obtained network suggests the speaker for each input for a short time, and by making a comprehensive judgment such as sum, product, majority decision, etc. in the entire series, one recognition result is obtained. To get

【００１４】請求項１に記載の話者認識方式においては発声内容を限定しない話者認識技術を用いることによ
り、発呼者に特定の言葉を発声してもらう必要がなくな
り、応答メッセージ切り替え電話装置の使い勝手が良く
なった。また、どんな言葉が入力されても対応できるよ
うになった。In the speaker recognition method according to the first aspect, by using the speaker recognition technology that does not limit the utterance content, it is not necessary for the caller to utter a specific word, and the answer message switching telephone device is provided. The usability has improved. In addition, it became possible to respond to any words entered.

【００１５】クラスタリングにより複数のデータの代
表ベクトルを学習データとしているので、学習効果を保
ちつつ学習データ数を削減できる。その結果、ニューラ
ルネットワークの学習時間が大幅に短縮できる。Since the representative vector of a plurality of data is used as learning data by clustering, the number of learning data can be reduced while maintaining the learning effect. As a result, the learning time of the neural network can be greatly reduced.

【００１６】請求項２に記載の本発明の話者認識部にお
ける動作を説明する。まず、学習用の音声から得た短時
間スペクトルの概形を用いて、ニューラルネットワーク
を学習する。この際に話者ごとにクラスタリングを行う
ことによって学習用データ数を減らしておく。認識時
は、任意の発声から上記と同じ短時間スペクトルの概形
を求め、その系列をネットワークに入力し、ネットワー
ク出力の系列を得る。得られたネットワークの出力ベク
トルは、それぞれが短時間の入力に対する話者を示唆し
ているが、出力ベクトル選択用しきい値を設けて、この
中で信頼性の高い出力ベクトルのみを選択し、これらす
べてについて、和、積、多数決等の総合的な判断を下す
ことによって、１つの認識結果を得る。出力ベクトル選
択の具体的な方法は、例えば予め設定した出力ベクトル
選択用しきい値を越える値が１つでも出力ベクトルに含
まれている場合、そのベクトルを選択する、等の方法に
よる。The operation of the speaker recognition section of the present invention according to claim 2 will be described. First, the neural network is trained by using the outline of the short-time spectrum obtained from the training voice. At this time, the number of learning data is reduced by performing clustering for each speaker. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained. The output vector of the obtained network suggests a speaker for each input for a short time, but a threshold for output vector selection is provided, and only a reliable output vector is selected among them. One recognition result is obtained by making a comprehensive judgment such as sum, product, majority vote, etc. for all of these. A specific method of selecting an output vector is, for example, a method of selecting a vector that exceeds a preset threshold value for selecting an output vector even if one value is included in the output vector.

【００１７】請求項２に記載の話者認識方式においては出力ベクトルの中で信頼性の高いものを選択すること
により、総合的な判断がより確実になり、認識率が向上
する。According to the speaker recognition method of the second aspect, by selecting a highly reliable output vector from among the output vectors, the comprehensive judgment can be made more reliable and the recognition rate can be improved.

【００１８】請求項３に記載の本発明の話者認識部にお
ける動作を説明する。まず、学習用の音声から得た短時
間スペクトルの概形を用いて、ニューラルネットワーク
を学習する。この際に話者ごとにクラスタリングを行う
ことによって学習用データ数を減らしておく。認識時
は、任意の発声から上記と同じ短時間スペクトルの概形
を求め、その系列をネットワークに入力し、ネットワー
ク出力の系列を得る。得られたネットワークの出力ベク
トルは、それぞれが短時間の入力に対する話者を示唆し
ており、これを系列全体で、和、積、多数決等の総合的
な判断を下すことによって、１つの認識結果を得る。The operation of the speaker recognition unit of the present invention according to claim 3 will be described. First, the neural network is trained by using the outline of the short-time spectrum obtained from the training voice. At this time, the number of learning data is reduced by performing clustering for each speaker. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained. The output vector of the obtained network suggests the speaker for each input for a short time, and by making a comprehensive judgment such as sum, product, majority decision, etc. in the entire series, one recognition result is obtained. To get

【００１９】請求項３に記載の話者認識方式においては発声内容を限定しない話者認識技術を用いることによ
り、インターホン利用者にキーワードを発声してもらう
必要がなくなり、応答メッセージ切り替えインターホン
装置の使い勝手が良くなった。また、どんな言葉が入力
されても対応できるようになった。In the speaker recognition method according to the third aspect, by using the speaker recognition technology that does not limit the utterance content, it is not necessary for the intercom user to utter a keyword, and the intercom response message switching interphone device is easy to use. Has improved. In addition, it became possible to respond to any words entered.

【００２０】クラスタリングにより複数のデータの代
表ベクトルを学習データとしているので、学習効果を保
ちつつ学習データ数を削減できる。その結果、ニューラ
ルネットワークの学習時間が大幅に短縮できる。Since the representative vector of a plurality of data is used as learning data by clustering, it is possible to reduce the number of learning data while maintaining the learning effect. As a result, the learning time of the neural network can be greatly reduced.

【００２１】請求項４に記載の本発明の話者認識部にお
ける動作を説明する。まず、学習用の音声から得た短時
間スペクトルの概形を用いて、ニューラルネットワーク
を学習する。この際に話者ごとのクラスタリングを行う
ことによって学習用データ数を減らしておく。認識時
は、任意の発声から上記と同じ短時間スペクトルの概形
を求め、その系列をネットワークに入力し、ネットワー
ク出力の系列を得る。得られたネットワークの出力ベク
トルは、それぞれが短時間の入力に対する話者を示唆し
ているが、出力ベクトル選択用しきい値を設けて、この
中で信頼性の高い出力ベクトルのみを選択し、これらす
べてについて、和、積、多数決等の総合的な判断を下す
ことによって、１つの認識結果を得る。出力ベクトル選
択の具体的な方法は、例えば予め設定した出力ベクトル
選択用しきい値を越える値が１つでも出力ベクトルに含
まれている場合、そのベクトルを選択する、等の方法に
よる。The operation of the speaker recognition section of the present invention according to claim 4 will be described. First, the neural network is trained by using the outline of the short-time spectrum obtained from the training voice. At this time, the number of learning data is reduced by performing clustering for each speaker. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained. The output vector of the obtained network suggests a speaker for each input for a short time, but a threshold for output vector selection is provided, and only a reliable output vector is selected among them. One recognition result is obtained by making a comprehensive judgment such as sum, product, majority vote, etc. for all of these. A specific method of selecting an output vector is, for example, a method of selecting a vector that exceeds a preset threshold value for selecting an output vector even if one value is included in the output vector.

【００２２】請求項４に記載の話者認識方式においては出力ベクトルの中で信頼性の高いものを選択すること
により、総合的な判断がより確実になり、認識率が向上
する。According to the speaker recognition method of the fourth aspect, by selecting a highly reliable output vector from among the output vectors, comprehensive judgment can be made more reliable and the recognition rate can be improved.

【００２３】然るに、本発明における「ニューラルネッ
トワーク」について説明すれば、下記(1) 〜(4) の如く
である。However, the description of the "neural network" in the present invention is as follows (1) to (4).

【００２４】(1)ニューラルネットワークは、その構造
から、図５（Ａ）に示す階層的ネットワークと図５
（Ｂ）に示す相互結合ネットワークの２種に大別でき
る。本発明は、両ネットワークのいずれを用いて構成す
るものであっても良いが、階層的ネットワークは後述す
る如くの簡単な学習アルゴリズムが確立されているため
より有用である。(1) The neural network has the structure shown in FIG.
It can be roughly classified into two types of mutual coupling networks shown in (B). The present invention may be configured by using either of both networks, but the hierarchical network is more useful because a simple learning algorithm as described later has been established.

【００２５】(2)ネットワークの構造階層的ネットワークは、図６に示す如く、入力層、中間
層、出力層からなる階層構造をとる。各層は１以上のユ
ニットから構成される。結合は、入力層→中間層→出力
層という前向きの結合だけで、各層内での結合はない。(2) Network Structure As shown in FIG. 6, the hierarchical network has a hierarchical structure including an input layer, an intermediate layer, and an output layer. Each layer is composed of one or more units. The coupling is only forward coupling such as input layer → middle layer → output layer, and there is no coupling within each layer.

【００２６】(3)ユニットの構造ユニットは図７に示す如く脳のニューロンのモデル化で
あり構造は簡単である。他のユニットから入力を受け、
その総和をとり一定の規則（変換関数）で変換し、結果
を出力する。他のユニットとの結合には、それぞれ結合
の強さを表わす可変の重みを付ける。(3) Structure of Unit The unit is a model of a brain neuron as shown in FIG. 7, and its structure is simple. Receive input from other units,
The sum is taken and converted by a certain rule (conversion function), and the result is output. A variable weight that represents the strength of the connection is attached to each of the connections with other units.

【００２７】(4)学習（バックプロパゲーション）ネットワークの学習とは、実際の出力を目標値（望まし
い出力）に近づけることであり、一般的には図７に示し
た各ユニットの変換関数及び重みを変化させて学習を行
なう。(4) Learning (Back Propagation) Learning the network is to bring the actual output closer to the target value (desired output), and generally, the conversion function and weight of each unit shown in FIG. Is learned by changing.

【００２８】また、学習のアルゴリズムとしては、例え
ば、Rumelhart, D.E.,McClelland,J.L. and the PDP Re
search Group, PARALLEL DISTRIBUTED PROCESSING, the
MIT Press, 1986.に記載されているバックプロパゲー
ションを用いることができる。As a learning algorithm, for example, Rumelhart, DE, McClelland, JL and the PDP Re
search Group, PARALLEL DISTRIBUTED PROCESSING, the
Backpropagation described in MIT Press, 1986. can be used.

【００２９】[0029]

【実施例】図１は応答メッセージ切り替え電話装置の一
例を示す模式図、図２は話者の出力値を示す模式図、図
３は応答メッセージ切り替えインターホン装置の一例を
示す模式図、図４は話者の出力値を示す模式図、図５は
ニューラルネットワークを示す模式図、図６は階層的な
ニューラルネットワークを示す模式図、図７はユニット
の構造を示す模式図である。1 is a schematic diagram showing an example of a response message switching telephone device, FIG. 2 is a schematic diagram showing an output value of a speaker, FIG. 3 is a schematic diagram showing an example of a response message switching intercom device, and FIG. 5 is a schematic diagram showing a speaker output value, FIG. 5 is a schematic diagram showing a neural network, FIG. 6 is a schematic diagram showing a hierarchical neural network, and FIG. 7 is a schematic diagram showing a structure of a unit.

【００３０】（第１実施例）（図１、図２参照）図１は応答メッセージ切り替え電話装置であり、その使
用手順は下記(A) 〜(C) の如くである。(First Embodiment) (Refer to FIGS. 1 and 2) FIG. 1 shows a response message switching telephone apparatus, and its use procedure is as follows (A) to (C).

【００３１】(A) 登録者の音声を学習するとき表１の如
くである。(A) When learning the voice of the registrant, it is as shown in Table 1.

【００３２】[0032]

【表１】 [Table 1]

【００３３】(B) メッセージを登録するとき受話器をとる登録モード用ボタンを押す合成音「どなた宛ですか」ＰＢボタンで山田さんを指定（ＰＢボタンの代わりに
音声入力でも可能）合成音「メッセージをどうぞ」発声「山田さん、 8時にいつもの所でね」受話器を置く通常モードに戻る(B) When registering a message Pick up the handset Press the registration mode button Synthetic sound "Who is it?" Specify Mr. Yamada with the PB button (You can also use voice input instead of the PB button) Synthetic sound "Message "Voice, Mr. Yamada, at the usual place at 8 o'clock" Put the handset and return to normal mode

【００３４】(C) 電話がかかってきたとき表２の如くで
ある。(C) When a call is received, it is as shown in Table 2.

【００３５】[0035]

【表２】 [Table 2]

【００３６】以下、話者認識方式の詳細について説明す
る。登録者 5名・非登録者25名について、学習用の文章
を、サンプリング周波数10kHz 、フレーム長25.6msec、
フレーム周期12.8msecでフーリエ分析し、100 〜5000Hz
の帯域で68ch（1/12 Oct. ）のパワーベクトルの系列を
得る。The details of the speaker recognition method will be described below. For 5 registrants and 25 non-registrants, the text for learning was sampled at a sampling frequency of 10 kHz, frame length of 25.6 msec,
Fourier analysis with a frame period of 12.8 msec, 100 to 5000 Hz
A series of 68ch (1/12 Oct.) power vectors is obtained in the band.

【００３７】これらのパワーベクトルから、階層的ク
ラスタリングを行うことによって、話者ごとに200 程度
の代表ベクトルを得る。Hierarchical clustering is performed from these power vectors to obtain about 200 representative vectors for each speaker.

【００３８】これらの代表ベクトルをニューラルネッ
トワークの入力とし（入力層68ユニット、入力パターン
は話者数×クラスタリング後の代表ベクトル数だけ得ら
れる）、登録者の場合のみ対応する出力ユニットが活性
化するように十分学習する。These representative vectors are used as inputs of the neural network (input layer 68 units, input pattern is obtained by the number of speakers × the number of representative vectors after clustering), and the corresponding output unit is activated only in the case of a registrant. To learn enough.

【００３９】任意の発声に対して、と同様にパワー
ベクトルの系列を得る。A sequence of power vectors is obtained in the same manner as for any utterance.

【００４０】これを、で学習したネットワークに入
力し、出力ベクトルの系列｛ｘ¹ ，ｘ² …，ｘⁿ ｝ｘ^t ＝（x^t ₁ ，…，x^t ₅ ）ｎ：フレーム数を得る。This is input to the network learned by, and the sequence of output vectors {x ¹ , x ² ..., X ⁿ } x ^t = (x ^t ₁ , ..., x ^t ₅ ) n: the number of frames is obtained.

【００４１】上記のベクトル系列に対し以下の３手
法を用いて、入力が登録者・非登録者いずれのものであ
るかを判断する。 (1) Σ_t x^t _s （s= 1〜5 ）の最大値が、予め設定した話
者判定用しきい値を越えていれば登録者、そうでなけれ
ば非登録者 (2) Π_t x^t _s （s=1 〜5 ）の最大値が、予め設定した話
者判定用しきい値を越えていれば登録者、そうでなけれ
ば非登録者 (3) max{x^t ₁ ，…，x^t ₅}=x^t _S(s=1〜5)の数の最大値が、
予め設定した話者判定用しきい値を越えていれば登録
者、そうでなければ非登録者The following three methods are used for the above vector series to determine whether the input is a registered person or a non-registered person. (1) Registered if the maximum value of Σ _t x ^t _s (s = 1 to 5) exceeds the preset threshold for speaker determination, otherwise unregistered (2) Π _t If the maximum value of x ^t _s (s = 1 to 5) exceeds the preset threshold for speaker determination, it is a registrant; otherwise, it is a non-registrant (3) max {x ^t ₁ , ... , X ^t ₅ } = x ^t _S (s = 1 to 5) has the maximum value,
Registered person if it exceeds the threshold for speaker determination set in advance, otherwise non-registered person

【００４２】また、上記の３手法のかわりに以下の手法
を用いても良い。 (1) Σ_t x^t _s （s= 1〜5 ）の最大値のみが、予め設定し
た第１の話者判定用しきい値を越え、かつその他の値が
予め設定した第２の話者判定用しきい値を下回っていれ
ば登録者、そうでなければ非登録者 (2) Π_t x^t _s （s=1 〜5 ）の最大値のみが、予め設定し
た第１の話者判定用しきい値を越え、かつ、その他の値
が予め設定した第２の話者判定用しきい値を下回ってい
れば登録者、そうでなければ非登録者 (3) max{x^t ₁ ，…，x^t ₅}=x^t _s(s=1〜5)の数の最大値が、
予め設定した第１の話者判定用しきい値を越え、かつ、
その他の値が予め設定した第２の話者判定用しきい値を
下回っていれば登録者、そうでなければ非登録者The following methods may be used instead of the above three methods. (1) Only the maximum value of Σ _t x ^t _s (s = 1 to 5) exceeds a preset first speaker determination threshold value, and other values have a preset second speaker. If it is below the threshold for judgment, it is the registered person, otherwise it is the non-registered person. (2) Only the maximum value of Π _t x ^t _s (s = 1 to 5) is set as the preset first speaker judgment. If it exceeds the threshold for communication and the other values are below the preset threshold for second speaker determination, it is a registrant; otherwise, it is a non-registrant (3) max {x ^t ₁ , …, X ^t ₅ } = x ^t _s (s = 1 to 5)
Exceeds a preset first speaker determination threshold value, and
If the other values are below the preset threshold for second speaker determination, it is a registered person, otherwise, it is a non-registered person.

【００４３】上記の結果、登録者と判断された場
合、最大の出力値を示すユニットがどれかにより、話者
が誰であるかを判断する。As a result of the above, when it is determined that the user is a registrant, it is determined who the speaker is based on which unit has the maximum output value.

【００４４】任意発声の一例として、学習用短文「明日
は東京に出ますのですみませんが留守にします。」に対
して、「ただいま」「こんにちは」「おはようございま
す」の３単語を用いて話者認識実験を行なったところ、
学習に用いた登録者 5名及び学習に用いていない非登録
者26名を完全に認識できた。As an example of an arbitrary utterance, in response to the short sentence for learning, "I'm sorry I'm going to Tokyo tomorrow, but I'll be away." When I did a recognition experiment,
We were able to fully recognize the 5 registrants used for learning and 26 non-registrants not used for learning.

【００４５】（第２実施例）（図３、図４参照）図３は応答メッセージ切り替えインターホン装置であ
り、その使用手順は下記(A) 〜(C) の如くである。(Second Embodiment) (Refer to FIGS. 3 and 4) FIG. 3 shows a response message switching intercom apparatus, and its use procedure is as follows (A) to (C).

【００４６】(A) 登録者の音声を学習するとき（電話機
を利用して音声収録する場合）表３の如くである。(A) When learning the voice of the registrant (when the voice is recorded using the telephone), it is as shown in Table 3.

【００４７】[0047]

【表３】 [Table 3]

【００４８】(B) メッセージを登録するとき受話器をとる登録モード用ボタンを押す合成音「どなた宛ですか」ＰＢボタンで〇×酒屋を指定（ＰＢボタンの代わりに
音声入力でも可能）合成音「メッセージをどうぞ」発声「ビール、１ケースお願い」受話器を置く通常モードに戻る(B) When registering a message, take the handset and press the registration mode button. Synthetic sound "Who is it?" PB button designates a 〇 × liquor store (voice input is also possible instead of the PB button) Synthetic sound " Please leave a message. ”Say“ Beer, 1 case please ”Put the handset Back to normal mode

【００４９】(C) 来客があったとき表４の如くである。(C) When there is a visitor, it is as shown in Table 4.

【００５０】[0050]

【表４】 [Table 4]

【００５１】以下、話者認識方式の詳細について説明す
る。登録者 5名・非登録者25名について、学習用の文章
を、サンプリング周波数10kHz 、フレーム長25.6msec、
フレーム周期12.8msecでフーリエ分析し、100 〜5000Hz
の帯域で68ch（1/12 Oct. ）のパワーベクトルの系列を
得る。The details of the speaker recognition method will be described below. For 5 registrants and 25 non-registrants, the text for learning was sampled at a sampling frequency of 10 kHz, frame length of 25.6 msec,
Fourier analysis with a frame period of 12.8 msec, 100 to 5000 Hz
A series of 68ch (1/12 Oct.) power vectors is obtained in the band.

【００５２】これらのパワーベクトルから、階層的ク
ラスタリングを行うことによって、話者ごとに200 程度
の代表ベクトルを得る。Hierarchical clustering is performed from these power vectors to obtain about 200 representative vectors for each speaker.

【００５３】これらの代表ベクトルをニューラルネッ
トワークの入力とし（入力層68ユニット、入力パターン
は話者数×クラスタリング後の代表ベクトル数だけ得ら
れる）、登録者の場合のみ対応する出力ユニットが活性
化するように十分学習する。These representative vectors are used as the input of the neural network (input layer 68 units, the input pattern is obtained by the number of speakers × the number of representative vectors after clustering), and the corresponding output unit is activated only in the case of the registrant. To learn enough.

【００５４】任意の発声に対して、と同様にパワー
ベクトルの系列を得る。A sequence of power vectors is obtained in the same manner as for any utterance.

【００５５】これを、で学習したネットワークに入
力し、出力ベクトルの系列｛ｘ¹ ，ｘ² …，ｘⁿ ｝ｘ^t ＝（x^t ₁ ，…，x^t ₅ ）ｎ：フレーム数を得る。This is input to the network learned by, and the sequence of output vectors {x ¹ , x ² ..., X ⁿ } x ^t = (x ^t ₁ , ..., x ^t ₅ ) n: the number of frames is obtained.

【００５６】上記ベクトル系列に対し以下の３手法
を用いて、入力が登録者・非登録者いずれのものである
かを判断する。 (1) Σ_t x^t _s （s= 1〜5 ）の最大値が、予め設定した話
者判定用しきい値を越えていれば登録者、そうでなけれ
ば非登録者 (2) Π_t x^t _s （s=1 〜5 ）の最大値が、予め設定した話
者判定用しきい値を越えていれば登録者、そうでなけれ
ば非登録者 (3) max{x^t ₁ ，…，x^t ₅}=x^t _s(s=1〜5)の数の最大値が、
予め設定した話者判定用しきい値を越えていれば登録
者、そうでなければ非登録者The following three methods are used for the above vector series to determine whether the input is a registered person or a non-registered person. (1) Registered if the maximum value of Σ _t x ^t _s (s = 1 to 5) exceeds the preset threshold for speaker determination, otherwise unregistered (2) Π _t If the maximum value of x ^t _s (s = 1 to 5) exceeds the preset threshold for speaker determination, it is a registrant; otherwise, it is a non-registrant (3) max {x ^t ₁ , ... , X ^t ₅ } = x ^t _s (s = 1 to 5) is maximum,
Registered person if it exceeds the threshold for speaker determination set in advance, otherwise non-registered person

【００５７】また、上記の３手法のかわりに以下の手法
を用いても良い。 (1) Σ_t x^t _s （s= 1〜5 ）の最大値のみが、予め設定し
た第１の話者判定用しきい値を越え、かつ、その他の値
が予め設定した第２の話者判定用しきい値を下回ってい
れば登録者、そうでなければ非登録者 (2) Π_t x^t _s （s=1 〜5 ）の最大値のみが、予め設定し
た第１の話者判定用しきい値を越え、かつ、その他の値
が予め設定した第２の話者判定用しきい値を下回ってい
れば登録者、そうでなければ非登録者 (3) max{x^t ₁ ，…，x^t ₅}=x^t _s(s=1〜5)の数の最大値が、
予め設定した第１の話者判定用しきい値を越え、かつ、
その他の値が予め設定した第２の話者判定用しきい値下
回っていれば登録者、そうでなければ非登録者The following methods may be used instead of the above three methods. (1) Only the maximum value of Σ _t x ^t _s (s = 1 to 5) exceeds the preset first speaker determination threshold value, and the other values have the preset second story. If it is below the threshold for person determination, it is the registered person, otherwise it is the non-registered person. (2) Only the maximum value of Π _t x ^t _s (s = 1 to 5) is set to the preset first speaker. If the threshold value for judgment is exceeded and the other values are lower than the preset threshold value for second speaker judgment, the registered person; otherwise, the non-registered person (3) max {x ^t ₁ ,…, X ^t ₅ } = x ^t _s (s = 1 to 5)
Exceeds a preset first speaker determination threshold value, and
If the other values are below the preset second speaker determination threshold value, it is a registered person, otherwise, it is a non-registered person.

【００５８】上記の結果、登録者と判断された場
合、最大の出力値を示すユニットがどれかにより、話者
が誰であるかを判断する。As a result of the above, when it is determined that the user is a registrant, it is determined who the speaker is based on which unit has the maximum output value.

【００５９】任意発声の一例として、学習用短文「明日
は東京に出ますのですみませんが留守にします。」に対
して、「ただいま」「こんにちは」「おはようございま
す」の３単語を用いて話者認識実験を行なったところ、
学習に用いた登録者 5名及び学習に用いていない非登録
者26名を完全に認識できた。As an example of voluntary utterance, in response to the short sentence for learning, "I'm sorry I'm going to Tokyo tomorrow, but I'll be away." When I did a recognition experiment,
We were able to fully recognize the 5 registrants used for learning and 26 non-registrants not used for learning.

【００６０】更に、本発明は、一台の機器を複数の者が
利用する際、利用者に応じて反応を切り替えることを必
要とする各種装置に応用できる。Furthermore, the present invention can be applied to various devices that require switching of reactions depending on users when a plurality of people use one device.

【００６１】[0061]

【発明の効果】以上のように本発明によれば、電話装置
やインターホン装置等の通話装置において、話者認識す
る際、対象とする音声を、予め学習した発声内容に限定
する必要のないものを得ることができる。As described above, according to the present invention, in a communication device such as a telephone device or an intercom device, when a speaker is recognized, it is not necessary to limit a target voice to a pre-learned utterance content. Can be obtained.

[Brief description of drawings]

【図１】図１は応答メッセージ切り替え電話装置の一例
を示す模式図である。FIG. 1 is a schematic diagram showing an example of a response message switching telephone device.

【図２】図２は話者の出力値を示す模式図である。FIG. 2 is a schematic diagram showing output values of a speaker.

【図３】図３は応答メッセージ切り替えインターホン装
置の一例を示す模式図である。FIG. 3 is a schematic diagram showing an example of a response message switching intercom apparatus.

【図４】図４は話者の出力値を示す模式図である。FIG. 4 is a schematic diagram showing output values of a speaker.

【図５】図５はニューラルネットワークを示す模式図で
ある。FIG. 5 is a schematic diagram showing a neural network.

【図６】図６は階層的なニューラルネットワークを示す
模式図である。FIG. 6 is a schematic diagram showing a hierarchical neural network.

【図７】図７はユニットの構造を示す模式図である。FIG. 7 is a schematic diagram showing a structure of a unit.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁵ 識別記号庁内整理番号ＦＩ技術表示箇所Ｈ０４Ｍ 9/00 Ｄ 8523−5Ｋ ─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁵ Identification code Office reference number FI technical display location H04M 9/00 D 8523-5K

Claims

[Claims]

1. A response message switching telephone device capable of recognizing who a calling party is and responding to a person who is registered in advance with a special message addressed to that person, by automatically responding to an incoming call. A response section, a speaker recognition section for recognizing a speaker from the voice of the caller who replies to the automatic response section, a response control section for controlling the basic communication circuit based on the recognition result of the speaker recognition section, and a message When a speaker is recognized by the above-mentioned speaker recognition unit using a neural network, a series of vectors representing the outline of a short-time spectrum is input and the network output series is It obtains one recognition result by summing, accumulating, and majority decision of recognition results by the output of, and is characterized by reducing the number of learning data of the network by clustering. Response message switching communication device.

2. An answering message switching telephone device capable of recognizing who a calling party is and responding to a person who is registered in advance with a special message addressed to that person, by automatically answering an incoming call. A response section, a speaker recognition section for recognizing a speaker from the voice of the caller who replies to the automatic response section, a response control section for controlling the basic communication circuit based on the recognition result of the speaker recognition section, and a message When a speaker is recognized using a neural network in the speaker recognition unit, a series of vectors representing the outline of a short-time spectrum is input, and an output vector is output from the series of network outputs. For the output vector selected using the threshold for selection, one recognition result is obtained by summing the recognition results by the individual outputs, products, majority voting, etc. A response message switching communication device, which is characterized by reducing the number of training learning data by clustering.

3. An intercom device installed at a front door or the like in a response message switching intercom device capable of recognizing who the intercom user is and responding with a special message addressed to the person registered in advance. Slave unit), an automatic response unit that automatically responds to the intercom user, a speaker recognition unit that recognizes a speaker from the voice of the intercom user who responds to the automatic response unit, and a speaker recognition unit of the speaker recognition unit. A response control unit that controls the intercom basic circuit according to the recognition result,
A message storage unit for storing a message is provided, and in the speaker recognition unit, when a speaker is recognized using a neural network, a series of vectors representing an outline of a short-time spectrum is input, and a series of network output is An answer message switching intercom device, which obtains one recognition result by summing, accumulating, majority voting, etc. of recognition results by individual outputs, and reduces the number of learning data of a network by clustering.

4. An intercom installed in a front door or the like in a response message switching intercom device capable of recognizing an intercom user and responding to a person who is registered in advance with a special message addressed to the person. Device), an automatic response unit that automatically responds to the intercom user, a speaker recognition unit that recognizes a speaker from the voice of the intercom user that responds to the automatic response unit, and a recognition of the speaker recognition unit. A response control unit that controls the intercom basic circuit according to the result,
A message storage unit for storing a message is provided, and when the speaker recognition unit uses the neural network to recognize a speaker, a series of vectors representing the outline of the short-time spectrum is input and output from the network output series. For the output vector selected using the vector selection threshold, one recognition result is obtained by summing the recognition results by each output, product, majority vote, etc. Clustering the number of learning data of the network. Response message switching intercom device characterized by reduction by

5. A response characterized in that, by combining the response message switching intercom apparatus according to claim 3 or 4 with a telephone, voice recording can be performed using the telephone when recording / learning the voice of the registrant. Message switching intercom device.