JPH06152725A

JPH06152725A - Answer message switching speech equipment

Info

Publication number: JPH06152725A
Application number: JP4291373A
Authority: JP
Inventors: Shingo Nishimura; 新吾西村; Masayuki Unno; 雅幸海野; Toshihiro Koremoto; 敏宏是本
Original assignee: Sekisui Chemical Co Ltd
Current assignee: Sekisui Chemical Co Ltd
Priority date: 1992-10-29
Filing date: 1992-10-29
Publication date: 1994-05-31

Abstract

PURPOSE:To provide a speech equipment such as a telephone equipment and an interphone equipment, etc., by which it is unnecessary to limit an object voice to the utterance contents learned in advance, at the time of recognizing a speaker. CONSTITUTION:At the time of recognizing a speaker by using a neural network, in a speaker recognizing part of the answer message switching speech equipment, a sequence of a vector for showing a rough form of a spectrum is inputted for a short period, a sequence of a network output is integrated by the sum, the product, the majority, etc., of a result of recognition by each output and one result of recognition is obtained, and a sentence in which the phoneme is well-balances is used as network learning data.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、電話をかけてきた相手
に応じて、その人宛の特別なメッセージを返すことが可
能な応答メッセージ切り替え通話装置に関する。また、
本発明は、玄関先などに設置したインターホンの利用者
に応じて、その人宛の特別なメッセージを返すことが可
能な応答メッセージ切り替えインターホン装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a response message switching communication device capable of returning a special message addressed to a person who has made a call. Also,
The present invention relates to a response message switching intercom device capable of returning a special message addressed to a user of an intercom installed at a front door or the like.

【０００２】[0002]

【従来の技術】従来、特公平3-59619 「人工知能機能付
留守番電話装置」、特開昭61-39756「留守番電話システ
ム」、特開昭64-86742「伝言蓄積再生装置」、特開平2-
294146「録音再生装置」などが提案されている。これら
の従来技術では音声認識（または話者認識）技術を利用
することが示されているが、具体的な認識手法について
の記載は無いか、記載されていたとしても以下に示す程
度である。また、発呼者に特定の言葉（キーワード、発
呼者の氏名など）を発声してもらうことを前提としてい
るものがほとんどである。2. Description of the Related Art Conventional Japanese Patent Publication No. 3-59619 "Answering machine with artificial intelligence function", Japanese Unexamined Patent Publication No. 61-39756 "Answering telephone system", Japanese Unexamined Patent Publication No. 64-86742 "Message storage and reproducing apparatus", Japanese Unexamined Patent Publication No. Hei 2 -
294146 "Recording / playback device" etc. have been proposed. Although these prior arts have been shown to use speech recognition (or speaker recognition) technology, there is no description of a specific recognition method, or even if it is described, it is only as shown below. Also, most of them are based on the premise that the caller speaks a specific word (keyword, caller's name, etc.).

【０００３】入力音声に含まれる特徴量（スペクト
ル、ピッチ、ケプストラムなど）を抽出し、このの
抽出データと、予め同様の方法で抽出しておいた辞書デ
ータとの距離を計算（ＤＰマッチングなど）し、この
の計算結果より入力音声が誰の声か判定し、この
の判定結果に応じて、その後の処理を制御する。A feature amount (spectrum, pitch, cepstrum, etc.) included in the input voice is extracted, and a distance between the extracted data and dictionary data previously extracted by a similar method is calculated (DP matching, etc.). Then, the voice of the input voice is determined from the calculation result of this, and the subsequent processing is controlled according to the determination result.

【０００４】また、従来、特開昭61-39756「留守番電話
システム」が提案されている。この従来技術では音声認
識技術を利用することが示されているが、具体的な認識
手法についての記載は以下に示す程度である。また、イ
ンターホン利用者にキーワードを発声してもらうことを
前提としている。Further, conventionally, Japanese Patent Laid-Open No. Sho 61-39756 "answering machine system" has been proposed. This prior art has been shown to utilize a voice recognition technique, but the description of a specific recognition method is as follows. Also, it is premised that the intercom user speaks a keyword.

【０００５】入力音声に含まれる特徴量（音高、母音
と子音の組み合わせなど）を抽出し、このの抽出デ
ータと、予め同様の方法で抽出しておいた辞書データと
の距離を計算し、このの計算結果より入力音声が誰
の声か判定し、このの判定結果に応じて、その後の
処理を制御する。A feature amount (pitch, combination of vowel and consonant, etc.) included in the input voice is extracted, and a distance between the extracted data and dictionary data previously extracted by a similar method is calculated, The voice of the input voice is determined from the result of the calculation, and the subsequent processing is controlled according to the result of the determination.

【０００６】[0006]

【発明が解決しようとする課題】上記従来技術において
は、音声認識（又は話者認識）の具体的な認識手法に関
する記載が乏しく、記載されている認識手法では認識率
の面で改善の余地があるものと思われる。また、発呼者
に特定の言葉を発声してもらうことを前提としている場
合、使い勝手が悪く、それ以外の言葉が入力されると誤
動作を起こすことがあるものと思われる。また、インタ
ーホン利用者にキーワードを発声してもらうことを前提
としているため使い勝手が悪く、それ以外の言葉が入力
されると誤動作を起こす場合があると思われる。In the above-mentioned prior art, there is little description about a specific recognition method of voice recognition (or speaker recognition), and there is room for improvement in recognition rate in the described recognition method. It seems that there is. In addition, when it is assumed that the caller utters a specific word, it is not easy to use, and if other words are input, malfunction may occur. Also, since it is premised that the intercom user speaks a keyword, it is not easy to use, and if other words are input, malfunction may occur.

【０００７】本発明は、電話装置やインターホン装置等
の通話装置において、話者認識する際、対象とする音声
を、予め学習した発声内容に限定する必要のないものを
提供することを目的とする。It is an object of the present invention to provide a communication device such as a telephone device or an intercom device, which does not need to limit a target voice to a pre-learned utterance content when recognizing a speaker. .

【０００８】[0008]

【課題を解決するための手段】請求項１に記載の本発明
は、発呼者が誰かを認識し、予め登録されている者に対
しては、その人宛の特別なメッセージで応答できる応答
メッセージ切り替え通話装置において、着信に対して自
動応答する自動応答部と、該自動応答部に対して返答し
た発呼者の音声から話者認識する話者認識部と、該話者
認識部の認識結果により通話基本回路を制御する応答制
御部と、メッセージを記憶するメッセージ記憶部を有
し、上記話者認識部で、ニューラルネットワークを用い
て話者認識する際、短時間スペクトルの概形を表すベク
トルの系列を入力し、ネットワーク出力の系列を、個々
の出力による認識結果の和、積、多数決などにより総合
して１つの認識結果を得るものであり、ネットワークの
学習用データとして、音韻バランスのとれた文章を用い
るようにしたものである。The present invention as set forth in claim 1, is a response that recognizes who the calling party is and can respond to a person who is registered in advance with a special message addressed to that person. In a message switching telephone device, an automatic response unit that automatically responds to an incoming call, a speaker recognition unit that recognizes a speaker from the voice of a caller who responds to the automatic response unit, and a recognition of the speaker recognition unit According to the result, it has a response control unit for controlling the basic communication circuit and a message storage unit for storing a message. When the speaker recognition unit recognizes the speaker by using the neural network, it represents the outline of the short-time spectrum. A vector sequence is input, and the network output sequence is combined by summing, recognizing, and voting majority of the recognition results from the individual outputs to obtain one recognition result. It is obtained to use a balanced sentence of phoneme balance.

【０００９】請求項２に記載の本発明は、発呼者が誰か
を認識し、予め登録されている者に対しては、その人宛
の特別なメッセージで応答できる応答メッセージ切り替
え通話装置において、着信に対して自動応答する自動応
答部と、該自動応答部に対して返答した発呼者の音声か
ら話者認識する話者認識部と、該話者認識部の認識結果
により通話基本回路を制御する応答制御部と、メッセー
ジを記憶するメッセージ記憶部を有し、上記話者認識部
で、ニューラルネットワークを用いて話者認識する際、
短時間スペクトルの概形を表すベクトルの系列を入力
し、ネットワーク出力の系列から出力ベクトル選択用し
きい値を用いて選択した出力ベクトルについて、個々の
出力による認識結果の和、積、多数決などにより総合し
て１つの認識結果を得るものであり、ネットワークの学
習用データとして、音韻バランスのとれた文章を用いる
ようにしたものである。The present invention as set forth in claim 2 is a response message switching telephone apparatus which recognizes who the calling party is and can respond to a person who is registered in advance with a special message addressed to the person. An automatic response unit that automatically responds to an incoming call, a speaker recognition unit that recognizes a speaker from the voice of the caller who responds to the automatic response unit, and a call basic circuit based on the recognition result of the speaker recognition unit. When a speaker is recognized using a neural network in the speaker recognition unit, which has a response control unit for controlling and a message storage unit for storing a message,
By inputting a series of vectors that represent the outline of the short-time spectrum and selecting the output vector from the network output series using the output vector selection threshold, the sum, product, majority vote, etc. of the recognition results for each output A single recognition result is obtained as a whole, and a sentence in which the phoneme is balanced is used as the learning data of the network.

【００１０】請求項３に記載の本発明は、インターホン
利用者が誰かを認識し、予め登録されている者に対して
は、その人宛の特別なメッセージで応答できる応答メッ
セージ切り替えインターホン装置において、玄関先など
に設置したインターホン（子機）と、インターホン利用
者に対して自動応答する自動応答部と、該自動応答部に
対して返答したインターホン利用者の音声から話者認識
する話者認識部と、該話者認識部の認識結果によりイン
ターホン基本回路を制御する応答制御部と、メッセージ
を記憶するメッセージ記憶部を有し、上記話者認識部
で、ニューラルネットワークを用いて話者認識する際、
短時間スペクトルの概形を表すベクトルの系列を入力
し、ネットワーク出力の系列を、個々の出力による認識
結果の和、積、多数決などにより総合して１つの認識結
果を得るものであり、ネットワークの学習用データとし
て、音韻バランスのとれた文章を用いるようにしたもの
である。The present invention according to claim 3 provides a response message switching intercom device which recognizes who the intercom user is and can respond to a person who is registered in advance with a special message addressed to the person. An intercom (slave unit) installed at a front door, an automatic response unit that automatically responds to the intercom user, and a speaker recognition unit that recognizes a speaker from the voice of the intercom user who responds to the automatic response unit. And a response control unit that controls the intercom basic circuit according to the recognition result of the speaker recognition unit, and a message storage unit that stores a message. When the speaker recognition unit uses the neural network to recognize the speaker. ,
A series of vectors that represent the outline of a short-time spectrum is input, and a series of network outputs is combined by summing, recognizing, or voting majority of the recognition results by individual outputs to obtain one recognition result. Phrase-balanced sentences are used as the learning data.

【００１１】請求項４に記載の本発明は、インターホン
利用者が誰か認識し、予め登録されている者に対して
は、その人宛の特別なメッセージで応答できる応答メッ
セージ切り替えインターホン装置において、玄関先など
に設置したインターホン（子機）と、インターホン利用
者に対して自動応答する自動応答部と、該自動応答部に
対して返答したインターホン利用者の音声から話者認識
する話者認識部と、該話者認識部の認識結果によりイン
ターホン基本回路を制御する応答制御部と、メッセージ
を記憶するメッセージ記憶部を有し、上記話者認識部
で、ニューラルネットワークを用いて話者認識する際、
短時間スペクトルの概形を表すベクトルの系列を入力
し、ネットワーク出力の系列から出力ベクトル選択用し
きい値を用いて選択した出力ベクトルについて、個々の
出力による認識結果の和、積、多数決などにより総合し
て１つの認識結果を得るものであり、ネットワークの学
習用データとして、音韻バランスのとれた文章を用いる
ようにしたものである。According to a fourth aspect of the present invention, in a response message switching intercom device, which recognizes who the intercom user is and responds to a person who is registered in advance with a special message addressed to the person, the front door is provided. An intercom (slave unit) installed in the front, an automatic response unit that automatically responds to the intercom user, and a speaker recognition unit that recognizes a speaker from the voice of the intercom user who responds to the automatic response unit. , A response control unit that controls the intercom basic circuit according to the recognition result of the speaker recognition unit, and a message storage unit that stores a message, and when the speaker recognition unit uses the neural network to recognize the speaker,
By inputting a series of vectors that represent the outline of the short-time spectrum and selecting the output vector from the network output series using the output vector selection threshold, the sum, product, majority vote, etc. of the recognition results for each output A single recognition result is obtained as a whole, and a sentence in which the phoneme is balanced is used as the learning data of the network.

【００１２】請求項５に記載の本発明は、請求項３又は
４に記載の応答メッセージ切り替えインターホン装置を
電話機と組み合わせることにより、登録者の音声を収録
・学習するときに、電話機を利用して音声収録できるよ
うにしたものである。The present invention according to claim 5 uses the telephone when recording / learning the voice of the registrant by combining the response message switching intercom apparatus according to claim 3 or 4 with the telephone. It is made possible to record audio.

【００１３】[0013]

【作用】請求項１に記載の本発明の話者認識部における
動作を説明する。まず、学習用の音声から得た短時間ス
ペクトルの概形を用いて、ニューラルネットワークを学
習する。学習用の音声には、音韻バランスのとれたもの
を用いる。認識時は、任意の発声から上記と同じ短時間
スペクトルの概形を求め、その系列をネットワークに入
力し、ネットワーク出力の系列を得る。得られたネット
ワークの出力ベクトルは、それぞれが短時間の入力に対
する話者を示唆しており、これを系列全体で、和、積、
多数決等の総合的な判断を下すことによって、１つの認
識結果を得る。The operation of the speaker recognition section of the present invention according to claim 1 will be described. First, the neural network is trained by using the outline of the short-time spectrum obtained from the training voice. The learning voice is a phonologically balanced one. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained. The output vector of the obtained network suggests the speaker for each input for a short time.
One recognition result is obtained by making a comprehensive judgment such as a majority vote.

【００１４】請求項１に記載の話者認識方式においては発声内容を限定しない話者認識技術を用いることによ
り、発呼者に特定の言葉を発声してもらう必要がなくな
り、応答メッセージ切り替え電話装置の使い勝手が良く
なった。また、どんな言葉が入力されても対応できるよ
うになった。In the speaker recognition method according to the first aspect, by using the speaker recognition technology that does not limit the utterance content, it is not necessary for the caller to utter a specific word, and the answer message switching telephone device is provided. The usability has improved. In addition, it became possible to respond to any words entered.

【００１５】音韻バランスのとれた音声を用いること
により、短い発声で種々の音素を学習することができ
る。By using a phoneme with a well-balanced phoneme, it is possible to learn various phonemes with a short utterance.

【００１６】請求項２に記載の本発明の話者認識部にお
ける動作を説明する。まず、学習用の音声から得た短時
間スペクトルの概形を用いて、ニューラルネットワーク
を学習する。学習用の音声には、音韻バランスのとれた
ものを用いる。認識時は、任意の発声から上記と同じ短
時間スペクトルの概形を求め、その系列をネットワーク
に入力し、ネットワーク出力の系列を得る。得られたネ
ットワークの出力ベクトルは、それぞれが短時間の入力
に対する話者を示唆しているが、出力ベクトル選択用し
きい値を設けて、この中で信頼性の高い出力ベクトルの
みを選択し、これらすべてについて、和、積、多数決等
の総合的な判断を下すことによって、１つの認識結果を
得る。出力ベクトル選択の具体的な方法は、例えば予め
設定した出力ベクトル選択用しきい値を越える値が１つ
でも出力ベクトルに含まれている場合、そのベクトルを
選択する、等の方法による。The operation of the speaker recognition section of the present invention according to claim 2 will be described. First, the neural network is trained by using the outline of the short-time spectrum obtained from the training voice. The learning voice is a phonologically balanced one. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained. The output vector of the obtained network suggests a speaker for each input for a short time, but a threshold for output vector selection is provided, and only a reliable output vector is selected among them. One recognition result is obtained by making a comprehensive judgment such as sum, product, majority vote, etc. for all of these. A specific method of selecting an output vector is, for example, a method of selecting a vector that exceeds a preset threshold value for selecting an output vector even if one value is included in the output vector.

【００１７】請求項２に記載の話者認識方式においては出力ベクトルの中で信頼性の高いものを選択すること
により、総合的な判断がより確実になり、認識率が向上
する。According to the speaker recognition method of the second aspect, by selecting a highly reliable output vector from among the output vectors, the comprehensive judgment can be made more reliable and the recognition rate can be improved.

【００１８】請求項３に記載の本発明の話者認識部にお
ける動作を説明する。まず、学習用の音声から得た短時
間スペクトルの概形を用いて、ニューラルネットワーク
を学習する。学習用の音声には、音韻バランスのとれた
ものを用いる。認識時は、任意の発声から上記と同じ短
時間スペクトルの概形を求め、その系列をネットワーク
に入力し、ネットワーク出力の系列を得る。得られたネ
ットワークの出力ベクトルは、それぞれが短時間の入力
に対する話者を示唆しており、これを系列全体で、和、
積、多数決等の総合的な判断を下すことによって、１つ
の認識結果を得る。The operation of the speaker recognition unit of the present invention according to claim 3 will be described. First, the neural network is trained by using the outline of the short-time spectrum obtained from the training voice. The learning voice is a phonologically balanced one. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained. The output vector of the obtained network suggests the speaker for each input for a short time.
One recognition result is obtained by making a comprehensive judgment such as a product or a majority vote.

【００１９】請求項３に記載の話者認識方式においては発声内容を限定しない話者認識技術を用いることによ
り、インターホン利用者にキーワードを発声してもらう
必要がなくなり、応答メッセージ切り替えインターホン
装置の使い勝手が良くなった。また、どんな言葉が入力
されても対応できるようになった。In the speaker recognition method according to the third aspect, by using the speaker recognition technology that does not limit the utterance content, it is not necessary for the intercom user to utter a keyword, and the intercom response message switching interphone device is easy to use. Has improved. In addition, it became possible to respond to any words entered.

【００２０】音韻バランスのとれた音声を用いること
により、短い発声で種々の音素を学習することができ
る。By using a phoneme with a well-balanced phoneme, various phonemes can be learned with a short utterance.

【００２１】請求項４に記載の本発明の話者認識部にお
ける動作を説明する。まず、学習用の音声から得た短時
間スペクトルの概形を用いて、ニューラルネットワーク
を学習する。学習用の音声には、音韻バランスのとれた
ものを用いる。認識時は、任意の発声から上記と同じ短
時間スペクトルの概形を求め、その系列をネットワーク
に入力し、ネットワーク出力の系列を得る。得られたネ
ットワークの出力ベクトルは、それぞれが短時間の入力
に対する話者を示唆しているが、出力ベクトル選択用し
きい値を設けて、この中で信頼性の高い出力ベクトルの
みを選択し、これらすべてについて、和、積、多数決等
の総合的な判断を下すことによって、１つの認識結果を
得る。出力ベクトル選択の具体的な方法は、例えば予め
設定した出力ベクトル選択用しきい値を越える値が１つ
でも出力ベクトルに含まれている場合、そのベクトルを
選択する、等の方法による。The operation of the speaker recognition section of the present invention according to claim 4 will be described. First, the neural network is trained by using the outline of the short-time spectrum obtained from the training voice. The learning voice is a phonologically balanced one. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained. The output vector of the obtained network suggests a speaker for each input for a short time, but a threshold for output vector selection is provided, and only a reliable output vector is selected among them. One recognition result is obtained by making a comprehensive judgment such as sum, product, majority vote, etc. for all of these. A specific method of selecting an output vector is, for example, a method of selecting a vector that exceeds a preset threshold value for selecting an output vector even if one value is included in the output vector.

【００２２】請求項４に記載の話者認識方式においては出力ベクトルの中で信頼性の高いものを選択すること
により、総合的な判断がより確実になり、認識率が向上
する。According to the speaker recognition method of the fourth aspect, by selecting a highly reliable output vector from among the output vectors, comprehensive judgment can be made more reliable and the recognition rate can be improved.

【００２３】然るに、本発明における「ニューラルネッ
トワーク」について説明すれば、下記(1) 〜(4) の如く
である。However, the description of the "neural network" in the present invention is as follows (1) to (4).

【００２４】(1)ニューラルネットワークは、その構造
から、図５（Ａ）に示す階層的ネットワークと図５
（Ｂ）に示す相互結合ネットワークの２種に大別でき
る。本発明は、両ネットワークのいずれを用いて構成す
るものであっても良いが、階層的ネットワークは後述す
る如くの簡単な学習アルゴリズムが確立されているため
より有用である。(1) The neural network has the structure shown in FIG.
It can be roughly classified into two types of mutual coupling networks shown in (B). The present invention may be configured by using either of both networks, but the hierarchical network is more useful because a simple learning algorithm as described later has been established.

【００２５】(2)ネットワークの構造階層的ネットワークは、図６に示す如く、入力層、中間
層、出力層からなる階層構造をとる。各層は１以上のユ
ニットから構成される。結合は、入力層→中間層→出力
層という前向きの結合だけで、各層内での結合はない。(2) Network Structure As shown in FIG. 6, the hierarchical network has a hierarchical structure including an input layer, an intermediate layer, and an output layer. Each layer is composed of one or more units. The coupling is only forward coupling such as input layer → middle layer → output layer, and there is no coupling within each layer.

【００２６】(3)ユニットの構造ユニットは図７に示す如く脳のニューロンのモデル化で
あり構造は簡単である。他のユニットから入力を受け、
その総和をとり一定の規則（変換関数）で変換し、結果
を出力する。他のユニットとの結合には、それぞれ結合
の強さを表わす可変の重みを付ける。(3) Structure of Unit The unit is a model of a brain neuron as shown in FIG. 7, and its structure is simple. Receive input from other units,
The sum is taken and converted by a certain rule (conversion function), and the result is output. A variable weight that represents the strength of the connection is attached to each of the connections with other units.

【００２７】(4)学習（バックプロパゲーション）ネットワークの学習とは、実際の出力を目標値（望まし
い出力）に近づけることであり、一般的には図７に示し
た各ユニットの変換関数及び重みを変化させて学習を行
なう。(4) Learning (Back Propagation) Learning the network is to bring the actual output closer to the target value (desired output), and generally, the conversion function and weight of each unit shown in FIG. Is learned by changing.

【００２８】また、学習のアルゴリズムとしては、例え
ば、Rumelhart, D.E.,McClelland,J.L. and the PDP Re
search Group, PARALLEL DISTRIBUTED PROCESSING, the
MIT Press, 1986.に記載されているバックプロパゲー
ションを用いることができる。As a learning algorithm, for example, Rumelhart, DE, McClelland, JL and the PDP Re
search Group, PARALLEL DISTRIBUTED PROCESSING, the
Backpropagation described in MIT Press, 1986. can be used.

【００２９】[0029]

【実施例】図１は応答メッセージ切り替え電話装置の一
例を示す模式図、図２は話者の出力値を示す模式図、図
３は応答メッセージ切り替えインターホン装置の一例を
示す模式図、図４は話者の出力値を示す模式図、図５は
ニューラルネットワークを示す模式図、図６は階層的な
ニューラルネットワークを示す模式図、図７はユニット
の構造を示す模式図である。1 is a schematic diagram showing an example of a response message switching telephone device, FIG. 2 is a schematic diagram showing an output value of a speaker, FIG. 3 is a schematic diagram showing an example of a response message switching intercom device, and FIG. 5 is a schematic diagram showing a speaker output value, FIG. 5 is a schematic diagram showing a neural network, FIG. 6 is a schematic diagram showing a hierarchical neural network, and FIG. 7 is a schematic diagram showing a structure of a unit.

【００３０】（第１実施例）（図１、図２参照）図１は応答メッセージ切り替え電話装置であり、その使
用手順は下記(A) 〜(C) の如くである。(First Embodiment) (Refer to FIGS. 1 and 2) FIG. 1 shows a response message switching telephone apparatus, and its use procedure is as follows (A) to (C).

【００３１】(A) 登録者の音声を学習するとき表１の如
くである。(A) When learning the voice of the registrant, it is as shown in Table 1.

【００３２】[0032]

【表１】 [Table 1]

【００３３】(B) メッセージを登録するとき受話器をとる登録モード用ボタンを押す合成音「どなた宛ですか」ＰＢボタンで山田さんを指定（ＰＢボタンの代わりに
音声入力でも可能）合成音「メッセージをどうぞ」発声「山田さん、 8時にいつもの所でね」受話器を置く通常モードに戻る(B) When registering a message Pick up the handset Press the registration mode button Synthetic sound "Who is it?" Specify Mr. Yamada with the PB button (You can also use voice input instead of the PB button) Synthetic sound "Message "Voice, Mr. Yamada, at the usual place at 8 o'clock" Put the handset and return to normal mode

【００３４】(C) 電話がかかってきたとき表２の如くで
ある。(C) When a call is received, it is as shown in Table 2.

【００３５】[0035]

【表２】 [Table 2]

【００３６】以下、話者認識方式の詳細について説明す
る。登録者 5名・非登録者25名について、学習用の音韻バ
ランスのとれた短文を、サンプリング周波数10kHz 、フ
レーム長25.6msec、フレーム周期12.8msecでフーリエ分
析し、100 〜5000Hzの帯域で68ch（1/12 Oct. ）のパワ
ーベクトルの系列を得る。The details of the speaker recognition method will be described below. For 5 registrants and 25 non-registrants, a short phonologically balanced short sentence for learning was Fourier-analyzed at a sampling frequency of 10 kHz, a frame length of 25.6 msec, and a frame period of 12.8 msec, and 68 ch (1 ch / 12 Oct.) power vector sequence.

【００３７】これらのベクトルをニューラルネットワ
ークの入力とし（入力層68ユニット、入力パターンは 1
回の発声につきフレームの数だけ得られる）、登録者の
場合のみ対応する出力ユニットが活性化するように十分
学習する。These vectors are used as inputs of the neural network (input layer 68 units, input pattern is 1
Learn enough to activate the corresponding output unit only in the case of a registrant.

【００３８】任意の発声に対して、と同様にパワー
ベクトルの系列を得る。For any utterance, a sequence of power vectors is obtained in the same manner as.

【００３９】これを、で学習したネットワークに入
力し、出力ベクトルの系列｛ｘ¹ ，ｘ² …，ｘⁿ ｝ｘ^t ＝（x^t ₁ ，…，x^t ₅ ）ｎ：フレーム数を得る。This is input to the network learned by, and the sequence of output vectors {x ¹ , x ² ..., X ⁿ } x ^t = (x ^t ₁ , ..., x ^t ₅ ) n: the number of frames is obtained.

【００４０】上記のベクトル系列に対し以下の３手
法を用いて、入力が登録者・非登録者いずれのものであ
るかを判断する。 (1) Σ_t x^t _s （s= 1〜5 ）の最大値が、予め設定した話
者判定用しきい値を越えていれば登録者、そうでなけれ
ば非登録者 (2) Π_t x^t _s （s=1 〜5 ）の最大値が、予め設定した話
者判定用しきい値を越えていれば登録者、そうでなけれ
ば非登録者 (3) max{x^t ₁ ，…，x^t ₅}=x^t _s(s=1〜5)の数の最大値が、
予め設定した話者判定用しきい値を越えていれば登録
者、そうでなければ非登録者The following three methods are used for the above vector series to determine whether the input is a registered person or a non-registered person. (1) Registered if the maximum value of Σ _t x ^t _s (s = 1 to 5) exceeds the preset threshold for speaker determination, otherwise unregistered (2) Π _t If the maximum value of x ^t _s (s = 1 to 5) exceeds the preset threshold for speaker determination, it is a registrant; otherwise, it is a non-registrant (3) max {x ^t ₁ , ... , X ^t ₅ } = x ^t _s (s = 1 to 5) is maximum,
Registered person if it exceeds the threshold for speaker determination set in advance, otherwise non-registered person

【００４１】また、上記の３手法のかわりに以下の手法
を用いても良い。 (1) Σ_t x^t _s （s= 1〜5 ）の最大値のみが、予め設定し
た第１の話者判定用しきい値を越え、かつその他の値が
予め設定した第２の話者判定用しきい値を下回っていれ
ば、登録者、そうでなければ非登録者 (2) Π_t x^t _s （s=1 〜5 ）の最大値のみが、予め設定し
た第１の話者判定用しきい値を越え、かつ、その他の値
が予め設定した第２の話者判定用しきい値を下回ってい
れば登録者、そうでなければ非登録者 (3) max{x^t ₁ ，…，x^t ₅}=x^t _s(s=1〜5)の数の最大値が、
予め設定した第１の話者判定用しきい値を越え、かつ、
その他の値が予め設定した第２の話者判定用しきい値を
下回っていれば登録者、そうでなければ非登録者The following methods may be used instead of the above three methods. (1) Only the maximum value of Σ _t x ^t _s (s = 1 to 5) exceeds a preset first speaker determination threshold value, and other values have a preset second speaker. If it is below the threshold for judgment, it is the registered person, otherwise it is the non-registered person. (2) Only the maximum value of Π _t x ^t _s (s = 1 to 5) is set for the preset first speaker. If the threshold value for judgment is exceeded and the other values are lower than the preset threshold value for second speaker judgment, the registered person; otherwise, the non-registered person (3) max {x ^t ₁ ,…, X ^t ₅ } = x ^t _s (s = 1 to 5)
Exceeds a preset first speaker determination threshold value, and
If the other values are below the preset threshold for second speaker determination, it is a registered person, otherwise, it is a non-registered person.

【００４２】上記の結果、登録者と判断された場
合、最大の出力値を示すユニットがどれかにより、話者
が誰であるかを判断する。As a result of the above, when it is determined that the user is a registrant, it is determined who the speaker is based on which unit has the maximum output value.

【００４３】任意発声の一例として、学習用短文「彼は
以前から、科学技術の進歩と人間の勇気が、はるかな宇
宙への旅を可能にしたのだと考えていました。」に対し
て、「ただいま」「こんにちは」「おはようございま
す」の３単語を用いて話者認識実験を行なったところ、
学習に用いた登録者 5名及び学習に用いていない非登録
者26名を完全に認識できた。As an example of voluntary vocalization, for the short sentence for learning, "He thought that the advance of science and technology and the courage of humans enabled a journey to a far universe." , "I'm home", "Hello" was subjected to a speaker recognition experiments using the three-word of "Good morning",
We were able to fully recognize the 5 registrants used for learning and 26 non-registrants not used for learning.

【００４４】（第２実施例）（図３、図４参照）図３は応答メッセージ切り替えインターホン装置であ
り、その使用手順は下記(A) 〜(C) の如くである。(Second Embodiment) (Refer to FIGS. 3 and 4) FIG. 3 shows a response message switching intercom apparatus, and its use procedure is as follows (A) to (C).

【００４５】(A) 登録者の音声を学習するとき（電話機
を利用して音声収録する場合）表３の如くである。(A) When learning the voice of the registrant (when the voice is recorded using the telephone), it is as shown in Table 3.

【００４６】[0046]

【表３】 [Table 3]

【００４７】(B) メッセージを登録するとき受話器をとる登録モード用ボタンを押す合成音「どなた宛ですか」ＰＢボタンで〇×酒屋を指定（ＰＢボタンの代わりに
音声入力でも可能）合成音「メッセージをどうぞ」発声「ビール、１ケースお願い」受話器を置く通常モードに戻る(B) When registering a message, pick up the handset and press the registration mode button. Synthetic sound "Who is it?" PB button designates a 〇 × liquor store (voice input is also possible instead of the PB button) Synthetic sound " Please leave a message. ”Say“ Beer, 1 case please ”Put the handset Back to normal mode

【００４８】(C) 来客があったとき表４の如くである。(C) When there is a visitor, it is as shown in Table 4.

【００４９】[0049]

【表４】 [Table 4]

【００５０】以下、話者認識方式の詳細について説明す
る。登録者 5名・非登録者25名について、学習用の音韻バ
ランスのとれた短文を、サンプリング周波数10kHz 、フ
レーム長25.6msec、フレーム周期12.8msecでフーリエ分
析し、100 〜5000Hzの帯域で68ch（1/12 Oct. ）のパワ
ーベクトルの系列を得る。The details of the speaker recognition method will be described below. For 5 registrants and 25 non-registrants, a short phonologically balanced short sentence for learning was Fourier-analyzed at a sampling frequency of 10 kHz, a frame length of 25.6 msec, and a frame period of 12.8 msec, and 68 ch (1 ch / 12 Oct.) power vector sequence.

【００５１】これらのベクトルをニューラルネットワ
ークの入力とし（入力層68ユニット、入力パターンは 1
回の発声につきフレームの数だけ得られる）、登録者の
場合のみ対応する出力ユニットが活性化するように十分
学習する。These vectors are used as inputs to the neural network (input layer 68 units, input pattern 1
Learn enough to activate the corresponding output unit only in the case of a registrant.

【００５２】任意の発声に対して、と同様にパワー
ベクトルの系列を得る。A sequence of power vectors is obtained in the same manner as for any utterance.

【００５３】これを、で学習したネットワークに入
力し、出力ベクトルの系列｛ｘ¹ ，ｘ² …，ｘⁿ ｝ｘ^t ＝（x^t ₁ ，…，x^t ₅ ）ｎ：フレーム数を得る。This is input to the network learned by, and the sequence of output vectors {x ¹ , x ² ..., X ⁿ } x ^t = (x ^t ₁ , ..., x ^t ₅ ) n: the number of frames is obtained.

【００５４】上記ベクトル系列に対し以下の３手法
を用いて、入力が登録者・非登録者いずれのものである
かを判断する。 (1) Σ_t x^t _s （s= 1〜5 ）の最大値が、予め設定した話
者判定用しきい値を越えていれば登録者、そうでなけれ
ば非登録者 (2) Π_t x^t _s （s=1 〜5 ）の最大値が、予め設定した話
者判定用しきい値を越えていれば登録者、そうでなけれ
ば非登録者 (3) max{x^t ₁ ，…，x^t ₅}=x^t _s(s=1〜5)の数の最大値が、
予め設定した話者判定用しきい値を越えていれば登録
者、そうでなければ非登録者The following three methods are used for the above vector series to determine whether the input is a registered person or a non-registered person. (1) Registered if the maximum value of Σ _t x ^t _s (s = 1 to 5) exceeds the preset threshold for speaker determination, otherwise unregistered (2) Π _t If the maximum value of x ^t _s (s = 1 to 5) exceeds the preset threshold for speaker determination, it is a registrant; otherwise, it is a non-registrant (3) max {x ^t ₁ , ... , X ^t ₅ } = x ^t _s (s = 1 to 5) is maximum,
Registered person if it exceeds the threshold for speaker determination set in advance, otherwise non-registered person

【００５５】また、上記の３手法のかわりに以下の手法
を用いても良い。 (1) Σ_t x^t _s （s= 1〜5 ）の最大値のみが、予め設定し
た第１の話者判定用しきい値を越え、かつ、その他の値
が予め設定した第２の話者判定用しきい値を下回ってい
れば登録者、そうでなければ非登録者 (2) Π_t x^t _s （s=1 〜5 ）の最大値のみが、予め設定し
た第１の話者判定用しきい値を越え、かつ、その他の値
が予め設定した第２の話者判定用しきい値を下回ってい
れば登録者、そうでなければ非登録者 (3) max{x^t ₁ ，…，x^t ₅}=x^t _s(s=1〜5)の数の最大値が、
予め設定した第１の話者判定用しきい値を越え、かつ、
その他の値が予め設定した第２の話者判定用しきい値下
回っていれば登録者、そうでなければ非登録者The following methods may be used instead of the above three methods. (1) Only the maximum value of Σ _t x ^t _s (s = 1 to 5) exceeds the preset first speaker determination threshold value, and the other values have the preset second story. If it is below the threshold for person determination, it is the registered person, otherwise it is the non-registered person. (2) Only the maximum value of Π _t x ^t _s (s = 1 to 5) is set to the preset first speaker. If the threshold value for judgment is exceeded and the other values are lower than the preset threshold value for second speaker judgment, the registered person; otherwise, the non-registered person (3) max {x ^t ₁ ,…, X ^t ₅ } = x ^t _s (s = 1 to 5)
Exceeds a preset first speaker determination threshold value, and
If the other values are below the preset second speaker determination threshold value, it is a registered person, otherwise, it is a non-registered person.

【００５６】上記の結果、登録者と判断された場
合、最大の出力値を示すユニットがどれかにより、話者
が誰であるかを判断する。As a result of the above, when it is determined that the person is a registrant, it is determined who the speaker is based on which unit has the maximum output value.

【００５７】任意発声の一例として、学習用短文「彼は
以前から、科学技術の進歩と人間の勇気が、はるかな宇
宙への旅を可能にしたのだと考えていました。」に対し
て、「ただいま」「こんにちは」「おはようございま
す」の３単語を用いて話者認識実験を行なったところ、
学習に用いた登録者 5名及び学習に用いていない非登録
者26名を完全に認識できた。As an example of voluntary utterance, a short sentence for learning "He thought that the advance of science and technology and the courage of humans have enabled a journey to a far universe." , "I'm home", "Hello" was subjected to a speaker recognition experiments using the three-word of "Good morning",
We were able to fully recognize the 5 registrants used for learning and 26 non-registrants not used for learning.

【００５８】更に、本発明は、一台の機器を複数の者が
利用する際、利用者に応じて反応を切り替えることを必
要とする各種装置に応用できる。Furthermore, the present invention can be applied to various devices that require switching of reactions according to users when a plurality of people use one device.

【００５９】[0059]

【発明の効果】以上のように本発明によれば、電話装置
やインターホン装置等の通話装置において、話者認識す
る際、対象とする音声を、予め学習した発声内容に限定
する必要のないものを得ることができる。As described above, according to the present invention, in a communication device such as a telephone device or an intercom device, when a speaker is recognized, it is not necessary to limit a target voice to a pre-learned utterance content. Can be obtained.

[Brief description of drawings]

【図１】図１は応答メッセージ切り替え電話装置の一例
を示す模式図である。FIG. 1 is a schematic diagram showing an example of a response message switching telephone device.

【図２】図２は話者の出力値を示す模式図である。FIG. 2 is a schematic diagram showing output values of a speaker.

【図３】図３は応答メッセージ切り替えインターホン装
置の一例を示す模式図である。FIG. 3 is a schematic diagram showing an example of a response message switching intercom apparatus.

【図４】図４は話者の出力値を示す模式図である。FIG. 4 is a schematic diagram showing output values of a speaker.

【図５】図５はニューラルネットワークを示す模式図で
ある。FIG. 5 is a schematic diagram showing a neural network.

【図６】図６は階層的なニューラルネットワークを示す
模式図である。FIG. 6 is a schematic diagram showing a hierarchical neural network.

【図７】図７はユニットの構造を示す模式図である。FIG. 7 is a schematic diagram showing a structure of a unit.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁵ 識別記号庁内整理番号ＦＩ技術表示箇所Ｈ０４Ｍ 9/00 Ｄ 8523−5Ｋ ─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁵ Identification code Office reference number FI technical display location H04M 9/00 D 8523-5K

Claims

[Claims]

1. A response message switching telephone device capable of recognizing who a calling party is and responding to a person who is registered in advance with a special message addressed to that person, by automatically responding to an incoming call. A response section, a speaker recognition section for recognizing a speaker from the voice of the caller who replies to the automatic response section, a response control section for controlling the basic communication circuit based on the recognition result of the speaker recognition section, and a message When a speaker is recognized by the above-mentioned speaker recognition unit using a neural network, a series of vectors representing the outline of a short-time spectrum is input and the network output series is This is to obtain a single recognition result by summing, accumulating, and majority voting of the recognition results by the output of. The use of a phonologically balanced sentence as the learning data of the network. Characteristic response message switching call device.

2. An answering message switching telephone device capable of recognizing who a calling party is and responding to a person who is registered in advance with a special message addressed to that person, by automatically answering an incoming call. A response section, a speaker recognition section for recognizing a speaker from the voice of the caller who replies to the automatic response section, a response control section for controlling the basic communication circuit based on the recognition result of the speaker recognition section, and a message When a speaker is recognized using a neural network in the speaker recognition unit, a series of vectors representing the outline of a short-time spectrum is input, and an output vector is output from the series of network outputs. For the output vector selected using the threshold for selection, one recognition result is obtained by summing the recognition results by the individual outputs, products, majority voting, etc. A response message switching communication device characterized in that a phonologically balanced sentence is used as learning data for a talk.

3. An intercom device installed at a front door or the like in a response message switching intercom device capable of recognizing who the intercom user is and responding with a special message addressed to the person registered in advance. Slave unit), an automatic response unit that automatically responds to the intercom user, a speaker recognition unit that recognizes a speaker from the voice of the intercom user who responds to the automatic response unit, and a speaker recognition unit of the speaker recognition unit. A response control unit that controls the intercom basic circuit according to the recognition result,
A message storage unit for storing a message is provided, and in the speaker recognition unit, when a speaker is recognized using a neural network, a series of vectors representing an outline of a short-time spectrum is input, and a series of network output is It obtains one recognition result by summing, accumulating, and majority voting of the recognition results of each output. Response message switching characterized by using phonologically balanced sentences as network learning data. Intercom device.

4. An intercom installed in a front door or the like in a response message switching intercom device capable of recognizing an intercom user and responding to a person who is registered in advance with a special message addressed to the person. Device), an automatic response unit that automatically responds to the intercom user, a speaker recognition unit that recognizes a speaker from the voice of the intercom user that responds to the automatic response unit, and a recognition of the speaker recognition unit. A response control unit that controls the intercom basic circuit according to the result,
A message storage unit for storing a message is provided, and when the speaker recognition unit uses the neural network to recognize a speaker, a series of vectors representing the outline of the short-time spectrum is input and output from the network output series. With regard to the output vector selected using the vector selection threshold value, one recognition result is obtained by summing, multiplying, majority voting, etc. of recognition results by individual outputs. A response message switching intercom device characterized by using balanced sentences.

5. A response characterized in that, by combining the response message switching intercom apparatus according to claim 3 or 4 with a telephone, voice recording can be performed using the telephone when recording / learning the voice of the registrant. Message switching intercom device.