JP6377921B2

JP6377921B2 - Speaker recognition device, speaker recognition method, and speaker recognition program

Info

Publication number: JP6377921B2
Application number: JP2014050753A
Authority: JP
Inventors: 学川▲崎▼; 拓明夏見; 康貴田中
Original assignee: SOHGO SECURITY SERVICES CO.,LTD.
Current assignee: SOHGO SECURITY SERVICES CO.,LTD.
Priority date: 2014-03-13
Filing date: 2014-03-13
Publication date: 2018-08-22
Anticipated expiration: 2034-03-13
Also published as: JP2015175915A

Description

この発明は、入力された入力音声データに基づいて該入力音声データの話者を認識する話者認識装置、話者認識方法及び話者認識プログラムに関する。 The present invention relates to a speaker recognition device, a speaker recognition method, and a speaker recognition program for recognizing a speaker of input voice data based on input voice data.

従来、音声データに基づいて該音声データの話者を認識する技術が知られている。例えば、特許文献１は、照合対象者の音声データから予め登録モデルデータを生成して格納し、入力音声データを分析した音声分析データと登録モデルデータとを照合処理することで、入力音声データの話者が照合対象者であるか否かを判定する話者認識システムを開示している。 Conventionally, a technique for recognizing a speaker of voice data based on the voice data is known. For example, Patent Document 1 generates and stores registered model data in advance from voice data of a person to be collated, and collates voice analysis data obtained by analyzing the input voice data with the registered model data. A speaker recognition system for determining whether or not a speaker is a verification target is disclosed.

かかる話者認識では、登録モデルデータが認識精度に大きな影響を与える。単一の音声データから登録モデルデータを構築すると、当該音声データが不適切であった場合に認識精度が大きく低下するので、複数の音声データから登録モデルデータを構築するか、複数の音声データを登録モデルデータとして用いることが行われている。 In such speaker recognition, the registration model data greatly affects the recognition accuracy. When the registration model data is constructed from a single voice data, the recognition accuracy is greatly reduced if the voice data is inappropriate. Therefore, the registration model data can be constructed from a plurality of voice data, or a plurality of voice data It is used as registration model data.

特開２００５−０９１７５８号公報Japanese Patent Laying-Open No. 2005-091758

しかしながら、従来の技術では、必ずしも照合対象者の登録モデルデータが適正な状態になるとは限らず、認識精度が低下する場合があるという問題点があった。具体的には、登録時に得られた複数の音声データに不適切な音声データが混在している場合には、不適切な音声データが登録モデルデータに影響を与え、認識精度の低下を招く。また、登録時に適正な状態であったとしても、話者側の音声が経時変化等により変化すると、認識精度の低下が発生することになる。 However, in the conventional technique, there is a problem that the registration model data of the person to be collated is not always in an appropriate state, and the recognition accuracy may be lowered. Specifically, when inappropriate audio data is mixed in a plurality of audio data obtained at the time of registration, the inappropriate audio data affects the registered model data, leading to a reduction in recognition accuracy. Even if the state is appropriate at the time of registration, if the voice on the speaker side changes due to a change over time or the like, the recognition accuracy deteriorates.

これらのことから、照合対象者の音声データを適正な状態で保持して入力音声データと比較することにより、話者認識の精度を向上することが重要な課題となっている。 For these reasons, it is an important issue to improve the accuracy of speaker recognition by holding the voice data of the person to be verified in an appropriate state and comparing it with the input voice data.

本発明は、上述した従来技術の課題を解決するためになされたものであって、照合対象者の音声データを適正な状態で保持し、もって話者認識の精度を向上した話者認識装置、話者認識方法及び話者認識プログラムを提供することを目的とする。 The present invention was made in order to solve the above-described problems of the prior art, and holds the speech data of the person to be verified in an appropriate state, thereby improving the speaker recognition accuracy, It is an object to provide a speaker recognition method and a speaker recognition program.

上述した課題を解決し、目的を達成するため、請求項１に記載の発明は、入力された入
力音声データに基づいて該入力音声データの話者を認識する話者認識装置であって、識別
すべき話者について、複数の登録音声データを受け付ける登録音声データ受付手段と、前
記登録音声データ受付手段が受け付けた複数の登録音声データから所定データ数の登録音
声データを選択する登録音声データ選択手段と、前記登録音声データ選択手段が選択した
所定データ数の登録音声データを前記話者に対応付けて記憶する記憶手段と、前記入力音
声データと前記記憶手段に格納された前記登録音声データとの類似度を算出する類似度算
出手段と、前記類似度算出手段により算出された類似度に基づいて前記話者を識別する話
者識別手段とを備え、前記登録音声データ選択手段は、前記登録音声データ受付手段が受け付けた複数の登録音声データの各々について、当該登録音声データを選択対象から除外した場合の前記複数の登録音声データの距離の分散を算出し、分散が最小となるよう登録音声データを除外することにより、前記登録音声データ受付手段が受け付けた複数の登録音声データから所定データ数の登録音声データを選択することを特徴とする。
In order to solve the above-described problems and achieve the object, the invention according to claim 1 is a speaker recognition device for recognizing a speaker of input voice data based on the input voice data that has been input. Registered voice data receiving means for receiving a plurality of registered voice data for a speaker to be registered, and registered voice data selecting means for selecting a predetermined number of registered voice data from the plurality of registered voice data received by the registered voice data receiving means Storage means for storing a predetermined number of registered voice data selected by the registered voice data selection means in association with the speaker, the input voice data, and the registered voice data stored in the storage means. a similarity calculation means for calculating the degree of similarity, and a speaker identification means for identifying the speaker on the basis of the similarity calculated by the similarity calculation unit, the registration The voice data selection means calculates, for each of the plurality of registered voice data received by the registered voice data reception means, a variance of the distances of the plurality of registered voice data when the registered voice data is excluded from selection targets, By excluding the registered voice data so as to minimize the variance, a predetermined number of registered voice data is selected from the plurality of registered voice data received by the registered voice data receiving means .

また、請求項２に記載の発明は、請求項１に記載の発明において、前記記憶手段は、同一の話者の所定データ数の登録音声データを登録音声セットとし、同一の話者について複数の登録音声セットを対応付けて記憶することを特徴とする。
また、請求項３に記載の発明は、入力された入力音声データに基づいて該入力音声データの話者を認識する話者認識装置であって、識別すべき話者について、複数の登録音声データを受け付ける登録音声データ受付手段と、前記登録音声データ受付手段が受け付けた複数の登録音声データから所定データ数の登録音声データを選択する登録音声データ選択手段と、前記登録音声データ選択手段が選択した所定データ数の登録音声データを前記話者に対応付けて記憶する記憶手段と、前記入力音声データと前記記憶手段に格納された前記登録音声データとの類似度を算出する類似度算出手段と、前記類似度算出手段により算出された類似度に基づいて前記話者を識別する話者識別手段とを備え、前記記憶手段は、同一の話者の所定データ数の登録音声データを登録音声セットとし、同一の話者について複数の登録音声セットを対応付けて記憶することを特徴とする。
Further, the invention according to claim 2 is the invention according to claim 1 , wherein the storage means sets registered voice data of a predetermined number of data of the same speaker as a registered voice set, and a plurality of the same speaker The registered voice set is stored in association with each other.
According to a third aspect of the present invention, there is provided a speaker recognition device for recognizing a speaker of the input voice data based on the input voice data, and a plurality of registered voice data for the speaker to be identified. Registered voice data receiving means, registered voice data selecting means for selecting a predetermined number of registered voice data from a plurality of registered voice data received by the registered voice data receiving means, and selected by the registered voice data selecting means Storage means for storing a predetermined number of registered voice data in association with the speaker; similarity calculation means for calculating a similarity between the input voice data and the registered voice data stored in the storage means; Speaker identification means for identifying the speaker based on the similarity calculated by the similarity calculation means, and the storage means stores a predetermined number of data of the same speaker. The audio data and the registered voice set, and to store in association with a plurality of registered voice sets for the same speaker.

また、請求項４に記載の発明は、請求項１〜３のいずれか一つに記載の発明において、前記話者識別手段による識別が行われた場合に、当該識別に寄与した登録音声セットを示す実績データを蓄積するとともに、前記入力音声データを更新準備セットの登録音声データとして格納し、前記実績データにより識別への寄与が少ないことが示された登録音声セットを削除するとともに前記更新準備セットを新規の登録音声セットとして追加する更新処理手段をさらに備えたことを特徴とする。
Further, in the invention according to claim 4 , in the invention according to any one of claims 1 to 3 , when the identification by the speaker identification means is performed, the registered voice set contributing to the identification is stored. And storing the input voice data as registered voice data of an update preparation set, deleting a registered voice set whose contribution to identification is less indicated by the result data, and the update preparation set Is further provided with an update processing means for adding as a new registered voice set.

また、請求項５に記載の発明は、請求項１〜４のいずれか一つに記載の発明において、前記話者識別手段は、前記類似度算出手段により算出された複数の類似度のうち、最も高い類似度に対応する話者を前記入力音声データの話者候補とし、前記類似度算出手段により算出された複数の類似度のうち、最も高い類似度が所定の照合閾値を超える場合に、前記話者候補と前記入力音声データの話者とが同一人物であると判定する話者照合手段をさらに備えたことを特徴とする。
Further, the invention according to claim 5 is the invention according to any one of claims 1 to 4 , wherein the speaker identifying means is a plurality of similarities calculated by the similarity calculating means. When the speaker corresponding to the highest similarity is a speaker candidate of the input speech data, and the highest similarity among a plurality of similarities calculated by the similarity calculation means exceeds a predetermined matching threshold, The apparatus further comprises speaker verification means for determining that the speaker candidate and the speaker of the input voice data are the same person.

また、請求項６に記載の発明は、請求項５に記載の発明において、監視対象に対する監視動作を行なう監視手段と、前記入力音声データに含まれる単語を判定する単語判定手段と、前記話者照合手段により、前記入力音声データの話者が前記識別すべき話者であるとの照合結果が得られた場合に、前記単語判定手段により判定された単語に基づいて前記監視手段の監視動作を制御する制御手段とをさらに備えたことを特徴とする。 The invention according to claim 6 is the invention according to claim 5, wherein the monitoring means for performing a monitoring operation on the monitoring target, the word determination means for determining a word included in the input voice data, and the speaker When the collation unit obtains a collation result that the speaker of the input voice data is the speaker to be identified, the monitoring unit performs the monitoring operation based on the word determined by the word determination unit. And a control means for controlling.

また、請求項７に記載の発明は、請求項１〜６のいずれか一つに記載の発明において、前記類似度算出手段は、前記登録音声データと前記入力音声データとの距離の小ささを前記類似度の高さとして算出することを特徴とする。 In addition, the invention according to claim 7 is the invention according to any one of claims 1 to 6, wherein the similarity calculation means calculates a small distance between the registered voice data and the input voice data. It is calculated as the height of the similarity.

また、請求項８に記載の発明は、入力された入力音声データに基づいて該入力音声データの話者を認識する話者認識方法であって、識別すべき話者について、複数の登録音声データを受け付ける登録音声データ受付ステップと、前記登録音声データ受付ステップで受け付けた複数の登録音声データから所定データ数の登録音声データを選択する登録音声データ選択ステップと、前記登録音声データ選択ステップで選択した所定データ数の登録音声データを前記話者に対応付けて記憶部に格納する格納ステップと、前記入力音声データと前記記憶部に格納された前記登録音声データとの類似度を算出する類似度算出ステップと、前記類似度算出ステップにより算出された類似度に基づいて前記話者を識別する話者識別ステップとを含み、前記登録音声データ選択ステップは、前記登録音声データ受付ステップで受け付けた複数の登録音声データの各々について、当該登録音声データを選択対象から除外した場合の前記複数の登録音声データの距離の分散を算出し、分散が最小となるよう登録音声データを除外することにより、前記登録音声データ受付ステップで受け付けた複数の登録音声データから所定データ数の登録音声データを選択することを特徴とする。
また、請求項９に記載の発明は、入力された入力音声データに基づいて該入力音声データの話者を認識する話者認識方法であって、識別すべき話者について、複数の登録音声データを受け付ける登録音声データ受付ステップと、前記登録音声データ受付ステップで受け付けた複数の登録音声データから所定データ数の登録音声データを選択する登録音声データ選択ステップと、前記登録音声データ選択ステップで選択した所定データ数の登録音声データを前記話者に対応付けて記憶部に格納する格納ステップと、前記入力音声データと前記記憶部に格納された前記登録音声データとの類似度を算出する類似度算出ステップと、前記類似度算出ステップにより算出された類似度に基づいて前記話者を識別する話者識別ステップとを含み、前記格納ステップは、同一の話者の所定データ数の登録音声データを登録音声セットとし、同一の話者について複数の登録音声セットを対応付けて格納することを特徴とする。

The invention according to claim 8 is a speaker recognition method for recognizing a speaker of the input voice data based on the input voice data that has been input, and a plurality of registered voice data for the speaker to be identified. Selected in the registered voice data selecting step, a registered voice data selecting step for selecting a predetermined number of registered voice data from the plurality of registered voice data received in the registered voice data receiving step, and the registered voice data selecting step A storage step for storing a predetermined number of registered voice data in the storage unit in association with the speaker, and a similarity calculation for calculating a similarity between the input voice data and the registered voice data stored in the storage unit a step, viewed contains a speaker identification step of identifying the speaker on the basis of the similarity calculated by the similarity calculation step, said Noboru The voice data selection step calculates a variance of the distances of the plurality of registered voice data when the registered voice data is excluded from selection targets for each of the plurality of registered voice data received in the registered voice data receiving step. By excluding the registered voice data so as to minimize the variance, a predetermined number of registered voice data is selected from the plurality of registered voice data received in the registered voice data receiving step .
The invention according to claim 9 is a speaker recognition method for recognizing a speaker of the input voice data based on the input voice data that has been input, and a plurality of registered voice data for the speaker to be identified. Selected in the registered voice data selecting step, a registered voice data selecting step for selecting a predetermined number of registered voice data from the plurality of registered voice data received in the registered voice data receiving step, and the registered voice data selecting step A storage step for storing a predetermined number of registered voice data in the storage unit in association with the speaker, and a similarity calculation for calculating a similarity between the input voice data and the registered voice data stored in the storage unit And a speaker identification step for identifying the speaker based on the similarity calculated by the similarity calculation step, Step, the same predetermined number of data registered voice data of the speaker to the registered voice set, and storing, in association with a plurality of registered voice sets for the same speaker.

また、請求項１０に記載の発明は、入力された入力音声データに基づいて該入力音声データの話者を認識する話者認識プログラムであって、識別すべき話者について、複数の登録音声データを受け付ける登録音声データ受付手順と、前記登録音声データ受付手順で受け付けた複数の登録音声データから所定データ数の登録音声データを選択する登録音声データ選択手順と、前記登録音声データ選択手順で選択した所定データ数の登録音声データを前記話者に対応付けて記憶部に格納する格納手順と、前記入力音声データと前記記憶部に格納された前記登録音声データとの類似度を算出する類似度算出手順と、前記類似度算出手順により算出された類似度に基づいて前記話者を識別する話者識別手順とをコンピュータに実行させ、前記登録音声データ選択手順は、前記登録音声データ受付手順で受け付けた複数の登録音声データの各々について、当該登録音声データを選択対象から除外した場合の前記複数の登録音声データの距離の分散を算出し、分散が最小となるよう登録音声データを除外することにより、前記登録音声データ受付手順で受け付けた複数の登録音声データから所定データ数の登録音声データを選択することを特徴とする。
また、請求項１１に記載の発明は、入力された入力音声データに基づいて該入力音声データの話者を認識する話者認識プログラムであって、識別すべき話者について、複数の登録音声データを受け付ける登録音声データ受付手順と、前記登録音声データ受付手順で受け付けた複数の登録音声データから所定データ数の登録音声データを選択する登録音声データ選択手順と、前記登録音声データ選択手順で選択した所定データ数の登録音声データを前記話者に対応付けて記憶部に格納する格納手順と、前記入力音声データと前記記憶部に格納された前記登録音声データとの類似度を算出する類似度算出手順と、前記類似度算出手順により算出された類似度に基づいて前記話者を識別する話者識別手順とをコンピュータに実行させ、前記格納手順は、同一の話者の所定データ数の登録音声データを登録音声セットとし、同一の話者について複数の登録音声セットを対応付けて格納することを特徴とする。 According to a tenth aspect of the present invention, there is provided a speaker recognition program for recognizing a speaker of the input voice data based on the input voice data that has been input. Registered voice data receiving procedure, registered voice data selecting procedure for selecting a predetermined number of registered voice data from a plurality of registered voice data received in the registered voice data receiving procedure, and selected in the registered voice data selecting procedure A storage procedure for storing a predetermined number of registered voice data in the storage unit in association with the speaker, and similarity calculation for calculating a similarity between the input voice data and the registered voice data stored in the storage unit and procedures, speaker identification procedure and cause the computer to execute identifying the speaker on the basis of the similarity calculated by the similarity calculation procedures, the registered voice The data selection procedure calculates, for each of the plurality of registered voice data received in the registered voice data reception procedure, a variance of the distances of the plurality of registered voice data when the registered voice data is excluded from the selection target, by dispersion exclude registered voice data to the minimum, characterized by you to select a predetermined number of data registered voice data from a plurality of registered voice data received by the registered voice data reception process.
The invention according to claim 11 is a speaker recognition program for recognizing a speaker of the input voice data based on the input voice data, and a plurality of registered voice data for the speaker to be identified. Registered voice data receiving procedure, registered voice data selecting procedure for selecting a predetermined number of registered voice data from a plurality of registered voice data received in the registered voice data receiving procedure, and selected in the registered voice data selecting procedure A storage procedure for storing a predetermined number of registered voice data in the storage unit in association with the speaker, and similarity calculation for calculating a similarity between the input voice data and the registered voice data stored in the storage unit A storage step for causing the computer to execute a procedure and a speaker identification procedure for identifying the speaker based on the similarity calculated by the similarity calculation procedure; , The same predetermined number of data registered voice data of the speaker to the registered voice set, and storing, in association with a plurality of registered voice sets for the same speaker.

本発明によれば、識別すべき話者について複数の登録音声データを受け付け、複数の登録音声データから所定データ数の登録音声データを選択し、選択した所定データ数の登録音声データを話者に対応付けて記憶し、入力音声データと登録音声データとの類似度に基づいて話者を識別するよう構成したので、照合対象者の音声データを適正な状態で保持し、もって話者認識の精度を向上することができる。 According to the present invention, a plurality of registered voice data is received for a speaker to be identified, a predetermined number of registered voice data is selected from the plurality of registered voice data, and the selected predetermined number of registered voice data is sent to the speaker. Since the speaker is identified and stored based on the similarity between the input voice data and the registered voice data, the voice data of the person to be collated is held in an appropriate state, so that the accuracy of speaker recognition Can be improved.

図１は、実施例に係るホームセキュリティシステムのシステム構成を示すシステム構成図である。FIG. 1 is a system configuration diagram illustrating a system configuration of the home security system according to the embodiment. 図２は、図１に示した話者認識部の内部構成を示す内部構成図である。FIG. 2 is an internal configuration diagram showing an internal configuration of the speaker recognition unit shown in FIG. 図３は、図２に示した話者登録データについて説明するための説明図である。FIG. 3 is an explanatory diagram for explaining the speaker registration data shown in FIG. 図４は、不適切な登録音声データの影響について説明する説明図である。FIG. 4 is an explanatory view for explaining the influence of inappropriate registered voice data. 図５は、登録音声データの選択について説明するための説明図である。FIG. 5 is an explanatory diagram for describing selection of registered voice data. 図６は、更新候補セットの生成について説明するための説明図である。FIG. 6 is an explanatory diagram for explaining generation of an update candidate set. 図７は、登録音声セットの入替について説明するための説明図である。FIG. 7 is an explanatory diagram for explaining replacement of a registered voice set. 図８は、登録モードにおける話者認識部の処理手順を示すフローチャートである。FIG. 8 is a flowchart showing a processing procedure of the speaker recognition unit in the registration mode. 図９は、図８に示した音声データ選択処理の処理手順を示すフローチャートである。FIG. 9 is a flowchart showing a processing procedure of the audio data selection processing shown in FIG. 図１０は、認識モードにおける話者認識部の処理手順を示すフローチャートである。FIG. 10 is a flowchart illustrating a processing procedure of the speaker recognition unit in the recognition mode. 図１１は、図１０に示した距離算出処理の処理手順を示すフローチャートである。FIG. 11 is a flowchart illustrating a processing procedure for the distance calculation processing illustrated in FIG. 10. 図１２は、図１０のステップＳ３０８に示した更新処理の処理手順を示すフローチャートである。FIG. 12 is a flowchart showing the processing procedure of the update process shown in step S308 of FIG. 図１３は、不適切な登録音声データの具体例について説明する説明図である。FIG. 13 is an explanatory diagram illustrating a specific example of inappropriate registered voice data. 図１４は、話者登録データの更新による効果についての説明図である。FIG. 14 is an explanatory diagram of the effect of updating the speaker registration data.

以下に、添付図面を参照して、本発明に係る話者認識装置、話者認識方法及び話者認識プログラムの好適な実施例を詳細に説明する。以下に示す実施例では、本発明に係る話者認識装置、話者認識方法及び話者認識プログラムを住宅用のホームセキュリティシステムに適用した場合について説明する。 Exemplary embodiments of a speaker recognition device, a speaker recognition method, and a speaker recognition program according to the present invention will be described below in detail with reference to the accompanying drawings. In the following embodiment, a case where the speaker recognition device, the speaker recognition method, and the speaker recognition program according to the present invention are applied to a home security system for a house will be described.

図１は、実施例に係るホームセキュリティシステムのシステム構成を示すシステム構成図である。図１に示すホームセキュリティシステムは、監視装置６０にドア監視装置１１、窓監視装置１２、火災検知装置１３及び話者認識装置３０を接続し、話者認識装置３０にマイクロホン２０を接続した構成を有する。 FIG. 1 is a system configuration diagram illustrating a system configuration of the home security system according to the embodiment. The home security system shown in FIG. 1 has a configuration in which the door monitoring device 11, the window monitoring device 12, the fire detection device 13, and the speaker recognition device 30 are connected to the monitoring device 60, and the microphone 20 is connected to the speaker recognition device 30. Have.

ドア監視装置１１は、住宅のドアに対する不正な侵入の試みを監視する装置である。ドア監視装置１１は、ピッキングなどの侵入の試みを検知した場合には、監視装置６０に対して報知を行なう。 The door monitoring device 11 is a device that monitors attempts to illegally enter a house door. When the door monitoring device 11 detects an intrusion attempt such as picking, the door monitoring device 11 notifies the monitoring device 60.

窓監視装置１２は、住宅の窓に対する不正な侵入の試みを監視する装置である。窓監視装置１２は、窓に対する衝撃等を検知した場合には、監視装置６０に対して報知を行なう。 The window monitoring device 12 is a device that monitors unauthorized attempts to enter a residential window. The window monitoring device 12 notifies the monitoring device 60 when it detects an impact or the like on the window.

火災検知装置１３は、住宅の居室等に設けられ、火災の発生を検知する装置である。火災検知装置１３は、火災の発生を検知した場合には、監視装置６０に対して報知を行なう。 The fire detection device 13 is a device that is provided in a living room of a house and detects the occurrence of a fire. The fire detection device 13 notifies the monitoring device 60 when the occurrence of a fire is detected.

マイクロホン２０は、玄関等の出入口に設置され、音響信号を取得して話者認識装置３０に出力する装置である。マイクロホン２０は、常に動作し、音響信号の取得及び出力を行なう。なお、人感センサ等を用いて音響信号の取得のオンオフ切替をおこなってもよい。話者認識装置３０は、任意の場所に設置可能である。また、マイクロホン２０を話者認識装置３０の筐体内に設けてもよい。 The microphone 20 is a device that is installed at an entrance such as an entrance, acquires an acoustic signal, and outputs it to the speaker recognition device 30. The microphone 20 always operates and acquires and outputs an acoustic signal. Note that acoustic signal acquisition may be switched on and off using a human sensor or the like. The speaker recognition device 30 can be installed at an arbitrary location. Further, the microphone 20 may be provided in the housing of the speaker recognition device 30.

話者認識装置３０は、マイクロホン２０が取得した音響信号を用いて話者認識を行ない、ホームセキュリティシステムの動作を管理する監視装置６０に出力する。話者認識装置３０は、話者認識部３１及びテキスト判別部３２を有し、監視装置６０は、監視制御部３３及び監視部３４を有する。話者認識部３１は、マイクロホン２０が取得した音響信号から音声を切り出し、該音声が居住者の音声であるか否かを認識し、認識結果を監視装置６０の監視制御部３３に出力する。また、テキスト判別部３２は、マイクロホン２０が取得した音響信号から音声を切り出し、該音声内の単語をテキスト情報として監視装置６０の監視制御部３３に出力する。 The speaker recognition device 30 performs speaker recognition using the acoustic signal acquired by the microphone 20 and outputs it to the monitoring device 60 that manages the operation of the home security system. The speaker recognition device 30 includes a speaker recognition unit 31 and a text determination unit 32, and the monitoring device 60 includes a monitoring control unit 33 and a monitoring unit 34. The speaker recognizing unit 31 cuts out a sound from the acoustic signal acquired by the microphone 20, recognizes whether the sound is a resident's sound, and outputs the recognition result to the monitoring control unit 33 of the monitoring device 60. In addition, the text determination unit 32 cuts out sound from the acoustic signal acquired by the microphone 20 and outputs a word in the sound to the monitoring control unit 33 of the monitoring device 60 as text information.

監視制御部３３は、話者認識部３１により話者が居住者であると認識された場合に、テキスト判別部３２から出力されたテキスト情報に基づいて、監視部３４の動作を制御する処理部である。具体的には、「セキュリティオン」や「いってきます」等のテキスト情報を含む場合には、監視部３４による監視動作を開始させ、「セキュリティオフ」や「ただいま」等のテキスト情報を含む場合には、監視部３４による監視動作を終了させる。 The monitoring control unit 33 is a processing unit that controls the operation of the monitoring unit 34 based on the text information output from the text determination unit 32 when the speaker recognition unit 31 recognizes that the speaker is a resident. It is. Specifically, when text information such as “security on” or “coming” is included, the monitoring operation by the monitoring unit 34 is started, and when text information such as “security off” or “just now” is included. Terminates the monitoring operation by the monitoring unit 34.

監視部３４は、ドア監視装置１１、窓監視装置１２及び火災検知装置１３の出力を用いて、住居の監視を行なう処理部である。具体的には、監視部３４は、監視制御部３３から開始指示を受けた場合に監視動作を開始し、監視動作中にドア監視装置１１、窓監視装置１２又は火災検知装置１３から異常発生の報知を受けた場合には、警報動作を行なうとともに、センタに対して異常発生を通知する。この監視動作は、監視制御部３３から終了指示を受けた場合に終了する。 The monitoring unit 34 is a processing unit that monitors the dwelling using the outputs of the door monitoring device 11, the window monitoring device 12, and the fire detection device 13. Specifically, the monitoring unit 34 starts a monitoring operation when receiving a start instruction from the monitoring control unit 33, and an abnormality occurs from the door monitoring device 11, the window monitoring device 12, or the fire detection device 13 during the monitoring operation. When the notification is received, an alarm operation is performed and an abnormality occurrence is notified to the center. This monitoring operation ends when an end instruction is received from the monitoring control unit 33.

このように、本実施例に係るホームセキュリティシステムでは、居住者の音声を認識することで、監視動作のオンオフ制御を音声操作により行なうことが可能である。 Thus, in the home security system according to the present embodiment, the on / off control of the monitoring operation can be performed by voice operation by recognizing the voice of the resident.

次に、図１に示した話者認識部３１の内部構成について説明する。図２は、図１に示した話者認識部３１の内部構成を示す内部構成図である。図２に示すように、話者認識部３１は、ＡＤ変換部４１、音声区間抽出部４２、特徴パラメータ算出部４３、切替部４４、登録処理部４５、記憶部４６、距離算出部４７、認識処理部４８及び更新処理部４９を有する。 Next, the internal configuration of the speaker recognition unit 31 shown in FIG. 1 will be described. FIG. 2 is an internal configuration diagram showing an internal configuration of the speaker recognition unit 31 shown in FIG. As shown in FIG. 2, the speaker recognition unit 31 includes an AD conversion unit 41, a voice segment extraction unit 42, a feature parameter calculation unit 43, a switching unit 44, a registration processing unit 45, a storage unit 46, a distance calculation unit 47, a recognition unit. A processing unit 48 and an update processing unit 49 are included.

ＡＤ変換部４１は、マイクロホン２０が取得した音響信号をアナログ信号からデジタル信号に変換し、音声区間抽出部４２に出力する処理を行なう処理部である。 The AD conversion unit 41 is a processing unit that performs a process of converting the acoustic signal acquired by the microphone 20 from an analog signal to a digital signal and outputting the converted signal to the speech segment extraction unit 42.

音声区間抽出部４２は、ＡＤ変換部４１によりデジタル信号に変換された音響信号から音声区間を抽出する処理部である。音声区間の抽出は、音響信号の信号パワーやゼロクロス数等に基づいて行なうことができる。 The voice segment extraction unit 42 is a processing unit that extracts a voice segment from the acoustic signal converted into a digital signal by the AD conversion unit 41. The extraction of the voice section can be performed based on the signal power of the acoustic signal, the number of zero crosses, and the like.

特徴パラメータ算出部４３は、音声区間抽出部４２から出力された音声信号のスペクトル包絡の特徴を示す特徴パラメータを算出する処理部である。特徴パラメータの算出手法としては、ＬＰＣ（Linear Predictive Coding）ケプストラム係数や、ＭＦＣＣ（Mel-Frequency Cepstrum Coefficient）等の任意の手法を用いることができる。 The feature parameter calculation unit 43 is a processing unit that calculates a feature parameter indicating the characteristics of the spectral envelope of the speech signal output from the speech segment extraction unit 42. As a feature parameter calculation method, any method such as an LPC (Linear Predictive Coding) cepstrum coefficient or an MFCC (Mel-Frequency Cepstrum Coefficient) can be used.

切替部４４は、話者認識部３１の動作モードを切り替える処理部である。話者認識部３１の動作モードには、登録モードと認識モードとがある。切替部４４により登録モードに設定されている場合には、特徴パラメータ算出部４３が算出した特徴パラメータは、登録音声データとして登録処理部４５に出力される。一方、切替部４４により認識モードに設定されている場合には、特徴パラメータ算出部４３が算出した特徴パラメータは、入力音声データとして距離算出部４７に出力される。 The switching unit 44 is a processing unit that switches the operation mode of the speaker recognition unit 31. The operation modes of the speaker recognition unit 31 include a registration mode and a recognition mode. When the registration mode is set by the switching unit 44, the feature parameter calculated by the feature parameter calculation unit 43 is output to the registration processing unit 45 as registered voice data. On the other hand, when the switching unit 44 sets the recognition mode, the feature parameter calculated by the feature parameter calculation unit 43 is output to the distance calculation unit 47 as input voice data.

登録処理部４５は、登録対象となる居住者に対し、当該居住者の登録音声データを対応付けて話者登録データを生成し、記憶部４６に格納する。このとき、登録処理部４５は、同一話者の登録音声データを所定数含む登録音声セットを構築して居住者に対応付ける。 The registration processing unit 45 associates the registered resident's registered voice data with the resident who is to be registered, generates speaker registration data, and stores it in the storage unit 46. At this time, the registration processing unit 45 constructs a registered voice set including a predetermined number of registered voice data of the same speaker and associates it with the resident.

また、登録処理部４５は、音声データ選択処理部４５ａを有する。音声データ選択処理部４５ａは、登録時に受け付けた複数の登録音声データから登録音声セットに格納する登録音声データを選択する処理部である。音声データ選択処理部４５ａの具体的な動作については後述する。 In addition, the registration processing unit 45 includes an audio data selection processing unit 45a. The voice data selection processing unit 45a is a processing unit that selects registered voice data to be stored in the registered voice set from a plurality of registered voice data received at the time of registration. The specific operation of the audio data selection processing unit 45a will be described later.

記憶部４６は、ハードディスク装置や不揮発性メモリ等の記憶デバイスであり、話者登録データを記憶する。話者登録データには、識別すべき話者である居住者に関する情報と、当該居住者の登録音声データを所定数格納した登録音声セットと、登録音声セットを更新するための登録音声データを格納した更新準備セットとが含まれる。図２では、記憶部４６は、話者登録データＲ１及び話者登録データＲ２を記憶している。 The storage unit 46 is a storage device such as a hard disk device or a nonvolatile memory, and stores speaker registration data. In the speaker registration data, information on the resident who is the speaker to be identified, a registered voice set storing a predetermined number of registered voice data of the resident, and registered voice data for updating the registered voice set are stored. Update preparation set. In FIG. 2, the storage unit 46 stores speaker registration data R1 and speaker registration data R2.

距離算出部４７は、入力音声データと登録音声データとの距離の小ささを類似度の高さとして算出する処理部である。具体的には、距離算出部４７は、同一の登録音声データから複数の部分登録音声データを切り出すとともに、入力音声データから複数の部分入力音声データを切り出し、部分登録音声データと部分入力音声データとの組合せについてそれぞれ距離を算出し、算出した複数の距離のうち、最小の距離を当該登録音声データと入力音声データとの距離とする。なお、算出した複数の距離の平均を登録音声データとの距離としてもよい。 The distance calculation unit 47 is a processing unit that calculates the small distance between the input voice data and the registered voice data as the high degree of similarity. Specifically, the distance calculation unit 47 cuts out a plurality of partial registration voice data from the same registered voice data, cuts out a plurality of partial input voice data from the input voice data, and performs partial registration voice data, partial input voice data, The distance is calculated for each of the combinations, and the minimum distance among the calculated distances is set as the distance between the registered voice data and the input voice data. Note that the average of the calculated plurality of distances may be used as the distance to the registered voice data.

距離算出部４７は、入力音声データと登録音声データとの距離を認識処理部４８に出力する。距離算出部４７による距離の算出と出力は、記憶部４６に格納された複数の登録音声データについてそれぞれ行なう。 The distance calculation unit 47 outputs the distance between the input voice data and the registered voice data to the recognition processing unit 48. The distance calculation unit 47 calculates and outputs the distance for each of a plurality of registered voice data stored in the storage unit 46.

認識処理部４８は、話者識別部４８ａと、話者照合部４８ｂとを有する。話者識別部４８ａは、距離算出部４７により算出された距離が最小となる登録音声データを選択する。この登録音声データの話者が、入力音声データの話者候補となる。 The recognition processing unit 48 includes a speaker identification unit 48a and a speaker verification unit 48b. The speaker identifying unit 48 a selects registered voice data that minimizes the distance calculated by the distance calculating unit 47. The speaker of the registered voice data becomes a speaker candidate of the input voice data.

話者照合部４８ｂは、話者識別部４８ａにより選択された登録音声データと入力音声データとの距離と、照合閾値とを比較する。この距離が照合閾値よりも小さいならば、話者照合部４８ｂは、当該登録音声データの話者と入力音声データの話者とが一致すると判定する。距離の小ささは、類似度の高さに対応するため、距離が照合閾値以下であることは、類似度が所定の類似度閾値以上であることを意味する。話者照合部４８ｂは、判定結果を監視装置６０及び更新処理部４９に出力する。 The speaker verification unit 48b compares the distance between the registered voice data selected by the speaker identification unit 48a and the input voice data with a verification threshold value. If this distance is smaller than the verification threshold, the speaker verification unit 48b determines that the speaker of the registered voice data matches the speaker of the input voice data. Since the small distance corresponds to the high degree of similarity, the distance being equal to or smaller than the matching threshold means that the similarity is equal to or larger than the predetermined similarity threshold. The speaker verification unit 48 b outputs the determination result to the monitoring device 60 and the update processing unit 49.

更新処理部４９は、認識モードにおける認識結果に基づいて、話者登録データを更新する処理部である。具体的には、更新処理部４９は、認識処理部４８の話者照合部４８ｂにより入力音声データの話者が登録音声データの話者と一致すると判定された場合に、入力音声データを当該話者の登録音声セットを更新するための登録音声データとして更新準備セットに格納する。そして、更新準備セットに登録音声データが十分に蓄積されたならば、音声データ選択処理部４９ａにより所定数の登録音声データを選択し、更新候補セットとする。 The update processing unit 49 is a processing unit that updates the speaker registration data based on the recognition result in the recognition mode. Specifically, when the speaker verification unit 48b of the recognition processing unit 48 determines that the speaker of the input voice data matches the speaker of the registered voice data, the update processing unit 49 converts the input voice data into the relevant speech. Is stored in the update preparation set as registered voice data for updating the registered voice set of the user. Then, if the registered voice data is sufficiently accumulated in the update preparation set, a predetermined number of registered voice data is selected by the voice data selection processing unit 49a and set as an update candidate set.

更新候補セットを生成した更新処理部４９は、話者登録データ内の登録音声セットが上限に達していなければ、更新候補セットを登録音声セットとして追加登録する。話者登録データ内の登録音声セットが上限に達しているならば、更新処理部４９は、音声セット選択処理部４９ｂによる音声セットの選択を行う。 The update processing unit 49 that has generated the update candidate set additionally registers the update candidate set as a registered voice set if the registered voice set in the speaker registration data has not reached the upper limit. If the registered voice set in the speaker registration data has reached the upper limit, the update processing unit 49 selects the voice set by the voice set selection processing unit 49b.

音声セット選択処理部４９ｂは、既存の登録音声セットに識別への寄与の実績がない登録音声セットが存在するか否かを判定する。識別への寄与の実績がない登録音声セットが存在する場合には、音声セット選択処理部４９ｂは、当該登録音声セットを削除し、更新候補セットを登録音声セットとして追加登録する。既存の登録音声セットのいずれも識別への寄与の実績があるならば、音声セット選択処理部４９ｂは、更新候補セットと識別への寄与の実績とをリセットする。 The voice set selection processing unit 49b determines whether or not there is a registered voice set that has no record of contribution to identification in the existing registered voice set. If there is a registered voice set that does not contribute to identification, the voice set selection processing unit 49b deletes the registered voice set and additionally registers the update candidate set as a registered voice set. If any of the existing registered voice sets has a record of contribution to identification, the voice set selection processing unit 49b resets the update candidate set and the record of contribution to identification.

図３は、図２に示した話者登録データについて説明するための説明図である。図３に示すように、話者登録データＲ１は、話者データ、登録音声セット及び更新準備セットを対応付けたデータである。 FIG. 3 is an explanatory diagram for explaining the speaker registration data shown in FIG. As shown in FIG. 3, the speaker registration data R1 is data in which speaker data, a registered voice set, and an update preparation set are associated with each other.

話者データは、話者登録データＲ１の話者に関する情報であり、氏名、性別、年齢等を示す。登録音声セットは、当該話者の登録音声データを所定数含む登録音声データのセットである。図３では、登録音声セットＧ１〜Ｇｍのｍ個の登録音声セットが話者登録データＲ１に含まれている。そして、登録音声セットＧ１〜Ｇｍは、それぞれ登録音声データｖ１〜ｖ（Ｎ−ｎ）の（Ｎ−ｎ）個の登録音声データを含む。 The speaker data is information regarding the speaker in the speaker registration data R1, and indicates name, sex, age, and the like. The registered voice set is a set of registered voice data including a predetermined number of registered voice data of the speaker. In FIG. 3, m registered voice sets of the registered voice sets G1 to Gm are included in the speaker registration data R1. The registered voice sets G1 to Gm include (Nn) pieces of registered voice data of the registered voice data v1 to v (Nn), respectively.

話者登録データＲ１を最初に登録した場合には、登録モードにて少なくとも１つの登録音声セットが生成される。その後、認識モードにおける認識結果に基づいて、登録音声セットの追加と入替が行われることになる。 When the speaker registration data R1 is first registered, at least one registered voice set is generated in the registration mode. Thereafter, based on the recognition result in the recognition mode, addition and replacement of the registered voice set are performed.

更新準備セットＧ０は、登録音声セットの追加や入替を行うための登録音声データを格納した登録音声データのセットである。認識モードで入力音声データの話者が当該登録音声データの話者と一致すると判定された場合に、入力音声データは登録音声データとして更新準備セットＧ０に追加される。 The update preparation set G0 is a set of registered voice data that stores registered voice data for adding or replacing a registered voice set. When it is determined that the speaker of the input voice data matches the speaker of the registered voice data in the recognition mode, the input voice data is added to the update preparation set G0 as registered voice data.

このとき、話者識別部４８ａにより選択された登録音声データが属する登録音声セットを示す照合実績データを生成し、更新準備セットＧ０に追加した登録音声データに対応付ける。すなわち、照合実績データは、どの登録音声セットを用いた識別によって入力音声データが登録音声データとして登録されたかを示す。 At this time, verification result data indicating the registered voice set to which the registered voice data selected by the speaker identifying unit 48a belongs is generated and associated with the registered voice data added to the update preparation set G0. That is, the verification result data indicates which registered voice set is used to identify the input voice data as registered voice data.

図３では、更新準備セットＧ０は、登録音声データｖ１〜ｖＮのＮ個の登録音声データを含む。そして、登録音声データｖ１〜ｖＮにはそれぞれ照合実績データｄ１〜ｄＮが対応付けられている。 In FIG. 3, the update preparation set G0 includes N pieces of registered voice data of the registered voice data v1 to vN. The registered voice data v1 to vN are associated with the matching record data d1 to dN, respectively.

更新準備セットＧ０にＮ個の登録音声データが蓄積されたならば、音声データ選択処理部４９ａにより所定数（Ｎ−ｎ）の登録音声データが選択され、更新候補セットが生成される。 If N pieces of registered voice data are accumulated in the update preparation set G0, a predetermined number (N−n) of registered voice data is selected by the voice data selection processing unit 49a, and an update candidate set is generated.

その後、登録音声セットが上限ｍ個に達していなければ、更新候補セットは登録音声セットとして追加登録される。登録音声セットが上限ｍ個に達しているならば、照合実績データｄ１〜ｄ（Ｎ−ｎ）に基づいて登録音声セットの入替、若しくは更新準備セットのリセットを行う。 Thereafter, if the number of registered voice sets has not reached the upper limit m, the update candidate set is additionally registered as a registered voice set. If the number of registered voice sets has reached the upper limit m, the registered voice set is replaced or the update preparation set is reset based on the matching record data d1 to d (N−n).

次に、登録音声データの選択について説明する。図４は、不適切な登録音声データの影響について説明する説明図である。図４では、登録音声データと入力音声データとの距離と、その頻度の分布とを示している。なお、登録音声データの話者と入力音声データの話者とが同一である場合の距離の分布が「本人分布」であり、登録音声データの話者と入力音声データの話者とが異なる場合の距離の分布が「他人分布」である。 Next, selection of registered voice data will be described. FIG. 4 is an explanatory view for explaining the influence of inappropriate registered voice data. FIG. 4 shows the distance between the registered voice data and the input voice data and the frequency distribution. When the registered voice data speakers and the input voice data speakers are the same, the distance distribution is the “personal distribution”, and the registered voice data speakers and the input voice data speakers are different. The distribution of distance is “others distribution”.

同一人物であるにも関わらず他人と認識する「本人拒否」を回避するためには、話者の照合を行うための照合閾値を大きくし、本人分布が照合閾値以下となるようにすることが有効である。一方、他人であるにも関わらず同一人物と認識する「他人受容」を回避するためには、照合閾値を小さくし、他人分布が照合閾値以上となるようにすることが有効である。 In order to avoid “rejection of the person” that is recognized as a different person even though they are the same person, it is necessary to increase the verification threshold for speaker verification so that the distribution of the individual is below the verification threshold. It is valid. On the other hand, in order to avoid “acceptance of others” that recognizes the same person even though they are others, it is effective to reduce the collation threshold so that the distribution of others is equal to or greater than the collation threshold.

図４（ａ）に示すように、適切な登録音声データを使用し、不適切な登録音声データが含まれていなければ、本人分布の分散は小さくなり、他人分布と重ならない。そのため、本人分布と他人分布とを峻別する距離を照合閾値とすれば、本人拒否が発生する本人拒否率と他人受容が発生する他人受容率の双方を十分に低くすることができる。 As shown in FIG. 4A, if appropriate registered voice data is used and inappropriate registered voice data is not included, the distribution of the principal distribution becomes small and does not overlap with the other person distribution. Therefore, if the distance that distinguishes the person distribution from the other person distribution is used as a collation threshold, both the person rejection rate at which the person rejection occurs and the person acceptance rate at which the other person acceptance occurs can be sufficiently reduced.

ところが、図４（ｂ）に示すように、不適切な登録音声データが含まれていると、本人分布の分散が広くなり、他人分布と重なりが生じるので、適切な照合閾値を設定することができなくなる。図４（ｂ）に示すように、本人拒否率を十分に下げるよう照合閾値を設定すると、他人受容率が高まってしまうのである。同様に、他人受容率を下げようとすると、本人拒否率が高まってしまう。 However, as shown in FIG. 4B, if inappropriate registered voice data is included, the distribution of the identity distribution becomes wide and overlaps with the distribution of others, so an appropriate collation threshold value can be set. become unable. As shown in FIG. 4B, when the collation threshold is set so as to sufficiently reduce the identity rejection rate, the acceptance rate of others is increased. Similarly, if you try to reduce the acceptance rate of others, the rejection rate will increase.

ここで、不適切な登録音声データとは、例えば、本人の音声に雑音や他人の話し声などが重畳した場合に生じる。従って、登録時に雑音などが重畳し、不適切な登録音声データが話者登録データに混入すると、以降の認識の全てに影響を与えることとなる。 Here, the inappropriate registered voice data is generated, for example, when noise or another person's speaking voice is superimposed on the voice of the person himself / herself. Therefore, when noise or the like is superimposed at the time of registration and inappropriate registration voice data is mixed in the speaker registration data, all subsequent recognitions are affected.

そこで、登録処理部４５の音声データ選択処理部４５ａは、登録時に受け付けた複数の登録音声データから、不適切な登録音声データを除外し、登録音声セットに格納する登録音声データを選択する。 Therefore, the audio data selection processing unit 45a of the registration processing unit 45 excludes inappropriate registration audio data from the plurality of registration audio data received at the time of registration, and selects registration audio data to be stored in the registration audio set.

図５は、登録音声データの選択について説明するための説明図である。まず、音声データ選択処理部４５ａは、図５（ａ）に示すように、登録対象の話者がＮ回発話した音声からそれぞれの特徴量を算出して得られた登録音声データｖ１〜ｖＮを蓄積する。 FIG. 5 is an explanatory diagram for describing selection of registered voice data. First, as shown in FIG. 5A, the voice data selection processing unit 45a uses registered voice data v1 to vN obtained by calculating respective feature amounts from voice uttered N times by a speaker to be registered. accumulate.

次に、音声データ選択処理部４５ａは、図５（ｂ）に示すように、登録音声データｖ１〜ｖＮの各組合せについて距離を算出する。この距離の算出は、距離算出部４７による距離の算出と同様である。音声データ選択処理部４５ａは、各登録音声データについて、他の登録音声データとの距離の平均を算出する。例えば、登録音声データｖ１については登録音声データｖ１と登録音声データｖ２〜ｖＮとの距離の平均を算出し、登録音声データｖ２については登録音声データｖ２と登録音声データｖ１，ｖ３〜ｖＮとの距離の平均を算出することになる。そして、距離の平均が最大となるものを選択対象から除外する。図５（ｂ）では、登録音声データｖ５が選択対象から除外することになる。 Next, the audio data selection processing unit 45a calculates a distance for each combination of the registered audio data v1 to vN as shown in FIG. This distance calculation is the same as the distance calculation by the distance calculation unit 47. The voice data selection processing unit 45a calculates the average of the distances between the registered voice data and other registered voice data. For example, for the registered voice data v1, the average of the distance between the registered voice data v1 and the registered voice data v2 to vN is calculated, and for the registered voice data v2, the distance between the registered voice data v2 and the registered voice data v1, v3 to vN. The average is calculated. Then, those having the maximum distance are excluded from the selection targets. In FIG. 5B, the registered voice data v5 is excluded from the selection targets.

音声データ選択処理部４５ａは、かかる処理を繰り返し、選択対象の数が所定数（Ｎ−ｎ）となった場合に、残った登録音声データを登録音声セットに格納する登録音声データとして選択する。 The voice data selection processing unit 45a repeats such processing, and when the number of selection targets reaches a predetermined number (N−n), selects the remaining registered voice data as registered voice data to be stored in the registered voice set.

なお、ここでは距離の平均を用いたが、距離の合計を用いてもよい。また、距離の平均や合計に対して閾値を設定し、該閾値を超える登録音声データを選択対象から除外してもよい。 Although the average distance is used here, the total distance may be used. Further, a threshold may be set for the average or total distance, and registered voice data exceeding the threshold may be excluded from selection targets.

また、登録音声データの選択に際し、距離の分散を利用してもよい。この場合には、まず、登録音声データｖ１〜ｖＮを蓄積し、その一部を除外候補として、他の登録音声データの全ての組合せについて距離を求め、それらの分散を算出する。その他のデータについても同様に除外候補として分散を算出する。その結果、図５（ｃ）に示すように、不適切な登録音声データが残っていれば距離の分散が大きくなり、不適切な登録音声データを除外候補とした場合に距離の分散が小さくなる。そこで、分散の値が最も小さくなるときの除外候補を除外し、残った登録音声データを登録音声セットに格納する登録音声データとして選択する。なお、登録のための繰り返し発話回数Ｎは３以上、除去候補のデータ数ｎは１以上、Ｎ−ｎは２以上とする。 Further, the dispersion of distances may be used when selecting the registered voice data. In this case, first, the registered voice data v1 to vN are accumulated, a part thereof is excluded, and distances are obtained for all combinations of other registered voice data, and their variances are calculated. For other data as well, variance is calculated as an exclusion candidate. As a result, as shown in FIG. 5C, the dispersion of distance increases if inappropriate registered speech data remains, and the dispersion of distance decreases when inappropriate registered speech data is an exclusion candidate. . Therefore, the candidate for exclusion when the variance value is the smallest is excluded, and the remaining registered voice data is selected as registered voice data to be stored in the registered voice set. The number of repeated utterances N for registration is 3 or more, the number n of removal candidate data is 1 or more, and N−n is 2 or more.

次に、登録音声セットの更新について説明する。登録時に適切な登録音声データを得られたとしても、話者の音声が経時変化等により変化すると、認識精度の低下が発生することになる。そこで、更新処理部４９は、認識モードにおける認識結果に基づいて、登録音声セットを更新する。 Next, update of the registered voice set will be described. Even if appropriate registered voice data can be obtained at the time of registration, if the speaker's voice changes due to changes over time or the like, the recognition accuracy will deteriorate. Therefore, the update processing unit 49 updates the registered voice set based on the recognition result in the recognition mode.

具体的には、更新処理部４９は、認識モードで入力音声データの話者が当該登録音声データの話者と一致すると判定された場合に、入力音声データを当該話者の登録音声データとして更新準備セットＧ０に追加する。 Specifically, the update processing unit 49 updates the input voice data as the registered voice data of the speaker when it is determined that the speaker of the input voice data matches the speaker of the registered voice data in the recognition mode. Add to preparation set G0.

また、話者識別部４８ａにより選択された登録音声データが属する登録音声セット、すなわち識別に寄与した登録音声セットを示す照合実績データを生成し、更新準備セットＧ０に追加した登録音声データに対応付ける。 Further, the registered voice set to which the registered voice data selected by the speaker identifying unit 48a belongs, that is, verification result data indicating the registered voice set that contributes to the identification is generated and associated with the registered voice data added to the update preparation set G0.

図６に示すように、更新準備セットＧ０にＮ個の登録音声データが蓄積されたならば、音声データ選択処理部４９ａは、所定数（Ｎ−ｎ）の登録音声データを選択し、更新候補セットを生成する。 As shown in FIG. 6, when N pieces of registered voice data are accumulated in the update preparation set G0, the voice data selection processing unit 49a selects a predetermined number (N−n) of registered voice data and updates candidates. Generate a set.

その後、登録音声セットが上限ｍ個に達していなければ、更新候補セットは登録音声セットとして追加登録される。登録音声セットが上限ｍ個に達しているならば、音声セット選択処理部４９ｂは音声セットの選択を行う。 Thereafter, if the number of registered voice sets has not reached the upper limit m, the update candidate set is additionally registered as a registered voice set. If the number of registered voice sets has reached the upper limit m, the voice set selection processing unit 49b selects a voice set.

音声セット選択処理部４９ｂは、照合実績データｄ１〜ｄ（Ｎ−ｎ）を参照し、識別への寄与の実績がない登録音声セットが存在するか否かを判定する。識別への寄与の実績がない登録音声セットが存在する場合には、音声セット選択処理部４９ｂは、当該登録音声セットを削除し、更新候補セットを登録音声セットとして追加登録する。既存の登録音声セットのいずれも識別への寄与の実績があるならば、音声セット選択処理部４９ｂは、更新候補セットと識別への寄与の実績とをリセットする。 The voice set selection processing unit 49b refers to the matching record data d1 to d (N−n) and determines whether or not there is a registered voice set that has no record of contribution to identification. If there is a registered voice set that does not contribute to identification, the voice set selection processing unit 49b deletes the registered voice set and additionally registers the update candidate set as a registered voice set. If any of the existing registered voice sets has a record of contribution to identification, the voice set selection processing unit 49b resets the update candidate set and the record of contribution to identification.

図７に示した例では、照合実績データｄ１〜ｄ４は、登録音声セットＧ１，Ｇ２，Ｇ３のいずれかを示しており、登録音声セットＧ２には識別の実績がない。そこで、登録音声セットＧ２が削除され、登録候補セットが新たな登録音声セットＧ２として格納されることになる。 In the example illustrated in FIG. 7, the matching record data d1 to d4 indicate any of the registered voice sets G1, G2, and G3, and the registered voice set G2 has no identification record. Therefore, the registered voice set G2 is deleted, and the registration candidate set is stored as a new registered voice set G2.

次に、話者認識部３１の処理手順について説明する。図８は、登録モードにおける話者認識部３１の処理手順を示すフローチャートである。なお、このフローチャートに示す処理手順は、切替部４４により登録モードに設定された状態で実行される。 Next, the processing procedure of the speaker recognition unit 31 will be described. FIG. 8 is a flowchart showing a processing procedure of the speaker recognition unit 31 in the registration mode. Note that the processing procedure shown in this flowchart is executed in a state in which the switching unit 44 sets the registration mode.

まず、マイクロホン２０が音響信号を取得する（ステップＳ１０１）。音声区間抽出部４２は、マイクロホン２０が取得した音響信号から音声区間を抽出する（ステップＳ１０２）。 First, the microphone 20 acquires an acoustic signal (step S101). The voice segment extraction unit 42 extracts a voice segment from the acoustic signal acquired by the microphone 20 (step S102).

特徴パラメータ算出部４３は、音声区間の音声信号から複数の部分音声信号を切り出し、該音声信号のスペクトル包絡の特徴を示す特徴パラメータを算出する（ステップＳ１０３）。登録処理部４５は、算出された特徴パラメータを登録音声データとして蓄積し（ステップＳ１０４）、登録音声データの数がＮ個になったか否かを判定する（ステップＳ１０５）。登録音声データの数がＮ個に満たなければ（ステップＳ１０５；Ｎｏ）、ステップＳ１０１に移行し、次の発話の音響信号を取得する。 The feature parameter calculation unit 43 cuts out a plurality of partial speech signals from the speech signal in the speech section, and calculates feature parameters indicating the characteristics of the spectral envelope of the speech signal (step S103). The registration processing unit 45 accumulates the calculated feature parameters as registered voice data (step S104), and determines whether the number of registered voice data has become N (step S105). If the number of registered voice data is less than N (step S105; No), the process proceeds to step S101, and an acoustic signal of the next utterance is acquired.

登録音声データの数がＮ個になったならば（ステップＳ１０５；Ｙｅｓ）、音声データ選択処理部４５ａは、Ｎ個の登録音声データから（Ｎ−ｎ）個の登録音声データを選択する音声データ選択処理を行う（ステップＳ１０６）。その後、選択された（Ｎ−ｎ）個の登録音声データを登録音声セットＧ１に格納し（ステップＳ１０７）、処理を終了する。 If the number of registered voice data is N (step S105; Yes), the voice data selection processing unit 45a selects voice data for selecting (N−n) registered voice data from the N registered voice data. A selection process is performed (step S106). Thereafter, the selected (N−n) pieces of registered voice data are stored in the registered voice set G1 (step S107), and the process ends.

次に、図８に示した音声データ選択処理について説明する。図９は、図８に示した音声データ選択処理の処理手順を示すフローチャートである。音声データ選択処理部４５ａは、まず、登録音声データｖ１〜ｖＮの各組合せについて距離を算出する（ステップＳ２０１）。そして、音声データ選択処理部４５ａは、各登録音声データについて、他の登録音声データとの距離の平均を算出し（ステップＳ２０２）、距離の平均が最大となる登録音声データを選択対象から除外する（ステップＳ２０３）。 Next, the audio data selection process shown in FIG. 8 will be described. FIG. 9 is a flowchart showing a processing procedure of the audio data selection processing shown in FIG. The voice data selection processing unit 45a first calculates a distance for each combination of the registered voice data v1 to vN (step S201). Then, the voice data selection processing unit 45a calculates the average distance of each registered voice data from the other registered voice data (step S202), and excludes the registered voice data having the maximum distance from the selection target. (Step S203).

ステップＳ２０３の後、音声データ選択処理部４５ａは、登録音声データの除去数がｎとなったかを判定する（ステップＳ２０４）。登録音声データの除去数がｎに達していなければ（ステップＳ２０４；Ｎｏ）、ステップＳ２０１に移行し、残った登録音声データの各組合せについて距離を算出する。登録音声データの除去数がｎとなったならば（ステップＳ２０４；Ｙｅｓ）、残った（Ｎ−ｎ）個の登録音声データを選択し（ステップＳ２０５）、音声データ選択処理を終了する。 After step S203, the voice data selection processing unit 45a determines whether or not the number of registered voice data removed is n (step S204). If the number of registered voice data removal has not reached n (step S204; No), the process proceeds to step S201, and the distance is calculated for each combination of the remaining registered voice data. If the number of removed registered voice data is n (step S204; Yes), the remaining (N−n) registered voice data are selected (step S205), and the voice data selection process is terminated.

図１０は、認識モードにおける話者認識部３１の処理手順を示すフローチャートである。なお、このフローチャートに示す処理手順は、切替部４４により認識モードに設定された状態で実行される。 FIG. 10 is a flowchart showing a processing procedure of the speaker recognition unit 31 in the recognition mode. Note that the processing procedure shown in this flowchart is executed in a state where the switching unit 44 sets the recognition mode.

まず、マイクロホン２０が音響信号を取得する（ステップＳ３０１）。音声区間抽出部４２は、マイクロホン２０が取得した音響信号から音声区間を抽出する（ステップＳ３０２）。 First, the microphone 20 acquires an acoustic signal (step S301). The voice segment extraction unit 42 extracts a voice segment from the acoustic signal acquired by the microphone 20 (step S302).

特徴パラメータ算出部４３は、音声区間のスペクトル包絡の特徴を示す特徴パラメータを算出する（ステップＳ３０３）。 The feature parameter calculation unit 43 calculates a feature parameter indicating the characteristics of the spectral envelope of the speech segment (step S303).

距離算出部４７は、全ての話者登録データについて入力音声データとの距離をそれぞれ算出する（ステップＳ３０４）。話者識別部４８ａは、距離算出部４７により算出された距離が最も小さい話者登録データを特定する（ステップＳ３０５）。 The distance calculation unit 47 calculates the distance from the input voice data for all the speaker registration data (step S304). The speaker identification unit 48a identifies the speaker registration data with the shortest distance calculated by the distance calculation unit 47 (step S305).

話者照合部４８ｂは、話者識別部４８ａにより特定された話者登録データと入力音声データとの距離、すなわち距離の最小値と照合閾値とを比較する（ステップＳ３０６）。距離の最小値が照合閾値よりも小さいならば（ステップＳ３０６；Ｙｅｓ）、話者照合部４８ｂは、当該話者登録データの話者と入力データの話者とが一致すると判定し、当該話者登録データの話者データを照合結果として監視装置６０及び更新処理部４９に出力する（ステップＳ３０７）。ステップＳ３０７の後、更新処理部４９は、更新処理を行って（ステップＳ３０８）、処理を終了する。 The speaker verification unit 48b compares the distance between the speaker registration data specified by the speaker identification unit 48a and the input voice data, that is, the minimum value of the distance and the verification threshold value (step S306). If the minimum value of the distance is smaller than the verification threshold (step S306; Yes), the speaker verification unit 48b determines that the speaker of the speaker registration data matches the speaker of the input data, and the speaker The speaker data of the registered data is output as a verification result to the monitoring device 60 and the update processing unit 49 (step S307). After step S307, the update processing unit 49 performs an update process (step S308) and ends the process.

一方、距離の最小値が照合閾値以上であるならば（ステップＳ３０６；Ｎｏ）、話者照合部４８ｂは、入力音声データに該当が無い旨を照合結果として監視装置６０に出力して処理を終了する（ステップＳ３０９）。 On the other hand, if the minimum value of the distance is equal to or greater than the collation threshold value (step S306; No), the speaker collation unit 48b outputs to the monitoring device 60 that the input voice data does not correspond to the collation result and ends the process. (Step S309).

次に、図１０のステップＳ３０４に示した距離算出処理について説明する。図１１は、図１０に示した距離算出処理の処理手順を示すフローチャートである。距離算出部４７は、まず、話者登録データを選択する（ステップＳ４０１）。そして、選択した話者登録データから登録音声データを選択し（ステップＳ４０２）、選択した登録音声データと入力音声データとの距離を算出する（ステップＳ４０３）。 Next, the distance calculation process shown in step S304 of FIG. 10 will be described. FIG. 11 is a flowchart illustrating a processing procedure for the distance calculation processing illustrated in FIG. 10. The distance calculation unit 47 first selects speaker registration data (step S401). Then, registered voice data is selected from the selected speaker registration data (step S402), and a distance between the selected registered voice data and input voice data is calculated (step S403).

ステップＳ４０３の後、距離算出部４７は、ステップＳ４０１で選択した話者登録データ内の全ての登録音声データを選択済であるか否かを判定する（ステップＳ４０４）。その結果、未選択の登録音声データが残っているならば（ステップＳ４０４；Ｎｏ）、ステップＳ４０２に移行し、登録音声データの選択を行う。 After step S403, the distance calculation unit 47 determines whether all registered voice data in the speaker registration data selected in step S401 has been selected (step S404). As a result, if unselected registered voice data remains (step S404; No), the process proceeds to step S402, and the registered voice data is selected.

同一の話者登録データ内の全ての登録音声データを選択済であるならば（ステップＳ４０４；Ｙｅｓ）。距離算出部４７は、同一の話者登録データ内の全ての登録音声データについて算出した距離のうち、最小の距離を当該話者登録データの距離とする（ステップＳ４０５）。 If all the registered voice data within the same speaker registration data have been selected (step S404; Yes). The distance calculation unit 47 sets the minimum distance among the distances calculated for all the registered voice data in the same speaker registration data as the distance of the speaker registration data (step S405).

ステップＳ４０５の後、距離算出部４７は、全ての話者登録データを選択済であるか否かを判定する（ステップＳ４０６）、その結果、未選択の話者登録データが残っているならば（ステップＳ４０６；Ｎｏ）、ステップＳ４０１に移行し、話者登録データの選択を行う。そして、全ての話者登録データを選択済であるならば（ステップＳ４０６；Ｙｅｓ）、距離算出処理を終了する。 After step S405, the distance calculation unit 47 determines whether all the speaker registration data has been selected (step S406). As a result, if unselected speaker registration data remains (step S406). Step S406; No), the process proceeds to step S401, and speaker registration data is selected. If all the speaker registration data has been selected (step S406; Yes), the distance calculation process ends.

次に、図１０のステップＳ３０８に示した更新処理について説明する。図１２は、図１０に示した更新処理の処理手順を示すフローチャートである。更新処理部４９は、認識モードで入力音声データの話者が当該登録音声データの話者と一致すると判定された場合に、入力音声データを当該話者の登録音声データとして更新準備セットＧ０に追加し、蓄積する（ステップＳ５０１）。 Next, the update process shown in step S308 of FIG. 10 will be described. FIG. 12 is a flowchart showing a processing procedure of the update process shown in FIG. When it is determined in the recognition mode that the speaker of the input voice data matches the speaker of the registered voice data, the update processing unit 49 adds the input voice data to the update preparation set G0 as the registered voice data of the speaker. And accumulates (step S501).

そして、話者識別部４８ａにより選択された登録音声データが属する登録音声セット、すなわち識別に寄与した登録音声セットを示す照合実績データを生成し、更新準備セットＧ０に追加した登録音声データに対応付ける（ステップＳ５０２）。 Then, the registered voice set to which the registered voice data selected by the speaker identifying unit 48a belongs, that is, the verification result data indicating the registered voice set contributing to the identification, is generated and associated with the registered voice data added to the update preparation set G0 ( Step S502).

更新処理部４９は、更新準備セットＧ０における登録音声データの蓄積数がＮとなったか否かを判定する（ステップＳ５０３）。登録音声データの蓄積数がＮに達していなければ（ステップＳ５０３；Ｎｏ）、そのまま更新処理を終了する。 The update processing unit 49 determines whether or not the number of registered voice data stored in the update preparation set G0 is N (step S503). If the accumulated number of registered voice data has not reached N (step S503; No), the update process is terminated as it is.

登録音声データの蓄積数がＮとなったならば（ステップＳ５０３；Ｙｅｓ）、音声データ選択処理部４９ａは、音声データ選択処理を行って所定数（Ｎ−ｎ）の登録音声データを選択する（ステップＳ５０４）。この音声データ選択処理は、図９に示した音声データ選択処理と同様の処理である。 If the stored number of registered voice data is N (step S503; Yes), the voice data selection processing unit 49a performs a voice data selection process to select a predetermined number (N−n) of registered voice data ( Step S504). This audio data selection process is the same as the audio data selection process shown in FIG.

更新処理部４９は、ステップＳ５０４で選択した（Ｎ−ｎ）個の登録音声データを更新候補セットとし（ステップＳ５０５）、登録音声セット数がｍであるか否かを判定する（ステップＳ５０６）。 The update processing unit 49 sets the (N−n) registered voice data selected in step S504 as an update candidate set (step S505), and determines whether the number of registered voice sets is m (step S506).

登録音声セット数がｍであるならば（ステップＳ５０６；Ｙｅｓ）、音声セット選択処理部４９ｂは音声セットの選択を行う。具体的には、音声セット選択処理部４９ｂは、照合実績データｄ１〜ｄ（Ｎ−ｎ）を参照し、識別への寄与の実績がない登録音声セットが存在するか否かを判定する（ステップＳ５０９）。識別への寄与の実績がない登録音声セットが存在する場合には（ステップＳ５０９；Ｙｅｓ）、音声セット選択処理部４９ｂは、当該登録音声セットを削除する（ステップＳ５１０）。 If the number of registered voice sets is m (step S506; Yes), the voice set selection processing unit 49b selects a voice set. Specifically, the voice set selection processing unit 49b refers to the matching record data d1 to d (N−n) and determines whether or not there is a registered voice set that does not contribute to the identification (step) S509). When there is a registered voice set that does not contribute to identification (step S509; Yes), the voice set selection processing unit 49b deletes the registered voice set (step S510).

ステップＳ５１０の後、若しくは登録音声セット数がｍに達していない場合（ステップＳ５０６；Ｎｏ）、更新処理部４９は、更新候補セットを登録音声セットとして追加する（ステップＳ５０７）。 After step S510, or when the number of registered voice sets has not reached m (step S506; No), the update processing unit 49 adds an update candidate set as a registered voice set (step S507).

ステップ５０７の後、若しくは識別への寄与の実績がない登録音声セットが存在しない場合（ステップＳ５０９；Ｎｏ）、更新準備セットＧ０をリセットして（ステップＳ５０８）、更新処理を終了する。更新準備セットＧ０のリセットでは、登録候補セットや照合実績データもリセットされることになる。 After step 507 or when there is no registered voice set with no record of contribution to identification (step S509; No), the update preparation set G0 is reset (step S508), and the update process is terminated. When the update preparation set G0 is reset, the registration candidate set and verification result data are also reset.

次に、不適切な登録音声データの具体例について説明する。図１３は、不適切な登録音声データの具体例について説明する説明図である。図１３は、発話回数を４回（Ｎ＝４）とし、各登録音声データからそれ以外の登録音声データまでの距離の平均を示したものである。 Next, a specific example of inappropriate registered voice data will be described. FIG. 13 is an explanatory diagram illustrating a specific example of inappropriate registered voice data. FIG. 13 shows the average distance from each registered voice data to the other registered voice data, with the number of utterances being four (N = 4).

図１３に示すデータ例Ｈ０は、登録音声データｖ１〜ｖ４の全てが適切な登録音声データであり、雑音が重畳した登録音声データが含まれない場合を示している。そして、データ例Ｈ１は登録音声データｖ１に雑音が重畳した場合を示し、データ例Ｈ２は登録音声データｖ２に雑音が重畳した場合を示し、データ例Ｈ３は登録音声データｖ３に雑音が重畳した場合を示し、データ例Ｈ４は登録音声データｖ４に雑音が重畳した場合を示している。 The data example H0 shown in FIG. 13 shows a case where all of the registered voice data v1 to v4 are appropriate registered voice data and no registered voice data on which noise is superimposed is included. The data example H1 shows a case where noise is superimposed on the registered voice data v1, the data example H2 shows a case where noise is superimposed on the registered voice data v2, and the data example H3 is a case where noise is superimposed on the registered voice data v3. The data example H4 shows a case where noise is superimposed on the registered voice data v4.

データ例Ｈ０〜Ｈ４に示したように、雑音を重畳した登録音声データは距離が他に比べて大きくなる。このため、距離の大きさから不適切な登録音声データを識別して予め除去することにより、識別時の精度を向上することができるのである。 As shown in the data examples H0 to H4, the registered voice data on which noise is superimposed has a larger distance than the others. For this reason, it is possible to improve the accuracy during identification by identifying and removing in advance the inappropriate registered voice data based on the distance.

次に、話者登録データの更新による効果について説明する。図１４は、話者登録データの更新による効果についての説明図である。具体的には、成人男性一名が週に一日、５２週（１年間）に亘り、朝昼夕の三回、同じ内容の言葉を７回ずつ発声した場合について示す。 Next, the effect of updating the speaker registration data will be described. FIG. 14 is an explanatory diagram of the effect of updating the speaker registration data. Specifically, a case where an adult male utters the same content seven times in the morning, noon, and evening three times a day for 52 weeks (one year) is shown.

まず、最初の７回分の内、５回分を登録音声データとした登録音声セットを生成する。この登録音声セットのみを使用すると、１年間の照合率は７９．７％となった。つぎに、認識モードでの認識結果を用いて登録音声セットを４つまで追加し、その後の入替を行わなかったケースでは、１年間の照合率は９７．６％となった。そして、４つの登録音声セットの生成後に照合実績による入替を行ったケースでは、１年間の照合率は９９．１％まで向上した。 First, a registered voice set is generated with registered voice data for five times out of the first seven times. Using only this registered voice set, the verification rate for one year was 79.7%. Next, in the case where up to four registered voice sets were added using the recognition result in the recognition mode and no subsequent replacement was performed, the verification rate for one year was 97.6%. And in the case where the replacement by the verification results was performed after the four registered voice sets were generated, the verification rate for one year improved to 99.1%.

上述してきたように、本実施例では、話者認識部３１は、複数の登録音声データから所定データ数の登録音声データを選択して登録音声セットを構築し、登録音声セットを話者に対応付けて記憶し、入力音声データと登録音声セット内の登録音声データとの類似度に基づいて前記話者を識別する。このため、登録音声データから不適切なデータを除去し、もって話者認識の精度を向上することができる。 As described above, in this embodiment, the speaker recognition unit 31 selects a predetermined number of registered voice data from a plurality of registered voice data to construct a registered voice set, and handles the registered voice set for the speaker. The speaker is identified based on the similarity between the input voice data and the registered voice data in the registered voice set. For this reason, inappropriate data can be removed from the registered voice data, thereby improving the accuracy of speaker recognition.

また、認識モードにおける認識結果に基づいて、登録音声セットの追加や更新を行うことにより、経時変化等により音声の特徴に変化が生じた場合であっても認識精度の低下を抑制できる。 Further, by adding or updating the registered voice set based on the recognition result in the recognition mode, it is possible to suppress a reduction in recognition accuracy even when the voice feature changes due to a change over time or the like.

さらに、識別に寄与した登録音声セットを示す照合実績データを蓄積し、照合実績データにより識別への寄与が少ないことが示された登録音声セットを削除して新規の登録音声セットを追加することにより、認識精度を高く保つことが可能である。 Furthermore, by accumulating verification record data indicating the registered voice set that contributed to the identification, by deleting the registered voice set indicated by the verification result data that the contribution to the identification is small, and adding a new registered voice set It is possible to keep the recognition accuracy high.

なお、登録音声セットは、所定期間一度も照合に利用されなかった場合に削除するように構成してもよい。また、登録音声セット間で距離を算出し、距離の大きさに応じて削除するか否かを決定してもよい。 The registered voice set may be deleted when it has not been used for verification for a predetermined period of time. Further, a distance may be calculated between registered voice sets, and it may be determined whether or not to delete depending on the magnitude of the distance.

また、上記実施例では、ホームセキュリティの動作モードを音声操作により切り替える場合について説明したが、本発明に係る話者認識は、動作モードの切替に限定されるものではなく、テキスト判別により多様な操作に適用可能である。 In the above embodiment, the case where the home security operation mode is switched by voice operation has been described. However, the speaker recognition according to the present invention is not limited to the operation mode switching, and various operations can be performed by text discrimination. It is applicable to.

また、上記実施例では、話者の照合が成功したことを条件にセキュリティの動作モードを切り替える構成を示したが、特定の話者の音声をブラックリストとして登録し、ブラックリストに登録した話者による操作を拒絶するよう構成してもよい。 In the above embodiment, the configuration in which the security operation mode is switched on the condition that the speaker verification is successful is shown. However, the voice of a specific speaker is registered as a black list, and the speaker who is registered in the black list is registered. You may comprise so that operation by may be refused.

また、本発明は、ホームセキュリティに限らず、携帯電話端末による話者認識等、任意の装置の話者認識に適用可能である。また、電話回線を介した話者認識による「振り込め詐欺対策」や、「インタホン越しの音声による本人確認」などへも適用可能である。 The present invention is not limited to home security, and can be applied to speaker recognition of an arbitrary device such as speaker recognition using a mobile phone terminal. Also, it can be applied to “transfer fraud countermeasures” by speaker recognition via a telephone line, “identification by voice through intercom”, and the like.

また、図示した各構成は機能概略的なものであり、必ずしも物理的に図示の構成をされていることを要しない。すなわち、各装置の分散・統合の形態は図示のものに限られず、その全部または一部を各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。また、話者認識部３１の機能部をソフトウェアにより実現し、コンピュータに実行させれば、コンピュータを話者認識装置として動作させる話者認識プログラムを得ることができる。 Each illustrated configuration is schematic in function, and does not necessarily need to be physically configured as illustrated. In other words, the form of distribution / integration of each device is not limited to the one shown in the figure, and all or a part thereof may be functionally / physically distributed / integrated in arbitrary units according to various loads and usage conditions. Can be configured. Moreover, if the function part of the speaker recognition part 31 is implement | achieved by software and it makes a computer perform, the speaker recognition program which makes a computer operate | move as a speaker recognition apparatus can be obtained.

以上のように、話者認識装置、話者認識方法及び話者認識プログラムは、照合対象者の音声データを適正な状態で保持し、話者認識の精度を向上することに適している。 As described above, the speaker recognition device, the speaker recognition method, and the speaker recognition program are suitable for maintaining the voice data of the verification target person in an appropriate state and improving the accuracy of speaker recognition.

１１ドア監視装置
１２窓監視装置
１３火災検知装置
２０マイクロホン
３０話者認識装置
３１話者認識部
３２テキスト判別部
３３監視制御部
３４監視部
４１ＡＤ変換部
４２音声区間抽出部
４３特徴パラメータ算出部
４４切替部
４５登録処理部
４５ａ、４９ａ音声データ選択処理部
４６記憶部
４７距離算出部
４８認識処理部
４８ａ話者識別部
４８ｂ話者照合部
４９更新処理部
４９ｂ音声セット選択処理部
６０監視装置 DESCRIPTION OF SYMBOLS 11 Door monitoring apparatus 12 Window monitoring apparatus 13 Fire detection apparatus 20 Microphone 30 Speaker recognition apparatus 31 Speaker recognition part 32 Text discrimination | determination part 33 Monitoring control part 34 Monitoring part 41 AD conversion part 42 Voice area extraction part 43 Feature parameter calculation part 44 Switching unit 45 registration processing unit 45a, 49a voice data selection processing unit 46 storage unit 47 distance calculation unit 48 recognition processing unit 48a speaker identification unit 48b speaker verification unit 49 update processing unit 49b voice set selection processing unit
60 Monitoring device

Claims

A speaker recognition device for recognizing a speaker of input voice data based on input voice data,
A registered voice data receiving means for receiving a plurality of registered voice data for a speaker to be identified;
Registered voice data selecting means for selecting a predetermined number of registered voice data from a plurality of registered voice data received by the registered voice data receiving means;
Storage means for storing a predetermined number of registered voice data selected by the registered voice data selection means in association with the speaker;
Similarity calculation means for calculating the similarity between the input voice data and the registered voice data stored in the storage means;
Speaker identification means for identifying the speaker based on the similarity calculated by the similarity calculation means,
The registered voice data selection unit calculates, for each of the plurality of registered voice data received by the registered voice data reception unit, a variance of the distances of the plurality of registered voice data when the registered voice data is excluded from selection targets. And a predetermined number of registered voice data is selected from the plurality of registered voice data received by the registered voice data receiving means by excluding the registered voice data so as to minimize the variance. apparatus.

2. The storage unit according to claim 1, wherein the storage unit stores registered voice data of a predetermined number of data of the same speaker as a registered voice set, and stores a plurality of registered voice sets in association with the same speaker. Speaker recognition device.

A speaker recognition device for recognizing a speaker of input voice data based on input voice data,
A registered voice data receiving means for receiving a plurality of registered voice data for a speaker to be identified;
Registered voice data selecting means for selecting a predetermined number of registered voice data from a plurality of registered voice data received by the registered voice data receiving means;
Storage means for storing a predetermined number of registered voice data selected by the registered voice data selection means in association with the speaker;
Similarity calculation means for calculating the similarity between the input voice data and the registered voice data stored in the storage means;
Speaker identification means for identifying the speaker based on the similarity calculated by the similarity calculation means,
The said memory | storage means makes registration voice data of the predetermined data number of the same speaker into a registration voice set, and associates and memorize | stores several registration voice sets about the same speaker, The speaker recognition apparatus characterized by the above-mentioned.

When identification by the speaker identification unit is performed, the record data indicating the registered voice set that contributes to the identification is accumulated, and the input voice data is stored as registered voice data of the update preparation set, and the record data The update processing means for deleting the registered voice set indicated as having a small contribution to the identification and adding the update preparation set as a new registered voice set according to claim 1 further comprising: The speaker recognition device according to any one of the above.

The speaker identifying means sets a speaker corresponding to the highest similarity among a plurality of similarities calculated by the similarity calculating means as a speaker candidate of the input voice data,
When the highest similarity among a plurality of similarities calculated by the similarity calculating means exceeds a predetermined collation threshold, it is determined that the speaker candidate and the speaker of the input voice data are the same person The speaker recognition apparatus according to claim 1, further comprising speaker verification means for

Monitoring means for performing a monitoring operation on the monitoring target;
Word determination means for determining a word included in the input voice data;
When the collation result that the speaker of the input voice data is the speaker to be identified is obtained by the speaker collating unit, the monitoring unit is based on the word determined by the word determining unit. The speaker recognition apparatus according to claim 5 , further comprising: a control unit that controls the monitoring operation.

The said similarity calculation means calculates the small distance of the said registration audio | voice data and the said input audio | speech data as the height of the said similarity, The one of Claims 1-6 characterized by the above-mentioned. Speaker recognition device.

A speaker recognition method for recognizing a speaker of input voice data based on input voice data,
A registered voice data receiving step for receiving a plurality of registered voice data for a speaker to be identified;
A registered voice data selecting step for selecting a predetermined number of registered voice data from the plurality of registered voice data received in the registered voice data receiving step;
A storing step of storing a predetermined number of registered voice data selected in the registered voice data selection step in the storage unit in association with the speaker;
A similarity calculation step of calculating a similarity between the input voice data and the registered voice data stored in the storage unit;
A speaker identification step for identifying the speaker based on the similarity calculated by the similarity calculation step,
In the registered voice data selection step, for each of the plurality of registered voice data received in the registered voice data reception step, a variance of the distances of the plurality of registered voice data when the registered voice data is excluded from selection targets is calculated. And a predetermined number of registered voice data is selected from the plurality of registered voice data received in the registered voice data receiving step by excluding the registered voice data so as to minimize the variance. Method.

A speaker recognition method for recognizing a speaker of input voice data based on input voice data,
A registered voice data receiving step for receiving a plurality of registered voice data for a speaker to be identified;
A registered voice data selecting step for selecting a predetermined number of registered voice data from the plurality of registered voice data received in the registered voice data receiving step;
A storing step of storing a predetermined number of registered voice data selected in the registered voice data selection step in the storage unit in association with the speaker;
A similarity calculation step of calculating a similarity between the input voice data and the registered voice data stored in the storage unit;
A speaker identification step for identifying the speaker based on the similarity calculated by the similarity calculation step,
In the speaker recognition method, the storing step includes storing registered voice data of a predetermined number of data of the same speaker as a registered voice set, and storing a plurality of registered voice sets in association with the same speaker.

A speaker recognition program for recognizing a speaker of input voice data based on input voice data,
Registered voice data reception procedure for receiving a plurality of registered voice data for a speaker to be identified;
A registered voice data selection procedure for selecting a predetermined number of registered voice data from a plurality of registered voice data received in the registered voice data reception procedure;
A storing procedure for storing a predetermined number of registered voice data selected in the registered voice data selection procedure in the storage unit in association with the speaker;
A similarity calculation procedure for calculating a similarity between the input voice data and the registered voice data stored in the storage unit;
Causing the computer to execute a speaker identification procedure for identifying the speaker based on the similarity calculated by the similarity calculation procedure,
In the registered voice data selection procedure, for each of a plurality of registered voice data received in the registered voice data reception procedure, a variance of distances of the plurality of registered voice data when the registered voice data is excluded from selection targets is calculated. And a predetermined number of registered voice data is selected from the plurality of registered voice data received in the registered voice data receiving procedure by excluding the registered voice data so as to minimize the variance. program.

A speaker recognition program for recognizing a speaker of input voice data based on input voice data,
Registered voice data reception procedure for receiving a plurality of registered voice data for a speaker to be identified;
A registered voice data selection procedure for selecting a predetermined number of registered voice data from a plurality of registered voice data received in the registered voice data reception procedure;
A storing procedure for storing a predetermined number of registered voice data selected in the registered voice data selection procedure in the storage unit in association with the speaker;
A similarity calculation procedure for calculating a similarity between the input voice data and the registered voice data stored in the storage unit;
Causing the computer to execute a speaker identification procedure for identifying the speaker based on the similarity calculated by the similarity calculation procedure,
The storage procedure is a speaker recognition program characterized in that registered voice data of a predetermined number of data of the same speaker is used as a registered voice set, and a plurality of registered voice sets are stored in association with each other for the same speaker.