JP6087542B2

JP6087542B2 - Speaker recognition device, speaker recognition method, and speaker recognition program

Info

Publication number: JP6087542B2
Application number: JP2012192394A
Authority: JP
Inventors: 康貴田中; 学川▲崎▼; 益巳谷本
Original assignee: Sohgo Security Services Co Ltd
Current assignee: Sohgo Security Services Co Ltd
Priority date: 2012-08-31
Filing date: 2012-08-31
Publication date: 2017-03-01
Anticipated expiration: 2032-08-31
Also published as: JP2014048534A

Description

この発明は、音声データに基づいて該音声データの話者を認識する話者認識装置、話者認識方法及び話者認識プログラムに関する。 The present invention relates to a speaker recognition device, a speaker recognition method, and a speaker recognition program for recognizing a speaker of voice data based on voice data.

従来、音声データに基づいて該音声データの話者を認識する技術が知られている。例えば、特許文献１は、照合対象者の音声データから予め登録モデルデータを生成して格納し、入力音声データを分析した音声分析データと登録モデルデータとを照合処理することで、入力音声データの話者が照合対象者であるか否かを判定する話者認識システムを開示している。 Conventionally, a technique for recognizing a speaker of voice data based on the voice data is known. For example, Patent Document 1 generates and stores registered model data in advance from voice data of a person to be collated, and collates voice analysis data obtained by analyzing the input voice data with the registered model data. A speaker recognition system for determining whether or not a speaker is a verification target is disclosed.

特開２００５−０９１７５８号公報Japanese Patent Laying-Open No. 2005-091758

しかしながら、上述した従来の技術では、登録モデルデータを構築するために、長時間発声された音声を学習する必要があり、また、モデルの構築並びにモデルを用いた話者認識時に複雑な演算を必要とするため、安価に高速な処理を行なうことが困難であるという問題点があった。 However, in the above-described conventional technology, it is necessary to learn long-spoken speech in order to construct registered model data, and complicated calculations are required during model construction and speaker recognition using the model. Therefore, there is a problem that it is difficult to perform high-speed processing at low cost.

また、上述した従来の技術では、照合対象者１人ずつのモデルを個別に構築しているが、照合対象者１人ずつのモデルを個別に構築するためには、各照合対象者の音声を別々に採取する必要があるため、照合対象者の登録が煩雑になるという問題点があった。 In addition, in the conventional technology described above, a model for each person to be collated is individually constructed. However, in order to individually construct a model for each person to be collated, the voices of each person to be collated are used. Since it is necessary to collect them separately, there is a problem that registration of a person to be verified becomes complicated.

また、上述した従来の技術では、一連の発話をＮ個のフレームに分割し、それぞれ算出した特徴パラメータを平均した値を指標としているため、一連の発話が完了した後で話者認識を行なうこととなり、認識までに時間を要するという問題点があった。 In the above-described conventional technique, since a series of utterances is divided into N frames and the average value of the calculated feature parameters is used as an index, speaker recognition is performed after the series of utterances is completed. Thus, there is a problem that it takes time to recognize.

このため、安価で高速な処理の実現、登録処理の簡易化、認識までの時間短縮等を実現し、話者認識の利便性を向上することが重要な課題となっていた。例えば、ホームセキュリティシステムの警備動作を利用者の音声により操作する場合を考えると、システム導入に要する費用を抑制するため、安価で高速な処理が求められる。また、操作権限の確認には、音声が複数の居住者のいずれかであることが判別できれば足り、複数の居住者の音声を一括して簡易に登録できることが求められる。さらに、話者をより早いタイミングで判別し、警備動作を速やかに制御することも求められる。 For this reason, it has been important to improve the convenience of speaker recognition by realizing inexpensive and high-speed processing, simplifying registration processing, shortening the time until recognition, and the like. For example, considering the case where the security operation of the home security system is operated by the user's voice, inexpensive and high-speed processing is required to suppress the cost required for system introduction. Moreover, it is sufficient for the confirmation of the operation authority that it is sufficient to determine that the voice is one of a plurality of residents, and it is required that the voices of the plurality of residents can be registered easily in a lump. Furthermore, it is also required to determine the speaker at an earlier timing and to quickly control the guard operation.

本発明は、上述した従来技術の課題を解決するためになされたものであって、登録及び認識に係る利便性を向上した話者認識装置、話者認識方法及び話者認識プログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems of the prior art, and provides a speaker recognition device, a speaker recognition method, and a speaker recognition program that improve the convenience of registration and recognition. With the goal.

上述した課題を解決し、目的を達成するため、請求項１に記載の発明は、音声データに基づいて該音声データの話者を認識する話者認識装置であって、登録対象者の音声を含む登録音声データから切り出された部分登録音声データのスペクトル包絡と、認識対象となる入力音声データから切り出された部分入力音声データのスペクトル包絡との類似度を算出する類似度算出手段と、前記類似度算出手段により算出された類似度に基づいて、前記入力音声データの話者を認識する認識手段とを備え、前記登録音声データは、複数の登録対象者の音声を含み、前記部分登録音声データは、各登録対象者の音声を含むよう複数切り出され、前記類似度算出手段は、複数の部分登録音声データのそれぞれのスペクトル包絡について、前記部分入力音声データのスペクトル包絡との類似度をそれぞれ算出し、前記認識手段は、前記入力音声データの話者が前記複数の登録対象者のいずれかと一致する、若しくは前記入力音声データの話者が前記複数の登録対象者のいずれとも一致しないとの認識結果を出力することを特徴とする。
In order to solve the above-described problems and achieve the object, the invention described in claim 1 is a speaker recognition device for recognizing a speaker of the voice data based on the voice data. Similarity calculating means for calculating the similarity between the spectral envelope of the partially registered speech data cut out from the included registered speech data and the spectral envelope of the partially input speech data cut out from the input speech data to be recognized; Recognition means for recognizing a speaker of the input voice data based on the degree of similarity calculated by the degree calculation means, and the registered voice data includes voices of a plurality of registration subjects, and the partial registration voice data Are cut out so as to include the voices of each registration target person, and the similarity calculation means performs the partial input voice for each spectrum envelope of the plurality of partial registration voice data. Calculating a similarity between the spectral envelope of over data respectively, said recognition means, speaker of the input voice data matches one of said plurality of registered person, or the input speech data speaker plurality characterized that you outputs the recognition result of the does not match any of the registered person.

また、請求項２に記載の発明は、音声データに基づいて該音声データの話者を認識する話者認識装置であって、登録対象者の音声を含む登録音声データから切り出された部分登録音声データのスペクトル包絡と、認識対象となる入力音声データから切り出された部分入力音声データのスペクトル包絡との類似度を算出する類似度算出手段と、前記類似度算出手段により算出された類似度に基づいて、前記入力音声データの話者を認識する認識手段とを備え、前記登録音声データは、複数の登録対象者の音声を含み、前記類似度算出手段は、同一の前記登録音声データから切り出された複数の部分登録音声データについて、各部分登録音声データのスペクトル包絡を示す特徴パラメータと、前記部分入力音声データのスペクトル包絡を示す特徴パラメータとの距離の小ささを前記各部分登録音声データに対する前記部分入力音声データの類似度の高さとしてそれぞれ算出し、前記各部分登録音声データに対する前記部分入力音声データの距離の最小値を前記登録音声データに対する前記部分入力音声データの距離とし、前記認識手段は、前記登録音声データに対する前記部分入力音声データの距離を前記類似度として用いて、前記入力音声データの話者が前記複数の登録対象者のいずれかと一致する、若しくは前記入力音声データの話者が前記複数の登録対象者のいずれとも一致しないとの認識結果を出力することを特徴とする。
According to a second aspect of the present invention, there is provided a speaker recognition device for recognizing a speaker of voice data based on the voice data, and the partially registered voice cut out from the registered voice data including the voice of the registration target person. Similarity calculation means for calculating the similarity between the spectral envelope of the data and the spectral envelope of the partial input voice data cut out from the input voice data to be recognized, and based on the similarity calculated by the similarity calculation means Recognizing means for recognizing a speaker of the input voice data, wherein the registered voice data includes voices of a plurality of registration subjects, and the similarity calculation means is cut out from the same registered voice data. For a plurality of partial registration voice data, a feature parameter indicating a spectral envelope of each partial registration voice data and a characteristic indicating a spectral envelope of the partial input voice data The distance from the parameter is calculated as the degree of similarity of the partial input voice data with respect to each partial registration voice data, and the minimum distance of the partial input voice data with respect to each partial registration voice data is calculated as the minimum value. The distance of the partial input voice data with respect to the registered voice data is used, and the recognition means uses the distance of the partial input voice data with respect to the registered voice data as the similarity, so that a speaker of the input voice data can register the plurality of registration voice data. It matches any of the subjects, or the speaker of the input speech data and outputting a recognition result of not match any of the plurality of registered persons.

また、請求項３に記載の発明は、請求項１又は２に記載の発明において、前記類似度算出手段は、複数の前記登録音声データについて前記入力音声データに対する類似度をそれぞれ算出し、前記認識手段は、前記入力音声データの話者が、複数の前記登録音声データのうち前記入力音声データに対する類似度が最も高い前記登録音声データに含まれる複数の登録対象者のいずれかと一致すると推定することを特徴とする。
The invention according to claim 3 is the invention according to claim 1 or 2, wherein the similarity calculation means calculates a similarity with respect to the input voice data for each of the plurality of registered voice data, and performs the recognition. The means estimates that the speaker of the input voice data matches one of a plurality of registration target persons included in the registered voice data having the highest similarity to the input voice data among the plurality of registered voice data. It is characterized by.

また、請求項４に記載の発明は、請求項１〜３のいずれか一つに記載の発明において、前記認識手段は、前記入力音声データに対する前記登録音声データの類似度が類似度閾値以上である場合に、前記入力音声データの話者が前記登録音声データに含まれる複数の登録対象者のいずれかと一致すると判定することを特徴とする。
According to a fourth aspect of the present invention, in the invention according to any one of the first to third aspects, the recognition means has a similarity of the registered voice data to the input voice data equal to or greater than a similarity threshold. In some cases, it is determined that a speaker of the input voice data matches one of a plurality of registration target persons included in the registered voice data.

また、請求項５に記載の発明は、請求項１〜４のいずれか一つに記載の発明において、前記複数の登録対象者の音声を含む登録音声データから得られた複数の部分登録音声データのスペクトル包絡を、該スペクトル包絡を示す特徴パラメータの類似性に基づいて分類し、各分類について前記特徴パラメータの代表値を算出する分類手段をさらに備え、前記類似度算出手段は、前記部分入力音声データのスペクトル包絡を示す特徴パラメータと前記各分類の代表値との距離を算出し、前記代表値との距離が最小となる前記分類に属する各部分登録音声データを前記類似度の算出に使用することを特徴とする。
The invention according to claim 5 is the invention according to any one of claims 1 to 4, wherein the plurality of partial registration voice data obtained from the registration voice data including the voices of the plurality of registration subjects. And classifying means for calculating a representative value of the feature parameter for each classification, and the similarity calculation means includes the partial input speech. A distance between a characteristic parameter indicating a spectral envelope of data and a representative value of each classification is calculated, and each partially registered speech data belonging to the classification having a minimum distance from the representative value is used for calculating the similarity. It is characterized by that.

また、請求項６に記載の発明は、請求項１〜５のいずれか一つに記載の発明において、監視対象に対する監視動作を行なう監視手段と、前記入力音声データに含まれる単語を判別する単語判別手段と、前記認識手段による認識結果が所定の条件を満たした場合に、前記単語判別手段により判別された単語に基づいて前記監視手段の監視動作を制御する制御手段とをさらに備えたことを特徴とする。 The invention according to claim 6 is the invention according to any one of claims 1 to 5 , wherein the monitoring means for performing the monitoring operation on the monitoring target, and the word for determining the word included in the input voice data A discriminating unit; and a control unit that controls a monitoring operation of the monitoring unit based on a word discriminated by the word discriminating unit when a recognition result by the recognizing unit satisfies a predetermined condition. Features.

また、請求項７に記載の発明は、音声データに基づいて該音声データの話者を認識する話者認識方法であって、複数の登録対象者の音声を含む登録音声データから各登録対象者の音声を含むよう部分登録音声データを切り出し、切り出された部分登録音声データのスペクトル包絡を求める登録音声データ処理ステップと、認識対象となる入力音声データから部分入力音声データを切り出し、切り出された部分入力音声データのスペクトル包絡を求める入力音声データ処理ステップと、複数の部分登録音声データのそれぞれのスペクトル包絡について、前記部分入力音声データのスペクトル包絡との類似度をそれぞれ算出する類似度算出ステップと、前記類似度算出ステップにより算出された類似度に基づいて、前記入力音声データの話者が前記複数の登録対象者のいずれかと一致する、若しくは前記入力音声データの話者が前記複数の登録対象者のいずれとも一致しないとの認識結果を出力する認識ステップとを含んだことを特徴とする。
また、請求項８に記載の発明は、音声データに基づいて該音声データの話者を認識する話者認識方法であって、複数の登録対象者の音声を含む登録音声データから部分登録音声データを切り出し、切り出された部分登録音声データのスペクトル包絡を求める登録音声データ処理ステップと、認識対象となる入力音声データから部分入力音声データを切り出し、切り出された部分入力音声データのスペクトル包絡を求める入力音声データ処理ステップと、同一の前記登録音声データから切り出された複数の部分登録音声データについて、各部分登録音声データのスペクトル包絡を示す特徴パラメータと、前記部分入力音声データのスペクトル包絡を示す特徴パラメータとの距離の小ささを前記各部分登録音声データに対する前記部分入力音声データの類似度の高さとしてそれぞれ算出し、前記各部分登録音声データに対する前記部分入力音声データの距離の最小値を前記登録音声データに対する前記部分入力音声データの距離とし、該距離を類似度として出力する類似度算出ステップと、前記類似度算出ステップにより算出された類似度に基づいて、前記入力音声データの話者が前記複数の登録対象者のいずれかと一致する、若しくは前記入力音声データの話者が前記複数の登録対象者のいずれとも一致しないとの認識結果を出力する認識ステップとを含んだことを特徴とする。
The invention according to claim 7 is a speaker recognition method for recognizing a speaker of the voice data based on the voice data, and each registration target person from the registered voice data including a plurality of registration target person voices. A part of the registered voice data that is extracted to include the voice of the registered voice data processing step for obtaining the spectral envelope of the extracted partial registered voice data, and the part of the input voice data that has been cut out from the input voice data to be recognized An input speech data processing step for obtaining a spectral envelope of the input speech data; and a similarity calculation step for calculating a similarity between the spectral envelopes of the partial input speech data for each of the plurality of partial registered speech data; Based on the similarity calculated in the similarity calculation step, a speaker of the input voice data Matches one of a plurality of registered person, or the speaker of the input voice data, characterized in that the containing and recognition step of outputting a recognition result of not match any of the plurality of registered persons.
The invention according to claim 8 is a speaker recognition method for recognizing a speaker of the voice data based on the voice data, and the partial registration voice data from the registered voice data including the voices of a plurality of registration subjects. A registered voice data processing step for obtaining a spectral envelope of the cut partial registered voice data, and an input for cutting the partial input voice data from the input voice data to be recognized and obtaining the spectral envelope of the cut partial input voice data A plurality of partially registered speech data extracted from the same registered speech data, a feature parameter indicating a spectral envelope of each partially registered speech data, and a feature parameter indicating a spectrum envelope of the partially input speech data The partial input voice data for each partial registered voice data Respectively, the minimum value of the distance of the partial input voice data with respect to each of the partial registered voice data is set as the distance of the partial input voice data with respect to the registered voice data, and the distance is set as the similarity. Based on the similarity calculation step to be output and the similarity calculated by the similarity calculation step, a speaker of the input voice data matches one of the plurality of registration target persons , or a story of the input voice data And a recognition step of outputting a recognition result that the person does not match any of the plurality of registration subjects .

また、請求項９に記載の発明は、音声データに基づいて該音声データの話者を認識する話者認識プログラムであって、複数の登録対象者の音声を含む登録音声データから各登録対象者の音声を含むよう部分登録音声データを切り出し、切り出された部分登録音声データのスペクトル包絡を求める登録音声データ処理手順と、認識対象となる入力音声データから部分入力音声データを切り出し、切り出された部分入力音声データのスペクトル包絡を求める入力音声データ処理手順と、複数の部分登録音声データのそれぞれのスペクトル包絡について、前記部分入力音声データのスペクトル包絡との類似度をそれぞれ算出する類似度算出手順と、前記類似度算出手順により算出された類似度に基づいて、前記入力音声データの話者が前記複数の登録対象者のいずれかと一致する、若しくは前記入力音声データの話者が前記複数の登録対象者のいずれとも一致しないとの認識結果を出力する認識手順とをコンピュータに実行させることを特徴とする。
また、請求項１０に記載の発明は、音声データに基づいて該音声データの話者を認識する話者認識プログラムであって、複数の登録対象者の音声を含む登録音声データから部分登録音声データを切り出し、切り出された部分登録音声データのスペクトル包絡を求める登録音声データ処理手順と、認識対象となる入力音声データから部分入力音声データを切り出し、切り出された部分入力音声データのスペクトル包絡を求める入力音声データ処理手順と、同一の前記登録音声データから切り出された複数の部分登録音声データについて、各部分登録音声データのスペクトル包絡を示す特徴パラメータと、前記部分入力音声データのスペクトル包絡を示す特徴パラメータとの距離の小ささを前記各部分登録音声データに対する前記部分入力音声データの類似度の高さとしてそれぞれ算出し、前記各部分登録音声データに対する距離の最小値を前記登録音声データに対する前記部分入力音声データの距離とし、該距離を類似度として出力する類似度算出手順と、前記類似度算出手順により算出された類似度に基づいて、前記入力音声データの話者が前記複数の登録対象者のいずれかと一致する、若しくは前記入力音声データの話者が前記複数の登録対象者のいずれとも一致しないとの認識結果を出力する認識手順とをコンピュータに実行させることを特徴とする。
The invention according to claim 9 is a speaker recognition program for recognizing a speaker of the voice data based on the voice data, and each registration target person from the registered voice data including voices of a plurality of registration target persons. A part of the registered speech data processing procedure for obtaining the spectral envelope of the segmented speech data that has been cut out so as to include the voice, and the portion that has been cut out from the input speech data to be recognized An input speech data processing procedure for obtaining a spectrum envelope of input speech data, and a similarity calculation procedure for calculating a similarity between each of the plurality of partial registered speech data and a spectrum envelope of the partial input speech data, on the basis of the similarity calculated by the similarity calculation procedure, Noboru speaker of the input voice data of the plurality It matches any of the subjects, or the speaker of the input voice data, characterized in that to perform the certified識手order and you outputs the recognition result of not match any of the plurality of registered person on the computer .
According to a tenth aspect of the present invention, there is provided a speaker recognition program for recognizing a speaker of the voice data based on the voice data, and the partial registration voice data from the registered voice data including voices of a plurality of registration subjects. A registered voice data processing procedure for obtaining a spectral envelope of the extracted partial registered voice data, and an input for extracting the partial input voice data from the input voice data to be recognized and obtaining the spectral envelope of the extracted partial input voice data For a plurality of partially registered speech data extracted from the same registered speech data, a feature parameter indicating a spectrum envelope of each partially registered speech data and a feature parameter indicating a spectrum envelope of the partially input speech data The partial input voice data for each partial registered voice data A similarity calculation procedure for calculating a minimum value of a distance to each of the partial registered voice data as a distance of the partial input voice data with respect to the registered voice data, and outputting the distance as a similarity And, based on the similarity calculated by the similarity calculation procedure, a speaker of the input speech data matches any of the plurality of registration subjects , or a speaker of the input speech data is registered in the plurality of registrations. It is characterized by causing a computer to execute a recognition procedure for outputting a recognition result indicating that it does not match any of the target persons .

本発明によれば、登録対象者の音声を含む登録音声データから切り出された部分登録音声データのスペクトル包絡と、認識対象となる入力音声データから切り出された部分入力音声データのスペクトル包絡との類似度を算出し、該類似度に基づいて、入力音声データの話者を認識するので、登録及び認識に係る利便性を向上することができる。 According to the present invention, the similarity between the spectral envelope of the partially registered speech data extracted from the registered speech data including the speech of the person to be registered and the spectral envelope of the partially input speech data extracted from the input speech data to be recognized. Since the degree is calculated and the speaker of the input voice data is recognized based on the degree of similarity, the convenience of registration and recognition can be improved.

図１は、本実施例１に係るホームセキュリティシステムのシステム構成を示すシステム構成図である。FIG. 1 is a system configuration diagram illustrating a system configuration of the home security system according to the first embodiment. 図２は、図１に示した話者認識部の内部構成を示す内部構成図である。FIG. 2 is an internal configuration diagram showing an internal configuration of the speaker recognition unit shown in FIG. 図３は、距離算出の概念を説明するための説明図である。FIG. 3 is an explanatory diagram for explaining the concept of distance calculation. 図４は、話者照合部が用いる照合閾値について説明するための説明図である。FIG. 4 is an explanatory diagram for explaining the collation threshold used by the speaker collation unit. 図５は、登録モードにおける話者認識部の処理手順を示すフローチャートである。FIG. 5 is a flowchart showing a processing procedure of the speaker recognition unit in the registration mode. 図６は、認識モードにおける話者認識部の処理手順を示すフローチャートである。FIG. 6 is a flowchart showing a processing procedure of the speaker recognition unit in the recognition mode. 図７は、実施例１に係る話者認識の実験結果を説明するための説明図である。FIG. 7 is an explanatory diagram for explaining an experiment result of speaker recognition according to the first embodiment. 図８は、実施例２に係る話者認識部の内部構成を示す内部構成図である。FIG. 8 is an internal configuration diagram illustrating an internal configuration of the speaker recognition unit according to the second embodiment. 図９は、クラスタを用いた最小距離探索の説明図である。FIG. 9 is an explanatory diagram of the minimum distance search using clusters.

以下に、添付図面を参照して、本発明に係る話者認識装置、話者認識方法及び話者認識プログラムの好適な実施例を詳細に説明する。以下に示す実施例１及び２では、本発明に係る話者認識装置、話者認識方法及び話者認識プログラムを住宅用のホームセキュリティシステムに適用した場合について説明する。 Exemplary embodiments of a speaker recognition device, a speaker recognition method, and a speaker recognition program according to the present invention will be described below in detail with reference to the accompanying drawings. In the following first and second embodiments, a case where the speaker recognition device, the speaker recognition method, and the speaker recognition program according to the present invention are applied to a home security system for a house will be described.

図１は、実施例１に係るホームセキュリティシステムのシステム構成を示すシステム構成図である。図１に示すホームセキュリティシステムは、監視装置６０にドア監視装置１１、窓監視装置１２、火災検知装置１３及び話者認識装置３０を接続し、話者認識装置３０にマイクロホン２０を接続した構成を有する。 FIG. 1 is a system configuration diagram illustrating a system configuration of the home security system according to the first embodiment. The home security system shown in FIG. 1 has a configuration in which the door monitoring device 11, the window monitoring device 12, the fire detection device 13, and the speaker recognition device 30 are connected to the monitoring device 60, and the microphone 20 is connected to the speaker recognition device 30. Have.

ドア監視装置１１は、住宅のドアに対する不正な侵入の試みを監視する装置である。ドア監視装置１１は、ピッキングなどの侵入の試みを検知した場合には、監視装置６０に対して報知を行なう。 The door monitoring device 11 is a device that monitors attempts to illegally enter a house door. When the door monitoring device 11 detects an intrusion attempt such as picking, the door monitoring device 11 notifies the monitoring device 60.

窓監視装置１２は、住宅の窓に対する不正な侵入の試みを監視する装置である。窓監視装置１２は、窓に対する衝撃等を検知した場合には、監視装置６０に対して報知を行なう。 The window monitoring device 12 is a device that monitors unauthorized attempts to enter a residential window. The window monitoring device 12 notifies the monitoring device 60 when it detects an impact or the like on the window.

火災検知装置１３は、住宅の居室等に設けられ、火災の発生を検知する装置である。火災検知装置１３は、火災の発生を検知した場合には、監視装置６０に対して報知を行なう。 The fire detection device 13 is a device that is provided in a living room of a house and detects the occurrence of a fire. The fire detection device 13 notifies the monitoring device 60 when the occurrence of a fire is detected.

マイクロホン２０は、玄関等の出入口に設置され、音響信号を取得して話者認識装置３０に出力する装置である。マイクロホン２０は、常に動作し、音響信号の取得及び出力を行なう。なお、人感センサ等を用いて音響信号の取得のオンオフ切替をおこなってもよい。話者認識装置３０は、任意の場所に設置可能である。また、マイクロホン２０を話者認識装置３０の筐体内に設けてもよい。 The microphone 20 is a device that is installed at an entrance such as an entrance, acquires an acoustic signal, and outputs it to the speaker recognition device 30. The microphone 20 always operates and acquires and outputs an acoustic signal. Note that acoustic signal acquisition may be switched on and off using a human sensor or the like. The speaker recognition device 30 can be installed at an arbitrary location. Further, the microphone 20 may be provided in the housing of the speaker recognition device 30.

話者認識装置３０は、マイクロホン２０が取得した音響信号を用いて話者認識を行ない、ホームセキュリティシステムの動作を管理する監視装置６０に出力する。話者認識装置３０は、話者認識部３１及びテキスト判別部３２を有し、監視装置６０は、監視制御部３３及び監視部３４を有する。話者認識部３１は、マイクロホン２０が取得した音響信号から音声を切り出し、該音声が居住者の音声であるか否かを認識し、認識結果を監視装置６０の監視制御部３３に出力する。また、テキスト判別部３２は、マイクロホン２０が取得した音響信号から音声を切り出し、該音声内の単語をテキスト情報として監視装置６０の監視制御部３３に出力する。 The speaker recognition device 30 performs speaker recognition using the acoustic signal acquired by the microphone 20 and outputs it to the monitoring device 60 that manages the operation of the home security system. The speaker recognition device 30 includes a speaker recognition unit 31 and a text determination unit 32, and the monitoring device 60 includes a monitoring control unit 33 and a monitoring unit 34. The speaker recognizing unit 31 cuts out a sound from the acoustic signal acquired by the microphone 20, recognizes whether the sound is a resident's sound, and outputs the recognition result to the monitoring control unit 33 of the monitoring device 60. In addition, the text determination unit 32 cuts out sound from the acoustic signal acquired by the microphone 20 and outputs a word in the sound to the monitoring control unit 33 of the monitoring device 60 as text information.

監視制御部３３は、話者認識部３１により話者が居住者であると認識された場合に、テキスト判別部３２から出力されたテキスト情報に基づいて、監視部３４の動作を制御する処理部である。具体的には、「セキュリティオン」や「いってきます」等のテキスト情報を含む場合には、監視部３４による監視動作を開始させ、「セキュリティオフ」や「ただいま」等のテキスト情報を含む場合には、監視部３４による監視動作を終了させる。 The monitoring control unit 33 is a processing unit that controls the operation of the monitoring unit 34 based on the text information output from the text determination unit 32 when the speaker recognition unit 31 recognizes that the speaker is a resident. It is. Specifically, when text information such as “security on” or “coming” is included, the monitoring operation by the monitoring unit 34 is started, and when text information such as “security off” or “just now” is included. Terminates the monitoring operation by the monitoring unit 34.

監視部３４は、ドア監視装置１１、窓監視装置１２及び火災検知装置１３の出力を用いて、住居の監視を行なう処理部である。具体的には、監視部３４は、監視制御部３３から開始指示を受けた場合に監視動作を開始し、監視動作中にドア監視装置１１、窓監視装置１２又は火災検知装置１３から異常発生の報知を受けた場合には、警報動作を行なうとともに、センタに対して異常発生を通知する。この監視動作は、監視制御部３３から終了指示を受けた場合に終了する。 The monitoring unit 34 is a processing unit that monitors the dwelling using the outputs of the door monitoring device 11, the window monitoring device 12, and the fire detection device 13. Specifically, the monitoring unit 34 starts a monitoring operation when receiving a start instruction from the monitoring control unit 33, and an abnormality occurs from the door monitoring device 11, the window monitoring device 12, or the fire detection device 13 during the monitoring operation. When the notification is received, an alarm operation is performed and an abnormality occurrence is notified to the center. This monitoring operation ends when an end instruction is received from the monitoring control unit 33.

このように、本実施例１に係るホームセキュリティシステムでは、居住者の音声を認識することで、監視動作のオンオフ制御を音声操作により行なうことが可能である。 As described above, in the home security system according to the first embodiment, the on / off control of the monitoring operation can be performed by the voice operation by recognizing the voice of the resident.

次に、図１に示した話者認識部３１の内部構成について説明する。図２は、図１に示した話者認識部３１の内部構成を示す内部構成図である。図２に示すように、話者認識部３１は、ＡＤ変換部４１、音声区間抽出部４２、特徴パラメータ算出部４３、切替部４４、記憶部４５、最小距離探索部４６及び認識処理部４７を有する。 Next, the internal configuration of the speaker recognition unit 31 shown in FIG. 1 will be described. FIG. 2 is an internal configuration diagram showing an internal configuration of the speaker recognition unit 31 shown in FIG. As shown in FIG. 2, the speaker recognition unit 31 includes an AD conversion unit 41, a speech segment extraction unit 42, a feature parameter calculation unit 43, a switching unit 44, a storage unit 45, a minimum distance search unit 46, and a recognition processing unit 47. Have.

ＡＤ変換部４１は、マイクロホン２０が取得した音響信号をアナログ信号からデジタル信号に変換し、音声区間抽出部４２に出力する処理を行なう処理部である。 The AD conversion unit 41 is a processing unit that performs a process of converting the acoustic signal acquired by the microphone 20 from an analog signal to a digital signal and outputting the converted signal to the speech segment extraction unit 42.

音声区間抽出部４２は、ＡＤ変換部４１によりデジタル信号に変換された音響信号から音声区間を抽出する処理部である。音声区間の抽出は、音響信号の信号パワーやゼロクロス数等に基づいて行なうことができる。 The voice segment extraction unit 42 is a processing unit that extracts a voice segment from the acoustic signal converted into a digital signal by the AD conversion unit 41. The extraction of the voice section can be performed based on the signal power of the acoustic signal, the number of zero crosses, and the like.

特徴パラメータ算出部４３は、音声区間抽出部４２から出力された音声信号から複数の部分音声信号を切り出し、該音声信号のスペクトル包絡の特徴を示す特徴パラメータを算出する処理部である。特徴パラメータの算出手法としては、ＬＰＣ（Linear Predictive Coding）ケプストラム係数や、ＭＦＣＣ（Mel-Frequency Cepstrum Coefficient）等の任意の手法を用いることができる。 The feature parameter calculation unit 43 is a processing unit that extracts a plurality of partial speech signals from the speech signal output from the speech segment extraction unit 42 and calculates feature parameters indicating the characteristics of the spectral envelope of the speech signal. As a feature parameter calculation method, any method such as an LPC (Linear Predictive Coding) cepstrum coefficient or an MFCC (Mel-Frequency Cepstrum Coefficient) can be used.

切替部４４は、話者認識部３１の動作モードを切り替える処理部である。話者認識部３１の動作モードには、登録モードと認識モードとがある。切替部４４により登録モードに設定されている場合には、特徴パラメータ算出部４３が算出した特徴パラメータは、記憶部４５に登録データとして格納される。一方、切替部４４により認識モードに設定されている場合には、特徴パラメータ算出部４３が算出した特徴パラメータは、入力データとして最小距離探索部４６に出力される。 The switching unit 44 is a processing unit that switches the operation mode of the speaker recognition unit 31. The operation modes of the speaker recognition unit 31 include a registration mode and a recognition mode. When the registration mode is set by the switching unit 44, the feature parameter calculated by the feature parameter calculation unit 43 is stored as registration data in the storage unit 45. On the other hand, when the switching unit 44 sets the recognition mode, the feature parameter calculated by the feature parameter calculation unit 43 is output to the minimum distance search unit 46 as input data.

記憶部４５は、ハードディスク装置や不揮発性メモリ等の記憶デバイスであり、登録データを記憶する。登録データは、登録処理の度に生成され、別データとして記憶される。図２では、記憶部４５は、登録データＲ¹及び登録データＲ²を記憶している。登録データに含まれる特徴パラメータは、単一の話者のもののみを含むものであってもよいし、複数の話者のものを含んでもよい。 The storage unit 45 is a storage device such as a hard disk device or a nonvolatile memory, and stores registration data. The registration data is generated every time registration processing is performed and stored as separate data. In FIG. 2, the storage unit 45 stores registration data R ¹ and registration data R ² . The feature parameter included in the registration data may include only a single speaker or may include a plurality of speakers.

最小距離探索部４６は、入力データと各登録データとの距離の小ささを類似度の高さとして算出する処理部である。具体的には、登録データの特徴パラメータであるＲは、

となる。ここで、分析フレームとは、登録音声データから部分登録音声データを切り出すための、一定のフレーム長の切り出し範囲である。すなわち、Ｎ個の登録音声データのそれぞれから、Ｍ個の部分登録音声データが切り出され、部分登録音声データのそれぞれについて算出された次数Ｋの特徴パラメータが登録データＲである。 The minimum distance search unit 46 is a processing unit that calculates the small distance between the input data and each registered data as the high degree of similarity. Specifically, R which is a characteristic parameter of registration data is

It becomes. Here, the analysis frame is a cutout range having a certain frame length for cutting out the partially registered voice data from the registered voice data. That is, M pieces of partial registration voice data are cut out from each of the N pieces of registration voice data, and the feature parameter of the order K calculated for each of the partial registration voice data is the registration data R.

また、入力データの特徴パラメータは、

となる。すなわち、入力音声データからは、Ｌ個の部分入力音声データが切り出され、部分入力音声データのそれぞれについて算出された次数Ｋの特徴パラメータが入力データである。 Also, the feature parameter of the input data is

It becomes. That is, L pieces of partial input voice data are cut out from the input voice data, and the feature parameter of the order K calculated for each of the partial input voice data is the input data.

入力データと登録データＲとの距離ｄは、

により算出する。図３は、距離算出の概念を説明するための説明図である。図３に示すように、入力データの各フレームについて、登録データの全フレームに対する特徴パラメータとの距離を総当たりで算出し、入力フレーム毎の最小距離の平均値を、入力データと登録データとの距離とする。 The distance d between the input data and the registration data R is

Calculated by FIG. 3 is an explanatory diagram for explaining the concept of distance calculation. As shown in FIG. 3, for each frame of the input data, the distance from the feature parameter to all the frames of the registration data is calculated brute force, and the average value of the minimum distance for each input frame is calculated between the input data and the registration data. Distance.

入力データに対して最も距離が小さい登録データＩとその距離ｄは、

により求められる。最小距離探索部４６は、入力データに対して最も距離が小さい登録データと、その距離を認識処理部４７に出力する。 The registered data I having the smallest distance to the input data and its distance d are

Is required. The minimum distance search unit 46 outputs the registered data having the smallest distance to the input data and the distance to the recognition processing unit 47.

図２に示した認識処理部４７は、話者識別部４７ａと、話者照合部４７ｂとを有する。話者識別部４７ａは、入力データに対して最も距離が小さい登録データの話者が、入力音声データの話者と同一であると推定する。 The recognition processing unit 47 illustrated in FIG. 2 includes a speaker identification unit 47a and a speaker verification unit 47b. The speaker identification unit 47a estimates that the speaker of the registered data having the shortest distance from the input data is the same as the speaker of the input voice data.

話者照合部４７ｂは、入力データに対して最も距離が小さい登録データについて、その距離を照合閾値と比較し、距離が照合閾値以下である場合に、その登録データの話者と入力データの話者とが一致すると判定する。距離の小ささは、類似度の高さに対応するため、距離が照合閾値以下であることは、類似度が所定の類似度閾値以上であることを意味する。 The speaker verification unit 47b compares the distance of the registration data having the smallest distance with respect to the input data with a verification threshold, and if the distance is equal to or smaller than the verification threshold, the speaker of the registration data and the input data It is determined that the person matches the person. Since the small distance corresponds to the high degree of similarity, the distance being equal to or smaller than the matching threshold means that the similarity is equal to or larger than the predetermined similarity threshold.

次に、話者照合部４７ｂが用いる照合閾値について説明する。図４は、話者照合部４７ｂが用いる照合閾値について説明するための説明図である。照合閾値を求める際には、予め登録データ間で距離を算出し、図４（ａ）に示すように、同一の話者である場合の距離の分布である話者内距離分布と、異なる話者である場合の距離の分布である話者間距離分布とを求める。 Next, the verification threshold used by the speaker verification unit 47b will be described. FIG. 4 is an explanatory diagram for explaining a matching threshold used by the speaker matching unit 47b. When obtaining the collation threshold, the distance between registered data is calculated in advance, and as shown in FIG. 4A, a different story from the intra-speaker distance distribution, which is a distance distribution when the same speaker is used. And the inter-speaker distance distribution, which is the distance distribution for the speaker.

この話者内距離分布及び話者間距離分布から、図４（ｂ）に示すように、話者を照合する際の誤り率が求められる。照合閾値を小さく、すなわち判定を厳しくすると、他人を誤って受け入れる他人受入率が低下するが、本人を誤って棄却する本人拒否率が増加する。そこで、他人受入率と本人拒否率が一致する値を照合閾値とすることが好適である。なお、必要に応じて、他人受入率を減らすなどの調整を照合閾値に対して行なってもよい。 From this intra-speaker distance distribution and inter-speaker distance distribution, as shown in FIG. 4 (b), an error rate when collating the speakers is obtained. If the collation threshold value is reduced, that is, the judgment is made stricter, the acceptance rate of others who mistakenly accept others decreases, but the rejection rate of falsely rejecting the principals increases. In view of this, it is preferable to set a value at which the acceptance rate of others and the rejection rate of the person coincide with each other as a collation threshold. If necessary, adjustments such as reducing the acceptance rate of others may be performed on the verification threshold.

次に、話者認識部３１の処理手順について説明する。図５は、登録モードにおける話者認識部３１の処理手順を示すフローチャートである。なお、このフローチャートに示す処理手順は、切替部４４により登録モードに設定された状態で実行される。 Next, the processing procedure of the speaker recognition unit 31 will be described. FIG. 5 is a flowchart showing a processing procedure of the speaker recognition unit 31 in the registration mode. Note that the processing procedure shown in this flowchart is executed in a state in which the switching unit 44 sets the registration mode.

まず、マイクロホン２０が音響信号を取得する（ステップＳ１０１）。音声区間抽出部４２は、マイクロホン２０が取得した音響信号から音声区間を抽出する（ステップＳ１０２）。 First, the microphone 20 acquires an acoustic signal (step S101). The voice segment extraction unit 42 extracts a voice segment from the acoustic signal acquired by the microphone 20 (step S102).

特徴パラメータ算出部４３は、音声区間の音声信号から複数の部分音声信号を切り出し、該音声信号のスペクトル包絡の特徴を示す特徴パラメータを算出する（ステップＳ１０３）。そして、算出した特徴パラメータを記憶部４５に登録データとして蓄積し（ステップＳ１０４）、登録処理を終了する。 The feature parameter calculation unit 43 cuts out a plurality of partial speech signals from the speech signal in the speech section, and calculates feature parameters indicating the characteristics of the spectral envelope of the speech signal (step S103). Then, the calculated feature parameters are accumulated as registration data in the storage unit 45 (step S104), and the registration process is terminated.

図６は、認識モードにおける話者認識部３１の処理手順を示すフローチャートである。なお、このフローチャートに示す処理手順は、切替部４４により認識モードに設定された状態で実行される。 FIG. 6 is a flowchart showing a processing procedure of the speaker recognition unit 31 in the recognition mode. Note that the processing procedure shown in this flowchart is executed in a state where the switching unit 44 sets the recognition mode.

まず、マイクロホン２０が音響信号を取得する（ステップＳ２０１）。音声区間抽出部４２は、マイクロホン２０が取得した音響信号から音声区間を抽出する（ステップＳ２０２）。 First, the microphone 20 acquires an acoustic signal (step S201). The voice segment extraction unit 42 extracts a voice segment from the acoustic signal acquired by the microphone 20 (step S202).

特徴パラメータ算出部４３は、音声区間の音声信号から複数の部分音声信号を切り出し、該音声信号のスペクトル包絡の特徴を示す特徴パラメータを算出する（ステップＳ２０３）。 The feature parameter calculation unit 43 cuts out a plurality of partial speech signals from the speech signal in the speech section, and calculates feature parameters indicating the characteristics of the spectral envelope of the speech signal (step S203).

最小距離探索部４６は、入力データと各登録データとの距離を算出し、入力データに対して最も距離が小さい登録データと、その距離を探索する（ステップＳ２０４）。認識処理部４７は、入力データに対して最も距離が小さい登録データの話者が、入力音声データの話者と同一であると推定し、その距離が照合閾値以下である場合に登録データの話者と入力データの話者とが一致すると判定して（ステップＳ２０５）、推定及び判定の結果を監視制御部３３に出力し（ステップＳ２０６）、認識処理を終了する。 The minimum distance search unit 46 calculates the distance between the input data and each registration data, and searches for the registration data having the smallest distance with respect to the input data and the distance (step S204). The recognition processing unit 47 estimates that the speaker of the registered data with the shortest distance from the input data is the same as the speaker of the input voice data, and if the distance is equal to or smaller than the matching threshold, It is determined that the speaker and the speaker of the input data match (step S205), the estimation and determination results are output to the monitoring control unit 33 (step S206), and the recognition process is terminated.

次に、本実施例に係る話者認識の実験結果について説明する。図７は、実施例１に係る話者認識の実験結果を説明するための説明図である。図７に示すように、登録データ（登録音声）の長さを５秒から２０秒まで５秒刻みで４種類使用し、入力データ（入力音声）の長さを０．１秒から１．５秒まで０．１秒刻みで１５種類使用して、話者識別及び話者照合実験を行った。音声の特徴パラメータ算出にはＬＰＣケプストラム係数（分析フレーム長３２ミリ秒、分析フレームシフト１６ミリ秒、次数３２）を使用し、話者照合における照合閾値は、本人拒否率と他人受入率とが同じになる距離とした。 Next, an experiment result of speaker recognition according to the present embodiment will be described. FIG. 7 is an explanatory diagram for explaining an experiment result of speaker recognition according to the first embodiment. As shown in FIG. 7, four types of registration data (registration voice) are used in increments of 5 seconds from 5 seconds to 20 seconds, and the length of input data (input voice) is 0.1 to 1.5 seconds. Speaker identification and speaker verification experiments were performed using 15 types in increments of 0.1 seconds to seconds. LPC cepstrum coefficients (analysis frame length 32 ms, analysis frame shift 16 ms, order 32) are used for speech feature parameter calculation, and the verification threshold in speaker verification is the same as the rejection rate and the acceptance rate of others. The distance to be.

図７（ａ）は、話者識別の実験結果である。図７（ａ）に示すように、入力音声の発声時間長が０．１秒であれば、登録音声の発声時間長が５秒である場合に平均話者識別率が８２％、登録音声の発声時間長が１０秒である場合に平均話者識別率が８６％、登録音声の発声時間長が１５秒である場合に平均話者識別率が８９％、登録音声の発声時間長が２０秒である場合に平均話者識別率が９１％となる。 FIG. 7A shows the experimental results of speaker identification. As shown in FIG. 7A, if the utterance time length of the input voice is 0.1 second, the average speaker identification rate is 82% when the utterance time length of the registered voice is 5 seconds, The average speaker identification rate is 86% when the utterance time length is 10 seconds, the average speaker identification rate is 89% when the utterance time length of the registered speech is 15 seconds, and the utterance time length of the registered speech is 20 seconds. In this case, the average speaker identification rate is 91%.

これらの平均話者識別率は、入力音声の発声時間長を長くすることで向上し、入力音声の発声時間長が０．７秒以上であれば、登録音声の発声時間長がいずれの値であっても平均話者識別率は９９％以上となる。 These average speaker identification rates can be improved by increasing the utterance time length of the input speech. If the utterance time length of the input speech is 0.7 seconds or more, the utterance time length of the registered speech can be any value. Even if it exists, an average speaker identification rate will be 99% or more.

図７（ｂ）は、話者照合の実験結果である。図７（ｂ）に示すように、入力音声の発声時間長が０．１秒であれば、登録音声の発声時間長が５秒である場合に平均話者照合率が９３．５％、登録音声の発声時間長が１０秒である場合に平均話者照合率が９４％、登録音声の発声時間長が１５秒である場合に平均話者照合率が９５％、登録音声の発声時間長が２０秒である場合に平均話者照合率が９５％となる。 FIG. 7B shows an experimental result of speaker verification. As shown in FIG. 7B, if the utterance time length of the input speech is 0.1 second, the average speaker verification rate is 93.5% when the utterance time length of the registered speech is 5 seconds, When the speech duration is 10 seconds, the average speaker verification rate is 94%, and when the registered speech duration is 15 seconds, the average speaker verification rate is 95% and the registration speech duration is In the case of 20 seconds, the average speaker verification rate is 95%.

これらの平均話者照合率は、入力音声の発声時間長を長くすることで向上し、入力音声の発声時間長が０．７秒以上であれば、登録音声の発声時間長がいずれの値であっても平均話者照合率は９８％以上となる。 These average speaker verification rates are improved by increasing the utterance time length of the input speech. If the utterance time length of the input speech is 0.7 seconds or more, the utterance time length of the registered speech is any value. Even if it exists, an average speaker collation rate will be 98% or more.

このように、入力音声の発声時間長が０．７秒以上であれば、話者識別と話者照合の双方において、高い精度の認識が可能である。また入力音声の発声時間長が０．１秒から０．７秒の短時間の発話であったとしても、十分な認識精度が得られる。 As described above, when the utterance time length of the input voice is 0.7 seconds or more, high accuracy recognition is possible in both speaker identification and speaker verification. Further, even if the utterance time length of the input voice is a short utterance of 0.1 to 0.7 seconds, sufficient recognition accuracy can be obtained.

上述してきたように、本実施例１では、話者認識部３１は、登録音声からフレーム単位で算出した特徴パラメータを保存しておき、入力音声の特徴パラメータとの最小距離の小ささを類似度の高さとして用いることから、予め統計モデル等を構築する必要がない。これにより、簡易な演算で話者認識をすることが可能となる。 As described above, in the first embodiment, the speaker recognizing unit 31 stores the feature parameter calculated in units of frames from the registered speech, and determines the minimum distance from the feature parameter of the input speech as the similarity. Therefore, it is not necessary to build a statistical model or the like in advance. As a result, it is possible to perform speaker recognition with a simple calculation.

また、登録音声のデータに複数の登録話者の音声が含まれている場合であっても、登録音声から切り出した複数のフレームのうち、入力音声に最も近いフレームとの距離を登録音声との距離として採用するので、該フレーム間では単一の登録話者との距離を求めたこととなる。そのため、入力音声の話者が登録音声に含まれる複数人のいずれかであるという認識をすることが可能である。 In addition, even if the registered voice data includes the voices of a plurality of registered speakers, the distance from the frame closest to the input voice among the plurality of frames cut out from the registered voice Since the distance is adopted, the distance to the single registered speaker is obtained between the frames. Therefore, it is possible to recognize that the speaker of the input voice is one of a plurality of persons included in the registered voice.

また、本実施例１では、入力音声の１フレームと、登録音声の全フレームとの最小距離を求め、入力音声のフレームが増えるごとに平均していくことから、入力音声のフレーム数が少ない、すなわち入力音声が短時間である場合にも、話者認識が可能である。そして、入力音声のフレーム数が増えれば、より高精度に話者認識が可能となる。 In the first embodiment, since the minimum distance between one frame of the input sound and all the frames of the registered sound is obtained and averaged every time the input sound frame increases, the number of frames of the input sound is small. That is, speaker recognition is possible even when the input speech is short. If the number of frames of input speech increases, speaker recognition can be performed with higher accuracy.

上記実施例１では、登録データの分析フレームを全て使用する場合について説明を行なったが、登録データの分析フレーム数が十分であるならば、その一部のみを使用することで処理を高速化することができる。そこで、本実施例２では、使用する分析フレームを選択することで効率的な処理を行なう話者認識について説明する。 In the first embodiment, the case where all the analysis frames of registration data are used has been described. However, if the number of analysis frames of registration data is sufficient, the processing speed is increased by using only a part of the analysis frames. be able to. Therefore, in the second embodiment, speaker recognition that performs efficient processing by selecting an analysis frame to be used will be described.

図８は、実施例２に係る話者認識部１３１の内部構成を示す内部構成図である。図８に示す記憶部４５は、複数の話者の音声の特徴パラメータが含まれる可能性のある登録データをグループ登録データ群として格納し、単一の話者の音声のみが含まれる登録データを個人登録データ群として格納する。 FIG. 8 is an internal configuration diagram illustrating an internal configuration of the speaker recognition unit 131 according to the second embodiment. The storage unit 45 shown in FIG. 8 stores registration data that may include a plurality of speaker's voice feature parameters as a group registration data group, and stores registration data that includes only a single speaker's voice. Store as personal registration data group.

また、話者認識部１３１は、登録処理部５２及びクラスタ設定部５３をさらに備えるとともに、最小距離探索部５１の動作が実施例１に示した最小距離探索部４６と異なる。その他の構成及び動作は、実施例１と同様であるので、同一の構成要素には同一の符号を付して説明を省略する。 The speaker recognition unit 131 further includes a registration processing unit 52 and a cluster setting unit 53, and the operation of the minimum distance search unit 51 is different from the minimum distance search unit 46 shown in the first embodiment. Since other configurations and operations are the same as those in the first embodiment, the same components are denoted by the same reference numerals and description thereof is omitted.

登録処理部５２は、認識処理部４７の話者照合部４７ｂにより入力データの話者が登録データの話者と一致すると判定された場合に、該入力データを個人登録データ群に属する登録データとして登録する処理部である。 When the speaker verification unit 47b of the recognition processing unit 47 determines that the speaker of the input data matches the speaker of the registration data, the registration processing unit 52 sets the input data as registration data belonging to the personal registration data group. A processing unit to be registered.

登録処理部５２は、入力データの話者が、グループ登録データ群に属する登録データの話者と一致した場合には、該入力データを新規の登録データとして登録する。新規の登録データとするのは、グループ登録データ群に属する登録データは、複数の話者の音声を含む可能性があり、その中のいずれの話者の音声と一致したが判別できないためである。一方、認識処理に使用した入力データは、話者が単独であると推定できるので、かかる入力データは、個人登録データ群に属する登録データとして登録する。 When the speaker of the input data matches the speaker of the registration data belonging to the group registration data group, the registration processing unit 52 registers the input data as new registration data. The reason for the new registration data is that the registration data belonging to the group registration data group may include the voices of a plurality of speakers and cannot be discriminated although it matches the voices of any of the speakers. . On the other hand, since the input data used for the recognition process can be estimated as a single speaker, the input data is registered as registration data belonging to the personal registration data group.

登録処理部５２は、入力データの話者が、個人登録データ群に属する登録データの話者と一致した場合には、該入力データを一致した登録データに追加して登録する。個人登録データ群に属する登録データは、単一の話者の音声により構成されるためである。このように、個人登録データ群に属する登録データの数、並びに個人登録データ群に属する登録データの分析フレーム数は、認識処理により増加し、より高精度な認識が可能となる。 When the speaker of the input data matches the speaker of the registration data belonging to the personal registration data group, the registration processing unit 52 adds the input data to the matching registration data and registers it. This is because the registration data belonging to the personal registration data group is composed of the voice of a single speaker. Thus, the number of registration data belonging to the personal registration data group and the number of analysis frames of registration data belonging to the personal registration data group are increased by the recognition process, thereby enabling more accurate recognition.

クラスタ設定部５３は、登録データについて、クラスタリングを行なう処理部である。具体的には、十分な数の分析フレームが蓄積された登録データについて、分析フレームをその特徴パラメータの類似性から複数のクラスタに分類する。クラスタの数は、登録データのデータ量などから任意に設定可能である。また、各クラスタについて、該クラスタに属する分析フレームの特徴パラメータの代表値を算出する。代表値としては、平均値等、任意の値を用いることができる。 The cluster setting unit 53 is a processing unit that performs clustering on registration data. Specifically, for registered data in which a sufficient number of analysis frames are accumulated, the analysis frames are classified into a plurality of clusters based on the similarity of the feature parameters. The number of clusters can be arbitrarily set based on the amount of registered data. For each cluster, a representative value of the characteristic parameter of the analysis frame belonging to the cluster is calculated. An arbitrary value such as an average value can be used as the representative value.

クラスタ設定部５３は、登録データの各分析フレームについて、該分析フレームが属するクラスタを関連付けるとともに、登録データに対して各クラスタの代表値を関連付ける。 The cluster setting unit 53 associates each analysis frame of the registered data with the cluster to which the analysis frame belongs, and associates the representative value of each cluster with the registered data.

クラスタ設定部５３による処理は、任意のタイミングで行なうことができる。例えば、登録データに対して変更が行なわれた場合に、変更が行なわれた登録データに対して処理を行なうことが好ましい。 The processing by the cluster setting unit 53 can be performed at an arbitrary timing. For example, when the registration data is changed, it is preferable to perform processing on the changed registration data.

最小距離探索部５１は、入力データと各登録データとの距離を類似度として算出する。使用する登録データは、グループ登録データ群と個人登録データ群のいずれに属するかを問わず、全ての登録データである。 The minimum distance search unit 51 calculates the distance between the input data and each registered data as the similarity. The registration data to be used is all the registration data regardless of whether it belongs to the group registration data group or the personal registration data group.

最小距離探索部５１は、登録データと入力データとの距離を算出する際には、まず、入力データのフレームと各クラスタの代表値との距離を算出する。そして、距離が最も小さいクラスタに属する各分析フレームとの距離を総当たりで算出し、最も小さい距離を登録データに対する最小距離とする。 When calculating the distance between the registration data and the input data, the minimum distance search unit 51 first calculates the distance between the frame of the input data and the representative value of each cluster. Then, the distance to each analysis frame belonging to the cluster having the smallest distance is calculated as a brute force, and the smallest distance is set as the minimum distance for the registered data.

このように、登録データとの距離を算出する場合に、クラスタを限定して処理を行なうことにより、最小距離の探索を高速化することが可能である。なお、クラスタの限定は、距離の最も小さいクラスタのみを用いる他、距離が最大となるクラスタを除外するなど、任意に選択可能である。 In this way, when calculating the distance to the registered data, it is possible to speed up the search for the minimum distance by performing processing while limiting the clusters. The limitation of the cluster can be arbitrarily selected by using only the cluster having the smallest distance or excluding the cluster having the maximum distance.

図９は、クラスタを用いた最小距離探索の説明図である。分析フレームの特徴パラメータの次数は、実際には３２次元等を用いるが、図９では説明を簡明にするため、２次元とする。 FIG. 9 is an explanatory diagram of the minimum distance search using clusters. The order of the characteristic parameters of the analysis frame is actually 32 dimensions, but in FIG. 9, it is assumed to be two dimensions for the sake of simplicity.

図９では、登録データの分析フレームに対する特徴パラメータ（Ｘ，Ｙ）の値をＸＹ平面上にプロットしている。そして、ＸＹ平面は、クラスタＡ１〜Ａ３の３つのクラスタに分類される。さらに、クラスタＡ１〜Ａ３について、該クラスタに属する分析フレームの代表値を求めている。 In FIG. 9, the values of the characteristic parameters (X, Y) for the analysis frame of the registered data are plotted on the XY plane. The XY plane is classified into three clusters A1 to A3. Further, the representative values of the analysis frames belonging to the clusters A1 to A3 are obtained.

入力データとの距離を求める場合には、入力データの分析フレームに対する特徴パラメータと、クラスタＡ１〜Ａ３の代表値との距離を求める。図９では、クラスタＡ３との距離が最小となる。このため、クラスタＡ３に属する各分析フレームに対する特徴パラメータとの距離を総当たりで算出し、最も小さい距離が登録データに対する最小距離となる。 When obtaining the distance to the input data, the distance between the characteristic parameter for the analysis frame of the input data and the representative values of the clusters A1 to A3 is obtained. In FIG. 9, the distance from the cluster A3 is minimized. For this reason, the distance to the feature parameter for each analysis frame belonging to the cluster A3 is calculated as a brute force, and the smallest distance is the minimum distance for the registered data.

上述してきたように、本実施例２では、登録データをクラスタに分類し、該クラスタを用いて話者認識に使用する分析フレームを制限することで効率的な処理を行なうことができる。また、認識処理の結果を用いて話者が単一の登録データを生成するので、話者の最新の音声の特徴を保持し、話者認識の精度を向上することができる。 As described above, in the second embodiment, it is possible to perform efficient processing by classifying registration data into clusters and limiting the analysis frames used for speaker recognition using the clusters. In addition, since the speaker generates single registration data using the result of the recognition process, the latest voice features of the speaker can be retained and the accuracy of speaker recognition can be improved.

なお、上記実施例では、登録時に音声データから特徴パラメータを算出し、特徴パラメータを記憶部４５に格納する構成について説明したが、記憶部４５に音声データ自体を格納し、認識時に適宜特徴パラメータを算出するよう構成してもよい。 In the above embodiment, the feature parameter is calculated from the voice data at the time of registration and the feature parameter is stored in the storage unit 45. However, the voice data itself is stored in the storage unit 45, and the feature parameter is appropriately set at the time of recognition. You may comprise so that it may calculate.

また、上記実施例では、ホームセキュリティの動作モードを音声操作により切り替える場合について説明したが、本発明に係る話者認識は、動作モードの切替に限定されるものではなく、テキスト判別により多様な操作に適用可能である。 In the above embodiment, the case where the home security operation mode is switched by voice operation has been described. However, the speaker recognition according to the present invention is not limited to the operation mode switching, and various operations can be performed by text discrimination. It is applicable to.

また、上記実施例では、話者の照合が成功したことを条件にセキュリティの動作モード切り替える構成を示したが、特定の話者の音声をブラックリストとして登録し、ブラックリストに登録した話者による操作を拒絶するよう構成してもよい。 In the above embodiment, the configuration in which the security operation mode is switched on the condition that the speaker verification is successful has been described. However, the voice of a specific speaker is registered as a blacklist, and the speaker registered in the blacklist It may be configured to reject the operation.

また、本発明は、ホームセキュリティに限らず、携帯電話端末による話者認識等、任意の装置の話者認識に適用可能である。特に、演算能力が限られた端末で話者認識を行なう場合には、小さい処理負荷で高い認識精度を得られる本発明は有用である。 The present invention is not limited to home security, and can be applied to speaker recognition of an arbitrary device such as speaker recognition using a mobile phone terminal. In particular, when speaker recognition is performed on a terminal having limited calculation capability, the present invention that can obtain high recognition accuracy with a small processing load is useful.

また、図示した各構成は機能概略的なものであり、必ずしも物理的に図示の構成をされていることを要しない。すなわち、各装置の分散・統合の形態は図示のものに限られず、その全部または一部を各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。また、話者認識部３１，１３１の機能部をソフトウェアにより実現し、コンピュータに実行させれば、コンピュータを話者認識装置として動作させる話者認識プログラムを得ることができる。 Each illustrated configuration is schematic in function, and does not necessarily need to be physically configured as illustrated. In other words, the form of distribution / integration of each device is not limited to the one shown in the figure, and all or a part thereof may be functionally / physically distributed / integrated in arbitrary units according to various loads and usage conditions. Can be configured. Further, if the function units of the speaker recognition units 31 and 131 are realized by software and executed by a computer, a speaker recognition program for operating the computer as a speaker recognition device can be obtained.

以上のように、話者認識装置、話者認識方法及び話者認識プログラムは、話者認識の利便性向上に適している。 As described above, the speaker recognition device, the speaker recognition method, and the speaker recognition program are suitable for improving the convenience of speaker recognition.

１１ドア監視装置
１２窓監視装置
１３火災検知装置
２０マイクロホン
３０話者認識装置
３１、１３１話者認識部
３２テキスト判別部
３３監視制御部
３４監視部
４１ＡＤ変換部
４２音声区間抽出部
４３特徴パラメータ算出部
４４切替部
４５記憶部
４６、５１最小距離探索部
４７認識処理部
４７ａ話者識別部
４７ｂ話者照合部
５２登録処理部
５３クラスタ設定部
６０監視装置 DESCRIPTION OF SYMBOLS 11 Door monitoring apparatus 12 Window monitoring apparatus 13 Fire detection apparatus 20 Microphone 30 Speaker recognition apparatus 31, 131 Speaker recognition part 32 Text discrimination | determination part 33 Monitoring control part 34 Monitoring part 41 AD conversion part 42 Voice area extraction part 43 Characteristic parameter calculation Unit 44 switching unit 45 storage unit 46, 51 minimum distance search unit 47 recognition processing unit 47a speaker identification unit 47b speaker verification unit 52 registration processing unit 53 cluster setting unit 60 monitoring device

Claims

A speaker recognition device for recognizing a speaker of voice data based on voice data,
Similarity that calculates the degree of similarity between the spectral envelope of the partially registered speech data extracted from the registered speech data including the speech of the person to be registered and the spectral envelope of the partially input speech data extracted from the input speech data to be recognized A calculation means;
Recognizing means for recognizing a speaker of the input voice data based on the similarity calculated by the similarity calculating means;
The registered voice data includes voices of a plurality of registration subjects ,
The partial registration voice data is cut out in a plurality so as to include the voice of each registration target person,
The similarity calculation means calculates a similarity with the spectrum envelope of the partial input speech data for each spectrum envelope of the plurality of partial registration speech data,
The recognition means obtains a recognition result that a speaker of the input voice data matches any of the plurality of registration target persons , or a speaker of the input voice data does not match any of the plurality of registration target persons. A speaker recognition device characterized by outputting.

A speaker recognition device for recognizing a speaker of voice data based on voice data,
Similarity that calculates the degree of similarity between the spectral envelope of the partially registered speech data extracted from the registered speech data including the speech of the person to be registered and the spectral envelope of the partially input speech data extracted from the input speech data to be recognized A calculation means;
Recognizing means for recognizing a speaker of the input voice data based on the similarity calculated by the similarity calculating means;
The registered voice data includes voices of a plurality of registration subjects ,
The similarity calculation means includes a feature parameter indicating a spectral envelope of each partial registered voice data and a spectral envelope of the partial input voice data for a plurality of partial registered voice data cut out from the same registered voice data. A small distance from the parameter is calculated as a height of similarity of the partial input voice data with respect to each partial registration voice data, and a minimum value of the distance of the partial input voice data with respect to each partial registration voice data is calculated. The distance between the partial input voice data and the registered voice data
The recognizing means uses a distance of the partial input voice data to the registered voice data as the similarity, and a speaker of the input voice data matches any of the plurality of registration subjects , or the input voice data A speaker recognition apparatus, wherein a recognition result indicating that the speaker does not match any of the plurality of registration subjects is output.

The similarity calculation means calculates a similarity to the input voice data for each of the plurality of registered voice data,
The recognition means presumes that the speaker of the input voice data matches one of a plurality of registration target persons included in the registered voice data having the highest similarity to the input voice data among the plurality of registered voice data The speaker recognition device according to claim 1, wherein the speaker recognition device is a speaker recognition device.

When the similarity of the registered voice data with respect to the input voice data is greater than or equal to a similarity threshold, the recognizing unit determines whether the speaker of the input voice data is one of a plurality of registration subjects included in the registered voice data. The speaker recognition device according to claim 1, wherein the speaker recognition devices are determined to match.

The spectral envelopes of a plurality of partially registered voice data obtained from the registered voice data including the voices of the plurality of registration subjects are classified based on similarity of feature parameters indicating the spectrum envelope, and the feature parameters for each classification A classification means for calculating a representative value of
The similarity calculation means calculates a distance between a characteristic parameter indicating a spectrum envelope of the partial input speech data and a representative value of each classification, and registers each partial belonging to the classification with a minimum distance to the representative value The speaker recognition apparatus according to claim 1, wherein voice data is used for calculating the similarity.

Monitoring means for performing a monitoring operation on the monitoring target;
Word discrimination means for discriminating words included in the input voice data;
And a control unit configured to control a monitoring operation of the monitoring unit based on a word determined by the word determination unit when a recognition result by the recognition unit satisfies a predetermined condition. Item 6. The speaker recognition device according to any one of Items 1 to 5.

A speaker recognition method for recognizing a speaker of voice data based on voice data,
A registration voice data processing step for cutting out the partial registration voice data so as to include the voice of each registration target person from the registration voice data including the voices of a plurality of registration target persons , and obtaining a spectrum envelope of the cut out partial registration voice data;
An input voice data processing step of cutting out the partial input voice data from the input voice data to be recognized and obtaining a spectrum envelope of the cut out partial input voice data;
A degree of similarity calculation step for calculating a degree of similarity with the spectrum envelope of the partial input voice data for each of the spectrum envelopes of the plurality of partial registration voice data,
Based on the similarity calculated in the similarity calculation step, a speaker of the input voice data matches any of the plurality of registration target persons , or a speaker of the input voice data is the plurality of registration target persons. A speaker recognition method, comprising: a recognition step of outputting a recognition result that does not match any of the above.

A speaker recognition method for recognizing a speaker of voice data based on voice data,
A registration voice data processing step for cutting out the partial registration voice data from the registration voice data including the voices of a plurality of registration subjects and obtaining a spectrum envelope of the cut out partial registration voice data;
An input voice data processing step of cutting out the partial input voice data from the input voice data to be recognized and obtaining a spectrum envelope of the cut out partial input voice data;
For a plurality of partially registered speech data cut out from the same registered speech data, the distance between the feature parameter indicating the spectrum envelope of each partially registered speech data and the feature parameter indicating the spectrum envelope of the partially input speech data is small Is calculated as the degree of similarity of the partial input voice data with respect to each partial registered voice data, and the minimum distance of the partial input voice data with respect to each partial registered voice data is calculated as the partial input with respect to the registered voice data A similarity calculation step for outputting the distance as the distance of the audio data and outputting the distance as a similarity;
Based on the similarity calculated in the similarity calculation step, a speaker of the input voice data matches any of the plurality of registration target persons , or a speaker of the input voice data is the plurality of registration target persons. A speaker recognition method, comprising: a recognition step of outputting a recognition result that does not match any of the above.

A speaker recognition program for recognizing a speaker of voice data based on voice data,
A registration voice data processing procedure for cutting out partial registration voice data so as to include each registration target person's voice from registration voice data containing a plurality of registration target person 's voices, and obtaining a spectrum envelope of the cut out partial registration voice data;
Input audio data processing procedure for extracting partial input audio data from input audio data to be recognized and obtaining a spectral envelope of the extracted partial input audio data;
For each spectrum envelope of a plurality of partial registration voice data, a similarity calculation procedure for calculating a similarity with the spectrum envelope of the partial input voice data,
Based on the similarity calculated by the similarity calculation procedure, a speaker of the input speech data matches any of the plurality of registration subjects , or a speaker of the input speech data is the plurality of registration subjects. speaker recognition program for sure識手order you outputs the recognition result of the does not match one of the characterized by causing a computer to execute the.

A speaker recognition program for recognizing a speaker of voice data based on voice data,
A registration voice data processing procedure for extracting partial registration voice data from registration voice data including voices of a plurality of registration target persons and obtaining a spectrum envelope of the cut out partial registration voice data;
Input audio data processing procedure for extracting partial input audio data from input audio data to be recognized and obtaining a spectral envelope of the extracted partial input audio data;
For a plurality of partially registered speech data cut out from the same registered speech data, the distance between the feature parameter indicating the spectrum envelope of each partially registered speech data and the feature parameter indicating the spectrum envelope of the partially input speech data is small Is calculated as the degree of similarity of the partial input voice data with respect to each partial registration voice data, the minimum value of the distance to each partial registration voice data as the distance of the partial input voice data with respect to the registration voice data, A similarity calculation procedure for outputting the distance as a similarity;
Based on the similarity calculated by the similarity calculation procedure, a speaker of the input speech data matches any of the plurality of registration subjects , or a speaker of the input speech data is the plurality of registration subjects. A speaker recognition program for causing a computer to execute a recognition procedure for outputting a recognition result that does not match any of the above.