JP2017223848A - Speaker recognition device - Google Patents

Speaker recognition device Download PDF

Info

Publication number
JP2017223848A
JP2017223848A JP2016119448A JP2016119448A JP2017223848A JP 2017223848 A JP2017223848 A JP 2017223848A JP 2016119448 A JP2016119448 A JP 2016119448A JP 2016119448 A JP2016119448 A JP 2016119448A JP 2017223848 A JP2017223848 A JP 2017223848A
Authority
JP
Japan
Prior art keywords
speaker
speech
similarity
registered
speakers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2016119448A
Other languages
Japanese (ja)
Inventor
美沙貴 辻川
Misaki Tsujikawa
美沙貴 辻川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Intellectual Property Corp of America
Original Assignee
Panasonic Intellectual Property Corp of America
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Panasonic Intellectual Property Corp of America filed Critical Panasonic Intellectual Property Corp of America
Priority to JP2016119448A priority Critical patent/JP2017223848A/en
Publication of JP2017223848A publication Critical patent/JP2017223848A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

PROBLEM TO BE SOLVED: To recognize a speaker with higher accuracy even when there is disturbance such as noise or an utterance length is short.SOLUTION: With respect to a speech inputted from a speech input unit 11, an analysis unit 13 extracts a feature quantity called i-vector using a large-scale speech database 12 having the speeches or speech models of a large number of unspecified speakers or registered speakers. A similarity calculation unit 14 calculates the similarity of speech models of a large number of unspecified speakers and speech models of registered speakers in the large-scale speech database 12 with the feature quantity of the inputted speech. An order calculation unit 15 calculates the order of magnitude in all speaker models of the similarity of the feature quantity of the inputted speech with the registered speaker model insisted by an input speaker. A determination unit 16 determines that the speaker is the person in question when the order calculated in the order calculation process is within a previously determined order.SELECTED DRAWING: Figure 1

Description

本発明は取得した音声信号に基づいて話者を認識する話者認識装置に関するものである。   The present invention relates to a speaker recognition device for recognizing a speaker based on an acquired voice signal.

従来の話者認識においては、一般的に、予め登録用の音声を収集し、収集した音声を分析することによって特徴量を抽出し、新たに取得する未知話者の音声と登録話者の音声との特徴量の類似度に基づいて本人かどうかを判断する手法がある。また,複数話者の音声との類似度を順位付けすることによって,話者認識を行う手法がある。   In conventional speaker recognition, generally, voices for registration are collected in advance, feature quantities are extracted by analyzing the collected voices, and newly acquired voices of unknown speakers and registered speakers are acquired. There is a method for determining whether or not the person is the person based on the similarity of the feature quantity. In addition, there is a technique for speaker recognition by ranking similarities with the speech of multiple speakers.

特許文献1で説明される話者認識装置は、入力話者の音声を分析し特徴量を抽出して、登録された全話者との類似度について木構造を用いて求め、入力話者と前記登録された全話者との類似度を順位づけし、入力話者の主張する本人との類似度があらかじめ定められた順以内である場合に本人であると判定する。   The speaker recognition device described in Patent Document 1 analyzes the voice of an input speaker, extracts feature amounts, obtains the similarity with all registered speakers using a tree structure, The degree of similarity with all the registered speakers is ranked, and if the degree of similarity with the person claimed by the input speaker is within a predetermined order, the person is determined to be the person.

また、特許文献1内で従来の一般的な類似度のみに基づく話者認識方法および話者識別装置について述べ、前記順位による話者認識方法は類似度のみに基づく話者認識方法よりも種々の外乱に対して頑健であるとしている。   Also, in Patent Document 1, a conventional speaker recognition method and speaker identification device based only on a general similarity are described, and the speaker recognition method based on the ranking is more various than the speaker recognition method based only on a similarity. He says he is robust against disturbances.

非特許文献1では、話者認識のための高精度な特徴量として、i−vectorと呼ばれる話者固有の特徴量とその求め方について新たに提案している。   Non-Patent Document 1 newly proposes a speaker-specific feature amount called i-vector and a method for obtaining the feature amount as a highly accurate feature amount for speaker recognition.

特許第2991288号明細書Japanese Patent No. 2991288

Dehak, Najim, et al. "Front-end factor analysis for speaker verification." Audio, Speech, and Language Processing, IEEE Transactions on 19.4 (2011): 788-798.Dehak, Najim, et al. "Front-end factor analysis for speaker verification." Audio, Speech, and Language Processing, IEEE Transactions on 19.4 (2011): 788-798.

従来の話者認識において、精度低下の課題として雑音などの外乱が音声に付加される場合や、対象の音声が極端に短い場合が挙げられる。   In conventional speaker recognition, there are cases where disturbances such as noise are added to the speech, and cases where the target speech is extremely short as a problem of reduced accuracy.

特許文献1では、各話者との類似度を順位付けすることにより外乱への頑健性を示しているが、特徴量について特定の手法を明示しておらず、また閾値の詳細な決定方法が未定であるため、外乱発生時において順位に基づく手法が類似度のみに基づく手法を上回る精度となる根拠や、より確実な手法が示されていない。また、精度低下要因について外乱への頑健性は論じられているが、単語レベルの短い発話については言及されていない。   In Patent Document 1, robustness against disturbance is shown by ranking the similarity with each speaker, but a specific method for the feature amount is not clearly shown, and a detailed method for determining a threshold is not provided. Since it is undecided, the grounds and the more reliable method that the method based on the ranking is more accurate than the method based only on the similarity when a disturbance occurs are not shown. In addition, although robustness against disturbances has been discussed with respect to factors that reduce accuracy, utterances with a short word level are not mentioned.

本発明は、i−vectorと名付けられた特徴量を話者固有のモデルとして音声より抽出し、実験結果等を元により精度が高い話者認識方法および話者認識装置を提供することを目的とするものである。   An object of the present invention is to provide a speaker recognition method and a speaker recognition device that extract a feature quantity named i-vector from a speech as a speaker-specific model and that has high accuracy based on experimental results and the like. To do.

本発明の一局面に係る話者認識方法は、あらかじめ大規模な不特定多数話者または登録話者の音声や音声モデルなどを有する大規模音声データベースを用いて、未知話者の音声が入力される音声入力処理と、入力された音声についてi−vectorと呼ばれる特徴量を前記大規模データベースを利用して抽出する分析処理と、前記大規模音声データベースの不特定多数話者の音声モデルおよび登録話者音声モデルと前記入力された音声の特徴量との類似度を計算する類似度計算処理と、前記入力された音声の特徴量と入力話者が主張する登録話者モデルとの類似度が全話者モデルの中で何番目に大きいかを計算する順位計算処理と、前記順位計算処理で求められた順位があらかじめ定められた順位内である場合に主張する本人であると判定する判定処理によって提供される。   In the speaker recognition method according to one aspect of the present invention, the speech of an unknown speaker is input using a large-scale speech database having speech and speech models of a large number of unspecified majority speakers or registered speakers in advance. Speech input processing, analysis processing for extracting feature quantities called i-vectors for the input speech using the large-scale database, speech models of unspecified majority speakers in the large-scale speech database, and registered speech The similarity calculation process for calculating the similarity between the speaker voice model and the feature amount of the input speech, and the similarity between the feature amount of the input speech and the registered speaker model claimed by the input speaker are all A rank calculation process for calculating the highest rank in the speaker model, and a person who claims that the rank is determined by the rank calculation process is within a predetermined rank. Provided by that determination process.

本構成によって、未知話者の音声信号が取得される。取得された音声信号から固有の特徴量であるi−vectorが抽出される。i−vectorを話者固有の特徴量として抽出する手法は多数の話者の音声から得られる一般的な音声の特徴量分布を使用する必要があり、大規模音声データベースの情報を使用することができる。i−vectorは数百程度の数値列として表されるため類似度の算出が容易である。また、外乱に頑健であり、雑音や入力時の機器の違いの影響が少ない。抽出された未知話者の特徴量と、あらかじめ大規模音声データベースに登録された大規模な不特定多数話者の音声モデルや登録話者のモデルとの類似度が算出される。算出された類似度を大きい順に順位付けし、未知話者が主張する登録話者との類似度が所定の順位内であれば当人と判定される。大規模音声データベースの話者音声はあらかじめ、雑音の有無や発話長、発話内容など収集された音声の条件を選択することが可能である。比較対象の話者音声について、雑音の少ない、発話長の充分な音声を利用することによって、未知話者音声への外乱の発生や、単語レベルの極端な短い発話であっても、安定した順位が期待できる。   With this configuration, the voice signal of an unknown speaker is acquired. An i-vector, which is a unique feature amount, is extracted from the acquired audio signal. The method of extracting i-vector as a speaker-specific feature amount needs to use a general speech feature amount distribution obtained from the speech of many speakers, and may use information in a large-scale speech database. it can. Since i-vector is expressed as a numerical string of several hundreds, the similarity can be easily calculated. In addition, it is robust against disturbances and is less affected by noise and device differences during input. The degree of similarity between the extracted feature quantity of the unknown speaker and the speech model of a large-scale unspecified majority speaker registered in the large-scale speech database or the model of the registered speaker is calculated. The calculated similarities are ranked in descending order, and if the similarity with the registered speaker claimed by the unknown speaker is within a predetermined rank, the person is determined to be the person concerned. For speaker voices in a large-scale voice database, it is possible to select in advance the collected voice conditions such as presence / absence of noise, utterance length, and utterance contents. For the speaker voices to be compared, stable speech ranking can be achieved even if disturbances to unknown speaker voices or utterances with extremely short word levels occur due to the use of voices with low noise and sufficient utterance length. Can be expected.

したがって、大規模音声データベースを用いて外乱に頑健な特徴量での類似度による不特定多数の話者および登録話者の順位付けを行うことで、より精度の高い話者認識が可能である。   Therefore, more accurate speaker recognition can be performed by ranking unspecified number of speakers and registered speakers based on similarities with features robust to disturbances using a large-scale speech database.

また、上記の話者認識方法において、事前に登録話者として開発用話者を用いて同様に判定を行い、本人が棄却される確率と詐称者が受理される確率を閾値となる順位ごとに算出し、最も誤り率の低くなる順位を本人であると判定するための順位として定めてもよい。   In the above speaker recognition method, the same determination is made in advance using a development speaker as a registered speaker in advance, and the probability that the person is rejected and the probability that the impersonator is accepted are set for each threshold. The order in which the error rate is calculated may be determined as the order for determining that the person is the person.

本構成によって、開発用話者における所定の順位ごとの本人が棄却される確率と詐称者が受理される二種類の認識誤り率が算出される。開発用話者において最も話者認識精度が高くなる順位が閾値と決定される。   With this configuration, the probability of rejecting the person for each predetermined rank in the development speaker and the two types of recognition error rates for accepting the impersonator are calculated. The order in which the speaker recognition accuracy is highest among the development speakers is determined as the threshold value.

したがって、未知話者の判定において開発用話者音声によって定められた順位を使用することができるので、より高い精度で話者を認識することができる。   Therefore, since the rank determined by the development speaker voice can be used in the determination of the unknown speaker, the speaker can be recognized with higher accuracy.

本発明によれば、雑音などの外乱や極端に短い発話などの話者認識における悪環境下でも、より高い精度で話者を認識することができる。   According to the present invention, a speaker can be recognized with higher accuracy even in a bad environment in speaker recognition such as disturbance such as noise or extremely short utterance.

本発明の実施の形態1における話者識別装置の構成を示す図である。It is a figure which shows the structure of the speaker identification device in Embodiment 1 of this invention. 本発明の実施の形態2における話者識別装置の構成を示す図である。It is a figure which shows the structure of the speaker identification device in Embodiment 2 of this invention. 本発明の実施の形態3における閾値順位決定のためのグラフを示す図である。It is a figure which shows the graph for threshold value ranking determination in Embodiment 3 of this invention.

以下添付図面を参照しながら、本発明の実施の形態について説明する。なお、以下の実施の形態は、本発明を具体化した一例であって、本発明の技術的範囲を限定するものではない。   Embodiments of the present invention will be described below with reference to the accompanying drawings. The following embodiments are examples embodying the present invention, and do not limit the technical scope of the present invention.

(実施の形態1)
図1は、本実施の形態1における話者認識装置の構成を示す図である。話者認識装置は、例えば、テレビ、スマートホン又はカーナビゲーション装置などに内蔵される。
(Embodiment 1)
FIG. 1 is a diagram illustrating the configuration of the speaker recognition apparatus according to the first embodiment. The speaker recognition device is built in, for example, a television, a smart phone, or a car navigation device.

図1に示す話者認識装置は、音声入力部11、大規模音声データベース12、分析部13、類似度計算部14、順位計算部15、判定部16で構成される。   The speaker recognition apparatus shown in FIG. 1 includes a voice input unit 11, a large-scale voice database 12, an analysis unit 13, a similarity calculation unit 14, a rank calculation unit 15, and a determination unit 16.

音声入力部11は、例えばマイクロフォンで構成され、未知話者音声を収集し、収集した音声を音声信号に変換して出力する。   The voice input unit 11 is composed of, for example, a microphone, collects unknown speaker voice, converts the collected voice into a voice signal, and outputs the voice signal.

大規模音声データベース12は、例えばクラウド上に配置された記憶装置であり、音声データまたは音声モデルを保持する。大規模音声データベースには、登録話者が含まれない不特定多数の話者の音声または音声モデルを保持する不特定多数話者音声データベースや、入力される未知話者が判定される対象である登録話者の音声または音声モデルを保持する登録話者音声データベースが含まれるが、この構成に限らず多数の話者の音声を保持していてもよい。   The large-scale voice database 12 is a storage device arranged on the cloud, for example, and holds voice data or a voice model. In the large-scale speech database, the unspecified majority speaker speech database that holds the speech or speech model of an unspecified number of speakers that do not include registered speakers, and the input unknown speakers are determined. Although a registered speaker voice database that holds a registered speaker's voice or a voice model is included, the present invention is not limited to this configuration, and a number of speaker's voices may be held.

分析部13は、音声入力部11から入力された音声信号を分析し、未知話者によって発話された音声の特徴量を算出する。ここで、i−vectorと呼ばれる式M=m+Twで求められる特徴量wが話者固有の特徴量として算出される。この式におけるMは、入力される話者個人を示す特徴量であり、例えばMFCC(Mel Frequency Cepstral Coefficient)という音声の周波数スペクトルを分析して得られる数値列を正規分布の重なりで表現する手法であるGMM(Gaussian Mixture Model)およびGMMスーパーベクトルなどが使用される。mは、多数の話者音声からMと同様にして得られる特徴量が使用される。このmにおけるGMMはUBM(Universal Background Model)と呼ばれる。TはMで求められた一般的な話者の特徴量空間を網羅することができる基底ベクトルである。wが本発明で使用される特徴量となる。各々の詳細な抽出方法などは非特許文献1および関連文献に記述されるため省略する。UBMを生成するために使用される話者音声は、音声データの一般的な特徴量を示すために、環境や話者性、発話内容などが多様かつ多量であるほど精度が良いとされる。したがって、分析部13は、大規模音声データベース12における多数の話者音声を使用して特徴量を抽出する。   The analysis unit 13 analyzes the voice signal input from the voice input unit 11 and calculates the feature amount of the voice uttered by the unknown speaker. Here, a feature value w obtained by an equation M = m + Tw called i-vector is calculated as a speaker-specific feature value. M in this equation is a feature amount indicating an individual speaker to be input. For example, MFCC (Mel Frequency Cepstial Coefficient) is a method of expressing a numerical sequence obtained by analyzing a frequency spectrum of speech by overlapping normal distributions. Some GMM (Gaussian Mixture Model), GMM super vectors, and the like are used. For m, a feature value obtained in the same manner as M from a large number of speaker voices is used. The GMM in m is called UBM (Universal Background Model). T is a basis vector that can cover the feature amount space of a general speaker obtained by M. w is a feature value used in the present invention. Detailed extraction methods and the like of each are described in Non-Patent Document 1 and related documents, and thus will be omitted. The speaker voice used to generate the UBM is considered to have higher accuracy as the environment, speaker characteristics, utterance content, and the like are more diverse and large in order to indicate general feature amounts of the voice data. Therefore, the analysis unit 13 extracts feature amounts using a large number of speaker voices in the large-scale voice database 12.

類似度計算部14は分析部13で算出された未知話者の特徴量wと大規模音声データベース12上の全てまたは一部の音声モデルとを比較し、類似度を算出する。特徴量および音声モデルは数百程度の数値列であるため、例えば非特許文献1内で示されるCosine distance scoringによって簡易に類似度を算出することができる。Cosine distance scoringは類似度が高い場合は1に近い値となり、類似度が低い場合には−1に近い値となる。また、類似度の算出手法は上記に限定されない。   The similarity calculation unit 14 compares the feature amount w of the unknown speaker calculated by the analysis unit 13 with all or some of the speech models on the large-scale speech database 12, and calculates the similarity. Since the feature quantity and the speech model are a numerical sequence of about several hundreds, the similarity can be easily calculated by Cosse distance scoring shown in Non-Patent Document 1, for example. The case distance scoring is close to 1 when the degree of similarity is high, and close to -1 when the degree of similarity is low. Further, the similarity calculation method is not limited to the above.

順位計算部15は、類似度計算部14で求められた類似度を大きい順に順位付けし、入力された未知話者の特徴量と、大規模音声データベース上の未知話者が本人であると主張する登録話者モデルとの類似度が類似度計算部14で算出した全類似度の中で何番目であるかを算出する。   The rank calculation unit 15 ranks the similarities obtained by the similarity calculation unit 14 in descending order, and claims that the input features of the unknown speaker and the unknown speaker on the large-scale speech database are the person himself / herself. The number of similarities with the registered speaker model to be calculated is calculated from among all the similarities calculated by the similarity calculating unit 14.

判定部16は、順位計算部15で算出された順位から、未知話者が主張する登録話者であるかどうかを判定する。算出された順位があらかじめ定められた順位より高い場合、主張する登録話者であると判定する。   The determination unit 16 determines whether or not the unknown speaker is a registered speaker based on the rank calculated by the rank calculation unit 15. If the calculated rank is higher than a predetermined rank, it is determined that the speaker is a registered speaker to be claimed.

(実施の形態2)
図2は、本実施の形態2における話者認識装置の構成を示す図である。図2において、図1およびと同じ構成要素については同じ符号を用い、説明を省略する。
(Embodiment 2)
FIG. 2 is a diagram illustrating a configuration of the speaker recognition device according to the second embodiment. 2, the same components as those in FIG. 1 are denoted by the same reference numerals, and description thereof is omitted.

実施の形態2における話者認識装置は、話者認識時の処理前に閾値決定時の処理が実施される。実施の形態1における処理は話者認識時処理とする。   In the speaker recognition apparatus according to the second embodiment, the process at the time of threshold determination is performed before the process at the time of speaker recognition. The process in the first embodiment is a process at the time of speaker recognition.

実施の形態2における閾値決定時の処理において、音声入力部11は開発用話者音声が入力され、入力された音声を音声信号に変換して出力する。開発用話者音声は音声の発話者が既知のものであり、登録話者とは異なる話者であってもよいし、登録話者と重複していてもよい。   In the threshold value determination process according to the second embodiment, the voice input unit 11 receives a development speaker voice, converts the input voice into a voice signal, and outputs the voice signal. The speaker voice for development has a known voice speaker, and may be a speaker different from the registered speaker, or may overlap with the registered speaker.

分析部13、類似度計算部14、順位計算部15は大規模音声データベース12上の開発用話者音声およびモデルを用いて実施の形態1で説明された処理を行い、順位を算出する。   The analysis unit 13, the similarity calculation unit 14, and the rank calculation unit 15 perform the processing described in the first embodiment using the development speaker voice and the model on the large-scale voice database 12, and calculate the rank.

閾値決定部17は順位計算部15によって算出された順位から開発用話者音声の認識誤り率を算出することで、適切な閾値となる順位を決定する。例えば、閾値となる順位を100位としたときの、開発用話者音声の認識誤り率が算出される。誤り率には、本人であるはずの音声を他者と判定してしまう確率(本人拒否率)と詐称者であるはずの音声を本人と判定してしまう確率(他人受入率)がある。開発用話者音声のうちある1名Aを選出し、残りを詐称者と考える。Aの発話を入力し、実施の形態1のように話者Aのモデルとの類似度が類似度を算出した話者中何位であるかを求め、100位以下であれば本人拒否となる。また、A以外の詐称者の発話を入力し、同様に類似度に基づく順位を算出し、話者Aのモデルとの類似度が100位以内であれば詐称者をAと判定することになってしまい、他人受入となる。以上のように二種類の誤り率をたとえば100位まで10位刻みに、100位以上は100位刻みに算出し、二種類の認識誤り率が交差する順位が最も誤り率が低くなる適切な閾値の順位として決定される。図3は、女性10名の短い発話を開発用話者として入力した場合の二種類の誤り率を前記手法で算出し、グラフにしたものである。縦軸が誤り率、横軸が閾値とした順位であり、破線が本人拒否率、実線が他人受入率を示す。図3において、おおよそ200位程度が適切な閾値と決定される。決定された閾値となる順位は、話者認識時の処理における判定部16で判定に使用される。   The threshold value determination unit 17 calculates the recognition error rate of the development speaker voice from the rank calculated by the rank calculation unit 15, thereby determining the rank that is an appropriate threshold value. For example, the recognition error rate of the speaker voice for development is calculated when the threshold ranking is 100th. The error rate includes a probability that the voice that should be the person is determined as the other person (person rejection rate) and a probability that the voice that is supposed to be the impersonator is determined as the person (other person acceptance rate). One person A is selected from the speaker voices for development, and the rest is regarded as an impersonator. The utterance of A is input, and the degree of similarity between the model of speaker A and the speaker whose degree of similarity is calculated is calculated as in the first embodiment. . In addition, the utterances of other scammers other than A are input, the ranking based on the similarity is calculated in the same manner, and if the similarity with the model of speaker A is within 100th, the scammer is determined to be A. It will be accepted by others. As described above, the two types of error rates are calculated in increments of 10 to 100, for example, 100 and above are calculated in increments of 100, and an appropriate threshold at which the error rate is lowest when the two types of recognition error rates intersect. Is determined as a ranking. FIG. 3 is a graph in which two types of error rates are calculated by the above method when a short utterance of 10 women is input as a development speaker. The vertical axis indicates the error rate, the horizontal axis indicates the threshold, the broken line indicates the principal rejection rate, and the solid line indicates the other person acceptance rate. In FIG. 3, about 200th place is determined as an appropriate threshold value. The determined rank order is used for determination by the determination unit 16 in the process at the time of speaker recognition.

実施の形態2における話者認識時の処理では、判定部16は閾値決定部17が算出した閾値順位を使用して入力される未知話者が主張する話者であるかを判定する。   In the process at the time of speaker recognition in the second embodiment, the determination unit 16 determines whether the unknown speaker input by using the threshold ranking calculated by the threshold determination unit 17 is a speaker claimed.

本発明に係る話者認識方法及び話者認識装置は、大規模データベース上の音声データを使用することにより、雑音などの外乱や発話長の不足がある場合でも、より高い精度で話者を識別することができ、取得した音声信号に基づいて話者を認識する話者認識方法及び話者認識装置として有用である。   The speaker recognition method and the speaker recognition apparatus according to the present invention can identify a speaker with higher accuracy even when there is a disturbance such as noise or a short utterance length by using voice data on a large-scale database. Therefore, the present invention is useful as a speaker recognition method and a speaker recognition device for recognizing a speaker based on an acquired voice signal.

11 音声入力部
12 大規模音声データベース
13 分析部
14 類似度計算部
15 順位計算部
16 判定部
17 閾値決定部
DESCRIPTION OF SYMBOLS 11 Audio | voice input part 12 Large-scale audio | voice database 13 Analysis part 14 Similarity calculation part 15 Ranking calculation part 16 Judgment part 17 Threshold value determination part

Claims (2)

不特定多数話者または登録話者の音声又は音声モデルを有する大規模音声データベースを用いて、話者認識を行う話者認識装置であって、
音声が入力される音声入力部と、
入力された音声について、i−vectorと呼ばれる特徴量を前記大規模音声データベースを利用して抽出する分析部と、
前記大規模音声データベースの不特定多数話者の音声モデルおよび登録話者音声モデルと前記入力された音声の特徴量との類似度を計算する類似度計算部と、
前記入力された音声の特徴量と入力話者が主張する登録話者モデルとの類似度が全話者モデルの中で何番目に大きいかを計算する順位計算部と、
前記順位計算処理で求められた順位があらかじめ定められた順位内である場合本人であると判定する判定部とを具備して構成されることを特徴とする話者認識装置。
A speaker recognition device that performs speaker recognition using a large-scale speech database having speech or speech models of unspecified majority speakers or registered speakers,
An audio input unit for inputting audio;
An analysis unit that extracts a feature amount called i-vector using the large-scale speech database for the input speech;
A similarity calculator that calculates the similarity between the speech model of the unspecified majority speakers and the registered speaker speech model of the large-scale speech database and the feature amount of the input speech;
A rank calculation unit for calculating the highest degree of similarity between the input speech feature quantity and the registered speaker model claimed by the input speaker in the entire speaker model;
A speaker recognition device comprising: a determination unit that determines that the user is an individual when the rank obtained by the rank calculation processing is within a predetermined rank.
事前に登録話者として開発用話者を用いて同様に判定を行い、本人が棄却される確率と詐称者が受理される確率を閾値となる順位ごとに算出し、最も誤り率の低くなる順位を本人であると判定するための順位として定める閾値決定部をさらに具備する、請求項1記載の話者認識装置。   Make a similar determination in advance using a development speaker as a registered speaker in advance, calculate the probability of rejecting the person and the probability of accepting the impersonator for each threshold order, and rank with the lowest error rate The speaker recognition device according to claim 1, further comprising: a threshold value determination unit that determines as a ranking for determining that the person is the person.
JP2016119448A 2016-06-16 2016-06-16 Speaker recognition device Pending JP2017223848A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2016119448A JP2017223848A (en) 2016-06-16 2016-06-16 Speaker recognition device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2016119448A JP2017223848A (en) 2016-06-16 2016-06-16 Speaker recognition device

Publications (1)

Publication Number Publication Date
JP2017223848A true JP2017223848A (en) 2017-12-21

Family

ID=60688113

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2016119448A Pending JP2017223848A (en) 2016-06-16 2016-06-16 Speaker recognition device

Country Status (1)

Country Link
JP (1) JP2017223848A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101888058B1 (en) * 2018-02-09 2018-08-13 주식회사 공훈 The method and apparatus for identifying speaker based on spoken word
DE102018125989A1 (en) 2017-10-24 2019-04-25 Shimano Inc. BRAKE SYSTEM
WO2019156427A1 (en) * 2018-02-09 2019-08-15 주식회사 공훈 Method for identifying utterer on basis of uttered word and apparatus therefor, and apparatus for managing voice model on basis of context and method thereof
KR102113879B1 (en) * 2018-12-19 2020-05-26 주식회사 공훈 The method and apparatus for recognizing speaker's voice by using reference database
JP2020173381A (en) * 2019-04-12 2020-10-22 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America Speaker recognition method, speaker recognition device, speaker recognition program, database creation method, database creation device, and database creation program
CN112735437A (en) * 2020-12-15 2021-04-30 厦门快商通科技股份有限公司 Voiceprint comparison method, system and device and storage mechanism
WO2023089731A1 (en) 2021-11-18 2023-05-25 エヴィクサー株式会社 Determination system, information processing device, method, and program
JP7473910B2 (en) 2020-03-27 2024-04-24 株式会社フュートレック SPEAKER RECOGNITION DEVICE, SPEAKER RECOGNITION METHOD, AND PROGRAM

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102018125989A1 (en) 2017-10-24 2019-04-25 Shimano Inc. BRAKE SYSTEM
DE102018125987A1 (en) 2017-10-24 2019-04-25 Shimano Inc. BRAKE SYSTEM
DE102018125988A1 (en) 2017-10-24 2019-04-25 Shimano, Inc. CONTROLLER, MANUFACTURED VEHICLE SYSTEM AND CONTROL PROCEDURE
KR101888058B1 (en) * 2018-02-09 2018-08-13 주식회사 공훈 The method and apparatus for identifying speaker based on spoken word
WO2019156427A1 (en) * 2018-02-09 2019-08-15 주식회사 공훈 Method for identifying utterer on basis of uttered word and apparatus therefor, and apparatus for managing voice model on basis of context and method thereof
KR102113879B1 (en) * 2018-12-19 2020-05-26 주식회사 공훈 The method and apparatus for recognizing speaker's voice by using reference database
JP2020173381A (en) * 2019-04-12 2020-10-22 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America Speaker recognition method, speaker recognition device, speaker recognition program, database creation method, database creation device, and database creation program
CN111816184A (en) * 2019-04-12 2020-10-23 松下电器(美国)知识产权公司 Speaker recognition method, speaker recognition device, recording medium, database generation method, database generation device, and recording medium
US11315573B2 (en) 2019-04-12 2022-04-26 Panasonic Intellectual Property Corporation Of America Speaker recognizing method, speaker recognizing apparatus, recording medium recording speaker recognizing program, database making method, database making apparatus, and recording medium recording database making program
JP7266448B2 (en) 2019-04-12 2023-04-28 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Speaker recognition method, speaker recognition device, and speaker recognition program
CN111816184B (en) * 2019-04-12 2024-02-23 松下电器(美国)知识产权公司 Speaker recognition method, speaker recognition device, and recording medium
JP7473910B2 (en) 2020-03-27 2024-04-24 株式会社フュートレック SPEAKER RECOGNITION DEVICE, SPEAKER RECOGNITION METHOD, AND PROGRAM
CN112735437A (en) * 2020-12-15 2021-04-30 厦门快商通科技股份有限公司 Voiceprint comparison method, system and device and storage mechanism
WO2023089731A1 (en) 2021-11-18 2023-05-25 エヴィクサー株式会社 Determination system, information processing device, method, and program

Similar Documents

Publication Publication Date Title
JP2017223848A (en) Speaker recognition device
US9536547B2 (en) Speaker change detection device and speaker change detection method
Kamppari et al. Word and phone level acoustic confidence scoring
JP6596376B2 (en) Speaker identification method and speaker identification apparatus
Becker et al. Forensic speaker verification using formant features and Gaussian mixture models.
Wang et al. Shifted-delta MLP features for spoken language recognition
US11837236B2 (en) Speaker recognition based on signal segments weighted by quality
US20140195232A1 (en) Methods, systems, and circuits for text independent speaker recognition with automatic learning features
US20190279644A1 (en) Speech processing device, speech processing method, and recording medium
US10665227B2 (en) Voice recognition device and voice recognition method
Bhukya Effect of gender on improving speech recognition system
JP5229124B2 (en) Speaker verification device, speaker verification method and program
Kumari et al. Comparison of LPCC and MFCC features and GMM and GMM-UBM modeling for limited data speaker verification
Ozaydin Design of a text independent speaker recognition system
JP2020060757A (en) Speaker recognition device, speaker recognition method, and program
Nguyen et al. Resident identification in smart home by voice biometrics
WO2014155652A1 (en) Speaker retrieval system and program
Lin An improved GMM-based clustering algorithm for efficient speaker identification
Ardiana et al. Gender Classification Based Speaker’s Voice using YIN Algorithm and MFCC
Kumar et al. Mel spectrogram based automatic speaker verification using GMM-UBM
Gao et al. Open-set speaker identification in broadcast news
Ghahabi et al. Speaker-Corrupted Embeddings for Online Speaker Diarization.
Manikandan et al. Speaker identification analysis for SGMM with k-means and fuzzy C-means clustering using SVM statistical technique
Laskar et al. Complementing the DTW based speaker verification systems with knowledge of specific regions of interest
Singh Bayesian distance metric learning and its application in automatic speaker recognition systems