JP6738867B2

JP6738867B2 - Speaker authentication method and voice recognition system

Info

Publication number: JP6738867B2
Application number: JP2018140621A
Authority: JP
Inventors: 奉眞李; 明祐呉; 益 ▲祥▼ 韓; 五赫權; 丙烈金; 燦奎李; 貞姫任; 丁牙崔; 秀桓金; 漢容姜; ▲みん▼ 碩崔; 智須崔
Original assignee: Naver Corp
Current assignee: Naver Corp
Priority date: 2017-07-26
Filing date: 2018-07-26
Publication date: 2020-08-12
Anticipated expiration: 2038-07-26
Also published as: KR102002903B1; KR20190012066A; JP2019028464A

Description

本発明は、話者認証方法及び話者認識システムに関し、さらに詳細には、音声認識装置及び音声認識サーバを含む音声認識システムにおいて話者を認証する方法に関する。 The present invention relates to a speaker authentication method and a speaker recognition system, and more particularly to a method for authenticating a speaker in a voice recognition system including a voice recognition device and a voice recognition server.

音声認識機能が搭載された人工知能スピーカ装置が市場に出回っている。人工知能スピーカ装置は、ユーザの音声を認識し、音声に含まれている命令を抽出し、命令による動作を実行し、その結果を音声として出力することにより、人工知能秘書のような役割を果たす。人工知能スピーカ装置が、単に音声質疑に応答し、質疑結果を音声として出力するレベルを超え、金融取り引きやショッピングのように、セキュリティが必要な分野で使用されるためには、正確に話者を認識及び識別しなければならない。しかしながら、人工知能スピーカ装置は、声を基にユーザを識別せざるを得ないため、指紋認識や虹彩認識のような生体情報を利用したユーザ識別方法またはユーザ認証方法に比べ、正確度が下がる。 Artificial intelligence speaker devices equipped with a voice recognition function are on the market. The artificial intelligence speaker device plays a role of an artificial intelligence secretary by recognizing a user's voice, extracting a command included in the voice, executing an action according to the command, and outputting the result as a voice. .. The artificial intelligence speaker device exceeds the level of simply answering the voice question and outputting the question and answer result as voice, and in order to be used in fields requiring security such as financial transactions and shopping, it is necessary to accurately talk the speaker. Must be recognized and identified. However, since the artificial intelligence speaker device has no choice but to identify the user based on the voice, the accuracy is lower than that of the user identification method or the user authentication method using biometric information such as fingerprint recognition and iris recognition.

本発明が解決しようとする課題は、前述の問題を解決するためのものであり、話者の音声から、音声の内容と話者とを正確に認識した後、話者を追加で認証することができる方法を提供することである。 The problem to be solved by the present invention is to solve the above-mentioned problem, and to additionally authenticate a speaker after correctly recognizing the content of the voice and the speaker from the voice of the speaker. It is to provide a method that can.

前述の技術的課題を達成するための技術的手段として、本開示の第１側面は、音声認識装置及び音声認識サーバを含む音声認識システムにおける話者認証方法を提供する。本話者認証方法は、前記音声認識サーバが、前記音声認識装置から、第１話者の音声を含む音声信号を受信する段階と、前記音声認識サーバが、前記音声信号に対して音声認識を実行し、第１音声認識結果を生成する段階と、前記音声認識サーバが、前記音声信号から、第１話者特徴ベクトルを抽出し、前記第１話者特徴ベクトルと登録された話者特徴ベクトルとの類似度を計算する段階と、前記類似度が、第１基準値以上である場合、前記音声認識サーバが、前記音声信号の話者が登録された第２話者であると決定する段階と、前記音声認識サーバが、前記第１話者または前記第２話者に認証音声を要求する段階と、前記音声認識サーバが、前記第１話者または前記第２話者から、認証音声信号を受信する段階と、前記音声認識サーバが、前記認証音声信号を基に、前記第２話者と前記第１話者との同一性を認証する段階と、前記同一性が認証された場合、前記音声認識サーバが、前記第１音声認識結果に対応する動作を実行する段階と、を含む。 As a technical means for achieving the above technical problem, the first aspect of the present disclosure provides a speaker authentication method in a voice recognition system including a voice recognition device and a voice recognition server. In the present speaker authentication method, the voice recognition server receives a voice signal including a voice of a first speaker from the voice recognition device, and the voice recognition server performs voice recognition on the voice signal. And executing a step of generating a first voice recognition result, wherein the voice recognition server extracts a first speaker feature vector from the voice signal, and registers the first speaker feature vector and the registered speaker feature vector. Calculating a degree of similarity to the voice recognition server, and determining that the voice recognition server is a registered second speaker if the voice recognition server is greater than or equal to a first reference value. And a step in which the voice recognition server requests an authentication voice from the first speaker or the second speaker, and the voice recognition server outputs an authentication voice signal from the first speaker or the second speaker. Receiving, the voice recognition server authenticates the identity of the second speaker and the first speaker based on the authentication voice signal, and if the identity is authenticated, The voice recognition server performs an operation corresponding to the first voice recognition result.

本開示の第２側面は、音声認識装置と通信する通信モジュールと、プロセッサと、を含む音声認識サーバを提供する。前記プロセッサは、前記通信モジュールを利用し、前記音声認識装置から、第１話者の音声を含む音声信号を受信し、前記音声信号に対して音声認識を実行し、第１音声認識結果を生成し、前記音声信号から、第１話者特徴ベクトルを抽出し、前記第１話者特徴ベクトルと登録された話者特徴ベクトルとの類似度を計算し、前記類似度が、第１基準値以上である場合、前記音声信号の話者が登録された第２話者であると決定し、前記第１話者または前記第２話者に認証音声を要求し、前記第１話者または前記第２話者から、認証音声信号を受信し、前記認証音声信号を基に、前記第２話者と前記第１話者との同一性を認証し、前記同一性が認証された場合、前記第１音声認識結果に対応する動作を実行するように構成される。 A second aspect of the present disclosure provides a voice recognition server including a communication module that communicates with a voice recognition device, and a processor. The processor receives the voice signal including the voice of the first speaker from the voice recognition device using the communication module, performs voice recognition on the voice signal, and generates a first voice recognition result. Then, a first speaker feature vector is extracted from the voice signal, the similarity between the first speaker feature vector and the registered speaker feature vector is calculated, and the similarity is equal to or greater than a first reference value. If it is, the speaker of the voice signal is determined to be the registered second speaker, and the first speaker or the second speaker is requested to authenticate voice, and the first speaker or the second speaker is requested. An authentication voice signal is received from two speakers, the identity between the second speaker and the first speaker is authenticated based on the authentication voice signal, and when the identity is authenticated, the first One voice recognition result is configured to be executed.

本開示の第３側面は、第２側面による音声認識サーバと通信する通信モジュールと、オーディオ信号を生成するマイクロフォンと、前記オーディオ信号から第１話者の音声を含む音声信号を検出し、前記音声信号を前記音声認識サーバに送信し、前記音声認識サーバから合成音信号を受信するように構成されるプロセッサと、前記合成音信号に対応する合成音を再生するスピーカと、を含む音声認識装置を提供する。 A third aspect of the present disclosure detects a voice signal including a voice of a first speaker from the audio module, a communication module that communicates with a voice recognition server according to the second aspect, a microphone that generates an audio signal, and the voice signal. A voice recognition device comprising: a processor configured to transmit a signal to the voice recognition server and receive a synthetic voice signal from the voice recognition server; and a speaker that reproduces a synthetic voice corresponding to the synthetic voice signal. provide.

本開示の第４側面は、音声認識サーバ及び音声認識装置を含む音声認識システムを提供する。前記音声認識装置は、前記音声認識サーバと通信する第１通信モジュールと、オーディオ信号を生成するマイクロフォンと、前記オーディオ信号から第１話者の音声を含む音声信号を検出し、前記音声信号を前記音声認識サーバに送信し、前記音声認識サーバから合成音信号を受信するように構成される第１プロセッサと、前記合成音信号に対応する合成音を再生するスピーカと、を含む。前記音声認識サーバは、第２プロセッサと、前記音声認識装置と通信する第２通信モジュールと、を含む。前記第２プロセッサは、前記音声認識装置から前記音声信号を受信し、前記音声信号に対して音声認識を実行し、第１音声認識結果を生成し、前記音声信号から、第１話者特徴ベクトルを抽出し、前記第１話者特徴ベクトルと登録された話者特徴ベクトルとの類似度を計算し、前記類似度が、第１基準値以上である場合、前記音声信号の話者が登録された第２話者であると決定し、前記第１話者または前記第２話者に認証音声を要求し、前記第１話者または前記第２話者から、認証音声信号を受信し、前記認証音声信号を基に、前記第２話者と前記第１話者との同一性を認証し、前記同一性が認証された場合、前記第１音声認識結果に対応する動作を実行するように構成される。 A fourth aspect of the present disclosure provides a voice recognition system including a voice recognition server and a voice recognition device. The voice recognition device detects a voice signal including a voice of a first speaker from the audio signal, a first communication module that communicates with the voice recognition server, a microphone that generates an audio signal, and outputs the voice signal. A first processor configured to transmit to a voice recognition server and receive a synthesized voice signal from the voice recognition server, and a speaker for playing back a synthesized voice corresponding to the synthesized voice signal. The voice recognition server includes a second processor and a second communication module that communicates with the voice recognition device. The second processor receives the voice signal from the voice recognition device, performs voice recognition on the voice signal, generates a first voice recognition result, and outputs a first speaker feature vector from the voice signal. Is calculated and the similarity between the first speaker feature vector and the registered speaker feature vector is calculated. If the similarity is equal to or higher than a first reference value, the speaker of the voice signal is registered. The second speaker, requesting an authentication voice from the first speaker or the second speaker, receiving an authentication voice signal from the first speaker or the second speaker, and Based on the authentication voice signal, the identity of the second speaker and the first speaker is authenticated, and when the identity is authenticated, the operation corresponding to the first voice recognition result is executed. Composed.

本開示の第５側面は、音声認識システムの音声認識サーバのプロセッサに、第２側面による話者認証方法を実行させるプログラムを提供する。 A fifth aspect of the present disclosure provides a program that causes a processor of a voice recognition server of a voice recognition system to execute the speaker authentication method according to the second aspect.

本開示の第６側面は、第５側面によるプログラムを記録したコンピュータ読み取り可能な記録媒体を提供する。 A sixth aspect of the present disclosure provides a computer-readable recording medium in which the program according to the fifth aspect is recorded.

本発明の実施形態によれば、話者認証手続きを介して、話者を正確に識別することができるので、話者誤認識や声盗用による誤動作の心配なしに、音声認識システムは、話者の命令を安全で正確に実行することができる。 According to the embodiments of the present invention, since the speaker can be accurately identified through the speaker authentication procedure, the voice recognition system can be used as a speaker without fear of speaker misrecognition or malfunction due to voice theft. The instructions of can be executed safely and accurately.

一実施形態による音声認識システムの例示的なネットワーク構成図である。FIG. 1 is an exemplary network configuration diagram of a voice recognition system according to an embodiment. 一実施形態による音声認識スピーカ装置の内部構成について説明するためのブロック図である。It is a block diagram for explaining an internal configuration of a voice recognition speaker device according to an embodiment. 一実施形態による音声認識サーバの内部構成について説明するためのブロック図である。It is a block diagram for explaining an internal configuration of a voice recognition server according to an embodiment. 一実施形態による音声認識サーバのプロセッサの内部構成について説明するためのブロック図である。It is a block diagram for explaining an internal configuration of a processor of a voice recognition server according to an embodiment. 他の実施形態による音声認識サーバのプロセッサの内部構成について説明するためのブロック図である。It is a block diagram for explaining an internal configuration of a processor of a voice recognition server according to another embodiment. 一実施形態による音声認識システムの話者認証方法について説明するための例示的なフローチャートである。3 is an exemplary flowchart illustrating a speaker authentication method of a voice recognition system according to an exemplary embodiment. 他の実施形態による音声認識システムの話者認証方法について説明するための例示的なフローチャートである。9 is an exemplary flowchart illustrating a speaker authentication method of a voice recognition system according to another exemplary embodiment. 他の実施形態による音声認識システムの話者認証方法について説明するための例示的なフローチャートである。9 is an exemplary flowchart illustrating a speaker authentication method of a voice recognition system according to another exemplary embodiment.

以下、添付した図面を参照し、本発明が属する技術分野で当業者が容易に実施することができるように、本発明の実施形態について詳細に説明する。しかしながら、本発明は、さまざまに異なる形態に具現化され、ここで説明する実施形態に限定されるものではない。そして、図面において、本発明について明確に説明するために、説明と関係ない部分を省略し、全体を通じて、類似部分については、類似した図面符号を付している。 Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement the present invention. However, the invention is embodied in various different forms and is not limited to the embodiments described herein. In the drawings, in order to clearly describe the present invention, parts unrelated to the description are omitted, and like parts are denoted by like reference numerals throughout.

明細書全体において、ある部分が他の部分と「連結／接続」されているとするとき、それは、「直接連結／接続」されている場合だけではなく、その中間に、他の要素を挟んで「電気的に連結／接続」されている場合も含む。また、ある部分がある構成要素を「含む」とするとき、それは、特別に反対となる記載がない限り、他の構成要素を除くものではなく、他の構成要素をさらに含んでもよい、ということを意味する。 Throughout the specification, when a part is “connected/connected” to another part, it means not only when it is “directly connected/connected” but also in the middle with another element interposed. It also includes the case of being “electrically connected/connected”. In addition, when a certain part "includes" a certain constituent element, it does not exclude other constituent elements, and may further include other constituent elements unless there is a specific opposite statement. Means

本明細書において、多様な箇所に登場する「一部の実施形態において」または「一実施形態において」というような語句は、必ずしも同一の実施形態を示すものではない。 The appearances of the phrase "in some embodiments" or "in one embodiment" in various places in this specification are not necessarily all referring to the same embodiment.

一実施形態は、機能的なブロック構成、及び多様な処理段階で示される。そのような機能ブロックの一部または全部は、特定機能を実行する多様な個数のハードウェアコンポーネント及び／またはソフトウェアコンポーネントによっても具現化される。例えば、本開示の機能ブロックは、１以上のマイクロプロセッサによって具現化されることもあるし、所定機能のための回路構成によって具現化されることもある。また、例えば、本開示の機能ブロックは、多様なプログラミング言語またはスクリプト言語によって具現化されることもある。かような機能ブロックは、１以上のプロセッサで実行されるアルゴリズムによって具現化されることもある。また、本開示は、電子的な環境設定、信号処理及び／またはデータ処理などのために、従来技術を採用することができる。モジュール及び構成のような用語は、汎用され、機械的であって物理的な構成として限定されるものではない。 One embodiment is presented in functional block organization and various processing stages. Some or all of such functional blocks may also be embodied by a variable number of hardware and/or software components that perform particular functions. For example, the functional blocks of the present disclosure may be embodied by one or more microprocessors or circuit configurations for predetermined functions. Further, for example, the functional blocks of the present disclosure may be embodied in various programming languages or script languages. Such functional blocks may be embodied by algorithms executed by one or more processors. Also, the present disclosure may employ conventional techniques for electronic configuration, signal processing and/or data processing, and the like. Terms such as module and configuration are general, mechanical and not limited to physical configuration.

また、図面に図示されている構成要素間の連結／接続線または連結／接続部材は、機能的な連結／接続、及び／または物理的または回路的な連結／接続を例示的に示したものに過ぎない。実際の装置においては、代替可能であったり追加されたりする多様な機能的な連結／接続、物理的または回路的な連結／接続により、構成要素間の連結／接続が示される。 In addition, the connection/connection lines or connection/connection members between the components illustrated in the drawings may be functional connection/connection and/or physical or circuit connection/connection. Not too much. In an actual device, various functional interconnections/connections, physical or circuit interconnections/connections, which may be substituted or added, represent interconnections/connections between components.

本開示において、音声認識機能は、ユーザの音声を含む音声信号を、文字列（または、テキスト）に変換することをいう。かような音声認識機能によって音声信号が変換された文字列（または、テキスト）は、音声認識結果とも呼ばれる。ユーザの音声信号は、音声命令を含み、かような音声認識結果も、音声命令に対応する命令を含み得る。かような音声命令は、音声認識スピーカ装置または音声認識サーバの特定機能を実行することができる。一方、本開示において、音声合成機能は、音声認識機能とは反対に、文字列（または、テキスト）を音声信号に変換することをいう。かような音声合成機能によって文字列（または、テキスト）が変換された音声信号は、合成音信号とも呼ばれる。 In the present disclosure, the voice recognition function refers to converting a voice signal including a user's voice into a character string (or text). A character string (or text) obtained by converting a voice signal by such a voice recognition function is also called a voice recognition result. The voice signal of the user includes a voice command, and such a voice recognition result may include a command corresponding to the voice command. Such a voice command may perform a specific function of the voice recognition speaker device or the voice recognition server. On the other hand, in the present disclosure, the voice synthesis function refers to conversion of a character string (or text) into a voice signal, as opposed to the voice recognition function. A voice signal in which a character string (or text) is converted by such a voice synthesizing function is also called a synthetic voice signal.

本開示において、「登録された」という表現は、音声認識システムに、ユーザ、またはその関連情報として登録されていることを意味する。「登録されたユーザ」は、音声認識システムにユーザ登録を終えたユーザを意味する。ある一人が、本開示による音声認識システムに、ユーザとして登録することができ、ユーザとして登録するとき、かような音声認識システムが提示する文章を発話した本人の音声を入力することができる。かような音声認識システムは、ユーザ登録時に入力された音声の音声信号から話者特徴ベクトルを抽出し、登録されたユーザの関連情報として保存することができる。そのように、音声認識システムに保存された話者特徴ベクトルは、登録された話者特徴ベクトルと呼ばれることがある。また、ユーザ登録時、自身が所有する携帯装置の識別番号を共に保存することができる。 In the present disclosure, the expression "registered" means registered in the voice recognition system as a user or related information thereof. “Registered user” means a user who has completed user registration in the voice recognition system. One person can register as a user in the voice recognition system according to the present disclosure, and when registering as a user, the voice of the person who speaks a sentence presented by such a voice recognition system can be input. Such a voice recognition system can extract a speaker feature vector from a voice signal of a voice input at the time of user registration and store it as related information of the registered user. As such, the speaker feature vector stored in the speech recognition system may be referred to as the registered speaker feature vector. In addition, at the time of user registration, the identification number of the mobile device owned by the user can be stored together.

かような音声認識システムに保存されるユーザの関連情報としては、ユーザ認証に使用される暗号が含まれてよい。また、ユーザ登録時にユーザは、自分固有の暗号を発話した暗号音声を、音声認識システムに入力することができる。かような音声認識システムは、暗号音声の暗号音声信号を保存し、暗号音声信号の音声認識結果、すなわち、暗号文字列、または暗号音声信号から抽出された話者特徴ベクトルを保存することができる。かような音声認識システムに保存された暗号音声信号、暗号文字列、暗号音声信号から抽出された話者特徴ベクトルはそれぞれ、登録された暗号音声信号、登録された暗号文字列、登録された話者特徴ベクトルとも呼ばれる。 The user-related information stored in the voice recognition system may include a cipher used for user authentication. In addition, at the time of user registration, the user can input the encrypted voice uttering the unique encryption to the voice recognition system. Such a voice recognition system can store the encrypted voice signal of the encrypted voice, and can store the voice recognition result of the encrypted voice signal, that is, the encrypted character string or the speaker feature vector extracted from the encrypted voice signal. .. The encrypted voice signal, the encrypted character string, and the speaker feature vector extracted from the encrypted voice signal stored in such a voice recognition system are the registered encrypted voice signal, the registered encrypted character string, and the registered speech, respectively. It is also called the person feature vector.

かような音声認識システムには、複数のユーザが登録される。本開示において、第１話者は、音声信号の音声を実際に発話した人を意味し、登録された第２話者は、音声認識システムに登録された複数のユーザのうち、音声認識システムが、音声信号の音声を発話したと認識したり決定したりしたユーザを意味する。登録された第２話者は、一般的に、第１話者と同一であるが、音声認識システムの話者誤認識や声盗用が発生する場合、登録された第２話者は、第１話者とは異なる。 A plurality of users are registered in such a voice recognition system. In the present disclosure, the first speaker means a person who actually utters the voice of the voice signal, and the registered second speaker is a voice recognition system among a plurality of users registered in the voice recognition system. , Means a user who recognizes or determines that the voice of the voice signal is uttered. The registered second speaker is generally the same as the first speaker, but if speaker misrecognition or voice theft occurs in the voice recognition system, the registered second speaker is the first speaker. Different from the speaker.

本開示において、キーワードは、単語形態を有するか、あるいは句形態を有することができる。本開示において、ウェークアップキーワード後に発話される音声命令は、自然言語形態の文章形態、単語形態または句形態を有することができる。 In the present disclosure, keywords can have word forms or phrase forms. In the present disclosure, a voice command spoken after a wake-up keyword may have a natural language form of sentence form, word form or phrase form.

以下、添付された図面を参照し、本開示について詳細に説明する。 Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.

図１は、一実施形態による音声認識システムの例示的なネットワーク構成図である。図１を参照すると、音声認識システムのネットワーク環境は、音声認識スピーカ装置１００、音声認識サーバ２００、携帯装置３００及びネットワーク４００を含むと例示的に図示されている。かような音声認識システムは、音声認識スピーカ装置１００及び音声認識サーバ２００を含む。 FIG. 1 is an exemplary network configuration diagram of a voice recognition system according to an embodiment. Referring to FIG. 1, the network environment of the voice recognition system is illustrated as including a voice recognition speaker device 100, a voice recognition server 200, a mobile device 300, and a network 400. Such a voice recognition system includes a voice recognition speaker device 100 and a voice recognition server 200.

音声認識スピーカ装置１００は、音声認識装置の一例であり、音声制御機能が搭載されて特定機能を実行するスピーカ装置である。音声認識スピーカ装置１００は、スマートスピーカ装置または人工知能スピーカ装置とも呼ばれる。音声認識スピーカ装置１００は、話者の音声を受信すると、音声と話者とを認識し、音声に含まれている命令を抽出し、命令による動作を実行し、その結果を音声として出力することができる。音声認識スピーカ装置１００が実行することができる特定機能は、例えば、音声情報提供、音楽再生、インターネットショッピング、金融取り引き、電話かけ、メッセージ送信、アラーム設定、及び、ネットワークを介して音声認識スピーカ装置に接続される電子装置または機械装置の制御などを含んでよい。 The voice recognition speaker device 100 is an example of a voice recognition device, and is a speaker device that is equipped with a voice control function and executes a specific function. The voice recognition speaker device 100 is also called a smart speaker device or an artificial intelligence speaker device. When the voice recognition speaker device 100 receives a voice of a speaker, the voice recognition speaker device 100 recognizes the voice and the speaker, extracts a command included in the voice, executes an operation according to the command, and outputs the result as a voice. You can The specific function that the voice recognition speaker device 100 can perform is, for example, voice information provision, music reproduction, internet shopping, financial transaction, telephone call, message transmission, alarm setting, and voice recognition speaker device via a network. It may include control of connected electronic or mechanical devices.

例えば、音声認識スピーカ装置１００が、ネットワークを介してスマートテレビに接続される場合、特定機能は、チャンネル視聴、チャンネル検索、動画再生及びプログラム検索などを含んでよい。例えば、音声認識スピーカ装置１００が、スマート冷蔵庫のような家電機器に接続される場合、特定機能は、冷蔵状態及び冷凍状態の点検、並びに温度設定などを含んでよい。しかしながら、本開示において、かような特定機能は、前述の機能に限定されるものではない。 For example, when the voice recognition speaker device 100 is connected to a smart TV via a network, the specific function may include channel viewing, channel search, video playback, program search, and the like. For example, when the voice recognition speaker device 100 is connected to a home electric appliance such as a smart refrigerator, the specific function may include a check of a refrigerating state and a freezing state, a temperature setting, and the like. However, in the present disclosure, such a specific function is not limited to the above-described function.

音声認識スピーカ装置１００は、無線通信または有線通信を介して、ネットワーク４００を介して、音声認識サーバ２００と通信することができる。 The voice recognition speaker device 100 can communicate with the voice recognition server 200 via the network 400 via wireless communication or wired communication.

ネットワーク４００の通信方式は、限定されるものではなく、ネットワーク４００に含まれる通信網（一例として、移動通信網、有線インターネット、無線インターネット、放送網）を活用した通信方式だけではなく、音声認識スピーカ装置１００との近距離無線通信が含まれてもよい。例えば、ネットワーク４００は、ＰＡＮ（personal area network）、ＬＡＮ（local area network）、ＣＡＮ（campus area network）、ＭＡＮ（metropolitan area network）、ＷＡＮ（wide area network）、ＢＢＮ（broadband network）、インターネットなどのネットワークのうち１以上の任意のネットワークを含んでよい。ネットワーク４００は、バスネットワーク、スターネットワーク、リングネットワーク、メッシュネットワーク、スター・バスネットワーク、ツリーまたは階層的ネットワークなどを含むネットワークトポロジーのうち任意の１以上を含んでよいが、それらに限定されるものではない。 The communication method of the network 400 is not limited, and is not limited to a communication method that utilizes a communication network (for example, a mobile communication network, a wired internet, a wireless internet, a broadcasting network) included in the network 400, and a voice recognition speaker. Near field communication with the device 100 may be included. For example, the network 400 includes a PAN (personal area network), a LAN (local area network), a CAN (campus area network), a MAN (metropolitan area network), a WAN (wide area network), a BBN (broadband network), the Internet, and the like. It may include any one or more of the networks. Network 400 may include any one or more of network topologies including, but not limited to, bus networks, star networks, ring networks, mesh networks, star bus networks, tree or hierarchical networks, and the like. Absent.

音声認識サーバ２００は、ネットワーク４００を介して、音声認識スピーカ装置１００と通信し、少なくとも１つのコンピュータ装置によって具現化され得る。音声認識サーバ２００は、クラウド形態に分散され、命令、コード、ファイル、コンテンツなどを提供することができる。 The voice recognition server 200 communicates with the voice recognition speaker device 100 via the network 400, and may be embodied by at least one computer device. The voice recognition server 200 is distributed in a cloud form and can provide instructions, codes, files, contents and the like.

音声認識サーバ２００は、音声認識スピーカ装置１００から受信された音声信号を、文字列（または、テキスト）に変換し、音声認識結果を生成することができる。音声認識サーバ２００は、音声認識スピーカ装置１００が再生する音声を合成し、合成音信号を生成し、この合成音信号を、音声認識スピーカ装置１００に送信することができる。 The voice recognition server 200 can convert the voice signal received from the voice recognition speaker device 100 into a character string (or text) and generate a voice recognition result. The voice recognition server 200 can synthesize voices reproduced by the voice recognition speaker device 100, generate a synthesized voice signal, and can transmit the synthesized voice signal to the voice recognition speaker device 100.

音声認識サーバ２００は、音声認識スピーカ装置１００が実行することができる特定機能を、実際に実行することができる。例えば、音声情報提供機能の場合、音声認識サーバ２００は、音声認識スピーカ装置１００から受信された音声信号に含まれている情報要求を認識し、この情報要求に応じた結果を生成し、合成音信号の形態で、音声認識スピーカ装置１００に送信することができる。電話かけ機能の場合、音声認識サーバ２００は、音声認識スピーカ装置１００から受信された音声信号に含まれている電話かけ要求を認識し、要求に応じて電話かけを実行し、電話かけ時、送信信号と受信信号とを中継することができる。音声認識サーバ２００は、ネットワーク４００を介して、家電機器にも接続され、音声認識サーバ２００は、音声認識スピーカ装置１００から受信された音声信号に含まれている制御命令により、家電機器を制御することができる。 The voice recognition server 200 can actually perform the specific function that the voice recognition speaker device 100 can perform. For example, in the case of the voice information providing function, the voice recognition server 200 recognizes the information request included in the voice signal received from the voice recognition speaker device 100, generates a result according to the information request, and outputs the synthesized voice. It can be transmitted to the voice recognition speaker device 100 in the form of a signal. In the case of the telephone call function, the voice recognition server 200 recognizes the telephone call request included in the voice signal received from the voice recognition speaker device 100, executes the telephone call according to the request, and transmits when the telephone call is made. The signal and the received signal can be relayed. The voice recognition server 200 is also connected to a home electric appliance via the network 400, and the voice recognition server 200 controls the home electric appliance by a control command included in a voice signal received from the voice recognition speaker device 100. be able to.

音声認識サーバ２００は、ネットワーク４００を介して、携帯装置３００にも接続される。音声認識サーバ２００と音声認識スピーカ装置１００とを接続するネットワークと、音声認識サーバ２００と携帯装置３００とを接続するネットワークは、互いに異なる種類であり得る。例えば、音声認識サーバ２００と音声認識スピーカ装置１００とを接続するネットワークは、ＬＡＮまたはインターネットであり、音声認識サーバ２００と携帯装置３００とを接続するネットワークは、移動通信網であり得る。一実施形態によれば、音声認識サーバ２００は、携帯装置３００に接続されない。 The voice recognition server 200 is also connected to the mobile device 300 via the network 400. The network that connects the voice recognition server 200 and the voice recognition speaker device 100 and the network that connects the voice recognition server 200 and the mobile device 300 may be of different types. For example, the network connecting the voice recognition server 200 and the voice recognition speaker device 100 may be a LAN or the Internet, and the network connecting the voice recognition server 200 and the mobile device 300 may be a mobile communication network. According to one embodiment, the voice recognition server 200 is not connected to the mobile device 300.

携帯装置３００は、ユーザが携帯して持ち歩くことができる、無線通信をサポートする電子機器である。例えば、携帯装置３００は、携帯電話機、スマートフォン、タブレットＰＣ（personal computer）またはノート型ＰＣなどであってよい。携帯装置３００は、電話機能、メッセージ機能またはメッセンジャー機能を有することができる。携帯装置３００は、ユーザの音声を音声信号に変換し、音声信号を音声認識サーバ２００に提供することができる。また、携帯装置３００は、音声認識サーバ２００から受信された音声信号または映像信号を再生することができる。携帯装置３００は、一般的に、一個人が使用する電子機器でもある。 The mobile device 300 is an electronic device that supports wireless communication and can be carried around by a user. For example, the mobile device 300 may be a mobile phone, a smartphone, a tablet PC (personal computer), a notebook PC, or the like. The mobile device 300 may have a telephone function, a message function, or a messenger function. The mobile device 300 can convert a user's voice into a voice signal and provide the voice signal to the voice recognition server 200. In addition, the mobile device 300 can reproduce the audio signal or the video signal received from the voice recognition server 200. The mobile device 300 is also generally an electronic device used by one individual.

図１には、音声認識スピーカ装置１００が、ネットワーク４００を介して、音声認識機能を実行する音声認識サーバ２００に接続されるように図示されているが、それは、例示的なものであり、音声認識スピーカ装置１００は、独立して、音声認識機能または音声合成機能を実行することもできる。 FIG. 1 illustrates that the voice recognition speaker device 100 is connected to the voice recognition server 200 that performs the voice recognition function via the network 400, but it is only an example. The recognition speaker device 100 can also independently perform a voice recognition function or a voice synthesis function.

図２は、一実施形態による音声認識スピーカ装置１００の内部構成について説明するためのブロック図である。図２を参照すると、音声認識スピーカ装置１００は、プロセッサ１１０、マイクロフォン１２０、スピーカ１３０及び通信モジュール１４０を含んでよい。音声認識スピーカ装置１００は、図２に図示されている構成要素より多くの構成要素を含んでもよい。例えば、音声認識スピーカ装置１００は、メモリをさらに含んでもよい。音声認識スピーカ装置１００は、通信モジュール１４０を介して、図１のネットワーク４００に接続され、音声認識サーバ２００と通信することができる。 FIG. 2 is a block diagram for explaining the internal configuration of the voice recognition speaker device 100 according to the embodiment. Referring to FIG. 2, the voice recognition speaker device 100 may include a processor 110, a microphone 120, a speaker 130, and a communication module 140. The voice recognition speaker device 100 may include more components than those shown in FIG. For example, the voice recognition speaker device 100 may further include a memory. The voice recognition speaker device 100 is connected to the network 400 of FIG. 1 via the communication module 140 and can communicate with the voice recognition server 200.

マイクロフォン１２０は、周辺のオーディオを電気的な音響データに変換することにより、オーディオ信号を直接生成することができる。また、音声認識スピーカ装置１００は、複数のマイクロフォン１２０を含んでもよく、複数のマイクロフォン１２０を利用し、オーディオ信号の入力方向を探し出すことができる。他の例によれば、音声認識スピーカ装置１００は、通信モジュール１４０を介して、外部装置から送信されたオーディオ信号を受信することもできる。スピーカ１３０は、オーディオ信号を音声に変換して出力することができる。 The microphone 120 can directly generate an audio signal by converting surrounding audio into electrical acoustic data. Further, the voice recognition speaker device 100 may include a plurality of microphones 120, and the plurality of microphones 120 can be used to find the input direction of the audio signal. According to another example, the voice recognition speaker device 100 can also receive an audio signal transmitted from an external device via the communication module 140. The speaker 130 can convert an audio signal into voice and output the voice.

プロセッサ１１０は、基本的な算術、ロジック及び入出力演算を実行することにより、コンピュータプログラムの命令を処理するように構成される。かような命令は、メモリからプロセッサ１１０に提供されるか、あるいは通信モジュール１４０を介して受信され、プロセッサ１１０に提供される。例えば、プロセッサ１１０は、メモリのような記録装置に保存されたプログラムコードによって命令を実行するように構成され得る。 Processor 110 is configured to process instructions of a computer program by performing basic arithmetic, logic and input/output operations. Such instructions may be provided to processor 110 from memory or received via communication module 140 and provided to processor 110. For example, the processor 110 may be configured to execute instructions according to program code stored in a recording device such as a memory.

プロセッサ１１０は、マイクロフォン１２０で生成されたオーディオ信号から、話者の音声に対応する音声信号を検出し、通信モジュール１４０を介して、検出された音声信号を音声認識サーバ２００に送信することができる。プロセッサ１１０は、キーワードを利用し、オーディオ信号から音声信号を検出することができる。プロセッサ１１０は、オーディオ信号のうち、キーワードに対応するキーワード音声信号を抽出することにより、キーワード音声信号に後続して受信される音声信号を識別することができる。 The processor 110 can detect a voice signal corresponding to the voice of the speaker from the audio signal generated by the microphone 120, and can transmit the detected voice signal to the voice recognition server 200 via the communication module 140. .. The processor 110 can detect a voice signal from an audio signal by using the keyword. The processor 110 can identify a voice signal received subsequent to the keyword voice signal by extracting a keyword voice signal corresponding to the keyword from the audio signal.

プロセッサ１１０は、音声認識サーバ２００から合成音信号を受信し、スピーカ１３０を介して、合成音信号に対応する合成音を再生することができる。 The processor 110 can receive a synthetic sound signal from the voice recognition server 200 and reproduce a synthetic sound corresponding to the synthetic sound signal via the speaker 130.

図３は、一実施形態による音声認識サーバ２００の内部構成について説明するためのブロック図である。図３を参照すると、音声認識サーバ２００は、プロセッサ２１０、メモリ２２０及び通信モジュール２３０を含む。音声認識サーバ２００は、図３に図示されている構成要素より多くの構成要素を含んでもよい。例えば、音声認識サーバ２００は、入出力装置をさらに含んでもよい。 FIG. 3 is a block diagram for explaining the internal configuration of the voice recognition server 200 according to the embodiment. Referring to FIG. 3, the voice recognition server 200 includes a processor 210, a memory 220 and a communication module 230. Speech recognition server 200 may include more components than those illustrated in FIG. For example, the voice recognition server 200 may further include an input/output device.

通信モジュール２３０は、ネットワーク４００を介して音声認識サーバ２００が音声認識スピーカ装置１００と通信するための機能を提供することができる。音声認識サーバ２００は、通信モジュール２３０を介して、図１のネットワーク４００に接続され、音声認識スピーカ装置１００と通信することができる。一実施形態によれば、音声認識サーバ２００は、通信モジュール２３０を介して、携帯装置３００とも通信することができる。 The communication module 230 can provide a function for the voice recognition server 200 to communicate with the voice recognition speaker device 100 via the network 400. The voice recognition server 200 is connected to the network 400 of FIG. 1 via the communication module 230 and can communicate with the voice recognition speaker device 100. According to one embodiment, the voice recognition server 200 may also communicate with the mobile device 300 via the communication module 230.

メモリ２２０は、コンピュータ読み取り可能な記録媒体であり、ＲＡＭ（random access memory）、ＲＯＭ（read-only memory）及びディスクドライブのような永続的大容量記録装置を含んでよい。メモリ２２０には、オペレーティングシステムと、少なくとも１つのプログラムコード（例えば、音声認識サーバ２００においてインストールされて実行される音声認識アプリケーション、音声合成アプリケーションなどのためのコード）と、が保存される。そのようなソフトウェアコンポーネントは、通信モジュール２３０を利用し、通信を介して、メモリ２２０にロードされ得る。例えば、少なくとも１つのプログラムは、開発者またはアプリケーションのインストールファイルを配布するファイル配布システムが、ネットワーク４００を介して提供するファイルによってインストールされるプログラムに基づき、メモリ２２０にロードされる。 The memory 220 is a computer-readable recording medium, and may include a permanent mass storage device such as a random access memory (RAM), a read-only memory (ROM), and a disk drive. The memory 220 stores an operating system and at least one program code (eg, code for a voice recognition application, a voice synthesis application, etc. installed and executed in the voice recognition server 200). Such software components may be loaded into memory 220 via communications utilizing communications module 230. For example, at least one program is loaded into the memory 220 based on a program installed by a file provided by a file distribution system that distributes an installation file of a developer or an application via the network 400.

プロセッサ２１０は、基本的な算術、ロジック及び入出力演算を実行するものであり、コンピュータプログラムの命令を処理するように構成され得る。プロセッサ２１０は、メモリ２２０に保存されたプログラムコードによって命令を実行するように構成され得る。 The processor 210 performs basic arithmetic, logic and input/output operations and may be configured to process computer program instructions. Processor 210 may be configured to execute instructions according to program code stored in memory 220.

プロセッサ２１０は、音声認識スピーカ装置１００から、第１話者の音声を含む音声信号を受信し、音声信号に対して音声認識を実行し、第１音声認識結果を生成するように構成され得る。例えば、プロセッサ２１０は、音声信号に対する音声認識を実行するために、音声信号の周波数特性を抽出し、音響モデル及び言語モデルを利用し、音声認識を実行することができる。かような周波数特性は、音響入力の周波数スペクトルを分析して抽出される音響入力の周波数成分の分布を意味する。音響モデル及び言語モデルは、メモリ２２０に保存される。ただし、音声認識方法は、これに限定されるものではなく、音声信号を文字列（または、テキスト）に変換する多様な技術が使用される。本開示において、第１音声認識結果は、第１話者の音声を含む音声信号に対して音声認識を実行した結果を意味する。 The processor 210 may be configured to receive a voice signal including the voice of the first speaker from the voice recognition speaker device 100, perform voice recognition on the voice signal, and generate a first voice recognition result. For example, the processor 210 may perform frequency recognition of a voice signal, extract frequency characteristics of the voice signal, and utilize the acoustic model and the language model to perform voice recognition. Such frequency characteristics mean a distribution of frequency components of the acoustic input extracted by analyzing the frequency spectrum of the acoustic input. The acoustic model and the language model are stored in the memory 220. However, the voice recognition method is not limited to this, and various techniques for converting a voice signal into a character string (or text) are used. In the present disclosure, the first voice recognition result means a result of performing voice recognition on a voice signal including the voice of the first speaker.

プロセッサ２１０は、音声信号を分析し、音声信号に含まれている音声を発話した話者がだれであるかを決定することができる。プロセッサ２１０は、音声信号から、第１話者特徴ベクトルを抽出し、第１話者特徴ベクトルを登録された話者特徴ベクトルと比較し、この比較の結果により、音声信号の話者が登録された第２話者であると決定するように構成され得る。例えば、プロセッサ２１０は、第１話者特徴ベクトルと登録された話者特徴ベクトルとの類似度を計算し、計算された類似度を基準値と比較することにより、音声信号の話者を識別することができる。本明細書において、第１話者特徴ベクトルは、第１話者の音声を含む音声信号から抽出された話者特徴ベクトルを意味する。登録された第２話者は、音声認識システムに登録されたユーザのうちの一人であり、音声認識スピーカ装置１００を正常に使用するように事前に登録された者を意味する。 The processor 210 can analyze the voice signal to determine who is the speaker who uttered the voice contained in the voice signal. The processor 210 extracts the first speaker feature vector from the voice signal, compares the first speaker feature vector with the registered speaker feature vector, and registers the speaker of the voice signal according to the result of this comparison. May be configured to determine to be a second speaker. For example, the processor 210 identifies the speaker of the voice signal by calculating the similarity between the first speaker feature vector and the registered speaker feature vector and comparing the calculated similarity with a reference value. be able to. In the present specification, the first speaker feature vector means a speaker feature vector extracted from a voice signal including the voice of the first speaker. The registered second speaker is one of the users registered in the voice recognition system, and means a person who is registered in advance so as to normally use the voice recognition speaker device 100.

登録された話者特徴ベクトルは、ユーザ登録時、第２話者の関連情報として、メモリ２２０に事前に保存される。音声認識サーバ２００には、複数の話者が登録され、その場合、メモリ２２０には、複数の登録された話者特徴ベクトルが保存される。登録された話者特徴ベクトルは、登録された話者の関連情報であり、登録された話者にそれぞれ対応する。第２話者は、音声認識サーバ２００に事前に登録されたユーザのうちの一人である。 The registered speaker feature vector is stored in the memory 220 in advance as related information of the second speaker at the time of user registration. A plurality of speakers are registered in the voice recognition server 200, and in that case, a plurality of registered speaker feature vectors are stored in the memory 220. The registered speaker feature vector is related information of the registered speaker and corresponds to each registered speaker. The second speaker is one of the users registered in advance in the voice recognition server 200.

プロセッサ２１０は、音声信号の話者を決定するために、音響モデルから抽出された事後情報（states posteriors）、一般的背景モデル及び全体変異性変換情報のうち少なくとも一つを利用し、音声信号の周波数特性から、話者特徴ベクトルを生成することができる。メモリ２２０には、事後情報、一般的背景モデル、全体変異性変換情報、及び登録された話者情報のうち少なくとも一つが保存される。 The processor 210 uses at least one of states posteriors extracted from the acoustic model, general background models, and global variability transformation information to determine the speaker of the speech signal. A speaker feature vector can be generated from the frequency characteristic. The memory 220 stores at least one of posterior information, general background model, global variability conversion information, and registered speaker information.

プロセッサ２１０は、第１話者特徴ベクトルとメモリ２２０に保存された登録された話者特徴ベクトルとに基づいて、音声信号の話者が登録された話者であるか否かを決定することができる。プロセッサ２１０は、第１話者特徴ベクトルを、登録された話者特徴ベクトルそれぞれと比較することができる。プロセッサ２１０は、第１話者特徴ベクトルと最も類似度が高い登録された話者特徴ベクトルを、登録された第２話者特徴ベクトルとして選択することができる。最も高い類似度が、第１基準値以上である場合、プロセッサ２１０は、登録された第２話者特徴ベクトルの登録された第２話者を、音声信号の話者であると決定することができる。最も高い類似度が第１基準値未満である場合、プロセッサ２１０は、音声信号の話者が登録されていない話者であると決定することができる。 The processor 210 may determine whether the speaker of the audio signal is a registered speaker based on the first speaker feature vector and the registered speaker feature vector stored in the memory 220. it can. The processor 210 can compare the first speaker feature vector with each of the registered speaker feature vectors. The processor 210 can select the registered speaker feature vector having the highest similarity to the first speaker feature vector as the registered second speaker feature vector. If the highest similarity is greater than or equal to the first reference value, the processor 210 may determine the registered second speaker of the registered second speaker feature vector to be the speaker of the audio signal. it can. If the highest similarity is less than the first reference value, the processor 210 can determine that the speaker of the audio signal is an unregistered speaker.

プロセッサ２１０は、第１話者または第２話者に認証音声を要求し、第１話者または第２話者から、認証音声信号を受信するように構成され得る。プロセッサ２１０は、受信された認証音声信号を基に、第２話者が第１話者と同一であるか否かを追加で確認することにより、第１話者と第２話者との同一性を認証することができる。 The processor 210 may be configured to request an authenticated voice from the first speaker or the second speaker and receive an authenticated voice signal from the first speaker or the second speaker. The processor 210 additionally confirms whether the second speaker is the same as the first speaker based on the received authentication voice signal, so that the first speaker and the second speaker are the same. You can authenticate your sex.

プロセッサ２１０は、第１話者と第２話者との同一性が認証された場合、第１音声認識結果に対応する動作を実行するように構成され得る。プロセッサ２１０は、第１話者特徴ベクトルと登録された第２話者特徴ベクトルとの類似度が、第１基準値より高い第２基準値以上である場合、登録された第２話者を、音声信号の話者と見なし、話者認証過程を省略し、第１音声認識結果に対応する動作を即座に実行するように構成され得る。 The processor 210 may be configured to perform an action corresponding to the first speech recognition result if the identities of the first speaker and the second speaker are verified. If the similarity between the first speaker feature vector and the registered second speaker feature vector is greater than or equal to the second reference value that is higher than the first reference value, processor 210 identifies the registered second speaker as It may be configured to regard as the speaker of the voice signal, omit the speaker authentication process, and immediately perform the operation corresponding to the first voice recognition result.

プロセッサ２１０は、第１音声認識結果に対応する機能を決定し、この機能を実行することができる。プロセッサ２１０は、動作の実行結果を報告するための合成音信号を生成するように構成され得る。プロセッサ２１０は、合成音信号を音声認識スピーカ装置１００に送信するように構成され得る。 The processor 210 can determine the function corresponding to the first speech recognition result and execute the function. The processor 210 may be configured to generate a synthetic sound signal for reporting the performance result of the operation. The processor 210 may be configured to send the synthetic sound signal to the voice recognition speaker device 100.

音声認識サーバ２００は、入出力装置であるマイクロフォンまたはスピーカをさらに含んでもよい。音声認識サーバ２００は、音声信号を直接生成し、合成音を直接再生することもできる。 The voice recognition server 200 may further include a microphone or a speaker that is an input/output device. The voice recognition server 200 can also directly generate a voice signal and directly reproduce a synthetic voice.

図４Ａは、一実施形態による音声認識サーバのプロセッサの内部構成について説明するためのブロック図である。図４Ａを参照すると、音声認識サーバ２００のプロセッサ２１０は、音声信号受信部２１１、音声認識部２１２、話者認識部２１３、話者認証部２１４、機能部２１５、及び合成音信号生成部２１６を含む。話者認識部２１３は、話者特徴ベクトル抽出部２１３ａ、話者特徴ベクトル比較部２１３ｂ及び登録話者決定部２１３ｃを含む。 FIG. 4A is a block diagram illustrating an internal configuration of a processor of the voice recognition server according to the exemplary embodiment. Referring to FIG. 4A, the processor 210 of the voice recognition server 200 includes a voice signal receiving unit 211, a voice recognizing unit 212, a speaker recognizing unit 213, a speaker authenticating unit 214, a functional unit 215, and a synthetic voice signal generating unit 216. Including. The speaker recognition unit 213 includes a speaker feature vector extraction unit 213a, a speaker feature vector comparison unit 213b, and a registered speaker determination unit 213c.

音声信号受信部２１１は、音声認識スピーカ装置１００から、第１話者の音声を含む音声信号を受信する。 The voice signal receiving unit 211 receives a voice signal including the voice of the first speaker from the voice recognition speaker device 100.

音声認識部２１２は、音声信号受信部２１１によって受信された音声信号に対して音声認識を実行し、第１音声認識結果を生成する。音声認識部２１２は、音声信号に対して音声認識を実行し、話者の音声を文字列（または、テキスト）に変換することができる。音声認識部２１２は、変換された文字列（または、テキスト）を自然言語処理し、音声信号に含まれている話者の命令を抽出することができる。第１音声認識結果は、第１話者の命令を含み、音声認識結果に対応する動作は、第１話者の命令による動作を意味する。 The voice recognition unit 212 performs voice recognition on the voice signal received by the voice signal reception unit 211 and generates a first voice recognition result. The voice recognition unit 212 can perform voice recognition on the voice signal and convert the voice of the speaker into a character string (or text). The voice recognition unit 212 can perform natural language processing on the converted character string (or text) to extract a speaker command included in the voice signal. The first voice recognition result includes the command of the first speaker, and the action corresponding to the voice recognition result means the action according to the command of the first speaker.

話者認識部２１３は、音声信号受信部２１１によって受信された音声信号の話者が、第２話者であると決定する。例えば、話者特徴ベクトル抽出部２１３ａは、音声信号受信部２１１によって受信された音声信号から話者特徴ベクトルを抽出する。話者特徴ベクトル抽出部２１３ａは、時間領域ベースの音声信号を、周波数領域の信号に変換し、変換された信号の周波数エネルギーが互いに異なるように変形することにより、話者特徴ベクトルを抽出することができる。例えば、かような話者特徴ベクトルは、メル周波数ケプストラム係数またはフィルタバンクエネルギーを基に抽出されるが、それらに限定されるものではなく、多様な方式で、オーディオデータから話者特徴ベクトルを抽出することができる。第１話者の音声を含む音声信号から抽出された話者特徴ベクトルは、第１話者特徴ベクトルと呼ばれる。 The speaker recognition unit 213 determines that the speaker of the audio signal received by the audio signal reception unit 211 is the second speaker. For example, the speaker feature vector extraction unit 213a extracts the speaker feature vector from the voice signal received by the voice signal reception unit 211. The speaker feature vector extraction unit 213a extracts a speaker feature vector by converting a time-domain-based speech signal into a frequency-domain signal and transforming the converted signals so that the frequency energies of the signals are different from each other. You can For example, such a speaker feature vector is extracted based on the mel frequency cepstrum coefficient or the filter bank energy, but is not limited thereto, and the speaker feature vector is extracted from the audio data by various methods. can do. The speaker feature vector extracted from the voice signal including the voice of the first speaker is called the first speaker feature vector.

話者特徴ベクトル比較部２１３ｂは、話者特徴ベクトル抽出部２１３ａによって抽出された第１話者特徴ベクトルを、メモリ２２０に保存されている登録された話者特徴ベクトルと比較する。例えば、話者特徴ベクトル比較部２１３ｂは、第１話者特徴ベクトルと登録された話者特徴ベクトルとの類似度を計算する。 The speaker feature vector comparison unit 213b compares the first speaker feature vector extracted by the speaker feature vector extraction unit 213a with the registered speaker feature vector stored in the memory 220. For example, the speaker feature vector comparison unit 213b calculates the similarity between the first speaker feature vector and the registered speaker feature vector.

メモリ２２０には、複数の登録された話者特徴ベクトルが存在し、話者特徴ベクトル比較部２１３ｂは、第１話者特徴ベクトルを、複数の登録された話者特徴ベクトルそれぞれと比較し、最も類似度が高い登録された話者特徴ベクトルを決定する。最も類似度が高い登録された話者特徴ベクトルは、第２話者特徴ベクトルと呼ばれる。 There are a plurality of registered speaker feature vectors in the memory 220, and the speaker feature vector comparison unit 213b compares the first speaker feature vector with each of the registered speaker feature vectors, and A registered speaker feature vector with high similarity is determined. The registered speaker feature vector with the highest degree of similarity is called the second speaker feature vector.

登録話者決定部２１３ｃは、話者特徴ベクトル比較部２１３ｂの比較の結果により、音声信号の話者が、登録された第２話者であると決定する。例えば、登録話者決定部２１３ｃは、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が、第１基準値以上であり、第２基準値未満である場合、音声信号の話者が、第２話者特徴ベクトルに対応する第２話者であると決定することができる。第２基準値は、第１基準値より高い。第２話者は、音声認識システムまたは音声認識サーバ２００に登録されたユーザのうちの１ユーザである。そのような側面において、第２話者は、登録された第２話者とも呼ばれる。 The registered speaker determination unit 213c determines that the speaker of the voice signal is the registered second speaker based on the result of the comparison by the speaker feature vector comparison unit 213b. For example, when the degree of similarity between the first speaker feature vector and the second speaker feature vector is greater than or equal to the first reference value and less than the second reference value, the registered speaker determination unit 213c talks about the voice signal. The speaker can be determined to be the second speaker corresponding to the second speaker feature vector. The second reference value is higher than the first reference value. The second speaker is one of the users registered in the voice recognition system or the voice recognition server 200. In such an aspect, the second speaker is also referred to as a registered second speaker.

登録話者決定部２１３ｃは、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が、第２基準値以上である場合、音声信号の話者が第２話者であると見なすことができる。その場合、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が非常に高いために、第２話者に対する認証が省略される。 The registered speaker determination unit 213c considers that the speaker of the audio signal is the second speaker when the similarity between the first speaker feature vector and the second speaker feature vector is equal to or greater than the second reference value. be able to. In that case, since the similarity between the first speaker feature vector and the second speaker feature vector is very high, the authentication for the second speaker is omitted.

第１話者特徴ベクトルと登録された話者特徴ベクトルそれぞれとの類似度が、いずれも事前に設定された基準値を超えない場合、すなわち、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が第１基準値未満である場合、登録話者決定部２１３ｃは、音声信号の話者が、登録されたユーザの中にいない、すなわち、音声信号の話者が、登録されていないユーザであると決定することができる。その場合、プロセッサ２１０は、音声認識結果に対応する動作を実行しないか、あるいは、音声認識結果に対応する動作が、だれもが実行することができる動作に設定されている場合に限り、かような動作を実行することができる。 When the degree of similarity between the first speaker feature vector and each of the registered speaker feature vectors does not exceed a preset reference value, that is, the first speaker feature vector and the second speaker feature vector If the degree of similarity with the is less than the first reference value, the registered speaker determination unit 213c determines that the speaker of the voice signal is not among the registered users, that is, the speaker of the voice signal is registered. It can be determined that there is no user. In that case, the processor 210 does not perform the action corresponding to the voice recognition result, or only when the action corresponding to the voice recognition result is set to the action that anyone can perform. Can perform various actions.

話者認証部２１４は、第１話者または第２話者に認証音声を要求し、第１話者または第２話者から、認証音声信号を受信するように構成され得る。一例によれば、話者認証部２１４は、音声認識スピーカ装置１００に認証音声要求を送信することができる。音声認識スピーカ装置１００を使用している第１話者は、認証音声要求を受信することができる。第１話者は、音声認識スピーカ装置１００に認証音声を発話し、音声認識スピーカ装置１００は、認証音声に対応する認証音声信号をプロセッサ２１０に送信することができる。 The speaker authenticator 214 may be configured to request an authentication voice from the first speaker or the second speaker and receive an authentication voice signal from the first speaker or the second speaker. According to an example, the speaker authentication unit 214 can transmit an authentication voice request to the voice recognition speaker device 100. The first speaker using the voice recognition speaker device 100 can receive the authentication voice request. The first speaker speaks an authentication voice to the voice recognition speaker device 100, and the voice recognition speaker device 100 can send an authentication voice signal corresponding to the authentication voice to the processor 210.

他の例によれば、話者認証部２１４は、第２話者の携帯装置３００に認証音声を要求し、第２話者は、認証音声要求に応じて、認証音声を発話することができる。第２話者は、認証音声を、音声認識スピーカ装置１００に発話するか、あるいは第２話者の携帯装置３００に発話することもできる。第２話者の携帯装置３００の識別情報は、第２話者が音声認識システムにユーザとして登録するときに入力され、メモリ２２０に事前に保存されることが可能である。ここで、話者認証部２１４は、第２話者の携帯装置３００に、認証音声の要求と共に、音声認識結果及び音声信号を送信することができる。第２話者の携帯装置３００は、第２話者の登録された携帯装置３００とも呼ばれる。 According to another example, the speaker authenticator 214 requests the second speaker's mobile device 300 for an authentication voice, and the second speaker can speak the authentication voice in response to the authentication voice request. .. The second speaker can speak the authentication voice to the voice recognition speaker device 100 or the portable device 300 of the second speaker. The identification information of the mobile device 300 of the second speaker can be input when the second speaker registers as a user in the voice recognition system and can be stored in the memory 220 in advance. Here, the speaker authentication unit 214 can transmit the voice recognition result and the voice signal together with the request for the authentication voice to the portable device 300 of the second speaker. The mobile device 300 of the second speaker is also referred to as the mobile device 300 with which the second speaker is registered.

話者認証部２１４は、受信された認証音声信号を基に、第２話者が第１話者と同一であるか否かを追加で確認し、第１話者と第２話者との同一性を認証することができる。 The speaker authentication unit 214 additionally confirms whether the second speaker is the same as the first speaker based on the received authentication voice signal, and determines whether the second speaker is the same as the first speaker. Identity can be verified.

機能部２１５は、話者認証部２１４において、第１話者と第２話者との同一性が認証された場合、音声認識部２１３によって生成された第１音声認識結果に対応する動作を実行する。機能部２１５は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が、第２基準値以上である場合、話者認証手続きなしに、音声認識部２１３によって生成された第１音声認識結果に対応する動作を実行することができる。 The functional unit 215 executes an operation corresponding to the first voice recognition result generated by the voice recognition unit 213 when the identity between the first speaker and the second speaker is authenticated by the speaker authentication unit 214. To do. When the similarity between the first speaker feature vector and the second speaker feature vector is greater than or equal to the second reference value, the function unit 215 generates the first speaker generated by the voice recognition unit 213 without a speaker authentication procedure. The operation corresponding to the voice recognition result can be executed.

機能部２１５は、登録話者決定部２１３ｃにおいて、音声信号の話者が登録されていない話者であると決定されるか、あるいは、話者認証部２１４において、第１話者と第２話者との同一性が認証されなかった場合、第１音声認識結果に対応する動作を実行しない。 In the function unit 215, the registered speaker determination unit 213c determines that the speaker of the voice signal is an unregistered speaker, or the speaker authentication unit 214 determines the first speaker and the second speaker. If the identity with the person is not authenticated, the operation corresponding to the first voice recognition result is not executed.

合成音信号生成部２１６は、機能部２１５が動作を実行した場合、動作の実行結果を報告するための合成音信号を生成する。合成音信号生成部２１６は、音声信号の話者が、登録されていないユーザであると決定され、第１音声認識結果に対応する動作が実行されていない場合、または、話者認証部２１４において、同一性が認証されず、第１音声認識結果に対応する動作が実行されていない場合、動作が実行されていないということを報告するための合成音信号を生成することができる。 When the functional unit 215 executes the operation, the synthetic sound signal generation unit 216 generates a synthetic sound signal for reporting the execution result of the operation. The synthesized voice signal generation unit 216 determines that the speaker of the voice signal is an unregistered user, and the operation corresponding to the first voice recognition result is not executed, or the speaker authentication unit 214 If the identity is not authenticated and the operation corresponding to the first voice recognition result is not executed, a synthetic sound signal for reporting that the operation is not executed can be generated.

他の実施形態によれば、プロセッサ２１０は、話者ベクトル改善部をさらに含んでもよい。第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が、第２基準値以上であるか、あるいは、話者認証部２１４において、第１話者と第２話者との同一性が認証された場合、音声信号の話者が、第２話者であるか否かということが確認されたわけであるので、かような話者ベクトル改善部は、第１話者特徴ベクトルを利用し、メモリ２２０に保存された第２話者の第２話者特徴ベクトルを改善させることができる。かような話者特徴ベクトル改善部は、音声信号から抽出された第１話者特徴ベクトルを利用した適応訓練方式を介して、第２話者の第２話者特徴ベクトルを生成し、新たに生成された第２話者特徴ベクトルが、適応訓練以前の第２話者特徴ベクトルに比べ、適応訓練性能が上昇した場合、新たに生成された第２話者特徴ベクトルを、メモリ２２０に保存することにより、第２話者特徴ベクトルを改善させることができる。 According to another embodiment, the processor 210 may further include a speaker vector improver. The similarity between the first speaker feature vector and the second speaker feature vector is greater than or equal to the second reference value, or the speaker authenticating unit 214 determines that the first speaker and the second speaker have the same identity. If it is authenticated, it is confirmed that the speaker of the voice signal is the second speaker. Therefore, such a speaker vector improving unit uses the first speaker feature vector. However, the second speaker feature vector of the second speaker stored in the memory 220 can be improved. The speaker feature vector improving unit generates a second speaker feature vector of the second speaker through the adaptive training method using the first speaker feature vector extracted from the voice signal, and newly generates the second speaker feature vector. When the adaptive training performance of the generated second speaker feature vector is higher than that of the second speaker feature vector before adaptive training, the newly generated second speaker feature vector is stored in the memory 220. As a result, the second speaker feature vector can be improved.

図４Ｂは、他の実施形態による音声認識サーバのプロセッサの内部構成について説明するためのブロック図である。図４Ｂを参照すると、音声認識サーバ２００のプロセッサ２１０ａは、音声信号受信部２１１、音声認識部２１２、話者認識部２１３、認証いかん決定部２１７、話者認証部２１４、機能部２１５及び合成音信号生成部２１６を含む。 FIG. 4B is a block diagram for explaining the internal configuration of the processor of the voice recognition server according to another embodiment. Referring to FIG. 4B, the processor 210a of the voice recognition server 200 includes a voice signal receiving unit 211, a voice recognizing unit 212, a speaker recognizing unit 213, an authentication decision unit 217, a speaker authenticating unit 214, a functional unit 215, and a synthetic voice. The signal generation unit 216 is included.

第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が、第１基準値以上であり、第２基準値未満であるため、登録話者決定部２１３ｃにおいて、音声信号の話者が、第２話者であると決定された場合、認証いかん決定部２１７は、第１音声認識結果に対応する動作、及び第２話者の設定のうち少なくとも一つを基に、話者認証部２１４の動作を実行するか否かを決定する。 Since the similarity between the first speaker feature vector and the second speaker feature vector is greater than or equal to the first reference value and less than the second reference value, the registered speaker determination unit 213c determines that the speaker of the audio signal is If it is determined that the speaker is the second speaker, the authentication determination unit 217 determines the speaker authenticator based on at least one of the operation corresponding to the first voice recognition result and the setting of the second speaker. It is determined whether to execute the operation of 214.

一例によれば、第１音声認識結果による動作が、第２話者が事前に設定した事前承認動作リストに含まれる場合、認証いかん決定部２１７は、話者認証部２１４の動作を実行するように決定することができる。事前承認動作リストは、メモリ２２０に保存され、音声認識スピーカ装置１００または音声認識サーバ２００が実行することができる動作のうち一部の動作が、事前に設定した事前承認動作リストに含まれてよい。例えば、金融取り引きやインターネットショッピング、メッセージ送信のような動作が、事前に設定した事前承認動作リストに含まれてよい。かような事前承認動作リストに含まれる動作は、登録された話者ごとに異なるようにも設定される。 According to an example, when the operation based on the first voice recognition result is included in the pre-approved operation list preset by the second speaker, the authentication decision unit 217 performs the operation of the speaker authenticating unit 214. Can be determined. The pre-approved operation list is stored in the memory 220, and some operations that can be performed by the voice recognition speaker device 100 or the voice recognition server 200 may be included in the pre-approved operation list set in advance. .. For example, operations such as financial transactions, internet shopping, and message sending may be included in the preset pre-approved operation list. The operations included in the pre-approved operation list are set differently for each registered speaker.

他の例によれば、第１音声認識結果による動作が、第２話者が事前に設定した事後通知動作リストに含まれる場合、認証いかん決定部２１７は、機能部２１５により第１音声認識結果に対応する動作がまず実行され、第２話者の携帯装置３００に音声認識結果及び音声信号を送信するように決定することができる。かような事後通知動作リストは、メモリ２２０に保存され、音声認識スピーカ装置１００が実行することができる動作のうち一部の動作が、事前に設定した事後通知動作リストに含まれてよい。例えば、電話かけ、設定変更のような動作が、事前に設定した事後通知動作リストに含まれてよい。かような事後通知動作リストに含まれる動作は、登録された話者ごとに異なるようにも設定される。 According to another example, when the operation based on the first voice recognition result is included in the posterior notification operation list preset by the second speaker, the authentication decision unit 217 causes the function unit 215 to perform the first voice recognition result. Then, the operation corresponding to is first performed, and it may be determined to transmit the voice recognition result and the voice signal to the mobile device 300 of the second speaker. The post notification operation list is stored in the memory 220, and some of the operations that can be performed by the voice recognition speaker device 100 may be included in the post notification operation list set in advance. For example, operations such as making a call and changing settings may be included in the post-notification operation list that is set in advance. The operation included in the post notification operation list is set to be different for each registered speaker.

さらに他の例によれば、認証いかん決定部２１７は、第２話者の携帯装置３００の位置及び現在時間のうち少なくとも一つが、事前承認条件に符合する場合、話者認証部２１４の動作を実行するように決定することができる。例えば、第２話者の携帯装置３００が、音声認識スピーカ装置１００の位置の近くに位置する場合、例えば、第２話者の携帯装置３００と、音声認識スピーカ装置１００とが同一無線Ｗｉ−Ｆｉ（登録商標）アクセスポイントに接続される場合や、第２話者の携帯装置３００のＧＰＳ（global position system）位置または無線網接続位置が、音声認識スピーカ装置１００の位置と実質的に一致する場合、第２話者が、音声信号受信部２１１によって受信された音声信号に含まれている音声を実際に発話した可能性が高いので、認証いかん決定部２１７は、話者認証部２１４の動作を省略することができる。登録された話者は、かような話者認証部２１４の動作の省略いかんをそれぞれ設定することができる。 According to still another example, the authentication determination unit 217 determines the operation of the speaker authenticator 214 when at least one of the position of the portable device 300 of the second speaker and the current time matches the pre-approval condition. You can decide to do it. For example, when the mobile device 300 of the second speaker is located near the position of the voice recognition speaker device 100, for example, the mobile device 300 of the second speaker and the voice recognition speaker device 100 are the same wireless Wi-Fi. When connected to a (registered trademark) access point, or when the GPS (global position system) position of the portable device 300 of the second speaker or the wireless network connection position substantially matches the position of the voice recognition speaker device 100. , It is highly possible that the second speaker actually uttered the voice included in the voice signal received by the voice signal receiving unit 211. Therefore, the authentication decision unit 217 causes the speaker authenticating unit 214 to operate. It can be omitted. The registered speaker can set whether or not the operation of the speaker authenticating unit 214 is omitted.

認証いかん決定部２１７は、第２話者が設定した時間、例えば、平日昼時間には、話者認証部２１４の動作を実行するように決定することができる。例えば、会社員である第２話者は、平日昼時間には、家にいない可能性が高いので、家に位置する音声認識スピーカ装置１００が、第２話者の音声を受信する可能性は低い。認証いかん決定部２１７は、そのような場合、話者認証部２１４の動作を実行するように決定することができる。登録された話者は、時間を基に、話者認証部２１４の動作を実行するか否かをそれぞれ設定することができる。 The authentication determination unit 217 can determine to execute the operation of the speaker authentication unit 214 at the time set by the second speaker, for example, weekday daytime. For example, since the second speaker who is an office worker is likely not to be at home during the daytime on weekdays, the voice recognition speaker device 100 located at the home is unlikely to receive the voice of the second speaker. Low. In such a case, the authentication decision unit 217 can decide to execute the operation of the speaker authentication unit 214. The registered speaker can set whether to execute the operation of the speaker authenticating unit 214 based on time.

事前承認条件は、第２話者によって事前に設定され、メモリ２２０にも保存される。また、現在時間を基に、話者認証部２１４の動作を実行するように決定する場合、事前承認条件は、第２話者の行動パターンに基づいても決定される。第２話者の行動パターンは、第２話者の携帯装置３００の位置や実行動作を基に生成される。例えば、認証いかん決定部２１７は、第２話者の携帯装置３００の位置を、長時間の間収集することができる。認証いかん決定部２１７は、携帯装置３００の位置を分析し、第２話者が音声認識スピーカ装置１００の近くに位置しない時間帯を決定することができる。認証いかん決定部２１７は、現在時間が、この時間帯に該当する場合、話者認証部２１４の動作を実行するように自動的に決定することができる。 The pre-approval condition is preset by the second speaker and is also stored in the memory 220. Further, when it is determined that the operation of the speaker authentication unit 214 is performed based on the current time, the pre-approval condition is also determined based on the behavior pattern of the second speaker. The action pattern of the second speaker is generated based on the position of the portable device 300 of the second speaker and the execution action. For example, the authentication determination unit 217 can collect the position of the portable device 300 of the second speaker for a long time. The authentication determination unit 217 can analyze the position of the mobile device 300 and determine a time period in which the second speaker is not located near the voice recognition speaker device 100. The authentication determination unit 217 can automatically determine to perform the operation of the speaker authentication unit 214 when the current time falls within this time zone.

図５は、一実施形態による音声認識システムの話者認証方法について説明するための例示的なフローチャートである。図５を参照すると、音声認識システムは、音声認識スピーカ装置１００及び音声認識サーバ２００を含む。 FIG. 5 is an exemplary flowchart for explaining a speaker authentication method of a voice recognition system according to an exemplary embodiment. Referring to FIG. 5, the voice recognition system includes a voice recognition speaker device 100 and a voice recognition server 200.

音声認識スピーカ装置１００は、マイクロフォン１２０（図２Ａ）を利用し、周辺の音を電気的に変換し、オーディオ信号を生成することができる（Ｓ１０１）。 The voice recognition speaker device 100 can use the microphone 120 (FIG. 2A) to electrically convert surrounding sounds and generate an audio signal (S101).

音声認識スピーカ装置１００は、オーディオ信号から音声信号を検出することができる（Ｓ１０２）。かような音声信号は、ユーザの音声を含み得る。ここで、ユーザを第１話者とする。かような音声は、第１話者の音声命令を含み得る。かような音声命令には、音声情報検索、電話かけ、メッセージ送信、金融取り引き、インターネットショッピング、食べ物配達、周辺家電機器制御、スマートホーム制御などが含まれてよい。本例においては、音声命令が金融取り引きに係わるものであり、第１話者の音声が、「Ｂに１００万ウォン送金せよ」というものであると仮定する。第１話者の音声には、音声認識スピーカ装置１００をウェークアップするためのトリガキーワードが含まれてもよい。音声認識スピーカ装置１００は、トリガキーワードを認識することにより、オーディオ信号から音声信号を検出することができる。 The voice recognition speaker device 100 can detect a voice signal from the audio signal (S102). Such a voice signal may include the voice of the user. Here, the user is the first speaker. Such voice may include a voice command of the first speaker. Such voice commands may include voice information retrieval, phone calls, message transmissions, financial transactions, internet shopping, food delivery, peripheral appliance controls, smart home controls, and the like. In this example, it is assumed that the voice command relates to a financial transaction, and the voice of the first speaker is "Send 1 million won to B". The voice of the first speaker may include a trigger keyword for waking up the voice recognition speaker device 100. The voice recognition speaker device 100 can detect the voice signal from the audio signal by recognizing the trigger keyword.

音声認識スピーカ装置１００は、音声信号を音声認識サーバ２００に送信し、音声認識サーバ２００は、音声認識スピーカ装置１００から音声信号を受信する（Ｓ１０３）。 The voice recognition speaker device 100 transmits a voice signal to the voice recognition server 200, and the voice recognition server 200 receives the voice signal from the voice recognition speaker device 100 (S103).

音声認識サーバ２００は、音声信号に対して音声認識を実行し、第１音声認識結果を生成する（Ｓ１０４）。音声認識サーバ２００は、音声信号の周波数特性を抽出し、音響モデル及び言語モデルを利用し、音声認識を実行することができる。音声認識サーバ２００は、音声信号を文字列に変換し、文字列を自然言語処理することにより、音声認識結果を生成することができる。かような音声認識結果は、第１話者の音声命令を含み得る。 The voice recognition server 200 performs voice recognition on the voice signal and generates a first voice recognition result (S104). The voice recognition server 200 can perform the voice recognition by extracting the frequency characteristic of the voice signal and using the acoustic model and the language model. The voice recognition server 200 can generate a voice recognition result by converting a voice signal into a character string and subjecting the character string to natural language processing. The voice recognition result may include the voice command of the first speaker.

音声認識サーバ２００は、音声信号から、第１話者特徴ベクトルを抽出する（Ｓ１０５）。音声認識サーバ２００は、音響モデルから抽出された事後情報、一般的背景モデル、及び全体変異性変換情報のうち少なくとも一つを利用し、音声信号の周波数特性から、第１話者特徴ベクトルを生成することができる。 The voice recognition server 200 extracts the first speaker feature vector from the voice signal (S105). The voice recognition server 200 uses at least one of the posterior information extracted from the acoustic model, the general background model, and the global variability conversion information to generate the first speaker feature vector from the frequency characteristic of the voice signal. can do.

音声認識サーバ２００は、第１話者特徴ベクトルと登録された話者特徴ベクトルとを比較する（Ｓ１０６）。登録された話者特徴ベクトルは、メモリ２２０（図３）に保存され、ユーザが音声認識システムに登録するときに入力されるユーザの音声を基に、事前に生成される。音声認識サーバ２００は、第１話者特徴ベクトルと登録された話者特徴ベクトルとの類似度を計算することができる。メモリ２２０には、音声認識スピーカ装置１００の正当なユーザにそれぞれ対応する複数の話者特徴ベクトルが保存され、音声認識サーバ２００は、第１話者特徴ベクトルと登録された話者特徴ベクトルそれぞれとの類似度を計算し、計算された類似度のうち最も高い類似度を決定することができる。ここで、登録された話者特徴ベクトルのうち第１話者特徴ベクトルと最も高い類似度を有する登録された話者特徴ベクトルは、第２話者特徴ベクトルと呼ばれ、最も高い類似度は、第１類似度と呼ばれる。 The voice recognition server 200 compares the first speaker feature vector with the registered speaker feature vector (S106). The registered speaker feature vector is stored in the memory 220 (FIG. 3) and is generated in advance based on the user's voice input when the user registers with the voice recognition system. The voice recognition server 200 can calculate the similarity between the first speaker feature vector and the registered speaker feature vector. The memory 220 stores a plurality of speaker feature vectors corresponding to the authorized users of the voice recognition speaker device 100, and the voice recognition server 200 stores the first speaker feature vector and each of the registered speaker feature vectors. Can be calculated, and the highest similarity among the calculated similarities can be determined. Here, among the registered speaker feature vectors, the registered speaker feature vector having the highest similarity to the first speaker feature vector is called the second speaker feature vector, and the highest similarity is It is called the first similarity.

音声認識サーバ２００は、第１類似度を第１基準値ｒｅｆ１と比較する（Ｓ１０７）。第１基準値ｒｅｆ１は、音声認識サーバ２００の話者認識性能によっても決定される。第１類似度が、第１基準値ｒｅｆ１以上である場合、音声認識サーバ２００は、音声信号の話者が、第２話者特徴ベクトルに対応する第２話者であると決定する（Ｓ１０８）。ここで、第２話者は、音声認識システムに登録された音声認識スピーカ装置１００のユーザのうちの一人であり、登録された第２話者とも呼ばれる。 The voice recognition server 200 compares the first similarity with the first reference value ref1 (S107). The first reference value ref1 is also determined by the speaker recognition performance of the voice recognition server 200. When the first similarity is equal to or higher than the first reference value ref1, the voice recognition server 200 determines that the speaker of the voice signal is the second speaker corresponding to the second speaker feature vector (S108). .. Here, the second speaker is one of the users of the voice recognition speaker device 100 registered in the voice recognition system, and is also called the registered second speaker.

第１類似度が、第１基準値ｒｅｆ１未満である場合、音声認識サーバ２００は、音声信号の話者が登録されていないユーザであると決定し、第１音声認識結果に対応する動作を実行しない（Ｓ１１５）。その場合、音声認識サーバ２００は、例えば、「音声の話者が識別されず、動作を実行しませんでした。音声を再び入力してください」という合成音に対応する合成音信号を生成することができる（Ｓ１１６）。 When the first similarity is less than the first reference value ref1, the voice recognition server 200 determines that the speaker of the voice signal is an unregistered user, and executes the operation corresponding to the first voice recognition result. No (S115). In that case, the voice recognition server 200 generates a synthetic voice signal corresponding to a synthetic voice, for example, "The speaker of the voice was not identified and no operation was performed. Please input the voice again". Can be performed (S116).

音声信号の話者が第２話者であると決定された場合、音声認識サーバ２００は、第１類似度を第２基準値ｒｅｆ２と比較する（Ｓ１０９）。第２基準値ｒｅｆ２は、第１基準値ｒｅｆ１より高く、音声認識サーバ２００の話者認識性能によっても決定される。図５に図示されている段階（Ｓ１０７ないしＳ１０９）の順序は、例示的なものであり、それらの順序は可変的である。 When it is determined that the speaker of the voice signal is the second speaker, the voice recognition server 200 compares the first similarity with the second reference value ref2 (S109). The second reference value ref2 is higher than the first reference value ref1 and is also determined by the speaker recognition performance of the voice recognition server 200. The order of the steps (S107 to S109) illustrated in FIG. 5 is exemplary, and the order is variable.

第１類似度が、第２基準値ｒｅｆ２以上である場合、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が非常に高いので、音声認識サーバ２００は、音声信号の話者が第２話者であると見なすことができる。その場合、音声認識サーバ２００は、さらなる認証手続きなしに、第１音声認識結果に対応する動作を実行することができる（Ｓ１１４）。 When the first similarity is equal to or higher than the second reference value ref2, the similarity between the first speaker feature vector and the second speaker feature vector is very high, and thus the voice recognition server 200 determines that the speaker of the voice signal is the speaker. Can be considered to be the second speaker. In that case, the voice recognition server 200 can perform the operation corresponding to the first voice recognition result without further authentication procedure (S114).

第１類似度が、第２基準値ｒｅｆ２未満である場合、音声認識サーバ２００は、第１話者が第２話者と一致するか否かをさらに確認するために、認証手続きを実行することができる。第２話者は、一般的に第１話者と同一である。しかしながら、音声認識サーバ２００の話者認識機能の誤謬により、第２話者は、第１話者と異なってしまう。 If the first similarity is less than the second reference value ref2, the voice recognition server 200 performs an authentication procedure to further confirm whether the first speaker matches the second speaker. You can The second speaker is generally the same as the first speaker. However, due to the error in the speaker recognition function of the voice recognition server 200, the second speaker is different from the first speaker.

例えば、第１話者が音声命令を発話したが、音声認識サーバ２００は、話者認識機能の誤謬により、音声命令を第１話者と異なる第２話者が発話したものであると認識してしまう。その場合、音声認識サーバ２００は、第２話者が、「Ｂに１００万ウォン送金せよ」と発話したと認識し、音声認識サーバ２００は、第２話者の口座からＢに１００万ウォンを送金するという問題が発生する。 For example, the first speaker utters a voice command, but the voice recognition server 200 recognizes that the voice command is uttered by a second speaker different from the first speaker due to an error in the speaker recognition function. Will end up. In that case, the voice recognition server 200 recognizes that the second speaker uttered “Send 1 million won to B”, and the voice recognition server 200 sends 1 million won to B from the account of the second speaker. The problem of sending money occurs.

他の例として、第１話者が第２話者の声を真似て音声命令を発話し、音声認識サーバ２００は、この音声命令を第２話者が発話したものであると認識してしまう。その場合は、声盗用による場合である。その場合にも、音声認識サーバ２００は、第２話者が、「Ｂに１００万ウォン送金せよ」と発話したと認識し、音声認識サーバ２００は、第２話者の口座からＢに１００万ウォンを送金する問題が発生する。 As another example, the first speaker utters a voice command imitating the voice of the second speaker, and the voice recognition server 200 recognizes that the second speaker utters this voice command. .. In that case, it is a case of using voice theft. Even in that case, the voice recognition server 200 recognizes that the second speaker uttered “Send 1 million won to B”, and the voice recognition server 200 receives 1 million from the second speaker's account to B. There is a problem of sending won.

そのような問題を防止するために、本実施形態によれば、音声認識サーバ２００は、音声認識スピーカ装置１００に、暗号発声を要求することができる。かような暗号音声要求は、音声認識サーバ２００から音声認識スピーカ装置１００に送信される（Ｓ１１０）。例えば、音声認識サーバ２００は、例えば、「暗号を言ってください」という合成音に対応する合成音信号を生成し、音声認識スピーカ装置１００に送信することができる。音声認識スピーカ装置１００は、合成音信号を受信し、スピーカ１３０（図２）を利用し、合成音を再生することができる。 In order to prevent such a problem, according to the present embodiment, the voice recognition server 200 can request the voice recognition speaker device 100 to perform encrypted utterance. The encrypted voice request is transmitted from the voice recognition server 200 to the voice recognition speaker device 100 (S110). For example, the voice recognition server 200 can generate a synthetic voice signal corresponding to a synthetic voice of “Please tell me the code” and transmit the synthetic voice signal to the voice recognition speaker device 100. The voice recognition speaker device 100 can receive the synthesized sound signal and can reproduce the synthesized sound using the speaker 130 (FIG. 2).

音声認識スピーカ装置１００を使用している第１話者は、暗号音声要求に応じて、暗号を発話することができる。音声認識スピーカ装置１００は、第１話者が発話した暗号音声を含む暗号音声信号を検出することができる（Ｓ１１１）。暗号音声信号は、音声認識スピーカ装置１００から音声認識サーバ２００にも送信される（Ｓ１１２）。暗号音声信号は、認証音声信号とも呼ばれる。 The first speaker using the voice recognition speaker device 100 can speak a cipher in response to the cipher voice request. The voice recognition speaker device 100 can detect the encrypted voice signal including the encrypted voice uttered by the first speaker (S111). The encrypted voice signal is also transmitted from the voice recognition speaker device 100 to the voice recognition server 200 (S112). The encrypted voice signal is also called an authentication voice signal.

音声認識サーバ２００は、受信された暗号音声信号を基に、第２話者と第１話者との同一性を認証することができる（Ｓ１１３）。音声認識サーバ２００が、音声認識スピーカ装置１００から、事前に設定された時間内に暗号音声信号を受信できなかった場合、音声信号の話者が、第２話者ではないと決定し、同一性を否定することができる。 The voice recognition server 200 can authenticate the identity of the second speaker and the first speaker based on the received encrypted voice signal (S113). If the voice recognition server 200 cannot receive the encrypted voice signal from the voice recognition speaker device 100 within the preset time, it is determined that the speaker of the voice signal is not the second speaker, and the identity is determined. Can be denied.

段階（Ｓ１１３）の一例によれば、音声認識サーバ２００は、受信された暗号音声信号に対して音声認識を実行し、第２音声認識結果を生成することができる。暗号音声信号に対する音声認識と、第１話者の音声信号に対する音声認識は、互いに同一方式によっても実行される。第２音声認識結果は、第１話者が発話した暗号音声を含み得る。音声認識サーバ２００は、第２音声認識結果から、暗号に該当する部分を検出することができる。 According to an example of the step (S113), the voice recognition server 200 may perform voice recognition on the received encrypted voice signal to generate a second voice recognition result. The voice recognition for the encrypted voice signal and the voice recognition for the voice signal of the first speaker are performed by the same method. The second voice recognition result may include the encrypted voice uttered by the first speaker. The voice recognition server 200 can detect the part corresponding to the encryption from the second voice recognition result.

音声認識サーバ２００は、第２音声認識結果を、第２話者の登録された暗号と比較することができる。第２話者の登録された暗号は、第２話者が音声認識システムに音声認識スピーカ装置１００のユーザとして登録するとき、第２話者によって事前に登録されたものであり、メモリ２２０に保存されている。 The voice recognition server 200 can compare the second voice recognition result with the registered cipher of the second speaker. The registered cipher of the second speaker is previously registered by the second speaker when the second speaker registers as a user of the voice recognition speaker device 100 in the voice recognition system, and is stored in the memory 220. Has been done.

音声認識サーバ２００は、第２音声認識結果と第２話者の暗号とが実質的に同一である場合、例えば、第２音声認識結果に、第２話者の暗号が含まれている場合、第１話者と第２話者とが互いに同一であると判定し、第１話者と第２話者との同一性が認証される。その場合、音声認識サーバ２００は、第１音声認識結果による動作を実行することができる（Ｓ１１４）。その場合、音声認識サーバ２００は、例えば、「第２話者の口座にＢに１００万ウォンを送金しました」という合成音に対応する合成音信号を生成することができる（Ｓ１１６）。 If the second voice recognition result and the second speaker's cipher are substantially the same, for example, if the second voice recognition result includes the second speaker's cipher, It is determined that the first speaker and the second speaker are the same, and the identity between the first speaker and the second speaker is verified. In that case, the voice recognition server 200 can execute the operation based on the first voice recognition result (S114). In that case, the voice recognition server 200 can generate a synthetic sound signal corresponding to the synthetic sound, for example, "Sent 1 million won to B to the account of the second speaker" (S116).

音声認識サーバ２００は、第２音声認識結果と第２話者の暗号とが実質的に同一ではない場合、第１話者と第２話者とが互いに異なると判定し、第１話者と第２話者との同一性認証に失敗したと判定することができる。その場合、音声認識サーバ２００は、第１音声認識結果に対応する動作を実行しない（Ｓ１１５）。その場合、音声認識サーバ２００は、例えば、「暗号が一致せず、動作を実行しませんでした」という合成音に対応する合成音信号を生成することができる（Ｓ１１６）。 The voice recognition server 200 determines that the first speaker and the second speaker are different from each other when the second voice recognition result and the cipher of the second speaker are not substantially the same, and determines that the first speaker is the same as the first speaker. It can be determined that the identity authentication with the second speaker has failed. In that case, the voice recognition server 200 does not execute the operation corresponding to the first voice recognition result (S115). In that case, the voice recognition server 200 can generate a synthetic voice signal corresponding to the synthetic voice, for example, “the code does not match and the operation was not executed” (S116).

Ｓ１１３段階の他の例によれば、音声認識サーバ２００は、受信された暗号音声信号から、第３話者特徴ベクトルを抽出することができる。音声認識サーバ２００は、第３話者特徴ベクトルを、第２話者の登録された暗号音声信号から抽出された第４話者特徴ベクトルと比較することができる。第２話者の登録された暗号音声信号は、第２話者が音声認識システムに音声認識スピーカ装置１００のユーザとして登録するとき、第２話者が発話した暗号音声を基に事前に生成され、メモリ２２０に保存されている。また、第４話者特徴ベクトルも、やはり第２話者の登録された暗号音声信号が生成されるときに登録された暗号音声信号から抽出され、メモリ２２０に保存されている。メモリ２２０には、第４話者特徴ベクトルだけが保存され、第２話者の登録された暗号音声信号は保存されない。 According to another example of step S113, the voice recognition server 200 may extract the third speaker feature vector from the received encrypted voice signal. The voice recognition server 200 can compare the third speaker feature vector with the fourth speaker feature vector extracted from the registered encrypted voice signal of the second speaker. The encrypted voice signal registered by the second speaker is generated in advance based on the encrypted voice uttered by the second speaker when the second speaker registers as a user of the voice recognition speaker device 100 in the voice recognition system. , Are stored in the memory 220. The fourth speaker feature vector is also extracted from the encrypted voice signal registered when the encrypted voice signal registered with the second speaker is generated, and is also stored in the memory 220. Only the fourth speaker feature vector is stored in the memory 220, and the registered encrypted voice signal of the second speaker is not stored.

音声認識サーバ２００は、第３話者特徴ベクトルと第４話者特徴ベクトルとの類似度が、事前に設定された基準値より高い場合、第１話者と第２話者とが互いに同一であると判定し、第１話者と第２話者との同一性が認証される。その場合、音声認識サーバ２００は、第１音声認識結果による動作を実行することができる（Ｓ１１４）。第３話者特徴ベクトル及び第４話者特徴ベクトルは、同一暗号を発話した音声を含む音声信号から抽出されたものであるので、類似度結果の信頼度が高い。 When the similarity between the third speaker feature vector and the fourth speaker feature vector is higher than a preset reference value, the voice recognition server 200 determines that the first speaker and the second speaker are the same. If there is, the identity between the first speaker and the second speaker is verified. In that case, the voice recognition server 200 can execute the operation based on the first voice recognition result (S114). The third-speaker feature vector and the fourth-speaker feature vector are extracted from the voice signal including the voice uttered with the same cipher, and thus the reliability of the similarity result is high.

音声認識サーバ２００は、第３話者特徴ベクトルと第４話者特徴ベクトルとの類似度が、事前に設定された基準値より低い場合、第１話者と第２話者とが互いに異なると判定し、第１話者と第２話者との同一性認証に失敗したと判定し、第１音声認識結果に対応する動作を実行しない（Ｓ１１５）。 The voice recognition server 200 determines that the first speaker and the second speaker are different from each other when the similarity between the third speaker feature vector and the fourth speaker feature vector is lower than a preset reference value. It is determined that the identity authentication between the first speaker and the second speaker has failed, and the operation corresponding to the first voice recognition result is not executed (S115).

Ｓ１１３段階のさらに他の例によれば、音声認識サーバ２００は、受信された暗号音声信号から、第３話者特徴ベクトルを抽出することができる。音声認識サーバ２００は、第３話者特徴ベクトルを、第２話者の登録された暗号音声信号から抽出された第４話者特徴ベクトルと比較することができる。 According to yet another example of operation S113, the voice recognition server 200 may extract the third speaker feature vector from the received encrypted voice signal. The voice recognition server 200 can compare the third speaker feature vector with the fourth speaker feature vector extracted from the registered encrypted voice signal of the second speaker.

音声認識サーバ２００は、受信された暗号音声信号に対して音声認識を実行し、第２音声認識結果を生成することができる。第２音声認識結果は、第１話者が発話した暗号音声を含み得、音声認識サーバ２００は、第２音声認識結果から、暗号に該当する部分を検出することができる。音声認識サーバ２００は、第２音声認識結果を、第２話者の登録された暗号音声信号の第３音声認識結果と比較することができる。第３音声認識結果は、第２話者の登録された暗号音声信号が生成されるとき、第２話者の登録された暗号音声信号に対して音声認識が実行された結果であり、メモリ２２０に事前に保存されている。第３音声認識結果も、やはり第２話者が発話した暗号音声を含み得、音声認識サーバ２００は、第３音声認識結果から、暗号に該当する部分を検出することができる。 The voice recognition server 200 can perform voice recognition on the received encrypted voice signal and generate a second voice recognition result. The second voice recognition result may include the encrypted voice uttered by the first speaker, and the voice recognition server 200 can detect the portion corresponding to the cipher from the second voice recognition result. The voice recognition server 200 can compare the second voice recognition result with the third voice recognition result of the encrypted voice signal registered by the second speaker. The third voice recognition result is a result of voice recognition performed on the encrypted voice signal registered by the second speaker when the encrypted voice signal registered by the second speaker is generated. Stored in advance. The third voice recognition result may also include the encrypted voice uttered by the second speaker, and the voice recognition server 200 can detect the portion corresponding to the cipher from the third voice recognition result.

音声認識サーバ２００は、第３話者特徴ベクトルと第４話者特徴ベクトルとの類似度が、事前に設定された基準値より高く、第２音声認識結果と第３音声認識結果とが実質的に同一である場合、第１話者と第２話者との同一性が認証される。その場合、音声認識サーバ２００は、第１音声認識結果による動作を実行することができる（Ｓ１１４）。 In the voice recognition server 200, the similarity between the third speaker feature vector and the fourth speaker feature vector is higher than a preset reference value, and the second voice recognition result and the third voice recognition result are substantially the same. , The identities of the first and second speakers are verified. In that case, the voice recognition server 200 can execute the operation based on the first voice recognition result (S114).

音声認識サーバ２００は、第３話者特徴ベクトルと第４話者特徴ベクトルとの類似度が、事前に設定された基準値より低いか、あるいは、第２音声認識結果と第３音声認識結果とが実質的に同一ではない場合、第１話者と第２話者とが互いに異なると判定し、第１話者と第２話者との同一性認証に失敗したと判定し、第１音声認識結果に対応する動作を実行しないこともある（Ｓ１１５）。 The voice recognition server 200 determines whether the degree of similarity between the third speaker feature vector and the fourth speaker feature vector is lower than a preset reference value, or the second voice recognition result and the third voice recognition result. Are not substantially the same, it is determined that the first speaker and the second speaker are different from each other, it is determined that the identity authentication between the first speaker and the second speaker has failed, and the first voice The operation corresponding to the recognition result may not be executed (S115).

Ｓ１１３段階において、第１話者と第２話者との同一性が認証されるか、あるいは、Ｓ１０９段階において、第１類似度が、第２基準値ｒｅｆ２以上である場合、音声認識サーバ２００は、第１音声認識結果に対応する動作を実行することができる（Ｓ１１４）。音声認識サーバ２００は、第１音声認識結果に対応する動作を実行した結果を報告するための合成音信号を生成することができる（Ｓ１１６）。 If the identity between the first speaker and the second speaker is verified in step S113, or if the first similarity is greater than or equal to the second reference value ref2 in step S109, the voice recognition server 200 , The operation corresponding to the first voice recognition result can be executed (S114). The voice recognition server 200 can generate a synthetic voice signal for reporting the result of executing the operation corresponding to the first voice recognition result (S116).

Ｓ１１３段階において、第１話者と第２話者との同一性が認証されないか、あるいは、Ｓ１０７段階において、第１類似度が、第１基準値ｒｅｆ１未満である場合、音声認識サーバ２００は、第１音声認識結果に対応する動作を実行しない（Ｓ１１５）。音声認識サーバ２００は、第１音声認識結果に対応する動作を実行しなかったことを報告するための合成音信号を生成することができる（Ｓ１１６）。 If the identity between the first speaker and the second speaker is not authenticated in step S113, or if the first similarity is less than the first reference value ref1 in step S107, the voice recognition server 200 determines that The operation corresponding to the first voice recognition result is not executed (S115). The voice recognition server 200 may generate a synthesized voice signal for reporting that the operation corresponding to the first voice recognition result has not been executed (S116).

音声認識サーバ２００は、生成された合成音信号を音声認識スピーカ装置１００に送信することができる（Ｓ１１７）。音声認識スピーカ装置１００は、合成音信号に対応する合成音を再生することができる（Ｓ１１８）。従って、音声信号の音声を発話した第１話者は、自分の音声命令の実行結果を直接確認することができる。 The voice recognition server 200 can transmit the generated synthetic voice signal to the voice recognition speaker device 100 (S117). The voice recognition speaker device 100 can reproduce the synthetic sound corresponding to the synthetic sound signal (S118). Therefore, the first speaker who speaks the voice of the voice signal can directly confirm the execution result of his voice command.

他の実施形態により、Ｓ１１４段階において、第１話者と第２話者との同一性が確認されたので、音声認識サーバ２００は、第１音声認識結果に対応する動作を実行しつつ、第１話者特徴ベクトルを利用し、第２話者の登録された第２話者特徴ベクトルを改善させることができる。 According to another embodiment, in step S114, since the first speaker and the second speaker are confirmed to be the same, the voice recognition server 200 performs the operation corresponding to the first voice recognition result while performing the operation. By using the one-speaker feature vector, the registered second-speaker feature vector of the second speaker can be improved.

図６は、他の実施形態による音声認識システムの話者認証方法について説明するための例示的なフローチャートである。図６を参照すると、音声認識システムは、音声認識スピーカ装置１００及び音声認識サーバ２００を含む。第２話者の携帯装置３００は、ネットワークを介して音声認識サーバ２００に接続される。 FIG. 6 is an exemplary flowchart illustrating a speaker authentication method of a voice recognition system according to another exemplary embodiment. Referring to FIG. 6, the voice recognition system includes a voice recognition speaker device 100 and a voice recognition server 200. The portable device 300 of the second speaker is connected to the voice recognition server 200 via the network.

図６に図示されている段階（Ｓ２０１−Ｓ２０９）及び段階（Ｓ２１６−Ｓ２２０）は、図５を参照して説明した段階（Ｓ１０１−Ｓ１０９）及び段階（Ｓ１１４−Ｓ１１８）とそれぞれ実質的に同一であるので、それらについては、反復して説明しない。以下では、図５の実施形態と違いがある段階（Ｓ２１０ないしＳ２１５）を中心に説明する。 The steps (S201-S209) and the steps (S216-S220) illustrated in FIG. 6 are substantially the same as the steps (S101-S109) and the steps (S114-S118) described with reference to FIG. 5, respectively. As such, they will not be repeated. Hereinafter, the steps (S210 to S215) different from the embodiment of FIG. 5 will be mainly described.

Ｓ２０７段階ないしＳ２０９段階において、第１類似度が、第１基準値ｒｅｆ１以上であり、第２基準値ｒｅｆ２未満である場合、音声認識サーバ２００は、音声信号の話者が第２話者であると決定しつつ、第１話者が第２話者と一致するか否かをさらに確認するために、認証手続きを実行することができる。 In steps S207 to S209, when the first similarity is greater than or equal to the first reference value ref1 and less than the second reference value ref2, the voice recognition server 200 determines that the speaker of the voice signal is the second speaker. While deciding that, the authentication procedure can be performed to further confirm whether the first speaker matches the second speaker.

音声認識サーバ２００は、認証手続きのために、ワンタイムパスワード（ＯＴＰ）を生成し（Ｓ２１０）、生成されたワンタイムパスワードを、第２話者の携帯装置３００に送信することができる（Ｓ２１１）。前述のように、携帯装置３００の識別番号は、第２話者が音声認識システムにユーザとして登録するときに共に入力され、メモリ２２０に保存されている。ワンタイムパスワードは、文字メッセージ、チャットアプリケーションのテキストメッセージ、ワンタイムパスワードを含むイメージなどの方式で、音声認識サーバ２００から携帯装置３００に送信される。例えば、かようなワンタイムパスワードは、一桁または複数桁の数字であり得る。他の例によれば、かようなワンタイムパスワードは、テキスト単語、テキスト文章または事物イメージであり得る。 The voice recognition server 200 may generate a one-time password (OTP) for the authentication procedure (S210), and may transmit the generated one-time password to the mobile device 300 of the second speaker (S211). .. As described above, the identification number of the mobile device 300 is also input when the second speaker registers as a user in the voice recognition system and is stored in the memory 220. The one-time password is transmitted from the voice recognition server 200 to the mobile device 300 by a method such as a text message, a text message of a chat application, and an image including the one-time password. For example, such a one-time password can be a single or multiple digit number. According to another example, such a one-time password may be a text word, a text sentence or an object image.

携帯装置３００は、ワンタイムパスワードを受信し、それをディスプレイウィンドウに表示することができる（Ｓ２１２）。携帯装置３００の所有者である第２話者は、ディスプレイウィンドウに表示されたワンタイムパスワードを確認することができる（Ｓ２１２ａ）。例えば、携帯装置３００のディスプレイウィンドウには、「認証番号は、ＸＸＸです。認証ボタンを押した後、音声認識スピーカ装置の前で認証番号を言ってください」というメッセージを含む通知ウィンドウがアクティブ化される。通知ウィンドウには、認証ボタンと共に、残り時間が表示される。 The mobile device 300 can receive the one-time password and display it on the display window (S212). The second speaker who is the owner of the portable device 300 can confirm the one-time password displayed in the display window (S212a). For example, in the display window of the mobile device 300, a notification window including a message "The authentication number is XXX. Please press the authentication button and say the authentication number in front of the voice recognition speaker device" is activated. It In the notification window, the remaining time is displayed along with the authentication button.

第２話者は、通知ウィンドウのメッセージを確認し、認証ボタンを押した後、音声認識スピーカ装置の前で、ワンタイムパスワードに指定された認証番号を発話することができる（Ｓ２１２ａ）。音声認識スピーカ装置１００は、第２話者が発話した認証番号の音声を含む認証音声信号を検出することができる（Ｓ２１３）。認証音声信号は、音声認識スピーカ装置１００から音声認識サーバ２００に送信されもする（Ｓ２１４）。 After confirming the message in the notification window and pressing the authentication button, the second speaker can speak the authentication number specified in the one-time password in front of the voice recognition speaker device (S212a). The voice recognition speaker device 100 can detect the authentication voice signal including the voice of the authentication number uttered by the second speaker (S213). The authentication voice signal is also transmitted from the voice recognition speaker device 100 to the voice recognition server 200 (S214).

音声認識サーバ２００は、受信された認証音声信号を基に、第２話者と第１話者との同一性を認証することができる（Ｓ２１５）。音声認識サーバ２００が、音声認識スピーカ装置１００から、事前に設定された時間内に認証音声信号を受信できなかった場合、音声信号の話者が、第２話者ではないと決定し、同一性を否定することができる。 The voice recognition server 200 can authenticate the identity between the second speaker and the first speaker based on the received authentication voice signal (S215). When the voice recognition server 200 cannot receive the authentication voice signal from the voice recognition speaker device 100 within the preset time, it is determined that the speaker of the voice signal is not the second speaker, and the identity is determined. Can be denied.

Ｓ２１５段階の一例によれば、音声認識サーバ２００は、受信された認証音声信号に対して音声認識を実行し、第２音声認識結果を生成することができる。第２音声認識結果は、第１話者が発話したワンタイムパスワードまたは認証番号の音声を含み得る。音声認識サーバ２００は、第２音声認識結果から、ワンタイムパスワードまたは認証番号に該当する部分を検出することができる。認証音声信号に対する音声認識と、第１話者の音声信号に対する音声認識は、互いに同一方式によっても実行される。他の例によれば、ワンタイムパスワードは、一桁または複数桁の数字であり得る。その場合、認証音声信号は、数字を発話した音声を含み、認証音声信号に対して音声認識を実行するとき、数字に特化された言語モデルが使用される。これとは対照的に、第１話者の音声信号に対する音声認識は、文字に特化された言語モデルを使用しても実行される。 According to an example of step S215, the voice recognition server 200 may perform voice recognition on the received authentication voice signal and generate a second voice recognition result. The second voice recognition result may include the voice of the one-time password or the authentication number uttered by the first speaker. The voice recognition server 200 can detect the part corresponding to the one-time password or the authentication number from the second voice recognition result. The voice recognition for the authentication voice signal and the voice recognition for the voice signal of the first speaker are performed by the same method. According to another example, the one-time password can be a single digit or multiple digit number. In that case, the authentication voice signal includes a voice that speaks a number, and when performing voice recognition on the authentication voice signal, a language model specialized for the number is used. In contrast, speech recognition for the first speaker's speech signal is also performed using a character-specific language model.

音声認識サーバ２００は、第２音声認識結果を、Ｓ２１０段階で音声認識サーバ２００が生成したワンタイムパスワードまたは指定番号と比較することができる。 The voice recognition server 200 may compare the second voice recognition result with the one-time password or the designated number generated by the voice recognition server 200 in step S210.

音声認識サーバ２００は、第２音声認識結果が、ワンタイムパスワードまたは指定番号と実質的に同一である場合、例えば、第２音声認識結果に、ワンタイムパスワードまたは指定番号が含まれている場合、第１話者と第２話者とが互いに同一であると判定し、第１話者と第２話者との同一性が認証される。その場合、音声認識サーバ２００は、第１音声認識結果による動作を実行することができる（Ｓ２１６）。その場合、音声認識サーバ２００は、動作の実行を報告するための合成音信号を生成することができる（Ｓ２１８）。 When the second voice recognition result is substantially the same as the one-time password or the designated number, for example, when the second voice recognition result includes the one-time password or the designated number, the voice recognition server 200 It is determined that the first speaker and the second speaker are the same, and the identity between the first speaker and the second speaker is verified. In that case, the voice recognition server 200 can execute the operation based on the first voice recognition result (S216). In that case, the voice recognition server 200 can generate a synthetic voice signal for reporting the execution of the operation (S218).

音声認識サーバ２００は、第２音声認識結果とワンタイムパスワードとが実質的に同一ではない場合、例えば、第２音声認識結果に、ワンタイムパスワードまたは指定番号が含まれていない場合、第１話者と第２話者とが互いに異なると判定し、第１話者と第２話者との同一性認証に失敗したと判定することができる。その場合、音声認識サーバ２００は、第１音声認識結果に対応する動作を実行しない（Ｓ２１７）。その場合、音声認識サーバ２００は、動作の不実行を報告するための合成音信号を生成することができる（Ｓ２１８）。 When the second voice recognition result and the one-time password are not substantially the same, for example, when the second voice recognition result does not include the one-time password or the designated number, the voice recognition server 200 determines the first talk. It can be determined that the speaker and the second speaker are different from each other, and it can be determined that the identity authentication between the first speaker and the second speaker has failed. In that case, the voice recognition server 200 does not execute the operation corresponding to the first voice recognition result (S217). In that case, the voice recognition server 200 can generate a synthetic voice signal for reporting the non-execution of the operation (S218).

Ｓ２１５段階の他の例によれば、音声認識サーバ２００は、受信された認証音声信号に対して音声認識を実行し、第２音声認識結果を生成することができる。音声認識サーバ２００は、第２音声認識結果を、ワンタイムパスワードと比較することができる。 According to another example of step S215, the voice recognition server 200 may perform voice recognition on the received authentication voice signal to generate a second voice recognition result. The voice recognition server 200 can compare the second voice recognition result with the one-time password.

音声認識サーバ２００は、受信された認証音声信号から、第３話者特徴ベクトルを抽出することができる。音声認識サーバ２００は、第３話者特徴ベクトルを、第２話者の登録された話者特徴ベクトル、すなわち、第２話者特徴ベクトルと比較することができる。一例により、ワンタイムパスワードが数字からなる場合、音声認識サーバ２００は、数字に特化された特徴ベクトルを基に、第２話者特徴ベクトルと第３話者特徴ベクトルとを互いに比較することができる。また、第２話者がユーザ登録するとき、第２話者特徴ベクトルを生成するために、第２話者は、音声認識システムが提示する特定文章を発話するが、かような特定文章は、数字が良好に認識される文章としても選択される。 The voice recognition server 200 can extract the third speaker feature vector from the received authentication voice signal. The voice recognition server 200 can compare the third speaker feature vector with the registered speaker feature vector of the second speaker, that is, the second speaker feature vector. According to an example, when the one-time password is composed of numbers, the voice recognition server 200 may compare the second speaker feature vector and the third speaker feature vector with each other based on the feature vector specialized for the numbers. it can. Also, when the second speaker registers as a user, the second speaker speaks a specific sentence presented by the speech recognition system in order to generate a second speaker feature vector. It is also selected as a sentence in which numbers are well recognized.

音声認識サーバ２００は、第２音声認識結果とワンタイムパスワードとが実質的に同一であり、第３話者特徴ベクトルと第２話者特徴ベクトルとの類似度が、事前に設定された基準値より高い場合、第１話者と第２話者とが互いに同一であると判定し、第１話者と第２話者との同一性が認証される。その場合、音声認識サーバ２００は、第１音声認識結果による動作を実行することができる（Ｓ２１６）。 In the voice recognition server 200, the second voice recognition result and the one-time password are substantially the same, and the similarity between the third speaker feature vector and the second speaker feature vector is a preset reference value. If it is higher, it is determined that the first speaker and the second speaker are the same, and the identity between the first speaker and the second speaker is authenticated. In that case, the voice recognition server 200 can execute the operation based on the first voice recognition result (S216).

音声認識サーバ２００は、第２音声認識結果とワンタイムパスワードとが実質的に同一ではないか、あるいは、第３話者特徴ベクトルと第２話者特徴ベクトルとの類似度が、事前に設定された基準値より低い場合、第１話者と第２話者とが互いに異なると判定し、第１話者と第２話者との同一性認証に失敗したと判定し、第１音声認識結果に対応する動作を実行しない（Ｓ２１７）。 In the voice recognition server 200, the second voice recognition result and the one-time password are not substantially the same, or the similarity between the third speaker feature vector and the second speaker feature vector is set in advance. If it is lower than the reference value, it is determined that the first speaker and the second speaker are different from each other, and it is determined that the identity authentication between the first speaker and the second speaker has failed, and the first voice recognition result. The operation corresponding to is not executed (S217).

Ｓ２１５段階において、第１話者と第２話者との同一性が認証されるか、あるいは、Ｓ２０９段階において、第１類似度が、第２基準値ｒｅｆ２以上である場合、音声認識サーバ２００は、第１音声認識結果に対応する動作を実行することができる（Ｓ２１６）。Ｓ２１５段階において、第１話者と第２話者との同一性が認証されないか、あるいは、Ｓ２０７段階において、第１類似度が、第１基準値ｒｅｆ１未満である場合、音声認識サーバ２００は、第１音声認識結果に対応する動作を実行しない（Ｓ２１７）。 If the identity between the first speaker and the second speaker is verified in step S215, or if the first similarity is greater than or equal to the second reference value ref2 in step S209, the voice recognition server 200 determines that , The operation corresponding to the first voice recognition result can be executed (S216). If the identity between the first speaker and the second speaker is not authenticated in step S215, or if the first similarity is less than the first reference value ref1 in step S207, the voice recognition server 200 determines that The operation corresponding to the first voice recognition result is not executed (S217).

図７は、他の実施形態による音声認識システムの話者認証方法について説明するための例示的なフローチャートである。図７を参照すると、音声認識システムは、音声認識スピーカ装置１００及び音声認識サーバ２００を含む。第２話者の携帯装置３００は、ネットワークを介して音声認識サーバ２００に接続される。 FIG. 7 is an exemplary flowchart illustrating a speaker authentication method of a voice recognition system according to another exemplary embodiment. Referring to FIG. 7, the voice recognition system includes a voice recognition speaker device 100 and a voice recognition server 200. The portable device 300 of the second speaker is connected to the voice recognition server 200 via the network.

図７に図示されている段階（Ｓ３０１−Ｓ３０９）及び段階（Ｓ３１４−Ｓ３１８）は、図５を参照して説明した段階（Ｓ１０１−Ｓ１０９）及び段階（Ｓ１１４−Ｓ１１８）とそれぞれ実質的に同一であるので、それらについては、反復して説明しない。以下では、図５の実施形態と違いがある段階（Ｓ３１０ないしＳ３１３）を中心に説明する。 The steps (S301-S309) and the steps (S314-S318) illustrated in FIG. 7 are substantially the same as the steps (S101-S109) and the steps (S114-S118) described with reference to FIG. 5, respectively. As such, they will not be repeated. Hereinafter, the steps (S310 to S313) different from the embodiment of FIG. 5 will be mainly described.

段階（Ｓ３０７ないしＳ３０９）において、第１類似度が、第１基準値ｒｅｆ１以上であり、第２基準値ｒｅｆ２未満である場合、音声認識サーバ２００は、音声信号の話者が第２話者であると決定しつつ、第１話者が第２話者と一致するか否かをさらに確認するために、認証手続きを実行することができる。 In step (S307 to S309), if the first similarity is equal to or more than the first reference value ref1 and less than the second reference value ref2, the voice recognition server 200 determines that the speaker of the voice signal is the second speaker. While determining that there is, an authentication procedure can be performed to further confirm whether the first speaker matches the second speaker.

音声認識サーバ２００は、認証手続きのために、第２話者の携帯装置３００、すなわち、第２話者に、第１話者の音声と同一内容を発話するように要求することができる（Ｓ３１０）。ここで、第１話者の音声は、Ｓ３０２段階において音声認識スピーカ装置が検出した音声信号に含まれている音声を意味する。第２話者と第１話者とが同一である場合、第２話者は、Ｓ３０２段階で受信された音声信号の内容を知ることができ、それは、第１話者、すなわち、第２話者だけが知っているので、秘密性がある。 The voice recognition server 200 may request the portable device 300 of the second speaker, that is, the second speaker, to speak the same content as the voice of the first speaker for the authentication procedure (S310). ). Here, the voice of the first speaker means the voice included in the voice signal detected by the voice recognition speaker device in step S302. If the second speaker and the first speaker are the same, the second speaker can know the content of the voice signal received in step S302, which is the first speaker, that is, the second speaker. Only the person who knows it has confidentiality.

携帯装置３００は、同一内容発話要求を受信し、それを外部に、例えば、ディスプレイウィンドウに表示することができる。例えば、携帯装置３００のディスプレイウィンドウには、「音声認識スピーカ装置から、第２話者の氏名で命令が実行されました。この命令を、今一度話してください」というメッセージを含む通知ウィンドウがアクティブ化されている。携帯装置３００の所有者である第２話者は、同一内容発話要求に応じて、同一内容の音声を発話することができる。携帯装置３００は、同一内容の音声を含む認証音声信号を検出することができる（Ｓ３１１）。認証音声信号は、携帯装置３００から音声認識サーバ２００に送信される（Ｓ３１２）。 The mobile device 300 can receive the same content utterance request and display it externally, for example, in a display window. For example, in the display window of the mobile device 300, a notification window including a message "A command was executed from the voice recognition speaker device under the name of the second speaker. Please speak this command again" is active. Has been converted. The second speaker who is the owner of the portable device 300 can speak the voice of the same content in response to the request of the same content. The mobile device 300 can detect the authentication voice signal including the voice of the same content (S311). The authentication voice signal is transmitted from the mobile device 300 to the voice recognition server 200 (S312).

音声認識サーバ２００は、受信された認証音声信号を基に、第２話者と第１話者との同一性を認証することができる（Ｓ３１３）。音声認識サーバ２００が、携帯装置３００から、事前に設定された時間内に認証音声信号を受信できなかった場合、音声信号の話者が、第２話者ではないと決定し、同一性を否定することができる。 The voice recognition server 200 can authenticate the identity between the second speaker and the first speaker based on the received authentication voice signal (S313). When the voice recognition server 200 cannot receive the authentication voice signal from the mobile device 300 within the preset time, it is determined that the speaker of the voice signal is not the second speaker, and the identity is denied. can do.

Ｓ３１３段階の一例によれば、音声認識サーバ２００は、受信された認証音声信号と、Ｓ３０３段階で受信された第１話者の音声信号と、を比較することができる。認証音声信号及び第１話者の音声信号は、同じような時点で生成され、実質的に同一内容の音声を含むので、類似の波形を有することができる。認証音声信号と第１話者の音声信号との比較は、波形、周波数スペクトルなどについて実行される。認証音声信号と音声信号との比較方式は、限定されるものではない。認証音声信号と音声信号との比較の結果、認証音声信号と音声信号との類似度が計算される。 According to an example of step S313, the voice recognition server 200 may compare the received authentication voice signal with the voice signal of the first speaker received in step S303. The authentication voice signal and the voice signal of the first speaker may have similar waveforms because they are generated at similar times and contain voices of substantially the same content. The comparison between the authentication voice signal and the voice signal of the first speaker is performed on the waveform, frequency spectrum, and the like. The method of comparing the authentication voice signal and the voice signal is not limited. As a result of the comparison between the authentication voice signal and the voice signal, the similarity between the authentication voice signal and the voice signal is calculated.

音声認識サーバ２００は、比較の結果として計算された類似度が、事前に設定された基準値を超える場合、第１話者と第２話者とが互いに同一であると判定し、第１話者と第２話者との同一性が認証される。その場合、音声認識サーバ２００は、第１音声認識結果による動作を実行することができる（Ｓ３１４）。音声認識サーバ２００は、類似度が、事前に設定された基準値より低い場合、第１話者と第２話者とが互いに異なると判定し、第１話者と第２話者との同一性認証に失敗したと判定することができる。その場合、音声認識サーバ２００は、第１音声認識結果に対応する動作を実行しない（Ｓ３１５）。 If the similarity calculated as a result of the comparison exceeds a reference value set in advance, the voice recognition server 200 determines that the first speaker and the second speaker are the same, and the first talk The identity between the person and the second speaker is verified. In that case, the voice recognition server 200 can execute the operation based on the first voice recognition result (S314). If the similarity is lower than a preset reference value, the voice recognition server 200 determines that the first speaker and the second speaker are different from each other, and the first speaker and the second speaker are the same. It can be determined that the sex authentication has failed. In that case, the voice recognition server 200 does not execute the operation corresponding to the first voice recognition result (S315).

Ｓ３１３段階の他の例によれば、音声認識サーバ２００は、受信された認証音声信号に対して音声認識を実行し、第２音声認識結果を生成することができる。かような認証音声信号に対する音声認識と、第１話者の音声信号に対する音声認識は、互いに同一方式によって実行される。音声認識サーバ２００は、第２音声認識結果を、第１話者の音声信号に対する第１音声認識結果と比較することができる。 According to another example of step S313, the voice recognition server 200 may perform voice recognition on the received authentication voice signal and generate a second voice recognition result. The voice recognition for the authentication voice signal and the voice recognition for the voice signal of the first speaker are performed by the same method. The voice recognition server 200 can compare the second voice recognition result with the first voice recognition result for the voice signal of the first speaker.

音声認識サーバ２００は、第２音声認識結果と第１音声認識結果とが実質的に同一である場合、例えば、第２音声認識結果と第１音声認識結果とが意味論的に同一である場合、第１話者と第２話者とが互いに同一であると判定し、第１話者と第２話者との同一性が認証される。その場合、音声認識サーバ２００は、第１音声認識結果による動作を実行することができる（Ｓ３１４）。 In the voice recognition server 200, when the second voice recognition result and the first voice recognition result are substantially the same, for example, when the second voice recognition result and the first voice recognition result are semantically the same. , It is determined that the first speaker and the second speaker are the same, and the identity between the first speaker and the second speaker is authenticated. In that case, the voice recognition server 200 can execute the operation based on the first voice recognition result (S314).

音声認識サーバ２００は、第２音声認識結果と第１音声認識結果とが実質的に同一ではない場合、第１話者と第２話者とが互いに異なると判定し、第１話者と第２話者との同一性認証に失敗したと判定することができる。その場合、音声認識サーバ２００は、第１音声認識結果に対応する動作を実行しない（Ｓ３１５）。 When the second voice recognition result and the first voice recognition result are not substantially the same, the voice recognition server 200 determines that the first speaker and the second speaker are different from each other, and determines that the first speaker and the first speaker are different from each other. It can be determined that the identity verification with the two speakers has failed. In that case, the voice recognition server 200 does not execute the operation corresponding to the first voice recognition result (S315).

Ｓ３１３段階のさらに他の例によれば、音声認識サーバ２００は、受信された認証音声信号に対して音声認識を実行し、第２音声認識結果を生成することができる。音声認識サーバ２００は、第２音声認識結果を、第１話者の音声信号に対する第１音声認識結果と比較することができる。 According to still another example of operation S313, the voice recognition server 200 may perform voice recognition on the received authentication voice signal and generate a second voice recognition result. The voice recognition server 200 can compare the second voice recognition result with the first voice recognition result for the voice signal of the first speaker.

音声認識サーバ２００は、受信された認証音声信号から、第３話者特徴ベクトルを抽出することができる。音声認識サーバ２００は、第３話者特徴ベクトルを、第１話者特徴ベクトルと比較することができる。第２話者の認証音声信号及び第１話者の音声信号は、同一内容を含み、類似した時点で生成されたので、第２話者と第１話者とが同一である場合、第３話者特徴ベクトルと第１話者特徴ベクトルとは、互いに容易に比較され、高類似度が計算されるのである。 The voice recognition server 200 can extract the third speaker feature vector from the received authentication voice signal. The voice recognition server 200 can compare the third speaker feature vector with the first speaker feature vector. Since the authentication voice signal of the second speaker and the voice signal of the first speaker have the same contents and are generated at similar times, if the second speaker is the same as the first speaker, The speaker feature vector and the first speaker feature vector are easily compared with each other and a high degree of similarity is calculated.

音声認識サーバ２００は、第２音声認識結果と第１音声認識結果とが実質的に同一であり、第３話者特徴ベクトルと第１話者特徴ベクトルとの類似度が、事前に設定された基準値より高い場合、第１話者と第２話者とが互いに同一であると判定し、第１話者と第２話者との同一性が認証される。その場合、音声認識サーバ２００は、第１音声認識結果による動作を実行することができる（Ｓ３１４）。 In the voice recognition server 200, the second voice recognition result and the first voice recognition result are substantially the same, and the similarity between the third speaker feature vector and the first speaker feature vector is set in advance. When it is higher than the reference value, it is determined that the first speaker and the second speaker are the same, and the identity between the first speaker and the second speaker is authenticated. In that case, the voice recognition server 200 can execute the operation based on the first voice recognition result (S314).

音声認識サーバ２００は、第２音声認識結果と第１音声認識結果とが実質的に同一ではないか、あるいは、第３話者特徴ベクトルと第１話者特徴ベクトルとの類似度が、事前に設定された基準値より低い場合、第１話者と第２話者とが互いに異なると判定し、第１話者と第２話者との同一性認証に失敗したと判定し、第１音声認識結果に対応する動作を実行しない（Ｓ３１５）。 In the voice recognition server 200, the second voice recognition result and the first voice recognition result are not substantially the same, or the similarity between the third speaker feature vector and the first speaker feature vector is determined in advance. If it is lower than the set reference value, it is determined that the first speaker and the second speaker are different from each other, it is determined that the identity authentication between the first speaker and the second speaker has failed, and the first voice The operation corresponding to the recognition result is not executed (S315).

Ｓ３１３段階で、第１話者と第２話者との同一性が認証されるか、あるいは、Ｓ３０９段階で、第１類似度が、第２基準値ｒｅｆ２以上である場合、音声認識サーバ２００は、第１音声認識結果に対応する動作を実行することができる（Ｓ３１４）。Ｓ３１３段階で、第１話者と第２話者との同一性が認証されないか、あるいは、Ｓ３０７段階で、第１類似度が第１基準値ｒｅｆ１未満である場合、音声認識サーバ２００は、第１音声認識結果に対応する動作を実行しない（Ｓ３１５）。 If the identity between the first speaker and the second speaker is verified in step S313, or if the first similarity is greater than or equal to the second reference value ref2 in step S309, the voice recognition server 200 determines , The operation corresponding to the first voice recognition result can be executed (S314). If the identity between the first speaker and the second speaker is not verified in step S313, or if the first similarity is less than the first reference value ref1 in step S307, the voice recognition server 200 determines that 1 The operation corresponding to the voice recognition result is not executed (S315).

上記で説明した本発明による実施形態は、コンピュータ上で多様な構成要素を介して実行されるコンピュータプログラムの形態によって具現化され、そのようなコンピュータプログラムは、コンピュータ読み取り可能な媒体に記録され得る。ここで、かような媒体は、コンピュータ実行可能なプログラムを続けて保存するものであってもよいし、実行またはダウンロードのために、一時的に保存するものであってもよい。また、かような媒体は、単一または複数個のハードウェアが結合された形態の多様な記録手段または保存手段であってよいが、あるコンピュータシステムに直接接続される媒体に限定されるものではなく、ネットワーク上に分散されて存在するものであってもよい。かような媒体の例は、ハードディスク、フロッピー（登録商標）ディスク及び磁気テープのような磁気媒体；ＣＤ−ＲＯＭ（compact disc read only memory）及びＤＶＤ（digital versatile disc）のような光記録媒体；フロプティカルディスク（floptical disk）のような光磁気媒体；及びＲＯＭ、ＲＡＭ、フラッシュメモリなどを含み、プログラム命令が保存されるように構成されたものがある。また、他の媒体の例として、アプリケーションを配布するアプリケーションストアや、その他多様なソフトウェアを供給したり配布したりするサイト、サーバなどで管理する記録媒体も挙げられる。 The embodiments of the present invention described above are embodied in the form of a computer program executed on various components of a computer, and the computer program may be recorded on a computer-readable medium. Here, such a medium may be one that continuously stores computer-executable programs, or one that temporarily stores them for execution or download. Further, such a medium may be various recording means or storage means in a form in which a single piece or a plurality of pieces of hardware are combined, but is not limited to the medium directly connected to a certain computer system. Instead, it may exist in a distributed manner on the network. Examples of such media are magnetic media such as hard disks, floppy disks and magnetic tapes; optical recording media such as compact disc read only memory (CD-ROM) and digital versatile disc (DVD); Some include magneto-optical media, such as floptical disks; and ROM, RAM, flash memory, etc., configured to store program instructions. Examples of other media include an application store that distributes applications, a site that supplies and distributes various other software, and a recording medium that is managed by a server or the like.

本明細書において、「部」、「モジュール」などは、プロセッサまたは回路のようなハードウェアコンポーネント、及び／またはプロセッサのようなハードウェアコンポーネントによって実行されるソフトウェアコンポーネントであり得る。例えば、「部」、「モジュール」などは、ソフトウェアコンポーネント、オブジェクト指向ソフトウェアコンポーネント、クラスコンポーネント及びタスクコンポーネントのようなコンポーネント；並びにプロセス、関数、属性、プロシージャ、サブルーチン、プログラムコードのセグメント、ドライバ、ファームウエア、マイクロコード、回路、データ、データベース、データ構造、テーブル、アレイ及び変数によって具現化され得る。 As used herein, “parts”, “modules”, etc. may be hardware components such as processors or circuits, and/or software components executed by hardware components such as processors. For example, "part", "module", etc. are software components, object-oriented software components, components such as class components and task components; and processes, functions, attributes, procedures, subroutines, program code segments, drivers, firmware. , Microcode, circuits, data, databases, data structures, tables, arrays and variables.

前述の本発明の説明は、例示のためのものであり、本発明が属する技術分野の当業者であれば、本発明の技術的思想や、必須な特徴を変更せずとも、他の具体的な形態で容易に変形が可能であるということを理解することができるであろう。従って、上記で説明した実施形態は、全ての面で例示的なものであり、限定的ではないと理解しなければならない。例えば、単一型と説明されている各構成要素は、分散されて実施されることもあるし、同様に、分散されていると説明されている構成要素は、結合された形態で実施されることもある。 The above description of the present invention is for the purpose of illustration, and a person skilled in the art to which the present invention belongs does not need to change the technical idea or essential features of the present invention to obtain other specific examples. It will be understood that various forms can be easily modified. Therefore, it should be understood that the embodiments described above are illustrative in all aspects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form. Sometimes.

本発明の範囲は、前述の詳細な説明ではなく、特許請求の範囲によって示され、特許請求の範囲の意味、範囲及びその均等概念から導き出される全ての変更または変形された形態が本発明の範囲に含まれると解釈されなければならない。 The scope of the present invention is shown not by the above detailed description but by the scope of the claims, and all modifications or modified forms derived from the meaning of the scope of the claims, the scope and the equivalent concept thereof are the scope of the present invention. Should be construed as included in.

本発明の実施形態に係る話者認証方法及び音声認識システムは、例えば、セキュリティ関連の技術分野に効果的に適用可能である。 The speaker authentication method and the voice recognition system according to the embodiment of the present invention can be effectively applied to, for example, a technical field related to security.

１００音声認識スピーカ装置
１１０プロセッサ
１２０マイクロフォン
１３０スピーカ
１４０通信モジュール
２００音声認識サーバ
２１０，２１０ａプロセッサ
２１１音声信号受信部
２１２音声認識部
２１３話者認識部
２１３ａ話者特徴ベクトル抽出部
２１３ｂ話者特徴ベクトル比較部
２１３ｃ登録話者決定部
２１４話者認証部
２１５機能部
２１６合成音信号生成部
２１７認証いかん決定部
２２０メモリ
２３０通信モジュール
３００携帯装置
４００ネットワーク 100 voice recognition speaker device 110 processor 120 microphone 130 speaker 140 communication module 200 voice recognition server 210, 210a processor 211 voice signal receiving unit 212 voice recognition unit 213 speaker recognition unit 213a speaker feature vector extraction unit 213b speaker feature vector comparison unit 213c Registered speaker determination unit 214 Speaker authentication unit 215 Functional unit 216 Synthetic sound signal generation unit 217 Authentication decision unit 220 Memory 230 Communication module 300 Mobile device 400 Network

Claims

A speaker authentication method in a voice recognition system including a voice recognition device and a voice recognition server, comprising:
The voice recognition server receives a voice signal including a voice of a first speaker from the voice recognition device;
The voice recognition server performs voice recognition on the voice signal to generate a first voice recognition result;
The voice recognition server extracts a first speaker feature vector from the voice signal, and calculates a similarity between the first speaker feature vector and the registered speaker feature vector;
If the similarity is greater than or equal to a first reference value, the voice recognition server determines that the first speaker of the voice signal is a second speaker who has completed user registration with the voice recognition server. ,
The voice recognition server requests the second speaker's mobile device to speak the same content as the first speaker's voice;
The voice recognition server receives an authentication voice signal including an authentication voice from the portable device of the second speaker;
The voice recognition server, based on the voice signal received from the voice recognition device and the authentication voice signal received from the portable device of the second speaker, the second speaker and the first speaker. Certifying the identity of
If the identity is authenticated, the voice recognition server performs an operation corresponding to the first voice recognition result;
Speaker authentication method including.

The voice recognition server requesting the voice recognition device to speak a code;
The voice recognition server receives an encrypted voice signal containing encrypted voice from the voice recognition device;
The voice recognition server compares the encrypted voice signal received from the voice recognition device with a registered password speech signal of the second speaker, and compares the encrypted voice signal with the first speaker. The method for authenticating a speaker according to claim 1, further comprising a step of determining identity with two speakers.

The step of determining the identity includes
The voice recognition server extracting a second speaker feature vector from the encrypted voice signal;
The speech recognition server comparing the second speaker feature vector with a speaker feature vector extracted from the registered password speech signal of the second speaker;
The voice recognition server determines the identity of the first speaker and the second speaker based on at least the result of the comparison;
The speaker authentication method according to claim 2, further comprising:

The voice recognition server generating a one-time password,
The voice recognition server transmitting the one-time password to the portable device of the second speaker;
The voice recognition server further comprises a step of receiving from the voice recognition device a one-time password voice signal including a voice uttered the one-time password. Speaker authentication method.

The voice recognition server performs voice recognition on the one-time password voice signal to generate a second voice recognition result;
The voice recognition server comparing the second voice recognition result with the one-time password;
The voice recognition server determines the identity of the first speaker and the second speaker based on at least the result of the comparison;
The speaker authentication method according to claim 4, further comprising:

The voice recognition server performs voice recognition on the one-time password voice signal to generate a second voice recognition result;
The voice recognition server extracting a second speaker feature vector from the one-time password voice signal;
The voice recognition server determines the identity between the second voice recognition result and the one-time password and the similarity between the second speaker feature vector and the registered speaker feature vector of the second speaker. Determining the identity of the first speaker and the second speaker based on the
The speaker authentication method according to claim 4, further comprising:

The step of verifying the identity includes
The voice recognition server comparing the voice signal with the authentication voice signal;
The voice recognition server determines the identity of the first speaker and the second speaker based on at least the result of the comparison;
The speaker authentication method according to claim 1, further comprising:

The step of verifying the identity includes
The voice recognition server performs voice recognition on the authentication voice signal to generate a second voice recognition result;
The voice recognition server comparing the second voice recognition result with the first voice recognition result;
The voice recognition server determines the identity of the first speaker and the second speaker based on at least the result of the comparison;
The speaker authentication method according to claim 1, further comprising:

After the voice recognition server determines that the first speaker of the voice signal is the registered second speaker when the similarity is equal to or higher than a second reference value that is higher than the first reference value. Corresponding to the first voice recognition result without performing the steps of requesting to speak the same content as the voice of the first speaker, receiving the authentication voice signal, and authenticating the identity. 9. The speaker authentication method according to claim 1, further comprising a step of performing the operation.

When the degree of similarity is less than the first reference value, the voice recognition server determines that the first speaker of the voice signal is a user who is not registered, and is the same as the voice of the first speaker. Requesting to speak content, receiving the authentication voice signal, and authenticating the identity, and not performing an action corresponding to the first voice recognition result. 10. The speaker authentication method according to any one of items 1 to 9.

The speaker authentication method according to any one of claims 1 to 10, wherein the registered second speaker is one of a plurality of users registered in the voice recognition system.

A program for causing a processor of a voice recognition server of a voice recognition system to execute the speaker authentication method according to any one of claims 1 to 11.

A communication module for communicating with the voice recognition device and the mobile device;
Using the communication module, a voice signal including the voice of the first speaker is received from the voice recognition device, voice recognition is performed on the voice signal, a first voice recognition result is generated, and the voice is generated. A first speaker feature vector is extracted from the signal, a similarity between the first speaker feature vector and the registered speaker feature vector is calculated, and when the similarity is equal to or higher than a first reference value, It is determined that the first speaker of the voice signal is the second speaker who has completed the user registration in the voice recognition server, and the same content as the voice of the first speaker is uttered to the portable device of the second speaker. To receive an authentication voice signal including an authentication voice from the portable device of the second speaker, and to receive the voice signal received from the voice recognition device and the portable device of the second speaker. On the basis of the authentication voice signal, the identity between the second speaker and the first speaker is authenticated, and when the identity is authenticated, an operation corresponding to the first voice recognition result is executed. A processor configured to
Speech recognition server including.

The processor is
Performing voice recognition on the authentication voice signal to generate a second voice recognition result,
Comparing the second speech recognition result with the first speech recognition result,
The voice recognition server according to claim 13, wherein the voice recognition server is configured to determine the identity between the first speaker and the second speaker based on at least the result of the comparison.

A voice recognition system including a voice recognition server and a voice recognition device,
The voice recognition device detects a voice signal including a voice of a first speaker from the first communication module that communicates with the voice recognition server, a microphone that generates an audio signal, and the voice signal, and detects the voice signal. A first processor configured to transmit to a voice recognition server and receive a synthetic voice signal from the voice recognition server; and a speaker that reproduces a synthetic voice corresponding to the synthetic voice signal,
The voice recognition server includes a second processor and a second communication module that communicates with the voice recognition device and the mobile device,
The second processor is
Receiving the voice signal from the voice recognition device,
Performing voice recognition on the voice signal to generate a first voice recognition result,
A first speaker feature vector is extracted from the voice signal, and a similarity between the first speaker feature vector and the registered speaker feature vector is calculated,
When the similarity is equal to or higher than the first reference value, it is determined that the first speaker of the voice signal is the second speaker who has completed the user registration in the voice recognition server,
Requesting the portable device of the second speaker to speak the same content as the voice of the first speaker,
Receiving an authentication voice signal including an authentication voice from the portable device of the second speaker,
The identity of the second speaker and the first speaker is authenticated based on the voice signal received from the voice recognition device and the authentication voice signal received from the portable device of the second speaker. ,
A voice recognition system configured to perform an action corresponding to the first voice recognition result if the identity is verified.

The second processor is
Performing voice recognition on the authentication voice signal to generate a second voice recognition result,
Comparing the second speech recognition result with the first speech recognition result,
The voice recognition system according to claim 15, wherein the voice recognition system is configured to determine the identity between the first speaker and the second speaker based on at least the result of the comparison.