JP2019028465A

JP2019028465A - Speaker verification method and speech recognition system

Info

Publication number: JP2019028465A
Application number: JP2018140622A
Authority: JP
Inventors: 奉眞李; Bong Jin Lee; ▲みん▼ 碩崔; Min Seok Choi; 益 ▲祥▼ 韓; Ick Sang Han; 五赫權; Oh Hyeok Kwon; 丙烈金; Byeong Yeol Kim; 明祐呉; Myung Woo Oh; 燦奎李; Chan Kyu Lee; 貞姫任; Jung Hui Im; 丁牙崔; Jung A Choi; 秀桓金; Suhwan Kim
Original assignee: Line Corp; Naver Corp
Current assignee: Z Intermediate Global Corp; Naver Corp
Priority date: 2017-07-26
Filing date: 2018-07-26
Publication date: 2019-02-21
Anticipated expiration: 2038-07-26
Also published as: KR20190012065A; KR101995443B1; JP6662962B2

Abstract

To provide a speaker verification method and a speech recognition system.SOLUTION: With respect to a speech recognition system capable of accurately recognizing a speaker and a speaker verification method for the same, the speaker verification method includes: a step for a speech recognition server to receive from a speech recognition unit a speech signal including speech of a first speaker; a step for the speech recognition server to perform speech recognition of the speech signal and to generate a speech recognition result; a step for the speech recognition server to extract a speaker feature vector from the speech signal, to compare the speaker feature vector with a registered speaker feature vector, and to determine according to a result of the comparison that a speaker of the speech signal is a registered second speaker; a step for the speech recognition server to transmit to a portable device of the second speaker a speech recognition result and the speech signal; a step for the speech recognition server to receive from the portable device recognition input; and a step for the speech recognition server to perform operation corresponding to the speech recognition result.SELECTED DRAWING: Figure 4A

Description

本発明は、話者検証方法及び音声認識システムに関し、さらに詳細には、音声認識装置及び音声認識サーバを含む音声認識システムにおいて話者を検証する方法に関する。 The present invention relates to a speaker verification method and a speech recognition system, and more particularly to a method for verifying a speaker in a speech recognition system including a speech recognition device and a speech recognition server.

音声認識機能が搭載された人工知能スピーカ装置が市場に出回っている。人工知能スピーカ装置は、ユーザの音声を認識し、音声に含まれている命令を抽出し、この命令による動作を実行し、その結果を音声として出力することにより、人工知能秘書のような役割を果たすことができる。かような人工知能スピーカ装置が、単に音声質疑に応答し、質疑結果を音声として出力するレベルを超え、金融取り引きやショッピングのように、セキュリティが必要な分野で使用されるためには、正確に話者を認識して識別できなければならない。しかしながら、人工知能スピーカ装置は、声を基にユーザを識別するものであり、指紋や虹彩認識のような生体情報を利用したユーザの識別方法または認証方法に比して正確度に劣る。 Artificial intelligence speaker devices equipped with a voice recognition function are on the market. The artificial intelligence speaker device recognizes the user's voice, extracts a command included in the voice, executes an operation according to the command, and outputs the result as a voice, thereby playing a role like an artificial intelligence secretary. Can fulfill. In order for such an artificial intelligence speaker device to exceed the level of simply answering voice questions and outputting the results of the questions as voice, and to be used in areas where security is required, such as financial transactions and shopping, it is accurate. It must be able to recognize and identify the speaker. However, the artificial intelligence speaker device identifies a user based on a voice, and is inferior in accuracy as compared with a user identification method or an authentication method using biometric information such as fingerprint or iris recognition.

本発明が解決しようとする課題は、前述の問題を解決するためのものであり、話者を正確に認識して識別することができる方法を提供することである。 The problem to be solved by the present invention is to solve the above-mentioned problem, and to provide a method capable of accurately recognizing and identifying a speaker.

前述の技術的課題を達成するための技術的手段として、本開示の第１側面は、音声認識装置及び音声認識サーバを含む音声認識システムにおいて話者を検証する方法を提供する。本話者検証方法は、前記音声認識サーバが、前記音声認識装置から、第１話者の音声を含む音声信号を受信する段階と、前記音声認識サーバが、前記音声信号に対して音声認識を実行し、音声認識結果を生成する段階と、前記音声認識サーバが、前記音声信号から話者特徴ベクトルを抽出し、前記話者特徴ベクトルを登録された話者特徴ベクトルと比較し、前記比較の結果により、前記音声信号の話者が登録された第２話者であると決定する段階と、前記音声認識サーバが、前記第２話者の携帯装置に前記音声認識結果及び前記音声信号を送信する段階と、前記音声認識サーバが、前記携帯装置から承認入力を受信する段階と、前記音声認識サーバが、前記音声認識結果に対応する動作を実行する段階と、を含む。 As technical means for achieving the above technical problem, the first aspect of the present disclosure provides a method for verifying a speaker in a speech recognition system including a speech recognition device and a speech recognition server. In the speaker verification method, the voice recognition server receives a voice signal including the voice of the first speaker from the voice recognition device; and the voice recognition server performs voice recognition on the voice signal. Executing and generating a speech recognition result, wherein the speech recognition server extracts a speaker feature vector from the speech signal, compares the speaker feature vector with a registered speaker feature vector, and The result of determining that the speaker of the voice signal is a registered second speaker, and the voice recognition server transmits the voice recognition result and the voice signal to the portable device of the second speaker And a step in which the voice recognition server receives an approval input from the portable device, and a step in which the voice recognition server performs an operation corresponding to the voice recognition result.

本開示の第２側面は、音声認識装置及び携帯装置と通信する通信モジュールと、プロセッサと、を含む音声認識サーバを提供する。前記プロセッサは、前記通信モジュールを利用し、前記音声認識装置から、第１話者の音声を含む音声信号を受信し、前記音声信号に対して音声認識を実行し、音声認識結果を生成し、前記音声信号から話者特徴ベクトルを抽出し、前記話者特徴ベクトルを登録された話者特徴ベクトルと比較し、前記比較の結果により、前記音声信号の話者が登録された第２話者であると決定し、前記通信モジュールを利用し、前記第２話者の携帯装置に前記音声認識結果及び前記音声信号を送信し、前記通信モジュールを利用し、前記携帯装置から承認入力を受信し、前記音声認識結果に対応する動作を実行するように構成される。 A second aspect of the present disclosure provides a voice recognition server including a communication module that communicates with a voice recognition device and a portable device, and a processor. The processor uses the communication module to receive a voice signal including the voice of the first speaker from the voice recognition device, perform voice recognition on the voice signal, and generate a voice recognition result. A speaker feature vector is extracted from the speech signal, the speaker feature vector is compared with a registered speaker feature vector, and a speaker of the speech signal is registered as a result of the comparison. Determining that there is, using the communication module, transmitting the voice recognition result and the voice signal to the portable device of the second speaker, using the communication module, receiving an approval input from the portable device; An operation corresponding to the voice recognition result is executed.

本開示の第３側面は、音声認識サーバ及び音声認識装置を含む音声認識システムを提供する。前記音声認識装置は、前記音声認識サーバと通信する第１通信モジュールと、オーディオ信号を生成するマイクロフォンと、前記オーディオ信号から音声信号を検出し、前記音声信号を前記音声認識サーバに送信し、前記音声認識サーバから合成音信号を受信する第１プロセッサと、前記合成音信号に対応する合成音を再生するスピーカと、を含む。前記音声認識サーバは、第２プロセッサと、前記音声認識装置及び携帯装置と通信することができる第２通信モジュールと、を含む。前記第２プロセッサは、前記音声認識装置から、第１話者の音声を含む音声信号を受信し、前記音声信号に対して音声認識を実行し、音声認識結果を生成し、前記音声信号から話者特徴ベクトルを抽出し、前記話者特徴ベクトルを登録された話者特徴ベクトルと比較し、前記比較の結果により、前記音声信号の話者が登録された第２話者であると認識し、前記第２話者の携帯装置に前記音声認識結果及び前記音声信号を送信し、前記携帯装置から承認入力を受信し、前記音声認識結果に対応する動作を実行するように構成される。 A third aspect of the present disclosure provides a voice recognition system including a voice recognition server and a voice recognition device. The voice recognition device detects a voice signal from the first communication module that communicates with the voice recognition server, a microphone that generates an audio signal, and the audio signal, transmits the voice signal to the voice recognition server, and A first processor that receives the synthesized sound signal from the speech recognition server; and a speaker that reproduces the synthesized sound corresponding to the synthesized sound signal. The voice recognition server includes a second processor and a second communication module capable of communicating with the voice recognition device and the portable device. The second processor receives a voice signal including the voice of the first speaker from the voice recognition device, performs voice recognition on the voice signal, generates a voice recognition result, and speaks from the voice signal. A speaker feature vector is extracted, the speaker feature vector is compared with a registered speaker feature vector, and the speaker of the voice signal is recognized as a registered second speaker according to a result of the comparison; The voice recognition result and the voice signal are transmitted to the portable device of the second speaker, an approval input is received from the portable device, and an operation corresponding to the voice recognition result is executed.

本開示の第４側面は、音声認識システムの音声認識サーバのプロセッサに、第１側面による話者検証方法を実行させるプログラムを提供することができる。 The fourth aspect of the present disclosure can provide a program that causes a processor of a speech recognition server of a speech recognition system to execute the speaker verification method according to the first aspect.

本開示の第５側面は、第４側面によるプログラムを記録したコンピュータ読み取り可能な記録媒体を提供することができる。 The fifth aspect of the present disclosure can provide a computer-readable recording medium in which the program according to the fourth aspect is recorded.

本開示の多様な実施形態によれば、話者検証手続きを介して、話者を正確に識別及び認識することができるため、盗用の心配なしに、話者の命令を実行することができる。さらに、話者検証誤謬または声盗用が発生しても、かような事実を話者当事者が把握することができる。 According to various embodiments of the present disclosure, since a speaker can be accurately identified and recognized through a speaker verification procedure, a speaker's command can be executed without fear of theft. Further, even if a speaker verification error or voice stealing occurs, such a fact can be grasped by the speaker party.

一実施形態による音声認識システムの例示的なネットワーク構成図である。1 is an exemplary network configuration diagram of a voice recognition system according to an embodiment. FIG. 一実施形態による音声認識スピーカ装置の内部構成について説明するためのブロック図である。It is a block diagram for demonstrating the internal structure of the speech recognition speaker apparatus by one Embodiment. 他の実施形態による音声認識スピーカ装置の内部構成について説明するためのブロック図である。It is a block diagram for demonstrating the internal structure of the speech recognition speaker apparatus by other embodiment. 一実施形態による音声認識サーバの内部構成について説明するためのブロック図である。It is a block diagram for demonstrating the internal structure of the speech recognition server by one Embodiment. 一実施形態による音声認識サーバのプロセッサの内部構成について説明するためのブロック図である。It is a block diagram for demonstrating the internal structure of the processor of the speech recognition server by one Embodiment. 一実施形態による音声認識サーバのプロセッサの内部構成について説明するためのブロック図である。It is a block diagram for demonstrating the internal structure of the processor of the speech recognition server by one Embodiment. 一実施形態による音声認識システムの話者検証方法について説明するための例示的なフローチャートである。6 is an exemplary flowchart for explaining a speaker verification method of the speech recognition system according to an embodiment; 一実施形態による音声認識システムの話者検証方法について説明するための例示的なフローチャートである。6 is an exemplary flowchart for explaining a speaker verification method of the speech recognition system according to an embodiment; 一実施形態による、音声認識システムに接続される第２話者の携帯装置の例示的な画面を図示する図面である。6 is a diagram illustrating an exemplary screen of a second speaker's mobile device connected to a voice recognition system, according to one embodiment. 他の実施形態による音声認識システムの話者検証方法について説明するための例示的なフローチャートである。6 is an exemplary flowchart for explaining a speaker verification method of a speech recognition system according to another embodiment.

以下では、添付した図面を参照し、本発明が属する技術分野で当業者が容易に実施することができるように、本発明の実施形態について詳細に説明する。しかしながら、本発明は、さまざまに異なる形態に具現化され、ここで説明する実施形態に限定されるものではない。そして、図面において、本発明について明確に説明するために、説明と関係ない部分を省略し、全体を通じて、類似した部分については、類似した図面符号を付している。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them in the technical field to which the present invention belongs. However, the present invention may be embodied in various different forms and is not limited to the embodiments described herein. In the drawings, in order to clearly describe the present invention, parts not related to the description are omitted, and like parts are denoted by like reference numerals throughout.

明細書全体において、ある部分が他の部分と「連結／接続」されているとするとき、それは、「直接連結／接続」されている場合だけではなく、その中間に、他の要素を挟んで「電気的に連結／接続」されている場合も含む。また、ある部分がある構成要素を「含む」とするとき、それは、特別に反対となる記載がない限り、他の構成要素を除くものではなく、他の構成要素をさらに含んでもよい、ということを意味する。 Throughout the specification, when a part is “coupled / connected” to another part, it is not only “directly coupled / connected”, but also other elements in between. This includes cases where they are “electrically coupled / connected”. In addition, when a certain component “includes” a component, it means that it does not exclude other components, and may include other components unless otherwise stated to the contrary. Means.

本明細書において、多様な箇所に登場する「一部の実施形態において」または「一実施形態において」というような語句は、必ずしも同一の実施形態を示すものではない。 In this specification, phrases such as “in some embodiments” or “in one embodiment” appearing in various places do not necessarily indicate the same embodiment.

一部の実施形態は、機能的なブロック構成、及び多様な処理段階によっても示される。かような機能ブロックの一部または全部は、特定機能を実行する多様な個数のハードウェアコンポーネント及び／またはソフトウェアコンポーネントによっても具現化される。例えば、本開示の機能ブロックは、１以上のマイクロプロセッサによって具現化されることもあるし、所定機能のための回路構成によって具現化されることもある。また、例えば、本開示の機能ブロックは、多様なプログラミング言語またはスクリプト言語によって具現化されることもある。かような機能ブロックは、１以上のプロセッサで実行されるアルゴリズムによって具現化されることもある。また、本開示は、電子的な環境設定、信号処理及び／またはデータ処理などのために、従来技術を採用することができる。「モジュール」及び「構成」というような用語は、汎用され、機械的であって物理的な構成として限定されるものではない。 Some embodiments are also illustrated by functional block configurations and various processing steps. Some or all of the functional blocks may be embodied by various numbers of hardware components and / or software components that execute specific functions. For example, the functional block of the present disclosure may be realized by one or more microprocessors, or may be realized by a circuit configuration for a predetermined function. Further, for example, the functional blocks of the present disclosure may be embodied by various programming languages or script languages. Such functional blocks may be embodied by algorithms executed by one or more processors. In addition, the present disclosure can employ conventional techniques for electronic environment setting, signal processing and / or data processing. Terms such as “module” and “configuration” are general purpose, mechanical and not limited to physical configurations.

また、図面に図示されている構成要素間の連結／接続線または連結／接続部材は、機能的な連結／接続、及び／または物理的または回路的な連結／接続を例示的に示したものに過ぎない。実際の装置においては、代替可能であったり追加されたりする多様な機能的な連結／接続、物理的または回路的な連結／接続により、構成要素間の連結／接続が示される。 Also, the connection / connection lines or connection / connection members between the components illustrated in the drawings are illustrative of functional connection / connection and / or physical or circuit connection / connection. Not too much. In actual devices, various functional connections / connections that can be substituted or added, physical / circuit connections / connections, indicate connections / connections between components.

本開示において、音声認識機能は、ユーザの音声を含む音声信号を文字列（または、テキスト）に変換することをいう。かような音声認識機能によって音声信号が変換された文字列（または、テキスト）は、音声認識結果とも呼ばれる。ユーザの音声信号は、音声命令を含み、かような音声認識結果も、音声命令に対応する命令を含み得る。かような音声命令は、音声認識スピーカ装置または音声認識サーバの特定機能を実行することができる。一方、本開示において、音声合成機能は、音声認識機能とは反対に、文字列（または、テキスト）を音声信号に変換することをいう。かような音声合成機能によって文字列（または、テキスト）が変換された音声信号は、合成音信号とも呼ばれる。 In the present disclosure, the voice recognition function refers to converting a voice signal including a user's voice into a character string (or text). A character string (or text) in which a voice signal is converted by such a voice recognition function is also called a voice recognition result. The user's voice signal includes a voice command, and such a voice recognition result may also include a command corresponding to the voice command. Such a voice command can execute a specific function of the voice recognition speaker device or the voice recognition server. On the other hand, in the present disclosure, the speech synthesis function refers to converting a character string (or text) into a speech signal, contrary to the speech recognition function. A voice signal in which a character string (or text) is converted by such a voice synthesis function is also called a synthesized voice signal.

本開示において「登録された」という表現は、音声認識システムに、ユーザまたはその関連情報として登録されていることを意味する。「登録されたユーザ」は、音声認識システムにユーザ登録を終えたユーザを意味する。ある一人は、本開示による音声認識システムにユーザとして登録することができ、ユーザとして登録するとき、本人の音声を入力することができる。かような音声認識システムは、ユーザ登録時に入力された音声の音声信号から、話者特徴ベクトルを抽出し、登録されたユーザの関連情報として保存することができる。このように、音声認識システムに保存された話者特徴ベクトルは、登録された話者特徴ベクトルと呼ばれる。また、ユーザ登録時、自己所有の携帯装置の識別番号を共に保存することができる。 In the present disclosure, the expression “registered” means that it is registered as a user or related information in the speech recognition system. “Registered user” means a user who has finished user registration in the speech recognition system. One person can register as a user in the speech recognition system according to the present disclosure, and can input his / her voice when registering as a user. Such a speech recognition system can extract a speaker feature vector from a speech signal of speech input at the time of user registration, and can store it as information related to the registered user. Thus, the speaker feature vectors stored in the speech recognition system are called registered speaker feature vectors. In addition, at the time of user registration, the identification number of the self-owned portable device can be stored together.

かような音声認識システムには、複数のユーザが登録されてよい。本開示において、第１話者は、音声信号の音声を実際に発話した人を意味し、登録された第２話者は、音声認識システムに登録された複数のユーザのうち、音声認識システムが、音声信号の音声を発話したと認識したり決定したりしたユーザを意味する。登録された第２話者は、一般的に、第１話者と同一であるが、音声認識システムの話者誤認識や声盗用が発生する場合、登録された第２話者は、第１話者と異なり得る。 A plurality of users may be registered in such a voice recognition system. In the present disclosure, the first speaker means a person who actually utters the voice of the voice signal, and the registered second speaker is a voice recognition system among a plurality of users registered in the voice recognition system. Means a user who recognizes or determines that the voice of the voice signal is spoken. The registered second speaker is generally the same as the first speaker. However, if a speaker misrecognition or voice stealing occurs in the speech recognition system, the registered second speaker is the first speaker. Can be different from the speaker.

本開示において、キーワードは、単語形態を有するか、あるいは句形態を有することができる。本開示において、ウェークアップキーワード後に発話される音声命令は、自然言語形態の文章形態、単語形態または句形態を有することができる。 In the present disclosure, a keyword can have a word form or a phrase form. In the present disclosure, the voice command uttered after the wakeup keyword can have a sentence form, a word form, or a phrase form in a natural language form.

以下、添付された図面を参照し、本開示について詳細に説明する。 Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.

図１は、一実施形態による音声認識システムの例示的なネットワーク構成図である。 FIG. 1 is an exemplary network configuration diagram of a speech recognition system according to an embodiment.

図１を参照すると、音声認識システムのネットワーク環境は、音声認識スピーカ装置１００、音声認識サーバ２００、携帯装置３００及びネットワーク４００を含むと例示的に図示されている。かような音声認識システムは、音声認識スピーカ装置１００及び音声認識サーバ２００を含む。 Referring to FIG. 1, the network environment of the voice recognition system is illustratively shown as including a voice recognition speaker device 100, a voice recognition server 200, a portable device 300, and a network 400. Such a speech recognition system includes a speech recognition speaker device 100 and a speech recognition server 200.

音声認識スピーカ装置１００は、音声認識装置の一例であり、音声制御機能が搭載され、特定機能を実行するスピーカ装置である。音声認識スピーカ装置１００は、スマートスピーカ装置または人工知能スピーカ装置とも呼ばれる。音声認識スピーカ装置１００は、話者の音声を受信すると、音声と話者とを認識し、音声に含まれている命令を抽出し、命令による動作を実行し、その結果を音声として出力することができる。音声認識スピーカ装置１００が実行することができる特定機能は、例えば、音声情報提供、音楽再生、インターネットショッピング、金融取り引き、電話かけ、メッセージ送信、アラーム設定、及び、ネットワークを介して音声認識スピーカ装置に接続される電子装置または機械装置の制御などを含んでよい。 The voice recognition speaker device 100 is an example of a voice recognition device, and is a speaker device that is equipped with a voice control function and executes a specific function. The speech recognition speaker device 100 is also called a smart speaker device or an artificial intelligence speaker device. Upon receiving the voice of the speaker, the voice recognition speaker device 100 recognizes the voice and the speaker, extracts a command included in the voice, executes an operation based on the command, and outputs the result as a voice. Can do. Specific functions that can be executed by the voice recognition speaker device 100 include, for example, voice information provision, music playback, Internet shopping, financial transaction, telephone call, message transmission, alarm setting, and voice recognition speaker device via a network. Control of connected electronic devices or mechanical devices may be included.

例えば、音声認識スピーカ装置１００が、ネットワークを介してスマートテレビに接続される場合、特定機能は、チャンネル視聴、チャンネル検索、動画再生及びプログラム検索などを含んでよい。例えば、音声認識スピーカ装置１００がスマート冷蔵庫のような家電機器に接続される場合、特定機能は、冷蔵状態及び冷凍状態の点検、及び温度設定などを含んでよい。しかしながら、本開示において、かような特定機能は前述の機能に限定されるものではない。 For example, when the voice recognition speaker device 100 is connected to a smart TV via a network, the specific function may include channel viewing, channel search, video playback, program search, and the like. For example, when the voice recognition speaker device 100 is connected to a household electrical appliance such as a smart refrigerator, the specific function may include inspection of a refrigerated state and a frozen state, temperature setting, and the like. However, in the present disclosure, such specific functions are not limited to the functions described above.

音声認識スピーカ装置１００は、無線通信または有線通信を介して、ネットワーク４００を介して、音声認識サーバ２００と通信することができる。 The voice recognition speaker device 100 can communicate with the voice recognition server 200 via the network 400 via wireless communication or wired communication.

ネットワーク４００の通信方式は、限定されるものではなく、ネットワーク４００に含まれる通信網（一例として、移動通信網、有線インターネット、無線インターネット、放送網）を活用した通信方式だけではなく、音声認識スピーカ装置１００との近距離無線通信も含まれる。例えば、ネットワーク４００は、ＰＡＮ（personal area network）、ＬＡＮ（local area network）、ＣＡＮ（campus area network）、ＭＡＮ（metropolitan area network）、ＷＡＮ（wide area network）、ＢＢＮ（broadband network）、インターネットなどのネットワークのうち１以上の任意のネットワークを含んでよい。ネットワーク４００は、バスネットワーク、スターネットワーク、リングネットワーク、メッシュネットワーク、スター・バスネットワーク、ツリーまたは階層的ネットワークなどを含むネットワークトポロジーのうち任意の１以上を含んでよいが、それらに限定されるものではない。 The communication method of the network 400 is not limited, and not only a communication method using a communication network included in the network 400 (for example, a mobile communication network, a wired Internet, a wireless Internet, and a broadcast network), but also a voice recognition speaker. Short-range wireless communication with the device 100 is also included. For example, the network 400 includes a PAN (personal area network), a LAN (local area network), a CAN (campus area network), a MAN (metropolitan area network), a WAN (wide area network), a BBN (broadband network), the Internet, etc. One or more arbitrary networks may be included. The network 400 may include any one or more of network topologies including, but not limited to, bus networks, star networks, ring networks, mesh networks, star bus networks, tree or hierarchical networks, and the like. Absent.

音声認識サーバ２００は、ネットワーク４００を介して、音声認識スピーカ装置１００と通信し、少なくとも１つのコンピュータ装置によって具現化され得る。音声認識サーバ２００は、クラウド形態に分散され、命令、コード、ファイル、コンテンツなどを提供することができる。 The voice recognition server 200 communicates with the voice recognition speaker device 100 via the network 400 and can be embodied by at least one computer device. The voice recognition server 200 is distributed in a cloud form and can provide commands, codes, files, contents, and the like.

音声認識サーバ２００は、音声認識スピーカ装置１００から受信された音声信号を文字列（または、テキスト）に変換し、音声認識結果を生成することができる。音声認識サーバ２００は、音声認識スピーカ装置１００によって再生される音声を合成し、合成音信号を生成し、合成音信号を音声認識スピーカ装置１００に送信することができる。 The voice recognition server 200 can convert a voice signal received from the voice recognition speaker device 100 into a character string (or text) and generate a voice recognition result. The speech recognition server 200 can synthesize the speech reproduced by the speech recognition speaker device 100, generate a synthesized sound signal, and transmit the synthesized sound signal to the speech recognition speaker device 100.

音声認識サーバ２００は、音声認識スピーカ装置１００が実行することができる特定機能を実際に実行することができる。例えば、音声情報提供機能の場合、音声認識サーバ２００は、音声認識スピーカ装置１００から受信された音声信号に含まれている情報要求を認識し、それに係わる結果を生成し、合成音信号の形態で、音声認識スピーカ装置１００に送信することができる。電話かけ機能の場合、音声認識サーバ２００は、音声認識スピーカ装置１００から受信された音声信号に含まれている電話かけ要求を認識し、要求に応じて電話かけを実行し、電話かけ時、送信信号と受信信号とを中継することができる。音声認識サーバ２００は、ネットワーク４００を介して、家電機器にも接続され、音声認識サーバ２００は、音声認識スピーカ装置１００から受信された音声信号に含まれている制御命令により、家電機器を制御することができる。 The voice recognition server 200 can actually execute a specific function that can be executed by the voice recognition speaker device 100. For example, in the case of the voice information providing function, the voice recognition server 200 recognizes an information request included in the voice signal received from the voice recognition speaker device 100, generates a result related thereto, and in the form of a synthesized voice signal. Can be transmitted to the speech recognition speaker device 100. In the case of the telephone call function, the voice recognition server 200 recognizes the telephone call request included in the voice signal received from the voice recognition speaker device 100, executes the telephone call in response to the request, and transmits when the telephone call is made. Signals and received signals can be relayed. The voice recognition server 200 is also connected to home appliances via the network 400, and the voice recognition server 200 controls the home appliances by a control command included in the voice signal received from the voice recognition speaker device 100. be able to.

音声認識サーバ２００は、ネットワーク４００を介して、携帯装置３００に接続されることも可能である。音声認識サーバ２００と音声認識スピーカ装置１００とを接続するネットワークと、音声認識サーバ２００と携帯装置３００とを接続するネットワークは、互いに異なる種類であり得る。例えば、音声認識サーバ２００と音声認識スピーカ装置１００とを接続するネットワークは、ＬＡＮまたはインターネットであり、音声認識サーバ２００と携帯装置３００とを接続するネットワークは、移動通信網であり得る。 The voice recognition server 200 can also be connected to the mobile device 300 via the network 400. The network that connects the voice recognition server 200 and the voice recognition speaker device 100 and the network that connects the voice recognition server 200 and the portable device 300 may be of different types. For example, the network that connects the voice recognition server 200 and the voice recognition speaker device 100 may be a LAN or the Internet, and the network that connects the voice recognition server 200 and the portable device 300 may be a mobile communication network.

携帯装置３００は、ユーザが携帯して持ち歩くことができる、無線通信をサポートする電子機器である。例えば、携帯装置３００は、携帯電話機、スマートフォン、タブレットＰＣ（personal computer）またはノート型ＰＣなどであってよい。携帯装置３００は、電話機能、メッセージ機能またはメッセンジャー機能を有することができ、音声認識サーバ２００から受信された音声信号または映像信号を再生することができる。また、携帯装置３００は、音声信号を音声認識サーバ２００に提供することもできる。携帯装置３００は、一般的に、一個人が使用する電子機器でもある。 The portable device 300 is an electronic device that supports wireless communication that can be carried around by a user. For example, the mobile device 300 may be a mobile phone, a smartphone, a tablet PC (personal computer), a notebook PC, or the like. The portable device 300 can have a telephone function, a message function, or a messenger function, and can reproduce an audio signal or a video signal received from the voice recognition server 200. In addition, the mobile device 300 can provide a voice signal to the voice recognition server 200. In general, the portable device 300 is also an electronic device used by one individual.

図１には、音声認識スピーカ装置１００が、ネットワーク４００を介して、音声認識機能を実行する音声認識サーバ２００に接続されるように図示されているが、これは、例示的なものであり、音声認識スピーカ装置１００は、独立して音声認識機能または音声合成機能を実行することもできる。 In FIG. 1, the voice recognition speaker device 100 is illustrated as being connected to the voice recognition server 200 that performs the voice recognition function via the network 400, but this is exemplary. The voice recognition speaker device 100 can also independently execute a voice recognition function or a voice synthesis function.

図２Ａは、一実施形態による音声認識スピーカ装置１００の内部構成について説明するためのブロック図である。 FIG. 2A is a block diagram for explaining the internal configuration of the speech recognition speaker device 100 according to an embodiment.

図２Ａを参照すると、音声認識スピーカ装置１００は、プロセッサ１１０、マイクロフォン１２０、スピーカ１３０及び通信モジュール１４０を含んでよい。音声認識スピーカ装置１００は、図２Ａに図示されている構成要素より多くの構成要素を含んでもよい。例えば、音声認識スピーカ装置１００は、メモリをさらに含んでもよい。音声認識スピーカ装置１００は、通信モジュール１４０を介して、図１のネットワーク４００に接続され、音声認識サーバ２００と通信することができる。 Referring to FIG. 2A, the voice recognition speaker device 100 may include a processor 110, a microphone 120, a speaker 130, and a communication module 140. The speech recognition speaker device 100 may include more components than those illustrated in FIG. 2A. For example, the voice recognition speaker device 100 may further include a memory. The voice recognition speaker device 100 is connected to the network 400 of FIG. 1 via the communication module 140 and can communicate with the voice recognition server 200.

マイクロフォン１２０は、周辺のオーディオを電気的な音響データに変換することにより、オーディオ信号を直接生成することができる。音声認識スピーカ装置１００は、複数のマイクロフォン１２０を含んでもよく、複数のマイクロフォン１２０を利用し、オーディオ信号の入力方向を探し出すことができる。 The microphone 120 can directly generate an audio signal by converting surrounding audio into electrical acoustic data. The voice recognition speaker device 100 may include a plurality of microphones 120, and the input direction of the audio signal can be found using the plurality of microphones 120.

他の例によれば、音声認識スピーカ装置１００は、通信モジュール１４０を介して、外部装置から送信されたオーディオ信号を受信することもできる。スピーカ１３０は、オーディオ信号を音声に変換し、出力することができる。 According to another example, the voice recognition speaker device 100 can also receive an audio signal transmitted from an external device via the communication module 140. The speaker 130 can convert an audio signal into sound and output the sound.

プロセッサ１１０は、基本的な算術、ロジック及び入出力演算を実行することにより、コンピュータプログラムの命令を処理するように構成され得る。かような命令は、メモリからプロセッサ１１０に提供されるか、あるいは通信モジュール１４０を介して受信され、プロセッサ１１０に提供される。例えば、プロセッサ１１０は、メモリのような記録装置に保存されたプログラムコードにより、命令を実行するように構成される。 The processor 110 may be configured to process computer program instructions by performing basic arithmetic, logic, and input / output operations. Such instructions are provided from memory to the processor 110 or received via the communication module 140 and provided to the processor 110. For example, the processor 110 is configured to execute instructions by program code stored in a recording device such as a memory.

プロセッサ１１０は、マイクロフォン１２０で生成されたオーディオ信号から、話者の音声に対応する音声信号を検出し、通信モジュール１４０を介して、検出された音声信号を音声認識サーバ２００に送信することができる。プロセッサ１１０は、キーワードを利用し、オーディオ信号から音声信号を検出することができる。プロセッサ１１０は、オーディオ信号のうち、キーワードに対応するキーワード音声信号を抽出することにより、キーワード音声信号に後続して受信される音声信号を識別することができる。 The processor 110 can detect a voice signal corresponding to the voice of the speaker from the audio signal generated by the microphone 120 and transmit the detected voice signal to the voice recognition server 200 via the communication module 140. . The processor 110 can detect the voice signal from the audio signal using the keyword. The processor 110 can identify a voice signal received subsequent to the keyword voice signal by extracting a keyword voice signal corresponding to the keyword from the audio signal.

プロセッサ１１０は、音声認識サーバ２００から合成音信号を受信し、スピーカ１３０を介して、合成音信号に対応する合成音を再生することができる。 The processor 110 can receive the synthesized sound signal from the speech recognition server 200 and reproduce the synthesized sound corresponding to the synthesized sound signal via the speaker 130.

図２Ｂは、他の実施形態による音声認識スピーカ装置１００ａの内部構成について説明するためのブロック図である。 FIG. 2B is a block diagram for explaining an internal configuration of a voice recognition speaker device 100a according to another embodiment.

図２Ｂを参照すると、音声認識スピーカ装置１００ａは、図２Ａの音声認識スピーカ装置１００に比べ、カメラ１５０をさらに含む。 Referring to FIG. 2B, the voice recognition speaker device 100a further includes a camera 150 as compared to the voice recognition speaker device 100 of FIG. 2A.

カメラ１５０は、プロセッサ１１０によって制御され、マイクロフォン１２０から受信されたオーディオ信号において音声信号が検出された時点の映像に対応する映像信号を生成することができる。プロセッサ１１０は、通信モジュール１４０を介して、映像信号を音声認識サーバ２００に送信することができる。カメラ１５０は、例えば、３６０°いずれも撮影することができる３６０°カメラであってもよい。 The camera 150 is controlled by the processor 110 and can generate a video signal corresponding to the video at the time when the audio signal is detected in the audio signal received from the microphone 120. The processor 110 can transmit the video signal to the voice recognition server 200 via the communication module 140. The camera 150 may be, for example, a 360 ° camera that can capture any 360 ° image.

他の例によれば、音声認識スピーカ装置１００ａは、カメラ１５０の撮影方向を調節することができる。音声認識スピーカ装置１００ａは、音声が発生した方向を感知することができるセンサを含んでもよい。プロセッサ１１０は、音声信号が検出された方向を感知し、カメラ１５０の撮影方向を、音声信号が検出された方向に調節することができる。このとき、プロセッサ１１０から音声認識サーバ２００に送信される映像信号は、話者の音声が発生した方向の映像を含んでよい。 According to another example, the voice recognition speaker device 100a can adjust the shooting direction of the camera 150. The voice recognition speaker device 100a may include a sensor that can sense the direction in which the voice is generated. The processor 110 can sense the direction in which the audio signal is detected and adjust the shooting direction of the camera 150 to the direction in which the audio signal is detected. At this time, the video signal transmitted from the processor 110 to the voice recognition server 200 may include a video in a direction in which the voice of the speaker is generated.

さらに他の例によれば、音声認識スピーカ装置１００ａは、周辺３６０を撮影するように、一定間隔に配置される複数のカメラ１５０を含んでもよい。音声認識スピーカ装置１００ａは、音声が発生した方向を感知することができるセンサを含んでもよい。プロセッサ１１０は、音声信号が検出された方向を感知し、複数のカメラ１５０のうち、音声信号が検出された方向を撮影することができるカメラ１５０を利用し、話者の音声が発生した方向の映像に対応する映像信号を取得することができる。 According to another example, the voice recognition speaker device 100a may include a plurality of cameras 150 arranged at regular intervals so as to photograph the periphery 360. The voice recognition speaker device 100a may include a sensor that can sense the direction in which the voice is generated. The processor 110 senses the direction in which the audio signal is detected, and uses the camera 150 that can capture the direction in which the audio signal is detected among the plurality of cameras 150 to determine the direction in which the speaker's audio is generated. A video signal corresponding to the video can be acquired.

図３は、一実施形態による音声認識サーバ２００の内部構成について説明するためのブロック図である。 FIG. 3 is a block diagram for explaining the internal configuration of the speech recognition server 200 according to an embodiment.

図３を参照すると、音声認識サーバ２００は、プロセッサ２１０、メモリ２２０及び通信モジュール２３０を含む。音声認識サーバ２００は、図３に図示されている構成要素より多くの構成要素を含んでもよい。例えば、音声認識サーバ２００は、入出力装置をさらに含んでもよい。 Referring to FIG. 3, the voice recognition server 200 includes a processor 210, a memory 220, and a communication module 230. The speech recognition server 200 may include more components than those illustrated in FIG. For example, the voice recognition server 200 may further include an input / output device.

通信モジュール２３０は、ネットワーク４００を介して音声認識サーバ２００が音声認識スピーカ装置１００及び携帯装置３００と通信するための機能を提供することができる。音声認識サーバ２００は、通信モジュール２３０を介して、図１のネットワーク４００に接続され、音声認識スピーカ装置１００及び携帯装置３００と通信することができる。 The communication module 230 can provide a function for the voice recognition server 200 to communicate with the voice recognition speaker device 100 and the portable device 300 via the network 400. The voice recognition server 200 is connected to the network 400 of FIG. 1 via the communication module 230 and can communicate with the voice recognition speaker device 100 and the portable device 300.

メモリ２２０は、コンピュータ読み取り可能な記録媒体であり、ＲＡＭ（random access memory）、ＲＯＭ（read-only memory）及びディスクドライブのような永続的大容量記録装置を含んでよい。メモリ２２０には、オペレーティングシステムと、少なくとも１つのプログラムコード（例えば、音声認識サーバ２００においてインストールされて実行される音声認識アプリケーション、音声合成アプリケーションなどのためのコード）と、が保存される。かようなソフトウェアコンポーネントは、通信モジュール２３０を利用し、通信を介して、メモリ２２０にロードされる。例えば、少なくとも１つのプログラムは、開発者、またはアプリケーションのインストールファイルを配布するファイル配布システムが、ネットワーク４００を介して提供するファイルによってインストールされるプログラムに基づき、メモリ２２０にロードされる。 The memory 220 is a computer-readable recording medium, and may include a permanent mass storage device such as a random access memory (RAM), a read-only memory (ROM), and a disk drive. The memory 220 stores an operating system and at least one program code (for example, a code for a speech recognition application or a speech synthesis application installed and executed in the speech recognition server 200). Such software components are loaded into the memory 220 via communication using the communication module 230. For example, at least one program is loaded into the memory 220 based on a program installed by a developer or a file distribution system that distributes an installation file of an application via a file provided by the network 400.

プロセッサ２１０は、基本的な算術、ロジック及び入出力演算を実行することにより、コンピュータプログラムの命令を処理するように構成され得る。プロセッサ２１０は、メモリ２２０に保存されたプログラムコードによって命令を実行するように構成され得る。 The processor 210 may be configured to process computer program instructions by performing basic arithmetic, logic and input / output operations. The processor 210 may be configured to execute instructions according to program code stored in the memory 220.

プロセッサ２１０は、音声認識スピーカ装置１００から、第１話者の音声を含む音声信号を受信し、音声信号に対して音声認識を実行し、音声認識結果を生成するように構成され得る。例えば、プロセッサ２１０は、音声信号に対する音声認識を実行するために、音声信号の周波数特性を抽出し、音響モデル及び言語モデルを利用して音声認識を実行することができる。かような周波数特性は、音響入力の周波数スペクトルを分析して抽出される音響入力の周波数成分の分布を意味する。音響モデル及び言語モデルは、メモリ２２０にも保存される。ただし、音声認識方法は、これに限定されるものではなく、音声信号を文字列（または、テキスト）に変換する多様な技術が使用される。 The processor 210 may be configured to receive a voice signal including the voice of the first speaker from the voice recognition speaker device 100, perform voice recognition on the voice signal, and generate a voice recognition result. For example, the processor 210 may extract a frequency characteristic of the voice signal and perform voice recognition using an acoustic model and a language model in order to perform voice recognition on the voice signal. Such a frequency characteristic means a distribution of frequency components of the sound input extracted by analyzing the frequency spectrum of the sound input. The acoustic model and the language model are also stored in the memory 220. However, the voice recognition method is not limited to this, and various techniques for converting a voice signal into a character string (or text) are used.

プロセッサ２１０は、音声信号を分析し、音声信号に含まれている音声を発話した話者がだれであるかを決定することができる。例えば、かような音声信号から、話者特徴ベクトルを抽出し、話者特徴ベクトルを登録された話者特徴ベクトルと比較し、この比較の結果により、音声信号の話者が、登録された第２話者であると決定するように構成され得る。登録された話者特徴ベクトルは、メモリ２２０に事前に保存されている。音声認識サーバ２００には、複数の話者が、音声認識スピーカ装置１００のユーザとして登録されることもあり、その場合、メモリ２２０には、複数の登録された話者特徴ベクトルが保存される。登録された話者特徴ベクトルは、登録された話者にそれぞれ対応する。 The processor 210 can analyze the audio signal and determine who is the speaker who spoke the audio contained in the audio signal. For example, a speaker feature vector is extracted from such a speech signal, the speaker feature vector is compared with a registered speaker feature vector, and the speaker of the speech signal is registered as a result of this comparison. It may be configured to determine that there are two speakers. The registered speaker feature vectors are stored in the memory 220 in advance. In the voice recognition server 200, a plurality of speakers may be registered as users of the voice recognition speaker device 100. In this case, a plurality of registered speaker feature vectors are stored in the memory 220. Each registered speaker feature vector corresponds to each registered speaker.

プロセッサ２１０は、音声信号の話者が第２話者であると決定するために、音響モデルから抽出された事後情報（states posteriors）、一般的背景モデル及び全体変異性変換情報のうち少なくとも一つを利用し、音声信号の周波数特性から話者特徴ベクトルを生成することができる。プロセッサ２１０は、生成された話者特徴情報と、メモリ２２０に保存された登録された話者特徴ベクトルと、に基づいて、音声信号の話者が、登録された話者であるか否かを決定することができる。メモリ２２０には、事後情報、一般的背景モデル、全体変異性変換情報、及び登録された話者情報のうち少なくとも一つが保存される。 The processor 210 determines at least one of states posteriors extracted from the acoustic model, general background model, and global variability conversion information to determine that the speaker of the audio signal is the second speaker. Can be used to generate a speaker feature vector from the frequency characteristics of the speech signal. The processor 210 determines whether or not the speaker of the audio signal is a registered speaker based on the generated speaker feature information and the registered speaker feature vector stored in the memory 220. Can be determined. The memory 220 stores at least one of posterior information, general background model, global variability conversion information, and registered speaker information.

プロセッサ２１０は、第２話者の携帯装置３００に音声認識結果及び音声信号を送信し、第２話者の携帯装置３００から承認入力を受信するように構成され得る。第２話者の登録された携帯装置３００についての情報は、メモリ２２０に事前に保存されている。第２話者が、音声認識スピーカ装置１００を介して、音声認識サーバ２００に、音声認識スピーカ装置１００のユーザとして登録するとき、自身の携帯装置３００についての情報（例えば、携帯装置３００の識別番号）を共に登録することができる。 The processor 210 may be configured to send a speech recognition result and a voice signal to the second speaker's mobile device 300 and receive an approval input from the second speaker's mobile device 300. Information about the mobile device 300 in which the second speaker is registered is stored in the memory 220 in advance. When the second speaker registers with the speech recognition server 200 as a user of the speech recognition speaker device 100 via the speech recognition speaker device 100, information about his own mobile device 300 (for example, an identification number of the mobile device 300). ) Can be registered together.

プロセッサ２１０は、音声認識結果に対応する動作を実行するように構成され得る。プロセッサ２１０は、音声認識結果に対応する機能を決定し、かような機能を実行することができる。プロセッサ２１０は、動作の実行結果を報告するための合成音信号を生成するように構成され得る。プロセッサ２１０は、合成音信号を音声認識スピーカ装置１００に送信するように構成され得る。 The processor 210 may be configured to perform an operation corresponding to the speech recognition result. The processor 210 can determine a function corresponding to the speech recognition result and execute such a function. The processor 210 may be configured to generate a synthesized sound signal for reporting the execution result of the operation. The processor 210 may be configured to transmit the synthesized sound signal to the speech recognition speaker device 100.

音声認識サーバ２００は、入出力装置として、マイクロフォンまたはスピーカをさらに含んでもよい。音声認識サーバ２００は、音声信号を直接生成し、合成音を直接再生することもできる。 The voice recognition server 200 may further include a microphone or a speaker as an input / output device. The voice recognition server 200 can directly generate a voice signal and directly reproduce the synthesized sound.

図４Ａは、一実施形態による音声認識サーバのプロセッサの内部構成について説明するためのブロック図である。 FIG. 4A is a block diagram for explaining the internal configuration of the processor of the speech recognition server according to the embodiment.

図４Ａを参照すると、音声認識サーバ２００のプロセッサ２１０は、音声信号受信部２１１、音声認識部２１２、話者認識部２１３、話者検証部２１４及び機能部２１５を含む。音声認識サーバ２００は、合成音信号生成部２１６をさらに含んでもよい。話者認識部２１３は、話者特徴ベクトル抽出部２１３ａ、話者特徴ベクトル比較部２１３ｂ及び登録話者決定部２１３ｃを含む。 Referring to FIG. 4A, the processor 210 of the voice recognition server 200 includes a voice signal receiving unit 211, a voice recognition unit 212, a speaker recognition unit 213, a speaker verification unit 214, and a function unit 215. The speech recognition server 200 may further include a synthesized sound signal generation unit 216. The speaker recognition unit 213 includes a speaker feature vector extraction unit 213a, a speaker feature vector comparison unit 213b, and a registered speaker determination unit 213c.

音声信号受信部２１１は、音声認識スピーカ装置１００から、第１話者の音声を含む音声信号を受信する。 The voice signal receiving unit 211 receives a voice signal including the voice of the first speaker from the voice recognition speaker device 100.

音声認識部２１２は、音声信号受信部２１１によって受信された音声信号に対して音声認識を実行し、音声認識結果を生成する。音声認識部２１２は、音声信号に対して音声認識を実行し、話者の音声を文字列（または、テキスト）に変換することができる。音声認識部２１２は、変換された文字列（または、テキスト）を自然言語処理し、音声信号に含まれている話者の命令を抽出することができる。音声認識結果は、話者の命令を含み、かような音声認識結果に対応する動作は、話者の命令による動作を意味する。 The voice recognition unit 212 performs voice recognition on the voice signal received by the voice signal reception unit 211 and generates a voice recognition result. The voice recognition unit 212 can perform voice recognition on the voice signal and convert the voice of the speaker into a character string (or text). The speech recognition unit 212 can perform natural language processing on the converted character string (or text) and extract a speaker's command included in the speech signal. The speech recognition result includes a speaker's command, and the operation corresponding to such a speech recognition result means an operation based on the speaker's command.

話者認識部２１３は、音声信号受信部２１１によって受信された音声信号の話者が、第２話者であると決定する。第２話者は、音声認識システムに、音声認識スピーカ装置１００のユーザとして登録された話者である。例えば、話者特徴ベクトル抽出部２１３ａは、音声信号受信部２１１によって受信された音声信号から、話者特徴ベクトルを抽出する。話者特徴ベクトル抽出部２１３ａは、時間領域ベースの音声信号を周波数領域の信号に変換し、変換された信号の周波数エネルギーを、互いに異なるように変形することにより、話者特徴ベクトルを抽出することができる。例えば、かような話者特徴ベクトルは、メル周波数ケプストラム係数またはフィルタバンクエネルギーを基に抽出されるが、それらに限定されるものではなく、多様な方式で、オーディオデータから話者特徴ベクトルを抽出することができる。 The speaker recognizing unit 213 determines that the speaker of the audio signal received by the audio signal receiving unit 211 is the second speaker. The second speaker is a speaker registered as a user of the voice recognition speaker device 100 in the voice recognition system. For example, the speaker feature vector extraction unit 213a extracts a speaker feature vector from the voice signal received by the voice signal reception unit 211. The speaker feature vector extraction unit 213a extracts a speaker feature vector by converting a time domain-based speech signal into a frequency domain signal and transforming the frequency energy of the converted signal to be different from each other. Can do. For example, such speaker feature vectors are extracted based on mel frequency cepstrum coefficients or filter bank energy, but are not limited thereto, and speaker feature vectors are extracted from audio data in various ways. can do.

話者特徴ベクトル比較部２１３ｂは、話者特徴ベクトル抽出部２１３ａによって抽出された話者特徴ベクトルを、メモリ２２０に保存された登録された話者特徴ベクトルと比較する。メモリ２２０には、複数の登録された話者特徴ベクトルが存在し、話者特徴ベクトル比較部２１３ｂは、抽出された話者特徴ベクトルを、複数の登録された話者特徴ベクトルと比較し、最も類似度が高い登録された話者特徴ベクトルを決定する。登録話者決定部２１３ｃは、話者特徴ベクトル比較部２１３ｂの比較の結果により、音声信号の話者が登録された第２話者であると決定する。登録話者決定部２１３ｃは、最も類似度が高い登録された話者特徴ベクトルの話者を、音声信号の話者に決定する。登録された話者特徴ベクトルが、いずれも事前に設定された基準値を超える類似度を有していない場合、登録話者決定部２１３ｃは、音声信号の話者が登録された話者ではないと決定することができる。このとき、プロセッサ２１０は、音声認識結果に対応する動作を実行しないか、あるいは音声認識結果に対応する動作が、だれもが実行することができる動作に設定されている場合に限り、当該動作を実行することができる。 The speaker feature vector comparison unit 213 b compares the speaker feature vector extracted by the speaker feature vector extraction unit 213 a with the registered speaker feature vector stored in the memory 220. There are a plurality of registered speaker feature vectors in the memory 220, and the speaker feature vector comparison unit 213b compares the extracted speaker feature vectors with a plurality of registered speaker feature vectors. A registered speaker feature vector having a high degree of similarity is determined. The registered speaker determination unit 213c determines that the speaker of the audio signal is the registered second speaker based on the comparison result of the speaker feature vector comparison unit 213b. The registered speaker determination unit 213c determines the speaker of the registered speaker feature vector having the highest similarity as the speaker of the audio signal. If none of the registered speaker feature vectors has a similarity that exceeds a preset reference value, the registered speaker determination unit 213c is not a speaker in which the speaker of the audio signal is registered. Can be determined. At this time, the processor 210 does not execute the operation corresponding to the speech recognition result, or only when the operation corresponding to the speech recognition result is set to an operation that can be executed by anyone. Can be executed.

話者検証部２１４は、話者認識部２１３によって決定された第２話者の携帯装置に、音声認識部２１３によって生成された音声認識結果と、音声信号受信部２１１によって受信された音声信号と、を送信する。第２話者の携帯装置の識別番号は、メモリ２２０に事前に保存されている。話者検証部２１４は、第２話者の携帯装置から承認入力を受信することができる。 The speaker verification unit 214 sends the voice recognition result generated by the voice recognition unit 213 and the voice signal received by the voice signal reception unit 211 to the portable device of the second speaker determined by the speaker recognition unit 213. , Send. The identification number of the mobile device of the second speaker is stored in the memory 220 in advance. The speaker verification unit 214 can receive an approval input from the portable device of the second speaker.

機能部２１５は、話者検証部２１４から承認入力を受信すると、音声認識部２１３によって生成された音声認識結果に対応する動作を実行する。機能部２１５は、音声信号の話者が登録された話者ではないと決定された場合、音声認識結果に対応する動作を実行しない。機能部２１５は、話者検証部２１４から拒絶入力を受信すると、音声認識結果に対応する動作を実行しない。 Upon receiving the approval input from the speaker verification unit 214, the function unit 215 performs an operation corresponding to the voice recognition result generated by the voice recognition unit 213. If it is determined that the speaker of the voice signal is not a registered speaker, the function unit 215 does not perform an operation corresponding to the voice recognition result. When the function unit 215 receives the rejection input from the speaker verification unit 214, the function unit 215 does not execute the operation corresponding to the voice recognition result.

合成音信号生成部２１６は、機能部２１５が動作を実行した場合、動作の実行結果を報告するための合成音信号を生成する。合成音信号生成部２１６は、音声信号の話者が、登録された話者ではないと決定され、音声認識結果に対応する動作が実行されない場合や、話者検証部２１４で承認入力が受信されず、音声認識結果に対応する動作が実行されない場合には、動作が実行されていないということを報告するための合成音信号を生成することができる。 When the function unit 215 executes an operation, the synthetic sound signal generation unit 216 generates a synthetic sound signal for reporting the execution result of the operation. The synthesized sound signal generation unit 216 determines that the speaker of the speech signal is not a registered speaker, and when the operation corresponding to the speech recognition result is not executed, or the approval input is received by the speaker verification unit 214. First, when the operation corresponding to the speech recognition result is not executed, a synthesized sound signal for reporting that the operation is not executed can be generated.

他の実施形態によれば、プロセッサ２１０は、音声認識スピーカ装置１００から映像信号を受信する映像信号受信部をさらに含んでもよい。このとき、話者検証部２１４は、映像信号受信部によって受信された映像信号を、第２話者の携帯装置に送信することができる。 According to another embodiment, the processor 210 may further include a video signal receiving unit that receives a video signal from the voice recognition speaker device 100. At this time, the speaker verification unit 214 can transmit the video signal received by the video signal receiving unit to the mobile device of the second speaker.

他の実施形態によれば、プロセッサ２１０は、話者ベクトル改善部をさらに含むように構成される。話者検証部２１４が、第２話者の携帯装置から承認入力を受信すると、受信された音声信号の話者が、第２話者であるか否かということが確認されたわけであるので、話者ベクトル改善部は、受信された音声信号から抽出された話者特徴ベクトルを利用し、メモリ２２０に保存された第２話者の登録された話者特徴ベクトルを改善することができる。かような話者特徴ベクトル改善部は、音声信号で抽出された話者特徴ベクトルを利用した適応訓練方式を介して、第２話者の登録された話者特徴ベクトルを生成し、新たに生成された話者特徴ベクトルが、適応訓練以前の登録された話者特徴ベクトルに比べ、適応訓練性能が上昇した場合、新たに生成された話者特徴ベクトルを、メモリ２２０に保存することにより、登録された話者特徴ベクトルを改善することができる。 According to other embodiments, the processor 210 is configured to further include a speaker vector refiner. When the speaker verification unit 214 receives the approval input from the portable device of the second speaker, it is confirmed whether or not the speaker of the received voice signal is the second speaker. The speaker vector improvement unit can improve the registered speaker feature vector of the second speaker stored in the memory 220 using the speaker feature vector extracted from the received speech signal. The speaker feature vector improvement unit generates a speaker feature vector registered for the second speaker through an adaptive training method using the speaker feature vector extracted from the speech signal, and newly generates it. In the case where the adaptive speaker performance increases compared with the registered speaker feature vector before the adaptive training, the newly generated speaker feature vector is registered by storing the newly generated speaker feature vector in the memory 220. The improved speaker feature vector can be improved.

図４Ｂは、一実施形態による音声認識サーバのプロセッサの内部構成について説明するためのブロック図である。 FIG. 4B is a block diagram for explaining an internal configuration of a processor of the speech recognition server according to the embodiment.

図４Ｂを参照すると、音声認識サーバ２００のプロセッサ２１０ａは、音声信号受信部２１１、音声認識部２１２、話者認識部２１３、検証いかん決定部２１７、話者検証部２１４、機能部２１５及び合成音信号生成部２１６を含む。 Referring to FIG. 4B, the processor 210a of the voice recognition server 200 includes a voice signal reception unit 211, a voice recognition unit 212, a speaker recognition unit 213, a verification determination unit 217, a speaker verification unit 214, a function unit 215, and a synthesized sound. A signal generation unit 216 is included.

検証いかん決定部２１７は、音声認識結果に対応する動作、及び第２話者の設定のうち少なくとも一つを基に、話者検証部２１４の動作を実行するか否かを決定する。 The verification determination unit 217 determines whether to execute the operation of the speaker verification unit 214 based on at least one of the operation corresponding to the speech recognition result and the setting of the second speaker.

一例によれば、検証いかん決定部２１７は、音声認識結果による動作が、第２話者が事前に設定した事前承認動作リストに含まれる場合、話者検証部２１４の動作を実行するように決定することができる。かような事前承認動作リストは、メモリ２２０に保存され、音声認識スピーカ装置１００を介して実行することができる動作または機能のうち一部が、事前に設定した事前承認動作リストにも含まれる。例えば、金融取り引きやインターネットショッピング、メッセージ送信のような動作が、事前に設定した事前承認動作リストに含まれる。かような事前承認動作リストに含まれる動作は、登録された話者ごとに異なるようにも設定される。 According to an example, the verification determination unit 217 determines to execute the operation of the speaker verification unit 214 when the operation based on the speech recognition result is included in the pre-approval operation list set in advance by the second speaker. can do. Such a pre-approval operation list is stored in the memory 220, and some of the operations or functions that can be executed via the voice recognition speaker device 100 are also included in the pre-approval operation list set in advance. For example, operations such as financial transactions, Internet shopping, and message transmission are included in the pre-approval operation list set in advance. The operations included in such a pre-approval operation list are set so as to be different for each registered speaker.

他の例によれば、音声認識結果による動作が、第２話者が事前に設定した事後通知動作リストに含まれる場合、検証いかん決定部２１７は、機能部２１５により音声認識結果に対応する動作がまず実行され、第２話者の携帯装置に音声認識結果及び音声信号を送信するように決定することができる。かような事後通知動作リストは、メモリ２２０に保存され、音声認識スピーカ装置１００が実行することができる動作のうち一部の動作が、事前に設定した事後通知動作リストにも含まれる。例えば、電話かけ、設定変更のような動作が、事前に設定した事後通知動作リストにも含まれる。かような事後通知動作リストに含まれる動作は、登録された話者ごとに異なるようにも設定される。 According to another example, when the operation based on the speech recognition result is included in the post-notification operation list set in advance by the second speaker, the verification unit 217 uses the function unit 215 to perform an operation corresponding to the speech recognition result. Is first executed and it can be decided to transmit the speech recognition result and the speech signal to the second speaker's mobile device. Such a post notification operation list is stored in the memory 220, and some of the operations that can be executed by the speech recognition speaker device 100 are also included in the post notification operation list set in advance. For example, operations such as making a call and changing settings are also included in the post notification operation list set in advance. The actions included in such a post-notification action list are also set so as to differ for each registered speaker.

さらに他の例によれば、検証いかん決定部２１７は、第２話者の携帯装置の位置及び現在時間のうち少なくとも一つが、事前承認条件に符合する場合、話者検証部２１４の動作を実行するように決定することができる。例えば、第２話者の携帯装置が、音声認識スピーカ装置１００の位置の近くに位置する場合、例えば、第２話者の携帯装置と、音声認識スピーカ装置１００と同一無線Ｗｉ−Ｆｉ（登録商標）アクセスポイントに接続される場合や、第２話者の携帯装置のＧＰＳ（global position system）位置または無線網接続位置が、音声認識スピーカ装置１００の位置と実質的に一致する場合、第２話者が、音声信号受信部２１１によって受信された音声信号に対応する音声を実際に発話した可能性が高いので、検証いかん決定部２１７は、話者検証部２１４の動作を省略することができる。登録された話者は、かような話者検証部２１４の動作の省略いかんをそれぞれ設定することができる。 According to another example, the verification determination unit 217 executes the operation of the speaker verification unit 214 when at least one of the position and the current time of the second speaker's mobile device meets the pre-approval condition. You can decide to do that. For example, when the second speaker's portable device is located near the position of the voice recognition speaker device 100, for example, the second speaker's portable device and the same wireless Wi-Fi (registered trademark) as the voice recognition speaker device 100 are used. ) When connected to the access point, or when the GPS (global position system) position or wireless network connection position of the mobile device of the second speaker substantially matches the position of the voice recognition speaker device 100, the second talk Since it is highly likely that the person has actually uttered the voice corresponding to the voice signal received by the voice signal receiving unit 211, the verification determination unit 217 can omit the operation of the speaker verification unit 214. The registered speaker can set the omission of the operation of the speaker verification unit 214 as described above.

検証いかん決定部２１７は、第２話者が設定した時間、例えば、平日の昼時間には、話者検証部２１４の動作を実行するように決定することができる。例えば、会社員である第２話者は、平日の昼時間には、家にいない可能性が高いので、家に位置する音声認識スピーカ装置１００が、第２話者の音声を受信する可能性は低い。検証いかん決定部２１７は、かような場合、話者検証部２１４の動作を実行するように決定することができる。登録された話者は、時間を基に、話者検証部２１４の動作を実行するか否かをそれぞれ設定することができる。 The verification determination unit 217 can determine to execute the operation of the speaker verification unit 214 at a time set by the second speaker, for example, at daytime on weekdays. For example, since the second speaker who is a company employee is not likely to be at home during weekday daytime, the voice recognition speaker device 100 located at home may receive the voice of the second speaker. Is low. In such a case, the verification determination unit 217 can determine to execute the operation of the speaker verification unit 214. The registered speakers can set whether or not to execute the operation of the speaker verification unit 214 based on time.

事前承認条件は、登録された話者によって事前に設定され、メモリ２２０にも保存される。また、かような事前承認条件は、登録された話者の行動パターンに基づいても決定される。登録された話者の行動パターンは、登録された話者の携帯装置の位置を基にも生成される。例えば、検証いかん決定部２１７は、登録された話者の携帯装置の位置を、長時間の間収集することができる。検証いかん決定部２１７は、収集された携帯装置の位置を分析し、登録された話者が、音声認識スピーカ装置１００の近くに位置しない時間帯を決定することができる。検証いかん決定部２１７は、現在時間が、この時間帯に該当する場合、話者検証部２１４の動作を実行するように自動的に決定することができる。 The pre-approval condition is set in advance by the registered speaker and is also stored in the memory 220. Such pre-approval conditions are also determined based on the registered speaker behavior patterns. The registered speaker's behavior pattern is generated based on the position of the registered speaker's portable device. For example, the verification determination unit 217 can collect the position of the registered speaker's portable device for a long time. The verification determination unit 217 can analyze the collected position of the portable device and determine a time zone in which the registered speaker is not located near the voice recognition speaker device 100. The verification determination unit 217 can automatically determine to execute the operation of the speaker verification unit 214 when the current time falls within this time zone.

図５Ａ及び図５Ｂは、一実施形態による音声認識システムの話者検証方法について説明するための例示的なフローチャートである。 5A and 5B are exemplary flowcharts for explaining a speaker verification method of a speech recognition system according to an embodiment.

図５Ａ及び図５Ｂを参照すると、音声認識システムは、音声認識スピーカ装置１００及び音声認識サーバ２００を含む。第２話者の携帯装置３００は、ネットワークを介して音声認識サーバ２００に接続される。 Referring to FIGS. 5A and 5B, the voice recognition system includes a voice recognition speaker device 100 and a voice recognition server 200. The second speaker's portable device 300 is connected to the voice recognition server 200 via a network.

音声認識スピーカ装置１００は、マイクロフォン１２０（図２Ａ）を利用し、周辺の音を電気的に変換し、オーディオ信号を生成することができる（Ｓ１０１）。 The voice recognition speaker device 100 can use the microphone 120 (FIG. 2A) to electrically convert surrounding sounds and generate an audio signal (S101).

音声認識スピーカ装置１００は、オーディオ信号から音声信号を検出することができる（Ｓ１０２）。かような音声信号は、ユーザの音声を含み得る。ここで、ユーザを第１話者とする。かような音声は、ユーザの音声命令を含み得る。かような音声命令には、音声情報検索、電話かけ、メッセージ送信、金融取り引き、インターネットショッピング、食べ物配達、周辺家電機器制御、スマートホーム制御などが含まれてよい。本例においては、音声命令が金融取り引きに係わるものであり、第１話者の音声が、「Ｂに１００万ウォンを送金せよ」であると仮定する。第１話者の音声には、音声認識スピーカ装置１００をウェークアップさせるためのトリガキーワードが含まれてもよい。音声認識スピーカ装置１００は、トリガキーワードを認識することにより、オーディオ信号から音声信号を検出することができる。 The voice recognition speaker device 100 can detect the voice signal from the audio signal (S102). Such an audio signal may include the user's voice. Here, the user is the first speaker. Such voice may include user voice commands. Such voice commands may include voice information retrieval, telephone call, message transmission, financial transaction, Internet shopping, food delivery, peripheral home appliance control, smart home control, and the like. In this example, it is assumed that the voice command is related to a financial transaction, and the voice of the first speaker is “send 1 million won to B”. The voice of the first speaker may include a trigger keyword for causing the voice recognition speaker device 100 to wake up. The voice recognition speaker device 100 can detect the voice signal from the audio signal by recognizing the trigger keyword.

音声認識スピーカ装置１００は、音声信号を音声認識サーバ２００に送信し（Ｓ１０３）、音声認識サーバ２００は、音声認識スピーカ装置１００から音声信号を受信する（Ｓ２０１）。 The voice recognition speaker device 100 transmits a voice signal to the voice recognition server 200 (S103), and the voice recognition server 200 receives the voice signal from the voice recognition speaker device 100 (S201).

音声認識サーバ２００は、音声信号に対して音声認識を実行し、音声認識結果を生成する（Ｓ２０２）。音声認識サーバ２００は、音声信号の周波数特性を抽出し、音響モデル及び言語モデルを利用し、音声認識を実行することができる。音声認識サーバ２００は、音声信号を文字列に変換し、文字列を自然言語処理することにより、音声認識結果を生成することができる。かような音声認識結果は、音声命令を含み得る。 The voice recognition server 200 performs voice recognition on the voice signal and generates a voice recognition result (S202). The speech recognition server 200 can extract speech signal frequency characteristics and execute speech recognition using an acoustic model and a language model. The voice recognition server 200 can generate a voice recognition result by converting a voice signal into a character string and subjecting the character string to natural language processing. Such a speech recognition result may include a speech command.

音声認識サーバ２００は、音声信号から、話者特徴ベクトルを抽出する（Ｓ２０３）。音声認識サーバ２００は、音響モデルから抽出された事後情報、一般的背景モデル及び全体変異性変換情報のうち少なくとも一つを利用し、音声信号の周波数特性から話者特徴ベクトルを生成することができる。 The speech recognition server 200 extracts speaker feature vectors from the speech signal (S203). The speech recognition server 200 can generate a speaker feature vector from the frequency characteristics of the speech signal by using at least one of the posterior information extracted from the acoustic model, the general background model, and the global variability conversion information. .

音声認識サーバ２００は、抽出された話者特徴ベクトルと、登録された話者特徴ベクトルと、を比較する（Ｓ２０４）。登録された話者特徴ベクトルは、メモリ２２０（図３）にも保存され、ユーザが音声認識システムに登録するとき、入力されるユーザの音声を基に事前に生成される。 The speech recognition server 200 compares the extracted speaker feature vector with the registered speaker feature vector (S204). The registered speaker feature vectors are also stored in the memory 220 (FIG. 3), and are generated in advance based on the input user's voice when the user registers in the voice recognition system.

音声認識サーバ２００は、Ｓ２０４段階の比較の結果により、音声信号の話者が登録された第２話者であると決定する（Ｓ２０５）。Ｓ２０４段階において、音声認識サーバ２００は、抽出された話者特徴ベクトルを、登録された話者特徴ベクトルそれぞれと比較することができる。この比較の結果、登録された話者特徴ベクトルのうち、抽出された話者特徴ベクトルと最も類似度が高い登録された話者特徴ベクトルが決定される。音声認識サーバ２００は、最も類似度が高い登録された話者特徴ベクトルのユーザが、音声信号の話者であると決定し、ここで、最も類似度が高い登録された話者特徴ベクトルに対応するユーザを第２話者とする。かような第２話者は、一般的に第１話者と同一である。しかしながら、音声認識サーバ２００の話者認識機能の誤謬により、第２話者は、第１話者と異なることもある。 The speech recognition server 200 determines that the speaker of the speech signal is the registered second speaker based on the comparison result in step S204 (S205). In step S204, the speech recognition server 200 can compare the extracted speaker feature vector with each of the registered speaker feature vectors. As a result of the comparison, a registered speaker feature vector having the highest similarity with the extracted speaker feature vector is determined among the registered speaker feature vectors. The speech recognition server 200 determines that the user of the registered speaker feature vector having the highest similarity is the speaker of the speech signal, and corresponds to the registered speaker feature vector having the highest similarity. The user who performs this operation is the second speaker. Such a second speaker is generally the same as the first speaker. However, the second speaker may be different from the first speaker due to an error in the speaker recognition function of the voice recognition server 200.

例えば、第１話者が音声命令を発話したが、音声認識サーバ２００は、話者認識機能の誤謬により、この音声命令を、第１話者と異なる第２話者が発話したと認識する。その場合、音声認識サーバ２００は、第２話者が、「Ｂに１００万ウォン送金せよ」と発話したと認識することになり、音声認識サーバ２００は、第２話者の口座からＢに１００万ウォンを送金する問題が発生する。 For example, the first speaker utters a voice command, but the voice recognition server 200 recognizes that this voice command is uttered by a second speaker different from the first speaker due to an error in the speaker recognition function. In this case, the voice recognition server 200 recognizes that the second speaker has uttered “Remit 1 million won to B”, and the voice recognition server 200 sends 100 to B from the account of the second speaker. The problem of remitting 10,000 won occurs.

他の例として、第１話者が第２話者の声を真似て音声命令を発話し、音声認識サーバ２００は、この音声命令を第２話者が発話したと認識してしまう。その場合、声盗用による場合である。その場合にも、音声認識サーバ２００は、第２話者が、「Ｂに１００万ウォン送金せよ」と発話したと認識することになり、音声認識サーバ２００は、第２話者の口座からＢに１００万ウォンを送金する問題が発生する。 As another example, the first speaker utters a voice command by imitating the voice of the second speaker, and the voice recognition server 200 recognizes that this voice command is uttered by the second speaker. In that case, it is a case by voice stealing. Also in this case, the voice recognition server 200 recognizes that the second speaker has uttered “Remit one million won to B”, and the voice recognition server 200 recognizes the B from the account of the second speaker. The problem of remitting 1 million won occurs.

音声認識サーバ２００は、かような問題を解消するために、さらなる話者検証手続きを実行することができる。音声認識サーバ２００は、音声認識結果及び音声信号を第２話者の携帯装置３００に送信することができる。第２話者が、音声認識システムに、音声認識スピーカ装置１００のユーザとして登録するとき、自分の携帯装置３００の識別番号を入力することができ、第２話者の携帯装置３００の識別番号は、メモリ２２０に保存される。 The voice recognition server 200 can execute a further speaker verification procedure to solve such a problem. The voice recognition server 200 can transmit the voice recognition result and the voice signal to the portable device 300 of the second speaker. When the second speaker registers in the voice recognition system as a user of the voice recognition speaker device 100, the identification number of the portable device 300 of the second speaker can be input, and the identification number of the portable device 300 of the second speaker is , Stored in the memory 220.

一方、音声認識サーバ２００は、Ｓ２０４段階の比較の結果により、音声信号の話者が、事前に登録されたユーザではないと決定することができる。抽出された話者特徴ベクトルと登録された話者特徴ベクトルとの類似度が、事前に設定された基準類似度を超えない場合、音声認識サーバ２００は、音声信号の話者が、事前に登録されていないと決定し、音声命令による動作を実行しない。その場合、音声認識サーバ２００は、即座にＳ２０９段階に進み、動作が実行されなかったことを報告するための合成音を生成することができる。その場合にも、音声命令の内容が、だれにも可能なように設定されたものであるならば、音声認識サーバ２００は、音声信号の話者が事前に登録されていないとしても、音声命令による動作を実行することができる。その場合、音声認識サーバ２００は、話者検証手続き（Ｓ２０６、Ｓ３０１−Ｓ３０５、Ｓ２０７）を省略し、即座にＳ２０８段階に進み、音声命令による動作を実行することができる。一実施形態によれば、音声命令による動作が、だれにも可能である動作である場合、音声認識サーバ２００は、Ｓ２０３−Ｓ２０５段階を実行しない。 On the other hand, the voice recognition server 200 can determine that the speaker of the voice signal is not a user registered in advance based on the comparison result in step S204. If the similarity between the extracted speaker feature vector and the registered speaker feature vector does not exceed the preset reference similarity, the speech recognition server 200 registers the speaker of the speech signal in advance. It is determined that it has not been performed, and the operation based on the voice command is not executed. In that case, the speech recognition server 200 can immediately proceed to step S209 and generate a synthesized sound for reporting that the operation has not been executed. Even in this case, if the content of the voice command is set so that anyone can use it, the voice recognition server 200 can execute the voice command even if the speaker of the voice signal is not registered in advance. The operation by can be executed. In that case, the speech recognition server 200 can skip the speaker verification procedure (S206, S301-S305, S207), and can immediately proceed to step S208 to execute an operation based on the voice command. According to one embodiment, the voice recognition server 200 does not perform steps S203 to S205 when the operation based on the voice command is an operation that anyone can perform.

第２話者の携帯装置３００は、音声認識サーバ２００から、音声認識結果及び音声信号を受信する（Ｓ３０１）。前述のように、第２話者は、実際に音声信号に含まれている音声を発話した第１話者と同じであることもあるし、同じでないこともある。 The mobile device 300 of the second speaker receives the voice recognition result and the voice signal from the voice recognition server 200 (S301). As described above, the second speaker may or may not be the same as the first speaker who actually uttered the voice included in the voice signal.

携帯装置３００は、音声認識結果を表示することができる（Ｓ３０２）。図６は、一実施形態による、音声認識システムに接続される第２話者の携帯装置の例示的な画面を図示している。 The portable device 300 can display the voice recognition result (S302). FIG. 6 illustrates an exemplary screen of a second speaker's mobile device connected to the speech recognition system, according to one embodiment.

図６に図示されているように、携帯装置３００のディスプレイウィンドウ３１０上には、「音声認識スピーカ装置で次の命令が実行されました」という文言３０１が表示される。受信された音声認識結果、または音声認識結果に含まれている音声命令が、文言３０１の下の領域３０２に表示される。本例において、領域３０２には、「Ｂに１００万ウォンを送金せよ」という文言が表示される。 As illustrated in FIG. 6, the wording 301 “The following command has been executed by the voice recognition speaker device” is displayed on the display window 310 of the portable device 300. The received voice recognition result or a voice command included in the voice recognition result is displayed in an area 302 below the wording 301. In this example, in the area 302, a word “Remit 1 million won to B” is displayed.

ディスプレイウィンドウ３１０には、音声認識スピーカ装置１００が検出した音声信号を再生することができる再生ボタン３０３が表示される。第２話者が再生ボタン３０３をタッチした場合、音声認識サーバ２００から受信された音声信号が再生される。 The display window 310 displays a reproduction button 303 that can reproduce an audio signal detected by the voice recognition speaker device 100. When the second speaker touches the play button 303, the voice signal received from the voice recognition server 200 is played.

ディスプレイウィンドウ３１０には、「承認ですか」という文言３０４が表示される。文言３０４の下に、承認ボタン３０５と拒絶ボタン３０６とが表示される。第２話者は、受信された音声認識結果、または再生された音声を確認した後、承認ボタン３０５をタッチすることにより、音声認識サーバ２００が、音声認識結果の音声命令による動作を実行するように承認することができる。第２話者は、自分が発話していないか、あるいは受信された音声認識結果が自分が発話した音声命令と異なるものであるならば、拒絶ボタン３０６をタッチすることにより、音声認識サーバ２００が、音声認識結果に対応する動作を実行しないようにする。音声認識サーバ２００の話者認識機能に誤謬があるか、あるいは第２話者が、自分の声を盗用された場合、第２話者が携帯装置３００を利用し、音声認識結果に対応する動作を実行しないようにすることにより、被害を防止することができる。 The display window 310 displays a word 304 “Is it approved?” An approval button 305 and a rejection button 306 are displayed below the wording 304. The second speaker confirms the received voice recognition result or the reproduced voice, and then touches the approval button 305 so that the voice recognition server 200 performs an operation based on the voice command of the voice recognition result. Can be approved. If the second speaker is not speaking or if the received voice recognition result is different from the voice command that he / she spoke, the second speaker touches the reject button 306 to cause the voice recognition server 200 to The operation corresponding to the voice recognition result is not executed. If there is an error in the speaker recognition function of the voice recognition server 200 or the second speaker steals his / her voice, the second speaker uses the mobile device 300 to respond to the voice recognition result. The damage can be prevented by not executing.

ディスプレイウィンドウ３１０には、申し出ボタン３０７が表示される。第２話者は、再生ボタン３０３をタッチして再生された音声を聞いた後、自分の声が盗用されたと判断する場合、申し出ボタン３０７をタッチし、声盗用事実を関連金融会社、官庁、または音声認識システムのメーカーのような外部機関に申し出ることができる。その場合、携帯装置３００は、声盗用に係わる情報と共に、音声認識サーバ２００から受信された音声信号を外部機関に送信することができる。 An offer button 307 is displayed on the display window 310. When the second speaker touches the play button 303 and hears the reproduced voice, and determines that his / her voice has been stolen, the second speaker touches the offer button 307 and the voice stealing fact is related to the related financial company, government office, Or you can apply to an external organization such as the manufacturer of a speech recognition system. In that case, the portable device 300 can transmit the voice signal received from the voice recognition server 200 to the external organization together with the information related to voice theft.

ディスプレイウィンドウ３１０には、音声フィードバック送りボタン３０８が表示される。第２話者が、音声フィードバック送りボタン３０８をタッチすると、音声認識スピーカ装置１００から出力される音声を入力することができる。第２話者は、音声フィードバック送りボタン３０８をタッチした後、声盗用者などに伝える応答音声を発話することができる。携帯装置３００は、応答音声に対応する応答音声信号を生成し、応答音声信号を音声認識サーバ２００に送信することができる。音声認識サーバ２００は、応答音声信号を受信し、音声認識スピーカ装置１００に伝達することができる。音声認識スピーカ装置１００は、音声認識サーバ２００から送信された応答音声信号を受信し、応答音声信号に対応する第２話者の応答音声を出力することができる。 In the display window 310, an audio feedback sending button 308 is displayed. When the second speaker touches the voice feedback feed button 308, the voice output from the voice recognition speaker device 100 can be input. After the second speaker touches the voice feedback sending button 308, the second speaker can utter a response voice to be transmitted to a voice stealer or the like. The portable device 300 can generate a response voice signal corresponding to the response voice and transmit the response voice signal to the voice recognition server 200. The voice recognition server 200 can receive the response voice signal and transmit it to the voice recognition speaker device 100. The voice recognition speaker device 100 can receive the response voice signal transmitted from the voice recognition server 200 and output the response voice of the second speaker corresponding to the response voice signal.

他の実施形態によれば、ディスプレイウィンドウ３１０には、音声通信ボタンが表示され、第２話者が音声通信ボタンをタッチすれば、音声認識スピーカ装置１００と携帯装置３００との間で音声を送受信することができるセッションが確立される。第２話者は、確立されたセッションを介して、音声認識スピーカ装置１００の周辺に位置する人と音声をやり取りすることができる。例えば、第１話者が第２話者の家族である場合、第１話者は、第２話者に、音声認識結果による動作が必要な理由について説明し、第２話者は、第１話者の説明を聞いた後、当該動作を承認することもできる。携帯装置３００は、音声認識スピーカ装置１００と直接接続されるか、あるいは音声認識サーバ２００の中継下で音声認識スピーカ装置１００とも接続される。 According to another embodiment, a voice communication button is displayed on the display window 310, and voice is transmitted / received between the voice recognition speaker device 100 and the portable device 300 when the second speaker touches the voice communication button. A session is established that can be The second speaker can exchange voice with a person located around the voice recognition speaker device 100 through the established session. For example, when the first speaker is the family of the second speaker, the first speaker explains to the second speaker why the action based on the voice recognition result is necessary, and the second speaker After listening to the speaker's explanation, the operation can be approved. The portable device 300 is directly connected to the voice recognition speaker device 100 or is connected to the voice recognition speaker device 100 under the relay of the voice recognition server 200.

再び図５Ａ及び図５Ｂを参照すると、携帯装置３００は、ディスプレイウィンドウ３１０の領域３０２上に、音声認識結果を表示することができる（Ｓ３０２）。また、携帯装置３００は、第２話者が再生ボタン３０３をタッチすることにより、音声信号を再生することができる（Ｓ３０３）。 Referring to FIGS. 5A and 5B again, the mobile device 300 may display the voice recognition result on the area 302 of the display window 310 (S302). In addition, the portable device 300 can reproduce the audio signal when the second speaker touches the reproduction button 303 (S303).

携帯装置３００は、第２話者が、承認ボタン３０５をタッチすることにより、承認入力を受信するか、あるいは、第２話者が拒絶ボタン３０６をタッチすることにより、拒絶入力を受信することができる（Ｓ３０４）。携帯装置３００は、承認入力または拒絶入力を、音声認識サーバ２００に送信することができる（Ｓ３０５）。携帯装置３００は、事前に設定された時間の間、承認入力が受信されなければ、拒絶入力が受信されたと見なし、拒絶入力を音声認識サーバ２００に送信することができる。 The mobile device 300 may receive an approval input when the second speaker touches the approval button 305 or may receive a rejection input when the second speaker touches the rejection button 306. Yes (S304). The portable device 300 can transmit an approval input or a rejection input to the voice recognition server 200 (S305). If the approval input is not received for a preset time, the portable device 300 can consider that the rejection input has been received and transmit the rejection input to the voice recognition server 200.

音声認識サーバ２００は、携帯装置３００から承認入力または拒絶入力を受信することができる（Ｓ２０７）。音声認識サーバ２００は、承認入力を受信した場合、音声認識結果に対応する動作を実行し、拒絶入力を受信した場合、音声認識結果に対応する動作を実行しない（Ｓ２０８）。本例において、音声認識サーバ２００は、承認入力を受信した場合、第２話者の口座からＢに１００万ウォンを送金することができる。音声認識サーバ２００は、拒絶入力を受信した場合、Ｂに１００万ウォンを送金しない。 The voice recognition server 200 can receive an approval input or a rejection input from the portable device 300 (S207). The voice recognition server 200 executes an operation corresponding to the voice recognition result when receiving the approval input, and does not execute an operation corresponding to the voice recognition result when receiving the rejection input (S208). In this example, when receiving the approval input, the voice recognition server 200 can remit 1 million won to B from the account of the second speaker. When the voice recognition server 200 receives the rejection input, the voice recognition server 200 does not remit 1 million won to B.

音声認識サーバ２００は、音声認識結果に対応する動作を実行した後、動作の実行結果を報告するための合成音信号を生成する（Ｓ２０９）。このとき、かような合成音信号は、例えば、「第２話者の口座からＢに１００万ウォンを送金しました」という合成音に対応する。音声認識サーバ２００は、音声認識結果に対応する動作を実行しない場合、動作の不実行を報告するための合成音信号を生成する（Ｓ２０９）。このとき、かような合成音信号は、例えば、「第２話者の不承認により、Ｂに１００万ウォンを送金していません」という合成音に対応する。音声認識サーバ２００は、生成された合成音信号を、音声認識スピーカ装置１００に送信することができる（Ｓ２１０）。 After executing the operation corresponding to the speech recognition result, the speech recognition server 200 generates a synthesized sound signal for reporting the operation execution result (S209). At this time, such a synthesized sound signal corresponds to, for example, a synthesized sound “1 million won has been transferred to B from the account of the second speaker”. If the operation corresponding to the speech recognition result is not executed, the speech recognition server 200 generates a synthesized sound signal for reporting the non-execution of the operation (S209). At this time, such a synthesized sound signal corresponds to, for example, a synthesized sound such that “1 million won has not been transferred to B due to the disapproval of the second speaker”. The speech recognition server 200 can transmit the generated synthesized sound signal to the speech recognition speaker device 100 (S210).

音声認識スピーカ装置１００は、合成音信号を受信し（Ｓ１０４）、合成音信号に対応する合成音を再生することができる（Ｓ１０５）。従って、音声信号の音声を発話した第１話者は、自分の音声命令の実行結果を直接確認することができる。 The speech recognition speaker device 100 can receive the synthesized sound signal (S104) and reproduce the synthesized sound corresponding to the synthesized sound signal (S105). Therefore, the first speaker who utters the voice of the voice signal can directly check the execution result of his voice command.

他の実施形態により、音声認識スピーカ装置１００がカメラ１５０（図２Ｂ）を含む場合、Ｓ１０２段階において、音声認識スピーカ装置１００は、カメラ１５０を利用し、音声信号が検出された時点の映像を含む映像信号を生成することができる。また、Ｓ１０３段階において、音声認識スピーカ装置１００は、音声信号と共に、映像信号を音声認識サーバ２００に送信することができる。 According to another embodiment, when the speech recognition speaker device 100 includes the camera 150 (FIG. 2B), the speech recognition speaker device 100 includes the video at the time when the speech signal is detected using the camera 150 in step S102. A video signal can be generated. In step S <b> 103, the voice recognition speaker device 100 can transmit the video signal together with the voice signal to the voice recognition server 200.

Ｓ２０１段階において、音声認識サーバ２００は、音声認識スピーカ装置１００から映像信号を受信し、Ｓ２０６段階において、音声認識結果及び音声信号と共に映像信号を第２話者の携帯装置３００に送信することができる。 In step S201, the voice recognition server 200 receives the video signal from the voice recognition speaker device 100. In step S206, the voice recognition server 200 can transmit the video signal together with the voice recognition result and the voice signal to the mobile device 300 of the second speaker. .

その場合、携帯装置３００は、映像信号を表示することができるインターフェース、例えば、映像表示ボタンを有することができる。第２話者が映像信号を表示するために、映像表示ボタンをタッチすれば、携帯装置３００が映像信号の映像を表示することにより、第２話者は、自分が発話していない音声認識結果を受信した場合、音声を発話した人が含まれている映像を確認することができる。かような映像信号は、第１話者の音声が発生した方向の映像を含んでよい。かような映像信号は、動画信号でもある。携帯装置３００は、動画信号を再生することができるインターフェース、例えば、映像再生ボタンを有することができる。第２話者が映像再生ボタンをタッチすれば、携帯装置３００は、動画信号に含まれている動画を再生することができる。 In that case, the mobile device 300 may have an interface capable of displaying a video signal, for example, a video display button. If the second speaker touches the video display button to display the video signal, the portable device 300 displays the video signal video, so that the second speaker can recognize the voice recognition result that he / she does not speak. , The video containing the person who spoke the voice can be confirmed. Such a video signal may include a video in a direction in which the voice of the first speaker is generated. Such a video signal is also a moving image signal. The mobile device 300 may have an interface that can play back a moving image signal, for example, a video playback button. If the second speaker touches the video playback button, the mobile device 300 can play back the moving image included in the moving image signal.

他の実施形態により、Ｓ２０７段階において、音声認識サーバ２００が、携帯装置３００から承認入力を受信した場合、第１話者と第２話者との同一性が確認されたことであるので、音声認識サーバ２００は、抽出された話者特徴ベクトルを利用し、第２話者の登録された話者特徴ベクトルを改善することができる。 According to another embodiment, when the voice recognition server 200 receives an approval input from the mobile device 300 in step S207, it is confirmed that the identity of the first speaker and the second speaker is confirmed. The recognition server 200 can use the extracted speaker feature vector to improve the registered speaker feature vector of the second speaker.

図７は、他の実施形態による音声認識システムの話者検証方法について説明するための例示的なフローチャートである。 FIG. 7 is an exemplary flowchart for explaining a speaker verification method of a speech recognition system according to another embodiment.

図７を参照すると、他の実施形態による話者検証方法によれば、図５のＳ２０５段階が実行された後、音声認識サーバ２００は、音声認識結果に対応する動作、及び第２話者の設定のうち少なくとも一つを基に、第２話者の携帯装置３００を利用した検証手続き（Ｓ２０６、Ｓ３０１−Ｓ３０５、Ｓ２０７）の実行いかんを決定することができる（Ｓ２１１）。 Referring to FIG. 7, according to the speaker verification method according to another embodiment, after the step S205 of FIG. 5 is executed, the speech recognition server 200 performs the operation corresponding to the speech recognition result and the second speaker's Based on at least one of the settings, execution of the verification procedure (S206, S301-S305, S207) using the mobile device 300 of the second speaker can be determined (S211).

Ｓ２１１段階において、検証手続きを実行すると決定した場合、音声認識サーバ２００は、Ｓ２０６段階に進み、音声認識結果及び音声信号を送信することができる。しかしながら、Ｓ２１１段階において、検証手続きを実行しないと決定した場合、音声認識サーバ２００は、検証手続きを省略し、Ｓ２０８段階に進み、音声認識結果に対応する動作を実行することができる。 If it is determined in step S211 that the verification procedure is to be executed, the speech recognition server 200 can proceed to step S206 and transmit a speech recognition result and a speech signal. However, if it is determined in step S211 that the verification procedure is not to be executed, the speech recognition server 200 can omit the verification procedure and proceed to step S208 to execute an operation corresponding to the speech recognition result.

一例によれば、音声認識結果による動作が、第２話者が事前に設定した事前承認動作リストに含まれる場合、Ｓ２１１段階において、音声認識サーバ２００は、検証手続き（Ｓ２０６、Ｓ３０１−Ｓ３０５、Ｓ２０７）を実行するように決定することができる。事前承認動作リストは、メモリ２２０（図３）に保存され得る。かような事前承認動作リストには、音声認識スピーカ装置１００を介して実行することができる動作のうち、第２話者が事前に選択した一部の動作が含まれる。例えば、第２話者は、金融取り引きやインターネットショッピング、メッセージ送信のような動作を選択し、かような事前承認動作リストに含めることができる。 According to an example, when the action based on the voice recognition result is included in the pre-approval action list set in advance by the second speaker, the voice recognition server 200 determines whether or not the verification procedure (S206, S301-S305, S207) in step S211. ) Can be determined to perform. The pre-approved action list can be stored in the memory 220 (FIG. 3). Such a pre-approval operation list includes some of the operations that can be executed via the voice recognition speaker device 100 and selected in advance by the second speaker. For example, the second speaker may select an action such as financial transaction, Internet shopping, or message transmission and include it in such a pre-approved action list.

他の例によれば、Ｓ２１１段階において、音声認識サーバ２００は、第２話者の携帯装置３００の位置及び現在時間のうち少なくとも一つが、事前承認条件に符合する場合、検証手続き（Ｓ２０６、Ｓ３０１−Ｓ３０５、Ｓ２０７）を実行するように決定することができる。例えば、第２話者の携帯装置３００が、音声認識スピーカ装置１００から離れて位置する場合、第２話者が、音声信号に対応する音声を実際に発話した可能性が低いので、Ｓ２１１段階において、音声認識サーバ２００は、検証手続き（Ｓ２０６、Ｓ３０１−Ｓ３０５、Ｓ２０７）を実行するように決定することができる。しかしながら、第２話者の携帯装置３００が、音声認識スピーカ装置１００の近くに位置する場合、Ｓ２１１段階において、音声認識サーバ２００は、検証手続き（Ｓ２０６、Ｓ３０１−Ｓ３０５、Ｓ２０７）を省略することができる。かような事前承認条件は、携帯装置３００の位置以外にも、第２話者が設定した時間帯を基に決定される。 According to another example, in step S211, the speech recognition server 200 determines whether the second speaker's mobile device 300 has a location and / or current time that matches the pre-approval condition (S206, S301). -S305, S207) can be determined to be executed. For example, when the second speaker's mobile device 300 is located away from the speech recognition speaker device 100, the second speaker is unlikely to actually speak the voice corresponding to the voice signal. The voice recognition server 200 can determine to execute the verification procedure (S206, S301-S305, S207). However, if the mobile device 300 of the second speaker is located near the speech recognition speaker device 100, the speech recognition server 200 may omit the verification procedure (S206, S301-S305, S207) in step S211. it can. Such pre-approval conditions are determined based on the time zone set by the second speaker in addition to the position of the portable device 300.

例えば、Ｓ２１１段階において、音声認識サーバ２００は、現在時間が、第２話者が設定した時間、例えば、平日の昼時間に該当する場合、検証手続き（Ｓ２０６、Ｓ３０１−Ｓ３０５、Ｓ２０７）を実行するように決定することができる。例えば、第２話者は、自分が家にいないであろう時間帯を事前に設定することができる。かような事前承認条件は、登録された話者ごとに異なって設定される。 For example, in step S211, the voice recognition server 200 executes a verification procedure (S206, S301-S305, S207) when the current time corresponds to a time set by the second speaker, for example, a weekday daytime. Can be determined. For example, the second speaker can set in advance a time zone when he / she will not be at home. Such pre-approval conditions are set differently for each registered speaker.

かような事前承認条件は、登録された話者によって事前に設定され、メモリ２２０に保存される。例えば、かような事前承認条件は、登録された話者の携帯装置３００の位置を基に生成される登録された話者の行動パターンに基づいても決定される。音声認識サーバ２００は、登録された話者の携帯装置３００の位置を収集することができる。音声認識サーバ２００は、収集された携帯装置３００の位置を分析して、登録された話者が、音声認識スピーカ装置１００から遠く離れて位置する時間帯を決定することができる。音声認識サーバ２００は、現在時間が、かような時間帯に該当する場合、検証手続き（Ｓ２０６、Ｓ３０１−Ｓ３０５、Ｓ２０７）を実行するように自動的に決定することができる。 Such pre-approval conditions are set in advance by a registered speaker and stored in the memory 220. For example, such pre-approval conditions are also determined based on a registered speaker's behavior pattern generated based on the position of the registered speaker's portable device 300. The voice recognition server 200 can collect the position of the registered speaker's portable device 300. The voice recognition server 200 can analyze the collected position of the mobile device 300 and determine a time zone in which the registered speaker is located far from the voice recognition speaker device 100. The voice recognition server 200 can automatically determine to execute the verification procedure (S206, S301-S305, S207) when the current time falls within such a time zone.

さらに他の例によれば、音声認識結果による動作が、第２話者が事前に設定した事後通知動作リストに含まれる場合、音声認識サーバ２００は、Ｓ２０９段階をまず実行し、事後通知手続き（Ｓ２０６、Ｓ３０１−Ｓ３０３）を後で実行するように決定することができる。かような事後通知動作リストは、メモリ２２０に保存され、かような事後通知動作リストには、音声認識スピーカ装置１００が実行することができる動作のうち、第２話者が選択した一部の動作が含まれる。例えば、電話かけ、設定変更のような動作が、事前に設定した事後通知動作リストに含まれる。かような事後通知動作リストに含まれる動作は、登録された話者ごとに異なって設定される。 According to another example, when the action based on the voice recognition result is included in the post-notification action list set in advance by the second speaker, the voice recognition server 200 first executes step S209 to perform a post-notification procedure ( S206, S301-S303) can be determined to be executed later. Such a post-notification action list is stored in the memory 220, and the post-notification action list includes a part of the actions that can be executed by the voice recognition speaker device 100 selected by the second speaker. Behavior is included. For example, operations such as making a call and changing settings are included in the post notification operation list set in advance. The actions included in such a post-notification action list are set differently for each registered speaker.

上記で説明された本発明による実施形態は、コンピュータ上で多様な構成要素を介して実行されるコンピュータプログラムの形態で具現化され、かようなコンピュータプログラムは、コンピュータ読み取り可能な媒体に記録される。かような媒体は、コンピュータ実行可能なプログラムを続けて保存するものであってもよいし、実行またはダウンロードのために一時的に保存するものであってもよい。また、かような媒体は、単一または複数のハードウェアが結合された形態の多様な記録手段または保存手段であってよいが、あるコンピュータシステムに直接接続される媒体に限定されるものではなく、ネットワーク上に分散されて存在するものであってもよい。かような媒体の例は、ハードディスク、フロッピー（登録商標）ディスク及び磁気テープのような磁気媒体；ＣＤ−ＲＯＭ（compact disc read only memory）及びＤＶＤ（digital versatile disc）のような光記録媒体；フロプティカルディスク（floptical disk）のような光磁気媒体；及びＲＯＭ、ＲＡＭ、フラッシュメモリなどを含み、プログラム命令が保存されるように構成されたものがある。また、他の媒体の例として、アプリケーションを配布するアプリケーションストアやその他多様なソフトウェアを供給したり配布したりするサイト、サーバなどで管理する記録媒体も挙げられる。 The embodiment according to the present invention described above is embodied in the form of a computer program executed via various components on a computer, and such a computer program is recorded on a computer-readable medium. . Such a medium may continuously store a computer-executable program or may be temporarily stored for execution or download. Such a medium may be a variety of recording means or storage means in a form in which a single piece or a plurality of pieces of hardware are combined, but is not limited to a medium that is directly connected to a computer system. It may be distributed on the network. Examples of such media are: magnetic media such as hard disks, floppy disks and magnetic tape; optical recording media such as compact disc read only memory (CD-ROM) and digital versatile disc (DVD); Some include magneto-optical media such as optical disks; ROM, RAM, flash memory, etc., and are configured to store program instructions. Examples of other media include an application store that distributes applications, a site that supplies and distributes various other software, and a recording medium that is managed by a server.

本明細書において、「部」、「モジュール」というものなどは、プロセッサまたは回路のようなハードウェアコンポーネント、及び／またはプロセッサのようなハードウェアコンポーネントによって実行されるソフトウェアコンポーネントであり得る。例えば、「部」、「モジュール」などは、ソフトウェアコンポーネント、オブジェクト指向ソフトウェアコンポーネント、クラスコンポーネント及びタスクコンポーネントのようなコンポーネント；プロセス、関数、属性、プロシージャ、サブルーチン、プログラムコードのセグメント、ドライバ、ファームウェア、マイクロコード、回路、データ、データベース、データ構造、テーブル、アレイ及び変数によって具現化され得る。 In this specification, the term “unit”, “module”, and the like may be a hardware component such as a processor or a circuit, and / or a software component executed by a hardware component such as a processor. For example, “part”, “module”, etc. are components such as software components, object-oriented software components, class components and task components; processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, micros It can be embodied by codes, circuits, data, databases, data structures, tables, arrays and variables.

前述の本発明の説明は、例示のためのものであり、本発明が属する技術分野の当業者であれば、本発明の技術的思想や必須な特徴を変更せずとも、他の具体的な形態で容易に変形が可能であるということを理解することができるであろう。従って、上記で説明した実施形態は、全ての面で例示的なものであり、限定的ではないと理解しなければならない。例えば、単一型と説明されている各構成要素は、分散されて実施されることもあり、同様に、分散されていると説明されている構成要素は、結合された形態で実施されることもある。 The above description of the present invention is given for the purpose of illustration, and those skilled in the art to which the present invention pertains can be applied to other specific examples without changing the technical idea and essential features of the present invention. It will be understood that the form can be easily modified. Accordingly, it should be understood that the embodiments described above are illustrative in all aspects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, a component described as being distributed may be implemented in a combined form. There is also.

本発明の範囲は、前述の詳細な説明ではなく、特許請求の範囲によって示され、特許請求の範囲の意味及び範囲、並びにその均等概念から導出される全ての変更されたり変形されたりする形態が、本発明の範囲に含まれるものであると解釈されなければならない。 The scope of the present invention is defined by the terms of the claims, rather than the foregoing detailed description, and is intended to include all modified and modified forms derived from the meaning and scope of the claims and the equivalents thereof. Should be construed as being included within the scope of the present invention.

本発明の実施形態に係る話者検証方法及び音声認識システムは、例えば、セキュリティ関連の技術分野に効果的に適用可能である。 The speaker verification method and the speech recognition system according to the embodiment of the present invention can be effectively applied to, for example, a technical field related to security.

１００，１００ａ音声認識スピーカ装置
１１０プロセッサ
１２０マイクロフォン
１３０スピーカ
１４０通信モジュール
１５０カメラ
２００音声認識サーバ
２１０，２１０ａプロセッサ
２１１音声信号受信部
２１２音声認識部
２１３話者認識部
２１３ａ話者特徴ベクトル抽出部
２１３ｂ話者特徴ベクトル比較部
２１３ｃ登録話者決定部
２１４話者認証部
２１５機能部
２１６合成音信号生成部
２１７検証いかん決定部
２２０メモリ
２３０通信モジュール
３００携帯装置
４００ネットワーク 100, 100a Speech recognition speaker device 110 Processor 120 Microphone 130 Speaker 140 Communication module 150 Camera 200 Speech recognition server 210, 210a Processor 211 Speech signal receiving unit 212 Speech recognition unit 213 Speaker recognition unit 213a Speaker feature vector extraction unit 213b Speaker Feature vector comparison unit 213c Registered speaker determination unit 214 Speaker authentication unit 215 Function unit 216 Synthetic sound signal generation unit 217 Verification determination unit 220 Memory 230 Communication module 300 Portable device 400 Network

Claims

A speaker verification method in a voice recognition system including a voice recognition device and a voice recognition server,
The voice recognition server receiving a voice signal including the voice of the first speaker from the voice recognition device;
The voice recognition server performing voice recognition on the voice signal and generating a voice recognition result;
The speech recognition server extracts a speaker feature vector from the speech signal, compares the speaker feature vector with a registered speaker feature vector, and a speaker of the speech signal is registered according to the comparison result. Determining to be a second speaker
The voice recognition server transmitting the voice recognition result and the voice signal to the portable device of the second speaker;
The voice recognition server receiving an approval input from the portable device;
The voice recognition server performing an operation corresponding to the voice recognition result;
Speaker verification method.

The speech recognition server further includes a step of determining whether to perform a verification procedure using the portable device based on at least one of an operation corresponding to the speech recognition result and a setting of the second speaker. ,
The verification procedure is:
The voice recognition server transmitting the voice recognition result and the voice signal to the portable device;
The voice recognition server receiving an approval input from the portable device;
The speaker verification method according to claim 1, comprising:

The step of determining whether or not to execute the verification procedure using the portable device is performed when the voice recognition server executes the verification procedure when at least one of the position and the current time of the portable device meets a pre-approval condition. The speaker verification method according to claim 2, comprising a step of determining.

The speaker verification method according to claim 3, wherein the pre-approval condition is set in advance by the second speaker or is determined based on an action pattern of the second speaker.

The step of determining whether to execute the verification procedure using the portable device includes the operation based on the speech recognition result included in the pre-approved operation list set in advance by the second speaker, The speaker verification method according to claim 2, comprising the step of determining execution of the verification procedure.

When the operation based on the speech recognition result is included in the post-notification operation list set in advance by the second speaker, the mobile phone after the speech recognition server executes the operation corresponding to the speech recognition result. The speaker verification method according to claim 2, further comprising: transmitting the voice recognition result and the voice signal to a device.

The portable device receiving the voice recognition result and the voice signal from the voice recognition server;
The portable device displaying the voice recognition result on the portable device;
The portable device playing the audio signal;
The portable device receiving an approval input or rejection input of the second speaker for approving or rejecting execution of an operation corresponding to the voice recognition result;
The portable device transmitting the approval input or rejection input to the voice recognition server;
The speaker verification method according to claim 1, further comprising:

The voice recognition server receiving a rejection input from the portable device;
The voice recognition server not performing an operation corresponding to the voice recognition result;
The speaker verification method according to claim 1, further comprising:

The voice recognition server receiving a response voice signal including a response voice of the second speaker from the portable device;
The voice recognition server transmitting the response voice signal to the voice recognition device such that the voice recognition device reproduces the response voice of the second speaker;
The speaker verification method according to claim 8, further comprising:

When an approval input is received from the portable device, the speech recognition server uses the speaker feature vector extracted from the speech signal to improve the registered speaker feature vector of the second speaker The speaker verification method according to any one of claims 1 to 9, further comprising:

The voice recognition device uses a camera to generate a video signal including a video at a time when the voice signal is generated, and transmits the video signal to the voice recognition server;
The voice recognition server transmitting the video signal together with the voice recognition result and the voice signal to the portable device;
The speaker verification method according to claim 1, further comprising:

The speaker verification method according to claim 11, wherein the video signal includes a video in a direction in which the voice of the first speaker is generated.

The speaker verification method according to claim 1, wherein the registered second speaker is one of a plurality of users registered in the voice recognition system.

The speech recognition server further includes a step of not performing an operation corresponding to the speech recognition result when the speech recognition server determines that a speaker of the speech signal is an unregistered user based on the comparison result. 14. The speaker verification method according to any one of items 13 to 13.

The program which makes the processor of the speech recognition server of a speech recognition system perform the speaker verification method of any one of Claims 1 thru | or 6.

A communication module for communicating with the voice recognition device and the portable device;
Using the communication module, receiving a voice signal including the voice of the first speaker from the voice recognition device, performing voice recognition on the voice signal, generating a voice recognition result, and generating a voice recognition result from the voice signal A speaker feature vector is extracted, the speaker feature vector is compared with a registered speaker feature vector, and the speaker of the speech signal is determined to be a registered second speaker based on the comparison result. , Using the communication module, transmitting the voice recognition result and the voice signal to the portable device of the second speaker, using the communication module, receiving an approval input from the portable device, and performing the voice recognition A processor configured to perform an action corresponding to the result;
Voice recognition server including

The processor is
Receiving a response voice signal including a response voice of the second speaker together with a rejection input from the portable device;
17. The device according to claim 16, configured to transmit the response voice signal to the voice recognition device using the communication module such that the voice recognition device reproduces the response voice of the second speaker. Voice recognition server.

A communication module for communicating with the voice recognition server according to claim 16 or 17,
A microphone that generates an audio signal;
A processor configured to detect a voice signal including the voice of the first speaker from the audio signal, send the voice signal to the voice recognition server, and receive a synthesized voice signal from the voice recognition server;
A speaker for reproducing a synthesized sound corresponding to the synthesized sound signal;
A speech recognition device.

A speech recognition system including a speech recognition server and a speech recognition device,
The voice recognition device detects a voice signal from the first communication module that communicates with the voice recognition server, a microphone that generates an audio signal, and the audio signal, transmits the voice signal to the voice recognition server, and A first processor that receives a synthesized sound signal from a speech recognition server; and a speaker that reproduces the synthesized sound corresponding to the synthesized sound signal;
The voice recognition server includes a second processor and a second communication module that communicates with the voice recognition device and the portable device,
The second processor is
Receiving a voice signal including the voice of the first speaker from the voice recognition device;
Performing voice recognition on the voice signal to generate a voice recognition result;
A speaker feature vector is extracted from the speech signal, the speaker feature vector is compared with a registered speaker feature vector, and a speaker of the speech signal is registered as a result of the comparison. Recognize that there is
Transmitting the voice recognition result and the voice signal to the portable device of the second speaker;
A speech recognition system configured to perform an operation corresponding to the speech recognition result when an approval input is received from the portable device.