JP2019133182A

JP2019133182A - Speech control apparatus, speech control method, computer program, and recording medium

Info

Publication number: JP2019133182A
Application number: JP2019071410A
Authority: JP
Inventors: 丙烈金; Byeong Yeol Kim; 益 ▲祥▼ 韓; Ick Sang Han; 五赫權; Oh Hyeok Kwon; 奉眞李; Bong Jin Lee; 明祐呉; Myung Woo Oh; ▲みん▼ 碩崔; Min Seok Choi; 燦奎李; Chan Kyu Lee; 貞姫任; Jung Hui Im; 智須崔; Ji Su Choi; 漢容姜; Han Yong Kang
Original assignee: Line Corp; Naver Corp
Current assignee: Z Intermediate Global Corp; Naver Corp
Priority date: 2017-05-19
Filing date: 2019-04-03
Publication date: 2019-08-08
Also published as: JP2022033258A; KR20180127065A; KR101986354B1; JP6510117B2; JP2018194844A

Abstract

To provide a speech control apparatus for preventing false detection of a keyword and a method of operating the same.SOLUTION: The method according to the present disclosure includes the steps of: receiving an audio signal corresponding to a surrounding sound; generating audio stream data; determining a first interval in which a candidate keyword corresponding to a predetermined keyword is detected, from the audio stream data; extracting a first speaker feature vector for identifying a speaker in the first interval and a second speaker feature vector for identifying a speaker in a second interval adjacent to the first interval from the audio stream data; and determining whether or not the predetermined keyword is included in the first interval based on similarity between the first speaker feature vector and the second speaker feature vector.SELECTED DRAWING: Figure 2

Description

本発明は、音声制御装置に関し、さらに詳細には、キーワード誤認識防止が可能な音声制御装置、音声制御装置の動作方法、コンピュータプログラム及び記録媒体等に関する。 The present invention relates to a voice control device, and more particularly to a voice control device capable of preventing erroneous keyword recognition, a method for operating the voice control device, a computer program, a recording medium, and the like.

携帯用通信装置、デスクトップＰＣ（personal computer）、タブレットＰＣ、及びエンターテイメントシステムのようなコンピュータ装置の性能が高度化しつつ、操作性を向上させるために、音声認識機能が搭載され、音声によって制御される電子機器が市場に出回っている。該音声認識機能は、別途のボタン操作、またはタッチモジュールの接触によらず、ユーザの音声を認識することにより、装置を手軽に制御することができる長所を有する。 In order to improve the operability of computer devices such as portable communication devices, desktop PCs (personal computers), tablet PCs, and entertainment systems, voice recognition functions are installed and controlled by voice. Electronic devices are on the market. The voice recognition function has an advantage that the apparatus can be easily controlled by recognizing the voice of the user regardless of a separate button operation or touch module touch.

かような音声認識機能によれば、例えば、スマートフォンのような携帯用通信装置においては、別途のボタンを押す操作なしに、通話機能を遂行したり、文字メッセージを作成したりすることができ、道案内、インターネット検索、アラーム設定等のような多様な機能を手軽に設定することができる。しかし、かような音声制御装置が、ユーザの音声を誤認識すると、不本意な動作を遂行してしまう問題が発生しうる。 According to such a voice recognition function, for example, in a portable communication device such as a smartphone, a call function can be performed or a text message can be created without pressing a separate button. Various functions such as route guidance, Internet search, and alarm setting can be easily set. However, if such a voice control device misrecognizes the user's voice, there may occur a problem of performing an unintentional operation.

韓国特許公開第１０−２０１７−００２８６２８号公報Korean Patent Publication No. 10-2017-0028628

本発明が解決しようとする課題は、キーワード誤認識を防止することができる音声制御装置、音声制御装置の動作方法、コンピュータプログラム及び記録媒体等を提供することである。 The problem to be solved by the present invention is to provide a voice control device, an operation method of the voice control device, a computer program, a recording medium, and the like that can prevent erroneous keyword recognition.

前述の技術的課題を達成するための技術的手段として、本開示の第１側面は、周辺音に対応するオーディオ信号を受信し、オーディオストリームデータを生成するオーディオ処理部と、前記オーディオストリームデータから、所定のキーワードに対応する候補キーワードを検出し、前記オーディオストリームデータにおいて、前記候補キーワードが検出された第１オーディオデータに該当する第１区間の始点及び終点を決定するキーワード検出部と、前記第１オーディオデータに係わる第１話者特徴ベクトルを抽出し、前記オーディオストリームデータにおいて、前記第１区間の始点を終点にする第２区間に該当する第２オーディオデータに係わる第２話者特徴ベクトルを抽出する話者特徴ベクトル抽出部と、前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度を基に、前記第１オーディオデータに、前記キーワードが含まれていたか否かを判断するウェークアップ判断部と、を含む音声制御装置を提供することができる。 As technical means for achieving the above-described technical problem, a first aspect of the present disclosure includes an audio processing unit that receives an audio signal corresponding to ambient sound and generates audio stream data, and the audio stream data. A keyword detecting unit that detects a candidate keyword corresponding to a predetermined keyword, and determines a start point and an end point of a first section corresponding to the first audio data in which the candidate keyword is detected in the audio stream data; A first speaker feature vector related to one audio data is extracted, and in the audio stream data, a second speaker feature vector related to second audio data corresponding to a second section whose end point is the start point of the first section is obtained. A speaker feature vector extraction unit to extract, the first speaker feature vector and the previous Based on the similarity between the second speaker feature vectors, the first audio data, a wake-up determination section that determines whether contains the keyword, it is possible to provide a voice control system comprising a.

また、本開示の第２側面は、周辺音に対応するオーディオ信号を受信し、オーディオストリームデータを生成する段階と、前記オーディオストリームデータから、所定のキーワードに対応する候補キーワードを検出し、前記オーディオストリームデータにおいて、前記候補キーワードが検出された第１オーディオデータに該当する第１区間の始点及び終点を決定する段階と、前記第１オーディオデータに係わる第１話者特徴ベクトルを抽出する段階と、前記オーディオストリームデータにおいて、前記第１区間の始点を終点にする第２区間に該当する第２オーディオデータに係わる第２話者特徴ベクトルを抽出する段階と、前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度を基に、前記第１オーディオデータに、前記キーワードが含まれていたか否かを判断し、ウェークアップさせるか否かを決定する段階と、を含む音声制御装置の動作方法を提供することができる。 According to a second aspect of the present disclosure, an audio signal corresponding to ambient sound is received and audio stream data is generated, a candidate keyword corresponding to a predetermined keyword is detected from the audio stream data, and the audio Determining a start point and an end point of a first section corresponding to the first audio data in which the candidate keyword is detected in the stream data; and extracting a first speaker feature vector related to the first audio data; Extracting, from the audio stream data, a second speaker feature vector related to second audio data corresponding to a second section whose end point is the starting point of the first section; the first speaker feature vector; Based on the similarity to the two-speaker feature vector, the first audio data is added to the keyword. Determines whether contains de, it can provide a method of operating a voice control system comprising the steps of determining whether to wake up, the.

また、本開示の第３側面は、音声制御装置のプロセッサに、第２側面による動作方法を実行させる命令語を含むコンピュータプログラムを提供することができる。 The third aspect of the present disclosure can provide a computer program including an instruction word that causes a processor of the voice control device to execute the operation method according to the second aspect.

また、本開示の第４側面は、第３側面によるコンピュータプログラムが記録されたコンピュータで読み取り可能な記録媒体を提供することができる。 The fourth aspect of the present disclosure can provide a computer-readable recording medium in which the computer program according to the third aspect is recorded.

本発明の多様な実施形態によれば、キーワードを誤認識する可能性が低下するので、音声制御装置の誤動作が防止される。 According to various embodiments of the present invention, since the possibility of erroneously recognizing a keyword is reduced, malfunction of the voice control device is prevented.

一実施形態によるネットワーク環境の例を図示した図面である。1 is a diagram illustrating an example of a network environment according to an embodiment. 一実施形態によって、電子機器及びサーバの内部構成について説明するためのブロック図である。It is a block diagram for demonstrating the internal structure of an electronic device and a server by one Embodiment. 一実施形態による音声制御装置のプロセッサが含みうる機能ブロックの例を図示した図面である。2 is a diagram illustrating an example of functional blocks that can be included in a processor of a voice control device according to an embodiment; 一実施形態によって、音声制御装置が遂行することができる動作方法の例を図示したフローチャートである。6 is a flowchart illustrating an example of an operation method that can be performed by a voice control apparatus according to an exemplary embodiment; 他の実施形態によって、音声制御装置が遂行することができる動作方法の例を図示したフローチャートである。6 is a flowchart illustrating an example of an operation method that can be performed by a voice control apparatus according to another embodiment. 一実施形態による音声制御装置が、図５の動作方法を実行する場合、単独命令キーワードが発話される例を図示する図面である。6 is a diagram illustrating an example in which a single command keyword is spoken when the voice control apparatus according to an embodiment executes the operation method of FIG. 5. 一実施形態による音声制御装置が、図６の動作方法を実行する場合、一般対話音声が発話される例を図示する図面である。7 is a diagram illustrating an example in which a general dialogue voice is spoken when the voice control apparatus according to an embodiment executes the operation method of FIG. 6. さらに他の実施形態によって、音声制御装置が遂行することができる動作方法の例を図示したフローチャートである。6 is a flowchart illustrating an example of an operation method that can be performed by a voice control apparatus according to another embodiment; 一実施形態による音声制御装置が、図７の動作方法を実行する場合、ウェークアップキーワード及び自然語音声命令が発話される例を図示する図面である。8 is a diagram illustrating an example in which a wakeup keyword and a natural language voice command are uttered when the voice control apparatus according to an embodiment executes the operation method of FIG. 7. 一実施形態による音声制御装置が、図７の動作方法を実行する場合、一般対話音声が発話される例を図示する図面である。8 is a diagram illustrating an example in which a general dialogue voice is spoken when the voice control apparatus according to an embodiment executes the operation method of FIG. 7.

以下、添付した図面を参照し、本発明が属する技術分野において当業者が容易に実施することができるように、本発明の実施形態について詳細に説明する。しかし、本発明は、さまざまに異なる形態に具現化され、ここで説明する実施形態に限定されるものではない。そして、図面において、本発明について明確に説明するために、説明と関係ない部分は省略し、明細書全体を通じて、類似した部分については、類似した図面符号を付した。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them in the technical field to which the present invention belongs. However, the present invention may be embodied in various different forms and is not limited to the embodiments described herein. In the drawings, in order to clearly describe the present invention, portions not related to the description are omitted, and similar portions are denoted by similar drawing symbols throughout the specification.

明細書全体において、ある部分が他の部分と「連結」されているとするとき、それは、「直接に連結」されている場合だけではなく、その中間に、他の素子を挟み、「電気的に連結」されている場合も含む。また、ある部分がある構成要素を「含む」とするとき、それは、特に別意の記載がない限り、他の構成要素を除くものではなく、他の構成要素をさらに含みうるということを意味する。 Throughout the specification, when a part is “connected” to another part, it is not only “directly connected”, but other parts are sandwiched between them and “electrical” It is also included when it is connected to. Also, when a part “includes” a component, it means that it does not exclude other components and may further include other components unless otherwise specified. .

本明細書において、様々な箇所に登場する「一部実施形態において」または「一実施形態において」というような語句は、必ずしもいずれも同一実施形態を示すものではない。 In this specification, phrases such as “in some embodiments” or “in one embodiment” appearing in various places do not necessarily indicate the same embodiment.

一部実施形態は、機能的なブロック構成、及び多様な処理段階で示される。かような機能ブロックの一部または全部は、特定機能を行う多様な個数のハードウェア構成及び／またはソフトウェア構成によっても具現化される。例えば、本開示の機能ブロックは、１以上のマイクロプロセッサによって具現化されるか、あるいは所定機能のための回路構成によっても具現化される。また、例えば、本開示の機能ブロックは、多様なプログラミング言語またはスクリプティング言語によっても具現化される。該機能ブロックは、１以上のプロセッサで実行されるアルゴリズムによっても具現化される。また、本開示は、電子的な環境設定、信号処理、及び／またはデータ処理などのために、従来技術を採用することができる。「モジュール」及び「構成」のような用語は、汎用され、機械的であって物理的な構成として限定されるものではない。 Some embodiments are shown with functional block configurations and various processing stages. Some or all of such functional blocks may be embodied by various numbers of hardware configurations and / or software configurations that perform specific functions. For example, the functional block of the present disclosure may be embodied by one or more microprocessors or a circuit configuration for a predetermined function. Further, for example, the functional blocks of the present disclosure are embodied by various programming languages or scripting languages. The functional block is also embodied by an algorithm executed by one or more processors. In addition, the present disclosure can employ conventional techniques for electronic environment setting, signal processing, and / or data processing. Terms such as “module” and “configuration” are generic and mechanical and are not limited to physical configurations.

また、図面に図示された構成要素間の連結線または連結部材は、機能的な連結、及び／または物理的または回路的な連結を例示的に示しただけである。実際の装置においては、代替可能であったり、追加されたりする多様な機能的な連結、物理的な連結または回路連結により、構成要素間の連結が示される。 Also, the connecting lines or connecting members between the components shown in the drawings are merely illustrative of functional connections and / or physical or circuit connections. In actual devices, the connections between components are indicated by various functional, physical or circuit connections that may be substituted or added.

本開示においてキーワードは、音声制御装置の特定機能をウェークアップさせることができる音声情報をいう。該キーワードは、ユーザの音声信号に基づいて、単独命令キーワードでもあり、ウェークアップキーワードでもある。ウェークアップキーワードは、スリープモード状態の音声制御装置をウェークアップモードに転換することができる音声に基づくキーワードであり、例えば、「クローバ」、「ハイコンピュータ」のような音声キーワードでもある。ユーザは、ウェークアップキーワードを発話した後、音声制御装置が遂行することを願う機能や動作を指示するための命令を自然語形態で発話することができる。なお、以下の説明でウェークアップキーワードの単なる一例として登場する「クローバ」（Ｃｌｏｖａ）は登録商標であり、「四葉のクローバー」（ｆｏｕｒ−ｌｅａｆｃｌｏｖｅｒ）における「クローバー」とは異なる点に留意を要する。その場合、該音声制御装置は、自然語形態の音声命令を音声認識し、音声認識された結果に対応する機能または動作を遂行することができる。単独命令キーワードは、例えば、音楽が再生中である場合、「中止」のように、音声制御装置の動作を直接制御することができる音声キーワードでもある。本開示で言及されるウェークアップキーワードは、ウェークアップワード、ホットワード、トリガーワードのような用語で呼ばれる。 In the present disclosure, the keyword refers to voice information that can wake up a specific function of the voice control device. The keyword is both a single command keyword and a wake-up keyword based on the user's voice signal. The wake-up keyword is a keyword based on voice that can switch the voice control device in the sleep mode to the wake-up mode, and is also a voice keyword such as “clover” or “high computer”, for example. After the user utters the wakeup keyword, the user can utter a command for instructing a function or operation desired to be performed by the voice control device in a natural language form. It should be noted that “clover” (Cloba), which appears as an example of a wakeup keyword in the following description, is a registered trademark, and differs from “clover” in “four-leaf clover”. In this case, the voice control device can recognize a voice command in a natural language form and perform a function or operation corresponding to the result of the voice recognition. The single command keyword is also a voice keyword that can directly control the operation of the voice control device such as “stop” when music is being played. Wake-up keywords referred to in this disclosure are referred to by terms such as wake-up word, hot word, and trigger word.

本開示において候補キーワードは、キーワードと発音が類似したワードを含む。例えば、キーワードが「クローバ」である場合、該候補キーワードは、「クローバー」、「グローバル」、「クラブ」などでもある。該候補キーワードは、音声制御装置のキーワード検出部が、オーディオデータからキーワードとして検出したものと定義される。該候補キーワードは、キーワードと同一でもあるが、該キーワードと類似した発音を有する他のワードでもある。一般的には、該音声制御装置は、ユーザが候補キーワードに該当する用語が含まれている文章を発話する場合にも、当該キーワードと誤認識してウェークアップさせることがある。本開示による音声制御装置は、音声信号から、前述のような候補キーワードが検出される場合にも反応するが、候補キーワードによってウェークアップさせることを防止することができる。 In this disclosure, candidate keywords include words that are similar in pronunciation to the keyword. For example, when the keyword is “clover”, the candidate keyword may be “clover”, “global”, “club”, and the like. The candidate keyword is defined as a keyword detected by the keyword detection unit of the voice control device as a keyword from the audio data. The candidate keyword is the same as the keyword, but is another word having a similar pronunciation to the keyword. In general, when the user utters a sentence including a term corresponding to a candidate keyword, the voice control device may erroneously recognize the keyword and wake it up. The voice control device according to the present disclosure reacts even when the candidate keyword as described above is detected from the voice signal, but can prevent the candidate keyword from being woken up.

本開示において音声認識機能は、ユーザの音声信号を、文字列（または、テキスト）に変換することをいう。ユーザの音声信号は、音声命令を含みうる。該音声命令は、音声制御装置の特定機能を行うことができる。 In the present disclosure, the voice recognition function refers to converting a user's voice signal into a character string (or text). The user's voice signal may include a voice command. The voice command can perform a specific function of the voice control device.

本開示において音声制御装置は、音声制御機能が搭載された電子機器をいう。音声制御機能が搭載された電子機器は、スマートスピーカまたは人工知能スピーカのような独立した電子機器でもある。また、音声制御機能が搭載された電子機器は、音声制御機能が搭載されたコンピュータ装置、例えば、デスクトップＰＣ（personal computer）、ノート型パソコンなどであるだけでなく、携帯が可能なコンピュータ装置、例えば、スマートフォンなどでもある。その場合、該コンピュータ装置には、音声制御機能を行うためのプログラムまたはアプリケーションがインストールされる。また、該音声制御機能が搭載された電子機器は、特定機能を主に遂行する電子製品、例えば、スマートテレビ、スマート冷蔵庫、スマートエアコン、スマートナビゲーションなどでもあり、自動車のインフォテーンメントシステムでもある。それだけではなく、音声によって制御される事物インターネット装置も、それに該当する。 In the present disclosure, a voice control device refers to an electronic device equipped with a voice control function. An electronic device equipped with a voice control function is also an independent electronic device such as a smart speaker or an artificial intelligence speaker. In addition, an electronic device equipped with a voice control function is not only a computer device equipped with a voice control function, such as a desktop PC (personal computer) or a notebook computer, but also a portable computer device, such as It is also a smartphone. In this case, a program or application for performing a voice control function is installed in the computer device. An electronic device equipped with the voice control function is an electronic product mainly performing a specific function, such as a smart TV, a smart refrigerator, a smart air conditioner, a smart navigation, and the like, and is also an automobile infotainment system. Not only that, but also the Internet devices of things controlled by voice.

本開示において、音声制御装置の特定機能は、例えば、該音声制御装置にインストールされたアプリケーションを実行することを含みうるが、それに制限されるものではない。例えば、該音声制御装置がスマートスピーカである場合、該音声制御装置の特定機能は、音楽再生、インターネットショッピング、音声情報提供、スマートスピーカに接続された電子装置または機械装置の制御などを含みうる。例えば、該音声制御装置がスマートフォンである場合、該アプリケーション実行は、電話かけること、道探し、インターネット検索またはアラーム設定などを含みうる。例えば、該音声制御装置がスマートテレビである場合、該アプリケーション実行は、プログラム検索またはチャネル検索などを含みうる。該音声制御装置がスマートオーブンである場合、該アプリケーション実行は、料理方法検索などを含みうる。該音声制御装置がスマート冷蔵庫である場合、該アプリケーション実行は、冷蔵状態及び冷凍状態の点検、または温度設定などを含みうる。該音声制御装置がスマート自動車である場合、該アプリケーション実行は、自動始動、自律走行、自動駐車などを含みうる。本開示でアプリケーション実行は、前述のところに制限されるものではない。 In the present disclosure, the specific function of the voice control device may include, for example, executing an application installed in the voice control device, but is not limited thereto. For example, when the voice control device is a smart speaker, specific functions of the voice control device may include music playback, Internet shopping, voice information provision, control of an electronic device or a mechanical device connected to the smart speaker, and the like. For example, when the voice control device is a smartphone, the application execution may include making a call, searching for a way, searching the Internet, or setting an alarm. For example, when the voice control device is a smart TV, the application execution may include a program search or a channel search. When the voice control device is a smart oven, the application execution may include cooking method search and the like. When the voice control device is a smart refrigerator, the application execution may include checking a refrigerated state and a frozen state, or setting a temperature. When the voice control device is a smart vehicle, the application execution may include automatic start, autonomous driving, automatic parking, and the like. Application execution in the present disclosure is not limited to the foregoing.

本開示においてキーワードは、ワード形態を有するか、あるいは球形態を有することができる。本開示において、ウェークアップキーワード後に発話される音声命令は、自然語形態の文章形態、ワード形態または球形態を有することができる。 In the present disclosure, the keywords can have a word form or a sphere form. In the present disclosure, the voice command uttered after the wake-up keyword may have a natural language form sentence form, a word form, or a sphere form.

以下、添付された図面を参照し、本開示について詳細に説明する。 Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.

図１は、一実施形態によるネットワーク環境の例を図示した図面である。図１に図示されたネットワーク環境は、複数の電子機器１００ａないし１００ｆ、サーバ２００及びネットワーク３００を含むように例示的に図示される。 FIG. 1 is a diagram illustrating an example of a network environment according to an embodiment. The network environment illustrated in FIG. 1 is exemplarily illustrated to include a plurality of electronic devices 100a to 100f, a server 200, and a network 300.

電子機器１００ａないし１００ｆは、音声で制御される例示的な電子機器である。電子機器１００ａないし１００ｆそれぞれは、音声認識機能以外に、特定機能を行うことができる。電子機器１００ａないし１００ｆの例を挙げれば、スマートスピーカまたは人工知能スピーカ、スマートフォン、携帯電話、ナビゲーション、コンピュータ、ノート型パソコン、デジタル放送用端末、ＰＤＡ（personal digital assistants）、ＰＭＰ（portable multimedia player）、タブレットＰＣ、スマート電子製品などがある。電子機器１００ａないし１００ｆは、無線または有線の通信方式を利用し、ネットワーク３００を介して、サーバ２００、及び／または他の電子機器１００ａないし１００ｆと通信することができる。しかし、それに限定されるものではなく、電子機器１００ａないし１００ｆそれぞれは、ネットワーク３００に連結されず、独立して動作することもできる。電子機器１００ａないし１００ｆは、電子機器１００とも総称される。 The electronic devices 100a to 100f are exemplary electronic devices controlled by voice. Each of the electronic devices 100a to 100f can perform a specific function in addition to the voice recognition function. Examples of the electronic devices 100a to 100f include smart speakers or artificial intelligence speakers, smartphones, mobile phones, navigation, computers, notebook computers, digital broadcasting terminals, PDA (personal digital assistants), PMP (portable multimedia players), There are tablet PCs and smart electronic products. The electronic devices 100a to 100f can communicate with the server 200 and / or other electronic devices 100a to 100f via the network 300 using a wireless or wired communication method. However, the present invention is not limited to this, and each of the electronic devices 100a to 100f is not connected to the network 300 and can operate independently. The electronic devices 100a to 100f are also collectively referred to as the electronic device 100.

ネットワーク３００の通信方式は、制限されるものではなく、ネットワーク３００が含みうる通信網（一例として、移動通信網、有線インターネット、無線インターネット、放送網）を活用する通信方式だけではなく、電子機器１００ａないし１００ｆ間の近距離無線通信が含まれてもよい。例えば、ネットワーク３００は、ＰＡＮ（personal area network）、ＬＡＮ（local area network）、ＣＡＮ（campus area network）、ＭＡＮ（metropolitan area network）、ＷＡＮ（wide area network）、ＢＢＮ（broadband network）、インターネットなどのネットワークのうち１以上の任意のネットワークを含みうる。また、ネットワーク３００は、バスネットワーク、スターネットワーク、リングネットワーク、メッシュネットワーク、スター・バスネットワーク、ツリーネットワークまたは階層的（hierarchical）ネットワークなどを含むネットワークトポロジーのうち、任意の１以上を含みうるが、それらに制限されるものではない。 The communication method of the network 300 is not limited, and is not limited to a communication method using a communication network (for example, a mobile communication network, a wired Internet, a wireless Internet, or a broadcast network) that can be included in the network 300, but also the electronic device 100a. Or near field communication between 100f may be included. For example, the network 300 includes a PAN (personal area network), a LAN (local area network), a CAN (campus area network), a MAN (metropolitan area network), a WAN (wide area network), a BBN (broadband network), the Internet, and the like. One or more arbitrary networks may be included. The network 300 may include any one or more of network topologies including a bus network, a star network, a ring network, a mesh network, a star bus network, a tree network, a hierarchical network, and the like. It is not limited to.

サーバ２００は、ネットワーク３００を介し、て電子機器１００ａないし１００ｆと通信し、音声認識機能を遂行するコンピュータ装置、または複数のコンピュータ装置によっても具現化される。サーバ２００は、クラウド形態に分散され、命令、コード、ファイル、コンテンツなどを提供することができる。 The server 200 is also embodied by a computer device or a plurality of computer devices that communicate with the electronic devices 100a to 100f via the network 300 and perform a voice recognition function. The server 200 is distributed in a cloud form and can provide instructions, codes, files, contents, and the like.

例えば、サーバ２００は、電子機器１００ａないし１００ｆから提供されるオーディオファイルを受信し、オーディオファイル内の音声信号を文字列（または、テキスト）に変換し、変換された文字列（または、テキスト）を、電子機器１００ａないし１００ｆに提供することができる。また、サーバ２００は、ネットワーク３００を介して接続した電子機器１００ａないし１００ｆに、音声制御機能を遂行するためのアプリケーションインストールのためのファイルを提供することができる。例えば、第２電子機器１００ｂは、サーバ２００から提供されたファイルを利用し、アプリケーションをインストールすることができる。第２電子機器１００ｂは、インストールされた運用体制（ＯＳ）、及び／または少なくとも１つのプログラム（例えば、インストールされた音声制御アプリケーション）の制御によってサーバ２００に接続し、サーバ２００が提供する音声認識サービスを提供される。 For example, the server 200 receives an audio file provided from the electronic devices 100a to 100f, converts an audio signal in the audio file into a character string (or text), and converts the converted character string (or text). The electronic devices 100a to 100f can be provided. In addition, the server 200 can provide an application installation file for performing a voice control function to the electronic devices 100a to 100f connected via the network 300. For example, the second electronic device 100b can install an application using a file provided from the server 200. The second electronic device 100b is connected to the server 200 under the control of the installed operating system (OS) and / or at least one program (for example, the installed voice control application), and the voice recognition service provided by the server 200 Provided.

図２は、一実施形態によって、電子機器及びサーバの内部構成について説明するためのブロック図である。 FIG. 2 is a block diagram for explaining an internal configuration of an electronic device and a server according to an embodiment.

電子機器１００は、図１の電子機器１００ａないし１００ｆのうち一つであり、電子機器１００ａないし１００ｆは、少なくとも図２に図示された内部構成を有することができる。電子機器１００は、ネットワーク３００を介して音声認識機能を遂行するサーバ２００に接続されるように図示されているが、それは例示的なものであり、電子機器１００は、独立して音声認識機能を遂行することもできる。電子機器１００は、音声によって制御される電子機器であり、音声制御装置１００とも呼ばれる。音声制御装置１００は、スマートスピーカまたは人工知能スピーカ、コンピュータ装置、携帯用コンピュータ装置、スマート家電製品などに含まれたり、それらに、有線及び／または無線で連結されたりして具現化される。 The electronic device 100 is one of the electronic devices 100a to 100f of FIG. 1, and the electronic devices 100a to 100f may have at least the internal configuration illustrated in FIG. Although the electronic device 100 is illustrated as being connected to a server 200 that performs a voice recognition function via the network 300, it is exemplary, and the electronic device 100 independently has a voice recognition function. It can also be accomplished. The electronic device 100 is an electronic device that is controlled by voice, and is also referred to as a voice control device 100. The voice control apparatus 100 is embodied by being included in a smart speaker or an artificial intelligence speaker, a computer apparatus, a portable computer apparatus, a smart home appliance, or the like, or connected to them by wire and / or wireless.

電子機器１００とサーバ２００は、メモリ１１０，２１０、プロセッサ１２０，２２０、通信モジュール１３０，２３０、及び入出力インターフェース１４０，２４０を含みうる。メモリ１１０，２１０は、コンピュータで読み取り可能な記録媒体であり、ＲＡＭ（random access memory）、ＲＯＭ（read-only memory）及びディスクドライブのような非消滅性大容量記録装置（permanent mass storage device）を含みうる。また、メモリ１１０，２１０には、運用体制と、少なくとも１つのプログラムコード（例えば、電子機器１００にインストールされて駆動される音声制御アプリケーション、音声認識アプリケーションなどのためのコード）とが保存される。かようなソフトウェア構成要素は、コンピュータで読み取り可能な記録媒体ではない通信モジュール１３０，２３０を介して、メモリ１１０，２１０にローディングされる。例えば、少なくとも１つのプログラムは、開発者、またはアプリケーションのインストールファイルを配布するファイル配布システムが、ネットワーク３００を介して提供するファイルによってインストールされるプログラムに基づいて、メモリ１１０，２１０にローディングされる。 The electronic device 100 and the server 200 may include memories 110 and 210, processors 120 and 220, communication modules 130 and 230, and input / output interfaces 140 and 240. The memories 110 and 210 are computer-readable recording media, and non-destructive mass storage devices such as random access memory (RAM), read-only memory (ROM), and disk drives. May be included. The memories 110 and 210 store an operation system and at least one program code (for example, a code for a voice control application, a voice recognition application, and the like installed and driven in the electronic device 100). Such software components are loaded into the memories 110 and 210 via the communication modules 130 and 230 that are not computer-readable recording media. For example, at least one program is loaded into the memories 110 and 210 based on a program installed by a developer or a file distribution system that distributes an installation file of an application via a file provided via the network 300.

プロセッサ１２０，２２０は、基本的な算術、ロジック及び入出力演算を行うことにより、コンピュータプログラムの命令を処理するように構成される。該命令は、メモリ１１０，２１０または通信モジュール１３０，２３０によって、プロセッサ１２０，２２０にも提供される。例えば、プロセッサ１２０，２２０は、メモリ１１０，２１０のような記録装置に保存されたプログラムコードによって受信される命令を実行するようにも構成される。 The processors 120 and 220 are configured to process computer program instructions by performing basic arithmetic, logic and input / output operations. The instructions are also provided to the processors 120 and 220 by the memories 110 and 210 or the communication modules 130 and 230. For example, the processors 120 and 220 are also configured to execute instructions received by program code stored in a recording device such as the memories 110 and 210.

通信モジュール１３０，２３０は、ネットワーク３００を介して、電子機器１００とサーバ２００とが互いに通信するための機能を提供することができ、他の電子機器１００ｂないし１００ｆと通信するための機能を提供することができる。一例として、電子機器１００のプロセッサ１２０が、メモリ１１０のような記録装置に保存されたプログラムコードによって生成した要請（一例として、音声認識サービス要請）が、通信モジュール１３０の制御により、ネットワーク３００を介してサーバ２００に伝達される。反対に、サーバ２００のプロセッサ２２０の制御によって提供される音声認識結果である文字列（テキスト）などが、通信モジュール２３０及びネットワーク３００を経て、電子機器１００の通信モジュール１３０を介して、電子機器１００に受信される。例えば、通信モジュール１３０を介して受信されたサーバ２００の音声認識結果は、プロセッサ１２０やメモリ１１０に伝達される。サーバ２００は、制御信号や命令、コンテンツ、ファイルなどを電子機器１００に送信することができ、通信モジュール１３０を介して受信された制御信号や命令などは、プロセッサ１２０やメモリ１１０に伝達し、コンテンツやファイルなどは、電子機器１００がさらに含みうる別途の記録媒体にも保存される。 The communication modules 130 and 230 can provide a function for the electronic device 100 and the server 200 to communicate with each other via the network 300, and provide a function for communicating with the other electronic devices 100b to 100f. be able to. As an example, a request (for example, a voice recognition service request) generated by the processor 120 of the electronic device 100 using a program code stored in a recording device such as the memory 110 is transmitted via the network 300 under the control of the communication module 130. Is transmitted to the server 200. In contrast, a character string (text) that is a voice recognition result provided by the control of the processor 220 of the server 200 passes through the communication module 230 and the network 300, and then via the communication module 130 of the electronic device 100. Received. For example, the speech recognition result of the server 200 received via the communication module 130 is transmitted to the processor 120 and the memory 110. The server 200 can transmit a control signal, a command, content, a file, and the like to the electronic device 100, and the control signal, the command, etc. received via the communication module 130 are transmitted to the processor 120 and the memory 110, and the content is transmitted. And the file are also stored in a separate recording medium that the electronic device 100 can further include.

入出力インターフェース１４０，２４０は、入出力装置１５０とのインターフェースのための手段でもある。例えば、入力装置はマイク１５１だけではなく、キーボードまたはマウスなどの装置を含み、出力装置は、スピーカ１５２だけではなく、状態を示す状態表示ＬＥＤ（light emitting diode）、アプリケーションの通信セッションを表示するためのディスプレイのような装置を含みうる。他の例として、入出力装置１５０は、タッチスクリーンのように、入力及び出力のための機能が一つに統合された装置を含みうる。 The input / output interfaces 140 and 240 are also means for interfacing with the input / output device 150. For example, the input device includes not only the microphone 151 but also a device such as a keyboard or a mouse, and the output device displays not only the speaker 152 but also a status display LED (light emitting diode) indicating the status, and a communication session of the application. Or a device such as a display. As another example, the input / output device 150 may include a device that integrates functions for input and output, such as a touch screen.

マイク１５１は、周辺音を電気的なオーディオ信号に変換することができる。マイク１５１は、電子機器１００内に直接装着されず、通信可能に連結される外部装置（例えば、スマート時計）に装着され、生成された外部信号は、通信によって電子機器１００に伝送される。図２には、マイク１５１が電子機器１００の内部に含まれるように図示されているが、他の一実施形態によれば、マイク１５１は、別途の装置内に含まれ、電子機器１００とは、有線通信または無線通信で連結される形態にも具現化される。 The microphone 151 can convert ambient sound into an electrical audio signal. The microphone 151 is not directly attached to the electronic device 100 but is attached to an external device (for example, a smart watch) that is communicably connected, and the generated external signal is transmitted to the electronic device 100 by communication. In FIG. 2, the microphone 151 is illustrated as being included in the electronic device 100, but according to another embodiment, the microphone 151 is included in a separate device. The present invention is also embodied in a form connected by wired communication or wireless communication.

他の実施形態において、電子機器１００及びサーバ２００は、図２の構成要素よりさらに多くの構成要素を含んでもよい。例えば、電子機器１００は、前述の入出力装置１５０のうち少なくとも一部を含むように構成されるか、あるいはトランシーバ（transceiver）、ＧＰＳ（global position system）モジュール、カメラ、各種センサ、データベースのような他の構成要素をさらに含んでもよい。 In other embodiments, electronic device 100 and server 200 may include more components than the components of FIG. For example, the electronic device 100 may be configured to include at least a part of the input / output device 150 described above, or may be a transceiver, a GPS (global position system) module, a camera, various sensors, a database, or the like. Other components may be further included.

図３は、一実施形態による音声制御装置のプロセッサが含みうる機能ブロックの例を図示した図面であり、図４は、一実施形態によって、音声制御装置が遂行することができる動作方法の例を図示したフローチャートである。 FIG. 3 is a diagram illustrating examples of functional blocks that can be included in the processor of the voice control device according to an embodiment. FIG. 4 is an example of an operation method that can be performed by the voice control device according to the embodiment. It is an illustrated flowchart.

図３に図示されているように、音声制御装置１００のプロセッサ１２０は、オーディオ処理部１２１、キーワード検出部１２２、話者特徴ベクトル抽出部１２３、ウェークアップ判断部１２４、音声認識部１２５及び機能部１２６を含みうる。かようなプロセッサ１２０及び機能ブロック１２１ないし１２６のうち少なくとも一部は、図４に図示された動作方法が含む段階（Ｓ１１０ないしＳ１９０）を遂行するように、音声制御装置１００を制御することができる。例えば、プロセッサ１２０、及びプロセッサ１２０の機能ブロック１２１ないし１２６のうち少なくとも一部は、音声制御装置１００のメモリ１１０が含む運用体制のコードと、少なくとも１つのプログラムコードによる命令と、を実行するようにも具現化される。 As illustrated in FIG. 3, the processor 120 of the voice control device 100 includes an audio processing unit 121, a keyword detection unit 122, a speaker feature vector extraction unit 123, a wakeup determination unit 124, a voice recognition unit 125, and a function unit 126. Can be included. At least a part of the processor 120 and the functional blocks 121 to 126 may control the voice control apparatus 100 to perform the steps (S110 to S190) included in the operation method illustrated in FIG. . For example, the processor 120 and at least a part of the functional blocks 121 to 126 of the processor 120 execute an operation system code included in the memory 110 of the voice control device 100 and an instruction based on at least one program code. Is also embodied.

図３に図示された機能ブロック１２１ないし１２６の一部または全部は、特定機能を行うハードウェア構成及び／またはソフトウェア構成にも具現化される。図３に図示された機能ブロック１２１ないし１２６が遂行する機能は、１以上のマイクロプロセッサによって具現化されるか、あるいは当該機能のための回路構成によっても具現化される。図３に図示された機能ブロック１２１ないし１２６の一部または全部は、プロセッサ１２０で実行される多様なプログラミング言語またはスクリプト言語で構成されたソフトウェアモジュールでもある。例えば、オーディオ処理部１２１とキーワード検出部１２２は、デジタル信号処理器（ＤＳＰ）によって具現化され、話者特徴ベクトル抽出部１２３、ウェークアップ判断部１２４及び音声認識部１２５は、ソフトウェアモジュールによっても具現化される。 A part or all of the functional blocks 121 to 126 illustrated in FIG. 3 may be embodied in a hardware configuration and / or a software configuration that performs a specific function. The functions performed by the function blocks 121 to 126 shown in FIG. 3 are implemented by one or more microprocessors or a circuit configuration for the functions. A part or all of the functional blocks 121 to 126 illustrated in FIG. 3 may be software modules configured by various programming languages or script languages executed by the processor 120. For example, the audio processing unit 121 and the keyword detection unit 122 are implemented by a digital signal processor (DSP), and the speaker feature vector extraction unit 123, the wakeup determination unit 124, and the speech recognition unit 125 are also implemented by software modules. Is done.

オーディオ処理部１２１は、周辺音に対応するオーディオ信号を受信し、オーディオストリームデータを生成する。オーディオ処理部１２１は、マイク１５１のような入力装置から、周辺音に対応するオーディオ信号を受信することができる。マイク１５１は、音声制御装置１００に通信で連結される周辺装置に含まれ、オーディオ処理部１２１は、マイク１５１で生成されたオーディオ信号を通信で受信することができる。該周辺音は、ユーザが発話した音声だけではなく、背景音を含む。従って、オーディオ信号には、音声信号だけではなく、背景音信号も含まれる。該背景音信号は、キーワード検出及び音声認識において、ノイズに該当する。 The audio processing unit 121 receives an audio signal corresponding to ambient sound and generates audio stream data. The audio processing unit 121 can receive an audio signal corresponding to ambient sound from an input device such as the microphone 151. The microphone 151 is included in a peripheral device that is connected to the voice control device 100 by communication, and the audio processing unit 121 can receive the audio signal generated by the microphone 151 by communication. The ambient sounds include background sounds as well as voices spoken by the user. Therefore, the audio signal includes not only the audio signal but also the background sound signal. The background sound signal corresponds to noise in keyword detection and speech recognition.

オーディオ処理部１２１は、連続的に受信されるオーディオ信号に対応するオーディオストリームデータを生成することができる。オーディオ処理部１２１は、オーディオ信号をフィルタリングしてデジタル化し、オーディオストリームデータを生成することができる。オーディオ処理部１２１は、オーディオ信号をフィルタリングしてノイズ信号を除去し、背景音信号に比べ、音声信号を増幅することができる。また、オーディオ処理部１２１は、オーディオ信号から音声信号のエコーを除去することもできる。 The audio processing unit 121 can generate audio stream data corresponding to continuously received audio signals. The audio processing unit 121 can filter and digitize the audio signal to generate audio stream data. The audio processing unit 121 filters the audio signal to remove the noise signal, and can amplify the audio signal compared to the background sound signal. The audio processing unit 121 can also remove an echo of the audio signal from the audio signal.

オーディオ処理部１２１は、音声制御装置１００がスリープモードで動作するときにも、オーディオ信号を受信するために、常時動作することができる。オーディオ処理部１２１は、音声制御装置１００がスリープモードで動作するとき、低い動作周波数で動作し、音声制御装置１００が正常モードで動作するときには、高い動作周波数で動作することができる。 The audio processing unit 121 can always operate to receive the audio signal even when the audio control device 100 operates in the sleep mode. The audio processing unit 121 operates at a low operating frequency when the audio control device 100 operates in the sleep mode, and can operate at a high operating frequency when the audio control device 100 operates in the normal mode.

メモリ１１０は、オーディオ処理部１２１で生成されたオーディオストリームデータを一時的に保存することができる。オーディオ処理部１２１は、メモリ１１０を利用して、オーディオストリームデータをバッファリングすることができる。メモリ１１０には、キーワードを含むオーディオデータだけではなく、キーワードが検出される前のオーディオデータが共に保存される。最近のオーディオデータをメモリ１１０に保存するために、メモリ１１０に最も前に保存されたオーディオデータが削除される。メモリ１１０に割り当てられた大きさが同一であるならば、常時同一期間のオーディオデータが保存される。メモリ１１０に保存されたオーディオデータに該当する前記期間は、キーワードを発声する時間より長いことが望ましい。 The memory 110 can temporarily store the audio stream data generated by the audio processing unit 121. The audio processing unit 121 can buffer the audio stream data using the memory 110. The memory 110 stores not only audio data including a keyword but also audio data before the keyword is detected. In order to store recent audio data in the memory 110, the audio data that was most recently stored in the memory 110 is deleted. If the size allocated to the memory 110 is the same, audio data of the same period is always stored. The period corresponding to the audio data stored in the memory 110 is preferably longer than the time for speaking a keyword.

本発明の他の実施形態によれば、メモリ１１０は、オーディオ処理部１２１で生成されたオーディオストリームに係わる話者特徴ベクトルを抽出して保存することができる。そのとき、該話者特徴ベクトルは、特定長のオーディオストリームに対して抽出して保存される。前述のように、最近生成されたオーディオストリームに係わる話者特徴ベクトルを保存するために、最も前に保存された話者特徴ベクトルが削除される。 According to another embodiment of the present invention, the memory 110 may extract and store speaker feature vectors related to the audio stream generated by the audio processing unit 121. At that time, the speaker feature vector is extracted and stored for an audio stream having a specific length. As described above, in order to store the speaker feature vectors related to the recently generated audio stream, the most recently stored speaker feature vectors are deleted.

キーワード検出部１２２は、オーディオ処理部１２１で生成されたオーディオストリームデータから、既定義の（即ち、所定の）キーワードに対応する候補キーワードを検出する。キーワード検出部１２２は、メモリ１１０に一時的に保存されたオーディオストリームデータから、既定義のキーワードに対応する候補キーワードを検出することができる。既定義のキーワードは、複数個存在することも可能であり、複数の既定義のキーワードは、キーワード保存所１１０ａに保存される。キーワード保存所１１０ａは、メモリ１１０に含まれてもよい。 The keyword detection unit 122 detects candidate keywords corresponding to predefined (that is, predetermined) keywords from the audio stream data generated by the audio processing unit 121. The keyword detection unit 122 can detect candidate keywords corresponding to the predefined keywords from the audio stream data temporarily stored in the memory 110. There may be a plurality of predefined keywords, and the plurality of predefined keywords are stored in the keyword storage 110a. The keyword storage 110 a may be included in the memory 110.

候補キーワードは、キーワード検出部１２２から、オーディオストリームデータのうちキーワードとして検出したものを意味する。候補キーワードは、キーワードと同一であっても良いし、該キーワードと類似して発音される他の単語であっても良い。例えば、該キーワードが「クローバ」である場合、候補キーワードは、「グローバル」であっても良い。すなわち、ユーザが「グローバル」を含んだ文章を発声した場合、キーワード検出部１２２は、オーディオストリームデータから、「グローバル」を「クローバ」と誤認して検出するかもしれないからである。かように検出された「グローバル」は、候補キーワードに該当する。 The candidate keyword means a keyword detected from the keyword detection unit 122 as a keyword in the audio stream data. The candidate keyword may be the same as the keyword, or may be another word that is pronounced similar to the keyword. For example, when the keyword is “clover”, the candidate keyword may be “global”. That is, when the user utters a sentence including “global”, the keyword detection unit 122 may mistakenly detect “global” as “clover” and detect it from the audio stream data. The detected “global” corresponds to the candidate keyword.

キーワード検出部１２２は、オーディオストリームデータを、既知のキーワードデータと比較し、オーディオストリームデータ内に、キーワードに対応する音声が含まれる可能性を計算することができる。キーワード検出部１２２は、オーディオストリームデータから、フィルタバンクエネルギー（filter bank energy）またはメル周波数ケプストラム係数（ＭＦＣＣ：Mel−frequency cepstram coefficients）のようなオーディオ特徴を抽出することができる。キーワード検出部１２２は、分類ウィンドウ（classifying window）を利用して、例えば、サポートベクトルマシン（support vector machine）または神経網（neural network）を利用して、かようなオーディオ特徴を処理することができる。該オーディオ特徴の処理に基づいて、キーワード検出部１２２は、オーディオストリームデータ内にキーワードが含まれる可能性を計算することができる。キーワード検出部１２２は、前記可能性が、既設定基準値（即ち、所定の基準値）より高い場合、オーディオストリームデータ内にキーワードが含まれていると判断することにより、候補キーワードを検出することができる。 The keyword detection unit 122 can compare the audio stream data with known keyword data, and calculate the possibility that the audio stream data includes voice corresponding to the keyword. The keyword detection unit 122 may extract audio features such as filter bank energy or Mel-frequency cepstram coefficients (MFCC) from the audio stream data. The keyword detection unit 122 can process such audio features using a classifying window, for example, a support vector machine or a neural network. . Based on the processing of the audio feature, the keyword detection unit 122 can calculate the possibility that the keyword is included in the audio stream data. The keyword detection unit 122 detects a candidate keyword by determining that the keyword is included in the audio stream data when the possibility is higher than a preset reference value (that is, a predetermined reference value). Can do.

キーワード検出部１２２は、キーワードデータに対応する音声サンプルを利用して人工神経網（例えば、人工知能のためのニューラルネットワーク）を生成し、生成された神経網を利用して、オーディオストリームデータからキーワードを検出するように、トレーニングされる。キーワード検出部１２２は、オーディオストリームデータ内のフレームごとに、それぞれキーワードを構成する音素の確率、またはキーワードの全体的な確率を計算することができる。キーワード検出部１２２は、オーディオストリームデータから、各音素に該当する確率シーケンス、またはキーワード自体の確率を出力することができる。そのシーケンスまたは確率を基に、キーワード検出部１２２は、オーディオストリームデータ内にキーワードが含まれる可能性を計算することができ、その可能性が既設定基準値以上である場合、候補キーワードが検出されたと判断することができる。前述の方式は、例示的なものであり、キーワード検出部１２２の動作は、多様な方式を介しても具現化される。 The keyword detection unit 122 generates an artificial neural network (for example, a neural network for artificial intelligence) using a voice sample corresponding to the keyword data, and uses the generated neural network to generate a keyword from the audio stream data. To be trained to detect. The keyword detection unit 122 can calculate the probability of the phoneme constituting the keyword or the overall probability of the keyword for each frame in the audio stream data. The keyword detection unit 122 can output the probability sequence corresponding to each phoneme or the probability of the keyword itself from the audio stream data. Based on the sequence or probability, the keyword detection unit 122 can calculate the possibility that the keyword is included in the audio stream data. If the possibility is equal to or greater than the preset reference value, the candidate keyword is detected. Can be judged. The above-described method is exemplary, and the operation of the keyword detection unit 122 may be implemented through various methods.

また、キーワード検出部１２２は、オーディオストリームデータ内のフレームごとに、オーディオ特徴を抽出することにより、当該フレームのオーディオデータが、人の音声に該当する可能性と、背景音に該当する可能性とを算出することができる。キーワード検出部１２２は、人の音声に該当する可能性と、背景音に該当する可能性とを比較し、当該フレームのオーディオデータが人の音声に該当すると判断することができる。例えば、キーワード検出部１２２は、当該フレームのオーディオデータが人の音声に該当する可能性が、背景音に該当する可能性より、既設定基準値を超えて高い場合、当該フレームのオーディオデータが人の音声に対応すると判断することができる。 Further, the keyword detection unit 122 extracts an audio feature for each frame in the audio stream data, so that the audio data of the frame may correspond to a human voice and a background sound. Can be calculated. The keyword detection unit 122 compares the possibility of corresponding to human voice with the possibility of corresponding to background sound, and can determine that the audio data of the frame corresponds to human voice. For example, when the possibility that the audio data of the frame corresponds to human speech is higher than the preset reference value than the possibility that the audio data of the frame corresponds to background sound, the keyword detection unit 122 determines that the audio data of the frame is human. It can be determined that it corresponds to the voice.

キーワード検出部１２２は、オーディオストリームデータから候補キーワードが検出された区間を特定することができ、候補キーワードが検出された区間の始点及び終点を決定することができる。オーディオストリームデータから候補キーワードが検出された区間は、キーワード検出区間、現在区間または第１区間とされる。オーディオストリームデータにおいて第１区間に該当するオーディオデータは、第１オーディオデータとする。キーワード検出部１２２は、候補キーワードが検出された区間の終りを終点と決定することができる。他の例によれば、キーワード検出部１２２は、候補キーワードが検出された後、既設定時間（例えば、０．５秒）の黙音が発生するまで待った後、第１区間に黙音区間が含まれるように、第１区間の終点を決定するか、あるいは黙音期間が含まれないように、第１区間の終点を決定することができる。 The keyword detection unit 122 can specify a section in which the candidate keyword is detected from the audio stream data, and can determine a start point and an end point of the section in which the candidate keyword is detected. The section in which the candidate keyword is detected from the audio stream data is the keyword detection section, the current section, or the first section. The audio data corresponding to the first section in the audio stream data is the first audio data. The keyword detection unit 122 can determine the end of the section in which the candidate keyword is detected as the end point. According to another example, the keyword detection unit 122 waits until a silent sound is generated for a preset time (for example, 0.5 seconds) after a candidate keyword is detected, and then the silent section is included in the first section. The end point of the first section can be determined so as to be included, or the end point of the first section can be determined so that the silent period is not included.

話者特徴ベクトル抽出部１２３は、メモリ１１０に一時的に保存されたオーディオストリームデータにおいて、第２区間に該当する第２オーディオデータを、メモリ１１０から読み取る。第２区間は、第１区間の以前区間であり、第２区間の終点は、第１区間の始点と同一でもある。第２区間は、以前区間とされる。第２区間の長さは、検出された候補キーワードに対応するキーワードによって可変的にも設定される。他の例によれば、第２区間の長さは、固定的にも設定される。さらに他の例によれば、第２区間の長さは、キーワード検出性能が最適化されるように、適応的に可変される。例えば、マイク１５１が出力するオーディオ信号が、「四葉のクローバー」であり、候補キーワードが「クローバー」である場合、第２オーディオデータは、「四葉の」という音声に対応する。 The speaker feature vector extraction unit 123 reads the second audio data corresponding to the second section from the memory 110 in the audio stream data temporarily stored in the memory 110. The second section is a section before the first section, and the end point of the second section is the same as the start point of the first section. The second section is the previous section. The length of the second section is also variably set depending on the keyword corresponding to the detected candidate keyword. According to another example, the length of the second section is also fixedly set. According to another example, the length of the second interval is adaptively varied so that the keyword detection performance is optimized. For example, when the audio signal output from the microphone 151 is “four-leaf clover” and the candidate keyword is “clover”, the second audio data corresponds to the voice “four-leaf”.

話者特徴ベクトル抽出部１２３は、第１区間に該当する第１オーディオデータの第１話者特徴ベクトルと、第２区間に該当する第２オーディオデータの第２話者特徴ベクトルと、を抽出する。話者特徴ベクトル抽出部１２３は、話者認識にロバストな話者特徴ベクトルをオーディオデータから抽出することができる。話者特徴ベクトル抽出部１２３は、時間ドメイン（time domain）の音声信号を、周波数ドメイン（frequency domain）の信号に変換し、変換された信号の周波数エネルギーを、互いに異なるように変形することにより、話者特徴ベクトルを抽出することができる。例えば、該話者特徴ベクトルは、メル周波数ケプストラム係数（ＭＦＣＣ）またはフィルタバンクエネルギーを基に抽出される、それらに限定されるものはではなく、多様な方式で、オーディオデータから話者特徴ベクトルを抽出することができる。 The speaker feature vector extraction unit 123 extracts a first speaker feature vector of the first audio data corresponding to the first section and a second speaker feature vector of the second audio data corresponding to the second section. . The speaker feature vector extraction unit 123 can extract speaker feature vectors that are robust to speaker recognition from audio data. The speaker feature vector extraction unit 123 converts a time domain speech signal into a frequency domain signal, and transforms the frequency energy of the converted signal to be different from each other. Speaker feature vectors can be extracted. For example, the speaker feature vectors are extracted based on Mel Frequency Cepstrum Coefficients (MFCC) or filter bank energy, but are not limited to them. Speaker feature vectors can be extracted from audio data in various ways. Can be extracted.

話者特徴ベクトル抽出部１２３は、一般的には、スリープモードで動作することができる。キーワード検出部１２２は、オーディオストリームデータから候補キーワードを検出すると、話者特徴ベクトル抽出部１２３をウェークアップさせることができる。キーワード検出部１２２は、オーディオストリームデータから候補キーワードを検出すると、話者特徴ベクトル抽出部１２３にウェークアップ信号を送信することができる。話者特徴ベクトル抽出部１２３は、キーワード検出部１２２において、候補キーワードが検出されたということを示すウェークアップ信号に応答してウェークアップされる。 The speaker feature vector extraction unit 123 can generally operate in a sleep mode. When the keyword detection unit 122 detects a candidate keyword from the audio stream data, the keyword feature vector extraction unit 123 can wake up. When the keyword detecting unit 122 detects a candidate keyword from the audio stream data, the keyword detecting unit 122 can transmit a wakeup signal to the speaker feature vector extracting unit 123. The speaker feature vector extraction unit 123 is woken up in response to a wakeup signal indicating that a candidate keyword has been detected by the keyword detection unit 122.

一実施形態によれば、話者特徴ベクトル抽出部１２３は、オーディオデータの各フレームごとに、フレーム特徴ベクトルを抽出し、抽出されたフレーム特徴ベクトルを正規化及び平均化し、オーディオデータを代表する話者特徴ベクトルを抽出することができる。抽出されたフレーム特徴ベクトルの正規化に、Ｌ２正規化が使用される。抽出されたフレーム特徴ベクトルの平均化は、オーディオデータ内の全フレームそれぞれに対して抽出されたフレーム特徴ベクトルを正規化して生成される正規化されたフレーム特徴ベクトルの平均を算出することによって達成される。 According to one embodiment, the speaker feature vector extraction unit 123 extracts a frame feature vector for each frame of audio data, normalizes and averages the extracted frame feature vector, and speaks representing audio data. A person feature vector can be extracted. L2 normalization is used to normalize the extracted frame feature vector. Averaging of the extracted frame feature vectors is accomplished by calculating an average of the normalized frame feature vectors generated by normalizing the extracted frame feature vectors for each of all frames in the audio data. The

例えば、話者特徴ベクトル抽出部１２３は、第１オーディオデータの各フレームごとに、第１フレーム特徴ベクトルを抽出し、抽出された第１フレーム特徴ベクトルを正規化及び平均化し、第１オーディオデータを代表する前記第１話者特徴ベクトルを抽出することができる。また、話者特徴ベクトル抽出部１２３は、第２オーディオデータの各フレームごとに、第２フレーム特徴ベクトルを抽出し、抽出された第２フレーム特徴ベクトルを正規化及び平均化し、第２オーディオデータを代表する第２話者特徴ベクトルを抽出することができる。 For example, the speaker feature vector extraction unit 123 extracts a first frame feature vector for each frame of the first audio data, normalizes and averages the extracted first frame feature vector, and converts the first audio data into The representative first speaker feature vector can be extracted. Further, the speaker feature vector extraction unit 123 extracts a second frame feature vector for each frame of the second audio data, normalizes and averages the extracted second frame feature vector, and converts the second audio data into the second audio data. A representative second speaker feature vector can be extracted.

他の実施形態によれば、話者特徴ベクトル抽出部１２３は、オーディオデータ内の全フレームについて、フレーム特徴ベクトルをそれぞれ抽出するのではなく、オーディオデータ内の一部フレームについて、フレーム特徴ベクトルをそれぞれ抽出することができる。前記一部フレームは、当該フレームのオーディオデータが、ユーザの音声データである可能性が高いフレームにおいて、音声フレームとして選択される。かような音声フレームの選択は、キーワード検出部１２２によってなされる。キーワード検出部１２２は、オーディオストリームデータの各フレームごとに、人音声である第１確率と、背景音である第２確率とを計算することができる。キーワード検出部１２２は、各フレームのオーディオデータが人音声である第１確率が、背景音である第２確率より、既設定基準値を超えて高いフレームを、音声フレームと決定することができる。キーワード検出部１２２は、当該フレームが、音声フレームであるか否かということを示すフラグまたはビットをオーディオストリームデータの各フレームに関連づけてメモリ１１０に保存することができる。 According to another embodiment, the speaker feature vector extraction unit 123 does not extract frame feature vectors for all frames in audio data, but extracts frame feature vectors for some frames in audio data. Can be extracted. The partial frame is selected as an audio frame in a frame in which the audio data of the frame is highly likely to be user audio data. Such a voice frame is selected by the keyword detection unit 122. The keyword detection unit 122 can calculate a first probability that is a human voice and a second probability that is a background sound for each frame of the audio stream data. The keyword detection unit 122 can determine, as a speech frame, a frame in which the first probability that the audio data of each frame is human speech exceeds the preset reference value than the second probability that the background sound is human speech. The keyword detecting unit 122 can store a flag or a bit indicating whether or not the frame is an audio frame in the memory 110 in association with each frame of the audio stream data.

話者特徴ベクトル抽出部１２３は、第１オーディデータ及び第２オーディオデータをメモリ１１０から読み取るとき、フラグまたはビットを共に読み取ることにより、当該フレームが音声フレームであるか否かということを知ることができる。 When the speaker feature vector extraction unit 123 reads the first audio data and the second audio data from the memory 110, the speaker feature vector extraction unit 123 knows whether or not the frame is an audio frame by reading both the flag and the bit. it can.

話者特徴ベクトル抽出部１２３は、オーディオデータ内のフレーム中、音声フレームと決定されたフレームそれぞれについてフレーム特徴ベクトルを抽出し、抽出された第１フレーム特徴ベクトルを正規化及び平均化し、オーディオデータを代表する話者特徴ベクトルを抽出することができる。例えば、話者特徴ベクトル抽出部１２３は、第１オーディオデータ内のフレーム中、音声フレームと決定されたフレームそれぞれについて、第１フレーム特徴ベクトルを抽出し、抽出された第１フレーム特徴ベクトルを正規化及び平均化し、第１オーディオデータを代表する前記第１話者特徴ベクトルを抽出することができる。また、話者特徴ベクトル抽出部１２３は、第２オーディオデータ内のフレーム中、音声フレームと決定されたフレームそれぞれについて、第２フレーム特徴ベクトルを抽出し、抽出された第２フレーム特徴ベクトルを正規化及び平均化し、第２オーディオデータを代表する第２話者特徴ベクトルを抽出することができる。 The speaker feature vector extraction unit 123 extracts a frame feature vector for each frame determined to be a speech frame from the frames in the audio data, normalizes and averages the extracted first frame feature vector, and converts the audio data into A representative speaker feature vector can be extracted. For example, the speaker feature vector extraction unit 123 extracts a first frame feature vector for each frame determined to be a speech frame from the frames in the first audio data, and normalizes the extracted first frame feature vector. And averaging the first speaker feature vectors representative of the first audio data. In addition, the speaker feature vector extraction unit 123 extracts a second frame feature vector for each frame determined to be a speech frame from the frames in the second audio data, and normalizes the extracted second frame feature vector. And a second speaker feature vector representative of the second audio data can be extracted.

ウェークアップ判断部１２４は、話者特徴ベクトル抽出部１２３で抽出された第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を基に、第１オーディオデータに当該キーワードが含まれていたか否かということ、すなわち、第１区間のオーディオ信号に当該キーワードが含まれていたか否かということを判断する。ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を、既設定基準値と比較し、類似度が基準値以下である場合、第１区間の第１オーディオデータに当該キーワードが含まれていると判断することができる。 Whether the wake-up determination unit 124 includes the keyword in the first audio data based on the similarity between the first speaker feature vector and the second speaker feature vector extracted by the speaker feature vector extraction unit 123. Whether or not the keyword is included in the audio signal of the first section is determined. The wake-up determination unit 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with a preset reference value, and when the similarity is equal to or less than the reference value, the first audio in the first section It can be determined that the keyword is included in the data.

音声制御装置１００がキーワードを誤認識する代表的な場合は、ユーザの音声中に、キーワードと類似した発音の単語が、音声中間に位置する場合である。例えば、キーワードが「クローバ」である場合、ユーザが他者に「四葉のクローバーをどうやって見つけられるの」という場合にも、音声制御装置１００は、「クローバー」に反応してウェークアップされ、ユーザが意図していない動作を遂行してしまうかもしれない。さらには、テレビニュースにおいてアナウンサーが、「ＪＮグローバルの時価総額は、…」という場合にも、音声制御装置１００は、「グローバル」に反応してウェークアップされてしまうかもしれない。そのようなキーワードの誤認識が発生してしまうことを防止するために、一実施形態によれば、キーワードと類似した発音の単語は、音声の最も先に位置する場合にのみ音声制御装置１００が反応する。また、周辺背景騒音が多い環境や、他の人々が話し合っている環境では、ユーザがキーワードに該当する音声を最も先に発声しても、周辺背景騒音や、他の人々の対話により、ユーザがキーワードに該当する音声を最も先に発声したということが感知されないこともある。一実施形態によれば、音声制御装置１００は、候補キーワードが検出された区間の第１話者特徴ベクトルと、以前区間の第２話者特徴ベクトルとを抽出し、第１話者特徴ベクトルと第２話者特徴ベクトルとが互いに異なる場合には、ユーザがキーワードに該当する音声を最も先に発声したと判断することができる。 A typical case in which the voice control apparatus 100 misrecognizes a keyword is a case where a word having a pronunciation similar to the keyword is located in the middle of the voice in the user's voice. For example, when the keyword is “clover”, the voice control device 100 is waked up in response to “clover” when the user says “how to find a four-leaf clover” to others, and the user intends You might end up doing something you didn't do. Furthermore, when an announcer in TV news says "JN Global's market capitalization is ...", voice control device 100 may be woken up in response to "Global". In order to prevent such misrecognition of a keyword from occurring, according to an embodiment, the voice control device 100 may detect a word with a pronunciation similar to the keyword only when it is positioned first in the voice. react. Also, in an environment where there is a lot of background noise in the surroundings, or in an environment where other people are talking, even if the user speaks the voice that corresponds to the keyword first, It may not be detected that the voice corresponding to the keyword is spoken first. According to one embodiment, the voice control device 100 extracts a first speaker feature vector of a section in which a candidate keyword is detected and a second speaker feature vector of a previous section, and the first speaker feature vector When the second speaker feature vectors are different from each other, it can be determined that the user has spoken the voice corresponding to the keyword first.

かような判断のために、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が既設定基準値以下である場合には、ユーザがキーワードに該当する音声を最も先に発声したと判断することができる。すなわち、ウェークアップ判断部１２４は、第１区間の第１オーディオデータに当該キーワードが含まれていると判断することができ、音声制御装置１００の一部機能をウェークアップさせることができる。第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が高いということは、第１オーディオデータに対応する音声を放った者と、第２オーディオデータに対応する音声を放った者とが同一である可能性が高いというのである。 For this determination, the wake-up determination unit 124 determines that the user corresponds to the keyword when the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value. Can be determined to have been uttered first. That is, the wakeup determination unit 124 can determine that the keyword is included in the first audio data in the first section, and can wake up a part of the function of the voice control device 100. The high degree of similarity between the first speaker feature vector and the second speaker feature vector means that the person who emitted sound corresponding to the first audio data and the person who emitted sound corresponding to the second audio data Are likely to be the same.

第２オーディオデータが黙音に該当する場合、話者特徴ベクトル抽出部１２３は、第２オーディオデータから、黙音に該当する第２話者特徴ベクトルを抽出することができる。話者特徴ベクトル抽出部１２３は、第１オーディオデータから、ユーザの音声に該当する第１話者特徴ベクトルを抽出するので、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度は、低い。 When the second audio data corresponds to silence, the speaker feature vector extraction unit 123 can extract a second speaker feature vector corresponding to silence from the second audio data. Since the speaker feature vector extraction unit 123 extracts the first speaker feature vector corresponding to the user's voice from the first audio data, the similarity between the first speaker feature vector and the second speaker feature vector is ,Low.

音声認識部１２５は、オーディオ処理部１２１で生成されたオーディオストリームデータにおいて第３区間に該当する第３オーディオデータを受信し、第３オーディオデータを音声認識することができる。他の例によれば、音声認識部１２５は、第３オーディオデータが、外部（例えば、サーバ２００）で音声認識されるように、第３オーディオデータを外部に伝送し、音声認識結果を受信することができる。 The voice recognition unit 125 can receive the third audio data corresponding to the third section in the audio stream data generated by the audio processing unit 121, and can recognize the third audio data. According to another example, the voice recognition unit 125 transmits the third audio data to the outside and receives the voice recognition result so that the third audio data is recognized by the outside (for example, the server 200). be able to.

機能部１２６は、キーワードに対応する機能を遂行することができる。例えば、音声制御装置１００がスマートスピーカである場合、機能部１２６は、音楽再生部、音声情報提供部、周辺機器制御部などを含み、検出されたキーワードに対応する機能を遂行することができる。音声制御装置１００がスマートフォンである場合、機能部１２６は、電話連結部、文字送受信部、インターネット検索部などを含み、検出されたキーワードに対応する機能を遂行することができる。機能部１２６は、音声制御装置１００の種類によって多様に構成される。機能部１２６は、音声制御装置１００が行うことができる多様な機能を遂行するための機能ブロックを包括的に示したものである。 The function unit 126 can perform a function corresponding to the keyword. For example, when the voice control device 100 is a smart speaker, the function unit 126 includes a music playback unit, a voice information providing unit, a peripheral device control unit, and the like, and can perform a function corresponding to the detected keyword. When the voice control device 100 is a smartphone, the function unit 126 includes a telephone connection unit, a character transmission / reception unit, an Internet search unit, and the like, and can perform a function corresponding to the detected keyword. The functional unit 126 is variously configured depending on the type of the voice control device 100. The functional unit 126 comprehensively shows functional blocks for performing various functions that the voice control apparatus 100 can perform.

図３に図示された音声制御装置１００は、音声認識部１２５を含むように図示されているが、それは例示的なものであり、音声制御装置１００は、音声認識部１２５を含まず、図２に図示されたサーバ２００が、音声認識機能を代わりに遂行することができる。その場合、図１に図示されているように、音声制御装置１００は、ネットワーク３００を介して、音声認識機能を遂行するサーバ２００に接続される。音声制御装置１００は、音声認識が必要な音声信号を含む音声ファイルをサーバ２００に提供することができ、サーバ２００は、音声ファイル内の音声信号に対して音声認識を行い、音声信号に対応する文字列を生成することができる。サーバ２００は、生成された文字列を、ネットワーク３００を介して、音声制御装置１００に送信することができる。しかし、以下では、音声制御装置１００が音声認識機能を遂行する音声認識部１２５を含むと仮定して説明する。 The voice control device 100 illustrated in FIG. 3 is illustrated as including the voice recognition unit 125, but this is exemplary, and the voice control device 100 does not include the voice recognition unit 125. The server 200 shown in FIG. 5 can perform the voice recognition function instead. In that case, as shown in FIG. 1, the voice control device 100 is connected via a network 300 to a server 200 that performs a voice recognition function. The voice control device 100 can provide a voice file including a voice signal that needs voice recognition to the server 200, and the server 200 performs voice recognition on the voice signal in the voice file and corresponds to the voice signal. A string can be generated. The server 200 can transmit the generated character string to the voice control device 100 via the network 300. However, the following description assumes that the voice control device 100 includes a voice recognition unit 125 that performs a voice recognition function.

プロセッサ１２０は、動作方法のためのプログラムファイルに保存されたプログラムコードをメモリ１１０にローディングすることができる。例えば、音声制御装置１００には、プログラムファイルによって、プログラムがインストール（install）される。そのとき、音声制御装置１００にインストールされたプログラムが実行される場合、プロセッサ１２０は、プログラムコードをメモリ１１０にローディングすることができる。そのとき、プロセッサ１２０が含むオーディオ処理部１２１、キーワード検出部１２２、話者特徴ベクトル抽出部１２３、ウェークアップ判断部１２４、音声認識部１２５及び機能部１２６のうち少なくとも一部のそれぞれは、メモリ１１０にローディングされたプログラムコードのうち対応するコードによる命令を実行し、図４の段階（Ｓ１１０ないしＳ１９０）を実行するようにも具現化される。 The processor 120 can load the program code stored in the program file for the operation method into the memory 110. For example, a program is installed in the voice control device 100 by a program file. At that time, when the program installed in the voice control device 100 is executed, the processor 120 can load the program code into the memory 110. At that time, at least some of the audio processing unit 121, the keyword detection unit 122, the speaker feature vector extraction unit 123, the wakeup determination unit 124, the speech recognition unit 125, and the function unit 126 included in the processor 120 are stored in the memory 110. The present invention may be realized by executing an instruction using a corresponding code among the loaded program codes and executing the steps (S110 to S190) of FIG.

その後、プロセッサ１２０の機能ブロック１２１ないし１２６が、音声制御装置１００を制御することは、プロセッサ１２０が音声制御装置１００の他の構成要素を制御することと理解される。例えば、プロセッサ１２０は、音声制御装置１００が含む通信モジュール１３０を制御し、音声制御装置１００が、例えば、サーバ２００と通信するように、音声制御装置１００を制御することができる。 Thereafter, the fact that the function blocks 121 to 126 of the processor 120 control the voice control device 100 is understood to mean that the processor 120 controls other components of the voice control device 100. For example, the processor 120 controls the communication module 130 included in the voice control apparatus 100, and can control the voice control apparatus 100 so that the voice control apparatus 100 communicates with the server 200, for example.

段階（Ｓ１１０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、周辺音に対応するオーディオ信号を受信する。オーディオ処理部１２１は、持続的に周辺音に対応するオーディオ信号を受信することができる。オーディオ信号は、マイク１５１のような入力装置が周辺音に対応して生成した電気信号でもある。 In step (S110), the processor 120, for example, the audio processing unit 121 receives an audio signal corresponding to the ambient sound. The audio processing unit 121 can continuously receive an audio signal corresponding to the ambient sound. The audio signal is also an electric signal generated by an input device such as the microphone 151 corresponding to the ambient sound.

段階（Ｓ１２０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、マイク１５１からのオーディオ信号を基に、オーディオストリームデータを生成する。オーディオストリームデータは、持続的に受信されるオーディオ信号に対応したものである。該オーディオストリームデータは、オーディオ信号をフィルタリングしてデジタル化させることによって生成されるデータでもある。 In step (S <b> 120), the processor 120, for example, the audio processing unit 121 generates audio stream data based on the audio signal from the microphone 151. The audio stream data corresponds to an audio signal that is continuously received. The audio stream data is also data generated by filtering and digitizing an audio signal.

段階（Ｓ１３０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、段階（Ｓ１２０）で生成されるオーディオストリームデータをメモリ１１０に一時的に保存する。メモリ１１０は、限定された大きさを有し、現在から最近一定時間の間のオーディオ信号に対応するオーディオストリームデータの一部が、メモリ１１０に一時的に保存される。新たなオーディオストリームデータが生成されると、メモリ１１０に保存されたオーディオストリームデータのうち最も古いデータが削除され、メモリ１１０内の削除によって空くようになった空間に、新たなオーディオストリームデータが保存される。 In step (S130), the processor 120, for example, the audio processing unit 121 temporarily stores the audio stream data generated in step (S120) in the memory 110. The memory 110 has a limited size, and a part of the audio stream data corresponding to the audio signal for a certain period of time from the present is temporarily stored in the memory 110. When new audio stream data is generated, the oldest data among the audio stream data stored in the memory 110 is deleted, and the new audio stream data is stored in a space that is freed by the deletion in the memory 110. Is done.

段階（Ｓ１４０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、段階（Ｓ１２０）で生成されるオーディオストリームデータから、既定義のキーワードに対応する候補キーワードを検出する。該候補キーワードは、既定義のキーワードと類似した発音を有する単語であり、段階（Ｓ１４０）において、キーワード検出部１２２でキーワードとして検出されたワードを指す。 In step (S140), the processor 120, for example, the keyword detection unit 122 detects candidate keywords corresponding to the predefined keywords from the audio stream data generated in step (S120). The candidate keyword is a word having a pronunciation similar to that of the predefined keyword, and indicates a word detected as a keyword by the keyword detection unit 122 in step (S140).

段階（Ｓ１５０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、オーディオストリームデータから候補キーワードが検出されたキーワード検出区間を識別し、キーワード検出区間の始点及び終点を決定する。キーワード検出区間は、現在区間とされる。オーディオストリームデータで現在区間に対応するデータは、第１オーディオデータとされる。 In step (S150), the processor 120, for example, the keyword detection unit 122 identifies a keyword detection section in which a candidate keyword is detected from the audio stream data, and determines a start point and an end point of the keyword detection section. The keyword detection section is the current section. The data corresponding to the current section in the audio stream data is the first audio data.

段階（Ｓ１６０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、メモリ１１０から、以前区間に該当する第２オーディオデータを読み取る。以前区間は、現在区間のすぐ直前区間であり、以前区間の終点は、現在区間の始点と同一でもある。話者特徴ベクトル抽出部１２３は、メモリ１１０から、第１オーディオデータも共に読み取ることができる。 In step (S160), the processor 120, for example, the speaker feature vector extraction unit 123 reads the second audio data corresponding to the previous section from the memory 110. The previous section is a section immediately before the current section, and the end point of the previous section is also the same as the start point of the current section. The speaker feature vector extraction unit 123 can also read the first audio data from the memory 110 together.

段階（Ｓ１７０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、第１オーディオデータから第１話者特徴ベクトルを抽出し、第２オーディオデータから第２話者特徴ベクトルを抽出する。第１話者特徴ベクトルは、第１オーディオデータに対応する音声の話者を識別するための指標であり、第２話者特徴ベクトルは、第２オーディオデータに対応する音声の話者を識別するための指標である。プロセッサ１２０、例えば、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を基に、第１オーディオデータにキーワードが含まれていたか否かということを判断することができる。ウェークアップ判断部１２４は、第１オーディオデータにキーワードが含まれていると判断する場合、音声制御装置１００の一部構成要素をウェークアップさせることができる。 In step (S170), the processor 120, for example, the speaker feature vector extraction unit 123 extracts the first speaker feature vector from the first audio data, and extracts the second speaker feature vector from the second audio data. The first speaker feature vector is an index for identifying the speaker of the speech corresponding to the first audio data, and the second speaker feature vector identifies the speaker of the speech corresponding to the second audio data. It is an indicator for. The processor 120, for example, the wakeup determination unit 124 determines whether or not a keyword is included in the first audio data based on the similarity between the first speaker feature vector and the second speaker feature vector. be able to. When the wakeup determination unit 124 determines that the keyword is included in the first audio data, the wakeup determination unit 124 can wake up some components of the voice control device 100.

段階（Ｓ１８０）において、プロセッサ１２０、例えば、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を既設定基準値と比較する。 In step (S180), the processor 120, for example, the wakeup determination unit 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with a preset reference value.

ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が既設定基準値以下である場合、現在区間の第１オーディオデータの話者と、以前区間の第２オーディオデータの話者とが互いに異なるということであるので、第１オーディオデータにキーワードが含まれていると判断することができる。その場合、段階（Ｓ１９０）でのように、プロセッサ１２０、例えば、ウェークアップ判断部１２４は、音声制御装置１００の一部構成要素をウェークアップさせることができる。 When the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value, the wakeup determination unit 124 determines whether the speaker of the first audio data in the current section is the second in the previous section. Since the speakers of the audio data are different from each other, it can be determined that a keyword is included in the first audio data. In that case, as in step (S190), the processor 120, for example, the wake-up determination unit 124, can wake up some components of the voice control device 100.

しかし、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が既設定基準値より高い場合、現在区間の第１オーディオデータの話者と、以前区間の第２オーディオデータの話者とが互いに同一であるということであるので、第１オーディオデータにキーワードが含まれていないと判断し、ウェークアップを進めない。その場合、段階（Ｓ１１０）に進み、周辺音に対応するオーディオ信号を受信する。段階（Ｓ１１０）において、オーディオ信号受信は、段階（Ｓ１２０−Ｓ１９０）を遂行するときにも続けられる。 However, if the similarity between the first speaker feature vector and the second speaker feature vector is higher than the preset reference value, the wakeup determination unit 124 determines whether the speaker of the first audio data in the current section is Since the two audio data speakers are the same as each other, it is determined that the keyword is not included in the first audio data, and wakeup cannot proceed. In this case, the process proceeds to step (S110), and an audio signal corresponding to the ambient sound is received. In step (S110), audio signal reception is continued when performing steps (S120 to S190).

図３のキーワード保存所１１０ａには、既定義の複数のキーワードが保存される。かようなキーワードは、ウェークアップキーワードでもあり、単独命令キーワードでもある。該ウェークアップキーワードは、音声制御装置１００の一部機能をウェークアップさせるためのものである。一般的には、ユーザは、ウェークアップキーワードを発話した後、所望の自然語音声命令を発話する。音声制御装置１００は、自然語音声命令を音声認識し、自然語音声命令に対応する動作及び機能を遂行することができる。 A plurality of predefined keywords are stored in the keyword storage location 110a in FIG. Such keywords are both wake-up keywords and single command keywords. The wakeup keyword is used to wake up a part of the function of the voice control device 100. Generally, a user utters a desired natural language voice command after speaking a wakeup keyword. The voice control device 100 can recognize a natural language voice command and perform operations and functions corresponding to the natural language voice command.

単独命令キーワードは、音声制御装置１００が、特定動作または機能を直接遂行するためのものであり、例えば、「再生」、「中止」のように、既定義の簡単な単語でもある。音声制御装置１００は、単独命令キーワードが受信されると、単独命令キーワードに該当する機能をウェークアップさせ、当該機能を遂行することができる。 The single command keyword is for the voice control device 100 to directly perform a specific operation or function, and is also a simple simple word such as “play” or “stop”. When the single command keyword is received, the voice control device 100 can wake up the function corresponding to the single command keyword and perform the function.

以下では、オーディオストリームデータから単独命令キーワードに対応する候補キーワードを検出した場合、及びオーディオストリームデータからウェークアップキーワードに対応する候補キーワードを検出した場合のそれぞれについて説明する。 Hereinafter, a case where a candidate keyword corresponding to a single command keyword is detected from audio stream data and a case where a candidate keyword corresponding to a wakeup keyword is detected from audio stream data will be described.

図５は、他の実施形態によって、音声制御装置が遂行することができる動作方法の例を図示したフローチャートである。 FIG. 5 is a flowchart illustrating an example of an operation method that can be performed by a voice control apparatus according to another embodiment.

図６Ａは、一実施形態による音声制御装置が、図５の動作方法を実行する場合、単独命令キーワードが発話される例を図示し、図６Ｂは、一実施形態による音声制御装置が、図５の動作方法を実行する場合、一般対話音声が発話される例を図示する。 6A illustrates an example in which a single command keyword is uttered when the voice control device according to the embodiment executes the operation method of FIG. 5, and FIG. 6B illustrates the voice control device according to the embodiment. In the case where the operation method is executed, an example in which a general dialogue voice is spoken is illustrated.

図５の動作方法は、図４の動作方法と実質的に同一である段階を含む。図５の段階のうち、図４の段階と実質的に同一である段階については、詳細に説明しない。図６Ａ及び図６Ｂには、オーディオストリームデータに対応するオーディオ信号と、オーディオ信号に対応するユーザの音声とが図示される。図６Ａには、音声「中止」に対応するオーディオ信号が図示され、図６Ｂには、音声「ここで停止して」に対応するオーディオ信号が図示される。 The operation method of FIG. 5 includes steps that are substantially the same as the operation method of FIG. Of the steps of FIG. 5, those steps that are substantially the same as the steps of FIG. 4 will not be described in detail. 6A and 6B illustrate an audio signal corresponding to audio stream data and a user's voice corresponding to the audio signal. 6A illustrates an audio signal corresponding to the voice “stop”, and FIG. 6B illustrates an audio signal corresponding to the voice “stop here”.

図６Ａ及び図６Ｂと共に図５を参照すれば、段階（Ｓ２１０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、周辺音に対応するオーディオ信号を受信する。 Referring to FIG. 5 together with FIGS. 6A and 6B, in step (S210), the processor 120, for example, the audio processing unit 121 receives an audio signal corresponding to the ambient sound.

段階（Ｓ２２０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、マイク１５１からのオーディオ信号を基に、オーディオストリームデータを生成する。 In the step (S220), the processor 120, for example, the audio processing unit 121 generates audio stream data based on the audio signal from the microphone 151.

段階（Ｓ２３０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、段階（Ｓ２２０）で生成されるオーディオストリームデータをメモリ１１０に一時的に保存する。 In step (S230), the processor 120, for example, the audio processing unit 121 temporarily stores the audio stream data generated in step (S220) in the memory 110.

段階（Ｓ２４０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、段階（Ｓ２２０）で生成されるオーディオストリームデータから、既定義の単独命令キーワードに対応する候補キーワードを検出する。単独命令キーワードは、音声制御装置１００の動作を直接制御することができる音声キーワードでもある。例えば、単独命令キーワードは、図６Ａに図示されているように、「中止」のような単語でもある。その場合、音声制御装置１００は、例えば、音楽や動画を再生している。 In step (S240), the processor 120, for example, the keyword detection unit 122 detects candidate keywords corresponding to the predefined single command keyword from the audio stream data generated in step (S220). The single command keyword is also a voice keyword that can directly control the operation of the voice control device 100. For example, the single command keyword is also a word such as “stop” as illustrated in FIG. 6A. In that case, the audio control device 100 reproduces music or a moving image, for example.

図６Ａの例において、キーワード検出部１２２は、オーディオ信号から「中止」という候補キーワードを検出することができる。図６Ｂの例において、キーワード検出部１２２は、オーディオ信号から、「中止」というキーワードと類似した発音を有する単語である「停止」という候補キーワードを検出することができる。 In the example of FIG. 6A, the keyword detection unit 122 can detect the candidate keyword “stop” from the audio signal. In the example of FIG. 6B, the keyword detection unit 122 can detect a candidate keyword “stop”, which is a word having a pronunciation similar to the keyword “stop”, from the audio signal.

段階（Ｓ２５０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、オーディオストリームデータから候補キーワードが検出されたキーワード検出区間を識別し、キーワード検出区間の始点及び終点を決定する。キーワード検出区間は、現在区間とされる。オーディオストリームデータにおいて、現在区間に対応するデータは、第１オーディオデータとされる。 In step (S250), the processor 120, for example, the keyword detection unit 122 identifies a keyword detection section in which a candidate keyword is detected from the audio stream data, and determines a start point and an end point of the keyword detection section. The keyword detection section is the current section. In the audio stream data, the data corresponding to the current section is the first audio data.

図６Ａの例において、キーワード検出部１２２は、「中止」という候補キーワードを検出した区間を現在区間と識別し、現在区間の始点及び終点を決定することができる。前記現在区間に対応するオーディオデータは、第１オーディオデータＡＤ１とされる。 In the example of FIG. 6A, the keyword detection unit 122 can identify the section in which the candidate keyword “stop” is detected as the current section, and can determine the start point and end point of the current section. The audio data corresponding to the current section is the first audio data AD1.

図６Ｂの例において、キーワード検出部１２２は、「停止」という候補キーワードを検出した区間を現在区間と識別し、現在区間の始点及び終点を決定することができる。前記現在区間に対応するオーディオデータは、第１オーディオデータＡＤ１とされる。 In the example of FIG. 6B, the keyword detection unit 122 can identify the section in which the candidate keyword “stop” is detected as the current section, and can determine the start point and end point of the current section. The audio data corresponding to the current section is the first audio data AD1.

また、段階（Ｓ２５０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、検出された候補キーワードが、ウェークアップキーワード及び単独命令キーワードのうちいずれのキーワードに対応する候補キーワードであるかということを判断することができる。図６Ａ及び図６Ｂの例において、キーワード検出部１２２は、検出された候補キーワード、すなわち、「中止」及び「停止」が単独命令キーワードに対応する候補キーワードであるということを判断することができる。 In step (S250), the processor 120, for example, the keyword detection unit 122 determines whether the detected candidate keyword is a candidate keyword corresponding to any one of the wakeup keyword and the single command keyword. be able to. 6A and 6B, the keyword detection unit 122 can determine that the detected candidate keywords, that is, “stop” and “stop” are candidate keywords corresponding to the single command keywords.

段階（Ｓ２６０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、メモリ１１０から、以前区間に該当する第２オーディオデータを読み取る。以前区間は、現在区間のすぐ直前区間であり、以前区間の終点は、現在区間の始点と同一でもある。話者特徴ベクトル抽出部１２３は、メモリ１１０から、第１オーディオデータも共に読み取ることができる。 In step (S260), the processor 120, for example, the speaker feature vector extraction unit 123 reads the second audio data corresponding to the previous section from the memory 110. The previous section is a section immediately before the current section, and the end point of the previous section is also the same as the start point of the current section. The speaker feature vector extraction unit 123 can also read the first audio data from the memory 110 together.

図６Ａの例において、話者特徴ベクトル抽出部１２３は、現在区間のすぐ直前区間である以前区間に対応する第２オーディオデータＡＤ２をメモリ１１０から読み取ることができる。図６Ｂの例において、話者特徴ベクトル抽出部１２３は、現在区間のすぐ直前区間である以前区間に対応する第２オーディオデータＡＤ２を、メモリ１１０から読み取ることができる。図６Ｂの例において、第２オーディオデータＡＤ２は「こで」という音声に対応する。以前区間の長さは、検出された候補キーワードによって可変的にも設定される。 In the example of FIG. 6A, the speaker feature vector extraction unit 123 can read from the memory 110 the second audio data AD2 corresponding to the previous section that is the section immediately before the current section. In the example of FIG. 6B, the speaker feature vector extraction unit 123 can read from the memory 110 the second audio data AD2 corresponding to the previous section that is the section immediately before the current section. In the example of FIG. 6B, the second audio data AD2 corresponds to the voice “kode”. The length of the previous section is also variably set according to the detected candidate keyword.

段階（Ｓ２７０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、オーディオ処理部１２１から、現在区間後の次の区間に該当する第３オーディオデータを受信する。次の区間は、現在区間のすぐ次の区間であり、次の区間の始点は、現在区間の終点と同一でもある。 In step (S <b> 270), the processor 120, for example, the speaker feature vector extraction unit 123 receives third audio data corresponding to the next section after the current section from the audio processing unit 121. The next section is the section immediately following the current section, and the start point of the next section is the same as the end point of the current section.

図６Ａの例において、話者特徴ベクトル抽出部１２３は、現在区間直後の次の区間に対応する第３オーディオデータＡＤ３を、オーディオ処理部１２１から受信することができる。図６Ｂの例において、話者特徴ベクトル抽出部１２３は、現在区間直後の次の区間に対応する第３オーディオデータＡＤ３を、オーディオ処理部１２１から受信することができる。図６Ｂの例において、第３オーディオデータＡＤ３は、「して」という音声に対応する。次の区間の長さは、検出された候補キーワードによって可変的にも設定される。 In the example of FIG. 6A, the speaker feature vector extraction unit 123 can receive the third audio data AD3 corresponding to the next section immediately after the current section from the audio processing unit 121. In the example of FIG. 6B, the speaker feature vector extraction unit 123 can receive the third audio data AD3 corresponding to the next section immediately after the current section from the audio processing unit 121. In the example of FIG. 6B, the third audio data AD3 corresponds to the voice “do”. The length of the next section is also variably set according to the detected candidate keyword.

段階（Ｓ２８０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、第オーディオデータ１ないし第３オーディオデータから、第１話者特徴ベクトルないし第３話者特徴ベクトルをそれぞれ抽出する。第１話者特徴ベクトルないし第３話者特徴ベクトルそれぞれは、第オーディオデータ１ないし第３オーディオデータに対応する音声の話者を識別するための指標である。プロセッサ１２０、例えば、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度、及び第１話者特徴ベクトルと第３話者特徴ベクトルとの類似度を基に、第１オーディオデータに単独命令キーワードが含まれていたか否かということを判断することができる。ウェークアップ判断部１２４は、第１オーディオデータに、単独命令キーワードが含まれていると判断する場合、音声制御装置１００の一部構成要素をウェークアップさせることができる。 In step (S280), the processor 120, for example, the speaker feature vector extraction unit 123 extracts the first speaker feature vector to the third speaker feature vector from the first audio data 1 to the third audio data, respectively. Each of the first speaker feature vector to the third speaker feature vector is an index for identifying a speaker of speech corresponding to the first audio data 1 to the third audio data. Based on the similarity between the first speaker feature vector and the second speaker feature vector and the similarity between the first speaker feature vector and the third speaker feature vector. It can be determined whether or not a single command keyword is included in the first audio data. When determining that the first audio data includes the single command keyword, the wakeup determination unit 124 can wake up some components of the voice control device 100.

図６Ａの例において、第１オーディオデータＡＤ１に対応する第１話者特徴ベクトルは、「中止」という音声を発声した話者を識別するための指標である。第２オーディオデータＡＤ２と第３オーディオデータＡＤ３は、実質的に黙音であるので、第２話者特徴ベクトル及び第３話者特徴ベクトルは、黙音に対応するベクトルを有することができる。従って、第１話者特徴ベクトルと、第２話者特徴ベクトル及び第３話者特徴ベクトルとの類似度は、低い。 In the example of FIG. 6A, the first speaker feature vector corresponding to the first audio data AD1 is an index for identifying the speaker who uttered the voice “stop”. Since the second audio data AD2 and the third audio data AD3 are substantially silent, the second speaker feature vector and the third speaker feature vector can have vectors corresponding to the silence. Therefore, the similarity between the first speaker feature vector, the second speaker feature vector, and the third speaker feature vector is low.

他の例として、以前区間及び次の区間に、「中止」という音声を発声した話者ではない他者が音声を発声する場合、第２話者特徴ベクトル及び第３話者特徴ベクトルは、前記他者に対応したベクトルを有するので、第１話者特徴ベクトルと、第２話者特徴ベクトル及び第３話者特徴ベクトルとの類似度は、低い。 As another example, when a non-speaker who uttered the speech “stop” in the previous section and the next section utters speech, the second speaker feature vector and the third speaker feature vector are Since the vector corresponding to the other person is included, the similarity between the first speaker feature vector, the second speaker feature vector, and the third speaker feature vector is low.

図６Ｂの例では、一人が「ここで停止して」と発声した。従って、「停止」に対応する第１オーディオデータＡＤ１から抽出される第１話者特徴ベクトル、「こで」に対応する第２オーディオデータＡＤ２から抽出される第２話者特徴ベクトル、及び「して」に対応する第３オーディオデータＡＤ３から抽出される第３話者特徴ベクトルは、いずれも実質的に同一である話者を識別するためのベクトルであるので、第１話者特徴ベクトルないし第３話者特徴ベクトルとの類似度は、高い。 In the example of FIG. 6B, one person uttered “Stop here”. Therefore, the first speaker feature vector extracted from the first audio data AD1 corresponding to “stop”, the second speaker feature vector extracted from the second audio data AD2 corresponding to “kode”, and “shi Since the third speaker feature vector extracted from the third audio data AD3 corresponding to “t” is a vector for identifying speakers that are substantially the same, the first speaker feature vector through the first speaker feature vector. The similarity to the three-speaker feature vector is high.

段階（Ｓ２９０）において、プロセッサ１２０、例えば、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を、既設定基準値と比較し、第１話者特徴ベクトルと第３話者特徴ベクトルとの類似度を既設定基準値と比較する。ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が既設定基準値以下であり、第１話者特徴ベクトルと第３話者特徴ベクトルとの類似度が既設定基準値以下である場合、現在区間の第１オーディオデータの話者は、以前区間の第２オーディオデータの話者、及び次の区間の第３オーディオデータの話者とは異なるので、第１オーディオデータに、単独命令キーワードが含まれていると判断することができる。その場合、段階（Ｓ３００）でのように、プロセッサ１２０、例えば、ウェークアップ判断部１２４は、単独命令キーワードを機能部１２６に提供し、機能部１２６は、ウェークアップ判断部１２４による、第１オーディオデータに単独命令キーワードが含まれているという判断に応答し、単独命令キーワードに対応する機能を遂行することができる。 In step (S290), the processor 120, for example, the wake-up determination unit 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with the preset reference value, and determines the first speaker feature vector. And the third speaker feature vector are compared with a preset reference value. The wake-up determination unit 124 has a similarity between the first speaker feature vector and the second speaker feature vector equal to or less than a preset reference value, and the similarity between the first speaker feature vector and the third speaker feature vector is Since the speaker of the first audio data in the current section is different from the speaker of the second audio data in the previous section and the speaker of the third audio data in the next section if it is less than the preset reference value, It can be determined that a single command keyword is included in one audio data. In this case, as in step (S300), the processor 120, for example, the wakeup determination unit 124 provides the single command keyword to the function unit 126, and the function unit 126 adds the first audio data to the first audio data by the wakeup determination unit 124. In response to a determination that a single command keyword is included, a function corresponding to the single command keyword can be performed.

図６Ａの例において、第１話者特徴ベクトルは、「中止」と発声した話者に対応するベクトルであり、第２話者特徴ベクトル及び第３話者特徴ベクトルは、黙音に対応したベクトルであるので、第１話者特徴ベクトルと、第２話者特徴ベクトル及び第３話者特徴ベクトルとの類似度は、既設定基準値より低い。その場合、ウェークアップ判断部１２４は、第１オーディオデータＡＤ１に、「中止」という単独命令キーワードが含まれていると判断することができる。その場合、機能部１２６は、前記判断に応答してウェークアップされ、「中止」という単独命令キーワードに対応する動作または機能を遂行することができる。例えば、音声制御装置１００が音楽を再生しているのであれば、機能部１２６は、「中止」という単独命令キーワードに対応し、音楽再生を止めることができる。 In the example of FIG. 6A, the first speaker feature vector is a vector corresponding to the speaker who has uttered “stop”, and the second speaker feature vector and the third speaker feature vector are vectors corresponding to the silence. Therefore, the similarity between the first speaker feature vector, the second speaker feature vector, and the third speaker feature vector is lower than the preset reference value. In that case, the wakeup determination unit 124 can determine that the first audio data AD1 includes the single command keyword “stop”. In this case, the function unit 126 is waked up in response to the determination, and can perform an operation or function corresponding to the single command keyword “stop”. For example, if the voice control device 100 is playing music, the function unit 126 can stop the music playback in response to the single command keyword “stop”.

しかし、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が、既設定基準値より高いか、あるいは第１話者特徴ベクトルと第３話者特徴ベクトルとの類似度が、既設定基準値より高い場合、現在区間の第１オーディオデータの話者が、以前区間の第２オーディオデータの話者、または次の区間の第３オーディオデータの話者と同一であるということであるので、第１オーディオデータにキーワードが含まれていないと判断し、ウェークアップを進めない。その場合、段階（Ｓ２１０）に進み、周辺音に対応するオーディオ信号を受信する。 However, the wakeup determination unit 124 determines whether the similarity between the first speaker feature vector and the second speaker feature vector is higher than a preset reference value, or the first speaker feature vector and the third speaker feature vector. Is higher than the preset reference value, the speaker of the first audio data in the current section is the same as the speaker of the second audio data of the previous section or the third audio data of the next section. Therefore, it is determined that no keyword is included in the first audio data, and the wakeup cannot proceed. In that case, the process proceeds to step (S210), and an audio signal corresponding to the ambient sound is received.

図６Ｂの例において、一人が「ここで停止して」と発声したので、第１話者特徴ベクトルないし第３話者特徴ベクトルの類似度は、高い。図６Ｂの例における発声である「ここに停止して」には、音声制御装置を制御するか、あるいはウェークアップさせるためのキーワードが含まれていないので、ウェークアップ判断部１２４は、第１オーディオデータＡＤ１に単独命令キーワードが含まれていないと判断し、機能部１２６が「停止」または「中止」に該当する機能や動作を遂行しないようにする。 In the example of FIG. 6B, since one person uttered “Stop here”, the similarity between the first speaker feature vector and the third speaker feature vector is high. In the example of FIG. 6B, “stop here” does not include a keyword for controlling the voice control device or causing the wake-up to occur, so the wake-up determination unit 124 determines the first audio data AD1. Are not included in the single command keyword, so that the function unit 126 does not perform the function or operation corresponding to “stop” or “stop”.

一般的な技術によれば、音声制御装置は、「ここで停止して」という発声のうち「停止」という音声を検出し、「停止」に該当する機能や動作を遂行することが技術的には可能である。かような機能や動作は、ユーザが意図していないものであり、ユーザは、音声制御装置を使用するときに不都合を感じる。しかし、一実施形態によれば、音声制御装置１００は、ユーザの音声から、単独命令キーワードを正確に認識することができるために、一般的な技術とは異なり、誤動作を遂行しない。 According to a general technique, the voice control device is technically capable of detecting a voice of “stop” from utterances of “stop here” and performing a function or operation corresponding to “stop”. Is possible. Such functions and operations are not intended by the user, and the user feels inconvenience when using the voice control device. However, according to one embodiment, since the voice control device 100 can accurately recognize the single command keyword from the user's voice, unlike the general technique, the voice control device 100 does not perform a malfunction.

図７は、さらに他の実施形態によって、音声制御装置が遂行することができる動作方法の例を図示したフローチャートである。 FIG. 7 is a flowchart illustrating an example of an operation method that can be performed by a voice control apparatus according to still another embodiment.

図８Ａは、一実施形態による音声制御装置が、図７の動作方法を実行する場合、ウェークアップキーワード及び自然語音声命令が発話される例を図示し、図８Ｂは、一実施形態による音声制御装置が、図７の動作方法を実行する場合、一般対話音声が発話される例を図示する。 FIG. 8A illustrates an example in which a wake-up keyword and a natural language voice command are spoken when the voice control apparatus according to one embodiment executes the operation method of FIG. 7, and FIG. 8B illustrates a voice control apparatus according to one embodiment. However, when the operation method of FIG. 7 is executed, an example in which a general dialogue voice is spoken is illustrated.

図７の動作方法は、図４の動作方法と実質的に同一である段階を含む。図７の段階のうち、図４の段階と実質的に同一である段階については、詳細に説明しない。図６Ａ及び図６Ｂには、オーディオストリームデータに対応するオーディオ信号と、オーディオ信号に対応するユーザの音声とが図示される。図８Ａには、ウェークアップキーワード「クローバ」と、自然語音声命令「明日の天気を教えて」とに対応するオーディオ信号が図示され、図８Ｂには「四葉のクローバーをどうやって見つけられるの」という対話音声に対応するオーディオ信号が図示される。 The operation method of FIG. 7 includes substantially the same steps as the operation method of FIG. Of the steps in FIG. 7, steps that are substantially the same as the steps in FIG. 4 will not be described in detail. 6A and 6B illustrate an audio signal corresponding to audio stream data and a user's voice corresponding to the audio signal. FIG. 8A illustrates an audio signal corresponding to the wakeup keyword “clover” and the natural language voice command “tell me tomorrow's weather”, and FIG. An audio signal corresponding to speech is illustrated.

図８Ａ及び図８Ｂと共に、図７を参照すれば、段階（Ｓ４１０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、周辺音に対応するオーディオ信号を受信する。段階（Ｓ４２０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、マイク１５１からのオーディオ信号を基に、オーディオストリームデータを生成する。段階（Ｓ４３０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、段階（Ｓ１２０）で生成されるオーディオストリームデータを、メモリ１１０に一時的に保存する。 Referring to FIG. 7 together with FIGS. 8A and 8B, in step (S410), the processor 120, for example, the audio processing unit 121 receives an audio signal corresponding to the ambient sound. In the step (S420), the processor 120, for example, the audio processing unit 121 generates audio stream data based on the audio signal from the microphone 151. In step (S430), the processor 120, for example, the audio processing unit 121 temporarily stores the audio stream data generated in step (S120) in the memory 110.

段階（Ｓ４４０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、段階（Ｓ４２０）で生成されるオーディオストリームデータから、既定義のウェークアップキーワードに対応する候補キーワードを検出する。該ウェークアップキーワードは、スリープモード状態の音声制御装置をウェークアップモードに転換することができる音声に基づくキーワードである。例えば、ウェークアップキーワードは、「クローバ」、「ハイコンピュータ」のような音声キーワードでもある。 In step (S440), the processor 120, for example, the keyword detection unit 122 detects candidate keywords corresponding to the predefined wakeup keyword from the audio stream data generated in step (S420). The wake-up keyword is a keyword based on voice that can switch the voice control device in the sleep mode to the wake-up mode. For example, the wake-up keyword is also a voice keyword such as “clover” or “high computer”.

図８Ａの例において、キーワード検出部１２２は、オーディオ信号から、「クローバ」という候補キーワードを検出することができる。図８Ｂの例において、キーワード検出部１２２は、オーディオ信号から、「クローバ」というキーワードと類似した発音を有する単語である「クローバー」という候補キーワードを検出することができる。 In the example of FIG. 8A, the keyword detection unit 122 can detect a candidate keyword “clover” from the audio signal. In the example of FIG. 8B, the keyword detection unit 122 can detect a candidate keyword “clover” that is a word having a pronunciation similar to the keyword “clover” from the audio signal.

段階（Ｓ４５０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、オーディオストリームデータから候補キーワードが検出されたキーワード検出区間を識別し、キーワード検出区間の始点及び終点を決定する。キーワード検出区間は、現在区間とされる。オーディオストリームデータで現在区間に対応するデータは、第１オーディオデータとされる。 In step (S450), the processor 120, for example, the keyword detection unit 122 identifies a keyword detection section in which a candidate keyword is detected from the audio stream data, and determines a start point and an end point of the keyword detection section. The keyword detection section is the current section. The data corresponding to the current section in the audio stream data is the first audio data.

図８Ａの例において、キーワード検出部１２２は、「クローバ」という候補キーワードを検出した区間を現在区間と識別し、現在区間の始点及び終点を決定することができる。前記現在区間に対応するオーディオデータは、第１オーディオデータＡＤ１とされる。図８Ｂの例において、キーワード検出部１２２は、「クローバー」という候補キーワードを検出した区間を現在区間と識別し、現在区間の始点及び終点を決定することができる。前記現在区間に対応するオーディオデータは、第１オーディオデータＡＤ１とされる。 In the example of FIG. 8A, the keyword detection unit 122 can identify the section in which the candidate keyword “crowbar” is detected as the current section, and can determine the start point and end point of the current section. The audio data corresponding to the current section is the first audio data AD1. In the example of FIG. 8B, the keyword detection unit 122 can identify the section in which the candidate keyword “clover” is detected as the current section, and can determine the start point and the end point of the current section. The audio data corresponding to the current section is the first audio data AD1.

また、段階（Ｓ４５０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、検出された候補キーワードがウェークアップキーワード及び単独命令キーワードのうちいずれのキーワードに対応する候補キーワードであるかということを判断することができる。図８Ａ及び図８Ｂの例において、キーワード検出部１２２は、検出された候補キーワード、すなわち、「クローバ」及び「クローバー」がウェークアップキーワードに対応する候補キーワードであるということを判断することができる。 In step (S450), the processor 120, for example, the keyword detection unit 122 determines whether the detected candidate keyword is a candidate keyword corresponding to any one of the wakeup keyword and the single command keyword. Can do. 8A and 8B, the keyword detection unit 122 can determine that the detected candidate keywords, that is, “clover” and “clover” are candidate keywords corresponding to the wakeup keyword.

段階（Ｓ４６０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、メモリ１１０から、以前区間に該当する第２オーディオデータを読み取る。以前区間は、現在区間のすぐ直前区間であり、以前区間の終点は、現在区間の始点と同一でもある。話者特徴ベクトル抽出部１２３は、メモリ１１０から、第１オーディオデータも共に読み取ることができる。 In step (S <b> 460), the processor 120, for example, the speaker feature vector extraction unit 123 reads the second audio data corresponding to the previous section from the memory 110. The previous section is a section immediately before the current section, and the end point of the previous section is also the same as the start point of the current section. The speaker feature vector extraction unit 123 can also read the first audio data from the memory 110 together.

図８Ａの例において、話者特徴ベクトル抽出部１２３は、現在区間のすぐ直前区間である以前区間に対応する第２オーディオデータＡＤ２を、メモリ１１０から読み取ることができる。図８Ｂの例において、話者特徴ベクトル抽出部１２３は、現在区間のすぐ直前区間である以前区間に対応する第２オーディオデータＡＤ２を、メモリ１１０から読み取ることができる。図８Ｂの例において、第２オーディオデータＡＤ２は「、四葉の」という音声に対応する。以前区間の長さは、検出された候補キーワードによって可変的にも設定される。 In the example of FIG. 8A, the speaker feature vector extraction unit 123 can read from the memory 110 the second audio data AD2 corresponding to the previous section that is the section immediately before the current section. In the example of FIG. 8B, the speaker feature vector extraction unit 123 can read from the memory 110 the second audio data AD2 corresponding to the previous section that is the section immediately before the current section. In the example of FIG. 8B, the second audio data AD2 corresponds to the voice “, Yotsuba”. The length of the previous section is also variably set according to the detected candidate keyword.

段階（Ｓ４７０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、第１オーディデータ及び第２オーディオデータから、第１話者特徴ベクトル及び第２話者特徴ベクトルをそれぞれ抽出する。プロセッサ１２０、例えば、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を基に、第１オーディオデータに、ウェークアップキーワードが含まれていたか否かということを判断することができる。ウェークアップ判断部１２４は、第１オーディオデータにウェークアップキーワードが含まれていると判断する場合、音声制御装置１００の一部構成要素をウェークアップさせることができる。 In step (S470), the processor 120, for example, the speaker feature vector extracting unit 123 extracts the first speaker feature vector and the second speaker feature vector from the first audio data and the second audio data, respectively. The processor 120, for example, the wakeup determination unit 124, determines whether or not the wakeup keyword is included in the first audio data based on the similarity between the first speaker feature vector and the second speaker feature vector. Judgment can be made. When the wakeup determination unit 124 determines that the wakeup keyword is included in the first audio data, the wakeup determination unit 124 can wake up some components of the voice control device 100.

図８Ａの例において、第１オーディオデータＡＤ１に対応する第１話者特徴ベクトルは、「クローバ」という音声を発声した話者を識別するための指標である。第２オーディオデータＡＤ２は、実質的に黙音であるので、第２話者特徴ベクトルは、黙音に対応するベクトルを有することができる。従って、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度は、低い。 In the example of FIG. 8A, the first speaker feature vector corresponding to the first audio data AD1 is an index for identifying the speaker who uttered the voice “clover”. Since the second audio data AD2 is substantially silent, the second speaker feature vector may have a vector corresponding to the silence. Therefore, the degree of similarity between the first speaker feature vector and the second speaker feature vector is low.

他の例として、以前区間に「クローバ」という音声を発声した話者ではない他者が音声を発声する場合、第２話者特徴ベクトルは、前記他者に対応したベクトルを有するので、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度は、低い。 As another example, when another person who is not the speaker who uttered the voice “clover” in the previous section utters the voice, the second speaker feature vector has a vector corresponding to the other person. The degree of similarity between the speaker feature vector and the second speaker feature vector is low.

図８Ｂの例では、一人が「四葉のクローバーをどうやって見つけられるの」と発声した。従って、「クローバー」に対応する第１オーディオデータＡＤ１から抽出される第１話者特徴ベクトルと、「四葉の」に対応する第２オーディオデータＡＤ２から抽出される第２話者特徴ベクトルは、いずれも実質的に同一である話者を識別するためのベクトルであるので、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度は、高い。 In the example of FIG. 8B, one person said, "How can I find a four-leaf clover?" Therefore, the first speaker feature vector extracted from the first audio data AD1 corresponding to “clover” and the second speaker feature vector extracted from the second audio data AD2 corresponding to “four leaves” Are vectors for identifying speakers that are substantially the same, the degree of similarity between the first speaker feature vector and the second speaker feature vector is high.

段階（Ｓ４８０）において、プロセッサ１２０、例えば、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を既設定基準値と比較する。ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が既設定基準値より高い場合、現在区間の第１オーディオデータの話者と、以前区間の第２オーディオデータの話者とが互いに同一であるということであるので、第１オーディオデータにキーワードが含まれていないと判断し、ウェークアップを進めない。その場合、段階（Ｓ４１０）に進み、プロセッサ１２０、例えば、オーディオ処理部１２１は、周辺音に対応するオーディオ信号を受信する。 In step (S480), the processor 120, for example, the wakeup determination unit 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with a preset reference value. When the similarity between the first speaker feature vector and the second speaker feature vector is higher than the preset reference value, the wakeup determination unit 124 and the speaker of the first audio data in the current section and the second audio in the previous section Since the data speakers are the same as each other, it is determined that the keyword is not included in the first audio data, and wakeup cannot proceed. In that case, proceeding to step (S410), the processor 120, for example, the audio processing unit 121 receives an audio signal corresponding to the ambient sound.

図８Ｂの例において、一人が「四葉のクローバー…」と発声したので、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度は、高い。図８Ｂの例において、「四葉のクローバー」と発声した者は、音声制御装置１００をウェークアップさせようという意図がないと判断し、ウェークアップ判断部１２４は、第１オーディオデータＡＤ１にウェークアップキーワードが含まれていないと判断し、音声制御装置１００をウェークアップさせない。 In the example of FIG. 8B, since one person uttered “four-leaf clover ...”, the similarity between the first speaker feature vector and the second speaker feature vector is high. In the example of FIG. 8B, the person who utters “four-leaf clover” determines that there is no intention to wake up the voice control device 100, and the wakeup determination unit 124 includes the wakeup keyword in the first audio data AD1. Therefore, the voice control device 100 is not woken up.

ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が、既設定基準値以下である場合、現在区間の第１オーディオデータの話者と、以前区間の第２オーディオデータの話者とが互いに異なるということであるので、第１オーディオデータにキーワードが含まれていると判断することができる。その場合、ウェークアップ判断部１２４は、音声制御装置１００の一部構成要素をウェークアップさせることができる。例えば、ウェークアップ判断部１２４は、音声認識部１２５をウェークアップさせることができる。 When the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value, the wakeup determination unit 124 determines whether the speaker of the first audio data in the current section is the first speaker data in the previous section. Since the two audio data speakers are different from each other, it can be determined that a keyword is included in the first audio data. In that case, the wake-up determination unit 124 can wake up some components of the voice control device 100. For example, the wakeup determination unit 124 can wake up the voice recognition unit 125.

図８Ａの例において、第１話者特徴ベクトルは、「クローバ」と発声した話者に対応するベクトルであり、第２話者特徴ベクトルは、黙音に対応したベクトルであるので、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度は、既設定基準値より低い。その場合、ウェークアップ判断部１２４は、第１オーディオデータＡＤ１に「クローバ」というウェークアップキーワードが含まれていると判断することができる。その場合、音声認識部１２５は、自然語音声命令を認識するためにウェークアップされる。 In the example of FIG. 8A, the first speaker feature vector is a vector corresponding to the speaker who uttered “clover”, and the second speaker feature vector is a vector corresponding to the silent sound. The similarity between the speaker feature vector and the second speaker feature vector is lower than the preset reference value. In this case, the wakeup determination unit 124 can determine that the wakeup keyword “clover” is included in the first audio data AD1. In that case, the voice recognition unit 125 is woken up to recognize a natural language voice command.

段階（Ｓ４９０）において、プロセッサ１２０、例えば、音声認識部１２５は、オーディオ処理部１２１から、現在区間後の次の区間に該当する第３オーディオデータを受信する。次の区間は、現在区間のすぐ次の区間であり、次の区間の始点は、現在区間の終点と同一でもある。 In step (S <b> 490), the processor 120, for example, the speech recognition unit 125 receives third audio data corresponding to the next section after the current section from the audio processing unit 121. The next section is the section immediately following the current section, and the start point of the next section is the same as the end point of the current section.

音声認識部１２５は、第３オーディオデータにおいて、既設定長の黙音が検出されるとき、次の区間の終点を決定することができる。音声認識部１２５は、第３オーディオデータを音声認識することができる。音声認識部１２５は、多様な方式で、第３オーディオデータを音声認識することができる。他の例によれば、音声認識部１２５は、第３オーディオデータの音声認識結果を得るために、外部装置、例えば、図２に図示される音声認識機能を有するサーバ２００に、第３オーディオデータを伝送することができる。サーバ２００は、第３オーディオデータを受信し、第３オーディオデータを音声認識することにより、第３オーディオデータに対応する文字列（テキスト）を生成し、生成された文字列（テキスト）を、音声認識結果として、音声認識部１２５に伝送することができる。 The voice recognition unit 125 can determine the end point of the next section when a preset length of silence is detected in the third audio data. The voice recognition unit 125 can recognize the third audio data. The speech recognition unit 125 can recognize the third audio data by various methods. According to another example, the voice recognition unit 125 may send third audio data to an external device, for example, the server 200 having the voice recognition function illustrated in FIG. Can be transmitted. The server 200 receives the third audio data, recognizes the third audio data by voice, generates a character string (text) corresponding to the third audio data, and converts the generated character string (text) into the voice. The recognition result can be transmitted to the voice recognition unit 125.

図８Ａの例において、次の区間の第３オーディオデータは、「明日の天気を教えて」のような自然語音声命令である。音声認識部１２５は、第３オーディオデータを直接音声認識し、音声認識結果を生成するか、あるいは第３オーディオデータが音声認識されるように、外部（例えば、サーバ２００）に伝送することができる。 In the example of FIG. 8A, the third audio data in the next section is a natural language voice command such as “Tell me about tomorrow's weather”. The voice recognition unit 125 can directly recognize the third audio data and generate a voice recognition result, or transmit the third audio data to the outside (for example, the server 200) so that the third audio data is recognized. .

段階（Ｓ５００）において、プロセッサ１２０、例えば、機能部１２６は、第３オーディオデータの音声認識結果に対応する機能を遂行することができる。図８Ａの例において、機能部１２６は、明日の天気を検索して結果を提供する音声情報提供部でもあり、機能部１２６は、インターネットを利用して明日天気を検索し、その結果をユーザに提供することができる。機能部１２６は、明日の天気の検索結果を、スピーカ１５２を利用して音声として提供することもできる。機能部１２６は、第３オーディオデータの音声認識結果に応答し、ウェークアップされる。 In step S500, the processor 120, for example, the functional unit 126 may perform a function corresponding to the voice recognition result of the third audio data. In the example of FIG. 8A, the function unit 126 is also an audio information providing unit that searches for tomorrow's weather and provides the results. The function unit 126 searches for tomorrow's weather using the Internet, and sends the results to the user. Can be provided. The function unit 126 can also provide a search result of tomorrow's weather as audio using the speaker 152. The functional unit 126 is woken up in response to the voice recognition result of the third audio data.

以上で説明した本発明による実施形態は、コンピュータ上で多様な構成要素を介して実行されるコンピュータプログラムの形態に具現化され、かようなコンピュータプログラムは、コンピュータで読み取り可能な媒体に記録される。そのとき、該媒体は、コンピュータで実行可能なプログラムを続けて保存するか、あるいは実行またはダウンロードのために、臨時保存するものでもある。また、該媒体は、単一、または数個のハードウェアが結合された形態の多様な記録手段または保存手段でもあるが、あるコンピュータシステムに直接接続される媒体に限定されるものではなく、ネットワーク上に分散存在するものでもある。該媒体の例示としては、ハードディスク、フロッピィーディスク及び磁気テープのような磁気媒体；ＣＤ−ＲＯＭ（compact disc read only memory）及びＤＶＤ（digital versatile disc）のような光記録媒体；フロプティカルディスク（floptical disk）のような磁気・光媒体（magneto-optical medium）；及びＲＯＭ（read-only memory）、ＲＡＭ（random access memory）、フラッシュメモリなどを含み、プログラム命令語が保存されるように構成されたものでもある。また、他の媒体の例示として、アプリケーションを流通するアプリストアや、その他多様なソフトウェアを供給したり流通させたりするサイト、サーバなどで管理する記録媒体ないし記録媒体も挙げることができる。 The embodiment according to the present invention described above is embodied in the form of a computer program that is executed on a computer via various components, and such a computer program is recorded on a computer-readable medium. . At that time, the medium may be a program that continuously stores a computer-executable program or a temporary storage for execution or download. The medium may be a variety of recording means or storage means in a form in which a single piece or several pieces of hardware are combined. However, the medium is not limited to a medium that is directly connected to a computer system. It is also distributed on top. Examples of the medium include magnetic media such as hard disks, floppy disks and magnetic tapes; optical recording media such as compact disc read only memory (CD-ROM) and digital versatile discs (DVD); floptical disks (floptical) including a magneto-optical medium such as disk; and read-only memory (ROM), random access memory (RAM), flash memory, etc., configured to store program instructions It is also a thing. Examples of other media include an application store that distributes applications, a site that supplies and distributes various other software, and a recording medium or recording medium that is managed by a server.

本明細書において、「部」、「モジュール」などは、プロセッサまたは回路のようなハードウェア構成（hardware component）、及び／またはプロセッサのようなハードウェア構成によって実行されるソフトウェア構成（software component）でもある。例えば、「部」、「モジュール」などは、ソフトウェア構成要素、客体志向ソフトウェア構成要素、クラス構成要素及びタスク構成要素のような構成要素、プロセス、関数、属性、プロシージャ、サブルーチン、プログラムコードのセグメント、ドライバ、ファームウェア、マイクロコード、回路、データ、データベース、データ構造、テーブル、アレイ及び変数によっても具現化される。 In this specification, “unit”, “module” and the like may be a hardware component such as a processor or a circuit and / or a software component executed by a hardware configuration such as a processor. is there. For example, “part”, “module”, etc. are software components, object-oriented software components, components such as class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, It is also embodied by drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays and variables.

前述の本発明の説明は、例示のためのものであり、本発明が属する技術分野の当業者であるならば、本発明の技術的思想や必須な特徴を変更せずにも、他の具体的な形態に容易に変形が可能であるということを理解することができるであろう。従って、以上で記述した実施形態は、全ての面において例示的なものであり、限定的ではないと理解しなければならない。例えば、単一型と説明されている各構成要素は、分散されて実施されもし、同様に、分散されていると説明されている構成要素も、結合された形態に実施されてもよい。 The above description of the present invention is for illustrative purposes only, and those skilled in the art to which the present invention pertains can be applied to other specific examples without changing the technical idea and essential features of the present invention. It will be understood that it can be easily transformed into a specific form. Accordingly, it should be understood that the embodiments described above are illustrative in all aspects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, a component described as being distributed may be implemented in a combined form.

本発明の範囲は、前記詳細な説明よりは、特許請求の範囲によって示され、特許請求の範囲の意味及び範囲、そしてその均等概念から導出される全ての変更、または変形された形態が、本発明の範囲に含まれるものであると解釈されなければならない。 The scope of the present invention is defined by the terms of the claims, rather than the foregoing detailed description, and all modifications or variations derived from the meaning and scope of the claims and the equivalents thereof are It should be construed as being included in the scope of the invention.

本発明の、キーワード誤認識を防止する音声制御装置、及びその動作方法は、例えば、音声認識関連の技術分野に効果的に適用可能である。 The voice control device and its operation method for preventing erroneous keyword recognition of the present invention can be effectively applied to, for example, a technical field related to voice recognition.

（付記１）
周辺音に対応するオーディオ信号を受信し、オーディオストリームデータを生成するオーディオ処理部と、
前記オーディオストリームデータから、所定のキーワードに対応する候補キーワードを検出し、前記オーディオストリームデータにおいて、前記候補キーワードが検出された第１オーディオデータに該当する第１区間の始点及び終点を決定するキーワード検出部と、
前記第１オーディオデータに係わる第１話者特徴ベクトルを抽出し、前記オーディオストリームデータにおいて、前記第１区間の始点を終点にする第２区間に該当する第２オーディオデータに係わる第２話者特徴ベクトルを抽出する話者特徴ベクトル抽出部と、
前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度を基に、前記第１オーディオデータに、前記キーワードが含まれていたか否かを判断するウェークアップ判断部と、を含む音声制御装置。
（付記２）
前記ウェークアップ判断部は、前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度が、既設定基準値以下である場合、前記第１オーディオデータに、前記キーワードが含まれていると判断することを特徴とする付記１に記載の音声制御装置。
（付記３）
前記所定のキーワードを含む複数のキーワードを保存するキーワード保存所をさらに含み、
前記キーワードそれぞれは、ウェークアップキーワードまたは単独命令キーワードであることを特徴とする付記１に記載の音声制御装置。
（付記４）
前記キーワード検出部により、前記オーディオストリームデータから、前記単独命令キーワードに対応する前記候補キーワードが検出された場合、
前記話者特徴ベクトル抽出部は、前記オーディオストリームデータにおいて、前記第１区間の終点を始点にする第３区間に該当する第３オーディオデータを受信し、前記第３オーディオデータの第３話者特徴ベクトルを抽出し、
前記ウェークアップ判断部は、前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度、及び前記第１話者特徴ベクトルと前記第３話者特徴ベクトルとの類似度を基に、前記第１オーディオデータに、前記単独命令キーワードが含まれていたか否かを判断することを特徴とする付記３に記載の音声制御装置。
（付記５）
前記ウェークアップ判断部は、前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度が、所定の基準値以下であり、前記第１話者特徴ベクトルと前記第３話者特徴ベクトルとの類似度が、所定の基準値以下である場合、前記第１オーディオデータに、前記単独命令キーワードが含まれていると判断することを特徴とする付記４に記載の音声制御装置。
（付記６）
前記キーワード検出部により、前記オーディオストリームデータから、前記ウェークアップキーワードに対応する前記候補キーワードが検出された場合、
前記第１オーディオデータに前記ウェークアップキーワードが含まれている旨の前記ウェークアップ判断部による判断に応答して、ウェークアップされ、前記オーディオストリームデータにおいて、前記第１区間の終点を始点にする第３区間に該当する第３オーディオデータを受信し、前記第３オーディオデータを音声認識するか、あるいは前記第３オーディオデータが音声認識されるように外部に伝送する音声認識部をさらに含むことを特徴とする付記３に記載の音声制御装置。
（付記７）
前記第２区間は、前記ウェークアップキーワードによって可変的に決定されることを特徴とする付記６に記載の音声制御装置。
（付記８）
前記話者特徴ベクトル抽出部は、
前記第１オーディオデータの各フレームごとに第１フレーム特徴ベクトルを抽出し、抽出された前記第１フレーム特徴ベクトルを正規化及び平均化し、前記第１オーディオデータを代表する前記第１話者特徴ベクトルを抽出し、
前記第２オーディオデータの各フレームごとに第２フレーム特徴ベクトルを抽出し、抽出された前記第２フレーム特徴ベクトルを正規化及び平均化し、前記第２オーディオデータを代表する前記第２話者特徴ベクトルを抽出することを特徴とする付記１に記載の音声制御装置。
（付記９）
前記キーワード検出部は、前記オーディオストリームデータの各フレームごとに、人音声である第１確率と、背景音である第２確率とを計算し、前記第１確率が前記第２確率より、所定の基準値を超えて高いフレームを音声フレームと決定し、
前記話者特徴ベクトル抽出部は、
前記第１オーディオデータ内のフレームにおいて、音声フレームと決定されたフレームそれぞれについて、第１フレーム特徴ベクトルを抽出し、抽出された前記第１フレーム特徴ベクトルを正規化及び平均化し、前記第１オーディオデータを代表する前記第１話者特徴ベクトルを抽出し、
前記第２オーディオデータ内のフレームにおいて、音声フレームと決定されたフレームそれぞれについて、第２フレーム特徴ベクトルを抽出し、抽出された前記第２フレーム特徴ベクトルを正規化及び平均化し、前記第２オーディオデータを代表する前記第２話者特徴ベクトルを抽出することを特徴とする付記１に記載の音声制御装置。
（付記１０）
前記話者特徴ベクトル抽出部は、前記キーワード検出部による前記候補キーワードの検出に応答してウェークアップされることを特徴とする付記１に記載の音声制御装置。
（付記１１）
周辺音に対応するオーディオ信号を受信し、オーディオストリームデータを生成する段階と、
前記オーディオストリームデータから、所定のキーワードに対応する候補キーワードを検出し、前記オーディオストリームデータにおいて、前記候補キーワードが検出された第１オーディオデータに該当する第１区間の始点及び終点を決定する段階と、
前記第１オーディオデータに係わる第１話者特徴ベクトルを抽出する段階と、
前記オーディオストリームデータにおいて、前記第１区間の始点を終点にする第２区間に該当する第２オーディオデータに係わる第２話者特徴ベクトルを抽出する段階と、
前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度を基に、前記第１オーディオデータに、前記キーワードが含まれていたか否かを判断し、ウェークアップさせるか否かを決定する段階と、を含む音声制御装置の動作方法。
（付記１２）
前記ウェークアップさせるか否かを決定する段階は、
前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度を所定の基準値と比較する段階と、
前記類似度が、前記所定の基準値以下である場合、前記第１オーディオデータに、前記キーワードが含まれていると判断してウェークアップさせる段階と、
前記類似度が、前記所定の基準値を超える場合、前記第１オーディオデータに、前記キーワードが含まれていないと判断してウェークアップさせない段階と、を含むことを特徴とする付記１１に記載の音声制御装置の動作方法。
（付記１３）
前記検出された候補キーワードが、単独命令キーワードに対応する前記候補キーワードである場合、
前記オーディオストリームデータにおいて、前記第１区間の終点を始点にする第３区間に該当する第３オーディオデータを受信する段階と、
前記第３オーディオデータの第３話者特徴ベクトルを抽出する段階と、
前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度が、所定の基準値以下であり、前記第１話者特徴ベクトルと前記第３話者特徴ベクトルとの類似度が、所定の基準値以下である場合、前記第１オーディオデータに、前記単独命令キーワードが含まれていると判断する段階と、をさらに含むことを特徴とする付記１１に記載の音声制御装置の動作方法。
（付記１４）
前記第１オーディオデータに、前記単独命令キーワードが含まれているという判断に応答し、前記単独命令キーワードに対応する機能を遂行する段階をさらに含むことを特徴とする付記１３に記載の音声制御装置の動作方法。
（付記１５）
前記検出されたキーワードがウェークアップキーワードに対応する前記候補キーワードである場合、
前記第１オーディオデータに、前記ウェークアップキーワードが含まれているという判断に応答して、前記オーディオストリームデータにおいて、前記第１区間の終点を始点にする第３区間に該当する第３オーディオデータを受信する段階と、
前記第３オーディオデータを音声認識するか、あるいは前記第３オーディオデータが音声認識されるように外部に伝送する段階と、をさらに含むことを特徴とする付記１１に記載の音声制御装置の動作方法。
（付記１６）
前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとを抽出する段階は、
前記第１オーディオデータの各フレームごとに第１フレーム特徴ベクトルを抽出する段階と、
抽出された前記第１フレーム特徴ベクトルを正規化及び平均化し、前記第１オーディオデータを代表する前記第１話者特徴ベクトルを抽出する段階と、
前記第２オーディオデータの各フレームごとに第２フレーム特徴ベクトルを抽出する段階と、
抽出された前記第２フレーム特徴ベクトルを正規化及び平均化し、前記第２オーディオデータを代表する前記第２話者特徴ベクトルを抽出する段階と、を含むことを特徴とする付記１１に記載の音声制御装置の動作方法。
（付記１７）
前記オーディオストリームデータの各フレームごとに、人音声である第１確率と、背景音である第２確率とを計算し、前記第１確率が前記第２確率より、所定の基準値を超えて高いフレームを音声フレームと決定する段階をさらに含み、
前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとを抽出する段階は、
前記第１オーディオデータ内のフレームにおいて、音声フレームと決定されたフレームそれぞれについて、第１フレーム特徴ベクトルを抽出する段階と、
抽出された前記第１フレーム特徴ベクトルを正規化及び平均化し、前記第１オーディオデータを代表する前記第１話者特徴ベクトルを抽出する段階と、
前記第２オーディオデータ内のフレームにおいて、音声フレームと決定されたフレームそれぞれについて、第２フレーム特徴ベクトルを抽出する段階と、
抽出された前記第２フレーム特徴ベクトルを正規化及び平均化し、前記第２オーディオデータを代表する前記第２話者特徴ベクトルを抽出する段階と、を含むことを特徴とする付記１１に記載の音声制御装置の動作方法。
（付記１８）
音声制御装置のプロセッサに、付記１１ないし１７のうちいずれか１項に記載の動作方法を実行させる命令語を含むコンピュータプログラム。
（付記１９）
付記１８に記載のコンピュータプログラムを記録した記録媒体。 (Appendix 1)
An audio processing unit that receives an audio signal corresponding to ambient sound and generates audio stream data;
Keyword detection for detecting a candidate keyword corresponding to a predetermined keyword from the audio stream data, and determining a start point and an end point of a first section corresponding to the first audio data in which the candidate keyword is detected in the audio stream data And
A first speaker feature vector related to the first audio data is extracted, and in the audio stream data, a second speaker feature related to the second audio data corresponding to the second section whose end point is the start point of the first section. A speaker feature vector extraction unit for extracting a vector;
A wakeup determination unit that determines whether or not the keyword is included in the first audio data based on a similarity between the first speaker feature vector and the second speaker feature vector; Control device.
(Appendix 2)
The wakeup determination unit includes the keyword in the first audio data when the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value. The voice control device according to appendix 1, characterized in that:
(Appendix 3)
A keyword storage for storing a plurality of keywords including the predetermined keyword;
The voice control device according to appendix 1, wherein each of the keywords is a wakeup keyword or a single command keyword.
(Appendix 4)
When the keyword detection unit detects the candidate keyword corresponding to the single command keyword from the audio stream data,
The speaker feature vector extraction unit receives third audio data corresponding to a third section starting from an end point of the first section in the audio stream data, and a third speaker feature of the third audio data. Extract the vector,
The wake-up determination unit is based on the similarity between the first speaker feature vector and the second speaker feature vector and the similarity between the first speaker feature vector and the third speaker feature vector. 4. The voice control apparatus according to appendix 3, wherein the first audio data is determined whether or not the single command keyword is included.
(Appendix 5)
The wakeup determination unit has a similarity between the first speaker feature vector and the second speaker feature vector equal to or less than a predetermined reference value, and the first speaker feature vector and the third speaker feature vector. 5. The voice control device according to appendix 4, wherein the first command data is determined to be included in the first audio data when the similarity to the first audio data is equal to or less than a predetermined reference value.
(Appendix 6)
When the keyword detection unit detects the candidate keyword corresponding to the wakeup keyword from the audio stream data,
In response to the determination by the wake-up determination unit that the wake-up keyword is included in the first audio data, the audio stream data includes a third section starting from the end point of the first section. And an audio recognition unit for receiving the corresponding third audio data and recognizing the third audio data or transmitting the third audio data to the outside so that the third audio data is recognized. 4. The voice control device according to 3.
(Appendix 7)
The voice control device according to appendix 6, wherein the second section is variably determined by the wakeup keyword.
(Appendix 8)
The speaker feature vector extraction unit includes:
A first frame feature vector is extracted for each frame of the first audio data, the extracted first frame feature vector is normalized and averaged, and the first speaker feature vector representing the first audio data is represented. Extract
Extracting a second frame feature vector for each frame of the second audio data; normalizing and averaging the extracted second frame feature vectors; and representing the second speaker feature vector representing the second audio data The voice control device according to supplementary note 1, wherein:
(Appendix 9)
The keyword detection unit calculates, for each frame of the audio stream data, a first probability that is a human voice and a second probability that is a background sound, and the first probability is a predetermined probability based on the second probability. A frame that exceeds the reference value is determined as an audio frame,
The speaker feature vector extraction unit includes:
In each frame in the first audio data, a first frame feature vector is extracted for each frame determined to be an audio frame, the extracted first frame feature vector is normalized and averaged, and the first audio data Extracting the first speaker feature vector representative of
A second frame feature vector is extracted for each frame determined to be an audio frame in the frames in the second audio data, and the extracted second frame feature vector is normalized and averaged, and the second audio data The speech control apparatus according to appendix 1, wherein the second speaker feature vector representing the above is extracted.
(Appendix 10)
The speech control apparatus according to appendix 1, wherein the speaker feature vector extraction unit is woken up in response to detection of the candidate keyword by the keyword detection unit.
(Appendix 11)
Receiving an audio signal corresponding to ambient sound and generating audio stream data;
Detecting a candidate keyword corresponding to a predetermined keyword from the audio stream data, and determining a start point and an end point of a first section corresponding to the first audio data in which the candidate keyword is detected in the audio stream data; ,
Extracting a first speaker feature vector related to the first audio data;
Extracting a second speaker feature vector related to the second audio data corresponding to the second section whose end point is the start point of the first section in the audio stream data;
Based on the similarity between the first speaker feature vector and the second speaker feature vector, it is determined whether or not the keyword is included in the first audio data, and whether or not to wake up is determined. And a method of operating the voice control device.
(Appendix 12)
Determining whether to wake up,
Comparing the similarity between the first speaker feature vector and the second speaker feature vector with a predetermined reference value;
When the similarity is equal to or lower than the predetermined reference value, determining that the keyword is included in the first audio data and causing the wakeup to occur;
The audio according to claim 11, further comprising: when the similarity exceeds the predetermined reference value, determining that the keyword is not included in the first audio data and causing the first audio data not to wake up. The operation method of the control device.
(Appendix 13)
When the detected candidate keyword is the candidate keyword corresponding to a single command keyword,
Receiving, in the audio stream data, third audio data corresponding to a third section starting from an end point of the first section;
Extracting a third speaker feature vector of the third audio data;
The similarity between the first speaker feature vector and the second speaker feature vector is less than or equal to a predetermined reference value, and the similarity between the first speaker feature vector and the third speaker feature vector is: The operation method of the voice control device according to appendix 11, further comprising a step of determining that the single audio keyword is included in the first audio data when the value is equal to or less than a predetermined reference value. .
(Appendix 14)
14. The voice control device according to claim 13, further comprising performing a function corresponding to the single command keyword in response to a determination that the single command keyword is included in the first audio data. How it works.
(Appendix 15)
If the detected keyword is the candidate keyword corresponding to a wakeup keyword,
In response to the determination that the wakeup keyword is included in the first audio data, the audio stream data receives third audio data corresponding to a third section starting from the end point of the first section. And the stage of
The operation method of the voice control apparatus according to claim 11, further comprising: recognizing the third audio data or transmitting the third audio data to the outside so that the third audio data is recognized. .
(Appendix 16)
Extracting the first speaker feature vector and the second speaker feature vector;
Extracting a first frame feature vector for each frame of the first audio data;
Normalizing and averaging the extracted first frame feature vectors to extract the first speaker feature vectors representative of the first audio data;
Extracting a second frame feature vector for each frame of the second audio data;
The speech according to claim 11, further comprising: normalizing and averaging the extracted second frame feature vector, and extracting the second speaker feature vector representing the second audio data. The operation method of the control device.
(Appendix 17)
For each frame of the audio stream data, a first probability that is human speech and a second probability that is background sound are calculated, and the first probability is higher than the second probability by exceeding a predetermined reference value. Further comprising determining the frame as an audio frame;
Extracting the first speaker feature vector and the second speaker feature vector;
Extracting a first frame feature vector for each frame determined to be an audio frame in the frames in the first audio data;
Normalizing and averaging the extracted first frame feature vectors to extract the first speaker feature vectors representative of the first audio data;
Extracting a second frame feature vector for each frame determined to be an audio frame in the frames in the second audio data;
The speech according to claim 11, further comprising: normalizing and averaging the extracted second frame feature vector, and extracting the second speaker feature vector representing the second audio data. The operation method of the control device.
(Appendix 18)
A computer program including an instruction word that causes a processor of a voice control device to execute the operation method according to any one of appendices 11 to 17.
(Appendix 19)
A recording medium on which the computer program according to attachment 18 is recorded.

１００音声制御装置（電子機器）
１１０メモリ
１２０プロセッサ
１２１オーディオ処理部
１２２キーワード検出部
１２３話者特徴ベクトル抽出部
１２４ウェークアップ判断部
１２５音声認識部
１２６機能部
100 Voice control device (electronic equipment)
DESCRIPTION OF SYMBOLS 110 Memory 120 Processor 121 Audio processing part 122 Keyword detection part 123 Speaker feature vector extraction part 124 Wake-up judgment part 125 Speech recognition part 126 Function part

Claims

An audio processing unit that receives an audio signal corresponding to ambient sound and generates audio stream data;
A keyword detection unit for determining a first section in which a candidate keyword corresponding to a predetermined keyword is detected from the audio stream data;
A first speaker feature vector for identifying a speaker in the first interval, and a second speaker feature vector for identifying a speaker in a second interval adjacent to the first interval; A speaker feature vector extraction unit that extracts data;
A wakeup determination unit that determines whether or not the predetermined keyword is included in the first section based on a similarity between the first speaker feature vector and the second speaker feature vector apparatus.

A voice control method executed by a voice control device,
Receiving an audio signal corresponding to the ambient sound and generating audio stream data;
Determining a first section in which a candidate keyword corresponding to a predetermined keyword is detected from the audio stream data;
A first speaker feature vector for identifying a speaker in the first interval, and a second speaker feature vector for identifying a speaker in a second interval adjacent to the first interval; Extracting from the data;
Determining whether or not the predetermined keyword is included in the first section based on a similarity between the first speaker feature vector and the second speaker feature vector.

A computer program for causing a computer to execute the voice control method according to claim 2.

A recording medium for storing the computer program according to claim 3.