JP2010511958A

JP2010511958A - Gesture / voice integrated recognition system and method

Info

Publication number: JP2010511958A
Application number: JP2009540141A
Authority: JP
Inventors: ヨンジユジョン; ムンソンハン; ジェソンイ; ジュンソクパク
Original assignee: Electronics and Telecommunications Research Institute ETRI
Current assignee: Electronics and Telecommunications Research Institute ETRI
Priority date: 2006-12-04
Filing date: 2007-12-03
Publication date: 2010-04-15
Also published as: KR100948600B1; KR20080050994A

Abstract

本発明はジェスチャー／音声統合認識システム及び方法に関し、雑音環境下における音声とジェスチャーの統合を通じて命令語の認識の性能を高めるために入力された音声の中から命令語の始点と終点を検出し音声特徴情報を抽出する音声特徴抽出部と、前記検出した始点と終点に関する情報を用いて撮影映像のジェスチャーから命令区間を検出し、ジェスチャー特徴情報を抽出するジェスチャー特徴抽出部と、前記抽出された音声特徴情報とジェスチャー特徴情報を既設定の学習パラメーターを用いて統合認識データとして出力する統合認識部を含んで構成され、簡単に、且つ正確にユーザーの命令を認識することができる。 The present invention relates to a gesture / speech integrated recognition system and method, and relates to a voice and speech recognition method that detects the start and end points of a command word from input voices in order to improve the recognition performance of the command word through voice and gesture integration in a noisy environment. A voice feature extraction unit for extracting feature information, a gesture feature extraction unit for detecting a command section from a gesture of a captured video using information on the detected start point and end point, and extracting gesture feature information; and the extracted voice The system includes an integrated recognition unit that outputs feature information and gesture feature information as integrated recognition data using preset learning parameters, and can easily and accurately recognize user commands.

Description

本発明は、統合認識技術に関し、特に、実際の雑音環境下でユーザーの命令を高性能で認識するために、音声のＥＰＤ値を利用しジェスチャーの特徴情報を抽出して音声の特徴情報と統合し、ユーザーの命令を認識することができるジェスチャー／音声統合認識システム及び方法に関する。 The present invention relates to integrated recognition technology, and in particular, extracts feature information of gestures using speech EPD values and integrates them with speech feature information in order to recognize user commands with high performance in an actual noise environment. And a gesture / voice integrated recognition system and method capable of recognizing a user's command.

本発明は、情報通信部及び情報通信研究振興院のＩＴ新成長動力核心技術開発事業の一環として行った研究から導出されたものである［課題管理番号：２００６−Ｓ−０３１−０１、課題名：ネットワークに基づく実感型サービスのための五感情報処理技術の開発］。 The present invention is derived from research conducted as part of the IT New Growth Dynamic Core Technology Development Project of the Information Communication Department and the Information Communication Research Promotion Agency [Problem Management Number: 2006-S-031-01, Name of Project] : Development of five-sense information processing technology for network-based real-world services].

最近、マルチメディア技術とインターフェイス技術の発達に伴い、人と機械のインターフェイスを容易く、且つ簡単に実現するために、顔の表情や方向、唇の形、凝視追跡、手のジェスチャ、音声等を利用しマルチモーダル（Ｍｕｌｔｉ−ｍｏｄａｌ）形態の認識研究が活発に行われている。 Recently, with the development of multimedia technology and interface technology, facial expression and direction, lip shape, gaze tracking, hand gesture, voice, etc. are used to easily and easily realize human-machine interface. However, research on the recognition of multi-modal forms has been actively conducted.

特に、現在のＭａｎ−Ｍａｃｈｉｎｅインターフェイス技術のうち、音声認識技術とジェスチャー認識技術が最も便利なインターフェイス技術として使用されている。但し、音声認識技術とジェスチャー認識技術は、制限された環境では高い認識率を示すが、実際の雑音環境下ではその性能を十分に発揮できないという問題がある。それは、音声認識は環境の雑音が性能に最も大きな影響を与えるからであり、カメラに基づくジェスチャー認識技術は照明の変化とジェスチャーの種類によって性能の差が多く発生する。従って、音声認識技術は、雑音に強いアルゴリズムを利用して認識することができる技術の開発が必要であり、ジェスチャー認識技術は認識情報を含むジェスチャーの特定区間を抽出することができる技術開発が必要となった。また、一般的なジェスチャーを使用する場合にはジェスチャーの特定区間が簡単に区分できないため、認識することに困難があった。 In particular, among the current Man-Machine interface technologies, speech recognition technology and gesture recognition technology are used as the most convenient interface technologies. However, although the voice recognition technology and the gesture recognition technology show a high recognition rate in a limited environment, there is a problem that the performance cannot be sufficiently exhibited in an actual noise environment. This is because, in speech recognition, environmental noise has the greatest effect on performance, and in camera-based gesture recognition technology, there are many differences in performance depending on lighting changes and gesture types. Therefore, it is necessary to develop a technology that can recognize speech recognition technology using a noise-resistant algorithm, and gesture recognition technology needs technology development that can extract a specific section of a gesture including recognition information. It became. In addition, when using a general gesture, it is difficult to recognize a specific section of the gesture because it cannot be easily divided.

また、音声とジェスチャーを統合し認識する場合においては、音声フレームの処理速度は約１０ｍｓ／ｆｒａｍｅであり、映像フレームの処理速度は約６６.７ｍｓ／ｆｒａｍｅであるため、各フレームを処理する処理速度に差がある上、一般的にジェスチャー区間が音声区間と比べて、より多くの時間がかかるため、発生する音声区間の長さとジェスチャー区間の長さに差が発生し、音声とジェスチャーを同期化するのに問題が生じる。 In the case where voice and gesture are integrated and recognized, the processing speed of the audio frame is about 10 ms / frame and the processing speed of the video frame is about 66.7 ms / frame. In addition, since the gesture section generally takes more time than the voice section, there is a difference between the length of the generated voice section and the length of the gesture section, and the voice and gesture are synchronized. Problems arise.

従って、上記のような問題を解決するために、環境雑音に強いアルゴリズムを用い、ユーザーの音声から命令語区間を探索して特徴情報を抽出し、また音声の命令語の始点に関する情報を用いてジェスチャーの特徴区間を検出し、明確に区分されないジェスチャーも簡単に命令を認識することができる手段が必要となった。 Therefore, in order to solve the above problems, an algorithm that is resistant to environmental noise is used, a command word section is searched from the user's voice, feature information is extracted, and information about the start point of the voice command word is used. There is a need for a means that can detect a feature section of a gesture and easily recognize a command even for a gesture that is not clearly divided.

また、音声とジェスチャーの統合認識において発生する同期の差に関する問題を、音声ＥＰＤ値により検出されたジェスチャーの命令区間で予め設定された最適フレームを適用し、同期を一致させる手段が必要となった。 In addition, there is a need for means for matching synchronization by applying an optimal frame preset in the gesture command section detected by the voice EPD value to the problem regarding the synchronization difference that occurs in the integrated recognition of voice and gesture. .

上記のような問題を解決するための本発明のジェスチャー／音声統合認識システムは、入力された音声の中から命令語の始点と終点を検出し音声特徴情報を抽出する音声特徴抽出部と、前記検出した始点と終点に関する情報を利用して撮影映像のジェスチャーから命令区間を検出しジェスチャー特徴情報を抽出するジェスチャー特徴抽出部と、前記抽出した音声特徴情報とジェスチャー特徴情報を既設定の学習パラメーターを用いて統合認識データとして出力する統合認識部を含んで成ることを特徴とする。 An integrated gesture / speech recognition system according to the present invention for solving the above-described problem includes a speech feature extraction unit that detects a start point and an end point of a command word from input speech and extracts speech feature information; Using the information about the detected start and end points, a gesture feature extraction unit that detects a command section from a gesture of a captured video and extracts gesture feature information; and the extracted speech feature information and gesture feature information with preset learning parameters It is characterized by comprising an integrated recognition unit that is used and output as integrated recognition data.

一方、前記ジェスチャー／音声統合認識システムは、前記検出した始点を利用し前記撮影映像からジェスチャーの始点を検出するジェスチャー始点検出モジュールと、前記ジェスチャーの始点から予め設定された最適フレーム数を適用し最適の映像フレームを計算して抽出する最適フレーム適用モジュールを含む同期化モジュールをさらに含むことを特徴とする。このとき、前記ジェスチャー始点検出モジュールは、前記検出した音声の始点（ＥＰＤ：ＥｎｄＰｏｉｎｔＤｅｔｅｃｔｉｏｎ）プラグを前記撮影映像でチェックしジェスチャーの始点を検出することを特徴とする。 On the other hand, the integrated gesture / speech recognition system uses the detected start point to detect the start point of the gesture from the captured video and applies the optimum number of frames set in advance from the start point of the gesture. And a synchronization module including an optimum frame application module for calculating and extracting the video frame. At this time, the gesture start point detection module detects the start point of the gesture by checking the detected start point (EPD: End Point Detection) plug in the captured video.

また、前記音声特徴抽出部は、前記入力された音声の中から命令語の始点と終点を検出するイーピーディー（ＥＰＤ：ＥｎｄＰｏｉｎｔＤｅｔｅｃｔｉｏｎ）検出モジュールと、聴覚モデルに基づくアルゴリズムを利用し、前記検出した命令語から前記命令語に含まれた音声特徴情報を抽出する聴覚モデルに基づく音声特徴抽出モジュールを含んで成り、さらに、前記抽出した音声特徴情報から雑音を除去することを特徴とする。 The voice feature extraction unit uses an EPD (End Point Detection) detection module that detects a start point and an end point of a command word from the input voice, and an algorithm based on an auditory model, and detects the detection. A speech feature extraction module based on an auditory model for extracting speech feature information included in the command word from the command word, and further removing noise from the extracted speech feature information.

また、前記ジェスチャー特徴抽出モジュールは、カメラで撮影された映像から手の動きを追跡し前記同期化モジュールに伝送する手追跡モジュールと、前記同期化モジュールで抽出した最適の映像フレームを利用し、ジェスチャー特徴情報を抽出するジェスチャー特徴抽出モジュールを含んで成ることを特徴とする。 The gesture feature extraction module uses a hand tracking module that tracks hand movements from video captured by a camera and transmits the motion to the synchronization module, and an optimal video frame extracted by the synchronization module. It is characterized by comprising a gesture feature extraction module for extracting feature information.

また、前記統合認識部は、予め設定された統合学習モデルと統合学習データベースに基づき学習パラメーターを生成する統合学習ＤＢ制御モジュールと、前記抽出した音声特徴情報とジェスチャー特徴情報を前記生成された学習パラメーターを利用し制御する統合特徴制御モジュールと、前記統合特徴制御モジュールにより制御される結果を認識結果として生成する統合認識モジュールを含んで成ることを特徴とし、このとき、前記統合特徴制御モジュールは、入力されるベクトルのノード数の拡張と縮小を通じて前記抽出した音声特徴情報とジェスチャー特徴情報の特徴ベクトルを制御することを特徴とする。 In addition, the integrated recognition unit includes an integrated learning DB control module that generates a learning parameter based on a preset integrated learning model and an integrated learning database, the extracted speech feature information and gesture feature information as the generated learning parameter And an integrated recognition module for generating a result controlled by the integrated feature control module as a recognition result. In this case, the integrated feature control module is an input. The feature vector of the extracted speech feature information and gesture feature information is controlled through expansion and reduction of the number of nodes of the vector to be performed.

上記のような目的を達成するため、本発明のジェスチャー／音声統合認識方法は、入力された音声の中から命令語の始点（ＥＰＤ値）と終点を検出し音声特徴情報を抽出する１段階と、前記検出した命令語の始点を利用し、カメラにより入力された映像のジェスチャーから命令区間を検出し、ジェスチャー特徴情報を抽出する２段階及び前記抽出した音声特徴情報とジェスチャー特徴情報を既設定の学習パラメーターを利用し統合認識データとして出力する３段階を含んで成ることを特徴とする。 In order to achieve the above object, the gesture / speech integrated recognition method of the present invention includes a step of detecting speech feature information by detecting a start point (EPD value) and an end point of a command word from input speech. , Using the detected start point of the command word, detecting the command section from the gesture of the video input by the camera, and extracting the gesture feature information; and the extracted voice feature information and gesture feature information are already set It is characterized by comprising three stages of outputting as integrated recognition data using learning parameters.

このとき、前記１段階は、前記命令語の始点と終点による命令語区間から聴覚モデルに基づき音声特徴情報を抽出することを特徴とする。 At this time, the step 1 is characterized in that voice feature information is extracted based on an auditory model from a command word section by a start point and an end point of the command word.

また、前記２段階は、前記カメラの入力映像から手のジェスチャーを追跡するＡ段階と、前記伝送されたＥＰＤ値を利用して前記手のジェスチャーによる命令区間を検出するＢ段階と、予め設定された最適のフレームを適用し前記ジェスチャーによる命令区間から最適のフレームを決めるＣ段階と、前記決められた最適のフレームからジェスチャー特徴情報を抽出するＤ段階を含んで成ることを特徴とする。 In addition, the two steps are preset as A step for tracking a hand gesture from the input image of the camera and B step for detecting a command interval by the hand gesture using the transmitted EPD value. And C stage for determining the optimum frame from the command section by the gesture and D stage for extracting gesture feature information from the determined optimum frame.

前述のように本発明によるジェスチャー／音声統合認識システム及び方法は、音声の命令語区間の始点であるＥＰＤ値を利用してジェスチャーの命令語区間を検出し、明確に区分できないジェスチャーの場合にも認識率を高めることができ、また、ジェスチャーの命令語区間に対して最適のフレームを適用し音声とジェスチャーの同期化を通じ、音声とジェスチャーによる統合認識を実現することができる効果がある。 As described above, the gesture / speech integrated recognition system and method according to the present invention detects the gesture command word section using the EPD value that is the starting point of the voice command word section, and also in the case of a gesture that cannot be clearly distinguished. The recognition rate can be increased, and integrated recognition by voice and gesture can be realized by applying an optimum frame to the command word section of gesture and synchronizing the voice and gesture.

本発明によるジェスチャー／音声統合認識システムの概念を示す図面である。1 is a diagram illustrating a concept of an integrated gesture / voice recognition system according to the present invention. 本発明によるジェスチャー／音声統合認識システムの構成を示す図面である。1 is a diagram illustrating a configuration of a gesture / voice integrated recognition system according to the present invention. 本発明によるジェスチャー／音声統合認識方法を示す流れ図である。3 is a flowchart illustrating a gesture / voice integrated recognition method according to the present invention.

以下、添付の図面を参照し本発明が属する技術分野において通常の知識を有する者が本発明を容易に実施することができる好ましい実施例を詳細に説明する。但し、本発明の好ましい実施例に対する動作原理を詳細に説明することにおいて、関る公知の機能または構成に対する具体的な説明が本発明の要旨を不必要に不明確にすることがあると判断される場合はその詳細な説明を省略する。 DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Reference will now be made in detail to the presently preferred embodiments of the invention, which are readily practiced by those skilled in the art to which the invention pertains. However, in describing the operating principles for the preferred embodiment of the present invention in detail, it is determined that specific descriptions of known functions or configurations may unnecessarily obscure the subject matter of the present invention. Detailed description thereof will be omitted.

図１は本発明によるジェスチャー／音声統合認識システムの概念を示す図面である。 FIG. 1 is a diagram showing a concept of a gesture / voice integrated recognition system according to the present invention.

図１を参照すると、ジェスチャー／音声統合認識技術は、人の音声とジェスチャーによる命令を統合して認識し、その認識結果によって発生する制御命令を利用し五感を表現するデバイスを制御する。 Referring to FIG. 1, the gesture / speech integrated recognition technology recognizes a human voice and a command based on a gesture, and controls a device that expresses the five senses using a control command generated according to the recognition result.

具体的に、人１００は、音声１１０とジェスチャー１２０により命令をする。ここで、人のする命令に対して、例を挙げ説明すると、人がサイバー空間で物を購買する場合にディスプレイされている物から、特定のパンを選ぶという命令として“とうもろこし食パン”といいながらとうもろこし食パンを指差す行為をすることができる。 Specifically, the person 100 gives a command with a voice 110 and a gesture 120. Here, an example is given to an order given by a person. When a person purchases an item in cyberspace, the command to select a specific bread from the displayed items is called “corn bread”. Can act to point to corn bread.

人１００が音声１１０とジェスチャー１２０により命令をすると、人の音声命令に関する特徴情報は音声認識１１１を通じて認識し、人のジェスチャーによる特徴情報はジェスチャー認識１２１を通じて認識する。このように認識された音声とジェスチャーの認識情報は、環境雑音に弱い音声と明確に区分できないジェスチャーに対する認識率を高めるために音声とジェスチャーによる特徴情報を統合認識１３０により、１つのユーザー命令として認識する。 When the person 100 issues a command with the voice 110 and the gesture 120, the feature information related to the voice command of the person is recognized through the voice recognition 111, and the feature information based on the person's gesture is recognized through the gesture recognition 121. The recognition information of the speech and gesture recognized in this way is recognized as one user command by the integrated recognition 130 in order to improve the recognition rate for the gesture that cannot be clearly distinguished from the speech weak to environmental noise. To do.

本発明はこのように人の音声とジェスチャーに対する統合認識に関する技術である。このように認識された命令は制御部により個別的な感覚に対する出力装置であるスピーカー１７０、ディスプレイ装置１７１、発香器１７２、触覚装置１７３、味覚装置１７４に伝達され、夫々の装置を制御する。また、認識結果をネットワークに伝送し、その結果に対する五感データを伝達し、夫々の出力装置を制御することもできる。但し、本願発明は統合認識に関するもので、認識以後の構成は多様に適用できるため、それに対する説明は省略する。 The present invention is a technique relating to integrated recognition of human speech and gestures. The command recognized in this way is transmitted to the speaker 170, the display device 171, the aroma generator 172, the tactile device 173, and the taste device 174, which are output devices for individual sensations, by the control unit, and controls the respective devices. It is also possible to transmit the recognition result to the network, transmit the five sense data for the result, and control each output device. However, the present invention relates to integrated recognition, and since the configuration after recognition can be applied in various ways, description thereof will be omitted.

図２は本発明によるジェスチャー／音声統合認識システムの構成を示す図面である。 FIG. 2 is a diagram showing a configuration of a gesture / voice integrated recognition system according to the present invention.

図２を参照すると、ジェスチャー／音声統合認識システムは、マイク２１１により入力された音声の中から命令語の始点と終点を検出し音声特徴情報を抽出する音声特徴抽出部２１０と、音声特徴抽出部２１０により検出した始点と終点に関する情報を利用し、カメラにより撮影された映像のジェスチャーから命令区間を検出し、ジェスチャー特徴情報を抽出するジェスチャー特徴抽出部２２０と、音声特徴抽出部２１０により検出した始点を利用し撮影映像からジェスチャーの始点を検出し、このように検出したジェスチャーの始点から予め設定された最適フレーム数を適用し、最適の映像フレームを計算する同期化モジュール２３０と、このように抽出した音声特徴情報とジェスチャー特徴情報を既設定の学習パラメーターを利用し統合認識データとして出力する統合認識部２４０を含んで成ることを特徴とする。以下、夫々の構成要素に対して具体的に説明する。 Referring to FIG. 2, the gesture / speech integrated recognition system detects a start point and an end point of a command word from speech input by a microphone 211 and extracts speech feature information, and a speech feature extraction unit. The information on the start point and end point detected by 210 is used to detect the command section from the gesture of the video shot by the camera, and the gesture feature extraction unit 220 that extracts the gesture feature information, and the start point detected by the voice feature extraction unit 210 A synchronization module 230 that detects the start point of the gesture from the captured video using the image, applies the optimal number of frames set in advance from the start point of the gesture thus detected, and extracts the video frame in this way Integrated voice feature information and gesture feature information using preset learning parameters Characterized in that it comprises an integrated recognition section 240 to output as identification data. Hereinafter, each component will be described in detail.

音声特徴抽出部２１０は、ユーザーが音声を入力するマイク２１１と、ユーザーの音声の中から命令語区間の始点と終点を検出するＥＰＤ（ＥｎｄＰｏｉｎｔＤｅｔｅｃｔｉｏｎ）検出モジュール２１２、ＥＰＤ検出モジュール２１２により検出した音声の命令語区間に対して聴覚モデルに基づき音声特徴情報を抽出する聴覚モデルに基づく音声特徴抽出モジュール２１３から成る。また、抽出した音声特徴情報に含まれた雑音を除去するチャンネル雑音除去モジュールを含むことができる（不図示）。 The voice feature extraction unit 210 is detected by a microphone 211 into which a user inputs voice, an EPD (End Point Detection) detection module 212 that detects the start point and end point of a command word section from the user's voice, and an EPD detection module 212. The speech feature extraction module 213 based on the auditory model extracts voice feature information based on the auditory model for the command word section of speech. In addition, a channel noise removal module that removes noise included in the extracted voice feature information can be included (not shown).

ＥＰＤ検出モジュール２１２は、有無線マイクにより入力された音声を分析し命令語の始点と終点を検出する。 The EPD detection module 212 analyzes the voice input by the wired / wireless microphone and detects the start point and the end point of the command word.

具体的に、ＥＰＤ検出モジュール２１２は、音声信号を獲得し、音声信号の終点の検出に必要なエネルギー値を計算し、入力された音声信号の中から命令語として計算すべき区間を判別し命令語の始点と終点を検出する。 Specifically, the EPD detection module 212 acquires a voice signal, calculates an energy value necessary for detecting the end point of the voice signal, determines a section to be calculated as a command word from the input voice signal, and determines a command. Detect the start and end points of words.

ＥＰＤ検出モジュール２１２は、先ずマイクから音声信号を獲得し、獲得した音声をフレーム計算のための形態に変換する。この過程で無線により音声が入力される場合は、データの損失や信号干渉による信号の歪みのような問題が発生し得るため、信号獲得時にこれに対する処理過程が必要である。 The EPD detection module 212 first acquires an audio signal from a microphone, and converts the acquired audio into a form for frame calculation. When voice is input wirelessly in this process, problems such as data loss and signal distortion due to signal interference may occur, and a processing process for this is necessary when acquiring a signal.

ＥＰＤ検出モジュール２１２において、音声信号の終点の検出に必要なエネルギー値の計算は、例えば、下記のように求める。音声信号を分析するためのフレームのサイズは１６０ｓａｍｐｌｅを基準とし、フレームエネルギーは下記の式により計算される。 In the EPD detection module 212, calculation of the energy value necessary for detecting the end point of the audio signal is obtained as follows, for example. The frame size for analyzing the audio signal is based on 160 sample, and the frame energy is calculated by the following equation.

Ｓ（ｎ）：声帯信号サンプル、Ｎ：１フレームのサンプル数 S (n): vocal cord signal sample, N: number of samples in one frame

こうして求められたフレームエネルギーは、以後行われる終点の検出のためのパラメーターとして用いられる。 The frame energy obtained in this way is used as a parameter for the subsequent end point detection.

ＥＰＤ検出モジュール２１２は、フレームエネルギー値を計算してからは命令語として実際計算すべき区間を判別する。例えば、音声信号の始点と終点を計算する過程は、フレームエネルギーを利用した４個のエネルギー臨界値（ｔｈｒｅｓｈｏｌｄ）と１０個の条件により決まる。ここで、４つのエネルギー臨界値と１０個の条件は多様に設定が可能であり、好ましくは、実験により命令語区間を求めるための最も適当なものを選択する。４つの臨界値は終点検出アルゴリズムによりフレーム毎に始点と終点を判別する。 The EPD detection module 212 determines a section to be actually calculated as a command word after calculating the frame energy value. For example, the process of calculating the start point and the end point of an audio signal is determined by four energy threshold values using frame energy and ten conditions. Here, the four energy critical values and the ten conditions can be set in various ways. Preferably, the most appropriate one for obtaining the command word section is selected by experiment. The four critical values are determined for each frame by the end point detection algorithm.

ＥＰＤ検出モジュール２１２は、こうして検出した命令語の始点（以下、“ＥＰＤ値”とする。）に対する情報を同期化モジュール２３０のジェスチャー始点検出モジュール２３１に伝達する。 The EPD detection module 212 transmits information on the start point of the instruction word thus detected (hereinafter referred to as “EPD value”) to the gesture start point detection module 231 of the synchronization module 230.

また、ＥＰＤ検出モジュール２１２は、入力された音声の中から命令語区間に対する情報を聴覚モデルに基づく音声特徴抽出モジュール２１３に伝送し音声特徴情報を抽出する。 Further, the EPD detection module 212 transmits information on the command word section from the input speech to the speech feature extraction module 213 based on the auditory model, and extracts speech feature information.

音声の命令語区間に対する情報を受信した聴覚モデルに基づく音声特徴抽出モジュール２１３は、ＥＰＤ検出モジュール２１２により検出した命令語区間から聴覚モデルに基づき特徴情報を抽出する。聴覚モデルに基づき音声特徴情報を抽出するために用いられるアルゴリズムには、ＥＩＨアルゴリズムとＺＣＰＡアルゴリズム等がある。 The voice feature extraction module 213 based on the auditory model that has received information on the voice command word section extracts feature information from the command word section detected by the EPD detection module 212 based on the auditory model. Algorithms used to extract voice feature information based on an auditory model include an EIH algorithm and a ZCPA algorithm.

聴覚モデルに基づく音声特徴抽出モジュール２１３により抽出された音声特徴情報は、チャンネル雑音除去モジュール（不図示）により雑音を除去し統合認識部２４０に伝達される。 The voice feature information extracted by the voice feature extraction module 213 based on the auditory model is transmitted to the integrated recognition unit 240 after removing noise by a channel noise removal module (not shown).

ジェスチャー特徴抽出部２２０は、カメラ２２１により撮影された映像から顔と手を検出する顔及び手検出モジュール２２２と、検出した手の動きを追跡して同期化モジュール２３０に伝達し、同期化モジュール２３０により計算された最適のフレームを利用しジェスチャーの特徴情報を抽出するジェスチャー特徴抽出モジュール２２４から成る。 The gesture feature extraction unit 220 tracks a face and hand detection module 222 that detects a face and a hand from an image captured by the camera 221, and detects and detects the movement of the hand, and transmits the detected motion to the synchronization module 230. The gesture feature extraction module 224 extracts the feature information of the gesture using the optimal frame calculated by the above.

顔及び手検出モジュール２２２は、映像からジェスチャーの対象となる顔及び手を検出し、手追跡モジュール２２３は映像における手の動きを続けて追跡する。但し、手追跡モジュール２２３は手に限定し説明したが、当業者によりジェスチャーとして認識され得る様々な体の一部を追跡することが出来る。 The face and hand detection module 222 detects the face and hand to be gestured from the video, and the hand tracking module 223 continuously tracks the movement of the hand in the video. However, although the hand tracking module 223 has been described as limited to hands, it can track various body parts that can be recognized as gestures by those skilled in the art.

手追跡モジュール２２３により時間が進むに従って手の動きを続けて保存し、手の動きからジェスチャー命令として認識できる部分は、同期化モジュール２３０により音声特徴抽出部２１０から伝達されたＥＰＤ値を利用して検出される。以下、ＥＰＤ値を利用して手の動きの中からジェスチャー命令として認識される区間を検出し、音声とジェスチャーの同期化のために、最適フレームを適用する同期化モジュール２３０に対して説明する。 The hand tracking module 223 continuously stores hand movements as time progresses, and a part that can be recognized as a gesture command from the hand movements uses the EPD value transmitted from the speech feature extraction unit 210 by the synchronization module 230. Detected. Hereinafter, a synchronization module 230 that detects an interval recognized as a gesture command from the hand movement using the EPD value and applies an optimal frame for synchronizing the voice and the gesture will be described.

同期化モジュール２３０は、ＥＰＤ値と手の動きに対する映像を利用しジェスチャーの始点を検出するジェスチャー始点検出モジュール２３１と、検出したジェスチャー始点により計算されたジェスチャーの始点フレームを利用し統合認識に必要な最適の映像フレームを計算する最適フレーム適用モジュール２３２を含んで成る。 The synchronization module 230 is necessary for integrated recognition using the gesture start point detection module 231 that detects the start point of the gesture using the EPD value and the image of the hand movement, and the gesture start point frame calculated from the detected gesture start point. An optimal frame application module 232 for calculating an optimal video frame is included.

ジェスチャー始点検出モジュール２３１は、リアルタイムで音声信号と映像信号が入力される中、ＥＰＤ検出モジュール２１２により音声のＥＰＤ値が検出されると、同期化モジュール２３０は映像信号から音声ＥＰＤプラグをチェックする。このような方法によりジェスチャー始点検出モジュール２３１はジェスチャーの始点フレームを計算する。また、計算されたジェスチャーの始点フレームを利用し、最適フレーム適用モジュール２３２は統合認識に必要な最適の映像フレームを計算してジェスチャー特徴抽出モジュール２２４に伝達する。最適フレーム適用モジュール２３２により適用される統合認識に必要な最適の映像フレームは、ジェスチャーの認識率が最も高いと判断されるフレーム数を予め設定し、ジェスチャー始点検出モジュール２３１によりジェスチャーの始点フレームが計算されると、最適の映像フレームを決める。 When the EPD value of the audio is detected by the EPD detection module 212 while the audio signal and the video signal are input in real time, the synchronization module 230 checks the audio EPD plug from the video signal. In this way, the gesture start point detection module 231 calculates the start frame of the gesture. Also, using the calculated start frame of the gesture, the optimum frame application module 232 calculates an optimum video frame necessary for the integrated recognition and transmits it to the gesture feature extraction module 224. The optimal video frame necessary for integrated recognition applied by the optimal frame application module 232 is preset with the number of frames determined to have the highest gesture recognition rate, and the gesture start point detection module 231 calculates the gesture start point frame. Then, determine the optimal video frame.

統合認識部２４０は、学習モデルに基づき音声特徴情報とジェスチャー特徴情報を効率的に統合するための統合モデルを生成する統合モデル生成モジュール２４２と、統計的モデルに基づく統合認識アルゴリズムの開発に適合した形態で構築された統合学習ＤＢ２４４と、統合モデル生成モジュール２４２と統合学習ＤＢ２４４による学習及び学習パラメーターを制御する統合学習ＤＢ制御モジュール２４３と、学習パラメーターと入力された音声特徴情報とジェスチャー特徴情報の特徴ベクトルを制御する統合特徴制御モジュール２４１と、認識結果を生成し様々な機能を提供する統合認識モデル２４５から成る。 The integrated recognition unit 240 is adapted to the development of an integrated model generation module 242 that generates an integrated model for efficiently integrating speech feature information and gesture feature information based on a learning model, and an integrated recognition algorithm based on a statistical model. Integrated learning DB 244 constructed in the form, integrated learning DB control module 243 for controlling learning and learning parameters by the integrated model generation module 242 and the integrated learning DB 244, and features of the speech feature information and gesture feature information inputted as learning parameters An integrated feature control module 241 that controls vectors and an integrated recognition model 245 that generates recognition results and provides various functions.

統合モデル生成モジュール２４２は、音声特徴情報とジェスチャー特徴情報を効率的に統合するために、高性能の統合モデルを生成する。高性能の統合モデルを決めるため、既存に用いられた多様な学習アルゴリズム（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ（ＨＭＭ）、ＮｅｕｒａｌＮｅｔｗｏｒｋ（ＮＮ）、ＤｙｎａｍｉｃＴｉｍｅＷａｐｐｉｎｇ（ＤＴＷ）等）を具現し実験により決めることができる。特に、本願発明は、ＮＮに基づき統合モデルを決め、統合認識に高性能を発揮することができるＮＮパラメーターを最適化する方法を用いることができる。但し、高性能の統合モデルを生成するにおいて最も大きな問題の１つは、フレーム数が異なる２つのモダリティを学習モデル内でどう同期化するのかという問題である。 The integrated model generation module 242 generates a high-performance integrated model in order to efficiently integrate voice feature information and gesture feature information. In order to determine a high-performance integrated model, various existing learning algorithms (Hidden Markov Model (HMM), Neural Network (NN), Dynamic Time Wapping (DTW), etc.) can be implemented and determined through experiments. In particular, the present invention can use a method of determining an integrated model based on NN and optimizing NN parameters that can exhibit high performance in integrated recognition. However, one of the biggest problems in generating a high-performance integrated model is how to synchronize two modalities with different numbers of frames in the learning model.

学習モデル内における同期化の問題は、学習モデルの最適化の問題と同一である。本発明は統合レイヤーを備え、前記統合レイヤー内で音声とジェスチャーの連結方法を最適化する。最適化するために、時間軸を基準に音声とジェスチャーが重畳長さを計算してから、これに基づき同期化する。このような重畳長さは認識率の実験を通じて最も高い認識率を示す連結方法を探す。 The synchronization problem within the learning model is the same as the learning model optimization problem. The present invention includes an integration layer, and optimizes a voice and gesture connection method within the integration layer. In order to optimize, the speech and gesture overlap length is calculated based on the time axis and then synchronized based on the calculated length. Such a superposition length is searched for a connection method showing the highest recognition rate through an experiment on the recognition rate.

統合学習ＤＢ２４４は、統計的モデルに基づく統合認識アルゴリズムの開発に適合する形態で統合認識データベースを構築する。 The integrated learning DB 244 constructs an integrated recognition database in a form that is compatible with the development of an integrated recognition algorithm based on a statistical model.

例えば、１０個の単語を対象にステレオカメラと無線マイクを利用し様々な年齢層のデータを同期化させて収集する。表１はジェスチャーと音声の統合のために定義された命令語群である。定義された命令語群は、一般的に人が多くの学習をしなくても理解できる自然なジェスチャーを対象とした。 For example, data of various age groups are synchronized and collected using a stereo camera and a wireless microphone for 10 words. Table 1 shows a command group defined for the integration of gesture and voice. The defined command groups are intended for natural gestures that humans can generally understand without much learning.

このとき、音声のサンプリングの割合は、１６ｋＨｚに１６ｂｉｔｓを使用し、チャンネル数１（ｍｏｎｏ）のＰｕｌｓｅＣｏｄｅｄＭｏｄｕｌａｔｉｏｎ（ＰＣＭ）方式のＷａｖｅｆｏｒｍを利用して録音する。映像は、ＳＴＨ−ＤＣＳＧ−Ｃステレオカメラを利用し、１秒当り１５ｆｒａｍｅ、３２０ｘ２４０サイズの２４ｂｉｔｓＢＩＴＭＡＰイメージをブルースクリーン背景と４つの蛍光灯ボックスが設けられた照明下で録画した。ステレオカメラでは音声インターフェイスが存在しないため、音声収集モジュールと映像収集モジュールを独立して作成し、音声録音プログラムでＩＰＣ（Ｉｎｔｅｒ−ＰｒｏｃｅｓｓＣｏｍｍｕｎｉｃａｔｉｏｎｓ）を通じて映像収集プロセスを制御する方法により映像と音声の同期化プログラムを作成しデータを収集した。映像収集モジュールは、ＯｐｅｎＣＶ（ＣｏｍｐｕｔｅｒＶｉｓｏｎ）ライブラリーとＳＶＳ（ＳｍａｌｌＶｉｓｉｏｎＳｙｓｔｅｍ）を利用して構成した。 At this time, the audio sampling rate is 16 bits at 16 kHz, and recording is performed using a Pulse Coded Modulation (PCM) Waveform with 1 channel (mono). The video was recorded using a STH-DCSG-C stereo camera, and a 24-bit BITMAP image of 15 frames per second and 320 × 240 size was recorded under illumination with a blue screen background and four fluorescent lamp boxes. Since there is no audio interface in stereo cameras, audio and video acquisition modules are created independently, and video and audio are synchronized by controlling the video acquisition process through IPC (Inter-Process Communications) with an audio recording program. A program was created and data was collected. The video collection module was configured using an Open CV (Computer Vision) library and an SVS (Small Vision System).

ステレオカメラの映像は、別途のキャリブレーション過程を経て実際の録音環境に適用させなければならず、最適の映像を獲得するために、関わるｇａｉｎ、ｅｘｐｏｓｕｒｅ、ｂｒｉｇｈｔｎｅｓｓ、ｒｅｄ、ｂｌｕｅパラメーター値を修正して色感、露出及びＷＢ値を調整した。キャリブレーション情報及びパラメーター情報は別途のｉｎｉファイルで保存し映像保存モジュールで呼び出し参照するようにした。 Stereo camera images must be applied to the actual recording environment through a separate calibration process, and the gain, exposure, brightness, red, and blue parameter values are modified to obtain the optimal image. The color feeling, exposure and WB value were adjusted. Calibration information and parameter information are saved in a separate ini file and called by the video saving module for reference.

統合学習ＤＢ制御モジュール２４３は、統合モデル生成モジュール２４２と連携し予め生成され保存された統合学習ＤＢ２４４に基づき学習パラメーターを生成する。 The integrated learning DB control module 243 generates learning parameters based on the integrated learning DB 244 generated and stored in advance in cooperation with the integrated model generation module 242.

統合特徴制御モジュール２４１は、統合学習ＤＢ制御モジュール２４３により生成された学習パラメーターと、音声特徴抽出部２１０と、ジェスチャー特徴抽出部２２０により抽出された音声と、ジェスチャーの特徴情報の特徴ベクトルを制御する。このような制御は入力ベクトルのノード数の拡張及び縮小に関わる。統合特徴制御モジュール２４１は、統合レイヤーを有することを特徴とし、このような統合レイヤーは夫々異なるサイズの音声とジェスチャーの長さを効率的に統合し単一認識率を提示するよう開発される。 The integrated feature control module 241 controls the learning parameters generated by the integrated learning DB control module 243, the speech extracted by the speech feature extraction unit 210, the gesture feature extraction unit 220, and the feature vector of the gesture feature information. . Such control involves expansion and reduction of the number of nodes in the input vector. The integrated feature control module 241 is characterized by having an integrated layer, which is developed to efficiently integrate different sized speech and gesture lengths to present a single recognition rate.

統合認識モジュール２４５は、統合特徴制御モジュール２４１による制御結果を用いて認識結果を生成する。また、統合表現器、或いはネットワーク等と相互作用するための様々な機能を提供する。 The integrated recognition module 245 generates a recognition result using the control result from the integrated feature control module 241. It also provides various functions for interacting with an integrated expression device or network.

図３は、本発明によるジェスチャー／音声統合認識方法を示す流れ図である。 FIG. 3 is a flowchart illustrating a gesture / voice integrated recognition method according to the present invention.

図３を参照すると、ジェスチャー／音声統合認識方法は、３つのスレッドで構成され動作する。３つのスレッドは、音声の特徴を抽出する音声特徴抽出スレッド１０と、ジェスチャーの特徴を抽出するジェスチャー特徴抽出スレッド２０と、音声とジェスチャーの統合認識を行う統合認識スレッド３０から成る。３つのスレッド１０、２０、３０は、学習パラメーターをロードする時点に生成し、スレッドプラグを利用して有機的に動作する。以下、３つのスレッド１０、２０、３０の有機的な動作を通じたジェスチャー／音声統合認識方法を説明する。 Referring to FIG. 3, the gesture / speech integrated recognition method is composed of three threads and operates. The three threads include a speech feature extraction thread 10 that extracts speech features, a gesture feature extraction thread 20 that extracts gesture features, and an integrated recognition thread 30 that performs integrated recognition of speech and gestures. The three threads 10, 20, and 30 are generated at the time when the learning parameters are loaded, and operate organically using a thread plug. Hereinafter, a gesture / speech integrated recognition method through an organic operation of the three threads 10, 20, and 30 will be described.

ユーザーが音声とジェスチャーを利用して命令をする場合、音声特徴抽出スレッド１０は有無線マイクを利用し音声を受信し続けるＳ３１１。また、ジェスチャー特徴抽出スレッド２０はカメラを利用してジェスチャーを含む映像を続けて受信するＳ３２０。マイクを利用し続けて入力される音声の音声フレームを計算しながらＳ３１２、ＥＰＤ検出モジュール２１２は音声に含まれた命令語の始点と終点（音声ＥＰＤ値）を検出するＳ３１３。音声ＥＰＤ値が検出されると、音声ＥＰＤ値をジェスチャー特徴抽出スレッドの同期化段階４０に伝達する。また、音声に含まれた命令語の始点と終点により音声の命令語区間が決まると、聴覚モデルに基づく音声特徴抽出モジュール２１３は聴覚モデルに基づき命令語区間から音声特徴を抽出しＳ３１４、統合認識スレッド３０に伝達する。 When the user gives an instruction using voice and gesture, the voice feature extraction thread 10 continues to receive voice using the wired / wireless microphone (S311). In addition, the gesture feature extraction thread 20 continuously receives an image including a gesture using the camera (S320). The EPD detection module 212 detects the start point and the end point (speech EPD value) of the command word included in the voice while calculating the voice frame of the voice that is continuously input using the microphone (S313). When the voice EPD value is detected, the voice EPD value is transmitted to the synchronization stage 40 of the gesture feature extraction thread. When the voice command word section is determined by the start point and the end point of the command word included in the voice, the voice feature extraction module 213 based on the auditory model extracts the voice feature from the command word section based on the auditory model. Communicate to the thread 30.

ジェスチャー特徴抽出スレッド２０は、カメラを通じて続けて入力される映像から手及び顔を検出するＳ３２１。こうして手と顔が検出されると、ユーザーのジェスチャーを追跡するＳ３２２。ユーザーのジェスチャーは変わり続けるため、一定の長さのジェスチャーをバッファに保存するＳ３２３。 The gesture feature extraction thread 20 detects a hand and a face from images continuously input through the camera (S321). When the hand and face are detected in this way, the user's gesture is tracked (S322). Since the user's gesture continues to change, a certain length of gesture is stored in the buffer S323.

ジェスチャーをバッファに保存する過程で、音声ＥＰＤ値が検出され伝達されると、バッファに保存されているジェスチャー映像における音声ＥＰＤプラグをチェックするＳ３２４。音声ＥＰＤプラグにより映像の特徴情報を含むジェスチャーの始点と終点を検索しＳ３２５、このように検索されたジェスチャー特徴を保存するＳ３２６。こうして保存されたジェスチャー特徴は音声と同期が異なるため、予め設定された最適フレームを適用しジェスチャーの始点フレームから最適フレームを計算する。また、計算された最適フレームはジェスチャー特徴抽出モジュール２２４を利用しジェスチャー特徴情報を抽出し統合認識スレッドに伝達する。 If the voice EPD value is detected and transmitted in the process of storing the gesture in the buffer, the voice EPD plug in the gesture video stored in the buffer is checked S324. The starting point and the ending point of the gesture including the feature information of the video are searched by the voice EPD plug in S325, and the gesture feature thus searched is stored in S326. Since the gesture features stored in this manner are different in synchronization with the voice, the optimum frame set in advance is applied and the optimum frame is calculated from the start frame of the gesture. The calculated optimum frame is extracted using the gesture feature extraction module 224 and is transmitted to the integrated recognition thread.

音声特徴抽出スレッド１０とジェスチャー特徴抽出スレッド２０で成功的に音声とジェスチャーの特徴情報が抽出されると、統合認識スレッド３０で認識結果を確認する間、音声／ジェスチャー特徴抽出スレッド１０、２０は停止（Ｓｌｅｅｐ）状態となるＳ３２８、Ｓ３１５。 When the voice and gesture feature information is successfully extracted by the voice feature extraction thread 10 and the gesture feature extraction thread 20, the voice / gesture feature extraction threads 10 and 20 are stopped while the integrated recognition thread 30 checks the recognition result. S328 and S315 that are in the (Sleep) state.

統合認識スレッド３０は、音声特徴情報とジェスチャー特徴情報の伝達を受ける前に、予め統合モデル生成モジュール２４２により高性能の統合モデルを生成し、こうして生成された統合モデルと統合学習ＤＢ２４４を制御し、統合学習ＤＢ制御モジュール２４３は学習パラメーターを生成してロードするＳ３３１。こうして学習パラメーターがロードされると、統合認識スレッド３０は音声／ジェスチャー特徴情報が伝達される前まで停止状態で維持されるＳ３３２。 The integrated recognition thread 30 generates a high-performance integrated model by the integrated model generation module 242 in advance before receiving the speech feature information and the gesture feature information, and controls the generated integrated model and the integrated learning DB 244. The integrated learning DB control module 243 generates and loads learning parameters S331. When the learning parameters are loaded in this way, the integrated recognition thread 30 is maintained in a stopped state until the voice / gesture feature information is transmitted S332.

このように停止状態にある統合認識スレッド３０は、音声とジェスチャーの特徴情報の抽出が完了しＳ３３３、特徴情報に関する信号を受信すると、夫々の特徴をメモリにロードするＳ３３４。音声とジェスチャーの特徴情報がロードされると、予め設定された最適化した統合学習モデルと学習パラメーターを用いて認識結果を計算するＳ３３５。 The integrated recognition thread 30 in the stopped state as described above completes the extraction of the feature information of the voice and gesture S333, and when receiving the signal related to the feature information, loads each feature into the memory S334. When the voice and gesture feature information is loaded, a recognition result is calculated using a preset optimized integrated learning model and learning parameters (S335).

統合認識部２４０により認識結果が計算されると、停止状態にある音声特徴抽出スレッド１０とジェスチャー特徴抽出スレッド２０は再び入力される音声と映像から特徴情報を抽出する作業をする。 When the recognition result is calculated by the integrated recognition unit 240, the voice feature extraction thread 10 and the gesture feature extraction thread 20 in the stopped state perform the work of extracting feature information from the input voice and video again.

以上で説明した本発明は、前述の実施例及び添付の図面により限定されるものではなく、本発明の技術的思想から外れない範囲内で様々な置換、変形及び変更が可能であるということは本発明が属する技術分野において通常の知識を有する当業者には自明である。 The present invention described above is not limited by the above-described embodiments and the accompanying drawings, and various replacements, modifications and changes can be made without departing from the technical idea of the present invention. It is obvious to a person skilled in the art having ordinary knowledge in the technical field to which the present invention belongs.

Claims

A voice feature extraction unit for detecting voice feature information by detecting the start point and end point of the command word from the input voice;
A gesture feature extraction unit for detecting a command section from a gesture of a captured video using information on the detected start point and end point, and extracting gesture feature information;
A gesture / speech integrated recognition system, comprising: an integrated recognition unit that outputs the extracted voice feature information and gesture feature information as integrated recognition data using preset learning parameters.

A gesture start point detection module for detecting a start point of a gesture from the captured video using the detected start point;
The gesture according to claim 1, further comprising a synchronization module including an optimal frame application module that applies an optimal number of frames set in advance from a starting point of the gesture and calculates and extracts an optimal video frame. / Integrated speech recognition system.

The gesture start point detection module includes:
The gesture / speech integrated recognition system according to claim 2, wherein the detected start point (EPD: End Point Detection) plug is checked in the captured video to detect a start point of a gesture.

The voice feature extraction unit
An EPD (End Point Detection) detection module for detecting a start point and an end point of a command word from the input voice;
2. A speech feature extraction module based on an auditory model that extracts speech feature information contained in the command word from the detected command word using an algorithm based on an auditory model. The gesture / speech integrated recognition system according to any one of Items 3 to 3.

The voice feature extraction unit
5. The gesture / voice integrated recognition system according to claim 4, wherein noise is removed from the extracted voice feature information.

The gesture feature extraction unit
A hand tracking module that tracks hand movements from video captured by the camera and transmits it to the synchronization module;
4. The gesture / speech integrated recognition system according to claim 3, further comprising a gesture feature extraction module that extracts gesture feature information using an optimal video frame extracted by the synchronization module.

The integrated recognition unit
An integrated learning DB control module for generating learning parameters based on a preset integrated learning model and an integrated learning database;
An integrated feature control module for controlling the extracted voice feature information and gesture feature information using the generated learning parameter;
The gesture / voice integrated recognition system according to claim 1, further comprising an integrated recognition module that generates a result controlled by the integrated feature control module as a recognition result.

The integrated learning model is
The integrated gesture / speech recognition system according to claim 7, wherein the gesture / speech integrated recognition system is generated based on a neural network (NN) learning algorithm.

The integrated learning database is
The feature information for speech and gestures of various ages is integrated using a stereo camera and a wireless microphone, and is constructed in a form applicable to an integrated recognition algorithm based on a statistical model. Gesture / voice integrated recognition system.

The gesture / voice integrated recognition system according to claim 7, wherein the integrated recognition module includes an integration layer that integrates the extracted voice feature information and gesture feature information.

The integrated feature control module includes:
The gesture / speech integrated recognition system according to claim 7, wherein the feature vector of the extracted speech feature information and gesture feature information is controlled through expansion and reduction of the number of nodes of the input vector.

A step of detecting voice feature information by detecting a start point (EPD value) and an end point of a command word from input voice;
Using the start point of the detected command word, detecting a command section from a gesture of a video input by a camera, and extracting gesture feature information;
A gesture / speech integrated recognition method comprising three steps of outputting the extracted speech feature information and gesture feature information as integrated recognition data using preset learning parameters.

The one stage includes
13. The gesture / voice integrated recognition method according to claim 12, wherein voice feature information is extracted based on an auditory model from a command word section defined by a start point and an end point of the command word.

The two steps are:
A stage for tracking hand gestures from the input video of the camera;
B stage for detecting a command interval by the hand gesture using the transmitted EPD value;
C stage for applying an optimal frame set in advance and determining an optimal frame from the command section by the gesture;
13. The integrated gesture / speech recognition method according to claim 12, further comprising a step D of extracting gesture feature information from the determined optimum frame.

The method of claim 12, wherein the one step further includes a step of removing noise from the extracted voice feature information.