JP2021015189A

JP2021015189A - Multi-modal voice recognition device and multi-modal voice recognition method

Info

Publication number: JP2021015189A
Application number: JP2019129656A
Authority: JP
Inventors: 瀬川　修; Osamu Segawa; 修瀬川; 知樹林; Tomoki Hayashi; 一哉武田; Kazuya Takeda
Original assignee: Chubu Electric Power Co Inc; Tokai National Higher Education and Research System NUC
Current assignee: Chubu Electric Power Co Inc; Tokai National Higher Education and Research System NUC
Priority date: 2019-07-11
Filing date: 2019-07-11
Publication date: 2021-02-12
Anticipated expiration: 2039-07-11
Also published as: JP7414231B2

Abstract

To provide multi-modal voice recognition technology for improving voice recognition performance.SOLUTION: A time-series voice feature amount (voice feature amount series) extracted from voice by voice feature amount extraction means 110 is encoded by voice feature amount encoding means 120 and is weighted by voice code weighting means 130. A time-series gazing point image feature amount (gazing point image feature amount series) extracted from a gazing point image by gazing point image feature amount extraction means 140 is encoded by gazing point image feature amount encoding means 150 and is weighted by gazing point code weighting means 160. Decoding means 180 forms a text corresponding to an integrated weighted code series obtained by integrating a weighted voice code series and a weighted gazing point code series by a character string selected from character strings stored in storage means 50. The text decoded by decoding means is displayed on display means 60 in association with a position of a gazing point.SELECTED DRAWING: Figure 1

Description

本発明は、音声と音声発声時における注視点画像を用いて音声を認識するマルチモーダル音声認識技術に関する。 The present invention relates to a multimodal speech recognition technique for recognizing speech using a gaze image at the time of speech and speech vocalization.

センシング技術の発展に伴って、種々の信号を同時に取得することが可能となっている。このような背景のもと、音声認識の技術分野において、音声認識性能を向上させるために、音声と音声以外の情報を用いて音声を認識するマルチモーダル音声認識技術が提案されている。
例えば、音声と、音声発声時における口の動きを示す口唇画像を用いる音声認識技術が非特許文献１に開示されている。
また、ディープラーニング（ニューラルネットワークの深層学習）に基づくエンドツーエンド(End-to-End)音声認識技術が非特許文献２に開示されている。 With the development of sensing technology, it has become possible to acquire various signals at the same time. Against this background, in the field of speech recognition technology, a multimodal speech recognition technique for recognizing speech using speech and information other than speech has been proposed in order to improve speech recognition performance.
For example, Non-Patent Document 1 discloses a voice recognition technique using a voice and a lip image showing the movement of the mouth when the voice is uttered.
Further, Non-Patent Document 2 discloses an end-to-end speech recognition technique based on deep learning (deep learning of a neural network).

「口唇の深度画像を用いたマルチモーダル音声認識」，押尾翔平他２名，情報処理学会研究報告，Ｖｏｌ．２０１４−ＳＬＰ−１０２−Ｎｏ．２，２０１４／７／２４"Multimodal Speech Recognition Using Deep Lip Images", Shohei Oshio et al., IPSJ Research Report, Vol. 2014-SLP-102-No. 2, 2014/7/24 「音声認識技術の変遷と最先端」，河原達也，日本音響学会誌，第７４巻第７号（２０１８），ｐｐ．３８１−３８６"Transition and Cutting Edge of Speech Recognition Technology", Tatsuya Kawahara, Journal of Acoustical Society of Japan, Vol. 74, No. 7 (2018), pp. 381-386

非特許文献１には、口唇画像を併用することで音声認識性能が向上することは開示されているが、口唇以外の情報を用いることについては開示されてない。
非特許文献２に開示されている、ディープラーニングに基づくエンドツーエンド(End-to-End)音声認識技術は、音声から取得する特徴量系列（特徴ベクトル）を文字列に直接マッピングする系列変換モデル(Encoder-Decoder)を基本としている。近年、系列変換モデルに注意機構(Attention)を組み合わせ、特徴ベクトルに重みを付与することで、音声認識性能の向上を図る試みがなされている。しかしながら、非特許文献２には、マルチモーダル音声認識に、注意機構を組み合わることは開示されてない。
本発明者は、音声認識性能を向上させる技術について種々検討した結果、音声を発声しながら作業を行う際には、音声と注視点が相互に関連していること、すなわち、音声と注視点の相互関係を推定することによって、音声認識性能を向上させることができることが判明した。
本発明は、このような点に鑑みて創案されたものであり、音声と注視点の周辺の注視点画像を用いて音声を認識することによって音声認識性能を向上させたマルチモーダル音声認識技術を提供することを目的とする。 Non-Patent Document 1 discloses that the voice recognition performance is improved by using the lip image together, but does not disclose the use of information other than the lip.
The end-to-end speech recognition technology based on deep learning disclosed in Non-Patent Document 2 is a sequence conversion model that directly maps a feature sequence (feature vector) acquired from speech to a character string. It is based on (Encoder-Decoder). In recent years, attempts have been made to improve speech recognition performance by combining an attention mechanism (Attention) with a sequence conversion model and giving weights to feature vectors. However, Non-Patent Document 2 does not disclose the combination of attention mechanism with multimodal speech recognition.
As a result of various studies on techniques for improving voice recognition performance, the present inventor has found that voice and gaze point are interrelated when performing work while uttering voice, that is, voice and gaze point. It was found that the speech recognition performance can be improved by estimating the interrelationship.
The present invention has been devised in view of these points, and is a multimodal speech recognition technique in which speech recognition performance is improved by recognizing speech by using speech and a gazing point image around the gazing point. The purpose is to provide.

第１発明は、マルチモーダル音声認識装置に関する。
第１発明は、音声情報入力手段、音声特徴情報抽出手段、注視点画像情報入力手段、注視点特徴情報抽出手段、記憶手段および変換手段を有している。
音声情報入力手段は、発話者の音声を示す音声情報を入力する。音声情報入力手段としては、音声情報を入力可能な種々の音声情報入力手段を用いることができる。好適には、音声を電気信号に変換するマイクを含む音声情報入力手段が用いられる。なお、音声情報が予め記憶されている記憶媒体を含む音声情報入力手段を用いることもできる。
音声特徴情報抽出手段は、音声情報入力手段より入力された音声情報から、音声特徴情報を時系列に抽出する。音声特徴情報抽出手段としては、好適には、畳み込み層とプーリング層を有する畳み込みニューラルネットワーク(CNN)が用いられる。
注視点画像情報入力手段は、発話者が音声発声時に注視している注視点の周辺の注視点画像を示す注視点画像情報を入力する。注視点画像情報入力手段としては、注視点画像を入力可能な種々の注視点画像入力手段を用いることができる。好適には、視線計測装置を含む注視点画像情報入力手段が用いられる。なお、好適には、装置内蔵のカメラで撮像される主観画像中における注視点の位置を示す注視点位置情報を出力可能な視線計測装置が用いられる。
注視点特徴情報抽出手段は、注視点画像情報入力手段より入力された注視点画像情報から、注視点特徴情報を時系列に抽出する。好適には、注視点画像情報は、音声特徴情報と同期して抽出される。注視点特徴情報抽出手段としては、音声特徴情報手出手段と同様に、多層ニューラルネットワーク(CNN)が用いられる。
記憶手段には、文字列情報が記憶されている。文字列情報には、文書等のテキスト情報を形成するひらがな、カタカナ、数字、常用漢字等が含まれる。
変換手段は、音声特徴情報抽出手段により抽出した時系列の音声特徴情報および注視点特徴情報抽出手段により抽出した時系列の注視点特徴情報に対応するテキスト情報を、記憶手段に記憶されている文字列情報の中から選択した文字列情報により形成する。好適には、テキスト情報は、表示手段等の出力手段から出力される。
時系列の音声特徴情報および時系列の注視点特徴情報に対応するテキスト情報を形成する方法としては、適宜の方法を用いることができる。
音声特徴情報抽出手段、注視点特徴情報抽出手段および変換手段は、１つのコンピューターによって構成することもできる、個別のコンピューターによって構成することもできる。また、遠方に配置し、インターネット等の通信回線を介して接続可能に構成することもできる。
第１発明は、時系列の音声特徴情報と時系列の注視点画像情報を用いて音声を認識しているため、音声認識性能を向上させることができる。
第１発明の異なる形態では、変換手段は、音声特徴情報符号化手段、音声符号重み付け手段、注視点特徴情報符号化手段、注視点符号重み付け手段および復号化手段を有している。
音声特徴情報符号化手段は、音声特徴情報抽出手段により抽出した時系列の音声特徴情報を符号化して時系列の音声符号を出力する。音声特徴情報符号化手段としては、好適には、再帰型ニューラルネットワーク(RNN: recurrent Neutral network)が用いられる。例えば、再帰型ニューラルネットワーク(RNN)の一形態である双方向長期短期記憶(BLSTM: Bi-directional Long Short Term Memory)が用いられる。
音声符号重み付け手段は、音声特徴情報符号化手段から出力された時系列の音声符号に重みを付与して時系列の重み付き音声符号を出力する。音声符号重み付け手段としては、時系列の音声符号に適切に重みを付与することができる種々の構成の音声符号重み付け手段を用いることができる。
注視点特徴情報符号化手段は、注視点特徴情報抽出手段により抽出した時系列の注視点特徴情報を符号化して時系列の注視点符号を出力する。注視点特徴情報符号化手段としては、音声特徴情報符号化手段と同様に、再帰型ニューラルネットワーク(RNN)の一形態である双方向長期短期記憶(BLSTM)等が用いられる。
注視点符号重み付け手段は、注視点特徴情報符号化手段から出力された時系列の注視点符号に重みを付与して時系列の重み付き注視点符号を出力する。注視点符号重み付け手段としては、時系列の注視点符号に適切に重みを付与することができる種々の構成の注視点符号重み付け手段を用いることができる。
復号化手段は、音声符号重み付け手段から出力された時系列の重み付き音声符号と注視点符号重み付け手段から出力された時系列の重み付き注視点符号を統合した統合符号（重み付き統合符号）に対応するテキスト情報を、記憶手段に記憶されている文字列情報の中から選択した文字列情報により形成する。
復号化手段としては、好適には、再帰型ニューラルネットワーク(RNN)が用いられる。例えば、再帰型ニューラルネットワーク(RNN)の一形態である長期短期記憶(LSTM)が用いられる。
音声特徴情報符号化手段、音声符号重み付け手段、注視点特徴情報符号化手段、注視点符号重み付け手段および復号化手段は、１つのコンピューターによって構成することもできる。また、個別のコンピューターによって構成することもできる。また、遠方に配置し、インターネット等の通信回線を介して接続可能に構成することもできる。
本形態では、音声情報と注視点画像情報との相関関係を正確に推定することができる。
第１発明の異なる形態では、音声符号重み付け手段および注視点符号重み付け手段としてニューラルネットワークで構成される系列変換モデルの注意機構(Attention)が用いられている。
本形態では、時系列の音声符号および時系列の注視点符号に適切な重みを付与することができ、音声認識性能を確実に向上させることができる。
第１発明の異なる形態では、注視点画像入力手段は、発話者の主観画像（注視点画像情報入力手段の撮像領域）中における注視点の位置を示す注視点位置情報を入力可能である。そして、テキスト情報を、注視点位置情報で示される発話者の注視点の位置に関連付けて表示手段に表示するように構成されている。例えば、表示手段に表示されている、注視点画像情報入力手段で撮像した主観画像中の、注視点の位置の近傍にテキスト情報を表示する。
本形態では、テキスト情報が、注視点の位置と関連付けて表示される表示手段に表示されるため、発話者が発声した音声および発話者の注視点を容易に確認することができる。
第１発明の異なる形態では、表示手段に表示されているテキスト情報が選択された場合、選択されたテキスト情報に対応する音声情報（テキスト情報を認識する際に入力された音声情報）を出力するように構成されている。
本形態では、発話者が発声した音声を容易に確認することができる。
第２発明は、マルチモーダル音声認識方法に関する。
本発明は、第１〜第３ステップを有している。
第１ステップでは、発話者の音声を示す音声情報から、音声特徴情報を時系列に抽出する。第１ステップの処理は、例えば、第１発明の音声特徴情報抽出手段によって実行される。
第２ステップでは、発話者が注視している注視点の周辺の注視点画像を示す注視点画像情報から、注視点特徴情報を時系列に抽出する。第２ステップの処理は、例えば、第１発明の注視点特徴情報抽出手段によって実行される。
第３ステップでは、抽出した時系列の音声特徴情報および抽出した時系列の注視点特徴情報に対応するテキスト情報を、記憶手段に記憶されている文字列情報の中から選択した文字列情報により形成する。第３ステップの処理は、例えば、第１発明の変換手段によって実行される。好適には、テキスト情報は、表示手段等の出力手段から出力される。
第２発明は、第１発明と同様の効果を有する。
第２発明の異なる形態では、第３ステップは、第４〜第８ステップを有している。
第４ステップでは、抽出した時系列の音声特徴情報を符号化して時系列の音声符号を出力する。第４ステップの処理は、例えば、第１発明の音声特徴情報符号化手段によって実行される。
第５ステップでは、時系列の音声符号に重みを付与して時系列の重み付き音声符号を出力する。第５ステップの処理は、例えば、第１発明の音声符号重み付け手段によって実行される。
第６ステップでは、時系列の注視点特徴情報を符号化して時系列の注視点符号を出力する。第６ステップの処理は、例えば、第１発明の注視点特徴情報符号化手段によって実行される。
第７ステップでは、時系列の注視点符号に重みを付与して時系列の重み付き注視点符号を出力する。第７ステップの処理は、例えば、第１発明の注視点符号重み付け手段によって実行される。
第８ステップでは、時系列の重み付き音声符号と時系列の重み付き注視点符号を統合した統合符号（重み付き統合符号）に対応するテキスト情報を、記憶手段に記憶されている文字列情報の中から選択した文字列情報により形成する。第８ステップの処理は、例えば、第１発明の復号化手段によって実行される。
好適には、第５ステップの音声符号重み付け処理および第７ステップの注視点符号重み付け処理は、ニューラルネットワークで構成される系列変換モデルの注意機構によって実行される。
本形態は、第１発明と同様の効果を有する。 The first invention relates to a multimodal speech recognition device.
The first invention includes a voice information input means, a voice feature information extraction means, a gaze point image information input means, a gaze point feature information extraction means, a storage means, and a conversion means.
The voice information input means inputs voice information indicating the voice of the speaker. As the voice information input means, various voice information input means capable of inputting voice information can be used. Preferably, a voice information input means including a microphone for converting voice into an electric signal is used. It should be noted that a voice information input means including a storage medium in which voice information is stored in advance can also be used.
The voice feature information extracting means extracts voice feature information in chronological order from the voice information input from the voice information input means. As the voice feature information extraction means, a convolutional neural network (CNN) having a convolution layer and a pooling layer is preferably used.
The gaze point image information input means inputs gaze image information indicating a gaze image around the gaze point that the speaker is gazing at when uttering a voice. As the gazing point image information input means, various gazing point image input means capable of inputting a gazing point image can be used. Preferably, a gazing point image information input means including a line-of-sight measuring device is used. It should be noted that preferably, a line-of-sight measuring device capable of outputting gaze point position information indicating the position of the gaze point in a subjective image captured by a camera built in the device is used.
The gazing point feature information extracting means extracts the gazing point feature information in chronological order from the gazing point image information input from the gazing point image information input means. Preferably, the gazing point image information is extracted in synchronization with the audio feature information. As the gazing point feature information extraction means, a multi-layer neural network (CNN) is used as in the voice feature information extraction means.
Character string information is stored in the storage means. The character string information includes hiragana, katakana, numbers, common kanji, and the like that form text information such as documents.
The conversion means stores the time-series voice feature information extracted by the voice feature information extraction means and the text information corresponding to the time-series gaze feature information extracted by the gaze feature information extraction means in the storage means. It is formed by the character string information selected from the column information. Preferably, the text information is output from an output means such as a display means.
As a method of forming text information corresponding to the time-series voice feature information and the time-series gazing point feature information, an appropriate method can be used.
The voice feature information extraction means, the gazing point feature information extraction means, and the conversion means can be configured by one computer or by individual computers. In addition, it can be arranged far away and can be connected via a communication line such as the Internet.
According to the first invention, since the voice is recognized by using the time-series voice feature information and the time-series gazing point image information, the voice recognition performance can be improved.
In a different form of the first invention, the conversion means includes a voice feature information coding means, a voice code weighting means, a gazing point feature information coding means, a gazing point code weighting means, and a decoding means.
The voice feature information coding means encodes the time-series voice feature information extracted by the voice feature information extraction means and outputs the time-series voice code. A recurrent neural network (RNN) is preferably used as the voice feature information coding means. For example, bi-directional long short term memory (BLSTM), which is a form of recurrent neural network (RNN), is used.
The voice code weighting means weights the time series voice code output from the voice feature information coding means and outputs the time series weighted voice code. As the voice code weighting means, it is possible to use voice code weighting means having various configurations capable of appropriately weighting the voice code of the time series.
The gazing point feature information coding means encodes the time series gazing point feature information extracted by the gazing point feature information extracting means and outputs the time series gazing point code. As the gazing point feature information coding means, a bidirectional long-term short-term memory (BLSTM) or the like, which is a form of a recurrent neural network (RNN), is used as in the voice feature information coding means.
The gaze point code weighting means weights the gaze point code of the time series output from the gaze point feature information coding means and outputs the weighted gaze point code of the time series. As the gaze point code weighting means, it is possible to use gaze point code weighting means having various configurations capable of appropriately weighting the gaze point codes in the time series.
The decoding means is an integrated code (weighted integrated code) that integrates the time-series weighted voice code output from the voice code weighting means and the time-series weighted gaze point code output from the gaze point code weighting means. The corresponding text information is formed by the character string information selected from the character string information stored in the storage means.
A recurrent neural network (RNN) is preferably used as the decoding means. For example, long short-term memory (LSTM), which is a form of recurrent neural network (RNN), is used.
The voice feature information coding means, the voice code weighting means, the gazing point feature information coding means, the gazing point code weighting means, and the decoding means can also be configured by one computer. It can also be configured by a separate computer. In addition, it can be arranged far away and can be connected via a communication line such as the Internet.
In this embodiment, the correlation between the audio information and the gazing point image information can be accurately estimated.
In a different form of the first invention, an attention mechanism (Attention) of a series conversion model composed of a neural network is used as a speech code weighting means and a gaze point code weighting means.
In this embodiment, an appropriate weight can be given to the time-series voice code and the time-series gaze point code, and the voice recognition performance can be surely improved.
In a different form of the first invention, the gaze point image input means can input gaze point position information indicating the position of the gaze point in the subjective image of the speaker (imaging area of the gaze point image information input means). Then, the text information is configured to be displayed on the display means in association with the position of the gaze point of the speaker indicated by the gaze point position information. For example, the text information is displayed in the vicinity of the position of the gazing point in the subjective image captured by the gazing point image information input means displayed on the display means.
In this embodiment, since the text information is displayed on the display means displayed in association with the position of the gazing point, the voice spoken by the speaker and the gazing point of the speaker can be easily confirmed.
In a different form of the first invention, when the text information displayed on the display means is selected, the voice information corresponding to the selected text information (the voice information input when recognizing the text information) is output. It is configured as follows.
In this embodiment, the voice uttered by the speaker can be easily confirmed.
The second invention relates to a multimodal speech recognition method.
The present invention has first to third steps.
In the first step, voice feature information is extracted in chronological order from the voice information indicating the voice of the speaker. The process of the first step is executed by, for example, the voice feature information extraction means of the first invention.
In the second step, the gazing point feature information is extracted in chronological order from the gazing point image information showing the gazing point image around the gazing point that the speaker is gazing at. The process of the second step is executed by, for example, the gazing point feature information extracting means of the first invention.
In the third step, the extracted time-series voice feature information and the text information corresponding to the extracted time-series gazing point feature information are formed by the character string information selected from the character string information stored in the storage means. To do. The process of the third step is executed by, for example, the conversion means of the first invention. Preferably, the text information is output from an output means such as a display means.
The second invention has the same effect as the first invention.
In a different form of the second invention, the third step has fourth to eighth steps.
In the fourth step, the extracted time-series voice feature information is encoded and the time-series voice code is output. The process of the fourth step is executed by, for example, the voice feature information coding means of the first invention.
In the fifth step, the time-series voice code is weighted and the time-series weighted voice code is output. The process of the fifth step is executed by, for example, the voice code weighting means of the first invention.
In the sixth step, the time-series gazing point feature information is encoded and the time-series gazing point code is output. The process of the sixth step is executed by, for example, the gazing point feature information coding means of the first invention.
In the seventh step, the time-series gaze point code is weighted and the time-series weighted gaze point code is output. The process of the seventh step is executed by, for example, the gaze point code weighting means of the first invention.
In the eighth step, the text information corresponding to the integrated code (weighted integrated code) in which the time-series weighted voice code and the time-series weighted gaze point code are integrated is stored in the storage means as the character string information. It is formed by the character string information selected from the list. The process of the eighth step is executed by, for example, the decoding means of the first invention.
Preferably, the voice code weighting process of the fifth step and the gazing point code weighting process of the seventh step are executed by the attention mechanism of the series conversion model composed of the neural network.
This embodiment has the same effect as that of the first invention.

本発明のマルチモーダル音声認識装置およびマルチモーダル音声認識方法は、音声と注視点の周辺の注視点画像を用いて音声を認識することにより、音声認識性能を向上させることができる。 The multimodal speech recognition device and the multimodal speech recognition method of the present invention can improve the speech recognition performance by recognizing speech using the speech and the gazing point image around the gazing point.

本発明のマルチモーダル音声認識装置の一実施形態のブロック図である。It is a block diagram of one Embodiment of the multimodal speech recognition apparatus of this invention. 一実施形態のマルチモーダル音声認識装置の音声特徴量抽出手段の作用を説明する図である。It is a figure explaining the operation of the voice feature amount extraction means of the multimodal voice recognition apparatus of one Embodiment. 一実施形態のマルチモーダル音声認識装置の音声特徴量符号化手段の作用を説明する図である。It is a figure explaining the operation of the voice feature amount coding means of the multimodal voice recognition apparatus of one Embodiment. 一実施形態のマルチモーダル音声認識装置の音声符号重み付け手段の作用を説明する図である。It is a figure explaining the operation of the voice code weighting means of the multimodal voice recognition apparatus of one Embodiment. 一実施形態のマルチモーダル音声認識装置の注視点画像特徴量抽出手段の作用を説明する図である。It is a figure explaining the operation of the gaze point image feature amount extraction means of the multimodal speech recognition apparatus of one Embodiment. 一実施形態のマルチモーダル音声認識装置の注視点画像特徴量符号化手段の作用を説明する図である。It is a figure explaining the operation of the gaze point image feature amount coding means of the multimodal speech recognition apparatus of one Embodiment. 一実施形態のマルチモーダル音声認識装置の注視点符号重み付け手段の作用を説明する図である。It is a figure explaining the operation of the gaze point code weighting means of the multimodal speech recognition apparatus of one Embodiment. 一実施形態のマルチモーダル音声認識装置の復号化手段の作用を説明する図である。It is a figure explaining the operation of the decoding means of the multimodal speech recognition apparatus of one Embodiment. 一実施形態のマルチモーダル音声認識装置の構成を説明する図である。It is a figure explaining the structure of the multimodal speech recognition apparatus of one Embodiment. 一実施形態のマルチモーダル音声認識装置の音声符号重み付け手段の動作を説明する図である。It is a figure explaining the operation of the voice code weighting means of the multimodal voice recognition apparatus of one Embodiment. 一実施形態のマルチモーダル音声認識装置の注視点符号重み付け手段の動作を説明する図である。It is a figure explaining the operation of the gaze point code weighting means of the multimodal speech recognition apparatus of one Embodiment. 表示手段の表示例を示す図である。It is a figure which shows the display example of the display means.

以下に、本発明の実施形態を、図面を参照して説明する。
本発明のマルチモーダル音声認識装置の一実施形態のブロック図が、図１に示されている。
本実施形態のマルチモーダル音声認識装置は、ディープラーニングに基づくエンドツーエンド(End-to-End)音声認識の枠組みを用いている。そして、複数の注意機構(Attention)を有する系列変換モデルを用いて、発話者の音声を示す音声情報と音声を発声している発話者の注視点の周辺の注視点画像を統合し、両者の相関関係に基づいて、音声情報を、文字列情報により形成されるテキスト情報に変換している。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
A block diagram of an embodiment of the multimodal speech recognition device of the present invention is shown in FIG.
The multimodal speech recognition device of the present embodiment uses an end-to-end speech recognition framework based on deep learning. Then, using a sequence conversion model having a plurality of attention mechanisms (Attention), the voice information indicating the speaker's voice and the gaze point image around the gaze point of the speaker who is uttering the voice are integrated, and both of them are used. Based on the correlation, the voice information is converted into the text information formed by the character string information.

本実施形態のマルチモーダル音声認識装置は、処理手段１０、音声情報入力手段３０、注視点画像情報入力手段４０、記憶手段５０および表示手段６０等を有している。 The multimodal voice recognition device of the present embodiment includes a processing means 10, a voice information input means 30, a gazing point image information input means 40, a storage means 50, a display means 60, and the like.

音声情報入力手段３０は、発話者が発声した音声を示す音声情報を入力する。音声情報は、音声波形であってもよいしスペクトラム（周波数情報）であってもよい。好適には、音声情報入力手段３０は、マイクとＡ−Ｄ変換手段により構成される。勿論、音声情報入力手段３０としては、音声情報を入力可能な種々の音声情報入力手段を用いることができる。
注視点画像情報入力手段は、発話者が音声を発声しながら注視している注視点の周辺の注視点画像を示す注視点画像情報を入力する。注視点画像情報入力手段としては、例えば、発話者が装着可能な視線計測装置を用いることができる。注視点の周辺の注視点画像は、視線計測装置で撮像した主観画像中の、注視点の周辺（例えば、注視点を中心とする所定のピクセルサイズの矩形領域）の画像を用いることができる。この場合、好適には、視線計測装置で撮像した主観画像中から注視点の周辺の画像情報を抽出する画像情報抽出手段が設けられる。
なお、好適には、発話者に装着された視線計測装置で撮影した主観画像中における発話者の注視点の位置を示す注視点位置情報を出力可能な視線計測装置が用いられる。
視線計測装置で撮影した主観画像中における注視点の位置が、本発明の「発話者の注視点の位置」に対応する。
記憶手段５０には、文字列情報が記憶されている。文字列情報には、文書等のテキスト情報を形成するひらがな、カタカナ、数字、常用漢字等が含まれる。
表示手段６０は、音声認識結果等を表示する際に用いられる。 The voice information input means 30 inputs voice information indicating the voice uttered by the speaker. The voice information may be a voice waveform or a spectrum (frequency information). Preferably, the voice information input means 30 is composed of a microphone and an AD conversion means. Of course, as the voice information input means 30, various voice information input means capable of inputting voice information can be used.
The gaze point image information input means inputs gaze image information indicating a gaze image around the gaze point that the speaker is gazing at while uttering a voice. As the gazing point image information input means, for example, a line-of-sight measuring device that can be worn by the speaker can be used. As the gazing point image around the gazing point, an image around the gazing point (for example, a rectangular region having a predetermined pixel size centered on the gazing point) in the subjective image captured by the line-of-sight measuring device can be used. In this case, preferably, an image information extraction means for extracting image information around the gazing point from the subjective image captured by the line-of-sight measuring device is provided.
It should be noted that preferably, a line-of-sight measurement device capable of outputting gaze point position information indicating the position of the gaze point of the speaker in a subjective image taken by the line-of-sight measurement device attached to the speaker is used.
The position of the gazing point in the subjective image taken by the line-of-sight measuring device corresponds to the "position of the gazing point of the speaker" of the present invention.
Character string information is stored in the storage means 50. The character string information includes hiragana, katakana, numbers, common kanji, and the like that form text information such as documents.
The display means 60 is used when displaying a voice recognition result or the like.

処理手段１０は、変換手段２０、音声特徴量抽出手段１１０、注視点画像特徴量抽出手段１４０を有している。
音声特徴量抽出手段１１０は、図２に示されているように、音声情報入力手段３０から入力された音声情報から、時系列に音声特徴量（音声特徴量ベクトル）Ｘ１〜Ｘｎを抽出する。なお、音声情報の音声区間（意味的な単語や文単位と推定される区間）は、前後に無音区間が含まれていることにより判別することができる。音声特徴量抽出手段１１０は、音声特徴量系列（音声特徴量ベクトル系列）｛Ｘ１，Ｘ２，・・・，Ｘｎ｝を出力する。
本実施形態では、音声特徴量抽出手段１１０として、畳み込み層とプーリング層を有する畳み込みニューラルネットワーク(CNN: Convolutional Neural Network)が用いられている。
音声特徴量抽出手段１１０が、本発明の「音声特徴情報抽出手段」に対応し、音声特徴量系列（音声特徴量ベクトル系列）｛Ｘ１，Ｘ２，・・・，Ｘｎ｝が、本発明の「時系列の音声特徴情報」に対応する。 The processing means 10 includes a conversion means 20, an audio feature amount extracting means 110, and a gazing point image feature amount extracting means 140.
As shown in FIG. 2, the voice feature amount extracting means 110 extracts voice feature amounts (voice feature amount vectors) X1 to Xn in time series from the voice information input from the voice information input means 30. It should be noted that the voice section of the voice information (the section presumed to be a semantic word or sentence unit) can be discriminated by including silence sections before and after. The voice feature amount extraction means 110 outputs a voice feature amount series (voice feature amount vector series) {X1, X2, ..., Xn}.
In the present embodiment, a convolutional neural network (CNN) having a convolution layer and a pooling layer is used as the voice feature extraction means 110.
The voice feature amount extraction means 110 corresponds to the "voice feature information extraction means" of the present invention, and the voice feature amount sequence (voice feature amount vector series) {X1, X2, ..., Xn} is the "voice feature amount extraction means" of the present invention. Corresponds to "time-series voice feature information".

注視点画像特徴量抽出手段１４０は、図５に示されているように、注視点画像情報入力手段４０より入力された注視点画像情報から、時系列に注視点画像特徴量（注視点画像特徴量ベクトル）Ｙ１〜Ｙｍを抽出する。注視点画像特徴量抽出手段１４０による時系列の注視点画像情報の抽出処理は、音声特徴量抽出手段１１０による時系列の音声特徴情報の抽出処理と同期して行われる。なお、発話者による音声の発声と注視動作がずれる場合もあるが、変換手段２０の処理により対応関係が推定されるため完全に同期していなくてもよい。注視点画像特徴量抽出手段１４０は、注視点画像特徴量系列（注視点画像特徴量ベクトル系列）｛Ｙ１，Ｙ２，・・・，Ｙｍ｝を出力する。
本実施形態では、注視点画像特徴量抽出手段１４０として、音声特徴量抽出手段１１０と同様に、畳み込みニューラルネットワーク(CNN)が用いられている。
注視点画像特徴量抽出手段１４０が、本発明の「注視点特徴情報抽出手段」に対応し、注視点画像特徴量系列（注視点画像特徴量ベクトル系列）｛Ｙ１，Ｙ２，・・・，Ｙｍ｝が、本発明の「時系列の注視点特徴情報」に対応する。 As shown in FIG. 5, the gazing point image feature amount extracting means 140 uses the gazing point image feature amount (gazing point image feature) in chronological order from the gazing point image information input from the gazing point image information input means 40. Quantity vector) Y1 to Ym are extracted. The time-series gaze point image information extraction process by the gaze point image feature amount extraction means 140 is performed in synchronization with the time series voice feature information extraction process by the voice feature amount extraction means 110. In some cases, the voice utterance by the speaker and the gaze operation may deviate from each other, but the correspondence relationship is estimated by the processing of the conversion means 20, so that the synchronization may not be complete. The gaze point image feature amount extraction means 140 outputs a gaze point image feature amount series (gaze point image feature amount vector series) {Y1, Y2, ..., Ym}.
In the present embodiment, the convolutional neural network (CNN) is used as the gazing point image feature amount extracting means 140 as in the voice feature amount extracting means 110.
The gaze point image feature amount extraction means 140 corresponds to the "gaze point feature information extraction means" of the present invention, and the gaze point image feature amount series (gaze point image feature amount vector series) {Y1, Y2, ..., Ym. } Corresponds to the "time-series gazing point feature information" of the present invention.

変換手段２０は、音声特徴量系列｛Ｘ１，Ｘ２，・・・，Ｘｎ｝（時系列の音声特徴量）および注視点画像特徴量系列｛Ｙ１，Ｙ２，・・・，Ｙｍ｝（時系列の注視点画像特徴量）に基づいて、記憶手段５０に記憶されている文字列の中から選択した文字列を用いて、音声認識結果であるテキスト情報を形成する。
本実施形態では、変換手段２０は、音声特徴量符号化手段１２０、音声符号重み付け手段１３０、注視点画像特徴量符号化手段１５０、注視点符号重み付け手段１６０、統合手段１７０および復号化手段１８０を有している。 The conversion means 20 includes a voice feature series {X1, X2, ..., Xn} (time series voice feature) and a gazing point image feature series {Y1, Y2, ..., Ym} (time series). The text information which is the voice recognition result is formed by using the character string selected from the character strings stored in the storage means 50 based on the gazing point image feature amount).
In the present embodiment, the conversion means 20 includes the voice feature amount coding means 120, the voice code weighting means 130, the gazing point image feature amount coding means 150, the gazing point code weighting means 160, the integrating means 170, and the decoding means 180. Have.

音声特徴量符号化手段１２０は、図３に示されているように、音声特徴量抽出手段１１０により抽出された時系列の音声特徴量Ｘ１〜Ｘｎ（音声特徴量系列｛Ｘ１，Ｘ２，・・・，Ｘｎ｝）を符号化して時系列の音声符号（音声符号ベクトル）ｈ１〜ｈｎを出力する。すなわち、音声特徴量符号化手段１２０は、音声符号系列（音声符号ベクトル系列）｛ｈ１，ｈ２，・・・，ｈｎ｝を出力する。
本実施形態では、音声特徴量符号化手段１２０として、図９に示されているように、再帰型ニューラルネットワーク(RNN)の一形態である双方向長期短期記憶(BLSTM)が用いられている。
音声特徴量符号化手段１２０が、本発明の「音声特徴情報符号化手段」に対応し、音声符号系列（音声符号ベクトル系列）｛ｈ１，ｈ２，・・・，ｈｎ｝が、本発明の「時系列の音声符号」に対応する。 As shown in FIG. 3, the voice feature amount coding means 120 is a time-series voice feature amount X1 to Xn (voice feature amount series {X1, X2, ...) Extracted by the voice feature amount extraction means 110. ·, Xn}) is encoded and the time-series voice code (voice code vector) h1 to hn is output. That is, the voice feature amount coding means 120 outputs a voice code sequence (voice code vector series) {h1, h2, ..., Hn}.
In this embodiment, as the speech feature coding means 120, as shown in FIG. 9, bidirectional long-term short-term memory (BLSTM), which is a form of a recurrent neural network (RNN), is used.
The voice feature amount coding means 120 corresponds to the "voice feature information coding means" of the present invention, and the voice code sequence (voice code vector series) {h1, h2, ..., Hn} is the "voice feature information coding means" of the present invention. Corresponds to "time-series voice code".

音声符号重み付け手段１３０は、音声特徴量符号化手段１２０から出力された時系列の音声符号ｈ１〜ｈｎ（音声符号系列｛ｈ１，ｈ２，・・・，ｈｎ｝）に重みを付与して時系列の重み付き音声符号（重み付き音声符号ベクトル）（ａ１＊ｈ１）〜（ａｎ＊ｈｎ）を出力する。すなわち、音声符号重み付け手段１３０は、重み付き音声符号系列（重み付き音声符号ベクトル系列）｛ａ１＊ｈ１，ａ２＊ｈ２，・・・，ａｎ＊ｈｎ｝を出力する。なお、重みａ１〜ａｎは、重みａ１〜ａｎの総和が「１」となるように設定される。
本実施形態では、音声符号重み付け手段１３０として、再帰型ニューラルネットワーク(RNN)の一形態である双方向長期短期記憶(BLSTM)が用いられている。
音声符号重み付け手段１３０の動作は、後述する。 The voice code weighting means 130 gives weights to the time series voice codes h1 to hn (voice code series {h1, h2, ..., Hn}) output from the voice feature amount coding means 120 to give a time series. (Weighted voice code vector) (a1 * h1) to (an * hn) are output. That is, the voice code weighting means 130 outputs a weighted voice code sequence (weighted voice code vector series) {a1 * h1, a2 * h2, ..., An * hn}. The weights a1 to an are set so that the sum of the weights a1 to an is "1".
In this embodiment, bidirectional long-term short-term memory (BLSTM), which is a form of recurrent neural network (RNN), is used as the speech code weighting means 130.
The operation of the voice code weighting means 130 will be described later.

注視点画像特徴量符号化手段１５０は、図６に示されているように、注視点画像特徴量抽出手段１４０により抽出された時系列の注視点画像特徴量Ｙ１〜Ｙｍ（注視点画像特徴量系列｛Ｙ１，Ｙ２，・・・，Ｙｍ｝）を符号化して時系列の注視点符号（注視点符号ベクトル）ｓ１〜ｓｍを出力する。すなわち、注視点画像特徴量符号化手段１５０は、注視点符号系列（注視点符号ベクトル系列）｛ｓ１，ｓ２，・・・，ｓｍ｝を出力する。
本実施形態では、注視点画像特徴量符号化手段１５０として、再帰型ニューラルネットワーク(RNN)の一形態である双方向長期短期記憶(BLSTM)が用いられている。
注視点画像特徴量符号化手段１５０が、本発明の「注視点特徴情報符号化手段」に対応し、注視点符号系列（注視点符号ベクトル系列）｛ｓ１，ｓ２，・・・，ｓｍ｝が、本発明の「時系列の注視点符号」に対応する。 As shown in FIG. 6, the gaze point image feature amount coding means 150 is a time-series gaze point image feature amount Y1 to Ym (gaze point image feature amount) extracted by the gaze point image feature amount extraction means 140. The sequence {Y1, Y2, ..., Ym}) is encoded and the time-series gaze point code (gaze point code vector) s1 to sm is output. That is, the gaze point image feature amount coding means 150 outputs the gaze point code sequence (gaze point code vector sequence) {s1, s2, ..., Sm}.
In this embodiment, bidirectional long-term short-term memory (BLSTM), which is a form of a recurrent neural network (RNN), is used as the gazing point image feature amount coding means 150.
The gazing point image feature amount coding means 150 corresponds to the "gazing point feature information coding means" of the present invention, and the gazing point code sequence (gazing point code vector sequence) {s1, s2, ..., Sm} , Corresponds to the "time-series gaze point code" of the present invention.

注視点符号重み付け手段１６０は、注視点画像特徴量符号化手段から出力された時系列の注視点符号ｓ１〜ｓｍに重みを付与して時系列の重み付き注視点符号（重み付き注視点符号ベクトル）（ｂ１＊ｓ１）〜（ｂｍ＊ｓｍ）を出力する。すなわち、注視点符号重み付け手段１６０は、重み付き注視点符号系列（重み付き注視点符号ベクトル系列）｛ｂ１＊ｓ１，ｂ２＊ｓ２，・・・，ｂｍ＊ｓｍ｝を出力する。なお、重みｂ１〜ｂｍは、重みｂ１〜ｂｍの総和が「１」となるように設定される。
本実施形態では、注視点符号重み付け手段１６０として、ニューラルネットワークで構成される系列変換モデルの注意機構(Attention)が用いられている。
注視点符号重み付け手段１６０の動作は、後述する。 The gazing point code weighting means 160 weights the time series gazing point codes s1 to sm output from the gazing point image feature quantity coding means to give a weight to the time series weighted gazing point code (weighted gazing point code vector). ) (B1 * s1) to (bm * sm) are output. That is, the gaze point code weighting means 160 outputs a weighted gaze point code sequence (weighted gaze point code vector sequence) {b1 * s1, b2 * s2, ..., Bm * sm}. The weights b1 to bm are set so that the sum of the weights b1 to bm is "1".
In this embodiment, the attention mechanism (Attention) of the sequence conversion model composed of the neural network is used as the gaze point code weighting means 160.
The operation of the gazing point code weighting means 160 will be described later.

統合手段１７０は、音声符号重み付け手段１３０から出力される時系列の重み付き音声符号（ａ１＊ｈ１）〜（ａｎ＊ｈｎ）（重み付き音声符号系列｛ａ１＊ｈ１，ａ２＊ｈ２，・・・，ａｎ＊ｈｎ｝）と注視点符号重み付け手段１６０から出力される重み付き注視点符号（ｂ１＊ｓ１）〜（ｂｍ＊ｓｍ）（重み付き注視点符号系列｛ｂ１＊ｓ１，ｂ２＊ｓ２，・・・，ｂｍ＊ｓｍ｝）を統合して時系列の統合重み付き符号（ａ１＊ｈ１＋ｂ１＊ｓ１）〜（ａｎ＊ｈｎ＋ｂｍ＋ｓｍ）（重み付き符号系列｛ａ１＊ｈ１＋ｂ１＊ｓ１，ａ２＊ｈ２＋ｂ２＊ｓ２，・・・，ａｎ＊ｈｎ＋ｂｍ＊ｓｍ｝）を出力する。 The integrating means 170 is a time-series weighted voice code (a1 * h1) to (an * hn) (weighted voice code sequence {a1 * h1, a2 * h2, ...) Output from the voice code weighting means 130. , An * hn}) and the weighted gaze point code output from the gaze point code weighting means 160 (b1 * s1) to (bm * sm) (weighted gaze point code sequence {b1 * s1, b2 * s2, ·・・, Bm * sm}) is integrated to integrate the time series Weighted codes (a1 * h1 + b1 * s1) to (an * hn + bm + sm) (weighted code series {a1 * h1 + b1 * s1, a2 * h2 + b2 * s2) ..., An * hn + bm * sm}) is output.

復号化手段は、図９に示されているように、統合手段１７０から出力された時系列の統合重み付き符号（ａ１＊ｈ１＋ｂ１＊ｓ１）〜（ａｎ＊ｈｎ＋ｂｍ＋ｓｍ）（重み付き符号系列｛ａ１＊ｈ１＋ｂ１＊ｓ１，ａ２＊ｈ２＋ｂ２＊ｓ２，・・・，ａｎ＊ｈｎ＋ｂｍ＊ｓｍ｝）に対応するテキスト情報を、記憶手段５０に記憶されている文字列情報の中から選択した文字列情報Ｃ１〜Ｃｉにより形成する。
文字列情報Ｃ１〜Ｃｉを選択する方法としては、例えば、各時刻において各LSTMの隠れ層から出力される符号を、Softmax関数により当該文字列の出力スコア（確率値）に換算する。そして、出力スコアが高い文字列を選択する方法が用いられる。
復号化手段１８０で複合されたテキスト情報（音声認識結果）は、表示手段６０に表示される。
本実施形態では、復号化手段１８０として、再帰型ニューラルネットワーク(RNN)の一形態である長期短期記憶(LSTM: Long Short Term Memory)が用いられている。 As shown in FIG. 9, the decoding means is a time-series integrated weighted code (a1 * h1 + b1 * s1) to (an * hn + bm + sm) (weighted code sequence {a1 *) output from the integrated means 170. The text information corresponding to h1 + b1 * s1, a2 * h2 + b2 * s2, ..., An * hn + bm * sm}) is selected from the character string information stored in the storage means 50. Character string information C1 to Ci Formed by
As a method of selecting the character string information C1 to Ci, for example, the code output from the hidden layer of each LSTM at each time is converted into the output score (probability value) of the character string by the Softmax function. Then, a method of selecting a character string having a high output score is used.
The text information (voice recognition result) compounded by the decoding means 180 is displayed on the display means 60.
In this embodiment, a long short term memory (LSTM), which is a form of a recurrent neural network (RNN), is used as the decoding means 180.

本実施形態は、音声情報を処理する音声チャネルと注視点画像を処理する注視点画像チャネルを有している。音声チャネルは、音声情報入力手段３０、音声特徴量抽出手段１１０、音声特徴量符号化手段１２０、音声符号重み付け手段１３０により構成される。注視点画像チャネルは、注視点画像情報入力手段４０、注視点画像特徴量抽出手段１４０、注視点画像特徴量符号化手段１５０、注視点符号重み付け手段１６０により構成される。
音声特徴量抽出手段１１０、音声特徴量符号化手段１２０、音声符号重み付け手段１３０、注視点画像特徴量抽出手段１４０、注視点画像特徴量符号化手段１５０、注視点符号重み付け手段１６０、統合手段１７０および復号化手段１８０は、共通のコンピューターで構成することもできるし、別々のコンピューターで構成することもできる。
また、１つの手段を他の手段に対して遠方に配置し、両手段間における情報の送受信を、インターネット等の通信回線を介して行うように構成することもできる。 The present embodiment has an audio channel for processing audio information and a gazing point image channel for processing the gazing point image. The voice channel is composed of voice information input means 30, voice feature amount extraction means 110, voice feature amount coding means 120, and voice code weighting means 130. The gazing point image channel is composed of the gazing point image information input means 40, the gazing point image feature amount extracting means 140, the gazing point image feature amount coding means 150, and the gazing point code weighting means 160.
Voice feature amount extraction means 110, voice feature amount coding means 120, voice code weighting means 130, gaze point image feature amount extraction means 140, gaze point image feature amount coding means 150, gaze point code weighting means 160, integration means 170. And the decoding means 180 can be configured by a common computer or can be configured by separate computers.
It is also possible to arrange one means far away from the other means so that information can be transmitted and received between the two means via a communication line such as the Internet.

次に、本実施形態のマルチモーダル音声認識装置の学習動作について説明する。
学習では、事前に作成し記憶手段５０に記憶した教師情報（入力である音声情報および注視点画像情報と、出力であるテキスト情報のペア）を用いて、誤差逆伝播法により、変換手段２０（系列変換モデル）を構成するニューラルネットワークの各種重みパラメータを繰り返し学習する。例えば、「遮断器７４５選択」という音声情報と「遮断器７４５を選択する際の注視点画像情報系列」を入力する。そして、復号化手段１８０から出力されるテキスト情報と入力された「遮断器７４５選択」という音声情報との誤差が最小となるように、変換手段２０の各種重みパラメータを学習する。 Next, the learning operation of the multimodal speech recognition device of the present embodiment will be described.
In learning, the conversion means 20 (a pair of input audio information and gazing point image information and output text information) is used by the error back propagation method, which is created in advance and stored in the storage means 50. Various weight parameters of the neural network that composes the sequence conversion model) are repeatedly learned. For example, the voice information "Circuit breaker 745 selection" and the "viewpoint image information series when selecting the circuit breaker 745" are input. Then, various weight parameters of the conversion means 20 are learned so that the error between the text information output from the decoding means 180 and the input voice information of "circuit breaker 745 selection" is minimized.

次に、本実施形態のマルチモーダル音声認識装置の音声認識動作を説明する。
音声情報および注視点画像情報が入力されると、本発明の処理手段１０の処理が開始される。処理手段１０の処理動作は、前述した通りである。
ここで、音声符号重み付け手段１３０による重み付け動作を、図１０を参照して説明する。
ディープラーニングに基づくエンドツーエンド(End-to-End)音声認識技術は、音声から取得する音声特徴量系列（音声特徴量ベクトル）を文字列に復号化するデコーダ(decoder)を有している。エンコーダは、音声特徴量系列を隠れ状態ベクトルに変換し、デコーダは、符号系列を隠れ状態ベクトルを介して認識結果であるテキスト情報に変換する。
本実施形態では、図９に示されているように、音声特徴量系列を符号化する音声エンコーダ、注視点画像特徴量系列（注視点画像特徴量ベクトル）を符号化する注視点エンコーダ、音声符号系列（音声符号ベクトル）と注視点符号系列（注視点符号ベクトル）を統合して文字列にマッピングするデコーダを有している。さらに、本実施形態では、デコーダにおいて、音声符号系列および注視点符号系列のそれぞれに対し、重み(Attention)を付与して統合重み付き符号系列を生成している。なお、音声エンコーダは、音声特徴量符号化手段１２０により構成されている。また、注視点エンコーダは、注視点画像特徴量符号化手段１５０により構成されている。また、デコーダは、音声符号重み付け手段１３０、注視点符号重み付け手段１６０、統合手段１７０、復号化手段１８０により構成されている。
図１０に一点鎖線で示されている、復号化手段１８０の任意の時刻ｔの時間断面における音声符号重み（音声符号重みベクトル）は、音声特徴量符号化手段１２０のＢＬＳＴＭの隠れ層から出力される隠れ状態ベクトル系列である{ｈ（１），…，ｈ（ｎ）}と、復号化手段１８０の、１つ前の時刻（ｔ−１）におけるＬＳＴＭの隠れ状態ベクトルｕ（ｔ−１）とに基づいて動的に付与される。例えば、隠れ状態ベクトルｕ（ｔ−１）と隠れ状態ベクトル系列{ｈ（１），…，ｈ（ｎ）}の各要素との類似度ａ（ｉ）（ｉ＝１，…，ｎ）は、隠れ状態ベクトルｕ（ｔ−１）と隠れ状態ベクトル系列{ｈ(１），…，ｈ（ｎ）}の各要素との内積を取ることによりａ（ｉ）＝ｕ（ｔ−１）・ｈ（ｉ）（ｉ＝１，…，ｎ）のように数値的に求めることができる。ここで、音声符号重みａ（ｉ）は合計が「１」になるように正規化しておく。時刻（ｔ）における音声エンコーダからデコーダへの入力は、音声符号重みａ（ｉ）（ｉ＝１，…，ｎ）と隠れ状態ベクトル系列｛ｈ（１），…，ｈ（ｎ）｝を用いて、［ａ（１）＊ｈ（１）＋…＋ａ（ｎ）＊ｈ（ｎ）］で表される。
なお、前記類似度の計算においては、種々の評価尺度を用いることができる。 Next, the voice recognition operation of the multimodal voice recognition device of the present embodiment will be described.
When the audio information and the gazing point image information are input, the processing of the processing means 10 of the present invention is started. The processing operation of the processing means 10 is as described above.
Here, the weighting operation by the voice code weighting means 130 will be described with reference to FIG.
The end-to-end speech recognition technology based on deep learning has a decoder that decodes a speech feature sequence (speech feature vector) acquired from speech into a character string. The encoder converts the voice feature sequence into a hidden state vector, and the decoder converts the code sequence into text information which is a recognition result via the hidden state vector.
In the present embodiment, as shown in FIG. 9, a voice encoder that encodes a voice feature amount series, a gaze point encoder that encodes a gaze point image feature amount series (gaze point image feature amount vector), and a voice code. It has a decoder that integrates a sequence (speech code vector) and a gazing point code sequence (gazing point code vector) and maps them to a character string. Further, in the present embodiment, in the decoder, a weight (Attention) is given to each of the voice code sequence and the gazing point code sequence to generate an integrated weighted code sequence. The voice encoder is composed of the voice feature amount coding means 120. Further, the gazing point encoder is composed of the gazing point image feature amount coding means 150. Further, the decoder is composed of the voice code weighting means 130, the gazing point code weighting means 160, the integration means 170, and the decoding means 180.
The voice code weight (voice code weight vector) in the time section of the decoding means 180 at an arbitrary time t, which is shown by the alternate long and short dash line in FIG. 10, is output from the hidden layer of the BLSTM of the voice feature amount coding means 120. The hidden state vector u (t-1) of the LSTM at the time (t-1) immediately before the decoding means 180 and the hidden state vector series {h (1), ..., H (n)}. It is given dynamically based on. For example, the degree of similarity a (i) (i = 1, ..., N) between the hidden state vector u (t-1) and each element of the hidden state vector series {h (1), ..., H (n)} is By taking the inner product of the hidden state vector u (t-1) and each element of the hidden state vector series {h (1), ..., H (n)}, a (i) = u (t-1). It can be obtained numerically as in h (i) (i = 1, ..., N). Here, the voice code weights a (i) are normalized so that the total is "1". The input from the voice encoder to the decoder at the time (t) uses the voice code weights a (i) (i = 1, ..., N) and the hidden state vector series {h (1), ..., H (n)}. It is represented by [a (1) * h (1) + ... + a (n) * h (n)].
In addition, various evaluation scales can be used in the calculation of the degree of similarity.

注視点符号重み付け手段１６０による重み付け動作を、図１１を参照して説明する。
図１１に一点鎖線で示されている、復号化手段１８０の任意の時刻ｔの時間断面における注視点符号重み（注視点符号重みベクトル）は、注視点画像特徴量符号化手段１５０のＢＬＳＴＭの隠れ層から出力される隠れ状態ベクトル系列である｛ｓ（１），…，ｓ（ｍ）｝と、復号化手段１８０の、１つ前の時刻（ｔ−１）におけるＬＳＴＭの隠れ状態ベクトルｕ（ｔ−１）とに基づいて動的に付与される。例えば、隠れ状態ベクトルｕ（ｔ−１）と隠れ状態ベクトル系列｛ｓ（１），…，ｓ（ｍ）｝の各要素との類似度ｂ（ｊ）（ｊ＝１，…，ｍ）は、隠れ状態ベクトルｕ（ｔ−１）と隠れ状態ベクトル系列｛ｓ（１），…，ｓ（ｍ）｝の各要素との内積を取ることよりｂ（ｊ）＝ｕ（ｔ−１）・ｓ（ｊ）（ｊ＝１，…，ｍ）のように数値的に求めることができる。ここで、注視点符号重みｂ（ｊ）は合計が「１」になるように正規化しておく。時刻（ｔ）における注視点エンコーダからデコーダへの入力は、注視点符号重みｂ（ｊ）（ｊ＝１，…，ｍ）と隠れ状態ベクトル系列｛ｓ（１），…，ｓ（ｍ）｝を用いて、［ｂ（１）＊ｓ（１）＋…＋ｂ（ｍ）＊ｓ（ｍ）］で表される。
なお、前記類似度の計算においては、種々の評価尺度を用いることができる。 The weighting operation by the gazing point code weighting means 160 will be described with reference to FIG.
The gaze point code weight (gaze point code weight vector) in the time cross section of the decoding means 180 at an arbitrary time t shown by the alternate long and short dash line in FIG. 11 is the hiding of the BLSTM of the gaze point image feature amount coding means 150. The hidden state vector sequence {s (1), ..., S (m)} output from the layer and the hidden state vector u of the LSTM at the time (t-1) immediately before the decoding means 180 (t-1). It is dynamically assigned based on t-1). For example, the degree of similarity b (j) (j = 1, ..., M) between the hidden state vector u (t-1) and each element of the hidden state vector series {s (1), ..., S (m)} is By taking the inner product of the hidden state vector u (t-1) and each element of the hidden state vector series {s (1), ..., S (m)}, b (j) = u (t-1). It can be calculated numerically as s (j) (j = 1, ..., M). Here, the gazing point code weights b (j) are normalized so that the total is "1". The input from the gaze point encoder to the decoder at time (t) is the gaze point code weight b (j) (j = 1, ..., M) and the hidden state vector sequence {s (1), ..., S (m)}. Is represented by [b (1) * s (1) + ... + b (m) * s (m)].
In addition, various evaluation scales can be used in the calculation of the degree of similarity.

以上のようにして、音声エンコーダおよび注視点エンコーダからの出力符号に、動的に重みを付与することができる。
次に、統合手段１７０において、前記方法で求められた重み付き音声符号系列および注視点符号系列を、ｒ（ｔ）＝［ａ（１）＊ｈ（１）＋…＋ａ（ｎ）＊ｈ（ｎ）］＋［ｂ（１）＊ｓ（１）＋…＋ｂ（ｍ）＊ｓ（ｍ）］として統合化し、このｒ（ｔ）を時刻ｔにおける復号化手段１８０（デコーダ）のＬＳＴＭへの入力とする。
そして、復号化手段１８０は、前述したように、各時刻において各ＬＳＴＭから出力される文字列を、Softmax関数により当該文字列の出力スコア（確率値）に換算する。そして、出力スコアが高い文字列を選択してテキスト情報（音声認識結果）を形成する。 As described above, the output codes from the voice encoder and the gazing-point encoder can be dynamically weighted.
Next, in the integrating means 170, the weighted voice code sequence and the gazing point code sequence obtained by the above method are subjected to r (t) = [a (1) * h (1) + ... + a (n) * h ( n)] + [b (1) * s (1) + ... + b (m) * s (m)], and this r (t) is converted into the LSTM of the decoding means 180 (decoder) at time t. Input.
Then, as described above, the decoding means 180 converts the character string output from each LSTM at each time into the output score (probability value) of the character string by the Softmax function. Then, a character string having a high output score is selected to form text information (speech recognition result).

以上のように、音声符号重み付け手段１３０による音声符号系列に対する重み付け（音声チャネルにおける符号重み付け）および注視点符号重み付け手段１６０による注視点符号系列に対する重み付け（注視点画像チャネルにおける重み付け）を行うことによって、復号化手段１８０に入力される統合重み付き符号系列に対応するテキスト情報（テキスト情報を構成する文字列情報）を、音声情報と注視点画像情報との相関関係を推定ながら形成することができる。
発話者が発声する音声と音声発声時における発話者の注視点は、相互に関連している。
このため、本実施形態では、発話者の音声と注視点との相互関係を推定して音声認識を行うことによって、音声認識性能を向上させることができる。 As described above, the voice code weighting means 130 weights the voice code sequence (code weighting in the voice channel) and the gazing point code weighting means 160 weights the gazing point code series (weighting in the gazing image channel). Text information (character string information constituting the text information) corresponding to the integrated weighted code sequence input to the decoding means 180 can be formed while estimating the correlation between the voice information and the gazing point image information.
The voice spoken by the speaker and the speaker's gaze point at the time of voice utterance are interrelated.
Therefore, in the present embodiment, the voice recognition performance can be improved by estimating the interrelationship between the voice of the speaker and the gazing point and performing voice recognition.

本実施形態の効果を確認するために、音声情報のみを用いた（モデル１）と、音声情報と注視点画像情報を用いた（モデル２）について比較実験を行い、文字単位の誤り率であるＣＥＲ(Character Error Rate)を求めた。なお、ＣＥＲは、[ＣＥＲ＝（Ｓ＋Ｄ＋Ｉ）＊１００/Ｎ]で表される。ここで、Ｓは置換誤り数、Ｄは脱落誤り数、Ｉは挿入誤り数、Ｎは正解文の文字数を表す。
実験の結果、（モデル１）ではＣＥＲが７．２％であったが、（モデル２）では６．９％に低減しており、本発明の構成の適用効果が確認された。 In order to confirm the effect of this embodiment, a comparative experiment was conducted between using only voice information (model 1) and using voice information and gaze point image information (model 2), and the error rate was in character units. The CER (Character Error Rate) was calculated. CER is represented by [CER = (S + D + I) * 100 / N]. Here, S is the number of replacement errors, D is the number of omission errors, I is the number of insertion errors, and N is the number of characters in the correct sentence.
As a result of the experiment, the CER was 7.2% in (Model 1), but it was reduced to 6.9% in (Model 2), confirming the effect of applying the configuration of the present invention.

以上の実施形態では、音声符号系列（音声符号ベクトル）ｈと注視点符号系列（注視点符号ベクトル）ｓを、同じ割合で統合した統合符号系列（統合符号ベクトル）ｒ（［ｒ＝ａ＊ｈ＋ｂ＊ｓ］）を用いたが、音声符号系列ｈと注視点符号系列ｓの融合割合を変えることもできる。例えば、［ｒ＝ａ＊ｈ＋ｇ＊（ｂ＊ｓ）］で表される統合符号系列ｒを用いることができる。ここで、ｇは、注視点符号系列の融合割合を示す融合重み（融合重みベクトル）である。融合重みベクトルは、固定でもよいし、動的に割り当てることもできる。 In the above embodiment, the voice code sequence (voice code vector) h and the gaze point code series (gaze point code vector) s are integrated at the same ratio as the integrated code sequence (integrated code vector) r ([r = a * h + b). * S]) was used, but the fusion ratio of the voice code sequence h and the gaze point code sequence s can also be changed. For example, the integrated code sequence r represented by [r = a * h + g * (b * s)] can be used. Here, g is a fusion weight (fusion weight vector) indicating the fusion ratio of the gazing point code series. The fusion weight vector may be fixed or dynamically assigned.

次に、音声認識結果の出力方法について説明する。
本実施形態では、復号化手段１８０で復号化したテキスト情報を表示手段６０に表示している。
図１２に、テキスト情報を表示する表示画面２００の一例が示されている。
図１２に示されている表示画面２００には、遮断器やラインスイッチの投入および切断を行う操作盤３００が表示されている。操作盤３００には、遮断器７４０、７４２、７４５を選択する際に操作される遮断器選択ボタン３１１〜３１３、ラインスイッチ７４０、７４２、７４５を選択する際に操作されるラインスイッチ選択ボタン３１４〜３１６、投入する際に操作される入ボタン３１７、遮断する際に操作される切ボタン３１８が設けられている。
ここで、発話者が、「ラインスイッチ７４５選択操作」という音声を発声するとともに、操作盤３００のラインスイッチ選択ボタン３１６を注視し、復号化手段１８０から「ラインスイッチ７４５選択操作」というテキスト情報が音声認識されたものとする。本実施形態では、操作盤３００が表示されている表示画面２００に、注視点に関連する位置、図１２では、ラインスイッチ選択ボタン３１６に対応する箇所に「ラインスイッチボタン７４５操作選択」というテキスト情報が表示される。
これにより、発話者の音声と注視点の位置を容易に判別することができる。
図１２には、「ラインスイッチ７４５を選択操作」という音声を発声した後、「入れます」という音声を発声するとともに、操作盤３００の入ボタン３１７を注視したことにともなって、注視点に関連する位置、図１２では、入ボタン３１７に対応する箇所に「入れます」というテキスト情報が表示される。
なお、表示画面２００にテキスト情報が表示されている状態において、表示画面２００に表示されているテキスト情報を選択する（例えば、タッチする）ことにより、表示されているテキスト情報に対応する音声情報、例えば、当該テキスト情報を認識させる際に入力された音声情報をスピーカ等の音声出力手段から出力するように構成することもできる。
テキスト情報を表示手段６０に表示する処理や、テキスト情報に対応する音声情報を音声出力手段から出力する処理等は、例えば、処理手段１０で実行される。 Next, a method of outputting the voice recognition result will be described.
In the present embodiment, the text information decoded by the decoding means 180 is displayed on the display means 60.
FIG. 12 shows an example of a display screen 200 that displays text information.
On the display screen 200 shown in FIG. 12, an operation panel 300 for turning on and off the circuit breaker and the line switch is displayed. On the operation panel 300, the circuit breaker selection buttons 311 to 313 operated when selecting the circuit breakers 740, 742, 745, and the line switch selection buttons 314 to be operated when selecting the line switches 740, 742, 745. 316, an on button 317 operated when turning on, and an off button 318 operated when shutting off are provided.
Here, the speaker utters the voice "line switch 745 selection operation", gazes at the line switch selection button 316 of the operation panel 300, and receives the text information "line switch 745 selection operation" from the decoding means 180. It is assumed that the voice has been recognized. In the present embodiment, the text information "line switch button 745 operation selection" is displayed on the display screen 200 on which the operation panel 300 is displayed, at the position related to the gazing point, and at the position corresponding to the line switch selection button 316 in FIG. Is displayed.
This makes it possible to easily determine the position of the speaker's voice and the gazing point.
In FIG. 12, after uttering the voice "select operation of the line switch 745", the voice "turn on" is uttered, and the on button 317 of the operation panel 300 is closely watched, which is related to the gazing point. In FIG. 12, the text information "Enter" is displayed at the position corresponding to the ON button 317.
In the state where the text information is displayed on the display screen 200, by selecting (for example, touching) the text information displayed on the display screen 200, the voice information corresponding to the displayed text information can be obtained. For example, it can be configured to output the voice information input when the text information is recognized from a voice output means such as a speaker.
The process of displaying the text information on the display means 60, the process of outputting the voice information corresponding to the text information from the voice output means, and the like are executed by the processing means 10, for example.

以上では、マルチモーダル音声認識装置について説明したが、本発明は、マルチモーダル音声認識方法として構成することもできる。
（態様１）
マルチモーダル音声認識方法であって、
発話者の音声を示す音声情報から、音声特徴情報を時系列に抽出する第１ステップと、
前記発話者が注視している注視点の周辺の注視点画像を示す注視点画像情報から、注視点特徴情報を時系列に抽出する第２ステップと、
前記抽出した時系列の音声特徴情報および前記抽出した時系列の注視点特徴情報に対応するテキスト情報を、記憶手段に記憶されている文字列情報の中から選択した文字列情報により形成する第３ステップと、を備えていることを特徴とするマルチモーダル音声認識方法。
（態様２）
態様１のマルチモーダル音声認識方法であって、
前記第３ステップは、
前記抽出した時系列の音声特徴情報を符号化して時系列の音声符号を出力する第４ステップと、
前記時系列の音声符号に重みを付与して時系列の重み付き音声符号を出力する第５ステップと、
前記時系列の注視点特徴情報を符号化して時系列の注視点符号を出力する第６ステップと、
前記時系列の注視点符号に重みを付与して時系列の重み付き注視点符号を出力する第７ステップと、
前記時系列の重み付き音声符号と前記時系列の重み付き注視点符号を統合した統合符号に対応するテキスト情報を、記憶手段に記憶されている文字列情報の中から選択した文字列情報により形成する第８ステップと、を有していることを特徴とするマルチモーダル音声認識方法。
このようなマルチモーダル音声認識方法も、前述したマルチモーダル音声認識装置と同様の効果を有する。 Although the multimodal speech recognition device has been described above, the present invention can also be configured as a multimodal speech recognition method.
(Aspect 1)
It is a multimodal speech recognition method.
The first step of extracting voice feature information in chronological order from voice information indicating the speaker's voice,
The second step of extracting the gazing point feature information in chronological order from the gazing point image information showing the gazing point image around the gazing point that the speaker is gazing at.
A third form of the extracted time-series voice feature information and the text information corresponding to the extracted time-series gaze point feature information by the character string information selected from the character string information stored in the storage means. A multimodal speech recognition method characterized by having steps and.
(Aspect 2)
A method of multimodal speech recognition according to aspect 1.
The third step is
The fourth step of encoding the extracted time-series voice feature information and outputting the time-series voice code, and
The fifth step of giving a weight to the time-series voice code and outputting the time-series weighted voice code, and
The sixth step of encoding the time-series gaze point feature information and outputting the time-series gaze point code, and
The seventh step of giving a weight to the time-series gaze point code and outputting the time-series weighted gaze point code, and
Text information corresponding to the integrated code in which the weighted voice code of the time series and the weighted gazing point code of the time series are integrated is formed by the character string information selected from the character string information stored in the storage means. A multimodal speech recognition method, characterized in that it has an eighth step.
Such a multimodal speech recognition method also has the same effect as the above-mentioned multimodal speech recognition device.

本発明は、実施形態で説明した構成に限定されず、種々の変更、追加、削除が可能である。
実施形態では、音声情報と注視点画像情報をマルチモーダル情報として用いたが、３以上の情報をマルチモーダル情報として用いることもできる。例えば、音声情報、注視点画像情報およびジェスチャー情報（身振りや手振り）をマルチモーダル情報として用いることもできる。
視線計測装置の可視光領域の撮像素子を用いて注視点画像情報を入力したが、赤外線センサや紫外線センサ等の種々のセンサを用いて注視点画像情報を入力することができる。
本発明のマルチモーダル音声認識装置およびマルチモーダル音声認識方法は、作業員の操作確認に限定されず、音声付き動画の字幕作成や映像検索、動画を用いた技能継承や教育訓練等の種々の分野で用いることができる。
音声情報入力手段としては、音声情報を入力可能な種々の構成の音声情報入力手段を用いることができる。また、音声情報を予め記憶している記憶手段等を音声情報入力手段として用いることもできる。
注視点画像情報入力手段としては、注視点の周辺の注視点画像情報を入力可能な種々の構成の注視点画像情報入力手段を用いることができる。また、注視点画像情報を予め記憶している記憶手段等を注視点画像情報入力手段として用いることもできる。
音声特徴量抽出手段（音声特徴情報抽出手段）、音声特徴量符号化手段（音声特徴情報符号化手段）、音声符号重み付け手段、注視点画像特徴量抽出手段（注視点特徴情報抽出手段）、注視点画像特徴量符号化手段（注視点特徴情報符号化手段）、注視点符号重み付け手段、統合手段、復号化手段の構成は、実施形態で説明した構成に限定されない。
音声認識結果等を表示手段に表示する方法は、実施形態で説明した方法に限定されない。
音声認識結果等を出力する方法は、表示手段に表示する方法に限定されない。例えば、通信回線を介して遠方の管理装置に送信する方法を用いることもできる。 The present invention is not limited to the configuration described in the embodiment, and various changes, additions, and deletions can be made.
In the embodiment, the audio information and the gazing point image information are used as the multimodal information, but three or more pieces of information can also be used as the multimodal information. For example, voice information, gazing point image information, and gesture information (gestures and gestures) can be used as multimodal information.
Although the gaze point image information is input using the image sensor in the visible light region of the line-of-sight measuring device, the gaze point image information can be input using various sensors such as an infrared sensor and an ultraviolet sensor.
The multimodal voice recognition device and the multimodal voice recognition method of the present invention are not limited to the operation confirmation of the worker, but are not limited to the operation confirmation of the worker, and are used in various fields such as subtitle creation and video search of video with voice, skill inheritance using video, and education and training. Can be used in.
As the voice information input means, voice information input means having various configurations capable of inputting voice information can be used. Further, a storage means or the like that stores voice information in advance can also be used as the voice information input means.
As the gazing point image information input means, it is possible to use gazing point image information input means having various configurations capable of inputting gazing point image information around the gazing point. Further, a storage means or the like that stores the gaze point image information in advance can also be used as the gaze point image information input means.
Voice feature amount extraction means (voice feature information extraction means), voice feature amount coding means (voice feature information coding means), voice code weighting means, gaze point image feature amount extraction means (gaze point feature information extraction means), note The configurations of the viewpoint image feature amount coding means (viewpoint feature information coding means), the viewpoint code weighting means, the integration means, and the decoding means are not limited to the configurations described in the embodiments.
The method of displaying the voice recognition result or the like on the display means is not limited to the method described in the embodiment.
The method of outputting the voice recognition result or the like is not limited to the method of displaying on the display means. For example, a method of transmitting to a distant management device via a communication line can also be used.

１０処理手段
２０変換手段
３０音声情報入力手段
４０注視点画像情報入力手段
５０記憶手段
６０表示手段
１１０音声特徴量抽出手段（音声特徴情報抽出手段）
１２０音声特徴量符号化手段（音声特徴情報符号化手段）
１３０音声符号重み付け手段
１４０注視点画像特徴量抽出手段（注視点特徴情報抽出手段）
１５０注視点画像特徴量符号化手段（注視点特徴情報符号化手段）
１６０注視点符号重み付け手段
１７０統合手段
１８０復号化手段
２００表示画面
３００操作盤
３１１〜３１３遮断器選択ボタン
３１４〜３１６ラインスイッチ選択ボタン
３１７入ボタン
３１８切ボタン
３２１、３２２テキスト情報表示部 10 Processing means 20 Conversion means 30 Voice information input means 40 Gaze point image information input means 50 Storage means 60 Display means 110 Voice feature amount extraction means (voice feature information extraction means)
120 Voice feature amount coding means (voice feature information coding means)
130 Voice code weighting means 140 Gaze point image feature amount extraction means (Gaze point feature information extraction means)
150 Gaze point image feature amount coding means (Gaze point feature information coding means)
160 Gaze point code weighting means 170 Integration means 180 Decoding means 200 Display screen 300 Operation panel 313-1313 Circuit breaker selection buttons 314 to 316 Line switch selection button 317 On button 318 Off button 321, 322 Text information display unit

Claims

It is a multimodal speech recognition device
Voice information input means for inputting voice information indicating the speaker's voice,
A gazing point image information input means for inputting gazing point image information indicating a gazing point image around the gazing point that the speaker is gazing at, and
A storage means for storing character string information and
A voice feature information extraction means that extracts voice feature information in chronological order from the voice information input from the voice information input means,
A gaze point feature information extraction means that extracts gaze feature information in chronological order from the gaze image information input from the gaze image information input means.
The time-series voice feature information extracted by the voice feature information extracting means and the text information corresponding to the time-series gazing point feature information extracted by the gazing point feature information extracting means are stored in the storage means. A multimodal speech recognition device including a conversion means formed by character string information selected from character string information.

The multimodal speech recognition device according to claim 1.
The conversion means
A voice feature information coding means that encodes each of the time-series voice feature information extracted by the voice feature information extraction means and outputs a time-series voice code.
A voice code weighting means that outputs a time-series weighted voice code by giving a weight to each of the time-series voice codes output from the voice feature information coding means.
A gazing point feature information coding means that encodes each of the time-series gazing point feature information extracted by the gazing point feature information extracting means and outputs a time-series gazing point code.
A gazing point code weighting means that outputs a time-series weighted gazing point code by weighting each of the gazing point codes of the time series output from the gazing point feature information coding means.
The storage of the text information corresponding to the integrated code in which the time-series weighted voice code output from the voice code weighting means and the time-series weighted gaze point code output from the gaze point code weighting means are integrated. A multimodal speech recognition device characterized by having a decoding means formed by character string information selected from character string information stored in the means.

The multimodal speech recognition device according to claim 2.
A multimodal speech recognition device characterized in that a attention mechanism of a series conversion model composed of a neural network is used as the speech code weighting means and the gazing point code weighting means.

The multimodal speech recognition device according to any one of claims 1 to 3.
Equipped with display means
The gaze point image input means can input gaze point position information indicating the position of the gaze point in the subjective image of the speaker.
A multimodal speech recognition device characterized in that the text information is configured to be displayed on the display means in association with the position of the gazing point of the speaker indicated by the gazing point position information.

The multimodal speech recognition device according to claim 4.
A multimodal speech recognition device characterized in that when the text information displayed on the display means is selected, the speech information corresponding to the selected text information is output.

It is a multimodal speech recognition method.
The first step of extracting voice feature information in chronological order from voice information indicating the speaker's voice,
The second step of extracting the gazing point feature information in chronological order from the gazing point image information showing the gazing point image around the gazing point that the speaker is gazing at.
A third form of the extracted time-series voice feature information and the text information corresponding to the extracted time-series gaze point feature information by the character string information selected from the character string information stored in the storage means. A multimodal speech recognition method characterized by having steps and.

The multimodal speech recognition method according to claim 6.
The third step is
The fourth step of encoding each of the extracted time-series voice feature information and outputting the time-series voice code, and
The fifth step of giving a weight to each of the time-series voice codes and outputting the time-series weighted voice code, and
The sixth step of encoding each of the time-series gaze point feature information and outputting the time-series gaze point code, and
The seventh step of assigning weights to each of the time-series gaze points and outputting the time-series weighted gaze points, and
Text information corresponding to the integrated code in which the weighted voice code of the time series and the weighted gazing point code of the time series are integrated is formed by the character string information selected from the character string information stored in the storage means. A multimodal speech recognition method, characterized in that it has an eighth step.