JP7414231B2

JP7414231B2 - Multimodal speech recognition device and multimodal speech recognition method

Info

Publication number: JP7414231B2
Application number: JP2019129656A
Authority: JP
Inventors: 修瀬川; 知樹林; 一哉武田
Original assignee: Chubu Electric Power Co Inc; Tokai National Higher Education and Research System NUC
Current assignee: Chubu Electric Power Co Inc; Tokai National Higher Education and Research System NUC
Priority date: 2019-07-11
Filing date: 2019-07-11
Publication date: 2024-01-16
Anticipated expiration: 2039-07-11
Also published as: JP2021015189A

Description

本発明は、音声と音声発声時における注視点画像を用いて音声を認識するマルチモーダル音声認識技術に関する。 The present invention relates to multimodal speech recognition technology that recognizes speech using speech and a gaze point image at the time of speech production.

センシング技術の発展に伴って、種々の信号を同時に取得することが可能となっている。このような背景のもと、音声認識の技術分野において、音声認識性能を向上させるために、音声と音声以外の情報を用いて音声を認識するマルチモーダル音声認識技術が提案されている。
例えば、音声と、音声発声時における口の動きを示す口唇画像を用いる音声認識技術が非特許文献１に開示されている。
また、ディープラーニング（ニューラルネットワークの深層学習）に基づくエンドツーエンド(End-to-End)音声認識技術が非特許文献２に開示されている。 With the development of sensing technology, it has become possible to acquire various signals simultaneously. Against this background, in the technical field of speech recognition, multimodal speech recognition technology has been proposed that recognizes speech using speech and information other than speech in order to improve speech recognition performance.
For example, Non-Patent Document 1 discloses a voice recognition technology that uses voice and a lip image that shows the movement of the mouth when making a voice.
Further, End-to-End speech recognition technology based on deep learning (deep learning of neural networks) is disclosed in Non-Patent Document 2.

「口唇の深度画像を用いたマルチモーダル音声認識」，押尾翔平他２名，情報処理学会研究報告，Ｖｏｌ．２０１４－ＳＬＰ－１０２－Ｎｏ．２，２０１４／７／２４“Multimodal speech recognition using depth images of lips”, Shohei Oshio et al., Information Processing Society of Japan Research Report, Vol. 2014-SLP-102-No. 2, 2014/7/24 「音声認識技術の変遷と最先端」，河原達也，日本音響学会誌，第７４巻第７号（２０１８），ｐｐ．３８１－３８６“Transition and cutting edge of speech recognition technology”, Tatsuya Kawahara, Journal of the Acoustical Society of Japan, Vol. 74, No. 7 (2018), pp. 381-386

非特許文献１には、口唇画像を併用することで音声認識性能が向上することは開示されているが、口唇以外の情報を用いることについては開示されてない。
非特許文献２に開示されている、ディープラーニングに基づくエンドツーエンド(End-to-End)音声認識技術は、音声から取得する特徴量系列（特徴ベクトル）を文字列に直接マッピングする系列変換モデル(Encoder-Decoder)を基本としている。近年、系列変換モデルに注意機構(Attention)を組み合わせ、特徴ベクトルに重みを付与することで、音声認識性能の向上を図る試みがなされている。しかしながら、非特許文献２には、マルチモーダル音声認識に、注意機構を組み合わることは開示されてない。
本発明者は、音声認識性能を向上させる技術について種々検討した結果、音声を発声しながら作業を行う際には、音声と注視点が相互に関連していること、すなわち、音声と注視点の相互関係を推定することによって、音声認識性能を向上させることができることが判明した。
本発明は、このような点に鑑みて創案されたものであり、音声と注視点の周辺の注視点画像を用いて音声を認識することによって音声認識性能を向上させたマルチモーダル音声認識技術を提供することを目的とする。 Non-Patent Document 1 discloses that speech recognition performance is improved by using lip images together, but does not disclose the use of information other than lips.
The end-to-end speech recognition technology based on deep learning disclosed in Non-Patent Document 2 is a series conversion model that directly maps a feature sequence (feature vector) obtained from speech to a character string. (Encoder-Decoder). In recent years, attempts have been made to improve speech recognition performance by combining an attention mechanism with a sequence transformation model and assigning weights to feature vectors. However, Non-Patent Document 2 does not disclose combining an attention mechanism with multimodal speech recognition.
As a result of various studies on techniques for improving speech recognition performance, the inventor found that when working while uttering a voice, the voice and the point of interest are mutually related. It has been found that speech recognition performance can be improved by estimating correlations.
The present invention was devised in view of these points, and is a multimodal speech recognition technology that improves speech recognition performance by recognizing speech using speech and images of the gazing point around the gazing point. The purpose is to provide.

第１発明は、マルチモーダル音声認識装置に関する。
第１発明は、音声情報入力手段、音声特徴情報抽出手段、注視点画像情報入力手段、注視点特徴情報抽出手段、記憶手段および変換手段を有している。
音声情報入力手段は、発話者の音声を示す音声情報を入力する。音声情報入力手段としては、音声情報を入力可能な種々の音声情報入力手段を用いることができる。好適には、音声を電気信号に変換するマイクを含む音声情報入力手段が用いられる。なお、音声情報が予め記憶されている記憶媒体を含む音声情報入力手段を用いることもできる。
音声特徴情報抽出手段は、音声情報入力手段より入力された音声情報から、音声特徴情報を時系列に抽出する。音声特徴情報抽出手段としては、好適には、畳み込み層とプーリング層を有する畳み込みニューラルネットワーク(CNN)が用いられる。
注視点画像情報入力手段は、発話者が音声発声時に注視している注視点の周辺の注視点画像を示す注視点画像情報を入力する。注視点画像情報入力手段としては、注視点画像を入力可能な種々の注視点画像情報入力手段を用いることができる。好適には、視線計測装置を含む注視点画像情報入力手段が用いられる。なお、好適には、装置内蔵のカメラで撮像される主観画像中における注視点の位置を示す注視点位置情報を出力可能な視線計測装置が用いられる。
注視点特徴情報抽出手段は、注視点画像情報入力手段より入力された注視点画像情報から、注視点特徴情報を時系列に抽出する。好適には、注視点画像情報は、音声特徴情報と同期して抽出される。注視点特徴情報抽出手段としては、音声特徴情報手出手段と同様に、多層ニューラルネットワーク(CNN)が用いられる。
記憶手段には、文字列情報が記憶されている。文字列情報には、文書等のテキスト情報を形成するひらがな、カタカナ、数字、常用漢字等が含まれる。
変換手段は、音声特徴情報抽出手段により抽出した時系列の音声特徴情報および注視点特徴情報抽出手段により抽出した時系列の注視点特徴情報に対応するテキスト情報を、記憶手段に記憶されている文字列情報の中から選択した文字列情報により形成する。好適には、テキスト情報は、表示手段等の出力手段から出力される。
時系列の音声特徴情報および時系列の注視点特徴情報に対応するテキスト情報を形成する方法としては、適宜の方法を用いることができる。
音声特徴情報抽出手段、注視点特徴情報抽出手段および変換手段は、１つのコンピューターによって構成することもできる、個別のコンピューターによって構成することもできる。また、遠方に配置し、インターネット等の通信回線を介して接続可能に構成することもできる。
第１発明は、時系列の音声特徴情報と時系列の注視点画像情報を用いて音声を認識しているため、音声認識性能を向上させることができる。
第１発明の異なる形態では、変換手段は、音声特徴情報符号化手段、音声符号重み付け手段、注視点特徴情報符号化手段、注視点符号重み付け手段および復号化手段を有している。
音声特徴情報符号化手段は、音声特徴情報抽出手段により抽出した時系列の音声特徴情報を符号化して時系列の音声符号を出力する。音声特徴情報符号化手段としては、好適には、再帰型ニューラルネットワーク(RNN: recurrent Neutral network)が用いられる。例えば、再帰型ニューラルネットワーク(RNN)の一形態である双方向長期短期記憶(BLSTM: Bi-directional Long Short Term Memory)が用いられる。
音声符号重み付け手段は、音声特徴情報符号化手段から出力された時系列の音声符号に重みを付与して時系列の重み付き音声符号を出力する。音声符号重み付け手段としては、時系列の音声符号に適切に重みを付与することができる種々の構成の音声符号重み付け手段を用いることができる。
注視点特徴情報符号化手段は、注視点特徴情報抽出手段により抽出した時系列の注視点特徴情報を符号化して時系列の注視点符号を出力する。注視点特徴情報符号化手段としては、音声特徴情報符号化手段と同様に、再帰型ニューラルネットワーク(RNN)の一形態である双方向長期短期記憶(BLSTM)等が用いられる。
注視点符号重み付け手段は、注視点特徴情報符号化手段から出力された時系列の注視点符号に重みを付与して時系列の重み付き注視点符号を出力する。注視点符号重み付け手段としては、時系列の注視点符号に適切に重みを付与することができる種々の構成の注視点符号重み付け手段を用いることができる。
復号化手段は、音声符号重み付け手段から出力された時系列の重み付き音声符号と注視点符号重み付け手段から出力された時系列の重み付き注視点符号を統合した統合符号（重み付き統合符号）に対応するテキスト情報を、記憶手段に記憶されている文字列情報の中から選択した文字列情報により形成する。
復号化手段としては、好適には、再帰型ニューラルネットワーク(RNN)が用いられる。例えば、再帰型ニューラルネットワーク(RNN)の一形態である長期短期記憶(LSTM)が用いられる。
音声特徴情報符号化手段、音声符号重み付け手段、注視点特徴情報符号化手段、注視点符号重み付け手段および復号化手段は、１つのコンピューターによって構成することもできる。また、個別のコンピューターによって構成することもできる。また、遠方に配置し、インターネット等の通信回線を介して接続可能に構成することもできる。
本形態では、音声情報と注視点画像情報との相関関係を正確に推定することができる。
第１発明の異なる形態では、音声符号重み付け手段および注視点符号重み付け手段としてニューラルネットワークで構成される系列変換モデルの注意機構(Attention)が用いられている。
本形態では、時系列の音声符号および時系列の注視点符号に適切な重みを付与することができ、音声認識性能を確実に向上させることができる。
第１発明の異なる形態では、注視点画像情報入力手段は、発話者の主観画像（注視点画像情報入力手段の撮像領域）中における注視点の位置を示す注視点位置情報を入力可能である。そして、テキスト情報を、注視点位置情報で示される発話者の注視点の位置に関連付けて表示手段に表示するように構成されている。例えば、表示手段に表示されている、注視点画像情報入力手段で撮像した主観画像中の、注視点の位置の近傍にテキスト情報を表示する。
本形態では、テキスト情報が、注視点の位置と関連付けて表示される表示手段に表示されるため、発話者が発声した音声および発話者の注視点を容易に確認することができる。
第１発明の異なる形態では、表示手段に表示されているテキスト情報が選択された場合、選択されたテキスト情報に対応する音声情報（テキスト情報を認識する際に入力された音声情報）を出力するように構成されている。
本形態では、発話者が発声した音声を容易に確認することができる。
第２発明は、マルチモーダル音声認識方法に関する。
本発明は、第１～第３ステップを有している。
第１ステップでは、発話者の音声を示す音声情報から、音声特徴情報を時系列に抽出する。第１ステップの処理は、例えば、第１発明の音声特徴情報抽出手段によって実行される。
第２ステップでは、発話者が注視している注視点の周辺の注視点画像を示す注視点画像情報から、注視点特徴情報を時系列に抽出する。第２ステップの処理は、例えば、第１発明の注視点特徴情報抽出手段によって実行される。
第３ステップでは、抽出した時系列の音声特徴情報および抽出した時系列の注視点特徴情報に対応するテキスト情報を、記憶手段に記憶されている文字列情報の中から選択した文字列情報により形成する。第３ステップの処理は、例えば、第１発明の変換手段によって実行される。好適には、テキスト情報は、表示手段等の出力手段から出力される。
第２発明は、第１発明と同様の効果を有する。
第２発明の異なる形態では、第３ステップは、第４～第８ステップを有している。
第４ステップでは、抽出した時系列の音声特徴情報を符号化して時系列の音声符号を出力する。第４ステップの処理は、例えば、第１発明の音声特徴情報符号化手段によって実行される。
第５ステップでは、時系列の音声符号に重みを付与して時系列の重み付き音声符号を出力する。第５ステップの処理は、例えば、第１発明の音声符号重み付け手段によって実行される。
第６ステップでは、時系列の注視点特徴情報を符号化して時系列の注視点符号を出力する。第６ステップの処理は、例えば、第１発明の注視点特徴情報符号化手段によって実行される。
第７ステップでは、時系列の注視点符号に重みを付与して時系列の重み付き注視点符号を出力する。第７ステップの処理は、例えば、第１発明の注視点符号重み付け手段によって実行される。
第８ステップでは、時系列の重み付き音声符号と時系列の重み付き注視点符号を統合した統合符号（重み付き統合符号）に対応するテキスト情報を、記憶手段に記憶されている文字列情報の中から選択した文字列情報により形成する。第８ステップの処理は、例えば、第１発明の復号化手段によって実行される。
好適には、第５ステップの音声符号重み付け処理および第７ステップの注視点符号重み付け処理は、ニューラルネットワークで構成される系列変換モデルの注意機構によって実行される。
本形態は、第１発明と同様の効果を有する。 The first invention relates to a multimodal speech recognition device.
The first aspect of the present invention includes a voice information input means, a voice characteristic information extraction means, a gaze point image information input means, a gaze point characteristic information extraction means, a storage means, and a conversion means.
The voice information input means inputs voice information indicating the voice of the speaker. As the voice information input means, various voice information input means capable of inputting voice information can be used. Preferably, audio information input means including a microphone that converts audio into electrical signals is used. Note that it is also possible to use an audio information input means that includes a storage medium in which audio information is stored in advance.
The audio feature information extraction means extracts audio feature information in time series from the audio information input from the audio information input means. As the audio feature information extraction means, a convolutional neural network (CNN) having a convolution layer and a pooling layer is preferably used.
The gazing point image information input means inputs gazing point image information indicating a gazing point image around the gazing point that the speaker is gazing at when uttering voice. As the gaze point image information input means, various gaze point image information input means that can input gaze point images can be used. Preferably, a gaze point image information input means including a line of sight measuring device is used. Preferably, a line-of-sight measuring device is used that can output gaze point position information indicating the position of the gaze point in a subjective image captured by a camera built into the device.
The point of interest feature information extraction means extracts point of interest feature information in time series from the point of interest image information input from the point of interest image information input means. Preferably, the gaze point image information is extracted in synchronization with the audio feature information. As the gaze point feature information extraction means, a multilayer neural network (CNN) is used, similar to the audio feature information extraction means.
The storage means stores character string information. The character string information includes hiragana, katakana, numbers, commonly used kanji, etc. that form text information such as documents.
The conversion means converts text information corresponding to the time-series voice feature information extracted by the voice feature information extraction means and the time-series gaze point feature information extracted by the gaze point feature information extraction means into characters stored in the storage means. It is formed using character string information selected from string information. Preferably, the text information is output from an output means such as a display means.
Any appropriate method can be used to form text information corresponding to time-series voice feature information and time-series gaze point feature information.
The audio feature information extraction means, the gaze point feature information extraction means, and the conversion means can be configured by one computer or by separate computers. Moreover, it can also be configured to be located far away and connectable via a communication line such as the Internet.
In the first invention, since speech is recognized using time-series voice feature information and time-series gaze point image information, it is possible to improve speech recognition performance.
In a different form of the first invention, the converting means includes a voice feature information encoding means, a voice code weighting means, a gaze point characteristic information encoding means, a gaze point code weighting means, and a decoding means.
The audio feature information encoding means encodes the time-series audio feature information extracted by the audio feature information extraction means and outputs a time-series audio code. A recurrent neural network (RNN) is preferably used as the audio feature information encoding means. For example, Bi-directional Long Short Term Memory (BLSTM), which is a form of recurrent neural network (RNN), is used.
The speech code weighting means weights the time-series speech codes output from the speech feature information encoding means and outputs time-series weighted speech codes. As the speech code weighting means, it is possible to use speech code weighting means having various configurations that can appropriately apply weights to time-series speech codes.
The gaze point feature information encoding means encodes the time series gaze point feature information extracted by the gaze point feature information extraction means and outputs a time series gaze point code. As the gaze point feature information encoding means, bidirectional long-term short-term memory (BLSTM), which is a form of recurrent neural network (RNN), or the like is used, similar to the audio feature information encoding means.
The point-of-regard code weighting means weights the time-series point-of-regard code output from the point-of-regard characteristic information encoding means, and outputs a time-series weighted point-of-regard code. As the point of interest code weighting means, it is possible to use point of interest code weighting means having various configurations that can appropriately apply weights to time-series point of interest codes.
The decoding means converts the time-series weighted speech code output from the speech code weighting means and the time-series weighted attention point code output from the attention point code weighting means into an integrated code (weighted integrated code). Corresponding text information is formed by character string information selected from character string information stored in the storage means.
A recurrent neural network (RNN) is preferably used as the decoding means. For example, long short-term memory (LSTM), which is a form of recurrent neural network (RNN), is used.
The audio feature information encoding means, the audio code weighting means, the attention point feature information encoding means, the attention point code weighting means, and the decoding means can also be configured by one computer. It can also be configured by separate computers. Moreover, it can also be configured to be located far away and connectable via a communication line such as the Internet.
In this embodiment, it is possible to accurately estimate the correlation between audio information and gaze point image information.
In a different form of the first invention, an attention mechanism (Attention) of a sequence conversion model constituted by a neural network is used as the voice code weighting means and the point of interest code weighting means.
In this embodiment, appropriate weights can be given to time-series speech codes and time-series gaze point codes, and speech recognition performance can be reliably improved.
In a different form of the first invention, the point of interest image information input means can input point of interest position information indicating the position of the point of interest in the subjective image of the speaker (the imaging area of the point of interest image information input means). The text information is then displayed on the display means in association with the position of the speaker's point of interest indicated by the point of interest position information. For example, text information is displayed near the position of the point of interest in the subjective image captured by the point of interest image information input means displayed on the display means.
In this embodiment, the text information is displayed on the display means in association with the position of the point of interest, so the voice uttered by the speaker and the point of interest of the speaker can be easily confirmed.
In a different form of the first invention, when text information displayed on the display means is selected, audio information corresponding to the selected text information (audio information input when recognizing the text information) is output. It is configured as follows.
In this embodiment, the voice uttered by the speaker can be easily confirmed.
The second invention relates to a multimodal speech recognition method.
The present invention has first to third steps.
In the first step, voice feature information is extracted in time series from voice information indicating the voice of the speaker. The process of the first step is executed, for example, by the audio feature information extraction means of the first invention.
In the second step, gaze point feature information is extracted in time series from gaze point image information indicating gaze point images around the gaze point that the speaker is gazing at. The process of the second step is executed, for example, by the gaze point feature information extraction means of the first invention.
In the third step, text information corresponding to the extracted time-series voice feature information and the extracted time-series gaze point feature information is formed using string information selected from string information stored in the storage means. do. The processing of the third step is executed, for example, by the conversion means of the first invention. Preferably, the text information is output from an output means such as a display means.
The second invention has the same effects as the first invention.
In a different form of the second invention, the third step includes fourth to eighth steps.
In the fourth step, the extracted time-series audio feature information is encoded to output a time-series audio code. The process of the fourth step is executed, for example, by the audio feature information encoding means of the first invention.
In the fifth step, weights are given to the time-series speech codes and time-series weighted speech codes are output. The processing of the fifth step is executed, for example, by the speech code weighting means of the first invention.
In the sixth step, the time-series gaze point feature information is encoded and a time-series gaze point code is output. The process of the sixth step is executed, for example, by the gaze point feature information encoding means of the first invention.
In the seventh step, weights are given to the time-series gazing point codes to output time-series weighted gazing point codes. The process of the seventh step is executed, for example, by the point-of-regard code weighting means of the first invention.
In the eighth step, the text information corresponding to the integrated code (weighted integrated code) that integrates the time-series weighted speech code and the time-series weighted gaze point code is added to the text information stored in the storage means. Formed by character string information selected from among them. The process of the eighth step is executed, for example, by the decoding means of the first invention.
Preferably, the voice code weighting process in the fifth step and the gaze point code weighting process in the seventh step are performed by an attention mechanism of a sequence conversion model configured with a neural network.
This embodiment has the same effects as the first invention.

本発明のマルチモーダル音声認識装置およびマルチモーダル音声認識方法は、音声と注視点の周辺の注視点画像を用いて音声を認識することにより、音声認識性能を向上させることができる。 The multimodal speech recognition device and multimodal speech recognition method of the present invention can improve speech recognition performance by recognizing speech using speech and a gaze point image around the gaze point.

本発明のマルチモーダル音声認識装置の一実施形態のブロック図である。FIG. 1 is a block diagram of an embodiment of a multimodal speech recognition device of the present invention. 一実施形態のマルチモーダル音声認識装置の音声特徴量抽出手段の作用を説明する図である。FIG. 3 is a diagram illustrating the operation of a voice feature extracting means of a multimodal voice recognition device according to an embodiment. 一実施形態のマルチモーダル音声認識装置の音声特徴量符号化手段の作用を説明する図である。FIG. 3 is a diagram illustrating the operation of a voice feature encoding means of a multimodal voice recognition device according to an embodiment. 一実施形態のマルチモーダル音声認識装置の音声符号重み付け手段の作用を説明する図である。FIG. 3 is a diagram illustrating the operation of a speech code weighting means of a multimodal speech recognition device according to an embodiment. 一実施形態のマルチモーダル音声認識装置の注視点画像特徴量抽出手段の作用を説明する図である。FIG. 3 is a diagram illustrating the operation of a gaze point image feature extracting means of the multimodal speech recognition device according to one embodiment. 一実施形態のマルチモーダル音声認識装置の注視点画像特徴量符号化手段の作用を説明する図である。FIG. 3 is a diagram illustrating the operation of a gaze point image feature amount encoding means of the multimodal speech recognition device of one embodiment. 一実施形態のマルチモーダル音声認識装置の注視点符号重み付け手段の作用を説明する図である。FIG. 3 is a diagram illustrating an operation of a gaze point code weighting means of a multimodal speech recognition device according to an embodiment. 一実施形態のマルチモーダル音声認識装置の復号化手段の作用を説明する図である。FIG. 3 is a diagram illustrating the operation of a decoding means of a multimodal speech recognition device according to an embodiment. 一実施形態のマルチモーダル音声認識装置の構成を説明する図である。FIG. 1 is a diagram illustrating the configuration of a multimodal speech recognition device according to an embodiment. 一実施形態のマルチモーダル音声認識装置の音声符号重み付け手段の動作を説明する図である。FIG. 3 is a diagram illustrating the operation of a speech code weighting means of a multimodal speech recognition device according to an embodiment. 一実施形態のマルチモーダル音声認識装置の注視点符号重み付け手段の動作を説明する図である。FIG. 6 is a diagram illustrating the operation of the point-of-regard code weighting means of the multimodal speech recognition device according to one embodiment. 表示手段の表示例を示す図である。It is a figure which shows the example of a display of a display means.

以下に、本発明の実施形態を、図面を参照して説明する。
本発明のマルチモーダル音声認識装置の一実施形態のブロック図が、図１に示されている。
本実施形態のマルチモーダル音声認識装置は、ディープラーニングに基づくエンドツーエンド(End-to-End)音声認識の枠組みを用いている。そして、複数の注意機構(Attention)を有する系列変換モデルを用いて、発話者の音声を示す音声情報と音声を発声している発話者の注視点の周辺の注視点画像を統合し、両者の相関関係に基づいて、音声情報を、文字列情報により形成されるテキスト情報に変換している。 Embodiments of the present invention will be described below with reference to the drawings.
A block diagram of one embodiment of the multimodal speech recognition device of the present invention is shown in FIG.
The multimodal speech recognition device of this embodiment uses an end-to-end speech recognition framework based on deep learning. Then, using a sequence transformation model that has multiple attention mechanisms, the voice information indicating the voice of the speaker and the image of the gaze point around the gaze point of the speaker who is uttering the voice are integrated. Based on the correlation, audio information is converted into text information formed by character string information.

本実施形態のマルチモーダル音声認識装置は、処理手段１０、音声情報入力手段３０、注視点画像情報入力手段４０、記憶手段５０および表示手段６０等を有している。 The multimodal speech recognition device of this embodiment includes a processing means 10, a speech information input means 30, a gaze point image information input means 40, a storage means 50, a display means 60, and the like.

音声情報入力手段３０は、発話者が発声した音声を示す音声情報を入力する。音声情報は、音声波形であってもよいしスペクトラム（周波数情報）であってもよい。好適には、音声情報入力手段３０は、マイクとＡ－Ｄ変換手段により構成される。勿論、音声情報入力手段３０としては、音声情報を入力可能な種々の音声情報入力手段を用いることができる。
注視点画像情報入力手段は、発話者が音声を発声しながら注視している注視点の周辺の注視点画像を示す注視点画像情報を入力する。注視点画像情報入力手段としては、例えば、発話者が装着可能な視線計測装置を用いることができる。注視点の周辺の注視点画像は、視線計測装置で撮像した主観画像中の、注視点の周辺（例えば、注視点を中心とする所定のピクセルサイズの矩形領域）の画像を用いることができる。この場合、好適には、視線計測装置で撮像した主観画像中から注視点の周辺の画像情報を抽出する画像情報抽出手段が設けられる。
なお、好適には、発話者に装着された視線計測装置で撮影した主観画像中における発話者の注視点の位置を示す注視点位置情報を出力可能な視線計測装置が用いられる。
視線計測装置で撮影した主観画像中における注視点の位置が、本発明の「発話者の注視点の位置」に対応する。
記憶手段５０には、文字列情報が記憶されている。文字列情報には、文書等のテキスト情報を形成するひらがな、カタカナ、数字、常用漢字等が含まれる。
表示手段６０は、音声認識結果等を表示する際に用いられる。 The voice information input means 30 inputs voice information indicating the voice uttered by the speaker. The audio information may be an audio waveform or a spectrum (frequency information). Preferably, the audio information input means 30 includes a microphone and an AD conversion means. Of course, as the voice information input means 30, various voice information input means capable of inputting voice information can be used.
The gazing point image information input means inputs gazing point image information indicating a gazing point image around a gazing point that the speaker is gazing at while uttering a voice. As the gaze point image information input means, for example, a line of sight measuring device that can be worn by the speaker can be used. As the gaze point image around the gaze point, an image around the gaze point (for example, a rectangular area of a predetermined pixel size centered on the gaze point) in the subjective image captured by the line of sight measurement device can be used. In this case, preferably, an image information extraction means is provided that extracts image information around the gaze point from the subjective image captured by the line of sight measuring device.
Preferably, a line-of-sight measuring device is used that can output gaze point position information indicating the position of the speaker's gaze point in a subjective image captured by a gaze measuring device worn on the speaker.
The position of the gaze point in the subjective image captured by the line of sight measuring device corresponds to the "position of the speaker's gaze point" of the present invention.
The storage means 50 stores character string information. The character string information includes hiragana, katakana, numbers, commonly used kanji, etc. that form text information such as documents.
The display means 60 is used to display speech recognition results and the like.

処理手段１０は、変換手段２０、音声特徴量抽出手段１１０、注視点画像特徴量抽出手段１４０を有している。
音声特徴量抽出手段１１０は、図２に示されているように、音声情報入力手段３０から入力された音声情報から、時系列に音声特徴量（音声特徴量ベクトル）Ｘ１～Ｘｎを抽出する。なお、音声情報の音声区間（意味的な単語や文単位と推定される区間）は、前後に無音区間が含まれていることにより判別することができる。音声特徴量抽出手段１１０は、音声特徴量系列（音声特徴量ベクトル系列）｛Ｘ１，Ｘ２，・・・，Ｘｎ｝を出力する。
本実施形態では、音声特徴量抽出手段１１０として、畳み込み層とプーリング層を有する畳み込みニューラルネットワーク(CNN: Convolutional Neural Network)が用いられている。
音声特徴量抽出手段１１０が、本発明の「音声特徴情報抽出手段」に対応し、音声特徴量系列（音声特徴量ベクトル系列）｛Ｘ１，Ｘ２，・・・，Ｘｎ｝が、本発明の「時系列の音声特徴情報」に対応する。 The processing means 10 includes a conversion means 20, an audio feature amount extraction means 110, and a gaze point image feature amount extraction means 140.
As shown in FIG. 2, the audio feature extraction means 110 extracts audio features (speech feature vectors) X1 to Xn in time series from the audio information input from the audio information input means 30. Note that the audio section of the audio information (the section presumed to be a semantic word or sentence unit) can be determined by the presence of silent sections before and after it. The audio feature extraction means 110 outputs an audio feature sequence (an audio feature vector sequence) {X1, X2, . . . , Xn}.
In this embodiment, a convolutional neural network (CNN) having a convolution layer and a pooling layer is used as the audio feature amount extraction means 110.
The audio feature extraction means 110 corresponds to the "audio feature information extraction means" of the present invention, and the audio feature series (audio feature vector series) {X1, X2, ..., Xn} corresponds to the "audio feature information extraction means" of the present invention. It corresponds to "time-series audio feature information".

注視点画像特徴量抽出手段１４０は、図５に示されているように、注視点画像情報入力手段４０より入力された注視点画像情報から、時系列に注視点画像特徴量（注視点画像特徴量ベクトル）Ｙ１～Ｙｍを抽出する。注視点画像特徴量抽出手段１４０による時系列の注視点画像情報の抽出処理は、音声特徴量抽出手段１１０による時系列の音声特徴情報の抽出処理と同期して行われる。なお、発話者による音声の発声と注視動作がずれる場合もあるが、変換手段２０の処理により対応関係が推定されるため完全に同期していなくてもよい。注視点画像特徴量抽出手段１４０は、注視点画像特徴量系列（注視点画像特徴量ベクトル系列）｛Ｙ１，Ｙ２，・・・，Ｙｍ｝を出力する。
本実施形態では、注視点画像特徴量抽出手段１４０として、音声特徴量抽出手段１１０と同様に、畳み込みニューラルネットワーク(CNN)が用いられている。
注視点画像特徴量抽出手段１４０が、本発明の「注視点特徴情報抽出手段」に対応し、注視点画像特徴量系列（注視点画像特徴量ベクトル系列）｛Ｙ１，Ｙ２，・・・，Ｙｍ｝が、本発明の「時系列の注視点特徴情報」に対応する。 As illustrated in FIG. Extract the quantity vector) Y1 to Ym. The extraction process of time-series gazing-point image information by the gazing-point image feature extracting means 140 is performed in synchronization with the extraction process of time-series audio feature information by the audio feature extracting means 110. Note that although there may be a case where the voice utterance and the gaze movement by the speaker are out of sync, the correspondence relationship is estimated by the processing of the conversion means 20, so they do not need to be completely synchronized. The gazing point image feature amount extracting means 140 outputs a gazing point image feature amount series (a gazing point image feature vector sequence) {Y1, Y2, . . . , Ym}.
In this embodiment, a convolutional neural network (CNN) is used as the gaze point image feature extraction means 140, similar to the audio feature extraction means 110.
The gazing point image feature extracting means 140 corresponds to the ``gazing point feature information extraction means'' of the present invention, and includes the gazing point image feature amount series (the gazing point image feature vector series) {Y1, Y2, . . . , Ym } corresponds to "time-series gaze point feature information" of the present invention.

変換手段２０は、音声特徴量系列｛Ｘ１，Ｘ２，・・・，Ｘｎ｝（時系列の音声特徴量）および注視点画像特徴量系列｛Ｙ１，Ｙ２，・・・，Ｙｍ｝（時系列の注視点画像特徴量）に基づいて、記憶手段５０に記憶されている文字列の中から選択した文字列を用いて、音声認識結果であるテキスト情報を形成する。
本実施形態では、変換手段２０は、音声特徴量符号化手段１２０、音声符号重み付け手段１３０、注視点画像特徴量符号化手段１５０、注視点符号重み付け手段１６０、統合手段１７０および復号化手段１８０を有している。 The conversion means 20 converts the audio feature series {X1, Text information, which is a voice recognition result, is formed using a character string selected from among the character strings stored in the storage means 50 based on the gazing point image feature amount).
In this embodiment, the converting means 20 includes an audio feature encoding means 120, an audio code weighting means 130, a gaze point image feature encoding means 150, a gaze point code weighting means 160, an integrating means 170, and a decoding means 180. have.

音声特徴量符号化手段１２０は、図３に示されているように、音声特徴量抽出手段１１０により抽出された時系列の音声特徴量Ｘ１～Ｘｎ（音声特徴量系列｛Ｘ１，Ｘ２，・・・，Ｘｎ｝）を符号化して時系列の音声符号（音声符号ベクトル）ｈ１～ｈｎを出力する。すなわち、音声特徴量符号化手段１２０は、音声符号系列（音声符号ベクトル系列）｛ｈ１，ｈ２，・・・，ｈｎ｝を出力する。
本実施形態では、音声特徴量符号化手段１２０として、図９に示されているように、再帰型ニューラルネットワーク(RNN)の一形態である双方向長期短期記憶(BLSTM)が用いられている。
音声特徴量符号化手段１２０が、本発明の「音声特徴情報符号化手段」に対応し、音声符号系列（音声符号ベクトル系列）｛ｈ１，ｈ２，・・・，ｈｎ｝が、本発明の「時系列の音声符号」に対応する。 As shown in FIG. 3, the audio feature encoding means 120 converts the time-series audio features X1 to Xn (speech feature series {X1, X2, . . . , Xn}) and outputs time-series speech codes (speech code vectors) h1 to hn. That is, the audio feature encoding means 120 outputs an audio code sequence (an audio code vector sequence) {h1, h2, . . . , hn}.
In this embodiment, as shown in FIG. 9, bidirectional long-term short-term memory (BLSTM), which is a form of recurrent neural network (RNN), is used as the audio feature encoding means 120.
The audio feature encoding means 120 corresponds to the "audio feature information encoding means" of the present invention, and the audio code series (voice code vector series) {h1, h2, ..., hn} corresponds to the "audio feature information encoding means" of the present invention. It corresponds to "time series audio code".

音声符号重み付け手段１３０は、音声特徴量符号化手段１２０から出力された時系列の音声符号ｈ１～ｈｎ（音声符号系列｛ｈ１，ｈ２，・・・，ｈｎ｝）に重みを付与して時系列の重み付き音声符号（重み付き音声符号ベクトル）（ａ１＊ｈ１）～（ａｎ＊ｈｎ）を出力する。すなわち、音声符号重み付け手段１３０は、重み付き音声符号系列（重み付き音声符号ベクトル系列）｛ａ１＊ｈ１，ａ２＊ｈ２，・・・，ａｎ＊ｈｎ｝を出力する。なお、重みａ１～ａｎは、重みａ１～ａｎの総和が「１」となるように設定される。
本実施形態では、音声符号重み付け手段１３０として、再帰型ニューラルネットワーク(RNN)の一形態である双方向長期短期記憶(BLSTM)が用いられている。
音声符号重み付け手段１３０の動作は、後述する。 The audio code weighting means 130 gives weights to the time-series audio codes h1 to hn (speech code series {h1, h2, ..., hn}) output from the audio feature encoding means 120, and converts them into time-series audio codes. The weighted audio codes (weighted audio code vectors) (a1*h1) to (an*hn) are output. That is, the audio code weighting means 130 outputs a weighted audio code sequence (weighted audio code vector sequence) {a1*h1, a2*h2, . . . , an*hn}. Note that the weights a1 to an are set so that the sum of the weights a1 to an is "1".
In this embodiment, bidirectional long-term short-term memory (BLSTM), which is a form of recurrent neural network (RNN), is used as the speech code weighting means 130.
The operation of the voice code weighting means 130 will be described later.

注視点画像特徴量符号化手段１５０は、図６に示されているように、注視点画像特徴量抽出手段１４０により抽出された時系列の注視点画像特徴量Ｙ１～Ｙｍ（注視点画像特徴量系列｛Ｙ１，Ｙ２，・・・，Ｙｍ｝）を符号化して時系列の注視点符号（注視点符号ベクトル）ｓ１～ｓｍを出力する。すなわち、注視点画像特徴量符号化手段１５０は、注視点符号系列（注視点符号ベクトル系列）｛ｓ１，ｓ２，・・・，ｓｍ｝を出力する。
本実施形態では、注視点画像特徴量符号化手段１５０として、再帰型ニューラルネットワーク(RNN)の一形態である双方向長期短期記憶(BLSTM)が用いられている。
注視点画像特徴量符号化手段１５０が、本発明の「注視点特徴情報符号化手段」に対応し、注視点符号系列（注視点符号ベクトル系列）｛ｓ１，ｓ２，・・・，ｓｍ｝が、本発明の「時系列の注視点符号」に対応する。 As shown in FIG. The sequence {Y1, Y2, . . . , Ym}) is encoded to output time-series gazing point codes (gazing point code vectors) s1 to sm. That is, the gazing point image feature amount encoding means 150 outputs the gazing point code series (the gazing point code vector series) {s1, s2, . . . , sm}.
In this embodiment, bidirectional long-term short-term memory (BLSTM), which is a form of recurrent neural network (RNN), is used as the gaze point image feature encoding means 150.
The gazing point image feature amount encoding means 150 corresponds to the ``gazing point feature information encoding means'' of the present invention, and the gazing point code series (the gazing point code vector series) {s1, s2, . . . , sm} is , corresponds to the "time-series gaze point code" of the present invention.

注視点符号重み付け手段１６０は、注視点画像特徴量符号化手段から出力された時系列の注視点符号ｓ１～ｓｍに重みを付与して時系列の重み付き注視点符号（重み付き注視点符号ベクトル）（ｂ１＊ｓ１）～（ｂｍ＊ｓｍ）を出力する。すなわち、注視点符号重み付け手段１６０は、重み付き注視点符号系列（重み付き注視点符号ベクトル系列）｛ｂ１＊ｓ１，ｂ２＊ｓ２，・・・，ｂｍ＊ｓｍ｝を出力する。なお、重みｂ１～ｂｍは、重みｂ１～ｂｍの総和が「１」となるように設定される。
本実施形態では、注視点符号重み付け手段１６０として、ニューラルネットワークで構成される系列変換モデルの注意機構(Attention)が用いられている。
注視点符号重み付け手段１６０の動作は、後述する。 The point-of-regard code weighting means 160 weights the time-series point-of-regard codes s1 to sm output from the point-of-regard image feature encoding means to create a time-series weighted point-of-regard code (weighted point-of-regard vector). )(b1*s1) to (bm*sm) are output. That is, the point-of-regard code weighting means 160 outputs a weighted point-of-regard code sequence (weighted point-of-regard code vector sequence) {b1*s1, b2*s2, . . . , bm*sm}. Note that the weights b1 to bm are set so that the sum of the weights b1 to bm is "1".
In the present embodiment, an attention mechanism (Attention) of a sequence conversion model constituted by a neural network is used as the gaze point code weighting means 160.
The operation of the gaze point code weighting means 160 will be described later.

統合手段１７０は、音声符号重み付け手段１３０から出力される時系列の重み付き音声符号（ａ１＊ｈ１）～（ａｎ＊ｈｎ）（重み付き音声符号系列｛ａ１＊ｈ１，ａ２＊ｈ２，・・・，ａｎ＊ｈｎ｝）と注視点符号重み付け手段１６０から出力される重み付き注視点符号（ｂ１＊ｓ１）～（ｂｍ＊ｓｍ）（重み付き注視点符号系列｛ｂ１＊ｓ１，ｂ２＊ｓ２，・・・，ｂｍ＊ｓｍ｝）を統合して時系列の統合重み付き符号（ａ１＊ｈ１＋ｂ１＊ｓ１）～（ａｎ＊ｈｎ＋ｂｍ＋ｓｍ）（重み付き符号系列｛ａ１＊ｈ１＋ｂ１＊ｓ１，ａ２＊ｈ２＋ｂ２＊ｓ２，・・・，ａｎ＊ｈｎ＋ｂｍ＊ｓｍ｝）を出力する。 The integrating means 170 outputs time-series weighted speech codes (a1*h1) to (an*hn) (weighted speech code sequences {a1*h1, a2*h2, . . . ) output from the speech code weighting means 130. , an*hn}) and the weighted point of interest code (b1*s1) to (bm*sm) (weighted point of interest code sequence {b1*s1, b2*s2, . ..., bm*sm}) to create a time series integrated weighted code (a1*h1+b1*s1) ~ (an*hn+bm+sm) (weighted code sequence {a1*h1+b1*s1, a2*h2+b2*s2, ..., an*hn+bm*sm}) is output.

復号化手段は、図９に示されているように、統合手段１７０から出力された時系列の統合重み付き符号（ａ１＊ｈ１＋ｂ１＊ｓ１）～（ａｎ＊ｈｎ＋ｂｍ＋ｓｍ）（重み付き符号系列｛ａ１＊ｈ１＋ｂ１＊ｓ１，ａ２＊ｈ２＋ｂ２＊ｓ２，・・・，ａｎ＊ｈｎ＋ｂｍ＊ｓｍ｝）に対応するテキスト情報を、記憶手段５０に記憶されている文字列情報の中から選択した文字列情報Ｃ１～Ｃｉにより形成する。
文字列情報Ｃ１～Ｃｉを選択する方法としては、例えば、各時刻において各LSTMの隠れ層から出力される符号を、Softmax関数により当該文字列の出力スコア（確率値）に換算する。そして、出力スコアが高い文字列を選択する方法が用いられる。
復号化手段１８０で複合されたテキスト情報（音声認識結果）は、表示手段６０に表示される。
本実施形態では、復号化手段１８０として、再帰型ニューラルネットワーク(RNN)の一形態である長期短期記憶(LSTM: Long Short Term Memory)が用いられている。 The decoding means, as shown in FIG. h1+b1*s1, a2*h2+b2*s2, . . . Formed by
As a method for selecting the character string information C1 to Ci, for example, the code output from the hidden layer of each LSTM at each time is converted into an output score (probability value) of the character string using a Softmax function. Then, a method is used in which a character string with a high output score is selected.
The text information (speech recognition result) decoded by the decoding means 180 is displayed on the display means 60.
In this embodiment, a long short term memory (LSTM), which is a form of a recurrent neural network (RNN), is used as the decoding means 180.

本実施形態は、音声情報を処理する音声チャネルと注視点画像を処理する注視点画像チャネルを有している。音声チャネルは、音声情報入力手段３０、音声特徴量抽出手段１１０、音声特徴量符号化手段１２０、音声符号重み付け手段１３０により構成される。注視点画像チャネルは、注視点画像情報入力手段４０、注視点画像特徴量抽出手段１４０、注視点画像特徴量符号化手段１５０、注視点符号重み付け手段１６０により構成される。
音声特徴量抽出手段１１０、音声特徴量符号化手段１２０、音声符号重み付け手段１３０、注視点画像特徴量抽出手段１４０、注視点画像特徴量符号化手段１５０、注視点符号重み付け手段１６０、統合手段１７０および復号化手段１８０は、共通のコンピューターで構成することもできるし、別々のコンピューターで構成することもできる。
また、１つの手段を他の手段に対して遠方に配置し、両手段間における情報の送受信を、インターネット等の通信回線を介して行うように構成することもできる。 This embodiment has an audio channel that processes audio information and a gaze point image channel that processes gaze point images. The audio channel includes audio information input means 30, audio feature extraction means 110, audio feature encoding means 120, and audio code weighting means 130. The gazing point image channel includes a gazing point image information input means 40 , a gazing point image feature extracting means 140 , a gazing point image feature encoding means 150 , and a gazing point code weighting means 160 .
Audio feature extracting means 110, audio feature encoding means 120, audio code weighting means 130, gazing point image feature extracting means 140, gazing image feature encoding means 150, gazing point code weighting means 160, integrating means 170 The decoding means 180 can be configured in a common computer or can be configured in separate computers.
Further, it is also possible to arrange one means far away from the other means and to transmit and receive information between the two means via a communication line such as the Internet.

次に、本実施形態のマルチモーダル音声認識装置の学習動作について説明する。
学習では、事前に作成し記憶手段５０に記憶した教師情報（入力である音声情報および注視点画像情報と、出力であるテキスト情報のペア）を用いて、誤差逆伝播法により、変換手段２０（系列変換モデル）を構成するニューラルネットワークの各種重みパラメータを繰り返し学習する。例えば、「遮断器７４５選択」という音声情報と「遮断器７４５を選択する際の注視点画像情報系列」を入力する。そして、復号化手段１８０から出力されるテキスト情報と入力された「遮断器７４５選択」という音声情報との誤差が最小となるように、変換手段２０の各種重みパラメータを学習する。 Next, the learning operation of the multimodal speech recognition device of this embodiment will be explained.
In learning, the transformation means 20 ( Iteratively learns various weight parameters of the neural network that makes up the series conversion model. For example, audio information such as "Select circuit breaker 745" and "point-of-regard image information series when selecting circuit breaker 745" are input. Then, various weight parameters of the converting means 20 are learned so that the error between the text information output from the decoding means 180 and the input voice information "Select circuit breaker 745" is minimized.

次に、本実施形態のマルチモーダル音声認識装置の音声認識動作を説明する。
音声情報および注視点画像情報が入力されると、本発明の処理手段１０の処理が開始される。処理手段１０の処理動作は、前述した通りである。
ここで、音声符号重み付け手段１３０による重み付け動作を、図１０を参照して説明する。
ディープラーニングに基づくエンドツーエンド(End-to-End)音声認識技術は、音声から取得する音声特徴量系列（音声特徴量ベクトル）を文字列に復号化するデコーダ(decoder)を有している。エンコーダは、音声特徴量系列を隠れ状態ベクトルに変換し、デコーダは、符号系列を隠れ状態ベクトルを介して認識結果であるテキスト情報に変換する。
本実施形態では、図９に示されているように、音声特徴量系列を符号化する音声エンコーダ、注視点画像特徴量系列（注視点画像特徴量ベクトル）を符号化する注視点エンコーダ、音声符号系列（音声符号ベクトル）と注視点符号系列（注視点符号ベクトル）を統合して文字列にマッピングするデコーダを有している。さらに、本実施形態では、デコーダにおいて、音声符号系列および注視点符号系列のそれぞれに対し、重み(Attention)を付与して統合重み付き符号系列を生成している。なお、音声エンコーダは、音声特徴量符号化手段１２０により構成されている。また、注視点エンコーダは、注視点画像特徴量符号化手段１５０により構成されている。また、デコーダは、音声符号重み付け手段１３０、注視点符号重み付け手段１６０、統合手段１７０、復号化手段１８０により構成されている。
図１０に一点鎖線で示されている、復号化手段１８０の任意の時刻ｔの時間断面における音声符号重み（音声符号重みベクトル）は、音声特徴量符号化手段１２０のＢＬＳＴＭの隠れ層から出力される隠れ状態ベクトル系列である{ｈ（１），…，ｈ（ｎ）}と、復号化手段１８０の、１つ前の時刻（ｔ－１）におけるＬＳＴＭの隠れ状態ベクトルｕ（ｔ－１）とに基づいて動的に付与される。例えば、隠れ状態ベクトルｕ（ｔ－１）と隠れ状態ベクトル系列{ｈ（１），…，ｈ（ｎ）}の各要素との類似度ａ（ｉ）（ｉ＝１，…，ｎ）は、隠れ状態ベクトルｕ（ｔ－１）と隠れ状態ベクトル系列{ｈ(１），…，ｈ（ｎ）}の各要素との内積を取ることによりａ（ｉ）＝ｕ（ｔ－１）・ｈ（ｉ）（ｉ＝１，…，ｎ）のように数値的に求めることができる。ここで、音声符号重みａ（ｉ）は合計が「１」になるように正規化しておく。時刻（ｔ）における音声エンコーダからデコーダへの入力は、音声符号重みａ（ｉ）（ｉ＝１，…，ｎ）と隠れ状態ベクトル系列｛ｈ（１），…，ｈ（ｎ）｝を用いて、［ａ（１）＊ｈ（１）＋…＋ａ（ｎ）＊ｈ（ｎ）］で表される。
なお、前記類似度の計算においては、種々の評価尺度を用いることができる。 Next, the speech recognition operation of the multimodal speech recognition device of this embodiment will be explained.
When the audio information and the gaze point image information are input, the processing of the processing means 10 of the present invention is started. The processing operation of the processing means 10 is as described above.
Here, the weighting operation by the voice code weighting means 130 will be explained with reference to FIG. 10.
End-to-end speech recognition technology based on deep learning includes a decoder that decodes a speech feature sequence (speech feature vector) obtained from speech into a character string. The encoder converts the audio feature sequence into a hidden state vector, and the decoder converts the code sequence into text information that is a recognition result via the hidden state vector.
In this embodiment, as shown in FIG. 9, an audio encoder that encodes an audio feature sequence, a gaze encoder that encodes a gaze image feature sequence (a gaze image feature vector), an audio code It has a decoder that integrates the sequence (voice code vector) and the point-of-interest code sequence (point-of-interest code vector) and maps it to a character string. Further, in this embodiment, the decoder generates an integrated weighted code sequence by assigning weights (attention) to each of the voice code sequence and the point-of-regard code sequence. Note that the audio encoder includes audio feature encoding means 120. Furthermore, the gaze point encoder is configured by gaze point image feature amount encoding means 150. Further, the decoder includes a voice code weighting means 130, a point of interest code weighting means 160, an integrating means 170, and a decoding means 180.
The speech code weight (speech code weight vector) in the time section at an arbitrary time t of the decoding means 180, which is shown by a dashed line in FIG. {h(1),...,h(n)}, which is a hidden state vector sequence of dynamically assigned based on For example, the degree of similarity a(i) (i=1,...,n) between the hidden state vector u(t-1) and each element of the hidden state vector sequence {h(1),...,h(n)} is , by taking the inner product of the hidden state vector u(t-1) and each element of the hidden state vector sequence {h(1),...,h(n)}, a(i)=u(t-1)・It can be determined numerically as h(i) (i=1,...,n). Here, the audio code weights a(i) are normalized so that the total becomes "1". The input from the speech encoder to the decoder at time (t) uses the speech code weight a(i) (i=1,...,n) and the hidden state vector sequence {h(1),...,h(n)}. is expressed as [a(1)*h(1)+...+a(n)*h(n)].
Note that various evaluation scales can be used in calculating the degree of similarity.

注視点符号重み付け手段１６０による重み付け動作を、図１１を参照して説明する。
図１１に一点鎖線で示されている、復号化手段１８０の任意の時刻ｔの時間断面における注視点符号重み（注視点符号重みベクトル）は、注視点画像特徴量符号化手段１５０のＢＬＳＴＭの隠れ層から出力される隠れ状態ベクトル系列である｛ｓ（１），…，ｓ（ｍ）｝と、復号化手段１８０の、１つ前の時刻（ｔ－１）におけるＬＳＴＭの隠れ状態ベクトルｕ（ｔ－１）とに基づいて動的に付与される。例えば、隠れ状態ベクトルｕ（ｔ－１）と隠れ状態ベクトル系列｛ｓ（１），…，ｓ（ｍ）｝の各要素との類似度ｂ（ｊ）（ｊ＝１，…，ｍ）は、隠れ状態ベクトルｕ（ｔ－１）と隠れ状態ベクトル系列｛ｓ（１），…，ｓ（ｍ）｝の各要素との内積を取ることよりｂ（ｊ）＝ｕ（ｔ－１）・ｓ（ｊ）（ｊ＝１，…，ｍ）のように数値的に求めることができる。ここで、注視点符号重みｂ（ｊ）は合計が「１」になるように正規化しておく。時刻（ｔ）における注視点エンコーダからデコーダへの入力は、注視点符号重みｂ（ｊ）（ｊ＝１，…，ｍ）と隠れ状態ベクトル系列｛ｓ（１），…，ｓ（ｍ）｝を用いて、［ｂ（１）＊ｓ（１）＋…＋ｂ（ｍ）＊ｓ（ｍ）］で表される。
なお、前記類似度の計算においては、種々の評価尺度を用いることができる。 The weighting operation by the gaze point code weighting means 160 will be explained with reference to FIG. 11.
The point of interest code weight (point of interest code weight vector) in the time section at an arbitrary time t of the decoding means 180, which is shown by a dashed line in FIG. The hidden state vector sequence {s(1),...,s(m)} output from the layer and the hidden state vector u( t-1). For example, the similarity b(j) (j=1,...,m) between the hidden state vector u(t-1) and each element of the hidden state vector sequence {s(1),...,s(m)} is , by taking the inner product of the hidden state vector u(t-1) and each element of the hidden state vector sequence {s(1),...,s(m)}, b(j)=u(t-1)・It can be determined numerically as s(j) (j=1,...,m). Here, the gaze point code weights b(j) are normalized so that the total becomes "1". The input from the gaze point encoder to the decoder at time (t) is the gaze point code weight b(j) (j=1,...,m) and the hidden state vector sequence {s(1),...,s(m)} is expressed as [b(1)*s(1)+...+b(m)*s(m)].
Note that various evaluation scales can be used in calculating the degree of similarity.

以上のようにして、音声エンコーダおよび注視点エンコーダからの出力符号に、動的に重みを付与することができる。
次に、統合手段１７０において、前記方法で求められた重み付き音声符号系列および注視点符号系列を、ｒ（ｔ）＝［ａ（１）＊ｈ（１）＋…＋ａ（ｎ）＊ｈ（ｎ）］＋［ｂ（１）＊ｓ（１）＋…＋ｂ（ｍ）＊ｓ（ｍ）］として統合化し、このｒ（ｔ）を時刻ｔにおける復号化手段１８０（デコーダ）のＬＳＴＭへの入力とする。
そして、復号化手段１８０は、前述したように、各時刻において各ＬＳＴＭから出力される文字列を、Softmax関数により当該文字列の出力スコア（確率値）に換算する。そして、出力スコアが高い文字列を選択してテキスト情報（音声認識結果）を形成する。 In the manner described above, weights can be dynamically assigned to the output codes from the speech encoder and the gaze point encoder.
Next, in the integrating means 170, the weighted speech code sequence and the gaze point code sequence obtained by the above method are converted into r(t)=[a(1)*h(1)+...+a(n)*h( n)]+[b(1)*s(1)+...+b(m)*s(m)], and converts this r(t) into the LSTM of the decoding means 180 (decoder) at time t. Use as input.
Then, as described above, the decoding means 180 converts the character string output from each LSTM at each time into an output score (probability value) of the character string using the Softmax function. Then, character strings with high output scores are selected to form text information (speech recognition results).

以上のように、音声符号重み付け手段１３０による音声符号系列に対する重み付け（音声チャネルにおける符号重み付け）および注視点符号重み付け手段１６０による注視点符号系列に対する重み付け（注視点画像チャネルにおける重み付け）を行うことによって、復号化手段１８０に入力される統合重み付き符号系列に対応するテキスト情報（テキスト情報を構成する文字列情報）を、音声情報と注視点画像情報との相関関係を推定ながら形成することができる。
発話者が発声する音声と音声発声時における発話者の注視点は、相互に関連している。
このため、本実施形態では、発話者の音声と注視点との相互関係を推定して音声認識を行うことによって、音声認識性能を向上させることができる。 As described above, by weighting the audio code sequence by the audio code weighting means 130 (code weighting in the audio channel) and weighting the gazing point code sequence by the gazing point code weighting means 160 (weighting in the gazing point image channel), Text information (character string information forming the text information) corresponding to the integrated weighted code sequence input to the decoding means 180 can be formed while estimating the correlation between the audio information and the gaze point image information.
The voice uttered by the speaker and the speaker's gaze point when uttering the voice are mutually related.
Therefore, in this embodiment, the speech recognition performance can be improved by performing speech recognition by estimating the correlation between the speaker's voice and the point of interest.

本実施形態の効果を確認するために、音声情報のみを用いた（モデル１）と、音声情報と注視点画像情報を用いた（モデル２）について比較実験を行い、文字単位の誤り率であるＣＥＲ(Character Error Rate)を求めた。なお、ＣＥＲは、[ＣＥＲ＝（Ｓ＋Ｄ＋Ｉ）＊１００/Ｎ]で表される。ここで、Ｓは置換誤り数、Ｄは脱落誤り数、Ｉは挿入誤り数、Ｎは正解文の文字数を表す。
実験の結果、（モデル１）ではＣＥＲが７．２％であったが、（モデル２）では６．９％に低減しており、本発明の構成の適用効果が確認された。 In order to confirm the effectiveness of this embodiment, we conducted a comparative experiment using only audio information (Model 1) and using audio information and gaze point image information (Model 2). CER (Character Error Rate) was calculated. Note that CER is expressed as [CER=(S+D+I)*100/N]. Here, S represents the number of substitution errors, D represents the number of omission errors, I represents the number of insertion errors, and N represents the number of characters in the correct sentence.
As a result of the experiment, the CER was 7.2% in (Model 1), but it was reduced to 6.9% in (Model 2), confirming the effect of applying the configuration of the present invention.

以上の実施形態では、音声符号系列（音声符号ベクトル）ｈと注視点符号系列（注視点符号ベクトル）ｓを、同じ割合で統合した統合符号系列（統合符号ベクトル）ｒ（［ｒ＝ａ＊ｈ＋ｂ＊ｓ］）を用いたが、音声符号系列ｈと注視点符号系列ｓの融合割合を変えることもできる。例えば、［ｒ＝ａ＊ｈ＋ｇ＊（ｂ＊ｓ）］で表される統合符号系列ｒを用いることができる。ここで、ｇは、注視点符号系列の融合割合を示す融合重み（融合重みベクトル）である。融合重みベクトルは、固定でもよいし、動的に割り当てることもできる。 In the embodiments described above, an integrated code sequence (integrated code vector) r ([r=a*h+b *s]), but it is also possible to change the fusion ratio of the speech code sequence h and the gaze point code sequence s. For example, an integrated code sequence r expressed as [r=a*h+g*(b*s)] can be used. Here, g is a fusion weight (fusion weight vector) indicating the fusion rate of the gaze point code series. The fusion weight vector may be fixed or dynamically assigned.

次に、音声認識結果の出力方法について説明する。
本実施形態では、復号化手段１８０で復号化したテキスト情報を表示手段６０に表示している。
図１２に、テキスト情報を表示する表示画面２００の一例が示されている。
図１２に示されている表示画面２００には、遮断器やラインスイッチの投入および切断を行う操作盤３００が表示されている。操作盤３００には、遮断器７４０、７４２、７４５を選択する際に操作される遮断器選択ボタン３１１～３１３、ラインスイッチ７４０、７４２、７４５を選択する際に操作されるラインスイッチ選択ボタン３１４～３１６、投入する際に操作される入ボタン３１７、遮断する際に操作される切ボタン３１８が設けられている。
ここで、発話者が、「ラインスイッチ７４５選択操作」という音声を発声するとともに、操作盤３００のラインスイッチ選択ボタン３１６を注視し、復号化手段１８０から「ラインスイッチ７４５選択操作」というテキスト情報が音声認識されたものとする。本実施形態では、操作盤３００が表示されている表示画面２００に、注視点に関連する位置、図１２では、ラインスイッチ選択ボタン３１６に対応する箇所に「ラインスイッチボタン７４５操作選択」というテキスト情報が表示される。
これにより、発話者の音声と注視点の位置を容易に判別することができる。
図１２には、「ラインスイッチ７４５を選択操作」という音声を発声した後、「入れます」という音声を発声するとともに、操作盤３００の入ボタン３１７を注視したことにともなって、注視点に関連する位置、図１２では、入ボタン３１７に対応する箇所に「入れます」というテキスト情報が表示される。
なお、表示画面２００にテキスト情報が表示されている状態において、表示画面２００に表示されているテキスト情報を選択する（例えば、タッチする）ことにより、表示されているテキスト情報に対応する音声情報、例えば、当該テキスト情報を認識させる際に入力された音声情報をスピーカ等の音声出力手段から出力するように構成することもできる。
テキスト情報を表示手段６０に表示する処理や、テキスト情報に対応する音声情報を音声出力手段から出力する処理等は、例えば、処理手段１０で実行される。 Next, a method for outputting speech recognition results will be explained.
In this embodiment, the text information decoded by the decoding means 180 is displayed on the display means 60.
FIG. 12 shows an example of a display screen 200 that displays text information.
A display screen 200 shown in FIG. 12 displays an operation panel 300 for turning on and off circuit breakers and line switches. The operation panel 300 includes circuit breaker selection buttons 311 to 313 that are operated when selecting circuit breakers 740, 742, and 745, and line switch selection buttons 314 to 314 that are operated when selecting line switches 740, 742, and 745. 316, an on button 317 that is operated to turn on the power, and an off button 318 that is operated to shut off the power are provided.
Here, the speaker utters the voice "Line switch 745 selection operation" and looks at the line switch selection button 316 on the operation panel 300, and the text information "Line switch 745 selection operation" is output from the decoding means 180. It is assumed that voice recognition has been performed. In this embodiment, on the display screen 200 on which the operation panel 300 is displayed, text information such as "Line switch button 745 operation selection" is displayed at a position related to the gaze point, which in FIG. 12 corresponds to the line switch selection button 316. is displayed.
Thereby, the speaker's voice and the position of the gaze point can be easily distinguished.
In FIG. 12, after uttering the voice ``Select operation of line switch 745'', the voice ``Turn it on'' is uttered, and as the user gazes at the ON button 317 on the operation panel 300, the line switch 745 is uttered. In FIG. 12, text information "Enter" is displayed at the location corresponding to the enter button 317.
Note that in a state where text information is displayed on the display screen 200, by selecting (for example, touching) the text information displayed on the display screen 200, audio information corresponding to the displayed text information, For example, it may be configured such that audio information input when recognizing the text information is output from audio output means such as a speaker.
Processing for displaying text information on the display means 60, processing for outputting audio information corresponding to the text information from the audio output means, etc. are executed by the processing means 10, for example.

以上では、マルチモーダル音声認識装置について説明したが、本発明は、マルチモーダル音声認識方法として構成することもできる。
（態様１）
マルチモーダル音声認識方法であって、
発話者の音声を示す音声情報から、音声特徴情報を時系列に抽出する第１ステップと、
前記発話者が注視している注視点の周辺の注視点画像を示す注視点画像情報から、注視点特徴情報を時系列に抽出する第２ステップと、
前記抽出した時系列の音声特徴情報および前記抽出した時系列の注視点特徴情報に対応するテキスト情報を、記憶手段に記憶されている文字列情報の中から選択した文字列情報により形成する第３ステップと、を備えていることを特徴とするマルチモーダル音声認識方法。
（態様２）
態様１のマルチモーダル音声認識方法であって、
前記第３ステップは、
前記抽出した時系列の音声特徴情報を符号化して時系列の音声符号を出力する第４ステップと、
前記時系列の音声符号に重みを付与して時系列の重み付き音声符号を出力する第５ステップと、
前記時系列の注視点特徴情報を符号化して時系列の注視点符号を出力する第６ステップと、
前記時系列の注視点符号に重みを付与して時系列の重み付き注視点符号を出力する第７ステップと、
前記時系列の重み付き音声符号と前記時系列の重み付き注視点符号を統合した統合符号に対応するテキスト情報を、記憶手段に記憶されている文字列情報の中から選択した文字列情報により形成する第８ステップと、を有していることを特徴とするマルチモーダル音声認識方法。
このようなマルチモーダル音声認識方法も、前述したマルチモーダル音声認識装置と同様の効果を有する。 Although the multimodal speech recognition device has been described above, the present invention can also be configured as a multimodal speech recognition method.
(Aspect 1)
A multimodal speech recognition method, comprising:
A first step of extracting voice feature information in time series from voice information indicating the voice of the speaker;
a second step of extracting gaze point feature information in time series from gaze point image information indicating gaze point images around the gaze point that the speaker is gazing at;
A third step of forming text information corresponding to the extracted time-series voice feature information and the extracted time-series gaze point feature information using character string information selected from character string information stored in the storage means. A multimodal speech recognition method comprising steps.
(Aspect 2)
A multimodal speech recognition method according to aspect 1, comprising:
The third step is
a fourth step of encoding the extracted time-series audio feature information and outputting a time-series audio code;
a fifth step of adding weight to the time-series audio code and outputting a time-series weighted audio code;
a sixth step of encoding the time-series gaze point feature information and outputting a time-series gaze point code;
a seventh step of adding a weight to the time-series point-of-regard code and outputting a time-series weighted point-of-regard code;
Text information corresponding to an integrated code that integrates the time-series weighted speech code and the time-series weighted gaze point code is formed using character string information selected from character string information stored in a storage means. A multimodal speech recognition method, comprising: an eighth step of:
Such a multimodal speech recognition method also has the same effects as the multimodal speech recognition device described above.

本発明は、実施形態で説明した構成に限定されず、種々の変更、追加、削除が可能である。
実施形態では、音声情報と注視点画像情報をマルチモーダル情報として用いたが、３以上の情報をマルチモーダル情報として用いることもできる。例えば、音声情報、注視点画像情報およびジェスチャー情報（身振りや手振り）をマルチモーダル情報として用いることもできる。
視線計測装置の可視光領域の撮像素子を用いて注視点画像情報を入力したが、赤外線センサや紫外線センサ等の種々のセンサを用いて注視点画像情報を入力することができる。
本発明のマルチモーダル音声認識装置およびマルチモーダル音声認識方法は、作業員の操作確認に限定されず、音声付き動画の字幕作成や映像検索、動画を用いた技能継承や教育訓練等の種々の分野で用いることができる。
音声情報入力手段としては、音声情報を入力可能な種々の構成の音声情報入力手段を用いることができる。また、音声情報を予め記憶している記憶手段等を音声情報入力手段として用いることもできる。
注視点画像情報入力手段としては、注視点の周辺の注視点画像情報を入力可能な種々の構成の注視点画像情報入力手段を用いることができる。また、注視点画像情報を予め記憶している記憶手段等を注視点画像情報入力手段として用いることもできる。
音声特徴量抽出手段（音声特徴情報抽出手段）、音声特徴量符号化手段（音声特徴情報符号化手段）、音声符号重み付け手段、注視点画像特徴量抽出手段（注視点特徴情報抽出手段）、注視点画像特徴量符号化手段（注視点特徴情報符号化手段）、注視点符号重み付け手段、統合手段、復号化手段の構成は、実施形態で説明した構成に限定されない。
音声認識結果等を表示手段に表示する方法は、実施形態で説明した方法に限定されない。
音声認識結果等を出力する方法は、表示手段に表示する方法に限定されない。例えば、通信回線を介して遠方の管理装置に送信する方法を用いることもできる。 The present invention is not limited to the configuration described in the embodiments, and various changes, additions, and deletions are possible.
In the embodiment, audio information and gaze point image information are used as multimodal information, but three or more pieces of information can also be used as multimodal information. For example, audio information, gaze point image information, and gesture information (gestures and hand gestures) can also be used as multimodal information.
Although the gaze point image information is input using the visible light region image sensor of the line of sight measuring device, the gaze point image information can be input using various sensors such as an infrared sensor or an ultraviolet sensor.
The multimodal speech recognition device and multimodal speech recognition method of the present invention are not limited to confirmation of operations by workers, but can be used in various fields such as creating subtitles for videos with audio, video search, skill inheritance using videos, and education and training. It can be used in
As the voice information input means, voice information input means having various configurations capable of inputting voice information can be used. Furthermore, a storage means or the like that stores voice information in advance can also be used as the voice information input means.
As the gaze point image information input means, gaze point image information input means having various configurations capable of inputting gaze point image information around the gaze point can be used. Furthermore, a storage device or the like that stores point-of-regard image information in advance can also be used as the point-of-focus image information input device.
Audio feature extraction means (audio feature information extraction means), audio feature encoding means (audio feature information encoding means), audio code weighting means, gaze point image feature extraction means (gazing point feature information extraction means), notes The configurations of the viewpoint image feature amount encoding means (point of interest characteristic information encoding means), the point of attention code weighting means, the integrating means, and the decoding means are not limited to the configurations described in the embodiments.
The method of displaying the voice recognition results etc. on the display means is not limited to the method described in the embodiment.
The method of outputting the voice recognition results etc. is not limited to the method of displaying them on the display means. For example, it is also possible to use a method of transmitting the information to a remote management device via a communication line.

１０処理手段
２０変換手段
３０音声情報入力手段
４０注視点画像情報入力手段
５０記憶手段
６０表示手段
１１０音声特徴量抽出手段（音声特徴情報抽出手段）
１２０音声特徴量符号化手段（音声特徴情報符号化手段）
１３０音声符号重み付け手段
１４０注視点画像特徴量抽出手段（注視点特徴情報抽出手段）
１５０注視点画像特徴量符号化手段（注視点特徴情報符号化手段）
１６０注視点符号重み付け手段
１７０統合手段
１８０復号化手段
２００表示画面
３００操作盤
３１１～３１３遮断器選択ボタン
３１４～３１６ラインスイッチ選択ボタン
３１７入ボタン
３１８切ボタン
３２１、３２２テキスト情報表示部 10 Processing means 20 Conversion means 30 Voice information input means 40 Gaze point image information input means 50 Storage means 60 Display means 110 Voice feature amount extraction means (voice feature information extraction means)
120 Voice feature encoding means (voice characteristic information encoding means)
130 Audio code weighting means 140 Gazing point image feature extraction means (gazing point feature information extraction means)
150 Gaze point image feature amount encoding means (gazing point feature information encoding means)
160 Gaze point code weighting means 170 Integrating means 180 Decoding means 200 Display screen 300 Operation panel 311 to 313 Circuit breaker selection buttons 314 to 316 Line switch selection button 317 On button 318 Off buttons 321, 322 Text information display section

Claims

A multimodal speech recognition device,
voice information input means for inputting voice information indicating the voice of the speaker;
Gaze point image information input means for inputting gaze point image information indicating a gaze point image around the gaze point that the speaker is gazing at;
a storage means for storing character string information;
audio feature information extraction means for extracting audio feature information in chronological order from the audio information input from the audio information input means;
Gaze point feature information extraction means for extracting gaze point feature information in chronological order from the gaze point image information input from the gaze point image information input means;
Text information corresponding to the time series voice feature information extracted by the voice feature information extraction means and the time series gaze point feature information extracted by the gaze point feature information extraction means is stored in the storage means. 1. A multimodal speech recognition device comprising: conversion means for forming character string information selected from character string information.

The multimodal speech recognition device according to claim 1,
The conversion means is
audio feature information encoding means for encoding each of the time-series audio feature information extracted by the audio feature information extraction means and outputting a time-series audio code;
audio code weighting means for weighting each of the time-series audio codes output from the audio feature information encoding means to output a time-series weighted audio code;
gaze point feature information encoding means for encoding each of the time series gaze point feature information extracted by the gaze point feature information extraction means and outputting a time series gaze point code;
a point of interest code weighting means for weighting each of the time-series point-of-regard codes outputted from the point-of-gaze characteristic information encoding means to output a time-series weighted point-of-regard code;
Text information corresponding to an integrated code obtained by integrating the time-series weighted audio code output from the audio code weighting means and the time-series weighted attention point code output from the attention point code weighting means is stored in the memory. 1. A multimodal speech recognition device comprising decoding means for decoding information selected from character string information stored in the means.

The multimodal speech recognition device according to claim 2,
A multimodal speech recognition device characterized in that an attention mechanism of a sequence conversion model constituted by a neural network is used as the speech code weighting means and the point-of-regard code weighting means.

A multimodal speech recognition device according to any one of claims 1 to 3, comprising:
Equipped with display means,
The point-of-gaze image information input means is capable of inputting point-of-gaze position information indicating the position of the point of interest in the subjective image of the speaker;
A multimodal speech recognition device characterized in that the text information is displayed on the display means in association with the position of the speaker's point of interest indicated by the point of interest position information.

The multimodal speech recognition device according to claim 4,
A multimodal speech recognition device characterized in that, when the text information displayed on the display means is selected, the multimodal speech recognition device is configured to output speech information corresponding to the selected text information.

A multimodal speech recognition method, comprising:
A first step of extracting voice feature information in time series from voice information indicating the voice of the speaker;
a second step of extracting gaze point feature information in time series from gaze point image information indicating gaze point images around the gaze point that the speaker is gazing at;
A third step of forming text information corresponding to the extracted time-series voice feature information and the extracted time-series gaze point feature information using character string information selected from character string information stored in the storage means. A multimodal speech recognition method comprising steps.

The multimodal speech recognition method according to claim 6,
The third step is
a fourth step of encoding each of the extracted time-series audio feature information and outputting a time-series audio code;
a fifth step of assigning a weight to each of the time-series audio codes and outputting a time-series weighted audio code;
a sixth step of encoding each of the time-series gaze point feature information and outputting a time-series gaze point code;
a seventh step of assigning a weight to each of the time-series gazing point codes and outputting a time-series weighted gazing point code;
Text information corresponding to an integrated code that integrates the time-series weighted speech code and the time-series weighted gaze point code is formed using character string information selected from character string information stored in a storage means. A multimodal speech recognition method, comprising: an eighth step of: