JP2016206960A

JP2016206960A - Voice video input/output device

Info

Publication number: JP2016206960A
Application number: JP2015088098A
Authority: JP
Inventors: 翔一郎齊藤; Shoichiro Saito; 尚植松; Hisashi Uematsu; 一成森内; Kazunari Moriuchi
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-04-23
Filing date: 2015-04-23
Publication date: 2016-12-08

Abstract

PROBLEM TO BE SOLVED: To allow a smooth operation even in a high noise environment or in a use state in which a touch operation cannot be easily performed.SOLUTION: A voice recognition part 2 performs voice recognition to an acoustic signal obtained by acquiring sounds in the surrounding containing voice of a user using a sound acquisition part M, and creates a voice recognition result. A video acquisition part 5 acquires a video signal obtained by imaging an area corresponding ot a visual field of a user using an imaging part V. A voice creation part 4 outputs an output acoustic signal to a sound release part S. A video creation part 6 outputs an output video signal created by using the video signal to a video display part G. A function control part 3 creates a control signal for controlling functions of the video acquisition part 5, voice recognition part 2, voice creation part 4, and the video creation part 6 based on a dial operation signal from a dial operation part D, a button operation signal from a button operation part B, and the voice recognition result.SELECTED DRAWING: Figure 2

Description

この発明は、音声コマンド入力による操作が可能な音声映像入出力装置である。 The present invention is an audio / video input / output device that can be operated by inputting an audio command.

近年、困難な作業や熟練していない作業を行う工場作業員などに対し、音声や映像などのマルチメディア技術や情報通信技術を駆使して作業効率を上げるニーズが高まっている。しかしながら、作業に必要な用具と別にノートパソコンやタブレット端末などの情報通信機器を持ち歩くことは作業員にとって大きな負担となる。また、そのような機器ではキーボード操作や画面操作が主であるが、作業員は両手を自由に使える環境にあるとは限らず、直観的な操作性が不足していることが多い。そのため、音声や映像と連携して作業を行おうとしても、作業者の意図した作業を行うことが難しいという課題がある。また、そのような環境では騒音が大きいことが多く、コミュニケーションが円滑に行えない場合が多いという課題もある。 In recent years, there has been a growing need for factory workers who perform difficult or unskilled work to improve work efficiency by utilizing multimedia technology such as voice and video and information communication technology. However, carrying an information communication device such as a notebook computer or a tablet terminal separately from the tools necessary for work is a heavy burden on the worker. In such devices, keyboard operations and screen operations are mainly performed, but workers are not always in an environment where both hands can be freely used, and intuitive operability is often insufficient. For this reason, there is a problem that it is difficult to perform the work intended by the operator even if the work is performed in cooperation with the sound or the video. In addition, in such an environment, there is often a problem that noise is large and communication is often not performed smoothly.

上述のようなニーズに対して、必要な情報を現実の視野に重畳して表示する眼鏡型ウェラブルデバイスが開発されている。例えば、非特許文献１、２などに記載されたGoogle Glass（登録商標）がある。Google Glassは、音声によるコマンド入力（例えば、「OK glass.」と発話するなど。詳しくは、非特許文献１参照。）と、ゼスチャーによるコマンド入力（例えば、指のタッチや本体の傾きの状態など。詳しくは、非特許文献２参照。）を利用して操作することが可能になっている。 In response to the above-described needs, eyeglass-type wearable devices that display necessary information superimposed on an actual visual field have been developed. For example, there is Google Glass (registered trademark) described in Non-Patent Documents 1 and 2. Google Glass uses voice commands (for example, “OK glass”). For details, refer to Non-Patent Document 1.) and gesture commands (for example, finger touch and body tilt) For details, refer to Non-Patent Document 2).

Google, Inc.、“Google Glass - Help - Voice actions”、[online]、[平成27年4月3日検索]、インターネット<URL：https://support.google.com/glass/answer/3079305?hl=en>Google, Inc., “Google Glass-Help-Voice actions”, [online], [Search April 3, 2015], Internet <URL: https://support.google.com/glass/answer/3079305? hl = en> Google, Inc.、“Google Glass - Help - Glass gestures”、[online]、[平成27年4月3日検索]、インターネット<URL：https://support.google.com/glass/answer/3064184?hl=en>Google, Inc., “Google Glass-Help-Glass gestures”, [online], [Search April 3, 2015], Internet <URL: https://support.google.com/glass/answer/3064184? hl = en>

しかしながら、従来の眼鏡型ウェラブルデバイスでは高騒音環境下での利用を想定しておらず、例えば工場内など周囲の騒音が大きい環境では音声が雑音に埋もれてしまい、音声によるコマンド入力が誤りやすい。また、工場内での作業者は分厚い手袋をして作業を行うことが多く、従来の眼鏡型ウェラブルデバイスが備えるようなタッチパッドでは静電式、圧電式にかかわらず細かな操作をすることが難しい。また、工場内ではヘルメットのような頭部への装着物が必要となる場合も多く、これらの装着物と物理的に干渉し、正規の着用方法ができない場合もある。さらに、落下事故を防ぐために落下防止ストラップなどを装着する必要もあり、装着準備に手間がかかる。 However, conventional glasses-type wearable devices are not intended for use in high-noise environments. For example, voices are buried in noise in environments where the surrounding noise is high, such as in factories, and command input by voice is likely to be erroneous. . In addition, workers in factories often work with thick gloves, and touchpads such as those equipped with conventional glasses-type wearable devices must be finely operated regardless of whether they are electrostatic or piezoelectric. Is difficult. In addition, there are many cases where an attachment to the head, such as a helmet, is required in the factory, and there is a case where a regular wearing method cannot be performed due to physical interference with these attachments. Furthermore, it is necessary to attach a fall prevention strap or the like in order to prevent a fall accident.

この発明の目的は、高騒音環境やタッチ操作が困難な利用状況であっても円滑な操作を可能とする音声映像入出力装置を提供することである。 An object of the present invention is to provide an audio / video input / output device capable of smooth operation even in a high noise environment or in a usage situation where touch operation is difficult.

上記の課題を解決するために、この発明の音声映像入出力装置は、利用者の音声を含む周囲の音を収音する複数の収音部と、利用者の視野に対応する領域を撮影する撮影部と、利用者の視野に入る位置に画面が配置された映像表示部と、回転操作に応じて回転方向および回転角度を示すダイヤル操作信号を出力するダイヤル操作部と、ダイヤル操作部の表面において回転軸の位置に配置され押下状態を示すボタン操作信号を出力するボタン操作部と、収音部を用いて取得した音響信号を音声認識して音声認識結果を生成する音声認識部と、撮影部を用いて映像信号を取得する映像取得部と、映像信号を用いて生成した出力映像信号を映像表示部へ出力する映像生成部と、ダイヤル操作信号、ボタン操作信号、および音声認識結果に基づいて、映像取得部、音声認識部、および映像生成部の機能を制御する制御信号を生成する機能制御部と、を含む。 In order to solve the above-described problem, the audio / video input / output device according to the present invention captures a plurality of sound pickup units that pick up surrounding sounds including a user's voice and an area corresponding to the user's field of view. An imaging unit, a video display unit with a screen placed at a position that falls within the user's field of view, a dial operation unit that outputs a dial operation signal indicating a rotation direction and a rotation angle according to a rotation operation, and a surface of the dial operation unit A button operation unit that outputs a button operation signal that is placed at the position of the rotation axis and indicates a pressed state, a voice recognition unit that recognizes an acoustic signal acquired using the sound collection unit and generates a voice recognition result, and shooting Based on a video acquisition unit that acquires a video signal using a unit, a video generation unit that outputs an output video signal generated using the video signal to a video display unit, a dial operation signal, a button operation signal, and a voice recognition result And Image acquisition unit includes a voice recognition unit, and a function control unit for generating a control signal for controlling the function of the image generating unit.

この発明の音声映像入出力装置は、複数の収音部から収音した音響信号を用いて音声認識するため、高騒音環境であっても音声コマンド入力が安定的に動作する。また、ダイヤルとボタンによる物理的な操作が可能であるため、タッチ操作が困難な利用状況でも操作が容易である。したがって、高騒音環境やタッチ操作が困難な利用状況であっても円滑な操作が可能である。 Since the audio / video input / output device of the present invention recognizes voice using acoustic signals collected from a plurality of sound collection units, voice command input operates stably even in a high noise environment. Further, since a physical operation with a dial and a button is possible, the operation is easy even in a usage situation where a touch operation is difficult. Therefore, smooth operation is possible even in high noise environments and usage situations where touch operations are difficult.

図１は、第一実施形態に係る音声映像入出力装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of the audio / video input / output device according to the first embodiment. 図２は、第一実施形態に係る音声映像入出力装置の機能構成を例示する図である。FIG. 2 is a diagram illustrating a functional configuration of the audio / video input / output device according to the first embodiment. 図３は、環境設定機能の利用イメージを例示する図である。FIG. 3 is a diagram illustrating an example of use of the environment setting function. 図４は、音量設定機能の利用イメージを例示する図である。FIG. 4 is a diagram illustrating a usage image of the volume setting function. 図５は、映像ズーム機能の利用イメージを例示する図である。FIG. 5 is a diagram illustrating an example of a usage image of the video zoom function. 図６は、映像追跡機能の利用イメージを例示する図である。FIG. 6 is a diagram illustrating an example of a usage image of the video tracking function. 図７は、第二実施形態に係る音声映像入出力装置の機能構成を例示する図である。FIG. 7 is a diagram illustrating a functional configuration of the audio / video input / output device according to the second embodiment. 図８は、第二実施形態に係る音声映像入出力装置の機能構成を例示する図である。FIG. 8 is a diagram illustrating a functional configuration of the audio / video input / output device according to the second embodiment. 図９は、音声ズーム機能の利用イメージを例示する図である。FIG. 9 is a diagram illustrating a usage image of the audio zoom function. 図１０は、音声追跡機能の利用イメージを例示する図である。FIG. 10 is a diagram illustrating a usage image of the voice tracking function.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

［第一実施形態］
この発明の第一実施形態は、例えば工場内のような高騒音の作業現場において、利用者が頭部に装着して利用することを想定したヘルメット一体型の音声映像入出力装置である。本形態の音声映像入出力装置は、図１に例示するように、n（≧2）個の収音部Ｍ₁,…,Ｍ_n、放音部Ｓ、撮影部Ｖ、映像表示部Ｇ、ダイヤル操作部Ｄ、ボタン操作部Ｂ、および通信記録部Ｃを含む。 [First embodiment]
The first embodiment of the present invention is a helmet-integrated audio / video input / output device that is assumed to be used by a user wearing on the head at a high-noise work site such as in a factory. As shown in FIG. 1, the audio / video input / output device of the present embodiment includes n (≧ 2) sound collecting units M ₁ ,..., M _n , sound emitting unit S, photographing unit V, video display unit G, A dial operation unit D, a button operation unit B, and a communication recording unit C are included.

収音部Ｍ₁,…,Ｍ_nは、利用者が装着した際に利用者の周囲の音を収音するマイクロホンである。図１の例では、１個の収音部Ｍ₁が利用者の口元に配置され、n-1個の収音部Ｍ₂,…,Ｍ_nが後頭部に水平に配列される構成を示したが、利用者の発話を含む周囲の音を収音可能であればどのような配置でもよい。 The sound pickup units M ₁ ,..., M _n are microphones that pick up sounds around the user when worn by the user. In the example of FIG. 1, a configuration is shown in which one sound collection unit M ₁ is arranged at the user's mouth, and n−1 sound collection units M ₂ ,..., M _n are arranged horizontally on the back of the head. However, any arrangement may be used as long as ambient sounds including the user's utterance can be collected.

放音部Ｓは、利用者が装着した際に利用者の耳に対応する位置に配置されたヘッドホンである。ヘッドホンは一般的に両耳に対して用意されるものであるが、本形態の放音部Ｓは、少なくとも片方の耳に対して用意されていればよく、通常どおり両耳に対して用意されていてもよい。 The sound emitting unit S is headphones arranged at a position corresponding to the user's ear when the user wears the sound emitting unit S. Headphones are generally prepared for both ears, but the sound emitting unit S of this embodiment is only required to be prepared for at least one ear, and is normally prepared for both ears. It may be.

撮影部Ｖは、利用者が装着した際に利用者の視野に対応する領域に画角が設定され、利用者の視野に対応する領域を撮影するビデオカメラである。図１の例では、利用者の額の位置に配置される構成を示したが、利用者の視野に対応する領域が撮影可能であればどのような位置に配置されてもよい。ただし、撮影部Ｖは、利用者の視野を遮らない位置に配置されることが望ましい。 The imaging unit V is a video camera that captures an area corresponding to the user's field of view by setting an angle of view in the area corresponding to the user's field of view when the user wears it. In the example of FIG. 1, the configuration is shown in which the user's forehead is placed. However, the region corresponding to the user's field of view may be placed in any position as long as it can be photographed. However, it is desirable that the photographing unit V be arranged at a position that does not block the user's field of view.

映像表示部Ｇは、利用者が装着した際に利用者の視野に入る位置に画面が配置された透過スクリーンを用いた小型ディスプレイである。映像表示部Ｇは透過型であるため、映像出力がない状態では利用者の視界を遮ることがなく、映像出力がある状態では利用者から見て前方の光景に対して映像が重畳して視認される。ここでは工場内などの作業現場における利用を想定しているため、粉塵等の飛来物から利用者の目を守る効果もある。映像表示部Ｇは、利用者の視野の範囲外に退避させることが可能な可動式となっている。例えば、正確な色彩を確認するなどで肉眼による確認が必要な場合には利便性が高い。退避させた際には、撮影部Ｖの画角に入らない位置に収納されることが望ましい。 The video display unit G is a small display using a transmissive screen in which a screen is arranged at a position that enters the user's field of view when the user wears it. Since the video display unit G is a transmission type, the visual field of the user is not obstructed when there is no video output, and the video is superimposed on the front scene viewed from the user when the video output is present. Is done. Here, since it is assumed to be used in a work site such as a factory, there is an effect of protecting the eyes of the user from flying objects such as dust. The video display unit G is movable so that it can be retracted outside the range of the visual field of the user. For example, convenience is high when confirmation with the naked eye is necessary, for example, by confirming an accurate color. When retracted, it is desirable to store in a position that does not fall within the angle of view of the photographing unit V.

ダイヤル操作部Ｄは、放音部Ｓの表面のうち利用者が装着した際に利用者の耳と反対側になる面に配置され、利用者の耳の位置を中心軸として回転自在な大型ダイヤルである。ダイヤル操作部Ｄは、円形のダイヤルの側面に複数の凸凹が形成されており、例えば利用者が厚い手袋などを装着している状況であっても指先で容易に操作が可能となるように形成される。ダイヤル操作部Ｄは、回転操作が行われた際に操作された回転方向および回転角度を示すダイヤル操作信号dを出力する。ダイヤル操作部Ｄの位置は必ずしも放音部Ｓの表面でなくともよく、利用者が操作可能な位置であればどのように配置してもよい。 The dial operation unit D is disposed on the surface of the sound emitting unit S that is opposite to the user's ear when the user wears the dial operation unit D, and is a large dial that is rotatable about the position of the user's ear It is. The dial operation part D has a plurality of irregularities formed on the side surface of the circular dial, and is formed so that it can be easily operated with a fingertip even when the user is wearing thick gloves, for example. Is done. The dial operation unit D outputs a dial operation signal d indicating the rotation direction and the rotation angle operated when the rotation operation is performed. The position of the dial operation part D does not necessarily have to be on the surface of the sound emitting part S, and may be arranged in any way as long as the user can operate it.

ボタン操作部Ｂは、ダイヤル操作部Ｄの表面のうち回転軸の位置に配置され、回転ダイヤルの中心軸の方向へ押下可能なボタンである。ボタン操作部Ｂは、キャップ部分が無押下状態でもダイヤル操作部Ｄの表面からわずかにくぼんだ位置になるように設定することで、意図しないボタン押下により誤動作を起こすことを防止できる。ボタン操作部Ｂは、押下操作が行われた際に押下状態を示すボタン操作信号bを出力する。 The button operation unit B is a button that is disposed at the position of the rotation axis on the surface of the dial operation unit D and can be pressed in the direction of the center axis of the rotation dial. By setting the button operation unit B so that the cap portion is slightly depressed from the surface of the dial operation unit D even when the cap portion is not pressed, it is possible to prevent malfunctions due to unintended button presses. The button operation unit B outputs a button operation signal b indicating a pressed state when a press operation is performed.

通信記録部Ｃは、遠隔にいる通信相手と無線もしくは有線の通信経路を確立し、音声信号および映像信号の送受信を行う。また、送受信を行った音声信号および映像信号や、収音部Ｍ₁,…,Ｍ_nを用いて取得した音声信号および撮影部Ｖを用いて取得した映像信号を記録媒体等に記録する。 The communication recording unit C establishes a wireless or wired communication path with a remote communication partner, and transmits and receives audio signals and video signals. In addition, the audio and video signals transmitted and received, the audio signals acquired using the sound pickup units M ₁ ,..., M _n and the video signals acquired using the photographing unit V are recorded on a recording medium or the like.

図２を参照して、第一実施形態の音声映像入出力装置の動作を説明する。本形態の音声映像入出力装置は、収音部Ｍ₁,…,Ｍ_n、放音部Ｓ、撮影部Ｖ、映像表示部Ｇ、ダイヤル操作部Ｄ、ボタン操作部Ｂ、および通信記録部Ｃに加えて、入力音声強調部１、音声認識部２、機能制御部３、音声生成部４、映像取得部５、および映像生成部６を含む。通信記録部Ｃは、図２に示すように、音声出力部Ｃ１、音声入力部Ｃ２、映像出力部Ｃ３、および映像入力部Ｃ４を含む。 The operation of the audio / video input / output device of the first embodiment will be described with reference to FIG. The audio / video input / output device of this embodiment includes a sound collection unit M ₁ ,..., M _n , a sound emission unit S, a photographing unit V, a video display unit G, a dial operation unit D, a button operation unit B, and a communication recording unit C. In addition, an input voice emphasis unit 1, a voice recognition unit 2, a function control unit 3, a voice generation unit 4, a video acquisition unit 5, and a video generation unit 6 are included. As shown in FIG. 2, the communication recording unit C includes an audio output unit C1, an audio input unit C2, a video output unit C3, and a video input unit C4.

入力音声強調部１は、収音部Ｍ₁,…,Ｍ_nがそれぞれ収音した音響信号x₁,…,x_nに対して目的音強調処理を行い、音響信号x₁,…,x_nに含まれる音声が強調された音声強調信号a_oを出力する。音声強調信号a_oは音声認識部２、音声生成部４、および音声出力部Ｃ１へ送られる。利用者の口元にある収音部Ｍ₁が収音した音響信号x₁には利用者の音声と環境雑音が含まれており、利用者の音声が届きにくい位置に配置された収音部Ｍ₂,…,Ｍ_nがそれぞれ収音した音響信号x₂,…,x_nには環境雑音のみが含まれていることが期待できる。したがって、音響信号x₁において音響信号x₂,…,x_nに含まれる環境雑音を抑圧することで、利用者の音声を強調した音響信号を得ることができる。目的音強調処理は上記の方法に限定されず、公知のどのような方法を適用してもよい。例えば、下記参考文献１に記載の音響信号強調技術を利用することができる。
〔参考文献１〕特開２０１３−１７９３８８号公報 Input speech enhancement unit 1, the sound collection unit M _1, ..., acoustic signals x ₁ M _n is picked up, respectively, ..., performs target sound enhancement process on x _n, the acoustic signals x _1, ..., x _n The speech enhancement signal a _o in which the speech included in is enhanced is output. The voice enhancement signal a _o is sent to the voice recognition unit 2, the voice generation unit 4, and the voice output unit C1. The sound signal x ₁ picked up by the sound pickup unit M _{1 at the} user's mouth contains the user's voice and environmental noise, and the sound pickup unit M arranged at a position where the user's voice is difficult to reach. _The acoustic signals x ₂ ,..., X _n picked up by ₂ ,..., M _n can be expected to contain only environmental noise. Therefore, by suppressing the environmental noise contained in the acoustic signals x ₂ ,..., X _n in the acoustic signal x ₁ , it is possible to obtain an acoustic signal that emphasizes the user's voice. The target sound enhancement process is not limited to the above method, and any known method may be applied. For example, an acoustic signal enhancement technique described in Reference Document 1 below can be used.
[Reference Document 1] JP 2013-179388 A

音声認識部２は、音声強調信号a_oに対して音声認識を行い、音声認識結果opを出力する。音声認識処理は公知のどのような方法を適用してもよい。音声認識結果opは機能制御部３へ送られる。音声認識部２は音声認識処理を自ら実行するものでなくてもよく、遠隔に設置された音声認識装置に対して音声強調信号a_oを送信し、その音声認識装置から返信される音声認識結果を音声認識結果opとして出力するものであってもよい。このとき、音声認識装置との通信は、通信記録部Ｃを用いて行えばよい。 The voice recognition unit 2 performs voice recognition on the voice enhancement signal a _o and outputs a voice recognition result op. Any known method may be applied to the voice recognition process. The voice recognition result op is sent to the function control unit 3. The voice recognition unit 2 does not have to execute the voice recognition process by itself, but transmits a voice enhancement signal a _o to a voice recognition device installed remotely, and a voice recognition result returned from the voice recognition device May be output as the speech recognition result op. At this time, communication with the speech recognition apparatus may be performed using the communication recording unit C.

機能制御部３は、ダイヤル操作部Ｄからのダイヤル操作信号d、ボタン操作部Ｂからのボタン操作信号b、および音声認識部２からの音声認識結果opに基づいて、音声映像入出力装置の機能を制御するための制御信号c₁,…,c₅を生成する。制御信号の内容は、利用者が映像表示部Ｇの表示に従ってダイヤル操作、ボタン操作、および音声コマンド入力により選択した内容により定まる。個別の機能に対する操作例は後述するが、ダイヤル操作、ボタン操作、および音声コマンド入力をどのように組み合わせて操作インターフェースを構成するかは任意である。例えば、基本的な操作として、音声コマンド入力により機能の呼び出しを行い、ダイヤル操作により選択肢の選択を行い、ボタン操作により選択を確定する流れが考えられる。また、音声コマンド入力により選択肢の指定から確定までを一括で行うことも可能である。制御信号は制御対象とする構成部に応じて個別に生成される。例えば、制御信号c₁は入力音声強調部１に対する制御を行う信号である。制御信号c₂は音声生成部４に対する制御を行う信号である。制御信号c₃は映像取得部５に対する制御を行う信号である。制御信号c₄は映像生成部６に対する制御を行う信号である。制御信号c₅は音声認識部２に対する制御を行う信号である。 Based on the dial operation signal d from the dial operation unit D, the button operation signal b from the button operation unit B, and the voice recognition result op from the voice recognition unit 2, the function control unit 3 Control signals c ₁ ,..., C ₅ for controlling. The content of the control signal is determined by the content selected by the user by dial operation, button operation, and voice command input according to the display on the video display unit G. Examples of operations for individual functions will be described later, but any combination of dial operation, button operation, and voice command input to configure the operation interface is arbitrary. For example, as a basic operation, a function may be called by calling a function by inputting a voice command, selecting an option by dialing, and confirming a selection by operating a button. It is also possible to collectively perform the process from designation of options to determination by inputting voice commands. The control signal is individually generated according to the component to be controlled. For example, the control signal c ₁ is a signal for controlling the input speech enhancement unit 1. The control signal c ₂ is a signal for controlling the sound generator 4. The control signal c ₃ is a signal for controlling the video acquisition unit 5. The control signal c ₄ is a signal that controls the video generation unit 6. The control signal c ₅ is a signal for controlling the voice recognition unit 2.

音声出力部Ｃ１は、入力音声強調部１の出力する音声強調信号a_oを遠隔の通信相手へ向けて送信する。もしくは音声強調信号a_oを図示していない記録媒体に記憶する。 The voice output unit C1 transmits the voice enhancement signal _ao output from the input voice enhancement unit 1 to a remote communication partner. Alternatively, the voice enhancement signal a _o is stored in a recording medium not shown.

音声入力部Ｃ２は、遠隔の通信相手から遠隔音声信号a_iを受信する。受信した遠隔音声信号a_iは音声生成部４へ送られる。遠隔音声信号a_iは、例えば、遠隔の通信相手が利用者に対して行うべき作業内容を指示する音声などである。 The voice input unit C2 receives a remote voice signal a _i from a remote communication partner. The received remote audio signal a _i is sent to the audio generator 4. The remote voice signal a _i is, for example, a voice for instructing the content of work to be performed by the remote communication partner to the user.

音声生成部４は、入力音声強調部１の出力する音声強調信号a_oと、音声入力部Ｃ２の出力する遠隔音声信号a_iがあれば遠隔音声信号a_iとを用いて出力音響信号a_sを生成し、その出力音響信号a_sを放音部Ｓへ出力する。音声強調信号a_oは、例えば、高騒音環境で利用しており自分の話す声も自分で聞き取りづらい場合などに、利用者の音声をフィードバックするために利用される。また、あらかじめ録音した音声を通信記録部Ｃなどに記憶しておき、その音声を再生することで遠隔音声信号a_iとして利用してもよい。 Sound generation unit 4, speech enhancement signal outputs of the input speech enhancement unit 1 a _o and the output to the remote audio signal a _i if there is a remote audio signal a _i and the output sound signal a _s using a speech input unit C2 , and outputs the output sound signal a _s to the sound emitting section S. The voice enhancement signal a _o is used, for example, to feed back a user's voice when the voice is used in a high noise environment and it is difficult to hear his / her voice. In addition, voices recorded in advance may be stored in the communication recording unit C and the like, and the voices may be reproduced and used as the remote voice signal a _i .

映像取得部５は、撮影部Ｖを用いて撮影した映像信号v_oを取得する。取得した映像信号v_oは映像出力部Ｃ３へ送られる。 The video acquisition unit 5 acquires a video signal v _o captured using the imaging unit V. The acquired video signal v _o is sent to the video output unit C3.

映像出力部Ｃ３は、映像取得部５の出力する映像信号v_oを遠隔の通信相手へ向けて送信する。もしくは映像信号v_oを図示していない記録媒体に記憶する。 The video output unit C3 transmits the video signal v _o output from the video acquisition unit 5 to a remote communication partner. Alternatively, the video signal v _o is stored in a recording medium (not shown).

映像入力部Ｃ４は、遠隔の通信相手から遠隔映像信号v_iを受信する。受信した遠隔映像信号v_iは映像生成部６へ送られる。遠隔映像信号v_iは、例えば、利用者が行うべき作業において必要とされる参考情報などである。 The video input unit C4 receives a remote video signal v _i from a remote communication partner. The received remote video signal v _i is sent to the video generator 6. The remote video signal v _i is, for example, reference information required for work to be performed by the user.

映像生成部６は、映像取得部５の出力する映像信号v_oと、映像入力部Ｃ４が出力する遠隔映像信号v_iがあれば遠隔映像信号v_iとを用いて出力映像信号v_sを生成し、その出力映像信号v_sを映像表示部Ｇへ出力する。 Image generating unit 6, generates a video signal v _o output from the image capturing section 5, an output video signal v _s using the remote video signal v _i If the remote video signal v _i output from the image input unit C4 and outputs the output video signal v _s to the video display unit G.

図３は、本形態の音声映像入出力装置において環境設定を行う際の利用イメージである。まず、利用者は音声コマンド入力により「環境」と入力する。機能制御部３は「環境」という音声認識結果opに基づいて、環境設定機能を呼び出すための制御信号c₄を映像生成部６へ送る。映像生成部６は現在の環境設定を映像表示部Ｇの透過スクリーンに表示する。図３の例では、現在の環境設定は「建設現場」であり、その他の設定候補として「トンネル」や「サーバ室」などが表示される。これらの選択肢とそれに紐づく動作パラメータはあらかじめ設定しておく。利用者がダイヤル操作部Ｄを回転させると、ダイヤル操作信号dが機能制御部３へ入力され、環境設定候補を変更するための制御信号c₄が映像生成部６へ送られる。ダイヤル操作により所望の環境設定候補を選択した後にボタン操作部Ｂを押下すると、ボタン操作信号bが機能制御部３へ入力され、環境設定の変更を確定するための制御信号c₁およびc₅が入力音声強調部１および音声認識部２へ送られる。入力音声強調部１および音声認識部２は制御信号c₁およびc₅に従って動作パラメータを変更する。目的音強調処理や音声認識処理は環境により最適な動作パラメータが異なることが一般的であるため、利用環境を正しく設定することでより精度の高い処理結果が得られることが期待できる。 FIG. 3 is a usage image when environment setting is performed in the audio / video input / output device of this embodiment. First, the user inputs “environment” by voice command input. The function control unit 3 sends a control signal c ₄ for calling the environment setting function to the video generation unit 6 based on the voice recognition result op “environment”. The video generation unit 6 displays the current environment setting on the transparent screen of the video display unit G. In the example of FIG. 3, the current environment setting is “construction site”, and “tunnel”, “server room”, and the like are displayed as other setting candidates. These options and the operation parameters associated therewith are set in advance. When the user rotates the dial operation unit D, the dial operation signal d is input to the function control unit 3 and a control signal c ₄ for changing the environment setting candidate is sent to the video generation unit 6. And presses the button operation unit B after selecting the desired configuration candidate by dial operation, button operation signal b is input to the function control unit 3, the control signals c ₁ and c ₅ for confirming the change of configuration It is sent to the input speech enhancement unit 1 and the speech recognition unit 2. The input speech enhancement unit 1 and the speech recognition unit 2 change the operation parameters according to the control signals c ₁ and c ₅ . Since target sound enhancement processing and speech recognition processing generally have different optimum operating parameters depending on the environment, it can be expected that more accurate processing results can be obtained by correctly setting the usage environment.

図４は、本形態の音声映像入出力装置において音量設定を行う際の利用イメージである。まず、利用者は音声コマンド入力により「音量」と入力する。機能制御部３は「音量」という音声認識結果opに基づいて、音量設定機能を呼び出すための制御信号c₄を映像生成部６へ送る。映像生成部６は現在の音量設定を映像表示部Ｇの透過スクリーンに表示する。図４の例では、現在の音量設定は「50」である。利用者がダイヤル操作部Ｄを回転させると、ダイヤル操作信号dが機能制御部３へ入力され、音量設定を変更するための制御信号c₂が音声生成部４へ、音量表示を変更するための制御信号c₄が映像生成部６へ送られる。音声生成部４は制御信号c₂に従って放音部Ｓへ出力する出力音響信号a_sの音量を上下させる。映像生成部６は制御信号c₄に従って音量表示を上下させる。 FIG. 4 is a usage image when performing volume setting in the audio / video input / output device of this embodiment. First, the user inputs “volume” by voice command input. The function control unit 3 sends a control signal c ₄ for calling the volume setting function to the video generation unit 6 based on the voice recognition result op “volume”. The video generation unit 6 displays the current volume setting on the transparent screen of the video display unit G. In the example of FIG. 4, the current volume setting is “50”. When the user rotates the dial operation unit D, dialing signal d is input to the function control unit 3, the control signal c ₂ for changing the volume setting to the sound generating unit 4, for changing the volume display A control signal c ₄ is sent to the video generator 6. Sound generation unit 4 to lower the volume of the output sound signal a _s to be output to the sound emitting unit S according to the control signal c _2. Image generation unit 6 to lower the volume display in accordance with the control signal c _4.

図５は、本形態の音声映像入出力装置において映像ズームを行う際の利用イメージである。まず、利用者は音声コマンド入力により「映像ズーム」と入力する。機能制御部３は「映像ズーム」という音声認識結果opに基づいて、映像ズーム機能を呼び出すための制御信号c₄を映像生成部６へ送る。映像生成部６は撮影部Ｖが現在撮影している映像を映像表示部Ｇの透過スクリーンに表示する。利用者がダイヤル操作部Ｄを回転させると、ダイヤル操作信号dが機能制御部３へ入力され、映像ズームの倍率を変更するための制御信号c₃が映像取得部５へ送られる。映像取得部５は撮影部Ｖの倍率を変化させて映像信号v_oを取得する。ハンズフリーで映像撮影をするときには頭上等に取り付けられたカメラ単体では正確に被写体が撮影者にわからないという課題があり、また撮影部Ｖを直接触るような操作も行いづらい。上述のような操作により映像ズームを行うことが可能であれば、映像範囲の確認や映像のズーム操作をスムーズに行うことが可能である。 FIG. 5 is a usage image when video zooming is performed in the audio / video input / output device of this embodiment. First, the user inputs “video zoom” by voice command input. The function control unit 3 sends a control signal c ₄ for calling the video zoom function to the video generation unit 6 based on the voice recognition result op “video zoom”. The video generation unit 6 displays the video currently captured by the imaging unit V on the transmission screen of the video display unit G. When the user rotates the dial operation unit D, a dial operation signal d is input to the function control unit 3, and a control signal c ₃ for changing the magnification of the image zoom is sent to the image acquisition unit 5. The video acquisition unit 5 acquires the video signal v _o by changing the magnification of the photographing unit V. When shooting a video with hands-free, there is a problem that the photographer does not know the subject accurately with a single camera attached overhead or the like, and it is difficult to perform an operation of directly touching the photographing unit V. If the image zoom can be performed by the operation as described above, the image range can be confirmed and the image zoom operation can be performed smoothly.

図６は、本形態の音声映像入出力装置において映像追跡を行う際の利用イメージである。まず、利用者は音声コマンド入力により「映像追跡」と入力する。機能制御部３は「映像追跡」という音声認識結果opに基づいて、映像追跡機能を呼び出すための制御信号c₄を映像生成部６へ送る。映像生成部６は撮影部Ｖが現在撮影している映像において映像中の追跡対象候補をハイライトさせながら映像表示部Ｇの透過スクリーンに表示する。図６の例では、３個の物体が追跡対象候補として表示されており、透過スクリーン左上に位置する物体が追跡対象候補としてハイライト表示されている。ここで利用者がダイヤル操作部Ｄを回転させると、ダイヤル操作信号dが機能制御部３へ入力され、追跡対象候補を切り替えるための制御信号c₄が映像生成部６へ送られる。利用者がダイヤル操作により所望の追跡対象候補を選択し、その状態でボタン操作部Ｂを押下すると、ボタン操作信号bが機能制御部３へ入力され、追跡対象を決定するための制御信号c₃が映像取得部５へ送られる。映像取得部５は、制御信号c₃に従って決定した追跡対象の映像追跡を開始する。以降、追跡対象が撮影部Ｖの画角に存在する限り、その追跡対象を中心とした映像信号v_oが撮影される。映像追跡と映像ズームを組み合わせることにより、視界内の任意の物体をズームしながら追跡することが可能である。この場合、撮影部Ｖはパン・チルト・ズーム機能に対応している必要がある。利用者がダイヤル操作を行うことが困難な状況を想定して音声コマンド入力による追跡対象の選択も可能である。利用者が「右」「左」などを音声コマンド入力することで追跡対象候補を切り替えたり、画面上の座標を直接音声コマンド入力することにより追跡対象候補を選択したりする方法が考えられる。音声コマンド入力による追跡対象の選択を行うことで完全にハンズフリーでの操作が可能となる。 FIG. 6 shows a usage image when video tracking is performed in the audio / video input / output device of this embodiment. First, the user inputs “video tracking” by voice command input. The function control unit 3 sends a control signal c ₄ for calling the video tracking function to the video generation unit 6 based on the voice recognition result op of “video tracking”. The video generation unit 6 displays on the transparent screen of the video display unit G while highlighting the tracking target candidate in the video in the video currently captured by the imaging unit V. In the example of FIG. 6, three objects are displayed as tracking target candidates, and an object located at the upper left of the transparent screen is highlighted as a tracking target candidate. Here, when the user rotates the dial operation unit D, the dial operation signal d is input to the function control unit 3, and a control signal c ₄ for switching the tracking target candidate is transmitted to the video generation unit 6. When the user selects a desired tracking target candidate by dialing and presses the button operation unit B in that state, the button operation signal b is input to the function control unit 3 and a control signal c ₃ for determining the tracking target. Is sent to the video acquisition unit 5. Video acquisition unit 5 starts video tracking of the tracking target determined in accordance with the control signal c _3. Thereafter, as long as the tracking target exists at the angle of view of the imaging unit V, the video signal _vo centered on the tracking target is captured. By combining video tracking and video zoom, it is possible to track any object in the field of view while zooming. In this case, the photographing unit V needs to support a pan / tilt / zoom function. It is also possible to select a tracking target by inputting a voice command, assuming that it is difficult for the user to perform a dial operation. A method is conceivable in which the user switches the tracking target candidate by inputting a voice command such as “right” or “left”, or selects the tracking target candidate by directly inputting a voice command of coordinates on the screen. By selecting a tracking target by inputting a voice command, a completely hands-free operation becomes possible.

［第二実施形態］
第二実施形態の音声映像入出力装置は、図７に例示するように、n（≧2）個の収音部Ｍ₁,…,Ｍ_n、放音部Ｓ、撮影部Ｖ、映像表示部Ｇ、ダイヤル操作部Ｄ、ボタン操作部Ｂ、および通信記録部Ｃを第一実施形態と同様に含み、さらにm-n個（m≧4）の前方収音部Ｍ_n+1,…,Ｍ_mを含む。また、本形態の音声映像入出力装置は、図８に例示するように、第一実施形態の音声映像入出力装置の各構成部に加えて、目的音強調部７をさらに含む。 [Second Embodiment]
As illustrated in FIG. 7, the audio / video input / output device of the second embodiment includes n (≧ 2) sound collecting units M ₁ ,..., M _n , sound emitting unit S, photographing unit V, and video display unit. G, dial operation unit D, the button operations part B, and includes a communication recording portion C as in the first embodiment, the front sound pickup unit M _{n + 1} of the further mn number (m ≧ 4), ..., a M _m Including. In addition to the components of the audio / video input / output device of the first embodiment, the audio / video input / output device of this embodiment further includes a target sound enhancement unit 7 as illustrated in FIG.

前方収音部Ｍ_n+1,…,Ｍ_mは、利用者が装着した際に利用者の視野に対応する方向から到来する音を収音するマイクロホンである。図７では、前方収音部Ｍ_n+1,…,Ｍ_mが前頭部の撮影部Ｖ近傍に水平に配列される例を示したが、利用者の視野に対応する方向から到来する音を収音可能であればどのような配置でもよい。 The front sound pickup units M _{n + 1} ,..., M _m are microphones that pick up sounds coming from the direction corresponding to the user's visual field when the user wears them. FIG. 7 shows an example in which the front sound pickup units M _{n + 1} ,..., M _m are horizontally arranged in the vicinity of the imaging unit V in the forehead, but the sound coming from the direction corresponding to the user's field of view. As long as sound can be collected, any arrangement may be used.

目的音強調部７は、前方収音部Ｍ_n+1,…,Ｍ_mがそれぞれ収音した前方音響信号x_n+1,…,x_mに対して目的音強調処理を行い、特定の音が強調された目的音強調信号a_o2を出力する。目的音強調信号a_o2は音声生成部４および音声出力部Ｃ１へ送られる。強調すべき音の特定は、利用者が音声コマンド入力、ダイヤル操作、およびボタン操作を用いて行う。具体的な特定の操作は後述する。目的音強調処理は公知のどのような方法を適用してもよく、例えば上記参考文献１に記載の音響信号強調技術を利用することができる。 Target sound enhancement unit 7, the front sound pickup unit M _{n + 1,} ..., front audio signal x _{n + 1} M _m is picked up, respectively, ..., it performs target sound enhancement process on x _m, specific sound The target sound emphasizing signal a _o2 in which is emphasized is output. The target sound enhancement signal a _o2 is sent to the voice generation unit 4 and the voice output unit C1. The sound to be emphasized is specified by the user using voice command input, dial operation, and button operation. A specific specific operation will be described later. For the target sound enhancement process, any known method may be applied. For example, the acoustic signal enhancement technique described in the above-mentioned Reference 1 can be used.

図９は、本形態の音声映像入出力装置において音声ズームを行う際の利用イメージである。音声ズームとは、特定の音源から到来する音を集中的に収音する機能である。まず、利用者は音声コマンド入力により「音声ズーム」と入力する。機能制御部３は「音声ズーム」という音声認識結果opに基づいて、音声ズーム機能を呼び出すための制御信号c₄を映像生成部６へ送る。映像生成部６は現在の音声ズームの設定（中心位置と倍率）を映像表示部Ｇの透過スクリーンに表示する。図９の例では、太線の円の中心がズーム位置を示し、円の半径がズーム倍率の大きさを示している。利用者はダイヤル操作とボタン操作により音声ズームの設定を変更する。まずダイヤル操作部Ｄを回転させると透過スクリーン上でズーム位置の横座標が移動する。所望の位置でボタン操作部Ｂを押下すると横座標が確定する。次にダイヤル操作部Ｄを回転させると透過スクリーン上でズーム位置の縦座標が移動する。所望の位置でボタン操作部Ｂを押下すると縦座標が確定する。その後ダイヤル操作部Ｄを回転させるとズームの倍率が変化する。最後にボタン操作部Ｂを押下すると、音声ズーム設定の変更を確定するための制御信号制御信号c₁が目的音強調部７へ送られる。目的音強調部７は特定された方向から到来する音を指定の倍率で強調した目的音強調信号a_o2を出力する。音声ズームは工事現場もしくは災害現場などで、装着者の視点からの映像と音声を記録する用途が考えられる。特に、騒音下で遠距離の人物や物体の音を選択的に聞き分ける際に、透過スクリーンと連動して音声ズーム設定が行えることにより、直観的かつ効率的に操作をすることができる。上述のような操作方法は、装着者の目の位置と前方収音部Ｍ_n+1,…,Ｍ_mの位置関係が近くかつ一定であるというヘルメット一体型の装置であるがゆえに実現できるものである。 FIG. 9 is a usage image when performing audio zoom in the audio / video input / output device of the present embodiment. The audio zoom is a function that collects sounds arriving from a specific sound source in a concentrated manner. First, the user inputs “voice zoom” by voice command input. The function control unit 3 sends a control signal c ₄ for calling the audio zoom function to the video generation unit 6 based on the audio recognition result op “audio zoom”. The video generation unit 6 displays the current audio zoom setting (center position and magnification) on the transmission screen of the video display unit G. In the example of FIG. 9, the center of the thick circle indicates the zoom position, and the radius of the circle indicates the zoom magnification. The user changes the audio zoom setting by dial operation and button operation. First, when the dial operation unit D is rotated, the abscissa of the zoom position moves on the transmission screen. When the button operation unit B is pressed at a desired position, the abscissa is determined. Next, when the dial operation unit D is rotated, the ordinate of the zoom position moves on the transmission screen. When the button operation unit B is pressed at a desired position, the ordinate is determined. Thereafter, when the dial operation unit D is rotated, the zoom magnification changes. Finally, when the button operation unit B is pressed, a control signal control signal c ₁ for confirming the change of the audio zoom setting is sent to the target sound emphasizing unit 7. The target sound emphasizing unit 7 outputs a target sound emphasizing signal a _{o2 in} which the sound coming from the specified direction is emphasized at a specified magnification. Audio zoom can be used to record video and audio from the viewpoint of the wearer at construction sites or disaster sites. In particular, when the sound of a long-distance person or object is selectively recognized under noise, the sound zoom setting can be performed in conjunction with the transmissive screen, thereby enabling intuitive and efficient operation. The above-described operation method can be realized because it is a helmet-integrated device in which the positional relationship between the wearer's eye position and the front sound pickup units M _{n + 1} ,..., M _m is close and constant. It is.

図１０は、本形態の音声映像入出力装置において音声追跡を行う際の利用イメージである。まず、利用者は音声コマンド入力により「音声追跡」と入力する。機能制御部３は「音声追跡」という音声認識結果opに基づいて、音声追跡機能を呼び出すための制御信号c₄を映像生成部６へ送る。映像生成部６は撮影部Ｖが現在撮影している映像において映像中の追跡対象候補をハイライトさせながら映像表示部Ｇの透過スクリーンに表示する。図１０の例では、３個の物体が追跡対象候補として表示されており、透過スクリーン左上に位置する物体が追跡対象候補としてハイライト表示されている。利用者がダイヤル操作部Ｄを回転させると、ダイヤル操作信号dが機能制御部３へ入力され、追跡対象候補を切り替えるための制御信号c₄が映像生成部６へ送られる。利用者がダイヤル操作により所望の追跡対象候補を選択し、その状態でボタン操作部Ｂを押下すると、ボタン操作信号bが機能制御部３へ入力され、追跡対象を決定するための制御信号c₁が目的音強調部７へ送られる。目的音強調部７は、制御信号c₁に従って決定した追跡対象の音声追跡を開始する。以降、追跡対象が撮影部Ｖの画角に存在する限り、その追跡対象の方向から到来する音を強調した目的音強調信号a_o2が出力される。音声追跡と音声ズームを組み合わせることにより、音声強調の倍率を指定することも可能である。利用者がダイヤル操作を行うことが困難な状況を想定して音声コマンド入力による追跡対象の選択も可能である。利用者が「右」「左」などを音声コマンド入力することで追跡対象候補を切り替えたり、画面上の座標を直接音声コマンド入力することにより追跡対象候補を選択したりする方法が考えられる。音声コマンド入力による追跡対象の選択を行うことで完全にハンズフリーでの操作が可能となる。 FIG. 10 is a usage image when performing audio tracking in the audio / video input / output device of this embodiment. First, the user inputs “voice tracking” by voice command input. The function control unit 3 sends a control signal c ₄ for calling the voice tracking function to the video generation unit 6 based on the voice recognition result op “voice tracking”. The video generation unit 6 displays on the transparent screen of the video display unit G while highlighting the tracking target candidate in the video in the video currently captured by the imaging unit V. In the example of FIG. 10, three objects are displayed as tracking target candidates, and an object located at the upper left of the transparent screen is highlighted as a tracking target candidate. When the user rotates the dial operation unit D, the dial operation signal d is input to the function control unit 3, and a control signal c ₄ for switching the tracking target candidate is sent to the video generation unit 6. When the user selects a desired tracking target candidate by a dial operation and presses the button operation unit B in that state, the button operation signal b is input to the function control unit 3 and a control signal c ₁ for determining the tracking target. Is sent to the target sound enhancement unit 7. Target sound enhancement unit 7 starts the tracked audio track determined in accordance with the control signal c _1. Thereafter, as long as the tracking target exists at the angle of view of the imaging unit V, the target sound enhancement signal a _o2 in which the sound coming from the direction of the tracking target is emphasized is output. By combining voice tracking and voice zoom, it is also possible to specify the voice enhancement magnification. It is also possible to select a tracking target by inputting a voice command, assuming that it is difficult for the user to perform a dial operation. A method is conceivable in which the user switches the tracking target candidate by inputting a voice command such as “right” or “left”, or selects the tracking target candidate by directly inputting a voice command of coordinates on the screen. By selecting a tracking target by inputting a voice command, a completely hands-free operation becomes possible.

［第三実施形態］
第一実施形態および第二実施形態において、音声生成部４が外部雑音抑圧機能を備えるように構成してもよい。この場合、音声生成部４は、既存の収音部、前方収音部、もしくは専用の収音部を用いて放音部Ｓ周辺の環境雑音を取得し、その環境雑音を抑圧する信号を生成して出力音響信号a_sに加算することで環境雑音を抑圧する。このように構成することにより、騒音下で利用する場合に、放音部Ｓから出力される出力音響信号a_sが利用者にとってより聞き取りやすくなる効果がある。外部雑音抑圧処理は、公知のどのような方法を適用してもよく、例えば下記参考文献２に記載された雑音抑圧技術を用いることができる。
〔参考文献２〕特開平７−３０３１３５号公報 [Third embodiment]
In 1st embodiment and 2nd embodiment, you may comprise so that the audio | voice production | generation part 4 may be provided with an external noise suppression function. In this case, the voice generation unit 4 acquires the environmental noise around the sound emission unit S using the existing sound collection unit, the front sound collection unit, or the dedicated sound collection unit, and generates a signal for suppressing the environmental noise. suppressing environmental noise by adding the output sound signal a _s to. With such a configuration, when utilized in noisy, there is an effect that the output acoustic signal a _s outputted from the sound emitting portion S is easily heard more for the user. Any known method may be applied to the external noise suppression processing. For example, a noise suppression technique described in Reference Document 2 below can be used.
[Reference Document 2] Japanese Patent Laid-Open No. 7-303135

この発明の音声映像入出力装置のポイントは以下のとおりである。一点目は、一体型ヘルメットに目的音強調用の複数のマイクロホンを設置し、装着者の発話や周囲の音を容易に集音できるようにした点である。二点目は、装着されたカメラと複数のマイクロホンをコントロールする直観的なインターフェースとして透過スクリーンと、音声コマンド入力やダイヤル操作、ボタン操作の機能を備え、映像と音声のコントロールを直観的かつ効率的に行えるようにした点である。 The points of the audio / video input / output device of the present invention are as follows. The first point is that multiple microphones for emphasizing the target sound are installed in the integrated helmet so that the wearer's speech and surrounding sounds can be easily collected. The second is an intuitive interface for controlling the installed camera and multiple microphones, including a transparent screen, voice command input, dial operation, and button operation functions, making video and audio control intuitive and efficient. This is the point that can be done.

この発明の音声映像入出力装置は、目的音強調機能を有しているため、高騒音環境下でも装着者の音声が明瞭に収音できる。その結果、装着者の意図を装置に伝える手段として音声コマンド入力を安定的に用いることができる。また、ダイヤルと透過スクリーンを具備したヘルメット一体型の装置とすることにより、マイクやカメラの撮影・収音について方向や強調率・倍率といったパラメータを直観的に操作することができるようになる。さらに、一体型とすることにより、装着者は両手をフリーにしながら、通信相手と現場の動画を共有しつつ、遠隔から指示を受けるなどの作業が可能となる。特に、高騒音の場所や工事現場など危険な場所で遠隔の指示者とやり取りをしながら、もしくは電子データを確認しながら作業を行う必要があるようなケースで、この発明の音声映像入出力装置を用いることで安全かつ正確、効率的に作業を行うことが可能となる。 Since the audio / video input / output device of the present invention has a target sound enhancement function, the voice of the wearer can be clearly picked up even in a high noise environment. As a result, the voice command input can be stably used as a means for transmitting the wearer's intention to the apparatus. Further, by adopting a helmet-integrated device equipped with a dial and a transmissive screen, parameters such as direction, enhancement rate, and magnification can be intuitively operated for shooting and sound collection by a microphone and a camera. Furthermore, by adopting the integrated type, the wearer can perform operations such as receiving instructions from a remote location while freeing both hands and sharing a video on the site with the communication partner. The audio / video input / output device according to the present invention, particularly in cases where it is necessary to perform work while interacting with a remote instructor or checking electronic data in a dangerous place such as a noisy place or a construction site. It is possible to work safely, accurately and efficiently.

この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. A configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Ｍ収音部
Ｓ放音部
Ｖ撮影部
Ｇ映像表示部
Ｃ通信記録部
Ｄダイヤル操作部
Ｂボタン操作部
１入力音声強調部
２音声認識部
３機能制御部
４音声生成部
５映像取得部
６映像生成部
７目的音強調部
Ｃ１音声出力部
Ｃ２音声入力部
Ｃ３映像出力部
Ｃ４映像入力部 M sound collecting part S sound emitting part V photographing part G video display part C communication recording part D dial operation part B button operation part 1 input voice emphasis part 2 voice recognition part 3 function control part 4 voice generation part 5 video acquisition part 6 video Generation unit 7 Target sound enhancement unit C1 Audio output unit C2 Audio input unit C3 Video output unit C4 Video input unit

Claims

A plurality of sound collection units for collecting surrounding sounds including user's voice;
An imaging unit for imaging an area corresponding to the user's field of view;
A video display unit in which a screen is arranged at a position within the user's field of view;
A dial operation unit that outputs a dial operation signal indicating a rotation direction and a rotation angle according to the rotation operation;
A button operation unit that outputs a button operation signal that is disposed at the position of the rotation axis on the surface of the dial operation unit and indicates a pressed state;
A speech recognition unit that recognizes an acoustic signal acquired using the sound collection unit and generates a speech recognition result; and
A video acquisition unit that acquires a video signal using the imaging unit;
A video generation unit that outputs an output video signal generated using the video signal to the video display unit;
A function control unit that generates control signals for controlling the functions of the video acquisition unit, the voice recognition unit, and the video generation unit, based on the dial operation signal, the button operation signal, and the voice recognition result;
Audio video input / output device.

The audio / video input / output device according to claim 1,
An input speech enhancement unit that emphasizes speech included in the acoustic signal and outputs the speech to the speech recognition unit;
An audio / video input / output device further comprising:

The audio / video input / output device according to claim 1 or 2,
The audio / video input / output device, wherein the video acquisition unit generates the video signal obtained by enlarging or / and tracking an area designated at an angle of view of the photographing unit based on the control signal.

The audio / video input / output device according to any one of claims 1 to 3,
A sound emitting unit disposed at a position corresponding to the user's ear;
A plurality of front sound collection units that collect sound coming from a direction corresponding to the user's field of view;
A target sound emphasizing unit that emphasizes a specific sound included in the front acoustic signal acquired using the front sound collecting unit;
A sound generation unit that outputs an output sound signal generated using the sound signal and the front sound signal to the sound emitting unit;
An audio / video input / output device further comprising:

The audio / video input / output device according to claim 4,
The function control unit generates a control signal for controlling the function of the target sound enhancement unit,
The audio / video input / output device according to claim 1, wherein the target sound emphasizing unit emphasizes sound coming from a sound source existing in a direction specified in the angle of view of the photographing unit based on the control signal.

The audio / video input / output device according to claim 4 or 5,
The audio / video input / output apparatus, wherein the audio generation unit acquires environmental noise around the sound emitting unit using the sound collection unit, and adds a signal for suppressing the environmental noise to the output acoustic signal.

The audio / video input / output device according to any one of claims 1 to 6,
The audio / video input / output device is capable of moving the video display unit to a position outside the range of the field of view of the user and outside the range of the angle of view of the photographing unit.

The audio / video input / output device according to any one of claims 1 to 7,
The voice recognition unit transmits the acoustic signal to a voice recognition device using a communication unit, and receives a voice recognition result obtained by voice recognition of the acoustic signal by the voice recognition device using the communication unit. Video input / output device.