JP2019197550A

JP2019197550A - Sound input/output device

Info

Publication number: JP2019197550A
Application number: JP2019105601A
Authority: JP
Inventors: 真人藤野; Masato Fujino
Original assignee: Fairy Devices Inc
Current assignee: Fairy Devices Inc
Priority date: 2017-06-26
Filing date: 2019-06-05
Publication date: 2019-11-14
Also published as: JP6675527B2; JP2019009770A

Abstract

To provide a sound input/output device for correctly recognizing sound in conformity with usage environment and reacting according to speaker identification and emotional state.SOLUTION: A sound input/output device according to one embodiment of the present invention includes: a sound receiving section having a plurality of microphones arranged three-dimensionally which can receive non-audible sounds as well as audible sounds; a sound generation section for generating audible sounds and/or non-audible sounds by one or a plurality of speakers; a signal processing section for processing and controlling a signal from the microphones; and a display section for performing a display on the basis of a processing result of the signal processing section; and a recording section for recording information on the sound collected by the sound receiving section.SELECTED DRAWING: Figure 1A

Description

本発明はたとえば音声入出力装置に係り、特に利用者の利用形態に適合した音声入出力装置に関する。 The present invention relates to a voice input / output device, for example, and more particularly to a voice input / output device suitable for a user's usage mode.

近年、コンピュータ及び通信装置の高性能化により、端末装置の高性能化に加えて、クラウドと呼ばれる、ネットワークを介しての高度な情報処理が可能となってきている。特に、ＡＩスピーカと称される、マイクロフォン（以下「マイク」と省略する。）から音声入力を受け付ける音声入力機能と、スピーカから音声を出力する音声出力機能とを備えた音声入出力装置が普及している。このような音声入出力装置においては、各種の使用環境下においてマイクから入力される音声を正しく認識し、遅滞なく音声出力や表示等により反応すると共に、入力された音声を正しく記録することが求められる。 In recent years, high performance of computers and communication devices has enabled advanced information processing via a network called a cloud in addition to high performance of terminal devices. In particular, a voice input / output device called an AI speaker having a voice input function for receiving voice input from a microphone (hereinafter abbreviated as “microphone”) and a voice output function for outputting voice from the speaker has become widespread. ing. In such a voice input / output device, it is required to correctly recognize the voice input from the microphone under various usage environments, to react by voice output or display without delay, and to correctly record the input voice. It is done.

この点で、特許文献１では、スピーカからの音と周辺の雑音と利用者の音声とが同時に存在するような使用環境で、利用者が発生した音声を明瞭に認識するとする技術思想が開示されている。 In this regard, Patent Document 1 discloses a technical idea of clearly recognizing a voice generated by a user in a usage environment in which a sound from a speaker, ambient noise, and a user's voice exist simultaneously. ing.

また、特許文献２では、使用者の音声とスピーカからの音声出力とが時間的に重なった場合の音声認識の精度を向上させるとする技術思想が開示されている。 Patent Document 2 discloses a technical idea that improves the accuracy of voice recognition when the user's voice and the voice output from the speaker overlap in time.

しかし、特許文献１および２においては、より確実な音声認識に結び付けるような技術思想は開示されていない。 However, Patent Documents 1 and 2 do not disclose a technical idea that is linked to more reliable speech recognition.

また、上記両文献とも音声記録については詳しく触れられていない。特に、音声を識別し、言語として記録した場合は大変難しくなってしまう。上述の言語としての記録とは、使用者が通常用いる言語のことであり、例えば使用者が日本人であれば日本語活字として記録することを意味するものである。 In addition, neither of the above-mentioned documents describes in detail the audio recording. In particular, when speech is identified and recorded as a language, it becomes very difficult. The above-mentioned recording as a language is a language that is usually used by the user. For example, if the user is a Japanese, it means recording as Japanese type.

一方、特許文献３では、音声入出力装置を作動させる場合、作動させるための起動用の言葉がマイクから入力された場合のみに反応して作動に入る技術思想が開示されている。同文献における音声入出力装置の動作は、受動的なものにとどまっている。また、起動用の言葉（ホットワード）を入力すれば誰でもその音声入出力装置を用いることができてしまうため、事前に使用者のホットワードオーディオフィンガープリントを記憶して置き、入力ホットワードと一致した場合にのみ起動するようにしてセキュリティを確保する技術が開示されている。しかし、入力されたホットワードと記憶されたホットワードオーディオフィンガープリントの一致・不一致を判定することは難しくより確実なセキュリティ確保手段が求められる。 On the other hand, Patent Literature 3 discloses a technical idea that, when a voice input / output device is operated, it reacts only when an activation word for activation is input from a microphone. The operation of the voice input / output device in this document is passive. In addition, since anyone can use the voice input / output device by entering a startup word (hot word), the user's hot word audio fingerprint is stored in advance, and the input hot word and A technique for ensuring security by starting only when a match is disclosed. However, it is difficult to determine a match / mismatch between the input hot word and the stored hot word audio fingerprint, and a more reliable security means is required.

特開２００１−９４３７０号公報JP 2001-94370 A 特開２０１５−１８４５３０号公報Japanese Patent Laying-Open No. 2015-184530 特開２０１７−７６１１７号公報JP 2017-76117 A

本願は上述したような従来からの問題に着眼し、使用環境に存在する機械的な雑音や笑い声や警報音等の特定の音が存在する環境下においても利用者の音声を確実に認識できる音声入出力装置を提供することを課題とするものである。 This application pays attention to the conventional problems as described above, and can recognize the user's voice reliably even in an environment where there is a specific noise such as mechanical noise, laughter or alarm sound that exists in the environment of use. An object of the present invention is to provide an input / output device.

また、利用者のストレスを少なくするための高速音声認識処理技術を体現する音声入出力装置を提供することも課題とするものである。更に、使用環境状態を積極的に探索して、最適な音声認識技術を用いることを体現する音声入出力装置を提供することも課題とするものである。なお、以後の説明においては、使用者が発する声やスピーカから発生される音や本発明の音声入出力周囲から発生される音を音声として総称することもある。 It is another object of the present invention to provide a voice input / output device that embodies high-speed voice recognition processing technology for reducing user stress. It is another object of the present invention to provide a voice input / output device that actively searches for a use environment state and embodies the use of an optimum voice recognition technology. In the following description, the voice uttered by the user, the sound generated from the speaker, and the sound generated from the voice input / output surroundings of the present invention may be collectively referred to as a voice.

上記に加え、利用者の識別や性別、感情状態をも識別して音声認識確度を高めることができる音声入出力装置、利用者音声指示に対する反応を最適なものにする音声入出力装置、を提供することも課題とするものである。更に積極的な話し掛けやセキュリティ対策を備えた音声入出力装置を提供することも別の課題である。 In addition to the above, a voice input / output device that can also identify the user's identification, gender, and emotional state to improve voice recognition accuracy, and a voice input / output device that optimizes the response to user voice instructions Doing it is also an issue. It is another issue to provide a voice input / output device with more proactive talking and security measures.

本発明は、上述したような課題を解決するために、本願の音声入出力装置の態様は、使用環境を非可聴音を用いて計測し、計測した環境に適合するよう最適処理を行うとともに、話者識別、感情状態識別を行い積極的なマン・マシンインタフェース装置とする。このため、より具体的には、本願の一態様に係る音声入出力装置は、可聴音から非可聴音までを受信できる複数のマイクが立体的に配置された音声受付部と、単数あるいは複数のスピーカによって可聴音及び／もしくは非可聴音を発音する発音部と、前記マイクからの信号を処理制御する信号処理部と、前記信号処理部の処理結果に基づいた表示を行う表示部と、前記音受付部によって収音された音声情報を記録する記録部とを有することを特徴とする音声入出力装置として構成することができる。 In order to solve the above-described problems, the aspect of the voice input / output device according to the present invention measures the usage environment using non-audible sound, performs an optimum process to match the measured environment, A positive man-machine interface device is identified by speaker identification and emotional state identification. Therefore, more specifically, the voice input / output device according to one aspect of the present application includes a voice reception unit in which a plurality of microphones that can receive audible to non-audible sounds are arranged three-dimensionally, and a single or a plurality of A sound generation unit that generates audible sound and / or non-audible sound through a speaker, a signal processing unit that controls processing of a signal from the microphone, a display unit that performs display based on a processing result of the signal processing unit, and the sound A voice input / output device having a recording unit for recording voice information collected by the reception unit can be provided.

さらに詳細には、本願の一態様に係る音声入出力装置は、可聴音から非可聴音までを受信できる複数のマイクが立体的に配置された音声受付部と、単数あるいは複数のスピーカによって可聴音及び／もしくは非可聴音を発音する発音部と、前記発音部から発音された音声を拡散する音声拡散部と、前記マイクからの信号を処理制御する信号処理部と、前記信号処理部の処理結果に基づいた表示を行う表示部と、前記音声受付部によって収音された音声情報を記録する記録部と、外部装置との情報授受を有線にて行うインタフェース部と、無線にて情報授受を行う通信部と、前記音受付部、前記発音部、前記音声拡散部、前記信号処理部、前記表示部、前記記録部、前記インタフェース部、前記通信部の各部へ電源を供給する電源部と、前記各部を収容する筐体とを備える構成とすることもできる。 More specifically, the audio input / output device according to one aspect of the present application includes an audio receiving unit in which a plurality of microphones that can receive an audible sound to a non-audible sound are arranged in three dimensions, and an audible sound by one or a plurality of speakers. And / or a sound generation unit that generates a non-audible sound, a sound diffusion unit that diffuses sound generated by the sound generation unit, a signal processing unit that controls processing of a signal from the microphone, and a processing result of the signal processing unit A display unit that performs display based on the information, a recording unit that records audio information collected by the audio reception unit, an interface unit that performs information exchange with an external device by wire, and wirelessly exchanges information A power supply unit that supplies power to each unit of the communication unit, the sound reception unit, the sound generation unit, the sound diffusion unit, the signal processing unit, the display unit, the recording unit, the interface unit, and the communication unit; each It may be configured to include a housing that houses the.

上記において、可聴音とは一般的に２０Ｈｚ〜２０ＫＨｚであり、非可聴音はそれ以外の周波数の音声のことである。後述する音声入出力装置の周囲環境を捜索するための非可聴音としては発生と集音の容易さや分解能から３０ＫＨｚ近辺のいわゆる超音波を用いることが望ましい。 In the above, the audible sound is generally 20 Hz to 20 KHz, and the non-audible sound is a sound of other frequencies. As a non-audible sound for searching the surrounding environment of a voice input / output device to be described later, it is desirable to use a so-called ultrasonic wave in the vicinity of 30 KHz from the viewpoint of ease of generation and sound collection and resolution.

本願は上記態様における構成に加えてさらに、複数の発光表示器および／若しくは画像表示器から構成される表示部を有する態様としてもよい。この場合には、周囲の環境音や話者の識別あるいは話者の感情識別結果により上記発光表示あるいは画像表示器の表示の仕方を変化させて表示することが可能となる。 In addition to the structure in the said aspect, this application is good also as an aspect which has a display part further comprised from a several light emission display and / or an image display. In this case, it is possible to change the display method of the light emission display or the image display device according to the surrounding environmental sound, speaker identification, or speaker emotion identification result.

上記態様においては、前記非可聴音を間欠発音し、装置周辺からの反射音を前記複数のマイクで受信し、装置周辺の環境を２次元方位及び距離に関して把握するための音声到来情報を把握する音声到来情報把握機能を有するようにしてもよい。 In the above aspect, the non-audible sound is intermittently generated, the reflected sound from the periphery of the device is received by the plurality of microphones, and the voice arrival information for grasping the environment around the device with respect to the two-dimensional direction and distance is grasped. A voice arrival information grasping function may be provided.

また、上記態様においては、環境音を識別するための情報である環境音識別情報を取得することが可能な環境音識別機能をさらに有するようにしてもよい。 Moreover, in the said aspect, you may make it further have an environmental sound identification function which can acquire the environmental sound identification information which is the information for identifying an environmental sound.

また、上記態様においては、話者を識別するための情報である話者識別情報を取得することが可能な話者識別機能をさらに有するようにしてもよい。 Moreover, in the said aspect, you may make it further have a speaker identification function which can acquire the speaker identification information which is the information for identifying a speaker.

また、上記態様においては、話者の感情状態を識別するための情報である話者感情情報を取得することが可能な話者感情識別機能をさらに有するようにしてもよい。 Moreover, in the said aspect, you may make it further have a speaker emotion identification function which can acquire the speaker emotion information which is the information for identifying a speaker's emotional state.

また、上記態様においては、話者を識別するための情報である話者識別情報を取得することが可能な話者識別機能と、話者の感情状態を識別するための情報である話者感情情報を取得することが可能な話者感情識別機能とをさらに備え、前記マイクから入力された音情報を前記記録部に記録する場合、前記音情報に紐付けられる、音声到来情報、話者識別情報、話者感情情報、外部情報のうちいずれか１以上を略同時に記録するようにしてもよい。 Moreover, in the said aspect, the speaker identification function which can acquire the speaker identification information which is information for identifying a speaker, and the speaker emotion which is information for identifying a speaker's emotional state A speaker emotion identification function capable of acquiring information, and when recording sound information input from the microphone in the recording unit, voice arrival information and speaker identification associated with the sound information Any one or more of information, speaker emotion information, and external information may be recorded substantially simultaneously.

また、上記態様においては、前記音到来情報、前記話者識別情報、前記話者感情情報、外部情報のうちの少なくともいずれか１つに基づいて前記複数の発光表示部の発光間隔、発光色、発光順序のうちいずれか１つ以上を変化できるようにしてもよい。 Further, in the above aspect, a light emission interval, a light emission color of the plurality of light emitting display units based on at least one of the sound arrival information, the speaker identification information, the speaker emotion information, and external information, Any one or more of the light emission orders may be changed.

また、上記態様においては、装置全体を回転する機構及び振動機構をさらに有するようにしてもよい。 Moreover, in the said aspect, you may make it further have the mechanism and vibration mechanism which rotate the whole apparatus.

また、上記態様においては、撮像部をさらに備えるようにしてもよい。 Moreover, in the said aspect, you may make it further provide an imaging part.

また、上記態様においては、個人認証部をさらに備えるようにしてもよい。 Moreover, in the said aspect, you may make it further provide a personal authentication part.

また、上記態様においては、プロジェクタ部をさらに備えるようにしてもよい。 Moreover, in the said aspect, you may make it further provide a projector part.

また、上記態様においては、赤外線通信部をさらに備えるようにしてもよい。 Moreover, in the said aspect, you may make it further provide an infrared communication part.

本願は上記態様における構成に加えてさらに、起動用の言葉による受動的起動に加えて、非可聴音発生やＴＶカメラによる監視により侵入者を検知し、音声入出力装置自身が能動的に起動し、合言葉の送受や、ＴＶカメラによる顔認識、指紋照合等の識別機能をさらに備えた態様としてもよい。この場合には、上述した話者識別に加えて個人識別をより確実に行いセキュリティを確保することが可能となる。 In addition to the configuration in the above aspect, the present application further detects intruders by generating non-audible sound and monitoring by a TV camera in addition to passive activation by activation words, and the voice input / output device itself is activated actively. It is also possible to adopt a mode further including identification functions such as transmission / reception of secret words, face recognition by a TV camera, fingerprint verification, and the like. In this case, in addition to the speaker identification described above, personal identification can be performed more reliably and security can be ensured.

本願に係る技術思想には、例えば、顧客満足度向上のため、話者がどのような発話に対しどのような感情を抱いたかを記録し、クライアント側の音声入出力装置をコールセンターに利用していた場合にオペレータに注意喚起したり、管理者に報告したりすることが含まれる。また、クライアント側の音声入出力装置を会議に利用していた場合に出席者が感情的になった場合に落ち着かせるように休憩を入れたり、冷静になるような旨の音声を発話したりすることも含まれる。 In the technical idea of the present application, for example, to improve customer satisfaction, it records what kind of emotion the speaker has about what kind of utterance, and uses the voice input / output device on the client side for the call center. This includes alerting the operator and reporting to the administrator. Also, when using the client's voice input / output device for a meeting, if the attendees become emotional, take a break to calm down, or speak a voice to the effect of calmness It is also included.

総じて、本願によれば、使用環境を積極的に捜索して捜索結果に適合する最適音声認識技術を用いたり、使用する環境に存在する環境音を認識して特定方位に存在する雑音源からの入力を阻止したり、利用者の音声特性を識別したりする、といったことが可能となる。また、複数の話者の音声を記録する場合、どの話者の音声記録であるかを識別するのが可能となる。さらに、例えば所有者が帰宅したことを自動判別し、「お帰りなさい！」と話しかけるような能動的動作をすることが可能となる。 In general, according to the present application, the optimum speech recognition technology that actively searches the use environment and matches the search result is used, or the environment sound existing in the use environment is recognized and the noise from the noise source existing in a specific direction is detected. It is possible to block input or identify the voice characteristics of the user. In addition, when recording the voices of a plurality of speakers, it is possible to identify which speaker is the voice recording. Further, for example, it is possible to automatically determine that the owner has returned home, and to perform an active operation such as saying “Please come home!”.

複数マイクを用いることにより、ビームフォーミング技術で話者の２次元方向が分かり、周辺雑音から分離して話者の言葉を確実に識別することができる。本方位識別情報と前記の話者識別情報、感情識別情報、外部情報を音声受信情報と共に記録しておけば、後の音声情報整理に大変有用である。 By using a plurality of microphones, the two-dimensional direction of the speaker can be known by the beam forming technique, and the speaker's words can be reliably identified by separating from the ambient noise. If the main orientation identification information, the speaker identification information, the emotion identification information, and the external information are recorded together with the voice reception information, it is very useful for rearranging the voice information.

音声情報を言語に変換して記録する場合は、その音声を誰が発生したものであるかを識別することは大変重要であるが、単に言語に変換しただけの記録では上記の様に方位識別情報と話者識別情報と感情識別情報と外部情報を記録しておけば確実な話者識別が可能となる。 When recording voice information converted into a language, it is very important to identify who generated the voice. However, in the case of recording just converted into language, the direction identification information is as described above. If speaker identification information, emotion identification information, and external information are recorded, reliable speaker identification is possible.

上記のように、非可聴音をパルス状に間欠発音し、反射音を上記複数マイクにて受信することで、周囲の反射体のような音環境確認が可能となり、音波伝搬のマルチパスの影響を最小にして音声識別の確度をより向上させることができる。さらに、音声入出力装置周辺の反射体が時間経過により移動する場合には侵入者ありと判断し、「いらっしゃい」あるいは「お帰りなさい」等のように従来にない能動的機能を達成することが可能となる。 As mentioned above, non-audible sound is intermittently generated in pulses, and the reflected sound is received by the above-mentioned multiple microphones, so it is possible to check the sound environment like the surrounding reflector, and the influence of multipath on sound wave propagation And the accuracy of voice identification can be further improved. Furthermore, when the reflector around the voice input / output device moves over time, it is determined that there is an intruder, and an unprecedented active function such as “welcome” or “return home” can be achieved. It becomes possible.

また、周波数分析など音声の特徴分析を行うことにより話者の識別や話者の感情状態を知ることができ、その結果により表示部の表示を適正に、例えば興奮状態を鎮めるような表示を行うことができる。これはマン・マシンインタフェースにとって大変有用な効果である。 In addition, by performing voice feature analysis such as frequency analysis, it is possible to know the speaker's identification and the speaker's emotional state, and display the display section appropriately, for example, to suppress the excitement state based on the result. be able to. This is a very useful effect for man-machine interface.

さらに本願によれば、例えば、話者がどのような発話に対しどのような感情を抱いたかを記録し、クライアント側の音声入出力装置をコールセンターに利用していた場合にオペレータに注意喚起したり、管理者に報告したりすることによって、顧客満足度を向上させることができる。また、クライアント側の音声入出力装置を会議に利用していた場合に出席者が感情的になった場合に落ち着かせるように休憩を入れたり、冷静になるような旨の音声を発話したりすることを通して、状況や雰囲気に適合した音声的環境を提供することができる。 Further, according to the present application, for example, what kind of emotion a speaker has for what kind of utterance is recorded, and if the voice input / output device on the client side is used for a call center, the operator is alerted. By reporting to the administrator, customer satisfaction can be improved. Also, when using the client's voice input / output device for a meeting, if the attendees become emotional, take a break to calm down, or speak a voice to the effect of calmness Through this, it is possible to provide an audio environment suitable for the situation and atmosphere.

起動用の言葉による能動的起動に加えて、非可聴音発生やＴＶカメラによる監視により侵入者を検知し、音声入出力装置自身が能動的に起動し、個人識別用の合言葉の送受や、ＴＶカメラによる顔認識、指紋照合等により、前記話者識別に加えて個人識別をより確実に行いセキュリティを確保するという効果が奏されることになる。 In addition to active activation by activation words, intruders are detected by generation of non-audible sound and monitoring by a TV camera, and the voice input / output device itself is activated actively, sending and receiving personal identification passwords, and TV Face recognition by the camera, fingerprint verification, and the like have the effect of ensuring personal security by ensuring personal identification in addition to speaker identification.

本発明の一実施形態に係る音声入出力装置の斜視図である。1 is a perspective view of a voice input / output device according to an embodiment of the present invention. 本発明の別の実施形態に係る音声入出力装置の斜視図である。It is a perspective view of the voice input / output device concerning another embodiment of the present invention. 本発明の一実施形態に係る音声入出力装置の内部構造概略図である。1 is a schematic diagram of an internal structure of a voice input / output device according to an embodiment of the present invention. 本発明の一実施形態に係る音声入出力装置に搭載されるプロジェクタの作用を概念的に説明するための斜視図である。FIG. 4 is a perspective view for conceptually explaining the operation of the projector mounted on the voice input / output device according to the embodiment of the invention. 本発明の一実施形態に係る音声入出力装置のマイク配置の一例を示す概念的斜視図である。It is a conceptual perspective view which shows an example of microphone arrangement | positioning of the audio | voice input / output device which concerns on one Embodiment of this invention. 本発明の一実施形態に係る音声入出力装置のマイク配置の別の一例を示す概念的斜視図である。It is a conceptual perspective view which shows another example of microphone arrangement | positioning of the audio | voice input / output device which concerns on one Embodiment of this invention. 本発明の一実施形態に係る実施形態に係る音声入出力装置のブロックダイヤグラム例である。It is an example of a block diagram of a voice input / output device concerning an embodiment concerning one embodiment of the present invention.

以下、図面を参照して本発明の実施形態を説明する。なお、以下では本発明の目的を達成するための説明に必要な範囲を模式的に示し、本発明の該当部分の説明に必要な範囲を主に説明することとし、説明を省略する箇所については公知技術によるものとする。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following, the range necessary for the description for achieving the object of the present invention is schematically shown, and the range necessary for the description of the relevant part of the present invention will be mainly described. According to a known technique.

図１Ａおよび図１Ｂは、本発明の一実施形態に係る音声入出力装置の２つの実施態様を示した図である。図１Ａでは、音声が自由に出入りするパンチングメタル等からなる外装材１２を外装させた円筒形の筐体１０に、後述する電気回路等を全て組み込み、頂部に多色ＬＥＤのような発光表示部１１を付したシンプルなデザインに纏めた例を示している。なお、外装材１２は上述した材料に限られず、音声が自由に出入りできる素材であればいかなるものであっても適用可能であり、筐体１０の形状も円筒形に限らず、長方形、多角柱形等の様々な形状が考えられるが、それ等の全ては本願の技術思想に包摂される。 1A and 1B are diagrams showing two embodiments of a voice input / output device according to an embodiment of the present invention. In FIG. 1A, all the electric circuits and the like to be described later are incorporated in a cylindrical casing 10 having an exterior material 12 made of punching metal or the like through which sound can freely enter and exit, and a light-emitting display section such as a multicolor LED at the top. The example summarized in the simple design which attached | subjected 11 is shown. The exterior material 12 is not limited to the above-described material, and any material can be applied as long as sound can freely enter and exit. The shape of the housing 10 is not limited to a cylindrical shape, and is rectangular or polygonal. Various shapes such as shapes are conceivable, all of which are included in the technical idea of the present application.

図１Ｂは、図１Ａに示された形態に、さらに画像表示部１３を組み込み、頂部に発光表示部１５を組み込み、筺体基部１６を回転可能とした例である。筺体基部１６にはモータ等による後述する回転機構３１が組み込まれており、筺体全体を回転させることができるため、話者の方向にＴＶカメラのような撮像部３３や画像表示部２４（図２Ａ参照）を向けることができる。さらに、回転機構に用いるモータを用いて筺体全体を振動（バイブレート）させ、音声入力に対するアクナレッジや発生する音声の強調等に用いることもできる。 FIG. 1B is an example in which the image display unit 13 is further incorporated in the form shown in FIG. 1A, the light emitting display unit 15 is incorporated at the top, and the housing base 16 is rotatable. A rotation mechanism 31 (described later) using a motor or the like is incorporated in the chassis base 16 so that the entire chassis can be rotated. Therefore, an imaging unit 33 such as a TV camera or an image display unit 24 (FIG. 2A) is directed toward the speaker. See). Furthermore, the entire casing can be vibrated using a motor used for the rotating mechanism, and can be used for an acknowledgment of voice input, emphasis of generated voice, and the like.

同じく頂部あるいは頂部周辺に赤外線人感センサ及び指紋センサおよびＴＶカメラを設置してもよい。図１Ａ及び図１Ｂでは、個別の多色ＬＥＤを連続的に円形に配置しているが、角形に配置したりハート形にしたりと種々のバリエーションが考えられ、各バリエーションに見合った各個別ＬＥＤの点灯間隔、点灯色、点灯シーケンスを採用することが考えられる。また、点灯シーケンスも、音声到来方法を示したり、話者の感情や話者の識別職にしたりといろいろ考えられるが、それ等の全ては本願の技術思想に包摂される。 Similarly, an infrared human sensor, a fingerprint sensor, and a TV camera may be installed at or near the top. In FIG. 1A and FIG. 1B, individual multicolor LEDs are continuously arranged in a circle, but various variations such as a square shape or a heart shape can be considered, and each individual LED corresponding to each variation can be considered. It is conceivable to employ a lighting interval, a lighting color, and a lighting sequence. In addition, the lighting sequence may be variously considered to indicate a voice arrival method or to be a speaker's emotion or a speaker's identification position, all of which are included in the technical idea of the present application.

図２Ａは本発明の一実施形態に係る図１Ｂに示した音声入出力装置の内部構造図例であり、図２Ｂは、本発明の一実施形態に係る音声入出力装置に搭載されるプロジェクタの作用を概念的に説明するための斜視図である。図２Ａに示されるように、円筒形の筐体２０の頂部には発光表示部２１が配置され、頂部近くには略等間隔にマイク２２０が複数配置されてなる複数マイクユニット２２と、その下部に同様に略等間隔に複数のマイク２３０が配置されてなるマイクユニット２３が配置されている。マイクユニット２２とマイクユニット２３との間には画像表示部２４及び後述する信号処理部等の電気回路が収容されている。 2A is an example of the internal structure of the voice input / output device shown in FIG. 1B according to an embodiment of the present invention. FIG. 2B is a diagram of a projector mounted on the voice input / output device according to an embodiment of the present invention. It is a perspective view for demonstrating an effect | action conceptually. As shown in FIG. 2A, a light emitting display unit 21 is arranged at the top of a cylindrical casing 20, and a plurality of microphone units 22 having a plurality of microphones 220 arranged at substantially equal intervals near the top, and a lower part thereof. Similarly, a microphone unit 23 in which a plurality of microphones 230 are arranged at substantially equal intervals is arranged. Between the microphone unit 22 and the microphone unit 23, an electric circuit such as an image display unit 24 and a signal processing unit described later is accommodated.

図２Ｃは、本発明の一実施形態に係る音声入出力装置のマイク配置の一例を示す概念的斜視図であり、図２Ｄは、同マイク配置の別の一例を示す概念的斜視図である。図２Ｃでは、複数のマイクを水平面上に等間隔配置したマイクユニットに加えて、同様なマイクユニットを垂直軸上で立体的に分離配置することにより各マイクへの到来音源の２次元到来方位を計測することができる。マイクの配置位置は、図２Ｃの配置例に限らず、例えば図２Ｄのごとく円筒形筐体に内接する多角柱の角度位置に相当する位置に配置する等、種々の配置方法が考えられるが、それ等の全ては本願の技術思想に包摂される。 2C is a conceptual perspective view showing an example of a microphone arrangement of the voice input / output device according to the embodiment of the present invention, and FIG. 2D is a conceptual perspective view showing another example of the microphone arrangement. In FIG. 2C, in addition to a microphone unit in which a plurality of microphones are arranged at equal intervals on a horizontal plane, similar microphone units are three-dimensionally separated on the vertical axis to thereby determine the two-dimensional arrival direction of the incoming sound source to each microphone. It can be measured. The arrangement position of the microphone is not limited to the arrangement example of FIG. 2C, but various arrangement methods such as, for example, arrangement at the position corresponding to the angular position of the polygonal column inscribed in the cylindrical housing as shown in FIG. 2D can be considered. All of these are included in the technical idea of the present application.

同じく、図２Ａでは複数のスピーカを下方に向けて同軸配置し、同軸下部に略円錐コーン状の音声拡散部３０を配置し、複数のスピーカ２５，２６から発生された音声を等方的に周囲に拡散している。もちろん、複数のスピーカ２５，２６と音声拡散部３０とを天地逆に配置してもよく、配置についてはその他いくつかのバリエーションも考えられるが、それ等の全ては本願の技術思想に包摂される。 Similarly, in FIG. 2A, a plurality of speakers are coaxially arranged downward and a substantially conical cone-shaped sound diffusion unit 30 is arranged in the lower part of the coaxial so that the sound generated from the plurality of speakers 25 and 26 is isotropically surrounded. Has spread. Of course, the plurality of speakers 25 and 26 and the sound diffusing unit 30 may be arranged upside down, and some other variations of the arrangement may be considered, but all of them are included in the technical idea of the present application. .

図１Ｂ、図２Ａにて示される形態においては、上記構成により、話者の方向に画像表示部１３を向けることができ、より効果的なマン・マシンインタフェースとすることができる。図１Ａに示される形態においては、図示しない同様の構成により、複数マイクによって、話者の方位等により、発光表示部の表示により、話者の方向を表示したりすることができる。 In the form shown in FIGS. 1B and 2A, the image display unit 13 can be directed in the direction of the speaker by the above configuration, and a more effective man-machine interface can be obtained. In the form shown in FIG. 1A, with the same configuration (not shown), the direction of the speaker can be displayed by a plurality of microphones by the display of the light emitting display unit according to the orientation of the speaker.

後述するように、非可聴音の反射による侵入者の検出に加えて赤外線による人感センサを筐体１０の頂部等に装置してもよい。同じく頂部には個人識別を確実にするための指紋センサや、ＴＶカメラのような撮像装置を設置してもよい。さらに、図２Ｂに示されるように、プロジェクタ３４を装備することにより、音声出力に同期して説明図や関連画像を拡大投影することができる。これが適用され得る場面としては、例えば会議や旅行説明のため、室内のホワイトボードや壁やスクリーンに地図や議題を、本実施形態に係るプロジェクタ３４によって投影する態様などが考えられる。 As will be described later, in addition to detection of an intruder by reflection of non-audible sound, a human sensor using infrared rays may be provided on the top of the housing 10 or the like. Similarly, a fingerprint sensor for ensuring personal identification or an imaging device such as a TV camera may be installed on the top. Furthermore, as shown in FIG. 2B, by providing the projector 34, it is possible to project an explanatory diagram and related images in an enlarged manner in synchronization with the audio output. As a scene to which this can be applied, for example, a mode in which a projector 34 according to the present embodiment projects a map or agenda on an indoor whiteboard, a wall, or a screen for explanation of a meeting or travel.

図３は、本発明の一実施形態に係る図１Ｂに示した音声入出力装置のブロックダイヤグラムである。同図に示されるように、円筒形ケースの上下水平面に配置されたマイクユニット４０は、ＡＧＣ（ＡｕｔｏｍａｔｉｃＧａｉｎＣｏｎｔｒｏｌ「自動利得制御」：システムの入力レベルが変わっても出力レベルを目標値に合わせて一定に保つ制御を意味する。以下同じ。）やフォーミング等を行うマイク制御部４１を介し、μＣＰＵを主体とする信号処理部４２に入力される。またマイク制御部４１はインタフェース部５０を介して雑音除去やエコーキャンセルを行うことができる。 FIG. 3 is a block diagram of the voice input / output device shown in FIG. 1B according to an embodiment of the present invention. As shown in the figure, the microphone unit 40 arranged on the upper and lower horizontal planes of the cylindrical case has an AGC (Automatic Gain Control “automatic gain control”) that adjusts the output level to the target value even if the input level of the system changes. It is input to a signal processing unit 42 mainly composed of a μCPU through a microphone control unit 41 that performs forming and the like. Further, the microphone control unit 41 can perform noise removal and echo cancellation via the interface unit 50.

信号処理部４２においてはマイクからの音声信号に対して、周囲雑音除去などの識別精度向上のための前処理を施す。処理後の音声信号の到来方位情報を引き出す一方、通信部４３やインタフェース部５０から外部に送信し、クラウド処理等により話者識別処理や感情識別処理等の高度な情報処理を行い、上記到来方位情報と共に音声情報として記録部４７に記録する。同時に、上記情報処理の結果に適合した表示を表示部４６に表示することができる。 The signal processing unit 42 performs preprocessing for improving the identification accuracy such as ambient noise removal on the sound signal from the microphone. While extracting the arrival direction information of the processed audio signal, the information is transmitted to the outside from the communication unit 43 or the interface unit 50, and advanced information processing such as speaker identification processing and emotion identification processing is performed by cloud processing or the like. It records in the recording part 47 as audio | voice information with information. At the same time, a display suitable for the result of the information processing can be displayed on the display unit 46.

さらに、信号処理部４２においては上記音声到来方位情報により、特定方位に存在する雑音源からの音声情報は取り込まず、逆に特定方位からの音声情報のみを記録することも可能となる。 Furthermore, in the signal processing unit 42, it is possible to record only the audio information from the specific azimuth without capturing the audio information from the noise source existing in the specific azimuth by the voice arrival azimuth information.

また、記録部４７は多層構成とし、記録すべき音声情報の到来方位や話者識別、感情識別等の関連情報を紐付けして音声情報とは別層に記録することにより、記録された音声情報の整理が大変容易になる。 In addition, the recording unit 47 has a multi-layer configuration, and records the recorded voice by linking related information such as the direction of arrival of voice information to be recorded, speaker identification, emotion identification, etc. and recording it on a layer different from the voice information. Organizing information becomes very easy.

信号処理部４２にはＷｉ−Ｆｉやブルートゥース（登録商標）などによって外部と無線交信するための通信部４３とハードワイヤにて外部機器と接続するためのインタフェース部５０とを有する。このため、外部マイクによって周囲雑音を受信して拡張ポートからかかる受信雑音を入力して周囲雑音の影響を低減したり、ＵＳＢポートにより外部機器と交信したりすることができる。 The signal processing unit 42 includes a communication unit 43 for wireless communication with the outside by Wi-Fi, Bluetooth (registered trademark), and the like, and an interface unit 50 for connecting to an external device by hard wires. For this reason, it is possible to receive ambient noise with an external microphone and input the received noise from the expansion port to reduce the influence of ambient noise, or to communicate with an external device via the USB port.

更に、音声命令によりインターネットを介してＴＶのチャンネル変更や照明装置のＯＮ／ＯＦＦを行っていた代わりに、赤外線通信部（ＩＲ送受信部）３５を装備することにより、音声入力命令によって直にＴＶや照明装置制御や外部機器を直接制御することが可能となる。 Furthermore, instead of changing the TV channel or turning on / off the lighting device via the Internet by voice command, the infrared communication unit (IR transceiver unit) 35 is installed, so that the TV or TV can be directly controlled by voice command. It becomes possible to directly control the lighting device control and external equipment.

本発明によれば、単に入力音声信号を正しく認識するばかりでなく、能動的に周囲環境を認識できるため、本発明の音声入出力装置から話者に対して能動的に語りかけられるプッシュ型のマン・マシンインタフェースとして家庭電化製品や娯楽分野、更には各種産業分野に広く利用されることが期待される。 According to the present invention, it is possible not only to correctly recognize an input voice signal but also to actively recognize the surrounding environment. Therefore, a push-type man who can actively talk to a speaker from the voice input / output device of the present invention. -It is expected to be widely used as a machine interface in home appliances, entertainment, and various industrial fields.

１０…筐体、１１…発光表示部、１２…外装材、１３…画像表示部、１４…回転部、１５…発光表示部、１６…筐体基部、２０…筐体、２１…発光表示部、２２…マイクユニット、２３…マイクユニット、２４…画像表示部、２５…スピーカ（可聴音発生部）、２６…スピーカ（非可聴音発生部）、２７…可聴音、２８…非可聴音、２９…土台、３０…音声拡散部、３１…回転機構、３２…個人認証部、３３…撮像部、３４…プロジェクタ、３５…赤外線通信部、４０…マイクユニット、４１…マイク制御部、４２…信号処理部、４３…通信部、４４…音声発生部、４５…非可聴音発生部、４６…表示部、４７…記録部、４８…回転駆動部、４９…電源部、５０…インタフェース部
DESCRIPTION OF SYMBOLS 10 ... Case, 11 ... Light emission display part, 12 ... Exterior material, 13 ... Image display part, 14 ... Rotation part, 15 ... Light emission display part, 16 ... Case base, 20 ... Case, 21 ... Light emission display part, 22 ... Microphone unit, 23 ... Microphone unit, 24 ... Image display unit, 25 ... Speaker (audible sound generating unit), 26 ... Speaker (non-audible sound generating unit), 27 ... Audible sound, 28 ... Non-audible sound, 29 ... Base: 30 ... Audio diffusion unit, 31 ... Rotation mechanism, 32 ... Personal authentication unit, 33 ... Imaging unit, 34 ... Projector, 35 ... Infrared communication unit, 40 ... Microphone unit, 41 ... Microphone control unit, 42 ... Signal processing unit 43 ... Communication unit 44 ... Sound generation unit 45 ... Non-audible sound generation unit 46 ... Display unit 47 ... Recording unit 48 ... Rotation drive unit 49 ... Power supply unit 50 ... Interface unit

Claims

A sound reception unit in which a plurality of microphones capable of receiving audible to non-audible sounds are arranged in three dimensions;
A sound generator that produces audible and / or non-audible sounds with one or more speakers;
A signal processing unit for processing and controlling the signal from the microphone based on an external command;
An interface unit for exchanging information with an external device by wire;
A wireless unit that wirelessly exchanges information with external devices;
A display unit for performing display based on a processing result of the signal processing unit;
A voice input / output apparatus comprising: a recording unit that records sound information collected by the sound receiving unit.

The voice input / output device according to claim 1, wherein the display unit includes a plurality of individual light emitters and / or image displays.

Voice arrival information grasping function for intermittently sounding the non-audible sound, receiving reflected sound from the periphery of the apparatus with the plurality of microphones, and grasping voice arrival information for grasping the environment around the apparatus with respect to the two-dimensional direction and distance The voice input / output device according to claim 1 or 2, characterized by comprising:

The voice input / output device according to any one of claims 1 to 3, further comprising a sound environment identification function for acquiring environmental information by sound.

The voice input / output device according to claim 1, further comprising a speaker identification function for acquiring speaker identification information.

The voice input / output device according to claim 1, further comprising a speaker emotion identification function for acquiring speaker emotion information.

A speaker identification function capable of acquiring speaker identification information, which is information for identifying a speaker,
A speaker emotion identification function capable of obtaining speaker emotion information, which is information for identifying a speaker's emotion state, and
When recording the sound information input from the microphone in the recording unit, at least one of sound arrival information, speaker identification information, speaker emotion information, and external information associated with the sound information is substantially simultaneously. 4. The voice input / output device according to claim 3, wherein recording is performed.

Based on at least one of the sound arrival information, the speaker identification information, the speaker emotion information, and the external information, any one of a light emission interval, a light emission color, and a light emission order of the plurality of light emission display units. 8. The voice input / output device according to claim 7, wherein one or more can be changed.

9. The voice input / output device according to claim 1, further comprising a mechanism for rotating the entire device and a vibration mechanism.

The voice input / output device according to claim 1, further comprising an imaging unit.

The voice input / output device according to claim 1, further comprising a personal authentication unit.

The voice input / output device according to claim 1, further comprising a projector unit.

The voice input / output device according to claim 1, further comprising an infrared communication unit.