JP2019137167A

JP2019137167A - Voice output control device, and voice output control program

Info

Publication number: JP2019137167A
Application number: JP2018021071A
Authority: JP
Inventors: 瑞貴川瀬; Mizuki Kawase
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2018-02-08
Filing date: 2018-02-08
Publication date: 2019-08-22
Anticipated expiration: 2038-02-08
Also published as: JP7023131B2

Abstract

To provide a voice output control device capable of controlling voice volume outputted from a voice generator at a point in time when an occupant starts utterance.SOLUTION: A voice output control device 1 includes: an image acquisition unit 11 for acquiring images photographed by imaging devices 31, 32, 33, 34 for photographing the faces of occupants 301, 302, 303, 304 in a vehicle 300; a conversation start estimation unit 12 for estimating a start of a conversation by an occupant from an opening degree of the mouth of the occupant on the basis of the images acquired by the image acquisition unit 11; an object position estimation unit 13 for estimating the position of a conversation object from the direction of the face or the direction of a visual line of the occupant whose conversation start is estimated on the basis of the images acquired by the image acquisition unit 11; and an output control unit 14 for generating control information to control the volume of voices outputted from voice generators 36, 37, 38, 39 closest to the position of the conversation object estimated by the object position estimation unit 13 when the conversation start estimation unit 13 estimates the start of the conversation by the occupant.SELECTED DRAWING: Figure 1

Description

この発明は、音声出力制御装置、及び音声出力制御プログラムに関するものである。 The present invention relates to an audio output control device and an audio output control program.

車両に乗車している乗員同士の会話の障害とならないよう、車載音響装置の音量を自動的に調整する技術が知られている。
例えば、特許文献１には、音響装置が出力する音声信号と車内用マイクロフォンから受けた音声信号を逐次モニタリングし、両者の比較に基づいて乗員の発話の有無を判定し、発話されていると判定されると、車載カメラの撮影画像を用いて発話の行われている発話位置を特定し、音声出力部の各増幅率を制御することで、スピーカのうち発話位置に近いスピーカのみの音量を低下させる技術が開示されている。 A technique is known that automatically adjusts the volume of an in-vehicle acoustic device so as not to hinder conversation between passengers in the vehicle.
For example, in Patent Document 1, a sound signal output from an acoustic device and a sound signal received from an in-vehicle microphone are sequentially monitored, and the presence / absence of an occupant's utterance is determined based on a comparison between the two, and it is determined that the utterance is being made. Then, the utterance position where the utterance is performed is identified using the image captured by the in-vehicle camera, and the volume of only the speaker near the utterance position is reduced by controlling each amplification factor of the audio output unit Techniques for making them disclosed are disclosed.

特開２０１２−２５２７０号公報JP 2012-25270 A

特許文献１に開示された技術は、車内用マイクロフォンから受けた音声信号により発話されていると判定されると、車載カメラの撮影画像を用いて発話の行われている発話位置を特定し、スピーカの音量を低下させるものである。したがって、当該技術は、乗員が発話を開始した時点では、まだスピーカの音量を低下させておらず、発話を開始した時点の会話が話し相手の乗員に聞こえづらいという課題があった。 When it is determined that the technology disclosed in Patent Document 1 is uttered by an audio signal received from an in-vehicle microphone, the utterance position where the utterance is performed is specified using a captured image of the in-vehicle camera, and the speaker The volume of the sound is reduced. Therefore, the technology has a problem that when the occupant starts speaking, the volume of the speaker has not been lowered yet, and the conversation at the time when the utterance is started is difficult to hear by the other passenger.

この発明は、上述の課題を解決するためのもので、乗員が発話を開始する時点において、音声発生装置から出力される音量を制御可能にする音声出力制御装置を提供することを目的としている。 An object of the present invention is to provide a sound output control device that can control a sound volume output from a sound generation device at the time when an occupant starts speaking.

この発明に係る音声出力制御装置は、車両に乗車している乗員の顔を撮影する撮像装置で撮影された画像を取得する画像取得部と、画像取得部で取得した画像に基づいて、乗員の口の開き具合から乗員による会話の開始を推定する会話開始推定部と、画像取得部で取得した画像に基づいて、会話の開始が推定された乗員の顔の方向又は視線の方向から会話対象の位置を推定する対象位置推定部と、会話開始推定部が乗員による会話の開始を推定した際、対象位置推定部が推定した会話対象の位置に最も近い音声発生装置から出力される音声の音量を制御する制御情報を生成する出力制御部とを備えたことを特徴とするものである。 An audio output control device according to the present invention is based on an image acquisition unit that acquires an image captured by an imaging device that captures the face of an occupant riding in a vehicle, and an image acquired by the image acquisition unit. Based on the image acquired by the image acquisition unit and the conversation start estimation unit that estimates the start of the conversation by the occupant from the degree of opening of the mouth, the conversation target is estimated from the direction of the occupant's face or the direction of the gaze When the target position estimation unit for estimating the position and the conversation start estimation unit estimate the start of the conversation by the occupant, the volume of the sound output from the sound generator closest to the position of the conversation target estimated by the target position estimation unit is And an output control unit that generates control information to be controlled.

この発明によれば、乗員が発話を開始する時点において、音声発生装置から出力される音量を制御できる。 According to the present invention, it is possible to control the sound volume output from the sound generation device when the occupant starts speaking.

実施の形態１に係る音声出力制御装置が適用された音響システムの構成を示すブロック図である。It is a block diagram which shows the structure of the acoustic system to which the audio | voice output control apparatus which concerns on Embodiment 1 was applied. 図２Ａは、実施の形態１に係る音声出力制御装置のハードウェア構成の一例を示す図である。図２Ｂは、実施の形態１に係る音声出力制御装置のハードウェア構成の一例を示す図である。FIG. 2A is a diagram illustrating an example of a hardware configuration of the audio output control device according to Embodiment 1. FIG. 2B is a diagram illustrating an example of a hardware configuration of the audio output control device according to Embodiment 1. 実施の形態１に係る音声出力制御装置が適用された音響システムを搭載した車両の内部を上方から見た構成例を示す図ある。It is a figure which shows the structural example which looked at the inside of the vehicle carrying the acoustic system to which the audio | voice output control apparatus which concerns on Embodiment 1 was applied from upper direction. 実施の形態１に係る音声出力制御装置１の動作を説明するフローチャートである。4 is a flowchart for explaining the operation of the audio output control device 1 according to the first embodiment.

以下、この発明の実施の形態について、図面を参照しながら詳細に説明する。
実施の形態１． Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
Embodiment 1 FIG.

実施の形態１に係る音声出力制御装置１は、一例として、車両に搭載された音響システム３に適用されるものとして、以下説明する。 The audio output control device 1 according to Embodiment 1 will be described below as an example applied to an acoustic system 3 mounted on a vehicle.

図１は、実施の形態１に係る音声出力制御装置１が適用された音響システム３の構成を示すブロック図である。
音響システム３は、撮像装置３１，３２，３３，３４、複数の音声発生装置３６，３７，３８，３９、音声出力制御装置１、及び音響装置２を備える。
撮像装置３１，３２，３３，３４は、車載用カメラであり、車両に乗車している乗員を撮影するものである。撮像装置３１，３２，３３，３４は、少なくとも各乗員の顔部を撮影できるものであればよい。
音声発生装置３６，３７，３８，３９は、音声を出力するスピーカである。 FIG. 1 is a block diagram illustrating a configuration of an acoustic system 3 to which the audio output control device 1 according to Embodiment 1 is applied.
The acoustic system 3 includes imaging devices 31, 32, 33, 34, a plurality of sound generators 36, 37, 38, 39, a sound output control device 1, and a sound device 2.
The imaging devices 31, 32, 33, and 34 are in-vehicle cameras, and take images of passengers riding in the vehicle. The imaging devices 31, 32, 33, and 34 may be any devices that can photograph at least the face of each occupant.
The sound generators 36, 37, 38, and 39 are speakers that output sound.

音響装置２は、いわゆるオーディオ装置であり、音源取得部２１、音源再生部２２、音量制御部２３、音声出力部２４、及び操作部２５を備える。
音源取得部２１は、音楽データ、放送データ、音声データ等の音源データ、又は映像データに含まれる音源データを取得する。
音源再生部２２は、音源取得部２１で取得した音源データを再生し、音声信号を生成する、いわゆる音声デコーダである。
音量制御部２３は、音源再生部２２で生成された音声信号に対して、音声発生装置３６，３７，３８，３９それぞれに出力する音声信号の増幅幅の制御を行うことができる、いわゆるコントロールアンプである。増幅幅は、操作部２５又は音声出力制御装置１からの入力により決定される。操作部２５及び音声出力制御装置１については後述する。
音声出力部２４は、音量制御部２３で増幅された音声信号を音声発生装置３６，３７，３８，３９にそれぞれ出力する。
操作部２５は、再生する音源データの選択、音源データの再生方法の選択、音声を出力する音量の変更等、車両の乗員であるユーザが所望の操作をするための操作入力手段である。 The acoustic device 2 is a so-called audio device, and includes a sound source acquisition unit 21, a sound source reproduction unit 22, a volume control unit 23, an audio output unit 24, and an operation unit 25.
The sound source acquisition unit 21 acquires sound source data included in music data, broadcast data, audio data, or the like, or video data.
The sound source reproduction unit 22 is a so-called audio decoder that reproduces the sound source data acquired by the sound source acquisition unit 21 and generates an audio signal.
The volume control unit 23 is a so-called control amplifier that can control the amplification width of the audio signal output to each of the audio generators 36, 37, 38, 39 with respect to the audio signal generated by the sound source reproduction unit 22 It is. The amplification width is determined by an input from the operation unit 25 or the audio output control device 1. The operation unit 25 and the audio output control device 1 will be described later.
The audio output unit 24 outputs the audio signal amplified by the volume control unit 23 to the audio generators 36, 37, 38, and 39, respectively.
The operation unit 25 is operation input means for a user who is a vehicle occupant to perform a desired operation such as selection of sound source data to be reproduced, selection of a method of reproducing sound source data, and change of sound output sound volume.

音声出力制御装置１は、画像取得部１１、会話開始推定部１２、対象位置推定部１３、出力制御部１４、及び制御情報送信部１５を備える。
画像取得部１１は、撮像装置３１，３２，３３，３４それぞれから画像を取得する。
会話開始推定部１２は、画像取得部１１で取得した画像を基に、周知の画像解析技術を用いて、それぞれの乗員の口の開き具合を判定し、判定した口の開き具合から、乗員により会話が開始されるか否か推定する。例えば、会話開始推定部１２は、乗員の口の大きさに対して開いた口の大きさの割合が予め設定した値より大きくなった等の条件を満たした際に、乗員により会話が開始されると推定する。
対象位置推定部１３は、画像取得部１１で取得した画像を基に、周知の画像解析技術を用いて、会話の開始が推定された乗員（以下、「話者」という）の顔の方向又は視線の方向を判定し、判定した話者の顔の方向又は視線の方向から会話対象の位置を推定する。会話対象の位置は、例えば、話者の顔の方向又は視線の方向にある座席である。
出力制御部１４は、会話開始推定部１２が乗員による会話の開始を推定した際、対象位置推定部１３が推定した会話対象の位置に最も近い音声発生装置から出力される音声の音量を小さくするよう、音響装置２が出力する音声信号の増幅幅の制御を行うための制御情報を生成する。
制御情報送信部１５は、出力制御部１４が生成した制御情報を音響装置２に送信する。 The audio output control device 1 includes an image acquisition unit 11, a conversation start estimation unit 12, a target position estimation unit 13, an output control unit 14, and a control information transmission unit 15.
The image acquisition unit 11 acquires images from the imaging devices 31, 32, 33, and 34, respectively.
Based on the image acquired by the image acquisition unit 11, the conversation start estimation unit 12 determines the degree of opening of each occupant's mouth using a well-known image analysis technique. Estimate whether the conversation will start. For example, the conversation start estimating unit 12 starts the conversation by the occupant when the condition that the ratio of the size of the opened mouth to the size of the occupant's mouth becomes larger than a preset value is satisfied. I guess.
The target position estimation unit 13 uses the well-known image analysis technique based on the image acquired by the image acquisition unit 11 to determine the direction of the face of the occupant (hereinafter referred to as “speaker”) estimated to start conversation. The direction of the line of sight is determined, and the position of the conversation target is estimated from the determined face direction of the speaker or the direction of the line of sight. The position of the conversation target is, for example, a seat in the direction of the speaker's face or the direction of the line of sight.
When the conversation start estimating unit 12 estimates the start of the conversation by the occupant, the output control unit 14 reduces the volume of the sound output from the sound generating device closest to the position of the conversation target estimated by the target position estimating unit 13. As described above, control information for controlling the amplification width of the audio signal output from the acoustic device 2 is generated.
The control information transmission unit 15 transmits the control information generated by the output control unit 14 to the acoustic device 2.

図２は、実施の形態１に係る音声出力制御装置１のハードウェア構成の一例を示す図である。
実施の形態１において、画像取得部１１、会話開始推定部１２、対象位置推定部１３、出力制御部１４、及び制御情報送信部１５の各機能は、処理回路２０１により実現される。すなわち、音声出力制御装置１は、画像取得部１１で取得した画像に基づいて生成された制御情報を、制御情報送信部１５で送信するための処理回路２０１を備える。
処理回路２０１は、図２Ａに示すように専用のハードウェアであっても、図２Ｂに示すようにメモリ２０５に格納されるプログラムを実行するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２０６であってもよい。 FIG. 2 is a diagram illustrating an example of a hardware configuration of the audio output control device 1 according to the first embodiment.
In the first embodiment, the functions of the image acquisition unit 11, the conversation start estimation unit 12, the target position estimation unit 13, the output control unit 14, and the control information transmission unit 15 are realized by the processing circuit 201. That is, the audio output control device 1 includes a processing circuit 201 for transmitting control information generated based on the image acquired by the image acquisition unit 11 by the control information transmission unit 15.
The processing circuit 201 may be dedicated hardware as shown in FIG. 2A or a CPU (Central Processing Unit) 206 that executes a program stored in the memory 205 as shown in FIG. 2B.

処理回路２０１が専用のハードウェアである場合、処理回路２０１は、例えば、単一回路、複合回路、プログラム化したプロセッサ、並列プログラム化したプロセッサ、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）、ＦＰＧＡ（Ｆｉｅｌｄ−ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）、又はこれらを組み合わせたものが該当する。 When the processing circuit 201 is dedicated hardware, the processing circuit 201 may be, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an ASIC (Application Specific Integrated Circuit), or an FPGA (Field-Programmable). Gate Array) or a combination thereof.

処理回路２０１がＣＰＵ２０６の場合、画像取得部１１、会話開始推定部１２、対象位置推定部１３、出力制御部１４、及び制御情報送信部１５の各機能は、ソフトウェア、ファームウェア、又は、ソフトウェアとファームウェアとの組み合わせにより実現される。すなわち、画像取得部１１、会話開始推定部１２、対象位置推定部１３、出力制御部１４、及び制御情報送信部１５は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）２０２、メモリ２０５等に記憶されたプログラムを実行するＣＰＵ２０６、又はシステムＬＳＩ（Ｌａｒｇｅ−ＳｃａｌｅＩｎｔｅｇｒａｔｉｏｎ）等の処理回路により実現される。また、ＨＤＤ２０２、又はメモリ２０５等に記憶されたプログラムは、画像取得部１１、会話開始推定部１２、対象位置推定部１３、出力制御部１４、及び制御情報送信部１５における各手順、すなわち、画像取得手順、会話開始推定手順、対象位置推定手順、出力制御手順、及び制御情報送信手順をコンピュータに実行させるものであるとも言える。ここで、メモリ２０５とは、例えば、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、フラッシュメモリ、ＥＰＲＯＭ（ＥｒａｓａｂｌｅＰｒｏｇｒａｍｍａｂｌｅＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＥＥＰＲＯＭ（ＥｌｅｃｔｒｉｃａｌｌｙＥｒａｓａｂｌｅＰｒｏｇｒａｍｍａｂｌｅＲｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）等の、不揮発性もしくは揮発性の半導体メモリ、磁気ディスク、フレキシブルディスク、光ディスク、コンパクトディスク、ミニディスク、又はＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）等が該当する。 When the processing circuit 201 is the CPU 206, the functions of the image acquisition unit 11, the conversation start estimation unit 12, the target position estimation unit 13, the output control unit 14, and the control information transmission unit 15 are software, firmware, or software and firmware. It is realized by the combination. That is, the image acquisition unit 11, the conversation start estimation unit 12, the target position estimation unit 13, the output control unit 14, and the control information transmission unit 15 execute programs stored in an HDD (Hard Disk Drive) 202, a memory 205, and the like. CPU 206 or a processing circuit such as a system LSI (Large-Scale Integration). The programs stored in the HDD 202, the memory 205, or the like are the procedures in the image acquisition unit 11, the conversation start estimation unit 12, the target position estimation unit 13, the output control unit 14, and the control information transmission unit 15, that is, the image. It can be said that the acquisition procedure, the conversation start estimation procedure, the target position estimation procedure, the output control procedure, and the control information transmission procedure are executed by the computer. Here, the memory 205 is, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), a flash memory, an EPROM (Erasable Programmable Read Only Memory Memory, an EEPROM (Electrically Erasable Memory), or the like. Or a volatile semiconductor memory, a magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, a DVD (Digital Versatile Disc), or the like.

なお、画像取得部１１、会話開始推定部１２、対象位置推定部１３、出力制御部１４、及び制御情報送信部１５の各機能について、一部を専用のハードウェアで実現し、一部をソフトウェア又はファームウェアで実現するようにしてもよい。例えば、画像取得部１１、会話開始推定部１２、及び対象位置推定部１３については、専用のハードウェアとしての処理回路２０１でその機能を実現し、出力制御部１４、及び制御情報送信部１５については、処理回路がメモリ２０５に格納されたプログラムを読み出して実行することによってその機能を実現することができる。
また、音声出力制御装置１は、撮像装置３１，３２，３３，３４、及び音響装置２との通信を行うための、入力インタフェース装置２０３及び出力インタフェース装置２０４を有する。
なお、以上の説明では、音声出力制御装置１のハードウェア構成について、図２Ｂに示すように、ＨＤＤ２０２を使用するものとして説明したが、ＨＤＤ２０２に代えて、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）を使用するものであってもよい。 In addition, about each function of the image acquisition part 11, the conversation start estimation part 12, the object position estimation part 13, the output control part 14, and the control information transmission part 15, a part is implement | achieved with a dedicated hardware and a part is software Alternatively, it may be realized by firmware. For example, the functions of the image acquisition unit 11, the conversation start estimation unit 12, and the target position estimation unit 13 are realized by a processing circuit 201 as dedicated hardware, and the output control unit 14 and the control information transmission unit 15 are realized. The function can be realized by the processing circuit reading and executing the program stored in the memory 205.
The audio output control device 1 also includes an input interface device 203 and an output interface device 204 for performing communication with the imaging devices 31, 32, 33, 34 and the acoustic device 2.
In the above description, the hardware configuration of the audio output control device 1 has been described as using the HDD 202 as shown in FIG. 2B. However, instead of the HDD 202, an SSD (Solid State Drive) is used. May be.

図３は、実施の形態１に係る音声出力制御装置１が適用された音響システム３を搭載した車両３００の内部を上方から見た構成例を示す図ある。
図３に示すように、撮像装置３１，３２，３３，３４は、乗員３０１，３０２，３０３，３０４がそれぞれ座る座席３１１，３１２，３１３，３１４に対応して設置されている。撮像装置３１，３２，３３，３４は、乗員３０１，３０２，３０３，３０４、特に乗員３０１，３０２，３０３，３０４の顔部をそれぞれ撮影している。
音声発生装置３６，３７，３８，３９は、座席３１１，３１２，３１３，３１４にそれぞれ対応して設置されている。 FIG. 3 is a diagram illustrating a configuration example when the inside of the vehicle 300 on which the sound system 3 to which the sound output control device 1 according to Embodiment 1 is applied is mounted is viewed from above.
As shown in FIG. 3, the imaging devices 31, 32, 33, and 34 are installed corresponding to seats 311, 312, 313, and 314 on which occupants 301, 302, 303, and 304 sit, respectively. The imaging devices 31, 32, 33, and 34 shoot the faces of the passengers 301, 302, 303, and 304, particularly the passengers 301, 302, 303, and 304, respectively.
The sound generators 36, 37, 38, 39 are installed corresponding to the seats 311, 312, 313, 314, respectively.

動作について説明する。
図４は、実施の形態１に係る音声出力制御装置１の動作を説明するフローチャートである。以下、実施の形態１に係る音声出力制御装置１の動作を、当該フローチャートを用いて説明すると共に、図３に示すように、話者が乗員３０４であり、会話対象の位置が座席３１１である場合を例にとって説明する。
音声出力制御装置１がこのフローチャートに示した処理を繰り返し実行することで、音声出力制御装置１の動作中、音響装置２は、音声出力制御装置１から受信した制御情報に基づいて音声発生装置３６，３７，３８，３９に出力する音声信号の増幅幅の制御を行う。 The operation will be described.
FIG. 4 is a flowchart for explaining the operation of the audio output control apparatus 1 according to the first embodiment. Hereinafter, the operation of the audio output control device 1 according to Embodiment 1 will be described using the flowchart, and as shown in FIG. 3, the speaker is the occupant 304 and the conversation target position is the seat 311. A case will be described as an example.
The sound output control device 1 repeatedly executes the processing shown in this flowchart, so that the sound device 2 can perform the sound generation device 36 based on the control information received from the sound output control device 1 during the operation of the sound output control device 1. , 37, 38 and 39, the amplification width of the audio signal to be output is controlled.

音声出力制御装置１の動作が開始されると、画像取得部１１は、撮像装置３１，３２，３３，３４それぞれから画像を取得する（ステップＳＴ１）。
会話開始推定部１２は、画像取得部１１で取得した画像を基に、乗員により会話が開始されるか否か推定する（ステップＳＴ２）。
ステップＳＴ２において、会話開始推定部１２が乗員により会話が開始されると推定した場合（ステップＳＴ２“ＹＥＳ”）、対象位置推定部１３は、画像取得部１１で取得した画像を基に、話者の顔の方向又は視線の方向から会話対象の位置を推定する（ステップＳＴ３）。画像取得部１１が取得した撮像装置３４で撮影された画像を基に、会話開始推定部１２が乗員３０４により会話が開始されると推定した場合、対象位置推定部１３は、画像取得部１１が取得した撮像装置３４で撮影された画像を基に、乗員３０４の顔の方向又は視線の方向から会話対象の位置が前方右側の座席３１１であると推定する。
出力制御部１４は、対象位置推定部１３が推定した会話対象の位置に基づき、当該会話対象の位置に最も近い音声発生装置から出力される音声の音量を小さくするよう、音響装置２が出力する音声信号の増幅幅の制御を行うための制御情報を生成する（ステップＳＴ４）。出力制御部１４は、対象位置推定部１３が推定した会話対象の位置である座席３１１に最も近い音声発生装置３６の音量を小さくするよう、音響装置２が出力する音声信号の増幅幅の制御を行うための制御情報を生成する。
制御情報送信部１５は、出力制御部１４が生成した制御情報を音響装置２に送信し（ステップＳＴ５）、音声出力制御装置１は、動作を終了する。
ステップＳＴ２において、会話開始推定部１２が乗員により会話が開始されないと推定した場合（ステップＳＴ２“ＮＯ”）、音声出力制御装置１は、動作を終了する。 When the operation of the audio output control device 1 is started, the image acquisition unit 11 acquires images from the imaging devices 31, 32, 33, and 34 (step ST1).
The conversation start estimation unit 12 estimates whether or not conversation is started by the occupant based on the image acquired by the image acquisition unit 11 (step ST2).
In step ST2, when the conversation start estimating unit 12 estimates that the conversation is started by the occupant (step ST2 “YES”), the target position estimating unit 13 determines the speaker based on the image acquired by the image acquiring unit 11. The position of the conversation target is estimated from the face direction or the line-of-sight direction (step ST3). When the conversation start estimation unit 12 estimates that the conversation is started by the occupant 304 based on the image captured by the imaging device 34 acquired by the image acquisition unit 11, the target position estimation unit 13 includes the image acquisition unit 11. Based on the acquired image captured by the imaging device 34, it is estimated that the position of the conversation target is the front right seat 311 from the direction of the face of the occupant 304 or the direction of the line of sight.
Based on the position of the conversation target estimated by the target position estimation unit 13, the output control unit 14 outputs the sound device 2 so as to reduce the volume of the sound output from the sound generation device closest to the position of the conversation target. Control information for controlling the amplification width of the audio signal is generated (step ST4). The output control unit 14 controls the amplification width of the audio signal output from the acoustic device 2 so as to reduce the volume of the audio generation device 36 closest to the seat 311 that is the position of the conversation target estimated by the target position estimation unit 13. Generate control information to do.
The control information transmission unit 15 transmits the control information generated by the output control unit 14 to the acoustic device 2 (step ST5), and the audio output control device 1 ends the operation.
In step ST2, when the conversation start estimation unit 12 estimates that the conversation is not started by the occupant (step ST2 “NO”), the voice output control device 1 ends the operation.

以上のように、音声出力制御装置１は、車両３００に乗車している乗員３０１，３０２，３０３，３０４の顔を撮影する撮像装置３１，３２，３３，３４で撮影された画像を取得する画像取得部１１と、画像取得部１１で取得した画像に基づいて、乗員の口の開き具合から乗員による会話の開始を推定する会話開始推定部１２と、画像取得部１１で取得した画像に基づいて、会話の開始が推定された乗員の顔の方向又は視線の方向から会話対象の位置を推定する対象位置推定部１３と、会話開始推定部１２が乗員による会話の開始を推定した際、対象位置推定部１３が推定した会話対象の位置に最も近い音声発生装置から出力される音声の音量を制御する制御情報を生成する出力制御部１４とを備える。このように構成することで、乗員が発話を開始する時点において、音声発生装置３６，３７，３８，３９から出力される音量を制御できる。 As described above, the audio output control device 1 acquires images captured by the imaging devices 31, 32, 33, and 34 that capture the faces of the occupants 301, 302, 303, and 304 riding in the vehicle 300. Based on the acquisition unit 11, the conversation start estimation unit 12 that estimates the start of conversation by the occupant from the degree of opening of the occupant's mouth based on the image acquired by the image acquisition unit 11, and the image acquired by the image acquisition unit 11 The target position estimation unit 13 that estimates the position of the conversation target from the direction of the occupant's face or the line of sight where the start of the conversation is estimated, and the target position when the conversation start estimation unit 12 estimates the start of the conversation by the occupant And an output control unit 14 that generates control information for controlling the volume of the sound output from the sound generation device closest to the position of the conversation target estimated by the estimation unit 13. With this configuration, it is possible to control the sound volume output from the sound generators 36, 37, 38, and 39 when the occupant starts speaking.

これまで説明した実施の形態１において、話者の顔の方向又は視線の方向にある座席である例を示したが、この限りではない。例えば、話者が座席３１１に座る乗員３０１である場合、当該話者がルームミラー３０５の方向に顔の方向又は視線の方向を向けていれば、会話対象が座席３１３又は座席３１４であると推定することができる。すなわち、会話開始推定部１２が会話の開始を推定した話者が前席の乗員であり、対象位置推定部１３が当該話者の顔の方向又は視線の方向がルームミラー３０５の方向であると判定した場合、会話対象の位置を後席と推定してもよい。このように構成することで、音声出力制御装置１は、会話対象の位置の推定の精度を向上できる。 In Embodiment 1 described so far, an example of a seat in the direction of the speaker's face or the direction of the line of sight has been shown, but this is not restrictive. For example, when the speaker is an occupant 301 sitting in the seat 311, if the speaker is facing the direction of the face mirror or the line of sight toward the room mirror 305, the conversation target is estimated to be the seat 313 or the seat 314. can do. That is, the speaker whose conversation start estimating unit 12 estimated the start of the conversation is the front seat occupant, and the target position estimating unit 13 is that the direction of the speaker's face or line of sight is the direction of the room mirror 305. If it is determined, the position of the conversation target may be estimated as the rear seat. With this configuration, the audio output control device 1 can improve the accuracy of estimating the position of the conversation target.

また、これまで説明した実施の形態１において、図３に示すように、座席３１１，３１２，３１３，３１４それぞれに対応した４台の撮像装置３１，３２，３３，３４を例に示したが、撮像装置は、車両３００内の複数の乗員の顔部が撮影できれば良い。すなわち、撮像装置の台数及び設置位置は、これに限るものではなく、例えば、複数の乗員の顔部が撮影できる位置に撮像装置が１台だけ設置されていても良い。複数の乗員の顔部が撮影できる位置に撮像装置が１台だけ設置されている場合、当該撮像装置で撮影された画像に基づいて、画像解析により画像内の顔の位置等から当該顔が車両３００内のどの座席に座る乗員のものであるかを判定することができる。 In the first embodiment described so far, as shown in FIG. 3, the four imaging devices 31, 32, 33, and 34 corresponding to the seats 311, 312, 313, and 314 are shown as examples. The imaging device only needs to be able to photograph the faces of a plurality of passengers in the vehicle 300. That is, the number and installation positions of the imaging devices are not limited to this. For example, only one imaging device may be installed at a position where a plurality of occupant faces can be photographed. When only one imaging device is installed at a position where the faces of a plurality of passengers can be photographed, the face is detected from the position of the face in the image by image analysis based on the image photographed by the imaging device. It is possible to determine which seat in 300 the passenger belongs to.

また、これまで説明した実施の形態１において、図３に示すように、座席３１１，３１２，３１３，３１４それぞれに対応した４台の音声発生装置３６，３７，３８，３９を例に示したが、音声発生装置の台数及び設置位置は、これに限るものではない。例えば、音声発生装置は、必ずしもそれぞれの座席に対応している必要はなく、車両３００内の前後、又は左右に１つずつ設置された音声発生装置等、座席の数より少ない台数であっても良いし、逆に座席の数より多くの台数であっても良い。 In the first embodiment described so far, as shown in FIG. 3, four sound generators 36, 37, 38, and 39 corresponding to the seats 311, 312, 313, and 314 are shown as examples. The number and installation positions of the sound generators are not limited to this. For example, the sound generation devices do not necessarily correspond to the respective seats, and even if the number of the sound generation devices is smaller than the number of seats, such as sound generation devices installed one by one in the front and rear, or left and right in the vehicle 300, On the contrary, it may be more than the number of seats.

また、これまで説明した実施の形態１において、出力制御部１４は、会話対象の位置に最も近い音声発生装置から出力される音声の音量を制御するよう、音響装置２が出力する音声信号の増幅幅の制御を行うための制御情報を生成する例を示したが、この限りではない。例えば、話者が座る座席の最も近くにある音声発生装置から出力される音声の音量も小さくするよう、音響装置２が出力する音声信号の増幅幅の制御を行うための制御情報を生成しても良い。また、出力制御部１４が生成する制御情報は、音響装置２において、音声発生装置３６，３７，３８，３９から出力される音声の音量を制御できる情報であれば良く、音声信号の増幅幅の制御を行うための情報に限定されるものではない。 In the first embodiment described so far, the output control unit 14 amplifies the sound signal output from the acoustic device 2 so as to control the volume of the sound output from the sound generation device closest to the position of the conversation target. Although an example of generating control information for performing width control has been shown, the present invention is not limited to this. For example, the control information for controlling the amplification width of the audio signal output from the acoustic device 2 is generated so that the volume of the audio output from the audio generation device closest to the seat where the speaker sits is also reduced. Also good. Further, the control information generated by the output control unit 14 may be information that can control the volume of the sound output from the sound generation devices 36, 37, 38, and 39 in the acoustic device 2, and may be the amount of amplification of the sound signal. It is not limited to information for performing control.

また、これまで説明した実施の形態１において、会話開始推定部１２が会話の終了も推定し、出力制御部１４は、会話開始推定部１２が会話の終了を推定した際、これまで小さくするよう制御していた音声発生装置から出力される音声の音量を、元に戻すよう、音響装置２が出力する音声信号の増幅幅の制御を行う制御情報を生成するようにしてもよい。このように構成することで、会話をしている期間だけ音声発生装置３６，３７，３８，３９から出力される音声の音量を小さくすることができる。 Further, in the first embodiment described so far, the conversation start estimation unit 12 also estimates the end of the conversation, and the output control unit 14 reduces so far when the conversation start estimation unit 12 estimates the end of the conversation. Control information for controlling the amplification width of the audio signal output from the acoustic device 2 may be generated so that the volume of the audio output from the controlled audio generator is restored. With this configuration, it is possible to reduce the volume of the sound output from the sound generators 36, 37, 38, 39 only during the conversation.

また、これまで説明した実施の形態１において、音声出力制御装置１は、車両３００内で機能する装置を例に示したが、必ずしも各機能の一部又は全部が車両３００内で機能する必要はない。例えば、音声出力制御装置１は、車両３００の外部に設置され、撮像装置３１，３２，３３，３４で撮影された画像をインターネット等のネットワークを介して画像取得部１１で取得し、生成した制御情報をインターネット等のネットワークを介して制御情報送信部１５から音響装置２に送信するようにしてもよい。また例えば、音声出力制御装置１は、各機能の一部が車両３００内で機能し、残りの一部が車両３００の外部で機能するようにしてもよい。 Moreover, in Embodiment 1 demonstrated so far, although the audio | voice output control apparatus 1 showed as an example the apparatus which functions in the vehicle 300, it is not necessary for some or all of each function to function in the vehicle 300. Absent. For example, the audio output control device 1 is installed outside the vehicle 300, and images generated by the imaging devices 31, 32, 33, and 34 are acquired by the image acquisition unit 11 via a network such as the Internet, and generated. Information may be transmitted from the control information transmission unit 15 to the audio device 2 via a network such as the Internet. Further, for example, the audio output control device 1 may be configured such that a part of each function functions in the vehicle 300 and the remaining part functions outside the vehicle 300.

なお、この発明はその発明の範囲内において、各実施の形態の自由な組み合わせ、あるいは各実施の形態の任意の構成要素の変形、もしくは各実施の形態において任意の構成要素の省略ができる。 It should be noted that within the scope of the invention, the present invention can be freely combined with each of the embodiments, modified with any component in each embodiment, or omitted with any component in each embodiment.

この発明に係る音声出力制御装置は音響システムをはじめとした音声出力機器に適用することができる。 The audio output control device according to the present invention can be applied to audio output devices such as an acoustic system.

１音声出力制御装置、２音響装置、３音響システム、１１画像取得部、１２会話開始推定部、１３対象位置推定部、１４出力制御部、１５制御情報送信部、２１音源取得部、２２音源再生部、２３音量制御部、２４音声出力部、２５操作部、３１，３２，３３，３４撮像装置、３６，３７，３８，３９音声発生装置、２０１処理回路、２０２ＨＤＤ、２０３入力インタフェース装置、２０４出力インタフェース装置、２０５メモリ、２０６ＣＰＵ、３００車両、３０１，３０２，３０３，３０４乗員、３０５ルームミラー、３１１，３１２，３１３，３１４座席。 DESCRIPTION OF SYMBOLS 1 Audio | voice output control apparatus, 2 Acoustic apparatus, 3 Acoustic system, 11 Image acquisition part, 12 Conversation start estimation part, 13 Target position estimation part, 14 Output control part, 15 Control information transmission part, 21 Sound source acquisition part, 22 Sound source reproduction 23, volume control unit, 24 audio output unit, 25 operation unit, 31, 32, 33, 34 imaging device, 36, 37, 38, 39 audio generation device, 201 processing circuit, 202 HDD, 203 input interface device, 204 Output interface device, 205 memory, 206 CPU, 300 vehicle, 301, 302, 303, 304 Crew, 305 rearview mirror, 311, 312, 313, 314 seat.

Claims

An image acquisition unit that acquires an image captured by an imaging device that captures the face of an occupant riding in the vehicle;
Based on the image acquired by the image acquisition unit, a conversation start estimation unit that estimates the start of conversation by the occupant from the degree of opening of the occupant's mouth,
A target position estimation unit that estimates the position of a conversation target from the direction of the face or line of sight of the occupant from which the start of conversation is estimated based on the image acquired by the image acquisition unit;
When the conversation start estimating unit estimates the start of the conversation by the occupant, control information for controlling the volume of the sound output from the sound generating device closest to the position of the conversation target estimated by the target position estimating unit is generated. An audio output control device comprising: an output control unit configured to output the output control unit.

The target position estimation unit estimates the position of the conversation target as the rear seat when the direction of the face or the line of sight of the occupant in the front seat where the start of the conversation is estimated is the direction of the room mirror. The audio output control apparatus according to claim 1, wherein:

The conversation start estimation unit estimates the start of the conversation by the occupant, and then based on the image acquired by the image acquisition unit, the conversation start by the occupant from the degree of opening of the occupant's mouth that estimated the start of the conversation. Estimate the end,
When the conversation start estimation unit estimates the end of the conversation by the occupant, the output control unit outputs the volume of the sound output from the sound generation device closest to the position of the conversation target estimated by the target position estimation unit The audio output control device according to claim 1, wherein the control information for controlling to restore the original is generated.

On the computer,
An image acquisition procedure for acquiring an image captured by an imaging device that captures the face of an occupant in the vehicle;
Based on the image acquired in the image acquisition procedure, a conversation start estimation procedure for estimating the start of conversation by the occupant from the degree of opening of the occupant's mouth,
A target position estimation procedure for estimating a conversation target position from the direction of the occupant's face or the direction of the line of sight estimated from the start of the conversation based on the image acquired in the image acquisition procedure;
When estimating the start of the conversation by the occupant in the conversation start estimation procedure, control information for controlling the volume of the sound output from the sound generator closest to the position of the conversation target estimated in the target position estimation procedure is generated A voice output control program for executing the output control procedure.