JP2016139972A

JP2016139972A - Information processing unit, voice output method, and computer program

Info

Publication number: JP2016139972A
Application number: JP2015014460A
Authority: JP
Inventors: 彬貴水嶋; Akitaka Mizushima
Original assignee: NTT Communications Corp
Current assignee: NTT Communications Corp
Priority date: 2015-01-28
Filing date: 2015-01-28
Publication date: 2016-08-04
Anticipated expiration: 2035-01-28
Also published as: JP6456163B2

Abstract

PROBLEM TO BE SOLVED: To make a user's judgement easier with a voice output.SOLUTION: The information processing unit includes: a main control section for obtaining a voice output parameter in accordance with an image display position associated with voice to be output; and a voice control section for outputting the voice to be output in accordance with the output parameter to a voice output apparatus. The output parameter is a parameter for changing a position of an output source of the voice recognized by a user listening to the voice.SELECTED DRAWING: Figure 1

Description

本発明は、音声を出力する技術に関する。 The present invention relates to a technique for outputting sound.

従来、様々な分野において、音声を出力する技術が用いられている。例えば、画像の出力に応じて音声を出力する技術が提案されている（例えば特許文献１参照）。より具体的には、遠隔地に位置する他のユーザと画像を伴った会話を可能にするボイスチャット技術がある。 Conventionally, techniques for outputting sound have been used in various fields. For example, a technique for outputting sound in accordance with image output has been proposed (see, for example, Patent Document 1). More specifically, there is a voice chat technique that enables a conversation with an image with another user located in a remote place.

特許第５４７８３５７号公報Japanese Patent No. 5478357

しかしながら、ボイスチャットでは、複数の他のユーザと同時に会話しようとすると、複数人の声が混ざってしまい聞き取りにくくなってしまうという問題があった。このように、単純に出力された音声のみによっては、その音声に伴って行われるべきユーザの判断（例えば、画面上に表示された複数の人のうちどの人が発話しているのかについての判断）が遅れてしまうことや、判断が困難になる場合があった。 However, in the voice chat, there is a problem that when talking simultaneously with a plurality of other users, the voices of a plurality of people are mixed and difficult to hear. In this way, depending on only the voice that is simply output, the user's judgment to be performed along with the voice (for example, judgment as to who is speaking among a plurality of people displayed on the screen) ) May be delayed or difficult to judge.

上記事情に鑑み、本発明は、音声の出力によってユーザの判断をより容易にする技術の提供を目的としている。 In view of the circumstances described above, an object of the present invention is to provide a technique that makes it easier for a user to make a judgment by outputting sound.

本発明の一態様は、出力対象の音声に関連する画像の表示位置に応じた音声の出力パラメータを取得する主制御部と、前記出力パラメータに応じて前記出力対象の音声を音声出力装置に出力させる音声制御部と、を備え、前記出力パラメータは、前記音声を聴くユーザが認識する前記音声の出力源の位置を変化させるためのパラメータである、情報処理装置である。 According to one aspect of the present invention, a main control unit that acquires a sound output parameter according to a display position of an image related to a sound to be output, and outputs the sound to be output to a sound output device according to the output parameter And an audio control unit that controls the output parameter to change a position of an output source of the audio recognized by a user who listens to the audio.

本発明の一態様は、上記情報処理装置であって、前記音声は、前記ユーザとは異なる発話者の発話内容を表す音声であり、前記画像は、前記発話者を示す画像である。 One embodiment of the present invention is the above information processing device, in which the voice is a voice representing a utterance content of a speaker different from the user, and the image is an image showing the speaker.

本発明の一態様は、上記情報処理装置であって、前記音声は、前記ユーザとネットワークを介して対話する発話者が発話した音声である。 One aspect of the present invention is the information processing apparatus, wherein the voice is a voice uttered by a speaker who interacts with the user via a network.

本発明の一態様は、上記情報処理装置であって、前記主制御部は、複数の音声について、それぞれ異なる出力パラメータを取得し、前記音声制御部は、前記複数の音声を同時に出力させる。 One aspect of the present invention is the information processing apparatus, wherein the main control unit acquires different output parameters for a plurality of sounds, and the sound control unit outputs the plurality of sounds simultaneously.

本発明の一態様は、上記情報処理装置であって、前記主制御部は、異常が生じた監視対象の画像の表示位置に応じた音声の出力パラメータを取得する。 One aspect of the present invention is the information processing apparatus, wherein the main control unit obtains an audio output parameter according to a display position of an image to be monitored in which an abnormality has occurred.

本発明の一態様は、上記情報処理装置であって、前記主制御部は、ユーザに注意を向けさせるべき画像の表示位置に応じた音声の出力パラメータを取得する。 One aspect of the present invention is the information processing apparatus, wherein the main control unit obtains an audio output parameter in accordance with a display position of an image to which a user should pay attention.

本発明の一態様は、出力対象の音声に関連する画像の表示位置に応じた音声の出力パラメータを取得する取得ステップと、前記出力パラメータに応じて前記出力対象の音声を音声出力装置に出力させる出力音声制御ステップと、を有し、前記出力パラメータは、前記音声を聴くユーザが認識する前記音声の出力源の位置を変化させるためのパラメータである、音声出力方法である。 According to one aspect of the present invention, an acquisition step of acquiring an audio output parameter corresponding to a display position of an image related to an output target audio, and causing the audio output device to output the output target audio according to the output parameter An output audio control step, wherein the output parameter is a parameter for changing a position of an output source of the audio recognized by a user who listens to the audio.

本発明の一態様は、出力対象の音声に関連する画像の表示位置に応じた音声の出力パラメータを取得する主制御部と、前記出力パラメータに応じて前記出力対象の音声を音声出力装置に出力させる音声制御部と、を備え、前記出力パラメータは、前記音声を聴くユーザが認識する前記音声の出力源の位置を変化させるためのパラメータである、情報処理装置、としてコンピュータを機能させるためのコンピュータプログラムである。 According to one aspect of the present invention, a main control unit that acquires a sound output parameter according to a display position of an image related to a sound to be output, and outputs the sound to be output to a sound output device according to the output parameter A computer for causing the computer to function as an information processing apparatus, wherein the output parameter is a parameter for changing a position of an output source of the audio recognized by a user who listens to the audio It is a program.

本発明により、音声の出力によってユーザの判断をより容易にすることが可能となる。 According to the present invention, it is possible to make a user's judgment easier by outputting sound.

音声出力システム１のシステム構成を表すシステム構成図である。1 is a system configuration diagram illustrating a system configuration of an audio output system 1. FIG. 主制御部３０１が音声データの出力処理の流れの第一の具体例を示すフローチャートである。5 is a flowchart illustrating a first specific example of the flow of output processing of audio data by the main control unit 301. 主制御部３０１が音声データの出力処理の流れの第二の具体例を示すフローチャートである。12 is a flowchart illustrating a second specific example of the flow of output processing of audio data by the main control unit 301. テレビ会話システムの画像出力装置２０の画面の具体例を示す図である。It is a figure which shows the specific example of the screen of the image output apparatus 20 of a television conversation system. 出力位置情報テーブルの具体例を示す図である。It is a figure which shows the specific example of an output position information table. 出力パラメータテーブルの具体例を示す図である。It is a figure which shows the specific example of an output parameter table. 監視システムの画像出力装置２０の画面の具体例を示す図である。It is a figure which shows the specific example of the screen of the image output device 20 of a monitoring system. 出力パラメータテーブルの具体例を示す図である。It is a figure which shows the specific example of an output parameter table. 音声モニタリングシステムの画像出力装置２０の画面の具体例を示す図である。It is a figure which shows the specific example of the screen of the image output apparatus 20 of an audio | voice monitoring system. 出力パラメータテーブルの具体例を示す図である。It is a figure which shows the specific example of an output parameter table. ＷＥＢシステムにおいて表示されるＷＥＢサイトの具体例を示す図である。It is a figure which shows the specific example of the WEB site displayed in a WEB system.

図１は、音声出力システム１のシステム構成を表すシステム構成図である。音声出力システム１は、音声出力装置１０、画像出力装置２０及び制御装置３０を備える。
音声出力装置１０は、スピーカー等の音声を出力する装置である。
画像出力装置２０は、ＣＲＴ（Cathode Ray Tube）ディスプレイ、液晶ディスプレイ、有機ＥＬ（Electro Luminescence）ディスプレイ等の画像を出力する装置である。 FIG. 1 is a system configuration diagram showing a system configuration of the audio output system 1. The audio output system 1 includes an audio output device 10, an image output device 20, and a control device 30.
The audio output device 10 is a device that outputs audio such as a speaker.
The image output device 20 is a device that outputs an image such as a CRT (Cathode Ray Tube) display, a liquid crystal display, or an organic EL (Electro Luminescence) display.

制御装置３０は、出力すべき音声を示す音声データと、出力すべき画像を示す画像データと、を取得する。制御装置３０は、取得された音声データを音声出力装置１０に出力させる。制御装置３０は、取得された画像データを画像出力装置２０に出力させる。
制御装置３０は、音声データを音声出力装置１０に対して出力させる際に、所定の条件に応じて、指向性を生じさせる。音声の指向性とは、音声を聴くユーザが認識する音声の出力源（以下、「仮想出力源」という。）の位置を、所定の条件に応じて変化させることが可能であることを示す。音声の指向性は、例えば離れて位置する複数の音声出力装置において、同じ音声を異なるタイミングで（遅延を生じさせて）出力することによって生じさせることが可能である。例えば、１台のパーソナルコンピュータに接続された（又は設けられた）複数のスピーカーにおいて、同じ音声を異なるタイミングで出力することによって、音声に指向性を生じさせることが可能である。制御装置３０は、所定の条件に応じて、音声の指向性を変化させる。言い換えると、制御装置３０は、所定の条件に応じて、仮想出力源の位置を変化させる。 The control device 30 acquires audio data indicating the sound to be output and image data indicating the image to be output. The control device 30 causes the audio output device 10 to output the acquired audio data. The control device 30 causes the image output device 20 to output the acquired image data.
When outputting the audio data to the audio output device 10, the control device 30 generates directivity according to a predetermined condition. The sound directivity indicates that the position of a sound output source (hereinafter referred to as “virtual output source”) recognized by a user who listens to the sound can be changed according to a predetermined condition. The directivity of sound can be generated, for example, by outputting the same sound at different timings (with a delay) in a plurality of sound output devices located apart from each other. For example, directivity can be generated in the sound by outputting the same sound at different timings in a plurality of speakers connected (or provided) to one personal computer. The control device 30 changes the directivity of the sound according to a predetermined condition. In other words, the control device 30 changes the position of the virtual output source according to a predetermined condition.

制御装置３０について詳細に説明する。
制御装置３０は、バスで接続されたＣＰＵ（Central Processing Unit）やメモリや補助記憶装置などを備え、制御プログラムを実行する。制御装置３０は、制御プログラムの実行によって、主制御部３０１、音声制御部３０２及び表示制御部３０３を備える装置として機能する。なお、制御装置３０の各機能の全て又は一部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアを用いて実現されても良い。 The control device 30 will be described in detail.
The control device 30 includes a CPU (Central Processing Unit), a memory, an auxiliary storage device, and the like connected by a bus, and executes a control program. The control device 30 functions as a device including the main control unit 301, the audio control unit 302, and the display control unit 303 by executing the control program. All or some of the functions of the control device 30 may be realized by using hardware such as an application specific integrated circuit (ASIC), a programmable logic device (PLD), or a field programmable gate array (FPGA).

制御プログラムは、コンピュータ読み取り可能な記録媒体に記録されても良い。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ、半導体記憶装置（例えばＳＳＤ：Solid State Drive）等の可搬媒体、コンピュータシステムに内蔵されるハードディスクや半導体記憶装置等の記憶装置である。制御プログラムは、電気通信回線を介して送信されても良い。 The control program may be recorded on a computer-readable recording medium. The computer-readable recording medium is a portable medium such as a flexible disk, a magneto-optical disk, a ROM, a CD-ROM, a semiconductor storage device (for example, SSD: Solid State Drive), a hard disk built in a computer system, or a semiconductor storage. A storage device such as a device. The control program may be transmitted via a telecommunication line.

主制御部３０１は、音声データ及び画像データを取得する。主制御部３０１は、取得された音声データの属性情報をさらに取得する。属性情報とは、音声の属性を示す情報である。属性情報は、例えば音声の発声主を示すユーザＩＤ、音声が関連づけられている装置を示す装置ＩＤ、音声が関連づけられているＷＥＢページ内のアイテムを示すアイテムＩＤ、音声が関連づけられているウィンドウ内の相対的位置、である。 The main control unit 301 acquires audio data and image data. The main control unit 301 further acquires attribute information of the acquired audio data. The attribute information is information indicating audio attributes. The attribute information includes, for example, a user ID indicating the voice speaker, a device ID indicating the device with which the voice is associated, an item ID indicating an item in the WEB page with which the voice is associated, and a window with which the voice is associated. Relative position.

主制御部３０１は、属性情報に基づいて、出力位置情報を取得する。出力位置情報は、仮想出力源に関連する情報である。出力位置情報は、例えば画像出力装置２０の画面内の二次元座標を示す情報、画像出力装置２０に表示される三次元画像内の三次元座標を示す情報、である。 The main control unit 301 acquires output position information based on the attribute information. The output position information is information related to the virtual output source. The output position information is, for example, information indicating two-dimensional coordinates in the screen of the image output apparatus 20 and information indicating three-dimensional coordinates in a three-dimensional image displayed on the image output apparatus 20.

主制御部３０１は、出力位置情報に基づいて、出力パラメータを取得する。出力パラメータは、出力位置情報に応じた仮想出力源から音声がユーザに聞こえるように音声出力装置１０を制御するための情報である。出力パラメータと出力位置情報とは１対１で対応付けて予め主制御部３０１によって記憶されていてもよい。 The main control unit 301 acquires an output parameter based on the output position information. The output parameter is information for controlling the audio output device 10 so that the user can hear the audio from the virtual output source according to the output position information. The output parameter and the output position information may be stored in advance by the main control unit 301 in a one-to-one correspondence.

音声制御部３０２は、主制御部３０１によって取得された出力パラメータで、主制御部３０１によって取得された音声データを音声出力装置１０に出力させる。
表示制御部３０３は、主制御部３０１によって取得された画像データを画像出力装置２０に出力させる。 The audio control unit 302 causes the audio output device 10 to output the audio data acquired by the main control unit 301 with the output parameters acquired by the main control unit 301.
The display control unit 303 causes the image output device 20 to output the image data acquired by the main control unit 301.

図２は、主制御部３０１が音声データの出力処理の流れの第一の具体例を示すフローチャートである。主制御部３０１は、出力の対象となっている音声データの属性情報を取得する（ステップＳ１０１）。次に、主制御部３０１は、ステップＳ１０１において取得された属性情報に基づいて出力パラメータを取得する（ステップＳ１０２）。そして、主制御部３０１は、出力の対象となっている音声データを、ステップＳ１０２において取得された出力パラメータで出力することを音声制御部３０２に指示する（ステップＳ１０３）。 FIG. 2 is a flowchart showing a first specific example of the flow of output processing of audio data by the main control unit 301. The main control unit 301 acquires attribute information of audio data that is an output target (step S101). Next, the main control unit 301 acquires an output parameter based on the attribute information acquired in step S101 (step S102). Then, the main control unit 301 instructs the audio control unit 302 to output the audio data to be output with the output parameter acquired in step S102 (step S103).

図３は、主制御部３０１が音声データの出力処理の流れの第二の具体例を示すフローチャートである。主制御部３０１は、出力の対象となっている音声データの属性情報を取得する（ステップＳ１１１）。次に、主制御部３０１は、ステップＳ１１１において取得された属性情報に基づいて、出力位置情報を取得する（ステップＳ１１２）。次に、主制御部３０１は、ステップＳ１１２において取得された出力位置情報に基づいて出力パラメータを取得する（ステップＳ１１３）。そして、主制御部３０１は、出力の対象となっている音声データを、ステップＳ１１３において取得された出力パラメータで出力することを音声制御部３０２に指示する（ステップＳ１１４）。 FIG. 3 is a flowchart showing a second specific example of the flow of audio data output processing by the main control unit 301. The main control unit 301 acquires attribute information of audio data that is an output target (step S111). Next, the main control unit 301 acquires output position information based on the attribute information acquired in step S111 (step S112). Next, the main control unit 301 acquires an output parameter based on the output position information acquired in step S112 (step S113). Then, the main control unit 301 instructs the audio control unit 302 to output the audio data to be output with the output parameter acquired in step S113 (step S114).

第一の具体例では、属性情報と出力パラメータとが予め対応付けて主制御部３０１に記憶されている。第一の具体例は、各属性の音声の指向性（出力源）が固定的に決められている場合に用いられる処理である。 In the first specific example, attribute information and output parameters are stored in the main control unit 301 in advance in association with each other. The first specific example is a process used when the sound directivity (output source) of each attribute is fixedly determined.

第二の具体例では、属性情報と出力位置情報とが対応付けて主制御部３０１に記憶されている。また、出力位置情報と出力パラメータとが対応付けて主制御部３０１に記憶されている。第二の具体例では、属性情報と出力位置情報との対応付けを変更することによって、属性情報と出力パラメータとの関係を容易に変更することが可能となる。第二の具体例は、各属性の音声の指向性（出力源）が変化する場合に用いられる処理である。
次に、音声出力システム１の具体的な適用例について説明する。 In the second specific example, attribute information and output position information are stored in the main control unit 301 in association with each other. Further, output position information and output parameters are stored in the main control unit 301 in association with each other. In the second specific example, the relationship between the attribute information and the output parameters can be easily changed by changing the association between the attribute information and the output position information. The second specific example is a process used when the sound directivity (output source) of each attribute changes.
Next, a specific application example of the audio output system 1 will be described.

［第一適用例：テレビ会話システム］
テレビ会話システムに適用された音声出力システム１について説明する。出力される音声データは、テレビ会話システムによって出力される音声を聴くユーザとネットワークを介して対話する対話者が発話した音声である。出力される画像データは、対話者を示す画像である。 [First application example: TV conversation system]
An audio output system 1 applied to a television conversation system will be described. The audio data that is output is audio that is spoken by a conversation person who interacts via a network with a user who listens to the audio output by the television conversation system. The output image data is an image showing a conversation person.

図４は、テレビ会話システムの画像出力装置２０の画面の具体例を示す図である。図４の例では、ユーザは４人の対話者と対話している。画面には、複数の対話者を示す画像４０１〜４０４が表示される。図４の例では、画像４０１〜４０４として、各対話者の顔をカメラで写すことによって得られる動画像が表示される。画像４０１〜４０４に表示される画像は、各対話者の顔の動画像に限定されない。例えば、各対話者の顔の静止画像であってもよいし、各対話者を示す文字であってもよい。 FIG. 4 is a diagram showing a specific example of the screen of the image output device 20 of the television conversation system. In the example of FIG. 4, the user is interacting with four interlocutors. Images 401 to 404 showing a plurality of interlocutors are displayed on the screen. In the example of FIG. 4, as the images 401 to 404, moving images obtained by copying each interactor's face with a camera are displayed. The images displayed in the images 401 to 404 are not limited to the moving images of the faces of the respective interactors. For example, it may be a still image of each interactor's face or a character indicating each interactor.

図５は、出力位置情報テーブルの具体例を示す図である。図５に示される出力位置情報テーブルは、主制御部３０１によって記憶される。出力位置情報テーブルは、ユーザＩＤと出力位置情報とを対応付けた複数のレコードを有する。ユーザＩＤは、対話者の識別情報である。テレビ会話システムでは、各対話者の発話内容を表す音声データの属性情報として、各対話者のユーザＩＤをもつ。出力位置情報は、画像出力装置２０の画面において、ユーザＩＤが示す対話者の画像が表示される位置を示す。 FIG. 5 is a diagram showing a specific example of the output position information table. The output position information table shown in FIG. 5 is stored by the main control unit 301. The output position information table has a plurality of records in which user IDs and output position information are associated with each other. The user ID is identification information of a conversation person. The television conversation system has the user ID of each conversation person as attribute information of voice data representing the utterance content of each conversation person. The output position information indicates a position on the screen of the image output device 20 at which an image of the conversation person indicated by the user ID is displayed.

例えば、図５の１番上のレコードは、ユーザＩＤが“Ａ１”である対話者の顔の動画像が画面の左上の位置（図４の画像４０１の位置）に表示されることを示す。図５の上から２番目のレコードは、ユーザＩＤが“Ａ２”である対話者の顔の動画像が画面の右上の位置（図４の画像４０２の位置）に表示されることを示す。図５の上から３番目のレコードは、ユーザＩＤが“Ａ３”である対話者の顔の動画像が画面の左下の位置（図４の画像４０３の位置）に表示されることを示す。図５の上から４番目のレコードは、ユーザＩＤが“Ａ４”である対話者の顔の動画像が画面の右下の位置（図４の画像４０４の位置）に表示されることを示す。 For example, the top record in FIG. 5 indicates that the moving image of the face of the conversation person whose user ID is “A1” is displayed at the upper left position of the screen (the position of the image 401 in FIG. 4). The second record from the top in FIG. 5 indicates that the moving image of the face of the conversation person whose user ID is “A2” is displayed at the upper right position of the screen (the position of the image 402 in FIG. 4). The third record from the top in FIG. 5 indicates that the moving image of the face of the conversation person whose user ID is “A3” is displayed at the lower left position of the screen (the position of the image 403 in FIG. 4). The fourth record from the top in FIG. 5 indicates that the moving image of the face of the conversation person whose user ID is “A4” is displayed at the lower right position of the screen (the position of the image 404 in FIG. 4).

主制御部３０１は、各対話者の発話内容を表す音声データを、ネットワークを介して受信することによって取得する。音声データは、対話者を示す識別情報とともに受信される。音声データとともに受信される識別情報は、例えば対話者が操作している装置に割り当てられているＩＰアドレスであってもよいし、上記装置のＭＡＣアドレスであってもよい。主制御部３０１は、取得された識別情報に基づいて、対話者のユーザＩＤ（属性情報）を取得する。主制御部３０１は、出力位置情報テーブルを参照することによって、対話者のユーザＩＤに応じた出力位置情報を取得する。 The main control unit 301 obtains voice data representing the utterance content of each interlocutor by receiving it via the network. The voice data is received together with identification information indicating the interlocutor. The identification information received together with the voice data may be, for example, an IP address assigned to a device operated by a dialog person, or may be a MAC address of the device. The main control unit 301 acquires the user ID (attribute information) of the conversation person based on the acquired identification information. The main control unit 301 refers to the output position information table to acquire output position information corresponding to the user ID of the conversation person.

図６は、出力パラメータテーブルの具体例を示す図である。図６に示される出力パラメータテーブルは、主制御部３０１によって記憶される。出力パラメータテーブルは、出力位置情報と出力パラメータとを対応付けた複数のレコードを有する。出力パラメータは、音声を聴くユーザが認識する音声の出力源の位置を制御するためのパラメータである。すなわち、出力パラメータは、出力位置情報によって示される位置が音声の出力源となるような指向性を音声に持たせるためのパラメータである。出力パラメータは、例えば左右に配置された音声出力装置１０の出力タイミングとして定義されてもよい。 FIG. 6 is a diagram illustrating a specific example of the output parameter table. The output parameter table shown in FIG. 6 is stored by the main control unit 301. The output parameter table has a plurality of records in which output position information and output parameters are associated with each other. The output parameter is a parameter for controlling the position of the sound output source recognized by the user who listens to the sound. That is, the output parameter is a parameter for giving the voice directivity such that the position indicated by the output position information is the voice output source. The output parameter may be defined as the output timing of the audio output devices 10 arranged on the left and right, for example.

例えば、図６の１番上のレコードは、出力対象の音声データの出力位置情報が左上である場合には、出力パラメータがＰａ１であることを示す。図６の上から２番目のレコードは、出力対象の音声データの出力位置情報が右上である場合には、出力パラメータがＰａ２であることを示す。図６の上から３番目のレコードは、出力対象の音声データの出力位置情報が左下である場合には、出力パラメータがＰａ３であることを示す。図６の上から４番目のレコードは、出力対象の音声データの出力位置情報が右下である場合には、出力パラメータがＰａ４であることを示す。 For example, the top record in FIG. 6 indicates that the output parameter is Pa1 when the output position information of the audio data to be output is at the upper left. The second record from the top in FIG. 6 indicates that the output parameter is Pa2 when the output position information of the audio data to be output is on the upper right. The third record from the top in FIG. 6 indicates that the output parameter is Pa3 when the output position information of the audio data to be output is in the lower left. The fourth record from the top in FIG. 6 indicates that the output parameter is Pa4 when the output position information of the audio data to be output is in the lower right.

主制御部３０１は、出力パラメータテーブルを参照することによって、出力対象の音声データの出力位置情報に応じた出力パラメータを取得する。
このように構成されたテレビ会話システムでは、対話者の画像が表示された位置に応じた指向性をもって音声データが出力される。例えば、図４の画面において、画像４０１の対話者の発話内容を表す音声は、左上方向（例えば、画面の画像４０１の位置）に出力源があるかのように出力される。図４の画面において、画像４０２の対話者の発話内容を表す音声は、右上方向（例えば、画面の画像４０２の位置）に出力源があるかのように出力される。このように音声が出力されることにより以下のような効果が得られる。 The main control unit 301 acquires an output parameter corresponding to the output position information of the audio data to be output by referring to the output parameter table.
In the television conversation system configured as described above, sound data is output with directivity corresponding to the position where the image of the conversation person is displayed. For example, on the screen of FIG. 4, sound representing the utterance content of the conversation person in the image 401 is output as if there is an output source in the upper left direction (for example, the position of the image 401 on the screen). In the screen of FIG. 4, the sound representing the utterance content of the conversation person in the image 402 is output as if there is an output source in the upper right direction (for example, the position of the image 402 on the screen). The following effects can be obtained by outputting the sound in this way.

一般的に、複数人が同時に話した音声が一つの出力源から出力されると（すなわち、モノラル音声で出力されると）、人が聞き分けることは難しい。そのため、特にモノラル音声で音声データが生成される場合、テレビ会話システムでは１対１で会話が行われることが多い。一方、カクテルパーティー効果として説明されているように、複数人が同時に話した音声であっても、各音声の出力源が異なる場合には、人はその中から自身が望む音声を聞き分けることが可能である。上述したテレビ会話システムでは、複数の対話者の音声が、それぞれ異なる出力源から聞こえるように出力される。そのため、たとえモノラル音声が用いられたとしても、ユーザは複数人が同時に話した音声を聞き分けることが可能となる。 Generally, when voices spoken by a plurality of people at the same time are output from a single output source (that is, output as monaural speech), it is difficult for people to hear them. Therefore, particularly when audio data is generated with monaural audio, a television conversation system often has a one-to-one conversation. On the other hand, as explained in the cocktail party effect, even if the voice is spoken by multiple people at the same time, if the output source of each voice is different, the person can hear the voice he / she wants among them. It is. In the above-described television conversation system, the voices of a plurality of interlocutors are output so that they can be heard from different output sources. Therefore, even if monaural sound is used, the user can hear the sound spoken by a plurality of people at the same time.

また、ユーザが対話者の声が誰の声であるのかを聞き分けることができなくても、画面に表示された画像の位置に応じた出力源から音声が出力されるため、出力源に基づいてどの対話者の声であるのかを判断することが可能となる。 Also, even if the user cannot tell who the voice of the interlocutor is, the voice is output from the output source corresponding to the position of the image displayed on the screen. It is possible to determine which voice of the conversation person.

＜変形例＞
音声データとともに受信される識別情報は、対話者を示すユーザＩＤであってもよい。
対話者の人数は４人に限定されない。例えば３人であってもよいし、５人以上であってもよい。
画像出力装置２０を用いることなく、画像の出力を伴わない会話システムとして適用されてもよい。このように構成された場合であっても、ユーザは複数人が同時に話した音声を聞き分けることが可能となるという効果は得られる。 <Modification>
The identification information received together with the audio data may be a user ID indicating a conversation person.
The number of interlocutors is not limited to four. For example, there may be three people or five or more people.
The present invention may be applied as a conversation system that does not involve image output without using the image output device 20. Even in the case of such a configuration, it is possible to obtain an effect that the user can hear the voices spoken by a plurality of people at the same time.

［第二適用例：監視システム］
監視システムに適用された音声出力システム１について説明する。出力される音声データは、異常が生じたことを示すアラーム音である。音声データの種類は１種類であってもよいし、複数種類であってもよい。出力される画像データは、監視対象を示す画像である。監視対象は、装置であってもよいし施設であってもよい。監視システムは、監視対象に設けられたセンサから状態値を取得し、状態値に基づいて異常が生じたか否かを判定する。異常が生じた場合、異常が生じた監視対象に応じた音声が出力される。 [Second application example: Monitoring system]
The audio output system 1 applied to the monitoring system will be described. The output audio data is an alarm sound indicating that an abnormality has occurred. There may be one type of audio data or a plurality of types. The output image data is an image indicating a monitoring target. The monitoring target may be a device or a facility. The monitoring system acquires a state value from a sensor provided in the monitoring target, and determines whether an abnormality has occurred based on the state value. When an abnormality occurs, a sound corresponding to the monitoring target where the abnormality has occurred is output.

図７は、監視システムの画像出力装置２０の画面の具体例を示す図である。図７の例では、監視対象となっている装置は８つであり、各装置は１以上の他装置と通信線で接続されている。画面には、複数の監視対象を示す画像５０１〜５０８が表示される。図７の例では、画像５０１〜５０８として、各監視対象を表す画像が表示される。画像５０１〜５０８として表示される画像は、正常であると判断された場合に表示される画像と、異常が生じていると判断された場合に表示される画像とが異なっても良い。 FIG. 7 is a diagram illustrating a specific example of the screen of the image output apparatus 20 of the monitoring system. In the example of FIG. 7, there are eight devices to be monitored, and each device is connected to one or more other devices via a communication line. On the screen, images 501 to 508 indicating a plurality of monitoring targets are displayed. In the example of FIG. 7, images representing the respective monitoring targets are displayed as the images 501 to 508. The images displayed as the images 501 to 508 may be different from the image displayed when it is determined to be normal and the image displayed when it is determined that an abnormality has occurred.

図８は、出力パラメータテーブルの具体例を示す図である。図８に示される出力パラメータテーブルは、主制御部３０１によって記憶される。出力パラメータテーブルは、監視対象ＩＤと出力パラメータとを対応付けた複数のレコードを有する。監視対象ＩＤは、監視対象の識別情報である。監視システムでは、異常が生じた際に出力される音声データの属性情報として、各監視対象の監視対象ＩＤをもつ。なお、複数の監視対象において、異常が生じた際に出力される音声が共通している場合、一つの音声データに対して複数の属性情報が付与されても良い。 FIG. 8 is a diagram showing a specific example of the output parameter table. The output parameter table shown in FIG. 8 is stored by the main control unit 301. The output parameter table has a plurality of records in which monitoring target IDs are associated with output parameters. The monitoring target ID is identification information of the monitoring target. The monitoring system has a monitoring target ID of each monitoring target as attribute information of audio data output when an abnormality occurs. Note that, when a plurality of monitoring targets share the same sound output when an abnormality occurs, a plurality of attribute information may be assigned to one sound data.

本実施形態における監視システムでは、各監視対象の画像が表示される画面内の位置は固定的に定義されている。そのため、各監視対象において異常が生じた際に出力される音声の出力源の位置は変化しない。したがって、本実施形態における監視システムは、出力位置情報を用いることなく第一の具体例の処理で実装することが可能である。 In the monitoring system according to the present embodiment, the position in the screen on which each monitoring target image is displayed is fixedly defined. Therefore, the position of the sound output source that is output when an abnormality occurs in each monitoring target does not change. Therefore, the monitoring system in this embodiment can be implemented by the process of the first specific example without using the output position information.

例えば、監視対象が非常に多数であり変化した画像を探すことが困難である場合や、画像出力装置２０の画面が非常に大きく一度に全体を視認することが困難である場合には、ユーザは、異常の発生を示す音声が出力されたとしてもどの監視対象において異常が発生したのか容易には判断できない。このような問題に対し、本実施形態の監視システムでは、画面に表示された監視対象の画像の位置に応じた出力源から異常を示す音声が出力される。そのため、ユーザは、どの監視対象において異常が生じたのかを、より容易により早く判断することが可能となる。 For example, when there are a large number of monitoring targets and it is difficult to search for a changed image, or when the screen of the image output device 20 is very large and it is difficult to visually recognize the entire screen at once, the user Even if a sound indicating the occurrence of an abnormality is output, it cannot be easily determined in which monitoring object the abnormality has occurred. In response to such a problem, in the monitoring system of the present embodiment, sound indicating abnormality is output from an output source corresponding to the position of the image to be monitored displayed on the screen. Therefore, the user can more easily and earlier determine in which monitoring target an abnormality has occurred.

＜変形例＞
各監視対象の画像が表示される画面内の位置が変化するように構成されてもよい。この場合、各監視対象の識別情報と出力位置情報とを対応付けたテーブルが必要となる。このようなテーブルは、他の情報処理装置から取得されてもよいし、ユーザによって設定されてもよいし、他の方法によって取得されてもよい。 <Modification>
You may comprise so that the position within the screen where the image of each monitoring object is displayed changes. In this case, a table in which the identification information of each monitoring target is associated with the output position information is required. Such a table may be acquired from another information processing apparatus, may be set by a user, or may be acquired by another method.

［第三適用例：音声モニタリングシステム］
音声モニタリングシステムに適用された音声出力システム１について説明する。出力される音声データは、モニタリングの対象となっている発話者が発話した音声である。出力される画像データは、発話者を示す画像である。
図９は、音声モニタリングシステムの画像出力装置２０の画面の具体例を示す図である。図９の例では、１６人の発話者がモニタリングの対象となっている。画面には、複数の発話者を示す画像６０１〜６１６が表示される。図９の例では、画像６０１〜６１６として、各発話者の顔をカメラで写すことによって得られる動画像が表示される。画像６０１〜６０１に表示される画像は、各発話者の顔の動画像に限定されない。例えば、各発話者の顔の静止画像であってもよいし、各発話者を示す文字であってもよい。 [Third application example: Voice monitoring system]
The audio output system 1 applied to the audio monitoring system will be described. The output voice data is voice uttered by the speaker who is the subject of monitoring. The output image data is an image showing a speaker.
FIG. 9 is a diagram illustrating a specific example of the screen of the image output device 20 of the audio monitoring system. In the example of FIG. 9, 16 speakers are monitored. Images 601 to 616 indicating a plurality of speakers are displayed on the screen. In the example of FIG. 9, as the images 601 to 616, moving images obtained by copying each speaker's face with a camera are displayed. The images displayed in the images 601 to 601 are not limited to the moving images of the faces of the speakers. For example, it may be a still image of the face of each speaker or a character indicating each speaker.

図１０は、出力パラメータテーブルの具体例を示す図である。図１０に示される出力パラメータテーブルは、主制御部３０１によって記憶される。出力パラメータテーブルは、オペレータＩＤと出力パラメータとを対応付けた複数のレコードを有する。オペレータＩＤは、オペレータの識別情報である。音声モニタリングシステムでは、出力される音声データの属性情報として、各オペレータのオペレータＩＤをもつ。 FIG. 10 is a diagram illustrating a specific example of the output parameter table. The output parameter table shown in FIG. 10 is stored by the main control unit 301. The output parameter table has a plurality of records in which operator IDs and output parameters are associated with each other. The operator ID is operator identification information. The voice monitoring system has an operator ID of each operator as attribute information of the voice data to be output.

本実施形態における音声モニタリングシステムでは、各オペレータの画像が表示される画面内の位置は固定的に定義されている。そのため、各オペレータの音声の出力源の位置は変化しない。したがって、本実施形態における音声モニタリングシステムは、出力位置情報を用いることなく第一の具体例の処理で実装することが可能である。 In the voice monitoring system according to the present embodiment, the position in the screen where the image of each operator is displayed is fixedly defined. Therefore, the position of the voice output source of each operator does not change. Therefore, the voice monitoring system in the present embodiment can be implemented by the process of the first specific example without using the output position information.

主制御部３０１は、所定数の複数の発話者の音声を、それぞれ異なる指向性を持たせて同時に出力する。主制御部３０１は、音声が出力される発話者を、所定のタイミング（例えば、音声が出力されてから１０秒後）で変更する。例えば、最初のタイミングでは、主制御部３０１は、最初の発話者の組み合わせ（例えば画像６０１〜６０４の発話者）の発話内容の音声データを同時に出力する。音声データの出力開始から所定時間（例えば１０秒）が経過すると、主制御部３０１は、次の発話者の組み合わせ（例えば画像６０５〜６０８の発話者）の発話内容の音声データを同時に出力する。音声データの出力開始から所定時間が経過すると、主制御部３０１は、次の発話者の組み合わせ（例えば画像６０９〜６１２の発話者）の発話内容の音声データを同時に出力する。このような処理を主制御部３０１が繰り返し実行することによって、発話者全員の発話内容をモニタリングすることが可能となる。 The main control unit 301 outputs a predetermined number of voices of a plurality of speakers simultaneously with different directivities. The main control unit 301 changes the speaker to whom the sound is output at a predetermined timing (for example, 10 seconds after the sound is output). For example, at the first timing, the main control unit 301 simultaneously outputs audio data of the utterance contents of the first speaker combination (for example, the speakers of the images 601 to 604). When a predetermined time (for example, 10 seconds) elapses from the start of the output of the audio data, the main control unit 301 simultaneously outputs the audio data of the utterance contents of the next speaker combination (for example, the speakers of the images 605 to 608). When a predetermined time has elapsed since the start of the output of the audio data, the main control unit 301 simultaneously outputs the audio data of the utterance content of the next speaker combination (for example, the speakers of the images 609 to 612). By repeating such processing by the main control unit 301, it becomes possible to monitor the utterance contents of all the speakers.

なお、主制御部３０１は、発話内容の音声データが出力されている発話者の画像を、発話内容の音声データが出力されていない発話者の画像と異なる態様で表示しても良い。例えば、主制御部３０１は、発話内容の音声データが出力されている発話者の画像に対して太い枠を設けて表示させても良い。 Note that the main control unit 301 may display the image of the speaker from which the speech data of the utterance content is output in a mode different from the image of the speaker from which the speech data of the utterance content is not output. For example, the main control unit 301 may display an image of a speaker to which voice data of the utterance content is output with a thick frame.

例えば、オペレータが非常に多数である場合には、一人ずつ音声をモニタリングすると、全員の音声のモニタリングを完了させるまでに多くの時間を要してしまう。このような問題に対し、本実施形態の音声モニタリングシステムでは、複数の発話者の音声が同時に出力される。その際に、各音声は画像の位置に応じて異なる出力源から聞こえるように出力される。そのため、ユーザは、上述したカクテルパーティー効果により複数の声を同時に聞き分けながらモニタリングすることが可能となる。したがって、全員のモニタリングを完了させるまでの時間を短縮することが可能となる。 For example, when there are a large number of operators, if the voice is monitored one by one, it takes a lot of time to complete the monitoring of the voices of all the members. In response to such a problem, in the voice monitoring system of the present embodiment, voices of a plurality of speakers are output simultaneously. At that time, each sound is output so that it can be heard from different output sources depending on the position of the image. Therefore, the user can monitor while listening to a plurality of voices simultaneously by the cocktail party effect described above. Therefore, it is possible to shorten the time until the monitoring of all members is completed.

＜変形例＞
各オペレータの画像が表示される画面内の位置が変化するように構成されてもよい。この場合、各オペレータの識別情報と出力位置情報とを対応付けたテーブルが必要となる。このようなテーブルは、他の情報処理装置から取得されてもよいし、ユーザによって設定されてもよいし、他の方法によって取得されてもよい。 <Modification>
The position on the screen where the image of each operator is displayed may be changed. In this case, a table in which the identification information of each operator is associated with the output position information is required. Such a table may be acquired from another information processing apparatus, may be set by a user, or may be acquired by another method.

［第四適用例：ＷＥＢシステム］
ＷＥＢシステムに適用された音声出力システム１について説明する。出力される音声データは、ＷＥＢサイトにおいて予め定義された音声データである。出力される画像データは、ＷＥＢサイトにおいて予め定義された画像である。 [Fourth application example: WEB system]
The audio output system 1 applied to the WEB system will be described. The output audio data is audio data defined in advance on the WEB site. The output image data is an image defined in advance on the WEB site.

主制御部３０１は、属性情報として、出力される音声が対応付けられた画像の画面内の位置を取得する。主制御部３０１は、取得された画面内の位置に応じて、出力パラメータを取得する。画面内の位置と出力パラメータの値との関係は、予め数式によって定義されてもよい。この場合、画面内の位置と出力パラメータの値との関係を表す数式を、主制御部３０１は予め記憶している。 The main control unit 301 acquires the position in the screen of the image associated with the output sound as the attribute information. The main control unit 301 acquires an output parameter according to the acquired position in the screen. The relationship between the position in the screen and the value of the output parameter may be defined in advance by an equation. In this case, the main control unit 301 stores in advance a mathematical expression representing the relationship between the position in the screen and the value of the output parameter.

図１１は、ＷＥＢシステムにおいて表示されるＷＥＢサイトの具体例を示す図である。このＷＥＢサイトでは、氏名を入力する入力枠７０１、住所を入力する入力枠７０２、職業を入力する入力枠７０３が表示される。入力内容に誤りがあった場合、主制御部３０１は、入力内容に誤りがあった入力枠の画面内での位置を示す位置情報を取得する。主制御部３０１は、例えば以下のように入力枠の位置情報を取得する。まず、主制御部３０１は、ＷＥＢサイトの情報を示すウィンドウ７０内における入力枠の位置情報（例えば、ウィンドウ７０の左上端を原点とした入力枠の左上端の位置情報）を取得する。次に、主制御部３０１は、画像出力装置２０の画面内におけるウィンドウ７０の位置情報（例えば、ウィンドウ７０の左上端の位置の画面内での座標）を取得する。主制御部３０１は、取得された２つの情報に基づいて、画面内での入力枠の位置情報を取得する。 FIG. 11 is a diagram showing a specific example of a WEB site displayed in the WEB system. In this WEB site, an input frame 701 for inputting a name, an input frame 702 for inputting an address, and an input frame 703 for inputting an occupation are displayed. When there is an error in the input content, the main control unit 301 acquires position information indicating the position on the screen of the input frame in which the input content is incorrect. The main control unit 301 acquires the position information of the input frame as follows, for example. First, the main control unit 301 acquires position information of the input frame in the window 70 indicating information on the WEB site (for example, position information of the upper left end of the input frame with the upper left end of the window 70 as the origin). Next, the main control unit 301 acquires the position information of the window 70 in the screen of the image output device 20 (for example, the coordinates of the position of the upper left corner of the window 70 in the screen). The main control unit 301 acquires position information of the input frame in the screen based on the acquired two pieces of information.

主制御部３０１は、取得された位置情報（属性情報）に基づいて、出力パラメータの値を取得する。氏名を入力する入力枠７０１において誤りがあった場合、その時点で入力枠７０１が表示されている場所が出力源となるようにエラー音が出力される。住所を入力する入力枠７０２において誤りがあった場合、その時点で入力枠７０２が表示されている場所が出力源となるようにエラー音が出力される。職業を入力する入力枠７０３において誤りがあった場合、その時点で入力枠７０３が表示されている場所が出力源となるようにエラー音が出力される。 The main control unit 301 acquires the value of the output parameter based on the acquired position information (attribute information). If there is an error in the input frame 701 for inputting the name, an error sound is output so that the place where the input frame 701 is displayed at that time becomes the output source. If there is an error in the input frame 702 for inputting an address, an error sound is output so that the location where the input frame 702 is displayed at that time is the output source. If there is an error in the input box 703 for inputting the occupation, an error sound is output so that the place where the input box 703 is displayed at that time becomes the output source.

出力される音声の例として、入力枠に対応付けられたエラー音が挙げられたが、出力される音声はこの例に限定されない。例えば、広告の静止画像（例えば広告バナー）や広告の動画像に対応付けられた音声が出力されても良い。この場合、広告の画像が表示されている位置を出力源して音声が聞こえるように、広告の音声が出力される。例えば、チュートリアルに含まれるボタン等の画像に対応付けられた音声が出力されても良い。この場合、例えば次に操作されるべきボタンの画像が表示されている位置を出力源として音声が聞こえるように、ボタンに対応付けられた音声が出力される。 An example of the output sound is an error sound associated with the input frame, but the output sound is not limited to this example. For example, audio associated with a still image of an advertisement (for example, an advertisement banner) or a moving image of an advertisement may be output. In this case, the sound of the advertisement is output so that the sound can be heard from the output source of the position where the image of the advertisement is displayed. For example, sounds associated with images such as buttons included in the tutorial may be output. In this case, for example, the sound associated with the button is output so that the sound can be heard using the position where the image of the button to be operated next is displayed as the output source.

このように構成されたＷＥＢシステムでは、ＷＥＢサイトに多数の情報や画像が表示されている場合であっても、ユーザに注意を向けさせるべき位置を出力源として音声が出力される。そのため、ユーザに注意を向けさせるべき位置への自然な視線誘導が可能となる。 In the WEB system configured as described above, even when a large amount of information and images are displayed on the WEB site, the sound is output using the position where the user should pay attention as an output source. Therefore, natural line-of-sight guidance to a position where attention should be directed to the user is possible.

＜変形例＞
出力される音声及び画像は、ＷＥＢの音声及び画面に限定されない。例えば、デジタルサイネージシステムに適用されることによって、広告に関する音声及び画像が出力されてもよい。例えば、ＷＥＢではなくアプリケーションとして作成されたチュートリアルで、上述したような音声及び画像の出力がなされてもよい。
以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 <Modification>
The output sound and image are not limited to WEB sound and screen. For example, by applying to a digital signage system, audio and images related to advertisements may be output. For example, the above-described audio and image output may be performed by a tutorial created as an application instead of WEB.
The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

１…音声出力システム，１０…音声出力装置，２０…画像出力装置，３０…制御装置，３０１…主制御部，３０２…音声制御部，３０３…表示制御部 DESCRIPTION OF SYMBOLS 1 ... Audio | voice output system, 10 ... Audio | voice output apparatus, 20 ... Image output apparatus, 30 ... Control apparatus, 301 ... Main control part, 302 ... Audio | voice control part, 303 ... Display control part

Claims

A main control unit that acquires an output parameter of sound according to a display position of an image related to the sound to be output;
A voice control unit that causes the voice output device to output the voice to be output according to the output parameter, and
The information processing apparatus, wherein the output parameter is a parameter for changing a position of an output source of the sound recognized by a user who listens to the sound.

The voice is a voice representing the utterance content of a speaker different from the user,
The information processing apparatus according to claim 1, wherein the image is an image showing the speaker.

The information processing apparatus according to claim 2, wherein the voice is a voice uttered by a speaker who interacts with the user via a network.

The main control unit acquires different output parameters for a plurality of sounds,
The information processing apparatus according to claim 1, wherein the voice control unit outputs the plurality of voices simultaneously.

The information processing apparatus according to claim 1, wherein the main control unit acquires an audio output parameter corresponding to a display position of an image to be monitored in which an abnormality has occurred.

The information processing apparatus according to claim 1, wherein the main control unit acquires an audio output parameter in accordance with a display position of an image that should be paid attention to the user.

An acquisition step of acquiring an audio output parameter according to a display position of an image related to the output target audio;
An output audio control step for causing the audio output device to output the audio to be output according to the output parameter, and
The output parameter is a parameter for changing a position of an output source of the sound recognized by a user who listens to the sound.
Audio output method.

A main control unit that acquires an output parameter of sound according to a display position of an image related to the sound to be output;
A voice control unit that causes the voice output device to output the voice to be output according to the output parameter, and
The computer program for causing a computer to function as an information processing apparatus, wherein the output parameter is a parameter for changing a position of an output source of the sound recognized by a user who listens to the sound.