JP5707346B2

JP5707346B2 - Information providing apparatus, program thereof, and information providing system

Info

Publication number: JP5707346B2
Application number: JP2012012069A
Authority: JP
Inventors: 悠紀松本; 雅法三部
Original assignee: Toshiba TEC Corp
Current assignee: Toshiba TEC Corp
Priority date: 2012-01-24
Filing date: 2012-01-24
Publication date: 2015-04-30
Anticipated expiration: 2032-01-24
Also published as: JP2013152277A

Description

本発明の実施形態は、商品の広告用の画像および音声を出力する情報提供装置とそのプログラムおよび情報提供システムに関する。 Embodiments described herein relate generally to an information providing apparatus that outputs an image and sound for advertising a product, a program thereof, and an information providing system.

広告用の画像をディスプレイで表示して公衆に知らせるディジタル・サイネージ（Digital Signage）が知られている。 There is known a digital signage that displays an advertisement image on a display and informs the public.

特開２０１１−２１０２３８号公報JP 2011-210238 A

上記のようなディジタル・サイネージは不特定多数の人を広告の対象としている。このため、表示を見る人に情報が的確に伝わらないことがある。この場合、十分な広告効果が得られない。 Digital signage as described above targets an unspecified number of people for advertising. For this reason, information may not be accurately transmitted to the person who sees the display. In this case, sufficient advertising effect cannot be obtained.

本発明の実施形態の目的は、表示を見る人に情報を的確に伝えることができ、高い広告効果が得られる情報提供装置とそのプログラムおよび情報提供システムを提供することである。 An object of an embodiment of the present invention is to provide an information providing apparatus, a program therefor, and an information providing system that can accurately convey information to a viewer who sees the display and can obtain a high advertising effect.

本発明の実施形態の情報提供装置は、広告用の画像を表示する表示手段と、音収集用のマイクロフォンと、このマイクロフォンで収集される音から音声を認識する音声認識手段と、この音声認識手段で認識される音声から前記表示手段の前にいる人の数を検出するとともに、前記音声認識手段で一定時間内に認識される音声のうちの最も入力音量が大きい音声に基づいて前記表示手段の前にいる人の属性を検出する検出手段と、上記表示手段で表示する画像に関わる内容で且つ前記検出手段で検出される数および属性に応じた特徴の音声を生成する音声生成手段と、この音声生成手段で生成される音声を出力するとともに、その音声出力の音量を前記検出手段で検出される数に応じて可変設定する制御手段と、を備える。 An information providing apparatus according to an embodiment of the present invention includes a display unit that displays an advertisement image, a microphone for sound collection, a voice recognition unit that recognizes a voice from sound collected by the microphone, and the voice recognition unit. The number of persons in front of the display means is detected from the voice recognized by the voice recognition means, and the voice of the display means is determined based on the voice with the highest input volume among the voices recognized by the voice recognition means within a predetermined time. Detecting means for detecting an attribute of a person in front ; audio generating means for generating a sound having characteristics related to the number and attributes detected by the detecting means with respect to an image displayed on the display means; Control means for outputting the sound generated by the sound generation means and variably setting the volume of the sound output according to the number detected by the detection means.

一実施形態の構成を示すブロック図。The block diagram which shows the structure of one Embodiment. 一実施形態の動作を示すフローチャート。The flowchart which shows operation | movement of one Embodiment. 一実施形態の変形例を示す図。The figure which shows the modification of one Embodiment.

以下、一実施形態について図面を参照して説明する。
図１において、１はディスプレイユニットで、広告用画像を表示するディスプレイ（表示手段・表示装置）２、音収集用のマイクロフォン３、および音出力用のスピーカ４を前面に有する。このディスプレイユニット１は、商品の宣伝用として、建物の壁面や車両の荷台などに設置される。あるいは、ディスプレイユニット１として、店舗のレジカウンタに置かれる商品販売処理装置たとえばＰＯＳ端末の客面側ディスプレイが使用される。ディスプレイユニット１としてＰＯＳ端末の客面側ディスプレイが使用される場合、ＰＯＳ端末が当該情報提供装置を含む構成となる。 Hereinafter, an embodiment will be described with reference to the drawings.
In FIG. 1, reference numeral 1 denotes a display unit having a display (display means / display device) 2 for displaying an advertisement image 2, a microphone 3 for sound collection, and a speaker 4 for sound output on the front surface. The display unit 1 is installed on a wall surface of a building, a loading platform of a vehicle, or the like for advertising a product. Alternatively, as the display unit 1, a merchandise sales processing device placed on a cash register counter of a store, for example, a customer side display of a POS terminal is used. When the customer side display of the POS terminal is used as the display unit 1, the POS terminal includes the information providing apparatus.

一方、制御部１０に、広告用の画像を生成する画像生成部１１、上記マイクロフォン３で収集される音から音声を認識する音声認識部１２、広告用の音声を生成する音声合成部１３、画像および音声の生成に必要なデータを入力するためのデータ入力部１４、動作条件設定用の操作部１５、およびネットワークインタフェース１６が接続される。ネットワークインタフェース１６は、通信ネットワーク２０を介してサーバ２１に接続される。 On the other hand, the control unit 10 includes an image generation unit 11 that generates an advertisement image, a speech recognition unit 12 that recognizes speech from sound collected by the microphone 3, a speech synthesis unit 13 that generates advertisement speech, an image A data input unit 14 for inputting data necessary for voice generation, an operation unit 15 for setting operating conditions, and a network interface 16 are connected. The network interface 16 is connected to the server 21 via the communication network 20.

上記音声認識部１２は、マイクロフォン３で収集される音を分析することにより、その収集音に含まれる音声を抽出して認識する。音声合成部１３は、制御部１０から指示される内容の音声を合成する。上記サーバ２１は、広告用の画像生成に必要な画像データおよび広告用の音声生成に必要な音声データを通信ネットワーク２０を介して制御部１０に転送する。 The voice recognition unit 12 analyzes the sound collected by the microphone 3 to extract and recognize the voice included in the collected sound. The voice synthesizer 13 synthesizes the voice having the content instructed from the control unit 10. The server 21 transfers the image data necessary for generating the advertisement image and the sound data necessary for generating the advertisement sound to the control unit 10 via the communication network 20.

上記ディスプレイユニット１、画像生成部１１、音声認識部１２、音声合成部１３、データ入力部１４、操作部１５、およびネットワークインタフェース１６により、広告用の画像および音声を生成してそれを公衆に知らせるディジタル・サイネージ（Digital Signage）用の情報提供装置が構成される。また、制御部１０、画像生成部１１、音声認識部１２、音声合成部１３、データ入力部１４、操作部１５、およびネットワークインタフェース１６は、コンピュータにより構成される。 The display unit 1, the image generation unit 11, the voice recognition unit 12, the voice synthesis unit 13, the data input unit 14, the operation unit 15, and the network interface 16 generate an advertisement image and voice and inform the public of it. An information providing apparatus for digital signage is configured. The control unit 10, the image generation unit 11, the voice recognition unit 12, the voice synthesis unit 13, the data input unit 14, the operation unit 15, and the network interface 16 are configured by a computer.

制御部１０は、予め内部メモリに格納しているプログラムに基づく主要な機能として、次の（１）〜（４）の手段を有する。
（１）ディスプレイユニット１の前にいる人（前を通る人を含む）の数および属性を音声認識部１２で認識される音声から検出する検出手段。具体的には、音声認識部１２で認識される音声を一定時間ずつ監視し、互いに特徴の異なる音声が一定時間内にいくつ存在するかをディスプレイユニット１の前にいる人の数として検出する。そして、その一定時間内に音声認識部１２で認識される音声のうち、最も入力音量が大きい音声と予め内部メモリに格納している年齢別・性別・外国語別（英語・中国語・韓国語等）の複数種の音声モデルデータとを照合することにより、かつ音声認識部１２で認識される音声の周波数および速度を分析しその分析結果を必要に応じて上記照合結果に加味することにより、ディスプレイユニット１の前にいる人の属性として、高齢者・子供・一般女性（高齢でない女性）・一般男性（高齢でない男性）・外国人（英語圏の人、中国語圏の人、韓国語圏の人など）を検出する。 The control unit 10 includes the following means (1) to (4) as main functions based on programs stored in the internal memory in advance.
(1) Detection means for detecting the number and attributes of persons (including persons passing in front) in front of the display unit 1 from the voice recognized by the voice recognition unit 12. Specifically, the voice recognized by the voice recognition unit 12 is monitored for a certain period of time, and the number of persons having different characteristics within a certain period of time is detected as the number of people in front of the display unit 1. Of the voices recognized by the voice recognition unit 12 within a certain period of time, the voice with the highest input volume and the age, sex and foreign language stored in the internal memory in advance (English, Chinese, Korean) Etc.) by collating with a plurality of types of voice model data, and analyzing the frequency and speed of the voice recognized by the voice recognition unit 12 and adding the analysis result to the above-mentioned collation result as necessary, The attributes of the person in front of the display unit 1 are elderly people, children, ordinary women (non-aged women), general men (non-aged men), foreigners (English-speaking people, Chinese-speaking people, Korean-speaking people) ).

（２）ディスプレイユニット１のディスプレイ２で表示する広告用の画像を、上記画像生成部１１と共に、上記検出手段で検出される属性に合せて生成する画像生成手段。 (2) Image generation means for generating an advertisement image to be displayed on the display 2 of the display unit 1 together with the image generation unit 11 according to the attribute detected by the detection means.

（３）ディスプレイユニット１のディスプレイ２で表示する画像に関わる内容で且つ上記検出手段で検出される数および属性に合せた特徴を有する広告用の音声（案内音声）を、上記音声合成部１３と共に生成する音声生成手段。具体的には、年齢別・性別・外国語別（英語・中国語・韓国語等）の複数種の音声生成用データを内部メモリに記憶しており、これら音声生成用データに基づき、上記検出手段で検出される数が一定数（例えば５人）以上の場合は予め定められた特徴の音声として不特定多数の人に聴かせるための音声を生成し、上記検出手段で検出される数が一定数未満の場合は同検出手段で検出される属性に合せた特徴の音声を生成する。つまり、一定数未満の場合、属性が高齢者であればその高齢者のためのゆっくりとした速度の音声を生成し、属性が子供であればその子供のための人気キャラクタ（漫画やテレビ番組等に登場する人気者）に似せた音声を生成し、属性が外国人であればその外国人のためにその外国人が日常的に使用する言語の語彙と表現を用いた音声を生成する。なお、これら生成する音声は、速度や声質などの特徴が互いに異なるだけで、内容（広告の内容）については同じである。 (3) Advertisement voice (guidance voice) having contents related to the image displayed on the display 2 of the display unit 1 and having characteristics matched to the number and attributes detected by the detection means, together with the voice synthesis unit 13 Voice generation means for generating. Specifically, multiple types of voice generation data by age, gender, and foreign language (English, Chinese, Korean, etc.) are stored in the internal memory, and the above detection is based on these voice generation data. When the number detected by the means is a certain number (for example, five persons) or more, a sound for listening to an unspecified number of people is generated as a sound having a predetermined characteristic, and the number detected by the detecting means is If the number is less than a certain number, a voice having a feature that matches the attribute detected by the detection means is generated. In other words, if the attribute is less than a certain number, if the attribute is elderly, a slow speed voice for the elderly is generated, and if the attribute is a child, a popular character for the child (such as a manga or TV program) If the attribute is a foreigner, the voice using the vocabulary and expressions of the language that the foreigner uses on a daily basis is generated. Note that these generated sounds have the same contents (advertising contents) except for the characteristics such as speed and voice quality.

（４）上記音声生成手段で生成される音声をスピーカから出力するとともに、その音声出力の音量を上記検出手段で検出される数（前にいる人の数）に応じて可変設定する制御手段。具体的には、検出される数が上記一定数以上の場合は最大レベルの音量を設定し、上記一定数未満の場合は通常レベルの音量を設定する。 (4) Control means for outputting the sound generated by the sound generation means from a speaker and variably setting the volume of the sound output according to the number detected by the detection means (the number of people in front). Specifically, when the detected number is equal to or greater than the predetermined number, the maximum level volume is set, and when the detected number is less than the predetermined number, the normal level volume is set.

つぎに、図２のフローチャートを参照しながら動作について説明する。
ディスプレイユニット１の前に立つ人やディスプレイユニット１の前を通る人が音声を発すると、その音声がマイクロフォン３で収集されて分析および認識される（ステップ１０１）。この認識された音声に基づき、ディスプレイユニット１の前にいる人の数および属性が検出される（ステップ１０２）。すなわち、互いに異なる特徴の音声が一定時間内にいくつ存在するかが人の数として検出されるとともに、認識される音声のうち最も声量の大きい音声が高齢者・子供・一般女性・一般男性・外国人のいずれのものであるかが検出される。 Next, the operation will be described with reference to the flowchart of FIG.
When a person standing in front of the display unit 1 or a person passing in front of the display unit 1 utters sound, the sound is collected by the microphone 3 and analyzed and recognized (step 101). Based on the recognized voice, the number and attributes of people in front of the display unit 1 are detected (step 102). In other words, the number of people is detected as to how many voices with different characteristics exist within a certain period of time, and among the recognized voices, the voices with the largest volume are the elderly, children, general women, general men, foreign countries. It is detected whether it is a person.

検出される人数が一定数以上であれば（ステップ１０３のＹＥＳ）、スピーカ４からの音声出力の音量が最大レベルに設定される（ステップ１０４）。そして、不特定多数の一般大衆をターゲットとした商品の広告用画像およびその画像に関わる内容の音声が生成され、生成された画像がディスプレイ２で表示されるとともに、生成された音声がスピーカ４から最大音量で発せられる（ステップ１０５）。この場合、不特定多数のだれもが聴き取り易いよう、日本のテレビやラジオ等で使用されるような標準的な語彙と表現を用いた標準的な速度の音声が生成されて発せられる。とくに、スピーカ４から最大音量の音声が発せられるので、ディスプレイユニット１の前で多くの人が互いに喋り合うなど騒々しい状況であっても、その人々に広告用画像の内容を的確に伝えることができる。 If the number of detected persons is equal to or greater than a certain number (YES in step 103), the volume of the sound output from the speaker 4 is set to the maximum level (step 104). Then, an advertisement image of a product targeting an unspecified number of general publics and sound of contents related to the image are generated, the generated image is displayed on the display 2, and the generated sound is transmitted from the speaker 4. It is emitted at the maximum volume (step 105). In this case, standard speed speech using standard vocabulary and expressions, such as those used in Japanese television and radio, is generated and emitted so that an unspecified number of people can easily listen. In particular, since the loudest sound is emitted from the speaker 4, even in a noisy situation where many people talk with each other in front of the display unit 1, the contents of the advertisement image can be accurately conveyed to the people. Can do.

上記検出される人数が一定数未満であれば（ステップ１０３のＮＯ）、ディスプレイユニット１の前にいる人が煩さを感じないよう、スピーカ４からの音声出力の音量が通常レベルに設定される（ステップ１０６）。そして、上記検出される属性が高齢者・子供・一般女性・一般男性・外国人のいずれであるかが判定される（ステップ１０７，１０８，１１１，１１３，１１５，１１７）。 If the detected number of persons is less than a certain number (NO in step 103), the volume of the sound output from the speaker 4 is set to a normal level so that the person in front of the display unit 1 does not feel inconvenience. (Step 106). Then, it is determined whether the detected attribute is elderly, child, general female, general male, or foreigner (steps 107, 108, 111, 113, 115, 117).

高齢の女性であれば（ステップ１０７のＹＥＳ、ステップ１０８のＹＥＳ）、高齢の女性をターゲットとした商品の広告用画像およびその画像に関わる内容の音声が生成され、生成された画像がディスプレイ２で表示されるとともに、生成された音声がスピーカ４から発せられる（ステップ１０９）。この場合、高齢者が聴き取り易いよう、標準的な語彙と表現を用いながらも速度がゆっくりでなるべく高周波数域を含まない音声が生成される。したがって、ディスプレイユニット１の前にいる高齢の女性は、興味のある商品の広告用画像を見ながら、その広告用画像の内容を容易かつ的確に把握することができる。 If it is an elderly woman (YES in step 107, YES in step 108), an advertisement image of a product targeting the elderly woman and a sound of contents related to the image are generated, and the generated image is displayed on the display 2. While being displayed, the generated voice is emitted from the speaker 4 (step 109). In this case, in order to make it easy for elderly people to listen, while using standard vocabulary and expressions, a voice that does not include a high frequency range as much as possible is generated. Therefore, an elderly woman in front of the display unit 1 can easily and accurately grasp the contents of the advertisement image while viewing the advertisement image of the product of interest.

高齢の男性であれば（ステップ１０７のＹＥＳ、ステップ１０８のＮＯ）、高齢の男性をターゲットとした商品の広告用画像およびその画像に関わる内容の音声が生成され、生成された画像がディスプレイ２で表示されるとともに、生成された音声がスピーカ４から発せられる（ステップ１１０）。この場合、高齢者が聴き取り易いよう、標準的な語彙と表現を用いながらも速度がゆっくりでなるべく高周波数域を含まない音声が生成される。したがって、ディスプレイユニット１の前にいる高齢の男性は、興味のある商品の広告用画像を見ながら、その広告用画像の内容を容易かつ的確に把握することができる。 If it is an elderly man (YES in step 107, NO in step 108), an advertisement image of a product targeting the elderly man and a sound of contents related to the image are generated, and the generated image is displayed on the display 2. While being displayed, the generated voice is emitted from the speaker 4 (step 110). In this case, in order to make it easy for elderly people to listen, while using standard vocabulary and expressions, a voice that does not include a high frequency range as much as possible is generated. Therefore, an elderly man in front of the display unit 1 can easily and accurately grasp the contents of the advertisement image while looking at the advertisement image of the product of interest.

子供であれば（ステップ１１１のＹＥＳ）、子供や家族連れをターゲットとした商品の広告用画像およびその画像に関わる内容の音声が生成され、生成された画像がディスプレイ２で表示されるとともに、生成された音声がスピーカ４から発せられる（ステップ１１２）。この場合、子供が分かり易いようまた興味を抱くよう簡単な語彙と表現を用いしかも人気キャラクタ（漫画やテレビ番組等に登場する人気者）に似せた音声が生成される。したがって、ディスプレイユニット１の前にいる子供や家族連れは、興味のある商品の広告用画像を見ながら、その広告用画像の内容を興味深く面白味をもって聴くことができる。 If it is a child (YES in step 111), an advertisement image of a product targeting children and families and a sound of contents related to the image are generated, and the generated image is displayed on the display 2 and generated. The generated sound is emitted from the speaker 4 (step 112). In this case, speech that is similar to a popular character (a popular person appearing in a comic or a television program) is generated using simple vocabulary and expressions so that the child can easily understand and be interested. Therefore, a child or a family member in front of the display unit 1 can listen to the content of the advertisement image with interest and interest while watching the advertisement image of the product of interest.

一般女性であれば（ステップ１１３のＹＥＳ）、一般女性をターゲットとした商品の広告用画像およびその画像に関わる内容の音声が生成され、生成された画像がディスプレイ２で表示されるとともに、生成された音声がスピーカ４から発せられる（ステップ１１４）。この場合、標準的な語彙と表現を用いながら、一般女性に強くアピールするような抑揚および速度の音声が生成される。ディスプレイユニット１の前にいる女性は、興味のある商品の広告用画像を見ながら、その広告用画像の内容を容易かつ的確に把握することができる。 If it is a general woman (YES in step 113), an advertisement image of a product targeting the general woman and a sound of contents related to the image are generated, and the generated image is displayed and generated on the display 2. The voice is emitted from the speaker 4 (step 114). In this case, while using standard vocabulary and expressions, speech of inflection and speed that strongly appeals to general women is generated. A woman in front of the display unit 1 can easily and accurately grasp the contents of the advertisement image while viewing the advertisement image of the product of interest.

一般男性であれば（ステップ１１５のＮＯ）、一般男性をターゲットとした商品の広告用画像およびその画像に関わる内容の音声が生成され、生成された画像がディスプレイ２で表示されるとともに、生成された音声がスピーカ４から発せられる（ステップ１１６）。この場合、標準的な語彙と表現を用いながら、一般男性に興味を抱かせるような抑揚および速度の音声が生成される。ディスプレイユニット１の前にいる男性は、興味のある商品の広告用画像を見ながら、その広告用画像の内容を容易かつ的確に把握することができる。 If it is a general male (NO in step 115), an advertisement image of a product targeting the general male and sound of contents related to the image are generated, and the generated image is displayed and generated on the display 2. The voice is emitted from the speaker 4 (step 116). In this case, while using standard vocabulary and expressions, intonation and speed sounds that generate interest in general men are generated. A man in front of the display unit 1 can easily and accurately grasp the contents of the advertisement image while viewing the advertisement image of the product of interest.

外国人であれば（ステップ１１７のＮＯ）、外国人をターゲットとした商品の広告用画像およびその画像に関わる内容の音声が生成され、生成された画像がディスプレイ２で表示されるとともに、生成された音声がスピーカ４から発せられる（ステップ１１８）。この場合、認識された音声の言語が英語であれば、英語の語彙と表現を用いた音声が生成される。認識された音声の言語が中国語であれば、中国語の語彙と表現を用いた音声が生成される。認識された音声の言語が韓国語であれば、韓国語の語彙と表現を用いた音声が生成される。したがって、ディスプレイユニット１の前にいる外国人は、興味のある商品の広告用画像を見ながら、その広告用画像の内容を容易かつ的確に把握することができる。 If it is a foreigner (NO in step 117), an advertisement image of a product targeting the foreigner and a sound of contents related to the image are generated, and the generated image is displayed and generated on the display 2. The voice is emitted from the speaker 4 (step 118). In this case, if the recognized speech language is English, speech using English vocabulary and expressions is generated. If the recognized speech language is Chinese, speech using Chinese vocabulary and expressions is generated. If the recognized speech language is Korean, speech using Korean vocabulary and expressions is generated. Therefore, a foreigner in front of the display unit 1 can easily and accurately grasp the contents of the advertisement image while looking at the advertisement image of the product of interest.

上記検出される属性が高齢者・子供・一般女性・一般男性・外国人のいずれでもない場合は（ステップ１０７，１０８，１１１，１１３，１１５，１１７のそれぞれＮＯ）、上記検出される人数が一定数以上の場合と同じく、不特定多数のだれもが聴き取り易いよう、日本のテレビやラジオ等で使用されるような標準的な語彙と表現を用いた標準的な速度の音声が生成されて発せられる（ステップ１０５）。 When the detected attribute is neither an elderly person, a child, a general female, a general male, or a foreigner (NO in each of steps 107, 108, 111, 113, 115, and 117), the detected number of persons is constant. As with more than a few cases, standard speed speech using standard vocabulary and expressions, such as those used in Japanese television and radio, is generated so that an unspecified number of people can easily hear. Emitted (step 105).

１つの広告の画像および音声の出力が終わると、ステップ１０１からの処理が繰り返され、ディスプレイユニット１の前に立つ人あるいは通る人を対象とした次の広告の画像および音声が再び出力される。 When the output of the image and sound of one advertisement is completed, the processing from step 101 is repeated, and the image and sound of the next advertisement for the person standing in front of the display unit 1 or the person passing by is output again.

このように、ディスプレイユニット１の前にいる人の属性に合せた音声を生成して発することにより、広告の内容をディスプレイユニット１の前にいる人に的確に伝えることができる。よって、ディジタル・サイネージとしての高い広告効果が得られる。 In this way, by generating and uttering sound that matches the attributes of the person in front of the display unit 1, the contents of the advertisement can be accurately conveyed to the person in front of the display unit 1. Therefore, a high advertising effect as digital signage can be obtained.

なお、上記実施形態では、表示手段としてディスプレイを用いたが、映像表示用のプロジェクタを用いてもよい。外国人として英語圏、中国語圏、韓国語圏の人を検出する構成としたが、それ以外の言語圏の人を検出することももちろん可能である。 In the above embodiment, a display is used as the display means, but a video display projector may be used. Although it is configured to detect English-speaking, Chinese-speaking, and Korean-speaking people as foreigners, it is of course possible to detect people in other language-speaking areas.

当該情報提供装置の機能や構成の一部を外部のサーバに設けることも可能である。このシステムを構築する場合、例えばクラウドコンピューティングを利用できる。より具体的には、ＳａａＳ（software as a service）と称されるソフトウェア提供形態が適する。このクラウドシステムを利用する場合の構成を図３に示す。情報提供システム２００は、クラウド２０１、複数の端末２０２および複数の通信ネットワーク２０３、および互いに通信接続された複数のサーバ２０４を有する。これら端末２０２、通信ネットワーク２０３、およびサーバ２０４は、それぞれ１つのみでもよい。端末２０２は、通信ネットワーク２０３を介してクラウド２０１と通信可能である。端末２０２としては、当該情報提供装置、デスクトップタイプやノートブックタイプなどの種々のコンピュータ、携帯電話装置、携帯情報端末（ＰＤＡ）、あるいはスマートフォンなどを適宜に利用できる。通信ネットワーク２０３としては、インターネット、プライベートネットワーク、次世代ネットワーク（ＮＧＮ）、あるいはモバイルネットワークなどを適宜に利用できる。 A part of the function and configuration of the information providing apparatus can be provided in an external server. When constructing this system, for example, cloud computing can be used. More specifically, a software provision form called SaaS (software as a service) is suitable. A configuration in the case of using this cloud system is shown in FIG. The information providing system 200 includes a cloud 201, a plurality of terminals 202, a plurality of communication networks 203, and a plurality of servers 204 connected to each other for communication. Each of these terminals 202, communication network 203, and server 204 may be only one. The terminal 202 can communicate with the cloud 201 via the communication network 203. As the terminal 202, the information providing device, various computers such as a desktop type and a notebook type, a mobile phone device, a personal digital assistant (PDA), a smartphone, and the like can be used as appropriate. As the communication network 203, the Internet, a private network, a next generation network (NGN), a mobile network, or the like can be used as appropriate.

この画像表示システム２００において、当該情報提供装置が持つ機能や構成のうち、少なくとも１つをサーバ２０４に設け、そのサーバ２０４に設けない残りの機能や構成を端末２０２に設ける。サーバ２０４に設ける機能や構成は、１つのサーバ２０４に配置してもよいし、複数のサーバ２０４に分散して配置してもよい。 In the image display system 200, at least one of the functions and configurations of the information providing apparatus is provided in the server 204, and the remaining functions and configurations that are not provided in the server 204 are provided in the terminal 202. The functions and configurations provided in the server 204 may be arranged in one server 204 or may be distributed in a plurality of servers 204.

その他、上記実施形態および変形例は、例として提示したものであり、発明の範囲を限定することは意図していない。この新規な実施形態および変形例は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、書き換え、変更を行うことができる。これら実施形態や変形は、発明の範囲は要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。
以下に、本願の当初の特許請求の範囲に記載された発明を付記する。
［１］
広告用の画像を表示する表示手段と、
音収集用のマイクロフォンと、
前記マイクロフォンで収集される音から音声を認識する音声認識手段と、
前記音声認識手段で認識される音声から前記表示手段の前にいる人の数を検出する検出手段と、
前記表示手段で表示する画像に関わる音声を生成する音声生成手段と、
前記音声生成手段で生成される音声を出力するとともに、その音声出力の音量を前記検出手段で検出される数に応じて可変設定する制御手段と、
を備えることを特徴とする情報提供装置。
［２］
前記検出手段は、前記音声認識手段で認識される音声から前記表示手段の前にいる人の数および属性を検出する、
前記音声生成手段は、前記表示手段で表示する画像に関わる内容で且つ前記検出手段で検出される数および属性に応じた特徴の音声を生成する、
前記制御手段は、前記音声出力の音量を前記検出手段で検出される数が一定以上の場合に最大レベルに設定し一定数未満の場合に通常レベルに設定する、
ことを特徴とする［１］記載の情報提供装置。
［３］
前記音声生成手段は、前記検出手段で検出される数が前記一定数以上の場合は前記表示手段で表示する画像に関わる内容で且つ不特定多数の人に聴かせるための音声を生成し、前記検出手段で検出される数が前記一定数未満の場合は前記表示手段で表示する画像に関わる内容で且つ同検出手段で検出される属性に合せた特徴の音声を生成する、
ことを特徴とする［２］記載の情報提供装置。
［４］
前記検出手段は、前記属性として、高齢者、子供、外国人を検出する、
前記音声生成手段は、前記検出手段で検出される数が前記一定数未満の場合、同検出手段で検出される属性が高齢者であればその高齢者のためのゆっくりとした速度の音声を生成し、同検出手段で検出される属性が子供であればその子供のための人気キャラクタによる音声を生成し、同検出手段で検出される属性が外国人であればその外国人のための外国語による音声を生成する、
ことを特徴とする［３］記載の情報提供装置。
［５］
広告用の画像を表示する表示手段、音収集用のマイクロフォン、およびコンピュータを含む情報提供装置において、
前記コンピュータに、
前記マイクロフォンで収集される音から音声を認識する音声認識機能と、
前記認識される音声から前記表示手段の前にいる人の数を検出する検出機能と、
前記表示手段で表示する画像に関わる音声を生成する音声生成機能と、
前記生成される音声を出力するとともに、その音声出力の音量を前記検出される数に応じて可変設定する制御機能と、
を実現させることを特徴とするプログラム。
［６］
広告用の画像を表示する情報提供装置およびサーバを含む情報提供システムにおいて、
音を収集する音収集手段と、
前記音収集手段で収集される音から音声を認識する音声認識手段と、
前記音声認識手段で認識される音声から前記情報提供装置の前にいる人の数を検出する検出手段と、
前記情報提供装置で表示する画像に関わる音声を生成する音声生成手段と、
前記音声生成手段で生成される音声を出力するとともに、その音声出力の音量を前記検出手段で検出される数に応じて可変設定する制御手段と、
を備え、前記各手段の少なくとも１つの手段を前記サーバが含み、そのサーバが含む手段を除く残りの手段を前記情報提供装置が含むことを特徴とする情報提供システム。 In addition, the said embodiment and modification are shown as an example and are not intending limiting the range of invention. The novel embodiments and modifications can be implemented in various other forms, and various omissions, rewrites, and changes can be made without departing from the spirit of the invention. The scope of the invention is included in the gist of these embodiments and modifications, and is included in the invention described in the claims and the equivalents thereof.
The invention described in the scope of the original claims of the present application will be added below.
[1]
Display means for displaying an advertisement image;
A microphone for sound collection;
Voice recognition means for recognizing voice from sound collected by the microphone;
Detection means for detecting the number of people in front of the display means from the voice recognized by the voice recognition means;
Sound generating means for generating sound related to the image displayed by the display means;
Control means for outputting the sound generated by the sound generation means and variably setting the volume of the sound output according to the number detected by the detection means;
An information providing apparatus comprising:
[2]
The detection means detects the number and attributes of people in front of the display means from the voice recognized by the voice recognition means;
The sound generation unit generates a sound having characteristics related to the number and attributes detected by the detection unit and contents related to the image displayed by the display unit.
The control means sets the volume of the audio output to a maximum level when the number detected by the detection means is equal to or greater than a certain level, and to a normal level when the number is less than a certain number,
[1] The information providing device according to [1].
[3]
When the number detected by the detection means is equal to or greater than the predetermined number, the sound generation means generates a sound related to an image displayed on the display means and to be heard by an unspecified number of people, When the number detected by the detecting means is less than the predetermined number, the sound of the feature is matched with the attribute detected by the detecting means with the contents related to the image displayed on the display means,
[2] The information providing device according to [2].
[4]
The detection means detects elderly people, children, foreigners as the attribute,
When the number detected by the detection means is less than the predetermined number, the voice generation means generates a slow speed voice for the elderly if the attribute detected by the detection means is an elderly person. If the attribute detected by the detection means is a child, a voice by a popular character for the child is generated. If the attribute detected by the detection means is a foreigner, a foreign language for the foreigner is generated. Generate sound by
[3] The information providing apparatus according to [3].
[5]
In an information providing apparatus including display means for displaying an image for advertisement, a microphone for sound collection, and a computer,
In the computer,
A voice recognition function for recognizing voice from sound collected by the microphone;
A detection function for detecting the number of people in front of the display means from the recognized voice;
A sound generation function for generating sound related to an image displayed by the display means;
A control function for outputting the generated sound and variably setting the volume of the sound output according to the detected number;
A program characterized by realizing.
[6]
In an information providing system including an information providing apparatus and a server for displaying an advertisement image,
Sound collection means for collecting sound;
Voice recognition means for recognizing voice from the sound collected by the sound collection means;
Detecting means for detecting the number of people in front of the information providing device from the voice recognized by the voice recognition means;
Sound generating means for generating sound related to an image displayed on the information providing apparatus;
Control means for outputting the sound generated by the sound generation means and variably setting the volume of the sound output according to the number detected by the detection means;
The information providing system is characterized in that the server includes at least one means of each means, and the information providing apparatus includes the remaining means excluding the means included in the server.

１…ディスプレイユニット、２…ディスプレイ、３…マイクロフォン、４…スピーカ、１０…制御部、１１…画像生成部、１２…音声認識部、１３…音声合成部、１４…データ入力部、１５…操作部、１６…ネットワークインタフェース、２０…通信ネットワーク、２１…サーバ DESCRIPTION OF SYMBOLS 1 ... Display unit, 2 ... Display, 3 ... Microphone, 4 ... Speaker, 10 ... Control part, 11 ... Image generation part, 12 ... Speech recognition part, 13 ... Speech synthesis part, 14 ... Data input part, 15 ... Operation part , 16 ... Network interface, 20 ... Communication network, 21 ... Server

Claims

Display means for displaying an advertisement image;
A microphone for sound collection;
Voice recognition means for recognizing voice from sound collected by the microphone;
The number of persons in front of the display means is detected from the voice recognized by the voice recognition means, and based on the voice having the highest input volume among the voices recognized by the voice recognition means within a predetermined time. Detecting means for detecting an attribute of a person in front of the display means ;
A voice generation unit that generates a voice having characteristics according to the number and attributes detected by the detection unit and the contents related to the image displayed by the display unit;
Control means for outputting the sound generated by the sound generation means and variably setting the volume of the sound output according to the number detected by the detection means;
An information providing apparatus comprising:

Before SL control means, claims, characterized in that the number of detected sound volume of the audio output by the detection means is set to the normal level when less than the predetermined number is set to the maximum level when more than a certain number of 1. The information providing apparatus according to 1.

When the number detected by the detection means is equal to or greater than the predetermined number, the sound generation means generates a sound related to an image displayed on the display means and to be heard by an unspecified number of people, When the number detected by the detecting means is less than the predetermined number, the sound of the feature is matched with the attribute detected by the detecting means with the contents related to the image displayed on the display means,
The information providing apparatus according to claim 2.

The detection means detects elderly people, children, foreigners as the attribute,
When the number detected by the detection means is less than the predetermined number, the voice generation means generates a slow speed voice for the elderly if the attribute detected by the detection means is an elderly person. If the attribute detected by the detection means is a child, a voice by a popular character for the child is generated. If the attribute detected by the detection means is a foreigner, a foreign language for the foreigner is generated. Generate sound by
The information providing apparatus according to claim 2 , wherein the information providing apparatus is an information providing apparatus.

In an information providing apparatus including display means for displaying an image for advertisement, a microphone for sound collection, and a computer,
In the computer,
A voice recognition function for recognizing voice from sound collected by the microphone;
The number of persons in front of the display means is detected from the recognized voice, and the display means based on the voice having the highest input volume among the voices recognized by the voice recognition function within a predetermined time. A detection function that detects the attributes of the person in front ,
A voice generation function for generating a voice having features according to the number and attributes detected by the detection function, which are contents related to an image displayed by the display unit;
A control function for outputting the generated sound and variably setting the volume of the sound output according to the detected number;
A program characterized by realizing.

In an information providing system including an information providing apparatus and a server for displaying an advertisement image,
Sound collection means for collecting sound;
Voice recognition means for recognizing voice from the sound collected by the sound collection means;
The number of persons in front of the information providing device is detected from the voice recognized by the voice recognition means, and based on the voice having the highest input volume among the voices recognized by the voice recognition means within a predetermined time. Detecting means for detecting an attribute of a person in front of the information providing device ;
A voice generation unit that generates a voice having characteristics according to the number and attributes detected by the detection unit, the content relating to the image displayed by the information providing apparatus;
Control means for outputting the sound generated by the sound generation means and variably setting the volume of the sound output according to the number detected by the detection means;
The information providing system is characterized in that the server includes at least one means of each means, and the information providing apparatus includes the remaining means excluding the means included in the server.