JP2019200598A

JP2019200598A - server

Info

Publication number: JP2019200598A
Application number: JP2018094722A
Authority: JP
Inventors: 田中　達雄; Tatsuo Tanaka; 達雄田中; 正士須崎; Masashi Suzaki; 克典新井; Katsunori Arai; 祐一郎豊崎; Yuichiro Toyosaki
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2018-05-16
Filing date: 2018-05-16
Publication date: 2019-11-21
Anticipated expiration: 2038-05-16
Also published as: JP7504978B2; JP7235946B2; JP2023024713A; JP2024107317A

Abstract

To provide a technology capable of using a smart speaker as an effective advertising medium, or further improving a smart speaker system.SOLUTION: A server includes: receiving means for receiving a distribution request from a microphone and a speaker having a communication function via a network; acquisition means for acquiring an audio content without an image in response to the received distribution request; selection means for selecting a voice advertisement without an image from voice advertisement holding means; and transmitting means for transmitting the acquired audio content and the selected audio advertisement together to the speaker via the network.SELECTED DRAWING: Figure 1

Description

本発明は、マイクロフォンを有するスピーカと通信するサーバに関する。 The present invention relates to a server that communicates with a speaker having a microphone.

マイクロフォンおよび通信機能を備え、音声による操作や情報検索を可能とするスマートスピーカの普及が始まっている（例えば、非特許文献１参照）。 A smart speaker having a microphone and a communication function and capable of performing voice operation and information retrieval has started to be widespread (for example, see Non-Patent Document 1).

https://www.is.nri.co.jp/report/short-research/2017/000213.html、平成３０年５月９日検索https://www.is.nri.co.jp/report/short-research/2017/000213.html, May 9, 2018 search

現在のスマートスピーカを含むシステムは、ユーザから音声で要求を受け、その要求を処理することができる。このような状況において、さらに有益なスマートスピーカシステムを創出することが望まれている。 Current systems including smart speakers can receive requests from the user and process the requests. In such a situation, it is desired to create a more useful smart speaker system.

本発明はこうした課題に鑑みてなされたものであり、その目的は、スマートスピーカを効果的な広告媒体として用いることができる技術の提供、またはスマートスピーカシステムのさらなる改善にある。 The present invention has been made in view of these problems, and an object thereof is to provide a technique that can use a smart speaker as an effective advertising medium, or to further improve a smart speaker system.

本発明のある態様は、サーバに関する。このサーバは、マイクロフォンおよび通信機能を有するスピーカから、ネットワークを介して配信要求を受け付ける受付手段と、受け付けた配信要求に応じて、画像を伴わない音声コンテンツを取得する取得手段と、画像を伴わない音声広告を音声広告保持手段から選択する選択手段と、取得された音声コンテンツと選択された音声広告とを合わせてスピーカに、ネットワークを介して送信する送信手段と、を備える。 One embodiment of the present invention relates to a server. The server includes a reception unit that receives a distribution request from a microphone and a speaker having a communication function via a network, an acquisition unit that acquires audio content without an image in response to the received distribution request, and no image. Selecting means for selecting a sound advertisement from the sound advertisement holding means; and transmitting means for transmitting the acquired sound content and the selected sound advertisement together to the speaker via the network.

なお、以上の構成要素の任意の組み合わせや、本発明の構成要素や表現を装置、方法、システム、コンピュータプログラム、コンピュータプログラムを格納した記録媒体などの間で相互に置換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements, or those obtained by replacing the constituent elements and expressions of the present invention with each other between apparatuses, methods, systems, computer programs, recording media storing computer programs, and the like are also included in the present invention. It is effective as an embodiment of

本発明によれば、スマートスピーカを効果的な広告媒体として用いることができる技術を提供できる、またはスマートスピーカシステムをさらに改善することができる。 ADVANTAGE OF THE INVENTION According to this invention, the technique which can use a smart speaker as an effective advertising medium can be provided, or a smart speaker system can be improved further.

第１の実施の形態に係る音声広告配信システムの構成を示す模式図である。It is a schematic diagram which shows the structure of the audio | voice advertisement delivery system which concerns on 1st Embodiment. 図１のスマートスピーカの機能および構成を示すブロック図である。It is a block diagram which shows the function and structure of the smart speaker of FIG. 図１の管理サーバのハードウエア構成図である。It is a hardware block diagram of the management server of FIG. 図１の管理サーバの機能および構成を示すブロック図である。It is a block diagram which shows the function and structure of the management server of FIG. 図４の音声コンテンツ保持部の一例を示すデータ構造図である。FIG. 5 is a data structure diagram illustrating an example of an audio content holding unit in FIG. 4. 図４の音声広告保持部の一例を示すデータ構造図である。It is a data structure figure which shows an example of the audio | voice advertisement holding | maintenance part of FIG. 図４の音声情報保持部の一例を示すデータ構造図である。FIG. 5 is a data structure diagram illustrating an example of a voice information holding unit in FIG. 4. 図４のユーザ情報保持部の一例を示すデータ構造図である。It is a data structure figure which shows an example of the user information holding | maintenance part of FIG. 図４のセッション情報保持部の一例を示すデータ構造図である。FIG. 5 is a data structure diagram illustrating an example of a session information holding unit in FIG. 4. 図１の管理サーバにおける一連の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a series of processes in the management server of FIG. ユーザの部屋の模式的な上面図である。It is a typical top view of a user's room. 第３の実施の形態に係る音声操作システムの構成を示す模式図である。It is a schematic diagram which shows the structure of the voice operation system which concerns on 3rd Embodiment. 第４の実施の形態に係る音声操作システムの構成を示す模式図である。It is a schematic diagram which shows the structure of the voice operation system which concerns on 4th Embodiment. 図１３の管理サーバの機能および構成を示すブロック図である。It is a block diagram which shows the function and structure of the management server of FIG. 図１４のユーザ情報保持部の一例を示すデータ構造図である。It is a data structure figure which shows an example of the user information holding | maintenance part of FIG.

以下、各図面に示される同一または同等の構成要素、部材、処理には、同一の符号を付するものとし、適宜重複した説明は省略する。また、各図面において説明上重要ではない部材の一部は省略して表示する。 Hereinafter, the same or equivalent components, members, and processes shown in the drawings are denoted by the same reference numerals, and repeated description is appropriately omitted. In addition, in the drawings, some of the members that are not important for explanation are omitted.

（第１の実施の形態）
第１の実施の形態に係る音声広告配信システムでは、ユーザはスマートスピーカを用いて、例えば以下のような作業を行うことができる。
・簡単な調べ物
・天気予報の確認
・ニュースを聞く
・アラームの設定
・スケジュールの確認
・計算をする
・音楽の再生
・スマート家電のコントロール。
音声広告配信システムはスマートスピーカのマイクロフォンを介してユーザの発話を取得し、発話を音声認識することでユーザが音声コンテンツ（例えば、検索結果、天気予報、ニュース、スケジュール、計算結果、音楽など）の配信を要求していることを理解する。システムは、要求されている音声コンテンツを用意してスマートスピーカに配信するのであるが、この際、スマートスピーカでの音声コンテンツ再生前に音声広告が再生されるように、音声広告を配信対象の音声コンテンツに挿入する。 (First embodiment)
In the audio advertisement distribution system according to the first embodiment, the user can perform, for example, the following work using the smart speaker.
・ Simple research ・ Check weather forecast ・ Listen to news ・ Set alarm ・ Check schedule ・ Calculate ・ Play music ・ Control smart home appliances.
The voice advertisement distribution system acquires the user's utterance through the microphone of the smart speaker, and the user recognizes the utterance as a voice so that the user can recognize the voice content (for example, search result, weather forecast, news, schedule, calculation result, music, etc.). Understand that you are requesting delivery. The system prepares the requested audio content and distributes it to the smart speaker. At this time, the audio advertisement is distributed to the audio speaker so that the audio advertisement is played before the audio content is played on the smart speaker. Insert into content.

この音声広告は、例えば配信対象の音声コンテンツに合わせた音声広告や、これまでの対話の内容に基づいた音声広告や、音声コンテンツの配信の直前にスマートスピーカが集音したスマートスピーカの周りの音に基づいた音声広告であってもよい。 This voice advertisement is, for example, a voice advertisement tailored to the audio content to be distributed, a voice advertisement based on the content of the previous conversation, or the sound around the smart speaker collected by the smart speaker immediately before the distribution of the audio content. It may be a voice advertisement based on.

音声広告の長さは、ユーザとの対話の状況やスマートスピーカの周りの状況に合わせて調整されてもよい。調整の態様としては、例えば配信対象の音声コンテンツの内容に応じて音声広告の長さを調整してもよいし、音声広告の重要部分を抽出してもよい。 The length of the voice advertisement may be adjusted according to the situation of interaction with the user and the situation around the smart speaker. As a mode of adjustment, for example, the length of the voice advertisement may be adjusted according to the content of the audio content to be distributed, or an important part of the voice advertisement may be extracted.

音声広告の再生のタイミングについて、ユーザがスマートスピーカの周りにいる場合に広告効果がより高いこと、またユーザがスマートスピーカや他のユーザと会話しているときに音声広告が出力されるとユーザが不快に感じうること、を考慮して決定されてもよい。例えば、音声広告は、スマートスピーカの周りにユーザがいると判定されるときのみ再生されてもよい。また、音声広告は、ユーザが他のユーザと会話していたり、スマートスピーカに対して発話しているときには再生されなくてもよい。後者の場合、ユーザが発話を止めると音声広告の出力を開始または再開してもよい。また、音声広告は他の電子機器、例えばテレビジョン（以下、ＴＶという）と連携して出力されてもよい。例えば、ＴＶで広告を流した後に、続報をスマートスピーカから音声で出力してもよい。この場合、次のＴＶの広告を消音してもよい。あるいはまた、スマートスピーカでの音声広告の再生後に関連する広告をＴＶで流してもよい。 When the voice advertisement is played, the user is more effective when the user is around the smart speaker, and when the voice advertisement is output when the user is talking to the smart speaker or another user, the user It may be determined taking into account that it may feel uncomfortable. For example, the voice advertisement may be played only when it is determined that there is a user around the smart speaker. The voice advertisement may not be reproduced when the user is talking to another user or speaking to the smart speaker. In the latter case, the output of the voice advertisement may be started or resumed when the user stops speaking. The voice advertisement may be output in cooperation with another electronic device, for example, a television (hereinafter referred to as TV). For example, after the advertisement is broadcast on the TV, the follow-up report may be output from the smart speaker by voice. In this case, the next TV advertisement may be muted. Alternatively, a related advertisement may be played on the TV after the voice advertisement is reproduced on the smart speaker.

音声広告配信システムはスマートスピーカを介して取得したユーザの発話から声紋を取得し、声紋認証によりユーザ認証を行う機能を有する。また、音声広告配信システムはＷｅｂサービスやＳＮＳなどの他のサービスと連携しており、音声広告配信システムにおける認証ユーザと、他のサービスにおけるユーザのアカウントと、を関連付けることができる。この場合、音声広告配信システムは、認証ユーザに対して、認証ユーザのアカウントに紐付く音声広告を選択してもよい。例えば、音声広告配信システムは、スマートスピーカで収集した情報と、アカウント属性と、に基づく音声広告を選択してもよい。また、音声広告配信システムは、スマートスピーカで収集した情報でアカウント属性を更新してもよい。 The voice advertisement distribution system has a function of acquiring a voiceprint from a user's utterance acquired via a smart speaker and performing user authentication by voiceprint authentication. The voice advertisement distribution system is linked with other services such as a web service and SNS, and can associate an authenticated user in the voice advertisement distribution system with a user account in another service. In this case, the voice advertisement distribution system may select a voice advertisement associated with the authenticated user's account for the authenticated user. For example, the voice advertisement distribution system may select a voice advertisement based on information collected by a smart speaker and an account attribute. Further, the voice advertisement distribution system may update the account attribute with information collected by the smart speaker.

図１は、第１の実施の形態に係る音声広告配信システム２の構成を示す模式図である。音声広告配信システム２は、管理サーバ４と、スマートスピーカ１０と、ＴＶ１２と、を備える。管理サーバ４とスマートスピーカ１０とＴＶ１２とはインターネットなどのネットワーク６を介して通信可能に接続されている。スマートスピーカ１０およびＴＶ１２はいずれも、ユーザ８の部屋１４に設置されている。スマートスピーカ１０はマイクロフォンおよび通信機能を有するスピーカであり、上述の通りネットワーク６に接続されると共に、ＴＶ１２ともＰ２Ｐ（ＰｅｅｒｔｏＰｅｅｒ）通信１６が可能に構成される。図１ではスマートスピーカ１０と管理サーバ４とが通信する例を示しているが、スマートスピーカ１０の数に制限はなく、ユーザ８の数にも制限はない。 FIG. 1 is a schematic diagram showing a configuration of a voice advertisement distribution system 2 according to the first embodiment. The voice advertisement distribution system 2 includes a management server 4, a smart speaker 10, and a TV 12. The management server 4, the smart speaker 10, and the TV 12 are communicably connected via a network 6 such as the Internet. Both the smart speaker 10 and the TV 12 are installed in the room 14 of the user 8. The smart speaker 10 is a speaker having a microphone and a communication function. The smart speaker 10 is connected to the network 6 as described above, and is configured to be capable of P2P (Peer to Peer) communication 16 with the TV 12. Although FIG. 1 shows an example in which the smart speakers 10 and the management server 4 communicate, the number of smart speakers 10 is not limited, and the number of users 8 is not limited.

ユーザ８は、「何か甲村太郎の歌が聴きたい」、「今日のニュースを教えて」、「今夜の天気は？」、「出雲大社について教えて」、等の音声コンテンツの配信要求を表す文をスマートスピーカ１０に向けて発話する。スマートスピーカ１０のマイクロフォンはユーザ８が発話した音声を電気信号に変換し、スマートスピーカ１０は変換の結果得られた電気信号を音声信号として、ネットワーク６を介して管理サーバ４に送信する。管理サーバ４は受信した音声信号に対して音声認識処理を行うことでユーザがどのような音声コンテンツの配信を求めているかを理解する。管理サーバ４は、要求された音声コンテンツに音声広告を添付した配信情報を生成し、ネットワーク６を介してスマートスピーカ１０に送信する。スマートスピーカ１０は、配信情報を受信すると、まず音声広告を出力し、次いで音声コンテンツを出力する。 The user 8 represents a delivery request for audio contents such as “I want to listen to something by Taro Komura”, “Tell me today's news”, “What is the weather tonight”, “Tell me about Izumo Taisha”, etc. Speak a sentence toward the smart speaker 10. The microphone of the smart speaker 10 converts the voice uttered by the user 8 into an electrical signal, and the smart speaker 10 transmits the electrical signal obtained as a result of the conversion to the management server 4 via the network 6 as an audio signal. The management server 4 understands what kind of audio content is requested by the user by performing voice recognition processing on the received voice signal. The management server 4 generates distribution information in which an audio advertisement is attached to the requested audio content, and transmits the distribution information to the smart speaker 10 via the network 6. When the smart speaker 10 receives the distribution information, the smart speaker 10 first outputs an audio advertisement, and then outputs audio content.

なお、スマートスピーカ１０はディスプレイを備えても備えなくてもよいが、管理サーバ４から配信されるコンテンツは、静止画や動画などの画像と音声とが一体となったコンテンツではなく、画像を伴わない音声コンテンツ（または、音声のみからなるコンテンツ）である。音声広告も同様に、画像を伴わない音声広告である。 The smart speaker 10 may or may not include a display, but the content distributed from the management server 4 is not a content in which an image such as a still image or a moving image and sound are integrated, but is accompanied by an image. There is no audio content (or content consisting only of audio). Similarly, the voice advertisement is a voice advertisement without an image.

図２は、図１のスマートスピーカ１０の機能および構成を示すブロック図である。ここに示す各ブロックは、ハードウエア的には、コンピュータのＣＰＵをはじめとする素子や機械装置で実現でき、ソフトウエア的にはコンピュータプログラム等によって実現されるが、ここでは、それらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウエア、ソフトウエアの組合せによっていろいろなかたちで実現できることは、本明細書に触れた当業者には理解されるところである。 FIG. 2 is a block diagram showing the function and configuration of the smart speaker 10 of FIG. Each block shown here can be realized by hardware and other elements such as a computer CPU and a mechanical device, and software can be realized by a computer program or the like. Draw functional blocks. Therefore, it is understood by those skilled in the art who have touched this specification that these functional blocks can be realized in various forms by a combination of hardware and software.

スマートスピーカ１０は、スピーカ１０２と、マイクロフォン１０４と、通信部１０６と、入力部１０８と、処理部１１０と、を備える。通信部１０６はネットワーク６との通信におけるインタフェースとして機能とし、かつ、Ｐ２Ｐ通信１６におけるインタフェースとしても機能する。入力部１０８は電源ボタン、音量調節ボタン等の物理的な入力機構を含む。処理部１１０は、スピーカ１０２、マイクロフォン１０４、通信部１０６、入力部１０８を制御し、スマートスピーカ１０の各種機能を実現する。 The smart speaker 10 includes a speaker 102, a microphone 104, a communication unit 106, an input unit 108, and a processing unit 110. The communication unit 106 functions as an interface in communication with the network 6 and also functions as an interface in the P2P communication 16. The input unit 108 includes physical input mechanisms such as a power button and a volume control button. The processing unit 110 controls the speaker 102, the microphone 104, the communication unit 106, and the input unit 108 to realize various functions of the smart speaker 10.

本実施の形態では、ユーザの発話をマイクロフォン１０４が音声信号に変換し、通信部１０６が音声信号を管理サーバ４に送信し、管理サーバ４が音声信号に音声認識処理を施すことを想定している。しかしながら、スマートスピーカにおいて少なくとも一部の音声認識処理が行われる場合や、スマートスピーカにおいて後述の音声コンテンツ取得処理や音声広告選択処理が行われる場合や、スマートスピーカがスタンドアローンである場合にも、本実施の形態の技術的思想を適用可能である。なお、スマートスピーカで行われた音声認識の結果を管理サーバに送ること、および、スマートスピーカから音声信号をそのまま管理サーバに送ること、はいずれも、ユーザの発話に対応する音声情報を管理サーバに送ると言いうるものである。 In the present embodiment, it is assumed that the microphone 104 converts the user's utterance into an audio signal, the communication unit 106 transmits the audio signal to the management server 4, and the management server 4 performs voice recognition processing on the audio signal. Yes. However, even when at least a part of voice recognition processing is performed on the smart speaker, when audio content acquisition processing and voice advertisement selection processing described later are performed on the smart speaker, or when the smart speaker is stand-alone. The technical idea of the embodiment can be applied. Note that both sending the result of speech recognition performed by the smart speaker to the management server and sending the voice signal from the smart speaker to the management server as they are, the voice information corresponding to the user's utterance is sent to the management server. It can be said to send.

図３は、図１の管理サーバ４のハードウエア構成図である。管理サーバ４は、メモリ１３０と、プロセッサ１３２と、通信インタフェース１３４と、ディスプレイ１３６と、入力インタフェース１３８と、を含む。これらの要素はそれぞれバス１４０に接続され、バス１４０を介して互いに通信する。 FIG. 3 is a hardware configuration diagram of the management server 4 of FIG. The management server 4 includes a memory 130, a processor 132, a communication interface 134, a display 136, and an input interface 138. These elements are each connected to the bus 140 and communicate with each other via the bus 140.

メモリ１３０は、データやプログラムを記憶するための記憶領域である。データやプログラムは、メモリ１３０に恒久的に記憶されてもよいし、一時的に記憶されてもよい。プロセッサ１３２は、メモリ１３０に記憶されているプログラムを実行することにより、管理サーバ４における各種機能を実現する。通信インタフェース１３４は、管理サーバ４の外部との間でデータの送受信を行うためのインタフェースである。例えば、通信インタフェース１３４はネットワーク６にアクセスするためのインタフェースを含む。ディスプレイ１３６は、各種情報を表示するためのデバイスであり、例えば、液晶ディスプレイや有機ＥＬ（Electroluminescence）ディスプレイなどである。入力インタフェース１３８は、ユーザからの入力を受け付けるためのデバイスである。入力インタフェース１３８は、例えば、マウスやキーボードやディスプレイ１３８上に設けられたタッチパネルを含む。 The memory 130 is a storage area for storing data and programs. Data and programs may be stored permanently in the memory 130 or may be temporarily stored. The processor 132 implements various functions in the management server 4 by executing a program stored in the memory 130. The communication interface 134 is an interface for transmitting / receiving data to / from the outside of the management server 4. For example, the communication interface 134 includes an interface for accessing the network 6. The display 136 is a device for displaying various information, and is, for example, a liquid crystal display or an organic EL (Electroluminescence) display. The input interface 138 is a device for receiving input from the user. The input interface 138 includes, for example, a mouse, a keyboard, and a touch panel provided on the display 138.

図４は、図１の管理サーバ４の機能および構成を示すブロック図である。ここに示す各ブロックは、ハードウエア的には、コンピュータのＣＰＵをはじめとする素子や機械装置で実現でき、ソフトウエア的にはコンピュータプログラム等によって実現されるが、ここでは、それらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウエア、ソフトウエアの組合せによっていろいろなかたちで実現できることは、本明細書に触れた当業者には理解されるところである。 FIG. 4 is a block diagram showing the function and configuration of the management server 4 of FIG. Each block shown here can be realized by hardware and other elements such as a computer CPU and a mechanical device, and software can be realized by a computer program or the like. Draw functional blocks. Therefore, it is understood by those skilled in the art who have touched this specification that these functional blocks can be realized in various forms by a combination of hardware and software.

管理サーバ４は、音声コンテンツ保持部４０２と、音声広告保持部４０４と、音声情報保持部４０６と、ユーザ情報保持部４０８と、セッション情報保持部４１０と、音声信号受付部４１２と、音声認識部４１４と、ユーザ認証部４１６と、セッション管理部４１８と、コンテンツ取得部４２０と、広告選択部４２２と、広告調整部４２４と、送信情報生成部４２６と、送信部４２８と、タイミング制御部４３０と、属性更新部４３２と、を備える。 The management server 4 includes an audio content holding unit 402, an audio advertisement holding unit 404, an audio information holding unit 406, a user information holding unit 408, a session information holding unit 410, an audio signal receiving unit 412, and a voice recognition unit. 414, a user authentication unit 416, a session management unit 418, a content acquisition unit 420, an advertisement selection unit 422, an advertisement adjustment unit 424, a transmission information generation unit 426, a transmission unit 428, and a timing control unit 430. And an attribute updating unit 432.

図５は、図４の音声コンテンツ保持部４０２の一例を示すデータ構造図である。音声コンテンツ保持部４０２は、音声コンテンツを特定するコンテンツＩＤと、音声コンテンツを特徴付けるキーワードと、音声コンテンツのデータと、を対応付けて保持する。キーワードに加えてまたは代えて、タグなどの他のメタデータが用いられてもよい。 FIG. 5 is a data structure diagram showing an example of the audio content holding unit 402 of FIG. The audio content holding unit 402 holds a content ID that specifies audio content, a keyword that characterizes the audio content, and data of the audio content in association with each other. Other metadata such as tags may be used in addition to or instead of keywords.

音声コンテンツ保持部４０２に保持されるデータは、予めまたは要求に応じて管理サーバ４によって生成され登録されたデータであってもよい。音声コンテンツのデータを作成する際に、公知の音声合成技術が用いられてもよい。あるいはまた、音声コンテンツ保持部４０２に保持されるデータは、予めまたは要求に応じて管理サーバ４が他のサービスのサーバから取得したデータであってもよい。 The data held in the audio content holding unit 402 may be data generated and registered by the management server 4 in advance or upon request. A known speech synthesis technique may be used when creating audio content data. Alternatively, the data held in the audio content holding unit 402 may be data that the management server 4 acquires from a server of another service in advance or upon request.

図６は、図４の音声広告保持部４０４の一例を示すデータ構造図である。音声広告保持部４０４は、音声広告を特定する広告ＩＤと、音声広告を特徴付けるキーワードと、音声広告の属性と、音声広告のデータと、を対応付けて保持する。キーワードに加えてまたは代えて、タグなどの他のメタデータが用いられてもよい。音声広告保持部４０４に保持されるデータは、管理サーバ４を運用する主体が広告主から受領したデータであってもよい。 FIG. 6 is a data structure diagram showing an example of the voice advertisement holding unit 404 of FIG. The voice advertisement holding unit 404 holds an advertisement ID that identifies a voice advertisement, a keyword that characterizes the voice advertisement, a voice advertisement attribute, and voice advertisement data in association with each other. Other metadata such as tags may be used in addition to or instead of keywords. The data held in the voice advertisement holding unit 404 may be data received from the advertiser by the entity operating the management server 4.

図７は、図４の音声情報保持部４０６の一例を示すデータ構造図である。音声情報保持部４０６は、スマートスピーカ１０のマイクロフォン１０４を介して取得された音声情報を保持する。音声情報は、後述の音声認識部４１４にて音声信号を音声認識することにより得られるユーザの発話内容を含む。音声情報保持部４０６は、ユーザを特定するユーザＩＤと、スマートスピーカ１０とユーザとの対話セッションのセッションＩＤと、対話セッションにおけるユーザまたはシステムの発話内容と、を対応付けて保持する。なお、音声情報保持部４０６は、システムの発話内容に対応するユーザＩＤとしてシステム固有のＩＤを保持する。 FIG. 7 is a data structure diagram showing an example of the audio information holding unit 406 of FIG. The voice information holding unit 406 holds voice information acquired via the microphone 104 of the smart speaker 10. The voice information includes the user's utterance content obtained by voice recognition of a voice signal by a voice recognition unit 414 described later. The voice information holding unit 406 holds a user ID for identifying the user, a session ID of the interactive session between the smart speaker 10 and the user, and the utterance content of the user or the system in the interactive session in association with each other. The voice information holding unit 406 holds a system-specific ID as a user ID corresponding to the utterance content of the system.

図８は、図４のユーザ情報保持部４０８の一例を示すデータ構造図である。ユーザ情報保持部４０８は、ユーザＩＤと、ユーザの声紋のデータと、他のサービスにおけるユーザのアカウントを特定するアカウントＩＤと、アカウントの属性と、ユーザが不快に感じた広告を特定するＮＧ広告ＩＤと、を対応付けて保持する。声紋のデータは、ユーザの音声広告配信システム２への初回登録時に取得されてもよい。 FIG. 8 is a data structure diagram illustrating an example of the user information holding unit 408 of FIG. The user information holding unit 408 includes a user ID, user voiceprint data, an account ID that identifies a user account in another service, an attribute of the account, and an NG advertisement ID that identifies an advertisement that the user feels uncomfortable. Are stored in association with each other. The voiceprint data may be acquired when the user first registers in the voice advertisement distribution system 2.

図９は、図４のセッション情報保持部４１０の一例を示すデータ構造図である。セッション情報保持部４１０は、スマートスピーカ１０とユーザとの現在の対話セッションの状態を保持する。セッション情報保持部４１０は、現在存在しているまたは維持されている対話セッションに係るユーザのユーザＩＤと、該対話セッションのセッションＩＤと、該対話セッションの状態と、を対応付けて保持する。対話セッションの状態は、ユーザが発話中であることを示す「発話中」と、ユーザが他のユーザと会話中であることを示す「会話中」と、ユーザが発話しておらずユーザによる次の発話またはシステムによる次の発話を待っている「発話待ち」と、の三つなかから選択される。対話セッションが終了したと判定された場合、セッション情報保持部４１０からその対話セッションに関するエントリが削除される。すなわち、ある対話セッションのセッションＩＤがセッション情報保持部４１０に登録されていると、その対話セッションは継続中であり、その対話セッションに係るユーザがスマートスピーカ１０の周囲にいると判定される。 FIG. 9 is a data structure diagram illustrating an example of the session information holding unit 410 of FIG. The session information holding unit 410 holds the state of the current interactive session between the smart speaker 10 and the user. The session information holding unit 410 holds the user ID of the user related to the currently existing or maintained interactive session, the session ID of the interactive session, and the state of the interactive session in association with each other. The state of the conversation session is “speaking” indicating that the user is speaking, “conversing” indicating that the user is talking to other users, Or “waiting for utterance” waiting for the next utterance by the system. When it is determined that the interactive session has ended, the entry related to the interactive session is deleted from the session information holding unit 410. That is, when the session ID of a certain interactive session is registered in the session information holding unit 410, it is determined that the interactive session is ongoing and the user related to the interactive session is around the smart speaker 10.

図４に戻り、音声信号受付部４１２は、スマートスピーカ１０からネットワーク６を介して、ユーザの発話内容を表す音声信号を受け付ける。上述の通り音声信号は、ユーザの発話音声をマイクロフォン１０４で変換した電気信号であり、特に音声の波形を表す電気信号である。発話内容は、スマートスピーカ１０（または音声広告配信システム２）への問いかけ・応答と、ひとり言と、他のユーザとの会話と、を含む。 Returning to FIG. 4, the audio signal reception unit 412 receives an audio signal representing the user's utterance content from the smart speaker 10 via the network 6. As described above, the audio signal is an electric signal obtained by converting the user's uttered voice by the microphone 104, and particularly an electric signal representing a waveform of the voice. The content of the utterance includes a question / response to the smart speaker 10 (or the voice advertisement distribution system 2), a single word, and a conversation with another user.

音声認識部４１４は、音声信号受付部４１２が受け付けた音声信号に対して所定の音声認識処理を施す。音声認識部４１４は音声認識により音声信号からユーザの発話内容を導出する。音声認識部４１４における音声認識処理は、ｎ−ｇｒａｍや隠れマルコフモデルを用いる公知の音声認識技術を用いて実現されてもよい。 The voice recognition unit 414 performs a predetermined voice recognition process on the voice signal received by the voice signal reception unit 412. The voice recognition unit 414 derives the user's utterance content from the voice signal by voice recognition. The voice recognition processing in the voice recognition unit 414 may be realized using a known voice recognition technique using an n-gram or a hidden Markov model.

ユーザ認証部４１６は、音声信号受付部４１２が受け付けた音声信号から声紋を抽出または取得する。ユーザ認証部４１６は、抽出された声紋に基づくユーザ認証（すなわち、声紋認証）を行う。ユーザ認証部４１６はユーザ情報保持部４０８を参照し、ユーザ情報保持部４０８に保持されている声紋のなかに抽出された声紋と一致する声紋があるか否かを判定する。ユーザ認証部４１６は、一致する声紋があればその声紋に対応するユーザＩＤを特定し、特定されたユーザＩＤと音声認識部４１４において導出された発話内容とを対応付ける。この場合、音声信号受付部４１２が受け付けた音声信号に対応する発話を行ったユーザは、管理サーバ４によって声紋認証されたこととなる。ユーザ認証部４１６は、一致する声紋がなければ、一致なしまたはユーザ不明を表す出力を生成する。管理サーバ４はこの出力に応じてユーザの新規登録を開始してもよい。 The user authentication unit 416 extracts or acquires a voiceprint from the audio signal received by the audio signal reception unit 412. The user authentication unit 416 performs user authentication based on the extracted voice print (ie, voice print authentication). The user authentication unit 416 refers to the user information holding unit 408 and determines whether there is a voice print that matches the extracted voice print in the voice print held in the user information holding unit 408. If there is a matching voice print, the user authentication unit 416 specifies a user ID corresponding to the voice print, and associates the specified user ID with the utterance content derived by the voice recognition unit 414. In this case, the user who has made an utterance corresponding to the audio signal received by the audio signal receiving unit 412 is authenticated by the management server 4. If there is no matching voiceprint, the user authentication unit 416 generates an output indicating no match or user unknown. The management server 4 may start a new user registration in response to this output.

セッション管理部４１８は、スマートスピーカ１０とユーザとの対話セッションを管理する。セッション管理部４１８は、音声情報保持部４０６とセッション情報保持部４１０とを管理する。セッション管理部４１８は、ユーザ認証部４１６によって対応付けられたユーザＩＤおよび発話内容に、スマートスピーカ１０とそのユーザとの対話セッションを特定するセッションＩＤを対応付けて音声情報保持部４０６に登録する。 The session management unit 418 manages an interactive session between the smart speaker 10 and the user. The session management unit 418 manages the audio information holding unit 406 and the session information holding unit 410. The session management unit 418 associates the user ID and utterance content associated by the user authentication unit 416 with the session ID that identifies the interactive session between the smart speaker 10 and the user and registers the associated information in the voice information holding unit 406.

セッション管理部４１８は、ユーザ認証部４１６によって対応付けられたユーザＩＤおよび発話内容に基づいてスマートスピーカ１０とそのユーザとの現在の対話セッションの状態を決定する。セッション管理部４１８は、決定された状態でセッション情報保持部４１０を更新する。例えば、セッション管理部４１８は、発話内容の解析結果が発言の途中であることを示す場合、現在の対話セッションの状態を「発話中」に決定する。セッション管理部４１８は、発話内容の解析結果が発言の終わりであることを示す場合、現在の対話セッションの状態を「発話待ち」に決定する。セッション管理部４１８は、発話内容の解析結果が対話セッションの終了を示す場合（例えば、発話内容が「またね」や「バイバイ」などの対話セッションの終了を示す語である場合）、該対話セッションのセッションＩＤを有する全てのエントリをセッション情報保持部４１０から削除する。セッション管理部４１８は「発話待ち」状態のまま所定の期間が経過した対話セッションをセッション情報保持部４１０から削除してもよい。ここで、発話内容に基づいて対話セッションの状態を決定したが、発話内容に代えて、又は、発話内容に加えて、別途備えられたカメラからの撮像情報から対話セッションの状態を決定してもよい。ここでのカメラはスマートスピーカ１０自体に配設されていてもよいし、通信機能を有するカメラ単体を別途用いてもよいし、カメラ機能を有するテレビ又はカメラ機能を有するコンピュータを別途用いてもよい。 The session management unit 418 determines the state of the current interactive session between the smart speaker 10 and the user based on the user ID and the utterance content associated by the user authentication unit 416. The session management unit 418 updates the session information holding unit 410 in the determined state. For example, when the analysis result of the utterance content indicates that the utterance is in the middle of the utterance, the session management unit 418 determines that the current conversation session state is “in utterance”. When the analysis result of the utterance content indicates the end of the utterance, the session management unit 418 determines the current conversation session state as “waiting for utterance”. When the analysis result of the utterance content indicates the end of the conversation session (for example, when the utterance content is a word indicating the end of the conversation session such as “Tane” or “Bye Bye”), the session management unit 418 Are deleted from the session information holding unit 410. The session management unit 418 may delete, from the session information holding unit 410, a conversation session for which a predetermined period has elapsed in the “waiting for utterance” state. Here, the state of the conversation session is determined based on the utterance content, but instead of the utterance content or in addition to the utterance content, the state of the conversation session may be determined from imaging information from a separately provided camera. Good. The camera here may be arranged in the smart speaker 10 itself, a separate camera having a communication function may be used separately, or a television having a camera function or a computer having a camera function may be used separately. .

コンテンツ取得部４２０は、音声認識部４１４において導出された発話内容が音声コンテンツの配信要求を含む場合、要求されている音声コンテンツを音声コンテンツ保持部４０２から取得する。例えば、発話内容が「何か甲村太郎の歌が聴きたい」などの音楽コンテンツの配信要求である場合、コンテンツ取得部４２０は要求されている音楽コンテンツを音声コンテンツ保持部４０２から取得する。あるいはまた、コンテンツ取得部４２０は、音楽配信サービスのサーバにアクセスし、要求されている音楽コンテンツをメタデータと共にそのサーバから取得してもよい。この場合、コンテンツ取得部４２０は取得した音楽コンテンツおよびメタデータを音声コンテンツ保持部４０２に登録してもよい。 The content acquisition unit 420 acquires the requested audio content from the audio content holding unit 402 when the utterance content derived by the audio recognition unit 414 includes an audio content distribution request. For example, when the content of the utterance is a distribution request for music content such as “I want to listen to something by Taro Komura”, the content acquisition unit 420 acquires the requested music content from the audio content holding unit 402. Alternatively, the content acquisition unit 420 may access the server of the music distribution service and acquire the requested music content from the server together with the metadata. In this case, the content acquisition unit 420 may register the acquired music content and metadata in the audio content holding unit 402.

発話内容が「今日のニュースを教えて」、「今夜の天気は？」などの情報コンテンツの配信要求である場合、コンテンツ取得部４２０は要求されている情報コンテンツを音声コンテンツ保持部４０２から取得する。あるいはまた、コンテンツ取得部４２０は、情報配信サービスのサーバにアクセスし、要求されている情報コンテンツをテキスト形式でメタデータと共にそのサーバから取得してもよい。この場合、コンテンツ取得部４２０は、所定の音声合成処理を用いることで、取得したテキスト形式の情報コンテンツを音声データに変換してもよい。コンテンツ取得部４２０は、音声データとなった情報コンテンツおよびメタデータを音声コンテンツ保持部４０２に登録してもよい。音声合成処理は、公知の音声合成技術を用いて実現されてもよい。 When the content of the utterance is an information content distribution request such as “Tell me today's news” or “What is the weather tonight?”, The content acquisition unit 420 acquires the requested information content from the audio content holding unit 402. . Alternatively, the content acquisition unit 420 may access the server of the information distribution service and acquire the requested information content from the server together with the metadata in text format. In this case, the content acquisition unit 420 may convert the acquired information content in the text format into audio data by using a predetermined speech synthesis process. The content acquisition unit 420 may register the information content and metadata that have become audio data in the audio content holding unit 402. The voice synthesis process may be realized using a known voice synthesis technique.

発話内容が「出雲大社について教えて」などの検索結果の配信要求である場合、コンテンツ取得部４２０は要求されている検索結果を音声コンテンツ保持部４０２から取得する。あるいはまた、コンテンツ取得部４２０は、検索サービスのサーバにアクセスし、要求されている検索結果をテキスト形式でメタデータと共にそのサーバから取得してもよい。この場合、コンテンツ取得部４２０は、所定の音声合成処理を用いることで、取得したテキスト形式の検索結果を音声データに変換してもよい。コンテンツ取得部４２０は、音声データとなった検索結果およびメタデータを音声コンテンツ保持部４０２に登録してもよい。 When the utterance content is a search result distribution request such as “Tell me about Izumo Taisha”, the content acquisition unit 420 acquires the requested search result from the audio content holding unit 402. Alternatively, the content acquisition unit 420 may access a search service server and acquire a requested search result from the server together with metadata in a text format. In this case, the content acquisition unit 420 may convert the acquired text format search result into audio data by using a predetermined speech synthesis process. The content acquisition unit 420 may register the search results and metadata that have become audio data in the audio content holding unit 402.

広告選択部４２２は、コンテンツ取得部４２０によって取得された音声コンテンツに添付すべき音声広告を、音声広告保持部４０４から選択する。広告選択部４２２における音声広告の選択の基準は、（１）コンテンツ取得部４２０によって取得された音声コンテンツの内容との関連性、（２）音声情報保持部４０６に保持される、スマートスピーカ１０とユーザとの現在の対話セッションにおけるユーザの発話内容との関連性、（３）認証されたユーザのアカウントの属性との関連性、のうちのいずれかまたはそれらの任意の組み合わせである。 The advertisement selection unit 422 selects an audio advertisement to be attached to the audio content acquired by the content acquisition unit 420 from the audio advertisement holding unit 404. The criteria for selecting an audio advertisement in the advertisement selection unit 422 are (1) relevance with the content of the audio content acquired by the content acquisition unit 420, and (2) the smart speaker 10 held in the audio information holding unit 406. Any or any combination of the relevance to the user's utterance content in the current interactive session with the user, and (3) the relevance to the attribute of the authenticated user's account.

例えば、（１）について、「出雲大社について教えて」という検索結果の配信要求に対して、コンテンツ取得部４２０によって「出雲大社は、古くは…」という音声コンテンツの音声データが取得される。広告選択部４２２はコンテンツ取得部４２０によって取得された「出雲大社は、古くは…」に対応するキーワード「出雲大社、神、縁結び」を音声コンテンツ保持部４０２から取得する。広告選択部４２２は、音声広告保持部４０４を参照し、取得したキーワード「出雲大社、神、縁結び」に対応するキーワード「出雲大社、縁結び」を有する「出雲大社に旅行に行きたい？それならＡＢＣ旅行者に相談だ」という音声広告の音声データを選択する。このように、音声コンテンツのキーワードと音声広告のキーワードとを比較することにより、コンテンツ取得部４２０によって取得された音声コンテンツの内容に対応する音声広告が広告選択部４２２によって選択される。 For example, with respect to (1), in response to a search result distribution request “Tell me about Izumo Taisha”, the content acquisition unit 420 acquires audio data of audio content “Izumo Taisha is old ...”. The advertisement selection unit 422 acquires from the audio content holding unit 402 the keyword “Izumo Taisha, God, Marriage” corresponding to “Izumo Taisha is old ...” acquired by the content acquisition unit 420. The advertisement selection unit 422 refers to the voice advertisement holding unit 404 and wants to travel to “Izumo Taisha Shrine” having the keyword “Izumo Taisha Shrine, matchmaking” corresponding to the acquired keyword “Izumo Taisha Shrine, matchmaking” ABC travel Select the audio data for the voice advertisement that says "I'm consulting the person." As described above, the advertisement selection unit 422 selects the audio advertisement corresponding to the content of the audio content acquired by the content acquisition unit 420 by comparing the keyword of the audio content and the keyword of the audio advertisement.

例えば、（２）について、スマートスピーカ１０とユーザとの間で
（ユーザ）「駅までタクシーで間に合う？」
（スマートスピーカ１０）「間に合います」
（ユーザ）「今夜の天気は？」
という対話が行われているとする。「今夜の天気は？」という情報コンテンツの配信要求に対して、コンテンツ取得部４２０によって「今夜のＣ地方の天気はにわか雨、気温は…」という音声コンテンツの音声データが取得される。広告選択部４２２は、音声情報保持部４０６を参照し、スマートスピーカ１０とユーザとの現在の対話セッションにおけるユーザの発話内容として「駅までタクシーで間に合う？」を特定する。広告選択部４２２は特定された「駅までタクシーで間に合う？」という発話内容から「駅、タクシー」というキーワードを抽出する。広告選択部４２２は、音声広告保持部４０４を参照し、抽出されたキーワード「駅、タクシー」に対応するキーワード「タクシー、配車」を有する「すぐくるＺＺＺタクシー配車サービス」という音声広告の音声データを選択する。このように、音声情報保持部４０６を参照することにより、広告選択部４２２は、スマートスピーカ１０とユーザとの現在の対話セッションにおけるユーザの発話内容に基づいて音声広告を選択することができる。 For example, for (2), between the smart speaker 10 and the user (user)
(Smart speaker 10) “In time”
(User) “What ’s the weather tonight?”
It is assumed that the dialogue In response to the information content distribution request “What is the weather tonight?”, The content acquisition unit 420 acquires the audio data of the audio content “the weather in the C region tonight is showers, the temperature is ...”. The advertisement selection unit 422 refers to the voice information holding unit 406 and identifies “Does it take a taxi to the station?” As the user's utterance content in the current dialogue session between the smart speaker 10 and the user. The advertisement selection unit 422 extracts the keyword “station, taxi” from the specified utterance content “Tax to station? The advertisement selection unit 422 refers to the voice advertisement holding unit 404, and obtains voice data of the voice advertisement “ZZZ taxi dispatch service coming soon” having the keyword “taxi, dispatch” corresponding to the extracted keyword “station, taxi”. select. Thus, by referring to the voice information holding unit 406, the advertisement selection unit 422 can select a voice advertisement based on the user's utterance content in the current conversation session between the smart speaker 10 and the user.

なお、上記の例において（２）ではなく（１）の基準が用いられる場合、広告選択部４２２はコンテンツ取得部４２０によって取得された「今夜のＣ地方の天気はにわか雨、気温は…」に対応するキーワード「Ｃ地方、雨、低温」を音声コンテンツ保持部４０２から取得する。広告選択部４２２は、音声広告保持部４０４を参照し、取得したキーワード「Ｃ地方、雨、低温」に対応するキーワード「傘、雨」を有する「ＣＢ社のハイパー傘は１０年壊れません！」という音声広告の音声データを選択する。このように、スマートスピーカ１０とユーザとの対話の内容が同じでも、用いる基準によって選択される音声広告が異なる場合がある。 When the criterion (1) is used instead of (2) in the above example, the advertisement selection unit 422 corresponds to “Tonight's weather in the C region is showers and temperatures are ...” acquired by the content acquisition unit 420. The keyword “C region, rain, low temperature” is acquired from the audio content holding unit 402. The advertisement selecting unit 422 refers to the voice advertisement holding unit 404, and has the keyword “umbrella, rain” corresponding to the acquired keyword “C region, rain, low temperature”. ”Is selected. Thus, even if the content of the dialogue between the smart speaker 10 and the user is the same, the voice advertisement selected may differ depending on the criteria used.

例えば、（３）について、「今日のニュースを教えて」という情報コンテンツの配信要求に対して、コンテンツ取得部４２０によって「今朝６時頃、Ａ県Ｂ市で火事があり、…」という音声コンテンツの音声データが取得される。併せて、ユーザ認証部４１６における声紋認証により「今日のニュースを教えて」の発話主のユーザが認証され、該ユーザのユーザＩＤ「Ｂ１０２」が特定される。広告選択部４２２は、特定されたユーザＩＤ「Ｂ１０２」に対応するアカウントの属性「子供、男性、独身」をユーザ情報保持部４０８から取得する。広告選択部４２２は、音声広告保持部４０４を参照し、取得した属性「子供、男性、独身」に対応する属性「子供、男性」を有する「Ｆ市に来たら、ＳＬに乗れるよ」という音声広告の音声データを選択する。また、特定されたユーザＩＤ「Ｂ１０２」に対応するアカウントの属性が「大人、女性、独身」であったなら、広告選択部４２２は、音声広告保持部４０４を参照し、その属性に対応する属性「独身、大人」を有する「出雲大社に旅行に行きたい？それならＡＢＣ旅行社に相談だ」という音声広告の音声データを選択する。このように、認証されたユーザのアカウントの属性と音声広告の属性とを比較することにより、認証されたユーザのアカウントの属性に対応する音声広告が広告選択部４２２によって選択される。 For example, with regard to (3), in response to a request for distribution of information content “Tell me today's news”, the content acquisition unit 420 “sounds about a fire in A city, B city around 6 o'clock this morning…” Voice data is acquired. In addition, the user who speaks “Tell me today's news” is authenticated by voiceprint authentication in the user authentication unit 416, and the user ID “B102” of the user is specified. The advertisement selecting unit 422 acquires the attribute “child, male, single” of the account corresponding to the specified user ID “B102” from the user information holding unit 408. The advertisement selection unit 422 refers to the voice advertisement holding unit 404 and has the attribute “children, men,” corresponding to the acquired attribute “children, men, single” and “speaking SL when you come to F city”. Select the audio data for the ad. If the attribute of the account corresponding to the identified user ID “B102” is “adult, female, single”, the advertisement selecting unit 422 refers to the voice advertisement holding unit 404 and attributes corresponding to the attribute. The voice data of the voice advertisement “I want to travel to Izumo Taisha? If so, consult ABC Travel Agency” with “Single, Adult” is selected. In this way, by comparing the attribute of the authenticated user account with the attribute of the voice advertisement, the advertisement selection unit 422 selects the voice advertisement corresponding to the attribute of the authenticated user account.

あるいはまた、特定されたユーザＩＤ「Ｂ１０２」に対応するアカウントの属性が「大人、男性、既婚」であったなら、広告選択部４２２はまずその属性に対応する「クリスマスプレゼントなら、ＸＸ貴金属の指輪がお勧めです」、「火災保険ならＸＹＺ火災海上保険にお任せを」、「すぐくるＺＺＺタクシー配車サービス」、「出雲大社に旅行に行きたい？それならＡＢＣ旅行社に相談だ」の四つの音声広告を候補として選択する。さらに広告選択部４２２は、コンテンツ取得部４２０によって取得された「今朝６時頃、Ａ県Ｂ市で火事があり、…」に対応するキーワード「Ａ県、Ｂ市、火事」を音声コンテンツ保持部４０２から取得する。広告選択部４２２は、選択した四つの候補のうち、取得したキーワード「Ａ県、Ｂ市、火事」に対応するキーワード「火事、火災、保険」を有する「火災保険ならＸＹＺ火災海上保険にお任せを」という音声広告の音声データを選択する。このように、（３）の基準で候補を選択し、（１）の基準で絞り込む、という形での（１）の基準と（３）の基準との組み合わせも可能である。 Alternatively, if the attribute of the account corresponding to the specified user ID “B102” is “adult, male, married”, the advertisement selection unit 422 first selects “XX precious metal ring for a Christmas present” corresponding to the attribute. "I recommend it to XYZ Fire and Marine Insurance for fire insurance", "ZZZ taxi dispatch service coming soon", "I want to travel to Izumo Taisha? Then consult ABC Travel Agency" Is selected as a candidate. Further, the advertisement selection unit 422 receives the keyword “A prefecture, B city, fire” corresponding to “There is a fire in A city and B city around 6 o'clock this morning” acquired by the content acquisition unit 420 as an audio content holding unit. From 402. The advertisement selection unit 422 has the keyword “fire, fire, insurance” corresponding to the acquired keyword “A prefecture, city B, fire” among the four selected candidates. Select the voice data of the voice advertisement " In this way, a combination of the criterion (1) and the criterion (3) in the form of selecting candidates based on the criterion (3) and narrowing down based on the criterion (1) is also possible.

例えば、スマートスピーカ１０とユーザとの間で
（ユーザ）「何か甲村太郎の歌が聴きたい」
（スマートスピーカ１０）「クリスマスソングなどいかがでしょうか？」
（ユーザ）「じゃあ、それで」
という対話が行われているとする。「何か甲村太郎の歌が聴きたい」という音楽コンテンツの配信要求に対して、コンテンツ取得部４２０によって甲村太郎のクリスマスソングの音声データが取得される。管理サーバ４は、スマートスピーカ１０を介してユーザに、クリスマスソングでよいか問い合わせる。管理サーバ４は、ユーザの「じゃあ、それで」という肯定の応答を受けると、取得した甲村太郎のクリスマスソングの音声データに音声広告を付してスマートスピーカ１０に送信する。ここで、広告選択部４２２はコンテンツ取得部４２０によって取得された甲村太郎のクリスマスソングに対応するキーワード「甲村太郎（作詞作曲）、乙アニメ（主題歌）、丙映画（挿入歌）、クリスマスソング、指輪」を音声コンテンツ保持部４０２から取得する。広告選択部４２２は、音声広告保持部４０４を参照し、取得したキーワード「甲村太郎（作詞作曲）、乙アニメ（主題歌）、丙映画（挿入歌）、クリスマスソング、指輪」に対応するキーワードを有する「乙アニメ、金曜午後６時から、放送中！」（キーワード：「乙アニメ、金曜、午後６時」）および「クリスマスプレゼントなら、ＸＸ貴金属の指輪がお勧めです」（キーワード：「クリスマス、プレゼント、指輪」）の二つの音声広告を候補として選択する。さらに広告選択部４２２は、声紋認証により認証されたユーザのユーザＩＤ「Ａ１０１」に対応するアカウントの属性「大人、男性、既婚」をユーザ情報保持部４０８から取得する。広告選択部４２２は、選択した二つの候補のうち、取得した属性「大人、男性、既婚」に対応する属性「大人」を有する「クリスマスプレゼントなら、ＸＸ貴金属の指輪がお勧めです」という音声広告の音声データを選択する。また、声紋認証により認証されたユーザのユーザＩＤが「Ａ１０５」であったなら、広告選択部４２２は、選択した二つの候補のうち、取得した属性「子供、女性、独身」に対応する属性「女性、子供」を有する「乙アニメ、金曜午後６時から、放送中！」という音声広告の音声データを選択する。このように、（１）の基準で候補を選択し、（３）の基準で絞り込む、という形での（１）の基準と（３）の基準との組み合わせも可能である。 For example, between the smart speaker 10 and the user (user) “I want to listen to something by Taro Komura”
(Smart speaker 10) “How about Christmas songs?”
(User) “So then, then”
It is assumed that the dialogue In response to a distribution request for music content “I want to listen to something by Taro Komura”, the content acquisition unit 420 acquires the audio data of Taro Komura's Christmas song. The management server 4 inquires of the user via the smart speaker 10 whether the Christmas song is acceptable. When the management server 4 receives an affirmative response “Okay, then”, the management server 4 attaches a voice advertisement to the acquired voice data of Taro Komura's Christmas song and transmits it to the smart speaker 10. Here, the advertisement selection unit 422 is a keyword corresponding to the Taro Komura Christmas song acquired by the content acquisition unit 420 “Taro Komura (Lyrics / Composition), Oto Anime (Theme Song), Samurai Movie (Insertion Song), Christmas Song, “Ring” is acquired from the audio content holding unit 402. The advertisement selection unit 422 refers to the voice advertisement holding unit 404 and searches for keywords corresponding to the acquired keywords “Taro Komura (song writing composition), Oto anime (theme song), 丙 movie (insert song), Christmas song, ring”. “Otome anime, airing from 6pm on Friday!” (Keyword: “Otome anime, Friday, 6pm”) and “If you have a Christmas gift, XX precious metal rings are recommended” (keyword: “Christmas, Two voice advertisements of “present, ring”) are selected as candidates. Further, the advertisement selection unit 422 acquires the attribute “adult, male, married” of the account corresponding to the user ID “A101” of the user authenticated by voiceprint authentication from the user information holding unit 408. The advertisement selection unit 422 has an attribute “adult” corresponding to the acquired attribute “adult, male, married” of the two selected candidates, and “XX precious metal ring is recommended for Christmas presents” Select the audio data. If the user ID of the user authenticated by voiceprint authentication is “A105”, the advertisement selection unit 422 selects the attribute “children, women, single” corresponding to the acquired attribute “children, women, single” from the two selected candidates. The voice data of the voice advertisement “Otome Anime, being broadcast from 6 pm Friday!” Having “Women, Children” is selected. In this way, a combination of the criterion (1) and the criterion (3) in which candidates are selected based on the criterion (1) and narrowed down based on the criterion (3) is also possible.

また、（１）の基準と（２）の基準との組み合わせや（２）の基準と（３）の基準との組み合わせや（１）、（２）、（３）の三つの基準の組み合わせも可能である。
あるいはまた、（１）、（２）、（３）の基準以外にも、スマートスピーカ１０が集音したスマートスピーカ１０の周囲の物音やユーザ同士の会話に基づいて音声広告が選択されてもよい。例えば、ユーザと他のユーザとの間で交わされた「ティッシュペーパーがないね」、「そうだね、ＥＣサイトで頼もうか」という会話をスマートスピーカ１０が拾っていた場合、広告選択部４２２はその会話内容から「ティッシュペーパー」というキーワードを抽出し、抽出された「ティッシュペーパー」を宣伝する音声広告を選択してもよい。また、例えば、犬や猫の鳴き声をスマートスピーカ１０が拾っていた場合、広告選択部４２２はその鳴き声から「犬、猫」というキーワードを特定し、特定された「犬、猫」に関連するドッグフードやキャットフードを宣伝する音声広告を選択してもよい。 Also, the combination of the standard (1) and the standard (2), the combination of the standard (2) and the standard (3), and the combination of the three standards (1), (2) and (3) Is possible.
Alternatively, in addition to the criteria of (1), (2), and (3), the voice advertisement may be selected based on the surrounding sound of the smart speaker 10 collected by the smart speaker 10 or the conversation between users. . For example, when the smart speaker 10 is picking up a conversation between the user and another user such as “I don't have tissue paper” or “Yes, let's rely on the EC site”, the advertisement selection unit 422 The keyword “tissue paper” may be extracted from the conversation content, and an audio advertisement that promotes the extracted “tissue paper” may be selected. Further, for example, when the smart speaker 10 picks up the cry of a dog or cat, the advertisement selection unit 422 identifies the keyword “dog, cat” from the cry, and the dog food related to the identified “dog, cat”. Or a voice advertisement that promotes cat food may be selected.

広告調整部４２４は、広告選択部４２２によって選択された音声広告の長さを調整するか否かを判定する。広告調整部４２４は、調整すると判定された場合、選択された音声広告に所定の抽出アルゴリズムを適用することにより該音声広告から一部分（例えば、比較的重要な部分）を抽出する。広告調整部４２４は音声広告の長さを調整するか否かを、コンテンツ取得部４２０によって取得された音声コンテンツの内容および／またはスマートスピーカ１０とユーザとの現在の対話セッションの状態に基づいて判定してもよい。 The advertisement adjustment unit 424 determines whether to adjust the length of the voice advertisement selected by the advertisement selection unit 422. When it is determined that adjustment is to be made, the advertisement adjustment unit 424 extracts a part (for example, a relatively important part) from the voice advertisement by applying a predetermined extraction algorithm to the selected voice advertisement. The advertisement adjustment unit 424 determines whether to adjust the length of the audio advertisement based on the content of the audio content acquired by the content acquisition unit 420 and / or the state of the current interactive session between the smart speaker 10 and the user. May be.

例えば、広告調整部４２４はセッション情報保持部４１０を参照し、選択された音声広告に対応するセッションの状態が「会話中」である場合は調整すると判定し、「発話待ち」であれば調整しないと判定してもよい。あるいはまた、広告調整部４２４は、コンテンツ取得部４２０によって取得された音声コンテンツの内容に基づいて音声広告の長さを決定してもよい。例えば、広告調整部４２４は音声コンテンツの再生時間に合わせて音声広告の長さを決めてもよい。比較的長い音声コンテンツについては音声広告を複数回再生するようにしてもよい。また例えば、広告調整部４２４は音声コンテンツがニュースや天気予報などの情報コンテンツである場合、ユーザはより早く所望の情報を得たいと考えている蓋然性が高いので、調整すると判定してもよい。 For example, the advertisement adjustment unit 424 refers to the session information holding unit 410 and determines that adjustment is made when the state of the session corresponding to the selected voice advertisement is “conversation”, and does not adjust if the state is “waiting for utterance”. May be determined. Alternatively, the advertisement adjustment unit 424 may determine the length of the audio advertisement based on the content of the audio content acquired by the content acquisition unit 420. For example, the advertisement adjustment unit 424 may determine the length of the voice advertisement according to the playback time of the voice content. For relatively long audio content, the audio advertisement may be played multiple times. In addition, for example, when the audio content is information content such as news or weather forecast, the advertisement adjustment unit 424 may determine that the user wants to adjust because the user is more likely to obtain desired information sooner.

音声広告のうち人が話している部分、人が大きな音で話している部分、背景音が段々と大きくなる部分などを重要部分として抽出する技術が知られており、ハードディスク録画機等で用いられている。所定の抽出アルゴリズムは、この公知の技術を用いて構成されてもよい。なお、広告調整部４２４は音声広告の長さに加えてまたは代えて、音声広告の音量を調整してもよい。 The technology that extracts the important part of the voice advertisement, such as the part where the person is speaking, the part where the person is speaking with a loud sound, and the part where the background sound becomes louder is known. ing. The predetermined extraction algorithm may be configured using this known technique. Note that the advertisement adjustment unit 424 may adjust the volume of the voice advertisement in addition to or instead of the length of the voice advertisement.

送信情報生成部４２６は、コンテンツ取得部４２０によって取得された音声コンテンツと、広告選択部４２２によって選択された音声広告と、を合わせてひとつの送信情報を生成する。広告調整部４２４により音声広告の長さが調整されている場合は、広告選択部４２２によって選択された音声広告の代わりに、広告調整部４２４によって長さが調整された音声広告が用いられる。送信情報生成部４２６は、送信情報がスマートスピーカ１０によって受信され再生されたときに、音声広告の再生が音声コンテンツの再生よりも時間的に前となるように、送信情報を構成する。例えば、送信情報がヘッダと音声コンテンツと音声広告とを含む場合、送信情報生成部４２６はヘッダ、音声広告、音声コンテンツの順に並ぶよう送信情報を生成してもよい。 The transmission information generation unit 426 generates one transmission information by combining the audio content acquired by the content acquisition unit 420 and the audio advertisement selected by the advertisement selection unit 422. When the length of the voice advertisement is adjusted by the advertisement adjustment unit 424, the voice advertisement whose length is adjusted by the advertisement adjustment unit 424 is used instead of the voice advertisement selected by the advertisement selection unit 422. The transmission information generation unit 426 configures the transmission information so that when the transmission information is received and reproduced by the smart speaker 10, the reproduction of the audio advertisement is temporally before the reproduction of the audio content. For example, when the transmission information includes a header, audio content, and audio advertisement, the transmission information generation unit 426 may generate the transmission information so that the header, the audio advertisement, and the audio content are arranged in this order.

送信部４２８は、送信情報生成部４２６によって生成された送信情報をスマートスピーカ１０に、ネットワーク６を介して送信する。スマートスピーカ１０は、ネットワーク６を介して送信情報を受信すると、送信情報に含まれる音声広告をまず再生した後に、送信情報に含まれる音声コンテンツを再生する。あるいはまた、後述のタイミング制御部４３０がネットワーク６を介してスマートスピーカ１０からの音声の出力を制御してもよい。この場合、タイミング制御部４３０は、スマートスピーカ１０にまず送信情報に含まれる音声広告を出力させ、次いで送信情報に含まれる音声コンテンツを出力させる。いずれにせよ、音声広告と音声コンテンツとは連続的に再生される。すなわち、音声広告と音声コンテンツとの間に他の音声は存在しない。特に、音声広告は音声コンテンツの直前に再生される。 The transmission unit 428 transmits the transmission information generated by the transmission information generation unit 426 to the smart speaker 10 via the network 6. When the smart speaker 10 receives the transmission information via the network 6, the smart speaker 10 first reproduces the audio advertisement included in the transmission information and then reproduces the audio content included in the transmission information. Alternatively, the timing control unit 430 described later may control the output of sound from the smart speaker 10 via the network 6. In this case, the timing control unit 430 causes the smart speaker 10 to first output the audio advertisement included in the transmission information and then output the audio content included in the transmission information. In any case, the audio advertisement and the audio content are continuously played back. That is, there is no other audio between the audio advertisement and the audio content. In particular, the audio advertisement is played immediately before the audio content.

あるいはまた、音声広告は音声コンテンツの途中に埋め込まれてもよいし、音声コンテンツが出力された後に音声広告が出力されてもよい。 Alternatively, the audio advertisement may be embedded in the middle of the audio content, or the audio advertisement may be output after the audio content is output.

タイミング制御部４３０は、ネットワーク６を介してスマートスピーカ１０と通信し、スマートスピーカ１０からの音声出力のタイミングを制御する。タイミング制御部４３０はセッション情報保持部４１０を参照し、スマートスピーカ１０の周囲にユーザが存在するか否かを判定する。タイミング制御部４３０は、スマートスピーカ１０とユーザとの対話セッションのセッションＩＤがセッション情報保持部４１０に保持されている場合、スマートスピーカ１０の周囲にユーザが存在すると判定する、またはユーザの存在を検知する。タイミング制御部４３０は、そのようなセッションＩＤがセッション情報保持部４１０に保持されていない場合、スマートスピーカ１０の周囲にユーザが存在しないと判定する。 The timing control unit 430 communicates with the smart speaker 10 via the network 6 and controls the timing of audio output from the smart speaker 10. The timing control unit 430 refers to the session information holding unit 410 and determines whether there is a user around the smart speaker 10. The timing control unit 430 determines that a user exists around the smart speaker 10 or detects the presence of the user when the session ID of the interactive session between the smart speaker 10 and the user is held in the session information holding unit 410. To do. The timing control unit 430 determines that there is no user around the smart speaker 10 when such a session ID is not held in the session information holding unit 410.

タイミング制御部４３０は、スマートスピーカ１０の周囲においてユーザの存在が検知されない場合、または、セッション情報保持部４１０に保持されているスマートスピーカ１０とユーザとの対話セッションの状態が「発話中」あるいは「会話中」となっている場合、スマートスピーカ１０からの音声広告の出力を制限する。タイミング制御部４３０は、ユーザの存在が検知されると、スマートスピーカ１０からの音声広告の出力を許可する。タイミング制御部４３０は、対話セッションの状態が「発話待ち」に変更されると、スマートスピーカ１０からの音声広告の出力を許可する。 When the presence of the user is not detected around the smart speaker 10, the timing control unit 430 indicates that the conversation session state between the smart speaker 10 and the user held in the session information holding unit 410 is “speaking” or “ When “conversation” is set, the output of the voice advertisement from the smart speaker 10 is restricted. When the presence of the user is detected, the timing control unit 430 permits the output of the voice advertisement from the smart speaker 10. The timing control unit 430 permits the output of the voice advertisement from the smart speaker 10 when the state of the conversation session is changed to “waiting for utterance”.

タイミング制御部４３０は、スマートスピーカ１０に関連付けられたＴＶ１２の出力と、スマートスピーカ１０から出力される音声広告とが連携するように、該音声広告の出力のタイミングを制御する。例えば、タイミング制御部４３０は、ＴＶ１２から「続きはスピーカで！」という広告が流れ終わったタイミングで、スマートスピーカ１０から「テレビで紹介したこの商品は…」という音声広告の出力が開始されるよう、スマートスピーカ１０を制御する。この場合、タイミング制御部４３０は、ＴＶ１２からネットワーク６を介して現在放映されているチャネルの番号を取得する。タイミング制御部４３０は予め放映のスケジュールを他のサービスのサーバから取得しておく。タイミング制御部４３０は、取得されたチャネルの番号と、放映のスケジュールと、から、ＴＶ１２で流される広告の内容と、開始タイミングと、終了タイミングとを特定することができる。タイミング制御部４３０は、特定された内容に関連する音声広告を音声広告保持部４０４から選択し、スマートスピーカ１０に送信する（音声コンテンツに付随してもしなくてもよい）。タイミング制御部４３０は、送信した音声広告の出力を、ＴＶ１２で流される広告の終了タイミングで開始するようスマートスピーカ１０を制御する。 The timing control unit 430 controls the output timing of the audio advertisement so that the output of the TV 12 associated with the smart speaker 10 and the audio advertisement output from the smart speaker 10 cooperate with each other. For example, the timing control unit 430 may start outputting a voice advertisement saying “This product introduced on the TV is…” from the smart speaker 10 at the timing when the advertisement “Continue with the speaker!” Has flowed from the TV 12. The smart speaker 10 is controlled. In this case, the timing control unit 430 acquires the number of the channel currently being broadcast from the TV 12 via the network 6. The timing control unit 430 obtains a broadcast schedule from a server of another service in advance. The timing control unit 430 can specify the content of the advertisement played on the TV 12, the start timing, and the end timing from the acquired channel number and the broadcast schedule. The timing control unit 430 selects an audio advertisement related to the specified content from the audio advertisement holding unit 404 and transmits it to the smart speaker 10 (which may or may not be associated with the audio content). The timing control unit 430 controls the smart speaker 10 to start the output of the transmitted voice advertisement at the end timing of the advertisement played on the TV 12.

属性更新部４３２は、スマートスピーカ１０で収集した音声情報でユーザのアカウントの属性を更新する。例えば図８のユーザ情報保持部４０８に示されるアカウントＩＤが検索サービスのアカウントのものである場合、検索サービスのサイトを訪問したユーザにこのアカウントＩＤが付与される。このユーザがどのようなものを検索しているかという情報からこのユーザのアカウントの属性が導出され、図８のユーザ情報保持部４０８の属性として登録される。 The attribute update unit 432 updates the attribute of the user account with the voice information collected by the smart speaker 10. For example, when the account ID shown in the user information holding unit 408 of FIG. 8 is that of the search service account, this account ID is given to the user who visited the search service site. The attribute of the user's account is derived from the information about what the user is searching for, and is registered as the attribute of the user information holding unit 408 in FIG.

一方、管理サーバ４では、スマートスピーカ１０とユーザとの対話の内容、および、ユーザと他のユーザとの会話の内容を解析することで、ユーザの嗜好を把握することができる。属性更新部４３２は、このようにスマートスピーカ１０を介して把握された嗜好で、ユーザ情報保持部４０８の属性を更新する。属性更新部４３２は、更新内容をアカウントＩＤに関連付けて検索サービスのサーバに提供してもよい。これにより、検索サービスは、これまで検索サービスでは得られなかったユーザの嗜好を得ることができ、このようにして得た嗜好を用いてブラウザの広告出力を最適化することができる。なお、スマートスピーカ１０を介して得られるユーザの嗜好、属性と、検索サービスにより把握される嗜好、属性とは、それぞれを識別可能に保持されてもよい。これにより、片方の嗜好、属性のみを用いて音声広告を選択する場合にも対応できるようになる。 On the other hand, the management server 4 can grasp the user's preference by analyzing the content of the conversation between the smart speaker 10 and the user and the content of the conversation between the user and another user. The attribute update unit 432 updates the attribute of the user information holding unit 408 with the preference grasped through the smart speaker 10 in this way. The attribute update unit 432 may provide the update content in association with the account ID to the server of the search service. Accordingly, the search service can obtain user preferences that have not been obtained by the search service so far, and can optimize advertisement output of the browser by using the preferences obtained in this way. Note that the user's preferences and attributes obtained via the smart speaker 10 and the preferences and attributes grasped by the search service may be held in an identifiable manner. As a result, it is possible to cope with a case where a voice advertisement is selected using only one preference or attribute.

また、属性更新部４３２は、スマートスピーカ１０から音声広告が出力されたとき、ユーザから興味無しの直接又は間接の表現をスマートスピーカ１０を介して受信したか否か判定する。属性更新部４３２は、受信した場合、対応する出力されていた音声広告の広告ＩＤを、ユーザのユーザＩＤに対応付けてＮＧ広告ＩＤとしてユーザ情報保持部４０８に登録する。広告選択部４２２は、音声広告を選択する際、ユーザ情報保持部４０８を参照し、認証されたユーザのユーザＩＤに対応して保持されるＮＧ広告ＩＤで特定される音声広告を選択の対象から除く。広告選択部４２２は、ＮＧ広告ＩＤで特定される音声広告の後継の音声広告を選択の対象から除いてもよい。広告選択部４２２は、ＮＧ広告ＩＤで特定される音声広告の属性やキーワードに対応するまたはそれと同じ属性やキーワードを有する音声広告を選択の対象から除いてもよい。 In addition, when the voice advertisement is output from the smart speaker 10, the attribute update unit 432 determines whether a direct or indirect expression of no interest from the user is received via the smart speaker 10. When received, the attribute updating unit 432 registers the corresponding output advertisement ID of the voice advertisement in the user information holding unit 408 as the NG advertisement ID in association with the user ID of the user. When selecting the voice advertisement, the advertisement selection unit 422 refers to the user information holding unit 408 and selects the voice advertisement specified by the NG advertisement ID held corresponding to the user ID of the authenticated user from the selection target. except. The advertisement selection unit 422 may exclude the successor voice advertisement of the voice advertisement specified by the NG advertisement ID from the selection target. The advertisement selection unit 422 may exclude a voice advertisement corresponding to or having the same attribute or keyword as the attribute or keyword of the voice advertisement specified by the NG advertisement ID from the selection target.

以上の構成による管理サーバ４の動作を説明する。
図１０は、図１の管理サーバ４における一連の処理の流れを示すフローチャートである。管理サーバ４は、ネットワーク６を介してスマートスピーカ１０から、音声コンテンツの配信要求を表す音声信号を受け付ける（Ｓ３０２）。管理サーバ４は、受け付けた音声信号に対して音声認識処理を行う（Ｓ３０４）ことで、要求されている音声コンテンツを特定する。管理サーバ４は、要求されている音声コンテンツを音声コンテンツ保持部４０２または外部から取得する（Ｓ３０６）。管理サーバ４は、音声広告保持部４０４から音声広告を選択する（Ｓ３０８）。管理サーバ４は、選択された音声広告の長さの調整が必要か否かを判定する（Ｓ３１０）。必要と判定された場合（Ｓ３１０のＹＥＳ）、管理サーバ４は音声広告の長さを調整する（Ｓ３１２）。管理サーバ４は、ステップＳ３０６で取得された音声コンテンツと、ステップＳ３０８で選択された音声広告（ステップＳ３１０でＮＯの場合）またはステップＳ３１２で長さが調整された音声広告（ステップＳ３１０でＹＥＳの場合）と、に基づいて送信情報を生成する（Ｓ３１４）。管理サーバ４は、生成された送信情報をスマートスピーカ１０にネットワーク６を介して送信する（Ｓ３１６）。管理サーバ４は、現在が音声広告を出力するのに適したタイミングであるか否かを判定する（Ｓ３１８）。適したタイミングである場合（Ｓ３１８のＹＥＳ）、管理サーバ４はまずスマートスピーカ１０に音声広告を出力させ（Ｓ３２０）、続いて音声コンテンツを出力させる（Ｓ３２２）。 The operation of the management server 4 having the above configuration will be described.
FIG. 10 is a flowchart showing a flow of a series of processes in the management server 4 of FIG. The management server 4 receives an audio signal representing a delivery request for audio content from the smart speaker 10 via the network 6 (S302). The management server 4 performs a voice recognition process on the received voice signal (S304), thereby specifying the requested voice content. The management server 4 acquires the requested audio content from the audio content holding unit 402 or from the outside (S306). The management server 4 selects a voice advertisement from the voice advertisement holding unit 404 (S308). The management server 4 determines whether or not the length of the selected voice advertisement needs to be adjusted (S310). If it is determined that it is necessary (YES in S310), the management server 4 adjusts the length of the voice advertisement (S312). The management server 4 uses the audio content acquired in step S306 and the audio advertisement selected in step S308 (in the case of NO in step S310) or the audio advertisement whose length is adjusted in step S312 (in the case of YES in step S310). ) And transmission information is generated based on (S314). The management server 4 transmits the generated transmission information to the smart speaker 10 via the network 6 (S316). The management server 4 determines whether or not the current timing is suitable for outputting a voice advertisement (S318). If the timing is appropriate (YES in S318), the management server 4 first causes the smart speaker 10 to output an audio advertisement (S320), and then causes the audio content to be output (S322).

上述の実施の形態において、保持部の例は、ハードディスクや半導体メモリである。また、本明細書の記載に基づき、各部を、図示しないＣＰＵや、インストールされたアプリケーションプログラムのモジュールや、システムプログラムのモジュールや、ハードディスクから読み出したデータの内容を一時的に記憶する半導体メモリなどにより実現できることは本明細書に触れた当業者には理解される。 In the embodiment described above, examples of the holding unit are a hard disk and a semiconductor memory. Further, based on the description of the present specification, each unit is configured by a CPU (not shown), a module of an installed application program, a module of a system program, a semiconductor memory that temporarily stores the content of data read from the hard disk, or the like. It will be appreciated by those skilled in the art who have touched this specification that this can be achieved.

本実施の形態に係る管理サーバ４によると、スマートスピーカ１０での音声コンテンツの再生に合わせて音声広告が再生される。これにより、音声コンテンツの配信に合わせた音声広告の提供が可能となる。また、本実施の形態では、音声コンテンツの再生の前に音声広告が再生される。この場合の音声広告をユーザに聞いてもらえる蓋然性は、音声コンテンツの再生の後に音声広告を再生する場合よりも高い。したがって、より効果的な広告の提供が可能となる。 According to the management server 4 according to the present embodiment, the audio advertisement is reproduced in accordance with the reproduction of the audio content on the smart speaker 10. Thereby, it is possible to provide an audio advertisement in accordance with the distribution of the audio content. In the present embodiment, the audio advertisement is reproduced before the audio content is reproduced. The probability that the user can hear the audio advertisement in this case is higher than that in the case where the audio advertisement is reproduced after the audio content is reproduced. Therefore, more effective advertisement can be provided.

また、本実施の形態に係る管理サーバ４では、音声広告は音声コンテンツの内容やスマートスピーカ１０を介して得られた音声情報や認証ユーザのアカウントの属性に基づいて選択される。このように選択される音声広告は、ユーザの嗜好や要望に沿うものである蓋然性が高い。したがって、ユーザへの訴求力がより高い音声広告を提供することができる。 Further, in the management server 4 according to the present embodiment, the voice advertisement is selected based on the contents of the voice content, the voice information obtained via the smart speaker 10, and the attribute of the authenticated user account. The voice advertisement selected in this way has a high probability of being in line with the user's preference and demand. Therefore, it is possible to provide a voice advertisement with higher appeal to the user.

また、本実施の形態に係る管理サーバ４では、スマートスピーカ１０からの音声広告の出力のタイミングが適宜制御される。したがって、ユーザの会話や発話の邪魔とならないような音声広告の出力が可能となる。または、ＴＶ１２などの他の電子機器と連携した音声広告の提供が可能となる。 In addition, in the management server 4 according to the present embodiment, the output timing of the voice advertisement from the smart speaker 10 is appropriately controlled. Therefore, it is possible to output a voice advertisement that does not interfere with the user's conversation or speech. Alternatively, it is possible to provide a voice advertisement in cooperation with other electronic devices such as the TV 12.

本実施の形態において、周囲の物音やユーザ同士の会話をスマートスピーカ１０が取得できることに関連して、管理サーバ４は、音声コンテキストを理解することで児童虐待が行われているか否かを判定してもよい。管理サーバ４は、自動虐待に関する音声データを編集し、所定の捜査機関に提供してもよい。捜査機関は全体の音声データを聞くことができる。 In the present embodiment, the management server 4 determines whether or not child abuse is being performed by understanding the audio context in connection with the smart speaker 10 being able to acquire surrounding sounds and conversations between users. May be. The management server 4 may edit voice data related to automatic abuse and provide it to a predetermined investigation organization. The investigator can hear the entire audio data.

本実施の形態において、ユーザ認証部４１６による声紋認証によりユーザＩＤが特定され、ユーザ情報保持部４０８を参照することでこのユーザＩＤに対応する属性が特定される。この場合、管理サーバ４は、ユーザの属性に応じて音声コンテンツまたは音声広告の出力の態様を変更してもよい。例えば、ユーザの属性が子供である場合、管理サーバ４は、音声コンテンツまたは音声広告において、なるべく簡単な言葉を用い、汚い言葉は削除または言い換えてもよい。あるいはまた、ユーザの属性が老人である場合、管理サーバ４は、音声コンテンツまたは音声広告において、音量を大きくし、または発音をより明瞭化してもよい。 In the present embodiment, a user ID is specified by voiceprint authentication by the user authentication unit 416, and an attribute corresponding to this user ID is specified by referring to the user information holding unit 408. In this case, the management server 4 may change the output mode of the audio content or the audio advertisement according to the user attribute. For example, when the user attribute is a child, the management server 4 may use simple words as much as possible in audio contents or audio advertisements, and delete or paraphrase dirty words. Alternatively, when the user attribute is an elderly person, the management server 4 may increase the volume or clarify the pronunciation in the audio content or the audio advertisement.

本実施の形態では、ユーザ認証部４１６によりユーザが認証される場合について説明したが、これに限られず、ユーザ認証はなくてもよい。この場合、音声広告の選択にユーザの属性は反映されない。ここで、本実施の形態ではＳ３２０で音声広告を出力する動作を説明したが、この音声広告の出力に加え、この出力状況を管理サーバ４で記憶することもできる。出力状況の例としては、「対象広告を最後まで再生した」、「対象広告は途中で停止された」、「対象広告の再生に加え、広告が対象とする製品に関して追加情報を出力した」などである。「対象広告を最後まで再生した」はスマートスピーカ１０が対象の音声データを最後まで出力した場合にその旨を管理サーバ４に報告することで実現することができる。また、途中での停止、追加情報の出力は共に、管理サーバ４が制御するものであるから当然に管理することができる。 In the present embodiment, the case where the user is authenticated by the user authentication unit 416 has been described. However, the present invention is not limited to this, and user authentication may not be performed. In this case, the user attribute is not reflected in the selection of the voice advertisement. Here, in the present embodiment, the operation of outputting the voice advertisement in S320 has been described. However, in addition to the output of this voice advertisement, the output status can also be stored in the management server 4. Examples of output status include “played the target ad to the end”, “the target ad was stopped halfway”, “in addition to playing the target ad, output additional information about the product targeted by the ad” It is. “The target advertisement has been played to the end” can be realized by reporting to the management server 4 that the smart speaker 10 has output the target audio data to the end. Moreover, since the management server 4 controls both the stop and the output of additional information in the middle, it can be managed naturally.

本実施の形態に係る技術的思想は以下の項目により表されてもよい。
（項目１）
マイクロフォンおよび通信機能を有するスピーカから、ネットワークを介して配信要求を受け付ける機能と、
受け付けた配信要求に応じて、画像を伴わない音声コンテンツを取得する機能と、
画像を伴わない音声広告を音声広告保持手段から選択する機能と、
取得された音声コンテンツと選択された音声広告とを合わせて前記スピーカに、前記ネットワークを介して送信する機能と、をサーバに実現させるためのコンピュータプログラム。
（項目２）
マイクロフォンおよび通信機能を有するスピーカから、ネットワークを介して配信要求を受け付けることと、
受け付けた配信要求に応じて、画像を伴わない音声コンテンツを取得することと、
画像を伴わない音声広告を音声広告保持手段から選択することと、
取得された音声コンテンツと選択された音声広告とを合わせて前記スピーカに、前記ネットワークを介して送信することと、を含む方法。 The technical idea according to the present embodiment may be expressed by the following items.
(Item 1)
A function of accepting a distribution request via a network from a microphone and a speaker having a communication function;
In response to the received distribution request, a function for acquiring audio content without an image,
A function for selecting a voice advertisement without an image from the voice advertisement holding means;
A computer program for causing a server to realize a function of transmitting the acquired audio content and a selected audio advertisement to the speaker via the network.
(Item 2)
Receiving a distribution request from a microphone and a speaker having a communication function via a network;
In response to the received distribution request, acquiring audio content without an image,
Selecting a voice advertisement without an image from the voice advertisement holding means,
Transmitting the acquired audio content and the selected audio advertisement together to the speaker over the network.

（第２の実施の形態）
第２の実施の形態では、ある現実の空間内に複数のスマートスピーカが異なる位置に配置されており、そのそれぞれが第１の実施の形態の管理サーバ４と同様の管理サーバとネットワークを介して接続される。 (Second Embodiment)
In the second embodiment, a plurality of smart speakers are arranged at different positions in a certain real space, and each of them is connected via a management server and a network similar to the management server 4 of the first embodiment. Connected.

図１１は、ユーザ２０４の部屋２０２の模式的な上面図である。この部屋２０２の中には固定の第１スマートスピーカ２０８と、固定の第２スマートスピーカ２１０と、固定の第３スマートスピーカ２１２と、固定の第４スマートスピーカ２１４と、可動の第５スマートスピーカ２１６と、ＴＶ２０６と、が配置されている。各スマートスピーカは管理サーバとネットワークを介して通信する。なお、図１１では五つのスマートスピーカが示されているが、スマートスピーカの数に制限はない。各スマートスピーカは部屋２０２の壁や床や天井に設置されてもよい。 FIG. 11 is a schematic top view of the room 202 of the user 204. In this room 202, a fixed first smart speaker 208, a fixed second smart speaker 210, a fixed third smart speaker 212, a fixed fourth smart speaker 214, and a movable fifth smart speaker 216. And TV 206 are arranged. Each smart speaker communicates with the management server via a network. In FIG. 11, five smart speakers are shown, but the number of smart speakers is not limited. Each smart speaker may be installed on the wall, floor, or ceiling of the room 202.

（１）スマートスピーカの位置の自動決定
管理サーバは各スマートスピーカの部屋２０２における位置を記録、管理している。この位置は、ユーザ２０４が管理サーバ４に登録してもよい。あるいはまた、管理サーバは、五つのスマートスピーカのマイクロフォンおよびスピーカを用いて、各スマートスピーカの位置を自動的に決定してもよい。 (1) Automatic determination of smart speaker position The management server records and manages the position of each smart speaker in the room 202. This position may be registered in the management server 4 by the user 204. Alternatively, the management server may automatically determine the position of each smart speaker using the microphones and speakers of the five smart speakers.

管理サーバは、あるスマートスピーカが出力する音声を他のスマートスピーカ１０が検出することによりスマートスピーカ間の相対位置を決定する。例えば、第２スマートスピーカ２１０、第３スマートスピーカ２１２、第４スマートスピーカ２１４の位置が知られおり、第１スマートスピーカ２０８の位置を決定する場合、管理サーバは、第１スマートスピーカ２０８のスピーカに所定波長の音のパルスを出力させる。管理サーバは、第２スマートスピーカ２１０、第３スマートスピーカ２１２、第４スマートスピーカ２１４それぞれから、所定波長の音のパルスを受けた時刻を取得する。管理サーバは、取得した時刻からパルスの伝搬時間を算出し、算出された伝搬時間と音速とから距離を算出する。管理サーバは、算出された各距離と、第２、第３および第４スマートスピーカ２１０、２１２、２１４の既知の位置と、から第１スマートスピーカ２０８の位置を算出する。 The management server determines the relative position between the smart speakers when another smart speaker 10 detects the sound output from one smart speaker. For example, when the positions of the second smart speaker 210, the third smart speaker 212, and the fourth smart speaker 214 are known, and the position of the first smart speaker 208 is determined, the management server sets the first smart speaker 208 as the speaker. A sound pulse having a predetermined wavelength is output. The management server acquires the time at which a sound pulse having a predetermined wavelength is received from each of the second smart speaker 210, the third smart speaker 212, and the fourth smart speaker 214. The management server calculates the pulse propagation time from the acquired time, and calculates the distance from the calculated propagation time and sound velocity. The management server calculates the position of the first smart speaker 208 from each calculated distance and the known positions of the second, third, and fourth smart speakers 210, 212, and 214.

第５スマートスピーカ２１６は、例えばロボットに装着されたスマートスピーカであり、自ら動くことができる。第１、第２、第３および第４スマートスピーカ２０８、２１０、２１２、２１４の位置が既知の場合、管理サーバは、上記の位置算出処理により第５スマートスピーカ２１６の位置を追跡することができる。また、第５スマートスピーカ２１６は、自分の位置を基準にして他のスマートスピーカの位置を決める場合、そのスマートスピーカが発する音を受け易い位置に移動してもよい。 The fifth smart speaker 216 is a smart speaker attached to a robot, for example, and can move by itself. When the positions of the first, second, third, and fourth smart speakers 208, 210, 212, and 214 are known, the management server can track the position of the fifth smart speaker 216 by the position calculation process described above. . Further, the fifth smart speaker 216 may move to a position where it can easily receive the sound emitted by the smart speaker when the position of another smart speaker is determined based on its own position.

図１１に示されるシステムの構成要素としては、マイクロフォンおよびスピーカの両方が搭載されているスマートスピーカやスマートフォンが望ましいが、一般にスピーカしかないテレビやラジオ、その他の電気機器でも、スピーカはあるため音声再生の支援は行うことができる。また、通信機能を備えた電気機器もある。これにより、複数の位置からのスピーカ出力が可能となる。電気機器を配置する位置はユーザが管理サーバに設定することで通知してもよいし、上述の位置算出処理により、あるいは無線通信の電波により管理サーバが自動的に決定してもよい。 As a component of the system shown in FIG. 11, a smart speaker or a smartphone equipped with both a microphone and a speaker is desirable. However, in general, a television, radio, or other electrical device having only a speaker also has a speaker, so that sound reproduction is possible. Can be supported. There are also electrical devices with a communication function. As a result, speaker output from a plurality of positions is possible. The position where the electric device is arranged may be notified by the user setting the management server, or the management server may automatically determine the position calculation processing described above or by radio communication radio waves.

（２）可動スマートスピーカの用途
音声出力において、スピーカの位置によって対象のユーザへの聞こえ方が変わる場合がある。したがって、管理サーバは、より適切に音が聞こえる位置に第５スマートスピーカ２１６を移動させる制御を行ってもよい。 (2) Use of movable smart speaker In audio output, the way the target user hears may change depending on the position of the speaker. Therefore, the management server may perform control to move the fifth smart speaker 216 to a position where sound can be heard more appropriately.

また、ＴＶ２０６などは一般にマイクロフォン機能を備えておらず、したがってこのままでは上記の位置算出処理に参加することはできない。しかしながら、第５スマートスピーカ２１６がＴＶ２０６の位置まで移動し、ＴＶ２０６のマイクロフォン機能を代行することにより、ＴＶ２０６も位置算出処理に参加することができるようになる。 Further, the TV 206 or the like generally does not have a microphone function, and therefore cannot participate in the position calculation process as it is. However, when the fifth smart speaker 216 moves to the position of the TV 206 and substitutes for the microphone function of the TV 206, the TV 206 can also participate in the position calculation process.

（３）ユーザの位置に応じた音声出力
各スマートスピーカの位置が既知の場合、ユーザ２０４の位置が分かれば、ユーザ２０４に最も近いスマートスピーカを特定することができる。ユーザ２０４が「テレビをつけて」などの音声出力要求を発話すると、五つのスマートスピーカがその発話を音声信号に変換し、管理サーバに送信する。管理サーバは音声信号に音声認識処理を施し、ユーザ２０４の音声出力要求を理解する。管理サーバは、ユーザ２０４の部屋２０２における位置に対応するスマートスピーカを特定する。特に管理サーバは、ユーザ２０４の位置に最も近い第２スマートスピーカ２１０を特定する。このとき管理サーバは、各スマートスピーカのマイクロフォンがユーザの発話を受けたときの音量を比較し、その音量が最も大きい第２スマートスピーカ２１０をユーザ２０４の位置に最も近いスマートスピーカとして特定する。あるいはまた、ユーザ２０４がスマートフォンを用いている場合は、管理サーバはスマートフォンの現在位置を取得することによりユーザ２０４の位置を特定することができる。 (3) Audio output according to the position of the user When the position of each smart speaker is known, if the position of the user 204 is known, the smart speaker closest to the user 204 can be specified. When the user 204 utters a voice output request such as “turn on the television”, the five smart speakers convert the utterance into a voice signal and transmit it to the management server. The management server performs voice recognition processing on the voice signal and understands the voice output request of the user 204. The management server specifies a smart speaker corresponding to the position of the user 204 in the room 202. In particular, the management server identifies the second smart speaker 210 that is closest to the position of the user 204. At this time, the management server compares the volume when the microphone of each smart speaker receives the user's speech, and specifies the second smart speaker 210 having the highest volume as the smart speaker closest to the position of the user 204. Alternatively, when the user 204 uses a smartphone, the management server can specify the position of the user 204 by acquiring the current position of the smartphone.

管理サーバは、上記のように特定された第２スマートスピーカ２１０に、ＴＶ２０６で流される映像に付随する音声を送信する。このようにすることで、ユーザ２０４は自分に一番近い第２スマートスピーカ２１０からＴＶ２０６の音声出力を受けることができる。 The management server transmits the audio accompanying the video streamed by the TV 206 to the second smart speaker 210 specified as described above. In this way, the user 204 can receive the audio output of the TV 206 from the second smart speaker 210 closest to him / her.

あるいはまた、スマートスピーカのスピーカが指向性を有する場合、管理サーバは指向性を制御してもよい。例えば、第１、第２、第３および第４スマートスピーカ２０８、２１０、２１２、２１４のスピーカが出力の指向性を有する場合、管理サーバは、各スピーカの音声出力がユーザ２０４の位置に向くよう各スマートスピーカを制御する。管理サーバは、各スマートスピーカにＴＶ２０６で流される映像に付随する音声を送信する。この場合、各スマートスピーカは音声をユーザ２０４に向けて出力する。 Alternatively, when the speaker of the smart speaker has directivity, the management server may control the directivity. For example, when the speakers of the first, second, third, and fourth smart speakers 208, 210, 212, and 214 have output directivity, the management server causes the sound output of each speaker to face the position of the user 204. Control each smart speaker. The management server transmits the audio accompanying the video streamed on the TV 206 to each smart speaker. In this case, each smart speaker outputs sound toward the user 204.

なお、指向性を有するスマートスピーカは、現在の音声出力の向きを視認可能な態様でユーザに示してもよい。例えば、スマートスピーカの上面に指向性を示す矢印をＬＥＤ等で表示してもよい。 Note that the smart speaker having directivity may be shown to the user in a manner in which the current direction of audio output can be visually recognized. For example, an arrow indicating directivity may be displayed on the top surface of the smart speaker with an LED or the like.

この例によると、例えばスマートスピーカに対してコンテンツ再生を指示したユーザとそれ以外のユーザとが部屋２０２の中にいる場合に、その指示したユーザを対象として音声を出力することができる。ユーザごとにサーバプロセスを割り当てる構成をとることで、第１のユーザの位置に向くよう第１のスマートスピーカを制御し、第２のユーザの位置に向くよう第２のスマートスピーカを制御して、第１のスマートスピーカと第２のスマートスピーカとが同時に音声を出力するようにしてもよい。ひとつのスマートスピーカにおいて音声出力装置を複数備え、サーバにおいてユーザごとにサーバプロセスを割り当てる構成をとることで、ひとつのスマートスピーカを制御して、第１のユーザの位置に向く音声と、第２のユーザの位置に向く音声とを同時に出力するようにしてもよい。 According to this example, for example, when a user who has instructed the content reproduction to the smart speaker and other users are in the room 202, it is possible to output sound for the instructed user. By taking a configuration in which a server process is assigned to each user, the first smart speaker is controlled to face the position of the first user, and the second smart speaker is controlled to face the position of the second user, You may make it a 1st smart speaker and a 2nd smart speaker output an audio | voice simultaneously. A plurality of audio output devices are provided in one smart speaker, and a server process is assigned to each user in the server, so that one smart speaker is controlled, and voice directed to the position of the first user, You may make it output simultaneously the audio | voice which faces a user's position.

本実施の形態に係る技術的思想は以下の項目により表されてもよい。
（項目３）
それぞれがマイクロフォンおよび通信機能を有する複数のスピーカを備えるシステムであって、
あるスピーカが出力する音声を他のスピーカが検出することによりスピーカ間の相対位置を決定するよう構成されるシステム。
（項目４）
それぞれがマイクロフォンおよび通信機能を有する複数のスピーカとネットワークを介して通信するサーバであって、前記複数のスピーカは同じ現実空間内の異なる位置に配置されており、
前記サーバは、
前記複数のスピーカのうちのいずれかを介して前記現実空間内のユーザから音声出力要求を受け付ける手段と、
前記ユーザの位置に対応するスピーカを特定する手段と、
特定されたスピーカに音声コンテンツを送信する手段と、を備えるサーバ。
（項目５）
それぞれがマイクロフォンおよび通信機能を有する複数のスピーカとネットワークを介して通信するサーバであって、前記複数のスピーカは同じ現実空間内の異なる位置に配置されており、
前記サーバは、
前記複数のスピーカのうちのいずれかを介して前記現実空間内のユーザから音声出力要求を受け付ける手段と、
前記ユーザの位置に向けて音声が出力されるよう、前記複数のスピーカのうちの少なくともひとつの指向性を制御する手段と、を備えるサーバ。 The technical idea according to the present embodiment may be expressed by the following items.
(Item 3)
A system including a microphone and a plurality of speakers each having a communication function,
A system configured to determine a relative position between speakers by detecting sound output from one speaker by another speaker.
(Item 4)
A server that communicates via a network with a plurality of speakers each having a microphone and a communication function, wherein the plurality of speakers are arranged at different positions in the same real space,
The server
Means for receiving a voice output request from a user in the real space via any one of the plurality of speakers;
Means for identifying a speaker corresponding to the position of the user;
Means for transmitting audio content to the identified speaker.
(Item 5)
Each of the servers communicates with a microphone and a plurality of speakers having a communication function via a network, and the plurality of speakers are arranged at different positions in the same real space,
The server
Means for receiving a voice output request from a user in the real space via any one of the plurality of speakers;
A server that controls directivity of at least one of the plurality of speakers so that sound is output toward the position of the user.

（第３の実施の形態）
図１２は、第３の実施の形態に係る音声操作システム２３２の構成を示す模式図である。音声操作システム２３２は、管理サーバ２３４と、スマートスピーカ２４０と、ＴＶ２４２と、スマートフォン２４８と、を備える。管理サーバ２３４とスマートスピーカ２４０とＴＶ２４２とスマートフォン２４８とはインターネットなどのネットワーク２３６を介して通信可能に接続されている。スマートスピーカ２４０およびＴＶ２４２はいずれも、ユーザ２３８の部屋２４４に設置されている。スマートスピーカ２４０はスマートフォン２４８とＰ２Ｐ通信２４６が可能に構成される。 (Third embodiment)
FIG. 12 is a schematic diagram illustrating a configuration of a voice operation system 232 according to the third embodiment. The voice operation system 232 includes a management server 234, a smart speaker 240, a TV 242 and a smartphone 248. The management server 234, the smart speaker 240, the TV 242 and the smartphone 248 are communicably connected via a network 236 such as the Internet. Both the smart speaker 240 and the TV 242 are installed in the room 244 of the user 238. Smart speaker 240 is configured to be capable of P2P communication 246 with smartphone 248.

ＴＶ２４２を音声で操作する場合、ユーザ２３８は、「テレビをつけて」等の操作指示を表す文をスマートスピーカ２４０に向けて発話する。スマートスピーカ２４０のマイクロフォンはユーザ２３８が発話した音声を電気信号に変換し、スマートスピーカ２４０は変換の結果得られた電気信号を音声信号として、ネットワーク２３６を介して管理サーバ２３４に送信する。管理サーバ２３４は受信した音声信号に対して音声認識処理を行うことでユーザ２３８がＴＶ２４２の電源を入れることを要求していると理解する。管理サーバ２３４は、要求された操作を実現するための、すなわちＴＶ２４２の電源を入れるための指示信号を生成し、ネットワーク２３６を介してＴＶ２４２に送信する。ＴＶ２４２はネットワーク２３６を介して指示信号を受信すると、電源オフ状態から電源オン状態に移行する。 When the TV 242 is operated by voice, the user 238 utters a sentence indicating an operation instruction such as “turn on the television” to the smart speaker 240. The microphone of the smart speaker 240 converts the voice uttered by the user 238 into an electric signal, and the smart speaker 240 transmits the electric signal obtained as a result of the conversion as an audio signal to the management server 234 via the network 236. The management server 234 understands that the user 238 requests to turn on the TV 242 by performing voice recognition processing on the received voice signal. The management server 234 generates an instruction signal for realizing the requested operation, that is, for turning on the TV 242, and transmits the instruction signal to the TV 242 via the network 236. When the TV 242 receives the instruction signal via the network 236, the TV 242 shifts from the power-off state to the power-on state.

このように、スマートスピーカ１０を介した制御、操作は基本的に音声により行われる。しかしながら、部屋２４４にユーザ２３８以外のユーザがいる場合、音声での制御を嫌がるユーザ２３８もいる。また、ユーザ２３８だけが部屋２４４にいる場合でも、制御内容によっては音声での制御を避けたい場合もある。その場合に、音声操作システム２３２は、スマートフォン２４８を介して、スマートスピーカ２４０のシステム側（管理サーバ２３４）に対してテキストでの制御を行うことを可能としている。 Thus, control and operation via the smart speaker 10 are basically performed by voice. However, when there is a user other than the user 238 in the room 244, there are also users 238 who dislike voice control. Even when only the user 238 is in the room 244, depending on the control content, it may be desired to avoid voice control. In this case, the voice operation system 232 can perform text control on the system side (management server 234) of the smart speaker 240 via the smartphone 248.

スマートフォン２４８での操作を可能とするために、スマートフォン２４８は、管理サーバ２３４専用のアプリケーションをダウンロードしてインストールする。スマートフォン２４８でそのアプリケーションが起動されると、そのアプリケーションは、Ｐ２Ｐ通信２４６やローカルネットワークを介してスマートスピーカ２４０から管理サーバ２３４のＵＲＬを取得する。アプリケーションは、取得したＵＲＬを用いて管理サーバ２３４との接続を確立する。スマートフォン２４８と管理サーバ２３４との間の接続が確立されると、スマートフォン２４８に入力された操作内容がその接続を通じて管理サーバ２３４に送信される。管理サーバ２３４は受信した操作内容を実現するよう指示信号を生成して送信する。 In order to enable operation on the smartphone 248, the smartphone 248 downloads and installs an application dedicated to the management server 234. When the application is activated on the smartphone 248, the application acquires the URL of the management server 234 from the smart speaker 240 via the P2P communication 246 or the local network. The application establishes a connection with the management server 234 using the acquired URL. When the connection between the smartphone 248 and the management server 234 is established, the operation content input to the smartphone 248 is transmitted to the management server 234 through the connection. The management server 234 generates and transmits an instruction signal so as to realize the received operation content.

例えば、管理サーバ２３４は、「テレビをつけて」というテキスト文字列をスマートフォン２４８から受信すると、受信したテキスト文字列を解析することでユーザ２３８がＴＶ２４２の電源を入れることを要求していると理解する。管理サーバ２３４は、要求された操作を実現するための、すなわちＴＶ２４２の電源を入れるための指示信号を生成し、ネットワーク２３６を介してＴＶ２４２に送信する。ＴＶ２４２はネットワーク２３６を介して指示信号を受信すると、電源オフ状態から電源オン状態に移行する。 For example, when the management server 234 receives a text string “Turn on TV” from the smartphone 248, the management server 234 analyzes the received text string and understands that the user 238 requests to turn on the TV 242. To do. The management server 234 generates an instruction signal for realizing the requested operation, that is, for turning on the TV 242, and transmits the instruction signal to the TV 242 via the network 236. When the TV 242 receives the instruction signal via the network 236, the TV 242 shifts from the power-off state to the power-on state.

本実施の形態に係る音声操作システム２３２によると、ユーザ２３８は、状況に応じて音声による操作とスマートフォン２４８を介した操作とを使い分けることができる。 According to the voice operation system 232 according to the present embodiment, the user 238 can selectively use the voice operation and the operation via the smartphone 248 depending on the situation.

本実施の形態では、管理サーバ２３４がスマートスピーカ２４０またはスマートフォン２４８を介してユーザ２３８から操作指示を受け付ける場合を説明したが、これに限られず、例えば第１の実施の形態のように管理サーバ２３４がスマートスピーカ２４０またはスマートフォン２４８を介してユーザ２３８から音声コンテンツの配信要求を受け付けてもよい。 In the present embodiment, a case has been described in which the management server 234 receives an operation instruction from the user 238 via the smart speaker 240 or the smartphone 248. However, the present invention is not limited to this, and for example, the management server 234 as in the first embodiment. May receive a delivery request for audio content from the user 238 via the smart speaker 240 or the smartphone 248.

本実施の形態に係る技術的思想は以下の項目により表されてもよい。
（項目６）
マイクロフォンおよび通信機能を有するスピーカとネットワークを介して通信するサーバであって、
前記スピーカのマイクロフォンを介してユーザから受け付けた要求を処理する手段と、
前記スピーカと通信する他の電子機器を介して前記ユーザから受け付けた要求を処理する手段と、を備えるサーバ。 The technical idea according to the present embodiment may be expressed by the following items.
(Item 6)
A server that communicates with a microphone and a speaker having a communication function via a network,
Means for processing a request received from a user via a microphone of the speaker;
Means for processing a request received from the user via another electronic device communicating with the speaker.

（第４の実施の形態）
第４の実施の形態は、管理サーバにおけるレイヤ分けに関する。本実施の形態では、話者別にサーバプロセス（またはサーバ）を変える。声紋認証により誰が話しているかをサーバは認識することができる。対象のユーザが通常使用しているサーバプロセス（またはサーバ）が処理を行う。 (Fourth embodiment)
The fourth embodiment relates to layering in the management server. In this embodiment, the server process (or server) is changed for each speaker. The server can recognize who is speaking by voiceprint authentication. The server process (or server) normally used by the target user performs processing.

図１３は、第４の実施の形態に係る音声操作システム２５２の構成を示す模式図である。音声操作システム２５２は、管理サーバ２５４と、スマートスピーカ２６０と、ＴＶ２６２と、を備える。管理サーバ２５４とスマートスピーカ２６０とＴＶ２６２とはインターネットなどのネットワーク２５６を介して通信可能に接続されている。スマートスピーカ２６０およびＴＶ２６２はいずれも部屋２６４に設置されており、部屋２６４には三人のユーザ（第１ユーザ２６６、第２ユーザ２６８、第３ユーザ２７０）がいる。 FIG. 13 is a schematic diagram illustrating a configuration of a voice operation system 252 according to the fourth embodiment. The voice operation system 252 includes a management server 254, a smart speaker 260, and a TV 262. The management server 254, the smart speaker 260, and the TV 262 are communicably connected via a network 256 such as the Internet. Smart speaker 260 and TV 262 are both installed in room 264, and there are three users (first user 266, second user 268, and third user 270) in room 264.

ＴＶ２６２を音声で操作する場合、第１ユーザ２６６は、「テレビをつけて」等の操作指示を表す文をスマートスピーカ２６０に向けて発話する。スマートスピーカ２６０のマイクロフォンは第１ユーザ２６６が発話した音声を電気信号に変換し、スマートスピーカ２６０は変換の結果得られた電気信号を音声信号として、ネットワーク２５６を介して管理サーバ２５４に送信する。管理サーバ２５４は受信した音声信号に対して声紋認証を行い、第１ユーザ２６６を特定する。管理サーバ２５４は、特定された第１ユーザ２６６に対応するサーバプロセスを選択し、選択されたサーバプロセスが以降の要求の処理を行う。ＴＶ２６２は、管理サーバ２５４の選択されたサーバプロセスからネットワーク２３６を介して指示信号を受信すると、電源オフ状態から電源オン状態に移行する。管理サーバ２５４において、第１ユーザ２６６とは異なる第２ユーザ２６８や第３ユーザ２７０の発話に対して、第１ユーザ２６６に対応するサーバプロセスとは異なるサーバプロセスが割り当てられる。 When operating the TV 262 with voice, the first user 266 utters a sentence representing an operation instruction such as “turn on the television” toward the smart speaker 260. The microphone of the smart speaker 260 converts the voice uttered by the first user 266 into an electric signal, and the smart speaker 260 transmits the electric signal obtained as a result of the conversion to the management server 254 via the network 256 as an audio signal. The management server 254 performs voiceprint authentication on the received audio signal and identifies the first user 266. The management server 254 selects a server process corresponding to the identified first user 266, and the selected server process processes subsequent requests. When the TV 262 receives an instruction signal from the selected server process of the management server 254 via the network 236, the TV 262 shifts from the power-off state to the power-on state. In the management server 254, a server process different from the server process corresponding to the first user 266 is assigned to the utterances of the second user 268 and the third user 270 different from the first user 266.

図１４は、図１３の管理サーバ２５４の機能および構成を示すブロック図である。ここに示す各ブロックは、ハードウエア的には、コンピュータのＣＰＵをはじめとする素子や機械装置で実現でき、ソフトウエア的にはコンピュータプログラム等によって実現されるが、ここでは、それらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウエア、ソフトウエアの組合せによっていろいろなかたちで実現できることは、本明細書に触れた当業者には理解されるところである。 FIG. 14 is a block diagram showing the function and configuration of the management server 254 of FIG. Each block shown here can be realized by hardware and other elements such as a computer CPU and a mechanical device, and software can be realized by a computer program or the like. Draw functional blocks. Therefore, it is understood by those skilled in the art who have touched this specification that these functional blocks can be realized in various forms by a combination of hardware and software.

管理サーバ２５４は、ユーザ情報保持部２７２と、音声信号受付部２７４と、ユーザ認証部２７６と、サーバプロセス群２７８と、を備える。サーバプロセス群２７８は、それぞれが特定のユーザに割り当てられた複数のサーバプロセスＳＰ１、ＳＰ２、ＳＰ３、…を含む。以下では、ユーザごとにサーバプロセスが異なる場合を説明するが、他の実施の形態では、ユーザごとにサーバそのものを異ならせてもよい。サーバが異なれば当然サーバプロセスも異なることとなる。 The management server 254 includes a user information holding unit 272, an audio signal receiving unit 274, a user authentication unit 276, and a server process group 278. The server process group 278 includes a plurality of server processes SP1, SP2, SP3,... Each assigned to a specific user. In the following, a case where the server process is different for each user will be described. However, in another embodiment, the server itself may be different for each user. Of course, different servers have different server processes.

図１５は、図１４のユーザ情報保持部２７２の一例を示すデータ構造図である。ユーザ情報保持部２７２は、ユーザＩＤと、ユーザの声紋のデータと、ユーザに割り当てられたサーバプロセスのＩＤと、を対応付けて保持する。 FIG. 15 is a data structure diagram showing an example of the user information holding unit 272 of FIG. The user information holding unit 272 holds a user ID, user voiceprint data, and an ID of a server process assigned to the user in association with each other.

図１４に戻り、音声信号受付部２７４は、スマートスピーカ２６０からネットワーク２５６を介して、三人のユーザ２６６、２６８、２７０のうちのいずれかの発話内容を表す音声信号を受け付ける。 Returning to FIG. 14, the audio signal reception unit 274 receives an audio signal representing the utterance content of any of the three users 266, 268, 270 from the smart speaker 260 via the network 256.

ユーザ認証部２７６は、音声信号受付部２７４が受け付けた音声信号から声紋を抽出または取得する。ユーザ認証部２７６は、抽出された声紋に基づく声紋認証を行う。ユーザ認証部２７６はユーザ情報保持部２７２を参照し、ユーザ情報保持部２７２に保持されている声紋のなかに抽出された声紋と一致する声紋があるか否かを判定する。ユーザ認証部２７６は、一致する声紋があればその声紋に対応するユーザＩＤおよびサーバプロセスＩＤを特定する。ユーザ認証部２７６は、一致する声紋がなければ、一致なしまたはユーザ不明を表す出力を生成する。 The user authentication unit 276 extracts or acquires a voiceprint from the audio signal received by the audio signal reception unit 274. The user authentication unit 276 performs voiceprint authentication based on the extracted voiceprint. The user authentication unit 276 refers to the user information holding unit 272 and determines whether there is a voice print that matches the extracted voice print in the voice print held in the user information holding unit 272. If there is a matching voiceprint, the user authentication unit 276 specifies the user ID and server process ID corresponding to the voiceprint. If there is no matching voiceprint, the user authentication unit 276 generates an output indicating no match or user unknown.

ユーザ認証部２７６でサーバプロセスＩＤが特定されると、サーバプロセス群２７８に含まれるサーバプロセスのうち、特定されたサーバプロセスＩＤを有するサーバプロセスが起動する。起動したサーバプロセスは、音声信号受付部２７４が受け付けた音声信号に対する以降の処理を行う。 When the server process ID is specified by the user authentication unit 276, a server process having the specified server process ID among the server processes included in the server process group 278 is activated. The activated server process performs subsequent processing on the audio signal received by the audio signal receiving unit 274.

サーバプロセス群２７８に含まれる各サーバプロセスは、第３の実施の形態で説明したような電子機器の操作機能を実現する。他の実施の形態では、サーバプロセスは、例えば第１の実施の形態で説明したような音声コンテンツの配信機能を実現してもよい。 Each server process included in the server process group 278 realizes the operation function of the electronic device as described in the third embodiment. In another embodiment, the server process may realize an audio content distribution function as described in the first embodiment, for example.

例えば、第１ユーザ２６６が部屋２６４に住む住人である場合、第１ユーザ２６６のサーバプロセスにはＴＶ２６２を制御する権限が付与されている。第２ユーザ２６８および第３ユーザ２７０が第１ユーザ２６６の部屋２６４に遊びに来た来訪者である場合、それらのユーザのサーバプロセスには、ＴＶ２６２を制御する権限は付与されない。したがって、第２ユーザ２６８または第３ユーザ２７０がＴＶ２６２を音声操作する場合、第２ユーザ２６８または第３ユーザ２７０のサーバプロセスが第１ユーザ２６６のサーバプロセスに、ＴＶ２６２の操作依頼を送信する。捜査依頼を受けたサーバプロセスは、第１ユーザ２６６に対してその操作を行ってよいかを問い合わせ、第１ユーザ２６６から同意を得ることができれば対象の操作を実行する。 For example, when the first user 266 is a resident living in the room 264, the authority to control the TV 262 is given to the server process of the first user 266. When the second user 268 and the third user 270 are visitors who have visited the room 264 of the first user 266, the authority to control the TV 262 is not granted to the server processes of those users. Therefore, when the second user 268 or the third user 270 performs a voice operation on the TV 262, the server process of the second user 268 or the third user 270 transmits an operation request for the TV 262 to the server process of the first user 266. The server process that has received the investigation request inquires of the first user 266 whether or not the operation can be performed, and executes the target operation if the consent can be obtained from the first user 266.

あるいはまた、スマートスピーカ２６０の所有者である第１ユーザ２６６が、ゲスト（来訪者）である第２ユーザ２６８および第３ユーザ２７０に対して権限を設定してもよい。例えば、スマートスピーカ２６０を介した電灯の制御を可能としつつ、スマートスピーカ２６０を介したＥＣサイトでの購入は不可としてもよい。 Alternatively, the first user 266 who is the owner of the smart speaker 260 may set authority for the second user 268 and the third user 270 who are guests (visitors). For example, it is possible that the electric light can be controlled via the smart speaker 260, but the purchase at the EC site via the smart speaker 260 is impossible.

本実施の形態に係る音声操作システム２５２によると、ユーザごとにサーバプロセスを割り当てることで、ユーザごとに実行可能な操作やアクセス可能な情報や権限などを異ならせることができる。 According to the voice operation system 252 according to the present embodiment, by assigning a server process to each user, operations that can be executed, accessible information, authority, and the like can be made different for each user.

本実施の形態では、管理サーバ２５４が複数のサーバプロセスを有し、管理サーバ２５４が音声信号を受けて声紋認証し、用いるサーバプロセスを特定する場合について説明したが、これに限られない。例えば、複数のサーバが存在する場合に、スマートスピーカ２６０からの音声信号を全てのサーバに送信し、各サーバで声紋認証を行ってもよい。あるいはまた、いずれか一人のユーザのサーバ若しくはサーバプロセス、又は、いずれか一つのサーバ若しくはサーバプロセスが音声信号を受け、対象のユーザの音声信号のみを抽出し、対象のサーバ若しくはサーバプロセスに転送してもよい。 In the present embodiment, a case has been described in which the management server 254 has a plurality of server processes, the management server 254 receives voice signals, performs voiceprint authentication, and identifies a server process to be used, but is not limited thereto. For example, when there are a plurality of servers, the voice signal from the smart speaker 260 may be transmitted to all the servers, and the voiceprint authentication may be performed by each server. Alternatively, any one user's server or server process, or any one server or server process receives an audio signal, extracts only the target user's audio signal, and forwards it to the target server or server process. May be.

本実施の形態では、操作指示を出す第１ユーザ２６６と操作対象のＴＶ２６２とが同じ部屋２６４にある場合について説明したが、これに限られず、操作対象の電子機器の遠隔操作を可能としてもよい。例えば、第２ユーザ２６８が第１ユーザ２６６の部屋２６４に遊びに来ているときに、第２ユーザ２６８が自分の部屋（部屋２６４とは異なる）のエアコンを起動したいと思ったとする。第２ユーザ２６８は「私の部屋のエアコンをつけて」という操作指示を表す文をスマートスピーカ２６０に向けて発話する。管理サーバ２５４は声紋認証および音声認識により第２ユーザ２６８の要求を理解する。管理サーバ２５４は、第２ユーザ２６８の部屋のスマートスピーカと接続されている別の管理サーバに、第２ユーザ２６８の要求を転送する。この際、認証データとして音声信号を添付する。 In this embodiment, the case where the first user 266 that issues an operation instruction and the TV 262 to be operated are in the same room 264 has been described. However, the present invention is not limited to this, and remote operation of the electronic device to be operated may be possible. . For example, when the second user 268 is visiting the room 264 of the first user 266, the second user 268 wants to activate the air conditioner in his / her room (different from the room 264). The second user 268 utters a sentence representing an operation instruction “turn on the air conditioner in my room” toward the smart speaker 260. The management server 254 understands the request of the second user 268 by voiceprint authentication and voice recognition. The management server 254 forwards the request of the second user 268 to another management server connected to the smart speaker in the room of the second user 268. At this time, an audio signal is attached as authentication data.

本実施の形態に係る技術的思想は以下の項目により表されてもよい。
（項目７）
マイクロフォンおよび通信機能を有するスピーカとネットワークを介して通信するサーバであって、
前記スピーカのマイクロフォンを介して取得された音声信号を解析することで話者を特定する手段と、
特定された話者に割り当てられたサーバプロセスを用いて、前記音声信号に係る処理を行う手段と、を備えるサーバ。 The technical idea according to the present embodiment may be expressed by the following items.
(Item 7)
A server that communicates with a microphone and a speaker having a communication function via a network,
Means for identifying a speaker by analyzing an audio signal acquired via a microphone of the speaker;
Means for performing processing relating to the voice signal using a server process assigned to the identified speaker.

なお、前記各実施の形態において、スマートスピーカに加え、ネットワークに接続したテレビ又はコンピュータと連携する動作も説明したが、スマートスピーカで音声広告を出力した後に、ユーザの音声制御に応じて追加の情報を前記テレビ又はコンピュータに表示することもできる。この時に管理サーバ４が前記テレビ又はコンピュータに表示すべきＵＲＬを送信することで実現されるが、このＵＲＬのパラメータの中にスマートスピーカ又は管理サーバ４により追加の情報リクエストがなされたことを示す情報を追加することもできる。これにより、リクエスト先のシステム側で管理サーバ４又はスマートスピーカを用いたリクエストであることを把握することができる。ここで、送信方法の一例として、ＵＲＬのパラメータを用いたが、他の方法にて通知してもよい。 In each of the embodiments described above, the operation in cooperation with the TV or computer connected to the network is described in addition to the smart speaker. However, after the voice advertisement is output by the smart speaker, additional information is added according to the user's voice control. Can also be displayed on the television or computer. This is realized by the management server 4 sending a URL to be displayed on the television or computer at this time, and information indicating that an additional information request has been made by the smart speaker or the management server 4 in the parameters of this URL. Can also be added. Thereby, it is possible to grasp that the request is made using the management server 4 or the smart speaker on the request destination system side. Here, URL parameters are used as an example of a transmission method, but notification may be made by other methods.

以上、実施の形態に係るシステムの構成と動作について説明した。これらの実施の形態は例示であり、各構成要素や各処理の組み合わせにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解される。実施の形態同士の組み合わせも可能である。 The configuration and operation of the system according to the embodiment have been described above. It is understood by those skilled in the art that these embodiments are exemplifications, and various modifications are possible for each component and combination of processes, and such modifications are also within the scope of the present invention. Combinations of the embodiments are also possible.

２音声広告配信システム、４管理サーバ、６ネットワーク、８ユーザ、１０スマートスピーカ。 2 voice advertisement distribution system, 4 management server, 6 network, 8 users, 10 smart speakers.

Claims

Receiving means for receiving a distribution request from a microphone and a speaker having a communication function via a network;
An acquisition means for acquiring audio content without an image in response to the received distribution request;
A selection means for selecting a voice advertisement without an image from the voice advertisement holding means;
A server comprising: transmission means for transmitting the acquired audio content and the selected audio advertisement together to the speaker via the network.

The server according to claim 1, wherein the selection unit selects an audio advertisement corresponding to the acquired audio content from the audio advertisement holding unit.

The server according to claim 1, wherein when the speaker receives the acquired audio content and the selected audio advertisement via the network, the speaker reproduces the audio content after reproducing the audio advertisement.

Voice information holding means for holding voice information acquired through the microphone of the speaker;
The server according to any one of claims 1 to 3, wherein the selection unit selects a voice advertisement based on voice information held in the voice information holding unit.

The server according to claim 4, wherein the voice information held in the voice information holding unit includes a user's utterance content in a current dialogue session between the speaker and the user.

A determination means for determining whether to adjust the length of the selected voice advertisement;
The server according to any one of claims 1 to 5, wherein, when it is determined to be adjusted, the transmission unit transmits a voice advertisement whose length is adjusted.

The server according to claim 6, wherein the length-adjusted voice advertisement is a part extracted from the voice advertisement by applying a predetermined extraction algorithm to the selected voice advertisement.

Control means for controlling the timing of audio output from the speaker via the network,
The said control means restrict | limits the output of the audio | voice advertisement from the said speaker, when it determines with a user's presence not being detected or when a user's conversation is continuing. Server described in.

Control means for controlling the timing of audio output from the speaker via the network,
8. The control unit according to claim 1, wherein the control unit controls the output timing of the audio advertisement so that an output of another electronic device associated with the speaker and an audio advertisement output from the speaker cooperate with each other. The server according to one item.

Further comprising authentication means for performing user authentication based on voice information acquired through the microphone of the speaker;
The server according to claim 1, wherein the selection unit selects a voice advertisement corresponding to an attribute of an authenticated user account from the voice advertisement holding unit.