JP2020060898A

JP2020060898A - Information processor and program

Info

Publication number: JP2020060898A
Application number: JP2018190843A
Authority: JP
Inventors: 亮介秋山; Ryosuke Akiyama
Original assignee: Spiralmind; Spiralmind Co Ltd
Current assignee: Spiralmind; Spiralmind Co Ltd
Priority date: 2018-10-09
Filing date: 2018-10-09
Publication date: 2020-04-16
Anticipated expiration: 2038-10-09
Also published as: JP7195861B2

Abstract

To provide an information processor capable of reproducing character's movement and voice while synchronizing each other with high accuracy.SOLUTION: An information processor 2 comprises: a storage part 34 storing an avatar model; a receiving part 31 that receives multiple first packets including a feature quantity of a subject and a first time stamp and multiple second packets including voice data and a second time stamp; an avatar output processing part 33 that extracts a feature quantity from each of the multiple first packets and makes a display unit display an avatar obtained by inputting the feature quantity into the avatar model; and a voice output processing part 35 that extracts voice data from each of the multiple second packets and makes a voice output part 44 output a voice based on the voice data. The avatar output processing part 33 acquires the second time stamp included in each of the multiple second packets and makes the display unit 43 display the avatar so that the voice and the avatar are synchronized based on the first time stamp and the second time stamp.SELECTED DRAWING: Figure 4

Description

本発明は、情報処理装置及びプログラムに関する。 The present invention relates to an information processing device and a program.

従来から、個人の顔を撮像した画像データ等から抽出した特徴量をアバタモデルに適用して、個人の顔の表情を反映したアバタを表示させることが行われている。例えば、特許文献１には、個人の顔の表情を示すデータをモーションデータストリームから抽出する顔認識コンポーネントと、当該個人の顔の表情を示すデータに基づいて、個人の顔の表情を反映するようにキャラクタをアニメーションするレンダーコンポーネントとを備えるシステムが開示されている。当該システムにより、個人の表情が反映されたキャラクタをディスプレイに表示させることが可能となる。 Conventionally, a feature amount extracted from image data obtained by capturing an image of an individual's face is applied to an avatar model to display an avatar that reflects the facial expression of the individual's face. For example, in Patent Document 1, a facial recognition component that extracts data indicating a facial expression of an individual from a motion data stream and a facial recognition component that reflects the facial expression of an individual are reflected based on the data indicating the facial expression of the individual. And a render component for animating a character. With this system, it is possible to display a character on which a personal expression is reflected on the display.

特表２０１３−５３５０５１号公報Special table 2013-535051 gazette

特許文献１に記載のシステムにおいて、臨場感を向上させるために、更に個人が発する音声データをデータストリームに追加し、キャラクタを表示させながら当該音声データに基づく音声を出力させることも考えられる。しかしながら、キャラクタの動きと音声とを精度高く同期させて再生させることは困難であった。 In the system described in Patent Document 1, in order to improve the realism, it is possible to further add voice data generated by an individual to a data stream and output a voice based on the voice data while displaying a character. However, it has been difficult to accurately reproduce the movement of the character and the voice in synchronization.

そこで、本発明は、キャラクタの動きと音声とを精度高く同期させて再生することが可能な情報処理装置を提供することを目的とする。 Therefore, it is an object of the present invention to provide an information processing device capable of accurately synchronizing the motion of a character with the voice and reproducing the same.

本発明の一態様に係る情報処理装置は、アバタモデルを記憶した記憶部と、被写体の特徴量及び第１タイムスタンプを含む複数の第１パケット、並びに音声データ及び第２タイムスタンプを含む複数の第２パケットを受信する受信部と、複数の第１パケットのそれぞれから特徴量を抽出し、特徴量をアバタモデルに入力することにより得られるアバタを表示部に表示させるアバタ出力処理部と、複数の第２パケットのそれぞれから音声データを抽出し、音声データに基づいた音声を音声出力部に出力させる音声出力処理部と、を備え、アバタ出力処理部は、複数の第２パケットのそれぞれに含まれる第２タイムスタンプを取得し、第１タイムスタンプ及び第２タイムスタンプに基づいて、音声及びアバタが同期するようにアバタを表示部に表示させる、情報処理装置。 An information processing apparatus according to an aspect of the present invention includes a storage unit that stores an avatar model, a plurality of first packets that include a feature amount of a subject and a first time stamp, and a plurality of audio packets that include a second time stamp. A receiving unit that receives the second packet; an avatar output processing unit that displays the avatar obtained by extracting the feature amount from each of the plurality of first packets and inputting the feature amount into the avatar model; A voice output processing unit that extracts voice data from each of the second packets and outputs a voice based on the voice data to a voice output unit, and the avatar output processing unit is included in each of the plurality of second packets. The second avatar is acquired and the avatar is displayed on the display unit so that the voice and the avatar are synchronized based on the first time stamp and the second time stamp. Make, the information processing apparatus.

この態様によれば、キャラクタの動きと音声とを精度高く同期させて再生することが可能となる。 According to this aspect, it is possible to accurately reproduce the movement of the character and the voice in synchronization.

本発明によれば、キャラクタの動きと音声とを精度高く同期させて再生することが可能な情報処理装置を提供することができる。 According to the present invention, it is possible to provide an information processing apparatus capable of accurately reproducing a motion of a character and a voice in synchronization with each other.

第１実施形態に係るアバタ操作システム１の一例を示すネットワーク構成図である。It is a network block diagram which shows an example of the avatar operation system 1 which concerns on 1st Embodiment. 第１実施形態に係るユーザ端末２Ａの機能上の構成を示すブロック図である。It is a block diagram which shows the functional structure of the user terminal 2A which concerns on 1st Embodiment. 送信処理部の機能モジュールを示す概略構成図である。It is a schematic block diagram which shows the functional module of a transmission process part. 受信処理部の機能モジュールを示す概略構成図である。It is a schematic block diagram which shows the functional module of a reception process part. 表情パケットのデータ構造を示す概略構成図である。It is a schematic block diagram which shows the data structure of a facial expression packet. 音声パケットのデータ構造を示す概略構成図である。It is a schematic block diagram which shows the data structure of a voice packet. サーバ装置３の機能上の構成を示すブロック図である。3 is a block diagram showing a functional configuration of the server device 3. FIG. 初期設定処理を説明するための動作シーケンス図である。It is an operation sequence diagram for explaining the initial setting process. アバタ操作処理を説明するための動作シーケンス図である。It is an operation sequence diagram for explaining an avatar operation process. 表情パケット生成処理を説明するための動作フロー図である。It is an operation | movement flowchart for demonstrating a facial expression packet generation process. 音声パケット生成処理を説明するための動作フローである。6 is an operational flow for explaining a voice packet generation process. 音声出力処理を説明するための動作フローである。It is an operational flow for explaining a voice output process. アバタ表示処理を説明するための動作フローである。9 is an operation flow for explaining avatar display processing.

添付図面を参照して、本発明の好適な実施形態について説明する。（なお、各図において、同一の符号を付したものは、同一又は同様の構成を有する。） A preferred embodiment of the present invention will be described with reference to the accompanying drawings. (Note that, in each of the drawings, those denoted by the same reference numerals have the same or similar configurations.)

［第１実施形態］
（１）構成
（１−１）アバタ操作システム１
図１は、第１実施形態に係るアバタ操作システム１の一例を示すネットワーク構成図である。 [First Embodiment]
(1) Configuration (1-1) Avatar operation system 1
FIG. 1 is a network configuration diagram showing an example of an avatar operation system 1 according to the first embodiment.

図１に示すとおり、アバタ操作システム１は、ユーザ端末２Ａと、ユーザ端末２Ｂと、サーバ装置３とを備える。サーバ装置３は、インターネット等の通信ネットワークを介して、ユーザ端末２Ａ及び２Ｂそれぞれに通信可能に接続されている。以下では、ユーザ端末２Ａ、２Ｂ等をまとめて「ユーザ端末２」と総称する場合がある。 As shown in FIG. 1, the avatar operation system 1 includes a user terminal 2A, a user terminal 2B, and a server device 3. The server device 3 is communicatively connected to each of the user terminals 2A and 2B via a communication network such as the Internet. Hereinafter, the user terminals 2A, 2B, etc. may be collectively referred to as “user terminal 2”.

アバタ操作システム１では、一のユーザ端末２が他のユーザ端末２に被写体の特徴量を送信し、当該他のユーザ端末２が受信した特徴量を所定のアバタモデルに適用することにより、他のユーザ端末２においてアバタ（キャラクタ）が出力される。以下では、被写体の特徴量を送信する装置ないし機能を「トラッカー」と称し、被写体の特徴量を受信してアバタを出力する装置ないし機能を「ビューワー」と称する場合がある。 In the avatar operation system 1, one user terminal 2 transmits the feature amount of the subject to the other user terminal 2 and the feature amount received by the other user terminal 2 is applied to a predetermined avatar model, so that another An avatar (character) is output at the user terminal 2. Hereinafter, a device or a function that transmits the feature amount of the subject may be referred to as a “tracker”, and a device or a function that receives the feature amount of the subject and outputs an avatar may be referred to as a “viewer”.

（１−２）ユーザ端末
図２は、第１実施形態に係るユーザ端末２Ａの機能上の構成を示すブロック図である。以下では、ユーザ端末２Ａは、トラッカー及びビューワーの双方の機能を有するものとして説明するが、ユーザ端末２Ａは、トラッカー及びビューワーのいずれか一方の機能のみを有していてもよい。なお、ユーザ端末２Ｂも、ユーザ端末２Ａと同様の構成を有するため、説明を省略する。 (1-2) User Terminal FIG. 2 is a block diagram showing a functional configuration of the user terminal 2A according to the first embodiment. In the following, the user terminal 2A will be described as having both tracker and viewer functions, but the user terminal 2A may have only one of the tracker and viewer functions. Since the user terminal 2B also has the same configuration as the user terminal 2A, the description thereof will be omitted.

ユーザ端末２Ａは、情報処理装置の一例であって、例えば、ＲＯＭやＲＡＭ等のメモリと、ＣＰＵ等のプロセッサとを備えるコンピュータによって構成される。ユーザ端末２Ａのメモリには、コンピュータプログラムやデータ等が格納される。コンピュータプログラムは、例えばＣＤ−ＲＯＭ等のコンピュータ読み取り可能な可搬型記録媒体からユーザ端末２Ａのメモリにインストールされてもよいし、通信ネットワークを介して他の情報処理装置からダウンロードされることによりユーザ端末２Ａのメモリにインストールされてもよい。ユーザ端末２Ａのプロセッサは、ユーザ端末２Ａのメモリに記憶されたコンピュータプログラム等に基づいて、ユーザ端末２Ａの各部の動作を統括的に制御する。 The user terminal 2A is an example of an information processing device, and is configured by, for example, a computer including a memory such as a ROM and a RAM and a processor such as a CPU. Computer programs, data and the like are stored in the memory of the user terminal 2A. The computer program may be installed in the memory of the user terminal 2A from a computer-readable portable recording medium such as a CD-ROM, or may be downloaded from another information processing device via a communication network to the user terminal. It may be installed in the memory of 2A. The processor of the user terminal 2A centrally controls the operation of each unit of the user terminal 2A based on a computer program stored in the memory of the user terminal 2A.

ユーザ端末２Ａは、例えば、ユーザ端末設定処理部１０と、送信処理部２０と、受信処理部３０と、撮像部４１と、音声入力部４２と、表示部４３と、音声出力部４４とを備える。ユーザ端末設定処理部１０と、送信処理部２０と、受信処理部３０とは、ユーザ端末２Ａのメモリに記憶されたプログラム等に基づいて、ユーザ端末２Ａが備えるプロセッサにより実現される機能モジュールである。撮像部４１と、音声入力部４２と、表示部４３と、音声出力部４４とは、それぞれ、ユーザ端末２Ａの外部に設けられていてもよい。 The user terminal 2A includes, for example, a user terminal setting processing unit 10, a transmission processing unit 20, a reception processing unit 30, an imaging unit 41, a voice input unit 42, a display unit 43, and a voice output unit 44. . The user terminal setting processing unit 10, the transmission processing unit 20, and the reception processing unit 30 are functional modules implemented by a processor included in the user terminal 2A based on a program stored in the memory of the user terminal 2A. . The imaging unit 41, the voice input unit 42, the display unit 43, and the voice output unit 44 may be provided outside the user terminal 2A, respectively.

ユーザ端末設定処理部１０は、後述するように、サーバ装置３と通信を行い、ルームＩＤやクライアントＩＤ等に係る種々の設定処理を行う。 As will be described later, the user terminal setting processing unit 10 communicates with the server device 3 and performs various setting processing relating to the room ID, the client ID, and the like.

送信処理部２０は、トラッカー機能を実現する機能モジュールの一例であって、撮像部４１により生成された動画データに基づいて表情パケット（第１パケット）を生成し、当該表情パケットを他の情報処理装置に送信する。また、送信処理部２０は、音声入力部４２により生成された音声データに基づいて音声パケット（第２パケット）を生成し、当該音声パケットを他の情報処理装置に送信する。 The transmission processing unit 20 is an example of a functional module that realizes a tracker function, generates a facial expression packet (first packet) based on the moving image data generated by the imaging unit 41, and uses the facial expression packet for other information processing. Send to device. The transmission processing unit 20 also generates a voice packet (second packet) based on the voice data generated by the voice input unit 42, and transmits the voice packet to another information processing device.

受信処理部３０は、ビューワー機能を実現する機能モジュールの一例であって、他の情報処理装置から表情パケットを受信し、当該表情パケットに基づいて表示部４３にアバタを表示する。また、受信処理部３０は、他の情報処理装置から音声パケットを受信し、当該音声パケットに基づいて音声出力部４４に音声を出力する。 The reception processing unit 30 is an example of a functional module that realizes a viewer function, receives a facial expression packet from another information processing device, and displays an avatar on the display unit 43 based on the facial expression packet. Further, the reception processing unit 30 receives a voice packet from another information processing device and outputs a voice to the voice output unit 44 based on the voice packet.

撮像部４１は、例えば、レンズ及び撮像素子により構成され、被写体及び／又は被写体の周囲環境を撮像することにより動画データを生成する。音声入力部４２は、例えば、マイクにより構成され、被写体及び／又は被写体の周囲環境が発する音声に基づいて音声データを生成する。表示部４３は、例えば、液晶ディスプレイや有機ＥＬ（Electro−Luminescence）ディスプレイ等により構成され、画像表示データや動画表示データに基づいて画像や動画を表示する。音声出力部４４は、例えば、スピーカ等により構成され、音声データに基づいて音声を出力する。 The imaging unit 41 is configured by, for example, a lens and an imaging element, and generates moving image data by imaging the subject and / or the surrounding environment of the subject. The voice input unit 42 is configured by, for example, a microphone, and generates voice data based on the voice emitted by the subject and / or the surrounding environment of the subject. The display unit 43 is composed of, for example, a liquid crystal display, an organic EL (Electro-Luminescence) display, or the like, and displays an image or a moving image based on the image display data or the moving image display data. The audio output unit 44 is composed of, for example, a speaker and outputs audio based on audio data.

（１−２−１）送信処理部
図３は、送信処理部２０の機能モジュールを示す概略構成図である。 (1-2-1) Transmission Processing Unit FIG. 3 is a schematic configuration diagram showing functional modules of the transmission processing unit 20.

上述したとおり、送信処理部２０は、トラッカー機能を実現する機能モジュールの一例である。図３に示すとおり、送信処理部２０は、例えば、表情パケット処理部２１と、音声パケット処理部２２と、動画データバッファ２３と、音声データバッファ２４とを備える。 As described above, the transmission processing unit 20 is an example of a functional module that realizes the tracker function. As shown in FIG. 3, the transmission processing unit 20 includes, for example, a facial expression packet processing unit 21, an audio packet processing unit 22, a moving image data buffer 23, and an audio data buffer 24.

表情パケット処理部２１は、例えば、部分動画データ抽出部２１１と、特徴量抽出部２１２と、第１タイムスタンプ部２１３と、表情パケット生成部２１４とを備える。 The facial expression packet processing unit 21 includes, for example, a partial moving image data extraction unit 211, a feature amount extraction unit 212, a first time stamp unit 213, and an facial expression packet generation unit 214.

部分動画データ抽出部２１１は、撮像部４１により生成される動画データから、所定の第１周期毎に部分動画データを抽出する。ここで、第１周期の長さは、特に限定されないが、表情パケットの容量に応じて任意に設定してもよく、例えば、１ｍｓ、２ｍｓ、５ｍｓ等であってよい。 The partial moving image data extraction unit 211 extracts the partial moving image data from the moving image data generated by the image pickup unit 41 every predetermined first cycle. Here, the length of the first cycle is not particularly limited, but may be set arbitrarily according to the capacity of the facial expression packet, and may be, for example, 1 ms, 2 ms, 5 ms or the like.

特徴量抽出部２１２は、部分動画データから被写体の特徴量を抽出する。ここで、特徴量は、被写体の特徴を示す情報である。特徴量は、被写体に含まれる少なくとも一の部位の特徴量であってよく、被写体の顔に含まれる部位の特徴量であってよい。特徴量は、例えば、被写体の額、眉（右眉、左眉）、目（左目、右目）、鼻、口、耳、頭髪等の特徴量であってよい。 The feature amount extraction unit 212 extracts the feature amount of the subject from the partial moving image data. Here, the feature amount is information indicating the feature of the subject. The feature amount may be a feature amount of at least one part included in the subject, or may be a feature amount of a part included in the face of the subject. The feature amount may be, for example, a feature amount of the subject's forehead, eyebrows (right eyebrow, left eyebrow), eyes (left eye, right eye), nose, mouth, ear, hair, and the like.

第１タイムスタンプ部２１３は、第１タイムスタンプを生成する。ここで、第１タイムスタンプは、例えば、部分動画データが部分動画データ抽出部２１１により抽出された時刻に応じたタイムスタンプである。第１タイムスタンプは、例えば、部分動画データの開始時刻及び終了時刻の間の任意の時刻（開始時刻及び終了時刻を含む）に応じたタイムスタンプであってもよい。 The first time stamp unit 213 generates a first time stamp. Here, the first time stamp is, for example, a time stamp corresponding to the time when the partial moving image data was extracted by the partial moving image data extraction unit 211. The first time stamp may be, for example, a time stamp corresponding to an arbitrary time (including the start time and the end time) between the start time and the end time of the partial moving image data.

表情パケット生成部２１４は、特徴量抽出部２１２が抽出した特徴量と、第１タイムスタンプ部２１３が生成した第１タイムスタンプとを含む表情パケット（第１パケット）を生成する。また、表情パケット生成部２１４は、生成した表情パケットを、通信ネットワークを介して他の情報処理装置に送信する。 The facial expression packet generator 214 generates an facial expression packet (first packet) including the feature amount extracted by the feature amount extractor 212 and the first time stamp generated by the first time stamp unit 213. The facial expression packet generator 214 also transmits the generated facial expression packet to another information processing device via the communication network.

音声パケット処理部２２は、例えば、部分音声データ抽出部２２１と、第２タイムスタンプ部２２２と、音声パケット生成部２２３とを備える。 The voice packet processing unit 22 includes, for example, a partial voice data extraction unit 221, a second time stamp unit 222, and a voice packet generation unit 223.

部分音声データ抽出部２２１は、音声入力部４２により生成される音声データから、所定の第２周期毎に部分音声データを抽出する。ここで、第２周期の長さは、特に限定されないが、音声パケットの容量に応じて任意に設定してもよく、例えば、３０ｍｓ、６０ｍｓ、１００ｍｓ等であってよい。なお、第１周期の長さと、第２周期の長さは、異なるものであってよい。 The partial voice data extraction unit 221 extracts the partial voice data from the voice data generated by the voice input unit 42 every predetermined second cycle. Here, the length of the second cycle is not particularly limited, but may be set arbitrarily according to the capacity of the voice packet, and may be, for example, 30 ms, 60 ms, 100 ms or the like. The length of the first cycle and the length of the second cycle may be different.

第２タイムスタンプ部２２２は、第２タイムスタンプを生成する。ここで、第２タイムスタンプは、例えば、部分音声データが部分音声データ抽出部２２１により抽出された時刻に応じたタイムスタンプである。 The second time stamp unit 222 generates a second time stamp. Here, the second time stamp is, for example, a time stamp corresponding to the time when the partial audio data was extracted by the partial audio data extracting unit 221.

音声パケット生成部２２３は、部分音声データ抽出部２２１が抽出した部分音声データと、第２タイムスタンプ部２２２が生成した第２タイムスタンプとを含む音声パケット（第２パケット）を生成する。 The voice packet generation unit 223 generates a voice packet (second packet) including the partial voice data extracted by the partial voice data extraction unit 221 and the second time stamp generated by the second time stamp unit 222.

（１−２−２）受信処理部
図４は、受信処理部３０の機能モジュールを示す概略構成図である。 (1-2-2) Reception Processing Unit FIG. 4 is a schematic configuration diagram showing functional modules of the reception processing unit 30.

上述したとおり、受信処理部３０は、ビューワー機能を実現する機能モジュールの一例である。図４に示すとおり、受信処理部３０は、例えば、受信部３１と、音声バッファ処理部３２Ａと、表情バッファ処理部３２Ｂと、アバタ出力処理部３３と、アバタモデル記憶部３４と、音声出力処理部３５とを備える。 As described above, the reception processing unit 30 is an example of a functional module that realizes the viewer function. As shown in FIG. 4, the reception processing unit 30 includes, for example, a reception unit 31, a voice buffer processing unit 32A, a facial expression buffer processing unit 32B, an avatar output processing unit 33, an avatar model storage unit 34, and a voice output processing. And a part 35.

受信部３１は、サーバ装置３等の他の情報処理装置からパケットを受信し、受信したパケットを例えばシーケンスＩＤ等に基づいて並べ替える。 The receiving unit 31 receives packets from another information processing device such as the server device 3 and rearranges the received packets based on, for example, the sequence ID.

音声バッファ処理部３２Ａ及び表情バッファ処理部３２Ｂは、ユーザ端末２Ａのメモリ及びプロセッサの一部から構成される処理部である。音声バッファ処理部３２Ａは、受信部３１が並べ替えた音声パケットを順次格納することによりバッファリングし、遅延した音声パケットは破棄する。表情バッファ処理部３２Ｂは、受信部３１が並べ替えた表情パケットを順次格納することによりバッファリングし、遅延した表情パケットは破棄する。 The voice buffer processing unit 32A and the facial expression buffer processing unit 32B are processing units configured by a part of the memory and the processor of the user terminal 2A. The voice buffer processing unit 32A buffers the voice packets rearranged by the receiving unit 31 by sequentially storing them, and discards the delayed voice packets. The facial expression buffer processing unit 32B buffers the facial expression packets rearranged by the receiving unit 31 by sequentially storing them, and discards the delayed facial expression packets.

アバタ出力処理部３３は、表情バッファ処理部３２Ｂに格納された表情パケットに基づいて、所定の表情パケットから特徴量を抽出することにより再生モーションを設定し、当該再生モーションをアバタモデル記憶部３４に記憶されたアバタモデルに適用することにより、アバタの表示データを生成する。このとき、アバタ出力処理部３３は、音声と同期してアバタを動かすことができるように、音声出力処理部３５から受信している第２タイムスタンプに基づいて再生モーションを設定する。再生モーションの設定方法については後述する。そして、アバタ出力処理部３３は、当該表示データを表示部４３に出力し、表示部４３にアバタを表示させる。 The avatar output processing unit 33 sets a playback motion by extracting a feature amount from a predetermined facial expression packet based on the facial expression packet stored in the facial expression buffer processing unit 32B, and sets the playback motion in the avatar model storage unit 34. By applying to the stored avatar model, avatar display data is generated. At this time, the avatar output processing unit 33 sets the reproduction motion based on the second time stamp received from the audio output processing unit 35 so that the avatar can be moved in synchronization with the audio. The method of setting the playback motion will be described later. Then, the avatar output processing unit 33 outputs the display data to the display unit 43 and causes the display unit 43 to display the avatar.

アバタモデル記憶部３４は、ユーザ端末２Ａのメモリの一部であって、アバタモデルが記憶されている。アバタモデルは、例えば特徴量抽出部２１２が抽出した被写体の特徴量を入力すると、当該特徴量に応じた態様のアバタを表示させるための表示データを出力する。アバタは、人間、動物、及び架空の生物等を模したキャラクタや、非生物のオブジェクト、その他任意の態様であってよい。 The avatar model storage unit 34 is a part of the memory of the user terminal 2A and stores the avatar model. For example, when the feature amount of the subject extracted by the feature amount extraction unit 212 is input, the avatar model outputs display data for displaying an avatar in a mode corresponding to the feature amount. The avatar may be a character imitating a human being, an animal, a fictitious creature, an inanimate object, or any other form.

音声出力処理部３５は、音声バッファ処理部３２Ａに格納された音声パケットから音声データを抽出し、当該音声データに基づいて音声出力部４４から音声を出力する。また、音声出力処理部３５は、音声を出力した音声パケットからタイムスタンプを抽出し、当該タイムスタンプを表情バッファ処理部３２Ｂにフィードバックする。 The audio output processing unit 35 extracts audio data from the audio packet stored in the audio buffer processing unit 32A, and outputs audio from the audio output unit 44 based on the audio data. The voice output processing unit 35 also extracts a time stamp from the voice packet that outputs the voice and feeds the time stamp back to the facial expression buffer processing unit 32B.

（１−２−３）パケットのデータ構造
＜表情パケット＞
図５Ａは、表情パケットのデータ構造を示す概略構成図である。 (1-2-3) Data structure of packet <expression packet>
FIG. 5A is a schematic configuration diagram showing a data structure of a facial expression packet.

表情パケットは、通信処理にかかる負荷が少ないことから、特にタイムリーなアバタの操作が求められるアバタ操作システム１においては、ＵＤＰ（ＵｓｅｒＤａｔａｇｒａｍＰｒｏｔｏｃｏｌ）に準拠したデータ構造を有することが好ましい。しかしながら、表情パケットは、例えば、ＴＣＰ（ＴｒａｎｓｐｏｒｔＣｏｎｔｒｏｌＰｒｏｔｏｃｏｌ）等の他の任意のプロトコルに準拠したデータ構造を有していてもよい。 The facial expression packet preferably has a data structure conforming to the UDP (User Datagram Protocol) in the avatar operation system 1 which requires a particularly timely operation of the avatar because the facial expression packet has a small load on communication processing. However, the facial expression packet may have a data structure conforming to any other protocol such as TCP (Transport Control Protocol).

図５Ａに示すとおり、表情パケットは、例えば、シーケンスＩＤと、パケットタイプと、第１タイムスタンプと、クライアントＩＤと、ルームＩＤと、データ長と、表情データとを含む。 As shown in FIG. 5A, the facial expression packet includes, for example, a sequence ID, a packet type, a first time stamp, a client ID, a room ID, a data length, and facial expression data.

シーケンスＩＤは、パケットを識別するための識別情報である。パケットタイプは、パケットのタイプを示す情報であり、「表情パケット」や「音声パケット」等が示される。第１タイムスタンプは、第１タイムスタンプ部２１３が生成した第１タイムスタンプである。クライアントＩＤは、初期設定処理においてサーバ装置３から提供されるデータであって、クライアントとしてのユーザ端末２Ａに固有の識別情報である。ルームＩＤは、サーバ装置３が設定するルームを識別するための識別情報である。データ長は、パケットのデータ長を示す情報である。表情データは、特徴量抽出部２１２が抽出した被写体の額、眉（右眉、左眉）、目（左目、右目）、鼻、口、耳、頭髪等の特徴量を示すデータである。 The sequence ID is identification information for identifying the packet. The packet type is information indicating the type of packet, and includes “expression packet” and “voice packet”. The first time stamp is the first time stamp generated by the first time stamp unit 213. The client ID is data provided from the server device 3 in the initial setting process, and is identification information unique to the user terminal 2A as a client. The room ID is identification information for identifying the room set by the server device 3. The data length is information indicating the data length of the packet. The facial expression data is data indicating feature amounts of the subject such as the forehead, eyebrows (right eyebrow, left eyebrow), eyes (left eye, right eye), nose, mouth, ears, hair, etc., extracted by the feature amount extraction unit 212.

＜音声パケット＞
図５Ｂは、音声パケットのデータ構造を示す概略構成図である。 <Voice packet>
FIG. 5B is a schematic configuration diagram showing a data structure of a voice packet.

音声パケットは、上述のとおり、ＵＤＰに準拠したデータ構造を有することが好ましいが、ＵＤＰに限らずとも、例えば、ＴＣＰ等の他の任意のプロトコルに準拠したデータ構造を有していてもよい。 As described above, the voice packet preferably has a data structure compliant with UDP, but not limited to UDP, it may have a data structure compliant with any other protocol such as TCP.

図５Ｂに示すとおり、音声パケットは、例えば、シーケンスＩＤと、パケットタイプと、第２タイムスタンプと、クライアントＩＤと、ルームＩＤと、データ長と、音声データとを含む。シーケンスＩＤ、パケットタイプ、クライアントＩＤ、ルームＩＤ、及びデータ長については、表情パケットに含まれるものと同様であるので、説明を省略する。 As shown in FIG. 5B, the voice packet includes, for example, a sequence ID, a packet type, a second time stamp, a client ID, a room ID, a data length, and voice data. The sequence ID, the packet type, the client ID, the room ID, and the data length are the same as those included in the facial expression packet, and thus the description thereof will be omitted.

第２タイムスタンプは、第２タイムスタンプ部２２２が生成した第２タイムスタンプである。音声データは、部分音声データ抽出部２２１が抽出した部分音声データである。 The second time stamp is the second time stamp generated by the second time stamp unit 222. The audio data is the partial audio data extracted by the partial audio data extraction unit 221.

（１−３）サーバ装置
図６は、サーバ装置３の機能上の構成を示すブロック図である。 (1-3) Server Device FIG. 6 is a block diagram showing a functional configuration of the server device 3.

サーバ装置３は、例えば、ＲＯＭやＲＡＭ等のメモリと、ＣＰＵ等のプロセッサとを備えるコンピュータによって構成される。サーバ装置３のメモリには、コンピュータプログラムやデータ等が格納される。コンピュータプログラムは、例えばＣＤ−ＲＯＭ等のコンピュータ読み取り可能な可搬型記録媒体からサーバ装置３のメモリにインストールされてもよいし、通信ネットワークを介して他の情報処理装置からダウンロードされることによりサーバ装置３のメモリにインストールされてもよい。サーバ装置３のプロセッサは、サーバ装置３のメモリに記憶されたコンピュータプログラム等に基づいて、サーバ装置３の各部の動作を統括的に制御する。 The server device 3 is composed of, for example, a computer including a memory such as a ROM and a RAM and a processor such as a CPU. Computer programs, data and the like are stored in the memory of the server device 3. The computer program may be installed in the memory of the server device 3 from a computer-readable portable recording medium such as a CD-ROM, or may be downloaded from another information processing device via a communication network to the server device. 3 may be installed in the memory. The processor of the server device 3 centrally controls the operation of each part of the server device 3 based on a computer program stored in the memory of the server device 3.

図６に示すとおり、サーバ装置３は、例えば、サーバ装置設定処理部５０と、パケット処理部６０とを備える。サーバ装置設定処理部５０と、パケット処理部６０とは、サーバ装置３のメモリに記憶されたプログラム等に基づいて、サーバ装置３が備えるプロセッサにより実現される機能モジュールである。 As shown in FIG. 6, the server device 3 includes, for example, a server device setting processing unit 50 and a packet processing unit 60. The server device setting processing unit 50 and the packet processing unit 60 are functional modules implemented by a processor included in the server device 3 based on a program stored in the memory of the server device 3.

サーバ装置設定処理部５０は、ユーザ端末２との間で、ルームＩＤやクライアントＩＤに係る設定処理を行う。 The server device setting processing unit 50 performs setting processing related to the room ID and the client ID with the user terminal 2.

パケット処理部６０は、ユーザ端末２から受信したパケット（表情パケット及び音声パケットを含む）を解析し、所望のユーザ端末２へ送信する。 The packet processing unit 60 analyzes the packet (including the facial expression packet and the voice packet) received from the user terminal 2 and transmits the packet to the desired user terminal 2.

（２）アバタ操作システム１の処理
（２−１）初期設定処理
図７は、初期設定処理を説明するための動作シーケンス図である。 (2) Process of avatar operation system 1 (2-1) Initial setting process FIG. 7 is an operation sequence diagram for explaining the initial setting process.

（Ｓ１０１）
まず、ユーザ端末２のユーザ端末設定処理部１０は、サーバ装置３にルームＩＤリスト要求を送信する。 (S101)
First, the user terminal setting processing unit 10 of the user terminal 2 transmits a room ID list request to the server device 3.

（Ｓ１０２）
次に、サーバ装置３のサーバ装置設定処理部５０は、サーバ装置３の内部又は外部のメモリからルームＩＤリストを取得し、これをユーザ端末２に送信する。 (S102)
Next, the server device setting processing unit 50 of the server device 3 acquires the room ID list from the internal or external memory of the server device 3 and transmits it to the user terminal 2.

（Ｓ１０３）
次に、ユーザ端末設定処理部１０は、クライアントＩＤ生成要求をサーバ装置３に送信する。このとき、ユーザ端末設定処理部１０は、例えば、当該クライアントＩＤ生成要求に、ルームＩＤリストから選択されたルームＩＤを含める。 (S103)
Next, the user terminal setting processing unit 10 transmits a client ID generation request to the server device 3. At this time, the user terminal setting processing unit 10 includes, for example, the room ID selected from the room ID list in the client ID generation request.

（Ｓ１０４）
次に、サーバ装置設定処理部５０は、クライアントＩＤ生成要求に基づいて、クライアントＩＤを生成する。このとき、サーバ装置設定処理部５０は、生成されたクライアントＩＤを、クライアントＩＤ生成要求に含まれる選択されたルームＩＤに紐付けてサーバ装置３の内部又は外部のメモリに格納する。 (S104)
Next, the server device setting processing unit 50 generates a client ID based on the client ID generation request. At this time, the server device setting processing unit 50 stores the generated client ID in the internal or external memory of the server device 3 in association with the selected room ID included in the client ID generation request.

（Ｓ１０５）
次に、サーバ装置設定処理部５０は、ユーザ端末２に、生成されたクライアントＩＤを送信する。 (S105)
Next, the server device setting processing unit 50 transmits the generated client ID to the user terminal 2.

（Ｓ１０６）
次に、ユーザ端末設定処理部１０は、サーバ装置３から受信したクライアントＩＤを、ユーザ端末２のメモリに格納する。以上で、初期設定処理が終了する。 (S106)
Next, the user terminal setting processing unit 10 stores the client ID received from the server device 3 in the memory of the user terminal 2. This completes the initialization process.

（２−２）アバタ操作処理
図８は、アバタ操作処理を説明するための動作シーケンス図である。 (2-2) Avatar Operation Processing FIG. 8 is an operation sequence diagram for explaining the avatar operation processing.

ここで、アバタ操作処理においては、一の情報処理装置（トラッカー）が他の情報処理装置（ビューワー）に被写体の特徴量を含むデータを送信し、特徴量に応じたアバタをビューワーに表示させることによって、アバタが操作される。以下では、ユーザ端末２Ａがトラッカーとして、ユーザ端末２Ｂがビューワーとしてそれぞれ機能する場合を例に説明する。 Here, in the avatar operation processing, one information processing device (tracker) transmits data including the feature amount of the subject to another information processing device (viewer), and causes the viewer to display the avatar corresponding to the feature amount. Operates the avatar. Hereinafter, a case where the user terminal 2A functions as a tracker and the user terminal 2B functions as a viewer will be described as an example.

（Ｓ２０１）
まず、トラッカーであるユーザ端末２Ａの送信処理部２０は、パケット生成処理を行い、パケット（表情パケット又は音声パケット）を生成する。パケット生成処理の詳細は後述する。 (S201)
First, the transmission processing unit 20 of the user terminal 2A, which is a tracker, performs a packet generation process to generate a packet (expression packet or voice packet). Details of the packet generation process will be described later.

（Ｓ２０２）
次に、ユーザ端末２Ａの送信処理部２０は、生成したパケットを、通信ネットワークを介してサーバ装置３に送信する。 (S202)
Next, the transmission processing unit 20 of the user terminal 2A transmits the generated packet to the server device 3 via the communication network.

（Ｓ２０３）
次に、サーバ装置３のパケット処理部６０は、ユーザ端末２Ａから受信したパケットを解析して、当該パケットの送信先を特定する。具体的には、パケット処理部６０は、パケットに含まれるルームＩＤ及びクライアントＩＤを取得し、サーバ装置３の内部又は外部のメモリに格納されたルームＩＤリストを参照して、パケットの送信先を特定する。 (S203)
Next, the packet processing unit 60 of the server device 3 analyzes the packet received from the user terminal 2A and identifies the transmission destination of the packet. Specifically, the packet processing unit 60 acquires the room ID and the client ID included in the packet, refers to the room ID list stored in the memory inside or outside the server device 3, and determines the destination of the packet. Identify.

（Ｓ２０４）
次に、サーバ装置３のパケット処理部６０は、パケット解析処理の結果に応じて、通信ネットワークを介して、パケットを特定された送信先に送信する。ここでは、送信先として、ビューワーであるユーザ端末２Ｂが特定されるものとする。 (S204)
Next, the packet processing unit 60 of the server device 3 transmits the packet to the specified destination via the communication network according to the result of the packet analysis process. Here, it is assumed that the user terminal 2B that is a viewer is specified as the transmission destination.

（Ｓ２０５）
次に、ビューワーであるユーザ端末２Ｂの受信処理部３０は、サーバ装置３から受信したパケットに基づいて、出力処理を行う。出力処理の詳細は後述する。以上で、アバタ操作処理が終了する。 (S205)
Next, the reception processing unit 30 of the user terminal 2B, which is a viewer, performs output processing based on the packet received from the server device 3. Details of the output processing will be described later. This is the end of the avatar operation process.

（２−２−１）パケット生成処理
以下、ユーザ端末２Ａによるパケット生成処理について説明する。パケット生成処理は、表情パケットを生成する表情パケット生成処理と、音声パケットを生成する音声パケット生成処理とを含む。 (2-2-1) Packet Generation Process Hereinafter, the packet generation process by the user terminal 2A will be described. The packet generation process includes a facial expression packet generation process for generating a facial expression packet and a voice packet generation process for generating a voice packet.

＜表情パケット生成処理＞
図９は、表情パケット生成処理を説明するための動作フロー図である。 <Expression packet generation processing>
FIG. 9 is an operation flow diagram for explaining the facial expression packet generation process.

（Ｓ３０１）
まず、撮像部４１は、被写体を撮像することにより動画データ（時系列的に連続した複数の画像データ）を生成し、当該動画データを動画データバッファ２３に格納する（バッファリングする）。 (S301)
First, the imaging unit 41 captures a subject to generate moving image data (a plurality of image data that is continuous in time series), and stores (buffers) the moving image data in the moving image data buffer 23.

（Ｓ３０２）
次に、表情パケット処理部２１の部分動画データ抽出部２１１は、第１周期が経過したか否かを判定する。第１周期が経過していないと判定された場合は（Ｓ３０２；Ｎｏ）、処理はＳ３０１に戻る。 (S302)
Next, the partial moving image data extraction unit 211 of the facial expression packet processing unit 21 determines whether the first cycle has elapsed. When it is determined that the first cycle has not elapsed (S302; No), the process returns to S301.

（Ｓ３０３）
第１周期が経過したと判定された場合は（Ｓ３０２；Ｙｅｓ）、部分動画データ抽出部２１１は、動画データバッファ２３に格納された動画データから一の第１周期に含まれる部分動画データを抽出する。 (S303)
When it is determined that the first period has elapsed (S302; Yes), the partial moving image data extraction unit 211 extracts the partial moving image data included in one first period from the moving image data stored in the moving image data buffer 23. To do.

（Ｓ３０４）
次に、表情パケット処理部２１の特徴量抽出部２１２は、抽出された部分動画データに基づいて、当該部分動画データに含まれる被写体の特徴量を抽出する。 (S304)
Next, the feature amount extraction unit 212 of the facial expression packet processing unit 21 extracts the feature amount of the subject included in the partial moving image data based on the extracted partial moving image data.

（Ｓ３０５）
次に、表情パケット処理部２１の第１タイムスタンプ部２１３は、抽出された部分動画データについて、当該部分動画データが部分動画データ抽出部２１１により抽出された時刻に応じた第１タイムスタンプを生成する。ここで、第１タイムスタンプは、例えば、当該部分動画データの開始時刻から終了時刻のうちの任意の時刻（開始時刻及び終了時刻を含む）であってもよい。 (S305)
Next, the first time stamp unit 213 of the facial expression packet processing unit 21 generates, for the extracted partial moving image data, a first time stamp according to the time when the partial moving image data was extracted by the partial moving image data extraction unit 211. To do. Here, the first time stamp may be, for example, an arbitrary time (including the start time and the end time) from the start time to the end time of the partial moving image data.

（Ｓ３０６）
次に、表情パケット処理部２１の表情パケット生成部２１４は、特徴量抽出部２１２により抽出された特徴量と、第１タイムスタンプとを含む表情パケットを生成する。以上で、表情パケット生成処理が終了する。 (S306)
Next, the facial expression packet generation unit 214 of the facial expression packet processing unit 21 generates a facial expression packet including the feature amount extracted by the feature amount extraction unit 212 and the first time stamp. This is the end of the facial expression packet generation process.

＜音声パケット生成処理＞
図１０は、音声パケット生成処理を説明するための動作フローである。 <Voice packet generation processing>
FIG. 10 is an operation flow for explaining the voice packet generation process.

（Ｓ４０１）
まず、音声入力部４２は、被写体や被写体を取り巻く環境から発せられる音声に基づいて音声データを生成し、当該音声データを音声データバッファ２４に格納する（バッファリングする）。 (S401)
First, the voice input unit 42 generates voice data based on the voice emitted from the subject or the environment surrounding the subject, and stores (buffers) the voice data in the voice data buffer 24.

（Ｓ４０２）
次に、音声パケット処理部２２の部分音声データ抽出部２２１は、第２周期が経過したか否かを判定する。第２周期が経過していないと判定された場合は（Ｓ４０２；Ｎｏ）、処理はＳ４０１に戻る。 (S402)
Next, the partial voice data extraction unit 221 of the voice packet processing unit 22 determines whether the second cycle has elapsed. When it is determined that the second cycle has not elapsed (S402; No), the process returns to S401.

（Ｓ４０３）
第２周期が経過したと判定された場合は（Ｓ４０２；Ｙｅｓ）、部分音声データ抽出部２２１は、音声データバッファ２４に格納された音声データから一の第２周期に含まれる部分音声データを抽出する。 (S403)
When it is determined that the second cycle has elapsed (S402; Yes), the partial audio data extraction unit 221 extracts the partial audio data included in one second cycle from the audio data stored in the audio data buffer 24. To do.

（Ｓ４０４）
次に、音声パケット処理部２２の第２タイムスタンプ部２２２は、抽出された部分音声データについて、当該部分音声データが部分音声データ抽出部２２１により抽出された時刻に応じた第２タイムスタンプを生成する。ここで、第２タイムスタンプは、例えば、当該部分音声データの開始時刻から終了時刻のうちの任意の時刻（開始時刻及び終了時刻を含む）であってもよい。 (S404)
Next, the second time stamp unit 222 of the voice packet processing unit 22 generates a second time stamp for the extracted partial voice data according to the time when the partial voice data was extracted by the partial voice data extraction unit 221. To do. Here, the second time stamp may be, for example, an arbitrary time (including the start time and the end time) from the start time to the end time of the partial audio data.

（Ｓ４０５）
次に、音声パケット処理部２２の音声パケット生成部２２３は、部分音声データ抽出部２２１により抽出された部分音声データと、第２タイムスタンプとを含む音声パケットを生成する。以上で、音声パケット生成処理が終了する。 (S405)
Next, the voice packet generator 223 of the voice packet processor 22 generates a voice packet including the partial voice data extracted by the partial voice data extractor 221 and the second time stamp. This completes the voice packet generation process.

（２−２−２）出力処理
以下、ユーザ端末２Ｂによる出力処理について説明する。出力処理は、音声出力処理と、アバタ表示処理と、を含む。 (2-2-2) Output Process Hereinafter, the output process by the user terminal 2B will be described. The output process includes a voice output process and an avatar display process.

＜音声出力処理＞
図１１は、音声出力処理を説明するための動作フローである。 <Voice output processing>
FIG. 11 is an operation flow for explaining the audio output process.

（Ｓ５０１）
まず、ユーザ端末２Ｂの受信部３１は、サーバ装置３から送信された音声パケットを受信する。 (S501)
First, the reception unit 31 of the user terminal 2B receives the voice packet transmitted from the server device 3.

（Ｓ５０２）
次に、受信部３１は、受信した音声パケットをシーケンスＩＤに基づいて並べ替える。 (S502)
Next, the reception unit 31 rearranges the received voice packets based on the sequence ID.

（Ｓ５０３）
次に、音声バッファ処理部３２Ａは、受信部３１が並べ替えた音声パケットをバッファリングする。 (S503)
Next, the voice buffer processing unit 32A buffers the voice packets rearranged by the receiving unit 31.

（Ｓ５０４）
次に、音声バッファ処理部３２Ａは、所定の待機時間が経過したが否かを判定する。ここで、所定の待機時間の長さは特に限定されないが、例えば、５０ｍｓである。所定の待機時間が経過していないと判定された場合は（Ｓ６０４；Ｎｏ）、上述のＳ６０３に戻る。 (S504)
Next, the audio buffer processing unit 32A determines whether or not a predetermined waiting time has elapsed. Here, the length of the predetermined waiting time is not particularly limited, but is, for example, 50 ms. When it is determined that the predetermined waiting time has not elapsed (S604; No), the process returns to S603 described above.

（Ｓ５０５）
音声バッファ処理部３２Ａが、所定の待機時間が経過したと判定した場合（Ｓ６０４；Ｙｅｓ）、バッファリングされた音声パケットから音声データを抽出し、当該音声データを音声出力部４４に出力することにより、音声を出力させる。 (S505)
When the audio buffer processing unit 32A determines that the predetermined waiting time has elapsed (S604; Yes), the audio data is extracted from the buffered audio packet, and the audio data is output to the audio output unit 44. , Output audio.

（Ｓ５０６）
また、音声出力処理部３５は、再生した音声パケットに含まれる第２タイムスタンプを抽出し、当該第２タイムスタンプを表情バッファ処理部３２Ｂに送信する。以上で、音声出力処理が終了する。 (S506)
Further, the audio output processing unit 35 extracts the second time stamp included in the reproduced audio packet and transmits the second time stamp to the facial expression buffer processing unit 32B. With that, the voice output process is completed.

＜アバタ表示処理＞
図１２は、アバタ表示処理を説明するための動作フローである。以下の動作フローにおいては、表情バッファ処理部３２Ｂは、音声出力処理部３５から断続的に第２タイムスタンプを受信しているものとする。 <Avatar display processing>
FIG. 12 is an operation flow for explaining the avatar display process. In the following operation flow, the facial expression buffer processing unit 32B is assumed to intermittently receive the second time stamp from the voice output processing unit 35.

（Ｓ６０１）
まず、ユーザ端末２Ｂの受信部３１は、サーバ装置３から送信された表情パケットを受信する。 (S601)
First, the receiving unit 31 of the user terminal 2B receives the facial expression packet transmitted from the server device 3.

（Ｓ６０２）
次に、受信部３１は、受信した表情パケットをシーケンスＩＤに基づいて並べ替える。 (S602)
Next, the reception unit 31 rearranges the received facial expression packets based on the sequence ID.

（Ｓ６０３）
次に、表情バッファ処理部３２Ｂは、受信部３１が並べ替えた表情パケットをバッファリングする。このとき、受信した第２タイムスタンプよりも古い第１タイムスタンプを含む表情パケットがバッファリングされている場合は、遅延したパケットとして破棄する。 (S603)
Next, the facial expression buffer processing unit 32B buffers the facial expression packets rearranged by the receiving unit 31. At this time, if the facial expression packet including the first time stamp older than the received second time stamp is buffered, it is discarded as a delayed packet.

（Ｓ６０４）
次に、アバタ出力処理部３３は、再生モーションを設定する。当該処理は、アバタと音声とが同期するように第１タイムスタンプ及び第２タイムスタンプに基づいて実行される。具体的には、音声を出力した時点（第２タイムスタンプが示す時点）に最も近い過去の時点の第１タイムスタンプに対応する表情データから開始し、再生時間の経過時点に最も近い将来の時点の第１タイムスタンプに対応する表情データで終了するように、再生モーションを設定する。 (S604)
Next, the avatar output processing unit 33 sets a reproduction motion. The process is executed based on the first time stamp and the second time stamp so that the avatar and the voice are synchronized. Specifically, starting from the facial expression data corresponding to the first time stamp of the past time closest to the time when the voice is output (the time indicated by the second time stamp), the future time closest to the elapsed time of the reproduction time. The reproduction motion is set so as to end with the facial expression data corresponding to the first time stamp of.

（Ｓ６０５）
次に、アバタ出力処理部３３は、再生モーションに含まれる表情データ（特徴量）をアバタモデル記憶部３４に格納されたアバタモデルに入力することによりアバタに係る表示データを生成し、当該表示データを表示部４３に出力することによりアバタを表示部４３に表示させる。上述したとおり、アバタ出力処理部３３は第１タイムスタンプ及び第２タイムスタンプに基づいて再生モーションを設定するため、アバタの音声と表情とが同期して再生される。以上で、アバタ表示処理が終了する。 (S605)
Next, the avatar output processing unit 33 generates the display data related to the avatar by inputting the facial expression data (feature amount) included in the reproduced motion into the avatar model stored in the avatar model storage unit 34, and the display data concerned. Is output to the display unit 43 to display the avatar on the display unit 43. As described above, since the avatar output processing unit 33 sets the reproduction motion based on the first time stamp and the second time stamp, the avatar's voice and facial expression are reproduced in synchronization. This is the end of the avatar display process.

［変形例］
複数のユーザ端末が生成した各音声パケットが同一のルームＩＤに係る場合、各音声パケットに含まれる音声を合成して合成音声を生成するミキシング処理を実行してもよい。当該ミキシング処理は、例えば、サーバ装置３が実行してもよいし、パケットを受信するユーザ端末２が実行してもよい。サーバ装置３がミキシング処理を実行する場合、例えば、各音声パケットに含まれる音声を抽出し、各音声パケットに含まれるタイムスタンプに基づいて、音声を同期して合成する。そして、合成した音声を含む新たな音声パケットを生成し、各音声パケットに代えて当該新たな音声パケットを所望のユーザ端末２に送信してもよい。パケットを受信するユーザ端末２がミキシング処理を実行する場合、例えば、受信した各音声パケットに含まれる音声を抽出し、各音声パケットに含まれるタイムスタンプに基づいて、音声を同期して出力する。 [Modification]
When each voice packet generated by a plurality of user terminals relates to the same room ID, the voice included in each voice packet may be combined to perform a mixing process of generating a synthesized voice. The mixing process may be executed by the server device 3 or the user terminal 2 that receives the packet, for example. When the server device 3 executes the mixing process, for example, the voice included in each voice packet is extracted, and the voices are synchronously combined based on the time stamp included in each voice packet. Then, a new voice packet including the synthesized voice may be generated, and the new voice packet may be transmitted to the desired user terminal 2 instead of each voice packet. When the user terminal 2 that receives the packet executes the mixing process, for example, the voice included in each received voice packet is extracted, and the voice is synchronously output based on the time stamp included in each voice packet.

上述した実施形態においては、動画データに基づく特徴量と、音声データとが、それぞれ異なるパケット（表情パケット及び音声パケット）に格納されるものとして説明した。しかしながら、ユーザ端末２は、動画データに基づく特徴量と、音声データとを単一のパケットにまとめて格納した上で、単一のタイムスタンプを生成しこれを当該単一のパケットに含めてもよい。 In the above-described embodiment, the feature amount based on the moving image data and the voice data are stored in different packets (a facial expression packet and a voice packet), respectively. However, the user terminal 2 may store the feature amount based on the moving image data and the audio data in a single packet, generate a single time stamp, and include this in the single packet. Good.

上述した実施形態においては、第２タイムスタンプよりも古い第１タイムスタンプに係る表情パケットは破棄するものとした。この点、破棄された表情パケットの前後から、アバタ動作の補完を行ってもよい。 In the above-described embodiment, the facial expression packet related to the first time stamp older than the second time stamp is discarded. In this regard, the avatar motion may be complemented before and after the discarded facial expression packet.

以上説明した実施形態は、本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。実施形態が備える各要素並びにその配置、材料、条件、形状及びサイズ等は、例示したものに限定されるわけではなく適宜変更することができる。また、異なる実施形態で示した構成同士を部分的に置換し又は組み合わせることが可能である。 The embodiments described above are for facilitating the understanding of the present invention and are not for limiting the interpretation of the present invention. Each element included in the embodiment and its arrangement, material, condition, shape, size and the like are not limited to the exemplified ones and can be appropriately changed. Further, the configurations shown in different embodiments can be partially replaced or combined.

１…アバタ操作システム、２、２Ａ、２Ｂ…ユーザ端末、１０…ユーザ端末設定処理部、２０…送信処理部、２１…表情パケット処理部、２１１…部分動画データ抽出部、２１２…特徴量抽出部、２１３…第１タイムスタンプ部、２１４…表情パケット生成部、２２…音声パケット処理部、２２１…部分音声データ抽出部、２２２…第２タイムスタンプ部、２２３…音声パケット生成部、２３…動画データバッファ、２４…音声データバッファ、３０…受信処理部、３１…受信部、３２Ａ…音声バッファ処理部、３２Ｂ…表情バッファ処理部、３３…アバタ出力処理部、３４…アバタモデル記憶部、３５…音声出力処理部、４１…撮像部、４２…音声入力部、４３…表示部、４４…音声出力部、３…サーバ装置、５０…サーバ装置設定処理部、６０…パケット処理部 DESCRIPTION OF SYMBOLS 1 ... Avatar operation system, 2 2A, 2B ... User terminal, 10 ... User terminal setting processing part, 20 ... Transmission processing part, 21 ... Facial expression packet processing part, 211 ... Partial moving image data extraction part, 212 ... Feature amount extraction part 213 ... First time stamp section, 214 ... Facial expression packet generation section, 22 ... Voice packet processing section, 221 ... Partial voice data extraction section, 222 ... Second time stamp section, 223 ... Voice packet generation section, 23 ... Video data Buffer, 24 ... Voice data buffer, 30 ... Reception processing unit, 31 ... Reception unit, 32A ... Voice buffer processing unit, 32B ... Facial expression buffer processing unit, 33 ... Avata output processing unit, 34 ... Avata model storage unit, 35 ... Voice Output processing unit, 41 ... Imaging unit, 42 ... Voice input unit, 43 ... Display unit, 44 ... Voice output unit, 3 ... Server device, 50 ... Server device setting processing unit 60 ... packet processing unit

Claims

A partial moving image data extraction unit that extracts partial moving image data for each predetermined first period from the moving image data generated by the imaging unit,
A feature amount extraction unit that extracts the feature amount of the subject included in the partial moving image data,
A first time stamp unit that generates a first time stamp according to the time when the partial moving image data is extracted by the partial moving image data extraction unit;
A first packet generation unit that generates a first packet including the characteristic amount and the first time stamp and transmits the first packet;
A partial voice data extraction unit for extracting partial voice data from the voice data generated by the voice input unit at every predetermined second cycle;
A second time stamp unit for generating a second time stamp according to the time when the partial audio data is extracted by the partial audio data extracting unit;
A second packet generator that generates a second packet including the partial audio data and the second time stamp, and transmits the second packet;
An information processing apparatus including.

The information processing apparatus according to claim 1, wherein the feature amount is a feature amount of at least one part included in the subject.

The information processing apparatus according to claim 2, wherein the at least one part is a part included in the face of the subject.

The information processing apparatus according to claim 1, wherein the audio data includes audio data emitted from the subject.

In a computer with a storage unit,
A partial moving image data extraction unit that extracts partial moving image data for each predetermined first period from the moving image data generated by the imaging unit,
A feature amount extraction unit that extracts the feature amount of the subject included in the partial moving image data,
A first time stamp unit that generates a first time stamp according to the time when the partial moving image data is extracted by the partial moving image data extraction unit;
A first packet generation unit that generates a first packet including the characteristic amount and the first time stamp and transmits the first packet;
A partial voice data extraction unit for extracting partial voice data from the voice data generated by the voice input unit at every predetermined second cycle;
A second time stamp unit for generating a second time stamp according to the time when the partial audio data is extracted by the partial audio data extracting unit;
A second packet generator that generates a second packet including the partial audio data and the second time stamp, and transmits the second packet;
A program for realizing.

A storage unit that stores the avatar model,
A receiving unit for receiving a plurality of first packets including a feature amount of a subject and a first time stamp, and a plurality of second packets including audio data and a second time stamp;
An avatar output processing unit that displays the avatar obtained by extracting the feature amount from each of the plurality of first packets and inputting the feature amount into the avatar model;
A voice output processing unit that extracts the voice data from each of the plurality of second packets and outputs a voice based on the voice data to a voice output unit;
The avatar output processing unit acquires the second time stamps included in each of the plurality of second packets, and synchronizes the voice and the avatar based on the first time stamp and the second time stamp. An information processing device for displaying an avatar on the display unit.

In a computer with a storage unit,
A receiving unit for receiving a plurality of first packets including a feature amount of a subject and a first time stamp, and a plurality of second packets including audio data and a second time stamp;
An avatar output processing unit that displays the avatar obtained by extracting the feature amount from each of the plurality of first packets and inputting the feature amount into an avatar model;
A program for realizing the voice output processing unit that extracts the voice data from each of the plurality of second packets and outputs the voice based on the voice data to a voice output unit,
The avatar output processing unit acquires the second time stamps included in each of the plurality of second packets, and synchronizes the voice and the avatar based on the first time stamp and the second time stamp. A program for displaying an avatar on the display unit.