JP2019207509A

JP2019207509A - Moving image creation system, method, and program

Info

Publication number: JP2019207509A
Application number: JP2018101927A
Authority: JP
Inventors: 康伸佐々木; Yasunobu Sasaki
Original assignee: DeNA Co Ltd
Current assignee: DeNA Co Ltd
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2019-12-05
Anticipated expiration: 2038-05-28
Also published as: JP2022095625A; JP7038602B2; JP7373599B2

Abstract

To improve entertainment related to output of sound included in a moving image.SOLUTION: A moving image creation apparatus 10 has a function for creating a moving image. The apparatus 10 creates the moving image including an image corresponding to a virtual space and an input sound, converts the input sound into text in response to a touch operation on a predetermined area for displaying the image and arranges the corresponding text object in the virtual space, thereby making it possible to easily create a moving image in which an object corresponding to the input sound is arranged in the virtual space.SELECTED DRAWING: Figure 1

Description

本発明は、動画を作成するためのシステム、方法、及びプログラムに関する。 The present invention relates to a system, a method, and a program for creating a moving image.

従来、ユーザが動画の配信を行うためのシステムが提供されている（例えば、特許文献１を参照）。例えば、ユーザは、スマートフォン及びパソコン等のユーザ端末が有するカメラを介して入力される画像、及び、同じくユーザ端末が有するマイクを介して入力される音声が含まれる動画を撮影し、撮影した動画を複数の視聴者に対して配信することができる。 Conventionally, a system for a user to distribute a moving image has been provided (see, for example, Patent Document 1). For example, a user shoots a moving image including an image input through a camera included in a user terminal such as a smartphone and a personal computer, and a sound input through a microphone included in the user terminal. It can be distributed to a plurality of viewers.

特開２０１７−１２１０３６号公報JP 2017-121036 A

しかしながら、上述した従来のシステムにおいて、動画に含まれる音声は、同じく動画に含まれる画像と共に出力されるのみであって、面白みに欠ける場合があった。このように、動画に含まれる音声の出力については、そのエンターテイメント性に関して改善の余地がある。 However, in the above-described conventional system, the audio included in the moving image is only output together with the image included in the moving image, and may not be interesting. In this way, there is room for improvement in the entertainment characteristics of the audio output included in the moving image.

本発明の実施形態は、動画に含まれる音声の出力に関するエンターテイメント性を向上させることを目的の一つとする。本発明の実施形態の他の目的は、本明細書全体を参照することにより明らかとなる。 An object of the embodiment of the present invention is to improve entertainment related to output of audio included in a moving image. Other objects of the embodiments of the present invention will become apparent by referring to the entire specification.

本発明の一実施形態に係るシステムは、１又は複数のコンピュータプロセッサを備え、動画を作成するためのシステムであって、前記１又は複数のコンピュータプロセッサは、読取可能な命令の実行に応じて、仮想空間に対応する画像を表示する所定の領域を有する画面をユーザに提示する処理と、前記仮想空間に対応する画像と、前記ユーザによって入力される音声と、を含む動画を作成する処理と、前記ユーザによる前記所定の領域に対するタッチ操作に応じて、前記入力される音声をテキストに変換し、変換されたテキストに対応するテキストオブジェクトを、前記所定の領域内のタッチ操作が行われた位置に基づく前記仮想空間内の位置に配置する処理と、を実行する。 A system according to an embodiment of the present invention includes one or more computer processors, and is a system for creating a moving image, wherein the one or more computer processors are configured to execute a readable instruction, A process of presenting a screen having a predetermined area for displaying an image corresponding to the virtual space to the user, a process of creating a moving image including an image corresponding to the virtual space and sound input by the user; In response to a touch operation on the predetermined area by the user, the input voice is converted into text, and a text object corresponding to the converted text is placed at a position where the touch operation is performed in the predetermined area. And a process of arranging at a position in the virtual space based on.

本発明の一実施形態に係る方法は、１又は複数のコンピュータによって実行され、動画を作成するための方法であって、仮想空間に対応する画像を表示する所定の領域を有する画面をユーザに提示するステップと、前記仮想空間に対応する画像と、前記ユーザによって入力される音声と、を含む動画を作成するステップと、前記ユーザによる前記所定の領域に対するタッチ操作に応じて、前記入力される音声をテキストに変換し、変換されたテキストに対応するテキストオブジェクトを、前記所定の領域内のタッチ操作が行われた位置に基づく前記仮想空間内の位置に配置するステップと、を備える。 A method according to an embodiment of the present invention is a method for creating a moving image that is executed by one or more computers, and presents a screen having a predetermined area for displaying an image corresponding to a virtual space to a user. And a step of creating a moving image including an image corresponding to the virtual space and a sound input by the user, and the sound input according to a touch operation on the predetermined area by the user Is converted to text, and a text object corresponding to the converted text is arranged at a position in the virtual space based on the position where the touch operation is performed in the predetermined area.

本発明の一実施形態に係るプログラムは、動画を作成するためのプログラムであって、１又は複数のコンピュータ上での実行に応じて、前記１又は複数のコンピュータに、仮想空間に対応する画像を表示する所定の領域を有する画面をユーザに提示する処理と、前記仮想空間に対応する画像と、前記ユーザによって入力される音声と、を含む動画を作成する処理と、前記ユーザによる前記所定の領域に対するタッチ操作に応じて、前記入力される音声をテキストに変換し、変換されたテキストに対応するテキストオブジェクトを、前記所定の領域内のタッチ操作が行われた位置に基づく前記仮想空間内の位置に配置する処理と、を実行させる。 A program according to an embodiment of the present invention is a program for creating a moving image, and an image corresponding to a virtual space is displayed on the one or more computers in accordance with execution on the one or more computers. A process of presenting a screen having a predetermined area to be displayed to a user; a process of creating a moving image including an image corresponding to the virtual space; and a sound input by the user; and the predetermined area by the user A position in the virtual space based on a position where the touch operation is performed in the predetermined area by converting the input voice into text in response to a touch operation on And processing to be arranged in.

本発明の様々な実施形態は、動画に含まれる音声の出力に関するエンターテイメント性を向上させる。 Various embodiments of the present invention improve entertainment related to the output of audio contained in a video.

本発明の一実施形態に係る動画作成装置１０の構成を概略的に示す構成図。1 is a configuration diagram schematically showing a configuration of a moving image creating apparatus 10 according to an embodiment of the present invention. 動画作成装置１０の機能を概略的に示すブロック図。FIG. 2 is a block diagram schematically showing functions of the moving image creating apparatus 10. 動画作成用画面６０を例示する図。The figure which illustrates screen 60 for animation creation. アバター１０２の動作を制御するために動画作成装置１０が実行する処理を例示するフロー図。The flowchart which illustrates the process which the moving image production apparatus 10 performs in order to control operation | movement of the avatar 102. FIG. インカメラを介して入力される入力画像５０を模式的に例示する図。The figure which illustrates typically the input image 50 input via an in-camera. 入力画像５０に含まれるユーザの顔及び両手が認識される様子を説明するための図。The figure for demonstrating a mode that the user's face and both hands which are contained in the input image 50 are recognized. 入力画像５０を例示する図。The figure which illustrates the input image 50. 動画作成用画面６０を例示する図。The figure which illustrates screen 60 for animation creation. 画像表示領域６２に対するタッチ操作の検出に応じて動画作成装置１０が実行する処理を例示するフロー図。The flowchart which illustrates the process which the moving image production apparatus 10 performs according to the detection of the touch operation with respect to the image display area 62. 動画作成用画面６０を例示する図。The figure which illustrates screen 60 for animation creation. 動画作成用画面６０を例示する図。The figure which illustrates screen 60 for animation creation. 動画作成用画面６０を例示する図。The figure which illustrates screen 60 for animation creation. 配信者画面８０を例示する図。The figure which illustrates the distributor screen 80.

以下、図面を参照しながら、本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明の一実施形態に係る動画作成装置１０の構成を概略的に示す構成図である。動画作成装置１０は、動画を作成するための機能を有し、本発明のシステムの一部又は全部を実装する装置の一例である。 FIG. 1 is a block diagram schematically showing the configuration of a moving image creating apparatus 10 according to an embodiment of the present invention. The moving image creating apparatus 10 is an example of an apparatus that has a function for creating a moving image and that implements part or all of the system of the present invention.

動画作成装置１０は、一般的なコンピュータとして構成されており、図１に示すように、ＣＰＵ又はＧＰＵ等のコンピュータプロセッサ１１と、メインメモリ１２と、ユーザＩ／Ｆ１３と、通信Ｉ／Ｆ１４と、ストレージ（記憶装置）１５とを備え、これらの各構成要素が図示しないバス等を介して電気的に接続されている。 The moving image creating apparatus 10 is configured as a general computer, and as shown in FIG. 1, a computer processor 11 such as a CPU or a GPU, a main memory 12, a user I / F 13, a communication I / F 14, A storage (storage device) 15 is provided, and these components are electrically connected via a bus or the like (not shown).

コンピュータプロセッサ１１は、ストレージ１５等に記憶されている様々なプログラムをメインメモリ１２に読み込んで、当該プログラムに含まれる各種の命令を実行する。メインメモリ１２は、例えば、ＤＲＡＭ等によって構成される。 The computer processor 11 reads various programs stored in the storage 15 or the like into the main memory 12 and executes various instructions included in the programs. The main memory 12 is configured by, for example, a DRAM.

ユーザＩ／Ｆ１３は、ユーザとの間で情報をやり取りするための各種の入出力装置を含む。ユーザＩ／Ｆ１３は、例えば、キーボード、ポインティングデバイス（例えば、マウス、タッチパネル等）等の情報入力装置、マイクロフォン等の音声入力装置、カメラ等の画像入力装置を含む。また、ユーザＩ／Ｆ１３は、ディスプレイ等の画像出力装置、スピーカ等の音声出力装置を含む。 The user I / F 13 includes various input / output devices for exchanging information with the user. The user I / F 13 includes, for example, an information input device such as a keyboard and a pointing device (for example, a mouse and a touch panel), a voice input device such as a microphone, and an image input device such as a camera. The user I / F 13 includes an image output device such as a display and an audio output device such as a speaker.

通信Ｉ／Ｆ１４は、ネットワークアダプタ等のハードウェア、各種の通信用ソフトウェア、及びこれらの組み合わせとして実装され、有線又は無線の通信を実現できるように構成されている。 The communication I / F 14 is implemented as hardware such as a network adapter, various types of communication software, and combinations thereof, and is configured to realize wired or wireless communication.

ストレージ１５は、例えば磁気ディスク、フラッシュメモリ等によって構成される。ストレージ１５は、オペレーティングシステムを含む様々なプログラム、及び各種データ等を記憶する。ストレージ１５が記憶するプログラムには、動画を作成するための機能を実現するためのアプリケーションプログラム（以下、「動画作成用アプリ」と言うことがある。）が含まれ得る。 The storage 15 is configured by, for example, a magnetic disk, a flash memory, or the like. The storage 15 stores various programs including an operating system, various data, and the like. The program stored in the storage 15 may include an application program for realizing a function for creating a moving image (hereinafter, also referred to as “moving image creating application”).

本実施形態において、動画作成装置１０は、スマートフォン、タブレット端末、パーソナルコンピュータ、及びウェアラブルデバイス等として構成され得る。 In the present embodiment, the moving image creation apparatus 10 can be configured as a smartphone, a tablet terminal, a personal computer, a wearable device, or the like.

次に、本実施形態の動画作成装置１０が有する機能について説明する。図２は、動画作成装置１０が有する機能を概略的に示すブロック図である。動画作成装置１０は、図示するように、様々な情報を記憶及び管理する情報記憶管理部４１と、動画を作成する動画作成部４３と、仮想空間を制御する仮想空間制御部４５とを有する。これらの機能は、コンピュータプロセッサ１１及びメインメモリ１２等のハードウェア、並びに、ストレージ１５等に記憶されている各種プログラムやデータ等が協働して動作することによって実現され、例えば、メインメモリ１２に読み込まれたプログラムに含まれる命令をコンピュータプロセッサ１１が実行することによって実現される。 Next, functions of the moving image creating apparatus 10 according to the present embodiment will be described. FIG. 2 is a block diagram schematically showing functions of the moving image creating apparatus 10. As shown in the figure, the moving image creating apparatus 10 includes an information storage management unit 41 that stores and manages various information, a moving image creating unit 43 that creates a moving image, and a virtual space control unit 45 that controls a virtual space. These functions are realized by the cooperation of hardware such as the computer processor 11 and the main memory 12 and various programs and data stored in the storage 15, for example, in the main memory 12. This is realized by the computer processor 11 executing instructions included in the read program.

情報記憶管理部４１は、ストレージ１５等において様々な情報を記憶及び管理する。動画作成部４３は、動画の作成に関する様々な処理を実行する。本実施形態において、動画作成部４３は、仮想空間に対応する画像を表示する所定の領域を有する画面をユーザに提示するように構成されている。例えば、動画作成部４３は、当該所定の領域を有する画面をディスプレイ等において表示するように構成される。 The information storage management unit 41 stores and manages various information in the storage 15 and the like. The moving image creation unit 43 executes various processes related to the creation of a moving image. In the present embodiment, the moving image creation unit 43 is configured to present a screen having a predetermined area for displaying an image corresponding to the virtual space to the user. For example, the moving image creation unit 43 is configured to display a screen having the predetermined area on a display or the like.

また、動画作成部４３は、上記仮想空間に対応する画像と、入力される音声と、を含む動画を作成するように構成されている。例えば、動画作成部４３は、所定の領域において表示されている仮想空間の画像と、マイクを介して入力される音声とを含む動画を作成（記録）するように構成される。作成された動画は、例えば、ストレージ１５等において格納される。 In addition, the moving image creation unit 43 is configured to create a moving image including an image corresponding to the virtual space and input audio. For example, the moving image creating unit 43 is configured to create (record) a moving image including an image of a virtual space displayed in a predetermined area and sound input via a microphone. The created moving image is stored in the storage 15 or the like, for example.

仮想空間制御部４５は、上記仮想空間の制御に関する様々な処理を実行する。本実施形態において、仮想空間制御部４５は、ユーザによる上記所定の領域に対するタッチ操作に応じて、入力される音声をテキストに変換し、変換されたテキストに対応するテキストオブジェクトを仮想空間内に配置するように構成されている。当該テキストオブジェクトは、所定の領域内の上記タッチ操作が行われた位置に基づく仮想空間内の位置に配置される。 The virtual space control unit 45 executes various processes related to the control of the virtual space. In the present embodiment, the virtual space control unit 45 converts input speech into text in response to a user's touch operation on the predetermined area, and places a text object corresponding to the converted text in the virtual space. Is configured to do. The text object is arranged at a position in the virtual space based on the position where the touch operation is performed in a predetermined area.

このように、本実施形態の動画作成装置１０は、仮想空間に対応する画像と、入力される音声とを含む動画を作成し、当該画像を表示する所定の領域に対するタッチ操作に応じて、当該入力される音声をテキストに変換して対応するテキストオブジェクトを仮想空間内に配置するから、入力される音声に対応するオブジェクトが仮想空間に配置される動画を手軽に作成することが可能となる。つまり、本実施形態の動画作成装置１０は、動画に含まれる音声の出力に関するエンターテイメント性を向上させる。 As described above, the moving image creating apparatus 10 according to the present embodiment creates a moving image including an image corresponding to the virtual space and the input audio, and the touch operation is performed on a predetermined area in which the image is displayed. Since the input sound is converted into text and the corresponding text object is arranged in the virtual space, it is possible to easily create a moving image in which the object corresponding to the input sound is arranged in the virtual space. That is, the moving image creating apparatus 10 according to the present embodiment improves entertainment related to output of audio included in the moving image.

本実施形態において、仮想空間制御部４５は、上記所定の領域に対するタッチ状態が開始されてから解消するまでの期間において入力される音声をテキストに変換するように構成され得る。例えば、仮想空間制御部４５は、当該タッチ状態の開始に応じて、入力される音声の録音を開始し、タッチ状態の解消に応じて、録音された音声のテキストへの変換を行って、変換されたテキストに対応するテキストオブジェクトを配置するように構成される。こうした構成は、テキストオブジェクトの配置を簡易な操作で実現し得る。 In the present embodiment, the virtual space control unit 45 may be configured to convert a voice input during a period from when the touch state with respect to the predetermined area is started to cancellation to text. For example, the virtual space control unit 45 starts recording the input voice in response to the start of the touch state, and converts the recorded voice to text in response to the cancellation of the touch state. Configured to place a text object corresponding to the rendered text. Such a configuration can realize the arrangement of the text object with a simple operation.

また、仮想空間制御部４５は、所定の領域に対するタッチ状態が開始された後に行われるフリック操作及び／又はスライド操作の方向に基づく視覚効果が付与されるように、変換されたテキストに対応するテキストオブジェクトを配置するように構成され得る。例えば、仮想空間制御部４５は、所定の領域に対するタッチ状態が解消される際に行われるフリック操作／スライド操作の方向が第１の方向（例えば、右方向）である場合は、第１の視覚効果（例えば、フェードインの効果）をテキストオブジェクトに付与する一方、当該フリック操作／スライド操作の方向が第２の方向（例えば、左方向）である場合は、第２の視覚効果（例えば、フェードアウトの効果）をテキストオブジェクトに付与するように構成される。こうした構成は、テキストオブジェクトに対する視覚効果の付与を簡易な操作で実現し得る。 In addition, the virtual space control unit 45 provides a text corresponding to the converted text so that a visual effect based on the direction of the flick operation and / or the slide operation performed after the touch state on the predetermined area is started is given. It can be configured to place objects. For example, when the direction of the flick operation / slide operation performed when the touch state on the predetermined area is canceled is the first direction (for example, the right direction), the virtual space control unit 45 performs the first visual When an effect (for example, fade-in effect) is given to the text object, and the direction of the flick operation / slide operation is the second direction (for example, left direction), the second visual effect (for example, fade-out) Is applied to the text object. With such a configuration, the visual effect can be imparted to the text object with a simple operation.

本実施形態において、仮想空間は、例えば、カメラを介して入力（撮影）される映像を表示するオブジェクトを含むように構成される。この場合、作成される動画は、例えば、現実のユーザが登場（出演）する動画として構成される。また、当該仮想空間は、例えば、ユーザによって操作されるアバターが含まれるように構成される。この場合、作成される動画は、現実のユーザの代わりにアバターが登場する動画として構成され、仮想空間制御部４５は、当該アバターの仮想空間における動作を制御するように構成される。この場合、仮想空間は、配置されたテキストオブジェクトをアバターが触ることができるように構成され得る。こうした構成は、アバターを介してテキストオブジェクトを触ることが可能となるから、作成される動画のエンターテイメント性が向上し得る。 In the present embodiment, the virtual space is configured to include an object that displays an image input (captured) via a camera, for example. In this case, the created moving image is configured as a moving image in which a real user appears (appears), for example. Further, the virtual space is configured to include an avatar operated by the user, for example. In this case, the created moving image is configured as a moving image in which an avatar appears instead of a real user, and the virtual space control unit 45 is configured to control the operation of the avatar in the virtual space. In this case, the virtual space can be configured such that the avatar can touch the arranged text object. Such a configuration allows the text object to be touched via the avatar, so that the entertainment of the created moving image can be improved.

仮想空間にユーザのアバターが含まれる場合において、仮想空間制御部４５は、例えば、タッチパネル又は物理コントローラ等に対するユーザによる操作に応じてアバターの動作を制御するように構成され得る。また、仮想空間制御部４５は、カメラを介して入力される画像に含まれるユーザの姿勢に少なくとも基づいて（例えば、ユーザの姿勢に連動するように）アバターの動作を制御するように構成され得る。画像に含まれるユーザの姿勢（ボーン）の検出は、例えば、公知の人物姿勢推定技術を適用して実現することができる。また、仮想空間制御部４５は、入力される画像に含まれるユーザの身体の１又は複数の所定の部位（例えば、顔及び両手等）の画像における配置に少なくとも基づいて（例えば、所定の部位の配置に連動するように）アバターの動作を制御するように構成され得る。こうした構成は、現実のユーザの動きに基づいてアバターを動作させることを可能とする。 In the case where the user's avatar is included in the virtual space, the virtual space control unit 45 may be configured to control the operation of the avatar according to the operation by the user on the touch panel or the physical controller, for example. Further, the virtual space control unit 45 may be configured to control the operation of the avatar based at least on the posture of the user included in the image input via the camera (for example, in conjunction with the posture of the user). . Detection of the posture (bone) of the user included in the image can be realized by applying a known human posture estimation technique, for example. Further, the virtual space control unit 45 is based at least on an arrangement in the image of one or more predetermined parts (for example, face and both hands) of the user's body included in the input image (for example, the predetermined part). It may be configured to control the behavior of the avatar (as linked to the placement). Such a configuration makes it possible to operate the avatar based on the actual movement of the user.

次に、このような機能を有する本実施形態の動画作成装置１０の具体例について説明する。この例の動画作成装置１０は、スマートフォン、タブレット端末、又は、パーソナルコンピュータ等として構成されており、動画作成用アプリがインストールされている。この例の動画作成装置１０は、アバターを含む動画を作成するように構成されている。 Next, a specific example of the moving image creating apparatus 10 of the present embodiment having such a function will be described. The moving image creating apparatus 10 in this example is configured as a smartphone, a tablet terminal, a personal computer, or the like, and a moving image creating application is installed therein. The moving image creating apparatus 10 in this example is configured to create a moving image including an avatar.

図３は、動画作成装置１０のディスプレイ等において表示される動画作成用画面６０を例示する。当該画面６０は、動画を作成するための画面であって、図示するように、作成する動画に含まれる画像を表示する画像表示領域（所定の領域）６２と、動画の作成（記録、録画）の開始及び終了を指示するための指示ボタン６４とを有する。 FIG. 3 illustrates a moving image creation screen 60 displayed on the display or the like of the moving image creation apparatus 10. The screen 60 is a screen for creating a moving image. As shown in the drawing, an image display area (predetermined area) 62 for displaying an image included in the moving image to be created, and creation (recording, recording) of the moving image. And an instruction button 64 for instructing the start and end of the operation.

画像表示領域６２には、仮想空間１００を特定の視野で（特定の位置の仮想カメラを介して）見た画像が表示される。当該仮想空間１００は、三次元の仮想空間として構成されており、人型のアバター１０２、及び、当該アバター１０２の手前に位置する机オブジェクト１０４を含む。 In the image display area 62, an image obtained by viewing the virtual space 100 with a specific field of view (via a virtual camera at a specific position) is displayed. The virtual space 100 is configured as a three-dimensional virtual space, and includes a humanoid avatar 102 and a desk object 104 positioned in front of the avatar 102.

ここで、仮想空間１００に含まれるアバター１０２の動作の制御に関する処理について説明する。図４は、この例において、アバター１０２の動作を制御するために、動画作成装置１０が実行する処理を例示するフロー図である。例えば、装置１０は、動画作成用画面６０の表示に応じて、図４に例示される処理を実行する。 Here, the process regarding control of the operation | movement of the avatar 102 contained in the virtual space 100 is demonstrated. FIG. 4 is a flowchart illustrating a process executed by the moving image creating apparatus 10 in order to control the operation of the avatar 102 in this example. For example, the apparatus 10 executes the process illustrated in FIG. 4 according to the display of the moving image creation screen 60.

動画作成装置１０は、まず、図４に示すように、インカメラを介して入力される入力画像に含まれるユーザの顔及び両手を認識する（ステップＳ１００）。インカメラは、装置１０において表示される画面を見るユーザを視野に含むように構成されている。ユーザは、動画作成用画面６０の画像表示領域６２に含まれる仮想空間１００の画像を見ながら、アバター１０２を動作させるためにインカメラの前で身体を動かすことになる。 First, as shown in FIG. 4, the moving image creating apparatus 10 recognizes the user's face and both hands included in the input image input through the in-camera (step S100). The in-camera is configured to include a user who views the screen displayed on the device 10 in the field of view. The user moves the body in front of the in-camera to operate the avatar 102 while viewing the image of the virtual space 100 included in the image display area 62 of the moving image creation screen 60.

図５は、インカメラを介して入力される入力画像５０を模式的に例示する。図示するように、この例では、ユーザの右手ＲＨの手の平には、第１の色（例えば、赤色）の円形のマーカーＭＫ１が設けられており、ユーザの左手ＬＨの手の平には、第２の色（例えば、黄色）の円形のマーカーＭＫ２が設けられている。これらのマーカーＭＫ１、２は、例えば、手の平に貼り付けるステッカーとして構成され、当該ステッカーは、例えば、動画作成用アプリの提供事業者等によってユーザに提供される。また、マーカーＭＫ１、２は、例えば、手の平にインク等で直接描かれる。この場合、例えば、動画作成用アプリの提供事業者等が配布するインストラクションに従って、ユーザが、両手の手の平にマーカーＭＫ１、２をそれぞれ描く。 FIG. 5 schematically illustrates an input image 50 input via the in-camera. As shown in the figure, in this example, a circular marker MK1 of the first color (for example, red) is provided on the palm of the user's right hand RH, and a second marker is provided on the palm of the user's left hand LH. A circular marker MK2 of color (for example, yellow) is provided. These markers MK1 and MK2, for example, are configured as stickers to be attached to the palm of the hand, and the stickers are provided to the user by, for example, a provider of a moving image creation application. The markers MK1 and MK2 are directly drawn with ink or the like on the palm of the hand, for example. In this case, for example, the user draws the markers MK1 and MK2 on the palms of both hands according to instructions distributed by the provider of the moving image creation application.

図６は、図５に例示した入力画像５０に含まれるユーザの顔及び両手が認識される様子を説明するための図である。図示するように、この例では、ユーザの顔ＦＣは、当該顔ＦＣの輪郭を囲う矩形の検出領域ＤＡ１として検出及び認識される。また、ユーザの両手ＲＨ、ＬＨは、当該両手ＲＨ、ＬＨの各々の手の平に設けられているマーカーＭＫ１、２の輪郭を囲う矩形の検出領域ＤＡ２、ＤＡ３としてそれぞれ検出及び認識される。こうした顔ＦＣ、及び、両手ＲＨ、ＬＨ（マーカーＭＫ１、２）の認識（及び、その後の追跡）は、公知の物体追跡技術を用いて実現され、例えば、機械学習を介して生成された学習済みモデルを用いて実現される。 FIG. 6 is a diagram for explaining how the user's face and both hands included in the input image 50 illustrated in FIG. 5 are recognized. As shown in the figure, in this example, the user's face FC is detected and recognized as a rectangular detection area DA1 surrounding the outline of the face FC. In addition, the user's hands RH and LH are detected and recognized as rectangular detection areas DA2 and DA3 surrounding the contours of the markers MK1 and MK2 provided on the palms of the hands RH and LH, respectively. Such recognition of the face FC and both hands RH, LH (markers MK1, 2) (and subsequent tracking) is realized using a known object tracking technique, for example, a learned learning generated through machine learning. Realized using a model.

図４のフロー図に戻り、入力画像に含まれるユーザの顔及び両手を認識すると、動画作成装置１０は、次に、当該入力画像におけるユーザの顔及び両手の配置に基づいてアバターの動作を制御する（ステップＳ１１０）。こうした入力画像におけるユーザの顔及び両手の配置に基づくアバターの動作の制御は、動画の作成が終了するまでの間（例えば、動画作成用画面６０の表示が終了するまでの間）、繰り返される（ステップＳ１２０においてＮＯ）。 Returning to the flowchart of FIG. 4, when the user's face and both hands included in the input image are recognized, the moving image creating apparatus 10 next controls the operation of the avatar based on the arrangement of the user's face and both hands in the input image. (Step S110). The control of the avatar movement based on the user's face and the arrangement of both hands in the input image is repeated until the creation of the moving image is completed (for example, until the display of the moving image creation screen 60 is completed) ( NO in step S120).

この例では、インカメラを介して入力される入力画像５０におけるユーザの顔ＦＣの位置に対する両手ＲＨ、ＬＨの相対的な位置（現実のユーザの顔と両手との間の位置関係）を再現するように、アバター１０２の動作が制御される。例えば、図７に例示するように、現実のユーザが「万歳」の姿勢をとって、入力画像５０におけるユーザの両手ＲＨ、ＬＨ（マーカーＭＫ１、２）がユーザの顔ＦＣの斜め上方に移動した場合、図８に例示するように、仮想空間１００におけるアバター１０２もまた、ユーザと同様に「万歳」の姿勢をとる（両手を顔の斜め上方に移動させる）。 In this example, the relative positions of both hands RH and LH (the positional relationship between the actual user's face and both hands) with respect to the position of the user's face FC in the input image 50 input via the in-camera are reproduced. Thus, the operation of the avatar 102 is controlled. For example, as illustrated in FIG. 7, the actual user takes a “many years” posture, and the user's hands RH and LH (markers MK 1 and 2) in the input image 50 move obliquely above the user's face FC. In this case, as illustrated in FIG. 8, the avatar 102 in the virtual space 100 also assumes a “many” posture (moves both hands diagonally above the face) in the same manner as the user.

ユーザが指示ボタン６４を選択すると、動画の記録が開始され、具体的には、画像表示領域６２に表示される画像と、マイクを介して入力される音声とを含む動画が記録される。ユーザが再度、指示ボタン６４を選択すると、動画の記録が停止される。作成された動画は、ストレージ１５等の所定の領域に格納される。このように、この例において、ユーザは、インカメラの前で身体を動かしながら話すことにより、当該身体の動きに追随して動作するアバター１０２が含まれる仮想空間１００に対応する画像、及び、自身の音声を含む動画を容易に作成することができる。 When the user selects the instruction button 64, recording of a moving image is started. Specifically, a moving image including an image displayed in the image display area 62 and sound input through a microphone is recorded. When the user selects the instruction button 64 again, the recording of the moving image is stopped. The created moving image is stored in a predetermined area such as the storage 15. Thus, in this example, the user speaks while moving his / her body in front of the in-camera, so that the image corresponding to the virtual space 100 including the avatar 102 that operates following the movement of the body and the user himself / herself Can be easily created.

ここで、この例におけるテキストオブジェクトの配置に関する動作について説明する。図９は、動画作成用画面６０の画像表示領域６２に対するユーザによるタッチ操作の検出に応じて、動画作成装置１０が実行する処理を例示するフロー図である。画像表示領域６２に対するタッチ操作を検出すると、装置１０は、まず、図示するように、画像表示領域６２に対するタッチ状態が解消される迄の間、マイクを介して入力される入力音声を録音する（ステップＳ２００、ステップＳ２１０においてＮＯ）。 Here, an operation related to the arrangement of the text objects in this example will be described. FIG. 9 is a flowchart illustrating processing executed by the moving image creating apparatus 10 in response to detection of a touch operation by the user on the image display area 62 of the moving image creating screen 60. When the touch operation on the image display area 62 is detected, the apparatus 10 first records the input voice input through the microphone until the touch state on the image display area 62 is canceled as shown in the figure ( In step S200 and step S210, NO).

そして、画像表示領域６２に対するタッチ状態が解消されると（ステップＳ２１０においてＹＥＳ）、装置１０は、録音を停止し（ステップＳ２２０）、録音された音声をテキストに変換する（ステップＳ２３０）。録音された音声のテキスト変換は、公知の音声認識技術を適用して実現される。なお、この例では、画像表示領域６２に対するタッチ状態が開始されてから解消するまでの期間において、入力音声は、作成される動画にも含まれる（ミュートされない）。なお、当該期間において、作成される動画における音声をミュートするようにしても良い。 When the touch state on image display area 62 is resolved (YES in step S210), apparatus 10 stops recording (step S220) and converts the recorded voice into text (step S230). The text conversion of the recorded voice is realized by applying a known voice recognition technique. In this example, the input sound is also included in the created moving image (not muted) in the period from the start of the touch state to the image display area 62 to the cancellation. Note that during the period, sound in the created moving image may be muted.

続いて、動画作成装置１０は、変換されたテキストに対応するテキストオブジェクトを仮想空間１００に配置する（ステップＳ２４０）。テキストオブジェクトは、変換されたテキストが仮想空間１００における三次元のオブジェクトとして構成されたものであり、画像表示領域６２に対するタッチ操作が行われた位置に基づく仮想空間１００上の位置に配置される。 Subsequently, the moving image creating apparatus 10 places a text object corresponding to the converted text in the virtual space 100 (step S240). The text object is obtained by converting the converted text as a three-dimensional object in the virtual space 100, and is arranged at a position on the virtual space 100 based on the position where the touch operation is performed on the image display area 62.

図１０は、画像表示領域６２に対するタッチ操作が行われ、タッチ状態が継続している状態の動画作成用画面６０を例示する。この場合、画像表示領域６２において、タッチ操作が行われている位置を中心とする円形のタッチ位置表示オブジェクト７０が表示される。この状態で、ユーザは、テキストオブジェクトに変換しようとする音声を入力する（話す）。 FIG. 10 illustrates the moving image creation screen 60 in a state where the touch operation is performed on the image display area 62 and the touch state is continued. In this case, a circular touch position display object 70 centered on the position where the touch operation is performed is displayed in the image display area 62. In this state, the user inputs (speaks) a voice to be converted into a text object.

図１１は、図１０の状態から画像表示領域６２に対するタッチ状態が解消されて、テキストオブジェクト１０６が仮想空間１００内に配置された状態の動画作成用画面６０を例示する。図１１の例では、画像表示領域６２に対するタッチ状態が継続されている期間において「こんにちは」という音声が入力されており、テキストオブジェクト１０６は、「こんにちは」というテキストに対応するオブジェクトとして構成されている。テキストオブジェクト１０６は、この例では、その先頭（図１１の例では「こ」の字に対応するオブジェクト）が、タッチ操作が行われていた位置（タッチ位置表示オブジェクト７０の表示位置）に対応する仮想空間１００内の位置となるように配置される。なお、テキストオブジェクト１０６が配置される位置は、これに限定されない。例えば、テキストオブジェクト１０６は、画像表示領域６２内のタッチ操作が行われた位置に基づいて特定され得る様々な仮想空間１００内の位置に配置され得る。 FIG. 11 illustrates the moving image creation screen 60 in a state where the touch state with respect to the image display area 62 is canceled from the state of FIG. 10 and the text object 106 is arranged in the virtual space 100. In the example of FIG. 11, are audio input of "Hello" in the period that the touch state is continued for the image display area 62, text object 106 is configured as an object corresponding to the text "Hello" . In this example, the text object 106 corresponds to the position where the touch operation was performed (the display position of the touch position display object 70) in the example (the object corresponding to the character “K” in the example of FIG. 11). It arrange | positions so that it may become a position in the virtual space 100. The position where the text object 106 is arranged is not limited to this. For example, the text object 106 can be arranged at various positions in the virtual space 100 that can be specified based on the position where the touch operation is performed in the image display area 62.

この例では、配置されたテキストオブジェクト１０６は、所定の時間（例えば、５秒）の経過後に消える。また、アバター１０２は、テキストオブジェクト１０６を触ることができるようになっている。例えば、テキストオブジェクト１０６は、文字毎に独立して動くように構成されており（文字毎に別々のオブジェクトとして構成されており）、例えば、図１２に示すように、アバター１０２の右手で、テキストオブジェクト１０６の先頭の「こ」の文字のオブジェクトに触ると、当該オブジェクトのみを動かすこともできる。このように、ユーザは、画像表示領域６２に対するタッチ操作によってテキストオブジェクト１０６を仮想空間１００に配置しつつ、アバター１０２を介して、配置したテキストオブジェクト１０６を動かすこともできる。なお、配置されたテキストオブジェクト１０６の動作はこれに限定されない。例えば、テキストオブジェクト１０６は、配置された後に所定の速度で下方向に移動（落下）するように制御され得る。また、テキストオブジェクト１０６の全体、又は、各文字に対応するオブジェクトが、アバター１０２が触れることによって消えるようにし、又は、アバター１０２が触れることによって、所定のエフェクト（形状の変化、及び、発光等）が発生するようにしても良い。 In this example, the arranged text object 106 disappears after a lapse of a predetermined time (for example, 5 seconds). The avatar 102 can touch the text object 106. For example, the text object 106 is configured to move independently for each character (configured as a separate object for each character). For example, as shown in FIG. When the object of the character “ko” at the head of the object 106 is touched, only the object can be moved. Thus, the user can move the placed text object 106 via the avatar 102 while placing the text object 106 in the virtual space 100 by a touch operation on the image display area 62. The operation of the placed text object 106 is not limited to this. For example, the text object 106 may be controlled to move downward (drop) at a predetermined speed after being placed. In addition, the entire text object 106 or an object corresponding to each character disappears when the avatar 102 touches, or when the avatar 102 touches, a predetermined effect (change in shape, light emission, etc.) May be generated.

この例では、画像表示領域６２に対するタッチ状態を解消する際にフリック操作を行うと、当該フリック操作の方向に基づく視覚効果が、配置されるテキストオブジェクトに対して付与される。例えば、タッチ状態を解消する際に右方向へのフリック操作が行われると、フェードインの効果を伴ってテキストオブジェクト１０６が配置（表示）される一方、タッチ状態を解消する際に左方向へのフリック操作が行われると、テキストオブジェクト１０６の配置後、フェードアウトの効果を伴って当該テキストオブジェクト１０６が消去される。 In this example, when a flick operation is performed when the touch state on the image display area 62 is canceled, a visual effect based on the direction of the flick operation is given to the text object to be arranged. For example, if a flicking operation in the right direction is performed when canceling the touch state, the text object 106 is arranged (displayed) with a fade-in effect, while the text object 106 is moved in the left direction when canceling the touch state. When the flick operation is performed, after the text object 106 is arranged, the text object 106 is deleted with a fade-out effect.

上述した例において、作成される動画をライブ配信するようにしても良い。この場合、例えば、動画作成装置１０は、作成される動画をストリーミング形式で動画配信サーバに送信し、当該動画配信サーバが、複数の視聴者のユーザ端末（スマートフォン等）に対してストリーミング形式で動画を配信する。図１３は、動画のライブ配信を行う配信者のユーザ端末等として構成される動画作成装置１０のディスプレイ等において表示される配信者画面８０を例示する。当該画面８０は、画面全体において三次元の仮想空間２００を表示し、当該仮想空間２００において、配信者のアバター２０２がステージ２０４上に配置されており、複数の視聴者の各々のアバター２０８が観客エリア２０６に配置されている。アバター２０２は、配信者によって操作される（例えば、カメラを介して入力される画像に含まれる配信者の顔及び両手の配置に基づいて制御される）。また、配信者による配信者画面８０に対するタッチ操作に応じて、入力される音声がテキストに変換されて対応するテキストオブジェクトが仮想空間２００内に配置される。アバター２０２は、仮想空間２００において、配置されたテキストオブジェクトを触ることができる。なお、アバター２０２が、テキストオブジェクトと同様に、当該仮想空間２００に配置されている他のオブジェクト（例えば、視聴者のアバター２０８が投げ入れるアイテム（ギフト）等）を触ることができるようにしても良い。 In the example described above, the created video may be distributed live. In this case, for example, the video creation device 10 transmits the created video to the video distribution server in a streaming format, and the video distribution server transmits the video in a streaming format to a plurality of viewer user terminals (smartphones or the like). To deliver. FIG. 13 exemplifies a distributor screen 80 displayed on the display or the like of the moving image creating apparatus 10 configured as a user terminal or the like of a distributor who performs live distribution of moving images. The screen 80 displays a three-dimensional virtual space 200 on the entire screen. In the virtual space 200, a distributor's avatar 202 is arranged on a stage 204, and each viewer's avatar 208 is a spectator. Arranged in area 206. The avatar 202 is operated by the distributor (for example, controlled based on the distributor's face and the positions of both hands included in the image input via the camera). Further, in response to a touch operation on the distributor screen 80 by the distributor, the input voice is converted into text, and the corresponding text object is arranged in the virtual space 200. The avatar 202 can touch the arranged text object in the virtual space 200. The avatar 202 may be able to touch other objects (for example, items (gifts) thrown by the viewer's avatar 208) placed in the virtual space 200 in the same manner as the text object. .

上述した例では、仮想空間１００、２００にユーザのアバター１０２、２０２が含まれるようにしたが、本実施形態の他の例において、仮想空間には、ユーザのアバターは含まれず、カメラを介して入力される画像を表示するオブジェクト等が配置され得る。 In the example described above, the user's avatars 102 and 202 are included in the virtual spaces 100 and 200. However, in another example of the present embodiment, the user's avatar is not included in the virtual space. An object or the like for displaying an input image can be arranged.

以上説明した本実施形態に係る動画作成装置１０は、仮想空間に対応する画像と、入力される音声とを含む動画を作成し、当該画像を表示する所定の領域（例えば、動画作成用画面６０の画像表示領域６２）に対するタッチ操作に応じて、当該入力される音声をテキストに変換して対応するテキストオブジェクトを仮想空間内に配置するから、入力される音声に対応するオブジェクトが仮想空間に配置される動画を手軽に作成することが可能となる。つまり、本実施形態の動画作成装置１０は、動画に含まれる音声の出力に関するエンターテイメント性を向上させる。 The moving image creating apparatus 10 according to the present embodiment described above creates a moving image including an image corresponding to the virtual space and the input sound, and displays a predetermined area (for example, the moving image creating screen 60). In response to a touch operation on the image display area 62), the input voice is converted into text and the corresponding text object is arranged in the virtual space, so the object corresponding to the input voice is arranged in the virtual space. It is possible to easily create a video to be played. That is, the moving image creating apparatus 10 according to the present embodiment improves entertainment related to output of audio included in the moving image.

本発明の他の実施形態において、動画作成装置１０が有する上述した機能の少なくとも一部は、当該装置１０とインターネット等の通信ネットワークを介して通信可能に接続されるサーバ（例えば、上述した動画配信サーバ）と、動画作成装置１０とが協働することによって実現される。例えば、当該サーバ側で、動画作成装置１０のカメラを介して入力される画像の解析、及び、マイクを介して入力される音声のテキスト変換、並びに、動画の作成（及び配信）等が行われるようにしても良い。 In another embodiment of the present invention, at least a part of the above-described functions of the moving image creating apparatus 10 is a server (for example, the above-described moving image distribution) that is connected to the apparatus 10 via a communication network such as the Internet. The server) and the moving image creating apparatus 10 cooperate. For example, on the server side, analysis of an image input via the camera of the moving image creation device 10, text conversion of audio input via a microphone, creation (and distribution) of a moving image, and the like are performed. You may do it.

本明細書で説明された処理及び手順は、明示的に説明されたもの以外にも、ソフトウェア、ハードウェアまたはこれらの任意の組み合わせによって実現される。例えば、本明細書で説明される処理及び手順は、集積回路、揮発性メモリ、不揮発性メモリ、磁気ディスク等の媒体に、当該処理及び手順に相当するロジックを実装することによって実現される。また、本明細書で説明された処理及び手順は、当該処理・手順に相当するコンピュータプログラムとして実装し、各種のコンピュータに実行させることが可能である。 The processes and procedures described in the present specification are implemented by software, hardware, or any combination thereof other than those explicitly described. For example, the processes and procedures described in this specification are realized by mounting logic corresponding to the processes and procedures on a medium such as an integrated circuit, a volatile memory, a nonvolatile memory, and a magnetic disk. The processing and procedure described in this specification can be implemented as a computer program corresponding to the processing / procedure and executed by various computers.

本明細書中で説明された処理及び手順が単一の装置、ソフトウェア、コンポーネント、モジュールによって実行される旨が説明されたとしても、そのような処理または手順は複数の装置、複数のソフトウェア、複数のコンポーネント、及び／又は複数のモジュールによって実行され得る。また、本明細書において説明されたソフトウェアおよびハードウェアの要素は、それらをより少ない構成要素に統合して、またはより多い構成要素に分解することによって実現することも可能である。 Even if the processes and procedures described herein are described as being performed by a single device, software, component, or module, such processes or procedures may be performed by multiple devices, multiple software, multiple Component and / or multiple modules. Also, the software and hardware elements described herein can be implemented by integrating them into fewer components or by disassembling them into more components.

本明細書において、発明の構成要素が単数もしくは複数のいずれか一方として説明された場合、又は、単数もしくは複数のいずれとも限定せずに説明された場合であっても、文脈上別に解すべき場合を除き、当該構成要素は単数又は複数のいずれであってもよい。 In the present specification, when the constituent elements of the invention are described as one or a plurality, or when they are described without being limited to one or a plurality of cases, they should be understood separately in context. The component may be either singular or plural.

１０動画作成装置
１１コンピュータプロセッサ
４１情報記憶管理部
４３動画作成部
４５仮想空間制御部
５０入力画像
６０動画作成用画面
６２画像表示領域（所定の領域）
８０配信者画面
１００、２００仮想空間
１０２、２０２アバター
１０６テキストオブジェクト

DESCRIPTION OF SYMBOLS 10 Movie creation apparatus 11 Computer processor 41 Information storage management part 43 Movie creation part 45 Virtual space control part 50 Input image 60 Movie creation screen 62 Image display area (predetermined area)
80 Distributor screen 100, 200 Virtual space 102, 202 Avatar 106 Text object

Claims

A system comprising one or more computer processors for creating a video,
The one or more computer processors are responsive to execution of readable instructions,
A process of presenting a screen having a predetermined area for displaying an image corresponding to the virtual space to the user;
A process of creating a moving image including an image corresponding to the virtual space and sound input by the user;
In response to a touch operation on the predetermined area by the user, the input voice is converted into text, and a text object corresponding to the converted text is placed at a position where the touch operation is performed in the predetermined area. A process of arranging at a position in the virtual space based on,
system.

The process of arranging includes converting the input voice into text in a period from the start of the touch state to the predetermined area to the cancellation.
The system of claim 1.

The arranging process includes arranging the corresponding text object so that a visual effect based on a direction of a flick operation and / or a slide operation performed after a touch state on the predetermined area is started is given. ,
The system according to claim 1 or 2.

The one or more computer processors further execute a process of controlling an operation of the avatar operated by the user in the virtual space,
The virtual space is configured so that the avatar can touch the arranged text object.
The system according to claim 1.

The process of controlling the movement of the avatar includes controlling the movement of the avatar based on at least the posture of the user included in an image input via a camera.
The system of claim 4.

The one or more computer processors further execute a process of distributing the created moving image in real time.
The system according to any one of claims 1 to 5.

A method for creating a video that is executed by one or more computers, comprising:
Presenting a screen having a predetermined area for displaying an image corresponding to the virtual space to the user;
Creating a moving image including an image corresponding to the virtual space and sound input by the user;
In response to a touch operation on the predetermined area by the user, the input voice is converted into text, and a text object corresponding to the converted text is placed at a position where the touch operation is performed in the predetermined area. Arranging at a position in the virtual space based on,
Method.

A program for creating videos,
In response to execution on one or more computers, the one or more computers
A process of presenting a screen having a predetermined area for displaying an image corresponding to the virtual space to the user;
A process of creating a moving image including an image corresponding to the virtual space and sound input by the user;
In response to a touch operation on the predetermined area by the user, the input voice is converted into text, and a text object corresponding to the converted text is placed at a position where the touch operation is performed in the predetermined area. A process of arranging at a position in the virtual space based on
program.