JP7038602B2

JP7038602B2 - Systems, methods, and programs for creating videos

Info

Publication number: JP7038602B2
Application number: JP2018101927A
Authority: JP
Inventors: 康伸佐々木
Original assignee: DeNA Co Ltd
Current assignee: DeNA Co Ltd
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2022-03-18
Anticipated expiration: 2038-05-28
Also published as: JP2019207509A; JP7373599B2; JP2022095625A

Description

本発明は、動画を作成するためのシステム、方法、及びプログラムに関する。 The present invention relates to a system, a method, and a program for creating a moving image.

従来、ユーザが動画の配信を行うためのシステムが提供されている（例えば、特許文献１を参照）。例えば、ユーザは、スマートフォン及びパソコン等のユーザ端末が有するカメラを介して入力される画像、及び、同じくユーザ端末が有するマイクを介して入力される音声が含まれる動画を撮影し、撮影した動画を複数の視聴者に対して配信することができる。 Conventionally, a system for a user to distribute a moving image has been provided (see, for example, Patent Document 1). For example, the user shoots a video including an image input through a camera of a user terminal such as a smartphone or a personal computer and a voice input through a microphone of the user terminal, and captures the shot video. It can be distributed to multiple viewers.

特開２０１７－１２１０３６号公報Japanese Unexamined Patent Publication No. 2017-120136

しかしながら、上述した従来のシステムにおいて、動画に含まれる音声は、同じく動画に含まれる画像と共に出力されるのみであって、面白みに欠ける場合があった。このように、動画に含まれる音声の出力については、そのエンターテイメント性に関して改善の余地がある。 However, in the above-mentioned conventional system, the sound included in the moving image is only output together with the image also included in the moving image, and may be uninteresting. As described above, there is room for improvement in the entertainment property of the audio output included in the moving image.

本発明の実施形態は、動画に含まれる音声の出力に関するエンターテイメント性を向上させることを目的の一つとする。本発明の実施形態の他の目的は、本明細書全体を参照することにより明らかとなる。 One of the objects of the embodiment of the present invention is to improve the entertainment property of the output of the sound included in the moving image. Other objects of the embodiments of the present invention will become apparent by reference to the entire specification.

本発明の一実施形態に係るシステムは、１又は複数のコンピュータプロセッサを備え、動画を作成するためのシステムであって、前記１又は複数のコンピュータプロセッサは、読取可能な命令の実行に応じて、仮想空間に対応する画像を表示する所定の領域を有する画面をユーザに提示する処理と、前記仮想空間に対応する画像と、前記ユーザによって入力される音声と、を含む動画を作成する処理と、前記ユーザによる前記所定の領域に対するタッチ操作に応じて、前記入力される音声をテキストに変換し、変換されたテキストに対応するテキストオブジェクトを、前記所定の領域内のタッチ操作が行われた位置に基づく前記仮想空間内の位置に配置する処理と、を実行する。 The system according to an embodiment of the present invention comprises one or more computer processors for creating moving images, wherein the one or more computer processors respond to the execution of readable instructions. A process of presenting a screen having a predetermined area for displaying an image corresponding to a virtual space to a user, a process of creating a moving image including an image corresponding to the virtual space and a voice input by the user, and a process of creating a moving image. In response to the touch operation on the predetermined area by the user, the input voice is converted into text, and the text object corresponding to the converted text is placed at the position where the touch operation is performed in the predetermined area. The process of arranging at a position in the virtual space based on the above is executed.

本発明の一実施形態に係る方法は、１又は複数のコンピュータによって実行され、動画を作成するための方法であって、仮想空間に対応する画像を表示する所定の領域を有する画面をユーザに提示するステップと、前記仮想空間に対応する画像と、前記ユーザによって入力される音声と、を含む動画を作成するステップと、前記ユーザによる前記所定の領域に対するタッチ操作に応じて、前記入力される音声をテキストに変換し、変換されたテキストに対応するテキストオブジェクトを、前記所定の領域内のタッチ操作が行われた位置に基づく前記仮想空間内の位置に配置するステップと、を備える。 A method according to an embodiment of the present invention is a method executed by one or more computers to create a moving image, and presents a screen having a predetermined area for displaying an image corresponding to a virtual space to a user. A step of creating a moving image including an image corresponding to the virtual space and a voice input by the user, and the input voice in response to a touch operation on the predetermined area by the user. Is converted into text, and the text object corresponding to the converted text is provided at a position in the virtual space based on the position where the touch operation is performed in the predetermined area.

本発明の一実施形態に係るプログラムは、動画を作成するためのプログラムであって、１又は複数のコンピュータ上での実行に応じて、前記１又は複数のコンピュータに、仮想空間に対応する画像を表示する所定の領域を有する画面をユーザに提示する処理と、前記仮想空間に対応する画像と、前記ユーザによって入力される音声と、を含む動画を作成する処理と、前記ユーザによる前記所定の領域に対するタッチ操作に応じて、前記入力される音声をテキストに変換し、変換されたテキストに対応するテキストオブジェクトを、前記所定の領域内のタッチ操作が行われた位置に基づく前記仮想空間内の位置に配置する処理と、を実行させる。 The program according to the embodiment of the present invention is a program for creating a moving image, and depending on execution on one or a plurality of computers, the image corresponding to the virtual space is displayed on the one or a plurality of computers. A process of presenting a screen having a predetermined area to be displayed to a user, a process of creating a moving image including an image corresponding to the virtual space and a voice input by the user, and a process of creating the predetermined area by the user. In response to the touch operation on the computer, the input voice is converted into text, and the text object corresponding to the converted text is placed in the virtual space based on the position where the touch operation is performed in the predetermined area. To execute the process to be placed in.

本発明の様々な実施形態は、動画に含まれる音声の出力に関するエンターテイメント性を向上させる。 Various embodiments of the present invention improve entertainment with respect to the output of audio contained in the moving image.

本発明の一実施形態に係る動画作成装置１０の構成を概略的に示す構成図。FIG. 6 is a configuration diagram schematically showing a configuration of a moving image creating device 10 according to an embodiment of the present invention. 動画作成装置１０の機能を概略的に示すブロック図。The block diagram which shows the function of the moving image creation apparatus 10. 動画作成用画面６０を例示する図。The figure which illustrates the screen 60 for moving image. アバター１０２の動作を制御するために動画作成装置１０が実行する処理を例示するフロー図。FIG. 6 is a flow chart illustrating a process executed by the moving image creation device 10 in order to control the operation of the avatar 102. インカメラを介して入力される入力画像５０を模式的に例示する図。The figure which schematically exemplifies the input image 50 input through the in-camera. 入力画像５０に含まれるユーザの顔及び両手が認識される様子を説明するための図。The figure for demonstrating how the face and both hands of the user included in the input image 50 are recognized. 入力画像５０を例示する図。The figure which illustrates the input image 50. 動画作成用画面６０を例示する図。The figure which illustrates the screen 60 for moving image. 画像表示領域６２に対するタッチ操作の検出に応じて動画作成装置１０が実行する処理を例示するフロー図。FIG. 6 is a flow chart illustrating a process executed by the moving image creating device 10 in response to the detection of a touch operation on the image display area 62. 動画作成用画面６０を例示する図。The figure which illustrates the screen 60 for moving image. 動画作成用画面６０を例示する図。The figure which illustrates the screen 60 for moving image. 動画作成用画面６０を例示する図。The figure which illustrates the screen 60 for moving image. 配信者画面８０を例示する図。The figure which illustrates the distributor screen 80.

以下、図面を参照しながら、本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明の一実施形態に係る動画作成装置１０の構成を概略的に示す構成図である。動画作成装置１０は、動画を作成するための機能を有し、本発明のシステムの一部又は全部を実装する装置の一例である。 FIG. 1 is a configuration diagram schematically showing a configuration of a moving image creating device 10 according to an embodiment of the present invention. The moving image creation device 10 has a function for creating a moving image, and is an example of a device that implements a part or all of the system of the present invention.

動画作成装置１０は、一般的なコンピュータとして構成されており、図１に示すように、ＣＰＵ又はＧＰＵ等のコンピュータプロセッサ１１と、メインメモリ１２と、ユーザＩ／Ｆ１３と、通信Ｉ／Ｆ１４と、ストレージ（記憶装置）１５とを備え、これらの各構成要素が図示しないバス等を介して電気的に接続されている。 The moving image creation device 10 is configured as a general computer, and as shown in FIG. 1, a computer processor 11 such as a CPU or GPU, a main memory 12, a user I / F13, a communication I / F14, and the like. A storage (storage device) 15 is provided, and each of these components is electrically connected via a bus or the like (not shown).

コンピュータプロセッサ１１は、ストレージ１５等に記憶されている様々なプログラムをメインメモリ１２に読み込んで、当該プログラムに含まれる各種の命令を実行する。メインメモリ１２は、例えば、ＤＲＡＭ等によって構成される。 The computer processor 11 reads various programs stored in the storage 15 and the like into the main memory 12 and executes various instructions included in the programs. The main memory 12 is composed of, for example, a DRAM or the like.

ユーザＩ／Ｆ１３は、ユーザとの間で情報をやり取りするための各種の入出力装置を含む。ユーザＩ／Ｆ１３は、例えば、キーボード、ポインティングデバイス（例えば、マウス、タッチパネル等）等の情報入力装置、マイクロフォン等の音声入力装置、カメラ等の画像入力装置を含む。また、ユーザＩ／Ｆ１３は、ディスプレイ等の画像出力装置、スピーカ等の音声出力装置を含む。 The user I / F 13 includes various input / output devices for exchanging information with the user. The user I / F 13 includes, for example, an information input device such as a keyboard and a pointing device (for example, a mouse, a touch panel, etc.), a voice input device such as a microphone, and an image input device such as a camera. Further, the user I / F 13 includes an image output device such as a display and an audio output device such as a speaker.

通信Ｉ／Ｆ１４は、ネットワークアダプタ等のハードウェア、各種の通信用ソフトウェア、及びこれらの組み合わせとして実装され、有線又は無線の通信を実現できるように構成されている。 The communication I / F 14 is implemented as hardware such as a network adapter, various communication software, and a combination thereof, and is configured to realize wired or wireless communication.

ストレージ１５は、例えば磁気ディスク、フラッシュメモリ等によって構成される。ストレージ１５は、オペレーティングシステムを含む様々なプログラム、及び各種データ等を記憶する。ストレージ１５が記憶するプログラムには、動画を作成するための機能を実現するためのアプリケーションプログラム（以下、「動画作成用アプリ」と言うことがある。）が含まれ得る。 The storage 15 is composed of, for example, a magnetic disk, a flash memory, or the like. The storage 15 stores various programs including an operating system, various data, and the like. The program stored in the storage 15 may include an application program for realizing a function for creating a moving image (hereinafter, may be referred to as a “moving image creating application”).

本実施形態において、動画作成装置１０は、スマートフォン、タブレット端末、パーソナルコンピュータ、及びウェアラブルデバイス等として構成され得る。 In the present embodiment, the moving image creating device 10 may be configured as a smartphone, a tablet terminal, a personal computer, a wearable device, or the like.

次に、本実施形態の動画作成装置１０が有する機能について説明する。図２は、動画作成装置１０が有する機能を概略的に示すブロック図である。動画作成装置１０は、図示するように、様々な情報を記憶及び管理する情報記憶管理部４１と、動画を作成する動画作成部４３と、仮想空間を制御する仮想空間制御部４５とを有する。これらの機能は、コンピュータプロセッサ１１及びメインメモリ１２等のハードウェア、並びに、ストレージ１５等に記憶されている各種プログラムやデータ等が協働して動作することによって実現され、例えば、メインメモリ１２に読み込まれたプログラムに含まれる命令をコンピュータプロセッサ１１が実行することによって実現される。 Next, the function of the moving image creating device 10 of the present embodiment will be described. FIG. 2 is a block diagram schematically showing the functions of the moving image creating device 10. As shown in the figure, the moving image creation device 10 has an information storage management unit 41 for storing and managing various information, a moving image creating unit 43 for creating a moving image, and a virtual space control unit 45 for controlling a virtual space. These functions are realized by the hardware such as the computer processor 11 and the main memory 12, and various programs and data stored in the storage 15 and the like operating in cooperation with each other, for example, in the main memory 12. This is realized by the computer processor 11 executing the instructions included in the read program.

情報記憶管理部４１は、ストレージ１５等において様々な情報を記憶及び管理する。動画作成部４３は、動画の作成に関する様々な処理を実行する。本実施形態において、動画作成部４３は、仮想空間に対応する画像を表示する所定の領域を有する画面をユーザに提示するように構成されている。例えば、動画作成部４３は、当該所定の領域を有する画面をディスプレイ等において表示するように構成される。 The information storage management unit 41 stores and manages various information in the storage 15 and the like. The moving image creation unit 43 executes various processes related to the creation of the moving image. In the present embodiment, the moving image creation unit 43 is configured to present the user with a screen having a predetermined area for displaying an image corresponding to the virtual space. For example, the moving image creation unit 43 is configured to display a screen having the predetermined area on a display or the like.

また、動画作成部４３は、上記仮想空間に対応する画像と、入力される音声と、を含む動画を作成するように構成されている。例えば、動画作成部４３は、所定の領域において表示されている仮想空間の画像と、マイクを介して入力される音声とを含む動画を作成（記録）するように構成される。作成された動画は、例えば、ストレージ１５等において格納される。 Further, the moving image creation unit 43 is configured to create a moving image including an image corresponding to the virtual space and input voice. For example, the moving image creation unit 43 is configured to create (record) a moving image including an image of a virtual space displayed in a predetermined area and a sound input via a microphone. The created moving image is stored in, for example, a storage 15 or the like.

仮想空間制御部４５は、上記仮想空間の制御に関する様々な処理を実行する。本実施形態において、仮想空間制御部４５は、ユーザによる上記所定の領域に対するタッチ操作に応じて、入力される音声をテキストに変換し、変換されたテキストに対応するテキストオブジェクトを仮想空間内に配置するように構成されている。当該テキストオブジェクトは、所定の領域内の上記タッチ操作が行われた位置に基づく仮想空間内の位置に配置される。 The virtual space control unit 45 executes various processes related to the control of the virtual space. In the present embodiment, the virtual space control unit 45 converts the input voice into text in response to a touch operation on the predetermined area by the user, and arranges a text object corresponding to the converted text in the virtual space. It is configured to do. The text object is placed at a position in the virtual space based on the position where the touch operation is performed in a predetermined area.

このように、本実施形態の動画作成装置１０は、仮想空間に対応する画像と、入力される音声とを含む動画を作成し、当該画像を表示する所定の領域に対するタッチ操作に応じて、当該入力される音声をテキストに変換して対応するテキストオブジェクトを仮想空間内に配置するから、入力される音声に対応するオブジェクトが仮想空間に配置される動画を手軽に作成することが可能となる。つまり、本実施形態の動画作成装置１０は、動画に含まれる音声の出力に関するエンターテイメント性を向上させる。 As described above, the moving image creating device 10 of the present embodiment creates a moving image including an image corresponding to the virtual space and input voice, and responds to a touch operation on a predetermined area for displaying the image. Since the input voice is converted into text and the corresponding text object is placed in the virtual space, it is possible to easily create a moving image in which the object corresponding to the input voice is placed in the virtual space. That is, the moving image creation device 10 of the present embodiment improves the entertainment property regarding the output of the sound included in the moving image.

本実施形態において、仮想空間制御部４５は、上記所定の領域に対するタッチ状態が開始されてから解消するまでの期間において入力される音声をテキストに変換するように構成され得る。例えば、仮想空間制御部４５は、当該タッチ状態の開始に応じて、入力される音声の録音を開始し、タッチ状態の解消に応じて、録音された音声のテキストへの変換を行って、変換されたテキストに対応するテキストオブジェクトを配置するように構成される。こうした構成は、テキストオブジェクトの配置を簡易な操作で実現し得る。 In the present embodiment, the virtual space control unit 45 may be configured to convert the voice input in the period from the start of the touch state to the predetermined area to the elimination of the touch state into text. For example, the virtual space control unit 45 starts recording the input voice in response to the start of the touch state, and converts the recorded voice into text in response to the cancellation of the touch state. It is configured to place a text object that corresponds to the recorded text. Such a configuration can realize the arrangement of text objects by a simple operation.

また、仮想空間制御部４５は、所定の領域に対するタッチ状態が開始された後に行われるフリック操作及び／又はスライド操作の方向に基づく視覚効果が付与されるように、変換されたテキストに対応するテキストオブジェクトを配置するように構成され得る。例えば、仮想空間制御部４５は、所定の領域に対するタッチ状態が解消される際に行われるフリック操作／スライド操作の方向が第１の方向（例えば、右方向）である場合は、第１の視覚効果（例えば、フェードインの効果）をテキストオブジェクトに付与する一方、当該フリック操作／スライド操作の方向が第２の方向（例えば、左方向）である場合は、第２の視覚効果（例えば、フェードアウトの効果）をテキストオブジェクトに付与するように構成される。こうした構成は、テキストオブジェクトに対する視覚効果の付与を簡易な操作で実現し得る。 Further, the virtual space control unit 45 is a text corresponding to the converted text so as to give a visual effect based on the direction of the flick operation and / or the slide operation performed after the touch state for the predetermined area is started. It can be configured to place objects. For example, the virtual space control unit 45 has a first visual sense when the direction of the flick operation / slide operation performed when the touch state for a predetermined area is canceled is the first direction (for example, the right direction). If the effect (eg, fade-in effect) is given to the text object, while the direction of the flick / slide operation is in the second direction (eg, left), the second visual effect (eg, fade out). Effect) is configured to be given to the text object. Such a configuration can realize the addition of a visual effect to a text object by a simple operation.

本実施形態において、仮想空間は、例えば、カメラを介して入力（撮影）される映像を表示するオブジェクトを含むように構成される。この場合、作成される動画は、例えば、現実のユーザが登場（出演）する動画として構成される。また、当該仮想空間は、例えば、ユーザによって操作されるアバターが含まれるように構成される。この場合、作成される動画は、現実のユーザの代わりにアバターが登場する動画として構成され、仮想空間制御部４５は、当該アバターの仮想空間における動作を制御するように構成される。この場合、仮想空間は、配置されたテキストオブジェクトをアバターが触ることができるように構成され得る。こうした構成は、アバターを介してテキストオブジェクトを触ることが可能となるから、作成される動画のエンターテイメント性が向上し得る。 In the present embodiment, the virtual space is configured to include, for example, an object that displays an image input (photographed) via a camera. In this case, the created moving image is configured as, for example, a moving image in which a real user appears (appears). Further, the virtual space is configured to include, for example, an avatar operated by the user. In this case, the created moving image is configured as a moving image in which an avatar appears in place of a real user, and the virtual space control unit 45 is configured to control the operation of the avatar in the virtual space. In this case, the virtual space may be configured so that the avatar can touch the placed text object. Such a configuration allows the text object to be touched through the avatar, which may improve the entertainment of the created video.

仮想空間にユーザのアバターが含まれる場合において、仮想空間制御部４５は、例えば、タッチパネル又は物理コントローラ等に対するユーザによる操作に応じてアバターの動作を制御するように構成され得る。また、仮想空間制御部４５は、カメラを介して入力される画像に含まれるユーザの姿勢に少なくとも基づいて（例えば、ユーザの姿勢に連動するように）アバターの動作を制御するように構成され得る。画像に含まれるユーザの姿勢（ボーン）の検出は、例えば、公知の人物姿勢推定技術を適用して実現することができる。また、仮想空間制御部４５は、入力される画像に含まれるユーザの身体の１又は複数の所定の部位（例えば、顔及び両手等）の画像における配置に少なくとも基づいて（例えば、所定の部位の配置に連動するように）アバターの動作を制御するように構成され得る。こうした構成は、現実のユーザの動きに基づいてアバターを動作させることを可能とする。 When the virtual space includes a user's avatar, the virtual space control unit 45 may be configured to control the operation of the avatar in response to an operation by the user on, for example, a touch panel or a physical controller. Further, the virtual space control unit 45 may be configured to control the movement of the avatar based on at least the posture of the user included in the image input via the camera (for example, to be linked to the posture of the user). .. The detection of the posture (bone) of the user included in the image can be realized by applying, for example, a known person posture estimation technique. Further, the virtual space control unit 45 is based on at least the arrangement in the image of one or a plurality of predetermined parts (for example, face and both hands) of the user's body included in the input image (for example, of the predetermined part). It can be configured to control the behavior of the avatar (to be linked to the placement). Such a configuration makes it possible to operate the avatar based on the movement of the actual user.

次に、このような機能を有する本実施形態の動画作成装置１０の具体例について説明する。この例の動画作成装置１０は、スマートフォン、タブレット端末、又は、パーソナルコンピュータ等として構成されており、動画作成用アプリがインストールされている。この例の動画作成装置１０は、アバターを含む動画を作成するように構成されている。 Next, a specific example of the moving image creating device 10 of the present embodiment having such a function will be described. The moving image creation device 10 of this example is configured as a smartphone, a tablet terminal, a personal computer, or the like, and an application for creating a moving image is installed. The moving image creation device 10 of this example is configured to create a moving image including an avatar.

図３は、動画作成装置１０のディスプレイ等において表示される動画作成用画面６０を例示する。当該画面６０は、動画を作成するための画面であって、図示するように、作成する動画に含まれる画像を表示する画像表示領域（所定の領域）６２と、動画の作成（記録、録画）の開始及び終了を指示するための指示ボタン６４とを有する。 FIG. 3 illustrates a moving image creation screen 60 displayed on a display or the like of the moving image creating device 10. The screen 60 is a screen for creating a moving image, and as shown in the figure, an image display area (predetermined area) 62 for displaying an image included in the moving image to be created, and creating (recording, recording) the moving image. It has an instruction button 64 for instructing the start and end of.

画像表示領域６２には、仮想空間１００を特定の視野で（特定の位置の仮想カメラを介して）見た画像が表示される。当該仮想空間１００は、三次元の仮想空間として構成されており、人型のアバター１０２、及び、当該アバター１０２の手前に位置する机オブジェクト１０４を含む。 In the image display area 62, an image of the virtual space 100 viewed in a specific field of view (via a virtual camera at a specific position) is displayed. The virtual space 100 is configured as a three-dimensional virtual space, and includes a humanoid avatar 102 and a desk object 104 located in front of the avatar 102.

ここで、仮想空間１００に含まれるアバター１０２の動作の制御に関する処理について説明する。図４は、この例において、アバター１０２の動作を制御するために、動画作成装置１０が実行する処理を例示するフロー図である。例えば、装置１０は、動画作成用画面６０の表示に応じて、図４に例示される処理を実行する。 Here, a process related to control of the operation of the avatar 102 included in the virtual space 100 will be described. FIG. 4 is a flow chart illustrating the process executed by the moving image creating device 10 in order to control the operation of the avatar 102 in this example. For example, the device 10 executes the process illustrated in FIG. 4 according to the display of the moving image creation screen 60.

動画作成装置１０は、まず、図４に示すように、インカメラを介して入力される入力画像に含まれるユーザの顔及び両手を認識する（ステップＳ１００）。インカメラは、装置１０において表示される画面を見るユーザを視野に含むように構成されている。ユーザは、動画作成用画面６０の画像表示領域６２に含まれる仮想空間１００の画像を見ながら、アバター１０２を動作させるためにインカメラの前で身体を動かすことになる。 First, as shown in FIG. 4, the moving image creating device 10 recognizes the user's face and both hands included in the input image input via the in-camera (step S100). The in-camera is configured to include the user viewing the screen displayed on the device 10 in the field of view. The user moves his / her body in front of the in-camera to operate the avatar 102 while viewing the image of the virtual space 100 included in the image display area 62 of the moving image creation screen 60.

図５は、インカメラを介して入力される入力画像５０を模式的に例示する。図示するように、この例では、ユーザの右手ＲＨの手の平には、第１の色（例えば、赤色）の円形のマーカーＭＫ１が設けられており、ユーザの左手ＬＨの手の平には、第２の色（例えば、黄色）の円形のマーカーＭＫ２が設けられている。これらのマーカーＭＫ１、２は、例えば、手の平に貼り付けるステッカーとして構成され、当該ステッカーは、例えば、動画作成用アプリの提供事業者等によってユーザに提供される。また、マーカーＭＫ１、２は、例えば、手の平にインク等で直接描かれる。この場合、例えば、動画作成用アプリの提供事業者等が配布するインストラクションに従って、ユーザが、両手の手の平にマーカーＭＫ１、２をそれぞれ描く。 FIG. 5 schematically illustrates an input image 50 input via the in-camera. As shown in this example, in this example, the palm of the user's right hand RH is provided with a circular marker MK1 of the first color (for example, red), and the palm of the user's left hand LH is provided with a second color. A circular marker MK2 of color (eg, yellow) is provided. These markers MK1 and 2 are configured as, for example, a sticker to be attached to the palm, and the sticker is provided to the user by, for example, a provider of a moving image creation application. Further, the markers MK1 and 2 are drawn directly on the palm, for example, with ink or the like. In this case, for example, the user draws the markers MK1 and MK2 on the palms of both hands according to the instructions distributed by the provider of the video creation application.

図６は、図５に例示した入力画像５０に含まれるユーザの顔及び両手が認識される様子を説明するための図である。図示するように、この例では、ユーザの顔ＦＣは、当該顔ＦＣの輪郭を囲う矩形の検出領域ＤＡ１として検出及び認識される。また、ユーザの両手ＲＨ、ＬＨは、当該両手ＲＨ、ＬＨの各々の手の平に設けられているマーカーＭＫ１、２の輪郭を囲う矩形の検出領域ＤＡ２、ＤＡ３としてそれぞれ検出及び認識される。こうした顔ＦＣ、及び、両手ＲＨ、ＬＨ（マーカーＭＫ１、２）の認識（及び、その後の追跡）は、公知の物体追跡技術を用いて実現され、例えば、機械学習を介して生成された学習済みモデルを用いて実現される。 FIG. 6 is a diagram for explaining how the user's face and both hands included in the input image 50 illustrated in FIG. 5 are recognized. As shown in the figure, in this example, the user's face FC is detected and recognized as a rectangular detection area DA1 surrounding the contour of the face FC. Further, the user's two-handed RH and LH are detected and recognized as rectangular detection areas DA2 and DA3 surrounding the contours of the markers MK1 and 2 provided on the palms of the two-handed RH and LH, respectively. Such recognition (and subsequent tracking) of face FC and two-handed RH, LH (markers MK1, 2) is realized using known object tracking techniques, for example, trained generated via machine learning. It is realized using a model.

図４のフロー図に戻り、入力画像に含まれるユーザの顔及び両手を認識すると、動画作成装置１０は、次に、当該入力画像におけるユーザの顔及び両手の配置に基づいてアバターの動作を制御する（ステップＳ１１０）。こうした入力画像におけるユーザの顔及び両手の配置に基づくアバターの動作の制御は、動画の作成が終了するまでの間（例えば、動画作成用画面６０の表示が終了するまでの間）、繰り返される（ステップＳ１２０においてＮＯ）。 Returning to the flow chart of FIG. 4, when the user's face and both hands included in the input image are recognized, the moving image creating device 10 then controls the operation of the avatar based on the arrangement of the user's face and both hands in the input image. (Step S110). The control of the movement of the avatar based on the arrangement of the user's face and both hands in such an input image is repeated until the creation of the moving image is completed (for example, until the display of the moving image creation screen 60 is completed) (for example, until the display of the moving image creation screen 60 is completed). NO in step S120).

この例では、インカメラを介して入力される入力画像５０におけるユーザの顔ＦＣの位置に対する両手ＲＨ、ＬＨの相対的な位置（現実のユーザの顔と両手との間の位置関係）を再現するように、アバター１０２の動作が制御される。例えば、図７に例示するように、現実のユーザが「万歳」の姿勢をとって、入力画像５０におけるユーザの両手ＲＨ、ＬＨ（マーカーＭＫ１、２）がユーザの顔ＦＣの斜め上方に移動した場合、図８に例示するように、仮想空間１００におけるアバター１０２もまた、ユーザと同様に「万歳」の姿勢をとる（両手を顔の斜め上方に移動させる）。 In this example, the relative positions of both hands RH and LH (the positional relationship between the actual user's face and both hands) with respect to the position of the user's face FC in the input image 50 input via the in-camera are reproduced. As described above, the operation of the avatar 102 is controlled. For example, as illustrated in FIG. 7, the actual user takes a “hurray” posture, and the user's two-handed RH and LH (markers MK1, 2) in the input image 50 move diagonally upward of the user's face FC. In this case, as illustrated in FIG. 8, the avatar 102 in the virtual space 100 also takes a “hurray” posture (moves both hands diagonally upward of the face) like the user.

ユーザが指示ボタン６４を選択すると、動画の記録が開始され、具体的には、画像表示領域６２に表示される画像と、マイクを介して入力される音声とを含む動画が記録される。ユーザが再度、指示ボタン６４を選択すると、動画の記録が停止される。作成された動画は、ストレージ１５等の所定の領域に格納される。このように、この例において、ユーザは、インカメラの前で身体を動かしながら話すことにより、当該身体の動きに追随して動作するアバター１０２が含まれる仮想空間１００に対応する画像、及び、自身の音声を含む動画を容易に作成することができる。 When the user selects the instruction button 64, recording of the moving image is started, and specifically, a moving image including an image displayed in the image display area 62 and a voice input via a microphone is recorded. When the user selects the instruction button 64 again, the recording of the moving image is stopped. The created moving image is stored in a predetermined area such as the storage 15. As described above, in this example, the user speaks while moving his / her body in front of the in-camera, and the image corresponding to the virtual space 100 including the avatar 102 that operates following the movement of the body, and himself / herself. You can easily create a video that includes the audio of.

ここで、この例におけるテキストオブジェクトの配置に関する動作について説明する。図９は、動画作成用画面６０の画像表示領域６２に対するユーザによるタッチ操作の検出に応じて、動画作成装置１０が実行する処理を例示するフロー図である。画像表示領域６２に対するタッチ操作を検出すると、装置１０は、まず、図示するように、画像表示領域６２に対するタッチ状態が解消される迄の間、マイクを介して入力される入力音声を録音する（ステップＳ２００、ステップＳ２１０においてＮＯ）。 Here, the operation related to the arrangement of the text object in this example will be described. FIG. 9 is a flow chart illustrating a process executed by the moving image creating device 10 in response to the detection of a touch operation by the user on the image display area 62 of the moving image creating screen 60. Upon detecting the touch operation on the image display area 62, the device 10 first records the input voice input via the microphone until the touch state on the image display area 62 is canceled, as shown in the figure. NO in steps S200 and S210).

そして、画像表示領域６２に対するタッチ状態が解消されると（ステップＳ２１０においてＹＥＳ）、装置１０は、録音を停止し（ステップＳ２２０）、録音された音声をテキストに変換する（ステップＳ２３０）。録音された音声のテキスト変換は、公知の音声認識技術を適用して実現される。なお、この例では、画像表示領域６２に対するタッチ状態が開始されてから解消するまでの期間において、入力音声は、作成される動画にも含まれる（ミュートされない）。なお、当該期間において、作成される動画における音声をミュートするようにしても良い。 Then, when the touch state with respect to the image display area 62 is eliminated (YES in step S210), the device 10 stops recording (step S220) and converts the recorded voice into text (step S230). Text conversion of recorded speech is realized by applying known speech recognition technology. In this example, the input voice is also included in the created moving image (not muted) in the period from the start of the touch state to the image display area 62 to the elimination of the touch state. It should be noted that the sound in the created moving image may be muted during the period.

続いて、動画作成装置１０は、変換されたテキストに対応するテキストオブジェクトを仮想空間１００に配置する（ステップＳ２４０）。テキストオブジェクトは、変換されたテキストが仮想空間１００における三次元のオブジェクトとして構成されたものであり、画像表示領域６２に対するタッチ操作が行われた位置に基づく仮想空間１００上の位置に配置される。 Subsequently, the moving image creation device 10 arranges a text object corresponding to the converted text in the virtual space 100 (step S240). The text object is such that the converted text is configured as a three-dimensional object in the virtual space 100, and is arranged at a position on the virtual space 100 based on the position where the touch operation with respect to the image display area 62 is performed.

図１０は、画像表示領域６２に対するタッチ操作が行われ、タッチ状態が継続している状態の動画作成用画面６０を例示する。この場合、画像表示領域６２において、タッチ操作が行われている位置を中心とする円形のタッチ位置表示オブジェクト７０が表示される。この状態で、ユーザは、テキストオブジェクトに変換しようとする音声を入力する（話す）。 FIG. 10 illustrates a moving image creation screen 60 in which a touch operation is performed on the image display area 62 and the touch state continues. In this case, in the image display area 62, the circular touch position display object 70 centered on the position where the touch operation is performed is displayed. In this state, the user inputs (speaks) the voice to be converted into a text object.

図１１は、図１０の状態から画像表示領域６２に対するタッチ状態が解消されて、テキストオブジェクト１０６が仮想空間１００内に配置された状態の動画作成用画面６０を例示する。図１１の例では、画像表示領域６２に対するタッチ状態が継続されている期間において「こんにちは」という音声が入力されており、テキストオブジェクト１０６は、「こんにちは」というテキストに対応するオブジェクトとして構成されている。テキストオブジェクト１０６は、この例では、その先頭（図１１の例では「こ」の字に対応するオブジェクト）が、タッチ操作が行われていた位置（タッチ位置表示オブジェクト７０の表示位置）に対応する仮想空間１００内の位置となるように配置される。なお、テキストオブジェクト１０６が配置される位置は、これに限定されない。例えば、テキストオブジェクト１０６は、画像表示領域６２内のタッチ操作が行われた位置に基づいて特定され得る様々な仮想空間１００内の位置に配置され得る。 FIG. 11 illustrates a moving image creation screen 60 in a state where the touch state with respect to the image display area 62 is eliminated from the state of FIG. 10 and the text object 106 is arranged in the virtual space 100. In the example of FIG. 11, the voice "hello" is input while the touch state with respect to the image display area 62 is continued, and the text object 106 is configured as an object corresponding to the text "hello". .. In this example, the head of the text object 106 (the object corresponding to the "ko" character in the example of FIG. 11) corresponds to the position where the touch operation was performed (the display position of the touch position display object 70). It is arranged so as to be located in the virtual space 100. The position where the text object 106 is arranged is not limited to this. For example, the text object 106 may be located in various virtual spaces 100 that can be identified based on the position in the image display area 62 where the touch operation was performed.

この例では、配置されたテキストオブジェクト１０６は、所定の時間（例えば、５秒）の経過後に消える。また、アバター１０２は、テキストオブジェクト１０６を触ることができるようになっている。例えば、テキストオブジェクト１０６は、文字毎に独立して動くように構成されており（文字毎に別々のオブジェクトとして構成されており）、例えば、図１２に示すように、アバター１０２の右手で、テキストオブジェクト１０６の先頭の「こ」の文字のオブジェクトに触ると、当該オブジェクトのみを動かすこともできる。このように、ユーザは、画像表示領域６２に対するタッチ操作によってテキストオブジェクト１０６を仮想空間１００に配置しつつ、アバター１０２を介して、配置したテキストオブジェクト１０６を動かすこともできる。なお、配置されたテキストオブジェクト１０６の動作はこれに限定されない。例えば、テキストオブジェクト１０６は、配置された後に所定の速度で下方向に移動（落下）するように制御され得る。また、テキストオブジェクト１０６の全体、又は、各文字に対応するオブジェクトが、アバター１０２が触れることによって消えるようにし、又は、アバター１０２が触れることによって、所定のエフェクト（形状の変化、及び、発光等）が発生するようにしても良い。 In this example, the placed text object 106 disappears after a predetermined time (eg, 5 seconds). Further, the avatar 102 can touch the text object 106. For example, the text object 106 is configured to move independently for each character (composed as a separate object for each character), for example, as shown in FIG. 12, with the right hand of the avatar 102, the text. If you touch the object with the character "ko" at the beginning of the object 106, you can move only that object. In this way, the user can move the arranged text object 106 via the avatar 102 while arranging the text object 106 in the virtual space 100 by touching the image display area 62. The operation of the arranged text object 106 is not limited to this. For example, the text object 106 may be controlled to move (fall) downward at a predetermined speed after being placed. Further, the entire text object 106 or the object corresponding to each character disappears when the avatar 102 touches it, or when the avatar 102 touches it, a predetermined effect (shape change, light emission, etc.) May occur.

この例では、画像表示領域６２に対するタッチ状態を解消する際にフリック操作を行うと、当該フリック操作の方向に基づく視覚効果が、配置されるテキストオブジェクトに対して付与される。例えば、タッチ状態を解消する際に右方向へのフリック操作が行われると、フェードインの効果を伴ってテキストオブジェクト１０６が配置（表示）される一方、タッチ状態を解消する際に左方向へのフリック操作が行われると、テキストオブジェクト１０６の配置後、フェードアウトの効果を伴って当該テキストオブジェクト１０６が消去される。 In this example, when a flick operation is performed when the touch state with respect to the image display area 62 is canceled, a visual effect based on the direction of the flick operation is given to the text object to be arranged. For example, if a flick operation is performed to the right when the touch state is canceled, the text object 106 is placed (displayed) with the effect of fading in, while the text object 106 is placed (displayed) to the left when the touch state is canceled. When the flick operation is performed, after the text object 106 is placed, the text object 106 is erased with the effect of fading out.

上述した例において、作成される動画をライブ配信するようにしても良い。この場合、例えば、動画作成装置１０は、作成される動画をストリーミング形式で動画配信サーバに送信し、当該動画配信サーバが、複数の視聴者のユーザ端末（スマートフォン等）に対してストリーミング形式で動画を配信する。図１３は、動画のライブ配信を行う配信者のユーザ端末等として構成される動画作成装置１０のディスプレイ等において表示される配信者画面８０を例示する。当該画面８０は、画面全体において三次元の仮想空間２００を表示し、当該仮想空間２００において、配信者のアバター２０２がステージ２０４上に配置されており、複数の視聴者の各々のアバター２０８が観客エリア２０６に配置されている。アバター２０２は、配信者によって操作される（例えば、カメラを介して入力される画像に含まれる配信者の顔及び両手の配置に基づいて制御される）。また、配信者による配信者画面８０に対するタッチ操作に応じて、入力される音声がテキストに変換されて対応するテキストオブジェクトが仮想空間２００内に配置される。アバター２０２は、仮想空間２００において、配置されたテキストオブジェクトを触ることができる。なお、アバター２０２が、テキストオブジェクトと同様に、当該仮想空間２００に配置されている他のオブジェクト（例えば、視聴者のアバター２０８が投げ入れるアイテム（ギフト）等）を触ることができるようにしても良い。 In the above example, the created video may be delivered live. In this case, for example, the video creation device 10 transmits the created video to the video distribution server in a streaming format, and the video distribution server sends the video to the user terminals (smartphones, etc.) of a plurality of viewers in the streaming format. To deliver. FIG. 13 illustrates a distributor screen 80 displayed on a display or the like of a moving image creation device 10 configured as a user terminal or the like of a distributor who performs live distribution of a moving image. The screen 80 displays a three-dimensional virtual space 200 on the entire screen, in which the distributor's avatar 202 is arranged on the stage 204, and each avatar 208 of the plurality of viewers is an audience. It is located in area 206. The avatar 202 is manipulated by the distributor (eg, controlled based on the distributor's face and the placement of both hands contained in the image input via the camera). Further, in response to a touch operation on the distributor screen 80 by the distributor, the input voice is converted into text and the corresponding text object is arranged in the virtual space 200. The avatar 202 can touch the arranged text object in the virtual space 200. It should be noted that the avatar 202 may be able to touch other objects (for example, items (gifts) thrown by the viewer's avatar 208) arranged in the virtual space 200 as well as the text object. ..

上述した例では、仮想空間１００、２００にユーザのアバター１０２、２０２が含まれるようにしたが、本実施形態の他の例において、仮想空間には、ユーザのアバターは含まれず、カメラを介して入力される画像を表示するオブジェクト等が配置され得る。 In the above-mentioned example, the user's avatars 102 and 202 are included in the virtual spaces 100 and 200, but in another example of the present embodiment, the virtual space does not include the user's avatar and is transmitted through the camera. An object or the like that displays the input image may be arranged.

以上説明した本実施形態に係る動画作成装置１０は、仮想空間に対応する画像と、入力される音声とを含む動画を作成し、当該画像を表示する所定の領域（例えば、動画作成用画面６０の画像表示領域６２）に対するタッチ操作に応じて、当該入力される音声をテキストに変換して対応するテキストオブジェクトを仮想空間内に配置するから、入力される音声に対応するオブジェクトが仮想空間に配置される動画を手軽に作成することが可能となる。つまり、本実施形態の動画作成装置１０は、動画に含まれる音声の出力に関するエンターテイメント性を向上させる。 The moving image creation device 10 according to the present embodiment described above creates a moving image including an image corresponding to a virtual space and input voice, and displays a predetermined area (for example, a moving image creating screen 60). In response to the touch operation on the image display area 62) of the above, the input voice is converted into text and the corresponding text object is placed in the virtual space. Therefore, the object corresponding to the input voice is placed in the virtual space. It will be possible to easily create a video to be created. That is, the moving image creation device 10 of the present embodiment improves the entertainment property regarding the output of the sound included in the moving image.

本発明の他の実施形態において、動画作成装置１０が有する上述した機能の少なくとも一部は、当該装置１０とインターネット等の通信ネットワークを介して通信可能に接続されるサーバ（例えば、上述した動画配信サーバ）と、動画作成装置１０とが協働することによって実現される。例えば、当該サーバ側で、動画作成装置１０のカメラを介して入力される画像の解析、及び、マイクを介して入力される音声のテキスト変換、並びに、動画の作成（及び配信）等が行われるようにしても良い。 In another embodiment of the present invention, at least a part of the above-mentioned functions of the moving image creation device 10 is a server (for example, the above-mentioned moving image distribution) that is communicably connected to the device 10 via a communication network such as the Internet. It is realized by the cooperation between the server) and the moving image creating device 10. For example, on the server side, analysis of an image input via the camera of the moving image creation device 10, text conversion of voice input via a microphone, creation (and distribution) of a moving image, and the like are performed. You may do so.

本明細書で説明された処理及び手順は、明示的に説明されたもの以外にも、ソフトウェア、ハードウェアまたはこれらの任意の組み合わせによって実現される。例えば、本明細書で説明される処理及び手順は、集積回路、揮発性メモリ、不揮発性メモリ、磁気ディスク等の媒体に、当該処理及び手順に相当するロジックを実装することによって実現される。また、本明細書で説明された処理及び手順は、当該処理・手順に相当するコンピュータプログラムとして実装し、各種のコンピュータに実行させることが可能である。 The processes and procedures described herein are implemented by software, hardware or any combination thereof, other than those expressly described. For example, the processes and procedures described herein are realized by implementing logic corresponding to the processes and procedures on a medium such as an integrated circuit, a volatile memory, a non-volatile memory, or a magnetic disk. Further, the processes and procedures described in the present specification can be implemented as a computer program corresponding to the processes and procedures, and can be executed by various computers.

本明細書中で説明された処理及び手順が単一の装置、ソフトウェア、コンポーネント、モジュールによって実行される旨が説明されたとしても、そのような処理または手順は複数の装置、複数のソフトウェア、複数のコンポーネント、及び／又は複数のモジュールによって実行され得る。また、本明細書において説明されたソフトウェアおよびハードウェアの要素は、それらをより少ない構成要素に統合して、またはより多い構成要素に分解することによって実現することも可能である。 Even if it is described that the processes and procedures described herein are performed by a single device, software, component, module, such processes or procedures may be performed by multiple devices, multiple software, multiple devices. Can be performed by a component of, and / or multiple modules. The software and hardware elements described herein can also be realized by integrating them into fewer components or by breaking them down into more components.

本明細書において、発明の構成要素が単数もしくは複数のいずれか一方として説明された場合、又は、単数もしくは複数のいずれとも限定せずに説明された場合であっても、文脈上別に解すべき場合を除き、当該構成要素は単数又は複数のいずれであってもよい。 In the present specification, even if the components of the invention are described as either singular or plural, or even if they are described without limitation to either singular or plural, they should be understood separately in the context. Except for, the component may be singular or plural.

１０動画作成装置
１１コンピュータプロセッサ
４１情報記憶管理部
４３動画作成部
４５仮想空間制御部
５０入力画像
６０動画作成用画面
６２画像表示領域（所定の領域）
８０配信者画面
１００、２００仮想空間
１０２、２０２アバター
１０６テキストオブジェクト

10 Video creator 11 Computer processor 41 Information storage management unit 43 Video creation unit 45 Virtual space control unit 50 Input image 60 Video creation screen 62 Image display area (predetermined area)
80 Distributor screen 100, 200 Virtual space 102, 202 Avatar 106 Text object

Claims

A system for creating moving images with one or more computer processors.
The one or more computer processors said, depending on the execution of readable instructions,
A process of presenting a screen having a predetermined area for displaying an image corresponding to a virtual space to a user,
A process of creating a moving image including an image corresponding to the virtual space and a voice input by the user.
In response to the touch operation on the predetermined area by the user, the input voice is converted into text, and the text object corresponding to the converted text is placed at the position where the touch operation is performed in the predetermined area. Executes the process of arranging the object at a position in the virtual space based on the above.
system.

The arranging process includes converting the input voice into text in the period from the start of the touch state to the predetermined area to the elimination of the touch state.
The system of claim 1.

The arranging process includes arranging the corresponding text object so as to give a visual effect based on the direction of a flick operation and / or a slide operation performed after the touch state for the predetermined area is started. ,
The system of claim 1 or 2.

The one or more computer processors further execute a process of controlling the operation of the avatar operated by the user in the virtual space.
The virtual space is configured so that the avatar can touch the arranged text object.
The system according to any one of claims 1 to 3.

The process of controlling the movement of the avatar includes controlling the movement of the avatar based on at least the posture of the user included in the image input through the camera.
The system of claim 4.

The one or more computer processors further execute a process of delivering the created moving image in real time.
The system according to any one of claims 1 to 5.

A method for making videos, run by one or more computers.
A step of presenting the user with a screen having a predetermined area for displaying an image corresponding to the virtual space, and
A step of creating a moving image including an image corresponding to the virtual space and a voice input by the user.
In response to the touch operation on the predetermined area by the user, the input voice is converted into text, and the text object corresponding to the converted text is placed at the position where the touch operation is performed in the predetermined area. A step of arranging a position in the virtual space based on the above.
Method.

A program for creating videos
Depending on the execution on one or more computers, the one or more computers
A process of presenting a screen having a predetermined area for displaying an image corresponding to a virtual space to a user,
A process of creating a moving image including an image corresponding to the virtual space and a voice input by the user.
In response to the touch operation on the predetermined area by the user, the input voice is converted into text, and the text object corresponding to the converted text is placed at the position where the touch operation is performed in the predetermined area. To execute the process of arranging the object at the position in the virtual space based on the above.
program.