JP7449519B2

JP7449519B2 - Systems, methods, and computer-readable media for video processing

Info

Publication number: JP7449519B2
Application number: JP2022528663A
Authority: JP
Inventors: ユアンウー，シャオ; チェン，ミン－チュ
Original assignee: 17Live Japan Inc
Current assignee: 17Live Japan Inc
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2024-03-14
Anticipated expiration: 2041-12-30
Also published as: JP2024501091A; WO2023129182A1

Description

本明細書における開示は、ビデオストリーミングにおける映像処理に関する。 The disclosure herein relates to video processing in video streaming.

ユーザ同士のオンライン通信に参加することを可能にするさまざまな技術が知られている。そのアプリケーションには、ライブストリーミング、ライブ電話会議などが含まれる。これらアプリケーションの普及に伴い、コミュニケーション中の効率向上と、相互のメッセージに対するより良い理解に対するユーザからの要望が高まっている。 Various techniques are known that allow users to participate in online communications with each other. Its applications include live streaming, live conference calls, and more. With the proliferation of these applications, there is an increasing demand from users for increased efficiency during communication and better understanding of each other's messages.

本発明の一実施態様に係る方法は、ライブ映像処理方法であり、当該方法は、ユーザからのメッセージを受信する工程と、ライブ映像のうち所定のオブジェクトの近傍の領域を拡大する工程とを含む。 A method according to an embodiment of the present invention is a live video processing method, and the method includes the steps of receiving a message from a user and enlarging an area near a predetermined object in the live video. .

本発明の一実施態様に係るシステムは、ライブ映像処理のためのシステムであり、１以上のプロセッサを含み、当該１以上のプロセッサが機械可読命令を実行して、ユーザからメッセージを受信する工程と、ライブ映像のうち所定のオブジェクトの近傍の領域を拡大する工程と、を実行する。 A system according to an embodiment of the present invention is a system for processing live video, and includes one or more processors, the one or more processors executing machine-readable instructions to receive messages from a user. , enlarging an area near a predetermined object in the live video.

本発明の一実施態様に係るコンピュータ可読媒体は、非一時的なコンピュータ可読媒体であり、ライブ映像処理のためのプログラムを含み、当該プログラムが１以上のコンピュータに、ユーザからメッセージを受信する工程と、ライブ映像のうち所定のオブジェクトの近傍の領域を拡大する工程と、を実行させる。 According to one embodiment of the present invention, a computer-readable medium is a non-transitory computer-readable medium that includes a program for processing live video, and the program causes one or more computers to receive a message from a user. , enlarging an area near a predetermined object in the live video.

ライブストリーミングの一例を示す概略図である。It is a schematic diagram showing an example of live streaming. 本発明の一部の実施態様に基づく例示的なストリーミングを示す概略図である。FIG. 2 is a schematic diagram illustrating exemplary streaming according to some implementations of the present invention. 本発明の一部の実施態様に基づく例示的なストリーミングを示す概略図である。FIG. 2 is a schematic diagram illustrating exemplary streaming according to some implementations of the present invention. 本発明の一部の実施態様に基づく例示的なストリーミングを示す概略図である。FIG. 2 is a schematic diagram illustrating exemplary streaming according to some implementations of the present invention. 本発明の一部の実施態様に基づく例示的なストリーミングを示す概略図である。FIG. 2 is a schematic diagram illustrating exemplary streaming according to some implementations of the present invention. 本発明の一部の実施態様に基づく例示的なストリーミングを示す概略図である。FIG. 2 is a schematic diagram illustrating exemplary streaming according to some implementations of the present invention. 本発明の一部の実施態様に基づく通信システムの構成を示す概略図である。1 is a schematic diagram illustrating a configuration of a communication system according to some embodiments of the present invention. FIG. 本発明の一部の実施態様に基づくユーザ端末のブロック図である。FIG. 2 is a block diagram of a user terminal according to some implementations of the invention. 本発明の一部の実施態様に基づく例示的なルックアップテーブルである。1 is an example lookup table according to some implementations of the present invention.

従来、オンラインコミュニケーションは、対面でのコミュニケーションと比較して、コミュニケーション効率が低下したり、誤解が生じたりするデメリットがあった。例えば、ライブ映像やライブストリーミングのコミュニケーションでは、ライブ映像が表示されているディスプレイ上にコメントや特殊効果などの気を逸らすものがある場合、正しい領域に焦点を合わせ続けることが困難である。また別の例として、ライブ映像やライブストリーミングのコミュニケーションでは、ディスプレイのサイズまたは映像の解像度が限られるため、ビデオコンテンツの詳細を見ることが困難である。 Traditionally, online communication has had the disadvantages of lower communication efficiency and misunderstandings compared to face-to-face communication. For example, in live video or live streaming communications, it can be difficult to maintain focus on the correct area if there are distractions, such as comments or special effects, on the display on which the live video is being displayed. As another example, in live video or live streaming communications, it is difficult to see details of the video content due to limited display size or video resolution.

図１にライブストリーミングの一例を示す概略図を示す。Ｓ１はライブストリーミングを表示するユーザ端末の画面である。ＲＡは、当該画面Ｓ１内の表示領域であり、ユーザＡのライブ映像を表示する。当該ユーザＡのライブ映像は、ユーザＡの近傍に配置されたカメラなどの映像撮影装置によって撮影され、提供されてもよい。この例で、ユーザＡは、料理のしかたを教えるライブ映像を配信しているストリーマーまたはブロードキャスターであってもよい。 FIG. 1 shows a schematic diagram showing an example of live streaming. S1 is a screen of a user terminal that displays live streaming. RA is a display area within the screen S1, and displays user A's live video. The live video of the user A may be captured and provided by a video capture device such as a camera placed near the user A. In this example, user A may be a streamer or broadcaster who is delivering live video teaching how to cook.

ユーザＡは、このライブ映像の視聴者が、映像の正しい領域に焦点を合わせ、その領域の詳細を見ることができるようにすることで、料理の手順や調理素材などの正しい知識を得られるようにしたいと考えている。従来、ユーザＡは注目すべき対象（鍋やまな板など）をカメラに近づけ、ユーザがよく見えるようにしなければならない場合がある。あるいは、ユーザＡが強調したい細部をユーザに見せるために、カメラの方向や位置、焦点を調整しなければならない場合もある。上記のような動作は、ユーザＡにとって不便であり、調理が中断される。 User A wants viewers of this live video to be able to focus on the correct area of the video and see the details of that area, so that they can gain correct knowledge about cooking steps, cooking ingredients, etc. I'm thinking of making it. Conventionally, user A may have to bring an object of interest (such as a pot or cutting board) closer to the camera so that the user can see it clearly. Alternatively, the direction, position, and focus of the camera may need to be adjusted in order to show the details that user A wants to emphasize. The above operation is inconvenient for user A, and cooking is interrupted.

したがって、進行中のプロセスを停止させることなく、ユーザがライブ映像中の関心領域を示し、その領域の詳細を提示することができる方法があることが望まれる。また、視聴者がライブ映像の正しい領域に焦点を合わせ、その領域の詳細を見られるような方法が望まれる。本発明は、ライブ映像の提示及び焦点化を容易にすることができる。 Therefore, it would be desirable to have a method that allows a user to indicate a region of interest in a live video and provide details of that region without stopping the ongoing process. It would also be desirable to have a method that allows the viewer to focus on the correct area of the live video and view details of that area. The present invention can facilitate live video presentation and focusing.

図２Ａ、図２Ｂ、図２Ｃ、図２Ｄに本発明の一部の実施態様に基づく例示的なストリーミングの概略図を示す。 FIGS. 2A, 2B, 2C, and 2D illustrate exemplary streaming schematics according to some embodiments of the present invention.

図２Ａに示すように、ユーザＡは、メッセージまたは信号Ｍ１を送信する。本実施態様において、当該メッセージＭ１は、「ズームイン」を示す音声メッセージである。別の実施態様において、当該メッセージＭ１は、ユーザＡによって表現されるジェスチャーメッセージであってもよい。例えば、ユーザＡは、身体部分（手など）を用いてジェスチャーメッセージを形成してもよい。一部の実施態様において、当該メッセージＭ１は、ユーザＡによって表現される顔の表情メッセージであってもよい。当該メッセージＭ１は、ユーザＡの映像（音声データを含む）の一部である。 As shown in FIG. 2A, user A sends a message or signal M1. In this embodiment, the message M1 is a voice message indicating "zoom in." In another embodiment, the message M1 may be a gesture message expressed by user A. For example, user A may form a gesture message using a body part (such as a hand). In some implementations, the message M1 may be a facial expression message expressed by user A. The message M1 is part of user A's video (including audio data).

当該メッセージＭ１は、スマートフォン、タブレット、ノートパソコン、またはビデオキャプチャ機能を有する任意のデバイスなど、ユーザＡの映像をキャプチャするために使用されるユーザ端末によって受信されてもよい。一部の実施態様において、当該メッセージＭ１は、ユーザＡの映像を制作または配信するために使用されるユーザ端末によって認識される。一部の実施態様において、当該メッセージＭ１は、ストリーミングサービスを提供するシステムによって認識される。一部の実施態様において、当該メッセージＭ１は、ストリーミングサービスをサポートするサーバによって認識される。一部の実施態様において、当該メッセージＭ１は、ストリーミングサービスをサポートするアプリケーションによって認識される。一部の実施態様において、当該メッセージＭ１は、音声認識プロセス、ジェスチャー認識プロセスおよび（または）顔の表情認識プロセスによって認識される。一部の実施態様において、当該メッセージＭ１は、電気信号であってもよく、無線接続により送信及び受信され得る。 The message M1 may be received by a user terminal used to capture the video of user A, such as a smartphone, tablet, laptop, or any device with video capture functionality. In some implementations, the message M1 is recognized by the user terminal used to produce or distribute user A's video. In some implementations, the message M1 is recognized by a system providing a streaming service. In some implementations, the message M1 is recognized by a server that supports streaming services. In some implementations, the message M1 is recognized by an application that supports streaming services. In some implementations, the message M1 is recognized by a speech recognition process, a gesture recognition process, and/or a facial expression recognition process. In some implementations, the message M1 may be an electrical signal and may be sent and received via a wireless connection.

図２Ｂに示すように、オブジェクトＯ１が認識され、領域Ｒ１が決定される。当該オブジェクトＯ１は、当該メッセージＭ１に従って認識される。一部の実施態様において、当該オブジェクトＯ１の認識は、当該メッセージＭ１の受信の後に行われる。一部の実施態様において、当該メッセージＭ１の受信は、当該オブジェクトＯ１の認識をトリガーする。一部の実施態様において、当該メッセージＭ１の認識は、当該オブジェクトＯ１の認識の前に行われる。 As shown in FIG. 2B, object O1 is recognized and region R1 is determined. The object O1 is recognized according to the message M1. In some implementations, recognition of the object O1 occurs after receiving the message M1. In some implementations, receipt of the message M1 triggers recognition of the object O1. In some implementations, the recognition of the message M1 occurs before the recognition of the object O1.

この実施態様において、当該オブジェクトＯ１は、ユーザＡの身体部分（手）であると設定、教示または判定される。他の実施態様において、当該オブジェクトＯ１は、まな板や鍋など身体以外のオブジェクトであると判定されてもよい。一部の実施態様において、当該オブジェクトＯ１は、時計、ブレスレット、またはステッカーなど、ユーザＡに装着されているウェアラブルオブジェクトであると判定されてもよい。当該ブジェクトＯ１は、ユーザＡの映像内の任意のオブジェクトであると予め決定または設定されていてもよい。 In this embodiment, the object O1 is set, taught, or determined to be the user A's body part (hand). In other embodiments, the object O1 may be determined to be an object other than the body, such as a cutting board or a pot. In some implementations, the object O1 may be determined to be a wearable object worn by user A, such as a watch, a bracelet, or a sticker. The object O1 may be determined or set in advance to be any object in the user A's video.

当該領域Ｒ１は、当該オブジェクトＯ１の近傍の領域であると決定される。例えば、当該領域Ｒ１は、すべてのオブジェクトＯ１を取り囲む領域とすることで、ユーザＡは、当該オブジェクトＯ１（この場合、当該オブジェクトＯ１は自分の手）の位置を制御することで、当該領域Ｒ１の大きさを都合よく制御することができるようになる。なお、当該領域Ｒ１の端と当該オブジェクトＯ１との間の距離は、実情に応じて決定することができる。 The region R1 is determined to be a region near the object O1. For example, by setting the area R1 to be an area surrounding all the objects O1, the user A can control the position of the object O1 (in this case, the object O1 is his hand) to control the area R1. The size can be conveniently controlled. Note that the distance between the end of the region R1 and the object O1 can be determined depending on the actual situation.

一部の実施態様において、異なるメッセージＭ１が、異なる所定のオブジェクトＯ１に対応してもよい。例えば、ユーザＡは、対応するメッセージを送信するだけで、認識すべきオブジェクトと、決定されるべき領域を選択することができる。例えば、ユーザＡが「鍋」と発話すれば、鍋（メッセージ「鍋」に対応する所定のオブジェクト）が認識され、当該領域Ｒ１が鍋近傍の領域であると決定される。 In some implementations, different messages M1 may correspond to different given objects O1. For example, user A can select the object to be recognized and the region to be determined by simply sending a corresponding message. For example, if user A utters "pot", a pot (a predetermined object corresponding to the message "pot") is recognized, and the region R1 is determined to be an area near the pot.

一部の実施態様において、オブジェクトＯ１は、ユーザＡのライブ映像を撮影するために使用されるユーザ端末によって認識される。一部の実施態様において、オブジェクトＯ１は、ユーザＡの映像を制作または配信するために使用されるユーザ端末によって認識される。一部の実施態様において、オブジェクトＯ１は、ストリーミングサービスを提供するシステムによって認識される。一部の実施態様において、オブジェクトＯ１は、ストリーミングサービスをサポートするサーバによって認識される。一部の実施態様において、オブジェクトＯ１は、ストリーミングサービスをサポートするアプリケーションによって認識される。 In some implementations, object O1 is recognized by a user terminal used to capture live video of user A. In some implementations, object O1 is recognized by a user terminal used to produce or distribute user A's video. In some implementations, object O1 is recognized by a system that provides a streaming service. In some implementations, object O1 is recognized by a server that supports streaming services. In some implementations, object O1 is recognized by an application that supports streaming services.

一部の実施態様において、当該領域Ｒ１は、ユーザＡのライブ映像を撮影するために使用されるユーザ端末によって決定される。一部の実施態様において、当該領域Ｒ１は、ユーザＡの映像を制作または配信するために使用されるユーザ端末によって決定される。一部の実施態様において、当該領域Ｒ１は、ストリーミングサービスを提供するシステムによって決定される。一部の実施態様において、当該領域Ｒ１は、ストリーミングサービスをサポートするサーバによって決定される。一部の実施態様において、当該領域Ｒ１は、ストリーミングサービスをサポートするアプリケーションによって決定される。 In some implementations, the region R1 is determined by the user terminal used to capture live video of user A. In some implementations, the region R1 is determined by the user terminal used to produce or distribute user A's video. In some implementations, the region R1 is determined by the system providing the streaming service. In some implementations, the region R1 is determined by a server that supports streaming services. In some implementations, the region R1 is determined by an application that supports streaming services.

図２Ｃに示すように、当該領域Ｒ１は、当該領域Ｒ１内の映像コンテンツの詳細が明瞭に見えるように拡大される。拡大された当該領域Ｒ１は、ユーザＡの映像のうち当該領域Ｒ１の外にある部分を覆う、またはそれに重なってもよい。拡大された当該領域Ｒ１は、画面Ｓ１の任意の領域上に表示されてもよい。 As shown in FIG. 2C, the region R1 is enlarged so that the details of the video content within the region R1 can be clearly seen. The enlarged region R1 may cover or overlap a portion of the user A's video that is outside the region R1. The enlarged area R1 may be displayed on any area of the screen S1.

一部の実施態様において、当該拡大処理は、ユーザＡのライブ映像を撮影するために使用されるユーザ端末によって実行される。一部の実施態様において、当該拡大処理は、ユーザＡの映像を制作または配信するために使用されるユーザ端末によって実行される。一部の実施態様において、当該拡大処理は、ストリーミングサービスを提供するシステムによって実行される。一部の実施態様において、当該拡大処理は、ストリーミングサービスをサポートするサーバによって実行される。一部の実施態様において、当該拡大処理は、ストリーミングサービスをサポートするアプリケーションによって実行される。一部の実施態様において、当該拡大処理は、視聴者のユーザ端末など、ユーザＡの映像を表示するユーザ端末によって実行される。 In some implementations, the enlargement process is performed by a user terminal used to capture live video of user A. In some implementations, the enlargement process is performed by a user terminal used to produce or distribute User A's video. In some implementations, the amplification process is performed by a system that provides a streaming service. In some implementations, the amplification process is performed by a server that supports streaming services. In some implementations, the amplification process is performed by an application that supports streaming services. In some implementations, the enlargement process is performed by a user terminal displaying user A's video, such as a viewer's user terminal.

ユーザＡの映像を撮像するユーザ端末によって当該拡大処理が行われる実施態様において、当該ユーザ端末は、当該領域Ｒ１（当該領域Ｒ１は、オブジェクトＯ１の移動に応じて移動してもよい）を、当該領域Ｒ１以外の別の領域と比較して、より高い解像度で撮像するように構成することができる。したがって、ライブ映像のうち拡大される領域は、ライブ映像のうち拡大されない別の領域と比較して、より高い解像度を有する。したがって、強調される領域はより多くの情報を有し、視聴者が細部を確認することができる。 In an embodiment in which the enlargement process is performed by a user terminal that captures an image of user A, the user terminal may move the area R1 (the area R1 may move in accordance with the movement of the object O1) to It can be configured to image at a higher resolution than other regions other than the region R1. Therefore, the area of the live image that is enlarged has a higher resolution than another area of the live image that is not enlarged. Therefore, the highlighted area has more information and the viewer can see the details.

図２Ｄに示すように、一部の実施態様において、拡大された当該領域Ｒ１以外の表示領域ＲＡ内の他の領域は、拡大された当該領域Ｒ１が目立つように処理されてもよい。例えば、他の領域を暗くしたり、ぼかしたりして、視聴者が当該領域Ｒ１により集中しやすくすることができる。 As shown in FIG. 2D, in some embodiments, other areas within the display area RA other than the enlarged area R1 may be processed so that the enlarged area R1 stands out. For example, other areas can be darkened or blurred to make it easier for the viewer to concentrate on the area R1.

図３に本発明の一部の実施態様に基づく例示的なストリーミングを示す概略図を示す。 FIG. 3 shows a schematic diagram illustrating exemplary streaming according to some embodiments of the present invention.

図３に示すように、当該オブジェクトＯ１は、ユーザＡのウェアラブル端末またはウェアラブルオブジェクトと判定される。当該オブジェクトＯ１は、ユーザＡの動きに同期して移動し、ライブ映像の拡大される領域は、当該オブジェクトＯ１の動きに同期して移動する。したがって、ユーザＡは、当該オブジェクトＯ１の位置を制御するだけで、どの領域を拡大または強調するかを決定することができ、便利である。一部の実施態様において、ライブ映像の領域の拡大、及び（または）拡大された当該領域の移動は、ユーザ端末、サーバ、またはアプリケーションによって実行される映像処理で行われる。したがって、所定のオブジェクトの移動に同期してライブ映像の拡大される領域が移動するとき、ライブ映像の撮影に用いる映像撮影装置の向きを固定することができる。 As shown in FIG. 3, the object O1 is determined to be user A's wearable terminal or wearable object. The object O1 moves in synchronization with the movement of the user A, and the enlarged area of the live video moves in synchronization with the movement of the object O1. Therefore, the user A can conveniently decide which area to enlarge or emphasize simply by controlling the position of the object O1. In some implementations, enlarging a region of live video and/or moving the enlarged region is performed by video processing performed by a user terminal, a server, or an application. Therefore, when the area in which the live video is enlarged moves in synchronization with the movement of a predetermined object, the orientation of the video imaging device used to capture the live video can be fixed.

一部の実施態様において、ユーザは、メッセージ認識プロセスを起動するための第１のメッセージを送信し、次に、認識するオブジェクトを示す第２のメッセージを送信することができる。そして、そのオブジェクトによって、拡大する領域が決定される。当該第１のメッセージ及び（または）当該第２のメッセージは、音声メッセージ、ジェスチャーメッセージ、あるいは顔の表情メッセージとすることができる、またはそれを含むことができる。一部の実施態様において、当該第１のメッセージはトリガーメッセージと呼んでもよい。 In some implementations, a user may send a first message to initiate a message recognition process and then send a second message indicating an object to recognize. Then, the area to be enlarged is determined by that object. The first message and/or the second message may be or include a voice message, a gesture message, or a facial expression message. In some implementations, the first message may be referred to as a trigger message.

例えば、ユーザＡは、「フォーカス」または「ズームイン」と発話することで、次に送信するものが当該オブジェクトＯ１を認識するためのものであることを示すことができる。次に、ユーザＡは、映像上の鍋を当該オブジェクトＯ１として認識するように、「鍋」と発話することができる。続いて、鍋の近傍が拡大される。 For example, the user A can indicate that the next transmission is for recognizing the object O1 by saying "focus" or "zoom in." Next, the user A can say "nabe" so that the pan on the video is recognized as the object O1. Subsequently, the vicinity of the pot is enlarged.

一部の実施態様において、上の構成により、メッセージ認識に使用されるリソースを節約することができる。例えば、常に継続されているメッセージ認識処理（映像情報をメッセージテーブルと比較することを含んでもよい）は、当該第１のメッセージにのみ焦点を当てることができ、この第１のメッセージは単一の音声メッセージであってもよい。当該第２のメッセージは、より多くの変化形を有し、それぞれが映像内の異なるオブジェクトに対応していてもよい。当該第２メッセージのメッセージ認識処理は、当該第１のメッセージが受信されたとき、または検出されたときにのみ起動することができる。 In some implementations, the above configuration may save resources used for message recognition. For example, an always-ongoing message recognition process (which may include comparing video information to a message table) can focus only on the first message, which is a single message. It may also be a voice message. The second message may have more variations, each corresponding to a different object in the video. Message recognition processing for the second message may only be triggered when the first message is received or detected.

図４に本発明の一部の実施態様に基づく通信システムの構成を示す概略図を示す。通信システム１は、コンテンツを介したインタラクションを伴うライブストリーミングサービスを提供することができる。ここで言う「コンテンツ」とは、コンピュータ装置で再生可能なデジタルコンテンツを指す。当該通信システム１は、ユーザがオンラインで他のユーザとのリアルタイムの交流に参加することを可能にする。通信システム１は、複数のユーザ端末１０と、バックエンドサーバ３０と、ストリーミングサーバ４０とを含む。ユーザ端末１０、バックエンドサーバ３０、及びストリーミングサーバ４０は、ネットワーク９０（例えばインターネットとしてもよい）を介して接続される。バックエンドサーバ３０は、ユーザ端末及び（または）ストリーミングサーバ４０との間のインタラクションを同期させるサーバとすることができる。一部の実施態様において、バックエンドサーバ３０は、アプリケーション（ＡＰＰ）プロバイダのオリジンサーバとしてもよい。ストリーミングサーバ４０は、ストリーミングデータまたはビデオデータを取り扱う、または提供するためのサーバである。一部の実施態様において、バックエンドサーバ３０とストリーミングサーバ４０は、独立したサーバとしてもよい。一部の実施態様において、バックエンドサーバ３０とストリーミングサーバ４０は、１つのサーバに統合してもよい。一部の実施態様において、ユーザ端末１０は、ライブストリーミングのためのクライアント装置である。一部の実施態様において、ユーザ端末１０は、視聴者、ストリーマー、アンカー、ポッドキャスター、オーディエンス、リスナーなどと呼ばれることがある。各当該ユーザ端末１０、当該バックエンドサーバ３０、及び当該ストリーミングサーバ４０はそれぞれ情報処理装置の一例である。一部の実施態様において、ストリーミングは、ライブストリーミングまたはビデオ再生とすることができる。一部の実施態様において、ストリーミングは、オーディオストリーミング及び（または）ビデオストリーミングとすることができる。一部の実施態様において、ストリーミングは、オンラインショッピング、トークショー、タレントショー、娯楽イベント、スポーツイベント、音楽ビデオ、映画、コメディ、コンサート、グループ通話、電話会議などのコンテンツを含むことができる。 FIG. 4 shows a schematic diagram showing the configuration of a communication system based on some embodiments of the present invention. The communication system 1 can provide live streaming services with interaction through content. "Content" here refers to digital content that can be played back on a computer device. The communication system 1 allows users to participate in real-time interactions with other users online. The communication system 1 includes a plurality of user terminals 10, a backend server 30, and a streaming server 40. The user terminal 10, the backend server 30, and the streaming server 40 are connected via a network 90 (for example, the Internet may be used). Backend server 30 may be a server that synchronizes interactions between user terminals and/or streaming server 40 . In some implementations, backend server 30 may be an application (APP) provider's origin server. Streaming server 40 is a server for handling or providing streaming data or video data. In some implementations, backend server 30 and streaming server 40 may be independent servers. In some implementations, backend server 30 and streaming server 40 may be integrated into one server. In some implementations, user terminal 10 is a client device for live streaming. In some implementations, user terminal 10 may be referred to as a viewer, streamer, anchor, podcaster, audience, listener, etc. Each of the user terminals 10, the back-end servers 30, and the streaming servers 40 are examples of information processing devices. In some implementations, streaming can be live streaming or video playback. In some implementations, streaming can be audio streaming and/or video streaming. In some implementations, streaming can include content such as online shopping, talk shows, talent shows, entertainment events, sporting events, music videos, movies, comedy, concerts, group calls, conference calls, and the like.

図５に本発明の一部の実施態様に基づくユーザ端末のブロック図を示す。 FIG. 5 shows a block diagram of a user terminal according to some embodiments of the invention.

当該ユーザ端末１０Ｓは、ストリーマーまたはブロードキャスターのユーザ端末である。当該ユーザ端末１０Ｓは、ライブ映像撮影ユニット１２と、メッセージ受信ユニット１３と、オブジェクト特定ユニット１４と、領域判定ユニット１５と、拡大ユニット１６と、送信ユニット１７と、を含む。 The user terminal 10S is a streamer or broadcaster user terminal. The user terminal 10S includes a live video capturing unit 12, a message receiving unit 13, an object specifying unit 14, an area determining unit 15, an enlargement unit 16, and a transmitting unit 17.

当該ライブ映像撮影ユニット１２は、カメラ１２２とマイク１２４を含み、ストリーマーのライブ映像データ（音声データを含む）を撮影するように構成される。 The live video capturing unit 12 includes a camera 122 and a microphone 124, and is configured to capture live video data (including audio data) of the streamer.

当該メッセージ受信ユニット１３は、ライブ映像中の音声ストリーム（一部の実施態様では画像ストリーム）を監視し、当該音声ストリーム中の所定の単語（例えば、「フォーカス」または「ズームイン」）を認識するように構成される。 The message receiving unit 13 monitors the audio stream (in some implementations the image stream) during the live video and is configured to recognize predetermined words (e.g. "focus" or "zoom in") in the audio stream. It is composed of

当該オブジェクト特定ユニット１４は、ライブ映像中の１つ以上の所定のオブジェクトを特定し、特定された１つ以上の当該オブジェクトを画像またはライブ映像中で認識するように構成される。オブジェクトの識別は、後述するルックアップテーブルと当該メッセージ受信ユニット１３により認識された所定の単語とによって行われてもよい。また、別の実施態様において、オブジェクトの識別は、当該メッセージ受信ユニット１３によって行われてもよい。 The object identification unit 14 is configured to identify one or more predetermined objects in the live video and to recognize the identified one or more objects in the image or live video. Identification of the object may be performed using a look-up table, which will be described later, and a predetermined word recognized by the message receiving unit 13. In another embodiment, the identification of the object may also be performed by the message receiving unit 13.

当該領域判定ユニット１５は、ライブ映像中の拡大される領域を決定するように構成される。拡大される当該領域は、特定されたオブジェクトまたは認識されたオブジェクトの近傍の領域である。 The region determining unit 15 is configured to determine the region to be enlarged in the live video. The region to be enlarged is a region in the vicinity of the identified or recognized object.

当該拡大ユニット１６は、ライブ映像の領域の拡大に関連する映像処理を実行するように構成される。拡大される当該領域がより高い解像度で撮影される実施態様においては、カメラ１２２が拡大処理に関与していてもよい。 The enlarging unit 16 is configured to perform video processing related to enlarging the area of the live video. In embodiments where the area to be magnified is photographed at a higher resolution, camera 122 may be involved in the magnification process.

当該送信ユニット１７は、拡大処理が実行された場合、拡大されたライブ映像（または、領域が拡大されたライブ映像）をサーバ（ストリーミングサーバ等）に送信するように構成される。拡大処理が行われない場合、当該送信ユニット１７は、当該ライブ映像撮影ユニット１２が撮像したライブ映像を送信する。 The transmission unit 17 is configured to transmit the enlarged live video (or the live video whose area has been enlarged) to a server (such as a streaming server) when the enlargement process is executed. If the enlargement process is not performed, the transmitting unit 17 transmits the live video captured by the live video capturing unit 12.

図６に、図５の当該オブジェクト特定ユニット１４によって利用され得る、本発明の一部の実施態様に従った例示的なルックアップテーブルを示す。 FIG. 6 illustrates an exemplary lookup table that may be utilized by the object identification unit 14 of FIG. 5, according to some embodiments of the present invention.

列「所定の単語」は、ライブ映像の音声ストリームにおいて識別されるべき単語を示す。列「オブジェクト」は、認識対象となる各所定の単語に対応するオブジェクトを示す。例えば、この例では、識別された「ズームイン」は、ライブ映像におけるストリーマーの手元の認識につながり、識別された「鍋」は、ライブ映像における鍋の認識につながり、識別された「板ください」は、ライブ映像におけるまな板の認識につながる。 The column "predetermined word" indicates the word to be identified in the live video audio stream. The column "object" indicates an object corresponding to each predetermined word to be recognized. For example, in this example, the identified "zoom in" leads to the recognition of the streamer's hand in the live video, the identified "nabe" leads to the recognition of a hot pot in the live video, and the identified "board please" leads to the recognition of the streamer's hand in the live video. , leading to the recognition of cutting boards in live video.

一部の実施態様において、当該所定の単語またはオブジェクトは、ユーザによって事前に設定される。一部の実施態様において、当該所定の単語またはオブジェクトは、ＡＩまたは機械学習を通じて自動作成されてもよい。 In some implementations, the predetermined word or object is preset by the user. In some implementations, the predetermined word or object may be automatically created through AI or machine learning.

本発明で説明した処理及び手順は、明示的に説明したものに加えて、ソフトウェア、ハードウェア、またはそれらの任意の組み合わせにより実現することができる。例えば、本明細書で説明した処理および手順は、その処理および手順に対応するロジックを集積回路、揮発性メモリ、不揮発性メモリ、非一時的なコンピュータ可読媒体、磁気ディスクなどの媒体に実装することにより実現することができる。さらに、本明細書に記載された処理および手順は、その処理および手順に対応するコンピュータプログラムとして実現することができ、各種のコンピュータにより実行することができる。 The processes and procedures described in this invention, in addition to those explicitly described, may be implemented by software, hardware, or any combination thereof. For example, the processes and procedures described herein may be implemented in a medium such as an integrated circuit, volatile memory, non-volatile memory, non-transitory computer-readable medium, magnetic disk, or the like. This can be realized by Further, the processes and procedures described in this specification can be realized as a computer program corresponding to the processes and procedures, and can be executed by various computers.

上記実施態様で説明したシステムまたは方法は、固体記憶装置、光ディスク記憶装置、磁気ディスク記憶装置などの非一時的なコンピュータ可読媒体に格納されたプログラムに統合されてもよい。あるいは、プログラムは、インターネットを介してサーバからダウンロードされ、プロセッサにより実行されるものとしてもよい。 The systems or methods described in the above embodiments may be integrated into programs stored on non-transitory computer-readable media, such as solid state storage, optical disk storage, magnetic disk storage, and the like. Alternatively, the program may be downloaded from a server via the Internet and executed by the processor.

以上、本発明の技術的内容及び特徴を説明したが、本発明の属する技術分野において通常の知識を有する者であれば、本発明の教示及び開示から逸脱することなく、なお多くの変形及び修正を行うことができる。したがって、本発明の範囲は、既に開示された実施態様に限定されず、本発明から逸脱しない別の変形や修正を含み、特許請求の範囲に含まれる範囲である。 Although the technical contents and features of the present invention have been described above, those with ordinary knowledge in the technical field to which the present invention pertains will appreciate that many variations and modifications can be made without departing from the teachings and disclosure of the present invention. It can be performed. Therefore, the scope of the invention is not limited to the embodiments already disclosed, but includes other variations and modifications that do not depart from the invention and are included in the scope of the claims.

Ｓ１画面
ＲＡ領域
Ｏ１オブジェクト
Ｒ１領域
１システム
１０ユーザ端末
１０Ｓユーザ端末
１２ライブ映像撮影ユニット
１２２カメラ
１２４マイク
１３メッセージ受信ユニット
１４オブジェクト特定ユニット
１５領域判定ユニット
１６拡大ユニット
１７送信ユニット
３０バックエンドサーバ
４０ストリーミングサーバ
９０ネットワーク S1 Screen RA Area O1 Object R1 Area 1 System 10 User terminal 10S User terminal 12 Live video shooting unit 122 Camera 124 Microphone 13 Message receiving unit 14 Object identification unit 15 Area determination unit 16 Enlargement unit 17 Transmission unit 30 Back end server 40 Streaming server 90 network

Claims

A live video processing method,
receiving a message from the user while the user-generated live video is being broadcast;
enlarging an area near a predetermined object in the live video based on the message;
including;
A live video processing method, wherein the live video includes an image of the user.

The live video processing method according to claim 1, further comprising the step of recognizing the predetermined object in the live video based on the message.

The live video processing method according to claim 2, further comprising the step of receiving a trigger message from the user that triggers recognition of the predetermined object in the live video based on the message.

The live video processing method according to claim 1, wherein the message includes a voice message, a gesture message, or a facial expression message.

The live video processing method according to claim 1, further comprising the step of recognizing the message from the user.

The live video processing method according to claim 5, wherein the step of recognizing the message from the user further includes a voice recognition process, a gesture recognition process, or a facial expression recognition process.

The live video processing method according to claim 1, wherein the predetermined object includes a body part of the user or a wearable object of the user.

2. The live video processing method according to claim 1, wherein the predetermined object moves in synchronization with the user's movement.

The live video processing method according to claim 1, wherein the message corresponds to the predetermined object.

The region to be enlarged in the live video is photographed by the video photographing device at a higher resolution than another region of the live video that is photographed at the same time as the region by the video photographing device and is not enlarged. The live video processing method according to claim 1, characterized in that:

2. The live video processing method according to claim 1, wherein the area to be enlarged in the live video moves in synchronization with the movement of the predetermined object.

When the live video is generated by a video capturing device near the user, and the area to be enlarged in the live video moves in synchronization with the movement of the predetermined object, the direction of the video capturing device is fixed. The live video processing method according to claim 11, characterized in that:

A live video processing system comprising one or more processors, the one or more processors executing machine-readable instructions;
receiving a message from the user while the user-generated live video is being broadcast;
enlarging an area near a predetermined object in the live video based on the message;
run the
A live video processing system, wherein the live video includes an image of the user.

A non-transitory computer-readable medium containing a program for live video processing, wherein the program is stored on one or more computers;
receiving a message from the user while the user-generated live video is being broadcast;
enlarging an area near a predetermined object in the live video based on the message;
run the
A computer-readable medium, wherein the live video includes an image of the user.