JP2023137704A

JP2023137704A - Highlight moving image generation system, highlight moving image generation method, and program

Info

Publication number: JP2023137704A
Application number: JP2022044022A
Authority: JP
Inventors: 功大橋; Isao Ohashi
Original assignee: Shabelab Inc
Current assignee: Shabelab Inc
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2023-09-29
Anticipated expiration: 2042-03-18
Also published as: JP7179387B1

Abstract

To generate a highlight moving image desired by a user by extracting a plurality of short moving images desired by the user from a moving image content.SOLUTION: A highlight moving image generation system 1 comprises an acquisition section which acquires an original moving image. Based on a result of performing sound recognition of the acquired original moving image, sound of the original moving image is extracted as a timestamped text holding a timestamp on word basis. On a screen for editing where the extracted timestamped text is displayed, a user is caused to select one or more continuous words from the timestamped text in such a manner that the timestamps are separated at intervals. The selection of the user is received, time ranges of the one or more selected continuous words are respectively identified based on the timestamped text, portions corresponding to the identified time ranges are segmented from the original moving image as short moving images and the segmented short moving images are combined such that a highlight moving image is generated.SELECTED DRAWING: Figure 1

Description

本発明は、ハイライト動画生成システム、ハイライト動画生成方法、およびプログラムに関する。 The present invention relates to a highlight video generation system, a highlight video generation method, and a program.

世の中ではコロナ禍によってＷｅｂセミナーやＹｏｕｔｕｂｅなどの動画コンテンツの利用が増えている。しかしながら、動画コンテンツには長時間の動画もある。そのため、視聴者からすると、動画コンテンツの再生に時間がかかり過ぎたり、途中で退屈になったりする、視聴効率が悪い動画コンテンツが多々あるという問題があった。 Due to the coronavirus pandemic, the use of video content such as web seminars and YouTube has increased around the world. However, video content also includes long videos. Therefore, from the perspective of the viewer, there is a problem in that there are many video contents that take too much time to play or become boring midway through, resulting in poor viewing efficiency.

上述した問題を解決するために、動画のうちユーザが視聴を所望する特定の部分を迅速に検索することが可能な技術（特許文献１参照）や、再生時に閲覧者が所望の画像音声の再生開始位置に容易かつ的確に到達可能とする技術（特許文献２参照）が提供されている。また、ハイライト動画作成のために、動画コンテンツの中から、フレームの輝度や特定被写体などの特徴量に基づいて算出したフレームの評価値に基づいて特定したハイライト部分を抜き出す技術（特許文献３参照）が提案されている。 In order to solve the above-mentioned problems, there is a technology that allows a user to quickly search for a specific part of a video that the user wants to view (see Patent Document 1), and a technology that allows the viewer to play back desired images and audio during playback. A technique (see Patent Document 2) has been provided that allows the user to easily and accurately reach the starting position. In addition, in order to create a highlight video, a technology that extracts a highlight portion specified from video content based on a frame evaluation value calculated based on feature quantities such as frame brightness and a specific subject (Patent Document 3) ) has been proposed.

特開２０１９―６６７８５号公報Japanese Patent Application Publication No. 2019-66785 特開２０１８―１６８５０８号公報Japanese Patent Application Publication No. 2018-168508 特開２０１９―２１６３６４号公報Japanese Patent Application Publication No. 2019-216364

しかしながら、特許文献１，２に記載の技術では、動画コンテンツにおいてユーザが視聴を所望する開始位置に容易に到達することはできるが、動画コンテンツから複数のショート動画を切り取って結合して、ハイライト動画を生成することはできない。また、特許文献３に記載の技術では、目立つ部分や盛り上がった部分といったハイライト部分が分かりにくいＷｅｂセミナーのような動画コンテンツからハイライト部分を抜き出すことは難しく、また、抜き出したハイライト部分がユーザが所望する部分とは限らないとの問題点があった。 However, with the techniques described in Patent Documents 1 and 2, although it is possible for the user to easily reach the starting position that the user desires to view in the video content, they cut out and combine multiple short videos from the video content, and highlight It is not possible to generate videos. In addition, with the technology described in Patent Document 3, it is difficult to extract highlighted parts from video content such as Web seminars, where highlighted parts such as prominent parts or raised parts are difficult to understand, and the extracted highlighted parts are There is a problem in that the part is not necessarily the desired part.

そこで、本発明は、これらの課題に鑑み、動画コンテンツからユーザが所望する複数のショート動画を抽出して、ユーザが所望するハイライト動画を生成するハイライト動画生成システム、ハイライト動画生成方法、およびプログラムを提供することを目的とする。 In view of these problems, the present invention provides a highlight video generation system, a highlight video generation method, and a method for extracting a plurality of short videos desired by the user from video content and generating a highlight video desired by the user. and programs.

本発明は、編集の対象となる元動画から、複数のショート動画を切り取って結合して、ハイライト動画を生成するハイライト動画生成システムであって、前記元動画を取得する取得部と、取得した前記元動画を音声認識する音声認識部と、前記音声認識の結果に基づいて、前記元動画の音声を、単語毎にタイムスタンプを付加したタイムスタンプ付テキストとして抽出する抽出部と、抽出した前記タイムスタンプ付テキストを編集用画面に表示する表示部と、前記編集用画面で、前記タイムスタンプ付テキストから１又は連続する複数の単語をタイムスタンプが飛び飛びとなるようにユーザに複数選択させるとともに、当該ユーザの選択を受け付ける選択部と、複数選択された前記１又は連続する複数の単語の時間範囲を、前記タイムスタンプ付テキストに基づいて、各々特定する特定部と、前記元動画から、各々特定した前記時間範囲に対応する部分を各々のショート動画として切り取る切取部と、切り取った前記各々のショート動画を結合して、ハイライト動画を生成する生成部と、を備えるハイライト動画生成システムを提供する。 The present invention provides a highlight video generation system that generates a highlight video by cutting and combining a plurality of short videos from an original video to be edited, comprising: an acquisition unit that acquires the original video; an extraction unit that extracts the audio of the original video as time-stamped text in which a time stamp is added to each word based on the result of the audio recognition; a display unit that displays the time-stamped text on an editing screen; and a display unit that allows a user to select one or a plurality of consecutive words from the time-stamped text so that the timestamps are discontinuous on the editing screen; and , a selection unit that accepts the user's selection; a specification unit that specifies the time range of the one or more consecutive words that have been selected based on the time-stamped text; and A highlight video generation system comprising: a cutting unit that cuts out a portion corresponding to the specified time range as each short video; and a generation unit that combines each of the cut short videos to generate a highlight video. provide.

また、本発明は、前記生成部が、前記ハイライト動画の先頭および／または後尾に予め用意した動画をさらに結合するハイライト動画生成システムを提供する。 The present invention also provides a highlight video generation system in which the generation unit further combines a previously prepared video at the beginning and/or the end of the highlight video.

また、本発明は、前記選択部が、複数選択された前記１又は連続する複数の単語の結合順の選択を受け付け、前記生成部が、前記結合順に従って、切り取った前記各々のショート動画を結合して、ハイライト動画を生成するハイライト動画生成システムを提供する。 Further, in the present invention, the selection unit receives a selection of a combination order of the one or a plurality of consecutive selected words, and the generation unit combines each of the cut short videos according to the combination order. To provide a highlight video generation system that generates highlight videos.

また、本発明は、前記表示部は、抽出した前記タイムスタンプ付テキストにフィラーが含まれる場合、前記編集用画面において、当該フィラーを削除または当該フィラー以外の単語と区別できるように当該フィラーを表示するハイライト動画生成システムを提供する。 Further, in the present invention, when the extracted time-stamped text includes a filler, the display section deletes the filler or displays the filler so that it can be distinguished from words other than the filler on the editing screen. We provide a highlight video generation system for

また、本発明は、前記切取部は、複数選択された前記１又は連続する複数の単語にフィラーが含まれる場合、前記元動画から、各々特定した前記時間範囲に対応する部分のうち前記フィラーに対応する部分を除いた部分を、各々のショート動画として切り取るハイライト動画生成システムを提供する。 Further, in the present invention, when a filler is included in the one or a plurality of consecutive words that are selected, the cutting section is configured to cut out the filler from a portion of the original video that corresponds to each of the specified time ranges. To provide a highlight video generation system that cuts out parts excluding corresponding parts as each short video.

また、本発明は、前記ハイライト動画にキャプションを付与するキャプション付与部を備えるハイライト動画生成システムを提供する。 Further, the present invention provides a highlight video generation system including a captioning section that adds a caption to the highlight video.

また、本発明は、前記ハイライト動画に音データを付加する挿入部を備えるハイライト動画生成システムを提供する。 Further, the present invention provides a highlight video generation system including an insertion section that adds sound data to the highlight video.

また、本発明は、編集の対象となる元動画から、複数のショート動画を切り取って結合して、ハイライト動画を生成するハイライト動画生成システムが実行する方法であって、前記元動画を取得するステップと、取得した前記元動画を音声認識するステップと、前記音声認識の結果に基づいて、前記元動画の音声を、単語毎にタイムスタンプを付加したタイムスタンプ付テキストとして抽出するステップと、抽出した前記タイムスタンプ付テキストを編集用画面に表示するステップと、前記編集用画面で、前記タイムスタンプ付テキストから１又は連続する複数の単語をタイムスタンプが飛び飛びとなるようにユーザに複数選択させるとともに、当該ユーザの選択を受け付けるステップと、複数選択された前記１又は連続する複数の単語の時間範囲を、前記タイムスタンプ付テキストに基づいて、各々特定するステップと、前記元動画から、各々特定した前記時間範囲に対応する部分を各々のショート動画として切り取るステップと、切り取った前記各々のショート動画を結合して、ハイライト動画を生成するステップと、を含むハイライト動画生成システムを提供する。 The present invention also provides a method executed by a highlight video generation system that generates a highlight video by cutting and combining a plurality of short videos from an original video to be edited, the method being executed by a highlight video generation system that obtains the original video. a step of audio recognizing the acquired original video; and a step of extracting the audio of the original video as time-stamped text with a time stamp added to each word based on the result of the audio recognition. Displaying the extracted time-stamped text on an editing screen; and having the user select one or more consecutive words from the time-stamped text on the editing screen so that the timestamps are discontinuous. At the same time, a step of accepting the selection of the user, a step of specifying the time range of the plurality of selected words or a plurality of consecutive words based on the time stamped text, and a step of specifying each of the plurality of selected words from the original video. To provide a highlight video generation system including the steps of: cutting out parts corresponding to the time ranges as respective short videos; and combining the cut short videos to generate a highlight video.

また、本発明は、編集の対象となる元動画から、複数のショート動画を切り取って結合して、ハイライト動画を生成するハイライト動画生成システムを、前記元動画を取得する取得部、取得した前記元動画を音声認識する音声認識部、前記音声認識の結果に基づいて、前記元動画の音声を、単語毎にタイムスタンプを付加したタイムスタンプ付テキストとして抽出する抽出部、抽出した前記タイムスタンプ付テキストを編集用画面に表示する表示部、前記編集用画面で、前記タイムスタンプ付テキストから１又は連続する複数の単語をタイムスタンプが飛び飛びとなるようにユーザに複数選択させるとともに、当該ユーザの選択を受け付ける選択部、複数選択された前記１又は連続する複数の単語の時間範囲を、前記タイムスタンプ付テキストに基づいて、各々特定する特定部、前記元動画から、各々特定した前記時間範囲に対応する部分を各々のショート動画として切り取る切取部、切り取った前記各々のショート動画を結合して、ハイライト動画を生成する生成部、として機能させるプログラムを提供する。 The present invention also provides a highlight video generation system that generates a highlight video by cutting and combining a plurality of short videos from an original video to be edited, including an acquisition unit that acquires the original video. a voice recognition unit that recognizes the voice of the original video; an extraction unit that extracts the voice of the source video as time-stamped text with a time stamp added to each word based on the result of the voice recognition; and the extracted time stamp. a display unit that displays the text with the timestamp on an editing screen; on the editing screen, the user selects one word or a plurality of consecutive words from the text with the timestamp so that the timestamps are discontinuous; a selection unit that receives a selection; a specification unit that specifies a time range of the one or more consecutive words selected based on the time-stamped text; To provide a program that functions as a cutting section that cuts out corresponding parts as respective short videos, and a generation section that combines the respective cut short videos to generate a highlight video.

本発明によれば、動画コンテンツからユーザが所望する複数のショート動画を抽出して、ユーザが所望するハイライト動画を生成する。 According to the present invention, a plurality of short videos desired by a user are extracted from video content to generate a highlight video desired by the user.

本発明の実施形態に係るハイライト動画生成システムの概要を説明する図である。1 is a diagram illustrating an overview of a highlight video generation system according to an embodiment of the present invention. 本発明の実施形態に係るハイライト動画生成システムの機能構成を示す図である。1 is a diagram showing a functional configuration of a highlight video generation system according to an embodiment of the present invention. 本発明の実施形態に係るハイライト動画生成システムが実行するハイライト動画生成処理フローを示す図である。FIG. 2 is a diagram showing a highlight video generation process flow executed by the highlight video generation system according to the embodiment of the present invention.

以下、添付図面を参照して、本発明を実施するための形態（以下、実施形態）について詳細に説明する。以降の図においては、実施形態の説明の全体を通して同じ要素には同じ番号または符号を付している。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, modes for carrying out the present invention (hereinafter referred to as embodiments) will be described in detail with reference to the accompanying drawings. In the following figures, the same numbers or symbols are given to the same elements throughout the description of the embodiments.

［基本概念／基本構成］
図１は、本発明の実施形態に係るハイライト動画生成システム１の概要を説明するための図である。ハイライト動画生成システム１は、編集元の動画（以下、元動画という）を音声認識して生成されたテキストを用いてユーザが所望する動画の部分を選択し、元動画からユーザが所望するハイライト動画を生成するシステムである。本実施形態において、ハイライト動画とは、元動画から抽出した、ユーザが視聴者に見せたい部分を集めた動画であって、例えば、元動画で盛り上がった部分、目立った部分、興味を引く部分などを集めた動画や、元動画の重要な部分を集めて元動画の内容が要約された動画などである。 [Basic concept/basic configuration]
FIG. 1 is a diagram for explaining an overview of a highlight video generation system 1 according to an embodiment of the present invention. The highlight video generation system 1 selects the part of the video that the user desires using the text generated by voice recognition of the video to be edited (hereinafter referred to as the original video), and selects the part of the video that the user desires from the original video. This is a system that generates light videos. In this embodiment, a highlight video is a video that is a collection of parts extracted from the original video that the user wants to show to the viewer, such as parts that were exciting, noticeable parts, and interesting parts in the original video. These include videos that collect important parts of the original video and summarize the contents of the original video.

ハイライト動画生成システム１は、ハイライト動画生成装置１０とユーザ端末２０とを含む。ハイライト動画生成装置１０は、ユーザ端末２０とネットワークを介して接続され、ユーザの指示に従って元動画からハイライト動画を生成する装置である。ハイライト動画生成装置１０は、オンプレミスでもクラウドサーバであってもよいが、本実施形態ではクラウドサーバとする。なお、ハイライト動画生成システム１は、ハイライト動画生成装置１０で生成されたハイライト動画を視聴する視聴者端末（図示せず）や、ハイライト動画をアップロードするサーバ等（図示せず）とも、ネットワークを介して接続されてもよい。 The highlight video generation system 1 includes a highlight video generation device 10 and a user terminal 20. The highlight video generation device 10 is a device that is connected to a user terminal 20 via a network and generates a highlight video from an original video according to a user's instructions. The highlight video generation device 10 may be an on-premises or cloud server, but in this embodiment, it is a cloud server. Note that the highlight video generation system 1 also includes a viewer terminal (not shown) for viewing the highlight video generated by the highlight video generation device 10, a server (not shown) for uploading the highlight video, etc. , may be connected via a network.

ユーザ端末２０は、元動画からハイライト動画を生成する際に指示を行うユーザの端末であって、例えば、スマートフォン、タブレット端末、パーソナルコンピュータ等で構成される。本実施形態においては１台しか表示しないが、複数台であってもよい。 The user terminal 20 is a user terminal that issues instructions when generating a highlight video from an original video, and is configured of, for example, a smartphone, a tablet terminal, a personal computer, or the like. Although only one device is displayed in this embodiment, a plurality of devices may be displayed.

ハイライト動画生成装置１０は、まず、ユーザ端末２０やユーザ端末２０からの指示に基づいてネットワーク上から元動画を取得する。次に、ハイライト動画生成装置１０は、取得した元動画を音声認識し、単語毎にタイムスタンプを保持するタイムスタンプ付きテキストを生成する（Ｓ１）。 The highlight video generation device 10 first obtains the original video from the network based on the user terminal 20 or an instruction from the user terminal 20 . Next, the highlight video generation device 10 performs voice recognition on the acquired original video and generates time-stamped text that holds a time stamp for each word (S1).

ここで、タイムスタンプは、元動画の開始からの経過時間であって、タイムスタンプ付きテキストは、元動画の開始からの経過時間が各単語に、言い換えると単語と単語との区切り部分に、挿入されたテキストである。図１に示す元動画から生成されたタイムスタンプ付きテキストでは、単語毎のタイムスタンプを黒▲で示すが、正確には、吹き出しに示すように黒▲は元動画からの経過時間である。また、図１に示す元動画から生成されたタイムスタンプ付きテキストでは、元動画の各フレームの最初の単語のタイムスタンプに、そのフレームの単語が紐づけられる。 Here, the timestamp is the elapsed time from the start of the original video, and the text with timestamp is the time elapsed from the start of the original video, inserted into each word, in other words, at the part between words. This is the text that was created. In the time-stamped text generated from the original video shown in FIG. 1, the time stamp for each word is indicated by a black ▲, but to be more precise, the black ▲ is the elapsed time from the original video, as shown in the balloon. Furthermore, in the time-stamped text generated from the original video shown in FIG. 1, the time stamp of the first word of each frame of the original video is associated with the word of that frame.

ハイライト動画生成装置１０は、ユーザ端末２０に、Ｓ１で生成したタイムスタンプ付きテキストを編集する編集用画面を表示させる（Ｓ２）。図１に示す編集用画面のタイムスタンプ付きテキストは、ユーザが単語を選択しやすいように、タイムスタンプは表示せず、タイムスタンプが挿入されている箇所をスペースとしているが、当該箇所をスペース以外で表示してもよいし、タイムスタンプを表示してもよい。 The highlight video generation device 10 causes the user terminal 20 to display an editing screen for editing the time stamped text generated in S1 (S2). In the text with a time stamp on the editing screen shown in Figure 1, the time stamp is not displayed and the place where the time stamp is inserted is a space so that the user can easily select a word. or a timestamp may be displayed.

ユーザ端末２０において、ユーザは、編集用画面に表示されたタイムスタンプ付きテキストから、ハイライト動画に含めたい部分の１又は連続する複数の単語を１以上選択する。選択は、ドラッグ、クリック、タップなど任意の手段でよく、選択された部分の表示も図１に示すように四角で囲む、ハイライトなど任意の表示方法でよい。以下、１又は連続する複数の単語を文、選択された１又は連続する複数の単語を選択文という。 At the user terminal 20, the user selects one or more consecutive words from the time-stamped text displayed on the editing screen to include in the highlight video. The selection may be made by any means such as dragging, clicking, or tapping, and the selected portion may be displayed by any method such as enclosing it in a square as shown in FIG. 1 or highlighting it. Hereinafter, one or a plurality of consecutive words will be referred to as a sentence, and one selected word or a plurality of consecutive words will be referred to as a selected sentence.

選択文が複数選択される場合、選択文同士のタイムスタンプが飛び飛びとなるように、言い換えると選択文同士のタイムスタンプが連続しないように、ハイライト動画生成装置１０は制御する。例えば、図１に示す編集用画面にて、「最近話題のお取り寄せについて紹介します」を選択した場合には、「こんにちは」を別の選択文として選択することはできない。但し、「こんにちは」を「最近話題のお取り寄せについて紹介します」と１つの選択文とすることはできる。編集画面で選択された選択文は、ユーザ端末２０からハイライト動画生成装置１０に送信される（Ｓ３）。 When a plurality of selected sentences are selected, the highlight video generation device 10 controls the selected sentences so that the time stamps of the selected sentences are discontinuous, or in other words, the time stamps of the selected sentences are not consecutive. For example, in the editing screen shown in FIG. 1, if you select ``I would like to introduce you to a recent hot topic,'' you cannot select ``Hello'' as another selection sentence. However, "Hello" can be made into one choice sentence, such as "I would like to introduce you to the recently talked about orders." The selected sentence selected on the editing screen is transmitted from the user terminal 20 to the highlight video generation device 10 (S3).

ハイライト動画生成装置１０は、ユーザ端末２０から受信した１以上の選択文各々の時間範囲を、タイムススタンプ付きテキストに基づいて特定する（Ｓ４）。ここで、時間範囲は、選択文の元動画における時間範囲である。詳細には、時間範囲は、選択文の最初の単語のタイムスタンプから、当該選択文の後の単語のタイムスタンプまでである。 The highlight video generation device 10 specifies the time range of each of the one or more selected sentences received from the user terminal 20 based on the timestamped text (S4). Here, the time range is the time range in the original video of the selected sentence. Specifically, the time range is from the timestamp of the first word of the selected sentence to the timestamp of the word after the selected sentence.

例えば、図１のタイムスタンプ付きテキストの「最近話題のお取り寄せについて紹介します」が選択文の場合には、「最近」のタイムスタンプ「０：００：１７：０１」から、選択文の後の単語「えー」のタイムスタンプ「０：００：３０：２５」まで、すなわち「０：００：１７：０１～０：００：３０：２５」が選択文「最近話題のお取り寄せについて紹介します」の時間範囲となる。 For example, if the time-stamped text in Figure 1, ``I would like to introduce you to the latest topic orders,'' is a selected sentence, the ``recent'' timestamp ``0:00:17:01'' indicates that after the selected sentence, The timestamp of the word "eh" up to "0:00:30:25", that is, "0:00:17:01 to 0:00:30:25" is the selected sentence "Introducing the recent hot topic of ordering" ” time range.

ハイライト動画生成装置１０は、特定した時間範囲に対応する部分を元動画からショート動画として切り取り、切り取ったショート動画を結合してハイライト動画を生成する（Ｓ５）。 The highlight video generation device 10 cuts out a portion corresponding to the specified time range from the original video as a short video, and combines the cut short videos to generate a highlight video (S5).

このようなハイライト動画システムによれば、元動画から生成されたタイムスタンプ付きテキストから文をユーザに選択させて、ユーザの所望する部分を特定し、ハイライト動画を生成する。それにより、ユーザは所望の部分を選択しやすく、ユーザは所望のハイライト動画が生成しやすい。その結果、ユーザが見せたい部分のみを集約したハイライト動画が作成でき、生成したハイライト動画を視聴者に視聴させることで、高いマーケティン効果が期待できる。 According to such a highlight video system, a user selects a sentence from time-stamped text generated from an original video, identifies a desired portion of the user, and generates a highlight video. This makes it easy for the user to select a desired portion, and it is easy for the user to generate a desired highlight video. As a result, it is possible to create a highlight video that summarizes only the parts that the user wants to see, and by allowing viewers to view the generated highlight video, a high marketing effect can be expected.

［ハイライト動画生成システムの機能構成］
図２は、本発明の実施形態に係るハイライト動画生成システム１の機能構成を示す図である。ハイライト動画生成システム１は、ハイライト動画生成装置１０と、ハイライト動画生成装置１０にネットワークを介して接続されたユーザ端末２０と、を備える。 [Functional configuration of highlight video generation system]
FIG. 2 is a diagram showing the functional configuration of the highlight video generation system 1 according to the embodiment of the present invention. The highlight video generation system 1 includes a highlight video generation device 10 and a user terminal 20 connected to the highlight video generation device 10 via a network.

[ハイライト動画生成装置の機能構成]
ハイライト動画生成装置１０は、ユーザ端末２０とデータの送受信を行う送受信部１１と、取得部１２と、音声認識部１３と、抽出部１４と、表示制御部１５と、選択部１６と、特定部１７と、切取部１８と、生成部１９と、記憶部１００と、を備える。 [Functional configuration of highlight video generation device]
The highlight video generation device 10 includes a transmitting/receiving unit 11 that transmits and receives data to and from the user terminal 20, an acquiring unit 12, a voice recognition unit 13, an extracting unit 14, a display control unit 15, a selecting unit 16, and a specifying unit 15. It includes a section 17, a cutting section 18, a generating section 19, and a storage section 100.

記憶部１００は、後述する、取得部１２で取得する元動画、抽出部１４で抽出されるタイムスタンプ付きテキスト、および生成部１９で生成されるハイライト動画を記憶する。なお、元動画およびタイムスタンプ付きテキストは、ハイライト動画が生成されると削除されてもよい。なお、本実施形態において、ハイライト動画生成装置１０はクラウドサーバであるため、記憶部１００は、クラウドストレージや分散型台帳で構成されるのが望ましい。 The storage unit 100 stores an original video acquired by the acquisition unit 12, text with a time stamp extracted by the extraction unit 14, and a highlight video generated by the generation unit 19, which will be described later. Note that the original video and the time-stamped text may be deleted once the highlight video is generated. Note that in this embodiment, since the highlight video generation device 10 is a cloud server, the storage unit 100 is preferably configured with cloud storage or a distributed ledger.

取得部１２は、元動画を取得し、記憶部１００に記憶させる。詳細には、取得部１２は、ユーザ端末２０から送受信部１１を介して元動画を取得、または、ユーザ端末２０からＵＲＬ等により指定されたサーバやＷｅｂページから送受信部１１を介して元動画を取得する。そして、取得部１２は、取得した元動画を記憶部１００に記憶させる。 The acquisition unit 12 acquires the original video and stores it in the storage unit 100. Specifically, the acquisition unit 12 acquires the original video from the user terminal 20 via the transmission/reception unit 11, or acquires the original video from the user terminal 20 via the transmission/reception unit 11 from a server or web page specified by a URL or the like. get. The acquisition unit 12 then stores the acquired original video in the storage unit 100.

音声認識部１３は、取得部１２で取得した元動画の音声データを認識して、音声をテキストデータに変換する処理を行う。例えば、音声認識部１３は、音響モデルと、言語的な制約を示す言語モデルとを組み合わせることで、音声データをテキストデータに変換する。 The voice recognition unit 13 performs a process of recognizing the voice data of the original video acquired by the acquisition unit 12 and converting the voice into text data. For example, the speech recognition unit 13 converts speech data into text data by combining an acoustic model and a language model indicating linguistic constraints.

抽出部１４は、音声認識部１３で得られたテキストデータに単語毎のタイムスタンプを挿入した、タイムスタンプ付きテキストを抽出する。詳細には、抽出部１４は、音声認識部１３で得られたテキストの各単語について、元動画を参照し、タイムスタンプを取得する。そして、抽出部１４は、各単語について取得したタイムスタンプを、テキストデータの対応する箇所に挿入し、タイムスタンプ付きテキストを抽出する。
抽出部１４は、各単語について取得したタイムスタンプを、テキストデータの対応する箇所に挿入した後、元動画の各フレームの最初の単語のタイムスタンプにそのフレームの単語が紐づけた、タイムスタンプ付きテキストを抽出してもよい。 The extraction unit 14 extracts time-stamped text obtained by inserting a time stamp for each word into the text data obtained by the speech recognition unit 13. Specifically, the extraction unit 14 refers to the original video and acquires a time stamp for each word of the text obtained by the speech recognition unit 13. Then, the extraction unit 14 inserts the time stamp acquired for each word into the corresponding location of the text data, and extracts the time stamped text.
After inserting the timestamp obtained for each word into the corresponding location of the text data, the extraction unit 14 inserts the timestamp obtained for each word into the corresponding location of the text data, and then extracts a timestamp with the word of the frame linked to the timestamp of the first word of each frame of the original video. Text may also be extracted.

抽出部１４は、タイムスタンプ付きテキストにフィラーが含まれる場合、フィラーを削除、すなわちフィラーとフィラーのタイムスタンプとを削除したタイムスタンプ付きテキストを抽出してもよい。削除するフィラーは、予め設定され、記憶部１００に保持されている。ここで、フィラーとは、会話の隙間を埋める「あー」、「えー」等の発話である。フィラーは発話文の内容と関係のない無駄な言葉であるので、タイムスタンプ付きテキストから削除することで、ユーザは所望の部分を選択しやすくなる。 When the time-stamped text includes a filler, the extraction unit 14 may extract the time-stamped text by removing the filler, that is, by removing the filler and the time stamp of the filler. Fillers to be deleted are set in advance and held in the storage unit 100. Here, the filler is an utterance such as "ah" or "eh" that fills a gap in the conversation. Since fillers are useless words that are unrelated to the content of the uttered sentence, by deleting them from the timestamped text, the user can more easily select the desired portion.

表示制御部１５は、抽出部１４で抽出したタイムスタンプ付きテキストに基づいて元動画の編集指示を行う編集用画面データ生成し、ユーザ端末２０に送信する。編集用画面データは、ユーザ端末２０の表示部（図示せず）に、ハイライト動画を作成するために、タイムスタンプ付きテキストから１つの単語または連続する複数の単語を選択させる画面を表示可能とするデータである。 The display control unit 15 generates editing screen data for instructing editing of the original video based on the time-stamped text extracted by the extraction unit 14, and transmits it to the user terminal 20. The editing screen data is capable of displaying a screen on the display unit (not shown) of the user terminal 20 that allows the user to select one word or a plurality of consecutive words from the time-stamped text in order to create a highlight video. This is the data.

また、編集用画面データは、タイムスタンプ付きテキストの単語を検索する機能をも有する画面や、元動画を再生する機能をも有する画面を表示可能とするデータであってもよい。それにより、容易にユーザが選択したい単語を検索することができ、また、編集用画面でタイムスタンプ付きテキストと元動画とを対比させることができる。 Further, the editing screen data may be data that can display a screen that also has a function of searching for words in time-stamped text or a screen that also has a function of playing back the original video. Thereby, the user can easily search for the word he or she wants to select, and can also compare the time-stamped text and the original video on the editing screen.

さらに、編集用画面データは、タイムスタンプ付きテキストにフィラーが含まれる場合、タイムスタンプ付きテキストを表示部に表示する際に、フィラーを他の文字と異なる書式、例えば、文字のサイズ、色、フォント、に変更する機能をも有してもよい。なお、フィラーの書式はユーザが編集用画面で選択できるようにしてもよい。 In addition, if the time-stamped text includes filler, the editing screen data may be configured such that when displaying the time-stamped text on the display, the filler may be formatted differently from other characters, such as text size, color, font, etc. It may also have a function to change to . Note that the format of the filler may be selected by the user on the editing screen.

選択部１６は、ユーザ端末２０の表示部に表示された編集用画面で、ユーザに、タイムスタンプ付きテキストから、文をタイムスタンプが飛び飛びとなるように複数選択させるとともに、ユーザが選択した複数の文を受け付ける。詳細には、選択部１６は、編集用画面で複数の文がユーザにより選択されたことに応じて、選択文同士のタイムスタンプが飛び飛びとなっているか、タイムスタンプ付きテキストを参照して判断する。 The selection section 16 is an editing screen displayed on the display section of the user terminal 20, and allows the user to select multiple sentences from the time-stamped text so that the timestamps are discontinuously arranged. Accept sentences. Specifically, in response to a plurality of sentences being selected by the user on the editing screen, the selection unit 16 determines whether or not the timestamps of the selected sentences are discontinuous with each other by referring to the timestamp-attached text. .

そして、選択部１６は、選択文同士のタイムスタンプが飛び飛びとなっていない場合には、編集用画面に文の選択しなおしをユーザに促すため、その旨のメッセージの表示やエラー音の出力を編集用画面に指示する。一方、選択部１６は、選択文同士のタイムスタンプが飛び飛びとなっている場合には、選択文を受け付ける。このように、ユーザにタイムスタンプが飛び飛びになるように文を選択させることで、適切なハイライト動画の生成を促すことができ、また、ユーザが所望の部分を選択する際に余計な部分まで含めてしまい冗長なハイライト動画が生成されることを防ぐことができる。 Then, if the timestamps of the selected sentences are not discontinuous, the selection unit 16 displays a message to that effect or outputs an error sound in order to prompt the user to reselect sentences on the editing screen. Instruct on the editing screen. On the other hand, the selection unit 16 accepts the selected sentences when the time stamps of the selected sentences are disjoint. In this way, by having the user select sentences so that the timestamps are scattered, it is possible to encourage the generation of an appropriate highlight video. This can prevent redundant highlight videos from being generated.

また、編集用画面でユーザが複数の文を選択する際に、選択した複数の文の結合順序も指定された場合、選択部１６は、選択文とともに結合順序を受け付ける。更に、編集用画面でユーザがタイムスタンプ付きテキストの一部の書き換えや削除をし、書き換えや削除後の文が選択された場合、選択部１６は、書き換え後の文を受け付ける。 Further, when the user selects a plurality of sentences on the editing screen and also specifies the order in which the selected sentences are combined, the selection unit 16 receives the combination order along with the selected sentences. Further, when the user rewrites or deletes a part of the time-stamped text on the editing screen and selects the rewritten or deleted sentence, the selection unit 16 accepts the rewritten sentence.

特定部１７は、複数の選択文各々の時間範囲を、タイムスタンプ付テキストに基づいて、各々特定する。詳細には、特定部１７は、タイムスタンプ付テキストから、選択文の最初の単語のタイムスタンプから、当該選択文の後の単語のタイムスタンプまでを、当該選択文の時間範囲ととして特定する。 The specifying unit 17 specifies the time range of each of the plurality of selected sentences based on the time-stamped text. Specifically, the specifying unit 17 specifies the time range of the selected sentence from the time stamp of the first word of the selected sentence to the time stamp of the word after the selected sentence from the time stamped text.

切取部１８は、特定部１７で特定された各時間範囲に基づいて、記憶部１００の元動画からショート動画を切り取る。詳細には、切取部１８は、元動画において各時間範囲に対応する部分をそれぞれショート動画として、元動画から切り取る。 The cutting unit 18 cuts out a short video from the original video in the storage unit 100 based on each time range specified by the specifying unit 17. Specifically, the cutting unit 18 cuts out portions of the original video corresponding to each time range from the original video as short videos.

切取部１８は、選択文にフィラーが含まれている場合には、元動画において特定部１７で特定された時間範囲に対応する部分からフィラー部分を削除してから、ショート画像として、元動画から切り取ってもよい。また、切取部１８は、編集用画面でユーザがタイムスタンプ付きテキストの一部の書き換えや削除をし、選択文が元動画とは変更されている場合、切り取ったショート画像を変更された選択文に変更する編集をしてもよい。 If the selected sentence contains a filler, the cutting unit 18 deletes the filler part from the part of the original video that corresponds to the time range specified by the specifying unit 17, and then extracts the filler from the original video as a short image. You can cut it out. In addition, if the user rewrites or deletes a part of the time-stamped text on the editing screen, and the selected sentence has been changed from the original video, the cutting unit 18 cuts out the cut short image into the changed selected sentence. You may edit to change it to .

生成部１９は、切取部１８で切り取られたショート動画を結合して、ハイライト動画を生成する。詳細には、生成部１９は、切取部１８で切り取られたショート動画をタイムスタンプ順に、または、選択部１６で受け付けた選択文に対して指定された結合順に、ショート動画を結合して、ハイライト動画を生成する。 The generation unit 19 combines the short videos cut out by the cutting unit 18 to generate a highlight video. Specifically, the generation unit 19 combines the short videos cut out by the cutting unit 18 in the order of timestamps or in the order of combination specified for the selected sentences received by the selection unit 16, and Generate light videos.

また、生成部１９は、ショート動画を結合する際に、先頭のショート動画の前および／また後尾のショート動画の後に、予め用意した動画や静止画を追加して、ハイライト画像を生成してもよい。それにより、ハイライト動画のタイトルやエンディングを追加することなどができる。 Furthermore, when combining short videos, the generation unit 19 adds a previously prepared video or still image before the first short video and/or after the last short video to generate a highlight image. Good too. This allows you to add titles and endings to highlight videos.

生成部１９は、生成したハイライト動画を送受信部１１を介して、ユーザ端末２０に提供する。また、生成部１９は、生成したハイライト動画を送受信部１１を介して、インターネット上の動画サイト等にアップロードし、ＵＲＬ形式で生成したハイライト動画を視聴者に提供する。 The generation unit 19 provides the generated highlight video to the user terminal 20 via the transmission/reception unit 11. Further, the generation unit 19 uploads the generated highlight video to a video site on the Internet via the transmission/reception unit 11, and provides the generated highlight video in URL format to the viewer.

さらに、生成部１９は、ハイライト動画のファイル形式を元動画のファイル形式とは変えたい場合には、ファイル形式の変換を行う。変換後のファイル形式は、編集用画面でユーザが設定して選択部１６が選択文と併せて受け付けてもよいし、ハイライト動画をアップロードする動画サイト等に応じて決定してもよい。 Furthermore, if the generation unit 19 wants to change the file format of the highlight video from the file format of the original video, it converts the file format. The file format after conversion may be set by the user on the editing screen and accepted by the selection unit 16 together with the selection text, or may be determined depending on the video site or the like to which the highlight video is uploaded.

上記の本システムの機能構成は、あくまで一例であり、１つの機能ブロック（データベース及び機能処理部）を分割したり、複数の機能ブロックをまとめて１つの機能ブロックとして構成したりしてもよい。各機能処理部は、装置や端末に内蔵されたＣＰＵ（Central Processing Unit）が、ＲＯＭ（Read Only Memory）、フラッシュメモリ、ＳＳＤ(Solid State Drive)、ハードディスク等の記憶装置（記憶部）に格納されたコンピュータ・プログラム（例えば、基幹ソフトや上述の各種処理をＣＰＵに実行させるアプリ等）を読み出し、ＣＰＵにより実行されたコンピュータ・プログラムによって実現される。すなわち、各機能処理部は、このコンピュータ・プログラムが、記憶装置に格納されたデータベース（ＤＢ;Data Base)やメモリ上の記憶領域からテーブル等の必要なデータを読み書きし、場合によっては、関連するハードウェア（例えば、入出力装置、表示装置、通信インターフェース装置）を制御することによって実現される。 The functional configuration of the present system described above is just an example, and one functional block (database and functional processing unit) may be divided, or multiple functional blocks may be combined into one functional block. Each functional processing unit is stored in a storage device (storage unit) such as a ROM (Read Only Memory), flash memory, SSD (Solid State Drive), or hard disk by a CPU (Central Processing Unit) built into the device or terminal. This is realized by reading out a computer program (for example, core software or an application that causes the CPU to execute the various processes described above), and executing the computer program by the CPU. In other words, each functional processing unit reads and writes necessary data such as tables from a database (DB) stored in a storage device or a storage area in memory, and in some cases, This is realized by controlling hardware (for example, input/output devices, display devices, communication interface devices).

[処理フロー]
図３は、本発明の実施形態に係るハイライト動画生成システムが実行するハイライト動画生成処理フローを示す図である。ハイライト動画生成処理は、本実施形態では、ハイライト動画生成装置が実行する。 [Processing flow]
FIG. 3 is a diagram showing a highlight video generation process flow executed by the highlight video generation system according to the embodiment of the present invention. In this embodiment, the highlight video generation process is executed by the highlight video generation device.

まず、取得部１２は、ユーザ端末２０やユーザ端末２０からの指示に基づいてネットワーク上から送受信部１１を介して元動画を取得する（Ｓ１１）。次に、音声認識部１３は、Ｓ１１で取得した元動画を音声認識して、音声をテキストデータに変換する（Ｓ１２）。次に、抽出部１４は、Ｓ１２で得られたテキストデータに単語毎のタイムスタンプを挿入した、タイムスタンプ付きテキストを抽出する（Ｓ１３）。 First, the acquisition unit 12 acquires the original video from the network via the transmission/reception unit 11 based on the user terminal 20 or an instruction from the user terminal 20 (S11). Next, the voice recognition unit 13 performs voice recognition on the original video acquired in S11, and converts the voice into text data (S12). Next, the extraction unit 14 extracts time-stamped text by inserting a time stamp for each word into the text data obtained in S12 (S13).

次に、表示制御部１５は、Ｓ１３で抽出したタイムスタンプ付きテキストに基づいて元動画の編集指示を行う編集用画面データを生成し、ユーザ端末２０に送信し、ユーザ端末３０の表示部は、編集画面データに基づいて編集用画面を表示する（Ｓ１４）。次に、選択部１６は、Ｓ１４でユーザ端末２０の表示部に表示された編集用画面で、ユーザに、タイムスタンプ付きテキストから文を、タイムスタンプが飛び飛びとなるように複数選択させるとともに、選択された複数の選択文を受け付ける（Ｓ１５）。 Next, the display control unit 15 generates editing screen data for instructing editing of the original video based on the time-stamped text extracted in S13, and transmits it to the user terminal 20, and the display unit of the user terminal 30 An editing screen is displayed based on the editing screen data (S14). Next, the selection unit 16 allows the user to select a plurality of sentences from the time-stamped text on the editing screen displayed on the display unit of the user terminal 20 in S14, so that the timestamps are discontinuous, and selects The selected plurality of selected sentences are accepted (S15).

次に、特定部１７は、Ｓ１５で受け付けた複数の選択文各々の時間範囲を、タイムスタンプ付テキストに基づいて、各々特定する（Ｓ１６）。次に、切取部１８は、Ｓ１６で特定された各時間範囲に基づいて、元動画からショート動画を切り取る（Ｓ１７）。そして、生成部１９は、Ｓ１７で切り取られたショート動画を結合して、ハイライト動画を生成する（Ｓ１８）。 Next, the specifying unit 17 specifies the time range of each of the plurality of selected sentences received in S15, based on the time-stamped text (S16). Next, the cutting unit 18 cuts out a short video from the original video based on each time range specified in S16 (S17). The generation unit 19 then combines the short videos cut out in S17 to generate a highlight video (S18).

ユーザにタイムスタンプが飛び飛びになるように文を選択させることで、適切なハイライト動画の生成を促すことができ、また、ユーザが所望の部分を選択する際に余計な部分まで含めてしまい冗長なハイライト動画が生成されることを防ぐことができる。 By having the user select sentences so that the timestamps are scattered, it is possible to encourage the generation of an appropriate highlight video, and it is also possible to encourage the generation of appropriate highlight videos. This can prevent the generation of highlight videos.

[変形例]
（１）例えば、ハイライト動画生成システムは、ハイライト動画にキャプションを付与するキャプション付与部を備えてもよい。キャプションは、元動画および／またはタイムスタンプ付きテキストに基づいてキャプション付与部が生成してもよいし、編集用画面で、選択文についてユーザから入力されたまたは当該選択文から選択されたキャプションをユーザ端末２０から受信してもよい。ハイライト動画にキャプションを付与することで、ユーザが伝えたいことや意識してほしいことなどをはっきりと表示することができる。 [Modified example]
(1) For example, the highlight video generation system may include a captioning section that adds a caption to the highlight video. The caption may be generated by the captioning unit based on the original video and/or the text with a timestamp, or the caption may be generated by the caption adding unit based on the original video and/or the text with a timestamp, or the caption may be generated by the user on the editing screen using a caption input by the user for the selected sentence or a caption selected from the selected sentence. It may also be received from the terminal 20. By adding captions to highlight videos, it is possible to clearly display what the user wants to convey or keep in mind.

（２）例えば、ハイライト動画生成システムは、ハイライト動画にＢＧＭや効果音といった音データを挿入する挿入部を備えてもよい。ハイライト動画をより効果的に視聴者に見せることができ、より高いマーケティン効果が期待できる。 (2) For example, the highlight video generation system may include an insertion unit that inserts sound data such as BGM and sound effects into the highlight video. Highlight videos can be shown more effectively to viewers, and higher marketing effects can be expected.

以上、実施形態を用いて本発明を説明したが、本発明の技術的範囲は上記実施形態に記載の範囲には限定されないことは言うまでもない。上記実施形態に、多様な変更または改良を加えることが可能であることが当業者に明らかである。また、そのような変更または改良を加えた形態も本発明の技術的範囲に含まれ得ることが、特許請求の範囲の記載から明らかである。なお、上記の実施形態では、本発明を物の発明として、ハイライト動画生成システムについて説明したが、本発明においてハイライト動画生成システムが実行する方法や、ハイライト動画生成システムを各種手段として機能させるプログラムの発明と捉えることもできる。 Although the present invention has been described above using the embodiments, it goes without saying that the technical scope of the present invention is not limited to the scope described in the above embodiments. It will be apparent to those skilled in the art that various changes or improvements can be made to the embodiments described above. Furthermore, it is clear from the claims that forms with such changes or improvements may also be included within the technical scope of the present invention. Note that in the above embodiments, the present invention is a product invention and the highlight video generation system has been described. However, in the present invention, the method executed by the highlight video generation system and the highlight video generation system functioning as various means It can also be seen as the invention of a program to do this.

１ハイライト動画生成システム
１０ハイライト動画生成装置
１１送受信部
１２取得部
１３音声認識部
１４抽出部
１５表示制御部
１６選択部
１７特定部
１８切取部
１９生成部
１００記憶部
２０ユーザ端末 1 Highlight video generation system 10 Highlight video generation device 11 Transmission/reception unit 12 Acquisition unit 13 Voice recognition unit 14 Extraction unit 15 Display control unit 16 Selection unit 17 Specification unit 18 Cutting unit 19 Generation unit 100 Storage unit 20 User terminal

本発明は、編集の対象となる元動画から、複数のショート動画を切り取って結合して、ハイライト動画を生成するハイライト動画生成システムであって、前記元動画を取得する取得部と、取得した前記元動画を音声認識する音声認識部と、前記音声認識の結果に基づいて、前記元動画の音声を、単語毎にタイムスタンプを付加したタイムスタンプ付テキストとして抽出する抽出部と、抽出した前記タイムスタンプ付テキストを編集用画面に表示する表示部と、前記編集用画面で、前記タイムスタンプ付テキストから１又は連続する複数の単語をユーザに複数選択させ、複数選択された前記１又は連続する複数の単語同士のタイムスタンプが連続しない場合に、当該ユーザの選択を受け付ける選択部と、複数選択された前記１又は連続する複数の単語の時間範囲を、前記タイムスタンプ付テキストに基づいて、各々特定する特定部と、前記元動画から、各々特定した前記時間範囲に対応する部分を各々のショート動画として切り取る切取部と、切り取った前記各々のショート動画を結合して、ハイライト動画を生成する生成部と、を備えるハイライト動画生成システムを提供する。 The present invention provides a highlight video generation system that generates a highlight video by cutting and combining a plurality of short videos from an original video to be edited, comprising: an acquisition unit that acquires the original video; an extraction unit that extracts the audio of the original video as time-stamped text in which a time stamp is added to each word based on the result of the audio recognition; a display unit that displays the time-stamped text on an editing screen; and a display unit that allows a user to select one or a plurality of consecutive words from the time-stamped text on the editing screen; If the timestamps of the plurality of words to be selected are not consecutive , a selection unit that accepts the user's selection, and a time range of the plurality of selected words or the plurality of consecutive words, based on the time stamped text, A identifying section that specifies each, a cutting section that cuts out a portion corresponding to the specified time range from the original video as each short video, and a highlight video is generated by combining each of the cut short videos. Provided is a highlight video generation system comprising a generation unit that performs the following steps.

また、本発明は、コンピュータが実行するハイライト動画生成方法であって、前記元動画を取得するステップと、取得した前記元動画を音声認識するステップと、前記音声認識の結果に基づいて、前記元動画の音声を、単語毎にタイムスタンプを付加したタイムスタンプ付テキストとして抽出するステップと、抽出した前記タイムスタンプ付テキストを編集用画面に表示するステップと、前記編集用画面で、前記タイムスタンプ付テキストから１又は連続する複数の単語をユーザに複数選択させ、複数選択された前記１又は連続する複数の単語同士のタイムスタンプが連続しない場合に、当該ユーザの選択を受け付けるステップと、複数選択された前記１又は連続する複数の単語の時間範囲を、前記タイムスタンプ付テキストに基づいて、各々特定するステップと、前記元動画から、各々特定した前記時間範囲に対応する部分を各々のショート動画として切り取るステップと、切り取った前記各々のショート動画を結合して、ハイライト動画を生成するステップと、を含むハイライト動画生成方法を提供する。 The present invention also provides a highlight video generation method executed by a computer , which includes the steps of acquiring the original video, audio recognizing the acquired original video, and generating the video based on the result of the audio recognition. a step of extracting the audio of the original video as time stamped text with a time stamp added to each word, a step of displaying the extracted time stamped text on an editing screen, and a step of displaying the time stamp on the editing screen. a step of allowing a user to select one or more consecutive words from the attached text , and accepting the user's selection if the timestamps of the multiple selected words or the multiple consecutive words are not consecutive; identifying the time ranges of the one or more consecutive words that have been identified based on the time-stamped text; The present invention provides a highlight video generation method including the steps of: cutting out the short videos as a short video; and combining the cut short videos to generate a highlight video.

また、本発明は、編集の対象となる元動画から、複数のショート動画を切り取って結合して、ハイライト動画を生成するハイライト動画生成システムを、前記元動画を取得する取得部、取得した前記元動画を音声認識する音声認識部、前記音声認識の結果に基づいて、前記元動画の音声を、単語毎にタイムスタンプを付加したタイムスタンプ付テキストとして抽出する抽出部、抽出した前記タイムスタンプ付テキストを編集用画面に表示する表示部、前記編集用画面で、前記タイムスタンプ付テキストから１又は連続する複数の単語をユーザに複数選択させ、複数選択された前記１又は連続する複数の単語同士のタイムスタンプが連続しない場合に、当該ユーザの選択を受け付ける選択部、複数選択された前記１又は連続する複数の単語の時間範囲を、前記タイムスタンプ付テキストに基づいて、各々特定する特定部、前記元動画から、各々特定した前記時間範囲に対応する部分を各々のショート動画として切り取る切取部、切り取った前記各々のショート動画を結合して、ハイライト動画を生成する生成部、として機能させるプログラムを提供する。 The present invention also provides a highlight video generation system that generates a highlight video by cutting and combining a plurality of short videos from an original video to be edited, including an acquisition unit that acquires the original video. a voice recognition unit that recognizes the voice of the original video; an extraction unit that extracts the voice of the source video as time-stamped text with a time stamp added to each word based on the result of the voice recognition; and the extracted time stamp. a display unit that displays a text with a time stamp on an editing screen; a display unit that allows a user to select one or a plurality of consecutive words from the text with a timestamp on the editing screen; a selection section that accepts the user's selection when the timestamps are not consecutive , and a specification section that specifies the time range of the one or more consecutive words selected based on the timestamp text. , a cutting unit that cuts out portions corresponding to the specified time ranges from the original video as respective short videos, and a generation unit that combines the respective cut short videos to generate a highlight video. Provide programs.

Claims

A highlight video generation system that generates a highlight video by cutting and combining multiple short videos from the original video to be edited,
an acquisition unit that acquires the original video;
a voice recognition unit that recognizes the voice of the acquired original video;
an extraction unit that extracts the audio of the original video as time-stamped text that holds a timestamp for each word, based on the result of the audio recognition;
a display unit that displays the extracted time-stamped text on an editing screen;
a selection unit that allows a user to select one or a plurality of consecutive words from the time stamped text on the editing screen so that the time stamps are discontinuous, and accepts the user's selection;
a specifying unit that specifies a time range of each of the plurality of selected words or a plurality of consecutive words based on the time-stamped text;
a cutting unit that cuts out portions corresponding to the respective specified time ranges from the original video as respective short videos;
a generation unit that generates a highlight video by combining each of the cut short videos;
Highlight video generation system.

The highlight video generation system according to claim 1, wherein the generation unit further combines a previously prepared video at the beginning and/or the end of the highlight video.

the selection unit accepts a selection of a combination order of the one or more consecutive words that are multiple selected;
3. The highlight video generation system according to claim 1, wherein the generation unit generates a highlight video by combining each of the cut short videos according to the order of combination.

Claims 1 to 3, wherein the display unit deletes the filler or displays the filler on the editing screen so that the filler can be distinguished from words other than the filler when the extracted time-stamped text includes a filler. The highlight video generation system according to any one of Item 3.

When the plurality of selected words or the plurality of consecutive words include a filler, the cutting portion removes a portion corresponding to the filler from among the portions corresponding to the respective specified time ranges from the original video. 5. The highlight video generation system according to claim 1, wherein each portion is cut out as a short video.

The highlight video generation system according to any one of claims 1 to 5, further comprising a captioning section that adds a caption to the highlight video.

The highlight video generation system according to any one of claims 1 to 6, further comprising a music insertion section that adds music to the highlight video.

A method executed by a highlight video generation system that generates a highlight video by cutting and combining multiple short videos from an original video to be edited, the method comprising:
obtaining the original video;
a step of audio recognizing the acquired original video;
Based on the result of the voice recognition, extracting the voice of the original video as time-stamped text that retains a time stamp for each word;
displaying the extracted time-stamped text on an editing screen;
On the editing screen, allowing the user to select one or more consecutive words from the timestamped text so that the timestamps are discontinuous, and accepting the user's selection;
identifying a time range of each of the one or more consecutive words selected based on the time-stamped text;
cutting out portions corresponding to the respective specified time ranges from the original video as respective short videos;
combining each of the cut short videos to generate a highlight video;
Highlight video generation system including.

A highlight video generation system that cuts and combines multiple short videos from the original video to be edited to create a highlight video.
an acquisition unit that acquires the original video;
a voice recognition unit that recognizes the voice of the acquired original video;
an extraction unit that extracts the audio of the original video as time-stamped text that holds a timestamp for each word, based on the result of the audio recognition;
a display unit that displays the extracted time-stamped text on an editing screen;
a selection unit that allows a user to select one or a plurality of consecutive words from the time stamped text on the editing screen so that the time stamps are discontinuous, and accepts the user's selection;
a specifying unit that specifies each of the time ranges of the one or a plurality of consecutively selected words based on the time-stamped text;
a cutting unit that cuts out portions corresponding to the respective specified time ranges from the original video as respective short videos;
a generation unit that generates a highlight video by combining each of the cut short videos;
A program that functions as