JP2023005038A

JP2023005038A - Moving image summarization device, moving image summarization method, and program

Info

Publication number: JP2023005038A
Application number: JP2021106710A
Authority: JP
Inventors: 健石井; Takeshi Ishii; 貴宏松本; Takahiro Matsumoto; 宜宗奥村; Yoshimune Okumura
Original assignee: NTT Communications Corp
Current assignee: NTT Communications Corp
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2023-01-18
Anticipated expiration: 2041-06-28
Also published as: JP7369739B2

Abstract

To provide a moving image summarization device, a moving image summarization method, and a program for summarizing a moving image which are highly versatile not limited to a specific genre.SOLUTION: In a system in which a moving image summarization device and a terminal are connected to a network, a moving image summarization device 100 for generating a moving image summary includes: a voice-to-text conversion unit 120 that acquires text by performing voice recognition on voice contained in the moving image; a text summarization unit 130 that summarizes sentences obtained from the text into a plurality of sentences; and a summary moving image generation unit 150 that acquires time sections each corresponding to each sentence in the plurality of sentences, extracts a partial moving image corresponding to each time section from the moving image, and combines the extracted partial moving images to generate a summary moving image.SELECTED DRAWING: Figure 2

Description

本発明は、動画を要約する技術に関連するものである。 The present invention relates to techniques for summarizing moving images.

長時間にわたる動画（映像と呼んでもよい）の内容を短時間で把握したいという要求が従来からあり、様々な動画要約技術が提案されている（例えば特許文献１～３）。 2. Description of the Related Art There has been a demand for grasping the contents of a long moving image (which may be called a video) in a short time, and various moving image summarization techniques have been proposed (for example, Patent Documents 1 to 3).

一例として、発表を撮影した多数の動画から、優れた発表を選ぶ状況において、個々の動画を、時間をかけて視聴することは難しい。動画要約技術を用いることで、短時間で個々の動画の内容を把握でき、効率的な評価を行うことができる。 As an example, in a situation in which excellent presentations are selected from a large number of presentation videos, it is difficult to watch each video over time. By using video summarization technology, it is possible to grasp the contents of individual videos in a short time and to perform efficient evaluation.

特開２０１０－０３９８７７号公報JP 2010-039877 A 特開２０１１－０６１２６３号公報JP 2011-061263 A 特開２０１５－０９９９５８号公報JP 2015-099958 A

従来の動画要約技術においては、一般に、動画から特定のジャンルに関連する画像の特徴を抽出し、その特徴を用いて動画の要約を作成している。しかし、動画には、多くのジャンルが存在しており、今後これまでにないジャンル、表現が生まれてくる可能性もある。 In conventional video summarization techniques, generally, features of images related to a specific genre are extracted from a video, and the features are used to create a summary of the video. However, there are many genres in moving images, and there is a possibility that new genres and expressions will be created in the future.

従って、従来技術における動画要約技術では、要約を実施できる動画が特定のジャンルのものに限定されてしまい、汎用的に動画を要約できないという課題がある。 Therefore, with the video summarization technology in the prior art, videos that can be summarized are limited to those of a specific genre, and there is a problem that videos cannot be general-purposely summarized.

本発明は上記の点に鑑みてなされたものであり、特定のジャンルに限定されない汎用性の高い動画要約技術を提供することを目的とする。 The present invention has been made in view of the above points, and an object of the present invention is to provide a highly versatile technique for summarizing moving images that is not limited to a specific genre.

開示の技術によれば、動画の要約を作成する動画要約装置であって、
前記動画に含まれる音声に対して音声認識を行うことにより、テキストを取得する音声テキスト化部と、
前記テキストから得られた文章を複数の文に要約する文章要約部と、
前記複数の文におけるそれぞれの文に対応する時間区間を取得し、前記動画から各時間区間に対応する部分動画を抽出し、抽出された部分動画を結合することにより要約動画を生成する要約動画生成部と
を備える動画要約装置が提供される。 According to the disclosed technology, a moving image summarization device for creating a moving image summary,
a speech-to-text conversion unit that acquires text by performing speech recognition on speech contained in the moving image;
a sentence summarization unit that summarizes sentences obtained from the text into a plurality of sentences;
Generating a summary video by obtaining a time segment corresponding to each sentence in the plurality of sentences, extracting partial videos corresponding to each time segment from the video, and combining the extracted partial videos to generate a summary video. A video summarization device is provided comprising:

開示の技術によれば、特定のジャンルに限定されない汎用性の高い動画要約技術を実現することが可能となる。 According to the disclosed technique, it is possible to realize a highly versatile video summarization technique that is not limited to a specific genre.

本発明の実施の形態におけるシステムの全体構成図である。1 is an overall configuration diagram of a system according to an embodiment of the present invention; FIG. 動画要約装置の機能構成図である。1 is a functional configuration diagram of a video summarizing device; FIG. システムの動作を説明するためのシーケンス図である。FIG. 4 is a sequence diagram for explaining the operation of the system; 表示画面の例を示す図である。FIG. 4 is a diagram showing an example of a display screen; 表示画面の例を示す図である。FIG. 4 is a diagram showing an example of a display screen; 動画区間の抽出方法の例を説明するための図である。FIG. 4 is a diagram for explaining an example of a method for extracting a moving image segment; FIG. 装置のハードウェア構成例を示す図である。It is a figure which shows the hardware configuration example of an apparatus.

以下、図面を参照して本発明の実施の形態（本実施の形態）を説明する。以下で説明する実施の形態は一例に過ぎず、本発明が適用される実施の形態は、以下で説明する実施の形態に限られるわけではない。 An embodiment (this embodiment) of the present invention will be described below with reference to the drawings. The embodiments described below are merely examples, and the embodiments to which the present invention is applied are not limited to the embodiments described below.

以下の説明では、特に断らない限り、「動画」は、音声入りの動画であるものとする。当該音声入りの動画において、音声と動画は同期している。また、動画に音声を含める形式についてはどのような形式であってもよい。例えば、動画と音声が別ファイルで提供される形式のものであってもよい。 In the following description, unless otherwise specified, a "moving image" is a moving image with sound. In the video with sound, the sound and the video are synchronized. In addition, any format may be used for including sound in moving images. For example, it may be of a format in which moving images and audio are provided as separate files.

（システム構成例）
図１に、本実施の形態におけるシステムの全体構成例を示す。図１に示すように、本システムは、動画要約装置１００と端末２００がネットワーク３００に接続された構成を有する。 (System configuration example)
FIG. 1 shows an example of the overall configuration of a system according to this embodiment. As shown in FIG. 1, this system has a configuration in which a moving picture summarizing device 100 and a terminal 200 are connected to a network 300 .

動画要約装置１００は、本発明に係る技術により動画を要約する装置である。端末２００は、スマートフォン、ＰＣ等の一般的な端末である。ネットワーク３００は、例えばインターネットである。ネットワーク３００は、ＬＡＮ等の小規模なネットワークであってもよい。 The moving picture summarizing device 100 is a device for summarizing moving pictures using the technology according to the present invention. A terminal 200 is a general terminal such as a smartphone or a PC. Network 300 is, for example, the Internet. Network 300 may be a small network such as a LAN.

（動画要約装置１００の構成例）
図２に、動画要約装置１００の機能構成例を示す。図２に示すように、動画要約装置１００は、動画データ取得部１１０、音声テキスト化処理部１２０、文章要約部１３０、動画区間抽出部１４０、要約動画生成部１５０、データ記憶部１６０を有する。なお、動画区間抽出部１４０の機能を、要約動画生成部１５０内に含めてもよい。各部の機能概要は下記のとおりである。 (Configuration example of video summarizing device 100)
FIG. 2 shows an example of the functional configuration of the video summarizing device 100. As shown in FIG. As shown in FIG. 2 , the video summarizing device 100 has a video data acquisition unit 110 , a speech-to-text processing unit 120 , a sentence summarization unit 130 , a video section extraction unit 140 , a summary video generation unit 150 and a data storage unit 160 . Note that the function of the moving image section extracting section 140 may be included in the digest moving image generating section 150 . The function outline of each part is as follows.

動画データ取得部１１０は、端末２００等から動画のデータを取得する。音声テキスト化処理部１２０は、動画における音声を文章に変換する。文章要約部１３０は、音声テキスト化処理部１２０により音声から変換された文章を要約する。動画区間抽出部１４０は、音声テキスト化処理部１２０により得られた要約文（抽出文）の時間位置に基づいて、動画の時間区間を抽出する。要約動画生成部１５０は、動画区間抽出部１４０により抽出された時間区間の動画を用いて要約動画を生成する。データ記憶部１６０は各種データを格納する。 The moving image data acquisition unit 110 acquires moving image data from the terminal 200 or the like. The speech-to-text processing unit 120 converts the speech in the moving image into sentences. The text summarizing section 130 summarizes the text converted from the speech by the speech-to-text processing section 120 . The video segment extraction unit 140 extracts a time segment of the video based on the time position of the summary sentence (extracted sentence) obtained by the speech-to-text processing unit 120 . The digest video generating section 150 generates a digest video using the video of the time segment extracted by the video segment extraction section 140 . The data storage unit 160 stores various data.

（システムの動作例）
次に、図３のシーケンスチャートの手順に沿って、システムの動作例を説明する。ここでは、端末２００が様々な動画のデータ（動画ファイルと呼んでもよい）を保持していると想定する。また、各動画は、人の発話の音声を含んでいる。 (Example of system operation)
Next, an operation example of the system will be described along the procedure of the sequence chart of FIG. Here, it is assumed that the terminal 200 holds various moving image data (which may be called moving image files). In addition, each moving image contains the sound of human speech.

端末２００のユーザがある動画の要約を視聴したと考え、Ｓ１０１において、ユーザは、端末２００に対して要約の視聴を希望する動画を指定する。 Assuming that the user of the terminal 200 has viewed a summary of a certain moving image, the user designates a moving image for which the summary is desired to be viewed on the terminal 200 in S101.

Ｓ１０２において、端末２００は、指定された動画のデータを動画要約装置１００にアップロードする。動画要約装置１００の動画データ取得部１１０が動画のデータを受信し、データ記憶部１６０に格納する。 At S<b>102 , the terminal 200 uploads data of the specified moving image to the moving image summary device 100 . The moving image data acquisition unit 110 of the moving image summarizing device 100 receives the moving image data and stores it in the data storage unit 160 .

音声テキスト化処理部１２０は、データ記憶部１６０から動画のデータを読み出し、当該データから音声（ここでは人の発話音声）を取得して音声認識を実行し、音声をテキストに変換する（Ｓ１０３、Ｓ１０４）。 The speech-to-text processing unit 120 reads the moving image data from the data storage unit 160, acquires speech (in this case, human speech) from the data, executes speech recognition, and converts the speech into text (S103, S104).

Ｓ１０５において、音声テキスト化処理部１２０は、音声から変換されたテキストから、複数の文からなる文章を生成する。この文章は、文の一覧の形式で生成されてもよい。Ｓ１０６において、音声テキスト化処理部１２０は、それぞれの文の時間情報（区間、長さ）を取得する。生成された文章（文の一覧）は各文の時間情報とともにデータ記憶部１６０に格納される。なお、音声認識により音声から文章を生成する技術自体は既存技術である。 In S105, the speech-to-text processing unit 120 generates a sentence consisting of a plurality of sentences from the text converted from the speech. This sentence may be generated in the form of a list of sentences. In S106, the speech-to-text processing unit 120 acquires time information (segment, length) of each sentence. The generated sentences (a list of sentences) are stored in the data storage unit 160 together with the time information of each sentence. Note that the technology itself for generating sentences from voice by speech recognition is an existing technology.

Ｓ１０７において、音声テキスト化処理部１２０は、生成した文の一覧（各文の時間情報付き）を端末２００に送信する。Ｓ１０８において、端末２００上に文の一覧が表示される。 In S<b>107 , the speech-to-text processing unit 120 transmits a list of generated sentences (with time information of each sentence) to the terminal 200 . At S108, a list of sentences is displayed on the terminal 200. FIG.

図４に、Ｓ１０８において端末２００上に表示される画面の例を示す。この例では、対象となっている動画の時間長、及び、音声認識により得られた文の一覧がその時刻（時間）とともに画面上に表示されている。「要約時間長」の指定により、希望する要約動画の長さ（時間長）を指定できる。また、「要約作成開始」ボタンにより要約開始を指示できる。 FIG. 4 shows an example of a screen displayed on the terminal 200 in S108. In this example, the time length of the target moving image and a list of sentences obtained by speech recognition are displayed on the screen along with the time (time). A desired length (length of time) of the summarized video can be specified by specifying the "digest time length". In addition, the start of summarization can be instructed by pressing the "start summarization" button.

なお、希望する要約動画の長さの指定に関しては、文の数で指定してもよい。例えば、全体で１００文がある場合に、文数として「２０」を指定することで、要約により２０文の分の要約動画を作成することができる。 Note that the desired length of the digest video may be specified by the number of sentences. For example, if there are 100 sentences in total, by specifying "20" as the number of sentences, a summary video for 20 sentences can be created.

また、全体の長さに対する要約の長さの比率（要約率と呼ぶ）で希望する要約動画の長さを指定してもよい。例えば「６分の１」を指定した場合、６０分の動画から１０分の要約動画が生成され、６分の動画から１分の動画が生成される。 Alternatively, the length of the desired digest video may be specified as a ratio of the length of the digest to the total length (referred to as the digest rate). For example, when "1/6" is specified, a 10-minute summary video is generated from a 60-minute video, and a 1-minute video is generated from a 6-minute video.

また、希望する要約動画の長さを指定せずに、予め決められた要約率で要約動画を生成してもよい。 Alternatively, a digest video may be generated with a predetermined digest rate without specifying the desired length of the digest video.

図３のＳ１０９において、ユーザは、端末２００上で希望する要約時間長を指定し、要約作成開始を指示する。Ｓ１１０において、要約作成命令とともに、上記時間長が、端末２００から動画要約装置１００に送信される。動画要約装置１００の文章要約部１３０は、要約作成命令と時間長を受信する。 In S109 of FIG. 3, the user designates a desired length of time for a summary on the terminal 200 and instructs to start creating a summary. At S110, the time length is transmitted from the terminal 200 to the moving picture summarizing device 100 together with the command to create a summary. The text summarizing unit 130 of the video summarizing device 100 receives the summarizing command and the length of time.

Ｓ１１１において、文章要約部１３０は、音声テキスト化処理部１２０により得られた文章（文の一覧）と時間情報をデータ記憶部１６０から読み出し、指定された時間長になるように、文章の要約を作成する。文章の要約自体は既存技術で実現できる。 In S111, the sentence summarization unit 130 reads out the sentences (list of sentences) obtained by the speech-to-text processing unit 120 and the time information from the data storage unit 160, and summarizes the sentences so as to have the specified length of time. create. The text summary itself can be realized with existing technology.

文章の要約を行うための既存技術としてどのような技術を使用してもよいが、本実施の形態では、一例として、文章から、複数の文を抽出することで要約を行う。例えば、全体の文章の中に、文１～文６０の６０個の文が含まれているとする。各文には、動画における時間区間（開始時刻、終了時刻、時間長）が対応付けられている。 Any existing technique for summarizing a sentence may be used, but in the present embodiment, as an example, the sentence is summarized by extracting a plurality of sentences from the sentence. For example, it is assumed that 60 sentences from sentence 1 to sentence 60 are included in the entire sentence. Each sentence is associated with a time segment (start time, end time, length of time) in the moving image.

例えば、指定された時間長が１０分であるとした場合、文章要約部１３０は、要約に含める文の時間長の合計が１０分になるように、例えば、「文１（時間長１分）、文２０（時間長１分）、文２１（時間長２分）、文５３（時間長３分）、文５４（時間長１分）、文６０（時間長２分）」といったようにして、重要と考えられる複数の文を抽出する。抽出された文、及び時間区間の情報はデータ記憶部１６０に格納される。 For example, if the specified time length is 10 minutes, the sentence summarization unit 130 may perform a sentence such as "sentence 1 (time length 1 minute)" so that the total time length of the sentences included in the summary is 10 minutes. , Sentence 20 (time length 1 minute), Sentence 21 (time length 2 minutes), Sentence 53 (time length 3 minutes), Sentence 54 (time length 1 minute), Sentence 60 (time length 2 minutes). , to extract multiple sentences that are considered important. The extracted sentence and information on the time interval are stored in the data storage unit 160 .

前述したように、要約の長さとして、時間長ではなく、文の数や要約率を指定することも可能である。文の数が指定された場合には、指定された数だけ文を抽出する。要約率が指定された場合には、その要約率に該当する時間長になるように文を抽出する。 As described above, it is also possible to specify the number of sentences or the rate of summarization as the length of the summary instead of the length of time. If the number of sentences is specified, extract only the specified number of sentences. When a summarization rate is specified, sentences are extracted so that the length of time corresponds to the summarization rate.

図３の例では、Ｓ１１１の次にＳ１１２に進む。ただし、Ｓ１１１の次にＳ１１２に進むことは例であり、Ｓ１１１の次に、Ｓ１１２の前に、文章要約部１３０は、要約の結果である複数の文（文の一覧）を端末２００に送信してもよい。つまり、要約の結果である複数の文（文の一覧）をユーザに対して出力してもよい。 In the example of FIG. 3, after S111, the process proceeds to S112. However, proceeding to S112 after S111 is an example. may In other words, a plurality of sentences (a list of sentences) as a summary result may be output to the user.

この場合、端末２００上には、例えば、図５に示す画面が表示される。図５に示すように、図４の画面に文章の要約が追加される。ユーザは、この画面上で、全文の中から要約動画に追加したい時間区間の文を指定することができる。また、要約動画から削除したい時間区間の文を指定することもできる。 In this case, the screen shown in FIG. 5 is displayed on the terminal 200, for example. As shown in FIG. 5, a text summary is added to the screen of FIG. On this screen, the user can specify sentences in a time segment that he/she wants to add to the digest video from all the sentences. You can also specify sentences in a time period that you want to delete from the summary video.

上記の文追加の指定がなされた場合、動画要約装置１００において、要約の結果として得られた複数の文にユーザから指定された文を追加して得られた複数の文から要約動画を生成することができる。また、上記の文削除の指定がなされた場合、動画要約装置１００において、要約の結果として得られた複数の文から、ユーザにより指定された文を削除して得られた複数の文から要約動画を生成することができる。 When the above sentence addition is specified, the video summarizing device 100 generates a summarized video from a plurality of sentences obtained by adding sentences specified by the user to the plurality of sentences obtained as a result of summarization. be able to. In addition, when the above sentence deletion is designated, the video summarization device 100 deletes sentences specified by the user from the plurality of sentences obtained as a result of summarization, and then extracts a summary video from a plurality of sentences. can be generated.

Ｓ１１２において、動画区間抽出部１４０は、Ｓ１１１において抽出されたそれぞれの文について、対応する時間区間の情報を取得する。 In S112, the moving image segment extraction unit 140 acquires information on the corresponding time segment for each sentence extracted in S111.

例えば、上記の例でいえば、文１に対して（開始時刻：０分０秒，終了時刻：１分０秒）が取得され、文２０に対して、例えば、（開始時刻：２０分３０秒，終了時刻：２１分３０秒）が取得される。他の文についても同様である。 For example, in the above example, (start time: 0 minutes 0 seconds, end time: 1 minute 0 seconds) is obtained for sentence 1, and for sentence 20, for example, (start time: 20 minutes 30 seconds seconds, end time: 21 minutes and 30 seconds). The same is true for other sentences.

Ｓ１１３において、動画区間抽出部１４０は、Ｓ１１２において取得したそれぞれの時間区間について、時間区間に対応する動画区間（時間区間の時間位置に対応する動画であり、部分動画と呼んでもよい）を、全体の動画から抽出する。 In S113, the video segment extraction unit 140 extracts the video segment corresponding to the time segment (video corresponding to the time position of the time segment, which may be called a partial video) for each of the time segments acquired in S112. Extract from the video of

例えば、上記の例を用いると、全体が６０分の動画から、文１に対応する時間区間（開始時刻：０分０秒，終了時刻：１分０秒）の動画１が抽出され、文２０に対応する時間区間（開始時刻：２０分３０秒，終了時刻：２１分３０秒）の動画２０が抽出される。他の時間区間についても同様である。 For example, using the above example, the video 1 in the time interval corresponding to sentence 1 (start time: 0 minutes 0 seconds, end time: 1 minute 0 seconds) is extracted from the video of 60 minutes in total, and sentence 20 is extracted. (start time: 20 minutes and 30 seconds, end time: 21 minutes and 30 seconds). The same is true for other time intervals.

動画抽出のイメージを図６に示す。図６に示すように、各抽出文の時間区間に対応する区間の動画を抽出する。 Fig. 6 shows an image of video extraction. As shown in FIG. 6, a moving image of a section corresponding to the time section of each extracted sentence is extracted.

図３のＳ１１４において、要約動画生成部１５０は、Ｓ１１３において抽出した動画を結合することにより、要約動画を生成する。例えば、Ｓ１１３において、動画１、動画２０、動画２１、動画５３、動画５４、動画６０が抽出されたとすると、これらを結合した「動画１＋動画２０＋動画２１＋動画５３＋動画５４＋動画６０」が要約動画として生成される。 In S114 of FIG. 3, the summarized moving image generation unit 150 generates a summarized moving image by combining the moving images extracted in S113. For example, in S113, if video 1, video 20, video 21, video 53, video 54, and video 60 are extracted, "video 1 + video 20 + video 21 + video 53 + video 54 + video 60", which is a combination of these videos, is the summary video. generated.

Ｓ１１５において、要約動画生成部１５０は、生成した要約動画を端末２００に送信する。Ｓ１１６において、端末２００上に要約動画が表示され、ユーザは要約動画を視聴する。 In S<b>115 , the digest video generating unit 150 transmits the generated digest video to the terminal 200 . At S116, the digest video is displayed on the terminal 200, and the user views the digest video.

（その他の例）
図３に示したシーケンスは一例であり、図３に示したシーケンスとは異なる手順で処理を行ってもよい。例えば、Ｓ１０７～Ｓ１１０の処理を行わないこととしてもよい。この場合、予め定めた要約率を使用してもよいし、Ｓ１０１、Ｓ１０２において要約の時間長等を端末２００から動画要約装置１００に指定してもよい。 (Other examples)
The sequence shown in FIG. 3 is an example, and processing may be performed in a procedure different from the sequence shown in FIG. For example, the processes of S107 to S110 may not be performed. In this case, a predetermined summarization rate may be used, or the length of time for summarization, etc. may be specified from the terminal 200 to the video summarization apparatus 100 in S101 and S102.

また、図１に示した構成も一例であり、この構成に限定されるわけではない。例えば、端末２００内に動画要約装置１００の機能を含めてもよい。この場合、端末２００は、自端末内で、元の動画から要約動画を生成し、表示することができる。なお、動画要約装置１００の機能を含む端末を「動画要約装置」と呼んでもよい。端末２００内に動画要約装置１００の機能を含める場合、図３における端末２００への情報送信は、端末２００のディスプレイに情報を表示することに相当する。 Moreover, the configuration shown in FIG. 1 is also an example, and the configuration is not limited to this configuration. For example, the terminal 200 may include the function of the moving picture summarizing device 100 . In this case, terminal 200 can generate and display a summary video from the original video within its own terminal. A terminal including the functions of the video summarizing device 100 may be called a "video summarizing device". When the function of the moving picture summarizing device 100 is included in the terminal 200, transmitting information to the terminal 200 in FIG. 3 corresponds to displaying information on the display of the terminal 200. FIG.

また、これまでに説明した例では、動画に含まれる音声から音声認識によりテキストを取得し、当該テキストを用いて要約動画を生成することとしているが、字幕等のテキストが含まれる動画に関しては、音声認識を用いることなく、当該テキストを用いて要約動画を生成することとしてもよい。 Also, in the examples described so far, text is obtained by speech recognition from the voice contained in the video, and the summary video is generated using the text. A summary video may be generated using the text without using speech recognition.

この場合、例えば、動画のデータ（動画ファイル）には、動画とテキストが含まれる。動画とテキストはタイムスタンプ等により同期しているが、分離しており、動画を解析することなくテキストを取得可能である。 In this case, for example, moving image data (moving image file) includes moving images and text. The video and text are synchronized by time stamps, etc., but they are separated, so the text can be obtained without analyzing the video.

例えば、音声テキスト化処理部１２０が、上記テキストから、複数の文からなる文章を生成する。この文章は、文の一覧の形式で生成され、各文には、その文に対応する動画における時間区間の情報が付されている。この処理以降の処理は、これまでに説明した処理と同様である。動画に含まれるテキストを使用する場合でも、動画抽出のイメージは図６に示したものと同じである。 For example, the speech-to-text processing unit 120 generates a sentence consisting of a plurality of sentences from the text. This sentence is generated in the form of a list of sentences, and each sentence is attached with information on the time interval in the moving image corresponding to that sentence. The processing after this processing is the same as the processing described so far. Even when using the text contained in the moving image, the image of moving image extraction is the same as that shown in FIG.

（ハードウェア構成例）
動画要約装置１００、端末２００はいずれも、例えば、コンピュータに、本実施の形態で説明する処理内容を記述したプログラムを実行させることにより実現可能である。当該コンピュータは物理マシンであってもよいし、クラウド上の仮想マシンであってもよい。動画要約装置１００、端末２００を総称して「装置」と呼ぶ。 (Hardware configuration example)
Both the moving picture summarizing apparatus 100 and the terminal 200 can be implemented by, for example, causing a computer to execute a program describing the processing details described in this embodiment. The computer may be a physical machine or a virtual machine on the cloud. The video summarizing device 100 and the terminal 200 are collectively referred to as "devices".

すなわち、当該装置は、コンピュータに内蔵されるＣＰＵやメモリ等のハードウェア資源を用いて、当該装置で実施される処理に対応するプログラムを実行することによって実現することが可能である。上記プログラムは、コンピュータが読み取り可能な記録媒体（可搬メモリ等）に記録して、保存したり、配布したりすることが可能である。また、上記プログラムをインターネットや電子メール等、ネットワークを通して提供することも可能である。 That is, the device can be realized by executing a program corresponding to the processing performed by the device using hardware resources such as a CPU and memory built into the computer. The above program can be recorded in a computer-readable recording medium (portable memory, etc.), saved, or distributed. It is also possible to provide the above program through a network such as the Internet or e-mail.

図７は、本実施の形態における上記コンピュータのハードウェア構成例を示す図である。図７のコンピュータは、それぞれバスＢで相互に接続されているドライブ装置１０００、補助記憶装置１００２、メモリ装置１００３、ＣＰＵ１００４、インタフェース装置１００５、表示装置１００６、入力装置１００７、及び出力装置１００８等を有する。 FIG. 7 is a diagram showing a hardware configuration example of the computer in this embodiment. The computer of FIG. 7 has a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, an output device 1008, etc., which are connected to each other via a bus B. .

当該コンピュータでの処理を実現するプログラムは、例えば、ＣＤ－ＲＯＭ又はメモリカード等の記録媒体１００１によって提供される。プログラムを記憶した記録媒体１００１がドライブ装置１０００にセットされると、プログラムが記録媒体１００１からドライブ装置１０００を介して補助記憶装置１００２にインストールされる。但し、プログラムのインストールは必ずしも記録媒体１００１より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置１００２は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 A program for realizing processing by the computer is provided by a recording medium 1001 such as a CD-ROM or a memory card, for example. When the recording medium 1001 storing the program is set in the drive device 1000 , the program is installed from the recording medium 1001 to the auxiliary storage device 1002 via the drive device 1000 . However, the program does not necessarily need to be installed from the recording medium 1001, and may be downloaded from another computer via the network. The auxiliary storage device 1002 stores installed programs, as well as necessary files and data.

メモリ装置１００３は、プログラムの起動指示があった場合に、補助記憶装置１００２からプログラムを読み出して格納する。ＣＰＵ１００４は、メモリ装置１００３に格納されたプログラムに従って、当該装置に係る機能を実現する。 The memory device 1003 reads and stores the program from the auxiliary storage device 1002 when a program activation instruction is received. The CPU 1004 implements functions related to the device according to programs stored in the memory device 1003 .

インタフェース装置１００５は、ネットワークに接続するためのインタフェースとして用いられる。表示装置１００６はプログラムによるＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）等を表示する。入力装置１００７はキーボード及びマウス、ボタン、又はタッチパネル等で構成され、様々な操作指示を入力させるために用いられる。出力装置１００８は演算結果を出力する。 The interface device 1005 is used as an interface for connecting to the network. A display device 1006 displays a program-based GUI (Graphical User Interface) or the like. An input device 1007 is composed of a keyboard, a mouse, buttons, a touch panel, or the like, and is used to input various operational instructions. The output device 1008 outputs the calculation result.

（実施の形態の効果）
本実施の形態に係る技術によれば、動画に含まれる音声又はテキストを利用して動画要約を行うことができるので、動画の特徴を事前に定義することが不要であり、汎用性の高い動画要約技術を実現することができる。また、より具体的な効果として下記の効果がある。 (Effect of Embodiment)
According to the technology according to the present embodiment, it is possible to summarize a moving image by using the voice or text contained in the moving image. Summarization techniques can be implemented. In addition, there are the following effects as more specific effects.

５Ｇの本格化に伴い動画コンテンツの需要は高まっている。かつコロナ禍によるセミナーや研修など各種イベントがほぼ全てオンライン開催となり、その模様を動画として記録することが激増している。そのような状況において、本実施の形態に係る技術により、要約動画を自動的に生成でき、生成した要約動画を、隙間時間にスマホやタブレット等で手軽に視聴することができる。また、セミナーや講演会の記録動画や研修の教材動画の視聴を促すプロモーションとしても非常に有効であり、本技術は世の中で非常に渇望されている。 Demand for video content is increasing with the full-scale introduction of 5G. In addition, almost all events such as seminars and trainings due to the corona wreck have been held online, and the number of video recordings of these events has increased dramatically. In such a situation, the technology according to the present embodiment can automatically generate a summary video, and the generated summary video can be easily viewed on a smartphone, tablet, or the like in a spare time. In addition, this technology is highly desired in the world because it is very effective as a promotion to encourage viewing of video recordings of seminars and lectures and videos of teaching materials for training.

（実施の形態のまとめ）
本明細書には、少なくとも下記各項の動画要約装置、動画要約方法、及びプログラムが開示されている。
（第１項）
動画の要約を作成する動画要約装置であって、
前記動画に含まれる音声に対して音声認識を行うことにより、テキストを取得する音声テキスト化部と、
前記テキストから得られた文章を複数の文に要約する文章要約部と、
前記複数の文におけるそれぞれの文に対応する時間区間を取得し、前記動画から各時間区間に対応する部分動画を抽出し、抽出された部分動画を結合することにより要約動画を生成する要約動画生成部と
を備える動画要約装置。
（第２項）
動画の要約を作成する動画要約装置であって、
前記動画に含まれるテキストから得られた文章を複数の文に要約する文章要約部と、
前記複数の文におけるそれぞれの文に対応する時間区間を取得し、前記動画から各時間区間に対応する部分動画を抽出し、抽出された部分動画を結合することにより要約動画を生成する要約動画生成部と
を備える動画要約装置。
（第３項）
前記文章要約部は、ユーザから指定された時間長、ユーザから指定された文の数、ユーザから指定された要約率、又は、予め定めた要約率に基づいて、前記要約を実行する
第１項又は第２項に記載の動画要約装置。
（第４項）
前記文章要約部は、前記文章の要約結果である前記複数の文をユーザに対して出力し、
前記要約動画生成部は、ユーザから指定された文を前記複数の文に追加する追加処理、又は、ユーザから指定された文を前記複数の文から削除する削除処理を実行し、前記追加処理又は前記削除処理がなされた複数の文から要約動画を生成する
第１項ないし第３項のうちいずれか１項に記載の動画要約装置。
（第５項）
動画の要約を作成する動画要約装置が実行する動画要約方法であって、
前記動画に含まれる音声に対して音声認識を行うことにより、テキストを取得する音声テキスト化ステップと、
前記テキストから得られた文章を複数の文に要約する文章要約ステップと、
前記複数の文におけるそれぞれの文に対応する時間区間を取得し、前記動画から各時間区間に対応する部分動画を抽出し、抽出された部分動画を結合することにより要約動画を生成する要約動画生成ステップと
を備える動画要約方法。
（第６項）
動画の要約を作成する動画要約装置が実行する動画要約方法であって、
前記動画に含まれるテキストから得られた文章を複数の文に要約する文章要約ステップと、
前記複数の文におけるそれぞれの文に対応する時間区間を取得し、前記動画から各時間区間に対応する部分動画を抽出し、抽出された部分動画を結合することにより要約動画を生成する要約動画生成ステップと
を備える動画要約方法。
（第７項）
コンピュータを、第１項ないし第４項のうちいずれか１項に記載の動画要約装置における各部として機能させるためのプログラム。 (Summary of embodiment)
This specification discloses at least the moving image summarizing device, moving image summarizing method, and program described below.
(Section 1)
A video summarizing device for creating a video summary,
a speech-to-text conversion unit that acquires text by performing speech recognition on speech contained in the moving image;
a sentence summarization unit that summarizes sentences obtained from the text into a plurality of sentences;
Generating a summary video by obtaining a time segment corresponding to each sentence in the plurality of sentences, extracting partial videos corresponding to each time segment from the video, and combining the extracted partial videos to generate a summary video. A moving picture summarization device comprising a section and .
(Section 2)
A video summarizing device for creating a video summary,
a sentence summarizing unit that summarizes sentences obtained from the text included in the moving image into a plurality of sentences;
Generating a summary video by obtaining a time segment corresponding to each sentence in the plurality of sentences, extracting partial videos corresponding to each time segment from the video, and combining the extracted partial videos to generate a summary video. A moving picture summarization device comprising a section and .
(Section 3)
The text summarization unit executes the summarization based on the length of time specified by the user, the number of sentences specified by the user, the summarization rate specified by the user, or a predetermined summarization rate. Or the moving picture summarization device according to item 2.
(Section 4)
The sentence summarization unit outputs the plurality of sentences, which are the results of summarizing the sentences, to the user;
The summary video generation unit performs an addition process of adding a sentence specified by the user to the plurality of sentences, or a deletion process of deleting the sentence specified by the user from the plurality of sentences, and performs the addition process or 3. The moving image summarizing device according to any one of items 1 to 3, wherein a summarized moving image is generated from the plurality of sentences subjected to the deletion processing.
(Section 5)
A video summarizing method executed by a video summarizing device that creates a video summary,
a speech-to-text conversion step of obtaining text by performing speech recognition on speech contained in the moving image;
a sentence summarization step of summarizing sentences obtained from the text into a plurality of sentences;
Generating a summary video by obtaining a time segment corresponding to each sentence in the plurality of sentences, extracting partial videos corresponding to each time segment from the video, and combining the extracted partial videos to generate a summary video. A video summarization method comprising steps and .
(Section 6)
A video summarizing method executed by a video summarizing device that creates a video summary,
a sentence summarization step of summarizing sentences obtained from the text included in the moving image into a plurality of sentences;
Generating a summary video by obtaining a time segment corresponding to each sentence in the plurality of sentences, extracting partial videos corresponding to each time segment from the video, and combining the extracted partial videos to generate a summary video. A video summarization method comprising steps and .
(Section 7)
A program for causing a computer to function as each part of the moving picture summarizing device according to any one of items 1 to 4.

以上、本実施の形態について説明したが、本発明はかかる特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the present embodiment has been described above, the present invention is not limited to such a specific embodiment, and various modifications and changes can be made within the scope of the gist of the present invention described in the claims. It is possible.

１００動画要約装置
１１０動画データ取得部
１２０音声テキスト化処理部
１３０文章要約部
１４０動画区間抽出部
１５０要約動画生成部
１６０データ記憶部
１０００ドライブ装置
１００２補助記憶装置
１００３メモリ装置
１００４ＣＰＵ
１００５インタフェース装置
１００６表示装置
１００７入力装置
１００８出力装置 100 Video summary device 110 Video data acquisition unit 120 Speech text processing unit 130 Text summarization unit 140 Video section extraction unit 150 Summary video generation unit 160 Data storage unit 1000 Drive device 1002 Auxiliary storage device 1003 Memory device 1004 CPU
1005 interface device 1006 display device 1007 input device 1008 output device

Claims

A video summarizing device for creating a video summary,
a speech-to-text conversion unit that acquires text by performing speech recognition on speech contained in the moving image;
a sentence summarization unit that summarizes sentences obtained from the text into a plurality of sentences;
Generating a summary video by obtaining a time segment corresponding to each sentence in the plurality of sentences, extracting partial videos corresponding to each time segment from the video, and combining the extracted partial videos to generate a summary video. A moving picture summarization device comprising a section and .

A video summarizing device for creating a video summary,
a sentence summarizing unit that summarizes sentences obtained from the text included in the moving image into a plurality of sentences;
Generating a summary video by obtaining a time segment corresponding to each sentence in the plurality of sentences, extracting partial videos corresponding to each time segment from the video, and combining the extracted partial videos to generate a summary video. A moving picture summarization device comprising a section and .

2. The text summarization unit performs the summarization based on a user-specified time length, a user-specified number of sentences, a user-specified summarization rate, or a predetermined summarization rate. 3. The moving image summarization device according to 2.

The sentence summarization unit outputs the plurality of sentences, which are the results of summarizing the sentences, to the user;
The summary video generation unit performs an addition process of adding a sentence specified by the user to the plurality of sentences, or a deletion process of deleting the sentence specified by the user from the plurality of sentences, and performs the addition process or 4. The moving picture summarizing device according to any one of claims 1 to 3, wherein a summarized moving picture is generated from the plurality of deleted sentences.

A video summarizing method executed by a video summarizing device that creates a video summary,
a speech-to-text conversion step of obtaining text by performing speech recognition on speech contained in the moving image;
a sentence summarization step of summarizing sentences obtained from the text into a plurality of sentences;
Generating a summary video by obtaining a time segment corresponding to each sentence in the plurality of sentences, extracting partial videos corresponding to each time segment from the video, and combining the extracted partial videos to generate a summary video. A video summarization method comprising steps and .

A video summarizing method executed by a video summarizing device that creates a video summary,
a sentence summarization step of summarizing sentences obtained from the text included in the moving image into a plurality of sentences;
Generating a summary video by obtaining a time segment corresponding to each sentence in the plurality of sentences, extracting partial videos corresponding to each time segment from the video, and combining the extracted partial videos to generate a summary video. A video summarization method comprising steps and .

A program for causing a computer to function as each unit in the moving picture summarizing device according to any one of claims 1 to 4.