JP2019213160A

JP2019213160A - Video editing apparatus, video editing method, and video editing program

Info

Publication number: JP2019213160A
Application number: JP2018110423A
Authority: JP
Inventors: 秀輝衣斐; Hideki Ibi; 純一石垣; Junichi Ishigaki; 司堀ノ内; Tsukasa Horinouchi; 浩之田中; Hiroyuki Tanaka; 大輔山口; Daisuke Yamaguchi; 博教小川; Hironori Ogawa; 貴文松留; Takafumi Matsutome
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-06-08
Filing date: 2018-06-08
Publication date: 2019-12-12
Anticipated expiration: 2038-06-08
Also published as: JP7133367B2

Abstract

To provide a video editing apparatus that enables editing that reflects intentions of a video creator while reducing burden of video editing, a video editing method, and a video editing program.SOLUTION: A video editing apparatus 1 includes: a video recognition unit 22 for calculating a video recognition result with a time section from video data included in moving image data; a voice recognition unit 23 for calculating a voice recognition result with a time section from voice data included in the moving image data; a recognition result combining unit 24 for calculating a composite recognition result obtained by combining the video recognition result with time section and the voice recognition result with time section at each playback time; and an edit information allocating unit 25 for determining edit information to be applied on the basis of the composite recognition result.SELECTED DRAWING: Figure 1

Description

本発明は動画編集装置、動画編集方法、及び動画編集プログラムの技術に関する。 The present invention relates to a moving image editing apparatus, a moving image editing method, and a moving image editing program.

従来、動画を作成する際には、編集材料となる動画を撮影した後、不要部分カット、テロップ付加、エフェクト付加、動画像コンテンツ付加、音声コンテンツ付加などの動画編集を行う場合がある。このような動画編集を行なうためには、高性能のコンピュータと技術者向けの専用ソフトウェアを用いる必要があり、それらを操作するためには高度な技術や専門知識が要求されていた。近年では、パーソナルコンピュータの高機能化やビデオカメラの普及等により、簡易に動画編集を行なえるソフトウェアが開発され、専門技術や専門知識のない一般ユーザでも容易に動画の編集ができるようになっている。 2. Description of the Related Art Conventionally, when creating a moving image, the moving image editing such as unnecessary part cut, telop addition, effect addition, moving image content addition, audio content addition, etc. may be performed after shooting the moving image as the editing material. In order to perform such video editing, it is necessary to use a high-performance computer and dedicated software for engineers, and advanced techniques and expertise are required to operate these computers. In recent years, software that allows easy editing of moving images has been developed due to the increased functionality of personal computers and the spread of video cameras, etc., and it has become possible to easily edit moving images even for general users without specialized technology or expertise. Yes.

例えば、特許文献１には、編集済のサンプル動画を学習することにより編集情報を生成し、生成した編集情報に基づいてユーザへの編集支援を行なう映像編集支援装置が開示されている。このような装置を用いることで、不要部分カット、テロップ付加、エフェクト付加、動画像コンテンツ付加、音声コンテンツ付加等を模した動画編集を一般ユーザが簡易に行なうことができる。 For example, Patent Document 1 discloses a video editing support apparatus that generates editing information by learning an edited sample moving image and performs editing support for a user based on the generated editing information. By using such an apparatus, a general user can easily perform moving image editing simulating unnecessary partial cut, telop addition, effect addition, moving image content addition, audio content addition, and the like.

特開２０１３−０８０９８９号公報JP2013-080989A

しかし、サンプル動画と近しいシーンを検出し、サンプル動画と同じ編集を自動で行う技術が提供されているが、動画作成者の意図を反映した編集を行うためには、近しいシーンを含む十分な量のサンプル動画が必要である。また、自動で無音部分をカットする技術や自動でテロップを挿入する技術も公知となっているが、動画作成者の意図を反映した編集を行うことができるものではない。 However, there is a technology that automatically detects scenes close to the sample video and automatically performs the same editing as the sample video, but a sufficient amount including close scenes is necessary to perform editing that reflects the intention of the video creator. Sample videos are required. In addition, a technique for automatically cutting a silent part and a technique for automatically inserting a telop are also known, but editing that reflects the intention of the movie creator cannot be performed.

そこで、本発明は、以上に示したかかる課題に鑑み、動画作成者の意図を反映した編集を可能としながら動画編集の負担を削減する動画編集装置、動画編集方法、及び動画編集プログラムを提供することを目的とする。 Therefore, in view of the above-described problems, the present invention provides a moving image editing apparatus, a moving image editing method, and a moving image editing program capable of reducing the burden of moving image editing while enabling editing reflecting the intention of the moving image creator. For the purpose.

本発明の解決しようとする課題は以上の如くであり、次にこの課題を解決するための手段を説明する。 The problem to be solved by the present invention is as described above. Next, means for solving the problem will be described.

即ち、本発明においては、動画データに含まれる動画像データから算出される時刻区間付動画像認識結果を格納する動画像認識結果格納部と、前記動画データに含まれる音声データから算出される時刻区間付音声認識結果を格納する音声認識結果格納部と、前記時刻区間付動画像認識結果および前記時刻区間付音声認識結果を各再生時刻において結合した複合認識結果を算出する認識結果結合部と、前記複合認識結果に基づき適用する時刻区間付編集情報を決定する編集情報割当部と、を備えるものである。 That is, in the present invention, a moving image recognition result storage unit that stores a moving image recognition result with a time interval calculated from moving image data included in the moving image data, and a time calculated from the audio data included in the moving image data. A speech recognition result storage unit that stores a speech recognition result with a section; a recognition result combining unit that calculates a combined recognition result obtained by combining the moving image recognition result with the time section and the speech recognition result with the time section at each reproduction time; An editing information allocating unit that determines editing information with a time interval to be applied based on the composite recognition result.

また、前記認識結果結合部において、前記時刻区間付動画像認識結果および前記時刻区間付音声認識結果は所定の組み合わせパターンと照合することにより結合され、前記所定の組み合わせパターンを登録可能とするものである。 Further, in the recognition result combining unit, the moving image recognition result with time interval and the speech recognition result with time interval are combined by collating with a predetermined combination pattern, and the predetermined combination pattern can be registered. is there.

また、前記編集情報割当部において、前記適用する時刻区間付編集情報は前記複合認識結果と、複合認識結果と時刻区間付編集情報との所定の組み合わせパターンと、を照合することにより決定され、前記所定の組み合わせパターンは登録可能とするものである。 Further, in the editing information allocating unit, the editing information with time section to be applied is determined by comparing the composite recognition result and a predetermined combination pattern of the composite recognition result and the editing information with time section, The predetermined combination pattern can be registered.

前記編集情報割当部において決定された時刻区間付編集情報は、編集可能なデータ群として出力されるものである。 The editing information with a time section determined by the editing information allocating unit is output as an editable data group.

本発明の効果として、以下に示すような効果を奏する。 As effects of the present invention, the following effects can be obtained.

本発明においては、動画データから認識が可能である複合的な事象と特定の編集内容とを結びつけることができ、たとえば、音声認識により動画出演者の発話内容から生成されたテロップを動画像認識により認識された動画出演者のジェスチャに応じた位置に付加したり、音声認識により認識されたキーワードに応じた動画像コンテンツを動画像認識により検出された動画出演者の顔位置の周辺に付加したり、動画像認識により認識された動画出演者のジェスチャの応じた音声コンテンツを音声認識により検出された発話区間と重ならないタイミングで付加したり、音声認識により検出された無音声区間のうち動画像認識により認識された特定のオブジェクトが動き始める時刻より前だけをカットしたりする、
といった自動編集を可能とする。
また、動画作成者は特定の編集内容と結びつけられた複合的な事象を利用して、動画の撮影中に任意の動画再生時刻に対して意図した編集内容を指定することができ、これにより、動画作成者の意図を反映した編集を可能としながら動画編集にかかる負担を軽減することができる。 In the present invention, a complex event that can be recognized from moving image data can be linked to a specific editing content. For example, a telop generated from the utterance content of a video performer by voice recognition is recognized by moving image recognition. Add to the position corresponding to the gesture of the recognized video performer, or add the moving image content according to the keyword recognized by voice recognition around the face position of the video performer detected by moving image recognition , Adding audio content according to the gesture of the video performer recognized by moving image recognition at a timing that does not overlap with the utterance interval detected by voice recognition, or moving image recognition among non-speech intervals detected by voice recognition Cut only before the time when the specific object recognized by
Can be automatically edited.
In addition, the video creator can specify the intended editing content for any video playback time during video recording using a complex event linked to specific editing content, While enabling editing that reflects the intention of the video creator, it is possible to reduce the burden on video editing.

本発明の第一の実施形態に係る動画編集装置を示すブロック図。1 is a block diagram showing a moving image editing apparatus according to a first embodiment of the present invention. 本発明の第一の実施形態に係る動画編集方法のうち認識結果取得方法を示すフローチャート図。The flowchart figure which shows the recognition result acquisition method among the moving image editing methods which concern on 1st embodiment of this invention. 本発明の第一の実施形態に係る動画編集方法のうち編集情報取得方法を示すフローチャート図。The flowchart figure which shows the edit information acquisition method among the moving image edit methods which concern on 1st embodiment of this invention. 本発明の第一の実施形態に係る動画編集方法のうち動画自動編集方法を示すフローチャート図。The flowchart figure which shows the moving image automatic editing method among the moving image editing methods which concern on 1st embodiment of this invention. 本発明の第一の実施形態に係る動画編集方法のうち動画自動カット編集を示すフローチャート図。The flowchart figure which shows animation automatic cut edit among the animation editing methods concerning 1st embodiment of this invention.

次に、発明の実施の形態を説明する。 Next, embodiments of the invention will be described.

＜第一の実施形態＞
本発明の実施形態に係る動画編集装置１について図１を用いて説明する。
動画編集装置１は、撮影した動画データを編集するための装置である。動画編集装置１は、外部の動画撮影装置と通信する通信部１１と、撮影した動画データ、編集した動画データ、および編集情報を記憶する記憶部１２と、撮影した動画データを編集する制御部１３と、を備えている。
通信部１１は、外部の動画撮影装置２と有線または無線の通信回線を通じて通信する部分である。
記憶部１２は、撮影した動画データ、編集した動画データ、および編集情報を記憶する部分であり、例えば、ＲＡＭやＲＯＭなどで構成されている。
制御部１３は、撮影した動画データを編集する部分であり、例えば演算装置等で構成されている。
また、制御部１３は、動画データを入力する動画入力部２１と、動画データに含まれる動画像データから算出される時刻区間付動画像認識結果を格納する動画像認識結果格納部としての動画像認識部２２と、動画データに含まれる音声データから算出される時刻区間付音声認識結果を格納する音声認識結果格納部としての音声認識部２３と、時刻区間付動画像認識結果および時刻区間付音声認識結果を各再生時刻において結合した複合認識結果を算出する認識結果結合部２４と、複合認識結果に基づき適用する時刻区間付編集情報を決定する編集情報割当部２５と、を備える。動画入力部２１、動画像認識部２２、音声認識部２３、認識結果結合部２４、編集情報割当部２５は、例えば一般的なＷｅｂブラウザソフトウェアをインストールした一般的な情報処理装置によって実現されるものである。 <First embodiment>
A moving image editing apparatus 1 according to an embodiment of the present invention will be described with reference to FIG.
The moving image editing apparatus 1 is an apparatus for editing captured moving image data. The moving image editing apparatus 1 includes a communication unit 11 that communicates with an external moving image capturing device, a storage unit 12 that stores captured moving image data, edited moving image data, and editing information, and a control unit 13 that edits the captured moving image data. And.
The communication unit 11 is a part that communicates with the external video shooting device 2 through a wired or wireless communication line.
The storage unit 12 is a part that stores captured moving image data, edited moving image data, and editing information, and includes, for example, a RAM or a ROM.
The control unit 13 is a part that edits the captured moving image data, and is composed of, for example, an arithmetic device.
The control unit 13 also includes a moving image input unit 21 that inputs moving image data, and a moving image as a moving image recognition result storage unit that stores a moving image recognition result with a time interval calculated from moving image data included in the moving image data. A recognition unit 22, a speech recognition unit 23 as a speech recognition result storage unit for storing a speech recognition result with time interval calculated from speech data included in the moving image data, a moving image recognition result with time interval, and a sound with time interval. A recognition result combining unit 24 that calculates a combined recognition result obtained by combining the recognition results at each reproduction time, and an editing information allocating unit 25 that determines editing information with a time section to be applied based on the combined recognition result. The moving image input unit 21, the moving image recognition unit 22, the voice recognition unit 23, the recognition result combination unit 24, and the editing information allocation unit 25 are realized by, for example, a general information processing apparatus in which general web browser software is installed. It is.

動画入力部２１は、通信部１１を介して外部の動画撮影装置２から動画データが入力される部分である。動画データとは、少なくとも動画像データと、音声データと、時刻データと、を備えるデータ群である。また、動画データはこれらのデータの他にさらに、字幕データ、多重音声データ、副題データ・章（チャプター）データ・メタデータ（タグ）などを備えてもよい。 The moving image input unit 21 is a portion to which moving image data is input from the external moving image capturing apparatus 2 via the communication unit 11. The moving image data is a data group including at least moving image data, audio data, and time data. In addition to these data, the moving image data may further include subtitle data, multiplexed audio data, subtitle data / chapter data / metadata (tag), and the like.

動画像認識部２２は、動画データに含まれる動画像データおよび時刻データから時刻区間付動画像認識結果を算出する部分である。時刻区間付動画像認識結果は、例えば、動画像データをフレーム画像に分解して、１フレームごとに認識を行うことで動画像を認識した結果である。なお、動画像認識方法は１フレームごとのフレーム分解だけに限定するものではなく、たとえば、マルチフレーム認識による動画像認識方法を採用することもできる。なお、本実施形態に限定するものでなく、動画像認識については、外部のツールもしくはウェブＡＰＩなどのサービスを利用することも可能である。すなわち、動画編集装置１は、少なくとも、算出された時刻区間付動画像認識結果を格納する部分を備えていればよい。 The moving image recognition unit 22 is a part that calculates a moving image recognition result with a time section from moving image data and time data included in the moving image data. The moving image recognition result with time interval is, for example, a result of recognizing a moving image by decomposing moving image data into frame images and performing recognition for each frame. Note that the moving image recognition method is not limited to frame decomposition for each frame, and for example, a moving image recognition method based on multi-frame recognition can be adopted. Note that the present invention is not limited to this embodiment, and for moving image recognition, services such as an external tool or web API can be used. That is, the moving image editing apparatus 1 only needs to include at least a portion for storing the calculated moving image recognition result with a time section.

また、動画像認識部２２は、オブジェクト認識および動き認識を用いることもできる。
オブジェクト認識は、フレーム画像データを入力情報とし、あらかじめ登録されたオブジェクトが現れた領域およびオブジェクトの種類を少なくとも出力情報に含む認識手法である。オブジェクトが現れた領域は、複数あってもよい。また、出力情報には、認識の信頼度が含まれていてもよい。 The moving image recognition unit 22 can also use object recognition and motion recognition.
Object recognition is a recognition method in which frame image data is used as input information, and at least output information includes a region where a pre-registered object appears and the type of object. There may be a plurality of areas where the objects appear. Further, the output information may include the recognition reliability.

また、マルチフレーム認識を採用することによりオブジェクトの特定の動きを対象としてもよい。例えば、人物の特定の動きを対象とする場合、人物が現れた領域、人物の特定のジェスチャの種類、および人物を認識した時刻区間を少なくとも出力に含む。 Further, a specific movement of the object may be targeted by adopting multi-frame recognition. For example, when a specific movement of a person is targeted, the output includes at least an area in which the person appears, a specific gesture type of the person, and a time interval in which the person is recognized.

動画像認識部２２は、フレーム画像抽出部３１と、秒間フレーム数抽出部３２と、再生時刻変換部３３と、を有する。フレーム画像抽出部３１は、動画データからフレームインデクス付フレーム画像を抽出する部分である。秒間フレーム数抽出部３２は、動画データから秒間フレーム数を抽出する部分である。再生時刻変換部３３は、フレーム画像抽出部３１で抽出されたフレームインデクス付フレーム画像と、秒間フレーム数抽出部３２で抽出された秒間フレーム数と、から再生時刻を算出する部分である。 The moving image recognition unit 22 includes a frame image extraction unit 31, a second frame number extraction unit 32, and a reproduction time conversion unit 33. The frame image extraction unit 31 is a part that extracts a frame image with a frame index from moving image data. The second frame number extraction unit 32 is a part that extracts the second frame number from the moving image data. The reproduction time conversion unit 33 is a part that calculates the reproduction time from the frame image with frame index extracted by the frame image extraction unit 31 and the number of frames per second extracted by the frame number extraction unit 32 per second.

このように、フレーム画像抽出部３１と、秒間フレーム数抽出部３２と、再生時刻変換部３３とによって取得されるインデクス付フレーム画像および再生時刻から、再生時刻を起点とする時刻区間付動画像認識結果を取得する。 As described above, the moving image recognition with the time interval starting from the reproduction time is obtained from the indexed frame image and the reproduction time acquired by the frame image extraction unit 31, the second frame number extraction unit 32, and the reproduction time conversion unit 33. Get the result.

動画像認識の実施形態としては、例えば、時刻区間付動画像認識結果を用いて特定のジェスチャを検出するジェスチャ認識、特定の物体の位置や動き（差分）を検出する位置認識、動画出演者の表情を認識する表情認識などが含まれる。 As an embodiment of moving image recognition, for example, gesture recognition that detects a specific gesture using a moving image recognition result with a time interval, position recognition that detects a position or movement (difference) of a specific object, and a video performer It includes facial expression recognition that recognizes facial expressions.

音声認識部２３は、動画データに含まれる音声データおよび時刻データから時刻区間付音声認識結果を算出する部分である。時刻区間付音声認識結果は、例えば、発話音声認識を用いた場合の、発話内容のテキスト、特定キーワードおよび無音声区間などを認識した結果を指す。なお、音声認識方法は、発話音声認識に限定するものではなく、無音区間や発話以外の特定音声を認識する音波形認識や、音楽などの音響信号認識や、マルチパス探索による単語列認識などを組み合わせて採用することもできる。なお、本実施形態に限定するものでなく、音声認識については、外部のツールもしくはウェブＡＰＩなどのサービスを利用することも可能である。すなわち、動画編集装置１は、少なくとも、算出された時刻区間付音声認識結果を格納する部分を備えていればよい。 The voice recognition unit 23 is a part that calculates a voice recognition result with a time section from voice data and time data included in the moving image data. The speech recognition result with time section indicates, for example, the result of recognizing the text of the utterance content, the specific keyword, the non-speech section, and the like when the speech recognition is used. Note that the speech recognition method is not limited to spoken speech recognition, but includes sound waveform recognition that recognizes specific speech other than silence intervals and speech, acoustic signal recognition such as music, and word string recognition by multipath search. They can also be used in combination. Note that the present invention is not limited to this embodiment, and for voice recognition, an external tool or a service such as a web API can be used. That is, the moving image editing apparatus 1 only needs to include at least a portion for storing the calculated speech recognition result with time interval.

音声認識部２３は、動画データに含まれる音声データおよび時刻データから再生時刻を起点とする区間に対し、発話音声認識を用いた時刻区間付音声認識結果を算出する。また、音声認識部２３は、さらに、動画データに含まれる音声データおよび時刻データから声量、認識の信頼度、話者識別結果などを出力に含む構成としてもよい。 The voice recognition unit 23 calculates a voice recognition result with a time section using speech voice recognition for a section starting from the reproduction time from the voice data and time data included in the moving image data. The voice recognition unit 23 may further include a voice volume, a reliability of recognition, a speaker identification result, and the like from the voice data and time data included in the moving image data.

認識結果結合部２４は、時刻区間付動画像認識結果および時刻区間付音声認識結果と、から複合認識結果を算出する。複合認識結果とは、再生時刻でグループ化した時刻区間付動画像認識結果および時刻区間付音声認識結果の複合データである。なお、その他のデータであるメタデータ（タグ）に再生時刻を付与してさらに複合したデータであってもよい。 The recognition result combining unit 24 calculates a composite recognition result from the moving image recognition result with time interval and the speech recognition result with time interval. The composite recognition result is composite data of a moving image recognition result with time interval and a speech recognition result with time interval grouped by reproduction time. Note that it may be data obtained by adding a reproduction time to metadata (tag), which is other data, and further combining it.

認識結果結合部２４においては、時刻区間付動画像認識結果および時刻区間付音声認識結果は、所定の組み合わせパターンと照合することにより結合される。組み合わせパターンを構成する要素は、動画編集装置１で利用することが可能な動画像認識方法で認識される任意の要素（例えばオブジェクト）と、動画編集装置１で利用することが可能な音声認識方法で認識される任意の要素（例えばキーワード）である。動画編集装置１の利用者たる動画作成者は、動画像認識方法で認識される任意の要素および音声認識方法で認識される任意の要素を任意に選択し、その組み合わせパターンを指定可能とする。
動画編集装置１で利用することが可能な動画像認識方法においては、例えば、特定の人物の顔を用いる場合、動画作成者は、予め人物の顔を認識する学習済みモデルを動画像認識部から利用可能であるように登録しておき、当該学習済みモデルは、動画像認識部で任意の要素を認識するために使用される。
また、動画編集装置１で利用することが可能な音声認識方法においては、例えば、特定のキーワードを用いる場合、動画作成者は、予め前記キーワードを音声認識部から利用可能であるように登録しておき、当該キーワードは、音声認識部で任意の要素を認識するために使用される。
認識結果結合部２４は、組み合わせパターンの入力部３５を備える。入力部３５は、動画作成者が、組み合わせパターンを指定するための入力手段であり、動画作成者は入力部３５から組み合わせパターンを指定可能である。 In the recognition result combining unit 24, the moving image recognition result with time interval and the speech recognition result with time interval are combined by collating with a predetermined combination pattern. The elements constituting the combination pattern are an arbitrary element (for example, an object) recognized by a moving image recognition method that can be used by the moving image editing apparatus 1 and a voice recognition method that can be used by the moving image editing apparatus 1. It is an arbitrary element (for example, keyword) recognized by A video creator who is a user of the video editing device 1 can arbitrarily select an arbitrary element recognized by the moving image recognition method and an arbitrary element recognized by the voice recognition method, and can designate a combination pattern thereof.
In the moving image recognition method that can be used in the moving image editing apparatus 1, for example, when using the face of a specific person, the moving image creator uses the moving image recognition unit to learn a learned model that recognizes the face of the person in advance. It is registered so that it can be used, and the learned model is used for recognizing an arbitrary element by the moving image recognition unit.
Further, in the voice recognition method that can be used in the video editing device 1, for example, when using a specific keyword, the video creator registers the keyword in advance so that it can be used from the voice recognition unit. The keyword is used for recognizing an arbitrary element by the voice recognition unit.
The recognition result combining unit 24 includes a combination pattern input unit 35. The input unit 35 is an input means for the movie creator to specify a combination pattern, and the movie creator can specify a combination pattern from the input unit 35.

編集情報割当部２５は、複合認識結果に編集情報を割り当てて時刻区間付編集情報を取得する部分である。複合認識結果と時刻区間付編集情報との関係を結びつける編集方法組み合わせパターンは、予め記憶部１２に登録されている。例えば、記憶部１２には、表１に示すような編集方法組み合わせパターンが登録されている。 The editing information allocation unit 25 is a part that allocates editing information to the composite recognition result and acquires editing information with a time section. The editing method combination pattern that links the relationship between the combined recognition result and the editing information with time section is registered in the storage unit 12 in advance. For example, editing method combination patterns as shown in Table 1 are registered in the storage unit 12.

表１に示す編集方法組み合わせパターンは、動画作成者が、指定可能である。編集方法組み合わせパターンを構成する要素は、認識結果結合部２４において指定された任意の組み合わせパターン、および、動画編集装置１が出力する時刻区間付編集情報と動画データとを読み込んで実際に動画編集を行う別の動画編集装置、もしくはこの動画編集装置自体、で利用可能な任意の編集方法である。動画作成者は、認識結果結合部２４において指定された任意の組み合わせパターンおよび任意の編集方法を選択し、その編集方法組み合わせパターンを指定可能とするものである。
編集情報割当部２５は、編集方法組み合わせパターンの入力部３６を備える。入力部３６は、動画作成者が、編集方法組み合わせパターンを指定するための入力手段であり、動画作成者は入力部３６から編集方法組み合わせパターンを指定可能である。 The editing method combination patterns shown in Table 1 can be specified by the video creator. The elements constituting the editing method combination pattern are an arbitrary combination pattern specified by the recognition result combining unit 24, and editing information with time interval and moving image data output by the moving image editing apparatus 1 and actually editing the moving image. This is an arbitrary editing method that can be used by another moving image editing apparatus to be performed or the moving image editing apparatus itself. The video creator can select an arbitrary combination pattern and an arbitrary editing method specified by the recognition result combining unit 24, and can specify the editing method combination pattern.
The editing information allocation unit 25 includes an editing method combination pattern input unit 36. The input unit 36 is an input means for the moving image creator to specify the editing method combination pattern, and the moving image creator can specify the editing method combination pattern from the input unit 36.

時刻区間付編集情報は不可逆的なコンテナファイルとして出力される。なお、時刻区間付編集情報はコンテナファイルとして出力するものに限定するものではなく、例えば、編集可能なデータ群として出力することもできる。 The editing information with time section is output as an irreversible container file. Note that the editing information with time section is not limited to being output as a container file, and can be output as, for example, an editable data group.

また、動画編集装置１は、さらに時刻区間付編集情報に従って動画自動編集を行う動画自動編集部２６を備えてもよい。この場合、動画自動編集部２６は、さらに、画像エフェクト付加編集部５１と、テロップ付加編集部５２と、動画像コンテンツ付加編集部５３と、音声エフェクト付加編集部５４と、音声コンテンツ付加編集部５５と、を備える。 Moreover, the moving image editing apparatus 1 may further include a moving image automatic editing unit 26 that performs automatic moving image editing according to editing information with a time section. In this case, the automatic video editing unit 26 further includes an image effect addition editing unit 51, a telop addition editing unit 52, a moving image content addition editing unit 53, an audio effect addition editing unit 54, and an audio content addition editing unit 55. And comprising.

まず、再生時刻変換部３３を用いて、時刻区間付編集情報に付与された再生時刻および秒間フレーム数からフレームインデクスを取得し、フレームインデクスが付与されたインデクス付フレーム画像を選択する。 First, the playback time conversion unit 33 is used to acquire a frame index from the playback time and the number of frames per second given to the editing information with time section, and select an indexed frame image to which the frame index is given.

画像エフェクト付加編集部５１は、時刻区間付編集情報および再生時刻に基づいて画像エフェクトを決定し、インデクス付フレーム画像に付加する。画像エフェクトとは、画面をデジタル加工する方式であり、例えば白黒二階調化などの画像加工、粒子加工などが含まれる。 The image effect addition editing unit 51 determines an image effect based on the editing information with time section and the playback time, and adds the image effect to the indexed frame image. The image effect is a method for digitally processing a screen, and includes, for example, image processing such as black and white gradation, particle processing, and the like.

また、テロップ付加編集部５２は、時刻区間付編集情報および再生時刻に基づいて文字または記号からなるテロップを決定し、インデクス付フレーム画像に付加する。テロップとは、画面に重ねる文字または記号を記載したレイヤーである。 Further, the telop addition editing unit 52 determines a telop composed of characters or symbols based on the editing information with time section and the reproduction time, and adds it to the frame image with index. A telop is a layer describing characters or symbols to be superimposed on the screen.

また、動画像コンテンツ付加編集部５３は、時刻区間付編集情報および再生時刻に基づいて動画像コンテンツを決定し、インデクス付フレーム画像に付加する。動画像コンテンツとは、フレーム画像にさらに付加される別の動画像である。 The moving image content addition editing unit 53 determines moving image content based on the editing information with time section and the reproduction time, and adds the determined moving image content to the indexed frame image. The moving image content is another moving image that is further added to the frame image.

また、音声エフェクト付加編集部５４は、時刻区間付編集情報から音声エフェクトを決定し、音声データに付加する。音声エフェクトとは、音声をデジタル加工する方式であり、ハイパス加工やエコー加工などが含まれる。 Further, the audio effect addition editing unit 54 determines an audio effect from the editing information with time interval and adds it to the audio data. The sound effect is a method for digitally processing sound, and includes high-pass processing and echo processing.

また、音声コンテンツ付加編集部５５は、時刻区間付編集情報から音声コンテンツを決定し、音声データに付加する。音声コンテンツとは、音声データに付加される効果音や音楽などの音声データである。 The audio content addition editing unit 55 determines audio content from the editing information with time interval and adds it to the audio data. The audio content is audio data such as sound effects and music added to the audio data.

また、動画編集装置１は、時刻区間付編集情報からカット区間を算出することもできる。この場合、動画編集装置１の制御部１３は、さらに、カット区間算出部２７を有する構成とする。 The moving image editing apparatus 1 can also calculate a cut section from the editing information with a time section. In this case, the control unit 13 of the moving image editing apparatus 1 further includes a cut section calculation unit 27.

カット区間算出部２７を有する場合、時刻区間付編集情報にはカット編集情報を含まれる構成とする。
カット編集情報が含まれていた場合には、カット区間算出部２７が、時刻区間付編集情報からカット区間を算出する。 When the cut section calculating unit 27 is provided, the edit information with time section is configured to include cut edit information.
When the cut edit information is included, the cut section calculation unit 27 calculates a cut section from the edit information with time section.

カット区間算出部２７は、動画像カット編集部６１と、音声カット編集部６２と、を有する。 The cut section calculation unit 27 includes a moving image cut editing unit 61 and an audio cut editing unit 62.

動画像カット編集部６１は、カット区間に含まれるインデクス付フレーム画像を削除する。また、音声カット編集部６２は、音声データからカット区間の音声データを削除する。 The moving image cut editing unit 61 deletes the indexed frame image included in the cut section. In addition, the voice cut editing unit 62 deletes the voice data of the cut section from the voice data.

動画結合部６３は、編集済み動画データを生成する。動画結合部６３は、編集済みのインデクス付フレーム画像および編集済みの音声データから編集済み動画データを生成する。 The moving image combining unit 63 generates edited moving image data. The moving image combining unit 63 generates edited moving image data from the edited indexed frame image and the edited audio data.

次に、本発明の実施形態に係る動画編集方法について図２から図５を用いて説明する。
まず、動画編集方法のうち、認識結果を取得する方法について図２を用いて説明する。 Next, a moving image editing method according to an embodiment of the present invention will be described with reference to FIGS.
First, of the moving image editing methods, a method for obtaining a recognition result will be described with reference to FIG.

まず、動画入力部２１を用いて動画データを読み込む（ステップＳ１０）。ステップＳ１０において読み込まれた動画データは、制御部１３へと送信される。 First, moving image data is read using the moving image input unit 21 (step S10). The moving image data read in step S10 is transmitted to the control unit 13.

次に、動画データに含まれる動画像データから時刻区間付動画像認識結果を算出する第一の工程について説明する。 Next, a first step of calculating a time-partitioned moving image recognition result from moving image data included in the moving image data will be described.

第一の行程において、まず、フレーム画像抽出部３１を用いて動画データからフレームインデクス付フレーム画像を取得する（ステップＳ２０）。 In the first step, first, the frame image extraction unit 31 is used to obtain a frame image with a frame index from the moving image data (step S20).

次に、秒間フレーム数抽出部３２を用いて、動画データから秒間フレーム数を取得する（ステップＳ３０）。 Next, the number of frames per second is obtained from the moving image data using the second frame number extraction unit 32 (step S30).

次に、再生時刻変換部３３を用いて、フレーム画像抽出部３１で取得されたフレームインデクス付フレーム画像と、秒間フレーム数抽出部３２で取得された秒間フレーム数と、から再生時刻を取得する（ステップＳ４０）。 Next, the playback time is acquired from the frame image with the frame index acquired by the frame image extraction unit 31 and the number of frames per second acquired by the frame number extraction unit 32 using the playback time conversion unit 33 ( Step S40).

次に、フレーム画像抽出部３１と、秒間フレーム数抽出部３２と、再生時刻変換部３３とによって取得される再生時刻、インデクス付フレーム画像、および再生時刻から、再生時刻を起点とする区間に対し、時刻区間付動画像認識結果を取得する（ステップＳ５０）。 Next, from the reproduction time, indexed frame image, and reproduction time acquired by the frame image extraction unit 31, the second frame number extraction unit 32, and the reproduction time conversion unit 33, the interval starting from the reproduction time is obtained. Then, the moving image recognition result with time interval is acquired (step S50).

また、動画データに含まれる音声データから時刻区間付音声認識結果を算出する第二の行程について説明する。なお、第一の行程と、第二の行程とは、並行して処理される。 In addition, a second process of calculating a voice recognition result with a time section from audio data included in the moving image data will be described. The first process and the second process are processed in parallel.

第二の行程において、まず、音声認識部２３を用いて、動画データから音声データを取得する（ステップＳ６０）。 In the second step, first, voice data is acquired from the moving image data using the voice recognition unit 23 (step S60).

次に、音声認識部２３を用いて、動画データに含まれる音声データおよび時刻データから再生時刻を起点とする区間に対し、時刻区間付音声認識結果を取得する（ステップＳ７０）。 Next, using the voice recognition unit 23, a voice recognition result with a time section is acquired for the section starting from the reproduction time from the voice data and the time data included in the moving image data (step S70).

次に、時刻区間付動画像認識結果および時刻区間付音声認識結果を各再生時刻において結合した複合認識結果を算出する第三の工程について、図３を用いて説明する。
認識結果結合部２４を用いて、再生時刻を含む時刻区間が付与された時刻区間付動画像認識結果と、時刻区間付音声認識結果と、から複合認識結果を算出する（ステップＳ１１０）。 Next, a third step of calculating a composite recognition result obtained by combining the moving image recognition result with time interval and the speech recognition result with time interval at each reproduction time will be described with reference to FIG.
Using the recognition result combining unit 24, a composite recognition result is calculated from the moving image recognition result with the time interval to which the time interval including the reproduction time is given and the speech recognition result with the time interval (step S110).

次に、複合認識結果に基づき適用する編集情報を決定する第四の工程について説明する。
編集情報割当部２５を用いて、複合認識結果から時刻区間付編集情報を取得する（ステップＳ１２０）。複合認識結果と編集情報の関係は記憶部１２に表１に示すテーブルとして記録されており、複合認識結果を入力すると、編集情報割当部２５によって、テーブルに基づいて時刻区間付編集情報が生成される。 Next, a fourth process for determining editing information to be applied based on the composite recognition result will be described.
Using the edit information allocation unit 25, the edit information with time section is acquired from the composite recognition result (step S120). The relationship between the combined recognition result and the editing information is recorded in the storage unit 12 as a table shown in Table 1. When the combined recognition result is input, the editing information assigning unit 25 generates editing information with a time section based on the table. The

次に、時刻区間付編集情報に従って動画自動編集を行う第五の工程について、図４を用いて説明する。
動画自動編集は、動画像データの編集と、音声データの編集と、が並行して行われる。
動画像データの編集においては、再生時刻変換部３３を用いて、時刻区間付編集情報に付与された再生時刻および秒間フレーム数からフレームインデクスを取得し、フレームインデクスが付与されたインデクス付フレーム画像を選択する。（ステップＳ１５０） Next, a fifth process for automatically editing a moving image according to editing information with a time section will be described with reference to FIG.
In moving image automatic editing, editing of moving image data and editing of audio data are performed in parallel.
In editing moving image data, the playback time conversion unit 33 is used to acquire a frame index from the playback time and the number of frames per second given to the editing information with time interval, and to obtain the frame image with index to which the frame index has been assigned. select. (Step S150)

次に、画像エフェクト付加編集部５１を用いて時刻区間付編集情報および再生時刻から画像エフェクトを決定し、インデクス付フレーム画像に付加する（ステップＳ１６０）。
次に、テロップ付加編集部５２を用いて、時刻区間付編集情報および再生時刻に基づいて文字または記号からなるテロップ画像を生成し、インデクス付フレーム画像に付加する（ステップＳ１７０）。
次に、動画像コンテンツ付加編集部５３を用いて、時刻区間付編集情報および再生時刻に基づいて動画像コンテンツを決定し、インデクス付フレーム画像に付加する（ステップＳ１８０）。 Next, the image effect addition editing unit 51 is used to determine an image effect from the editing information with time interval and the reproduction time, and add the frame effect to the indexed frame image (step S160).
Next, using the telop addition editing unit 52, a telop image composed of characters or symbols is generated based on the editing information with time section and the reproduction time, and is added to the frame image with index (step S170).
Next, using the moving image content addition editing unit 53, the moving image content is determined based on the editing information with time section and the reproduction time, and added to the frame image with index (step S180).

音声データの編集においては、まず、音声エフェクト付加編集部５４を用いて、時刻区間付編集情報から音声エフェクトを決定し、音声データに付加する（ステップＳ２２０）。
次に、音声コンテンツ付加編集部５５を用いて、時刻区間付編集情報から音声コンテンツを決定し、音声データに付加する（ステップＳ２３０）。 In editing audio data, first, the audio effect addition editing unit 54 is used to determine the audio effect from the editing information with time interval and add it to the audio data (step S220).
Next, the audio content adding / editing unit 55 is used to determine audio content from the editing information with time interval and add it to the audio data (step S230).

このように構成することにより、動画像に対して、画像エフェクト、テロップ画像、動画像コンテンツ、音声データに対する音声エフェクト、および音声コンテンツを付加することができる。 With this configuration, an image effect, a telop image, a moving image content, an audio effect for audio data, and an audio content can be added to the moving image.

さらに、カット編集を行う第六の工程について、図５を用いて説明する。
まず、時刻区間付編集情報にカット編集情報が含まれるか否かについて判断する（ステップＳ２５０）。時刻区間付編集情報にカット編集情報が含まれない場合はカット編集を行う必要が無いため、カット編集を終了する。
ステップＳ２５０において、カット編集情報が含まれると判断された場合は、カット区間算出部を用いて時刻区間付編集情報からカット区間を算出する（ステップＳ２６０）。 Further, a sixth process for performing cut editing will be described with reference to FIG.
First, it is determined whether or not cut editing information is included in the editing information with time section (step S250). If the edit information with time section is not included in the edit information, it is not necessary to perform the cut edit, so the cut edit is terminated.
If it is determined in step S250 that the cut edit information is included, the cut section is calculated from the edit information with time section using the cut section calculation unit (step S260).

カット編集には、動画像のカット編集と、音声データのカット編集と、が並行して行われる。 In cut editing, moving image cut editing and audio data cut editing are performed in parallel.

まず、カット区間に含まれる再生時刻において、再生時刻変換部３３を用いて、時刻区間付編集情報に付与された再生時刻および秒間フレーム数からフレームインデクスを取得し、フレームインデクスが付与されたインデクス付フレーム画像を選択する（ステップＳ２７０）。 First, at the playback time included in the cut section, the playback time conversion unit 33 is used to acquire the frame index from the playback time and the number of frames per second added to the editing information with the time section, and with the index to which the frame index is added. A frame image is selected (step S270).

次に、動画像カット編集部６１を用いて、前記選択されたインデクス付フレーム画像を削除する（ステップＳ２８０）。音声カット編集部６２を用いて、音声データからカット区間の音声データを削除する（ステップＳ２９０）。 Next, the selected frame image with index is deleted using the moving image cut editing unit 61 (step S280). Using the voice cut editing unit 62, the voice data of the cut section is deleted from the voice data (step S290).

このように構成することにより、動画データから不要な部分をカットするカット編集が行われる。 With this configuration, cut editing for cutting unnecessary portions from moving image data is performed.

上記第一の工程から第六の工程が行われた後、動画結合部６３を用いて、編集済み動画データを生成する（ステップＳ３１０）。次に、編集済み動画データを出力する（ステップＳ３２０）。 After the first to sixth steps are performed, edited moving image data is generated using the moving image combining unit 63 (step S310). Next, the edited moving image data is output (step S320).

次に、動画の編集方法の具体例について説明する。
第一の工程において、動画データに含まれる動画像データから算出された時刻区間付動画像認識結果の具体例として、オブジェクト認識で認識する特定のジェスチャ（ハンドサイン）が含まれる。特定のジェスチャを認識する学習済みモデルは予め登録しておく。動画データから抽出したフレーム画像データを入力情報として、オブジェクト認識を実行し、ジェスチャが現れたフレーム画像データにおける領域を検出し、認識結果とする。これらの認識結果は時刻区間と紐付けられている。
また、第二の工程において、動画データに含まれる音声データから算出された時刻区間付音声認識結果の具体例として、動画出演者の発話内容に基づくテキストの情報、特定キーワードが発話されたか否かの情報、無音声区間に関する情報が含まれる。発話音声認識で出力される発話内容文に含まれるキーワードは予め登録しておく。動画データから抽出した音声データを入力情報として、音声認識を実行し、たとえば特定キーワードが発話されたこと検出し、認識結果とする。これらの認識結果は時刻区間と紐付けられている。 Next, a specific example of a moving image editing method will be described.
In the first step, a specific gesture (hand sign) recognized by object recognition is included as a specific example of the moving image recognition result with time interval calculated from the moving image data included in the moving image data. A learned model for recognizing a specific gesture is registered in advance. Object recognition is executed using frame image data extracted from moving image data as input information, and an area in the frame image data where a gesture appears is detected as a recognition result. These recognition results are associated with time intervals.
In the second step, as a specific example of the speech recognition result with time interval calculated from the audio data included in the video data, text information based on the utterance content of the video performer, whether or not a specific keyword is uttered Information, and information related to a non-voice section. Keywords included in the utterance content sentence output by utterance voice recognition are registered in advance. Voice recognition is executed using voice data extracted from moving image data as input information, and for example, it is detected that a specific keyword is uttered, and the result is recognized. These recognition results are linked to the time interval.

認識結果結合部２４において、時刻区間と紐付けられた時刻区間付動画像認識結果および時刻区間付音声認識結果から、複合認識結果を取得する。複合認識結果には、前記認識情報が複合的になったものも含まれる。例えば、発話音声認識の結果として取得された発話時刻区間と、オブジェクト認識の結果として取得された時刻区間を時刻的に結合し、共通区間が含まれる場合、該当する発話音声認識またはオブジェクト認識のすべて、もしくは少なくとも何れか一つが含まれる時刻区間を算出し、時刻区間と紐付けられた複合認識結果とする。 In the recognition result combining unit 24, a composite recognition result is acquired from the moving image recognition result with time interval and the speech recognition result with time interval associated with the time interval. The composite recognition result includes a composite of the recognition information. For example, when the utterance time interval acquired as a result of speech recognition and the time interval acquired as a result of object recognition are combined temporally and a common interval is included, all of the corresponding utterance speech recognition or object recognition Alternatively, a time interval including at least one of them is calculated and used as a combined recognition result associated with the time interval.

編集情報割当部２５において、認識結果結合部２４において取得された複合認識結果から、表１に示す編集方法組み合わせパターンに基づいて時刻区間付編集情報を取得する。たとえば時刻区間の最も早い時刻もしくは最も遅い時刻を編集点として算出する。編集点の登録の代わりに、画面演出の付与や、効果音の付与といった編集内容を時刻区間付編集情報として対応付けることも可能である。 The editing information allocating unit 25 acquires editing information with a time section from the combined recognition result acquired by the recognition result combining unit 24 based on the editing method combination pattern shown in Table 1. For example, the earliest or latest time of the time interval is calculated as the edit point. Instead of registering edit points, it is also possible to associate edit contents such as screen effects and sound effects as time-section edit information.

編集情報割当部２５に対して、入力部３６を用いて編集方法組み合わせパターンを指定可能である。編集方法組み合わせパターンは、認識結果結合部２４において指定された組み合わせパターンと、利用可能な任意の編集方法と、を対応付ける組み合わせパターンであり、編集方法は動画作成者により予め登録される。編集方法は、例えば、予め登録されたジェスチャおよびキーワードと、それらを時刻的に結合する組み合わせパターンと、前記組み合わせパターンと対応する時刻区間付編集情報と、により表現することが可能である。また、これらの編集方法はプリセットされた手順から選択できるようにしてもよく、プリセットされた複数の手順をまとめて、一括選択できるようにしてもよい。 An editing method combination pattern can be designated for the editing information allocation unit 25 using the input unit 36. The editing method combination pattern is a combination pattern that associates the combination pattern specified in the recognition result combining unit 24 with any available editing method, and the editing method is registered in advance by the video creator. The editing method can be expressed by, for example, pre-registered gestures and keywords, a combination pattern that combines them in time, and edit information with a time section corresponding to the combination pattern. In addition, these editing methods may be selected from preset procedures, or a plurality of preset procedures may be collectively selected.

動画編集方法の別の具体例として、話者識別を合わせた発話内容の文字起こしを行う方法について説明する。
動画作成者は、オブジェクト認識で認識する出演者ごとの顔に対し、これらを認識する学習済みモデルを予め登録する。オブジェクト認識で認識する顔の構成要素に対し、これらを認識する学習済みモデルを予め登録する。顔の構成要素は、例えば、口である。 As another specific example of the moving image editing method, a method of transcribing the utterance content combined with speaker identification will be described.
The movie creator registers in advance a learned model for recognizing the face of each performer recognized by object recognition. A learned model for recognizing these components is registered in advance for the facial components recognized by object recognition. The facial component is, for example, the mouth.

音声認識部２３は、動画データに含まれる音声データおよび時刻データから、時刻区間付音声認識結果を出力する。また、動画像認識部２２は、動画データから抽出したフレーム画像データを入力情報としてオブジェクト認識を実行し、出演者ごとの顔が現れたフレーム画像データを検出し、各時刻における顔領域を取得する。 The voice recognition unit 23 outputs a voice recognition result with a time section from voice data and time data included in the moving image data. The moving image recognition unit 22 performs object recognition using the frame image data extracted from the moving image data as input information, detects frame image data in which a face for each performer appears, and acquires a face area at each time. .

また、動画データから抽出したフレーム画像データを入力情報としてオブジェクト認識を実行し、顔の構成要素である口が現れたフレーム画像データを検出し、各時刻における口の領域を取得する。 In addition, object recognition is executed using frame image data extracted from moving image data as input information, frame image data in which a mouth that is a face component appears is detected, and a mouth region at each time is acquired.

また、動画データから抽出したフレーム画像データを入力情報として動き認識を行い、各時刻、各領域における動きベクトルを取得する。 Also, motion recognition is performed using frame image data extracted from moving image data as input information, and motion vectors at each time and each region are acquired.

オブジェクト認識及び動き認識の結果を時刻的、領域的に結合する。すなわち、各出演者の顔領域と口の領域とが共通部分を持つとき、この口の領域を出演者に対応付けて、口の領域における動きベクトルに基づいて動きが有る場合、この出演者を発話中と判定し、動画像認識による発話時刻区間を取得する。 Combine the results of object recognition and motion recognition in time and area. That is, when each performer's face area and mouth area have a common part, this mouth area is associated with the performer, and if there is movement based on the motion vector in the mouth area, It is determined that an utterance is being performed, and an utterance time interval by moving image recognition is acquired.

発話音声認識による発話時刻区間と動画像認識による発話時刻区間を時刻的に結合し、共通区間が有る場合、動画像認識による発話者を、発話音声認識による発話内容文と組み合わせて時刻区間付編集情報として算出する。 When the utterance time section by utterance voice recognition and the utterance time section by moving picture recognition are combined in time, and there is a common section, the utterer by moving picture recognition is combined with the utterance content sentence by utterance voice recognition and edited with time section Calculate as information.

前記編集方法は発話内容文、動きベクトル予め登録された出演者ごとの顔、および顔の構成要素と、それらを時刻的、領域的に結合する組み合わせパターンと、対応する時刻区間付編集情報と、により表現することが可能であり、前記編集方法は動画作成者により予め登録される。 The editing method includes utterance content sentences, faces for each performer registered in advance as motion vectors, face components, combination patterns that combine them temporally and regionally, and editing information with corresponding time intervals, The editing method is registered in advance by the video creator.

また、動画の編集方法の別の具体例として、カット編集を行う方法について説明する。
動画作成者は、オブジェクト認識で認識する出演者の顔に対し、これらを認識する学習済みモデルを予め登録する。 In addition, a method for performing cut editing will be described as another specific example of the moving image editing method.
The movie creator registers in advance a learned model for recognizing the performer's face recognized by object recognition.

音声認識部２３は、動画データに含まれる音声データおよび時刻データから、時刻区間付音声認識結果を算出する。また、動画像認識部２２は、動画データから抽出したフレーム画像データを入力情報としてオブジェクト認識を実行し、出演者の顔が現れたフレーム画像データを検出し、そのフレームに基づいて時刻区間を取得する。 The voice recognition unit 23 calculates a voice recognition result with a time section from voice data and time data included in the moving image data. The moving image recognition unit 22 performs object recognition using the frame image data extracted from the moving image data as input information, detects the frame image data in which the performer's face appears, and acquires the time interval based on the frame. To do.

発話音声認識の結果として取得された時刻区間付音声認識結果と、オブジェクト認識の結果として取得された時刻区間を時刻的に結合し、いずれも含まれない時刻区間をカット編集により削除される区間として時刻区間付編集情報として算出する。 The time-recognized speech recognition result acquired as a result of utterance speech recognition and the time interval acquired as a result of object recognition are temporally combined, and a time interval that is not included as a segment that is deleted by cut editing Calculated as editing information with time interval.

前記編集方法は、発話内容文および予め登録された出演者の顔と、それらを時刻的に結合する組み合わせパターンと、対応する時刻区間付編集情報と、により表現することが可能であり、前記編集方法は動画作成者により予め登録される。 The editing method can be expressed by an utterance content sentence and a pre-registered performer's face, a combination pattern that combines them in time, and corresponding editing information with a time interval. The method is registered in advance by the video creator.

時刻区間付編集情報を用いて既に述べた手段により編集を行った動画像データおよび音声データを結合して編集済み動画像データを生成する。編集済み動画像データは、不可逆的なコンテナファイルとして出力される。なお、時刻区間付編集情報はコンテナファイルとして出力するものに限定するものではなく、例えば、編集可能なデータ群として出力することもできる。編集可能なデータ群として生成された場合には、一般的な動画編集ソフトによって、編集者が更に手動で編集を行うことができる。 Edited moving image data is generated by combining moving image data and audio data that have been edited by means already described using the editing information with time interval. The edited moving image data is output as an irreversible container file. Note that the editing information with time section is not limited to being output as a container file, and can be output as, for example, an editable data group. When the data is generated as an editable data group, the editor can perform further manual editing using general moving image editing software.

以上のように、動画編集装置１は、動画データに含まれる動画像データから算出される時刻区間付動画像認識結果を格納する動画像認識結果格納部としての動画像認識部２２と、動画データに含まれる音声データから算出される時刻区間付音声認識結果を格納する音声認識結果格納部としての音声認識部２３と、時刻区間付動画像認識結果および前記時刻区間付音声認識結果を各再生時刻において結合した複合認識結果を算出する認識結果結合部２４と、複合認識結果に基づき適用する時刻区間付編集情報を決定する編集情報割当部２５と、を備える。
このように構成することにより、動画編集において、動画データから認識が可能である複合的な事象と特定の編集内容とを紐付けることができる。例えば、音声認識により動画出演者の発話内容から生成されたテロップを動画像認識により認識された動画出演者のジェスチャに応じた位置に付加することができる。また、音声認識により認識されたキーワードに応じた動画像コンテンツを動画像認識により検出された動画出演者の顔の位置の周辺に付加することができる。
また、動画像認識により認識された動画出演者のジェスチャに応じた音声コンテンツを音声認識により検出された発話区間と重ならないタイミングで付加することができる。また、音声認識により検出された無音声区間のうち、動画像認識により認識された特定のオブジェクトが動き始める時刻より前だけをカットすることができる。 As described above, the moving image editing apparatus 1 includes the moving image recognition unit 22 as a moving image recognition result storage unit that stores a moving image recognition result with a time interval calculated from moving image data included in the moving image data, and the moving image data. A voice recognition unit 23 serving as a voice recognition result storage unit for storing a voice recognition result with a time interval calculated from voice data included in the voice data, a moving image recognition result with a time interval, and the voice recognition result with a time interval at each reproduction time. A recognition result combining unit 24 that calculates a combined recognition result combined in FIG. 5 and an editing information allocating unit 25 that determines editing information with a time interval to be applied based on the combined recognition result.
By configuring in this way, it is possible to link a complex event that can be recognized from moving image data and specific editing contents in moving image editing. For example, a telop generated from the utterance content of a video performer by voice recognition can be added to a position corresponding to the gesture of the video performer recognized by moving image recognition. In addition, moving image content corresponding to the keyword recognized by voice recognition can be added around the position of the face of the video performer detected by moving image recognition.
Moreover, the audio content according to the gesture of the video performer recognized by the moving image recognition can be added at a timing that does not overlap with the utterance section detected by the voice recognition. In addition, it is possible to cut only the part before the time at which a specific object recognized by moving image recognition starts to move in the silent section detected by voice recognition.

このように構成することにより、動画作成者は特定の編集内容と紐付けられた複合的な事象を利用して動画の撮影中に任意の動画再生時刻に対して意図した編集内容を指定することができ、これにより、動画作成者の意図を反映した編集を可能としながら動画編集にかかる負担を軽減することができる。 By configuring in this way, the video creator can specify the intended editing content for any video playback time during video recording using a complex event linked to the specific editing content. As a result, it is possible to reduce the burden of moving image editing while enabling editing that reflects the intention of the moving image creator.

上述の実施形態は、代表的な形態を示したに過ぎず、一実施形態の骨子を逸脱しない範囲で変形して実施することができる。さらに種々の形態で実施し得ることは勿論のことであり、本発明の範囲は、特許請求の範囲の記載によって示され、さらに特許請求の範囲に記載の均等の意味、および範囲内の全ての変更を含む。 The above-described embodiments are merely representative examples, and can be modified and implemented without departing from the scope of one embodiment. It goes without saying that the present invention can be embodied in various forms, and the scope of the present invention is indicated by the description of the claims, and the equivalent meanings described in the claims, and all the scopes within the scope of the claims. Includes changes.

１動画編集装置
２動画撮影装置
１１通信部
１２記憶部
１３制御部
２１動画入力部
２２動画像認識部
２３音声認識部
２４認識結果結合部
２５編集情報割当部
２６動画自動編集部
２７カット区間算出部
３１フレーム画像抽出部
３２秒間フレーム数抽出部
３３再生時刻変換部 DESCRIPTION OF SYMBOLS 1 Movie editing device 2 Movie shooting device 11 Communication unit 12 Storage unit 13 Control unit 21 Movie input unit 22 Moving image recognition unit 23 Speech recognition unit 24 Recognition result combination unit 25 Edit information allocation unit 26 Movie automatic editing unit 27 Cut section calculation unit 31 frame image extraction unit 32 second frame number extraction unit 33 playback time conversion unit

Claims

A moving image recognition result storage unit for storing a moving image recognition result with a time interval calculated from moving image data included in the moving image data;
A voice recognition result storage unit that stores a voice recognition result with a time interval calculated from voice data included in the moving image data;
A recognition result combining unit that calculates a combined recognition result obtained by combining the moving image recognition result with time interval and the speech recognition result with time interval at each reproduction time;
An editing information allocating unit for determining editing information with a time interval to be applied based on the composite recognition result;
A video editing apparatus comprising:

In the recognition result combining unit, the moving image recognition result with time interval and the speech recognition result with time interval are combined by collating with a predetermined combination pattern, and the predetermined combination pattern can be registered. The moving image editing apparatus according to claim 1.

In the editing information allocating unit, the editing information with time section to be applied is determined by comparing the composite recognition result with a predetermined combination pattern of the composite recognition result and the editing information with time section. The moving image editing apparatus according to claim 1, wherein the combination pattern can be registered.

4. The moving image editing apparatus according to claim 1, wherein the editing information with a time section determined by the editing information allocation unit is output as an editable data group. 5.

A first step of storing a moving image recognition result with a time interval calculated from moving image data included in the moving image data;
A second step of storing a voice recognition result with a time interval calculated from voice data included in the video data;
A third step of calculating a combined recognition result obtained by combining the moving image recognition result with time interval and the speech recognition result with time interval at each reproduction time;
A moving image editing method comprising: a fourth step of determining editing information with a time interval to be applied based on the composite recognition result.

In the second step, the moving image recognition result with time interval and the speech recognition result with time interval are combined by collating with a predetermined combination pattern, and the predetermined combination pattern can be registered. The moving image editing method according to claim 5.

In the third step, the editing information with time section to be applied is determined by comparing the composite recognition result with a predetermined combination pattern of the composite recognition result and the editing information with time section, The moving image editing method according to claim 5 or 6, wherein the combination pattern can be registered.

8. The moving image editing method according to claim 5, wherein the editing information with a time section determined by the editing information allocating unit is output as an editable data group. 9.

In a video editing program that causes an information processing device to function as a video editing device,
A moving image recognition step for storing a moving image recognition result with a time interval calculated from moving image data included in the moving image data;
A voice recognition step of storing a voice recognition result with a time interval calculated from voice data included in the video data;
A recognition result combining step of calculating a combined recognition result obtained by combining the moving image recognition result with time interval and the speech recognition result with time interval at each reproduction time;
A moving image editing program that causes the information processing apparatus to execute an editing information allocation step of determining editing information with a time interval to be applied based on the composite recognition result.

In the recognition result combining step, the moving image recognition result with time interval and the speech recognition result with time interval are combined by collating with a predetermined combination pattern, and the predetermined combination pattern can be registered. The moving image editing program according to claim 9.

In the editing information allocating step, the editing information with time section to be applied is determined by comparing the composite recognition result with a predetermined combination pattern of the composite recognition result and the editing information with time section, The moving image editing program according to claim 9 or 10, wherein the combination pattern can be registered.

The moving image editing program according to any one of claims 9 to 11, wherein the editing information with a time section determined in the editing information allocation step is output as an editable data group.