JP7424801B2

JP7424801B2 - Video editing output control device using text data, video editing output method using text data, and program

Info

Publication number: JP7424801B2
Application number: JP2019204328A
Authority: JP
Inventors: 英史安田; 六郎永田; 吾郎高橋
Original assignee: Tokyo Broadcasting System Television Inc
Current assignee: Tokyo Broadcasting System Television Inc
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2024-01-30
Anticipated expiration: 2039-11-12
Also published as: JP2021077432A

Description

本発明は、テキストデータを用いた編集制御技術に関する。 The present invention relates to editing control technology using text data.

映像編集機としての映像編集出力制御装置は、テレビ放送が始まって以来、様々な製品が開発され販売されている。 Since the beginning of television broadcasting, various products have been developed and sold as video editing output control devices used as video editing machines.

近年、音声認識技術がＡＩの台頭ともに成熟し、音声認識技術による音声データや映像データからの音声認識テキスト化の精度が高まってきている。音声認識技術はスマートフォンなどでは標準的に搭載され、キーパットテキスト入力と同様の入力手段として音声認識入力手段が確立されている。 In recent years, voice recognition technology has matured with the rise of AI, and the accuracy of speech recognition and text conversion from audio and video data using voice recognition technology has increased. Voice recognition technology is standardly installed in smartphones and the like, and voice recognition input means has been established as an input means similar to keypad text input.

また音声認識技術と同時に、テキストから音声を生成する音声合成装置のシステムも多く開発され、映像や音声とテキストとの関連性が密になりつつある。そこで音声認識技術や、音声合成技術を用いて得たテキスト情報をメタデータにして、音声情報に付加した編集システムが構築されている。 At the same time as voice recognition technology, many voice synthesizer systems that generate voice from text have been developed, and the relationship between video and voice and text is becoming closer. Therefore, editing systems have been constructed in which text information obtained using speech recognition technology or speech synthesis technology is converted into metadata and added to speech information.

特開２０１９－０６１４２８公報JP 2019-061428 Publication 再表２０１７／０７２９１５公報Re-tabled publication 2017/072915

しかし音声認識技術、音声合成技術が飛躍的に向上しているにも関わらず、それらの出力結果や生成元となるテキストデータを積極的に利用した編集装置は提供されていなかった。 However, even though speech recognition technology and speech synthesis technology have improved dramatically, no editing device has been provided that actively utilizes the output results and text data from which they are generated.

特許文献１の発明では映像からメタデータとして生成したテキスト情報を抽出する映像編集を行うシステムだが、時刻とは連携されておらず、メタデータやプレイリストはあくまで編集を行うための情報に過ぎない。 The invention of Patent Document 1 is a video editing system that extracts text information generated as metadata from video, but it is not linked to time, and metadata and playlists are only information for editing. .

特許文献２の発明は音声認識システムを使ってテキスト化し、メタデータを生成しているが、翻訳をメインに第２言語の同期を取ることを目的とし、そのメタデータを元に映像を管理しているのみで、編集ポイントを指定することはできない。 The invention of Patent Document 2 uses a speech recognition system to convert text into text and generate metadata, but the invention is mainly aimed at synchronizing a second language for translation, and manages video based on the metadata. It is not possible to specify an edit point.

そこで、本発明のいくつかの態様はかかる事情に鑑みてなされたものであり、音声認識システムによって出力された音声認識テキストデータ、もしくは音声合成に用いるテキストデータを元に、そのテキストデータを用いて映像を編集することを目的とする。 Therefore, some aspects of the present invention have been made in view of such circumstances, and are based on speech recognition text data output by a speech recognition system or text data used for speech synthesis. The purpose is to edit videos.

上記の課題を解決するために、請求項１記載の発明は、テキストデータを用いた映像編集出力装置であって、映像データを受信する映像データ入力手段と、音声データを受信する音声データ入力手段と、テキストデータを受信する外部テキストデータ入力手段と、外部クロック、もしくは内部クロックを元に時刻基準データを生成する基準時刻発生手段と、前記映像データ受信時に、前記映像データを構成する静止画データ毎に、前記時刻基準データを付与する第１時刻付与手段と、前記音声データ受信時に、前記音声データを構成する音声区間検出データ毎に、前記時刻基準データを付与する第２時刻付与手段と、前記外部テキストデータ入力手段で入力された際に、前記テキストデータに前記時刻基準データを付与する第３時刻付与手段と、前記時刻基準データを元に、前記映像データを構成する静止画データ、前記音声データを構成する音声区間検出データ、及び前記テキストデータの一部を出力することができるデータ出力手段と、を備えることを特徴としている。 In order to solve the above problem, the invention according to claim 1 is a video editing output device using text data, which includes video data input means for receiving video data, and audio data input means for receiving audio data. , an external text data input means for receiving text data, a reference time generation means for generating time reference data based on an external clock or an internal clock, and still image data constituting the video data when the video data is received. a first time assigning means for assigning the time reference data for each time, and a second time assigning means for assigning the time reference data for each voice section detection data constituting the audio data when receiving the audio data; a third time assigning means for assigning the time reference data to the text data when inputted by the external text data input means; still image data constituting the video data based on the time reference data; The present invention is characterized in that it includes voice section detection data constituting voice data and data output means capable of outputting a part of the text data.

本発明によれば、映像や音声を装置に取り込む際に基準時刻を付与し、同時に音声合成等を行う際に利用した外部テキストデータを、編集システムが事前に取り込む際に基準時刻を付与することによって、外部テキストデータを編集の基準軸として扱うことが可能となり、視覚的にわかりやすい編集システムを構築することができる。 According to the present invention, a reference time is assigned when video and audio are imported into the device, and at the same time, a reference time is assigned when the editing system imports external text data used for voice synthesis etc. in advance. This makes it possible to use external text data as a reference axis for editing, making it possible to build a visually easy-to-understand editing system.

請求項２記載の発明は、テキストデータを用いた映像編集出力装置であって、映像データを受信する映像データ入力手段と、音声認識処理を行い、音声から音声認識テキストデータを生成する音声認識手段と、外部クロック、もしくは内部クロックを元に時刻基準データを生成する基準時刻発生手段と、前記映像データ受信時に、前記映像データを構成する静止画データ毎に、前記時刻基準データを付与する第１時刻付与手段と、前記音声認識処理で生成された音声認識テキストデータ内に、前記時刻基準データを付与する第２時刻付与手段と、前記時刻基準データを元に、前記映像データを構成する静止画データ、及び前記音声認識テキストデータの一部を出力することができるデータ出力手段と、を備えることをを特徴としている。 The invention according to claim 2 is a video editing output device using text data, which comprises a video data input means for receiving video data, and a voice recognition means for performing voice recognition processing and generating voice recognition text data from the voice. a reference time generating unit that generates time reference data based on an external clock or an internal clock; and a first unit that assigns the time reference data to each still image data constituting the video data when receiving the video data. a time assigning unit; a second time assigning unit for assigning the time reference data to the voice recognition text data generated by the voice recognition process; and a still image forming the video data based on the time reference data. The present invention is characterized by comprising a data output means capable of outputting data and a part of the voice recognition text data.

本発明によれば、映像や音声を装置に取り込む際に基準時刻を付与し、同時に音声認識処理によって得られた音声認識テキストデータに基準時刻を付与することによって、音声認識テキストデータを編集の基準軸として扱うことが可能となり、視覚的にわかりやすい編集システムを構築することができる。 According to the present invention, a reference time is assigned when video and audio are imported into a device, and at the same time, a reference time is assigned to speech recognition text data obtained through speech recognition processing, thereby making the speech recognition text data a reference for editing. It becomes possible to handle it as an axis, and it is possible to build a visually easy-to-understand editing system.

本発明の実施形態１に係る情報処理システム１００の概略構成（システム構成）の一例を示す図である。1 is a diagram illustrating an example of a schematic configuration (system configuration) of an information processing system 100 according to Embodiment 1 of the present invention. 本発明の実施形態１に係る映像編集制御サーバの一例を示す概略構成図（ブロック図）である。1 is a schematic configuration diagram (block diagram) showing an example of a video editing control server according to Embodiment 1 of the present invention. FIG. 本発明の実施形態１に係る音声合成装置を用いて編集作業を行う過程を示すフローチャートである。3 is a flowchart showing a process of performing editing work using the speech synthesis device according to Embodiment 1 of the present invention. 本発明の実施形態２に係る情報処理システム２００の概略構成（システム構成）の一例を示す図である。2 is a diagram illustrating an example of a schematic configuration (system configuration) of an information processing system 200 according to a second embodiment of the present invention. FIG. 本発明の実施形態２に係る音声認識装置を用いて編集作業を行う過程を示すフローチャートである。7 is a flowchart showing a process of performing editing work using the speech recognition device according to Embodiment 2 of the present invention. 本発明の実施形態１、および実施形態２に係る情報処理装置に表示される画面実施の一例を示す図である。FIG. 3 is a diagram illustrating an example of a screen displayed on the information processing apparatus according to the first embodiment and the second embodiment of the present invention. 本発明の実施形態１、および実施形態２に係る情報処理装置に表示される画面実施の一例を示す図である。FIG. 3 is a diagram illustrating an example of a screen displayed on the information processing apparatus according to the first embodiment and the second embodiment of the present invention.

以下、添付図面を参照しながら本発明の実施の形態について説明する。以下の実施の形態は、本発明を説明するための例示であり、本発明をその実施の形態のみに限定する趣旨ではない。また、本発明は、その要旨を逸脱しない限り、様々な変形が可能である。さらに、各図面において同一の構成要素に対しては可能な限り同一の符号を付し、重複する説明は省略する。 Embodiments of the present invention will be described below with reference to the accompanying drawings. The following embodiments are illustrative for explaining the present invention, and are not intended to limit the present invention only to the embodiments. Further, the present invention can be modified in various ways without departing from the gist thereof. Furthermore, in each drawing, the same components are given the same reference numerals as much as possible, and duplicate explanations will be omitted.

＜実施形態１＞
図１は、本発明の実施形態１に係る情報処理システム１００の一実施形態を示す概略構成図（システム構成図）である。図１に示すように、本発明の実施の形態に係る情報処理システム１００は、例示的に映像出力装置１、映像編集出力制御サーバ２、映像受信装置３、及び情報処理装置４、音声合成装置５、を備え、所定のネットワークに接続されて構成されている。 <Embodiment 1>
FIG. 1 is a schematic configuration diagram (system configuration diagram) showing an embodiment of an information processing system 100 according to Embodiment 1 of the present invention. As shown in FIG. 1, an information processing system 100 according to an embodiment of the present invention includes, for example, a video output device 1, a video editing output control server 2, a video reception device 3, an information processing device 4, and an audio synthesis device. 5, and is connected to a predetermined network.

映像出力装置１は、映像編集出力制御サーバ２と接続し、編集を行う映像を出力する装置である。市販のＶＴＲ、ＤＶＤ、ＨＤ-ＣＡＭ、ＸＤ-ＣＡＭ、及び汎用の映像編集サーバ等である。 The video output device 1 is a device that connects to the video editing output control server 2 and outputs video to be edited. These include commercially available VTRs, DVDs, HD-CAMs, XD-CAMs, and general-purpose video editing servers.

映像編集出力制御サーバ２は、映像データ、音声データ、音声合成装置５から受信した外部テキストデータを元に映像編集を行う装置である。映像データ、音声データの一部削除やスーパーインポーズ、他の映像データや音声データを追加しながら、映像を編集出力制御をする装置である、映像編集出力制御サーバ２のさらに具体的な構成及び動作については、後述する。 The video editing output control server 2 is a device that performs video editing based on video data, audio data, and external text data received from the audio synthesis device 5. More specific configuration and configuration of the video editing output control server 2, which is a device that controls video editing and output while deleting and superimposing part of video data and audio data, and adding other video data and audio data. The operation will be described later.

映像受信装置３は、映像編集出力制御サーバ２と接続し、編集された映像を受信する装置である。市販のＶＴＲ、ＤＶＤ、ＨＤ-ＣＡＭ、ＸＤ-ＣＡＭ、及び汎用の映像編集サーバ等である。映像出力装置１を代わりに使用しても良い。 The video receiving device 3 is a device that connects to the video editing output control server 2 and receives edited video. These include commercially available VTRs, DVDs, HD-CAMs, XD-CAMs, and general-purpose video editing servers. The video output device 1 may be used instead.

情報処理装置４は、映像編集出力制御サーバ２と接続し、汎用ブラウザや専用アプリケーションを利用して、編集操作を行うことができる装置である。情報処理装置４は、汎用のコンピュータ装置であり、例えば、所定のネットワークに接続されたスマートフォン等の携帯電話、タブレット端末、ラップトップ／ノートブック型コンピュータ、及び据え置き型コンピュータ等である。 The information processing device 4 is a device that is connected to the video editing output control server 2 and can perform editing operations using a general-purpose browser or a dedicated application. The information processing device 4 is a general-purpose computer device, and is, for example, a mobile phone such as a smartphone connected to a predetermined network, a tablet terminal, a laptop/notebook computer, a stationary computer, or the like.

音声合成装置５は、テキストデータから音声データを生成する装置である。音声合成装置はあらかじめ用意したテキストデータを装置の基準時刻で音声データに変換する装置である。装置内で音声データ化するスピードを設定する機能を有し、音声データ化した際に基準時刻からの差分を計測し、その数値をテキストデータ内に埋め込む機能を有しても良い。 The speech synthesis device 5 is a device that generates speech data from text data. A speech synthesis device is a device that converts text data prepared in advance into speech data at the device's reference time. The device may have a function of setting the speed at which audio data is converted into audio data, and may also have a function of measuring the difference from a reference time when converting audio data, and embedding the value in text data.

所定のネットワークは、例えばインターネット等を含む情報処理に係る通信回線又は通信網であり、音声合成装置５と映像編集出力制御サーバ２との間、及び映像編集出力制御サーバ２と情報処理装置４との間で各種情報及び各種データの送受信が可能なように構成されていれば特に制限されない。所定のネットワークは、例えば、インターネットといった広帯域ネットワーク、携帯電話網といったコアネットワーク、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、あるいはこれらを組み合わせた狭帯域ネットワークにより実現される。 The predetermined network is, for example, a communication line or communication network related to information processing, including the Internet, and is a communication line or a communication network related to information processing, including, for example, the Internet. There is no particular restriction as long as the configuration is such that various information and data can be transmitted and received between the devices. The predetermined network is realized by, for example, a broadband network such as the Internet, a core network such as a mobile phone network, a LAN (Local Area Network), or a narrowband network that is a combination of these.

なお、情報処理システム１００は、本実施形態１では、音声合成装置５、映像編集出力制御サーバ２、映像出力装置１、映像受信装置３、及び情報処理装置４を一台ずつ備えて構成されているが、必ずしも一台ずつである必要はない。例えば、音声合成装置５に関しては、映像編集出力制御サーバ２に音声合成装置機能が搭載されていれば、設置しなくてもよい。また、映像出力装置１と映像受信装置３は共用でも良いし、情報処理装置４に関しては、複数台設置して、同時に映像編集出力制御サーバ２と通信を行ってもよい。さらに、映像編集出力制御サーバ２の機能と情報処理装置４の機能を同一のサーバで構成してもよいし、別の機能を有するサーバ上に機能を持たせてもよい。 In the first embodiment, the information processing system 100 is configured to include one audio synthesis device 5, one video editing output control server 2, one video output device 1, one video receiving device 3, and one information processing device 4. However, it does not necessarily have to be one at a time. For example, the audio synthesizer 5 does not need to be installed if the video editing output control server 2 is equipped with an audio synthesizer function. Further, the video output device 1 and the video reception device 3 may be shared, or a plurality of information processing devices 4 may be installed and communicate with the video editing output control server 2 at the same time. Furthermore, the functions of the video editing output control server 2 and the information processing device 4 may be configured in the same server, or the functions may be provided on servers with separate functions.

図２は、本発明の実施形態１に係る映像編集出力制御サーバ２の一例を示す概略構成図（ブロック図）である。図２に示すように、映像編集出力制御サーバ２は、例示的に、各種データ及び各種情報を送受信する送受信部２１と、各種データの入出力を制御するための各種処理を実行する情報処理部２２と、各種情報及び各種データを記録する記憶部２３と、を備えて構成される。なお、情報処理部２２は、例えば、不図示であるが、記憶部２３に格納されているプログラムをＣＰＵ等が実行したりすることにより実現することができる。 FIG. 2 is a schematic configuration diagram (block diagram) showing an example of the video editing output control server 2 according to the first embodiment of the present invention. As shown in FIG. 2, the video editing output control server 2 includes, for example, a transmitting/receiving unit 21 that transmits and receives various data and various information, and an information processing unit that executes various processes for controlling input and output of various data. 22, and a storage section 23 for recording various information and data. Although not shown, the information processing section 22 can be realized, for example, by a CPU or the like executing a program stored in the storage section 23.

送受信部２１は機能的に、映像データ受信部２１１と、音声データ受信部２１２と、外部テキストデータ受信部２１３と、編集済映像データ送信部２１４と、を含んで構成されている。また、各種データ及び各種情報を送信する送信部（不図示）、及び、各種データ及び各種情報を受信する受信部（不図示）をも含む。 The transmitting/receiving section 21 is functionally configured to include a video data receiving section 211, an audio data receiving section 212, an external text data receiving section 213, and an edited video data transmitting section 214. It also includes a transmitter (not shown) that transmits various data and information, and a receiver (not shown) that receives various data and information.

映像データ受信部２１１は、映像出力装置１から映像データを受信する。映像データとはトランスポートストリーム（ＴＳ）といったストリーム形式でも良いし、ＡＶＩ、ＱｕｉｃｋＴｉｍｅ、ＷＦＭ、ＦＬＶといったファイル形式でも良い。また、圧縮されていない映像であるＳＤＩ（シリアルデジタルインターフェース）形式でも良い。 Video data receiving section 211 receives video data from video output device 1 . The video data may be in a stream format such as a transport stream (TS), or in a file format such as AVI, QuickTime, WFM, or FLV. Alternatively, the SDI (serial digital interface) format, which is uncompressed video, may be used.

音声データ受信部２１２は、音声合成装置５から音声データを受信する。音声データは様々な形式のストリーミング形式でも良いし、ｍｐ３，ｗｍａ、ＡＡＣ、Ｖｏｒｂｉｓといったファイル形式でも良い。また、伝送形式のＡＥＳ／ＥＢＵ形式でも良い。 The audio data receiving unit 212 receives audio data from the audio synthesizer 5. The audio data may be in various streaming formats or in file formats such as mp3, wma, AAC, and Vorbis. Alternatively, the transmission format may be AES/EBU format.

外部テキストデータ受信部２１３は、音声合成装置５からテキストデータを受信する。テキストデータは音声合成装置５が音声データを生成するために必要なテキストであり、音声データ受信部２１２で受信した音声データと基準時刻が合うように同期が取られている。テキストデータは他のシステムから映像編集出力制御サーバ２が一旦受信をして、音声合成装置５に渡すようにしても良い。 External text data receiving section 213 receives text data from speech synthesis device 5. The text data is text necessary for the speech synthesis device 5 to generate speech data, and is synchronized so that the reference time matches the speech data received by the speech data receiving section 212. The text data may be once received by the video editing output control server 2 from another system and then passed to the audio synthesis device 5.

編集済映像データ送信部２１４は、編集を終えた映像データを外部システムに送信する。送信する編集を終えた映像データとはトランスポートストリーム（ＴＳ）といったストリーム形式でも良いし、ＡＶＩ、ＱｕｉｃｋＴｉｍｅ、ＷＦＭ、ＦＬＶといったファイル形式でも良い。また、圧縮されていない映像であるＳＤＩ（シリアルデジタルインターフェース）形式でも良い。 The edited video data transmitter 214 transmits the edited video data to an external system. The edited video data to be transmitted may be in a stream format such as a transport stream (TS), or may be in a file format such as AVI, QuickTime, WFM, or FLV. Alternatively, the SDI (serial digital interface) format, which is uncompressed video, may be used.

情報処理部２２は機能的に、時刻基準データ生成部２２１と、第１時刻付与部２２２と、第２時刻付与部２２３と、第３時刻付与部２２４と、データ出力部２２５と、を含んで構成されている。 The information processing unit 22 functionally includes a time reference data generation unit 221, a first time assignment unit 222, a second time assignment unit 223, a third time assignment unit 224, and a data output unit 225. It is configured.

時刻基準データ生成部２２１は、外部から受信した時刻情報、もしくは内部で生成した時刻情報を元に基準時刻を生成する。この基準時刻を元に映像編集出力制御サーバ２のデータは全て管理される。 The time reference data generation unit 221 generates a reference time based on time information received from the outside or time information generated internally. All data in the video editing output control server 2 is managed based on this reference time.

第１時刻付与部２２２は、時刻基準データ生成部２２１で生成した基準時刻を映像データ受信部２１１で受信した映像データの映像基準データに付与する。映像基準データとはＭｐｅｇＶｉｄｅｏの場合、フレームデータとなるＩピクチャに該当する。基準時刻はＩピクチャに直接付与しても良いし、Ｉピクチャを基準に相対的な時刻を付与しても良い。 The first time adding unit 222 adds the reference time generated by the time reference data generating unit 221 to the video reference data of the video data received by the video data receiving unit 211. In the case of Mpeg Video, the video reference data corresponds to an I picture that is frame data. The reference time may be assigned directly to the I picture, or a relative time may be assigned with the I picture as a reference.

第２時刻付与部２２３は、時刻基準データ生成部２２１で生成した基準時刻を音声データ受信部２１２で受信した音声データの音声区間検出データ毎に付与する。音声区間検出データとは音声ＶＡＤ（ＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ：音声区間検出）技術によって分割された音声データを指す。ＶＡＤ技術とは音声と雑音が含まれる信号から音声が存在する区間とそれ以外の区間を判別する技術であり、分割する手段は、無音区間を検出して分割しても良いし、一定の時間毎に区切っても良い。 The second time assigning unit 223 assigns the reference time generated by the time reference data generating unit 221 to each voice segment detection data of the audio data received by the audio data receiving unit 212. The voice section detection data refers to voice data divided by voice activity detection (VAD) technology. VAD technology is a technology that distinguishes between sections where speech exists and other sections from a signal that contains speech and noise.The means for dividing can be divided by detecting silent sections, or by dividing by detecting silent sections, or by dividing by detecting silent sections. You can separate each.

第３時刻付与部２２４は、時刻基準データ生成部２２１で生成した基準時刻を外部テキストデータ受信部２１３で受信したテキストデータに付与する。テキストデータには文字毎に基準時刻を付与しても良いし、形態素解析した単語毎に付与しても良い。 The third time adding unit 224 adds the reference time generated by the time reference data generating unit 221 to the text data received by the external text data receiving unit 213. A reference time may be added to the text data for each character, or for each word that has been morphologically analyzed.

記憶部２３は、映像データ受信部２１１で受信した映像データと、音声データ受信部２１２で受信した音声データと、外部テキストデータ受信部２１３で受信したテキストデータと、を記録し、保存されている。 The storage unit 23 records and stores video data received by the video data receiving unit 211, audio data received by the audio data receiving unit 212, and text data received by the external text data receiving unit 213. .

＜実施例１＞
図３を参照して、音声合成装置に外部テキストデータを入力して編集を行う映像編集出力制御装置を実施例１として説明する。図３は、本発明の実施形態１に係る音声合成装置５を用いて編集作業を行う過程を示すフローチャートである。 <Example 1>
Referring to FIG. 3, a video editing output control device that inputs external text data to a speech synthesis device and performs editing will be described as a first embodiment. FIG. 3 is a flowchart showing the process of performing editing work using the speech synthesis device 5 according to the first embodiment of the present invention.

（ステップＳ１）
映像出力装置１から編集対象となる映像データが映像編集出力制御サーバ２内の映像データ受信部２１１に入力される。 (Step S1)
Video data to be edited is input from the video output device 1 to the video data receiving section 211 in the video editing output control server 2 .

（ステップＳ２）
映像編集出力制御サーバ２内の時刻基準データ生成部２２１で生成された基準時刻を第１時刻付与部２２２にて映像データへ付与され、記憶部２３へ映像データＶＤとして記憶される。 (Step S2)
The reference time generated by the time reference data generation section 221 in the video editing output control server 2 is added to the video data by the first time adding section 222, and is stored in the storage section 23 as video data VD.

（ステップＳ３）
音声を生成する基となる外部テキストデータが、音声合成装置５に入力される。 (Step S3)
External text data, which is the basis for generating speech, is input to the speech synthesis device 5.

（ステップＳ４）
音声合成装置５は入力された外部テキストデータを基に音声データを生成する。 (Step S4)
The speech synthesis device 5 generates speech data based on the input external text data.

（ステップＳ５）
音声合成装置は生成した音声データを映像編集出力制御サーバ２へ転送する。転送された音声データは映像編集出力制御サーバ２は内部の音声データ受信部２１２に入力される。 (Step S5)
The audio synthesis device transfers the generated audio data to the video editing output control server 2. The transferred audio data is input to the audio data receiving section 212 inside the video editing output control server 2.

（ステップＳ６）
映像編集出力制御サーバ２内の時刻基準データ生成部２２１で生成された基準時刻を第２時刻付与部２２３にて音声データへ付与され、記憶部２３へ音声データＡＤとして記憶される。音声区間検出データに区切り保管しても良い。 (Step S6)
The reference time generated by the time reference data generation unit 221 in the video editing output control server 2 is added to the audio data by the second time adding unit 223, and is stored in the storage unit 23 as audio data AD. The voice section detection data may be separated and stored.

（ステップＳ７）
また音声合成装置５は入力された外部テキストデータを映像編集出力制御サーバ２へ転送する。転送された外部テキストデータは映像編集出力制御サーバ２は内部の外部テキストデータ受信部２１３に入力される。 (Step S7)
The audio synthesis device 5 also transfers the input external text data to the video editing output control server 2. The transferred external text data is input to the internal external text data receiving section 213 of the video editing output control server 2.

（ステップＳ８）
映像編集出力制御サーバ２内の時刻基準データ生成部２２１で生成された基準時刻を第３時刻付与部２２４にて外部テキストデータへ付与され、記憶部２３へ外部テキストデータＯＴＤとして記憶される。 (Step S8)
The reference time generated by the time reference data generation section 221 in the video editing output control server 2 is added to the external text data by the third time adding section 224, and is stored in the storage section 23 as external text data OTD.

（ステップＳ９）
次に情報処理装置４で編集処理を行う。編集の詳細については後述するが、情報処理装置４の編集画面で基準時刻を特定する操作を行い、その操作から当該基準時刻の静止画を読み出したり、音声データを構成する音声区間検出データを読み出して編集を効率的に行う。 (Step S9)
Next, the information processing device 4 performs editing processing. The details of editing will be described later, but an operation is performed to specify a reference time on the editing screen of the information processing device 4, and from that operation, a still image at the reference time is read out, and audio section detection data that constitutes the audio data is read out. Edit efficiently.

（ステップＳ１０）
編集された映像データは、データ出力部２２５で出力可能な形式に変換され、編集済映像データ送信部から外部システムに映像データを送信する。 (Step S10)
The edited video data is converted into an outputtable format by the data output unit 225, and the edited video data transmission unit transmits the video data to an external system.

＜実施形態２＞
図４は、本発明の実施形態２に係る情報処理システム２００の一実施形態を示す概略構成図（システム構成図）である。図４に示すように、本発明の実施の形態に係る情報処理システム２００は、例示的に映像出力装置１、映像編集出力制御サーバ２、映像受信装置３、及び情報処理装置４、音声認識装置６、を備え、所定のネットワークに接続されて構成されている。 <Embodiment 2>
FIG. 4 is a schematic configuration diagram (system configuration diagram) showing an embodiment of an information processing system 200 according to Embodiment 2 of the present invention. As shown in FIG. 4, an information processing system 200 according to an embodiment of the present invention includes, for example, a video output device 1, a video editing output control server 2, a video reception device 3, an information processing device 4, and a voice recognition device. 6, and is connected to a predetermined network.

映像出力装置１、映像編集出力制御サーバ２、映像受信装置３、情報処理装置４は実施形態１と同様の機能、動作を行う。 The video output device 1, the video editing output control server 2, the video reception device 3, and the information processing device 4 perform the same functions and operations as in the first embodiment.

音声認識装置６は、音声データから言語を認識し、音声認識テキストを生成する装置である。音声認識装置６はあらかじめ用意した映像出力装置１か出力される映像データの音声データ部分のみを取り入れても良いし、別の音声出力装置を用意しても良い。映像編集出力制御サーバ２から基準信号基準時刻を受信、もしくは外部の基準時刻を受信し、その基準時刻の数値を入力される音声データや、出力されるテキストデータに埋め込む機能を有し、情報処理システム２００で同一の基準時刻にて動作させる仕組みを持つ。 The speech recognition device 6 is a device that recognizes language from speech data and generates speech recognition text. The audio recognition device 6 may take in only the audio data portion of the video data output from the video output device 1 prepared in advance, or may prepare another audio output device. It has a function to receive a reference signal reference time from the video editing output control server 2 or an external reference time, and to embed the numerical value of the reference time in input audio data or output text data, and performs information processing. The system 200 has a mechanism to operate at the same reference time.

所定のネットワークは、例えばインターネット等を含む情報処理に係る通信回線又は通信網であり、音声認識装置６と映像編集出力制御サーバ２との間、及び映像編集出力制御サーバ２と情報処理装置４との間で各種情報及び各種データの送受信が可能なように構成されていれば特に制限されない。所定のネットワークは、例えば、インターネットといった広帯域ネットワーク、携帯電話網といったコアネットワーク、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、あるいはこれらを組み合わせた狭帯域ネットワークにより実現される。 The predetermined network is, for example, a communication line or communication network related to information processing, including the Internet, and is a communication line or a communication network related to information processing, including, for example, the Internet, and is used to connect the voice recognition device 6 and the video editing output control server 2, and between the video editing output control server 2 and the information processing device 4. There is no particular restriction as long as the configuration is such that various information and data can be transmitted and received between the devices. The predetermined network is realized by, for example, a broadband network such as the Internet, a core network such as a mobile phone network, a LAN (Local Area Network), or a narrowband network that is a combination of these.

なお、情報処理システム２００は、本実施形態２では、音声認識装置６、映像編集出力制御サーバ２、映像出力装置１、映像受信装置３、及び情報処理装置４を一台ずつ備えて構成されているが、必ずしも一台ずつである必要はない。例えば、音声認識装置６に関しては、映像編集出力制御サーバ２に音声認識機能が搭載されていれば、設置しなくてもよい。また、映像出力装置１と映像受信装置３は共用でも良いし、情報処理装置４に関しては、複数台設置して、同時に映像編集出力制御サーバ２と通信を行ってもよい。さらに、映像編集出力制御サーバ２の機能と情報処理装置４の機能を同一のサーバで構成してもよいし、別の機能を有するサーバ上に機能を持たせてもよい。 In the second embodiment, the information processing system 200 is configured to include one voice recognition device 6, one video editing output control server 2, one video output device 1, one video reception device 3, and one information processing device 4. However, it does not necessarily have to be one at a time. For example, the voice recognition device 6 does not need to be installed if the video editing output control server 2 is equipped with a voice recognition function. Further, the video output device 1 and the video reception device 3 may be shared, or a plurality of information processing devices 4 may be installed and communicate with the video editing output control server 2 at the same time. Furthermore, the functions of the video editing output control server 2 and the information processing device 4 may be configured in the same server, or the functions may be provided on servers with separate functions.

＜実施例２＞
図５を参照して、音声認識装置６に外部テキストデータを入力して編集を行う映像編集出力制御装置を実施例１として説明する。図５は、本発明の実施形態２に係る音声認識装置６を用いて編集作業を行う過程を示すフローチャートである。 <Example 2>
Referring to FIG. 5, a video editing output control device that inputs external text data to the voice recognition device 6 and performs editing will be described as a first embodiment. FIG. 5 is a flowchart showing the process of performing editing work using the speech recognition device 6 according to the second embodiment of the present invention.

（ステップＳ２１）
映像出力装置１から編集対象となる映像データが映像編集出力制御サーバ２内の映像データ受信部２１１に入力される。 (Step S21)
Video data to be edited is input from the video output device 1 to the video data receiving section 211 in the video editing output control server 2 .

（ステップＳ２２）
映像編集出力制御サーバ２内の時刻基準データ生成部２２１で生成された基準時刻を第１時刻付与部２２２にて映像データへ付与され、記憶部２３へ映像データＶＤとして記憶される。 (Step S22)
The reference time generated by the time reference data generation section 221 in the video editing output control server 2 is added to the video data by the first time adding section 222, and is stored in the storage section 23 as video data VD.

（ステップＳ２３）
ステップＳ２１で入力した編集対象となる映像データの音声データ部分を、映像出力装置１から音声認識装置６に入力する。また映像出力装置１から直接入力せず、別の装置を経由して入力しても良い。この音声データはこの際、音声データには映像データの基準時刻データを重畳する。この映像データの基準時刻データを元に映像編集出力制御サーバ２内の時刻データと同期させる。 (Step S23)
The audio data portion of the video data to be edited that was input in step S21 is input from the video output device 1 to the audio recognition device 6. Further, instead of being input directly from the video output device 1, the signal may be input via another device. At this time, reference time data of the video data is superimposed on the audio data. Based on the reference time data of this video data, it is synchronized with the time data in the video editing output control server 2.

（ステップＳ２４）
音声認識装置６は入力された音声データを基に音声認識テキストデータを生成する。この音声認識テキストデータには、前述の映像データの基準時刻データを元に生成された時刻データを埋め込む。 (Step S24)
The speech recognition device 6 generates speech recognition text data based on the input speech data. Time data generated based on the reference time data of the video data described above is embedded in this voice recognition text data.

（ステップＳ２５）
音声認識装置６は生成した音声認識テキストデータを映像編集出力制御サーバ２へ転送する。転送された音声認識テキストデータは映像編集出力制御サーバ２は内部の外部テキストデータ受信部２１３に入力される。 (Step S25)
The speech recognition device 6 transfers the generated speech recognition text data to the video editing output control server 2. The transferred voice recognition text data is input to the external text data receiving section 213 inside the video editing output control server 2.

（ステップＳ２６）
映像編集出力制御サーバ２内の時刻基準データ生成部２２１で生成された基準時刻を第３時刻付与部２２４にて外部テキストデータへ付与され、記憶部２３へ外部テキストデータＯＴＤとして記憶される。 (Step S26)
The reference time generated by the time reference data generation section 221 in the video editing output control server 2 is added to the external text data by the third time adding section 224, and is stored in the storage section 23 as external text data OTD.

（ステップＳ２７）
次に情報処理装置４で編集処理を行う。編集の詳細については後述するが、情報処理装置４の編集画面で基準時刻を特定する操作を行い、その操作から当該基準時刻の静止画を読み出したり、音声データを構成する音声区間検出データを読み出して編集を効率的に行う。 (Step S27)
Next, the information processing device 4 performs editing processing. The details of editing will be described later, but an operation is performed to specify a reference time on the editing screen of the information processing device 4, and from that operation, a still image at the reference time is read out, and audio section detection data that constitutes the audio data is read out. Edit efficiently.

（ステップＳ２８）
編集された映像データは、データ出力部２２５で出力可能な形式に変換され、編集済映像データ送信部２１４から外部システムに映像データを送信する。 (Step S28)
The edited video data is converted into a format that can be output by the data output unit 225, and the edited video data transmission unit 214 transmits the video data to an external system.

＜画面実施例１＞
図１、もしくは図４に示す情報処理装置４の表示部（不図示）に表示される、画面の一例を説明する。図６は、本発明の実施形態１、および実施形態２に係る情報処理装置に表示される画面実施例１を示す図である。画面は基準時刻データ表示エリア４１、映像データ表示エリア４２、スーパーインポーズ表示エリア４３、外部テキストデータ表示エリア４４からなる。全てのエリアを用意する必要はなく、各々必要な機能に応じてエリアを増やしたり、減らしたりしても良い。 <Screen example 1>
An example of a screen displayed on the display unit (not shown) of the information processing device 4 shown in FIG. 1 or 4 will be described. FIG. 6 is a diagram showing Example 1 of the screen displayed on the information processing apparatus according to Embodiment 1 and Embodiment 2 of the present invention. The screen consists of a reference time data display area 41, a video data display area 42, a superimposed display area 43, and an external text data display area 44. It is not necessary to prepare all areas, and the areas may be increased or decreased depending on the required functions.

外部テキストデータ表示エリア４４のテキストの選択表示４１１（図６ではテキスト「お」が選択されている）はマウスカーソルを上に置く「マウスオーバー」操作や、キーボードで「Ｓｈｉｆｔ＋カーソルキー」を押下する操作などで選択する。 The text selection display 411 in the external text data display area 44 (the text "O" is selected in FIG. 6) can be done by placing the mouse cursor over the "mouse over" operation or by pressing "Shift + cursor key" on the keyboard. Select by operation etc.

情報処理装置４は選択された外部テキストの選択表示４１１「お」に付与されている基準時刻を映像編集出力制御サーバ２内の外部テキストデータＯＴＤから読み出し、基準時刻を取り込む。取り込んだ基準時刻は基準時刻データ表示エリア４１に時間軸中の該当箇所に対象基準時刻表示４１２としてハイライト表示を行う。この基準時刻をターゲットとして編集を行うことによって、効率的な編集を行うことが可能となる。 The information processing device 4 reads the reference time assigned to the selection display 411 "O" of the selected external text from the external text data OTD in the video editing output control server 2, and imports the reference time. The captured reference time is highlighted in the reference time data display area 41 at a corresponding location on the time axis as a target reference time display 412. By performing editing using this reference time as a target, it becomes possible to perform efficient editing.

続いて、情報処理装置４は選択された外部テキストの選択表示４１１「お」に付与されている基準時刻を映像編集出力制御サーバ２内の外部テキストデータＯＴＤから読み出し、基準時刻を取り込む。取り込み後、情報処理装置４は取り込んだ基準時刻に紐づけられている映像データＶＤ内の静止画像を読み出す。取り込んだ静止画像は時間軸中の該当箇所に対象画像表示４１３としてハイライト表示を行う。このハイライト表示された静止画像をターゲットとして編集を行うことによって、効率的な編集を行うことが可能となる。 Subsequently, the information processing device 4 reads the reference time assigned to the selected external text selection display 411 "O" from the external text data OTD in the video editing output control server 2, and imports the reference time. After the capture, the information processing device 4 reads out the still image in the video data VD that is linked to the captured reference time. The captured still image is highlighted at the corresponding location on the time axis as a target image display 413. By performing editing using this highlighted still image as a target, it becomes possible to perform efficient editing.

＜画面実施例２＞
続いて、図１、もしくは図４に示す情報処理装置４の表示部（不図示）に表示される、画面の一例を説明する。図７は、本発明の実施形態１、および実施形態２に係る情報処理装置４に表示される画面実施例２を示す図である。画面は画面実施例１同様に基準時刻データ表示エリア４１、映像データ表示エリア４２、スーパーインポーズ表示エリア４３、外部テキストデータ表示エリア４４からなる。全てのエリアを用意する必要はなく、各々必要な機能に応じてエリアを増やしたり、減らしたりしても良い。 <Screen example 2>
Next, an example of a screen displayed on the display unit (not shown) of the information processing device 4 shown in FIG. 1 or 4 will be described. FIG. 7 is a diagram showing Example 2 of the screen displayed on the information processing device 4 according to Embodiment 1 and Embodiment 2 of the present invention. The screen is composed of a reference time data display area 41, a video data display area 42, a superimposed display area 43, and an external text data display area 44 as in the first embodiment. It is not necessary to prepare all areas, and the areas may be increased or decreased depending on the required functions.

スーパーインポーズ表示エリア４３のスーパーインポーズ表示４２１（図７ではスーパー素材「熱々の中華まんとおでんの発売を開始」が選択されている）はマウスクリックをしながらマウスカーソルを上下に動かす「マウスドラック」操作（破線矢印の通り）をする。マウスドラッグ位置に対応した基準時刻表示エリア４１時間軸中の基準時刻表示４２２がハイライト表示される。 The superimpose display 421 in the superimpose display area 43 (in Fig. 7, the super material "Start selling piping hot Chinese buns and oden" is selected) is a "mouse drag" function that moves the mouse cursor up and down while clicking the mouse. ” operation (as indicated by the dashed arrow). A reference time display 422 on the time axis of the reference time display area 41 corresponding to the mouse drag position is highlighted.

情報処理装置４はハイライト表示された基準時刻表示４２２の数値を映像編集出力制御サーバ２内の外部テキストデータＯＴＤから読み出し、基準時刻を取り込む。取り込んだ基準時刻は外部テキストデータ表示エリア４４で対応したテキストのハイライト表示４２３を行う。スーパーインポーズ表示４２１をマウスドラッグで破線のように動かすことにより、ハイライト表示された基準時刻表示４２２も、テキストのハイライト表示４２３も破線のように動作する。この操作によって、音声のスタート時間を確認できることができ、効率的な編集を行うことが可能となる。 The information processing device 4 reads the highlighted numerical value of the reference time display 422 from the external text data OTD in the video editing output control server 2 and takes in the reference time. The imported reference time is displayed as a highlighted text 423 in the external text data display area 44. By moving the superimposed display 421 as shown by the broken line by dragging the mouse, both the highlighted reference time display 422 and the text highlighted display 423 move as shown by the broken line. This operation allows you to check the start time of the audio, making it possible to edit efficiently.

続いて、情報処理装置４はハイライト表示された基準時刻表示４２２の数値に紐づけられた映像編集出力制御サーバ２内の映像データＶＤの静止画から読み出し、静止画像を取り込む。取り込んだ基準時刻は映像データ表示エリア４２で対応した静止画像表示４２４を行う。スーパーインポーズ表示４２１をマウスドラッグで破線のように動かすことにより、ハイライト表示された静止画像表示４２４も破線のように動作する。この操作によって、画像のスタート時間を確認できることができ、効率的な編集を行うことが可能となる。
Subsequently, the information processing device 4 reads out the still images of the video data VD in the video editing output control server 2 that are linked to the highlighted numerical value of the reference time display 422, and captures the still images. The captured reference time is displayed as a corresponding still image 424 in the video data display area 42. By moving the superimposed display 421 as indicated by the broken line by dragging the mouse, the highlighted still image display 424 also moves as indicated by the broken line. This operation allows you to check the start time of the image, making it possible to edit efficiently.

１映像出力装置
２映像編集出力制御サーバ
３映像受信装置
４情報処理装置
５音声合成装置
６音声認識装置
２１映像編集出力制御サーバ送受信部
２２映像編集出力制御サーバ情報処理部
２３映像編集出力制御サーバ記憶部
４１情報処理装置表示部の基準時刻データ表示エリア
４２情報処理装置表示部の映像データ表示エリア
４３情報処理装置表示部のスーパーインポーズ表示エリア
４４情報処理装置表示部の外部テキストデータ表示エリア
１００情報処理システム
２００情報処理システム
２１１映像データ受信部
２１２音声データ受信部
２１３外部テキストデータ受信部
２１４編集済映像データ送信部
２２１時刻基準データ生成部
２２２第１時刻付与部
２２３第２時刻付与部
２２４第３時刻付与部
４１１テキスト選択表示
４１２対象基準時刻表示
４１３対象画像表示
４２１スーパーインポーズ表示
４２２基準時刻表示
４２３テキストハイライト表示
４２４静止画像表示
ＶＤ映像データ
ＡＤ音声データ
ＯＴＤ外部テキストデータ 1 Video output device 2 Video editing output control server 3 Video receiving device 4 Information processing device 5 Speech synthesis device 6 Speech recognition device 21 Video editing output control server transmission/reception section 22 Video editing output control server information processing section 23 Video editing output control server storage Section 41 Reference time data display area 42 of the information processing device display section Video data display area 43 of the information processing device display section Superimposed display area 44 of the information processing device display section External text data display area 100 of the information processing device display section Information Processing system 200 Information processing system 211 Video data receiving section 212 Audio data receiving section 213 External text data receiving section 214 Edited video data transmitting section 221 Time reference data generating section 222 First time assigning section 223 Second time assigning section 224 Third Time assigning section 411 Text selection display 412 Target reference time display 413 Target image display 421 Superimpose display 422 Reference time display 423 Text highlight display 424 Still image display VD Video data AD Audio data OTD External text data

Claims

A video editing output device using text data,
video data input means for receiving video data;
audio data input means for receiving audio data;
external text data input means for receiving text data;
a reference time generating means for generating time reference data based on an external clock or an internal clock;
a first time assigning unit that assigns the time reference data to each still image data forming the video data when receiving the video data;
a second time assigning means for assigning the time reference data to each voice section detection data constituting the audio data when receiving the audio data;
a third time assigning unit that assigns the time reference data to the text data when input by the external text data input unit;
data output means capable of outputting still image data constituting the video data, audio section detection data constituting the audio data, and a part of the text data based on the time reference data;
A video editing output device comprising:

A video editing output device using text data,
video data input means for receiving video data;
a voice recognition means that performs voice recognition processing and generates voice recognition text data from the voice;
a reference time generating means for generating time reference data based on an external clock or an internal clock;
a first time assigning unit that assigns the time reference data to each still image data forming the video data when receiving the video data;
a second time adding means for adding the time reference data to the voice recognition text data generated by the voice recognition process;
data output means capable of outputting still image data constituting the video data and a part of the voice recognition text data based on the time reference data;
A video editing output device comprising:

The video editing output device according to claim 1 ,
displaying and outputting the time reference data read by the data output means when a mouse cursor or selection area is placed over a specific character at a location where the text data is displayed;
The video editing output device according to claim 1 .

The video editing output device according to claim 2 ,
displaying and outputting the time reference data read by the data output means when a mouse cursor or a selection area is placed over a specific character at a location where the voice recognition text data is displayed;
The video editing output device according to claim 2 .

The video editing output device according to claim 1 ,
when the mouse cursor is moved to a location where the text data is displayed or when a specific character within the text data is selected, the time reference data is associated with the time reference data read by the data output means. Display and output still images in video data,
The video editing output device according to claim 1 .

The video editing output device according to claim 2 ,
When the mouse cursor is moved to a location where the voice recognition text data is displayed, or when a specific character within the text data is selected, it is associated with the time reference data read by the data output means. , displaying and outputting a still image in the video data;
The video editing output device according to claim 2 .

The video editing output device according to claim 1 ,
When superimposing video, images, or text on video, the time reference data is displayed on the editing screen scale, and the video display portion, image display portion, or text display is superimposed on the editing screen scale. When a portion is selected and the cursor is moved using a mouse operation or a keyboard operation, the characters of the text data associated with the time reference data on the editing screen scale are displayed differently from other characters. do,
The video editing output device according to claim 1 .

The video editing output device according to claim 2 ,
When superimposing video, images, or text on video, the time reference data is displayed on the editing screen scale, and the video display portion, image display portion, or text display is superimposed on the editing screen scale. When a portion is selected and the cursor is moved using a mouse operation or a keyboard operation, the characters of the voice recognition text data associated with the time reference data on the editing screen scale are displayed differently from other characters. output,
The video editing output device according to claim 2 .

The video editing output device according to claim 1 or claim 2,
When superimposing video, images, or text on video, the time reference data is displayed on the editing screen scale, and the video display portion, image display portion, or text display is superimposed on the editing screen scale. Displaying and outputting a still image in the video data associated with the time reference data on the editing screen scale when a portion is selected and the cursor is moved by dragging with a mouse operation or moving a cursor with a keyboard operation;
The video editing output device according to claim 1 or 2.

A video editing output method using text data, the method comprising:
a video data input step of receiving video data;
an audio data input step of receiving audio data;
an external text data input step for receiving text data;
a reference time generation step for generating time reference data based on an external clock or an internal clock;
a first time assigning step of assigning the time reference data to each still image data constituting the video data when receiving the video data;
a second time assigning step of assigning the time reference data to each voice section detection data constituting the audio data when receiving the audio data;
a third time assigning step of assigning the time reference data to the text data when input in the external text data input step;
a data output step capable of outputting still image data constituting the video data, audio section detection data constituting the audio data, and a part of the text data based on the time reference data;
A video editing output method characterized by comprising:

A computer that edits and outputs video using text data,
video data input means for receiving video data;
audio data input means for receiving audio data;
external text data input means for receiving text data;
a reference time generating means for generating time reference data based on an external clock or an internal clock;
a first time assigning unit that assigns the time reference data to each still image data constituting the video data when receiving the video data;
a second time assigning means for assigning the time reference data to each voice section detection data constituting the audio data when receiving the audio data;
third time assigning means for assigning the time reference data to the text data when input by the external text data input means;
data output means capable of outputting still image data constituting the video data, audio section detection data constituting the audio data, and a part of the text data based on the time reference data;
A video editing output program that functions as a video editing output program.

A video editing output method using text data, the method comprising:
a video data input step of receiving video data;
a voice recognition step of performing voice recognition processing and generating voice recognition text data from the voice;
a reference time generation step of generating time reference data based on an external clock or an internal clock;
a first time assigning step of assigning the time reference data to each still image data constituting the video data when receiving the video data;
a second time adding step of adding the time reference data to the voice recognition text data generated in the voice recognition process;
a data output step capable of outputting still image data constituting the video data and a part of the voice recognition text data based on the time reference data;
A video editing output method characterized by comprising:

A computer that edits and outputs video using text data,
video data input means for receiving video data;
a voice recognition means that performs voice recognition processing and generates voice recognition text data from voice;
a reference time generating means for generating time reference data based on an external clock or an internal clock;
a first time assigning unit that assigns the time reference data to each still image data constituting the video data when receiving the video data;
a second time adding means for adding the time reference data to the voice recognition text data generated by the voice recognition process;
data output means capable of outputting still image data constituting the video data and a part of the voice recognition text data based on the time reference data;
A video editing output program that functions as a video editing output program.