JP2007295218A

JP2007295218A - Nonlinear editing apparatus, and program therefor

Info

Publication number: JP2007295218A
Application number: JP2006120126A
Authority: JP
Inventors: Masami Fujita; 昌巳藤田; Akira Nakamura; 章中村; Yasuyuki Kondo; 康之近藤; Tomoyasu Sugiura; 智保杉浦; Masahiro Shibata; 正啓柴田
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2006-04-25
Filing date: 2006-04-25
Publication date: 2007-11-08
Anticipated expiration: 2026-04-25
Also published as: JP4741406B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a nonlinear editing apparatus which can make images and voices correspond to text data without requiring human work and can generate editing data for editing the images and voices in conjunction with the editing of the text data. <P>SOLUTION: The nonlinear editing apparatus is provided with: a text corresponding means 32 for recognizing a voice and generating text data made to correspond to time information; a display means 50 for displaying the text data and images while making them corresponding to the time information; a text editing means 41 for editing the text data; and an editing data generation means 44 for generating editing data on the basis of the time information corresponding to the edited text data. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、テキストデータに連動させて映像や音声を編集するための編集データを生成するノンリニア編集装置およびそのプログラムに関する。 The present invention relates to a nonlinear editing apparatus that generates editing data for editing video and audio in conjunction with text data and a program thereof.

近年、映像や音声の編集は、映像等を記録したテープを再生し、所望の箇所を他のテープにコピーするリニア編集から、映像等を、一旦、ハードディスク等の記憶装置にデジタルデータとして書き込み、コンピュータによって編集するノンリニア編集へと移行している。
このノンリニア編集を行うノンリニア編集システムは、記憶装置に記憶してある映像等を、コンピュータ端末のＧＵＩ（Graphical User Interface）上で、操作者が「コピー」、「カット」、「ペースト」等の操作を行うことで、映像等の開始点・終了点からなる編集データを作成する。そして、ノンリニア編集システムは、編集した映像等を再生する際には、この編集データに基づいて、記憶装置から所望の開始点・終了点間の映像等を読み出して出力する。これによって、ノンリニア編集は、リニア編集に比べ、編集にかかる時間を削減することができる。
しかし、ノンリニア編集であっても、編集段階や、編集内容を確認する段階においては、操作者が映像等を再生する必要があるため、編集に多大な時間を要しているのが現状である。 In recent years, video and audio editing has been performed by playing back a tape on which video or the like has been recorded, copying the desired location to another tape, writing the video or the like as digital data in a storage device such as a hard disk, Shifting to non-linear editing that is performed by a computer.
This non-linear editing system performs non-linear editing such as “copy”, “cut”, “paste”, etc., on the GUI (Graphical User Interface) of a computer terminal by an operator. By doing the above, edit data composed of the start point and end point of video and the like is created. When the edited video or the like is reproduced, the nonlinear editing system reads out and outputs the video or the like between desired start points and end points from the storage device based on the edited data. Thus, the non-linear editing can reduce the time required for the editing as compared with the linear editing.
However, even in the case of non-linear editing, at the editing stage and the stage of confirming the editing content, it is necessary for the operator to reproduce the video and so on, so that it takes a lot of time for editing. .

そこで、テキストデータであるセンテンス（文章）と、映像のフレームとを対応付けておき、テキストを編集することで、それに対応する映像の編集を行う技術が開示されている（特許文献１参照）。この技術は、テキストデータであるセンテンスと、そのセンテンスに対応する映像の開始フレーム番号および終了フレーム番号とを、予め対応付けておき、センテンスを選択したり、センテンスの順序を換えたりすることで、映像のアクセス順序を任意に変更することを可能にしたものである。
これによって、操作者は、情報量が多く編集が困難な映像自体を編集することがなく、情報量の少ないテキストデータを編集することで、簡易に映像の編集を行うことができる。
特開平９−２３７４８６号公報（段落００８６〜００８９、図５） Therefore, a technique is disclosed in which a sentence (text), which is text data, is associated with a video frame, and the text is edited to edit the corresponding video (see Patent Document 1). In this technique, a sentence that is text data and a start frame number and an end frame number of a video corresponding to the sentence are associated in advance, and the sentence is selected or the order of the sentences is changed. The video access order can be arbitrarily changed.
Thus, the operator can easily edit the video by editing the text data with a small amount of information without editing the video itself having a large amount of information and difficult to edit.
JP-A-9-237486 (paragraphs 0086 to 0089, FIG. 5)

しかし、前記した従来の技術によれば、予めテキストデータであるセンテンスと映像のフレームとを対応付けておく必要があるため、その対応付けのための入力作業に多くの時間を要してしまうという問題があった。
また、従来の技術では、１つのセンテンスに複数の映像のフレームが対応付けられるため、映像の編集単位がセンテンス単位となり、細かい映像の編集を行うことができないという問題があった。 However, according to the above-described conventional technology, it is necessary to associate a sentence that is text data with a video frame in advance, so that it takes a lot of time for input work for the association. There was a problem.
Further, in the conventional technique, since a plurality of video frames are associated with one sentence, there is a problem in that a video editing unit becomes a sentence unit and fine video editing cannot be performed.

本発明は、以上のような課題を解決するためになされたものであり、映像や音声と、テキストデータとを、人手を介さず対応付けるとともに、テキストデータの文字単位の編集に連動させて、映像や音声を編集するための編集データを生成することが可能なテキスト連動型のノンリニア編集装置およびそのプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems. Video and audio are associated with text data without human intervention, and the video is linked with editing in units of characters of text data. Another object of the present invention is to provide a text-linked non-linear editing device and a program thereof that can generate edit data for editing audio and voice.

本発明は、前記目的を達成するために創案されたものであり、まず、請求項１に記載のノンリニア編集装置は、時間情報が付加された映像および音声に基づいて、前記映像および音声を編集するための編集データを生成するノンリニア編集装置であって、テキスト対応付け手段と、表示手段と、テキスト編集手段と、編集データ生成手段と、を備える構成とした。 The present invention was devised to achieve the above object. First, the nonlinear editing apparatus according to claim 1 edits the video and audio based on the video and audio to which time information is added. A non-linear editing apparatus that generates editing data for performing the above-described processing, and includes a text association unit, a display unit, a text editing unit, and an editing data generation unit.

かかる構成において、ノンリニア編集装置は、テキスト対応付け手段によって、映像に対応する時間情報が付加された音声を音声認識して、その時間情報に対応付けたテキストデータを生成する。これによって、テキストデータと、映像や音声とは、時間情報（タイムコード）を介して対応付けられることになる。このとき、テキストデータは、少なくとも１文字単位でタイムコードに対応付けることが可能である。
また、ノンリニア編集装置は、表示手段によって、テキストデータを表示するテキスト表示領域と、映像を表示する映像表示領域とに表示領域を区分して、時間情報に対応付けてテキストデータと映像とを表示装置に表示する。これによって、映像に対応する音声は、映像の表示（再生）に連動して、テキストデータとして表示されることになる。 In such a configuration, the non-linear editing apparatus recognizes a voice to which time information corresponding to a video is added by a text association unit, and generates text data associated with the time information. As a result, the text data and the video and audio are associated with each other through time information (time code). At this time, the text data can be associated with the time code in units of at least one character.
In addition, the non-linear editing device divides the display area into a text display area for displaying text data and a video display area for displaying video, and displays the text data and video in association with time information by the display means. Display on the device. As a result, the audio corresponding to the video is displayed as text data in conjunction with the display (playback) of the video.

そして、ノンリニア編集装置は、テキスト編集手段によって、テキストデータを操作者の操作に基づいて編集する。このとき、テキストデータに対する編集は、そのまま時間情報で対応付けられた映像や音声の編集内容に相当することになる。そして、ノンリニア編集装置は、編集データ生成手段によって、テキスト編集手段で編集されたテキストデータに対応する時間情報に基づいて、編集データを生成する。
なお、テキストデータと映像とは、時間情報で対応付けられているため、映像を編集することで、テキストデータを編集することも可能である。 Then, the non-linear editing device edits the text data by the text editing means based on the operation of the operator. At this time, the editing of the text data corresponds to the editing contents of the video and audio associated with the time information as they are. Then, the nonlinear editing device generates editing data by the editing data generation unit based on the time information corresponding to the text data edited by the text editing unit.
Since the text data and the video are associated with each other by time information, the text data can be edited by editing the video.

また、請求項２に記載のノンリニア編集装置は、請求項１に記載のノンリニア編集装置において、前記テキスト対応付け手段が、音声認識により認識された単語の切れ目ごとに前記時間情報を対応付けることで前記テキストデータを生成し、前記テキスト編集手段が、前記単語単位で前記テキストデータを編集する構成とした。 Further, in the nonlinear editing apparatus according to claim 2, in the nonlinear editing apparatus according to claim 1, the text association unit associates the time information with each word break recognized by speech recognition. Text data is generated, and the text editing unit edits the text data in units of words.

かかる構成において、ノンリニア編集装置は、テキスト対応付け手段が、音声認識により、音声をテキストデータに変換する際に、音声認識結果である単語の切れ目ごとに時間情報を付加したテキストデータを生成する。
そして、ノンリニア編集装置は、テキスト編集手段によって、単語単位でテキストデータを編集する。このように、編集の最小単位を単語単位とすることで、音声として意味を持たない単語の途中で音声を編集することがなくなる。 In such a configuration, the nonlinear editing apparatus generates text data in which time information is added for each break of a word as a speech recognition result when the text association unit converts speech into text data by speech recognition.
Then, the non-linear editing device edits the text data in units of words by the text editing means. In this way, by setting the minimum unit of editing as a word unit, it is possible to avoid editing a voice in the middle of a word that has no meaning as a voice.

また、請求項３に記載のノンリニア編集装置は、時間情報が付加された映像に基づいて、前記映像を編集するための編集データを生成するノンリニア編集装置であって、テキストデータ入力手段と、テキスト対応付け手段と、表示手段と、テキスト編集手段と、編集データ生成手段と、を備える構成とした。 The nonlinear editing apparatus according to claim 3 is a nonlinear editing apparatus that generates editing data for editing the video based on the video to which time information is added, the text data input means, The configuration includes an association unit, a display unit, a text editing unit, and an editing data generation unit.

かかる構成において、ノンリニア編集装置は、テキストデータ入力手段によって、外部から映像に対応するテキストデータを入力する。このテキストデータは、予め映像に対応する音声を電子化した原稿等である。
そして、ノンリニア編集装置は、テキスト対応付け手段によって、テキストデータ入力手段で入力されたテキストデータに、操作者の指示に基づいて、単語の切れ目ごとに映像に付加された時間情報を対応付ける。
また、ノンリニア編集装置は、表示手段によって、テキストデータを表示するテキスト表示領域と、映像を表示する映像表示領域とに表示領域を区分して、時間情報に対応付けてテキストデータと映像とを表示装置に表示する。 In such a configuration, the nonlinear editing apparatus inputs text data corresponding to the video from the outside by means of text data input means. This text data is a manuscript or the like in which sound corresponding to a video is digitized in advance.
Then, the non-linear editing apparatus associates the time information added to the video for each word break based on the instruction of the operator by the text association unit with the text data input by the text data input unit.
In addition, the non-linear editing device divides the display area into a text display area for displaying text data and a video display area for displaying video, and displays the text data and video in association with time information by the display means. Display on the device.

そして、ノンリニア編集装置は、テキスト編集手段によって、テキストデータを操作者の操作に基づいて編集する。そして、ノンリニア編集装置は、編集データ生成手段によって、テキスト編集手段で編集されたテキストデータに対応する時間情報に基づいて、編集データを生成する。 Then, the non-linear editing device edits the text data by the text editing means based on the operation of the operator. Then, the nonlinear editing device generates editing data by the editing data generation unit based on the time information corresponding to the text data edited by the text editing unit.

さらに、請求項４に記載のノンリニア編集装置は、請求項１から請求項３のいずれか一項に記載のノンリニア編集装置において、前記編集データ生成手段が、前記テキストデータ内に固有の制御文字により識別された編集の開始点および終了点を記述されることにより、前記開始点および終了点を認識し、当該開始点および終了点に対応する時間情報に基づいて、前記編集データを生成する構成とした。 Furthermore, the non-linear editing apparatus according to claim 4 is the non-linear editing apparatus according to any one of claims 1 to 3, wherein the edit data generating means uses a unique control character in the text data. A configuration for recognizing the start point and the end point by describing the start point and end point of the identified edit, and generating the edit data based on time information corresponding to the start point and end point; did.

かかる構成において、ノンリニア編集装置は、テキスト編集手段において、テキストデータ内に編集の開始点（イン点）および終了点（アウト点）を示す固有の制御文字を記述されることで、テキストデータに対応付けられた時間情報から、開始点および終了点の時間を認識することができる。
そして、編集データ生成手段が、その開始点および終了点の時間を映像や音声のカット点とした編集データを生成する。 In such a configuration, the non-linear editing apparatus supports text data by describing unique control characters indicating the editing start point (in point) and end point (out point) in the text data in the text editing means. From the attached time information, the time of the start point and the end point can be recognized.
Then, the edit data generating means generates edit data with the time of the start point and end point as video and audio cut points.

また、請求項５に記載のノンリニア編集装置は、請求項１から請求項４のいずれか一項に記載のノンリニア編集装置において、前記編集データ生成手段が、前記テキストデータ内に固有の制御文字により識別された文字スーパー文字列を記述されることにより、前記文字スーパー文字列を認識し、当該文字スーパー文字列の挿入位置に対応する時間情報に基づいて、前記編集データ内に前記文字スーパー文字列の再生情報を付加する構成とした。 The nonlinear editing device according to claim 5 is the nonlinear editing device according to any one of claims 1 to 4, wherein the editing data generation means uses a control character unique to the text data. The character super character string is recognized by describing the identified character super character string, and the character super character string is included in the edit data based on time information corresponding to the insertion position of the character super character string. The reproduction information is added.

かかる構成において、ノンリニア編集装置は、テキスト編集手段において、テキストデータ内に固有の制御文字により識別された文字スーパー文字列を記述されることで、テキストデータに対応付けられた時間情報から、文字スーパーを再生する時間を認識することができる。
そして、編集データ生成手段が、文字スーパーの文字列と、その文字スーパーを再生する時間とを編集データに付加する。 In such a configuration, the non-linear editing apparatus describes the character super-character from the time information associated with the text data by describing the character super-character string identified by the unique control character in the text data in the text editing means. The time to play can be recognized.
Then, the editing data generation means adds the character string of the character superimposition and the time for reproducing the character superimposition to the editing data.

さらに、請求項６に記載のノンリニア編集装置は、請求項１から請求項５のいずれか一項に記載のノンリニア編集装置において、前記映像を編集する映像編集手段を備え、前記テキスト編集手段が、前記映像編集手段で編集された映像に対応する時間情報に基づいて、前記テキストデータおよび当該テキストデータに対応付けられた時間情報を編集する構成とした。 Furthermore, the nonlinear editing device according to claim 6 is the nonlinear editing device according to any one of claims 1 to 5, further comprising video editing means for editing the video, wherein the text editing means includes: The text data and the time information associated with the text data are edited based on the time information corresponding to the video edited by the video editing means.

かかる構成において、ノンリニア編集装置は、映像編集手段によって、映像を編集することで、テキスト編集手段が、映像に対応付けられた時間情報に基づいて、テキストデータを編集することができる。 In such a configuration, the nonlinear editing apparatus edits the video by the video editing unit, so that the text editing unit can edit the text data based on the time information associated with the video.

また、請求項７に記載のノンリニア編集プログラムは、時間情報が付加された映像および音声に基づいて、前記映像および音声を編集するための編集データを生成するために、コンピュータを、テキスト対応付け手段、表示手段、テキスト編集手段、編集データ生成手段、として機能させる構成とした。 The non-linear editing program according to claim 7, wherein a computer is used to generate editing data for editing the video and audio based on the video and audio to which time information is added. , Display means, text editing means, and edit data generation means.

かかる構成において、ノンリニア編集プログラムは、テキスト対応付け手段によって、映像に対応する時間情報が付加された音声を音声認識して、その時間情報に対応付けたテキストデータを生成する。これによって、テキストデータと、映像や音声とは、時間情報（タイムコード）を介して対応付けられることになる。
また、ノンリニア編集プログラムは、表示手段によって、テキストデータを表示するテキスト表示領域と、映像を表示する映像表示領域とに表示領域を区分して、時間情報に対応付けてテキストデータと映像とを表示装置に表示する。これによって、映像に対応する音声は、映像の表示（再生）に連動して、テキストデータとして表示されることになる。
そして、ノンリニア編集プログラムは、テキスト編集手段によって、テキストデータを操作者の操作に基づいて編集する。
そして、ノンリニア編集プログラムは、編集データ生成手段によって、テキスト編集手段で編集されたテキストデータに対応する時間情報に基づいて、編集データを生成する。 In such a configuration, the non-linear editing program recognizes the voice to which the time information corresponding to the video is added by the text association means, and generates text data associated with the time information. As a result, the text data and the video and audio are associated with each other through time information (time code).
In addition, the non-linear editing program divides the display area into a text display area for displaying text data and a video display area for displaying video, and displays the text data and video in association with time information. Display on the device. As a result, the audio corresponding to the video is displayed as text data in conjunction with the display (playback) of the video.
Then, the non-linear editing program edits the text data based on the operation of the operator by the text editing means.
Then, the non-linear editing program generates edit data by the edit data generating means based on the time information corresponding to the text data edited by the text editing means.

本発明は、以下に示す優れた効果を奏するものである。
請求項１、７に記載の発明によれば、映像や音声とテキストデータとを、人手を介さず対応付けるとともに、テキストデータの任意の位置で映像や音声を編集することが可能になる。これによって、映像や音声を視聴することなく、文字によって、編集結果を確認することができるため、映像や音声の編集時間を短くすることができる。 The present invention has the following excellent effects.
According to the first and seventh aspects of the present invention, it is possible to associate video and audio with text data without manual intervention and to edit video and audio at an arbitrary position of the text data. Thus, the editing result can be confirmed by characters without watching the video and audio, and the video and audio editing time can be shortened.

請求項２に記載の発明によれば、単語単位で編集を行うため、音声として意味を持たない単語の途中で音声を編集することがなくなり、編集操作の効率性を高めることができる。 According to the second aspect of the invention, since editing is performed in units of words, it is not necessary to edit the voice in the middle of a word that has no meaning as a voice, and the efficiency of the editing operation can be improved.

請求項３に記載の発明によれば、テキストデータの任意の位置で映像を編集することが可能になる。これによって、編集結果を文字で確認することができるため、映像の編集時間を短くすることができる。 According to the invention described in claim 3, it is possible to edit the video at an arbitrary position of the text data. As a result, the editing result can be confirmed by characters, so that the video editing time can be shortened.

請求項４に記載の発明によれば、テキストデータに固有の文字を挿入するだけで、映像や音声の編集の開始点および終了点を設定することができ、映像や音声を再生して開始点および終了点を設定する場合に比べて、短時間で編集作業を行うことが可能になる。 According to the fourth aspect of the present invention, it is possible to set the start and end points of video and audio editing simply by inserting unique characters into text data. As compared with the case where the end point is set, the editing operation can be performed in a short time.

請求項５に記載の発明によれば、テキストデータの一連の編集作業において、文字スーパーを挿入する編集データを生成することができる。これによって、映像を再生して文字スーパーを挿入する箇所を探索する必要がないため、短時間で文字スーパーを設定した編集データを生成することができる。 According to the fifth aspect of the present invention, it is possible to generate edit data into which a character super is inserted in a series of text data editing operations. This eliminates the need to search for a place to insert a character superimpose by reproducing the video, and can therefore generate edit data in which the character superposition is set in a short time.

請求項６に記載の発明によれば、映像とテキストデータとを時間情報で対応付けているため、従来のように映像を編集することで、編集データを生成するユーザインタフェースと、テキストデータを編集することで、編集データを生成するユーザインタフェースとを共存させることができ、操作者の利便性を高めることができる。 According to the sixth aspect of the present invention, since the video and text data are associated with each other by time information, the user interface for generating edit data by editing the video as in the past, and editing the text data By doing so, it is possible to coexist with a user interface for generating edit data, and it is possible to improve the convenience for the operator.

以下、本発明の実施の形態について図面を参照して説明する。
［ノンリニア編集装置の概要］
最初に、図１を参照して、本発明に係るノンリニア編集装置について、その概要を説明する。図１は、本発明に係るノンリニア編集装置の概要を説明するための図であって、ノンリニア編集装置が表示する表示画面を示している。
ノンリニア編集装置１は、映像に対応したタイムコード（時間情報）が付加された音声に基づいて、映像や音声を編集するための編集データを生成するものである。
このノンリニア編集装置１は、タイムコードで対応付けられた映像および音声を入力し、入力された音声を音声認識したテキストデータを表示画面Ｄのテキスト表示領域Ｔに表示する。また、ノンリニア編集装置１は、入力された映像を表示画面Ｄの映像表示領域Ｍに表示する。 Embodiments of the present invention will be described below with reference to the drawings.
[Outline of nonlinear editing equipment]
First, the outline of the nonlinear editing apparatus according to the present invention will be described with reference to FIG. FIG. 1 is a diagram for explaining the outline of the nonlinear editing apparatus according to the present invention, and shows a display screen displayed by the nonlinear editing apparatus.
The non-linear editing apparatus 1 generates editing data for editing video and audio, based on audio to which a time code (time information) corresponding to the video is added.
The non-linear editing apparatus 1 inputs video and audio associated with a time code, and displays text data obtained by recognizing the input audio in a text display area T of the display screen D. Further, the nonlinear editing device 1 displays the input video in the video display area M of the display screen D.

そして、操作者が、テキスト表示領域Ｔに表示されているテキストデータを編集することで、ノンリニア編集装置１は、テキストデータの編集に連動して映像や音声を編集するための編集データを生成する。
このように、ノンリニア編集装置１は、テキストデータと、映像および音声とを同一のタイムコードで対応付けるため、ノンリニア編集を、テキストデータの編集により行うことを可能にしている。
以下、本発明に係るノンリニア編集装置の構成および動作について説明を行う。 Then, when the operator edits the text data displayed in the text display area T, the non-linear editing device 1 generates edit data for editing video and audio in conjunction with editing of the text data. .
As described above, the nonlinear editing apparatus 1 associates text data with video and audio with the same time code, so that nonlinear editing can be performed by editing text data.
The configuration and operation of the nonlinear editing apparatus according to the present invention will be described below.

≪第１実施形態≫
［ノンリニア編集装置の構成］
まず、図２を参照して、本発明の第１実施形態に係るノンリニア編集装置の構成について説明する。図２は、本発明の第１実施形態に係るノンリニア編集装置の構成を示すブロック図である。
図２に示すように、ノンリニア編集装置１は、時間情報が付加された映像および音声に基づいて、当該映像および音声を編集するための編集データを生成するものである。
ここでは、ノンリニア編集装置１は、制御手段１０と、記憶手段２０と、入力手段３０と、編集手段４０と、表示手段５０と、出力手段６０とを備えている。 << First Embodiment >>
[Configuration of nonlinear editing device]
First, the configuration of the nonlinear editing apparatus according to the first embodiment of the present invention will be described with reference to FIG. FIG. 2 is a block diagram showing the configuration of the nonlinear editing apparatus according to the first embodiment of the present invention.
As shown in FIG. 2, the nonlinear editing apparatus 1 generates editing data for editing video and audio based on the video and audio to which time information is added.
Here, the nonlinear editing apparatus 1 includes a control unit 10, a storage unit 20, an input unit 30, an editing unit 40, a display unit 50, and an output unit 60.

制御手段１０は、ノンリニア編集装置１全体を制御するものである。この制御手段１０は、図示を省略したメニュー画面等を画面上に表示し、操作者が選択した動作を実行する。ここでは、制御手段１０は、入力手段３０、編集手段４０、表示手段５０および出力手段６０を制御する。そして、制御手段１０は、編集対象の映像・音声を入力する際は、入力手段３０を動作させ、編集作業を行う際は、編集手段４０や表示手段５０を動作させ、編集データ等の編集結果を出力する際は、出力手段６０を動作させる。 The control means 10 controls the entire nonlinear editing apparatus 1. The control means 10 displays a menu screen or the like (not shown) on the screen and executes an operation selected by the operator. Here, the control unit 10 controls the input unit 30, the editing unit 40, the display unit 50, and the output unit 60. Then, the control means 10 operates the input means 30 when inputting the video / audio to be edited, and operates the editing means 40 and the display means 50 when performing editing work, thereby editing the editing data or the like. Is output, the output means 60 is operated.

記憶手段２０は、外部から入力された映像・音声、ノンリニア編集装置１において使用する各種データ、あるいは編集結果を記憶するものである。ここでは、記憶手段２０は、音声認識用データ記憶手段２１と、編集データ記憶手段２２とを備えている。 The storage means 20 stores externally input video / audio, various data used in the nonlinear editing apparatus 1, or editing results. Here, the storage unit 20 includes a voice recognition data storage unit 21 and an edit data storage unit 22.

音声認識用データ記憶手段２１は、後記するテキスト対応付け手段３２の音声認識手段３２１において使用される言語モデル、音響モデル等の音声認識に用いる各種データを記憶しておくものであって、ハードディスク等の一般的な記憶装置である。
言語モデルは、大量の音声データから学習した出力系列（単語、形態素、音素等）の出現頻度や接続確率等をモデル化したものである。この言語モデルには、例えば、一般的な「Ｎグラム言語モデル」を用いることができる。
音響モデルは、大量の音声データから予め学習した音素ごとの特徴量を「隠れマルコフモデル」によってモデル化したものである。この音響モデルは、単一の音響モデルを用いてもよいし、音響の種別（例えば、人物別）ごとに複数のモデルを用いてもよい。 The voice recognition data storage means 21 stores various data used for voice recognition such as a language model and an acoustic model used in the voice recognition means 321 of the text association means 32 described later. It is a general storage device.
The language model models the appearance frequency, connection probability, and the like of output sequences (words, morphemes, phonemes, etc.) learned from a large amount of speech data. As this language model, for example, a general “N-gram language model” can be used.
The acoustic model is obtained by modeling a feature amount for each phoneme learned in advance from a large amount of speech data using a “hidden Markov model”. As this acoustic model, a single acoustic model may be used, or a plurality of models may be used for each acoustic type (for example, for each person).

編集データ記憶手段２２は、編集対象となる映像、音声の素材データや、編集結果を記憶するものであって、ハードディスク等の一般的な記憶装置である。また、編集データ記憶手段２２に記憶される映像および音声には、タイムコードが付加されているものとする。なお、音声に付加されているタイムコードは、「時：分：秒：映像フレーム番号」として、映像のフレーム（映像フレーム）と対応付けられているものとする。
ここでは、記憶手段２０を、音声認識用データ記憶手段２１と、編集データ記憶手段２２とを別のハードディスク等によって構成しているが、同一のハードディスク等で構成してもよい。 The editing data storage means 22 stores video and audio material data to be edited and editing results, and is a general storage device such as a hard disk. Further, it is assumed that a time code is added to the video and audio stored in the edit data storage means 22. It is assumed that the time code added to the audio is associated with a video frame (video frame) as “hour: minute: second: video frame number”.
Here, the storage means 20 is composed of the voice recognition data storage means 21 and the edited data storage means 22 by separate hard disks, but may be composed of the same hard disk or the like.

入力手段３０は、制御手段１０からの制御に基づいて、映像・音声を入力し、タイムコードに対応したテキストデータを生成するものである。ここでは、入力手段３０は、映像音声入力手段３１と、テキスト対応付け手段３２とを備えている。 The input unit 30 inputs video / audio based on the control from the control unit 10 and generates text data corresponding to the time code. Here, the input unit 30 includes a video / audio input unit 31 and a text association unit 32.

映像音声入力手段３１は、外部から、編集対象となる映像および音声を入力するものである。なお、映像音声入力手段３１は、入力された映像および音声を編集データ記憶手段２２に記憶する。 The video / audio input means 31 inputs video and audio to be edited from the outside. The video / audio input means 31 stores the input video and audio in the edit data storage means 22.

テキスト対応付け手段３２は、音声を音声認識し、タイムコードに対応したテキストデータを生成するものである。ここでは、テキスト対応付け手段３２は、音声認識手段３２１で構成されている。 The text association means 32 recognizes speech and generates text data corresponding to the time code. Here, the text association unit 32 includes a speech recognition unit 321.

音声認識手段３２１は、音声をテキストデータに変換する一般的な音声認識の機能に加え、認識結果であるテキストデータに映像・音声のタイムコードを対応付けて付加するものである。なお、音声認識手段３２１は、音声認識用データ記憶手段２１に記憶されている言語モデルや音響モデルを用いて、音声認識を行う。ここでは、音声認識手段３２１は、分析手段３２１ａと、類似度算出手段３２１ｂと、探索手段３２１ｃとを備えている。 In addition to a general speech recognition function for converting speech into text data, the speech recognition means 321 adds a video / audio time code in association with text data as a recognition result. Note that the voice recognition unit 321 performs voice recognition using a language model or an acoustic model stored in the voice recognition data storage unit 21. Here, the voice recognition unit 321 includes an analysis unit 321a, a similarity calculation unit 321b, and a search unit 321c.

分析手段３２１ａは、音声の音声波形に窓関数（ハミング窓等）をかけることで、フレーム化された波形を抽出し、その波形を周波数分析することで、種々の特徴量を抽出するものである。例えば、フレーム化された波形のパワースペクトルの対数を逆フーリエ変換した値であるケプストラム係数等を特徴量とする。この特徴量には、ケプストラム係数以外にも、メル周波数ケプストラム係数（ＭＦＣＣ：Mel Frequency Cepstrum Coefficient）、ＬＰＣ（Linear Predictive Cording）係数、対数パワー等、一般的な音声特徴量を用いることができる。なお、分析手段３２１ａは、特徴量を抽出した時点における映像・音声に付加されているタイムコードを、当該特徴量に付加することとする。 The analysis unit 321a extracts a framed waveform by applying a window function (such as a Hamming window) to a speech waveform of speech, and extracts various feature amounts by performing frequency analysis on the waveform. . For example, a cepstrum coefficient or the like, which is a value obtained by inverse Fourier transform of the logarithm of the power spectrum of a framed waveform, is used as the feature amount. In addition to the cepstrum coefficients, general audio feature quantities such as a mel frequency cepstrum coefficient (MFCC), an LPC (Linear Predictive Cording) coefficient, logarithmic power, and the like can be used as the feature quantity. Note that the analysis unit 321a adds the time code added to the video / audio at the time of extracting the feature value to the feature value.

類似度算出手段３２１ｂは、分析手段３２１ａで分析され、時系列に入力される特徴量と、音声認識用データ記憶手段２１に記憶されている音響モデルでモデル化されている音素との類似度（確率値）を算出するものである。なお、類似度算出手段３２１ｂは、特徴量に付加されているタイムコードを、音素および類似度（確率値）とともに探索手段３２１ｃに出力する。 The similarity calculation unit 321b analyzes the similarity between the feature amount analyzed in the analysis unit 321a and input in time series and the phoneme modeled by the acoustic model stored in the voice recognition data storage unit 21 ( Probability value). The similarity calculation unit 321b outputs the time code added to the feature amount to the search unit 321c together with the phoneme and the similarity (probability value).

探索手段３２１ｃは、音声認識用データ記憶手段２１に記憶されている言語モデルから、接続される出力系列の候補を探索し、確率値が最大となる出力系列を入力音声に対する認識結果（テキストデータ）として出力するものである。ここでは、探索手段３２１ｃは、認識結果であるテキストデータを編集データ記憶手段２２に記憶する。なお、探索手段３２１ｃは、テキストデータを逐次出力する際に、分析手段３２１ａで付加されたタイムコードに基づいて、単語の切れ目ごとにタイムコードを付加することとする。
このように音声認識手段３２１を構成することで、音声認識されたテキストデータは、図３に示すように、各単語に映像のフレームごとに対応したタイムコードが付加されることになる。このテキストデータは、表示手段５０によって、図示を省略した表示装置に出力される。 The search unit 321c searches for a connected output sequence candidate from the language model stored in the speech recognition data storage unit 21, and recognizes the output sequence having the maximum probability value as a recognition result (text data). Is output as Here, the search means 321 c stores the text data as the recognition result in the edit data storage means 22. The search means 321c adds a time code for each break of words based on the time code added by the analysis means 321a when sequentially outputting text data.
By configuring the voice recognition means 321 in this way, the text data that has been voice-recognized is added with a time code corresponding to each frame of the video as shown in FIG. This text data is output by the display means 50 to a display device (not shown).

編集手段４０は、制御手段１０からの制御に基づいて、編集データ記憶手段２２に記憶されている映像および音声と、テキストデータとを連動させて編集することで、編集データを生成するものである。なお、編集データとは、編集対象の映像や音声において、少なくとも再編成用に抽出するための区間を示す、タイムコードの開始点および終了点の１組以上のデータである。ここでは、編集手段４０は、テキスト編集手段４１と、映像編集手段４２と、音声編集手段４３と、編集データ生成手段４４とを備えている。 The editing unit 40 generates editing data by editing the video and audio stored in the editing data storage unit 22 and the text data in conjunction with each other based on the control from the control unit 10. . The edit data is one or more sets of time code start points and end points indicating at least sections to be extracted for reorganization in video and audio to be edited. Here, the editing unit 40 includes a text editing unit 41, a video editing unit 42, an audio editing unit 43, and an editing data generation unit 44.

テキスト編集手段４１は、編集データ記憶手段２２に記憶されているテキストデータを編集するものである。ここでは、テキスト編集手段４１は、表示手段５０によって表示されたテキストデータに対して、操作者が図示を省略したマウス、キーボード等の入力装置を介して編集操作を行うことで、テキストデータの編集を行う。
すなわち、テキスト編集手段４１は、操作者が「コピー」、「カット」、「ペースト」の操作を行うことで、テキストデータの部分入れ替えや、削除等を行う。なお、テキスト編集手段４１は、テキストデータの編集を単語単位で行うこととする。これによって、テキストデータに対応付けられたタイムコード単位で、編集が行われることになる。 The text editing unit 41 edits text data stored in the editing data storage unit 22. Here, the text editing unit 41 edits the text data by performing an editing operation on the text data displayed by the display unit 50 via an input device such as a mouse or a keyboard (not shown). I do.
That is, the text editing unit 41 performs partial replacement or deletion of text data by the operator performing “copy”, “cut”, and “paste” operations. The text editing unit 41 edits text data in units of words. As a result, editing is performed in units of time codes associated with the text data.

なお、テキスト編集手段４１は、編集箇所を示すカーソルをテキストデータと同時に画面上に表示し、操作者がキーボード等によって、カーソルを移動させたときに、そのカーソル位置に対応する文字に対応付けられたタイムコードを、映像編集手段４２に通知することで、対応するシーン（フレーム）を表示させる。 The text editing unit 41 displays a cursor indicating the editing location on the screen at the same time as the text data. When the operator moves the cursor with a keyboard or the like, the text editing unit 41 is associated with the character corresponding to the cursor position. The corresponding time (code) is displayed by notifying the video editing means 42 of the obtained time code.

また、テキスト編集手段４１は、操作者から、テキストデータに予め定めた制御文字等を入力されることで、テキストデータに編集用の情報を設定する。このテキスト編集手段４１におけるテキストデータの編集結果は、編集データ生成手段４４に出力される。 Further, the text editing means 41 sets information for editing in the text data when the operator inputs a predetermined control character or the like in the text data. The text data editing result in the text editing unit 41 is output to the editing data generation unit 44.

ここで、図４を参照（適宜図２参照）して、編集用の情報となる制御文字について説明する。図４は、制御文字を説明するための説明図であって、ノンリニア編集装置の表示画面に表示されたテキストデータを示している。 Here, with reference to FIG. 4 (refer to FIG. 2 as appropriate), control characters serving as editing information will be described. FIG. 4 is an explanatory diagram for explaining the control characters, and shows text data displayed on the display screen of the nonlinear editing apparatus.

例えば、図４に示したように、テキストデータの任意の文字列を、予め定めた固有の制御文字（編集識別文字、例えば“／”）で囲むことで、テキストデータ内に、映像や音声を編成するための編集点（開始点〔イン点〕、終了点〔アウト点〕）を設定する。
このとき、編集点において、さらに、演出効果を特定する予め定めた固有の制御文字（効果識別文字）を付加することとしてもよい。すなわち、映像の演出効果として、「フェード」、「ワイプ」、「ディゾルブ」等を識別する文字と、その時間を付加する。例えば、「フェード」を特定するための文字（例えば、“ＦＯ”）と、その時間を示す数字（例えば、“３”）とを、効果識別文字（例えば“《”、“》”）で囲むことで、テキストデータ内に、編集点における映像の演出効果を文字列（例えば、“／《ＦＯ３》”）として設定する。図４の例では、終了点において、３秒でフェードアウトすることを示している。
なお、「ワイプ」、「ディゾルブ」等については、他の制御文字を予め定めておくことで、設定を行うことが可能である。 For example, as shown in FIG. 4, by enclosing an arbitrary character string of text data with a predetermined unique control character (edit identification character, for example, “/”), video and audio are included in the text data. Edit points (start point [in point], end point [out point]) for knitting are set.
At this time, it is also possible to add a predetermined unique control character (effect identification character) that specifies the effect of the effect at the editing point. That is, as a video effect, characters that identify “fade”, “wipe”, “dissolve”, and the like, and their time are added. For example, a character for specifying “fade” (for example, “FO”) and a number indicating the time (for example, “3”) are surrounded by effect identification characters (for example, “<<”, “>>”). As a result, the effect of rendering the video at the editing point is set as a character string (for example, “/ << FO3 >>”) in the text data. In the example of FIG. 4, fade-out is shown in 3 seconds at the end point.
Note that “wipe”, “dissolve”, and the like can be set by setting other control characters in advance.

また、予め定めた固有の制御文字（文字スーパー区間識別文字、文字スーパー識別文字）と任意の文字列（文字スーパー文字列）を、テキストデータに挿入することで、映像内に挿入する文字スーパーと、その文字スーパーを表示する文字スーパー開始点および文字スーパー終了点とを設定することとしてもよい。例えば、テキストデータの任意の文字列を、文字スーパー区間識別文字（例えば“〔”、“〕”）で囲むことで、テキストデータ内に、文字スーパーを表示する時間区間を設定する。さらに、その区間内に、文字スーパー識別文字（例えば“『”、“』”で挟まれた文字列（文字スーパー文字列）を挿入することで、テキストデータ内に、文字スーパーの文字列を設定する。 Also, by inserting a predetermined unique control character (character super section identification character, character super identification character) and an arbitrary character string (character super character string) into the text data, A character super start point and a character super end point for displaying the character super may be set. For example, by enclosing an arbitrary character string of text data with character super section identification characters (for example, “[”, “]”), a time section for displaying a character super is set in the text data. In addition, by inserting a character super-identification character (for example, ““ ”,“ ””) in that section, a character super character string is set in the text data. To do.

また、予め定めた固有の制御文字（無音指定文字、例えば、“＿”）を設定することで、無音区間を設定することとしてもよい。この場合、無音指定文字の１文字分が、予め定めた時間長の無音時間とする。
また、例えば、テキストデータの任意の文字列を領域指定（映像・音声分離指定）することで、この領域の区間においては、映像と音声とを分離させ、編集時には映像のみを使用することを設定することとしてもよい。 In addition, a silent section may be set by setting a predetermined unique control character (silent designation character, for example, “_”). In this case, it is assumed that one character of the silence designation character is a silence time of a predetermined time length.
Also, for example, by specifying an area for any character string of text data (video / audio separation specification), video and audio are separated in the section of this area, and only video is used for editing It is good to do.

このように、テキスト編集手段４１では、予め定めた制御文字等をテキストデータに挿入することで、種々の編集内容を設定することができる。なお、図４に示した制御文字等は、一例であって、他の文字を使用することも可能である。また、図４で説明した編集用の情報以外に、予め定めた制御文字（コメント指定文字、例えば、“（”、“）”）内に任意の文字列を挿入することで、映像や音声とは無関係なコメントを設定することとしてもよい。これによって、操作者は、当該コメントを「メモ」として使用することができ、編集を行う際の参考情報として利用することができる。
図２に戻って、テキスト編集手段４１について、説明を続ける。 Thus, the text editing means 41 can set various editing contents by inserting predetermined control characters or the like into the text data. The control characters shown in FIG. 4 are examples, and other characters can be used. In addition to the editing information described with reference to FIG. 4, an arbitrary character string is inserted into a predetermined control character (comment designation character, for example, “(”, “)”). May set an irrelevant comment. Thus, the operator can use the comment as a “memo” and can use it as reference information when editing.
Returning to FIG. 2, the description of the text editing means 41 will be continued.

このテキスト編集手段４１は、テキストデータを編集する際の補助を行うために、さらに、時間表示手段４１１と、キーワード検索手段４１２とを備えている。
時間表示手段４１１は、テキストデータに対応付けられたタイムコードを視覚化するものである。例えば、時間表示手段４１１は、図示を省略したマウス等の入力装置を介してテキストデータの任意の位置を指示されることで、当該位置に対応するタイムコードを表示する。なお、このとき、時間表示手段４１１は、タイムコードとして、編集対象となる映像、音声の素材データのタイムコードと、編集後の映像、音声に対応するタイムコードとを表示することとしてもよい。 The text editing unit 41 further includes a time display unit 411 and a keyword search unit 412 to assist in editing the text data.
The time display means 411 visualizes the time code associated with the text data. For example, the time display unit 411 displays a time code corresponding to the position by instructing an arbitrary position of the text data via an input device such as a mouse (not shown). At this time, the time display unit 411 may display the time code of the video and audio material data to be edited and the time code corresponding to the edited video and audio as the time code.

また、時間表示手段４１１は、テキストデータの任意の文字列をマウス等によって、ドラッグされることで、そのドラッグ領域の文字列に対応する映像・音声の再生時間を、テキストデータに対応付けられたタイムコードから算出し、表示することとしてもよい。これによって、操作者は、テキストデータから編集後の映像・音声の時間を確認することができる。 Further, the time display means 411 associates an arbitrary character string of the text data with a mouse or the like so that the video / audio reproduction time corresponding to the character string in the drag area is associated with the text data. It may be calculated from the time code and displayed. Thus, the operator can confirm the video / audio time after editing from the text data.

キーワード検索手段４１２は、テキストデータ内から、任意の文字列（キーワード）を検索するものである。すなわち、キーワード検索手段４１２は、表示画面上にキーワードを入力する入力画面を表示し、操作者からキーワードを入力されることで、編集データ記憶手段２２に記憶されているテキストデータから、キーワードを検索する。
これによって、操作者は、映像や音声を実際に再生して編集したい箇所を探索しなくても、キーワードを入力することで、編集したい箇所を探索することができる。 The keyword search means 412 searches for an arbitrary character string (keyword) from the text data. That is, the keyword search unit 412 displays an input screen for inputting a keyword on the display screen, and searches for the keyword from the text data stored in the edit data storage unit 22 when the keyword is input by the operator. To do.
Thus, the operator can search for a part to be edited by inputting a keyword without searching for a part to be edited by actually reproducing video and audio.

映像編集手段４２は、編集データ記憶手段２２に記憶されている映像を編集するものである。ここでは、映像編集手段４２は、表示手段５０によって表示された映像に対して、操作者が図示を省略したジョグ・シャトルコントローラ等の操作装置を介して編集操作を行うことで、映像の編集を行う。例えば、映像編集手段４２は、操作者によって、ジョグ・シャトルコントローラを介して、フレーム単位で映像を再生・停止させ、編集点（開始点、終了点）を設定する。あるいは、操作者によって、マウス等により表示画面上の操作ボタン等を押下されることで、映像の再生・停止、編集点の設定を行う。 The video editing means 42 is for editing the video stored in the editing data storage means 22. Here, the video editing unit 42 edits the video by allowing the operator to edit the video displayed by the display unit 50 via an operating device such as a jog / shuttle controller (not shown). Do. For example, the video editing means 42 plays and stops the video in units of frames by the operator via the jog / shuttle controller, and sets the editing points (start point, end point). Alternatively, when an operator presses an operation button or the like on the display screen with a mouse or the like, video playback / stop and edit point setting are performed.

このとき、映像編集手段４２は、テキストデータに対して、編集点のタイムコードに対応する箇所に、編集識別文字（図４参照）を挿入する。これによって、映像によって、編集点を設定する場合であっても、その映像に連動して、テキストデータが編集されることになる。
なお、操作者が所望する映像のシーンを表示させる場合、タイムコードに対応付けたタイムライン上のカーソルを操作者が移動させることで、シーンを表示させることとしてもよい。 At this time, the video editing means 42 inserts an editing identification character (see FIG. 4) at a location corresponding to the time code of the editing point with respect to the text data. As a result, even when an edit point is set depending on the video, the text data is edited in conjunction with the video.
When a scene of a video desired by the operator is displayed, the scene may be displayed by moving the cursor on the timeline associated with the time code.

音声編集手段４３は、編集データ記憶手段２２に記憶されている音声を編集するものである。この音声編集手段４３は、映像編集手段４２によって編集された映像のタイムコードに対応付けて、音声の編集を行う。さらに、音声編集手段４３は、テキスト編集手段４１によって編集されたテキストデータのタイムコードによっても、音声の編集を行う。 The voice editing unit 43 edits the voice stored in the editing data storage unit 22. The audio editing unit 43 performs audio editing in association with the time code of the video edited by the video editing unit 42. Furthermore, the voice editing unit 43 also edits the voice based on the time code of the text data edited by the text editing unit 41.

編集データ生成手段４４は、テキストデータに挿入されている編集用の制御文字等（図４参照）と、テキストデータに対応付けされているタイムコードとに基づいて、映像や音声を編集するための編集データを生成するものである。
すなわち、編集データ生成手段４４は、図４で説明した各種の制御文字等を探索することで、各制御文字に対応する編集内容を編集データとして生成する。 The edit data generation means 44 is for editing video and audio based on the control characters for editing inserted in the text data (see FIG. 4) and the time code associated with the text data. Edit data is generated.
That is, the edit data generation means 44 searches the various control characters described with reference to FIG. 4 to generate edit contents corresponding to each control character as edit data.

例えば、編集データ生成手段４４は、テキストデータにおいて、編集点（開始点、終了点）を示す編集識別文字を探索し、第１の編集識別文字の直後の単語に設定されているタイムコードを、開始点のタイムコードとする。また、編集データ生成手段４４は、第２の編集識別文字の直後の単語に設定されているタイムコードの直前の映像のフレームに対応するタイムコードを、終了点のタイムコードとする。 For example, the edit data generation unit 44 searches the text data for an edit identification character indicating an edit point (start point, end point), and sets the time code set in the word immediately after the first edit identification character, It is the time code of the starting point. The edit data generation means 44 uses the time code corresponding to the frame of the video immediately before the time code set in the word immediately after the second edit identification character as the end time code.

さらに、編集データ生成手段４４は、テキストデータに演出効果を特定する効果識別文字が含まれている場合、その効果識別文字が挿入されている位置に対応するタイムコードの時点から、指定された演出効果を行う旨のコードを編集コードに記述する。
また、編集データ生成手段４４は、テキストデータに文字スーパー区間識別文字が含まれている場合は、第１の文字スーパー区間識別文字の直後の単語に設定されているタイムコードを、文字スーパー開始点のタイムコードとする。また、編集データ生成手段４４は、第２の文字スーパー区間識別文字の直後の単語に設定されているタイムコードの直前の映像のフレームに対応するタイムコードを、文字スーパー終了点のタイムコードとする。そして、編集データ生成手段４４は、文字スーパー区間識別文字間の文字スーパー識別文字で囲まれている文字列を文字スーパーの文字列とする。 Further, when the text data includes an effect identification character for specifying the effect, the editing data generating means 44 specifies the specified effect from the time code corresponding to the position where the effect identification character is inserted. Write the code for effect to edit code.
In addition, when the text data includes a character super section identification character, the edit data generation unit 44 uses the time code set in the word immediately after the first character super section identification character as the character super start point. Time code. The edit data generation means 44 uses the time code corresponding to the frame of the video immediately before the time code set in the word immediately after the second character super section identification character as the time code at the character super end point. . Then, the edit data generation unit 44 sets a character string surrounded by character super identification characters between the character super section identification characters as a character super character string.

また、編集データ生成手段４４は、テキストデータに無音指定文字が含まれている場合は、その無音指定文字で示される無音区間だけは、映像を編集する旨の内容を編集データに記述する。
なお、生成された編集データは、編集データ記憶手段２２に記憶される。 Further, when the text data includes a silence designation character, the edit data generation means 44 describes the content of editing the video in the edit data only for the silence section indicated by the silence designation character.
The generated edit data is stored in the edit data storage unit 22.

ここで、図５を参照して、編集データ生成手段４４が生成する編集データの具体例について説明する。図５は、編集データ生成手段が生成する編集データのデータ構造図である。ここで、「番号」は、編集データのシリアル番号を示しており、編集データの先頭から順番に振られる連続番号である。
また、「編集対象」は、編集の対象を特定するための情報である。ここでは、編集対象が映像および音声の両方である場合を「ＶＡ」、映像のみである場合を「Ｖ」で示している。また、「編集内容」は、編集対象に対する編集の内容を特定する情報である。ここでは、編集点の開始点および終了点の映像を抽出（カット）する操作内容を「Ｃ」で示している。さらに、他の操作として、フェードアウトを「ＦＯ」で示している。また、開始点および終了点は、編集対象の開始および終了のタイムコードを示している。 Here, a specific example of the edit data generated by the edit data generating unit 44 will be described with reference to FIG. FIG. 5 is a data structure diagram of edit data generated by the edit data generating means. Here, “number” indicates the serial number of the edit data, and is a serial number assigned in order from the top of the edit data.
“Edit target” is information for specifying an edit target. Here, “VA” indicates that the editing target is both video and audio, and “V” indicates that the editing target is only video. “Edited content” is information for specifying the edited content for the editing target. Here, the operation content for extracting (cutting) the video at the start point and end point of the edit point is indicated by “C”. Furthermore, as another operation, fade-out is indicated by “FO”. The start point and end point indicate the start and end time codes to be edited.

また、編集データに文字スーパーの情報を付加するには、「番号０１５」に示すように、「編集対象」を映像「Ｖ」とし、「編集内容」を文字スーパーの付加を示す識別文字「Ｓ」に文字スーパーの文字列を付加した情報を記述し、文字スーパーの表示時間を開始点と終了点とに記述することとする。
このように、編集データ生成手段４４は、操作者によって、編集されたテキストデータのみから、図５に示した編集データを生成することができる。
図２に戻って、ノンリニア編集装置１の構成について説明を続ける。 Also, to add character super information to the edit data, as shown by “number 015”, the “edit target” is the video “V” and the “edit contents” is the identification character “S” indicating the addition of the character super. "Is added to the character superscript character string, and the character super display time is described as the start point and end point.
As described above, the edit data generation means 44 can generate the edit data shown in FIG. 5 from only the edited text data by the operator.
Returning to FIG. 2, the description of the configuration of the nonlinear editing apparatus 1 will be continued.

表示手段５０は、図示を省略した表示装置に対して、少なくとも、テキストデータを表示するテキスト表示領域と、映像を表示する映像表示領域とに表示領域を区分して、タイムコードに対応付けてテキストデータと映像とを表示（出力）するものである。ここでは、表示手段５０は、テキスト表示手段５１と、映像・音声表示手段５２と、編集時間軸表示手段５３とを備えている。 The display means 50 divides the display area into at least a text display area for displaying text data and a video display area for displaying video, and associates the text with the time code with respect to a display device (not shown). It displays (outputs) data and video. Here, the display unit 50 includes a text display unit 51, a video / audio display unit 52, and an editing time axis display unit 53.

テキスト表示手段５１は、表示装置のテキスト表示領域に、テキストデータを表示するものである。なお、テキスト表示手段５１は、映像・音声表示手段５２で再生される映像のタイムコードに連動して、テキストデータを表示し、映像が再生されている間は、タイムコードに基づいて、テキストデータをスクロールして表示する。また、テキスト表示手段５１は、現在の編集位置を示すテキストデータの位置にカーソルＣ_Ｔ（図１参照）を表示する。 The text display means 51 displays text data in the text display area of the display device. The text display means 51 displays text data in conjunction with the time code of the video reproduced by the video / audio display means 52. While the video is being reproduced, the text data is displayed based on the time code. Scroll to display. Further, the text display means 51 displays the cursor C _T (see FIG. 1) at the position of the text data indicating the current editing position.

映像・音声表示手段５２は、表示装置の映像表示領域に、映像を表示するものである。さらに、映像・音声表示手段５２は、図示を省略したスピーカ等の音声出力装置に対して、音声を出力する。なお、映像・音声表示手段５２は、映像・音声を再生中は、タイムコードをテキスト表示手段５１に対して通知するものとする。 The video / audio display means 52 displays video in the video display area of the display device. Further, the video / audio display means 52 outputs audio to an audio output device such as a speaker (not shown). The video / audio display means 52 notifies the text display means 51 of the time code during the reproduction of the video / audio.

また、映像・音声表示手段５２は、映像を表示する以外に、映像を再生表示するための操作ボタンをアイコンとして表示する。例えば、図１に示すように、「巻き戻し」、「再生」、「早送り」、「停止」等の操作ボタンや、開始点、終了点を設定する設定ボタンを表示し、当該ボタンをマウス等で押下されることで、映像編集手段４２が、映像の編集を行う。
また、映像・音声表示手段５２は、映像のタイムコードに対応付けたタイムラインを表示し、現在表示している映像のタイムコードに対応するシーン（フレーム）に対応する時間軸にカーソルＣ_Ｍ（図１参照）を表示する。 The video / audio display means 52 displays an operation button for reproducing and displaying the video as an icon in addition to displaying the video. For example, as shown in FIG. 1, operation buttons such as “rewind”, “play”, “fast forward”, “stop”, and setting buttons for setting a start point and an end point are displayed. Is pressed, the video editing means 42 edits the video.
The video / audio display means 52 displays a timeline associated with the time code of the video, and the cursor C _M (on the time axis corresponding to the scene (frame) corresponding to the time code of the currently displayed video. 1) is displayed.

編集時間軸表示手段５３は、表示装置の編集時間軸表示領域に、編集対象となる映像および音声の素材データと、編集後の映像および音声、並びに、付加した文字スーパーと時間との関係を時間軸に沿って視覚化したタイムラインを表示するものである。この編集時間軸表示手段５３は、編集データ記憶手段２２に記憶されている映像、音声、編集データを参照して、映像および音声の全時間領域に対する、映像および音声の各開始点、終了点の時間軸上の位置を算出することで、図１に示すように、編集時間軸表示領域Ｌに、映像および音声のタイムラインを表示する。
また、編集時間軸表示手段５３は、現在表示している映像のタイムコードに対応するシーン（フレーム）に対応する時間軸にカーソルＣ_Ｌ（図１参照）を表示することとする。 The editing time axis display means 53 displays the relationship between the video and audio material data to be edited, the edited video and audio, the added character super, and the time in the editing time axis display area of the display device. A timeline visualized along the axis is displayed. The editing time axis display means 53 refers to the video, audio, and editing data stored in the editing data storage means 22, and indicates the start and end points of the video and audio for all time areas of video and audio. By calculating the position on the time axis, a video and audio timeline is displayed in the editing time axis display area L as shown in FIG.
The editing time axis display means 53 displays the cursor C _L (see FIG. 1) on the time axis corresponding to the scene (frame) corresponding to the time code of the currently displayed video.

出力手段６０は、映像・音声の編集結果である編集データ等を出力するものである。ここでは、出力手段６０は、編集データ出力手段６１と、テキストデータ出力手段６２とを備えている。 The output means 60 outputs editing data or the like as video / audio editing results. Here, the output unit 60 includes an edit data output unit 61 and a text data output unit 62.

編集データ出力手段６１は、編集結果として編集データ記憶手段２２に記憶されている編集データを出力するものである。この編集データは、映像・音声を実際に編集するためのデータとして使用される。 The edit data output means 61 outputs edit data stored in the edit data storage means 22 as an edit result. This edit data is used as data for actually editing video / audio.

テキストデータ出力手段６２は、編集結果として編集データ記憶手段２２に記憶されているテキストデータを出力するものである。なお、このテキストデータは、タイムコードを含んだものであってもよいし、タイムコードを含まない文字列だけのデータであってもよい。これによって、操作者は、編集結果を映像や音声以外に、テキストデータで確認することも可能になり、編集の確認作業を簡易化することが可能になる。
なお、このテキストデータは、映像や音声に対するメタデータや字幕データとして、他の用途として使用することも可能である。 The text data output means 62 outputs the text data stored in the edit data storage means 22 as an editing result. The text data may include time code, or may be data of only a character string that does not include time code. As a result, the operator can check the editing result with text data in addition to video and audio, thereby simplifying the editing checking operation.
This text data can also be used for other purposes as metadata and caption data for video and audio.

以上説明したようにノンリニア編集装置１を構成することで、テキストデータに連動させて、映像や音声を編集するための編集データを生成することができる。これによって、映像や音声を視聴しながら時間をかけて行っていた編集作業を、テキストデータを編集するという簡易な作業で行うことが可能になる。
なお、ノンリニア編集装置１は、一般的なコンピュータを前記した各手段として機能させるノンリニア編集プログラムによって動作させることができる。 By configuring the nonlinear editing apparatus 1 as described above, editing data for editing video and audio can be generated in conjunction with text data. This makes it possible to perform an editing operation that has been performed over time while viewing video and audio by a simple operation of editing text data.
The non-linear editing apparatus 1 can be operated by a non-linear editing program that causes a general computer to function as each means described above.

［ノンリニア編集装置の動作］
次に、図６〜図８を参照して、ノンリニア編集装置１の動作について説明する。なお、ここでは、ノンリニア編集装置１の動作を、映像・音声の入力動作と、編集動作と、出力動作の３つに分けて説明する。図６は、ノンリニア編集装置の映像・音声の入力動作を示すフローチャートである。図７は、ノンリニア編集装置の編集動作を示すフローチャートである。図８は、ノンリニア編集装置の出力動作を示すフローチャートである。 [Operation of non-linear editing device]
Next, the operation of the nonlinear editing apparatus 1 will be described with reference to FIGS. Here, the operation of the nonlinear editing device 1 will be described by dividing it into three operations: video / audio input operation, editing operation, and output operation. FIG. 6 is a flowchart showing the video / audio input operation of the nonlinear editing apparatus. FIG. 7 is a flowchart showing the editing operation of the nonlinear editing apparatus. FIG. 8 is a flowchart showing the output operation of the nonlinear editing apparatus.

（入力動作）
最初に、図６を参照（適宜図２参照）して、ノンリニア編集装置１の入力動作について説明する。
まず、ノンリニア編集装置１は、入力手段３０の映像音声入力手段３１によって、外部から、編集対象となる映像および音声を入力し、編集データ記憶手段２２に記憶する（ステップＳ１）。 (Input operation)
First, the input operation of the nonlinear editing device 1 will be described with reference to FIG. 6 (refer to FIG. 2 as appropriate).
First, the nonlinear editing apparatus 1 inputs video and audio to be edited from the outside by the video / audio input means 31 of the input means 30 and stores them in the edit data storage means 22 (step S1).

その後、ノンリニア編集装置１は、テキスト対応付け手段３２の音声認識手段３２１によって、音声を、映像・音声のタイムコードに対応付けたテキストデータに変換する。
すなわち、ノンリニア編集装置１は、分析手段３２１ａによって、編集データ記憶手段２２に記憶されている音声の音声波形に窓関数をかけることで、フレーム化された波形を抽出し、その波形を周波数分析することで特徴量を抽出する（ステップＳ２）。さらに、分析手段３２１ａは、特徴量を抽出した時点における音声に付加されているタイムコードを、当該特徴量に付加することで、特徴量をタイムコードとを対応付ける（ステップＳ３）。 After that, the non-linear editing device 1 converts the sound into text data associated with the video / audio time code by the speech recognition unit 321 of the text association unit 32.
That is, the nonlinear editing apparatus 1 extracts a framed waveform by applying a window function to the speech waveform of the speech stored in the edited data storage unit 22 by the analyzing unit 321a, and frequency-analyzes the waveform. Thus, the feature amount is extracted (step S2). Further, the analyzing unit 321a associates the feature quantity with the time code by adding the time code added to the voice at the time of extracting the feature quantity to the feature quantity (step S3).

そして、ノンリニア編集装置１は、類似度算出手段３２１ｂによって、ステップＳ３で抽出された特徴量と、音声認識用データ記憶手段２１に記憶されている音響モデルでモデル化されている音素との類似度（確率値）を算出する（ステップＳ４）。 Then, the nonlinear editing device 1 uses the similarity calculation unit 321b to measure the similarity between the feature amount extracted in step S3 and the phoneme modeled by the acoustic model stored in the speech recognition data storage unit 21. (Probability value) is calculated (step S4).

さらに、ノンリニア編集装置１は、探索手段３２１ｃによって、音声認識用データ記憶手段２１に記憶されている言語モデルから、接続される出力系列の候補を探索し、確率値が最大となる出力系列を認識結果（テキストデータ）とするとともに、ステップＳ３で対応付けられたタイムコードに基づいて、単語の切れ目ごとにタイムコードを対応付ける（ステップＳ５）。 Further, the non-linear editing device 1 uses the search means 321c to search for a connected output series candidate from the language model stored in the speech recognition data storage means 21, and recognizes the output series having the maximum probability value. In addition to the result (text data), the time code is associated with each word break based on the time code associated in step S3 (step S5).

そして、ノンリニア編集装置１は、テキスト対応付け手段３２によって、単語の切れ目ごとにタイムコードが付加されたテキストデータを、編集データ記憶手段２２に記憶する（ステップＳ６）。
以上の動作によって、ノンリニア編集装置１は、映像・音声のタイムコードに単語ごとに対応したテキストデータを生成する。 Then, the non-linear editing device 1 stores the text data to which the time code is added for each word break by the text association unit 32 in the editing data storage unit 22 (step S6).
With the above operation, the nonlinear editing apparatus 1 generates text data corresponding to each word in the video / audio time code.

（編集動作）
次に、図７を参照（適宜図１、図２参照）して、ノンリニア編集装置１の編集動作について説明する。なお、ここでは、操作者が、テキスト表示領域Ｔにおいて、テキストデータを編集する動作について説明する。 (Editing action)
Next, the editing operation of the nonlinear editing apparatus 1 will be described with reference to FIG. 7 (refer to FIGS. 1 and 2 as appropriate). Here, an operation in which the operator edits text data in the text display area T will be described.

まず、ノンリニア編集装置１は、テキスト編集手段４１によって、図示を省略したマウス、キーボード等の入力装置を介して入力される操作者の操作を解析する（ステップＳ１１）。 First, the non-linear editing device 1 analyzes the operation of the operator input via the input device such as a mouse or a keyboard (not shown) by the text editing means 41 (step S11).

ここで、操作者が行った操作が、テキスト表示領域ＴのカーソルＣ_Ｔを移動させる操作である場合（ステップＳ１２でＹｅｓ）、ノンリニア編集装置１は、テキスト表示領域ＴのカーソルＣ_Ｔの移動に伴って、映像・音声の再生位置を移動させる（ステップＳ１３）。
このとき、ノンリニア編集装置１は、テキスト編集手段４１によって、移動したカーソルＣ_Ｔの位置に対応するテキストデータのタイムコードを、編集データ記憶手段２２から読み出し、映像編集手段４２が、映像表示領域Ｍに表示させる映像を当該タイムコードに対応するシーンに移動させる。また、編集時間軸表示手段５３が、編集時間軸表示領域Ｌに表示するカーソルＣ_Ｌを、当該タイムコードに対応するシーン（フレーム）の位置に移動させる。 Here, the operator has performed the operation, when an operation of moving the cursor C _T of the text display region T (Yes in step S12), the nonlinear editing apparatus 1, the movement of the cursor C _T of the text display region T Along with this, the playback position of the video / audio is moved (step S13).
At this time, nonlinear editing apparatus 1, the text editing means 41, a time code of the text data corresponding to the position of the moved cursor C _T, read from the editing data storing means 22, the video editing unit 42, the image display area M The video to be displayed is moved to the scene corresponding to the time code. The editing time axis display means 53, the cursor C _L to be displayed on the editing time axis display area L, is moved to the position of the scene (frame) corresponding to the time code.

また、操作者が行った操作が、テキスト表示領域Ｔのテキストデータの編集操作である場合、すなわち、「コピー」、「カット」、「ペースト」や、編集点（開始点、終了点）の設定である場合（ステップＳ１４でＹｅｓ）、ノンリニア編集装置１は、テキストデータの部分入れ替えや、削除等に対応して、映像・音声の編集を行う（ステップＳ１５）。
このとき、ノンリニア編集装置１は、テキスト編集手段４１によって、編集されたテキストデータのタイムコードを編集データ記憶手段２２から読み出し、映像編集手段４２が、テキストデータのタイムコードに対応して映像を編集したのちに、映像表示領域Ｍに編集後の映像を表示させる。また、音声編集手段４３が、テキストデータのタイムコードに対応して音声を編集する。そして、編集データ生成手段４４が、テキストデータ内に挿入されている編集識別文字が挿入されている位置に対応するタイムコードを、編集点（開始点、終了点）として、編集データに付加する。 Further, when the operation performed by the operator is an operation for editing text data in the text display area T, that is, “copy”, “cut”, “paste”, and edit points (start point, end point) are set. (Yes in step S14), the nonlinear editing apparatus 1 performs video / audio editing in response to partial replacement or deletion of text data (step S15).
At this time, the non-linear editing apparatus 1 reads the time code of the text data edited by the text editing unit 41 from the editing data storage unit 22, and the video editing unit 42 edits the video corresponding to the time code of the text data. After that, the edited video is displayed in the video display area M. The voice editing unit 43 edits the voice corresponding to the time code of the text data. Then, the edit data generation means 44 adds the time code corresponding to the position where the edit identification character inserted in the text data is inserted as edit points (start point, end point) to the edit data.

また、操作者が行った操作が、トランジッションの変更を行う操作である場合、すなわち、テキストデータに、「フェード」、「ワイプ」、「ディゾルブ」等を識別する文字と時間を設定する操作である場合（ステップＳ１６でＹｅｓ）、ノンリニア編集装置１は、トランジッションの種類と時間を編集データに付加する（ステップＳ１７）。
このとき、ノンリニア編集装置１は、テキスト編集手段４１において、テキストデータに映像の演出効果を示す文字（効果識別文字）が入力されたとき、編集データ生成手段４４によって、その効果識別文字が挿入されている位置に対応するタイムコードの時点から、指定された演出効果を行う旨のコードを編集コードに付加する。 Further, when the operation performed by the operator is an operation for changing a transition, that is, an operation for setting a character and time for identifying “fade”, “wipe”, “dissolve”, etc. in the text data. In the case (Yes in step S16), the nonlinear editing device 1 adds the type and time of the transition to the editing data (step S17).
At this time, in the non-linear editing apparatus 1, when the text editing means 41 receives a character (effect identification character) indicating the effect of the image in the text data, the editing data generation means 44 inserts the effect identification character. From the time code corresponding to the current position, a code for performing the specified effect is added to the edit code.

また、操作者が行った操作が、文字スーパーの追加、すなわち、テキストデータに、文字スーパー識別文字で挟まれた文字列（文字スーパー文字列）を挿入する操作である場合（ステップＳ１８でＹｅｓ）、ノンリニア編集装置１は、文字スーパーの文字列とその表示時刻とを編集データに付加する（ステップＳ１９）。
このとき、ノンリニア編集装置１は、テキスト編集手段４１において、テキストデータに文字スーパー識別文字で挟まれた文字スーパー文字列を入力されたとき、編集データ生成手段４４によって、文字スーパー識別文字が挿入されている２箇所の位置に対応するタイムコードで示される時間区間に文字スーパー文字列を表示する旨のコードを編集コードに付加する。 In addition, when the operation performed by the operator is an operation of adding a character super, that is, an operation of inserting a character string (character super character string) sandwiched between character super identification characters into text data (Yes in step S18). The nonlinear editing device 1 adds the character string of the character superimpose and the display time thereof to the editing data (step S19).
At this time, in the non-linear editing device 1, when a character super character string sandwiched between character super identification characters is input to the text data in the text editing unit 41, the character super identification character is inserted by the editing data generation unit 44. A code indicating that the character super character string is displayed in the time section indicated by the time code corresponding to the two positions is added to the edit code.

また、操作者が行った操作が、テキストデータの時間情報を表示する操作である場合（ステップＳ２０でＹｅｓ）、ノンリニア編集装置１は、タイムコードに基づいて、対応する時間を表示画面に表示する（ステップＳ２１）。
このとき、ノンリニア編集装置１は、時間表示手段４１１によって、マウス等の入力装置を介してテキストデータの任意の位置を指示されることで、当該位置に対応するタイムコードを表示画面に表示する。あるいは、時間表示手段４１１は、テキストデータの任意の文字列をマウス等によってドラッグされることで、そのドラッグ領域の文字列に対応する時間を、テキストデータに対応付けられたタイムコードから算出し、その算出時間、すなわち再生時間を表示画面に表示する。 When the operation performed by the operator is an operation for displaying time information of text data (Yes in step S20), the nonlinear editing device 1 displays the corresponding time on the display screen based on the time code. (Step S21).
At this time, the non-linear editing apparatus 1 displays a time code corresponding to the position on the display screen by instructing an arbitrary position of the text data by the time display means 411 via an input device such as a mouse. Alternatively, the time display means 411 calculates a time corresponding to the character string in the drag region by dragging an arbitrary character string of the text data with a mouse or the like from the time code associated with the text data, The calculated time, that is, the reproduction time is displayed on the display screen.

また、操作者が行った操作が、キーワードを検索する操作である場合（ステップＳ２２でＹｅｓ）、ノンリニア編集装置１は、テキストデータ内でキーワードを検索する（ステップＳ２３）。
このとき、ノンリニア編集装置１は、キーワード検索手段４１２によって、編集データ記憶手段２２に記憶されているテキストデータから、キーワードを検索する。
ここで、キーワードの検索に成功した場合（ステップＳ２４でＹｅｓ）、キーワード検索手段４１２は、カーソルＣ_Ｔの移動位置を、検索結果であるキーワードの位置に設定し、ステップＳ１２に戻ることで、キーワードが表示画面Ｄのテキスト表示領域Ｔ内に表示される。 When the operation performed by the operator is an operation for searching for a keyword (Yes in step S22), the nonlinear editing device 1 searches for the keyword in the text data (step S23).
At this time, the nonlinear editing apparatus 1 searches the keyword from the text data stored in the editing data storage unit 22 by the keyword search unit 412.
Here, (Yes in step S24), and the keyword search means 412 when succeeding in searching keywords, the movement position of the cursor C _T, is set to a position of a search result keyword, the process goes back to step S12, the keyword Is displayed in the text display area T of the display screen D.

そして、ノンリニア編集装置１は、編集操作の終了が指示されたか否かを判定し（ステップＳ２５）、終了が指示されていない場合（ステップＳ２５でＮｏ）、ステップＳ１１に戻って動作を継続する。一方、終了が指示された場合（ステップＳ２４でＹｅｓ）は、編集を終了する。なお、編集操作の終了指示は、制御手段１０が図示を省略したメニュー画面を表示し、操作者が編集終了を選択することにより行うこととする。 Then, the nonlinear editing apparatus 1 determines whether or not the end of the editing operation has been instructed (step S25). If the end has not been instructed (No in step S25), the nonlinear editing apparatus 1 returns to step S11 and continues the operation. On the other hand, when the end is instructed (Yes in step S24), the editing ends. The instruction to end the editing operation is given when the control unit 10 displays a menu screen (not shown) and the operator selects the end of editing.

以上の動作によって、ノンリニア編集装置１は、テキストデータに連動させて編集データを生成することができる。
なお、ここでは、テキストデータに連動させて、編集データを生成する動作について説明したが、タイムコードによって、映像・音声とテキストデータとが対応付けられているため、映像・音声を編集することで、それにあわせてテキストデータを編集し、編集データを生成する。 With the above operation, the nonlinear editing apparatus 1 can generate editing data in conjunction with text data.
Here, the operation of generating the edit data in conjunction with the text data has been described. However, since the video / audio and the text data are associated with each other by the time code, the video / audio can be edited. Then, edit the text data accordingly and generate the edited data.

（出力動作）
次に、図８を参照（適宜図２参照）して、ノンリニア編集装置１の出力動作について説明する。なお、ここでは、制御手段１０が図示を省略したメニュー画面を表示し、操作者が、所望するデータを選択することとする。 (Output operation)
Next, the output operation of the nonlinear editing apparatus 1 will be described with reference to FIG. Here, it is assumed that the control unit 10 displays a menu screen (not shown) and the operator selects desired data.

まず、ノンリニア編集装置１は、出力手段６０によって、操作者が選択した指示内容を解析する（ステップＳ３１）。
ここで、操作者が選択した指示内容が、編集データの出力である場合（ステップＳ３２でＹｅｓ）、ノンリニア編集装置１は、編集データ出力手段６１によって、編集データ記憶手段２２に記憶されている編集データを読み出し、出力する（ステップＳ３３）。 First, the nonlinear editing apparatus 1 analyzes the instruction content selected by the operator by the output means 60 (step S31).
Here, when the instruction content selected by the operator is output of edit data (Yes in step S32), the non-linear editing device 1 uses the edit data output means 61 to edit the edit data stored in the edit data storage means 22. Data is read and output (step S33).

また、操作者が選択した指示内容が、時間情報が付加されていないテキストデータの出力である場合（ステップＳ３４でＹｅｓ）、ノンリニア編集装置１は、テキストデータ出力手段６２によって、編集データ記憶手段２２に記憶されているテキストデータから、文字情報のみを抽出し、出力する（ステップＳ３５）。
また、操作者が選択した指示内容が、時間情報が付加されたテキストデータの出力である場合（ステップＳ３６でＹｅｓ）、ノンリニア編集装置１は、テキストデータ出力手段６２によって、編集データ記憶手段２２に記憶されているテキストデータをそのまま読み出し、出力する（ステップＳ３７）。 When the instruction content selected by the operator is output of text data to which time information is not added (Yes in step S34), the non-linear editing device 1 uses the text data output means 62 to edit the edit data storage means 22. Only character information is extracted from the text data stored in and output (step S35).
When the instruction content selected by the operator is output of text data to which time information is added (Yes in step S36), the non-linear editing device 1 uses the text data output means 62 to store the edit data storage means 22. The stored text data is read and output as it is (step S37).

そして、ノンリニア編集装置１は、出力動作の終了が指示されたか否かを判定し（ステップＳ３８）、終了が指示されていない場合（ステップＳ３８でＮｏ）、ステップＳ３１に戻って動作を継続する。一方、終了が指示された場合（ステップＳ３８でＹｅｓ）は、出力動作を終了する。なお、出力動作の終了指示は、制御手段１０が図示を省略したメニュー画面を表示し、操作者が出力終了を選択することにより行うこととする。 Then, the nonlinear editing apparatus 1 determines whether or not the end of the output operation has been instructed (step S38). If the end has not been instructed (No in step S38), the non-linear editing apparatus 1 returns to step S31 and continues the operation. On the other hand, when the termination is instructed (Yes in step S38), the output operation is terminated. The instruction to end the output operation is given by the control means 10 displaying a menu screen (not shown) and the operator selecting the output end.

以上の動作によって、ノンリニア編集装置１は、編集データ以外に、テキストデータを出力することができる。なお、このテキストデータの出力において、タイムコードを付加するか否かを選択可能とすることで、テキストデータを編集データの確認用として使用する以外に、メタデータや、字幕データとして使用することも可能になる。
以上、ノンリニア編集装置１の構成および動作について説明したが、本発明はこれに限定されるものではない。以下に、他のノンリニア編集装置の構成について説明する。 With the above operation, the non-linear editing apparatus 1 can output text data in addition to editing data. In addition, in the output of this text data, by making it possible to select whether or not to add a time code, the text data can be used as metadata or subtitle data in addition to being used for checking the edit data. It becomes possible.
The configuration and operation of the nonlinear editing device 1 have been described above, but the present invention is not limited to this. The configuration of another nonlinear editing apparatus will be described below.

≪第２実施形態≫
図９を参照して、本発明の第２実施形態に係るノンリニア編集装置の構成について説明する。図９は、本発明の第２実施形態に係るノンリニア編集装置の構成を示すブロック図である。
図２で説明したノンリニア編集装置１は、映像・音声を入力し、音声認識を行う際に、タイムコードを付加したテキストデータを生成することとしたが、ノンリニア編集装置１Ｂは、音声認識によりテキストデータを生成した後に、タイムコードを設定する構成としている。
すなわち、ノンリニア編集装置１Ｂは、入力手段３０Ｂが、図２で説明したノンリニア編集装置１の入力手段３０と異なっており、他の構成は同一のものである。そこで、入力手段３０Ｂ以外の構成については、図２で説明したノンリニア編集装置１と同一の符号を付し、説明を省略する。 << Second Embodiment >>
With reference to FIG. 9, the structure of the nonlinear editing apparatus based on 2nd Embodiment of this invention is demonstrated. FIG. 9 is a block diagram showing the configuration of the nonlinear editing apparatus according to the second embodiment of the present invention.
The non-linear editing apparatus 1 described with reference to FIG. 2 generates text data to which a time code is added when video / audio is input and voice recognition is performed. After the data is generated, the time code is set.
That is, the non-linear editing apparatus 1B is different from the input means 30 of the non-linear editing apparatus 1 described in FIG. 2 in the input means 30B, and the other configurations are the same. Therefore, the components other than the input unit 30B are denoted by the same reference numerals as those of the nonlinear editing apparatus 1 described with reference to FIG.

入力手段３０Ｂは、映像音声入力手段３１と、テキスト対応付け手段３２Ｂとを備えている。なお、映像音声入力手段３１は、ノンリニア編集装置１（図２）の映像音声入力手段３１と同一の構成であるため、説明を省略する。 The input unit 30B includes a video / audio input unit 31 and a text association unit 32B. The video / audio input means 31 has the same configuration as that of the video / audio input means 31 of the nonlinear editing apparatus 1 (FIG. 2), and thus the description thereof is omitted.

テキスト対応付け手段３２Ｂは、音声を音声認識することでテキストデータを生成し、そのテキストデータに対して、タイムコードを設定することで、映像・音声とテキストデータとを対応付けるものである。ここでは、テキスト対応付け手段３２は、音声認識手段３２１Ｂと、時間割付手段３２２とで構成されている。 The text associating unit 32B generates text data by recognizing speech, and associates video / audio with text data by setting a time code for the text data. Here, the text association unit 32 includes a voice recognition unit 321B and a time allocation unit 322.

音声認識手段３２１Ｂは、音声をテキストデータに変換するものであって、音声認識用データ記憶手段２１に記憶されている言語モデルや音響モデルを用いて、音声認識を行う。なお、音声認識手段３２１Ｂは、図２で説明した音声認識手段３２１のように、音声認識時にタイムコードをテキストデータに付加する機能を有さない一般的な音声認識手段である The speech recognition unit 321B converts speech into text data, and performs speech recognition using a language model or acoustic model stored in the speech recognition data storage unit 21. The speech recognition unit 321B is a general speech recognition unit that does not have a function of adding a time code to text data during speech recognition, like the speech recognition unit 321 described in FIG.

時間割付手段３２２は、音声認識手段３２１Ｂで生成されたテキストデータにおいて、単語の切れ目ごとに時間情報（タイムコード）を設定するものである。ここでは、時間割付手段３２２は、テキストデータの単語単位で区分された指定位置と、その指定位置に割り付けるタイムコードとを入力されることで、テキストデータの指定位置にタイムコードを設定する。なお、時間割付手段３２２は、テキストデータ内に２箇所以上タイムコードを設定されることで、他の単語単位で区分された位置にタイムコードを設定することとする。 The time assigning means 322 sets time information (time code) for each word break in the text data generated by the speech recognition means 321B. Here, the time allocating means 322 sets the time code at the designated position of the text data by inputting the designated position divided by the word unit of the text data and the time code assigned to the designated position. Note that the time allocation means 322 sets time codes at positions divided by other word units by setting two or more time codes in the text data.

ここで、図１０を参照（適宜図９参照）して、時間割付手段３２２がテキストデータにタイムコードを設定する動作について説明する。図１０は、時間割付手段が行うタイムコード設定の動作を説明するための説明図である。 Here, with reference to FIG. 10 (refer to FIG. 9 as appropriate), an operation in which the time allocation unit 322 sets the time code in the text data will be described. FIG. 10 is an explanatory diagram for explaining the operation of setting the time code performed by the time allocating means.

図１０（ａ）は、音声認識手段３２１Ｂによって、音声認識された後のテキストデータの内容を示している。ここで、テキストデータは、単語ごとに区分（Ｔ１〜Ｔ５）されている。
ここで、時間割付手段３２２は、テキストデータの指定位置とタイムコードとを入力されることで、指定位置にタイムコードを設定する。図１０の例では、図１０（ａ）のテキストデータで「花」と「の」との間（Ｔ１）に、１箇所目のタイムコードが設定されることで、時間割付手段３２２は、図１０（ｂ）に示すように、Ｔ１の位置にタイムコード（ＴＡ）を付加する。 FIG. 10A shows the content of the text data after the speech recognition unit 321B recognizes the speech. Here, the text data is divided (T1 to T5) for each word.
Here, the time allocation means 322 inputs the designated position and time code of the text data, and sets the time code at the designated position. In the example of FIG. 10, the time allocation unit 322 displays the text data of FIG. 10A by setting the first time code between “flower” and “no” (T1). As shown in FIG. 10B, a time code (TA) is added to the position of T1.

さらに、時間割付手段３２２は、テキストデータで「細胞」と「から」との間（Ｔ３）に、２箇所目のタイムコードが設定されることで、時間割付手段３２２は、図１０（ｃ）に示すように、Ｔ３の位置にタイムコード（ＴＢ）を付加する。
このとき、時間割付手段３２２は、Ｔ１およびＴ３に設定されたタイムコード（ＴＡおよびＴＢ）に基づいて、他の単語間の位置（Ｔ２、Ｔ４およびＴ５）に、タイムコード（ＴＣ、ＴＤおよびＴＥ）を付加（自動割付）する。例えば、時間割付手段３２２は、すでに設定されているタイムコード（ＴＡおよびＴＢ）を線形補間することにより、タイムコード（ＴＣ、ＴＤおよびＴＥ）を設定することとする。 Further, the time allocating unit 322 sets the second time code between “cell” and “from” (T3) in the text data, so that the time allocating unit 322 is configured as shown in FIG. As shown, the time code (TB) is added to the position of T3.
At this time, the time allocation means 322, based on the time code (TA and TB) set in T1 and T3, sets the time code (TC, TD and TE) to the position (T2, T4 and T5) between other words. ) Is added (automatic allocation). For example, the time allocation unit 322 sets the time code (TC, TD, and TE) by linearly interpolating the already set time code (TA and TB).

この時間割付手段３２２が行う線形補間は、各単語の文字数を基準として、簡易にタイムコードを線形補間することとしてもよい。また、音声認識手段３２１Ｂによって、各単語の音素数が既知の場合は、各単語の音素数を基準に線形補間することとしてもよい。また、音声認識手段３２１Ｂによって、各単語の時間長が既知の場合は、その時間長を基準に線形補間することとしてもよい。
なお、時間割付手段３２２は、すでにタイムコードが設定されている位置に再度タイムコードが設定された場合は、タイムコードの再割付を行う。これによって、タイムコードの精度を高めることができる。 The linear interpolation performed by the time allocating unit 322 may simply linearly interpolate the time code based on the number of characters of each word. Further, when the number of phonemes of each word is known by the speech recognition unit 321B, linear interpolation may be performed based on the number of phonemes of each word. In addition, when the time length of each word is known by the speech recognition unit 321B, linear interpolation may be performed based on the time length.
Note that the time allocation unit 322 reassigns the time code when the time code is set again at a position where the time code has already been set. Thereby, the accuracy of the time code can be increased.

以上、ノンリニア編集装置１Ｂの構成と動作とについて説明したが、時間割付手段３２２を、図２で説明したノンリニア編集装置１に組み込んで、音声認識によりタイムコードを付加したテキストデータを生成するか、操作者から設定されるタイムコードをテキストデータに付加するかを、適宜切り換えて動作させる構成としてもよい。 The configuration and operation of the nonlinear editing apparatus 1B have been described above. The time allocation unit 322 is incorporated into the nonlinear editing apparatus 1 described with reference to FIG. The time code set by the operator may be added to the text data so that the operation is switched as appropriate.

≪第３実施形態≫
次に、図１１を参照して、本発明の第３実施形態に係るノンリニア編集装置の構成について説明する。図１１は、本発明の第３実施形態に係るノンリニア編集装置の構成を示すブロック図である。
ノンリニア編集装置１（図２参照）や、ノンリニア編集装置１Ｂ（図９参照）では、テキストデータを、入力された音声を音声認識することで生成することとしたが、予め電子化された原稿等を入力することとしてもよい。
すなわち、ノンリニア編集装置１Ｃは、入力手段３０Ｃが、図２や図９で説明したノンリニア編集装置１，１Ｂの入力手段３０，３０Ｂと異なっており、他の構成は同一のものである。そこで、入力手段３０Ｃ以外の構成については、図２や図９で説明したノンリニア編集装置１，１Ｂと同一の符号を付し、説明を省略する。 << Third Embodiment >>
Next, the configuration of a nonlinear editing apparatus according to the third embodiment of the present invention will be described with reference to FIG. FIG. 11 is a block diagram showing the configuration of the nonlinear editing apparatus according to the third embodiment of the present invention.
In the non-linear editing apparatus 1 (see FIG. 2) and the non-linear editing apparatus 1B (see FIG. 9), text data is generated by recognizing the input voice. It is good also as inputting.
That is, the non-linear editing apparatus 1C is different in the input means 30C from the input means 30 and 30B of the non-linear editing apparatuses 1 and 1B described with reference to FIGS. 2 and 9, and the other configurations are the same. Therefore, the components other than the input unit 30C are denoted by the same reference numerals as those of the nonlinear editing apparatuses 1 and 1B described in FIG. 2 and FIG.

テキストデータ入力手段３３は、外部から予め電子化された原稿等のテキストデータを入力するものである。なお、テキストデータ入力手段３３は、入力されたテキストデータを編集データ記憶手段２２に記憶する。
テキスト対応付け手段３２Ｃは、テキストデータ入力手段３３で入力されたテキストデータに対して、タイムコードを設定することで、映像とテキストデータとを対応付けるものである。ここでは、テキスト対応付け手段３２Ｃは、時間割付手段３２２を備えている。なお、この時間割付手段３２２については、すでに図９で説明したものと同一であるため、説明を省略する。
このようにノンリニア編集装置１Ｃを構成することで、音声認識を行わない安価な構成とすることもできる。 The text data input means 33 is for inputting text data such as a document that has been digitized in advance from the outside. The text data input means 33 stores the input text data in the edit data storage means 22.
The text association unit 32C associates the video with the text data by setting a time code for the text data input by the text data input unit 33. Here, the text association unit 32C includes a time allocation unit 322. The time allocation means 322 is the same as that already described with reference to FIG.
By configuring the non-linear editing apparatus 1C in this way, an inexpensive configuration that does not perform speech recognition can be achieved.

本発明に係るノンリニア編集装置の概要を説明するための説明図である。It is explanatory drawing for demonstrating the outline | summary of the nonlinear editing apparatus which concerns on this invention. 本発明の第１実施形態に係るノンリニア編集装置の構成を示すブロック図である。1 is a block diagram illustrating a configuration of a nonlinear editing device according to a first embodiment of the present invention. タイムコードを付加したテキストデータの構造を示すデータ構造図である。It is a data structure figure which shows the structure of the text data which added the time code. 制御文字を説明するための説明図である。It is explanatory drawing for demonstrating a control character. 編集データ生成手段が生成する編集データの一例を示す図である。It is a figure which shows an example of the edit data which an edit data generation means produces | generates. ノンリニア編集装置の映像・音声の入力動作を示すフローチャートである。It is a flowchart which shows the video / audio input operation of a nonlinear editing apparatus. ノンリニア編集装置の編集動作を示すフローチャートである。It is a flowchart which shows the edit operation | movement of a non-linear editing apparatus. ノンリニア編集装置の出力動作を示すフローチャートである。It is a flowchart which shows the output operation | movement of a non-linear editing apparatus. 本発明の第２実施形態に係るノンリニア編集装置の構成を示すブロック図である。It is a block diagram which shows the structure of the nonlinear editing apparatus which concerns on 2nd Embodiment of this invention. 時間割付手段が行うタイムコード設定の動作を説明するための説明図である。It is explanatory drawing for demonstrating the operation | movement of the time code setting which a time allocation means performs. 本発明の第３実施形態に係るノンリニア編集装置の構成を示すブロック図である。It is a block diagram which shows the structure of the nonlinear editing apparatus which concerns on 3rd Embodiment of this invention.

Explanation of symbols

１、１Ｂ、１Ｃノンリニア編集装置
１０制御手段
２０記憶手段
３０入力手段
３１映像音声入力手段
３２テキスト対応付け手段
３２１音声認識手段
３３テキストデータ入力手段
４０編集手段
４１テキスト編集手段
４２映像編集手段
４３音声編集手段
４４編集データ生成手段
５０表示手段
６０出力手段 DESCRIPTION OF SYMBOLS 1, 1B, 1C Non-linear editing apparatus 10 Control means 20 Storage means 30 Input means 31 Image | video audio input means 32 Text matching means 321 Speech recognition means 33 Text data input means 40 Editing means 41 Text editing means 42 Video editing means 43 Audio editing Means 44 Editing data generation means 50 Display means 60 Output means

Claims

A non-linear editing device that generates edit data for editing the video and audio based on the video and audio to which time information is added,
A text association means for recognizing the speech and generating text data associated with the time information;
Display means for dividing the display area into a text display area for displaying the text data and a video display area for displaying the video, and displaying the text data and the video on a display device in association with the time information. When,
Text editing means for editing the text data based on an operator's instruction;
Edit data generating means for generating the edit data based on time information corresponding to the text data edited by the text editing means;
A non-linear editing apparatus characterized by comprising:

The text association means generates the text data by associating the time information for each break of a word recognized by speech recognition,
The non-linear editing apparatus according to claim 1, wherein the text editing unit edits the text data in units of words.

A non-linear editing device that generates editing data for editing the video based on the video with time information added thereto,
Text data input means for inputting text data corresponding to the video;
Text associating means for associating the time information with each word break based on the operator's instruction to the text data input by the text data input means;
Display means for dividing the display area into a text display area for displaying the text data and a video display area for displaying the video, and displaying the text data and the video on a display device in association with the time information. When,
Text editing means for editing the text data based on an instruction from the operator;
Edit data generating means for generating the edit data based on time information corresponding to the text data edited by the text editing means;
A non-linear editing apparatus characterized by comprising:

The edit data generating means
Based on the time information corresponding to the start point and end point by recognizing the start point and end point by describing the start point and end point of editing identified by a unique control character in the text data The nonlinear editing apparatus according to claim 1, wherein the editing data is generated.

The edit data generating means
By describing a character super character string identified by a unique control character in the text data, the character super character string is recognized, and based on time information corresponding to the insertion position of the character super character string, 5. The nonlinear editing apparatus according to claim 1, wherein reproduction information of the character super character string is added to the editing data. 6.

Video editing means for editing the video,
The text editing unit edits the text data and the time information associated with the text data based on the time information corresponding to the video edited by the video editing unit. The nonlinear editing device according to claim 5.

In order to generate editing data for editing the video and audio based on the video and audio to which the time information is added,
A text association means for recognizing the voice and generating text data associated with the time information;
Display means for dividing the display area into a text display area for displaying the text data and a video display area for displaying the video, and displaying the text data and the video on a display device in association with the time information. ,
Text editing means for editing the text data based on an instruction from the operator;
Edit data generating means for generating edit data for editing the video based on time information corresponding to the text data edited by the text editing means;
Non-linear editing program characterized by functioning as