JP2014222290A

JP2014222290A - Minute recording device, minute recording method, and program

Info

Publication number: JP2014222290A
Application number: JP2013101711A
Authority: JP
Inventors: 武士松村; Takeshi Matsumura
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2013-05-13
Filing date: 2013-05-13
Publication date: 2014-11-27
Anticipated expiration: 2033-05-13
Also published as: JP6280312B2

Abstract

PROBLEM TO BE SOLVED: To more highly accurately associate conference voices or conference videos with memoranda recorded as a minute.SOLUTION: A text and a voice are associated with each other based on dissimilarity of the text converted into phonemes on the basis of input time of a memorandum during a conference and utterance time of a voice recorded synchronizing with a video during the conference, and phonemes obtained by voice recognition of the recorded voice. When the text recorded as the conference memorandum is selected, the recorded video or voice is reproduced from the utterance time of the voice associated with the text.

Description

本発明は、議事録記録装置、議事録記録方法及びプログラムに関し、特に動画に含まれる音声とテキスト入力手段から入力されるテキスト情報の時間的ずれ幅を補正することができることを特徴とする。 The present invention relates to a minutes recording apparatus, a minutes recording method, and a program, and is characterized in that, in particular, it is possible to correct a time lag between a voice included in a moving image and text information input from a text input means.

一般に、組織において会議での議論内容は重要な資産であり、会議の後には議事録の提出が求められることが多い。会議には記録者がいてその会議の内容を手書きで記録していたため、議事録作成者に負担がかかっていた。議事録作成を支援するための技術として、以下のようなものが提案されている。
特許文献１記載の装置では、メモ編集部により、入力される会議の進行内容を表すメモを、そのメモが記録された日時と会議映像に関連付けて保持された日時情報とにより会議映像と関連づけ、会議の進行内容及び会議中の音声に基づいて、会議の進行内容の会議映像への関連付けを保持する。議事録作成部３はこの関連付けた状態で議事録を作成し、作成した議事録を議事録表示部５により閲覧できる。こうして議事録の記述内容と会議映像との間にリンクをはることができ、記述内容が会議のどのシーンに対応するのかを後で確認することができる。 In general, the content of discussions at meetings is an important asset in organizations, and it is often required to submit minutes after the meeting. The meeting had a recording person and recorded the contents of the meeting by hand, which put a burden on the minutes creator. The following technologies have been proposed to assist in the creation of minutes.
In the device described in Patent Document 1, a memo that represents the progress of the conference that is input by the memo editing unit is associated with the conference video based on the date and time when the memo was recorded and the date and time information held in association with the conference video. Based on the conference progress and the audio during the conference, the association of the conference progress with the conference video is retained. The minutes creation unit 3 creates minutes in this associated state, and the created minutes can be viewed by the minutes display unit 5. In this way, a link can be established between the description content of the minutes and the conference video, and it can be confirmed later which scene of the conference the description content corresponds to.

特開２００６−２６８８００号公報JP 2006-268800 A

しかしながら、特許文献１に記載の装置のように、メモ編集部によって日時情報と会議映像とを関連付けるのみでは、メモとして記録されたテキストと映像との高精度の関連付けは困難である。それというのは、メモ編集部によりテキストが記録された日時と、そのテキストに対応した発言内容が収録されている会議映像の日時とではずれが生じるためである。ユーザがメモを参照してリンクをたどることでメモと関連付けられた会議映像を閲覧しようとしても、テキストが作成された日時とそのテキストに対応した映像の時刻にずれのため、ユーザが閲覧したい部分のメモをクリックしても、メモに対応しないずれた時刻の映像が表示されてしまう。 However, as in the apparatus described in Patent Document 1, it is difficult to associate text recorded as a memo with a video with high accuracy only by associating the date and time information with the conference video by the memo editing unit. This is because there is a discrepancy between the date and time when the text was recorded by the memo editing unit and the date and time of the conference video in which the content of the comment corresponding to the text was recorded. Even if the user tries to view the conference video associated with the memo by browsing the link by referring to the memo, the part that the user wants to browse due to the difference between the date and time when the text was created and the video time corresponding to the text Even if you click the note, the video of the time that is not corresponding to the note will be displayed.

本発明は上記従来例に鑑みてなされたもので、テキストと映像とを、それぞれの内容に基づいて関連付けることで、高い精度で関連付けることを可能とした議事録記録装置及び議事録記録方法を提供する。 The present invention has been made in view of the above-described conventional example, and provides a minutes recording apparatus and a minutes recording method capable of associating text and video with high accuracy by associating them with each other based on their contents. To do.

本発明は以下の構成を有する。すなわち、
記録された音声データを区切りごとに保持する保持手段と、
入力されたテキスト情報を区切りごとに保存する編集手段と、
前記テキスト情報の区切りごとに、前記音声データの区切りと、それぞれの音韻の相違の程度に基づいて関連付け、当該関連付けを記憶する関連付け手段と
を有することを特徴とする議事録記録装置。 The present invention has the following configuration. That is,
Holding means for holding the recorded audio data for each break;
An editing means for storing the input text information for each separator;
An apparatus for recording minutes, comprising: an association unit that associates the division of the audio data with each division of the text information based on the degree of difference of each phoneme, and stores the association.

本発明によれば、テキストと映像とを、それぞれの内容に基づいて関連付けることで、高い精度で関連付けることが可能となる。さらに、高精度の関連付けによりテキストと映像とを同期させて表示することができる。 According to the present invention, it is possible to associate texts and videos with high accuracy by associating them based on their contents. Further, the text and the video can be displayed in synchronization with each other with high accuracy.

本システムのデバイス構成例を示す図である。It is a figure which shows the device structural example of this system. 本システムのテキスト入力画面の構成図を示す図であるIt is a figure which shows the block diagram of the text input screen of this system 本システムにおけるデータ管理構造と構成例を示した図であるIt is the figure which showed the data management structure in this system, and the structural example 本システムにおける議事録表示画面の構成図を示す図である。It is a figure which shows the block diagram of the minutes display screen in this system. 本システムの全体構成を示す図である。It is a figure which shows the whole structure of this system. 本システムにける同期制御手段部の構成を示した図である。It is the figure which showed the structure of the synchronous control means part in this system. 音声認識部の構成例を示すブロック図である。It is a block diagram which shows the structural example of a speech recognition part. 言語モデルの例を示す図である。It is a figure which shows the example of a language model. 音声照合部での照合処理を示す模式図である。It is a schematic diagram which shows the collation process in a voice collation part. テキスト変換音素列生成部でのテキスト変換音素列の生成処理を示す模式図である。It is a schematic diagram which shows the production | generation process of the text conversion phoneme sequence in a text conversion phoneme sequence production | generation part. 本システムにおける同期制御手段の処理フローを示す図である。It is a figure which shows the processing flow of the synchronous control means in this system. 同期制御手段の処理結果例を示す図である。It is a figure which shows the example of a processing result of a synchronous control means. 本システムにおける同期制御手段の照合部の処理フローを示す図である。It is a figure which shows the processing flow of the collation part of the synchronous control means in this system. 本システムにおける表示部の構成例を示す図である。It is a figure which shows the structural example of the display part in this system. 本システムにおける表示部での映像再生例を示す図である。It is a figure which shows the example of an image | video reproduction | regeneration by the display part in this system. 議事録を閲覧する手順のフローチャートである。It is a flowchart of the procedure which browses the minutes.

以下、本発明を実施するための形態について、図１〜図１５を参照しながら詳細に説明する。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to FIGS.

［第１実施形態］
本発明の第１の実施形態は、入力された文字データに現在の時間情報を付加して記録し、音声も同時に記録し、記録された音声と文字データの時間情報を基に、時系列に対応させて表示する議事録作成再生装置である。なお時間情報は、日時情報など日時を特定するための情報である。 [First Embodiment]
In the first embodiment of the present invention, the current character information is added to the input character data and recorded, the sound is recorded at the same time, and the time information of the recorded sound and character data is used in time series. It is a minutes creation / playback device that displays correspondingly. The time information is information for specifying the date and time, such as date and time information.

会議中に議事録の担当者や参加者が取ったメモは、会議中の話題を追いかけるのには最適な材料である。本実施形態では、そのメモの一文一文に、入力時の記入時刻（ＴＫ）を記録する。一文の入力の完了は例えば改行コードの入力により判定できる。メモが記録された記入時刻の前では、会議では、そのメモに対応する話題が議論されていたはずである。本実施形態の議事録作成再生装置では、会議の音声や映像も時間情報を付して記録し、関心のあるメモの部分を指定することで、記録した時刻でリンクした目的の箇所の音声や映像を容易に見つけられる。なお映像データとメモとをそれぞれの記録時刻で関連付けた情報を、本実施形態ではディジタル議事録と呼ぶことにする。 Notes taken by minutes and attendees during the meeting are the best material to follow the topic during the meeting. In this embodiment, the entry time (TK) at the time of input is recorded in each sentence of the memo. Completion of input of one sentence can be determined by inputting a line feed code, for example. Before the entry time when the memo was recorded, the topic corresponding to the memo should have been discussed at the meeting. In the minutes creation / playback apparatus of the present embodiment, the audio and video of the meeting is also recorded with time information, and by specifying the part of the note of interest, the audio of the target location linked at the recorded time or You can easily find the picture. Note that information in which video data and a memo are associated with each recording time is referred to as a digital minutes in this embodiment.

＜議事録作成再生装置の構成＞
図１は、本実施形態における議事録作成再生装置の構成を示す概念的な構成図である。議事録記録再生装置は、クライアント端末と、映像等の入力デバイスとを接続して構成される。図１において、クライアント端末１１は、パーソナルコンピュータやタブレット端末などであり、議事録を作成し再生するための端末である。クライアント端末１１は、本実施形態では、ＣＰＵ、ＲＡＭ、ＲＯＭ、ＨＤＤ等を有する。なお、本実施形態の説明やフローチャートで示す手順は、クライアント端末１１のＲＡＭ，ＲＯＭ，ＨＤＤのいずれかの記憶手段に記憶され、ＲＡＭにロードされてＣＰＵにより実行される。マイク１２は、クライアント端末１１に接続するマイクであり、音声を音声信号に変換してクライアント端末１１に入力する。スチルカメラ１３は、クライアント端末１１に接続する静止画撮影用のカメラである。ビデオカメラ１４は、クライアント端末１１に接続する動画像撮影用のビデオカメラである。これらのカメラはそれぞれ、撮影した静止画および動画のディジタルデータをクライアント端末１１に入力する。静止画および動画のディジタルデータをそれぞれ画像データと映像データと呼ぶことにする。映像データには、音声信号をディジタル化した音声データが映像と同期して含まれている。また静止画データにも音声データを統合することもできる。この静止画データを再生すると、静止画が表示されるとともに音声が再生される。なおビデオカメラ１４には、ビデオ撮影機能を有するディジタルカメラや携帯端末なども含む。映像データ中には時間情報が記録され、例えば任意のフレームで日時を特定できる必要がある。たとえば、撮影開始日時が映像データに記録されており、またフレームごとにフレーム番号が記録されていれば、映像の任意のフレームで日時を特定できる。逆に、日時から映像のフレームを特定することもできる。なお映像と同様に音声とテキストとを関連付けることもできる。この場合には、音声データに、その任意の時点の日時を特定できる情報が含まれている必要がある。 <Configuration of minutes creation and playback device>
FIG. 1 is a conceptual configuration diagram showing a configuration of a minutes creation / playback apparatus according to the present embodiment. The minutes recording / reproducing apparatus is configured by connecting a client terminal and an input device such as a video. In FIG. 1, a client terminal 11 is a personal computer, a tablet terminal, or the like, and is a terminal for creating and playing back minutes. In this embodiment, the client terminal 11 includes a CPU, RAM, ROM, HDD, and the like. Note that the description of the present embodiment and the procedure shown in the flowchart are stored in any of the RAM, ROM, and HDD storage units of the client terminal 11, loaded into the RAM, and executed by the CPU. The microphone 12 is a microphone that is connected to the client terminal 11, converts sound into an audio signal, and inputs the sound signal to the client terminal 11. The still camera 13 is a still image shooting camera connected to the client terminal 11. The video camera 14 is a video camera for shooting moving images connected to the client terminal 11. Each of these cameras inputs the captured still image and moving image digital data to the client terminal 11. The digital data of still images and moving images will be called image data and video data, respectively. The video data includes audio data obtained by digitizing an audio signal in synchronization with the video. Also, audio data can be integrated with still image data. When this still image data is reproduced, a still image is displayed and sound is reproduced. The video camera 14 includes a digital camera having a video shooting function, a portable terminal, and the like. Time information is recorded in the video data. For example, it is necessary to be able to specify the date and time in an arbitrary frame. For example, if the shooting start date and time is recorded in the video data and the frame number is recorded for each frame, the date and time can be specified by an arbitrary frame of the video. Conversely, a video frame can be specified from the date and time. Note that audio and text can be associated with each other as with video. In this case, the audio data needs to include information that can specify the date and time at an arbitrary time.

図１（ａ）に示す構成では、クライアント端末１１のキーボードから文字データ（テキスト）データを入力する。入力された文字データに、入力時の時間情報であるテキスト記入時刻（ＴＫ）を付加してテキストとして記録する。この場合の入力時とは文字データひとつの入力を単位としても良いし、改行を検知したタイミングなどでも良い。マイク１２は音声を音声データとしてクライアント端末１１に入力する。入力された音声データに、入力時の時間情報を付加してＨＤＤ等の記憶部に記録する。音声データは、あとで時間情報を付加して追加することもできる。議事録表示時には、記録されたテキストを時系列で表示し、記録された音声を再生出力する。そのとき表示されたテキストの一部分が指定されると、当該部分に対応する音声を再生する。 In the configuration shown in FIG. 1A, character data (text) data is input from the keyboard of the client terminal 11. A text entry time (TK), which is time information at the time of input, is added to the input character data and recorded as text. In this case, the time of input may be a unit of input of character data, or a timing when a line feed is detected. The microphone 12 inputs voice to the client terminal 11 as voice data. Time information at the time of input is added to the input audio data and recorded in a storage unit such as an HDD. Audio data can be added later with time information added. When displaying the minutes, the recorded text is displayed in time series, and the recorded voice is reproduced and output. If a part of the displayed text is designated, the sound corresponding to the part is reproduced.

図１（ｂ）に示す構成では、ビデオカメラ１４から動画像データすなわち映像データをクライアント端末１４に入力する。入力された映像データに記録時の時間情報を付加して例えばＨＤＤなどの記憶部に記録する。時間情報は、上述のように例えばフレーム単位で付加されてもよい。再生時には、記録された動画像データを再生出力する。映像データには、映像と同期した音声データが記録されている。 In the configuration shown in FIG. 1B, moving image data, that is, video data is input from the video camera 14 to the client terminal 14. Time information at the time of recording is added to the input video data and recorded in a storage unit such as an HDD. As described above, the time information may be added in units of frames, for example. During reproduction, the recorded moving image data is reproduced and output. In the video data, audio data synchronized with the video is recorded.

図１（ｃ）に示す構成では、複数のクライアント端末１１がネットワーク１０を介してサーバ１に接続される。別々のクライアント端末から入力されるデータをサーバ１で処理して動画像データを各クライアント上で再生出力する。これにより会議映像や議事録を複数のユーザにより共有することができる。 In the configuration shown in FIG. 1C, a plurality of client terminals 11 are connected to the server 1 via the network 10. Data input from separate client terminals is processed by the server 1 to reproduce and output moving image data on each client. Thereby, the meeting video and the minutes can be shared by a plurality of users.

本実施形態ではクライアント端末１１のＣＰＵや記録部での処理例を表しているが、サーバが存在するような構成では、その一部をサーバのＣＰＵや記録部に移行しても良い。なお記録部は、テキストを映像データや音声データと関連付けて記録する機能ユニットであり、例えば主ＣＰＵとは別の専用プロセッサなどにより実現でき、或いはＣＰＵにより記録のためのプログラムを実行することでも実現できる。 In the present embodiment, an example of processing in the CPU and recording unit of the client terminal 11 is shown. However, in a configuration in which a server exists, a part of the processing may be transferred to the CPU and recording unit of the server. The recording unit is a functional unit that records text in association with video data or audio data, and can be realized by a dedicated processor separate from the main CPU, for example, or by executing a recording program by the CPU. it can.

＜議事録作成処理＞
図５は本実施形態の議事録記録再生装置の機能ブロック図である。ユーザは、クライアント端末１１のキーボードなどのテキスト入力部５１からメモを入力する。入力されたメモには、タイマー５３により記入時刻（ＴＫ）が付加され、テキスト記録部５２により記録される。マイク１２やカメラ１３やビデオカメラ１４などの音声映像入力部５４から入力された音声や静止画や動画像は、タイマーにより時間情報が付加され、音声映像記録部５５により記録される。再生時には、同期表示制御部５６により同期がとられて、対応部分が表示部５７に表示される。同期表示制御部５６による同期制御方法については後述する。 <Meeting process>
FIG. 5 is a functional block diagram of the minutes recording / reproducing apparatus of the present embodiment. The user inputs a memo from a text input unit 51 such as a keyboard of the client terminal 11. The input memo is added with the entry time (TK) by the timer 53 and recorded by the text recording unit 52. Time information is added to the audio, still image, or moving image input from the audio / video input unit 54 such as the microphone 12, the camera 13, or the video camera 14, and is recorded by the audio / video recording unit 55. At the time of reproduction, the synchronization display control unit 56 synchronizes and the corresponding part is displayed on the display unit 57. A synchronization control method by the synchronization display control unit 56 will be described later.

次に、図２と図３を参照しながら、議事録作成時の動作を詳しく説明する。図２は、議事録作成再生装置の議事録作成時の表示画面を示す図である。この表示画面はクライアント端末１１のディスプレイに表示される。図３は、議事録作成再生装置に記録した議事録データのデータ例を示す図である。 Next, the operation when creating the minutes will be described in detail with reference to FIG. 2 and FIG. FIG. 2 is a diagram showing a display screen when the minutes are created by the minutes creation / playback apparatus. This display screen is displayed on the display of the client terminal 11. FIG. 3 is a diagram showing an example of the minutes data recorded in the minutes creation / playback apparatus.

まず、クライアント端末１１にマイク１２を接続する。マイク１２はクライアント端末１１に内蔵のものがあれば、当然それでもかまわない。なお、クライアント端末１１とは関係のないＩＣレコーダーなどを使って録音をして、後から音声データを追加するのであれば、クライアント端末１１に接続されたマイク１２を使用しなくてもかまわない。操作者が議事録記録用プログラムを起動すると、図２（ａ）に示すように、画面には、インジケータ部２１、映像操作部２２、書式設定部２２、テキスト入力部２４が表示される。記録者が画面の記録開始ボタンを押すと録音が開始される。録音の開始と同時に内部のタイマーがスタートし、録音開始時刻と現時点の時刻が計測される。なお本実施形態の目的のためにはタイマー５３は実時間を計時するリアルタイムクロックであってもよく、その場合にはタイマーをスタートさせる必要はない。 First, the microphone 12 is connected to the client terminal 11. If the microphone 12 is built in the client terminal 11, it may naturally be used. Note that the microphone 12 connected to the client terminal 11 may not be used if recording is performed using an IC recorder or the like unrelated to the client terminal 11 and audio data is added later. When the operator starts the minutes recording program, as shown in FIG. 2A, an indicator unit 21, a video operation unit 22, a format setting unit 22, and a text input unit 24 are displayed on the screen. Recording starts when the recorder presses the recording start button on the screen. An internal timer starts simultaneously with the start of recording, and the recording start time and the current time are measured. For the purpose of this embodiment, the timer 53 may be a real-time clock that measures the actual time. In this case, it is not necessary to start the timer.

この状態で、テキスト入力部２４に文字を入力する。テキスト入力部に操作カーソルを合わせた状態でキーを叩くと、入力開始位置をテキスト入力部２４の左に表示する。入力の途中で改行を入力すると、新たな行に移動し、その時点での記入時刻（ＴＫ）を記録する。その状態でさらに入力をすると、再びその入力の時刻を記録する。ある程度入力した状態の画面が、図２（ｂ）である。この例では５つの文が入力済みである。映像操作部２２の停止ボタンを押すと録音が終了し、タイマーも停止する。この状態で、メニューから「保存」を選ぶと、議事録の名前を入力させたうえで、記録した映像データファイルおよびテキストファイルが保存される。保存されるファイルは、例えば、図３（ａ）のような構造で保存される。図３（ａ）は、２００８年１２月２４日に記録し、「Ｃｏｎｆｅｒｅｎｃｅ１」という名前で保存した議事録データの例である。データはＤａｔａという名称のフォルダに設けられた、議事録ごとのフォルダに記録される。本例では議事録のフォルダは、記録した年月日を並べた名称が与えられる。同一の日に複数の議事録を記録する場合には、例えば年月日の後にシリアル番号や時刻を付加すればよい。各議事録のフォルダ内には以下のようなファイルが含まれる。
ｐｒｏｊｅｃｔ．ｄａｔは、議事録のファイル構成などを記録したファイルである。ファイル形式としては独自のバイナリ形式やその他のテキスト形式（ＸＭＬ、ＪＳＯＮ、独自など）でもよい。ｍｐ３形式のファイル「２００８１２２４１２３４５６．ｍｐ３」は、録音した音声で、ファイル名は、例えば録音開始時の年、月、日、時、分、秒としてあるが、これにこだわる必要はない。また、形式もｍｐ３でなくｗａｖなど、ほかの形式でもよい。また録音開始時刻は、例えばファイル属性として記録してもよい。テキストファイル「ｍｅｍｏ．ｔｘｔ」は、入力欄に入力したテキストで、図３（ｃ）のような形式の議事録メモデータ３０２として保存される。すなわち、テキストごとにテキストＩＤ３３０が付与され、テキストＩＤごとに、その記入時刻３３１と記入されたテキスト３３２またはテキスト３３３を関連付けて記録する。テキスト３３２はサマリーメモであり、テキスト３３３はサマリーメモに属する詳細のメモ（単にメモと称する）である。サマリーメモと詳細メモの区別は、例えば入力作業者に基づく。ここでは会議中の議事録作成担当者が記載したメモをサマリーメモとし、他の出席者が記載したメモをメモとしている。
このようにメモをプレーンテキストの形式で保存することで、後で別のエディタなどで編集することができる。データの汎用性を考慮しないのであればバイナリなど別の形式でもかまわない。
以上のファイル構成は、あくまで一例であり、議事録フォルダ内のファイルすべてを１ファイルにしてもよい。テキストの入力を、例えば改行コードがなく句点で完了させるなど、１つのメモを複数ファイルに分割し、テキストファイルごとに記入時刻を関連付けてもよい。このようにすることで長文のメモに関して、入力開始と完了との時間差を小さくすることができる。また、テキストをそのほかの形にしてもかまわない。 In this state, characters are input to the text input unit 24. When the key is hit with the operation cursor positioned on the text input unit, the input start position is displayed on the left side of the text input unit 24. When a line feed is input in the middle of input, the line moves to a new line and the entry time (TK) at that time is recorded. If further input is made in this state, the time of the input is recorded again. FIG. 2B shows a screen that has been input to some extent. In this example, five sentences have already been entered. When the stop button of the video operation unit 22 is pressed, recording ends and the timer also stops. In this state, if “Save” is selected from the menu, the recorded video data file and text file are saved after inputting the name of the minutes. For example, the file to be saved is saved in a structure as shown in FIG. FIG. 3A is an example of minutes data recorded on December 24, 2008 and saved under the name “Conference 1”. Data is recorded in a folder for each minutes provided in a folder named Data. In this example, the minutes folder is given a name in which recorded dates are arranged. When a plurality of minutes are recorded on the same day, for example, a serial number or time may be added after the date. The following files are included in each minutes folder.
project. dat is a file that records the file structure of the minutes. The file format may be a unique binary format or other text format (XML, JSON, unique, etc.). The file “200812242123456.mp3” in the mp3 format is recorded voice, and the file name is, for example, year, month, day, hour, minute, and second at the start of recording, but it is not necessary to be particular about this. Also, the format may be other formats such as wav instead of mp3. The recording start time may be recorded as a file attribute, for example. The text file “memo.txt” is the text entered in the input field, and is saved as the minutes memo data 302 in the format shown in FIG. That is, a text ID 330 is assigned to each text, and the entry time 331 and the entered text 332 or text 333 are recorded in association with each text ID. The text 332 is a summary memo, and the text 333 is a detailed memo (simply called a memo) belonging to the summary memo. The distinction between the summary memo and the detailed memo is based on, for example, the input operator. Here, the memo described by the person in charge of making minutes during the meeting is used as the summary memo, and the memo described by other attendees is used as the memo.
By saving the memo in the plain text format in this way, it can be edited later by another editor or the like. If the generality of data is not taken into consideration, another format such as binary may be used.
The above file structure is merely an example, and all the files in the minutes folder may be one file. For example, one memo may be divided into a plurality of files and input times may be associated with each text file, for example, text input may be completed at a period without a line feed code. This makes it possible to reduce the time difference between the start and completion of input for long memos. The text may be in other forms.

図１（ｂ）のようにビデオカメラ１３を接続して、音声だけでなく映像も同時に記録できる。この画面例は、図２（ｂ）に示したものと同様である。ただし、新しい行に一度文字を書いて時間が記録されたのち、文字を削除しても、記録された時間は消えない。新しい行に何も書かずに、さらに改行した場合、改行のみのメモに対応付けて時間は記録されない。なおこの特徴は、音声のみの議事録に適用することもできる。 By connecting the video camera 13 as shown in FIG. 1B, not only audio but also video can be recorded simultaneously. This screen example is the same as that shown in FIG. However, once a character is written on a new line and the time is recorded, even if the character is deleted, the recorded time does not disappear. If nothing is written on a new line and another line break is made, the time is not recorded in association with a memo with only a line break. This feature can also be applied to audio-only minutes.

議事録作成中にビデオカメラなどで撮影した映像ファイルが議事録のファイルに含まれる場合には、その議事録データ３０１のファイル構成は、図３（ｂ）のようになる。映像ファイルを含む議事録は、「ｍｏｖｉｅ．ｍｅｔａ」という映像データのメタデータのファイルと、映像ファイルを記録した「ｍｏｖｉｅ」というフォルダとが、図３（ａ）の構造に対して付加された構造を有する。ここで、映像ファイルに記録された情報などから撮影時間を取り出すことができる。撮影時間（撮影時刻）の情報は、映像ファイルとテキストファイルとの同期をとる同期制御部５６で利用する。なお本実施形態の動画ファイルには、映像のみでなく音声データも含まれている。そして映像データと音声データとは同期がとられており、映像のみならず音声の記録された時刻も映像ファイルから特定できる。 When a video file taken with a video camera or the like during the creation of the minutes is included in the minutes file, the file structure of the minutes data 301 is as shown in FIG. The minutes including the video file have a structure in which a metadata file of video data “movie.meta” and a folder “movie” in which the video file is recorded are added to the structure of FIG. Have Here, the shooting time can be extracted from the information recorded in the video file. Information on the shooting time (shooting time) is used by the synchronization control unit 56 that synchronizes the video file and the text file. Note that the moving image file of this embodiment includes not only video but also audio data. The video data and the audio data are synchronized, and not only the video but also the time when the audio is recorded can be specified from the video file.

＜同期制御＞
ここで同期制御部５７の詳細な説明を図６と図７を用いて行う。図６は入力された音声とテキストの対応付けを行うための同期制御手段５７の一実施形態を示すブロック図である。本実施形態の同期制御手段５７は、音声認識部６１、認識結果音素列格納部６２、テキスト変換音素列生成部６３、テキスト変換音素列格納部６４、結合部６５を備える。 <Synchronous control>
Here, a detailed description of the synchronization control unit 57 will be given with reference to FIGS. FIG. 6 is a block diagram showing an embodiment of the synchronization control means 57 for associating input speech and text. The synchronization control means 57 of this embodiment includes a speech recognition unit 61, a recognition result phoneme string storage unit 62, a text conversion phoneme string generation unit 63, a text conversion phoneme string storage unit 64, and a combining unit 65.

この同期制御部の目的は図３（ｃ）の形式で格納されているテキスト情報と関連する記入時刻情報（ＴＫ）に対して、このテキスト情報が発話された時刻を、入力される音声（映像データと同期した音声を含む）に対して音声認識を行い、音声が記録された時刻（発話時刻と呼ぶ）を検出し、新たな発話時刻情報（ＴＨ）としてテキスト情報と共に格納する事である。発話時刻情報（ＴＨ）は図３（ｃ）の形式に追加的に記録してもよいし、記入時刻（ＴＫ）を上書きしてもよい。ただし追加的に記録する場合には、追加した発話時刻情報（ＴＨ）とテキストとの関連付けを明確に示す必要がある。そのためには、たとえば図３（ｃ）のように表形式であれば、対応するメモのテキストと同一行に記録すればよい。 The purpose of this synchronization control unit is to input the time when the text information is spoken with respect to the entry time information (TK) associated with the text information stored in the format of FIG. Voice recognition is performed on a voice (including voice synchronized with data), a time when voice is recorded (referred to as speech time) is detected, and new speech time information (TH) is stored together with text information. The utterance time information (TH) may be additionally recorded in the format of FIG. 3C, or the entry time (TK) may be overwritten. However, when additionally recording, it is necessary to clearly indicate the association between the added utterance time information (TH) and the text. For this purpose, for example, in the case of a table format as shown in FIG. 3 (c), it may be recorded on the same line as the text of the corresponding memo.

同期制御部５６の構成を図６に示す。音声映像記録部５５に記録されている音声や映像を再生する事により、音声認識部６１には音声データの入力が開始される。音声データが直接入力さる場合は、そのままの音声データが入力され、入力が映像データの場合、会議映像と共に参加者により発声された音声データが入力される。したがって、映像と音声とは時間的ずれなく取得される。これにより取得された音声データは、音声認識部６１に入力される。音声認識部６１は、入力される音声データを認識処理し、音声データに対応する認識結果音素列と発話時刻（ＴＨ）とを出力する。出力データは結合部６５に入力される。ここで発話時刻（ＴＨ）とは、認識結果音素列が発話された時刻を音声（映像）に付加されている時刻情報を元に割り出したものである。 The configuration of the synchronization control unit 56 is shown in FIG. By reproducing the audio and video recorded in the audio video recording unit 55, input of audio data to the audio recognition unit 61 is started. When the voice data is directly input, the voice data as it is is input. When the input is video data, the voice data uttered by the participant is input together with the conference video. Therefore, the video and audio are acquired without time lag. The voice data acquired in this way is input to the voice recognition unit 61. The voice recognition unit 61 recognizes input voice data and outputs a recognition result phoneme string and utterance time (TH) corresponding to the voice data. The output data is input to the combining unit 65. Here, the utterance time (TH) is obtained by calculating the time when the recognition result phoneme string was uttered based on the time information added to the sound (video).

＜音声認識部＞
図７は、音声認識部６１の構成例を示すブロック図である。音声認識部６１は、音声検出部７１、音響分析部７２、音響モデル格納部７３、言語モデル格納部７４および音声照合部７５を備える。この構成は、音声認識で一般的なものであるが、簡単に説明する。 <Voice recognition unit>
FIG. 7 is a block diagram illustrating a configuration example of the voice recognition unit 61. The speech recognition unit 61 includes a speech detection unit 71, an acoustic analysis unit 72, an acoustic model storage unit 73, a language model storage unit 74, and a speech collation unit 75. This configuration is common in speech recognition, but will be briefly described.

音声検出部７１は、入力された音声データから人声を含む区間の音声を切り出して音響分析部７２に送る。音声検出部７１での音声の切り出しには、例えば、入力のパワーの大小に基づく音声検出手法を利用できる。この音声検出手法では、入力のパワーを逐次計算し、入力のパワーが予め定めた閾値を一定時間連続して上回った時点を音声の一区間の開始時点と判定し、逆に、入力のパワーが予め定めた閾値を一定時間連続して下回った時点を音声の当該区間の終了時点と判定する。音声検出部７１により切り出された音声は、音声開始時点から音声終了時点まで逐次音響分析部７２に送られる。 The voice detection unit 71 cuts out the voice of the section including the human voice from the input voice data and sends it to the acoustic analysis unit 72. For example, a voice detection method based on the magnitude of input power can be used to cut out the voice in the voice detection unit 71. In this voice detection method, the power of the input is calculated sequentially, and the time when the power of the input exceeds a predetermined threshold continuously for a certain period of time is determined as the start time of one section of the voice. The time point when the predetermined threshold value is continuously decreased for a certain period of time is determined as the end time point of the speech segment. The voice cut out by the voice detection unit 71 is sequentially sent to the acoustic analysis unit 72 from the voice start time to the voice end time.

音響分析部７２は、音声検出部７１により切り出された音声の音響分析を行い、メル周波数ケプストラム係数（ＭＦＣＣ）など音声の特徴を現す音響特徴量列を音声照合部７５に送出する。 The acoustic analysis unit 72 performs an acoustic analysis of the voice cut out by the voice detection unit 71, and sends an acoustic feature quantity sequence representing a voice feature such as a mel frequency cepstrum coefficient (MFCC) to the voice matching unit 75.

音響モデル格納部７３は、日本語音声を構成する単位である音素ごとに用意した隠れマルコフモデル（ＨＭＭ）などの標準パタンを格納している。この標準パタンを日本語単語・文章を構成する音素列に即して連結することで任意の日本語単語・文章に対応する標準パタンを作成することができる。 The acoustic model storage unit 73 stores a standard pattern such as a hidden Markov model (HMM) prepared for each phoneme that is a unit constituting Japanese speech. A standard pattern corresponding to any Japanese word / sentence can be created by connecting this standard pattern in accordance with the phoneme sequence constituting the Japanese word / sentence.

また、言語モデル格納部７４は、日本語の単語間、音素間などの接続関係を規定する言語モデルを格納している。この言語モデルには、（１）音節間の接続関係を規定する連続音節認識文法、（２）単語間の接続関係を規定する文法規則、（３）Ｎ個の音素の組が連続する確率を規定する統計的言語モデル、（４）Ｎ個の単語の組が連続する確率を規定する統計的言語モデルなどがある。 The language model storage unit 74 stores a language model that defines connection relationships such as between Japanese words and phonemes. This language model includes (1) a continuous syllable recognition grammar that prescribes the connection relationship between syllables, (2) a grammar rule that prescribes a connection relationship between words, and (3) a probability that a set of N phonemes continues. There are a statistical language model for defining, and (4) a statistical language model for defining the probability that a set of N words continues.

図８は、言語モデルの例を示す図である。図８（ａ）は、音節間の接続関係を規定する連続音節認識文法であり、これは、子音／ｂ／／ｄ／・・・と母音／ａ／／ｉ／・・・の接続関係を規定している。図８（ｂ）は、単語間の接続関係を規定する文法規則であり、これは、／単語１／／単語２／・・・の接続関係を規定している。 FIG. 8 is a diagram illustrating an example of a language model. FIG. 8 (a) is a continuous syllable recognition grammar that defines the connection relationship between syllables, and this shows the connection relationship between consonants / b // d /... And vowels / a // i /. It prescribes. FIG. 8B is a grammatical rule that defines a connection relationship between words, which defines a connection relationship of / word 1 // word 2 /.

図９は、音声照合部７５での照合処理を示す模式図である。図９は、音響分析部７２から送出される音響特徴量列が、音声照合部７５で標準パタンと照合され、これにより照合結果として／ｓｈ／／ｉ／・・・／ｕ／が得られると共に、各音素に対応する音声区間の開始、終了時刻が取得されることを示している。この各音素の時刻情報を元に発言が開始された発話時刻（ＴＨ）を決定する。 FIG. 9 is a schematic diagram showing a collation process in the voice collation unit 75. In FIG. 9, the acoustic feature quantity sequence sent from the acoustic analysis unit 72 is collated with the standard pattern by the voice collation unit 75, and as a result of collation, / sh // i /... / U / is obtained. This indicates that the start and end times of the speech section corresponding to each phoneme are acquired. The utterance time (TH) at which the utterance is started is determined based on the time information of each phoneme.

図７に戻って、音声照合部７５は、言語モデルに記された接続規則に従って音響モデルを接続して標準パタンを生成すると共に、Ｖｉｔｅｒｂｉアルゴリズムを用い、音響分析部７２から送出される音響特徴量列と標準パタンとを照合する。この照合の結果、両者の照合スコアを最大とする音声区間と標準パタンの対応が得られる。音声認識部７１での認識結果として、認識結果音素列および標準パタンを構成する各音素に対応する音声区間の開始時刻、終了時刻が取得される。これにより得られた認識結果音素列は、認識結果音素列格納部６２に格納される。なお、音声照合については、「中川聖一ら著：「確率モデルによる音声認識」電子情報通信学会」に記載されている。音素列は、例えば各音素を示すコードの列で表すことができる。音素に対応する時刻は、音素のコードに関連付けられて記録される。 Returning to FIG. 7, the voice collation unit 75 generates a standard pattern by connecting the acoustic model in accordance with the connection rules described in the language model, and uses the Viterbi algorithm to transmit the acoustic feature amount from the acoustic analysis unit 72. Match columns against standard patterns. As a result of this collation, the correspondence between the voice section that maximizes the collation score of both and the standard pattern is obtained. As a recognition result in the speech recognition unit 71, a start time and an end time of the speech section corresponding to each phoneme constituting the recognition result phoneme string and the standard pattern are acquired. The recognition result phoneme string obtained as a result is stored in the recognition result phoneme string storage unit 62. Voice collation is described in “Seiichi Nakagawa et al .:“ Voice Recognition by Stochastic Model ”” The Institute of Electronics, Information and Communication Engineers. The phoneme string can be represented by, for example, a chord string indicating each phoneme. The time corresponding to the phoneme is recorded in association with the phoneme code.

音声認識部６１の認識結果の取得処理は、結合部６５から指示された時点で行う。その際、取得の指示があるまでは、認識結果音素列格納部６２に、照合の過程で求めた照合中間結果（部分的な照合結果）である音素とその開始時刻（あるいは終了時刻も）を保持しておくものとする。本例では、この開始時刻が当該音素の発話時刻に相当する。その上で、結合部６５からの指示に応じて、前回の認識結果取時の照合中間結果を引き継ぎ継続して照合を行う。もしくは、前回照合時に用いた照合中間結果を破棄した後に再度初期状態から照合を開始する。 The recognition result acquisition process of the voice recognition unit 61 is performed when instructed by the combining unit 65. At that time, the phoneme which is the collation intermediate result (partial collation result) obtained in the collation process and its start time (or end time) are also stored in the recognition result phoneme string storage unit 62 until an acquisition instruction is given. Shall be retained. In this example, this start time corresponds to the utterance time of the phoneme. Then, according to the instruction from the combining unit 65, the verification is continued by taking over the verification intermediate result obtained when the previous recognition result was obtained. Alternatively, the collation is restarted from the initial state after discarding the collation intermediate result used at the previous collation.

一方、テキスト変換音素列生成部６３は、テキスト記録部５２に保存されているテキストに対応するテキスト変換音素列を生成する。ここでの入力されるテキストの単位はテキスト入力部５１から入力された単位であり、１文字単位でも文章単位でも良いが、本実施形態では先頭文字あるいは改行コードの次の文字から次の改行コードまでを単位として説明する。テキスト変換音素列は、テキスト変換音素列格納部６４に格納される。 On the other hand, the text converted phoneme string generation unit 63 generates a text converted phoneme string corresponding to the text stored in the text recording unit 52. The unit of text input here is a unit input from the text input unit 51, which may be a single character unit or a text unit. In this embodiment, the next line feed code from the first character or the next line feed code to the next line feed code. The description will be made with the unit as the unit. The text converted phoneme string is stored in the text converted phoneme string storage unit 64.

図１０は、テキスト変換音素列生成部６３でのテキスト変換音素列の生成処理を示す模式図である。テキスト変換音素列生成部６３は、漢字仮名混じり文により記述されているテキストの形態素解析を行い、それを品詞に分割すると共に読みを表す仮名文字列に変換し、さらに、仮名文字から発音記号への変換規則を記載した変換表を参照して、仮名文字列を音素列に変換してテキスト変換音素列を生成する。生成された音素列は、音声認識部６１により出力される音素列と同じ形式を持つ。 FIG. 10 is a schematic diagram showing a text conversion phoneme string generation process in the text conversion phoneme string generation unit 63. The text conversion phoneme string generation unit 63 performs morpheme analysis of the text described by the kana-kana mixed sentence, divides it into parts of speech and converts it into a kana character string representing a reading, and further converts kana characters into phonetic symbols. The conversion table describing the conversion rules is converted to a phoneme string to generate a text-converted phoneme string. The generated phoneme string has the same format as the phoneme string output by the speech recognition unit 61.

例えば、漢字仮名混じり文により記述されているテキスト文字列が「７時のニュースです」の場合、テキスト変換音素列生成部６３は、まず、形態素解析により「７」「時」「の」「ニュース」「です」の各品詞に分割する。次に、これらを、読みを表す仮名文字列「しち」「じ」「の」「にゅーす」「です」に変換し、さらに、仮名文字列から発音記号への変換規則を記載した変換表を参照して、仮名文字列を音素列／ｓｈ／／ｉ／／ｃｈ／／ｉ／／ｊ／／ｉ／／ｎ／／ｏ／／ｎｙ／／ｕｕ／／ｓ／／ｕ／／ｄ／／ｅ／／ｓ／／ｕ／に変換する。この発音記号が、音素を示すコードに相当する。 For example, when the text string described by the kanji kana mixed sentence is “7 o'clock news”, the text conversion phoneme string generation unit 63 first performs “7” “time” “no” “news” by morphological analysis. ”And“ is ”. Next, these are converted to the kana character strings "Shi", "ji", "no", "news", and "is" representing the readings, and a conversion table that describes the conversion rules from kana character strings to phonetic symbols Referring to the kana character string as a phoneme string / sh // i // ch // i // j // i // n // o // ny // uu // s // u // d // Convert to e // s // u /. This phonetic symbol corresponds to a chord indicating a phoneme.

結合部６５は、認識結果音素列格納部６２に格納された認識結果音素列とテキスト変換音素列格納部６４に格納されたテキスト変換音素列とを照合し、音声とテキストの対応付けを行い、結果として結合させる。この処理については図１１を参照して詳述する。概略的には、認識結果音素列とテキスト変換音素列とは、テキストの記入時刻と音声の発話時刻との近似と、音素列の一致とに基づいて結合される。 The combining unit 65 collates the recognition result phoneme sequence stored in the recognition result phoneme sequence storage unit 62 with the text conversion phoneme sequence stored in the text conversion phoneme sequence storage unit 64, and associates speech with text. Combine as a result. This process will be described in detail with reference to FIG. Schematically, the recognition result phoneme string and the text-converted phoneme string are combined based on the approximation between the text entry time and the speech utterance time and the phoneme string match.

音素列照合結果格納部６６は、結合部６５が照合過程で求めた照合中間結果（部分的な編集距離）を保持する。また、照合した結果、ひとつの音声に対して複数のテキストが対応するような場合にはこの複数の結果を全て格納する。これらの結果を格納する事により、同じ単語が発言された場合や、同じテキストが入力される場合においても、問題なく対応づける事が可能となる。 The phoneme string collation result storage unit 66 holds the collation intermediate result (partial editing distance) obtained by the combining unit 65 during the collation process. If a plurality of texts correspond to one voice as a result of collation, all the plurality of results are stored. By storing these results, even when the same word is spoken or when the same text is input, it is possible to associate without problems.

＜結合処理＞
ここで、結合部６５をプログラムにより実現する際の処理について図１１のフローチャートを用いて詳細に説明する。結合部６５が処理を開始する際には、まずクライアント端末１１のＣＰＵ（以下、ＣＰＵ）はテキスト変換音素列格納部６２に保存されている全テキスト変換音素列リストの取り出しを行う（Ｓ１１０１）。次にＣＰＵは音素列の生成を行うために、音声映像記録部５５に保存されている会議の音声（映像）を再生開始する（Ｓ１１０２）。音声が再生されるとＣＰＵは音声認識部６１から生成される認識結果音素列の取り出しを行い（Ｓ１１０３）、それを着目認識結果音素列として、テキスト変換音素列との比較を行う（Ｓ１１０４）。認識結果音素列は、図９で説明したように、音素列と各音素の発話時刻とを含む。この際のテキスト変換音素列内の比較対象とは、認識結果音素の発話時刻（ＴＨ）より後の記入時刻（ＴＫ）を持つテキスト変換音素列である。例えば図１２（ａ）の例では着目認識結果音素列「ｓｈｉｎｋａｉ」の発話時刻ＴＨ＝０時４０分より後の記入時刻ＴＫを持つテキスト変換音素３、４が比較対象となる。比較の結果は、第１の音素列を第２の音素列に変換するために必要な音素の挿入や削除、置換という操作の回数に応じて定義された相違度として出力される。ステップＳ１１０４では、着目認識結果音素列と比較対象のテキスト変換音素列との間のそれぞれの相違度を求める。図１２（ａ）では、テキスト変換音素３については相違度１、テキスト変換音素４については相違度が０となる。次にＣＰＵは、着目している認識結果音素列の比較対象であるテキスト変換音素列の中から、相違度が所定の値以下のテキスト変換音素列があるかの判断を行う（Ｓ１１０５）。 <Join processing>
Here, processing when the combining unit 65 is realized by a program will be described in detail with reference to the flowchart of FIG. When the combining unit 65 starts processing, first, the CPU (hereinafter referred to as CPU) of the client terminal 11 extracts an all-text converted phoneme sequence list stored in the text converted phoneme sequence storage unit 62 (S1101). Next, the CPU starts reproducing the conference audio (video) stored in the audio video recording unit 55 in order to generate a phoneme string (S1102). When the voice is reproduced, the CPU extracts a recognition result phoneme string generated from the voice recognition unit 61 (S1103), and uses it as a target recognition result phoneme string and compares it with a text converted phoneme string (S1104). The recognition result phoneme string includes the phoneme string and the utterance time of each phoneme as described in FIG. The comparison target in the text converted phoneme string at this time is a text converted phoneme string having an entry time (TK) after the utterance time (TH) of the recognition result phoneme. For example, in the example of FIG. 12A, the text-converted phonemes 3 and 4 having the entry time TK after the utterance time TH = 0: 40 of the target recognition result phoneme string “shinkai” are to be compared. The result of the comparison is output as the degree of difference defined according to the number of operations of insertion, deletion, and replacement of phonemes necessary for converting the first phoneme string to the second phoneme string. In step S1104, the degree of difference between the target recognition result phoneme string and the text-converted phoneme string to be compared is obtained. In FIG. 12A, the dissimilarity is 1 for the text converted phoneme 3 and the dissimilarity is 0 for the text converted phoneme 4. Next, the CPU determines whether there is a text-converted phoneme string having a degree of difference equal to or less than a predetermined value from the text-converted phoneme strings that are to be compared with the recognition result phoneme string of interest (S1105).

もしも該当するテキスト変換音素列があれば、該当するテキスト変換音素列に、着目認識結果音素の発話時刻（ＴＨ）を関連付ける（Ｓ１１０６）。これはテキストの内容に近い内容の音声の発話時刻（あるいは映像の時刻）を検出する事を目的としており、テキストと音声とを比較しながら内容が近い部分を検出する例である。なお、ここでは音声認識結果である音素列を基準として、それに相当するテキスト変換音素列を検出しているため、着目音声認識音素列が必ずしもメモの単位に合致しているとは限らない。そこで、Ｓ１１０６でテキストに関連づけられる発話時刻は、当該テキストの先頭（本例では改行コードが区切りとなる）の音素の発話時刻であり、それが当該テキストの発話時刻として関連付けられて記録される。 If there is a corresponding text converted phoneme string, the speech time (TH) of the target recognition result phoneme is associated with the corresponding text converted phoneme string (S1106). This is intended to detect the speech utterance time (or video time) of the content close to the content of the text, and is an example of detecting a portion where the content is close while comparing the text and the speech. Note that, here, since the text-converted phoneme string corresponding to the phoneme string that is the voice recognition result is detected as a reference, the target voice recognition phoneme string does not necessarily match the memo unit. Therefore, the utterance time associated with the text in S1106 is the utterance time of the phoneme at the beginning of the text (in this example, the line feed code is a delimiter), and is recorded in association with the utterance time of the text.

比較が終了すると、音声（動画）は最後まで再生されたかどうかの判断を行い（Ｓ１１０７）、最後で無い場合はＳ１１０３に分岐して次の音声認識結果の判定に移行し、この制御を再生が終了するまで繰り返す。尚、Ｓ１１０４〜Ｓ１１０６の処理よりも早く次の認識結果音素が取り出せた場合には、音声（動画）の再生を一時的にストップする事とし、Ｓ１１０６が終了した後に再生開始しても良い。 When the comparison is completed, it is determined whether or not the sound (moving image) has been reproduced to the end (S1107). If not, the process branches to S1103 and proceeds to the determination of the next speech recognition result. Repeat until finished. If the next recognition result phoneme can be extracted earlier than the processing of S1104 to S1106, the reproduction of the sound (moving image) may be temporarily stopped, and the reproduction may be started after S1106 ends.

これらの対応付けが終了した後に結合部６５では、また、１つの認識結果音素に対して複数のテキスト変換音素が対応付けされているかの判断も行い、１つのテキスト変換音素に複数の認識結果音素列が対応付けされている場合に、どの認識結果音素列の発話時刻を第一優先とするかの判断を行う。この判断では、例えば、最も発話時刻が最近の音素列を第一優先とする。ただし、閾値以下の相違度を持つテキスト認識音素列のうち、最も相違度が小さいものを第一優先とし、それが複数ある場合に発話時刻による第１優先度を決定しても良い。 After these associations are completed, the combining unit 65 also determines whether a plurality of text converted phonemes are associated with one recognition result phoneme, and a plurality of recognition result phonemes is associated with one text conversion phoneme. When the columns are associated, it is determined which recognition result phoneme sequence has the first priority. In this determination, for example, the phoneme string having the latest utterance time is given first priority. However, among the text recognition phoneme strings having a degree of difference equal to or less than the threshold, the one with the smallest degree of difference may be set as the first priority, and the first priority based on the utterance time may be determined when there are a plurality of text recognition phoneme strings.

＜相違度の算出＞
ステップＳ１１０４における相違度の算出のために、結合部６５は、認識結果音素列格納部６２に格納された認識結果音素列とテキスト変換音素列格納部６４に格納されたテキスト変換音素列を比較する。そして比較の結果として、両者の異なりの程度示す相違度を算出する。例えば、第１の音素列“／ｓｈ／／ｉ／／Ｎ／／ｋ／／ａ／／ｉ／”を第２の音素列“／ｔ／／ｏ／／ｋ／／ａ／／ｉ／”に変形する場合、以下に示すように、最低３回の手順が必要とされるので、相違度は３となる。 <Calculation of dissimilarity>
In order to calculate the degree of difference in step S <b> 1104, the combining unit 65 compares the recognition result phoneme sequence stored in the recognition result phoneme sequence storage unit 62 with the text converted phoneme sequence stored in the text converted phoneme sequence storage unit 64. . As a result of the comparison, a difference degree indicating the degree of difference between the two is calculated. For example, the first phoneme string “/ sh // i // N // k // a // i /” is replaced with the second phoneme string “/ t // o // k // a // i /”. In the case of deformation, the procedure is required at least three times as shown below, so the difference is 3.

１．／ｓｈ／／ｉ／／Ｎ／／ｋ／／ａ／／ｉ／
２．／ｔ／／ｉ／／Ｎ／／ｋ／／ａ／／ｉ／（“／ｓｈ／”を“／ｔ／”に置換）
３．／ｔ／／ｏ／／Ｎ／／ｋ／／ａ／／ｉ／（“／ｉ／”を“／ｏ／”に置換）
４．／ｔ／／ｏ／／ｋ／／ａ／／ｉ／（“／Ｎ／”を削除して終了）。 1. / Sh / / i / / N / / k / / a / / i /
2. / T / / i / / N / / k / / a / / i / (Replace "/ sh /" with "/ t /")
3. / T / / o / / N / / k / / a / / i / (“/ i /” replaced with “/ o /”)
4). / T // o // k // a // i / (deletes “/ N /” and ends).

相違度を求める最も簡易な手順は、音素を先頭から順に対応付けて、いずれかの音素列がの末尾まで、異なる音素の数を数える。いずれかの音素列が長ければ、その差異である音素数を、数えた異なる音素の数に加算する。得られた結果が、相違度である。もちろんこれは一例であって他の方法を適用することもできよう。 The simplest procedure for obtaining the degree of difference is associating phonemes in order from the top, and counting the number of different phonemes until any phoneme string ends. If any phoneme string is long, the number of phonemes that is the difference is added to the number of different phonemes counted. The obtained result is the degree of difference. Of course, this is an example, and other methods could be applied.

＜テキスト−音声対応リスト＞
図１１の処理が終了すると、図１２（ｂ）に示すような各テキスト変換音素列に対応付けされた認識結果音素列がその発話時刻の時刻情報と共に含まれるリスト１２０２が、統合部６５によって音素列結合結果格納部６６に作成される。入力されたテキストはテキストＩＤが付与されており、そのテキスト変換音素列と、テキストの記入時刻（ＴＫ）と、関連付けられた認識結果音素列及びそのＩＤと、当該認識結果音素列の発話時刻とが関連付けられて格納されている。ここでテキスト変換音素に第一優先で対応付けする事を示すフラグＯｎＣｌｉｃｋ及び、前出のテキストと重複して関連付けられた認識結果音素列であることを示すフラグＩｓＳａｍｅについては図１５の手順とともに説明する。なおテキスト−音声対応リスト１２０２は、テキストの区切りと音声の区切りとの関連付けを記憶した関連付け記憶部ということもできる。 <Text-to-speech list>
When the processing of FIG. 11 is completed, a list 1202 including a recognition result phoneme sequence associated with each text-converted phoneme sequence as shown in FIG. It is created in the column combination result storage unit 66. The input text is assigned a text ID, the text converted phoneme string, the entry time (TK) of the text, the associated recognition result phoneme string and its ID, and the utterance time of the recognition result phoneme string, Are stored in association with each other. Here, the flag OnClick indicating that the text-converted phoneme is associated with the first priority and the flag IsSame indicating that the recognition result phoneme string is associated with the previous text are described together with the procedure of FIG. To do. The text-speech correspondence list 1202 can also be referred to as an association storage unit that stores associations between text breaks and speech breaks.

＜第１優先の決定＞
図１５を参照して第１優先を決定する手順を説明する。この手順は図１１の手順の後で実行される。まず図１２（ｂ）のテキスト−音声対応リストから比較結果の取り出しを行う（Ｓ１５０１）。次に取り出したテキスト変換音素に対して認識結果音素が１つかどうかの判断と既出の認識結果音素かの判断を行う（Ｓ１５０２）。図１２（ｂ）の例だとテキストＴ．５に対応付けされている認識結果音素列はＮ．１のみであるので、第１優先度を示すフラグＯｎＣｌｉｃｋをＴＲＵＥとする。また、認識結果音素列Ｎ．１はリスト１２０２の着目欄の上部にないため、同じ認識結果音素列のＩＤがリスト１２０２の着目欄の上部に存在したことを示すフラグＩｓＳａｍｅをＦＡＬＳＥとする。換言すればＩｓＳａｍｅとは、リスト１２０２から取り出されてテスト済みのテキストと重複的に関連付けされている認識結果音素列であることを示すフラグである。 <Decision of first priority>
The procedure for determining the first priority will be described with reference to FIG. This procedure is executed after the procedure of FIG. First, a comparison result is extracted from the text-speech correspondence list in FIG. 12B (S1501). Next, a determination is made as to whether or not there is one recognition result phoneme for the extracted text converted phoneme and whether the recognition result phoneme has already been reached (S1502). In the example of FIG. 5 is a recognition result phoneme sequence associated with N.5. Since only 1, the flag OnClick indicating the first priority is set to TRUE. In addition, the recognition result phoneme string N. Since 1 is not above the focus column of the list 1202, the flag IsSame indicating that the ID of the same recognition result phoneme string exists above the focus column of the list 1202 is set to FALSE. In other words, IsSame is a flag indicating that it is a recognition result phoneme string that is extracted from the list 1202 and redundantly associated with the tested text.

その後Ｓ１５０４に進み、全てのテキスト変換音素列と認識結果音素列との組を照合したかの判断を行う。この組のことをテキスト−音声と記述し、図１２を参照する際には、Ｔｘ−Ｎｙとそれぞれの識別子の組で示す。未処理のテキスト−音声が存在するなら、再度Ｓ１５０１、Ｓ１５０２を行う。次にリスト１２０２から取り出されたテキスト−音声である、テキストＴ．６に対応付けされている認識結果音素列はＮ．２とＮ．３であるため、それら認識結果のうち最も時刻情報が新しいものを第１優先とする（Ｓ１５０３）。この場合は認識結果音素列Ｎ．３の発話時刻が認識結果音素列Ｎ．２の発話時刻よりも新しいため、Ｎ．３のフラグＯｎＣｌｉｃｋをＴＲＵＥとし、認識結果音素列Ｎ．４のＯｎＣｌｉｃｋをＦＡＬＳＥとし、リストの上部にＴ．６と同じものが存在しないため、認識結果音素列Ｎ．３、Ｎ．４それぞれのＩｓＳａｍｅはいずれもＦＡＬＳＥとする。この処理をテキストＴ．７にも同様に行うと認識結果音素列Ｎ．２、Ｎ．３のフラグＯｎＣｌｉｃｋはＦＡＬＳＥとし、認識結果音素列Ｎ．４のフラグＯｎＣｌｉｃｋはＴＲＵＥとする。またテキストＮ．４は初めての出現でなのでフラグＩｓＳａｍｅは偽（ＦＡＬＳＥ）とし、テキストＮ．２、Ｎ．３は、既に前出のテキストＴ６に関連付けられて出現しているために真（ＴＲＵＥ）とする。 Thereafter, the process proceeds to S1504, where it is determined whether or not a set of all text converted phoneme strings and recognition result phoneme strings have been collated. This set is described as text-speech, and when referring to FIG. 12, it is indicated by a set of Tx-Ny and each identifier. If there is an unprocessed text-speech, S1501 and S1502 are performed again. Next, text T., which is text-speech retrieved from list 1202. The recognition result phoneme string associated with N. 2 and N.I. Therefore, the most recent time information among the recognition results is given the first priority (S1503). In this case, the recognition result phoneme string N.I. 3 is the recognition result phoneme sequence N.3. 2 is newer than the utterance time of 2. 3 flag OnClick is set to TRUE, and the recognition result phoneme string N. 4 OnClick is set to FALSE, and T. 6 does not exist, the recognition result phoneme string N. 3, N.I. Each of the four IsSame is FALSE. This process is referred to as text T.P. 7 in the same manner, the recognition result phoneme string N. 2, N.I. 3 flag OnClick is FALSE, and the recognition result phoneme string N. The flag OnClick of 4 is set to TRUE. The text N.I. 4 is the first appearance, so the flag IsSame is false (FALSE) and the text N.4 is displayed. 2, N.I. Since 3 has already appeared in association with the above-mentioned text T6, it is set to true (TRUE).

リストの中の全てのテキスト−音声について処理を終えたならＳ１５０４に分岐して複数のテキスト情報音素が１つの認識結果音素の第１優先となっていないかの判断を行う。即ち、リスト中で、複数のテキスト変換音素列と関連付けられ、かつ複数のテキスト変換音素列についてフラグＯｎＣｌｉｃｋがＴＲＵＥになっているひとつの認識結果音素があるかどうかの判断を行う。 If the processing has been completed for all the text-speech in the list, the process branches to S1504 to determine whether or not a plurality of text information phonemes is the first priority of one recognition result phoneme. That is, it is determined whether or not there is one recognition result phoneme that is associated with a plurality of text-converted phoneme strings in the list and that has a flag OnClick of TRUE for the plurality of text-converted phoneme strings.

そのためにまずはこれまでの照合結果の参照を行い（Ｓ１５０５）、認識結果音素列ＩＤ毎のＯｎＣｌｉｃｋの値を読む（Ｓ１５０６）。この際にはテキスト変換音素列ＩＤの番号が大きいものの中で、認識結果音素列ＩＤが番号の大きいもの、即ち発話時刻が遅いものから順にテキスト−音声のＯｎＣｌｉｃｋフラグを参照する。図１２（ｂ）の例では、テキストＴ７に関連付けられた認識結果音素列ＩＤであるＮ２，Ｎ３は、テキストＴ６についても重複して関連付けられている。しかしいずれのフラグＯｎＣｌｉｃｋも、複数のテキスト変換音素列に対してＴＲＵＥになってはいないので、Ｓ１５０９へと移行し、全てのテキストＩＤを照合するまで同一の処理を実行する。 For this purpose, first, the comparison result so far is referred to (S1505), and the value of OnClick for each recognition result phoneme string ID is read (S1506). At this time, the text-to-speech OnClick flag is referred to in the order of the recognition result phoneme string ID having the largest number, that is, the one having the later utterance time, among the text conversion phoneme string ID having the larger number. In the example of FIG. 12B, the recognition result phoneme string IDs N2 and N3 associated with the text T7 are also associated with the text T6. However, since none of the flags OnClick is TRUE for a plurality of text converted phoneme strings, the process proceeds to S1509 and the same processing is executed until all text IDs are collated.

一方、Ｓ１５０６の処理において、リストの内容が図１２（ｃ）の例のような場合においては、参照している認識結果音素列Ｎ２は、テキスト変換音素列Ｔ１３とＴ１２と、Ｔ１１とについて、フラグＯｎＣｌｉｃｋがＴＲＵＥになっている。これは即ちテキスト変換音素列Ｔ．１１、Ｔ．１２、Ｔ１３において認識結果音素列Ｎ．２が第１優先として選択されている事を示す。従ってＳ１５０７に移行し、Ｓ１５０７において、参照している認識結果音素列より後に、フラグＯｎＣｌｉｃｋがＴＲＵＥであるような同一の認識結果音素列がリスト１２０２にあるか判定する。あれば参照している認識結果音素列に対応するフラグＯｎＣｌｉｃｋの値をＦＡＬＳＥに書き換える（Ｓ１５０８）。それとともに、当該テキストに関連付けられた認識結果音素列のうち、ＯｎＣＬｉｃｋの値がＦＡＬＳＥに書き換えられた認識結果音素列のひとつ前の認識結果音素列ＩＤに対応するＯｎＣｌｉｃｋの値をＴＲＵＥに変更する。この二つのＯｎＣｌｉｃｋフラグの書き換えをまとめてＯｎＣｌｉｃｋフラグの更新と呼ぶ。 On the other hand, in the processing of S1506, when the contents of the list are as in the example of FIG. 12C, the recognized recognition result phoneme string N2 is flagged for the text converted phoneme strings T13, T12, and T11. OnClick is set to TRUE. This means that the text converted phoneme string T.P. 11, T.W. 12, T13, the recognition result phoneme string N.I. 2 indicates that the first priority is selected. Therefore, the process proceeds to S1507. In S1507, it is determined whether there is an identical recognition result phoneme string in which the flag OnClick is TRUE after the reference recognition result phoneme string. If there is, the value of the flag OnClick corresponding to the reference recognition result phoneme string is rewritten to FALSE (S1508). At the same time, among the recognition result phoneme strings associated with the text, the OnClick value corresponding to the recognition result phoneme string ID immediately before the recognition result phoneme string in which the OnClick value is rewritten to FALSE is changed to TRUE. The rewriting of the two OnClick flags is collectively referred to as updating the OnClick flag.

なおＳ１５０７においては、参照している認識結果音素列より後に、フラグＯｎＣｌｉｃｋがＴＲＵＥであるような同一の認識結果音素列がリスト１２０２にあり、かつ、着目テキスト−音声では、複数の認識結果音素列がひとつのテキストに関連付けられていることを判定してもよい。一つの認識結果音素列しか関連付けられていない場合には、これを第１優先から外してしまうと、第１優先とすべき他の候補がいないためである。したがってこの場合、複数の認識結果音素列が関連付けられていれば、Ｓ１５０８に分岐して、当該テキスト−音声のＯｎＣｌｉｃｋ及びそのひとつ前のテキスト−音声のＯｎＣｌｉｃｋを更新する。 Note that in S1507, the same recognition result phoneme string having the flag OnClick as TRUE is present in the list 1202 after the reference recognition result phoneme string, and a plurality of recognition result phoneme strings are included in the target text-speech. It may be determined that is associated with one text. This is because when only one recognition result phoneme string is associated, if this is removed from the first priority, there is no other candidate to be the first priority. Therefore, in this case, if a plurality of recognition result phoneme strings are associated with each other, the process branches to S1508 to update the text-speech OnClick and the previous text-speech OnClick.

一方、Ｓ１５０７の条件が満たされなければＳ１５０９に分岐する。図１２（ｃ）の例では、テキスト変換音素列Ｔ．１１，Ｔ１２に関連付けられた認識結果音素列Ｎ２は、いずれもその後に出現するテキスト変換音素列Ｔ１３においてフラグＯｎＣｌｉｃｋがＴＲＵＥであるので、Ｓ１５０７からＳ１５０８に分岐してＯｎＣｌｉｃｋの値は更新される。またテキストに関連付けられた認識結果音素列も１つではないので、この条件を論理積で与えたとしてもＳ１５０８に分岐する。すなわち該当するＯｎＣｌｉｃｋフラグはＦＡＬＳＥに書き換えられる。それとともに、それぞれのテキストに関連付けられた認識結果音素列Ｎ１（すなわちＴ１１−Ｎ１およびＴ１２−Ｎ１）のＯｎＣｌｉｃｋフラグがＴＲＵＥに書き換えられる。 On the other hand, if the condition of S1507 is not satisfied, the process branches to S1509. In the example of FIG. 11, the recognition result phoneme string N2 associated with T12 has a flag OnClick of TRUE in the text conversion phoneme string T13 that appears thereafter, and therefore branches from S1507 to S1508 and the value of OnClick is updated. Since there is not one recognition result phoneme string associated with the text, even if this condition is given as a logical product, the process branches to S1508. That is, the corresponding OnClick flag is rewritten to FALSE. At the same time, the OnClick flag of the recognition result phoneme string N1 (that is, T11-N1 and T12-N1) associated with each text is rewritten to TRUE.

一方、テキスト変換音素列Ｔ．１３に関連付けられた認識結果音素列Ｎ２は、その後にＯｎＣｌｉｃｋがＴＲＵＥであるような認識結果音素列Ｎ２は出現しないので、Ｓ１５０８をスキップし、そのフラグＯｎＣｌｉｃｋの値そのまま維持される。 On the other hand, the text converted phoneme string T.E. In the recognition result phoneme string N2 associated with No. 13, since the recognition result phoneme string N2 whose OnClick is TRUE does not appear thereafter, S1508 is skipped, and the value of the flag OnClick is maintained as it is.

図１２（ｄ）は、認識結果音素列Ｎ２については図１３の手順で第１優先のテキストが決定された状態を示す。この状態においてＳ１５０９、Ｓ１５０５でＴ．１３の次のＴ．１２の認識結果ＩＤとＴ１１の中の同一の認識結果ＩＤのＯｎＣｌｉｃｋの値の比較を行う。この状態においてもＴ．１２−Ｎ．１とＴ．１１−Ｎ．１で複数のＯｎＣｌｉｃｋがＴＲＵＥとなっているためＳ１５０７に移行するが、Ｓ１５０７の処理においてはＮ．１より小さな番号が存在しないためＳ１５０９に移行し、次にＴ．１１の認識結果ＩＤの処理に移行し処理を終了する。 FIG. 12D shows a state where the first priority text is determined by the procedure of FIG. 13 for the recognition result phoneme string N2. In this state, the T.S. The next T.13. The 12 recognition result IDs are compared with the OnClick value of the same recognition result ID in T11. Even in this state, T.P. 12-N. 1 and T.W. 11-N. 1, since a plurality of OnClicks are TRUE, the process proceeds to S1507. Since there is no number smaller than 1, the process proceeds to S1509. The process proceeds to 11 recognition result ID processing, and the processing ends.

図１２（ｃ）に示すように、これらの処理を実行した後にも複数のテキスト変換音素から第１優先で同じ認識結果音素が選択されるような状況にもなる事があるが、その状況でも本システムの動作に影響はない。以上のようにして、テキストと認識結果音素列とを関連付けたリストが完成する。このリストは、認識結果音素列の基礎となった音声議事録或いは映像議事録とメモ入力されたテキストとを関連付けるリストでもある。テキスト−音声対応リスト１２０２は議事録ごとに作成されるために、複数の議事録が記憶される場合には、議事録ごとのフォルダや固有の識別名を付して、議事録ごとに区別される。 As shown in FIG. 12C, even after these processes are executed, the same recognition result phoneme may be selected with a first priority from a plurality of text-converted phonemes. There is no effect on the operation of this system. As described above, the list in which the text and the recognition result phoneme string are associated is completed. This list is also a list for associating the audio minutes or video minutes, which are the basis of the recognition result phoneme sequence, with the text entered as a memo. Since the text-to-speech correspondence list 1202 is created for each minutes, when a plurality of minutes are stored, a folder for each minutes and a unique identification name are added to distinguish the minutes. The

＜ディジタル議事録の再生＞
次に、図４、図１４、図１５を参照しながら、ディジタル議事録の再生すなわち閲覧時の動作を説明する。操作者が再生プログラムの起動指示をすることによりＣＰＵは閲覧用のプログラムを起動する。実際には、再生用と記録用を１つのプログラムとして、起動時にモードを選択するようにしてもよい。あるいは、はじめに再生用モードで起動して、必要なときに記録用モードに変更するようにしてもよい。起動すると、議事録再生プログラムは、図４に示すように、テキスト−音声対応リスト１２０２及び図３の議事録データ３０１、議事録メモデータ３０２を参照して、画面に時刻表示部４１、資料表示部（資料ファイルがあれば）４２、テキスト表示部４３、映像場面表示部４４（ビデオファイルがあれば）、サマリー表示部４５を含むユーザインタフェース画面を表示する。ただし図１２のテキスト−音声対応リスト１２０２の例示内容と図３の議事録メモデータ３０２の例示内容及び図４、図１４の例示内容とは別々の例を示しており、関連していない。 <Reproduction of digital minutes>
Next, with reference to FIG. 4, FIG. 14, and FIG. When the operator gives an instruction to start the reproduction program, the CPU starts the browsing program. Actually, the playback mode and the recording mode may be set as one program, and the mode may be selected at startup. Alternatively, it may be activated first in the playback mode and changed to the recording mode when necessary. When started, the minutes reproduction program refers to the text-speech correspondence list 1202 and the minutes data 301 and minutes memo data 302 of FIG. 3, as shown in FIG. A user interface screen including a section (if there is a document file) 42, a text display section 43, a video scene display section 44 (if there is a video file), and a summary display section 45 is displayed. However, the example contents of the text-to-speech correspondence list 1202 in FIG. 12, the example contents of the minutes memo data 302 in FIG. 3, and the example contents in FIGS. 4 and 14 are different examples and are not related.

さてユーザインタフェースの表示時には、まず時刻表示部４１に時刻を時系列に表示し、表示された時刻ごとに、時刻に関連付けた資料を資料表示部４２に時系列に表示する。このために、たとえば会議資料のページごとに時刻情報を付与することで、同時刻に撮影された映像フレームや記録されたメモと関連付けることができる。そして、議事録メモデータ３０２を参照して、サマリーメモ３３２およびメモ３３３をその記入時刻に基づいて、表示されたタイムラインに対応付けて表示する。またサマリーメモ３３２についてはサマリー表示部４５にも記入時刻順に表示する。さらに映像場面表示部４４には、時刻表示部４１に表示した時刻に対応付けて、撮影時刻のフレームを表示する。表示するフレームは、例えば時刻表示部４１に時刻目盛を表示するならば、その目盛に対応する時刻に応じて選べばよい。 When displaying the user interface, the time is first displayed on the time display unit 41 in time series, and the materials associated with the time are displayed on the material display unit 42 in time series for each displayed time. For this reason, for example, by assigning time information to each page of the conference material, it can be associated with a video frame photographed at the same time or a recorded memo. Then, referring to the minutes memo data 302, the summary memo 332 and the memo 333 are displayed in association with the displayed timeline based on the entry time. The summary memo 332 is also displayed on the summary display unit 45 in the order of entry time. Further, the video scene display unit 44 displays a frame of the shooting time in association with the time displayed on the time display unit 41. For example, if a time scale is displayed on the time display unit 41, the frame to be displayed may be selected according to the time corresponding to the scale.

このようにしてユーザインタフェース４０１を表示すると、利用者はそのうえで操作することで、メモに対応した映像を再生することができる。 When the user interface 401 is displayed in this manner, the user can reproduce the video corresponding to the memo by operating the user interface 401.

まずサマリー表示部４５の１行がクリックされると、ＣＰＵは、クリックされたテキストのテキストＩＤと対応付けられた認識結果音素列をテキスト−音声対応リスト１２０２を参照して見出し、その中でＯｎＣｌｉｃｋフラグがＴＲＵＥとなっている発話時刻（ＴＨ）を読み、ユーザインタフェース４０１の中で、その発話時刻を含む段を強調表示する。例えば、クリックされたサマリーテキストに対応するテキスト表示部４３中のテキストと、対応する時刻表示部４１の時刻を異なる色や点滅等で強調表示する。もちろん対応する資料やフレームを強調表示してもよい。 First, when one line of the summary display unit 45 is clicked, the CPU finds a recognition result phoneme string associated with the text ID of the clicked text with reference to the text-to-speech correspondence list 1202 and includes OnClick. The utterance time (TH) whose flag is TRUE is read, and the stage including the utterance time is highlighted in the user interface 401. For example, the text in the text display unit 43 corresponding to the clicked summary text and the time of the corresponding time display unit 41 are highlighted with different colors or blinking. Of course, the corresponding material or frame may be highlighted.

資料表示部４２の１つの資料がクリックされると、クライアント端末１の画面にクリックされた資料が拡大表示される。テスト表示部４３の中のテキストがクリックされると、図１５（ａ）のようにそのテキストと対応付けられた認識結果音素の中でＯｎＣｌｉｃｋがＴＲＵＥとなっている発話時刻（ＴＨ）の時刻の音声もしくは映像がＣＰＵにより再生される。この対応付けは以下の通り行う。例えば、クリックされたテキストのテキストＩＤに対応する認識結果音素列のうち、ＯｎＣｌｉｃｋフラグがＴＲＵＥである認識結果音素列を、テキスト−音声対応リスト１２０２から見つける。それによりその発話時刻（ＴＨ）を特定できるので、議事録データ３０１に含まれる映像データから、その発話時刻に記録されたフレームから映像を再生する。また同様に映像場面表示部の１つの映像（フレーム）がクリックされた場合においても、そのフレームを先頭として映像を再生する。この画面例を図１５（ａ）に示す。映像画面１５０１の下部に、映像全体における再生時点を示すタイムライン１５０２と、巻き戻しボタン１５０３、一時停止ボタン１５０４、早送りボタン１５０５が表示される。また一時停止中には一時停止ボタン１５０４に代えて、再生ボタンが表示される。この表示は、例えばユーザインタフェース４０１とは別ウィンドウで表示される。 When one material in the material display unit 42 is clicked, the clicked material is enlarged and displayed on the screen of the client terminal 1. When the text in the test display section 43 is clicked, the time of the utterance time (TH) at which OnClick is TRUE among the recognition result phonemes associated with the text as shown in FIG. Audio or video is reproduced by the CPU. This association is performed as follows. For example, among the recognition result phoneme strings corresponding to the text ID of the clicked text, a recognition result phoneme string whose OnClick flag is TRUE is found from the text-speech correspondence list 1202. As a result, the utterance time (TH) can be specified, and the video is reproduced from the video data included in the minutes data 301 from the frame recorded at the utterance time. Similarly, even when one video (frame) in the video scene display unit is clicked, the video is reproduced starting from that frame. An example of this screen is shown in FIG. At the bottom of the video screen 1501, a timeline 1502 indicating the playback point in the entire video, a rewind button 1503, a pause button 1504, and a fast forward button 1505 are displayed. During the pause, a play button is displayed instead of the pause button 1504. This display is displayed in a separate window from the user interface 401, for example.

なお映像再生の際に、タイムライン１５０２に、その他の対応付けられた認識結果音素の発話時刻を示すマークを再生候補として表示しても良い。再生候補は、テキスト−音声対応リスト１２０２を参照して、ＯｎＣｌｉｃｋフラグがＴＲＵＥとなっている認識結果音素列を選択し、当該音素列に関連付けて、タイムライン１５０２上に、当該音素列の発話時刻に該当する位置にマーク１５０６（図１５（ｂ）の例ではひし形）を表示する。このマークがクリックされると、クリックされたマークに関連付けられた音素列の発話時刻から映像の再生を開始する。また、カーソルがマークにおかれたなら、テキスト−音声対応リストを参照して、マークに関連付けられた音素列に更に関連付けられたテキストを、吹き出しなどで表示してもよい。また、再生候補の表示の際にはＩｓＳａｍｅがＴＲＵＥとなっている認識結果音素列の表示を行わない事も可能とする。これは記入時刻ＴＫが早い時刻の中の同じテキスト変換音素に対応づけられている認識結果音素列の表示を行わないためである。 During video playback, a mark indicating the utterance time of another associated recognition result phoneme may be displayed on the timeline 1502 as a playback candidate. The reproduction candidate refers to the text-to-speech correspondence list 1202, selects a recognition result phoneme string whose OnClick flag is TRUE, associates it with the phoneme string, and displays the utterance time of the phoneme string on the timeline 1502. A mark 1506 (diamond in the example of FIG. 15B) is displayed at a position corresponding to. When this mark is clicked, video playback starts from the utterance time of the phoneme string associated with the clicked mark. If the cursor is placed on a mark, text associated with the phoneme string associated with the mark may be displayed with a balloon or the like with reference to the text-speech correspondence list. In addition, when displaying the reproduction candidate, it is possible not to display the recognition result phoneme string in which IsSame is TRUE. This is because the recognition result phoneme string associated with the same text-converted phoneme in the time when the entry time TK is early is not displayed.

また、検索機能も備える。ユーザインタフェース４０１中の検索ウィンドウ４６を選択すると、検索語を入力する事が可能となる。検索ボタンをクリックすると、該当するテキストが存在する場所が強調表示される。 It also has a search function. When the search window 46 in the user interface 401 is selected, a search word can be input. Clicking the search button highlights the location where the relevant text exists.

次に、上述した議事録を閲覧する手順の概要を、図１６を参照して説明する。この手順は上述した動作をフローチャート化したものである。まずユーザは議事録記録再生装置を操作して、表示されたユーザインタフェース４０１上でディジタル議事録の再生の指示を入力する。ユーザが議事録を閲覧する際に閲覧したい箇所に応じて、操作カーソルを合わせて操作ボタンを押下する。操作ボタンが押下された際の操作カーソルの位置を判断し（Ｓ１６０１）、サマリー表示部４５である場合は、サマリー表示部４５内のメモの１行に対応した表示列を強調する（Ｓ１６０２）。一方操作カーソルの位置がテキスト表示部４３であれば、そのテキストに関連付けられた認識結果音素列（ＯｎＣｌｉｃｋ＝ＴＲＵＥに限る）の発話時刻から映像を再生する（Ｓ１６０３）。さらに、操作カーソルの位置が映像表示部であれば、指示されたフレームからその映像を再生する（Ｓ１６０４）。 Next, an outline of a procedure for browsing the minutes described above will be described with reference to FIG. This procedure is a flowchart of the above-described operation. First, the user operates the minutes recording / reproducing apparatus and inputs an instruction to reproduce the digital minutes on the displayed user interface 401. When the user browses the minutes, the user moves the operation cursor and presses the operation button in accordance with the location to be browsed. The position of the operation cursor when the operation button is pressed is determined (S1601), and if it is the summary display unit 45, the display column corresponding to one line of the memo in the summary display unit 45 is emphasized (S1602). On the other hand, if the position of the operation cursor is the text display unit 43, the video is reproduced from the utterance time of the recognition result phoneme string (limited to OnClick = TRUE) associated with the text (S1603). Further, if the position of the operation cursor is the video display unit, the video is reproduced from the designated frame (S1604).

このようにして、議事録担当者が記録したメモと、会議の映像や音声とを自動で対応させながら見ることができ、そのときの話題が映像等から追いかけやすくなる。議事録担当者が作成したメモの一部を指定することで、会議の映像や音声のなかからその部分に対応する個所を再生することができ、聞きたい個所を探すのが容易になる。他の手段で録音した音声を利用することができ、録音機器が故障した際などにも対応できる。会議の音声だけでなく映像も検索することができるようになり、ホワイトボードに図を描きながら説明しているような会議でも、内容がわかりやすくなる。議事録担当者が作成したメモだけではなく、会議の際に使用されていた資料などのデータも表示する事が可能となり、会議の内容を伝えやすくなる。 In this way, the memo recorded by the person in charge of the minutes can be viewed while automatically matching the video and audio of the conference, and the topic at that time can be easily followed from the video or the like. By specifying a part of the memo created by the person in charge of the minutes, the part corresponding to the part can be reproduced from the video and audio of the meeting, and it becomes easy to find the part to be heard. Voices recorded by other means can be used, and can be used when a recording device fails. You can search not only the audio of the meeting but also the video, and the contents will be easier to understand even in a meeting that is explained while drawing a picture on the whiteboard. It is possible to display not only the notes created by the person in charge of the minutes but also data such as materials used at the time of the meeting, which makes it easy to convey the contents of the meeting.

上記のように、本発明の実施形態では、議事録作成再生装置を、入力された文字データが入力された時間情報を付加して記録し、音声データ化したテキストと、記録された音声あるいは映像とを関連付ける。これにより、テキストの入力時刻のみならず、テキストと音声とをその音韻で関連付けることができる。そしてメモを指示すると、指示されたメモに、当該メモのテキストに音韻で関連付けられた音声を再生でき、テキストと音声或いは映像とをより高精度に関連付けることが可能となる。また再生時に、文字データの対応する位置を表示する構成としたので、簡単に目的の部分を参照できる議事録を作成することが可能となる。
［その他の実施例］
また、本発明は、以下の処理を実行することによっても実現される。即ち、上述した実施形態の機能を実現するソフトウェア（プログラム）を、ネットワーク又は各種記憶媒体を介してシステム或いは装置に供給し、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ等）がプログラムを読み出して実行する処理である。 As described above, in the embodiment of the present invention, the minutes creation / playback apparatus adds the time information when the input character data is input and records it as text, and the recorded voice or video. Associate with. As a result, not only the input time of the text but also the text and the voice can be related by their phonemes. When a memo is designated, the voice associated with the memo text by phoneme can be reproduced in the designated memo, and the text and the voice or video can be associated with higher accuracy. In addition, since the corresponding position of the character data is displayed at the time of reproduction, it is possible to create a minutes that can easily refer to the target portion.
[Other Examples]
The present invention can also be realized by executing the following processing. That is, software (program) that realizes the functions of the above-described embodiments is supplied to a system or apparatus via a network or various storage media, and a computer (or CPU, MPU, etc.) of the system or apparatus reads the program. It is a process to be executed.

Claims

Holding means for holding the recorded audio data for each break;
An editing means for storing the input text information for each separator;
An apparatus for recording minutes, comprising: an association unit that associates the division of the audio data with each division of the text information based on the degree of difference of each phoneme, and stores the association.

The holding means holds the voice data in association with the date and time information spoken for each break,
The editing means stores the text information in association with the date and time information input for each break,
The associating means selects the text information delimiter based on the degree of phoneme difference from the audio data delimiters associated with the date and time information earlier than the date and time information associated with the text information delimiter. The minutes recording apparatus according to claim 1, wherein the minutes recording is associated with voice data delimiters.

In the case where the association means can associate a plurality of audio data delimiters with one text information delimiter, the associated date and time information is the newest audio data delimiter as the one text information delimiter. The minutes recording apparatus according to claim 1, wherein the minutes recording apparatus stores the information in association with each other.

4. The text information delimiter is a character string sandwiched between line feed codes, and the audio data delimiter is a phoneme string recognized from the audio data. Minutes recording device described in 1.

5. The minutes recording apparatus according to claim 1, wherein the audio data is audio data recorded in synchronization with video data.

6. The minutes recording apparatus according to claim 1, further comprising user interface means for displaying the text information in chronological order for each delimiter.

The audio data is audio data recorded in synchronization with video data,
7. The minutes recording apparatus according to claim 6, wherein the user interface means further displays a part of frames constituting the video data in time series.

The user interface means further includes a date and time further associated with the audio data delimiter associated with the selected text information delimiter when one of the displayed delimiters of the text information is selected and playback is instructed. 3. The minutes recording apparatus according to claim 2, wherein the audio data is reproduced from information.

The audio data is audio data recorded in synchronization with video data,
9. The user interface means further reproduces the video data from a point in time indicated by date and time information further associated with the audio data delimiter associated with the selected text information delimiter. Minutes recording device described in 1.

When reproducing the video data, the user interface means specifies a timeline for specifying the time point of reproduction in the video data, and audio data associated with each of the text information delimiters in the timeline. A mark indicating date and time information associated with each delimiter is further displayed, and when the mark is instructed, the video data is reproduced from the time indicated by the date and time information associated with the instructed mark. The minutes recording device according to claim 9.

A holding step for holding the recorded audio data for each break;
A holding process for storing the input text information for each delimiter;
A minutes recording method, comprising: an associating step of associating each of the text information delimiters with the delimitation of the audio data based on a degree of difference of each phoneme and storing the association.

A holding step for holding the recorded audio data for each break;
A holding process for storing the input text information for each delimiter;
A program for causing a computer to execute an associating step of associating each of the text information delimiters with the delimitation of the audio data based on the degree of difference of each phoneme and storing the association.