JP2003274345A

JP2003274345A - Multimedia recording device, multimedia editing device, recording medium therefor, multimedia reproducing device, speech record generating device

Info

Publication number: JP2003274345A
Application number: JP2002071079A
Authority: JP
Inventors: Toshihiko Umeda; 俊彦楳田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2002-03-14
Filing date: 2002-03-14
Publication date: 2003-09-26
Anticipated expiration: 2022-03-14
Also published as: JP4662228B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech record generating device which automatically generates a speech record of a meeting of a plurality of members. <P>SOLUTION: The device has a means (AA) of making a source selected read of data from a medium where audio and video from pieces of input data and time information are digitally recorded, a means of restoring audio data, a means of detecting a speech break position, a means (B1) of dividing a voiced part into phrase units, a speech recognition means (C1) of making a phrase unit voice into text data, and a means (D1) of adding the time information to the text data in the phrase units, has a means of selectively reading another source out of the medium by the means (AA), a means of restoring speech data, a means of detecting a speech break position, a means (B2) of dividing a voiced part into phrase units, a speech recognition means (C2) of making a phrase unit voice into text data, and a means (D2) of adding the time information to the text data in the phrase units, and has a means (F) of alternately arraying the pieces of text data generated by (D1) and (D2) in the time order, and an output means (G). <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、複数人が参加した
会議における発言録を自動作成するマルチメディア記録
装置、マルチメディア編集装置、およびこれらの記録媒
体、マルチメディア再生装置、発言録作成装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multimedia recording device, a multimedia editing device, and a recording medium, a multimedia reproducing device, and a memo recording device for automatically creating a memo in a conference in which a plurality of people participate. .

【０００２】[0002]

【従来の技術】従来、複数人が発言する会議の発言を文
字化する国内の公知資料は見出せなかった。唯一、２０
０１年５月に開催されたＡＴＲ音声言語通信研究所にお
いて、ハンズフリー通話に関する国際研究会のワークシ
ョップ（HSC2001,International Workshop of Hands-
Free Speech Communication ）において、「カーネギ
ー・メロン大学の開発したミーティング・プラウザは、
簡単な議事録を自動的に作成することが出来るシステ
ム」（http://www.is.cs.smu.edu/js/meeting.html）と
してプロトタイプ報告がされている。この方法は全方位
（３６０度）カメラを会議机の中央に１個設置し、画像
処理により参加者の口元の動きがある人物が発言中と判
断処理し、全体の音声を共通入力した中から選り分け、
音声認識処理し、英文文字を作成するものである。2. Description of the Related Art Heretofore, it has not been possible to find any publicly known material in Japan that transcribes a statement made by a plurality of people. Only 20
At the ATR Spoken Language Communication Research Laboratories held in May 2001, HSC2001, International Workshop of Hands-
Free Speech Communication), "The meeting browser developed by Carnegie Mellon University
A prototype report has been made as "a system that can automatically create simple minutes" (http://www.is.cs.smu.edu/js/meeting.html). In this method, one omnidirectional (360 degree) camera is installed in the center of the conference table, and it is judged by the image processing that the person with the movement of the mouth of the participant is speaking, and the whole voice is input in common. Sort,
Speech recognition processing is performed to create English characters.

【０００３】このアプローチで目新しいのは、「３６０
度カメラ入力から複数人物の中から動きのある人物を選
別すること」と「音声認識」を組み合わせたところであ
るが、３６０度カメラの入力画像を処理することは１９
８８年頃、米国インテル社がすでに実施していたが、一
部の軍事用途以外はアプリケーション用途が無かったの
で流行らなかった。また、複数人物の中から発言者を選
別する手法としては、ふくすうのビデオ画像の中から口
元を判定する手法として、特開平８−３１７３６３号公
報の「画像伝送装置」、またその文献中においても公知
であり、新規性はない。また、その性能はプロトタイプ
であるので、評価の段階ではないかもしれないが、マイ
クが発言者の各個人に専用化された「説話型」ではな
く、数人分の共有型のオープン型、または「マイクロフ
ォンアレー型」を想定したものであるので、音声認識制
度についても現在の技術ではあまり期待できない。What is new in this approach is "360
This is a combination of “selecting a moving person from a plurality of persons based on the camera input” and “voice recognition”.
Around 1988, Intel Corporation had already implemented it, but it did not become popular because there was no application application other than some military applications. Further, as a method of selecting a speaker from a plurality of persons, as a method of determining a mouth from a fuzzy video image, “Image transmission device” in Japanese Patent Laid-Open No. 8-317363, and in that document, Is also known and has no novelty. Also, since its performance is a prototype, it may not be in the evaluation stage, but the microphone is not a "narrative type" dedicated to each speaker, but a shared open type for several people, or Since it is assumed that it is a "microphone array type," the current technology cannot expect much about the voice recognition system.

【０００４】一方、議事録作成ではなく、会議の模様を
映像つきメモ撮りするアプローチとして、特開２００１
−２１１４４０号公報の「対話記録システム」が公知で
あり、人物の頭にカメラを搭載し、廊下で話した内容を
記録しその内容を、あとで再視聴して、アイデア化す
る、また会話中にひらめいたところだけを選択し、記録
するものである。これによると、対話が記録されている
と感じることが、人間のコミュニケーションに影響をあ
たえることがある。だから携帯型の対話記録装置を提供
するとアプローチされている。On the other hand, as an approach for taking a memo with an image of a conference pattern instead of creating a minutes, Japanese Patent Laid-Open No. 2001-2001
The "Dialogue Recording System" of Japanese Patent Publication No. 211440 is known, in which a camera is mounted on the head of a person to record the contents spoken in the corridor, and the contents can be re-viewed later to be made into an idea, or during conversation. Only the inspirational part is selected and recorded. According to this, feeling that a dialogue is recorded may affect human communication. Therefore, it is approached to provide a portable dialogue recording device.

【０００５】会話発言から議事録文字を精度よく作成す
るアプローチの実用化は、昨年ついに実施された。ＮＨ
Ｋのニュース番組を「聴覚障害者のために」文字化する
もので、その方法は、特開２００１−１６６７９０号公
報の「書き起こしテキスト自動生成装置、音声認識装置
および記録媒体」にある。この手法の適用はアナウンサ
ーという比較的、きちんと発音する人物を対象としたも
のであるが、認識精度としてはきわめて良好である。欠
点としては一般人が話す言葉についての「あいまいさ」
への対応がない点くらいであるが、重大欠点でない。Practical use of the approach of accurately creating minutes letters from conversational speech was finally implemented last year. NH
The K news program is transcribed "for the hearing impaired", and its method is disclosed in Japanese Unexamined Patent Publication No. 2001-166790, "Transcribed text automatic generation device, voice recognition device and recording medium". This method is applied to the announcer, who is a person who pronounces properly, but the recognition accuracy is extremely good. The drawback is "ambiguity" about the words spoken by the general public
There is no correspondence to, but it is not a serious drawback.

【０００６】他の連続話者音声認識技術としては、特開
平６−３１８０９６号公報の「言語モデリング・システ
ム及び言語モデルを形成する方法」が優れている。これ
は音声の発音内容を単語認識する際の判定確立を高める
ために言語（構文）モデルを使う方法の改良で、従来の
言語モデルが所要するコンピュータのメモリ量の削減が
可能となったものである。しかし、この手法がいわゆる
「力づく方式」とＩＢＭ社自ら読んでいるように、アル
ゴリズムよりパターンマッチング辞書の量、および言語
構文の多さで勝負するアプローチである。これを実用化
してＰＣ向けの音声認識ソフトとその構文言語モデルに
適用されている。As another continuous-speaker speech recognition technique, "language modeling system and method for forming language model" in Japanese Patent Laid-Open No. 6-318096 is excellent. This is an improvement of the method of using a language (syntax) model to improve the probability of judgment when recognizing the pronunciation content of a word by words, and it has become possible to reduce the amount of computer memory required by a conventional language model. is there. However, this method is a so-called “powerful method”, as IBM itself reads, it is an approach that competes in terms of the amount of pattern matching dictionary and the amount of language syntax rather than the algorithm. This is put to practical use and applied to speech recognition software for PC and its syntactic language model.

【０００７】会議模様の映像・音声の多チャンネル同時
録画について、従来、同時録画については、特開２００
０−２１７０６３号公報の「番組情報提供システム、番
組情報提供装置及び記録再生制御装置」ではデジタル放
送の同一時間帯の複数コンテンツを同時に録画する際に
コンテンツのビットレートの設定方法について提案され
ている。これは記録装置性能が不十分なものを有効利用
するもので、データ量の多いデジタル放送番組の録画に
適用されるものである。またこの発明の引用文献では複
数のＶＴＲを用いた同時録画に関して、特開平１０−２
４３３０３、特開平７−２１６１９、において検討さ
れ、また１本のＶＴＲテープを共用録画する方法が特開
平９−３０７８４６において検討され、他に、ＤＩＳＫ
などの記録媒体に適用できる技術として、特開平７−１
０７４６１、特開平１１−９８４７８で提案された「一
旦符号化圧縮した映像を再度圧縮しなおす映像符号化技
術」の適応について検討されている。Regarding multi-channel simultaneous recording of video and audio of a conference pattern, conventional simultaneous recording has been disclosed in Japanese Patent Laid-Open No.
In “Program Information Providing System, Program Information Providing Device, and Recording / Playback Control Device” of 0-217063, a method of setting a bit rate of a content when simultaneously recording a plurality of contents in the same time zone of digital broadcasting is proposed. . This makes effective use of a recording device having insufficient performance and is applied to recording a digital broadcast program having a large amount of data. Further, in the cited document of the present invention, Japanese Patent Application Laid-Open No. 10-2
43303, Japanese Patent Laid-Open No. 7-21619, and a method of recording one VTR tape in common is also examined in Japanese Patent Laid-Open No. 9-307846, and in addition, DISK.
As a technique applicable to a recording medium such as JP-A-7-1
07461, Japanese Patent Application Laid-Open No. 11-98478 proposes a "video encoding technique for re-compressing a video that has been coded and compressed".

【０００８】また、最も現実的に複数ソースの映像を記
録する方法として特開２００１−８１４４号公報の「ビ
デオ装置」においてＨＤＤを用いたＮＴＳＣ信号をＭＰ
ＥＧ２信号変換し録画、同時再生、また２ソース同時録
画する提案がある。この発明構成自体は、米国における
ＰＣベースの録画方法として本出願（平１１年）に既に
ＡＴＩ社製のＴＴＶチューナー内蔵のビデオカードを用
い、「ＶＩＶＯ録画システム」として実施されていたも
ので新規性は乏しい。またＨＤＤをストライプ記録（Ｒ
ＡＩＤ- レベル２のこと）することも同様に新規性に乏
しい。しかし構成動作の現実性は高く、２００１年春ご
ろから日本市場に、ＨＤＤ録画装置として登場してい
る。As the most practical method of recording video from a plurality of sources, in the "video device" of Japanese Patent Laid-Open No. 2001-8144, the NTSC signal using the HDD is MP
There is a proposal to convert the EG2 signal and record, simultaneously play back, and simultaneously record two sources. The present invention structure itself has been implemented as a "VIVO recording system" by using a video card with a built-in TTV tuner made by ATI in this application (11th year of the year) as a PC-based recording method in the United States. Is scarce. Also, stripe recording (R
AID-level 2) is similarly lacking in novelty. However, the configuration operation is highly realistic, and has appeared in the Japanese market since the spring of 2001 as an HDD recording device.

【０００９】[0009]

【発明が解決しようとする課題】以上述べたように一般
会議の発言内容を文字化する積極的なアプローチは近
年、極めて少ない。また、会議の模様を映像付で記録す
るアプローチも少ない。しかし世の中でＣＰＵ、メモリ
回路技術、大容量記憶媒体の技術が進展し、装置の小型
化できること、さらに通信インフラが近年、急激に高速
化、低廉可ＩＰ化しつつあり、もはや、設備設置スペー
スがないとは言い訳にならず、ましてはＴＶ会議利用の
１０年ぶりの利用ブームに至っては、録画されているの
は会話に影響するなどのアプローチは否定せざるを得な
い。As described above, in recent years, there have been very few active approaches to characterize the content of comments at general meetings. Also, there are few approaches to record the pattern of the meeting with video. However, due to advances in CPU, memory circuit technology, and large-capacity storage medium technology in the world, it is possible to reduce the size of the device, and in recent years, the communication infrastructure is rapidly increasing in speed and becoming cheaper IP, and there is no space for installing equipment. That is not an excuse, let alone the approach that the recorded video affects the conversation, for example, after the first boom in the use of video conferencing in 10 years.

【００１０】また、ＶＴＲを用いた構成、ＨＤＤを用い
た構成の提案は、いずれも映像エンターティメントを録
画再生する目的での検討であり、入力ソースを可能な限
り品質を下げないで、そのまま録画することにより、録
画した内容を再生視聴して楽しむ、または、長時間記録
する目的での検討である。つまり複数のソース間でのコ
ンテンツ内容に相関関係は存在しない、前提での提案で
あるので２つのソース間の相関を処理するための考案点
はない。Further, the proposal of the configuration using the VTR and the configuration using the HDD are both studies for the purpose of recording and reproducing the video entertainment, and the quality of the input source should not be lowered as much as possible. This is a study for the purpose of playing back and enjoying the recorded contents by recording, or recording for a long time. That is, there is no correlation between the contents of a plurality of sources, which is a proposal on the premise that there is no device for processing a correlation between two sources.

【００１１】本発明は、上記事情に鑑みなされたもので
あり、複数人が参加した会議における発言録を自動作成
する発言録作成装置を提供することを目的とする。The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a comment recording apparatus for automatically generating a comment in a conference in which a plurality of people participate.

【００１２】また、会議の模様を再現する際、文字化し
た発言録のテキスト文字と一緒に、会議の当事者または
第三者が見聞き可能とすることを目的とする。Another object of the present invention is to allow a party or a third party of the conference to see and hear the text together with the text characters of the transcript when the pattern of the conference is reproduced.

【課題を解決するための手段】かかる目的を達成するた
めに、請求項１記載の発明は、日時を管理する手段と、
音声と同期した映像のデジタルデータを複数同時入力す
る入力チャネル手段を複数有し、各チャネルのデジタル
データと日時情報を記録媒体にデジタル記憶する手段を
有することを特徴としている。In order to achieve such an object, the invention according to claim 1 includes means for managing date and time,
It is characterized in that it has a plurality of input channel means for simultaneously inputting a plurality of digital data of video synchronized with audio and has means for digitally storing digital data and date / time information of each channel in a recording medium.

【００１３】請求項２記載の発明は、日時を管理する手
段と、音声と同期した映像のデジタルデータを複数同時
入力する入力チャネル手段を複数有し、各チャネルのデ
ジタルデータと日時情報を記録媒体にデジタル記憶する
手段を有するマルチメディア記録装置で記録されたこと
を特徴としている。The invention according to claim 2 has a plurality of means for managing the date and time and a plurality of input channel means for simultaneously inputting a plurality of digital data of the video synchronized with the voice, and the digital data of each channel and the date and time information are recorded on the recording medium. It is characterized in that it is recorded by a multimedia recording device having means for digitally storing the data.

【００１４】請求項３記載の発明は、音声・映像と日時
情報をデジタル記録した媒体からデータを読み出す手段
（Ａ）と、音声データを復元する手段と、音声途切れ位
置を検出する手段と、音声有声部をフレーズ単位化する
手段（Ｂ）とを有し、フレーズ単位音声をテキストデー
タ化する音声認識手段（Ｃ）と、前記日時情報をフレー
ズ単位のテキストデータに付加作成する手段（Ｄ）と映
像データを復元する手段（Ｖ）と、出力する手段（Ｇ）
とを有することを特徴としている。According to a third aspect of the present invention, means (A) for reading data from a medium in which audio / video and date / time information are digitally recorded, means for restoring audio data, means for detecting an audio interruption position, and audio A voice recognizing means (C) for converting the voiced part into phrase units and converting the phrase unit voice into text data; and a means (D) for additionally creating the date and time information in phrase unit text data. Means for restoring video data (V) and means for outputting (G)
It is characterized by having and.

【００１５】請求項４記載の発明は、音声・画像と日時
情報をデジタル記録した媒体からデータを読み出す手段
（Ａ）と、音声データを復元する手段と、音声途切れ位
置を検出する手段と、音声有声部をフレーズ単位化する
手段（Ｂ）と、フレーズ単位音声をテキストデータ化す
る音声認識手段（Ｃ）と、前記日時情報をフレーズ単位
のテキストデータに付加作成する手段（Ｄ）と、音声有
性部をフレーズ単位化するタイミングに同期した映像デ
ータをフレーズ処理する手段（Ｅ）と、記録媒体にデジ
タル記録する手段とを有することを特徴としている。According to a fourth aspect of the present invention, means (A) for reading data from a medium on which voice / image and date / time information are digitally recorded, means for restoring voice data, means for detecting a voice interruption position, and voice Means (B) for converting the voiced part into phrase units, voice recognition means (C) for converting phrase-based voices into text data, means (D) for additionally creating the date / time information to phrase-based text data, and voice presence The present invention is characterized by having means (E) for performing phrase processing on video data synchronized with the timing of converting the sex portion into phrase units, and means for digitally recording the data on a recording medium.

【００１６】請求項５記載の発明は、音声・画像と日時
情報をデジタル記録した媒体からデータを読み出す手段
（Ａ）と、音声データを復元する手段と、音声途切れ位
置を検出する手段と、音声有声部をフレーズ単位化する
手段（Ｂ）と、フレーズ単位音声をテキストデータ化す
る音声認識手段（Ｃ）と、前記日時情報をフレーズ単位
のテキストデータに付加作成する手段（Ｄ）と、音声有
性部をフレーズ単位化するタイミングに同期した映像デ
ータをフレーズ処理する手段（Ｅ）と、記録媒体にデジ
タル記録する手段とを有するマルチメディア編集装置で
記録されたことを特徴としている。According to a fifth aspect of the present invention, means (A) for reading data from a medium on which voice / image and date / time information are digitally recorded, means for restoring voice data, means for detecting a voice interruption position, and voice Means (B) for converting the voiced part into phrase units, voice recognition means (C) for converting phrase-based voices into text data, means (D) for additionally creating the date / time information to phrase-based text data, and voice presence It is characterized in that it is recorded by a multimedia editing apparatus having means (E) for performing phrase processing of video data synchronized with the timing at which the sex part is made into a phrase unit, and means for digitally recording it on a recording medium.

【００１７】請求項６記載の発明は、複数の入力データ
からの音声・映像と日時情報をデジタル記録した媒体か
らデータをソース選択読み出しする手段（ＡＡ）と、音
声データを復元する手段と、音声途切れ位置を検出する
手段と、音声有音部をフレーズ単位化する手段（Ｂ１）
と、フレーズ単位音声をテキストデータ化する音声認識
手段（Ｃ１）と、前記日時情報をフレーズ単位のテキス
トデータに付加作成する手段（Ｄ１）とを有し、前記媒
体から別のソースを手段（ＡＡ）で選択読み出しする手
段と、音声データを復元する手段と、音声途切れ位置を
検出する手段と、音声有音部をフレーズ単位化する手段
（Ｂ２）と、フレーズ単位音声をテキストデータ化する
音声認識手段（Ｃ２）と、前記日時情報をフレーズ単位
のテキストデータに付加作成する手段（Ｄ２）とを有
し、（Ｄ１）、（Ｄ２）で作成したテキストデータを時
間順に交互配列する手段（Ｆ）と、出力手段（Ｇ）とを
有することを特徴としている。According to a sixth aspect of the present invention, means (AA) for selectively selecting and reading data from a medium in which audio / video and date / time information from a plurality of input data are digitally recorded, means for restoring audio data, and audio Means for detecting a discontinuity position, and means for converting the voiced part of the voice into phrase units (B1)
A voice recognition means (C1) for converting the phrase unit voice into text data, and a means (D1) for additionally creating the date and time information to the phrase unit text data, and another source from the medium (AA). ) Means for selectively reading out, means for restoring voice data, means for detecting voice interruption positions, means (B2) for converting the voiced part into phrase units, and voice recognition for converting phrase units into text data. Means (C2) and means (D2) for adding and creating the date and time information to phrase-based text data, and means (F) for alternately arranging the text data created in (D1) and (D2) in time order. And output means (G).

【００１８】[0018]

【発明の実施の形態】以下、本発明の実施の形態を添付
図面を参照しながら詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

【００１９】本発明は、複数人の会議において各人の発
言を連続的にマイク・カメラで記録する。例えば３名の
会議ならマイク・カメラの入力を３チャネル別々に同時
録画する（ここでは２人の場合を説明する）。会議の発
言は一部同時発言があるかも知れないが、基本的に誰か
の代わりばんこの発言であり、各人の発言フレーズ部分
の組み合わせで構成される。各人の録画発言フレーズ単
位化したものに再編集し、発言内容を文字化するもので
ある。The present invention continuously records each person's utterance with a microphone camera in a meeting of a plurality of persons. For example, in the case of a conference of three people, the input of the microphone / camera is simultaneously recorded for each of the three channels (here, the case of two persons will be described). Some of the utterances in the conference may be simultaneous utterances, but basically they are banquet utterances instead of someone, and are composed of a combination of utterance phrase portions of each person. It is to re-edit the recorded speech phrase of each person into a unit, and characterize the speech content.

【００２０】本発明は、図１または図２のブロック構成
例に示すよう複数の入力ソースに時間情報を単位記録時
間毎に付加し図３のフォーマットで記録する点が新し
い。従来例として、ＶＣＲに記録する例を図１０、図１
１に示す。The present invention is new in that, as shown in the block configuration example of FIG. 1 or 2, time information is added to a plurality of input sources for each unit recording time and recording is performed in the format of FIG. As a conventional example, an example of recording in a VCR is shown in FIGS.
Shown in 1.

【００２１】図１は複数のデジタルソースの映像、音声
に時間情報を付加して記録する構成例を示す図であり、
図２はアナログソースの映像、音声に時間情報を付加し
て記録する構成例を示す図である。図１と図２の違いは
入力ソースの違いで図１は入力ソースがＤＶフォーマッ
ト、ＣＡＭコーダーなどの映像と音声がデジタル化され
た一つの入力ソースが複数ある場合である。図２はＳ信
号、コンポジット、コンポーネントの（ＮＴＳＣ、ＰＡ
Ｌ、ＳＥＣＡＭ）映像信号と音声信号が別々のオーソド
ックスな入力ソースが複数ある場合である。共に、入力
チャネル１と、、日時管理ブロック２と、整形部３と、
スイッチ４と、書き込み部５と、記録媒体６とから構成
される。FIG. 1 is a diagram showing a configuration example in which time information is added to video and audio of a plurality of digital sources and recorded.
FIG. 2 is a diagram showing a configuration example in which time information is added to video and audio of an analog source and recorded. The difference between FIG. 1 and FIG. 2 is the difference in the input source. In FIG. 1, the input source is a DV format, and there is a plurality of one input source such as CAM coder in which video and audio are digitized. Figure 2 shows S signal, composite, and component (NTSC, PA
L, SECAM) This is a case where there are a plurality of orthodox input sources in which video signals and audio signals are different. Input channel 1, date and time management block 2, shaping section 3,
It is composed of a switch 4, a writing unit 5, and a recording medium 6.

【００２２】各入力ソースは入力チャネル１により信号
を受け取り、日時管理ブロック２からの時間情報を図３
のフォーマット（３−１）にあらかじめ規定された単位
時間毎に区切り整形する。これを記録手段がのように連
続的に記録媒体に書き込む（３−２）ものである。ここ
で上記の単位記録時間は数百ミリ数から１０秒ぐらいの
単位である。図１、図２における２チャンネルの情報を
書き込む部分のスイッチ４は、時間分割による同時書き
込みを説明したものである。Each input source receives a signal on the input channel 1 and receives the time information from the date / time management block 2.
The format (3-1) of (3) is divided and shaped for each unit time defined in advance. The recording means continuously writes this on the recording medium (3-2). Here, the unit recording time is a unit of several hundred millimeters to 10 seconds. The switch 4 in the portion for writing information of two channels in FIGS. 1 and 2 is for explaining simultaneous writing by time division.

【００２３】映像信号はＤＶ入力されたもの、Ｓ信号、
コンポジット、コンポーネント信号ともでデジタル圧縮
を行う。圧縮手法は公知のＭＰＥＧでもモーションＪＰ
ＥＧでもＪＰＥＧ２０００の連続でも、いずれでも良
い。音声情報は同様にデジタル化、または再デジタル化
を行うが１９２ＫＨｚ帯域から９６ＫＨｚ程度の比較
的、広帯域を使うが、モノラル入力が基本であり、映像
情報量と比較するとはるかに少ない。The video signals are DV input signals, S signals,
Digital compression is performed on both composite and component signals. The compression method is the well-known MPEG but Motion JP
Either EG or JPEG2000 continuous or both may be used. Similarly, audio information is digitized or re-digitized, but a relatively wide band of about 192 KHz to 96 KHz is used, but monaural input is basically used, which is much smaller than the amount of video information.

【００２４】図３の記録媒体６は、ＨＤＤパック装置の
ほか、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＷ、ＤＶＤ−ＲＡＭ
の大容量光ディスク、フラッシュメモリなどを含む。各
単位時間情報には「入力チャネル番号」と同一媒体での
何回目の記録かを示す「セッション番号」、単位時間の
何番目かを示す「シーケンス番号」が付加され、これら
を「Ｐｒｏｊｅｃｔ管理部」と呼ぶ、媒体の記録内容全
体を管理するディレクトリ管理機能をもつ部分で「生録
データ」として記録される。The recording medium 6 in FIG. 3 is a DVD-ROM, a DVD-RW, a DVD-RAM in addition to the HDD pack device.
Including large-capacity optical discs, flash memory, etc. To each unit time information, an "input channel number", a "session number" indicating the number of times of recording on the same medium, and a "sequence number" indicating what number of the unit time are added are added to the "Project management unit". , Which has a directory management function for managing the entire recorded contents of the medium, and is recorded as "raw data".

【００２５】なお、単位記録時間ｍの開始タイミングは
複数の入力チャネルを同一タイミングで区切り、書き込
みを遅延させるバッファーで調整し、書き込みをズラし
ても良い。ここでは、入力チャネル数ｎに応じで各入力
チャネルからの入力をシーケンス区切りのタイミングを
ｍ／ｎ毎にズラす処理を行うブロック（図示せず）を設
けたので、全体のメモリバッファの使用効率が良い。Incidentally, the start timing of the unit recording time m may be adjusted by dividing a plurality of input channels at the same timing, adjusting the buffer for delaying the writing, and shifting the writing. Here, since a block (not shown) that shifts the input of each input channel according to the number of input channels n to shift the timing of sequence delimitation every m / n is provided, the use efficiency of the entire memory buffer is increased. Is good.

【００２６】図４は、発言の模様を映像、音声に同期し
て発言を文字化表示する構成例である。請求項１、２を
用い、記録した媒体から１つのチャンネルに記録した生
録データを再生しながら、発言をテキスト化し、その発
言の実時間を付加し出力するものである。図４は、記録
装置６と、データ読み出し部Ａと、音声デコーダ部Ｂ
と、テキストデータ化部Ｃと、フレーズ記憶部Ｄと、出
力Ｉ／Ｆ部Ｇと、映像デコーダ部Ｖとを有し構成されて
いる。FIG. 4 shows an example of a structure in which the utterance is displayed in characters in synchronism with the image and voice. According to the first and second aspects, while reproducing the recorded data recorded in one channel from the recorded medium, the utterance is converted into text, and the real time of the utterance is added and output. FIG. 4 shows a recording device 6, a data reading unit A, and an audio decoder unit B.
A text data conversion section C, a phrase storage section D, an output I / F section G, and a video decoder section V.

【００２７】再生は、記録メディアから、チャンネル番
号、セッション番号を指定し、シーケンス番号順に読み
出しをデータ読み出し部Ａで行い、映像データを映像デ
コーダＶでデコードし、音声を音声デコーダＢでデコー
ドして、音声と時間情報を分離し、映像、音声信号を入
出力Ｉ／Ｆ部Ｇから外部ＴＶなどに行う。同時に音声デ
コーダＢからのシーケンス毎の時間情報を受け、タイマ
ー計測開始する。そして音声デコーダＢから音声信号の
音声有音部の検出通知を受け、有音部の開始位置時間を
再計算する。この音声有音部単位を「フレーズ」と呼
ぶ。フレーズ記憶部Ｄにおいて、そのフレーズ開始時間
を一次記憶する。テキストデータ化部Ｃにおいては音声
有音部を（特開平６−３１８０９６または特開２００１
−１６６７９０の公知技術を用い）、音声認識文字コー
ド化しフレーズ記憶部Ｄに送る。フレーズ記憶部Ｄにお
いて一次記憶したフレーズ開始時間とフレーズ番号を音
声認識文字コードに付加し図５の出力形式に整える。For reproduction, a channel number and a session number are designated from the recording medium, the sequence number is read in the data reading section A, the video data is decoded by the video decoder V, and the sound is decoded by the audio decoder B. , Audio and time information are separated, and video and audio signals are transmitted from the input / output I / F unit G to an external TV or the like. At the same time, the time information for each sequence is received from the audio decoder B, and the timer measurement is started. Then, upon receiving a notification of the detection of the voiced part of the voice signal from the voice decoder B, the start position time of the voiced part is recalculated. This voiced voice unit is called a "phrase". The phrase storage section D temporarily stores the phrase start time. In the text data conversion section C, a voiced sound section is provided (see JP-A-6-318096 or 2001).
A known technology of 166790 is used), and the voice recognition character code is converted and sent to the phrase storage section D. The phrase start time and the phrase number, which are temporarily stored in the phrase storage section D, are added to the voice recognition character code to prepare the output format shown in FIG.

【００２８】文字コードの出力は入出力Ｉ／Ｆ部Ｇから
外部のテキストモニタに出力され、テキストモニタで内
部文字フォントから可視化されスクリーンに表示され
る。The output of the character code is output from the input / output I / F section G to an external text monitor, visualized from the internal character font on the text monitor and displayed on the screen.

【００２９】請求項３は映像・音声の再生と同時に音声
認識文字を外部表示装置に出力した。次に、請求項４の
発明では発言をフレーズ毎に文字コード化した情報と、
発言フレーズ毎に映像・音声を再構成し記録する方法に
ついて説明する。According to the third aspect, the voice recognition character is output to the external display device at the same time as the reproduction of the video / audio. Next, according to the invention of claim 4, information in which a statement is character coded for each phrase,
A method of reconstructing and recording video / audio for each utterance phrase will be described.

【００３０】図６は発言の模様を映像・音声に同期して
発言を文字化し、再記録する装置の構成例を示す図であ
る。図７は、図６の記録媒体７の形式例を示す図であ
る。本構成は、記録媒体６と、記録媒体７と、読み出し
部Ａと、音声デコーダＢと、テキストデータ化部Ｃと、
フレーズ記憶部Ｄと、映像デコーダＶと、入出力Ｉ／Ｆ
部Ｇとから構成されている。図６に示すように、映像デ
ータはフレーズ処理部Ｅに一次記憶される。テキストデ
ータ化部Ｃからのフレーズ検出通知を受け、フレーズ単
位の映像データとして図７の７−１の形式に再構成され
る。同時にデータ化部Ｃから音声データとフレーズ記憶
部Ｄからのフレーズ開始時間付の音声文字コード（テキ
スト）を含んだ形式となる。ここで音声データは、元デ
ータより間引き圧縮して媒体容量の節約を図る処理（図
示せず）を行い、同様に映像データを間引き圧縮しても
よい。FIG. 6 is a diagram showing an example of the configuration of an apparatus for converting a speech pattern into text and synchronizing it with video and audio and re-recording it. FIG. 7 is a diagram showing a format example of the recording medium 7 of FIG. This configuration has a recording medium 6, a recording medium 7, a reading section A, an audio decoder B, a text data conversion section C,
Phrase storage section D, video decoder V, input / output I / F
And a section G. As shown in FIG. 6, the video data is temporarily stored in the phrase processing unit E. Upon receipt of the phrase detection notification from the text data conversion section C, the image data is reconfigured in the format of 7-1 in FIG. 7 as phrase-based video data. At the same time, it becomes a format including voice data from the data conversion section C and a voice character code (text) with a phrase start time from the phrase storage section D. Here, the audio data may be thinned and compressed from the original data to perform a process (not shown) for saving the medium capacity, and similarly, the video data may be thinned and compressed.

【００３１】この形式の記録データは、図７の記録媒体
７中に「Ｐｒｏｊｅｃｔ管理」と示すように、記録部分
に「音フレ（音声フレーズ）形式」と記録され、「生
録」と区別可能となる。The recorded data in this format is recorded in the recording medium 7 as "Project management" in the recording portion as "sound frame (sound phrase) format" and is distinguishable from "live recording". Become.

【００３２】７−２には音フレーズ毎にフレーズ化され
た記録構成例を示している。これはフレーズ毎に継続時
間が相違し、フレーズ・データ長が可変形式で記録さ
れ、その長さが異なることを示している。また、あらか
じめ規定された最大フレーズ・データ長を超えるフレー
ズは７−１の「サブシーケンス番号」により適時、分割
される。この分割されたフレーズ・データには音声認識
出力の「テキスト」は包含せず「ＮＵＬＬ」データがパ
ディングされる。7-2 shows an example of a recording structure in which a phrase is formed for each sound phrase. This indicates that the duration is different for each phrase, the phrase data length is recorded in a variable format, and the length is different. Further, a phrase exceeding the maximum phrase data length defined in advance is timely divided by the "subsequence number" of 7-1. This divided phrase data does not include the "text" of the voice recognition output, but "NULL" data is padded.

【００３３】また、７−２の日時情報には各フレーズの
開始時間の他に、各フレーズの終了時間かフレーズの継
続時間情報を同時に記録しても良い。または次フレーズ
に、前フレーズの終了から現フレーズの開始までのブラ
ンク時間情報を記録することも可能である。Further, in the date and time information 7-2, in addition to the start time of each phrase, the end time of each phrase or phrase duration information may be recorded at the same time. Alternatively, blank time information from the end of the previous phrase to the start of the current phrase can be recorded in the next phrase.

【００３４】図８は、複数の発言者の模様を映像・音声
に同期して発言を文字化表示する装置の構成例であり、
図９はその表示例を示している。本構成は、記録媒体６
と、データ選択読み出し部ＡＡと、音声デコーダＢ１、
Ｂ２と、テキストデータ化部Ｃ１、Ｃ２と、フレーズ記
憶部Ｄ１、Ｄ２と、フレーズ並べ替え部Ｆと、入出力Ｉ
／Ｆ部Ｇと映像デコーダＶ１、Ｖ２とから構成されてい
る。に示すように、複数の「生録」されたチャネル毎の
データ（図３）を「ＡＡ」の読み出しブロックでチャネ
ル毎に交互読み出す。そしてチャネル毎の音声デコー
ド、音声認識ブロック「Ｂ１、Ｃ１、Ｄ１」と「Ｂ２、
Ｃ２、Ｄ２」を経て処理された、チャネル毎のフレーズ
時間付の文字コードをフレーズ並べ替え部Ｆにおいて、
時間順に並べ替えし、チャンネル番号を付加し図９の出
力形式に整える。文字コードの出力はの入出力Ｉ／Ｆ部
Ｇから外部のテキストモニタに出力され、テキストモニ
タで内部文字フォントから可視化されスクリーンに表示
される。FIG. 8 shows an example of the configuration of an apparatus for characterizing and displaying the utterances by synchronizing the patterns of a plurality of speakers with video and audio.
FIG. 9 shows an example of the display. This configuration is for the recording medium 6
, A data selection / readout unit AA, an audio decoder B1,
B2, text data conversion units C1 and C2, phrase storage units D1 and D2, phrase rearrangement unit F, and input / output I
/ F section G and video decoders V1 and V2. As shown in FIG. 3, a plurality of “raw recording” data for each channel (FIG. 3) are alternately read for each channel by the “AA” read block. Then, voice decoding for each channel, voice recognition blocks "B1, C1, D1" and "B2,
In the phrase rearranging section F, the character code with phrase time for each channel processed through “C2, D2” is
The data is rearranged in time order, the channel number is added, and the output format shown in FIG. 9 is prepared. The output of the character code is output from the input / output I / F section G to an external text monitor, visualized from the internal character font on the text monitor, and displayed on the screen.

【００３５】ここでフレーズ並べ替え部Ｆにおけるフレ
ーズコードの並べ替えは、同一チャネルのフレーズ間時
間の判定を加え、複数のフレーズをつなぎ合わせた出力
形式とすることもできる。これは、音声認識のためのフ
レーズ化と文章構成を可視化した際の読みやすさに配慮
したもので、文章構成フレーズ時間は、音声有音部判定
のための無音検出時間の１０倍程度に設定される。Here, the rearrangement of the phrase codes in the phrase rearranging section F can be performed in an output form in which a plurality of phrases are connected by adding the determination of the inter-phrase time of the same channel. This is in consideration of the readability when visualizing the phrase formation and sentence composition for voice recognition. The sentence composition phrase time is set to about 10 times the silent detection time for voiced part determination. To be done.

【００３６】[0036]

【発明の効果】以上の説明から明らかなように、本発明
によれば、複数の入力ソースによる発言者を簡単な構成
で、独立して録画可能となる。As is apparent from the above description, according to the present invention, the speakers from a plurality of input sources can be independently recorded with a simple structure.

【００３７】また、本発明によれば、独立した入力ソー
スの発言者の音声から発言録を発言時間付で得ることが
可能となる。Further, according to the present invention, it becomes possible to obtain a utterance record with the utterance time from the voice of the speaker of an independent input source.

【００３８】また、本発明によれば、独立した入力ソー
スの発言者の音声から発言録を発言時間付で得られ、映
像、音声のデータを圧縮し再記録でき、記録媒体の節約
が図れ、発言録の二次利用が可能となる。また複数入力
ソースの発言者の音声を簡単な構成でバッチ処理でき、
多ソース入力処理に適用が可能となる。Further, according to the present invention, a utterance can be obtained from the voice of a speaker of an independent input source with a utterance time, video and audio data can be compressed and re-recorded, and a recording medium can be saved. The second use of the utterance will be possible. In addition, the voices of speakers from multiple input sources can be batch processed with a simple configuration.
It can be applied to multi-source input processing.

【００３９】また、本発明によれば、複数の発言者の発
言録を発言順に文字化表示でき、発言録の二次利用が可
能となる。Further, according to the present invention, the utterances of a plurality of speakers can be displayed in text in the order of utterances, and the utterances can be secondarily used.

[Brief description of drawings]

【図１】複数のデジタル・ソースの映像・音声に時間情
報を付加して記録する装置の構成例を示す図である。FIG. 1 is a diagram showing a configuration example of an apparatus for adding time information to video / audio of a plurality of digital sources and recording the information.

【図２】複数のアナログ・ソースの映像・音声に時間情
報を付加して記録する装置の構成例を示す図である。FIG. 2 is a diagram showing a configuration example of an apparatus that adds time information to video / audio of a plurality of analog sources and records the information.

【図３】複数のソースの映像・音声に時間情報を付加し
て記録する媒体の形式例を示す図である。FIG. 3 is a diagram showing a format example of a medium in which time information is added to video / audio of a plurality of sources and recorded.

【図４】発言の模様を映像・音声に同期して発言を文字
化表示する装置の構成例を示す図である。FIG. 4 is a diagram showing an example of the configuration of an apparatus for characterizing and displaying a message by synchronizing the pattern of the message with video / audio.

【図５】発言の模様を映像・音声に同期して発言を文字
化表示する装置の表示例を示す図である。FIG. 5 is a diagram showing a display example of an apparatus for characterizing and displaying a message in synchronization with the pattern of the message in synchronization with video / audio.

【図６】発言の模様を映像・音声に同期して発言を文字
化し、再記録する装置の構成例を示す図である。FIG. 6 is a diagram showing a configuration example of an apparatus for converting a statement into a character in synchronization with video / audio and re-recording the statement.

【図７】発言の模様を映像・音声に同期して発言を文字
化し、再記録する装置の構成例を記録する媒体の形式例
である。FIG. 7 is a format example of a medium for recording an example of the configuration of an apparatus for converting an utterance into characters by synchronizing the utterance pattern with video / audio and re-recording.

【図８】複数の発言者の模様を映像・音声に同期して発
言を文字化表示する装置の構成例を示す図である。FIG. 8 is a diagram showing an example of the configuration of an apparatus that characterizes and displays a message by synchronizing the patterns of a plurality of speakers with video and audio.

【図９】複数の発言者の模様を映像・音声に同期して発
言を文字化表示する装置の表示例を示す図である。FIG. 9 is a diagram showing a display example of an apparatus which characterizes and displays a message by synchronizing patterns of a plurality of speakers with video and audio.

【図１０】複数の映像・音声ソースを２台のＶＣＲに記
録する従来例を示す図である。FIG. 10 is a diagram showing a conventional example in which a plurality of video / audio sources are recorded in two VCRs.

【図１１】従来例として業務用ＶＣＲテープの記録形式
例を示す図である。FIG. 11 is a diagram showing an example of a recording format of a commercial VCR tape as a conventional example.

[Explanation of symbols]

１入力チャネル２日時管理ブロック３整形部４スイッチ５書き込み部６、７記録媒体Ａデータ読み出し部Ｂ音声デコーダＣテキストデータ化部Ｄフレーズ記憶部Ｅフレーズ処理部Ｆフレーズ並べ替え部Ｇ入出力Ｉ／Ｆ部Ｈ再構成部Ｖ映像デコーダ 1 input channel 2 Date management block 3 Orthopedic department 4 switches 5 writing section 6, 7 recording medium A data reading section B audio decoder C text data conversion section D phrase storage E phrase processing section F phrase rearrangement section G Input / output I / F section H reconstruction unit V video decoder

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｈ０４Ｎ 5/781 ５１０Ｊ５１０Ｌ ─────────────────────────────────────────────────── ─── Continued Front Page (51) Int.Cl. ⁷ Identification Code FI Theme Coat (Reference) H04N 5/781 510J 510L

Claims

[Claims]

1. A means for managing date and time and a plurality of input channel means for simultaneously inputting a plurality of digital data of video synchronized with audio, and means for digitally storing digital data and date and time information of each channel in a recording medium. A multimedia recording device having.

2. A means for managing date and time and a plurality of input channel means for simultaneously inputting a plurality of digital data of video synchronized with audio, and means for digitally storing digital data and date / time information of each channel in a recording medium. A recording medium recorded by a multimedia recording device having the above.

3. A means (A) for reading out data from a medium in which audio / video and date / time information is digitally recorded, a means for restoring audio data, a means for detecting an audio discontinuity position, and a voiced voice unit being phraseized as a unit of phrase. A voice recognition means (C) for converting phrase unit voice into text data, a means (D) for adding and creating the date and time information to the phrase unit text data, and a means for restoring video data. (V)
And a means (G) for outputting the multimedia reproduction apparatus.

4. A means (A) for reading out data from a medium on which voice / image and date / time information are digitally recorded, a means for restoring voice data, a means for detecting a voice interruption position, and a voice voiced portion in phrase unit. Means (B), voice recognition means (C) for converting phrase-based voices into text data, means (D) for additionally creating the date and time information to phrase-based text data, and voice-specific parts for phrase-based conversion A multimedia editing apparatus comprising: means (E) for performing phrase processing of video data synchronized with the timing for performing the digital recording; and means for digitally recording the data on a recording medium.

5. A means (A) for reading out data from a medium in which voice / image and date / time information are digitally recorded, a means for restoring voice data, a means for detecting a voice interruption position, and a voice voice section in a phrase unit. Means (B), voice recognition means (C) for converting phrase-based voices into text data, means (D) for additionally creating the date and time information to phrase-based text data, and voice-specific parts for phrase-based conversion A recording medium recorded by a multimedia editing device having means (E) for performing phrase processing of video data synchronized with the timing for performing the digital recording on the recording medium.

6. A means (AA) for selectively reading data from a medium in which audio / video and date and time information from a plurality of input data are digitally recorded, a means for restoring audio data, and a means for detecting an audio interruption position. A means (B1) for converting the voiced part into phrase units; a voice recognition means (C1) for converting phrase units into text data; and a means (D1) for additionally creating the date / time information to the phrase-based text data. Means for selectively reading another source from the medium by means (AA), means for restoring voice data, means for detecting a voice discontinuity position, means for phraseizing the voiced sound portion into phrases (B2), voice recognition means (C2) for converting phrase unit voice into text data, and the date information is added to the phrase unit text data. (D1), a means (F) for alternately arranging the text data created in (D1) and (D2) in chronological order, and an output means (G) apparatus.