JPH11308565A

JPH11308565A - Method and system for displaying voice recording aid into video image and recording medium recording this method

Info

Publication number: JPH11308565A
Application number: JP10111733A
Authority: JP
Inventors: Yasumasa Niikura; 康巨新倉; Kenichi Minami; 憲一南; Yoshinobu Tonomura; 佳伸外村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-04-22
Filing date: 1998-04-22
Publication date: 1999-11-05
Anticipated expiration: 2018-04-22
Also published as: JP3426957B2

Abstract

PROBLEM TO BE SOLVED: To provide an interface by which an editor easily instructs when and how a talker makes a speech to the talker and the talker easily makes a speech by the instruction in the case of recording a voice in a video image. SOLUTION: First a video image block segmentation means 10 segments a video block to which a voice is inserted. A text addition means 11 adds text information of a voice desired to be recorded to the segmented video block. A voice synthesis means 12 synthesizes voice from the added text information. A synthesized voice processing means 13 processes length, height, intonation, voice tone and voice timing or the like of the synthesized voice information. A text and voice instruction means 14 reproduces the processed and synthesized voice and displays a voice talking timing. A voice recording means 15 records the voice talked by the talker by this instruction, a comparison means 16 compares a degree of coincidence between the recorded voice and the processed and synthesized voice to decide adoption of the recorded voice and the recorded voice whose adoption is decided is combined with the video block.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、映像に音声情報を
付加する際に、音声を入力するタイミングを指示するイ
ンタフェースに関する技術であって、音声のインサート
を簡単に行うことができ、映像制作の支援に役立つ音声
録音支援表示方法及び装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technology relating to an interface for instructing a timing of inputting audio when adding audio information to a video. The present invention relates to a voice recording support display method and apparatus useful for support.

【０００２】[0002]

【従来の技術】映像のデジタル処理技術の急速な発展に
伴い、安い器材で簡単に高いクオリティの映像を編集、
作成することが可能になった。2. Description of the Related Art With the rapid development of image digital processing technology, high quality images can be easily edited with cheap equipment.
It is possible to create.

【０００３】デジタル技術による映像の利用方法の顕著
な例は、ノンリニアの映像編集装置である。時系列情報
である映像の内容を空間的に展開し、各映像区間にラン
ダムなアクセスを可能としたことで、時系列に縛られる
ことなく自由に映像を編集することができるようになっ
ている。[0003] A prominent example of the use of video by digital technology is a non-linear video editing device. By expanding the contents of the video, which is time-series information, spatially and enabling random access to each video section, it is possible to freely edit the video without being bound by the time series .

【０００４】しかし、一次元の時系列情報と呼ばれる音
声情報は、空間的な展開が困難であるという問題がある
ため、編集の際に問題が大きかった。[0004] However, audio information called one-dimensional time-series information has a problem in that it is difficult to spatially expand the information, so that the problem is large when editing.

【０００５】例えば、映像編集の重要なプロセスである
ナレーションやセリフ入れ等のオペレーションは、録音
した原音声から音声区間を切り出して、映像に配置させ
る時間を適宜変更し、配置するという操作を行ってい
た。音声の時間、タイミングが映像と適合していない場
合には、何度も適合する音声を録音し直すという操作を
必要としていた。[0005] For example, in operations such as narration and serif insertion, which are important processes of video editing, an operation is performed in which an audio section is cut out from a recorded original audio, and the time for arranging it in a video is appropriately changed and then arranged. Was. If the time and timing of the audio are not compatible with the video, an operation of re-recording the adapted audio many times is required.

【０００６】そこで、こうした音声の録音を行うかわり
に、しゃべらせるべきテキストを入力して、音声合成で
音声を付加する方法も提案されている。例えば、「様々
な音声表現を実現できる音声作成ツール −Ｓｐｅｅｄ
９７−」（音声言語情報処理１７−１２１９９７／７
／１９、阿部匡伸、水野秀之、中蔦信弥）等があげられ
る。Therefore, instead of recording such voices, there has been proposed a method of inputting text to be spoken and adding voices by voice synthesis. For example, "Speech creation tool capable of realizing various speech expressions-Speed
97- "(Speech and Language Information Processing 17-12 1997/7
/ 19, Masanobu Abe, Hideyuki Mizuno, Shinya Nakatsuta).

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、音声の
代わりにテキストを入力して、音声合成で音声を付加す
る上記従来の方法では、合成される音声の品質が低く、
個人性が反映されないという問題点があった。However, in the above conventional method of inputting text instead of voice and adding voice by voice synthesis, the quality of voice to be synthesized is low.
There was a problem that personality was not reflected.

【０００８】そのため、高いクオリティの音声を用いて
映像作品を編集し、制作するためには、やはり人間を始
めとする生の原音を録音して用いる方が、品質の面から
も、個人情報という点からも望まれていた。[0008] Therefore, in order to edit and produce a video work using high-quality sound, it is better to record and use raw original sounds, including humans, in terms of quality as well as personal information. It was also desired from the point.

【０００９】しかし、原音の録音には上述したように、
その入力に困難が伴うという問題点がある。However, as described above, the recording of the original sound
There is a problem that the input is difficult.

【００１０】そこで本発明は、上述の問題点を解決する
ため、映像編集において、録音すべき音声を、編集者が
発声者に対して、いつ、どのような発声音で発声しても
らえば良いかという指示を与えやすく、発声者はその指
示を理解して発声しやすいようにする操作を簡単に行う
ことができるインタフェースを提供することを課題とす
る。Therefore, in order to solve the above-mentioned problems, the present invention requires the editor to have the utterer utter a sound to be recorded in video editing at any time. It is an object of the present invention to provide an interface that can easily give an instruction such that the speaker can easily perform an operation of making the user understand the instruction and make it easier to speak.

【００１１】[0011]

【課題を解決するための手段】本発明は、以下に列記し
た発明により、上述の課題を解決する。SUMMARY OF THE INVENTION The present invention solves the above-mentioned problems by the inventions listed below.

【００１２】（１）の発明は、映像に音声情報を挿入す
る際に、音声を発する際の編集者から発声者への要求条
件を表示する映像中への音声録音支援表示方法におい
て、音声入力を行いたい映像区間を限定して映像区間を
切り出し、該切り出された映像区間に対して挿入したい
音声に対応するテキスト情報を付加してテキスト付き映
像区間を生成し、該テキスト付き映像区間のテキスト情
報から音声を合成して合成音声を生成し、該合成音声に
対して、音声情報の長さ、高さ、イントネーション、声
色、発声のタイミング等を加工した加工合成音声を生成
し、挿入する音声を発声する発声者に、該加工した加工
合成音声と該テキスト情報を用いて、該映像区間に対し
てどのような音声を付加すれば良いか具体的な指示を与
え、該発声者が発声した内容を録音して、録音音声を取
得し、該録音音声と、該加工合成音声の一致具合を比較
し、該比較した結果で、該録音音声と該映像区間とか
ら、音声付き映像区間を生成することを特徴とする映像
中への音声録音支援表示方法である。[0012] The invention of (1) provides a method for supporting voice recording in a video, which displays a requirement from an editor to a speaker when voice is inserted when voice information is inserted into the video. The video section is cut out by limiting the video section that the user wants to perform, a text section with a text is generated by adding text information corresponding to the sound to be inserted to the cut video section, and the text of the video section with the text is generated. A synthesized speech is generated by synthesizing speech from information, and a processed synthesized speech is generated for the synthesized speech by processing the length, height, intonation, timbre, timing of speech, etc. of the speech information, and the speech to be inserted. Is given a specific instruction on what kind of sound should be added to the video section by using the processed synthesized speech and the text information. The recorded content is recorded, a recorded voice is obtained, the recorded voice is compared with the processed synthetic voice, and the result of the comparison is used to determine a video section with sound from the recorded voice and the video section. This is a method for supporting and displaying audio recording in a video, which is characterized in that it is generated.

【００１３】（２）の発明は、映像に音声情報を挿入す
る際に、音声を発する際の編集者から発声者への要求条
件を表示する映像中への音声録音支援表示方法におい
て、音声入力を行いたい映像区間を限定して映像区間を
切り出し、該切り出された映像区間に対して挿入したい
音声に対応するテキスト情報を付加してテキスト付き映
像区間を生成し、該テキスト付き映像区間のテキスト情
報から音声を合成して合成音声を生成し、該合成音声に
対して加工を加えて加工合成音声を生成し、該テキスト
情報付き映像区間と該加工合成音声を同期させ、必要に
応じて編集者に対して表示し、該同期を取られた映像区
間、加工合成音声、テキスト情報の３種類乃至は一部の
種類を抽出して同期して再生し、挿入する音声を発声す
る発声者に対して発声すべき音声、発声のタイミング等
の要求条件を指示し、該発声者が発声した内容を録音し
て、録音音声を取得し、該録音音声と、該加工合成音声
の一致具合を比較し、該比較した結果で、該録音音声と
該映像区間とから、音声付き映像区間を生成することを
特徴とする映像中への音声録音支援表示方法である。[0013] The invention of (2) provides a method for supporting voice recording in a video, which displays a request condition from an editor to a speaker when voice is to be inserted when voice information is inserted into the video. The video section is cut out by limiting the video section that the user wants to perform, a text section with a text is generated by adding text information corresponding to the sound to be inserted to the cut video section, and the text of the video section with the text is generated. Synthesizes a voice from the information to generate a synthesized voice, modifies the synthesized voice to generate a processed synthesized voice, synchronizes the video section with text information with the processed synthesized voice, and edits as necessary To the speaker, which extracts three or some of the synchronized video section, processed / synthesized voice, and text information, reproduces them synchronously, and utters the voice to be inserted. Depart for Voice to be instructed, required conditions such as timing of utterance, etc., record the content uttered by the speaker, obtain a recorded voice, compare the recorded voice with the processed synthesized voice, A method for supporting and displaying audio recording in a video, wherein a video section with audio is generated from the recorded voice and the video section based on the result of comparison.

【００１４】（３）の発明は、映像に音声情報を挿入す
る際に、音声を発する際の編集者から発声者への要求条
件を表示する映像中への音声録音支援表示装置におい
て、音声入力を行いたい映像区間を限定して切り出す映
像区間切り出し部と、該切り出された映像区間に対し
て、録音したい音声のテキスト情報を映像に付加してテ
キスト付き映像区間を生成するテキスト付加部と、該テ
キスト付き映像区間のテキスト情報から、合成音声を生
成する音声合成部と、該合成音声に対して加工を行い加
工合成音声を生成する合成音声加工部と、該テキスト情
報付き映像区間と該加工合成音声を同期させ、必要に応
じて編集者に対して表示するテキスト音声映像同期部
と、該同期を取られた映像区間、加工合成音声、テキス
ト情報の３種類乃至は一部の種類を抽出して同期して再
生し、発声者に対して発声すべき音声、発声のタイミン
グ等の要求条件を指示させるテキスト同期再生部と、発
声者によって発声された音声を録音する音声録音部と、
該録音された音声と該加工合成音声の一致具合を比較す
る比較部と、該比較した結果で、該録音音声を該映像区
間に付加する映像音声付加部と、を有することを特徴と
する映像中への音声録音支援表示装置である。[0014] The invention of (3) provides a voice recording support display apparatus for displaying a requirement from an editor to a speaker when voice is to be inserted when voice information is inserted into a video. A video section cutout section that cuts out a video section that the user wants to perform, and a text addition section that generates text-added video section by adding text information of audio to be recorded to the video for the cutout video section, A voice synthesis unit that generates a synthesized voice from the text information of the video section with text; a synthesized voice processing unit that processes the synthesized voice to generate a processed synthesized voice; A text / audio / video synchronizing unit for synchronizing the synthesized voice and displaying it to the editor as necessary, and three or more of the synchronized video section, processed synthesized voice, and text information A text-synchronous playback unit that extracts and sorts the types of text and reproduces them synchronously, and instructs the speaker on required conditions such as the voice to be uttered, the timing of the utterance, and voice recording that records the voice uttered by the speaker Department and
A video, comprising: a comparing unit that compares the degree of coincidence between the recorded voice and the processed synthesized voice; and a video / audio adding unit that adds the recorded voice to the video section based on the comparison result. It is a voice recording support display device for inside.

【００１５】（４）の発明は、映像に音声情報を挿入す
る際に、音声を発する際の編集者から発声者への要求条
件を表示する場合における、音声入力を行いたい映像区
間を限定して切り出す手順と、該切り出された映像区間
に対して挿入したい音声に対応するテキスト情報を付加
してテキスト付き映像区間を生成する手順と、該テキス
ト付き映像区間のテキスト情報から音声を合成して合成
音声を生成する手順と、該合成音声に対して加工を加え
て加工合成音声を生成する手順と、挿入する音声を発声
する発声者に、該加工合成音声と該テキスト情報を用い
て、該映像区間に対してどのような音声を付加すれば良
いか具体的な指示を与える手順と、該発声者が発声した
内容を録音して、録音音声を取得する手順と、該録音音
声と、該加工合成音声の一致具合を比較する手順と、該
比較した結果で、該録音音声と該映像区間とから、音声
付き映像区間を生成する手順と、をコンピュータに実行
させるためのプログラムを、該コンピュータが読み取り
可能な記録媒体に記録したことを特徴とする映像中への
音声録音支援表示方法を記録した記録媒体である。According to the invention of (4), when audio information is inserted into a video, a video section in which audio input is desired to be performed when displaying a requirement from an editor to a speaker when uttering audio is displayed. Extracting a video section, adding text information corresponding to the sound to be inserted to the extracted video section to generate a video section with text, and synthesizing audio from the text information of the video section with text. A step of generating a synthesized voice, a step of generating a processed synthesized voice by processing the synthesized voice, and a speaker who utters the voice to be inserted, using the processed synthesized voice and the text information, A procedure for giving specific instructions on what kind of sound should be added to the video section, a procedure for recording the content uttered by the speaker and obtaining a recorded voice, Processing synthesis The computer reads a program for causing the computer to execute a procedure of comparing the degree of matching of voices, and a procedure of generating a video section with sound from the recorded voice and the video section based on the comparison result. A recording medium for recording a method for supporting and displaying audio recording in a video, characterized by being recorded on a possible recording medium.

【００１６】（５）の発明は、映像に音声情報を挿入す
る際に、音声を発する際の編集者から発声者への要求条
件を表示する場合における、音声入力を行いたい映像区
間を限定して切り出す機能と、該切り出された映像区間
に対して、録音したい音声のテキスト情報を映像に付加
してテキスト情報付き映像区間を生成する機能と、該テ
キスト情報付き映像区間のテキスト情報から、合成音声
を生成する機能と、該合成音声に対して加工を行い加工
合成音声を生成する機能と、該テキスト情報付き映像区
間と該加工合成音声を同期させ、必要に応じて編集者に
対して表示する機能と、該同期を取られた映像区間、加
工合成音声、テキスト情報の３種類乃至は一部の種類を
抽出して同期して再生し、発声者に対して発声すべき音
声、発声のタイミング等の要求条件を指示させる機能
と、発声者によって発声された音声を録音する機能と、
該録音された音声と該加工合成音声の一致具合を比較す
る機能と、該比較した結果で、該録音音声を該映像区間
に付加する機能と、をコンピュータで実現するためのプ
ログラムを、該コンピュータが読み取り可能な記録媒体
に記録したことを特徴とする映像中への音声録音支援表
示方法を記録した記録媒体である。According to the invention of (5), when audio information is inserted into a video, a video section in which audio input is desired to be performed when displaying a requirement from an editor to a speaker when uttering a voice is displayed. A function of adding text information of a sound to be recorded to a video to generate a video section with text information for the extracted video section, and synthesizing the text information of the video section with text information. A function of generating a voice, a function of processing the synthesized voice to generate a processed synthesized voice, and synchronizing the video section with text information and the processed synthesized voice, and displaying the processed synthesized voice to an editor as necessary And three or some of the synchronized video section, processed / synthesized voice, and text information are extracted and played back in synchronization, and the voice to be uttered to the speaker, Taimi A function for instructing the requirements such as grayed, a function to record a voice uttered by the speaker,
A program for realizing a function of comparing the degree of coincidence between the recorded voice and the processed synthetic voice and a function of adding the recorded voice to the video section based on the result of the comparison by the computer. Is a recording medium for recording a method for supporting and displaying audio recording in a video, the method being recorded on a readable recording medium.

【００１７】本発明では、編集者は、発声者が発声すべ
き音声をテキストで入力し、合成音声とそれに対する操
作を行うことで、テキスト情報をいつ、どのように発声
すれば良いかをあらかじめ容易に表現することができる
ようになる。一方で、こうしたテキスト情報と合成音声
情報による手本を参考に、発声者は、いつどのようなタ
イミングでどんな雰囲気で発声すれば良いか直感的に発
声情報を理解することが可能となる。さらに、合成音声
と発声者による音声とを比較する機能を有することによ
り、編集者が考える理想的な音声に近づいた音声を発声
者は、発声しやすく、編集者は発声された音声の中から
選択、利用しやすくすることができるようになり、音声
を映像に付加することができるようになる。According to the present invention, the editor inputs text to be uttered by the speaker as text, and performs a synthetic voice and an operation on the synthesized voice to determine in advance when and how to utter text information. It can be easily expressed. On the other hand, referring to the example of the text information and the synthetic voice information, the speaker can intuitively understand the utterance information when and at what timing and in what atmosphere. Furthermore, by having the function of comparing the synthesized voice with the voice of the speaker, the speaker can easily utter a voice approaching the ideal voice considered by the editor, and the editor can select from among the uttered voices This makes it easier to select and use, and makes it possible to add audio to video.

【００１８】[0018]

【発明の実施の形態】以下、本発明の実施の形態につい
て図を用いて詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１９】図１に、本発明の一実施形態例による映像
中への音声録音支援方法の処理フローを示す。FIG. 1 shows a processing flow of a method for supporting audio recording into a video image according to an embodiment of the present invention.

【００２０】まず、編集する際の素材となる入力映像デ
ータ０が与えられる。そこから、音声を付加したい映像
区間１を映像区間切り出し手段１０を用いて切り出す。
切り出された映像区間１に対して、編集者は、テキスト
付加手段１１によって付加したい音声情報に対応するテ
キスト情報２を付加する。この際のテキスト情報２は極
力表音記号で表される方が望ましい。First, input video data 0 as a material for editing is provided. From there, the video section 1 to which audio is to be added is cut out using the video section cutout means 10.
The editor adds text information 2 corresponding to the audio information to be added by the text adding means 11 to the clipped video section 1. It is desirable that the text information 2 at this time is represented by phonetic symbols as much as possible.

【００２１】こうして得られたテキスト付き映像区間３
に対して、そのテキスト部分を音声合成手段１２によっ
て、合成音声４を生成する。生成された音声合成４は、
合成音声加工手段１３によって、さらに細かく加工さ
れ、加工合成音声５を得ることができる。合成音声加工
手段１３は、人の手が加わっても良いし、全自動で行っ
ても良い。The video section 3 with the text thus obtained
Then, the synthesized voice 4 is generated from the text portion by the voice synthesis means 12. The generated speech synthesis 4 is
The synthesized voice processing unit 13 further processes the synthesized voice to obtain the processed synthesized voice 5. The synthetic voice processing means 13 may be manually operated or fully automatic.

【００２２】生成された加工合成音声５から、テキスト
及び音声指示手段１４によって、音声を挿入する挿入音
声６を得る。この挿入音声６を発声録音手段１５によっ
て音声を録音し、録音音声７を得る。From the generated synthesized speech 5, an insertion speech 6 for inserting speech is obtained by the text and speech instruction means 14. The inserted voice 6 is recorded by the voice recording means 15 to obtain a recorded voice 7.

【００２３】録音音声７と加工合成音声５の一致具合
を、比較手段１６によって比較し、録音音声７を利用す
ると決定した場合には、録音音声７を採用して映像区間
と組み合わせて音声付き映像区間８を生成することがで
きる。The recording sound 7 is compared with the processed synthesized sound 5 by the comparing means 16, and if it is determined that the recorded sound 7 is to be used, the recorded sound 7 is adopted and combined with the video section to produce a video with sound. Section 8 can be generated.

【００２４】図２のブロック図により、本発明の一実施
形態例による映像中への音声録音支援装置の構成例を示
す。FIG. 2 is a block diagram showing an example of the configuration of an apparatus for supporting audio recording in a video according to an embodiment of the present invention.

【００２５】映像及び対象映像に処理を加えた結果を保
存する記録メディア１００を備えており、入力された映
像２０がまず記録メディア１００に格納される。入力さ
れる映像２０はディジタル化されていれば良く、映像デ
ータなら何でも構わない。例えば、非圧縮の映像データ
でも良いし、ＪＰＥＧ等によって圧縮されていても構わ
ないし、ＭＰＥＧ符号化されていても構わない。ファイ
ル形式も、ＡＶＩ、ＱｕｉｃｋＴｉｍｅなどさまざまな
もので構わない。この入力映像及び各種処理結果を記録
する記録メディア１００は、磁気、光等を利用して、テ
ープ、ＣＤＲ、ＤＶＤ、メモリ（Ｍｅｍｏｒｙ）等で構
わないが、十分に高速で読み書きが可能なメディアであ
ることが望ましい。There is provided a recording medium 100 for storing a result of processing a video and a target video, and an input video 20 is first stored in the recording medium 100. The input video 20 only needs to be digitized, and any video data may be used. For example, it may be uncompressed video data, may be compressed by JPEG or the like, or may be MPEG encoded. Various file formats such as AVI and QuickTime may be used. The recording medium 100 for recording the input image and various processing results may be a tape, a CDR, a DVD, a memory (Memory), or the like using magnetism, light, or the like, but is a medium that can read and write at sufficiently high speed. Desirably.

【００２６】入力映像２０から、編集者３０は、編集者
指示部５０を介して、映像区間切り出し部４０に対して
指示を与えて、任意の映像区間２１を切り出す。この際
に映像区間切り出し部４０は、まったく手動で切り出し
ても良いし、自動的に切り出しても良い。これには、特
開平５−３７８５３号公報「動画像のカット自動分割方
法」等の従来技術が利用可能である。From the input video 20, the editor 30 gives an instruction to the video section cutout section 40 via the editor instructing section 50 to cut out an arbitrary video section 21. At this time, the video section cutout unit 40 may cut out completely manually or may cut out automatically. For this purpose, a conventional technique such as Japanese Unexamined Patent Publication No. Hei 5-37853, “Automatic Video Cut Automatic Division Method” can be used.

【００２７】編集者３０は、同様に編集者指示部５０を
介して映像に付加したい音声情報のテキスト内容を記し
たテキスト情報３１を付加する。このテキスト情報３１
は、表音記号であることが望ましいが、表意記号などで
も、一意に発音が決定されるものであれば構わない。ど
のように発音するのかが、一意に決定できる記号であれ
ば必ずしも文字である必要はない。テキスト情報付加部
４１にて、テキスト情報３１を映像区間２１に付加す
る。The editor 30 similarly adds text information 31 describing the text content of the audio information to be added to the video via the editor instructing section 50. This text information 31
Is preferably a phonetic symbol, but any ideographic symbol or the like may be used as long as its pronunciation is uniquely determined. If it is a symbol that can be uniquely determined as to how to pronounce, it is not necessarily required to be a letter. The text information adding section 41 adds the text information 31 to the video section 21.

【００２８】テキスト情報３１は、映像に付加するだけ
でなく、音声合成部５１にて音声合成を行い、合成音声
３２を生成する。その音声の長さ、声の高さ、イントネ
ーション、音色等は任意のもので構わない。合成音声３
２は、一般には、機械的であり、イントネーション等が
理想的な状態ではない場合があるので、合成音声加工部
５２にて、合成音声３２のイントネーションや、発声の
タイミング等の情報を指定し、加工する。なお、合成音
声３２は、映像区間２１に付加することが前提となって
いるので、テキスト情報付き映像区間２２を参考に、映
像区間２１に適合するように合成音声加工部５２では、
加工を行うことができる。The text information 31 is not only added to the video but also is subjected to voice synthesis in a voice synthesis section 51 to generate a synthesized voice 32. The length of the voice, the pitch of the voice, the intonation, the timbre, etc. may be arbitrary. Synthetic speech 3
2 is generally mechanical, and the intonation and the like may not be in an ideal state. Therefore, the synthesized voice processing unit 52 specifies the intonation of the synthesized voice 32 and information such as the timing of utterance, Process. Since it is assumed that the synthesized speech 32 is added to the video section 21, the synthesized speech processing unit 52 refers to the video section with text information 22 so as to match the video section 21.
Processing can be performed.

【００２９】すなわち、加工合成音声３３では、映像の
時系列に合わせて、どこから合成音声を発声させるか、
どの程度で終了するかといった部分までの加工を行う。
こうして得られた加工合成音声３３は、テキスト音声映
像同期部４２によって、テキスト情報付き映像２２と、
同期が行われ、映像及び音声対応テキスト２３によっ
て、映像と合成音声とテキストとが完全に同期が取れた
状態で再生を行うことができる。ここでいうテキストと
は、合成音声の元になったテキスト情報であり、発音内
容が一意に決定されることが条件であり、映像及び音声
を再生すると明示的に、テキスト情報のどの部分が再生
されているのかがわかるようになっている。具体的にど
のように明示されるかについては後述する。That is, in the processed / synthesized speech 33, where the synthesized speech is uttered in accordance with the time series of the video,
Processing up to the end of the process is performed.
The processed and synthesized voice 33 thus obtained is converted by the text / audio / video synchronizing unit 42 into the video 22 with text information,
Synchronization is performed, and reproduction can be performed in a state where the video, the synthesized voice, and the text are completely synchronized by the video and audio corresponding text 23. The text here is text information from which the synthesized speech is based, and the condition is that the pronunciation content is uniquely determined. When video and audio are reproduced, which part of the text information is explicitly reproduced. You can see if it is done. Details of how this is specified will be described later.

【００３０】得られた映像及び音声対応テキスト２３を
テキスト同期再生部４４にて再生すると、実際に声を入
力し、付加する音声を発声する発声者３５に対して、映
像の再生中に、いつどのようなタイミングでどのような
発音の音声を発すれば良いかを明示的に表現することが
できる。When the obtained video and audio corresponding text 23 is reproduced by the text synchronous reproduction section 44, a voice is actually inputted and a speaker 35 who utters the voice to be added is given a voice message during reproduction of the video. It is possible to express explicitly what kind of sound should be produced at what timing.

【００３１】発声者３５は、発声録音部４５に対して、
音声を発声し、録音音声３６を得る。得られた録音音声
３６が、加工合成音声３３と一致しているかどうか音声
比較部４６にて比較し、加工合成音声３３と差が少な
く、編集に利用でき得るとなれば、録音音声３６から利
用録音音声３７が選択、抽出される。得られた利用録音
音声３７と、映像及び音声対応テキスト２３とを、映像
音声付加部４７にて組み合わせることで発声者による音
声付き映像３８が生成される。得られた音声付き映像３
８は、記録メディア１００に格納される。The utterer 35 sends a message to the utterance recorder 45
A voice is uttered, and a recorded voice 36 is obtained. The voice comparison unit 46 compares the obtained recorded voice 36 with the processed synthesized voice 33 to determine whether it matches the processed synthesized voice 33. The recorded voice 37 is selected and extracted. By combining the obtained use recorded voice 37 and the video and voice corresponding text 23 in the video / audio adding unit 47, a video with voice 38 by the speaker is generated. Obtained video 3 with audio
8 is stored in the recording medium 100.

【００３２】なお、各ブロックにおいて、途中で生成さ
れる、映像区間２１、テキスト情報付き映像区間２２、
テキスト情報３１、合成音声３２、加工合成音声３３、
録音音声３６、映像及び音声対応テキスト２３、利用録
音音声３７等は、適宜記録メディア１００に格納される
ものとする。図２のブロック図では、一部しか明示的に
は格納されないことになっているが、すべての情報を格
納するなどの拡張は容易に類推が可能であり、本発明の
実施形態例の範囲内である。In each block, a video section 21, a video section with text information 22,
Text information 31, synthesized speech 32, processed synthesized speech 33,
The recorded voice 36, the video / audio corresponding text 23, the used recorded voice 37, and the like are appropriately stored in the recording medium 100. In the block diagram of FIG. 2, only a part is explicitly stored, but expansion such as storing all information can be easily analogized, and is within the scope of the embodiment of the present invention. It is.

【００３３】以下に、図２中のいくつかのブロックにつ
いて説明を加える。Hereinafter, some blocks in FIG. 2 will be described.

【００３４】音声合成部５１では、一意に発音が決定さ
れる表音記号から成り立つテキスト情報３１を元に、合
成音声３２を生成する。この合成音声生成ツールは従来
方法で紹介した例を含め、すでに様々なものが提案され
ている。同様に音声合成加工部５２では、合成音声３２
の開始点、イントネーション、発音の長さを任意に変更
する手段が、従来方法で紹介してＳｐｅｅｄ９７等をは
じめとしていくつも提案されている。すなわち合成音声
を生成し、かつ、加工することは既存の技術を用いるこ
とで実現することができる。The speech synthesizer 51 generates a synthesized speech 32 based on text information 31 composed of phonetic symbols whose pronunciation is uniquely determined. Various synthetic speech generation tools have already been proposed, including the examples introduced in the conventional method. Similarly, in the speech synthesis processing section 52, the synthesized speech 32
There have been proposed various means for arbitrarily changing the starting point, intonation, and length of pronunciation of a sound, such as Speed97, introduced by a conventional method. That is, generation and processing of synthesized speech can be realized by using an existing technology.

【００３５】次に、テキスト同期再生部４４では、編集
者３０は、どのタイミングで、何を発音させ、どこで終
了させたいのかという具体例が、実際に発声する発声者
３５にわかりやすくなっていることが望ましい。音声及
び映像情報はいずれも時系列情報であり、時系列を表示
したまま、表音記号からなるテキスト情報を表示するも
のが最も有効である。このテキスト同期再生部４４の２
つの具体的なイメージを図３に示す。Next, in the text-synchronous reproduction section 44, the editor 30 can easily understand the specific example of what timing, what to pronounce, and where to end it for the speaker 35 who actually speaks. It is desirable. Both audio and video information are time-series information, and it is most effective to display text information consisting of phonetic symbols while displaying the time-series. This text synchronous playback unit 44-2
FIG. 3 shows two specific images.

【００３６】図３（ａ）は、第１のイメージであって、
時系列情報を空間に展開して示している例であり、映像
の一部乃至すべての画像が表示されている。一方、音声
トラックにおいても一部乃至すべての情報が表示され、
対応するテキスト情報も表示されている。その中で、映
像が再生されると表示されるフレームと音声トラックが
時系列に沿って切り替わる間に、対応するテキスト情報
も切り替わったものが表示される。FIG. 3A shows a first image,
This is an example in which time-series information is expanded and shown in a space, in which a part or all of an image is displayed. On the other hand, some or all information is displayed in the audio track,
The corresponding text information is also displayed. Among them, while the displayed frame and the audio track are switched in chronological order when the video is reproduced, the corresponding text information is also switched.

【００３７】図３（ｂ）は、第２のイメージであり、映
像をモニターなどでの映像表現手段で表示し、音声情報
をスピーカー等の音声表現手段で再生しつつ、モニター
の一部や専用のテキスト表示部でテキスト情報を表示す
る。テキスト情報の一部は、同期する映像の推移と同期
して、再生済みの文字は色変わり等を行うことにして、
テキスト情報の変化推移を表現する。FIG. 3B shows a second image, in which an image is displayed on a video display means such as a monitor, and audio information is reproduced by a voice display means such as a speaker while a part of the monitor or a dedicated monitor is used. Display text information in the text display section. Part of the text information is synchronized with the transition of the synchronized video, and the reproduced characters will change color, etc.
Express the change transition of text information.

【００３８】次に、発声録音部４５では、発声者３５の
音声の録音を行う。この録音音声３６は、モノラル録音
でも、ステレオ録音でも良いし、なんらかの符号化録音
が行われても良いし、いずれの帯域幅を利用しても構わ
ない。録音音声３６のうち、利用録音音声３７として採
用するかどうかの判定は、加工合成音声３３との比較を
行う音声比較部３６にて比較を行い、その結果から、自
動判定及び、編集者によって判断される。この音声比較
部４６では、合成音声による時系列情報と、発声された
情報とが一致しているかどうかを比較する機能を有して
いる。その方法は、例えば、合成音声のレベルと、録音
された録音音声のレベルとを比較し、それぞれの相対的
な音声レベルの変化の一致ぐあいを比較することによっ
て実現できる。Next, the utterance recording section 45 records the voice of the utterer 35. The recorded voice 36 may be a monaural recording, a stereo recording, a coded recording, or may use any bandwidth. Of the recorded voices 36, the determination as to whether or not to be adopted as the use recorded voice 37 is made by a voice comparison unit 36 for comparison with the processed and synthesized voice 33, and from the result, automatic determination and determination by the editor are performed. Is done. The voice comparison unit 46 has a function of comparing whether or not the time-series information based on the synthesized voice matches the uttered information. The method can be realized, for example, by comparing the level of the synthesized voice with the level of the recorded voice, and comparing the relative changes in the voice levels.

【００３９】このような音声レベルの比較方法だけでな
く、他にも、波形や周波数分布の変化を元に比較する等
の操作方法が考えられる。いくつかの比較方法を図４に
示す。これは音量比較を行ったものであり、加工音声合
成の音量変化と、録音音声との比較を行い、差分結果が
０に近い分布が続く差分が小さいところは、音量の変化
がないので、類似した音声であるとする。一方で、差分
が大きいところは、音量の変化があるので、異なるもの
であるとする。なお、あらかじめ比較の際には、音量を
正規化する処理を行っておく。さらに、音量による比較
だけでなく、周波数帯域での比較において、それぞれの
ピッチ情報を抽出して、周波数成分同士を比較すること
により、評価を行うことも可能である。In addition to such a sound level comparison method, other operation methods such as a comparison based on a change in a waveform or a frequency distribution can be considered. Some comparison methods are shown in FIG. This is a comparison of the sound volume. The sound volume change of the processed speech synthesis is compared with the recorded sound. When the difference in which the difference result is close to 0 and the difference is small, there is no change in the sound volume. It is assumed that the voice has been played. On the other hand, where the difference is large, there is a change in the sound volume, so it is assumed that they are different. At the time of comparison, a process of normalizing the volume is performed in advance. Furthermore, in the comparison in the frequency band as well as in the comparison based on the volume, it is also possible to extract each piece of pitch information and compare the frequency components to perform evaluation.

【００４０】なお、こうした音声比較部４６での比較結
果は、あくまでも参考情報として、編集者３０が最終的
に定性的な評価を加えて判断を下すようにしても良い
し、比較結果を定量的に評価して判断しても良い。It should be noted that the comparison result obtained by the audio comparing section 46 may be used as reference information by the editor 30 to finally make a judgment by adding a qualitative evaluation, or the comparison result may be quantitatively determined. May be evaluated.

【００４１】以上、映像中への音声録音支援方法及び装
置について説明を加えてきた。上述の実施形態例に対し
て、さまざまなアレンジを加えることは容易に類推で
き、それらも本発明の範囲内である。The method and apparatus for supporting audio recording in a video have been described above. It is easy to guess, by adding various arrangements to the above-described embodiment, and they are also within the scope of the present invention.

【００４２】なお、図１を用いて説明した処理の手順を
記載したコンピュータプログラム、あるいは図２で説明
した各ブロックの機能を実現するコンピュータプログラ
ムを記録した記録メディアと、映像等の保存メディア
と、記録メディアからプログラムを読み取り、実行する
ことが可能なコンピュータが存在すれば、本発明を実施
することが可能である。記録メディアとしては、例えば
ＲＯＭやＦＤ、メモリカード、ＭＯ、ＣＤ、ＤＶＤなど
や、それらに類するものであって、コンピュータが読み
取り可能なものであれば、何でも構わない。It should be noted that a computer program describing the processing procedure described with reference to FIG. 1 or a recording medium recording a computer program for realizing the function of each block described with reference to FIG. The present invention can be implemented as long as there is a computer that can read and execute a program from a recording medium. The recording medium may be, for example, a ROM, FD, memory card, MO, CD, DVD, or the like, or any other computer-readable medium.

【００４３】[0043]

【発明の効果】本発明によれば、映像に音声情報を付加
する場合に、編集者が実際に録音する声を発声する人へ
指示しやすく、かつ、発声者はいつ、どのタイミング
で、どの内容をどのようなイントネーションで発音すれ
ば良いのかという編集者からの指示を具体的に受け取る
ことを可能とするインタフェースを提供することができ
る。これにより、録音する音声の発声者の負担を軽減す
ることが可能となり、音声付き映像作品を簡単に制作す
ることが可能となる。よって、映像制作を支援する効果
がある。According to the present invention, when audio information is added to a video, the editor can easily instruct the person who actually utters the voice to be recorded, and the speaker can determine when, when, and when. It is possible to provide an interface that enables specific instructions from an editor to be received as to what kind of intonation should be pronounced. This makes it possible to reduce the burden on the speaker of the sound to be recorded, and to easily produce a video work with sound. Therefore, there is an effect of supporting video production.

[Brief description of the drawings]

【図１】本発明の映像中の音声録音支援方法の一実施形
態例を説明する処理フロー図である。FIG. 1 is a process flowchart illustrating an embodiment of a method for supporting audio recording in a video according to the present invention.

【図２】本発明の映像中の音声録音支援装置の一実施形
態例を説明するブロック図である。FIG. 2 is a block diagram illustrating an embodiment of an apparatus for supporting audio recording in a video according to the present invention.

【図３】（ａ），（ｂ）は、上記映像中の音声録音支援
装置の一実施形態例におけるテキスト同期再生部の実現
イメージを示す図である。FIGS. 3A and 3B are diagrams showing an image of a text synchronous reproduction unit in the audio recording support apparatus for video in the embodiment of the present invention.

【図４】上記映像中の音声録音支援装置の一実施形態例
における音声比較部での音声比較方法の一例を説明する
図である。FIG. 4 is a diagram illustrating an example of a sound comparison method in a sound comparison unit in the embodiment of the sound recording support apparatus for a video.

[Explanation of symbols]

１０…映像区間切り出し手段１１…テキスト付加手段１２…音声合成手段１３…合成音声加工手段１４…テキスト及び音声指示手段１５…発生録音手段１６…比較手段４０…映像区間切り出し部４１…テキスト情報付加部４２…テキスト音声同期部４４…テキスト同期再生部４５…音声録音部４６…音声比較部４７…映像音声付加部５０…編集者指示部５１…音声合成部５２…合成音声加工部１００…記録メディア DESCRIPTION OF SYMBOLS 10 ... Video section cutout means 11 ... Text addition means 12 ... Speech synthesis means 13 ... Synthesis voice processing means 14 ... Text and voice instruction means 15 ... Generation recording means 16 ... Comparison means 40 ... Video section cutout part 41 ... Text information addition part Reference numeral 42: Text / sound synchronization unit 44: Text synchronization reproduction unit 45: Voice recording unit 46: Voice comparison unit 47: Video / audio addition unit 50: Editor instruction unit 51: Voice synthesis unit 52: Synthetic voice processing unit 100: Recording media

Claims

[Claims]

1. A video section in which voice input is to be performed in a method of supporting voice recording in a video, which displays a request condition from an editor to a voice speaker when voice is inserted when voice information is inserted into the video. To cut out the video section, add text information corresponding to the sound to be inserted to the cut-out video section to generate a video section with text, and synthesize audio from the text information of the video section with text Then, a synthesized voice is generated, and a processed synthesized voice is generated by processing the synthesized voice. A speaker who utters the voice to be inserted is provided with the processed synthesized voice and the text information by using the processed synthesized voice and the text information. A specific instruction is given as to what kind of sound should be added to the video section, the contents spoken by the speaker are recorded, a recorded sound is obtained, and the recorded sound and the processed synthesized sound are obtained. Matching tool Comparing, in result of the comparison, and a 該録 sound audio and the video section, and generates an image with sound interval, sound recording support display method for the video, characterized in that.

2. A method according to claim 1, wherein when inserting audio information into a video, a method of supporting voice recording in a video, which displays a requirement from an editor to a speaker when the voice is emitted, is used in a video section in which a voice is to be input. To cut out the video section, add text information corresponding to the sound to be inserted to the cut-out video section to generate a video section with text, and synthesize audio from the text information of the video section with text Process the synthesized voice to generate a processed synthesized voice, synchronize the video section with text information with the processed synthesized voice, and display it to the editor as necessary. Then, three or some of the synchronized video section, processed synthetic voice, and text information are extracted, reproduced in synchronization, and should be uttered to the speaker who utters the voice to be inserted. voice , Requesting conditions such as utterance timing, recording the contents uttered by the speaker, acquiring a recorded voice, comparing the recorded voice with the processed synthetic voice, and comparing the results. And generating a video section with sound from the recorded voice and the video section.

3. A video section in which audio input is to be performed in a display apparatus for supporting audio recording in a video, which displays a request condition from an editor to a speaker when voice is inserted when audio information is inserted into the video. A video section clipping section for limiting the clipping of a video section, a text adding section for adding text information of a sound to be recorded to a video to generate a video section with text for the clipped video section, A speech synthesis unit that generates a synthesized voice from the text information of the above, a synthesized voice processing unit that processes the synthesized voice to generate a processed synthesized voice, and synchronizes the video section with text information with the processed synthesized voice. A text / audio / video synchronizing unit to be displayed to the editor as required, and three or some types of the synchronized video section, processed / synthesized audio, and text information A text-synchronous playback unit that outputs and reproduces in synchronism and instructs a speaker on required conditions such as a voice to be uttered, timing of utterance, and the like; a voice recording unit that records a voice uttered by the speaker; A comparison unit that compares the degree of coincidence between the recorded voice and the processed synthesized voice; and a video / audio addition unit that adds the recorded voice to the video section based on the comparison result. Voice recording support display device.

4. A procedure for inserting an audio information into a video and cutting out a video section in which an audio input is to be performed in a case where a requirement from an editor to a speaker at the time of uttering a voice is displayed. Adding a text information corresponding to a sound to be inserted to the cut-out video section to generate a video section with a text; and synthesizing a voice from the text information of the video section with a text to generate a synthesized voice. A procedure for processing the synthesized voice to generate a processed synthesized voice; and providing a speaker who utters the voice to be inserted to the video section using the processed synthesized voice and the text information. A procedure for giving specific instructions on what kind of voice should be added; a procedure for recording the content uttered by the speaker to obtain a recorded voice; and matching the recorded voice with the processed synthetic voice. A program for causing a computer to execute a procedure for comparing the combination, a procedure for generating a video section with sound from the recorded voice and the video section based on the comparison result, and A recording medium recording a method for supporting and displaying audio recording in a video, characterized by being recorded on the medium.

5. A function for cutting out a video section in which audio input is to be performed when inserting audio information into a video and displaying a request condition from an editor to a speaker when uttering a voice, A function of adding text information of a sound to be recorded to a video to the extracted video section to generate a video section with text information; and a function of generating a synthesized voice from the text information of the video section with text information. A function of processing the synthesized voice to generate a processed synthesized voice, a function of synchronizing the video section with text information and the processed synthesized voice, and displaying the processed synthesized voice to an editor as necessary, Synchronized video section, processed / synthesized voice, and three or more types of text information are extracted and reproduced in synchronization with each other, and the voice to be voiced, the timing of voice generation, etc. are requested from the speaker. A function of instructing conditions, a function of recording a voice uttered by the speaker, a function of comparing the degree of coincidence between the recorded voice and the processed synthetic voice, and A recording medium recording a method for supporting and displaying audio recording in a video, wherein a program for realizing the function added to a video section and the program for realizing the following by a computer is recorded on a recording medium readable by the computer.