JP4594908B2

JP4594908B2 - Explanation additional voice generation device and explanation additional voice generation program

Info

Publication number: JP4594908B2
Application number: JP2006210121A
Authority: JP
Inventors: 徹都木; 信正清山; 寛之世木; 礼子齋藤
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2006-08-01
Filing date: 2006-08-01
Publication date: 2010-12-08
Anticipated expiration: 2026-08-01
Also published as: JP2008039845A

Description

本発明は、映像に当該映像の内容を示す解説音声を付加する解説付加音声生成装置及び解説付加音声生成プログラムに関する。 The present invention relates to a commentary-added sound generation apparatus and a commentary-added sound generation program for adding commentary sound indicating the contents of a video to a video.

従来、テレビ放送における解説放送番組の制作においては、通常の手順で番組を完成させた後に、番組の台本や脚本とは別に、視覚障害者のための情景描写や字幕の内容についての解説原稿を専門家が作成する。そして、スタジオにおいて、解説放送番組用のディレクタの指示のもと、映像音声における台詞やナレーションなどの発声音が含まれる音声の区間（喋り区間）に重ならないように、この喋り区間から次の喋り区間までの無音あるいは背景音のみの区間（ポーズ区間）にナレータが絶妙のタイミングで解説原稿を読み上げて解説音声を付加することで、解説放送番組が制作されていた。 Conventionally, in the production of commentary broadcast programs on television broadcasts, after completing the program in the normal procedure, a commentary manuscript about the scene description and caption content for visually impaired persons is written separately from the script and screenplay of the program. Created by experts. Then, in the studio, under the direction of the director for the commentary broadcast program, the next utterance is made from this utterance section so that it does not overlap the utterance section (speaking section) containing speech sounds such as dialogue or narration in the video and audio. A commentary broadcast program was produced by the narrator reading the commentary at an exquisite timing and adding commentary audio to the silent or background-only segment (pause segment) up to the segment.

この解説放送番組では、テレビ音声の「主音声」及び「副音声」モードを利用して、映像音声を「主音声」チャンネルに、映像音声に解説音声を付加した音声を「副音声」チャンネルに流している。アナログ放送では、主音声、副音声ともにモノラル音声となるが、デジタル放送においては、地上放送と衛星放送の両方において、主音声、副音声ともにステレオ音声が可能となる。 In this explanation broadcast program, using the “main audio” and “sub audio” modes of the TV audio, the video and audio are added to the “main audio” channel, and the audio with the explanation audio added to the video and audio is set to the “sub audio” channel. It is flowing. In analog broadcasting, both main audio and sub audio are monaural audio, but in digital broadcasting, both main audio and sub audio can be stereo audio in both terrestrial broadcasting and satellite broadcasting.

また、古典芸能や外国語の演劇などの舞台鑑賞において、観客に芸能や演劇の背景や、難しい台詞の補足等を無線イヤホンレシーバを利用して行う解説放送がある。この解説放送の作業を支援するための技術が開示されている（特許文献１及び特許文献２参照）。この技術では、解説の音声を予め録音し、解説単位ごとに識別番号を付与して、この識別番号に基づいて再生順序や再生のタイミングを予め設定、あるいは、操作者によって指示することで、舞台の進行に合わせて解説放送を行うものである。
特開２００２−２６８３０号公報（段落番号００１１〜００４２）特開２００２−２６８４０号公報（段落番号００１０〜００３３） There is also a commentary broadcast that uses a wireless earphone receiver to provide audiences with the background of performing arts and theatrical performances and supplementation of difficult lines in stage performances such as classical performing arts and foreign language theaters. A technique for supporting the explanation broadcasting work is disclosed (see Patent Document 1 and Patent Document 2). In this technology, commentary audio is recorded in advance, and an identification number is assigned to each commentary unit. Based on this identification number, the playback order and playback timing are set in advance or designated by the operator. This will be a commentary broadcast as the project progresses.
JP 2002-26830 A (paragraph numbers 0011 to 0042) JP 2002-26840 A (paragraph numbers 0010 to 0033)

しかしながら、従来のテレビ放送における解説放送番組の制作では、解説音声の原稿作成者（シナリオライタ）には、ポーズ区間の限られた時間内に挿入する効果的で、かつ、番組の雰囲気を壊さない用語の選択など、知識と熟練度が必要であった。また、解説音声の録音においては、専門のナレータが必要なだけでなく、録音するスタジオも確保する必要があった。更に、解説音声の録音時には、発声開始のタイミングや発声速度を調整しなければならず、リハーサルなどを含めて多くの時間と費用が必要であった。そのため、解説放送番組の普及率は高くなく、平成１７年８月の総務省の報道資料によれば、平成１６年度の総放送時間に占める解説放送の割合は、ＮＨＫ（総合）では３．２％、ＮＨＫ（教育）では７．９％、民放キー５局では０．５％に留まっている。 However, in the production of commentary broadcast programs in conventional television broadcasts, it is effective for the commentary voice manuscript creator (scenario writer) to insert it within a limited time in the pause section, and does not break the atmosphere of the program. Knowledge and proficiency such as selection of terms were necessary. In addition, in order to record commentary audio, not only a specialized narrator was required, but also a studio for recording had to be secured. Furthermore, when recording commentary speech, it was necessary to adjust the voice start timing and voice rate, which required a lot of time and money including rehearsals. Therefore, the penetration rate of commentary broadcast programs is not high, and according to the press release of the Ministry of Internal Affairs and Communications in August 2005, the proportion of commentary broadcasts in the total broadcast time in 2004 is 3.2 for NHK (General). 7.9% for NHK (education) and 0.5% for 5 commercial broadcasting stations.

また、特許文献１に記載の技術は、舞台の上演に関わる作業や劇場の環境に特化したものであり、時間の限られたポーズ区間において喋り区間に重ならないように解説音声を付加しなければならないテレビ放送などの解説放送には適用できない。 In addition, the technique described in Patent Document 1 is specialized for work related to stage performances and the theater environment, and it is necessary to add commentary audio so that it does not overlap the beaten section in a pause section with a limited time. It cannot be applied to commentary broadcasting such as TV broadcasting.

本発明は、前記従来技術の問題を解決するために成されたもので、解説放送番組の音声を短時間で、かつ、低コストで制作できる解説付加音声生成装置及び解説付加音声生成プログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems of the prior art, and provides a commentary-added sound generation apparatus and a commentary-added sound generation program that can produce sound of a commentary broadcast program in a short time and at a low cost. The purpose is to do.

前記課題を解決するため、請求項１に記載の解説付加音声生成装置は、映像の音声である映像音声と、当該映像の内容に関連するテキストデータとを外部から入力し、前記映像音声に、前記テキストデータを音声に変換した解説音声を付加した解説付加音声を生成する解説付加音声生成装置であって、音声合成手段と、区間検出手段と、話速変換手段と、音声付加手段とを備える構成とした。 In order to solve the above-described problem, the commentary-added audio generation device according to claim 1 inputs video audio, which is audio of video, and text data related to the content of the video from the outside, A comment-added speech generation apparatus that generates comment-added speech to which commentary speech obtained by converting the text data into speech is added, and includes speech synthesis means, section detection means, speech speed conversion means, and speech addition means. The configuration.

かかる構成によれば、解説付加音声生成装置は、音声合成手段によって、テキストデータから音声合成して解説音声を生成し、区間検出手段によって、映像音声から、当該映像音声の再生時間の時間軸上において、発声音の音声区間である発声音区間、及び、無音あるいは背景音のみの音声区間であるポーズ区間を検出する。そして、解説付加音声生成装置は、話速変換手段によって、ポーズ区間の区間長に基づいて、解説音声を話速変換し、音声付加手段によって、映像音声のうち前記検出されたポーズ区間に対して、話速変換手段で話速変換された解説音声を付加して解説付加音声を生成する。 According to such a configuration, the commentary-added sound generation device generates a commentary sound by synthesizing the text data from the voice data by the voice synthesizing unit, and on the time axis of the reproduction time of the video / audio from the video / sound by the section detection unit. , The utterance sound section that is the voice section of the utterance sound and the pause section that is the voice section of the silence or the background sound only are detected. The commentary-added sound generation device converts the commentary speech to speech speed based on the section length of the pause section by the speech speed conversion means, and the speech addition means applies the detected pause section to the detected pause section. The commentary speech converted by the speech rate conversion means is added to generate the commentary-added speech.

これによって、解説付加音声生成装置は、テキストデータから解説音声を生成し、映像音声のポーズ区間に付加して、解説付加音声を生成することができる。 As a result, the commentary-added sound generation apparatus can generate commentary sound from the text data and add it to the pause section of the video and sound to generate commentary-added sound.

また、請求項２に記載の解説付加音声生成装置は、映像の音声である映像音声と、当該映像の内容に関連するテキストデータとを外部から入力し、前記映像音声に、前記テキストデータを音声に変換した解説音声を付加した解説付加音声を生成する解説付加音声生成装置であって、音声合成手段と、区間検出手段と、話速変換手段と、音声付加手段とを備える構成とした。
かかる構成によれば、解説付加音声生成装置は、音声合成手段によって、テキストデータから音声合成によって解説音声を生成し、区間検出手段によって、映像音声から、当該映像音声の再生時間の時間軸上において、発声音の音声区間である発声音区間、及び、無音あるいは背景音のみの音声区間であるポーズ区間を検出する。そして、話速変換手段によって、発声音区間に対応する映像音声を話速変換するとともに、この話速変換によって当該発声音区間が伸縮した長さと、ポーズ区間の区間長とに基づいて解説音声を話速変換し、音声付加手段によって、話速変換手段によって話速変換された映像音声のうち無音あるいは背景音のみの音声区間であるポーズ区間に対して、話速変換手段によって話速変換された解説音声を付加する。 Further, the commentary-added audio generation apparatus according to claim 2 inputs video audio, which is audio of video, and text data related to the content of the video from the outside, and converts the text data to the video audio. A comment-added speech generation apparatus that generates comment-added speech to which comment speech converted into a voice is added, and is configured to include speech synthesis means, section detection means, speech speed conversion means, and speech addition means .
According to such a configuration, the commentary-added sound generation device generates commentary sound by speech synthesis from the text data by the voice synthesis unit, and from the video and audio by the section detection unit on the time axis of the playback time of the video / sound. , A utterance sound section that is a voice section of a uttered sound, and a pause section that is a voice section of silence or background sound only. Then, the speech speed is converted by the speech speed conversion means, and the commentary sound is converted based on the length of the speech sound section expanded and contracted by the speech speed conversion and the length of the pause section. The speech speed is converted by the speech speed conversion means, and the speech speed conversion means converts the pause period, which is a voice section of silence or background sound only, from the audio / video converted by the speech speed conversion means. Add commentary audio.

これによって、解説付加音声生成装置は、発声音区間の映像音声と解説音声との両方を話速変換して映像音声に解説音声を付加し、解説付加音声を生成することができる。 As a result, the comment-added sound generation device can generate comment-added sound by adding the comment sound to the image sound by converting the speech speed of both the image sound and the comment sound in the utterance sound section.

また、請求項３に記載の解説付加音声生成装置は、請求項２に記載の解説付加音声生成装置において、前記話速変換手段によって話速変換された前記映像音声の発声音区間における、話速変換による伸縮を示す情報に基づいて、外部から入力された前記映像の前記発声音区間に対応する区間の区間長を伸縮する映像速度変換手段を備える構成とした。 According to a third aspect of the present invention, there is provided the commentary-added speech generation apparatus according to the second aspect, wherein the speech speed in the speech sound section of the video and audio converted by the speech speed conversion means is the speech speed conversion unit. On the basis of information indicating expansion / contraction due to conversion, video speed conversion means for expanding / contracting the section length of the section corresponding to the uttered sound section of the video input from the outside is provided.

これによって、解説付加音声生成装置は、話速変換による映像音声の伸縮に合わせて当該映像音声に対応する映像を伸縮させることで、解説付加音声とともに当該解説付加音声に同期した映像を生成することができる。 As a result, the commentary-added audio generation device generates a video synchronized with the commentary-added audio together with the commentary-added voice by expanding / contracting the video corresponding to the video-sound according to the expansion / contraction of the video / sound by the speech speed conversion. Can do.

また、請求項４に記載の解説付加音声生成プログラムは、映像の音声である映像音声と、当該映像の内容に関連するテキストデータとを外部から入力し、前記映像音声に、前記テキストデータを音声に変換した解説音声を付加した解説付加音声を生成するためにコンピュータを、音声合成手段、区間検出手段、話速変換手段、音声付加手段として機能させることとした。 The comment added audio generation program according to claim 4 inputs video and audio, which is video audio, and text data related to the content of the video from outside, and converts the text data into audio and video. The computer is caused to function as speech synthesis means, section detection means, speech speed conversion means, and speech addition means in order to generate commentary-added voice to which the commentary voice converted into is added.

かかる構成によれば、解説付加音声生成プログラムは、音声合成手段によって、テキストデータから音声合成によって解説音声を生成し、区間検出手段によって、映像音声から、当該映像音声の再生時間の時間軸上において、発声音の音声区間である発声音区間、及び、無音あるいは背景音のみの音声区間であるポーズ区間を検出する。そして、解説付加音声生成プログラムは、話速変換手段によって、ポーズ区間の区間長に基づいて、解説音声を話速変換し、音声付加手段によって、映像音声のうち検出されたポーズ区間に対して、話速変換手段で話速変換された解説音声を付加して解説付加音声を生成する。 According to such a configuration, the commentary additional sound generation program generates commentary sound by voice synthesis from the text data by the voice synthesizer, and on the time axis of the reproduction time of the video and sound from the video sound by the section detection unit. , A utterance sound section that is a voice section of a uttered sound, and a pause section that is a voice section of silence or background sound only. The commentary-added sound generation program converts the commentary speech based on the section length of the pause section by the speech speed conversion means, and for the pause section detected in the video and sound by the sound addition means, The commentary speech converted by the speech rate conversion means is added to generate commentary added speech.

これによって、解説付加音声生成プログラムは、テキストデータから解説音声を生成し、映像音声のポーズ区間に付加して、解説付加音声を生成することができる。
また、請求項５に記載の解説付加音声生成プログラムは、映像の音声である映像音声と、当該映像の内容に関連するテキストデータとを外部から入力し、前記映像音声に、前記テキストデータを音声に変換した解説音声を付加した解説付加音声を生成するためにコンピュータを、音声合成手段、区間検出手段、話速変換手段、音声付加手段として機能させることとした。
かかる構成によれば、解説付加音声生成プログラムは、音声合成手段によって、テキストデータから音声合成によって解説音声を生成し、区間検出手段によって、映像音声から、当該映像音声の再生時間の時間軸上において、発声音の音声区間である発声音区間、及び、無音あるいは背景音のみの音声区間であるポーズ区間を検出する。そして、話速変換手段によって、発声音区間に対応する映像音声を話速変換するとともに、この話速変換によって当該発声音区間が伸縮した長さと、ポーズ区間の区間長とに基づいて解説音声を話速変換し、音声付加手段によって、話速変換手段によって話速変換された映像音声のうち無音あるいは背景音のみの音声区間であるポーズ区間に対して、話速変換手段によって話速変換された解説音声を付加する。 As a result, the commentary-added sound generation program can generate commentary sound from the text data and add it to the pause section of the video and sound to generate commentary-added sound.
The comment added audio generation program according to claim 5 inputs video and audio as video audio and text data related to the content of the video from outside, and outputs the text data to the video and audio. The computer is caused to function as speech synthesis means, section detection means, speech speed conversion means, and speech addition means in order to generate commentary-added voice to which the commentary voice converted into is added.
According to such a configuration, the commentary additional sound generation program generates commentary sound by voice synthesis from the text data by the voice synthesizer, and on the time axis of the reproduction time of the video and sound from the video sound by the section detection unit. , A utterance sound section that is a voice section of a uttered sound, and a pause section that is a voice section of silence or background sound only. Then, the speech speed is converted by the speech speed conversion means, and the commentary sound is converted based on the length of the speech sound section expanded and contracted by the speech speed conversion and the length of the pause section. The speech speed is converted by the speech speed conversion means, and the speech speed conversion means converts the pause period, which is a voice section of silence or background sound only, from the audio / video converted by the speech speed conversion means. Add commentary audio.

本発明に係る解説付加音声生成装置及び解説付加音声生成プログラムでは、以下のような優れた効果を奏する。請求項１及び請求項４に記載の発明によれば、テキストデータから音声合成により解説音声を生成するため、専門のナレータや録音するスタジオや録音のリハーサル等が不要となり、コスト削減や制作時間の短縮ができる。 The commentary-added sound generation apparatus and commentary-added sound generation program according to the present invention have the following excellent effects. According to the first and fourth aspects of the present invention, since commentary speech is generated from text data by speech synthesis, specialized narrators, recording studios, recording rehearsals, and the like are not required, reducing costs and production time. Can be shortened.

また、ポーズ区間の区間長に合わせて解説音声を話速変換して時間尺を調整するため、冗長な文章であってもポーズ区間の区間長に合わせた解説音声にでき、原稿の作成に従来のような熟練度を要しないとともに、音声合成により生成された解説音声を用いて解説付加音声を生成することが可能になる。更に、原稿の作成が容易になりシナリオライタの人材確保が容易になるとともに、解説付加音声の制作のコスト削減や制作時間の短縮が可能になることで、解説放送番組の制作を促し、解説放送の普及拡大に寄与することができる。 In addition, since the explanation voice is converted to the speech speed according to the length of the pause section and the time scale is adjusted, even the redundant text can be converted to the explanation voice according to the length of the pause section, which is conventionally used to create a manuscript It is possible to generate commentary-added speech using commentary speech generated by speech synthesis. In addition, it is easy to prepare manuscripts and it is easy to secure human resources for scenario writers, and it is possible to reduce the cost of production of commentary-added audio and shorten the production time. Can contribute to the spread of

請求項２及び請求項５に記載の発明によれば、ポーズ区間の区間長が解説音声に対して短くても、映像音声も話速変換してポーズ区間を長くすることで、解説音声の話速が極端に早くなるのを防ぐことができるとともに、映像音声の話速と解説音声の話速の両方を調整することができるため、２つの話速をほぼそろえることで、全体的に話速のバランスがそろい、聞き取りやすい解説付加音声を生成することができる。 According to the second and fifth aspects of the present invention, even if the length of the pause section is shorter than that of the commentary voice, the voice of the commentary voice can be spoken by converting the voice speed of the video sound to make the pause section longer. It is possible to prevent the speed from becoming extremely fast, and to adjust both the speed of the audio and video speech and the speed of the commentary audio. This makes it possible to generate commentary-added speech that is easy to hear.

請求項３に記載の発明によれば、解説付加音声に同期した映像を生成することができるため、視覚障害者だけでなく弱視者や晴眼者に対しても利用可能な解説放送番組を生成することができる。 According to the third aspect of the present invention, since the video synchronized with the commentary additional sound can be generated, the commentary broadcast program that can be used not only for the visually impaired but also for the visually impaired and the sighted is generated. be able to.

以下、本発明の実施の形態について図面を参照して説明する。
［解説放送番組生成装置の構成］
まず、図１を参照して、本発明における解説放送番組生成装置１の構成について説明する。図１は、本発明における解説放送番組生成装置の構成を模式的に示した模式図である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[Configuration of commentary broadcast program generator]
First, with reference to FIG. 1, the structure of the explanation broadcast program production | generation apparatus 1 in this invention is demonstrated. FIG. 1 is a schematic diagram schematically showing a configuration of an explanation broadcast program generating apparatus according to the present invention.

解説放送番組生成装置（解説付加音声生成装置）１は、映像と、この映像の音声である映像音声と、この映像の各場面の内容を解説したテキストデータである解説原稿とを入力し、この映像音声において無音あるいは背景音のみの区間であるポーズ区間に、解説原稿を音声合成した解説音声を付加した副音声（解説付加音声）と、当該副音声に対応する映像である副音声同期映像とを生成するものである。ここで、解説放送番組生成装置１は、解説原稿入力手段１１と、映像速度変換手段１２と、音声分析手段１３と、音声合成手段１４と、話速変換手段１５と、音声接続手段１６と、映像・音声出力手段１７と、原映像出力手段１８と、副音声同期映像出力手段１９と、主音声出力手段２０と、副音声出力手段２１とを備える。 An explanation broadcast program generation device (explanation additional audio generation device) 1 inputs a video, video audio that is the audio of this video, and an explanation manuscript that is text data that describes the contents of each scene of this video. A sub-audio (commentary additional audio) in which a commentary voice obtained by synthesizing a commentary manuscript is added to a pause period, which is a period of only silence or background sound in the video and audio, and a sub-audio synchronized video that is a video corresponding to the sub-audio. Is generated. Here, the explanation broadcast program generating apparatus 1 includes an explanation manuscript input means 11, a video speed conversion means 12, a voice analysis means 13, a voice synthesis means 14, a speech speed conversion means 15, a voice connection means 16, A video / audio output unit 17, an original video output unit 18, a sub audio synchronized video output unit 19, a main audio output unit 20, and a sub audio output unit 21 are provided.

ここで、解説放送番組生成装置１には、外部に番組映像蓄積装置３と、解説放送番組蓄積装置７とが接続されている。番組映像蓄積装置３は、複数のテレビ番組の映像及びこれら映像の映像音声を蓄積するもので、ハードディスク等の一般的な記憶手段によって構成される。この番組映像蓄積装置３には、例えば、通常の方法によって制作されたテレビ番組の映像及び映像音声が記録されたＶＴＲ（Video Tape Recorder）や光ディスク等（図示せず）から図示しない入力手段によって映像及び映像音声が入力される。この際、映像と映像音声とが混合されて圧縮符号化されている場合には、映像と映像音声とを分離した後に番組映像蓄積装置３に蓄積する。なお、ここでは番組映像蓄積装置３はテレビ番組の映像と映像音声とを蓄積することとしたが、テレビ番組以外の映像と映像音声とを蓄積することとしてもよい。 Here, a program video storage device 3 and a comment broadcast program storage device 7 are externally connected to the comment broadcast program generation device 1. The program video storage apparatus 3 stores videos of a plurality of television programs and video / audio of these videos, and is configured by general storage means such as a hard disk. In this program video storage device 3, for example, video from a VTR (Video Tape Recorder) or an optical disk (not shown) in which video and audio of a television program produced by a normal method are recorded by input means (not shown). And video and audio are input. At this time, if the video and audio are mixed and compressed and encoded, the video and audio are separated and stored in the program video storage device 3. Here, the program video storage device 3 stores the video and video / audio of the TV program, but may store video and video / audio other than the TV program.

また、解説放送番組蓄積装置７は、映像（原映像）と映像音声（主音声）と、副音声と、副音声同期映像とを蓄積するもので、ハードディスク等の一般的な記憶手段によって構成される。そして、解説放送番組生成装置１は、番組映像蓄積装置３から映像と映像音声とを入力し、副音声と副音声同期映像を生成して、番組映像蓄積装置３から読み出した映像（原映像）及び映像音声（主音声）とともに解説放送番組蓄積装置７に蓄積する。この解説放送番組蓄積装置７に記憶された原映像、主音声、副音声及び副音声同期映像は、作業者によって必要な組み合わせが選択され、例えば、ＶＴＲや光ディスク等（図示せず）に解説放送番組として記録される。 The explanation broadcast program storage device 7 stores video (original video), video / audio (main audio), sub-audio, and sub-audio synchronized video, and is constituted by general storage means such as a hard disk. The Then, the explanation broadcast program generating device 1 inputs video and video / audio from the program video storage device 3, generates sub audio and sub audio synchronized video, and reads the video (original video) read from the program video storage device 3. In addition, it is stored in the explanation broadcast program storage device 7 together with the video and audio (main audio). For the original video, main audio, sub audio, and sub audio synchronized video stored in the explanation broadcast program storage device 7, a necessary combination is selected by the operator, for example, explanation broadcast to a VTR, an optical disc or the like (not shown). Recorded as a program.

更に、解説放送番組生成装置１には、外部に出力装置５が接続されている。この出力装置５は、解説放送番組生成装置１から出力された画像あるいは映像を表示し、また、音声を出力するものであって、例えば、液晶表示パネルのようなディスプレイとスピーカとから構成される。以下、解説放送番組生成装置１の構成について詳細に説明する。 Furthermore, an output device 5 is connected to the commentary broadcast program generating device 1 externally. The output device 5 displays an image or video output from the commentary broadcast program generation device 1 and outputs sound, and includes, for example, a display such as a liquid crystal display panel and a speaker. . Hereinafter, the configuration of the explanation broadcast program generation device 1 will be described in detail.

解説原稿入力手段１１は、番組映像蓄積装置３から入力される映像音声に付加する解説音声の原稿である解説原稿を外部から入力するものである。ここで入力された解説原稿は、音声合成手段１４に出力される。なお、ここで入力される解説原稿の文章は、例えば、タイムコードによって、挿入される映像音声の各々のポーズ区間に予め対応付けられていることとする。ここで、例えば、解説放送を制作する作業者が、後記する音声分析手段１３によって分析された映像音声の喋り区間（発声音区間）とポーズ区間との検出結果を出力装置５を介して参照しながら、キーボードや自動音声認識装置（図示せず）を介して解説原稿を入力することとしてもよい。なお、ここで、解説音声を喋りなれた作業者によって発声した解説音声を入力し、この音声を、後記する音声合成手段１４によって生成される解説音声の代わりに用いることとしてもよい。 The comment manuscript input means 11 is for inputting a comment manuscript that is a manuscript audio manuscript to be added to the video and audio inputted from the program video storage device 3 from the outside. The commentary manuscript input here is output to the speech synthesizer 14. It is assumed that the text of the commentary manuscript input here is associated in advance with each pause section of the inserted video and audio by, for example, a time code. Here, for example, the operator who produces the commentary broadcast refers to the detection results of the audio / voice beat section (voiced sound section) and the pause section analyzed by the voice analysis means 13 described later via the output device 5. However, the explanation manuscript may be input via a keyboard or an automatic voice recognition device (not shown). Here, commentary speech uttered by an operator who is familiar with commentary speech may be input, and this speech may be used in place of the commentary speech generated by the speech synthesis means 14 described later.

映像速度変換手段１２は、後記する話速変換手段１５から入力される映像音声及び解説音声の伸縮の情報に基づいて、番組映像蓄積装置３から入力された映像を伸縮するものである。ここで、映像速度変換手段１２には、話速変換手段１５によって映像音声及び解説音声が話速変換された場合の喋り区間及びポーズ区間の区間長の伸縮の情報が入力され、映像速度変換手段１２は、この区間長の伸縮に合わせて、この喋り区間およびポーズ区間に対応する区間の映像についてフレーム単位で間引いたり繰り返したりすることで映像を伸縮する。このように映像を伸縮することで、映像速度変換手段１２は、音声接続手段１６によって生成された副音声に同期した映像（副音声同期映像）を生成することができる。ここで生成された副音声同期映像は、映像・音声出力手段１７及び副音声同期映像出力手段１９に出力される。なお、映像音声及び解説音声の伸縮の情報とは、例えば、話速変換された喋り区間やポーズ区間を、映像及び映像音声の再生時間に対応させたときの開始時間と終了時間もしくはタイムコードの情報、及び、この区間の区間長の伸縮率を示す情報などである。 The video speed conversion means 12 expands / contracts the video input from the program video storage device 3 based on the expansion / contraction information of the video / audio and commentary voice input from the speech speed conversion means 15 described later. Here, the video speed conversion means 12 is input with information on the expansion and contraction of the length of the talk section and the pause section when the video speed and the commentary voice are converted by the speech speed conversion section 15. In accordance with the expansion / contraction of the section length, the image 12 expands / contracts the image by thinning or repeating the image of the section corresponding to the beat section and the pause section in units of frames. By expanding and contracting the video in this manner, the video speed converting unit 12 can generate a video (sub audio synchronized video) synchronized with the sub audio generated by the audio connecting unit 16. The sub-audio synchronized video generated here is output to the video / audio output means 17 and the sub-audio synchronized video output means 19. Note that the expansion / contraction information of the video / audio and the commentary audio is, for example, the start time and end time or time code of the speech section and the speech section converted to the speech section and pause section corresponding to the playback time of the video and video / audio. Information, and information indicating the expansion / contraction rate of the section length of this section.

音声分析手段（区間検出手段）１３は、番組映像蓄積装置３から入力された映像音声を分析し、台詞やナレーションなどの区間である喋り区間と、無音あるいは背景音のみの区間であるポーズ区間とを検出するものである。この音声分析手段１３は、例えば、特許第３１６０２２８号公報に記載されるような様々な音声分析技術によって実現することができる。ここでは、音声分析手段１３は、映像音声を当該映像音声の再生時間に対応させて、検出した各区間の開始時間と終了時間とを記録することとした。ここで音声分析された区間の情報と映像音声は、話速変換手段１５及び映像・音声出力手段１７に出力される。 The audio analysis means (section detection means) 13 analyzes the video and audio input from the program video storage device 3, and includes a squealing section that is a section such as speech and narration, and a pause section that is a section of silence or background sound only. Is detected. The voice analysis means 13 can be realized by various voice analysis techniques as described in Japanese Patent No. 3160228, for example. Here, the audio analysis means 13 records the start time and end time of each detected section in association with the video / audio reproduction time of the video / audio. The section information and video / audio that have been analyzed here are output to the speech speed conversion means 15 and the video / audio output means 17.

音声合成手段１４は、解説原稿入力手段１１から入力された解説原稿を音声合成して、解説音声を生成するものである。ここで音声合成された解説音声は、話速変換手段１５及び映像・音声出力手段１７に出力される。 The voice synthesizing unit 14 synthesizes the commentary manuscript input from the commentary manuscript input unit 11 and generates a commentary voice. The commentary speech synthesized here is output to the speech speed conversion means 15 and the video / audio output means 17.

ここで、作業者は、後記する映像・音声出力手段１７を介して出力装置５から出力された解説音声を聞くことができる。そして、ここでは、音声合成手段１４は、読みやアクセントの修正機能を有し、図示しない入力手段を介して作業者によって入力された指令に基づいて、解説音声に含まれる読みやアクセントの誤りを修正することができる。更に、ここでは、音声合成手段１４は、図示しない入力手段を介して作業者によって入力された指令に基づいて、男女含めた複数の話者の声質で音声を合成することができ、また、声の高さの調整や、イントネーションを強調あるいは抑制する機能も有することとした。そして、作業者が映像にふさわしい声質を選択したり、映像の場面に応じて声の高さや抑揚の設定をしたり、映像の途中で話者を変えたりする指令を入力することで、音声合成手段１４は、それらの指令に応じた解説音声を音声合成することができる。なお、この音声合成手段１４は、例えば、特開２００４−１３９０３３号公報に記載されるような様々な音声合成技術によって実現することができる。 Here, the worker can listen to the commentary sound output from the output device 5 via the video / audio output means 17 described later. Here, the speech synthesizer 14 has a reading and accent correcting function, and based on a command input by an operator via an input unit (not shown), the speech synthesizer 14 corrects a reading or accent error included in the commentary speech. It can be corrected. Furthermore, here, the speech synthesizer 14 can synthesize speech with the voice quality of a plurality of speakers including men and women based on a command input by an operator via an input unit (not shown). It also has the function of adjusting the height of the lens and enhancing or suppressing intonation. Then, the operator can select the voice quality appropriate for the video, set the voice pitch and inflection according to the video scene, and input the command to change the speaker in the middle of the video, thereby synthesizing the voice The means 14 can synthesize the commentary speech corresponding to those commands. The voice synthesizing means 14 can be realized by various voice synthesis techniques as described in, for example, Japanese Patent Application Laid-Open No. 2004-139033.

話速変換手段１５は、音声分析手段１３から入力された映像音声と、音声合成手段１４から入力された解説音声との話速変換を行うものである。ここで、話速変換とは、声の高さや質は保ったまま、音声の時間長を伸縮することである。この話速変換手段１５は、例えば、特許第２９５５２４７号公報や特許第３２２００４３号公報に記載されるような様々な話速変換技術によって実現することができる。ここでは、話速変換手段１５は、図示しない入力手段から入力された作業者からの指令に基づいて、解説音声のみ、あるいは、映像音声と解説音声との両方について話速変換を行う。ここで話速変換された映像音声及び解説音声は、音声接続手段１６に出力される。また、当該話速変換手段１５の話速変換による映像音声及び解説音声の伸縮の情報は、映像速度変換手段１２及び映像・音声出力手段１７に出力される。 The speaking speed conversion means 15 performs speaking speed conversion between the video and audio input from the sound analysis means 13 and the commentary sound input from the sound synthesis means 14. Here, the speech speed conversion is to extend / contract the voice time length while maintaining the voice pitch and quality. This speech speed conversion means 15 can be realized by various speech speed conversion techniques as described in, for example, Japanese Patent No. 2955247 and Japanese Patent No. 3220043. Here, the speech speed conversion means 15 performs the speech speed conversion for only the commentary sound or both the video sound and the commentary sound based on the instruction from the operator input from the input means (not shown). The audio / video and commentary audio converted at the speech speed are output to the audio connection means 16. Further, the information about the expansion / contraction of the video / audio and the commentary audio by the speech speed conversion of the speech speed conversion means 15 is output to the video speed conversion means 12 and the video / audio output means 17.

ここで、話速変換手段１５による話速変換について、図２を参照（適宜図１参照）して具体例を用いて説明する。図２は、話速変換手段による話速変換を説明するための説明図、（ａ）は、音声分析手段によって検出された喋り区間とポーズ区間の例を模式的に示した模式図、（ｂ）は、話速変換手段による話速変換のパターンの例を模式的に示した模式図、（ｃ）は、話速変換手段による話速変換の他のパターンの例を模式的に示した模式図、（ｄ）は、話速変換手段による話速変換の他のパターンの例を模式的に示した模式図である。なお、図２において、破線は喋り区間Ａ１、Ａ２、Ａ３、…及びポーズ区間Ｂ１、Ｂ２、…の開始時間及び終了時間を、一点鎖線は喋り区間Ａ１、Ａ２、Ａ３、…及びポーズ区間Ｂ１、Ｂ２、…の中心の時間を模式的に示す。 Here, the speech speed conversion by the speech speed converting means 15 will be described using a specific example with reference to FIG. 2 (refer to FIG. 1 as appropriate). FIG. 2 is an explanatory diagram for explaining the speech speed conversion by the speech speed conversion means. FIG. 2A is a schematic diagram schematically showing examples of a talk section and a pause section detected by the speech analysis means. ) Is a schematic diagram schematically showing an example of a speech speed conversion pattern by the speech speed conversion means, and (c) is a schematic diagram schematically showing another pattern example of the speech speed conversion by the speech speed conversion means. FIG. 4D is a schematic diagram schematically showing another example of the speech speed conversion pattern by the speech speed conversion means. In FIG. 2, the broken lines indicate the start times and end times of the beat sections A1, A2, A3,... And the pause sections B1, B2,. The time at the center of B2, ... is schematically shown.

ここで、図２（ａ）に示すように、音声分析手段１３によって、喋り区間Ａ１、Ａ２、Ａ３、…と、ポーズ区間Ｂ１、Ｂ２、…とが検出されたとする。なお、図２（ａ）では、検出された区間のうちの一部の区間（喋り区間Ａ１、Ａ２、Ａ３及びポーズ区間Ｂ１、Ｂ２）を示した。そして、話速変換手段１５は、ひとつの映像全体に対応する映像音声の時間尺と、変換後の映像音声と解説音声と時間尺の和が一致するように、解説音声のみ、あるいは、映像音声と解説音声との両方について話速変換する。ここでは、話速変換手段１５には、図２（ｂ）〜（ｄ）に示すパターン１からパターン３までの３つの変換のパターンが予め設定されていることとする。 Here, as shown in FIG. 2A, it is assumed that the speech analysis means 13 detects the beat sections A1, A2, A3,... And the pause sections B1, B2,. FIG. 2 (a) shows some of the detected sections (turning sections A1, A2, A3 and pause sections B1, B2). Then, the speech speed conversion means 15 is adapted to provide only the commentary audio or the video / audio so that the time scale of the video / audio corresponding to one whole video matches the sum of the converted video / audio, the commentary audio and the time scale. Talk speed conversion for both voice and commentary. Here, it is assumed that three conversion patterns from pattern 1 to pattern 3 shown in FIGS. 2B to 2D are set in advance in the speech speed conversion means 15.

ここで、話速変換手段１５に設定されたパターンの例について説明する。図２（ｂ）に示すパターン１では、話速変換手段１５は、喋り区間Ａ１、Ａ２、Ａ３、…の映像音声Ｃ１、Ｃ２、Ｃ３、…は話速変換せず、ポーズ区間Ｂ１、Ｂ２、…の区間長に合わせて音声合成手段１４から入力された解説音声を話速変換して、解説音声Ｄ１、Ｄ２、…とする。このとき、ポーズ区間長より２００ミリ秒程度短い区間長に解説音声Ｄ１、Ｄ２、…が収まるように話速変換することで、後記する音声接続手段１６によって前後の喋り区間との間にわずかな区間を残してポーズ区間に解説音声Ｄ１、Ｄ２、…をはめ込むことができ、視聴者にとって聞きやすい副音声を生成することができる。 Here, an example of the pattern set in the speech speed converting means 15 will be described. In the pattern 1 shown in FIG. 2 (b), the speech speed conversion means 15 does not convert the speech speed of the audio / video C1, C2, C3,... In the talk sections A1, A2, A3,. The commentary voice input from the voice synthesizer 14 is converted to the speech speed in accordance with the section length of ... to obtain commentary voices D1, D2,. At this time, the speech speed is converted so that the commentary voices D1, D2,... The commentary voices D1, D2,... Can be inserted into the pause section while leaving the section, and sub-voices that are easy to hear for the viewer can be generated.

なお、視覚障害者は、晴眼者に比べて２〜３倍の早口（かなで１分間に８００〜１２００文字程度の話速）の音声であっても十分理解できるという報告があり（「視覚障害者への音声提示における最適・最高速度」、ヒューマンインターフェース学会論文誌、Ｖｏｌ．７、Ｎｏ．１、ｐｐ．１０５−１１１、２００５参照）、ポーズ区間Ｂ１、Ｂ２、…の区間長が短く解説音声Ｄ１、Ｄ２、…が早口になってしまっても視覚障害者は聞き取ることができる。そして、このパターン１では、映像音声は話速変換しないので、後記する音声接続手段１６によって生成された副音声は、喋り区間の音声が映像と一致したものとなり、弱視者や晴眼者が視聴しても映像との間にずれの生じない副音声となる。ここで、ポーズ区間Ｂ１、Ｂ２、…の区間長が、話速変換前の解説音声より長い場合には、話速変換手段１５は音声合成手段１４から入力された解説音声を話速変換しない。 In addition, there is a report that visually impaired persons can fully understand even a voice that is 2 to 3 times faster than a sighted person (speaking speed of about 800 to 1200 characters per minute in kana) (“visual impairments”). Optimum / Maximum Speed for Presenting Speech to the Person ”, Journal of Human Interface Society, Vol.7, No.1, pp.105-111, 2005), pause section B1, B2,. Even if D1, D2,... In this pattern 1, since the audio speed of the video / audio is not converted, the sub-audio generated by the audio connection means 16 described later is the same as the audio in the talk section, which is viewed by a low-sighted person or a sighted person. However, the sub-audio is not shifted from the video. Here, when the section length of the pause sections B1, B2,... Is longer than the commentary speech before the speech speed conversion, the speech speed conversion means 15 does not convert the commentary voice input from the speech synthesis means 14 to the speech speed.

また、図２（ｃ）に示すパターン２では、話速変換手段１５は、各々の喋り区間Ａ１、Ａ２、Ａ３、…の映像音声の開始時間は変更せず、映像音声と解説音声との話速を同程度に設定して音声分析手段１３から入力された映像音声と、音声合成手段１４から入力された解説音声とを話速変換し、映像音声Ｃ１’、Ｃ２’、Ｃ３’、…と解説音声Ｄ１’、Ｄ２’、…とする。このパターン２では、映像音声Ｃ１’、Ｃ２’、Ｃ３’、…と、解説音声Ｄ１’、Ｄ２’、…とで話速がそろうため、後記する音声接続手段１６によって生成された副音声は、話速が異なる音声が組み合わされたものより聞き取りやすいものとなる。また、映像内の番組出演者の喋りはじめと、映像音声Ｃ１’、Ｃ２’、Ｃ３’、…の開始時間とが一致するため、映像と副音声とを弱視者や晴眼者が視聴する場合における映像内の番組出演者の唇の動きと副音声のずれによる違和感は軽減される。 Further, in the pattern 2 shown in FIG. 2C, the speech speed converting means 15 does not change the start time of the video and audio in each speaking section A1, A2, A3,. The speech speed of the video and audio input from the voice analyzing means 13 and the commentary voice input from the voice synthesizing means 14 is set to the same speed, and the video and audio C1 ′, C2 ′, C3 ′,. Explanation voices D1 ′, D2 ′,... In this pattern 2, since the voice speeds of video audio C1 ′, C2 ′, C3 ′,... And commentary audio D1 ′, D2 ′,... Match, the sub audio generated by the audio connection means 16 described later is It becomes easier to hear than a combination of voices with different speaking speeds. In addition, since the start time of the program performers in the video coincides with the start times of the video and audio C1 ′, C2 ′, C3 ′,... The uncomfortable feeling due to the movement of the lips of the program performers in the video and the shift of the sub-audio is reduced.

更に、図２（ｄ）に示すパターン３では、話速変換手段１５は、各々の喋り区間Ａ１、Ａ２、Ａ３、…及びポーズ区間Ｂ１、Ｂ２、…の中心の時間が、話速変換後の映像音声Ｃ１”、Ｃ２”、Ｃ３”、…と、解説音声Ｄ１”、Ｄ２”、…の中心の時間と一致するように、映像音声と解説音声との話速を同程度に設定して、映像音声と解説音声とを話速変換する。このパターン３では、パターン２と同様に映像音声Ｃ１”、Ｃ２”、Ｃ３”、…と、解説音声Ｄ１”、Ｄ２”、…とで話速がそろうため、後記する音声接続手段１６によって生成された副音声は、話速が異なる音声が組み合わされたものより聞き取りやすいものとなる。また、喋り区間Ａ１、Ａ２、Ａ３、…に解説音声Ｄ１”、Ｄ２”、…の区間が重複する時間が、パターン２に比べて少なくなるため、映像内において喋っている番組出演者の顔が出ることの少ない、ナレーションによる説明が中心の番組に適している。 Further, in the pattern 3 shown in FIG. 2 (d), the speech speed conversion means 15 causes the center time of each of the talk sections A1, A2, A3,... And the pause sections B1, B2,. The speech speeds of the video audio and the commentary audio are set to the same level so as to coincide with the center time of the video audio C1 ″, C2 ″, C3 ″,... And the commentary audio D1 ″, D2 ″,. The voice speed is converted between the video and audio and the explanation voice.In this pattern 3, as in the case of pattern 2, the voice speed is changed between the video and voice C1 ″, C2 ″, C3 ″,... And the explanation voice D1 ″, D2 ″,. For this reason, the sub-voice generated by the voice connection unit 16 described later is easier to hear than a combination of voices having different speech speeds. In addition, since the explanation voices D1 ″, D2 ″,... Overlap with the talk sections A1, A2, A3,... This is suitable for programs that are narrated and rarely appear.

そして、話速変換手段１５は、音声分析手段１３から入力された区間の情報に基づいて、設定されたパターン１からパターン３のそれぞれについて、各喋り区間とポーズ区間とに対応する映像音声と解説音声との区間の伸縮及び話速を設定する。更に、話速変換手段１５は、これらのパターン１からパターン３の映像音声及び解説音声の伸縮を示す情報を、後記する映像・音声出力手段１７に出力する。この情報は更に出力装置５に出力され、出力装置５を介してこの情報を参照した作業者によってパターンを指定する指令が入力されると、話速変換手段１５は、この指令に基づいて解説音声のみ、あるいは、映像音声と解説音声との両方について話速変換を行い、話速変換された映像音声及び解説音声を音声接続手段１６に出力する。また、話速変換手段１５は、ここで話速変換された映像音声及び解説音声の伸縮の情報を映像速度変換手段１２に出力する。 Then, the speech speed conversion means 15, based on the section information input from the sound analysis means 13, for each of the set patterns 1 to 3, the video and audio corresponding to each talk section and pause section, and explanations Set the expansion / contraction and speech speed of the section with the voice. Further, the speech speed conversion means 15 outputs information indicating the expansion / contraction of the video / audio and the explanation voice of the patterns 1 to 3 to the video / audio output means 17 described later. This information is further output to the output device 5, and when a command for designating a pattern is input via the output device 5 by an operator who refers to this information, the speech speed conversion means 15 performs an explanation voice based on this command. Or the speech speed is converted for both the video and audio and the commentary audio, and the video and speech that have been converted to the speech speed are output to the audio connection means 16. Further, the speech speed conversion means 15 outputs the information about the expansion and contraction of the video and audio and the commentary voice converted at the speech speed to the video speed conversion means 12.

なお、話速変換のパターンは前記の例に限定されない。また、ここで説明した話速変換のパターンは予め設定されていることとしたが、作業者が適宜、話速変換手段１５の話速変換のパターンの修正や追加を行うこととしてもよい。 Note that the speech speed conversion pattern is not limited to the above example. In addition, although the speech speed conversion pattern described here is set in advance, the operator may appropriately modify or add the speech speed conversion pattern of the speech speed conversion means 15.

図１に戻って説明を続ける。音声接続手段（音声付加手段）１６は、話速変換手段１５によって話速変換された映像音声と解説音声とを接続して、副音声を生成するものである。ここで生成された副音声は、映像・音声出力手段１７及び副音声出力手段２１に出力される。 Returning to FIG. 1, the description will be continued. The audio connection means (audio adding means) 16 connects the video and audio converted from the speech speed by the speech speed conversion means 15 and the explanation audio to generate sub audio. The sub audio generated here is output to the video / audio output means 17 and the sub audio output means 21.

映像・音声出力手段１７は、音声分析手段１３から入力された映像音声と、音声合成手段１４から入力された解説音声と、番組映像蓄積装置３から入力された映像と、映像速度変換手段１２から入力された副音声同期映像を、図示しない入力手段から入力された作業者からの指令に基づいて、出力可能な形式に変換して出力装置５に出力するものである。また、映像・音声出力手段１７は、音声分析手段１３から入力された喋り区間とポーズ区間の検出結果と、話速変換手段１５から入力された話速変換のパターンとを、図示しない入力手段から入力された作業者からの指令に基づいて、表示可能な出力形式に変換して出力装置５に出力するものでもある。更に、映像・音声出力手段１７は、図示しない入力手段から入力された作業者からの指令に基づいて、音声接続手段１６から入力された副音声を表示可能な出力形式に変換して出力装置５に出力するものでもある。 The video / audio output unit 17 receives the video / audio input from the audio analysis unit 13, the commentary audio input from the audio synthesis unit 14, the video input from the program video storage device 3, and the video speed conversion unit 12. The input sub-audio synchronized video is converted into a format that can be output based on a command from an operator input from an input means (not shown) and output to the output device 5. The video / audio output means 17 receives the detection results of the beat and pause sections input from the audio analysis means 13 and the speech speed conversion pattern input from the speech speed conversion means 15 from an input means (not shown). Based on the input command from the operator, the output format is also converted to a displayable output format and output to the output device 5. Further, the video / audio output means 17 converts the sub-audio input from the audio connection means 16 into a displayable output format on the basis of a command from an operator input from an input means (not shown) and outputs the output device 5. Also output to

ここで、図３を参照（適宜図１参照）して、映像・音声出力手段１７によって生成され、出力装置５の表示画面に表示される画像の例について説明する。図３は、映像・音声出力手段によって生成される画像の一例を示す模式図である。 Here, an example of an image generated by the video / audio output means 17 and displayed on the display screen of the output device 5 will be described with reference to FIG. FIG. 3 is a schematic diagram showing an example of an image generated by the video / audio output means.

図３に示すように、画像Ｗは、音声分析結果提示領域Ｅ０と、パターン提示領域Ｅ１、Ｅ２、…と、映像再生領域Ｆと、解説原稿提示領域Ｇとで主に構成されている。 As shown in FIG. 3, the image W mainly includes an audio analysis result presentation area E0, pattern presentation areas E1, E2,..., A video reproduction area F, and a comment document presentation area G.

音声分析結果提示領域Ｅ０は、映像の再生時間の時間軸上に、映像音声の区間を視覚化して提示する領域である。ここで、映像・音声出力手段１７は、音声分析手段１３から入力された喋り区間とポーズ区間の情報に基づいて、音声分析結果提示領域Ｅ０に喋り区間Ａ１、Ａ２、…とポーズ区間Ｂ１、Ｂ２、…とを提示する。 The audio analysis result presentation area E0 is an area for visualizing and presenting video and audio sections on the time axis of the video playback time. In this case, the video / audio output unit 17 enters the sections A1, A2,... And the pause sections B1, B2 in the voice analysis result presentation area E0 based on the information on the beat sections and pause sections input from the voice analysis means 13. , ... and present.

パターン提示領域Ｅ１、Ｅ２、…は、映像音声及び解説音声の話速変換のパターンを提示する領域である。ここで、映像・音声出力手段１７は、話速変換手段１５から入力された映像音声と解説音声の伸縮を示す情報に基づいて、パターン提示領域Ｅ１、Ｅ２、…の各々に、映像の再生時間の時間軸上に、話速変換の各パターンの映像音声と解説音声との区間を視覚化して提示する。例えば、パターン提示領域Ｅ１では、音声分析結果提示領域Ｅ０に提示された喋り区間Ａ１、Ａ２、…と同じ区間に映像音声Ｃ１、Ｃ２、…が、ポーズ区間Ｂ１、Ｂ２、…と同じ区間に解説音声Ｄ１、Ｄ２、…が提示されている。 The pattern presentation areas E1, E2,... Are areas for presenting speech speed conversion patterns of video and audio. Here, the video / audio output means 17 performs video playback time in each of the pattern presentation areas E1, E2,... Based on the information indicating the expansion and contraction of the video and audio input from the speech speed conversion means 15. On the time axis, a section between the video and audio of each pattern of speech speed conversion and commentary audio is visualized and presented. For example, in the pattern presentation area E1, video and audio C1, C2,... Are explained in the same section as the pause sections B1, B2,... In the same section as the beat sections A1, A2,. Voices D1, D2,... Are presented.

映像再生領域Ｆは、図示しない入力手段によって作業者から入力された指令に基づいて映像を提示する領域である。ここで、作業者が図示しないマウス等によって音声分析結果提示領域Ｅ０及びパターン提示領域Ｅ１、Ｅ２、…内の任意の領域をドラッグして選択すると、映像・音声出力手段１７は、その範囲の映像及び副音声同期映像を映像再生領域Ｆに提示する。このとき同時に、映像・音声出力手段１７は、選択された範囲の映像音声及び解説音声を出力装置５が備える図示しないスピーカ等に出力する。これによって、作業者はパターン提示領域Ｅ１、Ｅ２、…に示される話速変換のパターンのうちから最も適切なものを選択することができる。 The video reproduction area F is an area for presenting video based on a command input from an operator by an input means (not shown). Here, when the operator drags and selects an arbitrary area in the voice analysis result presentation area E0 and the pattern presentation areas E1, E2,... With a mouse or the like (not shown), the video / audio output means 17 displays the video in that range. And the sub audio synchronized video is presented in the video reproduction area F. At the same time, the video / audio output means 17 outputs the selected range of video / audio and commentary audio to a speaker (not shown) provided in the output device 5. Thereby, the operator can select the most appropriate one from the patterns of speech speed conversion shown in the pattern presentation areas E1, E2,.

解説原稿提示領域Ｇは、ポーズ区間Ｂ１、Ｂ２、…に対応する解説原稿を入力及び提示する領域である。ここで、解説原稿提示領域Ｇは、解説原稿入力領域Ｇ１を有し、作業者が図示しない入力手段によってポーズ区間を指定し、解説原稿のテキストを入力すると、映像・音声出力手段１７は、当該解説原稿入力領域Ｇ１に解説原稿を表示する。そして、作業者がこの解説原稿を確定することで、解説放送番組生成装置１は、各ポーズ区間Ｂ１、Ｂ２、…に対応する解説原稿を解説原稿入力手段１１から入力することができる。また、作業者が図示しない入力手段によって、解説原稿がすでに入力されているポーズ区間を指定して再生することで、映像・音声出力手段１７は、音声合成手段１４によって音声合成された当該ポーズ区間に対応する解説音声を出力装置５が備える図示しないスピーカ等に出力する。 The comment manuscript presentation area G is an area for inputting and presenting comment manuscripts corresponding to pause sections B1, B2,. Here, the comment document presentation area G has a comment document input area G1, and when the operator designates a pause section by input means (not shown) and inputs the text of the comment document, the video / audio output means 17 The explanation manuscript is displayed in the explanation manuscript input area G1. Then, when the operator confirms the comment document, the comment broadcast program generating apparatus 1 can input the comment document corresponding to each pause section B1, B2,... From the comment document input means 11. In addition, the video / audio output unit 17 causes the voice synthesis unit 14 to synthesize the pose section in which the operator has designated and reproduced a pause section in which the commentary document has already been input using an input unit (not shown). Is output to a speaker (not shown) provided in the output device 5.

図１に戻って説明を続ける。原映像出力手段１８は、番組映像蓄積装置３から入力された映像（原映像）を出力するものである。この原映像は、番組映像蓄積装置３に蓄積され、制作者によって制作された映像である。ここで出力された原映像は、解説放送番組蓄積装置７に蓄積される。 Returning to FIG. 1, the description will be continued. The original video output means 18 outputs the video (original video) input from the program video storage device 3. This original video is stored in the program video storage device 3 and is a video produced by the producer. The original video output here is stored in the explanation broadcast program storage device 7.

副音声同期映像出力手段１９は、映像速度変換手段１２から入力された副音声同期映像を出力するものである。ここで出力された副音声同期映像は、原映像出力手段１８から出力された原映像に対応付けられて解説放送番組蓄積装置７に蓄積される。 The sub audio synchronized video output means 19 outputs the sub audio synchronized video input from the video speed converting means 12. The sub-audio synchronized video output here is stored in the explanation broadcast program storage device 7 in association with the original video output from the original video output means 18.

主音声出力手段２０は、番組映像蓄積装置３から入力された映像音声（主音声）を出力するものである。この主音声は、番組映像蓄積装置３に蓄積され、制作者によって制作された映像の音声である。ここで出力された主音声は、原映像出力手段１８から出力された原映像に対応付けられて解説放送番組蓄積装置７に蓄積される。 The main audio output means 20 outputs video audio (main audio) input from the program video storage device 3. This main audio is the audio of the video stored in the program video storage device 3 and produced by the producer. The main audio output here is stored in the explanation broadcast program storage device 7 in association with the original video output from the original video output means 18.

副音声出力手段２１は、音声接続手段１６から入力された副音声を出力するものである。ここで出力された副音声は、副音声同期映像出力手段１９から出力された副音声同期映像に対応付けられて解説放送番組蓄積装置７に蓄積される。 The secondary audio output unit 21 outputs the secondary audio input from the audio connection unit 16. The sub audio output here is stored in the explanation broadcast program storage device 7 in association with the sub audio synchronized video output from the sub audio synchronized video output means 19.

なお、解説放送番組生成装置１は、コンピュータにおいて各手段を各機能プログラムとして実現することも可能であり、各機能プログラムを結合して、解説放送番組生成プログラム（解説付加音声生成プログラム）として動作させることも可能である。 Note that the explanation broadcast program generating apparatus 1 can also realize each means as a function program in a computer, and the function programs are combined to operate as an explanation broadcast program generation program (explanation additional sound generation program). It is also possible.

［解説放送番組生成装置の動作］
次に、図４を参照して、解説放送番組生成装置１の動作について説明する。図４は、解説放送番組生成装置が、解説放送番組を生成する動作を示したフローチャートである。 [Operation of commentary broadcast program generator]
Next, with reference to FIG. 4, the operation of the explanation broadcast program generating apparatus 1 will be described. FIG. 4 is a flowchart showing an operation in which the explanation broadcast program generating device generates an explanation broadcast program.

解説放送番組生成装置１は、音声分析手段１３によって、外部に接続された番組映像蓄積装置３に記憶された映像音声を音声分析し、喋り区間とポーズ区間とを検出する（ステップＳ１１）。そして、ここでは、解説放送番組生成装置１は、映像・音声出力手段１７によって、外部に接続された出力装置５に音声分析の結果を出力することとした。 The explanation broadcast program generating apparatus 1 performs audio analysis on the video and audio stored in the program video storage apparatus 3 connected to the outside by the audio analyzing means 13, and detects a talk section and a pause section (step S11). Here, the explanation broadcast program generating apparatus 1 outputs the result of the audio analysis to the output apparatus 5 connected to the outside by the video / audio output means 17.

そして、解説放送番組生成装置１は、解説原稿入力手段１１によって、作業者が作成した解説原稿を入力する（ステップＳ１２）。ここで入力される解説原稿は、作業者によって、ステップＳ１１において検出されたポーズ区間の各々に予め対応付けられている。更に、解説放送番組生成装置１は、音声合成手段１４によって、ステップＳ１２において入力された解説原稿を音声合成して解説音声を生成する（ステップＳ１３）。 Then, the explanation broadcast program generating apparatus 1 inputs the explanation manuscript created by the operator through the explanation manuscript input means 11 (step S12). The commentary manuscript input here is associated in advance with each of the pause sections detected in step S11 by the operator. Further, the commentary broadcast program generating device 1 generates a commentary voice by synthesizing the commentary document input in step S12 by the voice synthesizing unit 14 (step S13).

続いて、解説放送番組生成装置１は、話速変換手段１５によって、映像音声と解説音声との区間の伸縮及び話速を設定する（ステップＳ１４）。ここでは、複数の話速変換のパターンが予め設定されており、解説放送番組生成装置１は、話速変換手段１５によって、各パターンについて映像音声と解説音声との区間の伸縮及び話速を設定して、映像・音声出力手段１７によって、外部に接続された出力装置５に映像音声及び解説音声の伸縮を示す画像を出力することとした。 Subsequently, the commentary broadcast program generating device 1 sets the expansion / contraction and the speech speed of the section between the video and audio and the commentary sound by the speech speed conversion means 15 (step S14). Here, a plurality of speech speed conversion patterns are set in advance, and the commentary broadcast program generating apparatus 1 sets the expansion / contraction of the section between the video and audio and the commentary sound and the speech speed for each pattern by the speech speed conversion means 15. Thus, the video / audio output means 17 outputs an image showing expansion / contraction of the video / audio and the commentary audio to the output device 5 connected to the outside.

そして、解説放送番組生成装置１は、話速変換手段１５によって、ステップＳ１４において設定された話速に、映像音声及び解説音声を変換する（ステップＳ１５）。ここでは、解説放送番組生成装置１は、外部から入力された作業者の話速変換のパターンを指定する指令に基づいて、映像音声及び解説音声を指定されたパターンの話速に変換する。 Then, the commentary broadcast program generating apparatus 1 converts the video and audio and the commentary voice to the speech speed set in step S14 by the speech speed conversion means 15 (step S15). Here, the explanation broadcast program generating apparatus 1 converts the audio and video and the explanation voice into the designated pattern speaking speed based on a command for designating the speaking speed conversion pattern of the worker input from the outside.

更に、解説放送番組生成装置１は、音声接続手段１６によって、ステップＳ１５において話速変換された映像音声及び解説音声を接続して副音声を生成する（ステップＳ１６）。そして、解説放送番組生成装置１は、映像速度変換手段１２によって、ステップＳ１５において話速変換した映像音声及び解説音声の伸縮に合わせて、番組映像蓄積装置３に記憶された、当該映像音声に対応する映像を伸縮して副音声同期映像を生成する（ステップＳ１７）。 Furthermore, the commentary broadcast program generating device 1 connects the video and audio converted to the speech speed in step S15 by the audio connecting unit 16 to generate sub audio (step S16). The commentary broadcast program generating device 1 corresponds to the video and audio stored in the program video storage device 3 in accordance with the expansion and contraction of the video and audio and the commentary audio converted at step S15 by the video speed conversion means 12. The sub-audio synchronized video is generated by expanding and contracting the video to be played (step S17).

更に、解説放送番組生成装置１は、番組映像蓄積装置３から入力された映像を原映像として原映像出力手段１８から解説放送番組蓄積装置７に出力し、副音声同期映像出力手段１９によってステップＳ１７において生成された副音声同期映像を解説放送番組蓄積装置７に出力する。また、解説放送番組生成装置１は、主音声出力手段２０によって、映像音声を番組映像蓄積装置３から入力して主音声として解説放送番組蓄積装置７に出力し、副音声出力手段２１によって、ステップＳ１６において生成された副音声を解説放送番組蓄積装置７に出力して（ステップＳ１８）、動作を終了する。ここで出力された原映像、副音声同期映像、主音声及び副音声はそれぞれ関連付けられて解説放送番組蓄積装置７に蓄積される。 Further, the commentary broadcast program generating device 1 outputs the video input from the program video storage device 3 as an original video from the original video output means 18 to the commentary broadcast program storage device 7, and the sub audio synchronized video output means 19 performs step S17. Is output to the explanation broadcast program storage device 7. Also, the commentary broadcast program generating device 1 inputs the video and audio from the program video storage device 3 by the main audio output means 20 and outputs it as the main sound to the commentary broadcast program storage device 7, and the sub audio output means 21 performs the step. The sub audio generated in S16 is output to the explanation broadcast program storage device 7 (step S18), and the operation is terminated. The original video, sub audio synchronized video, main audio, and sub audio output here are associated with each other and stored in the explanation broadcast program storage device 7.

以上の動作によって、解説放送番組生成装置１は、番組映像蓄積装置３に記憶された複数の映像の副音声と副音声同期映像とを生成して、原映像及び主音声とともに解説放送番組蓄積装置７に蓄積することができる。そして、作業者は、必要な映像と副音声をＤＶＤ（Digital Versatile Disc）のような光ディスクやＶＴＲに記憶することができる。 Through the above operation, the explanation broadcast program generation device 1 generates sub-audio and sub-audio synchronized video of a plurality of videos stored in the program video storage device 3, and the explanation broadcast program storage device together with the original video and the main audio. 7 can be accumulated. The operator can store the necessary video and sub-audio on an optical disc such as a DVD (Digital Versatile Disc) or a VTR.

本発明における解説放送番組生成装置の構成を模式的に示した模式図である。It is the schematic diagram which showed typically the structure of the description broadcast program production | generation apparatus in this invention. 本発明における解説放送番組生成装置の話速変換手段による話速変換を説明するための説明図、（ａ）は、音声分析手段によって検出された喋り区間とポーズ区間の例を模式的に示した模式図、（ｂ）は、話速変換手段による話速変換のパターンの例を模式的に示した模式図、（ｃ）は、話速変換手段による話速変換の他のパターンの例を模式的に示した模式図、（ｄ）は、話速変換手段による話速変換の他のパターンの例を模式的に示した模式図である。Explanatory drawing for demonstrating the speech speed conversion by the speech speed conversion means of the commentary broadcast program production | generation apparatus in this invention, (a) showed typically the example of the talk section and pause section detected by the audio | voice analysis means. Schematic diagram, (b) is a schematic diagram schematically showing an example of a speech speed conversion pattern by the speech speed conversion means, and (c) is a schematic example of another pattern of speech speed conversion by the speech speed conversion means. (D) is a schematic diagram schematically showing another pattern example of speech speed conversion by the speech speed conversion means. 本発明における解説放送番組生成装置の映像・音声出力手段によって生成される画像の一例を示す模式図である。It is a schematic diagram which shows an example of the image produced | generated by the video / audio output means of the commentary broadcast program production | generation apparatus in this invention. 本発明における解説放送番組生成装置が、解説放送番組を生成する動作を示したフローチャートである。It is the flowchart which showed the operation | movement which the comment broadcast program production | generation apparatus in this invention produces | generates a comment broadcast program.

Explanation of symbols

１解説放送番組生成装置（解説付加音声生成装置）
１２映像速度変換手段
１３音声分析手段（区間検出手段）
１４音声合成手段
１５話速変換手段
１６音声接続手段（音声付加手段）
３番組映像蓄積装置
５出力装置
７解説放送番組蓄積装置 1 Commentary broadcast program generation device (explanatory additional sound generation device)
12 Video speed conversion means 13 Audio analysis means (section detection means)
14 speech synthesis means 15 speech speed conversion means 16 voice connection means (speech addition means)
3 Program Video Storage Device 5 Output Device 7 Explanation Broadcast Program Storage Device

Claims

A commentary that generates a commentary supplementary sound by adding a commentary sound in which the text data is converted into a voice to the video and voice, and the video and voice that is the voice of the video and text data related to the content of the video are input from the outside. An additional sound generation device,
Speech synthesis means for generating the commentary speech by speech synthesis from the text data;
On the time axis of the playback time of the video and audio from the video and audio, a section detecting means for detecting a voicing sound section that is a voice section of a uttered sound and a pause section that is a voice section of silence or background sound only;
A speech speed converting means for converting the commentary speech based on a section length of the pause section;
Voice adding means for adding the commentary voice converted by the voice speed conversion means to the detected pause section of the video and audio to generate the commentary additional voice;
A commentary-added voice generation device characterized by comprising:

A commentary that generates a commentary supplementary sound by adding a commentary sound in which the text data is converted into a voice to the video and voice, and the video and voice that is the voice of the video and text data related to the content of the video are input from the outside. An additional sound generation device,
Speech synthesis means for generating the commentary speech by speech synthesis from the text data;
On the time axis of the playback time of the video and audio from the video and audio, a zone detection means for detecting a speech zone that is a voice zone of a uttered sound and a pause zone that is a voice zone of only silence or background sound;
The video and audio corresponding to the utterance sound section is converted to speech speed, and the commentary voice is converted to speech speed based on the length of the utterance sound section expanded and contracted by the speech speed conversion and the section length of the pause section. Speaking speed conversion means,
Relative silence or pause section is a speech section of only the background sound in the video voice converted speech speed by the front Kihanashi speed conversion means, audio additional adding a commentary voice converted speech speed by the speech speed converting means Means,
Description additional voice generating device wherein Rukoto equipped with.

A section of a section corresponding to the uttered sound section of the video input from the outside based on information indicating expansion / contraction by the speaking speed conversion in the utterance sound section of the video and audio converted by the speech speed converting means. The comment-added audio generation apparatus according to claim 2, further comprising video speed conversion means for expanding and contracting the length.

In order to generate a commentary-added voice in which a video voice that is a video voice and text data related to the content of the video are input from the outside and a commentary voice obtained by converting the text data into a voice is added to the video voice. Computer
Speech synthesis means for generating the commentary speech by speech synthesis from the text data;
On the time axis of the playback time of the video and audio, a section detection means for detecting a voicing sound section that is a voice section of a uttered sound and a pause section that is a voice section of silence or only background sound from the video and audio,
A speech speed converting means for converting the commentary speech based on a section length of the pause section;
Voice adding means for adding the commentary voice converted by the voice speed conversion means to the detected pause section of the video and audio to generate the commentary additional voice;
A commentary-added voice generation program characterized by functioning as

  In order to generate a commentary-added voice in which a video voice that is a video voice and text data related to the content of the video are input from the outside and a commentary voice obtained by converting the text data into a voice is added to the video voice. Computer
  Speech synthesis means for generating the commentary speech by speech synthesis from the text data;
  On the time axis of the playback time of the video and audio, a section detection means for detecting a voicing sound section that is a voice section of a uttered sound and a pause section that is a voice section of silence or only background sound from the video and audio,
  The video and audio corresponding to the utterance sound section is converted to speech speed, and the commentary voice is converted to speech speed based on the length of the utterance sound section expanded and contracted by the speech speed conversion and the length of the pause section. Speaking speed conversion means,
  Audio adding means for adding commentary speech converted by the speech speed conversion means to a pause section that is a speech section of only silence or background sound in the video and audio converted by the speech speed conversion means. ,
A commentary-added voice generation program characterized by functioning as