JP4086886B2

JP4086886B2 - Movie playback apparatus, movie playback method and computer program thereof

Info

Publication number: JP4086886B2
Application number: JP2007117564A
Authority: JP
Inventors: 弘隆椎山
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2007-04-26
Filing date: 2007-04-26
Publication date: 2008-05-14
Anticipated expiration: 2022-04-16
Also published as: JP2007228624A

Description

本発明は、音声の再生を伴う動画再生技術の分野に関する。 The present invention relates to the field of moving image reproduction technology accompanied with audio reproduction.

従来より、例えば、ビデオテープレコーダ等のように、音声の再生を伴う動画再生装置においては、再生実行時にユーザが動画全体を短時間で見ることを可能とすべく、倍速再生機能や、高速早送り機能等が備えられている。 Conventionally, for example, in a video playback device with audio playback, such as a video tape recorder, a double speed playback function and high-speed fast-forward are provided so that the user can view the entire video in a short time when playback is performed. Functions are provided.

また、代表的な動画再生装置であるビデオテープレコーダにおいては、近年、記録媒体の倍速再生実行時に、音のエネルギが所定のしきい値以上の第１音声区間と、当該所定のしきい値未満の第２音声区間とを検出すると共に、その第１音声区間における音声信号のピッチ変換を行ないながら継続再生することにより、当該第２音声区間を侵食しながらも、再生された音声はユーザにとって多少早口ではあるもの、内容の理解が可能な再生音を伴いながら、２倍速で記憶媒体を再生可能な技術も提案されている。 In addition, in a video tape recorder that is a typical moving image playback device, in recent years, when executing double-speed playback of a recording medium, a first audio section in which sound energy is equal to or higher than a predetermined threshold and less than the predetermined threshold The second audio section is detected and continuously reproduced while performing pitch conversion of the audio signal in the first audio section, so that the reproduced audio is somewhat less for the user while eroding the second audio section. A technique that can reproduce a storage medium at a double speed while being accompanied by a reproduced sound that can be understood easily has been proposed.

しかしながら、上記の如く音声信号の部分的なピッチ変換処理を行うと、動画再生時に必ずしも音声と映像との同期関係が保てないことにより、例えば、再生された映像中の人物の喋っている映像と、再生された音声との同期が取れないことから、人間の感覚にとって不自然な再生となり、ユーザは違和感を感じることがある。 However, if the partial pitch conversion processing of the audio signal is performed as described above, the synchronized relationship between the audio and the video cannot always be maintained when the moving image is reproduced. And the reproduced sound cannot be synchronized, the reproduction is unnatural for the human sense, and the user may feel uncomfortable.

また、例えば特開平１０−３２７７６号公報、特開平９−２４３３５１号公報等においては、音声エネルギに基づいて無音状態を検出し、検出した無音状態以外の音を人の発した音声区間とみなすことにより、動画の要約（サマリー）を行う技術も提案されている。しかしながら、例えばニュース番組等のように、その番組全体を通して人の発した音声が支配的な動画においては、音声エネルギに基づく人の発した音声区間の検出はある程度は可能であるものの、バックグラウンドノイズやバックグラウンド音楽が存在する環境下ではこの方法は現実的ではない。 Further, for example, in JP-A-10-32776, JP-A-9-243351, etc., a silence state is detected based on voice energy, and sounds other than the detected silence state are regarded as a voice section emitted by a person. Thus, a technique for summarizing moving pictures has also been proposed. However, in a moving image in which a voice uttered by a person is dominant throughout the entire program, such as a news program, background noise can be detected to a certain extent although detection of a voice section uttered by a person based on voice energy is possible. This method is not practical in an environment where music or background music exists.

更に、上記特許公報以前の従来技術においても、音声検出を行なうと共に、検出した音声を考慮した動画再生を行なう技術が数多く提案されており、その殆どが音のエネルギをしきい値処理することによって音声を検出している。この背景には、日本語の曖昧さに起因する問題があり、「人の声」も「音声」と言い、人の声を含む音一般も「音声」と呼ぶことに起因しており、このような従来技術における音のエネルギのしきい値処理を、真の「音声検出」とひとまとめに総称するのは不適当である。 Furthermore, in the prior art prior to the above-mentioned patent publication, there have been proposed many techniques for performing sound detection and moving image reproduction considering the detected sound, most of which are performed by thresholding sound energy. Audio is detected. There is a problem caused by the ambiguity of the Japanese language. “Human voice” is also called “speech”, and general sounds including human voice are also called “speech”. It is inappropriate to collectively refer to threshold processing of sound energy in the prior art as true “voice detection”.

また、特開平９−２４７６１７号公報には、音声信号のＦＦＴ（高速フーリエ変換）スペクトラムを算出することによって特異点を求めることによって「音声情報等の特徴点」を検出し、その音量を分析する技術が提案されている。しかしながら、ＦＦＴスペクトラムを利用する方法においては、再生すべき音声信号の中に、広帯域のスペクトル分布となる所謂バックグラウンド音楽等が含まれる場合には、その中から人の発した声を検出することは困難である。 In Japanese Patent Laid-Open No. 9-247617, a “singularity” is obtained by calculating an FFT (Fast Fourier Transform) spectrum of an audio signal, and a “feature point such as audio information” is detected, and its volume is analyzed. Technology has been proposed. However, in the method using the FFT spectrum, when a sound signal to be reproduced includes so-called background music having a broadband spectrum distribution, a voice uttered by a person is detected therefrom. It is difficult.

このように、音声を伴う従来の動画再生においては、上述したように音声区間の検出が便宜的で不正確であるという問題があり、更に、その検出結果を用いた動画のサマリーの作成や倍速再生を行う場合には、再生に際して、映像と音声との同期関係が維持できないという問題がある。 As described above, in the conventional video playback with audio, there is a problem that the detection of the audio section is convenient and inaccurate as described above, and further, the summary of the video using the detection result and the double speed are generated. When performing playback, there is a problem that the synchronization relationship between video and audio cannot be maintained during playback.

また、近年においては、発声内容の情報を、所謂、字幕やクロ−ズキャプション等によって、動画データ及び音声信号と多重化、或いは別の領域帯域に挿入されたメディアが登場しているが、このようなメディアの再生時においても、音声区間の検出結果を用いた動画のサマリーの作成や倍速再生を行う場合には、再生に際して、映像と音声との同期関係が維持できないという問題がある。 Further, in recent years, media in which information of utterance contents is multiplexed with moving image data and audio signals by so-called subtitles or close captions, or inserted in another region band has appeared. Even when such media is played back, when creating a summary of a moving image or using double speed playback using the detection result of the audio section, there is a problem that the synchronization relationship between video and audio cannot be maintained during playback.

また、一般に、老人や子供等のユーザにとって各種装置を使いこなすことは容易なことでななく、且つ速い速度で発せられる音声は、その内容の理解が追いつき難いことが知られている。従って、このようなユーザにとって、上述したテープレコーダのような動画再生装置において倍速再生等の内容の早見（短縮再生）を行なうに際しては、再生に最適な条件が一般のユーザとは異なる。 In general, it is known that it is not easy for a user such as an elderly person or a child to master various devices, and it is difficult to comprehend the contents of voices emitted at a high speed. Therefore, for such a user, the optimum conditions for reproduction differ from those of general users when performing quick viewing (shortened reproduction) of contents such as double speed reproduction in a moving image reproduction apparatus such as the above-described tape recorder.

更に、動体視力の弱いユーザ、早い音声に対する聴力が弱いユーザ、或いは再生される音声を母国語としない外国のユーザ等にとっても、上記のような動画再生装置によって倍速再生等の内容の早見（短縮再生）を行なうに際しては、再生に最適な条件が一般のユーザとは異なる。 Furthermore, even for users with low dynamic visual acuity, users with low hearing ability for fast voices, or foreign users who do not use the voice to be played as their native language, the video playback device as described above allows quick viewing (shortening) of contents such as double speed playback. When performing (playback), the optimum conditions for playback differ from those of general users.

そこで本発明は、人の発した音声区間を正確に検出すると共に、検出した音声区間に従って映像と音声との同期関係を忠実に維持しながら、ユーザの閲覧所要時間を大幅に短縮する動画再生装置、動画再生方法及びそのコンピュータ・プログラムの提供を目的とする。 Therefore, the present invention accurately detects a voice section generated by a person and maintains a synchronized relationship between video and audio faithfully according to the detected voice section, and greatly reduces a user's viewing time. An object of the present invention is to provide a moving image reproduction method and a computer program thereof.

本発明の一側面によれば、音声信号と副情報とを含む動画データを高速に再生可能な動画再生装置であって、前記動画データに含まれる副情報に基づいて、人の発声期間を表わす第１音声区間と、それ以外の第２音声区間とを判定する判定手段と、前記動画データに基づいて、前記第１音声区間は、ユーザが内容を把握可能な所定の速度で、再生音声を伴う高速動画再生を行なう一方で、前記第２音声区間は、前記所定の速度より高速に、高速動画再生を行なう早見再生手段と、前記第１音声区間の長さおよびその区間の再生速度と、前記第２音声区間の長さおよびその区間の再生速度とに基づいて、前記高速動画再生に要する所要時間を算出すると共に、算出した所要時間をユーザに提示する提示手段とを備えることを特徴とする動画再生装置が提供される。 According to one aspect of the present invention, there is provided a moving image reproducing apparatus capable of reproducing moving image data including an audio signal and sub information at high speed, and represents a person's utterance period based on the sub information included in the moving image data. Based on the moving image data and the determination means for determining the first voice section and the second voice section other than the first voice section, the first voice section plays the reproduced voice at a predetermined speed at which the user can grasp the contents. While the high-speed moving image playback is performed, the second sound section is a fast-playing playback means for performing high-speed moving image playback at a speed higher than the predetermined speed, the length of the first sound section and the playback speed of the section, And a presentation means for calculating the required time required for the high-speed moving image playback based on the length of the second audio section and the playback speed of the section and presenting the calculated required time to the user. Play video Location is provided.

本発明の別の側面によれば、音声信号と副情報とを含む動画データを高速に再生可能な動画再生方法であって、前記動画データに含まれる副情報に基づいて、人の発声期間を表わす第１音声区間と、それ以外の第２音声区間とを判定する判定工程と、前記動画データに基づいて、前記第１音声区間は、ユーザが内容を把握可能な所定の速度で、再生音声を伴う高速動画再生を行なう一方で、前記第２音声区間は、前記所定の速度より高速に、高速動画再生を行なう早見再生工程と、前記第１音声区間の長さおよびその区間の再生速度と、前記第２音声区間の長さおよびその区間の再生速度とに基づいて、前記高速動画再生に要する所要時間を算出すると共に、算出した所要時間をユーザに提示する提示工程とを備えることを特徴とする動画再生方法が提供される。 According to another aspect of the present invention, there is provided a moving image reproduction method capable of reproducing moving image data including an audio signal and sub information at high speed, wherein a human speech period is determined based on the sub information included in the moving image data. Based on the determination step of determining the first voice section to be represented and the second voice section other than that, and the moving image data, the first voice section is reproduced at a predetermined speed at which the user can grasp the content. While the high-speed moving image reproduction is performed, the second audio section includes a quick-playing step of performing high-speed moving image reproduction at a speed higher than the predetermined speed, a length of the first audio section, and a reproduction speed of the section. And a presentation step of calculating a required time for the high-speed moving image playback based on the length of the second audio section and the playback speed of the section, and presenting the calculated required time to the user. Video re A method is provided.

以上説明した本発明によれば、人の発した音声区間を正確に検出すると共に、検出した音声区間に従って映像と音声との同期関係を忠実に維持しながら、ユーザの閲覧所要時間を大幅に短縮する動画再生装置、動画再生方法及びそのコンピュータ・プログラムの提供が実現する。 According to the present invention described above, a user's time required for browsing is greatly shortened while accurately detecting a voice section generated by a person and faithfully maintaining a synchronization relationship between video and audio according to the detected voice section. Provided is a moving image reproducing apparatus, a moving image reproducing method, and a computer program thereof.

以下、本発明に係る動画再生装置の一実施形態を、図面を参照して詳細に説明する。 Hereinafter, an embodiment of a moving image reproducing apparatus according to the present invention will be described in detail with reference to the drawings.

はじめに、本実施形態における動画再生装置の動作の概要について、図１を参照して説明する。 First, an outline of the operation of the moving image playback apparatus in the present embodiment will be described with reference to FIG.

図１は、本実施形態に係る動画再生装置における動画早見アルゴリズムの概念図を表す図である。 FIG. 1 is a diagram illustrating a conceptual diagram of a moving image quick-view algorithm in the moving image reproducing apparatus according to the present embodiment.

本実施形態に係る動画再生装置は、図１に示すように、大別して、動画早見インデックス作成部１００と、動画早見再生部２００とからなる。 As shown in FIG. 1, the moving image playback apparatus according to the present embodiment is roughly divided into a moving image quick index creation unit 100 and a moving image quick playback unit 200.

＜動画早見インデックス作成部１００＞
動画早見インデックス作成部１００では、動画データ記憶部１０から読み出した動画データが映像／音声／副情報分離処理（ステップＳ101）において映像データ（映像信号）、音声データ（音声信号）、並びに副情報に分離される。 <Quick movie index creation unit 100>
In the moving image quick index creation unit 100, moving image data read from the moving image data storage unit 10 is converted into video data (video signal), audio data (audio signal), and sub information in the video / audio / sub information separation process (step S101). To be separated.

そして、音声信号に対しては、音声区間読み込み処理（ステップＳ102）及び音声区間補正処理（ステップＳ103）が施され、映像信号に対しては、当該副情報にシーンチェンジ点情報が含まれない場合に、映像変化度演算処理（ステップＳ106）及びシーンチェンジ点検出処理（ステップＳ107）が施され、副情報に対しては、当該副情報にシーンチェンジ点情報が含まれる場合に、シーンチェンジ点読み出し処理（ステップＳ105）が施される。早見再生区間補正処理（ステップＳ104）では、早見再生区間情報が生成され、生成されたこの情報は、動画早見インデックス記憶部１１に記憶される。 When the audio signal is subjected to audio section reading processing (step S102) and audio section correction processing (step S103), scene change point information is not included in the sub information for the video signal. Then, a video change degree calculation process (step S106) and a scene change point detection process (step S107) are performed, and when the sub information includes scene change point information, the scene change point reading is performed. Processing (step S105) is performed. In the quick-play playback section correction process (step S104), fast-play playback section information is generated, and the generated information is stored in the moving image quick-view index storage unit 11.

即ち、音声区間読み込み処理（ステップＳ102）では、動画データが映像／音声／副情報分離処理（ステップＳ101）にて分離された情報に基づいて、「人の発生内容に関する情報」と、「表示タイミング情報」とが、音声区間読み込み結果として、当該動画データから読み出される。ここで、表示タイミング情報には、表示開始タイミング、表示終了タイミング、並びに区間長が含まれる。 That is, in the audio section reading process (step S102), based on the information obtained by separating the moving image data in the video / audio / sub-information separation process (step S101), the “information about the content of human occurrence” and the “display timing” "Information" is read out from the moving image data as a result of reading the voice section. Here, the display timing information includes a display start timing, a display end timing, and a section length.

音声区間補正処理（ステップＳ103）では、上記の音声区間読み込み結果に基づいて、音声再生時に人（ユーザ）が不快にならないように、近接する複数の音声区間を統合することによって新たに再生するところの、人の発声期間を表わす音声区間（以下、「人の音声区間」または区間Ａと称する）の補正が行われることにより、補正済みの音声区間情報を取得する。 In the voice segment correction process (step S103), based on the voice segment reading result, a new reproduction is performed by integrating a plurality of adjacent voice segments so that a person (user) is not uncomfortable at the time of voice reproduction. The corrected voice section information is acquired by correcting the voice section representing the person's utterance period (hereinafter referred to as “person's voice section” or section A).

例えば、高速動画再生に際する悪い態様として、近接する２つの区間Ａの間隔が狭い場合に、動画再生に際して、それらの音声区間を、人が聞いて内容把握ができる程度の速度で、音声を伴う倍速再生（例えば２倍速再生）を行なうと共に、人の音声区間ではない区間（以下、区間Ｂと称する）に対しては、動画再生に際して、再生映像を人が見ることによって内容把握ができる程度の高倍率の倍速で再生を行うと、変化が激しく、一般のユーザにとって聞き苦しいものとなる。 For example, as a bad aspect in high-speed video playback, when the interval between two adjacent sections A is narrow, at the time of video playback, the voice is played at such a speed that a person can hear and understand the contents. In addition to performing double-speed playback (for example, double-speed playback), the content of a section that is not a human voice section (hereinafter referred to as section B) can be grasped by viewing the playback video when the video is played back. When the playback is performed at a high speed of 2 times, the change is severe and it is difficult for a general user to hear.

従って、本実施形態では、音声区間補正処理（ステップＳ103）において、人の音声区間の間隔を考慮し、その間隔がある所定の条件を満たす場合には複数の人の音声区間群を統合することにより、前記の聞き苦しさを解消する。ここで、所定の条件としては、例えば、人の音声区間の間隔が所定のしきい値以下であることを設定するのが最も容易である。 Therefore, in the present embodiment, in the speech segment correction process (step S103), the interval between the human speech segments is taken into account, and when the predetermined condition satisfies the certain interval, a plurality of speech segment groups are integrated. This eliminates the difficulty of hearing. Here, as the predetermined condition, for example, it is easiest to set that the interval of the human voice section is equal to or less than a predetermined threshold value.

また、映像変化度演算処理（ステップＳ106）では、映像／音声／副情報分離処理（ステップＳ101）にて得られた映像データに対して、特開２０００−２３５６３９号公報に記載されたフレーム間の類似比較処理を行うことによってフレーム間類似度を演算することにより、映像変化情報が生成される。 Further, in the video change degree calculation process (step S106), the video data obtained in the video / audio / sub-information separation process (step S101) is processed between frames described in Japanese Patent Laid-Open No. 2000-235539. Image change information is generated by calculating the similarity between frames by performing a similarity comparison process.

一般に、音声信号を含む動画データに映像の変わり目が存在し、その直ぐ後に音声区間が始まる場合には、動画再生に際して、ほんの一瞬高速でシーンの先頭部分の映像が再生された後で、音声を伴う倍速再生による再生映像が、人が聞いて把握できる速度で行われるため、ユーザにとって映像がちらついたような違和感が生じる。 Generally, when there is a video transition in video data including audio signals and the audio section starts immediately after that, the video at the beginning of the scene is played back at a high speed for a moment when playing back the video. Since the reproduced video by the double speed reproduction is performed at a speed that can be heard and grasped by a person, the user feels uncomfortable that the video flickers.

そこで、本実施形態では、上記の副情報にシーンチェンジ点情報が含まれている場合には、シーンチェンジ点読み出し処理（ステップＳ105）において、その副情報からシーンチェンジ点群（シーンチェンジ点情報）を読み出し、シーンチェンジ点情報が含まれていない場合には、シーンチェンジ点検出処理（ステップＳ107）において、例えば、本願出願人による先行する特開２０００−２３５６３９号公報に開示されたシーンチェンジ点の検出技術を採用することにより、映像変化度演算処理（ステップＳ106）にて得られた映像変化情報に基づいて、シーンチェンジ点群（シーンチェンジ点情報）を検出する。 Therefore, in the present embodiment, when scene change point information is included in the sub information, in the scene change point reading process (step S105), a scene change point group (scene change point information) is obtained from the sub information. When the scene change point information is not included, in the scene change point detection process (step S107), for example, the scene change point disclosed in the preceding Japanese Patent Application Laid-Open No. 2000-235539 is filed. By adopting the detection technique, a scene change point group (scene change point information) is detected based on the video change information obtained in the video change degree calculation process (step S106).

そして、早見再生区間補正処理（ステップＳ104）では、ステップＳ103における音声区間補正処理後の音声区間の先頭よりも時間的に早く、且つ最も近傍で、その距離が所定のしきい値以下である場合に、音声区間の先頭を、ステップＳ105またはステップＳ107にて取得したシーンチェンジ点に対応する情報に置き換えることにより、ユーザの違和感を取り除くことができる。 Then, in the quick-play reproduction section correction process (step S104), when the distance is less than or equal to a predetermined threshold at a time earlier than the beginning of the voice section after the voice section correction process in step S103 Furthermore, by replacing the head of the voice section with information corresponding to the scene change point acquired in step S105 or step S107, the user's uncomfortable feeling can be removed.

上記各ステップの処理は、極めて高速に処理が可能であるから、本実施形態では、動画早見再生部２００による動画再生を行なうに際して、動画データ記憶部１０から読み出した音声・副情報付き動画データをメモリバッファ（不図示）に一時記憶しておき、動画再生が実際に行われるのに先んじて、上記の「人の発生内容に関する情報」を取得することにより、再生対象の動画データの内容を予め解析すること無く、動画早見インデックス作成部１００による早見再生区間情報の生成プロセスと、生成された早見再生区間情報及び動画データを利用した動画早見再生部２００による動画再生プロセスとを、動画全体（即ち、再生対象のコンテンツ全編）を、擬似的リアルタイムに実行（即ち、擬似並行処理によって実行）することにより、ユーザは、所望する動画コンテンツの全体（全編）を、短時間で効率良く早見することが可能である。 Since the processing in each of the above steps can be performed at extremely high speed, in the present embodiment, the moving image data with audio / sub-information read from the moving image data storage unit 10 is read when the moving image is reproduced by the moving image quick reproduction unit 200. The contents of the moving image data to be reproduced are stored in advance in a memory buffer (not shown) by acquiring the above-mentioned “information on the content of human occurrence” before the actual reproduction of the moving image. Without analysis, the process of generating the preview playback section information by the movie preview index creation unit 100 and the movie playback process of the movie preview playback unit 200 using the generated preview playback section information and movie data are performed on the entire movie (that is, , The content to be played back) is executed in pseudo real time (that is, executed by pseudo parallel processing) Chromatography The may be efficiently Hayami whole (full length), a short time of the desired video content.

＜動画早見再生部２００＞
次に、動画早見再生部２００では、動画早見再生処理（ステップＳ107）において、再生映像はディスプレイ１２、再生音声はスピーカ１３を利用して再生される。この動画早見再生処理による動画再生に際しては、動画早見インデックス記憶部１１から読み出された早見再生区間情報に基づいて、ステップＳ108にて再生に要する時間が表示されると共に、その表示に応じてステップＳ109にて設定されたユーザ所望の再生条件のフィードバックおよびユーザ・プロファイル１４に基づく再生条件を統合判断することにより、早見再生条件の最終的な設定が行われ、設定された早見再生条件に基づいて、動画データ記憶部１０から読み出した動画データの動画再生が行われる。 <Quick movie playback unit 200>
Next, in the movie quick-view playback unit 200, the playback video is played back using the display 12 and the playback audio is played back using the speaker 13 in the movie fast-play playback process (step S107). At the time of moving image reproduction by the moving image quick reproduction process, the time required for reproduction is displayed in step S108 based on the quick reproduction section information read from the moving image quick index storage unit 11, and the step corresponding to the display is performed. By integrating the feedback of the user-desired playback condition set in S109 and the playback condition based on the user profile 14, final setting of the fast-playing playback condition is performed, and based on the set fast-playing playback condition. The moving image reproduction of the moving image data read from the moving image data storage unit 10 is performed.

その際、本実施形態では、
・区間Ａに対しては、再生される音声をユーザが聞いた際に内容を把握できる速度で音声を伴う倍速再生が行われ、
・区間Ｂに対しては、再生される映像を見ることによってユーザが内容を把握できる範囲内で高倍率の倍速再生が行われる。 At that time, in this embodiment,
-For section A, double-speed playback with voice is performed at a speed that allows the user to grasp the content when the user hears the played voice,
-For the section B, high speed double speed reproduction is performed within a range in which the user can grasp the contents by viewing the reproduced video.

ここで、上記の区間Ａにおける倍速再生、即ち、人が聞いて内容を把握できる速度の再生とは、実験では2倍速まで、望ましくは1.5倍速程度にすると良いことが本願出願人による実験の結果から判っている。他方、区間Ｂに対しては、再生映像を人が見て内容が把握できる範囲で高い倍率の倍速で再生を行うが、本願出願人による実験の結果によれば、経験的には10倍速まで、望ましくは5倍速以上に設定すると良いことが判っている。 As a result of an experiment by the applicant of the present application, the double-speed playback in the above section A, that is, playback at a speed at which a person can hear and grasp the contents, is up to double speed, preferably about 1.5 times speed in the experiment. I know. On the other hand, for the section B, playback is performed at a high magnification rate within a range in which the user can see the content of the playback video, but according to the results of experiments by the applicant of the present application, it is empirically up to 10 times the speed. It has been found that it is better to set the speed to 5x or higher.

区間Ｂを高倍率で倍速再生すると、一般に、「キュルキュル」という音が出ることが知られているので、ステップＳ107では、区間Ｂを高速で再生するに際して、ユーザがそのような音を聞きたくない場合には、音声再生はミュートすることによって無音状態にする、或いは、再生時の音量を小さくすることが考えられる。 It is known that when the section B is played back at high speed at a high speed, generally, a sound of “curl curl” is produced. Therefore, in step S107, when the section B is played back at high speed, the user does not want to hear such a sound. In some cases, the sound reproduction may be muted by muting, or the volume during reproduction may be reduced.

区間Ａの再生速度、区間Ｂの再生速度及びその再生時の音量に関して、最も簡単な実施方法は、動画早見再生処理（ステップＳ107）において、予め音声をどう処理するかを決めておく他、その再生速度を、ユーザが可変で設定可能とする方法が存在する。 Regarding the playback speed of the section A, the playback speed of the section B, and the volume at the time of playback, the simplest implementation method is to determine how to process the sound in advance in the moving image quick playback process (step S107). There is a method that allows the user to set the playback speed in a variable manner.

しかし、一般に、例えば老人や子供等のユーザにとっては各種装置を使いこなすことは容易なことでななく、速い速度の音声再生が行われた場合にはその内容理解が難いことが知られており、面倒な速度調整を行わず且つ簡易に、やや低い倍率の倍速再生することが好ましい。これと同様に、年齢に関わらず視力の弱いユーザ（視覚障害者）、特に動体視力や聴力、特に早い音声の聴力の弱いユーザの弱いユーザ（聴覚障害者）、或いは再生される音声を母国語としない外国のユーザ等にとっても、速い速度の音声再生が行われた場合にはその内容理解が難いことが知られており、これらのユーザにとって最適な再生速度もある。 However, in general, it is not easy for a user such as an elderly person or a child to master various devices, and it is known that it is difficult to understand the content when high-speed audio playback is performed, It is preferable to perform double speed reproduction at a slightly lower magnification without performing troublesome speed adjustment. Similarly, users with weak visual acuity regardless of age (visually impaired), especially dynamic visual acuity and hearing, particularly users with weak early hearing ability (hearing impaired), or reproduced speech in their native language. It is known that foreign users and the like who do not want to understand the content are difficult to understand when high-speed audio reproduction is performed, and there is an optimum reproduction speed for these users.

そこで、本実施形態では、ユーザの年齢や言語や理解できる言語や視力や聴力等の情報、更には個々のユーザが好む基準の再生条件等のユーザに関する属性情報を、ユーザ・プロファイル１４に予め記憶しておき、動画早見再生処理（ステップＳ107）において、そのプロファイル１４を参照することにより、対象となるユーザに応じて、人間の発声期間を表わす音声区間（区間Ａ）および人間の発声区間を除く区間（区間Ｂ）の再生速度をそれぞれ決定し、個人に応じた内容理解が容易な動画早見再生を行うことが可能となる。 Therefore, in the present embodiment, information such as the user's age and language, understandable language, visual acuity, hearing ability, and the like, and attribute information about the user such as a standard reproduction condition preferred by each user are stored in the user profile 14 in advance. In addition, by referring to the profile 14 in the quick video playback process (step S107), the voice section (section A) representing the human voice period and the human voice section are excluded according to the target user. It is possible to determine the playback speed of each section (section B) and perform quick movie playback that makes it easy to understand the content according to the individual.

また、上述したように、区間Ｂの高倍率な倍速再生時に、音声のミュート或いは音量を小さくする場合にも、係る設定をプロファイル１４に予め記述しておくことにより、個々のユーザに応じた快適な動画早見再生を行うことが可能となる。 In addition, as described above, when the sound is muted or the volume is reduced during high-speed double-speed playback in the section B, it is possible to comfortably respond to individual users by describing such settings in the profile 14 in advance. It is possible to perform quick movie playback.

更に、高齢者および動体視力にハンディキャップのあるユーザに関しては、本来の早見再生という観点からは外れるかもしれないが、区間Ａの再生速度を等倍速度より遅く設定すると共に、区間Ｂの再生速度は等倍速度以上に設定することにより、係るユーザが区間Ａの音声内容を把握可能な低速再生を行いながらも、全体としては全ての区間を低速再生する場合と比較して短い所要時間で、動画（即ち、動画データ記憶部１０に格納されている動画データ）を閲覧することが可能となる。 Furthermore, for elderly people and users with handicaps in moving visual acuity, it may be out of the viewpoint of the original quick-playing, but the playback speed of the section A is set slower than the normal speed and the playback speed of the section B Is set to a speed equal to or higher than the normal speed, while the user concerned can perform low-speed playback capable of grasping the audio content of the section A, as a whole, in a short time required compared to the case of low-speed playback of all sections, It is possible to view a moving image (that is, moving image data stored in the moving image data storage unit 10).

また、早い音声の内容理解にハンディキャップのあるユーザおよび音声内容の言語に堪能でないユーザに関しては、本来の早見再生という観点からは外れるかもしれないが、区間Ａの再生速度を等倍速度より遅く設定すると共に、区間Ｂの再生速度は10倍速まで、望ましくは5倍速以上とし、係るユーザが区間Ａの音声内容を把握可能な低速再生を行いながらも、全体としては全ての区間を低速再生する場合と比較して短い所要時間で、動画（即ち、動画データ記憶部１０に格納されている動画データ）を閲覧することが可能となる。ここで、音声内容の言語に堪能か否かの判断は、上述したプロファイル１４に予め記憶した識別情報（後述する表４では得意言語）と、再生対象の動画に含まれる音声の言語種類情報とを比較することによって行なえば良い。 In addition, for users who are handicapped in understanding the content of early speech and users who are not fluent in the language of speech content, the playback speed of section A may be slower than the normal speed, although this may be out of the viewpoint of fast-playing. In addition to the setting, the playback speed of the section B is set to 10 times speed, preferably 5 times speed or more, and the slow playback that enables the user to grasp the audio content of the section A is performed, but the entire section is played back at a low speed as a whole. It is possible to view the moving image (that is, the moving image data stored in the moving image data storage unit 10) in a shorter time than the case. Here, whether or not the language of the audio content is proficient is determined based on the identification information stored in advance in the above-described profile 14 (special language in Table 4 described later), the language type information of the audio included in the video to be played back, This can be done by comparing.

ユーザ・プロファイル１４を選択する手順としては、例えば、ディスプレイ１２に表示されたプロファイル選択画面にユーザ・プロファイルリストを表示し、その中から、ユーザによるリモコン端末（不図示）の操作に応じて選択することが考えられ、更に指紋や声紋や顔認識等の個人認識技術を用いた自動的なプロファイル選択方法を採用しても良い。 As a procedure for selecting the user profile 14, for example, a user profile list is displayed on the profile selection screen displayed on the display 12, and the user profile 14 is selected according to the operation of the remote control terminal (not shown) by the user. In addition, an automatic profile selection method using a personal recognition technique such as fingerprint, voiceprint, or face recognition may be employed.

ところで、上記の如く個々のユーザにとって最適な早見再生を行う場合に、果たして元々どの長さの動画がどの位の時間で早見できるかは、空き時間を活用して早見を行おうとしているユーザにとって重要な情報である。 By the way, in the case of performing the fast-view playback optimal for each user as described above, what length of the original video can be viewed in advance is determined by the user who is going to make a quick look at the free time. This is important information.

そこで、本実施形態では、ステップＳ108において、区間Ａのトータル長を再生速度で割ることによって区間Ａの再生時間を計算すると共に、区間Ｂについては、当該トータル長を再生速度で割ることによって区間Ｂの再生速度を計算し、早見に要する時間として、算出したこれら２つの値の和を求め、元の動画を等倍再生する場合の所要時間と共にユーザに提示する。更に、これらの再生時間をユーザが見た上で、所望の再生時間内に収まるように、区間Ａの再生速度や区間Ｂの再生速度を指定することにより、ユーザ所望の再生時間に近くなるように調節することが可能である。 Therefore, in this embodiment, in step S108, the playback time of the section A is calculated by dividing the total length of the section A by the playback speed, and for the section B, the section B is calculated by dividing the total length by the playback speed. The playback speed is calculated, the sum of these two calculated values is obtained as the time required for quick viewing, and is presented to the user together with the time required for reproducing the original moving image at the same magnification. Furthermore, by designating the playback speed of the section A and the playback speed of the section B so that these playback times are within the desired playback time when the user sees them, the playback time becomes closer to the user's desired playback time. It is possible to adjust to.

ところで、予め設定されたユーザのプロファイル１４と、ユーザが指示した所望の再生速度との関連であるが、上記の如くステップＳ108においてプロファイル１４を用いて自動的に算出された動画早見再生に要する時間を見たユーザが、所定のマンマシン・インタフェースを介して、ステップＳ109において、更に、区間Ａおよび区間Ｂの再生速度を指定することにより、所望の動画早見再生に要する時間（再生速度情報）を設定した場合には、設定された所要時間内に納めるべく、自動的、或いはユーザに確認を行った上で、係る設定された再生速度情報を新たにプロファイルに記憶することにより、前回の操作情報を反映しつつ個々のユーザの好みに応じた理解の容易な動画早見再生を行うことが可能となる。 By the way, the time required for the quick movie playback that is automatically calculated using the profile 14 in step S108 as described above is related to the preset user profile 14 and the desired playback speed designated by the user. In step S109, the user who has watched further designates the playback speeds of the sections A and B in step S109, thereby obtaining the time (playback speed information) required for the desired movie quick-playback playback. If it is set, the previous operation information is stored in the profile by automatically storing the set playback speed information after confirming with the user automatically or within the set required time. This makes it possible to perform easy-to-understand moving image quick-play according to the preferences of individual users.

また、上述したユーザ・プロファイルに、更に、区間Ｂの再生時の音量をどう処理するかを予め指定しておく、或いは所定のマンマシン・インタフェースを介してユーザが指定した場合には、その指定された音量情報を反映しつつ個々のユーザの好みに応じた理解の容易な動画早見再生を行うことが可能となる。 Further, in the user profile described above, it is further specified in advance how to process the volume during playback of the section B, or if the user specifies through a predetermined man-machine interface, the specification is made. Thus, it is possible to perform easy-to-understand moving image quick-play according to individual user's preference while reflecting the volume information.

＜動画再生装置の動作の詳細＞
以下、上記の如く概説した本実施形態に係る動画再生装置の動作の詳細について説明する。以下の説明では、動画データ記憶部１０に記憶された録画済の動画データ（音声信号及び副情報を含む動画データ）に対して早見再生を行うためのインデックスとして早見再生区間情報を作成し、作成したその情報を利用して、当該動画データの早見再生を行う場合を例に説明する。 <Details of the operation of the video playback device>
Hereinafter, details of the operation of the moving picture reproducing apparatus according to the present embodiment outlined above will be described. In the following description, fast-playing playback section information is created and created as an index for fast-playing recorded video data (video data including audio signals and sub-information) stored in the video data storage unit 10. A case will be described as an example in which the information is used to perform quick-view playback of the moving image data.

本実施形態では、上述したように、ステップＳ101の映像／音声分離処理を経た後処理として、大別して、動画早見インデックス作成部１００による動画早見インデックス作成処理と、動画早見再生部２００による動画早見再生処理とがある。 In the present embodiment, as described above, the post-processing after the video / audio separation process in step S101 is roughly divided into a video quick-look index creation process by the video quick-look index creation unit 100 and a video fast-play playback by the video fast-play playback unit 200. There is processing.

本実施形態における音声・副情報付き動画データは、映像情報、音声情報、並びに副情報が多重化されたコンテンツであり、このような情報形態のメディアとしては、例えば、ＤＶＤやデジタルテレビ放送等が挙げられる。 The moving image data with audio / sub-information in the present embodiment is content in which video information, audio information, and sub-information are multiplexed. Examples of such information-type media include DVDs and digital TV broadcasts. Can be mentioned.

本実施形態において、副情報とは、動画のセグメント情報、シーンチェンジ情報、字幕に関する情報、時間情報等のように、映像や音声そのものとは異なる各種の情報である。 In the present embodiment, the sub-information is various types of information different from video and audio itself, such as moving image segment information, scene change information, subtitle information, time information, and the like.

本実施形態において、以下の説明では、係る副情報として、「人の発生内容に関する情報」を利用するが、この他にも、例えば、字幕、クロ−ズキャプション等があり、更に、人が発した音声の認識結果から得られる音素列表記等を採用することができる。 In the present embodiment, in the following description, “information related to the content of human occurrence” is used as such sub-information, but there are, for example, captions, close captions, and the like. The phoneme string notation obtained from the recognition result of the voice can be adopted.

ここで、字幕やクロ−ズキャプションは、聴覚障害者や自分が聴いている言語が理解できない者でもコンテンツの内容を楽しめるように、映像信号と同期して、人の発した音声内容に対応してオーバーレー表示するためのものであり、このような副情報を含むコンテンツは、その提供に先立って、人手により、或いは自動または半自動的に、人の発声期間を表わす音声区間（区間Ａ）が決定されると共に、決定された個々の音声区間における人の発声内容は、人手によって及び／または音声認識処理を施すことによって、当該コンテンツに付加的な情報（本実施形態では「副情報」）として記述されるのが一般的である。 Here, subtitles and close captions correspond to the audio content in sync with the video signal so that people with hearing disabilities and those who do not understand the language they are listening to can enjoy the content. The content including such sub-information has a voice section (section A) representing a person's utterance period manually or automatically or semi-automatically before providing the content. The content of the utterance of the person in the determined individual voice section is determined as additional information (“sub information” in the present embodiment) to the content by hand and / or by performing voice recognition processing. Generally described.

また、このようなコンテンツにおいて、そのコンテンツオリジナルな人の発した音声内容とは異なる言語種類の字幕またはクロ−ズキャプションは、上記の如く当該コンテンツに記述された副情報を、更に、人手及び／または自動翻訳によって目的とする言語に翻訳した後、追加的に記述されるのが一般的である。 Also, in such content, subtitles or close captions in a language type different from the audio content produced by the original person of the content, sub-information described in the content as described above, and further, Or, it is generally written after being translated into a target language by automatic translation.

そして、このような字幕やクロ−ズキャプション等の副情報は、一般に、動画再生時の表示期間を表すための区間情報を伴っており、この区間情報は、人の音声区間（区間Ａ）を表わすと捉えることができる。 Such sub-information such as subtitles and close captions is generally accompanied by section information for indicating a display period at the time of video playback, and this section information includes a human voice section (section A). It can be understood as representing.

そこで、本実施形態では、上記のような形態の副情報を含む音声・副情報付き動画データを対象として、その動画データに含まれる人の音声区間（区間Ａ）を検出する。 Therefore, in the present embodiment, the audio section (section A) of the person included in the moving image data is detected for the moving image data with audio / sub information including the sub information in the above form.

＜動画早見インデックス作成部１００＞
（人の音声区間の検出）
図２は、本実施形態において、動画早見インデックス作成部１００にて行われる人の音声区間の検出処理の概略を示すフローチャートであり、上述した音声区間読み込み処理（ステップＳ102）の詳細な手順を表わす。 <Quick movie index creation unit 100>
(Detection of human voice segment)
FIG. 2 is a flowchart showing an outline of the human voice segment detection process performed by the moving image quick index creation unit 100 in this embodiment, and shows the detailed procedure of the voice segment reading process (step S102) described above. .

同図において、ステップＳ201では、上述したステップＳ101における映像／音声／副情報分離処理が施された動画ストリームで未だ読み込んでいないものがあるかを判断し、全て読み込み済みの場合には本処理を終了する。 In step S201, it is determined whether there is any video stream that has not been read yet in the video / audio / sub-information separation process performed in step S101 described above. finish.

ステップＳ202では、ステップＳ201にて未だ読み込んでいない動画ストリームが存在すると判断されたので、その動画ストリームをバッファ（不図示）に読み込み、ステップＳ203では、読み込んだ動画ストリームに含まれる副情報から、「人の発生内容に関する情報」として、字幕、クロ−ズキャプション、音声認識の結果得られる音素列表記、或いは音声検出結果情報をシークし、その結果得られる情報を、人の音声区間（音声区間情報）として設定する。 In step S202, since it is determined that there is a video stream that has not yet been read in step S201, the video stream is read into a buffer (not shown). In step S203, the sub information included in the read video stream is As information on the content of human occurrence, seek subtitles, close captions, phoneme string notation obtained as a result of speech recognition, or speech detection result information, and obtain the resulting information as a speech segment (speech segment information) ).

ここで、読み込まれた情報の中からステップＳ203において音声区間情報として何れのものを選択するかは、存在するこれらの情報の中でその内容の確度が高いものから選べば良く、例えば字幕＞クロ−ズキャプション＞音素列表記＞音声検出結果情報の順位で選択すれば良い。 Here, from among the read information, which information is selected as the voice section information in step S203 may be selected from the existing information with high accuracy of the contents, for example, subtitle> cropping. -Selection may be made in the order of caption> phoneme string notation> speech detection result information.

表１は、音声区間情報として読み込んだ副情報を例示する表であり、この例では、区間Ａとして、発音区間０乃至２に関して、個々の区間の開始時刻（始点）と終了時刻（終点）とが対の情報として読み込まれている。 Table 1 is a table exemplifying the sub information read as the voice section information. In this example, as the section A, the start time (start point) and end time (end point) of each section with respect to the sound generation sections 0 to 2 are shown. Is read as pair information.

（人の音声区間の補正）
上述した音声区間補正処理（ステップＳ103）の詳細について説明する。ステップＳ103では、動画早見再生時に再生音声を聴いたユーザが不快感を抱かないように、時間軸上で近傍に位置する複数の音声区間を１つの音声区間として統合することによる補正が行われる。 (Correction of human speech section)
Details of the above-described speech segment correction processing (step S103) will be described. In step S103, correction is performed by integrating a plurality of sound sections located in the vicinity on the time axis as one sound section so that the user who listens to the reproduced sound at the time of moving image quick reproduction does not feel uncomfortable.

ここで、上述した音声区間の検出処理（図２）によって取得した音声区間情報の補正を行なう理由は、例えば、時間軸上で近傍に位置する２つの区間Ａの間隔が狭い場合に、区間Ａを聞いて人が内容を把握できる速度で音声を伴う倍速再生を行なう一方で、区間Ｂに対しては、再生映像を見て人が内容を把握できる範囲で高倍率な倍速で再生を行うと、再生態様の変化が激しく、ユーザにとって聞き苦しいものとなるからである。 Here, the reason for correcting the speech section information acquired by the above-described speech section detection processing (FIG. 2) is that, for example, when the interval between two sections A located in the vicinity on the time axis is narrow, the section A When the playback is performed at high speed and high speed within the range that allows the person to grasp the content by watching the playback video, while performing the double speed reproduction with the sound at a speed at which the person can grasp the content by listening to This is because the reproduction mode changes drastically and is difficult for the user to hear.

また、動画デコーダおよび再生処理の面からも、短い区間での速度の変化は、処理のオーバーヘッドが大きく、再生動作が一時的に停止状態になり、ギクシャクした再生になることが、一例として、マイクロソフト社のDirectShowを用いた本願出願人による実験において観察されている他、他の多くの動画再生手段で同様の現象が見られる。 In addition, from the viewpoint of video decoder and playback processing, as an example, the change in speed in a short section has a large processing overhead, and the playback operation temporarily stops, resulting in jerky playback. In addition to being observed in an experiment by the applicant of the present application using the company's DirectShow, the same phenomenon can be seen in many other video playback means.

そこで、本実施形態では、時間軸上で最も近傍に位置する２つの音声区間（区間Ａ）の間隔があるしきい値（図３ではＴｈ３）以下である場合には、これらの音声区間を統合することによる補正を行う。このしきい値を決めるに当たっては、例えば、会話を行うシーンを想定し、会話が成り立つ程度の間を実験的に求め、それをしきい値に用いる。この場合の処理の手順を、図３を参照して説明する。 Therefore, in the present embodiment, when the interval between two voice segments (section A) located closest on the time axis is equal to or smaller than a threshold value (Th3 in FIG. 3), these voice segments are integrated. The correction by doing. In determining this threshold value, for example, a scene in which conversation is performed is assumed, and the extent to which the conversation is established is experimentally determined and used as the threshold value. The processing procedure in this case will be described with reference to FIG.

図３は、本実施形態において間隔の短い音声区間に対して行われる統合補正処理を示すフローチャートである。 FIG. 3 is a flowchart showing an integrated correction process performed for a voice interval with a short interval in the present embodiment.

同図において、ステップＳ301では、先に検出された複数の区間Ａのうち、時間軸上で最初に位置する区間Ａを、着目する音声区間として読み込むが、着目すべき音声区間が無ければ本処理は終了する。 In the figure, in step S301, among the plurality of previously detected sections A, the section A first positioned on the time axis is read as the speech section of interest. If there is no speech section of interest, this processing is performed. Ends.

ステップＳ302では、次に着目する音声区間（区間Ａ）が存在するかを判断し、着目すべき音声区間が無ければ本処理を終了し、一方、まだ存在する場合には、以下に説明するステップＳ303乃至ステップＳ307の処理を繰り返す。 In step S302, it is determined whether or not there is a speech section to be focused on next (section A). If there is no speech section to be focused on, this processing is terminated. The processes from S303 to S307 are repeated.

ステップＳ303では、ステップＳ302にて次に着目する音声区間が存在すると判断されたので、その音声区間（区間Ａ）を表わす音声区間情報を読み込む。ここで、音声区間情報とは、音声区間の開始点と終点とが対となった情報である。 In step S303, since it is determined in step S302 that there is a voice section of interest next, voice section information representing the voice section (section A) is read. Here, the voice section information is information in which the start point and end point of the voice section are paired.

ステップＳ304では、２つの区間Ａの間隔、即ち、時間軸上で先の音声区間（現在着目している音声区間）の終点と、次の音声区間の開始点との間の距離（時間間隔）を求め、この距離が所定のしきい値Ｔｈ３以下であるかを判断する。 In step S304, the interval between the two sections A, that is, the distance (time interval) between the end point of the preceding voice section (currently focused voice section) and the start point of the next voice section on the time axis. To determine whether this distance is equal to or smaller than a predetermined threshold value Th3.

ステップＳ305では、ステップＳ302にて２つの区間Ａの間隔が所定のしきい値Ｔｈ３以下であると判断されたので、これら２つの音声区間を、１つの音声区間に統合する。より具体的に、統合された音声区間の音声区間情報には、本ステップにおける処理によって、先の音声区間の開始点が設定されると共に、次の音声区間の終点が設定される。 In step S305, since it is determined in step S302 that the interval between the two sections A is equal to or less than the predetermined threshold Th3, these two voice sections are integrated into one voice section. More specifically, in the voice segment information of the integrated voice segment, the start point of the previous voice segment and the end point of the next voice segment are set by the processing in this step.

ステップＳ306では、統合された音声区間を、現在着目する音声区間（区間Ａ）として設定し、ステップＳ302に戻る。 In step S306, the integrated voice section is set as the voice section of interest (section A), and the process returns to step S302.

ステップＳ307では、ステップＳ302にて２つの区間Ａの間隔が所定のしきい値Ｔｈ３より大きいと判断されたので、現在着目する音声区間を、そのまま１つの補正した音声区間情報として記憶すると共に、ステップＳ308では、次の音声区間を、処理対象として着目すべき音声区間として設定し、ステップＳ302に戻る。 In step S307, since it is determined in step S302 that the interval between the two sections A is larger than the predetermined threshold value Th3, the currently focused voice section is stored as it is as one corrected voice section information. In S308, the next speech segment is set as a speech segment to be focused on as a processing target, and the process returns to step S302.

このような統合処理が、扱うべき音声区間（区間Ａ）がなくなるまで繰り返される。 Such integration processing is repeated until there is no voice section (section A) to be handled.

（シーンチェンジ点情報を利用した人の音声区間の補正）
一般に、音声信号を含む動画データに映像の変わり目が存在し、その直後に区間Ａが始まる場合には、動画再生に際して、ほんの一瞬高速でシーンの先頭部分の映像が再生された後で、音声を伴う倍速再生による再生映像が、人が聞いて把握できる速度で行われるため、ユーザにとって映像がちらついたような違和感が生じる。 (Correction of human voice section using scene change point information)
In general, when there is a video transition in video data including an audio signal and section A starts immediately after that, the video is played after the video at the beginning of the scene is played at a high speed for a moment. Since the reproduced video by the double speed reproduction is performed at a speed that can be heard and grasped by a person, the user feels uncomfortable that the video flickers.

そこで、本実施形態では、例えば、本願出願人による先行する特開２０００−２３５６３９号公報に開示されたシーンチェンジ点の検出技術を採用することにより、ステップＳ107にて検出したシーンチェンジ点群、或いは、ステップＳ105にて副情報から読み出されたシーンチェンジ点群のうち、音声区間補正処理後の音声区間の先頭よりも時間的に早く、最も近傍で、且つその距離があるしきい値以下であるシーンチェンジ点が存在する場合には、その音声区間の先頭を、該シーンチェンジ点に対応する情報に置き換える補正を行なうことにより、早見再生時のユーザの違和感を取り除く。その際、近傍判定のためのしきい値は、高速再生の状態から人が聞いて内容が把握できる程度の速度で音声を伴う倍速再生へ移行する際のオーバーヘッドに応じた値である。 Therefore, in the present embodiment, for example, the scene change point group detected in step S107 by adopting the scene change point detection technique disclosed in the preceding Japanese Patent Application Laid-Open No. 2000-235539 by the applicant of the present application, or Of the scene change points read out from the sub-information in step S105, it is earlier in time than the beginning of the voice section after the voice section correction processing, is closest, and the distance is below a certain threshold value. If a scene change point exists, correction is performed to replace the beginning of the audio section with information corresponding to the scene change point, thereby eliminating the user's uncomfortable feeling during quick-playback. In this case, the threshold value for determining the neighborhood is a value corresponding to the overhead when shifting from the high-speed playback state to the double-speed playback with sound at a speed at which a person can hear and grasp the contents.

図４は、本実施形態においてシーンチェンジ点を用いて行われる音声区間統合補正処理を示すフローチャートであり、早見再生区間補正処理（ステップＳ104）の詳細を表わす。 FIG. 4 is a flowchart showing the audio section integration correction process performed using the scene change point in the present embodiment, and shows the details of the quick-view playback section correction process (step S104).

同図において、まずステップＳ401では、シーンチェンジ点検出処理（ステップＳ107）にて検出されたシーンチェンジ点群（シーンチェンジ点情報またはシーンチェンジ位置情報）から、時間軸上で先頭となるシーンチェンジ点（Ａ）を読み込む。 In the figure, first, in step S401, the first scene change point on the time axis from the scene change point group (scene change point information or scene change position information) detected in the scene change point detection process (step S107). Read (A).

シーンチェンジ点情報は、通常はフレーム単位で記述されるが、本ステップでは、フレームレートに基づいて時間情報に変換した後、音声区間情報と比較することになる。即ち、本実施形態のアルゴリズムでは、音声区間の開始点から最も近傍のシーンチェンジ点を求めるために、連続する２つのシーンチェンジ点情報を用いることにし、ここでは、説明の便宜上、先のシーンチェンジ点をＡ、次のシーンチェンジ点をＢとして、ステップＳ401では、Ａの方へシーンチェンジ点の時間を記憶する。 The scene change point information is normally described in units of frames. In this step, however, the scene change point information is converted into time information based on the frame rate and then compared with the audio section information. That is, in the algorithm of this embodiment, in order to obtain the nearest scene change point from the start point of the voice section, two consecutive scene change point information is used. Here, for convenience of explanation, the previous scene change point is used. Assuming that the point is A and the next scene change point is B, in step S401, the time of the scene change point is stored in the direction of A.

ステップＳ402では、読み込んでない音声区間情報があるかどうかを判断し、無い場合には処理を終了し、読み込んでない音声区間情報がある場合にはステップＳ403において音声区間情報を１つ読み込む。 In step S402, it is determined whether or not there is audio section information that has not been read. If there is no audio section information that has not been read, the process ends. If there is audio section information that has not been read, one audio section information is read in step S403.

ステップＳ404では、未だ読み込んでないシーンチェンジ点情報があるかどうかを判断し、無い場合には、ステップＳ403にて既に読み込んである音声区間情報を、ステップＳ405において、そのまま補正済の音声区間情報として更新記憶する。 In step S404, it is determined whether or not there is scene change point information that has not yet been read. If there is no scene change point information, the voice section information that has already been read in step S403 is updated as corrected voice section information as it is in step S405. Remember.

ステップＳ406では、ステップＳ404にて読み込んでないシーンチェンジ点情報があると判断されたので、そのシーンチェンジ点情報を、シーンチェンジ点情報Ｂとして読み込む。 In step S406, since it is determined that there is scene change point information not read in step S404, the scene change point information is read as scene change point information B.

ステップＳ407では、シーンチェンジ点Ａが、時間軸上において、ステップＳ403にて読み込んだ現在着目する音声区間の始点より前に位置するかどうか判断し、前に位置する場合には、ステップＳ405において、補正の必要は無いとして音声区間情報をそのまま補正済音声区間情報として更新記憶する。 In step S407, it is determined whether or not the scene change point A is located on the time axis before the start point of the current voice segment read in step S403. Since there is no need for correction, the speech segment information is updated and stored as corrected speech segment information.

ステップＳ408では、ステップＳ407にてシーンチェンジ点Ａが現在着目する音声区間の始点より前に位置すると判断されたので、そのシーンチェンジ点Ａが当該音声区間の始点としきい値Ｔｈ４以内の距離に存在するかどうかを判断し、当該しきい値Ｔｈ４以内ではない場合には、ステップＳ409において、シーンチェンジ点Ｂの情報を、シーンチェンジ点Ａへコピーすることにより、次のシーンチェンジ点を判断対象とする準備を行う。 In step S408, since it is determined in step S407 that the scene change point A is located before the start point of the voice section of interest, the scene change point A exists at a distance within the threshold Th4 from the start point of the voice section. If it is not within the threshold value Th4, the information of the scene change point B is copied to the scene change point A in step S409, so that the next scene change point is set as the determination target. Prepare to do.

ステップＳ410では、ステップＳ408にてシーンチェンジ点Ａが現在着目する音声区間の始点と当該しきい値Ｔｈ４以内の距離に存在すると判断されたので、シーンチェンジ点Ｂが当該音声区間の始点よりも後ろに位置するかを判断し、後ろに位置しない場合にはステップＳ409に進む。 In step S410, since it is determined in step S408 that the scene change point A exists at a distance within the threshold Th4 from the start point of the voice section of interest, the scene change point B is behind the start point of the voice section. If it is not located behind, the process proceeds to step S409.

一方、ステップＳ410にてシーンチェンジ点Ｂが当該音声区間の始点よりも後ろに位置すると判断された場合には、ステップＳ411において、シーンチェンジ点Ａが開始点であり、当該音声区間の終点が終点である部分区間を、補正済の音声区間情報として更新記憶し、ステップＳ412では、シーンチェンジ点Ｂの情報を、シーンチェンジ点Ａにコピーすることにより、次のシーンチェンジ点を判断対象とする準備を行う。 On the other hand, if it is determined in step S410 that the scene change point B is located behind the start point of the voice section, the scene change point A is the start point and the end point of the voice section is the end point in step S411. Is updated and stored as corrected audio section information, and in step S412, the information of the scene change point B is copied to the scene change point A, so that the next scene change point is prepared for determination. I do.

即ち、上述したステップＳ407、ステップＳ408、並びにステップＳ410の判断によって、シーンチェンジ点Ａが現在着目する音声区間の始点の前に位置すると共に、当該しきい値Ｔｈ４以下の近傍であり且つ、最も音声区間の始点に近い点であることが確かめられて初めて、上記のステップＳ411及びステップＳ412の処理が行われる。 That is, according to the determinations in step S407, step S408, and step S410 described above, the scene change point A is located in front of the start point of the currently focused audio section, is in the vicinity of the threshold value Th4, and is the most audio. Only after it is confirmed that the point is close to the start point of the section, the processes in steps S411 and S412 are performed.

また、ステップＳ410にてシーンチェンジ点Ｂが当該音声区間の始点よりも後ろではないと判断された場合、当該シーンチェンジ点Ｂは、現在設定されているシーンチェンジ点Ａよりも補正済音声区間の始点候補として更にふさわしいと判断できるので、ステップＳ409において、当該シーンチェンジ点Ｂの情報を、新たなシーンチェンジ点Ａとしてコピーすることにより、次のシーンチェンジ点を判断対象とする準備を行ない、その後でステップＳ404の処理に戻る。但し、この場合のシーンチェンジ点Ａは、既にステップＳ407およびステップＳ408の要件を満たしているので、ステップＳ407とステップＳ408とをパスしてステップＳ410の判断をいきなり行っても構わない。 If it is determined in step S410 that the scene change point B is not behind the start point of the audio section, the scene change point B is set in the corrected audio section from the currently set scene change point A. Since it can be determined that it is more suitable as a starting point candidate, in step S409, the information of the scene change point B is copied as a new scene change point A, so that the next scene change point is prepared for determination. The process returns to step S404. However, since the scene change point A in this case already satisfies the requirements of step S407 and step S408, step S407 and step S408 may be passed and the determination of step S410 may be performed suddenly.

上述した音声区間統合補正処理（図４）の手順によって取得した補正済の音声区間情報は、早見再生区間情報として、表２に例示するようなスキーマで、動画早見インデックス記憶部１１に記憶される。 The corrected voice section information acquired by the procedure of the voice section integrated correction process (FIG. 4) described above is stored in the moving image quick index storage unit 11 as the quick playback section information in the schema illustrated in Table 2. .

表２は、本実施形態におけるシーンチェンジ検出結果を例示する表であり、一例として、シーンチェンジ点の検出を行ったフレームを、フレームレート（30枚/Sec）を元に秒換算した結果が格納されている。 Table 2 is a table exemplifying the scene change detection result in this embodiment. As an example, the result of converting the frame where the scene change point is detected into seconds based on the frame rate (30 frames / sec) is stored. Has been.

そして、表３は、本実施形態における補正済の音声区間検出結果を例示する表であり、表２に示す結果と表１に示す結果とに基づいて、シーンチェンジ点を用いた音声区間の統合補正処理（図４）を、しきい値Ｔｈ４＝ 2000 mSecで施した場合の処理結果を示す。 Table 3 exemplifies the corrected speech segment detection results in the present embodiment. Based on the results shown in Table 2 and Table 1, the integration of speech segments using scene change points is shown. The processing result when the correction processing (FIG. 4) is performed with the threshold Th4 = 2000 mSec is shown.

表１及び表２を参照すると、音声区間０および音声区間２に対しては、それぞれの音声区間の開始点60000 mSec、400000 mSecの前で且つしきい値Ｔｈ４である2000 mSec以内の期間にはシーンチェンジは存在しない。また、音声区間１に対しては、開始点102000 mSecの1500 mSecの前で且つ2000 mSec以内には、シーンチェンジ点として、シーンチェンジＩＤ＝２（開始時間100000 mSec）と、シーンチェンジＩＤ＝３（開始時間101000mSec）の２点が存在するが、図４で示したアルゴリズムに従って最も近傍のものを選ぶことから、結果として、シーンチェンジＩＤ＝３の101000mSecが選ばれ、これが表３に反映されている。 Referring to Tables 1 and 2, for voice period 0 and voice period 2, the period before the start points 60000 mSec and 400000 mSec of each voice period and within the threshold value Th4 within 2000 mSec There is no scene change. Also, for voice section 1, before 1500 mSec of start point 102000 mSec and within 2000 mSec, as scene change points, scene change ID = 2 (start time 100000 mSec) and scene change ID = 3 There are two points (start time 101000mSec), but since the nearest one is selected according to the algorithm shown in FIG. 4, 101000mSec with scene change ID = 3 is selected as a result, and this is reflected in Table 3. Yes.

＜動画早見再生部２００＞
動画早見再生部２００にて行われる動画早見再生処理（ステップＳ107）は、人の音声区間（区間Ａ）に対しては人が聞いて内容を把握できる速度で音声を伴う倍速再生を行なう一方で、人の音声区間ではない区間（区間Ｂ）に対しては、再生映像を人が見て内容が把握できる範囲で高い倍率の倍速で再生を行う。 <Quick movie playback unit 200>
The quick video playback process (step S107) performed by the quick video playback unit 200 performs double-speed playback with voice at a speed at which a person can hear and grasp the content of the voice segment (section A). For a section (section B) that is not a person's voice section, playback is performed at a high speed and a double speed within a range in which the user can grasp the content by viewing the playback video.

近年、動画再生環境が整い、例えばマイクロソフト社の DirectShowモジュールを用いると、任意区間の速度を指定して連続再生することが可能である。このような機能を持つモジュールを用いることで、比較的簡易に任意区間の再生速度の変化を実現することが可能であり、その際、重要なのは、何の観点で速度を変化させるかである。 In recent years, the moving image playback environment has been improved. For example, when a Microsoft DirectShow module is used, it is possible to perform continuous playback by designating the speed of an arbitrary section. By using a module having such a function, it is possible to realize a change in the playback speed of an arbitrary section relatively easily, and in that case, what is important is how to change the speed.

図５は、本実施形態における動画早見再生処理を示すフローチャートである。 FIG. 5 is a flowchart showing the moving image quick-view playback process in the present embodiment.

同図において、ステップＳ601では、先に上述したユーザ・プロファイル１４の中からユーザが所望のものを選択するが、その具体的な手順としては、例えば、ディスプレイ１２に図８に例示するようなユーザ・プロファイルリストを含む表示画面を表示し、その中からユーザがリモコン端末等を利用して、所望のプロファイルを選択すれば良い。 In FIG. 8, in step S601, the user selects a desired one from the above-described user profile 14, and as a specific procedure thereof, for example, a user as illustrated in FIG. -A display screen including a profile list may be displayed, and a user may select a desired profile from among them using a remote control terminal or the like.

即ち、図７に示すユーザ・プロファイルリストにおけるユーザ所望のプロファイルの指定は、例えばリモコン端末にプロファイル選択用の操作ボタンを設けておき、これをユーザが押下するのに応じて、図９に例示するようなメニュー表示画面が表示され、その画面を見ながら、リモコン端末のプロファイル選択用の操作ボタンを利用して、ユーザが所望のプロファイルを指定する。もちろんユーザ・プロファイルの選択には、指紋や声紋や顔認識等の個人認識技術を用いた自動的なプロファイル選択方法も考えられ、こちらの方が常に正しいプロファイルの指定が可能なため、プロファイルの指定の誤りを起こしたり、他人のプロファイルを変更したり内容を覗く等のトラブルを防げる。 That is, the specification of a user desired profile in the user profile list shown in FIG. 7 is illustrated in FIG. 9 according to, for example, a remote control terminal provided with an operation button for profile selection, which is pressed by the user. Such a menu display screen is displayed, and the user designates a desired profile using the profile selection operation buttons of the remote control terminal while viewing the screen. Of course, user profile selection can also be done by automatic profile selection using personal recognition technology such as fingerprints, voiceprints, and face recognition. This is because it is always possible to specify the correct profile. You can prevent troubles such as making mistakes, changing other people's profiles, and looking into the contents.

また、ユーザ・プロファイルを新規に登録する場合には、図７の表示画面において「新規登録」ボタンをポインタデバイスで指定すると、プロファイル名およびその他の属性を入力するための、図８に例示する表示画面が現れる。 In addition, when a user profile is newly registered, when the “new registration” button is designated with a pointer device on the display screen of FIG. 7, the display illustrated in FIG. 8 is used to input a profile name and other attributes. A screen appears.

即ち、図８は、ユーザ・プロファイル登録用の表示画面を例示する図であり、初期状態では、識別名と年齢以外の内容が基準値で埋められており、ユーザによる入力操作によってユニークな識別名と年齢の入力変更の必要がある個所のみが変更され、所定の入力値範囲の適正チェックをパスした後、ユーザが「ＯＫ］ボタンを押下するのに応じて、そのプロファイルがユーザ・プロファイル１４に新たに追加登録される。 That is, FIG. 8 is a diagram exemplifying a display screen for user profile registration. In the initial state, contents other than the identification name and age are filled with the reference value, and a unique identification name is obtained by an input operation by the user. Only the location where the input of the age needs to be changed is changed, and after passing the appropriate check of the predetermined input value range, the profile is changed to the user profile 14 in response to the user pressing the “OK” button. It is newly registered.

また、ユーザが所望のプロファイルの内容変更を希望する場合、図７に示す表示画面において「変更」ボタンを押下し、図９に示す表示画面において所望のプロファイルを選択するのに応じて表示される図８の表示画面において、変更を希望する項目の情報内容を変更した後、「ＯＫ］ボタンを押下すれば良い。 Further, when the user desires to change the contents of a desired profile, it is displayed in response to pressing the “change” button on the display screen shown in FIG. 7 and selecting the desired profile on the display screen shown in FIG. In the display screen of FIG. 8, after changing the information content of the item desired to be changed, the “OK” button may be pressed.

更に、ユーザが所望のプロファイルの削除を希望する場合、図７に示す表示画面において「削除」ボタンを押下し、図９に示す表示画面において所望のプロファイルを選択し、その後、「ＯＫ］ボタンを押下すれば良い。 Further, when the user desires to delete the desired profile, the user presses the “delete” button on the display screen shown in FIG. 7, selects the desired profile on the display screen shown in FIG. 9, and then clicks the “OK” button. Just press it.

尚、上述した図７及び図８に示す表示画面において、「キャンセル」ボタンが押下された場合には、それまでの選択操作や入力操作に対応する処理（プロファイルの登録、変更、削除）はなされることなく処理が終了する。 When the “cancel” button is pressed on the display screens shown in FIGS. 7 and 8 described above, processing (profile registration, change, and deletion) corresponding to the selection operation and input operation so far is performed. The process ends without

次に、ステップＳ602では、ステップＳ601にて選択されたプロファイルが、ユーザ・プロファイル１４に存在するかを判断し、存在する場合には、ステップＳ603において対象となるプロファイルをユーザ・プロファイル１４から読み込み、存在しない場合には、基準値として予め設定されているところの、区間Ａおよび区間Ｂの再生速度、並びに区間Ｂの再生時の音量を、ステップＳ606において読み込む。ここで、ユーザ・プロファイルのデータスキーマ一の一例を、表４に示す。 Next, in step S602, it is determined whether the profile selected in step S601 exists in the user profile 14, and if it exists, the target profile is read from the user profile 14 in step S603, If it does not exist, the playback speeds of the sections A and B and the volume during playback of the section B, which are preset as reference values, are read in step S606. Here, an example of the data schema of the user profile is shown in Table 4.

表４は、本実施形態におけるユーザ・プロファイルを例示する表である。基準値は、プロファイルＩＤ＝０に示すように記憶しておけば良く、この場合、区間Ａの再生速度は1.5倍速、区間Ｂの再生速度は10.0倍速、そして、区間Ｂ再生時の音量の基準値は０（即ち音声ミュート）である。上述したユーザ・プロファイルの新規登録時に用いられる基準値には、この値を用いる。 Table 4 is a table illustrating a user profile in the present embodiment. The reference value may be stored as shown in profile ID = 0. In this case, the playback speed of the section A is 1.5 times faster, the playback speed of the section B is 10.0 times faster, and the reference of the volume during playback of the section B The value is 0 (ie audio mute). This value is used as the reference value used when registering a new user profile.

また、表４のユーザ・プロファイルのデータスキーマ一において、None とは値が設定されていないことを表し、逆に値が設定されている場合は、その値を最優先して再生を行う。更に、表４において、視力や聴力の欄の Good と Poorは、その人の年齢に無関係な、動体視力や早い音声の聴力の能力を表わす。 Also, in the data schema 1 of the user profile in Table 4, “None” indicates that no value is set. Conversely, if a value is set, playback is performed with the highest priority on that value. In Table 4, “Good” and “Poor” in the visual acuity and hearing fields represent the dynamic visual acuity and the ability to hear early voice regardless of the age of the person.

一般に、高齢になるほど耳が聞こえにくくなる他、言葉を理解する速度の低下が見られることが多く、また子供は言語能力が未発達のために速い速度で音声再生を行なうと理解できなくなることが多い。 In general, the older people become harder to hear, the lower the speed at which they understand words, and the lack of language skills makes children unable to understand when they play voice at high speeds. Many.

これらの事情を踏まえて、健常者の年齢に適した区間Ａの再生速度、並びに区間Ｂの再生速度のテンプレートを予め用意しておき、ユーザ・プロファイル１４に記憶された年齢に基づき、これらの速度を決定する。 Based on these circumstances, templates for the playback speed of the section A suitable for the age of the healthy person and the playback speed of the section B are prepared in advance, and these speeds are based on the age stored in the user profile 14. To decide.

しかし、青年にも関わらず動体視力や早い音声の聴力の弱い人や、外国人のため母国語とは異なる言語（例えば日本語）速い速度で音声再生を行なうと理解が追いつかない等、年齢に無関係な原因がある場合もある。このため、本実施形態では、表４に例示するユーザ・プロファイルのように、視力および聴力の特性を記述しておき、これらの設定があればこちらを優先して、区間Ａの再生速度、並びに区間Ｂの再生速度を低めに決定する。 However, despite being adolescents, people with weak visual acuity and fast voice hearing, or languages other than their native language (for example, Japanese language) because of foreigners can not catch up with their understanding, etc. There may be unrelated causes. For this reason, in the present embodiment, the characteristics of visual acuity and hearing ability are described as in the user profile exemplified in Table 4, and if there are these settings, this is given priority, and the playback speed of section A, and The playback speed of section B is determined to be low.

このような場合、高齢者および動体視力の弱いユーザに関しては、本来の早見再生という観点からは外れるかもしれないが、人の音声区間（区間Ａ）の再生速度を等倍速度より遅い速度に決定し、人の音声区間ではない区間（区間Ｂ）の再生速度を等倍速度以上とすることにより、係るユーザが区間Ａの音声内容を把握可能な低速再生を行いながらも、全体としては全ての区間を低速再生するよりも速い時間で動画を閲覧することが可能となる。 In such a case, for elderly people and users with weak moving vision, it may be out of the viewpoint of the original quick-playing, but the playback speed of the human voice section (section A) is determined to be slower than the normal speed. However, by making the playback speed of a section (section B) that is not a human voice section equal to or higher than the same speed, the user concerned can perform the slow playback that can grasp the voice content of the section A, It is possible to view a moving image in a time faster than the low speed playback of the section.

また、早口の音声に対する聴力の弱いユーザおよび外国人のため早口の日本語等では理解が追いつかないユーザに関しては、区間Ａの再生速度を等倍速度より遅い速度に決定し、区間Ｂの再生速度に関しては、その年齢の健常者と同じ再生速度とすることにより、区間Ａの音声内容を把握可能な低速再生を行いながらも、全体としては全ての区間を低速再生するよりも速い時間で動画を閲覧することが可能となる。 Also, for users with weak hearing ability for fast-spoken voices and users who cannot understand in fast-spoken Japanese because of foreigners, the playback speed of section A is determined to be slower than the normal speed, and the playback speed of section B As for the whole, while performing low-speed playback that can grasp the audio content of the section A by setting the same playback speed as that of a healthy person of that age, as a whole, the video is played at a faster time than the low-speed playback of all the sections. It becomes possible to browse.

このように、本実施形態では、ユーザ・プロファイルに対する速度決定処理は、予め健常者における年齢に適した区間Ａの再生速度および区間Ｂの再生速度のテンプレート、動体視力や早い音声の聴力の弱い症状、外国人のため早口の日本語では理解が追いつかない状況を加味して総合的な判断を行う。 As described above, in the present embodiment, the speed determination process for the user profile is performed in advance for a normal person's age-appropriate section A playback speed and section B playback speed template, dynamic visual acuity, and a symptom with a weak early hearing. Because of the foreigners, we make comprehensive judgments taking into account the situation where we cannot catch up with the quick Japanese.

また、本実施形態において、音声内容の言語に堪能か否かの判断は、ユーザ・プロファイル１４に記憶されている堪能であるか否か、或いは母国語を特定する言語種別情報と、再生対象の動画に含まれる音声内容の言語種別情報とを比較することにより行う。近年、ＤＶＤ等のデジタルコンテンツや、デジタルＢＳ等のデジタルメディアには、音声内容の言語を特定する言語種別情報が記憶されており、また近年ＥＰＧ（電子番組表）等から番組内容が電子的に入手可能であるため、これらの情報を用いることは現実的である。また、これらの情報が入手できない場合であっても、地上波ＴＶ番組でも標準設定では母国語、２カ国音声では通常メイン音声が母国語であり且つサブ音声は外国語であるため、これらの経験則に基づいて推定すれば良い。 In the present embodiment, whether or not the language of the audio content is proficient is determined based on whether or not the proficiency stored in the user profile 14 is proficient, or language type information for specifying the native language, and the reproduction target This is done by comparing the language type information of the audio content included in the video. In recent years, digital content such as a DVD and digital media such as a digital BS have stored language type information that specifies the language of audio content, and recently, the program content is electronically stored from an EPG (electronic program guide) or the like. It is practical to use this information because it is available. Even if this information is not available, even in a terrestrial TV program, the default setting is native language, and in bilingual voice, the main voice is usually the native language and the sub voice is a foreign language. What is necessary is just to estimate based on a law.

ステップＳ604では、ステップＳ603にて読み込んだユーザ所望のプロファイルに基づいて、区間Ａの再生速度と、区間Ｂの再生速度とを決定する。ここで、本ステップにおける処理の詳細を、図６を参照して説明する。 In step S604, the playback speed of section A and the playback speed of section B are determined based on the user-desired profile read in step S603. Details of the processing in this step will be described with reference to FIG.

図６は、本実施形態における動画早見再生処理を示すフローチャートのうち、ステップＳ604（図５）の処理の詳細を示すフローチャートである。 FIG. 6 is a flowchart showing details of the process in step S604 (FIG. 5) in the flowchart showing the quick-motion video reproduction process in the present embodiment.

同図において、まずステップＳ601では、ユーザ・プロファイル１４から先にユーザによって選択されたプロファイルを読み込み、ステップＳ602では、読み込んだプロファイルから取得したユーザの年齢に従って、健常者の年齢に応じた最適な区間Ａの再生速度と、区間Ｂの再生速度とが設定されているテンプレートを参照することにより、そのユーザに対する区間Ａの再生速度と、区間Ｂの再生速度とを仮決定する。 In the figure, first, in step S601, the profile previously selected by the user is read from the user profile 14, and in step S602, the optimum section according to the age of the healthy person is obtained according to the age of the user acquired from the read profile. By referring to a template in which the playback speed of A and the playback speed of section B are set, the playback speed of section A and the playback speed of section B for the user are provisionally determined.

ステップＳ603では、ステップＳ601にて読み込んだプロファイルに、動体視力が弱いと記述されているかを判断し、その旨が記述されている場合には、ステップＳ604において、区間Ａの再生速度と、区間Ｂの再生速度とを両方とも基準値より低い値に更新する。従って、この値も、予めプロファイルに記憶しておくのが望ましい。 In step S603, it is determined whether or not the profile read in step S601 describes that the moving body visual acuity is weak. If this is described, in step S604, the playback speed of section A and the section B are determined. Both playback speeds are updated to values lower than the reference value. Therefore, it is desirable to store this value in the profile in advance.

ステップＳ605では、ステップＳ603にて当該プロファイルに動体視力が弱いとは記述されていないと判断されたので、当該プロファイルに、速い音声の聴力が弱いと記述されているかを判断し、その旨が記述されている場合には、ステップＳ606において、区間Ａの再生速度のみ低い値に更新する。従って、この値も、予めプロファイルに記憶しておくのが望ましい。 In step S605, since it is determined in step S603 that the dynamic visual acuity is not described in the profile, it is determined whether or not it is described in the profile that the hearing ability of fast speech is weak. If so, only the playback speed of the section A is updated to a low value in step S606. Therefore, it is desirable to store this value in the profile in advance.

ステップＳ607では、ステップＳ605にて当該プロファイルに速い音声の聴力が弱いとは記述されていないと判断されたので、再生すべき動画データに含まれる音声内容の言語種別情報が入手可能であるかを判断し、入手可能である場合にはステップＳ608に進み、入手不可能な場合には処理を終了する。 In step S607, since it is determined in step S605 that it is not described that the hearing ability of the fast voice is weak in the profile, it is determined whether the language type information of the voice content included in the moving image data to be reproduced is available. If it is determined that it is available, the process proceeds to step S608. If it is not available, the process ends.

ステップＳ608では、再生すべき動画データに含まれる音声内容の言語種別情報を入手すると共に、入手した言語種別情報と、現在選択されている当該プロファイルに記述された得意言語情報とを比較し、これら２種類の情報が一致する場合には処理を終了し、一致しない場合には、ステップＳ609において、区間Ａの再生速度のみ低い値に更新する。従って、この値も、予めプロファイルに記憶しておくのが望ましい。 In step S608, the language type information of the audio content included in the moving image data to be reproduced is obtained, and the obtained language type information is compared with the good language information described in the currently selected profile. If the two types of information match, the process ends. If they do not match, only the playback speed of section A is updated to a low value in step S609. Therefore, it is desirable to store this value in the profile in advance.

即ち、図６に示す一連の処理では、ステップＳ603、ステップＳ605、並びにステップＳ608のどれにも当たらない場合には、ステップＳ602において仮決定された区間Ａの再生速度、並びに区間Ｂの再生速度がそのまま採用されることになる。 That is, in the series of processes shown in FIG. 6, if none of Step S603, Step S605, and Step S608 is reached, the playback speed of Section A and the playback speed of Section B temporarily determined in Step S602 are set. It will be adopted as it is.

もし、高齢や若年にもかかわらず動体視力や早い音声の聴力が優れている場合や、逆に劣っている場合には、区間Ａの再生速度および区間Ｂの再生速度の変更メニューを用いて、これらの値を変更操作できるように構成すると良い。この場合、ユーザは、再生映像を見ながら、区間Ａの再生速度および区間Ｂの再生速度を適宜変更し、自動的、或いはユーザに確認を求めた上で、設定された再生速度情報を、当該ユーザに対応するプロファイルに記憶することにより、前回の操作情報を反映しつつ個々のユーザに応じた理解しやすい動画早見再生を行うことが可能となる。 If the dynamic visual acuity or early voice hearing is excellent despite being old or young, or vice versa, use the menu for changing the playback speed of section A and the playback speed of section B. It is preferable to configure such that these values can be changed. In this case, the user appropriately changes the playback speed of the section A and the playback speed of the section B while watching the playback video, and automatically or prompts the user to confirm the set playback speed information. By storing the profile in the profile corresponding to the user, it becomes possible to perform easy-to-understand moving image quick reproduction corresponding to each user while reflecting the previous operation information.

尚、上述したプロファイルを用いずに簡易に行うのであれば、例えば、ステップＳ601乃至ステップＳ604、並びにステップＳ606の各ステップにおける処理の代わりに、区間Ａの再生速度を0.5倍速から2倍速まで、区間Ｂの再生速度を2倍速から10倍速までの間で、ユーザが動作メニューを利用して可変設定可能に構成する実施形態が想定される。 Note that if it is simply performed without using the above-described profile, for example, instead of the processing in each step of Steps S601 to S604 and Step S606, the playback speed of the section A is changed from 0.5 times speed to 2 times speed. An embodiment is assumed in which the user can variably set the playback speed of B between 2 × speed and 10 × speed using an operation menu.

ところで、区間Ｂを高倍率で倍速再生すると、「キュルキュル」という音が出るが、その音を聞きたくない場合には、区間Ｂの再生時には、音声再生はミュート状態とすることによって音を出なくする、或いは、小さな音量に変更する実施形態が想定される。このような設定に関しても、ステップＳ603で読み込んだプロファイルに予め記述しておき、動画早見再生時には、係るプロファイルを最優先とし、ステップＳ602でプロファイルが存在しないと判定された場合には、ステップＳ606では予め設定されている基準の音量を採用する。もちろん更に簡易に行うのであれば、例えば、動画早見再生処理が予め区間Ｂの音声再生レベルをどう処理するか予め決めておく実施形態が想定される。 By the way, when the section B is played back at high speed at a high speed, the sound “curl” is heard. However, if the user does not want to hear the sound, the sound playback is muted when the section B is played. Alternatively, an embodiment in which the volume is changed to a small volume is envisaged. Such a setting is also described in advance in the profile read in step S603, and when it is determined that there is no profile in step S602 when it is determined that the profile has the highest priority at the time of moving image playback, in step S606 A preset reference volume is adopted. Of course, if it is performed more simply, for example, an embodiment may be assumed in which how to quickly process the moving image quick playback processing determines the audio playback level of the section B in advance.

上記のような構成により、本実施形態では、区間Ａの再生速度および区間Ｂの再生速度、或いはそれら両方、並びに区間Ｂの音声レベルの指定を、ユーザ・プロファイルを用いることにより、個々のユーザに最適な再生を簡便に実現することが可能となる。 With the configuration as described above, in this embodiment, designation of the playback speed of the section A and the playback speed of the section B, or both, and the sound level of the section B can be made to each user by using the user profile. Optimal reproduction can be easily realized.

次に、ステップＳ605では、動画早見インデックス記憶部１１から、補正済み音声区間情報を読み込み、ステップＳ607では、区間Ａのトータル長を再生速度で割ることによって区間Ａの再生時間を計算し、区間Ｂについても同様にして再生速度を計算すると共に、これら２つの値を足すことによってユーザが早見に要する時間を算出する。そして、算出された早見に要する時間は、ディスプレイ２３等を利用してユーザに提示する。 Next, in step S605, the corrected audio section information is read from the video quick reference index storage unit 11, and in step S607, the playback time of section A is calculated by dividing the total length of section A by the playback speed, and section B is calculated. In the same way, the playback speed is calculated in the same manner as above, and the time required for the user to look quickly is calculated by adding these two values. The calculated time required for quick viewing is presented to the user using the display 23 or the like.

ステップＳ608では、ステップＳ607にて早見再生時間を認識したユーザがその時間に満足しているか否かを、リモコン端末への入力操作等を利用して判断し、この判断でユーザが満足している場合には、ステップＳ610において、上述した処理によって設定された区間Ａおよび区間Ｂの再生速度、並びに区間Ｂの音声再生レベルに従って、動画データ記憶部１０に記憶されている再生対象の動画を再生する。 In step S608, it is determined whether or not the user who has recognized the quick playback time in step S607 is satisfied with the time using an input operation to the remote control terminal or the like, and the user is satisfied with this determination. In this case, in step S610, the reproduction target moving image stored in the moving image data storage unit 10 is reproduced in accordance with the reproduction speeds of the sections A and B and the sound reproduction level of the section B set by the above-described processing. .

ステップＳ609では、ステップＳ608にてユーザが満足していないと判断されたので、ユーザ所望の再生時間に収まるように、区間Ａおよび区間Ｂの再生速度、並びに区間Ｂの音声再生レベルを変更可能なマンマシン・インタフェースを提供することにより、プロファイルや標準設定に満足できないユーザ自身が望む再生時間に近くなるように調節し、ステップＳ607に戻る。 In step S609, since it is determined in step S608 that the user is not satisfied, the playback speed of section A and section B and the audio playback level of section B can be changed so that the user's desired playback time can be accommodated. By providing the man-machine interface, adjustment is made so that the reproduction time desired by the user himself who is not satisfied with the profile and the standard setting is close, and the process returns to step S607.

また、ステップＳ609に対応する他の実施形態として、現在設定されている区間Ａおよび区間Ｂの再生速度に基づく動画再生を見ながら、それぞれの区間に対して、ユーザ所望の再生速度を変更可能に構成し、それに応じた早見に要する時間の算出及びその提示を行なうことにより、プロファイルや標準設定に満足できないユーザ自身が望む再生時間に近くなるように調節する構成も想定される。 Also, as another embodiment corresponding to step S609, the user-desired playback speed can be changed for each section while watching the video playback based on the playback speeds currently set for section A and section B. A configuration is also possible in which the time required for quick viewing according to the configuration is calculated and presented so that the playback time desired by the user himself who is not satisfied with the profile or the standard setting is adjusted.

ところでユーザ・プロファイルと、ユーザ所望の速度指示との関連であるが、ステップＳ607にて動画早見再生に要する時間を見たユーザが、所望の動画早見再生に要する時間に収めるべく、区間Ａおよび区間Ｂの再生速度を変更可能なマンマシン・インタフェースを用いて、これらの設定を調整・変更した場合には、その調整・変更後の値を、基準値として採用したいこともある。そこで、このような場合には、自動的、或いは図１０に例示する確認画面により、ユーザによる確認を促した後、「はい」が選択された場合には、ユーザによって調整・変更された再生速度情報を、当該ユーザに対応するプロファイルに記憶することにより、以降の動画再生に際しては、前回の操作情報を反映しつつ当該ユーザに応じた理解しやすい動画早見再生を行うことが可能となる。 By the way, in relation to the user profile and the user's desired speed instruction, the user who has seen the time required for the quick video playback in step S607 can set the time interval A and When these settings are adjusted / changed using a man-machine interface capable of changing the playback speed of B, the value after the adjustment / change may be used as a reference value. Therefore, in such a case, when “Yes” is selected after prompting confirmation by the user automatically or on the confirmation screen illustrated in FIG. 10, the playback speed adjusted / changed by the user is selected. By storing the information in the profile corresponding to the user, it is possible to perform easy-to-understand quick-playing of the moving image according to the user while reflecting the previous operation information in the subsequent moving-image playing.

尚、上述した実施形態において、算出された早見再生に要する時間をユーザが確認した上で、ユーザ所望の再生時間に収まるように、区間Ａの再生速度および区間Ｂの再生速度を変更することにより、プロファイルや標準設定に満足できないユーザが、自身が望む再生時間に近くなるように調節する構成例を挙げたが、この構成に限られるものではなく、例えば、ユーザが再生映像を見ながら、区間Ａの再生速度および区間Ｂの再生速度をそれぞれの変更可能に構成しておき、その設定に応じた早見に要する時間を再計算し、これをユーザに提示することにより、ユーザ自身が望む再生時間に近くなるように調節する実施形態も存在する。 In the above-described embodiment, after the user confirms the calculated time required for quick playback, the playback speed of section A and the playback speed of section B are changed so as to be within the user's desired playback time. The configuration example in which the user who is not satisfied with the profile or the standard setting adjusts the playback time to be close to the desired playback time has been described. However, the present invention is not limited to this configuration. The playback speed of A and the playback speed of section B are configured to be changeable, the time required for quick reference according to the setting is recalculated, and this is presented to the user so that the playback time desired by the user himself / herself There are also embodiments that adjust to be close to.

また、本実施形態においては、音声区間情報を、始点と終点との対であるとして説明したが、始点とその区間長、或いは終点と区間長からなる情報であっても良い。 In the present embodiment, the voice section information has been described as a pair of a start point and an end point, but may be information including a start point and its section length, or an end point and a section length.

このように、上述した本実施形態によれば、映像と音声との同期関係は崩すことなく、動画早見再生時には、人の発した音声は全て内容を把握できる速度で再生する一方で、人の発した音声の含まれない区間（区間Ｂ）は、より高速に再生する。これにより、動画早見再生時のトータルの閲覧時間を、等倍再生を行なった場合と比較して合理的に減らすことが可能となる。 As described above, according to the above-described embodiment, the synchronization relationship between the video and the audio is not broken, and at the time of the video quick playback, all the voices produced by the person are reproduced at a speed at which the contents can be grasped. The section (section B) that does not include the uttered voice is played back at a higher speed. This makes it possible to rationally reduce the total browsing time at the time of quick movie playback compared to the case where the same size playback is performed.

また、本実施形態によれば、区間Ａの再生速度および区間Ｂの再生速度を、ユーザ・プロファイル１４を用いることにより、個々のユーザに適した再生速度に簡便に設定可能であると共に、区間Ｂの再生時における音量も、ユーザに適したものに設定できる。 In addition, according to the present embodiment, the playback speed of the section A and the playback speed of the section B can be easily set to playback speeds suitable for individual users by using the user profile 14, and the section B The volume during playback can also be set to be suitable for the user.

更に、本実施形態によれば、早見再生に要する時間を予め、或いは動画の再生中に表示することにより、これに満足できないユーザは、区間Ａの再生速度および区間Ｂの再生速度を指定することにより、当該ユーザに最適な早見再生に要する時間に調整することができ、調整によって設定された情報は、当該ユーザに対応するプロファイルに更新記憶することが可能であるので、次回の早見再生に際して適切な動画再生を行なうことができる。 Furthermore, according to the present embodiment, by displaying the time required for quick playback in advance or during playback of a moving image, a user who is not satisfied with this can specify the playback speed of section A and the playback speed of section B. Thus, it is possible to adjust the time required for the fast-playback optimal for the user, and the information set by the adjustment can be updated and stored in the profile corresponding to the user. Video playback.

（他の実施形態）
上述した各実施形態を例に説明した本発明は、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 (Other embodiments)
The present invention described using the above-described embodiments as an example may be applied to a system including a plurality of devices, or may be applied to an apparatus including a single device.

尚、本発明は、前述した各実施形態において説明したフローチャートの機能を実現するソフトウェア・プログラムを、上述した動画再生装置として動作するシステム或いは装置に直接或いは遠隔から供給し、そのシステム或いは装置のコンピュータが該供給されたプログラムコードを読み出して実行することによっても達成される場合を含む。その場合、プログラムの機能を有していれば、形態は、プログラムである必要はない。 In the present invention, a software program that realizes the functions of the flowcharts described in the above embodiments is directly or remotely supplied to a system or apparatus that operates as the above-described moving image reproducing apparatus, and a computer of the system or apparatus is provided. Is also achieved by reading and executing the supplied program code. In that case, as long as it has the function of a program, the form does not need to be a program.

従って、本発明の機能処理をコンピュータで実現するために、該コンピュータにインストールされるプログラムコード自体も本発明を実現するものである。つまり、本発明のクレームでは、本発明の機能処理を実現するためのコンピュータプログラム自体も含まれる。 Accordingly, since the functions of the present invention are implemented by computer, the program code installed in the computer also implements the present invention. That is, the claims of the present invention include the computer program itself for realizing the functional processing of the present invention.

その場合、プログラムの機能を有していれば、オブジェクトコード、インタプリタにより実行されるプログラム、ＯＳに供給するスクリプトデータ等、プログラムの形態を問わない。 In this case, the program may be in any form as long as it has a program function, such as an object code, a program executed by an interpreter, or script data supplied to the OS.

プログラムを供給するための記録媒体としては、例えば、フロッピー（登録商標）ディスク、ハードディスク、光ディスク、光磁気ディスク、ＭＯ、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＤＶＤ（ＤＶＤ−ＲＯＭ，ＤＶＤ−Ｒ）などがある。 As a recording medium for supplying the program, for example, floppy (registered trademark) disk, hard disk, optical disk, magneto-optical disk, MO, CD-ROM, CD-R, CD-RW, magnetic tape, nonvolatile memory card ROM, DVD (DVD-ROM, DVD-R) and the like.

その他、プログラムの供給方法としては、クライアントコンピュータのブラウザを用いてインターネットのホームページに接続し、該ホームページから本発明のコンピュータプログラムそのもの、もしくは圧縮され自動インストール機能を含むファイルをハードディスク等の記録媒体にダウンロードすることによっても供給できる。また、本発明のプログラムを構成するプログラムコードを複数のファイルに分割し、それぞれのファイルを異なるホームページからダウンロードすることによっても実現可能である。つまり、本発明の機能処理をコンピュータで実現するためのプログラムファイルを複数のユーザに対してダウンロードさせるＷＷＷ(World Wide Web)サーバも、本発明のクレームに含まれるものである。 As another program supply method, a client computer browser is used to connect to an Internet homepage, and the computer program of the present invention itself or a compressed file including an automatic installation function is downloaded from the homepage to a recording medium such as a hard disk. Can also be supplied. It can also be realized by dividing the program code constituting the program of the present invention into a plurality of files and downloading each file from a different homepage. In other words, a WWW (World Wide Web) server that allows a plurality of users to download a program file for realizing the functional processing of the present invention on a computer is also included in the claims of the present invention.

また、本発明のプログラムを暗号化してＣＤ−ＲＯＭ等の記憶媒体に格納してユーザに配布し、所定の条件をクリアしたユーザに対し、インターネットを介してホームページから暗号化を解く鍵情報をダウンロードさせ、その鍵情報を使用することにより暗号化されたプログラムを実行してコンピュータにインストールさせて実現することも可能である。 In addition, the program of the present invention is encrypted, stored in a storage medium such as a CD-ROM, distributed to users, and key information for decryption is downloaded from a homepage via the Internet to users who have cleared predetermined conditions. It is also possible to execute the encrypted program by using the key information and install the program on a computer.

また、コンピュータが、読み出したプログラムを実行することによって、前述した実施形態の機能が実現される他、そのプログラムの指示に基づき、コンピュータ上で稼動しているＯＳなどが、実際の処理の一部または全部を行ない、その処理によっても前述した実施形態の機能が実現され得る。 In addition to the functions of the above-described embodiments being realized by the computer executing the read program, the OS running on the computer based on the instruction of the program is a part of the actual processing. Alternatively, the functions of the above-described embodiment can be realized by performing all of them and performing the processing.

さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行ない、その処理によっても前述した実施形態の機能が実現される。 Furthermore, after the program read from the recording medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion board or The CPU or the like provided in the function expansion unit performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing.

本実施形態に係る動画再生装置における動画早見アルゴリズムの概念図を表す図である。It is a figure showing the conceptual diagram of the moving image quick look-up algorithm in the moving image reproducing apparatus which concerns on this embodiment. 本実施形態において、動画早見インデックス作成部１００にて行われる人の音声区間の検出処理の概略を示すフローチャートである。4 is a flowchart illustrating an outline of a process for detecting a human voice section performed by the moving image quick index creation unit 100 in the present embodiment. 本実施形態において間隔の短い音声区間に対して行われる統合補正処理を示すフローチャートである。It is a flowchart which shows the integrated correction process performed with respect to the audio | voice area with a short space | interval in this embodiment. 本実施形態においてシーンチェンジ点を用いて行われる音声区間統合補正処理を示すフローチャートである。It is a flowchart which shows the audio | voice area integrated correction process performed using a scene change point in this embodiment. 本実施形態における動画早見再生処理を示すフローチャートである。It is a flowchart which shows the moving image quick-reproduction process in this embodiment. 本実施形態における動画早見再生処理を示すフローチャートのうち、ステップＳ604（図５）の処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of a process of step S604 (FIG. 5) among the flowcharts which show the moving image quick view reproduction | regeneration processing in this embodiment. ユーザ・プロファイル選択用の表示画面を例示する図である。It is a figure which illustrates the display screen for user profile selection. ユーザ・プロファイル登録用の表示画面を例示する図である。It is a figure which illustrates the display screen for user profile registration. 本実施形態におけるユーザ・プロファイルの例を示す図である。It is a figure which shows the example of the user profile in this embodiment. 提示された動画早見再生に要する時間に満足しないユーザが設定変更をした場合に、調整・変更された値を次回以降の動画再生時に基準値として用いるか確認を促す表示画面を例示する図である。It is a figure which illustrates the display screen which urges | confirms whether the value adjusted and changed is used as a reference value at the time of the next or subsequent video playback when a user who is not satisfied with the time required for the quick video playback is presented. .

Claims

A video playback device capable of playing back video data including audio signals and sub information at high speed,
Determination means for determining a first voice section representing a person's utterance period and a second voice section other than the first voice section based on the sub-information included in the video data;
Based on the moving image data, the first audio section performs high-speed moving image reproduction with a reproduction sound at a predetermined speed at which a user can grasp the contents, while the second audio section is based on the predetermined speed. Quick playback means for high speed video playback at high speed,
Based on the length of the first audio section and the playback speed of the section, and the length of the second audio section and the playback speed of the section, the time required for the high-speed video playback is calculated and calculated. Presenting means for presenting the required time to the user;
Equipped with a,
The quick-view playback means, when the user performs an operation for changing the playback speed of the first and second audio sections in response to the presentation means presenting the required time, the playback speed after the change. The moving picture reproducing apparatus further comprising adjusting means for adjusting the required time based on the above .

A video playback method capable of playing back video data including an audio signal and sub information at high speed,
A determination step of determining, based on the sub-information included in the moving image data, a first voice section representing a person's utterance period and a second voice section other than the first voice section;
Based on the moving image data, the first audio section performs high-speed moving image reproduction with a reproduction sound at a predetermined speed at which a user can grasp the contents, while the second audio section is based on the predetermined speed. A fast-playing process that performs high-speed video playback at high speed,
Based on the length of the first audio section and the playback speed of the section, and the length of the second audio section and the playback speed of the section, the time required for the high-speed video playback is calculated and calculated. A presentation process for presenting the required time to the user;
Equipped with a,
In the quick-view playback step, when a change operation of the playback speed of the first and second audio sections is performed by the user in response to the presentation of the required time in the presentation step, the playback after the change A moving image reproducing method comprising an adjusting step of adjusting the required time based on speed .

A computer program for causing a computer to execute the moving image reproducing method according to claim 2 .