JP2003309814A

JP2003309814A - Moving picture reproducing apparatus, moving picture reproducing method, and its computer program

Info

Publication number: JP2003309814A
Application number: JP2002113624A
Authority: JP
Inventors: Hirotaka Shiiyama; 弘隆椎山
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2002-04-16
Filing date: 2002-04-16
Publication date: 2003-10-31
Anticipated expiration: 2022-04-16
Also published as: JP4086532B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a moving picture reproducing apparatus for accurately detecting a voice utterance period in a voice signal uttered by a person and remarkably reducing a time required for browsing by a user while maintaining a synchronization relation between video and voice with fidelity according to the detected voice period. <P>SOLUTION: The moving picture reproducing apparatus discriminates a period A for denoting a voice utterance period of a person and periods B other than the period A on the basis of the voice signal included in moving picture data, reproduces a moving picture together with reproduced voice at a high speed being a non-multiple speed or a prescribed speed (e.g. 1.5 time or double speed of the non-multiple reproduction speed) at which a user can easily grasp the contents of the data for the period A and performs high speed moving picture reproduction together with reproduction voice in a small voice amount or in silence at a speed higher than the prescribed speed (e.g. 5 to 10-speed of the non-multiple speed) for the periods B on the basis of the moving picture data. In this case, the moving picture reproduction speed can be adjusted depending on attribute information of the user registered in a user profile 14. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声の再生を伴う
動画再生技術の分野に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to the field of moving image reproduction technology involving audio reproduction.

【０００２】[0002]

【従来の技術】従来より、例えば、ビデオテープレコー
ダ等のように、音声の再生を伴う動画再生装置において
は、再生実行時にユーザが動画全体（即ち、再生対象の
コンテンツ全編）を短時間で見ることを可能とすべく、
倍速再生機能や、高速早送り機能等が備えられている。2. Description of the Related Art Conventionally, in a moving picture reproducing apparatus which reproduces sound, such as a video tape recorder, a user views the whole moving picture (that is, the whole content to be reproduced) in a short time when executing reproduction. To enable that,
It has a double speed playback function and a fast forward function.

【０００３】また、代表的な動画再生装置であるビデオ
テープレコーダにおいては、近年、記録媒体の倍速再生
実行時に、音のエネルギが所定のしきい値以上の第１音
声区間と、当該所定のしきい値未満の第２音声区間とを
検出すると共に、その第１音声区間における音声信号の
ピッチ変換を行ないながら継続再生することにより、当
該第２音声区間を侵食しながらも、再生された音声はユ
ーザにとって多少早口ではあるもの、内容の理解が可能
な再生音を伴いながら、２倍速で記憶媒体を再生可能な
技術も提案されている。Further, in a video tape recorder, which is a typical moving picture reproducing apparatus, in recent years, when a double speed reproduction of a recording medium is executed, a first voice section in which the energy of sound is equal to or higher than a predetermined threshold value and the predetermined track. By detecting a second voice section less than the threshold value and continuously reproducing while performing pitch conversion of the voice signal in the first voice section, the reproduced voice is eroded while eroding the second voice section. Although it is a bit quicker for the user, there is also proposed a technology capable of reproducing a storage medium at a double speed while accompanied by a reproduced sound capable of understanding the contents.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上記の
如く音声信号の部分的なピッチ変換処理を行うと、動画
再生（動画早見再生）時に必ずしも音声と映像との同期
関係が保てないことにより、例えば、再生された映像中
の人物の喋っている映像と、再生された音声との同期が
取れないことから、人間の感覚にとって不自然な再生と
なり、ユーザは違和感を感じることがある。However, if the partial pitch conversion processing of the audio signal is performed as described above, the synchronous relationship between the audio and the video cannot always be maintained during the reproduction of the moving picture (fast-playing of the moving picture). For example, since the video in which the person speaking in the reproduced video is not synchronized with the reproduced audio, the reproduction may be unnatural to the human sense, and the user may feel uncomfortable.

【０００５】また、例えば特開平１０−３２７７６号公
報、特開平９−２４３３５１号公報等においては、音声
エネルギに基づいて無音状態を検出し、検出した無音状
態以外の音を人の発した音声区間とみなすことにより、
動画の要約（サマリー）を行う技術も提案されている。
しかしながら、例えばニュース番組等のように、その番
組全体を通して人の発した音声が支配的な動画において
は、音声エネルギに基づく人の発した音声区間の検出は
ある程度は可能であるものの、バックグラウンドノイズ
やバックグラウンド音楽が存在する環境下ではこの方法
は現実的ではない。Further, for example, in Japanese Unexamined Patent Publication No. 10-32776 and Japanese Unexamined Patent Publication No. 9-243351, a silent state is detected on the basis of voice energy, and a sound section other than the detected silent state is produced by a person. By considering
Techniques for summarizing videos have also been proposed.
However, in a moving image in which human voice is dominant throughout the program, such as a news program, background noise can be detected to some extent, although voice segment of human voice can be detected based on voice energy. This method is not practical in an environment with background music and background music.

【０００６】更に、上記特許公報以前の従来技術におい
ても、音声検出を行なうと共に、検出した音声を考慮し
た動画再生を行なう技術が数多く提案されており、その
殆どが音のエネルギをしきい値処理することによって音
声を検出している。この背景には、日本語の曖昧さに起
因する問題があり、「人の声」も「音声」と言い、人の
声を含む音一般も「音声」と呼ぶことに起因しており、
このような従来技術における音のエネルギのしきい値処
理を、真の「音声検出」とひとまとめに総称するのは不
適当である。Further, in the prior arts prior to the above patent publication, there have been proposed many techniques for detecting a voice and reproducing a moving image in consideration of the detected voice, most of which are thresholding the energy of the sound. The voice is detected by doing. There is a problem due to the ambiguity of Japanese in this background, because "human voice" is also called "voice", and general sounds including human voice are also called "voice".
It is inappropriate to collectively refer to such conventional sound energy thresholding as a true "voice detection".

【０００７】また、特開平９−２４７６１７号公報に
は、音声信号のＦＦＴ（高速フーリエ変換）スペクトラ
ムを算出することによって特異点を求めることによって
「音声情報等の特徴点」を検出し、その音量を分析する
技術が提案されている。しかしながら、ＦＦＴスペクト
ラムを利用する方法においては、再生すべき音声信号の
中に、広帯域のスペクトル分布となる所謂バックグラウ
ンド音楽等が含まれる場合には、その中から人の発した
声を検出することは困難である。Further, in Japanese Unexamined Patent Publication No. 9-247617, "characteristic points such as voice information" are detected by calculating a FFT (Fast Fourier Transform) spectrum of a voice signal to find a singular point, and its volume. The technique of analyzing is proposed. However, in the method using the FFT spectrum, when the audio signal to be reproduced includes so-called background music having a wide band spectrum distribution, it is necessary to detect a human voice from the background music. It is difficult.

【０００８】このように、音声を伴う従来の動画再生に
おいては、上述したように音声区間の検出が便宜的で不
正確であるという問題があり、更に、その検出結果を用
いた動画のサマリーの作成や倍速再生を行う場合には、
再生に際して、映像と音声との同期関係が維持できない
という問題がある。As described above, in the conventional moving image reproduction involving voice, there is a problem that the detection of the voice section is convenient and inaccurate as described above, and further, the summary of the moving image using the detection result is generated. When creating or playing back at double speed,
During reproduction, there is a problem that the synchronization relationship between video and audio cannot be maintained.

【０００９】また、一般に、老人や子供等のユーザにと
って各種装置を使いこなすことは容易なことでななく、
且つ速い速度で発せられる音声は、その内容の理解が追
いつき難いことが知られている。従って、このようなユ
ーザにとって、上述したテープレコーダのような動画再
生装置において倍速再生等の内容の早見（短縮再生）を
行なうに際しては、再生に最適な条件が一般のユーザと
は異なる。Generally, it is not easy for a user such as an old man or a child to use various devices,
Moreover, it is known that it is difficult to understand the content of a voice that is emitted at a high speed. Therefore, for such a user, the optimum conditions for the reproduction are different from those of a general user when performing a quick preview (shortening reproduction) of the contents such as the double speed reproduction in the moving picture reproducing apparatus such as the above-mentioned tape recorder.

【００１０】更に、動体視力の弱いユーザ、早い音声に
対する聴力が弱いユーザ、或いは再生される音声を母国
語としない外国のユーザ等にとっても、上記のような動
画再生装置によって倍速再生等の内容の早見（短縮再
生）を行なうに際しては、再生に最適な条件が一般のユ
ーザとは異なる。Further, even for a user who has a weak visual acuity, a user who has a weak hearing ability for a fast voice, or a foreign user who does not speak the voice to be played back in his or her native language, the above-mentioned moving picture playback apparatus can be used to reproduce the contents such as double speed. When performing quick viewing (shortened playback), the optimum conditions for playback differ from those of general users.

【００１１】そこで本発明は、人の発した音声区間を正
確に検出すると共に、検出した音声区間に従って映像と
音声との同期関係を忠実に維持しながら、ユーザの閲覧
所要時間を大幅に短縮する動画再生装置、動画再生方法
及びそのコンピュータ・プログラムの提供を目的とす
る。Therefore, the present invention accurately detects the voice section uttered by a person, and faithfully maintains the synchronization relationship between the image and the voice in accordance with the detected voice section, while significantly shortening the time required for the user to browse. An object is to provide a moving image reproducing device, a moving image reproducing method, and a computer program thereof.

【００１２】[0012]

【課題を解決するための手段】上記の目的を達成するた
め、本発明に係る動画再生装置は、以下の構成を特徴と
する。In order to achieve the above object, a moving picture reproducing apparatus according to the present invention is characterized by the following configuration.

【００１３】即ち、音声信号を含む動画情報を高速度で
再生可能な動画再生装置であって、前記動画情報に含ま
れる音声信号に基づいて、人の発声期間を表わす第１音
声区間と、それ以外の第２音声区間とを判定する音声区
間判定手段と、前記動画情報に基づいて、前記第１音声
区間は、ユーザが内容を把握可能な所定速度で、再生音
声を伴う高速動画再生を行なう一方で、前記第２音声区
間は、前記所定速度より高速度で、少なくとも高速動画
再生を行なう早見再生手段と、を備えることを特徴とす
る。That is, a moving picture reproducing apparatus capable of reproducing moving picture information including a sound signal at a high speed, wherein a first sound section representing a human utterance period based on the sound signal contained in the moving picture information, and Other than the second voice section, and based on the moving picture information, the first voice section performs high-speed moving picture reproduction accompanied by reproduced sound at a predetermined speed at which the user can grasp the contents. On the other hand, the second voice section is provided with a quick-view playback unit that performs at least high-speed moving image playback at a speed higher than the predetermined speed.

【００１４】好適な実施形態において、前記早見再生手
段は、前記第２音声区間において、前記所定速度より高
速度で、少なくとも小音量の再生音声を伴う動画再生を
行なうと良い。[0014] In a preferred embodiment, the quick-view reproduction means may perform a moving image reproduction at a speed higher than the predetermined speed and with a reproduced sound of at least a small volume in the second audio section.

【００１５】或いは、好適な他の実施形態において、前
記早見再生手段は、前記第２音声区間において、前記所
定速度より高速度で、無音声にて動画再生を行なうと良
い。Alternatively, in another preferred embodiment, the quick-view reproduction means may reproduce the moving image without voice at a speed higher than the predetermined speed in the second voice section.

【００１６】上記何れの装置構成においても、前記音声
区間判定手段は、前記音声信号に基づいて、声帯振動に
対応する音声ピッチを抽出し、抽出した音声ピッチに基
づいて、前記第１音声区間を判定すると良い。In any of the above device configurations, the voice section determination means extracts a voice pitch corresponding to a vocal cord vibration based on the voice signal, and determines the first voice section based on the extracted voice pitch. Good to judge.

【００１７】また、上記何れの装置構成においても、前
記音声区間判定手段は、前記音声信号に含まれる人の発
した音声帯域にフィルタリングを施すことによって得ら
れる信号から、存在し得る声帯振動数範囲のピッチを抽
出することによって人の声の支配的な母音部を検出する
と共に、検出した母音部を統合することにより、前記第
１音声区間を決定することを特徴とする。Further, in any of the above device configurations, the voice section determination means may have a vocal cord frequency range that can exist from a signal obtained by filtering the voice band emitted by a person included in the voice signal. It is characterized in that the first vowel section is determined by detecting the dominant vowel part of the human voice by extracting the pitch of the above and integrating the detected vowel parts.

【００１８】また、例えば前記音声区間判定手段は、前
記音声信号に基づいて、前記第１音声区間を判定するに
際して、時間軸上で近接する複数の前記第１音声区間
を、統合補正する補正手段を含むことを特徴とし、この
場合、前記補正手段は、前記動画情報に含まれるシーン
チェンジ点を検出すると共に、検出した個々のシーンチ
ェンジ点のうち、着目する前記第１音声区間の始点より
も時間的に早く且つ最も近傍に位置する近傍シーンチェ
ンジ点と、その始点との時間間隔（即ち、時間軸上での
距離）が所定のしきい値以下である場合に、該着目する
前記第１音声区間の始点を、該近傍シーンチェンジ点に
対応する情報に置き換えることによって補正すると良
い。In addition, for example, the voice section determination means, when determining the first voice section based on the voice signal, corrects means for integrally correcting a plurality of first voice sections that are close to each other on a time axis. In this case, the correction unit detects a scene change point included in the moving image information, and detects the scene change point included in the moving image information, and detects the scene change point from the start point of the first voice section of interest. If the time interval (that is, the distance on the time axis) between the near scene change point located closest in time and the start point thereof (that is, the distance on the time axis) is equal to or less than a predetermined threshold value, the first The start point of the voice section may be corrected by replacing it with information corresponding to the nearby scene change point.

【００１９】また、例えば前記早見再生手段は、前記第
１音声区間の長さとその区間の再生速度、並びに前記第
２音声区間の長さに基づいて、前記高速動画再生に要す
る所要時間を算出すると共に、算出した所要時間をユー
ザに提示することを特徴とし、この場合、前記早見再生
手段は、前記所要時間を提示するのに応じて、前記第１
及び第２音声区間の再生速度の変更操作がユーザによっ
て行われた場合に、その変更後の再生速度に基づいて、
前記所要時間を調整する調整手段を含むと良い。Further, for example, the quick-view reproduction means calculates a time required for the high-speed moving image reproduction based on the length of the first voice section, the reproduction speed of the section, and the length of the second voice section. Along with this, the calculated required time is presented to the user, and in this case, the quick-view reproduction means is adapted to present the required time, and in response to the presentation of the required time.
And when the operation of changing the reproduction speed of the second voice section is performed by the user, based on the changed reproduction speed,
It is preferable to include adjustment means for adjusting the required time.

【００２０】好適な実施形態においては、前記動画再生
装置を利用可能なユーザを対象として、個々のユーザに
関する属性情報（例えば、年齢、使用言語、動体視力、
並びに早い音声の聴力等）が登録されたユーザ・プロフ
ァイルを更に備え、前記早見再生手段は、前記ユーザ・
プロファイルに登録されているところの、特定ユーザに
関する属性情報に従って、前記第１及び第２音声区間の
再生速度を自動的に決定すると良い。In a preferred embodiment, attribute information (eg, age, language used, dynamic visual acuity, etc.) about each user is targeted for users who can use the moving image reproducing apparatus.
And a user profile in which the speed of hearing of fast voice, etc.) is registered,
It is preferable that the reproduction speeds of the first and second voice sections are automatically determined according to the attribute information regarding the specific user, which is registered in the profile.

【００２１】尚、同目的は、上記の各構成を備える動画
再生装置に対応する動画再生方法によっても達成され
る。The same object can also be achieved by a moving picture reproducing method corresponding to the moving picture reproducing apparatus having the above-mentioned constitutions.

【００２２】また、同目的は、上記の各構成を備える動
画再生装置及び方法を、コンピュータによって実現する
プログラムコード、及びそのプログラムコードが格納さ
れている、コンピュータ読み取り可能な記憶媒体によっ
ても達成される。The same object can also be achieved by a program code for realizing a moving picture reproducing apparatus and method having the above-mentioned configurations by a computer, and a computer-readable storage medium storing the program code. .

【００２３】[0023]

【発明の実施の形態】以下、本発明に係る動画再生装置
の一実施形態を、図面を参照して詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION An embodiment of a moving image reproducing apparatus according to the present invention will be described in detail below with reference to the drawings.

【００２４】はじめに、本実施形態における動画再生装
置の動作の概要について、図１を参照して説明する。First, an outline of the operation of the moving picture reproducing apparatus according to this embodiment will be described with reference to FIG.

【００２５】図１は、本実施形態に係る動画再生装置に
おける動画早見アルゴリズムの概念図を表す図である。FIG. 1 is a diagram showing a conceptual diagram of a moving picture fast-viewing algorithm in the moving picture reproducing apparatus according to this embodiment.

【００２６】本実施形態に係る動画再生装置は、図１に
示すように、大別して、動画早見インデックス作成部１
００と、動画早見再生部２００とからなる。As shown in FIG. 1, the moving picture reproducing apparatus according to the present embodiment is roughly classified into a moving picture fast-view index creating section 1
00 and the moving image quick playback unit 200.

【００２７】＜動画早見インデックス作成部１００＞動
画早見インデックス作成部１００では、動画データ記憶
部１０から読み出した動画データが映像／音声分離処理
（ステップＳ101）において映像データと音声データ
（音声信号）とに分離され、その音声信号に対しては、
音声区間推定処理（ステップＳ102）及び音声区間補正
処理（ステップＳ103）が施され、映像データに対して
は、映像変化度演算処理（ステップＳ105）、シーンチ
ェンジ点検出処理（ステップＳ106）が施され、早見再
生区間補正処理（ステップＳ104）によって早見再生区
間情報が生成され、生成されたこの情報は、動画早見イ
ンデックス記憶部１１に記憶される。<Motion picture fast-view index creation unit 100> In the motion picture quick-view index creation unit 100, the motion picture data read from the motion picture data storage unit 10 is converted into video data and audio data (audio signal) in the video / audio separation process (step S101). To the audio signal,
A voice segment estimation process (step S102) and a voice segment correction process (step S103) are performed, and a video change degree calculation process (step S105) and a scene change point detection process (step S106) are performed on the video data. The quick view reproduction section information is generated by the quick view reproduction section correction processing (step S104), and the generated information is stored in the moving picture quick view index storage unit 11.

【００２８】即ち、音声区間推定処理（ステップＳ10
2）では、映像／音声分離処理（ステップＳ101）にて得
られた音声信号に対してローパスフィルタによるフィル
タリングが施されることにより、その音声信号の零交差
点が求められると共に、その零交差点を始点と終点とに
有する小セグメント群が形成され、更に隣接する小セグ
メントの信号エネルギが小さい場合には、その小セグメ
ントは直前の小セグメントと結合されることによって１
つの小セグメントが決定される。ここで、零交差点と
は、フィルタリングが施された音声信号の波形が、基準
信号レベルであるゼロレベルと交差する点である。That is, the voice section estimation process (step S10)
In 2), the low-pass filter is applied to the audio signal obtained in the video / audio separation process (step S101) to obtain the zero-crossing point of the audio signal, and the zero-crossing point is set as the start point. And a terminal point are formed, and when the signal energy of the adjacent small segment is small, the small segment is combined with the immediately preceding small segment to obtain 1
One small segment is determined. Here, the zero crossing point is a point where the waveform of the filtered audio signal crosses the zero level which is the reference signal level.

【００２９】このようにして決定された個々の小セグメ
ントに対しては、その性質を表す属性情報がラベルとし
て付与される。このラベルには、必ず音声ピッチに対す
るラベルが含まれ、且つ音声ピッチセグメント内には、
音声ピッチ周期情報を併せ持つ（詳細は後述する）。Attribute information representing the property is attached as a label to each small segment thus determined. This label always contains a label for the voice pitch, and within the voice pitch segment,
It also has voice pitch cycle information (details will be described later).

【００３０】本実施形態では、音声検出に際して、係る
音声ピッチラベルを有するセグメント群を拠り所とし
て、隣接する音声ピッチラベルを持つセグメント間の距
離（即ち、時間間隔、以下同様）をしきい値処理すると
共に、それら音声ピッチセグメント内に個々に含まれる
音声ピッチ周期情報の連続性を用いて、更に、音声ピッ
チ周期の連続性と、セグメント間の距離との両者を考慮
することにより、離散的な音声ピッチラベルを持つ複数
小セグメントを、１つのセグメントとして統合すること
によって音声区間を正確に検出する。In the present embodiment, upon speech detection, the distance between segments having adjacent speech pitch labels (that is, the time interval, the same applies hereinafter) is thresholded on the basis of the segment group having such speech pitch labels. In addition, by using the continuity of the voice pitch period information individually included in the voice pitch segments, and further considering both the continuity of the voice pitch period and the distance between the segments, a discrete voice is obtained. A voice segment is accurately detected by integrating a plurality of small segments having pitch labels as one segment.

【００３１】また、音声区間補正処理（ステップＳ10
3）では、音声区間推定処理（ステップＳ102）における
処理結果（音声信号の中から検出された音声区間）に基
づいて、音声再生時に人（ユーザ）が不快にならないよ
うに、近接する複数の音声区間を統合することによって
新たに再生するところの、人の発声期間を表わす音声区
間（以下、「人の音声区間」または区間Ａと称する）の
補正が行われることにより、補正済みの音声区間情報を
取得する。Further, a voice section correction process (step S10)
In 3), based on the processing result (voice section detected from the voice signal) in the voice section estimation process (step S102), a plurality of adjacent voices are stored so that the person (user) does not feel uncomfortable during voice reproduction. Corrected voice section information by correcting a voice section (hereinafter, referred to as “human voice section” or section A) representing a human vocalization period, which is newly reproduced by integrating the sections. To get.

【００３２】例えば、高速動画再生に際する悪い態様と
して、近接する２つの区間Ａの間隔が狭い場合に、動画
再生に際して、それらの音声区間を、人が聞いて内容把
握ができる程度の速度で、音声を伴う倍速再生（例えば
２倍速再生）を行なうと共に、人の音声区間ではない区
間（以下、区間Ｂと称する）に対しては、動画再生に際
して、再生映像を人が見ることによって内容把握ができ
る程度の高倍率の倍速で再生を行うと、変化が激しく、
一般のユーザにとって聞き苦しいものとなる。For example, as a bad aspect in high-speed moving image reproduction, when the interval between two adjacent sections A is small, when reproducing the moving image, those audio sections are at a speed at which a person can hear and grasp the contents. In addition to performing double speed reproduction accompanied by sound (for example, double speed reproduction), for a section other than a person's voice section (hereinafter referred to as section B), the person grasps the content by watching the reproduced image when reproducing the moving image. When you play back at a high speed with a high magnification that allows you to
It is difficult for general users to hear.

【００３３】従って、本実施形態では、音声区間補正処
理（ステップＳ103）において、人の音声区間の間隔を
考慮し、その間隔がある所定の条件を満たす場合には複
数の人の音声区間群を統合することにより、前記の聞き
苦しさを解消する。ここで、所定の条件としては、例え
ば、人の音声区間の間隔が所定のしきい値以下であるこ
とを設定するのが最も容易である。Therefore, in this embodiment, in the voice section correction process (step S103), the intervals of the voice sections of the person are taken into consideration, and if the intervals satisfy a predetermined condition, the voice section groups of a plurality of persons are selected. By integrating, the above-mentioned listening difficulty is eliminated. Here, as the predetermined condition, for example, it is easiest to set that the interval of the human voice section is equal to or smaller than a predetermined threshold value.

【００３４】また、映像変化度演算処理（ステップＳ10
5）では、映像／音声分離処理（ステップＳ101）にて得
られた映像データに対して、特開２０００−２３５６３
９号公報に記載されたフレーム間の類似比較処理を行う
ことによってフレーム間類似度を演算することにより、
映像変化情報が生成される。Further, a video change degree calculation process (step S10)
In 5), with respect to the video data obtained in the video / audio separation process (step S101), Japanese Patent Application Laid-Open No. 2000-23563
By calculating similarity between frames by performing similarity comparison processing between frames described in Japanese Patent Publication No.
Video change information is generated.

【００３５】一般に、音声信号を含む動画データに映像
の変わり目が存在し、その直ぐ後に音声区間が始まる場
合には、動画再生に際して、ほんの一瞬高速でシーンの
先頭部分の映像が再生された後で、音声を伴う倍速再生
による再生映像が、人が聞いて把握できる速度で行われ
るため、ユーザにとって映像がちらついたような違和感
が生じる。In general, when a video transition exists in moving image data including an audio signal and a voice section starts immediately after that, when the moving image is reproduced, the image of the beginning portion of the scene is reproduced only at a momentary high speed. Since the reproduced video by the double speed reproduction accompanied by the audio is performed at a speed that can be heard and grasped by a person, the user feels uncomfortable as if the video flickers.

【００３６】そこで、本実施形態では、シーンチェンジ
点検出処理（ステップＳ106）において、例えば、本願
出願人による先行する特開２０００−２３５６３９号公
報に開示されたシーンチェンジ点の検出技術を採用する
ことにより、映像変化度演算処理（ステップＳ105）に
て得られた映像変化情報に基づいて、シーンチェンジ点
群（シーンチェンジ点情報）を検出する。Therefore, in the present embodiment, in the scene change point detection processing (step S106), for example, the technique of detecting the scene change point disclosed in Japanese Patent Laid-Open No. 2000-235639 by the applicant of the present application is adopted. Thus, the scene change point group (scene change point information) is detected based on the video change information obtained in the video change degree calculation process (step S105).

【００３７】そして、早見再生区間補正処理（ステップ
Ｓ104）では、ステップＳ103における音声区間補正処理
後の音声区間の先頭よりも時間的に早く、且つ最も近傍
で、その距離が所定のしきい値以下である場合に、音声
区間の先頭を、ステップＳ103にて検出したシーンチェ
ンジ点に対応する情報に置き換えることにより、ユーザ
の違和感を取り除くことができる。Then, in the quick-view reproduction section correction processing (step S104), the distance is less than or equal to a predetermined threshold in time and closest to the beginning of the voice section after the voice section correction processing in step S103. In this case, by replacing the beginning of the voice section with the information corresponding to the scene change point detected in step S103, the user's discomfort can be removed.

【００３８】＜動画早見再生部２００＞次に、動画早見
再生部２００では、動画早見再生処理（ステップＳ10
7）において、再生映像はディスプレイ１２、再生音声
はスピーカ１３を利用して再生される。この動画早見再
生処理による動画再生に際しては、動画早見インデック
ス記憶部１１から読み出された早見再生区間情報に基づ
いて、ステップＳ108にて再生に要する時間が表示され
ると共に、その表示に応じてステップＳ109にて設定さ
れたユーザ所望の再生条件のフィードバックおよびユー
ザ・プロファイル１４に基づく再生条件を統合判断する
ことにより、早見再生条件の最終的な設定が行われ、設
定された早見再生条件に基づいて、動画データ記憶部１
０から読み出した動画データの動画再生が行われる。<Movie Fast Playback Unit 200> Next, the movie fast playback unit 200 performs a quick movie playback process (step S10).
In 7), the reproduced video is reproduced using the display 12, and the reproduced sound is reproduced using the speaker 13. When the moving image is reproduced by this moving image quick reproduction process, the time required for reproduction is displayed in step S108 based on the rapid reproduction section information read from the moving image quick index storage unit 11, and a step corresponding to the display is displayed. The feedback of the reproduction condition desired by the user set in S109 and the reproduction condition based on the user profile 14 are integrally determined to finally set the quick view reproduction condition, and based on the set quick view reproduction condition. , Video data storage unit 1
Video reproduction of the video data read from 0 is performed.

【００３９】その際、本実施形態では、・区間Ａに対し
ては、再生される音声をユーザが聞いた際に内容を把握
できる速度で音声を伴う倍速再生が行われ、・区間Ｂに
対しては、再生される映像を見ることによってユーザが
内容を把握できる範囲内で高倍率の倍速再生が行われ
る。At this time, in the present embodiment, for the section A, the double speed reproduction accompanied by the voice is performed at a speed that allows the user to understand the content when the user hears the reproduced voice. In other words, high-speed double-speed reproduction is performed within a range in which the user can understand the content by watching the reproduced video.

【００４０】ここで、上記の区間Ａにおける倍速再生、
即ち、人が聞いて内容を把握できる速度の再生とは、実
験では2倍速まで、望ましくは1.5倍速程度にすると良い
ことが本願出願人による実験の結果から判っている。他
方、区間Ｂに対しては、再生映像を人が見て内容が把握
できる範囲で高い倍率の倍速で再生を行うが、本願出願
人による実験の結果によれば、経験的には10倍速まで、
望ましくは5倍速以上に設定すると良いことが判ってい
る。Here, the double speed reproduction in the above section A,
That is, it is known from the result of the experiment conducted by the applicant of the present application that the reproduction at a speed at which a person can hear and understand the content should be up to double speed in the experiment, preferably about 1.5 speed. On the other hand, with respect to the section B, the reproduced image is reproduced at a high speed with a high magnification within a range where a person can see and understand the content. ,
It has been found that it is desirable to set the speed to 5 times or more.

【００４１】区間Ｂを高倍率で倍速再生すると、一般
に、「キュルキュル」という音が出ることが知られてい
るので、ステップＳ107では、区間Ｂを高速で再生する
に際して、ユーザがそのような音を聞きたくない場合に
は、音声再生はミュートすることによって無音状態にす
る、或いは、再生時の音量を小さくすることが考えられ
る。It is known that, when the section B is reproduced at a high speed and at a high speed, a sound of "curculer" is generally produced. Therefore, in step S107, when the section B is reproduced at a high speed, such a sound is produced. If you do not want to listen, it is conceivable to mute the audio reproduction to make it silent or to reduce the volume during reproduction.

【００４２】区間Ａの再生速度、区間Ｂの再生速度及び
その再生時の音量に関して、最も簡単な実施方法は、動
画早見再生処理（ステップＳ107）において、予め音声
をどう処理するかを決めておく他、その再生速度を、ユ
ーザが可変で設定可能とする方法が存在する。Regarding the reproduction speed of the section A, the reproduction speed of the section B, and the volume at the time of reproduction, the simplest implementation method is to decide beforehand how to process the audio in the moving image quick reproduction processing (step S107). In addition, there is a method of allowing the user to variably set the reproduction speed.

【００４３】しかし、一般に、例えば老人や子供等のユ
ーザにとっては各種装置を使いこなすことは容易なこと
でななく、速い速度の音声再生が行われた場合にはその
内容理解が難いことが知られており、面倒な速度調整を
行わず且つ簡易に、やや低い倍率の倍速再生することが
好ましい。これと同様に、年齢に関わらず視力の弱いユ
ーザ（視覚障害者）、特に動体視力や聴力、特に早い音
声の聴力の弱いユーザの弱いユーザ（聴覚障害者）、或
いは再生される音声を母国語としない外国のユーザ等に
とっても、速い速度の音声再生が行われた場合にはその
内容理解が難いことが知られており、これらのユーザに
とって最適な再生速度もある。However, it is generally known that it is not easy for a user such as an old man or a child to use various devices, and it is difficult to understand the contents when a high speed voice reproduction is performed. Therefore, it is preferable to easily perform the double speed reproduction with a slightly low magnification without performing the troublesome speed adjustment. Similarly, regardless of age, users with low visual acuity (visually impaired), especially those with weak visual acuity and hearing, especially those with low-speed hearing (audible people), or reproduced voices in their native language It is known that foreign users and the like do not understand the content when high-speed voice reproduction is performed, and there is also an optimum reproduction speed for these users.

【００４４】そこで、本実施形態では、ユーザの年齢や
言語や理解できる言語や視力や聴力等の情報、更には個
々のユーザが好む基準の再生条件等のユーザに関する属
性情報を、ユーザ・プロファイル１４に予め記憶してお
き、動画早見再生処理（ステップＳ107）において、そ
のプロファイル１４を参照することにより、対象となる
ユーザに応じて、人間の発声区間（区間Ａ）および人間
の発声区間を除く区間（区間Ｂ）の再生速度をそれぞれ
決定し、個人に応じた内容理解が容易な動画早見再生を
行うことが可能となる。Therefore, in the present embodiment, information such as the age and language of the user, understandable language, visual acuity and hearing, and attribute information about the user such as reference reproduction conditions preferred by each user are stored in the user profile 14. Stored in advance, and referring to the profile 14 in the video fast-viewing reproduction process (step S107), the human utterance section (section A) and the section excluding the human utterance section are selected according to the target user. It is possible to determine the playback speed for each of the (section B) and perform quick movie playback in which the content can be easily understood according to the individual.

【００４５】また、上述したように、区間Ｂの高倍率な
倍速再生時に、音声のミュート或いは音量を小さくする
場合にも、係る設定をプロファイル１４に予め記述して
おくことにより、個々のユーザに応じた快適な動画早見
再生を行うことが可能となる。Further, as described above, even in the case where the mute of the sound or the volume of the sound is reduced during the high-speed double-speed reproduction of the section B, such setting is described in the profile 14 in advance so that the individual user It is possible to perform a comfortable quick movie playback according to the requirements.

【００４６】更に、高齢者および動体視力にハンディキ
ャップのあるユーザに関しては、本来の早見再生という
観点からは外れるかもしれないが、区間Ａの再生速度を
等倍速度より遅く設定すると共に、区間Ｂの再生速度は
等倍速度以上に設定することにより、係るユーザが区間
Ａの音声内容を把握可能な低速再生を行いながらも、全
体としては全ての区間を低速再生する場合と比較して短
い所要時間で、動画（即ち、動画データ記憶部１０に格
納されている動画データ）を閲覧することが可能とな
る。Further, for the elderly and users with handicaps in dynamic visual acuity, the reproduction speed of the section A may be set to be slower than the normal speed, and the section B may be deviated from the viewpoint of the original fast-view reproduction. By setting the playback speed of the above to be equal to or faster than 1 × speed, the user needs a shorter overall playback speed than the slow playback speed in which the user can grasp the voice content of the section A. It is possible to browse the moving image (that is, the moving image data stored in the moving image data storage unit 10) in time.

【００４７】また、早い音声の内容理解にハンディキャ
ップのあるユーザおよび音声内容の言語に堪能でないユ
ーザに関しては、本来の早見再生という観点からは外れ
るかもしれないが、区間Ａの再生速度を等倍速度より遅
く設定すると共に、区間Ｂの再生速度は10倍速まで、望
ましくは5倍速以上とし、係るユーザが区間Ａの音声内
容を把握可能な低速再生を行いながらも、全体としては
全ての区間を低速再生する場合と比較して短い所要時間
で、動画（即ち、動画データ記憶部１０に格納されてい
る動画データ）を閲覧することが可能となる。ここで、
音声内容の言語に堪能か否かの判断は、上述したプロフ
ァイル１４に予め記憶した識別情報（後述する表４では
得意言語）と、再生対象の動画に含まれる音声の言語種
類情報とを比較することによって行なえば良い。Further, for a user who has a handicap in understanding the content of a fast voice and a user who is not fluent in the language of the voice content, although it may be out of the original viewpoint of quick playback, the playback speed of the section A is the same. In addition to setting the speed lower than the speed, the playback speed of the section B is up to 10 times, preferably 5 times or more, and while performing the low-speed playback in which the user can grasp the audio content of the section A, the entire section is played as a whole. It is possible to browse a moving image (that is, moving image data stored in the moving image data storage unit 10) in a shorter required time as compared with the case of low speed reproduction. here,
To determine whether or not the user is fluent in the language of the audio content, the identification information (preferred language in Table 4 described later) stored in the profile 14 described above is compared with the language type information of the audio included in the moving image to be reproduced. It can be done by doing things.

【００４８】ユーザ・プロファイル１４を選択する手順
としては、例えば、ディスプレイ１２に表示されたプロ
ファイル選択画面にユーザ・プロファイルリストを表示
し、その中から、ユーザによるリモコン端末（不図示）
の操作に応じて選択することが考えられ、更に指紋や声
紋や顔認識等の個人認識技術を用いた自動的なプロファ
イル選択方法を採用しても良い。As a procedure for selecting the user profile 14, for example, the user profile list is displayed on the profile selection screen displayed on the display 12, and the user selects a remote control terminal (not shown) from the list.
It is conceivable that the profile is selected in accordance with the operation of, and an automatic profile selection method using a personal recognition technique such as fingerprint, voice print, face recognition or the like may be adopted.

【００４９】ところで、上記の如く個々のユーザにとっ
て最適な早見再生を行う場合に、果たして元々どの長さ
の動画がどの位の時間で早見できるかは、空き時間を活
用して早見を行おうとしているユーザにとって重要な情
報である。By the way, in the case of performing the optimum quick-view reproduction for each user as described above, it is attempted to use the free time to perform the quick-view to find out how long the moving picture originally originally can be played. Information that is important to users.

【００５０】そこで、本実施形態では、ステップＳ108
において、区間Ａのトータル長を再生速度で割ることに
よって区間Ａの再生時間を計算すると共に、区間Ｂにつ
いては、当該トータル長を再生速度で割ることによって
区間Ｂの再生速度を計算し、早見に要する時間として、
算出したこれら２つの値の和を求め、元の動画を等倍再
生する場合の所要時間と共にユーザに提示する。更に、
これらの再生時間をユーザが見た上で、所望の再生時間
内に収まるように、区間Ａの再生速度や区間Ｂの再生速
度を指定することにより、ユーザ所望の再生時間に近く
なるように調節することが可能である。Therefore, in this embodiment, step S108.
In Section 1, the playback time of the section A is calculated by dividing the total length of the section A by the playback speed, and for the section B, the playback speed of the section B is calculated by dividing the total length by the playback speed. As the time required,
The sum of these two calculated values is calculated and presented to the user together with the time required for reproducing the original moving image at the same size. Furthermore,
The reproduction time is adjusted so that it is close to the reproduction time desired by the user by designating the reproduction speed of the section A and the reproduction speed of the section B so that the reproduction time falls within the desired reproduction time by the user. It is possible to

【００５１】ところで、予め設定されたユーザのプロフ
ァイル１４と、ユーザが指示した所望の再生速度との関
連であるが、上記の如くステップＳ108においてプロフ
ァイル１４を用いて自動的に算出された動画早見再生に
要する時間を見たユーザが、所定のマンマシン・インタ
フェースを介して、ステップＳ109において、更に、区
間Ａおよび区間Ｂの再生速度を指定することにより、所
望の動画早見再生に要する時間（再生速度情報）を設定
した場合には、設定された所要時間内に納めるべく、自
動的、或いはユーザに確認を行った上で、係る設定され
た再生速度情報を新たにプロファイルに記憶することに
より、前回の操作情報を反映しつつ個々のユーザの好み
に応じた理解の容易な動画早見再生を行うことが可能と
なる。By the way, regarding the relationship between the preset user profile 14 and the desired playback speed instructed by the user, the movie quick-view playback automatically calculated using the profile 14 in step S108 as described above. The user who has seen the time required to specify the reproduction speeds of the section A and the section B in step S109 through a predetermined man-machine interface further determines the time required to reproduce the desired movie fast (the reproduction speed. (Information) is set, the set playback speed information is newly stored in the profile in order to save it within the set time, automatically or after confirming with the user. It is possible to perform quick-view playback of a moving image that is easy to understand according to the preference of each user while reflecting the operation information of.

【００５２】また、上述したユーザ・プロファイルに、
更に、区間Ｂの再生時の音量をどう処理するかを予め指
定しておく、或いは所定のマンマシン・インタフェース
を介してユーザが指定した場合には、その指定された音
量情報を反映しつつ個々のユーザの好みに応じた理解の
容易な動画早見再生を行うことが可能となる。In addition, in the above user profile,
Further, if the volume of the sound at the time of reproduction in the section B is designated in advance, or if the user designates it through a predetermined man-machine interface, the individual volume is reflected while reflecting the designated volume information. It is possible to perform quick-view playback that is easy to understand according to the user's preference.

【００５３】＜動画再生装置の動作の詳細＞以下、上記
の如く概説した本実施形態に係る動画再生装置の動作の
詳細について説明する。以下の説明では、動画データ記
憶部１０に記憶された録画済の動画データに対して早見
再生を行うためのインデックスとして早見再生区間情報
を作成し、作成したその情報を利用して、当該動画デー
タの早見再生を行う場合を例に説明する。<Details of Operation of Video Reproducing Device> Hereinafter, details of the operation of the video reproducing device according to the present embodiment outlined above will be described. In the following description, the quick view playback section information is created as an index for performing the quick view playback for the recorded video data stored in the video data storage unit 10, and the created video information is used to create the quick view playback section information. An example will be described in which the quick-view reproduction is performed.

【００５４】本実施形態では、上述したように、ステッ
プＳ101の映像／音声分離処理を経た後処理として、大
別して、動画早見インデックス作成部１００による動画
早見インデックス作成処理と、動画早見再生部２００に
よる動画早見再生処理とがある。In the present embodiment, as described above, the post-processing after the video / audio separation processing of step S101 is roughly divided into the video quick-view index creation processing by the video quick-view index creation section 100 and the video quick-view playback section 200. There is a movie quick-view playback process.

【００５５】＜動画早見インデックス作成部１００＞図
２は、動画早見インデックス作成部１００において行わ
れる人の発声期間を表わす音声区間（区間Ａ）検出のた
めのアルゴリズムを表わすブロック図であり、ＡＧＣ
（オートゲインコントロール）21、ローパスフィルタ2
2、零交差検出部23a,23b、音声セグメント化部24、音声
ピッチ検出部25、音声ラベリング部26、音声エネルギ計
算部27、並びに音声区間推定部28から成る。<Motion picture quick-view index creation unit 100> FIG. 2 is a block diagram showing an algorithm for detecting a voice section (section A) representing a person's utterance period, which is performed by the motion picture quick-view index creation unit 100.
(Automatic gain control) 21, low-pass filter 2
2. A zero crossing detection unit 23a, 23b, a voice segmentation unit 24, a voice pitch detection unit 25, a voice labeling unit 26, a voice energy calculation unit 27, and a voice section estimation unit 28.

【００５６】図３は、図２に示すアルゴリズムに基づく
処理の概略を示すフローチャートであり、このフローチ
ャートを参照して区間Ａ検出の手順を説明すると、まず
ステップＳ301にて音声信号を複数の小セグメントに分
割し、ステップＳ302では、それらの小セグメントの音
響的な特徴を表す音声ラベリングを行なう。その際、ス
テップＳ303では、音声ピッチを検出することによって
ロバストな母音候補の検出を行い、最後に、ステップＳ
304において、音声ピッチ検出結果に基づいて人の音声
区間（区間Ａ）の推定を行う。FIG. 3 is a flow chart showing the outline of the processing based on the algorithm shown in FIG. 2. The procedure for detecting the section A will be described with reference to this flow chart. First, in step S301, the audio signal is converted into a plurality of small segments. And in step S302, speech labeling is performed to express the acoustic features of these small segments. At that time, in step S303, a robust vowel candidate is detected by detecting the voice pitch, and finally, in step S303.
At 304, the human voice section (section A) is estimated based on the voice pitch detection result.

【００５７】即ち、映像／音声分離処理（ステップＳ10
1）によって動画データから分離された音声信号は、Ａ
ＧＣ（オートゲインコントロール）21によって音声エネ
ルギが正規化される。ＡＧＣ21の構成に関しては公知の
ものを採用すれば良く、登録済みの音声信号に対して、
その全体を通して信号レベルが最大となる音を基準とし
て、正規化を行う構成を採用することができる。That is, video / audio separation processing (step S10
The audio signal separated from the video data by 1) is A
The voice energy is normalized by a GC (auto gain control) 21. A well-known configuration may be adopted for the AGC21 configuration.
It is possible to adopt a configuration in which normalization is performed with the sound having the maximum signal level as a reference throughout.

【００５８】正規化された音声信号は、ローパスフィル
タ22においてフィルタリングを施すことにより、後段で
行われる解析処理に適した帯域の音声信号成分と、無声
子音認識に必要な帯域を持つ元の音声信号に分岐する。The normalized voice signal is filtered by the low-pass filter 22 to obtain a voice signal component in a band suitable for analysis processing performed at a later stage and an original voice signal having a band necessary for unvoiced consonant recognition. Branch to.

【００５９】（音声セグメント化）まず、ローパスフィ
ルタ22を通過した音声信号は、零交差点検出部23aにて
零交差点が求められた後、その零交差点を基準として、
音声セグメント化部24において、「小セグメント」と呼
ぶ小部分に暫定的に分割される。この処理は、図３のス
テップＳ301に相当する。(Voice Segmentation) First, in the voice signal that has passed through the low-pass filter 22, after the zero crossing point is obtained by the zero crossing point detecting section 23a, the zero crossing point is used as a reference.
In the audio segmenting unit 24, the audio segmentation unit 24 tentatively divides it into small parts called "small segments". This process corresponds to step S301 in FIG.

【００６０】ここで、ローパスフィルタ22をセグメント
分割に用いる理由は、小セグメントの基準が無声子音、
有声子音、並びに音声ピッチ等の単位であり、高周波の
影響があると無声子音等に悪影響が生じるからである。Here, the reason why the low-pass filter 22 is used for segmentation is that the reference for small segments is unvoiced consonants,
This is because it is a unit of voiced consonant, voice pitch, and the like, and unvoiced consonants and the like are adversely affected by the influence of high frequencies.

【００６１】さて、音声セグメント化部24は、音声信号
に対して暫定的に求められた零交差点を基準として、そ
の音声信号を小セグメントに分割するが、その小セグメ
ントは、以下の２条件ルール１：小セグメントの始点と終点は零交差点である
こと、ルール２：小セグメントのエネルギが小さい場合には、
直前の小セグメントと結合する。Now, the voice segmenting unit 24 divides the voice signal into small segments with reference to the zero crossing point tentatively obtained for the voice signal, and the small segment has the following two-condition rule. 1: The start point and the end point of the small segment are zero crossings. Rule 2: When the energy of the small segment is small,
Combine with the previous small segment.

【００６２】X1を始点としX２を終点とする小セグメン
トｆ（ｘ）に対して音声エネルギＰを、The sound energy P is given to a small segment f (x) whose starting point is X1 and whose ending point is X2,

【００６３】[0063]

【数１】なる数式（１）を満たすものと定義する。[Equation 1] It is defined that the following mathematical expression (1) is satisfied.

【００６４】そして、算出した音声エネルギＰが、所定
のしきい値Ｅｔｈ１以下の場合には、現在対象としてい
る小セグメントｆ（ｘ）を、その直前の小セグメントに
統合する。尚、音声エネルギＰは、数式（１）による小
セグメントｆ（ｘ）の絶対値の累積でなく、ｆ（ｘ）の
２乗エネルギを用いて計算しても良い。When the calculated sound energy P is less than or equal to the predetermined threshold value Eth1, the currently targeted small segment f (x) is integrated with the immediately preceding small segment. The sound energy P may be calculated by using the squared energy of f (x) instead of accumulating the absolute values of the small segments f (x) according to Expression (1).

【００６５】図４は、本実施形態において行われる小セ
グメントの結合処理を説明する図である。FIG. 4 is a diagram for explaining the small segment joining process performed in this embodiment.

【００６６】同図において、図４（ａ）は、零交差点検
出部23aにて複数の零交差点(Zero cross points)が求め
られた音声信号レベルを例示している。また、図４
（ｂ）では、検出された零交差点、上述したルール１お
よびルール２が適用されることによって設定された複数
の小セグメントが、個々の縦線によって示されており、
矢印で指し示された２つの小セグメントは、上述したル
ール２によって、１つの小セグメントに統合されたこと
を示している。In FIG. 4, FIG. 4A exemplifies the audio signal level at which a plurality of zero cross points (Zero cross points) are obtained by the zero cross point detecting section 23a. Also, FIG.
In (b), the detected zero-crossings, a plurality of small segments set by applying the above-mentioned rules 1 and 2 are shown by individual vertical lines,
The two small segments indicated by the arrows indicate that they are integrated into one small segment according to the above-mentioned rule 2.

【００６７】（音声ラベリング処理）零交差検出部23ｂ
では、ＡＧＣ21によって音声エネルギが正規化された音
声信号波形が、基準となるゼロレベルと交差する平均零
交差数を求め、更に、音声エネルギ計算部27において平
均エネルギを求めた後、個々の小セグメントに対して、
音声ラベリング部26において、始点、終点、平均零交差
数および平均エネルギを算出し、算出したこれらの値
を、小セグメントの特徴量として記憶する。この処理
は、図３のステップＳ302に相当する。(Voice Labeling Processing) Zero Crossing Detection Unit 23b
Then, the average number of zero crossings at which the voice signal waveform in which the voice energy is normalized by the AGC 21 intersects with the reference zero level is obtained, and further the voice energy calculation unit 27 obtains the average energy. Against
The voice labeling unit 26 calculates the start point, the end point, the average number of zero crossings, and the average energy, and stores the calculated values as the feature amount of the small segment. This process corresponds to step S302 in FIG.

【００６８】但し、平均零交差数および平均エネルギ
は、セグメント長SegLenを用いて、以下の式により計算
される。However, the average number of zero crossings and the average energy are calculated by the following formulas using the segment length SegLen.

【００６９】・（平均零交差数）＝（小セグメントに含
まれる元の音声信号の零交差点数）／ SegLen，・（平均エネルギ）＝（小セグメントに含まれるローパ
スフィルタが施された音声信号のエネルギ）／ SegLen である。(Average number of zero-crossings) = (number of zero-crossing points of original speech signal included in small segment) / SegLen, (average energy) = (low-pass filtered speech signal included in small segment) Energy) / SegLen.

【００７０】更に、小セグメントを５種類のカテゴリに
分類し、そのカテゴリを表すラベルを付与する。本実施
形態において個々の小セグメントに付与可能なラベルの
種類としては、無音、無声子音、有声子音、音声ピッ
チ、雑音がある。Further, the small segments are classified into five types of categories, and labels showing the categories are given. In the present embodiment, the types of labels that can be given to individual small segments include silence, unvoiced consonants, voiced consonants, voice pitch, and noise.

【００７１】次に、現在着目している小セグメントがど
のラベルに相当するかを、図５に示す手順によって決定
する。Next, it is determined by the procedure shown in FIG. 5 which label the small segment of interest corresponds to.

【００７２】図５は、本実施形態において行われる音声
ラベリングの処理を示すフローチャートであり、音声ラ
ベリング部26にて行われる処理の手順を示す。FIG. 5 is a flow chart showing the process of voice labeling performed in this embodiment, and shows the procedure of the process performed by the voice labeling unit 26.

【００７３】同図において、ステップＳ501では、着目
する小セグメント（処理対象とする小セグメント）の特
徴量として、平均零交差数AveZeroCrossRate および平
均エネルギAveEnergyを読み込む。In step S501 of the figure, the average number of zero crossings AveZeroCrossRate and the average energy AveEnergy are read in as feature quantities of the small segment of interest (small segment to be processed).

【００７４】本実施形態では、ラベル判断条件として、
以下のしきい値を設けるが、これらのしきい値は全て定
数である。In this embodiment, as the label judgment condition,
The following thresholds are provided, but these thresholds are all constants.

【００７５】・無音の最大エネルギを表わすしきい値： SileceEnergyMax，・無声子音の最小のエネルギしきい値： ConHEnergyLow，・無声子音の最大のエネルギしきい値： ConHEnergyMax，・有声子音の最小のエネルギしきい値： ConLEnergyLow，・有声子音の最大のエネルギしきい値： ConLEnergyMax，・無声子音の最小の零交差しきい値： ConHZeroCrossRateLow，・有声子音の最大の零交差しきい値： ConLZeroCrossRateMax，但し、 SileceEnergyMax ＞ ConHEnergyLow を満た
すこととする。• Threshold representing maximum energy of silence: SileceEnergyMax, • Minimum energy threshold of unvoiced consonants: ConHEnergyLow, • Maximum energy threshold of unvoiced consonants: ConHEnergyMax, • Minimum energy of voiced consonants Threshold: ConLEnergyLow, ・ Maximum energy threshold for voiced consonants: ConLEnergyMax, ・ Minimum zero-crossing threshold for unvoiced consonants: ConHZeroCrossRateLow, ・ Maximum zero-crossing threshold for voiced consonants: ConLZeroCrossRateMax, where SileceEnergyMax ＞ ConHEnergyLow shall be satisfied.

【００７６】ステップＳ502では、ステップＳ501にて読
み込んだ特徴量が、所定の無音条件を満足するかを判断
する。ここで、無音ラベル条件は、・((AveEnergy ＜ SileceEnergyMax) AND (AveZeroCros
sRate ＜ ConHZeroCrossRateLow))，または・((AveEnergy ＜ ConHEnergyLow) AND (AveZeroCrossR
ate ＞ ConHZeroCrossRateLow))，とする。そして、ステップＳ503では、上記の無音ラベ
ル条件を満たす場合に、当該着目する小セグメントに対
して、無音ラベルを関連付けして記憶する。In step S502, it is determined whether the feature amount read in step S501 satisfies a predetermined silent condition. Here, the silent label condition is: ((AveEnergy <SileceEnergyMax) AND (AveZeroCros
sRate <ConHZeroCrossRateLow)), or ((AveEnergy <ConHEnergyLow) AND (AveZeroCrossR
ate> ConHZeroCrossRateLow)), Then, in step S503, when the silent label condition is satisfied, the silent label is associated with the small segment of interest and stored.

【００７７】一方、ステップＳ502において無音ラベル
条件を満たさない場合に、ステップＳ501にて読み込ん
だ特徴量が、所定の無声子音ラベル条件を満足するか
を、ステップＳ504において判断する。ここで、無声子
音ラベル条件は、・(ConHEnergyLow ＜ AveEnergy ＜ ConHEnergyMax)
並びに、・(AveZeroCrossRate ＞ ConHZeroCrossRateLow) とする。そして、ステップＳ505では、上記の無声子音
ラベル条件を満たす場合に、当該着目する小セグメント
に対して、無声子音ラベルを関連付けして記憶する。On the other hand, if the silent label condition is not satisfied in step S502, it is determined in step S504 whether the feature amount read in step S501 satisfies a predetermined unvoiced consonant label condition. Here, the unvoiced consonant label conditions are: (ConHEnergyLow <AveEnergy <ConHEnergyMax)
In addition, ・ (AveZeroCrossRate> ConHZeroCrossRateLow). Then, in step S505, when the unvoiced consonant label condition is satisfied, the unvoiced consonant label is associated with the small segment of interest and stored.

【００７８】ステップＳ506では、ステップＳ501にて読
み込んだ特徴量が、上述した無音ラベル条件及び無声子
音ラベル条件を満足しない場合であるので、音声ピッチ
の検出を試み、検出できた場合には音声ピッチラベルを
該当する小セグメント群に付与する（ステップＳ50
7）。尚、ピッチ検出に関しては詳しく後述する。In step S506, since the feature amount read in step S501 does not satisfy the silent label condition and the unvoiced consonant label condition described above, the voice pitch is tried to be detected. If the voice pitch is detected, the voice pitch is detected. A label is given to the corresponding small segment group (step S50).
7). The pitch detection will be described later in detail.

【００７９】ここで、音声ピッチラベルの付与対象を小
セグメント群としたのは、後述するピッチ検出では、小
セグメントの統合が行われる可能性があり、その場合、
着目する小セグメント以降の複数の小セグメントをステ
ップＳ508において１つに統合し、これに対してピッチ
ラベルを与えるからである。このとき、音声ピッチが検
出されるセグメントは、主に声帯振動を伴う母音であ
る。Here, the reason why the voice pitch label is given to the small segment group is that there is a possibility that the small segments may be integrated in the pitch detection described later. In that case,
This is because a plurality of small segments after the focused small segment are integrated into one in step S508 and a pitch label is given to this. At this time, the segment in which the voice pitch is detected is a vowel mainly accompanied by vocal cord vibration.

【００８０】また、ステップＳ506において音声ピッチ
を検出できない場合には、ステップＳ509において有声
子音ラベル条件判定を行う。このとき、有声子音ラベル
条件は、・(ConLEnergyLow ＜ AveEnergy ＜ ConLEnergyMax)
並びに、・(AveZeroCrossRate ＜ ConLZeroCrossRateMax) とする。そして、ステップＳ510では、上記の有声子音
ラベル条件を満たす場合に、当該着目する小セグメント
に対して、有声子音ラベルを関連付けして記憶する。If the voice pitch cannot be detected in step S506, the voiced consonant label condition determination is performed in step S509. At this time, the voiced consonant label conditions are: (ConLEnergyLow <AveEnergy <ConLEnergyMax)
In addition, ・ (AveZeroCrossRate <ConLZeroCrossRateMax). Then, in step S510, when the above voiced consonant label condition is satisfied, the voiced consonant label is associated with the small segment of interest and stored.

【００８１】そして、ステップＳ511では、上述した各
条件を満たさない場合であるため、着目する小セグメン
トに対して、雑音ラベルを関連付けして記憶する。In step S511, since the above-mentioned conditions are not satisfied, the noise label is associated with the small segment of interest and stored.

【００８２】ここで、音声信号波形のセグメント化から
ラベリングに至るまでの処理過程を、図６に示す例を参
照して説明する。Now, the processing steps from segmentation of the audio signal waveform to labeling will be described with reference to the example shown in FIG.

【００８３】図６は、本実施形態における音声信号波形
のセグメント化からラベリングに至るまでの処理過程を
説明する図である。FIG. 6 is a diagram for explaining the processing steps from segmentation of the audio signal waveform to labeling in this embodiment.

【００８４】より具体的に、図６（ａ）は、ローパスフ
ィルタ後の音声信号波形を表わす。図６（ｂ）は、図６
（ａ）に示す音声信号波形の零交差点を基準に小セグメ
ント化した状態を表わし、同図に示す太い縦線は小セグ
メントの区切りを表わす。More specifically, FIG. 6A shows the audio signal waveform after the low-pass filter. FIG. 6B is the same as FIG.
The state shown in (a) is divided into small segments based on the zero-crossing points of the audio signal waveform, and the thick vertical line in the figure represents the division of small segments.

【００８５】そして、図６（ｃ）は、音声ラベリングと
セグメント化とを行った結果を表わし、同図に示す細長
い縦線はセグメントの区切りを表し、太い縦線は統合さ
れた小セグメントの名残を示している。図６（ｃ）で
は、図６（ｂ）に示す如く区切られた一部の複数小セグ
メントが、１つのピッチセグメントに統合されている様
子が判り、それぞれのセグメントには、付与されたラベ
ルが示されている。FIG. 6 (c) shows the result of the audio labeling and segmentation. The slender vertical lines shown in FIG. 6 represent segment breaks, and the thick vertical lines show the remnants of the integrated small segments. Is shown. In FIG. 6 (c), it can be seen that some of the plurality of small segments divided as shown in FIG. 6 (b) are integrated into one pitch segment, and the assigned label is attached to each segment. It is shown.

【００８６】（音声ピッチ検出）次に、音声ピッチ検出
部25の動作について、図９および図１０を参照して説明
する。この処理は、図３のステップＳ303に相当する。(Voice Pitch Detection) Next, the operation of the voice pitch detector 25 will be described with reference to FIGS. 9 and 10. This process corresponds to step S303 in FIG.

【００８７】図９は、本実施形態における音声ピッチ検
出処理を示すフローチャートであり、音声ピッチ検出部
25が行なう処理手順を示す。FIG. 9 is a flow chart showing the voice pitch detection processing in this embodiment.
25 shows a processing procedure performed by 25.

【００８８】同図において、ステップＳ901では、ロー
パスフィルタ後の音声信号波形の零交差点情報を入手す
る。そして、零交差点を基準として、波形の類似性を検
証することにより、音声ピッチを求める。In step S901 shown in the figure, the zero crossing point information of the audio signal waveform after the low pass filter is obtained. Then, the voice pitch is obtained by verifying the similarity of the waveforms with the zero crossing point as a reference.

【００８９】図７は、本実施形態における音声ピッチ検
出処理の説明のための音声信号波形を例示する図であ
る。FIG. 7 is a diagram exemplifying a voice signal waveform for explaining the voice pitch detection processing in this embodiment.

【００９０】本実施形態において、基準とする零交差点
は、時間方向に見て正の値を持つ波形の始点であって、
図７の例では、基準とする零交差点は、X1, X2, X3であ
る。In this embodiment, the reference zero-crossing point is the start point of the waveform having a positive value in the time direction,
In the example of FIG. 7, the reference zero crossing points are X1, X2, and X3.

【００９１】そして、ステップＳＳ902では、図７に例
示する場合において、零交差点X1を始点とし、零交差点
X2を終点とする部分波形をｆ（ｘ）、零交差点X2を始点
とし、零交差点X3を終点とする部分波形をｇ（ｘ）を、
初期基準として決定する。Then, in step SS902, in the case illustrated in FIG. 7, the zero crossing point X1 is set as the starting point and the zero crossing point is set.
F (x) is the partial waveform with X2 as the end point, g (x) is the partial waveform with the zero crossing point X2 as the starting point and the zero crossing point X3 as the ending point.
Determine as an initial standard.

【００９２】そして、ステップＳ903では、未処理の音
声区間（音声セグメント）が存在するかを判断し、存在
する場合にはステップＳ904に進み、存在しない場合に
は処理を終了する。Then, in step S903, it is determined whether or not there is an unprocessed voice section (voice segment). If it exists, the process proceeds to step S904, and if it does not exist, the process ends.

【００９３】ステップＳ904では、音声ピッチの有無お
よびそのセグメント範囲を報告するピッチ抽出処理を行
なう。ここで、報告するタイミングは、音声ピッチセグ
メントが途切れたタイミング、或いは部分波形ｆ（ｘ）
に対するピッチが見つからなかった場合である。尚、ス
テップＳ904におけるピッチ抽出処理については、図１
０を参照して詳しく後述する。In step S904, pitch extraction processing for reporting the presence / absence of a voice pitch and its segment range is performed. Here, the timing to report is the timing at which the voice pitch segment is interrupted, or the partial waveform f (x).
This is the case when the pitch for is not found. The pitch extraction process in step S904 will be described with reference to FIG.
Details will be described later with reference to 0.

【００９４】そして、ステップＳ905では、音声ピッチ
が存在するかを判断し、存在すると判断した場合には、
ステップＳ906において音声ピッチセグメント情報を、
着目する音声区間（音声セグメント）に関連付けして記
憶する。一方、音声ピッチが存在しない場合にはステッ
プＳ903に戻る。Then, in step S905, it is determined whether or not there is a voice pitch.
In step S906, the voice pitch segment information is
It is stored in association with the voice section (voice segment) of interest. On the other hand, if there is no voice pitch, the process returns to step S903.

【００９５】ここで、ステップＳ904にて行われるピッ
チ抽出処理について、図１０を参照して詳しく説明す
る。Here, the pitch extraction processing performed in step S904 will be described in detail with reference to FIG.

【００９６】図１０は、本実施形態における音声ピッチ
検出処理を示すフローチャートのうち、ステップＳ904
（図９）の処理の詳細を示すフローチャートである。FIG. 10 is a flowchart showing the voice pitch detection processing in this embodiment, in which step S904 is executed.
It is a flowchart which shows the detail of a process of (FIG. 9).

【００９７】同図において、ステップＳ1001では、設定
されたｆ（ｘ）に対するｇ（ｘ）を設定する。そして、
ステップＳ1002では、設定されたf（ｘ）の長さをチェ
ックし、ピッチとして存在し得ない位に長い場合には、
当該ｆ（ｘ）に対応する音声ピッチは無いと判断し、ス
テップＳ1003では、当該f(x)の終点を始点として有し、
時間方向に見て負の値を持つ波形の終点となる零交差点
のうち、当該始点に最も近傍のものを終点とする新たな
部分音声セグメントf(x)を設定し、今まで着目していた
f(x)のセグメントはピッチセグメントでないとレポート
する。In the figure, in step S1001, g (x) is set for the set f (x). And
In step S1002, the set length of f (x) is checked, and if it is too long to exist as a pitch,
It is determined that there is no voice pitch corresponding to the f (x), and in step S1003, the end point of the f (x) is set as the starting point,
Of the zero-crossing points that are the end points of the waveform with a negative value in the time direction, a new partial speech segment f (x) whose end point is the closest one to the start point has been set, and attention has been paid until now.
Report the segment of f (x) as not a pitch segment.

【００９８】更に、ステップＳ1004では、着目するf
（ｘ）の長さをチェックし、ピッチとして存在し得ない
位に短い場合には、ステップＳ1005において、着目する
f(x)の終点を始点として有し、且つ時間方向に見て負の
値を持つ波形の終点となる零交差点のうち、その始点
（f(x)の終点）に最も近傍のものを終点として有する部
分音声セグメントを、当該着目するf(x)の末尾に統合す
ることによって新たなf(x)として、ステップＳ1001に戻
る。Furthermore, in step S1004, the focused f
Check the length of (x), and if it is too short to exist as a pitch, pay attention in step S1005.
Of the zero crossings that have the end point of f (x) as the start point and the end point of the waveform with a negative value when viewed in the time direction, the one closest to the start point (end point of f (x)) is the end point. By integrating the partial voice segment which has as the above at the end of the focused f (x), the process returns to step S1001 as a new f (x).

【００９９】一方、ステップＳ1006では、ステップＳ10
02およびステップＳ1004におけるチェックを通過したと
ころの、着目するf（ｘ）に対して、g（ｘ）との非類似
度演算を行う。本ステップにおいて行われる非類似度演
算は、以下の非類似度評価関数を用いて算出する。On the other hand, in step S1006, step S10
The dissimilarity calculation with g (x) is performed on the focused f (x) that has passed the check in 02 and step S1004. The dissimilarity calculation performed in this step is calculated using the following dissimilarity evaluation function.

【０１００】即ち、部分音声セグメントｆ（ｘ）の、時
間 Xf におけるｆ（ｘ）とｇ（ｘ）との差の絶対値をΔ
（ Xf ）とすると、X1 ≦ Xf ≦ X2 且つ Xg ＝ X2 +
（ Xf−X1 ）として、 Δ（ Xf ）＝｜ｆ（ Xf ）−ｇ（ Xg ）｜と表される。この場合においても、ｆ（ｘ）とｇ（ｘ）
の差の絶対値ではなく差の二乗に基づいて、 Δ（ Xf ）＝ [ｆ（ Xf ）−ｇ（ Xg ）] × [ｆ（ Xf
）−ｇ（ Xg ）] としても良い。That is, the absolute value of the difference between f (x) and g (x) at time Xf of the partial voice segment f (x) is Δ.
(Xf), X1 ≤ Xf ≤ X2 and Xg = X2 +
As (Xf-X1), it is represented by Δ (Xf) = | f (Xf) -g (Xg) |. Even in this case, f (x) and g (x)
Based on the square of the difference, not the absolute value of the difference, Δ (Xf) = [f (Xf) −g (Xg)] × [f (Xf
) -G (Xg)].

【０１０１】そして更に、And further,

【０１０２】[0102]

【数２】と表すことができる。[Equation 2] It can be expressed as.

【０１０３】そしてステップＳ1007では、上記の如く算
出した非類似度がしきい値ＥＴｈ以上であるかを判断
し、DiffSum≧ＥＴｈの場合にはステップＳ1005に戻
り、DiffSum＜ＥＴｈの場合には、より精密な音声ピッ
チ検出を行うべく、ステップＳ1008において、最もエネ
ルギの大きな小区間がピッチセグメントの最後になるよ
うに、f(x)および g(x)の位置を補正する。Then, in step S1007, it is determined whether the dissimilarity calculated as described above is equal to or greater than the threshold value ETh. If DiffSum ≧ ETh, the process returns to step S1005. If DiffSum <ETh, the process proceeds to step S1005. In order to perform a precise voice pitch detection, in step S1008, the positions of f (x) and g (x) are corrected so that the small section having the highest energy is at the end of the pitch segment.

【０１０４】図８は、本実施形態における音声ピッチ検
出処理で行われるピッチ検出基準の更新手順を説明する
図である。最もエネルギの大きな小区間でピッチの基準
を補正することは、その小区間が、声帯振動の直後のタ
イミングで生成される波形であることからも合理的であ
る。FIG. 8 is a view for explaining the pitch detection reference update procedure performed in the voice pitch detection processing in this embodiment. Correcting the pitch reference in the small section having the largest energy is also rational because the small section is a waveform generated immediately after the vocal cord vibration.

【０１０５】次にステップＳ1009では、ピッチ検出カウ
ンタを0にリセットし、ステップＳ1010では、上述した
ステップＳ1006と同様に非類似度演算を行い、ステップ
Ｓ1011では、算出した非類似度としきい値ＥＴｈとの比
較処理を、上述したステップＳ1007と同様に行なう。Next, in step S1009, the pitch detection counter is reset to 0, in step S1010, dissimilarity calculation is performed in the same manner as in step S1006 described above, and in step S1011, the calculated dissimilarity and threshold ETh are set. The comparison process is performed in the same manner as step S1007 described above.

【０１０６】そして、ステップＳ1011における比較の結
果、算出された非類似度がしきい値ＥＴｈ以上の場合に
はステップＳ1013に進み、非類似度がしきい値ＥＴｈよ
り小さい場合にはステップＳ1014に進む。Then, as a result of the comparison in step S1011, if the calculated dissimilarity is greater than or equal to the threshold value ETh, the process proceeds to step S1013, and if the dissimilarity degree is less than the threshold value ETh, the process proceeds to step S1014. .

【０１０７】ステップＳ1013では、音声ピッチを2回以
上検出しているかを判断し、2回未満の場合には上述し
たステップＳ1005において音声セグメントの統合を行な
い、2回以上検出してる場合には、音声ピッチセグメン
トを検出したと判断できるので、ステップＳ1015におい
て、g(x)の終点を始点に持ち、時間方向に見て負の値を
持つ波形の終点となる始点に最も近傍の零交差点を終点
とする新たなセグメントf(x)を設定し、ピッチセグメン
トを検出した旨を表わすピッチセグメント範囲を報告す
る。In step S1013, it is determined whether the voice pitch is detected twice or more. If it is less than two times, the voice segments are integrated in step S1005 described above. Since it can be determined that the voice pitch segment is detected, in step S1015, the end point of g (x) is set as the start point, and the zero crossing point closest to the start point that is the end point of the waveform having a negative value in the time direction is set as the end point. A new segment f (x) is set, and the pitch segment range indicating that the pitch segment is detected is reported.

【０１０８】ステップＳ1014では、ピッチ検出回数をイ
ンクリメントし、現在のg(x)の終点を始点として有し、
時間方向に見て負の値を持つ波形の終点のうち、当該始
点に最も近傍の零交差点を終点として有する新たな部分
音声セグメントf(x)を設定すると共に、この部分音声セ
グメントf(x)に最も近傍の、時間方向に見て負の値を持
つ波形の終点となる零交差点を終点とする新たなｇ
（ｘ）を設定し、ステップＳ1010に戻る。In step S1014, the pitch detection count is incremented, and the current end point of g (x) is set as the start point.
Among the end points of the waveform having a negative value in the time direction, a new partial voice segment f (x) having the nearest zero-crossing point as the end point is set to the start point and the partial voice segment f (x) is set. A new g with the zero crossing point that is the closest to the point that is the end point of the waveform having a negative value in the time direction as the end point.
(X) is set, and the process returns to step S1010.

【０１０９】上述した音声ピッチ検出処理（図９及び図
１０）によって取得した音声ピッチセグメントは、後段
の音声区間判定部28にて利用するために、不図示のメモ
リに記憶される。The voice pitch segment obtained by the voice pitch detection process (FIGS. 9 and 10) described above is stored in a memory (not shown) for use by the voice section determination unit 28 in the subsequent stage.

【０１１０】（音声区間判定）次に、音声区間判定部28
では、上記の音声ピッチ検出処理によって取得した音声
ピッチセグメントを用いて、人の音声区間（区間Ａ）の
判定が行われる。この処理は、図３のステップＳ304に
相当する。(Voice section determination) Next, the voice section determination unit 28
Then, the voice segment of the person (section A) is determined using the voice pitch segment acquired by the above voice pitch detection processing. This process corresponds to step S304 in FIG.

【０１１１】一般に、純粋な人の声であれば、その音声
区間の大半を母音が占め、従ってピッチの存在するセグ
メントが長く安定して現れる。他方、ＢＧＭのある場合
には、その音律による影響を受けるるものの、人の音声
エネルギがＢＧＭのエネルギよりもある程度大きい場合
には、さほど影響を受けないことが実験的に判ってい
る。また、ある部分区間内において音声エネルギがＧＢ
Ｍのエネルギよりも十分大きくない場合には、その部分
区間において正確なピッチは現れない。Generally, in the case of a pure human voice, the vowels occupy most of its voice section, and therefore the segment in which the pitch exists appears stably for a long time. On the other hand, it has been empirically known that when there is BGM, it is affected by the temperament, but when human voice energy is somewhat higher than the energy of BGM, it is not so affected. In addition, the voice energy is GB in a certain partial section.
If it is not sufficiently larger than the energy of M, the exact pitch does not appear in that subsection.

【０１１２】また、多くの場合、母音の直前には子音が
伴われるが、声帯の振動を伴わない子音の場合にもピッ
チは現れず、しかもその時間は持続時間が10ｍｓ以下と
いう短い破裂音であり、最も長い摩擦音でも数10ｍｓの
オーダーである。また、破裂音等の発生直前に無音が生
じるものもある。In many cases, consonants are accompanied immediately before a vowel, but no pitch appears even in the case of a consonant that does not accompany the vibration of the vocal cords, and the duration is a short burst with a duration of 10 ms or less. Yes, even the longest fricative is on the order of tens of ms. In addition, silence may occur immediately before a plosive sound is generated.

【０１１３】従って、装置外部の要因だけでなく、人の
音声自身の要因によって音声ピッチが求まるセグメント
が離散的なものになるが、そのような場合であっても、
前後或いは全体のピッチ周期を考慮することにより、部
分区間の音声ピッチ周期の演算結果を統合して、更に音
声の特徴を活用して人の音声区間（区間Ａ）を判断する
必要がある。Therefore, the segment for which the voice pitch is obtained becomes discrete not only due to the factors external to the apparatus but also due to the factor of the human voice itself. Even in such a case,
It is necessary to determine the human voice section (section A) by integrating the calculation results of the voice pitch cycles of the partial sections by considering the front and back or the entire pitch cycle and further utilizing the characteristics of the voice.

【０１１４】図１１は、本実施形態における音声区間判
定処理を示すフローチャートであり、音声区間判定部28
が行なう処理手順を示す。FIG. 11 is a flow chart showing the voice section determination processing in this embodiment.
The processing procedure performed by is shown.

【０１１５】同図において、まず、ステップＳ1101で
は、連続する無音、無声子音ラベル、有声子音ラベル、
または雑音ラベルを持つセグメント群を、１つのセグメ
ントに結合する。In the figure, first, in step S1101, a continuous silence, unvoiced consonant label, voiced consonant label,
Alternatively, a group of segments having noise labels is combined into one segment.

【０１１６】更にステップＳ1102では、連続するピッチ
ラベルセグメントを求め、これを結合することにより、
それら複数セグメントの平均ピッチ周期を求める。この
統合したピッチセグメントを「統合ピッチセグメント」
と呼ぶこととする。Further, in step S1102, continuous pitch label segments are obtained and combined to obtain
The average pitch period of those multiple segments is calculated. This integrated pitch segment is called "integrated pitch segment".
Will be called.

【０１１７】ステップＳ1103では、統合ピッチセグメン
トに挟まれたとろこの、雑音ラベルが関連付けされてい
るセグメントを求め、ステップＳ1104では、そのセグメ
ントの両端の統合ピッチセグメントの平均ピッチ周期変
動率があるしきい値Ｔｈ１以下であるかを判断し、この
条件を満たす場合には、ステップＳ1105においてこれら
のセグメントを１つの統合ピッチセグメントに統合す
る。この処理により、ピッチセグメント、即ち母音の一
部にエネルギの大きなＢＧＭが重なったとしても補正可
能である。In step S1103, a segment associated with noise labels sandwiched between integrated pitch segments and associated with a noise label is obtained. In step S1104, the average pitch period variation rate of integrated pitch segments at both ends of the segment is a threshold value. It is determined whether or not the value is Th1 or less, and if this condition is satisfied, these segments are integrated into one integrated pitch segment in step S1105. By this processing, it is possible to correct even if the BGM having large energy overlaps the pitch segment, that is, a part of the vowel.

【０１１８】ところで、殆どの場合、単独の子音は存在
しないので、通常、後方或いは前方に子音を伴うことが
多い。これはＣＶＣ（Consonant Vowel Consonant ）モ
デルと呼ばれている。By the way, in most cases, since a single consonant does not exist, a consonant usually accompanies backward or forward. This is called a CVC (Consonant Vowel Consonant) model.

【０１１９】そこでステップＳ1106では、このＣＶＣモ
デルに基づいて、無声子音セグメント、有声子音セグメ
ント、並びにピッチセグメントを統合し、音声区間を求
める。ここで、ステップＳ1106の処理の詳細を、図１２
を参照して説明する。Therefore, in step S1106, the unvoiced consonant segment, the voiced consonant segment, and the pitch segment are integrated on the basis of this CVC model to obtain a voice section. Here, details of the processing in step S1106 will be described with reference to FIG.
Will be described with reference to.

【０１２０】図１２は、本実施形態における音声区間判
定処理を示すフローチャートのうち、ステップＳ1106
（図１１）の処理の詳細を示すフローチャートである。FIG. 12 is a step S1106 in the flowchart showing the voice section determination processing in this embodiment.
It is a flowchart which shows the detail of a process of (FIG. 11).

【０１２１】同図において、ステップＳ1201では、最も
先頭の統合ピッチセグメントを、基準となる統合ピッチ
セグメントとする。次に、ステップＳ1202にでは、基準
となる統合ピッチセグメントの次の統合ピッチセグメン
トを求める。In the figure, in step S1201, the leading integrated pitch segment is set as a reference integrated pitch segment. Next, in step S1202, an integrated pitch segment next to the reference integrated pitch segment is obtained.

【０１２２】更に、ステップＳ1203では、２つの統合ピ
ッチセグメントの間に、有声子音セグメントあるいは無
声子音セグメントが存在するかを判断し、存在しなけれ
ばステップＳ1206において基準となる統合ピッチセグメ
ントの次の統合ピッチセグメントが存在するかを判断
し、存在しない場合は処理を終了し、存在する場合に
は、基準となる統合ピッチセグメントを、ステップＳ12
07において更新する。Further, in step S1203, it is judged whether or not a voiced consonant segment or an unvoiced consonant segment exists between the two integrated pitch segments, and if there is not, the next integrated pitch segment as the reference in step S1206 is integrated. It is determined whether or not a pitch segment exists, and if it does not exist, the process ends. If it does exist, the integrated pitch segment serving as a reference is set in step S12.
Update at 07.

【０１２３】一方、ステップＳ1203において２つの統合
ピッチセグメントの間に有声子音セグメントあるいは無
声子音セグメントが存在すると判断した場合には、２つ
の統合ピッチセグメントの間の間隔Distがしきい値Pima
x1以下であるかを、ステップＳ1204において判断する。
そして、間隔Distがしきい値Pimax1以下である場合に
は、ステップＳ1205において当該２つの統合ピッチセグ
メントの端点を終点と始点とする人の音声区間として記
憶する。On the other hand, if it is determined in step S1203 that a voiced consonant segment or an unvoiced consonant segment exists between the two integrated pitch segments, the distance Dist between the two integrated pitch segments is the threshold Pima.
In step S1204, it is determined whether or not x1 or less.
If the distance Dist is less than or equal to the threshold value Pimax1, in step S1205, the end points of the two integrated pitch segments are stored as the voice section of the person having the end point and the start point.

【０１２４】ここで、しきい値Pimax1には、通常の最も
長い持続時間を持つ子音、例えば無声摩擦音/Ｓ/等の持
続時間よりも十分長いものを用いると良く、その際、２
つの統合ピッチセグメントの間に子音セグメントだけで
なく、無音セグメントが存在しても良い。その理由は、
無声子音のうち破裂音や破擦音では、発声の前に短い無
音が生じることがあるからである。Here, as the threshold value Pimax1, it is preferable to use a consonant having a normal longest duration, for example, a duration sufficiently longer than the duration of an unvoiced fricative / S / etc., in which case 2
There may be silent segments as well as consonant segments between two integrated pitch segments. The reason is,
This is because, among unvoiced consonants, a plosive sound or an affricate sound may cause a short silence before vocalization.

【０１２５】ステップＳ1205における音声区間記憶の
後、ステップＳ1206では、基準となる統合ピッチセグメ
ントの次の統合ピッチセグメントが存在するかを判断
し、存在しない場合には処理を終了し、存在する場合に
は、ステップＳ1207において基準となる統合ピッチセグ
メントを更新し、ステップＳ1206の終了条件を満足する
まで上述した各ステップの処理を繰り返し行う。但し、
統合ピッチセグメント情報およびその平均ピッチ情報
は、次の処理のために破棄せずに保存しておく。After the voice section is stored in step S1205, in step S1206, it is determined whether or not there is an integrated pitch segment next to the reference integrated pitch segment. If it does not exist, the process is terminated, and if it exists, it is determined. Updates the reference integrated pitch segment in step S1207, and repeats the processing of each step described above until the ending condition of step S1206 is satisfied. However,
The integrated pitch segment information and its average pitch information are saved without being discarded for the next processing.

【０１２６】一方、ステップＳ1204において２つの統合
ピッチセグメントの平均ピッチ周期を比較した結果、周
期変動率があるしきい値Pimax1より大きい場合には、上
述したステップＳ1206以降の処理を行なう。On the other hand, as a result of comparing the average pitch periods of the two integrated pitch segments in step S1204, when the period variation rate is larger than a certain threshold value Pimax1, the above-mentioned steps S1206 and subsequent steps are performed.

【０１２７】ここで再び図１１のフローチャートの説明
に戻る。ステップＳ1107では、ＣＶＣ構造を取らない、
例えば「あお」のようなＶＶ（Vowel-Vowel）構造の場
合を考慮すべく、ＶＶモデルに基づいて、隣接あるいは
間に無音セグメントまたは雑音セグメントを持つ２つの
ピッチセグメントを統合することによって音声区間を求
める。Now, let us return to the description of the flowchart of FIG. 11 again. In step S1107, the CVC structure is not taken,
Considering the case of a VV (Vowel-Vowel) structure such as “blue”, a voice segment is integrated by integrating two pitch segments having a silent segment or a noise segment between adjacent segments based on the VV model. Ask.

【０１２８】ここで、ステップＳ1107にて行われる音声
区間の検出処理について、図１３を参照して詳細に説明
する。Here, the voice section detection processing performed in step S1107 will be described in detail with reference to FIG.

【０１２９】図１３は、本実施形態における音声区間判
定処理を示すフローチャートのうち、ステップＳ1107
（図１１）の処理の詳細を示すフローチャートである。FIG. 13 is a step S1107 of the flowchart showing the voice section determination processing in this embodiment.
It is a flowchart which shows the detail of a process of (FIG. 11).

【０１３０】同図において、ステップＳ1301では、最も
先頭の統合ピッチセグメントを、基準となる統合ピッチ
セグメントとする。次に、ステップＳ1302では、基準と
なる統合ピッチセグメントの次の統合ピッチセグメント
を求める。In the figure, in step S1301, the first integrated pitch segment is set as a reference integrated pitch segment. Next, in step S1302, an integrated pitch segment next to the reference integrated pitch segment is obtained.

【０１３１】更に、ステップＳ1303では、２つの統合ピ
ッチセグメントの間隔Distがあるしきい値Pimax2以下で
あるかを判断し、間隔Distがしきい値Pimax2より大きい
場合にはステップＳ1306に進み、感覚Distがしきい値Pi
max2以下の場合にステップＳ1304に進む。Further, in step S1303, it is determined whether the distance Dist between the two integrated pitch segments is less than or equal to a certain threshold Pimax2. If the distance Dist is greater than the threshold Pimax2, the process proceeds to step S1306, and the sensory Dist is detected. Is the threshold Pi
If max2 or less, the process advances to step S1304.

【０１３２】ステップＳ1304では、２つの統合ピッチセ
グメントの平均ピッチ周期変動率があるしきい値Ｔｈ２
以下である場合には、ステップＳ1305において、２つの
統合ピッチセグメントと挟まれるセグメントを音声区間
として記憶する。その際、外乱に対する耐性を上げるた
めに、２つの統合ピッチセグメントの間に無音セグメン
トや雑音セグメントが存在しても良い。In step S1304, there is a threshold Th2 having the average pitch period variation rate of the two integrated pitch segments.
In the following cases, in step S1305, the segment sandwiched between the two integrated pitch segments is stored as a voice section. In that case, a silence segment or a noise segment may exist between the two integrated pitch segments in order to increase resistance to disturbance.

【０１３３】そして、ステップＳ1305における音声区間
の記憶の後、ステップＳ1306では、基準となる統合ピッ
チセグメントの次の統合ピッチセグメントが存在するか
を判断し、存在する場合は処理を終了し、存在する場合
は、ステップＳ1307において基準となる統合ピッチセグ
メントを更新し、ステップＳ1306の終了条件を満足する
まで繰り返し処理を行う。After storing the voice section in step S1305, in step S1306, it is determined whether or not there is an integrated pitch segment next to the reference integrated pitch segment. In this case, the reference integrated pitch segment is updated in step S1307, and the process is repeated until the ending condition of step S1306 is satisfied.

【０１３４】一方、ステップＳ1304において２つの統合
ピッチセグメントの平均ピッチ周期を比較した結果、周
期変動率がしきい値Ｔｈ２より大きい場合には、上述し
たステップＳ1306に進んで同様な処理を行なう。On the other hand, as a result of comparing the average pitch periods of the two integrated pitch segments in step S1304, when the period variation rate is larger than the threshold value Th2, the process proceeds to step S1306 described above and the same process is performed.

【０１３５】このようにしてピッチを検出したセグメン
トを基準として、音声中に含まれるＢＧＭ等によって雑
音ラベルが生じても、本実施形態では、上述した音声区
間判定処理において、雑音ラベルが付与された場合であ
っても、その前後の統合ピッチセグメントの平均ピッチ
周期の連続性を考慮することによって統合ピッチセグメ
ントの統合を行い、更に、ＣＶＣモデルを導入すること
によって無声子音セグメントや有声子音セグメントが間
に存在する統合ピッチセグメントをまとめて音声区間と
し、更にＶＶモデルを考慮して２つの統合ピッチセグメ
ントをまとめて音声区間を決定することにより、音声の
特徴を利用した外乱に強い音声区間抽出が可能となる。Even if a noise label is generated due to BGM or the like contained in the voice with reference to the segment for which the pitch is detected in this way, in the present embodiment, the noise label is added in the above-mentioned voice section determination process. Even in this case, the integrated pitch segments are integrated by considering the continuity of the average pitch period of the integrated pitch segments before and after the integration, and the unvoiced consonant segment and the voiced consonant segment are separated by introducing the CVC model. It is possible to extract a voice segment that is strong against external disturbances by using the features of the voice, by determining the voice segment by combining the integrated pitch segments that exist in the Becomes

【０１３６】（人の音声区間の補正）上述したように、
人の音声区間（区間Ａ）を検出した後に行われる音声区
間補正処理では、この処理結果に基づく再生音声を聴い
た際に人が不快感を抱かないように、時間軸上で近傍に
位置する複数の音声区間を１つの音声区間として統合す
ることによる補正が行われる。その理由は、例えば、時
間軸上で近傍に位置する２つの区間Ａの間隔が狭い場合
に、区間Ａを聞いて人が内容を把握できる速度で音声を
伴う倍速再生を行なう一方で、区間Ｂに対しては、再生
映像を見て人が内容を把握できる範囲で高倍率な倍速で
再生を行うと、再生態様の変化が激しく、ユーザにとっ
て聞き苦しいものとなるからである。(Correction of human voice section) As described above,
In the voice section correction process performed after detecting the voice section (section A) of the person, the person is located in the vicinity on the time axis so that the person does not feel uncomfortable when listening to the reproduced voice based on the processing result. Correction is performed by integrating a plurality of voice sections as one voice section. The reason is that, for example, when the interval between two sections A located near each other on the time axis is narrow, the double speed reproduction accompanied by voice is performed at a speed at which a person can understand the content by listening to the section A, while the section B is reproduced. On the other hand, if reproduction is performed at a high speed with a high magnification within a range in which a person can understand the content by viewing the reproduced video, the reproduction mode is greatly changed, and it is difficult for the user to hear.

【０１３７】また、動画デコーダおよび再生処理の面か
らも、短い区間での速度の変化は、処理のオーバーヘッ
ドが大きく、再生動作が一時的に停止状態になり、ギク
シャクした再生になることが、一例として、マイクロソ
フト社のDirectShowを用いた本願出願人による実験にお
いて観察されている他、他の多くの動画再生手段で同様
の現象が見られる。Also, from the viewpoint of the moving picture decoder and the reproduction processing, a change in speed in a short section has a large processing overhead, the reproduction operation is temporarily stopped, and the reproduction becomes jerky. As a result, the same phenomenon has been observed in many other moving image reproducing means in addition to the observation in the experiment by the applicant of the present invention using Microsoft DirectShow.

【０１３８】そこで、本実施形態では、時間軸上で最も
近傍に位置する２つの音声区間（区間Ａ）の間隔がある
しきい値（図１４ではＴｈ３）以下である場合には、こ
れらの音声区間を統合することによる補正を行う。この
しきい値を決めるに当たっては、例えば、会話を行うシ
ーンを想定し、会話が成り立つ程度の間を実験的に求
め、それをしきい値に用いる。この場合の処理の手順
を、図１４を参照して説明する。Therefore, in the present embodiment, when the interval between the two voice sections (section A) located closest to each other on the time axis is equal to or less than a certain threshold (Th3 in FIG. 14), these voice sections are set. Correction is made by integrating the sections. In determining the threshold value, for example, a scene in which a conversation is conducted is assumed, and the degree to which the conversation is established is experimentally obtained and used as the threshold value. The procedure of the processing in this case will be described with reference to FIG.

【０１３９】図１４は、本実施形態において間隔の短い
音声区間に対して行われる統合補正処理を示すフローチ
ャートである。この処理は、音声区間判定部28にて行わ
れる処理であって、上述した音声区間補正処理（ステッ
プＳ103）の詳細を表わす。FIG. 14 is a flow chart showing an integrated correction process performed for a voice section having a short interval in this embodiment. This process is a process performed by the voice section determination unit 28, and represents the details of the voice section correction process (step S103) described above.

【０１４０】同図において、ステップＳ1401では、先に
検出された複数の区間Ａのうち、時間軸上で最初に位置
する区間Ａを、着目する音声区間として読み込むが、着
目すべき音声区間が無ければ本処理は終了する。In step S1401, the first detected section A on the time axis among the plurality of sections A detected earlier is read as the target speech section, but there is no target speech section. For example, this process ends.

【０１４１】ステップＳ1402では、次に着目する音声区
間（区間Ａ）が存在するかを判断し、着目すべき音声区
間が無ければ本処理を終了し、一方、まだ存在する場合
には、以下に説明するステップＳ1403乃至ステップＳ14
07の処理を繰り返す。In step S1402, it is determined whether or not there is a next speech section (section A) of interest. If there is no speech section of interest, this process is terminated. Steps S1403 to S14 to be described
The process of 07 is repeated.

【０１４２】ステップＳ1403では、ステップＳ1402にて
次に着目する音声区間が存在すると判断されたので、そ
の音声区間（区間Ａ）を表わす音声区間情報を読み込
む。ここで、音声区間情報とは、音声区間の開始点と終
点とが対となった情報である。In step S1403, since it is determined in step S1402 that the next voice section of interest exists, the voice section information representing the voice section (section A) is read. Here, the voice section information is information in which the start point and the end point of the voice section are paired.

【０１４３】ステップＳ1404では、２つの区間Ａの間
隔、即ち、時間軸上で先の音声区間（現在着目している
音声区間）の終点と、次の音声区間の開始点との間の距
離（時間間隔）を求め、この距離が所定のしきい値Ｔｈ
３以下であるかを判断する。In step S1404, the distance between the two sections A, that is, the distance between the end point of the previous voice section (the current voice section of interest) and the start point of the next voice section on the time axis ( Time interval), and this distance is a predetermined threshold Th
It is judged whether it is 3 or less.

【０１４４】ステップＳ1405では、ステップＳ1402にて
２つの区間Ａの間隔が所定のしきい値Ｔｈ３以下である
と判断されたので、これら２つの音声区間を、１つの音
声区間に統合する。より具体的に、統合された音声区間
の音声区間情報には、本ステップにおける処理によっ
て、先の音声区間の開始点が設定されると共に、次の音
声区間の終点が設定される。In step S1405, since it is determined in step S1402 that the interval between the two sections A is less than or equal to the predetermined threshold Th3, these two speech sections are integrated into one speech section. More specifically, in the voice section information of the integrated voice section, the start point of the previous voice section is set and the end point of the next voice section is set by the process in this step.

【０１４５】ステップＳ1406では、統合された音声区間
を、現在着目する音声区間（区間Ａ）として設定し、ス
テップＳ1402に戻る。In step S1406, the integrated voice section is set as the current voice section (section A), and the process returns to step S1402.

【０１４６】ステップＳ1407では、ステップＳ1402にて
２つの区間Ａの間隔が所定のしきい値Ｔｈ３より大きい
と判断されたので、現在着目する音声区間を、そのまま
１つの補正した音声区間情報として記憶すると共に、ス
テップＳ1408では、次の音声区間を、処理対象として着
目すべき音声区間として設定し、ステップＳ1402に戻
る。In step S1407, since it is determined in step S1402 that the interval between the two sections A is larger than the predetermined threshold value Th3, the current voice section is stored as it is as one corrected voice section information. At the same time, in step S1408, the next voice section is set as the voice section to be focused on as the processing target, and the process returns to step S1402.

【０１４７】このような統合処理が、扱うべき音声区間
（区間Ａ）がなくなるまで繰り返される。Such integration processing is repeated until there is no speech section (section A) to be handled.

【０１４８】（シーンチェンジ点情報を利用した人の音
声区間の補正）また、一般に、音声信号を含む動画デー
タに映像の変わり目が存在し、その直後に区間Ａが始ま
る場合には、動画再生に際して、ほんの一瞬高速でシー
ンの先頭部分の映像が再生された後で、音声を伴う倍速
再生による再生映像が、人が聞いて把握できる速度で行
われるため、ユーザにとって映像がちらついたような違
和感が生じる。(Correction of Human Voice Section Using Scene Change Point Information) Generally, when there is a video transition in the video data including the audio signal and the section A starts immediately after that, the video is reproduced. , After the video of the beginning part of the scene is played back at a high speed for a moment, the playback video by the double speed playback accompanied by the audio is performed at a speed that can be heard and grasped by the person, and the user feels that the video is flickering. Occurs.

【０１４９】そこで、本実施形態では、例えば、本願出
願人による先行する特開２０００−２３５６３９号公報
に開示されたシーンチェンジ点の検出技術を採用するこ
とにより、検出したシーンチェンジ点群のうち、音声区
間補正処理後の音声区間の先頭よりも時間的に早く、最
も近傍で、且つその距離があるしきい値以下であるシー
ンチェンジ点が存在する場合には、その音声区間の先頭
を、該シーンチェンジ点に対応する情報に置き換える補
正を行なうことにより、早見再生時のユーザの違和感を
取り除く。その際、近傍判定のためのしきい値は、高速
再生の状態から人が聞いて内容が把握できる程度の速度
で音声を伴う倍速再生へ移行する際のオーバーヘッドに
応じた値である。Therefore, in the present embodiment, for example, by adopting the scene change point detection technology disclosed in the above-cited Japanese Patent Application Laid-Open No. 2000-235639 by the applicant of the present application, among the detected scene change point groups, If there is a scene change point that is earlier in time than the beginning of the voice section after the voice section correction processing and is closest to the start of the voice section, the beginning of the voice section is By performing the correction for replacing with the information corresponding to the scene change point, the user's uncomfortable feeling at the time of the quick-view reproduction is removed. At this time, the threshold value for the proximity determination is a value corresponding to the overhead when shifting from the high-speed reproduction state to the double-speed reproduction with voice at a speed at which a person can hear and understand the contents.

【０１５０】図１５は、本実施形態においてシーンチェ
ンジ点を用いて行われる音声区間の統合補正処理を示す
フローチャートである。この処理は、音声区間判定部28
にて行われる処理であって、上述した早見再生区間補正
処理（ステップＳ104）の詳細を表わす。FIG. 15 is a flow chart showing an integrated correction process of a voice section performed using a scene change point in this embodiment. This process is performed by the voice section determination unit 28.
The details of the above-described fast-viewing reproduction section correction process (step S104), which is the process performed in step S4.

【０１５１】同図において、まずステップＳ1501では、
シーンチェンジ点検出処理（ステップＳ106）にて検出
されたシーンチェンジ点群（シーンチェンジ点情報また
はシーンチェンジ位置情報）から、時間軸上で先頭とな
るシーンチェンジ点（Ａ）を読み込む。In the figure, first, in step S1501,
The first scene change point (A) on the time axis is read from the scene change point group (scene change point information or scene change position information) detected in the scene change point detection processing (step S106).

【０１５２】シーンチェンジ点情報は、通常はフレーム
単位で記述されるが、本ステップでは、フレームレート
に基づいて時間情報に変換した後、音声区間情報と比較
することになる。即ち、本実施形態のアルゴリズムで
は、音声区間の開始点から最も近傍のシーンチェンジ点
を求めるために、連続する２つのシーンチェンジ点情報
を用いることにし、ここでは、説明の便宜上、先のシー
ンチェンジ点をＡ、次のシーンチェンジ点をＢとして、
ステップＳ1501では、Ａの方へシーンチェンジ点の時間
を記憶する。The scene change point information is usually described in frame units, but in this step, it is converted into time information based on the frame rate and then compared with the voice section information. That is, in the algorithm of this embodiment, two consecutive scene change point information are used in order to find the closest scene change point from the start point of the voice section. Here, for convenience of explanation, the previous scene change point is used. The point is A and the next scene change point is B,
In step S1501, the time of the scene change point is stored in A direction.

【０１５３】ステップＳ1502では、読み込んでない音声
区間情報があるかどうかを判断し、無い場合には処理を
終了し、読み込んでない音声区間情報がある場合にはス
テップＳ1503において音声区間情報を１つ読み込む。In step S1502, it is determined whether or not there is unread voice section information. If there is not read voice section information, the process ends. If there is unread voice section information, one voice section information is read in step S1503.

【０１５４】ステップＳ1504では、未だ読み込んでない
シーンチェンジ点情報があるかどうかを判断し、無い場
合には、ステップＳ1503にて既に読み込んである音声区
間情報を、ステップＳ1505において、そのまま補正済の
音声区間情報として更新記憶する。In step S1504, it is determined whether or not there is scene change point information that has not been read. If there is no scene change point information, the voice section information that has already been read in step S1503 is directly corrected in step S1505. It is updated and stored as information.

【０１５５】ステップＳ1506では、ステップＳ1504にて
読み込んでないシーンチェンジ点情報があると判断され
たので、そのシーンチェンジ点情報を、シーンチェンジ
点情報Ｂとして読み込む。In step S1506, since it is determined that there is scene change point information that was not read in step S1504, that scene change point information is read as scene change point information B.

【０１５６】ステップＳ1507では、シーンチェンジ点Ａ
が、時間軸上において、ステップＳ1503にて読み込んだ
現在着目する音声区間の始点より前に位置するかどうか
判断し、前に位置する場合には、ステップＳ1505におい
て、補正の必要は無いとして音声区間情報をそのまま補
正済音声区間情報として更新記憶する。At step S1507, the scene change point A
On the time axis, it is determined whether or not it is located before the start point of the current voice segment read in step S1503, and if it is, it is determined in step S1505 that no correction is necessary and the voice segment The information is updated and stored as it is as the corrected voice section information.

【０１５７】ステップＳ1508では、ステップＳ1507にて
シーンチェンジ点Ａが現在着目する音声区間の始点より
前に位置すると判断されたので、そのシーンチェンジ点
Ａが当該音声区間の始点としきい値Ｔｈ４以内の距離に
存在するかどうかを判断し、当該しきい値Ｔｈ４以内で
はない場合には、ステップＳ1509において、シーンチェ
ンジ点Ｂの情報を、シーンチェンジ点Ａへコピーするこ
とにより、次のシーンチェンジ点を判断対象とする準備
を行う。In step S1508, since it is determined in step S1507 that the scene change point A is located before the start point of the current voice section of interest, the scene change point A is within the threshold point Th4 and the start point of the voice section. If it is not within the threshold value Th4, it is judged whether or not there is a distance, and in step S1509, the information of the scene change point B is copied to the scene change point A, so that the next scene change point can be obtained. Prepare for judgment.

【０１５８】ステップＳ1510では、ステップＳ1508にて
シーンチェンジ点Ａが現在着目する音声区間の始点と当
該しきい値Ｔｈ４以内の距離に存在すると判断されたの
で、シーンチェンジ点Ｂが当該音声区間の始点よりも後
ろに位置するかを判断し、後ろに位置しない場合にはス
テップＳ1509に進む。In step S1510, since it is determined in step S1508 that the scene change point A exists at a distance within the threshold Th4 from the start point of the current voice section of interest, the scene change point B is the start point of the voice section. It is determined whether or not it is located behind, and if it is not located behind, the process proceeds to step S1509.

【０１５９】一方、ステップＳ1510にてシーンチェンジ
点Ｂが当該音声区間の始点よりも後ろに位置すると判断
された場合には、ステップＳ1511において、シーンチェ
ンジ点Ａが開始点であり、当該音声区間の終点が終点で
ある部分区間を、補正済の音声区間情報として更新記憶
し、ステップＳ1512では、シーンチェンジ点Ｂの情報
を、シーンチェンジ点Ａにコピーすることにより、次の
シーンチェンジ点を判断対象とする準備を行う。On the other hand, when it is determined in step S1510 that the scene change point B is located behind the start point of the voice section, the scene change point A is the start point in step S1511, and The partial section whose end point is the end point is updated and stored as corrected voice section information, and in step S1512, the information of the scene change point B is copied to the scene change point A to determine the next scene change point. And prepare to.

【０１６０】即ち、上述したステップＳ1507、ステップ
Ｓ1508、並びにステップＳ1510の判断によって、シーン
チェンジ点Ａが現在着目する音声区間の始点の前に位置
すると共に、当該しきい値Ｔｈ４以下の近傍であり且
つ、最も音声区間の始点に近い点であることが確かめら
れて初めて、上記のステップＳ1511及びステップＳ1512
の処理が行われる。That is, the scene change point A is located in front of the start point of the current voice segment of interest and is in the vicinity of the threshold value Th4 or less as a result of the judgments in steps S1507, S1508, and S1510 described above. , S1511 and S1512 described above are not confirmed until the point is closest to the start point of the voice section.
Is processed.

【０１６１】また、ステップＳ1510にてシーンチェンジ
点Ｂが当該音声区間の始点よりも後ろではないと判断さ
れた場合、当該シーンチェンジ点Ｂは、現在設定されて
いるシーンチェンジ点Ａよりも補正済音声区間の始点候
補として更にふさわしいと判断できるので、ステップＳ
1509において、当該シーンチェンジ点Ｂの情報を、新た
なシーンチェンジ点Ａとしてコピーすることにより、次
のシーンチェンジ点を判断対象とする準備を行ない、そ
の後でステップＳ1504の処理に戻る。但し、この場合の
シーンチェンジ点Ａは、既にステップＳ1507およびステ
ップＳ1508の要件を満たしているので、ステップＳ1507
とステップＳ1508とをパスしてステップＳ1510の判断を
いきなり行っても構わない。If it is determined in step S1510 that the scene change point B is not behind the start point of the voice section, the scene change point B has been corrected more than the currently set scene change point A. Since it can be determined that it is more suitable as a start point candidate of the voice section, step S
In 1509, the information of the scene change point B is copied as a new scene change point A to prepare for the determination of the next scene change point, and then the process returns to step S1504. However, since the scene change point A in this case has already satisfied the requirements of step S1507 and step S1508, step S1507
And step S1508 may be skipped and the determination in step S1510 may be suddenly made.

【０１６２】上述した音声区間統合補正処理（図１５）
の手順によって取得した補正済の音声区間情報は、早見
再生区間情報として、表１に例示するようなスキーマ
で、動画早見インデックス記憶部１１に記憶される。Speech section integrated correction processing described above (FIG. 15)
The corrected voice section information acquired by the procedure of (1) is stored in the moving picture quick reference index storage unit 11 as the quick view reproduction section information in a schema as illustrated in Table 1.

【０１６３】表１は、本実施形態におけるシーンチェン
ジ検出結果を例示する表であり、一例として、シーンチ
ェンジ点の検出を行ったフレームを、フレームレート
（30枚/Sec）を元に秒換算した結果が格納されている。Table 1 is a table exemplifying the result of scene change detection in the present embodiment. As an example, the frames in which the scene change points are detected are converted into seconds based on the frame rate (30 images / Sec). The result is stored.

【０１６４】[0164]

【表１】 [Table 1]

【０１６５】次に表２は、本実施形態における音声区間
の検出結果を例示する表であり、１つの音声区間は、開
始点と終了点とで表現されている。Next, Table 2 is a table exemplifying the detection result of the voice section in the present embodiment, and one voice section is represented by a start point and an end point.

【０１６６】[0166]

【表２】 [Table 2]

【０１６７】そして、表３は、本実施形態における補正
済の音声区間検出結果を例示する表であり、表１に示す
結果と表２に示す結果とに基づいて、シーンチェンジ点
を用いた音声区間の統合補正処理（図１５）を、しきい
値Ｔｈ４＝ 2000 mSecで施した場合の処理結果を示す。Table 3 is a table exemplifying the corrected voice section detection result in this embodiment. Based on the result shown in Table 1 and the result shown in Table 2, the voice using the scene change point is shown. The processing result when the section integrated correction processing (FIG. 15) is performed with the threshold value Th4 = 2000 mSec is shown.

【０１６８】[0168]

【表３】 [Table 3]

【０１６９】表１及び表２を参照すると、音声区間０お
よび音声区間２に対しては、それぞれの音声区間の開始
点60000 mSec、400000 mSecの前で且つしきい値Ｔｈ４
である2000 mSec以内の期間にはシーンチェンジは存在
しない。また、音声区間１に対しては、開始点102000 m
Secの1500 mSecの前で且つ2000 mSec以内には、シーン
チェンジ点として、シーンチェンジＩＤ＝２（開始時間
100000 mSec）と、シーンチェンジＩＤ＝３（開始時間1
01000mSec）の２点が存在するが、図１５で示したアル
ゴリズムに従って最も近傍のものを選ぶことから、結果
として、シーンチェンジＩＤ＝３の101000mSecが選ば
れ、これが表３に反映されている。Referring to Tables 1 and 2, for the voice section 0 and the voice section 2, the start points of the respective voice sections are 60000 mSec and 400000 mSec, and the threshold Th4 is set.
There are no scene changes within 2000 mSec. Also, for voice section 1, the start point is 102000 m
Before 1500 mSec of Sec and within 2000 mSec, as a scene change point, scene change ID = 2 (start time
100000 mSec) and scene change ID = 3 (start time 1
01000 mSec) exists, but since the closest one is selected according to the algorithm shown in FIG. 15, as a result, 101000 mSec with scene change ID = 3 is selected, and this is reflected in Table 3.

【０１７０】＜動画早見再生部２００＞動画早見再生部
２００にて行われる動画早見再生処理（ステップＳ10
7）は、人の音声区間（区間Ａ）に対しては人が聞いて
内容を把握できる速度で音声を伴う倍速再生を行なう一
方で、人の音声区間ではない区間（区間Ｂ）に対して
は、再生映像を人が見て内容が把握できる範囲で高い倍
率の倍速で再生を行う。<Movie fast view reproduction unit 200> Movie fast view reproduction process (step S10)
7) is for the human voice section (section A), while performing double speed reproduction accompanied by voice at a speed at which a person can hear and understand the content, for the section (section B) which is not the human voice section. Is a high-speed, high-speed reproduction within a range where a person can see the reproduced video and understand the content.

【０１７１】近年、動画再生環境が整い、例えばマイク
ロソフト社の DirectShowモジュールを用いると、任意
区間の速度を指定して連続再生することが可能である。
このような機能を持つモジュールを用いることで、比較
的簡易に任意区間の再生速度の変化を実現することが可
能であり、その際、重要なのは、何の観点で速度を変化
させるかである。In recent years, a moving image reproducing environment has been prepared. For example, if a DirectShow module manufactured by Microsoft Corporation is used, it is possible to continuously reproduce by designating a speed in an arbitrary section.
By using a module having such a function, it is possible to relatively easily change the reproduction speed in an arbitrary section. At that time, what is important is the viewpoint of changing the speed.

【０１７２】図１６は、本実施形態における動画早見再
生処理を示すフローチャートである。FIG. 16 is a flow chart showing the moving image quick view reproducing process according to the present embodiment.

【０１７３】同図において、ステップＳ1701では、先に
上述したユーザ・プロファイル１４の中からユーザが所
望のものを選択するが、その具体的な手順としては、例
えば、ディスプレイ１２に図１８に例示するようなユー
ザ・プロファイルリストを含む表示画面を表示し、その
中からユーザがリモコン端末等を利用して、所望のプロ
ファイルを選択すれば良い。In FIG. 18, in step S1701, the user selects a desired one from the above-described user profiles 14, and the specific procedure is, for example, shown in FIG. A display screen including such a user profile list may be displayed, and the user may select a desired profile from the display screen by using a remote control terminal or the like.

【０１７４】即ち、図１８に示すユーザ・プロファイル
リストにおけるユーザ所望のプロファイルの指定は、例
えばリモコン端末にプロファイル選択用の操作ボタンを
設けておき、これをユーザが押下するのに応じて、図２
０に例示するようなメニュー表示画面が表示され、その
画面を見ながら、リモコン端末のプロファイル選択用の
操作ボタンを利用して、ユーザが所望のプロファイルを
指定する。もちろんユーザ・プロファイルの選択には、
指紋や声紋や顔認識等の個人認識技術を用いた自動的な
プロファイル選択方法も考えられ、こちらの方が常に正
しいプロファイルの指定が可能なため、プロファイルの
指定の誤りを起こしたり、他人のプロファイルを変更し
たり内容を覗く等のトラブルを防げる。That is, in order to specify the profile desired by the user in the user profile list shown in FIG. 18, for example, an operation button for profile selection is provided on the remote control terminal, and when the user presses this operation button, the operation shown in FIG.
A menu display screen as illustrated in 0 is displayed, and the user designates a desired profile by using the operation button for profile selection of the remote control terminal while watching the screen. Of course, to select the user profile,
An automatic profile selection method that uses personal recognition technology such as fingerprints, voiceprints, face recognition, etc. is also conceivable.Since this method can always specify the correct profile, it may cause an error in specifying the profile or the profile of another person. You can prevent troubles such as changing and looking at the contents.

【０１７５】また、ユーザ・プロファイルを新規に登録
する場合には、図１８の表示画面において「新規登録」
ボタンをポインタデバイスで指定すると、プロファイル
名およびその他の属性を入力するための、図１９に例示
する表示画面が現れる。When a user profile is newly registered, "new registration" is displayed on the display screen of FIG.
When the button is specified with the pointer device, the display screen shown in FIG. 19 for inputting the profile name and other attributes appears.

【０１７６】即ち、図１９は、ユーザ・プロファイル登
録用の表示画面を例示する図であり、初期状態では、識
別名と年齢以外の内容が基準値で埋められており、ユー
ザによる入力操作によってユニークな識別名と年齢の入
力変更の必要がある個所のみが変更され、所定の入力値
範囲の適正チェックをパスした後、ユーザが「ＯＫ］ボ
タンを押下するのに応じて、そのプロファイルがユーザ
・プロファイル１４に新たに追加登録される。That is, FIG. 19 is a view showing an example of a display screen for user profile registration. In the initial state, the contents other than the identification name and age are filled with the reference value, and the user's input operation makes it unique. Only the place where the identification name and age need to be changed is changed, and after passing the appropriate check of the predetermined input value range, the profile is changed to the user's profile when the user presses the “OK” button. It is newly additionally registered in the profile 14.

【０１７７】また、ユーザが所望のプロファイルの内容
変更を希望する場合、図１８に示す表示画面において
「変更」ボタンを押下し、図２０に示す表示画面におい
て所望のプロファイルを選択するのに応じて表示される
図１９の表示画面において、変更を希望する項目の情報
内容を変更した後、「ＯＫ］ボタンを押下すれば良い。When the user desires to change the contents of a desired profile, the "change" button is pressed on the display screen shown in FIG. 18, and the desired profile is selected on the display screen shown in FIG. On the displayed display screen of FIG. 19, after changing the information content of the item desired to be changed, the “OK” button may be pressed.

【０１７８】更に、ユーザが所望のプロファイルの削除
を希望する場合、図１８に示す表示画面において「削
除」ボタンを押下し、図２０に示す表示画面において所
望のプロファイルを選択し、その後、「ＯＫ］ボタンを
押下すれば良い。Further, when the user desires to delete the desired profile, the "delete" button is pressed on the display screen shown in FIG. 18, the desired profile is selected on the display screen shown in FIG. 20, and then "OK". ] Button.

【０１７９】尚、上述した図１８及び図１９に示す表示
画面において、「キャンセル」ボタンが押下された場合
には、それまでの選択操作や入力操作に対応する処理
（プロファイルの登録、変更、削除）はなされることな
く処理が終了する。When the "Cancel" button is pressed on the display screens shown in FIGS. 18 and 19, the processing (registration, change, deletion of the profile) corresponding to the selection operation or the input operation up to that point is performed. ) Is not performed and the process ends.

【０１８０】次に、ステップＳ1702では、ステップＳ17
01にて選択されたプロファイルが、ユーザ・プロファイ
ル１４に存在するかを判断し、存在する場合には、ステ
ップＳ1703において対象となるプロファイルをユーザ・
プロファイル１４から読み込み、存在しない場合には、
基準値として予め設定されているところの、区間Ａおよ
び区間Ｂの再生速度、並びに区間Ｂの再生時の音量を、
ステップＳ1706において読み込む。ここで、ユーザ・プ
ロファイルのデータスキーマ一の一例を、表４に示す。Next, in step S1702, step S17
It is determined whether the profile selected in 01 exists in the user profile 14, and if it exists, the target profile is selected in step S1703.
Read from profile 14 and if not present,
The playback speed of the section A and the section B, which is preset as a reference value, and the volume of the section B during playback,
It is read in step S1706. Here, an example of the data schema of the user profile is shown in Table 4.

【０１８１】[0181]

【表４】 [Table 4]

【０１８２】表４は、本実施形態におけるユーザ・プロ
ファイルを例示する表である。基準値は、プロファイル
ＩＤ＝０に示すように記憶しておけば良く、この場合、
区間Ａの再生速度は1.5倍速、区間Ｂの再生速度は10.0
倍速、そして、区間Ｂ再生時の音量の基準値は０（即ち
音声ミュート）である。上述したユーザ・プロファイル
の新規登録時に用いられる基準値には、この値を用い
る。Table 4 is a table exemplifying the user profile in this embodiment. The reference value may be stored as shown in profile ID = 0. In this case,
The playback speed in section A is 1.5 times faster, and the playback speed in section B is 10.0.
The reference value of the volume at the double speed and the reproduction of the section B is 0 (that is, the audio mute). This value is used as the reference value used when the user profile is newly registered.

【０１８３】また、表４のユーザ・プロファイルのデー
タスキーマ一において、None とは値が設定されていな
いことを表し、逆に値が設定されている場合は、その値
を最優先して再生を行う。更に、表４において、視力や
聴力の欄の Good と Poorは、その人の年齢に無関係
な、動体視力や早い音声の聴力の能力を表わす。Further, in the data schema 1 of the user profile in Table 4, “None” means that no value has been set. Conversely, when a value has been set, that value is given the highest priority for playback. To do. Further, in Table 4, Good and Poor in the columns of visual acuity and hearing acuity represent the ability of the visual acuity and the hearing ability of fast voice, which are irrelevant to the age of the person.

【０１８４】一般に、高齢になるほど耳が聞こえにくく
なる他、言葉を理解する速度の低下が見られることが多
く、また子供は言語能力が未発達のために速い速度で音
声再生を行なうと理解できなくなることが多い。In general, the older the person becomes, the harder they are to hear, and the slower the speed of understanding words is. In many cases, the child can understand that he / she reproduces voice at a high speed because his / her language ability is underdeveloped. It often disappears.

【０１８５】これらの事情を踏まえて、健常者の年齢に
適した区間Ａの再生速度、並びに区間Ｂの再生速度のテ
ンプレートを予め用意しておき、ユーザ・プロファイル
１４に記憶された年齢に基づき、これらの速度を決定す
る。Based on these circumstances, a template of the playback speed of the section A and the playback speed of the section B suitable for the age of the healthy person is prepared in advance, and based on the age stored in the user profile 14, Determine these speeds.

【０１８６】しかし、青年にも関わらず動体視力や早い
音声の聴力の弱い人や、外国人のため母国語とは異なる
言語（例えば日本語）速い速度で音声再生を行なうと理
解が追いつかない等、年齢に無関係な原因がある場合も
ある。このため、本実施形態では、表４に例示するユー
ザ・プロファイルのように、視力および聴力の特性を記
述しておき、これらの設定があればこちらを優先して、
区間Ａの再生速度、並びに区間Ｂの再生速度を低めに決
定する。However, even if a young man has weak visual acuity and hearing of fast voice, or if he / she is a foreigner and the language is different from his native language (eg Japanese), the voice cannot be understood at a high speed. , There may be age-independent causes. Therefore, in the present embodiment, as in the user profile illustrated in Table 4, the characteristics of the visual acuity and the hearing ability are described, and if there are these settings, this is prioritized.
The reproduction speed of the section A and the reproduction speed of the section B are determined to be low.

【０１８７】このような場合、高齢者および動体視力の
弱いユーザに関しては、本来の早見再生という観点から
は外れるかもしれないが、人の音声区間（区間Ａ）の再
生速度を等倍速度より遅い速度に決定し、人の音声区間
ではない区間（区間Ｂ）の再生速度を等倍速度以上とす
ることにより、係るユーザが区間Ａの音声内容を把握可
能な低速再生を行いながらも、全体としては全ての区間
を低速再生するよりも速い時間で動画を閲覧することが
可能となる。In such a case, for the elderly and users with weak visual acuity, the reproduction speed in the human voice section (section A) may be slower than the normal speed, although this may deviate from the viewpoint of the original quick-view reproduction. By determining the speed and setting the playback speed of the section (section B) that is not the human voice section to be equal to or higher than the normal speed, the user as a whole can perform low-speed playback while grasping the voice content of the section A, but as a whole. Makes it possible to view videos in a faster time than in slow playback of all sections.

【０１８８】また、早口の音声に対する聴力の弱いユー
ザおよび外国人のため早口の日本語等では理解が追いつ
かないユーザに関しては、区間Ａの再生速度を等倍速度
より遅い速度に決定し、区間Ｂの再生速度に関しては、
その年齢の健常者と同じ再生速度とすることにより、区
間Ａの音声内容を把握可能な低速再生を行いながらも、
全体としては全ての区間を低速再生するよりも速い時間
で動画を閲覧することが可能となる。For users who have a weak hearing ability for fast-talking voices and users who cannot understand in fast-talking Japanese due to foreigners, the playback speed of section A is determined to be slower than the normal speed and section B is set. Regarding the playback speed of
By setting the playback speed to be the same as that of a healthy person of that age, while performing low-speed playback capable of grasping the audio content of section A,
As a whole, it is possible to view a moving image in a faster time than in slow-moving all sections.

【０１８９】このように、本実施形態では、ユーザ・プ
ロファイルに対する速度決定処理は、予め健常者におけ
る年齢に適した区間Ａの再生速度および区間Ｂの再生速
度のテンプレート、動体視力や早い音声の聴力の弱い症
状、外国人のため早口の日本語では理解が追いつかない
状況を加味して総合的な判断を行う。As described above, in the present embodiment, the speed determination process for the user profile is performed in advance by the template of the reproduction speed of the section A and the reproduction speed of the section B suitable for the age of a healthy person, the visual acuity and the hearing ability of fast voice. We make a comprehensive judgment by taking into consideration the weak symptoms of Japanese and the situation that we cannot catch up with fast-paced Japanese because of foreigners.

【０１９０】また、本実施形態において、音声内容の言
語に堪能か否かの判断は、ユーザ・プロファイル１４に
記憶されている堪能であるか否か、或いは母国語を特定
する言語種別情報と、再生対象の動画に含まれる音声内
容の言語種別情報とを比較することにより行う。近年、
ＤＶＤ等のデジタルコンテンツや、デジタルＢＳ等のデ
ジタルメディアには、音声内容の言語を特定する言語種
別情報が記憶されており、また近年ＥＰＧ（電子番組
表）等から番組内容が電子的に入手可能であるため、こ
れらの情報を用いることは現実的である。また、これら
の情報が入手できない場合であっても、地上波ＴＶ番組
でも標準設定では母国語、２カ国音声では通常メイン音
声が母国語であり且つサブ音声は外国語であるため、こ
れらの経験則に基づいて推定すれば良い。Further, in the present embodiment, the determination as to whether or not the user is fluent in the language of the voice content is whether or not the user is proficient stored in the user profile 14, or language type information for specifying the native language, This is performed by comparing the language type information of the audio content included in the moving image to be reproduced. recent years,
Digital content such as a DVD and digital media such as a digital BS store language type information for specifying the language of audio content, and in recent years, program content can be electronically obtained from EPG (electronic program guide) or the like. Therefore, it is realistic to use such information. Even if this information is not available, even in the case of terrestrial TV programs, by default, the main language is the native language and the sub-audio is the foreign language in the two-language audio. It may be estimated based on the rule.

【０１９１】ステップＳ1704では、ステップＳ1703にて
読み込んだユーザ所望のプロファイルに基づいて、区間
Ａの再生速度と、区間Ｂの再生速度とを決定する。ここ
で、本ステップにおける処理の詳細を、図１７を参照し
て説明する。In step S1704, the reproduction speed of section A and the reproduction speed of section B are determined based on the profile desired by the user and read in step S1703. Here, details of the processing in this step will be described with reference to FIG.

【０１９２】図１７は、本実施形態における動画早見再
生処理を示すフローチャートのうち、ステップＳ1704
（図１６）の処理の詳細を示すフローチャートである。FIG. 17 is a step S1704 in the flowchart showing the moving image fast-viewing reproduction process according to this embodiment.
17 is a flowchart showing details of the process of (FIG. 16).

【０１９３】同図において、まずステップＳ1801では、
ユーザ・プロファイル１４から先にユーザによって選択
されたプロファイルを読み込み、ステップＳ1802では、
読み込んだプロファイルから取得したユーザの年齢に従
って、健常者の年齢に応じた最適な区間Ａの再生速度
と、区間Ｂの再生速度とが設定されているテンプレート
を参照することにより、そのユーザに対する区間Ａの再
生速度と、区間Ｂの再生速度とを仮決定する。In the figure, first, in step S1801,
The profile previously selected by the user is read from the user profile 14, and in step S1802,
According to the age of the user acquired from the read profile, by referring to the template in which the reproduction speed of the optimum section A and the reproduction speed of the section B are set according to the age of the healthy person, the section A for the user is referred to. And the playback speed of the section B are provisionally determined.

【０１９４】ステップＳ1803では、ステップＳ1801にて
読み込んだプロファイルに、動体視力が弱いと記述され
ているかを判断し、その旨が記述されている場合には、
ステップＳ1804において、区間Ａの再生速度と、区間Ｂ
の再生速度とを両方とも基準値より低い値に更新する。
従って、この値も、予めプロファイルに記憶しておくの
が望ましい。In step S1803, it is determined whether or not the profile read in step S1801 describes that the dynamic visual acuity is weak.
In step S1804, the reproduction speed of the section A and the section B
Both the playback speed and the playback speed are updated to values lower than the reference value.
Therefore, this value is also preferably stored in the profile in advance.

【０１９５】ステップＳ1805では、ステップＳ1803にて
当該プロファイルに動体視力が弱いとは記述されていな
いと判断されたので、当該プロファイルに、速い音声の
聴力が弱いと記述されているかを判断し、その旨が記述
されている場合には、ステップＳ1806において、区間Ａ
の再生速度のみ低い値に更新する。従って、この値も、
予めプロファイルに記憶しておくのが望ましい。[0195] In step S1805, it is determined in step S1803 that the profile does not describe that the dynamic visual acuity is weak. Therefore, it is determined whether the profile describes that the hearing ability of fast voice is weak. If it is described, in step S1806, the section A
Only the playback speed of is updated to a lower value. Therefore, this value is also
It is desirable to store it in the profile in advance.

【０１９６】ステップＳ1807では、ステップＳ1805にて
当該プロファイルに速い音声の聴力が弱いとは記述され
ていないと判断されたので、再生すべき動画データに含
まれる音声内容の言語種別情報が入手可能であるかを判
断し、入手可能である場合にはステップＳ1808に進み、
入手不可能な場合には処理を終了する。[0196] In step S1807, since it is determined in step S1805 that the hearing ability of fast audio is weak, it is possible to obtain the language type information of the audio content included in the moving image data to be reproduced. If it is available, the process proceeds to step S1808,
If it is not available, the process ends.

【０１９７】ステップＳ1808では、再生すべき動画デー
タに含まれる音声内容の言語種別情報を入手すると共
に、入手した言語種別情報と、現在選択されている当該
プロファイルに記述された得意言語情報とを比較し、こ
れら２種類の情報が一致する場合には処理を終了し、一
致しない場合には、ステップＳ1809において、区間Ａの
再生速度のみ低い値に更新する。従って、この値も、予
めプロファイルに記憶しておくのが望ましい。In step S1808, the language type information of the audio contents included in the moving image data to be reproduced is obtained, and the obtained language type information is compared with the favorite language information described in the currently selected profile. If these two types of information match, the process is terminated. If they do not match, only the reproduction speed of the section A is updated to a low value in step S1809. Therefore, this value is also preferably stored in the profile in advance.

【０１９８】即ち、図１７に示す一連の処理では、ステ
ップＳ1803、ステップＳ1805、並びにステップＳ1808の
どれにも当たらない場合には、ステップＳ1802において
仮決定された区間Ａの再生速度、並びに区間Ｂの再生速
度がそのまま採用されることになる。That is, in the series of processing shown in FIG. 17, if none of step S1803, step S1805, and step S1808 is met, the reproduction speed of the section A temporarily determined in step S1802 and the section B The playback speed will be adopted as it is.

【０１９９】もし、高齢や若年にもかかわらず動体視力
や早い音声の聴力が優れている場合や、逆に劣っている
場合には、区間Ａの再生速度および区間Ｂの再生速度の
変更メニューを用いて、これらの値を変更できるように
構成すると良い。この場合、ユーザは、再生映像を見な
がら、区間Ａの再生速度および区間Ｂの再生速度を適宜
変更し、自動的、或いはユーザに確認を求めた上で、設
定された再生速度情報を、当該ユーザに対応するプロフ
ァイルに記憶することにより、前回の操作情報を反映し
つつ個々のユーザに応じた理解しやすい動画早見再生を
行うことが可能となる。If the dynamic visual acuity and the hearing ability of fast voice are excellent in spite of being old or young, or if they are inferior on the contrary, the replay speed of section A and the replay speed of section B are changed. It is preferable to use the configuration so that these values can be changed. In this case, the user appropriately changes the reproduction speed of the section A and the reproduction speed of the section B while watching the reproduced video, and automatically or after asking the user for confirmation, sets the set reproduction speed information. By storing the profile in the profile corresponding to the user, it is possible to perform quick-playback playback of the moving image that reflects the previous operation information and is easy to understand for each user.

【０２００】尚、上述したプロファイルを用いずに簡易
に行うのであれば、例えば、ステップＳ1701乃至ステッ
プＳ1704、並びにステップＳ1706の各ステップにおける
処理の代わりに、区間Ａの再生速度を0.5倍速から2倍速
まで、区間Ｂの再生速度を2倍速から10倍速までの間
で、ユーザが動作メニューを利用して可変設定可能に構
成する実施形態が想定される。If the above-mentioned profile is not used and is simply performed, for example, instead of the processing in each of steps S1701 to S1704 and step S1706, the reproduction speed of the section A is changed from 0.5 × to 2 ×. It is assumed that the reproduction speed of the section B can be variably set by the user using the operation menu from 2 × to 10 ×.

【０２０１】ところで、区間Ｂを高倍率で倍速再生する
と、「キュルキュル」という音が出るが、その音を聞き
たくない場合には、区間Ｂの再生時には、音声再生はミ
ュート状態とすることによって音を出なくする、或い
は、小さな音量に変更する実施形態が想定される。この
ような設定に関しても、ステップＳ1703で読み込んだプ
ロファイルに予め記述しておき、動画早見再生時には、
係るプロファイルを最優先とし、ステップＳ1702でプロ
ファイルが存在しないと判定された場合には、ステップ
Ｓ1706では予め設定されている基準の音量を採用する。
もちろん更に簡易に行うのであれば、例えば、動画早見
再生処理が予め区間Ｂの音声再生レベルをどう処理する
か予め決めておく実施形態が想定される。By the way, when the section B is reproduced at a high speed and at a high speed, a sound of "curculer" is produced. However, if the user does not want to hear the sound, the sound reproduction is muted when the section B is reproduced. It is envisioned that an embodiment may be provided in which the output is turned off or the volume is changed to a low level. Such settings are also described in advance in the profile read in step S1703, and during the quick playback of the moving image,
If such a profile is given the highest priority, and it is determined in step S1702 that no profile exists, a preset reference volume is adopted in step S1706.
Of course, if it is performed in a simpler manner, for example, an embodiment is conceivable in which the moving image quick playback process predetermines how to process the audio playback level of the section B.

【０２０２】上記のような構成により、本実施形態で
は、区間Ａの再生速度および区間Ｂの再生速度、或いは
それら両方、並びに区間Ｂの音声レベルの指定を、ユー
ザ・プロファイルを用いることにより、個々のユーザに
最適な再生を簡便に実現することが可能となる。With the above configuration, in the present embodiment, the reproduction speed of the section A and the reproduction speed of the section B, or both of them, and the audio level of the section B are specified by using the user profile. It is possible to easily realize the optimum reproduction for the user.

【０２０３】次に、ステップＳ1705では、動画早見イン
デックス記憶部１１から、早見再生区間補正処理（ステ
ップＳ104）にて補正済みの音声区間情報である早見再
生区間情報を読み込み、ステップＳ1707では、区間Ａの
トータル長を再生速度で割ることによって区間Ａの再生
時間を計算し、区間Ｂについても同様にして再生速度を
計算すると共に、これら２つの値を足すことによってユ
ーザが早見に要する時間を算出する。そして、算出され
た早見に要する時間は、ディスプレイ２３等を利用して
ユーザに提示する。Next, in step S1705, the quick view reproduction section information which is the audio section information corrected in the quick view reproduction section correction processing (step S104) is read from the moving picture quick view index storage unit 11, and in step S1707 the section A is read. The playback time of the section A is calculated by dividing the total length of the section by the playback speed, the playback speed of the section B is calculated in the same manner, and the time required for the user to quickly look is calculated by adding these two values. . Then, the calculated time required for the quick look is presented to the user using the display 23 or the like.

【０２０４】ステップＳ1708では、ステップＳ1707にて
早見再生時間を認識したユーザがその時間に満足してい
るか否かを、リモコン端末への入力操作等を利用して判
断し、この判断でユーザが満足している場合には、ステ
ップＳ1710において、上述した処理によって設定された
区間Ａおよび区間Ｂの再生速度、並びに区間Ｂの音声再
生レベルに従って、動画データ記憶部１０に記憶されて
いる再生対象の動画を再生する。In step S1708, it is determined whether the user who has recognized the fast-view playback time in step S1707 is satisfied with the time by using an input operation or the like on the remote control terminal, and the user is satisfied with this determination. If so, in step S1710, the moving image to be played back stored in the moving image data storage unit 10 in accordance with the playback speeds of the sections A and B and the audio playback level of the section B set by the above-described processing. To play.

【０２０５】ステップＳ1709では、ステップＳ1708にて
ユーザが満足していないと判断されたので、ユーザ所望
の再生時間に収まるように、区間Ａおよび区間Ｂの再生
速度、並びに区間Ｂの音声再生レベルを変更可能なマン
マシン・インタフェースを提供することにより、プロフ
ァイルや標準設定に満足できないユーザ自身が望む再生
時間に近くなるように調節し、ステップＳ1707に戻る。In step S1709, since it is determined that the user is not satisfied in step S1708, the reproduction speeds of section A and section B and the sound reproduction level of section B are set so that the reproduction time desired by the user is reached. By providing a man-machine interface that can be changed, the reproduction time is adjusted so as to be close to the reproduction time desired by the user who is not satisfied with the profile and standard settings, and the process returns to step S1707.

【０２０６】また、ステップＳ1709に対応する他の実施
形態として、現在設定されている区間Ａおよび区間Ｂの
再生速度に基づく動画再生を見ながら、それぞれの区間
に対して、ユーザ所望の再生速度を変更可能に構成し、
それに応じた早見に要する時間の算出及びその提示を行
なうことにより、プロファイルや標準設定に満足できな
いユーザ自身が望む再生時間に近くなるように調節する
構成も想定される。As another embodiment corresponding to step S1709, while watching the moving image reproduction based on the currently set reproduction speeds of the section A and the section B, the reproduction speed desired by the user is set for each section. Configurable to be modifiable,
A configuration is also conceivable in which the time required for quick viewing is calculated and presented accordingly, and the reproduction time is adjusted to be closer to the reproduction time desired by the user who is not satisfied with the profile and standard settings.

【０２０７】ところでユーザ・プロファイルと、ユーザ
所望の速度指示との関連であるが、ステップＳ1707にて
動画早見再生に要する時間を見たユーザが、所望の動画
早見再生に要する時間に収めるべく、区間Ａおよび区間
Ｂの再生速度を変更可能なマンマシン・インタフェース
を用いて、これらの設定を調整・変更した場合には、そ
の調整・変更後の値を、基準値として採用したいことも
ある。そこで、このような場合には、自動的、或いは図
２１に例示する確認画面により、ユーザによる確認を促
した後、「はい」が選択された場合には、ユーザによっ
て調整・変更された再生速度情報を、当該ユーザに対応
するプロファイルに記憶することにより、以降の動画再
生に際しては、前回の操作情報を反映しつつ当該ユーザ
に応じた理解しやすい動画早見再生を行うことが可能と
なる。[0207] By the way, regarding the relationship between the user profile and the speed instruction desired by the user, the user who saw the time required for the quick-motion playback of the moving image in step S1707 has a section in order to set the time required for the quick playback of the desired moving image. When these settings are adjusted / changed using a man-machine interface capable of changing the reproduction speeds of A and section B, it may be desirable to use the adjusted / changed value as the reference value. Therefore, in such a case, if "Yes" is selected after prompting the confirmation by the user automatically or by the confirmation screen illustrated in FIG. 21, the reproduction speed adjusted / changed by the user. By storing the information in the profile corresponding to the user, it is possible to perform easy-to-understand movie quick-view playback according to the user while reflecting the previous operation information in the subsequent moving image playback.

【０２０８】上述した実施形態では、音声ラベリング処
理として零交差数や音声エネルギを用いたが、その具体
的な処理手順は必ずしも上記のアルゴリズムに制約され
るものではなく、公知の特徴量を用いたり、或いは異な
るラベル判定アルゴリズムを用いても良い。In the above-described embodiment, the number of zero crossings and the voice energy are used as the voice labeling process, but the specific processing procedure is not necessarily limited to the above algorithm, and a known feature amount may be used. Alternatively, different label determination algorithms may be used.

【０２０９】即ち、上述した実施形態に係る音声検出処
理の趣旨は、ローパスフィルタが施された音声信号の零
交差点情報を用いて、その音声信号を、合理的な複数の
音声セグメント（音声区間）に分割し、その際、波形処
理によって音声ピッチを検出すると共に音声ラベリング
を行った後に、人の声の大半を占める母音に必ず伴う所
定の音声ピッチを基準に、ＣＶＣ音声モデル等の音声の
特徴を用いて上記複数の音声セグメントを統合すること
により、係る音声信号にＢＧＭ等の外乱が含まれる場合
であっても、その外乱をリカバリする処理を含むところ
にある。That is, the purpose of the voice detection processing according to the above-mentioned embodiment is to use the zero-crossing information of the low-pass filtered voice signal to convert the voice signal into a plurality of reasonable voice segments (voice sections). Of the CVC voice model, etc., based on a predetermined voice pitch that is always accompanied by a vowel that occupies most of the human voice after the voice pitch is detected by waveform processing and the voice labeling is performed. Even if a disturbance such as BGM is included in the audio signal, the process of recovering the disturbance is included by integrating the plurality of audio segments by using.

【０２１０】従って、ＡＧＣ21やローパスフィルタ22の
実現方法に関して制約は無く、また、音声ラベリングに
関しては必ずしも本実施形態のアルゴリズムに制約され
るものではなく、異なるラベル判定アルゴリズムを用い
ても良い。Therefore, there are no restrictions on how to implement the AGC 21 and the low-pass filter 22, and the speech labeling is not necessarily restricted to the algorithm of this embodiment, and different label determination algorithms may be used.

【０２１１】また、音声区間判定部28にて行われる判定
処理（図１１）においても、ステップＳ1106にて行われ
るところの、無声子音セグメントあるいは有声子音セグ
メントおよびピッチセグメントを統合することによって
音声区間を求める処理と、Ｓ1107にて行われるところ
の、隣接あるいは間に無音セグメントまたは雑音セグメ
ントを持つ２つのピッチセグメントを統合することによ
って音声区間を求める処理との順序は、上述した実施形
態に限定するものではなく、これらの処理を並行して処
理するアルゴリズムでも良い。Also, in the determination processing (FIG. 11) performed by the voice segment determining unit 28, the voice segment is integrated by integrating the unvoiced consonant segment or the voiced consonant segment and the pitch segment, which is performed in step S1106. The order of the process of obtaining and the process of obtaining the voice section by integrating two pitch segments having a silent segment or a noise segment, which is performed in S1107, is limited to the above-described embodiment. Instead, an algorithm that processes these processes in parallel may be used.

【０２１２】また、上述した実施形態においては、ユー
ザ・プロファイルを選択する際の手順として、リモコン
端末を利用してユーザがプロファイル選択画面を適宜指
定し、ディスプレイ１２に表示されたユーザ・プロファ
イルリストの中から自分のユーザ・プロファイルを選択
する構成例を説明したが、この構成に限られるものでは
なく、例えば、パスワードにより他人のユーザ・プロフ
ァイルの変更や削除等の操作を防ぐ構成を採用しても良
い。Further, in the above-described embodiment, as a procedure for selecting a user profile, the user appropriately designates the profile selection screen by using the remote control terminal, and the user profile list displayed on the display 12 is displayed. The example of the configuration in which one's user profile is selected from the above is explained, but the configuration is not limited to this. For example, even if a password is used to prevent operations such as changing or deleting another user's user profile, good.

【０２１３】更に、指紋や声紋や顔認識等の個人認識技
術を用いた自動的なプロファイル選択方法も当然考えら
れ、これらの場合にはパスワードにより他人のユーザ・
プロファイルの変更や削除等の操作を防ぐ必要が無く便
利である。Further, an automatic profile selection method using a personal recognition technique such as a fingerprint, a voiceprint, or face recognition is naturally conceivable. In these cases, a password of another user cannot be selected by a password.
It is convenient because there is no need to prevent operations such as changing or deleting profiles.

【０２１４】また、上述した実施形態において、算出さ
れた早見再生に要する時間をユーザが確認した上で、ユ
ーザ所望の再生時間に収まるように、区間Ａの再生速度
および区間Ｂの再生速度を変更することにより、プロフ
ァイルや標準設定に満足できないユーザが、自身が望む
再生時間に近くなるように調節する構成例を挙げたが、
この構成に限られるものではなく、例えば、ユーザが再
生映像を見ながら、区間Ａの再生速度および区間Ｂの再
生速度をそれぞれの変更可能に構成しておき、その設定
に応じた早見に要する時間を再計算し、これをユーザに
提示することにより、ユーザ自身が望む再生時間に近く
なるように調節する実施形態も存在する。In the above-described embodiment, after the user confirms the calculated time required for the quick-view reproduction, the reproduction speed of the section A and the reproduction speed of the section B are changed so as to be within the reproduction time desired by the user. By doing so, a configuration example in which a user who is not satisfied with the profile and standard settings adjusts the playback time to be close to the desired playback time,
The present invention is not limited to this configuration. For example, while the user is watching the playback video, the playback speed of the section A and the playback speed of the section B are configured to be changeable, and the time required for the quick viewing according to the setting is set. There is also an embodiment in which the reproduction time is recalculated and presented to the user to adjust the reproduction time close to the reproduction time desired by the user.

【０２１５】このように、本実施形態によれば、人の発
した音声発声メカニズムの基本は声帯の振動、いわゆる
音声ピッチであり、これを音声信号中から抽出すること
によって有用な音声区間を得て、真の人の音声区間を検
出し、その区間を用いて、映像と音声との同期関係は崩
すことなく、動画早見再生時には、人の発した音声は全
て内容を把握できる速度で再生する一方で、人の発した
音声の含まれない区間（区間Ｂ）は、より高速に再生す
る。これにより、動画早見再生時のトータルの閲覧時間
を、等倍再生を行なった場合と比較して合理的に減らす
ことが可能となる。As described above, according to the present embodiment, the basis of the voice utterance mechanism uttered by a person is the vibration of the vocal cord, so-called voice pitch, and a useful voice section is obtained by extracting this from the voice signal. By detecting the true human voice section, and using that section, the synchronization between the video and the voice is not broken, and all the voices uttered by the person are played back at a speed that allows the user to grasp the content during the quick playback of the video. On the other hand, the section (section B) that does not include the voice uttered by a person is reproduced at a higher speed. As a result, it is possible to rationally reduce the total browsing time during the quick playback of the moving image, as compared with the case of performing the normal-size playback.

【０２１６】また、本実施形態によれば、区間Ａの再生
速度および区間Ｂの再生速度を、ユーザ・プロファイル
１４を用いることにより、個々のユーザに適した再生速
度に簡便に設定可能であると共に、区間Ｂの再生時にお
ける音量も、ユーザに適したものに設定できる。Further, according to the present embodiment, the reproduction speed of the section A and the reproduction speed of the section B can be easily set to the reproduction speed suitable for each user by using the user profile 14. The volume during playback of section B can also be set to be suitable for the user.

【０２１７】更に、本実施形態によれば、早見再生に要
する時間を予め、或いは動画の再生中に表示することに
より、これに満足できないユーザは、区間Ａの再生速度
および区間Ｂの再生速度を指定することにより、当該ユ
ーザに最適な早見再生に要する時間に調整することがで
き、調整によって設定された情報は、当該ユーザに対応
するプロファイルに更新記憶することが可能であるの
で、次回の早見再生に際して適切な動画再生を行なうこ
とができる。Further, according to the present embodiment, the time required for the quick-view reproduction is displayed in advance or during the reproduction of the moving image, so that the user who is not satisfied with this can display the reproduction speed of the section A and the reproduction speed of the section B. By specifying it, it is possible to adjust the time required for the quick view playback that is optimal for the user, and the information set by the adjustment can be updated and stored in the profile corresponding to the user, so the next quick view An appropriate moving image can be reproduced at the time of reproduction.

【０２１８】[0218]

【他の実施形態】上述した各実施形態を例に説明した本
発明は、複数の機器から構成されるシステムに適用して
も良いし、また、一つの機器からなる装置に適用しても
良い。Other Embodiments The present invention described by taking the above-described embodiments as examples may be applied to a system including a plurality of devices or may be applied to an apparatus including one device. .

【０２１９】尚、本発明は、前述した各実施形態におい
て説明したフローチャートの機能を実現するソフトウェ
ア・プログラムを、上述した動画再生装置として動作す
るシステム或いは装置に直接或いは遠隔から供給し、そ
のシステム或いは装置のコンピュータが該供給されたプ
ログラムコードを読み出して実行することによっても達
成される場合を含む。その場合、プログラムの機能を有
していれば、形態は、プログラムである必要はない。The present invention supplies the software program for realizing the functions of the flow charts described in the above-described embodiments directly or remotely to the system or apparatus operating as the above-mentioned moving picture reproducing apparatus, and the system or It also includes the case where it is achieved by the computer of the apparatus reading and executing the supplied program code. In that case, the form need not be a program as long as it has the functions of the program.

【０２２０】従って、本発明の機能処理をコンピュータ
で実現するために、該コンピュータにインストールされ
るプログラムコード自体も本発明を実現するものであ
る。つまり、本発明のクレームでは、本発明の機能処理
を実現するためのコンピュータプログラム自体も含まれ
る。Therefore, the program code itself installed in the computer to implement the functional processing of the present invention by the computer also implements the present invention. That is, the claims of the present invention include the computer program itself for realizing the functional processing of the present invention.

【０２２１】その場合、プログラムの機能を有していれ
ば、オブジェクトコード、インタプリタにより実行され
るプログラム、ＯＳに供給するスクリプトデータ等、プ
ログラムの形態を問わない。In this case, the program may take any form such as an object code, a program executed by an interpreter, or script data supplied to an OS as long as it has the function of the program.

【０２２２】プログラムを供給するための記録媒体とし
ては、例えば、フロッピー（登録商標）ディスク、ハー
ドディスク、光ディスク、光磁気ディスク、ＭＯ、ＣＤ
−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、磁気テープ、不揮発
性のメモリカード、ＲＯＭ、ＤＶＤ（ＤＶＤ−ＲＯＭ，
ＤＶＤ−Ｒ）などがある。As a recording medium for supplying the program, for example, a floppy (registered trademark) disk, a hard disk, an optical disk, a magneto-optical disk, an MO, a CD
-ROM, CD-R, CD-RW, magnetic tape, non-volatile memory card, ROM, DVD (DVD-ROM,
DVD-R).

【０２２３】その他、プログラムの供給方法としては、
クライアントコンピュータのブラウザを用いてインター
ネットのホームページに接続し、該ホームページから本
発明のコンピュータプログラムそのもの、もしくは圧縮
され自動インストール機能を含むファイルをハードディ
スク等の記録媒体にダウンロードすることによっても供
給できる。また、本発明のプログラムを構成するプログ
ラムコードを複数のファイルに分割し、それぞれのファ
イルを異なるホームページからダウンロードすることに
よっても実現可能である。つまり、本発明の機能処理を
コンピュータで実現するためのプログラムファイルを複
数のユーザに対してダウンロードさせるＷＷＷ(World W
ide Web)サーバも、本発明のクレームに含まれるもので
ある。In addition, as a method of supplying the program,
It can also be supplied by connecting to a homepage on the Internet using a browser of a client computer, and downloading the computer program itself of the present invention or a compressed file having an automatic installation function from the homepage to a recording medium such as a hard disk. It can also be realized by dividing the program code constituting the program of the present invention into a plurality of files and downloading each file from different homepages. In other words, a WWW (World WWW) that allows a plurality of users to download a program file for implementing the functional processing of the present invention on a computer.
An ide Web server is also included in the claims of the present invention.

【０２２４】また、本発明のプログラムを暗号化してＣ
Ｄ−ＲＯＭ等の記憶媒体に格納してユーザに配布し、所
定の条件をクリアしたユーザに対し、インターネットを
介してホームページから暗号化を解く鍵情報をダウンロ
ードさせ、その鍵情報を使用することにより暗号化され
たプログラムを実行してコンピュータにインストールさ
せて実現することも可能である。Also, the program of the present invention is encrypted to C
By storing the information in a storage medium such as a D-ROM and distributing it to the user, and having the user who satisfies the predetermined conditions download the key information for decrypting the encryption from the home page via the Internet, and by using the key information It is also possible to execute the encrypted program and install the program in a computer to realize it.

【０２２５】また、コンピュータが、読み出したプログ
ラムを実行することによって、前述した実施形態の機能
が実現される他、そのプログラムの指示に基づき、コン
ピュータ上で稼動しているＯＳなどが、実際の処理の一
部または全部を行ない、その処理によっても前述した実
施形態の機能が実現され得る。Further, the functions of the above-described embodiments are realized by the computer executing the read program, and the OS and the like running on the computer execute the actual processing based on the instructions of the program. The function of the above-described embodiment can be realized also by performing a part or all of the above.

【０２２６】さらに、記録媒体から読み出されたプログ
ラムが、コンピュータに挿入された機能拡張ボードやコ
ンピュータに接続された機能拡張ユニットに備わるメモ
リに書き込まれた後、そのプログラムの指示に基づき、
その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ
などが実際の処理の一部または全部を行ない、その処理
によっても前述した実施形態の機能が実現される。Furthermore, after the program read from the recording medium is written in the memory provided in the function expansion board inserted into the computer or the function expansion unit connected to the computer, based on the instructions of the program,
CPU provided on the function expansion board or function expansion unit
Performs a part or all of the actual processing, and the processing realizes the functions of the above-described embodiments.

【０２２７】[0227]

【発明の効果】以上説明した本発明によれば、人の発し
た音声区間を正確に検出すると共に、検出した音声区間
に従って映像と音声との同期関係を忠実に維持しなが
ら、ユーザの閲覧所要時間を大幅に短縮する動画再生装
置、動画再生方法及びそのコンピュータ・プログラムの
提供が実現する。According to the present invention described above, it is possible to accurately detect a voice section uttered by a person, and to faithfully maintain a synchronization relationship between a video and a sound in accordance with the detected voice section while the user is required to browse. It is possible to provide a moving picture reproducing apparatus, a moving picture reproducing method, and a computer program therefor, which can significantly reduce the time.

[Brief description of drawings]

【図１】本実施形態に係る動画再生装置における動画早
見アルゴリズムの概念図を表す図である。FIG. 1 is a diagram showing a conceptual diagram of a moving image fast viewing algorithm in a moving image reproducing apparatus according to the present embodiment.

【図２】動画早見インデックス作成部１００において行
われる人の発声期間を表わす音声区間（区間Ａ）検出の
ためのアルゴリズムを表わすブロック図である。FIG. 2 is a block diagram showing an algorithm for detecting a voice section (section A) representing a person's utterance period, which is performed in the moving image quick reference index creation unit 100.

【図３】図２に示すアルゴリズムに基づく処理の概略を
示すフローチャートである。FIG. 3 is a flowchart showing an outline of processing based on the algorithm shown in FIG.

【図４】本実施形態において行われる小セグメントの結
合処理を説明する図である。FIG. 4 is a diagram illustrating a small segment combination process performed in the present embodiment.

【図５】本実施形態において行われる音声ラベリングの
処理を示すフローチャートである。FIG. 5 is a flowchart showing a voice labeling process performed in this embodiment.

【図６】本実施形態における音声信号波形のセグメント
化からラベリングに至るまでの処理過程を説明する図で
ある。FIG. 6 is a diagram illustrating a processing process from segmentation of an audio signal waveform to labeling in the present embodiment.

【図７】本実施形態における音声ピッチ検出処理の説明
のための音声信号波形を例示する図である。FIG. 7 is a diagram exemplifying a voice signal waveform for explaining a voice pitch detection process in the present embodiment.

【図８】本実施形態における音声ピッチ検出処理で行わ
れるピッチ検出基準の更新手順を説明する図である。FIG. 8 is a diagram illustrating a procedure for updating a pitch detection reference performed in a voice pitch detection process according to the present embodiment.

【図９】本実施形態における音声ピッチ検出処理を示す
フローチャートである。FIG. 9 is a flowchart showing a voice pitch detection process in this embodiment.

【図１０】本実施形態における音声ピッチ検出処理を示
すフローチャートのうち、ステップＳ904（図９）の処
理の詳細を示すフローチャートである。FIG. 10 is a flowchart showing details of the process of step S904 (FIG. 9) of the flowchart showing the voice pitch detection process in the present embodiment.

【図１１】本実施形態における音声区間判定処理を示す
フローチャートである。FIG. 11 is a flowchart showing a voice segment determination process in the present embodiment.

【図１２】本実施形態における音声区間判定処理を示す
フローチャートのうち、ステップＳ1106（図１１）の処
理の詳細を示すフローチャートである。FIG. 12 is a flowchart showing details of the process of step S1106 (FIG. 11) of the flowchart showing the voice section determination process in the present embodiment.

【図１３】本実施形態における音声区間判定処理を示す
フローチャートのうち、ステップＳ1107（図１１）の処
理の詳細を示すフローチャートである。FIG. 13 is a flowchart showing details of the process of step S1107 (FIG. 11) of the flowchart showing the voice section determination process in the present embodiment.

【図１４】本実施形態において間隔の短い音声区間に対
して行われる統合補正処理を示すフローチャートであ
る。FIG. 14 is a flowchart showing an integrated correction process performed for a voice section having a short interval in the present embodiment.

【図１５】本実施形態においてシーンチェンジ点を用い
て行われる音声区間統合補正処理を示すフローチャート
である。FIG. 15 is a flowchart showing a voice section integrated correction process performed using a scene change point in the present embodiment.

【図１６】本実施形態における動画早見再生処理を示す
フローチャートである。FIG. 16 is a flowchart showing a moving image quick playback process according to the present embodiment.

【図１７】本実施形態における動画早見再生処理を示す
フローチャートのうち、ステップＳ1704（図１６）の処
理の詳細を示すフローチャートである。FIG. 17 is a flowchart showing details of the process of step S1704 (FIG. 16) of the flowchart showing the moving image quick playback process in the present embodiment.

【図１８】ユーザ・プロファイル選択用の表示画面を例
示する図である。FIG. 18 is a diagram illustrating a display screen for user profile selection.

【図１９】ユーザ・プロファイル登録用の表示画面を例
示する図である。FIG. 19 is a diagram illustrating a display screen for user profile registration.

【図２０】本実施形態におけるユーザ・プロファイルの
例を示す図である。FIG. 20 is a diagram showing an example of a user profile in the present embodiment.

【図２１】提示された動画早見再生に要する時間に満足
しないユーザが設定変更をした場合に、調整・変更され
た値を次回以降の動画再生時に基準値として用いるか確
認を促す表示画面を例示する図である。FIG. 21 illustrates a display screen that urges the user to confirm whether to use the adjusted / changed value as a reference value for the next and subsequent video playback when the user who is not satisfied with the time required for the video quick playback playback changes the setting. FIG.

フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｈ０４Ｎ 5/76 Continuation of front page (51) Int.Cl. ⁷ Identification code FI theme code (reference) H04N 5/76

Claims

[Claims]

1. A moving picture reproducing apparatus capable of reproducing moving picture information including a sound signal at a high speed, wherein a first sound section representing a human utterance period based on the sound signal contained in the moving picture information, and the first sound section. Other than the second voice section, and based on the moving image information, the first voice section performs high-speed moving image reproduction with reproduced sound at a predetermined speed at which the user can grasp the content. On the other hand, the moving picture reproducing apparatus, characterized in that the second voice section is provided with a quick-view reproducing means for performing at least high-speed moving picture reproduction at a speed higher than the predetermined speed.

2. The moving picture reproducing method according to claim 1, wherein the fast-viewing reproducing means performs moving picture reproduction at a speed higher than the predetermined speed and with at least a small volume of reproduced sound in the second sound section. apparatus.

3. The moving picture reproducing apparatus according to claim 1, wherein the quick-view reproducing means reproduces a moving picture without voice at a speed higher than the predetermined speed in the second sound section.

4. The voice section determination means extracts a voice pitch corresponding to vocal cord vibration based on the voice signal,
The moving image reproducing apparatus according to claim 1, wherein the first voice section is determined based on the extracted voice pitch.

5. The volume at the time of voice reproduction in the second voice section is determined in advance or can be designated by the user in the quick-view reproduction means.
The moving image reproducing apparatus according to claim 3.

6. The voice section determining means extracts a pitch in a possible vocal cord frequency range from a signal obtained by filtering a voice band emitted by a person included in the voice signal. 4. The moving image according to claim 1, wherein the first vowel section is determined by detecting a dominant vowel part of the voice and integrating the detected vowel parts. Playback device.

7. The voice section determining means, when determining the first voice section based on the voice signal, includes a correcting means for integrally correcting a plurality of first voice sections that are close to each other on a time axis. The moving picture reproducing apparatus according to any one of claims 1 to 3, further comprising:

8. The correcting means detects a scene change point included in the moving image information, and, in the detected individual scene change points, is earlier in time than the start point of the first voice section of interest. When the time interval between the nearest neighboring scene change point and its starting point is less than or equal to a predetermined threshold value, the starting point of the first voice section of interest is the information corresponding to the neighboring scene change point. The correction is performed by replacing
The described video playback device.

9. The quick-view reproduction means calculates a time required for the high-speed moving image reproduction based on the length of the first audio section, the reproduction speed of the section, and the length of the second audio section. The moving picture reproducing apparatus according to any one of claims 1 to 3, wherein the calculated required time is presented to the user.

10. The quick-view reproduction means, in response to the presentation of the required time, when the operation of changing the reproduction speed of the first and second voice sections is performed by the user, the reproduction after the change is performed. The moving image reproducing apparatus according to claim 9, further comprising an adjusting unit that adjusts the required time based on a speed.

11. A user profile, in which attribute information relating to each user is registered, is provided for users who can use the moving image reproduction apparatus, and the quick-view reproduction means is registered in the user profile. 4. The moving picture reproducing apparatus according to claim 1, wherein the reproducing speeds of the first and second voice sections are automatically determined according to the attribute information regarding the specific user.

12. The user profile includes at least one of age, language used, visual acuity, and fast audio hearing as attribute information regarding the individual user. The described video playback device.

13. The quick-view playback means sets the length of the first voice section, the playback speed of the section, and the length of the second voice section, which are automatically determined according to the attribute information about the specific user. Based on the calculated required time for the high-speed moving image reproduction, presenting the calculated required time to the user, and presenting the required time, the reproduction speeds of the first and second audio sections. 13. The moving image reproducing apparatus according to claim 11 or 12, further comprising: an adjusting unit that adjusts the required time based on the changed reproduction speed when the changing operation is performed by the user.

14. The adjusting means stores the changed reproduction speeds of the first and second voice sections in association with the attribute information regarding the specific user in the user profile, and the fast-view reproduction means 14. The moving picture reproducing apparatus according to claim 13, wherein the reproducing speeds of the changed first and second voice sections stored in the user profile are reflected when the high speed moving picture is reproduced.

15. The quick-view reproduction means, when the information regarding the reproduction manner of the second voice section is designated by the user, the reproduction manner with respect to the attribute information regarding the user stored in the user profile. Information relating to the reproduction mode of the second voice section, which is stored in the user profile and is stored in the user profile, and is reflected in the user profile when the high-speed moving image is reproduced. 11. The video playback device according to item 11.

16. The moving picture reproducing apparatus according to claim 1, wherein the predetermined speed during high speed moving picture reproduction in the first audio section is 1.5 to 2 times as fast as constant speed reproduction.

17. The fast-view reproduction means, when the attribute information relating to the user registered in the user profile includes identification information indicating that the user is an elderly person, a visually impaired person, or a hearing impaired person, When performing the high-speed moving image reproduction for the user corresponding to the identification information, the reproduction speed of the first voice section is slower than the normal speed and the reproduction speed of the second voice section is faster than the normal speed. 12. The moving picture reproducing apparatus according to claim 11, which is performed.

18. The fast-view reproduction means includes identification information indicating a language used by the user in the attribute information about the user registered in the user profile, and the identification information and the moving image information are included. If the language type information does not match, when the high-speed moving image reproduction is performed for the user corresponding to the identification information, the reproduction speed of the first voice section is set to be slower than the equal speed, and the second voice is reproduced. 12. The moving picture reproducing apparatus according to claim 11, wherein the reproduction speed of the section is 5 to 10 times.

19. The user profile is intended for a plurality of users who can use the video playback device.
Attribute information about each user is registered, the fast-view playback means, according to the selection operation of the specific user,
12. The moving picture reproducing apparatus according to claim 11, wherein the attribute information regarding the specific user is acquired from the user profile based on a personal authentication technique.

20. The moving picture reproducing apparatus according to claim 11, further comprising attribute information changing means capable of changing attribute information relating to a specific user registered in the user profile by the specific user himself.

21. A moving picture reproducing method for reproducing moving picture information including a sound signal at a high speed, wherein a first sound section indicating a vocalization period of a person based on a sound signal included in the moving picture information, and other than that. And a high-speed moving image reproduction with a reproduced sound at a predetermined speed at which the user can grasp the content, based on the moving image information. Then, the second voice section has a fast-viewing reproduction step of performing at least high-speed moving picture reproduction at a speed higher than the predetermined speed, the moving picture reproduction method.

22. The moving image reproduction according to claim 21, wherein, in the fast-view reproduction process, a moving image reproduction is performed at a speed higher than the predetermined speed and with a reproduction sound of at least a small volume in the second audio section. Method.

23. The moving picture reproducing method according to claim 21, wherein, in the fast-viewing reproducing step, the moving picture is reproduced in the second voice section at a speed higher than the predetermined speed and without voice.

24. In the voice section determination step, a voice pitch corresponding to vocal cord vibration is extracted based on the voice signal, and the first voice section is determined based on the extracted voice pitch. 21 to 23
The video playback method described in any one of 1.

25. In the voice segment determination step, a human voice is extracted by extracting a pitch in a possible vocal cord frequency range from a signal obtained by filtering a voice band emitted by a person included in the voice signal. 24. The moving image according to claim 21, wherein the first vowel section is determined by detecting a dominant vowel section of the voice and integrating the detected vowel sections. How to play.

26. In the voice section determination step, when the first voice section is determined based on the voice signal, a plurality of first voice sections that are close to each other on a time axis are integrally corrected. 21 to 23.
The video playback method described in any one of 1.

27. In the voice section determination step, at the time of the correction, a scene change point included in the moving image information is detected, and among the detected individual scene change points, a starting point of the first voice section of interest is detected. Also, if the time interval between the neighboring scene change point located closest in time and the closest point is less than or equal to a predetermined threshold value, the starting point of the focused first voice section is set to the neighboring scene. 27. The moving image reproducing method according to claim 26, wherein the correction is performed by replacing the information with the information corresponding to the change point.

28. Further, the method further comprises a registration step of registering attribute information regarding each user as a user profile for a user who can use the moving picture reproducing apparatus, and in the quick view reproducing step, the user profile 24. The reproduction speeds of the first and second voice sections are automatically determined according to the attribute information about the specific user registered in.
The video playback method described in any one of 1.

29. A computer program, which gives an operation instruction which can be realized by a computer, to the moving picture reproducing apparatus according to claim 1.

30. A computer program according to any one of claims 21 to 28, which gives an instruction to operate a computer.