JPH1188844A

JPH1188844A - Speech speed/picture speed simultaneous conversion system, method therefor and storage medium recorded with speech speed/picture speed simultaneous conversion control program

Info

Publication number: JPH1188844A
Application number: JP9249136A
Authority: JP
Inventors: Akira Nakamura; 章中村; Hajime Sonehara; 源曽根原; Kazuhisa Iguchi; 和久井口; Yuji Nojiri; 裕司野尻
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1997-09-12
Filing date: 1997-09-12
Publication date: 1999-03-30
Anticipated expiration: 2017-09-12
Also published as: JP3730764B2

Abstract

PROBLEM TO BE SOLVED: To make a video variable in speed lip-synchronously with a speech speed converted voice without causing any unnaturalness or jitter. SOLUTION: A speech speed conversion part 100 divides an input voice into a silence section, a voiceless section and a voiced section by a section division part 2 in accordance with speech speed conversion algorithm and converts speaker's speech speed into speed corresponding to the audibility of an audience by a silence section extension part 3 and a basic period section repeating part 7 based on speech speed magnification set in a speech speed setting part 9 by the audience. A picture speed conversion part 200 detects the moving vector of an input image by a moving vector detection system and interpolates an image (having the optional number of fields) on an optional time position properly within variable time length synchronously with the speech speed converted voice in accordance with a speech speed extension/shortening code transferred from a synthesis part 8 in the conversion part 100 based on conversion algorithm for executing interpolation in accordance with the moving vector so as to change the speed of video every moment.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、加齢ないしは何ら
かの障害等により低下する音声識別臨界速度（音声を正
確に識別できる最大の話速）、及び映像の識別速度（映
像の動きを正確に識別できる速度）等の視聴覚能力の低
下を補う技術に関するもので、特に映像と音声を有する
メディア（例えば、テレビジョン、ＶＴＲやＤＶＤのよ
うな映像音声記録メディア、コンピュータ上での動画再
生等）や医療機器等、発話者の声の高さ、個人性、及び
音韻性を保持したまま高品質に発話速度を変換できる話
速変換音声に同期し、映像も自然性を保ったまま高品質
に速度の変換が可能であり、視聴者の視聴覚特性にフィ
ッティングでき、視聴を補助することを行う、話速／画
速同時変換システムおよび方法並びに話速／画速同時変
換制御プログラムを記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition critical speed (maximum speech speed at which voice can be accurately recognized) and a video recognition speed (accurate video motion recognition) which decrease due to aging or some obstacle. Technology for compensating for a decrease in audio-visual ability, such as speed that can be achieved, and in particular, media having video and audio (for example, video and audio recording media such as televisions, VTRs and DVDs, video playback on computers, etc.) and medical care Synchronizes with the speech speed conversion voice that can convert the speech speed with high quality while maintaining the pitch, personality, and phonology of the speaker, such as equipment, and the video is adjusted to high quality while maintaining naturalness. Simultaneous speech speed / image speed conversion system and method and speech speed / image speed simultaneous conversion control program capable of conversion, fitting to viewer's audio-visual characteristics, and assisting viewing On a recording medium recording.

【０００２】[0002]

【従来の技術】加齢ないしは何らかの障害により、音声
識別臨界速度や映像の識別速度が低下した視聴者に対
し、音声及び映像の両者を同時に同期して高品質にゆっ
くりさせ視聴させることにより、音声や映像の了解度上
げることができる。しかし、この音声と映像の両者を同
期して高品質にゆっくりさせる手法が従来技術には無
い。2. Description of the Related Art For a viewer whose voice recognition critical speed or video recognition speed has been reduced due to aging or some obstacle, both voice and video are simultaneously synchronized and slowed down to a high quality for viewing. And video intelligibility. However, there is no method in the prior art for synchronizing both audio and video to high quality and slowing down.

【０００３】即ち、発話者の声の高さ、個人性、及び音
韻性を保持したまま高品質に発話速度を変換できる話速
変換音声に同期し、映像も自然性を保ったまま高品質に
速度の変換が可能となり、視聴者の視聴覚特性にフィッ
ティングでき、視聴を補助する方法は、発話者の声の高
さ、個人性、及び音韻性を保持したまま高品質に発話速
度を変換できる話速変換音声に同期し、映像の速度も自
然性を保ったまま高品質に変換を可能とすることが困難
であったため、開発されていない。[0003] In other words, it is synchronized with the speech rate converted voice that can convert the speech rate with high quality while maintaining the pitch, personality, and phonology of the speaker, and the video is also made with high quality while maintaining naturalness. Speed conversion is possible, fitting to the viewer's audiovisual characteristics, and a method of assisting viewing is a method that can convert the speech speed to high quality while maintaining the speaker's voice pitch, personality, and phonology. It has not been developed because it was difficult to perform high-quality conversion while maintaining the naturalness of video speed in synchronization with the speed-converted audio.

【０００４】高品質に話速変換のみを行う手法について
は幾つか開発されている。例えば、本出願人による特開
平５−０８０７９６号に開示されている。Several techniques have been developed for performing only speech rate conversion with high quality. For example, it is disclosed in Japanese Patent Application Laid-Open No. 5-080796 by the present applicant.

【０００５】この話速変換音声に同期（リップシンク）
して映像を可変速するには簡易的に、ＶＴＲの可変速再
生等がある。[0005] Synchronized with this speech speed converted voice (lip sync)
In order to perform variable speed video, there is a simple method such as variable speed reproduction of a VTR.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、上述の
ような従来技術においては、簡易的に、話速変換した話
速変換音声にＶＴＲの可変再生等を擬似的に同期させた
としても、ＶＴＲの可変速再生等では、ｌ／６０秒もし
くは１／５０秒（例えばハイビジョン、ＮＴＣＳ方式：
ｌ／６０秒、ＰＡＬ方式：ｌ／５０秒）のフイールド単
位で、同一フイールドを２度以上、繰り返し提示した
り、連続するフイールドを省いたりするため、動きが不
自然になったり、垂直方向に映像のジッターが生じるな
どして、高品質な変換ができない。However, in the above-described prior art, even if the VTR variable reproduction or the like is simply and pseudo-synchronized with the converted speech speed speech in a simple manner, the VTR cannot be used. In variable speed playback, etc., 1/60 seconds or 1/50 seconds (for example, HDTV, NTCS system:
1/60 seconds, PAL system: 1/50 seconds), the same field is repeatedly presented twice or more, or continuous fields are omitted, so that the movement becomes unnatural or the vertical direction High quality conversion is not possible due to video jitter.

【０００７】特に、加齢や何らかの原因により聴覚特性
の劣化した視聴者にとって、リップリーディング（読
唇）を併用するため、話速変換音声と映像の同期が取れ
ていない、もしくは、上述の一般的な手法（ＶＴＲ等の
可変速再生）により、動きが不自然になると、リップリ
ーディングを併用することができず、視聴し難くなる。[0007] In particular, for a viewer whose hearing characteristics have deteriorated due to aging or for some reason, since the lip reading (lip reading) is used at the same time, the speech speed converted voice and the video are not synchronized, or If the movement becomes unnatural due to the technique (variable-speed reproduction such as VTR), lip reading cannot be used together, making it difficult to view.

【０００８】また、ＶＴＲのような従来の可変再生等で
は、可変速倍率が一定で、話速変換音声にきめ細かく、
同期させることができない。In a conventional variable reproduction such as a VTR, the variable speed magnification is constant and the converted speech speed is fine.
Cannot synchronize.

【０００９】また、従来の可変速再生機器（例えばＶＴ
Ｒなど）等では、フィールドの内挿等の操作を行ってお
らず、高精度に時々刻々、速度を可変することは不可能
であり、話速変換音声のように時々刻々、話速の変化す
る音声に同期（リップシンク）させることができない。Also, conventional variable speed playback devices (for example, VT
R), etc., the operation such as interpolation of the field is not performed, and it is impossible to change the speed with high accuracy every moment. Cannot be synchronized (lip-synced) with the audio being played.

【００１０】本発明は、上記の点に鑑みて成されたもの
で、その目的は、ＶＴＲの可変速再生等のように、同一
フイールドを２度以上、繰り返し提示したり、連続する
フイールドを省いたりするために生じる動きの不自然さ
や、垂直方向に映像のジッターを生じることなく、話速
変換音声に同期（リップシンク）して、映像／音声を同
時に可変速でき、これにより視聴し難くすることなく、
各視聴者の視聴覚特性にフィッティングして視聴を補助
することができる話速／画速同時変換システムおよび方
法並びに話速／画速同時変換制御プログラムを記録した
記録媒体を提供することにある。The present invention has been made in view of the above points, and its object is to repeatedly present the same field twice or more, such as in a variable-speed reproduction of a VTR, or to omit a continuous field. The video / audio can be changed at the same time in synchronism (lip sync) with the speech speed converted audio without causing the unnatural motion caused by the movement or the jitter of the video in the vertical direction, thereby making it difficult to view. Without
An object of the present invention is to provide a speech speed / image speed simultaneous conversion system and method capable of assisting viewing by fitting to the audiovisual characteristics of each viewer, and a recording medium recording a speech speed / image speed simultaneous conversion control program.

【００１１】[0011]

【課題を解決するための手段】上記目的を達成するた
め、請求項１のシステムの発明は、入力音声を受聴者に
より設定された話速倍率をもとに声の高さや個人性を保
ちつつ伸張／短縮して話速変換音声として出力するとと
もに、話速変換を行なう可変長ブロック単位で該速変換
音声が前記入力音声と比べてどの程度時間的な差が生じ
たかを検出する話速変換手段と、入力映像のフィールド
間での動きベクトルを検出し、前記話速変換手段から転
送される前記時間的な差が映像の単位フィールド時間を
超えた場合、前記映像の動きベクトルをもとに前記時間
的な差に相当する任意の時間位置のフィールドの内挿を
行ない、あるいは動きの小さい映像については任意の時
間位置のフィールドの直線内挿を行い、その結果を画速
変換映像として出力する画速変換手段とを具備すること
を特徴とする。In order to achieve the above object, a system according to a first aspect of the present invention provides a system in which an input voice is input while maintaining the pitch and personality of the voice based on a speech speed magnification set by a listener. Speech rate conversion for outputting as speech rate converted speech after expanding / shortening, and detecting how much time difference between the speed converted speech and the input speech has occurred in units of variable length blocks for speech rate conversion. Means, detecting a motion vector between fields of the input video, and when the temporal difference transferred from the speech speed conversion means exceeds a unit field time of the video, based on the motion vector of the video. Interpolate a field at an arbitrary time position corresponding to the temporal difference, or perform linear interpolation on a field at an arbitrary time position for an image with small motion, and output the result as a speed-changed image Characterized by comprising the that image speed conversion means.

【００１２】ここで、前記話速変換手段は、話速倍率を
設定する話速設定手段と、入力音声を無音区間、無声区
間、有声区間に分割する区間分割手段と、前記話速設定
手段で設定された前記話速倍率をもとに前記区間分割手
段で分割された前記無音区間を延長／短縮する無音区間
伸張短縮手段と、前記話速設定手段で設定された前記話
速倍率をもとに前記区間分割手段で分割された前記有声
区間を基本周期単位で伸張／短縮する有声区間伸張短縮
手段と、前記無音区間伸張短縮手段で伸張／短縮された
無音区間および前記有声区間伸張短縮手段で伸張／短縮
された有声区間並びに前記区間分割手段で分割された前
記無声区間とを合成して話速変換音声として出力する合
成手段と、話速変換を行なう可変長ブロック単位で、前
記話速変換音声が原音声である前記入力音声と比べてど
の程度時間的な差が生じたかを検知して該時間的な差を
表す話速伸張・短縮コードを出力する話速伸張・短縮検
出手段とを具備するとすることができる。The speech speed conversion means includes a speech speed setting means for setting a speech speed magnification, a section dividing means for dividing an input voice into a silent section, a voiceless section, and a voiced section. A silent section extending / reducing means for extending / reducing the silent section divided by the section dividing means based on the set speech rate magnification; and a speech rate magnification set by the speech rate setting means. A voiced section extending / reducing means for extending / reducing the voiced section divided by the section dividing means in a basic cycle unit; and a silent section / voiced section extending / reducing means extended / reduced by the silent section extending / reducing means. Synthesizing means for synthesizing an expanded / shortened voiced section and the unvoiced section divided by the section dividing means and outputting as speech rate converted speech; and said speech rate conversion in units of variable length blocks for performing speech rate conversion. Audio Speech rate expansion / shortening detection means for detecting how much time difference has occurred in comparison with the input voice, which is a voice, and outputting a speech rate expansion / shortening code indicating the time difference. be able to.

【００１３】また、前記画速変換手段は、入力映像のフ
ィールド間での動きベクトルを検出する動き検出手段
と、前記話速変換手段の前記話速伸張・短縮検出手段か
ら転送される前記話速伸張・短縮コードが示す話速伸張
／短縮量が映像の単位フィールド時間を超えた場合、前
記動き検出手段が検出した前記映像の動きベクトルをも
とに前記話速伸張／短縮量に相当する任意の時間位置の
フィールドの内挿を行ない、あるいは動きの小さい映像
については任意の時間位置のフィールドの直線内挿を行
い、その結果を画速変換映像として出力するフィールド
内挿手段とを具備するとすることができる。The image speed converting means includes a motion detecting means for detecting a motion vector between fields of the input video, and the speech speed transferred from the speech speed extension / shortening detecting means of the speech speed converting means. When the speech speed extension / reduction amount indicated by the extension / shortening code exceeds the unit field time of the video, an arbitrary value corresponding to the speech speed extension / reduction amount based on the motion vector of the video detected by the motion detecting means. And a field interpolation means for performing linear interpolation of a field at an arbitrary time position with respect to an image having a small amount of motion, and outputting the result as a picture speed conversion image. be able to.

【００１４】更に、前記フィールド内挿手段は可変ブロ
ック長単位で、時々刻々、速度の変化する前記話速変換
音声に同期し、該可変ブロック長単位で、独立に映像の
内挿位置を決定し、任意のフィールド数の内挿を行うと
することができる。Further, the field interpolation means synchronizes with the speech rate converted voice whose speed changes every moment in units of variable block length, and independently determines the interpolation position of the video in units of variable block length. , An arbitrary number of fields can be interpolated.

【００１５】更に、前記有声区間伸張短縮手段は、前記
区間分割手段によって分割された前記有声区間に対して
その基本周期を抽出する基本周期抽出手段と、該基本周
期抽出手段で抽出された基本周期に従って各基本周期ご
とに前記有声区間を分割する基本周期区間分割手段と、
前記話速設定手段からの有声区間の伸張倍率に従って、
前記基本周期区間分割手段で分割された基本周期区間を
繰り返し、これにより有声区間の延長を行う基本周期区
間繰り返し手段とを有するとすることができる。Further, the voiced section expansion / contraction means includes a basic cycle extracting means for extracting a basic cycle of the voiced section divided by the section dividing means, and a basic cycle extracted by the basic cycle extracting means. Basic period section dividing means for dividing the voiced section for each basic period according to
According to the expansion rate of the voiced section from the speech speed setting means,
Basic cycle section repetition means for repeating the basic cycle section divided by the basic cycle section dividing means and thereby extending the voiced section can be provided.

【００１６】請求項６の方法の発明は、入力音声を聴者
により設定された話速倍率をもとに声の高さや個人性を
保ちつつ伸張／短縮して話速変換音声として出力すると
ともに、話速変換を行なう可変長ブロック単位で該速変
換音声が前記入力音声と比べてどの程度時間的な差が生
じたかを検出し、入力映像のフィールド間での動きベク
トルを検出し、前記時間的な差が映像の単位フィールド
時間を超えた場合、前記映像の動きベクトルをもとに前
記時間的な差に相当する任意の時間位置のフィールドの
内挿を行ない、あるいは動きの小さい映像については任
意の時間位置のフィールドの直線内挿を行い、その結果
を画速変換映像として出力することを特徴とする。According to a sixth aspect of the present invention, the input voice is expanded / shortened while maintaining the pitch and personality of the voice based on the voice speed magnification set by the listener, and is output as a voice speed converted voice. Detecting how much the time difference between the speed-converted voice and the input voice has occurred in units of variable length blocks for performing voice speed conversion, detecting a motion vector between fields of the input video, and If the difference exceeds the unit field time of the video, the field at an arbitrary time position corresponding to the temporal difference is interpolated based on the motion vector of the video, or the video with a small motion is optional. The linear interpolation of the field at the time position is performed, and the result is output as an image speed converted video.

【００１７】請求項７の方法の発明は、話速倍率を設定
する話速設定ステップと、入力音声を無音区間、無声区
間、有声区間に分割する区間分割ステップと、前記話速
設定ステップで設定された前記話速倍率をもとに前記区
間分割ステップで分割された前記無音区間を延長／短縮
する無音区間伸張短縮ステップと、前記話速設定ステッ
プで設定された前記話速倍率をもとに前記区間分割ステ
ップで分割された前記有声区間を基本周期単位で伸張／
短縮する有声区間伸張短縮ステップと、前記無音区間伸
張短縮ステップで伸張／短縮された無音区間および前記
有声区間伸張短縮ステップで伸張／短縮された有声区間
並びに前記区間分割ステップで分割された前記無声区間
とを合成して話速変換音声として出力する合成ステップ
と、話速変換を行なう可変長ブロック単位で、前記話速
変換音声が原音声である前記入力音声と比べてどの程度
時間的な差が生じたかを検知して該時間的な差を表す話
速伸張・短縮コードを出力する話速伸張・短縮検出ステ
ップと入力映像のフィールド間での動きベクトルを検出
する動き検出ステップと、前記話速伸張・短縮検出ステ
ップから転送される前記話速伸張・短縮コードが示す話
速伸張／短縮量が映像の単位フィールド時間を超えた場
合、前記動き検出ステップで検出した前記映像の動きベ
クトルをもとに前記話速伸張／短縮量に相当する任意の
時間位置のフィールドの内挿を行ない、あるいは動きの
小さい映像については任意の時間位置のフィールドの直
線内挿を行い、その結果を画速変換映像として出力する
フィールド内挿ステップとを有することを特徴とする。In a preferred embodiment of the present invention, the speech speed setting step includes setting a speech speed magnification, a section dividing step of dividing an input voice into a silent section, a voiceless section, and a voiced section, and the speech rate setting step. A silent section extension / reduction step for extending / reducing the silent section divided in the section dividing step based on the speech rate magnification obtained, and the speech rate magnification set in the speech rate setting step. The voiced section divided in the section dividing step is expanded /
A voiced section extension / reduction step to be shortened; a silent section extended / reduced in the silent section extension / reduction step; a voiced section extended / reduced in the voiced section extension / reduction step; And a synthesizing step of synthesizing and outputting as speech speed converted speech, and by a variable length block unit for speech speed conversion, how much time difference is compared with the input speech in which the speech speed converted speech is the original speech. A speech speed expansion / shortening detection step of detecting whether the occurrence has occurred and outputting a speech speed expansion / shortening code representing the temporal difference; a motion detecting step of detecting a motion vector between fields of the input video; When the speech speed extension / shortening amount indicated by the speech speed extension / shortening code transferred from the extension / shortening detection step exceeds the unit field time of the video, the motion detection is performed. The field at an arbitrary time position corresponding to the speech speed expansion / reduction amount is interpolated based on the motion vector of the image detected in the step, or a straight line of the field at an arbitrary time position for an image with a small motion. And a field interpolation step of performing interpolation and outputting the result as a picture speed converted video.

【００１８】請求項１０の記録媒体の発明は、コンピュ
ータによって話速および画速の同時変換を行なうための
制御プログラムを記録した記録媒体であって、該制御プ
ログラムはコンピュータに、入力音声を受聴者により設
定された話速倍率をもとに声の高さや個人性を保ちつつ
伸張／短縮して話速変換音声として出力させ、話速変換
を行なう可変長ブロック単位で前記話速変換音声が前記
入力音声と比べてどの程度時間的な差が生じたかを検出
させ、入力映像のフィールド間での動きベクトルを検出
させ、前記時間的な差が映像の単位フィールド時間を超
えた場合、前記映像の動きベクトルをもとに前記時間的
な差に相当する任意の時間位置のフィールドの内挿を行
なわせて、あるいは動きの小さい映像については任意の
時間位置のフィールドの直線内挿を行い、その結果を画
速変換映像として出力させることを特徴とする。According to a tenth aspect of the present invention, there is provided a recording medium in which a control program for simultaneously converting a speech speed and an image speed by a computer is recorded. The speech speed conversion voice is expanded and shortened while maintaining the voice pitch and personality based on the speech speed magnification set by, and is output as speech speed converted speech. By detecting how much time difference has occurred compared to the input sound, to detect a motion vector between the fields of the input video, if the temporal difference exceeds the unit field time of the video, Interpolation of a field at an arbitrary time position corresponding to the temporal difference is performed based on a motion vector, or a field at an arbitrary time position for an image with a small motion. Performs linear interpolation of de, characterized in that to output the result as Esoku converted image.

【００１９】また、請求項１１の記録媒体の発明は、コ
ンピュータによって話速および画速の同時変換を行なう
ための制御プログラムを記録した記録媒体であって、該
制御プログラムはコンピュータに、話速倍率を設定さ
せ、入力音声を無音区間、無声区間、有声区間に分割さ
せ、前記話速倍率をもとに前記無音区間を延長／短縮さ
せ、前記話速倍率をもとに前記有声区間を基本周期単位
で伸張／短縮させ、伸張／短縮された無音区間および有
声区間並びに何も加工されていない前記無声区間とを合
成して話速変換音声として出力させ、話速変換を行なう
可変長ブロック単位で、前記話速変換音声が原音声であ
る前記入力音声と比べてどの程度時間的な差が生じたか
を検知させて該時間的な差を表す話速伸張・短縮コード
を出力させ、入力映像のフィールド間での動きベクトル
を検出させ、前記話速伸張・短縮コードが示す話速伸張
／短縮量が映像の単位フィールド時間を超えた場合、検
出した前記映像の動きベクトルをもとに前記話速伸張／
短縮量に相当する任意の時間位置のフィールドの内挿を
行なわせ、あるいは動きの小さい映像については任意の
時間位置のフィールドの直線内挿を行い、その結果を画
速変換映像として出力させることを特徴とする。The invention of a recording medium according to claim 11 is a recording medium in which a control program for performing simultaneous conversion of a speech speed and an image speed by a computer is recorded. Is set, the input voice is divided into a silent section, an unvoiced section, and a voiced section, and the silent section is extended / shortened based on the speech rate magnification. The voiced section is divided into a basic cycle based on the speech rate magnification. Expanded / shortened in units, synthesizes the expanded / shortened silent section and voiced section, and the unvoiced section in which nothing has been processed, and outputs it as a speech rate converted speech, in units of variable length blocks for speech rate conversion. Detecting the time difference between the converted speech speed voice and the input voice which is the original voice, and outputting a speech speed extension / shortening code representing the time difference; If the speech speed expansion / shortening amount indicated by the speech speed expansion / shortening code exceeds the unit field time of the video, the speech vector is detected based on the detected motion vector of the video. Quick stretch /
Interpolation of the field at an arbitrary time position corresponding to the amount of shortening, or linear interpolation of the field at an arbitrary time position for an image with small motion, and outputting the result as a picture speed conversion image Features.

【００２０】本発明は、上記構成により、話速変換音声
と映像を同期でき、聴覚特性と視覚特性の劣化を相乗的
に補償し、各視聴者の視聴覚特性にフィッティングして
視聴しやすくするできる。原音声に比べ発声時間の変化
する話速変換音声に同期して、自然性を保ったまま高品
質に映像を可変速でき、加齢ないしは何らかの障害によ
り生じる音声識別臨界速度（音声を正確に識別できる最
大の話速）、及び映像の識別速度（映像の動きを正確に
識別できる速度）等の低下を補い、各視聴者、千差万別
の視聴覚特性に視聴側でフィッティングして、最適な話
速音声および、これに同期し、映像も可変速して、視聴
を補助することができる。According to the present invention, it is possible to synchronize the converted speech speed and the video with the above configuration, to compensate for the deterioration of the auditory characteristics and the visual characteristics synergistically, and to make it easy to view by fitting to the audiovisual characteristics of each viewer. . Synchronizes with the converted speech speed, whose utterance time changes compared to the original speech, enables high-quality video variable speed while maintaining naturalness, and the critical speed of speech identification (accurate speech identification caused by aging or some obstacles) The maximum speech speed that can be achieved) and the speed of discriminating the video (the speed at which the motion of the video can be accurately discriminated) are compensated for, and each viewer is fitted on the viewer side with a wide variety of audiovisual characteristics so that the optimum The voice speed and the video synchronously with the voice speed can be varied to assist viewing.

【００２１】また、本発明は、話速伸張・短縮コードを
もとに、話速変換音声に同期して、適宜、映像を内挿す
ることにより、時々刻々、映像の速度を可変でき、これ
により、話速変換音声と映像とを同期（リップシンク）
させることが可能となる。Further, according to the present invention, the video speed can be varied from moment to moment by interpolating the video as appropriate in synchronization with the voice speed converted voice based on the voice speed expansion / shortening code. Synchronizes the converted voice with video (lip sync)
It is possible to do.

【００２２】また、本発明は、ブロック単位で話速の変
化する話速変換音声に同期して、適宜、映像の内挿位置
を決定し、映像を内挿することにより、時々刻々、映像
の速度を可変でき、これにより、話速変換音声と映像と
を同期（リップシンク）させることが可能となる。Further, according to the present invention, an interpolation position of a video is appropriately determined in synchronization with a speech speed converted voice whose speech speed changes in a block unit, and the video is interpolated, so that the video is momentarily changed. The speed can be varied, so that it is possible to synchronize (lip-sync) the converted voice with the video.

【００２３】また、本発明は、高品質な速度変換を可能
とするため、映像の動きべクトルを検出して内挿を行う
アルゴリズムをべースに、話速変換音声から得られる話
速伸張・短縮コード（伸張もしくは短縮した情報）をも
とに、可変時間長内で、任意の時間位置の映像を内挿
し、話速変換音声と映像との同期（リップシンク）を可
能とする。Further, the present invention provides a speech speed expansion obtained from speech speed converted speech based on an algorithm for detecting and interpolating a motion vector of a video in order to enable high quality speed conversion. Based on the shortened code (expanded or shortened information), the video at an arbitrary time position is interpolated within the variable time length to enable synchronization (lip sync) between the speech speed converted voice and the video.

【００２４】以上により、本発明によれば、映像を伴っ
たメディア（例えば、テレビジョン、ＶＴＲ、ＤＶＤの
ような映像音声記録メディア、コンピュータ上での動画
再生等）や医療機器等、発話者の声の高さ、個人性、及
び音韻性を保持したまま高品質に発話速度を変換できる
話速変換音声に同期し、映像も自然性を保ったまま高品
質に速度の変換が可能であり、視聴者の視聴覚特性にフ
ィッティングでき、視聴を補助することができる。As described above, according to the present invention, the media of a speaker (for example, a video / audio recording medium such as a television, a VTR, and a DVD, and the reproduction of a moving image on a computer), a medical device, etc. Synchronized with the speech rate conversion voice that can convert the speech rate with high quality while maintaining the pitch, personality, and phonology of the voice, the video can be converted to high quality while maintaining the naturalness of the video, Fitting to the audio-visual characteristics of the viewer can assist viewing.

【００２５】[0025]

【発明の実施の形態】以下、図面に示す本発明の実施の
形態に基づき本発明を詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail based on embodiments of the present invention shown in the drawings.

【００２６】なお、本発明は複数の機器からなるシステ
ムにおいて達成されてもよく、１つの機器からなる装置
に達成されてもよい。また、システムあるいは装置にプ
ログラムを供給することにより、本発明を達成される場
合にも適用されることは言うまでもない。例えば、専用
の装置だけでなく、パーソナルコンピュータ（パソコ
ン）等を使用しても実現することができる。また、本発
明に係る話速変換アルゴリズム、画速変換アルゴリズム
を実現する制御手順をプログラム形態で記録する記録媒
体は、ＦＤ（フロッピーディスク）以外にも、ＣＤ−Ｒ
ＯＭ、ＩＣメモリカード等であってもよい。更に、本プ
ログラムをＲＯＭに記録しておき、これをメモリマップ
の一部となすように構成し、直接ＣＰＵで実行すること
も可能である。The present invention may be achieved in a system including a plurality of devices, or may be achieved in an apparatus including a single device. It goes without saying that the present invention is also applied to a case where the present invention is achieved by supplying a program to a system or an apparatus. For example, the present invention can be realized using not only a dedicated device but also a personal computer (personal computer). The recording medium for recording the control procedure for realizing the speech speed conversion algorithm and the image speed conversion algorithm according to the present invention in the form of a program is not limited to a floppy disk (FD), but may be a CD-R
It may be an OM, an IC memory card, or the like. Further, it is also possible to record this program in a ROM, configure the program as a part of a memory map, and directly execute the program by a CPU.

【００２７】図１は本発明の一実施形態の話速／画速同
時変換システムの構成を機能ブロックで示した図であ
る。この話速／画速同時変換システムは話速を変換する
話速変換部１００と画速を変換する画速変換部２００と
を有し、画速変換部２００は話速変換部１００からの話
速変換音声の時間伸張情報に合わせて、任意の時間位置
の映像を内挿することができ、話速変換音声と映像との
同期（リップシンク）を可能としている。ここで、画速
変換はハイビジョン、ＮＴＳＣ、ＰＡＬ等の各種カラー
テレビジョン方式に適応可能であるが、図１では、代表
例としてハイビジョン（高品位テレビジョン、高精細度
テレビジョン）についての場合を示す。FIG. 1 is a functional block diagram showing the configuration of a simultaneous speech speed / image speed conversion system according to an embodiment of the present invention. The simultaneous speech speed / image speed conversion system has a speech speed conversion unit 100 for converting the speech speed and an image speed conversion unit 200 for converting the image speed. The video at an arbitrary time position can be interpolated in accordance with the time extension information of the speed-converted voice, and synchronization (lip sync) between the voice-converted voice and the video is enabled. Here, the image speed conversion can be applied to various color television systems such as Hi-Vision, NTSC, and PAL. FIG. 1 shows a case of Hi-Vision (high-definition television and high-definition television) as a representative example. Show.

【００２８】［１］話速変換部１００話速変換部１００は、後述の話速設定部９の定める比率
にしたがって発話者の話速（音声スピード）を受聴者の
受聴能力に応じた速さに変換して受聴者に受聴させるこ
とができる。即ち、話速変換部１００は話速変換アルゴ
リズムに従って入力音声を無音区間、無声区間、有声区
間に分割し、声の高さや個人性を保ち、受聴者が設定す
る話速倍率をもとに、高品位に話速変換を可能とする。
この話速倍率は、例えば１５０ｍｓ程度の可変ブロック
長単位で、変化させることができる。なお、話速変換ア
ルゴリズムとしては、音声を一律に伸張する方式と、適
応的に各部の伸張倍率を変化させて全体の伸張量を抑制
する方式とが挙げられるが、そのいずれの方式でも本発
明は適用可能である。[1] Speech speed conversion unit 100 The speech speed conversion unit 100 changes the speech speed (voice speed) of the speaker according to the ratio determined by the speech speed setting unit 9 described later, according to the listening ability of the listener. And the listener can listen to it. That is, the speech rate conversion unit 100 divides the input voice into a silent section, a non-voice section, and a voiced section according to the speech rate conversion algorithm, maintains the pitch and personality of the voice, and, based on the speech rate magnification set by the listener, Enables high-quality speech speed conversion.
This speech speed magnification can be changed in units of a variable block length of, for example, about 150 ms. As the speech speed conversion algorithm, there are a system for uniformly expanding the voice and a system for adaptively changing the expansion ratio of each unit to suppress the entire expansion amount. Is applicable.

【００２９】話速変換部１００は、Ａ／Ｄ（アナログ・
デジタル）変換部１、区間分割部２、無音区間延長部
３、基本周期抽出部５、基本周期区間分割部６、基本周
期区間繰り返し部７、合成部８、話速設定部９および、
Ｄ／Ａ（デジタル・アナログ）変換部１０を具える。Ａ
／Ｄ変換部１はアナログの入力音声信号（音声入力）を
Ａ／Ｄ変換（ｌ６ビット、１６ｋＨｚサンプル）する。
区間分割部２はＡ／Ｄ変換部１でデジタル化した音声入
力を無音区間、無声区間、有声区間に分割する。The speech speed conversion unit 100 has an A / D (analog
Digital) converting section 1, section dividing section 2, silent section extending section 3, basic period extracting section 5, basic period section dividing section 6, basic period section repeating section 7, synthesizing section 8, speech speed setting section 9,
A D / A (digital / analog) converter 10 is provided. A
The / D converter 1 performs A / D conversion (16 bits, 16 kHz samples) on an analog input audio signal (audio input).
The section dividing section 2 divides the voice input digitized by the A / D conversion section 1 into a silent section, an unvoiced section, and a voiced section.

【００３０】無音区間延長部３はユーザーが設定する話
速設定部９からの無音区間の伸張倍率に従って、区間分
割部２によって分割された無音区間の延長を行うことに
より、音声の間を制御する。The silent section extension section 3 controls the interval between voices by extending the silent section divided by the section dividing section 2 in accordance with the extension rate of the silent section from the speech speed setting section 9 set by the user. .

【００３１】発話者の個人性、及び音韻性を保つために
無声区間４については加工はしない。No processing is performed on the unvoiced section 4 in order to maintain the speaker's personality and phonology.

【００３２】基本周期抽出部５は、区間分割部２によっ
て分割された有声区間に対してその基本周期を抽出す
る。基本周期区間分割部６は基本周期抽出部５で抽出さ
れた基本周期に従って、各基本周期ごとに有声区間を分
割する。基本周期区間繰り返し部７はユーザーが設定す
る話速設定部９からの有声区間の伸張倍率に従って、基
本周期区間分割部６で分割された基本周期区間を繰り返
し、これにより有声区間の延長を行う。The basic period extracting section 5 extracts the basic period of the voiced section divided by the section dividing section 2. The basic period section dividing section 6 divides a voiced section for each basic period according to the basic period extracted by the basic period extracting section 5. The basic cycle section repetition section 7 repeats the basic cycle section divided by the basic cycle section dividing section 6 in accordance with the expansion rate of the voice section from the speech speed setting section 9 set by the user, thereby extending the voice section.

【００３３】話速設定部９は、発話者の話す速さと受聴
者の受聴能力に応じて無音区間の伸張倍率および有声区
間の基本周期区間の伸張倍率の設定を行なう操作部であ
る。The speech speed setting unit 9 is an operation unit for setting the extension ratio of a silent section and the extension ratio of a basic period section of a voiced section in accordance with the speaking speed of the speaker and the listening ability of the listener.

【００３４】合成部８は無音区間延長部３で延長された
無音区間と、何の処理も施されていない無声区間４と、
基本周期区間繰り返し部７で延長された有声区間とを合
成する。これと同時に、合成部８は、話速変換を行なう
可変長ブロック単位で、話速変換音声が原音声と比べ
て、どの程度伸張（短縮）したかを検出する。即ち、合
成部８は区間分割部２から転送される原音声の分析区間
と時間長の情報（原音声タイムコード）を逐次、内部メ
モリに格納しておき、その分析区間に対応した話速変換
音声の時間長（話速変換音声タイムコード）とを比較
し、どの程度、時間的な差（話速伸張・短縮コード）が
あるかを検知する。そして、合成部８はこの話速伸張・
短縮コードを後述の画速変換部２００のフィールド内挿
部１５へ転送する。The synthesizing section 8 includes a silent section extended by the silent section extending section 3, a silent section 4 on which no processing is performed,
The voiced section extended by the basic cycle section repetition section 7 is synthesized. At the same time, the synthesizing unit 8 detects how much the speech speed converted speech has been expanded (shortened) compared to the original speech in units of variable length blocks for which speech speed conversion is performed. That is, the synthesizing section 8 sequentially stores the analysis section of the original voice and the time length information (original voice time code) transferred from the section dividing section 2 in the internal memory, and converts the speech speed corresponding to the analysis section. The time length of speech (speech speed converted speech time code) is compared to detect how much time difference (speech speed extension / shortening code) is present. Then, the synthesizing unit 8 expands the speech speed
The shortened code is transferred to the field interpolation unit 15 of the image speed conversion unit 200 described later.

【００３５】合成部８で合成された音声信号は、Ｄ／Ａ
変換部１０によりアナログ信号に変換されて話速変換音
声としてスピーカ（図示しない）へ出力され、受聴者の
受聴能力に応じた受聴者の好みの速さで発声される。The audio signal synthesized by the synthesizing section 8 is D / A
The signal is converted into an analog signal by the converter 10 and output to a speaker (not shown) as a speech speed converted sound, and is uttered at a desired speed of the listener according to the listening ability of the listener.

【００３６】［２］画速変換部２００画速変換部２００は、高品質な速度変換を可能とするた
め、映像の動きベクトルを検出して内挿を行なう変換ア
ルゴリズムをべースに、上記の話速変換部１００の合成
部８から転送される話速伸張・短縮コードをもとに、話
速変換音声に同期して、適宜、可変時間長内で、任意の
時間位置の映像（任意のフィールド数）を内挿可能とす
ることで、時々刻々、映像の速度を可変にする。この画
速変換処理により、話速変換音声と映像との同期（リッ
プシンク）を可能とする。[2] Picture Speed Conversion Section 200 The picture speed conversion section 200 is based on a conversion algorithm for detecting and interpolating a motion vector of a video in order to enable high-quality speed conversion. Based on the speech speed expansion / shortening code transferred from the synthesizing unit 8 of the speech speed converting unit 100, the video (arbitrary (The number of fields) can be interpolated, thereby making the speed of the video variable every moment. By this image speed conversion processing, synchronization (lip sync) between the speech speed converted sound and the video is made possible.

【００３７】一例として、画速変換部２００はＡ／Ｄ変
換部１１、前処理部１２、動き検出部１３、ベクトル検
出割り付け部１４、フィールド内挿部１５および、Ｄ／
Ａ変換部１６を具える。前処理部１２、動き検出部１３
および、ベクトル検出割り付け部１４とで動きベクトル
検出系を構成し、フィールド内挿部１５で内挿系を構成
する。As an example, the image speed conversion unit 200 includes an A / D conversion unit 11, a preprocessing unit 12, a motion detection unit 13, a vector detection allocation unit 14, a field interpolation unit 15, and a D / D conversion unit.
An A conversion unit 16 is provided. Pre-processing unit 12, motion detection unit 13
Further, a motion vector detection system is configured by the vector detection allocating unit 14, and an interpolation system is configured by the field interpolation unit 15.

【００３８】入力する映像（アナログ映像信号）を例え
ばＲＧＢ４：４：４のフォーマットのＡ／Ｄ変換部１１
においてＡ／Ｄ変換する。Ａ／Ｄ変換部１１でデジタル
化された映像信号に対して、動きべクトル検出・割付の
ための前処理を前処理部１２で行う。例えば、前処理部
１２において映像信号をインターレースの１１２５／６
０／２：１からノンインターレースの６５２／６０／
１：１に変換する。前処理部１２で前処理を施された映
像信号は動き検出部１３とベクトル検出割り付け部１４
とに送られる。An input video (analog video signal) is converted to an A / D converter 11 of RGB 4: 4: 4 format, for example.
Performs A / D conversion. A pre-processing unit 12 performs pre-processing for detecting and allocating a motion vector on the video signal digitized by the A / D conversion unit 11. For example, in the pre-processing unit 12, the video signal is interlaced by 1125/6.
0/2: 1 to non-interlaced 652/60 /
Convert 1: 1. The video signal pre-processed by the pre-processing unit 12 is converted into a motion detection unit 13 and a vector detection allocation unit 14.
And sent to.

【００３９】動き検出部１３は、例えば勾配法に基づく
初期偏位べクトル（候補べクトル８種、ブロックサイ
ズ：８画素、８ライン）を用いた反復勾配法（最大反復
回数：２回、ブロックサイズ：８画素、８ライン）によ
り、映像のフィールド間で動きべクトルを検出する。The motion detecting unit 13 is, for example, an iterative gradient method (maximum number of repetitions: 2; block size) using an initial displacement vector (eight kinds of candidate vectors, block size: 8 pixels, 8 lines) based on the gradient method. A motion vector is detected between video fields using a size of 8 pixels and 8 lines).

【００４０】入力画像と時間的タイミングの異なるフイ
ールドを新たに内挿するため、入力映像信号から動き検
出部１３で検出した動きべクトルを、ベクトル検出割り
付け部１４により新たに内挿するフイールド上に割り当
てる。In order to newly interpolate a field having a temporal timing different from that of the input image, the motion vector detected by the motion detection unit 13 from the input video signal is placed on a field newly interpolated by the vector detection allocation unit 14. assign.

【００４１】フィールド内挿部１５は、話速変換部１０
０の合成部８から転送された話速伸張・短縮コ―ドをも
とに、Ａ／Ｄ変換部１１から送られてくる入力映像に対
して任意の時間位置のフィールドを内挿する。即ち、フ
ィールド内挿部１５は、話速伸張・短縮コードが示す伸
張（短縮）量が例えば１／６０秒（１フィールド分）を
越えた場合、映像の動きベクトルをもとに伸張量に相当
する任意の時間位置のフィールドの内挿を行なう。The field interpolation unit 15 includes the speech speed conversion unit 10
A field at an arbitrary time position is interpolated into the input video sent from the A / D converter 11 based on the speech speed expansion / shortening code transferred from the synthesizer 8 of 0. That is, when the expansion (shortening) amount indicated by the speech speed expansion / shortening code exceeds, for example, 1/60 seconds (for one field), the field interpolation unit 15 corresponds to the expansion amount based on the motion vector of the video. Interpolation of the field at an arbitrary time position is performed.

【００４２】フィールド内挿部１５で内挿処理を受けた
映像信号はＤ／Ａ変換部１６へ出力する。このとき、話
速変換部１００のＤ／Ａ変換部１０と画速変換部２００
のＤ／Ａ変換部１６とを同期することで、Ｄ／Ａ変換部
１０からの話速変換音声とＤ／Ａ変換部１６からの画速
変換映像とを同期して出力させる。The video signal that has been subjected to the interpolation processing in the field interpolation section 15 is output to the D / A conversion section 16. At this time, the D / A converter 10 and the image speed converter 200 of the speech speed converter 100
By synchronizing with the D / A conversion unit 16, the speech speed converted voice from the D / A conversion unit 10 and the image speed converted video from the D / A conversion unit 16 are output in synchronization.

【００４３】図２は図１の話速／画速同時変換システム
による本発明に係わる話速／画速同時変換動作の一例を
示す。FIG. 2 shows an example of the simultaneous speech speed / image speed conversion operation according to the present invention by the simultaneous speech speed / image speed conversion system of FIG.

【００４４】図２の（ａ）に示す原音声（音声入力）を
図１の話速変換部１００により伸張すると、図２の
（ｂ）に示すように各音韻の継続時間が変化した音声と
なる。即ち、原音声は無声区間は変化しないが、有声区
間と無音区間の継続時間がそれぞれ伸張した音声に変
る。When the original voice (voice input) shown in FIG. 2A is expanded by the speech speed conversion unit 100 shown in FIG. 1, the voice whose duration of each phoneme changes as shown in FIG. Become. That is, the original voice does not change in the unvoiced section, but changes to a voice in which the durations of the voiced section and the silent section are respectively extended.

【００４５】図２の（ｂ）に示す話速変換したある有声
区間を拡大して図示すると、図２の（ｄ）のような波形
となる。これに対応した原音声の区間の波形は、図２の
（ｃ）である。この図示の波形区間を、ある１つのブロ
ックとする（このブロック長は可変長であり、時々刻
々、ブロック長が変化する。）。なお、図２の（ｃ）と
（ｄ）から原音声は基本周期を保ったまま伸張して変換
音声に変換されていることが分かる。FIG. 2D shows an enlarged enlarged voiced section shown in FIG. 2B, which has a waveform as shown in FIG. 2D. The waveform of the section of the original voice corresponding to this is shown in FIG. The illustrated waveform section is a certain block (the block length is a variable length, and the block length changes every moment). From FIG. 2 (c) and (d), it can be seen that the original voice is expanded into a converted voice while maintaining the basic period.

【００４６】この原音声のブロックの時間長をＴ１とす
る。このＴ１とこのブロックが何番目のものかを示す区
間情報とを合わせて、原音声タイムコードと呼ぶ。ま
た、このブロックに対応した話速変換音声の時間長をＴ
２とする。このＴ２とこのブロックが何番目のものかを
示す区間情報とを合わせて、話速変換音声タイムコード
と呼ぶ。The time length of the block of the original voice is defined as T1. The T1 and the section information indicating the order of this block are collectively called an original audio time code. Also, the time length of the converted speech speed voice corresponding to this block is T
Let it be 2. The T2 and the section information indicating the order of this block are collectively referred to as a speech speed converted voice time code.

【００４７】ここで、（Ｔ２−Ｔ１）が話速変換により
伸張・短縮した時間長となる。この時間長と上記の区間
情報とを合わせて、話速伸張・短縮コードと呼ぶ。合成
部８では、この（Ｔ２−Ｔ１）の時間長を算出して話速
伸張・短縮コードを求める。フィールド内挿部１５では
この話速伸張・短縮コードを１／６０秒（ｌフィールド
分）で量子化し、図２の（ｃ）で示した原音声に対応し
た映像のフィールド数と、図２の（d ）に示した話速変
換音声に対応した映像のフィールド数とを算出する。こ
の算出結果が、図２の（ｅ）に示す原画（フィールド）
の枚数と、図２の（ｇ）に示す画速変換により作り出さ
なければならないフィールド数に相当する。図２の
（ｆ）は内挿位置を表している。この内挿位置について
は、以下の図３を参照して詳述する。Here, (T2−T1) is the time length extended / shortened by the speech speed conversion. The length of time and the above-described section information are collectively referred to as a speech speed expansion / shortening code. The synthesizing unit 8 calculates the time length of (T2−T1) to obtain a speech speed expansion / shortening code. The field interpolation unit 15 quantizes the speech speed expansion / shortening code in 1/60 seconds (for one field), and calculates the number of fields of the video corresponding to the original audio shown in FIG. The number of fields of the video corresponding to the converted speech speed shown in (d) is calculated. The calculation result is the original image (field) shown in FIG.
And the number of fields that must be created by the image speed conversion shown in FIG. FIG. 2F shows the interpolation position. This interpolation position will be described in detail with reference to FIG.

【００４８】図３は、本発明に係る、ある可変ブロック
長内における、フィールドの内挿位置の決め方を示す。FIG. 3 shows how to determine a field interpolation position within a certain variable block length according to the present invention.

【００４９】原画のフィールド数：Ｍ、画速変換により
作成するフィールド数：Ｎとすると、図３の（ａ）にＭ
＞＝Ｎの場合、図３の（ｂ）にＭ＜Ｎの場合を示してい
る。この両者の内挿位置の決定方法は同一であり、以下
に示す手順で決定する。Assuming that the number of fields of the original image is M and the number of fields created by the image speed conversion is N, FIG.
> = N, FIG. 3B shows the case of M <N. The method of determining the interpolation position of the two is the same, and is determined by the following procedure.

【００５０】１）原画のフィールドを、Ｏ1 、Ｏ2 、
…、Ｏi 、Ｏi+1 、…、Ｏm とする。1) The fields of the original picture are O1, O2,
, Oi, Oi + 1, ..., Om.

【００５１】２）画速変換により作成するフィールド
を、Ｃ1 、Ｃ2 、…、Ｃj 、Ｃj+1 、…、Ｃn とする。2) The fields created by the image speed conversion are C1, C2,..., Cj, Cj + 1,.

【００５２】３）各フレームＣj に対して、次式（１）3) For each frame Cj, the following equation (1)

【００５３】[0053]

【数１】 (Equation 1)

【００５４】を満足するｉをｉ_opt とし、フレームＣj
は入力フレームＯ_(iopt)-1とＯ_ipotの間に位置するもの
と考える。Let i _opt be i that satisfies
Is located between the input frame O _{(iopt) -1} and O _ipot .

【００５５】この時、Ｏ_(iopt)-1からの内挿位置（Ｐ
ｊ）を次式（２）により演算して与える。[0055] interpolation position from this time, O _{(iopt) -1 (P}
j) is calculated by the following equation (2).

【００５６】[0056]

【数２】 (Equation 2)

【００５７】４）以上の内挿処理を、各可変ブロック毎
（各可変ブロックの時間長はそれぞれ異なる）に行い、
画速変換する。4) The above interpolation processing is performed for each variable block (the time length of each variable block is different).
Convert image speed.

【００５８】可変ブロック長単位で、時々刻々速度の変
化する画速変換の例を図４に示す。図４は原画の映像を
６フィールドから７フィールドへ変換出力した後、３フ
ィールドから４フィールドへ変換して出力する場合の例
を示している。FIG. 4 shows an example of image speed conversion in which the speed changes moment by moment in variable block length units. FIG. 4 shows an example in which an original image is converted from 6 fields to 7 fields, and then converted from 3 fields to 4 fields and output.

【００５９】[0059]

【発明の効果】以上説明したように、本発明によれば、
映像を伴ったメディア（例えば、テレビジョン、ＶＴ
Ｒ、ＤＶＤのような映像音声記録メディア、コンピュー
タ上での動画再生等）や医療機器等、発話者の声の高
さ、個人性、及び音韻性を保持したまま高品質に発話速
度を変換できる話速変換音声に同期して、映像も自然性
を保ったまま高品質に速度の変換が可能であり、視聴者
の視聴覚特性にフィッティングでき、視聴しやすくなる
等の効果を有する。As described above, according to the present invention,
Media with video (eg, television, VT
R, DVD and other audio / video recording media, video playback on a computer, etc.), medical equipment, etc., and can convert the speech speed to high quality while maintaining the pitch, personality, and phonology of the speaker. Synchronizing with the converted speech speed, the speed of the video can be converted to high quality while maintaining the naturalness of the video, and the video can be fitted to the audio-visual characteristics of the viewer, which has effects such as easy viewing.

[Brief description of the drawings]

【図１】本発明の一実施形態の話速／画速同時変換シス
テムの構成を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of a simultaneous speech speed / image speed conversion system according to an embodiment of the present invention.

【図２】本発明の一実施形態の話速／画速同時変換の動
作例を示すタイミングチャートである。FIG. 2 is a timing chart showing an operation example of simultaneous speech speed / image speed conversion according to an embodiment of the present invention.

【図３】本発明の一実施形態における映像のフィールド
の内挿位置を示すタイミングチャートである。FIG. 3 is a timing chart showing an interpolation position of a video field according to the embodiment of the present invention.

【図４】本発明の一実施形態における可変ブロック長単
位で、時々刻々速度の変化する画速変換の例を示すタイ
ミングチャートである。FIG. 4 is a timing chart showing an example of image speed conversion in which a speed changes every moment in units of a variable block length according to an embodiment of the present invention.

[Explanation of symbols]

１、１１Ａ／Ｄ変換部２区間分割部３無音区間延長部５基本周期抽出部６基本周期区間分割部７基本周期区間繰り返し部８合成部９話速設定部１０、１６Ｄ／Ａ変換部１２前処理部１３動き検出部１４ベクトル検出割り付け部１５フィールド内挿部１００話速変換部２００画速変換部 Reference Signs List 1, 1 A / D conversion unit 2 Section division unit 3 Silence section extension unit 5 Basic period extraction unit 6 Basic period section division unit 7 Basic period section repetition unit 8 Synthesis unit 9 Speech speed setting unit 10, 16 D / A conversion unit Reference Signs List 12 preprocessing unit 13 motion detection unit 14 vector detection allocation unit 15 field interpolation unit 100 speech speed conversion unit 200 image speed conversion unit

フロントページの続き (72)発明者野尻裕司東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内Continued on the front page (72) Inventor Yuji Nojiri 1-10-11 Kinuta, Setagaya-ku, Tokyo Japan Broadcasting Corporation Broadcasting Research Institute

Claims

[Claims]

An input speech is expanded / shortened based on a speech speed magnification set by a listener while maintaining the pitch and personality of the voice, and output as a speech speed converted voice, and a variable voice speed conversion is performed. Speech speed conversion means for detecting how much the time difference between the speed-converted voice and the input voice has occurred in units of long blocks, and detecting a motion vector between fields of the input video, When the temporal difference transferred from the means exceeds the unit field time of the video, the field at an arbitrary time position corresponding to the temporal difference is interpolated based on the motion vector of the video, Or a speed conversion means for performing a linear interpolation of a field at an arbitrary time position with respect to an image having a small motion and outputting the result as a speed conversion image. Cis Beam.

2. The speech speed conversion means: a speech speed setting means for setting a speech speed magnification; a section dividing means for dividing an input voice into a silent section, an unvoiced section, and a voiced section; A silent section extension / reduction means for extending / reducing the silent section divided by the section dividing means based on the speech rate magnification, and a speech rate magnification set by the speech rate setting means. A voiced section extension / reduction means for extending / reducing the voiced section divided by the section division means in basic cycle units; a voiceless section extended / reduced by the silent section extension / reduction means; and a voiced section extension / reduction means. Synthesizing means for synthesizing the shortened voiced section and the unvoiced section divided by the section dividing means and outputting as speech rate converted speech; and the speech rate converted speech in units of variable length blocks for performing speech rate conversion. Is the original sound Speech speed expansion / shortening detection means for detecting how much time difference has occurred in comparison with the input voice which is a voice and outputting a speech speed expansion / shortening code representing the time difference. The speech speed / image speed simultaneous conversion system according to claim 1, wherein:

3. The speech speed conversion means, wherein: a motion detection means for detecting a motion vector between fields of an input video; and the speech speed transferred from the speech speed extension / shortening detection means of the speech speed conversion means. When the speech speed extension / reduction amount indicated by the extension / shortening code exceeds the unit field time of the video, an arbitrary value corresponding to the speech speed extension / reduction amount based on the motion vector of the video detected by the motion detecting means. Interpolation of the field at the time position of, or for video with small motion, perform linear interpolation of the field at the arbitrary time position,
And a field interpolation means for outputting the result as a picture speed converted video.

4. The field interpolating means synchronizes with the speech rate converted voice whose speed changes every moment in variable block length units, and independently determines the interpolation position of video in the variable block length units. 4. The simultaneous speech speed / image speed conversion system according to claim 3, wherein interpolation of an arbitrary number of fields is performed.

5. The voiced section decompression and shortening means includes: a basic cycle extracting means for extracting a basic cycle of the voiced section divided by the section dividing means; a basic cycle extracted by the basic cycle extracting means. A basic period section dividing means for dividing the voiced section for each basic cycle according to the following: According to the expansion ratio of the voiced section from the speech speed setting means,
5. The speech speed according to claim 2, further comprising: a basic cycle section repetition section that repeats a basic cycle section divided by the basic cycle section dividing section, thereby extending a voiced section. / Image speed simultaneous conversion system.

6. An input voice that is expanded / shortened based on a voice speed magnification set by a listener while maintaining the pitch and personality of the voice and is output as a voice speed converted voice, and a variable voice speed is converted. Detecting how much time difference between the speed-converted sound and the input sound has occurred in units of long blocks, detecting a motion vector between fields of the input image, and detecting the time difference If the field time is exceeded, interpolation of a field at an arbitrary time position corresponding to the temporal difference is performed based on the motion vector of the video,
Alternatively, a speech speed / image speed simultaneous conversion method characterized by performing linear interpolation of a field at an arbitrary time position with respect to an image having a small motion, and outputting the result as an image speed converted image.

7. A speech speed setting step of setting a speech speed magnification, an interval dividing step of dividing an input voice into a silent section, an unvoiced section, and a voiced section; A silence section extension / reduction step for extending / reducing the silence section originally divided in the section division step; and a division in the section division step based on the speech speed magnification set in the speech speed setting step. A voiced section expansion / reduction step of expanding / reducing the voiced section in a basic cycle unit; a voiced section expanded / reduced in the voiceless section expansion / reduction step in the voiced section expansion / reduction step; A synthesizing step of synthesizing the unvoiced section divided in the section dividing step and outputting as a speech rate converted speech, a variable length block unit for performing speech rate conversion A speech speed expansion for detecting how much time difference the speech speed converted voice has compared with the input voice as the original voice and outputting a speech speed expansion / shortening code representing the time difference. A motion detection step for detecting a motion vector between fields of the input video, a shortening detection step, and a speech speed extension / shortening amount indicated by the speech speed extension / shortening code transferred from the speech speed extension / shortening detection step. If the unit field time of the video is exceeded, the field interpolation at an arbitrary time position corresponding to the speech speed expansion / reduction amount is performed based on the motion vector of the video detected in the motion detection step, or A field interpolation step of performing a linear interpolation of a field at an arbitrary time position with respect to a video having a small size and outputting the result as a speed-converted video. View speed simultaneous conversion method.

8. The field interpolation step synchronizes with the speech rate converted voice whose speed changes every moment in variable block length units, and independently determines the interpolation position of video in the variable block length units. 8. The simultaneous speech speed / image speed conversion method according to claim 7, wherein interpolation of an arbitrary number of fields is performed.

9. The voiced section expansion / reduction step includes: a basic cycle extraction step of extracting a basic cycle of the voiced section divided in the section division step; and a basic cycle extracted in the basic cycle extraction step. A basic period section dividing step of dividing the voiced section for each basic period according to the following: repeating the basic period section divided in the basic period section dividing step according to the expansion ratio of the voiced section from the speech speed setting step; 9. A speech speed / image speed simultaneous conversion method according to claim 7, further comprising: a basic period section repetition step of extending a voiced section by the following.

10. A recording medium in which a control program for simultaneously converting a speech speed and an image speed by a computer is recorded, said control program stores in a computer a speech speed magnification set by a listener. Originally, it is expanded / shortened while maintaining the pitch and personality of the voice, and is output as speech speed converted speech. How long the speech speed converted speech is compared with the input speech in units of variable length blocks for speech speed conversion A motion vector between fields of the input video, and if the temporal difference exceeds a unit field time of the video, the time is calculated based on the motion vector of the video. The field at an arbitrary time position corresponding to the difference in time, or perform linear interpolation on the field at an arbitrary time position for an image with small motion. Recording medium recording the speech rate / Display speed simultaneous conversion control program for causing output the result as Esoku converted image.

11. A recording medium recording a control program for simultaneously converting a speech speed and an image speed by a computer, wherein the control program causes the computer to set a speech speed magnification, and to input speech into a silent section, Dividing the unvoiced section into unvoiced sections and voiced sections, extending / reducing the silent section based on the speech rate magnification, extending / reducing the voiced section in basic cycle units based on the speech rate magnification, and extending / reducing The synthesized speech section and voiced section and the unvoiced section where nothing has been processed are synthesized and output as speech rate converted speech, and the speech rate converted speech is the original speech in units of variable length blocks for speech rate conversion. By detecting how much a time difference has occurred in comparison with the input voice, a speech speed expansion / shortening code representing the time difference is output, and a motion vector between fields of the input video is detected. If the speech speed extension / reduction amount indicated by the speech speed extension / shortening code exceeds the unit field time of the video, the speech speed extension / reduction amount is calculated based on the detected motion vector of the video. It is characterized in that a field at a corresponding arbitrary time position is interpolated, or a linear interpolation of a field at an arbitrary time position is performed for an image having a small motion, and the result is output as a speed-changed image. A recording medium on which a speech speed / image speed simultaneous conversion control program is recorded.