JPH08212190A

JPH08212190A - Support device for production of multimedia data

Info

Publication number: JPH08212190A
Application number: JP1791495A
Authority: JP
Inventors: Genichiro Kikui; 玄一郎菊井; Masanobu Higashida; 正信東田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1995-02-06
Filing date: 1995-02-06
Publication date: 1996-08-20

Abstract

PURPOSE: To provide a support device for production of multimedia data which can automatically and effectively secure the correspondence between a text and the speech signals for every sentence that is punctuated at pauses or every phrase, for example. CONSTITUTION: The language processing is applied to the text data at a speech feature estimation part 1, and the part 1 estimates the series of a time when a silent section of a fixed length or longer alternates with other sections of the voices obtained when the text data are read aloud. Then a speech signal analysis part 2 processes the speech signals and extracts the series of a time when a silent section of a fixed length of longer alternates with other sections out of the speech signals. The correspondence is secured at a collation part 3 between both output series data, and the series collation result data are produced and outputted through an output part 4.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声情報とテキスト情
報とを統合したマルチメディアデータの作成を支援する
マルチメディアデータ作成支援装置に関し、特に音声と
同期した動画像データ、あるいは、音声のみのデータと
これらに対応するテキストの間の同期を自動的に取るマ
ルチメディアデータ作成支援装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multimedia data creation support apparatus for supporting the creation of multimedia data in which voice information and text information are integrated, and more particularly to moving image data synchronized with voice or only voice. The present invention relates to a multimedia data creation support device that automatically synchronizes data and text corresponding to them.

【０００２】[0002]

【従来の技術】従来、音声動画像データとシナリオテキ
ストデータに対して、これらを自動的に同期させる処理
は実用化されていなかった。従って、この処理は人手で
行わざるを得なかった。2. Description of the Related Art Conventionally, a process for automatically synchronizing voice moving image data and scenario text data has not been put into practical use. Therefore, this process had to be performed manually.

【０００３】研究レベルに限れば、ニュース番組の音声
動画像データとこの番組の項目表（ニュース項目表）と
を対応づける装置が提案されている。この手法では、
「画像認識」と「話者識別」という２つの技術を用い
て、ひと続きの音声動画像データを項目表に存在する項
目数に分割し、分割された各々の部分と項目表の各項目
とを対応づけるものである。At the research level, there has been proposed a device for associating the audio / video data of a news program with the item table (news item table) of this program. With this technique,
Using two techniques of "image recognition" and "speaker identification", a series of audio moving image data is divided into the number of items existing in the item table, and each divided part and each item in the item table are divided. Is associated with.

【０００４】また、音声信号とこれに対応する音韻系列
を与えて、これらの間の対応を自動的に取ろうとする
「音声データの自動ラベリング」という技術も存在す
る。There is also a technique called "automatic labeling of voice data" in which a voice signal and a phoneme sequence corresponding to the voice signal are given to automatically take correspondence between them.

【０００５】[0005]

【発明が解決しようとする課題】上述したように、音声
動画像とシナリオテキストを同期させる作業を人手で行
うことは、動画像データとシナリオテキストを逐一見比
べて前者のどの部分に後者のどの部分が対応するかを目
視で同定する作業を行わざるを得ないため、多大な労力
と時間を要する。As described above, the manual operation of synchronizing the audio moving image and the scenario text is performed by comparing the moving image data and the scenario text one by one to which part of the former and which part of the latter. Since there is no choice but to visually identify whether or not this corresponds, a great deal of labor and time are required.

【０００６】一方、「画像認識」と「話者識別」を用い
た装置の場合、依拠している「画像認識」と「話者識
別」の技術、特に、前者の技術は今だ開発途上であり、
汎用的なものにするにはまだ未解決の問題がある。ま
た、仮にこの問題が解決したとしても、対応づけの最小
単位はニュース項目に限られるため、これより細かい範
囲の対応づけを行うことはできない。On the other hand, in the case of a device using "image recognition" and "speaker identification", the technology of "image recognition" and "speaker identification" on which it depends, especially the former technology, is still under development. Yes,
There are still unsolved problems in making it generic. Even if this problem is solved, since the minimum unit of correspondence is limited to the news item, it is not possible to make a finer range of correspondence.

【０００７】更に、音声データの自動ラベリングは対応
づけの単位が基本的には音韻であるため、単語、あるい
は、せいぜい文の長さまでのデータに対する技術が検討
されているのみで、数文から数十文のテキストに対して
は、動作しないか、または実用的な計算量に収まらない
可能性がある。また、音韻単位の対応づけはマルチメデ
ィアデータの作成支援のためには過剰である。Further, in the automatic labeling of speech data, since the unit of correspondence is basically phonological, only techniques for data of words or at most sentence length have been studied. For ten texts, it may not work or may not fit into practical calculation. Also, the correspondence of phoneme units is excessive for supporting the creation of multimedia data.

【０００８】本発明は、上記に鑑みてなされたもので、
その目的とするところは、テキストと音声信号とを例え
ばポーズで区切られる文または句の単位で自動的かつ効
率的に対応づけするマルチメディアデータ作成支援装置
を提供することにある。The present invention has been made in view of the above,
It is an object of the present invention to provide a multimedia data creation support device that automatically and efficiently associates text and audio signals with each other in units of sentences or phrases separated by pauses.

【０００９】[0009]

【課題を解決するための手段】上記目的を達成するた
め、本発明のマルチメディアデータ作成支援装置は、テ
キストと該テキストに対する音声信号とを同期して対応
づけするマルチメディアデータ作成支援装置であって、
前記テキストを言語処理することによって、該テキスト
を読み上げた場合に得られる音声における一定以上の長
さの無音区間とそれ以外の区間とが交替する時刻の系列
を予測する予測手段と、前記音声信号を処理することに
よって、該音声信号中の一定以上の長さの無音区間とそ
れ以外の区間とが交替する時刻の系列を抽出する抽出手
段と、前記予測手段の出力系列データと前記抽出手段の
出力系列データとの対応づけを行い、系列照合結果デー
タを作成する照合手段とを有することを要旨とする。In order to achieve the above object, a multimedia data creation support apparatus of the present invention is a multimedia data creation support apparatus that synchronizes texts with audio signals corresponding to the texts. hand,
Predicting means for predicting a sequence of times at which a silent section having a length of a certain length or longer and a section other than that in speech obtained when the text is read aloud by linguistically processing the text; By extracting a series of times at which a silent section of a certain length or more in the voice signal and other sections in the audio signal alternate, and output sequence data of the predicting section and the extracting section. The gist is to have a matching unit that associates with output series data and creates series matching result data.

【００１０】また、本発明のマルチメディアデータ作成
支援装置は、前記照合手段が前記予測手段によって予測
された前記系列中の無音区間とそれ以外の区間を前記抽
出手段によって抽出された前記系列中の無音区間とそれ
以外の区間に対して評価値が最大となるように対応づけ
る手段を有することを要旨とする。Further, in the multimedia data creation support apparatus of the present invention, a silent section in the series predicted by the predicting section by the collating means and a section other than that in the series are extracted from the series by the extracting means. The gist is to have a means for associating the silent section and the other sections so that the evaluation value becomes maximum.

【００１１】更に、本発明のマルチメディアデータ作成
支援装置は、前記評価値として、同一種類の区間同士が
対応する方が異なった種類の区間同士が対応するものよ
りも大きく、かつ対応する区間の長さの比のばらつきが
小さいほど大きい値であることを要旨とする。Further, in the multimedia data creation support apparatus of the present invention, as the evaluation value, the sections of the same type correspond to each other and are larger than the sections of different types correspond to each other. The gist is that the smaller the variation in the length ratio, the larger the value.

【００１２】本発明のマルチメディアデータ作成支援装
置は、前記音声信号に関して、音声動画像信号中の音声
信号を対応させ、前記テキストに関して、音声動画像に
対するシナリオテキストを対応させることを要旨とす
る。The multimedia data creation support apparatus of the present invention is characterized in that the voice signal is associated with the voice signal in the voice moving image signal, and the text is associated with the scenario text for the voice moving image.

【００１３】[0013]

【作用】本発明のマルチメディアデータ作成支援装置で
は、予測手段においてテキストを言語処理し、テキスト
を読み上げた場合に得られる音声における一定以上の長
さの無音区間とそれ以外の区間とが交替する時刻の系列
を予測し、抽出手段において音声信号を処理し、音声信
号中の一定以上の長さの無音区間とそれ以外の区間とが
交替する時刻の系列を抽出し、両出力系列データの対応
づけを照合手段で行い、系列照合結果データを作成す
る。In the multimedia data creation support apparatus of the present invention, the predicting means performs language processing on the text, and the silent section of a certain length or longer in the voice obtained when the text is read aloud is replaced with the other section. Predicting the time series, processing the audio signal in the extraction means, extracting the time series at which the silent section of a certain length or more in the audio signal and the other sections alternate, and corresponding both output series data The collation means is used to create the series collation result data.

【００１４】また、本発明のマルチメディアデータ作成
支援装置では、前記照合手段は予測手段で予測された系
列中の各区間を抽出手段で抽出された系列中の各区間に
対して評価値が最大となるように対応づける。Further, in the multimedia data creation support apparatus of the present invention, the collation means has the maximum evaluation value for each section in the series predicted by the prediction means with respect to each section in the series extracted by the extraction means. Correspond to

【００１５】更に、本発明のマルチメディアデータ作成
支援装置では、前記評価値は、同一種類の区間同士が対
応する方が異なった種類の区間同士が対応するものより
も大きく、かつ対応する区間の長さの比のばらつきが小
さいほど大きい値である。Furthermore, in the multimedia data creation support apparatus of the present invention, the evaluation value is larger when the sections of the same type correspond to each other than when sections of different types correspond to each other. The smaller the variation in the length ratio, the larger the value.

【００１６】本発明のマルチメディアデータ作成支援装
置では、音声信号に関しては音声動画像信号中の音声信
号を対応させ、テキストに関しては音声動画像に対する
シナリオテキストを対応させている。In the multimedia data creation support apparatus of the present invention, the audio signal corresponds to the audio signal in the audio moving image signal, and the text corresponds to the scenario text for the audio moving image.

【００１７】[0017]

【実施例】以下、図面を用いて本発明の実施例を説明す
る。Embodiments of the present invention will be described below with reference to the drawings.

【００１８】図１は本発明のマルチメディアデータ作成
支援装置の一実施例の構成を示すブロック図である。同
図において、音声特徴予測部１は本発明の予測手段を構
成するものであり、発音情報作成部１ａと系列データ作
成部１ｂとからなり、入力の漢字かな交じりテキストの
解析を行うことによって、そのテキストを発話する場合
の音韻系列、ポーズ等を決定し、これを用いて「一定以
上の長さの無音区間」および「それ以外の区間」の交替
の系列を作成する。FIG. 1 is a block diagram showing the configuration of an embodiment of the multimedia data creation support apparatus of the present invention. In the figure, the speech feature prediction unit 1 constitutes the prediction means of the present invention, and comprises a pronunciation information creation unit 1a and a sequence data creation unit 1b, and by analyzing the input Kanji-Kana mixing text, A phonological sequence, a pause, and the like for uttering the text are determined, and using this, an alternating sequence of "silent intervals of a certain length or longer" and "other intervals" is created.

【００１９】発音情報作成部１ａは漢字かな交じり文を
解析して、「読み」「ポーズ」「アクセント」など発音
に必要な情報を付与する装置である。この装置はテキス
トを入力とした音声合成システムに必ず含まれている装
置であるので詳細な説明は省略する。なお、本実施例に
おいては、この出力のうち「読み」と「ポーズ」の情報
のみを系列データ作成部１ｂに転送する。The pronunciation information creation unit 1a is a device that analyzes a kanji-kana mixing sentence and adds information necessary for pronunciation such as "reading", "pause" and "accent". This device is a device that is always included in a speech synthesis system using text as an input, so detailed description will be omitted. In this embodiment, only the "reading" and "pause" information of this output is transferred to the series data creation unit 1b.

【００２０】系列データ作成部１ｂは「読み」の情報と
「ポーズ」の情報を元に、有音区間と無音区間の交替の
生ずる時刻の系列を出力する。以下では、この系列のこ
とを系列Ａと呼ぶ。処理の結果はＡｂ（ｉ），Ａｅ
（ｉ）（１≦ｉ≦ｎ）なる２つの配列データとして出力
される。ここで、ｎは予測された有音区間の個数であ
り、Ａｂ（ｉ）はｉ番目の有音区間の開始時刻（Ａｂ
（１）＝０）、Ａｅ（ｉ）はｉ番目の有音区間の終了時
刻を保持するものとする。なお、本実施例においては、
時刻の単位として秒数を用いたが、これのかわりにモー
ラ数など他の単位を用いてもよい。更に、系列データ作
成部１ｂはｉ番目の区間に対応する入力テキストの部分
を抽出して所定の格納領域Ｄ（ｉ）に書き込む処理を行
う。The sequence data generation unit 1b outputs a sequence of times at which a voiced section and a silent section are switched based on the "reading" information and the "pause" information. Hereinafter, this series is referred to as series A. The processing result is Ab (i), Ae.
(I) (1 ≦ i ≦ n) is output as two array data. Here, n is the number of predicted voiced sections, and Ab (i) is the start time (Ab) of the i-th voiced section.
(1) = 0), Ae (i) holds the end time of the i-th voiced section. In this embodiment,
Although the number of seconds is used as the unit of time, another unit such as the number of mora may be used instead of this. Further, the series data creation unit 1b performs a process of extracting a portion of the input text corresponding to the i-th section and writing it in a predetermined storage area D (i).

【００２１】音声信号解析部２は、本発明抽出手段を構
成するものであり、デジタル化されている音声信号を入
力として受取り、信号処理の技術を用いて、この音声信
号中に存在する無音区間を大きい（時間の長い）順にｍ
＝ｎ＋ｌ（ｌはｎの２割以上の任意の数）個、同定する
ことによって、有音区間と無音区間の交替の生ずる時刻
の系列を出力する。この系列のことを以下では系列Ｂと
呼ぶ。The voice signal analysis section 2 constitutes the extraction means of the present invention, receives a digitized voice signal as an input, and uses a signal processing technique to output a silent section present in the voice signal. In descending order of m (longest time)
= N + l (l is an arbitrary number of 20% or more of n) are identified, and a sequence of time points at which the voiced section and the silent section are switched is output. Hereinafter, this series is referred to as series B.

【００２２】系列Ｂは系列Ａと同様にＢｂ（ｉ），Ｂｅ
（ｉ）（１≦ｉ≦ｍ）なる２つの配列データとして出力
される。ここで、ｍは抽出された有音区間の個数であ
り、Ｂｂ（ｉ），Ｂｅ（ｉ）はそれぞれｉ番目の有音区
間の開始時刻および終了時刻を保持するものとする。な
お、最初の有音区間の開始時刻を０、すなわち、Ｂｂ
（１）＝０であるとする。Similar to the series A, the series B is Bb (i), Be
(I) (1 ≦ i ≦ m) is output as two array data. Here, m is the number of extracted voiced sections, and Bb (i) and Be (i) hold the start time and end time of the i-th voiced section, respectively. The start time of the first voiced section is 0, that is, Bb.
(1) = 0.

【００２３】音声信号からこのような系列を作成する処
理は、パワー値と零交差数を用いてデジタル化された音
声信号データから有音区間を同定する処理（例えば文献
１、新美康永著、情報科学講座「音声認識」１９７９年
発行）を利用すれば容易に実装可能である。The process of creating such a sequence from a voice signal is a process of identifying a voiced section from voice signal data digitized by using a power value and the number of zero crossings (see, for example, Reference 1, Yasunaga Niimi, It can be easily implemented by using the Information Science Course "Voice Recognition", 1979.

【００２４】照合部３は、照合手段を構成するものであ
り、音声特徴予測部１から出力される各有音区間が、音
声信号解析部２から出力される有音区間のいずれに対応
するかを、次に示す「対応の良さに関する評価値」が最
大になるようにして決定する。The matching section 3 constitutes a matching means, and which of the voiced sections output from the voice signal analysis section 2 corresponds to each voiced section output from the voice feature prediction section 1. Is determined so that the following "evaluation value regarding goodness of correspondence" is maximized.

【００２５】「対応の良さに関する評価値」とは次の２
つの評価尺度を合成した値である。The "evaluation value for goodness of correspondence" means the following 2
It is a value obtained by combining two evaluation scales.

【００２６】（１）同一種類の区間同士、すなわち有音
区間同士、無音区間同士が対応する方が異なった種類の
区間同士が対応するよりよい。(1) It is better that sections of the same type, that is, sections having sound, and sections having no sound correspond to sections of different types.

【００２７】（２）ある区間とこの区間に対応する区間
との長さの比は、どの区間を取っても一定、すなわち、
ばらつきが小さい方がよい。(2) The ratio of the lengths of a certain section and the section corresponding to this section is constant regardless of the section, that is,
The smaller the variation, the better.

【００２８】本実施例では、系列Ａの有音区間と系列Ｂ
の有音区間の対応を関数ｑ（ｉ）で表現する。すなわ
ち、図１１に示すように、関数ｑ（ｉ）の値は系列Ａの
ｉ番目の有音区間と対応する系列Ｂの有音区間のうちで
先頭のものの番号を示す。また、系列Ａのｉ番目の有音
区間と対応する系列Ｂの有音区間の末尾のものの番号は
ｑ（ｉ＋１）−１で表される。例えば、ｑ（ｉ）＝ｊで
あるとき、系列Ａのｉ番目の有音区間の開始時刻Ａｂ
（ｉ）と、系列Ｂのｊ番目の有音区間の開始時刻Ｂｂ
（ｊ）とが対応している。更に、ｑ（ｉ＋１）＝ｋであ
るとき、系列Ａのｉ番目の有音区間の終了時刻Ａｅ
（ｉ）と対応するのはＢｅ（ｋ−１）である。In this embodiment, the voiced section of sequence A and the sequence B are sequenced.
The correspondence of the voiced section of is expressed by a function q (i). That is, as shown in FIG. 11, the value of the function q (i) indicates the number of the head of the voiced section of the sequence B corresponding to the i-th voiced section of the sequence A. Further, the number of the end of the voiced section of the series B corresponding to the i-th voiced section of the series A is represented by q (i + 1) -1. For example, when q (i) = j, the start time Ab of the i-th voiced section of the sequence A
(I) and the start time Bb of the j-th voiced section of the sequence B
It corresponds to (j). Furthermore, when q (i + 1) = k, the end time Ae of the i-th voiced section of the sequence A
Be (k-1) corresponds to (i).

【００２９】なお、ｑ（ｉ）は次の制約を満たさなけれ
ばならない（対応する要素の前後関係が逆転しないため
の制約）。Note that q (i) must satisfy the following constraint (constraint for preventing the context of corresponding elements from being reversed).

【００３０】ｉｆｉ＜ｊ then ｑ（ｉ）≦ｑ（ｊ） …（１）本実施例における評価値は次の通りである。Ifi <j then q (i) ≦ q (j) (1) The evaluation values in this example are as follows.

【００３１】[0031]

【数１】以下、上記各式について説明する。[Equation 1] The above equations will be described below.

【００３２】（２）式の第一項は系列Ａの先頭からｎ番
目の有音区間の開始時刻までの区間に関する対応度の評
価値を表す。この評価値は系列Ａの１つの有音区間とそ
の直後の無音区間を単位区間として計算する。ｉ番目の
単位区間はＡｂ（ｉ）からＡｂ（ｉ＋１）までであり、
これと対応する系列Ｂの区間はＢｂ（ｑ（ｉ))からＢｂ
（ｑ（ｉ＋１))である。１つの単位区間に関する評価値
の計算式はｉ，ｑ（ｉ），ｑ（ｉ＋１）を引数として
（３）式のように定義される。The first term in the equation (2) represents the evaluation value of the degree of correspondence regarding the section from the beginning of the series A to the start time of the nth voiced section. This evaluation value is calculated with one voiced section of the sequence A and a silent section immediately after that as a unit section. The i-th unit section is from Ab (i) to Ab (i + 1),
The section of the sequence B corresponding to this is from Bb (q (i)) to Bb.
(Q (i + 1)). The calculation formula of the evaluation value for one unit section is defined as the formula (3) using i, q (i) and q (i + 1) as arguments.

【００３３】（３）式は系列ＡにおけるＡｂ（ｉ）から
Ａｂ（ｉ＋１）までの区間とこれに対応する系列Ｂの区
間、すなわち、ｊ１＝Ｂｂ（ｑ（ｉ))からｊ２＝Ｂｂ
（ｑ（ｉ＋１))）までの区間との間の評価値を定義した
ものであり、第一項は系列Ａのｉ番目の無音区間に対応
する系列Ｂの部分、すなわち、Ｂｅ（ｊ２−１）からＢ
ｂ（ｊ２）の長さに関する評価値であり、((５）式に示
す通り、長い方が良い）、第二項は系列Ａのｉ番目の有
音区間に関する対応度である。ここで、「系列Ａのｉ番
目の区間に関する対応度」とは系列Ａのｉ番目の区間の
長さとこれに対応する系列Ｂの区間の長さとの対応の良
さを表す尺度で、系列Ａにおけるｉの長さをｘ、これと
対応する系列Ｂの区間の長さをｙとすると、（４）式の
ように定義される。この尺度は２つの区間の比が系列
Ａ，Ｂ全体の長さの比とどれだけ離れているかを計算
し、それに負号をつけたものである。Equation (3) is obtained by calculating the interval from Ab (i) to Ab (i + 1) in the sequence A and the corresponding interval in the sequence B, that is, j1 = Bb (q (i)) to j2 = Bb.
(Q (i + 1))) is defined as an evaluation value for the interval up to (q (i + 1)), and the first term is a part of the sequence B corresponding to the i-th silent segment of the sequence A, that is, Be (j2-1). ) To B
This is an evaluation value related to the length of b (j2) (the longer the better as shown in equation (5)), the second term is the degree of correspondence of the i-th voiced section of the sequence A. Here, the "correspondence degree with respect to the i-th section of the series A" is a scale showing the goodness of correspondence between the length of the i-th section of the series A and the corresponding section B of the series A, and When the length of i is x and the length of the section of the sequence B corresponding to this is y, it is defined as in Expression (4). This measure calculates how far the ratio of the two intervals is from the ratio of the lengths of the entire series A and B, and adds a negative sign to it.

【００３４】（２）式の第二項は系列Ａのｎ番目の有音
区間に関する対応に関する評価値を計算するための式で
ある。ｎ番目の有音区間はそれまでの有音区間と違っ
て、直後の無音区間を文間のポーズではないため、式に
示すように例外的に有音部のみを評価することとしてい
る。The second term of the equation (2) is an equation for calculating the evaluation value regarding the correspondence regarding the nth voiced section of the sequence A. Unlike the voiced sections up to that point, the n-th voiced section is not a pause between sentences in the immediately following silence section, so that only the voiced section is exceptionally evaluated as shown in the equation.

【００３５】出力部４は照合部３で得られた２つの系列
の対応づけ（ｑ（ｉ))、および、音声特徴予測部１から
転送された「系列とテキストとの対応データ」を合成す
ることにより「テキストと音声信号との対応」を出力す
る。The output unit 4 combines the correspondence (q (i)) of the two sequences obtained by the collation unit 3 and the "correspondence data between the sequence and the text" transferred from the speech feature prediction unit 1. As a result, "correspondence between text and voice signal" is output.

【００３６】以下、例を用いて具体的に説明する。A specific description will be given below with reference to an example.

【００３７】今、図２に示すようなシナリオテキストが
入力されたとする。このテキストを発音情報作成部１ａ
によって処理した結果から元のテキストに加えて「読
み」と「ポーズ」の情報のみを取り出した結果を図３に
示す。このデータは、「表記」と「読み（音韻系列）」
と「ポーズを表す英記号Ｌ，Ｍ，Ｓ，Ｎ」とからなるレ
コードの列である。なお、ポーズ欄は「読み」を発声し
た直後に挿入されるべきポーズの種別を表し、Ｌ，Ｍ，
Ｓ，Ｎは、この順にそれぞれ長ポーズ、中ポーズ、短ポ
ーズ、ポーズなしに対応する。Now, assume that the scenario text as shown in FIG. 2 is input. This text is phonetic information creation section 1a
FIG. 3 shows the result obtained by extracting only the information of “reading” and “pause” from the result of the processing performed by the above. This data is "notation" and "reading (phoneme series)"
It is a sequence of records consisting of and "English symbols L, M, S, N representing a pose". The pause column indicates the type of pose that should be inserted immediately after uttering "reading".
S and N correspond to a long pose, a middle pose, a short pose, and no pose, respectively, in this order.

【００３８】次に、系列データ作成部１ｂの処理を図４
に示す流れ図を用いて説明する。まず、ステップＳ４０
において、変数の初期化を行う。ｉは現時点での入力文
字列が何番目の有音区間に相当するかを保持するカウン
タであり、初期値は１である。ｔは現在までに与えられ
た文字列を発話するのに要した時間を保持している変
数、ＬＰ，ＭＰ，ＳＰはそれぞれ長ポーズ、中ポーズ、
短ポーズが何モーラに相当するかを規定した定数、ＭＬ
は１モーラが何単位時間に相当するかを規定した定数で
ある。図中のＬＰ，ＭＰ，ＳＰ，ＭＬの値は一例である
ことを注記しておく。Next, the processing of the series data creating section 1b will be described with reference to FIG.
This will be described with reference to the flowchart shown in FIG. First, step S40
At, the variables are initialized. i is a counter that holds the number of the voiced section of the input character string at the present time, and the initial value is 1. t is a variable that holds the time required to utter a given character string up to now, LP, MP, and SP are long pose, middle pose, and
ML, a constant that defines how many mora a short pose corresponds to
Is a constant that defines how many unit times one mora corresponds to. Note that the values of LP, MP, SP and ML in the figure are examples.

【００３９】初期化が終った後、ステップＳ４１で次の
レコードが発音情報作成部１ａから与えられているかど
うかを検査する。もしなければ、テキストが終了したと
判断して処理を終了する。もし、次のレコードが存在す
れば、それを読み込む（ステップＳ４２）。After the initialization is completed, it is checked in step S41 whether or not the next record is given from the pronunciation information creating section 1a. If not, it is determined that the text is finished and the process is finished. If the next record exists, it is read (step S42).

【００４０】ステップＳ４３では新たな有音区間の開始
であるかどうかを変数new の値を検査することで判定
し、そうであれば、Ａｅ（ｉ）（ｉ番目の有音区間の開
始時刻を保持する配列要素）にｔの値を代入する（ステ
ップＳ４４）。このとき同時に変数new の値をクリヤす
る。In step S43, it is determined whether or not a new voiced section is started by checking the value of the variable new, and if so, Ae (i) (the start time of the i-th voiced section is set). The value of t is substituted into the retained array element) (step S44). At this time, the value of the variable new is cleared at the same time.

【００４１】ステップＳ４５では入力されたレコードの
読みフィールドを参照してモーラ数を計算する。モーラ
数はこのフィールドの文字から“゛”（濁音記号）、お
よび、“゜”（半濁音・鼻濁音記号）を除いた文字の文
字数である。また、Ｄ（ｉ）の要素に表記フィールドに
存在する文字を追加する。In step S45, the number of moras is calculated by referring to the reading field of the input record. The number of mora is the number of characters in this field excluding "" (voiced sound symbol) and "°" (semi-voiced sound / nasal voiced sound symbol). Also, the characters existing in the notation field are added to the element of D (i).

【００４２】ステップＳ４６では入力文字が英記号Ｌで
あるかどうか検査し、そうであればステップＳ４７の処
理に移る。ステップＳ４７では、ｔの値をｉ番目の有音
区間の終了時刻を表すＡｅ（ｉ）に書き込み、次にｉの
インクリメント、およびｔへの長ポーズに要する時間の
追加を行う。また、この時点でｉ番目の有音区間が終了
したことになり、次のレコードから新たな有音区間が始
まるから、new の値を１にセットする。In step S46, it is checked whether or not the input character is the letter L, and if so, the process proceeds to step S47. In step S47, the value of t is written in Ae (i), which represents the end time of the i-th voiced section, and then the increment of i and the time required for a long pause are added to t. Also, at this point, the i-th voiced section has ended, and a new voiced section starts from the next record, so the value of new is set to 1.

【００４３】入力文字がＭまたはＳであるならば（ステ
ップＳ４８またはＳ５０）、これらに応じて、ｔに中ポ
ーズ、あるいは短ポーズの時間だけを加える（ステップ
Ｓ４９またはＳ５１）。If the input character is M or S (step S48 or S50), the time of the middle pause or the short pause is added to t accordingly (step S49 or S51).

【００４４】図３のようなレコード列を入力として、こ
の処理を実行すると、配列Ａｂ（ｉ），Ａｅ（ｉ），Ｄ
（ｉ）の値は図５のようになる。When this processing is executed by inputting a record string as shown in FIG. 3, arrays Ab (i), Ae (i), D
The value of (i) is as shown in FIG.

【００４５】図１の音声信号解析部２は図６のような音
声信号を入力として簡単な信号解析処理を行うことによ
って有音区間を同定し、図７のようなＢｂ（ｊ），Ｂｅ
（ｊ）（０≦ｊ≦ｍ）を出力する。The voice signal analysis unit 2 of FIG. 1 identifies a voiced section by performing a simple signal analysis process using the voice signal as shown in FIG. 6 as an input, and Bb (j), Be as shown in FIG.
(J) (0 ≦ j ≦ m) is output.

【００４６】図１の照合部３の処理プロセスを表す流れ
図を図８に示す。ここで示した手法は動的計画法による
照合アルゴリズムを応用したものであり、最適な対応づ
けを表すｑ（ｉ）（１≦ｉ≦ｎ）を効率的に求めること
ができる。FIG. 8 is a flow chart showing the processing process of the collating unit 3 in FIG. The method shown here is an application of the matching algorithm by the dynamic programming method, and q (i) (1 ≦ i ≦ n) representing the optimum association can be efficiently obtained.

【００４７】基本的な考え方は、「系列ＡとＢの先頭か
らそれぞれｉ個およびｊ個を取り出して最適な対応を求
めた場合の対応づけは、系列ＡＢ全体を最適に対応づけ
る対応づけの部分となっている」というものである。こ
の考え方は２つの系列の対応づけに関する評価値の計算
式が一定の条件を満たす場合に正しいが、本実施例の評
価値はこの条件を満たしている。The basic idea is that "i and j are taken out from the head of the series A and B respectively and the optimum correspondence is obtained, the correspondence is the part of the correspondence for optimally matching the entire series AB. Has become. " This way of thinking is correct when the calculation formula of the evaluation value regarding the association of the two series satisfies a certain condition, but the evaluation value of the present embodiment satisfies this condition.

【００４８】まず、ステップＳ８０において各種変数の
初期化を行う。ｐ（ｉ，ｊ）は音声特徴予測部１から得
られた系列と音声信号解析部２から得られた系列のそれ
ぞれ先頭からＡｂ（ｉ）およびＢｂ（ｊ）までの区間同
士が最適に対応した場合の評価値である。First, in step S80, various variables are initialized. For p (i, j), the sections from the beginning to Ab (i) and Bb (j) of the sequence obtained from the speech feature prediction unit 1 and the sequence obtained from the speech signal analysis unit 2 corresponded to each other optimally. It is the evaluation value in the case.

【００４９】まず、ステップＳ８１において有音系列Ａ
の先頭（Ａｂ（１))からＡｂ（２）までの区間と系列Ｂ
の先頭からＢｂ（ｊ）までの区間との対応に関する評価
値を求める。この計算は２≦ｊ≦ｍの各ｊについて実行
する。ここで、図中のｈ（ｉ，ｊ，ｋ）は本実施例にお
いてすでに述べた式（３）である。First, in step S81, the voiced sequence A
From the beginning (Ab (1)) to Ab (2) and the sequence B
The evaluation value regarding the correspondence with the section from the beginning to Bb (j) is obtained. This calculation is performed for each j of 2 ≦ j ≦ m. Here, h (i, j, k) in the figure is the equation (3) already described in this embodiment.

【００５０】ステップＳ８２ではｉ，ｊを変化させなが
らｐ（ｉ，ｊ）を求める。動的計画法の原理から、ｐ
（ｉ，ｊ）は、ｋを（ｑ（ｉ−１）≦ｋ≦ｊ）の範囲で
動かした時の・ｐ（ｉ−１，ｋ）の値と・Ａｂ（ｉ−１）からＡｂ（ｉ）までの区間とＢｂ
（ｋ）からＢｂ（ｊ）までの区間との間の評価値を加えた和の最大値であり、このときのｋの値がｑ（ｉ
−１）となる。In step S82, p (i, j) is obtained while changing i, j. From the principle of dynamic programming, p
(I, j) is the value of p (i-1, k) when k is moved within the range of (q (i-1) ≤ k ≤ j) and Ab (i-1) to Ab ( Section up to i) and Bb
It is the maximum value of the sum of the evaluation values between the section (k) and Bb (j), and the value of k at this time is q (i
-1).

【００５１】ステップＳ８３は系列２の最終区間を扱う
ための例外処理である。ステップＳ８２の処理はある有
音区間の開始時刻から次の有音区間の開始時刻までを１
つの単位として扱っているが、最後の有音区間はその後
に無音区間を含まないため、このように無音区間を含ま
ないような計算を行っている。このステップに現れてい
るｆは本実施例において既に述べた式（４）である。Step S83 is an exceptional process for handling the final section of series 2. In the process of step S82, 1 is set from the start time of a certain voiced section to the start time of the next voiced section.
Although it is treated as one unit, since the last voiced section does not include a silent section after that, the calculation is performed so that the silent section is not included. F appearing in this step is the equation (4) already described in this embodiment.

【００５２】この処理を行った結果を図９に示す。The result of this process is shown in FIG.

【００５３】出力部４は図９に示すｑ（ｉ）および図５
のｉ，Ｄ（ｉ）、図７のＢｂ（ｉ），Ｂｅ（ｉ）を用い
て、図５の各レコードに対して音声信号の開始時刻と終
了時刻を以下のように決定する。The output unit 4 includes q (i) shown in FIG. 9 and FIG.
I, D (i), and Bb (i), Be (i) in FIG. 7, the start time and end time of the audio signal for each record in FIG. 5 are determined as follows.

【００５４】・図５の最終レコード以外のｉ番目のレコ
ードに対しては、開始時刻をＢｂ（ｑ（ｉ))、終了時刻
をＢｅ（ｑ（ｉ＋１）−１）とする。For the i-th record other than the last record in FIG. 5, the start time is Bb (q (i)) and the end time is Be (q (i + 1) -1).

【００５５】・図９の最終レコードに対しては、開始時
刻をＢｂ（ｑ（ｎ))、終了時刻をＢｂ（ｍ）とする。For the last record in FIG. 9, the start time is Bb (q (n)) and the end time is Bb (m).

【００５６】この結果を図１０に示す。図１０は無音区
間で区切られたテキストと対応する音声信号の区間の開
始時刻および終了時刻を示しており、音声信号とテキス
トデータとが効率的に対応づけられていることがわか
る。The results are shown in FIG. FIG. 10 shows the start time and end time of the section of the voice signal corresponding to the text delimited by the silent section, and it can be seen that the voice signal and the text data are efficiently associated with each other.

【００５７】なお、本実施例では、入力データが音声信
号とテキストデータの場合について説明したが、音声信
号を音声動画像信号中の音声信号に、テキストデータを
音声動画像に対するシナリオテキストデータにそれぞれ
置き換えることにより、音声動画像とこれに対するシナ
リオテキストとの間の対応づけを行うマルチメディアデ
ータ作成支援装置も同様に実現可能である。In this embodiment, the case where the input data is the voice signal and the text data has been described, but the voice signal is the voice signal in the voice moving image signal and the text data is the scenario text data for the voice moving image. By substituting, a multimedia data creation support device for associating a voice moving image with a scenario text corresponding thereto can be realized in the same manner.

【００５８】[0058]

【発明の効果】以上説明したように、本発明によれば、
テキストを読み上げた場合に得られる音声における一定
以上の長さの無音区間とそれ以外の区間とが交替する時
刻の系列を予測し、音声信号中の一定以上の長さの無音
区間とそれ以外の区間とが交替する時刻の系列を抽出
し、両出力系列データの対応づけを照合手段で行い、更
に照合手段による対応づけでは、評価値が最大となるよ
うに対応づけし、また該評価値は同一種類の区間同士が
対応する方が異なった種類の区間同士が対応するものよ
りも大きく、かつ対応する区間の長さの比のばらつきが
小さいほど大きい値であるように設定しているので、音
声信号とテキストデータの対応づけを人手に頼ることな
く、自動的かつ効率的に行うことができる。また、計算
量の点でも一定以上の長さを有する無音区間の数は１分
の音声に対して数十程度であるので効率的である。As described above, according to the present invention,
Predict a sequence of times at which a silent section of a certain length or more and other sections in the speech obtained when reading the text are alternated, and predict a silent section of a certain length or more in the speech signal and other sections. A series of times at which the sections are changed is extracted, and both output series data are associated with each other by the collating means. Further, in the association by the collating means, the evaluation values are associated with each other so that the evaluation value becomes maximum, and the evaluation values are Since it is set that the corresponding sections of the same type are larger than the sections of different types correspond to each other, and the variation in the length ratio of the corresponding sections is smaller, the value is larger. Correspondence between a voice signal and text data can be automatically and efficiently performed without relying on human hands. Also, in terms of the amount of calculation, the number of silent sections having a length equal to or greater than a certain value is about several tens for one minute of voice, which is efficient.

[Brief description of drawings]

【図１】本発明の一実施例に係るマルチメディアデータ
作成支援装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a multimedia data creation support device according to an embodiment of the present invention.

【図２】入力されたテキストの例を示す図である。FIG. 2 is a diagram showing an example of input text.

【図３】図２に示すテキストを図１のマルチメディアデ
ータ作成支援装置に使用されている発音情報作成部で処
理した結果から元のテキストに加えて読みとポーズの情
報のみを取り出した結果を示す図である。FIG. 3 shows a result obtained by processing only the reading and pause information in addition to the original text from the result of processing the text shown in FIG. 2 in the pronunciation information creation unit used in the multimedia data creation support device of FIG. FIG.

【図４】図１のマルチメディアデータ作成支援装置に使
用されている系列データ作成部の処理を示すフローチャ
ートである。FIG. 4 is a flowchart showing a process of a series data creation unit used in the multimedia data creation support device of FIG.

【図５】図４に示す処理の結果得られた音声特徴予測部
の出力を示す図である。5 is a diagram showing an output of a speech feature prediction unit obtained as a result of the processing shown in FIG.

【図６】図１の装置に使用されている音声信号解析部に
入力される音声信号を示す図である。6 is a diagram showing an audio signal input to an audio signal analysis unit used in the apparatus of FIG.

【図７】音声信号解析部からの出力を示す図である。FIG. 7 is a diagram showing an output from an audio signal analysis unit.

【図８】図１に示すマルチメディアデータ作成支援装置
に使用されている照合部の処理を示すフローチャートで
ある。8 is a flowchart showing a process of a collating unit used in the multimedia data creation support device shown in FIG.

【図９】図１に示すマルチメディアデータ作成支援装置
に使用されている照合部からの出力を示す図である。9 is a diagram showing an output from a collating unit used in the multimedia data creation support device shown in FIG.

【図１０】図１に示すマルチメディアデータ作成支援装
置の出力を示す図である。10 is a diagram showing an output of the multimedia data creation support device shown in FIG. 1. FIG.

【図１１】各系列Ａ，Ｂにおける開始時刻および終了時
刻の関係を示す説明図である。11 is an explanatory diagram showing a relationship between a start time and an end time in each of the series A and B. FIG.

[Explanation of symbols]

１音声特徴予測部１ａ発音情報作成部１ｂ系列データ作成部２音声信号解析部３照合部４出力部 1 Speech Feature Prediction Section 1a Pronunciation Information Creation Section 1b Series Data Creation Section 2 Speech Signal Analysis Section 3 Matching Section 4 Output Section

Claims

[Claims]

1. A multimedia data creation support apparatus for synchronously associating a text with a voice signal corresponding to the text, the voice being obtained when the text is read aloud by linguistically processing the text. Prediction means for predicting a sequence of times at which a silent section of a certain length or more and another section alternate, and a silent section of a certain length or more in the audio signal by processing the audio signal, Extraction means for extracting a series of times at which other sections alternate with each other, matching means for associating the output series data of the prediction means with the output series data of the extraction means, and creating the series matching result data, An apparatus for supporting multimedia data creation, comprising:

2. The matching unit compares the silent section in the sequence predicted by the predicting unit and the other sections with respect to the silent section in the series extracted by the extracting unit and other sections. The multimedia data creation support apparatus according to claim 1, further comprising means for associating the evaluation value with a maximum value.

3. The evaluation value is larger as the sections of the same type correspond to each other than the sections of different types correspond to each other, and the variation in the ratio of the lengths of the corresponding sections is smaller. The multimedia data creation support apparatus according to claim 2, wherein

4. The audio signal in the audio moving image signal is associated with the audio signal, and the scenario text for the audio moving image is associated with the text. Multimedia data creation support device.