JPH02254500A

JPH02254500A - Vocalization speed estimating device

Info

Publication number: JPH02254500A
Application number: JP1077535A
Authority: JP
Inventors: Shin Kamiya; 伸神谷
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1989-03-29
Filing date: 1989-03-29
Publication date: 1990-10-15

Abstract

PURPOSE:To obtain a stable estimated value even if plural stationary parts exist such as a word, a sentence clause and a continuous voice, etc. by calculating an estimated value of an average generation speed, on the basis of a signal related to the average generation speed outputted from a voice speed estimation use neural network. CONSTITUTION:An input sound signal is brought to A/D conversion in a voice analyzing part 11, and a 16-th cepstrum coefficient is derived at every frame. The cepstrum coefficient is inputted to a stationary part discrimination use neural network 13 through a delaying part 12. The network 13 discriminates whether the sound signal of a prescribed number of frames is a stationary part or a non-stationary part. Subsequently, on the basis of a discriminating signal of the number corresponding to a prescribed time from the network 13, an estimated value of an average vocalization speed is calculated at every prescribed time through an average vocalization speed calculating part 16 of an input voice by a vocalization speed estimation use neural network 15. In such a way, even if plural stationary parts exist, a stable estimated value is obtained without being influenced by each stationary part.

Description

【発明の詳細な説明】〈産業上の利用分野〉この発明は、音声認識装置等に用いられる発声速度推定
装置の改良に関する。DETAILED DESCRIPTION OF THE INVENTION <Industrial Application Field> The present invention relates to an improvement in a speech rate estimation device used in a speech recognition device or the like.

〈従来の技術〉音声認識装置において、単語認識の結果得られた複数の
単語候補を正解らしい単語候補に絞り込む際に、次のよ
うにして行っている。すなわち、単語（または文節）を
発声した際に、パワーおよびスペクトル変化等のパラメ
ータを用いて音声区間長を比較的容易に求めることがで
きろ。そこで、もし平均発声速度が既知であれば、この
平均発声速度の逆数である平均音節長で上記音声区間長
を割ればその音声区間に含まれる音節数が求まる。<Prior Art> In a speech recognition device, a plurality of word candidates obtained as a result of word recognition are narrowed down to word candidates that are likely to be correct, as follows. That is, when a word (or phrase) is uttered, the length of the speech interval can be relatively easily determined using parameters such as power and spectral change. Therefore, if the average speech rate is known, the number of syllables included in the speech segment can be found by dividing the speech segment length by the average syllable length, which is the reciprocal of the average speech velocity.

こうして音節数が分かれば、得られた複数の単語候補の
中から上記推定音節数と同じ音節数の単語候補を選択す
ることにより、単語候補を正解らしい単語候補に絞り込
むことができるのである。このように、音声認識におい
て平均発声速度を推定すると言うことは重要なことであ
る。Once the number of syllables is known in this way, word candidates can be narrowed down to word candidates that are likely to be correct by selecting word candidates with the same number of syllables as the estimated number of syllables from among the multiple word candidates obtained. In this way, it is important to estimate the average speaking rate in speech recognition.

先に、本発明者は以下に述べるような平均発声速度の推
定方式を提案した（特開昭５９−６１９００）。第５図
はこの平均発声速度推定方式における平均発声速度推定
装置の概略ブロック図である。音声分析部ｌに入力され
た音声信号はＡ／Ｄ変換され、一定のフレーノ、長でパ
ワーおよびケプストラム係数等の特徴パラメータが求め
られる。Previously, the present inventor proposed a method for estimating the average speaking rate as described below (Japanese Patent Laid-Open No. 59-61900). FIG. 5 is a schematic block diagram of an average speaking rate estimating device in this average speaking rate estimation method. The audio signal input to the audio analysis section 1 is A/D converted, and characteristic parameters such as power and cepstral coefficients are determined at a constant Freno length.

定常部検出部２においては、スペクトル変化（数フレー
ム離れたフレーム間のケプストラム係数値の差）が極小
値を取るフレームの前後数フレーム区間の中から、連続
する類似フレーム（すなわち、上記極小値を取るフレー
ムとのケプストラム係数値の差が閾値以下となるフレー
ム）の数を求め、その連続する類似フレームの数を定常
部区間長とする。そうすると、発声速度推定部３は、予
め記憶している定常部区間長と発声速度との対応表を参
照して、上記求められた定常部区間長に基づいて平均発
声速度を求めるのである。The stationary part detection unit 2 detects consecutive similar frames (i.e., the minimum value) from among several frame sections before and after a frame in which the spectrum change (difference in cepstrum coefficient values between frames several frames apart) takes a minimum value. The number of frames in which the difference in cepstrum coefficient value from the captured frame is less than or equal to a threshold value is determined, and the number of consecutive similar frames is determined as the stationary section length. Then, the utterance rate estimator 3 refers to a pre-stored correspondence table between steady-state section lengths and utterance speeds, and calculates the average utterance speed based on the above-determined steady-state section lengths.

〈発明が解決しようとする課題〉上述のように、上記平均発声速度推定装置においては、
発声速度推定部３によって定常部毎に平均発声速度推定
値を出力するようになっている。<Problems to be Solved by the Invention> As mentioned above, in the above average speaking rate estimating device,
The speech rate estimator 3 outputs an estimated average speech rate for each stationary portion.

しかしながら、単語は複数の音節から成り、しかも各音
節を形成する母音や鼻音等に定常部が存在ずろので、隣
接する定常部の推定平均発声速度にバラツキが生じ、単
語全体の平均発声速度を推定することが困難であるとい
う間麗かある。However, since a word is made up of multiple syllables, and there is no constant part in the vowels, nasals, etc. that form each syllable, the estimated average speaking rate of adjacent stationary parts varies, and the average speaking rate of the whole word is estimated. There is a time when it is difficult to do.

そこで、この発明の目的は、単語１文節および連続音声
等のように複数の定常部が存在しても、安定して平均発
声速度を推定することができる発声速度推定装置を提供
ずろことにある。SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a speech rate estimating device that can stably estimate the average speech rate even if there are multiple stationary parts such as a single word phrase or continuous speech. .

〈課題を解決するための手段〉上記目的を達成するため、この発明の発声速度推定装置
は、入力された音声信号の所定フレーム数の特徴パラメ
ータを表す信号を入力し、上記所定フレーム数の音声信
号が定常部であるか非定常部であるかを識別して識別信
号を出力する定常部識別用ニューラル・ネットワークと
、上記定常部識別用ニューラル・ネットワークから出力
される上記識別信号を所定時間に対応する数だけ入力し
、平均発声速度に関係する信号を出力する発声速度推定
用ニューラル・ネットワークと、上記発声速度推定用ニ
ューラル・ネットワークから出力される平均発声速度に
関係する信号に基づいて、平均発声速度の推定値を算出
する発声速度計算部を備えたことを特徴としている。<Means for Solving the Problems> In order to achieve the above object, the speech rate estimation device of the present invention inputs a signal representing a characteristic parameter of a predetermined number of frames of an input audio signal, and calculates the speech rate of the predetermined number of frames. A neural network for steady region identification that identifies whether a signal is a steady region or an unsteady region and outputs an identification signal, and a neural network for identifying a steady region that outputs an identification signal at a predetermined time. Based on the neural network for estimating speaking rate which inputs the corresponding number and outputs a signal related to the average speaking rate, and the signal related to the average speaking rate output from the neural network for estimating speaking rate, the average The present invention is characterized in that it includes a speech rate calculation section that calculates an estimated value of the speech rate.

く作用〉定常部識別用ニューラル・ネットワークに、音声信号の
所定フレーム数の特徴パラメータを表す信号が入力され
ると、上記所定フレーム数の音声信号が定常部であるか
非定常部であるかが識別されて識別信号が出力される。Effect> When a signal representing a characteristic parameter of a predetermined number of frames of an audio signal is input to the neural network for identifying a stationary part, it is possible to determine whether the audio signal of the predetermined number of frames is a stationary part or an unsteady part. It is identified and an identification signal is output.

そして、上記定常部識別用ニューラル・ネットワークか
ら出力される上記識別信号が、所定時間に対応する数だ
け発声速度推定用ニューラル・ネットワークに入力され
ると、平均発声速度に関係する信号か出力される。When the identification signals outputted from the steady-state region identification neural network are input into the vocalization rate estimation neural network in a number corresponding to a predetermined time, a signal related to the average vocalization rate is output. .

そうすると、上記発声速度推定用ニューラル・ネットワ
ークから出力される平均発声速度に関係する信号に基づ
いて、発声速度計算部によって平均発声速度の推定値か
算出されて出力される。したがって、上記所定時間毎に
平均発声速度が推定される。Then, based on the signal related to the average speaking rate outputted from the speaking rate estimation neural network, the speaking rate calculator calculates and outputs an estimated value of the average speaking rate. Therefore, the average speech rate is estimated every predetermined time period.

〈実施例〉以下、この発明を図示の実施例により詳細に説明する。<Example> Hereinafter, the present invention will be explained in detail with reference to illustrated embodiments.

この発明は、入力音声信号の定常部を識別する手法と平
均発声速度を推定する手法として、ニューラル・ネット
ワークを用いるものである。上記ニューラル・ネットワ
ークを利用した識別は、学習によって入力データが属す
るカテゴリを識別する規則を自ら求め、この求めた規則
に従って入力データが属するカテゴリを識別する方法で
ある。The present invention uses a neural network as a method for identifying the stationary portion of an input speech signal and as a method for estimating the average speaking rate. The above-described identification using a neural network is a method in which a rule for identifying the category to which input data belongs is determined by learning, and the category to which input data belongs is identified according to the determined rule.

したがって、予め適確な学習データを用いて正しく学習
されたニューラル・ネットワークを用いれば、簡単な処
理で入力データが属するカテゴリを正しく識別すること
ができるのである。Therefore, by using a neural network that has been correctly trained using appropriate training data in advance, it is possible to correctly identify the category to which input data belongs with simple processing.

第１図はこの発明の発声速度推定装置における一実施例
のブロック図である。入力音声信号は音声分析部１１に
おいてサンプリング周期１２ＫＨｚでＡ／Ｄ変換され、
フレーム（Ｉフレームは８ｍｓ程度）毎に１６次のケプ
ストラム係数が求められる。FIG. 1 is a block diagram of an embodiment of the speech rate estimating device of the present invention. The input audio signal is A/D converted at a sampling period of 12 KHz in the audio analysis section 11,
A 16th-order cepstral coefficient is obtained for each frame (I frame is about 8 ms).

上記音声分析部１１から出力された１６次のケプストラ
ム係数は、後に詳述する遅延部１２を介して定常部識別
用ニューラル・ネットワーク１３に入力される。そして
、定常部識別用ニューラル・ネットワーク１３によって
、後に詳述するようにしてＴフレーム分の音声信号が定
常部であるか非定常部であるかが識別され、識別データ
か出力される。その後、この定常部識別用ニューラル・
ネットワーク１３からの識別データは、後に詳述する遅
延部Ｉ４を介して発声速度推定用ニューラル・ネットワ
ーク１５に入力される。そして、この発声速度推定用ニ
ューラル・ネットワーク１５によって、後に詳述するよ
うにして平均発声速度が推定されて、推定平均発声速度
を表すデータが出力される。そうすると、発声速度計算
部１６によって、発声速度推定用ニューラル・ネットワ
ーク１５からの推定平均発声速度を表すデータに基づい
て、平均発声速度の推定値が計算されて出力される。The 16th-order cepstral coefficient outputted from the speech analysis section 11 is inputted to the steady-state identification neural network 13 via the delay section 12, which will be described in detail later. Then, as will be described in detail later, the steady-state identification neural network 13 identifies whether the T-frame audio signal is a steady part or an unsteady part, and outputs identification data. Then, this stationary region identification neural
The identification data from the network 13 is input to the speech rate estimation neural network 15 via a delay unit I4, which will be described in detail later. The speaking rate estimation neural network 15 estimates the average speaking rate as will be described in detail later, and outputs data representing the estimated average speaking rate. Then, the speech rate calculation unit 16 calculates and outputs an estimated value of the average speech rate based on the data representing the estimated average speech rate from the speech rate estimation neural network 15.

第２図は上記定常部識別用ニューラル・ネットワーク１
３の構造の概略図である。このニューラル・ネットワー
クは、図中下側から順に入力層２１、中間層２２および
出力層２３から成る３層構造を有する３層パーセプトロ
ン型ニューラル・ネットワークである。入力層２１には
１６ＸＴ個のユニットを配し、中間層２２には８個のユ
ニットを配し、出力層２３には２個のユニット２４．２
５を配している。上記人ツノ層２！の１６×Ｔ個のユニ
ットの一つには、入力音声信号１゛フレ一ム分の１６次
のケプストラム係数のうち、一つのフレームの一つの次
数のケプストラム係数のデータを入力する。また、出ツ
ノ層２３のユニット２４にはカテゴリ“定常部”を割り
付け、ユニット２５にはカテゴリ“非定常部“を割り付
ける。入力層２！の各ユニットは夫々中間層２２の全ユ
ニットと接続している。また、中間層２２の各ユニット
は夫々出力層２３の全ユニット２４．２５と接続してい
る。Figure 2 shows the above-mentioned neural network 1 for identifying the stationary region.
FIG. 3 is a schematic diagram of the structure of No. 3. This neural network is a three-layer perceptron type neural network having a three-layer structure consisting of an input layer 21, an intermediate layer 22, and an output layer 23 in order from the bottom in the figure. The input layer 21 has 16XT units, the intermediate layer 22 has 8 units, and the output layer 23 has 2 units 24.2.
5 is placed. Above human horn layer 2! Data of one-order cepstrum coefficient of one frame among the 16-order cepstral coefficients of one frame of the input audio signal is input into one of the 16×T units. Further, the category "stationary part" is assigned to the unit 24 of the out-horn layer 23, and the category "unsteady part" is assigned to the unit 25. Input layer 2! Each unit is connected to all units of the intermediate layer 22, respectively. Further, each unit of the intermediate layer 22 is connected to all units 24 and 25 of the output layer 23, respectively.

しかしながら、各層内のユニット間は接続されない。However, units within each layer are not connected.

ここで、上記入力５２１の１６×Ｔ個のユニットにＴフ
レーム分の１６次のケプストラム係数を入力する方法は
、次のようにして行う。第４図は上記遅延部１２．１４
の詳細なブロック図である。Here, the method of inputting the 16th order cepstral coefficients for T frames to the 16×T units of the input 521 is performed as follows. Figure 4 shows the delay section 12.14.
FIG. 2 is a detailed block diagram of FIG.

定常部識別用ニューラル・ネットワーク１３における入
力層２１のｔｅｘ’ｒ個のユニットは、！端のユニット
から順次Ｔｐｌのユニットから成る１６個のブロックに
分割されており、第１番目（ｌ≦ｉ≦１６）のブロック
の最初のユニットには第ｉ次のケプストラム係数か入力
される。また、次のユニットには、第ｉ次のケプストラ
ム係数を遅延素子１２１によって１フレ一ム分だけ遅延
させたケプストラム係数か入力される。さらに次のユニ
ットには、第ｉ次のケプストラム係数を２個の遅延素子
１２１によって２フレ一ム分だけ遅延させたケプストラ
ム係数か入力される。以下、同様にして、最後のユニッ
トには、第ｉ次のケプストラム係数を（Ｔ−１）個の遅
延素子１２１によって（Ｔ−１）フレーム分だけ遅延さ
せたケプストラム係数が入力されるのである。こうして
、１６次のケプストラム係数がθフレームから順次（Ｔ
−１）フレームまで遅延され、Ｔフレーム分の１６次の
ケプストラム係数が入力層２１の各ユニットに入力され
るのである。The number of tex'r units of the input layer 21 in the neural network 13 for stationary region identification is ! It is divided into 16 blocks consisting of Tpl units sequentially from the end unit, and the i-th cepstral coefficient is input to the first unit of the first block (l≦i≦16). Furthermore, a cepstrum coefficient obtained by delaying the i-th order cepstrum coefficient by one frame by the delay element 121 is input to the next unit. Furthermore, a cepstrum coefficient obtained by delaying the i-th order cepstrum coefficient by two frames by two delay elements 121 is input to the next unit. Thereafter, similarly, the cepstrum coefficient obtained by delaying the i-th order cepstrum coefficient by (T-1) frames by (T-1) delay elements 121 is input to the last unit. In this way, the 16th order cepstral coefficients are sequentially distributed from the θ frame (T
-1) frames, and 16th-order cepstral coefficients for T frames are input to each unit of the input layer 21.

上記定常部識別用ニューラル・ネットワーク１３の学習
は次のように誤差逆伝播法によって行う。Learning of the above-mentioned neural network 13 for identifying a stationary region is performed by the error backpropagation method as follows.

すなわち、視察によって多数話者の音声信号のＴフレー
ム分が定常部であるか非定常部であるかを判定し、この
定常部か非定常部かが既知のＴフレーム分の音声信号の
１６次のケプストラム係数を求めて学習データとする。That is, it is determined by inspection whether the T frame portion of the voice signal of many speakers is a stationary portion or an unsteady portion, and the 16th order of the T frame portion of the voice signal, which is known as the stationary portion or the unsteady portion, is determined by inspection. Find the cepstral coefficients of and use them as learning data.

そして、定常部であるＴフレーム分の１６次のケプスト
ラム係数から成る学習データを、入力層２１の１６×Ｔ
個の各ユニットに入力した場合は、出力層２３へは、カ
テゴリ“定常部”が割り付けられたユニット２４への入
力値が“ビであり、カテゴリ“非定常部”が割り付けら
れたユニット２５への入力値が“０”である教師データ
を入力する。一方、非定常部であるＴフレーム分の１６
次のケプストラム係数から成る学習データを入力層２Ｉ
の各ユニットに入力した場合は、出力層２３へは、カテ
ゴリ“定常部”が割り付けられたユニット２４への入力
値が“０”であリ、カテゴリ“非定常部”が割り付けら
れたユニット２５への入力値が“ビである教師データを
入力する。そうすると、定常部識別用ニューラル・ネッ
トワーク１３は、出力Ｊｉ１２３のユニット２４．２５
からの出力値が教師データと同じになるようにネットワ
ークの重みを設定しなおしてネットワーク構造を決定す
るのである。Then, the learning data consisting of 16th order cepstral coefficients for T frames, which is the stationary part, is transferred to the 16×T
, the input value to the unit 24 to which the category "stationary part" is assigned is "B", and the input value to the unit 25 to which the category "unsteady part" is assigned is input to the output layer 23. Input training data whose input value is "0".On the other hand, 16 T frames of non-stationary part are input.
The training data consisting of the following cepstral coefficients is input to the input layer 2I.
When the input value is input to each unit in the output layer 23, the input value to the unit 24 to which the category "stationary part" is assigned is "0", and the input value to the unit 25 to which the category "unsteady part" is assigned is "0". Input training data in which the input value is "Bi. Then, the neural network 13 for stationary region identification uses the unit 24.25 of the output Ji 123.
The network structure is determined by resetting the weights of the network so that the output value is the same as the training data.

」二連のようにして学習された定常部識別用ニューラル
・ネットワーク１３によるＴフレーム分の入力音声信号
が属するカテゴリの識別は次のように行われる。すなわ
ち、定常部識別用ニューラル・ネットワークＩ３の入ツ
ノ層２１に、音声分析部１１からの′ｒフレーム分の１
６次のケプストラム係数が入力される。その結果、その
′ｒフレーム分の１６次のケプストラム係数が属するカ
テゴリが割り付けられた出力層２３のユニットからの出
ツノ値が最大となるような識別データが出力される。Identification of the category to which the input audio signal for T frames belongs is performed as follows by the neural network 13 for stationary part identification which has been trained in two series. That is, 1 'r frames from the speech analysis section 11 are added to the input horn layer 21 of the neural network I3 for stationary region identification.
Sixth-order cepstral coefficients are input. As a result, identification data is output such that the output point value from the unit of the output layer 23 to which the category to which the 16th-order cepstrum coefficients for the 'r frames belong is assigned is maximized.

したがって、入力音声信号が定常部である場合には、“
定常部”に割り付けられたユニット２４からの出力値が
最大となるような識別データが出力される。また、入力
音声信号が非定常部である場合には、“非定常部”か割
り付けられたユニット２５からの出力値が最大となるよ
うな識別データが出力されるのである。Therefore, if the input audio signal is a stationary part, “
Identification data that maximizes the output value from the unit 24 assigned to the "stationary part" is output.In addition, when the input audio signal is an unsteady part, the identification data that is assigned to the "unsteady part" or the assigned Identification data such that the output value from the unit 25 is the maximum is output.

第３図は上記発声速度推定用ニューラル・ネットワーク
１５の構造の概略図である。このニューラル・ネットワ
ークは、定常部識別用ニューラル・ネットワーク１３と
同様に３層パーセブトロン型ニューラル・ネットワーク
であり、上述のようにして学習された定常部識別用ニコ
ーラル・ネットワーク１３の出力層２３からの識別信号
を入力データとして動作する。FIG. 3 is a schematic diagram of the structure of the neural network 15 for estimating speech rate. This neural network is a three-layer persebutron type neural network similar to the neural network 13 for stationary region identification, and is based on the discrimination from the output layer 23 of the neural network 13 for stationary region identification learned as described above. Operates using signals as input data.

上記発声速度推定用ニューラル・ネットワーク１５の入
力層３１には２Ｘ２５＝５０個のユニットを配し、中間
層３２には２０個のユニットを配し、出力層３３には１
１個のユニットを配している。上記入力層３１の５０個
のユニットは２側づつ２５個のグループに分けられる。The input layer 31 of the neural network 15 for estimating speech rate has 2×25=50 units, the intermediate layer 32 has 20 units, and the output layer 33 has 1 unit.
It has one unit. The 50 units of the input layer 31 are divided into 25 groups on two sides.

そして、各グループの一方のユニット３４，３５．・・
・、３６には、定常部識別用ニューラル・ネットワーク
１３の出力層２３のユニット２４からの出力信号（すな
わら、カテゴリ“定常部”に対応する信号）を入力する
。また、各グループの他方のユニット３７，３８、・・
、３９には、定常部識別用ニューラル・ネットワーク１
３の出力層２３のユニット２５がらの出力信号（すなわ
ち、カテゴリ“非定常部”に対応する信号）を入力する
。その際に、上記２５個のグループのうちユニット３４
．３７から成るグループには定常部識別用ニューラル・
ネットワーク１３からの最初の識別データを入力し、ユ
ニット３５．３８から成るグループには定常部識別用ニ
ューラル・ネットワークＩ３がらの２番目の識別データ
を入力し、以下同様にしてユニット３６．３９から成る
グループには定常部識別用ニューラル・ネットワークＩ
３からの２５番目の識別データを入力する。ずなわら、
発声速度推定用ニューラル・ネットワーク１５の入力Ｆ
Ｊ３１には、入力音声信号の２５×Ｔフレ一ム分の１６
次のケプストラム係数に基づいて、定常部識別用ニュー
ラル・ネットワーク１３がら出力されろ連続した２５個
の識別データを入力するのである。Then, one unit 34, 35 .・・・
, 36 are input with the output signal from the unit 24 of the output layer 23 of the neural network 13 for stationary section identification (ie, the signal corresponding to the category "stationary section"). Also, the other units 37, 38 of each group...
, 39 includes the neural network 1 for stationary region identification.
The output signal from the unit 25 of the output layer 23 of No. 3 (that is, the signal corresponding to the category "unsteady part") is input. At that time, 34 units out of the above 25 groups
．． The group consisting of 37 has a neural network for stationary region identification.
The first identification data from the network 13 is input, and the group consisting of units 35 and 38 is input with the second identification data from the neural network I3 for stationary region identification, and so on, and the group consisting of units 36 and 39. Neural network I for stationary region identification is included in the group.
Enter the 25th identification data from 3. Zunawara,
Input F of neural network 15 for estimating speech rate
J31 contains 16 frames of 25 x T frames of the input audio signal.
Based on the following cepstral coefficients, 25 consecutive identification data output from the neural network 13 for stationary region identification are input.

ここで、入力層３１に定常部識別用ニューラル・ネット
ワーク１３から出力される連続した２５個の識別データ
を入力するのは、次の理由による。Here, the reason why 25 continuous pieces of identification data outputted from the neural network 13 for stationary region identification are input to the input layer 31 is as follows.

すなわち、本実施例における１フレームは８ｍｓである
から１秒間は１２５フレームとなる。そこで、本実施例
では定常部識別用ニューラル・ネットワーク１３によっ
て定常部を識別する際のフレーム数Ｔを５フレームとす
ると２５×Ｔフレームは１２５フレームすなわち１秒と
なる。したがって、定常部識別用ニューラル・ネットワ
ーク１５に連続した２５個の識別データを入力すること
によって、１秒間の入力音声信号に相当する識別データ
を入力したことになるのである。要は、発声速度推定用
ニューラル・ネットワーク１５の入力層３１には、定常
部識別用ニューラル・ネットワーク１３からの識別デー
タが、所定時間分に相当する数だけ入力されればよいの
である。That is, since one frame in this embodiment is 8 ms, there are 125 frames in one second. Therefore, in this embodiment, if the number T of frames used to identify a stationary area by the stationary area identification neural network 13 is 5 frames, then 25×T frames will be 125 frames, or 1 second. Therefore, by inputting 25 continuous pieces of identification data to the neural network 15 for stationary part identification, identification data corresponding to one second of input audio signal is inputted. In short, the input layer 31 of the speech rate estimation neural network 15 only needs to be input with identification data from the steady-state region identification neural network 13 in a number corresponding to a predetermined amount of time.

この場合、発声速度推定用ニューラル・ネットワーク１
５の入力層３１に定常部識別用ニューラル・ネットワー
クＩ３からの連続した２５個の識別データを入力する方
法としては例えば次のような方法かある。ここで、第４
図に示すように上記遅延部１４は２つの遅延部１４１，
１４２から成る。第３図および第４図において、入力層
３Ｉのユニット３４には、定常部識別用ニューラル・ネ
ットワーク１３の出力層２３における“定常部”に割り
付けられたユニット２４からの出力信号を直接入力し、
ユニット３７には、定常部識別用ニューラル・ネットワ
ーク１３の“非定常”に割り付けられたユニット２５か
らの出力信号を直接人ツノする。In this case, neural network 1 for estimating speech rate
As a method of inputting 25 continuous pieces of identification data from the neural network I3 for stationary region identification to the input layer 31 of No. 5, for example, there is the following method. Here, the fourth
As shown in the figure, the delay section 14 includes two delay sections 141,
Consists of 142. 3 and 4, the output signal from the unit 24 assigned to the "stationary section" in the output layer 23 of the neural network 13 for stationary section identification is directly input to the unit 34 of the input layer 3I,
The unit 37 directly receives the output signal from the unit 25 assigned to "non-stationary" in the neural network 13 for identifying a stationary region.

また、ユニット３５には、定常部識別用ニューラル・ネ
ットワーク１３のユニット２４からの出力信号を遅延部
１４１の遅延素子１４３によって′ｒフレーム分だけ時
間を遅延させて入力し、ユニット３８には、定常部識別
用ニューラル・ネットワーク１３のユニット２５からの
出力信号を遅延部１４２の遅延素子１４３によってＴフ
レーム分だす時間を遅延させて入力する。以下同様にし
て、ユニット３６には、定常部識別部ニューラル・ネッ
トワークのユニット２４からの出力信号を１２４個の遅
延素子１／１３によって２４Ｘ’ｌ’フレ一ム分だけ時
間を遅延させて入力し、ユニット３９には、定常部識別
用ニューラル・ネットワーク１３のユニット２５からの
出力信号を２４ｐｌの遅延素子１４３によって２４×Ｔ
フレ一ム分だけ時間を遅延さＵ゛て入力ずればよい。Further, the output signal from the unit 24 of the neural network 13 for stationary region identification is input to the unit 35 after being delayed by 'r frames by the delay element 143 of the delay section 141. The output signal from the unit 25 of the section identification neural network 13 is input after being delayed by a time of T frames by the delay element 143 of the delay section 142. Similarly, the output signal from the unit 24 of the stationary region identification section neural network is input to the unit 36 after being delayed by 24X'l' frames by 124 delay elements 1/13. , a unit 39 receives the output signal from the unit 25 of the neural network 13 for stationary region identification by a delay element 143 of 24 pl to 24×T.
All you have to do is delay the time by one frame and shift the input.

また、出力層３３の１１個のユニットのうち、ユニット
４１にはカテゴリ“無音“を割り付け、ユニット４２に
はカテゴリ“１”（１モ一ラ／秒）を割り付け、ユニッ
ト４３にはカテゴリ“２″（２モ一ラ／秒）を割り付け
、以下同様にしてユニット４６にはカテゴリ“１０”（
１０モ一ラ／秒）を割り付ける。入力層３１の各ユニッ
トは夫々中間層３２の全ユニットと接続している。また
、中間層３２の各ユニットは夫々出力層３３の全ユニッ
トと接続している。しかしながら、各層内のユニット間
は接続されない。Furthermore, among the 11 units of the output layer 33, the category "silence" is assigned to the unit 41, the category "1" (1 mora/second) is assigned to the unit 42, and the category "2" is assigned to the unit 43. ” (2 mora/sec), and in the same way, the unit 46 is assigned the category “10” (
10 mora/sec). Each unit of the input layer 31 is connected to all units of the intermediate layer 32, respectively. Furthermore, each unit of the intermediate layer 32 is connected to all units of the output layer 33, respectively. However, units within each layer are not connected.

上記発声速度推定用ニューラル・ネットワーク１５の学
習は、定常部識別用ニューラル・ネットワーク１３と上
述のように接続した状態で次のように誤差逆伝播法によ
って行う。すなわち、多数話者の音声信号の１秒間分（
２５ＸＴフレ一ム分）における各フレーム毎の１６次の
ケプストラム係数、の時系列を学習データとする。また
、上記学習データにおける平均発声速度を視察によって
算出し、その算出した平均発声速度を表すデータを教師
データとする。そして、例えば、無音区間の学習データ
を学習済みの定常部識別用ニューラル・ネットワークＩ
３における入力層２１の各ユニットに入力した場合は、
発声速度推定用ニューラル・ネットワーク１５における
出力層３３の“無音”が割り付けられたユニット４１へ
の入力値が“ビであり、他のユニットへの入力値が“０
”である教師データを入力する。また、有音区間であっ
て、視察によって算出された平均発声速度がＮモー９７
秒の学習データを定常部識別用ニューラル・ネットワー
ク１３の入力層２１に入力した場合は、発声速度推定用
ニューラル・ネットワーク１５における出力層３３のユ
ニット４５への入力値が“ビでありその他のユニットへ
の入力値が“０”である教師データを入力する。その場
合に、算出された平均発声速度が例えば２．５モ一ラ／
秒であれば、出力層３３の“２”が割り付けられたユニ
ット４３への入力値が“０．５”であり、“３”が割り
付けられたユニット４４への入ツノ値が“０．５”であ
り、その他のユニットへの入力値が“０“である教師デ
ータを入力するのである。すなわち、この学習において
は、定常部識別用ニューラル・ネットワーク１３の入力
層２１に１秒間（すなわち、２５×′ｒフレーム）分の
音声信号の１６次のケプストラム係数が入力された場合
に、１秒間分の入力音声信号に含まれるモーラ数を識別
するように学習するのである。そうすると、発声速度推
定用ニューラル・ネットワーク１５は、出力層３３の各
ユニット４１．・・・、４６からの出力値が教師データ
と同じになるようにネットワークの重みを設定しなおし
てネットワーク構造を決定するのである。この学習にお
いては、定常部識別用ニューラル・ネットワーク！３の
ネットワークの重みは変えないようにしておく。The learning of the neural network 15 for estimating the speaking rate is performed by the error backpropagation method as described below while connected to the neural network 13 for identifying the stationary part 13 as described above. In other words, one second of audio signals from multiple speakers (
The time series of 16th-order cepstral coefficients for each frame in 25XT frames is used as learning data. Further, the average speaking speed in the learning data is calculated through observation, and the data representing the calculated average speaking speed is used as teacher data. Then, for example, the learning data of the silent section is used as the neural network I for stationary region identification, which has already been trained.
When inputting to each unit of the input layer 21 in 3,
In the neural network 15 for estimating speech rate, the input value to the unit 41 to which "silence" is assigned in the output layer 33 is "bi", and the input value to other units is "0".
Input the teacher data that is ”.Also, in the sound section, the average speaking rate calculated from the inspection is Nmo97.
If the learning data for seconds is input to the input layer 21 of the neural network 13 for stationary part identification, the input value to the unit 45 of the output layer 33 of the neural network 15 for estimating speech rate is "Bi" and other units Input teacher data whose input value is "0".In that case, the calculated average speaking rate is, for example, 2.5 mo/
If it is seconds, the input value to the unit 43 to which "2" of the output layer 33 is assigned is "0.5", and the input value to the unit 44 to which "3" is assigned is "0.5". ” and the input values to other units are “0”. That is, in this learning, when the 16th order cepstral coefficient of the audio signal for 1 second (that is, 25×'r frames) is input to the input layer 21 of the neural network 13 for stationary region identification, It learns to identify the number of moras contained in an input audio signal of minutes. Then, the neural network 15 for estimating the speaking rate is connected to each unit 41 . ..., the network structure is determined by resetting the weights of the network so that the output value from 46 becomes the same as the teacher data. In this learning, we will use a neural network for stationary region identification! Leave the weight of network 3 unchanged.

１秒間分の入力音声信号に含まれるモーラ数（すなわち
、推定発声速度）は、上述のようにして学習された定常
部識別用ニューラル・ネットワーク１３と発声速度推定
用ニューラル・ネットワーク１５によって次のようにし
て識別される。The number of moras (that is, the estimated speaking rate) included in the input audio signal for one second is calculated as follows by the neural network 13 for identifying the stationary part and the neural network 15 for estimating the speaking rate that have been trained as described above. It is identified by

まず、定常部識別用ニューラル・ネットワーク１３の入
力層２１に、音声分析部２からの入力音声信号の最初の
Ｔフレーム分の１６次のケプストラム係数が上述のよう
に遅延部１２を介して入力される。その結果、Ｔフレー
ム分の１６次のケプストラム係数が属するカテゴリを表
す識別データが出力層２３から出力されて、発声速度推
定用ニューラル・ネットワーク１５における入力層３！
のユニット３４．３７に入力される。First, the 16th order cepstral coefficients for the first T frames of the input audio signal from the audio analysis unit 2 are input to the input layer 21 of the neural network 13 for stationary part identification via the delay unit 12 as described above. Ru. As a result, identification data representing the category to which the 16th-order cepstral coefficients for T frames belong is output from the output layer 23, and input layer 3!
unit 34.37.

続いて、定常部識別用ニューラル・ネット１３の入力層
２１に、入力音声信号の２番目のＴフレーム分の１６次
のケプストラム係数が入力される。Subsequently, the 16th order cepstral coefficients for the second T frame of the input audio signal are input to the input layer 21 of the neural network 13 for stationary region identification.

そうすると、この２番目のＴフレーム分の１６次のケプ
ストラム係数が属するカテゴリを表す識別データが出力
層２３から出力されて、発声速度推定用ニューラル・ネ
ットワークＩ５における入力層３１のユニット３４．３
７に入力される。それと同時に、遅延部Ｉ４の遅延素子
１４３によってＴフレームに相当する時間だけ遅延され
た上記最初のＴフレーム分の１６次のケプストラム係数
に対する識別データが、発声速度推定用ニューラル・ネ
ットワーク１５における入力層３Ｉのユニッ）３５．３
８に入力される。Then, identification data representing the category to which the 16th-order cepstral coefficients for the second T frame belong is output from the output layer 23, and the unit 34.3 of the input layer 31 in the neural network I5 for estimating speech rate is outputted from the output layer 23.
7 is input. At the same time, the identification data for the 16th-order cepstral coefficients for the first T frames delayed by the delay element 143 of the delay unit I4 by the time corresponding to T frames is transmitted to the input layer 3I in the neural network 15 for estimating speech rate. unit) 35.3
8 is input.

以下同様にして、２５番目のＴフレーム分の１６次のケ
プストラム係数が定常部識別用ニューラル・ネットワー
ク１３の入力層２１に入力されると、この２５番目のＴ
フレーム分の１６次のケプストラム係数に対する識別デ
ータが出力層２３から出力されて、発声速度推定用ニュ
ーラル・ネットワーク１５における入力／１ｉｆｆ３１
のユニット３４３７に入力される。それと同時に、遅延
部１４の遅延素子１４３によってＴフレームに相当する
時間だけ遅延された２４番目のＴフレーム分の１６次の
ケプストラム係数に対する識別データが、発声速度推定
用ニューラル・ネットワーク１５における入力層３１の
ユニット３５．３８に入力され、以下同様にして、２４
個の遅延素子１４３によって２４×Ｔフレームに相当す
る時間だけ遅延された最初のＴフレーム分の１６次のケ
プストラム係数に対する識別データが、発声速度推定用
ニューラル・ネットワーク１５における入力層３１のユ
ニット３６．３９に入力される。こうして、１秒間（す
なわち、２５×Ｔフレーム）分の入力音声信号の１６次
のケプストラム係数に基づく定常部識別用ニューラル・
ネットワーク１３からの２５個の識別データが、発声速
度推定用ニューラル・ネットワーク１５の入力層３Ｉに
入力されるのである。Similarly, when the 16th order cepstral coefficients for the 25th T frame are input to the input layer 21 of the neural network 13 for stationary region identification, this 25th T
Identification data for the 16th order cepstral coefficients for frames are output from the output layer 23 and input to the neural network 15 for estimating speech rate/1iff31.
unit 3437. At the same time, the identification data for the 16th-order cepstral coefficient for the 24th T frame, delayed by the time corresponding to T frames by the delay element 143 of the delay unit 14, is transmitted to the input layer 31 in the neural network 15 for estimating speech rate. unit 35, 38, and so on, 24
The identification data for the 16th-order cepstral coefficients for the first T frames delayed by the time corresponding to 24×T frames by the delay elements 143 are sent to the units 36. 39. In this way, the neural system for stationary region identification based on the 16th order cepstral coefficients of the input audio signal for 1 second (i.e., 25×T frames)
The 25 pieces of identification data from the network 13 are input to the input layer 3I of the neural network 15 for estimating speech rate.

そうすると、発声速度推定用ニューラル・ネットワーク
１５における出力層３３の各ユニット４１、・・・、４
６からは、１秒間分の入力音声信号の１６次のケプスト
ラム係数が属するカテゴリの識別程度に応じて０から１
までの値をとる出力値が出力されるのである。Then, each unit 41, . . . , 4 of the output layer 33 in the neural network 15 for estimating speech rate
From 6 onwards, the number ranges from 0 to 1 depending on the degree of identification of the category to which the 16th order cepstral coefficient of the input audio signal for 1 second belongs.
The output value that takes the value up to is output.

このように、本実施例においては、定常部識別用ニュー
ラル・ネットワーク１３および発声速度推定用ニューラ
ル・ネットワーク１５によって、入力音声信号の１秒間
分の１６次のケプストラム係数に基づいて、１秒間の入
力音声信号に含まるモーラ数を識別（すなわち、平均発
声速度を推定）する。したがって、入力音声信号１秒間
分の中に複数の定常部が存在しても、個々の定常部に囚
われずに安定して平均発声速度を推定することができる
のである。As described above, in this embodiment, the neural network 13 for stationary part identification and the neural network 15 for estimating speech rate calculate the input signal for one second based on the 16th-order cepstral coefficients for one second of the input audio signal. Identify the number of moras included in the audio signal (ie, estimate the average speaking rate). Therefore, even if a plurality of steady parts exist in one second of the input audio signal, it is possible to stably estimate the average speaking rate without being bound by individual steady parts.

」二連のようにして、発声速度推定用ニューラル・ネッ
トワーク１５の出力層３３から出力される推定平均発声
速度を表す出力データは、発声速度計算部１６に入力さ
れる。そして、この推定平均発声速度を表す出力データ
に基づいて、平均発声速度の推定値が次のようにして算
出される。'' Output data representing the estimated average speaking speed outputted from the output layer 33 of the neural network for estimating speaking speed 15 is input to the speaking speed calculating section 16 in a double series. Then, based on the output data representing the estimated average speaking speed, an estimated value of the average speaking speed is calculated as follows.

すなわち、発声速度推定用ニューラル・ネットワーク１
５における出力層３３の全ユニット４１・・・、４６か
らの出力値の中から、最大値を出力しているユニット（
このユニットをＵｌとする）と２番目に大きな値を出力
しているユニット（このユニットをＵ２とする）を選出
する。そして、ユニットＵＩがカテゴリ“無音“に割り
付けられたユニット４１であれば、対応する１秒間分の
入力音声信号は無音区間であると判定して、平均発声速
度推定値“０”を出力する。That is, neural network 1 for estimating speech rate
Out of the output values from all the units 41..., 46 of the output layer 33 in 5, the unit outputting the maximum value (
This unit is designated as Ul) and the unit outputting the second largest value (this unit is designated as U2) is selected. If the unit UI is the unit 41 assigned to the category "silence", it is determined that the corresponding input audio signal for one second is a silent section, and an estimated average speech rate value "0" is output.

それ以外の場合は、次式に従って平均発声速度推定値を
算出する。In other cases, the average speech rate estimate is calculated according to the following equation.

平均発声速度推定値− （Ｕｌｘｖ（Ｕｌ）＋Ｕ２ｘＶ（ｔＪ２）ｌ／（ｖ（Ｕ
ｌ）＋ｖ（Ｕ２）１但し、Ｖ（Ｕｎ）：ユニットＵｎの
出力値すなわち、各ユニットからの出力値にそのユニッ
トが割り付けられたモーラ数によって重みを付けた値に
よって発声速度推定値を算出するのである。Average speaking rate estimate - (Ulxv(Ul)+U2xV(tJ2)l/(v(U
l)+v(U2)1 However, V(Un): The output value of the unit Un, that is, the estimated speech rate is calculated by the output value from each unit weighted by the number of moras assigned to that unit. It is.

こうして、発声速度推定用ニューラル・ネットワーク！
５から出力される推定発声速度を表す出力データに基づ
いて平均発声速度の推定値か算出されて出力される。Thus, a neural network for estimating speech rate!
An estimated value of the average speaking speed is calculated and outputted based on the output data representing the estimated speaking speed outputted from step 5.

上述のように、本実施例の平均発声速度推定装置におい
ては、入力音声信号１秒間分の１６次のケプストラム係
数を定常部識別用ニューラル・ネットワーク１３の入力
層２１に入力すると、定常部識別用ニューラル・ネット
ワークＩ３は入力音声信号が定常部であるか非定常部で
あるかをＴフーム毎に識別して、識別データを順次出力
する。そうすると、発声速度推定用ニューラル・ネット
ワーク１５は、入力音声信号１秒間分の識別データに基
づいて、１秒間分の入力音声信号に含まれるモーラ数を
識別して（すなわち、平均発声速度を推定して）、推定
平均発声速度を表す出力データを出力する。そして、発
声速度計算部！６によりて、上記推定平均発声速度を表
す出力データに基づいて平均発声速度推定値を算出する
ようにしている。換言すれば、入力音声信号１秒間分の
１６次のケプストラム係数に基づいて、平均発声速度を
推定するのである。As described above, in the average speech rate estimating device of this embodiment, when the 16th order cepstral coefficients for one second of the input audio signal are input to the input layer 21 of the neural network 13 for stationary part identification, The neural network I3 identifies whether the input audio signal is a stationary portion or an unsteady portion for each T-foam, and sequentially outputs identification data. Then, the speaking rate estimation neural network 15 identifies the number of moras included in the input audio signal for one second (that is, estimates the average speaking rate) based on the identification data for one second of the input audio signal. ), and outputs output data representing the estimated average speaking rate. And the speaking speed calculator! 6, the estimated average speaking speed is calculated based on the output data representing the estimated average speaking speed. In other words, the average speaking rate is estimated based on the 16th order cepstral coefficients for one second of the input audio signal.

したがって、本実施例によれば、学習によってＴフーム
分の入力音声信号が定常部であるか非定常部であるかを
識別する規則を自ら生成するニューラル・ネットワーク
と、学習によって１秒間分（２５ＸＴフ一ム分）の入力
音声信号に含まれるモーラ数を識別する規則を自ら生成
するニューラル・ネットワークを用いて入力音声信号の
平均発声速度を推定することができ、入力音声信号１秒
間分の中に複数の定常部が存在しても、個々の定常部に
囚われずに安定して平均発声速度を推定することができ
る。Therefore, according to this embodiment, a neural network that generates by itself a rule for identifying whether an input audio signal for T-foams is a stationary part or an unsteady part by learning, and It is possible to estimate the average speaking rate of an input audio signal by using a neural network that automatically generates rules for identifying the number of moras contained in an input audio signal of 1 second duration. Even if there are multiple steady parts, the average speaking rate can be stably estimated without being bound by individual steady parts.

上記実施例においては、入力音声信号１秒間分に相当す
る数の識別データを発声速度推定用ニューラル・ネット
ワーク１５に入力している。しかしながら、この発明は
これに限定されるべきものではなく、たとえば０．５秒
間分に相当する数の識別データであってらよい。In the embodiment described above, a number of pieces of identification data corresponding to one second of the input audio signal are input to the neural network 15 for estimating the speaking rate. However, the present invention is not limited to this, and the number of identification data may be equivalent to, for example, 0.5 seconds.

上記実施例においては、発声速度推定用ニューラル・ネ
ットワーク１５の入力層３！のユニット数を５０個とし
ているが、この発明においてはこれに限定されるもので
はなく、■フレームの時間および定常部識別用ニューラ
ル・ネットワーク１３で定常部を識別する際のフレーム
数等によって最適に設定すればよい。In the above embodiment, the input layer 3 of the neural network 15 for estimating speech rate! The number of units is set at 50, but the present invention is not limited to this, and may be optimally determined depending on the frame time and the number of frames used to identify a stationary area by the neural network 13 for identifying a stationary area. Just set it.

上記実施例においては、特徴パラメータとじて１６次の
ケプストラム係数を用いている。しかしながら、この発
明はこれに限定されるものではなく、スペクトル、自己
相関係数および帯域通過フィルタ群出力等を用いてもよ
い。In the above embodiment, 16th order cepstral coefficients are used as feature parameters. However, the present invention is not limited to this, and a spectrum, an autocorrelation coefficient, a bandpass filter group output, etc. may be used.

また、上記実施例においては、３層パーセプト〔ノン型
ニューラル・ネットワークを用いているが、４層以上の
パーセプトロン型ニューラル・ネットワークであっても
構わない。Further, in the above embodiment, a three-layer percept (non-type neural network) is used, but a perceptron-type neural network with four or more layers may be used.

〈発明の効果〉以上より明らかなように、この発明の発声速度推定装置
は、定常部識別用ニューラル・ネットワーク、発声速度
推定用ニューラル・ネットワークおよび発声速度計算部
を備えて、入力音声信号の所定フレーム数の特徴パラメ
ータに基づいて、上記定常部識別用ニューラル・ネット
ワークによって上記所定フレーム数の音声信号が定常部
であるか非定常部であるかを識別し、この定常部識別用
ニューラル・ネットワークからの所定時間に対応する数
の識別信号に基づいて、上記発声速度推定用ニューラル
・ネットワークによって入力音声信号の平均発声速度に
関係する信号を出力し、この平均発声速度に関係する１
３号に基づいて、上記発声速度計算部によって平均発声
速度の推定値を上記所定時間毎に算出するようにしたの
で、上記所定時間の入力音声信号中に複数の定常部が存
在しても、個々の定常部に囚イっれずに安定して平均発
声速度を推定することができる。<Effects of the Invention> As is clear from the above, the speech rate estimating device of the present invention includes a neural network for identifying a stationary region, a neural network for estimating a speech rate, and a speech rate calculation unit, Based on the characteristic parameter of the number of frames, the above-mentioned neural network for identifying a stationary section identifies whether the audio signal of the predetermined number of frames is a stationary section or an unsteady section, and from this neural network for identifying a stationary section, Based on the number of identification signals corresponding to a predetermined time period, the neural network for estimating speaking rate outputs a signal related to the average speaking rate of the input audio signal, and outputs a signal related to the average speaking rate of the input audio signal.
Based on No. 3, the estimated value of the average speaking rate is calculated by the speaking rate calculation unit every predetermined time period, so that even if there are a plurality of steady parts in the input audio signal for the predetermined time period, It is possible to stably estimate the average speaking rate without being bound to individual stationary parts.

[Brief explanation of drawings]

第１図はこの発明の発声速度推定装置における一実施例
のブロック図、第２図は第１図における定常部識別用ニ
ューラル・ネットワークの概略構成図、第３図は第１図
におけろ発声速度推定用ニューラル・ネットワークの概
略構成図、第４図は第１図における遅延部の詳細なブロ
ック図、第５図は従来の発声速度推定装置のブロック図
である。１１・・・音声分析部、　　　１２．１４・・・遅延部
、１３・・・定常部識別用ニューラル・ネットワーク、
１５・・・発声速度推定用ニューラル・ネットワーク、
１６・・・発声速度計算部、　　２１．３１・・・入力
層、２２．３２・・・中間層、　　　２３．３３・・出
力層、１４３・・・遅延素子。FIG. 1 is a block diagram of an embodiment of the speech rate estimating device of the present invention, FIG. 2 is a schematic configuration diagram of a neural network for identifying a stationary region in FIG. 1, and FIG. FIG. 4 is a detailed block diagram of the delay section in FIG. 1, and FIG. 5 is a block diagram of a conventional speech rate estimation device. 11...Speech analysis unit, 12.14...Delay unit, 13...Neural network for stationary part identification,
15... Neural network for estimating speech rate,
16... Speech rate calculation unit, 21.31... Input layer, 22.32... Intermediate layer, 23.33... Output layer, 143... Delay element.

Claims

[Claims]

(1) Input a signal representing the characteristic parameters of a predetermined number of frames of the input audio signal, identify whether the audio signal of the predetermined number of frames is a stationary part or an unsteady part, and output an identification signal. inputting a neural network for identifying a stationary part and the identification signal outputted from the neural network for identifying a stationary part in a number corresponding to a predetermined time;
A neural network for estimating speaking speed that outputs a signal related to the average speaking speed, and an estimated value of the average speaking speed is calculated based on the signal related to the average speaking speed output from the neural network for estimating speaking speed. A speech rate estimating device comprising a speech rate calculation unit that calculates a speech rate.