JPH01243169A

JPH01243169A - System for learning and preparing pattern

Info

Publication number: JPH01243169A
Application number: JP63070759A
Authority: JP
Inventors: Mitsuo Furumura; 古村　光夫; Hiroo Tanaka; 田中　啓夫
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1988-03-24
Filing date: 1988-03-24
Publication date: 1989-09-27
Anticipated expiration: 2012-08-06
Also published as: JP2637760B2

Abstract

PURPOSE:To converge the learning of analog information with a high accuracy and simultaneously, at a high speed by multiplexing a neutral network, obtaining a weighted mean, making it into a final output and causing a weight coefficient between an input layer and an intermediate layer or between the intermediate layer and an output layer to be random. CONSTITUTION:The neutral network is composed of an input layer 10, an intermediate layer 12, an output layer 14, and a final output layer 16, and the intermediate layer 12 and output layer 14 are made into multistages. The weight coefficient between the input layer 10 and intermediate layer 12 or between the intermediate layer 12 and output layer 14 is made random according to a necessity, the learning is independently executed in respective stages in the multiplexed intermediate layer 12 and output layer 14, the weighted mean of the outputs of respective stages of the output layer 14 is obtained in the final output layer 16, and it is made into a final output OT. Thus, the efficient learning of the data of a read world can be executed, and the convergency can be made high-speed.

Description

【発明の詳細な説明】〔発明の概要〕パターンの学習・生成方式に関し、実世界のデータの効率的な学習を可能にし、かつ早く収
束することができるようにすることを目的とし、入力層、多段の中間層、多段の出力層、および最終出力
層からなるニューラルネットワークを用い、入力層に入
力系列を加え、最終出力層は多段の出力層の各出力の重
み付け平均をとってこれを最終出力とし、入力層と中間
層との間の重み係数あるいは中間層と出力層との間の重
み係数のいずれか一方の重み係数をランダム化し、各段
独立に学習させるように構成する。[Detailed Description of the Invention] [Summary of the Invention] Regarding a pattern learning/generation method, the present invention aims to enable efficient learning of real-world data and to quickly converge. , using a neural network consisting of a multi-stage intermediate layer, a multi-stage output layer, and a final output layer, the input sequence is added to the input layer, and the final output layer takes a weighted average of each output of the multi-stage output layer and finalizes it. As an output, either the weighting coefficient between the input layer and the intermediate layer or the weighting coefficient between the intermediate layer and the output layer is randomized, and each stage is configured to learn independently.

[Industrial application field]

本発明はパターンの学習・生成方式に関し、合成音声の
生成、音声認識などに有効であるが、広く時系列データ
を含む一般的なパターンに適用可能である。The present invention relates to a pattern learning/generation method, and is effective for synthetic speech generation, speech recognition, etc., but is widely applicable to general patterns including time-series data.

合成音声の生成には、登録されたメソセージであればパ
ーコール方式があり、これは音声の特徴を抽出し、抽出
されたパラメータから情報圧縮して、音声出力する。登
録されたものではない任意の文章（文字列）の合成音声
の生成にはいわゆる規則合成法が用いられている。規則
合成法は人間の話し方をルール化し、このルールで文字
列を合成音声化するものである。規則合成法は比較的小
規模なシステムでも合成音声が生成できる反面、少し品
質の高い合成音声を作ろうとする場合ルールが複雑化す
るとともに、一般的な規則を見つけることが、音声の基
本的問題点と密接に関連していて容易でなく、自然な合
成音声を作ることが困難で、機械的、非人間的音声にな
ってしまう。To generate synthesized speech, there is a Percoll system for registered messages, which extracts the features of the speech, compresses the information from the extracted parameters, and outputs the speech. A so-called rule synthesis method is used to generate synthesized speech of arbitrary sentences (character strings) that are not registered. The rule synthesis method creates rules for how humans speak, and uses these rules to convert character strings into synthesized speech. Although the rule synthesis method can generate synthesized speech even in a relatively small-scale system, the rules become complex when trying to create synthesized speech of slightly higher quality, and finding general rules is a fundamental problem in speech. It is difficult to create a natural synthesized voice because it is closely related to the points, and the voice becomes mechanical and non-human.

ニューラルネットワーク（人間の脳をまねたネ・７トワ
ーク）を用いた学習システムを導入することにより、規
則合成法に比べてより自然な合成音声を作ることが可能
である。ニューラルネットワークは、空間にパラメータ
を分布させ、それに覚え込ませるもので、低精度なもの
を並列に多数並べることにより精度を上げ、規則を指示
するではなく学習で覚え込ませるという手法をとる。■
とＣ０記号列と記号列の学習であり、各々、決められた
場所に符号が立っているか否かで学習ができ、可成り粗
い手法でも成功を収めることが多い。By introducing a learning system using a neural network (a network that imitates the human brain), it is possible to create synthetic speech that is more natural than the rule synthesis method. Neural networks distribute parameters in space and memorize them. They increase accuracy by arranging many low-accuracy parameters in parallel, and use a method of memorizing rules by learning rather than instructing them. ■
This is learning of C0 symbol strings and symbol strings, and learning can be done depending on whether or not a symbol stands in a predetermined position, and even fairly rough methods are often successful.

ニューラルネットワークを用いてテキスト（文字列）か
ら自然な合成音声を自動生成するシステムは、テキスト
から音韻系列に変換する部分と、音韻系列から合成音声
を生成させる部分からなる。A system that automatically generates natural synthetic speech from text (character strings) using a neural network consists of a part that converts the text into a phoneme sequence, and a part that generates synthetic speech from the phoneme sequence.

前者の文字列を音韻系列に変換する部分については、文
字列（１１０の状態）の学習で済むが、音韻系列から合
成音声を生成させる部分については、高精度なアナログ
データの学習をしなければならず、今まで困難であった
。本発明は音韻系列から音源／声道パラメータを出力す
る部分に用いても有効なバクーン学習・生成方式に係る
ものである。For the former part, which converts character strings into phoneme sequences, it is sufficient to learn the character strings (state 110), but for the part to generate synthesized speech from the phoneme sequences, it is necessary to learn highly accurate analog data. This has been difficult until now. The present invention relates to a Bakun learning/generation method that is also effective when used for outputting sound source/vocal tract parameters from phoneme sequences.

[Conventional technology]

ニューラルネットワークで用いるアナログニューロン素
子は第５図に示すように重み係数Ｗｉｊ用の抵抗と加算
器Σと出力関数ｆ４からなり、入力Ｘ、・しきい値θ、
を受けて次式で表わされる出力Ｚｉ、内部変数ｙ、を生
じる。As shown in Fig. 5, the analog neuron element used in the neural network consists of a resistance for weighting coefficient Wij, an adder Σ, and an output function f4, and has inputs X, threshold value θ,
In response to this, an output Zi and an internal variable y are generated as expressed by the following equation.

７、−、ＬＷ　ｉ　ｊ　Ｘ　ｉ＋θ、　　　　　　　　
・・・・・・（１）ｘ−ｆ；＜ｙｉ）　　　（＝＝＋、
ｚ、・・・・・’、）−・−Ｃ２）ここで■は入力素子
の数、Ｊは出力素子の数である。ニューラルネットワー
クモデルとしては第６図に示す入力層、中間層（ｈｉｄ
ｄｅｎ　１ａｙｅｒ；隠れ層）、および出力層からなる
３層構造のモデルが用いられている。重み付けは入力層
と中間層の間、および中間層と出力層の間で行なわれる
。7, −, LW i j X i+θ,
・・・・・・(1)x−f;<yi) (==+,
z,...', )--C2) Here, ■ is the number of input elements, and J is the number of output elements. The neural network model consists of an input layer and a middle layer (hidden layer) shown in Figure 6.
A model with a three-layer structure consisting of a hidden layer (hidden layer) and an output layer is used. Weighting is performed between the input layer and the hidden layer, and between the hidden layer and the output layer.

入力層の任意の一点をｘｈ□、ｔｐ（１≦ｉｓ≦Ｉｓ。Set any point on the input layer to xh□, tp (1≦is≦Is.

１≦ｉｐ≦ＩＰ、　ＩＳは人力層における系列の数、Ｉ
Ｐは系列の一つが持つ素子数）とし、中間層の任意の一
点をＺｌ′Ｊ（１≦ｊ≦Ｊ、Ｊは中間層の素子の数）と
する。このとき、入出力関係は、ｚｈｊ＝）ｌ＋、　　（’ＨΣＷ’ｉｓ＋　ｉｐ＋　ｊ
　Ｘ　’ｉｓ＋　ｉｐ十〇’ｊ）　　＝−（３）となる
。同様にして、中間層から出力層への人出力関係は、Ｚ　’、−ｆ　ｏｋ（ト’ｒｂ　Ｘ　’ｊ＋θ’　、　
　　　−・−・−（４）但し、ｘ　Ｏ、＝ｚ　ｈ　、で
あり、また１≦に≦Ｋ（Ｋは出力層の素子の数）である
。1≦ip≦IP, IS is the number of series in the human power layer, I
P is the number of elements one of the series has), and an arbitrary point on the intermediate layer is Zl'J (1≦j≦J, J is the number of elements in the intermediate layer). At this time, the input/output relationship is zhj=)l+, ('HΣW'is+ ip+ j
X'is+ip10'j)=-(3). Similarly, the human output relationship from the middle layer to the output layer is Z', -fok(t'rbX'j+θ',
−・−・−(4) However, x O,=z h , and 1≦ and≦K (K is the number of elements in the output layer).

以下に従来技術において用いられるバックプロパゲーシ
ョン学習アルゴリズムを示す。但し、Ｚ）′ｊ＋　　ｚ
ｏ、をまとめてＺｊ　と、またｘｈｉｓ＋ｉｐ＋ｘｌ″
ｊをまとめてＸｌと、またｆ　’ｊ＋　　”　ｋをまと
めてｆと書くことにする。The backpropagation learning algorithm used in the prior art is shown below. However, Z)′j+z
o, together as Zj, and xhis+ip+xl″
j will be collectively written as Xl, and f'j+''k will be collectively written as f.

目標入力（望ましい出力；ターゲット）をｔ、とし、目
標値ｔｊと実際の出力ｚＪとの誤差の二乗和が最小にな
るように、重みＷｉｊを修正する（修正量をΔＷｉｊと
する）。簡単化のため、しきい値θ、の値は０とする。The target input (desired output; target) is set to t, and the weight Wij is corrected (the amount of correction is set to ΔWij) so that the sum of squares of the error between the target value tj and the actual output zJ is minimized. For simplicity, the value of the threshold value θ is assumed to be 0.

即ち出力誤差ＥをＥ＝１／２Σ（ｔｉ　　Ｚ＝）”　　
　　　　　・・・・・・（５）とおき（１／２は、後で
微分をとったときに係数２が消えるようにするもの）、
次式に基づく学習法（最急降下法；誤差の傾斜が最も急
になるように重み修正量Δを決める）をとる。In other words, the output error E is E=1/2Σ(ti Z=)"
...... (5) (1/2 makes the coefficient 2 disappear when taking the differentiation later),
A learning method (steepest descent method; weight correction amount Δ is determined so that the slope of the error is the steepest) is used based on the following equation.

ΔＷ　１ＪＱＣ−θＥ／θＷ０、　　　　　　　・・・
・・・（６）ここで、式（５）より、次式が成立する。ΔW 1JQC-θE/θW0, ...
(6) Here, from equation (5), the following equation holds true.

θＥ／θｚＪ＝　　（ｔ、＋　　Ｚ；）　　　　　・・
・・・・（７）いま、 θＥ／ｃ？　Ｗ　ｉ　ｊ−θＥ／ｅ　ｙ、−θｙＪａｗ
＝ｊ　　・（８）であるので、式（１）より θｙＪ／θＷｉｊ＝θ／θｗＨＪ−’１ｗ、、ｘ　ｙ　
＝　ｘ　ｔ　・＝　（９）となる。つぎに、 δ、＝−θＥ／θｙｊ　　　　　　　　・・・・・・０
０）と置くと、式（８）と式（９）より一θＥ／θｗＩｊ−δ、χ、　　　　　　　・・・・・
・θ１）であり、これと式（６）の仮定より ΔＷｉｊ−αδ、χｉ　　　　　　　　　　・・・・・
・０２）となる。つぎに、δｊを計算する。弐〇〇）よ
りδ４−−θＥ／θｚＪ・θｚｊ／θｙＪであるので、
式（７）１式（２）を考慮すると１、出力層における誤
差の後向き伝播量δ０６は δ’ｈ−（ｔｋ−Ｚ’ｋ）ｆ′ｈ（ｙ’ｋ）　　　−・
−ＯＳとなる。また、出力層以外（中間層）における誤
差の後向き伝播量δｈ４は次のようになる。θE/θzJ= (t, + Z;) ・・
...(7) Now, θE/c? Wi j−θE/ey, −θyJaw
=j ・(8) Therefore, from equation (1), θyJ/θWij=θ/θwHJ−'1w,, x y
= x t ・= (9). Next, δ, = -θE/θyj ...0
0), then from equation (8) and equation (9) - θE/θwIj−δ, χ, ...
・θ1), and from this and the assumption of equation (6), ΔWij−αδ, χi ...
・02). Next, δj is calculated. From 2〇〇), δ4−−θE/θzJ・θzzj/θyJ, so
Considering Equation (7)1 and Equation (2), 1, the amount of backward propagation of error in the output layer δ06 is δ'h-(tk-Z'k)f'h(y'k)-・
- Becomes an OS. Further, the backward propagation amount δh4 of errors in layers other than the output layer (intermediate layer) is as follows.

δｈＪ＝ｆ′＝　（ｙ′′；）　：、δｏ、　Ｗ　ｊｌ
＋（ここでＫは出力層の素子数）　　・・・・・・０４
）特に、出力関数ｆ　（・）をロジスティック曲線ｚ、
＋−１／（１＋ｅｘｐ（’／ｉ））　　　　　・・・・
・・θωとする（Ｚ、、はｙ、がＯのとき１／２で、そ
れよりｙ、が正に増大すると１に、負にも増大すると０
に、飽和曲線を画いて近ずく）と、ｆ’＝　（ｙ＝）　−ＺＪ（１−ｚ＝）　　　　　−＝
０６）であるので、式０３）と式（１４）は、各々δｚ
＝　（ｔｋｚ０＋、）　Ｚ’ｈ　（１ｚ’ｈ）　・旧・
・０７１δｈｊ＝　ｚ　’；　（１ｚ　’、１）　’Ｅ
δ’ｋｗ’４．　　・・・・・ＨＱ８）となる。これら
において、弐０２）より、中間層と出力層の間の重みｗ
０５．の修正量ΔＷ’ｊ、は６ｗ０Ｊｈ　（ｎ＋１）　
−αδ０ｋｘｏＪ・・−・・・（＋！］）または、６ｗ０Ｊｋ（ｎ＋１）　＝　（ｚδ０　、　ｘ　Ｏ、＋
βΔｗ’、ｋ（１）・・・・・・ＱΦ となる。これに対し、入力層と中間層の間の重みＷｈｉ
ｊの修正量ΔＷｈｉｊは Δ　ｗ１盪Ｊ（ｎ＋１）　　　−α　δ　１１　ｘ　６
□　　　　　　　　　　　・・・・・・　　　（２イ）
または、 Δｗｈｚ７（ｎ＋１）　−ｏ：δｈ　ｊＸ　ｈ　、＋８
６ｗ　ｈ、ｉ　（ｎ）・・・・・・Ｑａとなる。以上の展開より、従来法の学習では、入力層か
ら中間層を経て出力層へ、図のモデルを用いて各々の出
力値を計算し、ついで、弐〇７）、　０１０と式ＱΦ、
Ｑつを用いて重み修正をすることにより、パターンの学
習を行っている。つまり、バンクプロパゲーションによ
る学習では、学習用のデータを入力し結果を出力する（
前向き；フィードフォワード）、結果のエラーを減らす
ように結合の強さを変える（後向き：フィードバック）
、再び学習用データを人力する、これを収束するまで繰
り返す、という方法をとる。δhJ=f′= (y′′;) :, δo, W jl
+ (here K is the number of elements in the output layer) ・・・・・・04
) In particular, the output function f (・) is transformed into a logistic curve z,
+-1/(1+exp('/i))...
...Set as θω (Z,, is 1/2 when y is O, becomes 1 when y increases positively, and becomes 0 when it also increases negatively)
) and f'= (y=) −ZJ(1−z=) −=
06), equation 03) and equation (14) each have δz
= (tkz0+,) Z'h (1z'h) ・Old・
・071δhj=z';(1z', 1) 'E
δ'kw'4. ...HQ8). In these, from 202), the weight w between the intermediate layer and the output layer
05. The correction amount ΔW'j is 6w0Jh (n+1)
-αδ0kxoJ...-(+!]) or 6w0Jk(n+1) = (zδ0, x O, +
βΔw', k(1)...QΦ. On the other hand, the weight between the input layer and the hidden layer is
The correction amount ΔWhij of j is Δw12J(n+1) −α δ 11 x 6
□ ・・・・・・ (2a)
Or Δwhz7(n+1) −o:δh jX h , +8
6w h, i (n)...Qa. From the above development, in the conventional learning method, each output value is calculated from the input layer to the intermediate layer to the output layer using the model shown in the figure, and then 207), 010 and the formula QΦ,
Pattern learning is performed by modifying the weights using Q. In other words, in learning by bank propagation, the learning data is input and the results are output (
forward; feedforward), changing the strength of the connections to reduce the error in the result (backward: feedback)
Then, we manually input the training data again and repeat this process until convergence.

[Problem to be solved by the invention]

ニューラルネットワークを用いた合成音声自動生成シス
テムは、規則合成法に比べて一層自然な合成音声を生成
することができる。理由は規則合成法が音韻変化の特徴
を全て規則として記述しなければならず、かつこれが困
難であるのに対し、ニューラルネットワークを用いた学
習法を導入すると、音韻環境を伴う入力と実音声より得
られた目標出力をセットで学習させることが可能になり
、自然な音韻環境をネットワークの中に取り込むことが
可能になるからである。しかし、現在までに提案されて
いるニューラルネットワークを用いた学習方式では、特
定の音韻環境以外を学習することは困難である。これは
、従来技術を用いるとデータ同士が直交しているもの以
外の学習が困難であり、学習の途中で今迄の学習結果が
破壊されることが多く、かつ学習の収束性が極めて悪い
ことによる。An automatic synthetic speech generation system using a neural network can generate more natural synthesized speech than the rule synthesis method. The reason is that the rule synthesis method requires all the characteristics of phonological changes to be described as rules, and this is difficult, whereas when introducing a learning method using a neural network, it is possible to This is because it becomes possible to learn the obtained target outputs as a set, and it becomes possible to incorporate a natural phonological environment into the network. However, with the learning methods that have been proposed to date using neural networks, it is difficult to learn anything other than a specific phonological environment. This is because when using conventional technology, it is difficult to learn data other than data that is orthogonal to each other, the learning results up to now are often destroyed during learning, and the convergence of learning is extremely poor. by.

本発明はかかる点を改善し、実世界のデータ（必ずしも
直交していないデータ）の効率的学習を可能にし、かつ
早く収束することができるようにすることを目的とする
ものである。It is an object of the present invention to improve this point, to enable efficient learning of real-world data (data that is not necessarily orthogonal), and to enable quick convergence.

[Means to solve the problem]

第１図に示すように本発明ではニューラルネットワーク
を入力層１０、中間層１２、出力層１４、最終出力層１
６で構成し、中間層と出力層は多段にする（多重化する
）。As shown in FIG.
6, and the intermediate layer and output layer are multi-staged (multiplexed).

入力層１０は１段であり、内部にＩ　＝ＩＳＸＩＰの点
（素子）を持つ。ここでＩｓは、入力を系列としたとき
の該系列の持つ個数であり、ＩＰは系列の１点が持つ列
（ベクトル）の中の素子の個数である。The input layer 10 has one stage and has a point (element) of I=ISXIP inside. Here, Is is the number of elements in a sequence when the input is a sequence, and IP is the number of elements in a column (vector) that one point in the sequence has.

中間層１２の段数はＭ段であり、ここでは中央のものを
Ｈ８、最上段をＨ−（Ｍ−１１／□、最下段をＨ（Ｎ−
１１／□とじている。出力層１４もＭ段とし、ここでは
同様な符号付けをしている。最終出力層１６は１段であ
る。The number of stages of the intermediate layer 12 is M, and here, the middle layer is H8, the top layer is H-(M-11/□, and the bottom layer is H(N-
11/□ Closed. The output layer 14 also has M stages, and is similarly labeled here. The final output layer 16 has one stage.

入力層の全ての点から、全ての段の中間層の全ての点に
対し結線し、中間層から出力層へは各最内において、当
該段の中間層の全ての点から出力層の全ての点に結線し
、他の段に対しては結線しない。出力層から最終出力層
へは、ある規則に基づき重み付け平均をとるための結線
をする。All points in the input layer are connected to all points in the intermediate layer of all stages, and from the intermediate layer to the output layer, all points in the intermediate layer of the relevant stage are connected to all points in the output layer. Connect to the point and do not connect to other stages. A connection is made from the output layer to the final output layer to take a weighted average based on a certain rule.

[Effect]

このニューラルネットワークでは入力層１０と中間層１
２との間、あるいは中間層１２と出力層１４との間の重
み係数を必要に応じてランダム化させる。また、多重化
した中間層１２と出力層１４では各段独立に学習させ、
最終出力層１６で出力層１４の各段の出力の重み付け平
均をとってこれを最終出力０７とする。次に学習規則を
列挙する。In this neural network, input layer 10 and hidden layer 1
2 or between the intermediate layer 12 and the output layer 14, as necessary. In addition, each stage of the multiplexed intermediate layer 12 and output layer 14 is trained independently,
The final output layer 16 takes a weighted average of the outputs of each stage of the output layer 14 and sets this as the final output 07. Next, the learning rules are listed.

■）従来法では、入力層はある長さ（ＩＳ）の系列から
なり、出力層は入力層の系列（特徴ベクトル列）の−点
（時系列では時刻）に対応するデータの列（特徴ベクト
ル）を出力とし、入力系列と出力系列とをセットで学習
させている。これに対し、本発明では、ネットワークを
多段化させ、出力列を段数分だけ増やして、各段の出力
列に対応する入力系列の点は■ある点を中心としてとな
りあった点をとる、■入力系列の任意の点をとる、など
の選択により定める。■) In the conventional method, the input layer consists of a sequence of a certain length (IS), and the output layer consists of a sequence of data (feature vector ) as the output, and the input series and output series are trained as a set. In contrast, in the present invention, the network is multi-staged, the output strings are increased by the number of stages, and the points of the input string corresponding to the output strings of each stage are: It is determined by selecting an arbitrary point in the input series.

■）次いで、学習を行う場合は、入力層と中間層の間の
重み係数、あるいは中間層と出力層との間の重み係数の
うち、必要に応じていずれか一方の重み係数をランダム
化しく例えば正規乱数値を重み係数に与える）、かつ各
段の間では独立に学習させる。この場合、各段の間でも
ランダム化した重み係数は同じセットではなく、やはり
ランダムである。また、最終出力層では、各段の出力層
の重みづけ平均をとる。■) Next, when performing learning, randomize one of the weighting coefficients between the input layer and the hidden layer, or the weighting coefficient between the hidden layer and the output layer, as necessary. For example, normal random values are given to the weighting coefficients), and learning is performed independently between each stage. In this case, the randomized weighting coefficients between each stage are not the same set, but are also random. Furthermore, in the final output layer, a weighted average of the output layers of each stage is taken.

■）また、学習プロセスにおいて出力層における誤差の
後向き伝播量６０ｋを６０に＝　（ｔｈ　　ｚ’ｋ）　Ｋ　（・）　　　・・
・・・・（２３）とし、中間層における誤差の後向き伝
播量δｌ′ｊを６６、＝Ｋ（・）′ｆ、δ’Ｗｊｈ　　
　　　・＝＝−（２４）とする。ここで、Ｋ（・）はあ
らかじめ定められた関数である。０７）ＯＩ式から明ら
かなようにδはＺが０と１で特異点を持ち、値が０にな
る。δが０に落ちると浮び上れなくなり、修正がなされ
なくなる。関数Ｋ（・）はこれを救うものである。さら
に、重み係数の修正についても、中間層と出力層の間の
重みＷ’ｊｋの修正量ΔＷ’ｊｋは、６ｗ０Ｊｋ（ｎ＋
１）　＝αδ０ｋｔ、（ｉ＋βΔｗ’、＋ｋ（ｎ）　＋
Ｍ　（・）　　−−（２５）とし、入力層と中間層の間
の重みＷｈｉｊの修正量ΔＷｈｉｊは、 Δｗｈ＋＝（ｎ＋１）　＝ａδ’、ｔＬ（−）＋βΔＷ
’＋、（ｎ）　＋Ｍ　（１−・＝　　（２６）とする。■) Also, in the learning process, the amount of backward propagation of error in the output layer, 60k, is set to 60 = (th z'k) K (・) ・・
...(23), and the backward propagation amount δl'j of the error in the intermediate layer is 66, = K(・)'f, δ'Wjh
・==−(24). Here, K(.) is a predetermined function. 07) As is clear from the OI formula, δ has a singular point when Z is 0 and 1, and the value becomes 0. If δ falls to 0, it will no longer float and no corrections will be made. The function K(.) saves this situation. Furthermore, regarding the modification of the weighting coefficient, the modification amount ΔW'jk of the weight W'jk between the intermediate layer and the output layer is 6w0Jk(n+
1) =αδ0kt, (i+βΔw', +k(n) +
M (・) −−(25), and the correction amount ΔWhij of the weight Whij between the input layer and the hidden layer is Δwh+=(n+1) =aδ', tL(-)+βΔW
'+, (n) +M (1-.= (26)).

ここで、Ｌ（・）１Ｍ（・）はあらかじめ定められた関
数とする（ここで、■）の係数ランダム化を行うとΔＷ
’ｊｋ又はΔＷｈ、ｊのどちらか一方は０となる）。た
だし、前項Ｉ）または、前項■）を適用する場合、誤差
伝播則、重み修正量はこの限りではない。Here, when L(・)1M(・) is a predetermined function (here, ■), when the coefficient is randomized, ΔW
'jk or ΔWh, either one of j is 0). However, when applying the previous section I) or the previous section 2), the error propagation law and the weight correction amount are not limited to these.

ネットワークの多重化及び最後の重み付け平均で、時間
分解能を損なわずに空間分解能を向上させることができ
、ランダム化で、学習で生じる重み係数の統計的偏り（
これが生じると、今までの学習結果が破壊される恐れが
ある）を均一化、従って学習精度の均一化をすることが
できる。更に、重み係数の一方のランダム化で、他方の
重み係数の収束値を重み係数が取り得る値の空間全体に
拡散させることができ、必らずしも直交していないデー
タの効率的学習が可能になる。Network multiplexing and final weighted averaging can improve spatial resolution without sacrificing temporal resolution, and randomization can reduce statistical bias of weighting coefficients (
If this occurs, the learning results up to now may be destroyed) can be made uniform, and therefore the learning accuracy can be made uniform. Furthermore, by randomizing one of the weighting coefficients, the converged value of the other weighting coefficient can be spread over the entire space of possible values of the weighting coefficient, which makes it possible to efficiently learn data that is not necessarily orthogonal. It becomes possible.

〔Example〕

本発明のニューラルネットワークの実施例を音声合成と
音声認識について示す。An embodiment of the neural network of the present invention will be shown for speech synthesis and speech recognition.

第２図は音声合成システムで、音韻生成部２２と音声パ
ラメータ生成部２４を備え、ニューラルネットワークＮ
ＮＷは各々に設けられる。入力音声２６を音声パラメー
タ自動抽出システム２０（特開昭５９−１５２４９６、
同１５２４９７に開示）に加えて分析し、音声パラメー
タ即ち音源パワー、有声／無声パラメータ、ピッチ等の
音源パラメータと、声道断面積、ＰＡＲＣＯＲ係数など
の声道パラメータ、またはＡＲ（全極型）パラメータ、
ＡＲ／ＭＡ　（極零型）パラメータ、スペクトル、その
他音声を分析して得られるパラメータを得て、これを音
声パラメータ生成部の学習入力（目標出力）とする。FIG. 2 shows a speech synthesis system that includes a phoneme generator 22 and a speech parameter generator 24, and includes a neural network N.
A NW is provided for each. The input voice 26 is extracted by a voice parameter automatic extraction system 20 (Japanese Patent Laid-Open No. 59-152496,
152497), and also analyze voice parameters, i.e., sound source parameters such as sound source power, voiced/unvoiced parameters, and pitch, and vocal tract parameters such as vocal tract cross-sectional area and PARCOR coefficient, or AR (all-polar) parameters. ,
AR/MA (pole-zero type) parameters, spectra, and other parameters obtained by analyzing the audio are obtained and used as learning inputs (target outputs) for the audio parameter generation section.

また自動抽出システム２０より得られた音声パラメータ
、あるいは原波形より、人力音声の音韻を決定し、音韻
生成部２２の学習人力（目標出力）とする。Furthermore, the phoneme of the human voice is determined from the voice parameters obtained from the automatic extraction system 20 or the original waveform, and is used as the learning human power (target output) of the phoneme generation section 22.

音韻生成部２２の入力は、発声される音声（入力音声２
６）のちとになるテキスト（文字列）ＴＸである。文字
列Ｔ、例えば「朝早く・・・・・・」は音韻系列ｒＡ、
Ｓ、Ａ、Ｈ，Ａ、Ｙ、Ａ、に、Ｕ、・・・・・・」に変
換されて、音韻生成部２２のニューラルネットワークＮ
ＮＷの人力層１０に入る。（１音韻ずつ逐次入力しかつ
排出されて、入力層には所定数の音韻があるようにされ
る）。上記変換は、平均音節長あるいは規則合成法ある
いは音韻論の知識を用いて行なう。文字列では、音声に
有る時間的な要素はないが、Ｔ、−１１，間の変換でこ
の時間要素が加えられる。また音韻は文字１つでは決ら
ないので、複数の文字が参照されて、各音韻が逐次フレ
ーム間隔で決定されて行く。こうして時間要素が加えら
れるが、速さは平均的なものであり、実際の速さにはＮ
ＮＷでの学習により修正される。The input to the phoneme generation unit 22 is the voice to be uttered (input voice 2
6) This is the text (character string) TX that will be used later. The character string T, for example "early in the morning..." is the phonetic sequence rA,
S, A, H, A, Y, A, U,...'' and is converted into the neural network N of the phoneme generating unit 22.
Ranked among the top 10 human resources in NW. (One phoneme is input and output sequentially, so that there are a predetermined number of phonemes in the input layer). The above conversion is performed using average syllable length, rule synthesis, or knowledge of phonology. Although a character string does not have the temporal element that exists in speech, this temporal element is added when converting between T and -11. Furthermore, since a phoneme cannot be determined by a single character, a plurality of characters are referred to and each phoneme is determined successively at frame intervals. In this way, a time element is added, but the speed is average, and the actual speed is N
Corrected by learning in NW.

学習入力は前述の如くで、実音声より定められた音韻系
列データであり、音韻生成部２２のニューラルネットワ
ークＮＮＷは上記に入力音韻系列Ｉｐを学習入力音韻系
列に修正して出力し、この出力Ｏｐ。■が音声パラメー
タ生成部２４の入力になる。学習入力は１，０であるが
、音韻出力は０と１の間の値をとる。このとき、必要に
応じてしきい値をもうけ音韻出力を０と１のみの値とし
てもよい。As described above, the learning input is phoneme sequence data determined from real speech, and the neural network NNW of the phoneme generation unit 22 corrects the input phoneme sequence Ip to the learning input phoneme sequence and outputs it. . (2) becomes an input to the audio parameter generation section 24. The learning input is 1, 0, but the phonological output takes a value between 0 and 1. At this time, if necessary, a threshold value may be provided and the phoneme output may be set to values of only 0 and 1.

音声パラメータ生成部２４は上記出力のを受けてこれを
前記学習入力（音声パラメータに変換して出力し、この
出力Ｏ１Ｔは音声合成回路２８に加えられて合成音声本
例では「朝早く・・・・・・」を出力させる。The voice parameter generation unit 24 receives the above output, converts it into the learning input (voice parameter) and outputs it, and this output O1T is added to the voice synthesis circuit 28 to generate a synthesized voice "Early in the morning... ..." is output.

第３図に音声パラメータ生成部の詳細を示す。FIG. 3 shows details of the audio parameter generation section.

入力層１０と最終出力層１６は１段、中間層１２と出力
層１４はＭ段である。入力層１０への入力は前記音韻系
列■であり、そのＩＰ個の点（データ）を含む列（音曲
）の１３個を入力しく１回の処理対象）とする。各列は
逐次入力され、中央のものにはｔ。、それより下方のも
のにはも。。１〜も０．３をまた上方のものにはも。−
Ｉ〜ｔ　ｏ−ｉを付しである。時間ｔの進行方向を矢印
で示す。The input layer 10 and the final output layer 16 have one stage, and the intermediate layer 12 and the output layer 14 have M stages. The input to the input layer 10 is the phoneme sequence (1), and 13 sequences (music pieces) containing IP points (data) are input and processed at one time. Each column is entered sequentially, with t in the middle one. , even for those below it. . 1 to 0.3 as well as those above. −
I~t o-i are attached. The direction of progress of time t is indicated by an arrow.

中間層１２及び出力層１４の段数Ｍは、少ないとランダ
ム化、重み付け平均化の意味が薄れるのである程度多い
のがよい。例えばｌ５＝２９に対しＭ＝９などとする。The number M of the intermediate layer 12 and the output layer 14 is preferably large to some extent because if it is small, the meaning of randomization and weighted averaging will be lost. For example, let M=9 for l5=29.

中間層の各段の素子数Ｊ、−Ｊ、は各々異なってもよい
が、ここでは一般性を失うことなくＪｌ−・・・・・・
＝Ｊ、−・・・・・・＝Ｊ、４＝Ｊとする。出力層１４
の各膜素子数も同様で、ここではに１−・・・・・・＝
Ｋ。The number of elements J, -J, in each stage of the intermediate layer may be different, but here, without loss of generality, Jl-...
=J, -...=J, 4=J. Output layer 14
The number of each film element is also the same, and here it is 1−...=
K.

＝・・・・・・−に、＝にとする。また出力層１４にお
いて各段が持つ入力系列の点（本例の時系列では時刻）
は入力系列の任意の点（時刻）でよいが、これも一般性
を失なうことなくＫｌ−ＫＭはある系列の点（時刻）を
中心として隣り合った点の値をとるものとする。このと
き、 ■）学習は、入力系列の中から互いに連結する１３列の
データをランダムに選択し、その中心の値から両側に（
Ｍ−１）／２個だけの系列の点に対応する出力データ列
（ヘクトル）を各段に順に付与し、その値（目標値）と
人力系列とをセットにして各段で行う。このランダムな
選択学習を逐次、必要なだけ繰り返す。また、最終出力
層１６では系列の−点（−時刻）に対しＭ個のデータが
与えられるので、適当な重み（例えば、Ｒｅｃｔａｎｇ
ｕｌａｒ、　　あるいはＨａｍｍｉｎｇ、　Ｉｌａｎｎ
ｉｎｇその他のＷｉｎｄｏｗ関数を与える）を付けて平
均値をとる。＝・・・・・・−、＝＝に。 Also, in the output layer 14, the points of the input series that each stage has (time in the time series of this example)
may be any point (time) in the input series, but without loss of generality, it is assumed that Kl-KM takes values at points adjacent to a certain series point (time) as the center. At this time, ■) Learning randomly selects 13 columns of data that are connected to each other from the input series, and from the center value to both sides (
Output data strings (hectors) corresponding to only M-1)/2 series points are sequentially given to each stage, and the values (target values) and the manual series are set as a set for each stage. This random selection learning is repeated as many times as necessary. In addition, in the final output layer 16, since M pieces of data are given to - points (-times) of the series, appropriate weights (for example, Rectang
ular, or Hamming, Ilann
ing and other Window functions) and take the average value.

■）学習を行う場合、人力層と中間層、あるいは中間層
と出力層の間の重み係数のうち、必要に応じてどちらか
一方（ここでは、−膜性を失うことなく中間層から出力
層の間の重み係数値）をランダム化し、学習させる。■) When performing learning, select one of the weighting coefficients between the human layer and the intermediate layer, or between the intermediate layer and the output layer, as necessary (here, - from the intermediate layer to the output layer without losing membrane properties). The weighting coefficient values between the two are randomized and trained.

■）また、学習プロセスにおいては、中間層と出力層の
間の重み係数をランダム化し、入力層と中間その間の学
習を式（２３）、　（２４）　、　ｅ６）にしたがって
行う。このとき、式（２５）と式（２６）における関数
ｒ−（・）。(2) Also, in the learning process, the weighting coefficients between the intermediate layer and the output layer are randomized, and learning between the input layer and the intermediate layer is performed according to equations (23), (24), and e6). At this time, the function r−(·) in equation (25) and equation (26).

Ｍ（・）は、０式　ＱΦ、　（２２）と同様にする、あ
るいは■学習初期においては定数とし、学習結果を判断
し人カバターンの性質、学習の収束性を考慮した重み付
け関数を実験的に決める。M(・) can be set as the same as equation 0 QΦ, (22), or ■ be a constant at the initial stage of learning, and a weighting function can be experimentally determined based on the learning results and taking into consideration the nature of human coverage and the convergence of learning. decide.

第４図に音声認識の実施例の概要を示す。音声認識は音
声合成の逆プロセスになり、ニューラルネットワークＮ
ＮＷに音声パラメータを逐次入力して、出力に音韻系列
を得、この音韻系列から文字列を得る。FIG. 4 shows an outline of an embodiment of speech recognition. Speech recognition is the reverse process of speech synthesis, using a neural network N
Speech parameters are sequentially input to the NW, a phoneme sequence is obtained as an output, and a character string is obtained from this phoneme sequence.

即ち、入力音声に対し自動抽出システムを適用し、音声
パラメータ即ち音源パラメータ及び声道パラメータを得
る（他のパラメータ、例えばＡＲパラメータ、スペクト
ル・パラメータ、　Ｗａｌｓｈ−１（ａｄａｍａｒｄ、
　Ｈａｒｒパラメータを用いてもよい。これらのパラメ
ータを入力とし、多段のニューラルネットワークＮＮＷ
を適用することにより最終出力を得る。最終出力は音声
合成の場合とは逆で、０と１の間の値をとる音韻パラメ
ータ系列である。この場合音韻パラメータは既に重み付
け平均化がなされているので、その出力値はその音韻で
あることの確からしさを示している。That is, an automatic extraction system is applied to the input speech to obtain speech parameters, that is, sound source parameters and vocal tract parameters (other parameters such as AR parameters, spectral parameters, Walsh-1 (adamard,
Harr parameters may also be used. Using these parameters as input, a multi-stage neural network NNW
The final output is obtained by applying . The final output is a phonological parameter sequence that takes values between 0 and 1, which is the opposite of the case of speech synthesis. In this case, the phoneme parameters have already been weighted and averaged, so the output value indicates the probability that the phoneme is the same.

従って最終的に入力音声がどの音韻であるかを決定する
具体的方法は、■系列のある点（時系列の場合は時刻）
において同時に発火している素子の中から出力値の一番
犬きいものをとる。あるいは、■出力値の大きいものか
ら順に候補として選択し、島駆動方式などにより、音韻
論的に最も確からしいものに決定する。■このシステム
を多量の音声データに適用することにより得る知見をル
ール化し、エキスパートシステムを構成することにより
、音韻を決定する、などの方法をとる。音韻系列が求ま
れば、これより文字列に変換する。Therefore, the specific method for ultimately determining which phoneme the input speech is is: ■ A certain point in the series (time in the case of a time series)
The one with the highest output value is selected from among the elements that are firing at the same time. Alternatively, (1) select candidates as candidates in descending order of output value, and use an island drive method or the like to determine the one that is phonologically most likely; ■By applying this system to a large amount of speech data, we will turn the knowledge gained into rules and construct an expert system to determine phonemes. Once the phoneme sequence is determined, it is converted into a character string.

この場合、第２図の音韻生成部の逆プロセスをとる（す
なわち、音韻系列を入力とし、文字列を出力とする）ニ
ューラルネットワークを構成し、前記の音韻を決定する
のと同様の手順をとることにより（つまり、出力値をそ
の文字であることの確からしさであると考え、前記■、
■、■の手順をとることにより）文字列を決定すること
もできる。In this case, a neural network is configured that performs the reverse process of the phoneme generation section in Figure 2 (that is, it takes a phoneme sequence as input and a character string as output), and takes the same procedure as for determining the phoneme described above. By (that is, considering the output value as the probability that it is that character, the above ■,
It is also possible to determine the character string (by following steps ① and ②).

〔Effect of the invention〕

以上説明したように本発明はニューラルネットワークを
多重化し、重み付け平均をとって最終出力とし、入力層
と中間層あるいは中間層と出力層間の重み係数をランダ
ム化したので、アナログ情報の学習を極めて高い精度で
、且つす早く収束可能にすることができ、自動音声合成
に用いて一層自然的で良好な合成音声が得られる等の効
果を得ることができる。As explained above, the present invention multiplexes neural networks, takes a weighted average as the final output, and randomizes the weighting coefficients between the input layer and the intermediate layer or between the intermediate layer and the output layer, so that learning of analog information is extremely efficient. It is possible to converge accurately and quickly, and when used in automatic speech synthesis, it is possible to obtain effects such as obtaining more natural and better synthesized speech.

[Brief explanation of the drawing]

第１図は本発明の原理説明図、第２図〜第４図は本発明の実施例を示し、第２図は音声
合成システムの説明図、第３図は音声パラメータ生成部
の説明図、第４図は音声認識システムの説明図、第５図〜第６図は従来例を示し、第５図はアナログニュ
ーロン素子の説明図、第６図はニューラルネットワーク
モデルの説明図である。第１図で１０は入力層、１２は中間層、１４は出力層、
１６は最終出力層、■は入力系列、ＯＴは最終出力であ
る。FIG. 1 is an explanatory diagram of the principle of the present invention; FIGS. 2 to 4 illustrate embodiments of the present invention; FIG. 2 is an explanatory diagram of a speech synthesis system; FIG. 3 is an explanatory diagram of a speech parameter generation unit; FIG. 4 is an explanatory diagram of a speech recognition system, FIGS. 5 and 6 are illustrations of a conventional example, FIG. 5 is an explanatory diagram of an analog neuron element, and FIG. 6 is an explanatory diagram of a neural network model. In Figure 1, 10 is an input layer, 12 is an intermediate layer, 14 is an output layer,
16 is the final output layer, ■ is the input sequence, and OT is the final output.

Claims

[Claims] 1. A neural network consisting of an input layer (10), a multi-stage intermediate layer (12), a multi-stage output layer (14), and a final output layer (16) is used, and the input layer has an input sequence ( I), and the final output layer takes a weighted average of each output of the multi-stage output layer and calculates this as the final output (
O^T), and is characterized by randomizing either the weighting coefficient between the input layer and the hidden layer or the weighting coefficient between the hidden layer and the output layer, and learning it independently at each stage. pattern learning/generation method.