JPH04247554A

JPH04247554A - Method for reallocating data for multiprocessor and control mechanism therefor

Info

Publication number: JPH04247554A
Application number: JP3013578A
Authority: JP
Inventors: Junichi Takahashi; 淳一高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1991-02-04
Filing date: 1991-02-04
Publication date: 1992-09-03

Abstract

PURPOSE:To provide the method for reallocating data and the control mechanism so as to efficiently allocate the data on a multi-processor at high speed. CONSTITUTION:At all the processing elements, the data are read out from the first address of a storing means by a reading means (10) and the read data are transferred prescribed times between the processing elements (11). Then, the data of the first address of the storage means transferred to each processing element is exchanged with the data of the second address and stored in the storage means (13) and the converted data in the second address of the storage means is transferred prescribed times simultaneously between the processing elements (14). Afterwards, the procedure that the data in the second address of the storage means transferred to the respective processing elements are stored in the first address of the storage means is executed (15) according to the total number of the processing elements.

Description

【発明の詳細な説明】【０００１】【産業上の利用分野】本発明は複数の処理ユニットから
なるマルチプロセッサにおけるデータの再割り付け方法
とその制御機構に係り、特に処理ユニット間のデータの
授受により各処理ユニットに格納されているデータを再
配置するマルチプロセッサにおけるデータの再割り付け
方法とその制御機構に関する。【０００２】【従来の技術】多種類の並列処理を行うマルチプロセッ
サでは、ある処理の処理結果をデータとして用いて他の
処理を実行する場合、各処理ユニットで得られた処理結
果を他の並列処理の実行に適するように各処理ユニット
に割り付け直す必要がある。このようにデータを処理ユ
ニットに割り付け直すことをデータの再配置という。【０００３】従来、このようなマルチプロセッサにおけ
るデータの再配置処理には主として次に述べる２種類の
方法があった。１つの方法はマルチプロセッサを管理す
るホストプロセッサに各処理ユニットにある全てのデー
タを集める。集められたデータはホストプロセッサ上で
データの並び換えられ、各処理ユニットに再格納される
。【０００４】もう一つの方法はホストプロセッサを介さ
ずに、処理ユニット間のデータ転送パスを使用し、各処
理ユニットのデータを１つずつ再配置するものである。【０００５】【発明が解決しようとする課題】しかるに、上記のホス
トプロセッサ上でデータを並び換えて再格納する方法は
、各処理ユニットとホストプロセッサとの間で大量のデ
ータ授受を必要とし、さらに、ホストプロセッサに対す
るデータの並び換え処理の付加が大きくなるという問題
がある。一方、データ転送パスを使用する方法は再配置
の対象になるデータ量に比例してデータ転送時間がかか
り、マルチプロセッサの処理に対するオーバヘッドが大
きくなるばかりでなく、各処理ユニットに対する再配置
のためのデータ転送処理を個々に管理しなければならな
いために制御が複雑になるという問題がある。【０００６】本発明は上記の点に鑑みなされたもので、
マルチプロセッサ上でデータの再割り付けの処理を効率
的に高速に実行することができるデータの再割り付け方
法とその制御機構を提供することを目的とする。【０００７】【課題を解決するための手段】図１は本発明の原理説明
図である。複数の処理要素を環状に接続したマルチプロ
セッサにおいて、複数の処理要素は各々データ授受を行
うためのデータ転送パスを介して処理要素と接続され、
複数の処理要素は所望の演算を行う演算手段、処理要素
間のデータを転送するデータ転送手段、アドレス及びデ
ータを記憶する記憶手段及び、記憶したデータを読み出
す読み出し手段に対して制御を行う制御手段を有し、複
数の処理要素に対して処理要素がマルチプロセッサに配
列されている順番に番号が付与され、全ての処理要素に
おいて、記憶手段の第１のアドレスから読み出し手段に
よりデータを読み出し（１０）、読み出したデータを処
理要素間で所定の回数転送し（１１）、各処理要素に転
送されてきた記憶手段の第１のアドレスのデータを第２
のアドレスのデータと交換して記憶手段に格納し（１３
）、変換された記憶手段の第２のアドレスのデータは処
理要素間で同時に所定回数転送し（１４）、個々の処理
要素に転送された記憶手段の第２のアドレスのデータを
記憶手段の第１のアドレスに格納する手続きを処理要素
の総数が奇数の場合は処理要素の総数（Ｎ）を２で割っ
た値（Ｎ／２）分実行し、処理要素の総数が偶数の場合
には処理要素の総数（Ｎ）を２で割った値（Ｎ／２）よ
り１を減した値（Ｎ／２−１）分実行し、処理要素の総
数が偶数且つ処理要素の総数を２で割った値（Ｎ／２）
とデータ転送を繰り返すカウント（ｒ）が等しければ全
ての処理要素において同時に記憶手段の第１のアドレス
のデータを取り出し、取り出した第１のアドレスデータ
を処理要素間で同時に所定カウント分転送し、各処理要
素では個々の処理要素に転送された第１のアドレスのデ
ータを記憶手段の第２のアドレスに格納する（１５）。【０００８】また、第１のアドレス（ｈ＋ｒ）に対する
処理要素間の転送回数を第１のアドレス（ｈ＋ｒ）のデ
ータの交換対象となるデータに対する第２のアドレス（
ｈ＋Ｎ−ｒ）のアドレス値からｈを減じた値（Ｎ−ｒ）
とし、第２のアドレス（ｈ＋Ｎ−ｒ）のデータに対する
処理要素間の転送回数を第２のアドレス（ｈ＋Ｎ−ｒ）
のデータの交換対象となるデータに対する第１のアドレ
ス（ｈ＋ｒ）のアドレス値からｈを減じた値ｒとして、
ｒ＝１，２，・・・，［Ｎ／２］に対して処理要素の記
憶手段から取り出されるデータの処理要素間の同時転送
における転送回数をカウントして、取り出したデータに
対する再配置先の処理要素への転送処理を制御する。【０００９】また、第１のアドレスを保持する第１のカ
ウンタと、第２のアドレスを保持する第２のカウンタと
、処理要素の記憶手段から取り出されるデータの処理要
素間の同時転送における転送回数をカウントする第３の
カウンタと、第１のカウンタの出力と第２のカウンタの
出力とを切り換えるセレクタと、セレクタの出力は第３
のカウンタの入力に接続され、第３のカウンタの値がｈ
に等しいことを検出して第１のフラグを発生する第１の
フラグ発生手段と、セレクタにより第１のフラグの内容
によって第１のカウンタの出力と第２のカウンタの出力
とを切り換える切り換え手段と、繰り返しの回数の制御
パラメータを保持する第１のレジスタと、全ての処理要
素の個数の偶奇性判定用のパラメータを保持する第２の
レジスタと、第１のレジスタの内容と、第１のカウンタ
の内容との一致を検出して第２のフラグを発生させる第
２のフラグ発生手段と、第２のレジスタの最下位ビット
の内容によりすべての処理要素の個数の偶奇性を判断す
る偶奇性判断手段と、第２のフラグの内容と第２のレジ
スタの最下位ビットの内容によってデータの再配置処理
の終了を検出する終了検出手段とを有する。【００１０】【作用】複数（Ｎ個）の処理要素を環状に接続し、その
処理要素にはマルチプロセッサ上での並び順に番号付さ
れている。隣接する処理要素間にデータ転送パスを有す
るマルチプロセッサにおいて処理要素の記憶手段の連続
しているアドレスに第１のデータが格納されている状態
から処理要素の記憶手段の同じ連続するアドレスに第２
のデータが格納されるように第１のデータと第２のデー
タのＮ個の処理要素を一括して再配置する。【００１１】【実施例】本発明の理系を簡単にするために以下にハイ
デンマーコブモデル法の学習処理及びフォワード−バッ
クワード・プロセデュアにおける学習処理とバウム−ウ
ェルチ・リエスティメーション・フォーミュラスにおけ
る学習処理について説明する。【００１２】声や文字等のパターン認識処理に用いられ
るＨＭＭ（Ｈｉｄｄｅｎ　Ｍａｒｋｏｖ　Ｍｏｄｅｌｓ
）　法の学習処理をマルチプロセッサの一つの形態であ
るアレイプロセッサを用いて実行する例について説明す
る。【００１３】このＨＭＭ法を用いたパターン認識では、
ある状態遷移確率モデルを仮定して音声や文字のパター
ンの生起をそのモデルにおける状態間の遷移によって観
測されるシンボル系列としてパターンをモデル化する。学習処理とは複数のサンプルパターンのデータから、確
率モデルの確率パラメータを推定することである。ＨＭ
Ｍ法の学習処理では、フォワード−バックワード・プロ
セデュア（Ｆｏｒｗａｒｄ−Ｂａｃｋｗａｒｄ　Ｐｒｏ
ｃｅｄｕｒｅ）　とバウム−ウェルチ・リ−エティメー
ション（Ｂａｕｍ−Ｗｅｌｃｈ　Ｒｅ−ｅｓｔｉｍａｔ
ｉｏｎ　Ｆｏｒｍｕｌａｓ）の２種類のアルゴリズムが
用いられる。ＨＭＭ法の学習処理はこれらのアルゴリズムにより互い
のアルゴリズムの処理結果を用いて求める確率モデルの
推定が収束するまで各々のアルゴリズムの処理を繰り返
す。これらのアルゴリズムの内容が以下に示される。【００１４】フォワード−バックワード・プロセデュア
は前向きパス・アルゴリズムと後ろ向きパス・アルゴリ
ズムの２種類のアルゴリズムからなる。【００１５】　　（１）　前向きパス・アルゴリズム　　初期設定：
　　１≦ｉ≦Ｎに対して　　　　　　　　　　　　　　
α（ｉ，０）＝π（ｉ）　　　　　　　　　　　　　　
　　　　　　　　・・・（１）　　　　　漸化式：　　
１≦ｉ≦Ｎ，ｔ＝１，２，・・・，Ｔに対して【００１
６】【数１】上記のアルゴリムにおいて、Ｎは処理要素の個
数である。π（ｉ）は初期状態確率を示す。α（ｉ，ｔ
）は確率パラメータである。【００１７】（２）　後ろ向きパス・アルゴリズム　　
初期設定：　　１≦ｉ≦Ｎに対して　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
１　　　　　　ｆｏｒ　ｉ∈ＥＴ　　　　　　　　　　
　　　　　β（ｉ，Ｔ）＝　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　・・・（３）　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　０　　　　　　ｏｔｈｅｒｗｉｓｅ　
　　　　漸化式：　　１≦ｉ≦Ｎ，ｔ＝Ｔ−１，Ｔ−２
，・・・，０に対して　　　　　　　　【００１８】【数２】ここで、ｃ（ｉ，ｊ；ｔ）≡ａ（ｉ，ｊ）・ｂ
（ｉ，ｊ；Ｏｔ　）【００１９】【数３】である。【００２０】上記のアルゴリズムにおいて、β（ｉ，ｔ
）は確率パラメータであり、ａ（ｉ，ｊ）は状態遷移確
率、ｂ（ｉ，ｊ；ｋ）はシンボル出力確率である。【００２１】また、バウム−ウェルチ・リエティメーシ
ョン・フォーミュラスの処理は初期状態確率π（ｉ）の
再推定計算、状態遷移確率ａ（ｉ，ｊ）の再推定計算、
シンボル出力確率ｂ（ｉ，ｊ；ｋ）の再推定計算の３種
類の計算からなる。各再推定計算の内容を以下に示す。尚、以下の表記では、π＋　（ｉ），ａ＋　（ｉ，ｊ）
，ｂ＋　（ｉ，ｊ；ｋ）はそれぞれ、π（ｉ），ａ（ｉ
，ｊ），ｂ（ｉ，ｊ；ｋ）の再推定計算結果を表す。【００２２】（１）　初期状態確率の再推定計算【００
２３】【数４】（２）　状態遷移確率ａ＋　（ｉ，ｊ）の再推
定計算【００２４】【数５】（３）　シンボル出力確率ｂ＋　（ｉ，ｊ；ｋ
）の再推定計算【００２５】【数６】ここで、ｃ（ｉ，ｊ；ｔ）≡ａ（ｉ，ｊ）・ｂ
（ｉ，ｊ；Ｏｔ　），【００２６】【数７】【００２７】【数８】である。【００２８】上記のフォワード−バックワード・プロセ
デュアとバウム−ウェルチ・リエティメーション・フォ
ーミュラスの各アルゴリズムに対する処理は所望の機能
を持った処理要素（以下ＰＥと呼ぶ）を環状に接続した
アレイプロセッサ構成（以下リングアレイプロセッサと
呼ぶ）を用いて並列処理が可能である。ここで対象とす
るリングアレイプロセッサについて説明する。図２はＨ
ＨＭ法のパターン認識処理における学習処理を並列処理
により実行する場合のリングアレイプロセッサの構成を
示す。リングアレイプロセッサはＰＥ１００ａ，１００
ｂ，・・・，１００ｃ，１００ｄと各ＰＥ間のデータ転
送パス１０１と、各ＰＥの管理下にあるメモリ１０２ａ
，１０２ｂ，・・・，１０２ｃ，１０２ｄ及びＰＥとデ
ータの入出力を行うデータ入出力パス１０３ａ，１０３
ｂ，・・・，１０３ｃ，１０３ｄにより構成される。以下に図２のリングアレイプロセッサ構成を用いた上記
の各アルゴリズムに対する並列処理方法について説明す
る。【００２９】（ａ）　フォワード−バックワード・プロセデュア［ａ
−１］　　前向きパス・アルゴリズム図３は学習処理に
おけるフォワード−バックワード　　プロセデュアの前
向きパス・アルゴリズムをリングアレイプロセッサ構成
で並列処理する場合のデータフローを示す。同図のリン
グアレイプロセッサの構成はＰＥ２００ａ，２００ｂ，
・・・，２００ｃ，２００ｄと各ＰＥ間のデータ転送パ
ス２０１と、各ＰＥの管理下にあるメモリ２０４とのデ
ータの入出力を行うデータ入出力パス２０２ａ，２０２
ｂ，・・・，２０２ｃ，２０２ｄとＰＥ間で循環転送さ
れるデータ列２０３｛α（１、ｔ−１），α（２，ｔ−
１），・・・，α（ｉ，ｔ−１），・・・，α（Ｎ，ｔ
−１）｝である。また、データ列２０４は各ＰＥにおい
て、上記のデータ列２０３の循環転送と同期してその管
理下にあるメモリ２０４から入力される。【００３０】例えば、ＰＥｉ　（１≦ｉ≦Ｎ）ではデー
タ列ＣＦ　（ｉ，ｔ）はＣＦ　（ｉ，ｔ）＝｛ｃ（ｉ，
ｉ；ｔ），ｃ（ｍｏｄ（ｉ＋１｜Ｎ），ｉ；ｔ），・・
・，ｃ（ｍｏｄ（ｉ−１｜Ｎ），ｉ；ｔ）｝である。【００３１】ここで、ｍｏｄ（ｍ｜Ｎ）（ｍは整数）は
ｍがＮの整数倍の時はＮをｍがＮの整数倍でないときに
はｍをＮで割ったときの剰余を表す。上記のデータ列Ｃ
Ｆ　（ｉ，ｔ）（１≦ｉ≦Ｔ）はＰＥｉ　（１≦ｉ≦Ｎ
）の管理下にあるメモリに格納されている。また、この
メモリにはデータα（ｉ，ｔ）の初期値α（ｉ，０）＝
π（ｉ）が格納されているとする。【００３２】ＰＥｉ　（１≦ｉ≦Ｎ）では先ず、（１）
　式に対応するα（ｉ，ｔ）の初期値α（ｉ，０）＝π
（ｉ）をその管理下にあるメモリから読み出し、データ
入出力パス２０２を介して入力する。次にＰＥｉ　（１
≦ｉ≦Ｎ）はその管理下にあるメモリからデータ列ＣＦ
　（ｉ，ｔ）の第１番目のデータｃ（ｉ，ｉ；１）を入
力し、そのデータと先に入力した初期値α（ｉ，０）と
乗算を行い、乗算結果であるα（ｉ，０）・ｃ（ｉ，ｉ
；１）をＰＥ内の格納領域に一時的に保持する。次に次
段の処理としてＰＥｉ　（１≦ｉ≦Ｎ）は先に入力した
初期値α（ｉ，０）を次段のＰＥに送信すると同時に、
前段のＰＥから初期値α（ｍｏｄ（ｉ＋１｜Ｎ），０）
を受信する。これと同時にＰＥの管理下にあるメモリか
らデータ列ＣＦ　（ｉ，ｔ）の２番目のデータｃ（ｍｏ
ｄ（ｉ＋１｜Ｎ），ｉ；１）を入力する。そして、ＰＥ
間で転送されたデータα（ｍｏｄ（ｉ＋１｜Ｎ），ｉ；
１）とこれと同時にメモリから入力されたデータｃ（ｍ
ｏｄ（ｉ＋１｜Ｎ），ｉ；１）との乗算を行い、その乗
算結果であるα（ｍｏｄ（ｉ＋１｜Ｎ），ｉ；１）・ｃ
（ｍｏｄ（ｉ＋１｜Ｎ），ｉ；１）とＰＥ内の格納領域
に保持されている先の乗算結果α（ｉ，０）・ｃ（ｉ，
ｉ；１）との和を計算（即ち積和計算）する。 α（ｍｏｄ（ｉ＋１｜Ｎ），ｉ；１）・ｃ（ｍｏｄ（ｉ
＋１｜Ｎ），ｉ；１）＋α（ｉ，０）・ｃ（ｉ，ｉ；１
）さらに、上記の加算結果をＰＥ内の格納領域に一時的に
保持する。【００３３】以後、全てのＰＥ間で転送されるデータα
（ｉ，０）がリングアレイプロセッサの全てのＰＥを一
巡するまで繰り返し行い、その都度、上述したような積
和計算を実行する。積和計算の計算結果はＰＥ内の格納
領域に保持する。このようにＰＥ間で循環転送されるデ
ータα（ｉ，０）がリングアレイプロセッサ上を一巡す
ると、ＰＥｉ　（１≦ｉ≦Ｎ）において時刻ｔ＝１に対
するデータα（ｉ，１）が求められる。以後、ｔ＝２に
対するデータα（ｉ，２）の計算処理はここで求められ
たデータα（ｉ，１）でα（ｉ，０）を置き換えてＰＥ
間で循環転送し、これと同時に時刻ｔ＝２に対するデー
タ列ＣＦ　（ｉ，ｔ）のデータをメモリから入力しなが
ら、ｔ＝１の場合と全く同一の処理過程で実行する。ｔ
＝３，・・・，Ｔに対しても同様である。各時刻ｔ＝１
，２，・・・、Ｔに対するα（ｉ，ｔ）の計算結果はＰ
Ｅｉ　（１≦ｉ≦Ｎ）の管理下にあるメモリに逐次格納
される。【００３４】［ａ−２］　　後向きアルゴリズム図４は
学習処理におけるフォワード−バックワード　　プロセ
デュアの後ろ向きパス・アルゴリズムをリングアレイプ
ロセッサ構成で並列処理する場合のデータフローを示す
。同図のリングアレイプロセッサはＰＥ３００ａ，３０
０ｂ，・・・，３００ｃ，３００ｄと各ＰＥ間のデータ
転送パス３０１と、各ＰＥの管理下にあるメモリとのデ
ータの入出力を行うデータ入出力パス３０２ａ，３０２
ｂ，・・・，３０２ｃ，３０２ｄとＰＥ間で循環転送さ
れるデータ列３０３｛β（１，ｔ＋１），β（２，ｔ＋
１），・・・，β（ｉ，ｔ＋１），・・・β（Ｎ，ｔ＋
１）｝により構成される。データ列３０４は各ＰＥ３０
０ａ，３００ｂ，・・・，３００ｃ，３００ｄにおいて
上記のデータ列３０３の循環転送と同期にして、その管
理下にあるメモリから入力される。ＰＥｉ　（１≦ｉ≦
Ｎ）ではデータ列ＣＢ　（ｉ，ｔ＋１）（０≦ｔ≦Ｔ−
１）はＣＢ　（ｉ，ｔ＋１）＝｛ｃ（ｉ，ｉ；ｔ＋１）
，ｃ（ｉ，ｍｏｄ（ｉ＋１｜Ｎ）；ｔ＋１），・・・，
ｃ（ｉ，ｍｏｄ（ｉ−１｜Ｎ）；ｔ＋１）｝である。こ
のデータ列ＣＢ　（ｉ，ｔ＋１）（０≦ｔ≦Ｔ−１）は
ＰＥｉ　（１≦ｉ≦Ｎ）の管理下にあるメモリに格納さ
れている。また、このメモリにはデータβ（ｉ，ｔ）の
初期値β（ｉ，Ｔ）が格納されているものとする。【００３５】この後ろ向きパス・アルゴリズムに対する
並列処理はＰＥ間の循環転送データをβ（ｉ，ｔ＋１）
、これと同時にメモリからＰＥｉ　（１≦ｉ≦Ｎ）に入
力されるデータ列をＣＢ　（ｉ，ｔ＋１）として、前向
きパス・アルゴリズムの並列処理と全く同様の処理を実
行する。即ち、ＰＥｉ　（１≦ｉ≦Ｎ）では先ず、（３
）　式に対応するβ（ｉ，ｔ）の初期値β（ｉ，Ｔ）が
その管理下にあるメモリから読み出され、データ入出力
パス３０２を介して入力される。次にＰＥｉ　（１≦ｉ
≦Ｎ）はＰＥｉ　の管理下にあるメモリからデータ列Ｃ
Ｂ　（ｉ，ｔ＋１）の第１番目のデータｃ（ｉ，ｉ；Ｔ
）を入力する。そのデータと先に入力した初期値β（ｉ
，Ｔ）との乗算を行い、その乗算結果であるβ（ｉ，Ｔ
）・ｃ（ｉ，ｉ；Ｔ）をＰＥ内の格納領域に一時的に保
持する。次にＰＥｉ　（１≦ｉ≦Ｎ）は先に入力した初
期値β（ｉ，Ｔ）を次段のＰＥに送信すると同時に、前
段のＰＥから初期値β（ｍｏｄ（ｉ＋１｜Ｎ），ｉ；Ｔ
）受信し、同時にＰＥの管理下にあるメモリからデータ
列ＣＢ　（ｉ，ｔ＋１）の２番目のデータｃ（ｍｏｄ（
ｉ＋１｜Ｎ），ｉ；Ｔ）を入力する。さらに、ＰＥ間で
転送されたデータβ（ｍｏｄ（ｉ＋１｜Ｎ），Ｔ）とこ
れと同時にメモリから入力されたデータｃ（ｍｏｄ（ｉ
＋１｜Ｎ），ｉ；Ｔ）との乗算を行い、その乗算結果で
あるβ（ｍｏｄ（ｉ＋１｜Ｎ），Ｔ）・ｃ（ｍｏｄ（ｉ
＋１｜Ｎ），ｉ；Ｔ）とＰＥ内の格納領域に保持されて
いる先に求められている乗算結果β（ｉ，Ｔ）・ｃ（ｉ
，ｉ；Ｔ）との和を計算（即ち、積和計算）する。【００３６】β（ｍｏｄ（ｉ＋１｜Ｎ），Ｔ）・ｃ（ｍ
ｏｄ（ｉ＋１｜Ｎ），ｉ；Ｔ）＋　　β（ｉ，Ｔ）・ｃ
（ｉ，ｉ；Ｔ）上記の加算結果をＰＥ内の格納領域に一時的に保持する
。以後すべてのＰＥはこのようなＰＥ間データ転送とメ
モリからのデータ入力との同時実行をＰＥ間で転送され
るデータβ（ｉ，Ｔ）がリングアレイプロセッサ上の全
てのＰＥを一巡するまで繰り返し行い、その都度、上記
の積和計算を実行する。積和計算の結果はＰＥ内の格納
領域に保持される。ＰＥ間で循環転送されるデータβ（
ｉ，Ｔ）がリングアレイプロセッサ上を一巡すると、Ｐ
Ｅｉ　（１≦ｉ≦Ｎ）において時刻ｔ＝Ｔ−１に対する
データβ（ｉ，Ｔ−１）が求められる。以後ｔ＝Ｔ−２
に対するデータβ（ｉ，Ｔ−２）の計算処理はここで求
められたデータβ（ｉ，Ｔ−１）でデータβ（ｉ，Ｔ）
を置き換えてＰＥ間で循環転送する。これと同時に時刻
ｔ＝Ｔ−２に対するデータ列ＣＢ　（ｉ，ｔ＋１）のデ
ータをメモリから入力しながら、ｔ＝Ｔ−１の場合と全
く同一の処理過程で実行する。ｔ＝Ｔ−３，・・・，０
に対しても同様である。各時刻ｔ＝Ｔ−１，Ｔ−２，・
・・，０に対するβ（ｉ，ｔ）の計算結果はＰＥｉ　（
１≦ｉ≦Ｎ）の管理下にあるメモリに逐次格納される。【００３７】フォワード−バックワード・プロセデュア
の前向きパス・アルゴリズム、及び後向きパス・アルゴ
リズムに対して、それぞれ、上記のような並列処理を実
行すると、各ＰＥのメモリには次のようなα（ｉ，ｔ）
及びβ（ｉ，ｔ）の計算結果が得られる。【００３８】ＰＥｉ　（１≦ｉ≦Ｎ）の管理下にあるメ
モリに格納される計算結果： α（ｉ，０），α（ｉ，１），・・・，α（ｉ，ｔ），
・・・，α（ｉ，Ｔ）；β（ｉ，Ｔ），β（ｉ，Ｔ−１
），・・・，β（ｉ，ｔ），・・・β（ｉ，０）尚、上
記の計算結果の並べ方は計算結果が求められる順番に同
じである。【００３９】（ｂ）　　バウム−ウェルチ・リエティメ
ーション・フォーミュラス［Ｂ−１］　　初期状態確率の再推定計算の並列処理方
法先ず、バウム−ウェルチ・リエティメーション・フォ
ーミュラスにおける初期状態確率π（ｉ）の再推定計算
に対する並列処理方法について説明する。【００４０】図５は学習処理におけるバウム−ウェルチ
・リ−エティメーション・フォーミュラスの初期状態確
率の再推定計算をリングアレイプロセッサ構成で並列処
理する場合のデータフローを示す。このリングアレイプ
ロセッサの構成はＰＥ４００ａ，４００ｂ，・・・，４
００ｃ，４００ｄと各ＰＥ間のデータ転送パス４０１と
ＰＥとそのＰＥが管理するメモリ間のデータ入出力パス
４０２とＰＥ間で循環転送されるデータ列４０３｛α（
１，０）・β（１，０），α（２，０）・β（２、０）
，・・・，α（ｉ，０）・β（ｉ，０），・・・，α（
Ｎ，０）・β（Ｎ，０）｝とデータ入出力パス４０２を
介してメモリから入力されるデータ列４０４等により構
成される。データ列４０４はＰＥｉ　（１≦ｉ≦Ｎ）で
はＤ（ｉ，０）＝｛α（ｉ，０），β（ｉ，０）｝が入
力される。このデータ列Ｄ（ｉ，０）はＰＥｉ　（１≦
ｉ≦Ｎ）の管理下にあるメモリに格納されているとする
。【００４１】この並列処理では先ずＰＥｉ　（１≦ｉ≦
Ｎ）において、（５）　式の分子の計算を実行するため
に必要なデータ列Ｄ（ｉ，０）＝｛α（ｉ，０），β（
ｉ，０）｝がデータ入出力パス４０２を介してメモリか
ら入力される。ＰＥｉ　（１≦ｉ≦Ｎ）は入力されたデ
ータ列Ｄ（ｉ，０）＝｛α（ｉ，０），β（ｉ，０）｝
の２種類のデータα（ｉ，０）及びβ（ｉ，０）を用い
て分子の積計算α（ｉ，０）・β（ｉ，０）を並列に実
行する。分母の計算であるＰ（Ｏ｜λ）はα（ｉ，０）
，β（ｉ，０）を用いた計算式を用いると、各ＰＥで並
列に計算した分子の計算結果の総和に等しい。従って、
分母の計算はＰＥｉ　（１≦ｉ≦Ｎ）の積計算結果α（
ｉ，０）・β（ｉ，０）をリングアレイプロセッサの全
てのＰＥを一巡するまでＰＥ間で循環転送し、全てのＰ
Ｅにおいて、その転送データの累積加算を並列に実行す
ることにより求められる。従って、それぞれのＰＥで求
められた分子の計算結果を分母の計算結果で除算するこ
とにより、初期状態確率π（ｉ）の再推定計算結果π＋
　（ｉ）（１≦ｉ≦Ｎ）がＰＥｉ　（１≦ｉ≦Ｎ）で同
時に求められる。さらに、得られた初期状態確率π（ｉ
）の再推定計算結果π＋　（ｉ）はＰＥｉ　（１≦ｉ≦
Ｎ）が管理するメモリに格納される。【００４２】［Ｂ−２］　　状態遷移確率の再推定計算
の並列処理方法次にリングアレイプロセッサ構成を用いた状態遷移確率
ａ（ｉ，ｊ）の再推定計算の並列処理を説明する。図６
は学習処理におけるバウム−ウェルチ・リエスティメー
ション・フォーミュラスの状態遷移確率の再推定計算を
リングアレイプロセッサ構成で並列処理する場合のデー
タフローを示す。データ転送パス５０１は各ＰＥ５００
ａ，５００ｂ，・・・，５００ｃ，５００ｄ間に設けら
れる。データ入出力パス５０２ａ，５０２ｂ，・・・，
５０２ｃ，５０２ｄは各ＰＥ５００ａ，５００ｂ，・・
・，５００ｃ，５００ｄとその管理下にあるメモリとの
間のデータの入出力を行うためのパスである。データ列
５０３はＰＥ５００ａ，５００ｂ，・・・，５００ｃ，
５００ｄ間で循環転送されるデータ列である。図６に示
した例では、データ列５０３はβ（１，ｔ），β（２，
ｔ），・・・，β（ｉ，ｔ），・・・，β（ｉ，ｔ），
・・・，β（Ｎ，ｔ）｝を示している。第１のデータ列
５０４はデータ入出力パス５０２を介してメモリから入
力され、ＰＥｉ　（１≦ｉ≦Ｎ）において、データ列Ｄ
（ｉ，ｔ）＝｛α（ｉ，ｔ），β（ｉ，ｔ）｝（０≦ｔ
≦Ｔ）が入力される。第２のデータ列５０５はデータ入
出力パス５０２を介してメモリから入力され、ＰＥｉ　
（１≦ｉ≦Ｎ）において、データ列ＣＢ　（ｉ，ｔ）＝
｛ｃ（ｉ，ｉ；ｔ），ｃ（ｍｏｄ（ｉ＋１｜Ｎ）；ｔ）
，・・・，ｃ（ｉ，ｍｏｄ（ｉ−１｜Ｎ）；ｔ）｝が入
力される。この第１のデータ列５０４、第２のデータ列
５０５はともにＰＥの管理するメモリに格納されている
ものとする。【００４３】この並列処理では先ず（６）　式の分母の
計算処理を実行するために、データ列Ｄ（ｉ，ｔ）＝｛
α（ｉ，ｔ），β（ｉ，ｔ）｝（０≦ｔ≦Ｔ）がＰＥｉ
　（１≦ｉ≦Ｎ）にメモリから入力される。ＰＥｉ　（
１≦ｉ≦Ｎ）は入力されたデータ列Ｄ（ｉ，ｔ）＝｛α
（ｉ，ｔ），β（ｉ，ｔ）｝（０≦ｔ≦Ｔ）の２種類の
データα（ｉ，ｔ）及びβ（ｉ，ｔ）を用いて、時刻ｔ
に関する積和計算 Σα（ｉ，ｔ）・β（ｉ，ｔ）を実行し、分母の計算結果を求める。この積和計算は全
てのＰＥにおいて並列に実行される。一方、分子の計算
処理ではＰＥｉ　（１≦ｉ≦Ｎ）は先に入力されたデー
タ列Ｄ（ｉ，ｔ）＝｛α（ｉ，ｔ），β（ｉ，ｔ）｝（
０≦ｔ≦Ｔ）のデータβ（ｉ，ｔ）を全てのＰＥを一巡
するまで循環転送しながら、これと同期してＰＥｉ　（
１≦ｉ≦Ｎ）にメモリからデータ列ＣＢ　（ｉ，ｔ）を
入力し、その時々でＰＥｉ　（１≦ｉ≦Ｎ）に入力され
るＰＥ間の循環転送データ、データ列ＣＢ　（ｉ，ｔ）
のデータ、先に入力されたデータ列Ｄ（ｉ，ｔ）のデー
タα（ｉ，ｔ−１）との３項間の乗算を並列に実行する
。この処理により、ＰＥｉ　（１≦ｉ≦Ｎ）にはｊ＝１
，２，・・・，Ｎの（ｉ，ｊ）の組み合わせに対する分
子の時刻ｔの被累積加算項が求められる。従って、時刻
ｔを更新して上記のような３項間の乗算に係わる処理を
実行し、各（ｉ，ｊ）の組み合わせに対して得られた計
算結果を各時刻ｔ毎に累積加算すれば分子の計算結果が
求められる。上記のこれらの処理は全てＰＥで並列に実
行される。上記の処理過程により求められた分母、分子
の計算結果を用いて、並列に分子を分母で除算すること
により、ＰＥｉ　（１≦ｉ≦Ｎ）にはｊ＝１，２，・・
・，Ｎの（ｉ，ｊ）の組み合わせに対する状態遷移確率
ａ（ｉ，ｊ）の再推定値ａ＋　（ｉ，ｊ）が求められる
。【００４４】また、以上のような並列処理方法からわか
るように、分子の計算の並列処理ではＰＥ間の転送デー
タをα（ｉ，ｔ−１），ＰＥｉ　（１≦ｉ≦Ｎ）にメモ
リから入力される第２のデータ列５０５をＣＦ　（ｉ，
ｔ）＝｛ｃ（ｉ，ｉ；ｔ），ｃ（ｍｏｄ（ｉ＋１｜Ｎ）
，ｉ；ｔ），ｃ（ｍｏｄ（ｉ＋２｜Ｎ），ｉ；ｔ），・
・・，ｃ（ｍｏｄ（ｉ−１｜Ｎ），ｉ；ｔ）｝とし、分
子の３項間の乗算においてはデータ列Ｄ（ｉ，ｔ）＝｛
α（ｉ，ｔ），β（ｉ，ｔ）｝（０≦ｔ≦Ｔ）からのデ
ータとしてβ（ｉ，ｔ）を用いればＰＥｉ　（１≦ｉ≦
Ｎ）にはｊ＝１，２，・・・，Ｎの（ｊ，ｉ）の組み合
わせに対する分子の計算結果が得られる。【００４５】従ってＰＥｉ　（１≦ｉ≦Ｎ）にはｊ＝１
，２，・・・，Ｎの（ｊ，ｉ）の組み合わせに対する状
態遷移確率ａ（ｊ，ｉ）の再推定値ａ＋　（ｊ，ｉ）は
分母の計算結果をすべてのＰＥに対して一巡するまでＰ
Ｅ間で循環転送し、得られた分子の計算結果をその時々
に転送される分母の計算結果で除算することにより求め
られる。【００４６】［ｂ−３］　　シンボル出力確率の再推定
計算の並列処理次にリングアレイプロセッサ構成用いたシンボル出力確
率ｂ（ｉ，ｊ；ｋ）の再推定計算の並列処理について説
明する。図７は学習処理におけるバウム−ウェルチ・リ
エスティメーション・フォーミュラスのシンボル出力確
率の再推定計算をリングアレイプロセッサ構成で並列処
理する場合のデータフローを示す。データ転送パス６０
１は各ＰＥ６００ａ，６００ｂ，・・・，６００ｃ，６
００ｄ間に設けられる。データ入出力パス６０２ａ，６
０２ｂ，・・・，６０２ｃ，６０２ｄは各ＰＥ６００ａ
，６００ｂ，６００ｃ，６００ｄとその管理下にあるメ
モリとの間のデータの入出力を行うためのパスである。データ列６０３はＰＥ６００ａ，６００ｂ，・・・，６
００ｃ，６００ｄ間で循環転送されるデータ列である。図７に示した例ではデータ列６０３は｛β（１，ｔ），
β（２，ｔ），・・・，β（ｉ，ｔ），・・・，β（ｉ
，ｔ），・・・，β（Ｎ，ｔ）｝である。第１のデータ
列６０４は各データ入出力パス６０２ａ，６０２ｂ，６
０２ｃ，６０２ｄを介してメモリから入力される。ＰＥ
ｉ（１≦ｉ≦Ｎ）においてはデータ列Ｄ（ｉ，ｔ）はＤ（ｉ，ｔ）＝｛α（ｉ，ｔ），β（ｉ，ｔ）｝　　　
　（０≦ｔ≦Ｔ）が入力される。また、第２のデータ列６０５は各データ
入出力パス６０２ａ，６０２ｂ，６０２ｃ，６０２ｄを
介してメモリから入力される。ＰＥｉ（１≦ｉ≦Ｎ）に
おいてはデータ列ＣＢ　（ｉ，ｔ）ＣＢ　（ｉ，ｔ）＝｛ｃ（ｉ，ｉ；ｔ），ｃ（ｉ，ｍｏ
ｄ（ｉ＋１｜Ｎ）；ｔ），・・・，（ｉ，ｍｏｄ（ｉ−
１｜Ｎ）；ｔ）｝とデータ列ＧＢ　ＧＢ　（ｉ，ｔ）＝｛（ｇ（ｉ，ｉ；ｔ），ｇ（ｉ，ｍ
ｏｄ（ｉ＋１｜Ｎ）；ｔ），・・・，ｇ（ｉ，ｍｏｄ（
ｉ−１｜Ｎ）；ｔ）｝のデータが１つずつ組になって入
力される。即ち、ＰＥｉ（１≦ｉ≦Ｎ）には、｛ｃ（ｉ
，ｉ；ｔ），ｇ（ｉ，ｉ；ｔ）｝，｛ｉ，ｍｏｄ（ｉ＋
１｜Ｎ）；ｔ），ｇ（ｉ，ｍｏｄ（ｉ＋１｜Ｎ）；ｔ）
｝，・・・，｛ｃ（ｉ，ｍｏｄ（ｉ−１｜Ｎ）；ｔ），
ｇ（ｉ，ｍｏｄ（ｉ−１｜Ｎ）；ｔ）｝の順で入力され
る。これらのデータは第１のデータ列６０４及び第２の
データ列６０５を構成するデータ列ＣＢ　（ｉ，ｔ）、
ＧＢ　（ｉ，ｔ）と共に、ＰＥの管理するメモリに格納
されているものとする。【００４７】ここで、第２のデータ列ＧＢ　（ｉ，ｔ）
のデータはシンボルＯｔ　と基準シンボルｋとの類似度
を表すパラメータｕ（ｔ；ｋ）を使用してｇ（ｉ，ｊ；ｔ）＝ｃ（ｉ，ｊ；ｔ）・ｕ（ｔ；ｋ）と
定義する。【００４８】シンボル出力確率ｂ（ｉ，ｊ；ｋ）の再推
定計算は（７）　式からわかるように、分母、分子の計
算内容は殆ど同一で、分子の計算にシンボルＯｔ　に関
する条件としてシンボルＯｔ　＝ｋが付加されている点
だけが異なる。また、この分母、分子の計算は状態遷移
確率ａ（ｉ，ｊ）の再推定計算の分子の計算と全く同等
である。従って、このシンボル出力確率ｂ（ｉ，ｊ；ｋ
）の再推定計算の分母、分子の計算処理は先に述べた状
態遷移確率ａ（ｉ，ｊ）の再推定計算の分子の並列計算
処理法をそのまま応用して実行できる。【００４９】次に図７に沿って分母、分子の並列計算処
理法について説明する。先ず、ＰＥｉ（１≦ｉ≦Ｎ）は
データ入出力パス６０２ａ，６０２ｂ，・・・，６０２
ｃ，６０２ｄを介してメモリから第１のデータ列Ｄ（ｉ
，ｔ）＝｛α（ｉ，ｔ），β（ｉ，ｔ）｝　　　　（０
≦ｔ≦Ｔ）を入力する。そして、分母、分子の並列計算処理では、
転送データが一巡するまでＰＥ間でデータβ（ｉ，ｔ）
を循環転送しながら、これと同期してＰＥｉ（１≦ｉ≦
Ｎ）にメモリから第２のデータ列｛ｃ（ｉ，ｉ；ｔ），ｇ（ｉ，ｉ；ｔ）｝，｛ｃ（ｉ，
ｍｏｄ（ｉ＋１｜Ｎ）；ｔ），ｇ（ｉ，ｍｏｄ（ｉ＋１
｜Ｎ）；ｔ）｝，｛ｃ（ｉ，ｍｏｄ（ｉ＋２｜Ｎ）；ｔ
），ｇ（ｉ，ｍｏｄ（ｉ＋２｜Ｎ）；ｔ）｝，・・・，
｛ｃ（ｉ，ｍｏｄ（ｉ−１｜Ｎ）；ｔ），ｇ（ｉ，ｍｏ
ｄ（ｉ−１｜Ｎ）；ｔ）｝を入力する。ＰＥｉ（１≦ｉ
≦Ｎ）ではその時々で入力されるＰＥ間の循環転送デー
タ、第２のデータ列の２種類のデータ、第１のデータ列
のデータα（ｉ，ｔ−１）を用いて、（７）　式での分
母の計算についてはデータα，ｃ，βの３項間の乗算を
実行し、分子の計算についてはデータα，ｇ，βの３項
間の乗算を実行する。この分母、分子の乗算の処理はす
べてのＰＥにおいて並列に実行される。この処理により
、ＰＥｉ（１≦ｉ≦Ｎ）にはｊ＝１，２，・・・，Ｎの
（ｉ，ｊ）の組み合わせに対する分母・分子の時刻ｔに
対する被累積加算項が求められる。従って、時刻ｔを更
新して上記のよウな３項間の乗算に係わる処理を実行し
、その計算結果を各時刻ｔ毎に累積加算すれば、分母、
分子の計算結果が求められる。これらの処理はＰＥ間で
並列に実行される。上記の処理過程により求められた分
母、分子の計算結果を用いて、並列に分子を分母で除算
することにより、ＰＥｉ（１≦ｉ≦Ｎ）にはｊ＝１，２
，・・・，Ｎの（ｉ，ｊ）の組み合わせに対するシンボ
ル出力確率の再推定値としてｂ＋　（ｉ，ｊ；ｋ）が求
められる。【００５０】また、以上の並列処理方法からわかるよう
にこの計算の並列処理では複数のＰＥ間の転送データを
α（ｉ，ｔ−１），ＰＥｉ（１≦ｉ≦Ｎ）にメモリから
入力される第２のデータ列を、データ列ＣＦ　（　ｉ，ｔ）＝｛ｃ（ｉ，ｉ；ｔ），ｃ
（ｍｏｄ（ｉ＋１｜Ｎ），ｉ；ｔ），ｃ（ｍｏｄ（ｉ＋
２｜Ｎ），ｉ；ｔ），・・・，ｃ（ｍｏｄ（ｉ−１｜Ｎ
），ｉ；ｔ）｝とデータ列ＧＦ　（ｉ，ｔ）＝｛ｇ（ｉ，ｉ；ｔ），ｇ（
ｍｏｄ（ｉ＋１｜Ｎ），ｉｔ），ｇ（ｍｏｄ（ｉ＋２｜
Ｎ），ｉ；ｔ）｝，・・・，ｇ（ｍｏｄ（ｉ−１｜Ｎ）
，ｉ；ｔ）｝のデータを１つずつ組にしたデータ列｛ｃ
（ｉ，ｉ；ｔ），ｇ（ｉ，ｉ；ｔ）｝，｛ｃ（ｍｏｄ（
ｉ＋１｜Ｎ），ｉ；ｔ），ｇ（ｍｏｄ（ｉ＋１｜Ｎ），
ｉ；ｔ）　｝，｛ｃ（ｍｏｄ（ｉ＋２｜Ｎ），ｉ；ｔ）
，ｇ（ｍｏｄ（ｉ＋２｜Ｎ），ｉ；ｔ）｝，・・・，｛
ｃ（ｍｏｄ（ｉ−１｜Ｎ），ｉ；ｔ），ｇ（ｍｏｄ（ｉ
−１｜Ｎ），ｉ；ｔ）｝とし、３項間の乗算においては
データ列Ｄ（ｉ，ｔ）からのデータとしてデータβ（ｉ
，ｔ）を用いれば、ＰＥｉ（１≦ｉ≦Ｎ）にはｊ＝１，
２，・・・、Ｎの（ｊ，ｉ）の組み合わせに対するシン
ボル出力確率の再推定値ｂ＋　（　ｊ，ｉ；ｋ）が求め
られる。【００５１】以上、ＰＥｉ（１≦ｉ≦Ｎ）においてｊ＝
１，２，・・・，Ｎの（ｉ，ｊ）（または（ｊ，ｉ））
の組み合わせに対する状態遷移確率の再推定値（ａ＋　
（ｉ，ｊ）（またはａ＋　（ｊ，ｉ））及びシボル出力
確率の再推定値ｂ＋　（ｉ，ｊ；ｋ）（またはｂ＋　（
ｊ，ｉ；ｋ））が求められると、ＰＥｉ（１≦ｉ≦Ｎ）
はシンボルＯｔ　と基準シンボルｋとの類似度を表すパ
ラメータｕ（ｔ；ｋ）を用いてｋに関する以下の積和計
算Σｕ（ｔ；ｋ）・ｂ＋　（ｉ，ｊ；ｋ）または Σｕ（ｔ；ｋ）・ｂ＋　（ｊ，ｉ；ｋ）を実行してシン
ボルＯｔ　に対するシンボル出力確率の再推定値ｂ＋　
（ｉ，ｊ；Ｏｔ　）又は、ｂ＋　（ｊ，ｉ；Ｏｔ　）を
求め、その結果と状態遷移確率の再推定値ａ＋　（ｉ，
ｊ）（またはａ＋　（ｊ，ｉ））との乗算を実行し、デ
ータｃ（ｉ，ｊ；ｔ）（又は、ｃ（ｊ，ｉ；ｔ））の再
推定値ｃ＋　（ｉ，ｊ；ｔ）（又はｃ＋　（ｊ，ｉ；ｔ
））を求める。【００５２】ここに再推定値を得るための流れを示す。ｂ＋　（ｉ，ｊ；Ｏｔ　）＝Σｕ（ｔ；ｋ）・ｂ＋　（
ｉ，ｊ；ｋ）又は、ｂ＋　（ｉ，ｉ；Ｏｔ　）＝Σｕ（ｔ；ｋ）・ｂ＋　（
ｊ，ｉ；ｋ）次にｃ＋　（ｉ，ｊ；ｔ）＝ｂ＋　（ｉ，ｊ；Ｏｔ　）・ａ
＋　（ｉ，ｊ）又はｃ＋　（ｊ，ｉ；ｔ）＝ｂ＋　（ｊ，ｉ；Ｏｔ　）・ａ
＋　（ｊ，ｉ）この結果はＰＥの管理下にあるメモリに格納される。【００５３】また、データｇ（ｉ，ｊ；ｔ）（又はｇ（
ｊ，ｉ；ｔ））の再推定値ｇ＋　（ｉ，ｊ；ｔ）（又は
ｇ＋　（ｊ，ｉ；ｔ））はｕ（ｔ；ｋ）・ｂ＋　（ｉ，
ｊ；Ｏｔ　）（又はｕ（ｔ；ｋ）・ｂ＋　（ｊ，ｉ；Ｏ
ｔ　）の乗算結果と状態遷移確率の再推定値ａ＋　（ｉ
，ｊ）（又はａ＋　（ｊ，ｉ））との乗算を実行するこ
とによって求め、　　ｇ＋　（ｉ，ｊ；ｔ）＝｛ｕ（ｔ；ｋ）・ｂ＋　（
ｉ，ｊ；Ｏｔ　）｝　　　　　　　　　　　　　　　　
　　　　　　　　・ａ＋　（ｉ，ｊ）　　ｇ＋　（ｊ，
ｉ；ｔ）＝｛ｕ（ｔ；ｋ）・ｂ＋　（ｊ，ｉ；Ｏｔ　）
｝　　　　　　　　　　　　　　　　　　　　　　　　
・ａ＋　（ｊ，ｉ）その結果はＰＥの管理下にあるメモ
リに格納される。【００５４】以上のようなバウムウェルチ・リエスティ
メーション・フォーミュラスの３種類の再推定計算に対
する並列計算処理を実行することにより、リングアレイ
プロセッサの各ＰＥで求められる計算結果の分布は以下
のようになる。【００５５】ＰＥｉ（１≦ｉ≦Ｎ）の管理するメモリに
格納される計算結果：（ａ）ＰＥ間の転送データをα（ｉ，ｔ）とした並列処
理法の場合 π＋　（ｉ）；１≦ｔ≦Ｔに対するｃ＋　（ｉ，ｉ；ｔ），ｃ＋　（ｍｏｄ（ｉ＋１｜Ｎ）
，ｉ；ｔ），・・，ｃ＋　（Ｎ，ｉ；ｔ），ｃ＋　（１
，ｉ；ｔ），ｃ＋　（２，ｉ；ｔ），・・，ｃ＋　（ｍ
ｏｄ（ｉ−１｜Ｎ），ｉ；ｔ）；１≦ｔ≦Ｔに対するｇ＋　（ｉ，ｉ；ｔ），ｇ＋　（ｍｏｄ（ｉ＋１｜Ｎ）
，ｉ；ｔ），・・，ｇ＋　（Ｎ，ｉ；ｔ），ｇ＋　（１
，ｉ；ｔ），ｇ＋　（２，ｉ；ｔ），・・，ｇ＋　（ｍ
ｏｄ（ｉ−１｜Ｎ），ｉ；ｔ）；（ｂ）ＰＥ間の転送データをβ（ｉ，ｔ）とした並列処
理方法の場合 π＋　（ｉ）；１≦ｔ≦Ｔに対するｃ＋　（ｉ，ｉ；ｔ），ｃ＋　（ｉ，ｍｏｄ（ｉ＋１｜
Ｎ），ｉ；ｔ），・・，ｃ＋　（ｉ，Ｎ，；ｔ），ｃ＋
　（ｉ，１；ｔ），ｃ＋　（ｉ，２；ｔ），・・，ｃ＋
（ｉ，ｍｏｄ（ｉ−１｜Ｎ）；ｔ）；１≦ｔ≦Ｔに対す
るｇ＋　（ｉ，ｉ；ｔ），ｇ＋　（ｉ，ｍｏｄ（ｉ＋１｜
Ｎ）；ｔ），・・，ｇ＋　（ｉ，Ｎ；ｔ），ｇ＋　（ｉ
，１；ｔ），ｇ＋　（ｉ，２；ｔ），・・，ｇ＋　（ｉ
，ｍｏｄ（ｉ−１｜Ｎ）；ｔ）；尚、上記の（ａ），（ｂ）の計算結果の並べ方は、計算
結果が得られる順番に同じである。【００５６】（Ｃ）　　学習処理に必要となるデータの
再配置処理の内容これまで、説明してきたフォワード−バックワード・プ
ロセデュアとバウムウェルチ・リエスティメーション・
フォーミュラスに対するリングアレイプロセッサ構成を
用いた並列処理法とそれによって各ＰＥのメモリに得ら
れる処理結果の分布から学習処理に必要となるデータの
再配置処理の内容を説明する。【００５７】互いの処理結果を使ってフォワード−バッ
クワード・プロセデュアとバウムウェルチ・リエスティ
メーション・フォーミュラスの処理を繰り返し実行する
学習処理は、具体的には次のような処理を実行するに他
ならない。。先ず、初期状態確率π（ｉ）、状態遷移確
率ａ（ｉ，ｊ），シンボル出力確率ｂ（ｉ，ｊ；ｋ）の
初期値を適当に設定し、フォワード−バックワード・プ
ロセデュアの処理により確率パラメータα（ｉ，ｔ），
β（ｉ，ｔ）を計算する。そして、上記の３種類の確率
の初期値とフォワード−バックワード・プロセデュアよ
り求められた２種類の確率パラメータα（ｉ，ｔ），β
（ｉ，ｔ）を用いてバウムウェルチ・リエスティメーシ
ョン・フォーミュラスより初期状態確率π（ｉ），状態
遷移確率ａ（ｉ，ｊ），シンボル出力確率ｂ（ｉ，ｊ；
ｋ）の再推定を行い、その結果をそれぞれπ＋　（ｉ）
，ａ＋　（ｉ，ｊ），ｂ＋　（ｉ，ｊ；ｋ）とする。再
推定結果が初期と異なれば、初期値を再推定値に置き換
えて、再度、フォワード−バックワード・プロセデュア
とバウム−ウェルチ・リエスティメーション・フォーミ
ュラスの処理を行う。このような処理をバウム−ウェル
チ・リエスティメーション・フォーミュラスにより求め
られた再推定値がフォワード−バックワード・プロセデ
ュアの計算で用いられた各種の確率の値に一致するまで
実行する。【００５８】上記の繰り返しの処理の内容から、繰り返
し処理の実行中はフォワード−バックワード・プロセデ
ュアの処理で用いられるデータπ（ｉ），ａ（ｉ，ｊ）
，ｂ（ｉ，ｊ；ｋ）としては、バウムウェルチ・リエス
ティメーション・フォーミュラスの処理で得られるπ＋
　（ｉ），ａ＋　（ｉ，ｊ），ｂ＋　（ｉ，ｊ；ｋ）を
用い、バウムウェルチ・リエスティメーション・フォー
ミュラスで用いるデータはフォワード−バックワード・
プロセデュアの処理で用いられるデータπ（ｉ），ａ（
ｉ，ｊ），ｂ（ｉ，ｊ；ｋ）とフォワード−バックワー
ド・プロセデュアの処理で得られるデータα（ｉ，ｔ）
，β（ｉ，ｔ）である。【００５９】従って、データの再配置処理の目的はフォ
ワード−バックワード・プロセデュアの並列処理後にＰ
Ｅのメモリに保持されるデータの分布がバウム−ウェル
チ・リエスティメーション・フォーミュラスの並列処理
に対する初期データ分布にバウムウェルチ・リエスティ
メーション・フォーミュラスの並列処理後のＰＥのメモ
リに保持されるデータの分布がフォワード−バックワー
ド・プロセデュアに対する並列処理の初期データ分布に
適するようにすることである。【００６０】上記の各並列処理法の説明から、それぞれ
の並列処理において必要となるＰＥｉ（１≦ｉ≦Ｎ）の
管理下にあるメモリのデータ分布を整理し、以下に示す
。【００６１】［Ｃ−１］　　フォワード−バックワード
プロセデュアの前向きパス・アルゴリズムのデータ分布
使用するデータ：α（ｉ，０）＝π（ｉ），ＣＦ　（ｉ
，ｔ）＝｛ｃ（ｉ，ｉ；ｔ），ｃ（ｍｏｄ（ｉ＋１｜Ｎ
），ｉ；ｔ），・・・，ｃ（ｍｏｄ（ｉ−１｜Ｎ），ｉ
；ｔ）｝（１≦ｔ≦Ｔ）得られるデータ：｛α（ｉ，１），・・・，α（ｉ，ｔ
），・・・，α（ｉ，Ｔ）｝［Ｃ−２］　　フォワード−バックワードプロセデュア
の後向きパス・アルゴリズムのデータ分布使用するデー
タ：β（ｉ，Ｔ），ＣＢ　（ｉ，ｔ）＝｛ｃ（ｉ，ｉ；
ｔ），ｃ（ｉ，ｍｏｄ（ｉ＋１｜Ｎ）；ｔ），・・・，
ｃ（ｉ，ｍｏｄ（ｉ−１｜Ｎ）；ｔ）｝（１≦ｔ≦Ｔ）得られるデータ：｛β（ｉ，Ｔ−１），・・・，β（ｉ
，ｔ），・・・，β（ｉ，０）｝［Ｃ−３］　　バウム−ウェルチ・リエスティメーショ
ン・フォーミュラスの再推定計算のデータ分布（１）　
ＰＥ間の循環転送データがα（ｉ，ｔ）の場合使用する
データ：｛α（ｉ，０），α（ｉ，１），・・・，α（
ｉ，ｔ），・・・，α（ｉ，Ｔ）｝｛β（ｉ，０），β
（ｉ，１），・・・，β（ｉ，ｔ），・・・，β（ｉ，
Ｔ）｝，ＣＦ　（ｉ，ｔ）＝｛ｃ（ｉ，ｉ；ｔ），ｃ（ｍｏｄ（
ｉ＋１｜Ｎ），ｉ；ｔ），ｃ（ｍｏｄ（ｉ−１｜Ｎ），
ｉ；ｔ）｝（１≦ｔ≦Ｔ）ＣＦ　（ｉ，ｔ）＝｛ｇ（ｉ，ｉ；ｔ），ｇ（ｍｏｄ（
ｉ＋１｜Ｎ），ｉ；ｔ），ｇ（ｍｏｄ（ｉ−１｜Ｎ），
ｉ；ｔ）｝（１≦ｔ≦Ｔ）得られるデータ：　　π＋　（ｉ），ＣＦ　＋　（ｉ，
ｔ）＝　　｛ｃ＋　（ｉ，ｉ；ｔ），ｃ＋　（ｍｏｄ（
ｉ＋１｜Ｎ），ｉ；ｔ），ｃ＋　（ｍｏｄ（ｉ−１｜Ｎ
），ｉ；ｔ）｝（１≦ｔ≦Ｔ）ＣＦ　＋　（ｉ，ｔ）＝｛ｇ＋　（ｉ，ｉ；ｔ），ｇ＋
　（ｍｏｄ（ｉ＋１｜Ｎ），ｉ；ｔ），ｇ＋　（ｍｏｄ
（ｉ−１｜Ｎ），ｉ；ｔ）｝（１≦ｔ≦Ｔ）（２）　ＰＥ間の循環転送データがβ（ｉ，ｔ）の場合
使用するデータ：｛α（ｉ，０），α（ｉ，１），・・
・，α（ｉ，ｔ），・・・，α（ｉ，Ｔ）｝｛β（ｉ，
０），β（ｉ，１），・・・，β（ｉ，ｔ），・・・，
β（ｉ，Ｔ）｝，ＣＢ　（ｉ，ｔ）＝｛ｃ（ｉ，ｉ；ｔ），ｃ（ｉ，ｍｏ
ｄ（ｉ＋１｜Ｎ）；ｔ），・・・，ｃ（ｉ，ｍｏｄ（ｉ
−１｜Ｎ）；ｔ）｝（１≦ｔ≦Ｔ）ＣＢ　（ｉ，ｔ）＝｛ｇ（ｉ，ｉ；ｔ），ｇ（ｉ，ｍｏ
ｄ（ｉ＋１｜Ｎ）；ｔ），・・・，ｇ（ｉ，ｍｏｄ（ｉ
−１｜Ｎ）；ｔ）｝（１≦ｔ≦Ｔ）得られるデータ：　　π＋　（ｉ），ＣＢ　＋　（ｉ，
ｔ）＝　　｛ｃ＋　（ｉ，ｉ；ｔ），ｃ＋　（ｍｏｄ（
ｉ＋１｜Ｎ）；ｔ），・・・，ｃ＋　（ｉ，ｍｏｄ（ｉ
−１｜Ｎ）；ｔ）｝（１≦ｔ≦Ｔ）ＣＢ　＋　（ｉ，ｔ）＝｛ｇ＋　（ｉ，ｉ；ｔ），ｇ＋
　（ｉ，ｍｏｄ（ｉ＋１｜Ｎ）；ｔ），・・・，ｇ＋　
（ｉ，ｍｏｄ（ｉ−１｜Ｎ）；ｔ）｝（１≦ｔ≦Ｔ）上
記の各並列処理法に対するデータ分布の整理結果から次
のことがわかる。【００６２】（１）　フォワード−バックワード・プロ
セデュアの前向きパス・アルゴリズム、後ろ向きパス・
アルゴリスムに対するそれぞれの並列処理により得られ
るデータ分布を合成したものはバウム−ウェルチ・リエ
スティメーション・フォーミュラスの並列処理で使用す
るデータ分布に適している。【００６３】（２）　フォワード−バックワード・プロ
セデュアの前向きパス・アルゴリズムに対する並列処理
で使用されるデータ列ＣＦ　（ｉ，ｔ）は、ＰＥ間の循
環転送データがα（ｉ，ｔ）の場合のバウムウェルチ・
リエスティメーション・フォーミュラスに対する並列処
理で使用するデータ列ＣＦ　（ｉ，ｔ）に同じである。【００６４】（３）　フォワード−バックワード・プロ
セデュアの後向きパス・アルゴリズムに対する並列処理
で使用されるデータ列ＣＢ　（ｉ，ｔ）は、ＰＥ間の循
環転送データがβ（ｉ，ｔ）の場合のバウム−ウェルチ
・リエスティメーション・フォーミュラスに対する並列
処理で使用するデータ列ＣＢ　（ｉ，ｔ）に同じである
。【００６５】（４）　ＰＥ間転送データをα（ｉ，ｔ）
とした場合のバウムウェルチ・リエスティメーション・
フォーミュラスに対する並列処理から得られるデータ列
ＣＦ　＋　（ｉ，ｔ）は、フォワード−バックワード　
　プロセデュアの前向きパス・アルゴリズムの並列処理
に使用されるデータ列ＣＦ　（ｉ，ｔ）と同等であるが
、後ろ向きパス・アルゴリズムの並列処理に使用される
データ列ＣＢ　（ｉ，ｔ）に対しては、これらのデータ
列を構成するデータの（ｘ，ｙ）−インデックスが転置
の関係にある。【００６６】（５）　ＰＥ間転送データをβ（ｉ，ｔ）
とした場合のバウム−ウェルチ・リエスティメーション
・フォーミュラスに対する並列処理から得られるデータ
列ＣＢ　＋　（ｉ，ｔ）はフォワード−バックワード・
プロセデュアの後ろ向きパス・アルゴリズムの並列処理
に使用されるデータ列ＣＢ　（ｉ，ｔ）と同様であるが
、前向きパス・アルゴリズムの並列処理に使用されるデ
ータ列ＣＦ　（ｉ，ｔ）に対してはこれらのデータ列を
構成するデータの（ｘ，ｙ）−インデックスが転置の関
係にある。【００６７】上記からバウム−ウェルチ・リエスティメ
ーション・フォーミュラスの並列処理により得られる処
理結果のデータ列はＰＥ間転送データをα（ｉ，ｔ）又
はβ（ｉ，ｔ）のどちらを選択してもフォワード−バッ
クワード・プロセデュアの前向きパス・アルゴリズム又
は、後ろ向きパス・アルゴリズムの処理のどちらか一方
のデータ列と同等になるだけであり、他方の処理を実行
するためには各ＰＥにおいて得られたデータ列のデータ
を再配置する必要がある。その内容は（４）、（５）に
示したように、構成するデータの（ｘ，ｙ）−インデッ
クスが転置関係にあるデータ列ＣＦ　（ｉ，ｔ）（ある
いはＣＦ　＋　（ｉ，ｔ））（１≦ｉ≦Ｎ）とデータＣ
Ｂ　（ｉ，ｔ）（あるいはＣＢ　＋　（ｉ，ｔ））（１
≦ｉ≦Ｎ）との相互変換である。【００６８】図８は学習処理に必要となるデータの再配
置処理の内容を示す。同図はこの相互変換の内容を示し
ている。同図はすべてのＰＥのメモリに格納されるデー
タ列ＣＦ　（ｉ，ｔ）（あるいはＣＦ　＋　（ｉ，ｔ）
），ＣＢ　（ｉ，ｔ）（あるいはＣＢ　＋　（ｉ，ｔ）
）を構成するデータの（ｘ，ｙ）−インデックスを列挙
する形式で示している。データの時刻ｔに関するインデ
ックスｔはすべてのデータで同一であるので省略してあ
る。データ分布Ｐがデータ列ＣＢ　（ｉ，ｔ）（あるい
はＣＢ　＋　（ｉ，ｔ））を構成するデータから構成さ
れたもので、データ分布Ｑがデータ列ＣＦ　（ｉ，ｔ）
（あるいはＣＦ　＋　（ｉ，ｔ））を構成するデータか
ら構成されたものである。また、ある時刻ｔに対するこ
れらのデータ列のデータは各メモリの連続するアドレス
（図８の例ではアドレスｈ〜アドレス（ｈ＋Ｎ−１）の
範囲）に格納されるものとする。【００６９】以下図８に沿ってデータの再配置処理法に
ついて説明する。同図に示したデータ分布Ｐ，Ｑをそれ
ぞれ１つの行列と考え、各行列Ｐ，Ｑの同一の要素が保
持されるＰＥ番号と各ＰＥに保持されるデータの順番を
アドレスと考え、この同一の要素のそれぞれのＰＥにお
けるアドレスとの関係について説明する。【００７０】行列Ｐの要素において、ＰＥｉ（１≦ｉ≦
Ｎ）に保持される要素とそのアドレスとの関係は　　　
　　　アドレス　　　　　　　　　　　　　　　　　　
　　　　　　　　行列Ｐの要素　　　　　　　　ｈ　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　（ｉ，ｉ）　　　　ｈ＋１　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　（
ｉ，ｍｏｄ（ｉ＋１｜Ｎ））　　　　ｈ＋２　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　（ｉ，ｍｏｄ（ｉ＋２｜Ｎ））　　　　　　・　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　・　　　　　　・
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　・　　ｈ＋Ｎ
−１　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　（ｉ，ｍｏｄ（ｉ−１｜Ｎ））である。【００７１】上記の行列Ｐの要素のインデックスを（ｉ
，ｊ），アドレスをａｄｒ（ＰＥｉ；Ｐ）としてアドレ
スａｄｒ（ＰＥｉ；Ｐ）を要素のｘ−インデックス、ｙ
−インデックスを使って表現することを考える。ＰＥｉに保持される要素のｙ−インデックスはその値が
Ｎに等しくなるまでは、ｉから順番に１つずつ増加し、
それ以降は１からｉ−１に等しくなるまで１つずつ増加
する。従って、ｉ≦ｊ（≦Ｎ）の場合はこのｙ−インデ
ックス列の何れかに等しい。ｙ−インデックスｉのアド
レスがｈであるので、ｙ−インデックスｊをもつ要素の
アドレスはｊ−ｉ＋ｈと表現できる。一方、ｉ≦ｊ＜ｉ
の場合は、このｙ−インデックスｊは１からｉ−１まで
１つずつ増加するｙ−インデックス列の何れかに等しい
。ｙ−インデックスの値がＮの要素のアドレスは（Ｎ−
ｉ＋ｈ）であるから、ｙ−インデックスの値がＮの要素
のアドレスは（Ｎ−ｉ＋ｈ）であるから、ｙ−インデッ
クスの値が１の要素のアドレスは（Ｎ−ｉ＋ｈ＋１）で
ある。従って、１≦ｊ＜ｉの範囲のｙ−インデックスｊ
を持つ要素のアドレスは（Ｎ−ｉ＋ｈ＋ｊ）で与えられ
る。【００７２】以上により行列Ｐの要素（ｉ，ｊ）のアド
レスは、　　　　　　　　　　　　　　　　　　　　ｊ−ｉ＋ｈ
　　　　　　ｆｏｒ　　　　　　　ｉ≦ｊ≦Ｎ　　ａｄ
ｒ（ＰＥｉ；Ｐ）　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　・・・（８）　　　　　　　　　　　　　
　　　　　　　　Ｎ−ｉ＋ｈ＋ｊ　　　　ｆｏｒ　　　
　　　　１≦ｊ＜ｉと表すことができる。【００７３】次に行列Ｑの要素に対して、上記の行列Ｐ
の要素と同一の要素が保持されるＰＥ番号とそのアドレ
スを求めることにより、行列Ｐ，Ｑの同一要素を保持す
るＰＥ番号の関係及びアドレスの関係を明らかにする。【００７４】行列Ｑの要素において、ＰＥｉ（１≦ｉ≦
Ｎ）に保持される要素とそのアドレスの関係は　　　　
　　アドレス　　　　　　　　　　　　　　　　　　　
　　　　　　　行列Ｐの要素　　　　　　　　ｈ　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
（ｉ，ｉ）　　　　ｈ＋１　　　　　　　　　　　　　
　　　　　　　　　　　　　　　（ｍｏｄ（ｉ＋１｜Ｎ
），ｉ）　　　　ｈ＋２　　　　　　　　　　　　　　
　　　　　　　　　　　　　　（ｍｏｄ（ｉ＋２｜Ｎ）
，ｉ）　　　　　　・　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　・　　　　　　・　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　・　　ｈ＋Ｎ−１　　　　　　　　　　　
　　　　　　　　　　　　　　　（ｍｏｄ（ｉ−１｜Ｎ
），ｉ）である。【００７５】上記の関係から、ＰＥ番号はそのＰＥが保
持する要素のｙ−インデックスに等しいので、行列Ｑに
おいて、行列Ｐのｙ−インデックスｊの要素はＰＥｊに
保持されることになる。上記のアドレスと要素との関係
においてｉをｊに置き換えて考えると、このＰＥｊに保
持される要素のｘ−インデックスはｊから始まり、その
値がＮに等しくなるまで、１つずつ増加する。それ以降
は１からｊ−１まで１つずつ増加する。ｊ≦ｉ（≦Ｎ）
の場合は、要素のｘ−インデックスｉはｊからＮまで１
つずつ増加するｘ−インデックス列の何れかに等しいこ
とになるから、ｘ−インデックスｉを持つ要素のアドレ
スはｉ−ｊ＋ｈで与えられる。一方、１≦ｉ＜ｊの場合
は、このインデックスｉは１からｊ−１まで１つずつ増
加するｘ−インデックス列のいずれかに等しい。故に１
≦ｉ＜ｉの範囲のｘ−インデックスｉをもつ要素のアド
レスはＮ−ｊ＋ｈ＋ｉで表される。【００７６】従って、行列Ｑの要素（ｉ，ｊ）のアドレ
スａｄｒ（ＰＥｊ；Ｑ）は　　　　　　　　　　　　　　　　　　　　ｉ−ｊ＋ｈ
　　　　　　ｆｏｒ　　　　　　　ｊ≦ｉ≦Ｎ　　ａｄ
ｒ（ＰＥｉ；Ｐ）　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　・・・（８）　　　　　　　　　　　　　
　　　　　　　　Ｎ−ｊ＋ｈ＋ｉ　　　　ｆｏｒ　　　
　　　　１≦ｉ＜ｊと表すことができる。【００７７】以上の結果を整理すると、（８）　式及び
（９）　式から行列Ｐ，Ｑの同一要素（ｉ，ｊ）に対し
て、この要素を保持するＰＥ番号とそのＰＥにおけるア
ドレスの関係は以下の表１のように整理することができ
る。【００７８】【表１】次に、この表１の結果を用いて、行列Ｐ→行列
Ｑ，または、行列Ｑ→行列Ｐの要素の再配置方法を明ら
かにする。【００７９】表一からわかるように、要素（ｉ，ｊ）が
保持されるＰＥ番号及びアドレスは、その要素のｘ−イ
ンデックス、ｙ−インデックスの大小関係によって分け
て考えなければならない。行列Ｐの要素のインデックス
の大小関係とアドレスとの関係について説明する。図９
は要素のインデックス差と再配置前アドレス、再配置先
アドレスＰＥ間距離との関係を示すグラフである。同図
は要素のｘ，ｙ−インデックス差（ｊ−ｉ）と再配置前
アドレスＫ，再配置先アドレスＫ’，ＰＥ間距離Ｌとの
関係を示している。横軸は要素のインデックスの差（ｊ
−ｉ）であり、縦軸は要素のアドレスである。８１は再
配置前アドレスＫ，８２は再配置先アドレスＫ，８３は
ＰＥ間距離Ｌである。要素のインデックスｉ，ｊの大小
関係から決められる（ｊ−ｉ）の定義域は｛−（Ｎ−１
），−１｝，｛０｝、｛１、（Ｎ−１）｝の３種類に分
けられる。但し、（ｊ−ｉ）は上記の範囲の整数とする
。この３種類の定義域に存在する要素の再配置処理の条
件を以下に示す。【００８０】（１）ｉ＝ｊの場合；行列Ｐの要素（ｉ，ｊ）が保持されるＰＥ番号はｉ，そ
のアドレスはｈである。一方、行列Ｑの要素（ｉ，ｊ）
が保持されるＰＥ番号はｉ（ｊ＝ｉより）であり、その
アドレスはｈである。即ち、行列でＰ，Ｑのアドレスｈ
の要素は同一番号のＰＥに保持されるので、再配置処理
の必要はない。【００８１】（２）　ｉ＜ｊの場合；行列Ｐの要素（ｉ，ｊ）は、ＰＥｉのアドレス（ｊ−ｉ
＋ｈ）に保持される。一方、行列Ｑの要素（ｉ，ｊ）は
ＰＥｊのアドレス｛Ｎ−（Ｊ−１）＋ｈ｝に保持される
。従って、この定義域の要素に対する行列Ｐ→行列Ｑの
再配置処理では、ＰＥｉのアドレス（ｊ−ｉ＋ｈ）の要
素をＰＥｊのアドレス｛Ｎ−（ｊ−ｉ）＋ｈ｝に再配置
しなければならない。【００８２】（３）　ｉ＞ｊの場合；行列Ｐの要素（ｉ，ｊ）はＰＥｉのアドレス｛Ｎ−（ｉ
−ｊ）＋ｈ｝に保持される。一方、行列Ｑの要素（ｉ，
ｊ）はＰＥｊのアドレス（ｉ−ｊ＋ｈ）に保持される。従って、この定義域の要素に対する行列Ｐ→行列Ｑの再
配置処理では、ＰＥｉのアドレス｛Ｎ−（ｉ−ｊ）＋ｈ
｝の要素をＰＥｊのアドレス（ｉ−ｋ＋ｈ）に再配置し
なければならない。【００８３】これらの再配置条件から、データの再配置
処理には、ＰＥ番号の変換と各要素が保持されるアドレ
スの変換とが必要である。従って、リングアレイプロセ
ッサ構成では、（再配置前のアドレスから要素を取り出
す）→（この要素を再配置先のＰＥへ転送する）→（転
送された要素を再配置先のアドレスに格納する）の手順
でデータの再配置処理を実行しなければならない。この
手順の中の要素の再配置先のＰＥへの転送は各再配置条
件毎に異なるので、その内容を以下に述べる。ここで、
ｉ＝ｊの場合は上記の結果から再配置処理は必要としな
いため、ｉ≠ｊの場合のみ考える。また、再配置処理の
ＰＥ間データ転送において経由するＰＥの個数をＰＥ間
距離Ｌと定義する。【００８４】（１）　ｉ＜ｊの場合；リングアレイプロセッサ構成上でのデータ転送方向はＰ
Ｅ番号の大→小であるから、ＰＥｉ→ＰＥｊのデータ転
送は、ＰＥｉ→　　ＰＥ１→　　ＰＥＮ→　　ＰＥｊの経路で
実行しなければならない。従って、ＰＥ間距離Ｌは、Ｐ
Ｅｉ→　　ＰＥ１が（ｉ−１），ＰＥ１→ＰＥＮが１，
ＰＥＮ→ＰＥｊが（Ｎ−ｊ）であるから、Ｌ＝（ｉ−１
）＋１＋（Ｎ−ｊ）＝Ｎ−（ｊ−ｉ）である。【００８５】（２）　ｉ＞ｊの場合；データ転送はＰＥ
ｉ→　　ＰＥｊで実行できるので、ＰＥ間距離ＬはＬ＝
ｉ−ｊである。【００８６】以上の結果をまとめ、ＰＥ間距離Ｌ，再配
置再アドレスＫ’と要素のインデックス差（ｊ−ｉ）と
の関係を先の図９のグラフにより次のことが分かる。【００８７】（１）　要素群｛（ｉ，ｊ）｜ｊ−ｉ＝ｋ
｝と要素群｛（ｉ，ｊ）｜ｊ−ｉ＝ｋ−Ｎ｝のアドレス
は同一で、その値はｋ＋ｈ（１≦ｋ≦Ｎ−１）である。【００８８】このアドレス（ｋ＋ｈ）に保持される要素
群のｘ，ｙ−インデックスの関係について説明する。図
１０はアドレスに保持される要素群のｘ−インデックス
とｙ−インデックスの関係を示す。要素群｛（ｉ，ｊ）
｜ｊ−ｉ＝ｋ｝は同図中、ｊ−切片がｋの直線９０上の
格子点に対応し、要素群｛（ｉ，ｊ）｜ｊ−ｉ＝ｋ−Ｎ
｝はｉ−切片がＮ−ｋの直線９１上の格子点に対応する
。それぞれの直線上の格子点の個数はｉ，ｊの定義域１
≦ｉ，ｊ≦Ｎから、前者は（Ｎ−ｋ）個、後者はｋ個で
ある。従って、同一アドレス（ｋ＋ｈ）に存在する要素
の個数はＮ個である。これはｋのすべての場合に対して
成立し、それぞれのｋに対する要素は互いに排反である
。【００８９】即ち、要素群｛（ｉ，ｊ）｜ｊ−ｉ＝ｋ，
ｊ−ｉ＝ｋ−Ｎ｝の要素はＮ個存在し、これらは１個ず
つＮ個のＰＥに保持され、そのアドレスは（ｋ＋ｈ）で
ある。【００９０】（２）　同一アドレス値（ｋ＋ｈ）をもつ
要素群のＰＥ間距離ＬはＮ−ｋである。即ち、同一アド
レスの要素のＰＥ間の転送回数は等しい。従って、各Ｐ
Ｅの同一アドレスの要素はリングアレイプロセッサ構成
上で並列データ転送が可能である。【００９１】（３）　同一アドレス値（ｋ＋ｈ）をもつ
要素群の再配置先アドレスＫ’はその要素群のＰＥ間距
離Ｌにｈを加算した値、即ち、（Ｎ−ｋ＋ｈ）に等しい
。従って、（２）　に示した並列データ転送の対象とな
る要素群はそのＰＥ間データ転送回数の値にｈを加算し
たアドレス（Ｎ−ｋ＋ｈ）に配置すればよい。【００９２】上記の内容の（１）　〜（２）　より、デ
ータ再配置処理は次のような並列データ転送処理で実行
できる。アドレス（ｋ＋ｈ）の要素群に対しては（Ｎ−ｋ）回の
ＰＥ間転送を行い、アドレス（Ｎ−ｋ＋ｈ）に配置する
。また、アドレス（Ｎ−ｋ＋ｈ）の要素群に対してはｋ
回のＰＥ間転送を行い、アドレス（ｋ＋ｈ）に配置する
。即ち、ＰＥ間での並列データ転送を行いながら、アド
レス（ｋ＋ｈ）の要素群とアドレス（Ｎ−ｋ＋ｈ）の要
素群とを交換する処理である。再配置処理の対象となる
要素群のアドレス（ｋ＋ｈ）とＰＥ数Ｎの値によりＰＥ
間転送回数及び再配置先アドレスが決定されるので、各
ＰＥはＰＥ間の転送回数、再配置先アドレスを全く同一
の制御により実現できる。【００９３】これまでの説明は行列Ｐ→行列Ｑの再配置
処理を例に述べてきたが、行列Ｑ→行列Ｐの再配置処理
の場合についても全く同一である。【００９４】これまでの説明をまとめ、要素群の再配置
処理の並列処理方法を以下に示す。（Ｄ）　　再配置処理の並列処理方法Ｎ個のＰＥからなるリングアレイプロセッサ構成におい
て、（但し、データ転送方向はすべてのＰＥｉ（１≦ｉ
≦Ｎ）に対してＰＥｉ→ＰＥ（ｍｏｄ（ｉ−１｜Ｎ））である。ここで
、ｍｏｄ（ｍ｜Ｎ）はｍがＮの整数倍であればＮ，ｍが
Ｎの整数倍でなければｍをＮで割ったときの剰余を表す
）ステップ１；ｒ＝１，２，・・・，［Ｎ／２］に対してステップ２〜
ステップ７を実行する。（ｒは繰り返し数）ここで［ｘ
］はｘを越えない最大整数を表す。ステップ２；全てのＰＥｉ（１≦ｉ≦Ｎ）において、ア
ドレス（ｒ＋ｈ）の要素を取り出す。ステップ３；全てのＰＥｉ（１≦ｉ≦Ｎ）において、取
り出された要素を次段のＰＥに送信すると同時に、前段
ＰＥから取り出された要素を受信する。このようなデー
タ電送を（Ｎ−ｒ）回繰り返す。ステップ４；Ｎが偶数、かつｒ＝［Ｎ／２］のとき、す
べてのＰＥｉ（１≦ｉ≦Ｎ）において、転送されてきた
要素をアドレス（Ｎ−ｒ＋ｈ）に格納し、処理を終了す
る。ステップ５；全てのＰＥｉ（１≦ｉ≦Ｎ）において、ア
ドレス（Ｎ−ｒ＋ｈ）の要素を取り出すと共に、転送さ
れてきた要素をアドレス（Ｎ−ｒ＋ｈ）に格納する。ステップ６；全てのＰＥｉ（１≦ｉ≦Ｎ）において、ア
ドレス（Ｎ−ｒ＋ｈ）から取り出された要素を次段ＰＥ
へ送信すると同時に、前段ＰＥから取り出された要素を
受信する。このようなデータ転送をｒ回繰り返す。ステップ７；全てのＰＥｉ（１≦ｉ≦Ｎ）において、転
送されてきた要素をアドレス（ｒ＋ｈ）に格納する。【００９５】この並列処理方法ではステップ１の繰り返
し処理回数は［Ｎ／２］で規定されることから、Ｎが偶
数の場合に対してステップ４の特殊な処理を設けている
。これは、次の理由からである。【００９６】Ｎが奇数の場合は［Ｎ／２］は（Ｎ−１）
／２に等しいので、ステップ２で取り出される要素のア
ドレスは｛ｈ＋１，ｈ＋２，・・・，ｈ＋（Ｎ−１）／
２｝である。また、これらのアドレスの要素が格納され
るアドレスは｛ｈ＋（Ｎ−１），ｈ＋（Ｎ−２），・・
・・・，ｈ＋（Ｎ＋１）／２｝である。一方、ステップ
５で取り出される要素のアドレスは｛ｈ＋（Ｎ−１），
ｈ＋（Ｎ−２），・・・・・，ｈ＋（Ｎ＋１）／２｝で
あり、格納されるアドレスは｛ｈ＋１，ｈ＋２，・・・
，ｈ＋（Ｎ−１）／２｝である。従って、ステップ１〜
ステップ７の処理は互いに重複することなく実行できる
。【００９７】Ｎが偶数の場合は［Ｎ／２］はＮ／２に等
しいので、ステップ２で取り出される要素のアドレスは
｛ｈ＋１，ｈ＋２，・・・，ｈ＋Ｎ／２｝である。また
、これらのアドレスの要素が格納されるアドレスは｛ｈ
＋（Ｎ−１），ｈ＋（Ｎ−２），・・・・・，ｈ＋Ｎ／
２｝である。ステップ４のステップがないとすると、ス
テップ５で取り出される要素のアドレスは｛ｈ＋（Ｎ−
１），ｈ＋（Ｎ−２），・・・・・，ｈ＋Ｎ／２｝であ
り、格納されるアドレスは｛ｈ＋１，ｈ＋２，・・・，
ｈ＋Ｎ／２｝である。このためアドレスＮ／２の要素の
再配置処理に伴うデータ転送処理は重複して実行される
。従って、ステップ４のステップを設けておけば、ステ
ップ２〜ステップ４の処理によってアドレスＮ／２の要
素の再配置が完了し、このアドレスの要素に対する再配
置処理が重複することなく、無駄な処理ステップを削減
することができる。【００９８】［Ｄ−１］　　Ｎが偶数の場合における再
配置処理の並列処理方法次にＮが偶数の場合について説明する。Ｎ＝６（偶数）
の場合の再配置処理例に対して、再配置処理前のデータ
分布、データ再配置の並列処理過程及び再配置処理後の
データ分布について説明する。図１１は再配置処理前の
各ＰＥのデータ分布を示す。また、図１２は本発明の一
実施例の再配置処理過程を示す。図１３は本発明の一実
施例の再配置処理後の各ＰＥデータ分布を示す。図１１
の例では、ＰＥに割り付けられた要素のｘ−インデック
スはＰＥ番号に等しい。先に述べた再配置処理の並列処
理方法に従って、図１２に示した並列処理過程を図１３
と共に説明する。【００９９】図１２におけるステップ１では先ず、アド
レス（ｈ＋１）の要素をすべてのＰＥで取り出す。取り
出された全ての要素に対するＰＥ間転送回数はＮ−１＝
６−１＝５であるから、ステップ２〜ステップ６はこれ
らの要素をＰＥ間で循環転送している過程を示している
。【０１００】そして、ステップ６はこれらの要素に対す
る再配置先のＰＥへの転送が完了する。【０１０１】ステップ７では、これらの転送された要素
は、再配置先のアドレスの要素と交換する形式で格納さ
れる。即ち、再配置先のアドレス（ｈ＋５）の要素を取
り出し、このアドレス（ｈ＋５）に転送された要素を格
納する。【０１０２】ステップ８では、交換する形式で取り出さ
れたアドレス（ｈ＋５）の要素を、転送回数Ｎ−５＝６
−５＝１だけＰＥ間で転送し、再配置先のＰＥへの転送
を完了する。【０１０３】ステップ９ではこれらの要素は再配置先ア
ドレス（ｈ＋１）に格納される。このとき、アドレス（
ｈ＋１）の要素はすでに再配置されているので、転送さ
れてきた要素をそのままこのアドレス（ｈ＋１）に格納
する。【０１０４】以上のように、ステップ１〜ステップ９に
おいて、ＰＥ間の循環データ転送を介して、アドレス（
ｈ＋１）の要素とアドレス（ｈ＋５）の要素との交換が
完了する。ステップ９では、さらに、次の交換の処理の
対象となるアドレス（ｈ＋２）の要素が取り出される。そして、これらの要素はステップ１０〜ステップ１３に
示すように、Ｎ−２＝６−２＝４回のＰＥ間での循環デ
ータ転送を経て、再配置先のＰＥに配置され、アドレス
（ｈ＋４）に格納される。また、このとき、アドレス（
ｈ＋４）に格納されていた要素が取り出される。この様
子を示しているのがステップ１４である。【０１０５】ステップ１５からステップ１６ではこのア
ドレス（ｈ＋４）の要素はＮ−４＝６−４＝２回のＰＥ
間循環転送を経て、再配置先ＰＥへ配置され、そのＰＥ
のアドレス（ｈ＋２）に格納される。【０１０６】アドレス（ｈ＋３）の要素に対しては、再
配置先のアドレスが（ｈ＋３）であるので、このアドレ
スから取り出した要素はＮ−３＝６−３＝３回のＰＥ間
循環データ転送を経て、再配置先のＰＥに配置され、取
り出しアドレスと同じアドレス（ｈ＋３）に格納される
。これらの処理過程を示しているのがステップ１８〜ス
テップ２１であり、このステップ２１で全ての再配置処
理が完了する。【０１０７】図１３のデータ分布は図１２による再配置
処理過程を経て得られた再配置処理後のものである。こ
の分布では各アドレスの要素に対するインデックスに対
して転置の関係になっている。再配置処理が正常に実行
されたことを示している。【０１０８】［Ｄ−２］　　Ｎが奇数の場合における再
配置処理の並列処理方法次にＮが奇数の場合について説明する。Ｎ＝５（奇数）
の場合の再配置処理例に対して、再配置処理前のデータ
分布、データ再配置の並列処理過程及び再配置処理後の
データ分布について説明する。図１４は再配置処理前の
各ＰＥのデータ分布を示す。また、図１５は本発明の他
の実施例の再配置処理過程を示す。図１６は本発明の一
実施例の再配置処理後の各ＰＥデータ分布を示す。Ｎが
偶数の場合と全く同様の処理過程で再配置処理が実行で
きる。図１５のステップ１〜ステップ６はＮ−１＝５−
１＝４回のＰＥ間循環データ転送を経て、アドレス（ｈ
＋１）の要素がアドレス（ｈ＋４）に再配置される。【０１０９】また、ステップ６〜ステップ８ではＮ−１
＝５−４＝１回のＰＥ間循環データ転送を経て、アドレ
ス（ｈ＋４）の要素がアドレス（ｈ＋１）に再配置され
る。これにより、アドレス（ｈ＋１）の要素とアドレス
（ｈ＋４）の要素に関する再配置処理が完了する。以後
、ステップ８以降についても、ＰＥ間循環データ転送を
介したアドレス（ｈ＋２）の要素とアドレス（ｈ＋３）
の要素との交換が行われ、ステップ１５において、全て
の要素に対する再配置処理が完了する。【０１１０】図１４、図１６から、再配置処理後のデー
タ分布を比較すると、ＰＥの各アドレスの要素に対する
インデックスは互いに転置の関係になっており、再配置
処理が終了したことがわかる。【０１１１】（Ｅ）再配置処理の並列処理方法における処理時間［Ｅ
−１］　　　　総処理時間次に上記の並列処理方法を用いて再配置処理を実行した
場合の処理時間を見積もる。先ず、データ転送に要する
処理時間を、ＰＥ間のデータ転送回数の総和から見積も
る。各ＰＥの同一アドレスの要素は、全て、ＰＥ間のデ
ータ転送回数が同一で、且つ、リングアレイプロセッサ
の構成上のＰＥ間で並列転送が可能である。【０１１２】上記の並列処理方法では、Ｎの偶数、奇数
のそれぞれの場合に対する問題に対処してあるので、総
転送回数は１つのＰＥの各アドレスの要素の転送回数の
総和をとることにより求められる。【０１１３】アドレス（ｈ＋ｒ）のＰＥ間データ転送回
数は（Ｎ−ｒ）であるから、転送回数の総和をＴｔｒと
すると、【０１１４】【数９】である。【０１１５】従って、１回当たりのデータ転送時間をＳ
（ｔｒ）　とすると、ＰＥ間データ転送に要する総処理
時間Ｓ（ｔｒ−ａｌｌ）　は（１０）式を用いて、　　
Ｓ（ｔｒ−ａｌｌ）　＝（１／２）・Ｎ・（Ｎ−１）・
Ｓ（ｔｒ）　　　　　　・・・（１１）で表される。【０１１６】［Ｅ−２］　　要素の取り出し及び、再格納処理時間次
に各ＰＥに格納されている要素の取り出し、及び再格納
に要する処理時間を見積もる。１つの要素を取り出すの
に要する時間をＳ（Ｒ），格納するのに要する時間をＳ
（Ｗ）　とする。上記の並列処理方法では、各ＰＥの同
一のアドレスの要素は同時に取り出され、リングアレイ
プロセッサ構成上で並列にＰＥ間データ転送された後、
同時に再配置先のアドレスに格納される。すなわち、要
素の取り出し、または、再格納の回数は１つのＰＥにお
ける回数を考えればよく、その回数は再配置の対象にな
るアドレスの個数に等しいから（Ｎ−１）回（アドレス
０の要素は再配置処理の必要がないので、再配置処理が
必要となる要素は（Ｎ−１）個）である。従って、各Ｐ
Ｅでの要素取り出し、再格納に要する総処理時間Ｓ（Ｒ
／Ｗ）　は、　　　　Ｓ（Ｒ／Ｗ）　＝（Ｎ−１）・｛Ｓ（Ｒ）　＋Ｓ（
Ｗ）　｝　　　　　　　　　　　　　　　　・・・（１
２）で表される。【０１１７】（１１）、（１２）式より、再配置処理に
要する総処理時間Ｓ（ａｒｎｇ）は、　　Ｓ（ａｒｎｇ）　　＝Ｓ（ｔｒ−ａｌｌ）　＋Ｓ（
Ｒ／Ｗ）　　　　　　　　　　　　　＝（１／２）・Ｎ
・（Ｎ−１）・Ｓ（ｔｒ）　＋（Ｎ−１）　　　　　　
　　　　　　　　　　・｛Ｓ（Ｒ）　＋Ｓ　（Ｗ）｝　
　　　　　　　　　　　＝（Ｎ−１）・｛（１／２）・
Ｎ・Ｓ（ｔｒ）＋Ｓ（Ｒ）　＋Ｓ（Ｗ）　｝　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　・・・（１３）となる。【０１１８】（１３）式の｛　　｝の第２、第３の項は
第１項に比べて無視できるとする（第１項はＮに比例、
第２、第３はＮに関係なく一定）と、総処理時間Ｓ（ａ
ｒｎｇ）は、　　　　Ｓ（ａｒｎｇ）　　≒（１／２）
・Ｎ・（Ｎ−１）・Ｓ（ｔｒ）　　　　　　　　・・・
（１４）と近似できる。すなわち、総処理時間Ｓ（ａｒ
ｎｇ）はＰＥ間データ転送に要する処理時間に支配され
る。従って、Ｎ（ＰＥ数）の偶数、奇数に関係なく、再
配置処理はＯ（Ｎ２　／２）の処理時間で実行できる。【０１１９】（Ｆ）　　並列処理方法による再配置処理の制御方法上
記の並列処理方法を用いて再配置処理を実行する場合の
制御方法について説明する。上記の並列処理方法におい
て、並列にＰＥ間データ転送される要素は全て同一アド
レスの要素であり、その転送回数も同一であることから
、あるアドレスから要素を取り出す場合のアドレス設定
、転送回数のカウント、再格納時のアドレスの設定は、
どのＰＥでも全く同じである。従って、この再配置処理
は個々のＰＥで全く同一の制御を行うことによって実現
できる。【０１２０】図１７は本発明のデータの再配置処理の制
御フローチャートを示す。このフローチャートは上記の
再配置処理の並列処理方法に対応するものである。本発
明の再配置処理の制御方法の特徴は互いに交換の対象と
なる要素のアドレス値をＰＥ間データ転送の転送回数の
カウントに用いる点である。以下図１７に従って先に説
明した再配置処理の並列処理過程との関係を明確にしな
がら説明する。【０１２１】ステップ１１０；要素取り出しアドレス／
再格納アドレス用のカウンタＣＡ，ＣＢに初期値を設定
する。初期値としては、カウンタＣＡには“ｈ＋１”、
カウンタＣＢには“ｈ＋Ｎ−１”を設定する。この２つ
のカウンタＣＡ，ＣＢに設定されたアドレス値は再配置
処理の並列処理において、互いにその要素を交換する対
象のアドレスである。また、ループ回数を指定するパラ
メータＢＲ１には［Ｎ／２］＋ｈを設定し、Ｎの偶数・
奇数の判定用のパラメータＢＲ２にはＮを設定する。【０１２２】ステップ１１１；全てのＰＥにおいて、同
時にカウンタＣＡで示されるアドレスの要素を取り出す
と共に、再格納アドレスを示すカウンタＣＢの内容を転
送回数をカウントするカウンタＣＴにロードする。【０１２３】ステップ１１２；全てのＰＥにおいて、取
り出された要素を次段のＰＥへ送信すると同時に前段Ｐ
Ｅからの要素を受信する。このとき、すべてのＰＥにお
いて、カウンタＣＴをデクリメントする。【０１２４】ステップ１１３；カウンタＣＴの値が“ｈ
”になるまで、ＰＥ間での要素の転送とカウンタＣＴの
デクリメント（ステップ１１２）を繰り返す。カウンタ
ＣＴの値が“ｈ”になったら、転送された要素は再配置
先のＰＥに存在するから、ＰＥ間データ転送を終了し、
その転送されてきた要素をカウンタＣＢで示されるアド
レスに格納する。【０１２５】ステップ１１４；カウンタＣＴの値が“ｈ
”のとき、転送された要素は再配置先のＰＥに存在する
から、ＰＥ間のデータ転送を終了し、その転送されてき
は要素をカウンタＣＢで示されるアドレスに格納する。このとき、カウンタＣＢで示されるアドレスには再配置
処理前の別の要素が格納されているため、転送されてき
た要素をこのカウンタＣＢで示されるアドレスに格納す
る。また、パラメータＢＲ２が偶数であり、且つカウン
タＣＡがパラメータＢＲ１であれば、処理を終了する。【０１２６】ステップ１１５；カウンタＣＡの内容をカ
ウンタＣＴにロードする。そして、すべてのＰＥにおい
て、カウンタＣＢで示されるアドレスから取り出された
要素を次段のＰＥへ送信すると同時に、前段のＰＥから
の要素を受信する。【０１２７】ステップ１１６；すべてのＰＥにおいて、
カウンタＣＴをデクリメントする。【０１２８】ステップ１１７；このような処理をカウン
タＣＴの値が“ｈ”になるまで繰り返す。【０１２９】ステップ１１８；カウンタＣＴの値が“ｈ
”になったら、転送されてきた要素をカウンタＣＡの示
すアドレスに格納する。カウンタＣＡの示すアドレスの
要素はすでに再配置処理されているから、転送された要
素はそのままカウンタＣＡの示すアドレスに格納する。このとき、カウンタＣＡがパラメータＢＲ１と等しくな
ったら終了する。【０１３０】ステップ１１９；このようなある２つのア
ドレスの要素の再配置処理が互いの要素の交換の形式で
終了すると、カウンタＣＡはインクリメント、カウンタ
ＣＢはデクリメントする。そして、上記と同様の過程に
よって、カウンタＣＡ，カウンタＣＢで示されるアドレ
スの要素をＰＥ間データ転送を介して再配置する。【０１３１】上記のように、２つのアドレスの要素を互
いに交換して再配置する処理を、カウンタＣＡの値が“
［Ｎ／２］＋ｈ”に一致するまで繰り返す（ステップ１
１４、ステップ１１８）。この制御フローチャートでは
カウンタＣＡの値がこの“［Ｎ／２］＋ｈ”に一致した
か否かの判定（ステップ１１４、ステップ１１８）はカ
ウンタＣＡの値とパラメータＢＲ１の内容“［Ｎ／２］
＋ｈ”との一致検出により行う。尚、パラメータＢＲ１
の内容は、再配置処理開始時に設定する（ステップ１１
０）。Ｎが偶数の場合には、アドレス（［Ｎ／２］＋ｈ
）の要素の再配置は２つのアドレスの要素を交換する形
式にはならないので、ＰＥ間データ転送によって再配置
先のＰＥに転送された後、同じアドレスに直接再格納す
る。このＰＥの総個数Ｎが偶数であるか奇数であるかの
判定（ステップ１１４）は、パラメータＢＲ２の内容“
Ｎ”によって判定する。このＢＲ２の内容も再配置処理
開始時に設定する（ステップ１１０）。【０１３２】これまで説明してきたように、ＰＥ間での
並列データ転送を介した２つのアドレスの要素の交換を
基本とした規則的な処理によって、再配置処理が完了す
る。この制御方法はカウンタＣＡ、カウンタＣＢは要素
の取り出し／再格納アドレスを与える働きをするばかり
でなく、それぞれ、カウンタＣＢの示すアドレスの要素
、カウンタＣＡの示すアドレスの要素に対するＰＥ間デ
ータ転送回数を与える働きもしている。【０１３３】（Ｇ）　　並列処理方法による再配置処理
の制御ハードウェア構成（Ｆ）で示したような制御方法を実現する制御ハードウ
ェア構成について説明する。図１８は本発明のデータの
再配置処理の制御ハードウェアの構成図を示す。同図に
おいて、インクリメンタＣＡ１２０　，デクリメンタＣ
Ｂ１２１　，デクリメンタＣＴ１２４　，レジスタＢＲ
１１２２　，レジスタＢＲ２１２３　はそれぞれ、図１
６に示したカウンタＣＡ，カウンタＣＢ，カウンタＣＴ
に対応し、レジスタＢＲ１１２２　は図１７で転送を繰
り返すカウントであるループ回数の制御パラメータを指
定するＢＲ１に対応し、レジスタＢＲ２１２３　はＮの
偶数、奇数の判定用のパラメータを指定するＢＲ２に対
応する。【０１３４】セレクタ１２７　はインクリメンタＣＡ１
２０，デクリメンタＣＢ１２１　の出力を切り換えてデ
クリメンタＣＴ１２４　にロードするためのものである
。【０１３５】フラグ生成回路１２５　はＰＥ間の並列デ
ータ転送の終了はデクリメンタＣＴ１２４　の値が“ｈ
”であることを検出してそのフラグ（以後このフラグを
“ｈ”−フラグと呼ぶ）を生成する回路である。また、
この“ｈ”−フラグはデクリメンタＣＴ１２４　へのデ
ータロード元を切り換える制御信号として用いる。即ち
この“ｈ”−フラグがオンになる毎に、セレクタ１２７
　はデータロード元のインクリメンタＣＡ１２０　また
はデクリメンタＣＢ１２１　を切り換える。【０１３６】レジスタＢＲ２１２３　にはＮを設定する
。Ｎの偶数、奇数についてはこのレジスタＢＲ２１２３
　のＬＳＢ（最下位ビット）の値が“０”又は、“１”
によって判定する。即ち、“０”ならば偶数、“１”な
らば奇数と判定する。また、［Ｎ／２］＋ｈとインクリ
メンタＣＡ１２０　の値との一致によって制御する。こ
れは、上記の制御方法にも示したように、インクリメン
タＣＡ１２０　は、“ｈ＋１”から“［Ｎ／２］＋ｈ”
までインクリメントするので、このインクリメンタＣＡ
１２０　の値とレジスタＢＲ１１２２　の内容との一致
を検出することは実行的に［Ｎ／２］の回数をカウント
するのに等価である。【０１３７】このため、本制御ハードウェア構成には、
インクリメンタＣＡ１２０　の値とレジスタＢＲ１１２
２　の内容との一致を検出する一致検出回路１２６　を
設けている。【０１３８】再配置処理の終了はこの一致検出回路１２
６　から出力される一致フラグの内容によって判定する
。なお、再配置処理の終了条件にはＮの偶奇性が関係す
るので、、レジスタＢＲ２１２３　のＬＳＢはこの一致
検出回路１２６　に入力されている。【０１３９】これにより、インクリメンタＣＡ１２０　
、デクリメンタＣＢ１２１　、デクリメンタＣＴ１２４
　とレジスタＢＲ１１２２　の特定の値を検出すること
により（Ｆ）で示したような制御方法が実現できる。【０１４０】【発明の効果】上記のように本発明のデータの再配置処
理方法によれば、複数（Ｎ）個の処理要素を一括して同
時に再配置できるので、要素を１個ずつ再配置する場合
に比べて再配置処理をＮ倍高速化できる。また、本発明
は２つのアドレスの要素を交換する形式で再配置処理を
行う方法であるので、規則的また、効率的な処理が実現
できる。【０１４１】また、本発明の再配置処理の制御方法によ
れば、要素のアドレス値を単に取り出しアドレス及び再
格納アドレスとして用いるだけでなく、ＰＥ間データ転
送により要素を再配置先のＰＥへ転送する場合の転送回
数のカウントにも用いた二重のＤＯループ処理の構造を
もった制御方法であるので、再配置処理の制御を規則的
、効率的に実現できる。さらに、各ＰＥは全く同一の制
御を実行し、各々のＰＥの制御状態を互いに管理するこ
となく、また、リングアレイプロセッサ構成の各ＰＥを
個別に制御するような複雑な制御構成をとることがない
ので制御が簡単になる。【０１４２】また、本発明の制御ハードウェアの構成に
よれば、上記の二重のＤＯループ構造の制御を３種類の
カウンタとカウンタの特定の値を検出して、そのフラグ
を生成する２種類の検出回路を用いて実現できるため、
ハードウェア構成が簡素化でき、ハードウェアの規模も
小さくできる。Detailed Description of the Invention [0001] [Industrial Application Field] The present invention provides a
How to reallocate data in multiprocessors
and its control mechanism, especially the data exchange between processing units.
The data stored in each processing unit is regenerated by sending and receiving.
Reallocation of data in multiprocessors to be deployed
Concerning the method and its control mechanism. [Background Art] A multiprocessor that performs many types of parallel processing
In some systems, the processing results of a certain process are used as data for other processes.
When executing processing, the processing results obtained in each processing unit are
each processing unit to make the results suitable for executing other parallel processes.
need to be reassigned. The data processing unit
Reassigning data to a unit is called data relocation. Conventionally, in such a multiprocessor,
There are two main types of data relocation processing described below.
There was a way. One method is to manage multiprocessors.
All data in each processing unit is sent to the host processor
Collect data. The collected data is stored on the host processor
Data is sorted and re-stored in each processing unit
. [0004] Another method is to use
data transfer paths between processing units without
This is to rearrange the data in the management unit one by one. [0005] However, the above-mentioned host
How to sort and restorage data on a processor
, a large amount of data is transferred between each processing unit and the host processor.
requires data exchange, and also requires
The problem is that the additional processing of data sorting increases.
There is. On the other hand, the method using data transfer path is relocation
Data transfer time is proportional to the amount of data to be transferred.
The overhead for multiprocessor processing is large.
In addition to increasing the
must individually manage the data transfer process for
There is a problem in that control becomes complicated because of the complexity. [0006] The present invention has been made in view of the above points.
Efficiently process data reallocation on multiprocessors
How to reallocate data that can be executed at high speed
The purpose is to provide a law and control mechanism. [Means for Solving the Problems] FIG. 1 illustrates the principle of the present invention.
It is a diagram. Multi-processor with multiple processing elements connected in a ring
In the processor, multiple processing elements each exchange data.
connected to the processing element via a data transfer path for
Multiple processing elements are calculation means and processing elements that perform desired calculations.
data transfer means, address and data transfer means for transferring data between
A storage means for storing data and reading the stored data.
It has a control means for controlling the readout means, and
Processing elements are allocated to multiple processors for several processing elements.
Numbers are assigned in the order in which they are listed, and all processing elements are
from the first address of the storage means to the reading means.
(10) and process the read data.
The data is transferred between processing elements a predetermined number of times (11), and transferred to each processing element.
The sent data at the first address of the storage means is transferred to the second address.
(13)
), the converted data at the second address of the storage means is processed.
Data is simultaneously transferred a predetermined number of times between processing elements (14), and individual processing
The data at the second address of the storage means transferred to the element
The processing element stores the procedure at the first address of the storage means.
If the total number of processing elements is odd, divide the total number of processing elements (N) by 2.
If the total number of processing elements is even
is the total number of processing elements (N) divided by 2 (N/2).
The total number of processing elements is
The value is an even number and the total number of processing elements divided by 2 (N/2)
If the count (r) of repeating data transfer is equal to
the first address of the storage means in all processing elements simultaneously;
The extracted first address data
A predetermined count of data is simultaneously transferred between processing elements, and each processing
The data of the first address transferred to each processing element is
The data is stored in the second address of the storage means (15). [0008] Also, for the first address (h+r)
The number of transfers between processing elements is determined by the data at the first address (h+r).
The second address for the data to be exchanged (
The value obtained by subtracting h from the address value of h+N-r) (N-r)
and for the data at the second address (h+N-r)
The number of transfers between processing elements is set to the second address (h+N-r)
The first address for the data to be exchanged.
As the value r obtained by subtracting h from the address value of (h+r),
Description of processing elements for r=1, 2, ..., [N/2]
Simultaneous transfer of data retrieved from storage between processing elements
Count the number of transfers and add the data to the retrieved data.
Controls the transfer process to the relocation destination processing element. [0009] Furthermore, a first address holding the first address is
a counter; and a second counter holding a second address.
, the processing requirements for the data retrieved from the storage means of the processing element.
A third method that counts the number of transfers during simultaneous transfer between elements.
a counter, the output of the first counter and the output of the second counter.
There is a selector that switches between the output and the output of the selector.
is connected to the input of the third counter, and the value of the third counter is h
a first flag that detects that it is equal to and generates a first flag;
The content of the first flag is determined by the flag generation means and the selector.
The output of the first counter and the output of the second counter are
and control of the number of repetitions.
The first register holds parameters and all processing requirements.
A second one that holds parameters for determining whether the prime number is even or odd.
a register, the contents of the first register, and the first counter
A second flag that detects a match with the contents of the flag and generates a second flag.
2 flag generation means and the least significant bit of the second register
The parity of the number of all processing elements is determined by the contents of
parity determining means, and the content of the second flag and the second register.
Data relocation processing based on the contents of the least significant bit of the star.
and end detection means for detecting the end of. [Operation] A plurality of (N) processing elements are connected in a ring, and the
Processing elements are numbered in the order in which they are arranged on a multiprocessor.
It is. Have data transfer paths between adjacent processing elements
Continuity of storage means for processing elements in a multiprocessor
The state where the first data is stored at the address
to the same consecutive address in the storage means of the processing element.
The first data and the second data are stored so that the data of
Relocate N processing elements of the data at once. [Example] In order to simplify the science of the present invention, the following is a high-level example.
Learning process and forward-back of the Demmer-Cobb model method
Learning processing and Baum-W in the quadratic procedure
In the HELCH REESTIMATION FORMULAS
The learning process will be explained below. [0012] Used for pattern recognition processing of voices, characters, etc.
Hidden Markov Models
) The learning process of the law is performed using a form of multiprocessor.
This section describes an example of execution using an array processor.
Ru. In pattern recognition using this HMM method,
Patterns of speech and letters are calculated based on a certain state transition probability model.
The occurrence of events can be viewed through transitions between states in the model.
The pattern is modeled as a sequence of symbols to be measured. Learning process is the process of accurately learning from data of multiple sample patterns.
The goal is to estimate the probability parameters of the rate model. H.M.
In the learning process of the M method, the forward-backward process
Sedua (Forward-Backward Pro
cedure) and Baum-Welch Lietime
Baum-Welch Re-estimate
There are two types of algorithms (ion formulas).
used. The learning process of the HMM method uses these algorithms to
The probabilistic model obtained using the processing results of the algorithm
Repeat each algorithm until the estimation converges.
vinegar. The contents of these algorithms are shown below. Forward-Backward Procedure
is a forward pass algorithm and a backward pass algorithm.
It consists of two types of algorithms. (1) Forward pass algorithm Initial settings:
For 1≦i≦N
α(i,0)=π(i)
...(1) Recurrence formula:
1≦i≦N, t=1, 2, ..., 001 for T
6] [Equation 1] In the above algorithm, N is the number of processing elements.
It is a number. π(i) indicates the initial state probability. α(i, t
) is a probability parameter. (2) Backward pass algorithm
Initial setting: For 1≦i≦N
　　　　　　　　　　　　　　　　　　　　　　　　　
1 for i∈ET
β(i,T)=
...(3)
　　　　　　　　　　　　　　　　　　　　　　　　　
0 otherwise
Recurrence formula: 1≦i≦N, t=T-1, T-2
, ..., 0 [Formula 2] Here, c(i, j; t)≡a(i, j)・b
(i,j;Ot) [Equation 3]. In the above algorithm, β(i,t
) is the probability parameter, and a(i, j) is the state transition probability.
b(i,j;k) is the symbol output probability. [0021] Also, Baum-Welch-Riettimesi
The process of formulas is based on the initial state probability π(i).
re-estimation calculation, re-estimation calculation of state transition probability a(i,j),
Three types of re-estimation calculation of symbol output probability b(i, j; k)
It consists of calculations of types. The details of each re-estimation calculation are shown below. In addition, in the following notation, π+ (i), a+ (i, j)
, b+ (i, j; k) are π(i), a(i
, j), b(i, j; k). (1) Re-estimation calculation of initial state probability 00
23] [Formula 4] (2) Re-estimation of state transition probability a+ (i, j)
Constant calculation [Equation 5] (3) Symbol output probability b+ (i, j; k
) re-estimation calculation [Equation 6] Here, c(i, j; t)≡a(i, j)・b
(i, j; Ot ), [Formula 7] [Formula 8]. The above forward-backward process
Dua and Baum - Welch Retimation Fo
- Processing for each algorithm of Mulas is the desired function
Processing elements (hereinafter referred to as PEs) with
Array processor configuration (hereinafter referred to as ring array processor)
Parallel processing is possible using The target here is
This section describes the ring array processor. Figure 2 shows H
Parallel processing of learning processing in pattern recognition processing of HM method
The configuration of the ring array processor when executed by
show. Ring array processor is PE100a,100
Data transfer between b,..., 100c, 100d and each PE
A transmission path 101 and a memory 102a under the control of each PE.
, 102b, ..., 102c, 102d and PE and D
Data input/output paths 103a, 103 for inputting and outputting data
b, . . . , 103c, 103d. Below is the above using the ring array processor configuration of Figure 2.
We will explain the parallel processing method for each algorithm.
Ru. (a) Forward-backward procedure [a
-1] Forward pass algorithm Figure 3 shows the learning process.
Forward-backward in front of procedure
Oriented path algorithm in ring array processor configuration
The data flow for parallel processing is shown below. Lin in the same figure
The configuration of the Guarei processor is PE200a, 200b,
..., 200c, 200d and each PE
The data storage between the memory 201 and the memory 204 under the control of each PE.
Data input/output paths 202a, 202 for inputting and outputting data
Circular transfer between b,..., 202c, 202d and PE
data string 203 {α(1, t-1), α(2, t-
1),...,α(i,t-1),...,α(N,t
-1)}. In addition, the data string 204 is
The management is performed in synchronization with the circular transfer of the data string 203 described above.
It is input from the memory 204 under the control. For example, in PEi (1≦i≦N), the data
The data sequence CF (i, t) is CF (i, t)={c(i,
i;t),c(mod(i+1|N),i;t),...
, c(mod(i-1|N),i;t)}. Here, mod(m|N) (m is an integer) is
When m is an integer multiple of N, set N to when m is not an integer multiple of N.
represents the remainder when m is divided by N. Above data string C
F (i, t) (1≦i≦T) is PEi (1≦i≦N
) is stored in memory under the control of Also, this
In the memory, the initial value α(i, 0) of data α(i, t) is
Suppose that π(i) is stored. For PEi (1≦i≦N), first, (1)
The initial value of α(i, t) corresponding to the formula α(i, 0) = π
(i) is read from the memory under its control, and the data
It is input via the input/output path 202. Then PEi (1
≦i≦N) is the data string CF from the memory under its control.
Input the first data c(i,i;1) of (i,t).
Enter that data and the initial value α(i, 0) entered earlier.
Perform multiplication and obtain the multiplication result α(i,0)・c(i,i
;1) is temporarily held in a storage area within the PE. then next
As a step processing, PEi (1≦i≦N) is inputted earlier.
At the same time as sending the initial value α(i, 0) to the next PE,
Initial value α (mod (i+1 | N), 0) from the previous PE
receive. At the same time, is the memory under the control of the PE?
The second data c(mo
Input d(i+1|N), i;1). And P.E.
The data α(mod(i+1|N),i;
1) and the data c(m
od(i+1|N),i;1), and the multiplication is
The calculation result α(mod(i+1|N),i;1)・c
(mod(i+1|N),i;1) and storage area in PE
The previous multiplication result α(i,0)・c(i,
i; 1) (ie, product-sum calculation). α(mod(i+1|N),i;1)・c(mod(i
+1｜N),i;1)+α(i,0)・c(i,i;1
) Furthermore, the above addition result is temporarily stored in the storage area in the PE.
Hold. [0033] From now on, data α transferred between all PEs
(i, 0) unites all PEs of the ring array processor.
Repeat until the cycle is completed, and each time repeat the product as described above.
Perform a sum calculation. The calculation results of the sum of products calculation are stored in the PE.
Hold in area. Data transferred circularly between PEs in this way
The data α(i,0) goes around the ring array processor.
Then, at PEi (1≦i≦N), for time t=1,
Data α(i, 1) is obtained. After that, t=2
The calculation process for the data α(i, 2) is obtained here.
Replace α(i, 0) with data α(i, 1) and PE
At the same time, the data for time t=2 is transferred between
While inputting the data of data column CF (i, t) from memory,
Then, the process is executed in exactly the same way as in the case of t=1. t
The same applies to =3,...,T. Each time t=1
,2,..., the calculation result of α(i,t) for T is P
Sequentially stored in memory under control of Ei (1≦i≦N)
be done. [a-2] Backward algorithm Figure 4 is
Forward-backward process in learning process
Ringarape Dua's backward pass algorithm
Showing the data flow when performing parallel processing with a processor configuration
. The ring array processor in the same figure is PE300a, 30
Data between 0b,..., 300c, 300d and each PE
The data between the transfer path 301 and the memory under the control of each PE
Data input/output paths 302a, 302 for inputting and outputting data
Circular transfer between b,..., 302c, 302d and PE
data string 303 {β(1, t+1), β(2, t+
1),...,β(i,t+1),...β(N,t+
1)}. The data string 304 is for each PE30.
At 0a, 300b, ..., 300c, 300d
In synchronization with the circular transfer of the data string 303 above, its management
The data is input from the underlying memory. PEi (1≦i≦
N), the data string CB (i, t+1) (0≦t≦T−
1) is CB (i, t+1) = {c(i, i; t+1)
,c(i,mod(i+1|N);t+1),...,
c(i, mod (i-1|N);t+1)}. child
The data string CB (i, t+1) (0≦t≦T-1) is
stored in memory under the control of PEi (1≦i≦N).
It is. This memory also contains data β(i, t).
It is assumed that an initial value β(i, T) is stored. For this backward pass algorithm,
Parallel processing transfers circularly transferred data between PEs to β(i, t+1)
, and at the same time enter PEi (1≦i≦N) from memory.
Assuming that the input data string is CB (i, t+1), forward
Exactly the same processing as the parallel processing of the
go That is, for PEi (1≦i≦N), first, (3
) The initial value β(i, T) of β(i, t) corresponding to the equation is
Data input/output is read from the memory under its control.
It is input via path 302. Next, PEi (1≦i
≦N) is the data string C from the memory under the control of PEi.
The first data c(i,i;T
). That data and the initial value β(i
, T), and the multiplication result β(i, T
)・c(i,i;T) is temporarily stored in the storage area within the PE.
hold Next, PEi (1≦i≦N) is the first value entered earlier.
At the same time as sending the initial value β(i, T) to the next PE,
From the PE of the stage to the initial value β(mod(i+1|N),i;T
) received and at the same time data from memory under the control of the PE.
The second data c(mod(
Input i+1|N), i;T). Additionally, between PEs
The transferred data β(mod(i+1|N),T)
At the same time, data c(mod(i
+1 | N), i; T), and the multiplication result is
A certain β(mod(i+1|N),T)・c(mod(i
+1 | N), i; T) and are held in the storage area within the PE.
The multiplication result β(i,T)・c(i
, i; T) (ie, product-sum calculation). β(mod(i+1|N),T)・c(m
od(i+1|N),i;T)+β(i,T)・c
(i, i; T) Temporarily retain the above addition result in the storage area in PE
. From now on, all PEs will use this inter-PE data transfer and
Simultaneous execution with data input from memory is transferred between PEs.
The data β(i,T)
Repeat all PE steps until one cycle is completed, and repeat the steps above each time.
Executes the sum of products calculation. The result of sum-of-products calculation is stored in PE.
held in the area. Data β (
i, T) goes around the ring array processor, then P
At Ei (1≦i≦N), for time t=T-1
Data β(i, T-1) is obtained. From now on t=T-2
The calculation process of data β(i, T-2) for is calculated here.
The data β (i, T) is calculated using the calculated data β (i, T-1)
is replaced and transferred circularly between PEs. At the same time, the time
Data string CB (i, t+1) data for t=T-2
While inputting the data from memory, consider the case of t=T-1 and the total
It is executed in the same process. t=T-3,...,0
The same applies to . Each time t=T-1, T-2,・
..., the calculation result of β (i, t) for 0 is PEi (
1≦i≦N). Forward-Backward Procedure
Forward pass algorithm and backward pass algorithm
Parallel processing as described above is performed for each rhythm.
Then, the following α(i, t) is stored in the memory of each PE.
and β(i, t) are obtained. [0038] PEi (1≦i≦N)
Calculation results stored in memory: α(i, 0), α(i, 1), ..., α(i, t),
..., α(i, T); β(i, T), β(i, T-1
), ..., β (i, t), ... β (i, 0), above
The order of the calculation results shown below is the same as the order in which the calculation results are obtained.
It is the same. (b) Baum-Welch-Rietime
tion formulas [B-1] Parallel processing method for re-estimating initial state probabilities
First, the Baum-Welch Retimation Foundation.
- Re-estimation calculation of initial state probability π(i) in Muras
We will explain the parallel processing method for FIG. 5 shows Baum-Welch in the learning process.
・Identification of the initial state of Re-Etimation Formulas
The rate re-estimation calculation is processed in parallel using a ring array processor configuration.
This shows the data flow when managing the data. This lingua rape
The processor configuration is PE400a, 400b,..., 4
Data transfer path 401 between 00c, 400d and each PE
Data input/output path between a PE and the memory managed by that PE
Data string 403 {α(
1,0)・β(1,0),α(2,0)・β(2,0)
,...,α(i,0)・β(i,0),...,α(
N,0)・β(N,0)} and the data input/output path 402
The data string 404 etc. input from the memory via the
will be accomplished. The data string 404 is PEi (1≦i≦N)
is D(i,0)={α(i,0),β(i,0)}
Powered. This data string D(i,0) is PEi (1≦
Suppose that it is stored in memory under the control of i≦N)
. In this parallel processing, first PEi (1≦i≦
N), to perform the calculation of the numerator of equation (5)
Data string D(i, 0) = {α(i, 0), β(
i, 0)} is sent to the memory via the data input/output path 402.
input from PEi (1≦i≦N) is the input data
Data sequence D (i, 0) = {α (i, 0), β (i, 0)}
Using two types of data α(i, 0) and β(i, 0),
The molecule product calculations α(i, 0) and β(i, 0) are executed in parallel.
go P(O|λ), which is the denominator calculation, is α(i, 0)
, β(i, 0), each PE is parallelized.
It is equal to the sum of the calculation results of the numerator calculated in the column. Therefore,
The calculation of the denominator is the product calculation result α(
i, 0) and β(i, 0) for all ring array processors.
Transfer is performed circularly between PEs until all PEs have been visited, and all P
At E, cumulative addition of the transferred data is executed in parallel.
It is determined by Therefore, each PE
Dividing the calculated result of the numerator by the calculated result of the denominator
As a result, the re-estimation calculation result of the initial state probability π(i) is π+
(i) (1≦i≦N) is the same at PEi (1≦i≦N)
sometimes required. Furthermore, the obtained initial state probability π(i
) re-estimation calculation result π+ (i) is PEi (1≦i≦
It is stored in the memory managed by N). [B-2] Re-estimation calculation of state transition probability
Parallel processing method Next, state transition probability using ring array processor configuration
Parallel processing of re-estimation calculation of a(i,j) will be explained. Figure 6
is a Baum-Welch restimator in the learning process.
Re-estimation calculation of state transition probability of tion formulas
Data when processing in parallel with a ring array processor configuration
Shows tough flow. The data transfer path 501 is connected to each PE 500.
a, 500b, ..., 500c, 500d.
It will be done. Data input/output paths 502a, 502b,...
502c, 502d are each PE500a, 500b,...
・, 500c, 500d and the memory under their management
This is a path for inputting and outputting data between. data column
503 is PE500a, 500b, ..., 500c,
This is a data string that is cyclically transferred between 500d. Shown in Figure 6
In this example, the data string 503 is β(1, t), β(2,
t), ..., β (i, t), ..., β (i, t),
..., β(N, t)}. first data column
504 is input from memory via data input/output path 502.
and in PEi (1≦i≦N), the data string D
(i, t) = {α (i, t), β (i, t)} (0≦t
≦T) is input. The second data column 505 is the data input.
input from memory via output path 502 and PEi
(1≦i≦N), data string CB (i, t)=
{c(i,i;t),c(mod(i+1|N);t)
,...,c(i, mod(i-1|N);t)} is input
Powered. This first data string 504, second data string
505 are both stored in the memory managed by the PE.
shall be taken as a thing. In this parallel processing, first, the denominator of equation (6) is
In order to perform calculation processing, data string D (i, t) = {
α(i, t), β(i, t)} (0≦t≦T) is PEi
(1≦i≦N) is input from the memory. PEi (
1≦i≦N) is the input data string D(i, t)={α
(i, t), β(i, t)} (0≦t≦T).
Using data α(i, t) and β(i, t), time t
The product-sum calculation Σα(i,t)·β(i,t) is executed to obtain the calculation result of the denominator. This sum of products calculation is
It is executed in parallel in all PEs. On the other hand, calculation of molecules
In processing, PEi (1≦i≦N) is the previously input data.
data sequence D(i, t) = {α(i, t), β(i, t)}(
0≦t≦T) data β(i, t) is passed through all PEs.
PEi (
1≦i≦N), input the data string CB (i, t) from the memory.
input, and is input to PEi (1≦i≦N) from time to time.
Circular transfer data between PEs, data string CB (i, t)
data, the data of the previously input data string D(i, t)
Multiply the three terms with α(i, t-1) in parallel.
. Through this process, PEi (1≦i≦N) has j=1
,2,...,N for the combination of (i, j)
The cumulative addition term of the child at time t is determined. Therefore, the time
Update t and perform the processing related to multiplication between three terms as described above.
and the calculation obtained for each (i, j) combination.
If the calculation results are cumulatively added at each time t, the calculation result of the numerator is
Desired. All of the above processes are executed in parallel on PE.
will be carried out. Denominator and numerator obtained through the above processing process
divide the numerator by the denominator in parallel using the calculation result of
Therefore, PEi (1≦i≦N) has j=1, 2,...
・, state transition probability for the (i, j) combination of N
A re-estimated value a+ (i, j) of a(i, j) is obtained
. [0044] Also, from the above parallel processing method,
In parallel processing of molecular calculations, data transferred between PEs is
Note the data on α(i, t-1), PEi (1≦i≦N)
CF (i,
t)={c(i,i;t),c(mod(i+1|N)
,i;t),c(mod(i+2|N),i;t),・
..., c (mod (i-1 | N), i; t)}, and min
In multiplication between three child terms, data string D(i, t) = {
α (i, t), β (i, t)} (0≦t≦T)
If β(i, t) is used as the data, PEi (1≦i≦
N) has the combination (j, i) of j = 1, 2, ..., N
The calculation result of the molecule for the displacement can be obtained. Therefore, for PEi (1≦i≦N), j=1
,2,...,N for the combination of (j, i)
The re-estimated value a+ (j, i) of the state transition probability a(j, i) is
P until the calculation result of the denominator is passed through all PEs.
The calculation results of the molecules obtained are transferred cyclically between E.
Determined by dividing by the calculation result of the denominator transferred to
It will be done. [b-3] Re-estimation of symbol output probability
Parallel processing of calculationsNext, symbol output confirmation using a ring array processor configuration
We will explain the parallel processing of the re-estimation calculation of the rate b(i, j; k).
I will clarify. Figure 7 shows the Baum-Welch Li in the learning process.
Estimation formulas symbol output confirmation
The rate re-estimation calculation is processed in parallel using a ring array processor configuration.
This shows the data flow when managing the data. data transfer path 60
1 is each PE600a, 600b,..., 600c, 6
It is provided between 00d. Data input/output path 602a, 6
02b,..., 602c, 602d are each PE600a
, 600b, 600c, 600d and the media under their management.
This is a path for inputting and outputting data to and from the memory. The data string 603 is PE600a, 600b,...,6
This is a data string that is cyclically transferred between 00c and 600d. In the example shown in FIG. 7, the data string 603 is {β (1, t),
β(2,t),...,β(i,t),...,β(i
, t), ..., β(N, t)}. first data
Column 604 indicates each data input/output path 602a, 602b, 6
It is input from the memory via 02c and 602d. P.E.
For i (1≦i≦N), the data string D(i, t) is D(i, t)={α(i, t), β(i, t)}
(0≦t≦T) is input. In addition, the second data column 605 includes each data
Input/output paths 602a, 602b, 602c, 602d
input from memory via PEi (1≦i≦N)
, the data string CB (i, t) CB (i, t) = {c (i, i; t), c (i, mo
d(i+1|N);t),...,(i, mod(i-
1 | N); t)} and the data string GB GB (i, t) = {(g(i, i; t), g(i, m
od(i+1|N);t),...,g(i, mod(
The data of i-1 | N); t)} is input one by one as a set.
Powered. That is, for PEi (1≦i≦N), {c(i
,i;t),g(i,i;t)},{i,mod(i+
1 | N); t), g(i, mod (i+1 | N); t)
},...,{c(i, mod(i-1|N);t),
g(i, mod(i-1 | N); t)}
Ru. These data are the first data column 604 and the second data column 604.
Data string CB (i, t) forming data string 605,
Stored in memory managed by PE along with GB (i, t)
It is assumed that Here, the second data string GB (i, t)
The data is the similarity between symbol Ot and reference symbol k
Using the parameter u(t;k) that represents g(i,j;t)=c(i,j;t)・u(t;k)
Define. Re-estimation of symbol output probability b(i,j;k)
As can be seen from equation (7), the constant calculation requires the calculation of the denominator and numerator.
The calculation contents are almost the same, and the symbol Ot is used to calculate the numerator.
The point where the symbol Ot = k is added as a condition for
Only that is different. Also, the calculation of this denominator and numerator is the state transition
Exactly equivalent to the calculation of the numerator of the re-estimation calculation of probability a(i, j)
It is. Therefore, this symbol output probability b(i,j;k
) The calculation process for the denominator and numerator of the re-estimation calculation is as described above.
Parallel calculation of the numerator for re-estimation calculation of state transition probability a(i, j)
The processing method can be applied and executed as is. Next, parallel calculation processing of the denominator and numerator is performed according to FIG.
Explain the theory. First, PEi (1≦i≦N) is
Data input/output paths 602a, 602b,..., 602
The first data string D(i
, t) = {α(i, t), β(i, t)} (0
≦t≦T). Then, in parallel calculation processing of denominator and numerator,
Data β (i, t) is transferred between PEs until the transferred data completes one cycle.
While circularly transferring PEi (1≦i≦
N) from the memory to the second data string {c(i, i; t), g(i, i; t)}, {c(i,
mod(i+1|N);t),g(i,mod(i+1
|N);t)}, {c(i, mod(i+2|N);t
), g(i, mod (i+2|N);t)},...,
{c(i, mod(i-1|N); t), g(i, mo
d(i-1|N);t)}. PEi(1≦i
≦N), the circular transfer data between PEs input from time to time is
data, two types of data, a second data column, and a first data column.
Using the data α(i, t-1), the minute in equation (7)
For calculating the mother, multiply the three terms of data α, c, β.
For the calculation of the molecule, use the three terms of data α, g, and β.
Perform multiplication between This denominator and numerator multiplication process is
executed in parallel on all PEs. This process
, PEi (1≦i≦N) has j=1, 2,...,N.
At time t of the denominator and numerator for the combination of (i, j)
The cumulative addition term for the sum is calculated. Therefore, change the time t.
Execute a new process related to multiplication between three terms as described above.
, if the calculation results are cumulatively added at each time t, the denominator,
Calculation results for the molecule are required. These processes are performed between PEs.
executed in parallel. The amount determined by the above processing process
Divide the numerator by the denominator in parallel using the calculation results of the mother and numerator.
By doing so, PEi (1≦i≦N) has j=1,2
,..., symbol for the combination (i, j) of N
b+ (i, j; k) is obtained as the re-estimated value of the output probability.
I can't stand it. [0050] Also, as can be seen from the above parallel processing method,
In parallel processing of this calculation, data transferred between multiple PEs is
α (i, t-1), PEi (1≦i≦N) from memory
The second input data string is data string CF (i, t) = {c (i, i; t), c
(mod(i+1|N),i;t),c(mod(i+
2|N),i;t),...,c(mod(i-1|N
), i; t)} and the data string GF (i, t) = {g(i, i; t), g(
mod(i+1|N),it),g(mod(i+2|
N),i;t)},...,g(mod(i-1|N)
, i; t)} data string {c
(i, i; t), g(i, i; t)}, {c(mod(
i+1|N), i;t), g(mod(i+1|N),
i;t) }, {c(mod(i+2|N),i;t)
, g(mod(i+2|N),i;t)},...,{
c(mod(i-1|N),i;t),g(mod(i
−1 | N), i; t)}, and in multiplication between three terms,
Data β(i
, t), PEi (1≦i≦N) has j=1,
2,..., N for the combination (j, i)
The re-estimated value b+ (j, i; k) of the vol output probability is obtained.
It will be done. Above, in PEi (1≦i≦N), j=
(i, j) (or (j, i)) of 1, 2, ..., N
The re-estimated value of the state transition probability for the combination of (a+
(i, j) (or a+ (j, i)) and symbol output
The re-estimated value of probability b+ (i, j; k) (or b+ (
j, i; k)) is obtained, PEi (1≦i≦N)
is a parameter representing the similarity between symbol Ot and reference symbol k.
The following sum of products for k using the parameter u(t;k)
Execute the calculation Σu(t;k)・b+ (i, j;k) or Σu(t;k)・b+ (j,i;k)
Re-estimated value b+ of symbol output probability for bor Ot
(i, j; Ot ) or b+ (j, i; Ot )
The result and the re-estimated state transition probability a+ (i,
j) (or a+ (j, i)) and
Rewriting the data c(i,j;t) (or c(j,i;t))
Estimated value c+ (i, j; t) (or c+ (j, i; t
)). The flow for obtaining the re-estimated value is shown here. b+ (i,j;Ot)=Σu(t;k)・b+ (
i, j;k) or b+ (i,i;Ot)=Σu(t;k)・b+ (
j, i; k) Then c+ (i, j; t) = b+ (i, j; Ot)・a
+ (i, j) or c+ (j, i; t) = b+ (j, i; Ot)・a
+ (j,i) This result is stored in memory under the control of the PE. [0053] Also, data g(i, j; t) (or g(
j, i; t)) re-estimated value g+ (i, j; t) (or
g+ (j, i; t)) is u(t; k)・b+ (i,
j;Ot ) (or u(t;k)・b+ (j,i;O
t ) and the re-estimated state transition probability a+ (i
, j) (or a+ (j, i))
and g+ (i, j; t)={u(t;k)・b+ (
i, j; Ot )}
・a+ (i, j) g+ (j,
i;t)={u(t;k)・b+ (j,i;Ot)
}
・a+ (j, i) The result is a memo under the control of PE.
stored in the file. [0054] Baumwelch-Riesty as described above
for three types of re-estimation calculations of Mation Formulas.
By performing parallel calculation processing to
The distribution of calculation results obtained for each PE of the processor is as follows.
become that way. [0055] In the memory managed by PEi (1≦i≦N)
Stored calculation results: (a) Parallel processing with transfer data between PEs as α(i, t)
In the case of logic, π+ (i); c+ (i, i; t), c+ (mod (i+1 | N) for 1≦t≦T
,i;t),...,c+ (N,i;t),c+ (1
,i;t),c+ (2,i;t),...,c+ (m
od(i-1|N), i;t); g+ (i, i;t), g+ (mod(i+1|N) for 1≦t≦T
,i;t),...,g+ (N,i;t),g+ (1
,i;t),g+ (2,i;t),...,g+ (m
od(i-1|N),i;t); (b) Parallel processing with transfer data between PEs as β(i,t)
In the case of the mathematical method, π+ (i); c+ (i, i; t), c+ (i, mod (i+1 |
N), i;t),...,c+ (i,N,;t),c+
(i, 1; t), c+ (i, 2; t),..., c+
(i, mod (i-1 | N); t); for 1≦t≦T
g+ (i, i; t), g+ (i, mod(i+1 |
N);t),...,g+ (i,N;t),g+ (i
,1;t),g+ (i,2;t),...,g+ (i
, mod (i-1 | N); t); In addition, the way of arranging the calculation results of (a) and (b) above is as follows:
The order in which the results are obtained is the same. (C) Data required for learning processing
Contents of the relocation process The forward-backward process explained so far
Rosedur and Baumwelch Restimation
Ring array processor configuration for formulas
The parallel processing method used and the resulting
The data required for the learning process is calculated from the distribution of the processing results obtained.
The contents of the relocation process will be explained. Forward-back using each other's processing results
Coward Procedur and Baumwelch Riesty
Executing the process of formulas repeatedly
Specifically, the learning process includes the following processes.
It won't happen. . First, the initial state probability π(i) and the state transition probability
rate a(i,j), symbol output probability b(i,j;k)
Set the initial value appropriately and perform the forward-backward program.
Probability parameter α (i, t),
Calculate β(i,t). And the above three types of probabilities
The initial value of and the forward-backward procedure
Two types of probability parameters α(i, t) and β
Baum-Welch Restimesi using (i, t)
From the formula, the initial state probability π(i), the state
Transition probability a(i,j), symbol output probability b(i,j;
k) and re-estimate the results as π+ (i)
, a+ (i, j), b+ (i, j; k). Re
If the estimation result is different from the initial value, replace the initial value with the re-estimated value.
Then again, forward-backward procedure
and Baum-Welch Restimation Form
process the curve. Baum-Well
Required by Chi Restimation Formulas
The re-estimated value is forward-backward proceded.
until they match the values of the various probabilities used in the calculation of the
Execute. [0058] From the contents of the above-mentioned iterative processing,
The forward-backward process is
Data π(i), a(i, j) used in the
, b(i, j; k) is Baum-Welch-Ries
π+ obtained by processing the timing formulas
(i), a+ (i, j), b+ (i, j; k)
Baum Welch Restimation Four
The data used in Muras is forward-backward
Data π(i), a(
i, j), b(i, j; k) and forward-backward
Data α(i, t) obtained by de procedure processing
, β(i,t). Therefore, the purpose of data relocation processing is to
P after parallel processing of word-backward procedure
The distribution of data held in the memory of E is Baum-Well
Parallel processing of Chi Restimation Formulas
Baum-Welch-Riesti on the initial data distribution for
PE notes after parallel processing of Mation Formulas
The distribution of data held in
Initial data distribution of parallel processing for de procedure
It is to make it suitable. From the explanation of each parallel processing method above, each
PEi (1≦i≦N) required in parallel processing of
The data distribution of the managed memory is organized as shown below.
. [C-1] Forward-backward
Data distribution for Procedur forward pass algorithm
Data used: α (i, 0) = π (i), CF (i
,t)={c(i,i;t),c(mod(i+1|N
), i; t), ..., c (mod (i-1 | N), i
;t)} (1≦t≦T) Obtained data: {α(i, 1), ..., α(i, t
), ..., α (i, T)} [C-2] Forward-backward procedure
Data distribution for backward pass algorithm
ta: β (i, T), CB (i, t) = {c (i, i;
t), c(i, mod (i+1 | N); t), ...,
c (i, mod (i-1 | N); t)} (1≦t≦T) Obtained data: {β (i, T-1), ..., β (i
,t),...,β(i,0)} [C-3] Baum-Welch Restimation
Data distribution for re-estimation calculation of formulas (1)
Used when the circular transfer data between PEs is α(i, t)
Data: {α(i, 0), α(i, 1), ..., α(
i, t), ..., α (i, T)} {β (i, 0), β
(i, 1), ..., β (i, t), ..., β (i,
T)}, CF (i, t)={c(i, i; t), c(mod(
i+1|N), i;t), c(mod(i-1|N),
i;t)}(1≦t≦T) CF (i,t)={g(i,i;t),g(mod(
i+1|N), i;t), g(mod(i-1|N),
i;t)}(1≦t≦T) Obtained data: π+ (i), CF + (i,
t) = {c+ (i, i; t), c+ (mod(
i+1 | N), i; t), c+ (mod(i-1 | N
), i; t)} (1≦t≦T) CF + (i, t)={g+ (i, i; t), g+
(mod(i+1|N),i;t),g+(mod
(i-1 | N), i; t)} (1≦t≦T) (2) When the circular transfer data between PEs is β(i, t)
Data used: {α(i, 0), α(i, 1),...
・, α(i, t), ..., α(i, T)} {β(i,
0), β (i, 1), ..., β (i, t), ...,
β (i, T)}, CB (i, t) = {c (i, i; t), c (i, mo
d(i+1|N);t),...,c(i, mod(i
−1 | N); t)} (1≦t≦T) CB (i, t)={g(i, i; t), g(i, mo
d(i+1|N);t),...,g(i, mod(i
−1 | N); t)} (1≦t≦T) Obtained data: π+ (i), CB + (i,
t) = {c+ (i, i; t), c+ (mod(
i+1 | N);t),...,c+ (i, mod(i
−1 | N); t)} (1≦t≦T) CB + (i, t) = {g+ (i, i; t), g+
(i, mod (i+1 | N); t), ..., g+
(i, mod (i-1 | N); t)} (1≦t≦T) on
From the data distribution results for each parallel processing method listed below, the following
I understand that. (1) Forward-Backward Pro
Sedua's forward pass algorithm, backward pass algorithm
obtained by parallel processing of each algorithm.
The synthetic data distribution is the Baum-Welch-Rie
Used in parallel processing of stimulation formulas.
suitable for data distribution. (2) Forward-Backward Pro
Parallel processing for Sedure's forward pass algorithm
The data string CF (i, t) used in
Baum-Welch when the ring transfer data is α(i, t)
Parallel processing for Restimation Formulas
This is the same as the data string CF (i, t) used in the process. (3) Forward-Backward Pro
Parallel processing for Sedua's backward pass algorithm
The data string CB (i, t) used in
Baum-Welch when the ring transfer data is β(i,t)
・Parallel to Restimation Formulas
Same as data string CB (i, t) used in processing
. (4) Transfer data between PEs to α(i,t)
Baum-Welch Restimation
Data strings obtained from parallel processing on formulas
CF + (i,t) is forward-backward
Parallel processing of the procedure forward pass algorithm
It is equivalent to the data string CF (i, t) used in
, used for parallel processing of backward pass algorithms
For the data string CB (i, t), these data
The (x, y)-index of the data composing the column is transposed
There is a relationship between (5) Transfer data between PEs to β(i,t)
Baum-Welch Restimation when
・Data obtained from parallel processing of formulas
Column CB + (i, t) is forward-backward
Parallel Processing of Procedure Backward Pass Algorithm
It is similar to the data string CB (i, t) used in
, the device used for parallel processing of the forward pass algorithm.
For the data string CF (i, t), these data strings are
The (x, y)-index of the constituent data is related to transposition.
It's in charge. From the above, Baum-Welch Restimme
The results obtained by parallel processing of application formulas are
The data string of the processing result is the data transferred between PEs by α(i, t) or
is forward-back regardless of which β(i, t) is selected.
The forward pass algorithm of the forward procedure or
is either one of the backward pass algorithm processing
is only equivalent to the data column of , and performs the processing on the other
In order to do this, the data of the data string obtained in each PE
need to be relocated. The contents are in (4) and (5)
As shown, the (x, y)-index of the constituent data
A data string CF (i, t) (with
or CF + (i, t)) (1≦i≦N) and data C
B (i, t) (or CB + (i, t)) (1
≦i≦N). FIG. 8 shows the redistribution of data necessary for learning processing.
Indicates the contents of the processing. The figure shows the content of this mutual conversion.
ing. The figure shows the data stored in the memory of all PEs.
data sequence CF (i, t) (or CF + (i, t)
), CB (i, t) (or CB + (i, t)
) enumerate the (x, y)-indices of the data that make up
It is shown in this format. An index regarding time t of data
The box t is the same for all data, so it is omitted.
Ru. If the data distribution P is the data string CB (i, t) (or
is composed of data that constitutes CB + (i, t)).
and the data distribution Q is the data sequence CF (i, t)
(or CF + (i, t))
It is composed of Also, for a certain time t,
The data in these data strings are consecutive addresses in each memory.
(In the example of Figure 8, address h to address (h+N-1)
range). [0069] Below, the data relocation processing method is shown in Figure 8.
explain about. The data distributions P and Q shown in the same figure are
Each matrix is considered to be one matrix, and the same elements of each matrix P and Q are preserved.
PE numbers held and the order of data held in each PE.
address and to each PE of this same element.
This section explains the relationship with the address that can be accessed. In the elements of matrix P, PEi (1≦i≦
The relationship between the elements held in N) and their addresses is
address
Element h of matrix P
　　　　　　　　　　　　　　　　　　　　　　　　　
(i,i) h+1
(
i, mod (i+1 | N)) h+2
　　　　　　　　　　　　　　　　　　　　　　　　　
(i, mod (i+2 | N)) ・
　　　　　　　　　　　　　　　　　　　　　　　　　
・・
　　　　　　　　　　　　　　　　　　　　　　　　　
・h+N
-1
(i, mod (i-1|N)). Let the index of the element of the above matrix P be (i
, j), address as adr(PEi;P)
Sadr(PEi;P) is the x-index of the element, y
−Think about expressing using indexes. The y-index of the element held in PEi is
Until it becomes equal to N, increase by 1 starting from i,
After that, increase by 1 from 1 until it equals i-1
do. Therefore, if i≦j (≦N), this y-index
is equal to any one of the box columns. y - add with index i
Since the response is h, the element with y-index j
The address can be expressed as j−i+h. On the other hand, i≦j<i
, this y-index j is from 1 to i-1
Equal to any of the y-index columns incremented by one
. The address of the element whose y-index value is N is (N-
i+h), so the element whose y-index value is N
Since the address of is (N-i+h), the y-index
The address of the element whose value is 1 is (N-i+h+1).
be. Therefore, y-index j in the range 1≦j<i
The address of the element with is given by (N-i+h+j)
Ru. From the above, the address of element (i, j) of matrix P is
The response is j−i+h
for i≦j≦N ad
r(PEi;P)
　　　　　　　　　　　　　　　　　　　　　　　　　
...(8)
N-i+h+j for
It can be expressed as 1≦j<i. Next, for the elements of the matrix Q, the above matrix P
PE number and its address where the same element as the element is held
By finding the same elements of matrices P and Q,
The relationship between PE numbers and addresses will be clarified. In the elements of matrix Q, PEi (1≦i≦
The relationship between the elements held in N) and their addresses is
address
Element h of matrix P
　　　　　　　　　　　　　　　　　　　　　　　　　
(i,i) h+1
(mod(i+1|N
), i) h+2
(mod(i+2｜N)
,i)・
　　　　　　　　　　　　　　　　　　　　　　　　　
・・
　　　　　　　　　　　　　　　　　　　　　　　　　
・h+N-1
(mod(i-1|N
), i). [0075] From the above relationship, the PE number is
Since it is equal to the y-index of the element with
, the element with y-index j of matrix P is PEj
will be retained. Relationship between the above addresses and elements
If we replace i with j, we can save this PEj.
The x-index of the element held starts from j and its
Increment by 1 until the value equals N. after that
increases by one from 1 to j-1. j≦i(≦N)
, the x-index i of the element is 1 from j to N
equal to any of the x-index columns that increases by
Therefore, the address of the element with x-index i is
The space is given by i−j+h. On the other hand, if 1≦i<j
, this index i is incremented by 1 from 1 to j-1.
equal to any of the x-index columns that are added. Therefore 1
Add element with x-index i in the range ≦i<i
The response is represented by N-j+h+i. Therefore, the address of element (i, j) of matrix Q is
suadr(PEj;Q) is i-j+h
for j≦i≦N ad
r(PEi;P)
　　　　　　　　　　　　　　　　　　　　　　　　　
...(8)
N-j+h+i for
It can be expressed as 1≦i<j. [0077] When the above results are summarized, equation (8) and
(9) From formula, for the same elements (i, j) of matrices P and Q,
and the PE number that holds this element and the address in that PE.
The relationship between dresses can be organized as shown in Table 1 below.
Ru. [Table 1] Next, using the results of Table 1, matrix P→matrix
Q, or clarify how to rearrange the elements of matrix Q → matrix P.
I'll do it. As can be seen from Table 1, the element (i, j) is
The PE number and address maintained are
Index, y-Divided by index size relationship
I have to think about it. Index of element of matrix P
The relationship between the size of the address and the address will be explained. Figure 9
is the element index difference, the address before relocation, and the relocation destination
It is a graph showing the relationship with the distance between addresses PE. Same figure
is the x, y-index difference (j-i) of the element and before rearrangement
Address K, relocation destination address K', distance L between PEs
It shows a relationship. The horizontal axis is the difference in index of elements (j
-i), and the vertical axis is the address of the element. 81 is again
The pre-location address K, 82 is the relocation destination address K, 83 is
The distance between PEs is L. Size of element index i, j
The domain of (j-i) determined from the relationship is {-(N-1
), -1}, {0}, {1, (N-1)}
I get kicked. However, (j-i) is an integer within the above range
. Conditions for relocation processing of elements existing in these three types of domains
The matters are shown below. (1) When i=j; The PE number where the element (i, j) of matrix P is held is i, then
The address of is h. On the other hand, element (i, j) of matrix Q
The PE number maintained is i (from j=i), and its
The address is h. That is, the address h of P and Q in the matrix
Since the elements of are held in the PE with the same number, the relocation process
There is no need for (2) When i<j; Element (i, j) of matrix P is the address of PEi (j−i
+h). On the other hand, the elements (i, j) of matrix Q are
It is held at the address of PEj {N-(J-1)+h}
. Therefore, matrix P→matrix Q for the elements of this domain
In the relocation process, the address (j-i+h) of PEi is
Relocate element to PEj address {N-(j-i)+h}
Must. (3) When i>j; Element (i, j) of matrix P is the address of PEi {N-(i
−j)+h}. On the other hand, the elements (i,
j) is held at address (i−j+h) of PEj. Therefore, the matrix P→matrix Q is rewritten for the elements of this domain.
In the placement process, the address of PEi {N-(i-j)+h
} is relocated to the address (i-k+h) of PEj.
There must be. Data relocation based on these relocation conditions
Processing involves converting the PE number and the address where each element is held.
conversion of the source is required. Therefore, the ring array process
In the sensor configuration, (element is retrieved from the address before relocation)
) → (Transfer this element to the relocation destination PE) → (Transfer
Steps to store the sent element at the relocation destination address)
You must perform data relocation processing. this
Transfer of elements in a procedure to the relocation destination PE is performed in each relocation condition.
Since each case is different, the details are described below. here,
If i=j, from the above results, relocation processing is not necessary.
Therefore, we consider only the case where i≠j. Also, the relocation process
The number of PEs passed through during data transfer between PEs
Define the distance L. (1) When i<j; The data transfer direction on the ring array processor configuration is P.
Since the E number is large → small, the data transfer from PEi → PEj is
The transmission is via the route PEi → PE1 → PEN → PEj.
must be carried out. Therefore, the distance L between PEs is P
Ei → PE1 is (i-1), PE1 → PEN is 1,
Since PEN→PEj is (N-j), L=(i-1
)+1+(N-j)=N-(j-i). (2) When i>j; data transfer is performed by PE
Since it can be executed with i → PEj, the distance L between PEs is L=
It is i-j. [0086] Summarizing the above results, the distance L between PEs, the redistribution
Relocation address K' and element index difference (ji-i)
The following relationship can be seen from the graph of FIG. (1) Element group {(i, j) | j−i=k
} and the address of the element group {(i, j) | j−i=k−N}
are the same, and their value is k+h (1≦k≦N-1). Element held at this address (k+h)
The relationship between the x, y-index of the group will be explained. figure
10 is the x-index of the element group held at the address
The relationship between and y-index is shown. element group {(i, j)
|j-i=k} is on the straight line 90 whose j-intercept is k in the same figure.
Corresponding to the grid points, the element group {(i, j) | j−i=k−N
} corresponds to a lattice point on the straight line 91 whose i-intercept is N-k
. The number of grid points on each straight line is the domain 1 of i, j
Since ≦i, j≦N, the former is (N-k) and the latter is k.
be. Therefore, elements existing at the same address (k+h)
The number of objects is N. This is for all cases of k
holds, and the elements for each k are mutually exclusive
. That is, the element group {(i, j) | j−i=k,
There are N elements of j-i=k-N}, and each of these is one
It is held in N PEs and its address is (k+h).
be. (2) Having the same address value (k+h)
The distance L between PEs of the element group is N-k. In other words, the same ad
The number of transfers between the PEs of the elements of the response is equal. Therefore, each P
Elements of E with the same address are in a ring array processor configuration.
Parallel data transfer is possible. (3) Having the same address value (k+h)
The relocation destination address K' of an element group is the distance between PEs of that element group.
Equal to the value of distance L plus h, that is, (N-k+h)
. Therefore, it is not subject to the parallel data transfer shown in (2).
element group, add h to the value of the number of data transfers between PEs.
It may be placed at the address (N-k+h). From (1) to (2) above, the data
The data relocation process is performed using the following parallel data transfer process.
can. For the element group of address (k+h), (N-k) times
Perform inter-PE transfer and place at address (N-k+h)
. Also, for the element group at address (N-k+h), k
Transfer between PEs twice and place at address (k+h)
. In other words, while performing parallel data transfer between PEs,
Elements of address (k+h) and essentials of address (N-k+h)
This is a process of exchanging with a prime group. Subject to relocation processing
PE is determined by the address of the element group (k+h) and the value of the number of PEs N.
The number of inter-transfers and the relocation destination address are determined, so each
PEs have exactly the same number of transfers between PEs and relocation destination addresses.
This can be achieved by controlling the The explanation so far is based on the rearrangement of matrix P→matrix Q.
I have described the process as an example, but the relocation process of matrix Q → matrix P
The same is true for the case. [0094] Summarizing the explanation so far and rearranging the element group
The parallel processing method is shown below. (D) Parallel processing method for relocation processing In a ring array processor configuration consisting of N PEs
(However, the data transfer direction is for all PEi (1≦i
≦N), PEi→PE(mod(i-1|N)). here
, mod(m|N), if m is an integer multiple of N, then N, m are
Represents the remainder when m is divided by N unless it is an integer multiple of N
) Step 1; Step 2~ for r=1, 2, ..., [N/2]
Execute step 7. (r is the number of repetitions) where [x
] represents the largest integer not exceeding x. Step 2; In all PEi (1≦i≦N),
Extract the elements of dress (r+h). Step 3; At all PEi (1≦i≦N),
At the same time, the extracted element is sent to the next PE.
Receive the retrieved element from the PE. Days like this
The data transmission is repeated (N-r) times. Step 4; When N is an even number and r=[N/2], all
For all PEi (1≦i≦N),
Store the element at address (N-r+h) and end the process.
Ru. Step 5; In all PEi (1≦i≦N),
Extract the elements of the address (N-r+h) and transfer them.
The received element is stored at address (N-r+h). Step 6; In all PEi (1≦i≦N),
The element extracted from the dress (N-r+h) is transferred to the next PE
At the same time, the element extracted from the previous PE is sent to
Receive. Such data transfer is repeated r times. Step 7; For all PEi (1≦i≦N),
Store the sent element at address (r+h). In this parallel processing method, repeating step 1
Since the number of processing times is defined as [N/2], if N is even
Special processing in step 4 is provided for the case of numbers.
. This is for the following reason. If N is an odd number, [N/2] is (N-1)
/2, so the arithmetic of the element retrieved in step 2 is
The dress is {h+1, h+2,..., h+(N-1)/
2}. Also, elements of these addresses are stored
The addresses are {h+(N-1), h+(N-2),...
..., h+(N+1)/2}. On the other hand, step
The address of the element retrieved in step 5 is {h+(N-1),
h+(N-2),...,h+(N+1)/2}
Yes, the stored address is {h+1, h+2,...
, h+(N-1)/2}. Therefore, step 1~
The processing in step 7 can be executed without duplicating each other.
. [0097] If N is an even number, [N/2] is equal to N/2.
Therefore, the address of the element retrieved in step 2 is
{h+1, h+2, . . . , h+N/2}. Also
, the address where the elements of these addresses are stored is {h
+(N-1), h+(N-2),..., h+N/
2}. Assuming there is no step 4, the step
The address of the element retrieved in step 5 is {h+(N-
1), h+(N-2),..., h+N/2}.
The stored address is {h+1, h+2,...,
h+N/2}. Therefore, the element at address N/2
Data transfer processing associated with relocation processing is executed redundantly.
. Therefore, if step 4 is provided, step
By processing Steps 2 to 4, the requirements for address N/2 are
The element relocation is completed and the relocation for the element at this address is
Eliminate redundant processing and reduce unnecessary processing steps
can do. [D-1] Recurrence when N is an even number
Parallel processing method for placement processing Next, the case where N is an even number will be explained. N=6 (even number)
For the example of relocation processing in the case of , the data before relocation processing is
Distribution, parallel processing process of data relocation, and after relocation processing
Describe data distribution. Figure 11 shows the state before relocation processing.
The data distribution of each PE is shown. Moreover, FIG. 12 shows one of the aspects of the present invention.
The relocation processing process of the embodiment is shown. Figure 13 is an example of the present invention.
Each PE data distribution after the relocation process of the example is shown. Figure 11
In the example, the x-index of the element assigned to the PE
is equal to the PE number. Parallel processing of the relocation process mentioned earlier
According to the method, the parallel processing process shown in Fig. 12 is transformed into Fig. 13.
I will explain it together. [0099] In step 1 in Fig. 12, first, add
The element of response (h+1) is extracted from all PEs. take
The number of inter-PE transfers for all issued elements is N-1=
Since 6-1=5, steps 2 to 6 are as follows.
This shows the process of circularly transferring elements between PEs.
. [0100] Then, step 6 is for these elements.
The transfer to the relocation destination PE is completed. [0101] In step 7, these transferred elements
is stored in a format that replaces the element at the address to which it is to be relocated.
It will be done. In other words, take the element at the relocation destination address (h+5).
and store the element transferred to this address (h+5).
pay. [0102] In step 8, extract the data in the format to be exchanged.
The element of address (h+5) is transferred the number of times N-5=6.
-5=1 is transferred between PEs and transferred to the relocation destination PE
complete. [0103] In step 9, these elements are relocated to the destination address.
It is stored in address (h+1). At this time, the address (
h+1) has already been rearranged, so it cannot be transferred.
Store the received element as is at this address (h+1)
do. [0104] As described above, steps 1 to 9
address () via circular data transfer between PEs.
The exchange of the element at address (h+1) and the element at address (h+5) is
Complete. In step 9, further processing of the next exchange is performed.
The element at the target address (h+2) is extracted. These elements are then added to steps 10 to 13.
As shown, N-2 = 6-2 = 4 times of circular data between PEs.
After data transfer, the address is allocated to the relocation destination PE.
(h+4). Also, at this time, the address (
The element stored in h+4) is retrieved. Like this
Step 14 shows the child. [0105] In steps 15 and 16, this application
Elements of dress (h+4) are N-4=6-4=2 PEs
After inter-circular transfer, it is allocated to the relocation destination PE, and that PE
is stored at address (h+2). [0106] For the element at address (h+3),
Since the destination address is (h+3), this address
The elements extracted from the base are N-3=6-3=3 times PE
After circular data transfer, the data is placed in the relocation destination PE and the
Stored at the same address (h+3) as the output address
. These processing steps are shown in steps 18 to 18.
This is step 21, and all relocation processing is performed in this step 21.
The process is completed. The data distribution in FIG. 13 is rearranged according to FIG.
This is after the rearrangement process obtained through the processing process. child
In the distribution of
The relationship is transposed. Relocation process executed successfully
It shows that it was done. [D-2] Recurrence when N is an odd number
Parallel processing method for placement processing Next, the case where N is an odd number will be explained. N=5 (odd number)
For the example of relocation processing in the case of , the data before relocation processing is
Distribution, parallel processing process of data relocation, and after relocation processing
Describe data distribution. Figure 14 shows the state before relocation processing.
The data distribution of each PE is shown. In addition, FIG. 15 shows
3 shows a relocation process according to an embodiment of the present invention. Figure 16 shows one example of the present invention.
5 shows each PE data distribution after relocation processing in the example. N is
Relocation processing can be performed in exactly the same process as in the case of even numbers.
Wear. Steps 1 to 6 in FIG. 15 are N-1=5-
1 = After 4 times of circular data transfer between PEs, the address (h
+1) is relocated to address (h+4). [0109] Also, in steps 6 to 8, N-1
=5-4=After one PE-to-PE circular data transfer, the address is
The element at address (h+4) is relocated to address (h+1).
Ru. As a result, the element of address (h+1) and the address
The rearrangement process regarding the element (h+4) is completed. From then on
, cyclic data transfer between PEs is also performed after step 8.
Elements of address (h+2) and address (h+3) via
, and in step 15, all
The relocation process for the element is completed. From FIG. 14 and FIG. 16, the data after the relocation process is
Comparing the data distributions, we can see that for the elements of each address of PE,
The indexes are in a transposed relationship with each other, so relocation
You can see that the process has finished. (E) Processing time [E
-1] Total processing time Next, the relocation process was executed using the above parallel processing method.
Estimate the processing time for the case. First, the data transfer requires
Estimate processing time from the total number of data transfers between PEs
Ru. All elements with the same address in each PE are
The number of data transfers is the same, and the ring array processor
Parallel transfer is possible between PEs in this configuration. [0112] In the above parallel processing method, even or odd numbers of N
Since the problems for each case have been addressed, the overall
The number of transfers is the number of transfers of each address element of one PE.
It is found by taking the sum. [0113] Data transfer time between PEs at address (h+r)
Since the number is (N-r), the total number of transfers is Ttr.
Then, [Equation 9] is obtained. [0115] Therefore, the data transfer time per time is S
(tr), the total processing required for inter-PE data transfer is
The time S(tr-all) is calculated using equation (10),
S(tr-all) = (1/2)・N・(N-1)・
S(tr) ... is represented by (11). [E-2] Element retrieval and restorage processing time
Retrieving and re-storing the elements stored in each PE
Estimate the processing time required. extract one element
The time required to store S(R), and the time required to store S
(W). In the above parallel processing method, each PE has the same
The elements at one address are taken out simultaneously and the ring array
After data is transferred between PEs in parallel on the processor configuration,
At the same time, it is stored at the relocation destination address. In other words, the key
The number of raw retrievals or restorations is limited to one PE.
All you have to do is think about the number of times you can move, and that number will be subject to relocation.
Since it is equal to the number of addresses, (N-1) times (address
0 elements do not need relocation processing, so relocation processing is
The number of required elements is (N-1). Therefore, each P
The total processing time S(R
/W) is S(R/W) = (N-1)・{S(R) +S(
W) } ...(1
2). From equations (11) and (12), the relocation process
The total processing time S(arng) required is: S(arng) = S(tr-all) + S(
R/W) = (1/2)・N
・(N-1)・S(tr) +(N-1)
・{S(R) +S(W)}
=(N-1)・{(1/2)・
N・S(tr)+S(R)+S(W)}
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
...(13). The second and third terms of { } in equation (13) are
Assume that it can be ignored compared to the first term (the first term is proportional to N,
the second and third are constant regardless of N), and the total processing time S (a
rng) is S(arng) ≒ (1/2)
・N・(N-1)・S(tr)...
It can be approximated as (14). That is, the total processing time S(ar
ng) is dominated by the processing time required for data transfer between PEs.
Ru. Therefore, regardless of whether N (the number of PEs) is even or odd,
The placement process can be executed in a processing time of O(N2/2). (F) Control method for relocation processing using parallel processing method
When performing relocation processing using the parallel processing method described above,
The control method will be explained. In the above parallel processing method
Therefore, all elements whose data is transferred between PEs in parallel have the same address.
Since it is an element of the response and the number of transfers is the same,
, Address settings when extracting an element from a certain address
, count the number of transfers, and set the address when restoring.
It is exactly the same for all PEs. Therefore, this relocation process
is achieved by performing exactly the same control on each PE.
can. FIG. 17 shows the control of data relocation processing according to the present invention.
The control flowchart is shown below. This flowchart is similar to the above
This corresponds to a parallel processing method for relocation processing. Main departure
The characteristics of the control method of light relocation processing are mutually interchangeable.
The address value of the element is the number of transfers between PEs.
This is the point used for counting. The explanation will be given below according to Figure 17.
The relationship between the relocation process and the parallel processing process described above should be clarified.
I will explain. Step 110; Element extraction address/
Set initial values to counters CA and CB for restorage addresses
do. As an initial value, counter CA has "h+1",
"h+N-1" is set in the counter CB. These two
The address values set in counters CA and CB of
In parallel processing, pairs that exchange their elements with each other
This is the address of the elephant. There is also a parameter that specifies the number of loops.
Set the meter BR1 to [N/2]+h, and set the even number of N.
N is set as the parameter BR2 for determining an odd number. [0122] Step 111; All PEs
Extract the element of the address indicated by counter CA
At the same time, the contents of counter CB indicating the restorage address are transferred.
Load the counter CT to count the number of transmissions. Step 112; In all PEs,
At the same time as sending the retrieved element to the next PE, the previous PE
Receive elements from E. At this time, all PE
and decrements the counter CT. Step 113; The value of counter CT is “h”
”, transfer of elements between PEs and counter CT until
Repeat the decrement (step 112). counter
When the value of CT becomes “h”, the transferred element is relocated.
Since it exists in the previous PE, end the inter-PE data transfer,
The transferred element is added to the address indicated by counter CB.
Store in reply. Step 114; The value of counter CT is “h”
”, the transferred element exists in the relocation destination PE.
, the data transfer between PEs is finished, and the transferred data is
stores the element at the address indicated by counter CB. At this time, the address indicated by counter CB is relocated.
It is not transferred because another element before processing is stored.
Store the element in the address indicated by this counter CB.
Ru. Also, parameter BR2 is an even number and the counter
If the parameter CA is the parameter BR1, the process ends. Step 115; Count the contents of counter CA.
Load to counter CT. And in all PE
is extracted from the address indicated by counter CB.
At the same time as sending an element to the next PE,
Receive elements of . Step 116; In all PEs,
Decrement counter CT. [0128] Step 117; Count such processing.
Repeat until the value of taCT becomes "h". Step 118; The value of counter CT is “h”
”, the transferred element is displayed in counter CA.
Store it at the specified address. of the address indicated by counter CA
Since the element has already been repositioned, the transferred elements are
The element is stored as is at the address indicated by the counter CA. At this time, counter CA is not equal to parameter BR1.
Then it will end. [0130] Step 119;
The rearrangement process of dress elements is in the form of exchanging elements with each other.
When finished, counter CA increments, counter
CB is decremented. And the same process as above
Therefore, the address indicated by counter CA and counter CB
Relocate elements of the system via inter-PE data transfer. [0131] As mentioned above, the elements of two addresses are mutually
When the value of counter CA is “
Repeat until it matches [N/2]+h” (Step 1
14, step 118). In this control flowchart
The value of counter CA matched this “[N/2]+h”
The determination of whether or not (steps 114 and 118)
Value of counter CA and contents of parameter BR1 “[N/2]
This is done by detecting a match with “+h”.In addition, the parameter BR1
The contents of are set at the start of the relocation process (step 11).
0). If N is an even number, the address ([N/2]+h
) is a form of exchanging elements at two addresses.
Since this is not the formula, relocation is performed by data transfer between PEs.
After being forwarded to the destination PE, it is re-stored directly to the same address.
Ru. Whether the total number N of PEs is an even number or an odd number
The determination (step 114) is based on the content of the parameter BR2 “
The contents of this BR2 are also relocated.
Set at the start (step 110). [0132] As explained so far, between PEs
Exchange of elements of two addresses via parallel data transfer
The relocation process is completed through basic regular processing.
Ru. This control method uses counter CA and counter CB as elements.
It only serves to give the fetch/restore address of
, but the elements of the address indicated by counter CB, respectively.
, inter-PE data for the element at the address indicated by counter CA.
It also functions to give the number of data transfers. (G) Relocation processing using parallel processing method
Control hardware that realizes the control method shown in control hardware configuration (F)
This section explains the software configuration. Figure 18 shows the data of the present invention.
A configuration diagram of control hardware for relocation processing is shown. In the same figure
, incrementer CA120, decrementer C
B121, decrementer CT124, register BR
1122 and register BR2123 are shown in FIG.
Counter CA, counter CB, counter CT shown in 6
Corresponding to this, register BR1122 repeats the transfer in Figure 17.
Specifies the control parameter for the loop count, which is the count to repeat.
register BR2123 corresponds to BR1 of N.
For BR2 that specifies parameters for determining even and odd numbers
respond. Selector 127 is incrementer CA1
20, switch the output of decrementer CB121 to decrement
This is for loading into Climenta CT124.
. [0135] The flag generation circuit 125 generates parallel data between PEs.
The end of data transfer is indicated by the value of decrementer CT124 being “h”.
” is detected and that flag (from now on, this flag will be used as
This is a circuit that generates a flag (referred to as an “h” flag). Also,
This “h”-flag is used as a decrementer to decrementer CT124.
Used as a control signal to switch the data load source. That is,
Every time this “h”-flag turns on, the selector 127
is the data load source incrementer CA120 or
switches the decrementer CB121. [0136] Set N to register BR2123.
. For even and odd numbers of N, use this register BR2123.
The value of LSB (least significant bit) is “0” or “1”
Judgment is made by In other words, “0” is an even number, “1” is an even number, and “1” is an even number.
If it is, it is determined to be an odd number. Also, [N/2] + h and increment
It is controlled by matching the value of Mentor CA120. child
This is an incremental change as shown in the control method above.
Data CA120 is “h+1” to “[N/2]+h”
This incrementer CA
120 value matches the contents of register BR1122
Detecting actually counts [N/2] times.
is equivalent to [0137] Therefore, this control hardware configuration includes:
Value of incrementer CA120 and register BR112
A match detection circuit 126 that detects a match with the contents of 2.
It is set up. The relocation process is completed by this match detection circuit 12.
6 Determine based on the contents of the match flag output from
. Note that the parity of N is related to the termination condition of the relocation process.
Therefore, the LSB of register BR2123 is this match.
It is input to the detection circuit 126. [0139] As a result, incrementer CA120
, decrementer CB121, decrementer CT124
and detecting the specific value of register BR1122
Accordingly, the control method shown in (F) can be realized. Effects of the Invention As described above, the data relocation process of the present invention
According to the method, multiple (N) processing elements are simultaneously
If you want to rearrange elements one by one, you can rearrange them at any time.
The relocation process can be made N times faster compared to . Moreover, the present invention
performs relocation processing in the form of exchanging elements of two addresses.
This method enables regular and efficient processing.
can. [0141] Furthermore, according to the relocation processing control method of the present invention,
If so, simply retrieve the element's address value from the fetch address and
Not only used as a storage address, but also used for data transfer between PEs.
Transfer times when transferring elements to the relocation destination PE by
The structure of double DO loop processing also used for counting numbers.
This control method allows for regular control of relocation processing.
, can be realized efficiently. Furthermore, each PE has exactly the same
control and mutually manage the control status of each PE.
In addition, each PE in the ring array processor configuration
There is no need to create a complicated control configuration that requires individual control.
This makes control easier. [0142] Furthermore, the configuration of the control hardware of the present invention
According to
Find counters and specific values of counters and flag them
This can be realized using two types of detection circuits that generate
The hardware configuration can be simplified and the scale of the hardware can be reduced.
Can be made smaller.

[Brief explanation of the drawing]

【図１】本発明の原理説明図である。FIG. 1 is a diagram explaining the principle of the present invention.

【図２】パターン認識における学習処理を並列処理によ
り実行する場合のリングアレイプロセッサの構成図であ
る。FIG. 2 is a configuration diagram of a ring array processor when learning processing in pattern recognition is executed by parallel processing.

【図３】リングアレイプロセッサ構成を用いた前向きパ
ス・アルゴリズムの並列処理を示す図である。（フォワ
ード−バックワード・プロセデュア）FIG. 3 illustrates parallel processing of a forward pass algorithm using a ring array processor configuration. (Forward-Backward Procedure)

【図４】リングアレイプロセッサ構成を用いた後ろ向き
パス・アルゴリズムの並列処理を示す図である。（フォ
ワード−バックワード・プロセデュア）FIG. 4 illustrates parallel processing of a backward pass algorithm using a ring array processor configuration. (Forward-Backward Procedure)

【図５】リング
アレイプロセッサ構成を用いた初期状態確率の再推定計
算の並列処理を説明するための図である。（バウム−ウ
ェルチ・リエスティメーション・フォーミュラス）FIG. 5 is a diagram for explaining parallel processing of re-estimation calculation of initial state probabilities using a ring array processor configuration. (Baum-Welch Restimation Formulas)

【図６】バウム−ウェルチ・リエスティメーション・フ
ォーミュラスの状態遷移確率の再推定計算をリングアレ
イプロセッサ構成で並列処理する場合のデータフローで
ある。FIG. 6 is a data flow when re-estimating state transition probabilities of the Baum-Welch Reestimation Formulas is processed in parallel using a ring array processor configuration.

【図７】学習処理におけるバウム−ウェルチ・リエステ
ィメーション・フォーミュラスのシンボル出力確率の再
推定計算をリングアレイ構成で並列処理する場合のデー
タフローである。FIG. 7 is a data flow when the re-estimation calculation of the symbol output probability of the Baum-Welch Reestimation Formula in learning processing is processed in parallel in a ring array configuration.

【図８】学習処理に必要となるデータの再配置処理の内
容を示す図である。FIG. 8 is a diagram showing the contents of data rearrangement processing required for learning processing.

【図９】要素のインデックス差と再配置前アドレス、再
配置先アドレス、ＰＥ間距離との関係を示す図である。FIG. 9 is a diagram showing the relationship between an element index difference, a pre-relocation address, a relocation destination address, and a distance between PEs.

【図１０】アドレスに保持される要素群のｘ−インデッ
クスとｙ−インデックスの関係を示す図である。FIG. 10 is a diagram showing the relationship between x-index and y-index of an element group held at an address.

【図１１】再配置処理前の各ＰＥのデータ分布を示す図
である。FIG. 11 is a diagram showing data distribution of each PE before relocation processing.

【図１２】本発明の一実施例の再配置処理過程を示す図
である。FIG. 12 is a diagram showing a relocation process according to an embodiment of the present invention.

【図１３】本発明の一実施例の再配置処理後の各ＰＥの
データ分布を示す図である。FIG. 13 is a diagram showing data distribution of each PE after relocation processing according to an embodiment of the present invention.

【図１４】再配置処理前の各ＰＥのデータ分布を示す図
である。FIG. 14 is a diagram showing data distribution of each PE before relocation processing.

【図１５】本発明の他の実施例の再配置処理過程を示す
図である。FIG. 15 is a diagram showing a relocation process according to another embodiment of the present invention.

【図１６】本発明の他の実施例の再配置処理後の各ＰＥ
のデータ分布を示す図である。FIG. 16: Each PE after relocation processing according to another embodiment of the present invention
It is a figure showing data distribution of.

【図１７】本発明のデータの再配置処理の制御フローチ
ャートである。FIG. 17 is a control flowchart of data relocation processing according to the present invention.

【図１８】本発明のデータの再配置処理の制御ハードウ
ェアの構成図である。FIG. 18 is a configuration diagram of control hardware for data relocation processing according to the present invention.

[Explanation of symbols]

１２０　　インクリメンタＣＡ１２１　　デクリメンタＣＢ１２２　　レジスタＢＲ１１２３　　レジスタＢＲ２１２４　　デクリメンタＣＴ１２５　　フラグ生成回路１２６　　一致検出回路１２７　　セレクタ 120 Incrementer CA 121 Decrementer CB 122 Register BR1 123 Register BR2 124 Decrementer CT 125 Flag generation circuit 126 Coincidence detection circuit 127 Selector

Claims

[Claims]

1. In a multiprocessor system in which a plurality of processing elements are connected in a ring, each of the plurality of processing elements is connected to the processing element via a data transfer path for exchanging data, and the plurality of processing elements has a control means for controlling arithmetic means for performing a desired operation, a data transfer means for transferring data between the processing elements, a storage means for storing addresses and data, and a reading means for reading out the stored data. , numbers are assigned to the plurality of processing elements in the order in which the processing elements are arranged in the multiprocessor, and in all the processing elements, the reading means reads data from the first address of the storage means. The read data is read and transferred between the processing elements a predetermined number of times, and the data at the first address of the storage means transferred to each processing element is exchanged with the data at the second address. The data at the second address of the storage means stored in the storage means and exchanged is simultaneously transferred between the processing elements a predetermined number of times, and the data at the second address of the storage means transferred to each of the processing elements. If the total number of processing elements is an odd number, the procedure for storing the processing element at the first address of the storage means is executed for a value (N/2) obtained by dividing the total number of processing elements (N) by 2, and If the total number of processing elements is an even number, the value (N
/2-1), and if the total number of the processing elements is an even number and the total number of the processing elements divided by 2 (N/2) is equal to the data transfer repeat count (r), all the processing elements are executed. At the same time, the data at the first address of the storage means is retrieved, and the retrieved first address data is simultaneously transferred between the processing elements by a predetermined count, and in each of the processing elements, the data is transferred to each of the processing elements. A data reallocation method in a multiprocessor, characterized in that data at the first address is stored at a second address of the storage means.

2. The number of transfers between the processing elements for the first address (h+r) is expressed as the number of transfers between the processing elements for the first address (h+r).
The value (N-r) is obtained by subtracting h from the address value of the second address (h+N-r) for the data to be exchanged.
) is the number of transfers between the processing elements for the data at the second address (h+N-r) as a value r obtained by subtracting h from the address value at the first address (h+r) for the data to be exchanged with the data at the second address (h+N-r). , r=1,2,...,
Counting the number of transfers in simultaneous transfer between the processing elements of data retrieved from the storage means of the processing element relative to [N/2], and transmitting the retrieved data to the processing element to which it is to be relocated. A method for controlling data reallocation in a multiprocessor, characterized by controlling processing.

3. A first counter that holds the first address, a second counter that holds the second address, and a link between the processing element and the data retrieved from the storage means of the processing element. a third counter that counts the number of transfers in simultaneous transfer; a selector that switches between the output of the first counter and the output of the second counter; the output of the selector is connected to the input of the third counter; , a first flag generating means that detects that the value of the third counter is equal to h and generates a first flag; and an output of the first counter according to the content of the first flag by the selector. and the output of the second counter; a first register that holds a control parameter for the number of repetitions; and a second register that holds a parameter for determining the parity of the number of all the processing elements. a register; a second flag generating means for detecting a match between the contents of the first register and the contents of the first counter and generating a second flag; and a least significant bit of the second register. parity determining means for determining whether the number of all the processing elements is even or odd based on the contents of the processing element; 1. A data reallocation control mechanism in a multiprocessor, comprising: end detection means for detecting the end of data reallocation in a multiprocessor.