JP3691538B2

JP3691538B2 - Vector data addition method and vector data multiplication method

Info

Publication number: JP3691538B2
Application number: JP04676395A
Authority: JP
Inventors: 浩二黒田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1995-03-07
Filing date: 1995-03-07
Publication date: 2005-09-07
Anticipated expiration: 2020-09-07
Also published as: JPH08241302A

Description

【０００１】
【産業上の利用分野】
本発明は、ベクトル加算処理を高速に実行できるようにするベクトルデータ加算方法と、そのベクトルデータ加算方法を使って、ベクトル乗算処理を高速に実行できるようにするベクトルデータ乗算方法とに関する。
【０００２】
ベクトル処理装置では、ベクトル加算処理やベクトル乗算処理を実行する。このようなベクトル演算処理は高速に実行できるようにする必要がある。
【０００３】
【従来の技術】
従来のベクトル処理装置では、ベクトル加算処理を実行するときには、桁上げの発生を考慮して、ベクトルシフト命令を実行しながらベクトル加算命令を実行していくという構成を採っていた。
【０００４】
次に、１６０ビットの被加数と１６０ビットの加数との加算処理を例にとって、この従来技術を詳細に説明する。
加算器が６４ビット同士の加算処理を実行する場合には、従来では、図１３に示すように、６４ビットの３つのレジスタ（ｖｒ00，ｖｒ01，ｖｒ02）からなる被加数用のレジスタと、６４ビットの３つのレジスタ（ｖｒ03，ｖｒ04，ｖｒ05）からなる加数用のレジスタとを用意して、例えば、図１４に示す形式、すなわち、図１５に図式化する形式に従って、その被加数用のレジスタに１６０ビットの被加数を格納するとともに、加数用のレジスタに１６０ビットの加数を格納する。
【０００５】
そして、図１６に示すベクトル命令列を発行することで、１６０ビットの被加数と１６０ビットの加数との加算処理を実行する。ここで、
「ＶＡｖｒ１，ｖｒ２，ｖｒ３」
は、ベクトルレジスタｖｒ１とベクトルレジスタｖｒ２との加算結果をベクトルレジスタｖｒ３に格納しろというベクトル加算命令であり、
「ＶＳＲｖｒ１，ＳＣ，ｖｒ３」
は、ベクトルレジスタｖｒ１のデータをＳＣビット右シフトしてベクトルレジスタｖｒ３に格納しろというベクトルシフト命令であり、
「ＶＳＬｖｒ１，ＳＣ，ｖｒ３」
は、ベクトルレジスタｖｒ１のデータをＳＣビット左シフトしてベクトルレジスタｖｒ３に格納しろというベクトルシフト命令である。
【０００６】
すなわち、図１６に示すベクトル命令列に従い、先ず最初に、(1) のベクトル加算命令ＶＡに従って、ベクトルレジスタｖｒ02の被加数部分と、ベクトルレジスタｖｒ05の加数部分とを加算してベクトルレジスタｖｒ10に格納する。このとき、桁上げが発生する可能性があるので、続いて、(2) のベクトルシフト命令ＶＳＲに従って、ベクトルレジスタ10の格納データを６０ビット右シフトすることでその桁上げ値（キャリーアウトデータ）を取り出して、それをベクトルレジスタ15に格納する。
【０００７】
続いて、(3) のベクトル加算命令ＶＡに従って、ベクトルレジスタｖｒ01の被加数部分と、ベクトルレジスタｖｒ04の加数部分とを加算してベクトルレジスタｖｒ20に格納する。
【０００８】
続いて、(4) のベクトル加算命令ＶＡに従って、下位部分の加算処理により発生したキャリーアウトデータを加算すべく、ベクトルレジスタｖｒ15の格納するキャリーアウトデータと、ベクトルレジスタｖｒ20の格納データとを加算してベクトルレジスタｖｒ20に格納する。このとき、桁上げが発生する可能性があるので、続いて、(5) のベクトルシフト命令ＶＳＲに従って、ベクトルレジスタ20の格納データを６０ビット右シフトすることでそのキャリーアウトデータを取り出して、それをベクトルレジスタｖｒ25に格納する。
【０００９】
続いて、(6) のベクトル加算命令ＶＡに従って、ベクトルレジスタｖｒ00の被加数部分と、ベクトルレジスタｖｒ03の加数部分とを加算してベクトルレジスタｖｒ30に格納する。
【００１０】
続いて、(7) のベクトル加算命令ＶＡに従って、下位部分の加算処理により発生したキャリーアウトデータを加算すべく、ベクトルレジスタｖｒ25の格納するキャリーアウトデータと、ベクトルレジスタｖｒ30の格納データとを加算してベクトルレジスタｖｒ6 に格納する。
【００１１】
続いて、ベクトルレジスタｖｒ10に格納される６０ビットの有効データを取り出すべく、(8) のベクトルシフト命令ＶＳＬに従って、ベクトルレジスタ10の格納データを４ビット左シフトして、それをベクトルレジスタｖｒ10に格納し、(9) のベクトルシフト命令ＶＳＲに従って、そのベクトルレジスタｖｒ10の格納データを４ビット右シフトしてベクトルレジスタｖｒ8 に格納することで、上位４ビットにゼロ値を持つその６０ビットの有効データをベクトルレジスタｖｒ8 に格納する。
【００１２】
続いて、ベクトルレジスタｖｒ20に格納される６０ビットの有効データを取り出すべく、(10)のベクトルシフト命令ＶＳＬに従って、ベクトルレジスタ20の格納データを４ビット左シフトして、それをベクトルレジスタｖｒ20に格納し、(11)のベクトルシフト命令ＶＳＲに従って、そのベクトルレジスタｖｒ20の格納データを４ビット右シフトしてベクトルレジスタｖｒ7 に格納することで、上位４ビットにゼロ値を持つその６０ビットの有効データをベクトルレジスタｖｒ7 に格納する。
【００１３】
このように、従来のベクトル処理装置では、ベクトル加算処理を実行するときには、桁上げの発生を考慮して、ベクトルシフト命令を実行しながらベクトル加算命令を実行していくという構成を採っていたのである。
【００１４】
一方、従来のベクトル処理装置の備える乗算器では、６４ビット×６４ビットのような入力仕様を持つ場合にあっても、ハードウェア量の削減を図るために、６４ビット×１６ビットのような少ないビット数の乗算機能を持つ構成を採っていた。
【００１５】
そして、入力仕様の６４ビット同士の乗算処理を実現するために、乗数を１６ビット単位に４分割し、６４ビット×１６ビット乗算機能を使って、それらの１６ビットの乗数部分と６４ビットの被乗数とを乗算することで部分積を求め、それらの部分積を１６ビットシフトしつつ加算して、その加算結果の示す６４ビット部分を乗算結果として出力するという構成を採っていた。
【００１６】
例えば、６４ビット同士の乗算処理の上位６４ビットが必要となる場合には、命令指示に従って、図１７に示すように、４分割した乗数を下位側から順番に選択して部分積を求め、それらの部分積を１６ビット左シフトしつつ加算していくことで乗算処理の上位６４ビットを得て出力していた。また、６４ビット同士の乗算処理の下位６４ビットが必要となる場合には、命令指示に従って、４分割した乗数を上位側から順番に選択して部分積を求め、それらの部分積を１６ビット右シフトしつつ加算していくことで乗算処理の下位６４ビットを得て出力していた。
【００１７】
【発明が解決しようとする課題】
しかしながら、従来技術のように、ベクトル加算処理を実行するときに、桁上げの発生を考慮して、ベクトルシフト命令を実行しながらベクトル加算命令を実行していくという構成を採っていると、命令数が多くなることから高速にベクトル加算処理を実行できないという問題点があった。
【００１８】
また、従来技術の乗算器では、内部で実行する乗算回数が多くなるとともに、内部に、部分積をシフトし加算する機能（ループ構成を使用している）を持たなくてはならないという問題点があった。
【００１９】
本発明はかかる事情に鑑みてなされたものであって、ベクトル加算処理を高速に実行できるようにするベクトルデータ加算方法の提供と、そのベクトルデータ加算方法を使って、ベクトル乗算処理を高速に実行できるようにするベクトルデータ乗算方法の提供とを目的とする。
【００２０】
【課題を解決するための手段】
図１に本発明の原理構成を図示する。
図中、１は本発明を実行するベクトル処理装置であって、ＣＰＵ１０と、ベクトル命令制御機構１１と、ベクトルレジスタ１２と、マスクレジスタ１３と、加算器１４と、乗算器１５とを備える。
【００２１】
このＣＰＵ１０は、ベクトル命令を発行する。ベクトル命令制御機構１１は、ベクトル命令の実行を制御する。ベクトルレジスタ１２は、ベクトルデータを格納する。マスクレジスタ１３は、ベクトル処理で使用するマスクデータを格納する。加算器１４は、ベクトル加算命令を実行する。乗算器１５は、ベクトル乗算命令を実行する。
【００２２】
本発明を実現するために、加算器１４は、ベクトルオペランドの加数と被加数の他に、マスクレジスタ１３に格納されるキャリーアウトデータを入力するとともに、算出結果のキャリーアウトデータをマスクレジスタ１３へ出力する構成を採る。このとき、マスクオペランドの増加により命令で指定するレジスタ数が増える場合には、命令で指定するレジスタ数を抑えるべく、入力用のマスクレジスタと出力用のマスクレジスタとして同一のものを使用する構成を採ることが好ましい。
【００２３】
一方、本発明を実現するために、乗算器１５は、入力される２つのｍビットデータの乗算値となる２ｍビットデータを算出する機能を有するとともに、命令に応答して、乗算値の上位ｍビットデータか下位ｍビットデータのいずれか一方を選択して出力するセレクタを持つ構成を採る。
【００２６】
【作用】
本発明で用いる加算器１４は、ベクトルオペランドの被加数と加数の他に、マスクレジスタ１３に書き込まれるキャリーアウトデータを入力として加算処理を実行して、その加算結果により生ずるキャリーアウトデータをマスクレジスタ１３に書き込んでいく。
【００２７】
このように構成される加算器１４を使い、本発明を実行するベクトル処理装置１では、
（ｉ）ベクトルレジスタｖｒ１に格納されるｍビットデータとベクトルレジスタｖｒ２に格納されるｍビットデータとマスクレジスタｍｒ１に格納されるキャリーアウトデータとの加算結果をベクトルレジスタｖｒ３に格納するとともに、そのとき発生するキャリーアウトデータをマスクレジスタｍｒ２に格納しろというベクトル加算命令を、
「ＶＡＣｖｒ１，ｖｒ２，ｍｒ１，ｖｒ３，ｍｒ２」
と表すならば、このベクトル加算命令の命令列で表される被加数と加数との加算命令を受け取り、
（ ii ）上述の構成を採る加算器１４を使って、この命令列のベクトル加算命令をその命令列の順番に従って実行することで、被加数と加数とを加算する。
このように、本発明によれば、キャリーアウトデータをマスクレジスタ１３に格納していく構成を採ることから、従来技術のように、ベクトルシフト命令を実行しながらベクトル加算命令を実行していくという構成を採る必要がなくなり、少ない命令数でもって高速にベクトル加算命令を実行できるようになる。
【００２８】
また、本発明で用いる乗算器１５は、入力される２つのｍビットデータの乗算値となる２ｍビットデータを算出する機能を有する。すなわち、従来の乗算器では、部分積を求め、それらをシフトしつつ加算することで、入力される２つのｍビットデータの乗算値となるｍビットデータを算出する構成を採っているのに対して、本発明で用いる乗算器１５では、部分積を求めることなく、直接、乗算値となる２ｍビットデータを算出する構成を採っている。例えば、６４ビットのデータと、６４ビットのデータとを乗算して、１２８ビットの乗算結果のデータを算出するのである。
【００２９】
これから、従来の乗算器では、内部で実行する乗算回数が多くなるとともに、内部に、部分積をシフトし加算する機能を持たなくてはならないという問題点があったが、本発明で用いる乗算器１５では、これを解決できることになる。
【００３０】
しかるに、乗算器１５に入力されるデータがｍビット構成であるときには、加算器１４に入力されるデータもｍビット構成を採るので、乗算器１５が２ｍビットのデータを出力したのでは整合性を保てない。これから、本発明で用いる乗算器１５では、命令に応答して、乗算値の上位ｍビットデータか下位ｍビットデータのいずれか一方を選択して出力するセレクタを持つ構成を採ることで、これに対処している。
このように構成される乗算器１５及び加算器１４を使い、本発明を実行するベクトル処理装置１では、
（ｉ）ベクトルレジスタｖｒ１に格納されるｍビットデータとベクトルレジスタｖｒ２に格納されるｍビットデータとの乗算結果の下位ｍビットデータ部分をベクトルレジスタｖｒ３に格納しろというベクトル乗算命令を、
「ＶＭＬｖｒ１，ｖｒ２，ｖｒ３」
と表し、ベクトルレジスタｖｒ１に格納されるｍビットデータとベクトルレジスタｖｒ２に格納されるｍビットデータとの乗算結果の上位ｍビットデータ部分をベクトルレジスタｖｒ３に格納しろというベクトル乗算命令を、
「ＶＭＵｖｒ１，ｖｒ２，ｖｒ３」
と表し、ベクトルレジスタｖｒ１に格納されるｍビットデータとベクトルレジスタｖｒ２に格納されるｍビットデータとマスクレジスタｍｒ１に格納されるキャリーアウトデータとの加算結果をベクトルレジスタｖｒ３に格納するとともに、そのとき発生するキャリーアウトデータをマスクレジスタｍｒ２に格納しろというベクトル加算命令を、
「ＶＡＣｖｒ１，ｖｒ２，ｍｒ１，ｖｒ３，ｍｒ２」
と表すならば、第１番目にＶＭＬ命令、それに続いてＶＭＵ命令、それに続いてＶＭＬ命令、それに続いてＶＡＣ命令、それに続いてＶＭＬ命令、それに続いてＶＡＣ命令、それに続いてＶＭＵ命令、それに続いてＶＭＵ命令、それに続いてＶＡＣ命令、それに続いてＶＭＬ命令、それに続いてＶＡＣ命令、それに続いてＶＭＵ命令、最後にＶＡＣ命令という命令列で表される被乗数と乗数との乗算命令を受け取り、
（ ii ）この命令列の乗算命令をその命令列の順番に従って実行するとともに、その実行にあたって、ＶＭＬ命令及びＶＭＵ命令を実行するときは、上述の構成を採る乗算器１５を使ってその命令を実行し、ＶＡＣ命令を実行するときは、上述の構成を採る加算器１４を使ってその命令を実行する。
このように、本発明によれば、ベクトル乗算命令を実行するにあたって実行が要求されることになるベクトル加算命令の実行にあたって、キャリーアウトデータをマスクレジスタ１３に格納していく構成を採ることから、従来技術のように、ベクトルシフト命令を実行しながらベクトル加算命令を実行していくという構成を採る必要がなくなることで、少ない命令数でもって高速にベクトル加算命令を実行できるようになり、これにより、少ない命令数でもって高速にベクトル乗算命令を実行できるようになる。
【００３１】
【実施例】
以下、実施例に従って本発明を詳細に説明する。
図１で説明したように、本発明で用いる加算器１４は、ベクトルオペランドの被加数と加数の他に、マスクレジスタ１３に書き込まれるキャリーアウトデータを入力として加算処理を実行して、その加算結果により生ずるキャリーアウトデータをマスクレジスタ１３に書き込む構成を採っている。
【００３２】
すなわち、図２に示すように、ベクトルレジスタ１２から読み込む被加数と、ベクトルレジスタ１２から読み込む加数と、マスクレジスタ１３から読み込むキャリーアウトデータとを入力して加算処理を実行して、その加算値をベクトルレジスタ１２に書き込むとともに、その加算処理により生じたキャリーアウトデータをマスクレジスタ１３に書き込む構成を採るのである。
【００３３】
マスクレジスタ１３から読み込むキャリーアウトデータは、１ビットのデータであることから、この加算処理は簡単なハードウェア構成により実現できることになる。
【００３４】
一方、図１で説明したように、本発明で用いる乗算器１５は、入力される２つのｍビットデータの乗算値となる２ｍビットデータを算出する機能を有するとともに、命令に応答して、乗算値の上位ｍビットデータか下位ｍビットデータのいずれか一方を選択して出力するセレクタを持つ構成を採る。
【００３５】
すなわち、図３に示すように、ベクトルレジスタ１２から読み込むｍビットの被乗数と、ベクトルレジスタ１２から読み込むｍビットの乗数とを入力として乗算処理を実行して、その乗算値の２ｍビットのデータをラッチし、命令に応答して、その乗算値の上位ｍビットデータか下位ｍビットデータのいずれか一方を選択して出力するセレクタを持つ構成を採るのである。
【００３６】
次に、上述の構成を採る加算器１４を用いて実行される本発明によるベクトル加算処理について、１６０ビットの被加数と１６０ビットの加数との加算処理を例にして説明する。
加算器１４が６４ビット同士の加算処理を実行する場合には、図１３で示したように、６４ビットの３つのレジスタ（ｖｒ00，ｖｒ01，ｖｒ02）からなる被加数用のレジスタと、６４ビットの３つのレジスタ（ｖｒ03，ｖｒ04，ｖｒ05）からなる加数用のレジスタとを用意して、図４に示す形式、すなわち、図５に図式化する形式に従って、その被加数用のレジスタに１６０ビットの被加数を格納するとともに、加数用のレジスタに１６０ビットの加数を格納する。
【００３７】
そして、図６に示すベクトル命令列を発行することで、１６０ビットの被加数と１６０ビットの加数との加算処理を実行する。ここで、
「ＶＡＣｖｒ１，ｖｒ２，ｍｒ１，ｖｒ３，ｍｒ２」
は、ベクトルレジスタｖｒ１とベクトルレジスタｖｒ２とマスクレジスタｍｒ１との加算結果をベクトルレジスタｖｒ３に格納するとともに、そのとき発生するキャリーアウトデータをマスクレジスタｍｒ２に格納しろというベクトル加算命令である。
【００３８】
すなわち、図６に示すベクトル命令列に従い、先ず最初に、(1) のベクトル加算命令ＶＡＣに従って、ベクトルレジスタｖｒ02の被加数部分と、ベクトルレジスタｖｒ05の加数部分と、初期値としてゼロ値を格納するマスクレジスタｍｒ00の格納データとを加算してベクトルレジスタｖｒ08に格納するとともに、このとき発生する桁上げ値のキャリーアウトデータをマスクレジスタｍｒ01に格納する。
【００３９】
続いて、(2) のベクトル加算命令ＶＡＣに従って、ベクトルレジスタｖｒ01の被加数部分と、ベクトルレジスタｖｒ04の加数部分と、マスクレジスタｍｒ01に格納されるキャリーアウトデータとを加算してベクトルレジスタｖｒ07に格納するとともに、このとき発生する桁上げ値のキャリーアウトデータをマスクレジスタｍｒ02に格納する。
【００４０】
最後に、(3) のベクトル加算命令ＶＡＣに従って、ベクトルレジスタｖｒ00の被加数部分と、ベクトルレジスタｖｒ03の加数部分と、マスクレジスタｍｒ02に格納されるキャリーアウトデータとを加算してベクトルレジスタｖｒ06に格納するとともに、このとき発生する桁上げ値のキャリーアウトデータをマスクレジスタｍｒ00に格納する。
【００４１】
このように、上述の構成を採る加算器１４を用いて実行される本発明によるベクトル加算処理では、図７に示すように、マスクレジスタｍｒ00,01,02を使いつつ、３個のベクトル加算命令を発行することで、１６０ビットの被加数と１６０ビットの加数との加算値を算出できることになる。これに対して、従来技術に従っていると、図１３で説明したように、１１個のベクトル加算命令／ベクトルシフト命令を発行しなければならない。
【００４２】
次に、上述の構成を採る乗算器１５を用いて実行される本発明によるベクトル乗算処理について説明する。
図８に示すように、４倍精度データでは、１１２ビットの仮数を持っている。これから、４倍精度の乗算処理では、乗算結果の仮数を求めるために、図９に示すオペランドの乗算処理を実行する必要がある。
【００４３】
これから、上述の構成を採る乗算器１５を用いて実行される本発明によるベクトル乗算処理について、１２８ビットの被乗数と１２８ビットの乗数との乗算処理を例にして説明する。
【００４４】
乗算器１５が６４ビット同士の乗算処理を実行する場合には、図１０に示すように、１２８ビットの被乗数用のレジスタ（上位６４ビット部分を“０１”、下位６４ビット部分を“０２”で表してある）と、１２８ビットの乗数用のレジスタ（上位６４ビット部分を“０３”、下位６４ビット部分を“０４”で表してある）とを用意して、その被乗数レジスタに１２８ビットの被乗数（上位６４ビット部分をＡ１、下位６４ビット部分をＡ２で表してある）を格納するとともに、その乗数レジスタに１２８ビットの乗数（上位６４ビット部分をＢ１、下位６４ビット部分をＢ２で表してある）を格納する。
【００４５】
そして、図１１に示すベクトル命令列を発行することで、図１２に図式化する乗算過程に従いつつ、１２８ビットの被乗数と１２８ビットの乗数との乗算処理を実行する。ここで、
「ＶＭＬｖｒ１，ｖｒ２，ｖｒ３」
は、ベクトルレジスタｖｒ１とベクトルレジスタｖｒ２との乗算結果の下位６４ビットをベクトルレジスタｖｒ３に格納しろというベクトル乗算命令であり、
「ＶＭＵｖｒ１，ｖｒ２，ｖｒ３」
は、ベクトルレジスタｖｒ１とベクトルレジスタｖｒ２との乗算結果の上位６４ビットをベクトルレジスタｖｒ３に格納しろというベクトル乗算命令であり、
「ＶＡＣｖｒ１，ｖｒ２，ｍｒ１，ｖｒ３，ｍｒ２」
は、ベクトルレジスタｖｒ１とベクトルレジスタｖｒ２とマスクレジスタｍｒ１との加算結果をベクトルレジスタｖｒ３に格納するとともに、そのとき発生するキャリーアウトデータをマスクレジスタｍｒ２に格納しろというベクトル加算命令である。
【００４６】
すなわち、図１１に示すベクトル命令列に従い、先ず最初に、(1) のベクトル乗算命令ＶＭＬに従って、ベクトルレジスタｖｒ02の被乗数部分Ａ２と、ベクトルレジスタｖｒ04の乗数部分Ｂ２とを乗算して、乗算器１５のセレクタを制御することで出力されるその乗算結果の下位６４ビットのＡ２Ｂ２Ｌをベクトルレジスタｖｒ23に格納する。
【００４７】
続いて、(2) のベクトル乗算命令ＶＭＵに従って、ベクトルレジスタｖｒ02の被乗数部分Ａ２と、ベクトルレジスタｖｒ04の乗数部分Ｂ２とを乗算して、乗算器１５のセレクタを制御することで出力されるその乗算結果の上位６４ビットのＡ２Ｂ２Ｕをベクトルレジスタｖｒ05に格納する。
【００４８】
続いて、(3) のベクトル乗算命令ＶＭＬに従って、ベクトルレジスタｖｒ01の被乗数部分Ａ１と、ベクトルレジスタｖｒ04の乗数部分Ｂ２とを乗算して、乗算器１５のセレクタを制御することで出力されるその乗算結果の下位６４ビットのＡ１Ｂ２Ｌをベクトルレジスタｖｒ06に格納する。
【００４９】
続いて、(4) のベクトル加算命令ＶＡＣに従って、ベクトルレジスタｖｒ05に格納されるＡ２Ｂ２Ｕと、ベクトルレジスタｖｒ06に格納されるＡ１Ｂ２Ｌと、初期値としてゼロ値を格納するマスクレジスタｍｒ00の格納データとを加算してベクトルレジスタｖｒ07に格納するとともに、このとき発生する桁上げ値のキャリーアウトデータをマスクレジスタｍｒ01に格納する。
【００５０】
続いて、(5) のベクトル乗算命令ＶＭＬに従って、ベクトルレジスタｖｒ02の被乗数部分Ａ２と、ベクトルレジスタｖｒ03の乗数部分Ｂ１とを乗算して、乗算器１５のセレクタを制御することで出力されるその乗算結果の下位６４ビットのＡ２Ｂ１Ｌをベクトルレジスタｖｒ08に格納する。
【００５１】
続いて、(6) のベクトル加算命令ＶＡＣに従って、ベクトルレジスタｖｒ07の格納データと、ベクトルレジスタｖｒ08に格納されるＡ２Ｂ１Ｌと、初期値としてゼロ値を格納するマスクレジスタｍｒ00の格納データとを加算してベクトルレジスタｖｒ22に格納するとともに、このとき発生する桁上げ値のキャリーアウトデータをマスクレジスタｍｒ02に格納する。
【００５２】
続いて、(7) のベクトル乗算命令ＶＭＵに従って、ベクトルレジスタｖｒ01の被乗数部分Ａ１と、ベクトルレジスタｖｒ04の乗数部分Ｂ２とを乗算して、乗算器１５のセレクタを制御することで出力されるその乗算結果の上位６４ビットのＡ１Ｂ２Ｕをベクトルレジスタｖｒ10に格納する。
【００５３】
続いて、(8) のベクトル乗算命令ＶＭＵに従って、ベクトルレジスタｖｒ02の被乗数部分Ａ２と、ベクトルレジスタｖｒ03の乗数部分Ｂ１とを乗算して、乗算器１５のセレクタを制御することで出力されるその乗算結果の上位６４ビットのＡ２Ｂ１Ｕをベクトルレジスタｖｒ11に格納する。
【００５４】
続いて、(9) のベクトル加算命令ＶＡＣに従って、ベクトルレジスタｖｒ10に格納されるＡ１Ｂ２Ｕと、ベクトルレジスタｖｒ11に格納されるＡ２Ｂ１Ｕと、マスクレジスタｍｒ01に格納されるキャリーアウトデータとを加算してベクトルレジスタｖｒ12に格納するとともに、このとき発生する桁上げ値のキャリーアウトデータをマスクレジスタｍｒ00に格納する。
【００５５】
続いて、(10)のベクトル乗算命令ＶＭＬに従って、ベクトルレジスタｖｒ01の被乗数部分Ａ１と、ベクトルレジスタｖｒ03の乗数部分Ｂ１とを乗算して、乗算器１５のセレクタを制御することで出力されるその乗算結果の下位６４ビットのＡ１Ｂ１Ｌをベクトルレジスタｖｒ13に格納する。
【００５６】
続いて、(11)のベクトル加算命令ＶＡＣに従って、ベクトルレジスタｖｒ12の格納データと、ベクトルレジスタｖｒ13に格納されるＡ１Ｂ１Ｌと、マスクレジスタｍｒ02に格納されるキャリーアウトデータとを加算してベクトルレジスタｖｒ21に格納するとともに、このとき発生する桁上げ値のキャリーアウトデータをマスクレジスタｍｒ03に格納する。
【００５７】
続いて、(12)のベクトル乗算命令ＶＭＵに従って、ベクトルレジスタｖｒ01の被乗数部分Ａ１と、ベクトルレジスタｖｒ03の乗数部分Ｂ１とを乗算して、乗算器１５のセレクタを制御することで出力されるその乗算結果の上位６４ビットのＡ１Ｂ１Ｕをベクトルレジスタｖｒ15に格納する。
【００５８】
最後に、(13)のベクトル加算命令ＶＡＣに従って、初期値としてゼロ値を格納するマスクレジスタｖｒ00の格納データと、ベクトルレジスタｖｒ15に格納されるＡ１Ｂ１Ｕと、マスクレジスタｍｒ03に格納されるキャリーアウトデータとを加算してベクトルレジスタｖｒ20に格納するとともに、このとき発生する桁上げ値のキャリーアウトデータをマスクレジスタｍｒ00に格納する。
【００５９】
このように、上述の構成を採る乗算器１５を用いて実行される本発明によるベクトル乗算処理では、６４ビット同士の乗算処理により求まる１２８ビットの乗算結果の上位６４ビットか下位６４ビットのいずれかを取り出しながら、上述の構成を採る加算器１４を用いて実行される本発明によるベクトル加算処理を用いつつ、１２８ビットの被乗数と１２８ビットの乗数との乗算値を算出していくのである。
【００６０】
なお、この構成にあって、マスクレジスタｍｒ00やベクトルレジスタｖｒ00には、ゼロ値を格納しておく必要はなく、そのようなレジスタ番号が指定されるときには、ゼロ値の入力指定があったと見なしていく構成を採ってもよい。また、ｍｒ00へ書き込むキャリーアウトデータは、実際には後で使用するものではない。これから、そのようなレジスタ番号が指定されるときには、実際の書込処理を行わないことで、元のデータを壊さないようにする構成を採ってもよい。また、ベクトル加算命令ＶＡＣでは、５個のレジスタを指定しなければならないが、入力と出力とでマスクレジスタを共通にすれば、４個のレジスタの指定で済むことになる。
【００６１】
【発明の効果】
以上説明したように、本発明のベクトルデータ加算方法によれば、キャリーアウトデータをマスクレジスタに格納していく構成を採ることから、従来技術のように、ベクトルシフト命令を実行しながらベクトル加算命令を実行していくという構成を採る必要がなくなり、少ない命令数でもって高速にベクトル加算命令を実行できるようになる。
【００６２】
また、本発明のベクトルデータ乗算方法によれば、ベクトル乗算命令を実行するにあたって実行が要求されることになるベクトル加算命令の実行にあたって、キャリーアウトデータをマスクレジスタに格納していく構成を採ることから、従来技術のように、ベクトルシフト命令を実行しながらベクトル加算命令を実行していくという構成を採る必要がなくなることで、少ない命令数でもって高速にベクトル加算命令を実行できるようになり、これにより、少ない命令数でもって高速にベクトル乗算命令を実行できるようになる。
【図面の簡単な説明】
【図１】本発明を実行するベクトル処理装置の装置構成図である。
【図２】本発明で用いる加算器の説明図である。
【図３】本発明で用いる乗算器の説明図である。
【図４】被加数及び加数の格納処理の説明図である。
【図５】被加数及び加数の格納処理の説明図である。
【図６】本発明で発行するベクトル加算命令の説明図である。
【図７】本発明の加算処理の説明図である。
【図８】４倍精度データのデータフォーマットの説明図である。
【図９】４倍精度乗算処理のオペランドの説明図である。
【図１０】被乗数及び乗数の格納処理の説明図である。
【図１１】本発明で発行するベクトル乗算命令の説明図である。
【図１２】本発明の乗算処理の説明図である。
【図１３】従来技術の説明図である。
【図１４】従来技術の説明図である。
【図１５】従来技術の説明図である。
【図１６】従来技術の説明図である。
【図１７】従来技術の説明図である。
【符号の説明】
１ベクトル処理装置
１０ＣＰＵ
１１ベクトル命令制御機構
１２ベクトルレジスタ
１３マスクレジスタ
１４加算器
１５乗算器[0001]
[Industrial application fields]
  The present invention is a vector that enables vector addition processing to be executed at high speed.Data vector addition method, and vector data multiplication method that enables vector multiplication processing to be executed at high speed using the vector data addition method,About.
[0002]
The vector processing device executes vector addition processing and vector multiplication processing. Such vector operation processing needs to be executed at high speed.
[0003]
[Prior art]
In the conventional vector processing apparatus, when performing the vector addition process, the vector addition instruction is executed while executing the vector shift instruction in consideration of the occurrence of carry.
[0004]
Next, this prior art will be described in detail by taking an addition process of a 160-bit addend and a 160-bit addend as an example.
When the adder performs 64-bit addition processing, conventionally, as shown in FIG. 13, a 64-bit addend register composed of three 64-bit registers (vr00, vr01, vr02); A register for addends consisting of three registers (vr03, vr04, vr05) of bits is prepared, for example, for the addend in accordance with the format shown in FIG. 14, that is, the format illustrated in FIG. A 160-bit addend is stored in the register, and a 160-bit addend is stored in the addend register.
[0005]
Then, by issuing the vector instruction sequence shown in FIG. 16, the addition process of the 160-bit addend and the 160-bit addend is executed. here,
"VA vr1, vr2, vr3"
Is a vector addition instruction to store the addition result of the vector register vr1 and the vector register vr2 in the vector register vr3,
"VSR vr1, SC, vr3"
Is a vector shift instruction to shift the data in the vector register vr1 to the right by SC bits and store it in the vector register vr3.
"VSL vr1, SC, vr3"
Is a vector shift instruction to shift the data of the vector register vr1 to the left by SC bits and store it in the vector register vr3.
[0006]
That is, according to the vector instruction sequence shown in FIG. 16, first, according to the vector addition instruction VA of (1), the addend part of the vector register vr02 and the addend part of the vector register vr05 are added to obtain the vector register vr10. To store. At this time, there is a possibility that a carry may occur. Then, according to the vector shift instruction VSR of (2), the carry data (carry-out data) is shifted right by 60 bits to the data stored in the vector register 10. Is stored in the vector register 15.
[0007]
Subsequently, the addend part of the vector register vr01 and the addend part of the vector register vr04 are added according to the vector addition instruction VA of (3) and stored in the vector register vr20.
[0008]
Subsequently, according to the vector addition instruction VA of (4), the carry-out data stored in the vector register vr15 and the storage data in the vector register vr20 are added in order to add the carry-out data generated by the addition processing of the lower part. Is stored in the vector register vr20. At this time, there is a possibility that a carry may occur. Subsequently, according to the vector shift instruction VSR of (5), the stored data in the vector register 20 is shifted to the right by 60 bits to extract the carry-out data. Is stored in the vector register vr25.
[0009]
Subsequently, in accordance with the vector addition instruction VA of (6), the addend part of the vector register vr00 and the addend part of the vector register vr03 are added and stored in the vector register vr30.
[0010]
Subsequently, according to the vector addition instruction VA of (7), the carry-out data stored in the vector register vr25 and the storage data in the vector register vr30 are added in order to add the carry-out data generated by the addition processing of the lower part. Is stored in the vector register vr6.
[0011]
Subsequently, in order to extract the 60-bit valid data stored in the vector register vr10, the stored data in the vector register 10 is shifted left by 4 bits in accordance with the vector shift instruction VSL in (8) and stored in the vector register vr10. Then, according to the vector shift instruction VSR of (9), the stored data in the vector register vr10 is shifted 4 bits to the right and stored in the vector register vr8, so that the 60 bits of valid data having a zero value in the upper 4 bits are obtained. Store in the vector register vr8.
[0012]
Subsequently, in order to extract the 60-bit valid data stored in the vector register vr20, the stored data in the vector register 20 is shifted left by 4 bits in accordance with the vector shift instruction VSL in (10) and stored in the vector register vr20. Then, according to the vector shift instruction VSR of (11), the stored data of the vector register vr20 is right-shifted by 4 bits and stored in the vector register vr7, so that the 60-bit valid data having a zero value in the upper 4 bits is obtained. Store in the vector register vr7.
[0013]
As described above, in the conventional vector processing apparatus, when the vector addition process is executed, the vector addition instruction is executed while executing the vector shift instruction in consideration of the occurrence of carry. is there.
[0014]
On the other hand, in the multiplier provided in the conventional vector processing apparatus, even if it has an input specification of 64 bits × 64 bits, in order to reduce the amount of hardware, the number is as small as 64 bits × 16 bits. It had a configuration with a bit number multiplication function.
[0015]
Then, in order to realize the 64-bit multiplication process of the input specification, the multiplier is divided into four 16-bit units, and the 16-bit multiplier part and 64-bit multiplicand using the 64-bit × 16-bit multiplication function. The partial products are obtained by multiplying and added while shifting the partial products by 16 bits, and the 64-bit portion indicated by the addition result is output as the multiplication result.
[0016]
For example, when the upper 64 bits of a 64-bit multiplication process are required, as shown in FIG. 17, in accordance with the instruction instruction, multipliers divided into four are selected in order from the lower side to obtain partial products. Are added while shifting the left 16 bits to the left, and the higher 64 bits of the multiplication process are obtained and output. Also, when the lower 64 bits of the 64-bit multiplication process are required, the partial product is obtained by selecting the multipliers divided into four in order from the upper side in accordance with the instruction, and those partial products are converted to the right 16 bits. By adding while shifting, the lower 64 bits of the multiplication process were obtained and output.
[0017]
[Problems to be solved by the invention]
However, when the vector addition processing is executed while executing the vector shift instruction in consideration of the occurrence of carry when the vector addition processing is executed as in the conventional technique, the instruction There is a problem that the vector addition processing cannot be executed at high speed because the number increases.
[0018]
In addition, the multiplier of the prior art has a problem that the number of multiplications to be executed internally increases, and a function for shifting and adding partial products (using a loop configuration) must be provided inside. there were.
[0019]
  The present invention has been made in view of such circumstances, and is a vector that enables vector addition processing to be executed at high speed.Provides a vector data addition method and a vector data multiplication method that enables vector multiplication processing to be executed at high speed using the vector data addition method.For the purpose of providing.
[0020]
[Means for Solving the Problems]
  FIG. 1 illustrates the principle configuration of the present invention.
  In the figure, 1 is the present invention.RunThe vector processing apparatus includes a CPU 10, a vector instruction control mechanism 11, a vector register 12, a mask register 13, an adder 14, and a multiplier 15.
[0021]
The CPU 10 issues a vector instruction. The vector instruction control mechanism 11 controls execution of vector instructions. The vector register 12 stores vector data. The mask register 13 stores mask data used in vector processing. The adder 14 executes a vector addition instruction. The multiplier 15 executes a vector multiplication instruction.
[0022]
In order to implement the present invention, the adder 14 inputs the carry-out data stored in the mask register 13 in addition to the addend and algend of the vector operand, and the carry-out data of the calculation result is input to the mask register. 13 is used. At this time, if the number of registers specified by the instruction increases due to an increase in the mask operand, the same mask register for input and output is used to suppress the number of registers specified by the instruction. It is preferable to take.
[0023]
On the other hand, in order to implement the present invention, the multiplier 15 has a function of calculating 2m-bit data that is a multiplication value of two input m-bit data, and in response to an instruction, the higher m of the multiplication value. A configuration having a selector that selects and outputs either bit data or lower m-bit data is adopted.
[0026]
[Action]
  Main departureUsed in the lightThe arithmetic unit 14 performs addition processing with the carry-out data written in the mask register 13 as an input in addition to the addend and addend of the vector operand, and carries the carry-out data resulting from the addition result to the mask register 13. Write.
[0027]
  In the vector processing apparatus 1 that executes the present invention using the adder 14 configured as described above,
  (I) The addition result of the m-bit data stored in the vector register vr1, the m-bit data stored in the vector register vr2, and the carry-out data stored in the mask register mr1 is stored in the vector register vr3. A vector addition instruction to store the generated carry-out data in the mask register mr2,
        "VAC vr1, vr2, mr1, vr3, mr2"
If it represents, the addition instruction of the addend and the addend represented by the instruction sequence of this vector addition instruction is received,
  ( ii The adder 14 having the above-described configuration is used to execute the vector addition instruction of the instruction sequence according to the order of the instruction sequence, thereby adding the addend and the addend.
  in this way,According to the present invention,Since the carry-out data is stored in the mask register 13, it is not necessary to adopt a configuration in which the vector addition instruction is executed while executing the vector shift instruction as in the prior art, and the number of instructions is small. Thus, the vector addition instruction can be executed at high speed.
[0028]
  In addition, this departureThe power used in the lightThe calculator 15 has a function of calculating 2m-bit data that is a product of two input m-bit data. That is, the conventional multiplier employs a configuration in which m-bit data that is a multiplication value of two input m-bit data is calculated by obtaining partial products and adding them while shifting. The main departureThe power used in the lightThe calculator 15 employs a configuration for directly calculating 2m-bit data as a multiplication value without obtaining a partial product. For example, 64-bit data is multiplied by 64-bit data to calculate 128-bit multiplication result data.
[0029]
  As a result, the conventional multiplier has a problem that the number of multiplications to be executed internally is increased and a function for shifting and adding partial products must be provided internally.The power used in the lightThe calculator 15 can solve this problem.
[0030]
  However, when the data input to the multiplier 15 has an m-bit configuration, the data input to the adder 14 also has an m-bit configuration. Therefore, if the multiplier 15 outputs 2 m-bit data, consistency is ensured. I can't keep it. From now onThe power used in the lightThe calculator 15 copes with this by adopting a configuration having a selector that selects and outputs either the upper m-bit data or the lower m-bit data of the multiplication value in response to the instruction.
  In the vector processing apparatus 1 that executes the present invention using the multiplier 15 and the adder 14 configured as described above,
  (I) A vector multiplication instruction to store in the vector register vr3 the lower m-bit data part of the multiplication result of the m-bit data stored in the vector register vr1 and the m-bit data stored in the vector register vr2.
        “VML vr1, vr2, vr3”
A vector multiplication instruction for storing the upper m-bit data portion of the multiplication result of the m-bit data stored in the vector register vr1 and the m-bit data stored in the vector register vr2 in the vector register vr3,
        “VMU vr1, vr2, vr3”
The addition result of the m-bit data stored in the vector register vr1, the m-bit data stored in the vector register vr2, and the carry-out data stored in the mask register mr1 is stored in the vector register vr3. A vector addition instruction to store the generated carry-out data in the mask register mr2,
        "VAC vr1, vr2, mr1, vr3, mr2"
The first VML instruction, followed by the VMU instruction, followed by the VML instruction, followed by the VAC instruction, followed by the VML instruction, followed by the VAC instruction, followed by the VMU instruction, followed by A VMU instruction, followed by a VAC instruction, followed by a VML instruction, followed by a VAC instruction, followed by a VMU instruction, and finally a multiplicand-multiplier instruction represented by an instruction sequence of VAC instruction,
  ( ii ) The multiplication instruction of this instruction sequence is executed according to the order of the instruction sequence, and when executing the VML instruction and the VMU instruction, the instruction is executed using the multiplier 15 having the above-described configuration, When executing the VAC instruction, the adder 14 having the above-described configuration is used to execute the instruction.
  As described above, according to the present invention, the carry-out data is stored in the mask register 13 when executing the vector addition instruction that is required to be executed when executing the vector multiplication instruction. By eliminating the need to adopt a configuration in which a vector addition instruction is executed while executing a vector shift instruction as in the prior art, a vector addition instruction can be executed at a high speed with a small number of instructions. Thus, vector multiplication instructions can be executed at high speed with a small number of instructions.
[0031]
【Example】
  Hereinafter, the present invention will be described in detail according to examples.
  As explained in FIG.Used in the lightThe arithmetic unit 14 performs addition processing with the carry-out data written in the mask register 13 as an input in addition to the addend and addend of the vector operand, and carries the carry-out data resulting from the addition result to the mask register 13. The writing structure is adopted.
[0032]
That is, as shown in FIG. 2, an addend read from the vector register 12, an addend read from the vector register 12, and carry-out data read from the mask register 13 are input to perform addition processing, and the addition is performed. The configuration is such that a value is written to the vector register 12 and carry-out data generated by the addition process is written to the mask register 13.
[0033]
Since the carry-out data read from the mask register 13 is 1-bit data, this addition process can be realized with a simple hardware configuration.
[0034]
  On the other hand, as described in FIG.The power used in the lightThe calculator 15 has a function of calculating 2m-bit data that is a multiplication value of two input m-bit data, and in response to an instruction, either the upper m-bit data or the lower m-bit data of the multiplication value A configuration having a selector that selects and outputs one is adopted.
[0035]
That is, as shown in FIG. 3, the multiplication process is executed with the m-bit multiplicand read from the vector register 12 and the m-bit multiplier read from the vector register 12 as inputs, and 2 m-bit data of the multiplication value is latched. In response to the command, the selector has a selector that selects and outputs either the upper m-bit data or the lower m-bit data of the multiplication value.
[0036]
  NextIn addition to the above configurationPerformed using calculator 14According to the inventionThe vector addition process will be described using an example of an addition process of a 160-bit addend and a 160-bit addend.
  When the adder 14 performs a 64-bit addition process, FIG.3As shown in FIG. 4, an addend register consisting of three 64-bit registers (vr00, vr01, vr02) and an addend register consisting of three 64-bit registers (vr03, vr04, vr05) 4 is stored in accordance with the format shown in FIG. 4, that is, the format diagrammatically shown in FIG. Stores the addend.
[0037]
Then, by issuing the vector instruction sequence shown in FIG. 6, the addition process of the 160-bit addend and the 160-bit addend is executed. here,
"VAC vr1, vr2, mr1, vr3, mr2"
Is a vector addition instruction to store the addition result of the vector register vr1, vector register vr2, and mask register mr1 in the vector register vr3 and store the carry-out data generated at that time in the mask register mr2.
[0038]
That is, according to the vector instruction sequence shown in FIG. 6, first, in accordance with the vector addition instruction VAC of (1), the addend part of the vector register vr02, the addend part of the vector register vr05, and a zero value as an initial value are set. The stored data of the mask register mr00 to be stored is added and stored in the vector register vr08, and the carry-out data of the carry value generated at this time is stored in the mask register mr01.
[0039]
Subsequently, the addend part of the vector register vr01, the addend part of the vector register vr04, and the carry-out data stored in the mask register mr01 are added according to the vector addition instruction VAC of (2) to obtain the vector register vr07. The carry-out data of the carry value generated at this time is stored in the mask register mr02.
[0040]
Finally, according to the vector addition instruction VAC of (3), the addend part of the vector register vr00, the addend part of the vector register vr03, and the carry-out data stored in the mask register mr02 are added to the vector register vr06. The carry-out data of the carry value generated at this time is stored in the mask register mr00.
[0041]
  like thisIn addition to the above configurationUsing calculator 14In the vector addition process according to the present invention,7 calculates the addition value of the 160-bit addend and the 160-bit addend by issuing three vector addition instructions while using the mask registers mr00,01,02 as shown in FIG. It will be possible. On the other hand, according to the prior art, 11 vector addition instructions / vector shift instructions must be issued as described in FIG.
[0042]
  NextIn the above-mentioned configurationPerformed using calculator 15According to the inventionThe vector multiplication process will be described.
  As shown in FIG. 8, the quadruple precision data has a 112-bit mantissa. Thus, in quadruple precision multiplication processing, it is necessary to execute operand multiplication processing shown in FIG. 9 in order to obtain the mantissa of the multiplication result.
[0043]
  ThisAnd the above-mentioned configurationPerformed using calculator 15According to the inventionVector multiplication processing will be described by taking multiplication processing of a 128-bit multiplicand and a 128-bit multiplier as an example.
[0044]
When the multiplier 15 performs a 64-bit multiplication process, as shown in FIG. 10, a 128-bit multiplicand register (the upper 64 bit portion is “01” and the lower 64 bit portion is “02”). And a 128-bit multiplier register (the upper 64 bit portion is represented by “03” and the lower 64 bit portion is represented by “04”), and the multiplicand register is a 128-bit multiplicand. (The upper 64 bit portion is represented by A1 and the lower 64 bit portion is represented by A2), and the multiplier register has a 128 bit multiplier (the upper 64 bit portion is represented by B1 and the lower 64 bit portion is represented by B2. ).
[0045]
Then, by issuing the vector instruction sequence shown in FIG. 11, the multiplication process of the 128-bit multiplicand and the 128-bit multiplier is executed while following the multiplication process schematically shown in FIG. here,
“VML vr1, vr2, vr3”
Is a vector multiplication instruction to store the lower 64 bits of the multiplication result of the vector register vr1 and the vector register vr2 in the vector register vr3.
“VMU vr1, vr2, vr3”
Is a vector multiplication instruction for storing the upper 64 bits of the multiplication result of the vector register vr1 and the vector register vr2 in the vector register vr3.
"VAC vr1, vr2, mr1, vr3, mr2"
Is a vector addition instruction to store the addition result of the vector register vr1, vector register vr2, and mask register mr1 in the vector register vr3 and store the carry-out data generated at that time in the mask register mr2.
[0046]
That is, according to the vector instruction sequence shown in FIG. 11, first, according to the vector multiplication instruction VML of (1), the multiplicand part A2 of the vector register vr02 and the multiplier part B2 of the vector register vr04 are multiplied, and the multiplier 15 A2B2L of the lower 64 bits of the multiplication result output by controlling the selector is stored in the vector register vr23.
[0047]
Subsequently, in accordance with the vector multiplication instruction VMU of (2), the multiplicand part A2 of the vector register vr02 and the multiplier part B2 of the vector register vr04 are multiplied, and the multiplication output by controlling the selector of the multiplier 15 is performed. The upper 64 bits A2B2U of the result are stored in the vector register vr05.
[0048]
Subsequently, in accordance with the vector multiplication instruction VML in (3), the multiplicand part A1 of the vector register vr01 and the multiplier part B2 of the vector register vr04 are multiplied, and the multiplication output by controlling the selector of the multiplier 15 is performed. The lower 64 bits A1B2L of the result are stored in the vector register vr06.
[0049]
Subsequently, in accordance with the vector addition instruction VAC of (4), A2B2U stored in the vector register vr05, A1B2L stored in the vector register vr06, and the stored data of the mask register mr00 storing a zero value as an initial value are added. Then, the carry-out data of the carry value generated at this time is stored in the mask register mr01.
[0050]
Subsequently, in accordance with the vector multiplication instruction VML in (5), the multiplicand part A2 of the vector register vr02 and the multiplier part B1 of the vector register vr03 are multiplied, and the multiplication output by controlling the selector of the multiplier 15 is performed. The lower 64 bits of the result A2B1L are stored in the vector register vr08.
[0051]
Subsequently, according to the vector addition instruction VAC of (6), the data stored in the vector register vr07, A2B1L stored in the vector register vr08, and the data stored in the mask register mr00 that stores a zero value as an initial value are added. The carry-out data of the carry value generated at this time is stored in the mask register mr02.
[0052]
Subsequently, in accordance with the vector multiplication instruction VMU in (7), the multiplicand part A1 of the vector register vr01 and the multiplier part B2 of the vector register vr04 are multiplied, and the multiplication output by controlling the selector of the multiplier 15 is performed. The upper 64 bits A1B2U of the result are stored in the vector register vr10.
[0053]
Subsequently, in accordance with the vector multiplication instruction VMU of (8), the multiplicand part A2 of the vector register vr02 and the multiplier part B1 of the vector register vr03 are multiplied, and the multiplication output by controlling the selector of the multiplier 15 is performed. The upper 64 bits of the result A2B1U are stored in the vector register vr11.
[0054]
Subsequently, according to the vector addition instruction VAC of (9), A1B2U stored in the vector register vr10, A2B1U stored in the vector register vr11, and carry-out data stored in the mask register mr01 are added to the vector register. The carry-out data of the carry value generated at this time is stored in the mask register mr00.
[0055]
Subsequently, in accordance with the vector multiplication instruction VML in (10), the multiplicand part A1 of the vector register vr01 and the multiplier part B1 of the vector register vr03 are multiplied, and the multiplication output by controlling the selector of the multiplier 15 is performed. The lower 64 bits A1B1L of the result are stored in the vector register vr13.
[0056]
Subsequently, in accordance with the vector addition instruction VAC of (11), the data stored in the vector register vr12, A1B1L stored in the vector register vr13, and carry-out data stored in the mask register mr02 are added to the vector register vr21. The carry-out data of the carry value generated at this time is stored in the mask register mr03.
[0057]
Subsequently, in accordance with the vector multiplication instruction VMU of (12), the multiplicand part A1 of the vector register vr01 and the multiplier part B1 of the vector register vr03 are multiplied, and the multiplication output by controlling the selector of the multiplier 15 is performed. The upper 64 bits A1B1U of the result are stored in the vector register vr15.
[0058]
Finally, in accordance with the vector addition instruction VAC of (13), the stored data of the mask register vr00 that stores a zero value as an initial value, the A1B1U stored in the vector register vr15, the carry-out data stored in the mask register mr03, And the carry-out data of the carry value generated at this time is stored in the mask register mr00.
[0059]
  like thisIn the above-mentioned configurationUse calculator 15Executed by the vector multiplication processing according to the present invention.Does not extract either the upper 64 bits or the lower 64 bits of the 128-bit multiplication result obtained by the 64-bit multiplication process.Further, the vector addition processing according to the present invention executed using the adder 14 having the above-described configuration is performed.While being used, the multiplication value of the 128-bit multiplicand and the 128-bit multiplier is calculated.
[0060]
In this configuration, it is not necessary to store a zero value in the mask register mr00 and the vector register vr00. When such a register number is designated, it is assumed that the zero value has been designated. Any configuration may be adopted. Also, the carry-out data written to mr00 is not actually used later. From now on, when such a register number is designated, a configuration may be adopted in which the original data is not destroyed by not performing the actual writing process. In addition, in the vector addition instruction VAC, five registers must be specified. However, if the mask register is shared between the input and the output, it is only necessary to specify four registers.
[0061]
【The invention's effect】
  As explained above,For bright vector data addition methodAccordingly, since the configuration in which carry-out data is stored in the mask register is adopted, it is not necessary to adopt the configuration in which the vector addition instruction is executed while executing the vector shift instruction as in the prior art, and there is little The vector addition instruction can be executed at high speed by the number of instructions.
[0062]
  In addition, the present inventionAccording to this vector data multiplication method, the carry-out data is stored in the mask register when executing the vector addition instruction that is required to be executed when executing the vector multiplication instruction. As described above, it is not necessary to adopt a configuration in which the vector addition instruction is executed while executing the vector shift instruction, so that the vector addition instruction can be executed at a high speed with a small number of instructions. The vector multiplication instruction can be executed at high speed with the number of instructions.
[Brief description of the drawings]
[Figure 1] Main departureThe structure of the vector processing deviceIt is a chart.
[Figure 2] Main departureIn the explanatory diagram of the adder used in the lightis there.
[Figure 3] Main departureIn the explanatory diagram of the multiplier used in Mingis there.
FIG. 4 is an explanatory diagram of processing for storing an addend and an addend.
FIG. 5 is an explanatory diagram of a processing for storing an addend and an addend.
FIG. 6 is an explanatory diagram of a vector addition instruction issued in the present invention.
FIG. 7 is an explanatory diagram of addition processing according to the present invention.
FIG. 8 is an explanatory diagram of a data format of quadruple precision data.
FIG. 9 is an explanatory diagram of operands of quadruple precision multiplication processing;
FIG. 10 is an explanatory diagram of multiplicand and multiplier storage processing;
FIG. 11 is an explanatory diagram of a vector multiply instruction issued in the present invention.
FIG. 12 is an explanatory diagram of multiplication processing according to the present invention.
FIG. 13 is an explanatory diagram of a conventional technique.
FIG. 14 is an explanatory diagram of the prior art.
FIG. 15 is an explanatory diagram of a prior art.
FIG. 16 is an explanatory diagram of a conventional technique.
FIG. 17 is an explanatory diagram of the prior art.
[Explanation of symbols]
1 Vector processing device
10 CPU
11 Vector instruction control mechanism
12 Vector register
13 Mask register
14 Adder
15 multiplier

Claims

Input m-bit data stored in the vector register, m-bit data stored in another vector register, and carry-out data stored in the mask register, and calculate m-bit data as an addition value. The adder and addend are added using an adder that has the function of storing the addition result in the vector register and storing the carry-out data resulting from the addition result in another mask register. A vector data addition method for
(I) The addition result of the m-bit data stored in the vector register vr1, the m-bit data stored in the vector register vr2, and the carry-out data stored in the mask register mr1 is stored in the vector register vr3. A vector addition instruction to store the generated carry-out data in the mask register mr2,
"VAC vr1, vr2, mr1, vr3, mr2"
If it represents, the addition instruction of the addend and the addend represented by the instruction sequence of this vector addition instruction is received,
( ii ) Using the adder to execute the vector addition instruction of the instruction sequence according to the order of the instruction sequence, to add the algend and the addend,
A feature of vector data addition method.

Using the m-bit data stored in the vector register and the m-bit data stored in another vector register as input, 2m-bit data to be a multiplication value is calculated, and the upper m-bit data of the multiplication value A multiplier having a function of selecting any one of the lower m-bit data and storing it in the vector register for the multiplication result, m-bit data stored in the vector register, and m stored in another vector register The bit data and carry-out data stored in the mask register are input, m-bit data as an addition value is calculated and stored in the vector register for the addition result, and the carry-out data generated by the addition result is Multiplication processing of multiplicand and multiplier using adder with function of storing in another mask register A vector data multiplication method for executing,
(I) A vector multiplication instruction to store in the vector register vr3 the lower m-bit data part of the multiplication result of the m-bit data stored in the vector register vr1 and the m-bit data stored in the vector register vr2.
“VML vr1, vr2, vr3”
A vector multiplication instruction for storing the upper m-bit data portion of the multiplication result of the m-bit data stored in the vector register vr1 and the m-bit data stored in the vector register vr2 in the vector register vr3,
“VMU vr1, vr2, vr3”
The addition result of the m-bit data stored in the vector register vr1, the m-bit data stored in the vector register vr2, and the carry-out data stored in the mask register mr1 is stored in the vector register vr3. A vector addition instruction to store the generated carry-out data in the mask register mr2,
"VAC vr1, vr2, mr1, vr3, mr2"
The first VML instruction, followed by the VMU instruction, followed by the VML instruction, followed by the VAC instruction, followed by the VML instruction, followed by the VAC instruction, followed by the VMU instruction, followed by A VMU instruction, followed by a VAC instruction, followed by a VML instruction, followed by a VAC instruction, followed by a VMU instruction, and finally a multiplicand-multiplier instruction represented by an instruction sequence of VAC instruction,
( ii ) The multiplication instruction of the instruction sequence is executed according to the order of the instruction sequence, and when executing the VML instruction and the VMU instruction, the instruction is executed using the multiplier, and the VAC instruction is executed. To execute the instruction using the above adder,
A characteristic vector data multiplication method.