JP2018156119A

JP2018156119A - Simd type parallel arithmetic apparatus, simd type parallel arithmetic semiconductor chip, simd type parallel arithmetic method and apparatus including simd type parallel arithmetic apparatus and semiconductor chip

Info

Publication number: JP2018156119A
Application number: JP2015139577A
Authority: JP
Inventors: 井上　克己; Katsumi Inoue; 克己井上
Original assignee: Individual
Current assignee: Individual
Priority date: 2015-07-13
Filing date: 2015-07-13
Publication date: 2018-10-04
Also published as: WO2017010524A1

Abstract

PROBLEM TO BE SOLVED: To provide a SIMD type parallel arithmetic algorism that maximizes an efficiency of parallel operation of the SIMD type (single instruction/multiple data) and increases integration degree, increases a calculation speed by making it possible to drive a calculation core 100%, significantly reducing power consumption, realizing arbitrary degree of parallelism and computation time, and easily installing into a semiconductor ASIC and FPGA.SOLUTION: An arithmetic apparatus 203 configured by a memory cell group 202 consisting of a number of memory cells 104 and N arithmetic units 109 reads data in which one data is configured by two or more memory cells 104 by accessing N pairs at once by an address line 102 for accessing a memory cell 104 of the memory cell group 202, inputs them into an arithmetic input of N arithmetic units 109 in parallel, and writes N pairs of arithmetic result data of arithmetic outputs of the N arithmetic units into the N memory cells 109 at once.SELECTED DRAWING: Figure 2

Description

本発明はＳＩＭＤ型並列演算装置、ＳＩＭＤ型並列演算半導体チップ、ＳＩＭＤ型並列演算方法、ＳＩＭＤ型並列演算装置や半導体チップを含んだ装置に関する。 The present invention relates to a SIMD type parallel arithmetic device, a SIMD type parallel arithmetic semiconductor chip, a SIMD type parallel arithmetic method, a SIMD type parallel arithmetic device, and a device including a semiconductor chip.

本願発明の目的を明確にするためにＣＰＵやＧＰＵの課題を示す。
図１は、特徴データ照合の例である。
データベースには、特徴１から特徴Ｎまでの８ビット（０〜２５５）のデータが対象Ａから対象Ｚまで登録されており、このデータベースに照合する問い合わせ照合データが与えられ、本図の下部には互いの特徴データ同士の差を求め、その差の合計（差和演算）を求め、その最も小さいものが最も類似した対象であるとして、対象Ｃが類似照合結果として判定される様子が示されている。 In order to clarify the object of the present invention, problems of CPU and GPU will be shown.
FIG. 1 is an example of feature data collation.
In the database, 8-bit (0 to 255) data from feature 1 to feature N is registered from subject A to subject Z, and inquiry matching data to be collated with this database is given. The difference between the characteristic data of each other is obtained, the sum of the differences (difference sum operation) is obtained, and it is shown that the object C is determined as the similarity matching result, assuming that the smallest one is the most similar object. Yes.

照合対象が国際空港の入出国のテロリストや犯罪者顔の照合であれば、対象Ａから対象Ｚの顏（人）の数は１００万（１Ｍ）人にものぼり顏の特徴の種類を１０００（１Ｋ）とした場合、１Ｋ＊１Ｍ＝１Ｇ回の差和演算を繰り返す必要がある。
１つのＣＰＵで１回当たりの差和演算１１９を１０ｎ秒とした場合、１０秒もの時間が掛りとてもリアルタイムで利用することは出来ない。 If the target of verification is to match the terrorists and criminal faces entering and leaving the international airport, the number of moths (persons) from subject A to subject Z is 1,000,000 (1M). 1K), it is necessary to repeat the difference sum calculation 1K * 1M = 1G times.
If the difference calculation 119 per time is set to 10 nsec with one CPU, it takes 10 seconds and cannot be used in real time.

また照合対象が手書き文字の照合であれば対象Ａから対象Ｚの文字の数が日本語の場合３０００（３Ｋ）文字でその特徴を２５６種類とした場合、２５６＊３Ｋ＝７６８Ｋ回の差和演算１１９を繰り返す必要がある。 If the collation target is collation of handwritten characters, if the number of characters from the object A to the object Z is Japanese, 3000 (3K) characters and 256 types of features are used, and 256 * 3K = 768K difference / sum operations 119 needs to be repeated.

先ほど同様１回当たりの差和演算を１０ｎ秒とした場合、７．６８ｍ秒の時間が必要になり一秒間では約１３０文字しか読み取れない、原稿用紙１枚分の文字を読み取るのに３秒近くの時間が必要になる。 If the difference calculation per time is set to 10 ns as before, 7.68 msec is required, and only about 130 characters can be read in one second. Time is required.

以上はこの発明の意図や目的を説明するのに都合のよい類似度を評価するための差和演算１１９の例で説明を行ったが積和演算１２０やその他の行列演算（ベクトル演算）も同様であり、その応用は指紋、静脈などの生体認証や印鑑の照合など枚挙に暇がない。
また膨大なデータを扱う気象や流体の分子の動きなどのシミュレーションにも行列演算は欠かすことが出来ない。
行列データの演算のような大量のデータの繰り返し演算は一般的なＣＰＵにとって極めてつらい処理である。
ＣＰＵは情報処理のあらゆる処理をこなす汎用プロセッサであるが、逐次処理が基本となるので繰り返し演算が頻発するような情報処理では様々な課題が残されている。 The above is an example of the difference-sum operation 119 for evaluating the degree of similarity that is convenient for explaining the intent and purpose of the present invention, but the product-sum operation 120 and other matrix operations (vector operations) are also the same. The application has no time for enumeration such as biometric authentication such as fingerprints and veins and collation of seals.
Matrix operations are also indispensable for simulations of weather and fluid molecules that handle huge amounts of data.
Iterative processing of a large amount of data, such as calculation of matrix data, is a very difficult process for a general CPU.
The CPU is a general-purpose processor that performs all processes of information processing. However, since it is based on sequential processing, various problems remain in information processing in which repeated calculations occur frequently.

ＣＰＵのこのような課題を軽減するために利用されるＧＰＵは１つのチップに大量の演算コアを抱え並列処理をすることでこれらの問題を解決しようとするものである。
ＧＰＵは大量の演算処理が必要な画像処理を高速で実現することを目的として誕生したが基本的にはＣＰＵと同様な情報処理アーキテクチャを踏襲している。 A GPU used to alleviate such problems of the CPU attempts to solve these problems by carrying a large number of arithmetic cores in one chip and performing parallel processing.
The GPU was born for the purpose of realizing image processing requiring a large amount of arithmetic processing at high speed, but basically follows an information processing architecture similar to that of a CPU.

最近ではＧＰＧＰＵとして画像処理以外、タンパク質の構造解析や流体解析や振動解析など大量な行列ベクトル計算が必要な情報処理に利用されている。
ＧＰＵはＳＩＭＤ型情報処理で利用される場合が大半であるがＣＰＵと同様な情報処理アーキテクチャを踏襲しているので、多数の独立した演算器とその演算器毎に専用のメモリを有しそれぞれの演算器はそれぞれのプログラムとデータに基づき独立して演算を行う構成である。 Recently, GPGPU is used for information processing that requires a large amount of matrix vector calculations such as protein structure analysis, fluid analysis, and vibration analysis, in addition to image processing.
GPUs are mostly used in SIMD type information processing, but follow the same information processing architecture as CPUs. Therefore, each GPU has its own dedicated memory and a dedicated memory. The computing unit is configured to perform computation independently based on each program and data.

以上のようにそれぞれの演算器が独立して動作する構成であるので、例えば演算器ではプログラムを解読するための回路、演算タスクを制御するための回路や、メモリのアドレスデコーダさらには演算コアを動かすためのメモリなどそれぞれ独立して、それぞれの回路を持つ必要があり回路やメモリが重複する結果になっている。 Since each arithmetic unit operates independently as described above, for example, the arithmetic unit includes a circuit for decoding a program, a circuit for controlling an arithmetic task, a memory address decoder, and an arithmetic core. It is necessary to have each circuit independently, such as a memory for moving, resulting in overlapping circuits and memories.

またそれぞれが独立して動作する構造であるため、通常はＣＰＵの支配下でＧＰＵのＯＳを起動し常にＧＰＵの演算器の負荷が適切で均等に動作できるようソフトウエアでコントロールするもののプログラムの並列化は難しく各演算コアに均等に処理を与えることは困難であり、演算コアの遊びが生じてしまう、折角沢山の演算コアがあっても多くの演算コアが遊んでいては意味がない。 In addition, since each has a structure that operates independently, the OS of the GPU is normally activated under the control of the CPU, and the program is controlled in parallel so that the load of the GPU computing unit is always appropriate and can be operated evenly. However, it is difficult to apply processing equally to each arithmetic core, and even if there are many arithmetic cores that cause play of arithmetic cores, it is meaningless if there are many arithmetic cores playing.

またＧＰＵは演算コアが数千個など多くなると例えば３００ワットを超えるような大電力を消費し発熱も大きくなり携帯機器やロボットなどの頭脳として利用することが出来ない。 Also, when the GPU has thousands of computing cores, for example, it consumes a large amount of power exceeding 300 watts, for example, and heat generation increases, and it cannot be used as the brain of a mobile device or a robot.

半導体微細化技術の限界も間近にせまり従来型アーキテクチャでは性能アップが期待出来なくなる時代がまもなくやってくるが、様々な分野で演算性能向上と省電力化に対する期待が高まっている。 The limit of semiconductor miniaturization technology is approaching, and the time when performance improvement cannot be expected with the conventional architecture will soon come. However, there are increasing expectations for improvement of computing performance and power saving in various fields.

最近話題になっている人工知能技術の１つであるニューラルネットワークにおいてもシステムの規模が極めて大きくなり、開発を進めるにも実用化するために大きな障害になっている。
一例を挙げればニューラルネットワークは最適な動作を得るために様々な条件を与え学習を繰り返す必要があるが大規模なネットワークになると例えば１万６０００個のＣＰＵを使用しても学習時間が数日から１週間程度もかかることがネット情報で公開されている。 Even in the neural network which is one of the artificial intelligence technologies that have recently been talked about, the scale of the system has become extremely large, and it has become a major obstacle for practical use in order to proceed with development.
For example, a neural network needs to be repeatedly trained with various conditions in order to obtain an optimal operation. However, for a large-scale network, for example, even if 16,000 CPUs are used, the learning time is several days. It takes about a week to publish online.

言うまでもなく１回の学習で最適な動作を得ることは困難であり、繰り返し、繰り返し最適な動作が得られるようチューニングを行わなくてはならない。
このように莫大なハードウエア資源を利用しても多大な学習時間が掛かることがこの技術の成長の妨げになっている。 Needless to say, it is difficult to obtain an optimum operation by one learning, and tuning must be performed repeatedly and repeatedly to obtain an optimum operation.
Even if such a huge amount of hardware resources is used, it takes a lot of learning time, which hinders the growth of this technology.

後述するがニューラルネットワークは大量の積和演算１２０を実行する必要がある、大掛かりなシステムとすることなく演算性能を上げ、小型省電力、低発熱の装置が実現し、しかも学習時間が短縮できれば、この技術の進化は大幅に加速する。 As will be described later, if the neural network needs to execute a large number of product-sum operations 120, the calculation performance can be improved without making a large-scale system, a small power saving and low heat generation device can be realized, and the learning time can be shortened. The evolution of this technology is greatly accelerated.

以上のようにシステムを大規模にすることなく効率がよい並列処理の要求が高まっている。 As described above, there is an increasing demand for efficient parallel processing without making the system large-scale.

この発明の特徴はＳＩＭＤ型（単一命令／複数データ）並列演算処理を行う上で共通化できるところは共通化して無駄な回路や機能を取り除くことにより、集積度を向上し、電力消費や発熱を最小限に抑え、しかもＳＩＭＤ型の並列演算処理の効果が最も高い並列演算装置や並列半導体演算チップを提供するものである。 The feature of the present invention is that the SIMD type (single instruction / multiple data) parallel arithmetic processing can be shared, and it is common to remove unnecessary circuits and functions, thereby improving the degree of integration, power consumption and heat generation. The parallel computing device and the parallel semiconductor computing chip that provide the highest effect of SIMD type parallel computing processing are provided.

ＧＰＵの演算能力を高めるためにはメモリアクセス方法を最適化することが不可欠でありそのために様々な手法が取り入られている、しかしながらＧＰＵはＳＩＭＤ型演算を基本としているのでＧＰＵを大幅にスリム化して集積度が高め、しかも演算器の演算効率を高めれば高速化が可能になることは自明のことである。 Optimizing memory access methods is indispensable in order to increase the computing power of GPUs, and various techniques have been adopted for this purpose. However, since GPUs are based on SIMD type operations, GPUs have been greatly streamlined. Obviously, if the degree of integration is increased and the calculation efficiency of the calculator is increased, the speed can be increased.

本願発明者はこれまでメモリ型コンピューティングによるメモリ型プロセッサはノイマン型コンピュータの様々の課題を解決出来ることを提案、これまで様々な特許を出願し実用化を進めている、以下に代表的な特許文献を示す。 The inventor of the present application has proposed that a memory-type processor based on memory-type computing can solve various problems of Neumann-type computers, and has filed various patents so far and has been putting it into practical use. The literature is shown.

特許第４５８８１１４号、情報絞込み検出機能を備えたメモリは、画像や音声のパターンマッチを超高速で実現するメモリ型プロセッサである。
従来のソフトウエアパターンマッチに比較して数万倍以上高速であることが実証されている。 Japanese Patent No. 4588114, a memory having an information narrowing detection function is a memory type processor that realizes image and sound pattern matching at an ultra-high speed.
It has been proven to be tens of thousands of times faster than conventional software pattern matching.

特願２０１３−２６４７６３は情報検索機能を備えたメモリは、データベースのレコードを超高速で検索するメモリ型プロセッサである。
従来のソフトウエアによる検索に比較して数万倍以上高速であることが実証されており、この技術が本願発明のきっかけとなっている。 Japanese Patent Application No. 2013-264863 is a memory type processor that searches for a record in a database at a very high speed.
It has been demonstrated that it is several tens of thousands of times faster than conventional software search, and this technique has triggered the present invention.

他者の発明による特願２００８−１２３４７９ＳＩＭＤ及びそのためのメモリアレイ構造、はＳＩＭＤ型プロセッサとメモリで構成されるものであるがデータの衝突回避を目的とするものであり目的も手法も全く別のものである。 The Japanese Patent Application No. 2008-123479 of the invention of another person and the memory array structure therefor are composed of a SIMD type processor and a memory, but for the purpose of avoiding data collision, the purpose and method are completely different. Is.

特願２０１１−０２３０３７並列データ処理装置、はＳＩＭＤアレイを備え、ブロックごとの演算を独立して行うものであるが手法は全く別のものである。 Japanese Patent Application No. 2011-023037 A parallel data processing apparatus includes a SIMD array and performs an operation for each block independently, but the method is completely different.

詳細は明らかではないが、マイクロン社のオートマトン演算チップでは２５６行×４９５１２列のＤＲＡＭアレイを並列に読み出し超高速なオートマトン演算を実現した事例がネット上に公開されているが本発明の目的とは異なるもので、他の先願発明からも本願発明のようにメモリのアドレス線を直接ドライブするようなＳＩＭＤ型演算方式は見受けられない。 Although details are not clear, a micron automaton arithmetic chip reads out a 256-row × 49512-column DRAM array in parallel and realizes an ultra-high-speed automaton arithmetic on the net, but the purpose of the present invention is It is different, and there is no SIMD type arithmetic system that directly drives the memory address line as in the present invention from other prior inventions.

特許第４５８８１１４号Japanese Patent No. 4588114 特願２０１３−２６４７６３Japanese Patent Application No. 2013-264863 特願２００８−１２３４７９Japanese Patent Application No. 2008-123479 特願２０１１−０２３０３７Japanese Patent Application No. 2011-023037

ＧＰＵなどの従来型ＳＩＭＤ型並列演算は、独立した演算コアとそのメモリで構成されるため回路規模が大きくなり集積度が上がらない、またＣＰＵならびにＧＰＵのＯＳを介してのＧＰＵ駆動の準備処理やメモリへのデータ転送、それに伴うＧＰＵ内部の演算器の割り当てやタスク割り当て制御や管理などのオーバヘッドや、演算器そのものの遊びによって演算速度が犠牲になり、消費電力が大きくなりがちである。 Conventional SIMD type parallel computation such as GPU is composed of an independent computation core and its memory, so the circuit scale becomes large and the degree of integration does not increase. Also, preparation processing for GPU drive via CPU and GPU OS Data transfer to the memory, the accompanying overhead of computing units within the GPU, task allocation control and management, and the play of the computing units itself are sacrificed, and the calculation speed tends to increase, and the power consumption tends to increase.

本願発明ではＳＩＭＤ型並列演算の効率を最大にして集積度を向上し、ハードウエア限界の演算速度を実現可能にするばかりでなく、適正な演算速度と適正な電力消費を選択可能な構成とし、複数利用することにより任意の並列度と演算時間を実現すると共に、半導体ＡＳＩＣならびにＦＰＧＡにも容易に実装できるＳＩＭＤ型並列演算アルゴリズムを提供する。 The present invention maximizes the efficiency of SIMD type parallel operation to improve the integration degree, and not only enables the hardware-limited operation speed to be realized, but also makes it possible to select an appropriate operation speed and an appropriate power consumption. A SIMD parallel operation algorithm that can realize an arbitrary parallelism and operation time by using a plurality of elements and can be easily mounted on a semiconductor ASIC and an FPGA is provided.

請求項１では
多数のメモリセルからなるメモリセル群とＮ個の演算器とで構成される演算装置であって
前記メモリセル群の前記メモリセルをアクセスするためのアドレス線は、１つのデータが２個以上のメモリセルで構成されるデータをＮ組一括アクセス可能な複数データアクセス手段を備え
前記一括アクセスされたアドレス線のＮ組のデータを一括読み出し、前記Ｎ個の演算器の演算入力に並列に入力する手段と
前記Ｎ個の演算器の演算出力のＮ組の演算結果データを、前記一括アクセスされたアドレス線の前記Ｎ組の前記メモリセルに一括書き込みする手段と
を具備することを特徴とする。 According to a first aspect of the present invention, there is provided an arithmetic unit comprising a memory cell group composed of a large number of memory cells and N arithmetic units, and an address line for accessing the memory cell of the memory cell group has one data. A plurality of data access means capable of batch-accessing N sets of data composed of two or more memory cells is provided, and N sets of data on the collectively accessed address lines are read in batches and used as calculation inputs of the N computing units. Means for inputting in parallel and means for collectively writing N sets of operation result data of operation outputs of the N arithmetic units to the N sets of memory cells of the address line accessed in batch. Features.

請求項２では
前記演算器は
（１）四則演算
（２）浮動小数点演算
（３）比較演算
（４）論理演算
（５）シフト演算
（６）以上を組み合わせた多段演算
以上（１）から（６）のいずれかの演算を実行する演算器であることを特徴とする。 In claim 2, the arithmetic unit is (1) arithmetic operation (2) floating point operation (3) comparison operation (4) logic operation (5) shift operation (6) or more multistage operation or more (1) to (6 It is an arithmetic unit that executes any one of the operations (1).

請求項３では
前記演算器の一部ならびに演算器の一部の入力ビットにマスクを掛け、演算器の一部ならびに演算器１０９の入力ビットの一部の入力に演算の影響をなくす演算手段を備えたことを特徴とする。 According to a third aspect of the present invention, there is provided an arithmetic means for masking a part of the arithmetic unit and a part of the input bits of the arithmetic unit and eliminating the influence of the arithmetic operation on a part of the arithmetic unit and a part of the input bits of the arithmetic unit 109. It is characterized by having.

請求項４では
前記演算器の前記演算結果を外部出力する手段を備えたことを特徴とする。 According to a fourth aspect of the present invention, there is provided means for externally outputting the calculation result of the calculator.

請求項５では
前記演算器の前記Ｎ個の演算器の演算入力に外部からのデータを並列に入力する手段を備えたことを特徴とする。 According to a fifth aspect of the present invention, there is provided means for inputting data from the outside in parallel to the arithmetic inputs of the N arithmetic units of the arithmetic unit.

請求項６では
請求項１記載の前記並列演算装置は１つの半導体チップ内に構成されたことを特徴とする。 According to a sixth aspect of the present invention, the parallel arithmetic device according to the first aspect is configured in one semiconductor chip.

請求項７では
ＣＰＵやＧＰＵなど他のＬＳＩと組み合わせされ１つの半導体チップ内に構成されたことを特徴とする。 The present invention is characterized in that it is combined with another LSI such as a CPU or GPU and configured in one semiconductor chip.

請求項８では
請求項１記載の前記並列演算装置はＦＰＧＡに実装されたたことを特徴とする。 An eighth aspect of the present invention is characterized in that the parallel computing device according to the first aspect is mounted on an FPGA.

請求項９では
前記メモリセル群と前記演算器が分割され、前記メモリセルが独立したチップであることを特徴とする。 According to a ninth aspect of the present invention, the memory cell group and the arithmetic unit are divided, and the memory cell is an independent chip.

請求項１０では
前記メモリセル群と前記演算器が分割され、前記演算器が独立したチップであることを特徴とする。 According to a tenth aspect of the present invention, the memory cell group and the arithmetic unit are divided, and the arithmetic unit is an independent chip.

請求項１１では
請求項６、７、８記載の半導体チップを複数用意し並列演算させることを特徴とする。 According to an eleventh aspect of the present invention, a plurality of semiconductor chips according to the sixth, seventh and eighth aspects are prepared and operated in parallel.

請求項１２では
複数の請求項９記載のメモリチップと１つの請求項１０記載の演算チップを組合せ並列演算させることを特徴とする。 According to a twelfth aspect of the present invention, a plurality of memory chips according to the ninth aspect and a single arithmetic chip according to the tenth aspect are combined and operated in parallel.

請求項１３では
１つの請求項９記載のメモリチップと複数の請求項１０記載の演算チップを組合せ並列演算させることを特徴とする。 According to a thirteenth aspect of the present invention, one memory chip according to the ninth aspect and a plurality of arithmetic chips according to the tenth aspect are combined and operated in parallel.

請求項１４では
複数のアドレスのデータを合成して１つのデータとして並列演算することを特徴とする。 According to a fourteenth aspect of the present invention, data of a plurality of addresses are combined and operated in parallel as one data.

請求項１５では
請求項１から請求項１０記載の
（１）ＳＩＭＤ型並列演算装置
（２）ＳＩＭＤ型並列演算半導体チップ
（３）ＳＩＭＤ型並列演算半導体のメモリチップ
（４）ＳＩＭＤ型並列演算半導体の演算チップ。
以上（１）から（４）のいずれかを含んだシステム。 In claim 15, (1) SIMD type parallel arithmetic device (2) SIMD type parallel arithmetic semiconductor chip (3) SIMD type parallel arithmetic semiconductor memory chip (4) SIMD type parallel arithmetic semiconductor of claim 1 to claim 10 Arithmetic chip.
A system including any of (1) to (4) above.

図１は、データの照合（特徴データの照合）の例である。（実施例２）FIG. 1 is an example of data collation (feature data collation). (Example 2) 図２は、並列演算装置もしくは半導体並列演算チップの全体構成例である。FIG. 2 shows an example of the overall configuration of a parallel arithmetic device or a semiconductor parallel arithmetic chip. 図３は、並列演算装置もしくは半導体並列演算チップの詳細構成例である。（実施例１）FIG. 3 is a detailed configuration example of a parallel arithmetic device or a semiconductor parallel arithmetic chip. (Example 1) 図４は、ニューラルネットワークの構成例である。（実施例３）FIG. 4 is a configuration example of a neural network. Example 3 図５は、ニューラルネットワークのユニット例である。FIG. 5 is an example of a neural network unit.

図２は、並列演算装置ならびに並列演算半導体チップ２０１の全体構成図である。
本図はメモリや演算機能の細かな回路構成は割愛し、本願発明の概念のみを説明するためのものであり、図の上段部分はメモリ部２０２となっており、図の下段部分は演算部２０３になっている。 FIG. 2 is an overall configuration diagram of the parallel arithmetic device and the parallel arithmetic semiconductor chip 201.
This figure omits the detailed circuit configuration of the memory and the calculation function, and is for explaining only the concept of the present invention. The upper part of the figure is the memory part 202, and the lower part of the figure is the calculation part. 203.

後述するがメモリセルの種類も演算器の種類も任意であり、複数のＬＳＩの組み合わせで装置を構成するも、１つの半導体チップに実装することも、その他の機能を盛り込んだ半導体チップとすることも自由である。 As will be described later, the type of the memory cell and the type of the arithmetic unit are arbitrary, and the device is configured by combining a plurality of LSIs, mounted on one semiconductor chip, or a semiconductor chip incorporating other functions. Is also free.

この並列演算装置または並列演算半導体チップ２０１は演算グループ１から演算グループＮまでＮ個の演算グループが完全並列演算可能な構成になっている。 This parallel arithmetic device or parallel arithmetic semiconductor chip 201 is configured such that N arithmetic groups from arithmetic group 1 to arithmetic group N can perform completely parallel arithmetic.

メモリ１０３は複数のメモリセル１０４で構成されるデータがＮグループとも１つのアドレス１０１でアクセス可能なように１本のアドレス線１０２に接続されており、任意のアドレス１０１が選択（アクセス）可能な構成になっている。 The memory 103 is connected to one address line 102 so that data composed of a plurality of memory cells 104 can be accessed by one address 101 in all N groups, and an arbitrary address 101 can be selected (accessed). It is configured.

本例のアドレスＸからアドレスＸ＋ｎは１つのデータが９＋９ビット、アドレスＹからアドレスＹ＋ｍは１７＋１７ビットのメモリセル１０４となっており演算器１０９の演算入力データＡ１２３側もしくは演算入力データＢ１２４側の一方もしくは双方の入力に加えられる構成になっている。 In this example, the address X to the address X + n is a memory cell 104 in which one data is 9 + 9 bits and the address Y to the address Y + m is 17 + 17 bits, and either one of the arithmetic input data A123 side or the arithmetic input data B124 side of the arithmetic unit 109 or It is configured to be added to both inputs.

メモリセルの割付は必要なデータ幅や符号、桁上げなどを考慮し決めればよい、言うまでもなくデータ幅は演算の精度に影響する。
またメモリセル１０４は演算入力データＡ１２３側、Ｂ１２４側いずれか一方とすることも可能である。
アドレスの数も任意であり、演算グループの数も任意である。
様々なデータ幅のアドレスを持たせることも、様々な演算を混載させることも任意である。 The allocation of the memory cells may be determined in consideration of the necessary data width, sign, carry, etc. Needless to say, the data width affects the calculation accuracy.
Further, the memory cell 104 can be set to either the operation input data A123 side or B124 side.
The number of addresses is also arbitrary, and the number of operation groups is also arbitrary.
It is optional to have addresses with various data widths and to mix various operations.

演算部２０３の各ビット線（データ線）１０５には、メモリセル群のデータを読み出し演算器１０９の演算入力データ１２３側、１２４側にデータを代入するか、演算器１０９の演算結果１１０をメモリセル群に書き込みをするか、を切り替えるためのＲ／Ｗ切替スイッチ１０６が付いている。 For each bit line (data line) 105 of the arithmetic unit 203, the data of the memory cell group is read and assigned to the arithmetic input data 123 or 124 side of the arithmetic unit 109, or the arithmetic result 110 of the arithmetic unit 109 is stored in the memory. An R / W changeover switch 106 is provided for switching between writing to the cell group.

演算器１０９は演算グループ毎にＮ個一列に配列されこの演算器１０９の入力には、アドレスを指定しアクセスされたメモリセル１０４のビット線（データ線）１０５を通じ読み出されたＮ個のデータと、必ずしも外部からの入力データを必要とするものではないが、本例では外部からの入力データ１２５（本例では９ビット）が演算器１０９の演算入力データＡ１２３側に入力可能な構成になっている。 The arithmetic units 109 are arranged in N columns for each arithmetic group, and N pieces of data read out through the bit lines (data lines) 105 of the memory cells 104 that are accessed by designating addresses are input to the arithmetic units 109. In this example, external input data 125 (9 bits in this example) can be input to the arithmetic input data A123 side of the arithmetic unit 109, although it does not necessarily require external input data. ing.

外部からの入力データを利用せず、メモリセルに記憶されたデータ同志のみでバッチ処理演算することも可能である。 It is also possible to perform batch processing calculations using only data stored in memory cells without using external input data.

演算器１０９の演算結果１１０は入出力インターフェース１１３に接続され、演算結果を例えばＰＣＩ−eなど任意の出力形態で演算出力１０８として出力することが出来る。
またメモリ記憶データ１０８は入出力インターフェース１１３から、メモリセル１０４に記憶するデータを外部から入力することが出来る。 The calculation result 110 of the calculator 109 is connected to the input / output interface 113, and the calculation result can be output as the calculation output 108 in an arbitrary output form such as PCI-e.
The memory storage data 108 can be input from the input / output interface 113 from the outside as data stored in the memory cell 104.

先に示した通りこの演算結果１１０はビット線（データ線）１０５を通じて指定しアクセスしたアドレスのメモリ１０３に書き込みすることが可能である。
一例に過ぎないが例えば符号付き８ビットデータ同士の乗算であれば桁上げが発生するので、図に示すアドレスＹからアドレスＹ＋ｍの１７ビットのデータ幅で構成されるアドレスに書き込みすればよい。 As described above, the calculation result 110 can be written to the memory 103 at the address specified and accessed through the bit line (data line) 105.
Although only an example, for example, when multiplication is performed between signed 8-bit data, a carry occurs. Therefore, it is only necessary to write to an address having a 17-bit data width from address Y to address Y + m shown in the figure.

従って本例の場合はメモリ部のＮ個のデータと外部から与えられるＮ個の入力データをＮデータ並列に直接ＳＩＭＤ演算しその演算結果を出力もしくはメモリに記憶することが出来る構成である。 Accordingly, in the case of this example, the NMD data in the memory unit and the N input data given from the outside are directly subjected to SIMD calculation in parallel with the N data, and the calculation result can be output or stored in the memory.

図３は、１演算グループの詳細を示すものである。
本図は、並列接続された演算グループ１から演算グループＮの中の１つのグループのメモリ１０３と演算器１０９、入力データ１２５、ならびに入出力インターフェース１１３の詳細を示すものである。
メモリセル１０４はアドレスＸからアドレスＸ＋ｎまでは８ビットデータ＋符号１ビットの合計９ビットデータで構成されており、またアドレスＹからアドレスＹ＋ｍは８ビットデータが２組み＋符号１ビット合計１７ビットデータで構成されている。 FIG. 3 shows details of one operation group.
This figure shows details of the memory 103, the arithmetic unit 109, the input data 125, and the input / output interface 113 of one group among the arithmetic groups 1 to N connected in parallel.
The memory cell 104 is composed of a total of 9-bit data of 8 bits data + signature 1 bit from address X to address X + n, and 2 sets of 8 bits data + 17 bits data of sign 1 bit in total from address Y to address Y + m. It consists of

先に示したとおり、データの長さやそのデータの割り付けは任意である。
また本例では、演算器１０９の演算入力データＡ１２３側、演算入力データＢ１２４側の双方にメモリセル１０４が取り付けられており、両方のデータを読み出すか、一方のみのメモリセルを読み出すか、反対に両方のデータに書き込むか、一方のみのメモリセルに書き込むか任意の選択が出来る構成である。 As described above, the length of data and the allocation of the data are arbitrary.
In this example, the memory cell 104 is attached to both the arithmetic input data A123 side and the arithmetic input data B124 side of the arithmetic unit 109, and both data are read out, or only one of the memory cells is read out, or vice versa. It is possible to arbitrarily select whether to write both data or only one of the memory cells.

以上のような処理は演算器１０９の一部ならびに演算器１０９の入力ビットの一部、さらに演算出力の一部にマスクを掛け、演算器１０９の一部ならびに演算器１０９の一部の入力の演算の影響をなくすように演算条件を与えることも、演算結果の一部を無視（マスク）してメモリセルに記憶するようにすることも自由な構成である。 The above processing is performed by masking a part of the arithmetic unit 109, a part of the input bits of the arithmetic unit 109, and a part of the arithmetic output so that a part of the arithmetic unit 109 and a part of the arithmetic unit 109 are input. It is also possible to give a calculation condition so as to eliminate the influence of the calculation, or to ignore (mask) a part of the calculation result and store it in the memory cell.

利用するデータの種類や長さなどこのメモリセルの配列と利用方法は任意に定めることが出来る。 The arrangement and usage of the memory cells, such as the type and length of data to be used, can be arbitrarily determined.

Ｒ／Ｗ切替スイッチ１０６がＲ（読み出し）の場合、アクセスされたアドレスのメモリセル１０４からのデータはビット線（データ線）１０５を通じて演算器１０９に入力される。
またＲ／Ｗ切替スイッチ１０６がＷ（書き込み）の場合、演算器１０９の演算結果をアクセスされたアドレスのメモリセル１０４に書き込みすることが可能である。 When the R / W switch 106 is R (read), data from the memory cell 104 at the accessed address is input to the arithmetic unit 109 through the bit line (data line) 105.
When the R / W switch 106 is W (write), it is possible to write the calculation result of the calculator 109 to the memory cell 104 at the accessed address.

外部から入力される入力データ１２５（本例では９ビット）は先に示したメモリセルの読み出しビット線と論理和ゲートを通じて、演算器１０９の演算入力データＡ１２３側の入力に加えられる構成である。
この入力データ１２５は演算器１から演算器Ｎまで共通（並列）に与えられる。 Input data 125 (9 bits in this example) input from the outside is added to the input on the operation input data A123 side of the operation unit 109 through the read bit line of the memory cell and the OR gate described above.
This input data 125 is given in common (in parallel) from the arithmetic unit 1 to the arithmetic unit N.

本例では、外部から与えられる入力データ１２５が全演算器共通に与えられる場合を示したが、それぞれの演算器１０９毎に異なったデータを入力することも可能である。 In this example, the case where the input data 125 given from the outside is given to all the arithmetic units is shown, but different data can be inputted to each arithmetic unit 109.

演算器１０９を多段接続することも可能でその場合、毎回の演算結果１１０をメモリに一時記憶させる必要がなくなるので演算が極めて効率的でありその分高速化が図れる、詳細は後述する。 It is also possible to connect the computing units 109 in multiple stages. In this case, it is not necessary to temporarily store the computation result 110 every time in the memory, so that the computation is extremely efficient and the speed can be increased accordingly. Details will be described later.

このようなＳＩＭＤ型並列回路の特徴は、独立したメモリと独立した演算器で実現されるＧＰＵなどの回路と比較して、
（１）メモリ部については演算器毎のプログラム記憶用のメモリが不要になり演算データ記憶用のメモリだけでよい、またアドレス選択回路（アドレスデコーダ含む）が１組で済む。
（２）演算部に関しては演算器毎のプログラム解読器回路、各演算器の演算タスク制御や管理などの回路などが不要になる。
など共通部分回路を大幅に省くことができる、従って集積度が上がり経済性も高くなる。 The feature of such SIMD type parallel circuit is compared with a circuit such as a GPU realized by an independent memory and an independent computing unit,
(1) The memory unit does not require a program storage memory for each arithmetic unit, and only a memory for arithmetic data storage is required, and only one address selection circuit (including an address decoder) is required.
(2) With respect to the calculation unit, a program decoder circuit for each calculation unit, a calculation task control and management circuit for each calculation unit, and the like become unnecessary.
The common partial circuit and the like can be largely omitted. Therefore, the degree of integration increases and the economic efficiency increases.

さらに特徴的なことは、アドレス線１０２をアクセスすることが直接ＳＩＭＤ型並列演算を実行することになるので、役割を与えたすべての演算グループを一瞬たりとも遊ばせることなく極めて効率的で高速な演算をさせることが可能になる。 What is more characteristic is that accessing the address line 102 directly executes SIMD type parallel operations, so it is extremely efficient and fast without letting all the operation groups that have given a role play even for a moment. It becomes possible to make calculations.

一般的な演算ではメモリからのデータの読み出しと、そのデータに基づく演算の最低２サイクルが必要である。
この方法では１サイクルとすることができる、つまりメモリレイテンシと演算レイテンシをバランスさせれば最高速の演算が可能になる。
一般的にはメモリレイテンシの方が大きいので例えばメモリを演算器内部のレジスタや高速キャッシュメモリのようなメモリを直接ドライブする構成にすれば現在の半導体技術の極限的な超高速並列演算が可能になる。 In general operations, at least two cycles of reading data from the memory and operations based on the data are required.
In this method, one cycle can be achieved, that is, the highest speed operation can be performed by balancing the memory latency and the operation latency.
In general, the memory latency is larger. For example, if the memory is configured to directly drive a memory such as a register inside a computing unit or a high-speed cache memory, the ultra-high-speed parallel computation of the current semiconductor technology becomes possible. Become.

従って通常のＧＰＵの場合のように各演算器の稼働率を気にすることなく、並列演算の性能がＣＰＵやＧＰＵのＯＳの性能やプログラマの熟練度に左右されることなく、いつも高速で確実な演算結果を生み出すことが可能になる。 Therefore, without worrying about the operating rate of each computing unit as in the case of a normal GPU, the performance of parallel computing is always fast and reliable without being affected by the performance of the OS of the CPU or GPU and the skill level of the programmer. It is possible to produce a simple calculation result.

図１で示した手書き文字照合を本願発明の並列演算装置もしくは並列演算半導体チップ２０１で実現する場合を説明する。
日常的に利用される日本語は３０００文字程度であるので演算グループＮを３０００（３Ｋ）とし本例の場合１文字あたりの特徴を２５６種類として、メモリならびに演算器は３０００組（グループ）用意されているものとする。 A case where the handwritten character collation shown in FIG. 1 is realized by the parallel arithmetic device or the parallel arithmetic semiconductor chip 201 of the present invention will be described.
Since the Japanese language used on a daily basis is about 3000 characters, the calculation group N is 3000 (3K). In this example, there are 256 types of features per character, and 3000 sets (groups) of memory and calculators are prepared. It shall be.

この手書き文字の特徴データは符号なしの８ビットデータ（０〜２５５）であるので、図２で示すアドレスＸからアドレスＸ＋２５５まで符号なしで１文字を１演算グループとして演算入力データＢ１２４側に順番に登録（書き込み）する。
以上でデータベースならびに演算の準備が完了する。 Since the characteristic data of this handwritten character is unsigned 8-bit data (0 to 255), one character is unsigned from address X to address X + 255 shown in FIG. Register (write).
This completes the preparation of the database and calculation.

以上の状態で、照合データの特徴１と、データベースの特徴１の差を求める場合、先ずはＲ／Ｗ切替スイッチ１０６をＲ、つまり読み出しモードにしておき、演算器１から演算器Ｎには外部演算条件１１４入力から減算指令を与えておく。 In the above state, when obtaining the difference between the feature 1 of the collation data and the feature 1 of the database, first, the R / W selector switch 106 is set to R, that is, the reading mode, and the calculator 1 to the calculator N are externally connected. A subtraction command is given from the calculation condition 114 input.

照合の際、照合データは入力データ１２５の入力７から入力０に並列（同時）に与えられ、演算グループ１から演算グループＮの演算器の演算入力データＡ１２３側に並列（同時）に与えられる。 At the time of collation, the collation data is given in parallel (simultaneously) from input 7 to input 0 of the input data 125, and is given in parallel (simultaneously) from the arithmetic group 1 to the arithmetic input data A123 side of the arithmetic units in the arithmetic group N.

特徴１のデータベースが記憶されているアドレスＸをアクセスし読み出し、演算入力データＢ１２４側に入力することにより、ＡＢ双方の演算データが演算器１から演算器Ｎの入力に並列（同時）に加えられることになる。 By accessing and reading the address X stored in the feature 1 database and inputting it to the operation input data B 124 side, both operation data AB are added in parallel (simultaneously) from the operation unit 1 to the input of the operation unit N. It will be.

以上の入力ならびに減算演算条件で演算を実施すると３Ｋ全演算器１０９の出力には特徴１の差分のデータが並列（同時）出力される。 When calculation is performed under the above input and subtraction calculation conditions, the difference data of feature 1 is output in parallel (simultaneously) to the output of all 3K calculators 109.

次にＲ／Ｗ切替スイッチ１０６をＷ、つまり書き込みモードとし、アドレスＹをアクセスし以上の演算結果をメモリセルの演算入力データＢ１２４側に一時記憶する。 Next, the R / W selector switch 106 is set to W, that is, the write mode, the address Y is accessed, and the above calculation result is temporarily stored in the calculation input data B124 side of the memory cell.

同様にデータベースの特徴２の差を求める場合、Ｒ／Ｗ切替スイッチ１０６をＲにし、演算器１から演算器Ｎには減算指令を与え、演算入力データＡ１２３側には外部からの照合データの特徴２データを与え、演算入力データＢ１２４側には特徴２のデータベースが記憶されているアドレスＸ＋１をアクセスし読み出すことにより、ＡＢ双方の演算データが演算器１から演算器Ｎに並列に加えられる。 Similarly, when the difference of the feature 2 of the database is obtained, the R / W selector switch 106 is set to R, a subtraction command is given from the computing unit 1 to the computing unit N, and the feature of the collation data from the outside is given to the computation input data A123 side. Two pieces of data are given, and the arithmetic input data B124 side accesses and reads the address X + 1 in which the database of the characteristic 2 is stored, so that the arithmetic data of both AB are added in parallel from the arithmetic unit 1 to the arithmetic unit N.

以上の入力ならびに演算条件で演算を実施することにより全演算器１０９の出力には差分のデータが並列に出力される。
Ｒ／Ｗ切替スイッチ１０６をＷにし、この演算結果をアドレスＹの演算入力データＡ１２３側に並列に一時記憶する。 By performing the calculation under the above input and calculation conditions, difference data is output in parallel to the outputs of all the arithmetic units 109.
The R / W selector switch 106 is set to W, and this calculation result is temporarily stored in parallel on the calculation input data A123 side of the address Y.

次にＲ／Ｗ切替スイッチ１０６をＷにし、演算器１０９の演算条件を加算として、先に一時記憶されたアドレスＹを読み出し演算器１０９に入力すると、二つの差が加算され特徴１と特徴２の差和演算１１９が並列に実施される。 Next, when the R / W selector switch 106 is set to W, the calculation condition of the calculator 109 is added, and the previously stored address Y is read and input to the calculator 109, the two differences are added, and the characteristics 1 and 2 Are calculated in parallel.

この演算結果を再びアドレスＹの演算入力データＢ１２４側に並列に一時記憶する、アドレスＹの演算入力データＢ１２４側は累積された差和演算結果である。
以上を特徴２５６まで繰り返すことにより、３０００文字の差和演算１１９が完成する。 This calculation result is temporarily temporarily stored again in parallel on the calculation input data B124 side of the address Y. The calculation input data B124 side of the address Y is the accumulated difference sum calculation result.
By repeating the above to the feature 256, the difference calculation 119 of 3000 characters is completed.

本例は演算器が１グループ１段の構成であるが差の演算器と和の演算器の双方を用意し多段接続した構成の場合、特徴差分データを毎回アドレスＹに一時記憶させる必要がなくなるのでさらに効率のよい演算が可能になる。 In this example, the arithmetic unit has a configuration of one stage per group. However, in the case of a configuration in which both the difference arithmetic unit and the sum arithmetic unit are prepared and connected in multiple stages, it is not necessary to temporarily store the feature difference data at the address Y every time. Therefore, more efficient calculation becomes possible.

本方式はハードウエア限界速度まで性能を上げることが可能であるが、一例として以上の特徴１つの差和演算時間が１ｎ秒であれば２５６特徴の合計照合演算時間は２５６ｎ秒であり１０ｎ秒であれば２．５６μ秒である、１０ｎ秒であっても先に示した１つのＣＰＵによる処理より確実に３Ｋ倍高速になる。
通常の場合、実際に処理を実行して見ないと、どの程度のスループットが出るか分からない場合が多いがこの方式は常に実力値通りの演算速度を約束する。 Although this method can improve the performance to the hardware limit speed, as an example, if the difference calculation time of one of the above features is 1 nsec, the total matching calculation time of 256 features is 256 nsec, which is 10 nsec Even if it is 10n seconds, which is 2.56 microseconds, it is surely 3K times faster than the processing by one CPU described above.
In normal cases, it is often impossible to know how much throughput will be obtained unless the processing is actually executed, but this method always promises the calculation speed as the actual value.

この差和演算１１９結果を例えばＰＣＩ−ｅなどのインターフェースで出力しその結果を通常のＣＰＵ等によって最小値を求めればよい。 The difference sum operation 119 result may be output through an interface such as PCI-e, and the minimum value may be obtained by a normal CPU or the like.

以下に最近話題になっているニューロネットワークへの応用について示す。
ニューロネットワークは様々な形式があるが、一番基本的な内容で本願発明に関連する要点のみを示す。 The following is an application to the neuro-network that has become a hot topic recently.
Although there are various types of neuro-networks, only the main points related to the present invention are shown in the most basic contents.

図４はニューラルネットワークの構成例である。
図に示すように、一般的なニューロネットワークは多数のニューロユニットで構成される入力層、中間層、出力層のなど幾つかの層からなり、一つの層の出力が次段の層の入力となるように配線されたネットワークからなる。
ニューロネットワークを構成するユニットの数は様々であるが、本例では仮に入力層、中間層、出力層それぞれが１０００（１Ｋ）で合計３０００（３Ｋ）であった場合で説明する。 FIG. 4 is a configuration example of a neural network.
As shown in the figure, a general neuro network consists of several layers such as an input layer, an intermediate layer, and an output layer composed of a number of neuro units, and the output of one layer is the input of the next layer. It consists of a wired network.
Although the number of units constituting the neuronetwork is various, in this example, the case where the input layer, the intermediate layer, and the output layer are each 1000 (1K) and 3000 (3K) in total will be described.

図５は、ニューロネットワークを構成する中間層の１ユニットの概念図である。
中間層の１ユニットには入力層から１Ｋの入力が並列に与えられ、その並列入力の演算結果が集計され１つの出力として出力されることになる。
このユニットは１Ｋの入力層の入力ニューロユニットよりのアナログ出力データを受けると、入力１から入力ｎ（本例の場合１Ｋ）毎に設定された結合荷重データと、アナログ入力データの値を乗算し全ての入力データと結合荷重データの積和演算１２０を実行し、全ての積和演算終了後、閾値の演算やシグモイド関数など所定の演算を行いその結果を出力することになる。 FIG. 5 is a conceptual diagram of one unit of the intermediate layer constituting the neuro network.
One unit of the intermediate layer is given 1K inputs in parallel from the input layer, and the calculation results of the parallel inputs are aggregated and output as one output.
When this unit receives analog output data from the input neuron unit of the 1K input layer, the unit multiplies the connection load data set for each input 1 to n (1K in this example) by the value of the analog input data. A product-sum operation 120 of all input data and combined weight data is executed, and after completion of all the product-sum operations, a predetermined operation such as a threshold operation or a sigmoid function is performed and the result is output.

この処理で一番負担の多い処理は言うまでもなく１０００（１Ｋ）個のニューロユニットがそれぞれ１０００（１Ｋ）回繰り返し合計１００万（１Ｍ）回の積和演算１２０を行う必要がありニューラルネットワーク演算時間の大半を占めることになる。
同じような積和演算処理を出力層のニューラルユニットでも行う必要がありネットワーク全体では合計２００万（２Ｍ）回の演算をする必要がある。 Needless to say, this processing is the most burdensome processing, and 1000 (1K) neuro units each need to repeat 1000 (1K) times for a total of 1 million (1M) times of product-sum operation 120. Will occupy the majority.
Similar product-sum operation processing must be performed in the output layer neural unit, and the entire network needs to perform a total of 2 million (2M) operations.

以上の説明は入力層から中間層、中間層から出力層に向かうニューラルネットワークの一般的な動作である正伝播の例を示したものである。
この正伝播の演算時間の場合、仮に１つのＣＰＵが１０ｎ秒で積和演算した場合、１０ｎ秒＊２Ｍ回＝２０ｍ秒であり、特段問題になる数字ではない。 The above description shows an example of positive propagation, which is a general operation of a neural network from the input layer to the intermediate layer and from the intermediate layer to the output layer.
In the case of the calculation time of the positive propagation, if one CPU performs a product-sum operation in 10 nsec, it is 10 nsec * 2M times = 20 msec, which is not a particularly problematic number.

ニューラルネットワークはネットワークに適切な学習を行うことによりネットワークから所定の演算結果を得ようとするものである。
通常この学習は出力層から中間層、中間層から入力層へのバックプロパゲーションと呼ばれる逆伝播演算を繰り返し、学習の都度の評価関数の誤差のレベルが所定の値以下となるまで学習を繰り返す必要がある。 A neural network attempts to obtain a predetermined calculation result from a network by performing appropriate learning for the network.
Normally, this learning involves repeating backpropagation called backpropagation from the output layer to the intermediate layer and from the intermediate layer to the input layer, and it is necessary to repeat the learning until the error level of the evaluation function is less than or equal to a predetermined value for each learning. There is.

例えば手書き文字などの場合、例えば「あ」を学習する場合、例えば１００人が書いた手書き文字を読み取り、誰の文字でも「あ」の出力が出るまで繰り返し学習させ、先の結合荷重データや閾値が最適になるまで繰り返し、学習のための逆伝播演算を行う必要がある。 For example, in the case of a handwritten character or the like, for example, when learning “A”, for example, a handwritten character written by 100 people is read, and any character is repeatedly learned until “A” is output. It is necessary to perform back propagation operation for learning repeatedly until it becomes optimal.

この演算は通常１文字当たり数千回繰り返しする必要があり、文字数３０００回同じ処理を繰り返す必要があるので最低でも１０Ｍ回（１０００万回）程度の学習が必要である。 This calculation usually needs to be repeated thousands of times per character, and the same processing needs to be repeated 3000 times, so learning at least 10M times (10 million times) is required.

逆伝播演算の細部は割愛するが、これらの逆伝播演算もユニット１つ１つの積和演算の繰り返しであり本例のようなニューラルネットワークでは先に説明の２Ｍ回の積和演算を１０Ｍ回逆伝播学習させた場合の積和演算回数は２０Ｔ回もの演算となる。
仮にＣＰＵ１個で１回の積和演算を１０ｎ秒で連続的に実行しても２０Ｔ回演算を繰り返すと、積和演算だけでも２００、０００秒、５５．５時間かかる計算になり、その時間が待ち時間となる。 Although details of the back propagation operation are omitted, these back propagation operations are also repeated for each unit of product-sum operation. In the neural network like this example, the 2M product-sum operation described above is reversed 10M times. The number of product-sum operations when propagation learning is performed is 20T.
Even if one CPU performs one product-sum operation continuously in 10 ns, if it repeats the operation 20T times, it takes 200,000 seconds and 55.5 hours for the product-sum operation alone. It becomes waiting time.

以上のような学習が１回で完了することは希であり、学習結果を見ながら先に示した結合荷重や閾値をチューニングする必要がある。
以上がニューラルネットワーク技術最大の問題であり、画像認識などのように１層当たりのネットワークユニットが１万個を超えるようになると、ＧＰＵを使って如何に演算時間を短縮できるかが鍵になる。
然しながら従来型のＧＰＵで高速化を狙うと発熱が大きくシステムが大型化して大電力を浪費することになる。 The learning as described above is rarely completed once, and it is necessary to tune the coupling load and the threshold value shown above while looking at the learning result.
The above is the biggest problem in the neural network technology. When the number of network units per layer exceeds 10,000 as in image recognition, the key is how to reduce the computation time using the GPU.
However, when aiming at high speed with a conventional GPU, the heat generation is large and the system becomes large and wastes a large amount of power.

現在市場に出ているＦＰＧＡは千個以上の演算器とＳＲＡＭを標準装備しているものも少なくなく、これらの演算器とメモリを組合せすることによりＦＰＧＡでも容易に本願発明を実現することができる、一般的なＦＰＧＡであれば数ワットから十数ワット程度であるので低電力で並列度高いチップが容易に実現できる。 Many FPGAs currently on the market are equipped with more than a thousand arithmetic units and SRAM as standard equipment. By combining these arithmetic units and memories, the present invention can be easily realized even with FPGAs. In a typical FPGA, since it is about several watts to several tens of watts, a chip with low power and high parallelism can be easily realized.

例えばFPGA１チップに３Ｋの並列演算器を実装し１回の積和演算１２０を１０ｎ秒とした場合先ほどの学習時間の積和演算にかかる時間は、１／３K、つまり６６秒に短縮される。
言うまでもなくこれを複数利用することによりさらに超並列で超高速して低電力消費のシステムが実現できる。 For example, when a 3K parallel arithmetic unit is mounted on the FPGA 1 chip and one product-sum operation 120 is 10 nsec, the time required for the product-sum operation of the learning time is reduced to 1 / 3K, that is, 66 seconds.
Needless to say, by using a plurality of these, it is possible to realize a system that is super parallel and ultra high speed and low power consumption.

回路構成を自由に修正可能なＦＰＧＡによる本願発明はニューラルネットワークの場合のように試行錯誤で最適な回路を探し出す場合に最適である。 The present invention using an FPGA whose circuit configuration can be freely modified is optimal for finding an optimum circuit by trial and error as in the case of a neural network.

本発明により半導体チップ化した場合、ＦＰＧＡに比較して１桁程度演算時間を短縮することができる、ＧＰＵなどの従来型のＳＩＭＤ回路の無駄な回路がなくなるのでメモリや演算器の並列度を上げることが出来る。
またＧＰＵの各コアを駆動するため前処理などのオーバヘッドも不要になり、演算コアの遊びも解消できるのでハードウエア限界性能を求めることができる。
従ってこの技術はスリムで超高速なＳＩＭＤ専用新型ＧＰＵとなる。 When a semiconductor chip is formed according to the present invention, the calculation time can be shortened by about an order of magnitude as compared with an FPGA, and there is no useless circuit of a conventional SIMD circuit such as a GPU. I can do it.
In addition, since each core of the GPU is driven, overhead such as preprocessing is not required, and play of the arithmetic core can be eliminated, so that the hardware limit performance can be obtained.
This technology is therefore a slim, ultra-high speed new GPU dedicated to SIMD.

以下にこの技術の注意点や応用を記す。 The precautions and applications of this technology are described below.

この技術の特徴はアドレス線のドライブ能力が重要である。
また演算グループを幾つかにバンク分けして微小時間ずらしてデータを読み書きすることにより大量のメモリセルドライブや演算器の突入電流を制限することができる。
クロック周波数の切替により１演算時間を１ｎ秒から１０ｎ秒など自由にコントロールして演算性能が優先か消費電力が優先かにより任意の演算時間を選択するようなことも可能になるので、チップ当たりや１W当たりの演算能力が大きな演算器を実現させることが可能になる。
以上のことは、半導体微細化技術の限界が間近にせまり、従来型アーキテクチャでは性能アップが期待できなくなる近未来極めて大きな価値を生み出す。 The feature of this technology is the address line drive capability.
Moreover, the inrush current of a large number of memory cell drives and arithmetic units can be limited by dividing the arithmetic group into several banks and reading / writing data with a slight time shift.
By switching the clock frequency, it is possible to freely control one calculation time, such as 1 ns to 10 ns, and to select an arbitrary calculation time depending on whether calculation performance is prioritized or power consumption is prioritized. It is possible to realize a computing unit having a large computing capacity per 1 W.
As described above, the limit of the semiconductor miniaturization technology is approaching, and it will produce extremely great value in the near future where the performance improvement cannot be expected with the conventional architecture.

これまで演算器１０９の演算は実数の四則演算１１５を中心に説明してきたが、浮動小数点演算１２７とすることや、一致や大小、範囲などの比較演算１１６、ＡＮＤ、ＯＲ、ＮＯＴなどの論理演算１１７、演算器内部や互いの演算器に跨ったデータのシフト演算、以上を多段組み合わせしたＳＩＭＤ演算に共通に利用できる。 So far, the arithmetic unit 109 has been described with a focus on the real four arithmetic operations 115. However, it is assumed that the arithmetic operation is a floating-point operation 127, and comparison operations 116 such as coincidence, magnitude, and range, and logical operations such as AND, OR, and NOT. 117, it can be used in common for the arithmetic operation of the arithmetic unit, the shift operation of the data across the arithmetic units, and the SIMD operation in which the above are combined in multiple stages.

一例であるが浮動小数点演算の場合演算器１０９の性能に合わせメモリセルのデータ割り付けを行えばよい。 For example, in the case of floating-point arithmetic, memory cell data may be allocated in accordance with the performance of the arithmetic unit 109.

データ長の長いロングデータの場合、複数のメモリアドレスのデータを繰り返し読みこみ、所定回数読み込んだデータを一つのロングデータとして演算することも可能である。 In the case of long data having a long data length, it is also possible to repeatedly read data at a plurality of memory addresses and calculate the data read a predetermined number of times as one long data.

本願発明のメモリセル１０４はＳＲＡＭメモリやＤＲＡＭメモリ、ＦＬＡＳＨメモリはもとより、今後市場に出回る抵抗型メモリや磁気メモリなど全てのメモリセル共通に利用可能である。
演算性能や演算コストを考え様々なメモリや様々なメモリをアドレス毎に混載することも可能である。 The memory cell 104 of the present invention can be used not only for SRAM memory, DRAM memory, and FLASH memory but also for all memory cells such as resistance memory and magnetic memory that will be on the market in the future.
Various memories and various memories can be mixed for each address in consideration of calculation performance and calculation cost.

メモリ部２０２と演算部２０３を独立分離し、独立分離された装置や半導体チップを組合せ利用するとメモリ資源や演算器資源を無駄なく効率的に利用することが可能である。 When the memory unit 202 and the arithmetic unit 203 are independently separated and a device and a semiconductor chip that are independently separated are used in combination, it is possible to efficiently use memory resources and arithmetic unit resources without waste.

１０１アドレス
１０２アドレス線１０３メモリ１０４メモリセル１０５ビット線（１０６Ｒ／Ｗ切替スイッチ１０７演算入力１０８演算出力＆１０９演算器１１０演算結果１１１演算結果レジスタ１１２論理和（Ｏ１１３入出力イン１１４演算条件１１５四則演算１１６比較演算１１７論理演算１１９差和演算１２０積和演算１２３演算入力データＡ１２４演算入力データＢ１２５入力データ１２６符号１２７浮動小数点演算
２０１並列演算装置ならびに並列演算半導体チップ
２０２メモリ部
２０３演算部 101 Address

データ線）

メモリ記憶データ

Ｒ）ゲート
ターフェース

102 Address line
103 memory
104 memory cells
105 bit line (data line)
106 R / W selector switch
107 Calculation input
108 Operation output & Memory storage data
109 Calculator
110 Operation result
111 Operation result register
112 OR gate
113 I / O interface
114 Calculation conditions
115 Arithmetic operations
116 Comparison operation
117 logical operations
119 Difference sum operation
120 multiply-add operation
123 Operation input data A
124 Calculation input data B
125 input data
126 code
127 Floating-point arithmetic 201 Parallel arithmetic device and parallel arithmetic semiconductor chip 202 Memory unit 203 Arithmetic unit

この発明によるＳＩＭＤ型並列演算器は応用範囲が広く、並列演算集積度を高め、一般的な演算スピードから超高速な演算スピードまで自由な設定が可能になり、演算性能優先か消費電流優先かシステムに合わせて最適な利用環境を提供することができる。
ＦＰＧＡでも容易に実現できるので一般的なデータ演算はもとより携帯機器の認証機能やロボットの頭脳に最適であり、多くのＧＰＵニーズをこの技術に置き換えることができる。 The SIMD type parallel arithmetic unit according to the present invention has a wide range of applications, increases the degree of parallel arithmetic integration, and can be freely set from a general arithmetic speed to an extremely high arithmetic speed. The optimal usage environment can be provided.
Since it can be easily realized with an FPGA, it is suitable not only for general data computation but also for an authentication function of a mobile device and a brain of a robot, and can replace many GPU needs with this technology.

最近ではＧＰＧＰＵとして画像処理以外、タンパク質の構造解析や流体解析や振動解析など大量な行列ベクトル計算が必要な情報処理に利用されている。
ＧＰＵはＳＩＭＤ型情報処理で利用される場合が大半であるがＣＰＵと同様な情報処理アーキテクチャを踏襲しているので、多数の独立した演算器または演算グループとその演算器毎に専用のメモリを有しそれぞれの演算器はそれぞれのプログラムとデータに基づき独立して演算を行う構成である。 Recently, GPGPU is used for information processing that requires a large amount of matrix vector calculations such as protein structure analysis, fluid analysis, and vibration analysis, in addition to image processing.
GPUs are mostly used in SIMD type information processing, but follow the same information processing architecture as CPUs, so there are many independent computing units or computing groups and dedicated memories for each computing unit. Each computing unit is configured to perform computation independently based on each program and data.

以上のようにそれぞれの演算器または演算グループが独立して動作する構成であるので、例えば演算器ではプログラムを解読するための回路、演算タスクを制御するための回路や、メモリのアドレスデコーダさらには演算コアを動かすためのメモリなどそれぞれ独立して、それぞれの回路を持つ必要があり回路やメモリが重複する結果になっている。 Since each arithmetic unit or arithmetic group operates independently as described above, for example, in an arithmetic unit, a circuit for decoding a program, a circuit for controlling an arithmetic task, a memory address decoder, It is necessary to have each circuit independently, such as a memory for operating an arithmetic core, resulting in overlapping circuits and memories.

ＧＰＵなどの従来型ＳＩＭＤ型並列演算は、独立した演算コアまたは演算グループとそのメモリで構成されるため回路規模が大きくなり集積度が上がらない、またＣＰＵならびにＧＰＵのＯＳを介してのＧＰＵ駆動の準備処理やメモリへのデータ転送、それに伴うＧＰＵ内部の演算器の割り当てやタスク割り当て制御や管理などのオーバヘッドや、演算器そのものの遊びによって演算速度が犠牲になり、消費電力が大きくなりがちである。 Conventional SIMD type parallel computations such as GPUs are composed of independent computation cores or computation groups and their memories, so the circuit scale increases and the degree of integration does not increase. Also, GPU-driven through the CPU and GPU OS Preparatory processing, data transfer to the memory, the accompanying overhead of computing units in the GPU, task allocation control and management, etc., and the play of the computing units themselves, sacrifices the computation speed and tends to increase power consumption. .

本発明により半導体チップ化した場合、ＦＰＧＡに比較して１桁程度演算時間を短縮することができる、ＧＰＵなどの従来型のＳＩＭＤ回路の無駄な回路がなくなるのでメモリや演算器の並列度を上げることが出来る。
またＧＰＵの各演算コアや演算グループを駆動するため前処理などのオーバヘッドも不要になり、演算コアの遊びも解消できるのでハードウエア限界性能を求めることができる。
従ってこの技術はスリムで超高速なＳＩＭＤ専用新型ＧＰＵとなる。
When a semiconductor chip is formed according to the present invention, the calculation time can be shortened by about an order of magnitude as compared with an FPGA, and there is no useless circuit of a conventional SIMD circuit such as a GPU. I can do it.
In addition, since the computation cores and computation groups of the GPU are driven, overhead such as pre-processing is not necessary, and play of the computation core can be eliminated, so that the hardware limit performance can be obtained.
This technology is therefore a slim, ultra-high speed new GPU dedicated to SIMD.

Claims

An arithmetic unit comprising a memory cell group consisting of a large number of memory cells and N arithmetic units, and an address line for accessing the memory cells of the memory cell group includes two or more data A plurality of data access means capable of collectively accessing N sets of data composed of memory cells is provided, and N sets of data of the address lines accessed in batch are read out at a time and input in parallel to the arithmetic inputs of the N arithmetic units. Means for simultaneously writing N sets of operation result data of operation outputs of the N arithmetic units to the N sets of memory cells of the address line accessed in batch. Type parallel computing device.

The arithmetic unit is any one of (1) four arithmetic operations, (2) floating point operations, (3) comparison operations, (4) logic operations, (5) shift operations (6) or more combined, and more than (1) to (6) 2. The SIMD type parallel arithmetic device according to claim 1, wherein the arithmetic unit is configured to execute the above-described arithmetic operation.

A calculation means is provided that masks a part of the arithmetic unit and a part of the input bits of the arithmetic unit, and eliminates the influence of the operation on a part of the arithmetic unit and a part of the input bits of the arithmetic unit 109. The SIMD type parallel arithmetic apparatus according to claim 1, wherein:

2. The SIMD type parallel arithmetic apparatus according to claim 1, further comprising means for outputting the arithmetic result of the arithmetic unit externally.

2. The SIMD type parallel arithmetic device according to claim 1, further comprising means for inputting external data to arithmetic inputs of the N arithmetic units of the arithmetic unit.

2. The SIMD parallel arithmetic semiconductor chip according to claim 1, wherein the parallel arithmetic device is configured in one semiconductor chip.

A SIMD type parallel operation semiconductor chip characterized by being combined with a CPU and other LSIs and configured in one semiconductor chip.

2. The SIMD type parallel operation semiconductor chip according to claim 1, wherein the parallel operation device is mounted on an FPGA.

8. The SIMD parallel arithmetic semiconductor memory chip according to claim 5, wherein the memory cell group and the arithmetic unit are divided, and the memory cell is an independent chip.

8. The SIMD type parallel arithmetic semiconductor arithmetic chip according to claim 5, wherein the memory cell group and the arithmetic unit are divided, and the arithmetic unit is an independent chip.

9. A SIMD type parallel operation method comprising preparing a plurality of semiconductor chips according to claim 6, 7 and 8 and performing parallel operations.

A SIMD type parallel arithmetic method, wherein a plurality of memory chips according to claim 9 and one arithmetic chip according to claim 10 are combined and operated in parallel.

10. A SIMD type parallel operation method, wherein one memory chip according to claim 9 and a plurality of operation chips according to claim 10 are combined and operated in parallel.

A SIMD type parallel operation method comprising combining data of a plurality of addresses and performing parallel operation as one data.

11. (1) SIMD type parallel arithmetic device (2) SIMD type parallel arithmetic semiconductor chip (3) Memory chip of SIMD type parallel arithmetic semiconductor (4) SIMD type parallel arithmetic semiconductor arithmetic chip according to claim 1.
A system including any of (1) to (4) above.