CN115907005B

CN115907005B - Large-scale full-connection I Xin Moxing annealing treatment circuit based on network on chip

Info

Publication number: CN115907005B
Application number: CN202310010051.1A
Authority: CN
Inventors: 姚恩义; 蒋东; 汪祥瑞; 黄展鸿
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-01-05
Filing date: 2023-01-05
Publication date: 2023-05-12
Anticipated expiration: 2043-01-05
Also published as: CN115907005A

Abstract

The invention discloses a large-scale full-connection I Xin Moxing annealing treatment circuit based on a network on chip, relates to the technical field of an Ictane model, and provides a scheme for solving the problems of small treatment scale, low expansibility, low convergence speed, low parallel processing capability and the like of a model circuit in the prior art. Comprising the following steps: the system comprises a global controller, a control bus, a spin processing array and a merging router array; the global controller performs parallel control on all spin processing units; all spin processing units share annealing temperature and random numbers and communicate and calculate through the merged router array. The method has the advantages of high convergence speed, high parallelism, high expansibility, low design complexity and low hardware resource cost, and can realize high-speed and high-parallelism annealing treatment on the fully-connected isooctyl model.

Description

Large-scale full-connection I Xin Moxing annealing treatment circuit based on network on chip

Technical Field

The invention relates to the technical field of an isooctyl model, in particular to a large-scale full-connection annealing treatment circuit for an Yi Xin Moxing based on an on-chip network.

Background

The combination optimization problem is to find out a class of problems of optimal objects, such as a travel business problem, a maximum cut problem, a graph coloring problem, a flight scheduling problem, and the like, from a limited set of discrete objects, and belongs to a typical non-deterministic polynomial problem. The random process of substance phase transformation is described by a group of interconnected spins by the Xin Moxing, and most of the combinatorial optimization problem can be mapped into the Xin Moxing, i.e., the optimal solution of the combinatorial optimization problem can be solved by solving the isooctyl model. The annealing algorithm is derived from the solid substance annealing principle, is a general optimization algorithm, and can effectively solve the problem of the I Xin Moxing. Processors of von-neumann architecture have difficulty in quickly solving the combinatorial optimization problem because the solution space of the combinatorial optimization problem grows explosively with variable numbers and the processors have an inherent serial operating mechanism. The quantum annealing processor solves the combination optimization problem by utilizing superconducting flux qubits, and has excellent precision and extremely high solving speed. However, quantum annealing processors require an extremely low operating environment and consume high costs. These drawbacks limit their use in practice. With the development of semiconductor manufacturing technology, annealing processors based on CMOS processes have been developed to overcome the above problems. The processor adopts an SRAM unit to store spin states, realizes interaction among spins by using a logic circuit, and jumps out of local minima by using a random number generator. Such processors offer significant improvements in execution speed, cost and power consumption over general purpose CPU and quantum annealing processors and are capable of operating at room temperature. However, most CMOS annealing processors currently only support sparse interconnects of spins, such as trellis diagrams, state Wang Tu, and hexagonal diagrams, which greatly limit the variety of combinatorial optimization problems they can solve. While several CMOS annealing processors supporting fully connected spins have been developed to solve many combinatorial optimization problems, such as the traveler problem and the max cut problem, these processors consume a large amount of resources and only implement a small number of fully connected spins, and at most only one spin can be selected for status update in each iteration step. They can only solve small-scale combinatorial optimization problems, and have low expansibility, slow convergence speed, and low parallel processing capability. In general, for an annealing architecture with high expansibility, high convergence speed and high parallelism, which supports large-scale fully-connected spin and parallel update states, no better design scheme exists at present.

Disclosure of Invention

The invention aims to provide a large-scale full-connection I Xin Moxing annealing treatment circuit based on a network on chip, which solves the problems in the prior art.

The invention discloses a large-scale full-connection I Xin Moxing annealing treatment circuit based on a network on chip, which comprises the following components: the system comprises a global controller, a control bus, a spin processing array and a merging router array; the global controller performs parallel control on all spin processing units; all spin processing units share annealing temperature and random numbers and communicate and calculate through the merged router array.

The global controller comprises an I/O module, a control logic module, a temperature scheduling module and a random number generating module; the I/O module is responsible for exchanging information between a user and the processing circuit; the control logic module is responsible for generating corresponding control signals; the temperature scheduling module and the random number generating module are respectively responsible for generating annealing temperature and random numbers; the control bus sends control signals, annealing temperatures, and random numbers to all spin processing units.

The spin processing array includes a plurality of spin processing units, each spin processing unit further containing 256 spins, one spin being updated per process.

The spin processing unit comprises a control unit, a state updating unit and a production unit;

the control unit comprises control logic and a counter; the control logic is used for receiving an instruction of the global controller and correspondingly generating control signals among all computing elements in the spin processing unit; the counter is used for recording the spin quantity of which the state is-1 in 256 spins;

the state updating unit consists of a part and a number register, a memory

Register, a

The system comprises a register, an absolute value device, three adders, a comparator, a multiplier and a turner; the absolute value device and the adder are used for receiving and accumulating the correlation coefficient

And store the accumulated result in

A register; the part and the sum register

Registers for registering basic parts and numbers and accumulated results

The method comprises the steps of carrying out a first treatment on the surface of the The comparator is used for comparing

And

determining whether to update the single spin processed by the spin processing unit, and selecting if the update condition is met

Further updating the spin state, otherwise the state remains unchanged;

the production unit consists of seven stages of pipelines including an access controller, 16J memories, an h memory, 16 sigma memories, 16 comparators, a plurality of adders, a plurality of multiplexers and a plurality of NOT gates for generating

Is a basic part of (2)

And

coefficients of (a)

And sent to other spin processing units; the access controller is used for specially controlling and storing or reading out interaction coefficients

And external magnetic coefficient

The method comprises the steps of carrying out a first treatment on the surface of the Before the first stage pipeline, 16 coefficients are read from 16J memories at a time

Starting 16 comparators to judge whether the coefficients are related to all the spins selected to be updated, and if so, transmitting the coefficients to a first-stage pipeline; the second stage pipeline is composed of 16 coefficients transmitted from the upper stage

And 16 spins stored in sigma memory

Production of

Is formed of 16 basic parts

I.e. when

In the time-course of which the first and second contact surfaces,

directly to the next stage when

When the NOT gate pair is activated

Performing bit inversion, and then transmitting the result to the next stage; the third stage pipeline accumulates the base partial sums through an adder tree having 16 inputs; the fourth stage pipeline is used for generating external magnetic coefficients

Adding the result of the previous stage; the fifth stage pipeline records the counter in the control unit

Adding the number of (2) to the result of the previous stage; the sixth stage pipeline directly sends the accumulated partial sums to the next stage pipeline if the accumulated partial sums are related to spins processed in the current spin processing unit, otherwise waits for other 240 partial sums and merges the partial sums into the next stage pipeline; the seventh stage pipeline calculates

Is a basic part of (2)

Or (b)

Coefficients of (a)

Packaged into partial sum or coefficient packets and forwarded to other spin processing units.

The merging router comprises a merging module and a routing module which are respectively responsible for merging and forwarding information packets.

The large-scale full-connection I Xin Moxing annealing treatment circuit based on the network-on-chip has the advantages that:

(1) The method has high convergence rate and high parallelism, and can realize high-speed and high-parallelism annealing treatment on the fully-connected isooctane model. Multiple spin concurrent updates are supported from the algorithm design level to the hardware implementation level: on the algorithm, the dynamic multi-thread parallel update annealing algorithm can dynamically adjust the thread number K and the single thread parallel update spin number M, under the condition of limited hardware resources, the convergence is accelerated, and the precision is ensured through a temperature return strategy; on hardware, parallel updating is realized through a network-on-chip architecture, and an algorithm function is realized.

(2) With high expansibility, each spin processing unit can process 256 spins, and the combination optimization problem of a larger scale can be processed by only increasing the number of spin processing units.

(3) The design complexity is low, the hardware resource cost is low, and the special design is adopted for the structure in the circuit: in the global controller, the temperature reciprocal is calculated by adopting the multiplier, so that the use of a high-cost divider is avoided, and meanwhile, all spin processing units share the temperature and the random number, so that the hardware cost and the power consumption are greatly reduced. In the spin processing array, a distributed storage and near memory computing structure is adopted, so that the structure of a spin processing unit is simplified, the communication traffic load is reduced, the computing complexity is reduced, and the computing efficiency is improved. Meanwhile, the spin processing unit uses a full pipeline structure, combines unique multiply-accumulate operation, adopts an adder with a counter to replace sixteen adders, and greatly reduces hardware cost. The merging router adopts a merging, deflection scheme and a full pipeline design, can merge a plurality of parts and data packets into one, reduces communication traffic load and calculation time consumed by each iteration, and simplifies design complexity.

Drawings

Fig. 1 is a schematic diagram of the overall architecture of a large-scale fully-connected i Xin Moxing annealing treatment circuit according to the present invention.

FIG. 2 is a schematic diagram of the operation state of the global controller according to the present invention.

Fig. 3 is a schematic diagram of the global controller according to the present invention.

Fig. 4 is a schematic diagram of a spin processing unit according to the present invention.

Fig. 5 is a schematic diagram of a merging router according to the present invention.

Detailed Description

As shown in FIG. 1, the large-scale full-connection I Xin Moxing annealing processing circuit based on the network-on-chip in the invention comprises a global controller, a control bus, a spin processing array and a merging router array.

Dynamic multithreading parallel update annealing algorithm adopts

，

Judging whether or not to satisfy the comparisonNew conditions based on system feedback

Thread dynamic adjustment is carried out, and a temperature return operation is carried out in the final stage. Wherein N is the total number of spins and V is the total parallel updated spin number; k is the number of threads, M is the number of threads to update spins in parallel, and all are integers.

The global controller consists of four modules, including an I/O module, a control logic module, a temperature scheduling module and a random number generating module. The I/O module is responsible for exchanging information between the user and the processing circuitry. The control logic module is responsible for generating corresponding control signals. The temperature scheduling module and the random number generating module are respectively responsible for generating annealing temperature and random numbers. The control bus sends control signals, annealing temperatures, and random numbers to all spin processing units.

As shown in fig. 2, the global controller includes five working states, namely an idle state, an S1 static parameter configuration state, an S2 dynamic parameter configuration state, an S3 iteration state, and an S4 result return state. In the idle state, i.e. all components are in the standby state, when receiving a start signal from a user, the idle state is switched to the static parameter configuration state. In the static parameter configuration state, the initial annealing temperature value, the temperature decay factor, the temperature threshold, the successive iteration threshold, the initial spin state, and the random seed may be written into a register and then switched to the dynamic parameter configuration state. In the dynamic parameter configuration state, determining the thread number and the basic annealing temperature value of the next iteration according to the feedback signals from the spin processing array, transmitting the thread number and the basic annealing temperature value to all spin processing units, and switching to the iteration state. In the iterative state, all spin processing units are activated to calculate the spin state, if the temperature is lower than the temperature threshold, the calculation task is completed, the state is switched to the result return state, and otherwise, the state is switched to the dynamic parameter configuration state again. In the result return state, the final state of the spin is returned to the user as a solution to the initial problem.

As shown in fig. 3, in the global controller, the data packet from the user is identified by the monitor, and relevant static parameters are stored in the configuration register file, and other parameters are transmitted to the router. After one iteration, the flipped signal values from different threads are added by different counters, the accumulated result is saved in a log register file, and the comparator array is used to obtain the maximum flipped spin number to select the corresponding thread. If the result of the continuous repeated iterative accumulation is 0, the thread number is increased through the left shift operation. The linear feedback shift register is used for generating random numbers to determine whether spin is flipped. In the global controller, the temperature reciprocal is calculated in advance and sent to all spin processors, and a multiplier replaces a divider. In the global controller, when the annealing temperature is too low and the system falls into a local minimum, the annealing temperature is raised.

The spin processing array is used to update the states of the spins in parallel. The spin processing array includes a plurality of spin processing units, each spin processing unit further containing 256 spins, one spin being updated per process. As shown in fig. 4, the spin processing unit includes a control unit, a state updating unit, and a production unit. The control unit includes control logic and a counter. The control logic is used for receiving instructions of the global controller and correspondingly generating control signals among all computing elements in the spin processing unit. The counter is used to record the number of spins of 256 states-1. The state updating unit consists of a part and a number register, a memory

Register, a

The system comprises a register, an absolute value device, three adders, a comparator, a multiplier and a reverser. The absolute value device and the adder are used for receiving and accumulating the correlation coefficient

And store the accumulated result in

In a register. The part and the sum register

Registers for registering basic parts and numbers and accumulated results

. The comparator is used for comparing

And

The spin state is further updated, otherwise the state remains unchanged.

Is a basic part of (2)

And

coefficients of (a)

And sent to other spin processing units. The access controller is used for specially controlling and storing or reading out interaction coefficients

And external magnetic coefficient

. Before the first stage pipeline, 16 coefficients are read from 16J memories at a time

The 16 comparators are enabled to determine if the coefficients are correlated with all spins selected for updating, and if so, are passed to the first stage pipeline. The second stage pipeline is composed of 16 coefficients transmitted from the upper stage

And 16 spins stored in sigma memory

Production of

Is formed of 16 basic parts

I.e. when

In the time-course of which the first and second contact surfaces,

directly to the next stage when

When the NOT gate pair is activated

Bit-wise negation is performed and the result is passed to the next stage. The third stage pipeline sums the base partial sums through an adder tree having 16 inputs. The fourth stage pipeline is used for generating external magnetic coefficients

Added to the result of the previous stage. The fifth stage pipeline records the counter in the control unit

Is added to the result of the previous stage. And if the accumulated partial sums are related to spins processed in the current spin processing unit, the sixth stage pipeline is directly fed into the next stage pipeline, otherwise, the other 240 partial sums are waited and combined into the next stage pipeline. The seventh stage pipeline calculates

Is a basic part of (2)

Or (b)

Coefficients of (a)

As shown in fig. 5, the merging router array provides communication links to support packet exchanges between different spin processing units and is capable of merging multiple portions and packets into one. The merging router mainly comprises a merging stage and a routing stage, and consists of four input ports in east, south, west and north, four output ports corresponding to the four input ports, six comparators, six selectors, three adders, four register groups, a popper, a cross switch and an arbiter. In the merging stage, six comparators compare the types and destinations of the data packets received via the input terminals, and if the merging conditions are met, the data packets are merged into one data packet by an adder. The routing phase sends the data packet to the current spin processing unit or other router via the output port through the popper, crossbar, and arbiter.

It will be apparent to those skilled in the art from this disclosure that various other changes and modifications can be made which are within the scope of the invention as defined in the appended claims.

Claims

1. A network-on-chip-based large-scale fully-connected i Xin Moxing annealing processing circuit, comprising: the system comprises a global controller, a control bus, a spin processing array and a merging router array; the global controller performs parallel control on all spin processing units; all spin processing units share annealing temperature and random numbers, and communicate and calculate through the merging router array;

the global controller comprises an I/O module, a control logic module, a temperature scheduling module and a random number generating module; the I/O module is responsible for exchanging information between a user and the processing circuit; the control logic module is responsible for generating corresponding control signals; the temperature scheduling module and the random number generating module are respectively responsible for generating annealing temperature and random numbers; the control bus sends control signals, annealing temperature and random numbers to all spin processing units;

the spin processing array comprises a plurality of spin processing units, each spin processing unit further comprises 256 spins, and each spin is updated by each process;

the state updating unit consists of a part and a number register, a memory

Register, one->

The system comprises a register, an absolute value device, three adders, a comparator, a multiplier and a turner; the absolute value device and the adder are used for receiving and accumulating the correlation coefficient +.>

And store the accumulated result in +.>

A register; the part and the number register are used for registering basic parts and numbers +.>

The register is used for registering accumulated +.>

The method comprises the steps of carrying out a first treatment on the surface of the The comparator is used for comparing->

And

determining whether to update the single spin processed by the spin processing unit, and selecting +.>

Further updating the spin state, otherwise the state remains unchanged;

wherein ,

judging whether the updating condition is satisfied or not according to the system feedback and +.>

Dynamically adjusting threads, andand performing a tempering operation in the final stage; n is the total number of spins, V is the total number of parallel updating spins, K is the number of threads, M is the number of parallel updating spins of a single thread, and all are integers; />

Is the external magnetic coefficient; />

Is the spin value of sequence number j, and j takes values 1 to 16;

Basic part of (2) and->

and />

Correlation coefficient->

And sent to other spin processing units; the access controller is used for specially controlling the storage or readout of the correlation coefficient +.>

And external magnetic coefficient->

The method comprises the steps of carrying out a first treatment on the surface of the Before the first stage pipeline, 16 correlation coefficients are read out from 16J memories at a time>

The 16 comparators are enabled to determine whether the coefficients are correlated with all spins selected for updating, and if so, they are then used to update the spin-dependent coefficientsPassing to a first stage pipeline; the second stage pipeline is composed of 16 correlation coefficients transmitted from the upper stage>

And 16 spins +.>

Produce->

16 basic parts and->

I.e. when->

When (I)>

Directly to the next stage when>

At the time, the NOT pair is activated>

Performing bit inversion, and then transmitting the result to the next stage; the third stage pipeline accumulates the base partial sums through an adder tree having 16 inputs; the fourth stage pipeline is used for adding the external magnetic coefficient +.>

Adding the number of (2) to the result of the previous stage; the sixth stage pipeline directly sends the accumulated part to the next stage pipeline if the accumulated part is related to the spin processed in the current spin processing unitOtherwise, waiting for other 240 partial sums and merging into the next stage pipeline; the seventh stage pipeline will calculate +.>

Basic part of (2) and->

Or->

Related coefficient of (a)

2. The network-on-chip based large-scale full-connection i Xin Moxing annealing circuit of claim 1, wherein said merging router comprises a merging module and a routing module, each responsible for merging and forwarding packets.