CN114996205A

CN114996205A - On-chip data scheduling controller and method for auxiliary 3D architecture near memory computing system

Info

Publication number: CN114996205A
Application number: CN202210856427.6A
Authority: CN
Inventors: 曹玥; 杨建国; 张文君
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-09-02
Anticipated expiration: 2042-07-21
Also published as: CN114996205B

Abstract

The invention discloses an on-chip data scheduling controller and method for an auxiliary 3D architecture near memory computing system, wherein the scheduling controller is used as a memory mapping IO device to be accessed into a system bus, so that a preset instruction can be written into a corresponding memory mapping address through a processor to realize scheduling control; the scheduling controller is connected with the host external interrupt receiving module to send an execution completion interrupt signal to the host and receive an accelerator interrupt signal to judge the accelerator state; and acquiring the memory access path of the takeover host and directly accessing the memory. The pre-write data scheduling instruction from the host can be received and the host access port can be managed so as to access all memory addresses on the chip. According to the preset instruction, the scheduling controller sequentially executes data scheduling and sends a completion signal to the host at the preset node, and returns the control right of the access port to allow the host to read the final data.

Description

On-chip data scheduling controller and method for auxiliary 3D architecture near memory computing system

Technical Field

The invention relates to the technical field of data transmission, in particular to an on-chip data scheduling controller and an on-chip data scheduling method for an auxiliary 3D architecture near memory computing system.

Background

The 3D architecture near memory computing system can perform 3D stacking of a Silicon chip of a conventional process accelerator and a DRAM chip, and connect an upper signal port and a lower signal port by using a Through Silicon Via (TSV) or Hybrid Bonding (HB) technology, thereby completing data interaction. Compared with the traditional processor/memory structure, the system can greatly shorten the distance between the computing unit and the memory unit, reduce the memory access delay, and simultaneously use the TSV/HB technology to directly extract data from the memory bank without a DRAM system bus, thereby greatly improving the memory access bandwidth. The system can effectively relieve the problem of the memory wall, thereby improving the performance of the processor system and having great development potential.

However, the access connection of the 3D architecture does not pass through a DRAM system bus, the access range of a single accelerator is limited, and only the bank directly connected to the accelerator below the accelerator chip can be accessed, and if the size of data processed at a single time in an application exceeds the upper data storage limit of the bank of the single memory, data transfer scheduling between accelerators still needs to be performed through the host, and because the host and the DRAM are still connected by using a conventional structure, if optimization is not performed, the system performance bottleneck still can be formed.

Disclosure of Invention

In order to solve the defects of the prior art, further reduce the data transmission quantity through the traditional host-DRAM memory port, realize the conversion of the interaction between the accelerators on the chip into the communication on the chip, promote the upper limit of the data size of the near memory calculation on the pure chip, greatly reduce the memory access times of the host and effectively improve the data handling efficiency and the energy efficiency ratio of the system, the invention adopts the following technical scheme:

an on-chip data scheduling controller of an assisted 3D architecture near memory computing system, comprising: presetting an instruction storage module, a data handling module and a state controller;

the preset instruction storage module is used for storing a preset data scheduling instruction sent by the host and respectively sending the carrying information and the state information to the data carrying module and the state controller;

the data carrying module carries data from one accelerator to another accelerator through carrying information of the preset instruction storage module according to a data carrying starting instruction of the state controller, and sends an instruction completion signal to the state controller;

the state controller enters a chip data scheduling state according to a chip carrying takeover request of the host, judges an executable data carrying instruction according to state information of a preset instruction storage module and an interrupt signal of an accelerator, sends the executable data carrying starting instruction to the data carrying module, acquires an instruction completion signal, judges whether to send an execution completion interrupt signal to the host or not according to the state information after carrying is completed, and exits the chip data scheduling state.

Further, the preset instruction storage module comprises: the system comprises an instruction decoder, a preset instruction queue and an instruction information register;

the instruction decoder receives a preset data scheduling instruction, judges whether a preset instruction queue is full or not, feeds back write-in failure information to the host if the preset instruction queue is full, decodes the preset data scheduling instruction if the preset instruction queue is not full, judges whether the decoded preset data scheduling instruction is correct or not, feeds back the write-in failure information to the host if the decoded preset data scheduling instruction is wrong, and respectively sends the carrying information and the state information in the preset data scheduling instruction to the preset instruction queue and the state controller if the decoded preset data scheduling instruction is correct;

presetting an instruction queue, writing in the carrying information, and updating the queue according to the updating information acquired from the state controller;

and the instruction information register reads corresponding carrying information from a preset instruction queue according to the reading request of the state controller so as to enable the data carrying module to read.

Further, whether the decoded preset data scheduling instruction is correct or not is judged, a source memory address range and a target memory address range of the preset accelerator are calculated through an adder, the source memory starting address and the target memory starting address in the carrying information are respectively compared with the source memory address range and the target memory address range of the preset accelerator in a consistency mode, the source memory starting address and the target memory starting address in the carrying information are correct in the range, and otherwise, the source memory starting address and the target memory starting address are wrong.

Further, before consistency comparison, validity comparison is performed, and when the source memory starting address and the target memory starting address in the transport information are not empty at the same time, the validity is determined, otherwise, the validity is determined.

And further, updating the queue according to the updating information acquired from the state controller, namely writing the carrying information into the tail of the queue by a preset instruction queue, updating tail unit information, and updating head unit information according to a head updating request transmitted by the state controller.

Further, the data handling module comprises a data handling controller and a temporary data cache, the data handling controller receives a data handling starting instruction of the state controller, sequentially generates a memory access instruction according to the handling information provided by the preset instruction storage module, reads a data corresponding address from the source accelerator, stores the data corresponding address into the temporary data cache, reads data from the temporary data cache, writes the data into a target accelerator corresponding address, and circularly carries out data handling operation until all data are handled, and sends an instruction completion signal to the state controller.

Further, the state controller comprises a state information storage queue and a judgment module;

the state information storage queue is used for storing the state information, updating queue information according to the existing queue information, the accelerator interrupt signal and the completion signal of the data handling module, and clearing a completed queue unit according to a head updating instruction of the judging module;

the judging module judges whether an executable data carrying instruction exists according to the on-chip carrying takeover request sent by the host and the state information in the state information storage queue; if not, waiting for the next period to judge again; if so, judging a next instruction to be executed according to the state information, initiating a data reading request to a preset instruction storage module, and sending a state confirmation instruction to the target accelerator; if the target accelerator feedback can receive data writing, transmitting a data carrying starting instruction to the data carrying module; if the feedback of the target accelerator cannot be written, re-entering the judgment of the quasi-execution instruction, and confirming the state of the accelerator in the next round; and after receiving the data carrying module carrying completion signal, judging whether the head unit of the queue is updated or not according to the signal and the self state information, judging whether an execution completion interrupt signal is sent to the host or not, and exiting the in-chip data scheduling state.

Further, the state information storage queue includes state information and additional information, the state information including: source accelerator id, target accelerator id, whether exit the on-chip scheduling state after completion, the additional information includes: the source data is valid, whether the source data is finished or not, whether read-write dependence exists or not and relevant dependence unit id information exist or not;

the state information storage queue update rule is as follows:

the extra information is generated for the first time when the tail unit of the queue is written:

the source data is valid and is set to 0 or not;

if the scheduling state on the back exit piece exists in the unit in front of the queue, or the target accelerator id is consistent with the source accelerator id of the unit, the read dependency is set to be 1 (namely the read dependency exists), and the read dependency unit id is set to be the queue unit id which is closest to the unit and meets the condition; if no coincidence unit exists, the read dependency is set to be 0, and the read dependency unit id is set to be 0;

if the source accelerator id is consistent with the target accelerator id of the unit in the front of the queue, the write dependency is set to be 1 (namely the write dependency exists), and the write dependency unit id is set to be the queue unit id which is closest to the unit and meets the condition; if no coincidence unit exists, the write dependency is set to be 0, and the write dependency unit is set to be 0;

the additional information is updated when the accelerator execution completion interrupt signal occurs as follows:

checking whether a source accelerator id is consistent with an execution completion accelerator id and no read dependency exists in an existing unit of the queue; if yes, effectively updating the source data of all the units meeting the condition to be 1 (namely the source data is valid); if not, updating is not carried out;

the additional information is updated as follows when a completion signal of the data handling module occurs:

setting the completion state of the id unit corresponding to the conveying information after the conveying to be 1;

if the corresponding id unit is in a take-over state of 0 after finishing the exit and the units in the queue have read and write dependencies on the corresponding id unit, updating the corresponding read and write dependencies of all the units meeting the conditions to be 0; if the corresponding id unit is in a takeover state of 1 after finishing exiting, and the corresponding id unit is not a head unit, the read dependency on the corresponding id unit is not cleared;

if the id unit corresponding to the conveying information after the conveying is finished is a head unit and the unit with the completion rear exit takeover state of 1 is included when the finished unit is cleared, clearing the read dependence corresponding to the unit;

meanwhile, the state information storage queue clears the completed queue unit according to the head updating instruction of the judging module, namely, the head is updated to the request unit id.

Furthermore, after the judging module judges that an executable data carrying instruction exists, the state information is input into the arbitration unit, the id of the next unit to be executed is judged, a data reading request is sent to the preset instruction storage module, and the corresponding information of the unit is read; judging whether the head unit of the queue is updated or not, namely judging whether the corresponding unit of the signal is the head unit or not, and if not, not updating the head; if yes, updating the head unit to the first unfinished unit behind the head unit, confirming whether a unit with a finished back exit takeover state of 1 is contained in the clearing unit, and if not, judging a next round of quasi-execution instruction; if the data is contained, sending an execution completion interrupt signal to the host, resetting the carrying takeover register to 0, and exiting the on-chip data scheduling state.

A chip data scheduling control method for an auxiliary 3D architecture near memory computing system comprises the following steps:

step S1: before a source accelerator is started, a host preset data scheduling instruction is obtained to ensure that a data scheduling controller correctly detects execution completion information of the source accelerator; except the information needed by data scheduling, the command contains information on whether to quit the on-chip scheduling after completion; judging whether a preset instruction queue is full, if so, feeding back write-in failure information to a host, if not, decoding a preset data scheduling instruction, judging whether the decoded preset data scheduling instruction is correct, if not, feeding back the write-in failure information, if so, storing the carrying information in the preset data scheduling instruction, and judging whether the instruction and the existing instruction have a dependency relationship according to state information in the preset data scheduling instruction;

step S2: acquiring an on-chip transport takeover request sent by a host, entering an on-chip data scheduling state, and judging executable data transport according to state information in a preset data scheduling instruction and an interrupt signal of an accelerator;

step S3: carrying data from one accelerator to another accelerator according to carrying information in a preset data scheduling instruction, and generating an instruction completion signal;

step S4: and after the carrying is finished, emptying the dependency relation related to the finished instruction, updating the queue, judging whether to send an execution finishing interrupt signal to the host or not according to the information whether to quit the in-chip scheduling state or not after the carrying is finished, and quitting the in-chip data scheduling state.

The invention has the advantages and beneficial effects that:

the scheduling controller and the scheduling control method can receive the pre-written data scheduling instruction from the host and can manage the access port of the host so as to access all memory addresses on a chip. According to the preset instruction, the controller sequentially executes data transferring and sends a completion signal to the host at the preset node, returns the control right of the access port and allows the host to read the final data. The invention can convert the interaction among all accelerators on the chip into on-chip communication, and the upper limit of the data size of pure on-chip near memory calculation is promoted to a single memory chip from a single memory bank, thereby greatly reducing the access frequency of a host and effectively improving the data handling efficiency and the energy efficiency ratio of a system. Meanwhile, the invention can multiplex the access path of the original host, and the additionally generated hardware expense is also smaller.

Drawings

FIG. 1 is a schematic diagram of a 3D architecture near memory computing accelerator system including an on-chip data scheduling controller according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of an on-chip data scheduling controller according to an embodiment of the present invention.

FIG. 3 is a diagram of the structure and stored information of a state information store queue according to an embodiment of the present invention.

Fig. 4 is a flowchart of a method for controlling on-chip data scheduling according to an embodiment of the present invention.

Fig. 5a is a flowchart illustrating a process from obtaining a preset command to obtaining a transportation takeover request in the control method according to the embodiment of the invention.

Fig. 5b is a flowchart illustrating the process of acquiring the accelerator interrupt signal and sending the data transfer request according to the control method of the embodiment of the invention.

Fig. 5c is a flowchart illustrating data transfer to completion of scheduling in the control method according to the embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

An on-chip data scheduling controller of an auxiliary 3D architecture near memory computing system can be used for assisting a 3D architecture near memory computing accelerator system, as shown in fig. 1, the scheduling controller is used as a memory mapping IO device to access a system bus, so that a preset instruction can be written in a corresponding memory mapping address through a processor to realize scheduling control; the scheduling controller is connected with the host external interrupt receiving module to send an execution completion interrupt signal to the host and receive an accelerator interrupt signal to judge the accelerator state; and acquiring the memory access path of the takeover host and directly accessing the memory. As shown in fig. 2, the scheduling controller includes: the system comprises a preset instruction storage module, a data handling module and a state controller;

the system comprises a preset instruction storage module, a data handling module and a state controller, wherein the preset instruction storage module is used for storing a preset data scheduling instruction sent by a host and respectively sending handling information and state information to the data handling module and the state controller;

The preset instruction storage module can receive a preset data scheduling instruction from the host, decode the preset data scheduling instruction and store the decoded preset data scheduling instruction in a preset instruction queue, transmit required state information into the state controller, update the queue according to the information sent by the state controller, and provide the required information for the data carrying module.

In the embodiment of the invention, the system addresses of the accelerators are represented as 32 bits, the number of the system addresses is 16, the address range corresponding to each accelerator is 32MB, and the bit width of the host memory access port is 64 bits.

The preset instruction storage module comprises: the system comprises an instruction decoder, a preset instruction queue and an instruction information register;

The preset scheduling instruction comprises carrying information and state information, the carrying information comprises a source internal memory starting address, a target internal memory starting address and data size, and the state information comprises: source accelerator id, target accelerator id, information whether to quit on-chip scheduling state after completion;

in the embodiment of the present invention, the command decoder includes a memory-mapped register for receiving a predetermined scheduling command from the host, the upper limit size of data handling is set to 16MB, and one possible command ordering mode is shown in table 1:

TABLE 1 Preset instruction sequencing mode for scheduling instructions

Source accelerator id	Source memory start address	Target accelerator id	Target memory address	Data size	Whether to exit the on-chip scheduling state after completion
						95-92	91-60	59-56	55-24	23-1	0

The instruction receiving memory-mapped register has a bit width of 96 bits.

When the preset instruction queue is full, the judging module feeds back write failure to the host; when the preset instruction queue is not full, the preset scheduling instruction can be successfully written into the preset instruction queue.

The instruction decoding module decodes the whole preset scheduling instruction into the format of table 1.

Judging whether the decoded preset data scheduling instruction is correct or not, comparing validity, and if the source internal memory starting address and the target internal memory starting address in the carrying information are not empty, determining that the data scheduling instruction is valid, otherwise, determining that the data scheduling instruction is invalid;

and calculating a source memory address range and a target memory address range of the preset accelerator through an adder, and respectively carrying out consistency comparison on the source memory starting address and the target memory starting address in the carrying information with the source memory address range and the target memory address range of the preset accelerator, wherein the source memory starting address and the target memory starting address are correct in the ranges, and otherwise, the source memory starting address and the target memory starting address are wrong.

In the embodiment of the invention, validity judgment is carried out on the initial address of the source memory and the initial address of the target memory by judging whether the two addresses are 0 at the same time, and data is considered to be valid when the two addresses are not 0 at the same time.

The consistency judgment is that the initial address of the memory of the preset accelerator is 0x8000000, the address ranges of the accelerators are sequentially arranged, namely 0x 80000000-0 x803fffff, 0x 80400000-0 x807fffff … … and the like, the decoder comprises an adder to calculate the address ranges of the source memory and the target memory, and whether the address in the preset scheduling instruction is in the address range of the preset accelerator is judged; if not, sending an instruction error signal to the host; if the two transmission interfaces are consistent, transmitting the initial address of the source memory, the initial address of the target memory and the data size to a preset instruction queue, transmitting the id of the source accelerator and the id of the target accelerator to a state controller if the on-chip scheduling state exits after the completion, and simultaneously pulling up effective valid signals of the two transmission interfaces; after completion, the module resets the instruction receiving memory mapped register to 0.

And updating the queue according to the updating information acquired from the state controller, namely writing the carrying information into the tail of the queue by a preset instruction queue, updating tail unit information, and updating head unit information according to a head updating request transmitted by the state controller.

In the embodiment of the invention, the preset instruction queue receives the information when the valid of the transmission interface of the decoder is pulled up, writes the information into the tail of the queue and updates the tail unit information; meanwhile, updating the information of the head unit according to a head updating request transmitted by the state controller; updating the empty and full condition according to the change in each period, and transmitting the updated empty and full condition to the instruction decoder; in this example, assume that the queue length is 16.

When the state controller sends an effective reading request, the instruction information register reads information of a corresponding source memory starting address, a target memory starting address and data size from a preset instruction queue according to the unit id provided by the state controller, and transmits the information to the data handling module.

The data carrying module comprises a data carrying controller and a temporary data cache, wherein the data carrying controller receives a data carrying starting instruction of the state controller, sequentially generates an access instruction according to carrying information provided by the preset instruction storage module, reads a data corresponding address from the source accelerator, stores the data into the temporary data cache, reads data from the temporary data cache, writes the data into a target accelerator corresponding address, and circularly carries out data carrying operation until all data are carried completely, and sends an instruction finishing signal to the state controller.

In the embodiment of the invention, a data handling module acquires a data handling starting instruction from a state controller, acquires a source memory starting address, a target memory starting address and data size information from an instruction information register, carries data from the source memory starting address corresponding to a source accelerator to the target memory starting address corresponding to a target accelerator, and sends an instruction completion signal to the state controller after completion;

the data handling module comprises a data handling controller and a temporary data cache unit; through the data carrying controller, the temporary data cache unit receives a data carrying initiating instruction of the state controller, sequentially generates an access instruction according to an acquisition source internal memory starting address, a target internal memory starting address and data size information provided by an instruction information register in a preset instruction storage module as carrying information, reads data from the source internal memory starting address and stores the data into the temporary data cache unit, reads the data from the temporary data cache unit and writes the data into the target internal memory starting address, and circularly executes data carrying operation until all data are carried completely, and sends an instruction finishing signal to the state controller;

the size of the temporary data cache unit may be set as an upper limit of single data transmission allowed in the memory access protocol, for example: the memory support protocol is DDR4 (Burst upper limit of 8), in which case the temporary data cache unit size may be set to 64B.

The state controller enters an in-chip data scheduling state according to an in-chip carrying takeover request of the host, judges an executable data carrying instruction according to state information of a preset instruction storage module and an interrupt signal of an accelerator, sends the data carrying request to the data carrying module, reads necessary information from the preset instruction storage module, judges whether an execution completion interrupt signal is sent to the host or not after carrying is completed, and exits the in-chip data scheduling state. The state controller comprises a state information storage queue and a judgment module;

and the state information storage queue is used for storing the state information (the source accelerator id, the target accelerator id and whether the on-chip scheduling state exits after completion), updating the queue information according to the existing queue information, the accelerator interrupt signal and the completion signal of the data handling module, and clearing the completed queue unit according to the head updating instruction of the judgment module.

A state information storage queue comprising state information and additional information, the state information comprising: source accelerator id, target accelerator id, whether exit the on-chip scheduling state after completion, the additional information includes: the source data is valid, whether the source data is complete, whether read-write dependency exists, and the id information of the related dependency unit, and one possible queue condition is shown in fig. 3;

the state information storage queue update rule is as follows:

the source data is valid and is set to 0 or not;

if the source accelerator id is consistent with the target accelerator id of the unit in the front of the queue, the write dependency is set to be 1 (namely the write dependency exists), and the write dependency unit id is set to be the queue unit id which is closest to the unit and meets the condition; if no coincidence unit exists, the write dependency exists and is set as 0, and the write dependency unit is set as 0;

checking whether a source accelerator id is consistent with an execution completion accelerator id and no read dependency exists in an existing unit of the queue; if yes, effectively updating the source data of all the units meeting the condition to be 1 (namely the source data is effective); if not, not updating;

the additional information is updated as follows when the completion signal of the data handling module occurs:

The judging module judges whether an executable data carrying instruction exists according to the on-chip carrying takeover request sent by the host and the state information in the state information storage queue; if not, waiting for the next period to judge again; if so, judging a next instruction to be executed according to the state information, initiating a data reading request to a preset instruction storage module, and sending a state confirmation instruction to the target accelerator; if the target accelerator feedback can receive data writing, transmitting a data carrying starting instruction to the data carrying module; if the feedback of the target accelerator can not be written, re-entering the judgment of the quasi-execution instruction, and confirming the state of the accelerator in the next round; and after receiving the data carrying module carrying completion signal, judging whether the head unit of the queue is updated or not according to the signal and the self state information, judging whether an execution completion interrupt signal is sent to the host or not, and exiting the in-chip data scheduling state.

After judging that an executable data carrying instruction exists, inputting the state information into an arbitration unit, judging the unit id to be executed next, initiating a data reading request to a preset instruction storage module, and reading the corresponding information of the unit; judging whether the head unit of the queue is updated or not, namely judging whether the corresponding unit of the signal is the head unit or not, and if not, not updating the head; if yes, updating the head unit to the first unfinished unit behind the head unit, confirming whether a unit with a finished back exit takeover state of 1 is contained in the clearing unit, and if not, judging a next round of quasi-execution instruction; if yes, sending an execution completion interrupt signal to the host, resetting the carrying takeover register to 0, and exiting the on-chip data scheduling state.

In the embodiment of the invention, a judging module comprises a single byte carrying takeover memory mapping register, a host writes 1 into the register, namely, the host enters a chip data scheduling state and takes over a memory access path of a chip, and the module judges whether an executable data carrying instruction exists or not according to state information in a state information storage queue, namely, a queue unit with effective and incomplete source data; inputting the information into an arbitration unit, and judging the unit id to be executed next by the arbitration unit (if the unit does not meet the condition, the unit id is not selected); initiating a data reading request to the preset instruction storage module, and reading corresponding unit information; sending a state confirmation instruction to a unit target accelerator to be executed; if the feedback target accelerator can receive data writing, transmitting a data carrying initiating request to the data carrying module; if the feedback target accelerator can not be written in, re-entering the quasi-execution instruction judgment, and confirming the state of the next round of accelerator; the arbitration unit should include a weight variable mechanism to prevent deadlock, such as round-robin design, but the highest weight needs to be updated back to the head unit after a data transfer request is successfully initiated; after receiving a data carrying module carrying completion signal, judging whether a head unit of a queue is updated according to the signal and self state information, namely whether a unit corresponding to the signal is the head unit; if not, the head is not updated; if yes, updating the head unit to a first unfinished unit behind the head unit, and confirming whether the cleaning unit comprises a unit with a finished back exit takeover state of 1; if not, judging the next round of simulated execution instruction; if yes, sending an execution completion interrupt signal to the host, resetting the transport takeover register to 0, and exiting the in-chip data scheduling state.

As shown in fig. 4, a method for controlling on-chip data scheduling of an auxiliary 3D architecture near memory computing system includes the following steps:

step S1: before a source accelerator is started, a host preset data scheduling instruction is obtained to ensure that a data scheduling controller correctly detects execution completion information of the source accelerator; except the information needed by data scheduling, the command contains information whether to quit the on-chip scheduling state after completion; judging whether a preset instruction queue is full, if so, feeding back writing failure information to a host, if not, decoding the preset data scheduling instruction, judging whether the decoded preset data scheduling instruction is correct, if not, feeding back the writing failure information, if so, storing the carrying information in the preset data scheduling instruction, and judging whether the instruction and the existing instruction have a dependency relationship according to state information in the preset data scheduling instruction;

In an embodiment of the present invention, a method for determining host control and controller scheduling of an on-chip data scheduling controller is provided, where a single instruction is a flow of interest, as shown in fig. 5a, 5b, and 5c, the method includes the following steps:

the host needs to write a preset data scheduling instruction into the instruction receiving memory mapping register before starting the source accelerator so as to ensure that the data scheduling controller correctly detects the execution completion information of the source accelerator;

after receiving a preset data scheduling instruction, the data scheduling controller transmits the preset data scheduling instruction to a preset instruction storage module, the preset instruction storage module judges whether an instruction queue is full, and if the instruction queue is full, the data scheduling controller feeds back write failure to a host; if the instruction is not full, the module judges the instruction consistency after decoding, and if the instruction is wrong, the module feeds back the instruction error to the host; if the state information is correct, the required information is stored in the preset instruction queue and the state information storage queue respectively, the instruction receiving memory mapping register is reset to be 0, and meanwhile the state controller generates additional state information for the first time;

after finishing writing all initial data and a preset data scheduling instruction, the host writes 1 into the transportation takeover memory mapping register, sends an on-chip transportation takeover request to enable the on-chip transportation takeover request to enter an on-chip data scheduling state, and takes over a memory access path of the memory chip;

the data scheduling controller detects the execution completion condition of accelerators related to all preset instructions written into the controller, and the data scheduling controller is irrelevant to whether the data scheduling controller enters an on-chip data scheduling state or not; after receiving execution completion information of a certain accelerator, effectively writing all corresponding source accelerators (without leading uncompleted dependent instructions) in a state information storage queue into 1;

after the data scheduling controller enters a data scheduling state in a chip, the state controller checks the information of the existing effective source accelerator in each period when data transportation is not initiated; if the accelerator does not meet the condition, waiting for the next period to repeat the action; if so, judging the next execution carrying instruction through arbitration, reading corresponding information from a preset instruction storage module, and sending a reading request to the target accelerator state memory mapping register to confirm the target accelerator state; if the accelerator is in use, carrying out arbitration again in the next period; if the accelerator is idle, a corresponding data handling instruction is initiated to the data handling module;

after receiving a data carrying instruction, the data carrying module acquires data carrying information from a preset instruction storage module, sequentially generates an access instruction, reads data from a source memory address and stores the data to the temporary data cache, reads data from the temporary data cache and writes the data into a target memory address, and circularly carries out data carrying operation until all data are carried out, and sends an instruction completion signal to the state controller;

the state controller marks the queue unit corresponding to the instruction as finished, clears the dependency relation related to the queue unit corresponding to the instruction according to the updating rule, marks a first unfinished unit behind the queue unit corresponding to the instruction as a new head unit if the queue unit corresponding to the instruction is a head unit, and transmits the information to the preset instruction storage module; and meanwhile, if the cleared unit has the information for exiting the on-chip scheduling state, emptying the dependency relationship related to the queue unit corresponding to the instruction, resetting the carrying takeover register to be 0, and sending execution completion interrupt to the host by the data scheduling controller to exit the on-chip scheduling state.

The above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the scope of the embodiments of the present invention in nature.

Claims

1. An on-chip data scheduling controller of an assisted 3D architecture near memory computing system, comprising: the device comprises a preset instruction storage module, a data handling module and a state controller, and is characterized in that:

the preset instruction storage module stores a preset data scheduling instruction sent by the host computer and sends the carrying information and the state information to the data carrying module and the state controller respectively;

2. The on-chip data scheduling controller of an assisted 3D architecture near memory computing system of claim 1, wherein: the preset instruction storage module comprises: the system comprises an instruction decoder, a preset instruction queue and an instruction information register;

the instruction decoder receives a preset data scheduling instruction, judges whether a preset instruction queue is full or not, feeds back write-in failure information to the host if the preset instruction queue is full, decodes the preset data scheduling instruction if the preset instruction queue is not full, judges whether the decoded preset data scheduling instruction is correct or not, feeds back the write-in failure information to the host if the decoded preset data scheduling instruction is wrong, and respectively sends carrying information and state information in the preset data scheduling instruction to the preset instruction queue and the state controller if the decoded preset data scheduling instruction is correct;

the preset instruction queue writes in the carrying information and updates the queue according to the updating information acquired from the state controller;

3. The on-chip data scheduling controller of an auxiliary 3D architecture near memory computing system according to claim 2, wherein: and judging whether the decoded preset data scheduling instruction is correct or not, calculating a source memory address range and a target memory address range of the preset accelerator through an adder, and respectively comparing the source memory starting address and the target memory starting address in the carrying information with the source memory address range and the target memory address range of the preset accelerator in a consistency manner, wherein the result is correct in the range, and otherwise, the result is wrong.

4. The on-chip data scheduling controller of an auxiliary 3D architecture near memory computing system of claim 3, wherein: and before consistency comparison, effectiveness comparison is carried out, when the source memory starting address and the target memory starting address in the carrying information are not empty simultaneously, the source memory starting address and the target memory starting address are effective, and otherwise, the source memory starting address and the target memory starting address are invalid.

5. The on-chip data scheduling controller of an auxiliary 3D architecture near memory computing system according to claim 2, wherein: and updating the queue according to the updating information acquired from the state controller, namely writing the carrying information into the tail of the queue by a preset instruction queue, updating tail unit information, and updating head unit information according to a head updating request transmitted by the state controller.

6. The on-chip data scheduling controller of an assisted 3D architecture near memory computing system of claim 1, wherein: the data carrying module comprises a data carrying controller and a temporary data cache, the data carrying controller receives a data carrying starting instruction of the state controller, sequentially generates an access instruction according to carrying information provided by the preset instruction storage module, reads a data corresponding address from the source accelerator, stores the data into the temporary data cache, reads data from the temporary data cache, writes the data into a target accelerator corresponding address, and circularly carries out data carrying operation until all data are carried, and sends an instruction finishing signal to the state controller.

7. The on-chip data scheduling controller of an assisted 3D architecture near memory computing system of claim 1, wherein: the state controller comprises a state information storage queue and a judgment module;

the state information storage queue is used for storing the state information, updating the queue information according to the existing queue information, the accelerator interrupt signal and the completion signal of the data handling module, and clearing the completed queue unit according to the head updating instruction of the judging module;

the judging module judges whether an executable data carrying instruction exists according to an on-chip carrying takeover request sent by the host and the state information in the state information storage queue; if not, waiting for the next period to judge again; if so, judging a next instruction to be executed according to the state information, initiating a data reading request to a preset instruction storage module, and sending a state confirmation instruction to the target accelerator; if the target accelerator feedback can receive data writing, transmitting a data carrying starting instruction to the data carrying module; if the feedback of the target accelerator can not be written, re-entering the judgment of the quasi-execution instruction, and confirming the state of the accelerator in the next round; and after receiving the data carrying module carrying completion signal, judging whether the head unit of the queue is updated or not according to the signal and the self state information, judging whether an execution completion interrupt signal is sent to the host or not, and exiting the in-chip data scheduling state.

8. The on-chip data scheduling controller of an auxiliary 3D architecture near memory computing system according to claim 7, wherein: the state information storage queue comprises state information and additional information, wherein the state information comprises: source accelerator id, target accelerator id, whether exit the on-chip scheduling state after completion, the additional information includes: the source data is valid, whether the source data is finished or not, whether read-write dependence exists or not and relevant dependence unit id information exist or not;

the state information storage queue update rule is as follows:

the source data is valid and is set to 0 or not;

if the scheduling state on the back exit chip exists in the unit in front of the queue or the target accelerator id is consistent with the source accelerator id of the unit, the read dependency is set to be 1, and the read dependency unit id is set to be the queue unit id which is closest to the unit and meets the conditions; if no coincidence unit exists, the read dependency is set to be 0, and the read dependency unit id is set to be 0;

if the source accelerator id is consistent with the target accelerator id of the unit in the front of the queue, the write dependency is set to be 1, and the write dependency unit id is set to be the queue unit id which is closest to the unit and meets the condition; if no coincidence unit exists, the write dependency is set to be 0, and the write dependency unit is set to be 0;

checking whether a source accelerator id is consistent with an execution completion accelerator id and no read dependency exists in an existing unit of the queue; if yes, effectively updating the source data of all the units meeting the conditions to be 1; if not, not updating;

if the id unit corresponding to the transportation information after transportation is the head unit and the unit with the completion rear exit takeover state of 1 is included in the process of clearing the completed unit, clearing the read dependence corresponding to the unit;

9. The on-chip data scheduling controller of an auxiliary 3D architecture near memory computing system of claim 7, wherein: the judging module inputs the state information into the arbitration unit after judging that the executable data carrying instruction exists, judges the id of the next unit to be executed, initiates a data reading request to the preset instruction storage module, and reads the corresponding information of the unit; judging whether the head unit of the queue is updated or not, namely judging whether the corresponding unit of the signal is the head unit or not, and if not, not updating the head; if yes, updating the head unit to a first unfinished unit behind the head unit, confirming whether a unit with an ejection taking-over state of 1 after completion is included in the clearing unit, and if not, judging a next round of quasi-execution instruction; if the data is contained, sending an execution completion interrupt signal to the host, resetting the carrying takeover register to 0, and exiting the on-chip data scheduling state.

10. A method for controlling on-chip data scheduling of an auxiliary 3D architecture near memory computing system is characterized by comprising the following steps:

step S1: before a source accelerator is started, acquiring a host preset data scheduling instruction, wherein the instruction comprises information whether to exit from an on-chip scheduling state after completion; judging whether a preset instruction queue is full, if so, feeding back writing failure information to a host, if not, decoding the preset data scheduling instruction, judging whether the decoded preset data scheduling instruction is correct, if not, feeding back the writing failure information, if so, storing the carrying information in the preset data scheduling instruction, and judging whether the instruction and the existing instruction have a dependency relationship according to state information in the preset data scheduling instruction;