CN115525599A - High-efficient computing device - Google Patents

High-efficient computing device Download PDF

Info

Publication number
CN115525599A
CN115525599A CN202211131027.5A CN202211131027A CN115525599A CN 115525599 A CN115525599 A CN 115525599A CN 202211131027 A CN202211131027 A CN 202211131027A CN 115525599 A CN115525599 A CN 115525599A
Authority
CN
China
Prior art keywords
data
computing device
task
external
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211131027.5A
Other languages
Chinese (zh)
Inventor
李树青
王江
孙华锦
王明明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Original Assignee
Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd filed Critical Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Priority to CN202211131027.5A priority Critical patent/CN115525599A/en
Publication of CN115525599A publication Critical patent/CN115525599A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7814Specially adapted for real time processing, e.g. comprising hardware timers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1668Details of memory controller
    • G06F13/1678Details of memory controller using bus width
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4063Device-to-bus coupling
    • G06F13/4068Electrical coupling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides an efficient computing device, which is used for receiving data directly transmitted by external equipment, reading data participating in operation by an external memory and writing operation result data into the external memory; the on-chip interconnection bus is used for data stream interaction between the computing device and the storage controller as well as between the computing device and the peripheral controller; the storage controller is connected with an external storage, and the external controller is connected with external equipment. The high-efficiency computing device can obviously reduce the access bandwidth to the memory, can effectively break through the performance bottleneck caused by the shortage of the external memory bandwidth in the traditional computing acceleration equipment, and achieves higher acceleration performance; the invention avoids the cost increase caused by using expensive devices such as high-capacity on-chip cache or HBM, and the like, only uses the conventional technology and the conventional external devices, and has obvious technical maturity and cost advantages.

Description

High-efficient computing device
Technical Field
The invention relates to the technical field of computers, in particular to an efficient computing device.
Background
A compute accelerator is a device used in a computer system that has a high computational power in certain application domains. When the computing accelerator is used, the computing tasks originally borne by the CPU can be borne by the accelerator, so that the relatively expensive general computing capacity of the CPU can be released. In general, the CPU power at the same cost is much less than the computing power of these specialized computing accelerators, and thus this type of device can function as an "accelerator.
The current computing power of CPUs has become slow and even stagnant. The development demands of networks, storage systems, and the like brought by application fields such as cloud computing and AI are still rapidly increasing. CPUs have become increasingly the computing bottleneck in computer systems, and it has become a trend to perform computing tasks in specific application domains by dedicated hardware.
The compute accelerator is generally comprised of a compute acceleration chip, board level memory (e.g., DRAM), and an interface to connect external devices. Referring to fig. 1, the computation acceleration chip is generally an SoC system, and its core is a special computing device that takes over the computation function, for example, in the AI chip, the computing device is generally a vector multiplier, a matrix multiplier, etc.; in a memory chip, the memory chip is generally used for a RAID computing device, an encryption/decryption computing device, and the like.
In addition, since the computing process generally requires a large amount of buffer space, and the on-chip buffer cannot achieve a large capacity due to its expensive price, the SoC system also needs to use a memory controller for connecting the board-level memory.
The data participating in the computation may originate from an external device or from a host memory, but the ultimate source is typically also an external device, such as a network, a disk, etc. An efficient computing acceleration device is directly connected with an external device or communicates with the external device in a P2P mode, and an SoC system of the accelerator chip needs to comprise a peripheral controller. Some computing acceleration devices do not have a peripheral controller, and they can only obtain data from the host memory, and since the data of the external device needs to be first entered into the host memory and then read by the acceleration device, the computer system using such acceleration device is often more prone to the memory bandwidth bottleneck.
The data flow of the conventional calculation acceleration chip is as follows in fig. 2. In a typical scenario, the external memory stores data 1 to be involved in an operation, and another data (data 2) involved in the operation is provided by an external device. Under the traditional method, data 2 is firstly written into an external memory by an external device through an access memory controller, then a computing device reads data 1 and data 2 from the external memory, and after the computing process is executed, the result is written into the external memory.
The above conventional method has a disadvantage that each calculation requires a plurality of accesses to the external memory. For example, in the above scenario, 4 accesses to the memory are required (excluding preparation data 1). This drawback is not obvious in conventional computer systems, because the external memory tends to provide a relatively large bandwidth, but with the advent of computing-intensive scenarios such as cloud computing, AI, etc., the external memory bandwidth has become a bottleneck of the computing accelerator, that is, the acceleration chip tends to provide a sufficiently high computational power, but due to the limitation of the external memory bandwidth, the computational power cannot be fully exerted.
The above problems can be solved to some extent by using more external memory channels, or by using higher bandwidth HBM techniques, but these approaches all involve very high costs.
Disclosure of Invention
In view of the above, an object of the present invention is to provide an efficient computing apparatus, which can be used inside a computing acceleration chip to greatly reduce the number of accesses to an external memory and reduce the requirement for memory bandwidth.
In view of the above, in one aspect, the present invention provides an efficient computing device, wherein the system comprises:
the computing device is used for receiving data directly transmitted by external equipment, reading the data participating in operation by the external memory and writing the operation result data into the external memory;
the on-chip interconnection bus is used for carrying out data stream interaction between the computing device and the storage controller as well as between the computing device and the peripheral controller; the storage controller is connected with an external storage, and the external controller is connected with external equipment.
As a further aspect of the present invention, the computing apparatus is further configured to support multiple external devices or multiple tasks of one external device, and the data reading task is allowed to be issued to the external devices in a different order from the data transmission order to the computing apparatus.
As a further aspect of the present invention, the computing apparatus further includes an internal cache module, where the internal cache module is used for size of an address space for receiving data, and allows a delay of returning data by an external device to be jittered and a delay and a jitter of a read/write operation of an external storage to be jittered.
As a further aspect of the present invention, the external device is configured to write data directly to the computing apparatus, and write operation result data into the external memory after the data is read and operated by the computing apparatus.
As a further aspect of the present invention, the computing device reads data from the external memory in an active scheduling manner.
As a further solution of the present invention, at least one task context storage unit is arranged inside the computing device, each storage unit corresponds to one task and is used for supporting multiple concurrent tasks and sharing a set of task context information, wherein a behavior corresponding to one command or a series of associated commands issued to the external device is used as one task.
As a further aspect of the present invention, the computing device is provided with a bus interface communicated with the on-chip interconnection bus, where the bus interface includes a master bus interface and a slave bus interface, and is used to provide a configuration register and a geotechnical address range for receiving data written by an external device, where the address range is a virtual address space.
As a further aspect of the invention, the virtual address space is divided into N, where N is equal to the maximum number of parallel tasks supported by the computing device, each task occupies an independent address space, and the space size is the maximum data size of a task, where the virtual address space occupies a continuous address space on the bus.
As a further solution of the present invention, when the external device writes a segment of data into the address space, the computing device is configured to find a task corresponding to the address according to an address range hit by the data; acquiring task parameters from a task context storage module, and setting a calculation module; acquiring a data mapping table from a task context storage module; acquiring the offset of the data address in an address space corresponding to the current task; finding the position of another data participating in operation in the memory at the corresponding position in the data mapping table by adopting the offset; -reading participation data from the location, the data being operated on with parameter data from an external device.
As a further aspect of the present invention, a peripheral data cache is provided in the computing device, data written into the computing device is stored in the peripheral data cache in the computing device, and corresponding data read from the memory is stored in the external storage data cache.
As a further scheme of the present invention, when the peripheral data enters the data cache, the data number is also used for generating a data number, and when a read command is sent to the memory, the number is sent along with the command and returned along with the returned data, so as to quickly find the corresponding data from the peripheral data cache.
As a further aspect of the present invention, the minimum value of the size of the peripheral cache is determined according to the data delay of the peripheral and the bandwidth of the computing device:
buffer size = peripheral data latency computing device bandwidth.
As a further aspect of the present invention, the computing device further includes a counter, an initial value of the counter is a buffer size, when a certain amount of buffers are used for a task, the corresponding size is subtracted from the counter, and when the computing device consumes a certain amount of data from the peripheral buffer, the corresponding size is added to the counter.
As a further scheme of the invention, the configuration of the task context is completed by a scheduler which is a hardware logic circuit or a CPU; the method for configuring the task context comprises the following steps:
the scheduler acquires a current idle task context, acquires idle resources from a resource pool locally managed by the scheduler when the task context is managed and distributed in the scheduler, and applies for the resources from the computing device when the task context is managed and distributed in the computing device;
the scheduler determines a virtual address space range used by the task according to the acquired task context number;
the scheduler allocates the task parameters to a specified task context module, and allocates the position of the memory where another data participating in calculation is located to a data mapping table;
after the configuration is completed, the scheduler takes the acquired virtual address space range as a data receiving address and sends a reading command to external equipment;
the external equipment receives the reading command, writes the data to the receiving address specified in the command, and the data hits the data receiving address of the computing device;
the computing device acquires information from the task context according to the content, reads data from the memory, puts the data into an external storage data cache, performs computation through the computing device, and writes the result into the external memory;
when the address range of the current task receives an amount of data equal to the total size of the task, the current task ends and the computing device or scheduler reclaims the task context for allocation to the next task.
In yet another aspect of the present invention, there is also provided a computer readable storage medium storing computer program instructions which, when executed, implement any of the above methods for configuring task contexts in an efficient computing device according to the present invention.
In yet another aspect of the present invention, there is also provided a computer apparatus comprising a memory and a processor, the memory having stored therein a computer program, which when executed by the processor performs any of the above-described methods for configuration of task contexts in an efficient computing device according to the present invention.
Compared with the existing scheme, the efficient computing device designed by the invention has the following advantages:
the invention provides an efficient computing device, which can obviously reduce the access bandwidth to a memory, effectively break through the performance bottleneck caused by the shortage of the external storage bandwidth in the traditional computing acceleration equipment and achieve higher acceleration performance; the invention avoids the cost increase caused by using expensive devices such as high-capacity on-chip cache or HBM, and the like, only uses the conventional technology and the conventional external devices, and has obvious technical maturity and cost advantages.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
In the figure:
FIG. 1 is a block diagram illustrating the internal components of a typical compute acceleration chip;
FIG. 2 illustrates a flow chart of a data flow application of a conventional compute acceleration chip;
FIG. 3 illustrates a flow diagram of a data flow implementation in an efficient computing device in accordance with the present invention;
FIG. 4 is a block diagram illustrating the basic framework and key modules of an efficient computing device, according to the present invention;
FIG. 5 illustrates a schematic diagram of the allocation of virtual address space in an efficient computing device according to the present invention;
FIG. 6 illustrates a flow diagram for querying task information through input data in an efficient computing device according to the present invention.
FIG. 7 illustrates a schematic diagram of an embodiment of a computer-readable storage medium to implement a method of configuration of task contexts in an efficient computing device, in accordance with the present invention;
FIG. 8 illustrates a hardware architecture diagram of an embodiment of a computer apparatus for implementing a method for configuration of task contexts in an efficient computing device, in accordance with the present invention;
fig. 9 shows a schematic view of a frame of an embodiment of a chip according to the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two non-identical entities with the same name or different parameters, and it is understood that "first" and "second" are only used for convenience of expression and should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements does not include all of the other steps or elements inherent in the list.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, of the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flowcharts shown in the figures are illustrative only and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Because a traditional computation acceleration chip needs to access an external memory for multiple times for each computation, although the acceleration chip can often provide enough computation power, due to the limitation of the bandwidth of the external memory, the computation power cannot be fully exerted, and the above problems can be solved to a certain extent by using more external memory channels or by using a higher-bandwidth HBM technology, but these methods bring very high cost.
In view of this, the present invention designs an implementation manner of an efficient computing device, which can be used inside a computing acceleration chip to greatly reduce the number of accesses to an external memory and reduce the requirement for memory bandwidth.
To this end, referring to fig. 3 and 4, in a first aspect of the present invention, there is provided an efficient computing device, the system comprising a computing device, an on-chip interconnect bus, a memory controller, and a peripheral controller; the computing device is used for receiving data directly transmitted by external equipment, reading the data participating in operation by the external memory and writing the operation result data into the external memory; the on-chip interconnection bus is used for data stream interaction between the computing device and the storage controller and between the computing device and the peripheral controller; the storage controller is connected with an external storage, and the external controller is connected with external equipment.
In the embodiment of the invention, the data of the external equipment is directly transmitted to the computing device, and is not firstly transmitted to the memory. After receiving the data, the computing device reads another data participating in the operation from the external memory, and then writes the operation result data into the external memory.
In some embodiments of the present invention, the computing apparatus is further configured to support multiple external devices or multiple tasks of one external device, and the data reading task is allowed to be issued to the external devices in a different order than the data is transmitted to the computing apparatus.
In embodiments of the invention, a computing device may support multiple concurrent tasks, i.e., at any time, the computing device may have multiple tasks in process, rather than having to start the next after one task is completed. This feature can reduce the idle latency of the device in case of a certain delay of the data of the external device.
The computing device can support a plurality of external devices or a plurality of tasks of one external device, and the tasks are allowed to be out of order, namely the data reading tasks are allowed to be issued to the external devices in a different order from the data transmission to the computing device.
In some embodiments of the present invention, the computing apparatus further includes an internal cache module, where the internal cache module is used for size of an address space for data reception, and allows jitter in latency of data returned by an external device and latency and jitter in a read/write operation of an external storage.
In embodiments of the present invention, a small internal cache is provided within the computing device, the cache being characterized by a size that is much smaller than the size of the address space presented externally by the computing device for data reception, hereinafter referred to as the "virtual address" space. The cache is used for allowing the delay of the data returned by the external device to have certain jitter, and the read-write operation of the external storage has certain delay and jitter, so that the requirements on the external device and the external storage are reduced.
Therefore, compared with the traditional computing acceleration device, the high-efficiency computing device can obviously reduce the access bandwidth to the memory, effectively break through the performance bottleneck caused by the shortage of the external storage bandwidth in the traditional computing acceleration equipment, achieve higher acceleration performance, and avoid the cost rise caused by using expensive devices such as high-capacity on-chip cache or HBM (high-performance memory module).
In some embodiments of the present invention, the computing device of the present invention is in a data stream in the system, and the external device is configured to write data directly to the computing device, and write operation result data into the external memory after the data is read and operated by the computing device.
In this embodiment, the data is written directly to the computing device by the device, rather than being written to the external memory before being read by the computing device. Further, if the external device is a device that can actively transmit data, it is "written" by the external device; if the external device is a passive device, data may be generally retrieved from the external device and "written" to the computing device by a DMA controller, which may also be integrated within the computing device.
Due to the possible out-of-order and multi-IO concurrency, the computing device cannot know the task to which the data belongs in advance, which is also an obvious feature that the present invention can be distinguished from the conventional computing device. Therefore, after data is written into the computing device, the computing device needs to search for a task parameter according to additional information (generally, a data address) of the written data, acquire another data participating in the operation from the external memory, and perform a computing process. How to efficiently realize the above process is also one of the key points of the present invention.
As can be seen from fig. 3, the present invention significantly reduces the number of accesses to the external memory compared to the conventional method (in the simple scenario in the above figure, the conventional method requires 4 accesses, and the present invention requires only 2 accesses).
The basic framework and the key modules of the computing device designed by the invention are shown in fig. 4, at least one task context storage unit is arranged in the computing device, each storage unit corresponds to one task and is used for supporting a plurality of concurrent tasks and sharing a group of task context information, wherein a behavior corresponding to one command or a series of associated commands issued to external equipment is taken as one task.
In this embodiment, the computing device has a plurality of task context storage units therein, and each storage unit corresponds to one task. The concept of the task is not clear, and all tasks that can share a set of task context information can be regarded as one task, and generally, a behavior corresponding to a command or a series of associated commands issued to an external device can be regarded as one task. Because the traditional computing device actively schedules the tasks and reads the data, the data can be computed according to a plan, and only a small amount of context or only the context of the current task needs to be stored; the computing device designed by the invention passively receives data, so that the design of a plurality of task contexts can more effectively support the disorder among tasks and reduce the idle waiting time of the device, which is one of the characteristics of the invention.
The task context contains two types of key information, namely task parameters and data mapping. The task parameters are conventional information, including necessary information required for the execution of the computing task, such as task size, data coefficients, etc. Since the data may be interleaved between tasks, the task parameters may be updated by the device after each computation task for a small block of data. The data mapping is a table, and since another operand required by the computing unit needs to be actively read from the memory, after the computing device receives data from the external device, an address of the other operand corresponding to the data needs to be found in the mapping table according to data information (in the present invention, a data address is used), and then the address is read.
Referring to fig. 4, in some embodiments of the present invention, the computing apparatus is provided with a bus interface in communication with the on-chip interconnect bus, where the bus interface includes a master bus interface and a slave bus interface, and is configured to provide the configuration register and a geotechnical address range for receiving data written by an external device, where the address range is a virtual address space.
The bus interface (slave) is used to receive task configuration information and data written by external devices, and may be implemented by using a plurality of different protocols, such as AXI, AHB, etc., and the interface is characterized by providing an address range, and a read or write request falling within the address range is captured by the interface and converted into read and write behaviors for internal registers or logic of the module.
One of the features of the bus interface in the present invention is that in addition to providing addresses such as configuration registers, a large address range is provided for receiving data written by an external device, the size of the address range far exceeds the size of the cache inside the module, and the address range is referred to as a virtual address space.
In some embodiments of the present invention, referring to fig. 5, the virtual address space is divided into N parts, where N is equal to the maximum number of parallel tasks supported by the computing device, each task occupies an independent address space, and the space size is the maximum data size of one task, where the virtual address space occupies a continuous address space on the bus.
Namely: each task occupies an independent address space, the space size is the maximum data size of one task, and when the data size of one task is smaller than the space size, data can only use one address. As a simpler implementation, the virtual address space of the present invention occupies a continuous segment of address space on the bus. Therefore, when data of the external device is written into a certain address, it is possible to easily determine to which task the data belongs. Of course, the virtual address space may not be continuous as long as there is a mapping relationship between the address and the task.
In some embodiments of the present invention, when the external device writes a piece of data into the address space, the computing device is configured to find a task corresponding to the address according to an address range hit by the data; acquiring task parameters from a task context storage module and setting a calculation module; acquiring a data mapping table from a task context storage module; obtaining the offset of the data address in an address space corresponding to the current task; finding the position of another data participating in operation in the memory at the corresponding position in the data mapping table by adopting the offset; -reading participation data from the location, the data being operated on with parameter data from an external device.
Therefore, referring to fig. 6, when the external device writes a piece of data in the above address space:
(1) The computing device firstly finds out the task corresponding to the address according to the address range hit by the data;
(2) Then the computing device obtains the task parameters from the task context storage module and sets a computing module;
(3) Then the computing device acquires a data mapping table from the task context storage module;
(4) Then the computing device further obtains the offset of the data address in the address space corresponding to the current task;
(5) The computing device finds the position of the other data participating in the operation in the memory by using the offset at the corresponding position in the data mapping table;
(6) The computing device reads another data from the location that will ultimately participate in the operation with the data from the external device.
In some embodiments of the present invention, a peripheral data cache is provided in the computing device, data written to the computing device is stored in the peripheral data cache in the computing device, and corresponding data read from the memory is stored in the external storage data cache.
In some embodiments of the present invention, when the peripheral data enters the data cache, the peripheral data is further configured to generate a data number, and when a read command is sent to the memory, the number is sent along with the command and returned along with the returned data, so as to quickly find the corresponding data from the peripheral data cache.
Data written to the computing device will be stored in a peripheral data cache in the computing device, and corresponding data read from the memory will be stored in an external storage data cache. When the memory is cached with data, the data is sent to the computing device for computation together with the corresponding peripheral data. When entering the data cache, the peripheral data can be marked with a data number, and when sending a read command to the memory, the number can be sent with the command and returned with the returned data. Therefore, when the computing device receives the data returned by the memory, the corresponding data can be quickly searched from the peripheral data cache. In most cases, the read command to the external memory is completed and data is returned in sequence, so the data number can be omitted, and both the external data buffer and the memory data buffer can be designed as simple FIFOs.
In some embodiments of the invention, the minimum value of the size of the peripheral cache is determined by the data delay of the peripheral and the bandwidth of the computing device:
buffer size = peripheral data latency × computing device bandwidth.
In some embodiments of the present invention, the computing device further includes a counter, an initial value of the counter is a buffer size, when a certain amount of buffers are used for a task, the counter is subtracted from the counter, and when the computing device consumes a certain amount of data from the peripheral buffer, the counter is added to the buffer.
Therefore, to prevent buffer overflow, the computing device needs to maintain a counter, the initial value of which is the buffer size. When a task needs a certain amount of cache, the size is subtracted from the counter. The counter is incremented by the size each time the computing device consumes a certain amount of data from the peripheral cache. And when the cache size required by one task is larger than the count value, waiting is required, or the task is split.
The external memory cache size and allocation principle are the same as the above-mentioned peripheral cache.
In some embodiments of the present invention, the task context is configured by a scheduler, which is a hardware logic circuit or CPU; the method for configuring the task context comprises the following steps:
the scheduler acquires a current idle task context, acquires idle resources from a resource pool locally managed by the scheduler when the task context is managed and distributed in the scheduler, and applies for the resources from the computing device when the task context is managed and distributed in the computing device;
the scheduler determines a virtual address space range used by the task according to the acquired task context number;
the scheduler allocates the task parameters to a designated task context module and allocates the position of the memory where another data participating in the operation is located to a data mapping table;
after the configuration is completed, the scheduler takes the acquired virtual address space range as a data receiving address and sends a reading command to external equipment; it should be noted that the specified size of the cache needs to be retrieved from the computing device before sending the command to the external device. If the size of the current residual cache is smaller than the size of the data to be read of the whole task, the scheduler needs to consider waiting, or split the task, and firstly sends a reading command which does not exceed the size of the current residual cache to the external device.
The external equipment receives the reading command, writes the data to the receiving address appointed in the command, and the data hits the data receiving address of the computing device;
the computing device acquires information from the task context according to the content, reads data from the memory, puts the data into an external storage data cache, performs computation through the computing device, and writes the result into an external memory (the address is also recorded through a data mapping table);
when the address range of the current task receives an amount of data equal to the total size of the task, the current task ends and the computing device or scheduler reclaims the task context for allocation to the next task.
It should be understood that although the steps are described above in a certain order, the steps are not necessarily performed in the order described. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, some steps of the present embodiment may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or in turns with other steps or at least a part of the steps or stages in other steps.
In a second aspect of the embodiments of the present invention, a computer-readable storage medium is further provided, and fig. 7 is a schematic diagram of a computer-readable storage medium illustrating a method for configuring a task context in an efficient computing device according to an embodiment of the present invention. As shown in fig. 7, the computer-readable storage medium 300 stores computer program instructions 310, the computer program instructions 310 being executable by a processor. The computer program instructions 310 when executed implement a method for configuring a task context in an efficient computing device according to any of the embodiments described above, comprising the steps of:
the scheduler acquires a current idle task context, acquires idle resources from a resource pool locally managed by the scheduler when the task context is managed and distributed in the scheduler, and applies for the resources from the computing device when the task context is managed and distributed in the computing device;
the scheduler determines a virtual address space range used by the task according to the acquired task context number;
the scheduler allocates the task parameters to a specified task context module, and allocates the position of the memory where another data participating in calculation is located to a data mapping table;
after the configuration is completed, the scheduler takes the acquired virtual address space range as a data receiving address and sends a reading command to external equipment;
the external equipment receives the reading command, writes the data to the receiving address appointed in the command, and the data hits the data receiving address of the computing device;
the computing device acquires information from the task context according to the content, reads data from the memory, puts the data into an external storage data cache, performs computation through the computing device, and writes the result into an external memory;
when the address range of the current task receives an amount of data equal to the total size of the task, the current task ends and the computing device or scheduler reclaims the task context for allocation to the next task.
It will be appreciated that all of the embodiments, features and advantages set forth above with respect to the method of configuration of task contexts in an efficient computing device according to the present invention apply equally, without conflict with one another, to an efficient computing device and storage medium according to the present invention.
In a fourth aspect of the embodiments of the present invention, there is further provided a computer apparatus 400, including a memory 420 and a processor 410, where the memory stores therein a computer program, and the computer program, when executed by the processor, implements the method for configuring task context in an efficient computing device according to any one of the above embodiments, including the following steps:
the scheduler obtains a current idle task context, when the task context is managed and distributed in the scheduler, the scheduler obtains an idle resource from a resource pool locally managed by the scheduler, and when the task context is managed and distributed in the computing device, the scheduler applies for the resource from the computing device;
the scheduler determines the virtual address space range used by the task according to the acquired task context number;
the scheduler allocates the task parameters to a specified task context module, and allocates the position of the memory where another data participating in calculation is located to a data mapping table;
after the configuration is completed, the scheduler takes the acquired virtual address space range as a data receiving address and sends a reading command to external equipment;
the external equipment receives the reading command, writes the data to the receiving address appointed in the command, and the data hits the data receiving address of the computing device;
the computing device acquires information from the task context according to the content, reads data from the memory, puts the data into an external storage data cache, performs computation through the computing device, and writes the result into the external memory;
when the address range of the current task receives an amount of data equal to the total size of the task, the current task ends and the computing device or scheduler reclaims the task context for allocation to the next task.
Fig. 8 is a schematic hardware configuration diagram of an embodiment of a computer apparatus for performing a method for configuring a task context in an efficient computing device according to the present invention. Taking the computer device 400 shown in fig. 8 as an example, the computer device includes a processor 410 and a memory 420, and may further include: an input device 430 and an output device 440. The processor 410, the memory 420, the input device 430, and the output device 440 may be connected by a bus or other means, as exemplified by the bus connection in fig. 8. Input device 430 may receive entered numeric or character information and generate signal inputs related to the implementation of an efficient computing device. The output device 440 may include a display device such as a display screen.
The memory 420, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method for configuring a task context in the embodiments of the present application. The memory 420 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by use of a method of configuration of a task context, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 420 may optionally include memory located remotely from processor 410, which may be connected to local modules via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The processor 410 executes various functional applications of the server and data processing, i.e., a method of implementing configuration of task context of the above-described method embodiments, by executing nonvolatile software programs, instructions, and modules stored in the memory 420.
In a fifth aspect of an embodiment of the present invention, there is also provided an Avatar chip 500 that reads from or writes to any of the above methods according to the present invention for configuration of task contexts in an efficient computing device. Fig. 9 shows a schematic diagram of a frame of a chip 500 according to the invention. As shown in FIG. 9, in this embodiment, the chip 500 has a CPU reset vector register 510, a CPU release control pin 520, a CPU release control register 530, and a debug interface 540 in its architecture, wherein
The CPU reset vector register 510 is used to control the address of an instruction that is read and executed after the CPU is released;
the CPU release control register 520 is used to control CPU release when the chip 500 is powered on;
the CPU release control pin 530 is used to control the validity of the CPU release control register 520;
the debug interface 540 is used to read and write the on-chip RAM and the registers to perform the reading and writing of the chip.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
Finally, it should be noted that the computer-readable storage medium (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM may be available in a variety of forms such as synchronous RAM (DRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.
The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, and/or any other such configuration.
The invention provides an efficient computing device, which can obviously reduce the access bandwidth to a memory, effectively break through the performance bottleneck caused by the shortage of the external storage bandwidth in the traditional computing acceleration equipment and achieve higher acceleration performance; the invention avoids the cost increase caused by using expensive devices such as high-capacity on-chip cache or HBM, and the like, only uses the conventional technology and the conventional external devices, and has obvious technical maturity and cost advantages.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in the configuration of task contexts in an efficient computing device according to the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The numbers of the embodiments disclosed in the above embodiments of the present invention are merely for description, and do not represent the advantages or disadvantages of the embodiments.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant only to be exemplary, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. An efficient computing device, the system comprising:
the computing device is used for receiving data directly transmitted by external equipment, reading the data participating in operation by the external memory and writing the operation result data into the external memory;
the on-chip interconnection bus is used for data stream interaction between the computing device and the storage controller as well as between the computing device and the peripheral controller; the storage controller is connected with an external storage, and the external controller is connected with external equipment.
2. The efficient computing apparatus of claim 1, wherein the computing apparatus is further configured to support multiple external devices or multiple tasks of one external device, and wherein the data reading tasks are allowed to be issued to the external devices in a different order than the data is transmitted to the computing apparatus.
3. The efficient computing device of claim 2, further comprising an internal buffer module inside the computing device, wherein the internal buffer module is configured to buffer the size of the address space for data reception, allowing for jitter in the latency of the external device to return data and latency and jitter in the externally stored read-write operation.
4. A high-efficiency computing device as claimed in claim 1, wherein the external device is configured to write data directly to the computing device, and to write the operation result data into the external memory after the data is read and operated by the computing device.
5. The efficient computing device of claim 2, wherein the computing device employs proactive scheduling to read data from an external memory.
6. The efficient computing device according to claim 5, wherein at least one task context storage unit is disposed inside the computing device, each storage unit corresponds to one task and is used for supporting multiple concurrent tasks and sharing a set of task context information, and a behavior corresponding to one command or a series of associated commands issued to the external device serves as one task.
7. The efficient computing device of claim 6, wherein the computing device is configured with a bus interface in communication with the on-chip interconnect bus, the bus interface comprising a master bus interface and a slave bus interface for providing the configuration registers and a geotechnical address range for receiving data written by an external device, wherein the address range is a virtual address space.
8. A computing device as claimed in claim 7, wherein the virtual address space is divided into N parts, N being equal to the maximum number of parallel tasks supported by the computing device, each task occupying an independent address space, the size of the space being the maximum data size of a task, and wherein the virtual address space occupies a contiguous address space on the bus.
9. The efficient computing device according to claim 8, wherein when an external device writes a piece of data into the address space, the computing device is configured to find a task corresponding to the address according to an address range where the data is hit; acquiring task parameters from a task context storage module, and setting a calculation module; acquiring a data mapping table from a task context storage module; acquiring the offset of the data address in an address space corresponding to the current task; the offset is adopted to find out the position of another data participating in the operation in the memory at the corresponding position in the data mapping table; -reading participation data from the location, the data being operated on with parameter data from an external device.
10. The efficient computing apparatus of claim 9, wherein the task context is configured by a scheduler, the scheduler being a hardware logic circuit or a CPU; the method for configuring the task context comprises the following steps:
the scheduler obtains a current idle task context, when the task context is managed and distributed in the scheduler, the scheduler obtains an idle resource from a resource pool locally managed by the scheduler, and when the task context is managed and distributed in the computing device, the scheduler applies for the resource from the computing device;
the scheduler determines a virtual address space range used by the task according to the acquired task context number;
the scheduler allocates the task parameters to a specified task context module, and allocates the position of the memory where another data participating in calculation is located to a data mapping table;
after the configuration is completed, the scheduler takes the acquired virtual address space range as a data receiving address and sends a reading command to external equipment;
the external equipment receives the reading command, writes the data to the receiving address specified in the command, and the data hits the data receiving address of the computing device;
the computing device acquires information from the task context according to the content, reads data from the memory, puts the data into an external storage data cache, performs computation through the computing device, and writes the result into the external memory;
when the address range of the current task receives an amount of data equal to the total size of the task, the current task ends and the computing device or scheduler reclaims the task context for allocation to the next task.
CN202211131027.5A 2022-09-16 2022-09-16 High-efficient computing device Pending CN115525599A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211131027.5A CN115525599A (en) 2022-09-16 2022-09-16 High-efficient computing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211131027.5A CN115525599A (en) 2022-09-16 2022-09-16 High-efficient computing device

Publications (1)

Publication Number Publication Date
CN115525599A true CN115525599A (en) 2022-12-27

Family

ID=84696887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211131027.5A Pending CN115525599A (en) 2022-09-16 2022-09-16 High-efficient computing device

Country Status (1)

Country Link
CN (1) CN115525599A (en)

Similar Documents

Publication Publication Date Title
CN102375800B (en) For the multiprocessor systems on chips of machine vision algorithm
US7802025B2 (en) DMA engine for repeating communication patterns
US7694035B2 (en) DMA shared byte counters in a parallel computer
US7581054B2 (en) Data processing system
US11768601B2 (en) System and method for accelerated data processing in SSDS
CN113918101B (en) Method, system, equipment and storage medium for writing data cache
CN114580344B (en) Test excitation generation method, verification system and related equipment
US8990456B2 (en) Method and apparatus for memory write performance optimization in architectures with out-of-order read/request-for-ownership response
US10761822B1 (en) Synchronization of computation engines with non-blocking instructions
EP3379421B1 (en) Method, apparatus, and chip for implementing mutually-exclusive operation of multiple threads
US8397005B2 (en) Masked register write method and apparatus
US11467946B1 (en) Breakpoints in neural network accelerator
US20180336034A1 (en) Near memory computing architecture
Contini et al. Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication
JP4734348B2 (en) Asynchronous remote procedure call method, asynchronous remote procedure call program and recording medium in shared memory multiprocessor
CN115525599A (en) High-efficient computing device
CN115525582A (en) Method and system for task management and data scheduling of page-based inline computing engine
US11500802B1 (en) Data replication for accelerator
US11119787B1 (en) Non-intrusive hardware profiling
US11061654B1 (en) Synchronization of concurrent computation engines
US10997277B1 (en) Multinomial distribution on an integrated circuit
Prakash et al. Custom instructions with local memory elements without expensive DMA transfers
CN116455849B (en) Concurrent communication method, device, equipment and medium for many-core processor
US10929063B1 (en) Assisted indirect memory addressing
US11550736B1 (en) Tensorized direct memory access descriptors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination