CN117312330B

CN117312330B - Vector data aggregation method and device based on note storage and computer equipment

Info

Publication number: CN117312330B
Application number: CN202311614079.2A
Authority: CN
Inventors: 方建滨; 张鹏; 黄春; 唐滔; 彭林; 崔英博; 姜浩; 沈洁; 范小康; 于恒彪; 苏醒; 易昕
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-11-29
Filing date: 2023-11-29
Publication date: 2024-02-09
Anticipated expiration: 2043-11-29
Also published as: CN117312330A

Abstract

The application relates to a vector data aggregation method and device based on note type storage and computer equipment. The method comprises the following steps: and reading the index of target data of the off-chip memory, and writing the index into the note memory through an execution unit to obtain a first index base address. And calculating the address of the target data in the off-chip memory to obtain the off-chip memory address. The off-chip memory address includes: a base address of the target data and an offset index of the target data. And acquiring an off-chip storage address through an execution unit, and writing the off-chip storage address into a note memory to obtain an aggregation parameter packet. And storing the aggregation parameter package in sequence according to the offset index and the first index base address, and carrying the aggregation parameter package to the vector register. By adopting the method, vector calculation and vector data aggregation operation can be performed concurrently, and the aggregation efficiency of vector data from on-chip memory to vector registers and the performance of an application program are improved.

Description

Vector data aggregation method and device based on note storage and computer equipment

Technical Field

The present disclosure relates to the field of high performance memory access technologies, and in particular, to a method and an apparatus for aggregating vector data based on scratch pad storage, and a computer device.

Background

To address the growing demand for computing power and the limitations in power consumption constraints of modern processors, current multi-core or many-core processors use Vector unit (Vector Units) technology and Scratch-pad Memory (SPM) technology. In one aspect, the vector unit uses a vector instruction to initiate the operation of a set of data elements, and before using the vector unit to perform computation, the required data needs to be loaded into a vector register close to the computing unit from an off-chip memory, so that the performance of an application program can be doubled by fully using the vector unit. On the other hand, a scratch pad is a quickly accessible memory bank located on a processor chip. Practical applications often require loading data elements scattered in different locations in memory into a vector register, i.e., vector data aggregation (Vector Data Gather) operations. In conventional computer systems, vector data aggregation operations typically require the use of loops to traverse the data, computing one by one. Because of the large amount of calculation, the method has lower efficiency, and is difficult to meet the high-efficiency processing requirement of large-scale data. For example, the non-zero elements of the sparse matrix are scattered, and conventional vector data aggregation methods require accessing data in memory one by one, often require reading elements from different memory locations into the same vector register, and are inefficient and less parallel for aggregate processing and access of large volumes of data.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, and a computer device for aggregating vector data based on scratch pad storage, which can improve the efficiency of vector data aggregation processing.

The vector data aggregation method based on the note storage is applied to a computer system, and the computer system comprises: execution unit, vector register, scratch pad memory, and off-chip memory.

The method comprises the following steps:

and reading the index of target data of the off-chip memory, and writing the index into the note memory through an execution unit to obtain a first index base address.

And calculating the address of the target data in the off-chip memory to obtain the off-chip memory address. The off-chip memory address includes: a base address of the target data and an offset index of the target data.

And acquiring an off-chip storage address through an execution unit, and writing the off-chip storage address into a note memory to obtain an aggregation parameter packet.

And storing the aggregation parameter package in sequence according to the offset index and the first index base address, and carrying the aggregation parameter package to the vector register.

In one embodiment, the target data includes: target data element, index of target data, and base address of target data.

In one embodiment, the method further comprises: the application program reads a plurality of target data which are discretely distributed in the off-chip memory, the execution unit receives an aggregation instruction of the computer system to execute extraction operation, the index of the target data is extracted from the off-chip memory, and the index is written into the note memory in sequence to obtain a first index base address.

In one embodiment, the method further comprises: and carrying out one-by-one addition operation on the base address of the target data and each index of the target data element in the off-chip memory to obtain the off-chip memory address of the target data.

In one embodiment, the method further comprises: and acquiring an off-chip storage address from the off-chip memory through the execution unit, and writing the offset index into the note memory to obtain the note storage address of the target data. And storing the target data elements in sequence according to the note type storage address to obtain an aggregation parameter packet. The aggregation parameter package includes: a note memory address, an off-chip memory address, and a target data element.

In one embodiment, the scratch pad memory handles aggregation parameter packets in 1024 bits for data units to the vector registers.

A vector data aggregating device based on a note storage, the device comprising:

the first index base address acquisition module is used for reading the index of the target data of the off-chip memory, and writing the index into the note memory through the execution unit to obtain a first index base address.

The source address acquisition module is used for calculating the address of the target data element in the off-chip memory to obtain the off-chip memory address. The off-chip memory address includes: a base address of the target data and an offset index of the target data.

And the aggregation parameter package acquisition module is used for acquiring the off-chip storage address through the execution unit and writing the off-chip storage address into the note memory to obtain an aggregation parameter package.

And the aggregation module is used for storing the aggregation parameter packets in sequence according to the offset index and the first index base address and carrying the aggregation parameter packets to the vector register.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

According to the vector data aggregation method, the vector data aggregation device and the computer equipment based on the note type storage, the note type storage is used for storing the index of the target data and the first index base address, the data access efficiency can be improved, when certain target data needs to be accessed, the corresponding base address is only needed to be searched in the note type storage through the index of the target data, and then the actual address of the target data in the off-chip storage is obtained through the source address calculation, so that the target data is directly accessed, and the access and processing efficiency and the parallelism are effectively improved for large-batch data access.

Drawings

FIG. 1 is an application scenario diagram of a method of vector data aggregation based on scratch pad storage in one embodiment;

FIG. 2 is a flow diagram of a method of vector data aggregation based on scratch pad storage in one embodiment;

FIG. 3 is a diagram of vector data types in one embodiment, wherein FIG. 3 (a) is single word vector data and FIG. 3 (b) is half word vector data;

FIG. 4 is a block diagram of a vector data aggregation device based on scratch pad storage in one embodiment;

fig. 5 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The vector data aggregation method based on the note storage can be applied to a computer system shown in fig. 1. The system comprises a vector processor, an off-chip memory, a data bus and the like, wherein the vector processor comprises a vector control register, a vector execution unit, a vector register file, a note type memory and the like.

In one embodiment, as shown in fig. 2, a method for aggregating vector data based on a note storage is provided, and the method is applied to the system in fig. 1 for illustration, and includes the following steps:

step 202, reading an index of target data of an off-chip memory, and writing the index into a note memory through an execution unit to obtain a first index base address.

The execution unit executes instruction operation according to the operation instruction received by the computer processor. The instruction operations include: write, read, store, load, and handle, etc. The target data includes: the data base address corresponds to an index address of the data base address.

Specifically, the execution unit executes the received computer aggregation instruction, calls the DMA (Direct Memory Access) component to read the indexes of the plurality of target data discretely distributed in the off-chip memory, extracts the indexes to the note-pad memory, and writes the indexes of the plurality of target data into the note-pad memory in sequence to obtain a plurality of continuously arranged first index addresses.

Step 204, the address of the target data in the off-chip memory is calculated to obtain the off-chip memory address.

Specifically, calculating an offset index of a base address of each target data in an off-chip memory in the off-chip memory to obtain off-chip storage addresses of all target data elements, and performing one-by-one addition operation on the off-chip storage base address (src) of the target data in the off-chip memory and each index of the target data element in the off-chip memory to obtain the off-chip storage address of each target data element, wherein the off-chip storage addresses comprise: a base address of the target data and an offset index of the target data.

Step 206, obtaining the off-chip storage address through the execution unit, and writing the off-chip storage address into the note memory to obtain the aggregation parameter package.

Specifically, after the execution unit receives the extraction instruction of the computer system, the DMA (Direct Memory Access) component executes the extraction operation, obtains the off-chip storage address from the off-chip memory, and writes the offset index into the note-type memory to obtain the note-type storage address of the target data. And storing the target data elements in sequence in a continuous mode according to the note storage addresses to obtain an aggregation parameter package. The aggregation parameter package includes: a note memory address, an off-chip memory address, and a target data element.

Step 208, storing the aggregation parameter package in sequence according to the offset index and the first index base address, and carrying the aggregation parameter package to the vector register.

It should be noted that, the execution unit executes the received computer aggregation instruction, loads the aggregation parameter packet of the scratch pad memory, executes the handling operation, and handles the aggregation parameter packet of the scratch pad memory to the destination vector register (dst_vr).

In the vector data aggregation method based on the note-on storage, the vector data aggregation device based on the note-on storage and the computer equipment are characterized in that the note-on storage is used for storing the index of target data and the first index base address, the note-on storage is used for storing the index of the target data and the first index base address, the data access efficiency can be improved, when certain target data needs to be accessed, the corresponding base address is only needed to be searched in the note-on storage through the index of the target data, and then the actual address of the target data in the off-chip storage is obtained through the calculation of the source address, so that the target data is directly accessed, and the access and processing efficiency and the parallelism are effectively improved for large-batch data access.

In one embodiment, an application program reads a plurality of target data discretely distributed in an off-chip memory, an execution unit receives an aggregation instruction of a computer system to execute extraction operation, and an index of the target data is extracted from the off-chip memory and sequentially written into a note memory to obtain a first index base address.

It should be noted that, by extracting the index of the target data from the off-chip memory and writing the index into the scratch pad in sequence, the first index base address may be obtained. When the subsequent vector data aggregation operation is performed, the actual address of the target data can be calculated directly through the index and the first index base address, so that the data in the off-chip memory can be directly accessed, frequent access to the off-chip memory can be avoided, the time and the power consumption of data access are reduced, and the efficiency of data access is improved. In addition, the extracted target data is stored in a vector register, so that efficient vector data aggregation can be realized. The vector register can store a plurality of data simultaneously, supports parallel operation, and can improve the calculation efficiency. Meanwhile, the efficiency of data access is improved, so that the calculation delay and the power consumption can be further reduced, and the calculation efficiency is further improved.

In one embodiment, the base address of the target data and each index of the target data element in the off-chip memory are added one by one to obtain the off-chip memory address of the target data.

It should be noted that, by calculating the offset index of each target data in the off-chip memory, the off-chip memory address of the target data may be obtained. By summing the offset index and the base address of the target data, the off-chip memory address of the target data can be obtained. Thus, the off-chip memory is not required to be accessed once every time the target data element is accessed, and the off-chip memory address of the target data can be obtained through calculation, so that the target data element is directly accessed. This approach may reduce the number of accesses to off-chip memory, thereby increasing access speed and reducing power consumption. It can be seen that efficient vector data aggregation can be achieved by calculating the off-chip memory address of the target data and storing it in the vector register.

In one embodiment, the execution unit obtains the off-chip memory address from the off-chip memory, and writes the offset index into the scratch pad memory to obtain the scratch pad memory address of the target data. And storing the target data elements in sequence according to the note type storage address to obtain an aggregation parameter packet. The aggregation parameter package includes: a note memory address, an off-chip memory address, and a target data element.

It should be noted that, by extracting the off-chip memory address of the target data from the off-chip memory, the number of accesses to the memory may be reduced. After the note type storage address of the target data is obtained, the target data elements are sequentially stored in the note type memory, and adjacent data can be stored in adjacent storage units by utilizing the locality principle, so that the access efficiency of the memory is improved, and the access delay and the power consumption of the memory are reduced. Therefore, the note type storage address, the off-chip storage address and the target data element are stored in the aggregation parameter packet, so that the retrieval, calling and aggregation of the target data element among memories are smoother and more efficient, and efficient vector data aggregation can be realized. The vector register can store a plurality of data simultaneously, supports parallel operation, and can improve the calculation efficiency. Meanwhile, as the efficiency of data access is improved, the delay and the power consumption of calculation can be further reduced, and the calculation efficiency is further improved, so that the performance of the whole calculation system is improved.

It should be noted that, as shown in fig. 3, the vector data types may be 16 single-word vector data and 32 half-word vector data, where fig. 3 (a) is 16 single-word vector data, each of which occupies 64-bit width, and fig. 3 (b) is 32 half-word vector data, each of which occupies 32-bit width. The vector register has a 1024-bit width and can store 16 single-word vector data or 32 half-word vector data, so the scratch pad memory carries the aggregation parameters to the vector register in 1024-bit data units. Because the cache of the vector register is much faster than the access speed of the off-chip memory, the aggregation parameter packet is directly carried into the vector register, so that the access delay and the memory bandwidth consumption of data can be greatly reduced, and the efficiency and the performance of the aggregation operation are improved. Meanwhile, the vector register supports a vectorization instruction, and the vector execution unit can execute a plurality of identical or similar operations in parallel, and can further improve the efficiency of aggregation operation.

It should be understood that, although the steps in the flowcharts of fig. 1-2 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1-2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or stages are performed necessarily occur in sequence, but may be performed alternately or alternately with at least a portion of the other steps or sub-steps or stages of other steps.

In one embodiment, as shown in fig. 4, there is provided a vector data aggregation apparatus based on a note storage, including: a first index base address acquisition module 402, a source address acquisition module 404, an aggregation parameter package acquisition module 406, and an aggregation module 408, wherein:

the first index base address obtaining module 402 is configured to read an index of target data of the off-chip memory, and write the index into the scratch pad memory through the execution unit to obtain a first index base address.

The source address obtaining module 404 is configured to calculate an address of the target data element in the off-chip memory, and obtain an off-chip memory address. The off-chip memory address includes: a base address of the target data and an offset index of the target data.

The aggregation parameter package obtaining module 406 is configured to obtain, by using the execution unit, the off-chip storage address, and write the off-chip storage address into the scratch pad memory, thereby obtaining an aggregation parameter package.

An aggregation module 408, configured to store the aggregation parameter packages in sequence according to the offset index and the first index base address, and carry the aggregation parameter packages to the vector register.

For specific limitations on the vector data aggregation device based on the note-on storage, reference may be made to the above limitation on the vector data aggregation method based on the note-on storage, and the description thereof will not be repeated here. The various modules in the vector data aggregation device based on the note-pad storage can be implemented in whole or in part by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for vector data aggregation based on scratch pad storage. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structures shown in fig. 4-5 are block diagrams of only some of the structures associated with the present application and are not intended to limit the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory storing a computer program and a processor that when executing the computer program performs the steps of:

And calculating the address of the target data element in the off-chip memory to obtain the off-chip memory address. The off-chip memory address includes: a base address of the target data and an offset index of the target data.

In one embodiment, the processor when executing the computer program further performs the steps of: the application program reads a plurality of target data which are discretely distributed in the off-chip memory, the execution unit receives an aggregation instruction of the computer system to execute extraction operation, the index of the target data is extracted from the off-chip memory, and the index is written into the note memory in sequence to obtain a first index base address.

In one embodiment, the processor when executing the computer program further performs the steps of: and carrying out one-by-one addition operation on the base address of the target data and each index of the target data element in the off-chip memory to obtain the off-chip memory address of the target data.

In one embodiment, the processor when executing the computer program further performs the steps of: and acquiring an off-chip storage address from the off-chip memory through the execution unit, and writing the offset index into the note memory to obtain the note storage address of the target data. And storing the target data elements in sequence according to the note type storage address to obtain an aggregation parameter packet. The aggregation parameter package includes: a note memory address, an off-chip memory address, and a target data element.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples represent only a few embodiments of the present application, which are described in more detail and detail, but are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. The vector data aggregation method based on the note storage is characterized by being applied to a computer system, wherein the computer system comprises the following components: the device comprises an execution unit, a vector register, a note memory and an off-chip memory;

the method comprises the following steps:

reading an index of target data of the off-chip memory, and writing the index into a note memory through the execution unit to obtain a first index base address;

calculating the address of the target data in the off-chip memory to obtain an off-chip memory address; the off-chip memory address includes: a base address of the target data and an offset index of the target data;

adding the base address of the target data and each index of the target data element in the off-chip memory one by one to obtain an off-chip memory address of the target data;

acquiring the off-chip storage address through the execution unit, and writing the off-chip storage address into the note memory to obtain an aggregation parameter packet;

acquiring the off-chip storage address from the off-chip memory through the execution unit, and writing the offset index into the note-type memory to obtain the note-type storage address of the target data;

sequentially storing the target data elements according to the note storage addresses to obtain an aggregation parameter packet; the aggregation parameter package includes: the note storage address, the off-chip storage address, and the target data element;

2. The method of claim 1, wherein the target data comprises: a target data element, an index of the target data, and a base address of the target data.

3. The method of claim 2, wherein reading an index of target data of the off-chip memory, writing the index to a scratch pad memory by the execution unit, obtaining a first index base address, comprises:

the application program reads a plurality of target data which are discretely distributed in the off-chip memory, the execution unit receives an aggregation instruction of the computer system to execute extraction operation, the index of the target data is extracted from the off-chip memory, and the index is written into the note memory in sequence to obtain a first index base address.

4. A method according to any one of claims 1 to 3, wherein the scratch pad memory handles the aggregation parameter packets to the vector register in 1024 bits of data units.

5. Vector data aggregating device based on note storage, characterized in that it comprises:

the first index base address acquisition module is used for reading the index of target data of the off-chip memory, and writing the index into the note memory through the execution unit to obtain a first index base address;

the source address acquisition module is used for calculating the address of the target data in the off-chip memory to obtain an off-chip memory address; the off-chip memory address includes: a base address of the target data and an offset index of the target data; adding the base address of the target data and each index of the target data element in the off-chip memory one by one to obtain an off-chip memory address of the target data;

the aggregation parameter packet acquisition module is used for acquiring the off-chip storage address through the execution unit and writing the off-chip storage address into the note memory to obtain an aggregation parameter packet;

and the aggregation module is used for storing the aggregation parameter packets in sequence according to the offset index and the first index base address and carrying the aggregation parameter packets to a vector register.

6. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 4 when the computer program is executed.