CN109919826A - A kind of diagram data compression method and figure computation accelerator for figure computation accelerator - Google Patents

A kind of diagram data compression method and figure computation accelerator for figure computation accelerator Download PDF

Info

Publication number
CN109919826A
CN109919826A CN201910107925.9A CN201910107925A CN109919826A CN 109919826 A CN109919826 A CN 109919826A CN 201910107925 A CN201910107925 A CN 201910107925A CN 109919826 A CN109919826 A CN 109919826A
Authority
CN
China
Prior art keywords
data
index
column
vertex
buffer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910107925.9A
Other languages
Chinese (zh)
Other versions
CN109919826B (en
Inventor
邓军勇
莉兹·K·约翰
宋爽
邬沁哲
杨博文
田璞
赵一迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Posts and Telecommunications
University of Texas System
Original Assignee
Xian University of Posts and Telecommunications
University of Texas at Austin
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Posts and Telecommunications, University of Texas at Austin filed Critical Xian University of Posts and Telecommunications
Priority to CN201910107925.9A priority Critical patent/CN109919826B/en
Publication of CN109919826A publication Critical patent/CN109919826A/en
Application granted granted Critical
Publication of CN109919826B publication Critical patent/CN109919826B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)

Abstract

The present invention discloses a kind of diagram data compression method and figure computation accelerator for figure computation accelerator, method includes: S1, the diagram data that the pretreatment circuit of figure computation accelerator abuts sparse matrix by be processed and indicates is converted into the diagram data that independent sparse column compress CSCI format, diagram data after each column independent compression include column mark data to and nonzero element data pair, each data are to including index index and numerical value value, by the meaning for indexing highest two instructions index remaining and numerical value value of index, S2, the diagram data of CSCI format after conversion is stored in the memory of the figure computation accelerator by the pretreatment circuit of figure computation accelerator.Compression method of the invention can be improved the concurrency and efficiency of figure computation accelerator.

Description

A kind of diagram data compression method and figure computation accelerator for figure computation accelerator
Technical field
The present invention relates to a kind of data compression process technology, in particular to a kind of diagram data pressure for figure computation accelerator Contracting method and figure computation accelerator.
Background technique
With the rise of the Novel Internets such as social networks application and popularizing for various electronic equipments, figure is calculated, especially The related application that Large Scale Graphs calculate has become the research hotspot of academia and industry, from technology, application and independent intellectual Property right angularly sees that the research and development of figure computation accelerator are all imperative.
Currently, there are many figure computation accelerators for design in the industry, need to consider diagram data in these figure computation accelerators Efficient Compression format is to improve the concurrency and efficiency of calculating.
Summary of the invention
The object of the present invention is to provide the figures that is used for of a kind of concurrency that can be improved figure computation accelerator and efficiency to calculate The diagram data compression method and figure computation accelerator of accelerator.
To achieve the above object, the main technical schemes that the present invention uses include:
Provided by the present invention for the diagram data compression method of figure computation accelerator, comprising:
S1, the diagram data that the pretreatment circuit of figure computation accelerator abuts sparse matrix by be processed and indicates are converted into The diagram data of independent sparse column compression CSCI format, the diagram data after each column independent compression include column mark data to and non-zero entry Prime number is according to right, and each data are to including index index and numerical value value, by the two instruction index of highest for indexing index The meaning of remaining and numerical value value,
S2, figure computation accelerator pretreatment circuit the diagram data of the CSCI format after conversion be stored in the figure calculate In the memory of accelerator.
As a further improvement of that present invention, the step S1 includes:
Column independent compression is pressed into data pair one by one to the diagram data that sparse adjacency matrix indicates;
Each data includes: index index and numerical value value to structure;
Index highest two be " 01 " or " 10 " data to for column identify ioc;
Data as column mark are to subsequent data to the corresponding data pair of nonzero element for all rows of the column.
As a further improvement of that present invention, when being " 01 " for index highest two, remaining position index indicates column index, Value indicates the nonzero element number of the column in adjacent sparse matrix;
When being " 10 " for index highest two, remaining position index indicates column index, and this is classified as adjacent sparse matrix Last column, value indicate the nonzero element number of the column in adjacent sparse matrix;
When being " 00 " for index highest two, remaining position index indicates line index, and value is indicated in sparse adjacency matrix Corresponding nonzero element value.
As a further improvement of that present invention, number of the digit of the index and value according to adjacent sparse matrix data It is determined according to amount.
On the other hand, the present invention also provides a kind of figure computation accelerators, including pretreatment circuit and memory;
The pretreatment circuit is according to any compression method of the claims 1 to 4 to adjacent sparse matrix number According to progress conversion process.
As a further improvement of that present invention, further includes:
Control circuit, data access unit, scheduler, combination grain processing unit and result generate unit;
Wherein, the pretreatment circuit is also used to column mark copy in CSCI being stored in the memory;
The control circuit stores the conversion for finishing and sending later just for receiving the pretreatment circuit in memory Thread indication signal, according to the figure that host is sent calculate application type control the data access unit, combination grain processing unit, As a result generate unit operation, and by host send application type one root vertex index or application type two source summit Index sends the data access unit;
The data access unit, for reading the diagram data and column mark of the CSCI, and root from the memory It is being stored according to the specified vertex of vertex index calculating of enlivening that described vertex index, source summit index or result generate unit transmission Physical address in device is transferred to scheduler to carry out data access, and by the data of reading;
The scheduler, for keeping in the nonzero element number of column mark instruction in CISI, and according to the mangcorn Temporary data are assigned to the processing elements in combination grain processing unit and carried out by the status signal for spending processing elements in processing unit Processing;
The combination grain processing unit, for according in control circuit application type and result generate unit and enliven Vertex data carries out parallel processing to the data kept in scheduler, and intermediate data transmission result generation is single by treated Member;
The result generates unit, for being handled according to the application type in control circuit intermediate data, and The vertex index of enlivening for the treatment of process is sent into data access unit, final result stores by treated.
As a further improvement of that present invention, the control circuit includes: host interface component and control logic component;
The host interface component, for receiving host send application type, application type one root vertex index and The source summit of application type two indexes;
The control logic component, the conversion ready transport indicator sent for receiving the pretreatment circuit, will be described Root vertex index or source summit index send the data access unit, and application type is sent combination grain processing unit and knot Fruit generates unit, and starts each module in figure computation accelerator and start to work;
Wherein application type is first is that breadth first search application BFS type, application type is second is that signal source shortest path application SSSP type.
As a further improvement of that present invention, the data access unit includes: user logic component, address calculation module Buffer is identified with column;
The column identify buffer, and the column for storing diagram data in CSCI identify;
The address calculation module generates the vertex rope of unit input for send according to the control circuit and result Draw, calculates current active vertex i in conjunction with each column nonzero element data, the number of every row storage data in column mark buffer The physical address of corresponding data in memory;
The user logic component is temporarily stored in the column mark for reading column mark from the memory and keeps in In device;The number for enlivening vertex correspondence accordingly is read from the memory according to the address that the address calculation module calculates According to, and the data read to the scheduler dispatches;
And then receive scheduler dispatches pause read signal after, stopping read data from the memory;
The user logic component is also used to read again number after Signal Fail is read in the pause of the scheduler dispatches According to.
As a further improvement of that present invention, scheduler includes: Buffer allocation module, task scheduling modules and double buffering Area's module;
Buffer allocation module, the column for analyzing the column data transmitted from data access unit identify corresponding data It is right, and the column are identified according to the buffer status information of double buffering module transmission by corresponding diagram data and are sent to double buffering Module then sends to data access unit when buffer areas all in double buffering module all occupy and stops reading signal;
Task scheduling modules, processing elements status signal and double buffering for being transmitted according to combination grain processing unit The buffer status information of module transmission is sent into idle and calculates what capacity was met the requirements data unscheduled in all buffer areas Processing elements are handled;
Double buffering module includes: that multiple groups keep in the different front and back double buffering composition of capacity;
The double buffering module, for when all buffer states are set to " full ", notice task scheduling modules to be dispatched The diagram data of buffer cache, the buffer state that data dispatch is completed is set to " sky ", and sends buffer state to buffer area Distribution module.
As a further improvement of that present invention, combination grain processing unit includes: auxiliary circuit module and Processor Array;
The auxiliary circuit module, for result to be generated unit input according to processing elements state each in Processor Array It enlivens vertex data pair and is transferred to corresponding free time processing elements in Processor Array with the corresponding CSCI that scheduler inputs;
Processor Array is made of the processing elements PE of multiple and different capacity, multiple processing elements concurrent workings;
Each processing elements receive the input of auxiliary circuit module enliven vertex data pair and CSCI after, passed according to control circuit Defeated application type is calculated with CSCI to enlivening vertex data pair.
As a further improvement of that present invention, each processing elements, are specifically used for:
When the application type of control circuit transmission is breadth first search, the value value for enlivening vertex data pair is added 1 It is assigned to the value of each data pair in CSCI;
When application type is signal source shortest path, by data pair each in the value and CSCI that enliven vertex data pair Value be added after for updating the value of each data pair in CSCI;
The calculated result of processing elements is output to result and generates unit, and the data that calculated result includes are to maximum number and each The data that processing elements are handled simultaneously are identical to number.
As a further improvement of that present invention, as a result generating unit includes: that operation module, comparator and on piece result are temporary Device;
Operation module includes: 8 tunnel, the 4 level production line tree of 15 operating units composition;
The calculating capacity of each operating unit is identical to maximum number as the data that input data includes;
Each operating unit, the application type for being inputted according to control circuit is in the input of combination grain processing unit Between data calculated;
The comparator, the data for being inputted according to the operation module, one by one according to the line index of each data pair The corresponding last time value value of the line index of read operation cell processing from piece result buffer, and it is current with input Value value compares, if current value value is not smaller than last time value value, do not execute any operation, directly carries out next The calculating of line index updates the line index of operating unit processing on piece result buffer if current value value is smaller Value value, and the line index corresponding vertex is set to and enlivens vertex, which is output to data access unit, will be gone Index and value Value Data are to being output to combination grain processing unit;
On piece result buffer, for keeping in the depth/distance on each vertex.
As a further improvement of that present invention, each operating unit, is specifically used for: for application type one and application type Two calculating carried out are all to compare operation, i.e., are compared the value value of the identical two-way input data pair of line index, will be compared with Small value is exported as the corresponding new value of the line index to next stage, until exporting from afterbody op_cell, is then inputted To comparator.
As a further improvement of that present invention, the address calculation module is also used to the base in memory according to CSCI Address BaseAddr determines the physical address PhyAddr according to formula one;
One: PhyAddr=BaseAddr+ (nnz_c of formula0+nnz_c1+...+nnz_ci)/RowSize;
nnz_ciIndicate the nonzero element number of respective column in independent sparse adjacency matrix, i is column index;
The byte number of the RowSize expression every row storage data of memory.
The beneficial effects of the present invention are:
The diagram data compressed in diagram data compression method of the invention is applied to figure computation accelerator, and then schemes to calculate and accelerate Device efficiently realizes two kinds of applications such as BFS and SSSP in figure calculating, improves effective bandwidth and concurrency, accelerates processed Journey.
Detailed description of the invention
Figure 1A is a kind of structural schematic diagram for figure computation accelerator that one embodiment of the invention provides;
Figure 1B is the compression process schematic diagram in the present invention for the diagram data of figure computation accelerator;
Fig. 2 is the structural schematic diagram of the control circuit in Figure 1A;
Fig. 3 is the circuit structure diagram of the data access unit in Figure 1A;
Fig. 4 is the circuit structure diagram of the scheduler in Figure 1A;
Fig. 5 is the circuit structure diagram of the combination grain processing unit in Figure 1A;
Fig. 6 is the circuit structure diagram of the result generation unit in Figure 1A;
Fig. 7 is the circuit structure diagram of operation module in Fig. 6.
Specific embodiment
In order to better explain the present invention, in order to understand, with reference to the accompanying drawing, by specific embodiment, to this hair It is bright to be described in detail.
Currently, the management of large-scale graph data can use a variety of data models, the vertex that can connect according to a line Number is divided into simple graph model and hypergraph model.The present invention can only connect two vertex towards simple graph model, i.e. a line, and There may be loops.Figure in the real world is usually averaged degree, i.e. the ratio of number of edges and number of vertex, only several to several hundred, with Easily up to ten million or even more than one hundred million a vertex scale is compared to seeming extremely sparse, and degree is in power-law distribution.
Simple graph model can be expressed as sparse adjacency matrix form.Diagram data since scale is big, in memory mostly with Compressed format storage, compressed format have CSC (Compressed Sparse Column), CSR (Compressed Sparse Row), in COO (Coordinate List), DCSC (Doubly Compressed Sparse Column) and the present invention CSCI (Compressed Sparse Column Independently) referred to etc..
Breadth first search BFS is basic graph search algorithm, and the basis of many important nomographys.Breadth First BFS is searched for from given vertex, referred to as root, starts iterative search and gives all accessible vertexs in figure, and calculate from root vertex to The depth of all accessible vertexs, i.e., least number of edges.When initialization, the depth on root vertex is set as 0, and labeled as active (active), the depth on every other vertex is set as infinity.In the t times iteration, the vertex v adjacent with vertex is enlivened Depth is calculated by following formula.If the depth on a vertex is updated to t+1 by infinity, which is marked as living It jumps and is used for next iteration.It so repeats to terminate until search.
Depth (v)=min (depth (v), t+1)
Signal source shortest path SSSP is used to calculate the shortest path from specified source summit all accessible vertexs into given figure Distance.When initialization, the distance of source summit is set as 0, and labeled as active (active), the distance on every other vertex is set For infinity.In the t times iteration, it is assumed that the weight on the side from vertex u to vertex v is w (u, v), then from source summit to top The shortest path distance of point v is calculated by following formula.If the distance on a vertex is updated, which is marked as living It jumps and is used for next iteration.It so repeats until completing all accessible vertexs.
Distance (v)=min (distance (v), distance (u)+w (u, v)).
Figure computation accelerator of the invention be it is a kind of for scheme calculate, using sparse column independent compression and combination grain at Circuit structure that reason unit accelerates parallel, that two kinds of BFS and SSSP etc. figure calculating applications can be carried out.Below in conjunction with Figure 1A to figure The structure and working principle of 7 pairs of figure computation accelerators of the invention are described in detail.
As shown in Figure 1A, the figure computation accelerator structure of the embodiment of the present invention include pretreatment circuit (CSCIU, Compressed Sparsed Column Independently Unit), control circuit (CTR, ConTRoller), one A data access unit (DAU, Data Accessing Unit), a scheduler (SCD, SCheDuler), a mangcorn Spend processing unit (MGP, Mixed-Granularity Processing unit) and result generate unit (RGU, Result Generating Unit)。
Wherein, the memory in figure computation accelerator structure can be regarded as general-purpose storage, calculate correlation for storing figure Data.
The adjoining sparse matrix diagram data of input is converted into independent sparse column compressed format by pretreatment circuit (CSCIU) (CSCI) it is stored in memory, while storing the copy for arranging mark in portion CSCI in memory, that is to say, that column mark (ioc) In respectively arrange the copy of corresponding nonzero element number.
The input of pretreatment circuit being originally inputted for external structure.In addition, the representation of simple graph also there are many, For this purpose, pre-processing circuit in the application, it converts the diagram data of adjacency matrix form.
In the present embodiment, data can adopt the index of (index, value) in the compressed CSCI format of diagram data It is indicated with 32bit, value can indicate that the concrete meaning of index and value are as shown in table 1, wherein index using 16bit [31:30] is that the data of " 01 " or " 10 " identify (ioc) to for column.The copy that corresponding nonzero element number is respectively arranged in column mark exists It is stored one by one on memory by column.
Table 1, data illustrate meaning in CSCI format
Circuit is pre-processed after the completion of storage, issues conversion ready transport indicator to control circuit.
In the present embodiment, the adjoining sparse matrix of diagram data is indicated independent by column by independent sparse column compression CSCI format Compression forms data one by one to (index, value).
For convenience of explanation, it is assumed here that index is indicated that value is indicated by 16bit by 32bit, and concrete application can basis The practical scale of diagram data determines the expression digit of index and value.The concrete meaning of index and value is as shown in table 1, Middle index [31:30] is that the data of " 01 " or " 10 " identify ioc to for column.
Index highest two be " 01 " or " 10 " data to for column identify ioc (indicator of column), often A column mark data is to subsequent data to the corresponding data pair of nonzero element for all rows of the column;
When being " 01 " for index highest two, remaining position index indicates that column index, value indicate in sparse adjacency matrix The nonzero element number of the column;
When being " 10 " for index highest two, remaining position index indicates column index, and this is classified as sparse adjacency matrix Last column, value indicate the nonzero element number of the column in sparse adjacency matrix;When being " 00 " for index highest two, Remaining position index indicates line index, and value indicates corresponding nonzero element value in sparse adjacency matrix.
CSCI compressed format is exemplified below, for convenience of explanation, for the simple graph shown in Figure 1B, the figure include A, B, C, six vertex D, E, F, the weight of each edge are identified on Figure 1B.The corresponding sparse adjacency matrix M of the Figure 1B is expressed as follows, by Include six vertex in the Figure 1B, therefore adjacency matrix is 6 × 6 matrixes, the row, column of matrix indexes the strigula therein since 1 It indicates not connect between corresponding two vertex, weight 0;
Compress the 1st column: last column of the column non-matrix have 2 nonzero elements, are located at the 3rd row and the 4th row, therefore should Column compression are as follows:
(0100_0000_0000_0000_0000_0000_0000_0001,0000_0000_0000_0010)
(0000_0000_0000_0000_0000_0000_0000_0011,0000_0000_0000_0011)
(0000_0000_0000_0000_0000_0000_0000_0100,0000_0000_0000_0010)
Compress the 2nd column: last column of the column non-matrix have 1 nonzero element, are located at the 1st row, therefore the column compress are as follows:
(0100_0000_0000_0000_0000_0000_0000_0010,0000_0000_0000_0001)
(0000_0000_0000_0000_0000_0000_0000_0001,0000_0000_0000_0001)
Compress the 3rd column: last column of the column non-matrix have 1 nonzero element, are located at the 5th row, therefore the column compress are as follows:
(0100_0000_0000_0000_0000_0000_0000_0011,0000_0000_0000_0001)
(0000_0000_0000_0000_0000_0000_0000_0101,0000_0000_0000_0001)
Compress the 4th column: last column of the column non-matrix have 1 nonzero element, are located at the 2nd row, therefore the column compress are as follows:
(0100_0000_0000_0000_0000_0000_0000_0100,0000_0000_0000_0001)
(0000_0000_0000_0000_0000_0000_0000_0010,0000_0000_0000_0011)
Compress the 5th column: last column of the column non-matrix have 2 nonzero elements, are located at the 1st row and the 4th row, therefore should Column compression are as follows:
(0100_0000_0000_0000_0000_0000_0000_0101,0000_0000_0000_0010)
(0000_0000_0000_0000_0000_0000_0000_0001,0000_0000_0000_0010)
(0000_0000_0000_0000_0000_0000_0000_0100,0000_0000_0000_0100)
Compress the 6th column: last column of the column bit matrix have 2 nonzero elements, are located at the 3rd row and the 5th row, therefore should Column compression are as follows:
(1000_0000_0000_0000_0000_0000_0000_0110,0000_0000_0000_0010)
(0000_0000_0000_0000_0000_0000_0000_0011,0000_0000_0000_0001)
(0000_0000_0000_0000_0000_0000_0000_0101,0000_0000_0000_0011)
Each row compression finishes, and is sequentially stored into memory by column main sequence.
Compression process can be described as follows:
Sparse matrix be will abut against since first row by column independent process, when each column compress,
1) it counts the column nonzero element number and generates column mark data pair, if last non-column of the column, arrange mark index [31:30] is " 01 ", is otherwise " 10 ", and column mark index [29:0] of each column indicates column index, the column mark of each column Value [15:0] indicates the nonzero element number of the column;
2) all nonzero elements of the column are sequentially generated data pair by row, index [31:30] is " 00 ", index [29:0] Indicate the line index of each nonzero element, value [15:0] indicates the numerical value of each nonzero element.
Referring to Fig. 2, control circuit (CTR) is by host interface component (host_IF) and control logic component (Ctr_Logic) Two parts composition.
Application type and the corresponding vertex index of application type that host interface (host_IF) receiving host is sent are simultaneously temporary It deposits.
Application type in the present embodiment includes: the application type one of breadth first search (BFS) application, single source shortest path The application type two of diameter (SSSP) application.
Wherein, the vertex index of breadth first search (BFS) application is root vertex index, and signal source shortest path (SSSP) is answered Vertex index is source summit index.
After control logic component (Ctr_Logic) receives the conversion ready transport indicator that pretreatment circuit is sent, by vertex Index is sent to data access unit (DAU), and application type is sent to combination grain processing unit (MGP) and result generates list First (RGU) and Acceleration of starting device are started to work.
Referring to Fig. 3, data access unit (DAU) by user logic component (UI), column mark buffer (ioc_ram), Location computing module (addr_cal) composition.
User logic component mainly completes three functions:
1) it is temporary that column mark feeding column mark buffer (ioc_ram) is read from memory;
2) it is read from memory according to the address that address calculation module (addr_cal) is calculated and accordingly enlivens vertex correspondence Data, and according to the value value of the vertex column mark data centering determine read data number;
Address calculation module can be calculated according to following formula one, in addition, can be by enlivening vertex during address calculation Index participates in completing, for this purpose, for the address for enlivening vertex calculating, 2) accordingly enliven vertex correspondence is top belonging to address in Point.
Enlivening vertex is the vertex that algorithm is updated in every wheel iterative process, enlivens vertex as next round.Just Designated root vertex/source summit when the beginning, this i.e. first round enliven vertex, subsequent to produce when being calculated according to root vertex/source summit Raw each round enlivens vertex.
3) vertex data read is sent to scheduler (SCD), and signal is read according to the pause that scheduler (SCD) is sent Data are read in stopping from memory, while saving current state, in case after the pause reading Signal Fail that scheduler is sent again Secondary reading data.
Column mark buffer (ioc_ram) is used to keep in the column mark of diagram data in CSCI format.
Address calculation module (addr_cal) generates the vertex index of unit input according to control circuit and result, in conjunction with column The number RowSize for identifying each column nonzero element data and the every row storage data of memory that buffer provides calculates current live Jump the physical address PhyAddr of vertex i corresponding data in memory, it is assumed that the base of diagram data in memory in CSCI format Address is BaseAddr, then PhyAddr may be expressed as:
One: PhyAddr=BaseAddr+ (nnz_c of formula0+nnz_c1+...+nnz_ci)/RowSize。
Referring to Fig. 4, scheduler (SCD) is by Buffer allocation module (buf_assign), task scheduling modules (task_ Sch), double buffering module (double_buffer) forms.
Buffer allocation module (buf_assign) is analyzed from the diagram data for the CSCI format that data access unit is sent into The column mark of the column data (column data that data access unit is sent into) knows column data to be processed to number, i.e. column mark The value value of data pair, and the buffer status information sent according to double buffering module (double_buffer) is to be processed The column data be sent to double buffering module (double_buffer), when all buffer areas all occupy, as " full " when, It is then sent to data access unit (DAU) and stops reading signal;
Task scheduling modules (task_sch) according to the processing elements status signal that combination grain processing unit (MGP) is sent with And the buffer status information that double buffering module (double_buffer) is sent, (not data unscheduled in all buffer areas It is sent into the data of combination grain processing unit) it is sent into idle and calculates the processing elements that capacity is met the requirements and handled;
Double buffering module (double_buffer) is made of 16 groups of different front and back double bufferings of temporary capacity, each The capacity of buffer area is described as follows.
The preceding buffer area of f, b expression, rear buffer area in the title of buffer area, 0~7 expression No. 0 buffer area to No. 7 buffer areas, 8~ 11,12~13 meanings are similar.
When buffer area receives the diagram data of CSCI format, which is temporarily stored into number determination according to the data of the column data Buffer area, each buffer area are equipped with former and later two, are used alternatingly in the form of table tennis, and when initial, buffer state is all " sky ", Data are stored in preceding buffer area, i.e. buf*_f, when the preceding buffer area of identical capacity is all occupied, then it is current newly to receive data deposit The rear buffer area of range of capacity, i.e. buf*_b, if capacity it is small front and back buffer area is all occupied and the biggish buffer area of capacity Peanut data can be then stored in large capacity buffer area by the free time.
For example, successively being stored in buf0_f~buf7_f, when data are no more than 64 to number if buf0_f is Have data and do not walked by reading, be then stored in buf1_f, if buf1_f occupy, be stored in buf2_f, and so on, when buf0_f~ Buf7_f is all occupied, then is sequentially successively stored in buf0_b~buf7_b, if buf0_f~buf7_f, buf0_b~buf7_b It all occupies, then the data that data are no more than 64 to number can be temporarily stored into buf8_f~buf11_f, and so on.
Since memory data bandwidth is limited, when a column data of the diagram data of CSCI format is more than memory data bandwidth When, then it needs to be stored in a buffer area by several times, until a column data of storage is all temporary, which is set to " full ", and Notice task scheduling modules (task_sch) can dispatch the column data, and after the completion of data dispatch, which is set to " sky ", while state is sent to Buffer allocation module (buf_assign);When a column data is more than to number in diagram data 1024, i.e., buffer area maximum capacity when, can batch processing.
Buf0~7_f, buf0~7_b:64 data pair;
Buf8~11_f, buf8~11_b:128 data pair;
Buf12~13_f, buf12~13_b:256 data pair;
Buf14_f, buf14_b:512 data pair;
Buf15_f, buf15_b:1024 data pair.
Referring to Fig. 5, combination grain processing unit (MGP) is by auxiliary circuit module (aux_cell), Processor Array (PEA) Composition.
Auxiliary circuit module (aux_cell), for result to be generated to the work of unit (RGU) input according to processing elements state It is corresponding idle that the diagram data for the corresponding CSCI format that jump vertex data pair is inputted with scheduler (SCD) is sent to Processor Array Processing elements.
Above-mentioned jump vertex data is to can be regarded as: the calculated result of BFS/SSSP is to generate a number to each vertex Value, to represent depth or distance of the vertex in figure, intermediate result is also in this way, only the value can quilt during successive iterations It updates.Because referred to herein as enlivening vertex data pair.
Processor Array (PEA) is made of the processing elements PE of 16 different capabilities, 16 processing elements can concurrent working, processing It is following (data handled here to Exclude Col mark data to) that member calculates capacity.
Processing elements (PE) receive auxiliary circuit module (aux_cell) input enliven vertex data pair and CSCI format According to the application type of control circuit CTR feeding, to it, (what is inputted enlivens vertex data pair and CSCI format after diagram data Diagram data) it is calculated.
When application type is breadth first search (BFS), processing elements (PE) add the value value for enlivening vertex data pair 1 is assigned to the value of each data pair of CSCI diagram data;
When application type is signal source shortest path (SSSP), processing elements (PE) will enliven the value of vertex data pair with The value of each data pair of CSCI diagram data is used to update the value of each data pair of diagram data in CSCI format after being added;
Calculated result is output to result and generates unit (RGU), and the data that calculated result includes are to maximum number and each place It is identical to number to manage the data that first (PE) can be handled simultaneously.
PE0~7: 64 data pair can be handled simultaneously;
PE8~11: 128 data pair can be handled simultaneously;
PE12~13: 256 data pair can be handled simultaneously;
PE14: 512 data pair can be handled simultaneously;
PE15: 1024 data pair can be handled simultaneously.
Referring to Fig. 6, unit (RGU) is as a result generated by operation module (OPC), comparator (CMP), on piece result buffer (cur_rlt) it forms.
In conjunction with Fig. 7 and Fig. 6,8 tunnel, 4 level production line tree that operation module (OPC) is made of 15 operating units (op_cell) It constitutes.
The calculating capacity of each op_cell is identical to maximum number as the data that input data includes;
The application type that each op_cell is inputted according to control circuit (CTR) counts (MGP) intermediate data inputted It calculates, the calculating carried out for breadth first search (BFS) and signal source shortest path (SSSP) is all to compare operation, i.e., by line index The value of identical two-way input data pair is compared, and is exported smaller value as the corresponding new value of the line index under Level-one is then input to comparator (CMP) until exporting from afterbody op_cell;
The data that comparator (CMP) is inputted according to operation module (OPC), one by one according to the line index of each data pair from piece The corresponding last time value value of the line index is read in upper result buffer (cur_rlt), and compared with the current value of input, If current value is not smaller than last time value, any operation is not executed, the calculating of next line index is directly carried out, if Current value is smaller, then updates value value of the line index on piece result buffer (cur_rlt), and by the line index Corresponding vertex, which is set to, enlivens vertex, which is output to data access unit (DAU), by line index and value data pair It is output to combination grain processing unit (MGP);On piece result buffer (cur_rlt) is for the temporary depth per each vertex (depth) (for breadth first search BFS) and apart from (distance) (for signal source shortest path (SSSP)).
The present invention is used in the project of " the parallel figure computation accelerator of combination grain ", by actual verification, The result shows that the function of the circuit meets target, the object of the invention can be realized with reliably working.
It is to be appreciated that describing the skill simply to illustrate that of the invention to what specific embodiments of the present invention carried out above Art route and feature, its object is to allow those skilled in the art to can understand the content of the present invention and implement it accordingly, but The present invention is not limited to above-mentioned particular implementations.All various changes made within the scope of the claims are repaired Decorations, should be covered by the scope of protection of the present invention.

Claims (10)

1. a kind of diagram data compression method for figure computation accelerator characterized by comprising
S1, the diagram data that the pretreatment circuit of figure computation accelerator abuts sparse matrix by be processed and indicates are converted into independence The diagram data of sparse column compression CSCI format, the diagram data after each column independent compression include column mark data to and non-zero entry prime number According to right, each data to including index index and numerical value value, by index index two instruction index of highest remaining The meaning of position and numerical value value,
S2, figure computation accelerator pretreatment circuit the diagram data of the CSCI format after conversion be stored in the figure and calculated accelerate In the memory of device.
2. the method according to claim 1, wherein the step S1 includes:
Column independent compression is pressed into data pair one by one to the diagram data that sparse adjacency matrix indicates;
Each data includes: index index and numerical value value to structure;
Index highest two be " 01 " or " 10 " data to for column identify ioc;
Data as column mark are to subsequent data to the corresponding data pair of nonzero element for all rows of the column.
3. according to the method described in claim 2, it is characterized in that,
When being " 01 " for index highest two, remaining position index indicates that column index, value indicate the column in adjacent sparse matrix Nonzero element number;
When being " 10 " for index highest two, remaining position index indicates column index, and this is classified as the last of adjacent sparse matrix One column, value indicate the nonzero element number of the column in adjacent sparse matrix;
When being " 00 " for index highest two, remaining position index indicates line index, and value indicates corresponding in sparse adjacency matrix Nonzero element value.
4. according to the method described in claim 2, it is characterized in that, the digit of the index and value is according to adjacent sparse square The data volume of battle array data determines.
5. a kind of figure computation accelerator, which is characterized in that including pretreatment circuit and memory;
The pretreatment circuit according to any compression method of the claims 1 to 4 to adjacent sparse matrix data into Row conversion process.
6. figure computation accelerator according to claim 5, which is characterized in that further include:
Control circuit, data access unit, scheduler, combination grain processing unit and result generate unit;
Wherein, the pretreatment circuit is also used to column mark copy in CSCI being stored in the memory;
The control circuit stores the ready finger of conversion for finishing and sending later for receiving the pretreatment circuit in memory Show signal, application type is calculated according to the figure that host is sent and controls the data access unit, combination grain processing unit, result The operation of unit is generated, and the root vertex index for the application type one that host is sent or the source summit of application type two index Send the data access unit;
The data access unit, for reading the diagram data and column mark of the CSCI from the memory, and according to institute The vertex index of enlivening for stating root vertex index, source summit index or result generation unit transmission calculates specified vertex in memory Physical address be transferred to scheduler to carry out data access, and by the data of reading;
The scheduler, for will in CISI column mark instruction nonzero element number keep in, and according to the combination grain at The status signal for managing processing elements in unit, at the processing elements that temporary data are assigned in combination grain processing unit Reason;
The combination grain processing unit, for according in control circuit application type and result generate unit and enliven vertex Data carry out parallel processing to the data kept in scheduler, and intermediate data transmission result generates unit by treated;
The result generates unit, for being handled according to the application type in control circuit intermediate data, and will place The vertex index of enlivening of reason process sends data access unit, and by treated, final result is stored.
7. figure computation accelerator according to claim 6, which is characterized in that the control circuit includes: host interface group Part and control logic component;
The host interface component, the root vertex index and application of the application type, application type one that are sent for receiving host The source summit of type two indexes;
The control logic component, the conversion ready transport indicator sent for receiving the pretreatment circuit, described is pushed up Point index or source summit index send the data access unit, and application type is sent combination grain processing unit and result produces Raw unit, and start each module in figure computation accelerator and start to work;
Wherein application type is first is that breadth first search application BFS type, application type is second is that signal source shortest path application SSSP Type.
8. figure computation accelerator according to claim 7, which is characterized in that the data access unit includes: that user patrols Collect component, address calculation module and column mark buffer;
The column identify buffer, and the column for storing diagram data in CSCI identify;
The address calculation module generates the vertex index of unit input for send according to the control circuit and result, I pairs of current active vertex is calculated in conjunction with each column nonzero element data, the number of every row storage data in column mark buffer Answer the physical address of data in memory;
The user logic component is temporarily stored in the column mark buffer for reading the column mark from the memory In;The data for enlivening vertex correspondence accordingly are read from the memory according to the address that the address calculation module calculates, And the data read to the scheduler dispatches;
And then receive scheduler dispatches pause read signal after, stopping read data from the memory;
The user logic component is also used to read again data after Signal Fail is read in the pause of the scheduler dispatches.
9. figure computation accelerator according to claim 8, which is characterized in that scheduler includes: Buffer allocation module, appoints Scheduler module of being engaged in and double buffering module;
Buffer allocation module, the column for analyzing the column data transmitted from data access unit identify corresponding data pair, and The column are identified corresponding diagram data according to the buffer status information of double buffering module transmission and are sent to double buffering module, When buffer areas all in double buffering module all occupy, is then sent to data access unit and stop reading signal;
Task scheduling modules, processing elements status signal and double buffering module for being transmitted according to combination grain processing unit Data unscheduled in all buffer areas are sent into the processing that idle and calculating capacity is met the requirements by the buffer status information of transmission Member is handled;
Double buffering module includes: that multiple groups keep in the different front and back double buffering composition of capacity;
The double buffering module, for notifying task scheduling modules dispatch buffer when all buffer states are set to " full " The diagram data of area's caching, the buffer state that data dispatch is completed is set to " sky ", and sends buffer state to Buffer allocation Module.
10. figure computation accelerator according to claim 9, which is characterized in that combination grain processing unit includes: auxiliary electricity Road module and Processor Array;
The auxiliary circuit module, for result to be generated enlivening for unit input according to processing elements state each in Processor Array Vertex data pair is transferred to corresponding free time processing elements in Processor Array with the corresponding CSCI that scheduler inputs;
Processor Array is made of the processing elements PE of multiple and different capacity, multiple processing elements concurrent workings;
Each processing elements receive the input of auxiliary circuit module enliven vertex data pair and CSCI after, according to control circuit transmission Application type is calculated with CSCI to enlivening vertex data pair.
CN201910107925.9A 2019-02-02 2019-02-02 Graph data compression method for graph computation accelerator and graph computation accelerator Active CN109919826B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910107925.9A CN109919826B (en) 2019-02-02 2019-02-02 Graph data compression method for graph computation accelerator and graph computation accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910107925.9A CN109919826B (en) 2019-02-02 2019-02-02 Graph data compression method for graph computation accelerator and graph computation accelerator

Publications (2)

Publication Number Publication Date
CN109919826A true CN109919826A (en) 2019-06-21
CN109919826B CN109919826B (en) 2023-02-17

Family

ID=66961445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910107925.9A Active CN109919826B (en) 2019-02-02 2019-02-02 Graph data compression method for graph computation accelerator and graph computation accelerator

Country Status (1)

Country Link
CN (1) CN109919826B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598175A (en) * 2019-09-17 2019-12-20 西安邮电大学 Sparse matrix column vector comparison device based on graph computation accelerator
CN111309976A (en) * 2020-02-24 2020-06-19 北京工业大学 GraphX data caching method for convergence graph application
CN113326125A (en) * 2021-05-20 2021-08-31 清华大学 Large-scale distributed graph calculation end-to-end acceleration method and device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336758A (en) * 2013-06-29 2013-10-02 中国科学院软件研究所 Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same
US20140040530A1 (en) * 2012-08-02 2014-02-06 Lsi Corporation Mixed granularity higher-level redundancy for non-volatile memory
CN104636273A (en) * 2015-02-28 2015-05-20 中国科学技术大学 Storage method of sparse matrix on SIMD multi-core processor with multi-level cache
CN106157339A (en) * 2016-07-05 2016-11-23 华南理工大学 The animated Mesh sequence compaction algorithm extracted based on low-rank vertex trajectories subspace
CN106951961A (en) * 2017-02-24 2017-07-14 清华大学 The convolutional neural networks accelerator and system of a kind of coarseness restructural
CN107229967A (en) * 2016-08-22 2017-10-03 北京深鉴智能科技有限公司 A kind of hardware accelerator and method that rarefaction GRU neutral nets are realized based on FPGA
CN107239823A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of apparatus and method for realizing sparse neural network
CN107704916A (en) * 2016-08-12 2018-02-16 北京深鉴科技有限公司 A kind of hardware accelerator and method that RNN neutral nets are realized based on FPGA
CN108234863A (en) * 2016-12-12 2018-06-29 汤姆逊许可公司 For rebuilding the method and apparatus that the signal of sparse matrix is encoded to transmission data
EP3343392A1 (en) * 2016-12-31 2018-07-04 INTEL Corporation Hardware accelerator architecture and template for web-scale k-means clustering
EP3343391A1 (en) * 2016-12-31 2018-07-04 INTEL Corporation Heterogeneous hardware accelerator architecture for processing sparse matrix data with skewed non-zero distributions

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140040530A1 (en) * 2012-08-02 2014-02-06 Lsi Corporation Mixed granularity higher-level redundancy for non-volatile memory
CN107155358A (en) * 2012-08-02 2017-09-12 希捷科技有限公司 Combination grain higher level redundancy for nonvolatile memory
CN103336758A (en) * 2013-06-29 2013-10-02 中国科学院软件研究所 Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same
CN104636273A (en) * 2015-02-28 2015-05-20 中国科学技术大学 Storage method of sparse matrix on SIMD multi-core processor with multi-level cache
CN106157339A (en) * 2016-07-05 2016-11-23 华南理工大学 The animated Mesh sequence compaction algorithm extracted based on low-rank vertex trajectories subspace
CN107239823A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of apparatus and method for realizing sparse neural network
CN107704916A (en) * 2016-08-12 2018-02-16 北京深鉴科技有限公司 A kind of hardware accelerator and method that RNN neutral nets are realized based on FPGA
CN107229967A (en) * 2016-08-22 2017-10-03 北京深鉴智能科技有限公司 A kind of hardware accelerator and method that rarefaction GRU neutral nets are realized based on FPGA
CN108234863A (en) * 2016-12-12 2018-06-29 汤姆逊许可公司 For rebuilding the method and apparatus that the signal of sparse matrix is encoded to transmission data
EP3343392A1 (en) * 2016-12-31 2018-07-04 INTEL Corporation Hardware accelerator architecture and template for web-scale k-means clustering
EP3343391A1 (en) * 2016-12-31 2018-07-04 INTEL Corporation Heterogeneous hardware accelerator architecture for processing sparse matrix data with skewed non-zero distributions
CN106951961A (en) * 2017-02-24 2017-07-14 清华大学 The convolutional neural networks accelerator and system of a kind of coarseness restructural

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
纪国良: "工程计算中大型稀疏矩阵存储方法研究", 《数值计算与计算机应用》 *
赖智超: "基于超节点LDL分解的大规模结构计算", 《计算机辅助工程》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598175A (en) * 2019-09-17 2019-12-20 西安邮电大学 Sparse matrix column vector comparison device based on graph computation accelerator
CN111309976A (en) * 2020-02-24 2020-06-19 北京工业大学 GraphX data caching method for convergence graph application
CN111309976B (en) * 2020-02-24 2021-06-25 北京工业大学 GraphX data caching method for convergence graph application
CN113326125A (en) * 2021-05-20 2021-08-31 清华大学 Large-scale distributed graph calculation end-to-end acceleration method and device
CN113326125B (en) * 2021-05-20 2023-03-24 清华大学 Large-scale distributed graph calculation end-to-end acceleration method and device

Also Published As

Publication number Publication date
CN109919826B (en) 2023-02-17

Similar Documents

Publication Publication Date Title
CN109919826A (en) A kind of diagram data compression method and figure computation accelerator for figure computation accelerator
CN103235743B (en) A kind of based on decomposing and the multiple goal test assignment dispatching method of optimum solution follow-up strategy
CN112200300B (en) Convolutional neural network operation method and device
CN110969362B (en) Multi-target task scheduling method and system under cloud computing system
CN106503333B (en) A kind of network on three-dimensional chip test-schedule method
CN107710200A (en) System and method for the operator based on hash in parallelization SMP databases
CN105159148B (en) Robot instruction processing method and device
CN101604261B (en) Task scheduling method for supercomputer
CN110069502A (en) Data balancing partition method and computer storage medium based on Spark framework
CN103793273A (en) Distributed type queue scheduling method and device based on Redis
CN102004664A (en) Scheduling method of embedded real-time operating system of space vehicle
CN109949202A (en) A kind of parallel figure computation accelerator structure
Gong et al. Grouping target paths for evolutionary generation of test data in parallel
CN104935523A (en) Load balancing processing method and equipment
CN105359142A (en) Hash join method, device and database management system
CN108776698A (en) A kind of data fragmentation method of the skew-resistant based on Spark
Li et al. Bottleneck identification and alleviation in a blocked serial production line with discrete event simulation: A case study.
CN104346380B (en) Data reordering method and system based on MapReduce model
CN106980673A (en) Main memory database table index updating method and system
CN106874215B (en) Serialized storage optimization method based on Spark operator
CN104657108B (en) A kind of management method and system of the event queue of the software simulator of microprocessor
CN113297537B (en) High-performance implementation method and device for solving sparse structured trigonometric equation set
CN105844110B (en) A kind of adaptive neighborhood TABU search based on GPU solves Method for HW/SW partitioning
CN103294797A (en) Parallel constraint solution method based on compatibility optimization
CN114691302A (en) Dynamic cache replacement method and device for big data processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
TA01 Transfer of patent application right

Effective date of registration: 20190618

Address after: West Chang'an Street, Chang'an District, Xi'an City, Shaanxi Province

Applicant after: XI'AN University OF POSTS & TELECOMMUNICATIONS

Applicant after: BOARD OF REGENTS THE University OF TEXAS SYSTEM

Address before: 710121 West Chang'an Street, Chang'an District, Xi'an City, Shaanxi Province

Applicant before: Xi'an University of Posts & Telecommunications

Applicant before: University of Texas at Austin

TA01 Transfer of patent application right
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CI02 Correction of invention patent application

Correction item: Applicant|Address|Applicant

Correct: Xi'an University of Posts and Telecommunications: 710121 West Chang'an Street, Chang'an District, Xi'an City, Shaanxi Province|University of Texas at Austin

False: Xi'an University of Posts & Telecommunications|710121 Xi'an, Shaanxi, Changan District West Chang'an Avenue|Univ. Texas

Number: 27-02

Volume: 35

CI02 Correction of invention patent application
GR01 Patent grant
GR01 Patent grant