CN108932135A

CN108932135A - The acceleration platform designing method of sorting algorithm based on FPGA

Info

Publication number: CN108932135A
Application number: CN201810698823.4A
Authority: CN
Inventors: 李曦; 王超; 程玉明; 周学海
Original assignee: Suzhou Institute for Advanced Study USTC
Current assignee: Suzhou Institute for Advanced Study USTC
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2018-12-04

Abstract

The invention discloses a kind of acceleration platform designing methods of sorting algorithm based on FPGA, and accelerating platform includes: using profiling technology analysis classes center vector algorithm, K- nearest neighbor algorithm and NB Algorithm and to obtain hot spot code；Analyze the hot spot code of three kinds of sorting algorithms and suitably modified to extract common logic therein；The resource and characteristic for analyzing FPGA platform, are optimized accelerator arithmetic element using assembly line and parallel means, design hardware accelerator general frame and generate IP kernel；The semantic instruction set of design extension simultaneously realizes the corresponding each functional logic block of instruction set, completes the function of key code by the operation of the fetching of instruction, decoding, execution；Graft procedure system writes the driving of each hardware device to development board, completes the collaborative work of software and hardware under an operating system.The present invention supports a variety of sorting algorithms, improves the scalability and flexibility of system, programmer can use existing FPGA resource and easily obtain good performance.

Description

The acceleration platform designing method of sorting algorithm based on FPGA

Technical field

The present invention relates to a kind of hardware-accelerated platforms of algorithm, and in particular to a kind of scalability and flexibility it is high based on The acceleration platform and its design method of the sorting algorithm of FPGA.

Background technique

With popularizing for personal computer, internet has obtained quick development, and following a large amount of electronic information become It must be difficult to handle.One of current information Science and Technology field focus of attention, be exactly by effectively organization and management these Electronic information, and fast, accurately and comprehensively therefrom find information required for user.Sorting algorithm is big as processing and classification The key technology for measuring data, can largely solve the problems, such as information clutter phenomenon, user is facilitated to be accurately located institute The information and diffluent information needed.And as information filtering, information retrieval, search engine, text database, digital library The technical foundation in equal fields, sorting technique have a wide range of applications.

At present in sorting algorithm, integrated study is the research hotspot of domestic and foreign scholars, and integrated study is mainly according to certain Rule combine single classifier to solve the problems, such as, algorithm has Bagging, Boosting.And in single sorting algorithm, no Same sorting algorithm has the characteristics that respective.It, can be again in no background if support vector machines (SVM) has very high accuracy rate Good effect is obtained on the data set of information.In contrast, decision tree then can be very good to explain the model established.Therefore exist Under different data, background and demand, need to can be only achieved better effect with different sorting algorithms.

Nowadays in big data era, the high-dimensional data of magnanimity have slowed down the efficiency of sorting algorithm significantly, have seriously restrict The development of all trades and professions.With a large amount of surges and urgent need of the people to key message of data, how quickly and efficiently Completion is particularly important the extraction and classification of information.Therefore the high-performance of sorting algorithm realizes the weight for also becoming people's research Want project.Compared to traditional computer system, multicore heterogeneous computing platforms are integrated with restructural logic unit, combine different The characteristics of structure multi-core platform and Reconfiguration Technologies, can carry out hardware platform again for the Dynamic Execution process of application program Configuration, has higher flexibility while being more easily extensible, and meets the demand of big data era, therefore be based on GPU and FPGA Heterogeneous computing system become processing big data application a kind of effective frame.

Relative to the difficulty optimized in algorithm level to algorithm, researcher is taken on hardware view Obtained effective progress.The main means that sorting algorithm accelerates at present have cloud computing platform and hardware-accelerated platform, medium cloud meter It calculates platform to be made of the single-unit point server based on CPU of a large amount of isomorphisms, works in coordination, cooperates between multiple nodes.Cloud Computing platform programming model can generally be divided into two kinds of computation model based on Map-Reduce computation model and based on figure, and two The essence of kind computation model is all to be divided task using the parallel means of task-level parallelism and data level, then will be divided Good task and data is distributed on the distributed computer in distribution beyond the clouds, these distributed computers complete task computation It is returned the result on the host in cloud computing platform afterwards.Although cloud computing platform can obtain good acceleration effect, Computational efficiency is relatively low for a certain node on its cloud, and acceleration effect is also limited by network bandwidth, adds The cost of speed is also relatively high, and energy consumption expense is also very big.Hardware-accelerated platform includes graphics processing unit GPGPU (General Purpose Graphics Processing Unit), application-specific integrated circuit ASIC (Application Specific Integrated Circuit) and field programmable gate array FPGA(Field Programmable Gate Array).Wherein GPU possesses a large amount of hardware level threads and a large amount of parallel processing elements, therefore often is applied to calculate Complexity and graphics process field that can be parallel, accelerate the execution of various applications using the parallel mode of data level.Currently, being directed to The program norms such as CUDA, OpenCL and OpenACC that GPGPU platform is proposed and realized, greatly reduce based on GPGPU's The exploitation threshold of application, but also GPGPU becomes the current also relatively broad acceleration parallel tables used.GPUGPU platform has Good acceleration effect but same unavoidable high energy consumption expense, and in addition, in order to enhance versatility, GPU integrated chip The functional component that many algorithms take less than brings resources of chip consumption and additional no small expense in this way.In order to reduce power consumption Expense combines higher computational efficiency, and best bet is one dedicated hardware acceleration structure of design.ASIC and FPGA Platform is dedicated hardware-accelerated platform.Wherein ASIC is to be specifically used to integrated circuit, cannot be again after the completion of circuit structure Modification, can only be suitble to specifically apply.And FPGA has reconfigurability, function on the dynamically reconfigurable FPGA of user relative to ASIC The features such as energy module, flexibility is preferable, and relatively low, the development cycle is short to exploitation personnel requirement.Therefore we choose Carrying platform of the FPGA platform as sorting algorithm hardware accelerator.

The current versatility to FPGA accelerator and flexibility both at home and abroad is paid close attention to few above, and often concentrates on studying On acceleration effect on single specific algorithm.Therefore we are intended to design the general sorting algorithm based on FPGA and accelerate Device.

We have chosen three kinds of naive Bayesian, class vector center, K- neighbour representative sorting algorithms and grind Study carefully, these three sorting algorithms have different characteristics and application field.The basic ideas of NB Algorithm are by calculating first It tests probability and conditional probability obtains the probability that data belong to such, then data belong to such probability in more all classes, most Result is referred in the classification of maximum probability afterwards.The basic ideas of class center vector algorithm are calculated in every a kind of data Heart vector, then compare the size of data Yu every class center vector similarity, finally result is referred in the maximum class of similarity. K- nearest neighbor algorithm concentrates the similarity of each data by calculating a given test data and training data, from obtained phase Like finding K training datas most like with example in degree.Majority belongs to some class in this K training data, just example point For in this class.

Summary of the invention

For the above technical problems, object of the present invention is to: provide a sorting algorithm based on FPGA acceleration Platform can support a variety of sorting algorithms, improve the scalability and flexibility of system, so that not having the programming of hardware knowledge Person can use existing FPGA resource and easily obtain good performance

The technical scheme is that

A kind of acceleration platform of the sorting algorithm based on FPGA, which includes general processor, memory module and FPGA.Its Middle general processor is responsible for transmitting data and controls accelerator, and memory module is responsible for temporal data for general processor and acceleration Device processing, FPGA are that main responsible hot spot code calculates.Design method the following steps are included:

S01: it using profiling technology analysis classes center vector algorithm, K- nearest neighbor algorithm and NB Algorithm and obtains To hot spot code；

S02: the hot spot code of three kinds of sorting algorithms of analysis is simultaneously suitably modified to extract common logic therein；

S03: being analyzed the resource and characteristic of FPGA platform, optimized accelerator arithmetic element using assembly line and parallel means, if It counts out hardware accelerator general frame and generates IP kernel；

S04: the semantic instruction set of design extension simultaneously realizes the corresponding each functional logic block of instruction set, passes through taking for instruction Refer to, decoding, the function of operating completion key code of executing；

S05: graft procedure system to development board writes the driving of each hardware device, completes software and hardware under an operating system It cooperates.

In optimal technical scheme, the hardware resource of FPGA in the step S01, including logic unit and storage unit, all It is limited.In order to make full use of the hardware resource of FPGA, we carry out class center vector, K- neighbour, naive Bayesian this Before the general accelerator design of three kinds of sorting algorithms, it is necessary first to which these three sorting algorithms are dissected, therefrom extraction algorithm The key code that time-consuming is realized on FPGA.

It is accounted for using profiling technology analysis three kinds of sorting algorithms each function timing under different data sets Than；

The result set of test is counted, the average performance times accounting of each function is found out；

It is more than the function of given threshold as hot spot code using time accounting.

In optimal technical scheme, it is further reduced FPGA hardware resource consumption in the step S02, we are to these three The key code of sorting algorithm is parsed, and the identical logic function between algorithm is found out, and only needing Hardware in this way, these are patrolled The multiplexing of partial arithmetic can be completed by collecting function, avoided the repetition and waste of logic unit above FPGA, saved above FPGA Hardware resource.Can be seen that three kinds of algorithms to the anatomy of three kinds of sorting algorithms all has some identical logics, especially to The lookup of the calculating of similarity and minimum value between amount, but still somewhat different, be to look for the smallest K value such as KNN algorithm and It is not minimum value, bayesian algorithm is probability calculation rather than similarity calculation.For the significantly more efficient hardware using FPGA Resource, and increase the flexibility of accelerator, this different logic at two is modified by we, it is enable to be multiplexed.

In optimal technical scheme, in the step S03, due in analysis of central issue it will be seen that between vector it is similar It is key code that degree, which calculates, therefore accelerator design is mainly that this key code optimizes acceleration.Similarity between vector Measurement have some multiplexing logicals, it has been found that all five kinds of measuring similarities, Euclidean distance, manhatton distance, outstanding card Moral similarity factor, cosine similarity and Pearson correlation coefficient, it is only necessary to Sx, Sy, Sx-y, Sx2, Sy2, S (x-y) 2, Sxy and The calculating of Nxy these types intermediate variable, and these scalars are all by adding up after two vector operations, therefore we can To realize that an accumulator is used to calculate similarity intermediate variable in accelerator design, this, simplifies accelerator knots Structure, and streamlined is made it easier for, improve accelerator performance.Since when calculating similarity, there is no rely between vector Relationship, we can calculate vector similarity using multiple parallel IP kernels.

In optimal technical scheme, accelerator dependent instruction classification three categories, input and output instruction, meter in the step S04 Calculate instruction and control instruction.

Input and output instruction is mainly read from DMA and written-back operation, the main input and output comprising to scalar, to vector Input and output and input and output to Vector Groups；

Computations mainly call accelerator components to be calculated；

Control instruction is used to the trend of control instruction stream, realizes complicated cycling jump logic.

In optimal technical scheme, the step S05 the following steps are included:

In the writing of driving, each hardware device is accessed by the way of accessing Linux character device；

Data filling is carried out using mapping mechanism in the driving of DMA is write；

The mapping mechanism is to reserve one section of continuous physical memory in memory, maps that one section in kernel spacing In address, then by this section of kernel spacing address of cache to user's space.

Compared with prior art, the invention has the advantages that

Analysis of central issue is carried out herein by three kinds of different algorithms, key code is extracted and finds out common between key code Logic, will be on its Hardware and programming to FPGA platform.In interface level, herein in order to improve the applicability of accelerator, write Accelerator relevant device driver and user-interface are for upper layer calling.Meanwhile accelerator being put down with other herein Platform compares and analyzes, and obtains the indexs such as performance, power, the energy consumption of accelerator.The experimental results showed that the accelerator has well Accelerating ability and lower power dissipation overhead.

Detailed description of the invention

The invention will be further described with reference to the accompanying drawings and embodiments:

Fig. 1 is the design flow diagram of the acceleration system platform of the embodiment of the present invention；

Fig. 2 is the class CENTER ALGORITHM hot spot explosion views of the acceleration system platform of the embodiment of the present invention；

Fig. 3 is the K- nearest neighbor algorithm hot spot explosion views of the acceleration system platform of the embodiment of the present invention；

Fig. 4 is the NB Algorithm hot spot explosion views of the acceleration system platform of the embodiment of the present invention；

Fig. 5 is the accelerator system structure chart of the acceleration system platform of the embodiment of the present invention；

Fig. 6 is the accelerator structure figure of the acceleration system platform of the embodiment of the present invention；

Fig. 7 is the similarity calculation IP design drawing of the acceleration system platform of the embodiment of the present invention；

Fig. 8 is the acceleration system platform IP kernel structure chart of the embodiment of the present invention；

Fig. 9 is the figure of the acceleration system platform summing elements of the embodiment of the present invention；

Figure 10 is the instruction set design figure of the acceleration system platform of the embodiment of the present invention；

Figure 11 is the DMA driver package drawing of the acceleration system platform of the embodiment of the present invention；

Figure 12 is the interface specification figure of the acceleration system platform of the embodiment of the present invention；

Figure 13 be the acceleration system platform of the embodiment of the present invention mapping mechanism under DMA transmission data flow chart.

Specific embodiment

Above scheme is described further below in conjunction with specific embodiment.It should be understood that these embodiments are for illustrating The present invention and be not limited to limit the scope of the invention.Implementation condition used in the examples can be done according to the condition of specific producer Further adjustment, the implementation condition being not specified is usually the condition in routine experiment.

Embodiment:

Deep neural network in the embodiment of the present invention accelerates platform to include general processor, field programmable gate array and deposit Store up module, wherein the data path between FPGA and general processor can use PCI-E bus protocol, AXI bus protocol Deng.Attached drawing data path of the embodiment of the present invention illustrates for using AXI bus protocol, but the present invention is not limited thereto.

Fig. 1 is the design flow diagram of the acceleration system platform of the embodiment of the present invention, and the steps included are as follows:

Using profiling technology analysis classes center vector algorithm, K- nearest neighbor algorithm and NB Algorithm and obtain heat Point code；

Analyze the hot spot code of three kinds of sorting algorithms and suitably modified to extract common logic therein；

The resource and characteristic for analyzing FPGA platform, are optimized accelerator arithmetic element using assembly line and parallel means, designed Hardware accelerator general frame simultaneously generates IP kernel；

The semantic instruction set of design extension simultaneously realizes the corresponding each functional logic block of instruction set, by the fetching of instruction, translates Code, the operation executed complete the function of key code；

Graft procedure system writes the driving of each hardware device to development board, completes the collaboration of software and hardware under an operating system Work.

Fig. 2 is the class CENTER ALGORITHM hot spot explosion views of the acceleration system platform of the embodiment of the present invention, it is known that vector similarity The time scale that calculating occupies is maximum, is 82.37% ~ 96.74%, and class center vector calculates and lookup minimum value calculating accounts for respectively It is 2.13% ~ 11.41% and 1.13%~6.22% with time scale.Therefore it is the hot spot calculated that vector similarity, which calculates, and real Existing hardware-accelerated key code.Simultaneously it can be seen that the Manhattan distance metric mode similarity calculation time occupies in table It occupies than other opposite similarity metric mode times than low, because it only relates to the subtraction of vector, asks absolute value and mark The operation of summation is measured, it is lower than the computation complexity of other similarity metrics.

Fig. 3 is the K- nearest neighbor algorithm hot spot explosion views of the acceleration system platform of the embodiment of the present invention, and K- nearest neighbor algorithm calculates Process mainly includes that the calculating of class vector similarity and similarity search this two stages of minimum value.The first rank of K- nearest neighbor algorithm Section carries out similarity-rough set to test data and all data of training set.Second stage then searches the smallest training of K similarity Collect data, test data is finally attributed to classification in K training data at most.We are similar for the difference of K- nearest neighbor algorithm Degree calculation carries out hot spot and dissects to obtain as shown in the figure.Similarity Calculation similarity between vector in table Calculating, KMin is to search the calculating of K similarity minimum value.The time scale for knowing that the calculating of vector similarity occupies is maximum, It is 93.42% ~ 98.91%, searching minimum value and calculating holding time ratio is 1.09%~6.58%.Therefore and class center vector algorithm Equally, it is the hot spot calculated that vector similarity, which calculates, and realizes hardware-accelerated key code.

Fig. 4 is the NB Algorithm hot spot explosion views of the acceleration system platform of the embodiment of the present invention, naive Bayesian Algorithm totality calculating process mainly includes calculating and lookup this two parts of minimum value of class probability.First stage calculates priori Probability and conditional probability, second stage look for the classification of maximum probability to carry out class test data.We are directed to NB Algorithm Carry out hot spot dissect to obtain result it is as shown in the figure.Probability Calculation is the calculating of vector probability, Min in table It is calculated to search minimum value.The time scale that probability calculation occupies is maximum, is 98.86%, searches minimum value and calculates holding time ratio Example is 1.14%.Likewise, probability calculation is the hot spot calculated, and realize hardware-accelerated key code.

Fig. 5 is the accelerator system structure chart of the acceleration system platform of the embodiment of the present invention, and the system that we design is base In the common hardware acceleration system of the sorting algorithm on FPGA.The hierarchical chart of hardware-accelerated system is as shown in the figure.Entire system System design be finally it is user oriented, upper-layer user needs to carry out by application call accelerator interfaces to add algorithm Speed.Our initial designs be a bottom hardware-accelerated system, including DMA and IP kernel, DMA complete data transmission, IP kernel It is then to complete relevant calculation.In order to complete the demand of user, it would be desirable to being extended to upper layer from level to level.With Linux system For system, in kernel spacing layer, in order to realize that operating system is associated with hardware accelerator, needs to write the relevant equipment of design and drive Dynamic program enables operating system to control by driver hardware-accelerated such as accelerator kernel-driven and DMA kernel-driven The operation of device.Up one layer again, be exactly user's space layer, we must provide corresponding interface to application call, with full The demand of sufficient user.Scheme middle-level structure and is divided into 3 layers, including user interface layer, kernel-driven layer and hardware-accelerated layer.It is wherein hard Part acceleration layer contains the relevant devices such as IP kernel and DMA, and major function is to control the data of IP kernel reading DMA transmission and counted It calculates, calculated result is finally returned to DMA.Kernel-driven layer contains DMA kernel-driven and IP kernel kernel in kernel spacing and drives Dynamic, the read-write for calculating and DMA to IP kernel controls.End user's interface layer provides accelerator interfaces, and IP kernel connects Mouth and DMA interface, encapsulate run-time library also to facilitate the calling of user.

Fig. 6 is the accelerator structure figure of the acceleration system platform of the embodiment of the present invention, accelerates platform overall structure by Host PC, DDR RAM and accelerator composition, wherein host PC is managed in accelerator by control bus by the control unit in accelerator Multiple IP kernels operation and DMA data transmission.DMA can manage the number between DDR RAM and IP kernel with substituted host CPU According to transmission, such host PC can execute other task in DMA data transfer, save cpu resource.Inside accelerator, Each IP kernel is equipped with a DMA, and ensuring that IP kernel in this way, data being capable of parallel transmission while parallel operation.It is designing When accelerator, it has been found that only there was only the difference on input vector when each IP kernel executes identical task computation, and count Too much influence is caused according to the time that upper difference will not run each IP kernel, therefore we use single-instruction multiple-data stream (SIMD) The mode of (Single Instruction Multiple Data, SIMD) designs accelerator, with SMID mould between each IP kernel Formula executes entire calculating task.It can simplify hardware logic design in this way, also ensure the operational efficiency of each IP kernel.

Fig. 7 is the similarity calculation IP design drawing of the acceleration system platform of the embodiment of the present invention, mainly includes:

I/O module；

Centralized location；

Arithmetical unit；

Summing elements.

I/O module is responsible for outputting and inputting for data in IP kernel, and centralized location is then used to store input vector, centre Variable, training data and final classification result.And the input that Ethernet controller regulates and controls I/O module by control bus is defeated Out, the calculating of arithmetical unit and summing elements.The data in centralized location can be read in summing elements and arithmetical unit It carries out operation and returns results to centralized location.Final result is written back to DMA by output module.

The data that control unit Controller inside accelerator controls IP kernel operation and DMA in accelerator pass Defeated, it is to realize this point by instruction set.Instruction set is write by user to control management accelerator, it passes through place Main PC is filled in the instruction buffer to control unit.In order to guarantee the execution one by one of instruction, program counter (Program Counter, PC) and the register group for storing important information such as stores the register group of vector length, also all It is designed to realize inside control unit.IP kernel operation inside accelerator is several by being stored in DDR RAM of coming of DMA transfer According to, and calculated result is returned to DMA, last DMA transfer calculated result is simultaneously stored back into DDR RAM.It may be said that IP kernel has been contracted greatly Partial calculating, shown in structure chart 8.

The data in centralized location can be read to carry out operation and return results in summing elements and arithmetical unit Centralized location.Final result is written back to DMA by output module.Wherein summing elements whole design as shown in figure 9, from We can see that summing elements mainly complete intermediate scalar crucial in similarity calculation between vector in structure, including Sx, Sy, Sx-y, Sx2, Sy2, S (x-y) 2, Sxy and Nxy, and they are deposited into caching, we may notice that Sx+y does not belong to Intermediate scalar in any similarity calculation, it is used to calculate in class center vector algorithm and center herein, in this way The calculating common hardware logic of calculating and similarity between vectors with center does not just have to waste hardware resource design and center again Calculate, while also ensuring efficiency.The structure of summing elements includes three parts, multi-functional unit, caching and tree of adding up, such as Shown in Fig. 9.

Figure 10 is the instruction set design figure of the acceleration system platform of the embodiment of the present invention, and accelerator dependent instruction classification three is big Class, input and output instruction, computations and control instruction.Input and output instruction is mainly read from DMA and written-back operation, mainly Comprising the input and output to scalar, the input and output to vector and the input and output to Vector Groups.Computations are mainly adjusted It is calculated with accelerator components, control instruction is used to the trend of control instruction stream, realizes complicated cycling jump logic.Input Instruction is an operand instruction, is mainly used for data from being stored in accelerator caching in DMA, wherein operand represents pair Corresponding scalar in should caching, there is no specified source operands for instruction, this is because the data/address bus of FPGA is AXI4-Stream total Line, this bus are that the data of the zero-address information type an of point-to-point are total for the data transmission between DMA and accelerator Line.Likewise, output order is also an operand instruction, for writing back to data in DMA from accelerator caching.Computations It is the instruction of computing module, is to represent the AC class instruction calculated using accelerator accumulator module and represent to use accelerator respectively The AR class instruction that arithmetic module calculates, because being coarseness instruction, they are used only to enabled module meter without operand It calculates.Control instruction then is used to realize the circulation of code and jumps that they are all without operand, this is because coming to these instructions It says, the numerical value needed for them is all that can be read directly in the presence of fixed address, such as cmp is instructed, it is for judging choosing Which kind of measuring similarity is taken, by case compared with constant value, equal be carried out instructs in next step, unequal to be carried out lower step for it Instruction, the value of case are stored in fixing address.

Figure 11 is the DMA driver package drawing of the acceleration system platform of the embodiment of the present invention, and entire acceleration system includes Multiple IP kernels, wherein having the accelerator IP kernel of DMA, Timer and our designs.FPGA development board used in us by The research and development of Xilinx company, therefore Xilinx company provides the support packet comprising corresponding IP kernel driver, such as DMA, Timer Deng.We directly can support the driver in packet using it, and wherein the encapsulation of DMA driver is main as shown in table 4.1 Encapsulation function is that initialization function, resetting function, state discriminant function and data transmit control function etc..The driver of IP kernel Belong to character device, operating system nucleus module can add by driver of the insmod or rmmod order to IP kernel It carries or deletes.User program is to be controlled character device and read and write in the form of a file, operates corresponding correlation and connects Mouth is open (), read (), write (), ioctl (), close () etc., and carrying out operation against relevant interface can complete Opening, closing, read-write and the control of character device.

Figure 12 is the interface specification figure of the acceleration system platform of the embodiment of the present invention, and there are six the calling interfaces of accelerator, It is MyAc_start, MyAc_reset, MyAc_busy, MyAc_close, MyAc_send_instructions, MyAc_ respectively Send_data and MyAc_get_data, wherein parameter dev_fd refers to the corresponding device file of accelerator.First four interface is complete At the basic function of accelerator, such as starting, resetting, status inquiry and closing.Wherein inquiry state interface MyAc_busy is used In returning to accelerator current state, when return value is 1, show that at this time accelerator is being located just in busy in accelerator The data of DMA transfer are managed, and are to show that accelerator is in wait state when return value is 0, are ready to wait DMA transfer Data.MyAc_send_instructions interface is used to the instruction array that length is length passing through DMA transfer to control In instruction buffer inside unit, then control unit can operating instruction caching one by one each instruction, to realize to adding The control of calculating and the input and output of fast device internal arithmetic unit.Same MyAc_send_data and MyAc_get_data Interface is used to control DMA transfer data.After completing these interfaces, user can calling interface realize related application task It calculates.It includes the accelerator for calculating and the DMA for data transmission that user, which needs to initialize, first, then obtains storage just The physical base address of beginning data and operation result maps that in user's space virtual address, then extremely by initial data transfer In the physical address obtained before, while starting DMA, DMA can read data from the physical address of storing initial data, and pass The defeated arithmetic element in accelerator is calculated.Since DMA is full duplex, and accelerator internal structure is that assembly line is set Meter, therefore DMA, while transmitting data to accelerator, calculated result is returned to DMA by accelerator, and then DMA write returns to object Manage address.It finally waits DMA write to return to finish, all devices file is closed when being in idle condition.Here it is user's calling interfaces Process.

Figure 13 be the acceleration system platform of the embodiment of the present invention mapping mechanism under DMA transmission data flow chart.In order to Shorten copy data time, we take memory mapping technique.It takes user's space address and the progress of kernel spacing address The mode of mapping, i.e. one section of virtual address by physical space address of cache into kernel spacing, then this section of virtual address will It is mapped in a sector address space of consumer process, this completes between the logical address from physical address to user's space Primary mapping, the operation to the data in user's space is equivalent to the data manipulation in physical address at this time.So only It needs by required data copy into this section of physical space, DMA can be to reality corresponding to consumer process space virtual address Border physical address carries out the input and outflow of data, no longer needs additional linux kernel space to consumer process space later Data copy, it is only necessary to complete primary copy just.This mapping mechanism is extremely effective under big data to improve acceleration The performance of device.This section of continuous physical address space be linux kernel it is reserved transmit data dedicated for accelerator, cannot For other tasks.

The foregoing examples are merely illustrative of the technical concept and features of the invention, its object is to allow the person skilled in the art to be It cans understand the content of the present invention and implement it accordingly, it is not intended to limit the scope of the present invention.It is all smart according to the present invention The equivalent transformation or modification that refreshing essence is done, should be covered by the protection scope of the present invention.

Claims

1. the acceleration platform designing method of the sorting algorithm based on FPGA, which is characterized in that the acceleration platform includes general place Device, memory module and FPGA are managed, wherein general processor is responsible for transmitting data and controls accelerator, and memory module is responsible for temporary Data are handled for general processor and accelerator, and FPGA is responsible for the calculating of hot spot code: design method the following steps are included:

2. the acceleration platform designing method of the sorting algorithm according to claim 1 based on FPGA, which is characterized in that described Carried out in step S01 class center vector, K- neighbour, naive Bayesian these three sorting algorithms general accelerator design before, These three sorting algorithms are dissected first, therefrom the extraction algorithm key code that time-consuming is realized on FPGA:

Three kinds of sorting algorithms each function timing accounting under different data sets is analyzed using profiling technology；

3. the acceleration platform designing method of the sorting algorithm according to claim 1 based on FPGA, which is characterized in that described In step S02, the key code of these three sorting algorithms is parsed, finds out the identical logic function between algorithm: to The lookup of the calculating of similarity and minimum value between amount；By identical logic function Hardware, the multiplexing of partial arithmetic is completed；It is different Logic function: KNN algorithm is to look for the smallest K value rather than minimum value, and bayesian algorithm is probability calculation rather than phase It is calculated like degree；Different logic functions at two is modified, it is enable to be multiplexed.

4. the acceleration platform designing method of the sorting algorithm according to claim 3 based on FPGA, which is characterized in that described In step S03, analysis of central issue is carried out, obtaining similarity calculation between outgoing vector is key code, and accelerator design is key code Optimize acceleration；The measurement of similarity has some multiplexing logicals between vector, all five kinds of measuring similarities: euclidean away from From, manhatton distance, Jie Kade similarity factor, cosine similarity and Pearson correlation coefficient, it is only necessary to Sx, Sy, Sx-y, Sx2, The calculating of Sy2, S (x-y) 2, Sxy and Nxy these types intermediate variable, these scalars are all by adding up after two vector operations Come, realizes that an accumulator is used to calculate similarity intermediate variable in accelerator design, simplify accelerator structure, and make It is easier to streamlined, improves accelerator performance；Since dependence being not present between vector when calculating similarity, use Multiple parallel IP kernels calculate vector similarity.

5. the acceleration platform designing method of the sorting algorithm according to claim 1 based on FPGA, which is characterized in that described In step S04, accelerator dependent instruction is divided into three categories: input and output instruction, computations and control instruction:

Input and output instruction is from DMA reading and written-back operation, and main includes the input and output to scalar, defeated to the input of vector Out and the input and output to Vector Groups；

Computations are that accelerator components is called to be calculated；

6. the acceleration platform designing method of the sorting algorithm according to claim 1 based on FPGA, which is characterized in that described Step S05 the following steps are included: