A kind of network data processing method and system based on GPU and buffering area
Technical field
The present invention relates to the network data parallel processing technique, relate in particular to a kind ofly, make full use of the large-scale parallel computing capability of GPU, realize two-forty and the low method of network data lingeringly by the cooperation of buffering area and multithreading.
Background technology
Moore's Law is roughly followed in the development of integrated chip, and performance doubled in promptly per 18 months.And Gilder's law is thought, in following 25 years, the bandwidth of backbone network is with one times of per 6 monthly increment.Using single CPU, network processing unit or special integrated chip to be difficult to satisfy the demand of backbone network data processing, how network applications such as route, fire compartment wall, intrusion detection, anti-virus keeps at a high speed and the low problem that postpones if all facing.
Academia and industrial quarters generally believe that the parallel computation based on multi-core platform is the effective way that addresses this problem.According to list of references, He Peng, Jiang Haiyang, thank to Gao Gang, " pSnort: " information technology wall bulletin based on the parallel intruding detection system of polycaryon processor. 2010. Vol. 8, No.page49-59, exist at present two kinds of basic multithreading execution model: RTC (Run-To-Complete) and software pipeline (Software Pipeline, SPL).RTC opens the complete Processing tasks of thread operation on each nuclear, task assignment to different threads, is realized coarse grain parallelism.It still depends on the integrated manufacturing capacity of multi core chip.Software pipeline is that Processing tasks is split as several stages, and each stage operates on the different nuclear as independent thread, improves performance by streamlined thought.It depends on the reasonable fractionation of task, and it is bigger to implement difficulty.Point out in the document that in the multinomial trial in intrusion detection field, owing to communication overhead and synchronization overhead between the multithreading of introducing thereupon, performance boost is less, even anti-phenomenon of falling occurs not rising.
Above-mentioned two kinds of models are based on all that CPU implements.Can consider to introduce GPU equipment, and in conjunction with these two kinds of multithreading models.
GPU(Graphic Processing Unit, Graphics Processing Unit) be a kind of graphic process unit that occurs in recent years.Compare with CPU, network processing unit or special integrated chip, it has powerful computation capability, is applicable to that all kinds of high performance universals calculate.The acceleration principle that GPU calculates is: offset the expense of data copy and single computing unit at a slow speed by the concurrent execution of a large amount of GPU threads, thereby obtain the performance boost from integral body.For example, in NVIDIA GTX580 video card, 512 SP(Stream Processor are arranged, stream handle), the frequency of each SP all is lower than common core cpu.But 512 tasks can be calculated simultaneously concomitantly on these SP and finish, even if add the back and forth copy of data between internal memory and video memory, institute takes time and also calculates required time one by one much smaller than these 512 tasks on CPU.
Also because this hardware designs, GPU equipment also is not suitable for the single-threaded calculating of actuating logic complexity.Therefore, in GPU calculates, generally be serial part and parallel section with task division.Carry out the serial part by CPU, and start GPU; And GPU carries out parallel section.Use step that GPU calculates normally: with data to be calculated from memory copying to being positioned at video memory among the GPU equipment; Start GPU kernel function (kernel function), the beginning parallel computation; Copy result of calculation to internal memory from video memory.
In disparate networks is used, there are some similar links in the processing of network data.Receive after the packet, can be divided into three steps: preliminary treatment, crucial calculating, subsequent treatment.Preliminary treatment comprises agreement identification, protocol analysis, burst reorganization, stream reduction etc.; Crucial calculate work such as be meant rule match, pattern matching, the Hash calculation relevant with concrete application, table look-up, algorithm is more complicated often; Subsequent treatment is according to result of calculation, to packet or stream transmit, abandon, write down, report to the police, operation such as termination.For example, in intrusion detection, carry out protocol analysis and stream reduction earlier, then protocol fields and load data are carried out pattern matching, whether decision reports to the police according to matching result at last.
As can be seen, GPU is not suitable for carrying out preliminary treatment and subsequent treatment, therefore can not be fully with RTC mode treatment network data.But crucial this step of calculating can be given GPU and be finished, and other steps are carried out by CPU.In fact, on problems such as routing table look-up, pattern matching, string matching, the work that academia has had some GPU to quicken, list of references S. Han et al. " PacketShader:A GPU-accelerated Software Router " for example, Proceedings of ACM SIGCOMM 2010, Sep 2010; G. Vasiliadis et al. " Gnort:High Performance Network Intrusion Detection Using Graphics Processors ", In Proceedings of the 11th International Symposium on Recent Advances In Intrusion Detection. Sep. 2008; Jiangfeng Peng, Hu Chen. " CUgrep:A GPU-based high performance multi-string matching system ", International Conference on Future Computer and Communication. 2010, Vol.1, Issue 21-24:77-81.But these work all lay particular emphasis on the parallelization transformation to algorithm itself, and algorithm when on GPU, moving to the cache optimization of data access in the video memory etc., do not have to consider the task scheduling problem under network scenarios.
Utilize the GPU network data, mainly have two problems at task dispatching party face:
One, data arrive at the network equipment one by one with the form of bag (packet) usually, and GPU is not strong to the disposal ability of single task role, the only parallel computation of suitable considerable task.In order to make full use of GPU, need carry out the buffer memory of some to packet, but first buffer memory again Calculation Method will increase processing delay.On the other hand, when data during just at buffer memory, GPU equipment is idle, and utilance is still not high.
Its two, the copy of data from the internal memory to the video memory will become performance bottleneck.At present, GPU acceleration effect other field preferably is computation-intensive substantially, and promptly kernel function running time, therefore big quantity research was absorbed in the parallel speed that how to improve algorithm more than the data copy time.But in network application, data volume to be processed is bigger, copies consuming time morely, and calculating itself no longer is performance bottleneck by contrast.Even if having complicated calculating in some applications, after work optimizations such as aforementioned reference, yet can run into the bottleneck of copy.Therefore, how improving data copy speed also is a unavoidable problem.
Summary of the invention
The present invention has solved above-mentioned two problems preferably by a kind of method for scheduling task based on buffering area.This method and system and concrete packet receiving method, preprocess method, crucial computational methods, method for subsequent processing are irrelevant, therefore be applicable among all network applications of following this type of treatment step, include but not limited to route, fire compartment wall, intrusion detection, anti-virus etc., and can use simultaneously with the result of existing many documents, further improve performance.
Among the present invention, in order to improve the speed and the systematic function of data copy, adopted methods such as " copy in batches ", " streamlined scheduling ", its know-why is as follows.
In batches copy to the lifting of speed based on the following fact: secondary data copy relates to work such as bus application, bus assignment, transfer of data and bus release, and repeatedly copy set can reduce the number of times of bus application, distribution and release to once finishing; In addition, it is process by accelerating slowly that data are transmitted on bus, and the copy low volume data often can not be obtained higher total line use ratio.Therefore, many piece of data are merged together once copy, speed can have a distinct increment than repeatedly copying respectively.Experimental result shows that between internal memory and video memory, copy speed is proportional with the data volume that once copies.
Streamlined scheduling is meant, in GPU computational process, data from memory copying to video memory, start carry out the GPU kernel function, the result is copied to three steps of internal memory from video memory and must order carries out.But can begin a plurality of GPU calculating simultaneously by a plurality of threads, make these steps form streamline.For example, when the GPU kernel function of first thread is carried out, the data of second thread from memory copying to video memory.Therefore the employed system unit difference of these two steps can carry out simultaneously.Therefore, GPU computing time and data copy time produce certain overlapping, thereby have reduced whole time overhead, have improved processing speed.
In order to solve above technical problem, the invention provides a kind of network data processing method based on GPU and buffering area, may further comprise the steps:
Preliminary treatment thread in the preliminary treatment sets of threads carries out preliminary treatment to the network packet that receives incessantly, forms calculation task and sends into buffering area;
Computational threads in the computational threads group is taken out a calculation task incessantly and is calculated to CPU from buffering area, perhaps take out a plurality of calculation tasks and calculate to GPU, and result of calculation is sent to follow-up sets of threads;
Subsequent treatment thread in the subsequent treatment sets of threads is finished the result of calculation that transmits behind the calculation task to the computational threads in the computational threads group incessantly and is carried out subsequent treatment.
Further, preliminary treatment sets of threads comprises one or more preliminary treatment thread; The computational threads group comprises one or more computational threads; The subsequent treatment sets of threads comprises one or more subsequent treatment thread.
Further, preliminary treatment is the required preliminary treatment of network application, comprises agreement identification, protocol analysis, burst reorganization, stream reduction.
Further, calculation task comprises calculative data and the relevant configuration information that extracts by preliminary treatment from network packet.
Further, each unit in the buffering area comprises a calculation task.
Further, if the number of the calculation task in the buffering area does not reach assign thresholds, the computational threads in the computational threads group is taken out single calculation task and is directly calculated among the CPU from buffering area.
Further, if the number of the calculation task in the buffering area does not reach assign thresholds, the computational threads break-off in the computational threads group, the calculation task number in buffering area reaches assign thresholds.
Further, if the number of the calculation task in the buffering area reaches assign thresholds, a plurality of computational threads in the computational threads group are calculated to GPU from the calculation task that buffering area takes out the threshold value number.
Further, the calculation task that takes out the threshold value number from buffering area of a plurality of computational threads in the computational threads group calculates to GPU and comprises:
The data to be calculated of the calculation task of threshold value number are copied in the internal memory, among the disposable then video memory that is copied to GPU equipment;
Start the GPU kernel function calculated data in the video memory is carried out a parallel computation;
The result of calculation of parallel computation is copied back internal memory from video memory, and the subsequent treatment thread that the result is sent in the subsequent treatment sets of threads carries out subsequent treatment.
Particularly, the number of calculation task reaches a certain pre-set threshold N in buffering area, then therefrom takes out N calculation task; Copy the data to be calculated of these tasks to a continuous region of memory; Among the disposable then video memory that is copied to GPU equipment; Next start the GPU kernel function and carry out a parallel computation; After calculating is finished, the result is copied back internal memory from video memory; At last the result is sent to the subsequent treatment thread.
And when the calculation task number in the buffering area does not reach threshold value N, two kinds of diverse ways are arranged: first method is that computational threads is taken out a calculation task immediately from buffering area, and directly in CPU, calculate, after finishing, the result is sent to subsequent treatment thread among the D; Second method is the computational threads break-off, and the calculation task number reaches N in waiting until buffering area.
Further, each the subsequent treatment thread in the subsequent treatment sets of threads receives the result of calculation that computational threads transmits, and according to result of calculation corresponding packet is carried out the required subsequent treatment of network application.
Further, according to result of calculation corresponding packet being carried out the required subsequent treatment of network application comprises forwarding, abandons, writes down, reports to the police, ends.
The present invention also provides a kind of network data processing system based on GPU and buffering area, specifically comprises:
Pretreatment unit is used for incessantly the network packet that receives being carried out preliminary treatment, forms calculation task and sends into buffer location;
Buffering area is used to store the calculation task that pretreatment unit is sent;
Computing unit is used for taking out a calculation task from buffer location incessantly and calculates to CPU, perhaps takes out a plurality of calculation tasks and calculates to GPU, and result of calculation is sent to the subsequent treatment unit;
The subsequent treatment unit is used for incessantly the computational threads of computing unit being finished the result of calculation that transmits behind the calculation task and carries out subsequent treatment.
Further, pretreatment unit comprises one or more preliminary treatment thread; Computing unit comprises one or more computational threads; The subsequent treatment unit comprises one or more subsequent treatment thread.
Further, preliminary treatment is the required preliminary treatment of network application, comprises agreement identification, protocol analysis, burst reorganization, stream reduction.
Further, calculation task comprises calculative data and the relevant configuration information that extracts by preliminary treatment from network packet.
Further, each unit in the buffering area comprises a calculation task.
Further, if the number of the calculation task in the buffering area does not reach assign thresholds, the computational threads in the computing unit is taken out single calculation task and is directly calculated among the CPU from buffering area.
Further, if the number of the calculation task in the buffering area does not reach assign thresholds, the computational threads break-off in the computing unit, the calculation task number in buffering area reaches assign thresholds.
Further, if the number of the calculation task in the buffering area reaches assign thresholds, a plurality of computational threads in the computing unit are calculated to GPU from the calculation task that buffering area takes out the threshold value number.
Further, the calculation task that takes out the threshold value number from buffering area of a plurality of computational threads in the computing unit calculates to GPU and comprises:
The data to be calculated of the calculation task of threshold value number are copied in the internal memory, among the disposable then video memory that is copied to GPU equipment;
Start the GPU kernel function calculated data in the video memory is carried out a parallel computation;
The result of calculation of parallel computation is copied back internal memory from video memory, and the subsequent treatment thread that the result is sent in the subsequent treatment sets of threads carries out subsequent treatment.
Further, each the subsequent treatment thread in the subsequent treatment unit receives the result of calculation that computational threads transmits, and according to result of calculation corresponding packet is carried out the required subsequent treatment of network application.
Further, according to result of calculation corresponding packet being carried out the required subsequent treatment of network application comprises forwarding, abandons, writes down, reports to the police, ends.
Beneficial effect of the present invention is analyzed as follows:
The present invention is provided with a buffering area, is used for calculation task is carried out buffer memory.After the task of having accumulated some, just take out in batches by computational threads, give GPU and calculate.According to the characteristics that GPU noted earlier calculates, considerable task forms a large amount of concurrent GPU threads, can make full use of the computation capability of GPU.
On the other hand, computational threads obtains considerable task from buffering area after, earlier the data that all these tasks are related are copied in the continuous region of memory, on the disposable video memory that copies GPU equipment to.And traditional way is, the data with calculation task are copied to video memory respectively.Said as the front, copy will reduce the expense that the bus application brings in batches, improve copy speed.Experimental result shows, the copy time that this accelerated method reduced has surpassed the overhead that data centralization is required.Promptly from whole process, with data centralization and copy, than copy is fast respectively.Therefore the buffering area of the method for the invention design can also obtain further speed and promote from copy in batches.
In addition, the buffering area among the present invention is positioned at after the preliminary treatment, and calculation task is carried out buffer memory, rather than the primitive network packet that receives is cushioned.By contrast, this design can bring the benefit of two aspects: at first, in actual applications, be not that all packets all need to calculate, for example, in route or fire compartment wall, the packet that has just can be transmitted or be abandoned in agreement identification back, do not form next step calculation task, therefore can improve processing speed, reduce processing delay; Secondly, might not be that a packet just forms a calculation task, for example in intrusion detection or anti-virus, calculation task is relevant with network flow often, after the preliminary treatment link flows reduction to a plurality of packets, just therefrom takes out partial data and produces a calculation task, therefore, the calculation task number will be less than the network packet number, has reduced the operation to buffering area, has also reduced buffering area required memory expense.
Be provided with three groups of different threads among the present invention, be respectively applied for preliminary treatment, core calculations and subsequent treatment.This is foregoing software flow ray mode.Traditional single-threaded network data processing mode is that this three links are carried out in each bag serial.But these three links can be separated.After multithreading, the circular flow separately of each link is connected by buffering area or message.In the multi-core computer system, a plurality of threads can really be carried out concomitantly, thereby form streamlined work, improve the throughput of system.
In addition, be provided with a plurality of computational threads among the present invention.These threads are responsible for the core calculations link, and when they called GPU and calculate a plurality of task, can be divided into three steps: i) data to be calculated with these tasks were copied to the contiguous memory zone, and in the disposable video memory that is copied to GPU equipment; Ii) start the GPU kernel function, calculate; Iii) copy result of calculation to internal memory from video memory, and the result is mail to the subsequent treatment thread.In the multi-core computer system, produced the effect of foregoing " streamlined scheduling ", the concurrent execution by different step has improved processing speed.Use many computational threads, computing capability can also be extended to a plurality of GPU equipment.
Among the present invention, task quantity does not reach assign thresholds N in buffering area, and worker thread can have two kinds of processing modes, and these two kinds of strategies respectively have quality, are applicable to different scenes.
In a kind of processing mode, worker thread takes out a task at once, directly calculates in CPU.This mode is suitable for the bigger network environment of load change.Packet more after a little while in network, the task quantity of buffering area increasess slowly, this moment, computational threads was not idle, but take out the individual task computing, therefore the task in the buffering area needn't be waited for to quantity and reach N, the processing delay that has guaranteed corresponding data is a controlled scope, and processing speed is not less than the CPU computation rate; In network packet more for a long time, the task quantity of buffering area can increase fast reach threshold value, computational threads is called GPU and is calculated, and can reach high processing speed.Therefore, this processing mode can adapt to offered load well and change, the CPU processing speed that guarantees that controlled processing delay is still arranged under the worst case and be not less than traditional approach.
In another kind of processing mode, worker thread is waited for and is reached threshold value N until task quantity.This mode is not suitable for the bigger network of load change, but is applicable to comparatively stable express network.In express network, worker thread all be used for if having time carrying out GPU and calculate required data copy and kernel function startup, and can directly not use CPU calculation task (speed is relatively slow), and therefore can utilize GPU at full speed, remain higher processing speed.
Description of drawings
In order to be illustrated more clearly in the present invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, the accompanying drawing that describes below only is some embodiment that put down in writing among the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the network data processing method flow chart that the present invention is based on GPU and buffering area;
Fig. 2 is the network data processing method preliminary treatment thread embodiment flow chart that the present invention is based on GPU and buffering area;
Fig. 3 is network data processing method computational threads embodiment one flow chart that the present invention is based on GPU and buffering area;
Fig. 4 is network data processing method computational threads embodiment two flow charts that the present invention is based on GPU and buffering area;
Fig. 5 is the network data processing method subsequent treatment thread embodiment flow chart that the present invention is based on GPU and buffering area;
Fig. 6 is the network data processing system schematic diagram that the present invention is based on GPU and buffering area.
Embodiment
In order to make those skilled in the art person understand technical scheme in the embodiment of the invention better, and above-mentioned purpose of the present invention, feature and advantage can be become apparent more, technical scheme among the present invention is described in further detail below in conjunction with accompanying drawing.
The invention provides a kind of network data processing method and system based on GPU and buffering area, work simultaneously by a plurality of computational threads, reach the streamlined that data copy and GPU kernel function are carried out, can give full play to computing capability, supercomputing and the network data of GPU.The present invention also has the advantage of dynamically adapting offered load and lower reason delay.The present invention and concrete packet receiving, preliminary treatment and computational methods are irrelevant, are applicable to network application fields such as anti-virus, intrusion detection, route, fire compartment wall.
At first introduce the network data processing method based on GPU and buffering area provided by the invention, concrete steps as shown in Figure 1:
Preliminary treatment thread in S101, the preliminary treatment sets of threads carries out preliminary treatment to the network packet that receives incessantly, forms calculation task and sends into buffering area;
Wherein, preliminary treatment sets of threads comprises one or more preliminary treatment thread;
Preliminary treatment is the required preliminary treatment of network application, comprises agreement identification, protocol analysis, burst reorganization, stream reduction.
Each unit in the buffering area is a calculation task.Calculation task is the concrete computation requirement according to network application, is made up of data to be calculated and relevant configuration information.
Computational threads in S102, the computational threads group is taken out a calculation task incessantly and is calculated to CPU from buffering area, perhaps take out a plurality of calculation tasks and calculate to GPU, and result of calculation is sent to follow-up sets of threads;
Wherein, the computational threads group comprises one or more computational threads;
The number of calculation task reaches a certain pre-set threshold N in buffering area, then therefrom takes out N calculation task; Copy the data to be calculated of these tasks to a continuous region of memory; Among the disposable then video memory that is copied to GPU equipment; Next start the GPU kernel function and carry out a parallel computation; After calculating is finished, the result is copied back internal memory from video memory; At last the result is sent to the subsequent treatment thread of step S103.
And when the calculation task number in the buffering area does not reach threshold value N, two kinds of diverse ways are arranged: first method is that computational threads is taken out a calculation task immediately from buffering area, and directly in CPU, calculate, after finishing, the result is sent to subsequent treatment thread among the step S103; Second method is the computational threads break-off, and the calculation task number reaches N in waiting until buffering area.
Subsequent treatment thread in S103, the subsequent treatment sets of threads is finished the result of calculation that transmits behind the calculation task to the computational threads in the computational threads group incessantly and is carried out subsequent treatment.
Wherein, the subsequent treatment sets of threads comprises one or more subsequent treatment thread; The function of each subsequent treatment thread is: receive the result of calculation that computational threads transmits, according to the result corresponding packet is carried out the required subsequent treatment of network application (transmit, abandon, write down, report to the police, termination etc.).
Provide the present invention below in intruding detection system (Intrusion Detecting Systems, IDS) a kind of implementation method in.For the purpose of convenient, we are the canonical coupling that some is detected rule with the core calculations work simplification of IDS.
GPU calculates and uses CUDA exploitation framework and language, GPU equipment to use NVIDIA GTX series plot to calculate video card.
Calculation task is made up of four parts: network flow index, application layer data pointer, application layer data length, matching result.Buffering area adopts formation, and capacity is 8192 unit, and each unit all is a pointer, points to a calculation task.Because a plurality of threads access queue are simultaneously arranged, need a lock realize exclusive reference, use the mutex (pthread_mutex_t) of POSIX standard to realize the function of locking.
The preliminary treatment sets of threads comprises a preliminary treatment thread.Its main flow process as shown in Figure 2, wherein S201 to S206 constitutes a circulation, being described as follows of each step:
S201, receive packet, it is flowed reduction from network interface card.The realization of stream reduction can be with reference to the IDS system that increases income, for example Snort(http: //www.snort.org/) source code.After the stream reduction, go to S202.
Whether contain application layer protocol and data in S202, the judgement stream,, then go to S203 if having; If do not have, no longer handle this stream, go to S201, beginning is circulation next time.
The call number of S203, the application layer data in will flowing and stream is formed a calculation task together.Here only get the data of front 1KB in the application layer data, if not enough 1KB then gets physical length.Go to S204 then.
S204, obtain the visit lock of buffering area formation A, judge then whether formation A is full.If full, go to S205; Otherwise go to S206.
The visit lock of S205, release formation A, and wait for until formation and being discontented with, promptly there is empty position.Can realize latency function with the conditional-variable pthread_cond_t and the wait operation pthread_cond_wait () thereof of POSIX standard.When formation becomes discontented state, obtain the visit lock of A again, and go to S206.
S206, calculation task is added buffering area formation A, discharge the visit lock then.Thereby finish this circulation, jump to S201 and restart next cycling.
The computational threads group comprises 4 computational threads, and two kinds of implementations are arranged, below narration respectively.
First kind of implementation of computational threads not only uses GPU to calculate but also use CPU to calculate, as shown in Figure 3, and wherein:
S301, initialization GPU computing environment comprise: select GPU equipment with cudaSetDevice (); Distribute contiguous memory with cudaMallocHost (), use for task data and result of calculation; Distribute video memory with cudaMalloc (), in order to deposit task data and result of calculation; Create CUDA stream with cudaStreamCreate (), be used to realize the execution of asynchronous copy and kernel function, to realize the streamlined of computational threads.Wherein, be used to deposit the contiguous memory and the big or small 8MB of being of video memory of calculated data, promptly queue length 8192, are multiplied by the maximum 1KB data of each task; Be used to deposit the internal memory and the big or small 64KB of being of video memory of result of calculation, promptly queue length 8192, multiply by the result of each task 8 byte.At last, the feature database that will be used for canonical coupling is copied to the video memory fixed position with cudaMemcpy () function.After initialization is finished, go to S302, begin main circulation.
S302, obtain the visit lock of buffering area formation, judge then whether its state is empty.If formation is empty, go to S303; Otherwise the formation non-NULL goes to S304.
The visit lock of S303, buffer release sector row, and wait for until the formation non-NULL, promptly there is calculation task.Can realize latency function with the conditional-variable pthread_cond_t and the wait operation pthread_cond_wait () thereof of POSIX standard.When formation becomes the state of non-NULL, obtain the visit lock of A again, and go to S304.
S304, judge whether formation is full.If full, go to S305; Otherwise formation is discontented, goes to S309.
All calculation tasks in S305, the taking-up formation, the visit that discharges formation is then locked.Go to S306.
S306, the contiguous memory of applying for when the application layer data in all calculation tasks is copied to the S301 step call the cudaMemcpyAsync () function among the CUDA then, and the stream of creating during with the S301 step is copied to pre-assigned space in the video memory.Go to S307.
S307, startup CUDA calculate kernel function, and kernel function is responsible for carrying out the canonical coupling.In the start-up parameter of kernel function, Thread Count is 32, and the block number is 256, i.e. 8192 tasks of parallel computation, and the stream of creating when using the S301 step.After the startup, go to S208.
S308, use cudaMemcpyAsync () are copied to internal memory with result of calculation from video memory, and wherein video memory address and memory address all use the space that the S201 step is distributed.Then these results are inserted among the structure of calculation task one by one.Go to S311.
S309: from formation, take out a calculation task, discharge the visit lock of formation then, go to S310.
S310, directly the data in the calculation task are carried out the canonical coupling, and matching result is recorded among the structure of calculation task.Go to S311.
S311, the calculation task that will comprise result of calculation send to the subsequent treatment sets of threads.Go to S302, the circulation of beginning next round.
Second kind of implementation of computational threads only uses GPU to calculate, as shown in Figure 4, and wherein:
S401, initialization GPU computing environment comprise: select GPU equipment with cudaSetDevice (); Distribute contiguous memory with cudaMallocHost (), use for task data and result of calculation; Distribute video memory with cudaMalloc (), in order to deposit task data and result of calculation; Create CUDA stream with cudaStreamCreate (), be used to realize the execution of asynchronous copy and kernel function, to realize the streamlined of computational threads.Wherein, be used to deposit the contiguous memory and the big or small 8MB of being of video memory of calculated data, promptly queue length 8192, are multiplied by the maximum 1KB data of each task; Be used to deposit the internal memory and the big or small 64KB of being of video memory of result of calculation, promptly queue length 8192, multiply by the result of each task 8 byte.At last, the feature database that will be used for canonical coupling is copied to the video memory fixed position with cudaMemcpy () function.After initialization is finished, go to S402, begin main circulation.
S402, obtain the visit lock of buffering area formation, judge then whether its state is full.If formation less than, go to S403; Otherwise formation is full, goes to S404.
The visit lock of S403, buffer release sector row, and wait for full until formation.Can realize latency function with the conditional-variable pthread_cond_t and the wait operation pthread_cond_wait () thereof of POSIX standard.When formation becomes full state, obtain the visit lock of buffering area formation again, and go to S404.
All calculation tasks in S404, the taking-up formation, the visit that discharges formation is then locked.Go to S405.
S405, the contiguous memory of applying for when the application layer data in all calculation tasks is copied to the S401 step call the cudaMemcpyAsync () function among the CUDA then, and the stream of creating during with the S401 step is copied to pre-assigned space in the video memory.Go to S406.
S406, startup CUDA calculate kernel function, and kernel function is responsible for carrying out the canonical coupling.In the start-up parameter of kernel function, Thread Count is 32, and the block number is 256, i.e. 8192 tasks of parallel computation, and the stream of creating when using the S401 step.Go to S407.
S407, result of calculation is copied to internal memory from video memory with cudaMemcpyAsync (), wherein video memory address and memory address all use the space that the S401 step is distributed.Then these results are inserted among the structure of calculation task one by one.After the startup, go to S408.
S408, the calculation task that will comprise result of calculation send to the subsequent treatment sets of threads.Go to S402, the circulation of beginning next round.
The subsequent treatment sets of threads comprises a subsequent treatment thread, and whether its major function is to receive result of calculation, determine to User Alarms according to match condition.The groundwork flow process as shown in Figure 5, wherein:
S501, wait for the calculation task that comprises the result that computational threads is sent in the computational threads group, wait by the time after, go to S502.
S502, judgement result of calculation are not mated upward any IDS rule if the result shows this calculation task, then no longer carry out follow-up disposal, go to S501, continue the next calculation task of processing; Otherwise the result shows that this calculation task coupling goes up certain bar IDS rule, goes to S503.
S503, with the index of stream, information that stream is relevant, matching result to User Alarms, for example output to the operation terminal and record among the journal file.After finishing, go to S501, continue to handle next calculation task.
The present invention also provides a kind of network data processing system based on GPU and buffering area, as shown in Figure 6, comprising:
Pretreatment unit 601 is used for incessantly the network packet that receives being carried out preliminary treatment, forms calculation task and sends into buffering area Unit 602;
Buffering area 602 is used to store the calculation task that pretreatment unit 601 is sent;
Computing unit 603 is used for taking out a calculation task from buffering area 602 incessantly and calculates to CPU, perhaps takes out a plurality of calculation tasks and calculates to GPU, and result of calculation is sent to subsequent treatment unit 604;
Subsequent treatment unit 604 is used for incessantly the computational threads of computing unit 603 being finished the result of calculation that transmits behind the calculation task and carries out subsequent treatment.
Wherein, pretreatment unit 601 comprises one or more preliminary treatment thread; Computing unit 603 comprises one or more computational threads; Subsequent treatment unit 604 comprises one or more subsequent treatment thread.
Preliminary treatment is the required preliminary treatment of network application, comprises agreement identification, protocol analysis, burst reorganization, stream reduction.
Calculation task comprises calculative data and the relevant configuration information that extracts by preliminary treatment from network packet.
Each unit of 602 comprises a calculation task in the buffering area.
If the number of the calculation task in the buffering area 602 does not reach assign thresholds, the computational threads in the computing unit 603 is taken out single calculation task and is directly calculated among the CPU from buffering area 602.
If the number of the calculation task in the buffering area 602 does not reach assign thresholds, the computational threads break-off in the computing unit 603, the calculation task number in buffering area 602 reaches assign thresholds.
If the number of the calculation task in the buffering area 602 reaches assign thresholds, a plurality of computational threads in the computing unit 603 are calculated to GPU from the calculation task that buffering area 602 takes out the threshold value number.
The calculation task that a plurality of computational threads in the computing unit 603 are taken out the threshold value number from buffering area 602 calculates to GPU and comprises:
The data to be calculated of the calculation task of threshold value number are copied in the internal memory, among the disposable then video memory that is copied to GPU equipment;
Start the GPU kernel function calculated data in the video memory is carried out a parallel computation;
The result of calculation of parallel computation is copied back internal memory from video memory, and the subsequent treatment thread that the result is sent in the subsequent treatment unit 604 carries out subsequent treatment.
Each subsequent treatment thread in the subsequent treatment unit 604 receives the result of calculation that computational threads transmits, and according to result of calculation corresponding packet is carried out the required subsequent treatment of network application, comprises forwarding, abandons, writes down, reports to the police, termination etc.
Though described the present invention by embodiment, those of ordinary skills know, the present invention has many distortion and variation and do not break away from spirit of the present invention, wish that appended claim comprises these distortion and variation and do not break away from spirit of the present invention.