CN112883245A - GPU (graphics processing Unit) stream-based rapid parallel character string matching method and system - Google Patents

GPU (graphics processing Unit) stream-based rapid parallel character string matching method and system Download PDF

Info

Publication number
CN112883245A
CN112883245A CN202110222110.2A CN202110222110A CN112883245A CN 112883245 A CN112883245 A CN 112883245A CN 202110222110 A CN202110222110 A CN 202110222110A CN 112883245 A CN112883245 A CN 112883245A
Authority
CN
China
Prior art keywords
character
string
module
node
gpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110222110.2A
Other languages
Chinese (zh)
Other versions
CN112883245B (en
Inventor
陈海军
唐卓
曹嵘晖
刘妮
叶晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University of Technology
Original Assignee
Hunan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University of Technology filed Critical Hunan University of Technology
Priority to CN202110222110.2A priority Critical patent/CN112883245B/en
Publication of CN112883245A publication Critical patent/CN112883245A/en
Application granted granted Critical
Publication of CN112883245B publication Critical patent/CN112883245B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/24569Query processing with adaptation to specific hardware, e.g. adapted for using GPUs or SSDs

Abstract

The invention discloses a rapid parallel character string matching method based on GPU (graphics processing unit) streams, which realizes kernel-level task parallel by accelerating the optimized parallel character string matching based on the GPU streams. According to the method, a big data task is firstly divided into small data tasks without dependency relationship, and then the small data tasks are dispatched to each GPU device to run. The character string data set is stored in a low-speed global memory, and the mode string has higher access frequency and is stored in a high-speed shared memory. All tasks can be asynchronously and concurrently executed by starting the appropriate number of CUDA flows according to application requirements and resource states. The method can solve the technical problems that the calculation process of the conventional BF algorithm has a plurality of meaningless matching calculations due to the adoption of brute force retrieval of all traversal characters, the time complexity of the calculation process of the conventional BK algorithm is high, and the conventional KMP algorithm is poor in moving strategy and low in speed.

Description

GPU (graphics processing Unit) stream-based rapid parallel character string matching method and system
Technical Field
The invention belongs to the technical field of internet, and particularly relates to a rapid parallel character string matching method and system based on GPU (graphics processing unit) streams.
Background
As a basis for many fields of scientific computing, the problem of string matching is currently being extensively and extensively studied. String matching is widely used in many problems such as intrusion detection, molecular biology, information filtering, virus detection, spell checking, language translation, number compression, search engines, etc.
The existing string matching algorithm mainly comprises: brute Force retrieval (BF) algorithm, hash retrieval (Robin-Karp, RK) algorithm, Knuth-Morria-Pratt (KMP) algorithm and Boyer Moore (BM) algorithm; wherein, the BF algorithm mainly searches all character matching results through violence until the matching is successful or the matching is finished; the RK algorithm is an improvement on the BF algorithm, and is mainly characterized in that substrings are screened by comparing hash values of the substrings, and then the BF algorithm is executed on the substrings; compared with the BF algorithm, the KMP algorithm is greatly improved, and the algorithm efficiency is improved mainly by eliminating backtracking of a main string pointer; the BM algorithm accelerates the character moving efficiency mainly through rules of bad characters and good suffixes, and is 3-5 times faster than KMP speed.
However, the above existing string matching methods all have some non-negligible drawbacks: firstly, the BF algorithm adopts violent retrieval of all traversal characters, and a plurality of meaningless matching calculations exist in the calculation process; secondly, the BK algorithm firstly traverses the hash values of all possibly matched substrings, and the time complexity of calculation of a large data set is high; thirdly, the KMP algorithm accelerates the movement of the pattern string by using a shift strategy, but the movement strategy is not optimal and has a slow speed; fourthly, the BM algorithm cannot realize data division and parallel computation based on GPU high-concurrency equipment for large data sets.
Disclosure of Invention
In view of the above-identified deficiencies in the art or needs for improvement, the present invention provides a method and system for fast parallel string matching based on GPU streams. The method aims to solve the technical problems that a calculation process of the existing BF algorithm has many meaningless matching calculations due to the adoption of brute force retrieval of all traversal characters, the technical problem that the time complexity of the calculation process of the existing BK algorithm is high, the technical problem that the existing KMP algorithm is poor in moving strategy and slow in speed, and the technical problem that the existing BM algorithm cannot realize data division and parallel calculation based on GPU high-concurrency equipment for large data sets.
To achieve the above object, according to one aspect of the present invention, there is provided a GPU stream-based fast parallel string matching method applied in a distributed computing system including a master node and a plurality of slave nodes, the method including the steps of:
(1) the method comprises the steps that a main node receives an application program submitted by a user and analyzes the application program to obtain a DAG graph;
(2) the main node divides the data corresponding to the tasks in the DAG graph in the step (1) to obtain a plurality of divided data blocks;
(3) and (3) the master node sends the segmented data block obtained in the step (2) to the slave node.
(4) The slave node determines whether there are multiple partitioning points in each data block. If yes, switching to the step (5), otherwise, switching to the step (6);
(5) dividing each data block obtained in the step (2) by the slave node according to the dividing points to obtain a plurality of divided data blocks, creating k GPU execution streams, and averagely distributing the divided data blocks to the k GPU execution streams for processing to obtain k task execution streams executed in parallel, wherein k is an integer less than or equal to 64;
(6) the slave node divides each data block obtained in the step (2) according to the first 55% and the second 55% to obtain two independent data blocks, and distributes the two divided data blocks to two GPUs for stream processing to obtain 2 task execution streams executed in parallel;
(7) and (4) configuring s threads for the task execution flow obtained in the steps (5) and (6) by the slave node to obtain a control flow group containing s parallel control flows, wherein the value range of s is 128-512.
(8) Utilizing the mth control flow pair first character table F of the control flow group obtained in the step (7) by the slave nodeFmTo m-1 th element Fm+1Between the character string PSmMatching is performed to obtain a shifted pattern string P, where m ∈ [1, s ]];
(9) Judging the pattern string P and the character string PS shifted in the step (8) from the nodemWhether the matching is successful or not is judged, if so, the process is ended, otherwise, the step (10) is carried out;
(10) judging the m + s th element F in the first character table F from the nodem+sIf the current character is in the first character table F, if so, entering the step (11), otherwise, ending the process;
(11) the slave node converts the mth element F in the first character table FmIs set equal to Fm+sAnd returning to the step (8).
Preferably, step (2) comprises in particular the following sub-steps:
and (2-1) the main node scans the data corresponding to the task to obtain a bad character table.
And (2-2) the main node scans the data corresponding to the task to obtain an initial character table F.
And (2-3) the master node acquires the total number of the GPUs in the slave nodes and determines the average data block size L processed by each GPU according to the total number of the GPUs.
(2-4) the main node divides the data corresponding to the main node task according to the average data block size L obtained in the step (2-3) to obtain a plurality of divided data blocks;
preferably, the step (2-4) is specifically that the first segmentation point of the segmentation process is the position of the character closest to the distance L in the character string corresponding to the data corresponding to the master node task, where the starting point of the L is the starting point of the character string, the second segmentation point of the segmentation process is the position of the character closest to the distance 2L in the character string corresponding to the data corresponding to the master node task, the third segmentation point of the segmentation process is the position of the character closest to the distance 3L in the character string corresponding to the data corresponding to the master node task, …, and so on, and then the data is segmented by the determined segmentation points to obtain a plurality of segmented data blocks.
Preferably, the bad character table is constructed by the following sub-steps:
(2-1-1) setting a counter i to be 1, acquiring a mode string P corresponding to a main node task, and acquiring an array skip for recording bad characters, wherein the size of the array skip is 256;
(2-1-2) judging whether i is equal to 256, if so, ending the process, and taking the obtained array skip as a final bad character table, otherwise, entering (2-1-3);
(2-1-3) judging whether the character Ci corresponding to the ith element in the array skip is located in the pattern string P, if so, entering the step (2-1-4), otherwise, setting the ith element in the array skip as Plen, wherein the Plen represents the length of the pattern string P corresponding to the main node task;
(2-1-4) set the ith element in the array skip to Plen-1-max (P)Ci) In which P isCiIndicates the position of the character Ci corresponding to the ith element in the array skip in the pattern string P, max (P)Ci) Indicating the position of the rightmost character Ci in the pattern string P, and then proceeds to step (2-1-5).
(2-1-5) setting i ═ i +1, and returning to step (2-1-2).
Preferably, step (2-2) comprises the sub-steps of:
(2-2-1) setting a counter j to be 1, setting k to be 1, acquiring a mode string P corresponding to a main node task and a character string S corresponding to data corresponding to the task, and acquiring a variable length array str;
(2-2-2) judging whether j is equal to the length of the character string S, if so, ending the process, and taking the obtained array str as a final first character table F, otherwise, entering the step (2-2-3);
(2-2-3) judging whether the character Sj corresponding to the j-th element in the character string S is equal to the first character P1 in the pattern string P, if so, setting the k-th element in the variable length array str as j, setting k as k +1, and then entering the step (2-2-4); otherwise, directly entering the step (2-2-4);
(2-2-4) setting j ═ j +1, and returning to step (2-2-2).
Preferably, if a bad character C exists in the data block and the bad character C does not belong to any character in the pattern string P corresponding to the master node task, the position of the bad character C in the data block is the splitting point.
Preferably, step (8) comprises the sub-steps of:
(8-1) setting a counter P equal to the length of the pattern string P, and acquiring the character C at the P-th position in the pattern string PpAnd a character string PSmMiddle p-th character CSp
(8-2) judging whether p is larger than 0, if so, entering the step (8-3), and if not, ending the process;
(8-3) determination of CpAnd CSpIf the two are matched, entering the step (8-4) if the two are matched, and otherwise entering the step (8-5);
(8-4) setting p ═ p-1, and returning to step (8-1);
(8-5) the Slave node pairs the character CS in the pattern string PpAnd (4) shifting to obtain a shifted pattern string P, and returning to the step (8-1).
Preferably, the shifting rule in step (8-5) is as follows:
Slide(CSp)=max(Skip(CSp),First(CSp))
wherein, slide (CS)p) Indicating character CSpDistance of movement in Pattern string P, Skip (CS)p) Is an element CS in the bad character tablepValue of (c), First (CS)p) Is the element CS in the first character table FpThe value of (c).
According to another aspect of the present invention, there is provided a GPU stream-based fast parallel string matching system for use in a distributed computing system including a master node and a plurality of slave nodes, the system comprising:
the system comprises a first module, a second module and a third module, wherein the first module is arranged on a main node and used for receiving an application program submitted by a user and analyzing the application program to obtain a DAG (direct current) graph;
the second module is arranged on the main node and used for segmenting data corresponding to the tasks in the DAG graph in the first module to obtain a plurality of segmented data blocks;
and the third module is arranged on the master node and used for sending the segmented data block obtained by the second module to the slave node.
And the third module is arranged at the slave node and used for judging whether a plurality of division points exist in each data block. If yes, switching to a fifth module, otherwise, switching to a sixth module;
a fourth module, disposed in the master node, configured to partition each data block obtained by the second module according to a partition point to obtain multiple partitioned data blocks, create k GPU execution streams, and averagely allocate the partitioned data blocks to the k GPU execution streams for processing, so as to obtain k task execution streams executed in parallel, where k is an integer less than or equal to 64;
a sixth module, disposed on the master node, configured to divide each data block obtained by the second module according to the first 55% and the second 55% to obtain two independent data blocks, and allocate the two divided data blocks to two GPUs to perform stream processing, so as to obtain 2 task execution streams executed in parallel;
a seventh module, disposed in the master node, configured to configure s threads for the task execution streams obtained in the fifth module and the sixth module, so as to obtain a control stream group including s parallel control streams, where a value range of s is 128 to 512.
An eighth module, disposed at the slave node, for utilizing the mth control flow pair element F in the control flow group obtained by the seventh module to obtain the mth element F in the first control flow pair list FmTo m-1 th element Fm+1Between the character string PSmMatching is performed to obtain a shifted pattern string P, where m ∈ [1, s ]];
A ninth module, disposed at the slave node, for determining the shifted pattern string P and character string PS of the eighth modulemIf the matching is successful, the process is ended, otherwise, the tenth module is entered;
a tenth module, disposed at the slave node, for determining the m + s th element F in the first character table Fm+sWhether the current character is in the first character table F or not, if so, entering an eleventh module, otherwise, ending the process;
an eleventh module, disposed at the slave node, for storing the first character table FM-th element FmIs set equal to Fm+sAnd returns to the eighth module.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) because the step (2-1) and the step (2-2) are adopted, the moving distance of the characters in the pattern string matching is judged mainly through the first character table and the bad character table, and therefore the defect that a large amount of meaningless calculation exists in the violent retrieval of the conventional BF algorithm can be overcome;
(2) because the invention adopts the step (2-2), it mainly records the position that may match through the first character table, so can have the time complexity high defect to solve the large data set of existing BK algorithm;
(3) because the invention adopts the step (2-1), the optimal moving distance of the characters in the pattern string matching is recorded mainly by the bad character table, so the defect that the character moving strategy speed of the existing KMP algorithm is slow can be solved;
(4) because the steps (2-4), (5) and (6) are adopted, the defects that the existing BM algorithm cannot realize data division for a large data set and parallel calculation based on GPU high-concurrency equipment are overcome by mainly dividing data blocks based on bad characters and performing concurrent calculation based on GPU multiple streams;
(5) because the time overhead of character string matching and the energy consumption optimization of the computing equipment are a set of conflict problems, the method is more suitable for solving the problems and has relatively low computing cost.
(6) The method is independent of a function model in use, has an optimization result independent of initial conditions, and has a wide application range.
Drawings
FIG. 1 is a schematic diagram of a typical CUDA flow scheduling process;
FIG. 2 is a schematic diagram of a CPU-GPU heterogeneous computing environment;
FIG. 3 is an example of a bad character table in the present invention;
FIG. 4 is an example of an initial character table in the present invention;
FIG. 5 is a flow chart of the fast parallel string matching method based on GPU streams according to the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The basic idea of the invention is to realize kernel-level task parallelism by accelerating the optimized parallel string matching based on the GPU stream. According to the method, a big data task is firstly divided into small data tasks without dependency relationship, and then the small data tasks are dispatched to each GPU device to run. The character string data set is stored in a low-speed global memory, and the mode string has higher access frequency and is stored in a high-speed shared memory. All tasks can be asynchronously and concurrently executed by starting the appropriate number of CUDA flows according to application requirements and resource states.
FIG. 1 is a typical CUDA flow scheduling process, which includes: firstly, data is copied from a host end (CPU) to a device end (GPU), then thread-level parallel computation is executed on the device end, and finally, a computation result is copied from the device end (GPU) back to the host end (CPU). Normally, the device side only starts a default stream, and all tasks are serial in the execution stream. The task is dispersed in different asynchronous execution streams by using multi-stream, so that the speed of computing execution can be greatly increased.
Fig. 2 is a CPU-GPU heterogeneous computing environment, in which a plurality of GPU devices are connected to a CPU host via a PCIe bus, and a GPU includes a large-capacity global memory and a small-capacity shared memory. And running a plurality of GPU streams on the hardware of the equipment end to execute asynchronous and parallel tasks. The distribution of the task load is distributed and communicated by the data manager through NVswitch high-speed interconnection technology.
As shown in fig. 5, the present invention provides a fast parallel string matching method based on GPU streams, which is applied in a distributed computing system including a master node and a plurality of slave nodes, and the method includes the following steps:
(1) the method comprises the steps that a main node receives an application program submitted by a user and analyzes the application program to obtain a directed acyclic Graph (DAG Graph for short);
(2) the main node divides the data corresponding to the tasks in the DAG graph in the step (1) to obtain a plurality of divided data blocks;
the step has the advantages that the bad characters which do not exist in the pattern strings are used as the basis for data block segmentation, a large data set can be divided into a plurality of independent data blocks, and the division points do not have the possibility of matching, so that the plurality of data blocks can be independently and parallelly calculated.
As shown in fig. 5, this step specifically includes the following sub-steps:
and (2-1) the main node scans the data corresponding to the task to obtain a bad character table.
The bad character table is used to define the offset in the pattern string matching process and the reference boundary for data block segmentation, as shown in fig. 3.
The bad character table is implemented as follows:
(2-1-1) setting a counter i to be 1, acquiring a mode string P corresponding to a main node task, and acquiring an array skip (the size of the array skip is 256) for recording bad characters;
specifically, the way to acquire the array in this step is to acquire it from a computer by using an input command, for example, using the statement "new skip [256] in the C language.
(2-1-2) judging whether i is equal to 256, if so, ending the process, and taking the obtained array skip as a final bad character table, otherwise, entering (2-1-3);
(2-1-3) judging whether the character Ci corresponding to the ith element in the array skip is located in the pattern string P, if so, entering the step (2-1-4), otherwise, setting the ith element in the array skip as Plen, wherein the Plen represents the length of the pattern string P corresponding to the main node task;
(2-1-4) setting the ith element in the array skip to bePlen-1-max(PCi) In which P isCiIndicates the position of the character Ci corresponding to the ith element in the array skip in the pattern string P, max (P)Ci) Indicating the position of the rightmost character Ci in the pattern string P, and then proceeds to step (2-1-5).
(2-1-5) setting i ═ i +1, and returning to step (2-1-2).
The above-described steps (2-1-1) to (2-1-5) have an advantage in that the scanning records the positions of all bad characters in the data set that are not present in the pattern string, so that the data can be easily divided into a plurality of completely independent data blocks.
Specifically, the present invention records the sliding distance of the bad character of the pattern string. Meanwhile, the bad character which exists in the character string but does not exist in the pattern string records all positions P in the character string. The position can be used as the basis for data block segmentation, so that data dependence does not exist in the segmentation of the data block, and the possibility of successful character string matching does not exist.
And (2-2) the main node scans the data corresponding to the task to obtain an initial character table F.
The first character table F is used to control the distance moved by the pattern string matching process, and as shown in fig. 4, it records all the positions of the first characters P1 in the character string to which the data corresponds.
The method comprises the following substeps:
(2-2-1) setting a counter j to be 1, setting k to be 1, acquiring a mode string P corresponding to a main node task and a character string S corresponding to data corresponding to the task, and acquiring a variable length array str;
specifically, the way of acquiring the array in this step is to acquire it from a computer by using an input command, for example, by using a statement "malloc (sizeof (int) > size)" in the C language.
(2-2-2) judging whether j is equal to the length of the character string S, if so, ending the process, and taking the obtained array str as a final first character table F, otherwise, entering the step (2-2-3);
(2-2-3) judging whether the character Sj corresponding to the j-th element in the character string S is equal to the first character P1 in the pattern string P, if so, setting the k-th element in the variable length array str as j, setting k as k +1, and then entering the step (2-2-4); otherwise, directly entering the step (2-2-4);
(2-2-4) setting j ═ j +1, and returning to step (2-2-2).
And (2-3) the master node acquires the total number of the GPUs in the slave nodes and determines the average data block size L processed by each GPU according to the total number of the GPUs.
Specifically, the average data block size is equal to the total amount of data corresponding to the master node task divided by the total number of GPUs.
(2-4) the main node divides the data corresponding to the main node task according to the average data block size L obtained in the step (2-3) to obtain a plurality of divided data blocks;
specifically, the first division point of the division process is the position where the character closest to L is located in the character string corresponding to the data corresponding to the master node task (where the starting point of L is the starting point of the character string), the second division point of the division process is the position where the character closest to 2L is located in the character string corresponding to the data corresponding to the master node task (where the starting point of L is the starting point of the character string), the third division point of the division process is the position where the character closest to 3L is located in the character string corresponding to the data corresponding to the master node task (where the starting point of L is the starting point of the character string), …, and so on, and then the data is divided by the determined division points to obtain a plurality of divided data blocks.
In order to avoid the division points possibly having matching points, the invention takes the position of the bad character c which is not in the pattern string and is closest to the nL position as the division point, thus having no need to pay attention to whether the character string which is possibly matched is divided.
The step (2-4) has the advantage that the load balancing data block dividing strategy is added, so that the sizes of the divided independent data blocks are similar, and the task amount allocated to each computing device is balanced.
(3) And (3) the master node sends the segmented data block obtained in the step (2) to the slave node.
(4) The slave node determines whether there are multiple partitioning points in each data block. If yes, switching to the step (5), otherwise, switching to the step (6);
specifically, the existence of a division point in a data block means that a bad character C exists in the data block, and the bad character C does not belong to any character in the pattern string P corresponding to the master node task, and the position of the bad character C in the data block is the division point.
(5) Dividing each data block obtained in the step (2) by the slave node according to the dividing points to obtain a plurality of divided data blocks, creating k GPU execution streams, and averagely distributing the divided data blocks to the k GPU execution streams for processing to obtain k task execution streams executed in parallel, wherein k is an integer less than or equal to 64;
(6) the slave node divides each data block obtained in the step (2) according to the first 55% and the second 55% to obtain two independent data blocks, and distributes the two divided data blocks to two GPUs for stream processing to obtain 2 task execution streams executed in parallel;
wherein the first 55% and the second 55% are selected by calculating the overlap to exclude the case where the cut point is a matching point.
(7) Configuring s threads (threads) for the task execution flow obtained in the steps (5) and (6) from the node to obtain a control flow group containing s parallel control flows;
specifically, s ranges from 128 to 512.
The advantage of the foregoing steps (5) to (7) is that a GPU stream parallel optimization mechanism is added, so that multiple execution streams simultaneously and respectively process multiple different data blocks, thereby increasing the parallelism of computation, speeding up the execution of computation, and simultaneously controlling the selection of stream groups to improve the parallel granularity inside the execution streams as much as possible, thereby further improving the overall parallel efficiency of the task.
(8) Utilizing the mth control flow in the control flow group obtained in the step (7) to the mth element F in the first character table F from the nodemTo m-1 th element Fm+1Between the character string PSmMatching is performed to obtain a shifted pattern string P, where m ∈ [1, s ]];
The method comprises the following substeps:
(8-1) setting a counter P equal to the length of the pattern string P, and acquiring the character C at the P-th position in the pattern string PpAnd a character string PSmMiddle p-th character CSp
(8-2) judging whether p is larger than 0, if so, entering the step (8-3), and if not, ending the process;
(8-3) determination of CpAnd CSpIf the two are matched, entering the step (8-4) if the two are matched, and otherwise entering the step (8-5);
(8-4) setting p ═ p-1, and returning to step (8-1);
(8-5) the Slave node pairs the character CS in the pattern string PpShifting to obtain a shifted pattern string P, and returning to the step (8-1);
sequential shifting of the pattern string can result in unnecessary and meaningless matching calculations, skipping these calculations is the correct shift operation. Selecting characters CS by performing a shift operationpThe maximum shift distance in the bad character table and the first character table F.
Specifically, the shift rule is as follows:
Slide(CSp)=max(Skip(CSp),First(CSp))
wherein, slide (CS)p) Indicating character CSpDistance of movement in Pattern string P, Skip (CS)p) Is an element CS in the bad character tablepValue of (c), First (CS)p) Is the element CS in the first character table FpThe value of (c). By comparing the two, the maximum value of the two is selected as the movement distance of the pattern string P. The selection of a larger value does not produce matching omission, unnecessary easy matching processes are reduced, and the execution of matching calculation is accelerated.
(9) Judging the pattern string P and the character string PS shifted in the step (8) from the nodemWhether the matching is successful or not is judged, if so, the process is ended, otherwise, the step (10) is carried out;
(10) judging the m + s th element F in the first character table F from the nodem+sIf the current character is in the first character table F, if so, entering the step (11), otherwise, ending the process;
(11) the slave node converts the mth element F in the first character table FmIs set equal to Fm+sAnd returning to the step (8).
In one embodiment, a computer device is provided, which may be a gateway. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the recorded IP address of the terminal in the local area network and the corresponding MAC address data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a fast parallel string matching method based on GPU streams.
In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement all the steps of the GPU stream based fast parallel string matching method of the present invention.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, performs all the steps of the GPU-stream-based fast parallel string matching method of the present invention.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), Direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM).
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A rapid parallel character string matching method based on GPU streams is applied to a distributed computing system comprising a main node and a plurality of slave nodes, and is characterized by comprising the following steps:
(1) the method comprises the steps that a main node receives an application program submitted by a user and analyzes the application program to obtain a DAG graph;
(2) the main node divides the data corresponding to the tasks in the DAG graph in the step (1) to obtain a plurality of divided data blocks;
(3) and (3) the master node sends the segmented data block obtained in the step (2) to the slave node.
(4) The slave node determines whether there are multiple partitioning points in each data block. If yes, switching to the step (5), otherwise, switching to the step (6);
(5) dividing each data block obtained in the step (2) by the slave node according to the dividing points to obtain a plurality of divided data blocks, creating k GPU execution streams, and averagely distributing the divided data blocks to the k GPU execution streams for processing to obtain k task execution streams executed in parallel, wherein k is an integer less than or equal to 64;
(6) the slave node divides each data block obtained in the step (2) according to the first 55% and the second 55% to obtain two independent data blocks, and distributes the two divided data blocks to two GPUs for stream processing to obtain 2 task execution streams executed in parallel;
(7) and (4) configuring s threads for the task execution flow obtained in the steps (5) and (6) by the slave node to obtain a control flow group containing s parallel control flows, wherein the value range of s is 128-512.
(8) Utilizing the mth control flow in the control flow group obtained in the step (7) to the mth element F in the first character table F from the nodemTo m-1 th element Fm+1Between the character string PSmMatching is performed to obtain a shifted pattern string P, where m ∈ [1, s ]];
(9) Judging the pattern string P and the character string PS shifted in the step (8) from the nodemWhether the matching is successful or not is judged, if so, the process is ended, otherwise, the step (10) is carried out;
(10) judging the m + s th element F in the first character table F from the nodem+sIf the current character is in the first character table F, if so, entering the step (11), otherwise, ending the process;
(11) the slave node converts the mth element F in the first character table FmIs set equal to Fm+sAnd returning to the step (8).
2. The GPU-stream-based fast parallel string matching method according to claim 1, wherein the step (2) comprises the following sub-steps:
and (2-1) the main node scans the data corresponding to the task to obtain a bad character table.
And (2-2) the main node scans the data corresponding to the task to obtain an initial character table F.
And (2-3) the master node acquires the total number of the GPUs in the slave nodes and determines the average data block size L processed by each GPU according to the total number of the GPUs.
And (2-4) the main node divides the data corresponding to the main node task according to the average data block size L obtained in the step (2-3) to obtain a plurality of divided data blocks.
3. The method as claimed in claim 2, wherein the step (2-4) is specifically configured to segment the data by using the determined segmentation points to obtain the segmented data blocks, where the first segmentation point of the segmentation process is a position where a character closest to L is located in a character string corresponding to the data corresponding to the master node task, where a starting point of L is a starting point of the character string, the second segmentation point of the segmentation process is a position where a character closest to 2L is located in a character string corresponding to the data corresponding to the master node task, and the third segmentation point of the segmentation process is a position where a character closest to 3L is located in a character string corresponding to the data corresponding to the master node task, …, and so on.
4. A GPU-stream based fast parallel string matching method according to any of claims 1-3, characterized in that the bad character table is constructed by the following sub-steps:
(2-1-1) setting a counter i to be 1, acquiring a mode string P corresponding to a main node task, and acquiring an array skip for recording bad characters, wherein the size of the array skip is 256;
(2-1-2) judging whether i is equal to 256, if so, ending the process, and taking the obtained array skip as a final bad character table, otherwise, entering (2-1-3);
(2-1-3) judging whether the character Ci corresponding to the ith element in the array skip is located in the pattern string P, if so, entering the step (2-1-4), otherwise, setting the ith element in the array skip as Plen, wherein the Plen represents the length of the pattern string P corresponding to the main node task;
(2-1-4) set the ith element in the array skip to Plen-1-max (P)Ci) In which P isCiIndicates the position of the character Ci corresponding to the ith element in the array skip in the pattern string P, max (P)Ci) Indicating the position of the rightmost character Ci in the pattern string P, and then proceeds to step (2-1-5).
(2-1-5) setting i ═ i +1, and returning to step (2-1-2).
5. A GPU stream-based fast parallel string matching method according to claim 4, characterized in that step (2-2) comprises the following sub-steps:
(2-2-1) setting a counter j to be 1, setting k to be 1, acquiring a mode string P corresponding to a main node task and a character string S corresponding to data corresponding to the task, and acquiring a variable length array str;
(2-2-2) judging whether j is equal to the length of the character string S, if so, ending the process, and taking the obtained array str as a final first character table F, otherwise, entering the step (2-2-3);
(2-2-3) judging whether the character Sj corresponding to the j-th element in the character string S is equal to the first character P1 in the pattern string P, if so, setting the k-th element in the variable length array str as j, setting k as k +1, and then entering the step (2-2-4); otherwise, directly entering the step (2-2-4);
(2-2-4) setting j ═ j +1, and returning to step (2-2-2).
6. A GPU stream-based fast parallel string matching method according to any of claims 1-5, characterized in that if a bad character C exists in a data block and the bad character C does not belong to any character in the pattern string P corresponding to the master node task, the position of the bad character C in the data block is a splitting point.
7. A GPU stream-based fast parallel string matching method according to any of claims 1-6, characterized in that step (8) comprises the following sub-steps:
(8-1) setting a counter P equal to the length of the pattern string P, and acquiring the character C at the P-th position in the pattern string PpAnd a character string PSmMiddle p-th character CSp
(8-2) judging whether p is larger than 0, if so, entering the step (8-3), and if not, ending the process;
(8-3)judgment CpAnd CSpIf the two are matched, entering the step (8-4) if the two are matched, and otherwise entering the step (8-5);
(8-4) setting p ═ p-1, and returning to step (8-1);
(8-5) the Slave node pairs the character CS in the pattern string PpAnd (4) shifting to obtain a shifted pattern string P, and returning to the step (8-1).
8. A GPU stream-based fast parallel string matching method according to claim 7, characterized by the shift rule in step (8-5) is as follows:
Slide(CSp)=max(Skip(CSp),First(CSp))
wherein, slide (CS)p) Indicating character CSpDistance of movement in Pattern string P, Skip (CS)p) Is an element CS in the bad character tablepValue of (c), First (CS)p) Is the element CS in the first character table FpThe value of (c).
9. A GPU stream-based fast parallel string matching system, which is applied to a distributed computing system comprising a master node and a plurality of slave nodes, and is characterized by comprising:
the system comprises a first module, a second module and a third module, wherein the first module is arranged on a main node and used for receiving an application program submitted by a user and analyzing the application program to obtain a DAG (direct current) graph;
the second module is arranged on the main node and used for segmenting data corresponding to the tasks in the DAG graph in the first module to obtain a plurality of segmented data blocks;
and the third module is arranged on the master node and used for sending the segmented data block obtained by the second module to the slave node.
And the third module is arranged at the slave node and used for judging whether a plurality of division points exist in each data block. If yes, switching to a fifth module, otherwise, switching to a sixth module;
a fourth module, disposed in the master node, configured to partition each data block obtained by the second module according to a partition point to obtain multiple partitioned data blocks, create k GPU execution streams, and averagely allocate the partitioned data blocks to the k GPU execution streams for processing, so as to obtain k task execution streams executed in parallel, where k is an integer less than or equal to 64;
a sixth module, disposed on the master node, configured to divide each data block obtained by the second module according to the first 55% and the second 55% to obtain two independent data blocks, and allocate the two divided data blocks to two GPUs to perform stream processing, so as to obtain 2 task execution streams executed in parallel;
a seventh module, disposed in the master node, configured to configure s threads for the task execution streams obtained in the fifth module and the sixth module, so as to obtain a control stream group including s parallel control streams, where a value range of s is 128 to 512.
An eighth module, disposed at the slave node, for utilizing the mth control flow pair element F in the control flow group obtained by the seventh module to obtain the mth element F in the first control flow pair list FmTo m-1 th element Fm+1Between the character string PSmMatching is performed to obtain a shifted pattern string P, where m ∈ [1, s ]];
A ninth module, disposed at the slave node, for determining the shifted pattern string P and character string PS of the eighth modulemIf the matching is successful, the process is ended, otherwise, the tenth module is entered;
a tenth module, disposed at the slave node, for determining the m + s th element F in the first character table Fm+sWhether the current character is in the first character table F or not, if so, entering an eleventh module, otherwise, ending the process;
an eleventh module, provided at the slave node, for setting the mth element F in the first character table FmIs set equal to Fm+sAnd returns to the eighth module.
CN202110222110.2A 2021-02-28 2021-02-28 GPU (graphics processing Unit) stream-based quick parallel character string matching method and system Active CN112883245B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110222110.2A CN112883245B (en) 2021-02-28 2021-02-28 GPU (graphics processing Unit) stream-based quick parallel character string matching method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110222110.2A CN112883245B (en) 2021-02-28 2021-02-28 GPU (graphics processing Unit) stream-based quick parallel character string matching method and system

Publications (2)

Publication Number Publication Date
CN112883245A true CN112883245A (en) 2021-06-01
CN112883245B CN112883245B (en) 2022-05-10

Family

ID=76054914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110222110.2A Active CN112883245B (en) 2021-02-28 2021-02-28 GPU (graphics processing Unit) stream-based quick parallel character string matching method and system

Country Status (1)

Country Link
CN (1) CN112883245B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110252046A1 (en) * 2008-12-16 2011-10-13 Geza Szabo String matching method and apparatus
CN103559018A (en) * 2013-10-23 2014-02-05 东软集团股份有限公司 String matching method and system based on graphics processing unit (GPU) calculation
US20140095834A1 (en) * 2012-09-30 2014-04-03 Shih J. Kuo Instruction and logic for boyer-moore search of text strings
US20150026194A1 (en) * 2012-03-01 2015-01-22 International Business Machines Corporation Finding a best matching string among a set of strings
WO2015088314A1 (en) * 2013-12-09 2015-06-18 Mimos Berhad An apparatus and method for parallel moving adaptive windo filtering edit distance computation
US9830369B1 (en) * 2013-05-14 2017-11-28 Jsonar, Inc. Processor for database analytics processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110252046A1 (en) * 2008-12-16 2011-10-13 Geza Szabo String matching method and apparatus
US20150026194A1 (en) * 2012-03-01 2015-01-22 International Business Machines Corporation Finding a best matching string among a set of strings
US20140095834A1 (en) * 2012-09-30 2014-04-03 Shih J. Kuo Instruction and logic for boyer-moore search of text strings
US9830369B1 (en) * 2013-05-14 2017-11-28 Jsonar, Inc. Processor for database analytics processing
CN103559018A (en) * 2013-10-23 2014-02-05 东软集团股份有限公司 String matching method and system based on graphics processing unit (GPU) calculation
WO2015088314A1 (en) * 2013-12-09 2015-06-18 Mimos Berhad An apparatus and method for parallel moving adaptive windo filtering edit distance computation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
续士强等: ""基于 GPU加速的快速字符串匹配算法"", 《软件导刊》 *

Also Published As

Publication number Publication date
CN112883245B (en) 2022-05-10

Similar Documents

Publication Publication Date Title
EP4036803A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
CN110276367B (en) Learning classification device and learning classification method
CN111695693B (en) Learning device and learning method
JP7196542B2 (en) Learning device and learning method
Mohammadi et al. Accelerating Louvain community detection algorithm on graphic processing unit
JP2020027436A (en) Learning device and learning method
CN109636709B (en) Graph calculation method suitable for heterogeneous platform
CN112883245B (en) GPU (graphics processing Unit) stream-based quick parallel character string matching method and system
Meyer et al. Warp-centric k-nearest neighbor graphs construction on GPU
Reyes et al. A GRASP-based scheme for the set covering problem
CN108334532B (en) Spark-based Eclat parallelization method, system and device
US11461662B1 (en) Compilation time reduction for memory and compute bound neural networks
Li et al. Efficient neighbor searching for agent-based simulation on GPU
Gupta et al. Map-based graph analysis on MapReduce
JP7211020B2 (en) Learning device and learning method
JP7363145B2 (en) Learning device and learning method
Romero et al. Bolt: Fast inference for random forests
Ye et al. Improvement and application of decision tree C4. 5 algorithm
JP2020027451A (en) Learning device and learning method
JP7176359B2 (en) Learning device and learning method
He et al. Task tree partition and subtree allocation for heterogeneous multiprocessors
Zou et al. An efficient data structure for dynamic graph on GPUS
Galicia et al. Rdfpartsuite: bridging physical and logical RDF partitioning
CN109377495B (en) Large-scale graph segmentation method supporting incremental segmentation
Jurczuk et al. What are the limits of evolutionary induction of decision trees?

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant