CN117971355B - Heterogeneous acceleration method, device, equipment and storage medium based on self-supervision learning - Google Patents

Heterogeneous acceleration method, device, equipment and storage medium based on self-supervision learning Download PDF

Info

Publication number
CN117971355B
CN117971355B CN202410376329.1A CN202410376329A CN117971355B CN 117971355 B CN117971355 B CN 117971355B CN 202410376329 A CN202410376329 A CN 202410376329A CN 117971355 B CN117971355 B CN 117971355B
Authority
CN
China
Prior art keywords
word
deterministic finite
self
finite automaton
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410376329.1A
Other languages
Chinese (zh)
Other versions
CN117971355A (en
Inventor
童浩南
任智新
张闯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202410376329.1A priority Critical patent/CN117971355B/en
Publication of CN117971355A publication Critical patent/CN117971355A/en
Application granted granted Critical
Publication of CN117971355B publication Critical patent/CN117971355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a heterogeneous acceleration method, a device, equipment and a storage medium based on self-supervision learning, which relate to the technical field of computers, and are used for acquiring a data control flow through local hardware equipment, generating a non-deterministic finite automaton according to a generated regular expression, and analyzing and filtering the data control flow by the non-deterministic finite automaton; analyzing the non-deterministic finite automata based on a self-supervision learning model in heterogeneous equipment, and configuring the non-deterministic finite automata to a corresponding regular engine when the non-deterministic finite automata is in a matching relation with a regular expression represented by the non-deterministic finite automata, and carrying out parallel analysis and filtering on a data control flow; the self-supervision learning model is obtained by training based on training data containing a center word, a background word and a noise word, and can more effectively verify whether the non-deterministic finite automaton represents a regular rule, so that more efficient regular expression matching is realized on an FPGA.

Description

Heterogeneous acceleration method, device, equipment and storage medium based on self-supervision learning
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a heterogeneous acceleration method, apparatus, device, and storage medium based on self-supervised learning.
Background
At present, software solutions on a central processing unit (Central Processing Unit, CPU) and a graphics processor (graphics processing unit, GPU) are quickly limited by computation as the complexity of the expression increases, so that heterogeneous acceleration using heterogeneous hardware acceleration devices is widely used, on one hand, big data workload can be unloaded onto a hardware acceleration card, a higher speed-up ratio can be obtained, and on the other hand, part of CPU or GPU pressure can be released. The field programmable gate array (Field Programmable GATE ARRAY, FPGA) is a common heterogeneous acceleration card, is widely used in a data center, and can be used for directly compiling a regular expression into a non-deterministic finite automaton (Nondeterministic Finite Automata, NFA) by the FPGA, and the matching path can be built immediately when the non-deterministic finite automaton encounters an input character string, so that the acceleration of software application in a CPU or a GPU can be realized. However, as the complexity of the regular expression increases, the corresponding non-deterministic finite automaton may become very huge and complex, and as it cannot be verified whether the non-deterministic finite automaton accurately expresses the content of the regular expression, the generated non-deterministic finite automaton may be unreasonable due to the regular expression with high complexity, so that the matching efficiency is low when the non-deterministic finite automaton processes large-scale text data, and the heterogeneous device is affected to execute an acceleration task.
Disclosure of Invention
The invention provides a heterogeneous acceleration method, device, equipment and storage medium based on self-supervision learning, which are used for solving the defects that in the related technology, because whether a non-deterministic finite automaton accurately expresses the content of a regular expression or not cannot be verified, the generated non-deterministic finite automaton is unreasonable due to a high-complexity regular expression, and the matching efficiency is low when the non-deterministic finite automaton processes large-scale text data, and the heterogeneous equipment is influenced to execute an acceleration task.
The invention provides a heterogeneous acceleration method based on self-supervision learning, which comprises the following steps:
Acquiring a data control flow through a local hardware device, and generating a non-deterministic finite automaton according to a generated regular expression, wherein the non-deterministic finite automaton is used for representing the regular expression so as to analyze and filter the data control flow;
Receiving the data control flow and the non-deterministic finite automaton through heterogeneous equipment, analyzing the non-deterministic finite automaton based on a self-supervision learning model, configuring the non-deterministic finite automaton to a corresponding regular engine when the non-deterministic finite automaton is in a matching relation with a regular expression represented by the non-deterministic finite automaton, and carrying out parallel analysis and filtering on the data control flow;
the self-supervision learning model is obtained by training based on training data comprising a center word, a background word and a noise word.
According to the heterogeneous acceleration method based on self-supervised learning, the self-supervised learning model is used for analyzing the non-deterministic finite automaton, and the heterogeneous acceleration method comprises the following steps:
Acquiring a state topological structure of the non-deterministic finite automaton;
Constructing a training sample set based on the state topological structure, wherein training data in the training sample set comprises a center word, a background word and a noise word;
training the self-supervision learning model based on the training sample set;
Inputting the non-deterministic finite automaton into a trained self-supervision learning model, and obtaining a characterization vector corresponding to the non-deterministic finite automaton;
Calculating the similarity between each characterization vector, and acquiring the character/character string with the strongest correlation in the non-deterministic finite automaton;
Acquiring characters/character strings with strongest correlation in the regular expression corresponding to the non-deterministic finite automaton;
And when the character/character string with the strongest correlation in the non-deterministic finite automaton is consistent with the character/character string with the strongest correlation in the regular expression, judging that the non-deterministic finite automaton forms a matching relation with the represented regular expression.
According to the heterogeneous acceleration method based on self-supervised learning provided by the invention, the training sample set is constructed based on the state topological structure, and the heterogeneous acceleration method comprises the following steps:
Generating a plurality of character strings based on the state topology;
Generating a corpus based on the plurality of character strings;
and extracting the center word, the background word and the noise word from the corpus to obtain a training sample set.
According to the heterogeneous acceleration method based on self-supervised learning provided by the invention, the method for generating a plurality of character strings based on the state topological structure comprises the following steps:
Creating an initial character string, starting with a special character, and initializing a current state as a starting state of the state topological structure;
establishing a conversion character dictionary based on the state topological structure, wherein the conversion character dictionary is used for representing the next state to which each state can be transferred and corresponding conversion characters of the next state;
Simulating state conversion based on the converted characters until a final state is reached;
a string is generated based on all characters traversed in the state transition process, the string comprising a sequence of transition characters from a starting state to a final state.
According to the heterogeneous acceleration method based on self-supervised learning provided by the invention, the character simulation state conversion based on the conversion is carried out until a final state is reached, and the heterogeneous acceleration method comprises the following steps:
When the current state is not the final state, acquiring a possible conversion path of the current state according to the conversion character dictionary;
randomly selecting one path from possible conversion paths, and adding conversion characters corresponding to the selected path into the initialized character string;
The current state is updated to the next state in the selected transition character dictionary until the final state is reached.
According to the heterogeneous acceleration method based on self-supervised learning, each character string in the corpus is regarded as a sentence, each character in the sentence is regarded as a word, and the central word and the background word are extracted from the corpus, and the heterogeneous acceleration method comprises the following steps:
calculating the occurrence frequency of each word in the corpus;
filtering out words with low occurrence frequency based on the occurrence frequency of each word and a preset high-frequency word threshold value, and constructing a vocabulary based on the residual words;
Traversing each word in the vocabulary, taking each word as a central word, randomly selecting a window size, determining the number of background words of each central word according to the window size, selecting words around each central word as the background words corresponding to each central word according to the number of the background words, and taking the central words and the corresponding background words as training positive samples.
According to the heterogeneous acceleration method based on self-supervised learning provided by the invention, after the vocabulary is constructed based on the residual words, the heterogeneous acceleration method further comprises the following steps:
and carrying out secondary random sampling on the vocabulary, obtaining the occurrence frequency and the total word number of each word in the vocabulary, and screening the words in the vocabulary according to the preset occurrence frequency requirement and the total word number requirement to obtain a final vocabulary.
According to the heterogeneous acceleration method based on self-supervised learning provided by the invention, noise words are extracted from the corpus, and the heterogeneous acceleration method comprises the following steps:
selecting words with occurrence frequency lower than a preset low-frequency word threshold from the corpus, wherein the low-frequency word threshold is three-fourths of the high-frequency word threshold;
Constructing a noise distribution based on the words with the occurrence frequency lower than a preset low-frequency word threshold value;
Normalizing the noise distribution, and unifying words with different occurrence frequencies to the same occurrence frequency;
Randomly extracting a plurality of words meeting the noise quantity requirement by using the normalized noise distribution, and adjusting the number of the extracted words according to the model calculation requirement and calculation resources;
judging whether the extracted word is a background word corresponding to the central word, if so, discarding the extracted word; otherwise, taking the extracted words as noise words corresponding to the central words until the number of the noise words meets the noise number requirement;
And traversing all the central words in the corpus, and generating a group of noise words for each central word as a training negative sample.
According to the heterogeneous acceleration method based on self-supervised learning provided by the invention, the center word, the background word and the noise word are extracted from the corpus to obtain a training sample set, and the heterogeneous acceleration method comprises the following steps:
Receiving the center word, the background word and the noise word through a batch processing function;
Adding positive sample labels for the center words and the corresponding background words thereof and storing the positive sample labels in a positive sample data list, and adding negative sample labels for the center words and the corresponding noise words thereof and storing the negative sample labels in a negative sample data list;
Unifying list lengths of the positive sample data list and the negative sample data list, and filling elements after a shorter data list;
Assigning different mask values to normal elements and padding elements in each data list;
A training sample set is generated based on the positive sample data list and the negative sample data list.
According to the heterogeneous acceleration method based on self-supervised learning provided by the invention, the self-supervised learning model is trained based on the training sample set, and the heterogeneous acceleration method comprises the following steps:
Reading the positive sample data of the positive sample data list and the negative sample data in the negative sample data list in batches;
for each read lot, inputting the positive or negative sample data into a self-supervised learning model to perform forward computation;
calculating a predicted loss based on the forward calculation output result and the loss function;
And carrying out back propagation through the prediction loss, and updating model parameters until the training ending condition is met, so as to obtain a trained self-supervision learning model.
According to the heterogeneous acceleration method based on self-supervised learning provided by the invention, the batch reading of the positive sample data list and the negative sample data in the negative sample data list comprises the following steps:
And transmitting the batch processing function as a parameter to a data loader, and reading the positive sample data of the positive sample data list and the negative sample data in the negative sample data list in batches by the data loader.
According to the heterogeneous acceleration method based on self-supervision learning, the self-supervision learning model comprises a word hopping model, wherein the word hopping model comprises a first key embedding layer and a second key embedding layer; the self-supervised learning model performs forward computation, including:
converting the center word into a center word vector through the first key embedding layer, and converting the background word and the noise word into a background word vector and a noise word vector through the second key embedding layer;
estimating the similarity between the central word vector and the background word vector or the noise word vector;
and outputting positive and negative sample classification results corresponding to the currently input training data based on the similarity.
According to the heterogeneous acceleration method based on self-supervised learning provided by the invention, the estimating of the similarity between the central word vector and the background word vector or the noise word vector comprises the following steps:
And respectively carrying out dot product operation on the central word vector and the background word vector or the noise word vector, and taking a dot product operation result as the similarity between the central word vector and the background word vector or the noise word vector.
According to the heterogeneous acceleration method based on self-supervision learning, the loss function is a binary cross entropy loss function, the predictive loss is calculated based on a forward calculation output result and the loss function, and the heterogeneous acceleration method comprises the following steps:
and calculating the prediction loss between the positive and negative sample classification result corresponding to the input training data and the positive and negative sample label corresponding to the input training data based on the binary cross entropy loss function.
According to the heterogeneous acceleration method based on self-supervised learning provided by the invention, the similarity between each characterization vector is calculated, and the character/character string with the strongest correlation in the non-deterministic finite automaton is obtained, which comprises the following steps:
acquiring cosine similarity between characterization vectors of each node in the non-deterministic finite automaton;
The characters corresponding to the two characterization vectors with the maximum cosine similarity are used as the characters with the strongest correlation in the non-deterministic finite automaton;
Or taking the characters corresponding to the plurality of characterization vectors with the cosine similarity larger than a preset threshold as the character string with the strongest correlation in the non-deterministic finite automaton.
According to the heterogeneous acceleration method based on self-supervised learning provided by the invention, the acquiring of the character/character string with the strongest correlation in the regular expression corresponding to the non-deterministic finite automaton comprises the following steps:
Calculating a first conditional probability of surrounding characters generated by characters in the regular expression, and acquiring word vectors of the regular expression corresponding to the non-deterministic finite automaton according to the first conditional probability; or calculating a second conditional probability of a corresponding character generated by surrounding characters of a certain character in the regular expression, and acquiring a word vector of the regular expression corresponding to the non-deterministic finite automaton according to the second conditional probability;
and calculating the correlation among a plurality of word vectors based on cosine similarity, and screening out the character/character string with the strongest correlation in the regular expression corresponding to the non-deterministic finite automaton based on the correlation.
According to the heterogeneous acceleration method based on self-supervised learning provided by the invention, the local hardware equipment comprises: a CPU or GPU; the heterogeneous device comprises an FPGA, further comprising:
the CPU or GPU sends control instructions to the FPGA through a register, wherein the control instructions comprise control start, reset and address offset.
The invention also provides a heterogeneous acceleration device based on self-supervision learning, which comprises:
The generation module is used for acquiring a data control flow through the local hardware equipment and generating a non-deterministic finite automaton according to the generated regular expression, wherein the non-deterministic finite automaton is used for representing the regular expression so as to analyze and filter the data control flow;
the analysis module is used for receiving the data control flow and the non-deterministic finite automaton through heterogeneous equipment, analyzing the non-deterministic finite automaton based on a self-supervision learning model, and configuring the non-deterministic finite automaton to a corresponding regular engine when the non-deterministic finite automaton is in a matching relation with a regular expression represented by the non-deterministic finite automaton, so as to analyze and filter the data control flow in parallel;
the self-supervision learning model is obtained by training based on training data comprising a center word, a background word and a noise word.
The invention also provides a terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the heterogeneous acceleration method based on self-supervised learning when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the self-supervised learning based heterogeneous acceleration method of any of the above.
According to the heterogeneous acceleration method, the heterogeneous acceleration device, the heterogeneous acceleration equipment and the storage medium based on self-supervision learning, the data control flow is acquired through the local hardware equipment, and the non-deterministic finite automaton is generated according to the generated regular expression and is used for representing the regular expression so as to analyze and filter the data control flow; receiving the data control flow and the non-deterministic finite automaton through heterogeneous equipment, analyzing the non-deterministic finite automaton based on a self-supervision learning model, configuring the non-deterministic finite automaton to a corresponding regular engine when the non-deterministic finite automaton is in a matching relation with a regular expression represented by the non-deterministic finite automaton, and carrying out parallel analysis and filtering on the data control flow; the self-supervision learning model is obtained by training based on training data containing a center word, a background word and a noise word, and can be used for effectively verifying whether the non-deterministic finite automaton represents a regular rule, so that more efficient regular expression matching is realized on the FPGA, more developers can easily apply the FPGA to develop acceleration application, and development of hardware acceleration technology field based on regular expression matching is promoted.
Drawings
In order to more clearly illustrate the invention or the technical solutions in the related art, the following description will briefly explain the drawings used in the embodiments or the related art description, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for those skilled in the art.
FIG. 1 is a schematic flow chart of a heterogeneous acceleration method based on self-supervised learning provided by an embodiment of the invention;
FIG. 2 is a schematic diagram of a device deployment provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of a non-deterministic finite automaton topology provided by an embodiment of the present invention;
Fig. 4 is a schematic functional structure diagram of a heterogeneous acceleration device based on self-supervised learning according to an embodiment of the present invention;
fig. 5 is a schematic functional structure of a terminal device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a flowchart of a heterogeneous acceleration method based on self-supervised learning according to an embodiment of the present invention, as shown in fig. 1, where the heterogeneous acceleration method based on self-supervised learning according to an embodiment of the present invention includes:
Step 101, acquiring a data control flow through a local hardware device, and generating a non-deterministic finite automaton according to a generated regular expression, wherein the non-deterministic finite automaton is used for representing the regular expression so as to analyze and filter the data control flow;
102, receiving the data control flow and the non-deterministic finite automaton through heterogeneous equipment, analyzing the non-deterministic finite automaton based on a self-supervision learning model, configuring the non-deterministic finite automaton to a corresponding regular engine when the non-deterministic finite automaton is in a matching relation with a regular expression represented by the non-deterministic finite automaton, and carrying out parallel analysis and filtering on the data control flow;
the self-supervision learning model is obtained by training based on training data comprising a center word, a background word and a noise word.
In an embodiment of the present invention, a local hardware device includes: a CPU or GPU; the heterogeneous device comprises an FPGA, further comprising: the CPU or GPU sends control instructions into the FPGA through a register, wherein the control instructions comprise control start, reset and address offset.
In the embodiment of the invention, a CPU or a GPU is deployed on a storage server with the model number of NF5266M6, and heterogeneous acceleration is realized by matching with an acceleration card with the model number of F37X. The working flow is as follows: first, the CPU transfers the database data to the DDR of the FPGA board by direct memory access (Direct Memory Access, DMA). Meanwhile, the CPU generates NFA of the regular expression, forms the information of the NFA into frames containing a plurality of regular expressions, each frame can contain a plurality of regularities, and then transmits the frames to Double Data Rate (DDR) of the FPGA board. The CPU feeds necessary control information into the FPGA through registers including control start, reset, address offset, etc. Then, the frame data is parsed and different NFAs are configured onto different regularization engines according to the configuration information. Once the configuration of the regular engine is completed, the system starts to analyze and filter the data frames in parallel, and finally gathers the results, and hardware acceleration is realized by processing the data frames in parallel.
As shown in fig. 2, the key hardware components include a CPU, DDR memory of the FPGA board, the FPGA board itself, and a regularization engine. The data and control flow starts from the CPU and is transmitted to the DDR of the FPGA through the DMA, then the CPU sends control information to the FPGA through the register, and finally the regular engine inside the FPGA processes the data. Wherein the roles of each hardware component include:
The CPU is used as a central processing unit of the system and is responsible for generating the NFA of the regular expression and controlling the flow direction of data and control information. The CPU uses DMA to transfer data from the database to DDR memory of the FPGA board card. The DDR memory is used for storing data transmitted from a database and frames containing regular expressions NFA generated by a CPU. The FPGA board card is used for receiving control information from the CPU through the register, analyzing frame data, and configuring the NFA to the regular engine for data analysis and filtering. The regular engine is a component in the FPGA and is used for processing and analyzing the data frames in parallel and executing regular expression matching.
Since regular expressions tend to be long and numerous, and often require modification, even if the generation of state machines is automatically implemented using programs, while improving the efficiency of the present architectural solution, a large number of development case validations are still required during program development and deployment. However, considering the large and flexible amount and limited development of online analytical Processing (OLAP) business data traffic, traversing all possible regular expression queries is difficult to achieve, and lacks an efficient way to verify whether a non-deterministic finite automaton based On regular expression generation is reasonable.
According to the heterogeneous acceleration method based on self-supervision learning, a local hardware device is used for acquiring a data control flow, and a non-deterministic finite automaton is generated according to a generated regular expression, and the non-deterministic finite automaton is used for representing the regular expression so as to analyze and filter the data control flow; receiving the data control flow and the non-deterministic finite automaton through heterogeneous equipment, analyzing the non-deterministic finite automaton based on a self-supervision learning model, configuring the non-deterministic finite automaton to a corresponding regular engine when the non-deterministic finite automaton is in a matching relation with a regular expression represented by the non-deterministic finite automaton, and carrying out parallel analysis and filtering on the data control flow; the self-supervision learning model is obtained by training based on training data containing a center word, a background word and a noise word, and can be used for effectively verifying whether the non-deterministic finite automaton represents a regular rule, so that more efficient regular expression matching is realized on the FPGA, more developers can easily apply the FPGA to develop acceleration application, and development of hardware acceleration technology field based on regular expression matching is promoted.
Based on any of the above embodiments, analyzing the non-deterministic finite automaton based on a self-supervised learning model includes:
Step 201, acquiring a state topological structure of the non-deterministic finite automaton;
Taking the expression p (at) × (r|n) as an example, the state machine topology of the converted NFA is shown in fig. 3, where Si is the initial state, S0 to S4 are the intermediate states, sf is the final accepted state, and epsilon represents an arbitrary character. The build NFA is automatically generated by a software program, the input of the build program is a regular expression, and the output is the NFA stack.
The state topology node comprises:
Edges with empty strings (ε) from Si to S0;
From S0 to S1, edges with labels 'p';
from S1 to S2, edges with labels 'a';
from S2 to S3, edges with labels't';
from S3 to Sf, edges with labels 'r';
from S0 to S3, the edge with tag't' (one jump, meaning that it can be transferred directly from S_0 to S3 without going through S1 and S2);
From S3 to S4, edges with labels 'n';
from S4 to Sf, edges with labels 'n';
From S0 to S1, S1 to S0, S1 to S1, S2 to S2, and S2 to S0; these are edges with labels 'p', indicating that multiple transition paths exist between these states.
In this NFA, epsilon-conversion (i.e., space-to-serial conversion) allows the automaton to transition from Si to S0 without input. The characters (p, a, t, r, n) on the other sides represent possible state transitions when a specific input is received. Furthermore, the loop in FIG. 3 represents that the automaton may stay in the same state when receiving a particular input in a given state.
Step 202, constructing a training sample set based on the state topological structure, wherein training data in the training sample set comprises a center word, a background word and a noise word;
Step 203, training the self-supervision learning model based on the training sample set;
step 204, inputting the non-deterministic finite automaton into a trained self-supervision learning model, and obtaining a characterization vector corresponding to the non-deterministic finite automaton;
step 205, calculating the similarity between each characterization vector, and obtaining the character/character string with the strongest correlation in the non-deterministic finite automaton;
step 206, acquiring the character/character string with the strongest correlation in the regular expression corresponding to the non-deterministic finite automaton;
And 207, judging that the non-deterministic finite automaton forms a matching relation with the regular expression represented by the non-deterministic finite automaton when the character/character string with the strongest correlation in the non-deterministic finite automaton is consistent with the character/character string with the strongest correlation in the regular expression.
Based on any of the above embodiments, constructing a training sample set based on the state topology includes:
step 2021, generating a plurality of character strings based on the state topology;
In an embodiment of the present invention, generating a plurality of character strings based on the state topology includes:
Creating an initial character string, starting with a special character, and initializing a current state as a starting state of the state topological structure;
establishing a conversion character dictionary based on the state topological structure, wherein the conversion character dictionary is used for representing the next state to which each state can be transferred and corresponding conversion characters of the next state;
Simulating state conversion based on the converted characters until a final state is reached;
a string is generated based on all characters traversed in the state transition process, the string comprising a sequence of transition characters from a starting state to a final state.
In an embodiment of the present invention, generating a plurality of character strings includes a plurality of stages:
an initialization stage:
Creating an initial string of characters to specify the character Initially, a null transition from an initial state is indicated. The current state is set to s0, which is the initial state of the NFA.
Defining a state transition phase:
A dictionary is built to represent the next state to which each state can be transferred and its corresponding transition character.
For example, from state s0 may transition to either s0 or s1 by the character p, or to s3 by the character t.
Analog state transition phase:
When the current state is not the final state sf, the following steps are performed:
looking at the possible transitions of the current state, this is a list containing (next state, transition character) tuples.
If the current state does not have any transition paths (list is empty), the algorithm ends.
A tuple is randomly selected from the possible transitions, this selection simulating the non-deterministic characteristics of the NFA. The selected conversion character is added to the initialized character string. The current state is updated to be the next state of the selected tuple.
A character string generation stage:
repeating the above steps until the final state sf is reached.
Returning to the generated string, wherein the string comprises a transition sequence from the initial state to the final state.
Generating a plurality of samples:
The above steps are run multiple times using one cycle to generate multiple string samples.
Each string starts with a null string (epsilon) corresponding to a path relationship on the NFA graph, characterizing the transition of this state.
In the embodiment of the invention, based on the conversion character, the state conversion is simulated until the final state is reached, and the method comprises the following steps:
When the current state is not the final state, acquiring a possible conversion path of the current state according to the conversion character dictionary;
randomly selecting one path from possible conversion paths, and adding conversion characters corresponding to the selected path into the initialized character string;
The current state is updated to the next state in the selected transition character dictionary until the final state is reached.
Step 2022, generating a corpus based on the plurality of character strings;
And 2023, extracting the center word, the background word and the noise word from the corpus to obtain a training sample set.
In the embodiment of the invention, each character string in the corpus is regarded as a sentence, each character in the sentence is regarded as a word, and the extraction of the center word and the background word from the corpus comprises the following steps:
calculating the occurrence frequency of each word in the corpus;
In the embodiment of the invention, the occurrence frequency of different characters is counted through regular expression sample data, the mark with the occurrence frequency higher than 70% is a high-frequency word, a positive sample is obtained, and the other mark is a negative sample. Valuable information is extracted from the large-scale test information and the state machine model, and regular accelerated development test is performed more quickly.
Filtering out words with low occurrence frequency based on the occurrence frequency of each word and a preset high-frequency word threshold value, and constructing a vocabulary based on the residual words;
Traversing each word in the vocabulary, taking each word as a central word, randomly selecting a window size, determining the number of background words of each central word according to the window size, selecting words around each central word as the background words corresponding to each central word according to the number of the background words, and taking the central words and the corresponding background words as training positive samples.
The embodiment of the invention ensures that each center word forms an effective corresponding relation with the background word, and simultaneously follows the original regular expression mode.
And, a corresponding index map, i.e., a mapping relationship from word to index and from index to word, is established for the vocabulary.
In some embodiments of the present invention, after constructing the vocabulary based on the remaining words, further comprising:
and carrying out secondary random sampling on the vocabulary, obtaining the occurrence frequency and the total word number of each word in the vocabulary, and screening the words in the vocabulary according to the preset occurrence frequency requirement and the total word number requirement to obtain a final vocabulary.
Each sentence in the corpus is converted into an index list form. To better train the word embedding model, the dataset is sub-sampled by a random process that considers the frequency of words and the total number of words to decide whether to retain a particular word.
The method provided by the embodiment of the invention not only balances the high-frequency words and the low-frequency words, but also can effectively construct a training set for further natural language processing tasks from the data generated by the NFA.
In an embodiment of the present invention, extracting noise words from the corpus includes:
selecting words with occurrence frequency lower than a preset low-frequency word threshold from the corpus, wherein the low-frequency word threshold is three-fourths of the high-frequency word threshold;
Constructing a noise distribution based on the words with the occurrence frequency lower than a preset low-frequency word threshold value;
Normalizing the noise distribution, and unifying words with different occurrence frequencies to the same occurrence frequency;
Randomly extracting a plurality of words meeting the noise quantity requirement by using the normalized noise distribution, and adjusting the number of the extracted words according to the model calculation requirement and calculation resources;
judging whether the extracted word is a background word corresponding to the central word, if so, discarding the extracted word; otherwise, taking the extracted words as noise words corresponding to the central words until the number of the noise words meets the noise number requirement;
And traversing all the central words in the corpus, and generating a group of noise words for each central word as a training negative sample.
In constructing a negative sampling method for word embedding training, a set of noise words is randomly selected for each center word while ensuring that the noise words are not identical to the background words of the center word. The quality of word embedding models can be improved because the models learn not only which words are usually presented together (positive samples), but also which words are unlikely to be presented together (negative samples).
According to the training optimization method for extracting the noise words (the noise words cannot be background words) aiming at each center word, the training cost is reduced and the application range of the model is improved by introducing negative sampling.
In the embodiment of the invention, extracting a center word, a background word and a noise word from the corpus to obtain a training sample set comprises the following steps:
Receiving the center word, the background word and the noise word through a batch processing function;
In PyTorch, a custom dataset is typically created by inheriting the torch.utes.data.Dataset and implementing the init, getite, and len methods, which encapsulates the center word, background word, and negative sample word.
Adding positive sample labels for the center words and the corresponding background words thereof and storing the positive sample labels in a positive sample data list, and adding negative sample labels for the center words and the corresponding noise words thereof and storing the negative sample labels in a negative sample data list;
Unifying list lengths of the positive sample data list and the negative sample data list, and filling elements after a shorter data list;
in the embodiment of the invention, the shorter list is added with a filling element (such as 0);
Assigning different mask values to normal elements and padding elements in each data list;
In the valid data position the mask value is 1 and in the fill position the mask value is 0.
A training sample set is generated based on the positive sample data list and the negative sample data list.
In this way it is ensured that the data of each batch is correctly processed and formatted, providing a suitable input for model training. This flexibility is an important advantage of PyTorch, allowing customization of the data processing flow to specific application requirements.
Based on any of the above embodiments, the training the self-supervised learning model based on the training sample set includes:
Step 2031, reading the positive sample data of the positive sample data list and the negative sample data of the negative sample data list in batches;
Step 2032, for each read lot, inputting the positive or negative sample data into a self-supervised learning model to perform forward computation;
Step 2033, calculating a predicted loss based on the forward calculation output result and the loss function;
and 2034, carrying out back propagation through the prediction loss, and updating model parameters until the training ending condition is met, so as to obtain a trained self-supervision learning model.
In the embodiment of the present invention, reading the positive sample data of the positive sample data list and the negative sample data of the negative sample data list in batches includes:
And transmitting the batch processing function as a parameter to a data loader, and reading the positive sample data of the positive sample data list and the negative sample data in the negative sample data list in batches by the data loader.
In the embodiment of the invention, the self-supervision learning model comprises a word hopping model, wherein the word hopping model comprises a first key embedded layer and a second key embedded layer; the self-supervised learning model performs forward computation, including:
converting the center word into a center word vector through the first key embedding layer, and converting the background word and the noise word into a background word vector and a noise word vector through the second key embedding layer;
estimating the similarity between the central word vector and the background word vector or the noise word vector;
and outputting positive and negative sample classification results corresponding to the currently input training data based on the similarity.
In the embodiment of the invention, a jump word model class is created, and the class inherits from the torch. The method comprises the steps of performing forward computation by using a skip model and training data generated by a non-deterministic finite automaton by using a binary cross entropy loss function so as to obtain vector representation of each center word, wherein the specific method comprises the following steps: two key embedded layers are defined: one specific to the center word and the other for processing the background word and the noise word. These two embedding layers are responsible for converting words into dense vector representations. The forward propagation method of the model is designed to receive the center word, the background word and the noise word as inputs and calculate the corresponding embedded vectors. In the forward computation of the model, vector representations of the center word and the background/noise words are first obtained by the previously defined embedding layer. These vector representations are high-dimensional spatial representations of words that capture the semantic and grammatical relations from word to word. Then, the dot product between these vectors is calculated to estimate the similarity between them. This step is critical because it allows the model to understand the relationships between different words. For training the model, a binary cross entropy loss function is used, which is well suited for handling the two classification problem, accurately distinguishing positive samples (combination of center word and background word) from negative samples (combination of center word and noise word). By minimizing this loss function, the model learns to distinguish between related and unrelated word pairs, thereby better understanding the relationships between the words. During the training phase of the model, the previously defined data loader is utilized to provide training data that is batch processed, making the training process more efficient.
For each data batch, a forward calculation of the model is performed, and the loss calculated later is used to guide the learning of the model. By back-propagation of the loss, the parameters of the model are updated, enabling the model to more accurately represent and distinguish relationships between words.
In the embodiment of the invention, estimating the similarity between the center word vector and the background word vector or the noise word vector comprises the following steps:
And respectively carrying out dot product operation on the central word vector and the background word vector or the noise word vector, and taking a dot product operation result as the similarity between the central word vector and the background word vector or the noise word vector.
In the embodiment of the present invention, the loss function is a binary cross entropy loss function, and the calculating the predicted loss based on the forward calculation output result and the loss function includes:
and calculating the prediction loss between the positive and negative sample classification result corresponding to the input training data and the positive and negative sample label corresponding to the input training data based on the binary cross entropy loss function.
Based on any of the above embodiments, the calculating the similarity between each of the token vectors, and obtaining the character/character string with the strongest correlation in the non-deterministic finite automaton, includes:
Step 2051, obtaining cosine similarity between characterization vectors of each node in the non-deterministic finite automaton;
Step 2052, using the characters corresponding to the two characterization vectors with the maximum cosine similarity as the character with the strongest correlation in the non-deterministic finite automaton; or taking the characters corresponding to the plurality of characterization vectors with the cosine similarity larger than a preset threshold as the character string with the strongest correlation in the non-deterministic finite automaton.
Based on calculating cosine similarity between different word embedding vectors. The cosine similarity is used for the included angle of two vectors in Heng Lianggao-dimensional space, so that the similarity of the two vectors is judged to find out the word most similar to a given word, and the word most related to the central word is judged to judge whether the generation of a strong related state machine is accurate or not and whether the state machine reasoning is problematic or not by matching with the learning method of the character string generated based on the regular expression.
In the embodiment of the invention, the correlation calculation method maps the correlation calculation of the NFA edge structure into the correlation calculation of the node expression, and the correlation calculation method does not depend on the information of a specific state machine.
Based on any one of the above embodiments, the obtaining the character/character string with the strongest correlation in the regular expression corresponding to the non-deterministic finite automaton includes:
Step 2061, calculating a first conditional probability of surrounding characters generated by characters in the regular expression, and acquiring word vectors of the regular expression corresponding to the non-deterministic finite automaton according to the first conditional probability; or calculating a second conditional probability of a corresponding character generated by surrounding characters of a certain character in the regular expression, and acquiring a word vector of the regular expression corresponding to the non-deterministic finite automaton according to the second conditional probability;
Step 2062, calculating the correlation between a plurality of word vectors based on cosine similarity, and screening out the character/character string with the strongest correlation in the regular expression corresponding to the non-deterministic finite automaton based on the correlation.
Based on any one of the above embodiments, the obtaining the center character vector of the regular expression corresponding to the non-deterministic finite automaton may be implemented by the following two schemes:
Scheme one: calculating a first conditional probability of surrounding characters generated by characters in the regular expression, and generating a center character vector of the regular expression corresponding to the non-deterministic finite automaton according to the first conditional probability;
Generating character strings meeting requirements based on regular expressions, such as patr, taking part of any characters as the center, wherein the conditional probability of generating peripheral characters in the scheme I is as follows The probability distribution meets the conditional independence, and the probability density can be rewritten as/>In this model, each character has two vector representations for computing the corresponding probabilities. Specifically, the present model uses vectors s and t to represent two vectors of one character as a center character and a peripheral character, thereby estimating the corresponding probability distribution/>Wherein regular character index set/>Corresponding to characters and patterns of interest to the canonical matching process. The index corresponds to a probability density of/>
Based on the likelihood function estimated model parameters, using random gradients to optimize the logarithm of the maximum likelihood probability density, and carrying out gradient calculation on the vectors:
by optimizing to The gradient of the loss function is calculated for the probabilities of all the upstream and downstream characters of the center character.
After training, the vector s of the output characters is used as a correlation analysis, when the correlation between the characters is greater than a certain threshold (such as the correlation is greater than 0.95), whether the two characters are connected to the same state in the state machine is judged, the generation of the strongly correlated state machine is accurate, and otherwise, the state machine reasoning is problematic.
Scheme II: calculating second conditional probability of a corresponding character generated by surrounding characters of a certain character in the regular expression, and acquiring a central character vector of the regular expression corresponding to the non-deterministic finite automaton according to the second conditional probability.
In the embodiment of the present invention, central character representation is predicted around the upstream and downstream characters, such as "patr", and the partial arbitrary characters are taken as the center, and based on that the conditional probabilities of "P", "a", "r" are P ('t' | 'P', 'a', 'r'), the conditional probabilities of the given upstream and downstream regular characters and the related central characters can be written as:
similar to scheme one, vectors s and t represent two vectors with one character as a center character and a peripheral character in a regular expression. Is a subset of characters of size K for characters/>The character belongs to the center character/>Contextual character of (a), character index set/>Characters or patterns corresponding to the canonical matching process relationship. Based on this, the maximum likelihood probability for model two is:
Model parameters are estimated based on the likelihood functions. After the training is finished, the vector t of the output character is used as the correlation analysis. When the correlation between characters is greater than a certain threshold (e.g., the correlation is greater than 0.95), a determination is made as to whether the two characters are connected to the same state in the state machine. For example, the characters't' and 'n' are connected through the state 'S4', and the generation of a strongly-correlated state machine is proved to be accurate, otherwise, the state machine reasoning is problematic.
In the embodiment of the invention, the character/character string with the strongest correlation in the non-deterministic finite automaton is consistent with the character/character string with the strongest correlation in the regular expression, and the two words with the correlation larger than 0.95 can be judged to be identical.
According to the heterogeneous acceleration method based on self-supervision learning, which is provided by the embodiment of the invention, by introducing the method based on self-supervision learning, a regular acceleration scheme based on the FPGA is rapidly developed and tested, so that the calculation performance is improved, the performance bottleneck is solved, the processing efficiency of flexible service cases is improved, and the calculation performance is improved. By using the state machine graph model, the regular rule can be more effectively represented, so that more efficient regular expression matching is realized on the FPGA. The method solves the problem of calculation performance possibly faced by the prior art when a large number of regular expressions are processed, and simultaneously is expected to provide a more powerful tool for the fields of network security and the like, so that the real-time detection effect of threats is improved. And secondly, by adopting a self-supervision learning method, the system can more intelligently judge whether the state machine accurately represents the regular rule, so that the development efficiency of the regular acceleration scheme is improved. Not only reduces the burden of the developer in the FPGA programming aspect, but also is hopeful to promote the wider application of the regular expression matching technology in various fields. By providing two correlation-based schemes, efficient means of verification and development are provided while maintaining flexibility. The method has important significance for the fields of OLAP business data business and the like, wherein a large number of flexible cases exist.
The self-supervised learning-based heterogeneous acceleration device provided by the invention is described below, and the self-supervised learning-based heterogeneous acceleration device described below and the self-supervised learning-based heterogeneous acceleration method described above can be correspondingly referred to each other.
Fig. 4 is a functional structural schematic diagram of a heterogeneous acceleration device based on self-supervised learning according to an embodiment of the present invention, as shown in fig. 4, where the heterogeneous acceleration device based on self-supervised learning according to an embodiment of the present invention includes:
a generating module 401, configured to obtain a data control flow through a local hardware device, and generate a non-deterministic finite automaton according to a generated regular expression, where the non-deterministic finite automaton is used to characterize the regular expression, so as to analyze and filter the data control flow;
an analysis module 402, configured to receive, through a heterogeneous device, the data control flow and the non-deterministic finite automaton, analyze the non-deterministic finite automaton based on a self-supervised learning model, and configure the non-deterministic finite automaton to a corresponding regular engine when the non-deterministic finite automaton is in a matching relationship with a regular expression represented by the non-deterministic finite automaton, and perform parallel analysis and filtering on the data control flow;
the self-supervision learning model is obtained by training based on training data comprising a center word, a background word and a noise word.
According to the heterogeneous acceleration device based on self-supervision learning, a local hardware device is used for acquiring a data control flow, and a non-deterministic finite automaton is generated according to a generated regular expression, and the non-deterministic finite automaton is used for representing the regular expression so as to analyze and filter the data control flow; receiving the data control flow and the non-deterministic finite automaton through heterogeneous equipment, analyzing the non-deterministic finite automaton based on a self-supervision learning model, configuring the non-deterministic finite automaton to a corresponding regular engine when the non-deterministic finite automaton is in a matching relation with a regular expression represented by the non-deterministic finite automaton, and carrying out parallel analysis and filtering on the data control flow; the self-supervision learning model is obtained by training based on training data containing a center word, a background word and a noise word, and can be used for effectively verifying whether the non-deterministic finite automaton represents a regular rule, so that more efficient regular expression matching is realized on the FPGA, more developers can easily apply the FPGA to develop acceleration application, and development of hardware acceleration technology field based on regular expression matching is promoted.
In an embodiment of the invention, the analysis module 402 is configured to:
Acquiring a state topological structure of the non-deterministic finite automaton;
Constructing a training sample set based on the state topological structure, wherein training data in the training sample set comprises a center word, a background word and a noise word;
training the self-supervision learning model based on the training sample set;
Inputting the non-deterministic finite automaton into a trained self-supervision learning model, and obtaining a characterization vector corresponding to the non-deterministic finite automaton;
Calculating the similarity between each characterization vector, and acquiring the character/character string with the strongest correlation in the non-deterministic finite automaton;
Acquiring characters/character strings with strongest correlation in the regular expression corresponding to the non-deterministic finite automaton;
And when the character/character string with the strongest correlation in the non-deterministic finite automaton is consistent with the character/character string with the strongest correlation in the regular expression, judging that the non-deterministic finite automaton forms a matching relation with the represented regular expression.
In an embodiment of the present invention, the constructing a training sample set based on the state topology includes:
Generating a plurality of character strings based on the state topology;
Generating a corpus based on the plurality of character strings;
and extracting the center word, the background word and the noise word from the corpus to obtain a training sample set.
In an embodiment of the present invention, the generating a plurality of character strings based on the state topology includes:
Creating an initial character string, starting with a special character, and initializing a current state as a starting state of the state topological structure;
establishing a conversion character dictionary based on the state topological structure, wherein the conversion character dictionary is used for representing the next state to which each state can be transferred and corresponding conversion characters of the next state;
Simulating state conversion based on the converted characters until a final state is reached;
a string is generated based on all characters traversed in the state transition process, the string comprising a sequence of transition characters from a starting state to a final state.
In the embodiment of the invention, the character simulation state conversion based on the conversion until reaching the final state comprises the following steps:
When the current state is not the final state, acquiring a possible conversion path of the current state according to the conversion character dictionary;
randomly selecting one path from possible conversion paths, and adding conversion characters corresponding to the selected path into the initialized character string;
The current state is updated to the next state in the selected transition character dictionary until the final state is reached.
In the embodiment of the present invention, each character string in the corpus is regarded as a sentence, each character in the sentence is regarded as a word, and the extracting the center word and the background word from the corpus includes:
calculating the occurrence frequency of each word in the corpus;
filtering out words with low occurrence frequency based on the occurrence frequency of each word and a preset high-frequency word threshold value, and constructing a vocabulary based on the residual words;
Traversing each word in the vocabulary, taking each word as a central word, randomly selecting a window size, determining the number of background words of each central word according to the window size, selecting words around each central word as the background words corresponding to each central word according to the number of the background words, and taking the central words and the corresponding background words as training positive samples.
In the embodiment of the invention, after the vocabulary is constructed based on the residual words, the method further comprises the following steps:
and carrying out secondary random sampling on the vocabulary, obtaining the occurrence frequency and the total word number of each word in the vocabulary, and screening the words in the vocabulary according to the preset occurrence frequency requirement and the total word number requirement to obtain a final vocabulary.
In an embodiment of the present invention, extracting noise words from the corpus includes:
selecting words with occurrence frequency lower than a preset low-frequency word threshold from the corpus, wherein the low-frequency word threshold is three-fourths of the high-frequency word threshold;
Constructing a noise distribution based on the words with the occurrence frequency lower than a preset low-frequency word threshold value;
Normalizing the noise distribution, and unifying words with different occurrence frequencies to the same occurrence frequency;
Randomly extracting a plurality of words meeting the noise quantity requirement by using the normalized noise distribution, and adjusting the number of the extracted words according to the model calculation requirement and calculation resources;
judging whether the extracted word is a background word corresponding to the central word, if so, discarding the extracted word; otherwise, taking the extracted words as noise words corresponding to the central words until the number of the noise words meets the noise number requirement;
And traversing all the central words in the corpus, and generating a group of noise words for each central word as a training negative sample.
In the embodiment of the present invention, the extracting the center word, the background word and the noise word from the corpus to obtain the training sample set includes:
Receiving the center word, the background word and the noise word through a batch processing function;
Adding positive sample labels for the center words and the corresponding background words thereof and storing the positive sample labels in a positive sample data list, and adding negative sample labels for the center words and the corresponding noise words thereof and storing the negative sample labels in a negative sample data list;
Unifying list lengths of the positive sample data list and the negative sample data list, and filling elements after a shorter data list;
Assigning different mask values to normal elements and padding elements in each data list;
A training sample set is generated based on the positive sample data list and the negative sample data list.
In an embodiment of the present invention, the training the self-supervised learning model based on the training sample set includes:
Reading the positive sample data of the positive sample data list and the negative sample data in the negative sample data list in batches;
for each read lot, inputting the positive or negative sample data into a self-supervised learning model to perform forward computation;
calculating a predicted loss based on the forward calculation output result and the loss function;
And carrying out back propagation through the prediction loss, and updating model parameters until the training ending condition is met, so as to obtain a trained self-supervision learning model.
In the embodiment of the present invention, the batch reading of the positive sample data list and the negative sample data of the negative sample data list includes:
And transmitting the batch processing function as a parameter to a data loader, and reading the positive sample data of the positive sample data list and the negative sample data in the negative sample data list in batches by the data loader.
In the embodiment of the invention, the self-supervision learning model comprises a word hopping model, wherein the word hopping model comprises a first key embedding layer and a second key embedding layer; the self-supervised learning model performs forward computation, including:
converting the center word into a center word vector through the first key embedding layer, and converting the background word and the noise word into a background word vector and a noise word vector through the second key embedding layer;
estimating the similarity between the central word vector and the background word vector or the noise word vector;
and outputting positive and negative sample classification results corresponding to the currently input training data based on the similarity.
In an embodiment of the present invention, the estimating the similarity between the center word vector and the background word vector or the noise word vector includes:
And respectively carrying out dot product operation on the central word vector and the background word vector or the noise word vector, and taking a dot product operation result as the similarity between the central word vector and the background word vector or the noise word vector.
In an embodiment of the present invention, the loss function is a binary cross entropy loss function, and the calculating the predicted loss based on the forward calculation output result and the loss function includes:
and calculating the prediction loss between the positive and negative sample classification result corresponding to the input training data and the positive and negative sample label corresponding to the input training data based on the binary cross entropy loss function.
In the embodiment of the present invention, the calculating the similarity between each of the token vectors, and obtaining the character/character string with the strongest correlation in the non-deterministic finite automaton, includes:
acquiring cosine similarity between characterization vectors of each node in the non-deterministic finite automaton;
The characters corresponding to the two characterization vectors with the maximum cosine similarity are used as the characters with the strongest correlation in the non-deterministic finite automaton;
Or taking the characters corresponding to the plurality of characterization vectors with the cosine similarity larger than a preset threshold as the character string with the strongest correlation in the non-deterministic finite automaton.
According to the heterogeneous acceleration method based on self-supervised learning provided by the invention, the acquiring of the character/character string with the strongest correlation in the regular expression corresponding to the non-deterministic finite automaton comprises the following steps:
Calculating a first conditional probability of surrounding characters generated by characters in the regular expression, and acquiring word vectors of the regular expression corresponding to the non-deterministic finite automaton according to the first conditional probability; or calculating a second conditional probability of a corresponding character generated by surrounding characters of a certain character in the regular expression, and acquiring a word vector of the regular expression corresponding to the non-deterministic finite automaton according to the second conditional probability;
and calculating the correlation among a plurality of word vectors based on cosine similarity, and screening out the character/character string with the strongest correlation in the regular expression corresponding to the non-deterministic finite automaton based on the correlation.
According to the heterogeneous acceleration device based on self-supervision learning, the self-supervision learning-based method is introduced, so that a regular acceleration scheme based on the FPGA is rapidly developed and tested, the calculation performance is improved, the performance bottleneck is solved, the processing efficiency of flexible service cases is improved, and the calculation performance is improved. By using the state machine graph model, the regular rule can be more effectively represented, so that more efficient regular expression matching is realized on the FPGA. The method solves the problem of calculation performance possibly faced by the prior art when a large number of regular expressions are processed, and simultaneously is expected to provide a more powerful tool for the fields of network security and the like, so that the real-time detection effect of threats is improved. And secondly, by adopting a self-supervision learning method, the system can more intelligently judge whether the state machine accurately represents the regular rule, so that the development efficiency of the regular acceleration scheme is improved. Not only reduces the burden of the developer in the FPGA programming aspect, but also is hopeful to promote the wider application of the regular expression matching technology in various fields. By providing two correlation-based schemes, efficient means of verification and development are provided while maintaining flexibility. The method has important significance for the fields of OLAP business data business and the like, wherein a large number of flexible cases exist.
Fig. 5 illustrates an entity structure diagram of a terminal device, and as shown in fig. 5, the server may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Memory 530 includes a computer program, an operating system, and retrieved graph structure data, and processor 510 may invoke logic instructions in memory 530 to perform a self-supervised learning based heterogeneous acceleration method, the method comprising: acquiring a data control flow through a local hardware device, and generating a non-deterministic finite automaton according to a generated regular expression, wherein the non-deterministic finite automaton is used for representing the regular expression so as to analyze and filter the data control flow; receiving the data control flow and the non-deterministic finite automaton through heterogeneous equipment, analyzing the non-deterministic finite automaton based on a self-supervision learning model, configuring the non-deterministic finite automaton to a corresponding regular engine when the non-deterministic finite automaton is in a matching relation with a regular expression represented by the non-deterministic finite automaton, and carrying out parallel analysis and filtering on the data control flow; the self-supervision learning model is obtained by training based on training data comprising a center word, a background word and a noise word.
Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the related art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the self-supervised learning-based heterogeneous acceleration method provided by the above methods, the method comprising: acquiring a data control flow through a local hardware device, and generating a non-deterministic finite automaton according to a generated regular expression, wherein the non-deterministic finite automaton is used for representing the regular expression so as to analyze and filter the data control flow; receiving the data control flow and the non-deterministic finite automaton through heterogeneous equipment, analyzing the non-deterministic finite automaton based on a self-supervision learning model, configuring the non-deterministic finite automaton to a corresponding regular engine when the non-deterministic finite automaton is in a matching relation with a regular expression represented by the non-deterministic finite automaton, and carrying out parallel analysis and filtering on the data control flow; the self-supervision learning model is obtained by training based on training data comprising a center word, a background word and a noise word.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (18)

1. The heterogeneous acceleration method based on self-supervised learning is characterized by comprising the following steps of:
Acquiring a data control flow through a local hardware device, and generating a non-deterministic finite automaton according to a generated regular expression, wherein the non-deterministic finite automaton is used for representing the regular expression so as to analyze and filter the data control flow;
Receiving the data control flow and the non-deterministic finite automaton through heterogeneous equipment, analyzing the non-deterministic finite automaton based on a self-supervision learning model, configuring the non-deterministic finite automaton to a corresponding regular engine when the non-deterministic finite automaton is in a matching relation with a regular expression represented by the non-deterministic finite automaton, and carrying out parallel analysis and filtering on the data control flow;
the self-supervision learning model is obtained by training based on training data comprising a center word, a background word and a noise word;
The analyzing the non-deterministic finite automaton based on the self-supervised learning model comprises the following steps:
Acquiring a state topological structure of the non-deterministic finite automaton;
Constructing a training sample set based on the state topological structure, wherein training data in the training sample set comprises a center word, a background word and a noise word;
training the self-supervision learning model based on the training sample set;
Inputting the non-deterministic finite automaton into a trained self-supervision learning model, and obtaining a characterization vector corresponding to the non-deterministic finite automaton;
Calculating the similarity between each characterization vector, and acquiring the character/character string with the strongest correlation in the non-deterministic finite automaton;
Acquiring characters/character strings with strongest correlation in the regular expression corresponding to the non-deterministic finite automaton;
When the character/character string with the strongest correlation in the non-deterministic finite automaton is consistent with the character/character string with the strongest correlation in the regular expression, judging that the non-deterministic finite automaton forms a matching relationship with the represented regular expression;
The self-supervision learning model comprises a word hopping model, wherein the word hopping model comprises a first key embedded layer and a second key embedded layer; the self-supervised learning model performs forward computation, including:
converting the center word into a center word vector through the first key embedding layer, and converting the background word and the noise word into a background word vector and a noise word vector through the second key embedding layer;
estimating the similarity between the central word vector and the background word vector or the noise word vector;
and outputting positive and negative sample classification results corresponding to the currently input training data based on the similarity.
2. The heterogeneous acceleration method based on self-supervised learning of claim 1, wherein the constructing a training sample set based on the state topology comprises:
Generating a plurality of character strings based on the state topology;
Generating a corpus based on the plurality of character strings;
and extracting the center word, the background word and the noise word from the corpus to obtain a training sample set.
3. The self-supervised learning based heterogeneous acceleration method of claim 2, wherein the generating a plurality of character strings based on the state topology comprises:
Creating an initial character string, starting with a special character, and initializing a current state as a starting state of the state topological structure;
Establishing a conversion character dictionary based on the state topological structure, wherein the conversion character dictionary is used for representing the next state to which each state is transferred and corresponding conversion characters of the next state;
Simulating state conversion based on the converted characters until a final state is reached;
a string is generated based on all characters traversed in the state transition process, the string comprising a sequence of transition characters from a starting state to a final state.
4. A heterogeneous acceleration method based on self-supervised learning as set forth in claim 3, wherein the simulating state transitions based on the transition characters until reaching a final state, comprises:
When the current state is not the final state, acquiring a possible conversion path of the current state according to the conversion character dictionary;
randomly selecting one path from possible conversion paths, and adding conversion characters corresponding to the selected path into the initialized character string;
The current state is updated to the next state in the selected transition character dictionary until the final state is reached.
5. The self-supervised learning based heterogeneous acceleration method of claim 2, wherein each character string in the corpus is considered a sentence, each character in a sentence is considered a word, and extracting the center word and the background word from the corpus comprises:
calculating the occurrence frequency of each word in the corpus;
filtering out words with low occurrence frequency based on the occurrence frequency of each word and a preset high-frequency word threshold value, and constructing a vocabulary based on the residual words;
Traversing each word in the vocabulary, taking each word as a central word, randomly selecting a window size, determining the number of background words of each central word according to the window size, selecting words around each central word as the background words corresponding to each central word according to the number of the background words, and taking the central words and the corresponding background words as training positive samples.
6. The self-supervised learning based heterogeneous acceleration method of claim 5, further comprising, after constructing a vocabulary based on the remaining words:
and carrying out secondary random sampling on the vocabulary, obtaining the occurrence frequency and the total word number of each word in the vocabulary, and screening the words in the vocabulary according to the preset occurrence frequency requirement and the total word number requirement to obtain a final vocabulary.
7. The self-supervised learning based heterogeneous acceleration method of claim 5, wherein extracting noise words from the corpus comprises:
selecting words with occurrence frequency lower than a preset low-frequency word threshold from the corpus, wherein the low-frequency word threshold is three-fourths of the high-frequency word threshold;
Constructing a noise distribution based on the words with the occurrence frequency lower than a preset low-frequency word threshold value;
Normalizing the noise distribution, and unifying words with different occurrence frequencies to the same occurrence frequency;
Randomly extracting a plurality of words meeting the noise quantity requirement by using the normalized noise distribution, and adjusting the number of the extracted words according to the model calculation requirement and calculation resources;
judging whether the extracted word is a background word corresponding to the central word, if so, discarding the extracted word; otherwise, taking the extracted words as noise words corresponding to the central words until the number of the noise words meets the noise number requirement;
And traversing all the central words in the corpus, and generating a group of noise words for each central word as a training negative sample.
8. The heterogeneous acceleration method based on self-supervised learning of claim 2, wherein the extracting the center word, the background word, and the noise word from the corpus to obtain the training sample set comprises:
Receiving the center word, the background word and the noise word through a batch processing function;
Adding positive sample labels for the center words and the corresponding background words thereof and storing the positive sample labels in a positive sample data list, and adding negative sample labels for the center words and the corresponding noise words thereof and storing the negative sample labels in a negative sample data list;
Unifying list lengths of the positive sample data list and the negative sample data list, and filling elements after a shorter data list;
Assigning different mask values to normal elements and padding elements in each data list;
A training sample set is generated based on the positive sample data list and the negative sample data list.
9. The heterogeneous acceleration method based on self-supervised learning of claim 8, wherein the training of the self-supervised learning model based on the training sample set comprises:
Reading the positive sample data of the positive sample data list and the negative sample data in the negative sample data list in batches;
for each read lot, inputting the positive or negative sample data into a self-supervised learning model to perform forward computation;
calculating a predicted loss based on the forward calculation output result and the loss function;
And carrying out back propagation through the prediction loss, and updating model parameters until the training ending condition is met, so as to obtain a trained self-supervision learning model.
10. The heterogeneous acceleration method of claim 9, wherein the batch reading of the positive sample data list and the negative sample data of the negative sample data list comprises:
And transmitting the batch processing function as a parameter to a data loader, and reading the positive sample data of the positive sample data list and the negative sample data in the negative sample data list in batches by the data loader.
11. The self-supervised learning based heterogeneous acceleration method of claim 9, wherein the estimating similarity between the center word vector and the background word vector or noise word vector comprises:
And respectively carrying out dot product operation on the central word vector and the background word vector or the noise word vector, and taking a dot product operation result as the similarity between the central word vector and the background word vector or the noise word vector.
12. The self-supervised learning based heterogeneous acceleration method of claim 11, wherein the loss function is a binary cross entropy loss function, the calculating a predicted loss based on the forward calculated output result and the loss function, comprising:
and calculating the prediction loss between the positive and negative sample classification result corresponding to the input training data and the positive and negative sample label corresponding to the input training data based on the binary cross entropy loss function.
13. The heterogeneous acceleration method of claim 1, wherein the calculating the similarity between each token vector to obtain the most relevant character/string in the non-deterministic finite automaton comprises:
acquiring cosine similarity between characterization vectors of each node in the non-deterministic finite automaton;
The characters corresponding to the two characterization vectors with the maximum cosine similarity are used as the characters with the strongest correlation in the non-deterministic finite automaton;
Or taking the characters corresponding to the plurality of characterization vectors with the cosine similarity larger than a preset threshold as the character string with the strongest correlation in the non-deterministic finite automaton.
14. The heterogeneous acceleration method based on self-supervised learning according to claim 1, wherein the obtaining the character/character string with the strongest correlation in the regular expression corresponding to the non-deterministic finite automaton comprises:
Calculating a first conditional probability of surrounding characters generated by characters in the regular expression, and acquiring word vectors of the regular expression corresponding to the non-deterministic finite automaton according to the first conditional probability; or calculating a second conditional probability of a corresponding character generated by surrounding characters of a certain character in the regular expression, and acquiring a word vector of the regular expression corresponding to the non-deterministic finite automaton according to the second conditional probability;
and calculating the correlation among a plurality of word vectors based on cosine similarity, and screening out the character/character string with the strongest correlation in the regular expression corresponding to the non-deterministic finite automaton based on the correlation.
15. The self-supervised learning based heterogeneous acceleration method of claim 1, wherein the local hardware device comprises: a CPU or GPU; the heterogeneous device comprises an FPGA, further comprising:
the CPU or GPU sends control instructions to the FPGA through a register, wherein the control instructions comprise control start, reset and address offset.
16. Heterogeneous accelerating device based on self-supervised learning, characterized by comprising:
The generation module is used for acquiring a data control flow through the local hardware equipment and generating a non-deterministic finite automaton according to the generated regular expression, wherein the non-deterministic finite automaton is used for representing the regular expression so as to analyze and filter the data control flow;
the analysis module is used for receiving the data control flow and the non-deterministic finite automaton through heterogeneous equipment, analyzing the non-deterministic finite automaton based on a self-supervision learning model, and configuring the non-deterministic finite automaton to a corresponding regular engine when the non-deterministic finite automaton is in a matching relation with a regular expression represented by the non-deterministic finite automaton, so as to analyze and filter the data control flow in parallel;
the self-supervision learning model is obtained by training based on training data comprising a center word, a background word and a noise word;
The analyzing the non-deterministic finite automaton based on the self-supervised learning model comprises the following steps:
Acquiring a state topological structure of the non-deterministic finite automaton;
Constructing a training sample set based on the state topological structure, wherein training data in the training sample set comprises a center word, a background word and a noise word;
training the self-supervision learning model based on the training sample set;
Inputting the non-deterministic finite automaton into a trained self-supervision learning model, and obtaining a characterization vector corresponding to the non-deterministic finite automaton;
Calculating the similarity between each characterization vector, and acquiring the character/character string with the strongest correlation in the non-deterministic finite automaton;
Acquiring characters/character strings with strongest correlation in the regular expression corresponding to the non-deterministic finite automaton;
When the character/character string with the strongest correlation in the non-deterministic finite automaton is consistent with the character/character string with the strongest correlation in the regular expression, judging that the non-deterministic finite automaton forms a matching relationship with the represented regular expression;
The self-supervision learning model comprises a word hopping model, wherein the word hopping model comprises a first key embedded layer and a second key embedded layer; the self-supervised learning model performs forward computation, including:
converting the center word into a center word vector through the first key embedding layer, and converting the background word and the noise word into a background word vector and a noise word vector through the second key embedding layer;
estimating the similarity between the central word vector and the background word vector or the noise word vector;
and outputting positive and negative sample classification results corresponding to the currently input training data based on the similarity.
17. A terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the self-supervised learning based heterogeneous acceleration method as claimed in any one of claims 1 to 15 when the program is executed by the processor.
18. A non-transitory readable storage medium having stored thereon a computer program which, when executed by a processor, implements the self-supervised learning based heterogeneous acceleration method of any of claims 1 to 15.
CN202410376329.1A 2024-03-29 Heterogeneous acceleration method, device, equipment and storage medium based on self-supervision learning Active CN117971355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410376329.1A CN117971355B (en) 2024-03-29 Heterogeneous acceleration method, device, equipment and storage medium based on self-supervision learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410376329.1A CN117971355B (en) 2024-03-29 Heterogeneous acceleration method, device, equipment and storage medium based on self-supervision learning

Publications (2)

Publication Number Publication Date
CN117971355A CN117971355A (en) 2024-05-03
CN117971355B true CN117971355B (en) 2024-06-07

Family

ID=

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163233A (en) * 2018-02-11 2019-08-23 陕西爱尚物联科技有限公司 A method of so that machine is competent at more complex works
CA3207044A1 (en) * 2021-03-18 2022-09-01 Joy MACKAY Automated classification of emotio-cogniton
CN115688779A (en) * 2022-10-11 2023-02-03 杭州瑞成信息技术股份有限公司 Address recognition method based on self-supervision deep learning
CN116049371A (en) * 2023-01-18 2023-05-02 之江实验室 Visual question-answering method and device based on regularization and dual learning
CN116304367A (en) * 2023-02-24 2023-06-23 河北师范大学 Algorithm and device for obtaining communities based on graph self-encoder self-supervision training
CN116303881A (en) * 2022-12-13 2023-06-23 浙江邦盛科技股份有限公司 Enterprise organization address matching method and device based on self-supervision representation learning
CN117349870A (en) * 2023-12-05 2024-01-05 苏州元脑智能科技有限公司 Transparent encryption and decryption computing system, method, equipment and medium based on heterogeneous computing
CN117729047A (en) * 2023-12-29 2024-03-19 北京网藤科技有限公司 Intelligent learning engine method and system for industrial control network flow audit

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163233A (en) * 2018-02-11 2019-08-23 陕西爱尚物联科技有限公司 A method of so that machine is competent at more complex works
CA3207044A1 (en) * 2021-03-18 2022-09-01 Joy MACKAY Automated classification of emotio-cogniton
CN115688779A (en) * 2022-10-11 2023-02-03 杭州瑞成信息技术股份有限公司 Address recognition method based on self-supervision deep learning
CN116303881A (en) * 2022-12-13 2023-06-23 浙江邦盛科技股份有限公司 Enterprise organization address matching method and device based on self-supervision representation learning
CN116049371A (en) * 2023-01-18 2023-05-02 之江实验室 Visual question-answering method and device based on regularization and dual learning
CN116304367A (en) * 2023-02-24 2023-06-23 河北师范大学 Algorithm and device for obtaining communities based on graph self-encoder self-supervision training
CN117349870A (en) * 2023-12-05 2024-01-05 苏州元脑智能科技有限公司 Transparent encryption and decryption computing system, method, equipment and medium based on heterogeneous computing
CN117729047A (en) * 2023-12-29 2024-03-19 北京网藤科技有限公司 Intelligent learning engine method and system for industrial control network flow audit

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于弱监督预训练CNN模型的情感分析方法;张越;夏鸿斌;;计算机工程与应用;20180701(第13期);全文 *

Similar Documents

Publication Publication Date Title
US10983761B2 (en) Deep learning enhanced code completion system
Russell et al. Automated vulnerability detection in source code using deep representation learning
Sun et al. The neural network pushdown automaton: Model, stack and learning simulations
JP7178513B2 (en) Chinese word segmentation method, device, storage medium and computer equipment based on deep learning
CN114896395A (en) Language model fine-tuning method, text classification method, device and equipment
CN112420125A (en) Molecular attribute prediction method and device, intelligent equipment and terminal
CN115374845A (en) Commodity information reasoning method and device
CN115756475A (en) Sequence generation countermeasure network-based code annotation generation method and device
CN116702157B (en) Intelligent contract vulnerability detection method based on neural network
CN117971355B (en) Heterogeneous acceleration method, device, equipment and storage medium based on self-supervision learning
CN117520142A (en) Automatic test assertion statement generation method based on code pre-training model
Sekiyama et al. Automated proof synthesis for the minimal propositional logic with deep neural networks
CN116595537A (en) Vulnerability detection method of generated intelligent contract based on multi-mode features
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
CN117971355A (en) Heterogeneous acceleration method, device, equipment and storage medium based on self-supervision learning
CN114819163A (en) Quantum generation countermeasure network training method, device, medium, and electronic device
CN115017987A (en) Language model fine-tuning method, text classification method, device and equipment
Paduraru et al. Automatic test data generation for a given set of applications using recurrent neural networks
CN116029261A (en) Chinese text grammar error correction method and related equipment
CN117971356A (en) Heterogeneous acceleration method, device, equipment and storage medium based on semi-supervised learning
CN117971357B (en) Finite state automaton verification method and device, electronic equipment and storage medium
CN117971354A (en) Heterogeneous acceleration method, device, equipment and storage medium based on end-to-end learning
CN116527411B (en) Data security intelligent protection model construction method and device and collaboration platform
CN113190657B (en) NLP data preprocessing method, jvm and spark end server
CN117971358B (en) Finite state automaton verification method and device

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant