CN107038161B - Equipment and method for filtering data - Google Patents

Equipment and method for filtering data Download PDF

Info

Publication number
CN107038161B
CN107038161B CN201510408180.1A CN201510408180A CN107038161B CN 107038161 B CN107038161 B CN 107038161B CN 201510408180 A CN201510408180 A CN 201510408180A CN 107038161 B CN107038161 B CN 107038161B
Authority
CN
China
Prior art keywords
rule
data
filtered
filtering
abstract syntax
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510408180.1A
Other languages
Chinese (zh)
Other versions
CN107038161A (en
Inventor
丁崔灿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201510408180.1A priority Critical patent/CN107038161B/en
Priority to PCT/CN2016/088302 priority patent/WO2017008650A1/en
Publication of CN107038161A publication Critical patent/CN107038161A/en
Application granted granted Critical
Publication of CN107038161B publication Critical patent/CN107038161B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking

Abstract

The application aims to provide equipment and a method for filtering data, the initial data to be filtered is converted into structured data to be filtered after the initial data to be filtered is obtained every time, corresponding filtering rules are utilized to perform matching calculation in real time, filtering results are obtained immediately, the problem of real-time performance is solved, arithmetic operation, character string operation, relational operation, logic operation, regular expression operation and set operation are supported, an expansion interface is reserved, the filtering rules are simple operation expression forms with variables, and the problems of complex description, difficult expansion and difficult management of the filtering rules are solved.

Description

Equipment and method for filtering data
Technical Field
The application relates to the field of computers, in particular to a technology for filtering data meeting a filtering rule from mass data in real time according to the set filtering rule.
Background
With the explosive growth of information technology, the data volume is increasing day by day, and the requirements of numerous fields on the processing of mass data are increasing continuously.
In the prior art, there are several methods for filtering data satisfying a filtering rule from a mass of data according to a set filtering rule:
effective data are filtered based on SQL (Structured Query Language) statements of a memory relational database, however, the method needs to cache mass data in a logic data table of the memory database, occupies a large amount of memory resources, and the periodic execution of the SQL statements cannot meet the real-time requirement easily;
the method comprises the following steps that (1) a massive data storage scheme based on Hbase (a distributed and column-oriented open source database) is used, a Map-Reduce algorithm (a programming model algorithm for parallel operation of a large-scale data set) is used for filtering effective data, however, a Map-Reduce model task is a post-calculation mode similar to batch processing, only operation matching results can be periodically executed on massive data stored in the Hbase, instantaneity is difficult to guarantee, and the complex Map-Reduce model task needs to be realized through extended compiling, so that the requirements of real-time variability and various calculations on a large number of filtering rules are difficult to meet;
based on a CEP engine (Complex Event Processing), filtering effective data by using a pattern matching algorithm is more suitable for monitoring and decision control of an enterprise application system, however, most mature CEP engines are business software and have high user cost, and the CEP engines have respective pattern rule description methods, for example, Drools uses an XML format, Esper uses an EPL format, a large amount of adaptation codes need to be written for use in response to different system requirements, and non-standardized matching algorithms need to be extended for writing for implementation, so that implementation difficulty is increased.
Disclosure of Invention
The technical problem to be solved by the application is how to filter out the mass data in real time according to the set filtering rules without occupying a large amount of memory resources, so as to meet the filtering rules, and meet the requirements of real-time variability and various calculations of the large amount of filtering rules.
To achieve the above object, the present application provides a method for filtering data, wherein the method comprises:
acquiring initial data to be filtered, and converting the initial data to be filtered into structured data to be filtered, wherein the structured data to be filtered comprises a data field identifier and a data body in a key-value pair format;
loading filtering rules, wherein each filtering rule comprises a rule field identifier, a rule name and a rule operational expression, and establishing a first rule list of the filtering rules by taking the field identifier of the filtering rule as an index;
acquiring the structured data to be filtered, and acquiring a plurality of filtering rules with rule field identifications corresponding to the data field identifications from the first rule list according to the data field identifications;
and performing parallel matching operation on the structured data to be filtered by using the acquired plurality of filtering rules.
Further, the acquiring initial data to be filtered includes:
and acquiring the initial data to be filtered from the distributed message middleware.
Further, converting the initial data to be filtered into structured data to be filtered further comprises:
sending the structured data to be filtered to a blocking queue;
acquiring the structured data to be filtered comprises:
and acquiring the structured data to be filtered from the blocking queue.
Further, the performing parallel matching operation on the structured data to be filtered by using the obtained plurality of filtering rules includes:
performing rule compiling on the acquired filtering rules to establish an executable abstract syntax tree;
and traversing a plurality of runnable abstract syntax trees by taking the data body of the structured data to be filtered as an input parameter, and performing parallel matching calculation by using the plurality of runnable abstract syntax trees.
Further, the rule compiling the obtained filtering rule to establish the runnable abstract syntax tree includes:
analyzing the rule expression of the acquired filtering rule to convert the rule expression into an abstract syntax tree;
pre-computing the abstract syntax tree to obtain the runnable abstract syntax tree;
wherein pre-computing the abstract syntax tree once comprises:
creating a running stack according to the abstract syntax tree, and transmitting elements in the abstract syntax tree into the running stack;
when the element is an operator, transmitting two operands corresponding to the operator out of the running stack, and calculating to obtain a calculation result;
and when the element is a special element, converting the special element into a program language data structure element and then transmitting the program language data structure element into an operation stack.
Further, performing parallel matching calculations using a number of the runnable abstract syntax trees comprises:
replacing variables of the runnable abstract syntax tree with parameters in the data volume;
and performing matching calculation on the runnable abstract syntax tree by utilizing the running stack.
Further, the method further comprises:
adding filter rules, deleting filter rules or modifying and compiling the existing filter rules.
Further, establishing the first rule list of the filtering rule with the domain identifier of the filtering rule as an index further includes:
establishing a second rule list of the filtering rules indexed according to rule names of the filtering rules;
the adding, deleting or modifying and compiling of the filter rule comprises at least any one of the following steps:
adding the newly added filtering rule into the second rule list;
deleting the corresponding filtering rule from the second rule list;
and searching the filtering rule from the second rule list, and modifying and compiling the searched filtering rule.
Further, each of the filtering rules further includes: information of the notifier to which the filtering rule is bound;
the method further comprises the following steps:
and sending the structured data to be filtered meeting the corresponding filtering rule to the notifier bound by the filtering rule for transmission.
There is also provided according to another aspect of the present application, an apparatus for filtering data, wherein the apparatus includes:
the device comprises a first device, a second device and a third device, wherein the first device is used for acquiring initial data to be filtered and converting the initial data to be filtered into structured data to be filtered, and the structured data to be filtered comprises a data body in a data field identifier and a key-value pair format;
the second device is used for loading filtering rules, wherein each filtering rule comprises a rule field identifier, a rule name and a rule operational expression, and a first rule list of the filtering rule with the field identifier of the filtering rule as an index is established;
the third device is used for acquiring the structured data to be filtered and acquiring a plurality of filtering rules with rule field identifications corresponding to the data field identifications from the first rule list according to the data field identifications;
and the fourth device is used for performing parallel matching operation on the structured data to be filtered by utilizing the acquired plurality of filtering rules.
Further, the first apparatus includes:
and acquiring the unit of the initial data to be filtered from the distributed message middleware.
Further, the first apparatus includes:
means for sending the structured data to be filtered to a blocking queue;
the third means comprises:
and acquiring the unit of the structured data to be filtered from the blocking queue.
Further, the fourth apparatus includes:
means for performing rule compilation on the obtained filter rules to create a runnable abstract syntax tree;
and the unit is used for traversing the plurality of runnable abstract syntax trees by taking the data body of the structured data to be filtered as an input parameter and performing parallel matching calculation by utilizing the plurality of runnable abstract syntax trees.
Further, the means for performing rule compilation on the obtained filter rules to build a runnable abstract syntax tree includes:
a module for analyzing the rule expression of the obtained filtering rule to convert into an abstract syntax tree;
means for pre-computing the abstract syntax tree to obtain the runnable abstract syntax tree, wherein the means is configured to:
creating a run stack from the abstract syntax tree, passing elements of the abstract syntax tree into the run stack,
when the element is an operator, two operands corresponding to the operator are transmitted out of the running stack, calculated to obtain a calculation result,
and when the element is a special element, converting the special element into a program language data structure element and then transmitting the program language data structure element into the running stack.
Further, the unit for traversing the plurality of runnable abstract syntax trees by using the data volume of the structured data to be filtered as an input parameter, and performing parallel matching calculation by using the plurality of runnable abstract syntax trees includes:
means for replacing variables of the runnable abstract syntax tree with parameters in the data volume;
means for performing a match calculation on the runnable abstract syntax tree using the run stack.
Further, the apparatus further comprises:
and the fifth device is used for adding the filtering rules, deleting the filtering rules or modifying and compiling the existing filtering rules.
Further, the second apparatus further includes:
a unit that creates a second rule list of the filter rule indexed according to a rule name of the filter rule;
the fifth means includes:
means for adding the newly added filter rule to the second rule list;
means for deleting the corresponding filter rule from the second rule list;
means for searching for a filter rule from the second rule list, and performing modification compilation on the searched filter rule.
Further, each of the filtering rules further includes: information of the notifier to which the filtering rule is bound;
the apparatus further comprises:
and the sixth device is used for sending the structured data to be filtered meeting the corresponding filtering rule to the notifier bound by the filtering rule for transmission.
Compared with the prior art, the equipment and the method for filtering data provided by the embodiment of the application adopt a stream type operation mode, data cannot be cached or solidified in a memory, namely the initial data to be filtered is converted into structured data to be filtered after the initial data to be filtered is obtained each time, the corresponding filtering rule is utilized to carry out matching calculation in real time, the filtering result is obtained immediately, and the problem of real-time performance of filtering of massive stream type data is solved;
furthermore, according to the device and the method for filtering data provided by the embodiment of the application, the method and the device for filtering data support arithmetic operation, character string operation, relational operation, logic operation, regular expression operation and set operation, an expansion interface is reserved, and the filtering rule is in a simple operation expression form with variables, so that the problems of complex description, difficult expansion and difficult management of the filtering rule are solved;
in addition, the device and the method for data filtering provided by the embodiment of the application are designed and developed autonomously, are relatively low in cost, and can be monitored and optimized on any code path.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 illustrates an apparatus schematic diagram of an apparatus for filtering data provided in accordance with an aspect of the present application;
FIG. 2 illustrates an apparatus diagram of an apparatus for filtering data provided in accordance with a preferred embodiment of the present application;
FIG. 3 illustrates an apparatus diagram of an apparatus for filtering data according to another preferred embodiment of the present application;
FIG. 4 illustrates a flow diagram of a method for filtering data provided in accordance with an aspect of the present application;
FIG. 5 illustrates a flow chart of a method for filtering data provided in accordance with a preferred embodiment of the present application;
FIG. 6 illustrates a flow chart of a method for filtering data provided in accordance with another preferred embodiment of the present application;
FIG. 7 illustrates a device diagram including the system for filtering data device provided in accordance with a preferred embodiment of the present application;
fig. 8 to 10 are schematic diagrams illustrating parallel matching operations performed on the structured data to be filtered by using the obtained filtering rules according to a specific scenario of the present application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present application is described in further detail below with reference to the attached figures.
Fig. 1 shows a schematic apparatus diagram of an apparatus for filtering data according to an aspect of the present application, where the apparatus 1 includes: a first device 11, a second device 12, a third device 13 and a fourth device 14.
Specifically, the first device 11 is configured to obtain initial data to be filtered, and convert the initial data to be filtered into structured data to be filtered, where the structured data to be filtered includes a data body in a key-value pair format and a data field identifier; the second device 12 is configured to load filtering rules, where each filtering rule includes a rule domain identifier, a rule name, and a rule operation expression, and establish a first rule list of the filtering rule using the domain identifier of the filtering rule as an index; the third device 13 is configured to obtain the structured data to be filtered, and obtain, according to the data field identifier, a plurality of filtering rules having rule field identifiers corresponding to the data field identifiers from the first rule list; the fourth device 14 is configured to perform a parallel matching operation on the structured data to be filtered by using the obtained filtering rules.
Further, the first device 11 is configured to obtain initial data to be filtered, and convert the initial data to be filtered into structured data to be filtered, where the structured data to be filtered includes a data field identifier and a data body in a key-value pair format, and the structured data to be filtered includes a data field identifier and a data body in a key-value pair format.
Wherein the data domain identifies a category indicating the structured data to be filtered, wherein the category is, for example and without limitation: the CPU occupancy of the host, the access delay time of a certain website, etc., the data field identifier may be identified by data or characters, etc., and any identifier that can be identified by a computer may be used as an embodiment of the data field identifier and is included herein by reference. Wherein, the data body of the Key-Value pair format records the detailed information of the Key-Value pair format (Key-Value format) of the structured data to be filtered, and the data body is, for example only (by way of example, and not limited thereto): instanceId 123456, clusterId Hangzhou, Value 92, bizTime 1427041923825, and unit Percent, wherein the left side of each equal sign represents a Key (Key), the right side of each equal sign represents a Value (Value), and the information on the left and right sides of the equal sign forms a data body in a Key-Value pair format, where the data body may include one or more Key-pairs, and the number of the Key-pairs is not limited.
Preferably, the initial data to be filtered is obtained from mass data, and the first device 11 further includes: and acquiring the unit of the initial data to be filtered from the distributed message middleware. The first device 11 uses a distributed message middleware, preferably, the distributed message middleware is MetaQ (a distributed message middleware), where MetaQ is a message middleware of a distributed, queue model, and MetaQ has the following characteristics: strict message sequence can be guaranteed; the method provides rich message pulling modes, high-efficiency subscriber horizontal expansion capability, a real-time message subscription mechanism and hundred million-level message accumulation capability, utilizes the characteristics of the clustering data Sharding of the MetaQ, can enable a plurality of devices 1 to form a plurality of peer nodes with completely the same functions for clustering, enables the clusters to have load balancing capability, and meets the requirements of expandability, high availability and performance under the background of mass data.
Preferably, the first device 11 may further include: means for sending the structured data to be filtered to a blocking queue; accordingly, the third device 13 comprises means for retrieving the structured data to be filtered from the congestion queue.
Here, the blocking queue is able to block further enqueue operations when the queue is full until the queue of the blocking queue is not full. Specifically, the first device 11 sends the structured data to be filtered to a blocking queue, the structured data to be filtered enters the blocking queue to wait, the third device 13 obtains the structured data to be filtered from the blocking queue according to the waiting order of the structured data to be filtered, and deletes the structured data to be filtered from the blocking queue after the structured data to be filtered is obtained. Here, when the blocking queue is full of the structured data to be filtered waiting in the blocking queue, the blocking queue blocks the operation of sending the filtered data to the blocking queue by the first device 11, so that when the processing capacity is insufficient, the memory occupation is too large, thereby playing a role of peak clipping and valley filling in the process of filtering the mass data, and avoiding processing faults.
Further, the second device 12 is configured to load filtering rules, wherein each filtering rule includes: the filtering rule comprises a rule field identifier, a rule name and a rule operation expression, and a first rule list of the filtering rule with the field identifier of the filtering rule as an index is established.
Here, the rule field identifies a category for indicating the filtering rule, wherein the category is, for example and without limitation: the CPU occupancy of the host, the access delay time of a certain website, etc., the rule field identifier may be identified by data or characters, etc., and any identifier that can be identified by a computer may be used as an embodiment of the data field identifier and is included herein by reference. Preferably, the content of the rule field identifier is the same as or substantially the same as the content of the data field identifier, so that the third device 13 obtains a plurality of filtering rules with rule field identifiers corresponding to the data field identifiers from the first rule list according to the data field identifiers. Wherein the rule name may be a globally uniquely identified rule name to facilitate administrative maintenance of the filtering rule. The regular operational expression may be a regular expression composed of numbers and character strings, for example (by way of example only, and not limited thereto): the instanceId ═ AY123456' | | clusterId & & value >80, and the regular operation expression may further include a data set type composed of native types such as non-numbers, character strings, and the like, for example (by way of example only, and not limited thereto): arrays, hash sets, etc.
Further, the second device 12 establishes a first rule list of the filtering rules indexed by the domain identifier of the filtering rule, and the first rule list is used for providing support for the third device 13 to obtain the filtering rule.
Further, the third device 13 obtains the structured data to be filtered, and obtains a plurality of filtering rules with rule field identifications corresponding to the data field identifications from the first rule list according to the data field identifications. Specifically, the third device 13 obtains a plurality of filtering rules with rule domain identifiers corresponding to the data domain identifiers from the first rule list according to the data domain identifiers.
Further, the fourth device 14 performs a parallel matching operation on the structured data to be filtered by using the obtained filtering rules.
Preferably, for each structured data to be filtered, the third device 13 obtains a plurality of filtering rules having corresponding same rule field identifiers according to the data field identifiers thereof, the fourth device 14 performs a matching operation on the structured data to be filtered by using each obtained filtering rule, and the fourth device 14 performs a parallel matching operation on the plurality of obtained filtering rules, so as to fully utilize the performance of the multi-core central processing unit and improve the filtering efficiency.
In particular, the fourth means 14 comprise: means for performing rule compilation on the obtained filter rules to create a runnable abstract syntax tree; and the unit is used for traversing the plurality of runnable abstract syntax trees by taking the data body of the structured data to be filtered as an input parameter and performing parallel matching calculation by utilizing the plurality of runnable abstract syntax trees.
The fourth device 14 implements the function of abstract syntax tree, and can support arithmetic operation, character string operation, relational operation, logical operation, regular expression operation, set operation, and the like, and reserves an extension interface, and can support user-defined operation, and the like.
Further, the fourth means 14 performs rule compiling on the obtained filtering rules to establish an executable Abstract Syntax Tree (AST), which is here a Tree representation of the Abstract Syntax structure of the regular expression.
Specifically, the unit for performing rule compiling on the acquired filtering rules to establish a runnable abstract syntax tree includes: a module for analyzing the rule expression of the obtained filtering rule to convert into an abstract syntax tree; and means for pre-computing the abstract syntax tree to obtain the runnable abstract syntax tree.
Specifically, the rule expression of the obtained filtering rule is analyzed to be converted into an abstract syntax tree, which can be implemented by using antlr (other Tool for Language recognition), and the filtering rule expression customized by the user can be converted into the abstract syntax tree; the Token stream of AST is obtained by lexical analysis of the regular expression, and the Token stream (Token) includes various operation operators for analyzing the identified string rules, including but not limited to: operators, numbers, strings, variables, regular expressions, and the like.
Among these, arithmetic operators include, for example, the following example code:
Figure BDA0000758473060000101
Figure BDA0000758473060000111
in a specific application scenario, for example, the regular expression of the filtering rule is the following content in the form of a character string:
CPU>90/100and clusterId in[‘hz’,’qd’]and instanceId like‘AK47\w+’
fig. 8 to 10 are schematic diagrams illustrating parallel matching operations performed on the structured data to be filtered by using the obtained filtering rules according to a specific scenario of the present application. By writing an Antlr lexical analysis rule, an AST token stream as shown in fig. 9 is obtained, and the priority problem is solved by adopting an operational expression suffix representation in a storage form in the system, as shown in fig. 8, the storage form is: OP operator, Num number, Var variable, Regex regular expression, Strarray string array.
The abstract syntax tree is then pre-computed to obtain the runnable abstract syntax tree. The precomputation is used for precomputing a constant expression In the AST token stream to judge whether the sub-expressions can be calculated or not, checking whether each element In the abstract syntax tree is of a special type or not through precomputation, converting the element of the special type into a program language data structure element, for example, but not limited to, translating the interpretation of the Like operation parameter element into a regular expression, and translating the interpretation of the In operation parameter element into a set. The constant expressions in the AST can be pre-budgeted through pre-calculation, so that the runtime processing speed is accelerated, and elements of special types are converted into program language data structure elements, wherein the elements of special types are elements of native types composed of non-numbers and character strings, such as but not limited to data set types, such as but not limited to arrays, hash maps, hash sets and the like.
In the concrete scenario, the fourth device 14 performs a pre-calculation on the abstract syntax tree shown in fig. 9 once, and the calculation result is an executable Abstract Syntax Tree (AST), wherein the token stream of the AST is shown in fig. 10, and "0.9", "java.util.hashset [ 'hz', 'qd' ]" and "java.util.regex.pattern 'AK 47\ W +'" are the pre-calculated calculation results.
In an alternative embodiment, example code for performing the pre-calculation is as follows:
Figure BDA0000758473060000121
Figure BDA0000758473060000131
Figure BDA0000758473060000141
of course, those skilled in the art should understand that the above exemplary codes are only examples, and other forms of pre-calculation, codes, etc. that may appear in the future, such as applying the present application, can be included in the protection scope of the present application by reference.
In particular, the module for pre-computing the abstract syntax tree to obtain the runnable abstract syntax tree is configured to: creating a running stack according to the abstract syntax tree, transmitting elements in the abstract syntax tree into the running stack, transmitting two operands corresponding to the operators out of the running stack when the elements are the operators, calculating to obtain a calculation result, and converting the special elements into program language data structure elements and transmitting the program language data structure elements into the running stack when the elements are special elements.
Further, the fourth device 14 further includes a unit that takes the data body of the structured data to be filtered as an input parameter, traverses the plurality of runnable abstract syntax trees, and performs parallel matching calculation by using the plurality of runnable abstract syntax trees.
The process of performing parallel matching calculation operation by using a plurality of runnable abstract syntax trees is the same as one-time pre-calculation, all expressions in the AST are calculable expressions during normal running, so the final calculation result is a determined value which is a Boolean value FALSE or TRUE, and if the Boolean value of the calculation result is TRUE, the structured data is judged to meet the filtering rule.
Here, by performing matching calculation on the data by using the executable abstract syntax tree, when the device 1 is allocated with 1000 filtering rules, for each piece of the structured data to be filtered, matching operation is concurrently performed on the 1000 filtering rules in the thread pool of the device 1, so as to concurrently calculate the filtering rules by fully utilizing the performance of the multi-core CPU.
Specifically, the unit for traversing the plurality of runnable abstract syntax trees by using the data volume of the structured data to be filtered as an input parameter, and performing parallel matching calculation by using the plurality of runnable abstract syntax trees includes: means for replacing variables of the runnable abstract syntax tree with parameters in the data volume; means for performing a match calculation on the runnable abstract syntax tree using the run stack.
Example code that replaces variables of the runnable abstract syntax tree with parameters in the data volume is as follows:
Figure BDA0000758473060000151
performing matching calculation on the runnable abstract syntax tree by using the running stack, wherein each node of the AST is processed, and example codes put into the running stack are as follows:
Figure BDA0000758473060000152
example codes for performing corresponding operations on operator nodes in the AST are as follows:
Figure BDA0000758473060000153
Figure BDA0000758473060000161
an example code for performing the matching calculation is as follows:
Figure BDA0000758473060000162
Figure BDA0000758473060000171
of course, those skilled in the art should understand that the above exemplary codes are only examples, and other forms such as methods, codes, etc. that may appear in the future, such as applying the present application, can be included in the protection scope of the present application by reference.
Thereafter, the device 1 may further process the structured data to be filtered, for example, an alarm or the like.
Fig. 2 shows a schematic diagram of an apparatus for filtering data according to a preferred embodiment of the present application, where the apparatus 1 includes: a first means 11 ', a second means 12 ', a third means 13 ', a fourth means 14 ' and a fifth means 15 '.
The contents of the first means 11 ', the third means 13 ' and the fourth means 14 ' are the same as or substantially the same as the contents of the first means 11, the third means 13 and the fourth means 14 of the apparatus 1 shown in fig. 1, and for the sake of brevity, they are not repeated again and are only included herein by way of reference.
Preferably, the second device 12 'refers to the content of the second device 12 shown in fig. 1, and the second device 12' further includes: a unit that creates a second rule list of the filter rule indexed according to a rule name of the filter rule; the second device 12' establishes a first rule list and a second rule list according to the rule field identifier of the filtering rule and the two-dimensional index of the rule name of the filtering rule, wherein the first rule list using the rule field identifier of the filtering rule as the index is searched for when filtering data, and the second rule list using the rule name of the filtering rule as the index is searched for when managing and maintaining the filtering rule. And when the structured data to be filtered is obtained, searching the filtering rules in the first rule list according to the data field identification matching, finding the list of the corresponding filtering rules, traversing the list of the filtering rules, taking the data body of the formatted data to be filtered as an input parameter, and performing concurrent matching calculation on each rule in the list. The second rule list facilitates management of filtering rules.
The fifth means 15' is used to add filtering rules, delete filtering rules or modify and compile existing filtering rules.
In particular, the fifth means 15' comprise means for adding the additional filtering rules to the second list of rules; means for deleting the corresponding filter rule from the second rule list; means for searching for a filter rule from the second rule list, and performing modification compilation on the searched filter rule. The fifth device 15' can modify and add/delete the filtering rules, thereby improving the flexibility of the filtering rules.
Fig. 3 shows a schematic diagram of an apparatus for filtering data according to another preferred embodiment of the present application, wherein the apparatus 1 includes a first device 11 ", a second device 12", a third device 13 ", a fourth device 14", a fifth device 15 ", and a sixth device 16".
The first device 11 ", the second device 12", the third device 13 ", the fourth device 14", and the fifth device 15 "are the same as or substantially the same as the first device 11 ', the second device 12', the third device 13 ', the fourth device 14', and the fifth device 15" of the apparatus 1 shown in fig. 2, and for the sake of brevity, the descriptions are omitted and the descriptions are included herein by way of reference.
Here, each of the filtering rules further includes: information of the notifier to which the filtering rule is bound; the sixth means 16' is configured to send the structured data to be filtered, which satisfies the corresponding filtering rule, to the notifier to which the filtering rule is bound, so as to prepare for transmission. Here, the notifier is a group of implementation of the reserved notification interface, and can implement a customized notification manner, for example, different transmission protocols, different compression algorithms, and different serialization algorithms are used to transmit to different systems in the downstream system cluster. The notifier can freely assemble and bind to any filtering rule when the filtering rule is created.
Fig. 4 illustrates a flow chart of a method for filtering data provided in accordance with an aspect of the present application, wherein the method includes: step S11, step S12, step S13, and step S14.
Specifically, the step S11 includes: acquiring initial data to be filtered, and converting the initial data to be filtered into structured data to be filtered, wherein the structured data to be filtered comprises a data field identifier and a data body in a key-value pair format; the step S12 includes: loading filtering rules, wherein each filtering rule comprises a rule field identifier, a rule name and a rule operation expression, and establishing a first rule list of the filtering rules by taking the field identifier of the filtering rule as an index; the step S13 includes: acquiring the structured data to be filtered, and acquiring a plurality of filtering rules with rule field identifications corresponding to the data field identifications from the first rule list according to the data field identifications; the step S14 includes: and performing parallel matching operation on the structured data to be filtered by using the acquired plurality of filtering rules.
Further, in the step S11: the method comprises the steps of obtaining initial data to be filtered and converting the initial data to be filtered into structured data to be filtered, wherein the structured data to be filtered comprises a data field identifier and a data body in a key-value pair format, and the structured data to be filtered comprises the data field identifier and the data body in the key-value pair format.
Wherein the data domain identifies a category indicating the structured data to be filtered, wherein the category is, for example and without limitation: the CPU occupancy of the host, the access delay time of a certain website, etc., the data field identifier may be identified by data or characters, etc., and any identifier that can be identified by a computer may be used as an embodiment of the data field identifier and is included herein by reference. Wherein, the data body of the Key-Value pair format records the detailed information of the Key-Value pair format (Key-Value format) of the structured data to be filtered, and the data body is, for example only (by way of example, and not limited thereto): instanceId 123456, clusterId Hangzhou, Value 92, bizTime 1427041923825, and unit Percent, wherein the left side of each equal sign represents a Key (Key), the right side of each equal sign represents a Value (Value), and the information on the left and right sides of the equal sign forms a data body in a Key-Value pair format, where the data body may include one or more Key-pairs, and the number of the Key-pairs is not limited.
Preferably, the initial data to be filtered is obtained from mass data, and the step S11 further includes: the initial data to be filtered is obtained from a distributed message middleware, and through the distributed message middleware, preferably, a MetaQ (a distributed message middleware) is a message middleware of a distributed and queue model, and has the following characteristics: strict message sequence can be guaranteed; the method provides a rich message pull mode, a high-efficiency subscriber horizontal expansion capability, a real-time message subscription mechanism and a hundred million level message accumulation capability, utilizes characteristics of clustered data Sharding of MetaQ, and fig. 7 shows an equipment schematic diagram of a system for filtering data equipment according to an embodiment of the present application, wherein a plurality of equipment 1 form a plurality of peer nodes with completely the same function for clustering, and make the cluster have a load balancing capability, so as to meet requirements of expandability, high availability and performance in a mass data context.
Preferably, the step S11 further includes: sending the structured data to be filtered to a blocking queue; accordingly, the step S13 includes: and acquiring the structured data to be filtered from the blocking queue.
Here, the blocking queue is able to block further enqueue operations when the queue is full until the queue of the blocking queue is not full. Specifically, the step S11 sends the structured data to be filtered to a blocking queue, and the structured data to be filtered enters the blocking queue to wait, and the step S13 obtains the structured data to be filtered from the blocking queue according to the waiting order of the structured data to be filtered, and deletes the structured data to be filtered from the blocking queue after the structured data to be filtered is obtained. Here, when the blocking queue is full of the structured data to be filtered waiting in the blocking queue, the blocking queue blocks the operation of sending the filtered data to enter the blocking queue in step S11, so that when the processing capacity is insufficient, the memory occupation is too large, thereby playing a role of peak clipping and valley filling in the process of filtering the mass data, and avoiding processing faults.
Further, in the step S12, filtering rules are loaded, wherein each filtering rule includes: the filtering rule comprises a rule field identifier, a rule name and a rule operation expression, and a first rule list of the filtering rule with the field identifier of the filtering rule as an index is established.
Here, the rule field identifies a category for indicating the filtering rule, wherein the category is, for example and without limitation: the CPU occupancy of the host, the access delay time of a certain website, etc., the rule field identifier may be identified by data or characters, etc., and any identifier that can be identified by a computer may be used as an embodiment of the data field identifier and is included herein by reference. Preferably, the content of the rule field identifier is the same as or substantially the same as the content of the data field identifier, so that the step S13 obtains a plurality of filtering rules having rule field identifiers corresponding to the data field identifiers from the first rule list according to the data field identifiers. Wherein the rule name may be a globally uniquely identified rule name to facilitate administrative maintenance of the filtering rule. The regular operational expression may be a regular expression composed of numbers and character strings, for example (by way of example only, and not limited thereto): the instanceId ═ AY123456' | | clusterId & & value >80, and the regular operation expression may further include a data set type composed of native types such as non-numbers, character strings, and the like, for example (by way of example only, and not limited thereto): arrays, hash sets, etc.
Further, the step S12 includes: establishing a first rule list of the filtering rules indexed by the domain identifier of the filtering rule, wherein the first rule list is used for providing support for acquiring the filtering rule in the step S13.
Further, in the step S13, the structured data to be filtered is obtained, and a plurality of filtering rules having rule domain identifications corresponding to the data domain identifications are obtained from the first rule list according to the data domain identifications. Specifically, the step S13 obtains a plurality of filtering rules having rule domain identifications corresponding to the data domain identifications from the first rule list according to the data domain identifications.
Further, in the step S14, a parallel matching operation is performed on the structured data to be filtered by using the obtained filtering rules.
Preferably, for each of the structured data to be filtered, the step S13 obtains a plurality of filtering rules having corresponding same rule field identifiers according to the data field identifiers thereof, then the step S14 performs a matching operation on the structured data to be filtered by using each obtained filtering rule, and the step S14 performs a parallel matching operation on the plurality of obtained filtering rules, so as to fully utilize the performance of the multi-core central processing unit and improve the filtering efficiency.
Specifically, the step S14 includes: performing rule compiling on the acquired filtering rules to establish an executable abstract syntax tree; and traversing a plurality of runnable abstract syntax trees by taking the data body of the structured data to be filtered as an input parameter, and performing parallel matching calculation by using the plurality of runnable abstract syntax trees.
The step S14 realizes the function of abstract syntax tree, and can support arithmetic operation, string operation, relational operation, logical operation, regular expression operation, set operation, and the like, and reserves an extension interface, and can support user-defined operation, and the like.
Further, the obtained filtering rules are regularly compiled to create an executable Abstract Syntax Tree (AST), which is here a Tree-like representation of the Abstract Syntax structure of the regular expression.
Wherein the rule compiling the obtained filtering rule to establish the runnable abstract syntax tree comprises: the rule expression of the obtained filtering rule is analyzed to be converted into an abstract syntax tree, and specifically, the rule expression can be realized by using Antlr (Another Tool for Language recognition), so that the filtering rule expression customized by a user can be converted into the abstract syntax tree; the Token stream of AST is obtained by lexical analysis of the regular expression, and the Token stream (Token) includes various operation operators for analyzing the identified string rules, including but not limited to: operators, numbers, strings, variables, regular expressions, and the like.
The content of the example code of the operational operator is the same as or substantially the same as the content of the example code of the operational operator of the abstract syntax tree converted by the fourth device 14 of the apparatus 1 shown in fig. 1, and for the sake of brevity, the description is omitted, and the example code is included herein only by way of reference.
In a specific application scenario, for example, the regular expression of the filtering rule is the following content in the form of a character string:
CPU>90/100and clusterId in[‘hz’,’qd’]and instanceId like‘AK47\w+’
by writing an Antlr lexical analysis rule, an AST token stream as shown in fig. 9 is obtained, and the priority problem is solved by adopting an operational expression suffix representation in a storage form in the system, as shown in fig. 8, the storage form is: OP operator, Num number, Var variable, Regex regular expression, Strarray string array.
The abstract syntax tree is then pre-computed to obtain the runnable abstract syntax tree. The precomputation is used for precomputing a constant expression In the AST token stream to judge whether the sub-expressions can be calculated or not, checking whether each element In the abstract syntax tree is of a special type or not through precomputation, converting the element of the special type into a program language data structure element, for example, but not limited to, translating the interpretation of the Like operation parameter element into a regular expression, and translating the interpretation of the In operation parameter element into a set. The constant expressions in the AST can be pre-budgeted through pre-calculation, so that the runtime processing speed is accelerated, and elements of special types are converted into program language data structure elements, wherein the elements of special types are elements of native types composed of non-numbers and character strings, such as but not limited to data set types, such as but not limited to arrays, hash maps, hash sets and the like.
In the concrete scenario, the abstract syntax tree shown in fig. 9 is pre-computed once, and the computation result is an executable Abstract Syntax Tree (AST), where the token stream of the AST is shown in fig. 10, where "0.9", "java.util.hashset [ 'hz', 'qd' ]" and "java.util.regex.pattern 'AK 47\ W +'" are pre-computed computation results.
The exemplary code for performing the pre-calculation may be the same as or substantially the same as the content of the exemplary code for performing the pre-calculation by the fourth apparatus 14 shown in fig. 1, and for the sake of brevity, the description is omitted, and only the content is included herein by way of reference.
Specifically, pre-computing the abstract syntax tree comprises:
creating a running stack according to the abstract syntax tree, and transmitting elements in the abstract syntax tree into the running stack; when the element is an operator, transmitting two operands corresponding to the operator out of the operation stack, and calculating to obtain a calculation result; and when the element is a special element, converting the special element into a program language data structure element and then transmitting the program language data structure element into an operation stack.
Further, the data body of the structured data to be filtered is used as an input parameter, a plurality of runnable abstract syntax trees are traversed, a process of parallel matching calculation is performed by using the runnable abstract syntax trees, the process of parallel matching calculation operation is performed by using the runnable abstract syntax trees is the same as one-time pre-calculation, all expressions in the AST are calculable expressions during normal operation, so that the final calculation result is a determined value, the value is a Boolean value FALSE or TRUE, and if the Boolean value of the calculation result is TRUE, the structured data is judged to meet the filtering rule.
Here, the parallel matching calculation using several executable abstract syntax trees performs matching calculation on the data using the executable abstract syntax trees, for example, when the device 1 is assigned 1000 filtering rules, for each piece of the structured data to be filtered, the matching operation is concurrently performed on the 1000 filtering rules in the thread pool of the device 1, so as to fully utilize the performance of the multi-core CPU to concurrently calculate the filtering rules.
Specifically, the parallel matching calculation by using a plurality of executable abstract syntax trees comprises the following steps: replacing variables of the runnable abstract syntax tree with parameters in the data volume; and performing matching calculation on the runnable abstract syntax tree by utilizing the running stack.
The content of the example code for replacing the variable of the runnable abstract syntax tree with the parameter in the data volume is the same or substantially the same as the content of the example code for replacing the fourth device 14 of the apparatus 1 in fig. 1, and for the sake of brevity, the description is omitted, and the example code is included by reference.
The content of the example code for performing corresponding operations on the operator node in the AST is the same as or substantially the same as the content of the example code for performing corresponding operations on the fourth device 14 of the apparatus 1 in fig. 1, and for the sake of brevity, details are not repeated again, and are included herein only by way of reference.
Likewise, the content of the example code for performing the matching calculation is the same as or substantially the same as that of the example code for performing the matching calculation by the fourth device 14 of the apparatus 1 in fig. 1, and for the sake of brevity, the description is omitted, and the example code is included by way of reference.
Thereafter, the method can further process the structured data to be filtered, such as alarming and the like.
Fig. 5 is a schematic flow chart illustrating a method for filtering data according to a preferred embodiment of the present application, the method including: step S11 ', step S12 ', step S13 ', step S14 ', and step S15 '.
The contents of step S11 ', step S13 ' and step S14 ' are the same as or substantially the same as the contents of step S11, step S12 and step S14 shown in fig. 4, and for brevity, they are not repeated again and are only included herein by way of reference.
Preferably, the step S12 'refers to the content of the step S12 shown in fig. 4, and the step S12' further includes: establishing a second rule list of the filtering rules indexed according to rule names of the filtering rules; step S12' establishes a first rule list and a second rule list according to the rule field identifier of the filtering rule and the two-dimensional index of the rule name of the filtering rule, where the first rule list using the rule field identifier of the filtering rule as the index is used for searching when filtering data, and the second rule list using the rule name of the filtering rule as the index is used for searching when managing and maintaining the filtering rule. And when the structured data to be filtered is obtained, searching the filtering rules in the first rule list according to the data field identification matching, finding the list of the corresponding filtering rules, traversing the list of the filtering rules, taking the data body of the formatted data to be filtered as an input parameter, and performing concurrent matching calculation on each rule in the list. The second rule list facilitates management of filtering rules.
In the step S15', a new filtering rule is added, a filtering rule is deleted, or an existing filtering rule is modified and compiled.
Specifically, the step S15' includes at least any one of: adding the newly added filtering rule into the second rule list; deleting the corresponding filtering rule from the second rule list; and searching for a filtering rule from the second rule list, and modifying and compiling the searched filtering rule, wherein the step S15' can modify and add/delete the filtering rule, so as to improve the flexibility of the filtering rule.
FIG. 6 is a flowchart illustrating a method for filtering data according to another preferred embodiment of the present application, wherein the method includes steps S11 ', S12', S13 ', S14', S15 'and S16'.
The steps S11 ", S12", S13 ", S14" and S15 "are the same as or substantially the same as the steps S11 ', S12', S13 ', S14' and S15" shown in fig. 5, and for the sake of brevity, they are not repeated and are included herein by reference.
Here, each of the filtering rules further includes: information of the notifier to which the filtering rule is bound; in step S16', the structured data to be filtered satisfying the corresponding filtering rule is sent to the notifier bound by the filtering rule for transmission. Here, the notifier is a group of implementations of the reservation notification interface, and can implement a customized notification manner, for example, different transmission protocols, different compression algorithms, and different serialization algorithms are used for transmission to different systems in the downstream system cluster. The notifier can freely assemble and bind to any filtering rule when the filtering rule is created.
Compared with the prior art, the equipment and the method for filtering data provided by the embodiment of the application adopt a stream type operation mode, data cannot be cached or solidified in a memory, namely the initial data to be filtered is converted into structured data to be filtered after the initial data to be filtered is obtained each time, the corresponding filtering rule is utilized to carry out matching calculation in real time, the filtering result is obtained immediately, and the problem of real-time performance of filtering of massive stream type data is solved;
furthermore, according to the device and the method for filtering data provided by the embodiment of the application, the method and the device for filtering data support arithmetic operation, character string operation, relational operation, logic operation, regular expression operation and set operation, an expansion interface is reserved, and the filtering rule is in a simple operation expression form with variables, so that the problems of complex description, difficult expansion and difficult management of the filtering rule are solved;
in addition, the device and the method for data filtering provided by the embodiment of the application are designed and developed autonomously, are relatively low in cost, and can be monitored and optimized on any code path.
Through multiple performance tests, the obtained performance index is approximately that a virtual machine configured in a single 4-core 8G can support 50 ten thousand filtering rules, the TPS for processing streaming data reaches 20000, the TPS for filtering effective data reaches 2000, the average load of the system is stabilized at about load1-4, and CPU resources are effectively utilized.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (18)

1. A method for filtering data, wherein the method comprises:
acquiring initial data to be filtered, and converting the initial data to be filtered into structured data to be filtered, wherein the structured data to be filtered comprises a data field identifier and a data body in a key-value pair format; the data field identification is used for indicating the category of the structured data to be filtered; the data body of the key-value pair format records the detailed information of the key-value pair format of the structured data to be filtered;
loading filtering rules, wherein each filtering rule comprises a rule field identifier, a rule name and a rule operational expression, and establishing a first rule list of the filtering rules by taking the field identifier of the filtering rule as an index; the rule domain identification is used for indicating the category of the filtering rule;
acquiring the structured data to be filtered, and acquiring a plurality of filtering rules with rule field identifications corresponding to the data field identifications from the first rule list according to the data field identifications;
and performing parallel matching operation on the structured data to be filtered by using the acquired plurality of filtering rules.
2. The method of claim 1, wherein obtaining initial data to be filtered comprises:
and acquiring the initial data to be filtered from the distributed message middleware.
3. The method of claim 1, wherein converting the initial data to be filtered into structured data to be filtered further comprises:
sending the structured data to be filtered to a blocking queue;
acquiring the structured data to be filtered comprises:
and acquiring the structured data to be filtered from the blocking queue.
4. The method of claim 1, wherein performing a parallel matching operation on the structured data to be filtered using the obtained filtering rules comprises:
performing rule compiling on the acquired filtering rules to establish an executable abstract syntax tree;
and traversing a plurality of runnable abstract syntax trees by taking the data body of the structured data to be filtered as an input parameter, and performing parallel matching calculation by using the plurality of runnable abstract syntax trees.
5. The method of claim 4, wherein the rule compiling the retrieved filtering rules to create a runnable abstract syntax tree comprises:
analyzing the rule expression of the acquired filtering rule to convert the rule expression into an abstract syntax tree;
pre-computing the abstract syntax tree to obtain the runnable abstract syntax tree;
wherein pre-computing the abstract syntax tree once comprises:
creating a running stack according to the abstract syntax tree, and transmitting elements in the abstract syntax tree into the running stack;
when the element is an operator, transmitting two operands corresponding to the operator out of the running stack, and calculating to obtain a calculation result;
and when the element is a special element, converting the special element into a program language data structure element and then transmitting the program language data structure element into the running stack.
6. The method of claim 5, wherein performing parallel matching computations using a number of the runnable abstract syntax trees comprises:
replacing variables of the runnable abstract syntax tree with parameters in the data volume;
and performing matching calculation on the runnable abstract syntax tree by utilizing the running stack.
7. The method of any of claims 1-6, wherein the method further comprises:
adding filter rules, deleting filter rules or modifying and compiling the existing filter rules.
8. The method of claim 7, wherein establishing the first rule list of filtering rules indexed by domain identification of the filtering rule further comprises:
establishing a second rule list of the filtering rules indexed according to rule names of the filtering rules;
the adding, deleting or modifying and compiling of the filter rule comprises at least any one of the following steps:
adding the newly added filtering rule into the second rule list;
deleting the corresponding filtering rule from the second rule list;
and searching the filtering rule from the second rule list, and modifying and compiling the searched filtering rule.
9. The method of claim 1, wherein each of the filtering rules further comprises: information of the notifier to which the filtering rule is bound;
the method further comprises the following steps:
and sending the structured data to be filtered meeting the corresponding filtering rule to the notifier bound by the filtering rule for transmission.
10. An apparatus for filtering data, wherein the apparatus comprises:
the device comprises a first device, a second device and a third device, wherein the first device is used for acquiring initial data to be filtered and converting the initial data to be filtered into structured data to be filtered, and the structured data to be filtered comprises a data body in a data field identifier and a key-value pair format; the data field identification is used for indicating the category of the structured data to be filtered; the data body of the key-value pair format records the detailed information of the key-value pair format of the structured data to be filtered;
the second device is used for loading filtering rules, wherein each filtering rule comprises a rule field identifier, a rule name and a rule operational expression, and a first rule list of the filtering rule with the field identifier of the filtering rule as an index is established; the rule domain identification is used for indicating the category of the filtering rule;
the third device is used for acquiring the structured data to be filtered and acquiring a plurality of filtering rules with rule field identifications corresponding to the data field identifications from the first rule list according to the data field identifications;
and the fourth device is used for performing parallel matching operation on the structured data to be filtered by utilizing the acquired plurality of filtering rules.
11. The apparatus of claim 10, wherein the first means comprises:
and acquiring the unit of the initial data to be filtered from the distributed message middleware.
12. The apparatus of claim 10, wherein the first means comprises:
means for sending the structured data to be filtered to a blocking queue;
the third means comprises:
and acquiring the unit of the structured data to be filtered from the blocking queue.
13. The apparatus of claim 10, wherein the fourth means comprises:
means for performing rule compilation on the obtained filter rules to create a runnable abstract syntax tree;
and the unit is used for traversing the plurality of runnable abstract syntax trees by taking the data body of the structured data to be filtered as an input parameter and performing parallel matching calculation by utilizing the plurality of runnable abstract syntax trees.
14. The apparatus of claim 13, wherein the means for rule compiling the retrieved filter rules to create a runnable abstract syntax tree comprises:
a module for analyzing the rule expression of the obtained filtering rule to convert into an abstract syntax tree;
means for pre-computing the abstract syntax tree to obtain the runnable abstract syntax tree, wherein the means is configured to:
creating a run stack from the abstract syntax tree, passing elements of the abstract syntax tree into the run stack,
when the element is an operator, two operands corresponding to the operator are transmitted out of the running stack, calculated to obtain a calculation result,
and when the element is a special element, converting the special element into a program language data structure element and then transmitting the program language data structure element into the running stack.
15. The apparatus of claim 13, wherein the means for traversing the plurality of runnable abstract syntax trees with the data volume of the structured data to be filtered as input parameters and performing parallel matching computations using the plurality of runnable abstract syntax trees comprises:
means for replacing variables of the runnable abstract syntax tree with parameters in the data volume;
means for performing a match calculation on the runnable abstract syntax tree using the run stack.
16. The apparatus of any of claims 10 to 15, wherein the apparatus further comprises:
and the fifth device is used for adding the filtering rules, deleting the filtering rules or modifying and compiling the existing filtering rules.
17. The apparatus of claim 16, wherein the second means further comprises:
a unit that creates a second rule list of the filter rule indexed according to a rule name of the filter rule;
the fifth means includes:
means for adding the newly added filter rule to the second rule list;
means for deleting the corresponding filter rule from the second rule list;
means for searching for a filter rule from the second rule list, and performing modification compilation on the searched filter rule.
18. The apparatus of claim 10, wherein each of the filtering rules further comprises: information of the notifier to which the filtering rule is bound;
the apparatus further comprises:
and the sixth device is used for sending the structured data to be filtered meeting the corresponding filtering rule to the notifier bound by the filtering rule for transmission.
CN201510408180.1A 2015-07-13 2015-07-13 Equipment and method for filtering data Active CN107038161B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201510408180.1A CN107038161B (en) 2015-07-13 2015-07-13 Equipment and method for filtering data
PCT/CN2016/088302 WO2017008650A1 (en) 2015-07-13 2016-07-04 Device and method for filtering data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510408180.1A CN107038161B (en) 2015-07-13 2015-07-13 Equipment and method for filtering data

Publications (2)

Publication Number Publication Date
CN107038161A CN107038161A (en) 2017-08-11
CN107038161B true CN107038161B (en) 2021-03-26

Family

ID=57757755

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510408180.1A Active CN107038161B (en) 2015-07-13 2015-07-13 Equipment and method for filtering data

Country Status (2)

Country Link
CN (1) CN107038161B (en)
WO (1) WO2017008650A1 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766466A (en) * 2017-09-29 2018-03-06 上海望友信息科技有限公司 Recognition methods, system, computer-readable recording medium and the equipment of data type
CN109672704B (en) * 2017-10-16 2022-02-25 阿里巴巴集团控股有限公司 Message processing method and device and electronic equipment
CN107766538A (en) * 2017-10-28 2018-03-06 杭州安恒信息技术有限公司 Data filtering processing module and synchronous, asynchronous filter method based on java
CN109189807A (en) * 2018-09-13 2019-01-11 北京奇虎科技有限公司 A kind of filter method and device of alert data
CN110287174A (en) * 2019-05-09 2019-09-27 北京善义善美科技有限公司 A kind of data filtering engine and system and filter method
CN110427754B (en) * 2019-08-12 2024-02-13 腾讯科技(深圳)有限公司 Network application attack detection method, device, equipment and storage medium
CN111427915A (en) * 2020-03-25 2020-07-17 京东数字科技控股有限公司 Information processing method and device, storage medium and electronic equipment
CN112068933B (en) * 2020-09-02 2021-08-10 成都鱼泡科技有限公司 Real-time distributed data monitoring method
CN112565338B (en) * 2020-11-10 2023-06-20 中国人民解放军战略支援部队信息工程大学 Ethernet message capturing, filtering, storing and real-time analyzing method and system
CN116383290B (en) * 2023-03-22 2023-10-31 中国华能集团有限公司北京招标分公司 Data generalization and analysis method

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1953373A (en) * 2006-09-19 2007-04-25 清华大学 A method to filter and verify open real IPv6 source address
CN101127774A (en) * 2007-09-19 2008-02-20 中兴通讯股份有限公司 Priority processing method for initial filtering rule
CN101158948A (en) * 2006-10-08 2008-04-09 中国科学院软件研究所 Text content filtering method and system
CN101282332A (en) * 2008-05-22 2008-10-08 上海交通大学 System for generating assaulting chart facing network safety alarm incident
CN101414929A (en) * 2008-11-18 2009-04-22 华为技术有限公司 Method, device and system for acquiring information
CN101860531A (en) * 2010-04-21 2010-10-13 北京星网锐捷网络技术有限公司 Filtering rule matching method of data packet and device thereof
CN102082728A (en) * 2010-12-28 2011-06-01 北京锐安科技有限公司 Dynamic loading method for filtering rules of network audit system
CN102231134A (en) * 2011-07-29 2011-11-02 哈尔滨工业大学 Method for detecting redundant code defects based on static analysis
CN102654864A (en) * 2011-03-02 2012-09-05 华北计算机系统工程研究所 Independent transparent security audit protection method facing real-time database
CN103116620A (en) * 2013-01-29 2013-05-22 中国电力科学研究院 Unstructured data safe filtering method based on strategy
CN103338155A (en) * 2013-07-01 2013-10-02 安徽中新软件有限公司 High-efficiency filtering method for data packets
CN103780460A (en) * 2014-01-15 2014-05-07 珠海市佳讯实业有限公司 System for realizing hardware filtering of TAP device through FPGA

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101304589A (en) * 2008-04-14 2008-11-12 中国联合通信有限公司 Method and system for monitoring and filtering garbage short message transmitted by short message gateway
CN102467561A (en) * 2010-11-19 2012-05-23 金蝶软件(中国)有限公司 Form data filtering method and device
US8949371B1 (en) * 2011-09-29 2015-02-03 Symantec Corporation Time and space efficient method and system for detecting structured data in free text
CN103034700B (en) * 2012-12-05 2016-06-29 北京奇虎科技有限公司 The processing method of rich text content and system
US9197632B2 (en) * 2013-03-15 2015-11-24 Kaarya Llc System and method for account access
CN103618733B (en) * 2013-12-06 2017-06-27 北京中创腾锐技术有限公司 A kind of data filtering system and method for being applied to mobile Internet
CN103631966B (en) * 2013-12-18 2017-10-10 用友网络科技股份有限公司 A kind of method of configurable parsing multivalue matching field
CN104331278B (en) * 2014-10-15 2017-08-25 南京航空航天大学 A kind of instruction filter method and device for ARINC661 specifications
CN104317947B (en) * 2014-11-07 2017-12-12 南京烽火星空通信发展有限公司 A kind of real-time architecture comparing system based on mass data

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1953373A (en) * 2006-09-19 2007-04-25 清华大学 A method to filter and verify open real IPv6 source address
CN101158948A (en) * 2006-10-08 2008-04-09 中国科学院软件研究所 Text content filtering method and system
CN101127774A (en) * 2007-09-19 2008-02-20 中兴通讯股份有限公司 Priority processing method for initial filtering rule
CN101282332A (en) * 2008-05-22 2008-10-08 上海交通大学 System for generating assaulting chart facing network safety alarm incident
CN101414929A (en) * 2008-11-18 2009-04-22 华为技术有限公司 Method, device and system for acquiring information
CN101860531A (en) * 2010-04-21 2010-10-13 北京星网锐捷网络技术有限公司 Filtering rule matching method of data packet and device thereof
CN102082728A (en) * 2010-12-28 2011-06-01 北京锐安科技有限公司 Dynamic loading method for filtering rules of network audit system
CN102654864A (en) * 2011-03-02 2012-09-05 华北计算机系统工程研究所 Independent transparent security audit protection method facing real-time database
CN102231134A (en) * 2011-07-29 2011-11-02 哈尔滨工业大学 Method for detecting redundant code defects based on static analysis
CN103116620A (en) * 2013-01-29 2013-05-22 中国电力科学研究院 Unstructured data safe filtering method based on strategy
CN103338155A (en) * 2013-07-01 2013-10-02 安徽中新软件有限公司 High-efficiency filtering method for data packets
CN103780460A (en) * 2014-01-15 2014-05-07 珠海市佳讯实业有限公司 System for realizing hardware filtering of TAP device through FPGA

Also Published As

Publication number Publication date
WO2017008650A1 (en) 2017-01-19
CN107038161A (en) 2017-08-11

Similar Documents

Publication Publication Date Title
CN107038161B (en) Equipment and method for filtering data
EP2674875B1 (en) Method, controller, program and data storage system for performing reconciliation processing
US10318882B2 (en) Optimized training of linear machine learning models
US9672474B2 (en) Concurrent binning of machine learning data
US10339465B2 (en) Optimized decision tree based models
Verma et al. Big Data representation for grade analysis through Hadoop framework
Xiao et al. SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming
US20180129712A1 (en) Data provenance and data pedigree tracking
Zhang et al. Towards efficient join processing over large RDF graph using mapreduce
KR20160011212A (en) Managing memory and storage space for a data operation
Hu et al. Towards big linked data: a large-scale, distributed semantic data storage
Mehmood et al. Distributed real-time ETL architecture for unstructured big data
Tanase et al. A highly efficient runtime and graph library for large scale graph analytics
KR20140048396A (en) System and method for searching file in cloud storage service, and method for controlling file therein
US10108745B2 (en) Query processing for XML data using big data technology
CN113297057A (en) Memory analysis method, device and system
US11636124B1 (en) Integrating query optimization with machine learning model prediction
Mittal et al. Efficient random data accessing in MapReduce
Gupta et al. Efficient query analysis and performance evaluation of the NoSQL data store for bigdata
US11847121B2 (en) Compound predicate query statement transformation
US11657069B1 (en) Dynamic compilation of machine learning models based on hardware configurations
Venkatesan et al. PoN: Open source solution for real-time data analysis
CN115248815A (en) Predictive query processing
Dhanda Big data storage and analysis
Raj et al. Scalable two-phase top-down specification for big data anonymization using apache pig

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant