CN107038161B

CN107038161B - Equipment and method for filtering data

Info

Publication number: CN107038161B
Application number: CN201510408180.1A
Authority: CN
Inventors: 丁崔灿
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2015-07-13
Filing date: 2015-07-13
Publication date: 2021-03-26
Anticipated expiration: 2035-07-13
Also published as: WO2017008650A1; CN107038161A

Abstract

The application aims to provide equipment and a method for filtering data, the initial data to be filtered is converted into structured data to be filtered after the initial data to be filtered is obtained every time, corresponding filtering rules are utilized to perform matching calculation in real time, filtering results are obtained immediately, the problem of real-time performance is solved, arithmetic operation, character string operation, relational operation, logic operation, regular expression operation and set operation are supported, an expansion interface is reserved, the filtering rules are simple operation expression forms with variables, and the problems of complex description, difficult expansion and difficult management of the filtering rules are solved.

Description

Equipment and method for filtering data

Technical Field

The application relates to the field of computers, in particular to a technology for filtering data meeting a filtering rule from mass data in real time according to the set filtering rule.

Background

With the explosive growth of information technology, the data volume is increasing day by day, and the requirements of numerous fields on the processing of mass data are increasing continuously.

In the prior art, there are several methods for filtering data satisfying a filtering rule from a mass of data according to a set filtering rule:

effective data are filtered based on SQL (Structured Query Language) statements of a memory relational database, however, the method needs to cache mass data in a logic data table of the memory database, occupies a large amount of memory resources, and the periodic execution of the SQL statements cannot meet the real-time requirement easily;

the method comprises the following steps that (1) a massive data storage scheme based on Hbase (a distributed and column-oriented open source database) is used, a Map-Reduce algorithm (a programming model algorithm for parallel operation of a large-scale data set) is used for filtering effective data, however, a Map-Reduce model task is a post-calculation mode similar to batch processing, only operation matching results can be periodically executed on massive data stored in the Hbase, instantaneity is difficult to guarantee, and the complex Map-Reduce model task needs to be realized through extended compiling, so that the requirements of real-time variability and various calculations on a large number of filtering rules are difficult to meet;

based on a CEP engine (Complex Event Processing), filtering effective data by using a pattern matching algorithm is more suitable for monitoring and decision control of an enterprise application system, however, most mature CEP engines are business software and have high user cost, and the CEP engines have respective pattern rule description methods, for example, Drools uses an XML format, Esper uses an EPL format, a large amount of adaptation codes need to be written for use in response to different system requirements, and non-standardized matching algorithms need to be extended for writing for implementation, so that implementation difficulty is increased.

Disclosure of Invention

The technical problem to be solved by the application is how to filter out the mass data in real time according to the set filtering rules without occupying a large amount of memory resources, so as to meet the filtering rules, and meet the requirements of real-time variability and various calculations of the large amount of filtering rules.

To achieve the above object, the present application provides a method for filtering data, wherein the method comprises:

acquiring initial data to be filtered, and converting the initial data to be filtered into structured data to be filtered, wherein the structured data to be filtered comprises a data field identifier and a data body in a key-value pair format;

loading filtering rules, wherein each filtering rule comprises a rule field identifier, a rule name and a rule operational expression, and establishing a first rule list of the filtering rules by taking the field identifier of the filtering rule as an index;

acquiring the structured data to be filtered, and acquiring a plurality of filtering rules with rule field identifications corresponding to the data field identifications from the first rule list according to the data field identifications;

and performing parallel matching operation on the structured data to be filtered by using the acquired plurality of filtering rules.

Further, the acquiring initial data to be filtered includes:

and acquiring the initial data to be filtered from the distributed message middleware.

Further, converting the initial data to be filtered into structured data to be filtered further comprises:

sending the structured data to be filtered to a blocking queue;

acquiring the structured data to be filtered comprises:

and acquiring the structured data to be filtered from the blocking queue.

Further, the performing parallel matching operation on the structured data to be filtered by using the obtained plurality of filtering rules includes:

performing rule compiling on the acquired filtering rules to establish an executable abstract syntax tree;

and traversing a plurality of runnable abstract syntax trees by taking the data body of the structured data to be filtered as an input parameter, and performing parallel matching calculation by using the plurality of runnable abstract syntax trees.

Further, the rule compiling the obtained filtering rule to establish the runnable abstract syntax tree includes:

analyzing the rule expression of the acquired filtering rule to convert the rule expression into an abstract syntax tree;

pre-computing the abstract syntax tree to obtain the runnable abstract syntax tree;

wherein pre-computing the abstract syntax tree once comprises:

creating a running stack according to the abstract syntax tree, and transmitting elements in the abstract syntax tree into the running stack;

when the element is an operator, transmitting two operands corresponding to the operator out of the running stack, and calculating to obtain a calculation result;

and when the element is a special element, converting the special element into a program language data structure element and then transmitting the program language data structure element into an operation stack.

Further, performing parallel matching calculations using a number of the runnable abstract syntax trees comprises:

replacing variables of the runnable abstract syntax tree with parameters in the data volume;

and performing matching calculation on the runnable abstract syntax tree by utilizing the running stack.

Further, the method further comprises:

adding filter rules, deleting filter rules or modifying and compiling the existing filter rules.

Further, establishing the first rule list of the filtering rule with the domain identifier of the filtering rule as an index further includes:

establishing a second rule list of the filtering rules indexed according to rule names of the filtering rules;

the adding, deleting or modifying and compiling of the filter rule comprises at least any one of the following steps:

adding the newly added filtering rule into the second rule list;

deleting the corresponding filtering rule from the second rule list;

and searching the filtering rule from the second rule list, and modifying and compiling the searched filtering rule.

Further, each of the filtering rules further includes: information of the notifier to which the filtering rule is bound;

the method further comprises the following steps:

and sending the structured data to be filtered meeting the corresponding filtering rule to the notifier bound by the filtering rule for transmission.

There is also provided according to another aspect of the present application, an apparatus for filtering data, wherein the apparatus includes:

the device comprises a first device, a second device and a third device, wherein the first device is used for acquiring initial data to be filtered and converting the initial data to be filtered into structured data to be filtered, and the structured data to be filtered comprises a data body in a data field identifier and a key-value pair format;

the second device is used for loading filtering rules, wherein each filtering rule comprises a rule field identifier, a rule name and a rule operational expression, and a first rule list of the filtering rule with the field identifier of the filtering rule as an index is established;

the third device is used for acquiring the structured data to be filtered and acquiring a plurality of filtering rules with rule field identifications corresponding to the data field identifications from the first rule list according to the data field identifications;

and the fourth device is used for performing parallel matching operation on the structured data to be filtered by utilizing the acquired plurality of filtering rules.

Further, the first apparatus includes:

and acquiring the unit of the initial data to be filtered from the distributed message middleware.

Further, the first apparatus includes:

means for sending the structured data to be filtered to a blocking queue;

the third means comprises:

and acquiring the unit of the structured data to be filtered from the blocking queue.

Further, the fourth apparatus includes:

means for performing rule compilation on the obtained filter rules to create a runnable abstract syntax tree;

and the unit is used for traversing the plurality of runnable abstract syntax trees by taking the data body of the structured data to be filtered as an input parameter and performing parallel matching calculation by utilizing the plurality of runnable abstract syntax trees.

Further, the means for performing rule compilation on the obtained filter rules to build a runnable abstract syntax tree includes:

a module for analyzing the rule expression of the obtained filtering rule to convert into an abstract syntax tree;

means for pre-computing the abstract syntax tree to obtain the runnable abstract syntax tree, wherein the means is configured to:

creating a run stack from the abstract syntax tree, passing elements of the abstract syntax tree into the run stack,

when the element is an operator, two operands corresponding to the operator are transmitted out of the running stack, calculated to obtain a calculation result,

and when the element is a special element, converting the special element into a program language data structure element and then transmitting the program language data structure element into the running stack.

Further, the unit for traversing the plurality of runnable abstract syntax trees by using the data volume of the structured data to be filtered as an input parameter, and performing parallel matching calculation by using the plurality of runnable abstract syntax trees includes:

means for replacing variables of the runnable abstract syntax tree with parameters in the data volume;

means for performing a match calculation on the runnable abstract syntax tree using the run stack.

Further, the apparatus further comprises:

and the fifth device is used for adding the filtering rules, deleting the filtering rules or modifying and compiling the existing filtering rules.

Further, the second apparatus further includes:

a unit that creates a second rule list of the filter rule indexed according to a rule name of the filter rule;

the fifth means includes:

means for adding the newly added filter rule to the second rule list;

means for deleting the corresponding filter rule from the second rule list;

means for searching for a filter rule from the second rule list, and performing modification compilation on the searched filter rule.

the apparatus further comprises:

and the sixth device is used for sending the structured data to be filtered meeting the corresponding filtering rule to the notifier bound by the filtering rule for transmission.

Compared with the prior art, the equipment and the method for filtering data provided by the embodiment of the application adopt a stream type operation mode, data cannot be cached or solidified in a memory, namely the initial data to be filtered is converted into structured data to be filtered after the initial data to be filtered is obtained each time, the corresponding filtering rule is utilized to carry out matching calculation in real time, the filtering result is obtained immediately, and the problem of real-time performance of filtering of massive stream type data is solved;

furthermore, according to the device and the method for filtering data provided by the embodiment of the application, the method and the device for filtering data support arithmetic operation, character string operation, relational operation, logic operation, regular expression operation and set operation, an expansion interface is reserved, and the filtering rule is in a simple operation expression form with variables, so that the problems of complex description, difficult expansion and difficult management of the filtering rule are solved;

in addition, the device and the method for data filtering provided by the embodiment of the application are designed and developed autonomously, are relatively low in cost, and can be monitored and optimized on any code path.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates an apparatus schematic diagram of an apparatus for filtering data provided in accordance with an aspect of the present application;

FIG. 2 illustrates an apparatus diagram of an apparatus for filtering data provided in accordance with a preferred embodiment of the present application;

FIG. 3 illustrates an apparatus diagram of an apparatus for filtering data according to another preferred embodiment of the present application;

FIG. 4 illustrates a flow diagram of a method for filtering data provided in accordance with an aspect of the present application;

FIG. 5 illustrates a flow chart of a method for filtering data provided in accordance with a preferred embodiment of the present application;

FIG. 6 illustrates a flow chart of a method for filtering data provided in accordance with another preferred embodiment of the present application;

FIG. 7 illustrates a device diagram including the system for filtering data device provided in accordance with a preferred embodiment of the present application;

fig. 8 to 10 are schematic diagrams illustrating parallel matching operations performed on the structured data to be filtered by using the obtained filtering rules according to a specific scenario of the present application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

Fig. 1 shows a schematic apparatus diagram of an apparatus for filtering data according to an aspect of the present application, where the apparatus 1 includes: a first device 11, a second device 12, a third device 13 and a fourth device 14.

Specifically, the first device 11 is configured to obtain initial data to be filtered, and convert the initial data to be filtered into structured data to be filtered, where the structured data to be filtered includes a data body in a key-value pair format and a data field identifier; the second device 12 is configured to load filtering rules, where each filtering rule includes a rule domain identifier, a rule name, and a rule operation expression, and establish a first rule list of the filtering rule using the domain identifier of the filtering rule as an index; the third device 13 is configured to obtain the structured data to be filtered, and obtain, according to the data field identifier, a plurality of filtering rules having rule field identifiers corresponding to the data field identifiers from the first rule list; the fourth device 14 is configured to perform a parallel matching operation on the structured data to be filtered by using the obtained filtering rules.

Further, the first device 11 is configured to obtain initial data to be filtered, and convert the initial data to be filtered into structured data to be filtered, where the structured data to be filtered includes a data field identifier and a data body in a key-value pair format, and the structured data to be filtered includes a data field identifier and a data body in a key-value pair format.

Wherein the data domain identifies a category indicating the structured data to be filtered, wherein the category is, for example and without limitation: the CPU occupancy of the host, the access delay time of a certain website, etc., the data field identifier may be identified by data or characters, etc., and any identifier that can be identified by a computer may be used as an embodiment of the data field identifier and is included herein by reference. Wherein, the data body of the Key-Value pair format records the detailed information of the Key-Value pair format (Key-Value format) of the structured data to be filtered, and the data body is, for example only (by way of example, and not limited thereto): instanceId 123456, clusterId Hangzhou, Value 92, bizTime 1427041923825, and unit Percent, wherein the left side of each equal sign represents a Key (Key), the right side of each equal sign represents a Value (Value), and the information on the left and right sides of the equal sign forms a data body in a Key-Value pair format, where the data body may include one or more Key-pairs, and the number of the Key-pairs is not limited.

Preferably, the initial data to be filtered is obtained from mass data, and the first device 11 further includes: and acquiring the unit of the initial data to be filtered from the distributed message middleware. The first device 11 uses a distributed message middleware, preferably, the distributed message middleware is MetaQ (a distributed message middleware), where MetaQ is a message middleware of a distributed, queue model, and MetaQ has the following characteristics: strict message sequence can be guaranteed; the method provides rich message pulling modes, high-efficiency subscriber horizontal expansion capability, a real-time message subscription mechanism and hundred million-level message accumulation capability, utilizes the characteristics of the clustering data Sharding of the MetaQ, can enable a plurality of devices 1 to form a plurality of peer nodes with completely the same functions for clustering, enables the clusters to have load balancing capability, and meets the requirements of expandability, high availability and performance under the background of mass data.

Preferably, the first device 11 may further include: means for sending the structured data to be filtered to a blocking queue; accordingly, the third device 13 comprises means for retrieving the structured data to be filtered from the congestion queue.

Here, the blocking queue is able to block further enqueue operations when the queue is full until the queue of the blocking queue is not full. Specifically, the first device 11 sends the structured data to be filtered to a blocking queue, the structured data to be filtered enters the blocking queue to wait, the third device 13 obtains the structured data to be filtered from the blocking queue according to the waiting order of the structured data to be filtered, and deletes the structured data to be filtered from the blocking queue after the structured data to be filtered is obtained. Here, when the blocking queue is full of the structured data to be filtered waiting in the blocking queue, the blocking queue blocks the operation of sending the filtered data to the blocking queue by the first device 11, so that when the processing capacity is insufficient, the memory occupation is too large, thereby playing a role of peak clipping and valley filling in the process of filtering the mass data, and avoiding processing faults.

Further, the second device 12 is configured to load filtering rules, wherein each filtering rule includes: the filtering rule comprises a rule field identifier, a rule name and a rule operation expression, and a first rule list of the filtering rule with the field identifier of the filtering rule as an index is established.

Here, the rule field identifies a category for indicating the filtering rule, wherein the category is, for example and without limitation: the CPU occupancy of the host, the access delay time of a certain website, etc., the rule field identifier may be identified by data or characters, etc., and any identifier that can be identified by a computer may be used as an embodiment of the data field identifier and is included herein by reference. Preferably, the content of the rule field identifier is the same as or substantially the same as the content of the data field identifier, so that the third device 13 obtains a plurality of filtering rules with rule field identifiers corresponding to the data field identifiers from the first rule list according to the data field identifiers. Wherein the rule name may be a globally uniquely identified rule name to facilitate administrative maintenance of the filtering rule. The regular operational expression may be a regular expression composed of numbers and character strings, for example (by way of example only, and not limited thereto): the instanceId ═ AY123456' | | clusterId & & value >80, and the regular operation expression may further include a data set type composed of native types such as non-numbers, character strings, and the like, for example (by way of example only, and not limited thereto): arrays, hash sets, etc.

Further, the second device 12 establishes a first rule list of the filtering rules indexed by the domain identifier of the filtering rule, and the first rule list is used for providing support for the third device 13 to obtain the filtering rule.

Further, the third device 13 obtains the structured data to be filtered, and obtains a plurality of filtering rules with rule field identifications corresponding to the data field identifications from the first rule list according to the data field identifications. Specifically, the third device 13 obtains a plurality of filtering rules with rule domain identifiers corresponding to the data domain identifiers from the first rule list according to the data domain identifiers.

Further, the fourth device 14 performs a parallel matching operation on the structured data to be filtered by using the obtained filtering rules.

Preferably, for each structured data to be filtered, the third device 13 obtains a plurality of filtering rules having corresponding same rule field identifiers according to the data field identifiers thereof, the fourth device 14 performs a matching operation on the structured data to be filtered by using each obtained filtering rule, and the fourth device 14 performs a parallel matching operation on the plurality of obtained filtering rules, so as to fully utilize the performance of the multi-core central processing unit and improve the filtering efficiency.

In particular, the fourth means 14 comprise: means for performing rule compilation on the obtained filter rules to create a runnable abstract syntax tree; and the unit is used for traversing the plurality of runnable abstract syntax trees by taking the data body of the structured data to be filtered as an input parameter and performing parallel matching calculation by utilizing the plurality of runnable abstract syntax trees.

The fourth device 14 implements the function of abstract syntax tree, and can support arithmetic operation, character string operation, relational operation, logical operation, regular expression operation, set operation, and the like, and reserves an extension interface, and can support user-defined operation, and the like.

Further, the fourth means 14 performs rule compiling on the obtained filtering rules to establish an executable Abstract Syntax Tree (AST), which is here a Tree representation of the Abstract Syntax structure of the regular expression.

Specifically, the unit for performing rule compiling on the acquired filtering rules to establish a runnable abstract syntax tree includes: a module for analyzing the rule expression of the obtained filtering rule to convert into an abstract syntax tree; and means for pre-computing the abstract syntax tree to obtain the runnable abstract syntax tree.

Specifically, the rule expression of the obtained filtering rule is analyzed to be converted into an abstract syntax tree, which can be implemented by using antlr (other Tool for Language recognition), and the filtering rule expression customized by the user can be converted into the abstract syntax tree; the Token stream of AST is obtained by lexical analysis of the regular expression, and the Token stream (Token) includes various operation operators for analyzing the identified string rules, including but not limited to: operators, numbers, strings, variables, regular expressions, and the like.

Among these, arithmetic operators include, for example, the following example code:

in a specific application scenario, for example, the regular expression of the filtering rule is the following content in the form of a character string:

CPU>90/100and clusterId in[‘hz’,’qd’]and instanceId like‘AK47\w+’

fig. 8 to 10 are schematic diagrams illustrating parallel matching operations performed on the structured data to be filtered by using the obtained filtering rules according to a specific scenario of the present application. By writing an Antlr lexical analysis rule, an AST token stream as shown in fig. 9 is obtained, and the priority problem is solved by adopting an operational expression suffix representation in a storage form in the system, as shown in fig. 8, the storage form is: OP operator, Num number, Var variable, Regex regular expression, Strarray string array.

The abstract syntax tree is then pre-computed to obtain the runnable abstract syntax tree. The precomputation is used for precomputing a constant expression In the AST token stream to judge whether the sub-expressions can be calculated or not, checking whether each element In the abstract syntax tree is of a special type or not through precomputation, converting the element of the special type into a program language data structure element, for example, but not limited to, translating the interpretation of the Like operation parameter element into a regular expression, and translating the interpretation of the In operation parameter element into a set. The constant expressions in the AST can be pre-budgeted through pre-calculation, so that the runtime processing speed is accelerated, and elements of special types are converted into program language data structure elements, wherein the elements of special types are elements of native types composed of non-numbers and character strings, such as but not limited to data set types, such as but not limited to arrays, hash maps, hash sets and the like.

In the concrete scenario, the fourth device 14 performs a pre-calculation on the abstract syntax tree shown in fig. 9 once, and the calculation result is an executable Abstract Syntax Tree (AST), wherein the token stream of the AST is shown in fig. 10, and "0.9", "java.util.hashset [ 'hz', 'qd' ]" and "java.util.regex.pattern 'AK 47\ W +'" are the pre-calculated calculation results.

In an alternative embodiment, example code for performing the pre-calculation is as follows:

of course, those skilled in the art should understand that the above exemplary codes are only examples, and other forms of pre-calculation, codes, etc. that may appear in the future, such as applying the present application, can be included in the protection scope of the present application by reference.

In particular, the module for pre-computing the abstract syntax tree to obtain the runnable abstract syntax tree is configured to: creating a running stack according to the abstract syntax tree, transmitting elements in the abstract syntax tree into the running stack, transmitting two operands corresponding to the operators out of the running stack when the elements are the operators, calculating to obtain a calculation result, and converting the special elements into program language data structure elements and transmitting the program language data structure elements into the running stack when the elements are special elements.

Further, the fourth device 14 further includes a unit that takes the data body of the structured data to be filtered as an input parameter, traverses the plurality of runnable abstract syntax trees, and performs parallel matching calculation by using the plurality of runnable abstract syntax trees.

The process of performing parallel matching calculation operation by using a plurality of runnable abstract syntax trees is the same as one-time pre-calculation, all expressions in the AST are calculable expressions during normal running, so the final calculation result is a determined value which is a Boolean value FALSE or TRUE, and if the Boolean value of the calculation result is TRUE, the structured data is judged to meet the filtering rule.

Here, by performing matching calculation on the data by using the executable abstract syntax tree, when the device 1 is allocated with 1000 filtering rules, for each piece of the structured data to be filtered, matching operation is concurrently performed on the 1000 filtering rules in the thread pool of the device 1, so as to concurrently calculate the filtering rules by fully utilizing the performance of the multi-core CPU.

Specifically, the unit for traversing the plurality of runnable abstract syntax trees by using the data volume of the structured data to be filtered as an input parameter, and performing parallel matching calculation by using the plurality of runnable abstract syntax trees includes: means for replacing variables of the runnable abstract syntax tree with parameters in the data volume; means for performing a match calculation on the runnable abstract syntax tree using the run stack.

Example code that replaces variables of the runnable abstract syntax tree with parameters in the data volume is as follows:

performing matching calculation on the runnable abstract syntax tree by using the running stack, wherein each node of the AST is processed, and example codes put into the running stack are as follows:

example codes for performing corresponding operations on operator nodes in the AST are as follows:

an example code for performing the matching calculation is as follows:

of course, those skilled in the art should understand that the above exemplary codes are only examples, and other forms such as methods, codes, etc. that may appear in the future, such as applying the present application, can be included in the protection scope of the present application by reference.

Thereafter, the device 1 may further process the structured data to be filtered, for example, an alarm or the like.

Fig. 2 shows a schematic diagram of an apparatus for filtering data according to a preferred embodiment of the present application, where the apparatus 1 includes: a first means 11 ', a second means 12 ', a third means 13 ', a fourth means 14 ' and a fifth means 15 '.

The contents of the first means 11 ', the third means 13 ' and the fourth means 14 ' are the same as or substantially the same as the contents of the first means 11, the third means 13 and the fourth means 14 of the apparatus 1 shown in fig. 1, and for the sake of brevity, they are not repeated again and are only included herein by way of reference.

Preferably, the second device 12 'refers to the content of the second device 12 shown in fig. 1, and the second device 12' further includes: a unit that creates a second rule list of the filter rule indexed according to a rule name of the filter rule; the second device 12' establishes a first rule list and a second rule list according to the rule field identifier of the filtering rule and the two-dimensional index of the rule name of the filtering rule, wherein the first rule list using the rule field identifier of the filtering rule as the index is searched for when filtering data, and the second rule list using the rule name of the filtering rule as the index is searched for when managing and maintaining the filtering rule. And when the structured data to be filtered is obtained, searching the filtering rules in the first rule list according to the data field identification matching, finding the list of the corresponding filtering rules, traversing the list of the filtering rules, taking the data body of the formatted data to be filtered as an input parameter, and performing concurrent matching calculation on each rule in the list. The second rule list facilitates management of filtering rules.

The fifth means 15' is used to add filtering rules, delete filtering rules or modify and compile existing filtering rules.

In particular, the fifth means 15' comprise means for adding the additional filtering rules to the second list of rules; means for deleting the corresponding filter rule from the second rule list; means for searching for a filter rule from the second rule list, and performing modification compilation on the searched filter rule. The fifth device 15' can modify and add/delete the filtering rules, thereby improving the flexibility of the filtering rules.

Fig. 3 shows a schematic diagram of an apparatus for filtering data according to another preferred embodiment of the present application, wherein the apparatus 1 includes a first device 11 ", a second device 12", a third device 13 ", a fourth device 14", a fifth device 15 ", and a sixth device 16".

The first device 11 ", the second device 12", the third device 13 ", the fourth device 14", and the fifth device 15 "are the same as or substantially the same as the first device 11 ', the second device 12', the third device 13 ', the fourth device 14', and the fifth device 15" of the apparatus 1 shown in fig. 2, and for the sake of brevity, the descriptions are omitted and the descriptions are included herein by way of reference.

Here, each of the filtering rules further includes: information of the notifier to which the filtering rule is bound; the sixth means 16' is configured to send the structured data to be filtered, which satisfies the corresponding filtering rule, to the notifier to which the filtering rule is bound, so as to prepare for transmission. Here, the notifier is a group of implementation of the reserved notification interface, and can implement a customized notification manner, for example, different transmission protocols, different compression algorithms, and different serialization algorithms are used to transmit to different systems in the downstream system cluster. The notifier can freely assemble and bind to any filtering rule when the filtering rule is created.

Fig. 4 illustrates a flow chart of a method for filtering data provided in accordance with an aspect of the present application, wherein the method includes: step S11, step S12, step S13, and step S14.

Specifically, the step S11 includes: acquiring initial data to be filtered, and converting the initial data to be filtered into structured data to be filtered, wherein the structured data to be filtered comprises a data field identifier and a data body in a key-value pair format; the step S12 includes: loading filtering rules, wherein each filtering rule comprises a rule field identifier, a rule name and a rule operation expression, and establishing a first rule list of the filtering rules by taking the field identifier of the filtering rule as an index; the step S13 includes: acquiring the structured data to be filtered, and acquiring a plurality of filtering rules with rule field identifications corresponding to the data field identifications from the first rule list according to the data field identifications; the step S14 includes: and performing parallel matching operation on the structured data to be filtered by using the acquired plurality of filtering rules.

Further, in the step S11: the method comprises the steps of obtaining initial data to be filtered and converting the initial data to be filtered into structured data to be filtered, wherein the structured data to be filtered comprises a data field identifier and a data body in a key-value pair format, and the structured data to be filtered comprises the data field identifier and the data body in the key-value pair format.

Preferably, the initial data to be filtered is obtained from mass data, and the step S11 further includes: the initial data to be filtered is obtained from a distributed message middleware, and through the distributed message middleware, preferably, a MetaQ (a distributed message middleware) is a message middleware of a distributed and queue model, and has the following characteristics: strict message sequence can be guaranteed; the method provides a rich message pull mode, a high-efficiency subscriber horizontal expansion capability, a real-time message subscription mechanism and a hundred million level message accumulation capability, utilizes characteristics of clustered data Sharding of MetaQ, and fig. 7 shows an equipment schematic diagram of a system for filtering data equipment according to an embodiment of the present application, wherein a plurality of equipment 1 form a plurality of peer nodes with completely the same function for clustering, and make the cluster have a load balancing capability, so as to meet requirements of expandability, high availability and performance in a mass data context.

Preferably, the step S11 further includes: sending the structured data to be filtered to a blocking queue; accordingly, the step S13 includes: and acquiring the structured data to be filtered from the blocking queue.

Here, the blocking queue is able to block further enqueue operations when the queue is full until the queue of the blocking queue is not full. Specifically, the step S11 sends the structured data to be filtered to a blocking queue, and the structured data to be filtered enters the blocking queue to wait, and the step S13 obtains the structured data to be filtered from the blocking queue according to the waiting order of the structured data to be filtered, and deletes the structured data to be filtered from the blocking queue after the structured data to be filtered is obtained. Here, when the blocking queue is full of the structured data to be filtered waiting in the blocking queue, the blocking queue blocks the operation of sending the filtered data to enter the blocking queue in step S11, so that when the processing capacity is insufficient, the memory occupation is too large, thereby playing a role of peak clipping and valley filling in the process of filtering the mass data, and avoiding processing faults.

Further, in the step S12, filtering rules are loaded, wherein each filtering rule includes: the filtering rule comprises a rule field identifier, a rule name and a rule operation expression, and a first rule list of the filtering rule with the field identifier of the filtering rule as an index is established.

Here, the rule field identifies a category for indicating the filtering rule, wherein the category is, for example and without limitation: the CPU occupancy of the host, the access delay time of a certain website, etc., the rule field identifier may be identified by data or characters, etc., and any identifier that can be identified by a computer may be used as an embodiment of the data field identifier and is included herein by reference. Preferably, the content of the rule field identifier is the same as or substantially the same as the content of the data field identifier, so that the step S13 obtains a plurality of filtering rules having rule field identifiers corresponding to the data field identifiers from the first rule list according to the data field identifiers. Wherein the rule name may be a globally uniquely identified rule name to facilitate administrative maintenance of the filtering rule. The regular operational expression may be a regular expression composed of numbers and character strings, for example (by way of example only, and not limited thereto): the instanceId ═ AY123456' | | clusterId & & value >80, and the regular operation expression may further include a data set type composed of native types such as non-numbers, character strings, and the like, for example (by way of example only, and not limited thereto): arrays, hash sets, etc.

Further, the step S12 includes: establishing a first rule list of the filtering rules indexed by the domain identifier of the filtering rule, wherein the first rule list is used for providing support for acquiring the filtering rule in the step S13.

Further, in the step S13, the structured data to be filtered is obtained, and a plurality of filtering rules having rule domain identifications corresponding to the data domain identifications are obtained from the first rule list according to the data domain identifications. Specifically, the step S13 obtains a plurality of filtering rules having rule domain identifications corresponding to the data domain identifications from the first rule list according to the data domain identifications.

Further, in the step S14, a parallel matching operation is performed on the structured data to be filtered by using the obtained filtering rules.

Preferably, for each of the structured data to be filtered, the step S13 obtains a plurality of filtering rules having corresponding same rule field identifiers according to the data field identifiers thereof, then the step S14 performs a matching operation on the structured data to be filtered by using each obtained filtering rule, and the step S14 performs a parallel matching operation on the plurality of obtained filtering rules, so as to fully utilize the performance of the multi-core central processing unit and improve the filtering efficiency.

Specifically, the step S14 includes: performing rule compiling on the acquired filtering rules to establish an executable abstract syntax tree; and traversing a plurality of runnable abstract syntax trees by taking the data body of the structured data to be filtered as an input parameter, and performing parallel matching calculation by using the plurality of runnable abstract syntax trees.

The step S14 realizes the function of abstract syntax tree, and can support arithmetic operation, string operation, relational operation, logical operation, regular expression operation, set operation, and the like, and reserves an extension interface, and can support user-defined operation, and the like.

Further, the obtained filtering rules are regularly compiled to create an executable Abstract Syntax Tree (AST), which is here a Tree-like representation of the Abstract Syntax structure of the regular expression.

Wherein the rule compiling the obtained filtering rule to establish the runnable abstract syntax tree comprises: the rule expression of the obtained filtering rule is analyzed to be converted into an abstract syntax tree, and specifically, the rule expression can be realized by using Antlr (Another Tool for Language recognition), so that the filtering rule expression customized by a user can be converted into the abstract syntax tree; the Token stream of AST is obtained by lexical analysis of the regular expression, and the Token stream (Token) includes various operation operators for analyzing the identified string rules, including but not limited to: operators, numbers, strings, variables, regular expressions, and the like.

The content of the example code of the operational operator is the same as or substantially the same as the content of the example code of the operational operator of the abstract syntax tree converted by the fourth device 14 of the apparatus 1 shown in fig. 1, and for the sake of brevity, the description is omitted, and the example code is included herein only by way of reference.

CPU>90/100and clusterId in[‘hz’,’qd’]and instanceId like‘AK47\w+’

by writing an Antlr lexical analysis rule, an AST token stream as shown in fig. 9 is obtained, and the priority problem is solved by adopting an operational expression suffix representation in a storage form in the system, as shown in fig. 8, the storage form is: OP operator, Num number, Var variable, Regex regular expression, Strarray string array.

In the concrete scenario, the abstract syntax tree shown in fig. 9 is pre-computed once, and the computation result is an executable Abstract Syntax Tree (AST), where the token stream of the AST is shown in fig. 10, where "0.9", "java.util.hashset [ 'hz', 'qd' ]" and "java.util.regex.pattern 'AK 47\ W +'" are pre-computed computation results.

The exemplary code for performing the pre-calculation may be the same as or substantially the same as the content of the exemplary code for performing the pre-calculation by the fourth apparatus 14 shown in fig. 1, and for the sake of brevity, the description is omitted, and only the content is included herein by way of reference.

Specifically, pre-computing the abstract syntax tree comprises:

creating a running stack according to the abstract syntax tree, and transmitting elements in the abstract syntax tree into the running stack; when the element is an operator, transmitting two operands corresponding to the operator out of the operation stack, and calculating to obtain a calculation result; and when the element is a special element, converting the special element into a program language data structure element and then transmitting the program language data structure element into an operation stack.

Further, the data body of the structured data to be filtered is used as an input parameter, a plurality of runnable abstract syntax trees are traversed, a process of parallel matching calculation is performed by using the runnable abstract syntax trees, the process of parallel matching calculation operation is performed by using the runnable abstract syntax trees is the same as one-time pre-calculation, all expressions in the AST are calculable expressions during normal operation, so that the final calculation result is a determined value, the value is a Boolean value FALSE or TRUE, and if the Boolean value of the calculation result is TRUE, the structured data is judged to meet the filtering rule.

Here, the parallel matching calculation using several executable abstract syntax trees performs matching calculation on the data using the executable abstract syntax trees, for example, when the device 1 is assigned 1000 filtering rules, for each piece of the structured data to be filtered, the matching operation is concurrently performed on the 1000 filtering rules in the thread pool of the device 1, so as to fully utilize the performance of the multi-core CPU to concurrently calculate the filtering rules.

Specifically, the parallel matching calculation by using a plurality of executable abstract syntax trees comprises the following steps: replacing variables of the runnable abstract syntax tree with parameters in the data volume; and performing matching calculation on the runnable abstract syntax tree by utilizing the running stack.

The content of the example code for replacing the variable of the runnable abstract syntax tree with the parameter in the data volume is the same or substantially the same as the content of the example code for replacing the fourth device 14 of the apparatus 1 in fig. 1, and for the sake of brevity, the description is omitted, and the example code is included by reference.

The content of the example code for performing corresponding operations on the operator node in the AST is the same as or substantially the same as the content of the example code for performing corresponding operations on the fourth device 14 of the apparatus 1 in fig. 1, and for the sake of brevity, details are not repeated again, and are included herein only by way of reference.

Likewise, the content of the example code for performing the matching calculation is the same as or substantially the same as that of the example code for performing the matching calculation by the fourth device 14 of the apparatus 1 in fig. 1, and for the sake of brevity, the description is omitted, and the example code is included by way of reference.

Thereafter, the method can further process the structured data to be filtered, such as alarming and the like.

Fig. 5 is a schematic flow chart illustrating a method for filtering data according to a preferred embodiment of the present application, the method including: step S11 ', step S12 ', step S13 ', step S14 ', and step S15 '.

The contents of step S11 ', step S13 ' and step S14 ' are the same as or substantially the same as the contents of step S11, step S12 and step S14 shown in fig. 4, and for brevity, they are not repeated again and are only included herein by way of reference.

Preferably, the step S12 'refers to the content of the step S12 shown in fig. 4, and the step S12' further includes: establishing a second rule list of the filtering rules indexed according to rule names of the filtering rules; step S12' establishes a first rule list and a second rule list according to the rule field identifier of the filtering rule and the two-dimensional index of the rule name of the filtering rule, where the first rule list using the rule field identifier of the filtering rule as the index is used for searching when filtering data, and the second rule list using the rule name of the filtering rule as the index is used for searching when managing and maintaining the filtering rule. And when the structured data to be filtered is obtained, searching the filtering rules in the first rule list according to the data field identification matching, finding the list of the corresponding filtering rules, traversing the list of the filtering rules, taking the data body of the formatted data to be filtered as an input parameter, and performing concurrent matching calculation on each rule in the list. The second rule list facilitates management of filtering rules.

In the step S15', a new filtering rule is added, a filtering rule is deleted, or an existing filtering rule is modified and compiled.

Specifically, the step S15' includes at least any one of: adding the newly added filtering rule into the second rule list; deleting the corresponding filtering rule from the second rule list; and searching for a filtering rule from the second rule list, and modifying and compiling the searched filtering rule, wherein the step S15' can modify and add/delete the filtering rule, so as to improve the flexibility of the filtering rule.

FIG. 6 is a flowchart illustrating a method for filtering data according to another preferred embodiment of the present application, wherein the method includes steps S11 ', S12', S13 ', S14', S15 'and S16'.

The steps S11 ", S12", S13 ", S14" and S15 "are the same as or substantially the same as the steps S11 ', S12', S13 ', S14' and S15" shown in fig. 5, and for the sake of brevity, they are not repeated and are included herein by reference.

Here, each of the filtering rules further includes: information of the notifier to which the filtering rule is bound; in step S16', the structured data to be filtered satisfying the corresponding filtering rule is sent to the notifier bound by the filtering rule for transmission. Here, the notifier is a group of implementations of the reservation notification interface, and can implement a customized notification manner, for example, different transmission protocols, different compression algorithms, and different serialization algorithms are used for transmission to different systems in the downstream system cluster. The notifier can freely assemble and bind to any filtering rule when the filtering rule is created.

Through multiple performance tests, the obtained performance index is approximately that a virtual machine configured in a single 4-core 8G can support 50 ten thousand filtering rules, the TPS for processing streaming data reaches 20000, the TPS for filtering effective data reaches 2000, the average load of the system is stabilized at about load1-4, and CPU resources are effectively utilized.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method for filtering data, wherein the method comprises:

acquiring initial data to be filtered, and converting the initial data to be filtered into structured data to be filtered, wherein the structured data to be filtered comprises a data field identifier and a data body in a key-value pair format; the data field identification is used for indicating the category of the structured data to be filtered; the data body of the key-value pair format records the detailed information of the key-value pair format of the structured data to be filtered;

loading filtering rules, wherein each filtering rule comprises a rule field identifier, a rule name and a rule operational expression, and establishing a first rule list of the filtering rules by taking the field identifier of the filtering rule as an index; the rule domain identification is used for indicating the category of the filtering rule;

2. The method of claim 1, wherein obtaining initial data to be filtered comprises:

3. The method of claim 1, wherein converting the initial data to be filtered into structured data to be filtered further comprises:

sending the structured data to be filtered to a blocking queue;

acquiring the structured data to be filtered comprises:

and acquiring the structured data to be filtered from the blocking queue.

4. The method of claim 1, wherein performing a parallel matching operation on the structured data to be filtered using the obtained filtering rules comprises:

5. The method of claim 4, wherein the rule compiling the retrieved filtering rules to create a runnable abstract syntax tree comprises:

wherein pre-computing the abstract syntax tree once comprises:

6. The method of claim 5, wherein performing parallel matching computations using a number of the runnable abstract syntax trees comprises:

7. The method of any of claims 1-6, wherein the method further comprises:

8. The method of claim 7, wherein establishing the first rule list of filtering rules indexed by domain identification of the filtering rule further comprises:

adding the newly added filtering rule into the second rule list;

deleting the corresponding filtering rule from the second rule list;

9. The method of claim 1, wherein each of the filtering rules further comprises: information of the notifier to which the filtering rule is bound;

the method further comprises the following steps:

10. An apparatus for filtering data, wherein the apparatus comprises:

the device comprises a first device, a second device and a third device, wherein the first device is used for acquiring initial data to be filtered and converting the initial data to be filtered into structured data to be filtered, and the structured data to be filtered comprises a data body in a data field identifier and a key-value pair format; the data field identification is used for indicating the category of the structured data to be filtered; the data body of the key-value pair format records the detailed information of the key-value pair format of the structured data to be filtered;

the second device is used for loading filtering rules, wherein each filtering rule comprises a rule field identifier, a rule name and a rule operational expression, and a first rule list of the filtering rule with the field identifier of the filtering rule as an index is established; the rule domain identification is used for indicating the category of the filtering rule;

11. The apparatus of claim 10, wherein the first means comprises:

12. The apparatus of claim 10, wherein the first means comprises:

means for sending the structured data to be filtered to a blocking queue;

the third means comprises:

13. The apparatus of claim 10, wherein the fourth means comprises:

14. The apparatus of claim 13, wherein the means for rule compiling the retrieved filter rules to create a runnable abstract syntax tree comprises:

15. The apparatus of claim 13, wherein the means for traversing the plurality of runnable abstract syntax trees with the data volume of the structured data to be filtered as input parameters and performing parallel matching computations using the plurality of runnable abstract syntax trees comprises:

16. The apparatus of any of claims 10 to 15, wherein the apparatus further comprises:

17. The apparatus of claim 16, wherein the second means further comprises:

the fifth means includes:

means for adding the newly added filter rule to the second rule list;

means for deleting the corresponding filter rule from the second rule list;

18. The apparatus of claim 10, wherein each of the filtering rules further comprises: information of the notifier to which the filtering rule is bound;

the apparatus further comprises: