CN113722332A - Method and system for improving efficiency and robustness of matching algorithm based on data structure - Google Patents

Method and system for improving efficiency and robustness of matching algorithm based on data structure Download PDF

Info

Publication number
CN113722332A
CN113722332A CN202111056560.5A CN202111056560A CN113722332A CN 113722332 A CN113722332 A CN 113722332A CN 202111056560 A CN202111056560 A CN 202111056560A CN 113722332 A CN113722332 A CN 113722332A
Authority
CN
China
Prior art keywords
matching
width
predicates
predicate
data structure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111056560.5A
Other languages
Chinese (zh)
Other versions
CN113722332B (en
Inventor
钱诗友
廖政宇
曹健
薛广涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202111056560.5A priority Critical patent/CN113722332B/en
Publication of CN113722332A publication Critical patent/CN113722332A/en
Application granted granted Critical
Publication of CN113722332B publication Critical patent/CN113722332B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Automation & Control Theory (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for improving the efficiency and robustness of a matching algorithm based on a data structure, wherein the method comprises the following steps: indexing the subscriptions based on a matching algorithm using a preset data structure; in the preset data structure, the preset data structure comprises two levels of index layers and a storage layer; the first-level index layer maps predicates with the same attribute into the same attribute unit based on the mapping of the attribute; the second-level index layer is based on mapping of interval predicate width, and predicates are mapped into different width units according to the interval predicate width, so that the interval predicates with the same width but different centers can be mapped into the same width unit; the storage tier is for storing subscriptions; the width cells are divided in a uniform manner.

Description

Method and system for improving efficiency and robustness of matching algorithm based on data structure
Technical Field
The invention relates to a matching algorithm in an event distribution network, in particular to a method and a system for improving the efficiency and robustness of the matching algorithm based on a data structure, and more particularly to a preset data structure supporting the efficient and robust matching algorithm in a publish/subscribe system.
Background
The publish/subscribe system originally appeared as a news subsystem. The method realizes complete decoupling of two communication parties in terms of time, space and synchronization. Because of its attractive nature, publish/subscribe systems are widely deployed in many areas, such as system monitoring and management, real-time stock updates, online gaming, online advertising, and social media messaging. In particular, content-based publish/subscribe systems allow subscribers to express their interest in events using boolean expressions, enabling fine-grained selective information distribution.
Matching algorithms are key modules of large-scale publish/subscribe systems. In order to improve the matching performance, researchers have proposed many matching algorithms based on different data structures. However, the performance and robustness of the existing matching algorithm in a dynamic environment are poor due to the fact that the performance of the matching algorithm is affected by various factors.
Patent document CN110427217B (application number: 201910672885.2) discloses a lightweight parallel method and system for content-based publish-subscribe system matching algorithm, wherein an index structure for storing a data structure is layered to form a plurality of hierarchies, each hierarchy corresponds to a storage unit set for storing the data structure, and the plurality of hierarchies are grouped, and each hierarchy group simultaneously comprises a hierarchy and a storage unit set corresponding to the hierarchy; and setting a matching thread for each hierarchical group, independently distributing a matching event to a single matching thread for processing, and simultaneously updating an indicator by a plurality of matching threads, wherein the indicator performs synchronous operation during updating. The matching performance is improved, and the parallelism is dynamically adjusted according to the performance requirement, so that the event is ensured to be distributed quickly and reliably. And determining the optimal parallelism by using an iterative optimization method, and improving the task allocation of threads, so that the time overhead is very efficient.
To cope with this problem, the present invention proposes a new data structure. The data structure can simultaneously support various matching algorithms, and the best matching algorithm is used for matching under different environments, so that the influence of a dynamic environment on the matching performance is reduced, and better matching performance and stability are obtained.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method and a system for improving the efficiency and robustness of a matching algorithm based on a data structure.
The method for improving the efficiency and robustness of the matching algorithm based on the data structure provided by the invention comprises the following steps: indexing the subscriptions based on a matching algorithm using a preset data structure;
in the preset data structure, the preset data structure comprises two levels of index layers and a storage layer;
the first-level index layer maps predicates with the same attribute into the same attribute unit based on the mapping of the attribute;
the second-level index layer is based on mapping of interval predicate width, and predicates are mapped into different width units according to the interval predicate width, so that the interval predicates with the same width but different centers can be mapped into the same width unit;
the storage layer is used for storing the subscription;
the width cells are divided in a uniform manner.
Preferably, the storage layer is stored by using B + trees, each width unit has two B + trees respectively corresponding to the low value and the high value of the interval predicate, and the low value tree is provided with a link to the high value tree; the B + tree can realize self-balance of the tree and ensure the order of the inserted elements.
Preferably, in the matching process, two groups of markers and one group of recorders are also included; one group of markers marks unmatched subscriptions by using the bit sets, and the other group of markers counts the number of matched predicates in the subscriptions by using a counter; the recorder is used for recording the task division in the mixed matching.
Preferably, the matching algorithm comprises: forward matching AFM, reverse matching ABM and hybrid matching AHM;
the forward matching AFM adopts a mode of counting matched predicates, and all low-value trees in a data structure are checked during matching;
the reverse matching ABM adopts a mode of marking unmatched predicates;
the hybrid matching AHM combines a forward matching method and a reverse matching method, and performs task division on width units, wherein the forward matching is used for the width units used for indexing the narrow inter-region predicates, and the reverse matching is used for the width units except the narrow inter-region predicates.
Preferably, the forward matching employs:
step S1: adding one to a counter of a subscription corresponding to a predicate indexed in the [ v', v ] space; wherein v represents an event value; in a width unit, assuming that the width range of the interval predicate of the width unit index is [ w, w '], and v' ═ v-w;
step S2: for predicates indexed in the [ v ', v' ] space, checking the high values of the predicates on a high value tree through pointers arranged on a B + tree; when the high value of the predicate is more than or equal to v, the predicate is matched, and the counter of the corresponding subscription containing the current predicate also performs an adding operation; wherein v ═ v-w'
Step S3: when all the low value trees in the data structure are operated, the counter is checked, and when the value of the counter is the same as the number of predicates of the corresponding subscription, the current subscription is matched.
Preferably, the reverse matching employs:
step S4: marking all predicates with low values larger than the event value v on the low value tree;
step S5: marking all predicates with high values smaller than the event value v on the high value tree;
step S6: the markers are checked and the untagged subscriptions are matched.
Preferably, the mixing matching adopts: by performing task division on the width units, forward matching is used for the width units used for indexing the predicates among the narrow regions, and reverse matching is used for other width units; recording the number of predicates divided into forward matching in each subscription through a recorder; after completing the forward matching and the reverse matching, for the unmarked subscription in the reverse matching, checking whether the values of the counter and the recorder corresponding to the unmarked subscription are equal, if so, the current subscription is matched;
the narrow inter-interval predicate is a predicate with an interval width smaller than a preset value;
the division point of the mixed matching to the width unit task needs to make the forward matching and the reverse matching have the same matching time as much as possible; assuming that the width of the division point is κ, the forward and reverse matches have similar matching times when κ satisfies the following equation;
Figure BDA0003254807750000031
where v represents the value of the event,
Figure BDA0003254807750000032
and
Figure BDA0003254807750000033
representing the unit cost of the execution flag and the count, respectively; Γ (x) represents the probability that a low or high value of the predicate is equal to x; x represents a random variable.
Preferably, the entire search space is divided into a matching space, a non-matching space, a candidate space and an empty space according to information of each width unit; when a forward matching algorithm is used, a counter meeting the subscription corresponding to the conditional predicate is directly added with one; all predicates in the unmatched space do not meet the conditions, and when a reverse matching algorithm is used, subscriptions corresponding to the predicates which do not meet the conditions are directly marked; the empty space does not contain any predicate, and the detection is not needed in the matching process, so that the traversal overhead is reduced.
Preferably, string type matching and fuzzy matching are supported;
the character string type matching realizes the support of character type matching by converting the character type into the form of an interval predicate;
the fuzzy matching considers that all predicates in the candidate space meet the conditions, and the high values of the predicates are not further checked on the high value tree;
in the forward matching, the matching efficiency is further improved by omitting the check on the possibly matched predicates; but will bring some error to the matching result, the matched subscription includes some false positive subscription;
given the number of width units ζ under each attribute, the maximum error rate f of the predicate over a single attribute is:
Figure BDA0003254807750000041
given the maximum error rate F allowed, the number of width cells divided per attribute is calculated.
The system for improving the efficiency and robustness of the matching algorithm based on the data structure provided by the invention comprises the following components: indexing the subscriptions based on a matching algorithm using a preset data structure;
in the preset data structure, the preset data structure comprises two levels of index layers and a storage layer;
the first-level index layer maps predicates with the same attribute into the same attribute unit based on the mapping of the attribute;
the second-level index layer is based on mapping of interval predicate width, and predicates are mapped into different width units according to the interval predicate width, so that the interval predicates with the same width but different centers can be mapped into the same width unit;
the storage layer is used for storing the subscription;
the width cells are divided in a uniform manner.
Compared with the prior art, the invention has the following beneficial effects:
the preset data structure provided by the invention has the main advantages that the preset data structure can simultaneously support various matching algorithms, and can realize efficient and stable matching performance in a dynamic environment. First, most of the data structures of the existing matching algorithms can only support one matching algorithm, which makes it difficult for their data structures to support other matching algorithms to further improve the performance. Secondly, the invention can mix a plurality of matching algorithms, thereby solving the defect of performance fluctuation of a single algorithm in a dynamic environment. The characteristic ensures that the invention can keep more efficient and stable matching performance under a dynamic environment, thereby realizing the guarantee of the quality of service (QoS) of event distribution service in a plurality of more variable scenes.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is an abstract diagram of a data structure according to the present invention.
FIG. 2 is a schematic diagram of the AFM algorithm of the present invention.
FIG. 3 is a diagram of the ABM algorithm of the present invention.
FIG. 4 is a schematic diagram of the AFM algorithm optimization of the present invention.
FIG. 5 is a schematic diagram of the ABM algorithm optimization of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Example 1
The technical problem of the invention is solved: the event matching performance is crucial to the performance of a content-based publish/subscribe system, and the preset data structure and the three matching algorithms based on the preset data structure can meet the matching requirements in a dynamic environment and realize more efficient and stable matching performance.
The methods adopted by the existing matching algorithm can be divided into two types: forward matching and reverse matching. One of the parameters affecting the performance of the matching algorithm is the matching probability of the subscription, and as the matching probability of the subscription increases, the performance of the forward matching algorithm decreases, while the performance of the reverse matching algorithm increases. Therefore, the invention provides a preset data structure for index subscription, which can support three methods of forward matching, reverse matching and mixed matching. The hybrid method uses a forward method and a reverse method in the event matching process, and fully exerts the advantages of the two methods, thereby improving the efficiency and robustness of event matching.
To realize efficient and stable matching performance. The invention firstly provides a novel index structure. The index structure adopts multi-level division, and the interval predicates are mapped according to the width of the interval predicates, so that the support to various matching algorithms is realized. And through analyzing the subscription in each index unit, the invention provides three efficient matching algorithms which can be suitable for different environmental requirements. Finally, the performance requirement and stability under a dynamic environment are ensured.
The method for improving the efficiency and robustness of the matching algorithm based on the data structure provided by the invention comprises the following steps: indexing the subscriptions based on a matching algorithm using a preset data structure;
in the preset data structure, the preset data structure comprises two levels of index layers and a storage layer;
the first-level index layer maps predicates with the same attribute into the same attribute unit based on the mapping of the attribute;
the second-level index layer is based on mapping of interval predicate width, and predicates are mapped into different width units according to the interval predicate width, so that the interval predicates with the same width but different centers can be mapped into the same width unit;
the storage layer is used for storing the subscription;
the width cells are divided in a uniform manner.
Specifically, the storage layer is stored by using B + trees, each width unit comprises two B + trees which respectively correspond to a low value and a high value of an interval predicate, and a link for the high value tree is arranged on the low value tree; the B + tree can realize self-balance of the tree and ensure the order of the inserted elements.
Specifically, in the matching process, two groups of markers and one group of recorders are further included; one group of markers marks unmatched subscriptions by using the bit sets, and the other group of markers counts the number of matched predicates in the subscriptions by using a counter; the recorder is used for recording the task division in the mixed matching.
Specifically, the matching algorithm includes: forward matching AFM, reverse matching ABM and hybrid matching AHM;
the forward matching AFM adopts a mode of counting matched predicates, and all low-value trees in a data structure are checked during matching;
the reverse matching ABM adopts a mode of marking unmatched predicates;
the hybrid matching AHM combines a forward matching method and a reverse matching method, and performs task division on width units, wherein the forward matching is used for the width units used for indexing the narrow inter-region predicates, and the reverse matching is used for the width units except the narrow inter-region predicates.
Specifically, the forward matching employs:
step S1: adding one to a counter of a subscription corresponding to a predicate indexed in the [ v', v ] space; wherein v represents an event value; in a width unit, assuming that the width range of the interval predicate of the width unit index is [ w, w '], and v' ═ v-w;
step S2: for predicates indexed in the [ v ', v' ] space, checking the high values of the predicates on a high value tree through pointers arranged on a B + tree; when the high value of the predicate is more than or equal to v, the predicate is matched, and the counter of the corresponding subscription containing the current predicate also performs an adding operation; wherein v ═ v-w'
Step S3: when all the low value trees in the data structure are operated, the counter is checked, and when the value of the counter is the same as the number of predicates of the corresponding subscription, the current subscription is matched.
Specifically, the reverse matching employs:
step S4: marking all predicates with low values larger than the event value v on the low value tree;
step S5: marking all predicates with high values smaller than the event value v on the high value tree;
step S6: the markers are checked and the untagged subscriptions are matched.
Specifically, the mixing matching adopts: by performing task division on the width units, forward matching is used for the width units used for indexing the predicates among the narrow regions, and reverse matching is used for other width units; recording the number of predicates divided into forward matching in each subscription through a recorder; after completing the forward matching and the reverse matching, for the subscription which is not marked in the reverse matching, checking whether the values of the counter and the recorder corresponding to the subscription which is not marked are equal, if so, the current subscription is matched;
the narrow inter-interval predicate is a predicate with an interval width smaller than a preset value;
the division point of the mixed matching to the width unit task needs to make the forward matching and the reverse matching have the same matching time as much as possible; assuming that the width of the division point is κ, the forward and reverse matches have similar matching times when κ satisfies the following equation;
Figure BDA0003254807750000071
where v represents the value of the event,
Figure BDA0003254807750000072
and
Figure BDA0003254807750000073
representing the unit cost of the execution flag and the count, respectively; Γ (x) represents the probability that a low or high value of the predicate is equal to x; x represents a random variable.
Specifically, the whole search space is divided into a matching space, a non-matching space, a candidate space and an empty space according to the information of each width unit; when a forward matching algorithm is used, a counter meeting the subscription corresponding to the conditional predicate is directly added with one; all predicates in the unmatched space do not meet the conditions, and when a reverse matching algorithm is used, subscriptions corresponding to the predicates which do not meet the conditions are directly marked; the empty space does not contain any predicate, and the detection is not needed in the matching process, so that the traversal overhead is reduced.
Specifically, string type matching and fuzzy matching are supported;
the character string type matching realizes the support of character type matching by converting the character type into the form of an interval predicate;
the fuzzy matching considers that all predicates in the candidate space meet the conditions, and the high values of the predicates are not further checked on the high value tree;
in the forward matching, the matching efficiency is further improved by omitting the check on the possibly matched predicates; but will bring some error to the matching result, the matched subscription includes some false positive subscription;
given the number of width units ζ under each attribute, the maximum error rate f of the predicate over a single attribute is:
Figure BDA0003254807750000081
given the maximum error rate F allowed, the number of width cells divided per attribute is calculated.
The system for improving the efficiency and robustness of the matching algorithm based on the data structure can be realized by the steps and the flows in the method for improving the efficiency and robustness of the matching algorithm based on the data structure. The method for improving the efficiency and robustness of the matching algorithm based on the data structure can be understood as a preferred example of a system for improving the efficiency and robustness of the matching algorithm based on the data structure by those skilled in the art.
Example 2
Example 2 is a preferred example of example 1
Existing matching algorithms can be broadly classified into two categories. One type is forward matching, which focuses on finding matching predicates to determine which subscriptions are matching. Such matching algorithms can be further classified into count-based matching and tree-structure filtering-based matching. Another type of matching algorithm is inverse matching. Their main idea is to indirectly determine matching subscriptions by determining which predicates are not matching. The data structure of these algorithms can only support a single matching method. For forward matching, the efficiency of the matching algorithm decreases as the number of matching predicates increases, and for reverse matching the efficiency decreases. Therefore, it is not possible to adapt to a dynamic environment elegantly using a single matching method.
There are also scores for exact and fuzzy matches for different matching algorithms. Fuzzy matching, as opposed to exact matching, may treat subscriptions that are false positive as matching by nano-false. By the method, the matching performance is improved on the premise of ensuring a certain misjudgment tolerance. In addition, the support of different event types by matching algorithms can be further divided into single event types and multiple event types. Compared with a single event type, the multi-event type support can provide richer subscription expressions and ensure event matching in a high-dimensional space.
In order to overcome the defect that the existing matching algorithm adopts a single matching method and realize a publishing/subscribing system with stronger universality, higher matching efficiency and more stable performance, the invention designs a preset data structure which comprises a secondary index layer and a storage layer and is used for indexing subscription. The overall framework of the data structure is shown in fig. 1. The entire data structure may be divided into a two-level index layer and a storage layer. Wherein the first level index is attribute based. Predicates with the same attribute will map into the same attribute unit. The second level index is a mapping based on the width of the interval predicate. We first compute each predicate width and then map the predicate into different width cells according to the width. This mapping approach enables interval predicates with the same width but different centers to be mapped into the same width cell. The width cells are divided in a uniform manner, for example: dividing the value range space of [0,1] into 5 width units, the width range mapped to each width unit is 0.2, where the width range of the first width unit is [0,0.2], the second is [0.2,0.4], and so on.
The storage layer is used for storing subscriptions, the storage layer stores the subscriptions by adopting B + trees, each width unit comprises two B + trees which respectively correspond to the low value and the high value of the interval predicate, and the low value tree is provided with a link for the high value tree. The B + tree can realize self-balancing of the tree and guarantee the ordering of the inserted elements. In addition, two sets of markers and one set of recorders are required in the matching process. One set of markers marks the subscriptions that do not match using a bit set, and the other set uses a counter to count the number of predicates that match in the subscriptions. The recorder is used for recording the task division in the mixed matching.
And for each predicate, mapping the predicate to a corresponding width unit according to the width of the predicate and the corresponding attribute of the predicate, and then respectively inserting the low value and the high value of the predicate into the two B + trees to complete the insertion process of the subscription.
Three matching algorithms based on preset data structure
Based on a preset data structure, the invention provides three matching algorithms, namely forward matching (AFM), reverse matching (ABM) and hybrid matching (AHM). The three matching algorithms are based on the same data structure.
(1) The forward matching adopts a mode of counting matched predicates. Upon a match, all low value trees in the data structure need to be checked. As shown in fig. 2, the search space on the low value tree is forward matched. In a width unit, the width range of the interval predicate of the width unit index is assumed to be [ w, w' ]. For an event value v, let v ' be v-w, and v "be v-w ', then the subscriptions in the [ v ', v ] space are all matched. The remaining potentially matching subscriptions are contained within the [ v ", v' ] space. The forward matching therefore comprises three steps. Step 1, adding one to a counter of a subscription corresponding to a predicate indexed in [ v', v ] space. And 2, for the predicates indexed in the [ v ', v' ] space, checking the high values of the predicates on the high value tree through pointers arranged on the B + tree. If the high value is greater than or equal to v, the predicate is matched, and the counter of the corresponding subscription containing the predicate also performs an adding operation. And 3, checking the counter after all the low value trees in the data structure are operated. If the counter value is the same as the number of predicates for the corresponding subscription, the subscription is a match.
(2) The inverse matching adopts a mode of marking the unmatched predicates. As shown in fig. 3, the search space of the match is reversed. Within a width unit, all predicates with low values greater than v and high values less than v are unmatched for an event v. Therefore, three steps are also possible for reverse matching. First, on the low value tree, all predicates with low values greater than v are labeled. Secondly, on the high value tree, all predicates with high values smaller than v are marked. Finally, the tagger is checked and the non-tagged subscriptions are matched.
(3) Hybrid matching combines forward and reverse matching methods. By task partitioning the width units. Forward matching is used for width cells that index narrow interval predicates (meaning predicates with interval widths less than a given threshold), and reverse matching is used for other width cells. The number of predicates in each subscription that are divided into forward matches is recorded by the recorder. After the forward and reverse matching portions are completed. For the unmarked subscriptions in the reverse match, it is checked whether the values of the counter and the recorder are equal for the unmarked subscriptions. If equal, the subscription is a match.
The division point of the mixed matching to the width unit task allocation needs to be as long as possible so that the forward matching and the reverse matching have the same matching time. Assuming that the width of the division point is κ, the forward and reverse matches have similar matching times when κ satisfies the following equation.
Figure BDA0003254807750000101
Where v represents the value of the event,
Figure BDA0003254807750000103
and
Figure BDA0003254807750000104
representing the unit cost of performing the marking and counting, respectively, and Γ (x) represents the probability that a low or high value of the predicate equals x. By solving the above equation, for example, when the predicates are uniformly distributed over the value domain space, it can be obtained
Figure BDA0003254807750000102
And taking the width unit where the kappa is positioned as a boundary, allocating the width unit with the index predicate width smaller than the kappa to forward matching, and allocating the rest width units to reverse matching to finish task division.
Reduced search space optimization for matching algorithms
As shown in fig. 4 to 5, the entire search space may be divided into a matching space, a non-matching space, and an empty space according to information of each width unit, i.e., upper and lower limits of an interval predicate width that can be mapped to the width unit. When a forward matching algorithm is used, one can be directly added to counters of subscriptions corresponding to predicates meeting the conditions; all predicates in the mismatch space are not satisfied, and when a reverse matching algorithm is used, subscriptions which do not satisfy the predicates and correspond to the predicates can be directly marked; the empty space does not contain any predicate, and the matching process does not need to be checked, so that the traversal overhead can be reduced.
For both forward and reverse matching, the process of determining the target space is included. For example, a forward match requires determining the space where the matching predicate is located, and a reverse match determines the space where the mismatching predicate is located. The invention provides an optimization method for reducing the search space under the partial situation aiming at the part. Assume that the value range space of the attribute is [0,1 ]. First, for a width cell given a predicate width range of [ w, w' ], no predicate is contained in the [1-w,1] space on the low value tree and in the [0, w ] space on the high value tree.
For a forward match, given an event value v on the low value tree, all predicates on the low value tree are matched when v is between [1-w, w ]. Therefore, when v is in the interval, all matching predicate spaces can be determined only by two comparisons, and the time for searching on the B + tree is reduced.
For reverse matching, when v is greater than 1-w on the low value tree or less than w on the high value tree, no mismatched predicate exists on the corresponding low value or high value tree. Therefore, when v satisfies the above condition, it can be determined by two comparisons that the low value tree or the high value tree does not contain the mismatch predicate.
By the method, the positioning cost of the target space can be shortened, the matching efficiency is further improved, and a good optimization effect is achieved when the width exceeds half of the value domain space.
Support for string type matching and fuzzy matching
Fuzzy matching means that the matching structure is not exact and may contain false positives (false positives), i.e. non-matching subscriptions are judged to be matching. Fuzzy matching algorithms generally improve matching performance by sacrificing some false positives.
The invention realizes the support for character type matching by converting the character type into the form of an interval predicate, and the supported operator comprises the following steps: <, not more than ≧ and (wildcard character). The transformation mode is as follows:
Ai<“abcde”-→Ai∈[“”,“abcde”)
Ai≤“abcde”-→Ai∈[“”,“abcde”]
Ai=“abcde”-→Ai∈[“abcde”,“abcde”]
Ai>“abcde”-→Ai∈(“abcde”,“INF”]
Ai≥“abcde”-→Ai∈[“abcde”,“INF”]
Ai=“abcd*”-→Ai∈[“abcd”,“abce”)
the present invention also provides support for fuzzy matching, i.e. all predicates in the candidate space (candidate space) shown in fig. 2 are considered satisfied, and no further check is made on the high value tree for the high values of these predicates). In the forward matching, the matching efficiency is further improved by omitting the check on the possibly matched predicates. But this will introduce some error to the matching result, i.e. the matching subscription contains some false positive subscriptions. Given the number of width units ζ under each attribute, the maximum error rate f of the predicate over a single attribute is:
Figure BDA0003254807750000111
given the maximum error rate F allowed, equation (2) can be used to calculate the number of width cells divided on each attribute.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A method for improving efficiency and robustness of a matching algorithm based on a data structure is characterized by comprising the following steps: indexing the subscriptions based on a matching algorithm using a preset data structure;
in the preset data structure, the preset data structure comprises two levels of index layers and a storage layer;
the first-level index layer maps predicates with the same attribute into the same attribute unit based on the mapping of the attribute;
the second-level index layer is based on mapping of interval predicate width, and predicates are mapped into different width units according to the interval predicate width, so that the interval predicates with the same width but different centers can be mapped into the same width unit;
the storage layer is used for storing the subscription;
the width cells are divided in a uniform manner.
2. The method for improving the efficiency and robustness of the matching algorithm based on the data structure as claimed in claim 1, wherein the storage layer is stored by using B + trees, two B + trees are provided for each width unit, the two B + trees respectively correspond to a low value and a high value of an interval predicate, and a link to the high value tree is provided on the low value tree; the B + tree can realize self-balance of the tree and ensure the order of the inserted elements;
and for each predicate, mapping the attribute corresponding to the predicate and the width of the predicate to a corresponding width unit, and respectively inserting the low value and the high value of the predicate into the two B + trees to complete the insertion process of the subscription.
3. The method for improving the efficiency and robustness of the matching algorithm based on the data structure as claimed in claim 1, wherein in the matching process, two groups of markers and one group of recorders are further included; one group of markers marks unmatched subscriptions by using the bit sets, and the other group of markers counts the number of matched predicates in the subscriptions by using a counter; the recorder is used for recording the task division in the mixed matching.
4. The method for improving efficiency and robustness of a matching algorithm based on a data structure of claim 1, wherein the matching algorithm comprises: forward matching AFM, reverse matching ABM and hybrid matching AHM;
the forward matching AFM adopts a mode of counting matched predicates, and all low-value trees in a data structure are checked during matching;
the reverse matching ABM adopts a mode of marking unmatched predicates;
the hybrid matching AHM combines a forward matching method and a reverse matching method, and performs task division on width units, wherein the forward matching is used for the width units used for indexing the narrow inter-region predicates, and the reverse matching is used for the width units except the narrow inter-region predicates.
5. The method for improving efficiency and robustness of matching algorithms based on data structures of claim 4, wherein the forward matching employs:
step S1: adding one to a counter of a subscription corresponding to a predicate indexed in the [ v', v ] space; wherein v represents an event value; in a width unit, assuming that the width range of the interval predicate of the width unit index is [ w, w '], and v' ═ v-w;
step S2: for predicates indexed in the [ v ', v' ] space, checking the high values of the predicates on a high value tree through pointers arranged on a B + tree; when the high value of the predicate is more than or equal to v, the predicate is matched, and the counter of the corresponding subscription containing the current predicate also performs an adding operation; wherein, v ═ v-w';
step S3: when all the low value trees in the data structure are operated, the counter is checked, and when the value of the counter is the same as the number of predicates of the corresponding subscription, the current subscription is matched.
6. The method for improving efficiency and robustness of matching algorithms based on data structures of claim 4, wherein the reverse matching employs:
step S4: marking all predicates with low values larger than the event value v on the low value tree;
step S5: marking all predicates with high values smaller than the event value v on the high value tree;
step S6: the markers are checked and the untagged subscriptions are matched.
7. The method for improving efficiency and robustness of matching algorithms based on data structures of claim 4, wherein the hybrid matching employs: by performing task division on the width units, forward matching is used for the width units used for indexing the predicates among the narrow regions, and reverse matching is used for other width units; recording the number of predicates divided into forward matching in each subscription through a recorder; after completing the forward matching and the reverse matching, for the subscription which is not marked in the reverse matching, checking whether the values of the counter and the recorder corresponding to the subscription which is not marked are equal, if so, the current subscription is matched;
the narrow inter-interval predicate is a predicate with an interval width smaller than a preset value;
the division point of the mixed matching to the width unit task needs to make the forward matching and the reverse matching have the same matching time as much as possible; assuming that the width of the division point is κ, the forward and reverse matches have similar matching times when κ satisfies the following equation;
Figure FDA0003254807740000021
where v represents the value of the event,
Figure FDA0003254807740000023
and
Figure FDA0003254807740000022
representing the unit cost of the execution flag and the count, respectively; Γ (x) represents the probability that a low or high value of the predicate is equal to x; x is the number ofRepresenting a random variable.
8. The method for improving efficiency and robustness of a matching algorithm based on a data structure as claimed in claim 1, wherein the whole search space is divided into a matching space, a non-matching space and an empty space according to information of each width unit; when a forward matching algorithm is used, a counter meeting the subscription corresponding to the conditional predicate is directly added with one; all predicates in the unmatched space do not meet the conditions, and when a reverse matching algorithm is used, subscriptions corresponding to the predicates which do not meet the conditions are directly marked; the empty space does not contain any predicate, and the detection is not needed in the matching process, so that the traversal overhead is reduced.
9. The method for improving efficiency and robustness of matching algorithms based on data structures of claim 1, wherein string type matching and fuzzy matching are supported;
the character string type matching realizes the support of character type matching by converting the character type into the form of an interval predicate;
the fuzzy matching considers that all predicates in the candidate subspace all meet the condition, and the high values of the predicates are not further checked on the high value tree;
in the forward matching, the matching efficiency is further improved by omitting the check on the possibly matched predicates; but will bring some error to the matching result, the matched subscription includes some false positive subscription;
given the number of width units ζ under each attribute, the maximum error rate f of the predicate over a single attribute is:
Figure FDA0003254807740000031
given the maximum error rate F allowed, the number of width cells divided per attribute is calculated.
10. A system for improving efficiency and robustness of a matching algorithm based on a data structure, comprising: indexing the subscriptions based on a matching algorithm using a preset data structure;
in the preset data structure, the preset data structure comprises two levels of index layers and a storage layer;
the first-level index layer maps predicates with the same attribute into the same attribute unit based on the mapping of the attribute;
the second-level index layer is based on mapping of interval predicate width, and predicates are mapped into different width units according to the interval predicate width, so that the interval predicates with the same width but different centers can be mapped into the same width unit;
the storage layer is used for storing the subscription;
the width cells are divided in a uniform manner.
CN202111056560.5A 2021-09-09 2021-09-09 Method and system for improving efficiency and robustness of matching algorithm based on data structure Active CN113722332B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111056560.5A CN113722332B (en) 2021-09-09 2021-09-09 Method and system for improving efficiency and robustness of matching algorithm based on data structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111056560.5A CN113722332B (en) 2021-09-09 2021-09-09 Method and system for improving efficiency and robustness of matching algorithm based on data structure

Publications (2)

Publication Number Publication Date
CN113722332A true CN113722332A (en) 2021-11-30
CN113722332B CN113722332B (en) 2024-03-26

Family

ID=78682867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111056560.5A Active CN113722332B (en) 2021-09-09 2021-09-09 Method and system for improving efficiency and robustness of matching algorithm based on data structure

Country Status (1)

Country Link
CN (1) CN113722332B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004798A (en) * 2010-12-27 2011-04-06 东北大学 Matching method of symmetrical issuing subscription system based on plural one-dimensional index
CN102819569A (en) * 2012-07-18 2012-12-12 中国科学院软件研究所 Matching method for data in distributed interactive simulation system
CN103984760A (en) * 2014-05-29 2014-08-13 中国航空无线电电子研究所 Data structure oriented to content publishing and subscribing system and mixed event matching method thereof
US20140280317A1 (en) * 2013-03-15 2014-09-18 University Of Florida Research Foundation, Incorporated Efficient publish/subscribe systems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004798A (en) * 2010-12-27 2011-04-06 东北大学 Matching method of symmetrical issuing subscription system based on plural one-dimensional index
CN102819569A (en) * 2012-07-18 2012-12-12 中国科学院软件研究所 Matching method for data in distributed interactive simulation system
US20140280317A1 (en) * 2013-03-15 2014-09-18 University Of Florida Research Foundation, Incorporated Efficient publish/subscribe systems
CN103984760A (en) * 2014-05-29 2014-08-13 中国航空无线电电子研究所 Data structure oriented to content publishing and subscribing system and mixed event matching method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DONGXIANG ZHANG等: "An efficient publish subscribe index for e-commerce databases.", 《PROCEEDINGS OF THE VLDB ENDOWMENT》, vol. 7, no. 8, 31 December 2014 (2014-12-31), pages 613 - 624, XP058064052, DOI: 10.14778/2732296.2732298 *

Also Published As

Publication number Publication date
CN113722332B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
Afrati et al. Parallel skyline queries
US8725730B2 (en) Responding to a query in a data processing system
CN102648468B (en) Table search device, table search method, and table search system
Gan et al. Dynamic density based clustering
US20060182046A1 (en) Parallel partition-wise aggregation
Sarkas et al. Categorical skylines for streaming data
CN103678609A (en) Large data inquiring method based on distribution relation-object mapping processing
CN105045917A (en) Example-based distributed data recovery method and device
CN106599091A (en) Storage and indexing method of RDF graph structures stored based on key values
CN110413927B (en) Optimization method and system based on matching instantaneity in publish-subscribe system
CN110765143A (en) Data processing method, device, server and storage medium
CN114357085B (en) Financial data storage method and device based on block chain and storage medium
CN106777111B (en) Time sequence retrieval index system and method for super-large scale data
CN106484815A (en) A kind of automatic identification optimization method for retrieving scene based on mass data class SQL
CN116301656A (en) Data storage method, system and equipment based on log structure merging tree
CN110175202A (en) The method and system of the outer connection of table for database
CN110941645A (en) Method, device, storage medium and processor for automatically judging case string
CN110134688B (en) Hot event data storage management method and system in online social network
CN113722332A (en) Method and system for improving efficiency and robustness of matching algorithm based on data structure
CN110209742B (en) Block chain based storage system and method classified according to data importance
Doulkeridis et al. On saying" enough already!" in mapreduce
CN110321388B (en) Quick sequencing query method and system based on Greenplus
Gulzar et al. D-SKY: A framework for processing skyline queries in a dynamic and incomplete database
CN114969189A (en) Method and device for determining connection in database connection pool
CN116701386A (en) Key value pair retrieval method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant