CN113722332B - Method and system for improving efficiency and robustness of matching algorithm based on data structure - Google Patents

Method and system for improving efficiency and robustness of matching algorithm based on data structure Download PDF

Info

Publication number
CN113722332B
CN113722332B CN202111056560.5A CN202111056560A CN113722332B CN 113722332 B CN113722332 B CN 113722332B CN 202111056560 A CN202111056560 A CN 202111056560A CN 113722332 B CN113722332 B CN 113722332B
Authority
CN
China
Prior art keywords
matching
predicates
width
predicate
subscription
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111056560.5A
Other languages
Chinese (zh)
Other versions
CN113722332A (en
Inventor
钱诗友
廖政宇
曹健
薛广涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202111056560.5A priority Critical patent/CN113722332B/en
Publication of CN113722332A publication Critical patent/CN113722332A/en
Application granted granted Critical
Publication of CN113722332B publication Critical patent/CN113722332B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Automation & Control Theory (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for improving the efficiency and the robustness of a matching algorithm based on a data structure, wherein the method comprises the following steps: indexing the subscription based on a matching algorithm by using a preset data structure; among the preset data structures, the preset data structure comprises two levels of index layers and a storage layer; the first-level index layer is based on mapping of attributes, and predicates with the same attributes are mapped into the same attribute units; the second-stage index layer is based on mapping of interval predicate widths, and predicates are mapped into different width units according to the interval predicate widths, so that interval predicates with the same width but different centers can be mapped into the same width units; the storage layer is used for storing subscription; the width units are divided in a uniform manner.

Description

Method and system for improving efficiency and robustness of matching algorithm based on data structure
Technical Field
The invention relates to a matching algorithm in an event distribution network, in particular to a method and a system for improving the efficiency and the robustness of the matching algorithm based on a data structure, and more particularly relates to a preset data structure of the matching algorithm supporting Gao Xiaolu rods in a publish/subscribe system.
Background
The publish/subscribe system initially appears as a news subsystem. It achieves complete decoupling of both parties in terms of time, space and synchronization. Because of its attractive nature, publish/subscribe systems are widely deployed in many areas, such as system monitoring and management, real-time stock updates, online gaming, online advertising, and social media messaging. In particular, content-based publish/subscribe systems allow subscribers to express their interest in events using boolean expressions, enabling fine-grained selective information distribution.
The matching algorithm is a key module of a large-scale publish/subscribe system. To improve matching performance, researchers have proposed many matching algorithms based on different data structures. However, the performance of the matching algorithm is affected by various factors, so that the performance and the robustness of the existing matching algorithm in a dynamic environment are poor.
Patent document CN110427217B (application number 201910672885.2) discloses a content-based publish-subscribe system matching algorithm lightweight parallel method and system, in which an index structure of a storage data structure is layered to form a plurality of levels, each level corresponds to a storage unit set of the storage data structure, the plurality of levels are grouped, and each level group simultaneously comprises a level and a storage unit set corresponding to the level; matching threads are set for each hierarchical group, matching events are independently distributed to a single matching thread for processing, and a plurality of matching threads update an indicator at the same time, and the indicator performs synchronous operation when updating. And the matching performance is improved, and the parallelism is dynamically adjusted according to the performance requirement, so that the rapid and reliable distribution of the events is ensured. The optimal parallelism is determined by using an iterative optimization method, so that task allocation of threads is improved, and time overhead is very efficient.
In order to cope with this problem, the present invention proposes a new data structure. The data structure can simultaneously support a plurality of matching algorithms, and realizes matching by using the optimal matching algorithm under different environments, thereby reducing the influence of dynamic environments on the matching performance and obtaining better matching performance and stability.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method and a system for improving the efficiency and the robustness of a matching algorithm based on a data structure.
The method for improving the efficiency and the robustness of the matching algorithm based on the data structure provided by the invention comprises the following steps: indexing the subscription based on a matching algorithm by using a preset data structure;
among the preset data structures, the preset data structure comprises two levels of index layers and a storage layer;
the first-level index layer is based on mapping of attributes, and predicates with the same attributes are mapped into the same attribute units;
the second-stage index layer is based on mapping of interval predicate widths, and predicates are mapped into different width units according to the interval predicate widths, so that interval predicates with the same width but different centers can be mapped into the same width units;
the storage layer is used for storing subscription;
the width units are divided in a uniform manner.
Preferably, the storage layer adopts B+ trees for storage, two B+ trees are arranged for each width unit, the low value and the high value of the interval predicate are respectively corresponding to the two B+ trees, and the low value tree is provided with a link for the high value tree; the B+ tree can realize the self-balance of the tree and ensure the order of the inserted elements.
Preferably, in the matching process, two sets of markers and one set of recorders are also included; one group of markers uses a bit set to mark unmatched subscriptions, and the other group uses a counter to count the number of matched predicates in the subscriptions; the recorder is used for recording task partitions in hybrid matching.
Preferably, the matching algorithm comprises: forward matching AFM, reverse matching ABM, and hybrid matching AHM;
the forward matching AFM is used for checking all low-value trees in the data structure when matching by adopting a mode of counting matched predicates;
the reverse matching ABM is a mode of marking unmatched predicates;
the hybrid matching AHM combines forward matching and reverse matching methods, performs task division on width units, uses forward matching on width units used for indexing narrow interval predicates, and uses reverse matching on width units except narrow interval predicates.
Preferably, the forward matching employs:
step S1: adding one operation to a subscribed counter corresponding to the predicate indexed in the [ v', v ] space; wherein v represents an event value; in one width unit, assume that the width range of the interval predicate of the width unit index is [ w, w '], v' =v-w;
step S2: for predicates indexed in the [ v ', v' ] space, checking the high value of the predicate on the high value tree through pointers arranged on the B+ tree; when the high value of the predicate is greater than or equal to v, matching, and adding one to a counter containing the corresponding subscription of the current predicate; wherein v "=v-w'
Step S3: after all low value tree operations in the data structure are completed, the counter is checked, and when the value of the counter is the same as the number of predicates of the corresponding subscription, the current subscription is matched.
Preferably, the reverse matching employs:
step S4: marking all predicates with low values larger than the event value v on the low value tree;
step S5: marking all predicates with high values smaller than the event value v on the high value tree;
step S6: the taggant is checked and untagged subscriptions are matched.
Preferably, the hybrid matching employs: performing task division on the width units, using forward matching on the width units for indexing the predicates of the narrow intervals, and using reverse matching on other width units; recording the quantity of predicates divided into forward matching in each subscription through a recorder; after the forward matching and the reverse matching are completed, checking whether the values of the counter and the recorder corresponding to the untagged subscription are equal to each other or not for the untagged subscription in the reverse matching, and if so, matching the current subscription;
the narrow-interval predicate is a predicate with interval width smaller than a preset value;
the division points of the mixed matching to the width unit task allocation need to be as long as possible so that the forward matching and the reverse matching have the same matching time; assuming that the width of the dividing point is kappa, when kappa satisfies the following equation, the forward and reverse matches have similar matching times;
where v represents the event value,and->Representing the unit cost of performing the marking and counting, respectively; Γ (x) represents the probability that the low or high value of the predicate equals x; x represents a random variable.
Preferably, the entire search space is divided into a matching space, a non-matching space, a candidate space, and an empty space according to the information of each width unit; all predicates in the matching space meet the condition, and when a forward matching algorithm is used, adding one operation to a subscribed counter corresponding to the conditional predicate is directly performed; all predicates in the unmatched space are unsatisfied, and when an inverse matching algorithm is used, subscription corresponding to the unsatisfied predicates is marked directly; the empty space does not contain any predicates, and no check is needed in the matching process, so that the traversing cost is reduced.
Preferably, string type matching and fuzzy matching are supported;
the character string type matching is realized by converting the character type into the form of interval predicates;
the fuzzy matching considers that all predicates in the candidate space meet the condition, and the high value of the predicates is not further checked on the high value tree;
in forward matching, the matching efficiency is further improved by omitting the checking of possible matching predicates; however, certain errors are brought to the matching result, and certain false positive subscriptions are contained in the matched subscriptions;
given the number ζ of width units per attribute, the predicate maximum error rate f on a single attribute is:
given the maximum error rate F allowed, the number of width cells divided on each attribute is calculated.
The system for improving the efficiency and the robustness of the matching algorithm based on the data structure provided by the invention comprises the following components: indexing the subscription based on a matching algorithm by using a preset data structure;
among the preset data structures, the preset data structure comprises two levels of index layers and a storage layer;
the first-level index layer is based on mapping of attributes, and predicates with the same attributes are mapped into the same attribute units;
the second-stage index layer is based on mapping of interval predicate widths, and predicates are mapped into different width units according to the interval predicate widths, so that interval predicates with the same width but different centers can be mapped into the same width units;
the storage layer is used for storing subscription;
the width units are divided in a uniform manner.
Compared with the prior art, the invention has the following beneficial effects:
the preset data structure provided by the invention has the main advantages that a plurality of matching algorithms can be simultaneously supported, and the efficient and stable matching performance can be realized in a dynamic environment. First, the data structures of existing matching algorithms can mostly only support one matching algorithm, which makes it difficult for their data structures to support other matching algorithms to further improve performance. Secondly, the invention can mix a plurality of matching algorithms, thereby solving the defect of single algorithm performance fluctuation under dynamic environment. This feature enables the present invention to maintain more efficient and stable matching performance in a dynamic environment, thereby enabling quality of service (QoS) of event distribution services to be guaranteed in some more diverse scenarios.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:
FIG. 1 is an abstract view of a data structure of the present invention.
FIG. 2 is a schematic of an AFM algorithm of the present invention.
FIG. 3 is a schematic representation of the ABM algorithm of the present invention.
FIG. 4 is a schematic view of AFM algorithm optimization according to the present invention.
FIG. 5 is a schematic diagram of the ABM algorithm optimization of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.
Example 1
The technical solution of the invention is as follows: the event matching performance is crucial to the performance of a content-based publish/subscribe system, and the preset data structure and the three matching algorithms based on the event matching performance can cope with matching requirements in a dynamic environment, so that more efficient and stable matching performance is realized.
The methods adopted by the existing matching algorithm can be divided into two types: forward matching and reverse matching. One of the parameters affecting the performance of the matching algorithm is the matching probability of the subscription, with the increasing of the matching probability of the subscription, the performance of the forward matching algorithm will decrease, while the performance of the reverse matching algorithm will increase. Therefore, the invention provides a preset data structure for index subscription, which can support three methods of forward matching, backward matching and hybrid matching. The hybrid method uses two methods, namely a forward method and a reverse method, in the event matching process, and the advantages of the two methods are fully exerted, so that the event matching efficiency and the event matching robustness are improved.
In order to achieve efficient and stable matching performance. The invention firstly proposes a new index structure. The index structure adopts multi-stage division, and interval predicates are mapped according to the width of the interval predicates, so that the support of various matching algorithms is realized. And by analyzing the subscription in each index unit, the invention provides three efficient matching algorithms which can be suitable for different environmental requirements. Finally, the performance requirement and the stability under the dynamic environment are ensured.
The method for improving the efficiency and the robustness of the matching algorithm based on the data structure provided by the invention comprises the following steps: indexing the subscription based on a matching algorithm by using a preset data structure;
among the preset data structures, the preset data structure comprises two levels of index layers and a storage layer;
the first-level index layer is based on mapping of attributes, and predicates with the same attributes are mapped into the same attribute units;
the second-stage index layer is based on mapping of interval predicate widths, and predicates are mapped into different width units according to the interval predicate widths, so that interval predicates with the same width but different centers can be mapped into the same width units;
the storage layer is used for storing subscription;
the width units are divided in a uniform manner.
Specifically, the storage layer adopts B+ trees for storage, two B+ trees are arranged for each width unit, the low value and the high value of the interval predicate are respectively corresponding to the two B+ trees, and links to the high value tree are arranged on the low value tree; the B+ tree can realize the self-balance of the tree and ensure the order of the inserted elements.
Specifically, in the matching process, the method also comprises two groups of markers and one group of recorders; one group of markers uses a bit set to mark unmatched subscriptions, and the other group uses a counter to count the number of matched predicates in the subscriptions; the recorder is used for recording task partitions in hybrid matching.
Specifically, the matching algorithm includes: forward matching AFM, reverse matching ABM, and hybrid matching AHM;
the forward matching AFM is used for checking all low-value trees in the data structure when matching by adopting a mode of counting matched predicates;
the reverse matching ABM is a mode of marking unmatched predicates;
the hybrid matching AHM combines forward matching and reverse matching methods, performs task division on width units, uses forward matching on width units used for indexing narrow interval predicates, and uses reverse matching on width units except narrow interval predicates.
Specifically, the forward matching employs:
step S1: adding one operation to a subscribed counter corresponding to the predicate indexed in the [ v', v ] space; wherein v represents an event value; in one width unit, assume that the width range of the interval predicate of the width unit index is [ w, w '], v' =v-w;
step S2: for predicates indexed in the [ v ', v' ] space, checking the high value of the predicate on the high value tree through pointers arranged on the B+ tree; when the high value of the predicate is greater than or equal to v, matching, and adding one to a counter containing the corresponding subscription of the current predicate; wherein v "=v-w'
Step S3: after all low value tree operations in the data structure are completed, the counter is checked, and when the value of the counter is the same as the number of predicates of the corresponding subscription, the current subscription is matched.
Specifically, the reverse matching employs:
step S4: marking all predicates with low values larger than the event value v on the low value tree;
step S5: marking all predicates with high values smaller than the event value v on the high value tree;
step S6: the taggant is checked and untagged subscriptions are matched.
Specifically, the hybrid matching employs: performing task division on the width units, using forward matching on the width units for indexing the predicates of the narrow intervals, and using reverse matching on other width units; recording the quantity of predicates divided into forward matching in each subscription through a recorder; after the forward matching and the reverse matching are completed, checking whether the values of the corresponding counter and the recorder of the untagged subscription are equal or not for the untagged subscription in the reverse matching, and if so, matching the current subscription;
the narrow-interval predicate is a predicate with interval width smaller than a preset value;
the division points of the mixed matching to the width unit task allocation need to be as long as possible so that the forward matching and the reverse matching have the same matching time; assuming that the width of the dividing point is kappa, when kappa satisfies the following equation, the forward and reverse matches have similar matching times;
where v represents the event value,and->Representing the unit cost of performing the marking and counting, respectively; Γ (x) represents the probability that the low or high value of the predicate equals x; x represents a random variable.
Specifically, according to the information of each width unit, dividing the whole search space into a matching space, a non-matching space, a candidate space and an empty space; all predicates in the matching space meet the condition, and when a forward matching algorithm is used, adding one operation to a subscribed counter corresponding to the conditional predicate is directly performed; all predicates in the unmatched space are unsatisfied, and when an inverse matching algorithm is used, subscription corresponding to the unsatisfied predicates is marked directly; the empty space does not contain any predicates, and no check is needed in the matching process, so that the traversing cost is reduced.
Specifically, supporting string type matching and fuzzy matching;
the character string type matching is realized by converting the character type into the form of interval predicates;
the fuzzy matching considers that all predicates in the candidate space meet the condition, and the high value of the predicates is not further checked on the high value tree;
in forward matching, the matching efficiency is further improved by omitting the checking of possible matching predicates; however, certain errors are brought to the matching result, and certain false positive subscriptions are contained in the matched subscriptions;
given the number ζ of width units per attribute, the predicate maximum error rate f on a single attribute is:
given the maximum error rate F allowed, the number of width cells divided on each attribute is calculated.
The system for improving the efficiency and the robustness of the matching algorithm based on the data structure can be realized through the step flow in the method for improving the efficiency and the robustness of the matching algorithm based on the data structure. Those skilled in the art can understand the method for improving the efficiency and the robustness of the matching algorithm based on the data structure as a preferred example of the system for improving the efficiency and the robustness of the matching algorithm based on the data structure.
Example 2
Example 2 is a preferred example of example 1
Existing matching algorithms can be broadly divided into two categories. One type is forward matching, which focuses on finding matching predicates to determine which subscriptions are matching. Such matching algorithms can be further divided into count-based matching and tree-structure filtering-based matching. Another type of matching algorithm is reverse matching. Their main idea is to indirectly determine matching subscriptions by determining which predicates are not matching. The data structures of these algorithms can only support a single matching method. For forward matching, the efficiency of the matching algorithm decreases as the number of matching predicates increases, and reverse matching is the opposite. Therefore, the dynamic environment cannot be gracefully adapted with a single matching method.
There are also exact and fuzzy matches for different matching algorithms. Unlike exact matches, fuzzy matches may identify some false positive subscriptions as matching in a nano-pseudo manner. In this way, the improvement of the matching performance is obtained on the premise of ensuring that a certain misjudgment rate is tolerated. In addition, the support of different event types according to the matching algorithm can be further divided into a single event type and a multi-event type. Compared with a single event type, the multi-event type support can provide richer subscription expression and ensure event matching under a high-dimensional space.
In order to solve the defect that the existing matching algorithm adopts a single matching method, a publishing/subscribing system with stronger universality, higher matching efficiency and more stable performance is realized. The overall framework of the data structure is shown in fig. 1. The entire data structure may be divided into a two-level index layer and a storage layer. Wherein the first level index is attribute-based. Predicates with the same attributes will map into the same attribute units. The second level index is a mapping based on interval predicate width. We first compute each predicate width and then map the predicates into different width units according to the width. This mapping approach enables interval predicates of the same width but different centers to be mapped into the same width units. The width units are divided in a uniform manner, for example: dividing the value range space of [0,1] into 5 width units, mapping the width range of each width unit to be 0.2, wherein the width range of the first width unit is [0,0.2], the width range of the second width unit is [0.2,0.4], and the like.
The storage layer is used for storing subscription, the storage layer is used for storing B+ trees, two B+ trees are arranged for each width unit, the low value and the high value of the interval predicate are respectively corresponding to the B+ trees, and links to the high value trees are arranged on the low value trees. The B+ tree can realize the self-balance of the tree and ensure the order of the inserted elements. In addition, two sets of markers and one set of recorders are required in the matching process. One set of markers marks non-matching subscriptions using a set of bits, and the other set counts the number of matching predicates in the subscriptions using a counter. The recorder is used for recording task partitions in hybrid matching.
And mapping each predicate into a corresponding width unit according to the corresponding attribute and the width of the predicate, and then respectively inserting a low value and a high value of each predicate into two B+ trees to complete the subscription insertion process.
Three matching algorithms based on preset data structure
Based on a preset data structure, the invention provides three matching algorithms, namely forward matching (AFM), reverse matching (ABM) and hybrid matching (AHM). The three matching algorithms are based on the same data structure.
(1) Forward matching employs a way of counting the predicates of the match. Upon a match, all low value trees in the data structure need to be checked. As shown in fig. 2, the search space on the low value tree is matched in the forward direction. In one width unit, the width range of the interval predicate of the width unit index is assumed to be [ w, w' ]. For an event value v, let v ' =v-w, v "=v-w ', then the subscriptions in the v ' v space are all matched. The remaining possible matching subscriptions are contained within the v ", v' ] space. The forward matching thus involves three steps. And step 1, adding one operation to the subscribed counter corresponding to the predicate indexed in the [ v', v ] space. And 2, checking the high value of the predicate in the [ v ', v' ] space through the pointer arranged on the B+ tree. If the high value is greater than or equal to v, then the counter of the corresponding subscription containing the predicate is matched and also incremented. And 3, checking the counter after all the low value tree operations in the data structure are completed. If the value of the counter is the same as the number of predicates for the corresponding subscription, the subscription is matched.
(2) Reverse matching employs a way of tagging unmatched predicates. As shown in fig. 3, the search space is reverse matched. Within a width cell, for an event v, all predicates with low values greater than v and high values less than v are unmatched. Therefore, three steps can be divided for reverse matching. First, all predicates with low values greater than v are marked on the low value tree. Secondly, on the high value tree, all predicates with high values less than v are marked. Finally, the taggant is checked, and untagged subscriptions are matched.
(3) Hybrid matching combines forward matching and reverse matching methods. By tasking the width units. Forward matching is used for width units used to index narrow-interval predicates (meaning predicates with interval widths less than a given threshold) and reverse matching is used for other width units. The number of predicates divided into forward matches in each subscription is recorded by a logger. After the forward matching and reverse matching sections are completed. For untagged subscriptions in the reverse match, it is checked whether the values of the untagged subscription correspondence counter and logger are equal. If equal, the subscriptions are matched.
The division points of the width unit task assignment by the hybrid matching need to be such that the forward matching and the reverse matching have the same matching time as much as possible. Assuming that the width of the division point is κ, the forward and reverse matches have similar matching times when κ satisfies the following equation.
Where v represents the value of the event,and->Representing the unit cost of performing the marking and counting, respectively, Γ (x) represents the probability that the low or high value of the predicate equals x. By solving the above equation, for example, when predicates are uniformly distributed in the value range space, it is possible to obtain +.>The width unit of kappa is used as the boundary for indexing predicates with small widthWidth units at κ are assigned to forward matches and the remaining width units are assigned to reverse matches to complete task partitioning.
Reduced search space optimization for matching algorithms
As shown in fig. 4 to 5, according to information of each width unit, that is, upper and lower limits of interval predicate widths that can be mapped to the width unit, the entire search space can be divided into a matching space, a non-matching space, and an empty space. All predicates in the matching space meet the condition, and when a forward matching algorithm is used, one adding operation can be directly carried out on the subscribed counters corresponding to the predicates meeting the condition; all predicates in the unmatched space are unsatisfied, and when an inverse matching algorithm is used, subscription corresponding to the unsatisfied predicates can be directly marked; the empty space does not contain any predicates, and no check is needed in the matching process, so that the traversing cost can be reduced.
For both forward and reverse matches, a process of determining the target space is involved. For example, forward matching requires determining the space in which a matching predicate is located, while reverse matching is determining the space in which a non-matching predicate is located. The invention provides an optimization method for reducing search space under partial situations. Let the value range space of the attribute be [0,1]. First, for width units with a given predicate width range of [ w, w' ], predicates are not contained in the [1-w,1] space on the low value tree and in the [0,w ] space on the high value tree.
For forward matching, an event value v is given on the low value tree, and when v is between [1-w, w ], all predicates on the low value tree are matched. Therefore, when v is in the interval, all matching predicate spaces can be determined by only two comparisons, so that the time for searching on the B+ tree is reduced.
For reverse matching, when v is greater than 1-w on a low value tree or less than w on a high value tree, there is no unmatched predicate on the corresponding low or high value tree. Therefore, when v satisfies the above condition, it can be determined that the low value tree or the high value tree does not contain a mismatch predicate by two comparisons.
By the method, the positioning cost of the target space can be shortened, the matching efficiency is further improved, and the method has a good optimizing effect when the width exceeds half of the value range space.
String type matching and fuzzy matching support
Fuzzy matching means that the matching structure is not exact and may contain false positives (false positives), i.e. a non-matching subscription is judged to be matching. Fuzzy matching algorithms generally improve matching performance by sacrificing certain false positives.
The invention realizes the support of character type matching by converting the character type into the form of interval predicates, and the supported operators comprise: <, +, =, +,. The transformation mode is as follows:
Ai<“abcde”-→Ai∈[“”,“abcde”)
Ai≤“abcde”-→Ai∈[“”,“abcde”]
Ai=“abcde”-→Ai∈[“abcde”,“abcde”]
Ai>“abcde”-→Ai∈(“abcde”,“INF”]
Ai≥“abcde”-→Ai∈[“abcde”,“INF”]
Ai=“abcd*”-→Ai∈[“abcd”,“abce”)
the present invention also provides support for fuzzy matching, i.e., all predicates in the candidate space (candidate space) shown in FIG. 2 are considered satisfied, and no further checking of the high values of these predicates is done on the high value tree). In forward matching, the matching efficiency is further improved by omitting the checking of possible matching predicates. But this will introduce some error into the matching result, i.e. the matching subscription contains some false positive subscription. Given the number ζ of width units per attribute, the predicate maximum error rate f on a single attribute is:
given the maximum error rate F allowed, equation (2) can be used to calculate the number of divided width cells on each attribute.
Those skilled in the art will appreciate that the systems, apparatus, and their respective modules provided herein may be implemented entirely by logic programming of method steps such that the systems, apparatus, and their respective modules are implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the systems, apparatus, and their respective modules being implemented as pure computer readable program code. Therefore, the system, the apparatus, and the respective modules thereof provided by the present invention may be regarded as one hardware component, and the modules included therein for implementing various programs may also be regarded as structures within the hardware component; modules for implementing various functions may also be regarded as being either software programs for implementing the methods or structures within hardware components.
The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the invention. The embodiments of the present application and features in the embodiments may be combined with each other arbitrarily without conflict.

Claims (8)

1. A method for improving the efficiency and robustness of a matching algorithm based on a data structure, comprising: indexing the subscription based on a matching algorithm by using a preset data structure;
among the preset data structures, the preset data structure comprises two levels of index layers and a storage layer;
the two-stage index layer comprises a first-stage index layer and a second-stage index layer;
the first-level index layer is based on mapping of attributes, and predicates with the same attributes are mapped into the same attribute units;
the second-stage index layer is based on mapping of interval predicate widths, and predicates are mapped into different width units according to the interval predicate widths, so that interval predicates with the same width but different centers can be mapped into the same width units;
the storage layer is used for storing subscription;
the width units are divided in a uniform manner;
the matching algorithm comprises the following steps: forward matching AFM, reverse matching ABM, and hybrid matching AHM;
the forward matching AFM is used for checking all low-value trees in the data structure when matching by adopting a mode of counting matched predicates;
the reverse matching ABM is a mode of marking unmatched predicates;
the hybrid matching AHM combines forward matching and reverse matching methods, performs task division on width units, uses forward matching on the width units used for indexing the narrow interval predicates, and uses reverse matching on the width units except the narrow interval predicates;
the mixing matching adopts: performing task division on the width units, using forward matching on the width units for indexing the predicates of the narrow intervals, and using reverse matching on other width units; recording the quantity of predicates divided into forward matching in each subscription through a recorder; after the forward matching and the reverse matching are completed, checking whether the values of the corresponding counter and the recorder of the untagged subscription are equal or not for the untagged subscription in the reverse matching, and if so, matching the current subscription;
the narrow-interval predicate is a predicate with interval width smaller than a preset value;
the division points of the mixed matching to the width unit task allocation need to be as long as possible so that the forward matching and the reverse matching have the same matching time; assuming that the width of the dividing point is kappa, when kappa satisfies the following equation, the forward and reverse matches have similar matching times;
wherein v represents an event value, and I and J represent unit costs of performing marking and counting, respectively; Γ (x) represents the probability that the low or high value of the predicate equals x; x represents a random variable.
2. The method for improving the efficiency and the robustness of a matching algorithm based on a data structure according to claim 1, wherein the storage layer uses b+ trees for storage, two b+ trees are provided for each width unit, the low value and the high value of interval predicates are respectively corresponding to the low value tree and the link to the high value tree is arranged on the low value tree; the B+ tree can realize the self-balance of the tree and ensure the order of the inserted elements;
and mapping the corresponding attribute of each predicate and the width of each predicate into a corresponding width unit, and respectively inserting the low value and the high value of each predicate into two B+ trees to complete the subscription insertion process.
3. The method for improving the efficiency and robustness of a matching algorithm based on a data structure according to claim 1, further comprising two sets of markers and one set of recorders during the matching process; one group of markers uses a bit set to mark unmatched subscriptions, and the other group uses a counter to count the number of matched predicates in the subscriptions; the recorder is used for recording task partitions in hybrid matching.
4. The method for improving efficiency and robustness of a matching algorithm based on a data structure according to claim 1, wherein the forward matching employs:
step S1: adding one operation to a subscribed counter corresponding to the predicate indexed in the [ v', v ] space; wherein v represents an event value; in one width unit, assume that the width range of the interval predicate of the width unit index is [ w, w '], v' =v-w;
step S2: for predicates indexed in the [ v ', v' ] space, checking the high value of the predicate on the high value tree through pointers arranged on the B+ tree; when the high value of the predicate is greater than or equal to v, matching, and adding one to a counter containing the corresponding subscription of the current predicate; wherein v "=v-w';
step S3: after all low value tree operations in the data structure are completed, the counter is checked, and when the value of the counter is the same as the number of predicates of the corresponding subscription, the current subscription is matched.
5. The method for improving efficiency and robustness of a matching algorithm based on a data structure according to claim 1, wherein the reverse matching employs:
step S4: marking all predicates with low values larger than the event value v on the low value tree;
step S5: marking all predicates with high values smaller than the event value v on the high value tree;
step S6: the taggant is checked and untagged subscriptions are matched.
6. The method for improving the efficiency and the robustness of a matching algorithm based on a data structure according to claim 1, wherein the whole search space is divided into a matching space, a non-matching space and an empty space according to the information of each width unit; all predicates in the matching space meet the condition, and when a forward matching algorithm is used, adding one operation to a subscribed counter corresponding to the conditional predicate is directly performed; all predicates in the unmatched space are unsatisfied, and when an inverse matching algorithm is used, subscription corresponding to the unsatisfied predicates is marked directly; the empty space does not contain any predicates, and no check is needed in the matching process, so that the traversing cost is reduced.
7. The method for improving the efficiency and the robustness of a matching algorithm based on a data structure according to claim 1, wherein character string type matching and fuzzy matching are supported;
the character string type matching is realized by converting the character type into the form of interval predicates;
the fuzzy matching considers that all predicates in the candidate subspace meet the conditions, and the high value of the predicates is not further checked on the high value tree;
in forward matching, the matching efficiency is further improved by omitting the checking of possible matching predicates; however, certain errors are brought to the matching result, and certain false positive subscriptions are contained in the matched subscriptions;
given the number ζ of width units per attribute, the predicate maximum error rate f on a single attribute is:
given the maximum error rate F allowed, the number of width cells divided on each attribute is calculated.
8. A system for improving the efficiency and robustness of a matching algorithm based on a data structure, comprising: indexing the subscription based on a matching algorithm by using a preset data structure;
among the preset data structures, the preset data structure comprises two levels of index layers and a storage layer;
the two-stage index layer comprises a first-stage index layer and a second-stage index layer;
the first-level index layer is based on mapping of attributes, and predicates with the same attributes are mapped into the same attribute units;
the second-stage index layer is based on mapping of interval predicate widths, and predicates are mapped into different width units according to the interval predicate widths, so that interval predicates with the same width but different centers can be mapped into the same width units;
the storage layer is used for storing subscription;
the width units are divided in a uniform manner;
the matching algorithm comprises the following steps: forward matching AFM, reverse matching ABM, and hybrid matching AHM;
the forward matching AFM is used for checking all low-value trees in the data structure when matching by adopting a mode of counting matched predicates;
the reverse matching ABM is a mode of marking unmatched predicates;
the hybrid matching AHM combines forward matching and reverse matching methods, performs task division on width units, uses forward matching on the width units used for indexing the narrow interval predicates, and uses reverse matching on the width units except the narrow interval predicates;
the mixing matching adopts: performing task division on the width units, using forward matching on the width units for indexing the predicates of the narrow intervals, and using reverse matching on other width units; recording the quantity of predicates divided into forward matching in each subscription through a recorder; after the forward matching and the reverse matching are completed, checking whether the values of the corresponding counter and the recorder of the untagged subscription are equal or not for the untagged subscription in the reverse matching, and if so, matching the current subscription;
the narrow-interval predicate is a predicate with interval width smaller than a preset value;
the division points of the mixed matching to the width unit task allocation need to be as long as possible so that the forward matching and the reverse matching have the same matching time; assuming that the width of the dividing point is kappa, when kappa satisfies the following equation, the forward and reverse matches have similar matching times;
wherein v represents an event value, and I and J represent unit costs of performing marking and counting, respectively; Γ (x) represents the probability that the low or high value of the predicate equals x; x represents a random variable.
CN202111056560.5A 2021-09-09 2021-09-09 Method and system for improving efficiency and robustness of matching algorithm based on data structure Active CN113722332B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111056560.5A CN113722332B (en) 2021-09-09 2021-09-09 Method and system for improving efficiency and robustness of matching algorithm based on data structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111056560.5A CN113722332B (en) 2021-09-09 2021-09-09 Method and system for improving efficiency and robustness of matching algorithm based on data structure

Publications (2)

Publication Number Publication Date
CN113722332A CN113722332A (en) 2021-11-30
CN113722332B true CN113722332B (en) 2024-03-26

Family

ID=78682867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111056560.5A Active CN113722332B (en) 2021-09-09 2021-09-09 Method and system for improving efficiency and robustness of matching algorithm based on data structure

Country Status (1)

Country Link
CN (1) CN113722332B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004798A (en) * 2010-12-27 2011-04-06 东北大学 Matching method of symmetrical issuing subscription system based on plural one-dimensional index
CN102819569A (en) * 2012-07-18 2012-12-12 中国科学院软件研究所 Matching method for data in distributed interactive simulation system
CN103984760A (en) * 2014-05-29 2014-08-13 中国航空无线电电子研究所 Data structure oriented to content publishing and subscribing system and mixed event matching method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10642918B2 (en) * 2013-03-15 2020-05-05 University Of Florida Research Foundation, Incorporated Efficient publish/subscribe systems

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004798A (en) * 2010-12-27 2011-04-06 东北大学 Matching method of symmetrical issuing subscription system based on plural one-dimensional index
CN102819569A (en) * 2012-07-18 2012-12-12 中国科学院软件研究所 Matching method for data in distributed interactive simulation system
CN103984760A (en) * 2014-05-29 2014-08-13 中国航空无线电电子研究所 Data structure oriented to content publishing and subscribing system and mixed event matching method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
An efficient publish subscribe index for e-commerce databases.;Dongxiang Zhang等;《Proceedings of the VLDB Endowment》;20141231;第Vol. 7卷(第No. 8期);正文第613-624页 *

Also Published As

Publication number Publication date
CN113722332A (en) 2021-11-30

Similar Documents

Publication Publication Date Title
CN108564470B (en) Transaction distribution method for parallel building blocks in block chain
Liu et al. Efficient distributed query processing in large RFID-enabled supply chains
Bai et al. Discovering the $ k $ representative skyline over a sliding window
CN106599091A (en) Storage and indexing method of RDF graph structures stored based on key values
CN110851663B (en) Method and device for managing metadata
CN116521956A (en) Graph database query method and device, electronic equipment and storage medium
CN111666344A (en) Heterogeneous data synchronization method and device
CN110175202A (en) The method and system of the outer connection of table for database
CN113722332B (en) Method and system for improving efficiency and robustness of matching algorithm based on data structure
CN111782663B (en) Aggregation index structure and aggregation index method for improving aggregation query efficiency
CN110321388B (en) Quick sequencing query method and system based on Greenplus
CN116680090A (en) Edge computing network management method and platform based on big data
CN115996169A (en) Network fault analysis method and device, electronic equipment and storage medium
CN115292361A (en) Method and system for screening distributed energy abnormal data
Zheng et al. User preference-based data partitioning top-k skyline query processing algorithm
Song et al. Labeled graph sketches
Bai et al. Skyline-join query processing in distributed databases
CN112948469A (en) Data mining method and device, computer equipment and storage medium
CN112131291A (en) JSON data-based structured analysis method, device, equipment and storage medium
CN116226296B (en) OpenGauss-based data packet aggregation method
CN114238258B (en) Database data processing method, device, computer equipment and storage medium
CN116955736B (en) Data constraint condition recommendation method and system in data standard
CN112000387B (en) Instance-oriented product configuration management and change analysis method
Akber Efficient Skyline Community Discovery in Large Networks
Fu et al. False-positive probability and compression optimization for tree-structured Bloom filters

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant