CN102298624B

CN102298624B - Approximate dynamic ring matching method for homogeneous and symmetric publication and subscription system

Info

Publication number: CN102298624B
Application number: CN201110233554.2A
Authority: CN
Inventors: 王波涛; 信俊昌; 王立军; 马素华
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2011-08-15
Filing date: 2011-08-15
Publication date: 2014-08-20
Anticipated expiration: 2031-08-15
Also published as: CN102298624A

Abstract

The invention provides an approximate dynamic ring matching method for a homogeneous and symmetric publication and subscription system, which comprises the following steps of: step 1, obtaining a subscription probability; step 2, calculating a spacing width of a threshold value position and an approximate boundary in a domain; and step 3, estimating the saved memory space proportion. The method is applied to the approximate dynamic ring match of the arbitrary data distribution, and can be used in a real-time environment, so that the accuracy can be averagely improved by 15%. The predict formula of the saved space proportion in the method can be used for calculating any data distribution types with high accuracy, and further can be used for mining the distribution of the matched probability of the subscription in the overall domain size space and analyzing the relationships of each subscription dimension and distribution characteristics of various dimensional data; and with the adoption of policies such as dimension reduction, and the like, the predict result is closer to the true value, and a better predict effect can be obtained.

Description

The approximate dynamic ring matching method of isomorphism symmetrical issuing subscription system

Technical field

The invention belongs to Computer Database field, particularly a kind of approximate dynamic ring matching method of isomorphism symmetrical issuing subscription system.

Background technology

Distribution subscription system is a distributed middleware system that meets producers and consumers's interaction of information.Publisher, except releasing news, also needs to have the ability of the subscription selected, and the role who issues and subscribe to is symmetrical.In symmetrical application, user has with the information of demand and adopts identical data structure to describe, and this is the concept of isomorphism symmetry.The symmetrical publish/subscribe system of isomorphism (HSPUB/SUB) is in an increasingly wide range of applications, and as easy thing Exchange Service and house exchange etc., wherein ring coupling is one of HSPUB/SUB system key issue that must solve.According to the index structure of subscribing to, the matching process of publish/subscribe can be divided into following four classes.

Based on one dimension index.One dimension index structure, as RBTree, Hash table, B+ tree etc., be used for index subscribe in definition predicate and qualified predicate is counted.General, to operate in the predicate defining in identical attribute indexed in an index structure with identical.Mainly contain two kinds of algorithms based on one dimension index: Count algorithm and Hanson algorithm.

Based on multi-dimensional indexing.In hyperspace, a subscription is regarded as an object, and matching operation and search operation are identical.Main thought is directly to set up index with multi-dimensional indexing for subscribing to, or a d dimension hypercube is converted to the point of 2d dimension, thereby avoids the serious intersection in hyperspace.

Test Network Based.Based on the technology of test network, first subscription information is stored on coupling tree.Different with predicate index, network test technology is set up subscription information index tree according to subscription information module.Each non-leafy node comprises a test, the result of the limit representative test of this node.A leaf node comprises a subscription information and the limit that represents test result.Coupling is exactly to travel through this coupling tree by carrying out the limit that test that each node describes and tracking and testing result form.

Based on figure.Basic concept based on figure coupling is wanted, and finds ring in digraph, and this digraph is set up by the subscription in the publish/subscribe application of isomorphism symmetry.Each node in figure represents a subscription.If subscribe to S1 and S2 coupling, just can set up a cum rights directed edge from S1 to S2.

First three target of planting matching process is between subscribing to, to find efficiently coupling one to one, can not be directly used in symmetrical coupling.Target based on drawing method is to find the optimum collection of ring coupling, and this graphic structure is not suitable for the real-time application of frequent insertion and deletion action, and is np hard problem, and actual outcome quality depends on concrete applied environment.For example in dynamic environment, utilize the ring matching process dynamically updating to look for ring coupling.But the ring number of matches generating, along with the increase exponentially of the length of ring increases, needs a large amount of storage spaces.In 2010 Northeastern University's Master's thesis, Tan Xianting proposes a kind of strategy to middle result treatment based on threshold value, but this method range of application has certain limitation.The distribution that it is only applicable to every one-dimensional data is all uniformly and independently, and obtaining prediction, to save the formula degree of accuracy of space proportion not high.

Summary of the invention

In order to solve the deficiency of prior art, the present invention proposes a kind of approximate dynamic ring matching method of isomorphism symmetrical issuing subscription system, is applicable to the approximate dynamic ring coupling that arbitrary data distributes.

The technical solution used in the present invention is: length is before the chain subscription of MaxLength-1 (the ring maximum length of MaxLength representative system definition) is inserted in subscription database, calculate chain and subscribe to the probability being mated, if probability is less than threshold value, this subscription will be dished out to save storage space so, subscribe to the probability being mated by excavation and subscribe to the relation between dimension in distribution and the analysis in whole domain sizes space, use strategies such as reducing dimension to solve the bound of saving more accurately space proportion.

The coupling that the inventive method relates to and ring coupling are defined as follows:

Definition 1:(coupling) subscribe to S1, S2 and 1≤i≤d for two all with 2d attribute, if Iai ∈ S1 and Iai+d ∈ S2 have common factor, so we claim S1 and S2 unidirectional have-need mate, be called for short " unidirectional coupling ", as shown in Fig. 1 (a).If S2 and S1 also unidirectional coupling simultaneously, claims that S1 and S2 are " coupling ", concrete as shown in Fig. 1 (b).

Definition 2 (ring coupling) collects CN={S1 for subscribing to ..., Si, ..., SN}, wherein N > 2, if Si and Si+1 match (1≤i≤N-1), and SN and S1 match, and CN is called the ring coupling that length is N.As shown in Fig. 2 (a), the difference that subscription chain is mated with ring is that in ring coupling, SN must match with S1.Therefore ring coupling is also a chain, can be that N ring coupling is regarded a chain as length, and the ring that creates length and be N+1 mates as shown in Fig. 2 (b).

Definition 3 (chain subscription) are subscribed to collection, L for one _n={ S ₁..., S _i..., S _n, wherein N > 2.If S _iand S _i+1(1≤i≤N-1), L match _nbe called the chain that length is N.L _nby and S ₁identical demand predicate and and S _nthe identical predicate that has forms, and is considered to and is processed into a subscription, therefore also makes chain subscribe to.Fig. 2 (a) has shown an example that chain is subscribed to.

The approximate dynamic ring matching method of isomorphism symmetrical issuing subscription system of the present invention comprises the following steps:

Step 1: obtain and subscribe to probability

Assess probability by statistical information.In a subscription, there are two groups of predicates: have predicate and demand predicate.Be defined as respectively Pro if having the probability of predicate and demand predicate _oand Pro _w, the probability that subscription is mated is so:

Pro _ocan be estimated as:

{Pro}_{o} = \overset{Pro = \Pr o_{o} \times {Pro}_{w}}{\frac{OwnPredicateMatchedTimes}{TotalNumber}}

Wherein OwnPredicateMatchedTimes has the number of times sum that predicate is mated, and records the SNAPSHOT INFO of current subscription database; TotalNumber is the sum of subscribing to.

Same, Pro _wcan be estimated as:

{Pro}_{w} = \frac{WantPredicateMatchedTimes}{TotalNumber}

Step 2: in corresponding width W and territory approximate separatrix in territory for calculated threshold

Step 2.1 calculated threshold is corresponding width W in territory

The descriptor of each attribute of subscribing to can represent with the spacer segment on space corresponding to this attribute, and as shown in Figure 2, the base unit that defines interval is 1, and the interval width of interval [S, E] is W=E-S+1, supposes that domain sizes is N.The probability P that interval if [S, E] mated _i.The present invention has provided the computing method of the saving space upper and lower bound that is applicable to any data distribution.First the probability density of tentation data is f (x), and the probability of the continuous random variable in a certain interval is that probability density on this interval is this section of interval definite integral, so calculate P _icomputing formula general be expressed as formula (1).Wherein N is the size of domain sizes, and W is that the interval width of diagonal positions is W=E-S+1, and Pro _o=Pro _w,

p_{i} = 1 - [{({&Integral;}_{1}^{S} f (x) dx)}^{2} + {({&Integral;}_{E}^{N} f (x) dx)}^{2}] - - - (1)

Wherein (d representative has the dimension of predicate or demand predicate).And formula (1) obtains:

1 - \sqrt[2 d]{pro} - {({&Integral;}_{E}^{N} f (x) dx)}^{2} = {({&Integral;}_{1}^{S} f (x) dx)}^{2} - - - (2)

Wherein, S=E-W+1

Therefore,

{&Integral;}_{1}^{E - W + 1} f (x) dx = \sqrt[2]{1 - \sqrt[2 d]{Pro} - {({&Integral;}_{E}^{N} f (x) dx)}^{2}} - - - (3)

Make constant

C = \sqrt[2]{1 - \sqrt[2 d]{Pro}}

Obtain with E=N

{&Integral;}_{1}^{N - W + 1} f (x) dx = C - - - (4)

The original function that makes f (x) is F (x):

F (x) |_{1}^{N - W + 1} = C - - - (5)

If F (x) exists inverse function, can solve

W＝N-F ^-1(C+F(1))+1

,

W = N - F^{- 1} (\sqrt{1 - \sqrt[2 d]{Pro}} + F (1)) + 1 - - - (6)

For the data of any distribution, given threshold value corresponding coordinate (width) in territory can have formula (6) to obtain.Pro is the probability that predicate is mated, and the inventive method is used selectance, and (selectance refers to any two probability that subscription is mated completely, that is to say, any one is subscribed to, and by any one, other subscribe to the probability of coupling completely.) estimate, be defaulted as threshold value and equal the value of selectance.Wherein the value of selectance can generate by debugging experiment parameter the data of corresponding selectance.

The distribution that formula (6) is applicable to every one-dimensional data is all independently, identical.But in actual application, this situation is only extremely occurring under ideal situation.The data of each dimension of a subscription distribute may be different, and have its fixing feature.So formula (6) has certain imprecision to the computing method of this situation, the larger error of existence also must cause predicting the outcome.

Consider now so a kind of situation.Suppose a subscription about machine parts.For a regular factory, the part of production of machinery all has certain specification.By the size of a certain part, it has fixing industrialized standard, the size of suppose there is A, B, C, tetra-classification standards of D being weighed part.For a concrete machine parts, the probability that this attribute of part dimension is mated is 0.25.Now,

{Threshold}_{i} = \sqrt[2 d - 1]{pro / thre {shold}_{j}} (i < d, j < d, i &NotEqual; j)

Wherein, j represents this attribute of size of part, and i represents other attributes of part, threshold _jrepresent the probability that attribute j is mated.Corresponding formula (6) should be adjusted to,

W = N - F^{- 1} (\sqrt[2]{1 - \sqrt[2 d - 1]{pro / threshol d_{j}}} + F (1)) + 1 (i < d, j < d, i &NotEqual; j)

The basic thought of Here it is dimensionality reduction, in actual application, falling apteryx and how reducing dimension has certain concrete specific aim strategy.For example, descriptive other attribute, the probability that this attribute is mated is so 0.5; Describe the attribute of region, a house, suppose that this city is divided into 7 districts altogether, the probability that this attribute is mated is so fixed as 1/7, etc.Formula (6) after dimensionality reduction is expressed as

W = N - F^{- 1} (\sqrt[2]{1 - \sqrt[2 d - 1]{pro / threshol d_{j}}} + F (1)) + 1 (i < d, j < d, i &NotEqual; j) - - - (7)

Wherein, j represents a certain attribute with fixing matching probability of subscribing to, and i represents other attributes except j of subscribing to, threshold _jrepresent the probability that attribute j is mated.

Step 2.2: approximate separatrix in computational fields

Obtain the different probability forms that interval is mated of subscribing to according to the method to middle result treatment based on threshold value, this form is cut apart afterwards as shown in Figure 4, select on diagonal line, each table can both be divided into four regions, the likelihood ratio region 2 in region 3 little, in region 1 and region 4, the probability of all size all exists.Take into full account in region 1 and region 4 that subscribing to the possible subscription data that is greater than threshold value by matching probability analyzes.There is the probability that the subscription of its both sides, a separatrix is mated to be greater than respectively and to be less than threshold value, as shown in Figure 5, can move the separatrix f (x that obtains other arbitrarily-shaped domain sizes Sn by the separatrix line of size 400 _i),

f (x_{j}) = 1.631 \times 10^{7} \times e^{- 0.09502 x_{j}} + 655.9 \times e^{- 0.005968 x_{j}} - - - (8)

x _i＝x _j×Sn/400+W-Sn/2 (9)

f(x _i)＝f(x _j)×Sn/400+W-Sn/2(i∈sizeSn，j∈size400) (10)

Xi is that domain sizes is the horizontal ordinate of putting on the boundary line of Sn, and xj is the horizontal ordinate of the boundary line of domain sizes 400.F (xi) is the ordinate of the point on the domain sizes boundary line that is Sn.F (xj) is the ordinate of the point on the domain sizes boundary line that is 400.W is the width at diagonal positions interval in Fig. 4.The contrast of this approximate separatrix and actual value as shown in Figure 6.

Step 3: the storage space ratio that estimation is saved

The separatrix obtaining according to step 2, the sum of the possible situation in region 3 is set to save space proportion lower limit Low (N, W), take into full account that according to formula (10) region 1 and region 4 are saved the upper limit Upper (N of space proportion, W), N is the size in territory, and W is that the interval width of diagonal positions is that W=E-S+1 is respectively:

Low (N, W) = {(\frac{W * (2 * N - W + 1)}{N * (N + 1)})}^{2 d} - - - (11)

Upper (N, W) = {1 - \frac{{(\frac{(N - W) * (N - W + 1)}{2})}^{2} + 2 * Σ_{y = z}^{w} Σ_{x = w}^{N} {(N - f (x) + 1) * (N - y + 1)}}{{(\frac{N * (N + 1)}{2})}^{2}}}^{d} - - - (12)

Fig. 7 has shown that the upper limit of the saving space proportion after use this method approaches actual value before more, has had lifting in a big way in accuracy.Fig. 8 has shown that the upper limit after dimensionality reduction almost overlaps with actual value, and lower limit lower limit before has also had improvement largely.Thus, the inventive method has had the raising of quite large degree in estimated performance.

Beneficial effect: the inventive method is applicable to the approximate dynamic ring coupling that arbitrary data distributes, and can be applied in real time environment, and degree of accuracy has improved average 15 percentage points.The saving space proportion predictor formula of the inventive method can calculate any data distribution distribution pattern, and there is pinpoint accuracy, further excavate and subscribe to the probability that mated and draw predictor formula and analyze relation between each subscription dimension and the characteristic distributions of different dimensional data in the distribution in whole domain sizes space, use strategies such as reducing dimension to make to predict the outcome and more approach actual value, can obtain better prediction effect.

Brief description of the drawings

Fig. 1 mates example key diagram;

Fig. 2 ring coupling and chain are subscribed to example key diagram;

The exemplary plot of Fig. 3 predicate and interval mapping;

Fig. 4 probability region is divided exemplary plot;

The marginal scatter diagram of the different domain sizes of Fig. 5;

The comparison diagram of Fig. 6 separatrix predicted value and actual value;

Fig. 7 saving space proportion of the present invention and before methods and results comparison diagram;

Fig. 8 has used the comparison diagram after dimension reduction method on Fig. 7 basis.

Embodiment

Below in conjunction with drawings and Examples, the inventive method is described further.

An example of the present invention, certain website, house provides information as shown in table 1:

Certain website of exchanging houses of table 1 provides house property information

Extract the data in this exchange website, house, sample data arranges as shown in table 2:

Certain data of exchanging houses in website that table 2 extracts

Step 1: obtain and subscribe to probability

Each attribute in upper table is abstracted into the valid data that system can be identified.Add up each and have the number of times WantPredicateMatchedTimes (WPMT) that number of times OwnPredicateMachedTimes (OPMT) that predicate mated and demand predicate are mated in subscribing to.Each subscribes to the probability being mated, as shown in table 3:

Table 3 calculates subscribes to probability

Step 2: in calculated threshold location interval width W and territory approximate separatrix

Step 2.1: calculated threshold location interval width W

Due to this example, in the data of our simulation, the first dimension is house area attribute, is respectively center, city, east gate, south gate, north gate, west gate.The probability that the area attribute of subscribing to is mated is 1/5, so adopt the method for dimensionality reduction to obtain W according to formula (7).

W = N - F^{- 1} (\sqrt[2]{1 - \sqrt[2 d - 1]{pro / threshol d_{j}}} + F (1)) + 1 (i < d, j < d, i &NotEqual; j)

Step 2.22: approximate separatrix in computational fields

This routine domain sizes is 10000, obtains the approximate separatrix f (x of Sn=10000 according to formula (8) formula (9) formula (10) _i) as follows:

f(x _i)＝25*f(x _j)+W-5000(i∈sizeSn，j∈size400)

Wherein

f (x_{j}) = 1.631 \times 10^{7} \times e^{- 0.09502 x_{j}} + 655.9 \times e^{- 0.005968 x_{j}}

x _i＝x _j×Sn/400+W-Sn/2

Step 3: the storage space ratio that estimation is saved

The size of this routine setting threshold is the size of selectance.Setting selectance is 1 × 10 ^-6, then according to formula (11) and (12), calculate lower limit and the upper limit of the storage space of saving after dimensionality reduction, be respectively 59.65% and 89.74%.And not before dimensionality reduction bound be respectively, 98.35% and 58.69%.Actual value is 88.95%.The upper limit after dimensionality reduction is than the upper limit before dimensionality reduction with truly approaching 8.61%, and the lower limit after same dimensionality reduction approaches 0.96%.In the degree of accuracy of predictor formula, improve a lot.

Claims

1. the approximate dynamic ring matching method of isomorphism symmetrical issuing subscription system, comprises the following steps:

Step 1: obtain and subscribe to probability

Assess probability by statistical information, in a subscription, have two groups of predicates: have predicate and demand predicate, be defined as respectively Pro if having the probability of predicate and demand predicate _oand Pro _w, the probability that subscription is mated is so:

Pro=Pro _o×Pro _w

Pro _ocan be estimated as:

\Pr o_{o} = \frac{OwnPredicateMatchedTimes}{TotalNumber},

Wherein Own PredicateMatchedTimes has the number of times sum that predicate is mated, and records the SNAPSHOT INFO of current subscription database; TotalNumber is the sum of subscribing to,

Same, Pro _wcan be estimated as:

\Pr o_{w} = \frac{WantPredicateMatchedTimes}{TotalNumber},

Wherein, WantPredicateMatchedTimes is the number of times that demand predicate is mated;

It is characterized in that: also comprise:

Step 2.1: calculated threshold is corresponding width W in territory

The descriptor of each attribute of subscribing to can represent with the spacer segment on space corresponding to this attribute, and the base unit that defines interval is 1, and the interval width of interval [S, E] is W=E-S+1, supposes that domain sizes is N, establishes the probability P that interval [S, E] is mated _i, calculate the saving space upper and lower bound that any data distribute, first the probability density of tentation data is f (x), the probability of the continuous random variable in a certain interval is that probability density on this interval is this section of interval definite integral, so calculate P _icomputing formula be expressed as formula (1), wherein N is the size of domain sizes, W is that the interval width of diagonal positions is W=E-S+1, and Pro _o=Pro _w,

p_{i} = 1 - [{({&Integral;}_{1}^{S} f (x) dx)}^{2} + {({&Integral;}_{E}^{N} f (x) dx)}^{2}] - - - (1)

Wherein d representative has the dimension of predicate or demand predicate, is obtained by formula (1):

1 - \sqrt[2 d]{pro} - {({&Integral;}_{E}^{N} f (x) dx)}^{2} = {({&Integral;}_{1}^{S} f (x) dx)}^{2} - - - (2)

Wherein, S=E-W+1

Therefore,

{&Integral;}_{1}^{E - W + 1} f (x) dx = \sqrt[2]{1 - \sqrt[2 d]{pro} - {({&Integral;}_{E}^{N} f (x) dx)}^{2}} - - - (3)

Make constant

C = \sqrt[2]{1 - \sqrt[2 d]{pro}}

Obtain with E=N

{&Integral;}_{1}^{N - W + 1} f (x) dx = C - - - (4)

The original function that makes f (x) is F (x):

F (x) |_{1}^{N - W + 1} = C - - - (5)

If F (x) exists inverse function, can solve

W=N-F ^-1(C+F(1))+1

,

W = N - F^{- 1} (\sqrt{1 - \sqrt[2 d]{pro}} + F (1)) + 1 - - - (6)

For the data of any distribution, given threshold value corresponding coordinate in territory can be obtained by formula (6), Pro subscribes to the probability being mated, estimate by selectance, be defaulted as the value that threshold value equals selectance, wherein the value of selectance can generate by debugging experiment parameter the data of corresponding selectance

After the attribute dimensionality reduction of subscribing to, formula (6) is expressed as

W = N - F^{- 1} (\sqrt[2]{1 - \sqrt[2 d - 1]{pro / {threshold}_{j}}} + F (1)) + 1 (i < d, j < d, i &NotEqual; j) - - - (7)

Wherein, j represents a certain attribute with fixing matching probability of subscribing to, and i represents other attributes except j of subscribing to, threshold _jrepresent the probability that attribute j is mated;

Step 2.2: approximate separatrix in computational fields

Obtain the different probability forms that interval is mated of subscribing to according to the method to middle result treatment based on threshold value, each form can both be divided into four regions, the probability that the subscription of both sides, separatrix is mated is greater than respectively and is less than threshold value, in this form, the region in the upper left corner is region 3, the probability that in this region, represented subscription is mated is less than the probability of threshold value, is moved the separatrix f (x that obtains other arbitrarily-shaped domain sizes Sn by the separatrix line of domain sizes 400 _i),

f (x_{j}) = 1.631 \times 10^{7} \times e^{- 0.09502 x_{j}} + 655.9 \times e^{- 0.005968 x_{j}} - - - (8)

x _i=x _j×Sn/400+W-Sn/2 (9)

f(x _i)=f(s _j)×Sn/400+W-Sn/2(i∈sizeSn,j∈size400) (10)

X _ifor the horizontal ordinate of putting on the boundary line of domain sizes Sn, x _jfor the horizontal ordinate of the boundary line of domain sizes 400, f (x _i) be the ordinate of the point on the boundary line of domain sizes Sn, f (x _j) be the ordinate of the point on the boundary line of domain sizes 400;

Step 3: the storage space ratio that estimation is saved

The separatrix obtaining according to step 2, the sum of the possible situation in region 3 is set to save space proportion lower limit Low (N, W), be saved the upper limit Upper (N of space proportion according to formula (10), W), N is the size in territory, and W is that the interval width of diagonal positions is that W=E-S+1 is respectively:

Low (N, W) = {(\frac{W * (2 * N - W + 1)}{N * (N + 1)})}^{2 d} - - - (11)

Upper (N, W) = {1 - \frac{{(\frac{(N - W) * (N - W + 1)}{2})}^{2} + 2 * Σ_{y = z}^{w} Σ_{x = w}^{N} {(N - f (x) + 1) * (N - y + 1)}}{{(\frac{N * (N + 1)}{2})}^{2}}}^{d} - - - (12) .