CN102117302B - Data origin tracking method on sensor data stream complex query results - Google Patents

Data origin tracking method on sensor data stream complex query results Download PDF

Info

Publication number
CN102117302B
CN102117302B CN 200910264155 CN200910264155A CN102117302B CN 102117302 B CN102117302 B CN 102117302B CN 200910264155 CN200910264155 CN 200910264155 CN 200910264155 A CN200910264155 A CN 200910264155A CN 102117302 B CN102117302 B CN 102117302B
Authority
CN
China
Prior art keywords
origin
inquiry
data
query
trail
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 200910264155
Other languages
Chinese (zh)
Other versions
CN102117302A (en
Inventor
王永利
时真旺
徐佳
彭甫镕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN 200910264155 priority Critical patent/CN102117302B/en
Publication of CN102117302A publication Critical patent/CN102117302A/en
Application granted granted Critical
Publication of CN102117302B publication Critical patent/CN102117302B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a data origin tracking method on sensor data stream complex query results. The method comprises the following steps of determining the size of an origin tracking query sliding window; conducting standardized description on the origin query; judging the class of the origin tracking query and designing corresponding algorithm; designing the frame of origin tracking; and implementing the whole origin tracking algorithm so as to realize tracking the data origin of the sensor data stream complex query results. The method breaks through the technical limitation that an existing sensor data management system cannot support complex query backtrack, firstly introduces a data origin tracking concept to the complex query field of sensor data stream, and provides a feasible solution for the new online tracking application.

Description

Sensor data stream complex query result's data origin tracking
Technical field
The invention belongs to the data origin tracking of the reverse tracking technique of iceberg Query Result in the sensing data warehouse, particularly sensor data stream complex query result.
Background technology
New generation sensor and sensor (radio frequency identification) technology provides powerful perception, has understood and the ability of Management World for people, the ability that many sensor-based new application do not possess in the urgent need to a kind of available data management system---review the origin of event and Query Result, namely support higher layer applications to arrive the data origin trace ability of low layer data back inquiry.The iceberg inquiry is returned few Query Result in the data tuple of a large amount of inputs, is class inquiry typical on the sensing data warehouse, frequent application.Because the iceberg inquiry relates to the aggregate function on an attribute or the property set, simultaneously sensing data have uncertainty, redundancy, contain space-time, the online characteristics such as response of needs, Data Source may inaccessible or access of a high price, therefore the data origin information of tracing sensor iceberg Query Result is very difficult, for the research fields such as database, sensor network, Complex event processing have proposed many new challenges.
The origin of data has been put down in writing the whole history that data are processed, comprise the origin of data and process all follow-up processes of these data, data origin is followed the trail of (Data Lineage Tracing), also can be described as " data origin follow the trail of (Data Provenance Tracing) " the main problem that how dates back to the raw resource data that derives this view from the interested high level view of user of paying close attention to.The data origin of iceberg inquiry is followed the trail of the location, the tracking that are based on sensor technology and is used the essential function of supporting, the result that the stages such as data analysis, quality of data inspection can inquire about the iceberg continually recalls, it carries out the ability that efficient greatly affects sensing data management system response inquiry, and the status of this function in the traditional data warehouse is dispensable.
The correlative study of data origin had attracted the field scholars' such as data integration, Web search, semantic tagger extensive concern in recent years.External in the achievement in research that some data origin have been arranged aspect data warehouse, E-science, quality tracing, assurance data credibility and the repeatability, and domestic research aspect data origin just just begins.At present data origin is followed the trail of the 3 kinds of approach that mainly contain: inquiry is inverted, mark and workflow logs.
(1) to invert be by inquiry or view definition are analyzed when origin is followed the trail of in inquiry, the result of (carry out backward process ask) of inverting is exactly the origin (Y.W.Cui of data, J.Widom, J.L.Wiener.Tracing theLineage of View Data in a Warehousing Environment.ACM Transactions onDatabase Systems, 2000,25 (2): 179-227).Because it is just to carry out computational analysis when needs are used data origin, thereby is called again " lazy " mode.This mode puts forward when mainly in early days data origin being used for view maintenance and replacement problem.The shortcoming of inquiry inversion calculation method is that it not exclusively is applicable to complex query.All be that the hypothesis complex query meets some requirements and can be standardized or rewrite in the research because existing inquiry is inverted, but in fact not all inquiry is not always the case, even meet some requirements, the data origin of obtaining sometimes also and out of true.
(2) will mark for data origin is exactly at some information (P Buneman about data source or production process history of mark record, S Khanna, W C Tan.On Propagation of Deletions andAnnotations Through Views.In:Proc of the Int 1Conf on Management of Data (ACMSIGMOD/PODS), 2002.150-158).Because this mode is to carry some data origin information allowing at the very start data communication device cross mark, thereby is called again " eager " mode.There have series of problems to have about the tissue, management etc. of mark to be to be solved.(the D Bhagwat such as Bhagwat, L Chiticariu, W C Tan, G Vijayvargiya.An Annotation Management System for Relaional Databases.In:Proc of the Int ' lConf on Very Large Data Base (VLDB), 2004.900-911) designed a kind of data model of the management mark based on relation, wherein each data item (attribute) is with mark, and mark can be along with data shift together when shifting when data.The problem of this memory module is that redundancy ratio is larger, and in fact requires to revise relation schema, and this is impossible under many circumstances, and only supports the mark on the attribute granularity.(the Peter Buneman such as Buneman, Adriane P.Chapman, James Cheney.ProvenanceManagement in Curated Databases SIGMOD 2006, June 27-29,2006, Chicago, Illinois, USA.) studied and between database, the conventional data of duplicated record to have played source technology, a kind of method of track user behavior is proposed, browse source database and copy data to the curated database, with the behavior of the convenient form of clamp recording user that can inquire about.Other W7 dimension model (Sudha Ram, Jun Hu, Regi ThomasGeorge.PROMS:A System for Harvesting and Managing Data Provenanc.) be the advanced dimension model that has at present certain semantic information, its structure is ask statement than backward process and is had greater flexibility.Because most of existing data management systems are not store mark, so matter of utmost importance is exactly how to create or obtain mark.In some system, provide corresponding tool set to help the user to create mark, but the angle from data origin, sensor data stream itself has just contained the space time information that is suitable for automatic acquisition, up to the present, also nobody did about sensing data origin automatic marking and the manually research of mark aspect.
(3) workflow logs is based on the record to data processing of message level.Existing research thinks that workflow logs does not have enough semantic informations, even it is collected, obtain raw data and also be difficult to for reinventing workflow, workflow logs often plays booster action in the operating process of reality, as replenishing of other dual modes.Yet, this booster action is significant to the processed continuously character of sensor data stream just, workflow logs helps to adopt the methods such as digraph search, state transition that the data origin is realized effectively reviewing, and this did not consider in the existing achievement in research yet.
At home, Liu Xiping etc. introduced first data origin concept, meaning and development (Liu Xiping, Wan Changxuan. Research on Data Provenance An Overview. Science Plaza .2005,1:47-52); Li Yazi has compared several data origin descriptive models commonly used, proposition is described to develop into gradually by XML Schema and is made up domain body and then realize inference mechanism, be one of Data Source developing direction of following the trail of (Li Yazi. data origin dimension model and descriptive model. modern Library technology .2007,153 (7): 10-13).Domestic in recent years only have some achievements in research in the semantic tagger field, yet these researchs are not in order to solve the origin tracing problem based.Compare with the origin tracking on the conventional static database, it is more high that the origin tracing problem based that sensing data faces is more complicated, origin is followed the trail of Query Cost.A large amount of deployment of sensor can cause pouring in system such as the label information as the flood, must solve the problem of gushing of sensing data.The method of generally acknowledging at present is in the sensing data management system data connector to be set, and comprises sensor middleware, event handling and internal storage data cache.Raw data has increased the difficulty of following the trail of through complicated intermediate treatment process.Considering under the probabilistic prerequisite of derivation rule, use origin to follow the tracks of the derivation rule collection of finding to be fit to materialization, an interesting not solution difficult problem (R.Derakhshan in the sensing data process field, M.E.Orlowska, Li Xue.RFID Data Management:Challenges andOpportunities.in Proceeding of IEEE International Conference on RFID 2007,26-28March 2007Page (s): 175-182).
There is following four problems in existing origin tracer technique in being applied to sensor data stream: (1) existing data origin research is mostly just for scientific library, do not consider the needs that the sensing data origin tracking take Data Stream Processing as feature responds fast, this need to solve from the height that creates new origin tracing model the tracing problem based of sensing data origin so that existing origin method for tracing is difficult to be applied directly to the sensing data management; (2) on existing research for data origin all rests on relatively slowly static data collection qualitative analysis of variation and describes, sensor data stream that can't Adaptive change; (3) counter performance is not the common attribute that data are processed inquiry or function, if accurate specified data item, even would find weak inverse function also little to the meaning of application.(4) for design inverse function or backward process inquiry, need to understand in advance the complex process that data are processed, this is just so that scheme can only for specific application, be difficult to robotization.For backward process inquiry or the great effort of the necessary cost of inverse function coding, hindered the application of this technology simultaneously.
Summary of the invention
Technical matters solved by the invention is to provide a kind of fast accurate sensor data stream complex query result's data origin tracking.
The technical solution that realizes the object of the invention is: a kind of sensor data stream complex query result's data origin tracking may further comprise the steps:
Step 1, the tracking inquiry moving window size of determining to originate from;
Step 2, to the origin inquiry description that standardizes;
Step 3, the classification that origin is followed the trail of inquiry are judged and are designed respective algorithms;
The framework that step 4, design origin are followed the trail of;
Step 5, whole origin tracing algorithm is implemented, thereby realized tracking to sensor data stream complex query result's data origin.
The present invention compared with prior art, its remarkable advantage: (1) is broken through in the existing sensing data management can't support the technology limitation that complex query is recalled, data origin is followed the trail of concept introduce first iceberg inquiry field on the sensor data stream, using for novel online tracing provides feasible solution.(2) mode of processing online with data stream is set up the data origin tracing model of features such as adapting to sensing data uncertainty, imperfection; (3) according to not waiting general sampling principle dynamically to determine the size of origin tracking query window, adapt to Change of Data Stream; (4) based on the sensing data of Stream Processing origin follow the trail of operation theory and origin follow the trail of search algorithm (relate to the adverse selection, projection of regional location, distance, time, also, friendship, gathering and concatenation operation).The computation model that the present invention sets up proposes to be applicable to comprise the sensor data stream backward process inquiry algorithm of the different situations such as known treatment logic and unknown processing logic for the data stream of Fast transforms, can draw online data origin information.(5) cost is little, and it is accurate that data origin is followed the trail of result set, and telescopicing performance is good.
Below in conjunction with accompanying drawing the present invention is described in further detail.
Description of drawings
Fig. 1 is sensor data stream complex query result's of the present invention data origin tracking process flow diagram.
Fig. 2 is that the sensor data stream origin is followed the trail of the implication synoptic diagram.
Fig. 3 is streaming sensing data origin tracing system frame diagram.
Fig. 4 is sensing data complex query derived data stream exemplary plot.
Fig. 5 is by intermediate sensor derived data stream iterative computation origin figure.
Fig. 6 is the graph of a relation of Different Origin method for tracing and origin tracking time.
Fig. 7 is that three kinds of origin method for tracing are followed the trail of the comparison diagram of precision to the origin of sensing data.
Embodiment
In conjunction with Fig. 1, a kind of sensor data stream complex query result's of the present invention data origin tracking may further comprise the steps:
Step 1, the tracking inquiry moving window size of determining to originate from; Specifically may further comprise the steps:
Step 11, origin is followed the trail of the inquiry moving window define, it is w that origin is followed the trail of the query window size iIndividual gap, W i=(t-w i, t), wherein t represents current time, bidding label i appears at the effective range of reader, at window W iDuring this time reader in each gap with identical Probability p iReading tag i;
Step 12, origin is followed the trail of inquiry moving window gap, and to read probability be p iSeparate Bernoulli test; Suppose at W iThe institute gapped, label i only appears at W iSubset S i, make p i AvgBe illustrated in the average experience read rate on these observation gaps, p i avg = Σt ∈ S i p i , t / | S i | , P wherein I, tList of labels information calculations according to reader obtains, and we look S iBe the binomial sampling, | S i| be B (w, p i Avg i) the binomial stochastic variable;
Step 13, the suitable w of selection iThereby guarantee to read label i with high probability, if at the number w of smooth window intermediate gap iSatisfy inequality
Figure G2009102641555D00052
Then can guarantee at window W iIn with the probability reading tag i greater than 1-δ, δ is the probability of error of expectation in the formula, thereby determines that origin follows the trail of inquiry moving window size.
Step 2, to the origin inquiry description that standardizes; Specifically on the relational data model basis, introduce the randomization tuple, the standard procedure of uncertain origin tracking of information is provided, and the continuous-query language interface of declarative is provided for the user.
Step 3, the classification that origin is followed the trail of inquiry are judged and are designed respective algorithms; Specifically may further comprise the steps:
Whether known origin is inquired about corresponding forward query pattern for step 31, basis, with the origin whether be the standard relationship pattern, the tracking type that will originate from is divided into Four types, if the inquiry of known forward is standard relationship SPJ (selection, projection, connection) view mode, then execution in step 32; If known forward inquiry is standard relationship ASPJ (gathering, selection, projection, connection) view mode, then execution in step 33; If known forward inquiry is the non-standard ASPJ view mode that concerns, then execution in step 34; If unknown forward query pattern and be operating as the non-standard ASPJ of relation view mode, then execution in step 35;
Step 32, the inquiry of known forward are followed the trail of inquiry for standard relationship SPJ view mode origin, all convert all SPJ views to the SPJ canonical form, use the origin of specifying tuple based on the tracking query count of canonical form;
Step 33, the inquiry of known forward are followed the trail of for standard relationship ASPJ view mode origin, with intermediate result as the tie of assembling between tuple and the Basic Flow, in needs, calculate the relevant portion of intermediate result from Basic Flow, in data warehouse, whole intermediate result is stored as the physicochemical assisted view;
Step 34, the inquiry of known forward are followed the trail of inquiry for the non-standard ASPJ view mode origin that concerns,
The operation that acts on sensor data stream is divided into dispersion and merges two classes, if each input data item produces 0 or a plurality of separate data item, then be considered as operation splitting, the origin that the method for inputting data item is determined output item is enumerated in employing; Otherwise the employing union operation is about to union operation and is subdivided into the context-free merging and keeps the key assignments merging, verifies the subset of input item in cumulative mode;
Step 35, unknown forward query pattern and be operating as the non-standard ASPJ of relation view mode adopt the Dynamic Slicing technique computes to specify the tuple origin, design the black box origin method for tracing of unknown Operation Definition.
The framework that step 4, design origin are followed the trail of; Specifically may further comprise the steps:
Step 41, origin Query Information model primary entity is classified, it is divided into data stream and inquiry, data stream by Basic Flow with derive two types of streams and form: Basic Flow is from a certain equipment outside the system, sensor network or a service; Derive stream and come from Basic Flow or other derivation stream;
Step 42, design distributed event handling system, this system accepts query requests in the central service mode, carries out the inquiry of engine deploy at a plurality of distributed queries, and carries out inquiry in the time in life cycle separately; Load on each query engine of system monitoring is estimated inquiry is optimized according to reuse rule, inquiry and network cost, and the Query distribution of receiving is arrived effective query execution engine;
Step 43, on the basis of step 42, make up the sensing data origin inquiry framework of based on data stream mode, this framework comprises the combination of tissue, storage policy, origin and the data of origin, and the circulation way that originates from.
Step 5, whole origin tracing algorithm is implemented, thereby realized tracking to sensor data stream complex query result's data origin.
The invention provides and a kind ofly can in the moving window sensor data stream, review the technology of event and Query Result origin and can effectively examine sensing data quality, checking event response process, quick a kind of fine granularity sensing data of the different data source of recombination and integration method for tracing that originates from.The present invention is specially mainly for the classification that the sensing data of inflow system comes the moving window at specified data origin place big or small in real time, origin tracking inquiry is inquired about, judged to the formalization origin, framework and the algorithm composition that the enforcement origin is followed the trail of:
1, it is big or small to determine that origin is followed the trail of the inquiry moving window
For the sensor real time data that flows into system, in order to calculate the source information that rises through the as a result tuple that obtains after the complex process, need to carry out sensing data origin method for tracing to input traffic collection (or the view that defines at the input traffic collection).Because the potential unlimitedness of data stream, all data all just can not be calculated after the storage, therefore must determine suitable data stream window (moving window), the moving window that is of moderate size has namely kept the statistics completeness of the sample in the window, has reduced again the storage cost of data stream.Determine the sliding window size that origin tracking inquiry is depended on according to not waiting general sampling principle.
Definition 1.1, sensing data adfluxion ψ: be comprised of many data stream S, the S in the moving window partly is equivalent to the table in the relational data model, and tuple t ∈ S is comprised of a plurality of property values.
Definition 1.2, the inquiry cycle: be to finish reader and an iterative process of the communication protocol of all labels on every side.
Definition 1.3, epoch (gap): be the minimum time unit that reader is followed the tracks of the label of all its identifications, jointly formed by the result in a plurality of inquiry cycle that typical gap width is 0.2-0.25 second.In each gap, except label ID, reader also records some additional informations (such as the query-response number of times of each label, the moment that label is read at last etc.) simultaneously, and sends these property information cycle to client.
If it is w that origin is followed the trail of the query window size iIndividual gap, W i=(t-w i, t), bidding label i appears at the effective range of reader, at window W iDuring this time reader in each gap with identical Probability p iIt is p for successfully reading probability that reading tag i, the present invention look each gap iSeparate Bernoulli test.This means that the number of successfully being observed of label i is a stochastic variable in the window, the obedience parameter is (w i, p i) binomial distribution.Suppose at W iThe institute gapped, label i only appears at W iSubset S i, make p i AvgBe illustrated in the average experience read rate on these observation gaps, p i avg = Σ t ∈ S i p i , t / | S i | , P wherein I, tList of labels information calculations according to reader obtains, and we look S iBe the binomial sampling, | S i| be B (w, p i Avg i) the binomial stochastic variable.According to the standard probability theory, | the expectation of Si| and variance are E [ | S i | ] = w i · p i avg , Var [ | S i | ] = w i · p i avg · ( 1 - p i avg ) .
Adjust each label window size and set up the binomial sampling model for self-adaptation, at first consider window size w iSize, guarantee at window w iIn have enough gaps can observe label i (prerequisite is that label is positioned within the readable range of reader), select suitable w iCan guarantee to read label i with high probability.
Lemma 1.1 makes p i AvgBe illustrated in the observation probability of label i in the gap, if at the number w of smooth window intermediate gap iSatisfy inequality
Figure G2009102641555D00074
Then can guarantee at window W iIn with the probability reading tag i greater than 1-δ.
Proof: because the observation of label i is belonged to separate Bernouli test, at w iThe probability that reads of wishing label i on the individual sample is
Figure G2009102641555D00081
If this probability≤δ, both sides are got ln and are got w i ln ( 1 - p i avg ) ≤ ln δ , According to inequality-x 〉=ln (1-x) x ∈ (0,1) wherein, if then - w i p i avg ≤ ln δ Also can guarantee w i ln ( 1 - p i avg ) ≤ ln δ Set up, namely w i ≥ ln ( 1 / δ ) p i avg , Card is finished.
Namely
Figure G2009102641555D00086
Can guarantee completeness with high probability.
2, the sensing data origin is followed the trail of semantic formalized description
The present invention sees the importance of uncertainty, data origin process and the data itself of pending data on an equal basis, (relational data model can be described space-time attribute feature preferably at relational data model, but time and the space characteristics that can not represent the space-time object) on the basis, introduce the uncertain problem that the randomization tuple solves sensing data, provide origin to follow the trail of standard procedure, incorporate concept and the technology of sliding window of time in the pattern of traffic, the continuous-query language interface of declarative is provided for the user.
Definition 2.1, possible-tuple (possible tuple): the tuple that may occur in the data stream, the uncertainty of reaction tuple content.
Definition 2.2, existed-tuple (having tuple): each existed-tuple is comprised of one or more possible-tuple, represents actual possible-tuple combination.
Definition 2.3, probalility (probability of occurrence): depend on possible-tuple and existed-tuple, the possibility of current value appears in representative.Probability of occurrence can be by the read rate of statistics a period of time label and the ratio calculation in gap, and essence is average read rate.
Definition 2.4, lineage (origin): contact possible-tuple and the possible-tuple that derives this tuple.If λ is the operation (if the relation belonging to operative configuration, λ represents view definition, is designated as v) that acts on the ψ, if there is λ -1, ∀ I ⊆ ψ , λ -1(λ(I))=I, ∀ O ⊆ ψ , λ (λ -1(O))=and O, wherein I is that input set, O are output items, claims λ -1Be the inverse operation of λ, for any tuple t among the O, claim λ -1(O) be the origin of t.Origin is followed the trail of implication as shown in Figure 2 on the data stream.
Definition 2.5 is with the data adfluxion ψ of origin L: a ψ LTlv triple (R, D, λ -1) sequence, wherein R is set of relations, D is the glossary of symbols that comprises I (R), and λ -1To be structured in D to 2 DOn inverse operation.I (R) comprises two tuples (i, j), and i represents existed-tuple, and j represents the index of its existed-tuple.
ψ=(R, D, λ -1) be a data adfluxion, establish glossary of symbols D k ⊆ D , The possible subset ψ of ψ kCan obtain as follows:
1. such as D (i, j) ∈ D k, so for each j ≠ j ', D ( i , j ′ ) ∉ D k ,
2. ∀ D ( i , j ) ∈ D k , λ - 1 ( D ( i , j ) ) ⊆ D k
3. if to certain existed-tuple, t iThere is not a D (i, j) ∈ D k, t so iA possible existed-tuple, and ∀ D ( i , j ) ∈ t i , λ -1(D (i, j))=φ, or λ - 1 ( D ( i , j ) ) ⊂⃒ S k .
Different possible-tuple is mutual exclusion among the same existed-tuple of condition 1 finger, occurs once at the most in every kind of possible example.It is semantic that condition 2 is strengthened origin, if possible-tuple may occur in the example at one, must be that possible-tuple that derives, and this namely means a direction.
It is as follows that the sensor data stream origin of design is followed the trail of query language:
SELECT?OriginStream j.Attr
FROM?OriginStream 1,…OriginStream n
WHERE?lineage(OriginStream 1,…OriginStream n)
AND?OtherPredication…
WINDOW?Now-Size,[Confidence?Factor]
Wherein OriginStream has been source traffic (or deriving stream), Attr is the attribute-name in the stream, lineage is for having calculated the general purpose function (process) of source function, OtherPredication is all the other stsndard SQL predicates, Now represents current time, WINDOW clause has defined size and has been the moving window of Size, and option ConfidenceFactor has represented the confidence factor of source information.
3, data stream plays source model and origin tracked information system framework
Designed the data stream origin tracing model of a kind of facing sensing device iceberg inquiry, and the sensing data origin that has made up the based on data stream mode is inquired about framework, the coupling scheme that comprise tissue, storage policy, origin and the data of origin, and the circulation way of origin etc.
Origin Query Information model primary entity comprises data stream and inquiry, and data stream is comprised of Basic Flow and two types of derivation streams: Basic Flow comes from a certain equipment, sensor network or the service outside the event handling system; Derive stream and come from Basic Flow or other derivation stream, event handling also can produce derives stream.The source information that rises of entity is divided into two parts: the information of collecting at data stream and inquiry registration device originates from for static, and for example metamessage is the part of static origin; The information of collecting during query processing is for dynamically originating from, and for example the variation of data stream flow rate is exactly a kind of dynamic origin.
Aspect origin tracking process framework, the present invention has designed a kind of distributed event handling system, accept query requests in the central service mode, carry out the inquiry of engine deploy at a plurality of distributed queries, the query execution engine is performed in the time in life cycle separately.Load on the system monitoring query engine produces new query execution engine according to the dynamic demand of distributed terminator.Inquiry plan is estimated inquiry is optimized according to reuse rule, inquiry and network cost, and the Query distribution of receiving is arrived effective query execution engine.Streaming sensing data origin tracing system frame diagram as shown in Figure 3.
The current state of maintenance system is responsible in resource management, carries out alternately with query count.The origin Scout service keeps following the trail of to current resources change in the system, provides simultaneously GUI (Graphical User Interface) monitoring current inquiry, data stream and computational resource.The user is by calling the inquiry of origin service registry input traffic and iceberg.When submitting new inquiry to, system produces the registration of deriving stream.After the inquiry registration of iceberg, the user can add the additional informations such as note and metamessage to playing set of source data.
4, the sensor origin is followed the trail of theory and arthmetic statement
Follow the trail of the different data streams that design the data origin method for tracing that is applicable to known mode (standard relationship operation) information, unknown pattern information, comprises the complex query that forms to the complex query that formed by standard relationship operational characters such as standard A SPJ (gathering, selection, projection, connection) and by the non-ASPJ standard relationship operational character tracing algorithm that originates from for dissimilar sensor data stream fine granularity origin.
4.1, the origin of known standard relationship operation follows the trail of
If the definition of known various operations to sensing data, and operation belongs to the relational operation of standard, adopts following method to calculate origin.Basic thought is the origin of calculating certain tuple in Select-Project-Join (SPJ) view with the relational query on the base table, and application query returns the origin of tuple t in S that particular figure v obtains to data stream S.
4.1.1, SPJ view origin follows the trail of inquiry
Definition 4.1, the origin tracking enquiry, ψ is the data adfluxion, v is the view on the ψ, given tuple t ∈ v (ψ), TQ T, vFor the origin of v and t is followed the trail of inquiry, the current TQ that only works as T, v(ψ)=v -1 ψ(t), v wherein -1 ψ(t) be t origin according to v in ψ, TQ T, vIrrelevant with data stream example ψ.Can define similarly the tracking inquiry of view tuple set T, be designated as TQ T, v(ψ).
All SPJ views can convert the form of relational algebra conversion sequence to
Figure G2009102641555D00101
π wherein A, σ C, S 1,
Figure G2009102641555D00102
Refer to respectively projection, the selection about condition c, data stream, connection about condition A, claim that this form is the SPJ canonical form, the conversion of SPJ can be with the origin [CUI96] that affects the view tuple.Therefore, a given SPJ view at first is converted into the SPJ canonical form, uses the origin of specifying tuple based on the tracking query count of canonical form.The below introduces the additional operations symbol that the origin that is used for the SPJ view is followed the trail of inquiry.
Definition 4.2, operation splitting symbol ω, establishing S is data stream, and operation splitting symbol ω resolves into the table sequence with S, and the table in each sequence is that S is to property set A i ⊆ S ( i = 1 . . m ) Projection.
&omega; A 1 , . . . , A m ( T ) = < &pi; A 1 ( T ) , . . . , &pi; A m ( T ) >
Theorem 4.1 is established data adfluxion ψ and is comprised base data stream S 1..., S m,
Figure G2009102641555D00113
Be the SPJ view on the S.Given tuple t ∈ V, the origin in ψ can adopt the following query count that acts on the base data stream to draw,
Figure G2009102641555D00114
Given tuple set T &SubsetEqual; V , The origin tracking enquiry of T is
Figure G2009102641555D00116
Proof: establish
Figure G2009102641555D00117
V 1C(V 2), V=π A(V 1), forming by a plurality of steps if can prove operation, the origin of net result tuple can be followed the trail of acquisition by the carrying out to the multi-level view on the query tree, step by step so have so
v - 1 &Psi; ( t ) = v 1 - 1 &Psi; ( &pi; A - 1 V 1 ( t ) )
= v 1 - 1 &Psi; ( &sigma; A = t ( V 1 ) )
= v 2 - 1 &Psi; ( &sigma; C - 1 V 2 ( &sigma; A = t ( V 1 ) ) )
= v 2 - 1 &Psi; ( &sigma; A = t ( V 1 ) )
= &omega; S 1 , . . . , S m ( &sigma; A = t ( V 1 ) )
Figure G2009102641555D001113
Figure G2009102641555D001114
Figure G2009102641555D001115
Figure G2009102641555D001116
Figure G2009102641555D001117
Figure G2009102641555D001118
4.1.2, the origin of ASPJ view follows the trail of
Although following the trail of, the origin of SPJ view need not intermediate result, if but do not store specific intermediate result, the ASPJ view that comprises gathering can not be followed the tracks of, because in containing the view of aggregation operator, can increase some attributes that calculated by aggregate function newly, and these attributes just do not have at all in Basic Flow.Therefore must be with intermediate result as the tie of assembling between tuple and the Basic Flow.The relevant portion of intermediate result can calculate from Basic Flow in needs, and whole result also can be stored as the physicochemical assisted view in data warehouse, supposes in the following discussion that all middle results of gathering are effective.
(A) ASPJ normalized form
Different from the SPJ view, the ASPJ view does not have simple normalized form, because in the definition of ASPJ view, some selection, projection, attended operation can not be shifted (or afterwards) before the aggregation operator onto, can destroy like this equivalence of relational operation.Yet by calculating and merging some SPJ operation, any ASPJ view query tree can be changed into by
Figure G2009102641555D00121
The form that the operational character sequence forms claims that this sequence is the ASPJ section, is the ASPJ normalized form.The ASPJ section may be omitted certain operations symbol, although must comprise an operational character (otherwise this section can not merge with adjacent section) except farthest each section in the ASPJ normalized form.When certain monobasic operational character was omitted, we supposed to exist a corresponding general operation symbol to replace.The general aggregation operator symbol of stream S is α S, general projection operation symbol is π S, general selection operator is σ S
Definition 4.3, ASPJ normalized form is established v and is the ASPJ view definition on the data adfluxion ψ.
(1) v=S, S is the base flow among the ψ, is the ASPJ normalized form.
(2)
Figure G2009102641555D00122
The ASPJ normalized form, if v jThe ASPJ normalized form with non-trivial top aggregation operator symbol, j=1..k.G wherein, aggr (B) expression is defined in the aggregate function on the attribute B
At every turn based on original view definition, the intermediate result of each operational character need to be stored or recomputate to the method for the origin by certain operational character being followed the trail of determine any view, of a high price.The ASPJ view definition is converted to normalized form, can stores or intermediate result that re-computation is less, follow the trail of operational characters all in each section.
(B) origin of individual layer ASPJ view is followed the trail of inquiry
View by top ASPJ section definition is called individual layer ASPJ view, is similar to the SPJ view, can use an inquiry to follow the trail of the origin of tuple as individual layer ASPJ view.
Theorem 4.2,
Figure G2009102641555D00123
Be individual layer ASPJ view, given tuple t ∈ V, t is at S 1..., S mUpper origin by the v definition, can adopt the following query count that acts on the base flow to draw:
Figure G2009102641555D00124
Given tuple set T &SubsetEqual; V , The origin tracking enquiry of T is
Figure G2009102641555D00126
Proof:
Figure G2009102641555D00127
And V=α G, aggr (B)(V 1)
v - 1 &Psi; ( t ) = v 1 - 1 &Psi; ( &alpha; G , aggr ( B ) - 1 V 1 ( t ) ) = v 1 - 1 &Psi; ( &sigma; G = t . G ( V 1 ) )
Figure G2009102641555D00129
Figure G2009102641555D00131
Figure G2009102641555D00132
Figure G2009102641555D00134
Figure G2009102641555D00135
(C) multistage ASPJ view origin tracing algorithm
Given general ASPJ view definition at first converts view to the ASPJ normalized form, and is divided into the set of ASPJ section, is medial view of each section definition, follows the trail of the origin of tuple by top-down medial view hierarchical recurrence.At every one deck, inquire about to calculate the tuple of current tracking at view or the Basic Flow of lower one deck with the tracking of individual layer ASPJ view.
V is the view definition of normalized form, and tuple t ∈ v (ψ), process TupleLineageTracing () calculate the origin of tuple t according to the v on the ψ.Main procedure algorithm StreamLineageTracing () calculates tuple set T &SubsetEqual; V Origin.If v=v ' is (V 1..., V k) wherein v ' be individual layer ASPJ view.V j=v j(ψ), j=1..k is effective as Basic Flow or medial view.This process is at first at<V 1..., V kIn, use individual layer ASPJ view query TQ (T, v ',<V according to theorem 4.2 1..., V m) calculate origin<V of T 1 *..., V k *.Then according to v j(j=1..k) each tuple-set V of definition recursive calculation j *Origin, the result is connected to the view tuple-set, obtain the origin of whole tabulation.
procedure?TupleLineageTracing(t;v;ψ)
In: wish is followed the trail of the tuple t of origin, view definition v, data adfluxion ψ
Out: the origin of tuple t among the data adfluxion ψ (generating tuple t by v)
1return(StreamLineageTracing({t};v;ψ));
procedure?StreamLineageTracing(T;v;ψ)
In: the tuple-set T that wish is followed the tracks of, view definition v, data adfluxion ψ
Out: the origin of tuple set T among the data adfluxion ψ (generating tuple set T by v)
1?if?v=S∈ψ?then?return(<T>);
2 //otherwise v=v ' (v 1..., v k) wherein v ' be individual layer ASPJ view
3<V 1 *..., V k *〉 ← TQ (T, v ', { V 1..., V k); //V j=v j() is medial view or Basic Flow, j=1..k
4?D*←Φ;
5?for?j←1?to?k?do
6?D*←D*·SteamLineageTracing(V j *;v j;D);
7?return(D*);
4.2, the origin of non-relational operation follows the trail of
Many operations of sensor data stream are not belonged to the standard relationship operation, working specification can't be turned to the ASPJ canonical form, so the non-relational operation that the present invention aims at sensor data stream has designed the origin method for tracing.Be without loss of generality, the operation that we will act on sensor data stream is divided into dispersion and merges two classes.
Definition 4.5, operation splitting: each input data item produces 0 or separate a plurality of data item: &ForAll; I , &lambda; ( I ) = Y i &Element; I &lambda; ( { i } ) . The origin that produces output item o by operation splitting is defined as * (o, I)={ i ∈ I|o ∈ T ({ i}) }.
Algorithm for design TraceDecompose (λ, O, I) follows the tracks of the origin that is produced the output item collection by operation splitting, wherein with the output item collection as parameter, rather than with single output item as parameter.
procedure?TraceDecompose(λ,O*,I)
I*←Φ;
for?each?i∈I?do
ifλ({i})∩O*≠Φ?then?I*←I*∪{i};
return?I*;
Definition 4.6, union operation:
Figure G2009102641555D00143
λ (I)=O={o 1... o n, there is unique mutually disjoint division I 1... I nλ (I k)={ o kEstablishment k=1..n, I 1... I nBe called input and divide, and the o that is drawn by operation λ kOrigin I kBe λ * (o k, I)=I k
Design TraceAggregator (λ, O*, I) follows the tracks of the origin of the output item collection O* that is produced by operation splitting, and λ (I*)=O*, I* are the subsets of I, and its supplementary set produces among the O remaining.Method is to verify the subset of I in cumulative mode, as finds that certain subset I ' makes λ (I ')=O*, and λ (I-I ')=O-O*, the superset that then only need verify can find origin, significantly accelerates work efficiency.
procedure?TraceAggregator(λ,O*,I)
L ← by the sequence of subsets of gathering the large minispread I of gesture;
for?each?I∈L
if?λ(I*)=O*then
if(I-I*)=O-O*then?break;
Else L ← by whole superset sequences of gathering the large minispread I* of gesture;
Return?I*
By analysis, the union operation on the data stream can further be subdivided into again the context-free merging and keep key assignments and merge.
Definition 4.7, context-free merges: any two input data item, perhaps belong to identical input and divide, perhaps always do not belong to same division.In other words, the context-free merging determines that the method for dividing is that arbitrary input item only relies on the value of self, and does not rely on the value of other input item.
Definition 4.8 keeps key assignments and merges: implication establish the input and output item each comprises unique key value concern, be i.key for item i remembers, if to any input set I and division I thereof 1..., I n, for output λ (I)={ o 1..., o n, all I kSubset I ' produce the unique output item identical with the output item key assignments, namely &ForAll; I &prime; &SubsetEqual; I k , &lambda; ( I &prime; ) = { o k &prime; } And o ' k.key=o k.key (k=1..n) claims λ to keep key assignments to merge.
Theorem 4.3, keeping the key assignments merging is that context-free merges.
Proof: establish λ and be the merging of reservation key assignments, With i ', perhaps belong to identical input and divide, perhaps do not belong to same division.
Adopt reduction to absurdity, establishing λ is not context-free merging, has so input set I and I ' and tuple item i and i '
(1) when i and i ' belong to same division Ij, establishes λ (I j)=o j, according to the definition that keeps the key assignments merging, and λ (i}) have identical key assignments o with the output item of λ ({ i ' }) generation j.key.
(2) belong to different division I as i and i ' jAnd I k, λ (I j)=o j, λ (I k)=o k, according to the definition that keeps the key assignments merging, and λ (i}) produce with key assignments o k.key item, λ ({ i ' }) produces with key assignments o j.key so item is o j.key ≠ o k.key, with the definition contradiction that keeps the key assignments merging.
Theorem 4.4, if operation λ be one with reversing λ -1Union operation, so for the example of all λ (I)=O, the origin of all o ∈ O that generate via λ is λ -1(o}).
Proof: because λ is union operation, according to definition 4.6, λ (I)=D={o 1..., o n, there is a unique division I among the I 1..., I n, satisfy λ (I k)={ o k, wherein k=1..n, and I kO kOrigin.According to definition 2.4, λ -1(λ (I))=I, so, &ForAll; o k &Element; O , λ -1({o k})=λ -1(λ(I k))=I k
Notice that in fact integrality does not need theorem 4.4 to support that only weak inverse needs, weak inverse requirement
Figure G2009102641555D00154
λ -1(λ (I))=I must set up, and
Figure G2009102641555D00155
λ (λ -1(O))=O needn't set up.
According to theorem 4.4, if but inverse operation is union operation, so just can realize the origin tracking with inverse operation, if but inverse operation is operation splitting, just can not use inverse operation and calculate origin.Usually reverse inquiry and be the effective means of asking origin.
Example 1, tabulation merges, and two attribute input sets produce two attribute output collection, according to first attribute input set are divided into groups, and produce an output item, comprise minute class value to first property value.I={<1, a 〉,<1, c〉and,<2, b 〉,<2, g〉and,<2, h 〉, output O=λ (I)=and<1, " a, c " 〉,<2, " b, g, h " 〉, the contrary λ of λ -1Second attribute of input data item separated a plurality of of generation, i.e. λ -1(O))=I.
If operation λ is union operation, according to theorem 4.4, use λ -1The origin of executing data item is followed the trail of, given output item o={<2 for example, " b, g, h "〉}, the origin of o is P -1(o})=<2, b 〉, and<2, g 〉,<2, h 〉.
If the mapping relations of known non-relational operation then can be limited in the subset of input item collection according to the origin of mode map with output item when calculating the origin of output item, this subset may be very little, thereby promote the efficient of calculating origin.For example hypothesis operation λ has the reverse mode mapping, when searching the origin of input item, can at first search the input item subset I &prime; &SubsetEqual; { i &Element; I | i . A = g ( o . B ) } , Then use TraceAggregator (λ, o, I ') to enumerate every subset of I ', seek the origin of o I * &SubsetEqual; I &prime; .
4.3, the black box of unknown Operation Definition origin method for tracing.
All can't learn when relational operation and non-relational operation namely not have the details of operation under the black box operational circumstances of inverse operation, the present invention adopts the Dynamic Slicing technique computes to specify the tuple origin.
Dynamic Slicing is a kind of debugging technique, can catch mistake in executable statement.Software error was very effective when recent research showed the Dynamic Slicing technology to positioning trip, Dynamic Slicing can be analyzed unknown operation [Xiangyu Zhang effectively, Precise dynamic slicing algorithms, International Conference onSoftware Engineering, 2003].The present invention adopts roBDD[Randal E Bryant, Graph-basedalgorithms for boolean function manipulation, IEEE, 1986] etc. technical skill focuses on that deal with data relies on and control relies on, traditional Dynamic Slicing is relatively valued control and is relied on, follow the tracks of the executable statement collection, auxiliary routine person's debugging, and the input set that origin calculating tracking is associated with specific output valve is valued data dependence.
Definition 4.9, the Dynamic Slicing technology: a given program is carried out execution point s iDynamic Slicing (the i bar of expression statement s can be carried out example), be directly or indirectly to be subjected to s iCarry out the statement set of impact.
Be the set of identification correlative, Dynamic Slicing catches the experience dependence between statement is carried out, and dependence can be divided into two classes: data dependence and control rely on.
Definition 4.10, data dependence: example s carried out in statement iData dependence is carried out example t in another statement j, current only when at t jThe variable of middle definition is afterwards at s iIn be cited.
Definition 4.11, control relies on: example s carried out in statement iControl depends on t j, and if only if statement s iPerform statement t jThe result who exports during branch.
Adopt Dynamic Slicing technology origin to calculate the input item set, these input items are used for calculating the particular value at specific execution point place.Preset sequence is carried out, execution point s iThe data origin of the value v at place is designated as DL (v@s i), DL (v@s i) be the input item collection, these input items rely on by data or control and directly or indirectly participate in s iThe computation process of the value v at place.Use DL (s i) expression s iThe data origin of left side expression formula.
Dynamic Slicing at first makes up the dynamic routine dependency graph usually, has wherein described the data of two statement examples/control dependence, travels through this figure and seeks accessibility statement example set.The defective of the method is that the size that makes up dependency graph is not easy estimation.Dynamic Slicing can be calculated in a kind of rapidly (forward) mode, section produces continuously along with the progress of carrying out and upgrades, and the method has alleviated space problem, and Dynamic Slicing takes up room surprising usually, be to upgrade section, in each step of carrying out high operation of Executing Cost of having to.Because the present invention has just used the thought of program debug, target is the tracking that originates from, so execution that needn't trace statement, the origin set of OUTPUT only comprises INPUT[0], yet, all statements are carried out and are included in the Dynamic Slicing of OUTPUT, because they have direct or indirect contribution to the value of OUTPUT.If design is good, origin is followed the trail of can be higher than the efficient of Dynamic Slicing.
10:x=INPUT[0];
20:x=x+1;
30:OUTPUT=x;
How data origin is calculated the term of execution of program.Basic thought is and s iRight side variable correlated inputs element set, be all s iThe correlated inputs collection of the statement example that data or control rely on also.In other words, all and certain s iOperational character or the input item of the predicate correlation carried out of control si, also with s iRelevant.Executable statement s for example iFor: if t jIs true then d=f (use 0, use 1..., use n), use use 0, use 1..., use nBe variable d assignment, then s iControl depends on t j
Make DEF (x) be the last perform statement example of definition x, use x.t be data use xIntrinsic time stamp, w iBe the suitable sliding window size of current data stream, the available following equation of the calculating of data origin is described:
DL ( dest @ s i ) = ( Y &ForAll; x DL ( use x @ s i ) &cup; DL ( t j )
= DL ( tj ) &cup; ( Y &ForAll; x . DEF ( use x ) &NotEqual; &phi; DL ( use x @ DEF ( use x ) ) &cup;
( Y &ForAll; x . DEF ( use x ) &NotEqual; &phi; { use x } ) ) &cup; ( use x . t &Element; [ t - w i , t ] )
By s iThe origin set of the variable dest of definition is t jThe origin collection with by s iThe variable use of definition xThe origin collection also.If variable use xPre-defined, DL (use x@s i)=DL (use x@DEF (use x)), otherwise use xBe regarded as input, DL (use x@s i)={ use x.Follow the trail of for origin by analysis, data dependence relies on more crucial than control.The recursive procedure SlicingTraceLineage (dest, s) that can design Dynamic Slicing according to above-mentioned equation calculates the origin tracking.
On the basis of aforementioned theory and algorithm, sensor SLT master arthmetic statement is as follows:
Algorithm sensor StreamsLineageTracing
Input:LineageQuery, Assistant Information//origin is followed the trail of inquiry, pattern, processing supplementary continuously
Output:Lineage Information//given data item (collection) plays source information
1 Do While not eof sensor Streams//do not stop when sensor data stream
2 ComputingWindowSize (p Avg, δ); The moving window size that // adjustment is suitable
3 If the forward query schema is aware Then//known forward query pattern information
Are 4 If operation is composed of relational operation Then//processing comprised of relational operation?
5 StreamLineageTracing (T; V; ψ; p Avgδ); The contrary ASPJ query script of // standard of calling
6 Else//call are non-to concern that backward process askes process
7 If?operation?is?many-to-one?relationship?Then
8 TraceDecompose (λ, O*, I); // call and decompose the traceback process
9 Else
10 TraceAggregator (λ, o, I ') // call merge the traceback process
11 EndIf
12 EndIf
13 SlicingTraceLineage (dest, s) // to the job sequence Dynamic Slicing, call the black box slicing processes
14EndDo
Below in conjunction with embodiment the present invention is described in further detail.
In conjunction with Fig. 1, experimental situation reaches supporting with it reader, grid location equipment as main composition take ultrahigh frequency (2.45GHz) active label, and epoch is 2 seconds.Wherein grid location equipment is divided into several location grades such as 5 meters, 10 meters, 25 meters, and has built by 20 grid location devices, 5 readers, 50 experimental situations that active label consists of in the laboratory.The reader of triangle location usefulness intends choosing at present the in the industry high-precision fixed bit platform of leading UbiSense, and its bearing accuracy can reach 30 centimetres, and the data volume that the unit interval can send is larger, thereby satisfies the demand of iceberg query performance test.Data server is configured to 2.0G duo core/1.0G/160M, and sensing data is by the serial ports collection.
Embodiment one: consider the omnidistance follow-up of quality application scenarios of workshop, near every procedure completing place on the travelling belt, a series of sensor read write lines with detecting sensor are installed, product to process is followed the tracks of, system obtains monitoring flow and operations flows and has carried out a series of operation, obtains one by the continuous-query technology on the sensing data at last and derives stream (i.e. audit stream).The sequence of operation as shown in Figure 4.
Monitor data stream comprises four attributes: reader number, location number, timestamp, the product tabulation (being formed by production number and detection probability) that detects simultaneously, be Monitor (ReaderId, LocationId, MTimeStamp, Productid-list), the actual content of certain moment Monitor stream is as shown in table 1.
Table 1 Sensor monitoring data stream Monitor example
ReaderId LocationId MTimeStamp Productid-list(Tagid,probalility)
R1 L1 2008-6-11?11:51:12 (087,0.9)(098,0.2)(120,0.7)
R2 L1 2008-6-11?11:56:10 (089,0.9)(125,0.8)(087,0.9)
R3 L3 2008-6-11?12:01:00 (087,0.8)(120,0.8)
The Product table comprises six attributes: tag number, name of product, classification, price, weight, kind, namely Product (Tagid, ProductName, Category, Price, Weight, Unit) is as shown in table 2.
Table 2 product table Product example
Tagid ProductName Category Price Weight Unit
087 HP?6320 Computer 7000 2.8 Box
089 Rice Food 100 50 Bag
120 IBM?T63 Computer 13000 2 Box
In the sequence of operation, λ 1Content to Monitor stream is cut apart, and allows each label monopolize delegation, meets relational database 1NF, newly-increased two row Tagid, Probalility; λ 2The Product table is filtered, and obtaining classification is the product of " computer "; λ 3Connect Monitor stream and Product table, keep LocationId, Tagid, ProductName, the Probalility attribute column must satisfy time restriction simultaneously; λ 4The result who connects is divided into groups to gather by the TigId attribute, calculate the average probability that detects every kind of product every day, output<TigId, ProductName, AverageProbalility 〉; λ 5Select average probability greater than 0.85 name of product.According to the definition of front to operation, λ 1, λ 2, λ 5Belong to the one-to-many operation, λ 4, λ 5Belong to the many-one operation.Wherein except λ 1In addition, other several operations can be realized by the standard relationship operation.
(1) to the single-layer view that belongs to union operation tracking: the λ that originates from 4According to the TagId grouping Probalility is averaged gathering and calculate, produce 5 output tuples, comprise minute class value to average identification probability.According to table 1, simply with<TagId, AverageProbalility〉the expression tuple, λ 4Input I={<' 087 ', 0.8,<' 087 ', 0.9 〉,<' 087 ', 0.9,<' 089 ', 0.9 〉,<' 098 ', 0.2,<' 120 ', 0.7 〉,<' 120 ', 0.8,<' 125 ', 0.8〉}, output O=λ 4(I)=<' 087 ', 0.867 〉,<' 089 ', 0.9 〉,<' 098 ', 0.2〉and,<' 120 ', 0.75 〉,<' 125 ', 0.8 〉, λ 4Contrary λ 4 -1Second attribute to the input data item decomposes a plurality of of generation, namely &lambda; 4 - 1 ( O ) ) = I . λ 4Be union operation, according to theorem 4.4, use λ 4 -1The origin of executing data item is followed the trail of, given output item o={<' 120 ' for example, 0.75〉}, the origin of o is
Figure G2009102641555D00192
(2) to belonging to the operation splitting tracking that originates from: operation λ 1Belong to operation splitting, for output o={<' L1 ', ' 087 ', ' HP 6320 ', 0.8〉} adopt process TraceDecompose (), obtain o to λ 1Origin be {<' R1 ', ' L1 ', ' 2008-6-11 11:51:12 ', ' (087,0.9) (098,0.2) (120,0.7) ',<' R2 ', ' L1 ', ' 2008-6-11 11:56:10 ', ' (089,0.9) (125,0.8) (087,0.9) ', and with its inverse operation λ 1 -1The origin that obtains is the subset of correct source of origin.
(3) to the tracking that originates from of multilayer view: establish operation λ 5One be output as o={<' 087 ', ' HP6320 ', calculate o to the origin in origin Monitor data stream, namely need the tracking that originates from of multi-layer data stream view.Because λ 5Subdued and partly assembled rear tuple, all are at λ 4Introduce afterwards the temporary all computer gathering tuples of intermediate flow.Adopt (C) multistage ASPJ view origin tracing algorithm, the origin of last layer is calculated in segmentation, finally obtains the result identical with (2), as shown in Figure 5.
For testing above-described all algorithms, we verify various Algorithm Performances by experiment.Experimental data comes from actual test environment, and input traffic is Monitor, and input table is Product.If follow the trail of inquiry by define origin such as lower class SQL mode, carry out experiment one:
SELECT*FROM?AllComputer?where;
SELECT?Monitor.TagID,Product.ProductName
FROM?Monitor,Product
WHERE?lineage(Monitor,Product)AND?Operation(P 1,P 2,P 3,P 4,P 5)
WINDOW?Now-0
P wherein 1... P 5Represent five operations, the expression of 0 among the WINDOW clause is adjusted window size according to reading rate self-adaptation.The origin of the as a result tuple that has experienced five operations need is followed the trail of in this inquiry, five operations are existing a pair of eurypalynous, and the many-one type is arranged again, and wherein the many-one type belongs to again the key assignments conservative.Experimental design three schemes of calculating this origin inquiry, (1) can adopt the TupleLineageTracing process to calculate fast P because rear four operations belong to the standard relationship operation 2... P 5Origin, and P 1Operation belongs to non-standard operation, so obtain the origin collection as inputting with the TupleLineageTracing process, continues to call the TraceDecompose process computation collection that finally originates from; (2) to operation P 2... P 5Adopt the non-TraceAggregator process computation that concerns, given transition diagram, operation of a secondary tracking is to operation P 1Adopt its data origin of TraceDecompose process tracking; (3) establish look five the operation P 1... P 5For the black box operation, directly call the SlicingTraceLineage process.
Repeatedly test 10 times, from 50 to 300 tuples of record sliding window size change, measure three kinds of schemes the origin of single tuple is followed the trail of the time, the result as shown in Figure 6, the speed of process scheme (1) is more many soon than the speed of other two scheme operations, if learn known operation information, and operation is comprised of standard relationship operation A SPJ fully, along with the increase of moving window input size, the backward process of multilayer standard A SPJ view is ask and has been reduced significantly tracking time.And scheme (3) is applicable to the more origin tracking of complex operations.
Embodiment 2: verify dynamic moving window and static state fixedly the origin of moving window follow the trail of to compare to obtain which type of performance boost.We choose existing Cui backward process inquiry method [Y.W.Cui, J.Widom, J.L.Wiener.Tracing the Lineage of View Data in a Warehousing Environment.ACMTransactions on Database Systems, 2000,25 (2): 179-227] and Zhang section method for tracing [M.Zhang, X.Zhang, X.Zhang, and S.Prabhakar.Tracing lineage beyond relationaloperators.Technical report, Purdue University, 2007] and the described method of this patent compare.Experimental data comes from TPC-D standard testing collection.Input table is LineItem, Order, and PartSupp, the content of table is generated by standard dbgen program, and the size of the whole data warehouse of TPC-D zoom factor 1.0 representatives is 1GB.
The cost performance that this several method obtains to specify the tuple origin has been compared in experiment two, and Fig. 7 is that fixedly the moving window size is called the computational accuracy relativity of this paper put forward the methods of Cui method and Zhang method and self-adaptation adjustment sliding window size.Because these two kinds of methods are for the static data collection, so adopt the method for batch updating data subset and this paper to compare.Repeatedly tested 10 times, at every turn for the Cui method changes different input data item numbers (being equivalent to fixed window size) with the Zhang method, computational accuracy is defined as the origin tuple number and true number of tuples purpose ratio that algorithm draws.Fig. 7 shows, compares with the Zhang method with the Cui method, and the precision of the loss sensing data origin that this method obtains is the highest.This be because, because the volatibility of data stream, the reader error intrinsic with the sensor reader, the feature in original data stream of origin tuple is from concentrated different at static data, the data set of artificial appointment does not have to have considered the problem of source flow, causes the reduction of Cui method and Zhang method accuracy of identification.
Similar above-mentioned experimental technique, the performances such as the time delay of three kinds of calculating sensor data origin methods, the computational accuracy under dissimilar operational circumstances, internal memory consumption situation, throughput and stability that we have gone back simplation verification, all experiments show, this method has all shown good performance at aspects such as time delay, the uncertainty that adapts to sensing data and origin tracking success ratios, and this method has faster computing velocity, satisfies the requirement of in real time origin tracking.

Claims (1)

1. a sensor data stream complex query result data origin tracking is characterized in that, may further comprise the steps:
Step 1, big or small by origin tracking inquiry moving window, wherein step 1 determines that origin tracking inquiry moving window size specifically may further comprise the steps:
Step 11, origin is followed the trail of the inquiry moving window define, it is w that origin is followed the trail of the query window size iIndividual gap, W i=(t-w i, t), bidding label i appears at the effective range of reader, at window W iDuring this time reader in each gap with identical Probability p iReading tag i;
Step 12, origin is followed the trail of inquiry moving window gap, and to read probability be p iSeparate Bernoulli test; Suppose at W iThe institute gapped, label i only appears at W iSubset S i, order
Figure FSB00000925070500011
Be illustrated in the average experience read rate on these observation gaps,
Figure FSB00000925070500012
P wherein I, tList of labels information calculations according to reader obtains, wherein S iBe the binomial sampling, | S i| for
Figure FSB00000925070500013
The binomial stochastic variable;
Step 13, selection w iThereby guarantee to read label i with high probability, if at the number w of smooth window intermediate gap iSatisfy inequality
Figure FSB00000925070500014
Then can guarantee at window W iIn with the probability reading tag i greater than 1-δ, δ is the probability of error of user's expectation in the formula, thereby determines that origin follows the trail of inquiry moving window size;
Step 2, to the origin inquiry description that standardizes, wherein the inquiry of step 2 pair origin standardizes, and to describe be on the relational data model basis, introduce the randomization tuple, the standard procedure of uncertain origin tracking of information is provided, and the continuous-query language interface of declarative is provided for the user;
Step 3, the classification that origin is followed the trail of inquiry judge and design respective algorithms, and wherein step 3 pair origin is followed the trail of the classification of inquiring about and judged and design respective algorithms and specifically may further comprise the steps:
Whether known origin is inquired about corresponding forward query pattern for step 31, basis, with the origin whether be the standard relationship pattern, the tracking type that will originate from is divided into Four types, if the inquiry of known forward is standard relationship SPJ (selection, projection, connection) view mode, then execution in step 32; If known forward inquiry is standard relationship ASPJ (gathering, selection, projection, connection) view mode, then execution in step 33; If known forward inquiry is the non-standard ASPJ view mode that concerns, then execution in step 34; If unknown forward query pattern and be operating as the non-standard ASPJ of relation view mode, then execution in step 35;
Step 32, the inquiry of known forward are followed the trail of inquiry for standard relationship SPJ view mode origin, all convert all SPJ views to the SPJ canonical form, use the origin of specifying tuple based on the tracking query count of canonical form;
Step 33, the inquiry of known forward are followed the trail of for standard relationship ASPJ view mode origin, with intermediate result as the tie of assembling between tuple and the Basic Flow, in needs, calculate the relevant portion of intermediate result from Basic Flow, in data warehouse, whole intermediate result is stored as the physicochemical assisted view;
Step 34, the inquiry of known forward are followed the trail of inquiry for the non-standard ASPJ view mode origin that concerns, the operation that acts on sensor data stream is divided into dispersion and merges two classes, if each input data item produces 0 or a plurality of separate data item, then be considered as operation splitting, adopt the method for enumerating the input data item to determine the origin of output item; Otherwise the employing union operation is about to union operation and is subdivided into the context-free merging and keeps the key assignments merging, verifies the subset of input item in cumulative mode;
Step 35, unknown forward query pattern and be operating as the non-standard ASPJ of relation view mode origin and follow the trail of inquiry adopt the Dynamic Slicing technique computes to specify the tuple origin, design the black box origin method for tracing of unknown Operation Definition;
The framework that step 4, design origin are followed the trail of, wherein the framework followed the trail of of step 4 design origin may further comprise the steps:
Step 41, origin Query Information model primary entity is classified, it is divided into data stream and inquiry, data stream by Basic Flow with derive two types of streams and form: Basic Flow is from a certain equipment outside the system, sensor network or a service; Derive stream and come from Basic Flow or other derivation stream;
Step 42, design distributed event handling system, this system accepts query requests in the central service mode, carries out the inquiry of engine deploy at a plurality of distributed queries, and carries out inquiry in the time in life cycle separately; Load on each query engine of system monitoring is estimated inquiry is optimized according to reuse rule, inquiry and network cost, and the Query distribution of receiving is arrived effective query execution engine;
Step 43, on the basis of step 42, make up the sensing data origin inquiry framework of based on data stream mode, this framework comprises the combination of tissue, storage policy, origin and the data of origin, and the circulation way that originates from;
Step 5, whole origin tracing algorithm is implemented, thereby realized tracking to sensor data stream complex query result's data origin.
CN 200910264155 2009-12-31 2009-12-31 Data origin tracking method on sensor data stream complex query results Expired - Fee Related CN102117302B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200910264155 CN102117302B (en) 2009-12-31 2009-12-31 Data origin tracking method on sensor data stream complex query results

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200910264155 CN102117302B (en) 2009-12-31 2009-12-31 Data origin tracking method on sensor data stream complex query results

Publications (2)

Publication Number Publication Date
CN102117302A CN102117302A (en) 2011-07-06
CN102117302B true CN102117302B (en) 2013-01-23

Family

ID=44216076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200910264155 Expired - Fee Related CN102117302B (en) 2009-12-31 2009-12-31 Data origin tracking method on sensor data stream complex query results

Country Status (1)

Country Link
CN (1) CN102117302B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102509062B (en) * 2011-11-14 2015-01-07 无锡南理工科技发展有限公司 RFID (radio frequency identification) dataflow multi-tag cleaning method based on sliding windows
CN102402615B (en) * 2011-12-22 2013-02-27 哈尔滨工程大学 Method for tracking source information based on structured query language (SQL) sentences
CN102646126A (en) * 2012-02-29 2012-08-22 浙江工商大学 Data stream effective clustering method based on tuple uncertainty
US8751645B2 (en) * 2012-07-20 2014-06-10 Telefonaktiebolaget L M Ericsson (Publ) Lattice based traffic measurement at a switch in a communication network
CN103823885A (en) * 2014-03-07 2014-05-28 河海大学 Data provenance dependence relation analysis model-based data dependence analysis method
CN104090950B (en) * 2014-07-03 2017-04-12 浙江工商大学 Data flow clustering method integrating cluster existence strength
CN108885627B (en) * 2016-01-11 2022-04-05 甲骨文美国公司 Query-as-a-service system providing query result data to remote client
CN105912595B (en) * 2016-04-01 2019-03-05 华南理工大学 A kind of data origin collection method of relational database
CN106055676B (en) * 2016-06-03 2019-04-02 电子科技大学 A kind of data source tracing method and system based on big data model analysis platform
US10209959B2 (en) * 2016-11-03 2019-02-19 Samsung Electronics Co., Ltd. High radix 16 square root estimate
CN107016065A (en) * 2017-03-16 2017-08-04 陕西科技大学 It is customizable to rely on semantic effective origin filter method
CN110222244B (en) * 2019-05-29 2022-03-01 第四范式(北京)技术有限公司 Method and device for auditing and pushing labeled data
CN112612814A (en) * 2020-12-22 2021-04-06 中国再保险(集团)股份有限公司 Data stream query method and device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1278345A (en) * 1997-11-04 2000-12-27 国际商业机器公司 Online database mining
CN1303061A (en) * 1999-10-21 2001-07-11 国际商业机器公司 System and method of sequencing and classifying attributes for better visible of multidimentional data
WO2002088925A1 (en) * 2001-04-30 2002-11-07 The Commonwealth Of Australia A data processing and observation system
CN101292222A (en) * 2005-10-17 2008-10-22 米德玛赤控股有限公司 A method and apparatus for improved processing and analysis of complex hierarchic data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1278345A (en) * 1997-11-04 2000-12-27 国际商业机器公司 Online database mining
CN1303061A (en) * 1999-10-21 2001-07-11 国际商业机器公司 System and method of sequencing and classifying attributes for better visible of multidimentional data
WO2002088925A1 (en) * 2001-04-30 2002-11-07 The Commonwealth Of Australia A data processing and observation system
CN101292222A (en) * 2005-10-17 2008-10-22 米德玛赤控股有限公司 A method and apparatus for improved processing and analysis of complex hierarchic data

Also Published As

Publication number Publication date
CN102117302A (en) 2011-07-06

Similar Documents

Publication Publication Date Title
CN102117302B (en) Data origin tracking method on sensor data stream complex query results
EP3475887B1 (en) System and method for dynamic lineage tracking, reconstruction, and lifecycle management
Tang et al. Extracting top-k insights from multi-dimensional data
Karthikeyan et al. A survey on association rule mining
Wang et al. A survey of queries over uncertain data
WO2022083576A1 (en) Analysis method and apparatus for operating data of network function virtualization device
US20180300401A1 (en) System and method for fuzzy ontology matching and search across ontologies
Yu et al. Extending functional dependency to detect abnormal data in RDF graphs
CN105095522B (en) Relation table set external key recognition methods based on nearest neighbor search
CN114930385A (en) Computer-implemented method for defect analysis, apparatus for defect analysis, computer program product and intelligent defect analysis system
Qi et al. Threshold query optimization for uncertain data
Liu et al. Determining the real data completeness of a relational dataset
Ding et al. Leveraging currency for repairing inconsistent and incomplete data
Hu et al. Computing complex temporal join queries efficiently
Chen et al. Join cardinality estimation by combining operator-level deep neural networks
Tang et al. PreQR: pre-training representation for SQL understanding
US20090012919A1 (en) Explaining changes in measures thru data mining
Song et al. On saving outliers for better clustering over noisy data
US11017300B1 (en) Computer incident scoring and correlation
Zhao et al. Anomaly detection of aircraft lead‐acid battery
Chen et al. Accurate Summary-based Cardinality Estimation Through the Lens of Cardinality Estimation Graphs
Wang et al. A survey on data cleaning methods in cyberspace
CN112445918A (en) Knowledge graph generation method and device, electronic equipment and storage medium
Sakr et al. An overview of graph indexing and querying techniques
CN105590224A (en) Method for determining failure node in transaction process

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130123

Termination date: 20151231

EXPY Termination of patent right or utility model