WO2011079706A1

WO2011079706A1 - Method and device for data query

Info

Publication number: WO2011079706A1
Application number: PCT/CN2010/079728
Authority: WO
Inventors: 申小次; 李建军; 贾学力; 庄明亮; 付新刚
Original assignee: 北京世纪高通科技有限公司
Priority date: 2009-12-30
Filing date: 2010-12-13
Publication date: 2011-07-07
Also published as: CN101763417A

Abstract

A method and a device for data query are provided, related to the field of intelligent transportation system. The method includes: acquiring a sub-series to be queried and the corresponding time parameter thereof (101); acquiring a sub-series set of the corresponding time parameter from history data according to the corresponding time parameter of the sub-series to be queried (102); performing dimensionality reduction on the sub-series to be queried and the sub-series in the obtained sub-series set (103); performing match query on the sub-series to be queried and the sub-series in the obtained sub-series set, on which dimensionality reduction has been performed (104); and acquiring sub-series matching with the sub-series to be queried (105). The method and the device for data query can reduce the time complexity of the data query of the system and improve the resource utilization rate of the system. The problem has been solved that a larger error is easily generated and the query complexity is higher and more system resources are necessarily occupied in the prior art when the system performs data query.

Description

The present invention claims the priority of the Chinese application filed on December 30, 2009, the application number is 200910244152.5, and the invention is entitled "a data query method and device", the entire contents of which are incorporated by reference. In this application.

Technical field

The present invention relates to the field of intelligent transportation systems, and in particular, to a data query method and apparatus.

Background technique

The Advanced Traffic Information System (ATIS) is based on a well-established information network that can be acquired by sensors or data transmission equipment equipped in roads, cars, transfer stations, parking lots and weather centers. Various types of traffic information are comprehensively processed according to the obtained data. The system provides comprehensive and accurate real-time road traffic congestion information to the community in real time. However, the data acquired by the device cannot completely cover all the roads, so that real-time data filling needs to be performed by similar queries of historical data, and the historical data can be analyzed and predicted.

The historical data is a list of ordered data formed over time, and is a time series, referred to as timing. The similarity query of time series is to find similar patterns of change in the time series data set, which is of great significance for the prediction, classification and knowledge discovery of time series. Large-scale time series database similar query is one of the hot topics of time series data mining. By performing similar queries in the historical database on real-time data, the filling and prediction of real-time data can be quickly realized. However, due to the massive and high dimensionality of the historical data, the distance calculation of the original sequence is directly performed, and the subsequences similar to the sequence to be queried need to occupy a large amount of system resources. Wherein, the time series often uses a high-level data representation form as a discrete Fourier transform DFT method.

In the process of realizing the above data processing, the inventors have found that at least the following problems exist in the prior art: Since the method of the discrete Fourier transform currently smoothes many original sequence information, it cannot be refined. It does represent the original sequence, and the time complexity of the method is ⁰ (", which makes the system easy to generate large errors when performing data query, and the query complexity is high, and the system resources that need to be occupied are large.

Embodiments of the present invention provide a data query method and apparatus.

In order to achieve the above object, embodiments of the present invention use the following technical solutions:

A data query method, including:

Obtaining a subsequence to be queried and its corresponding time parameter;

Obtaining, according to the corresponding time parameter of the sub-sequence to be queried, a sub-sequence set of the corresponding time parameter from the historical data;

Performing a dimensionality reduction process on the sub-sequence to be queried and the sub-sequence in the acquired sub-sequence set;

Performing matching query on the sub-sequence in the dimension reduction processing sub-sequence after the dimension reduction processing;

Obtaining a subsequence that matches the subsequence to be queried.

A data query device includes:

An information acquiring unit, configured to acquire a subsequence to be queried and a corresponding time parameter thereof;

a history subsequence obtaining unit, configured to acquire, according to the corresponding time parameter of the subsequence to be queried, a subsequence set of the corresponding time parameter from the historical data;

a sequence processing unit, configured to perform a dimensionality reduction process on the sub-sequence in the to-be-queried sub-sequence and the acquired sub-sequence set;

a matching query unit, configured to perform a matching query on the sub-sequence in the sub-sequence after the dimension reduction processing;

And a matching sequence obtaining unit, configured to acquire a subsequence that matches the subsequence to be queried. The data query method and device provided by the embodiment of the present invention obtain the sub-sequence set of the corresponding time parameter from the historical data according to the corresponding time parameter of the sub-sequence to be queried. ; the sub-sequence to be queried and the obtained sub-sequence The sub-sequences in the set of columns are subjected to a dimensionality reduction process; the sub-sequences in the reduced-dimensionally processed sub-sequences are matched with the sub-sequences in the reduced-dimensionally processed sub-sequences; Matching subsequences. Compared with the prior art, the embodiment of the present invention performs the dimensionality reduction processing on the subsequence in the subsequence set of the corresponding time parameter in the subsequence to be queried and the historical data, so that the query time complexity of the whole system is obtained. Reduced, increased utilization of system resources.

DRAWINGS

1 is a flowchart of a data query method according to an embodiment of the present invention;

2 is a flowchart of implementing a sub-sequence set step of acquiring a corresponding time parameter from historical data according to a corresponding time parameter of the to-be-queried sub-sequence in a data query method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of a specific implementation of performing a dimension reduction process on a sub-sequence in the sub-sequence to be queried and the obtained sub-sequence set in a data query method according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of a specific implementation process of performing a matching query process on a sub-sequence in a sub-sequence after the dimension reduction process and a sub-sequence in the dimension reduction process in the data query method according to an embodiment of the present disclosure Figure

FIG. 5 is a flowchart of a specific implementation of a step of acquiring a subsequence matching the to-be-queried sub-sequence in a data query method according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a data query apparatus according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The data query method and apparatus of the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

As shown in FIG. 1 , a data query method according to an embodiment of the present invention includes:

101: Obtain a subsequence to be queried and a corresponding time parameter thereof;

Obtaining, according to the corresponding time parameter of the sub-sequence to be queried, obtaining a sub-sequence set of the corresponding time parameter from the historical data;

103: the sub-sequence in the sub-sequence to be queried and the obtained sub-sequence set Columns for dimensionality reduction;

104: Perform matching query on the sub-sequence in the dimension reduction processing sub-sequence after the dimension reduction processing;

105: Obtain a subsequence that matches the subsequence to be queried.

As shown in FIG. 2, in an data query method according to an embodiment of the present invention, an implementation process of obtaining a sub-sequence set step of the corresponding time parameter from historical data according to a corresponding time parameter of the to-be-queried sub-sequence;

Let the database of the time series of historical data be the first database DB; wherein, N time series with different lengths are stored; the current subsequence to be queried is _Xi = (x _i , x _i+T , -, x _i+(m _ _l ), i = \,2,-,n-(m-\)T_ where, is the embedding dimension; r is the delay time, ^{= 1} , ² ,...; is the point in the phase space. The process of obtaining the sub-sequence corresponding to the time of the sub-sequence to be queried in the database DB, that is, obtaining the reconstructed phase space is:

201: Obtain a time parameter corresponding to each element in the sub-sequence to be queried; for example: a time point corresponding to ^Xi in the sub-sequence to be queried is tl, and a time point corresponding to xi+T is t2, xi The time point corresponding to +2 τ is t3, and so on.

202: Search, according to the time parameter, a sub-sequence set corresponding to the time parameter from a database of a time series of historical data, that is, a first database DB; for example: setting a time parameter as tl, tl, t3...; The time-series database of data, that is, the first database DB sequentially queries the sub-sequences corresponding to the times t1, t2, t3... on the previous day, and the sub-sequences corresponding to the times t1, t2, t3... in the previous two days until the first database is to be All the sub-sequences of the corresponding time instants t1, t2, t3... in the DB are all queried, and the sub-sequences of all the corresponding time instants t1, t2, t3... of the query are grouped into one sub-sequence set.

As shown in FIG. 3, a specific implementation process for performing dimension reduction processing on the sub-sequence in the sub-sequence to be queried and the sub-sequence obtained in the obtained data query method is provided in the data query method provided by the embodiment of the present invention; Includes:

The following is stored in the traffic history database as a case of a 5-minute interval speed; where, a time series is a 1-day speed value, then the length of each time series is less than or equal to 288. The subsequence to be queried is ^= , x ₂ ,..., x _m }, the length is m<288; the preset error size is e, and the initial dimension value is p<m;

301: Obtain a preset error parameter e, an initial dimension value p, p<m, and a subroutine to be queried 302: Map the to-be-queried sub-sequence into its corresponding piecewise polynomial feature space; the specific mapping process is as follows:

V _X eX its length |χ|=∞, approximated by the following polynomial function in the sense of minimum mean square error:

/(t ) = w. + _Wl t + w ₂ t ² ten... ten w- ¹ is the X-ray polynomial based on ^'^''''tp- ¹ } Zhang's point in the P-dimensional feature space ω = ( , ₀ , , ₁ , · ··,^_ ₁ ) , here is shown as follows: where Q = (lV.., iV.., m ^T ) i, 2,...,m .

The inverse transformation of X is: 5 = ^F - ^ω Οω;

Satisfy between χ and χ': χ=χ'+

Where _e is the residual sequence, subject to the standard normal distribution, ie w(o, ² ).

Using this transformation, the mapping of ^→Rp is implemented, generally > , so the mapping of →Rp realizes the dimensionality reduction of time series data.

The above-mentioned preset dimension value p may bring a large error, that is, the error between the sub-sequence after the dimension reduction and the pre-dimension sub-sequence exceeds the preset error _e , so the embodiment of the present invention can also ensure the following steps The accuracy of the subsequence after dimensionality reduction.

Approximate function

304: Obtain the actual error to determine whether the value of the actual error Ι ^χ ^χ ΐ is less than the preset error e. If it is less than the preset error e, perform step 306; if not less than the preset error e, perform step 305.

305: Update the p value; for example: increase the p value, go to step 302.

306, the result output ^{w is} a polynomial representation of ^. And it realizes the process of transforming from m-dimensional time-space to P-dimensional space and realizing dimension reduction.

It should be noted that the process of mapping the subsequences in the obtained subsequence sets into their corresponding segment polynomial feature spaces is the same as the above dimension reduction process, and details are not described herein again.

As shown in FIG. 4, in a data query method according to an embodiment of the present invention, a step of matching a sub-sequence in a sub-sequence after the dimension reduction processing and a sub-sequence in the dimension reduction-processed sub-sequence are performed in a data query method Specific implementation process; the process includes:

401: Perform MBR (minimum outsourcing rectangle) on the reduced sequence subsequence set Segmentation; The implementation process of the MBR segmentation is as follows:

MBR is the smallest outer rectangle, which is the smallest circumscribed rectangle that surrounds the primitive and is parallel to the Χ, Υ axis. The trajectory of the original time series in the feature space is divided into multiple sub-tracks by MBR, so that the number of disk accesses is minimized.

In the MBR indexing method, an R* tree is established, and each node in the R* tree (ie, each MBR) needs to store data including "^, ^^1, ^1 ..., ^ _mn , ^, where, Is the unique identification number of each time series; ^ and respectively are the starting offset position and the ending offset position in the MBR corresponding time series; ^F1 ' ^max is the vertex coordinate value of the MBR.

402: Perform matching query on the reduced-dimensionally processed sub-sequence and the reduced-dimensional processed sub-sequence set after the MBR segmentation. Specifically, it is used in the index file to search for all MBRs that meet the following conditions as a candidate set: _Wq c MBR.

As shown in FIG. 5, a specific implementation process of obtaining a subsequence step matching the subsequence to be queried in a data query method according to an embodiment of the present invention includes:

501: Obtain an Euclidean distance of the sub-sequence in the sub-sequence set after the dimension reduction processing and the sub-sequence to be queried after the dimension reduction processing.

The specific process of obtaining the Euclidean distance is:

V%e X , VJ GX ,

Is the actual Euc l idean distance between χ and _y. It is to be noted that, when the sub-sequence to be queried after the dimension reduction process is matched with the sub-sequence set after the MBR segmentation, only the MBR is acquired as the candidate concentrator sequence and The Euclidean distance of the sub-sequence to be queried after the dimension reduction process.

502: Obtain a subsequence that matches the to-be-queried subsequence q according to the acquired Euclidean distance.

It should be noted that the process can also include:

Get the Euclidean distance threshold; for example: Let Euclidean distance ε be 0.001

Obtaining a set of subsequences matching the subsequence to be queried according to the Euclidean distance threshold. Specifically, a subsequence whose Euclidean distance is less than or equal to the ε, that is, 0.001, is obtained. It should be noted that when the acquired sub-sequence set is the distance between the points in the phase space corresponding to the MBR in the candidate set ^{= 1} , ² , . . . , if ^ ^ f then it is a similar Subsequence.

It should also be noted that if all subsequences satisfying the above formula are output as a result, the query process may be referred to as a PQ query; the subsequence set satisfying ^^"^^^, for example: the set The sub-sequence is included, and if the sub-sequence with the smallest distance between the sub-sequences is output as a result, the query process may be referred to as a query.

As shown in FIG. 6, a data query device according to an embodiment of the present invention includes:

The information obtaining unit 601 is configured to obtain a sub-sequence to be queried and a corresponding time parameter thereof, and the historical sub-sequence obtaining unit 602 is configured to obtain the corresponding time parameter from the historical data according to the corresponding time parameter of the sub-sequence to be queried. Subsequence set;

a sequence processing unit 603, configured to perform dimension reduction processing on the sub-sequence in the to-be-queried sub-sequence and the acquired sub-sequence set;

The matching query unit 604 is configured to perform a matching query on the sub-sequence in the sub-sequence after the dimension reduction processing;

The matching sequence obtaining unit 605 is configured to acquire a subsequence that matches the subsequence to be queried.

It is to be noted that the sequence processing unit 603 includes:

The subsequence processing subunit to be queried is used to map the subsequence to be queried into its corresponding piecewise polynomial feature space;

And a historical subsequence processing subunit, configured to map the subsequences in the acquired subsequence set into their corresponding piecewise polynomial feature spaces.

It should be noted that the matching query unit 604 includes:

a segmentation subunit, configured to perform MBR segmentation on the dimension reduction processed subsequence set;

The matching query sub-unit is configured to perform matching query on the reduced-dimensionally processed sub-sequence and the reduced-dimensional processed sub-sequence set after the MBR segmentation.

It is also to be noted that the matching sequence obtaining unit 605 includes:

a distance obtaining sub-unit, configured to obtain an Euclidean distance of the sub-sequence in the sub-sequence set after the dimension reduction processing and the sub-sequence to be queried after the dimension reduction processing;

a matching sequence acquisition subunit for obtaining the Euclidean distance according to the Obtaining a subsequence that matches the subsequence to be queried.

It should also be noted that the device also includes:

a threshold acquisition unit for obtaining an Euclidean distance threshold;

And a matching subsequence obtaining unit, configured to obtain, according to the Euclidean distance threshold, a subsequence set that matches the subsequence to be queried.

The data query method and device provided by the embodiment of the present invention obtain the sub-sequence set of the corresponding time parameter from the historical data according to the corresponding time parameter of the sub-sequence to be queried. Performing a dimensionality reduction process on the sub-sequence in the acquired sub-sequence and the obtained sub-sequence set; and performing the dimension-reduced sub-sequence to be processed in the reduced-dimensional processed sub-sequence set The subsequence performs a matching query; and obtains a subsequence that matches the subsequence to be queried. Compared with the prior art, the embodiment of the present invention performs the dimensionality reduction processing on the subsequence in the subsequence set of the corresponding time parameter in the subsequence to be queried and the historical data, so that the query time complexity of the whole system is obtained. Reduce, improve the utilization of system resources; and use the method of piecewise polynomial to represent the time series, thus reducing the error in the query process.

Through the description of the above embodiments, those skilled in the art can understand that all or part of the steps of the foregoing embodiment can be implemented by a program to instruct related hardware, and the program can be stored in a computer readable manner. In the storage medium, when the program is executed, the method includes the steps of the foregoing method embodiment, such as: FLASH, ROM/RAM, disk, optical disk, and the like.

The above is only the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the claims.

Claims

Claim

A data query method, comprising:

Obtaining a subsequence to be queried and its corresponding time parameter;

Performing a matching query on the sub-sequence in the sub-sequence set after the dimension reduction processing;

Obtaining a subsequence that matches the subsequence to be queried.

The data query method according to claim 1, wherein the step of performing dimension reduction processing on the sub-sequence in the sub-sequence to be queried and the sub-sequence obtained in the obtained sub-sequence includes:

Mapping the to-be-queried subsequence into its corresponding piecewise polynomial feature space;

The subsequences in the acquired set of subsequences are respectively mapped into their corresponding piecewise polynomial feature spaces.

The data query method according to claim 1 or 2, wherein the sub-sequence in the reduced-dimensionally processed sub-sequence is matched with the sub-sequence in the reduced-dimensional processed sub-sequence Steps, including:

Performing the minimum outsourcing rectangle segmentation on the reduced dimensionally processed subsequence set;

Matching the reduced-dimensionally processed sub-sequence after the dimension-reduced sub-sequence to the reduced-dimensional processed sub-sequence set after the minimum outsourcing rectangle is segmented.

The data query method according to claim 1 or 2, wherein the step of acquiring a subsequence matching the subsequence to be queried comprises:

Obtaining a Euclidean distance of the sub-sequence in the sub-sequence set after the dimension reduction processing and the sub-sequence to be queried after the dimension reduction processing;

Obtaining an subsequence matching the subsequence to be queried according to the acquired Euclidean distance Column.

The data query method according to claim 4, further comprising: obtaining a Euclidean distance threshold;

According to the Euclidean distance threshold, a subsequence set matching the subsequence to be queried is obtained.

6. A data query device, comprising:

a sequence processing unit, configured to perform a dimensionality reduction process on the sub-sequence in the sub-sequence to be queried and the acquired sub-sequence set;

a matching query unit, configured to perform matching query between the reduced-dimensionally processed sub-sequence and the sub-sequence in the reduced-dimensional processed sub-sequence;

And a matching sequence obtaining unit, configured to acquire a subsequence that matches the subsequence to be queried.

The data query device according to claim 6, wherein the sequence processing unit comprises:

a subsequence processing subunit to be queried, configured to map the subsequence to be queried into its corresponding segment polynomial feature space;

The historical subsequence processing subunit is configured to map the subsequences in the acquired subsequence set into their corresponding piecewise polynomial feature spaces, respectively.

The data query device according to claim 6 or 7, wherein the matching query unit comprises:

a segmentation subunit, configured to perform a minimum outsourcing rectangle segmentation on the dimension reduction processed subsequence set; and a matching query subunit, configured to divide the dimension reduction processed subsequence to be subdivided into the minimum outsourcing rectangle The sub-sequence set after the dimension reduction processing performs a matching query.

The data query device according to claim 6 or 7, wherein the matching sequence is obtained Take the unit, including:

a distance obtaining sub-unit, configured to obtain a Euclidean distance of the sub-sequence in the sub-sequence set after the dimension reduction processing and the sub-sequence to be queried after the dimension reduction processing;

And a matching sequence obtaining subunit, configured to obtain a subsequence matching the subsequence to be queried according to the acquired Euclidean distance.

The data query device according to claim 9, wherein the device further comprises: a threshold value obtaining unit, configured to acquire a Euclidean distance threshold;

And a matching subsequence obtaining unit, configured to obtain, according to the Euclidean distance threshold, a set of subsequences that match the subsequence to be searched.