CN112988797A - Space-time adjoint query method based on p-stable lsh - Google Patents

Space-time adjoint query method based on p-stable lsh Download PDF

Info

Publication number
CN112988797A
CN112988797A CN202110292813.2A CN202110292813A CN112988797A CN 112988797 A CN112988797 A CN 112988797A CN 202110292813 A CN202110292813 A CN 202110292813A CN 112988797 A CN112988797 A CN 112988797A
Authority
CN
China
Prior art keywords
data
digital signature
offline
hash values
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110292813.2A
Other languages
Chinese (zh)
Inventor
胡宇飞
陈成斌
叶智慧
苏胜林
马军亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongruixin Digital Technology Co ltd
Original Assignee
Zhongruixin Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongruixin Digital Technology Co ltd filed Critical Zhongruixin Digital Technology Co ltd
Priority to CN202110292813.2A priority Critical patent/CN112988797A/en
Publication of CN112988797A publication Critical patent/CN112988797A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures

Abstract

The invention discloses a space-time adjoint query method based on p-stable lsh, which comprises the following steps: acquiring off-line data of a space-time accompanying object track, and preprocessing the off-line data to generate multi-dimensional vector data; generating the multidimensional vector data into an offline digital signature of an object track based on a p-stable lsh algorithm; synchronizing the offline digital signature to an online data query big database; and querying a big database based on the online data, constructing a quick retrieval tool, and querying similar target objects according to the digital signature. The whole track is subjected to hash transformation by adopting a p-stable lsh algorithm, so that the track dimension is greatly reduced, complicated accompanying analysis is converted into fingerprint (hash value) comparison, the query efficiency and the query effect are greatly improved, and meanwhile, the longitude and latitude dimension reduction by adopting the geohash can finish normalization processing on the longitude and latitude errors in a certain range. In addition, the online and offline combination mode is adopted, and the user experience is greatly improved.

Description

Space-time adjoint query method based on p-stable lsh
Technical Field
The invention relates to the technical field of data processing, in particular to a space-time adjoint query method based on p-stable lsh.
Background
A spatiotemporal trajectory is a recorded sequence of positions and times of moving objects. As an important spatiotemporal object data type, the application of spatiotemporal trajectories covers various aspects such as human behaviors, traffic logistics and the like. Through analyzing various spatio-temporal trajectory data, similarity abnormal features in the spatio-temporal trajectory data can be obtained, and the method is helpful for finding meaningful trajectory patterns in the spatio-temporal trajectory data. The adjoint mode is one of space-time trajectory modes, and has important application in the fields of traffic management, resource allocation and the like.
In the prior art, a method for querying a moving object of a space-time trajectory generally adopts a geohash to accompany the space-time trajectory, or several adjacent base stations are divided into a group and numbered for the group. The group number is then used to accompany the base station-based space-time trajectory. The method is only suitable for the base station track, the requirement on the base station grouping is high, and the grouping effect directly influences the final accompanying result.
However, the prior art only performs hash transformation on the longitude and latitude to reduce the dimension of the longitude and latitude. In the process of space-time comparison, business logic such as statistics and collision is still needed to realize, so that the efficiency and the effect are greatly reduced, and particularly, the problems of low efficiency and poor effect are particularly obvious under the condition of mass data. Therefore, a method for improving query efficiency and query effect to realize space-time accompanied query is needed.
Disclosure of Invention
The invention provides a space-time adjoint query method based on p-stable lsh, which is used for solving the problems of low space-time adjoint query efficiency and poor effect in the prior art.
The invention provides a space-time adjoint query method based on p-stable lsh, which comprises the following steps:
acquiring offline data of a space-time accompanying object track;
preprocessing the offline data to generate multidimensional vector data; the preprocessing comprises multidimensional vector data which are generated according to time nodes and aim at the longitude and latitude of the position point of the object track based on a geohash algorithm;
generating the multidimensional vector data into an offline digital signature of an object track based on a p-stable lsh algorithm;
synchronizing the offline digital signature to an online data query big database;
and querying a big database based on the online data, constructing a quick retrieval tool, and querying similar target objects according to the digital signature.
Optionally, the preprocessing the offline data to generate multidimensional vector data includes:
coding all position points of the object track based on a geohash algorithm;
performing time slicing processing on a preset total time length according to a preset time segment length, and dividing the total time length into a plurality of time slices;
sorting the geohash codes of the position points in each time slice, and extracting the geohash code with the most occurrence times as the geohash code of the corresponding position point;
and extracting the longitude and latitude data coded by the geohash of the corresponding position points in all the time slices to form multi-dimensional vector data.
Optionally, the generating the offline digital signature of the object trajectory from the multidimensional vector data based on the p-stable lsh algorithm includes:
appending a hash value to each vector in the multi-dimensional vector data;
determining the size w of each barrel according to a Hash collision probability formula;
Figure BDA0002983019270000021
wherein w is the size of the sub-barrel; c represents the Euclidean distance; the value range of t is determined according to the maximum and minimum values of the longitude and latitude of different regions; p is generally above 0.7 according to empirical values; f is the function of probability distribution in gauss, and the function formula is as follows:
Figure BDA0002983019270000031
where μ is the mathematical expectation and σ is the standard deviation;
determining the number of the sub-buckets according to the size of the sub-buckets and the total number of the multi-dimensional vector data;
determining the number of hash values according to the number of the sub-barrels;
the determined number of hash values is set to the corresponding number of digital signatures.
Optionally, the setting the determined number of hash values to the corresponding number of digital signatures includes:
classifying and storing the hash values, wherein each class comprises a plurality of hash values;
merging the hash values in each class, and determining a merged hash value in each class;
and taking the merged hash value as a digital signature of a corresponding class.
Optionally, the merging the hash values in each class includes:
binary conversion is carried out on the hash value in each class, and the hash value is converted into a group of binary data;
carrying out binary conversion on the binary data again to form a character string;
correspondingly, taking the merged hash value as a digital signature of a corresponding class, including:
and taking the character string as a digital signature of the corresponding class.
Optionally, the classifying and storing the hash values, where each class includes a plurality of hash values, includes:
storing the hash values according to long fields, wherein each long field comprises 8 hash values;
correspondingly, the merging the hash values in each class, and determining a merged hash value for each class, includes:
merging the hash values in each long field to form a character string corresponding to the long field;
the step of using the merged hash value as a digital signature of a corresponding class includes:
and setting the character string corresponding to each long type field as the digital signature corresponding to the long type field.
Optionally, the constructing a quick search tool includes:
confirming a digital signature of an object track to be inquired;
analyzing the corresponding long field according to the long field corresponding to the digital signature;
and sorting according to the same number of the analyzed hash values.
Optionally, parsing the corresponding long type field includes:
carrying out corresponding system conversion on the character strings stored in the long field;
carrying out binary conversion on the character string after the binary conversion again;
the binary data is decimal converted to form a hash value.
Optionally, the synchronizing the offline digital signature to the online data query big database includes:
synchronizing the offline digital signature into an MPP or Elasticissearch database; the MPP or Elasticissearch database is an online database.
Optionally, the preprocessing the offline data includes:
the offline data preprocessing is realized on a hadoop platform, and the data preprocessing is completed based on the offline operation of MapReduce or Spark.
The invention provides a space-time adjoint query method based on p-stable lsh, and the scheme provided by the invention can be applied to space-time adjoint analysis of any scene and is not limited to base station data in a case. In addition, the whole track is subjected to hash transformation by adopting a p-stable lsh algorithm, so that the track dimension is greatly reduced, complicated accompanying analysis is converted into fingerprint (hash value) comparison, the query efficiency and the query effect are greatly improved, and meanwhile, the geohash is adopted for latitude and longitude dimension reduction, and normalization processing can be completed on latitude and longitude errors within a certain range. In addition, the embodiment adopts a big data processing technology, can adapt to a super-large-scale data set, and simultaneously adopts a mode of combining online and offline, thereby greatly improving the user experience.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flowchart of a space-time adjoint query method based on p-stable lsh according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a geohash algorithm in an embodiment of the present invention;
FIG. 3 is an exemplary diagram of a p-stable lsh in two dimensions in an embodiment of the present invention;
FIG. 4 is a diagram illustrating hash value merging according to an embodiment of the present invention;
FIG. 5 is a diagrammatic illustration of a query for similar target objects in an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example 1:
the embodiment of the invention provides a space-time adjoint query method based on p-stable lsh, fig. 1 is a flow chart of the space-time adjoint query method based on p-stable lsh in the embodiment of the invention, please refer to fig. 1, the method comprises the following steps:
step S101, obtaining off-line data of space-time accompanying object tracks;
step S102, preprocessing the off-line data to generate multi-dimensional vector data; the preprocessing comprises multidimensional vector data which are generated according to time nodes and aim at the longitude and latitude of the position point of the object track based on a geohash algorithm;
step S103, generating the multidimensional vector data into an offline digital signature of an object track based on a p-stable lsh algorithm;
step S104, synchronizing the off-line digital signature to an on-line data query big database;
and S105, querying a big database based on the online data, constructing a quick retrieval tool, and querying similar target objects according to the digital signature.
The working principle of the technical scheme is as follows: the scheme provided by the embodiment is that offline data of a time-space accompanying object track is preprocessed, the preprocessing principle is that multidimensional vector data of position points of the object track are generated according to time nodes based on a geohash algorithm, the multidimensional vector data take longitude and latitude of each position point as parameter data, and longitude and latitude of all position points of all object tracks are taken as multidimensional vector data.
The geohash algorithm is a way to encode latitude and longitude, changing two dimensions into one dimension. The basic principle is to divide the longitude and latitude into two halves, and 0 and 1 are used to represent the two halves respectively. After a plurality of halving, a string of 0 and 1 sequences is formed, and then the string is arranged by encoding the sequence by using base 32. The more times the score is, the longer the string length and the more accurate the geographical location representation. Referring to fig. 2, fig. 2 is a schematic diagram of a geohash algorithm according to an embodiment of the present invention.
And generating the multidimensional vector data into a digital signature of an offline object track based on a p-stable lsh algorithm.
The p-stable lsh algorithm is p-stable local sensitive hashing, a hash value is attached to each feature vector by using the idea of p-stable distribution, and if two vectors are similar, the probability that the hash values of the two vectors fall into the same sub-bucket is very high. Referring to FIG. 3, FIG. 3 is a diagram illustrating an example of a two-dimensional space for p-stable lsh in the embodiment of the present invention. Therefore, the vector can be converted into a hash value based on the p-stable lsh algorithm, and a string of character strings is generated through the subsequent merging processing of the hash values, and the string of character strings is set as the digital signature of the target track.
Because the above steps are all data processing performed in an offline state, generally, the processing and analysis of the data at this time need to be completed by a hadoop platform, and then the operation is based on the offline operation of MapReduce or Spark, and the operation is not good at performing the online real-time interactive analysis. Therefore, the digital signature obtained after the offline processing needs to be synchronized into an online database, specifically, the offline digital signature is synchronized into an online data query big database, and the offline digital signature is synchronized into an MPP or Elasticsearch database; the MPP or Elasticissearch database is an online database.
After the digital signature is uploaded to an online database, the query of the space-time accompanying object track can be carried out online subsequently. The method is that a quick retrieval tool is constructed based on the online data query big database, and similar target objects are queried according to the digital signature. And constructing a quick retrieval tool based on the synchronized data, inquiring the digital signature of the track of any target object, and then inquiring similar target objects by using the digital signature of the track.
The beneficial effects of the above technical scheme are: the scheme provided by the embodiment can be applied to space-time adjoint analysis of any scene, and is not limited to the base station data in the case. In addition, the whole track is subjected to hash transformation by adopting a p-stable lsh algorithm, so that the track dimension is greatly reduced, complicated accompanying analysis is converted into fingerprint (hash value) comparison, the query efficiency and the query effect are greatly improved, and meanwhile, the geohash is adopted for latitude and longitude dimension reduction, and normalization processing can be completed on latitude and longitude errors within a certain range. In addition, the embodiment adopts a big data processing technology, can adapt to a super-large-scale data set, and simultaneously adopts a mode of combining online and offline, thereby greatly improving the user experience.
Example 2:
on the basis of embodiment 1, the preprocessing the offline data to generate multidimensional vector data includes:
coding all position points of the object track based on a geohash algorithm;
performing time slicing processing on a preset total time length according to a preset time segment length, and dividing the total time length into a plurality of time slices;
sorting the geohash codes of the position points in each time slice, and extracting the geohash code with the most occurrence times as the geohash code of the corresponding position point;
and extracting the longitude and latitude data coded by the geohash of the corresponding position points in all the time slices to form multi-dimensional vector data.
The working principle of the technical scheme is as follows: the scheme provided by this embodiment is an optimal scheme for preprocessing the offline data, that is, for preprocessing the offline data. The specific scheme is as follows:
taking a base station track as an example, base station drift, inconsistent coverage of a base station, signaling loss, inconsistent base station positions across operators and the like generally exist in base station track data given by operators, so that all position points of all object tracks are obtained first, the longitude and latitude of all the position points are coded by adopting a geohash algorithm, that is, all the position points need to be traversed, and all the position points are coded by adopting the geohash algorithm to form the geohash codes of a plurality of position points, thereby solving the problem that the longitude and latitude of the base stations at adjacent positions of different operators are different.
And secondly, sequencing the geohash codes of the position points in each time slice, and extracting the geohash code with the largest occurrence frequency as the geohash code of the corresponding position point. For example: the daily trace data is sliced in 5 minutes, so that 288 time slices can be sliced. Then, only the geohash code that appears most frequently is retained for the location point within each time slice. In addition, for slices without location points, the geohash code within the most recent time slice is used as its geohash code. By adopting the above processing method, the geohash coding of each time slice corresponding to one position point can be realized. Dividing how many time slices results in a geohash encoding of how many location points.
And finally, extracting the longitude and latitude data of the geohash codes of the corresponding position points in all the time slices to form multi-dimensional vector data. For example: the center longitude and latitude of the geohash code of all position points are taken, so the vector dimension of one day is 288 × 2-576 dimensions.
The beneficial effects of the above technical scheme are: the method comprises the steps of using a geohash to perform longitude and latitude dimensionality reduction, completing normalization processing on longitude and latitude errors within a certain range, simultaneously slicing time to form a plurality of time slices, taking the time slices as time nodes, extracting a geohash code of a position point in each time slice, only keeping the geohash code with the largest occurrence frequency, forming the geohash code corresponding to the corresponding position point, and forming multidimensional vector data by all the geohash codes.
Example 3:
on the basis of embodiment 1, the generating of the digital signature of the offline object trajectory based on the p-stable lsh algorithm according to the multidimensional vector data includes:
appending a hash value to each vector in the multi-dimensional vector data;
determining the size w of each barrel according to a Hash collision probability formula;
Figure BDA0002983019270000081
wherein w is the size of the sub-barrel; c represents the Euclidean distance; the value range of t is determined according to the maximum and minimum values of the longitude and latitude of different regions; p is generally above 0.7 according to empirical values; f is the function of probability distribution in gauss, and the function formula is as follows:
Figure BDA0002983019270000091
where μ is the mathematical expectation and σ is the standard deviation;
determining the number of the sub-buckets according to the size of the sub-buckets and the total number of the multi-dimensional vector data;
determining the number of hash values according to the number of the sub-barrels;
the determined number of hash values is set to the corresponding number of digital signatures.
The working principle of the technical scheme is as follows: the principle of the scheme adopted by the embodiment is that a hash value is attached to each feature vector by utilizing the thought of p-stable, and if the two vectors are similar, the probability that the hash values of the two vectors fall into the same bucket is very high. The size of the sub-buckets in the hash is determined based on the principle, and then the number of the sub-buckets and the number of the hash values are determined.
The above-described p-stable distribution and hash function will be described and explained in detail below.
The p-stable distribution is described as follows: for a distribution D on a real number set R, if P > ═ 0 exists, there is the same distribution for any n real numbers v1, …, vn and n variables X1, …, Xn that satisfy the D distribution, random variables Σ iviXi and (Σ i | vi | P)1/pX, where X is a random variable subject to the D distribution, then D is called a P stable distribution.
The expression of the hash function is as follows:
Figure BDA0002983019270000092
where v is a high dimensional vector, a is a gaussian random vector, b is a random number, and w is the bucket size.
After introducing p-stable distribution and a hash function, according to the thought of p-stable, each feature vector is attached with a hash value, if the two vectors are similar, the probability that the hash values of the two vectors fall in the same barrel is very high, and the size of the sub-barrel is further determined.
The determination mode of the size of the sub-barrel is as follows:
first, the hash collision probability is calculated as follows:
Figure BDA0002983019270000093
wherein f is a probability function under the condition of Gaussian probability distribution, and the expression of f is as follows:
Figure BDA0002983019270000101
wherein c represents a Euclidean distance; the value range of t is determined according to the maximum and minimum values of the longitude and latitude of different regions; p is generally 0.7 or more according to the empirical value, because of the particularity of the locus data of the base station, too large collision probability leads to overfitting, and too small collision probability leads to underfitting.
Thus, the value of the bucket size w can be determined from the above equation. The more the hash functions are, the more calculation and storage are needed, the more the track similarity is, and the comprehensive consideration is needed according to the actual situation.
And finally, determining the number of sub-buckets according to the determined size of the sub-buckets and the total number of the multi-dimensional vector data, further determining the number of hash values according to the determined number of the sub-buckets, and then setting the hash values with the determined number as the digital signatures with the corresponding number.
The beneficial effects of the above technical scheme are: by adopting the scheme provided by the embodiment, the hash transformation is carried out on the whole track by adopting the p-stable lsh algorithm, so that the track dimension is greatly reduced, the complicated accompanying analysis is converted into the fingerprint (hash value) comparison, and the query efficiency and the query effect are greatly improved.
Example 4:
on the basis of embodiment 3, the setting of the determined number of hash values to the corresponding number of digital signatures includes:
classifying and storing the hash values, wherein each class comprises a plurality of hash values;
merging the hash values in each class, and determining a merged hash value in each class;
and taking the merged hash value as a digital signature of a corresponding class.
The working principle and the beneficial effects of the technical scheme are as follows: the scheme adopted by the embodiment is that in order to reduce the number of digital signatures to the greatest extent, the hash values are classified and stored, the hash values are merged, and the merged hash values are used as new digital signatures, so that the number of digital signatures can be greatly reduced, the query time is reduced, and the query efficiency is improved.
Example 5:
on the basis of the embodiment 4, the merging the hash values in each class includes:
binary conversion is carried out on the hash value in each class, and the hash value is converted into a group of binary data;
carrying out binary conversion on the binary data again to form a character string;
correspondingly, taking the merged hash value as a digital signature of a corresponding class, including:
and taking the character string as a digital signature of the corresponding class.
The working principle of the technical scheme is as follows: the scheme adopted by the embodiment is the forming process and the forming principle of the digital signature. The specific implementation mode is as follows:
fig. 4 is a schematic diagram of hash value merging according to an embodiment of the present invention, and referring to fig. 4, first, binary conversion is performed on a hash value to form a group of binary data patterns, and then, binary conversion is performed on binary data again to form a string of character strings, and the string of character strings is set as digital signatures of corresponding classes. By adopting the scheme, the plurality of hash values are combined, and the character string formed after combination is set as the digital signature, so that the representative significance of the digital signature is ensured, and the digital signature can be reduced, thereby reducing the calculation and query costs.
The beneficial effects of the above technical scheme are: the scheme provided by the embodiment is suitable for calculation and query of the relevant data of the massive object tracks, so that the retrieval efficiency and the storage efficiency of the digital signature determined by the scheme of the embodiment are greatly improved.
Example 6:
on the basis of embodiment 4, classifying and storing the hash values, where each class includes a plurality of hash values, includes:
storing the hash values according to long fields, wherein each long field comprises 8 hash values;
correspondingly, the merging the hash values in each class, and determining a merged hash value for each class, includes:
merging the hash values in each long field to form a character string corresponding to the long field;
the step of using the merged hash value as a digital signature of a corresponding class includes:
and setting the character string corresponding to each long type field as the digital signature corresponding to the long type field.
The working principle of the technical scheme is as follows: the scheme adopted by this embodiment is to store hash values according to the long fields, each long field includes 8 hash values, the hash values in each long field are combined to form a string corresponding to the long field, and the string corresponding to each long field is set as a digital signature corresponding to the long field.
For example: this embodiment uses 64 hash functions, 128 buckets. Under normal conditions, the digital signature generated after the hash function hash can already provide retrieval requirements. But for massive traces, the retrieval efficiency and the storage efficiency of 64 digital signatures are great tests. Therefore, in the embodiment, every 8 digital signatures are stored according to a long type field, every 8 bits in the long type field represent one signature, and finally, only 8 digital signatures are reserved.
The beneficial effects of the above technical scheme are: the scheme provided by the embodiment is suitable for calculation and query of the relevant data of the massive object tracks, so that the retrieval efficiency and the storage efficiency of the digital signature determined by the scheme of the embodiment are greatly improved.
Example 7:
on the basis of embodiment 6, the building of the quick retrieval tool comprises:
confirming a digital signature of an object track to be inquired;
analyzing the corresponding long field according to the long field corresponding to the digital signature;
and sorting according to the same number of the analyzed hash values.
The working principle of the technical scheme is as follows: the scheme adopted by the embodiment is a process of determining the similar target object based on the digital signature of the track of the object to be queried, and by setting the quick retrieval tool, the query efficiency can be improved, and the user experience is greatly improved.
For example: a fast retrieval tool can be constructed based on the synchronized data. For any target, the digital signature of the track is inquired, and then the digital signature of the track is used for inquiring similar targets. In setting the digital signature, the hash values are merged into a long field, resulting in strong binding of every 8 tracks, e.g., two sets of hash values 1,19,88,76,8,55,44,68 and 1,19,88,76,8,55,44,69 differ by one value, but resulting in an unequal overall digital signature. Therefore, when the similarity ranking is performed, the ranking needs to be performed according to the fact that the long type fields are decoded first and then according to the number of the decoded hash values. Fig. 5 is a diagram showing a query of a similar target object in an embodiment of the present invention, please refer to the result of the diagram showing in fig. 5 to further understand the above scheme.
The beneficial effects of the above technical scheme are: by adopting the scheme provided by the embodiment, the hash value is merged and coded, the storage space is greatly reduced, and the retrieval efficiency can be greatly improved.
Example 8:
on the basis of embodiment 7, the parsing the corresponding long type field includes:
carrying out corresponding system conversion on the character strings stored in the long field;
carrying out binary conversion on the character string after the binary conversion again;
the binary data is decimal converted to form a hash value.
The working principle and the beneficial effects of the technical scheme are as follows: the scheme of this embodiment corresponds to hash value merging in embodiment 6, and in this embodiment, is a process of parsing a long type field, that is, an inverse process of binary translation in embodiment 6. By combining the scheme of hash value merging coding, the scheme of the embodiment can greatly reduce the storage space and greatly improve the retrieval efficiency.
Example 9:
on the basis of embodiment 1, the synchronizing the offline digital signature into the online data query big database includes:
synchronizing the offline digital signature into an MPP or Elasticissearch database; the MPP or Elasticissearch database is an online database.
The working principle and the beneficial effects of the technical scheme are as follows: the embodiment mainly aims at massive track data with data scale of billions to billions per day. The processing and analysis of the data need to be done by a hadoop platform. Then, offline operations based on MapReduce or Spark are not good at performing online real-time interaction analysis. It is therefore necessary to synchronize the offline generated digital signature to an online database such as MPP or Elasticsearch. Therefore, the user experience is greatly improved by adopting a mode of combining online and offline.
Example 10:
on the basis of embodiment 1, the preprocessing the offline data includes:
the offline data preprocessing is realized on a hadoop platform, and the data preprocessing is completed based on the offline operation of MapReduce or Spark.
The working principle and the beneficial effects of the technical scheme are as follows: the embodiment mainly aims at massive track data with data scale of billions to billions per day. The processing and analysis of the data need to be done by a hadoop platform. Then, offline operations based on MapReduce or Spark are not good at performing online real-time interaction analysis. It is therefore necessary to synchronize the offline generated digital signature to an online database such as MPP or Elasticsearch. Therefore, the user experience is greatly improved by adopting a mode of combining online and offline.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. The space-time adjoint query method based on the p-stable lsh is characterized by comprising the following steps:
acquiring offline data of a space-time accompanying object track;
preprocessing the offline data to generate multidimensional vector data; the preprocessing comprises multidimensional vector data which are generated according to time nodes and aim at the longitude and latitude of the position point of the object track based on a geohash algorithm;
generating the multidimensional vector data into an offline digital signature of an object track based on a p-stable lsh algorithm;
synchronizing the offline digital signature to an online data query big database;
and querying a big database based on the online data, constructing a quick retrieval tool, and querying similar target objects according to the digital signature.
2. The p-stable lsh based spatio-temporal adjoint query method according to claim 1, wherein the preprocessing the offline data to generate multidimensional vector data comprises:
coding all position points of the object track based on a geohash algorithm;
performing time slicing processing on a preset total time length according to a preset time segment length, and dividing the total time length into a plurality of time slices;
sorting the geohash codes of the position points in each time slice, and extracting the geohash code with the most occurrence times as the geohash code of the corresponding position point;
and extracting the longitude and latitude data coded by the geohash of the corresponding position points in all the time slices to form multi-dimensional vector data.
3. The p-stable lsh based spatio-temporal adjoint query method of claim 1, wherein the generating the multidimensional vector data into the digital signature of the offline object trajectory based on the p-stable lsh algorithm comprises:
appending a hash value to each vector in the multi-dimensional vector data;
determining the size w of each barrel according to a Hash collision probability formula;
Figure FDA0002983019260000021
wherein w is the size of the sub-barrel; c represents the Euclidean distance; the value range of t is determined according to the maximum and minimum values of the longitude and latitude of different regions; p is generally above 0.7 according to empirical values; f is the function of probability distribution in gauss, and the function formula is as follows:
Figure FDA0002983019260000022
where μ is the mathematical expectation and σ is the standard deviation;
determining the number of the sub-buckets according to the size of the sub-buckets and the total number of the multi-dimensional vector data;
determining the number of hash values according to the number of the sub-barrels;
the determined number of hash values is set to the corresponding number of digital signatures.
4. The p-stable lsh based spatio-temporal adjoint query method of claim 3, wherein said setting a certain number of hash values as a corresponding number of digital signatures comprises:
classifying and storing the hash values, wherein each class comprises a plurality of hash values;
merging the hash values in each class, and determining a merged hash value in each class;
and taking the merged hash value as a digital signature of a corresponding class.
5. The p-stable lsh-based spatio-temporal adjoint query method according to claim 4, wherein said merging the hash values in each class comprises:
binary conversion is carried out on the hash value in each class, and the hash value is converted into a group of binary data;
carrying out binary conversion on the binary data again to form a character string;
correspondingly, taking the merged hash value as a digital signature of a corresponding class, including:
and taking the character string as a digital signature of the corresponding class.
6. The p-stable lsh-based spatio-temporal adjoint query method according to claim 4, wherein the classifying and storing the hash values, each class containing a plurality of hash values, comprises:
storing the hash values according to long fields, wherein each long field comprises 8 hash values;
correspondingly, the merging the hash values in each class, and determining a merged hash value for each class, includes:
merging the hash values in each long field to form a character string corresponding to the long field;
the step of using the merged hash value as a digital signature of a corresponding class includes:
and setting the character string corresponding to each long type field as the digital signature corresponding to the long type field.
7. The p-stable lsh based spatio-temporal adjoint query method according to claim 6, wherein said constructing a fast search tool comprises:
confirming a digital signature of an object track to be inquired;
analyzing the corresponding long field according to the long field corresponding to the digital signature;
and sorting according to the same number of the analyzed hash values.
8. The p-stable lsh based spatio-temporal adjoint query method of claim 7, wherein said parsing the corresponding long type field comprises:
carrying out corresponding system conversion on the character strings stored in the long field;
carrying out binary conversion on the character string after the binary conversion again;
the binary data is decimal converted to form a hash value.
9. The p-stable lsh based spatio-temporal adjoint query method according to claim 1, wherein the synchronizing the offline digital signature into the online data query big database comprises:
synchronizing the offline digital signature into an MPP or Elasticissearch database; the MPP or Elasticissearch database is an online database.
10. The p-stable lsh based spatio-temporal adjoint query method according to claim 1, wherein the preprocessing the offline data comprises:
the offline data preprocessing is realized on a hadoop platform, and the data preprocessing is completed based on the offline operation of MapReduce or Spark.
CN202110292813.2A 2021-03-18 2021-03-18 Space-time adjoint query method based on p-stable lsh Pending CN112988797A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110292813.2A CN112988797A (en) 2021-03-18 2021-03-18 Space-time adjoint query method based on p-stable lsh

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110292813.2A CN112988797A (en) 2021-03-18 2021-03-18 Space-time adjoint query method based on p-stable lsh

Publications (1)

Publication Number Publication Date
CN112988797A true CN112988797A (en) 2021-06-18

Family

ID=76332701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110292813.2A Pending CN112988797A (en) 2021-03-18 2021-03-18 Space-time adjoint query method based on p-stable lsh

Country Status (1)

Country Link
CN (1) CN112988797A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117062009A (en) * 2023-10-11 2023-11-14 北京艾瑞数智科技有限公司 Method, device, equipment and storage medium for judging accompanying track

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273470A (en) * 2017-06-06 2017-10-20 武汉大学 Based on the spatiotemporal mode method for digging for becoming the quick GeoHash codings of granularity
CN107766433A (en) * 2017-09-19 2018-03-06 昆明理工大学 A kind of range query method and device based on Geo BTree
CN108846013A (en) * 2018-05-04 2018-11-20 昆明理工大学 A kind of spatial key word querying method and device based on geohash Yu Patricia Trie
CN111241217A (en) * 2018-11-29 2020-06-05 阿里巴巴集团控股有限公司 Data processing method, device and system
CN112434084A (en) * 2020-12-02 2021-03-02 电信科学技术第十研究所有限公司 Trajectory similarity matching method and device based on geohash and LCSS

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273470A (en) * 2017-06-06 2017-10-20 武汉大学 Based on the spatiotemporal mode method for digging for becoming the quick GeoHash codings of granularity
CN107766433A (en) * 2017-09-19 2018-03-06 昆明理工大学 A kind of range query method and device based on Geo BTree
CN108846013A (en) * 2018-05-04 2018-11-20 昆明理工大学 A kind of spatial key word querying method and device based on geohash Yu Patricia Trie
CN111241217A (en) * 2018-11-29 2020-06-05 阿里巴巴集团控股有限公司 Data processing method, device and system
CN112434084A (en) * 2020-12-02 2021-03-02 电信科学技术第十研究所有限公司 Trajectory similarity matching method and device based on geohash and LCSS

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MAYUR DATAR: ""Locality-Sensitive Hashing Scheme Based on pStable "", 《ACM》 *
喻思羽等: "基于p-stable LSH的多点地质统计建模算法", 《石油学报》 *
王世雄: ""杭州o2o生活服务业的聚集性及空间特征"", 《浙江理工大学学报》 *
申艳光等: "面向加密云数据的多关键词模糊检索方法", 《计算机工程与设计》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117062009A (en) * 2023-10-11 2023-11-14 北京艾瑞数智科技有限公司 Method, device, equipment and storage medium for judging accompanying track
CN117062009B (en) * 2023-10-11 2024-01-23 北京艾瑞数智科技有限公司 Method, device, equipment and storage medium for judging accompanying track

Similar Documents

Publication Publication Date Title
Chen et al. KNN-BLOCK DBSCAN: Fast clustering for large-scale data
EP3709184B1 (en) Sample set processing method and apparatus, and sample querying method and apparatus
CN106886553B (en) Image retrieval method and server
US20130046793A1 (en) Fast matching of image features using multi-dimensional tree data structures
CN109299097B (en) Online high-dimensional data nearest neighbor query method based on Hash learning
CN113051359A (en) Large-scale track data similarity query method based on multi-level index structure
CN104217222B (en) A kind of image matching method represented based on stochastical sampling Hash
CN106570166B (en) Video retrieval method and device based on multiple locality sensitive hash tables
CN116978011B (en) Image semantic communication method and system for intelligent target recognition
CN109871379B (en) Online Hash nearest neighbor query method based on data block learning
CN114238329A (en) Vector similarity calculation method, device, equipment and storage medium
CN108229358B (en) Index establishing method and device, electronic equipment and computer storage medium
Wang et al. Ppq-trajectory: spatio-temporal quantization for querying in large trajectory repositories
CN112988797A (en) Space-time adjoint query method based on p-stable lsh
CN113536020B (en) Method, storage medium and computer program product for data query
Damiani et al. Learning behavioral representations of human mobility
CN111353062A (en) Image retrieval method, device and equipment
CN113704342A (en) Method, system, equipment and storage medium for trace accompanying analysis
Luaces et al. Leveraging Bitmap Indexing for Subgraph Searching.
Chu et al. An efficient k-medoids-based algorithm using previous medoid index, triangular inequality elimination criteria, and partial distance search
US11886445B2 (en) Classification engineering using regional locality-sensitive hashing (LSH) searches
US9406150B2 (en) Concept for encoding data defining coded orientations representing a reorientation of an object
CN111107493B (en) Method and system for predicting position of mobile user
Liu et al. Speeding up joint mutual information feature selection with an optimization heuristic
Mitzev et al. Time series shapelets: training time improvement based on particle swarm optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210618

RJ01 Rejection of invention patent application after publication