CN113010525B - Ocean space-time big data parallel KNN query processing method based on PID - Google Patents

Ocean space-time big data parallel KNN query processing method based on PID Download PDF

Info

Publication number
CN113010525B
CN113010525B CN202110354512.8A CN202110354512A CN113010525B CN 113010525 B CN113010525 B CN 113010525B CN 202110354512 A CN202110354512 A CN 202110354512A CN 113010525 B CN113010525 B CN 113010525B
Authority
CN
China
Prior art keywords
query
data
knn
grid
pid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110354512.8A
Other languages
Chinese (zh)
Other versions
CN113010525A (en
Inventor
乔百友
马玲
郝元卿
胡兵
孙永佼
吴刚
韩东红
Original Assignee
东北大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东北大学 filed Critical 东北大学
Priority to CN202110354512.8A priority Critical patent/CN113010525B/en
Publication of CN113010525A publication Critical patent/CN113010525A/en
Application granted granted Critical
Publication of CN113010525B publication Critical patent/CN113010525B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention provides a marine space-time big data parallel KNN query processing method based on PID, and relates to the technical field of space-time big data management. The method introduces a PID controller technology widely used in the industry for the first time, and realizes variable step length searching processing based on a feedback mechanism. Firstly, preprocessing acquired ocean data, dividing the data by adopting a grid dividing method, and indexing the preprocessed ocean data by adopting a grid indexing technology on the basis; encoding each grid cell in a row ordering mode; judging which rows and columns are in the radius range of the circle by using the row ordering grid index, so as to directly judge whether an intersection exists with the circle; when the KNN inquiry is carried out, the adjustability of the PID system is utilized, the searching range is dynamically adjusted through negative feedback, the dynamic prediction of the inquiry radius in the KNN inquiry processing is realized, the KNN inquiry times are reduced, and the KNN inquiry processing speed is accelerated.

Description

Ocean space-time big data parallel KNN query processing method based on PID
Technical Field
The invention relates to the technical field of space-time big data management, in particular to a marine space-time big data parallel KNN query processing method based on PID.
Background
Since the 21 st century, information technology has rapidly developed, marine observation technologies such as marine remote sensing and marine buoy have rapidly developed, and the scale of marine data has been explosively increased, so that the marine field has entered the big data era. A KNN query generally refers to a given spatial dataset and a query point, returning k results satisfying the query criteria nearest to the query point. The KNN query is widely applied to space application systems as a very important space query operation, and has important applications in ocean application systems such as ocean detection, ocean rescue, ocean information privacy protection and the like. How to perform efficient KNN query processing in such ocean big data is a challenging problem and is one of the research hotspots in the current spatial database field. The traditional KNN query processing method generally adopts a centralized data processing mode, is not suitable for processing marine big data, and the existing KNN query processing algorithm under a distributed environment, such as ParallelCirculorTrip and the like, is mostly based on a MapReduce frame, and MapReduce is a processing frame based on a magnetic disk, and has lower iteration processing efficiency. Meanwhile, the existing algorithms generally adopt index structures such as grid indexes, R-tree indexes and the like, and the index efficiency is low. When KNN query processing is carried out, the query radius increases by a fixed step length, so that the number of times of query is excessive, the subsequent calculated amount is excessive, and the query processing efficiency is affected. Therefore, a parallel KNN query processing method aiming at ocean big data is very necessary to be designed by combining the current efficient memory computing framework Spark.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a marine space-time big data parallel KNN query processing method based on PID, combines a memory-based big data parallel processing framework Spark, introduces a PID controller technology into calculation of space-time KNN query step length, and realizes a variable step length search processing mode based on a feedback mechanism, thereby improving the traditional query processing mode of fixed query radius step length.
In order to solve the technical problems, the invention adopts the following technical scheme: a marine space-time big data parallel KNN query processing method based on PID comprises the following steps:
step 1: preprocessing ocean big data; the acquired ocean data is cleaned, including data deduplication, exception handling and filling of missing values; the ocean data set to be processed is read from the HDFS, and is converted into RDD in the memory by using the CreateRDD method of the Spark platform, and the following data preprocessing is performed in the process:
step 1-1: data deduplication; performing repeatability check and de-duplication treatment on the acquired ocean data to ensure that no repeated data exists;
step 1-2: exception handling; performing consistency check and error detection on the de-duplicated marine data, correcting inconsistent and abnormal data, and setting a threshold gamma for an abnormal proportion larger than a set threshold gamma 1 And deleting the data which cannot be corrected;
step 1-3: processing a missing value; performing data interpolation processing on the missing of a single data item or a plurality of discontinuous data items, and performing missing value filling on the missing of a plurality of continuous data items by adopting an LSTM network;
step 2: data partitioning and grid index construction; partitioning the preprocessed marine big data by adopting a grid partitioning method, partitioning the whole data space into grids with equal size, coding the grids by adopting a line ordering method, projecting the grids into corresponding grids according to the spatial position of the data, creating indexes, and forming an index data set, wherein the specific method comprises the following steps:
step 2-1: dividing the ocean data two-dimensional space into square grid units with equal size by adopting a grid dividing method; setting the length of each grid unit as L, wherein each grid unit represents one grid unit of an mth row and an n column by using a binary group (m, n), m and n are coordinates of the grid unit in terms of longitude and latitude respectively, and the code of the grid unit is represented by c (m, n); each grid cell is encoded in a row ordering manner, and the calculation of the grid cell code c (m, n) is shown as follows:
c(m,n)=n*Row_num+m (1)
wherein row_num is the number of grid cells per column;
step 2-2: for any one marine data object P i Binary group (D) for spatial position information i1 ,D i2 ) Wherein D is represented by i1 And D i2 Respectively data objects P i Longitude and latitude dimensions of (a); if the data object P i To be divided into grid cells c (m, n) where their spatial locations are located, thenL is the length of the grid; if there are s data objects mapped into the same grid cell c (m, n), then the set P (m, n) = { P is used 1 ,P 2 ,P 3 ,…,P s -to represent;
step 2-3: establishing a grid index; each grid cell, namely, the projected data object, forms an index item, and each index item is organized in a Key (value) mode, wherein Key is a grid cell number, and value is a set of all data objects mapped to the grid cell; for grid cell c (m, n) and the set of data objects { P } partitioned into that grid cell 1 ,P 2 ,…,P s Corresponding index items (c (m, n), { P) are formed 1 ,P 2 ,…,P s });
Step 2-4: on the basis of grid index, a data item Num is added to each grid index item for recording the number of data objects divided into each grid unit, and the index item becomes (c (m, n), { Num, P) 1 ,P 2 ,…,P s Num = s); all index items form grid indexes, and the grid indexes are stored in a memory in an RDD form to form a grid index data set;
step 3: KNN query processing based on PID; performing PID-based KNN query processing on the divided data areas, generating local KNN query results on each data area, merging all local KNN query results, sequencing, and finally returning the first K query results as final KNN query processing results; the method specifically comprises the following steps: firstly, reading inquiry point information given by a user, creating an RDD inquiry set, then executing a PID-based KNN inquiry algorithm on the divided index data set to obtain a local KNN inquiry result set on each data partition, merging and sorting the local KNN inquiry results on each data partition to form a candidate inquiry result set, selecting the first K inquiry results as final KNN inquiry results, and storing the final KNN inquiry results in an HDFS file system; the method comprises the following specific steps:
step 3-1: initial KNN query processing; performing primary KNN query processing according to the KNN query requirement; firstly, according to a grid-based data partitioning method, calculating a grid unit where a query point q is located according to the position of the query point q, setting an initial value of a query radius r, and carrying out KNN query in a circular area formed by taking the q point as a circle center and r as a radius;
firstly, determining a candidate query partition set in a circular area formed by a query point q and a query radius r, then calculating the distance from each data point in each candidate query partition to the query point q, if K data points closest to the q point are found in the candidate query partition set, finishing the query, returning a result and storing the result in an HDFS; otherwise, executing the step 3-2, calculating the increment step length of the query radius by adopting a PID-based query radius calculation method, and forming a new query radius;
the specific method for determining the candidate query partition set comprises the following steps: determining whether the data partition represented by one grid cell is the queried object according to the query point q and the query radius r: in the KNN query process, firstly determining a data partition needing to be queried, namely a candidate query partition, in a circular local area formed by a radius r; then calculating the distance between the query point q and each data point in the query partition, thereby determining K nearest points; the specific determination method comprises the following steps: for any partition represented by grid cell C, using Mindist (C, q) to represent the minimum distance from the query point q to the grid cell C, and using maxdist (C, q) to represent the maximum distance from the query point q to the grid cell C; if the condition mindist (C, q) is satisfied, r < maxdist (C, q), indicating that the data partition represented by the grid unit C is in the query range of the query q, putting the data partition into a candidate query partition set, and executing KNN query processing on the candidate partitions; otherwise, the grid unit C is not in the query range of the query point q, and the partition represented by the grid unit cannot enter the candidate query partition set;
step 3-2: calculating the increment step length of the query radius by adopting a PID-based query radius calculation method, and forming a new query radius;
step 3-2-1: inputting a target value of the query radius to a PID system, and calculating a predicted value u (t) of the query radius r by the PID system according to a PID calculation formula; then obtaining a difference e (t) between the target value and the predicted value, returning to the PID system again, and calculating u (t) and e (t) again, and iterating the calculation process until the PID system is stabilized until the output predicted value and the input target value have no error, or the running result reaches an error range preset by the PID system; the PID calculation formula is:
wherein e (t) is the difference between the predicted value output by the PID system and the target value input by the PID system, t is time, K p Refers to the value of the P parameter, K i Refers to the value of the I parameter, K d Refers to the value of the D parameter;
replacing the derivative part in the PID system with the difference, the following PID calculation formula is obtained:
where u (k) is the predicted query radius, k is the sample number, T i Is an integral constant, T d Is a differential constant, T is a sampling period, e (j) is an accumulated error, e (k) is an inspectionThe difference between the polling result set value and the actual value; the set value of the query result is the target query result number K of KNN query, and the actual value is the query result number obtained by actual KNN search through the query radius u (K);
in order to keep the parameters consistent with those in the PID simulation system, the above formula is modified to obtain the following formula:
step 3-2-2: setting PID system parameters to obtain a new query radius;
setting P, I, D parameters of a PID system through actual KNN query experiment tests, and further calculating a new query radius according to a given K value;
step 3-3: continuing KNN query by using the new query radius; after calculating a new query radius by using a PID system, executing new KNN queries on each data partition in parallel according to the new query radius, merging local KNN query results on each partition, calculating the overall KNN query result number, ending the query and saving the query result into the HDFS if the query ending condition is met, otherwise, jumping to step 3-2 to continue execution.
The beneficial effects of adopting above-mentioned technical scheme to produce lie in: the ocean space-time big data parallel KNN query processing method based on PID combines an advanced big data parallel processing framework Spark based on memory, introduces an automatic control algorithm PID controller technology which is most widely applied in the industry at present into calculation of space-time KNN query step length for the first time, realizes a variable step length search processing mode based on a feedback mechanism, and improves a traditional query processing mode of fixed query radius step length. The data scanning time is reduced to a certain extent, so that quick KNN query processing can be realized, and the problem that the existing KNN query processing method cannot meet the marine big data KNN query processing requirement is effectively solved. According to the ocean data with obvious time-space characteristics, each data has longitude and latitude information, so that the data space is divided by using the longitude and latitude information, and a grid index is constructed to index the ocean data, thereby accelerating the query processing speed. On the basis of the original grid index, a data item is added for each grid index item and used for recording the number of data objects divided into each grid unit, so that unnecessary calculation overhead is avoided. Each grid unit is encoded in a row ordering mode, the encoding can very simply calculate the rows and columns of the data object, and the rows and columns of the object are obtained in a sum-of-the-quotient mode. By using the row ordering grid index, the method can simply judge which rows and columns are in the radius range of the circle, thereby directly judging whether the rows and columns intersect with the circle or not, and being simpler and more efficient. When KNN inquiry is carried out, the adjustability of the PID system is utilized, the searching range is dynamically adjusted through negative feedback adjustment, the prediction of the dynamic step length of the inquiry radius in KNN inquiry processing is realized, the KNN inquiry times are reduced, and therefore the effect of rapid inquiry can be achieved.
Drawings
FIG. 1 is a flow chart of a PID-based ocean space-time big data parallel KNN query processing method provided by an embodiment of the invention;
FIG. 2 is a general framework diagram of a PID-based ocean space-time big data parallel KNN query processing method provided by an embodiment of the invention;
FIG. 3 is a schematic diagram of a grid index structure according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a PID system according to an embodiment of the invention;
FIG. 5 is a graph showing the comparison of execution times of two query methods for different grid numbers according to an embodiment of the present invention;
FIG. 6 is a graph showing the comparison of execution time of two query methods for different data amounts according to an embodiment of the present invention;
FIG. 7 is a graph showing the comparison of execution times of two query methods with different parallelism according to an embodiment of the present invention;
fig. 8 is a comparison chart of execution time under two query methods under different K values according to an embodiment of the present invention.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
In the embodiment, a Spark cluster formed by 5 IBM X3650M 4 servers is built and used as a test running environment of the method, wherein one server is used as a Master node, and the other servers are used as workbench nodes. The memory configuration, network card configuration, hard disk configuration, CPU configuration and other configurations of each node are the same, and are specifically shown in table 1.
Table 1 server configuration
Configuration of Specification of specification
CPU Intel(R)Xeon(R)CPU 2.00GHz
Memory 32GB DDR RAM
Hard disk 3.5 inch 7200rpm 2TB
Network card 1Gb/s self-adaptive Ethernet card
In the embodiment, the IntelliJ IDEA is adopted as a development environment of the method, and the Scala is adopted as a programming language, so that the method design and development are completed. The software environment including the method operation in this embodiment includes: the operating system centOS 6.4, the running environment includes Scala, java, hadoop and Spark cluster system. The specific software environment is shown in table 2.
Table 2 software environment
In this embodiment, a method for processing parallel KNN query of marine space-time big data based on PID, as shown in fig. 1, includes the following steps:
step 1: preprocessing ocean big data; the acquired ocean data is cleaned, including data deduplication, exception handling and filling of missing values; the ocean data set to be processed is read from the HDFS and converted into RDD in memory (Resilient Distributed Datasets, i.e. elastic distributed data set) by the createrd method of Spark platform, in which the following data preprocessing is performed:
step 1-1: data deduplication; performing repeatability check and de-duplication treatment on the acquired ocean data to ensure that no repeated data exists;
step 1-2: exception handling; performing consistency check and error detection on the de-duplicated marine data, correcting inconsistent and abnormal data, and setting a threshold gamma for an abnormal proportion larger than a set threshold gamma 1 And deleting the data which cannot be corrected;
step 1-3: processing a missing value; performing data interpolation processing on the missing of a single data item or a plurality of discontinuous data items, and performing missing value filling on the missing of a plurality of continuous data items by adopting an LSTM network;
step 2: data partitioning and grid index construction; partitioning of the ocean data is completed and a corresponding index is created. Partitioning the preprocessed marine big data by adopting a grid partitioning method, partitioning the whole data space into grids with equal size, coding the grids by adopting a line ordering method, projecting the grids into corresponding grids according to the spatial position of the data, creating indexes, and forming an index data set, wherein the specific method comprises the following steps:
step 2-1: dividing the ocean data two-dimensional space into square grid units with equal size by adopting a grid dividing method; setting the length of each grid unit as L, wherein each grid unit represents one grid unit of an mth row and an n column by using a binary group (m, n), m and n are coordinates of the grid unit in terms of longitude and latitude respectively, and the code of the grid unit is represented by c (m, n); each grid cell is encoded in a row ordering manner, and the calculation of the grid cell code c (m, n) is shown as follows:
c(m,n)=n*Row_num+m (1)
wherein row_num is the number of grid cells per column; the code can calculate the row and column of the data object very simply, the row and column of the object can be obtained by adopting the mode of summing the sum and the quotient, the method can simply judge which rows and columns are in the query radius range, thereby facilitating KNN query processing.
Step 2-2: for any one marine data object P i Binary group (D) for spatial position information i1 ,D i2 ) Wherein D is represented by i1 And D i2 Respectively data objects P i Longitude and latitude dimensions of (a); if the data object P i To be divided into grid cells c (m, n) where their spatial locations are located, thenL is the length of the grid; if there are s data objects mapped into the same grid cell c (m, n), then the set P (m, n) = { P is used 1 ,P 2 ,P 3 ,…,P s -to represent;
step 2-3: establishing a grid index; each grid cell, namely, the projected data object, forms an index item, and each index item is organized in a Key (value) mode, wherein Key is a grid cell number, and value is a set of all data objects mapped to the grid cell; for grid cell c (m, n) and the set of data objects { P } partitioned into that grid cell 1 ,P 2 ,…,P s Corresponding index items (c (m, n), { P) are formed 1 ,P 2 ,…,P s });
Step 2-4: on the basis of grid index, a data item Num is added to each grid index item for recording the number of data objects divided into each grid unit, and the index item becomes (c (m, n), { Num, P) 1 ,P 2 ,…,P s Num = s); when the KNN query is performed, whether the data under the grid unit needs to be traversed or not can be determined by comparing the values of the data items, so that unnecessary calculation cost can be avoided. All index items form grid indexes, and the grid indexes are stored in a memory in an RDD form to form a grid index data set;
step 3: KNN query processing based on PID; performing PID-based KNN query processing on the divided data areas, generating local KNN query results on each data area, merging all local KNN query results, sequencing, and finally returning the first K query results as final KNN query processing results; the method specifically comprises the following steps: firstly, reading inquiry point information given by a user, creating an RDD inquiry set, then executing a PID-based KNN inquiry algorithm (i.e. PIDKNN algorithm) on a partitioned index data set RDD a to obtain a local KNN inquiry result set RDD b on each data partition, merging and sorting the local KNN inquiry results on each data partition to form a candidate inquiry result set RDD c, selecting the previous K inquiry results as final KNN inquiry results, and storing the final KNN inquiry results in an HDFS file system; it can be seen that the most critical in parallel KNN query processing is the PID-based KNN query processing algorithm performed on the data partition, which specifically comprises the following steps:
step 3-1: initial KNN query processing; performing primary KNN query processing according to the KNN query requirement; firstly, according to a grid-based data partitioning method, calculating a grid unit where a query point q is located according to the position of the query point q, setting an initial value of a query radius r (generally, the grid length is L, i.e. r=l), and carrying out KNN query in a circular area formed by taking the q point as a circle center and r as a radius;
firstly, determining a candidate query partition set in a circular area formed by a query point q and a query radius r, then calculating the distance from each data point in each candidate query partition to the query point q, if K data points closest to the q point are found in the candidate query partition set, finishing the query, returning a result and storing the result in an HDFS; otherwise, executing the step 3-2, calculating the increment step length of the query radius by adopting a PID-based query radius calculation method, and forming a new query radius;
the specific method for determining the candidate query partition set comprises the following steps: determining whether the data partition represented by one grid cell is the queried object according to the query point q and the query radius r: in the KNN query process, firstly determining a data partition (namely a grid unit) to be queried in a circular area formed by a radius r, namely a candidate query partition; then calculating the distance between the query point q and each data point in the query partition, thereby determining K nearest points; the specific determination method comprises the following steps: for any partition represented by grid cell C, using Mindist (C, q) to represent the minimum distance from the query point q to the grid cell C, and using maxdist (C, q) to represent the maximum distance from the query point q to the grid cell C; if the condition mindist (C, q) is satisfied, r < maxdist (C, q), indicating that the data partition represented by the grid unit C is in the query range of the query q, putting the data partition into a candidate query partition set, and executing KNN query processing on the candidate partitions; otherwise, the grid unit C is not in the query range of the query point q, and the partition represented by the grid unit cannot enter the candidate query partition set;
step 3-2: calculating the increment step length of the query radius by adopting a PID-based query radius calculation method, and forming a new query radius;
after the initial KNN inquiry, when the number of inquiry results is less than K, the radius of a circle is required to be enlarged for next data searching; for query radius calculation, a PID algorithm is adopted for prediction, the PID algorithm can return a proper query radius value according to the error of the output value and the set value, and the searching range can be dynamically adjusted through negative feedback adjustment, so that efficient KNN query is realized.
Step 3-2-1: inputting a target value of the query radius to a PID system, and calculating a predicted value u (t) of the query radius r by the PID system according to a PID calculation formula; then obtaining a difference e (t) between the target value and the predicted value, returning to the PID system again, and calculating u (t) and e (t) again, and iterating the calculation process until the PID system is stabilized until the output predicted value and the input target value have no error, or the running result reaches an error range preset by the PID system; the PID calculation formula is:
wherein e (t) is the difference between the predicted value output by the PID system and the target value input by the PID system, t is time, K p Refers to the value of the P parameter, K i Refers to the value of the I parameter, K d Refers to the value of the D parameter;
an integrating module and a differentiating module in the analog PID system approach in a calculus way, and the analog PID control algorithm is converted into a digital PID control algorithm; according to the principle of calculus, as long as the sampling period is small enough, the differential part in the PID system can be replaced by difference, and the following PID calculation formula is obtained:
where u (k) is the predicted query radius, k is the sampling number (k=0, 1,2, …), T i Is an integral constant, T d Is a differential constant, T is a sampling period, e (j) is an accumulated error, and e (k) is the difference between a set value and an actual value of a query result; the set value of the query result is the target query result number K of KNN query, and the actual value is the query result number obtained by actual KNN search through the query radius u (K);
in order to keep the parameters consistent with those in the PID simulation system, the above formula is modified to obtain the following formula:
step 3-2-2: setting PID system parameters to obtain a new query radius;
setting P, I, D parameters of a PID system through actual KNN query experiment tests, and further calculating a new query radius according to a given K value;
step 3-3: continuing KNN query by using the new query radius; after calculating a new query radius by using a PID system, executing new KNN queries on each data partition in parallel according to the new query radius, merging local KNN query results on each partition, calculating the overall KNN query result number, ending the query and saving the query result into the HDFS if the query ending condition is met, otherwise, jumping to step 3-2 to continue execution.
In the embodiment, experimental study and comparison are respectively carried out on the method by adopting a plurality of groups of synthesized data, the spatial distribution range of the synthesized data is 10k, the spatial positions of data objects and the spatial positions of query points are randomly generated according to Gaussian distribution, 5 groups of spatial data sets are generated in the experiment, the number of the data objects in the 5 groups of data sets is 50 ten thousand, 100 ten thousand, 200 ten thousand, 500 ten thousand and 1000 ten thousand respectively, the task number range of Spark in the experiment is 6-72, the K value range is 50-10000, and specific experimental data parameters are shown in table 3.
TABLE 3 Experimental data parameters
The invention discloses a PID-based ocean space-time big data parallel KNN query processing method, and the overall framework is shown in figure 2 and mainly comprises three parts of ocean big data preprocessing, data partitioning, grid index construction and parallel KNN query processing.
Grid index structure: the grid index construction of the method is shown in fig. 3, and the index mainly comprises two parts, namely key and Value. Key is a grid cell code, and Value consists of the number of data points and the data points in the grid.
PID system structure: the PID system structure is shown in FIG. 4, and mainly comprises three parts of a proportion module, an integration module and a differentiation module, and comprises three parameters: the P parameter (deviation scaling factor), the I parameter (deviation integral factor) and the D parameter (deviation derivative factor) together regulate the output of the PID system.
This example compares the method of the invention to the ParallelcircularTrip method:
in this embodiment, the number of query data sets is 100 query points, that is, 100 KNN queries are performed, and then an average value is obtained as a final experimental result. Based on the experimental environment and the experimental data set, the performance of the PID-based KNN query processing method of the present invention was tested and analyzed from the following aspects, and compared with the parallelcirclslarp method.
(1) Different grid cell numbers:
for a fixed size data set, how many grid cells number directly affects the number of data objects in the grid, so the setting of the grid number has a significant impact on the performance of the algorithm. Fig. 5 shows the performance of the method as a function of the number of grids when the data size is 200w, the k value is 10000, and the parallelism of spark is 12. From the experimental results of fig. 5, it can be seen that the execution time of the method of the present invention decreases and then increases gradually with increasing number of grids. It can be seen that as the number of grids increases from 100 to 400, the time of the query decreases because after the number of grids increases, the number of data points in a single grid decreases, and thus the number of grids involved in the query is more accurate, which generally reduces the number of data points involved, and thus reduces the amount of computation and computation time. As the number of grids continues to increase, the run time again begins to gradually increase, as the amount of data involved in each grid of the grid index decreases, resulting in an increase in the number of searches that achieve a satisfied condition, and thus in turn an increase in search time. In addition, this results in longer and longer times in the distributed system that are required for inter-node communication and data ordering.
It can also be seen from fig. 5 that both methods are performed more than on a uniform data set, because the data distribution is not uniform, resulting in a large difference in the number of data objects in the grid, which results in a large difference in the amount of data divided into each node, which results in some nodes having been calculated, but some nodes are still performing calculations, thus slowing down the overall calculation time, while the data distribution is not uniform, resulting in an increase in KNN query times, which also results in an increase in the overall running time of the algorithm on the non-uniform data set. From the test on the uniformly distributed data set and the non-uniformly distributed data set, the execution time of the method is obviously shorter than that of the ParallelCirculorTrip method, so that the PID-based KNN query processing method has better KNN query processing efficiency.
(2) The number of different data objects is:
for a fixed number of grid cells, the size of the number of data objects (data volume) in the dataset directly affects the number of data objects in each grid cell, and therefore the number of data objects has a significant impact on the performance of the algorithm. In fig. 6, the K value is set to 10000, the parallelism of spark is set to 24, the number of grids is 6400, and the performance of the method varies with the data amount. As can be seen from fig. 5, overall on both uniform and non-uniform data sets, as the amount of data increases, the execution time of the method also increases, which is expected, but the query method of the present invention performs better than the conventional method. When the data volume is increased from 100w to 200w, the calculation time of the method is not increased very fast, because the times of KNN inquiry are not very different when the KNN inquiry is carried out under the two data volumes, so the time spent on inquiry is not very different, but the time for index construction is correspondingly increased due to the increase of the data volume. Under the condition of uneven distribution and 100w data quantity, the query speed of the method is faster than that under the condition of even distribution, because the queried data is distributed in a local area with denser data distribution. Generally, on a data set with uneven distribution, the running time is longer than that on a data set with even distribution, because the data distribution is uneven, so that the data quantity of each node is different, and the calculation completion time of each node is inconsistent, so that the calculation is slower, and the data distribution is uneven, so that a result meeting the requirement can be obtained by carrying out multiple queries. The method of the invention is superior to the ParallelcircularTrip method under different data sizes.
(3) Under different parallelism:
parallelism, i.e., the number of Spark tasks performed in parallel, has a direct impact on the performance of the method. The embodiment is used for testing the performance change condition of the PIDKNN query method under different parallelism. Here, the data amount is 200 ten thousand, the K value is set to 10000, and the grid number is 1600. The parallelism in the experiment is the number of executors. Fig. 7 shows the runtime variation of the parallelcirclul triep method and the pidkn method of the present invention at different degrees of parallelism. As can be seen from the figure, the running time of the query method of the present invention and the conventional query method gradually decreases after the parallelism increases from 6 to 24, but does not decrease linearly with the linear increase of the parallelism, because communication and memory in distributed computation limit the speed of computation when the Spark parallelism increases. When the parallelism reaches 48, the run times and 24 are almost comparable, since the inter-node communication of the cluster and the memory limitations at this time cause the performance of the method to be a bottleneck. When parallelism reaches 72, there is a data processing queuing state due to excessive parallelization of data, and communication costs between nodes are increasing, resulting in an increase in runtime at 72 times of parallelism. As shown in fig. 7, the performance of the PID-based KNN query processing method of the present invention is better than that of the parallelcirclslarp method for tests on both homogeneous and heterogeneous data sets at different parallelism.
(4) At different K values:
different K values directly influence the execution efficiency of the method, and the embodiment also tests the influence of the change of the K values on the performance of the PIDKNN query algorithm. Fig. 8 shows how the execution time of the method varies with the K-value in the case where the data amount is 200 ten thousand, the mesh amount is 1600, and the parallelism is 24. As can be seen from fig. 8, when the K values are 50, 100, 500 and 1000, respectively, the execution time of both methods on uniform data and non-uniform data increases as the K value increases. This is because as the K value increases, the amount of data that needs to be found increases, and the amount of computation increases, and the execution time of both algorithms increases. At 1000 and below 1000, the number of grid cells involved in the query is small, and the number of queries is not large, so that the execution time is not large. As can be seen from fig. 8, as the K value increases, the efficiency of the PIDKNN query method of the present invention has an increasingly significant advantage over the conventional KNN query processing corrosion prevention, both on a uniform data set and on non-uniform data, which benefits from the introduction of the PID control system to optimize the calculation of the query radius increase, and thus the PID-based KNN query processing method of the present invention has a higher efficiency.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims (3)

1. A PID-based ocean space-time big data parallel KNN query processing method is characterized in that: the method comprises the following steps:
step 1: preprocessing ocean big data; the acquired ocean data is cleaned, including data deduplication, exception handling and filling of missing values;
step 2: data partitioning and grid index construction; partitioning the preprocessed marine big data by adopting a grid partitioning method, partitioning the whole data space into grids with equal size, coding the grids by adopting a line ordering method, projecting the grids into corresponding grids according to the spatial position of the data, and creating an index to form an index data set;
step 3: KNN query processing based on PID; performing PID-based KNN query processing on the divided data areas, generating local KNN query results on each data area, merging all local KNN query results, sequencing, and finally returning the first K query results as final KNN query processing results; the method specifically comprises the following steps: firstly, reading inquiry point information given by a user, creating an RDD inquiry set, then executing a PID-based KNN inquiry algorithm on the divided index data set to obtain a local KNN inquiry result set on each data partition, merging and sorting the local KNN inquiry results on each data partition to form a candidate inquiry result set, selecting the first K inquiry results as final KNN inquiry results, and storing the final KNN inquiry results in an HDFS file system;
step 3-1: initial KNN query processing; performing primary KNN query processing according to the KNN query requirement; firstly, according to a grid-based data partitioning method, calculating a grid unit where a query point q is located according to the position of the query point q, setting an initial value of a query radius r, and carrying out KNN query in a circular area formed by taking the q point as a circle center and r as a radius;
firstly, determining a candidate query partition set in a circular area formed by a query point q and a query radius r, then calculating the distance from each data point in each candidate query partition to the query point q, if K data points closest to the q point are found in the candidate query partition set, finishing the query, returning a result and storing the result in an HDFS; otherwise, executing the step 3-2, calculating the increment step length of the query radius by adopting a PID-based query radius calculation method, and forming a new query radius;
the specific method for determining the candidate query partition set comprises the following steps: determining whether the data partition represented by one grid cell is the queried object according to the query point q and the query radius r: in the KNN query process, firstly determining a data partition needing to be queried, namely a candidate query partition, in a circular local area formed by a radius r; then calculating the distance between the query point q and each data point in the query partition, thereby determining K nearest points; the specific determination method comprises the following steps: for any partition represented by grid cell C, using Mindist (C, q) to represent the minimum distance from the query point q to the grid cell C, and using maxdist (C, q) to represent the maximum distance from the query point q to the grid cell C; if the condition mindist (C, q) is satisfied, r < maxdist (C, q), indicating that the data partition represented by the grid unit C is in the query range of the query q, putting the data partition into a candidate query partition set, and executing KNN query processing on the candidate partitions; otherwise, the grid unit C is not in the query range of the query point q, and the partition represented by the grid unit cannot enter the candidate query partition set;
step 3-2: calculating the increment step length of the query radius by adopting a PID-based query radius calculation method, and forming a new query radius;
step 3-2-1: inputting a target value of the query radius to a PID system, and calculating a predicted value u (t) of the query radius r by the PID system according to a PID calculation formula; then obtaining a difference e (t) between the target value and the predicted value, returning to the PID system again, and calculating u (t) and e (t) again, and iterating the calculation process until the PID system is stabilized until the output predicted value and the input target value have no error, or the running result reaches an error range preset by the PID system; the PID calculation formula is:
wherein e (t) is the difference between the predicted value output by the PID system and the target value input by the PID system, t is time, K p Refers to the value of the P parameter, K i Refers to the value of the I parameter, K d Refers to the value of the D parameter;
replacing the derivative part in the PID system with the difference, the following PID calculation formula is obtained:
where u (k) is the predicted query radius, k is the sample number, T i Is an integral constant, T d Is the differential constant, T is the sampling periodE (j) is the accumulated error, e (k) is the difference between the set value and the actual value of the query result; the set value of the query result is the target query result number K of KNN query, and the actual value is the query result number obtained by actual KNN search through the query radius u (K);
in order to keep the parameters consistent with those in the PID simulation system, the above formula is modified to obtain the following formula:
step 3-2-2: setting PID system parameters to obtain a new query radius;
setting P, I, D parameters of a PID system through actual KNN query experiment tests, and further calculating a new query radius according to a given K value;
step 3-3: continuing KNN query by using the new query radius; after calculating a new query radius by using a PID system, executing new KNN queries on each data partition in parallel according to the new query radius, merging local KNN query results on each partition, calculating the overall KNN query result number, ending the query and saving the query result into the HDFS if the query ending condition is met, otherwise, jumping to step 3-2 to continue execution.
2. The marine space-time big data parallel KNN query processing method based on PID of claim 1, wherein the method is characterized in that: the specific method of the step 1 is as follows:
the ocean data set to be processed is read from the HDFS, and is converted into RDD in the memory by using the CreateRDD method of the Spark platform, and the following data preprocessing is performed in the process:
step 1-1: data deduplication; performing repeatability check and de-duplication treatment on the acquired ocean data to ensure that no repeated data exists;
step 1-2: exception handling; performing consistency check and error detection on the de-duplicated marine data, correcting inconsistent and abnormal data, and determining that the abnormal proportion is larger thanSetting a threshold r 1 And deleting the data which cannot be corrected;
step 1-3: processing a missing value; the data interpolation process is performed for the missing of a single data item or a plurality of discontinuous data items, and the missing value filling is performed for the missing of a plurality of continuous data items by adopting an LSTM network.
3. The marine space-time big data parallel KNN query processing method based on PID of claim 1, wherein the method is characterized in that: the specific method of the step 2 is as follows:
step 2-1, dividing a two-dimensional space of ocean data into square grid units with equal size by adopting a grid division method; setting the length of each grid unit as L, wherein each grid unit represents one grid unit of an mth row and an n column by using a binary group (m, n), m and n are coordinates of the grid unit in terms of longitude and latitude respectively, and the code of the grid unit is represented by c (m, n); each grid cell is encoded in a row ordering manner, and the calculation of the grid cell code c (m, n) is shown as follows:
c(m,n)=n*Row_num+m (1)
wherein row_num is the number of grid cells per column;
step 2-2: for any one marine data object P i Binary group (D) for spatial position information i1 ,D i2 ) Wherein D is represented by i1 And D i2 Respectively data objects P i Longitude and latitude dimensions of (a); if the data object P i To be divided into grid cells c (m, n) where their spatial locations are located, thenL is the length of the grid; if there are s data objects mapped into the same grid cell c (m, n), then the set P (m, n) = { P is used 1 ,P 2 ,P 3 ,…,P s -to represent;
step 2-3: establishing a grid index; each grid cell, i.e. the projected data object, forms an index item, each index item being organized in a (Key, value) way, where Key isGrid cell number, value is the set of all data objects mapped to the grid cell; for grid cell c (m, n) and the set of data objects { P } partitioned into that grid cell 1 ,P 2 ,P 3 ,…,P s Corresponding index items (c (m, n), { P) are formed 1 ,P 2 ,P 3 ,…,P s });
Step 2-4: on the basis of grid index, a data item Num is added to each grid index item for recording the number of data objects divided into each grid unit, and the index item becomes (c (m, n), { Num, P) 1 ,P 2 ,P 3 ,…,P s Num = s); all index items form a grid index, and the grid index is stored in a memory in an RDD form to form a grid index data set.
CN202110354512.8A 2021-04-01 2021-04-01 Ocean space-time big data parallel KNN query processing method based on PID Active CN113010525B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110354512.8A CN113010525B (en) 2021-04-01 2021-04-01 Ocean space-time big data parallel KNN query processing method based on PID

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110354512.8A CN113010525B (en) 2021-04-01 2021-04-01 Ocean space-time big data parallel KNN query processing method based on PID

Publications (2)

Publication Number Publication Date
CN113010525A CN113010525A (en) 2021-06-22
CN113010525B true CN113010525B (en) 2023-08-01

Family

ID=76387568

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110354512.8A Active CN113010525B (en) 2021-04-01 2021-04-01 Ocean space-time big data parallel KNN query processing method based on PID

Country Status (1)

Country Link
CN (1) CN113010525B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116028500B (en) * 2023-01-17 2023-07-14 黑龙江大学 Range query indexing method based on high-dimensional data
CN116126942B (en) * 2023-02-09 2023-11-24 国家气象信息中心(中国气象局气象数据中心) Multi-dimensional space meteorological grid data distributed storage query method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7933850B1 (en) * 2006-11-13 2011-04-26 Oracle America, Inc. Method and apparatus for functional relationship approximation through nonparametric regression using R-functions
CN104299247A (en) * 2014-10-15 2015-01-21 云南大学 Video object tracking method based on self-adaptive measurement matrix
WO2019030464A1 (en) * 2017-08-07 2019-02-14 Wadaro Limited A method of geolocation
CN109446293A (en) * 2018-11-13 2019-03-08 嘉兴学院 A kind of parallel higher-dimension nearest Neighbor
CN110046268A (en) * 2016-02-05 2019-07-23 大连大学 Establish the higher dimensional space kNN querying method that sensitive hash index is set based on ranking

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
PL2866484T3 (en) * 2013-10-24 2019-05-31 Telefonica Germany Gmbh & Co Ohg A method for anonymization of data collected within a mobile communication network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7933850B1 (en) * 2006-11-13 2011-04-26 Oracle America, Inc. Method and apparatus for functional relationship approximation through nonparametric regression using R-functions
CN104299247A (en) * 2014-10-15 2015-01-21 云南大学 Video object tracking method based on self-adaptive measurement matrix
CN110046268A (en) * 2016-02-05 2019-07-23 大连大学 Establish the higher dimensional space kNN querying method that sensitive hash index is set based on ranking
WO2019030464A1 (en) * 2017-08-07 2019-02-14 Wadaro Limited A method of geolocation
CN109446293A (en) * 2018-11-13 2019-03-08 嘉兴学院 A kind of parallel higher-dimension nearest Neighbor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A PID-Based KNN query processing algorithm for spatial data;Baiyou Qiao等;《Sensors》;第22卷(第19期);1-9 *
P2P环境下面向不确定数据的kNN查询方法;王国仁等;《东北大学学报(自然科学版)》;第33卷(第5期);632-635 *

Also Published As

Publication number Publication date
CN113010525A (en) 2021-06-22

Similar Documents

Publication Publication Date Title
Ding et al. Tsunami: A learned multi-dimensional index for correlated data and skewed workloads
Eldawy et al. CG_Hadoop: computational geometry in MapReduce
US10521441B2 (en) System and method for approximate searching very large data
CN113010525B (en) Ocean space-time big data parallel KNN query processing method based on PID
Zhang et al. Efficient parallel skyline evaluation using MapReduce
Zhao et al. $ k $ NN-DP: handling data skewness in $ kNN $ joins using MapReduce
Velentzas et al. A partitioning gpu-based algorithm for processing the k nearest-neighbor query
CN114420215A (en) Large-scale biological data clustering method and system based on spanning tree
CN112734106A (en) Method and device for predicting energy load
Velentzas et al. In-memory k Nearest Neighbor GPU-based Query Processing.
Seidl et al. MR-DSJ: distance-based self-join for large-scale vector data analysis with mapreduce
Roh et al. Hierarchically organized skew-tolerant histograms for geographic data objects
Hundt et al. Cuda-accelerated alignment of subsequences in streamed time series data
CN108108251B (en) Reference point k nearest neighbor classification method based on MPI parallelization
Lukač et al. Fast approximate k-nearest neighbours search using GPGPU
KR20210117169A (en) Efficient similarity search
Li et al. An enhanced and efficient clustering algorithm for large data using MapReduce
Xiang et al. GAIPS: Accelerating maximum inner product search with GPU
US11315036B2 (en) Prediction for time series data using a space partitioning data structure
CN105630706B (en) Intelligent memory block replacement method, system and computer readable storage medium
Yagoubi et al. Radiussketch: massively distributed indexing of time series
Du et al. Modeling approaches for time series forecasting and anomaly detection
Wu et al. Optimizing the query performance of block index through data analysis and I/O modeling
Yang et al. A fast and efficient grid-based K-means++ clustering algorithm for large-scale datasets
Lu et al. On the auto-tuning of elastic-search based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant