CN108460121B - Little file merging method for space-time data in smart city - Google Patents
Little file merging method for space-time data in smart city Download PDFInfo
- Publication number
- CN108460121B CN108460121B CN201810154658.6A CN201810154658A CN108460121B CN 108460121 B CN108460121 B CN 108460121B CN 201810154658 A CN201810154658 A CN 201810154658A CN 108460121 B CN108460121 B CN 108460121B
- Authority
- CN
- China
- Prior art keywords
- time
- attribute
- file
- space
- small
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 54
- 238000004364 calculation method Methods 0.000 claims abstract description 6
- 238000000605 extraction Methods 0.000 claims abstract description 4
- 238000003860 storage Methods 0.000 claims description 47
- 230000008569 process Effects 0.000 claims description 22
- 238000012217 deletion Methods 0.000 claims description 9
- 238000012935 Averaging Methods 0.000 claims description 8
- 238000013500 data storage Methods 0.000 claims description 8
- 238000007418 data mining Methods 0.000 claims description 6
- 230000009191 jumping Effects 0.000 claims description 6
- 238000005065 mining Methods 0.000 claims description 6
- 230000002123 temporal effect Effects 0.000 claims description 5
- 238000004220 aggregation Methods 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 abstract description 11
- 230000015654 memory Effects 0.000 description 18
- 230000004044 response Effects 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 6
- 238000011160 research Methods 0.000 description 6
- 230000006399 behavior Effects 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000007726 management method Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 230000002776 aggregation Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004883 computer application Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 235000008331 Pinus X rigitaeda Nutrition 0.000 description 1
- 235000011613 Pinus brutia Nutrition 0.000 description 1
- 241000018646 Pinus brutia Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000005054 agglomeration Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- VQLYBLABXAHUDN-UHFFFAOYSA-N bis(4-fluorophenyl)-methyl-(1,2,4-triazol-1-ylmethyl)silane;methyl n-(1h-benzimidazol-2-yl)carbamate Chemical compound C1=CC=C2NC(NC(=O)OC)=NC2=C1.C=1C=C(F)C=CC=1[Si](C=1C=CC(F)=CC=1)(C)CN1C=NC=N1 VQLYBLABXAHUDN-UHFFFAOYSA-N 0.000 description 1
- 238000007596 consolidation process Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- YQGOJNYOYNNSMM-UHFFFAOYSA-N eosin Chemical compound [Na+].OC(=O)C1=CC=CC=C1C1=C2C=C(Br)C(=O)C(Br)=C2OC2=C(Br)C(O)=C(Br)C=C21 YQGOJNYOYNNSMM-UHFFFAOYSA-N 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/1734—Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/231—Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for merging small files of space-time data in a smart city, wherein the small file merging is used as a microscopic data layout mechanism, so that the I/O performance of a system can be effectively improved, and the access delay of a user is reduced. The method comprises the steps of carrying out parameterization representation and space-time attribute extraction on historical user access information by analyzing characteristics of space-time data, carrying out hierarchical clustering on the access information by utilizing an AGNES algorithm in a space-time attribute domain, carrying out access density-based weighted calculation on a clustering result, and finding out an access-related space-time range. And finally, guiding the combination of the small files by utilizing the space-time range. Experimental results show that the algorithm is simple and efficient, and the access efficiency of the small spatiotemporal data files in the system is greatly improved.
Description
Technical Field
The invention relates to the field of research on a small file merging strategy of space-time data in a smart city.
Background
In a smart city based on the internet of things and cloud computing, ubiquitous sensors generate sensing data with three inherent properties of time, space and type, and the sensing data are small in size (usually dozens to hundreds of KB), numerous in variety, huge in quantity, high in redundancy and dynamically increased along with time, and belong to typical small spatiotemporal data files.
The current mainstream distributed file system focuses on large files in terms of implementation strategies such as metadata management, data layout, stripe design, cache management and the like. Currently, the distributed file systems which are relatively common are Google GFS, Hadoop HDFS, PVFS, Lustre and the like. These file systems all employ a centralized data management mechanism of a master-slave structure, which stores metadata (data describing data, such as namespaces, access control information, file locations, sizes, etc.) of files separately from data block files. The administrator in the system is an MDS (Metadata Server, MDS), which is responsible for maintaining information such as IP and state of the data storage node, in addition to storing Metadata of the file. The worker is a Data Storage Server (DSS). A typical distributed file system and access mechanism for a master-slave architecture is shown in fig. 1. As can be seen from the figure, each time a Client sends a file access request, the Client needs to communicate with the MDS first, acquire metadata information, and then establish a file transmission link with the DSS. Obviously, large-scale high-concurrency small file access requests enable Client-MDS to communicate frequently, and occupy limited bandwidth and computing resources of a system, so that MDS becomes a bottleneck of system performance, data access performance is seriously affected, and response time of file access is prolonged.
The access performance of the system is seriously influenced while the convenience is brought to the life of people by the massive space-time data small files and related applications. The main body is as follows: the memory occupancy rate is high: a large amount of metadata server memories are occupied by a large amount of small files, and the total number of files stored by the system is limited by the memory capacity; the metadata server has a large load: the file operation is carried out through a metadata server, and frequent interaction causes the metadata server to have overlarge load, which is easy to become a bottleneck of the access performance of the whole system; the file access efficiency is low: each storage and reading of the file is communicated with the metadata server, and most of the time is spent on system overhead relative to a small amount of data transmission time of the file.
Research has shown that (Wang F, Xin Q, Hong B, et al. File System Workload Analysis for Large Scale Scientific Computing Applications [ C ]. IEEE.2004:139-152.), in the small file based application service System, the number of requests for small files by the user exceeds 90% of all requests, while the amount of data accessed is less than 10% of all accessed data. The data access performance of the system is seriously influenced by the mass of small files. The small file combination is used as a microscopic data layout mechanism, and a plurality of different small files can be combined into a large file, so that the communication frequency between a Client and a metadata server is reduced, the MDS load can be relieved, and the access performance of the small files is improved. However, existing research on small file consolidation focuses on improving the storage system structure and analyzing the characteristics of the files themselves. Currently, research on doclet problems can be summarized into two categories:
(1) improved system architecture
Horse and the like (Ma, Meng, bear, Urjin, eosin cloud distributed file system: storage [ J ] of massive small files, small-sized computer system, 2012,7(33):1481-1482.) aiming at the problem of access delay of massive small files, organizes and manages metadata through improved distributed extensible hash, provides a file system HVFS based on distributed table storage, and realizes efficient access of small files. Zhang et al (Zhang Zhao, Zyulien, Li wenjuan, etc. Small File oriented cloud storage System [ J ] based on peer-to-peer network, Zhejiang university school newspaper: engineering edition, 2013(1):214 and 215.) are used for storing the route and state information of all nodes in the system by introducing a central routing node, and combining a route information prefetching mechanism of a client, the query time of resources is reduced, and the problem of small file access efficiency based on a peer-to-peer network (P2P) distributed cloud storage system is solved, but the access performance of the method is limited by the central node and the cost is higher. Pair et al (Pair pine age, Liaoxiang, Huangchenlin, etc.) FlatLFS, a lightweight file system J optimized for processing of massive small files, school newspaper of national defense science and technology university, 2013,35(2), 120 and 126) abandons the hierarchical file management mode of the traditional file system, designs a flat data storage lightweight file system FlatLFS, and changes the high efficiency of small file access at the cost of sacrificing flexibility. Zhao et al (Zhao Yuan Ling, Xia Ling, Chua Yongji, etc.. A study [ J ] of small file storage access strategy with optimized performance computer research and development, 2012,49(7): 1579-. Zhang et al (Zhang Z H, Ghome K. hFS: A Hybrid File System protocol for Improving Small files and Metadata Performance [ C ]. ACM Procedents of the 2007EuroSys Conference on Operating Systems Review,2007, 175-.
(2) Merging using file self-properties
The small file merging technology can merge a plurality of small files into a large file and store the large file in the DSS. On one hand, the Client-MDS can obtain the metadata information of a plurality of small files through one-time communication, so that the phenomenon that only a small data volume is transmitted in each interaction is avoided, and the bandwidth utilization rate of a system is improved; on the other hand, as a management center of the distributed file system, the system performance is reduced due to the fact that MDS loads are too heavy, and the file metadata information amount stored in the MDS can be reduced through small file combination, so that the MDS storage load is reduced.
The original small file merging methods include Hadoop Archive (HAR) Archive file technology, sequence File sequence file technology, and MapFile. Subsequently, 2001 et al (Yusi, Gui Xiaolin, Huang Ru Wei, etc..) a scheme [ J ] for improving the storage efficiency of small files in cloud storage, university of Sian traffic, 2011,45(6):59-60.) comprehensively considers the reading time, the merging time and the memory occupancy rate of the small files, applies a multidimensional attribute decision theory, and merges the small files into a large file by adopting a sequence file technology. The method well reduces the memory consumption and improves the storage efficiency of the small files, but no related method is provided for improving the reading efficiency of the files. Jiang et al (Jiang L, Li B, Song M L. the Optimization of HDFS Based on Small Files [ A ]. Broadband Network and Multimedia Technology (IC-BNMT),20103rd IEEE International Conference on Date of Conference [ C ]. 26-28Oct.2010.912-915.) merge the Small Files into a large file on one hand, and store the metadata information of part of the Small Files in the DataNode memory on the other hand, thereby further reducing the memory consumption of the NameNode and improving the reading speed of the Small Files. Dong et al (Dong B, Zheng Q H, Tian F. optimized application for storing and accessing small files on closed storage [ J ]. Journal of Network and Computer Applications Volume 35, Issue 6, November 2012,1847 and 1862.) divide the small files into three types of structure-related, logic-related and independent files according to the actual application characteristics and the characteristics of the files themselves, and thereby make a small file merging strategy. Liu et al (Liu X H, Han J Z, Zhong Y Q. implementation WebGIS on Hadoop: A Case Study of implementation Small File I/O Performance on HDFS [ C ]. Cluster Computing and works phones, 2009.CLUSTER'09.IEEE International Conference on Date of Conference,2009:1-8.) use the application characteristics and user access characteristics in WebGIS to merge Small files of adjacent geographic location information and establish global index for them, effectively Improving the storage efficiency of Small files. Dong et al (Dong B, Qiu J, Zheng Q, et al. A Novel Approach to Improving the Efficiency of the Storing and Accessing of Small Files on Hadoop: A Case Study by PowerPoint Files [ M ] IEEE,2010.) propose a Small file merging mechanism for merging Files belonging to the same courseware into a large file and reading by using a two-stage prefetching mechanism, considering the correlation between courseware and the locality of access. Zhang et al (Zhang Chung, Rui Jianwu, Wuting. a method [ J ] for storing and reading Hadoop small files computer applications and software, 2012(11):95-100.) from the correlation and directory structure between small files, through establishing a hierarchical index for the merged large file, and realizing the preloading of the index file when the user accesses, the reading efficiency of the small files is improved.
It can be seen from the above research work that the access performance of small files can be improved by improving the system structure or combining the characteristics of the files. Moreover, on the premise of not changing the system architecture or improving the hardware performance, the small file combination is used as a microscopic data layout mechanism, so that the burden and communication overhead of the management node in the storage system can be effectively reduced, and the storage and access performance of the small file is improved.
However, the existing small file merging strategy utilizes the characteristics of the file itself (in the above, the reading time of the small file, the user access characteristics (mainly referring to the user access times, the retention time, and the like) which are directly related to the specific file and belong to the characteristics of the file itself), and can only merge files that have been accessed historically, or simply merge space-time adjacent files, without deeply utilizing the user access information. And the purpose of combining the small files of the mass space-time data in the smart city is to reduce the access delay of the user and provide more convenient and faster space-time data service. Therefore, in addition to the characteristics of the file itself, other characteristics of the user access behavior also have a significant influence on the merged access performance, which is lacking in the current research.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method for guiding small file merging based on the time-space characteristics that user access behaviors are gathered in a certain time-space range by deeply utilizing user access information, wherein the method focuses on where, when and what types of data have greater interest and access heat of users, and merges the data, so that the data access efficiency can be obviously improved, and data or new data which are not accessed by the users historically can be merged.
Based on the discovery that user access behaviors (access behaviors of users to small files) are relatively aggregated in a certain space-time range, namely the small files in the space-time range show access correlation, the space-time data small file merging method in the smart city is realized by the following technical scheme: and mining a space-time range with access correlation from historical access information of the small files by using a data mining algorithm, and then combining the small files in the space-time range.
The recommended data mining algorithm comprises Apriori, FP-Growth, Can-Tree algorithms and the like in association rules, hierarchical clustering algorithm AGNES (adaptive clustering NESTING), density clustering algorithm DBSCAN and the like in clustering, and aims to mine the time-space range with access correlation by mining historical user access information.
The algorithms are different from the occupied storage space in terms of computational complexity, and more importantly, if the association rule algorithm is used, the position information of the mined associated files needs to be considered, and then the associated files in adjacent geographic areas are taken to calculate the space range; if a density clustering algorithm is used, the clustered cluster shape needs to be considered, and files contained in the cluster are selected and the spatial range is calculated by combining the actual application scene. The present invention recommends the use of the AGNES clustering algorithm because it is simple and effective and does not need to consider the shape of the clustered clusters.
The process of mining a space-time range with access correlation from historical access information of the small files by using a hierarchical clustering algorithm AGNES, and then merging the small files in the space-time range comprises the following steps:
1) carrying out parameterization representation and space-time attribute extraction on historical user access information;
according to the definition of the space-time data small files, each file comprises an inherent position attribute l type attribute s and a time attribute t, so that any small file can be represented by three space-time elements (l, s and t);
suppose a small set of spatio-temporal data files generated in a smart city is F ═ F1,f2,…,fmAn included set of location attributes may be denoted as L ═ L1,l2,…,lmT, T is a set of time attributes1,t2,…,tmThe type attribute set is S ═ S1,s2,…,smAnd the user accesses the application service in the smart city to generate a small file access request sequence of A ═ a1,a2,…an) Wherein each request item aiWhere 1. ltoreq. i. ltoreq.n each corresponds to a small file f of spatio-temporal dataiI is more than or equal to 1 and less than or equal to n; after the request sequence is parameterized and extracted, a space-time attribute sequence is formed:
A=(a1,a2,…an)=((l1,s1,t1),(l2,s2,t2),…,(ln,sn,tn)) (1);
2) file merging
2.1) type attribute classification: sequence a of small file access requests from history (a ═ a)1,a2,…,an) The attribute of the type to be contained is si,siAccess request sequence of e.g. SSeparating out;
2.2) space-time clustering: access request sequence using hierarchical clustering algorithm AGNESClustering the position attributes and clustering the time attributes respectively to obtain a merging range of the position attributes and a merging range of the time attributes;
2.3) small file merging: the type attribute is s according to the merging range of the position attribute and the time attributeiMerging the small files;
2.4) circulating the steps 2.1) -2.3), calculating space-time combination ranges (namely combination ranges of position attributes and time attributes) of the small files with different attributes, respectively combining the space-time combination ranges, and establishing indexes.
The invention classifies the type attributes before the spatio-temporal clustering, because generally speaking, users have different access characteristics to small files with different types of attributes, and the corresponding spatio-temporal ranges with access correlation are generally different. And the type attribute classification is carried out before clustering, so that a more accurate space-time combination range can be obtained.
As a recommended scheme: step 2.2), access request sequence is subjected to hierarchical clustering algorithm AGNESAnd respectively clustering the position attributes and the time attributes, carrying out weighted calculation based on the access density on the clustering result, and then obtaining the merging range of the position attributes and the merging range of the time attributes by using the weighted result. The clustering result is weighted and calculated based on the access density, so that the influence of noise points on the calculation result can be reduced, and the calculated space-time range represents the space-time range corresponding to the access related file to the maximum extent.
In the step 2.2), the recommended merging range of the position attribute is obtained as follows:
(1a) request forThe set of location attributes contained therein is represented asAggregating location attributesEach coordinate in the graph is used as a cluster;
(2a) calculating the group average distance between each cluster, and finding out two clusters with the shortest distance for merging;
(3a) repeating step (2a) until the group mean distance between any two clusters is greater than a predefined distance thresholdFinishing the clustering algorithm; the predefined distance thresholdAs a set of location attributesAverage value of distances between all coordinate points;
(4a) assuming that the cluster set generated after the clustering process in the step (3a) is finished isCalculating the average spatial range of the cluster sets by using the cluster sets, and weighting the spatial range radius of the cluster according to the access heat of a user, namely the density (number) of coordinate points in each cluster, wherein the larger the density is, the larger the weight is;
(5a) finally, the clusters are clusteredAveraging the weighted radii of all the clusters in the space range to calculate the type attribute si,siAnd E, merging the range of the position attribute corresponding to the space-time data doclet of the S.
The invention defines the similarity between two clusters by calculating the group average distance by using an average connection algorithm, and the cluster similarity is higher when the group average distance is shorter, thereby being beneficial to ensuring that the clustering process is not excessively sensitive to outliers or noise points.
In the step 2.2), the recommended merging range of the time attribute is obtained as follows:
(1b) request forThe time attribute set contained in it is expressed asAggregating temporal attributesEach coordinate in the graph is used as a cluster;
(2b) calculating group average time difference between each cluster, and finding out two clusters with the minimum time difference for combination;
(3b) repeating step (2b) until the group mean time difference between any two clusters is greater than a predefined time difference thresholdFinishing the clustering algorithm; a predefined threshold of said time differenceAs a collection of time attributesAverage value of the difference between all time points;
(4b) assuming that the cluster set generated after the clustering process in the step (3b) is finished isCalculating the average time span range of the cluster set by using the cluster set, and weighting the time span radius of the cluster according to the access heat of a user, namely the density (number) of the time attribute points in each cluster, wherein the larger the density is, the larger the weight is;
(5b) finally, the clusters are clusteredAveraging the weighted time span radii of all the clusters to calculate the type attribute si,siAnd e, merging the range of the time attribute corresponding to the space-time data doclet of the S.
The similarity between two clusters is defined by calculating the group average time difference by using an average connection algorithm, and the cluster similarity is higher when the group average time difference is smaller, so that the clustering method is favorable for ensuring that outliers or noise points are not excessively sensitive in the clustering process.
The process implemented in the step 2.3) is recommended as follows:
let F be { F ═ F for doclet dataset1,f2,…,fMThe type attribute in (f) is si,siThe attribute set of the space-time data small file position belonging to the S isSet of time attributes asMerging the range according to the position attribute mined in the step 2.2)And time attribute merge scopesFor type attribute of siThe small file merging steps are as follows:
(1c) creating a file;
(2c) by time attribute aggregationMost recent (earliest) time attribute tvFor reference point, find out the time span less than or equal toTime attribute of (2) constitute a set Range _ tv;
(3c) Gathering by location attributesAny one of the location attributes luAs a reference point, finding a spatial distance less than or equal toThe location attributes of (1) constitute a set Range _ lu;
(4c) Attributing a location attribute to Range _ luThe time attribute belongs to Range _ tvMerging the created files in the step (1c) into the small space-time files;
(5c) if the total size of the merged file is larger than the predefined large file storage capacity, jumping to the step (6 c); if the total size of the merged file is smaller than the predefined large file storage capacity, jumping to the step (7 c);
(6c) will set Range _ tvTime attributes already participating in the merge fromDeletion inMaintaining a current location attribute reference point luStep (1c-5c) is executed in a circulating mode without changing;
(7c) will set Range _ tvTime attribute of fromDeletion inAnd (2c-5c) executing steps in a circulating way until the position attribute set Range _ l is collecteduTime attribute setAll the small files in the file are merged;
(8c) if the total size of the files merged at this time is not enough to pre-define the storage capacity of the large files, the set Range _ l is setuLocation attribute of fromDeletion inResetting a set of time attributesPerforming step (2c-5 c);
(9c) circularly executing the steps (1c-8c) until the position attributes are collected intoSet of time attributes asType attribute is siAll of the small files are merged.
The bottom storage of the currently common distributed storage system, such as HDFS, is in data blocks (datablocks), and the system usually defaults to 64 MB. When a file is stored, it must reach the size of a data block before the system can place it in the underlying disk. Therefore, merging is necessary when the distributed storage system stores small files. And, whether the whole data block is filled or not needs to be judged during merging, and the data block can be placed into the bottom layer disk only if the whole data block is filled. Otherwise, the uploaded small file is always placed in the cache queue of the client, and the system does not execute the write operation of the file until 64MB is reached, and the file is stored in the disk at the bottom layer in the form of data blocks. Of course, the size of the data block can be customized, for example, 128MB, at this time, the total size of the small files that we merge must reach 128MB before storing in the underlying disk. Small files generated in a smart city belong to typical stream data, and new data can be continuously generated in the same space range along with time. Therefore, the small file data generated in the smart city has a limited spatial attribute range (e.g., a spatial region of a city), but has an unlimited temporal attribute range. Therefore, in order to effectively merge the existing and the subsequent newly added data, when the small file is merged, according to the time attribute of the file, the earliest generated data in the same position attribute range is merged first, and then the data generated in the next time period in the current position attribute range is merged continuously until all the small file data in the position attribute range are merged. And then continuously finding the next position attribute range, and continuously merging the small files in the range according to the sequence of the time attributes so as to fulfill the aims of traversing all the small file data of the whole geographic area and merging the small file data.
In the step 2.4), after the files are merged, a local index is established in the generated large file for storing the length of the small file and the offset position in the large file, so that the application server can directly position through the internal index of the file when processing the data request, and the required small file data can be conveniently and quickly obtained. The combined large file is placed in the continuous storage interval of the same node in the bottom data storage server as much as possible, because some originally continuous small files are likely to be placed at the edge position of the large file, when a user accesses the large file, the client needs to read across the large file and even across the data storage server, and the access performance of the small file is influenced.
Compared with the prior art, the invention has the following beneficial effects:
the invention utilizes the time-space correlation of the user access files, excavates the hidden time-space range from the historical access information, and utilizes the time-space range to guide the combination of the small files; in addition, the invention merges files according to the space-time range of the access related files, and has no relation with the self characteristics of the files such as the access times of users, the stay time and the like, so all the files in the system can be merged, and the new data can be merged whether the users access the files or not.
Drawings
FIG. 1 illustrates a distributed file system and access mechanism in a master-slave architecture;
FIG. 2 is a diagram illustrating the AGNES clustering process for data objects { a, b, c, d, e };
FIG. 3 is a diagram illustrating the structure of a large file generated after merging;
FIG. 4 is a diagram illustrating a size distribution of files stored in the system;
FIG. 5 is used to illustrate an MDS storage load condition;
FIG. 6 is used to show the storage speed of small files;
fig. 7 is used to show the total average response time for small file accesses.
Detailed Description
The smart city provides convenient and fast space-time data service for users, and often provides space-time data application service for the public in a predefined application mode through a network platform. After receiving an application access request of a user, the system accesses data in a certain space-time range according to predefined application and space-time parameters selected by the user, analyzes and processes the accessed data by using a function in a background, and finally returns a result to the user.
Obviously, if small files in a certain space range and time span are frequently accessed by users, the small file access correlation in the space-time range is indicated. Based on the above reasoning, the embodiment considers that a hierarchical clustering algorithm is used to dig the space-time range with access correlation from historical access information of the small files, and the range is used to guide the combination of the small files in the system, thereby reducing the memory consumption of the MDS and the number of times of communication between the Client and the MDS, and reducing the access delay of the user.
The method for merging the space-time data small files in the smart city specifically comprises the following steps:
step 1) carrying out parametric representation and space-time attribute extraction on historical user access information
Specifically, the method comprises the following steps: according to the definition of the space-time data small files, each file comprises an inherent position attribute l, a type attribute s and a time attribute t. Therefore, any small document can be represented by its space-time three elements (l, s, t).
Suppose a small set of spatio-temporal data files generated in a smart city is F ═ F1,f2,…,fmAn included set of location attributes may be denoted as L ═ L1,l2,…,lmT, T is a set of time attributes1,t2,…,tmThe type attribute set is S ═ S1,s2,…,sm}. The user accesses the application service in the smart city to generate a small file access request sequence of A ═ a1,a2,…an) Wherein each request item aiWhere 1. ltoreq. i. ltoreq.n each corresponds to a small file f of spatio-temporal dataiI is more than or equal to 1 and less than or equal to n. The request sequence is parameterized and extracted to form space-time attributeThe sequence is as follows:
A=(a1,a2,…an)=((l1,s1,t1),(l2,s2,t2),…,(ln,sn,tn)) (1)。
step 2) document merging
2.1) type attribute classification from the historical doclet access request sequence a ═ (a)1,a2,…,an) The attribute of the type to be contained is si,siAccess request sequence of e.g. SSeparating out;
2.2) spatio-temporal clustering by using hierarchical clustering algorithm AGNES to access request sequencesRespectively clustering the position attributes and the time attributes, carrying out weighted calculation based on access density on clustering results, and then obtaining a merging range of the position attributes and a merging range of the time attributes by using weighted results;
2.3) merging small files, namely, merging the type attribute s according to the merging range of the position attribute and the time attributeiMerging the small files;
2.4) circulating the steps 2.1) -2.3), calculating space-time combination ranges (namely combination ranges of position attributes and time attributes) of the small files with different attributes, respectively combining the space-time combination ranges, and establishing indexes.
In step 2.2), the merging range of the position attribute is obtained as follows:
the merge range of location attributes is actually how many spatial extents of the doclets should be merged together. Therefore, different clusters are formed by clustering the position attributes of the historical small file access request sequences, and then the clusters are analyzed and calculated to obtain a weighted average cluster radius, wherein the radius is the spatial range of combination, namely the combination rule about the position attributes.
A. Position attribute clustering based on AGNES algorithm
The AGNES algorithm is a classical hierarchical clustering method based on agglomeration. Initially, AGNES treats each object as a cluster, and then the clusters are merged step-by-step according to some criteria. For example, cluster C1An object and cluster C in (2)2Is the smallest of the euclidean distances between all objects belonging to different clusters, they are considered similar, C1And C2May be merged. In the clustering process, each cluster is represented by all objects in the cluster, the similarity between the two clusters is determined by the similarity of data points closest to the two clusters, and the cluster merging process is repeated until all the objects are finally merged into one cluster. The clustering process based on the agglomerative AGNES algorithm is shown in fig. 2.
Assuming that the location attributes of the small files of spatiotemporal data in the smart city considered herein are all two-dimensional, i.e., the longitude and latitude coordinates can be expressed as:
L={l1,l2,l3,…}={(x1,y1),(x2,y2),…,(xM,yM)} (2)
then the user access request sequence a contains a type attribute s according to equation (1)iAccess request ofCan be expressed as:
the goal of clustering is to make the similarity of objects in the same cluster as large as possible and the similarity between objects in different clusters as small as possible. Therefore, a core problem of the clustering method is how to measure the similarity between two clusters. Similarity is defined for clustering of spatio-temporal data doclet location attributes, where each cluster is a set of location attributes. To ensure that outliers or noise points are not overly sensitive during clustering, the similarity between two clusters is defined by calculating the group mean distance using an average connection algorithm, with cluster similarities being higher for closer group mean distances.
Firstly, defining any two position attribute coordinate pointsDistance between l ═ x, y and l ═ x ', y':
suppose cluster CmContaining a set of coordinate points Cm=(l1,l2,l3…), cluster CnContaining coordinate points in the set Cn=(l′1,l′2,l′3…), there is no intersection of the elements between the two clusters. Then, cluster CmAnd CnGroup average distance between:
wherein N ism=card(Cm) Represents a cluster CmNumber of middle coordinate points, Nn=card(Cn) Represents a cluster CnThe number of middle coordinate points. At this time, cluster Cm=(l1,l2,l3…) has a spatial range radius of:
B. generation of a merge Range
According to the principle of the AGNES algorithm, if the number of clusters to be generated by the clustering algorithm is not pre-specified, the algorithm will merge the generated clusters until all objects are merged into one cluster. Obviously, clustering into a cluster is equivalent to no clustering, so that a merging rule of the position attribute cannot be found from the original position attribute coordinate points. However, since the small file of the spatio-temporal data of each data type involves a plurality of application services, and the access behavior preferences of different users are different, the small file data in different spatio-temporal ranges can be accessed every time the application services are accessed. Therefore, we need to set a termination condition for the AGNES algorithm.
To this end, the type attribute is defined as si,siSpace distance threshold corresponding to space-time data small file belonging to SAs a set of location attributesAnd averaging the distances between all coordinate points, and using the average value as a termination condition of the clustering algorithm:
wherein the position attribute coordinate pointRepresenting a sequence of location attributesThe number of coordinate points included. Obviously, if clusterAnd clusterGroup of the twoMean distanceThey cannot be combined into one cluster. When all the clusters can not be subjected to aggregation combination, the clustering process of the AGES algorithm is ended.
Next, the AGNES algorithm is used to include the type attribute siAccess request ofPosition attribute sequence ofClustering is performed in order to distinguish the spatial range involved by each application service access of the user to the type of data through the result of clustering. The clustering process is summarized as follows:
(2) Calculating the group average distance between each cluster, finding the two clusters with the nearest distance to combine, and if the clusters areAndthe group average distance between them is the nearest, then they are combined to form a new cluster
(3) Repeating step (2) until the group mean distance between any two clusters is greater than a predefined distance thresholdThe clustering algorithm ends.
Assuming that the cluster set generated after the clustering process is finished isAll that is needed is to use this cluster set to calculate their average spatial extent. Obviously, if the location attribute points contained in a certain cluster are dense, this indicates that the corresponding spatial range is a hot spot area where the user accesses the type of attribute doclet.
Therefore, in order to reduce the influence of the noise point on the calculation result, the calculated spatial range represents the space-time attribute range corresponding to the access related file to the maximum extent. The spatial range radius of the cluster can be weighted according to the access heat of the user, namely the density (number) of coordinate points in each cluster, and the larger the density is, the larger the weight is. Then, for the cluster1≤k≤KlThe radius of the spatial range weighted by the access density is:
wherein,representing a clusterThe number of the contained coordinate points is,representing a sequence of location attributesThe number of the contained coordinate points is,into a clusterSpatial extent ofA radius. Finally, the sets are aligned againAveraging the radiuses of the space ranges weighted by all the clusters in the space range, and calculating a position attribute merging range corresponding to the small spatio-temporal data file with the type attribute si belonging to S:
in step 2.2), the process of obtaining the merging range of the time attribute is as follows:
the scope rule of the temporal attributes is in fact how many small files within a time span should be merged together. As with the previous location attributes, by time attribute clustering of historical doclet access information. Except that the clustered objects are changed from two-dimensional latitude and longitude coordinates to one-dimensional time attributes.
A. Time attribute clustering based on AGNES algorithm
Request as with the previous location attribute clusteringThe time attribute set contained in it can be expressed as:
first, the similarity between clusters is defined, where each cluster is a set of temporal attributes. The similarity between two clusters is defined by calculating a group mean time difference, the closer the group mean time difference is, the higher the cluster similarity is. Defining any two time attribute pointsTime difference therebetween:
d(t,t′)=|t-t′| (12)
suppose cluster CmThe time attribute point set contained is Cm=(t1,t2,t3…), cluster CnThe time attribute point set contained is Cn=(t′1,t′2,t′3…), there is no intersection of the elements between the two clusters. Then, cluster CmAnd CnThe group mean time difference between is:
wherein N ism=card(Cm) Represents a cluster CmNumber of middle time attribute points, Nn=card(Cn) Represents a cluster CnThe number of medium time attribute points. At this time, cluster Cm=(t1,t2,t3…) is:
B. generation of a merge Range
And calculating location attributes in the foregoing defining a type attribute as s according to the principles of the AGNES algorithmi,siTime difference threshold corresponding to space-time data small file belonging to SAs a collection of time attributesAnd averaging the difference values between all the time points, and using the average value as a termination condition of the clustering algorithm:
wherein the time attribute pointRepresenting a sequence of time attributesThe number of coordinate points included. Obviously, if clusterAnd clusterGroup mean time difference betweenThey cannot be combined into one cluster. Clustering ends when all clusters cannot be coalesced.
Next, the AGNES algorithm is used to include the type attribute siAccess request ofTime attribute sequence ofClustering is performed in order to distinguish the time span range involved by each application service access of the user to the type of data through the result of clustering. The process of clustering is the same as the previous location attribute.
Assuming that the cluster set generated after the clustering process is finished isAnd weighting the time span of the cluster by using the density of the time attribute points in the cluster, wherein the greater the density, the greater the weight. Then, for the clusterThe radius of the time range weighted by the access density is:
wherein,representing a clusterThe number of time attributes to be included,representing a sequence of time attributesThe number of coordinate points included in the image,into a clusterRadius of the time range. Finally, the sets are aligned againAveraging the weighted time range radii of all clusters in the cluster, and calculating the type attribute si,siThe time attribute combination range corresponding to the space-time data small file belonging to the S is as follows:
step 2.3) according to the combination range of the position attribute and the time attribute, the type attribute is siThe specific process of merging the small files is as follows:
let F be { F ═ F for doclet dataset1,f2,…,fMThe type attribute in (f) is si,siThe attribute set of the space-time data small file position belonging to the S isSet of time attributes asDigging out according to step 2.2)Location attribute merging range fromAnd time attribute merge scopesFor type attribute of siThe small file merging steps are as follows:
(1c) creating a file;
(2c) by time attribute aggregationMost recent (earliest) time attribute tvFor reference point, find out the time span less than or equal toTime attribute of (2) constitute a set Range _ tv;
(3c) Gathering by location attributesAny one of the location attributes luAs a reference point, finding a spatial distance less than or equal toThe location attributes of (1) constitute a set Range _ lu;
(4c) Attributing a location attribute to Range _ luThe time attribute belongs to Range _ tvMerging the created files in the step (1c) into the small space-time files;
(5c) if the total size of the merged file is larger than the predefined large file storage capacity, jumping to the step (6 c); if the total size of the merged file is smaller than the predefined large file storage capacity, jumping to the step (7 c);
(6c) will set Range _ tvTime attributes already participating in the merge fromDeletion inMaintaining a current location attribute reference point luStep (1c-5c) is executed in a circulating mode without changing;
(7c) will set Range _ tvTime attribute of fromDeletion inAnd (2c-5c) executing steps in a circulating way until the position attribute set Range _ l is collecteduTime attribute setAll the small files in the file are merged;
(8c) if the total size of the files merged at this time is not enough to pre-define the storage capacity of the large files, the set Range _ l is setuLocation attribute of fromDeletion inResetting a set of time attributesPerforming step (2c-5 c);
(9c) circularly executing the steps (1c-8c) until the position attributes are collected intoSet of time attributes asType attribute is siAll of the small files are merged.
According to the generation method of the merging rule, S ═ S can be continuously calculated1,s2,s3… } in whichThe space-time combination range corresponding to the small files with the type attributes is combined according to the combination steps. Thus, the whole small file data set F is set as F1,f2,…,fMThe files in the file are all merged into a large file set F' ═ F1,F2,…,FN}。
The merged small file needs to establish a local Index (Internal Index) inside the large file for the small file to store the length of the small file and the offset position inside the large file, so that the application server can directly position through the Internal Index of the file when processing a data request, and the required small file data can be conveniently and quickly obtained. The combined large file should be placed in the same node continuous storage interval in the bottom layer data storage server as much as possible, because some originally continuous small files are likely to be placed at the edge position of the large file, when a user accesses the large file, the client needs to read across the large file and even across the data storage server, and the access performance of the small file is influenced. The structure of the merged large file is shown in fig. 3.
Experiment of
The experimental data was derived from the martian city intelligent city network application demonstration platform, which includes 14 sensors in different regions, generates sensor data from 1 month and 1 day 2010, and provides 20 predefined application programs to the public. These data all have obvious spatio-temporal properties, most of them belong to typical spatio-temporal data small files (generally not more than 4MB), and are numerous and occupy little storage space.
Taking meteorological monitoring sensing data as an example, 5 acquisition points are arranged in different areas of Wuhan city, each acquisition point comprises seven types of sensors, and the acquisition density is 5 minutes. The data collected by one sensor in one day generates a file with the size of 3.2KB-5.8KB and the average size of 4.25KB, and the data of 10000 collection points are simulated by the subsequent time-space kriging interpolation. The total data files collected by all collection points from 1/2017 to 1/6/2017 are about 12,740,058 with a total size of about 53.68 GB. Fig. 4 is a distribution diagram of file sizes stored in the system.
After processing, we obtain 4,076,328 small file access information of the user in the period from 3/1/2017 to 6/1/2017.
The experimental test for realizing the results comprises three parts: the MDS stores load, the storage speed of the file and the average response time of the file access request, the latter two tests comprise two conditions of single-user concurrence and multi-user concurrence, wherein the multi-user concurrence is simulated through a single-client-side multi-process, each experiment is repeated for 10 times to take average, and the experiment result is compared with the original HDFS and the HAR archive file technology mentioned in the literature (Hadoop architecture. Hadoop archives guide [ EB/OL ], http:// ha doop. apache. org/common/docs/current/ha doop _ archives. html. 2011.).
MDS storage load
The small file combination can effectively reduce the amount of metadata in the MDS, and therefore, the memory consumption of the MDS is reflected by the consumption of memory in the MDS, and the memory consumption of the MDS when the system is distributed and stored for 5,000, 10,000, 15,000, 20,000, 25,000 and 30,000 small files is tested. The results of the experiment are shown in FIG. 5.
As can be seen from the figure, in the case of not storing any file, the memory consumption of the system itself is about 4.2MB, and the memory occupied increases linearly with the number of stored files. In the traditional HDFS, each small file stored in the memory occupies the space of one object, so that the memory is very large. The HAR archiving technique and the algorithm proposed herein both involve the merging of small files, and the large file objects that are stored in the memory are merged, reducing the number of objects in the MDS memory. Similarly, the merged file is also merged, and the merged large file includes as many small files, and the algorithm proposed herein is different from the HAR in that a local index is established for the merged small file (directory information of the small file is placed in the merged large file), so that memory overhead is further reduced compared to the HAR.
Storage speed of file
The storage test of the file is carried out by a Client of the storage system of the application demonstration platform (not a user, the user can only access the predefined application service through the network platform), the method is that 100,000 small file data with the total size of 0.396GB are written into the storage system at one time, and the average storage speed of the file is calculated. The results of the experiment are shown in FIG. 6.
As can be seen from the figure, the storage system has the maximum write rate when 5 users access concurrently, regardless of merging of files, and the total transmission efficiency tends to be stable as the number of concurrent users increases. Secondly, the HAR with the merging strategy and the storage speed of the text algorithm are all superior to the traditional HDFS due to the merging of small files. Meanwhile, it can be seen that, since the HAR merge mechanism is to directly pack a plurality of small files into one file and write the file into the HDFS file, the speed is high, but the HAR merge mechanism has the disadvantage that the merged file needs to be re-created if modified, and is not suitable for subsequent small file reading. The algorithm proposed herein, considering the spatio-temporal properties of the files and then combining them, affects the storage speed reduction of the files to some extent, but has little impact on the overall writing speed of the files compared with a large data set, and still has very high storage efficiency.
Average response time
The purpose of merging is to reduce the average response time of user access, for which the total average response time of the system when a single user accesses 500-.
As can be seen from the figure, the total average response time is linear with the number of small files, and the merging algorithm proposed herein has the smallest total response time when accessing small files, followed by HAR and finally the original HDFS. This is because the metadata retrieval in the MDS is complicated due to the excessive small files, the system communicates frequently inside, most of the time is spent on the overhead of the system, and the reading time is increased. The HAR archiving technology has high storage efficiency, but in the reading of files, although one large file can be obtained in each access, because the correlation between the files is not considered, the small files included in the large file are not a plurality of small files required by the corresponding application of the service, the reading hit rate of the large file is reduced, the number of times of communication between Client-MDS is increased, and the reading speed of data is influenced. The merging algorithm mentioned in the text is based on analyzing the spatio-temporal correlation of user access and the attributes of the files, and the associated files are merged together as much as possible. The large file obtained by each application service access request comprises the small file required by the service subsequently, so that frequent communication between the Client and the MDS is avoided, the response time of the application service is shortened, and the small file access performance of the system is effectively improved.
Conclusion
The method combines the space-time attributes of the files and the space-time characteristics accessed by the user, on one hand, the space-time granularity problem of combining different types of files is solved, and on the other hand, all the files in the system can be combined (no matter whether the user accesses the files or not).
Generally, the small files are combined according to the space-time attributes of the small files, the storage efficiency of direct combination is not high, but the reading efficiency can be greatly improved, the purpose of combining the small files of the space-time data of the smart city is to reduce the access delay of users, and obviously, the algorithm is more suitable for the application scene.
In addition, besides mining the space-time range related to access from historical user access information by using a clustering algorithm (hierarchical clustering AGES), the invention Can also use other data mining algorithms, such as Apriori, FP-Growth, Can-Tree and the like in association rules, and other clustering algorithms in the cluster, such as density clustering DBSCAN and the like. The applicant believes that it is possible to implement the method by mining access to relevant documents using these general data mining algorithms, calculating spatio-temporal ranges contained in the documents, and finally guiding the merging of small documents using the spatio-temporal ranges under the guidance of the inventive concept, and thus the description thereof is omitted.
Claims (4)
1. A little file merging method of time-space data in a smart city is characterized in that a data mining algorithm is utilized to mine a time-space range with access correlation from historical access information of small files, and then the small files in the time-space range are merged;
the data mining algorithm adopts a hierarchical clustering algorithm AGNES in clustering;
the process of mining a space-time range with access correlation from historical access information of the small files by using a hierarchical clustering algorithm AGNES, and then merging the small files in the space-time range comprises the following steps:
1) carrying out parameterization representation and space-time attribute extraction on historical user access information;
according to the definition of the space-time data small files, each file comprises an inherent position attribute l, a type attribute s and a time attribute t, so that any small file can be represented by three space-time elements (l, s and t);
suppose a small set of spatio-temporal data files generated in a smart city is F ═ F1,f2,…,fmAn included set of location attributes may be denoted as L ═ L1,l2,…,lmT, T is a set of time attributes1,t2,…,tmThe type attribute set is S ═ S1,s2,…,smAnd the user accesses the application service in the smart city to generate a small file access request sequence of A ═ a1,a2,…an) Wherein each request item aiWhere 1. ltoreq. i. ltoreq.n each corresponds to a small file f of spatio-temporal dataiI is more than or equal to 1 and less than or equal to n; after the request sequence is parameterized and extracted, a space-time attribute sequence is formed:
A=(a1,a2,…an)=((l1,s1,t1),(l2,s2,t2),…,(ln,sn,tn)) (1);
2) file merging
2.1) type attribute classification: sequence a of small file access requests from history (a ═ a)1,a2,…,an) The attribute of the type to be contained is si,siAccess request sequence of e.g. SIs separated outTo the process;
2.2) space-time clustering: access request sequence using hierarchical clustering algorithm AGNESRespectively clustering the position attributes and the time attributes, carrying out weighted calculation based on access density on clustering results, and then obtaining a merging range of the position attributes and a merging range of the time attributes by using weighted results;
2.3) small file merging: the type attribute is s according to the merging range of the position attribute and the time attributeiMerging the small files;
2.4) circulating the steps 2.1) -2.3), calculating space-time combination ranges of small files with different attributes, respectively combining the space-time combination ranges, and establishing indexes;
in step 2.2), the merging range of the position attribute is obtained by the following method:
(1a) request forThe set of location attributes contained therein is represented asAggregating location attributesEach coordinate in the graph is used as a cluster;
(2a) calculating the group average distance between each cluster, and finding out two clusters with the shortest distance for merging;
(3a) repeating step (2a) until the group mean distance between any two clusters is greater than a predefined distance thresholdFinishing the clustering algorithm; the predefined distance thresholdAs a set of location attributesAverage value of distances between all coordinate points;
(4a) assuming that the cluster set generated after the clustering process in the step (3a) is finished isCalculating the average spatial range of the cluster sets by using the cluster sets, and weighting the spatial range radius of the cluster according to the access heat of a user, namely the density (number) of coordinate points in each cluster, wherein the larger the density is, the larger the weight is;
(5a) finally, the clusters are clusteredAveraging the weighted radii of all the clusters in the space range to calculate the type attribute si,siThe position attribute corresponding to the space-time data small file belonging to the S is merged;
in step 2.2), the merging range of the time attribute is obtained as follows:
(1b) request forThe time attribute set contained in it is expressed asAggregating temporal attributesEach coordinate in the graph is used as a cluster;
(2b) calculating group average time difference between each cluster, and finding out two clusters with the minimum time difference for combination;
(3b) repeating step (2b) until the group mean time difference between any two clusters is greater than a predefined time difference thresholdFinishing the clustering algorithm; a predefined threshold of said time differenceAs a collection of time attributesAverage value of the difference between all time points;
(4b) assuming that the cluster set generated after the clustering process in the step (3b) is finished isCalculating the average time span range of the cluster set by using the cluster set, and weighting the time span radius of the cluster according to the access heat of a user, namely the density of the time attribute point in each cluster, wherein the larger the density is, the larger the weight is;
2. Merging method according to claim 1, characterized in that step 2.3) is implemented as follows:
let F be { F ═ F for doclet dataset1,f2,…,fMThe type attribute in (f) is si,siThe attribute set of the space-time data small file position belonging to the S isSet of time attributes asAccording to the stepsStep 2.2) mined location attribute merging rangeAnd time attribute merge scopesFor type attribute of siThe small file merging steps are as follows:
(1c) creating a file;
(2c) by time attribute aggregationMost advanced time attribute of interior tvFor reference point, find out the time span less than or equal toTime attribute of (2) constitute a set Range _ tv;
(3c) Gathering by location attributesAny one of the location attributes luAs a reference point, finding a spatial distance less than or equal toThe location attributes of (1) constitute a set Range _ lu;
(4c) Attributing a location attribute to Range _ luThe time attribute belongs to Range _ tvMerging the created files in the step (1c) into the small space-time files;
(5c) if the total size of the merged file is larger than the predefined large file storage capacity, jumping to the step (6 c); if the total size of the merged file is smaller than the predefined large file storage capacity, jumping to the step (7 c);
(6c) will set Range _ tvTime attributes already participating in the merge fromDeletion inMaintaining a current location attribute reference point luStep (1c-5c) is executed in a circulating mode without changing;
(7c) will set Range _ tvTime attribute of fromDeletion inAnd (2c-5c) executing steps in a circulating way until the position attribute set Range _ l is collecteduTime attribute setAll the small files in the file are merged;
(8c) if the total size of the files merged at this time is not enough to pre-define the storage capacity of the large files, the set Range _ l is setuLocation attribute of fromDeletion inResetting a set of time attributesPerforming step (2c-5 c);
3. The merging method according to claim 2, wherein in step 2.4), the index is built inside the large file after merging generation, and the index is used for storing the length of the small file and the offset position inside the large file.
4. The merging method according to claim 3, wherein the large files generated after merging are placed in a continuous storage interval of the same node in the underlying data storage server.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810057495X | 2018-01-22 | ||
CN201810057495 | 2018-01-22 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108460121A CN108460121A (en) | 2018-08-28 |
CN108460121B true CN108460121B (en) | 2022-02-08 |
Family
ID=63216584
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810154658.6A Active CN108460121B (en) | 2018-01-22 | 2018-02-23 | Little file merging method for space-time data in smart city |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108460121B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783810B (en) * | 2018-12-26 | 2022-11-11 | 北京明略软件系统有限公司 | Text processing method and device and computer readable storage medium |
CN110297810B (en) * | 2019-07-05 | 2022-01-18 | 联想(北京)有限公司 | Stream data processing method and device and electronic equipment |
CN110334133B (en) * | 2019-07-11 | 2020-11-20 | 北京京东智能城市大数据研究院 | Rule mining method and device, electronic equipment and computer-readable storage medium |
US11562094B2 (en) | 2019-12-31 | 2023-01-24 | International Business Machines Corporation | Geography aware file dissemination |
CN112017044A (en) * | 2020-08-12 | 2020-12-01 | 西华大学 | Block chain user participation degree evaluation method based on AGNES and DBSCAN algorithm |
CN113810488B (en) * | 2021-09-14 | 2023-10-24 | 东北电力大学 | Resource searching system based on interest cluster-hotchain and construction method thereof |
CN113778949A (en) * | 2021-09-27 | 2021-12-10 | 武汉英仕达信息技术有限公司 | Data middleware system for Internet of things |
CN116540260B (en) * | 2023-04-20 | 2024-08-06 | 中国人民解放军国防科技大学 | Three-dimensional imaging method, system and medium based on single-line laser radar |
CN116721001B (en) * | 2023-08-10 | 2023-11-17 | 江苏网进科技股份有限公司 | Smart city resource management method based on digital twinning |
CN118395232A (en) * | 2024-04-26 | 2024-07-26 | 中企知研(北京)科技有限公司 | Digital service resource optimized storage method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679895A (en) * | 2015-03-18 | 2015-06-03 | 成都影泰科技有限公司 | Medical image data storing method |
CN104765876A (en) * | 2015-04-24 | 2015-07-08 | 中国人民解放军信息工程大学 | Massive GNSS small file cloud storage method |
CN106021585A (en) * | 2016-06-02 | 2016-10-12 | 同济大学 | Traffic incident video access method and system based on time-space characteristics |
CN106528756A (en) * | 2016-11-07 | 2017-03-22 | 王昱淇 | Network map data organization method based on space-time relevance |
CN106933511A (en) * | 2017-02-27 | 2017-07-07 | 武汉大学 | Consider the GML data storage method for organizing and system of load balancing and disk efficiency |
-
2018
- 2018-02-23 CN CN201810154658.6A patent/CN108460121B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679895A (en) * | 2015-03-18 | 2015-06-03 | 成都影泰科技有限公司 | Medical image data storing method |
CN104765876A (en) * | 2015-04-24 | 2015-07-08 | 中国人民解放军信息工程大学 | Massive GNSS small file cloud storage method |
CN106021585A (en) * | 2016-06-02 | 2016-10-12 | 同济大学 | Traffic incident video access method and system based on time-space characteristics |
CN106528756A (en) * | 2016-11-07 | 2017-03-22 | 王昱淇 | Network map data organization method based on space-time relevance |
CN106933511A (en) * | 2017-02-27 | 2017-07-07 | 武汉大学 | Consider the GML data storage method for organizing and system of load balancing and disk efficiency |
Non-Patent Citations (3)
Title |
---|
云存储中面向访问任务的小文件合并与预取策略;王涛等;《武汉大学学报信息科学版》;20131231;第1504-1508页 * |
数字标准平台中海量时空小文件合并策略研究;顾鑫等;《计算机应用研究》;20141130;第3340-3343页 * |
顾鑫等.数字标准平台中海量时空小文件合并策略研究.《计算机应用研究》.2014, * |
Also Published As
Publication number | Publication date |
---|---|
CN108460121A (en) | 2018-08-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108460121B (en) | Little file merging method for space-time data in smart city | |
US11429630B2 (en) | Tiered storage for data processing | |
Li et al. | A prefetching model based on access popularity for geospatial data in a cluster-based caching system | |
CN104065568A (en) | Web server cluster routing method | |
JP5137339B2 (en) | Server, system and method for retrieving clustered vector data | |
Malensek et al. | Analytic queries over geospatial time-series data using distributed hash tables | |
Buddhika et al. | Synopsis: A distributed sketch over voluminous spatiotemporal observational streams | |
Malensek et al. | Expressive query support for multidimensional data in distributed hash tables | |
Havers et al. | DRIVEN: A framework for efficient Data Retrieval and clustering in Vehicular Networks | |
Sarwat | Interactive and scalable exploration of big spatial data--a data management perspective | |
Sabarish et al. | Clustering of trajectory data using hierarchical approaches | |
Azari et al. | A data replication algorithm for groups of files in data grids | |
Sun | Personalized music recommendation algorithm based on spark platform | |
CN109218366A (en) | Monitor video temperature cloud storage method based on k mean value | |
Xu et al. | Adaptive and scalable load balancing for metadata server cluster in cloud-scale file systems | |
Xiong et al. | A small file merging strategy for spatiotemporal data in smart health | |
He et al. | Dynamic multidimensional index for large-scale cloud data | |
Malensek et al. | Autonomously improving query evaluations over multidimensional data in distributed hash tables | |
Miao | Clustering of different dimensional variables based on distance correlation coefficient | |
Blank et al. | Using summaries to search and visualize distributed resources addressing spatial and multimedia features | |
CN102096723A (en) | Data query method based on copy replication algorithm | |
Li et al. | Mhb-tree: A distributed spatial index method for document based nosql database system | |
Cui et al. | Controllable Clustering Algorithm for Associated Real‐Time Streaming Big Data Based on Multi‐Source Data Fusion | |
Li et al. | An efficient scheme for probabilistic skyline queries over distributed uncertain data | |
Zhang | Large data oriented to image information fusion spark and improved fruit fly optimization based on the density clustering algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |