CN108460121B

CN108460121B - Little file merging method for space-time data in smart city

Info

Publication number: CN108460121B
Application number: CN201810154658.6A
Authority: CN
Inventors: 熊炼; 熊珊; 国代新
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-01-22
Filing date: 2018-02-23
Publication date: 2022-02-08
Anticipated expiration: 2038-02-23
Also published as: CN108460121A

Abstract

The invention discloses a method for merging small files of space-time data in a smart city, wherein the small file merging is used as a microscopic data layout mechanism, so that the I/O performance of a system can be effectively improved, and the access delay of a user is reduced. The method comprises the steps of carrying out parameterization representation and space-time attribute extraction on historical user access information by analyzing characteristics of space-time data, carrying out hierarchical clustering on the access information by utilizing an AGNES algorithm in a space-time attribute domain, carrying out access density-based weighted calculation on a clustering result, and finding out an access-related space-time range. And finally, guiding the combination of the small files by utilizing the space-time range. Experimental results show that the algorithm is simple and efficient, and the access efficiency of the small spatiotemporal data files in the system is greatly improved.

Description

Little file merging method for space-time data in smart city

Technical Field

The invention relates to the field of research on a small file merging strategy of space-time data in a smart city.

Background

In a smart city based on the internet of things and cloud computing, ubiquitous sensors generate sensing data with three inherent properties of time, space and type, and the sensing data are small in size (usually dozens to hundreds of KB), numerous in variety, huge in quantity, high in redundancy and dynamically increased along with time, and belong to typical small spatiotemporal data files.

The current mainstream distributed file system focuses on large files in terms of implementation strategies such as metadata management, data layout, stripe design, cache management and the like. Currently, the distributed file systems which are relatively common are Google GFS, Hadoop HDFS, PVFS, Lustre and the like. These file systems all employ a centralized data management mechanism of a master-slave structure, which stores metadata (data describing data, such as namespaces, access control information, file locations, sizes, etc.) of files separately from data block files. The administrator in the system is an MDS (Metadata Server, MDS), which is responsible for maintaining information such as IP and state of the data storage node, in addition to storing Metadata of the file. The worker is a Data Storage Server (DSS). A typical distributed file system and access mechanism for a master-slave architecture is shown in fig. 1. As can be seen from the figure, each time a Client sends a file access request, the Client needs to communicate with the MDS first, acquire metadata information, and then establish a file transmission link with the DSS. Obviously, large-scale high-concurrency small file access requests enable Client-MDS to communicate frequently, and occupy limited bandwidth and computing resources of a system, so that MDS becomes a bottleneck of system performance, data access performance is seriously affected, and response time of file access is prolonged.

The access performance of the system is seriously influenced while the convenience is brought to the life of people by the massive space-time data small files and related applications. The main body is as follows: the memory occupancy rate is high: a large amount of metadata server memories are occupied by a large amount of small files, and the total number of files stored by the system is limited by the memory capacity; the metadata server has a large load: the file operation is carried out through a metadata server, and frequent interaction causes the metadata server to have overlarge load, which is easy to become a bottleneck of the access performance of the whole system; the file access efficiency is low: each storage and reading of the file is communicated with the metadata server, and most of the time is spent on system overhead relative to a small amount of data transmission time of the file.

Research has shown that (Wang F, Xin Q, Hong B, et al. File System Workload Analysis for Large Scale Scientific Computing Applications [ C ]. IEEE.2004:139-152.), in the small file based application service System, the number of requests for small files by the user exceeds 90% of all requests, while the amount of data accessed is less than 10% of all accessed data. The data access performance of the system is seriously influenced by the mass of small files. The small file combination is used as a microscopic data layout mechanism, and a plurality of different small files can be combined into a large file, so that the communication frequency between a Client and a metadata server is reduced, the MDS load can be relieved, and the access performance of the small files is improved. However, existing research on small file consolidation focuses on improving the storage system structure and analyzing the characteristics of the files themselves. Currently, research on doclet problems can be summarized into two categories:

(1) improved system architecture

Horse and the like (Ma, Meng, bear, Urjin, eosin cloud distributed file system: storage [ J ] of massive small files, small-sized computer system, 2012,7(33):1481-1482.) aiming at the problem of access delay of massive small files, organizes and manages metadata through improved distributed extensible hash, provides a file system HVFS based on distributed table storage, and realizes efficient access of small files. Zhang et al (Zhang Zhao, Zyulien, Li wenjuan, etc. Small File oriented cloud storage System [ J ] based on peer-to-peer network, Zhejiang university school newspaper: engineering edition, 2013(1):214 and 215.) are used for storing the route and state information of all nodes in the system by introducing a central routing node, and combining a route information prefetching mechanism of a client, the query time of resources is reduced, and the problem of small file access efficiency based on a peer-to-peer network (P2P) distributed cloud storage system is solved, but the access performance of the method is limited by the central node and the cost is higher. Pair et al (Pair pine age, Liaoxiang, Huangchenlin, etc.) FlatLFS, a lightweight file system J optimized for processing of massive small files, school newspaper of national defense science and technology university, 2013,35(2), 120 and 126) abandons the hierarchical file management mode of the traditional file system, designs a flat data storage lightweight file system FlatLFS, and changes the high efficiency of small file access at the cost of sacrificing flexibility. Zhao et al (Zhao Yuan Ling, Xia Ling, Chua Yongji, etc.. A study [ J ] of small file storage access strategy with optimized performance computer research and development, 2012,49(7): 1579-. Zhang et al (Zhang Z H, Ghome K. hFS: A Hybrid File System protocol for Improving Small files and Metadata Performance [ C ]. ACM Procedents of the 2007EuroSys Conference on Operating Systems Review,2007, 175-.

(2) Merging using file self-properties

The small file merging technology can merge a plurality of small files into a large file and store the large file in the DSS. On one hand, the Client-MDS can obtain the metadata information of a plurality of small files through one-time communication, so that the phenomenon that only a small data volume is transmitted in each interaction is avoided, and the bandwidth utilization rate of a system is improved; on the other hand, as a management center of the distributed file system, the system performance is reduced due to the fact that MDS loads are too heavy, and the file metadata information amount stored in the MDS can be reduced through small file combination, so that the MDS storage load is reduced.

The original small file merging methods include Hadoop Archive (HAR) Archive file technology, sequence File sequence file technology, and MapFile. Subsequently, 2001 et al (Yusi, Gui Xiaolin, Huang Ru Wei, etc..) a scheme [ J ] for improving the storage efficiency of small files in cloud storage, university of Sian traffic, 2011,45(6):59-60.) comprehensively considers the reading time, the merging time and the memory occupancy rate of the small files, applies a multidimensional attribute decision theory, and merges the small files into a large file by adopting a sequence file technology. The method well reduces the memory consumption and improves the storage efficiency of the small files, but no related method is provided for improving the reading efficiency of the files. Jiang et al (Jiang L, Li B, Song M L. the Optimization of HDFS Based on Small Files [ A ]. Broadband Network and Multimedia Technology (IC-BNMT),20103rd IEEE International Conference on Date of Conference [ C ]. 26-28Oct.2010.912-915.) merge the Small Files into a large file on one hand, and store the metadata information of part of the Small Files in the DataNode memory on the other hand, thereby further reducing the memory consumption of the NameNode and improving the reading speed of the Small Files. Dong et al (Dong B, Zheng Q H, Tian F. optimized application for storing and accessing small files on closed storage [ J ]. Journal of Network and Computer Applications Volume 35, Issue 6, November 2012,1847 and 1862.) divide the small files into three types of structure-related, logic-related and independent files according to the actual application characteristics and the characteristics of the files themselves, and thereby make a small file merging strategy. Liu et al (Liu X H, Han J Z, Zhong Y Q. implementation WebGIS on Hadoop: A Case Study of implementation Small File I/O Performance on HDFS [ C ]. Cluster Computing and works phones, 2009.CLUSTER'09.IEEE International Conference on Date of Conference,2009:1-8.) use the application characteristics and user access characteristics in WebGIS to merge Small files of adjacent geographic location information and establish global index for them, effectively Improving the storage efficiency of Small files. Dong et al (Dong B, Qiu J, Zheng Q, et al. A Novel Approach to Improving the Efficiency of the Storing and Accessing of Small Files on Hadoop: A Case Study by PowerPoint Files [ M ] IEEE,2010.) propose a Small file merging mechanism for merging Files belonging to the same courseware into a large file and reading by using a two-stage prefetching mechanism, considering the correlation between courseware and the locality of access. Zhang et al (Zhang Chung, Rui Jianwu, Wuting. a method [ J ] for storing and reading Hadoop small files computer applications and software, 2012(11):95-100.) from the correlation and directory structure between small files, through establishing a hierarchical index for the merged large file, and realizing the preloading of the index file when the user accesses, the reading efficiency of the small files is improved.

It can be seen from the above research work that the access performance of small files can be improved by improving the system structure or combining the characteristics of the files. Moreover, on the premise of not changing the system architecture or improving the hardware performance, the small file combination is used as a microscopic data layout mechanism, so that the burden and communication overhead of the management node in the storage system can be effectively reduced, and the storage and access performance of the small file is improved.

However, the existing small file merging strategy utilizes the characteristics of the file itself (in the above, the reading time of the small file, the user access characteristics (mainly referring to the user access times, the retention time, and the like) which are directly related to the specific file and belong to the characteristics of the file itself), and can only merge files that have been accessed historically, or simply merge space-time adjacent files, without deeply utilizing the user access information. And the purpose of combining the small files of the mass space-time data in the smart city is to reduce the access delay of the user and provide more convenient and faster space-time data service. Therefore, in addition to the characteristics of the file itself, other characteristics of the user access behavior also have a significant influence on the merged access performance, which is lacking in the current research.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method for guiding small file merging based on the time-space characteristics that user access behaviors are gathered in a certain time-space range by deeply utilizing user access information, wherein the method focuses on where, when and what types of data have greater interest and access heat of users, and merges the data, so that the data access efficiency can be obviously improved, and data or new data which are not accessed by the users historically can be merged.

Based on the discovery that user access behaviors (access behaviors of users to small files) are relatively aggregated in a certain space-time range, namely the small files in the space-time range show access correlation, the space-time data small file merging method in the smart city is realized by the following technical scheme: and mining a space-time range with access correlation from historical access information of the small files by using a data mining algorithm, and then combining the small files in the space-time range.

The recommended data mining algorithm comprises Apriori, FP-Growth, Can-Tree algorithms and the like in association rules, hierarchical clustering algorithm AGNES (adaptive clustering NESTING), density clustering algorithm DBSCAN and the like in clustering, and aims to mine the time-space range with access correlation by mining historical user access information.

The algorithms are different from the occupied storage space in terms of computational complexity, and more importantly, if the association rule algorithm is used, the position information of the mined associated files needs to be considered, and then the associated files in adjacent geographic areas are taken to calculate the space range; if a density clustering algorithm is used, the clustered cluster shape needs to be considered, and files contained in the cluster are selected and the spatial range is calculated by combining the actual application scene. The present invention recommends the use of the AGNES clustering algorithm because it is simple and effective and does not need to consider the shape of the clustered clusters.

The process of mining a space-time range with access correlation from historical access information of the small files by using a hierarchical clustering algorithm AGNES, and then merging the small files in the space-time range comprises the following steps:

1) carrying out parameterization representation and space-time attribute extraction on historical user access information;

according to the definition of the space-time data small files, each file comprises an inherent position attribute l type attribute s and a time attribute t, so that any small file can be represented by three space-time elements (l, s and t);

suppose a small set of spatio-temporal data files generated in a smart city is F ═ F₁,f₂,…,f_mAn included set of location attributes may be denoted as L ═ L₁,l₂,…,l_mT, T is a set of time attributes₁,t₂,…,t_mThe type attribute set is S ═ S₁,s₂,…,s_mAnd the user accesses the application service in the smart city to generate a small file access request sequence of A ═ a₁,a₂,…a_n) Wherein each request item a_iWhere 1. ltoreq. i. ltoreq.n each corresponds to a small file f of spatio-temporal data_iI is more than or equal to 1 and less than or equal to n; after the request sequence is parameterized and extracted, a space-time attribute sequence is formed:

A＝(a₁,a₂,…a_n)＝((l₁,s₁,t₁),(l₂,s₂,t₂),…,(l_n,s_n,t_n)) (1)；

2) file merging

2.1) type attribute classification: sequence a of small file access requests from history (a ═ a)₁,a₂,…,a_n) The attribute of the type to be contained is s_i,s_iAccess request sequence of e.g. S

Separating out;

2.2) space-time clustering: access request sequence using hierarchical clustering algorithm AGNES

Clustering the position attributes and clustering the time attributes respectively to obtain a merging range of the position attributes and a merging range of the time attributes;

2.3) small file merging: the type attribute is s according to the merging range of the position attribute and the time attribute_iMerging the small files;

2.4) circulating the steps 2.1) -2.3), calculating space-time combination ranges (namely combination ranges of position attributes and time attributes) of the small files with different attributes, respectively combining the space-time combination ranges, and establishing indexes.

The invention classifies the type attributes before the spatio-temporal clustering, because generally speaking, users have different access characteristics to small files with different types of attributes, and the corresponding spatio-temporal ranges with access correlation are generally different. And the type attribute classification is carried out before clustering, so that a more accurate space-time combination range can be obtained.

As a recommended scheme: step 2.2), access request sequence is subjected to hierarchical clustering algorithm AGNES

And respectively clustering the position attributes and the time attributes, carrying out weighted calculation based on the access density on the clustering result, and then obtaining the merging range of the position attributes and the merging range of the time attributes by using the weighted result. The clustering result is weighted and calculated based on the access density, so that the influence of noise points on the calculation result can be reduced, and the calculated space-time range represents the space-time range corresponding to the access related file to the maximum extent.

In the step 2.2), the recommended merging range of the position attribute is obtained as follows:

(1a) request for

The set of location attributes contained therein is represented as

Aggregating location attributes

Each coordinate in the graph is used as a cluster;

(2a) calculating the group average distance between each cluster, and finding out two clusters with the shortest distance for merging;

(3a) repeating step (2a) until the group mean distance between any two clusters is greater than a predefined distance threshold

Finishing the clustering algorithm; the predefined distance threshold

As a set of location attributes

Average value of distances between all coordinate points;

(4a) assuming that the cluster set generated after the clustering process in the step (3a) is finished is

Calculating the average spatial range of the cluster sets by using the cluster sets, and weighting the spatial range radius of the cluster according to the access heat of a user, namely the density (number) of coordinate points in each cluster, wherein the larger the density is, the larger the weight is;

(5a) finally, the clusters are clustered

Averaging the weighted radii of all the clusters in the space range to calculate the type attribute s_i,s_iAnd E, merging the range of the position attribute corresponding to the space-time data doclet of the S.

The invention defines the similarity between two clusters by calculating the group average distance by using an average connection algorithm, and the cluster similarity is higher when the group average distance is shorter, thereby being beneficial to ensuring that the clustering process is not excessively sensitive to outliers or noise points.

In the step 2.2), the recommended merging range of the time attribute is obtained as follows:

(1b) request for

The time attribute set contained in it is expressed as

Aggregating temporal attributes

Each coordinate in the graph is used as a cluster;

(2b) calculating group average time difference between each cluster, and finding out two clusters with the minimum time difference for combination;

(3b) repeating step (2b) until the group mean time difference between any two clusters is greater than a predefined time difference threshold

Finishing the clustering algorithm; a predefined threshold of said time difference

As a collection of time attributes

Average value of the difference between all time points;

(4b) assuming that the cluster set generated after the clustering process in the step (3b) is finished is

Calculating the average time span range of the cluster set by using the cluster set, and weighting the time span radius of the cluster according to the access heat of a user, namely the density (number) of the time attribute points in each cluster, wherein the larger the density is, the larger the weight is;

(5b) finally, the clusters are clustered

Averaging the weighted time span radii of all the clusters to calculate the type attribute s_i,s_iAnd e, merging the range of the time attribute corresponding to the space-time data doclet of the S.

The similarity between two clusters is defined by calculating the group average time difference by using an average connection algorithm, and the cluster similarity is higher when the group average time difference is smaller, so that the clustering method is favorable for ensuring that outliers or noise points are not excessively sensitive in the clustering process.

The process implemented in the step 2.3) is recommended as follows:

let F be { F ═ F for doclet dataset₁,f₂,…,f_MThe type attribute in (f) is s_i,s_iThe attribute set of the space-time data small file position belonging to the S is

Set of time attributes as

Merging the range according to the position attribute mined in the step 2.2)

And time attribute merge scopes

For type attribute of s_iThe small file merging steps are as follows:

(1c) creating a file;

(2c) by time attribute aggregation

Most recent (earliest) time attribute t_vFor reference point, find out the time span less than or equal to

Time attribute of (2) constitute a set Range _ t_v；

(3c) Gathering by location attributes

Any one of the location attributes l_uAs a reference point, finding a spatial distance less than or equal to

The location attributes of (1) constitute a set Range _ l_u；

(4c) Attributing a location attribute to Range _ l_uThe time attribute belongs to Range _ t_vMerging the created files in the step (1c) into the small space-time files;

(5c) if the total size of the merged file is larger than the predefined large file storage capacity, jumping to the step (6 c); if the total size of the merged file is smaller than the predefined large file storage capacity, jumping to the step (7 c);

(6c) will set Range _ t_vTime attributes already participating in the merge from

Deletion in

Maintaining a current location attribute reference point l_uStep (1c-5c) is executed in a circulating mode without changing;

(7c) will set Range _ t_vTime attribute of from

Deletion in

And (2c-5c) executing steps in a circulating way until the position attribute set Range _ l is collected_uTime attribute set

All the small files in the file are merged;

(8c) if the total size of the files merged at this time is not enough to pre-define the storage capacity of the large files, the set Range _ l is set_uLocation attribute of from

Deletion in

Resetting a set of time attributes

Performing step (2c-5 c);

(9c) circularly executing the steps (1c-8c) until the position attributes are collected into

Set of time attributes as

Type attribute is s_iAll of the small files are merged.

The bottom storage of the currently common distributed storage system, such as HDFS, is in data blocks (datablocks), and the system usually defaults to 64 MB. When a file is stored, it must reach the size of a data block before the system can place it in the underlying disk. Therefore, merging is necessary when the distributed storage system stores small files. And, whether the whole data block is filled or not needs to be judged during merging, and the data block can be placed into the bottom layer disk only if the whole data block is filled. Otherwise, the uploaded small file is always placed in the cache queue of the client, and the system does not execute the write operation of the file until 64MB is reached, and the file is stored in the disk at the bottom layer in the form of data blocks. Of course, the size of the data block can be customized, for example, 128MB, at this time, the total size of the small files that we merge must reach 128MB before storing in the underlying disk. Small files generated in a smart city belong to typical stream data, and new data can be continuously generated in the same space range along with time. Therefore, the small file data generated in the smart city has a limited spatial attribute range (e.g., a spatial region of a city), but has an unlimited temporal attribute range. Therefore, in order to effectively merge the existing and the subsequent newly added data, when the small file is merged, according to the time attribute of the file, the earliest generated data in the same position attribute range is merged first, and then the data generated in the next time period in the current position attribute range is merged continuously until all the small file data in the position attribute range are merged. And then continuously finding the next position attribute range, and continuously merging the small files in the range according to the sequence of the time attributes so as to fulfill the aims of traversing all the small file data of the whole geographic area and merging the small file data.

In the step 2.4), after the files are merged, a local index is established in the generated large file for storing the length of the small file and the offset position in the large file, so that the application server can directly position through the internal index of the file when processing the data request, and the required small file data can be conveniently and quickly obtained. The combined large file is placed in the continuous storage interval of the same node in the bottom data storage server as much as possible, because some originally continuous small files are likely to be placed at the edge position of the large file, when a user accesses the large file, the client needs to read across the large file and even across the data storage server, and the access performance of the small file is influenced.

Compared with the prior art, the invention has the following beneficial effects:

the invention utilizes the time-space correlation of the user access files, excavates the hidden time-space range from the historical access information, and utilizes the time-space range to guide the combination of the small files; in addition, the invention merges files according to the space-time range of the access related files, and has no relation with the self characteristics of the files such as the access times of users, the stay time and the like, so all the files in the system can be merged, and the new data can be merged whether the users access the files or not.

Drawings

FIG. 1 illustrates a distributed file system and access mechanism in a master-slave architecture;

FIG. 2 is a diagram illustrating the AGNES clustering process for data objects { a, b, c, d, e };

FIG. 3 is a diagram illustrating the structure of a large file generated after merging;

FIG. 4 is a diagram illustrating a size distribution of files stored in the system;

FIG. 5 is used to illustrate an MDS storage load condition;

FIG. 6 is used to show the storage speed of small files;

fig. 7 is used to show the total average response time for small file accesses.

Detailed Description

The smart city provides convenient and fast space-time data service for users, and often provides space-time data application service for the public in a predefined application mode through a network platform. After receiving an application access request of a user, the system accesses data in a certain space-time range according to predefined application and space-time parameters selected by the user, analyzes and processes the accessed data by using a function in a background, and finally returns a result to the user.

Obviously, if small files in a certain space range and time span are frequently accessed by users, the small file access correlation in the space-time range is indicated. Based on the above reasoning, the embodiment considers that a hierarchical clustering algorithm is used to dig the space-time range with access correlation from historical access information of the small files, and the range is used to guide the combination of the small files in the system, thereby reducing the memory consumption of the MDS and the number of times of communication between the Client and the MDS, and reducing the access delay of the user.

The method for merging the space-time data small files in the smart city specifically comprises the following steps:

step 1) carrying out parametric representation and space-time attribute extraction on historical user access information

Specifically, the method comprises the following steps: according to the definition of the space-time data small files, each file comprises an inherent position attribute l, a type attribute s and a time attribute t. Therefore, any small document can be represented by its space-time three elements (l, s, t).

Suppose a small set of spatio-temporal data files generated in a smart city is F ═ F₁,f₂,…,f_mAn included set of location attributes may be denoted as L ═ L₁,l₂,…,l_mT, T is a set of time attributes₁,t₂,…,t_mThe type attribute set is S ═ S₁,s₂,…,s_m}. The user accesses the application service in the smart city to generate a small file access request sequence of A ═ a₁,a₂,…a_n) Wherein each request item a_iWhere 1. ltoreq. i. ltoreq.n each corresponds to a small file f of spatio-temporal data_iI is more than or equal to 1 and less than or equal to n. The request sequence is parameterized and extracted to form space-time attributeThe sequence is as follows:

A＝(a₁,a₂,…a_n)＝((l₁,s₁,t₁),(l₂,s₂,t₂),…,(l_n,s_n,t_n)) (1)。

step 2) document merging

2.1) type attribute classification from the historical doclet access request sequence a ═ (a)₁,a₂,…,a_n) The attribute of the type to be contained is s_i,s_iAccess request sequence of e.g. S

Separating out;

2.2) spatio-temporal clustering by using hierarchical clustering algorithm AGNES to access request sequences

Respectively clustering the position attributes and the time attributes, carrying out weighted calculation based on access density on clustering results, and then obtaining a merging range of the position attributes and a merging range of the time attributes by using weighted results;

2.3) merging small files, namely, merging the type attribute s according to the merging range of the position attribute and the time attribute_iMerging the small files;

In step 2.2), the merging range of the position attribute is obtained as follows:

the merge range of location attributes is actually how many spatial extents of the doclets should be merged together. Therefore, different clusters are formed by clustering the position attributes of the historical small file access request sequences, and then the clusters are analyzed and calculated to obtain a weighted average cluster radius, wherein the radius is the spatial range of combination, namely the combination rule about the position attributes.

A. Position attribute clustering based on AGNES algorithm

The AGNES algorithm is a classical hierarchical clustering method based on agglomeration. Initially, AGNES treats each object as a cluster, and then the clusters are merged step-by-step according to some criteria. For example, cluster C₁An object and cluster C in (2)₂Is the smallest of the euclidean distances between all objects belonging to different clusters, they are considered similar, C₁And C₂May be merged. In the clustering process, each cluster is represented by all objects in the cluster, the similarity between the two clusters is determined by the similarity of data points closest to the two clusters, and the cluster merging process is repeated until all the objects are finally merged into one cluster. The clustering process based on the agglomerative AGNES algorithm is shown in fig. 2.

Assuming that the location attributes of the small files of spatiotemporal data in the smart city considered herein are all two-dimensional, i.e., the longitude and latitude coordinates can be expressed as:

L＝{l₁,l₂,l₃,…}＝{(x₁,y₁),(x₂,y₂),…,(x_M,y_M)} (2)

then the user access request sequence a contains a type attribute s according to equation (1)_iAccess request of

Can be expressed as:

thereby requesting

The set of location attributes contained therein can be expressed as:

the goal of clustering is to make the similarity of objects in the same cluster as large as possible and the similarity between objects in different clusters as small as possible. Therefore, a core problem of the clustering method is how to measure the similarity between two clusters. Similarity is defined for clustering of spatio-temporal data doclet location attributes, where each cluster is a set of location attributes. To ensure that outliers or noise points are not overly sensitive during clustering, the similarity between two clusters is defined by calculating the group mean distance using an average connection algorithm, with cluster similarities being higher for closer group mean distances.

Firstly, defining any two position attribute coordinate points

Distance between l ═ x, y and l ═ x ', y':

suppose cluster C_mContaining a set of coordinate points C_m＝(l₁,l₂,l₃…), cluster C_nContaining coordinate points in the set C_n＝(l′₁,l′₂,l′₃…), there is no intersection of the elements between the two clusters. Then, cluster C_mAnd C_nGroup average distance between:

wherein N is_m＝card(C_m) Represents a cluster C_mNumber of middle coordinate points, N_n＝card(C_n) Represents a cluster C_nThe number of middle coordinate points. At this time, cluster C_m＝(l₁,l₂,l₃…) has a spatial range radius of:

B. generation of a merge Range

According to the principle of the AGNES algorithm, if the number of clusters to be generated by the clustering algorithm is not pre-specified, the algorithm will merge the generated clusters until all objects are merged into one cluster. Obviously, clustering into a cluster is equivalent to no clustering, so that a merging rule of the position attribute cannot be found from the original position attribute coordinate points. However, since the small file of the spatio-temporal data of each data type involves a plurality of application services, and the access behavior preferences of different users are different, the small file data in different spatio-temporal ranges can be accessed every time the application services are accessed. Therefore, we need to set a termination condition for the AGNES algorithm.

To this end, the type attribute is defined as s_i,s_iSpace distance threshold corresponding to space-time data small file belonging to S

As a set of location attributes

And averaging the distances between all coordinate points, and using the average value as a termination condition of the clustering algorithm:

wherein the position attribute coordinate point

Representing a sequence of location attributes

The number of coordinate points included. Obviously, if cluster

And cluster

Group of the twoMean distance

They cannot be combined into one cluster. When all the clusters can not be subjected to aggregation combination, the clustering process of the AGES algorithm is ended.

Next, the AGNES algorithm is used to include the type attribute s_iAccess request of

Position attribute sequence of

Clustering is performed in order to distinguish the spatial range involved by each application service access of the user to the type of data through the result of clustering. The clustering process is summarized as follows:

(1) attribute sequence of position

Each coordinate therein serves as a cluster.

(2) Calculating the group average distance between each cluster, finding the two clusters with the nearest distance to combine, and if the clusters are

And

the group average distance between them is the nearest, then they are combined to form a new cluster

(3) Repeating step (2) until the group mean distance between any two clusters is greater than a predefined distance threshold

The clustering algorithm ends.

Assuming that the cluster set generated after the clustering process is finished is

All that is needed is to use this cluster set to calculate their average spatial extent. Obviously, if the location attribute points contained in a certain cluster are dense, this indicates that the corresponding spatial range is a hot spot area where the user accesses the type of attribute doclet.

Therefore, in order to reduce the influence of the noise point on the calculation result, the calculated spatial range represents the space-time attribute range corresponding to the access related file to the maximum extent. The spatial range radius of the cluster can be weighted according to the access heat of the user, namely the density (number) of coordinate points in each cluster, and the larger the density is, the larger the weight is. Then, for the cluster

1≤k≤K_lThe radius of the spatial range weighted by the access density is:

wherein,

representing a cluster

The number of the contained coordinate points is,

representing a sequence of location attributes

The number of the contained coordinate points is,

into a cluster

Spatial extent ofA radius. Finally, the sets are aligned again

Averaging the radiuses of the space ranges weighted by all the clusters in the space range, and calculating a position attribute merging range corresponding to the small spatio-temporal data file with the type attribute si belonging to S:

in step 2.2), the process of obtaining the merging range of the time attribute is as follows:

the scope rule of the temporal attributes is in fact how many small files within a time span should be merged together. As with the previous location attributes, by time attribute clustering of historical doclet access information. Except that the clustered objects are changed from two-dimensional latitude and longitude coordinates to one-dimensional time attributes.

A. Time attribute clustering based on AGNES algorithm

Request as with the previous location attribute clustering

The time attribute set contained in it can be expressed as:

first, the similarity between clusters is defined, where each cluster is a set of temporal attributes. The similarity between two clusters is defined by calculating a group mean time difference, the closer the group mean time difference is, the higher the cluster similarity is. Defining any two time attribute points

Time difference therebetween:

d(t,t′)＝|t-t′| (12)

suppose cluster C_mThe time attribute point set contained is C_m＝(t₁,t₂,t₃…), cluster C_nThe time attribute point set contained is C_n＝(t′₁,t′₂,t′₃…), there is no intersection of the elements between the two clusters. Then, cluster C_mAnd C_nThe group mean time difference between is:

wherein N is_m＝card(C_m) Represents a cluster C_mNumber of middle time attribute points, N_n＝card(C_n) Represents a cluster C_nThe number of medium time attribute points. At this time, cluster C_m＝(t₁,t₂,t₃…) is:

B. generation of a merge Range

And calculating location attributes in the foregoing defining a type attribute as s according to the principles of the AGNES algorithm_i,s_iTime difference threshold corresponding to space-time data small file belonging to S

As a collection of time attributes

And averaging the difference values between all the time points, and using the average value as a termination condition of the clustering algorithm:

wherein the time attribute point

Representing a sequence of time attributes

The number of coordinate points included. Obviously, if cluster

And cluster

Group mean time difference between

They cannot be combined into one cluster. Clustering ends when all clusters cannot be coalesced.

Time attribute sequence of

Clustering is performed in order to distinguish the time span range involved by each application service access of the user to the type of data through the result of clustering. The process of clustering is the same as the previous location attribute.

And weighting the time span of the cluster by using the density of the time attribute points in the cluster, wherein the greater the density, the greater the weight. Then, for the cluster

The radius of the time range weighted by the access density is:

wherein,

representing a cluster

The number of time attributes to be included,

representing a sequence of time attributes

The number of coordinate points included in the image,

into a cluster

Radius of the time range. Finally, the sets are aligned again

Averaging the weighted time range radii of all clusters in the cluster, and calculating the type attribute s_i,s_iThe time attribute combination range corresponding to the space-time data small file belonging to the S is as follows:

step 2.3) according to the combination range of the position attribute and the time attribute, the type attribute is s_iThe specific process of merging the small files is as follows:

Set of time attributes as

Digging out according to step 2.2)Location attribute merging range from

And time attribute merge scopes

For type attribute of s_iThe small file merging steps are as follows:

(1c) creating a file;

(2c) by time attribute aggregation

Time attribute of (2) constitute a set Range _ t_v；

(3c) Gathering by location attributes

The location attributes of (1) constitute a set Range _ l_u；

Deletion in

(7c) will set Range _ t_vTime attribute of from

Deletion in

All the small files in the file are merged;

Deletion in

Resetting a set of time attributes

Performing step (2c-5 c);

Set of time attributes as

Type attribute is s_iAll of the small files are merged.

According to the generation method of the merging rule, S ═ S can be continuously calculated₁,s₂,s₃… } in whichThe space-time combination range corresponding to the small files with the type attributes is combined according to the combination steps. Thus, the whole small file data set F is set as F₁,f₂,…,f_MThe files in the file are all merged into a large file set F' ═ F₁,F₂,…,F_N}。

The merged small file needs to establish a local Index (Internal Index) inside the large file for the small file to store the length of the small file and the offset position inside the large file, so that the application server can directly position through the Internal Index of the file when processing a data request, and the required small file data can be conveniently and quickly obtained. The combined large file should be placed in the same node continuous storage interval in the bottom layer data storage server as much as possible, because some originally continuous small files are likely to be placed at the edge position of the large file, when a user accesses the large file, the client needs to read across the large file and even across the data storage server, and the access performance of the small file is influenced. The structure of the merged large file is shown in fig. 3.

Experiment of

The experimental data was derived from the martian city intelligent city network application demonstration platform, which includes 14 sensors in different regions, generates sensor data from 1 month and 1 day 2010, and provides 20 predefined application programs to the public. These data all have obvious spatio-temporal properties, most of them belong to typical spatio-temporal data small files (generally not more than 4MB), and are numerous and occupy little storage space.

Taking meteorological monitoring sensing data as an example, 5 acquisition points are arranged in different areas of Wuhan city, each acquisition point comprises seven types of sensors, and the acquisition density is 5 minutes. The data collected by one sensor in one day generates a file with the size of 3.2KB-5.8KB and the average size of 4.25KB, and the data of 10000 collection points are simulated by the subsequent time-space kriging interpolation. The total data files collected by all collection points from 1/2017 to 1/6/2017 are about 12,740,058 with a total size of about 53.68 GB. Fig. 4 is a distribution diagram of file sizes stored in the system.

After processing, we obtain 4,076,328 small file access information of the user in the period from 3/1/2017 to 6/1/2017.

The experimental test for realizing the results comprises three parts: the MDS stores load, the storage speed of the file and the average response time of the file access request, the latter two tests comprise two conditions of single-user concurrence and multi-user concurrence, wherein the multi-user concurrence is simulated through a single-client-side multi-process, each experiment is repeated for 10 times to take average, and the experiment result is compared with the original HDFS and the HAR archive file technology mentioned in the literature (Hadoop architecture. Hadoop archives guide [ EB/OL ], http:// ha doop. apache. org/common/docs/current/ha doop _ archives. html. 2011.).

MDS storage load

The small file combination can effectively reduce the amount of metadata in the MDS, and therefore, the memory consumption of the MDS is reflected by the consumption of memory in the MDS, and the memory consumption of the MDS when the system is distributed and stored for 5,000, 10,000, 15,000, 20,000, 25,000 and 30,000 small files is tested. The results of the experiment are shown in FIG. 5.

As can be seen from the figure, in the case of not storing any file, the memory consumption of the system itself is about 4.2MB, and the memory occupied increases linearly with the number of stored files. In the traditional HDFS, each small file stored in the memory occupies the space of one object, so that the memory is very large. The HAR archiving technique and the algorithm proposed herein both involve the merging of small files, and the large file objects that are stored in the memory are merged, reducing the number of objects in the MDS memory. Similarly, the merged file is also merged, and the merged large file includes as many small files, and the algorithm proposed herein is different from the HAR in that a local index is established for the merged small file (directory information of the small file is placed in the merged large file), so that memory overhead is further reduced compared to the HAR.

Storage speed of file

The storage test of the file is carried out by a Client of the storage system of the application demonstration platform (not a user, the user can only access the predefined application service through the network platform), the method is that 100,000 small file data with the total size of 0.396GB are written into the storage system at one time, and the average storage speed of the file is calculated. The results of the experiment are shown in FIG. 6.

As can be seen from the figure, the storage system has the maximum write rate when 5 users access concurrently, regardless of merging of files, and the total transmission efficiency tends to be stable as the number of concurrent users increases. Secondly, the HAR with the merging strategy and the storage speed of the text algorithm are all superior to the traditional HDFS due to the merging of small files. Meanwhile, it can be seen that, since the HAR merge mechanism is to directly pack a plurality of small files into one file and write the file into the HDFS file, the speed is high, but the HAR merge mechanism has the disadvantage that the merged file needs to be re-created if modified, and is not suitable for subsequent small file reading. The algorithm proposed herein, considering the spatio-temporal properties of the files and then combining them, affects the storage speed reduction of the files to some extent, but has little impact on the overall writing speed of the files compared with a large data set, and still has very high storage efficiency.

Average response time

The purpose of merging is to reduce the average response time of user access, for which the total average response time of the system when a single user accesses 500-.

As can be seen from the figure, the total average response time is linear with the number of small files, and the merging algorithm proposed herein has the smallest total response time when accessing small files, followed by HAR and finally the original HDFS. This is because the metadata retrieval in the MDS is complicated due to the excessive small files, the system communicates frequently inside, most of the time is spent on the overhead of the system, and the reading time is increased. The HAR archiving technology has high storage efficiency, but in the reading of files, although one large file can be obtained in each access, because the correlation between the files is not considered, the small files included in the large file are not a plurality of small files required by the corresponding application of the service, the reading hit rate of the large file is reduced, the number of times of communication between Client-MDS is increased, and the reading speed of data is influenced. The merging algorithm mentioned in the text is based on analyzing the spatio-temporal correlation of user access and the attributes of the files, and the associated files are merged together as much as possible. The large file obtained by each application service access request comprises the small file required by the service subsequently, so that frequent communication between the Client and the MDS is avoided, the response time of the application service is shortened, and the small file access performance of the system is effectively improved.

Conclusion

The method combines the space-time attributes of the files and the space-time characteristics accessed by the user, on one hand, the space-time granularity problem of combining different types of files is solved, and on the other hand, all the files in the system can be combined (no matter whether the user accesses the files or not).

Generally, the small files are combined according to the space-time attributes of the small files, the storage efficiency of direct combination is not high, but the reading efficiency can be greatly improved, the purpose of combining the small files of the space-time data of the smart city is to reduce the access delay of users, and obviously, the algorithm is more suitable for the application scene.

In addition, besides mining the space-time range related to access from historical user access information by using a clustering algorithm (hierarchical clustering AGES), the invention Can also use other data mining algorithms, such as Apriori, FP-Growth, Can-Tree and the like in association rules, and other clustering algorithms in the cluster, such as density clustering DBSCAN and the like. The applicant believes that it is possible to implement the method by mining access to relevant documents using these general data mining algorithms, calculating spatio-temporal ranges contained in the documents, and finally guiding the merging of small documents using the spatio-temporal ranges under the guidance of the inventive concept, and thus the description thereof is omitted.

Claims

1. A little file merging method of time-space data in a smart city is characterized in that a data mining algorithm is utilized to mine a time-space range with access correlation from historical access information of small files, and then the small files in the time-space range are merged;

the data mining algorithm adopts a hierarchical clustering algorithm AGNES in clustering;

according to the definition of the space-time data small files, each file comprises an inherent position attribute l, a type attribute s and a time attribute t, so that any small file can be represented by three space-time elements (l, s and t);

2) file merging

Is separated outTo the process;

2.4) circulating the steps 2.1) -2.3), calculating space-time combination ranges of small files with different attributes, respectively combining the space-time combination ranges, and establishing indexes;

in step 2.2), the merging range of the position attribute is obtained by the following method:

(1a) request for

The set of location attributes contained therein is represented as

Aggregating location attributes

Each coordinate in the graph is used as a cluster;

Finishing the clustering algorithm; the predefined distance threshold

As a set of location attributes

Average value of distances between all coordinate points;

(5a) finally, the clusters are clustered

Averaging the weighted radii of all the clusters in the space range to calculate the type attribute s_i,s_iThe position attribute corresponding to the space-time data small file belonging to the S is merged;

in step 2.2), the merging range of the time attribute is obtained as follows:

(1b) request for

The time attribute set contained in it is expressed as

Aggregating temporal attributes

Each coordinate in the graph is used as a cluster;

As a collection of time attributes

Average value of the difference between all time points;

Calculating the average time span range of the cluster set by using the cluster set, and weighting the time span radius of the cluster according to the access heat of a user, namely the density of the time attribute point in each cluster, wherein the larger the density is, the larger the weight is;

(5b) finally, the clusters are clustered

2. Merging method according to claim 1, characterized in that step 2.3) is implemented as follows:

Set of time attributes as

According to the stepsStep 2.2) mined location attribute merging range

And time attribute merge scopes

For type attribute of s_iThe small file merging steps are as follows:

(1c) creating a file;

(2c) by time attribute aggregation

Most advanced time attribute of interior t_vFor reference point, find out the time span less than or equal to

Time attribute of (2) constitute a set Range _ t_v；

(3c) Gathering by location attributes

The location attributes of (1) constitute a set Range _ l_u；

Deletion in

(7c) will set Range _ t_vTime attribute of from

Deletion in

All the small files in the file are merged;

Deletion in

Resetting a set of time attributes

Performing step (2c-5 c);

Set of time attributes as

Type attribute is s_iAll of the small files are merged.

3. The merging method according to claim 2, wherein in step 2.4), the index is built inside the large file after merging generation, and the index is used for storing the length of the small file and the offset position inside the large file.

4. The merging method according to claim 3, wherein the large files generated after merging are placed in a continuous storage interval of the same node in the underlying data storage server.