CN104573082B - Space small documents distributed data storage method and system based on access log information - Google Patents

Space small documents distributed data storage method and system based on access log information Download PDF

Info

Publication number
CN104573082B
CN104573082B CN201510042456.9A CN201510042456A CN104573082B CN 104573082 B CN104573082 B CN 104573082B CN 201510042456 A CN201510042456 A CN 201510042456A CN 104573082 B CN104573082 B CN 104573082B
Authority
CN
China
Prior art keywords
small documents
msub
space small
documents data
access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510042456.9A
Other languages
Chinese (zh)
Other versions
CN104573082A (en
Inventor
潘少明
徐正全
种衍文
李红
李明
汤戈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201510042456.9A priority Critical patent/CN104573082B/en
Publication of CN104573082A publication Critical patent/CN104573082A/en
Application granted granted Critical
Publication of CN104573082B publication Critical patent/CN104573082B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Abstract

The present invention provides the space small documents distributed data storage method and system based on access log information, including space small documents data set to be divided into the subset frequently accessed and the non-subset frequently accessed, extract the access sequence of the space small documents data subset frequently accessed, segmentation calculates the degree of association of each space small documents data frequently accessed, and the mutual degree of association numerical value of each space small documents data frequently accessed is formed into incidence matrix;RCM sort algorithms are utilized to be exported after resetting after carrying out size conversion to each element numerical value in incidence matrix, best of breed is found using partial approximation search method to the incidence matrix after rearrangement, the space small documents data frequently accessed are carried out with distribution storage using best of breed, and the non-space small documents data frequently accessed are stored separately according to locus neighbouring relations.The present invention improves the concurrent access performance of space small documents data.

Description

Space small documents distributed data storage method and system based on access log information
Technical field
The invention belongs to the distribution technical field of memory of space small documents data, more particularly to a kind of new based on access The space small documents distributed data storage method and system of log information.
Background technology
The storage of mass spatial information and the quick major issue for accessing always spatial Information Service system and attempting to solve, The conventional spatial Information Service system data volume that for example NASA Systeme pour l'Observation de la Terre gathers daily has reached 2TB, to these data Reasonable layout store and turn into key to obtain parallel quick access, the important solution of one type is by data Distribution storage is carried out to realize the concurrent access to data to improve data access efficiency.
At present than more typical distributed file storage system mainly include as GFS (Google file system), HDFS (Hadoop distributed file system) and Lustre etc..But improvement of these systems in storage performance It is mainly reflected in the storage processing to big file.Such as GFS, its storage strategy is mainly that big file is divided into regular length Block (such as 64MB), then all blocks are respectively stored on different memories to improve (the reference of the concurrent access rate of data Document Ghemawat S, Gobioff H, Shun-Tak L.The Google file system.In:Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles(SOSP’03).Bolton Landing,New York:IEEE,2003.1–15).Another kind of typical memory technology such as RAID (Redundant Array Of Independent Disks), and it is respectively stored in difference after each big data file is divided into several data blocks Disk to improve the concurrent access to this document.
Although more than distribution storage method it is effective to large file, be directed to small documents data, due to can not continue into Single file is simply simply stored in by row piecemeal, the method applicability deficiency stored by piecemeal, method general at present In single storage server, thus the concurrent access to multiple small documents data is difficult to, I/O is inefficient.
Research shows that current most of system all has substantial amounts of small documents data, such as American National energy research science The file for having 99% in 13,000,000 files at calculating center is less than 64M, and the file less than 64K is even more to have accounted for 44% (reference Document Carns P, Lang S, Ross R, et al..Small-file access in parallel file systems [C].Parallel&Distributed Processing,2009.IPDPS 2009.IEEE International Symposium On.IEEE,2009:1-11)。
In fact, the spatial Information Service system based on pyramid model, as Google Earth, World Wind are equal Sample is the memory space data in the form of small documents.The earth is divided into different resolution by World Wind according to pyramid model Tile data, each tile data save as a file, and the size of each tile data is fixed as 512 × 512 pixels, each Tile file size is no more than 1MB (bibliography Boschetti L, Roy D P, Justice C O.Using NASA ' s World Wind virtual globe for interactive internet visualization of the global MODIS burned area product.Int J Remote Sens,2008,29(11):3067–3072);Google Earth equally uses multi-resolution models memory space data, and the size of each data file is also no more than 64MB (bibliography Sample J T,Loup E.Tile-base geospatial information system:principle and practices.New York:Springer,2010.23–200)。
In a word, the distribution storage method currently for large file is difficult to apply to the storage of small documents data, and pin Optimization to small documents data concentrates on the Access Optimizations of data again, and (non-memory optimizes, Access Optimization curstomer-oriented end, and storing Optimize service-oriented end), such as reduce data-intensive applications program the execution time (bibliography J.Kim, A.Chandra, and J.B.Weissma.Using Data Accessibility for Resource Selection in Large- Scale Distributed Systems.IEEE Trans.Parallel Distributed Systems,vol.20, No.6, pp.788-801, June 2009), or reduce small documents index information expense (bibliography A.L.Chervenak, R.Schuler,M.Ripeanu,M.A.Amer,S.Bharathi,I.Foster,A.Iamnitchi,and C.Kesselman.The Globus Replica Location Service:Design and Experience.IEEE Trans.Parallel Distributed Systems, vol.20, no.9, pp.1260-1272, Sept.2009) etc..But In distributed system, the performance of access delay time is not only relevant with access method, and has with the distribution memory module of data Close.Therefore not yet solved at all to the optimization problem of small documents data.
The content of the invention
For problem above, the present invention provides a kind of space small documents distributed data storage side based on access log information Method and system, the access log information of utilization space small documents data, the correlation between each space small documents data is analyzed, And distribution storage is carried out to space small documents data accordingly, to improve the concurrent access rate to space small documents data.
A kind of space small documents distributed data storage method and system based on access log information of the present invention, institute The technical scheme of use is:
A kind of space small documents distributed data storage method based on access log information, to any space small documents number According to type, execution comprises the following steps:
Step 1, by space small documents data set, the subset frequently accessed and non-frequently visit are divided into according to access frequency difference The subset asked;Including following sub-step,
Step 1.1, each space small documents data access temperature is obtained, realization is as follows,
If space small documents data set is F={ f1,f2,...,fN, include space small documents data f1,f2,...,fN, its Middle N is the total number of space small documents data, and i-th of space small documents data markers is fi, i=1,2 ..., N;
Space small documents data are have accessed successively if being recorded in access log informationSpace small documents number According to access log sequence beA=(a1,a2,…,aM) for space small documents data access sequence to Amount, at∈ [1, N], sequence number t=1,2 ..., M are accessed, wherein M is the access total degree to all space small documents data in F;
Count each space small documents data fiThe number λ occurred in access log sequence Ri, with λiFor the small text in the space Number of packages is according to fiAccess temperature;
Step 1.2, the space small documents data being accessed frequently according to the extraction of space small documents data access temperature, are realized It is as follows,
The default discriminant parameter λ of input,
If hollow small documents data f of space small documents data set FiAccess temperature λi>λ, then space small documents data fi For the space small documents data frequently accessed, otherwise fiBelong to the non-space small documents data frequently accessed;
Step 1.3, the space small documents data Special composition small documents data set frequently accessed according to obtained by step 1.2 Subset, realization is as follows,
If all space small documents data frequently accessed form subset and areWherein N1For frequency The space small documents data total number of numerous access, i-th1、j1The individual space small documents data frequently accessed are respectively labeled asWithi1,j1∈[1,N1];
Step 2, the access sequence of the space small documents data subset frequently accessed is extracted from access log information, including Access sequence is sequentially formed according to time order and functionIt is empty frequently to access Between small documents data access sequence vector,Access sequence number t1=(11,21,…,M1), wherein M1For to F1In own The access total degree of frequent addressing space small documents data;
Step 3, calculated and each frequently accessed using the access sequence segmentation of the space small documents data subset that frequently accesses The degree of association of space small documents data, and the mutual degree of association numerical value of each space small documents data frequently accessed is formed Incidence matrix;Including following sub-step,
Step 3.1, according to storage server quantity m, frequent addressing space small documents data subset length N1Calculate frequent Access sequence section length n=N1/m;
Step 3.2, frequent access sequence being segmented according to access sequence section length, realization is as follows,
According to access order, by frequent addressing space small documents data access sequence vector A1Cut with the n component of element one For some subvectors, A is expressed as1=(S1,S2,…,Sl), wherein subvector Sk=(ak1,ak2,…,akn), akj∈[1,N1], 1 ≤ k≤l, 1≤j≤n;By A1In all subvector set be designated as S, S={ Sk:k∈[1,l]};
Step 3.3, the mutual degree of association numerical value of the space small documents data frequently accessed is calculated, realization is as follows,
Defined function
WhereinFor SkIn all elements composition set;FunctionRepresent in length The space small documents data frequently accessed were spent in the access cycle for nWithWhether there is relevance;
Defined function RS(i1,j1),
Wherein RS(i1,j1) represent S pairsWithTotal correlation degree;
Step 3.4, the mutual degree of association numerical value of the space small documents data frequently accessed is formed into incidence matrix RS,
Step 4, RCM sort algorithms are utilized to be exported after resetting after size conversion is carried out to each element numerical value in incidence matrix;
Step 5, to the incidence matrix after rearrangement using partial approximation search method successively circulate find m best of breed after it is defeated Going out, method is as follows,
Step 6, distribution storage is carried out to the space small documents data frequently accessed using step 5 gained best of breed, with And the non-space small documents data frequently accessed are stored separately according to locus neighbouring relations.
Moreover, step 4 includes following sub-step,
Step 4.1, element maximum in incidence matrix, including traversal incidence matrix all elements value are obtained, and is obtained most Big value Rmax
Step 4.2, size conversion, including traversal incidence matrix all elements value are carried out to incidence matrix element numerical value, and Perform operation RS(i1,j1)=Rmax-RS(i1,j1);
Step 4.3, rearrangement is entered to incidence matrix using standard RCM sort algorithms.
Moreover, step 5 includes following sub-step,
Step 5.1, current iteration number d=1 is initialized;
Step 5.2, a best of breed is found using partial approximation search method, is included in current matrix and finds one N × n block so that corresponding matrix element value is maximum in n × n blocks in the matrix, and corresponding n file forms one optimal group Close;When performing step 5.2 for the first time, current matrix is the incidence matrix after being reset obtained by step 4;Subsequent execution step 5.2 When, current matrix is the matrix obtained by preceding an iteration;
Step 5.3, will after current iteration execution step 5.2 search obtains a best of breed being made up of n file The incidence matrix element that n file is corresponded in incidence matrix is deleted, and obtains (N1-dn)×(N1- dn) matrix;
Step 5.4, judge whether d=m-1, otherwise make d=d+1, (N obtained by step 5.3 is performed with current iteration1-dn) ×(N1- dn) matrix be current matrix, return to step 5.2 carries out next iteration and continues search for next combination recently, It is to stop search, m best of breed is obtained.
The present invention correspondingly provides a kind of space small documents distributed data storage system based on access log information, including With lower unit,
Space small documents data set pretreatment unit (100), for by the space of any space small documents data type Small documents data set, the subset frequently accessed and the non-subset frequently accessed are divided into according to access frequency difference;Including following mould Block, space small documents data access frequency statistical module (101), for obtaining each space small documents data access temperature, realize It is as follows,
If space small documents data set is F={ f1,f2,...,fN, include space small documents data f1,f2,...,fN, its Middle N is the total number of space small documents data, and i-th of space small documents data markers is fi, i=1,2 ..., N;
Space small documents data are have accessed successively if being recorded in access log informationSpace small documents number According to access log sequence beA=(a1,a2,…,aM) for space small documents data access sequence to Amount, at∈ [1, N], sequence number t=1,2 ..., M are accessed, wherein M is the access total degree to all space small documents data in F;
Count each space small documents data fiThe number λ occurred in access log sequence Ri, with λiFor the small text in the space Number of packages is according to fiAccess temperature;
Frequent addressing space small documents data set extraction module (102), for according to space small documents data access temperature The space small documents data being accessed frequently are extracted, realization is as follows,
The default discriminant parameter λ of input,
If hollow small documents data f of space small documents data set FiAccess temperature λi>λ, then space small documents data fi For the space small documents data frequently accessed, otherwise fiBelong to the non-space small documents data frequently accessed;
Frequent addressing space small documents subset structure module (103), for according to frequent addressing space small documents data set The subset of the space small documents data Special composition small documents data set frequently accessed obtained by extraction module (102), realization is as follows,
If all space small documents data frequently accessed form subset and areWherein N1For frequency The space small documents data total number of numerous access, i-th1、j1The individual space small documents data frequently accessed are respectively labeled asWithi1,j1∈[1,N1];
Space small documents data access vector acquiring unit (200), frequently accessed for being extracted from access log information Space small documents data subset access sequence, including sequentially form access sequence according to time order and function For frequent addressing space small documents data access sequence vector,Access sequence number t1= (11,21,…,M1), wherein M1For to F1In all frequently addressing space small documents data access total degree;
Space small documents data access incidence matrix computing unit (300), for utilizing the space small documents frequently accessed The access sequence segmentation of data subset calculates the degree of association of each space small documents data frequently accessed, and is frequently accessed each The mutual degree of association numerical value composition incidence matrix of space small documents data;Including with lower module, frequent access sequence segmentation Length computation module (301), for according to storage server quantity m, frequent addressing space small documents data subset length N1Meter Calculate frequent access sequence section length n=N1/m;
Storage server number parameter m is by outside input.
Frequent access sequence segmentation module (302), for being carried out according to access sequence section length to frequent access sequence Segmentation, realization is as follows,
According to access order, by frequent addressing space small documents data access sequence vector A1Cut with the n component of element one For some subvectors, A is expressed as1=(S1,S2,…,Sl), wherein subvector Sk=(ak1,ak2,…,akn), akj∈[1,N1], 1 ≤ k≤l, 1≤j≤n;By A1In all subvector set be designated as S, S={ Sk:k∈[1,l]};
Space small documents data correlation degree computing module (303), for calculating the space small documents data phase frequently accessed Degree of association numerical value between mutually, realization is as follows,
Defined function
WhereinFor SkIn all elements composition set;FunctionRepresent in length The space small documents data frequently accessed were spent in the access cycle for nWithWhether there is relevance;
Defined function RS(i1,j1),
Wherein RS(i1,j1) represent S pairsWithTotal correlation degree;
Space small documents data correlation matrix generation module (304), for the space small documents data phase that will frequently access Degree of association numerical value composition incidence matrix R between mutuallyS,
Incidence matrix conversion rearrangement units (400), for profit after each element numerical value progress size conversion in incidence matrix Exported after being reset with RCM sort algorithms;
Incidence matrix best of breed search unit (500), for being searched for the incidence matrix after rearrangement using partial approximation Method finds best of breed;
Space small documents distributed data storage unit (600), for utilizing incidence matrix best of breed search unit (500) Gained best of breed carries out distribution storage to the space small documents data frequently accessed, and to the non-small text in the space frequently accessed Number of packages is stored separately according to according to locus neighbouring relations.
Moreover, incidence matrix conversion rearrangement units (400) are included with lower module,
Incidence matrix element maximum acquisition module (401), for obtaining element maximum in incidence matrix, including traversal Incidence matrix all elements value, and obtain maximum Rmax
Incidence matrix element value size modular converter (402), for carrying out size conversion, bag to incidence matrix element numerical value Traversal incidence matrix all elements value is included, and performs operation RS(i1,j1)=Rmax-RS(i1,j1);
Incidence matrix reordering module (403), for entering rearrangement to incidence matrix using standard RCM sort algorithms.
Moreover, incidence matrix best of breed search unit (500) is included with lower module,
Initialization module, for initializing current iteration number d=1;
Best of breed search module, for finding a best of breed using partial approximation search method, it is included in current N × n block is found in matrix so that corresponding matrix element value is maximum in n × n blocks in the matrix, corresponding n file Form a best of breed;When best of breed searches for the first task of mould, current matrix is that incidence matrix changes rearrangement units (400) incidence matrix after gained is reset;When best of breed searches for mould follow-up work, current matrix is obtained by preceding an iteration Matrix;
Matrix update module, one is obtained by n for carrying out current iteration job search in best of breed search module After the best of breed of file composition, the incidence matrix element that n file is corresponded in incidence matrix is deleted, obtains (N1-dn)× (N1- dn) matrix;
Judge output module, for judging whether d=m-1, otherwise make d=d+1, with matrix update module current iteration work Make gained (N1-dn)×(N1- dn) matrix be current matrix, order best of breed search module carries out next iteration work Next combination recently is continued search for, is to stop search, m best of breed is obtained.
The invention has the advantages that:Space small documents data are due to enormous amount, but user access activity is present Aggregation, most of request concentrate on small part space small documents data, are that this is of the invention by entering to space small documents data After row accesses temperature classification, its mutual pass is calculated using access log information to the space small documents data frequently accessed Connection degree, and optimal distribution storage assembled scheme is found by partial approximation search method after incidence matrix is formed, and to different heat The space small documents data of degree under limited computing resource consumption, realize magnanimity space using different scheme distribution storages The Optimal Distribution storage of small documents data, reaches and improves its concurrent access performance, improve the service ability of space information system Purpose.Therefore, the coincidence when present invention can reduce server internal space small documents data access, so as to finally be taken High space small documents data parallel rate of people logging between business device, improves space small documents data, services performance, and reduce and calculate number According to amount, efficiency is higher, has preferable engineering practice, can be applied to small documents data in space under large-scale distributed environment Technical field of distributed memory.
Brief description of the drawings
Fig. 1 is system structure diagram in the embodiment of the present invention.
Fig. 2 is hollow structural representation of small documents data set pretreatment unit 100 of the embodiment of the present invention.
Fig. 3 is hollow structural representation of small documents data access incidence matrix computing unit 300 of the embodiment of the present invention.
Fig. 4 is that incidence matrix changes the structural representation of rearrangement units 400 in the embodiment of the present invention.
Fig. 5 is method flow diagram in the embodiment of the present invention.
Embodiment
Under distributed environment, the access to space small documents data is difficult to by the piecemeal distribution storage realization pair to data Its concurrent access, it is therefore desirable to the correlation between each space small documents data is analyzed, to realize to the small text in space Number of packages makes asked space small documents data storage in different storage servers as far as possible according to when conducting interviews, with Parallel acquisition of the realization of maximum possible to space small documents data, so as to improve the performance of spatial Information Service system.
Because space small documents data bulk is huge, the storage Combinatorial Optimization of large-scale space small documents data calculates again Miscellaneous degree is high, and it is big to search plain time overhead, needs to carry out space small documents data temperature classification for this, and according to different temperatures difference Best storage assembled scheme is obtained using different methods.
The specific implementation to technical solution of the present invention provides below suggests explanation in detail.
Small documents data in space of the present invention, comprising Spatial data types and spatial coordinate location, each small text in space Number of packages evidence is smaller, is unsuitable for being continued to be divided into more parts and being respectively stored on different servers to improve its concurrent access effect Rate.The access log information is recorded sequentially in time by corresponding spatial Information Service system, and each client should With the log information of addressing space small documents data, including the space small documents data type and coordinate accessed.Described visit Ask that log information is recorded in the process of running by spatial Information Service system, form includes but is not limited to file, database.
The space small documents packet contains different type, including but not limited to SRTM30 (the 30m of global Shuttle Radar Topography Mission terrain data files)、SRTM90。
Described is a kind of based on the space small documents distributed data storage method and system for asking log information, for every species The space small documents data of type are handled respectively, and described method and system is to different types of space small documents data handling procedure phase Together.
As shown in figure 5, technical scheme is used by the method for the present invention:A kind of space based on access log information is small File data is distributed storage method and system, and to any space small documents data type, execution comprises the following steps:
(1) frequently addressing space small documents data subset extraction:It is different according to access frequency by space small documents data set It is divided into frequent access subset and non-frequent access subset;Including following sub-step,
1. obtain each space small documents data access temperature.
If space small documents data set is F={ f1,f2,...,fN, include space small documents data f1,f2,...,fN, its Middle N is the total number of space small documents data, and i-th of space small documents data markers is fi, i=1,2 ..., N.
Space small documents data are have accessed successively if being recorded in access log informationSpace small documents number According to access log sequence beCorresponding title A=(a1,a2,…,aM) visited for space small documents data Ask sequence vector, at∈ [1, N] (accesses sequence number t=1,2 ..., M), and wherein M is the visit to all space small documents data in F Ask total degree.
Count each fi(fi∈ F) the number λ that occurs in access log sequence Ri, then λiFor space small documents data fi Access temperature.
2. the space small documents data being accessed frequently according to the extraction of space small documents data access temperature.
The default discriminant parameter λ of frequent addressing space small documents data is inputted,
If hollow small documents data f of space small documents data set FiAccess temperature λi>λ, then space small documents data fi For the space small documents data frequently accessed, otherwise, fiBelong to the non-space small documents data frequently accessed.
3. according to the frequent addressing space small documents data Special composition small documents data set F 2. obtained subset
If set subset that all space small documents data frequently accessed form asWherein N1For The space small documents data total number frequently accessed, i-th1、j1The individual space small documents data frequently accessed are respectively labeled as Withi1,j1∈[1,N1]。
Can equally set the non-space small documents data set frequently accessed asWherein N2For non-frequency The space small documents data total number of numerous access.Wherein N1+N2=N.
(2) frequently addressing space small documents data subset access sequence extraction:Extract from access log information and frequently visit The access sequence for the space small documents data subset asked;
Access log information have recorded the coordinate of spatial data, and different coordinates represents different data.Therefore can be from Access log information extracts the coordinate information of accessed space small documents data according to access time sequencing.Specific implementation When, specifying information extracting mode can determine according to the record format of access log information.Coordinate information is space small documents data Space latitude and longitude coordinates.
Access sequence subset is extracted according to frequent addressing space small documents data subset, realization is as follows,
To hollow small documents data of access log information according to access time sequencing, the sky wherein frequently accessed is taken Between small documents data, form the access sequence of space small documents data subset frequently accessedIt is right The title answeredFor frequent addressing space small documents data access sequence vector,(access sequence Number t1=(11,21,…,M1)), wherein M1For to F1In all frequently addressing space small documents data access total degree.
(3) calculation of relationship degree obtains with incidence matrix:Utilize the access sequence of the space small documents data subset frequently accessed Row segmentation calculates the degree of association of each space small documents data frequently accessed, and by each space small documents data phase frequently accessed Degree of association numerical value composition incidence matrix between mutually;Including following sub-step,
1. according to storage server quantity, frequent addressing space small documents data subset length N1Calculate frequent access sequence Section length n.
Storage server quantity m can input by outside input, such as by CONFIG.SYS.
Pass through formula n=N1Frequent access sequence section length n is calculated in/m.
2. frequent access sequence is segmented according to access sequence section length.
According to the access order of frequent addressing space small documents data, by frequent addressing space small documents data access sequence Vectorial A1Some subvectors are segmented into the n component of element one, are expressed as:A1=(S1,S2,…,Sl), wherein subvector Sk=(ak1, ak2,…,akn), akj∈[1,N1], 1≤k≤l, 1≤j≤n are A1In length be n subvector.By A1Middle all length is n Access vector set be designated as S, i.e. A1In all subvectors set S={ Sk:k∈[1,l]}。
3. calculate the mutual degree of association numerical value of the space small documents data frequently accessed
Small documents data interrelated degree in space in each segmentation is calculated first, it is rightDefined function:
WhereinFor SkIn all elements composition set.The meaning of function exists In the space small documents data frequently accessed within the access cycle that length is nWithWhether there is relevance.
On this basis, defined function:
Then RS(i1,j1) represent S pairsWithTotal correlation degree.
4. the mutual degree of association numerical value of the space small documents data frequently accessed is formed into incidence matrix.
By all N1The mutual degree of association of the individual space small documents data that frequently access is represented with matrix, you can is obtained Following incidence matrix RS
(4) incidence matrix conversion and rearrangement output:RCM is utilized after carrying out size conversion to each element numerical value in incidence matrix Sort algorithm exports after resetting;Including following sub-step,
1. obtain element maximum in incidence matrix.
Incidence matrix all elements value is traveled through, and obtains maximum Rmax
2. size conversion is carried out to incidence matrix element numerical value.
Incidence matrix all elements value is traveled through, and performs operation RS(i1,j1)=Rmax-RS(i1,j1), to incidence matrix member Plain value size is changed.
3. rearrangement is entered to incidence matrix using standard RCM sort algorithms.
Rearrangement is entered to incidence matrix using standard RCM sort algorithms, target is to concentrate nonzero element in incidence matrix Near diagonal.New matrix after rearrangement is designated as P.Standard RCM sort algorithms are prior art, are referred to during specific implementation Document Gibbs N E, Poole W G, Stockmeyer P K.An algorithm for reducing the bandwidth and profile of a sparse matrix.SIAM Journal on Numerical Analysis, 1976,13(2):236-250。
(5) optimal storage distributed combination search output.
Best of breed is found to obtain to the son using partial approximation search method to the incidence matrix after being reset obtained by (4) Collect the highest concurrent access rate of space small documents data.Partial approximation search method is prior art, refers to text during specific implementation Offer XIA Kai, Wen-zhan.Adaptive Genetic Algorithm Based on Local Search Mechanism Quickly Solving TSP.Journal of Zhejiang Institute of Science and Technology, 2014,31(3)。
Incidence matrix after being reset according to obtained by (4), iteration use partial approximation search method, often perform partial approximation search A best of breed for including n file is once obtained, m combination is finally can obtain, to be subsequently respectively stored in m storage On server.Each combination is made up of n file, the mutual association angle value of n file corresponded in a matrix a n × N block;It is implemented as follows:
1. initialize current iteration number d=1;
2. finding a best of breed using partial approximation search method, it is included in one n × n of searching in current matrix Block so that corresponding matrix element value is maximum in n × n blocks in the matrix, and corresponding n file forms a best of breed;
When performing 2. for the first time, the incidence matrix after the rearrangement obtained by (4) of current matrix, matrix size N1×N1;Afterwards It is continuous execution 2. when, current matrix be preceding an iteration obtained by (N1-(d-1)n)×(N1- (d-1) n) matrix;
After 3. in current iteration execution, 2. search obtains a best of breed being made up of n file, by incidence matrix The incidence matrix element of corresponding n file is deleted, and obtains (N1-dn)×(N1- dn) matrix, reduce the association square continued search for Battle array size, can save search time;
4. judging whether d=m-1, d=d+1 is otherwise made, 3. gained (N is performed with current iteration1-dn)×(N1- dn) square Battle array (is (N after d=d+11-(d-1)n)×(N1- (d-1) n)) based on as current matrix, 2. return is carried out next time Iteration continue search for it is next recently combination, be to stop search, current matrix is n × n, you can directly obtain last by The best of breed of n file composition, and m-1 obtained best of breed of cyclic search together, is obtained m optimal group successively Close.
(6) space small documents distributed data storage:The best of breed for utilizing (5) to finally give is small to the space frequently accessed File data carries out distribution storage, and to the non-space small documents data frequently accessed according to its locus neighbouring relations point Open storage.
Embodiment carries out distribution storage to the space small documents data frequently accessed to obtain according to the best of breed of acquisition The highest concurrent access rate of the space small documents data.
The space small documents data of the optimal distribution storage combination obtained by step (5), have the mutual degree of association The characteristics of low (corresponding element value is big in the incidence matrix i.e. after the conversion of matrix element value size after rearrangement), then can be by one All space small documents data storages in individual best of breed obtain concurrently visit low between each other with this in a server Ask requirement (realizing high concurrent access rate between different server).
According to the coordinate information of space small documents data, embodiment is empty according to it to non-frequently addressing space small documents data Between position correlation be stored separately.
For the F of step (1)2, it is adjacent according to position, then the principle of different server is stored in, by the non-sky frequently accessed Between small documents data storage in the server.
According to spatial data accessing feature, spatial data accessing has the continuity of space access road strength, therefore, adjacent There is spatial data higher probability to be accessed simultaneously, and therefore, being stored in different servers can be reduced concurrently, be improved simultaneously Row rate.
When it is implemented, the discriminant parameter of described frequent addressing space small documents data, incidence matrix RCM sort algorithms Parameter, storage server quantity can be preset by outside input or by those skilled in the art.
Referring to Fig. 1, the present invention correspondingly provides a kind of space small documents distributed data storage based on access log information System, including with lower unit,
Space small documents data set pretreatment unit (100), for by the space of any space small documents data type Small documents data set, the subset frequently accessed and the non-subset frequently accessed are divided into according to access frequency difference;Referring to Fig. 2, bag Include with lower module, space small documents data access frequency statistical module (101), for obtaining each space small documents data access heat Degree, realization is as follows,
If space small documents data set is F={ f1,f2,...,fN, include space small documents data f1,f2,...,fN, its Middle N is the total number of space small documents data, and i-th of space small documents data markers is fi, i=1,2 ..., N;
Space small documents data are have accessed successively if being recorded in access log informationSpace small documents number According to access log sequence beA=(a1,a2,…,aM) for space small documents data access sequence to Amount, at∈ [1, N], sequence number t=1,2 ..., M are accessed, wherein M is the access total degree to all space small documents data in F;
Count each space small documents data fiThe number λ occurred in access log sequence Ri, with λiFor the small text in the space Number of packages is according to fiAccess temperature;
Frequent addressing space small documents data set extraction module (102), for according to space small documents data access temperature The space small documents data being accessed frequently are extracted, realization is as follows,
The default discriminant parameter λ of input,
If hollow small documents data f of space small documents data set FiAccess temperature λi>λ, then space small documents data fi For the space small documents data frequently accessed, otherwise fiBelong to the non-space small documents data frequently accessed;
Frequent addressing space small documents subset structure module (103), for according to frequent addressing space small documents data set The subset of the space small documents data Special composition small documents data set frequently accessed obtained by extraction module (102), realization is as follows,
If all space small documents data frequently accessed form subset and areWherein N1For frequency The space small documents data total number of numerous access, i-th1、j1The individual space small documents data frequently accessed are respectively labeled asWithi1,j1∈[1,N1];
Space small documents data access vector acquiring unit (200), frequently accessed for being extracted from access log information Space small documents data subset access sequence, including sequentially form access sequence according to time order and function For frequent addressing space small documents data access sequence vector,Access sequence number t1= (11,21,…,M1), wherein M1For to F1In all frequently addressing space small documents data access total degree;
Space small documents data access incidence matrix computing unit (300), for utilizing the space small documents frequently accessed The access sequence segmentation of data subset calculates the degree of association of each space small documents data frequently accessed, and is frequently accessed each The mutual degree of association numerical value composition incidence matrix of space small documents data;Referring to Fig. 3, including with lower module, frequently access Sequence segment length computation module (301), for according to storage server quantity m, frequent addressing space small documents data subset Length N1Calculate frequent access sequence section length n=N1/m;
Frequent access sequence segmentation module (302), for being carried out according to access sequence section length to frequent access sequence Segmentation, realization is as follows,
According to access order, by frequent addressing space small documents data access sequence vector A1Cut with the n component of element one For some subvectors, A is expressed as1=(S1,S2,…,Sl), wherein subvector Sk=(ak1,ak2,…,akn), akj∈[1,N1], 1 ≤ k≤l, 1≤j≤n;By A1In all subvector set be designated as S, S={ Sk:k∈[1,l]};
Space small documents data correlation degree computing module (303), for calculating the space small documents data phase frequently accessed Degree of association numerical value between mutually, realization is as follows,
Defined function
WhereinFor SkIn all elements composition set;FunctionRepresent in length The space small documents data frequently accessed were spent in the access cycle for nWithWhether there is relevance;
Defined function RS(i1,j1),
Wherein RS(i1,j1) represent S pairsWithTotal correlation degree;
Space small documents data correlation matrix generation module (304), for the space small documents data phase that will frequently access Degree of association numerical value composition incidence matrix R between mutuallyS,
Incidence matrix conversion rearrangement units (400), for profit after each element numerical value progress size conversion in incidence matrix Exported after being reset with RCM sort algorithms;
Incidence matrix best of breed search unit (500), for being searched for the incidence matrix after rearrangement using partial approximation Method finds best of breed;
Space small documents distributed data storage unit (600), for utilizing incidence matrix best of breed search unit (500) Gained best of breed carries out distribution storage to the space small documents data frequently accessed, and to the non-small text in the space frequently accessed Number of packages is stored separately according to according to locus neighbouring relations.
Referring to Fig. 4, incidence matrix conversion rearrangement units (400) further comprise with lower module,
Incidence matrix element maximum acquisition module (401), for obtaining element maximum in incidence matrix, including traversal Incidence matrix all elements value, and obtain maximum Rmax
Incidence matrix element value size modular converter (402), for carrying out size conversion, bag to incidence matrix element numerical value Traversal incidence matrix all elements value is included, and performs operation RS(i1,j1)=Rmax-RS(i1,j1);
Incidence matrix reordering module (403), for entering rearrangement to incidence matrix using standard RCM sort algorithms.
Incidence matrix best of breed search unit (500) is included with lower module,
Initialization module, for initializing current iteration number d=1;
Best of breed search module, for finding a best of breed using partial approximation search method, it is included in current N × n block is found in matrix so that corresponding matrix element value is maximum in n × n blocks in the matrix, corresponding n file Form a best of breed;When best of breed searches for the first task of mould, current matrix is that incidence matrix changes rearrangement units (400) incidence matrix after gained is reset;When best of breed searches for mould follow-up work, current matrix is obtained by preceding an iteration Matrix;
Matrix update module, one is obtained by n for carrying out current iteration job search in best of breed search module After the best of breed of file composition, the incidence matrix element that n file is corresponded in incidence matrix is deleted, obtains (N1-dn)× (N1- dn) matrix;
Judge output module, for judging whether d=m-1, otherwise make d=d+1, with matrix update module current iteration work Make gained (N1-dn)×(N1- dn) matrix be current matrix, order best of breed search module carries out next iteration work Next combination recently is continued search for, is to stop search, m best of breed is obtained.
Each module specific implementation can be consistent with method specific steps, and it will not go into details by the present invention.
Specific embodiment described herein is only to spirit explanation for example of the invention.Technology belonging to the present invention is led The technical staff in domain can be made various modifications or supplement to described specific embodiment or be replaced using similar mode Generation, but without departing from the spiritual of the present invention or surmount scope defined in appended claims.

Claims (4)

  1. A kind of 1. space small documents distributed data storage method based on access log information, it is characterised in that:To any sky Between small documents data type, execution comprises the following steps:
    Step 1, by space small documents data set, the subset that frequently accesses is divided into according to access frequency difference and non-is frequently accessed Subset;Including following sub-step,
    Step 1.1, each space small documents data access temperature is obtained, realization is as follows,
    If space small documents data set is F={ f1,f2,...,fN, include space small documents data f1,f2,...,fN, wherein N For the total number of space small documents data, i-th of space small documents data markers is fi, i=1,2 ..., N;
    Space small documents data are have accessed successively if being recorded in access log informationThe visit of space small documents data Ask that logged sequence isA=(a1,a2,…,aM) it is space small documents data access sequence vector, at∈ [1, N], sequence number t=1,2 ..., M are accessed, wherein M is the access total degree to all space small documents data in F;
    Count each space small documents data fiThe number λ occurred in access log sequence Ri, with λiFor the space small documents number According to fiAccess temperature;
    Step 1.2, the space small documents data being accessed frequently according to the extraction of space small documents data access temperature, are realized such as Under,
    The default discriminant parameter λ of input,
    If hollow small documents data f of space small documents data set FiAccess temperature λi>λ, then space small documents data fiFor frequency The space small documents data of numerous access, otherwise fiBelong to the non-space small documents data frequently accessed;
    Step 1.3, the son of the space small documents data Special composition small documents data set frequently accessed according to obtained by step 1.2 Collection, realization is as follows,
    If all space small documents data frequently accessed form subset and areWherein N1Frequently to visit The space small documents data total number asked, i-th1、j1The individual space small documents data frequently accessed are respectively labeled asWithi1, j1∈[1,N1];
    Step 2, the access sequence of the space small documents data subset frequently accessed is extracted from access log information, including according to Time order and function sequentially forms access sequence For the small text of frequent addressing space Number of packages is vectorial according to access sequence,Access sequence number t1=(11,21,…,M1), wherein M1For to F1In it is all frequently visit Ask the access total degree of space small documents data;
    Step 3, each space frequently accessed is calculated using the access sequence segmentation of the space small documents data subset frequently accessed The degree of association of small documents data, and the mutual degree of association numerical value of each space small documents data frequently accessed is formed into association Matrix;Including following sub-step,
    Step 3.1, according to storage server quantity m, frequent addressing space small documents data subset length N1Calculate and frequently access sequence Row section length n=N1/m;
    Step 3.2, frequent access sequence being segmented according to access sequence section length, realization is as follows,
    According to access order, by frequent addressing space small documents data access sequence vector A1It is segmented into the n component of element one some Subvector, it is expressed as A1=(S1,S2,…,Sl), wherein subvector Sk=(ak1,ak2,…,akn), akj∈[1,N1], 1≤k≤l, 1≤j≤n;By A1In all subvector set be designated as S, S={ Sk:k∈[1,l]};
    Step 3.3, the mutual degree of association numerical value of the space small documents data frequently accessed is calculated, realization is as follows,
    Defined function
    WhereinFor SkIn all elements composition set;FunctionRepresent that in length be n's The space small documents data frequently accessed in access cycleWithWhether there is relevance;
    Defined function RS(i1,j1),
    <mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mi>R</mi> <mi>S</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>i</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>j</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>l</mi> </munderover> <msub> <mi>R</mi> <msub> <mi>S</mi> <mi>k</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>i</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>j</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mn>1</mn> <mo>&amp;le;</mo> <msub> <mi>i</mi> <mn>1</mn> </msub> <mo>&amp;le;</mo> <msub> <mi>N</mi> <mn>1</mn> </msub> <mo>,</mo> <mn>1</mn> <mo>&amp;le;</mo> <msub> <mi>j</mi> <mn>1</mn> </msub> <mo>&amp;le;</mo> <msub> <mi>N</mi> <mn>1</mn> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced>
    Wherein RS(i1,j1) represent S pairsWithTotal correlation degree;
    Step 3.4, the mutual degree of association numerical value of the space small documents data frequently accessed is formed into incidence matrix RS,
    <mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mi>R</mi> <mi>S</mi> </msub> <mo>=</mo> <msub> <mrow> <mo>(</mo> <msub> <mi>R</mi> <mi>S</mi> </msub> <mo>(</mo> <mrow> <msub> <mi>i</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>j</mi> <mn>1</mn> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mrow> <msub> <mi>N</mi> <mn>1</mn> </msub> <mo>&amp;times;</mo> <msub> <mi>N</mi> <mn>1</mn> </msub> </mrow> </msub> </mrow> </mtd> <mtd> <mrow> <mn>1</mn> <mo>&amp;le;</mo> <msub> <mi>i</mi> <mn>1</mn> </msub> <mo>&amp;le;</mo> <msub> <mi>N</mi> <mn>1</mn> </msub> <mo>,</mo> <mn>1</mn> <mo>&amp;le;</mo> <msub> <mi>j</mi> <mn>1</mn> </msub> <mo>&amp;le;</mo> <msub> <mi>N</mi> <mn>1</mn> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced>
    Step 4, RCM sort algorithms are utilized to be exported after resetting after size conversion is carried out to each element numerical value in incidence matrix;
    Step 5, best of breed is found using partial approximation search method to the incidence matrix after rearrangement;
    Step 5 includes following sub-step,
    Step 5.1, current iteration number d=1 is initialized;
    Step 5.2, a best of breed is found using partial approximation search method, is included in one n × n of searching in current matrix Block so that corresponding matrix element value is maximum in n × n blocks in the matrix, one best of breed of corresponding n file composition; When performing step 5.2 for the first time, current matrix is the incidence matrix after being reset obtained by step 4;During subsequent execution step 5.2, Current matrix is the matrix obtained by preceding an iteration;
    Step 5.3, after current iteration execution step 5.2 search obtains a best of breed being made up of n file, will associate The incidence matrix element that n file is corresponded in matrix is deleted, and obtains (N1-dn)×(N1- dn) matrix;
    Step 5.4, judge whether d=m-1, otherwise make d=d+1, (N obtained by step 5.3 is performed with current iteration1-dn)×(N1- Dn matrix) is current matrix, and return to step 5.2 carries out next iteration and continues search for next best of breed, is to stop Only search for, m best of breed is obtained;
    Step 6, distribution storage is carried out to the space small documents data frequently accessed using step 5 gained best of breed, and it is right The non-space small documents data frequently accessed are stored separately according to locus neighbouring relations.
  2. 2. the space small documents distributed data storage method based on access log information, its feature exist according to claim 1 In:Step 4 includes following sub-step,
    Step 4.1, element maximum in incidence matrix, including traversal incidence matrix all elements value are obtained, and obtains maximum Rmax
    Step 4.2, size conversion, including traversal incidence matrix all elements value are carried out to incidence matrix element numerical value, and is performed Operate RS(i1,j1)=Rmax-RS(i1,j1);
    Step 4.3, rearrangement is entered to incidence matrix using standard RCM sort algorithms.
  3. A kind of 3. space small documents distributed data storage system based on access log information, it is characterised in that:Including to place an order Member,
    Space small documents data set pretreatment unit (100), for by the small text in space of any space small documents data type Part data set, the subset frequently accessed and the non-subset frequently accessed are divided into according to access frequency difference;Including with lower module,
    Space small documents data access frequency statistical module (101), for obtaining each space small documents data access temperature, realize It is as follows,
    If space small documents data set is F={ f1,f2,...,fN, include space small documents data f1,f2,...,fN, wherein N For the total number of space small documents data, i-th of space small documents data markers is fi, i=1,2 ..., N;
    Space small documents data are have accessed successively if being recorded in access log informationThe visit of space small documents data Ask that logged sequence isA=(a1,a2,…,aM) it is space small documents data access sequence vector, at∈ [1, N], sequence number t=1,2 ..., M are accessed, wherein M is the access total degree to all space small documents data in F;
    Count each space small documents data fiThe number λ occurred in access log sequence Ri, with λiFor the space small documents number According to fiAccess temperature;
    Frequent addressing space small documents data set extraction module (102), for being extracted according to space small documents data access temperature The space small documents data being accessed frequently, realization is as follows,
    The default discriminant parameter λ of input,
    If hollow small documents data f of space small documents data set FiAccess temperature λi>λ, then space small documents data fiFor frequency The space small documents data of numerous access, otherwise fiBelong to the non-space small documents data frequently accessed;
    Frequent addressing space small documents subset structure module (103), for being extracted according to frequent addressing space small documents data set The subset of the space small documents data Special composition small documents data set frequently accessed obtained by module (102), realization is as follows,
    If all space small documents data frequently accessed form subset and areWherein N1Frequently to visit The space small documents data total number asked, i-th1、j1The individual space small documents data frequently accessed are respectively labeled asWithi1, j1∈[1,N1];
    Space small documents data access vector acquiring unit (200), for extracting the sky frequently accessed from access log information Between small documents data subset access sequence, including sequentially form access sequence according to time order and function For frequent addressing space small documents data access sequence vector,Access sequence number t1= (11,21,…,M1), wherein M1For to F1In all frequently addressing space small documents data access total degree;
    Space small documents data access incidence matrix computing unit (300), for utilizing the space small documents data frequently accessed The access sequence segmentation of subset calculates the degree of association of each space small documents data frequently accessed, and by each space frequently accessed The mutual degree of association numerical value composition incidence matrix of small documents data;Including with lower module,
    Frequent access sequence section length computing module (301), for small according to storage server quantity m, frequent addressing space File data subset length N1Calculate frequent access sequence section length n=N1/m;
    Frequent access sequence segmentation module (302), for being segmented according to access sequence section length to frequent access sequence, Realization is as follows,
    According to access order, by frequent addressing space small documents data access sequence vector A1It is segmented into the n component of element one some Subvector, it is expressed as A1=(S1,S2,…,Sl), wherein subvector Sk=(ak1,ak2,…,akn), akj∈[1,N1], 1≤k≤l, 1≤j≤n;By A1In all subvector set be designated as S, S={ Sk:k∈[1,l]};
    Space small documents data correlation degree computing module (303), for calculate the space small documents data frequently accessed mutually it Between degree of association numerical value, realize it is as follows,
    Defined function
    WhereinFor SkIn all elements composition set;FunctionRepresent in length to be n Access cycle in the space small documents data that frequently accessWithWhether there is relevance;
    Defined function RS(i1,j1),
    <mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mi>R</mi> <mi>S</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>i</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>j</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>l</mi> </munderover> <msub> <mi>R</mi> <msub> <mi>S</mi> <mi>k</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>i</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>j</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mn>1</mn> <mo>&amp;le;</mo> <msub> <mi>i</mi> <mn>1</mn> </msub> <mo>&amp;le;</mo> <msub> <mi>N</mi> <mn>1</mn> </msub> <mo>,</mo> <mn>1</mn> <mo>&amp;le;</mo> <msub> <mi>j</mi> <mn>1</mn> </msub> <mo>&amp;le;</mo> <msub> <mi>N</mi> <mn>1</mn> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced>
    Wherein RS(i1,j1) represent S pairsWithTotal correlation degree;
    Space small documents data correlation matrix generation module (304), for the space small documents data that will frequently access mutually it Between degree of association numerical value composition incidence matrix RS,
    <mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mi>R</mi> <mi>S</mi> </msub> <mo>=</mo> <msub> <mrow> <mo>(</mo> <msub> <mi>R</mi> <mi>S</mi> </msub> <mo>(</mo> <mrow> <msub> <mi>i</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>j</mi> <mn>1</mn> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mrow> <msub> <mi>N</mi> <mn>1</mn> </msub> <mo>&amp;times;</mo> <msub> <mi>N</mi> <mn>1</mn> </msub> </mrow> </msub> </mrow> </mtd> <mtd> <mrow> <mn>1</mn> <mo>&amp;le;</mo> <msub> <mi>i</mi> <mn>1</mn> </msub> <mo>&amp;le;</mo> <msub> <mi>N</mi> <mn>1</mn> </msub> <mo>,</mo> <mn>1</mn> <mo>&amp;le;</mo> <msub> <mi>j</mi> <mn>1</mn> </msub> <mo>&amp;le;</mo> <msub> <mi>N</mi> <mn>1</mn> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced>
    Incidence matrix conversion rearrangement units (400), for utilizing RCM after carrying out size conversion to each element numerical value in incidence matrix Sort algorithm exports after resetting;
    Incidence matrix best of breed search unit (500), for being sought to the incidence matrix after rearrangement using partial approximation search method Look for best of breed;
    Incidence matrix best of breed search unit (500) is included with lower module,
    Initialization module, for initializing current iteration number d=1;
    Best of breed search module, for finding a best of breed using partial approximation search method, it is included in current matrix One n × n of middle searching block so that corresponding matrix element value is maximum in n × n blocks in the matrix, and corresponding n file is formed One best of breed;When best of breed searches for the first task of mould, current matrix is that incidence matrix changes rearrangement units (400) Incidence matrix after gained rearrangement;When best of breed searches for mould follow-up work, current matrix is the square obtained by preceding an iteration Battle array;
    Matrix update module, one is obtained by n file for carrying out current iteration job search in best of breed search module After the best of breed of composition, the incidence matrix element that n file is corresponded in incidence matrix is deleted, obtains (N1-dn)×(N1- Dn matrix);
    Judge output module, for judging whether d=m-1, otherwise make d=d+1, with matrix update module current iteration work institute Obtain (N1-dn)×(N1- dn) matrix be current matrix, order best of breed search module carry out next iteration work after The continuous next best of breed of search, is to stop search, m best of breed is obtained;
    Space small documents distributed data storage unit (600), for using obtained by incidence matrix best of breed search unit (500) Best of breed carries out distribution storage to the space small documents data frequently accessed, and to the non-space small documents number frequently accessed It is stored separately according to according to locus neighbouring relations.
  4. 4. the space small documents distributed data storage system based on access log information, its feature exist according to claim 3 In:Incidence matrix conversion rearrangement units (400) are included with lower module,
    Incidence matrix element maximum acquisition module (401), for obtaining element maximum in incidence matrix, including traversal association Matrix all elements value, and obtain maximum Rmax
    Incidence matrix element value size modular converter (402), for incidence matrix element numerical value carry out size conversion, including time Incidence matrix all elements value is gone through, and performs operation RS(i1,j1)=Rmax-RS(i1,j1);
    Incidence matrix reordering module (403), for entering rearrangement to incidence matrix using standard RCM sort algorithms.
CN201510042456.9A 2015-01-28 2015-01-28 Space small documents distributed data storage method and system based on access log information Expired - Fee Related CN104573082B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510042456.9A CN104573082B (en) 2015-01-28 2015-01-28 Space small documents distributed data storage method and system based on access log information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510042456.9A CN104573082B (en) 2015-01-28 2015-01-28 Space small documents distributed data storage method and system based on access log information

Publications (2)

Publication Number Publication Date
CN104573082A CN104573082A (en) 2015-04-29
CN104573082B true CN104573082B (en) 2017-11-14

Family

ID=53089144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510042456.9A Expired - Fee Related CN104573082B (en) 2015-01-28 2015-01-28 Space small documents distributed data storage method and system based on access log information

Country Status (1)

Country Link
CN (1) CN104573082B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885463B (en) * 2017-11-10 2021-08-31 下一代互联网重大应用技术(北京)工程研究中心有限公司 Target file processing method and device
CN109491594B (en) * 2018-09-28 2021-12-03 北京寄云鼎城科技有限公司 Method and device for optimizing data storage space in matrix inversion process
CN109542857B (en) * 2018-11-26 2021-06-29 杭州迪普科技股份有限公司 Audit log storage method, audit log query method, audit log storage device, audit log query device and related equipment
CN111104381A (en) * 2019-11-30 2020-05-05 北京浪潮数据技术有限公司 Log management method, device and equipment and computer readable storage medium
CN111966950B (en) * 2020-10-21 2021-01-15 北京每日优鲜电子商务有限公司 Log sending method and device, electronic equipment and computer readable medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101453688A (en) * 2007-12-04 2009-06-10 中兴通讯股份有限公司 Method for fast responding scene switching in mobile stream media service

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101324948B (en) * 2008-07-24 2015-11-25 阿里巴巴集团控股有限公司 A kind of method of information recommendation and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101453688A (en) * 2007-12-04 2009-06-10 中兴通讯股份有限公司 Method for fast responding scene switching in mobile stream media service

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于向量关系表的自动数据收集算法;杨婧 等;《计算机工程与应用》;20071231;第43卷(第15期);176-179 *

Also Published As

Publication number Publication date
CN104573082A (en) 2015-04-29

Similar Documents

Publication Publication Date Title
CN104573082B (en) Space small documents distributed data storage method and system based on access log information
CN105843841A (en) Small file storing method and system
Chen et al. Distributed modeling in a MapReduce framework for data-driven traffic flow forecasting
CN106547882A (en) A kind of real-time processing method and system of big data of marketing in intelligent grid
Fu et al. An experimental evaluation of large scale GBDT systems
CN107515952A (en) The method and its system of cloud data storage, parallel computation and real-time retrieval
CN105912666A (en) Method for high-performance storage and inquiry of hybrid structure data aiming at cloud platform
Tran et al. MCHT: A maximal clique and hash table-based maximal prevalent co-location pattern mining algorithm
Skluzacek et al. Klimatic: a virtual data lake for harvesting and distribution of geospatial data
CN106570145B (en) Distributed database result caching method based on hierarchical mapping
Ji et al. Scalable nearest neighbor query processing based on inverted grid index
Madbouly et al. Clustering big data based on distributed fuzzy K-medoids: An application to geospatial informatics
Demir et al. Clustering spatial networks for aggregate query processing: A hypergraph approach
Shah et al. Big data analytics framework for spatial data
CN106547890A (en) Quick clustering preprocess method in large nuber of images characteristic vector
Chai et al. A node-priority based large-scale overlapping community detection using evolutionary multi-objective optimization
Anusha et al. Big data techniques for efficient storage and processing of weather data
Liu et al. Processing particle data flows with SmartNICs
Zhang et al. True-link clustering through signaling process and subcommunity merge in overlapping community detection
Mitra et al. Alleviating resource requirements for spatial deep learning workloads
Yang et al. Visualization and adaptive subsetting of earth science data in HDFS: A novel data analysis strategy with Hadoop and Spark
Zhang et al. High-performance spatial join processing on gpgpus with applications to large-scale taxi trip data
Rammer et al. Small is beautiful: Distributed orchestration of spatial deep learning workloads
Han et al. A parallel online trajectory compression approach for supporting big data workflow
Zhang et al. U2sod-db: a database system to manage large-scale ubiquitous urban sensing origin-destination data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171114

Termination date: 20190128