CN103198119A - Method for fast searching all chained files having same repeating data deleting identification - Google Patents
Method for fast searching all chained files having same repeating data deleting identification Download PDFInfo
- Publication number
- CN103198119A CN103198119A CN2013101121259A CN201310112125A CN103198119A CN 103198119 A CN103198119 A CN 103198119A CN 2013101121259 A CN2013101121259 A CN 2013101121259A CN 201310112125 A CN201310112125 A CN 201310112125A CN 103198119 A CN103198119 A CN 103198119A
- Authority
- CN
- China
- Prior art keywords
- module
- redundancy
- performance
- searching
- database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention provides a method for fast searching all chained files having the same repeating data deleting identification. The method is characterized in that a high-performance high-concurrency database is adopted to serve as the core, and high efficiency can be obtained by integrating an ergodic interface, a core hook module and a redundancy searching module. A modular structure of the method comprises the high-performance high-concurrency database (1), a core hook module (2), an ergodic interface module (3), and a redundancy searching module (4), wherein the ergodic interface, the core hook module and the redundancy searching module support high-concurrency multi-process and multi-threading operation, so that the integral performance of the system is improved. The redundancy searching module provides redundancy configuration, so that the availability of the system is improved. The method seldom needs to search for the whole file system directory tree for searching, so that the method is efficient.
Description
Technical field
The present invention relates to the Computer Applied Technology field, be specifically related to a kind of method of searching the all-links file with identical data de-duplication sign fast.
Background technology
Entered since 21 century, along with the acceleration of information age, the development that business data presents the trend of explosive increase, particularly mobile Internet, Internet of Things and cloud computing has more aggravated the explosive growth of data.IDC report points out, global metadata amount every year, the global metadata amount reached 1.8ZB in 2010, will reach 8ZB in 2015 with 60% speed increase, and the year two thousand twenty will reach 35ZB, indicate the arrival in " big data " epoch.Data Growth brings following huge problem: cost sharply increases, bandwidth pressure is big, energy consumption issues is serious, the device space take huge, can't thoroughly solve the problems such as problem that data volume is increased sharply by increase equipment, simultaneously, the energy problem that the world faces is increasingly serious, and is more noticeable in high-tech IT field energy dissipation and environmental protection.The widely-used information center's scale of large enterprise, government bodies, financial institution that allows of internet expands day by day, and exchanges data increases, and equipment is piled into the mountain, and floor area is more and more, and power consumption hits new peak repeatly.Be realization information and management optimization, when making up the company information framework, appeal green power-saving technology more.Energy savings reduces power consumption, reduces system cost, is badly in need of research towards the novel green memory technology of emerging application.Under this megatrend, data de-duplication technology is accumulate and is educated and give birth to, and data de-duplication technology can reduce the repeating data in user's storage system effectively, thereby for the user has saved memory capacity, reduces carrying cost and management difficulty.
The existing all-links document method with same data de-duplication sign of searching all must travel through whole file system directories tree one by one, and to each file that finds, obtain its sign and compare, traversal for 1,000,000,000 level files catalogues will expend a large amount of time and resource, heavily delete in the technology in data, can be divided into according to the method for heavily deleting: file-level is heavily deleted with the piece level and is heavily deleted.Heavily deleting in the scheme of file-level, need the multiple file of internal unit weight to preserve a copy, and be established to the link (comprising the data de-duplication sign of documentary evidence content unanimity, generally is the cryptographic hash of file content) of this copy at place, the path at duplicate file place.When the fast quick-recovery of needs had file under a plurality of paths of identical file content, the method that how to find the All Files link path with identical content fast was just very important.
Summary of the invention
The purpose of this invention is to provide a kind of method of searching the all-links file with identical data de-duplication sign fast.
The existing all-links document method with same data de-duplication sign of searching all must travel through whole file system directories tree one by one, and to each file that finds, obtain its sign and compare, will expend a large amount of time and resource for the traversal of 1,000,000,000 level files catalogues.
The objective of the invention is to realize in the following manner:
Structure of the present invention is the method centered by the high concurrent database of high-performance, this system architecture comprises: the high concurrent database (1) of high-performance, kernel hooking module (2), traversal interface module (3), redundancy are searched module (4), kernel hooking module, traversal interface module, redundancy are searched module and are supported high concurrent multi-process multithreading operation, thereby improve the overall performance of system, wherein:
The high concurrent database of high-performance (1) is the core of architecture, is responsible for depositing a large amount of hard link information, and supports multi-process, the high concurrent visit of multithreading;
The information of kernel hooking module (2) when mainly being responsible for setting up chained file is collected and information is deposited, and supports multi-thread concurrent;
Traversal interface module (3) provides calling interface for the upper level applications Ergodic Theory;
Redundancy is searched acting as when not having needed information in the high concurrent database of high-performance (1) of module (4), travels through whole storage system, carries out redundancy and searches, and the information that finds is put into the high concurrent database (1) of high-performance.
The invention has the beneficial effects as follows: kernel hooking module, traversal interface module, redundancy are searched module and are supported high concurrent multi-process multithreading operation, thereby improve the overall performance of system.Redundancy is searched module redundant configuration is provided, thereby improves the high availability of system.Seldom need to travel through whole file system directories tree and search, very efficient.
Description of drawings
Fig. 1 is traditional all hard link path topology figure with same sign that search;
Fig. 2 searches the all-links document flow synoptic diagram with identical repeating data file identification fast.
Embodiment
Explain below with reference to Figure of description method of the present invention being done.
As described in the summary of the invention, architecture of the present invention mainly comprises: the high concurrent database (1) of high-performance, kernel hooking module (2), traversal interface module (3), redundancy are searched module (4).
We propose has based on searching fast of the high concurrent database of high-performance that a kind of to search the all-links document method with identical data de-duplication sign fast be core with the high concurrent database of high-performance, it is characterized in that in method, kernel hooking module, traversal interface module, redundancy are searched module and are supported high concurrent multi-process multithreading operation, thereby improve the overall performance of system.Kernel hooking module, traversal interface module, redundancy are searched module and are carried out redundant configuration, thereby improve the high availability of system.As shown in Figure 2, the native system architecture mainly comprises: the high concurrent database (1) of high-performance, kernel hooking module (2), traversal interface module (3), redundancy are searched module (4).
The high concurrent database of high-performance is as the core of the method, plays that information storage and high speed are concurrent effect such as searches.
The kernel hooking module is registered into kernel, the function of setting up chained file changes kernel over to when carrying out, use the kernel hooking program, information such as file path and data de-duplication sign are deposited into the high concurrent database of high-performance, and the data de-duplication sign is write chained file.
The interface that the traversal interface module provides traversal to call, be the various entrances of searching function, when searching, at first entering the high concurrent database of high-performance searches, if can find the database key assignments with search marking matched, then the content that this key-value pair is answered returns to call function, searches module and searches otherwise enter redundancy.
Redundancy is searched module will travel through whole file system directories tree with degree of depth traversal or range traversal method, and each file is obtained its sign also and the search key contrast, up to complete file system directories tree of traversal, the result who obtains be returned.
Except the described technical characterictic of instructions, be the known technology of those skilled in the art.
Claims (3)
1. method of searching the all-links file with identical data de-duplication sign fast, it is characterized in that, be core with the high concurrent database of high-performance, search module by integrating traversal interface, kernel hooking module and redundancy, this lookup method is reached than higher efficient, the modular structure of this method comprises: the high concurrent database (1) of high-performance, and kernel hooking module (2), traversal interface module (3), redundancy are searched module (4) wherein:
The high concurrent database of high-performance (1) is the core of structure, is responsible for depositing a large amount of chained file routing informations, and supports multi-process, the high concurrent visit of multithreading;
The information of kernel hooking module (2) when mainly being responsible for setting up chained file is collected and information is deposited, and supports multi-thread concurrent;
Traversal interface module (3) provides calling interface for the upper level applications Ergodic Theory;
Redundancy is searched acting as when not having needed information in the high concurrent database of high-performance (1) of module (4), travels through whole storage system, carries out redundancy and searches, and the information that finds is put into the high concurrent database (1) of high-performance.
2. method according to claim 1 is characterized in that kernel hooking module, traversal interface module, redundancy search module and support high concurrent multi-process multithreading operation, thereby improves the overall performance of system.
3. method according to claim 1 is characterized in that redundancy searches the redundant configuration that module provides method, thereby improves the high availability of system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013101121259A CN103198119A (en) | 2013-04-02 | 2013-04-02 | Method for fast searching all chained files having same repeating data deleting identification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013101121259A CN103198119A (en) | 2013-04-02 | 2013-04-02 | Method for fast searching all chained files having same repeating data deleting identification |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103198119A true CN103198119A (en) | 2013-07-10 |
Family
ID=48720677
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2013101121259A Pending CN103198119A (en) | 2013-04-02 | 2013-04-02 | Method for fast searching all chained files having same repeating data deleting identification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103198119A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015024511A1 (en) * | 2013-08-21 | 2015-02-26 | International Business Machines Corporation | Adding cooperative file coloring in similarity based deduplication system |
CN106469167A (en) * | 2015-08-18 | 2017-03-01 | 北大方正集团有限公司 | The display packing of file statuss and the display system of file statuss |
CN107239314A (en) * | 2016-03-28 | 2017-10-10 | 苏州简约纳电子有限公司 | The minimizing technology of data structure is re-defined in ASN.1 compilation processes |
US9830229B2 (en) | 2013-08-21 | 2017-11-28 | International Business Machines Corporation | Adding cooperative file coloring protocols in a data deduplication system |
CN108009049A (en) * | 2017-11-28 | 2018-05-08 | 厦门市美亚柏科信息股份有限公司 | The offline restoration methods of MYISAM storage engines deletion records, storage medium |
WO2018113210A1 (en) * | 2016-12-21 | 2018-06-28 | 深圳市易特科信息技术有限公司 | Repeated medical documentation deletion system and method in medical informationization |
CN109308284A (en) * | 2018-09-28 | 2019-02-05 | 中国平安财产保险股份有限公司 | Report menu generating method, device, computer equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007041456A2 (en) * | 2005-09-30 | 2007-04-12 | Neopath Networks, Inc. | Accumulating access frequency and file attributes for supporting policy based storage management |
CN101719936A (en) * | 2009-12-09 | 2010-06-02 | 成都市华为赛门铁克科技有限公司 | Method, device and cache system for providing file downloading service |
CN102289451A (en) * | 2011-06-17 | 2011-12-21 | 奇智软件(北京)有限公司 | Method and device for searching files or folders |
CN102609453A (en) * | 2012-01-11 | 2012-07-25 | 中国农业大学 | Embedded file searching method and embedded file searching system |
-
2013
- 2013-04-02 CN CN2013101121259A patent/CN103198119A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007041456A2 (en) * | 2005-09-30 | 2007-04-12 | Neopath Networks, Inc. | Accumulating access frequency and file attributes for supporting policy based storage management |
CN101719936A (en) * | 2009-12-09 | 2010-06-02 | 成都市华为赛门铁克科技有限公司 | Method, device and cache system for providing file downloading service |
CN102289451A (en) * | 2011-06-17 | 2011-12-21 | 奇智软件(北京)有限公司 | Method and device for searching files or folders |
CN102609453A (en) * | 2012-01-11 | 2012-07-25 | 中国农业大学 | Embedded file searching method and embedded file searching system |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015024511A1 (en) * | 2013-08-21 | 2015-02-26 | International Business Machines Corporation | Adding cooperative file coloring in similarity based deduplication system |
US9542411B2 (en) | 2013-08-21 | 2017-01-10 | International Business Machines Corporation | Adding cooperative file coloring in a similarity based deduplication system |
US9830229B2 (en) | 2013-08-21 | 2017-11-28 | International Business Machines Corporation | Adding cooperative file coloring protocols in a data deduplication system |
US11048594B2 (en) | 2013-08-21 | 2021-06-29 | International Business Machines Corporation | Adding cooperative file coloring protocols in a data deduplication system |
CN106469167A (en) * | 2015-08-18 | 2017-03-01 | 北大方正集团有限公司 | The display packing of file statuss and the display system of file statuss |
CN106469167B (en) * | 2015-08-18 | 2019-06-28 | 北大方正集团有限公司 | The display methods of file status and the display system of file status |
CN107239314A (en) * | 2016-03-28 | 2017-10-10 | 苏州简约纳电子有限公司 | The minimizing technology of data structure is re-defined in ASN.1 compilation processes |
CN107239314B (en) * | 2016-03-28 | 2020-09-01 | 苏州简约纳电子有限公司 | Method for removing repeated definition data structure in ASN.1 compiling process |
WO2018113210A1 (en) * | 2016-12-21 | 2018-06-28 | 深圳市易特科信息技术有限公司 | Repeated medical documentation deletion system and method in medical informationization |
CN108009049A (en) * | 2017-11-28 | 2018-05-08 | 厦门市美亚柏科信息股份有限公司 | The offline restoration methods of MYISAM storage engines deletion records, storage medium |
CN109308284A (en) * | 2018-09-28 | 2019-02-05 | 中国平安财产保险股份有限公司 | Report menu generating method, device, computer equipment and storage medium |
CN109308284B (en) * | 2018-09-28 | 2023-09-19 | 中国平安财产保险股份有限公司 | Report menu generation method and device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102222085B (en) | Data de-duplication method based on combination of similarity and locality | |
CN103198119A (en) | Method for fast searching all chained files having same repeating data deleting identification | |
Ji et al. | Big data processing in cloud computing environments | |
US11093466B2 (en) | Incremental out-of-place updates for index structures | |
CN103577123A (en) | Small file optimization storage method based on HDFS | |
CN103544261B (en) | A kind of magnanimity structuring daily record data global index's management method and device | |
CN104239377A (en) | Platform-crossing data retrieval method and device | |
Chatzimilioudis et al. | Distributed in-memory processing of all k nearest neighbor queries | |
CN103279502B (en) | A kind of framework and method with the data de-duplication file system be combined with parallel file system | |
US9110820B1 (en) | Hybrid data storage system in an HPC exascale environment | |
US20170351620A1 (en) | Caching Framework for Big-Data Engines in the Cloud | |
WO2014110940A1 (en) | A method, apparatus and system for storing, reading the directory index | |
Von der Weth et al. | Multiterm keyword search in NoSQL systems | |
Li et al. | Efficient subspace skyline query based on user preference using MapReduce | |
CN104572505A (en) | System and method for ensuring eventual consistency of mass data caches | |
CN102779138A (en) | Hard disk access method of real time data | |
CN104951464A (en) | Data storage method and system | |
Feng et al. | Lcindex: a local and clustering index on distributed ordered tables for flexible multi-dimensional range queries | |
Shangguan et al. | Big spatial data processing with Apache Spark | |
CN108319604A (en) | The associated optimization method of size table in a kind of hive | |
CN103761290A (en) | Data management method and system based on content aware | |
CN110413724A (en) | A kind of data retrieval method and device | |
Bao et al. | Query optimization of massive social network data based on hbase | |
Akdogan et al. | Cost-efficient partitioning of spatial data on cloud | |
Sun et al. | Handling multi-dimensional complex queries in key-value data stores |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20130710 |
|
WD01 | Invention patent application deemed withdrawn after publication |