CN103198119A - Method for fast searching all chained files having same repeating data deleting identification - Google Patents

Method for fast searching all chained files having same repeating data deleting identification Download PDF

Info

Publication number
CN103198119A
CN103198119A CN2013101121259A CN201310112125A CN103198119A CN 103198119 A CN103198119 A CN 103198119A CN 2013101121259 A CN2013101121259 A CN 2013101121259A CN 201310112125 A CN201310112125 A CN 201310112125A CN 103198119 A CN103198119 A CN 103198119A
Authority
CN
China
Prior art keywords
module
redundancy
performance
searching
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013101121259A
Other languages
Chinese (zh)
Inventor
王通
郭鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN2013101121259A priority Critical patent/CN103198119A/en
Publication of CN103198119A publication Critical patent/CN103198119A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a method for fast searching all chained files having the same repeating data deleting identification. The method is characterized in that a high-performance high-concurrency database is adopted to serve as the core, and high efficiency can be obtained by integrating an ergodic interface, a core hook module and a redundancy searching module. A modular structure of the method comprises the high-performance high-concurrency database (1), a core hook module (2), an ergodic interface module (3), and a redundancy searching module (4), wherein the ergodic interface, the core hook module and the redundancy searching module support high-concurrency multi-process and multi-threading operation, so that the integral performance of the system is improved. The redundancy searching module provides redundancy configuration, so that the availability of the system is improved. The method seldom needs to search for the whole file system directory tree for searching, so that the method is efficient.

Description

A kind of method of searching the all-links file with identical data de-duplication sign fast
Technical field
The present invention relates to the Computer Applied Technology field, be specifically related to a kind of method of searching the all-links file with identical data de-duplication sign fast.
Background technology
Entered since 21 century, along with the acceleration of information age, the development that business data presents the trend of explosive increase, particularly mobile Internet, Internet of Things and cloud computing has more aggravated the explosive growth of data.IDC report points out, global metadata amount every year, the global metadata amount reached 1.8ZB in 2010, will reach 8ZB in 2015 with 60% speed increase, and the year two thousand twenty will reach 35ZB, indicate the arrival in " big data " epoch.Data Growth brings following huge problem: cost sharply increases, bandwidth pressure is big, energy consumption issues is serious, the device space take huge, can't thoroughly solve the problems such as problem that data volume is increased sharply by increase equipment, simultaneously, the energy problem that the world faces is increasingly serious, and is more noticeable in high-tech IT field energy dissipation and environmental protection.The widely-used information center's scale of large enterprise, government bodies, financial institution that allows of internet expands day by day, and exchanges data increases, and equipment is piled into the mountain, and floor area is more and more, and power consumption hits new peak repeatly.Be realization information and management optimization, when making up the company information framework, appeal green power-saving technology more.Energy savings reduces power consumption, reduces system cost, is badly in need of research towards the novel green memory technology of emerging application.Under this megatrend, data de-duplication technology is accumulate and is educated and give birth to, and data de-duplication technology can reduce the repeating data in user's storage system effectively, thereby for the user has saved memory capacity, reduces carrying cost and management difficulty.
The existing all-links document method with same data de-duplication sign of searching all must travel through whole file system directories tree one by one, and to each file that finds, obtain its sign and compare, traversal for 1,000,000,000 level files catalogues will expend a large amount of time and resource, heavily delete in the technology in data, can be divided into according to the method for heavily deleting: file-level is heavily deleted with the piece level and is heavily deleted.Heavily deleting in the scheme of file-level, need the multiple file of internal unit weight to preserve a copy, and be established to the link (comprising the data de-duplication sign of documentary evidence content unanimity, generally is the cryptographic hash of file content) of this copy at place, the path at duplicate file place.When the fast quick-recovery of needs had file under a plurality of paths of identical file content, the method that how to find the All Files link path with identical content fast was just very important.
Summary of the invention
The purpose of this invention is to provide a kind of method of searching the all-links file with identical data de-duplication sign fast.
The existing all-links document method with same data de-duplication sign of searching all must travel through whole file system directories tree one by one, and to each file that finds, obtain its sign and compare, will expend a large amount of time and resource for the traversal of 1,000,000,000 level files catalogues.
The objective of the invention is to realize in the following manner:
Structure of the present invention is the method centered by the high concurrent database of high-performance, this system architecture comprises: the high concurrent database (1) of high-performance, kernel hooking module (2), traversal interface module (3), redundancy are searched module (4), kernel hooking module, traversal interface module, redundancy are searched module and are supported high concurrent multi-process multithreading operation, thereby improve the overall performance of system, wherein:
The high concurrent database of high-performance (1) is the core of architecture, is responsible for depositing a large amount of hard link information, and supports multi-process, the high concurrent visit of multithreading;
The information of kernel hooking module (2) when mainly being responsible for setting up chained file is collected and information is deposited, and supports multi-thread concurrent;
Traversal interface module (3) provides calling interface for the upper level applications Ergodic Theory;
Redundancy is searched acting as when not having needed information in the high concurrent database of high-performance (1) of module (4), travels through whole storage system, carries out redundancy and searches, and the information that finds is put into the high concurrent database (1) of high-performance.
The invention has the beneficial effects as follows: kernel hooking module, traversal interface module, redundancy are searched module and are supported high concurrent multi-process multithreading operation, thereby improve the overall performance of system.Redundancy is searched module redundant configuration is provided, thereby improves the high availability of system.Seldom need to travel through whole file system directories tree and search, very efficient.
Description of drawings
Fig. 1 is traditional all hard link path topology figure with same sign that search;
Fig. 2 searches the all-links document flow synoptic diagram with identical repeating data file identification fast.
Embodiment
Explain below with reference to Figure of description method of the present invention being done.
As described in the summary of the invention, architecture of the present invention mainly comprises: the high concurrent database (1) of high-performance, kernel hooking module (2), traversal interface module (3), redundancy are searched module (4).
We propose has based on searching fast of the high concurrent database of high-performance that a kind of to search the all-links document method with identical data de-duplication sign fast be core with the high concurrent database of high-performance, it is characterized in that in method, kernel hooking module, traversal interface module, redundancy are searched module and are supported high concurrent multi-process multithreading operation, thereby improve the overall performance of system.Kernel hooking module, traversal interface module, redundancy are searched module and are carried out redundant configuration, thereby improve the high availability of system.As shown in Figure 2, the native system architecture mainly comprises: the high concurrent database (1) of high-performance, kernel hooking module (2), traversal interface module (3), redundancy are searched module (4).
The high concurrent database of high-performance is as the core of the method, plays that information storage and high speed are concurrent effect such as searches.
The kernel hooking module is registered into kernel, the function of setting up chained file changes kernel over to when carrying out, use the kernel hooking program, information such as file path and data de-duplication sign are deposited into the high concurrent database of high-performance, and the data de-duplication sign is write chained file.
The interface that the traversal interface module provides traversal to call, be the various entrances of searching function, when searching, at first entering the high concurrent database of high-performance searches, if can find the database key assignments with search marking matched, then the content that this key-value pair is answered returns to call function, searches module and searches otherwise enter redundancy.
Redundancy is searched module will travel through whole file system directories tree with degree of depth traversal or range traversal method, and each file is obtained its sign also and the search key contrast, up to complete file system directories tree of traversal, the result who obtains be returned.
Except the described technical characterictic of instructions, be the known technology of those skilled in the art.

Claims (3)

1. method of searching the all-links file with identical data de-duplication sign fast, it is characterized in that, be core with the high concurrent database of high-performance, search module by integrating traversal interface, kernel hooking module and redundancy, this lookup method is reached than higher efficient, the modular structure of this method comprises: the high concurrent database (1) of high-performance, and kernel hooking module (2), traversal interface module (3), redundancy are searched module (4) wherein:
The high concurrent database of high-performance (1) is the core of structure, is responsible for depositing a large amount of chained file routing informations, and supports multi-process, the high concurrent visit of multithreading;
The information of kernel hooking module (2) when mainly being responsible for setting up chained file is collected and information is deposited, and supports multi-thread concurrent;
Traversal interface module (3) provides calling interface for the upper level applications Ergodic Theory;
Redundancy is searched acting as when not having needed information in the high concurrent database of high-performance (1) of module (4), travels through whole storage system, carries out redundancy and searches, and the information that finds is put into the high concurrent database (1) of high-performance.
2. method according to claim 1 is characterized in that kernel hooking module, traversal interface module, redundancy search module and support high concurrent multi-process multithreading operation, thereby improves the overall performance of system.
3. method according to claim 1 is characterized in that redundancy searches the redundant configuration that module provides method, thereby improves the high availability of system.
CN2013101121259A 2013-04-02 2013-04-02 Method for fast searching all chained files having same repeating data deleting identification Pending CN103198119A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013101121259A CN103198119A (en) 2013-04-02 2013-04-02 Method for fast searching all chained files having same repeating data deleting identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013101121259A CN103198119A (en) 2013-04-02 2013-04-02 Method for fast searching all chained files having same repeating data deleting identification

Publications (1)

Publication Number Publication Date
CN103198119A true CN103198119A (en) 2013-07-10

Family

ID=48720677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013101121259A Pending CN103198119A (en) 2013-04-02 2013-04-02 Method for fast searching all chained files having same repeating data deleting identification

Country Status (1)

Country Link
CN (1) CN103198119A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015024511A1 (en) * 2013-08-21 2015-02-26 International Business Machines Corporation Adding cooperative file coloring in similarity based deduplication system
CN106469167A (en) * 2015-08-18 2017-03-01 北大方正集团有限公司 The display packing of file statuss and the display system of file statuss
CN107239314A (en) * 2016-03-28 2017-10-10 苏州简约纳电子有限公司 The minimizing technology of data structure is re-defined in ASN.1 compilation processes
US9830229B2 (en) 2013-08-21 2017-11-28 International Business Machines Corporation Adding cooperative file coloring protocols in a data deduplication system
CN108009049A (en) * 2017-11-28 2018-05-08 厦门市美亚柏科信息股份有限公司 The offline restoration methods of MYISAM storage engines deletion records, storage medium
WO2018113210A1 (en) * 2016-12-21 2018-06-28 深圳市易特科信息技术有限公司 Repeated medical documentation deletion system and method in medical informationization
CN109308284A (en) * 2018-09-28 2019-02-05 中国平安财产保险股份有限公司 Report menu generating method, device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007041456A2 (en) * 2005-09-30 2007-04-12 Neopath Networks, Inc. Accumulating access frequency and file attributes for supporting policy based storage management
CN101719936A (en) * 2009-12-09 2010-06-02 成都市华为赛门铁克科技有限公司 Method, device and cache system for providing file downloading service
CN102289451A (en) * 2011-06-17 2011-12-21 奇智软件(北京)有限公司 Method and device for searching files or folders
CN102609453A (en) * 2012-01-11 2012-07-25 中国农业大学 Embedded file searching method and embedded file searching system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007041456A2 (en) * 2005-09-30 2007-04-12 Neopath Networks, Inc. Accumulating access frequency and file attributes for supporting policy based storage management
CN101719936A (en) * 2009-12-09 2010-06-02 成都市华为赛门铁克科技有限公司 Method, device and cache system for providing file downloading service
CN102289451A (en) * 2011-06-17 2011-12-21 奇智软件(北京)有限公司 Method and device for searching files or folders
CN102609453A (en) * 2012-01-11 2012-07-25 中国农业大学 Embedded file searching method and embedded file searching system

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015024511A1 (en) * 2013-08-21 2015-02-26 International Business Machines Corporation Adding cooperative file coloring in similarity based deduplication system
US9542411B2 (en) 2013-08-21 2017-01-10 International Business Machines Corporation Adding cooperative file coloring in a similarity based deduplication system
US9830229B2 (en) 2013-08-21 2017-11-28 International Business Machines Corporation Adding cooperative file coloring protocols in a data deduplication system
US11048594B2 (en) 2013-08-21 2021-06-29 International Business Machines Corporation Adding cooperative file coloring protocols in a data deduplication system
CN106469167A (en) * 2015-08-18 2017-03-01 北大方正集团有限公司 The display packing of file statuss and the display system of file statuss
CN106469167B (en) * 2015-08-18 2019-06-28 北大方正集团有限公司 The display methods of file status and the display system of file status
CN107239314A (en) * 2016-03-28 2017-10-10 苏州简约纳电子有限公司 The minimizing technology of data structure is re-defined in ASN.1 compilation processes
CN107239314B (en) * 2016-03-28 2020-09-01 苏州简约纳电子有限公司 Method for removing repeated definition data structure in ASN.1 compiling process
WO2018113210A1 (en) * 2016-12-21 2018-06-28 深圳市易特科信息技术有限公司 Repeated medical documentation deletion system and method in medical informationization
CN108009049A (en) * 2017-11-28 2018-05-08 厦门市美亚柏科信息股份有限公司 The offline restoration methods of MYISAM storage engines deletion records, storage medium
CN109308284A (en) * 2018-09-28 2019-02-05 中国平安财产保险股份有限公司 Report menu generating method, device, computer equipment and storage medium
CN109308284B (en) * 2018-09-28 2023-09-19 中国平安财产保险股份有限公司 Report menu generation method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN102222085B (en) Data de-duplication method based on combination of similarity and locality
CN103198119A (en) Method for fast searching all chained files having same repeating data deleting identification
Ji et al. Big data processing in cloud computing environments
US11093466B2 (en) Incremental out-of-place updates for index structures
CN103577123A (en) Small file optimization storage method based on HDFS
CN103544261B (en) A kind of magnanimity structuring daily record data global index's management method and device
CN104239377A (en) Platform-crossing data retrieval method and device
Chatzimilioudis et al. Distributed in-memory processing of all k nearest neighbor queries
CN103279502B (en) A kind of framework and method with the data de-duplication file system be combined with parallel file system
US9110820B1 (en) Hybrid data storage system in an HPC exascale environment
US20170351620A1 (en) Caching Framework for Big-Data Engines in the Cloud
WO2014110940A1 (en) A method, apparatus and system for storing, reading the directory index
Von der Weth et al. Multiterm keyword search in NoSQL systems
Li et al. Efficient subspace skyline query based on user preference using MapReduce
CN104572505A (en) System and method for ensuring eventual consistency of mass data caches
CN102779138A (en) Hard disk access method of real time data
CN104951464A (en) Data storage method and system
Feng et al. Lcindex: a local and clustering index on distributed ordered tables for flexible multi-dimensional range queries
Shangguan et al. Big spatial data processing with Apache Spark
CN108319604A (en) The associated optimization method of size table in a kind of hive
CN103761290A (en) Data management method and system based on content aware
CN110413724A (en) A kind of data retrieval method and device
Bao et al. Query optimization of massive social network data based on hbase
Akdogan et al. Cost-efficient partitioning of spatial data on cloud
Sun et al. Handling multi-dimensional complex queries in key-value data stores

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130710

WD01 Invention patent application deemed withdrawn after publication