CN106250494B - A kind of data management and analysis system based on file system - Google Patents
A kind of data management and analysis system based on file system Download PDFInfo
- Publication number
- CN106250494B CN106250494B CN201610623825.8A CN201610623825A CN106250494B CN 106250494 B CN106250494 B CN 106250494B CN 201610623825 A CN201610623825 A CN 201610623825A CN 106250494 B CN106250494 B CN 106250494B
- Authority
- CN
- China
- Prior art keywords
- data characteristics
- data
- library
- file system
- analysis
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
- G06F16/148—File search processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/178—Techniques for file synchronisation in file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/1805—Append-only file systems, e.g. using logs or journals to store data
- G06F16/1815—Journaling file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
Abstract
The present invention discloses a kind of data management and analysis system based on file system, comprising: is provided with the log subsystem of the file system of client-side interface;Data characteristics catcher with outer work reads journal entries from log subsystem by client-side interface, and data characteristics and its variation are extracted from the journal entries of reading;Data characteristics library adapter, it requires data characteristics and its variation being converted to retrieval entry according to the analysis of specific data characteristics and the library type and library structure in the data characteristics library with outer work is set, retrieval entry is then reset into (replay) into data characteristics library;Data characteristics administrative analysis subsystem is analyzed according to specific data characteristics and is required, and search condition, the data characteristics in organization and administration and analysis data characteristics library is arranged.Realization of the library type and library structure for the demand adaptation data feature database that the present invention can be applied neatly according to data characteristics management and analysis without frequently modifying file system according to the demand of analysis, management.
Description
Technical field
The present invention relates to field of computer technology.More particularly, to a kind of data management analysis based on file system
System.
Background technique
The file system of computer provides name space and address space for computer user, thus enabled user's storage
While mass data, according to file name, path and catalogue come group organization data and discovery data.The continuous expansion of data is brought
The demand of complex data management, to rely solely on file name, path and this organizational form of catalogue can no longer meet user
Data management demand.In recent years, a large amount of data application and scientific algorithm need complicated data organization and data hair
Existing mechanism, to expedite the emergence of the birth of data management system.Current data management system, it is necessary first to obtain file system number
According to feature and its a relational database is changed to, and then according to the data characteristics definition rule in relational database, is counted
According to management, data discovery and tissue, wherein the data characteristics of file system is also metadata.
Generally use two ways in the prior art to obtain the data characteristics and its variation of file system:
First way: data characteristics is obtained by scanning file system, and periodic scan comparison file system is poor
It is different to find that data characteristics changes, it aggregated data feature and its changes in database, number is then done according to data characteristics
According to management.This mode has certain defect, firstly, periodic scan has lost the real-time of data characteristics update, secondly, greatly
File system scanning and compare very time-consuming, inefficiency.
The second way: the data characteristics subsystem of the data characteristics and data of separate file system, file system is set
Count into a database, all file system data characteristic manipulations, inherently to the operation of the database, all data
Feature all saves in the database, and then convenient search and inquiry.This metadata for data management and file system
Server is implemented as data characteristics library mode and belongs to (In Band) data management system in band, and the defect of this mode is, text
The variation of metadata caused by the normal IO of part system is also required to more new metadata, and file system adaptively can not adjust Coorg with dynamic
Formula.Because the data characteristics subsystem of file system once defines data characteristics layout, library type and library structure (schema)
Can not just change realizes, which close coupling design, and data characteristics library is a part of file system, and institute is non-in this way
It is often not flexible, it cannot be according to the library type and library structure that the target of data characteristics management and the demand of analysis are adapted at any time.Simultaneously
The system performance of frequent data characteristics operation is completely dependent on and is limited to the performance of data characteristics subsystem database.
Accordingly, it is desirable to provide a kind of data management and analysis system based on file system.
Summary of the invention
The purpose of the present invention is to provide a kind of data management and analysis systems based on file system, can not change file system
The realization of system and neatly according to data characteristics management and analysis application demand adaptation data feature database library type and library tie
Structure.
In order to achieve the above objectives, the present invention adopts the following technical solutions:
A kind of data management and analysis system based on file system, comprising: the log subsystem of file system, data characteristics
Catcher, data characteristics library adapter, data characteristics library and data Features Management analyzing subsystem;
The log subsystem of the file system is provided with client-side interface;
The data characteristics catcher reads log from the log subsystem of file system by the client-side interface
Entry extracts data characteristics and its variation from the journal entries of reading;
Data characteristics library adapter is required according to the analysis of specific data characteristics by the data characteristics and its variation
It is converted to retrieval entry and analyzes the library type and library structure that requiring, the data characteristics library is set according to specific data characteristics,
Then the retrieval entry is reset into data characteristics library;
The data characteristics administrative analysis subsystem is required according to specific data characteristics management or analysis, setting retrieval item
Part organizes and analyzes the data characteristics in data characteristics library;
The data characteristics catcher and the data characteristics library are with outer work.
Preferably, the log take-back strategy of the log subsystem of the file system are as follows: only when file system applies
After data characteristics operation and data characteristics catcher explicitly allows the journal entries recycled just can sequentially be recycled.
Preferably, the data characteristics catcher by the client-side interface from the log subsystem of file system
It is gone back when reading journal entries while updating current log vernier.
Preferably, the type in the data characteristics library includes RDBMS relational database, distribution NOSQL database, search
Engine or relevant retrieval, search system.
In order to obtain the variation of data characteristics and real-time tracking data feature, avoid scanning big file system (deep catalogue layer
It is secondary, mass file number), the present invention utilizes the log subsystem real-time capture data characteristics and its variation of file system, and
Data characteristics and its variation are pooled in data characteristics library.
In order to guarantee that the present invention is flexible enough, the library type and library structure (schema) in data characteristics library are required with decoupling
File system data feature layout is realized, flexibly can easily be adjusted according to the demand of data management and analysis application, together
When do not influence the performance of file system itself.The present invention allow do not change file system realization and flexibly according to data spy
The library type and library structure of expropriation and management reason and the demand adaptation data feature database of analysis application.
Beneficial effects of the present invention are as follows:
(1) present invention does not influence the IO performance of file system, and data characteristics catcher and data feature database are that band is outer
(Out Of Band) work, the performance of normal the input and output code path and input and output of file system itself is not influenced.
(2) all file system for having log subsystem can be transformed into applicable data management according to the present invention
Analysis system, therefore applicability of the invention is wide.
(3) present invention captures data characteristics and its variation according to journal entries, can accomplish to embody data characteristics in real time
It updates, and easily obtains the increment of data characteristics variation, make the number in the data characteristics and data feature database in file system
It is consistent according to feature.
(4) specific requirements of the present invention according to administrative analysis, the library type and library structure of flexible adaptation data feature database
(schema), the change realized without file system.Can be adapted to by data characteristics library inquiry that various different applications require,
Retrieval and search.
Detailed description of the invention
Specific embodiments of the present invention will be described in further detail with reference to the accompanying drawing;
Fig. 1 shows the schematic diagram of the data management and analysis system based on file system.
Specific embodiment
In order to illustrate more clearly of the present invention, the present invention is done further below with reference to preferred embodiments and drawings
It is bright.Similar component is indicated in attached drawing with identical appended drawing reference.It will be appreciated by those skilled in the art that institute is specific below
The content of description is illustrative and be not restrictive, and should not be limited the scope of the invention with this.
About the log subsystem of file system, many existing file system are all in order to guarantee data and data characteristics
Consistency all realizes log subsystem.The log subsystem of file system be otherwise known as WAL write before log or attempt log
Intent Log.Each file system updates all changes of the involved file system data feature of operation, can be first with day
The mode persistence of will is additional in file system journal, then reapplies in file system.When the update of file system is grasped
When completing, i.e., when file system has applied data characteristics operation, changing relevant journal entries in this can just be returned
It receives.
It is all have write preceding log, the local file system for attempting log subsystem, distributed file system are ok
File provided in this embodiment is incorporated into according to the data management and analysis system transformation of file system provided in this embodiment
The data management and analysis system of system.
The data that data management and analysis system provided in this embodiment based on file system carries out data management analysis are special
Sign includes: the standard attribute (POSIX attribute ATTR) and extended attribute (XATTR) of file.
Data management and analysis system provided in this embodiment based on file system, the log subsystem based on file system
(Filesystem Journaling subsystem) obtains data characteristics and its variation, and aggregated data feature into base carries out base
In the management and analysis of data characteristics.
As shown in Figure 1, the data management and analysis system provided in this embodiment based on file system includes: file system
Log subsystem, data characteristics catcher, data characteristics library adapter, data characteristics library and data Features Management analyze subsystem
System;
The log subsystem of file system: file system journal subsystem is provided with client-side interface, the client-side interface
Function are as follows: for data characteristics catcher sequentially read journal entries, update current log vernier and explicitly allow recycle day
Will entry;Journal entries embody file system data feature and data changing features.Due to existing file system journal
System can recycle journal entries after data characteristics is updated into file system, in order to guarantee that data characteristics catcher does not omit number
It is updated according to feature, in the present embodiment, the log take-back strategy of the log subsystem of file system is adjusted are as follows: not by data characteristics
Catcher (client of log subsystem) explicitly allows the journal entries recycled that cannot recycle, only when file system application
After data characteristics operation and the client of log subsystem explicitly allows the journal entries recycled just can sequentially be recycled.
Data characteristics catcher: data characteristics catcher is with outer work, and data characteristics catcher is as log subsystem
The client of system actively passed through client-side interface and reads journal entries from the log subsystem of file system, from the day of reading
Data characteristics and its variation are extracted in will entry, update current log vernier and send out the data characteristics extracted and its variation
It send to data characteristics library adapter.
Data characteristics library: data characteristics library is that the data characteristics catcher with outer work, outside file system captures number
According to feature and its variation, data characteristics library is adapted to various library types and library structure, therefore data characteristics Ku Kegen according to adapter
The difference of target file system being directed to is required and different according to the analysis of specific data characteristics, and the type in data characteristics library includes
RDBMS relational database, distribution NOSQL database, search engine or relevant retrieval, search system.
Data characteristics library adapter: since different libraries can be arranged according to the difference that file system is applied in data characteristics library
Type and library structure, therefore data characteristics library adapter needs to require to capture data characteristics according to the analysis of specific data characteristics
The data characteristics and its variation that device extracts, which are converted to corresponding retrieval entry and are analyzed according to specific data characteristics, requires setting
Then the corresponding retrieval entry of these journal entries is reset (replay) and arrives data by the library type and library structure in data characteristics library
In feature database.
Data characteristics administrative analysis subsystem: analyzing according to specific data characteristics and require, and search condition, tub of tissue is arranged
Data characteristics in reason and analysis data characteristics library, to achieve the purpose that data characteristics management and data signature analysis, above-mentioned group
Knitting the data characteristics in management data characteristics library includes being scanned for, being retrieved according to data characteristics, classified, setting strategy and triggering
The movement executed after condition and trigger condition triggering.
Two specific data characteristics libraries are substituted into below to the data management analysis system of file system provided in this embodiment
System is further described.
File system selects CEPHFS for example, it is not limited to CEPHFS.Improve the file system journal of CEPHFS
Subsystem.The log subsystem is compared with existing log subsystem, the improvement of the log subsystem are as follows: 1. are provided with client
Interface provides client sequence and reads journal entries, updates current log and read vernier;Some day is recycled according to client requirements
All entries before will entry update the function of recycling vernier.2. adjusting journal entries take-back strategy, only work as file system
It applies after data characteristics operates and what the client of log subsystem was explicit has recycled correlation log entry, file system just may be used
Really to recycle the journal entries.
The type in data characteristics library is RDBMS PostgreSQL database.And according to file system standard file attribute
ATTR (size of file creates renewal time, directory size, owner etc.) and extended attribute XATTR manage tissue number
According to.
Client of the data characteristics catcher as log subsystem sequentially reads corresponding journal entries and from reading
Data characteristics and its variation are extracted in journal entries.
Data characteristics and its variation are converted to corresponding retrieval entry by data characteristics library adapter, and according to library type
PostgreSQL data characteristics library and predefined table structure reset (Replay) and retrieve entry to PostgreSQL data characteristics
In library.
Data characteristics administrative analysis subsystem sets querying condition and does according to the content in PostgreSQL data characteristics library
The organization and management of data: for example picking out the maximum file of size, searches the All Files of some period update, and tool
The All Files of some standby identical extended attribute value.
Data characteristics library can also be search engine ElasticSearch, and inquiry has the text that extended attribute content is ABC
Part.It searches in All Files, the probability and file that extended attribute ABC and DEF occurs simultaneously.
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair
The restriction of embodiments of the present invention may be used also on the basis of the above description for those of ordinary skill in the art
To make other variations or changes in different ways, all embodiments can not be exhaustive here, it is all to belong to this hair
The obvious changes or variations that bright technical solution is extended out are still in the scope of protection of the present invention.
Claims (3)
1. a kind of data management and analysis system based on file system, which is characterized in that the system includes: the log of file system
Subsystem, data characteristics catcher, data characteristics library adapter, data characteristics library and data Features Management analyzing subsystem;
The log subsystem of the file system is provided with client-side interface;
The data characteristics catcher reads journal entries from the log subsystem of file system by the client-side interface,
Data characteristics and its variation are extracted from the journal entries of reading;
Data characteristics library adapter requires to convert the data characteristics and its variation according to the analysis of specific data characteristics
The library type and library structure that requiring, the data characteristics library is set are analyzed at retrieval entry and according to specific data characteristics, then
The retrieval entry is reset into data characteristics library;
The data characteristics administrative analysis subsystem is required according to specific data characteristics management or analysis, and search condition is arranged,
Data characteristics in organization and administration and analysis data characteristics library;
The data characteristics catcher and the data characteristics library are with outer work;
The log take-back strategy of the log subsystem of the file system are as follows: only when file system applies data characteristics operation
Afterwards and data characteristics catcher explicitly allows the journal entries recycled just can sequentially be recycled.
2. the data management and analysis system according to claim 1 based on file system, which is characterized in that the data are special
Catcher is levied to go back while updating when reading journal entries from the log subsystem of file system by the client-side interface
Current log vernier.
3. the data management and analysis system according to claim 1 based on file system, which is characterized in that the data are special
The type in sign library includes RDBMS relational database, distribution NOSQL database, search engine or relevant retrieval, search system
System.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610623825.8A CN106250494B (en) | 2016-08-02 | 2016-08-02 | A kind of data management and analysis system based on file system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610623825.8A CN106250494B (en) | 2016-08-02 | 2016-08-02 | A kind of data management and analysis system based on file system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106250494A CN106250494A (en) | 2016-12-21 |
CN106250494B true CN106250494B (en) | 2019-04-09 |
Family
ID=57606374
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610623825.8A Active CN106250494B (en) | 2016-08-02 | 2016-08-02 | A kind of data management and analysis system based on file system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106250494B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110297846B (en) * | 2019-05-28 | 2021-08-20 | 北京奇艺世纪科技有限公司 | Log feature processing system, method, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6725392B1 (en) * | 1999-03-03 | 2004-04-20 | Adaptec, Inc. | Controller fault recovery system for a distributed file system |
CN1893370A (en) * | 2005-06-29 | 2007-01-10 | 国际商业机器公司 | Server cluster recovery and maintenance method and system |
CN101304360A (en) * | 2007-05-08 | 2008-11-12 | 艾岩 | System and method for virtualization of user digital terminal |
CN101578599A (en) * | 2006-08-07 | 2009-11-11 | 米谋萨系统有限公司 | Synthesis of fatty acids |
CN103533023A (en) * | 2013-07-25 | 2014-01-22 | 上海和辰信息技术有限公司 | Cloud service application cluster synchronization system and synchronization method based on cloud service characteristics |
-
2016
- 2016-08-02 CN CN201610623825.8A patent/CN106250494B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6725392B1 (en) * | 1999-03-03 | 2004-04-20 | Adaptec, Inc. | Controller fault recovery system for a distributed file system |
CN1893370A (en) * | 2005-06-29 | 2007-01-10 | 国际商业机器公司 | Server cluster recovery and maintenance method and system |
CN101578599A (en) * | 2006-08-07 | 2009-11-11 | 米谋萨系统有限公司 | Synthesis of fatty acids |
CN101304360A (en) * | 2007-05-08 | 2008-11-12 | 艾岩 | System and method for virtualization of user digital terminal |
CN103533023A (en) * | 2013-07-25 | 2014-01-22 | 上海和辰信息技术有限公司 | Cloud service application cluster synchronization system and synchronization method based on cloud service characteristics |
Also Published As
Publication number | Publication date |
---|---|
CN106250494A (en) | 2016-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8510323B2 (en) | Grouping identity records to generate candidate lists to use in an entity and relationship resolution process | |
US11615058B2 (en) | Database syncing | |
US5758355A (en) | Synchronization of server database with client database using distribution tables | |
US8924373B2 (en) | Query plans with parameter markers in place of object identifiers | |
US10248674B2 (en) | Method and apparatus for data quality management and control | |
US20220083618A1 (en) | Method And System For Scalable Search Using MicroService And Cloud Based Search With Records Indexes | |
US20090177844A1 (en) | Method of efficiently choosing a cache entry for castout | |
CN101510209A (en) | Method, system and server for implementing real time search | |
EP2336901B1 (en) | Online access to database snapshots | |
CN105159950B (en) | The real-time sorting query method and system of mass data | |
CN109815240B (en) | Method, apparatus, device and storage medium for managing index | |
US20230164171A1 (en) | Systems and methods for rapidly generating security ratings | |
US20110289112A1 (en) | Database system, database management method, database structure, and storage medium | |
KR101544560B1 (en) | An online analytical processing system for big data by caching the results and generating 2-level queries by SQL parsing | |
CN104123356A (en) | Method for increasing webpage response speed under large data volume condition | |
CN107169003B (en) | Data association method and device | |
CN106250494B (en) | A kind of data management and analysis system based on file system | |
Esuli | Mipai: Using the pp-index to build an efficient and scalable similarity search system | |
Ooi et al. | Frequent update and efficient retrieval: An oxymoron on moving object indexes? | |
CN101459599B (en) | Method and system for implementing concurrent execution of cache data access and loading | |
US20160004749A1 (en) | Search system and search method | |
KR102415155B1 (en) | Apparatus and method for retrieving data | |
US20220156260A1 (en) | Columnar Techniques for Big Metadata Management | |
US20180081959A1 (en) | Efficient dual-objective cache | |
CN110909029A (en) | Method and medium for realizing cache based on Nosql |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |