CN106250494B - A kind of data management and analysis system based on file system - Google Patents

A kind of data management and analysis system based on file system Download PDF

Info

Publication number
CN106250494B
CN106250494B CN201610623825.8A CN201610623825A CN106250494B CN 106250494 B CN106250494 B CN 106250494B CN 201610623825 A CN201610623825 A CN 201610623825A CN 106250494 B CN106250494 B CN 106250494B
Authority
CN
China
Prior art keywords
data characteristics
data
library
file system
analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610623825.8A
Other languages
Chinese (zh)
Other versions
CN106250494A (en
Inventor
吴江
谢鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Polar Technology (beijing) Co Ltd
Original Assignee
Polar Technology (beijing) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Polar Technology (beijing) Co Ltd filed Critical Polar Technology (beijing) Co Ltd
Priority to CN201610623825.8A priority Critical patent/CN106250494B/en
Publication of CN106250494A publication Critical patent/CN106250494A/en
Application granted granted Critical
Publication of CN106250494B publication Critical patent/CN106250494B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Abstract

The present invention discloses a kind of data management and analysis system based on file system, comprising: is provided with the log subsystem of the file system of client-side interface;Data characteristics catcher with outer work reads journal entries from log subsystem by client-side interface, and data characteristics and its variation are extracted from the journal entries of reading;Data characteristics library adapter, it requires data characteristics and its variation being converted to retrieval entry according to the analysis of specific data characteristics and the library type and library structure in the data characteristics library with outer work is set, retrieval entry is then reset into (replay) into data characteristics library;Data characteristics administrative analysis subsystem is analyzed according to specific data characteristics and is required, and search condition, the data characteristics in organization and administration and analysis data characteristics library is arranged.Realization of the library type and library structure for the demand adaptation data feature database that the present invention can be applied neatly according to data characteristics management and analysis without frequently modifying file system according to the demand of analysis, management.

Description

A kind of data management and analysis system based on file system
Technical field
The present invention relates to field of computer technology.More particularly, to a kind of data management analysis based on file system System.
Background technique
The file system of computer provides name space and address space for computer user, thus enabled user's storage While mass data, according to file name, path and catalogue come group organization data and discovery data.The continuous expansion of data is brought The demand of complex data management, to rely solely on file name, path and this organizational form of catalogue can no longer meet user Data management demand.In recent years, a large amount of data application and scientific algorithm need complicated data organization and data hair Existing mechanism, to expedite the emergence of the birth of data management system.Current data management system, it is necessary first to obtain file system number According to feature and its a relational database is changed to, and then according to the data characteristics definition rule in relational database, is counted According to management, data discovery and tissue, wherein the data characteristics of file system is also metadata.
Generally use two ways in the prior art to obtain the data characteristics and its variation of file system:
First way: data characteristics is obtained by scanning file system, and periodic scan comparison file system is poor It is different to find that data characteristics changes, it aggregated data feature and its changes in database, number is then done according to data characteristics According to management.This mode has certain defect, firstly, periodic scan has lost the real-time of data characteristics update, secondly, greatly File system scanning and compare very time-consuming, inefficiency.
The second way: the data characteristics subsystem of the data characteristics and data of separate file system, file system is set Count into a database, all file system data characteristic manipulations, inherently to the operation of the database, all data Feature all saves in the database, and then convenient search and inquiry.This metadata for data management and file system Server is implemented as data characteristics library mode and belongs to (In Band) data management system in band, and the defect of this mode is, text The variation of metadata caused by the normal IO of part system is also required to more new metadata, and file system adaptively can not adjust Coorg with dynamic Formula.Because the data characteristics subsystem of file system once defines data characteristics layout, library type and library structure (schema) Can not just change realizes, which close coupling design, and data characteristics library is a part of file system, and institute is non-in this way It is often not flexible, it cannot be according to the library type and library structure that the target of data characteristics management and the demand of analysis are adapted at any time.Simultaneously The system performance of frequent data characteristics operation is completely dependent on and is limited to the performance of data characteristics subsystem database.
Accordingly, it is desirable to provide a kind of data management and analysis system based on file system.
Summary of the invention
The purpose of the present invention is to provide a kind of data management and analysis systems based on file system, can not change file system The realization of system and neatly according to data characteristics management and analysis application demand adaptation data feature database library type and library tie Structure.
In order to achieve the above objectives, the present invention adopts the following technical solutions:
A kind of data management and analysis system based on file system, comprising: the log subsystem of file system, data characteristics Catcher, data characteristics library adapter, data characteristics library and data Features Management analyzing subsystem;
The log subsystem of the file system is provided with client-side interface;
The data characteristics catcher reads log from the log subsystem of file system by the client-side interface Entry extracts data characteristics and its variation from the journal entries of reading;
Data characteristics library adapter is required according to the analysis of specific data characteristics by the data characteristics and its variation It is converted to retrieval entry and analyzes the library type and library structure that requiring, the data characteristics library is set according to specific data characteristics, Then the retrieval entry is reset into data characteristics library;
The data characteristics administrative analysis subsystem is required according to specific data characteristics management or analysis, setting retrieval item Part organizes and analyzes the data characteristics in data characteristics library;
The data characteristics catcher and the data characteristics library are with outer work.
Preferably, the log take-back strategy of the log subsystem of the file system are as follows: only when file system applies After data characteristics operation and data characteristics catcher explicitly allows the journal entries recycled just can sequentially be recycled.
Preferably, the data characteristics catcher by the client-side interface from the log subsystem of file system It is gone back when reading journal entries while updating current log vernier.
Preferably, the type in the data characteristics library includes RDBMS relational database, distribution NOSQL database, search Engine or relevant retrieval, search system.
In order to obtain the variation of data characteristics and real-time tracking data feature, avoid scanning big file system (deep catalogue layer It is secondary, mass file number), the present invention utilizes the log subsystem real-time capture data characteristics and its variation of file system, and Data characteristics and its variation are pooled in data characteristics library.
In order to guarantee that the present invention is flexible enough, the library type and library structure (schema) in data characteristics library are required with decoupling File system data feature layout is realized, flexibly can easily be adjusted according to the demand of data management and analysis application, together When do not influence the performance of file system itself.The present invention allow do not change file system realization and flexibly according to data spy The library type and library structure of expropriation and management reason and the demand adaptation data feature database of analysis application.
Beneficial effects of the present invention are as follows:
(1) present invention does not influence the IO performance of file system, and data characteristics catcher and data feature database are that band is outer (Out Of Band) work, the performance of normal the input and output code path and input and output of file system itself is not influenced.
(2) all file system for having log subsystem can be transformed into applicable data management according to the present invention Analysis system, therefore applicability of the invention is wide.
(3) present invention captures data characteristics and its variation according to journal entries, can accomplish to embody data characteristics in real time It updates, and easily obtains the increment of data characteristics variation, make the number in the data characteristics and data feature database in file system It is consistent according to feature.
(4) specific requirements of the present invention according to administrative analysis, the library type and library structure of flexible adaptation data feature database (schema), the change realized without file system.Can be adapted to by data characteristics library inquiry that various different applications require, Retrieval and search.
Detailed description of the invention
Specific embodiments of the present invention will be described in further detail with reference to the accompanying drawing;
Fig. 1 shows the schematic diagram of the data management and analysis system based on file system.
Specific embodiment
In order to illustrate more clearly of the present invention, the present invention is done further below with reference to preferred embodiments and drawings It is bright.Similar component is indicated in attached drawing with identical appended drawing reference.It will be appreciated by those skilled in the art that institute is specific below The content of description is illustrative and be not restrictive, and should not be limited the scope of the invention with this.
About the log subsystem of file system, many existing file system are all in order to guarantee data and data characteristics Consistency all realizes log subsystem.The log subsystem of file system be otherwise known as WAL write before log or attempt log Intent Log.Each file system updates all changes of the involved file system data feature of operation, can be first with day The mode persistence of will is additional in file system journal, then reapplies in file system.When the update of file system is grasped When completing, i.e., when file system has applied data characteristics operation, changing relevant journal entries in this can just be returned It receives.
It is all have write preceding log, the local file system for attempting log subsystem, distributed file system are ok File provided in this embodiment is incorporated into according to the data management and analysis system transformation of file system provided in this embodiment The data management and analysis system of system.
The data that data management and analysis system provided in this embodiment based on file system carries out data management analysis are special Sign includes: the standard attribute (POSIX attribute ATTR) and extended attribute (XATTR) of file.
Data management and analysis system provided in this embodiment based on file system, the log subsystem based on file system (Filesystem Journaling subsystem) obtains data characteristics and its variation, and aggregated data feature into base carries out base In the management and analysis of data characteristics.
As shown in Figure 1, the data management and analysis system provided in this embodiment based on file system includes: file system Log subsystem, data characteristics catcher, data characteristics library adapter, data characteristics library and data Features Management analyze subsystem System;
The log subsystem of file system: file system journal subsystem is provided with client-side interface, the client-side interface Function are as follows: for data characteristics catcher sequentially read journal entries, update current log vernier and explicitly allow recycle day Will entry;Journal entries embody file system data feature and data changing features.Due to existing file system journal System can recycle journal entries after data characteristics is updated into file system, in order to guarantee that data characteristics catcher does not omit number It is updated according to feature, in the present embodiment, the log take-back strategy of the log subsystem of file system is adjusted are as follows: not by data characteristics Catcher (client of log subsystem) explicitly allows the journal entries recycled that cannot recycle, only when file system application After data characteristics operation and the client of log subsystem explicitly allows the journal entries recycled just can sequentially be recycled.
Data characteristics catcher: data characteristics catcher is with outer work, and data characteristics catcher is as log subsystem The client of system actively passed through client-side interface and reads journal entries from the log subsystem of file system, from the day of reading Data characteristics and its variation are extracted in will entry, update current log vernier and send out the data characteristics extracted and its variation It send to data characteristics library adapter.
Data characteristics library: data characteristics library is that the data characteristics catcher with outer work, outside file system captures number According to feature and its variation, data characteristics library is adapted to various library types and library structure, therefore data characteristics Ku Kegen according to adapter The difference of target file system being directed to is required and different according to the analysis of specific data characteristics, and the type in data characteristics library includes RDBMS relational database, distribution NOSQL database, search engine or relevant retrieval, search system.
Data characteristics library adapter: since different libraries can be arranged according to the difference that file system is applied in data characteristics library Type and library structure, therefore data characteristics library adapter needs to require to capture data characteristics according to the analysis of specific data characteristics The data characteristics and its variation that device extracts, which are converted to corresponding retrieval entry and are analyzed according to specific data characteristics, requires setting Then the corresponding retrieval entry of these journal entries is reset (replay) and arrives data by the library type and library structure in data characteristics library In feature database.
Data characteristics administrative analysis subsystem: analyzing according to specific data characteristics and require, and search condition, tub of tissue is arranged Data characteristics in reason and analysis data characteristics library, to achieve the purpose that data characteristics management and data signature analysis, above-mentioned group Knitting the data characteristics in management data characteristics library includes being scanned for, being retrieved according to data characteristics, classified, setting strategy and triggering The movement executed after condition and trigger condition triggering.
Two specific data characteristics libraries are substituted into below to the data management analysis system of file system provided in this embodiment System is further described.
File system selects CEPHFS for example, it is not limited to CEPHFS.Improve the file system journal of CEPHFS Subsystem.The log subsystem is compared with existing log subsystem, the improvement of the log subsystem are as follows: 1. are provided with client Interface provides client sequence and reads journal entries, updates current log and read vernier;Some day is recycled according to client requirements All entries before will entry update the function of recycling vernier.2. adjusting journal entries take-back strategy, only work as file system It applies after data characteristics operates and what the client of log subsystem was explicit has recycled correlation log entry, file system just may be used Really to recycle the journal entries.
The type in data characteristics library is RDBMS PostgreSQL database.And according to file system standard file attribute ATTR (size of file creates renewal time, directory size, owner etc.) and extended attribute XATTR manage tissue number According to.
Client of the data characteristics catcher as log subsystem sequentially reads corresponding journal entries and from reading Data characteristics and its variation are extracted in journal entries.
Data characteristics and its variation are converted to corresponding retrieval entry by data characteristics library adapter, and according to library type PostgreSQL data characteristics library and predefined table structure reset (Replay) and retrieve entry to PostgreSQL data characteristics In library.
Data characteristics administrative analysis subsystem sets querying condition and does according to the content in PostgreSQL data characteristics library The organization and management of data: for example picking out the maximum file of size, searches the All Files of some period update, and tool The All Files of some standby identical extended attribute value.
Data characteristics library can also be search engine ElasticSearch, and inquiry has the text that extended attribute content is ABC Part.It searches in All Files, the probability and file that extended attribute ABC and DEF occurs simultaneously.
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention may be used also on the basis of the above description for those of ordinary skill in the art To make other variations or changes in different ways, all embodiments can not be exhaustive here, it is all to belong to this hair The obvious changes or variations that bright technical solution is extended out are still in the scope of protection of the present invention.

Claims (3)

1. a kind of data management and analysis system based on file system, which is characterized in that the system includes: the log of file system Subsystem, data characteristics catcher, data characteristics library adapter, data characteristics library and data Features Management analyzing subsystem;
The log subsystem of the file system is provided with client-side interface;
The data characteristics catcher reads journal entries from the log subsystem of file system by the client-side interface, Data characteristics and its variation are extracted from the journal entries of reading;
Data characteristics library adapter requires to convert the data characteristics and its variation according to the analysis of specific data characteristics The library type and library structure that requiring, the data characteristics library is set are analyzed at retrieval entry and according to specific data characteristics, then The retrieval entry is reset into data characteristics library;
The data characteristics administrative analysis subsystem is required according to specific data characteristics management or analysis, and search condition is arranged, Data characteristics in organization and administration and analysis data characteristics library;
The data characteristics catcher and the data characteristics library are with outer work;
The log take-back strategy of the log subsystem of the file system are as follows: only when file system applies data characteristics operation Afterwards and data characteristics catcher explicitly allows the journal entries recycled just can sequentially be recycled.
2. the data management and analysis system according to claim 1 based on file system, which is characterized in that the data are special Catcher is levied to go back while updating when reading journal entries from the log subsystem of file system by the client-side interface Current log vernier.
3. the data management and analysis system according to claim 1 based on file system, which is characterized in that the data are special The type in sign library includes RDBMS relational database, distribution NOSQL database, search engine or relevant retrieval, search system System.
CN201610623825.8A 2016-08-02 2016-08-02 A kind of data management and analysis system based on file system Active CN106250494B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610623825.8A CN106250494B (en) 2016-08-02 2016-08-02 A kind of data management and analysis system based on file system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610623825.8A CN106250494B (en) 2016-08-02 2016-08-02 A kind of data management and analysis system based on file system

Publications (2)

Publication Number Publication Date
CN106250494A CN106250494A (en) 2016-12-21
CN106250494B true CN106250494B (en) 2019-04-09

Family

ID=57606374

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610623825.8A Active CN106250494B (en) 2016-08-02 2016-08-02 A kind of data management and analysis system based on file system

Country Status (1)

Country Link
CN (1) CN106250494B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297846B (en) * 2019-05-28 2021-08-20 北京奇艺世纪科技有限公司 Log feature processing system, method, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6725392B1 (en) * 1999-03-03 2004-04-20 Adaptec, Inc. Controller fault recovery system for a distributed file system
CN1893370A (en) * 2005-06-29 2007-01-10 国际商业机器公司 Server cluster recovery and maintenance method and system
CN101304360A (en) * 2007-05-08 2008-11-12 艾岩 System and method for virtualization of user digital terminal
CN101578599A (en) * 2006-08-07 2009-11-11 米谋萨系统有限公司 Synthesis of fatty acids
CN103533023A (en) * 2013-07-25 2014-01-22 上海和辰信息技术有限公司 Cloud service application cluster synchronization system and synchronization method based on cloud service characteristics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6725392B1 (en) * 1999-03-03 2004-04-20 Adaptec, Inc. Controller fault recovery system for a distributed file system
CN1893370A (en) * 2005-06-29 2007-01-10 国际商业机器公司 Server cluster recovery and maintenance method and system
CN101578599A (en) * 2006-08-07 2009-11-11 米谋萨系统有限公司 Synthesis of fatty acids
CN101304360A (en) * 2007-05-08 2008-11-12 艾岩 System and method for virtualization of user digital terminal
CN103533023A (en) * 2013-07-25 2014-01-22 上海和辰信息技术有限公司 Cloud service application cluster synchronization system and synchronization method based on cloud service characteristics

Also Published As

Publication number Publication date
CN106250494A (en) 2016-12-21

Similar Documents

Publication Publication Date Title
US8510323B2 (en) Grouping identity records to generate candidate lists to use in an entity and relationship resolution process
US11615058B2 (en) Database syncing
US5758355A (en) Synchronization of server database with client database using distribution tables
US8924373B2 (en) Query plans with parameter markers in place of object identifiers
US10248674B2 (en) Method and apparatus for data quality management and control
US20220083618A1 (en) Method And System For Scalable Search Using MicroService And Cloud Based Search With Records Indexes
US20090177844A1 (en) Method of efficiently choosing a cache entry for castout
CN101510209A (en) Method, system and server for implementing real time search
EP2336901B1 (en) Online access to database snapshots
CN105159950B (en) The real-time sorting query method and system of mass data
CN109815240B (en) Method, apparatus, device and storage medium for managing index
US20230164171A1 (en) Systems and methods for rapidly generating security ratings
US20110289112A1 (en) Database system, database management method, database structure, and storage medium
KR101544560B1 (en) An online analytical processing system for big data by caching the results and generating 2-level queries by SQL parsing
CN104123356A (en) Method for increasing webpage response speed under large data volume condition
CN107169003B (en) Data association method and device
CN106250494B (en) A kind of data management and analysis system based on file system
Esuli Mipai: Using the pp-index to build an efficient and scalable similarity search system
Ooi et al. Frequent update and efficient retrieval: An oxymoron on moving object indexes?
CN101459599B (en) Method and system for implementing concurrent execution of cache data access and loading
US20160004749A1 (en) Search system and search method
KR102415155B1 (en) Apparatus and method for retrieving data
US20220156260A1 (en) Columnar Techniques for Big Metadata Management
US20180081959A1 (en) Efficient dual-objective cache
CN110909029A (en) Method and medium for realizing cache based on Nosql

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant