CN103761290A - Data management method and system based on content aware - Google Patents

Data management method and system based on content aware Download PDF

Info

Publication number
CN103761290A
CN103761290A CN201410018214.1A CN201410018214A CN103761290A CN 103761290 A CN103761290 A CN 103761290A CN 201410018214 A CN201410018214 A CN 201410018214A CN 103761290 A CN103761290 A CN 103761290A
Authority
CN
China
Prior art keywords
file
object data
test value
proof test
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410018214.1A
Other languages
Chinese (zh)
Inventor
王通
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Beijing Electronic Information Industry Co Ltd
Original Assignee
Inspur Beijing Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Beijing Electronic Information Industry Co Ltd filed Critical Inspur Beijing Electronic Information Industry Co Ltd
Priority to CN201410018214.1A priority Critical patent/CN103761290A/en
Publication of CN103761290A publication Critical patent/CN103761290A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Abstract

The invention provides a data management method and system based on content aware. The method includes acquiring a preset file format template; matching stored files with the file format template, and acquiring the file format of the files; according to the file format of the files, dividing the content of the files into at least two object data, and storing all object data in a storing database; according to the file format of the files, determining a calibration algorithm corresponding to the files; utilizing the calibration algorithm corresponding to the files to perform calibration calculation on each of the object data corresponding to the files, and recording a calibration value corresponding to each of the object data of the files; establishing corresponding relation between the calibration value corresponding to each of the object data and storing positions of the object data in the storing database; and managing the stored files according to the corresponding relation between the calibration value corresponding to each of the object data and the storing positions of the object data in the storing database.

Description

The data managing method of content-based perception and system
Technical field
The present invention relates to computer application field, relate in particular to a kind of data managing method and system of content-based perception.
Background technology
Since entering 21 century, along with the acceleration of information age, the development that business data presents the trend of explosive increase, particularly mobile Internet, Internet of Things and cloud computing has more aggravated the explosive growth of data.IDC report points out, global metadata amount is every year with 60% speed increase, and within 2010, global metadata amount reaches 1.8ZB, within 2015, will reach 8ZB, and the year two thousand twenty will reach 35ZB, indicate the arrival in " large data " epoch.Data Growth brings following huge problem: cost sharply increases, bandwidth pressure is large, energy consumption issues is serious, the device space take huge, by increase equipment, cannot thoroughly solve the problems such as problem of data volume surge, simultaneously, the energy problem that the world faces is increasingly serious, more noticeable in high-tech IT field energy dissipation and environmental protection.Being widely used of internet allows information center's scale of large enterprise, government bodies, financial institution day by day expand, and exchanges data increases, and equipment is piled into mountain, and floor area is more and more, and power consumption hits new peak repeatly.For realizing information and management optimization, when building company information framework, more appeal green power-saving technology.Save the energy, reduce power consumption, reduce system cost, be badly in need of research towards the novel green memory technology of emerging application.Under this megatrend, data de-duplication technology is accumulate and is educated and give birth to, and data de-duplication technology can reduce the repeating data in user's storage system effectively, thereby for user has saved memory capacity, reduces carrying cost and management difficulty.
In data, heavily delete in technology, according to the method for heavily deleting, can be divided into: file-level is heavily deleted with piece level and heavily deleted.These two kinds of methods are all across heavily deleting between file, therefore cannot read as required in actual use.
Summary of the invention
The data managing method of content-based perception provided by the invention and system, how as required the technical matters that solve is reading out data.
For solving the problems of the technologies described above, the invention provides following technical scheme:
A data managing method for content-based perception, comprising:
Obtain the file layout template setting in advance;
The file of storage is mated with described file layout template, obtain the file layout of described file;
According to the file layout of described file, the division of teaching contents of described file is become at least two object datas, and by all object data stores in stored data base; And, according to the file layout of described file, determine the checking algorithm that described file is corresponding;
Utilize the checking algorithm that described file is corresponding to carry out verification calculating to each object data corresponding to described file, record the proof test value corresponding to each object data of described file;
The proof test value that described in setting up, each object data is corresponding and the corresponding relation of described object data memory location in stored data base;
According to the corresponding relation of proof test value corresponding to described each object data and described object data memory location in stored data base, the file of storage is managed.
Wherein, according to the file layout of described file, determine the checking algorithm that described file is corresponding, comprising:
In a certain file layout, to a plurality of checking algorithm should be had time, according to the size of each object data of described file, select the checking algorithm with the big or small corresponding complexity of this document.
Wherein, the proof test value that each object data is corresponding described in described basis and the corresponding relation of described object data memory location in stored data base, manage the file of storage, comprising:
When execution is heavily deleted across file, the proof test value of each object data of the file of storage is compared;
If the proof test value of two object datas is identical, according to one of them memory location corresponding to proof test value of two identical object datas of proof test value, search one of them object data of two objects that described proof test value is identical, the object data of finding is heavily deleted to operation.
Wherein, the proof test value that each object data is corresponding described in described basis and the corresponding relation of described object data memory location in stored data base, manage the file of storage, comprising:
When application program file reading, obtain the proof test value of the object data that application program used;
The memory location corresponding according to described proof test value, reads described object data.
A data management system for content-based perception, this system comprises: structuring and unstructured document format module storehouse, file scan perception module, verification computing module, verification stored data base and object stored data base, wherein:
File layout template base, for storing a plurality of file layout templates, wherein said file layout template comprises the format module of structuring, unstructured document and at least one in self-defining file layout template;
File scan perception module, for scanning document one by one, identifies the form of given file, and utilizes the file layout template of depositing in file layout template base to mate, to select suitable checking algorithm;
Verification computing module, provides various verification computational algorithms for the file internal object extracting according to file layout;
Verification stored data base, for after file scan perception module analysis outfile form, the content of file is divided into at least two object datas according to file layout, and by object data stores in object database, and use verification computing module to carry out verification calculating to the object of object dataset, record proof test value and the deposit position of this object in object stored data base;
Object stored data base, for depositing object data.
Wherein, described verification computing module, specifically in a certain file layout to should have a plurality of checking algorithm time, according to the size of each object data of described file, select the checking algorithm with the big or small corresponding complexity of this document.
Wherein, described system also comprises and heavily deletes module; Wherein:
The described heavy module of deleting, for when execution is heavily deleted across file, compares to the proof test value of each object data of the file of storage; If the proof test value of two object datas is identical, according to one of them memory location corresponding to proof test value of two identical object datas of proof test value, search one of them data of two objects that described proof test value is identical, the object data of finding is heavily deleted to operation.
Wherein, described system also comprises and reads control module, wherein:
The described control module that reads, for when the application program file reading, obtains the proof test value of the object data that application program used, and the memory location corresponding according to described proof test value, reads described object data.
Data de-duplication method and the system of the content-based perception of employing provided by the invention, centered by file layout template base, file internal data is divided into object according to file layout, deposit practical object data in object stored data base, this object cryptographic hash is deposited in file object position, do not destroy the file layout of original, when application program file reading, do not need all object datas that leave in object stored data base to read and be combined into source document, and only according to object proof test value, read corresponding document object as required, reduce and read unnecessary data, improve IO performance.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the data managing method embodiment of content-based perception provided by the invention;
Fig. 2 is the schematic flow sheet of the data management system embodiment of content-based perception provided by the invention.
Embodiment
For making the object, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below in conjunction with the accompanying drawings and the specific embodiments.It should be noted that, in the situation that not conflicting, the embodiment in the application and the feature in embodiment be combination in any mutually.
Fig. 1 is the schematic flow sheet of the data managing method embodiment of content-based perception provided by the invention.Shown in Fig. 1, embodiment of the method comprises:
Step 101, obtain the file layout template setting in advance;
Wherein, file layout template is to be set up by the data layout of file, and wherein each file layout is to there being data division rule separately, and wherein this data division rule arranges according to the data storage cell of this document; For example, the file of video format can be divided according to the quantity of Frame etc.In simple terms, according to file size, divide, the least unit of wherein dividing can arrange according to the size of whole file, in data division rule, can the size of dividing rear each object data be set by higher limit and the lower limit of the number of division is set, big or small higher limit and the lower limit of object data also can be set certainly.
Step 102, the file of storage is mated with described file layout template, obtain the file layout of described file;
Concrete, after obtaining a certain file, read the suffix name of this document, the file layout recording in this suffix name and file layout template is contrasted, find the file layout of this document.
Step 103, according to the file layout of described file, the division of teaching contents of described file is become at least two object datas, and by all object data stores in stored data base; And, according to the file layout of described file, determine the checking algorithm that described file is corresponding;
Wherein, different files are used different checking algorithms, such as picture, can use the conversion of yardstick invariant features (Scale-invariant feature transform, SIFT) and pHash algorithm, and common file verification can be used MD5 verification.
Step 104, utilize the checking algorithm that described file is corresponding to carry out verification calculating to each object data corresponding to described file, record the proof test value corresponding to each object data of described file;
Wherein step 103 and step 104 do not have clear and definite sequencing.
The proof test value that described in step 105, foundation, each object data is corresponding and the corresponding relation of described object data memory location in stored data base;
For example, according to the object data of file the putting in order of file, the proof test value of order conservation object data.
Step 106, according to the corresponding relation of proof test value corresponding to described each object data and described object data memory location in stored data base, to storage file manage.
The present invention adopts the data de-duplication method of content-based perception, centered by structuring and unstructured document format module storehouse, file internal data is divided into object according to file layout, deposit practical object data in object stored data base, this object cryptographic hash is deposited in file object position, do not destroy the file layout of original, when application program file reading, do not need all object datas that leave in object stored data base to read and be combined into source document, and only according to object proof test value, read corresponding document object as required, reduce and read unnecessary data, improve IO performance.
Structure of the present invention is different from traditional file and heavily deletes with piece and heavily delete, specifically, the existing heavy method of deleting is not heavily deleted processing in advance to the repeating data of file inside, heavily delete each time and all need repeatedly repeating data to be carried out verification calculating, compared peering, also can destroy the original form of file simultaneously, when application program is processed file, need to from data storage repository, read after all file datas are combined into original and could process, cannot read as required.The present invention adopts the data de-duplication method of content-based perception, centered by structuring and unstructured document format module storehouse, file internal data is divided into object according to file layout, deposit practical object data in object stored data base, this object proof test value is deposited in file object position, do not destroy the file layout of original, when application program file reading, do not need all object datas that leave in object stored data base to read and be combined into source document, and only according to object proof test value, read corresponding document object as required, reduce and read unnecessary data, improve IO performance.
Below embodiment of the method provided by the invention is described further:
Wherein, according to the file layout of described file, determine the checking algorithm that described file is corresponding, comprising:
In a certain file layout, to a plurality of checking algorithm should be had time, according to the size of each object data of described file, select the checking algorithm with the big or small corresponding complexity of this document.
Concrete, file for some file layout, this document form checking algorithm applicatory has a variety of, but in the complexity of above-mentioned checking algorithm, there is certain difference, therefore can be divided into the algorithm of high complexity and the algorithm of low complexity by applicable checking algorithm, when the object data that arranges when how is excessive, be to guarantee operational performance, select the algorithm of low complex degree.
Wherein, the proof test value that each object data is corresponding described in described basis and the corresponding relation of described object data memory location in stored data base, manage the file of storage, comprising:
When execution is heavily deleted across file, the proof test value of each object data of the file of storage is compared;
If the proof test value of two object datas is identical, according to one of them memory location corresponding to proof test value of two identical object datas of proof test value, search one of them object data of two objects that described proof test value is identical, the object data of finding is heavily deleted to operation.
Due to checking algorithm can a certain object data of unique discriminating content, therefore, without the content of checking this object data, by proof test value just simply examination go out the data that content is identical, improved work efficiency.
Wherein, the proof test value that each object data is corresponding described in described basis and the corresponding relation of described object data memory location in stored data base, manage the file of storage, comprising:
When application program file reading, obtain the proof test value of the object data that application program used;
The memory location corresponding according to described proof test value, reads described object data.
When receive user to the read requests of a certain object data after, according to the data reference position of user's request, determine the object data that described user asks, for example, when user watches a certain video online, during the reproduction time of redirect video, according to the reproduction time of clicking, determine corresponding object data; By reading the proof test value of this object data, read the data content of this object data again, send to this user.This shows, aforesaid way can respond the quick read requests of user to a certain partial data fast, improves response speed.
Fig. 2 is the structural representation of the data management system embodiment of content-based perception provided by the invention.System shown in Figure 2 embodiment comprises: structuring and unstructured document format module storehouse 201, file scan perception module 202, verification computing module 203, verification stored data base 204 and object stored data base 205, wherein:
File layout template base 201, for storing a plurality of file layout templates, wherein said file layout template comprises the format module of structuring, unstructured document and at least one in self-defining file layout template;
File scan perception module 202, for scanning document one by one, identifies the form of given file, and utilizes the file layout template of depositing in file layout template base 201 to mate, to select suitable checking algorithm;
Verification computing module 203, provides various verification computational algorithms for the file internal object extracting according to file layout;
Verification stored data base 204, for after file scan perception module analysis outfile form, the content of file is divided into at least two object datas according to file layout, and by object data stores in object database, and use verification computing module to carry out verification calculating to the object of object dataset, record proof test value and the deposit position of this object in object stored data base;
Object stored data base 205, for depositing object data.
Wherein, described verification computing module, specifically in a certain file layout to should have a plurality of checking algorithm time, according to the size of each object data of described file, select the checking algorithm with the big or small corresponding complexity of this document.
Wherein, described system also comprises and heavily deletes module; Wherein:
The described heavy module of deleting, for when execution is heavily deleted across file, compares to the proof test value of each object data of the file of storage; If the proof test value of two object datas is identical, according to one of them memory location corresponding to proof test value of two identical object datas of proof test value, search one of them data of two objects that described proof test value is identical, the object data of finding is heavily deleted to operation.
Wherein, described system also comprises and reads control module, wherein:
The described control module that reads, for when the application program file reading, obtains the proof test value of the object data that application program used, and the memory location corresponding according to described proof test value, reads described object data.
Structure of the present invention is different from traditional file and heavily deletes with piece and heavily delete, specifically, the existing heavy method of deleting is not heavily deleted processing in advance to the repeating data of file inside, heavily delete each time and all need repeatedly repeating data to be carried out verification calculating, compared peering, also can destroy the original form of file simultaneously, when application program is processed file, need to from data storage repository, read after all file datas are combined into original and could process, cannot read as required.The present invention adopts the data de-duplication method of content-based perception, centered by structuring and unstructured document format module storehouse, file internal data is divided into object according to file layout, deposit practical object data in object stored data base, this object proof test value is deposited in file object position, do not destroy the file layout of original, when application program file reading, do not need all object datas that leave in object stored data base to read and be combined into source document, and only according to object proof test value, read corresponding document object as required, reduce and read unnecessary data, improve IO performance.
The all or part of step that one of ordinary skill in the art will appreciate that above-described embodiment can realize by computer program flow process, described computer program can be stored in a computer-readable recording medium, described computer program (as system, unit, device etc.) on corresponding hardware platform is carried out, when carrying out, comprise step of embodiment of the method one or a combination set of.
Alternatively, all or part of step of above-described embodiment also can realize with integrated circuit, and these steps can be made into respectively integrated circuit modules one by one, or a plurality of modules in them or step are made into single integrated circuit module realize.Like this, the present invention is not restricted to any specific hardware and software combination.
Each device/functional module/functional unit in above-described embodiment can adopt general calculation element to realize, and they can concentrate on single calculation element, also can be distributed on the network that a plurality of calculation elements form.
The form of software function module of usining each device/functional module/functional unit in above-described embodiment realizes and during as production marketing independently or use, can be stored in a computer read/write memory medium.The above-mentioned computer read/write memory medium of mentioning can be ROM (read-only memory), disk or CD etc.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited to this, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; can expect easily changing or replacing, within all should being encompassed in protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain described in claim.

Claims (8)

1. a data managing method for content-based perception, is characterized in that, comprising:
Obtain the file layout template setting in advance;
The file of storage is mated with described file layout template, obtain the file layout of described file;
According to the file layout of described file, the division of teaching contents of described file is become at least two object datas, and by all object data stores in stored data base; And, according to the file layout of described file, determine the checking algorithm that described file is corresponding;
Utilize the checking algorithm that described file is corresponding to carry out verification calculating to each object data corresponding to described file, record the proof test value corresponding to each object data of described file;
The proof test value that described in setting up, each object data is corresponding and the corresponding relation of described object data memory location in stored data base;
According to the corresponding relation of proof test value corresponding to described each object data and described object data memory location in stored data base, the file of storage is managed.
2. method according to claim 1, is characterized in that, according to the file layout of described file, determines the checking algorithm that described file is corresponding, comprising:
In a certain file layout, to a plurality of checking algorithm should be had time, according to the size of each object data of described file, select the checking algorithm with the big or small corresponding complexity of this document.
3. method according to claim 1, is characterized in that, the proof test value that each object data is corresponding described in described basis and the corresponding relation of described object data memory location in stored data base manage the file of storage, comprising:
When execution is heavily deleted across file, the proof test value of each object data of the file of storage is compared;
If the proof test value of two object datas is identical, according to one of them memory location corresponding to proof test value of two identical object datas of proof test value, search one of them object data of two objects that described proof test value is identical, the object data of finding is heavily deleted to operation.
4. method according to claim 1, is characterized in that, the proof test value that each object data is corresponding described in described basis and the corresponding relation of described object data memory location in stored data base manage the file of storage, comprising:
When application program file reading, obtain the proof test value of the object data that application program used;
The memory location corresponding according to described proof test value, reads described object data.
5. a data management system for content-based perception, is characterized in that, this system comprises: structuring and unstructured document format module storehouse, file scan perception module, verification computing module, verification stored data base and object stored data base, wherein:
File layout template base, for storing a plurality of file layout templates, wherein said file layout template comprises the format module of structuring, unstructured document and at least one in self-defining file layout template;
File scan perception module, for scanning document one by one, identifies the form of given file, and utilizes the file layout template of depositing in file layout template base to mate, to select suitable checking algorithm;
Verification computing module, provides various verification computational algorithms for the file internal object extracting according to file layout;
Verification stored data base, for after file scan perception module analysis outfile form, the content of file is divided into at least two object datas according to file layout, and by object data stores in object database, and use verification computing module to carry out verification calculating to the object of object dataset, record proof test value and the deposit position of this object in object stored data base;
Object stored data base, for depositing object data.
6. method according to claim 5, is characterized in that:
Described verification computing module, specifically in a certain file layout to should have a plurality of checking algorithm time, according to the size of each object data of described file, select the checking algorithm with the big or small corresponding complexity of this document.
7. method according to claim 5, is characterized in that, described system also comprises heavily deletes module; Wherein:
The described heavy module of deleting, for when execution is heavily deleted across file, compares to the proof test value of each object data of the file of storage; If the proof test value of two object datas is identical, according to one of them memory location corresponding to proof test value of two identical object datas of proof test value, search one of them data of two objects that described proof test value is identical, the object data of finding is heavily deleted to operation.
8. system according to claim 5, is characterized in that, described system also comprises and read control module, wherein:
The described control module that reads, for when the application program file reading, obtains the proof test value of the object data that application program used, and the memory location corresponding according to described proof test value, reads described object data.
CN201410018214.1A 2014-01-15 2014-01-15 Data management method and system based on content aware Pending CN103761290A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410018214.1A CN103761290A (en) 2014-01-15 2014-01-15 Data management method and system based on content aware

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410018214.1A CN103761290A (en) 2014-01-15 2014-01-15 Data management method and system based on content aware

Publications (1)

Publication Number Publication Date
CN103761290A true CN103761290A (en) 2014-04-30

Family

ID=50528527

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410018214.1A Pending CN103761290A (en) 2014-01-15 2014-01-15 Data management method and system based on content aware

Country Status (1)

Country Link
CN (1) CN103761290A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133888A (en) * 2014-07-30 2014-11-05 宇龙计算机通信科技(深圳)有限公司 Multi-system data processing method, device and terminal
CN109446345A (en) * 2018-09-26 2019-03-08 深圳中广核工程设计有限公司 Nuclear power file verification processing method and system
CN110071782A (en) * 2019-04-12 2019-07-30 广州小鹏汽车科技有限公司 The processing method and processing unit of message

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133888A (en) * 2014-07-30 2014-11-05 宇龙计算机通信科技(深圳)有限公司 Multi-system data processing method, device and terminal
CN104133888B (en) * 2014-07-30 2019-08-02 宇龙计算机通信科技(深圳)有限公司 A kind of multisystem data processing method, device and terminal
CN109446345A (en) * 2018-09-26 2019-03-08 深圳中广核工程设计有限公司 Nuclear power file verification processing method and system
CN110071782A (en) * 2019-04-12 2019-07-30 广州小鹏汽车科技有限公司 The processing method and processing unit of message
CN110071782B (en) * 2019-04-12 2022-03-18 广州小鹏汽车科技有限公司 Message processing method and processing device

Similar Documents

Publication Publication Date Title
US11941017B2 (en) Event driven extract, transform, load (ETL) processing
US10789231B2 (en) Spatial indexing for distributed storage using local indexes
US20220261413A1 (en) Using specified performance attributes to configure machine learning pipepline stages for an etl job
US10013440B1 (en) Incremental out-of-place updates for index structures
CN104820714A (en) Mass small tile file storage management method based on hadoop
CN103440288A (en) Big data storage method and device
CN103812939A (en) Big data storage system
US10860562B1 (en) Dynamic predicate indexing for data stores
CN109739828B (en) Data processing method and device and computer readable storage medium
US20210271658A1 (en) Data edge platform for improved storage and analytics
CN104239377A (en) Platform-crossing data retrieval method and device
CN104462185A (en) Digital library cloud storage system based on mixed structure
CN104199899A (en) Method and device for storing massive pictures based on Hbase
WO2019161645A1 (en) Shell-based data table extraction method, terminal, device, and storage medium
CN104778229A (en) Telecommunication service small file storage system and method based on Hadoop
US20200097673A1 (en) Data privilage control method and system
CN103198119A (en) Method for fast searching all chained files having same repeating data deleting identification
CN103870557A (en) Database-based electronic file storage system
Merceedi et al. A comprehensive survey for hadoop distributed file system
CN103761290A (en) Data management method and system based on content aware
Luo et al. Big-data analytics: challenges, key technologies and prospects
US20140310262A1 (en) Multiple schema repository and modular database procedures
CN116719822B (en) Method and system for storing massive structured data
EP3343395B1 (en) Data storage method and apparatus for mobile terminal
Chen et al. Analysis of plant breeding on hadoop and spark

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140430

RJ01 Rejection of invention patent application after publication