CN103761290A

CN103761290A - Data management method and system based on content aware

Info

Publication number: CN103761290A
Application number: CN201410018214.1A
Authority: CN
Inventors: 王通
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2014-01-15
Filing date: 2014-01-15
Publication date: 2014-04-30

Abstract

The invention provides a data management method and system based on content aware. The method includes acquiring a preset file format template; matching stored files with the file format template, and acquiring the file format of the files; according to the file format of the files, dividing the content of the files into at least two object data, and storing all object data in a storing database; according to the file format of the files, determining a calibration algorithm corresponding to the files; utilizing the calibration algorithm corresponding to the files to perform calibration calculation on each of the object data corresponding to the files, and recording a calibration value corresponding to each of the object data of the files; establishing corresponding relation between the calibration value corresponding to each of the object data and storing positions of the object data in the storing database; and managing the stored files according to the corresponding relation between the calibration value corresponding to each of the object data and the storing positions of the object data in the storing database.

Description

The data managing method of content-based perception and system

Technical field

The present invention relates to computer application field, relate in particular to a kind of data managing method and system of content-based perception.

Background technology

Since entering 21 century, along with the acceleration of information age, the development that business data presents the trend of explosive increase, particularly mobile Internet, Internet of Things and cloud computing has more aggravated the explosive growth of data.IDC report points out, global metadata amount is every year with 60% speed increase, and within 2010, global metadata amount reaches 1.8ZB, within 2015, will reach 8ZB, and the year two thousand twenty will reach 35ZB, indicate the arrival in " large data " epoch.Data Growth brings following huge problem: cost sharply increases, bandwidth pressure is large, energy consumption issues is serious, the device space take huge, by increase equipment, cannot thoroughly solve the problems such as problem of data volume surge, simultaneously, the energy problem that the world faces is increasingly serious, more noticeable in high-tech IT field energy dissipation and environmental protection.Being widely used of internet allows information center's scale of large enterprise, government bodies, financial institution day by day expand, and exchanges data increases, and equipment is piled into mountain, and floor area is more and more, and power consumption hits new peak repeatly.For realizing information and management optimization, when building company information framework, more appeal green power-saving technology.Save the energy, reduce power consumption, reduce system cost, be badly in need of research towards the novel green memory technology of emerging application.Under this megatrend, data de-duplication technology is accumulate and is educated and give birth to, and data de-duplication technology can reduce the repeating data in user's storage system effectively, thereby for user has saved memory capacity, reduces carrying cost and management difficulty.

In data, heavily delete in technology, according to the method for heavily deleting, can be divided into: file-level is heavily deleted with piece level and heavily deleted.These two kinds of methods are all across heavily deleting between file, therefore cannot read as required in actual use.

Summary of the invention

The data managing method of content-based perception provided by the invention and system, how as required the technical matters that solve is reading out data.

For solving the problems of the technologies described above, the invention provides following technical scheme:

A data managing method for content-based perception, comprising:

Obtain the file layout template setting in advance;

The file of storage is mated with described file layout template, obtain the file layout of described file;

According to the file layout of described file, the division of teaching contents of described file is become at least two object datas, and by all object data stores in stored data base; And, according to the file layout of described file, determine the checking algorithm that described file is corresponding;

Utilize the checking algorithm that described file is corresponding to carry out verification calculating to each object data corresponding to described file, record the proof test value corresponding to each object data of described file;

The proof test value that described in setting up, each object data is corresponding and the corresponding relation of described object data memory location in stored data base;

According to the corresponding relation of proof test value corresponding to described each object data and described object data memory location in stored data base, the file of storage is managed.

Wherein, according to the file layout of described file, determine the checking algorithm that described file is corresponding, comprising:

In a certain file layout, to a plurality of checking algorithm should be had time, according to the size of each object data of described file, select the checking algorithm with the big or small corresponding complexity of this document.

Wherein, the proof test value that each object data is corresponding described in described basis and the corresponding relation of described object data memory location in stored data base, manage the file of storage, comprising:

When execution is heavily deleted across file, the proof test value of each object data of the file of storage is compared;

If the proof test value of two object datas is identical, according to one of them memory location corresponding to proof test value of two identical object datas of proof test value, search one of them object data of two objects that described proof test value is identical, the object data of finding is heavily deleted to operation.

When application program file reading, obtain the proof test value of the object data that application program used;

The memory location corresponding according to described proof test value, reads described object data.

A data management system for content-based perception, this system comprises: structuring and unstructured document format module storehouse, file scan perception module, verification computing module, verification stored data base and object stored data base, wherein:

File layout template base, for storing a plurality of file layout templates, wherein said file layout template comprises the format module of structuring, unstructured document and at least one in self-defining file layout template;

File scan perception module, for scanning document one by one, identifies the form of given file, and utilizes the file layout template of depositing in file layout template base to mate, to select suitable checking algorithm;

Verification computing module, provides various verification computational algorithms for the file internal object extracting according to file layout;

Verification stored data base, for after file scan perception module analysis outfile form, the content of file is divided into at least two object datas according to file layout, and by object data stores in object database, and use verification computing module to carry out verification calculating to the object of object dataset, record proof test value and the deposit position of this object in object stored data base;

Object stored data base, for depositing object data.

Wherein, described verification computing module, specifically in a certain file layout to should have a plurality of checking algorithm time, according to the size of each object data of described file, select the checking algorithm with the big or small corresponding complexity of this document.

Wherein, described system also comprises and heavily deletes module; Wherein:

The described heavy module of deleting, for when execution is heavily deleted across file, compares to the proof test value of each object data of the file of storage; If the proof test value of two object datas is identical, according to one of them memory location corresponding to proof test value of two identical object datas of proof test value, search one of them data of two objects that described proof test value is identical, the object data of finding is heavily deleted to operation.

Wherein, described system also comprises and reads control module, wherein:

The described control module that reads, for when the application program file reading, obtains the proof test value of the object data that application program used, and the memory location corresponding according to described proof test value, reads described object data.

Data de-duplication method and the system of the content-based perception of employing provided by the invention, centered by file layout template base, file internal data is divided into object according to file layout, deposit practical object data in object stored data base, this object cryptographic hash is deposited in file object position, do not destroy the file layout of original, when application program file reading, do not need all object datas that leave in object stored data base to read and be combined into source document, and only according to object proof test value, read corresponding document object as required, reduce and read unnecessary data, improve IO performance.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of the data managing method embodiment of content-based perception provided by the invention;

Fig. 2 is the schematic flow sheet of the data management system embodiment of content-based perception provided by the invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below in conjunction with the accompanying drawings and the specific embodiments.It should be noted that, in the situation that not conflicting, the embodiment in the application and the feature in embodiment be combination in any mutually.

Fig. 1 is the schematic flow sheet of the data managing method embodiment of content-based perception provided by the invention.Shown in Fig. 1, embodiment of the method comprises:

Step 101, obtain the file layout template setting in advance;

Wherein, file layout template is to be set up by the data layout of file, and wherein each file layout is to there being data division rule separately, and wherein this data division rule arranges according to the data storage cell of this document; For example, the file of video format can be divided according to the quantity of Frame etc.In simple terms, according to file size, divide, the least unit of wherein dividing can arrange according to the size of whole file, in data division rule, can the size of dividing rear each object data be set by higher limit and the lower limit of the number of division is set, big or small higher limit and the lower limit of object data also can be set certainly.

Step 102, the file of storage is mated with described file layout template, obtain the file layout of described file;

Concrete, after obtaining a certain file, read the suffix name of this document, the file layout recording in this suffix name and file layout template is contrasted, find the file layout of this document.

Step 103, according to the file layout of described file, the division of teaching contents of described file is become at least two object datas, and by all object data stores in stored data base; And, according to the file layout of described file, determine the checking algorithm that described file is corresponding;

Wherein, different files are used different checking algorithms, such as picture, can use the conversion of yardstick invariant features (Scale-invariant feature transform, SIFT) and pHash algorithm, and common file verification can be used MD5 verification.

Step 104, utilize the checking algorithm that described file is corresponding to carry out verification calculating to each object data corresponding to described file, record the proof test value corresponding to each object data of described file;

Wherein step 103 and step 104 do not have clear and definite sequencing.

The proof test value that described in step 105, foundation, each object data is corresponding and the corresponding relation of described object data memory location in stored data base;

For example, according to the object data of file the putting in order of file, the proof test value of order conservation object data.

Step 106, according to the corresponding relation of proof test value corresponding to described each object data and described object data memory location in stored data base, to storage file manage.

The present invention adopts the data de-duplication method of content-based perception, centered by structuring and unstructured document format module storehouse, file internal data is divided into object according to file layout, deposit practical object data in object stored data base, this object cryptographic hash is deposited in file object position, do not destroy the file layout of original, when application program file reading, do not need all object datas that leave in object stored data base to read and be combined into source document, and only according to object proof test value, read corresponding document object as required, reduce and read unnecessary data, improve IO performance.

Structure of the present invention is different from traditional file and heavily deletes with piece and heavily delete, specifically, the existing heavy method of deleting is not heavily deleted processing in advance to the repeating data of file inside, heavily delete each time and all need repeatedly repeating data to be carried out verification calculating, compared peering, also can destroy the original form of file simultaneously, when application program is processed file, need to from data storage repository, read after all file datas are combined into original and could process, cannot read as required.The present invention adopts the data de-duplication method of content-based perception, centered by structuring and unstructured document format module storehouse, file internal data is divided into object according to file layout, deposit practical object data in object stored data base, this object proof test value is deposited in file object position, do not destroy the file layout of original, when application program file reading, do not need all object datas that leave in object stored data base to read and be combined into source document, and only according to object proof test value, read corresponding document object as required, reduce and read unnecessary data, improve IO performance.

Below embodiment of the method provided by the invention is described further:

Concrete, file for some file layout, this document form checking algorithm applicatory has a variety of, but in the complexity of above-mentioned checking algorithm, there is certain difference, therefore can be divided into the algorithm of high complexity and the algorithm of low complexity by applicable checking algorithm, when the object data that arranges when how is excessive, be to guarantee operational performance, select the algorithm of low complex degree.

Due to checking algorithm can a certain object data of unique discriminating content, therefore, without the content of checking this object data, by proof test value just simply examination go out the data that content is identical, improved work efficiency.

When receive user to the read requests of a certain object data after, according to the data reference position of user's request, determine the object data that described user asks, for example, when user watches a certain video online, during the reproduction time of redirect video, according to the reproduction time of clicking, determine corresponding object data; By reading the proof test value of this object data, read the data content of this object data again, send to this user.This shows, aforesaid way can respond the quick read requests of user to a certain partial data fast, improves response speed.

Fig. 2 is the structural representation of the data management system embodiment of content-based perception provided by the invention.System shown in Figure 2 embodiment comprises: structuring and unstructured document format module storehouse 201, file scan perception module 202, verification computing module 203, verification stored data base 204 and object stored data base 205, wherein:

File layout template base 201, for storing a plurality of file layout templates, wherein said file layout template comprises the format module of structuring, unstructured document and at least one in self-defining file layout template;

File scan perception module 202, for scanning document one by one, identifies the form of given file, and utilizes the file layout template of depositing in file layout template base 201 to mate, to select suitable checking algorithm;

Verification computing module 203, provides various verification computational algorithms for the file internal object extracting according to file layout;

Verification stored data base 204, for after file scan perception module analysis outfile form, the content of file is divided into at least two object datas according to file layout, and by object data stores in object database, and use verification computing module to carry out verification calculating to the object of object dataset, record proof test value and the deposit position of this object in object stored data base;

Object stored data base 205, for depositing object data.

Wherein, described system also comprises and heavily deletes module; Wherein:

Wherein, described system also comprises and reads control module, wherein:

The all or part of step that one of ordinary skill in the art will appreciate that above-described embodiment can realize by computer program flow process, described computer program can be stored in a computer-readable recording medium, described computer program (as system, unit, device etc.) on corresponding hardware platform is carried out, when carrying out, comprise step of embodiment of the method one or a combination set of.

Alternatively, all or part of step of above-described embodiment also can realize with integrated circuit, and these steps can be made into respectively integrated circuit modules one by one, or a plurality of modules in them or step are made into single integrated circuit module realize.Like this, the present invention is not restricted to any specific hardware and software combination.

Each device/functional module/functional unit in above-described embodiment can adopt general calculation element to realize, and they can concentrate on single calculation element, also can be distributed on the network that a plurality of calculation elements form.

The form of software function module of usining each device/functional module/functional unit in above-described embodiment realizes and during as production marketing independently or use, can be stored in a computer read/write memory medium.The above-mentioned computer read/write memory medium of mentioning can be ROM (read-only memory), disk or CD etc.

The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited to this, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; can expect easily changing or replacing, within all should being encompassed in protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain described in claim.

Claims

1. a data managing method for content-based perception, is characterized in that, comprising:

Obtain the file layout template setting in advance;

2. method according to claim 1, is characterized in that, according to the file layout of described file, determines the checking algorithm that described file is corresponding, comprising:

3. method according to claim 1, is characterized in that, the proof test value that each object data is corresponding described in described basis and the corresponding relation of described object data memory location in stored data base manage the file of storage, comprising:

4. method according to claim 1, is characterized in that, the proof test value that each object data is corresponding described in described basis and the corresponding relation of described object data memory location in stored data base manage the file of storage, comprising:

5. a data management system for content-based perception, is characterized in that, this system comprises: structuring and unstructured document format module storehouse, file scan perception module, verification computing module, verification stored data base and object stored data base, wherein:

Object stored data base, for depositing object data.

6. method according to claim 5, is characterized in that:

Described verification computing module, specifically in a certain file layout to should have a plurality of checking algorithm time, according to the size of each object data of described file, select the checking algorithm with the big or small corresponding complexity of this document.

7. method according to claim 5, is characterized in that, described system also comprises heavily deletes module; Wherein:

8. system according to claim 5, is characterized in that, described system also comprises and read control module, wherein: