CN116627320A

CN116627320A - Unstructured data storage, migration and identification method

Info

Publication number: CN116627320A
Application number: CN202310374828.2A
Authority: CN
Inventors: 王燕蓉; 闫丽飞; 赵维伟; 吕世雷; 赵元朋
Original assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd
Priority date: 2023-04-10
Filing date: 2023-04-10
Publication date: 2023-08-22

Abstract

The invention belongs to the technical field of unstructured data processing, in particular to a method for storing, migrating and identifying unstructured data, which comprises the following steps: s1: the method comprises the steps of obtaining unstructured data, establishing an index tag for the unstructured data, and storing the index tag into a guide partition; s2: the unstructured data with the index label established is analyzed, the same part and different parts in the unstructured data are determined and split, and the split unstructured data are stored in a storage area; s3: storing the same part of the split unstructured data in a redundant partition in a storage area, and storing the different part of the split unstructured data in a total storage area in the storage area; the invention can analyze unstructured data, classify, combine and store the unstructured data, improve the storage compression efficiency of the data and facilitate the identification and migration of the data.

Description

Unstructured data storage, migration and identification method

Technical Field

The invention belongs to the technical field of unstructured data processing, and particularly relates to a method for storing, migrating and identifying unstructured data.

Background

In the information socialization era, mass data information is accumulated in each industry in the process of processing related business, along with popularization and development of IT application, the traditional paper data storage mode is continuously reduced, and more electronic information storage modes are adopted to store in a computer. For unstructured data storage, similar to pictures, images, video, etc., the diverse nature of unstructured data formats makes it inconvenient to use a two-dimensional table structure to implement the representation of data compressed storage.

The existing unstructured data are generally stored in a memory in sequence, so that the data are not related, the identification and retrieval time is long, the unstructured data are difficult to use, a large amount of redundant information exists in the unstructured data, a large amount of storage space is wasted when the data information is stored, and the data storage compression efficiency is low.

Disclosure of Invention

In order to make up for the defects of the prior art, the unstructured data are classified, combined and stored after being analyzed, the storage compression efficiency of the data is improved, and the identification and the migration of the data are facilitated.

The technical scheme adopted for solving the technical problems is as follows: the invention discloses a storage, migration and identification method of unstructured data, which is characterized in that: the method comprises the following steps:

s1: the method comprises the steps of obtaining unstructured data, establishing an index tag for the unstructured data, and storing the index tag into a guide partition;

s2: the unstructured data with the index label established is analyzed, the same part and different parts in the unstructured data are determined and split, and the split unstructured data are stored in a storage area;

s3: storing the same part of the split unstructured data in a redundant partition in a storage area, and storing the different part of the split unstructured data in a total storage area in the storage area;

and storing one copy of the same part in different unstructured data in the redundant partition, respectively setting mutually independent labels for a plurality of groups of different parts of data with different sources stored in the redundant partition, and backing up the redundant partition in the storage area.

Preferably, the index tag includes check information for checking the integrity of unstructured data, and the check information includes, but is not limited to, an MD5 value, an SHA1 value and a CRC32 value of unstructured data, and the unstructured data needs to be checked for the check information recorded in the index tag when the unstructured data is stored, referred to and used.

Preferably, the index tag established for the unstructured data in the step S1 includes type information, the redundant partition and the total storage area are divided into sub-partitions corresponding to the unstructured data of different types, and the type information recorded in the unstructured data block after the unstructured data analysis and splitting is still stored in the corresponding sub-partition.

Preferably, the index tag includes a main tag and a sub tag, the main tag stores index tags of all unstructured data stored currently, the sub tag is generated according to a fixed time interval, and the sub tag stores index tags of unstructured data stored in a corresponding period.

Preferably, when the unstructured data is migrated, the index tag, the data in the redundant partition and the data in the total storage area are migrated in sequence, the migration number is performed before the data blocks in the redundant partition and the total storage area are migrated, and the migration number data is migrated after the index tag.

Preferably, the unstructured data is verified at a fixed time interval in the unstructured data migration process.

Preferably, the unstructured data is divided into hot spot data and common data according to the use times and the use time, and the hot spot data is transmitted preferentially in the migration process.

Preferably, the hot spot data is stored in the same storage device or storage area after being migrated.

The beneficial effects of the invention are as follows:

1. according to the method for storing, migrating and identifying the unstructured data, the unstructured data to be stored is analyzed and split, the same part in different unstructured data is selected to be stored, redundant parts in the stored unstructured data are effectively reduced, the size of the stored data is reduced, the storage compression efficiency of the unstructured data is improved, and the waste of storage space is reduced.

2. According to the method for storing, migrating and identifying unstructured data, the index labels can be compared with each other by setting the main index and the sub index, so that the condition that the stored data cannot be used due to errors of the index labels is reduced, and meanwhile, the searching and using speed of the stored data is improved through the sub index.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a block flow diagram of the identification method of the present invention.

Detailed Description

The invention is further described in connection with the following detailed description in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

As shown in fig. 1, the method for storing, migrating and identifying unstructured data according to the present invention comprises the following steps:

the method comprises the steps that one part of the same part of different unstructured data is stored in a redundant partition, mutually independent labels are respectively arranged on a plurality of groups of the same part of data with different sources stored in the redundant partition, and backup exists in the redundant partition in a storage area;

when the unstructured data is stored, an index label is established for the obtained unstructured data, so that the stored unstructured data is conveniently searched, read and migrated, the disorder of the stored unstructured data is avoided, the storage compression efficiency of the unstructured data is influenced, and the unstructured data is difficult to review and use, meanwhile, the same part and different parts in the split data block are determined by analyzing and splitting the unstructured data with the established index label, so that the same part in the data block is optionally stored in a redundant partition according to the difference between the data blocks, the different parts of the data block in minutes are all stored in a total storage area, the redundant part in the stored unstructured data is effectively reduced, the total size of the stored unstructured data is reduced, therefore, the storage compression efficiency of unstructured data is improved, the waste of storage space is reduced, the unstructured data is convenient to identify, consult and migrate, meanwhile, when the stored unstructured data is needed to be used, the same part and different parts corresponding to the unstructured data are called from the redundant subarea and the total storage area according to the index label, the same part and different parts are combined according to the sequence recorded in the index label to obtain the initial unstructured data, meanwhile, label information corresponding to the data stored in the same part in the redundant subarea in the unstructured data is recorded in the index label, so that when the unstructured data is used, as a plurality of groups of the same part data are stored in the redundant subarea, the situation of data calling errors occurs when the corresponding same part data is called in the redundant subarea, affecting the normal use of unstructured data.

As one embodiment of the present invention, the index tag includes check information for checking the integrity of unstructured data, where the check information includes, but is not limited to, MD5 value, SHA1 value, and CRC32 value of unstructured data, and the unstructured data needs to check the check information recorded in the index tag when stored, referred to, and used;

when an index tag is built for unstructured data, check information of the unstructured data is synchronously calculated, so that the check information is added to the index tag, meanwhile, when the unstructured data is used, after the same part of data and different parts of data which are called from a redundant partition and a total storage area are combined, the check information recorded in the index tag is used for verification, so that the unstructured data to be called for use is complete and free of errors, the situation that the unstructured data is changed and the original state of the unstructured data cannot be restored due to the fact that the data blocks called from the redundant partition and the total storage area are combined and the normal use of the unstructured data are influenced is avoided.

As an implementation mode of the invention, the index label established for the unstructured data in the step S1 includes type information, the redundant partition and the total storage area are divided into sub-partitions corresponding to the unstructured data of different types, and the type information recorded in the unstructured data block after the unstructured data analysis and splitting is still stored in the corresponding sub-partition;

when an index tag is established for unstructured data, type information of the unstructured data is added in the index tag according to the type of the unstructured data, such as pictures, audios, videos and texts, then the unstructured data is analyzed and split to obtain data blocks, and the data blocks are stored in corresponding sub-partitions in a redundant partition or a total storage area according to the type information recorded in the index tag, so that the similar unstructured data are stored in the same partition, the retrieval speed and the reading speed of the stored unstructured data are improved, the use of the unstructured data is guaranteed, and the response speed when the unstructured data is used is accelerated.

As one embodiment of the present invention, the index tag includes a main tag and a sub tag, the main tag stores index tags of all unstructured data stored currently, the sub tag is generated according to a fixed time interval, and the sub tag stores index tags of unstructured data stored in a corresponding period;

through setting up main label and sub-label for the index label that the unstructured data was established can exist the backup of mutual contrast, thereby avoid the index label to receive the interference after, lead to the unstructured data of storage to appear chaotic or lose, influence the normal use of unstructured data, simultaneously, establish the sub-label according to fixed time interval, look for the index label from the sub-label at first when using unstructured data, thereby the effectual size that reduces the index label can avoid directly looking for the index label from main label, lead to the problem that retrieval time increases, retrieval speed slows down, simultaneously, the sub-label is established according to time sequence, also be favorable to the seek the location to the index label, and then accelerate the retrieval, the use to unstructured data.

As one embodiment of the invention, when the unstructured data is migrated, sequentially migrating the index tag, the data in the redundant partition and the data in the total storage area, wherein the migration numbering is performed before the migration of the data blocks in the redundant partition and the total storage area, and the migration numbering data is migrated after the index tag;

when unstructured data is migrated, the data blocks in the redundant partition and the total storage area are numbered, so that the number of the transmitted and the untransmitted data blocks can be determined in the unstructured data migration process, and because the data blocks are all migrated, even if the data blocks are interfered in the unstructured data migration process, the positions of the lost data blocks can be determined according to the migration numbers of the data blocks, retransmission is performed, interruption and continuous transmission are allowed at any time in the unstructured data migration process, the migration process can be normally completed, meanwhile, the positions of the lost data blocks can be determined according to the migration numbers without searching for an index tag from the beginning when the data blocks are lost in the transmission process, so that the time and the calculation force spent in searching for the index tag are reduced, the migration speed of the unstructured data is accelerated, and the integrity of the migrated data is ensured.

As one embodiment of the invention, the unstructured data is verified at fixed time intervals in the unstructured data migration process;

because the interference in the data migration process is not considered to be controlled, the loss of the data block can occur at any time point in the transmission process, and the data which is lost and wrong in the unstructured data which is already transmitted can be checked in time by checking the migrated data through the check information and the migration number in the middle fixed time in the migration process, so that the transmission is carried out again, the integrity of the unstructured data after the migration is ensured, the errors which occur in the data migration process are corrected rapidly, and the efficiency of the data migration is improved.

As one embodiment of the present invention, the unstructured data is divided into hot spot data and normal data according to the number of times and the time of use, and the hot spot data is preferentially transmitted in the migration process;

the hot spot data and the common data are determined through the number of times and the using time of the stored unstructured data in daily use, wherein the number of times and the using time of the hot spot data are high, and the hot spot data are relatively important to users, so when the unstructured data are migrated, the hot spot data are preferentially migrated, the hot spot data are migrated to a target storage area earlier than the common data, the occupation time of the hot spot data in the migration process is reduced, the users can be ensured to normally use the hot spot data, and meanwhile, the possibility of damage and loss after the hot spot data are migrated can be reduced due to the fact that the transmission link has the possibility of fluctuation along with the time, and the normal use of the stored data by the users is ensured.

As one implementation mode of the invention, the hot spot data is stored in the same storage device or storage area after being migrated;

after the unstructured data is migrated, the hot spot data is stored in the same storage device or storage area, and the storage position of the corresponding unstructured data is changed, so that the access speed of the migrated hot spot data is improved, the time delay of a user when the user uses the hot spot data is reduced, and the user experience is improved.

The specific working procedure is as follows:

when the unstructured data is stored, an index label is established for the obtained unstructured data, so that the stored unstructured data is conveniently searched, read and migrated, the disorder of the stored unstructured data is avoided, the storage compression efficiency of the unstructured data is influenced, and the unstructured data is difficult to review and use, meanwhile, the same part and different parts in the split data block are determined by analyzing and splitting the unstructured data with the established index label, so that the same part in the data block is optionally stored in a redundant partition according to the difference between the data blocks, the different parts of the data block in minutes are all stored in a total storage area, the redundant part in the stored unstructured data is effectively reduced, the total size of the stored unstructured data is reduced, therefore, the storage compression efficiency of unstructured data is improved, the waste of storage space is reduced, the unstructured data is convenient to identify, consult and migrate, meanwhile, when the stored unstructured data is needed to be used, the same part and different parts corresponding to the unstructured data are called from the redundant subarea and the total storage area according to the index label, the same part and different parts are combined according to the sequence recorded in the index label to obtain the initial unstructured data, meanwhile, label information corresponding to the data stored in the same part in the redundant subarea in the unstructured data is recorded in the index label, so that when the unstructured data is used, as a plurality of groups of the same part data are stored in the redundant subarea, the situation of data calling errors occurs when the corresponding same part data is called in the redundant subarea, the normal use of unstructured data is affected;

when an index tag is established for unstructured data, check information of the unstructured data is synchronously calculated, so that the check information is added to the index tag, and meanwhile, when the unstructured data is used, after the same part of data and different parts of data which are called from a redundant partition and a total storage area are combined, the check information recorded in the index tag is used for verification, so that the unstructured data to be called for use is complete and free of errors, the situation that the unstructured data is changed and the initial state of the unstructured data cannot be restored due to the fact that the data blocks called from the redundant partition and the total storage area are combined and normal use of the unstructured data is affected is avoided;

when an index tag is established for unstructured data, type information of the unstructured data is added in the index tag according to the type of the unstructured data, such as pictures, audios, videos and texts, then the unstructured data is analyzed and split to obtain data blocks, and the data blocks are stored in corresponding sub-partitions in a redundant partition or a total storage area according to the type information recorded in the index tag, so that the similar unstructured data are stored in the same partition, the retrieval speed and the reading speed of the stored unstructured data are improved, the use of the unstructured data is guaranteed, and the response speed when the unstructured data is used is accelerated;

by setting the main label and the sub-label, the index label established for the unstructured data can have backup of mutual comparison, so that the problem that the index label is disturbed, the stored unstructured data is disordered or lost, the normal use of the unstructured data is influenced, meanwhile, the sub-label is established according to a fixed time interval, the index label is searched from the sub-label when the unstructured data is used, the size of the index label is effectively reduced, the problems that the index label is directly searched from the main label, the search time is increased and the search speed is reduced are solved, meanwhile, the sub-label is established according to the time sequence, the search and the positioning of the index label are facilitated, and the search and the use of the unstructured data are further accelerated;

when unstructured data is migrated, the data blocks in the redundant partition and the total storage area are numbered, so that the number of the transmitted and the untransmitted data blocks can be determined in the unstructured data migration process, and as the data blocks are all migrated, even if the data blocks are lost due to interference in the unstructured data migration process, the positions of the lost data blocks can be determined according to the migration numbers of the data blocks, retransmission is performed, interruption and continuous transmission are allowed at any time in the unstructured data migration process, the migration process can be normally completed, meanwhile, when the data blocks are migrated and the data blocks are lost in the transmission process, the positions of the lost data blocks can be determined according to the migration numbers without searching for the index tag, so that the time and the calculation force spent in searching for the index tag are reduced, the migration speed of the unstructured data is accelerated, and the integrity of the migrated data is ensured;

because the interference in the data migration process is not considered to be controlled, the loss of the data block can occur at any time point in the transmission process, and the data which is lost and wrong in the unstructured data which is already transmitted can be checked out in time by checking the migrated data through the check information and the migration number in the middle of the migration process, so that the transmission is carried out again, the integrity of the unstructured data after the migration is ensured, the errors which occur in the data migration process are corrected quickly, and the efficiency of the data migration is improved;

the method has the advantages that the hot spot data and the common data are determined through the number of times and the using time of the stored unstructured data in daily use, wherein the number of times and the using time of the hot spot data are large, and the hot spot data are relatively important for users, so that when the unstructured data are migrated, the hot spot data are preferentially migrated, the hot spot data are migrated to a target storage area earlier than the common data, the occupation time of the hot spot data in the migration process is reduced, the users can be ensured to normally use the hot spot data, and meanwhile, the possibility of damage and loss after the hot spot data are migrated can be reduced due to the fact that the transmission link has the possibility of fluctuation along with the time, and the normal use of the stored data by the users is ensured;

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for storing, migrating and identifying unstructured data, which is characterized in that: the method comprises the following steps:

2. A method of unstructured data storage, migration, and identification according to claim 1, wherein: the index tag comprises check information for checking the integrity of unstructured data, wherein the check information comprises, but is not limited to, an MD5 value, an SHA1 value and a CRC32 value of unstructured data, and the unstructured data needs to check the check information recorded in the index tag when being stored, referred to and used.

3. A method of unstructured data storage, migration, and identification according to claim 1, wherein: the index label established for the unstructured data in the step S1 comprises type information, the redundant subareas and the total storage area are divided into subareas corresponding to the unstructured data of different types, and the type information recorded in the unstructured data block after the unstructured data analysis and splitting is still stored in the corresponding subareas.

4. A method of unstructured data storage, migration, and identification according to claim 1, wherein: the index tag comprises a main tag and a sub tag, wherein the main tag stores all index tags of unstructured data stored currently, the sub tag is generated according to a fixed time interval, and the sub tag stores the index tag of the unstructured data stored in a corresponding time interval.

5. A method of unstructured data storage, migration, and identification according to claim 2, wherein: and when the unstructured data is migrated, sequentially migrating the index tag, the data in the redundant partition and the data in the total storage area, wherein the migration number is carried out before the migration of the data blocks in the redundant partition and the total storage area, and the migration number data is migrated after the index tag.

6. The method for storing, migrating and identifying unstructured data according to claim 5, wherein: and verifying the migrated unstructured data at a fixed time interval in the unstructured data migration process.

7. The method for storing, migrating and identifying unstructured data according to claim 5, wherein: the unstructured data is divided into hot spot data and common data according to the use times and the use time, and the hot spot data is transmitted preferentially in the migration process.

8. The method for storing, migrating and identifying unstructured data according to claim 7, wherein: and the hot spot data are stored in the same storage device or storage area after being migrated.