CN111190992A

CN111190992A - Mass storage method and storage system for unstructured data

Info

Publication number: CN111190992A
Application number: CN201911257354.3A
Authority: CN
Inventors: 陈书平; 于长琦; 王绪繁; 陶俭; 陈竞翔; 姜志山; 王灿; 王玉宝
Original assignee: Huaneng Group Technology Innovation Center Co Ltd; Huaneng Information Technology Co Ltd
Current assignee: Huaneng Group Technology Innovation Center Co Ltd; Huaneng Information Technology Co Ltd
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-05-22
Anticipated expiration: 2039-12-10
Also published as: CN111190992B

Abstract

The embodiment of the invention discloses a mass storage method and a storage system of unstructured data, which comprises the following steps: dividing a cloud storage space into a plurality of distributed storage modules for storing different file types; dividing the distributed storage module into a plurality of sub-storage clusters by using a space simulation method, and setting a storage mode of a data stream in the sub-storage clusters and a grid storage position; setting a virtual channel between two adjacent sub-storage clusters, and erecting transmission communication links matched and corresponding between a front-end data source and the sub-storage clusters; forming a storage implementation unit by a plurality of adjacent sub-storage clusters, and implementing rapid storage by using a virtual channel of the same storage implementation unit; a plurality of to-be-stored units are used as a storage buffer pool by adding virtual channels among the storage units, so that the effective data storage rate of the database is improved, and each sub-storage cluster is monitored to be completely utilized in sequence.

Description

Mass storage method and storage system for unstructured data

Technical Field

The embodiment of the invention relates to the technical field of mass storage, in particular to a mass storage method and a storage system of unstructured data.

Background

The data in the computer informatization system is divided into structured data and unstructured data, wherein the unstructured data is data which has an irregular or incomplete data structure, does not have a predefined data model and is inconvenient to represent by a database two-dimensional logic table. Including office documents, text, pictures, XML, HTML, various types of reports, images, and audio/video information, etc., in all formats, unstructured data has a wide variety of formats, standards, and technologically unstructured information is more difficult to standardize and understand than structured information. Storage, retrieval, distribution, and utilization require more intelligent IT technologies such as mass storage, intelligent retrieval, knowledge mining, content protection, value-added development and utilization of information, and the like.

When the unstructured data are stored in a mass mode, due to the fact that data explosiveness causes that the speed of the data when the unstructured data are stored in a warehouse is low, timeliness of data storage is seriously affected, and the situation of data loss is easy to occur.

Disclosure of Invention

Therefore, the embodiment of the invention provides a mass storage method and a storage system for unstructured data, wherein virtual channels among storage units are additionally arranged to use a plurality of units to be stored as a storage buffer pool, so that the effective data storage rate of a database is improved, and each sub-storage cluster is monitored to be fully utilized in sequence, so that the problems of data loss and low mass storage utilization rate caused by data storage congestion in the prior art are solved.

In order to achieve the above object, an embodiment of the present invention provides the following: a mass storage method and a storage system of unstructured data comprise the following steps:

step 100, dividing a cloud storage space into a plurality of distributed storage modules for storing different file types;

200, dividing the distributed storage module into a plurality of sub-storage clusters by using a space simulation method, and setting a storage mode of a data stream in the sub-storage clusters;

step 300, setting a virtual channel between two adjacent sub-storage clusters, and erecting transmission communication links matched and corresponding between a front-end data source and the sub-storage clusters;

step 400, forming a storage implementation unit by a plurality of adjacent sub-storage clusters, and implementing fast storage by using the virtual channel of the same storage implementation unit.

As a preferred scheme of the present invention, in step 200, the spatial simulation method divides the distributed storage module into a plurality of sub-storage clusters that are distributed stereoscopically according to a three-dimensional matrix, and data streams of the same type are sequentially stored in the sub-storage clusters at different stereoscopic positions.

As a preferred embodiment of the present invention, in step 200, the specific implementation step of setting the storage manner of the data stream in the child storage cluster according to the distribution characteristics of the child storage cluster includes:

constructing a three-dimensional rectangular coordinate system along three rectangular intersected edges of the three spatially distributed sub-storage clusters;

marking the three-dimensional coordinates of each sub-storage cluster in the three-dimensional rectangular coordinate system;

specifically, the data streams are sequentially stored in an upper layer and a lower layer, and then stored in a front-row and rear-row manner in each layer of the child storage cluster.

As a preferable scheme of the present invention, in step 300, the virtual channels are disposed between the sub storage clusters in the same layer and between two adjacent layers of the sub storage clusters in the three-dimensional coordinate system, the entire sub storage clusters implement data through storage through the virtual channels, and the virtual channels sequentially store data streams in the sub storage clusters along an "S" shape.

As a preferred aspect of the present invention, in step 400, the storage implementing unit uses one of the child storage clusters as a main storage object, and uses the other child storage clusters as a buffer pool.

As a preferred embodiment of the present invention, in step 400, the specific implementation steps of implementing fast storage in the same storage implementing unit through the virtual channel include:

step 401, connecting and conducting an import port of a main storage object in one storage implementation unit with the transmission communication link, and storing front-end data in the main storage object through the import port of the main storage object;

step 402, monitoring the size of the retention data of the transmission communication link in real time, and sequentially opening other sub-storage clusters serving as buffer pools of the same storage implementation unit according to the size of the retention data;

step 403, the front-end data is imported into a main storage object through a virtual channel;

step 404, monitoring the remaining capacity of the main storage object of the storage implementation unit in real time by using a memory monitor, and adjusting the main storage object of the next storage implementation unit to perform data storage according to the remaining capacity of the main storage object.

As a preferred embodiment of the present invention, the child storage cluster serving as the buffer pool in the previous storage implementation unit is a main storage object of the next storage implementation unit.

As a preferable scheme of the present invention, in step 402, a plurality of segmented link ends are provided at a connection end of the transmission communication link and the storage implementation unit, the segmented link ends are each provided with a warehousing port corresponding to the sub-storage cluster in the storage implementation unit one to one, the segmented link ends are communicated with the buffer pool in an order from near to far from the main storage object, and the segmented link ends are disconnected from the sub-storage cluster serving as the buffer pool in an order from far to near from the main storage object.

In addition, the present invention also provides a mass storage system for unstructured data, which is characterized by comprising:

the cloud storage space differentiation module is used for dividing the cloud storage space into a plurality of distributed storage modules which respectively store different file types;

the storage module splitting unit is used for splitting the distributed storage module into sub storage clusters distributed in a three-dimensional matrix;

the virtual channel unit is used for carrying out data intercommunication on two adjacent sub-storage clusters;

and the storage implementation unit is used for dividing the combination of the plurality of sub-storage clusters into a main storage object and other buffer pools.

As a preferred aspect of the present invention, the virtual channel unit adds a data buffer area for reducing data warehousing pressure to each sub-storage cluster, and the data stream is transferred from an adjacent sub-storage cluster to the sub-storage cluster that is storing data.

The embodiment of the invention has the following advantages:

(1) in the process of storing mass data, in order to avoid the problem that the pressure of data storage is large and the storage speed is low, the invention adopts an asynchronous storage mode, and all the sub-storage clusters are communicated and connected by using the virtual channels, and a plurality of units to be stored are used as the storage buffer pool, so that the effective data storage rate of the database is improved, and the condition that the data is lost due to the storage congestion of the data is avoided.

(2) According to the invention, each sub-storage cluster is monitored to be completely utilized in sequence, all the sub-storage clusters are practical in sequence as required, and the condition of storage space waste is avoided.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

FIG. 1 is a block diagram of a mass storage system according to an embodiment of the present invention;

FIG. 2 is a block diagram of a data transmission interactive system according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a mass storage method according to an embodiment of the present invention;

fig. 4 is a flowchart illustrating a data transmission interaction method according to an embodiment of the present invention.

In the figure:

1-a cloud storage space differentiation module; 2-a storage module splitting unit; 3-a virtual channel unit; 4-a storage implementation unit; 5-an interaction recording unit; 6-an interactive communication link unit; 7-data transmission link unit.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the invention provides a mass storage method and a storage system for unstructured data, and the method and the system provided by the invention divide a cloud storage space for storing mass data into a plurality of distributed storage modules according to various types of unstructured data, and then divide the distributed storage modules into a plurality of stereoscopic three-dimensional distributed sub-storage clusters, so that different types of data can be classified and stored, and later query interaction is facilitated.

In addition, in the process of storing mass data, in order to avoid the problem that the pressure of data storage is large and the storage speed is low, an asynchronous storage mode is adopted, all the sub-storage clusters are in through connection through virtual channels, when data are stored in one of the sub-storage clusters, a plurality of sub-storage clusters in through connection with the sub-storage cluster are used as a storage buffer pool, the effective data storage rate of the database is improved, and the condition that the data are lost due to data storage congestion is avoided.

Meanwhile, when the storage system is used for data interaction, an interaction recording pool for accelerating the interaction speed is additionally arranged, and the distribution conditions of the same data query frequency, the same request statement set and the data queried by the request statements in the storage system in the sub-storage clusters are counted in the interaction recording pool, so that when a client sends a data interaction request next time, the data interaction request is directly compared and searched in the interaction recording pool, the query data is quickly responded from the sub-storage clusters, and the problem of slow request response caused by data screening in a huge mass storage system is avoided.

A mass storage system for unstructured data, comprising:

the cloud storage space differentiation module 1 is used for dividing a cloud storage space into a plurality of distributed storage modules which respectively store different file types;

the storage module splitting unit 2 is used for splitting the distributed storage module into sub storage clusters distributed in a three-dimensional matrix;

and the virtual channel unit 3 is used for performing data intercommunication on two adjacent sub-storage clusters.

The virtual channel unit 3 adds a data buffer area for reducing the pressure of data storage for each sub-storage cluster, and the data stream is transferred from the adjacent sub-storage cluster to the sub-storage cluster storing data;

and the storage implementation unit 4 is used for dividing a plurality of storage cluster sub-combinations into a main storage object and other buffer pools.

The working principle and working mode of the mass storage system will be described in detail in the mass storage method.

As shown in fig. 3, the storage method specifically includes the following steps:

step 100, dividing a cloud storage space into a plurality of distributed storage modules for storing different file types.

Step 200, dividing the distributed storage module into a plurality of sub-storage clusters by using a space simulation method, and setting a storage mode of a data stream in the sub-storage clusters.

The distributed storage module is divided into a plurality of sub storage clusters which are distributed in a three-dimensional mode according to a three-dimensional matrix by a space simulation method, and data streams of the same type are stored in the sub storage clusters at different three-dimensional positions in sequence.

According to the distribution characteristics of the sub-storage clusters, the specific implementation steps for setting the storage mode of the data stream in the sub-storage clusters are as follows:

(1) constructing a three-dimensional rectangular coordinate system along three rectangular intersected edges of the three spatially distributed sub-storage clusters;

(2) marking the three-dimensional coordinates of each sub-storage cluster in a three-dimensional rectangular coordinate system;

(3) specifically, the data streams are sequentially stored in an upper layer and a lower layer, and then stored in a front-row and rear-row manner in each layer of the child storage cluster.

When data is stored in the sub-storage clusters, the data may be stored in an order from an upper layer to a lower layer or from the lower layer to the upper layer, and the data is stored in the sub-storage clusters of each layer in a row-by-row or column-by-row manner, where the storage manner is not specifically limited.

And 300, setting a virtual channel between two adjacent sub-storage clusters, and erecting transmission communication links matched and corresponding between a front-end data source and the sub-storage clusters.

However, once the storage mode is defined, the virtual channels of the entire layer of the sub-storage cluster support are set differently.

The virtual channel is arranged between the sub-storage clusters in the same layer in the three-dimensional coordinate system, the virtual channel can be arranged between the sub-storage clusters in each row or between the sub-storage clusters in each column, and the sub-storage clusters in two adjacent rows or two adjacent columns are also connected through the virtual channel.

Similarly, a virtual channel is also arranged between two adjacent layers of the sub-storage clusters, the whole sub-storage cluster realizes data through storage through the virtual channel, and the virtual channel sequentially stores data streams in the sub-storage clusters along an S shape, so that the problem of low storage and storage efficiency is solved in a three-dimensional sub-storage cluster matrix.

How to implement the fast binning operation using the virtual channel during data storage will be described in detail in step 400.

The storage implementation unit takes one of the child storage clusters as a main storage object and takes the other child storage clusters as a buffer pool, wherein the number of the child storage clusters contained in the storage implementation unit can be customized as required, that is, when data is stored in the main storage object, once the storage speed is slow, the data can be transferred into the child storage clusters serving as the buffer pool first and then transferred into the main storage object through a virtual channel between the child storage clusters, so as to implement asynchronous and fast storage.

The specific implementation steps of implementing the fast storage through the virtual channel in the same storage implementation unit are as follows:

and (I) connecting and conducting a leading-in port of a main storage object in the storage implementation unit with the transmission communication link, and storing front-end data in the main storage object through the leading-in port of the main storage object.

(II) monitoring the size of the retention data of the transmission communication link in real time, and sequentially opening other sub-storage clusters serving as buffer pools of the same storage implementation unit according to the size of the retention data.

The connection end of the transmission communication link and the storage implementation unit is provided with a plurality of segmented link ends, the segmented link ends are provided with storage ports which are in one-to-one correspondence with the sub-storage clusters in the storage implementation unit, the segmented link ends are communicated with the buffer pool in the sequence from near to far of the main storage object, and the segmented link ends are disconnected with the sub-storage clusters serving as the buffer pool in the sequence from far to near of the main storage object.

(III) the front-end data is imported into the main storage object through a virtual channel.

According to the steps I, II and III, when the problem of low storage efficiency occurs at the lead-in port of the main storage object, data is led into other sub storage clusters associated with the main storage object for buffering, the storage pressure of the lead-in port of the main storage object is reduced, and then the data of the sub storage clusters serving as the buffer pool asynchronously enters the main storage object through the virtual channel.

And when the pressure of the leading-in port of the main storage object is reduced, the transmission communication link is disconnected with the sub-storage cluster serving as the buffer pool, so that the data is mainly stored according to the time sequence through the leading-in port of the main storage object, and the later inquiry and data comparison are facilitated.

And the segmented link ends are connected with the sub-storage clusters serving as the buffer pools in the order from near to far from the main storage objects, and the segmented link ends are disconnected from the sub-storage clusters serving as the buffer pools in the order from far to near from the main storage objects, so that the problems that the distribution of data in the plurality of buffer pools is disordered and the data storage sequence is completely disordered when each main storage object is completely collected are solved.

(IV) monitoring the residual capacity of the main storage object of the storage implementation unit in real time by using a memory monitor, and adjusting the main storage object of the next storage implementation unit to store data according to the residual capacity of the main storage object.

The child storage cluster serving as the buffer pool in the previous storage implementation unit is the main storage object of the next storage implementation unit.

For example, when six sub-storage clusters exist in a row, and three sub-storage clusters are used as one storage implementation unit, the sub-storage clusters included in each storage implementation unit are respectively cluster 1, cluster 2, and cluster 3; cluster 2, cluster 3, and cluster 4; cluster 3, cluster 4 and cluster 5 … …, so cluster 2 acts as a buffer pool for the first storage implementation unit and is also the main storage object for the second storage implementation unit, when data is stored in the cluster 1 sequentially, the port of cluster 1 always maintains communication with the transport communication link, the communication between cluster 2 and cluster 3 and the transport communication link depends on the port storage pressure of cluster 1, when the memory of cluster 1 is used up, the data is stored to cluster 2 uniformly, the port of cluster 2 always maintains communication with the transport communication link, the communication between cluster 3 and cluster 4 and the transport communication link depends on the port storage pressure of cluster 2, and so on.

Therefore, in the process of storing mass data, in order to avoid the problem that the pressure of data storage is large and the storage speed is low, an asynchronous storage mode is adopted, all the sub-storage clusters are connected in a through mode through virtual channels, the effective data storage rate of the database is improved, the situation that data is lost due to data storage congestion is avoided, meanwhile, each sub-storage cluster is monitored to be completely utilized in sequence, and the waste of storage space is avoided.

Example 2

As is known, after mass data is stored, due to the huge storage space system, during later data transmission, the problem of incomplete utilization of the storage space exists, and meanwhile, when a user sends a query request at a client, long-time screening is required to find corresponding data.

As shown in fig. 2, the data transmission interactive system includes: the cloud storage space differentiation module 1 is used for dividing a cloud storage space into a plurality of distributed storage modules which respectively store different file types;

the interaction recording unit 5 is used for storing data with high query request times in the sub-storage clusters and storing request statement sets;

and the interactive communication link unit 6 is used for constructing an interactive sequence responding to the client request statement.

A data transmission link unit 7, said data transmission link unit 7 can distribute a plurality of links between said front end data source head and a plurality of said sub-storage clusters, said interactive communication link unit 6 has one and only one link between said front end data source head and a plurality of said sub-storage clusters,

as shown in fig. 4, the specific implementation method of the data transmission interactive system includes the following steps:

step 100, dividing a cloud storage space into a plurality of distributed storage modules according to the types of unstructured data, and dividing the distributed storage modules into a plurality of sub-storage clusters by using a space simulation method.

Step 200, setting a virtual channel between two adjacent sub-storage clusters, and erecting a transmission communication link between a front-end data source and the sub-storage clusters, wherein the transmission communication link corresponds to the sub-storage clusters in a matching manner.

The data transmission process is specifically as described in embodiment 1, and data transmission and storage are performed through the virtual channel, so that on one hand, the pressure for mass data transmission is reduced, and on the other hand, it is ensured that each sub-storage cluster is completely utilized without wasting storage space.

After the data is saved, because the data of the storage system is huge, the specific implementation process of how to quickly interact and respond in the data interaction process is as described in step 300 and step 400.

Step 300, applying for creating a space of an interaction recording pool from the cloud storage space, and backing up data in the sub-storage cluster in the interaction recording pool according to the counted number of client requests, wherein the backup data of the interaction recording pool is the same as the data in the sub-storage cluster.

The same front-end data source can be matched with a plurality of the sub-storage clusters, the storage space is continuously expanded to carry out endless mass storage, and the number of the interactive recording pools is the same as the classification number of the front-end data sources.

The interaction record pool has the main functions that a user can conveniently inquire data at the back end of the cloud storage at a client, and in order to avoid operation complexity, only one interaction record pool is arranged at each front-end data source head. According to the processing system of big data, the utilization rate of the saved data does not exceed 20 percent mostly, and the same type of data is accessed for many times mostly.

Based on the discovery, the embodiment calculates the data query process of each front-end data source, including the request statements sent by the client and the specific data finally queried by the client, calculates the specific data with more query times in real time and sends more same request statements, and backs up the specific data with more query times into the interactive recording pool.

The specific implementation process is as follows:

A. counting the request times of the client, and temporarily storing the data with the high request times of the client in the interaction record pool, wherein the specific implementation steps are as follows:

B. acquiring a request statement of a client for requesting query on data in a child storage cluster;

C. counting the sending times of different request statements, and determining the coordinates of the sub-storage cluster where the data responding to each request statement is located;

D. sequentially storing the data with the client selection frequency from high to low in the interaction recording pool, and simultaneously storing a request statement set with the query frequency from high to low;

E. and storing the coordinate set of the child storage cluster in which the single request statement in the request statement set is positioned in the interaction record pool.

That is, the request statement sent by the client is compared with the specific data name, and if the request statement is consistent with the specific data name, the data can be quickly found from the interactive record pool without being searched in a huge mass data system, so that the quick response to the client request is realized.

And if specific data are not found in the data set of the interactive recording pool, performing real-time comparison on the request statement set, once the comparison is the same, screening out the child storage clusters containing the request statement at one time through the child storage cluster coordinate set, searching the data containing the request statement in the specific child storage clusters, and finally screening out the specific data successfully.

And step 400, constructing a bidirectional interactive communication link according to the communication paths of the client, the interactive recording pool and the cluster block.

Therefore, when the client requests data interaction, the backup data of the request statement in the interaction record pool is compared for one time;

secondly, comparing the request statement sets of the request statements in the interactive recording pool for the second time, and inquiring specific data in a coordinate set of a sub-storage cluster where the matched request statements are located;

and finally, inquiring the data of the response request statement in the whole child storage cluster.

In summary, the interactive log pool can perform the function of counting the distribution of the data queried by the same data query frequency, the same request statement set and the request statement in the storage system in the sub-storage cluster in the interactive log pool, so that when a client sends a data interaction request next time, the data is directly compared and searched in the interactive log pool, and query data is quickly responded from the sub-storage cluster, thereby avoiding the problem of slow request response caused by data screening in a huge mass storage system.

In addition, as a feature point of the present invention, it is periodically required to selectively delete backup data in the interaction log pool to maintain an urgent redundancy space in the interaction log pool, and the execution criteria for selectively deleting backup data are as follows: firstly, deleting data in backup data according to the sequence of query interaction time; and then selecting and deleting the specific backup data with low query interaction frequency.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A mass storage method and a storage system of unstructured data are characterized by comprising the following steps:

step 200, dividing the distributed storage module into a plurality of sub-storage clusters by using a space simulation method, and setting a storage mode of a data stream in the sub-storage clusters and a grid storage position;

2. The method according to claim 1, wherein in step 200, the spatial simulation method divides the distributed storage module into a plurality of the sub-storage clusters distributed in three-dimensional manner according to a three-dimensional matrix, and data streams of a same type are sequentially stored in the sub-storage clusters at different three-dimensional positions.

3. The method according to claim 2, wherein in step 200, the specific implementation step of setting the storage manner of the data stream in the sub-storage cluster according to the distribution characteristics of the sub-storage cluster is as follows:

4. The method according to claim 1, wherein in step 300, the virtual channels are disposed between the sub-storage clusters in the same layer and between two adjacent layers of the sub-storage clusters in the three-dimensional coordinate system, the whole sub-storage clusters implement data through storage through the virtual channels, and the virtual channels sequentially store data streams in the sub-storage clusters along an "S" shape.

5. A method as claimed in claim 1, wherein in step 400, said storage implementation unit uses one of said child storage clusters as a main storage object, and uses the other of said child storage clusters as a buffer pool.

6. The method according to claim 5, wherein in step 400, the specific implementation steps of implementing fast storage in the same storage implementation unit through the virtual channel include:

7. The method of claim 6, wherein the child storage cluster as the buffer pool in the previous storage implementation unit is the main storage object of the next storage implementation unit.

8. The method according to claim 7, wherein in step 402, the connection end of the transmission communication link and the storage implementation unit is provided with a plurality of segment link ends, each segment link end is provided with a storage port corresponding to a sub-storage cluster in the storage implementation unit, the segment link ends are connected with the buffer pool in the order from near to far from the main storage object, and the segment link ends are disconnected from the sub-storage cluster in the buffer pool in the order from far to near from the main storage object.

9. Mass storage system for unstructured data according to any of the claims 1-8, characterized in that it comprises:

the cloud storage space differentiation module (1) is used for dividing a cloud storage space into a plurality of distributed storage modules which respectively store different file types;

the storage module splitting unit (2) is used for splitting the distributed storage module into sub storage clusters distributed in a three-dimensional matrix;

the virtual channel unit (3) is used for carrying out data intercommunication on two adjacent sub-storage clusters;

and the storage implementation unit (4) is used for dividing the combination of the plurality of sub storage clusters into a main storage object and other buffer pools.

10. A mass storage system for unstructured data according to claim 9, wherein: and the virtual channel unit (3) is used for additionally arranging a data buffer area for reducing the data storage pressure for each sub-storage cluster, and the data stream is transferred from the adjacent sub-storage cluster to the sub-storage cluster storing the data.