CN116561230A

CN116561230A - Distributed storage and retrieval system based on cloud computing

Info

Publication number: CN116561230A
Application number: CN202310828240.XA
Authority: CN
Inventors: 袁庆伟
Original assignee: Changtong Intelligent Shenzhen Co ltd
Current assignee: Changtong Intelligent Shenzhen Co ltd
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-08-08
Anticipated expiration: 2043-07-07
Also published as: CN116561230B

Abstract

The invention discloses a distributed storage and retrieval system based on cloud computing, relates to the technical field of cloud computing, and aims to solve the problems of instability and low retrieval efficiency of cloud computing data during storage. According to the method, the proper storage space is selected according to the calculation result, the stability of compressed data storage can be improved to the greatest extent, the utilization rate of the storage space is effectively improved, the accuracy of data can be further improved through clustering processing of the data segments, two data segments with the smallest average association value are selected for merging, the complexity of the data segments in text extraction can be reduced, the accuracy is improved, meanwhile, the complexity is reduced, the distance between each word vector in the data segments and the standard word vector corresponding to the data segments is reduced, the keywords corresponding to the word vectors with the smallest distance are screened out, the keywords in the data segments can be more accurately obtained, and the efficiency of later-stage data in retrieval is further improved.

Description

Distributed storage and retrieval system based on cloud computing

Technical Field

The invention relates to the technical field of cloud computing, in particular to a distributed storage and retrieval system based on cloud computing.

Background

Cloud computing is to break up a huge data computing process program into numerous small programs through a network "cloud", then process and analyze the small programs through a system of multiple servers to obtain results and return the results to users.

The Chinese patent with publication number of CN113535715A discloses an intelligent education storage system based on cloud computing, mainly by classifying input data into labels and inputting the labels into a storage library III, classifying and storing the data according to statistics of the cloud computing classification system, and updating the labels, the accuracy and speed during retrieval are improved, and the problems of data storage are solved, but the following problems exist in actual operation:

1. the data is not optimized, so that useless data in the data cannot be cleared in time, and the storage space is wasted during storage.

2. The selection of the storage space region is not performed according to the capacity and the length of the data, so that the data is too large and the storage space is too small, and the storage space is not matched.

3. The feature words of the text words in the data are not determined by the stored data, so that the determination of the keywords of the text data in the data is inaccurate, and the retrieval time is too long in the later data retrieval process.

Disclosure of Invention

The invention aims to provide a distributed storage and retrieval system based on cloud computing, which selects a proper storage space according to a computing result, can improve the stability of compressed data storage to the greatest extent, effectively improves the utilization rate of the storage space, can further improve the accuracy of data by clustering data segments, selects two data segments with the smallest average association value for merging, can reduce the complexity of the data segments during text extraction, improves the accuracy, simultaneously reduces the complexity, calculates the distance between each word vector in the data segments and the standard word vector corresponding to the data segments, screens out keywords corresponding to the word vector with the smallest distance, can enable the keywords in the data segments to be acquired more accurately, further improves the efficiency of later data during retrieval, and can solve the problems in the prior art.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a cloud computing-based distributed storage and retrieval system, comprising:

a cloud computing data acquisition unit configured to:

according to different transmission terminals, respectively receiving different cloud computing data;

when the cloud computing data is received, a transmission channel with corresponding capacity is automatically adapted according to the flow of the cloud computing data;

the cloud computing data processing unit is used for:

based on the data acquired in the cloud computing data acquisition unit, uniformly acquiring the data, dividing the acquired data into a plurality of data segments, respectively performing redundancy processing on the data in the plurality of data segments, compressing the data after the processing is finished, and marking the compressed data as compressed data;

the cloud computing data storage unit is used for:

based on the compressed data acquired in the cloud computing data processing unit, respectively acquiring the lengths of the compressed data, and acquiring the space capacity data to be stored of the compressed data after the acquisition is completed;

extracting parameters of compressed data length data and parameters of space capacity data respectively, carrying out corresponding calculation on the parameters after extracting the parameters, and judging a storage area of the compressed data according to a calculation result;

the corresponding calculation formula of the parameters is as follows:

；

wherein x is a parameter of compressed data length data, y is a parameter of space capacity data, a is a first storage area, b is a second storage area, and c is a third storage area;

storing the compressed data to a first storage area or a second storage area or a third storage area according to different comparison results respectively through parameter comparison results of parameters of the compressed data length data and parameters of the space capacity data;

a stored data dictionary analyzing unit configured to:

based on the compressed data in the storage areas acquired by the cloud computing data storage unit, the compressed data in different storage areas are respectively subjected to keyword acquisition of the data, and the keywords in the data are classified according to the word attributes of the keywords after the keywords are acquired.

Preferably, the cloud computing data processing unit includes:

a data segmentation module for:

dividing the acquired cloud computing data into a plurality of segments with the same length, and respectively carrying out number unique coding and labeling on each segment after dividing;

the segment data redundancy processing module is used for:

based on the data of the unique coding label acquired in the data segmentation module, respectively performing data deduplication on a plurality of data;

the data de-duplication is to overlap the data segment with the data segment, and remove the repeated and useless data in the data segment after the data overlap, where the useless data is a data model input in advance in the database, and if the data segment has data consistent with the data model, the data is useless data.

Preferably, the cloud computing data processing unit includes:

the redundant data compression module is used for:

based on the data segments which are obtained from the segmented data redundancy processing module and are removed, obtaining the number of the data segments, and generating the compression threads with the same number according to the number of the data segments;

after the compression thread is generated, respectively importing the data segments into the compression thread;

and the data segment is subjected to data cycle compression in a compression thread, and finally the data subjected to cycle compression is marked as compressed data.

Preferably, the cloud computing data storage unit includes:

the cloud computing data length acquisition module is used for:

based on the compressed data acquired in the cloud computing data processing unit, extracting the compressed data respectively, and acquiring the length data of each compressed data after extracting;

a space region capacity acquisition module, configured to:

extracting a storage space region in the storage hardware, and acquiring the number of the space regions after the storage space region is extracted;

the total capacity and the remaining capacity data in each storage space area are acquired.

Preferably, the cloud computing data storage unit further includes:

a capacity correspondence storage module for:

based on the length of the compressed data acquired in the cloud computing data length acquisition module and the total capacity and the residual capacity in the storage space area acquired in the space area capacity acquisition module, the length of the compressed data is compared with the capacity of the storage space area.

Preferably, the capacity correspondence storage module is further configured to:

if the comparison threshold is greater than or equal to the capacity threshold of the first storage area, storing the compressed data in the second storage area or the third storage area;

if the comparison threshold is greater than or equal to the second storage area, storing the compressed data into the first storage area or the third storage area;

and if the comparison threshold value is greater than or equal to the third storage area, storing the compressed data into the first storage area or the second storage area.

Preferably, the stored data dictionary analyzing unit includes:

a storage area data acquisition module, configured to:

based on the compressed data stored in different areas obtained in the cloud computing data storage unit, each compressed data is extracted independently;

a storage data segmentation module for:

and extracting the data sequence of each compressed data based on the plurality of compressed data acquired by the storage area data acquisition module, dividing the acquired data sequence into a plurality of data segments with consistent lengths, and treating each data segment as a single cluster.

Preferably, the stored data dictionary analyzing unit further includes:

a stored data keyword acquisition module, configured to:

based on the data segments obtained in the stored data segmentation module, clustering the data in each data segment to obtain a plurality of clustered data after processing;

word segmentation operation is carried out on sentences in the clustered data, and a plurality of extracted words in the sentences are obtained;

according to the sentence characteristics in the dictionary library, the extracted words are corresponding to the sentence characteristics, and the sentence characteristics of the extracted words are determined after the corresponding is finished;

and determining the word attribute of the extracted word according to the sentence characteristics, and labeling the word for determining the word attribute as a target word.

Preferably, the stored data keyword obtaining module is further configured to:

the clustering process comprises the steps of measuring the distance between two data segment time, wherein the distance measurement is performed according to an average association measurement method;

measuring the average distance between the data point of the first data segment and the data point of the second data segment, and merging the two data segments into one data segment after the measurement is completed;

and selecting two data segments with the minimum average association value for combination when the combination is carried out, and finally obtaining the clustering data.

Preferably, the stored data dictionary analyzing unit further includes:

keyword category induction module for:

based on the target words acquired in the stored data keyword acquisition module, performing maximum length splicing on the target words by using a text minimum unit to acquire spliced words;

cleaning the spliced words according to sentence characteristics in the dictionary library to obtain a keyword set;

converting keywords in the keyword set into word vectors;

and respectively calculating the distance between each word vector in the data segment and the standard word vector corresponding to the data segment, and screening out the keyword corresponding to the word vector with the minimum distance as the target keyword in the data segment.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention provides a distributed storage and retrieval system based on cloud computing, which eliminates repeated data and useless data in paragraphs through data deduplication, can effectively avoid the problem of repeated computing, can maximally utilize storage space, can make cloud computing data more compact after data segments are compressed, can reduce the storage space and can also improve the speed of data later transmission.

2. The invention provides a distributed storage and retrieval system based on cloud computing, which is characterized in that after the data length and the storage space capacity are compared and calculated, a proper storage space is selected according to a calculation result, so that the stability of compressed data during storage can be improved to the greatest extent, the data capacity is not excessively small but is stored in a storage area with the largest residual capacity, and the utilization rate of the storage space is effectively improved.

3. The invention provides a distributed storage and retrieval system based on cloud computing, which can further improve the accuracy of data by clustering data segments, and can reduce the complexity of the data segments during text extraction, improve the accuracy and reduce the complexity by measuring the distance between each data segment and selecting two data segments with the smallest average association value for merging when the number of the data segments is too large.

Drawings

FIG. 1 is a schematic overall flow chart of the present invention;

FIG. 2 is a schematic diagram of a cloud computing data processing unit module according to the present invention;

FIG. 3 is a schematic diagram of a cloud computing data storage unit module according to the present invention;

fig. 4 is a schematic diagram of a stored data dictionary analyzing unit module according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to solve the problem that in the prior art, when cloud computing data is stored, data is not optimized, so that useless data in the data cannot be cleared in time, and storage space is wasted during storage, referring to fig. 1 and 2, the present embodiment provides the following technical scheme:

a cloud computing-based distributed storage and retrieval system, comprising: a cloud computing data acquisition unit configured to: according to different transmission terminals, respectively receiving different cloud computing data; when the cloud computing data is received, a transmission channel with corresponding capacity is automatically adapted according to the flow of the cloud computing data; the cloud computing data processing unit is used for: based on the data acquired in the cloud computing data acquisition unit, uniformly acquiring the data, dividing the acquired data into a plurality of data segments, respectively performing redundancy processing on the data in the plurality of data segments, compressing the data after the processing is finished, and marking the compressed data as compressed data; the cloud computing data storage unit is used for: based on the compressed data acquired in the cloud computing data processing unit, respectively acquiring the lengths of the compressed data, and acquiring the space capacity data to be stored of the compressed data after the acquisition is completed; extracting parameters of compressed data length data and parameters of space capacity data respectively, carrying out corresponding calculation on the parameters after extracting the parameters, and judging a storage area of the compressed data according to a calculation result; the corresponding calculation formula of the parameters is as follows:

；

wherein x is a parameter of compressed data length data, y is a parameter of space capacity data, a is a first storage area, b is a second storage area, and c is a third storage area; storing the compressed data to a first storage area or a second storage area or a third storage area according to different comparison results respectively through parameter comparison results of parameters of the compressed data length data and parameters of the space capacity data; a stored data dictionary analyzing unit configured to: based on the compressed data in the storage areas acquired by the cloud computing data storage unit, the compressed data in different storage areas are respectively subjected to keyword acquisition of the data, and the keywords in the data are classified according to the word attributes of the keywords after the keywords are acquired.

Specifically, the cloud computing data acquisition unit performs channel selection on the received cloud computing data according to the capacity of the data, so that the efficiency of the cloud computing data in transmission can be improved, the cloud computing data processing unit performs preliminary data deduplication processing on the acquired cloud computing data, the problem of repeated computation can be effectively avoided, meanwhile, the storage space can be utilized to the greatest extent, the cloud computing data storage unit can select a proper storage space according to the length of the data, the storage space utilization rate is effectively improved, the text data words in the storage data can be further processed by the storage data dictionary analysis unit, the complexity of the data segment in text extraction can be reduced, the accuracy is improved, meanwhile, the complexity of keyword acquisition in the data segment is further improved, and the later-stage data retrieval efficiency is further improved.

A cloud computing data processing unit comprising: a data segmentation module for: dividing the acquired cloud computing data into a plurality of segments with the same length, and respectively carrying out number unique coding and labeling on each segment after dividing; the segment data redundancy processing module is used for: based on the data of the unique coding label acquired in the data segmentation module, respectively performing data deduplication on a plurality of data; the data deduplication is to overlap data segments with data segments, and reject duplicate and useless data in the data segments after data overlap, wherein the useless data is a data model input in advance in a database, and if the data segment has data consistent with the data model, the data is useless data, and the cloud computing data processing unit comprises: the redundant data compression module is used for: based on the data segments which are obtained from the segmented data redundancy processing module and are removed, obtaining the number of the data segments, and generating the compression threads with the same number according to the number of the data segments; after the compression thread is generated, respectively importing the data segments into the compression thread; and the data segment is subjected to data cycle compression in a compression thread, and finally the data subjected to cycle compression is marked as compressed data.

Specifically, the cloud computing data is divided into a plurality of segments with the same length through the data segmentation module, the segments are uniquely coded and labeled, the stability and the accuracy of data acquisition can be effectively guaranteed, the problem of repeated processing of later-stage data can be avoided, the segmented segments are subjected to redundant processing through the segmented data redundancy processing module, the redundant processing is data deduplication, namely, repeated data and useless data in the segments are removed, the problem of repeated computation can be effectively avoided, meanwhile, the storage space can be maximally utilized, the data in the data segments are subjected to deduplication processing and then are subjected to data compression through the redundant data compression module, meanwhile, the number of compression threads is consistent with that of the data segments, the compression of each data segment in the cloud computing data can be maximally guaranteed, the cloud computing data can be more compact after the data segments are compressed, the storage space can be reduced, and the speed of later-stage data transmission can be improved.

In order to solve the problem in the prior art that when cloud computing data is stored, a storage space region is not selected according to the capacity and the length of the data, so that the data is too large and the storage space is too small, and the storage space is not matched, referring to fig. 3, the embodiment provides the following technical scheme:

a cloud computing data storage unit comprising: the cloud computing data length acquisition module is used for: based on the compressed data acquired in the cloud computing data processing unit, extracting the compressed data respectively, and acquiring the length data of each compressed data after extracting; a space region capacity acquisition module, configured to: extracting a storage space region in the storage hardware, and acquiring the number of the space regions after the storage space region is extracted; acquiring the total capacity and the residual capacity data in each storage space area, wherein the cloud computing data storage unit further comprises: a capacity correspondence storage module for: based on the length of the compressed data acquired in the cloud computing data length acquisition module and the total capacity and the residual capacity in the storage space area acquired in the space area capacity acquisition module, the length of the compressed data is compared with the capacity of the storage space area, and the capacity corresponds to the storage module and is further used for: if the comparison threshold is greater than or equal to the capacity threshold of the first storage area, storing the compressed data in the second storage area or the third storage area; if the comparison threshold is greater than or equal to the second storage area, storing the compressed data into the first storage area or the third storage area; and if the comparison threshold value is greater than or equal to the third storage area, storing the compressed data into the first storage area or the second storage area.

Specifically, the length of each compressed data is acquired through a cloud computing data length acquisition module, the total capacity and the residual capacity of the storage space are acquired through a space region capacity acquisition module, the data is stored according to the length of the data and the residual capacity of the storage space, so that the stability of the data in storage is better, the data length and the storage space are selected through a capacity corresponding storage module, the storage space is multiple, after the data length and the storage space capacity are compared and calculated, the proper storage space is selected according to the calculation result, and meanwhile, the data length and the storage space capacity are calculated through the following formula:

；

the stability of compressed data storage can be improved to the greatest extent, the data cannot be stored in the storage area with the largest residual capacity when the data capacity is too small, and the utilization rate of the storage space is effectively improved.

In order to solve the problems of the prior art that the stored data does not perform characteristic word determination of text words in the data, so that the keyword determination of the text data in the data is inaccurate, and the retrieval time is too long and the efficiency is reduced in the later period when the data retrieval is performed, please refer to fig. 4, the embodiment provides the following technical scheme:

the stored data dictionary analyzing unit includes: a storage area data acquisition module, configured to: based on the compressed data stored in different areas obtained in the cloud computing data storage unit, each compressed data is extracted independently; a storage data segmentation module for: based on the plurality of compressed data acquired by the storage area data acquisition module, extracting a data sequence of each compressed data, dividing the acquired data sequence into a plurality of data segments with consistent lengths, and regarding each data segment as a single cluster, the storage data dictionary analysis unit further comprises: a stored data keyword acquisition module, configured to: based on the data segments obtained in the stored data segmentation module, clustering the data in each data segment to obtain a plurality of clustered data after processing; word segmentation operation is carried out on sentences in the clustered data, and a plurality of extracted words in the sentences are obtained; according to the sentence characteristics in the dictionary library, the extracted words are corresponding to the sentence characteristics, and the sentence characteristics of the extracted words are determined after the corresponding is finished; determining word attributes of the extracted words according to the sentence characteristics, marking the words with the determined word attributes as target words, and the stored data keyword acquisition module is further used for: the clustering process comprises the steps of measuring the distance between two data segment time, wherein the distance measurement is performed according to an average association measurement method; measuring the average distance between the data point of the first data segment and the data point of the second data segment, and merging the two data segments into one data segment after the measurement is completed; when merging, selecting two data segments with the minimum average association value for merging, finally obtaining cluster data, and storing the cluster data into a data dictionary analysis unit, and further comprising: keyword category induction module for: based on the target words acquired in the stored data keyword acquisition module, performing maximum length splicing on the target words by using a text minimum unit to acquire spliced words; cleaning the spliced words according to sentence characteristics in the dictionary library to obtain a keyword set; converting keywords in the keyword set into word vectors; and respectively calculating the distance between each word vector in the data segment and the standard word vector corresponding to the data segment, and screening out the keyword corresponding to the word vector with the minimum distance as the target keyword in the data segment.

Specifically, compressed data is firstly acquired through a storage area data acquisition module, the compressed data is segmented into a plurality of data segments through a storage data segmentation module after the compressed data is acquired, namely clusters, text words in the data segments are acquired according to a storage data keyword acquisition module, wherein the accuracy of the data can be further improved through clustering processing, when the number of the data segments is excessive, the distance between each data segment is measured, two data segments with the smallest average association value are selected for merging, the complexity of the data segments in text extraction can be reduced, the accuracy is improved, the complexity is reduced, target words in the data segments are extracted after the data clustering processing, the maximum length of target word text units are spliced through a keyword category induction module, the data accuracy is improved after the target word text units are spliced, the distance between each word vector and the corresponding standard word vector in the data segments is calculated, the keywords corresponding to the vector with the minimum distance are screened, the keyword in the data segments can be acquired accurately, and the later data retrieval efficiency is further improved.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A cloud computing-based distributed storage and retrieval system, comprising:

a cloud computing data acquisition unit configured to:

the cloud computing data processing unit is used for:

the cloud computing data storage unit is used for:

the corresponding calculation formula of the parameters is as follows:

；

a stored data dictionary analyzing unit configured to:

2. The cloud computing-based distributed storage and retrieval system of claim 1, wherein: the cloud computing data processing unit includes:

a data segmentation module for:

the segment data redundancy processing module is used for:

3. The cloud computing-based distributed storage and retrieval system of claim 1, wherein: the cloud computing data processing unit includes:

the redundant data compression module is used for:

4. The cloud computing-based distributed storage and retrieval system of claim 1, wherein: the cloud computing data storage unit includes:

the cloud computing data length acquisition module is used for:

a space region capacity acquisition module, configured to:

5. The cloud computing-based distributed storage and retrieval system of claim 4, wherein: the cloud computing data storage unit further includes:

a capacity correspondence storage module for:

6. The cloud computing-based distributed storage and retrieval system of claim 5, wherein: the capacity correspondence storage module is further configured to:

7. The cloud computing-based distributed storage and retrieval system of claim 1, wherein: the stored data dictionary analyzing unit includes:

a storage area data acquisition module, configured to:

a storage data segmentation module for:

8. The cloud computing based distributed storage and retrieval system of claim 7, wherein: the stored data dictionary analyzing unit further includes:

a stored data keyword acquisition module, configured to:

9. The cloud computing-based distributed storage and retrieval system of claim 8, wherein: the stored data keyword obtaining module is further configured to:

10. The cloud computing-based distributed storage and retrieval system of claim 8, wherein: the stored data dictionary analyzing unit further includes:

keyword category induction module for:

converting keywords in the keyword set into word vectors;