CN116561230A - Distributed storage and retrieval system based on cloud computing - Google Patents

Distributed storage and retrieval system based on cloud computing Download PDF

Info

Publication number
CN116561230A
CN116561230A CN202310828240.XA CN202310828240A CN116561230A CN 116561230 A CN116561230 A CN 116561230A CN 202310828240 A CN202310828240 A CN 202310828240A CN 116561230 A CN116561230 A CN 116561230A
Authority
CN
China
Prior art keywords
data
cloud computing
storage
compressed
storage area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310828240.XA
Other languages
Chinese (zh)
Other versions
CN116561230B (en
Inventor
袁庆伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changtong Intelligent Shenzhen Co ltd
Original Assignee
Changtong Intelligent Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changtong Intelligent Shenzhen Co ltd filed Critical Changtong Intelligent Shenzhen Co ltd
Priority to CN202310828240.XA priority Critical patent/CN116561230B/en
Publication of CN116561230A publication Critical patent/CN116561230A/en
Application granted granted Critical
Publication of CN116561230B publication Critical patent/CN116561230B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed storage and retrieval system based on cloud computing, relates to the technical field of cloud computing, and aims to solve the problems of instability and low retrieval efficiency of cloud computing data during storage. According to the method, the proper storage space is selected according to the calculation result, the stability of compressed data storage can be improved to the greatest extent, the utilization rate of the storage space is effectively improved, the accuracy of data can be further improved through clustering processing of the data segments, two data segments with the smallest average association value are selected for merging, the complexity of the data segments in text extraction can be reduced, the accuracy is improved, meanwhile, the complexity is reduced, the distance between each word vector in the data segments and the standard word vector corresponding to the data segments is reduced, the keywords corresponding to the word vectors with the smallest distance are screened out, the keywords in the data segments can be more accurately obtained, and the efficiency of later-stage data in retrieval is further improved.

Description

Distributed storage and retrieval system based on cloud computing
Technical Field
The invention relates to the technical field of cloud computing, in particular to a distributed storage and retrieval system based on cloud computing.
Background
Cloud computing is to break up a huge data computing process program into numerous small programs through a network "cloud", then process and analyze the small programs through a system of multiple servers to obtain results and return the results to users.
The Chinese patent with publication number of CN113535715A discloses an intelligent education storage system based on cloud computing, mainly by classifying input data into labels and inputting the labels into a storage library III, classifying and storing the data according to statistics of the cloud computing classification system, and updating the labels, the accuracy and speed during retrieval are improved, and the problems of data storage are solved, but the following problems exist in actual operation:
1. the data is not optimized, so that useless data in the data cannot be cleared in time, and the storage space is wasted during storage.
2. The selection of the storage space region is not performed according to the capacity and the length of the data, so that the data is too large and the storage space is too small, and the storage space is not matched.
3. The feature words of the text words in the data are not determined by the stored data, so that the determination of the keywords of the text data in the data is inaccurate, and the retrieval time is too long in the later data retrieval process.
Disclosure of Invention
The invention aims to provide a distributed storage and retrieval system based on cloud computing, which selects a proper storage space according to a computing result, can improve the stability of compressed data storage to the greatest extent, effectively improves the utilization rate of the storage space, can further improve the accuracy of data by clustering data segments, selects two data segments with the smallest average association value for merging, can reduce the complexity of the data segments during text extraction, improves the accuracy, simultaneously reduces the complexity, calculates the distance between each word vector in the data segments and the standard word vector corresponding to the data segments, screens out keywords corresponding to the word vector with the smallest distance, can enable the keywords in the data segments to be acquired more accurately, further improves the efficiency of later data during retrieval, and can solve the problems in the prior art.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a cloud computing-based distributed storage and retrieval system, comprising:
a cloud computing data acquisition unit configured to:
according to different transmission terminals, respectively receiving different cloud computing data;
when the cloud computing data is received, a transmission channel with corresponding capacity is automatically adapted according to the flow of the cloud computing data;
the cloud computing data processing unit is used for:
based on the data acquired in the cloud computing data acquisition unit, uniformly acquiring the data, dividing the acquired data into a plurality of data segments, respectively performing redundancy processing on the data in the plurality of data segments, compressing the data after the processing is finished, and marking the compressed data as compressed data;
the cloud computing data storage unit is used for:
based on the compressed data acquired in the cloud computing data processing unit, respectively acquiring the lengths of the compressed data, and acquiring the space capacity data to be stored of the compressed data after the acquisition is completed;
extracting parameters of compressed data length data and parameters of space capacity data respectively, carrying out corresponding calculation on the parameters after extracting the parameters, and judging a storage area of the compressed data according to a calculation result;
the corresponding calculation formula of the parameters is as follows:
wherein x is a parameter of compressed data length data, y is a parameter of space capacity data, a is a first storage area, b is a second storage area, and c is a third storage area;
storing the compressed data to a first storage area or a second storage area or a third storage area according to different comparison results respectively through parameter comparison results of parameters of the compressed data length data and parameters of the space capacity data;
a stored data dictionary analyzing unit configured to:
based on the compressed data in the storage areas acquired by the cloud computing data storage unit, the compressed data in different storage areas are respectively subjected to keyword acquisition of the data, and the keywords in the data are classified according to the word attributes of the keywords after the keywords are acquired.
Preferably, the cloud computing data processing unit includes:
a data segmentation module for:
dividing the acquired cloud computing data into a plurality of segments with the same length, and respectively carrying out number unique coding and labeling on each segment after dividing;
the segment data redundancy processing module is used for:
based on the data of the unique coding label acquired in the data segmentation module, respectively performing data deduplication on a plurality of data;
the data de-duplication is to overlap the data segment with the data segment, and remove the repeated and useless data in the data segment after the data overlap, where the useless data is a data model input in advance in the database, and if the data segment has data consistent with the data model, the data is useless data.
Preferably, the cloud computing data processing unit includes:
the redundant data compression module is used for:
based on the data segments which are obtained from the segmented data redundancy processing module and are removed, obtaining the number of the data segments, and generating the compression threads with the same number according to the number of the data segments;
after the compression thread is generated, respectively importing the data segments into the compression thread;
and the data segment is subjected to data cycle compression in a compression thread, and finally the data subjected to cycle compression is marked as compressed data.
Preferably, the cloud computing data storage unit includes:
the cloud computing data length acquisition module is used for:
based on the compressed data acquired in the cloud computing data processing unit, extracting the compressed data respectively, and acquiring the length data of each compressed data after extracting;
a space region capacity acquisition module, configured to:
extracting a storage space region in the storage hardware, and acquiring the number of the space regions after the storage space region is extracted;
the total capacity and the remaining capacity data in each storage space area are acquired.
Preferably, the cloud computing data storage unit further includes:
a capacity correspondence storage module for:
based on the length of the compressed data acquired in the cloud computing data length acquisition module and the total capacity and the residual capacity in the storage space area acquired in the space area capacity acquisition module, the length of the compressed data is compared with the capacity of the storage space area.
Preferably, the capacity correspondence storage module is further configured to:
if the comparison threshold is greater than or equal to the capacity threshold of the first storage area, storing the compressed data in the second storage area or the third storage area;
if the comparison threshold is greater than or equal to the second storage area, storing the compressed data into the first storage area or the third storage area;
and if the comparison threshold value is greater than or equal to the third storage area, storing the compressed data into the first storage area or the second storage area.
Preferably, the stored data dictionary analyzing unit includes:
a storage area data acquisition module, configured to:
based on the compressed data stored in different areas obtained in the cloud computing data storage unit, each compressed data is extracted independently;
a storage data segmentation module for:
and extracting the data sequence of each compressed data based on the plurality of compressed data acquired by the storage area data acquisition module, dividing the acquired data sequence into a plurality of data segments with consistent lengths, and treating each data segment as a single cluster.
Preferably, the stored data dictionary analyzing unit further includes:
a stored data keyword acquisition module, configured to:
based on the data segments obtained in the stored data segmentation module, clustering the data in each data segment to obtain a plurality of clustered data after processing;
word segmentation operation is carried out on sentences in the clustered data, and a plurality of extracted words in the sentences are obtained;
according to the sentence characteristics in the dictionary library, the extracted words are corresponding to the sentence characteristics, and the sentence characteristics of the extracted words are determined after the corresponding is finished;
and determining the word attribute of the extracted word according to the sentence characteristics, and labeling the word for determining the word attribute as a target word.
Preferably, the stored data keyword obtaining module is further configured to:
the clustering process comprises the steps of measuring the distance between two data segment time, wherein the distance measurement is performed according to an average association measurement method;
measuring the average distance between the data point of the first data segment and the data point of the second data segment, and merging the two data segments into one data segment after the measurement is completed;
and selecting two data segments with the minimum average association value for combination when the combination is carried out, and finally obtaining the clustering data.
Preferably, the stored data dictionary analyzing unit further includes:
keyword category induction module for:
based on the target words acquired in the stored data keyword acquisition module, performing maximum length splicing on the target words by using a text minimum unit to acquire spliced words;
cleaning the spliced words according to sentence characteristics in the dictionary library to obtain a keyword set;
converting keywords in the keyword set into word vectors;
and respectively calculating the distance between each word vector in the data segment and the standard word vector corresponding to the data segment, and screening out the keyword corresponding to the word vector with the minimum distance as the target keyword in the data segment.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention provides a distributed storage and retrieval system based on cloud computing, which eliminates repeated data and useless data in paragraphs through data deduplication, can effectively avoid the problem of repeated computing, can maximally utilize storage space, can make cloud computing data more compact after data segments are compressed, can reduce the storage space and can also improve the speed of data later transmission.
2. The invention provides a distributed storage and retrieval system based on cloud computing, which is characterized in that after the data length and the storage space capacity are compared and calculated, a proper storage space is selected according to a calculation result, so that the stability of compressed data during storage can be improved to the greatest extent, the data capacity is not excessively small but is stored in a storage area with the largest residual capacity, and the utilization rate of the storage space is effectively improved.
3. The invention provides a distributed storage and retrieval system based on cloud computing, which can further improve the accuracy of data by clustering data segments, and can reduce the complexity of the data segments during text extraction, improve the accuracy and reduce the complexity by measuring the distance between each data segment and selecting two data segments with the smallest average association value for merging when the number of the data segments is too large.
Drawings
FIG. 1 is a schematic overall flow chart of the present invention;
FIG. 2 is a schematic diagram of a cloud computing data processing unit module according to the present invention;
FIG. 3 is a schematic diagram of a cloud computing data storage unit module according to the present invention;
fig. 4 is a schematic diagram of a stored data dictionary analyzing unit module according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to solve the problem that in the prior art, when cloud computing data is stored, data is not optimized, so that useless data in the data cannot be cleared in time, and storage space is wasted during storage, referring to fig. 1 and 2, the present embodiment provides the following technical scheme:
a cloud computing-based distributed storage and retrieval system, comprising: a cloud computing data acquisition unit configured to: according to different transmission terminals, respectively receiving different cloud computing data; when the cloud computing data is received, a transmission channel with corresponding capacity is automatically adapted according to the flow of the cloud computing data; the cloud computing data processing unit is used for: based on the data acquired in the cloud computing data acquisition unit, uniformly acquiring the data, dividing the acquired data into a plurality of data segments, respectively performing redundancy processing on the data in the plurality of data segments, compressing the data after the processing is finished, and marking the compressed data as compressed data; the cloud computing data storage unit is used for: based on the compressed data acquired in the cloud computing data processing unit, respectively acquiring the lengths of the compressed data, and acquiring the space capacity data to be stored of the compressed data after the acquisition is completed; extracting parameters of compressed data length data and parameters of space capacity data respectively, carrying out corresponding calculation on the parameters after extracting the parameters, and judging a storage area of the compressed data according to a calculation result; the corresponding calculation formula of the parameters is as follows:
wherein x is a parameter of compressed data length data, y is a parameter of space capacity data, a is a first storage area, b is a second storage area, and c is a third storage area; storing the compressed data to a first storage area or a second storage area or a third storage area according to different comparison results respectively through parameter comparison results of parameters of the compressed data length data and parameters of the space capacity data; a stored data dictionary analyzing unit configured to: based on the compressed data in the storage areas acquired by the cloud computing data storage unit, the compressed data in different storage areas are respectively subjected to keyword acquisition of the data, and the keywords in the data are classified according to the word attributes of the keywords after the keywords are acquired.
Specifically, the cloud computing data acquisition unit performs channel selection on the received cloud computing data according to the capacity of the data, so that the efficiency of the cloud computing data in transmission can be improved, the cloud computing data processing unit performs preliminary data deduplication processing on the acquired cloud computing data, the problem of repeated computation can be effectively avoided, meanwhile, the storage space can be utilized to the greatest extent, the cloud computing data storage unit can select a proper storage space according to the length of the data, the storage space utilization rate is effectively improved, the text data words in the storage data can be further processed by the storage data dictionary analysis unit, the complexity of the data segment in text extraction can be reduced, the accuracy is improved, meanwhile, the complexity of keyword acquisition in the data segment is further improved, and the later-stage data retrieval efficiency is further improved.
A cloud computing data processing unit comprising: a data segmentation module for: dividing the acquired cloud computing data into a plurality of segments with the same length, and respectively carrying out number unique coding and labeling on each segment after dividing; the segment data redundancy processing module is used for: based on the data of the unique coding label acquired in the data segmentation module, respectively performing data deduplication on a plurality of data; the data deduplication is to overlap data segments with data segments, and reject duplicate and useless data in the data segments after data overlap, wherein the useless data is a data model input in advance in a database, and if the data segment has data consistent with the data model, the data is useless data, and the cloud computing data processing unit comprises: the redundant data compression module is used for: based on the data segments which are obtained from the segmented data redundancy processing module and are removed, obtaining the number of the data segments, and generating the compression threads with the same number according to the number of the data segments; after the compression thread is generated, respectively importing the data segments into the compression thread; and the data segment is subjected to data cycle compression in a compression thread, and finally the data subjected to cycle compression is marked as compressed data.
Specifically, the cloud computing data is divided into a plurality of segments with the same length through the data segmentation module, the segments are uniquely coded and labeled, the stability and the accuracy of data acquisition can be effectively guaranteed, the problem of repeated processing of later-stage data can be avoided, the segmented segments are subjected to redundant processing through the segmented data redundancy processing module, the redundant processing is data deduplication, namely, repeated data and useless data in the segments are removed, the problem of repeated computation can be effectively avoided, meanwhile, the storage space can be maximally utilized, the data in the data segments are subjected to deduplication processing and then are subjected to data compression through the redundant data compression module, meanwhile, the number of compression threads is consistent with that of the data segments, the compression of each data segment in the cloud computing data can be maximally guaranteed, the cloud computing data can be more compact after the data segments are compressed, the storage space can be reduced, and the speed of later-stage data transmission can be improved.
In order to solve the problem in the prior art that when cloud computing data is stored, a storage space region is not selected according to the capacity and the length of the data, so that the data is too large and the storage space is too small, and the storage space is not matched, referring to fig. 3, the embodiment provides the following technical scheme:
a cloud computing data storage unit comprising: the cloud computing data length acquisition module is used for: based on the compressed data acquired in the cloud computing data processing unit, extracting the compressed data respectively, and acquiring the length data of each compressed data after extracting; a space region capacity acquisition module, configured to: extracting a storage space region in the storage hardware, and acquiring the number of the space regions after the storage space region is extracted; acquiring the total capacity and the residual capacity data in each storage space area, wherein the cloud computing data storage unit further comprises: a capacity correspondence storage module for: based on the length of the compressed data acquired in the cloud computing data length acquisition module and the total capacity and the residual capacity in the storage space area acquired in the space area capacity acquisition module, the length of the compressed data is compared with the capacity of the storage space area, and the capacity corresponds to the storage module and is further used for: if the comparison threshold is greater than or equal to the capacity threshold of the first storage area, storing the compressed data in the second storage area or the third storage area; if the comparison threshold is greater than or equal to the second storage area, storing the compressed data into the first storage area or the third storage area; and if the comparison threshold value is greater than or equal to the third storage area, storing the compressed data into the first storage area or the second storage area.
Specifically, the length of each compressed data is acquired through a cloud computing data length acquisition module, the total capacity and the residual capacity of the storage space are acquired through a space region capacity acquisition module, the data is stored according to the length of the data and the residual capacity of the storage space, so that the stability of the data in storage is better, the data length and the storage space are selected through a capacity corresponding storage module, the storage space is multiple, after the data length and the storage space capacity are compared and calculated, the proper storage space is selected according to the calculation result, and meanwhile, the data length and the storage space capacity are calculated through the following formula:
the stability of compressed data storage can be improved to the greatest extent, the data cannot be stored in the storage area with the largest residual capacity when the data capacity is too small, and the utilization rate of the storage space is effectively improved.
In order to solve the problems of the prior art that the stored data does not perform characteristic word determination of text words in the data, so that the keyword determination of the text data in the data is inaccurate, and the retrieval time is too long and the efficiency is reduced in the later period when the data retrieval is performed, please refer to fig. 4, the embodiment provides the following technical scheme:
the stored data dictionary analyzing unit includes: a storage area data acquisition module, configured to: based on the compressed data stored in different areas obtained in the cloud computing data storage unit, each compressed data is extracted independently; a storage data segmentation module for: based on the plurality of compressed data acquired by the storage area data acquisition module, extracting a data sequence of each compressed data, dividing the acquired data sequence into a plurality of data segments with consistent lengths, and regarding each data segment as a single cluster, the storage data dictionary analysis unit further comprises: a stored data keyword acquisition module, configured to: based on the data segments obtained in the stored data segmentation module, clustering the data in each data segment to obtain a plurality of clustered data after processing; word segmentation operation is carried out on sentences in the clustered data, and a plurality of extracted words in the sentences are obtained; according to the sentence characteristics in the dictionary library, the extracted words are corresponding to the sentence characteristics, and the sentence characteristics of the extracted words are determined after the corresponding is finished; determining word attributes of the extracted words according to the sentence characteristics, marking the words with the determined word attributes as target words, and the stored data keyword acquisition module is further used for: the clustering process comprises the steps of measuring the distance between two data segment time, wherein the distance measurement is performed according to an average association measurement method; measuring the average distance between the data point of the first data segment and the data point of the second data segment, and merging the two data segments into one data segment after the measurement is completed; when merging, selecting two data segments with the minimum average association value for merging, finally obtaining cluster data, and storing the cluster data into a data dictionary analysis unit, and further comprising: keyword category induction module for: based on the target words acquired in the stored data keyword acquisition module, performing maximum length splicing on the target words by using a text minimum unit to acquire spliced words; cleaning the spliced words according to sentence characteristics in the dictionary library to obtain a keyword set; converting keywords in the keyword set into word vectors; and respectively calculating the distance between each word vector in the data segment and the standard word vector corresponding to the data segment, and screening out the keyword corresponding to the word vector with the minimum distance as the target keyword in the data segment.
Specifically, compressed data is firstly acquired through a storage area data acquisition module, the compressed data is segmented into a plurality of data segments through a storage data segmentation module after the compressed data is acquired, namely clusters, text words in the data segments are acquired according to a storage data keyword acquisition module, wherein the accuracy of the data can be further improved through clustering processing, when the number of the data segments is excessive, the distance between each data segment is measured, two data segments with the smallest average association value are selected for merging, the complexity of the data segments in text extraction can be reduced, the accuracy is improved, the complexity is reduced, target words in the data segments are extracted after the data clustering processing, the maximum length of target word text units are spliced through a keyword category induction module, the data accuracy is improved after the target word text units are spliced, the distance between each word vector and the corresponding standard word vector in the data segments is calculated, the keywords corresponding to the vector with the minimum distance are screened, the keyword in the data segments can be acquired accurately, and the later data retrieval efficiency is further improved.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A cloud computing-based distributed storage and retrieval system, comprising:
a cloud computing data acquisition unit configured to:
according to different transmission terminals, respectively receiving different cloud computing data;
when the cloud computing data is received, a transmission channel with corresponding capacity is automatically adapted according to the flow of the cloud computing data;
the cloud computing data processing unit is used for:
based on the data acquired in the cloud computing data acquisition unit, uniformly acquiring the data, dividing the acquired data into a plurality of data segments, respectively performing redundancy processing on the data in the plurality of data segments, compressing the data after the processing is finished, and marking the compressed data as compressed data;
the cloud computing data storage unit is used for:
based on the compressed data acquired in the cloud computing data processing unit, respectively acquiring the lengths of the compressed data, and acquiring the space capacity data to be stored of the compressed data after the acquisition is completed;
extracting parameters of compressed data length data and parameters of space capacity data respectively, carrying out corresponding calculation on the parameters after extracting the parameters, and judging a storage area of the compressed data according to a calculation result;
the corresponding calculation formula of the parameters is as follows:
wherein x is a parameter of compressed data length data, y is a parameter of space capacity data, a is a first storage area, b is a second storage area, and c is a third storage area;
storing the compressed data to a first storage area or a second storage area or a third storage area according to different comparison results respectively through parameter comparison results of parameters of the compressed data length data and parameters of the space capacity data;
a stored data dictionary analyzing unit configured to:
based on the compressed data in the storage areas acquired by the cloud computing data storage unit, the compressed data in different storage areas are respectively subjected to keyword acquisition of the data, and the keywords in the data are classified according to the word attributes of the keywords after the keywords are acquired.
2. The cloud computing-based distributed storage and retrieval system of claim 1, wherein: the cloud computing data processing unit includes:
a data segmentation module for:
dividing the acquired cloud computing data into a plurality of segments with the same length, and respectively carrying out number unique coding and labeling on each segment after dividing;
the segment data redundancy processing module is used for:
based on the data of the unique coding label acquired in the data segmentation module, respectively performing data deduplication on a plurality of data;
the data de-duplication is to overlap the data segment with the data segment, and remove the repeated and useless data in the data segment after the data overlap, where the useless data is a data model input in advance in the database, and if the data segment has data consistent with the data model, the data is useless data.
3. The cloud computing-based distributed storage and retrieval system of claim 1, wherein: the cloud computing data processing unit includes:
the redundant data compression module is used for:
based on the data segments which are obtained from the segmented data redundancy processing module and are removed, obtaining the number of the data segments, and generating the compression threads with the same number according to the number of the data segments;
after the compression thread is generated, respectively importing the data segments into the compression thread;
and the data segment is subjected to data cycle compression in a compression thread, and finally the data subjected to cycle compression is marked as compressed data.
4. The cloud computing-based distributed storage and retrieval system of claim 1, wherein: the cloud computing data storage unit includes:
the cloud computing data length acquisition module is used for:
based on the compressed data acquired in the cloud computing data processing unit, extracting the compressed data respectively, and acquiring the length data of each compressed data after extracting;
a space region capacity acquisition module, configured to:
extracting a storage space region in the storage hardware, and acquiring the number of the space regions after the storage space region is extracted;
the total capacity and the remaining capacity data in each storage space area are acquired.
5. The cloud computing-based distributed storage and retrieval system of claim 4, wherein: the cloud computing data storage unit further includes:
a capacity correspondence storage module for:
based on the length of the compressed data acquired in the cloud computing data length acquisition module and the total capacity and the residual capacity in the storage space area acquired in the space area capacity acquisition module, the length of the compressed data is compared with the capacity of the storage space area.
6. The cloud computing-based distributed storage and retrieval system of claim 5, wherein: the capacity correspondence storage module is further configured to:
if the comparison threshold is greater than or equal to the capacity threshold of the first storage area, storing the compressed data in the second storage area or the third storage area;
if the comparison threshold is greater than or equal to the second storage area, storing the compressed data into the first storage area or the third storage area;
and if the comparison threshold value is greater than or equal to the third storage area, storing the compressed data into the first storage area or the second storage area.
7. The cloud computing-based distributed storage and retrieval system of claim 1, wherein: the stored data dictionary analyzing unit includes:
a storage area data acquisition module, configured to:
based on the compressed data stored in different areas obtained in the cloud computing data storage unit, each compressed data is extracted independently;
a storage data segmentation module for:
and extracting the data sequence of each compressed data based on the plurality of compressed data acquired by the storage area data acquisition module, dividing the acquired data sequence into a plurality of data segments with consistent lengths, and treating each data segment as a single cluster.
8. The cloud computing based distributed storage and retrieval system of claim 7, wherein: the stored data dictionary analyzing unit further includes:
a stored data keyword acquisition module, configured to:
based on the data segments obtained in the stored data segmentation module, clustering the data in each data segment to obtain a plurality of clustered data after processing;
word segmentation operation is carried out on sentences in the clustered data, and a plurality of extracted words in the sentences are obtained;
according to the sentence characteristics in the dictionary library, the extracted words are corresponding to the sentence characteristics, and the sentence characteristics of the extracted words are determined after the corresponding is finished;
and determining the word attribute of the extracted word according to the sentence characteristics, and labeling the word for determining the word attribute as a target word.
9. The cloud computing-based distributed storage and retrieval system of claim 8, wherein: the stored data keyword obtaining module is further configured to:
the clustering process comprises the steps of measuring the distance between two data segment time, wherein the distance measurement is performed according to an average association measurement method;
measuring the average distance between the data point of the first data segment and the data point of the second data segment, and merging the two data segments into one data segment after the measurement is completed;
and selecting two data segments with the minimum average association value for combination when the combination is carried out, and finally obtaining the clustering data.
10. The cloud computing-based distributed storage and retrieval system of claim 8, wherein: the stored data dictionary analyzing unit further includes:
keyword category induction module for:
based on the target words acquired in the stored data keyword acquisition module, performing maximum length splicing on the target words by using a text minimum unit to acquire spliced words;
cleaning the spliced words according to sentence characteristics in the dictionary library to obtain a keyword set;
converting keywords in the keyword set into word vectors;
and respectively calculating the distance between each word vector in the data segment and the standard word vector corresponding to the data segment, and screening out the keyword corresponding to the word vector with the minimum distance as the target keyword in the data segment.
CN202310828240.XA 2023-07-07 2023-07-07 Distributed storage and retrieval system based on cloud computing Active CN116561230B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310828240.XA CN116561230B (en) 2023-07-07 2023-07-07 Distributed storage and retrieval system based on cloud computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310828240.XA CN116561230B (en) 2023-07-07 2023-07-07 Distributed storage and retrieval system based on cloud computing

Publications (2)

Publication Number Publication Date
CN116561230A true CN116561230A (en) 2023-08-08
CN116561230B CN116561230B (en) 2023-09-01

Family

ID=87490180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310828240.XA Active CN116561230B (en) 2023-07-07 2023-07-07 Distributed storage and retrieval system based on cloud computing

Country Status (1)

Country Link
CN (1) CN116561230B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117033393A (en) * 2023-10-08 2023-11-10 四川酷赛科技有限公司 Information storage management system based on artificial intelligence

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9195635B2 (en) * 2012-07-13 2015-11-24 International Business Machines Corporation Temporal topic segmentation and keyword selection for text visualization
US20170147220A1 (en) * 2010-08-02 2017-05-25 International Business Machines Corporation Determining whether to compress a data segment in a dispersed storage network
CN109766451A (en) * 2019-01-09 2019-05-17 武汉巨正环保科技有限公司 A kind of cloud computing platform and its scheduling, data analysing method
CN113094346A (en) * 2021-03-10 2021-07-09 北京四达时代软件技术股份有限公司 Big data coding and decoding method and device based on time sequence
CN113407785A (en) * 2021-06-11 2021-09-17 西北工业大学 Data processing method and system based on distributed storage system
CN113505117A (en) * 2021-07-26 2021-10-15 平安信托有限责任公司 Data quality evaluation method, device, equipment and medium based on data indexes
CN114238264A (en) * 2021-11-12 2022-03-25 上海浦东发展银行股份有限公司 Data processing method, data processing device, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170147220A1 (en) * 2010-08-02 2017-05-25 International Business Machines Corporation Determining whether to compress a data segment in a dispersed storage network
US9195635B2 (en) * 2012-07-13 2015-11-24 International Business Machines Corporation Temporal topic segmentation and keyword selection for text visualization
CN109766451A (en) * 2019-01-09 2019-05-17 武汉巨正环保科技有限公司 A kind of cloud computing platform and its scheduling, data analysing method
CN113094346A (en) * 2021-03-10 2021-07-09 北京四达时代软件技术股份有限公司 Big data coding and decoding method and device based on time sequence
CN113407785A (en) * 2021-06-11 2021-09-17 西北工业大学 Data processing method and system based on distributed storage system
CN113505117A (en) * 2021-07-26 2021-10-15 平安信托有限责任公司 Data quality evaluation method, device, equipment and medium based on data indexes
CN114238264A (en) * 2021-11-12 2022-03-25 上海浦东发展银行股份有限公司 Data processing method, data processing device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
郭嘉丰;范意兴;: "深度学习检索框架的前沿探索", 计算机研究与发展, no. 09 *
陆小丽;何加铭;: "基于Map/Reduce的索引数据云存储模型研究", 宁波大学学报(理工版), no. 03 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117033393A (en) * 2023-10-08 2023-11-10 四川酷赛科技有限公司 Information storage management system based on artificial intelligence
CN117033393B (en) * 2023-10-08 2023-12-12 四川酷赛科技有限公司 Information storage management system based on artificial intelligence

Also Published As

Publication number Publication date
CN116561230B (en) 2023-09-01

Similar Documents

Publication Publication Date Title
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN108182175B (en) Text quality index obtaining method and device
CN110826618A (en) Personal credit risk assessment method based on random forest
CN116561230B (en) Distributed storage and retrieval system based on cloud computing
CN112465020B (en) Training data set generation method and device, electronic equipment and storage medium
CN109597757B (en) Method for measuring similarity between software networks based on multidimensional time series entropy
CN111368867B (en) File classifying method and system and computer readable storage medium
CN116881430B (en) Industrial chain identification method and device, electronic equipment and readable storage medium
CN114048318A (en) Clustering method, system, device and storage medium based on density radius
CN116304020A (en) Industrial text entity extraction method based on semantic source analysis and span characteristics
CN115758183A (en) Training method and device for log anomaly detection model
CN117725437B (en) Machine learning-based data accurate matching analysis method
CN113194332B (en) Multi-policy-based new advertisement discovery method, electronic device and readable storage medium
CN116702059B (en) Intelligent production workshop management system based on Internet of things
CN113726824B (en) Fraud website searching method and system based on image characteristics
CN115599917A (en) Text double-clustering method based on improved bat algorithm
CN112487991B (en) High-precision load identification method and system based on characteristic self-learning
CN112632229A (en) Text clustering method and device
CN111291182A (en) Hotspot event discovery method, device, equipment and storage medium
CN111143436A (en) Data mining method for big data
CN117235137B (en) Professional information query method and device based on vector database
CN117076573B (en) Data processing analysis system based on big data technology
CN110705462B (en) Hadoop-based distributed video key frame extraction method
CN111581164B (en) Multimedia file processing method, device, server and storage medium
CN117290758A (en) Classification and classification method, device, equipment and medium for unstructured document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant