CN118075293A

CN118075293A - Cloud platform data storage method based on SaaS

Info

Publication number: CN118075293A
Application number: CN202410303174.9A
Authority: CN
Inventors: 何立娟; 王昀
Original assignee: Beijing Guqi Data Technology Co ltd
Current assignee: Beijing Guqi Data Technology Co ltd
Priority date: 2024-03-18
Filing date: 2024-03-18
Publication date: 2024-05-24
Anticipated expiration: 2044-03-18
Also published as: CN118075293B

Abstract

The invention discloses a cloud platform data storage method based on SaaS, which relates to the technical field of data storage and comprises the steps that a system receives and automatically classifies data uploaded to a cloud platform; performing intelligent and dynamic data distribution management based on the classification result; for a data processing task needing quick response, the system performs preprocessing and analysis in the data uploading process by utilizing an edge computing technology; according to the importance of the data, the system automatically adjusts the number of redundant copies and the backup frequency; the system automatically manages the migration, archiving and deleting of the data according to a preset or self-learning strategy; the system intelligently adjusts the data storage strategy according to the network condition, the storage resource utilization rate and the data access mode which are monitored in real time. The invention uses the edge computing technology to preprocess and analyze the data in the uploading process, effectively reduces the response time of the data processing task and improves the processing efficiency.

Description

Cloud platform data storage method based on SaaS

Technical Field

The invention relates to the technical field of data storage, in particular to a cloud platform data storage method based on SaaS.

Background

With the rapid development of cloud computing technology, and particularly the widespread use of software as a service (SaaS) modes in enterprises and individual users, data storage requirements have exhibited explosive growth. In this context, cloud platform data storage method systems are becoming a hotspot for research and applications. The traditional cloud data storage method mainly depends on storage and management of a centralized data center, and can provide stable and reliable data service, but with continuous increase of data volume and diversification of access requirements, the method faces the problems of low data processing efficiency, long response time, insufficient data management flexibility and the like. Existing data storage schemes often have difficulty meeting high efficiency and low latency requirements, particularly when handling large-scale data and high frequency access requests.

Currently, a new method capable of improving data processing efficiency, optimizing storage resource utilization and flexibly coping with different data processing task demands is needed in the field of cloud data storage. Particularly, under the condition that big data and internet of things (IoT) applications are increasing, intelligent and dynamic distributed management of the data and real-time adjustment of storage strategies to adapt to changes of network conditions and storage resources become key to improving system performance and user experience. In addition, edge computation is also becoming increasingly recognized and utilized as a technique for data processing near the source of the data generation, with the potential for increasing data processing speed and reducing network latency.

Disclosure of Invention

The present invention has been made in view of the above-described problems occurring in the conventional cloud data storage method.

Therefore, the problem to be solved by the present invention is how to provide a method for intelligent and dynamic data distribution management based on classification results by receiving and automatically classifying data uploaded to a cloud platform.

In order to solve the technical problems, the invention provides the following technical scheme:

In a first aspect, an embodiment of the present invention provides a SaaS-based cloud platform data storage method, which includes that a system receives and automatically classifies data uploaded to a cloud platform; performing intelligent and dynamic data distribution management based on the classification result; for a data processing task needing quick response, the system performs preprocessing and analysis in the data uploading process by utilizing an edge computing technology; according to the importance of the data, the system automatically adjusts the number of redundant copies and the backup frequency; the system automatically manages the migration, archiving and deleting of the data according to a preset or self-learning strategy; the system intelligently adjusts the data storage strategy according to the network condition, the storage resource utilization rate and the data access mode which are monitored in real time.

As an optimal scheme of the SaaS-based cloud platform data storage method, the invention comprises the following steps: the intelligent and dynamic data distribution management based on the classification result comprises the following steps: according to the data feature extraction result, designing a multidimensional distribution strategy, marking different types of data, and adding a category label, wherein the designing the multidimensional distribution strategy according to the data feature extraction result comprises: a data set D is arranged, and for each data item D _i in the data set D, a feature vector is defined asWherein/>Quantized values representing different classes; feature extraction is carried out on the data items, and a multidimensional distribution strategy is designed based on the feature extraction, wherein the formula is as follows:

；

wherein, Represents the j-th eigenvalue of data item d _i,/>Is the dynamic weight of the j-th feature, θ _jK is a weight coefficient considering the interaction between the two features F _j and F _K, and F (d _i) is a feature vector; marking different types of data and adding category labels.

As an optimal scheme of the SaaS-based cloud platform data storage method, the invention comprises the following steps: the system performs preprocessing and analysis in the data uploading process by utilizing an edge computing technology, and comprises the following steps: according to the processing emergency degree and the service requirement of the data, the data are classified according to priority, and the data are determined to be preprocessed and analyzed on the edge computing nodes; the method comprises the steps that edge computing nodes are arranged at a data source, and when the data source generates data, the determined data are sent to the edge computing nodes for preprocessing; the edge computing node utilizes local computing resources to analyze the transmitted data in real time, and generates a simple analysis result; and carrying out result integration and optimization between the edge computing nodes and the cloud platform host to realize data analysis of cloud edge coordination.

As an optimal scheme of the SaaS-based cloud platform data storage method, the invention comprises the following steps: the prioritizing of the data includes establishing an assessment model that scores each data set, the assessment model being as follows:

；

wherein, For an exponential function with respect to real-time demand k, C is the computing power of the edge compute node, V is the total storage space of the edge compute node, u is the storage space already used on the current edge compute node, λ and μ are adjustment parameters for balancing the effects of dataset characteristics and edge compute node resources, and D is the total amount of datasets;

；

where a and b are positive constants.

As an optimal scheme of the SaaS-based cloud platform data storage method, the invention comprises the following steps: the step of classifying the data priority further comprises the step of calculating an S value of each data set to be used as a basis for determining the priority of the data set and whether preprocessing or analysis is needed on the edge node; if the S value is higher than the score threshold value and approaches to 1, indicating that the processing priority is higher; if the S value is lower than the data set of the score threshold value, the real-time requirement is not high, or the resources of the edge node are insufficient for efficient processing; the data sets with the S value higher than the score threshold value are marked as high priority, and preprocessing or analysis is required to be performed on the edge nodes preferentially; data sets with S values below the score threshold are processed later, if resources allow, or at the cloud.

As an optimal scheme of the SaaS-based cloud platform data storage method, the invention comprises the following steps: the calculation process of the score threshold value is as follows: collecting historical data within a time range of 1 month, recording S values of all data sets, and arranging the collected data sets in time sequence to form time sequence data of the S values; statistical analysis was performed for each time series:

；

Wherein S _i is the S value at the ith time point, Is the weight of the ith time point, N is the total observation point in the time window; and (3) operating the clustering algorithm for a plurality of times, calculating the contour coefficient of each clustering result, and selecting the clustering center with the highest stability index as a candidate threshold.

As an optimal scheme of the SaaS-based cloud platform data storage method, the invention comprises the following steps: the system automatically manages the migration, archiving and deleting of the data according to a preset or self-learning strategy, and comprises the following steps: according to the characteristics of the data, the life cycle of the data is divided, wherein the life cycle comprises hot data, warm data and cold data, and the specific dividing process is as follows: collecting access logs and update logs of all data sets in a storage system, analyzing the update logs, calculating the update interval time distribution of each data set, and determining a main update period; all the collected features are input into an LSTM-based time series model, and a life cycle prediction model is trained as follows:

；

wherein L (T) represents a life cycle evaluation value of the data item at a time T length, For data access frequency,/>To update the frequency, F (T) is a normalized function of the total observation time T,/>Filtering a function for complex information about the temporal locality feature p (T); predicting each data set on line, and outputting a data heat change curve in a preset time; setting storage strategies of data with different life cycles, wherein hot data is stored in a high-performance storage layer, and cold data is stored in a low-cost storage layer; for cold data, after a preset residence time is reached, the data is automatically deleted or archived to a minimum cost storage.

The invention has the beneficial effects that the edge computing technology is utilized to preprocess and analyze the data in the uploading process, so that the response time of the data processing task is effectively reduced, and the processing efficiency is improved; the number of redundant copies and the backup frequency of the data are automatically adjusted according to the importance of the data, so that the utilization of storage resources is optimized, the safety and the reliability of the data are enhanced, and the migration, archiving and deleting of the data can be automatically managed through a preset or self-learning strategy, so that the flexibility and the intelligence of the data management are further improved; meanwhile, according to network conditions, storage resource utilization rate and data access modes monitored in real time, a data storage strategy is intelligently adjusted to adapt to different use scenes and requirements, and the characteristics are expected to greatly improve the performance and user satisfaction of the cloud data storage system.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a SaaS-based cloud platform data storage method in embodiment 1.

Fig. 2 is a schematic diagram of the prediction of each dataset in example 1.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

Example 1

Referring to fig. 1 and fig. 2, in a first embodiment of the present invention, a cloud platform data storage method based on SaaS is provided, including the following steps:

S1: the system receives and automatically classifies the data uploaded to the cloud platform.

Preferably, a data uploading interface is provided to support data uploading in various formats, such as structured data, semi-structured data, unstructured data, etc.

And analyzing the uploaded data, and extracting key characteristic fields such as a time stamp, a device ID, a geographic position and the like.

The data is automatically classified according to predefined classification rules, for example, into types of user log data, device operation data, network monitoring data, and the like.

Different types of data are marked, category labels are added, and other metadata such as data sources, reporting time and the like can also be added.

And temporarily storing the processed classified data to a temporary storage area of the cloud platform.

S2: the intelligent and dynamic data distribution management is carried out based on the classification result, and comprises the steps of designing a multidimensional distribution strategy according to the data characteristic extraction result, marking different types of data and adding a class label.

S2.1: with data set D, for each data item D _i in D we define its eigenvector asWherein/>Quantized values representing characteristics such as data category, timeliness, position correlation, safety and the like; the data item is subjected to feature extraction, a multidimensional distribution strategy is designed based on the feature extraction, and the multi-aspect requirements of data category, timeliness, position correlation, safety and the like are considered, wherein the formula is as follows:

；

wherein, Represents the j-th eigenvalue of data item d _i,/>Is the dynamic weight of the j-th feature, θ _jK is a weight coefficient considering the interaction between the two features F _j and F _K, and F (d _i) is a feature vector.

S2.2: different types of data are marked, category labels are added, and other metadata such as data sources, reporting time and the like can also be added.

S3: for data processing tasks requiring a fast response, the system utilizes edge computing techniques to perform preprocessing and analysis during the data upload process.

S3.1: according to the processing emergency degree and the service requirement of the data, the data are classified according to priority, and the data are determined to be preprocessed and analyzed on the edge computing nodes, specifically:

Establishing an evaluation model, scoring each data set, wherein the evaluation model is as follows:

；

wherein, An exponential function with respect to real-time requirements k for expressing the impact of real-time requirements of the dataset on priority; c is the computing power of the edge compute node, V is the total memory space of the edge compute node, u is the memory space already used on the current edge compute node, λ and μ are adjustment parameters for balancing the effects of the data set characteristics and the edge compute node resources, and D is the total amount of the data set.

；

Where a and b are positive constants used to adjust the impact of the real-time demand k to express the rate of increase in priority of the data set with high real-time demand.

The calculated S value for each dataset will serve as a basis for determining its priority and whether or not preprocessing or analysis at the edge node is required.

For each dataset, an S value is calculated based on its real-time requirements and data volume.

According to the historical processing data, calculating a score threshold value of the natural cluster as follows:

Collecting historical data within a time range of 1 month, recording S values of all data sets, and arranging the collected data sets in time sequence to form time sequence data of the S values; statistical analysis was performed for each time series:

；

Wherein S _i is the S value at the ith time point, Is the weight of the ith time point, and N is the total number of observation points in the time window.

The clustering algorithm is operated for a plurality of times, and the stability index (such as the contour coefficient) of each clustering result is calculated. And selecting the cluster center with the highest stability index as a candidate threshold.

For datasets with S values closer to 1, this indicates that they have higher processing priority, as this indicates that they either have high real-time requirements or that the edge nodes have enough resources to process them quickly.

For datasets with S values below the score threshold, it is indicated that they may be post-processed or not processed at the edge node, possibly because the real-time requirements are not high or the resources of the edge node are not sufficient for efficient processing.

Data sets with S values above the score threshold will be marked as high priority, requiring preferential preprocessing or analysis at the edge nodes.

The data sets with S values below the score threshold may be processed later, if allowed by the resources, or at the cloud to save valuable resources of the edge node.

S3.2: edge computing nodes are arranged at the end of a data source, such as an edge server at the outlet of a data acquisition device or a local area network; when the data source generates the data, the data source is firstly sent to the edge computing node for preprocessing, such as filtering useless data, extracting key fields, simply aggregating and the like.

S3.3: the edge computing node utilizes local computing resources to analyze the data in real time, and a simple analysis result is generated.

The real-time analysis is to extract key features from the preprocessed data according to time sequence analysis; based on the extracted features, KNN nearest neighbor classification is applied to analysis, and analysis results mainly comprise abnormal marks, category labels, predicted values and the like.

And sending the result data after pretreatment and real-time analysis to a temporary storage area of the cloud platform.

S3.4: and carrying out result integration and optimization between the edge computing nodes and the cloud platform host to realize data analysis of cloud edge coordination.

For a delay-sensitive analysis task, edge calculation is utilized to finish in advance, response time is shortened, real-time data processing is realized through the edge calculation, and calculation pressure of a cloud platform is reduced.

It should be noted that, as the data is stored and processed, the system continuously monitors its integrity, and ensures that the data is not tampered with unauthorized through the cryptographic hash and the digital signature, once the data integrity problem is found, the system immediately starts preset countermeasures, such as automatic quarantine and recovery.

S4: the system automatically adjusts the number of its redundant copies and the backup frequency according to the importance of the data.

Business Impact Analysis (BIA) is performed on each type of data to determine the specific impact of data loss or corruption on the business, taking into account factors including, but not limited to, severity of business outage, difficulty in recovery, potential financial loss, compliance and legal impact.

Based on the results of BIA, the data is classified into the following business importance levels: key: data critical to the business, the loss or unavailability of which will immediately impact the company's operational capabilities, leading to significant financial or legal consequences; high: while not immediately affecting the company's operational capacity, its loss or unavailability can have a serious impact on business in a short period of time, possibly resulting in significant financial loss or reputation damage; in (a): the loss or unavailability of which has a certain impact on the business, may lead to a reduction in operating efficiency, but does not immediately cause serious financial or legal consequences; low: the impact on business operations is small, and the loss or unavailability has limited impact on the long-term success and daily operations of the company.

Further, according to the service importance level of the data, different numbers of redundant copies are set, and more copies are set for the important data.

And setting more than 3 copies for data with high level and more, and distributing the data in different machine rooms or available areas to improve the usability.

The redundant copy number is dynamically adjusted by monitoring the access frequency and the service condition of the data in real time, the copy is frequently added, and the access reduced copy number is reduced.

And setting a cold and hot data dividing strategy, wherein hot data keeps more copies, and the number of the copies of the cold data is reduced.

Specifically, defining a time window, and counting the access times of each data item in the time; based on the number of accesses, the data is classified into three categories, hot, warm, and cold.

Thermal data: the number of accesses exceeds X in the last 30 days, the specific value of X depends on the traffic characteristics and the data access pattern.

Warm data: the number of accesses was between Y and X in the last 30 days, where Y < X.

Cold data: the number of visits was less than Y in the last 30 days.

And automatically calculating the backup retention period of each data according to the data writing time stamp.

The data backup is carried out regularly at certain time intervals, the frequency is determined according to the importance of the data, the frequency of the important data backup is higher, and the copy of the old data is automatically deleted or archived after the data is out of date. And the storage space is released, the incremental backup mode is adopted in the backup synchronization process, repeated data transmission is avoided, the backup efficiency is improved, redundant copy consistency check is regularly carried out, and the data synchronization consistency among a plurality of copies is ensured.

S5: the system automatically manages the migration, archiving and deleting of the data according to a preset or self-learning strategy.

S5.1: according to the characteristics of data access frequency, update time and the like, the life cycle of data, such as hot data, warm data and cold data, is divided, and the specific dividing process is as follows:

collecting access logs and update logs of all data sets in a storage system, wherein the access logs comprise access time, access duration time and data volume, and the update logs comprise information such as update time, update data volume and the like; the access log is analyzed, the daily access frequency, the weekly access frequency and the monthly access frequency of each data set are calculated, meanwhile, the access time distribution is counted, and whether the data set has time locality is judged.

Analyzing the update log, calculating the update interval time distribution of each data set, and determining a main update period; inputting all collected characteristics into an LSTM-based time sequence model, training a life cycle prediction model, and comprehensively considering data access frequency, update frequency and time locality characteristics under different time lengths by the model, wherein the data access frequency, update frequency and time locality characteristics are as follows:

；

wherein L (T) represents a life cycle evaluation value of the data item at a time T length, For data access frequency,/>To update the frequency, F (T) is a normalized function of the total observation time T,/>The function is filtered for complex information about the temporal locality feature p (T).

Further, the method comprises the steps of,。

Predicting each data set online, and outputting a data heat change curve within a preset time (for example, a future week): if the fluctuation range of the heat curve is smaller, the heat curve is basically stable, and the heat of the data set is indicated to be more stable, and the data set is classified as continuous heat data; if the heat curve has a plurality of peaks, indicating that the data set has periodic hot spots, such data is classified as periodic heat data; if the heat curve has a clear peak, then cooling rapidly, indicating that the data set has burst hot spots, and classifying the data set as temporary heat data; if the peak heat is low and the cooling rate is slow, then it may be determined to be warm data and the storage strategy may be somewhat looser; if the heat curve shows a double-peak mode or a multi-peak mode, the data set is indicated to have a plurality of uncertain hot spot periods, and the unstable periodic heat data needs to adopt a more flexible storage strategy, the hot spots are monitored in real time to be dynamically adjusted, and the problem that the heat is difficult to deal with by only a preset period strategy is solved; if the heat curve is stably maintained at a lower level and no obvious peak value exists, judging that the cold data is continuously cooled; if the heat curve has zero access for a long time, judging that the data set enters a cooling period, classifying the data set into cooling data, and temporarily storing the cooling data after backup; if the heat curve shows unstable random pattern with unstable fluctuation, no fixed period or clear hot spot exists, the data set with dynamic change can be judged to be random active data, real-time monitoring and response are needed, and the storage requirement is difficult to predict.

S5.2: the storage strategies of the data with different life cycles are set, hot data is stored in a high-performance storage layer, and cold data is stored in a low-cost storage layer.

Further, thermal data: and the data are stored in storage media such as SSD, memory database and the like with low delay and high IOPS, so that high-speed inquiry and access are guaranteed, multi-copy redundancy is set, and cache acceleration access is deployed.

Warm data: storage in a medium-high speed SAS hard disk or a distributed storage system allowing high concurrency provides faster access speed, moderate redundancy and caching.

Cold data: in SATA hard disk, tape or object storage, which is large in capacity and low in cost, single copy storage, access speed is relatively slow but low in cost.

Archiving data: stored on an on-line down-tape or off-line medium, and need to be restored in advance to be accessed.

When the data lifecycle changes, data migration is automatically triggered, migrating data from the high performance storage to the low cost storage.

S5.3: for rarely accessed cold data, after a preset residence time is reached, the data is automatically deleted or archived to a minimum cost storage.

S6: the system intelligently adjusts the data storage strategy according to the network condition, the storage resource utilization rate and the data access mode which are monitored in real time.

Specifically, the network bandwidth utilization rate is monitored in real time, and network delay between different data centers is monitored; monitoring the utilization rate, response time and read-write throughput of different storage media; collecting storage access logs of the application, analyzing the read-write access frequency, identifying the type of a storage medium which is mainly accessed, determining the cold and hot of data, and counting the read-write proportion of different types of requests, such as more read requests; when the network delay increases, migrating the data to a storage server with better network conditions; when the storage capacity utilization rate is high, compression, deduplication and archiving are adopted for cold data to a low-cost storage medium; and reading frequent data, copying more copies to a high-performance medium, writing the frequent data, and distributing more storage resources.

The embodiment also provides a computer device, which is applicable to the case of the cloud platform data storage method based on SaaS, and comprises the following steps: a memory and a processor; the memory is used for storing computer executable instructions, and the processor is used for executing the computer executable instructions to realize the SaaS-based cloud platform data storage method according to the embodiment.

The computer device may be a terminal comprising a processor, a memory, a communication interface, a display screen and input means connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

The present embodiment also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements a cloud platform data storage method based on SaaS as set forth in the above embodiments; the storage medium may be implemented by any type or combination of volatile or nonvolatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM), electrically erasable Programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.

In conclusion, the invention utilizes the edge computing technology to preprocess and analyze the data in the uploading process, effectively reduces the response time of the data processing task and improves the processing efficiency; the number of redundant copies and the backup frequency of the data are automatically adjusted according to the importance of the data, so that the utilization of storage resources is optimized, the safety and the reliability of the data are enhanced, and the migration, archiving and deleting of the data can be automatically managed through a preset or self-learning strategy, so that the flexibility and the intelligence of the data management are further improved; meanwhile, according to network conditions, storage resource utilization rate and data access modes monitored in real time, a data storage strategy is intelligently adjusted to adapt to different use scenes and requirements, and the characteristics are expected to greatly improve the performance and user satisfaction of the cloud data storage system.

Example 2

Referring to table 1, for the second embodiment of the present invention, to further verify the advancement of the present invention, comparative data of the SaaS-based cloud platform data storage method and the prior art are given.

Company X builds an IoT platform provided in SaaS mode, providing equipment status monitoring and manufacturing process optimization services specifically for industrial manufacturing enterprises. The platform obtains a variety of industrial data including device sensor data, video stream data, manufacturing process event logs through a data collection gateway at the factory site. These data sources include both structured time series data and unstructured data such as images, video, etc. The platform is used for carrying out intelligent processing on massive heterogeneous data and responding to analysis query demands of enterprise users in real time.

Aiming at the scene, the platform adopts the cloud storage scheme of the invention, so that obvious technical effect is improved. The specific implementation conditions are as follows:

and a data uploading interface adopting a secure HTTPS protocol supports a plurality of data formats such as Protobuf, JSON and the like.

The platform back end analyzes the data stream, extracts characteristics such as a time stamp and an equipment ID, marks data types such as 'equipment sensor data', 'video monitoring', and the like by using a tag, and temporarily stores the classified marked data into a cloud storage temporary time zone based on Ceph.

The platform sets different storage strategies for different kinds of industrial data, such as real-time critical equipment monitoring data and process parameter configuration data with low real-time requirements.

3 Copies of the equipment monitoring data are distributed in different data centers in the same city; the parameter configuration data is only 1 part. The platform uses a deep learning model to monitor the data access modes of different data sets in real time and dynamically adjusts the storage distribution strategy; for the collected video monitoring data, the platform performs real-time anomaly detection on edge nodes inside a factory by using a deep learning model, and only transmits detection result marks to a cloud. This reduces raw data transmission by at least 60%, reducing platform processing pressure. The detection algorithm achieves significant improvement compared with the traditional method, and the following is a comparison table of the invention and the prior art:

TABLE 1 comparison of the present invention with the prior art

As can be seen from the table, compared with the traditional technology, the invention has the obvious improvement and innovation in the aspects of data uploading, storage management, edge calculation, data security, backup strategy and the like, and particularly has the unique advantages in the aspects of processing industrial big data by intelligent storage management and edge calculation support. By applying the technical means, the data processing efficiency can be improved, the storage cost can be reduced, and the user experience can be improved.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. A cloud platform data storage method based on SaaS is characterized in that: comprising the following steps:

the system receives and automatically classifies the data uploaded to the cloud platform;

Performing intelligent and dynamic data distribution management based on the classification result;

For a data processing task needing quick response, the system performs preprocessing and analysis in the data uploading process by utilizing an edge computing technology;

According to the importance of the data, the system automatically adjusts the number of redundant copies and the backup frequency;

the system automatically manages the migration, archiving and deleting of the data according to a preset or self-learning strategy;

The system intelligently adjusts the data storage strategy according to the network condition, the storage resource utilization rate and the data access mode which are monitored in real time;

the intelligent and dynamic data distribution management based on the classification result comprises the steps of designing a multidimensional distribution strategy according to the data characteristic extraction result, marking different types of data and adding a class label;

The step of designing a multidimensional distribution strategy according to the data characteristic extraction result comprises the following steps:

A data set D is arranged, and for each data item D _i in the data set D, a feature vector is defined as Wherein/>Quantized values representing different classes;

Extracting characteristics of the data items, and designing a multidimensional distribution strategy based on the extracted characteristics, wherein the formula is as follows:

；

wherein, Represents the j-th eigenvalue of data item d _i,/>Is the dynamic weight of the j-th feature, θ _jK is a weight coefficient considering the interaction between the two features F _j and F _K, and F (d _i) is a feature vector;

the system performs preprocessing and analysis in the data uploading process by utilizing an edge computing technology, and comprises the following steps:

according to the processing emergency degree and the service requirement of the data, the data are classified according to priority, and the data are determined to be preprocessed and analyzed on the edge computing nodes;

the edge computing node is arranged at the end of the data source, and when the data source generates data, the determined data is firstly sent to the edge computing node for preprocessing;

The edge computing node utilizes local computing resources to analyze the transmitted data in real time, and generates a simple analysis result;

and carrying out result integration and optimization between the edge computing nodes and the cloud platform host to realize data analysis of cloud edge coordination.

2. The SaaS-based cloud platform data storage method of claim 1, wherein the steps of: the prioritizing of the data includes establishing an assessment model that scores each data set, the assessment model being as follows:

；

where a and b are positive constants.

3. The SaaS set forth in claim 2, wherein: the prioritizing of the data further includes,

The S value obtained by calculation of each data set is used as a basis for determining the priority of the data set and whether preprocessing or analysis is needed on the edge node or not;

if the S value is higher than the score threshold value and approaches to 1, indicating that the processing priority is higher;

If the S value is lower than the data set of the score threshold value, the real-time requirement is not high, or the resources of the edge node are insufficient for efficient processing;

The data sets with the S value higher than the score threshold value are marked as high priority, and preprocessing or analysis is required to be performed on the edge nodes preferentially;

Data sets with S values below the score threshold are processed later, if resources allow, or at the cloud.

4. The SaaS set forth in claim 3, wherein: the calculation process of the score threshold value is as follows:

Collecting historical data within a time range of 1 month, recording S values of all data sets, and arranging the collected data sets in time sequence to form time sequence data of the S values;

statistical analysis was performed for each time series:

；

Wherein S _i is the S value at the ith time point, Is the weight of the ith time point, N is the total observation point in the time window;

And (3) operating the clustering algorithm for a plurality of times, calculating the contour coefficient of each clustering result, and selecting the clustering center with the highest stability index as a candidate threshold.

5. The SaaS set forth in claim 4, wherein: the system automatically manages the migration, archiving and deleting of the data according to a preset or self-learning strategy, and comprises the following steps:

according to the characteristics of the data, the life cycle of the data is divided, wherein the life cycle comprises hot data, warm data and cold data, and the specific dividing process is as follows:

Collecting access logs and update logs of all data sets in a storage system, analyzing the update logs, calculating the update interval time distribution of each data set, and determining an update period;

All the collected features are input into an LSTM-based time series model, and a life cycle prediction model is trained as follows:

；

wherein L (T) represents a life cycle evaluation value of the data item at a time T length, In order to be able to access the data at a frequency,To update the frequency, F (T) is a normalized function of the total observation time T,/>Filtering a function for complex information about the temporal locality feature p (T);

predicting each data set on line, and outputting a data heat change curve in a preset time;

Setting storage strategies of data with different life cycles, wherein hot data is stored in a high-performance storage layer, and cold data is stored in a low-cost storage layer;

for cold data, after a preset residence time is reached, the data is automatically deleted or archived to a minimum cost storage.