CN118277620A - Virtual machine-based security cloud storage monitoring system and method - Google Patents

Virtual machine-based security cloud storage monitoring system and method Download PDF

Info

Publication number
CN118277620A
CN118277620A CN202410500195.XA CN202410500195A CN118277620A CN 118277620 A CN118277620 A CN 118277620A CN 202410500195 A CN202410500195 A CN 202410500195A CN 118277620 A CN118277620 A CN 118277620A
Authority
CN
China
Prior art keywords
data
hash
stored
hash function
static
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410500195.XA
Other languages
Chinese (zh)
Inventor
王君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huaban Zhiyuan Technology Co ltd
Original Assignee
Beijing Huaban Zhiyuan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huaban Zhiyuan Technology Co ltd filed Critical Beijing Huaban Zhiyuan Technology Co ltd
Publication of CN118277620A publication Critical patent/CN118277620A/en
Pending legal-status Critical Current

Links

Abstract

The application discloses a virtual machine-based security cloud storage monitoring system and a method, wherein the system comprises the following steps: the virtual machine acquires data to be stored; establishing a memory data structure based on hash, and optimizing the data to be stored; and encoding the data to be stored by using a first static hash function, and storing the encoded data to be stored. The application provides a virtual machine-based safe cloud storage monitoring system and a virtual machine-based safe cloud storage monitoring method, which are used as an efficient and telescopic cloud storage monitoring system.

Description

Virtual machine-based security cloud storage monitoring system and method
Technical Field
The application relates to the technical field of information security, in particular to a security cloud storage monitoring system and method based on a virtual machine.
Background
With the rapid growth of enterprise data, data management and archive management face increasing challenges. Enterprises need to efficiently and securely store and analyze large amounts of semi-structured or unstructured data, such as contract documents, customer information, financial statements, and the like. Traditional relational databases and file systems have failed to meet the needs of enterprises for data storage and analysis.
Cloud storage provides a flexible, scalable data management solution for enterprises. However, as the scale of cloud storage systems continues to increase, so does the complexity. For this challenge, monitoring and security protection of cloud storage systems is increasingly important. There is a need to collect and analyze system logs in real-time to detect potential security threats and performance problems while also ensuring the privacy and compliance of stored data.
Conventional relational database management systems (RDBMS), such as PostgreSQL or Oracle databases, were developed primarily to store relational data with well-defined schemas and to support transactional reads and updates. However, the data to be stored, such as logs or metrics, are typically not updated, their attributes are dynamic and multidimensional. NoSQL document stores, such as MongoDB or CouchDB, which achieve better horizontal expansion and relax the schema requirements of the data as it is ingested. Many NoSQL databases use log-structured merge trees that primarily support efficient access to data through primary keys. NoSQL databases often require users to predefine index data attributes, which can create significant storage overhead for full-text searches, due to the high overhead introduced by the creation and maintenance of secondary indexes.
Therefore, the cloud storage monitoring system adopting the traditional storage technology has the common problems of low processing efficiency, insufficient safety guarantee and the like.
Disclosure of Invention
In view of this, the application provides a secure cloud storage monitoring system and method based on a virtual machine, which is used as a high-efficiency and scalable cloud storage monitoring system, and the system is based on the virtual machine and combines a special hash memory data structure and a coding technology, thereby improving the efficiency and instantaneity of unstructured data processing of enterprises and ensuring the privacy and security of data.
The application provides a secure cloud storage monitoring method based on a virtual machine, which comprises the following steps:
the virtual machine acquires data to be stored;
Establishing a memory data structure based on hash, and optimizing the data to be stored;
And encoding the data to be stored by using a first static hash function, and storing the encoded data to be stored.
Optionally, establishing a hash-based memory data structure, optimizing the data to be stored based on the memory data structure, including:
establishing an inverted index structure based on hash, and performing inverted index on the data to be stored;
And setting up a monitoring point, continuously tracking the inflow rate and the data mode of the data to be stored, and merging redundant release lists on line, wherein the release list is used for describing the position and the frequency information of any word element in the inverted index structure.
Optionally, establishing a hash-based inverted index structure, and performing inverted indexing on the data to be stored includes:
decomposing the data to be stored into a plurality of word elements;
generating a hash fingerprint of each word element by using a hash algorithm, wherein the hash algorithm is MurmurHash or CityHash;
The hash fingerprint and corresponding ID for each token are added to the inverted list.
Optionally, online merging of redundant release lists, including:
Identifying duplicate data entries in the release list based on the hashed fingerprint of each of the tokens;
the duplicate data entries are combined such that each data entry retains only the latest version data.
Optionally, the method further comprises:
According to the data load condition, dynamically adjusting the size of the release list;
And formulating a dynamic merging strategy, and executing the merging operation of the release list in batches based on the hash fingerprint of each word element in preset time based on the dynamic merging strategy.
Optionally, encoding the data to be stored with a first static hash function includes:
Judging the word elements, and identifying sensitive word elements;
mapping the sensitive token to a reference to a non-personal identifier;
the reference is encoded using a first static hash function.
Optionally, encoding the reference using a first static hash function includes:
Constructing a first static hash function, wherein the first static hash function is SHA-256 or MD5;
converting the sensitive word into a hash value by using the first static hash function;
and constructing a release list corresponding to the sensitive word element by utilizing the hash value.
Optionally, the method further comprises:
evaluating a validity period of the first static hash function;
And replacing the first static hash function with a second static hash function before the expiration of the validity period.
Optionally, before storing the encoded data to be stored, the method further includes:
and compressing the encoded data to be stored.
The embodiment of the application also provides a security cloud storage monitoring system based on the virtual machine, which comprises the following steps:
The acquisition module is used for acquiring data to be stored;
the optimizing module is used for establishing a memory data structure based on hash and optimizing the data to be stored;
and the encoding module is used for encoding the data to be stored by utilizing a first static hash function and storing the encoded data to be stored. .
The application provides a virtual machine-based security cloud storage monitoring system and a virtual machine-based security cloud storage monitoring method, which introduce a hash-based inverted index structure, and the special hash memory data structure can greatly improve the storage and retrieval efficiency of unstructured data of enterprises, so that the system and the method are very suitable for the scene of mass unstructured data management of enterprises. In addition, the embodiment of the application encodes the data to be stored through the static hash function, provides an irreversible and indirect associated reference (identifier) for the privacy sensitive word element, greatly improves the security and privacy protection capability of the stored data, and can meet the compliance requirement of enterprise data management.
Drawings
For a clearer description of an embodiment of the present application, the drawings that are required to be used in the description of the embodiment will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained from these drawings without inventive effort for those skilled in the art.
Fig. 1 is a schematic flow chart of a security cloud storage monitoring method based on a virtual machine according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a security cloud storage monitoring system based on a virtual machine according to an embodiment.
Detailed Description
Embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
It should be appreciated that the following specific embodiments of the disclosure are described in order to provide a better understanding of the present disclosure, and that other advantages and effects will be apparent to those skilled in the art from the present disclosure. It will be apparent that the described embodiments are merely some, but not all embodiments of the present disclosure. The disclosure may be embodied or practiced in other different specific embodiments, and details within the subject specification may be modified or changed from various points of view and applications without departing from the spirit of the disclosure. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.
It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
It should also be noted that the illustrations provided in the following embodiments merely illustrate the basic concepts of the disclosure by way of illustration, and only the components related to the disclosure are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
In addition, in the following description, specific details are provided in order to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.
Example 1
In modern computing environments, particularly in the field of cloud computing, security of data storage and processing is a critical issue in terms of enterprise data management. With the increasing popularity of virtualization technologies, a secure cloud storage monitoring system based on a virtual machine has become a key solution for ensuring data storage security in a cloud environment. The technical scheme provides a high-efficiency telescopic cloud storage monitoring system, which is based on a virtual machine, combines a special hash memory data structure and other optimization technologies, improves the efficiency and instantaneity of data processing, and simultaneously ensures the privacy and safety of data.
In the embodiment of the invention, a virtual machine refers to a set of computer systems which are simulated on a physical machine through a virtualization technology and are identical to hardware. It can run operating systems and applications independently, using a virtualization manager (hypervisor) to manage interactions with physical hardware. In the cloud storage system, the virtual machine can dynamically allocate resources to realize elastic expansion and contraction so as to optimize computing performance and storage efficiency.
The embodiment of the invention adopts a cloud storage system architecture based on a virtual machine, wherein the cloud storage system is a highly-extensible storage solution, separates data storage from computing resources through a virtual machine technology, and provides services through a network. The basic architecture generally includes the following key components:
the storage node (Storage Nodes) is a cluster of physical or virtual servers carrying data, responsible for the actual storage and backup of data.
And the management node (MANAGEMENT NODES) is used for controlling the storage node and managing the data distribution, the copy and the cluster health.
Front-end load balancer (Front-end Load Balancers) manages data requests, implements load balancing and failover, and ensures high availability of user access.
An API gateway (API GATEWAYS) provides a series of interfaces that allow clients to interact with the storage system through the RESTful API.
The storage system has the following functions:
Virtual machine dynamic resource management: and by combining a monitoring system, the virtual machine resource allocation is automatically adjusted, for example, CPU and memory resources are increased when the data access demand is high, so that the efficient data processing speed is ensured.
Data redundancy and backup strategy: and the advantages of virtualization technology are utilized to realize multi-copy, real-time backup and quick recovery of data so as to cope with possible hardware faults and data loss.
Data distribution and fragmentation: to optimize storage and retrieval efficiency, data is physically scattered and fragmented. The inverted index links the data stored in a scattered manner into a large index table, so that the data can be rapidly indexed and retrieved.
Encryption and security: data is often stored encrypted in storage nodes, and in addition, encryption protocols (e.g., TLS/SSL) are also applied to the data in the network transmissions to ensure data privacy and security.
In order to meet different application requirements and ensure system availability, the architecture design of cloud storage systems needs to be highly reliable and expandable. The flexible design allows the system to increase storage space and computing power as needed without disrupting service. By using the virtualization technology and dynamic resource management, the cloud storage system not only can effectively process and store a large amount of data, but also can adjust resources according to the system load at any time.
The following is an explanation of technical terms of the embodiments of the present invention:
hash algorithms such as MurmurHash or CityHash are used to convert inputs (e.g., tokens) into fixed-length outputs (hash values) that are random and try to avoid collisions.
Static hash functions such as SHA-256 or MD5 are used to encode sensitive information into irreversible and anonymous references to enhance the privacy and security of data.
MurmurHash: murmurHash is a non-cryptographic hash function that aims to efficiently hash data. It can provide good dispersion and lower collision probability when calculating hash values. MurmurHash is often used to generate hash values in software development, particularly when handling large amounts of data, due to its excellent performance.
CityHash: cityHash is also a non-cryptographic hash function, similar to MurmurHash. Google developed CityHash for fast processing of large amounts of data and generating unique hash values. It is intended to provide a high-quality hash function for long character strings and to effectively reduce hash collisions when processing similar character strings.
SHA-256 (secure Hash Algorithm 256 bits), one of the members of SHA-2 (secure Hash Algorithm family), is a cryptographic hash function. SHA-256 is able to compress messages of arbitrary length into a fixed length (256 bit) digest. Since the hash value output by the hash value has high uniqueness and irreversibility, the hash value is widely applied to the fields of data integrity check and information security.
MD5 (message digest algorithm version 5) is also a widely used cryptographic hash function that is capable of generating 128 bit (16 byte) hash values. While MD5 has been partially obsolete (more vulnerable to collision attacks) due to its security, it is still used for consistency verification in certain non-security critical applications.
Post list (inverted index) the post list is a component of the inverted index that stores a list of specific terms ordered by document ID. In a search engine, when embodiments of the present invention index a set of documents and search for the documents based on a keyword, the search engine may use the posting list in the inverted index to quickly find all documents that contain the keyword.
Fig. 1 is a flow chart of a security cloud storage monitoring method based on a virtual machine according to an embodiment of the present disclosure, where the security cloud storage monitoring method based on a virtual machine includes steps S1 to S3.
S1, a virtual machine acquires data to be stored;
In the embodiment of the invention, the virtual machine is responsible for receiving and analyzing the input data of the user and converting the input data into a format which can be further processed by the storage system. Specifically:
receiving data: the virtual machine receives data sent from a user terminal or other systems through a network or other communication modes. For example, a communication protocol such as HTTP (S), FTP, webSocket or RESTful API is used to implement uploading and downloading of data, and in addition, a monitoring service running in the virtual machine monitors a preset network port in real time, so as to process any incoming data request in time.
Data analysis and verification: and carrying out format analysis on the received data. For example, it may be desirable to convert JSON, XML, or other format data into an internally used data structure. Meanwhile, the integrity and the security of the data also need to be verified, so that the data is ensured not to be tampered. In addition, a caching mechanism is established in the memory of the virtual machine, and a large amount of incoming data streams are temporarily stored, so that the impact of instantaneous data peaks on a storage system is reduced.
Pretreatment: the necessary data cleansing and deduplication processes are performed prior to storage to optimize storage efficiency and data quality.
Specifically, the virtual machine configures a network monitoring service and monitors that a specific port receives data; the virtual machine performs checksum decoding on the received data packet to ensure the integrity and correct format of the data; the implementation of the data caching mechanism temporarily stores data and performs preprocessing as required to reduce redundancy.
S2, establishing a memory data structure based on hash, and optimizing the data to be stored based on the memory data structure;
In step S2, the data structure is established to efficiently manage the data to be stored, so as to achieve fast retrieval and data security. And the data optimization further improves the storage and query efficiency and reduces the influence of redundant data on the system performance.
S2 specifically comprises:
s21, establishing an inverted index structure based on hash, and performing inverted index on the data to be stored;
The hash-based inverted index structure is a data structure particularly suited for text retrieval and data mining, particularly when dealing with large-scale data sets. In this configuration, the hash table is used to store the index for fast lookup and insertion operations. The key to this structure is that it can quickly locate all documents that contain a particular term.
The inverted index is composed of two parts:
vocabulary (vocabolar): unique tokens (e.g., words, phrases, or lexical identifiers) that appear in all documents are stored.
Inverted list (INVERTED LISTS): for each term in the vocabulary, an inverted list is maintained that contains all references (e.g., document IDs) to documents that contain the term.
Hash-based implementation:
In hash-based inverted indexes:
The epoch hash table: a hash table is used in place of the traditional vocabulary in which keys are hash values of tokens, which are pointers to the inverted list.
Inverted arrangement table: similar to the inverted list in the infrastructure, there is typically some optimization such as merging duplicate document entries.
Illustratively, there are three documents:
document 1 content: "secure storage System"
Document 2 content: storage System efficiency "
Document 3 content: safety monitor "
The reverse index is constructed as follows:
word segmentation: the content of each document is decomposed into tokens.
And (3) hash calculation: a hash value is calculated for each token.
Establishing an index: the hash value for each token and the corresponding document ID are added to the inverted list.
The result of constructing the inverted index may be as follows, assuming that H (lemma) represents the hash value of the lemma:
Hash table:
h ("Security") - > points to the inverted list [ document 1, document 3]
H ("store") - > points to the inverted list [ document 1, document 2]
H ("System") - > points to the inverted list [ document 1, document 2]
H ("efficiency") - > points to the inverted list [ document 2]
H ("monitoring") - > points to an inverted list [ document 3]
When retrieving documents containing "security" and "storage", it is possible to quickly locate to the inverted list by a hash function and find the set of documents containing both tokens. Since the inverted list of both "safe" and "store" tokens contains "document 1", it can be quickly determined that document 1 contains all tokens to be retrieved.
The inverted index based on the hash can realize quick search by reducing the number of character string comparison times of the word elements and effectively utilizing the property of the hash, thereby remarkably improving the searching efficiency, especially when facing to a huge data set.
Specifically, in S21, a hash-based inverted index structure is established to inverted index the data to be stored, which includes the following steps S211-S213:
S211, decomposing the data to be stored into a plurality of word elements;
In S211, the text string is decomposed into meaningful units, called lemmas, by using a word segmentation (Tokenization) technique according to grammar and vocabulary rules of chinese, english, etc. different from the language.
S212, generating a hash fingerprint of each word element by using a hash algorithm, wherein the hash algorithm is MurmurHash or CityHash;
Hash fingerprints refer to a unique hash value (fingerprint) for each token obtained by a hash algorithm. In the embodiment of the application, a hash algorithm capable of balancing the collision probability and the calculation efficiency is required to be selected. For monitoring systems, the ideal hash algorithm should have high randomness and low collision rate to ensure that the fingerprint of each token is as unique as possible. In addition, the embodiment of the application can map the word element to the hash fingerprint, and simultaneously ensure the rapidness and stability of the process. Specifically, a mapping mechanism can be established, and a mapping table is maintained in the system, wherein the mapping table is used for recording the relation between the word elements and the hash fingerprints corresponding to the word elements. Therefore, the constructed hash fingerprint mapping table can be utilized to realize quick word element searching and comparison in the monitoring system. When a new token request is received, the system first calculates its hash fingerprint and then performs a quick lookup in the mapping table to verify the presence or absence of the token.
In step S212, a hash fingerprint for each token will be generated using MurmurHash and CityHash hash algorithms, respectively. The following is a specific implementation of the refinement steps:
1. generating hash fingerprints using MurmurHash algorithm
Initializing a hash function: the algorithm library is loaded MurmurHash and a hash function instance is initialized for each token.
Calculating a hash value: and invoking MurmurHash functions on each segmented word element, and outputting a hash value with 128 bits or other proper lengths.
For example, for a lemma "key", the procedure is as follows:
The term "secret key"
Hash fingerprint (MurmurHash), "11f50ed2c5a3a3b5d260b9df465811b4"
2. Generating hash fingerprints using CityHash algorithm
Initializing a hash function: a CityHash algorithm is introduced and a hash function is set for each token.
Performing a hash operation: and performing CityHash calculation on each word element to obtain a hash fingerprint with a fixed length.
For example, for the word metadata, the procedure is as follows:
The term "data"
Hash fingerprint (CityHash), "9ae0ea9e3c9c8e988b6a1850e17ea195"
In the actual reverse index creation process, the hash fingerprint is generated as follows, for example:
the algorithms MurmurHash and CityHash are applied to the lemma "management":
the term "management"
MurmurHash fingerprint e.g. "2a94f8fa2ccb1e29ed2a1bcfe3c208f3"
CityHash fingerprint e.g. "3bcdb cb8f63f6c29be1a369d9e041a"
The algorithms MurmurHash and CityHash are applied to the lemma "technique":
The term "technique"
MurmurHash fingerprint e.g. "d3bfbc2dd0f42d1a0446c8e0f1b79ae8"
CityHash fingerprint e.g. "64c2f54b5e4dc9cc9a5b76465726443c"
After each word element is processed by the hash function, a unique hash fingerprint is obtained.
S213, adding the hash fingerprint and the corresponding ID of each word element into the inverted list.
In S213, a hash table is developed as the basic data structure of the inverted index, wherein the key is a hash fingerprint of a lemma and the value is a document ID list. And adding the document ID appearing in the corresponding word element to the hash fingerprint item of each word element in the hash table.
For example, if there is a piece of data to be stored: "Key management and data encryption techniques", the process flow is as follows:
Pretreatment: removing nonsensical characters to obtain keywords: "key management data encryption technique".
B, wording: separating keywords to obtain a word element sequence: "key", "manage", "data", "encrypt", "technology".
And (3) hash calculation: and applying MurmurHash algorithm to each word element to obtain a group of hash fingerprints.
And (3) adding an index: the hash fingerprint for each token is associated with the document ID and added to the inverted index list.
The results were as follows:
Hash table:
h ("Key") - > links to [ document ID1]
H ("management") - > links to [ document ID1]
H ("data") - > links to [ document ID1]
H ("encryption") - > links to [ document ID1]
H ("technology") - > links to [ document ID1]
In another example, assume that there is a blog platform containing the following simplified article content:
Article 1, "cloud storage provides efficient data backup"
Article 2 "data security is crucial in cloud computing"
Article 3, "promote reliability of cloud services"
When an inverted index is established for the blog platform, the embodiment of the invention executes the following steps to realize the hash fingerprint technology and optimize the matching process of the word elements:
Word segmentation (Tokenization):
each article is broken down into a series of tokens (e.g., "cloud storage," "high efficiency," "data," "backup," "security," "cloud computing," "critical," "boost," "cloud service," "reliability").
Calculate hash fingerprint (hash):
A hash function is applied to each token to generate a unique hash value (fingerprint). For example, embodiments of the present invention use some hash function H such that the tokens are translated as follows:
H ("cloud storage") - >2a6d
H ("data") - >8ef4
...
Establishing an index:
And generating an inverted index according to the hash fingerprint of each word element. For the fingerprint of each new term, if it does not exist in the index, a new entry is created in the index and the ID of the current article is associated; if so, the ID of the current article is added to the inverted list of corresponding lemma fingerprints.
The results were:
hash table (inverted index):
2a6d- > [ article 1, article 3]
8Ef4- > [ article 1, article 2]
...
Index storage example:
2a6d { article 1, article 3}
8Ef4 { article 1, article 2}
...
The process simplifies the indexing and retrieving process of the article content, so that the embodiment of the invention can very quickly find all articles containing specific word elements, and simultaneously greatly saves the storage space.
Therefore, the embodiment of the invention needs to select a hash function which can generate the unique identifier and does not cause a large number of conflicts, so that the retrieval speed is increased and the storage efficiency is improved.
The embodiment of the invention uses a high-performance memory data structure, such as a Bloom Filter, to rapidly judge whether the article ID exists in the inverted list of a word element, thereby avoiding the repeated article ID from being inserted in the list.
S22, setting up monitoring points, continuously tracking the inflow rate and the data mode of the data to be stored, and merging redundant release lists on line, wherein the release lists are used for describing the position and frequency information of any word element in the inverted index structure.
The posting list (Posting List) is a term in the inverted index structure that describes a list that stores location information in which documents a particular term (typically a keyword or phrase) appears. Each of the tokens has a corresponding posting list entry in the vocabulary. The posting list is a key component in building a text search engine and some database indexing systems.
Each entry (publication) in the publication list includes:
document ID: representing the document number or unique identification containing the term. In a distributed system, this may also be a path for the file or other unique resource locator.
Position information: the position of the word element in the document can be a single position or a series of positions.
Frequency information: the frequency with which tokens appear in a document.
Other possible information: such as the context of the term within the document.
The significance is as follows:
Indexing and retrieving: the posting lists enable a search engine or database to quickly index and retrieve information. For example, if all documents containing the term "cloud storage" are to be found, the search engine looks up the posting list corresponding to the term "cloud storage" in the vocabulary, and then extracts the document ID in the list, thus finding all relevant documents.
Data storage efficiency: by means of the structure of the release list, the system can effectively store and manage the occurrence of keywords without storing repeated character strings. Therefore, not only is the storage space saved, but also the retrieval efficiency is improved.
Search optimization: the distribution list may be further optimized as desired. Retrieval performance of the list may be optimized, for example, by a jump table, hash table, or other efficient data structure.
In S22, setting up the monitoring point typically includes configuring monitoring software to monitor the data flow within the virtual machine. The monitoring point records the inflow rate of data and the data pattern, such as packet size, request frequency, etc., in real time.
When designing a data processing strategy for a secure cloud storage monitoring system based on a virtual machine, improving storage efficiency and reducing unnecessary data repetition are important points of optimization. By using the online list merging technology, the data flow can be monitored in real time, and meanwhile, the storage utilization rate can be improved and redundant information can be reduced.
Specifically, steps S221 and S222 are included:
S221, identifying repeated data entries in the release list based on the hash fingerprint of each word element;
in S221, duplicate data entries need to be identified in the existing posting list. For example, if a term appears in multiple documents and the documents have a same document tag (e.g., version number or document type is the same), then the entries may be considered duplicate items.
Specifically, S221 constructs a hash map with the hash fingerprint of each data entry as a key, and the location and frequency information as values. The mapping table is updated in real time to reflect the fingerprint of the newly added data entry and its associated information. Traversing the mapping table uses the hash value to quickly locate the potential duplicate entry. For each entry of the newly added mapping table, it is checked by comparing the hash fingerprints if there is already a matching entry. For each matching hash fingerprint, all duplicate issue list entries are marked in preparation for merging.
S222, merging repeated data entries so that each data entry only retains the latest version data.
The merge policy is set, e.g., a trigger condition, priority setting, etc., for the merge operation is determined. The version rules for retaining the data entry are chosen, for example, the latest version of the data may be retained.
The merge operation is performed according to the tags, merging the information of duplicate entries (such as file locations and frequency) into reserved entries. Duplicate and older versions of data entries, i.e., deduplication operations, are removed from the issue list.
After the merge operation is complete, the posting list is updated, ensuring that it reflects the latest, non-duplicate information. In addition, the associative cache and index are updated, if necessary, to maintain data consistency and up-to-date status throughout the system.
It should be noted that, in S222, the actual merging operation is performed, and the merging may be performed in various manners, such as integrating the repeated document references into one, or reorganizing the data in the list according to a certain logic rule, so as to reduce the repetition and redundancy, i.e. the deduplication operation.
In addition, when the release list is expanded, immediate and accurate deduplication is critical to maintaining system efficiency and storage optimization, so for different data load levels, an algorithm needs to be implemented to automatically manage the expanded release list and deduplication operation to ensure that the data is always up to date and not duplicated, and the following is specific steps A1-A2:
A1. according to the data load condition, dynamically adjusting the size of the release list;
In A1, the data load condition of the system needs to be continuously monitored, which relates to various aspects of the inflow rate of data, the size of data volume, the query frequency and the like. And adjusting the size of the release list in real time according to the monitoring result. Including expanding memory allocation, enabling new data structures to reduce seek time, etc. In addition, it is also necessary to manually determine what circumstances require the issue list to be resized, such as when the system reaches a certain CPU or memory usage threshold.
A2. And formulating a dynamic merging strategy, and executing the merging operation of the release list in batches based on the hash fingerprint of each word element in preset time based on the dynamic merging strategy.
And dynamically formulating a merging strategy according to the application scene and the system performance requirement, and selecting an appropriate merging strategy according to different scenes and requirements. Specifically, the merging strategy may be real-time or periodic. For real-time systems, small batches may be more favored for rapid merging, while batch processing systems may choose to merge large batches during off-peak hours.
S3, encoding the data to be stored by using a first static hash function, and storing the encoded data to be stored.
The "static hash function" in S3 refers to a deterministic hash function that does not change during system operation. Such functions are typically used to map an input (such as a token or other type of identifier) to a fixed-size, seemingly random string of data (hash value). This mapping process is an encoding operation for converting data into another form. And herein, "reference" refers to an identifier, such as a pointer or index, that points to a particular data or object, such as a data item stored in a database or file system.
The purpose of encoding the list and the references between the tokens using the first static hash function is to preserve anonymity and privacy of the data. This is achieved by replacing the data items and their identifiers (e.g. document IDs or tokens) with unique encodings generated by a hash function, which improves the non-identity of the data.
It should be noted that, in S2, the data to be stored is optimized to improve the efficiency of data storage and retrieval, while in S3, the sensitive data needs to be subjected to irreversible encoding operation so as to hide the plaintext information of the sensitive data, so as to improve the storage security.
Specifically, in S3, the data to be stored is encoded using a first static hash function, including the following steps S31-S33:
s31, judging the word elements, and identifying sensitive word elements;
For example, the determination may be made by a preset sensitive word stock, natural Language Processing (NLP) tool, or the like.
S32, mapping the sensitive word elements into references of non-personal identifiers;
For each sensitive token, a reference replacement for the non-personal identifier is created. The reference alternatives may be randomly generated identifiers or an anonymized code generated by a particular algorithm. In the data record, all sensitive tokens are replaced with corresponding non-personal identifier references, thereby desensitizing the data.
S33, encoding the reference by using a first static hash function.
In S33, a suitable static hash function, such as SHA-256 or MD5, is selected, which aims to convert the sensitive token into a hash value, increasing the security of the data. A hash function is performed on each non-personal identifier reference, generating a hash value unique to each sensitive token. And constructing a new release list by using the hash value spectrum, wherein the release list comprises the coded word hash values and the corresponding position information of the word hash values in the storage system.
Wherein the reference is encoded using a first static hash function, comprising the steps S331-S333 of:
s331, constructing a first static hash function, wherein the first static hash function is SHA-256 or MD5;
S332, converting the sensitive word into a hash value by using the first static hash function;
Bulk hashing of non-personal identifier references using a selected static hash function ensures that the same hash function version or configuration is used for each reference to maintain consistency of encoding.
S333, constructing a release list corresponding to the sensitive lemma by utilizing the hash value.
And filling the hash value after each reference code into a release list, wherein each hash value corresponds to one to a plurality of positions in the storage system of the original data.
An example is used to illustrate the implementation of the above steps:
Assume that the non-personal identifier is referenced as "REF001" which represents a piece of sensitive data to be protected.
Selecting a hash function:
SHA-256 is determined to be a static hash function suitable for this use case.
The SHA-256 hash function is performed on "REF001" resulting in a unique hash fingerprint.
The coded release list is constructed, the content of the present release list is updated into hash fingerprints, and the hash fingerprints are not obviously and directly related to the storage position of the actual data, so that the privacy of the data is maintained.
Encoded post-release list example:
Hash value "5e 88489" - [ document ID1 position, document ID2 position ]
The steps ensure that the original sensitive data is well protected through non-personal identifier reference and subsequent hash processing, and the functionality of a retrieval system is maintained.
Furthermore, the use of a static hash function is very effective in enhancing data security because it provides a non-reversible and non-directly related reference to stored tokens. However, to combat security attacks that may occur over time, such as rainbow table attacks, these hash functions need to be updated periodically. The following specific implementation method comprises the following steps:
B1. evaluating a validity period of the first static hash function;
According to the external safety environment, the validity period of the static hash function is estimated, and the time length of the function which can be safely used under the condition of not revealing key information is estimated.
B2. and replacing the first static hash function with a second static hash function before the expiration of the validity period.
A plan is made to replace the old hash function before the end of the validity period from the evaluation. Wherein a second static hash function, such as SHA-3 encoding, is capable of adjusting its hash mechanism, such as further salifying or altering the hash parameters, as needed during system operation to accommodate changing data protection requirements.
In addition, the encoded data to be stored may be compressed before the encoded data to be stored is stored. Conventional compression techniques may be employed and will not be described further herein.
Illustratively, in the field of cloud computing, monitoring logs and analyzing user behavior in real time is a critical task to ensure quality of service and security. Aiming at the application scene, the embodiment of the invention provides the following technical advantages:
real-time monitoring log the embodiment of the invention can monitor and analyze log data generated from the cloud infrastructure in real time. By utilizing the hash inverted index structure, specific events or modes can be quickly searched, and the speed of detecting and responding to problems is greatly improved.
User behavior analysis, namely analyzing activities of users on a cloud platform is important for improving user experience and safety management. By establishing the inverted index and combining with the hash algorithm, the system can track the activities of the user, such as file access, service requests and the like, analyze the behavior patterns and quickly discover abnormal behaviors, thereby preventing potential security threats.
Illustratively, in view of the continual flow of logs in cloud services and the continual generation of user behavior, monitoring systems need to efficiently process, sort, and store large volumes of log and behavior data records. The system realizes monitoring by the following steps:
and after collecting log data and user behavior records generated in real time, segmenting the log data and the user behavior records into word elements which can be independently indexed.
A hash fingerprint is generated for each token using a special hash algorithm (e.g., murmurHash) and stored in an inverted index structure.
And updating and optimizing the index in real time, such as dynamically adjusting the size of the release list and executing merging operation in batches to maintain the instantaneity and accuracy of the data.
Aiming at sensitive information such as user identity identification, a static hash function is applied to encode, so that the safety and privacy protection of data are enhanced.
By the method, the cloud service provider can provide more reliable service guarantee for users of the cloud service provider, and meanwhile, the privacy of user data and the overall safety of the system are effectively maintained.
In addition, for the actual scenario of storing user log files in a cloud service environment: in order to ensure the security of the stored data and efficiently process logs, the specific implementation steps comprise:
step 1, data collection and pretreatment
The virtual machine of the cloud service provider collects log data of the user in real time through monitoring software when the user interacts with the service, such as access records, error reports, system calls and the like.
Data preprocessing includes sorting log formats, filtering out garbage, and extracting key data points such as time stamps, event types, and resource identifiers.
Step 2, establishment of word segmentation and inverted index
The log content is broken down into a series of tokens using text analysis techniques to facilitate faster data lookup and analysis in subsequent steps.
An inverted index is created, a MurmurHash algorithm is applied to each token to generate a unique hash fingerprint, and the result is stored in an index structure. Inverted index is critical to efficient log data retrieval.
Step3, monitoring point setting and data flow tracking
Monitoring points are established to track the inflow rate and pattern of log data, and any potential abnormal behavior in real time.
Through continuous analysis, the monitoring system is able to discover and defend against potential security threats, such as frequent login attempts, unauthorized data access, or abnormal system operation behavior.
Step 4, on-line de-duplication and high-efficiency storage
By merging the redundant issue lists online, unnecessary occupation of storage space is reduced while maintaining the latest data version.
The release list can be dynamically adjusted according to the data load, so that the memory occupation and the query response time are optimized.
Step 5, sensitive information processing and security coding
A mapping operation is performed on the identified sensitive tokens, e.g., portions containing personal information of the user, to translate the references to non-personal identifiers.
The references are encoded by using SHA-256 or MD5 static hash functions, so as to generate irreversible unique codes, and the security and privacy protection of the data are enhanced.
Through the steps, the security cloud storage monitoring system efficiently completes the processing and storage of log data on the premise of ensuring the privacy of user data, so that operators can easily analyze the past events afterwards, and the overall security of cloud environment data storage is improved.
The application provides a virtual machine-based security cloud storage monitoring system and a virtual machine-based security cloud storage monitoring method, which introduce a hash-based inverted index structure, and the special hash memory data structure can greatly improve the storage and retrieval efficiency of unstructured data of enterprises, so that the system and the method are very suitable for the scene of mass unstructured data management of enterprises. In addition, the embodiment of the application encodes the data to be stored through the static hash function, provides an irreversible and indirect associated reference (identifier) for the privacy sensitive word element, greatly improves the security and privacy protection capability of the stored data, and can meet the compliance requirement of enterprise data management.
Example two
Fig. 2 is a schematic structural diagram of a security cloud storage monitoring system based on a virtual machine according to an embodiment of the present disclosure, where the security cloud storage monitoring system 200 based on a virtual machine includes:
An acquisition module 201, configured to acquire data to be stored;
an optimizing module 202, configured to establish a hash-based memory data structure, and optimize the data to be stored;
and the encoding module 203 is configured to encode the data to be stored by using a first static hash function, and store the encoded data to be stored.
Optionally, the optimizing module 202 is configured to establish a hash-based memory data structure, optimize the data to be stored based on the memory data structure, and include:
establishing an inverted index structure based on hash, and performing inverted index on the data to be stored;
And setting up a monitoring point, continuously tracking the inflow rate and the data mode of the data to be stored, and merging redundant release lists on line, wherein the release list is used for describing the position and the frequency information of any word element in the inverted index structure.
Optionally, establishing a hash-based inverted index structure, and performing inverted indexing on the data to be stored includes:
decomposing the data to be stored into a plurality of word elements;
generating a hash fingerprint of each word element by using a hash algorithm, wherein the hash algorithm is MurmurHash or CityHash;
The hash fingerprint and corresponding ID for each token are added to the inverted list.
Optionally, online merging of redundant release lists, including:
Identifying duplicate data entries in the release list based on the hashed fingerprint of each of the tokens;
the duplicate data entries are combined such that each data entry retains only the latest version data.
Optionally, the system further comprises:
the adjustment module is used for dynamically adjusting the size of the release list according to the data load condition;
And the merging module is used for formulating a dynamic merging strategy, and executing the merging operation of the release list in batches based on the hash fingerprint of each word element within a preset time based on the dynamic merging strategy.
Optionally, encoding the data to be stored with a first static hash function includes:
Judging the word elements, and identifying sensitive word elements;
mapping the sensitive token to a reference to a non-personal identifier;
the reference is encoded using a first static hash function.
Optionally, encoding the reference using a first static hash function includes:
Constructing a first static hash function, wherein the first static hash function is SHA-256 or MD5;
converting the sensitive word into a hash value by using the first static hash function;
and constructing a release list corresponding to the sensitive word element by utilizing the hash value.
Optionally, the system further comprises:
the evaluation module is used for evaluating the validity period of the first static hash function;
and the replacing module is used for replacing the first static hash function by using a second static hash function before the expiration of the validity period.
Optionally, the system further comprises:
and the compression module is used for compressing the encoded data to be stored.
The application provides a virtual machine-based security cloud storage monitoring system, which introduces a hash-based inverted index structure, and the special hash memory data structure can greatly improve the storage and retrieval efficiency of enterprise unstructured data, thereby being very suitable for the scene of enterprise massive unstructured data management. In addition, the embodiment of the application encodes the data to be stored through the static hash function, provides an irreversible and indirect associated reference (identifier) for the privacy sensitive word element, greatly improves the security and privacy protection capability of the stored data, and can meet the compliance requirement of enterprise data management.
The system of the embodiments of the present disclosure may perform the method provided by the embodiments of the present disclosure, and implementation principles thereof are similar, and actions performed by each module in the system of each embodiment of the present disclosure correspond to steps in the method of each embodiment of the present disclosure, and detailed functional descriptions of each module of the system may be specifically referred to descriptions in the corresponding method shown in the foregoing, which are not repeated herein.
The foregoing is merely an optional implementation manner of some implementation scenarios of the disclosure, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the disclosure may be adopted without departing from the technical ideas of the scheme of the disclosure, which also belongs to the protection scope of the embodiments of the disclosure.

Claims (10)

1. A secure cloud storage monitoring method based on a virtual machine, the method comprising:
the virtual machine acquires data to be stored;
establishing a memory data structure based on hash, and optimizing the data to be stored based on the memory data structure;
And encoding the data to be stored by using a first static hash function, and storing the encoded data to be stored.
2. The method of claim 1, wherein establishing a hash-based memory data structure, based on which the data to be stored is optimized, comprises:
establishing an inverted index structure based on hash, and performing inverted index on the data to be stored;
And setting up a monitoring point, continuously tracking the inflow rate and the data mode of the data to be stored, and merging redundant release lists on line, wherein the release list is used for describing the position and the frequency information of any word element in the inverted index structure.
3. The method of claim 2, wherein creating a hash-based inverted index structure for inverted indexing the data to be stored comprises:
decomposing the data to be stored into a plurality of word elements;
generating a hash fingerprint of each word element by using a hash algorithm, wherein the hash algorithm is MurmurHash or CityHash;
The hash fingerprint and corresponding ID for each token are added to the inverted list.
4. The method of claim 2, wherein merging redundant publication lists online comprises:
Identifying duplicate data entries in the release list based on the hashed fingerprint of each of the tokens;
the duplicate data entries are combined such that each data entry retains only the latest version data.
5. The method according to claim 4, wherein the method further comprises:
According to the data load condition, dynamically adjusting the size of the release list;
And formulating a dynamic merging strategy, and executing the merging operation of the release list in batches based on the hash fingerprint of each word element in preset time based on the dynamic merging strategy.
6. A method according to claim 3, wherein encoding the data to be stored using a first static hash function comprises:
Judging the word elements, and identifying sensitive word elements;
mapping the sensitive token to a reference to a non-personal identifier;
the reference is encoded using a first static hash function.
7. The method of claim 6, wherein encoding the reference using a first static hash function comprises:
Constructing a first static hash function, wherein the first static hash function is SHA-256 or MD5;
converting the sensitive word into a hash value by using the first static hash function;
and constructing a release list corresponding to the sensitive word element by utilizing the hash value.
8. The method of claim 7, wherein the method further comprises:
evaluating a validity period of the first static hash function;
And replacing the first static hash function with a second static hash function before the expiration of the validity period.
9. The method of claim 1, wherein prior to storing the encoded data to be stored, the method further comprises:
and compressing the encoded data to be stored.
10. A virtual machine-based secure cloud storage monitoring system, comprising:
The acquisition module is used for acquiring data to be stored;
YY+242086P
the optimizing module is used for establishing a memory data structure based on hash and optimizing the data to be stored;
And the encoding module is used for encoding the data to be stored by utilizing a first static hash function and storing the encoded data to be stored.
CN202410500195.XA 2024-03-26 2024-04-24 Virtual machine-based security cloud storage monitoring system and method Pending CN118277620A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2024103496534 2024-03-26

Publications (1)

Publication Number Publication Date
CN118277620A true CN118277620A (en) 2024-07-02

Family

ID=

Similar Documents

Publication Publication Date Title
US8543555B2 (en) Dictionary for data deduplication
US8344916B2 (en) System and method for simplifying transmission in parallel computing system
US7702640B1 (en) Stratified unbalanced trees for indexing of data items within a computer system
US20190050419A1 (en) De-duplicating distributed file system using cloud-based object store
US9361337B1 (en) System for organizing and fast searching of massive amounts of data
US8423520B2 (en) Methods and apparatus for efficient compression and deduplication
US9262432B2 (en) Scalable mechanism for detection of commonality in a deduplicated data set
CN108255647B (en) High-speed data backup method under samba server cluster
US10339124B2 (en) Data fingerprint strengthening
CN110727663A (en) Data cleaning method, device, equipment and medium
Franke et al. Parallel Privacy-preserving Record Linkage using LSH-based Blocking.
Brengel et al. {YARIX}: Scalable {YARA-based} malware intelligence
Xu et al. Reducing replication bandwidth for distributed document databases
Moia et al. Similarity digest search: A survey and comparative analysis of strategies to perform known file filtering using approximate matching
Patgiri et al. Hunting the pertinency of bloom filter in computer networking and beyond: A survey
CN117453646A (en) Kernel log combined compression and query method integrating semantics and deep neural network
CN110598467A (en) Memory data block integrity checking method
CN118277620A (en) Virtual machine-based security cloud storage monitoring system and method
Kumar et al. Differential Evolution based bucket indexed data deduplication for big data storage
Vikraman et al. A study on various data de-duplication systems
Abdulsalam et al. Evaluation of Two Thresholds Two Divisor Chunking Algorithm Using Rabin Finger print, Adler, and SHA1 Hashing Algorithms
Xiao et al. A Secure Lossless Redundancy Elimination Scheme With Semantic Awareness for Cloud-Assisted Health Systems
Hua Cheetah: An efficient flat addressing scheme for fast query services in cloud computing
Kumar et al. Comparative analysis of deduplication techniques for enhancing storage space
US12014169B2 (en) Software recognition using tree-structured pattern matching rules for software asset management

Legal Events

Date Code Title Description
PB01 Publication