CN113468561B

CN113468561B - Data protection method, device and server

Info

Publication number: CN113468561B
Application number: CN202110680699.0A
Authority: CN
Inventors: 宋晓峰
Original assignee: Baowan Capital Management Co ltd
Current assignee: Baowan Capital Management Co ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2024-04-23
Anticipated expiration: 2041-06-18
Also published as: CN113468561A

Abstract

The embodiment of the application relates to a data protection method, which comprises the following steps: partitioning all the data in the public cloud according to the data attribute of each data in the public cloud, and determining a region corresponding to each data attribute; and determining the protection mode of the area corresponding to each data attribute, and protecting all the data in the public cloud based on the protection mode of the area corresponding to each data attribute. By partitioning all the data in the public cloud, determining the area corresponding to each data attribute, and further determining the protection mode of the area corresponding to each data attribute, the cloud data encryption method and the cloud data encryption device can improve the cloud data encryption efficiency and the data security.

Description

Data protection method, device and server

Technical Field

The embodiment of the application relates to the technical field of data security, in particular to a data protection method, a data protection device and a server.

Background

With the increasing development of big data and cloud computing technology, data resources are given extremely important economic and strategic attributes, and have become important resources in relation to the interests and security of an organization or individual. While the value of data resources is widely agreed upon, data resource security is also faced with a variety of security threats and challenges. With the development of cloud computing in recent years, a large number of encryption and decryption operations on mobile devices in the past are now available, and a cloud server is used to assist in encryption and decryption operations, so that a user (encryptor or decryptor) can communicate complex operations with a powerful computing capability of the cloud server to perform computation. When the user side obtains the calculation result returned by the server, the result is combined to synthesize the ciphertext or obtain the plaintext, but from the aspect of information security, the encryption and decryption calculation by using the method must ensure the integrity during parameter transmission among the sender, the receiver and the cloud server, otherwise, the wrong ciphertext or plaintext is generated.

In order to protect the data security of cloud environment, data encryption and access control are common methods. The access control can restrict the user from accessing the data content that meets the rights of the user, and the data encryption can prevent the confidential data content from being read by unauthorized persons. If only access control technology is used to protect the security of the archives, the confidential data content may still leak due to the specific risk of the cloud environment, so that the security of the data content can be ensured under the condition that the data leaks carelessly by the data encryption technology.

The conventional data encryption method in the cloud environment is to encrypt the data completely and store the encrypted data in a cloud data center, and then execute decryption when the user needs the encrypted data. If a large amount of data encryption and decryption are repeatedly performed, system resources (such as CPU, memory, etc.) are severely consumed, resulting in heavy system load and reduced system performance; secondly, in the process of accessing the data by the user, encryption and decryption may be repeated, so that the cloud provider or a malicious user has a chance to steal the data of the user; in addition, the encrypted data presents a meaningless messy code, so that the data is not easy to search and query, and for a cloud provider, the encrypted data is unfavorable for the subsequent business analysis requirements besides greatly improving the complexity of resource consumption and key management.

In the process of implementing the present application, the applicant finds that at least the following problems exist in the related art:

the cloud data encryption is low in efficiency and insufficient in safety.

Disclosure of Invention

The embodiment of the application aims to provide a data protection method, a data protection device and a server so as to improve the data security of a cloud.

In a first aspect, an embodiment of the present application provides a data protection method, where the method includes:

partitioning all the data in the public cloud according to the data attribute of each data in the public cloud, and determining a region corresponding to each data attribute;

and determining a protection mode of the area corresponding to each data attribute, and protecting all the data in the public cloud based on the protection mode of the area corresponding to each data attribute.

In some embodiments, the profile attributes include identification attributes, quasi-identification attributes, sensitive attributes, and non-identification attributes;

The determining the protection mode of the area corresponding to each data attribute comprises the following steps:

If the data attribute is the identification attribute, determining a protection mode of an area corresponding to the identification attribute as a recoding mode;

If the data attribute is a quasi-recognition attribute or a sensitive attribute, determining that the protection mode of the area corresponding to the quasi-recognition attribute or the sensitive attribute is a symmetric encryption mode combined with an anonymization mode;

if the data attribute is a non-identification attribute, the data of the area corresponding to the non-identification attribute is not processed.

In some embodiments, the symmetric encryption mode in combination with anonymization mode comprises:

copying the data with the data attribute as the quasi-identification attribute or the sensitive attribute to generate a first data set and a second data set;

Anonymizing the first dataset;

and carrying out symmetrical encryption processing on the second data set.

In some embodiments, the anonymizing the first dataset comprises:

and anonymizing the first data set based on a firefly swarm optimization algorithm.

In some embodiments, the firefly swarm optimization algorithm comprises: initializing a group center evolution stage;

the initializing the group center evolution stage comprises the following steps:

loading a first data set, arranging fireflies corresponding to each data, and setting firefly parameters to be executed, wherein the firefly parameters comprise the number of hidden names, an initial radius, initial brightness, a brightness attenuation coefficient, a proportionality constant and the number of loops;

Entering a circulation;

Updating the brightness of each firefly;

searching the fluorescence value of the neighbor in the initial radius of each firefly;

determining a target firefly corresponding to each firefly, and moving to the target firefly;

updating the radius;

returning to circulation until the set circulation times are executed to finish the firefly evolution process;

Outputting data values of the last cycle of fireflies, wherein the data values comprise identification, data of each firefly, the firefly value of the last cycle, a radius and an intra-radius member set;

And determining the data value of the last cycle of the firefly as an evolution result of the evolution stage of the initial group center.

In some embodiments, the firefly swarm optimization algorithm further comprises: a data classification stage;

wherein, the data classification stage comprises: the successive classification stage specifically comprises:

Obtaining an evolution result of the central evolution stage of the initialization group, and sequencing all fireflies according to fluorescence brightness from high to low;

Starting from the firefly with the highest fluorescence brightness, searching the firefly quantity which is not distributed to the clusters in the radius range, if the firefly quantity which is not distributed to the previous clusters in the radius range of the firefly meets k-1, taking the firefly as a starting point of the clusters, and selecting the data with the minimum information loss quantity of k-1 by using all neighbor fireflies in the radius to establish the k anonymous clusters; if the neighbor in the selected firefly radius does not meet k-1 data, skipping the firefly, and selecting the firefly with the highest fluorescence brightness to continue searching until all fireflies are searched;

and outputting a classification result of the successive classification stage.

In some embodiments, the data classification stage further comprises: the residual data processing stage specifically comprises the following steps:

combining fireflies which are not distributed in the successive classification stage into a residual data set;

Loading the residual data set, obtaining the fluorescence brightness of each firefly in the residual data set, and setting the number of the hidden names;

sequencing the fluorescence brightness of all fireflies which are not distributed in the successive classification stage from high to low;

selecting fireflies with highest fluorescence brightness as initial points of the clusters;

Adding data one by one according to the mode of minimum information loss until the number of the hidden names is met;

circularly selecting fireflies with highest fluorescence brightness from the rest other fireflies as a cluster initial point until the number of the hidden names is not satisfied;

Adding unassigned fireflies to the established clusters in a manner that the clusters of the remaining data processing stages are closest to each other;

and outputting the classification result of the remaining data processing stage.

In a second aspect, an embodiment of the present application provides a data protection apparatus, including:

the data attribute partitioning unit is used for partitioning all the data in the public cloud according to the data attribute of each data in the public cloud and determining a region corresponding to each data attribute;

and the selective protection unit is used for determining the protection mode of the area corresponding to each data attribute and selectively protecting all the data in the public cloud based on the protection mode of the area corresponding to each data attribute.

The selective protection unit is specifically configured to:

In a third aspect, an embodiment of the present application provides a server, including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data protection method according to the first aspect.

In a fourth aspect, a non-transitory computer-readable storage medium stores computer-executable instructions that, when executed by a server, cause the server to perform the data protection method according to the first aspect.

The embodiment of the application has the beneficial effects that: by providing a data protection method, the method comprises: partitioning all the data in the public cloud according to the data attribute of each data in the public cloud, and determining a region corresponding to each data attribute; and determining a protection mode of the area corresponding to each data attribute, and protecting all the data in the public cloud based on the protection mode of the area corresponding to each data attribute. By partitioning all the data in the public cloud, determining the area corresponding to each data attribute, and further determining the protection mode of the area corresponding to each data attribute, the cloud data encryption method and the cloud data encryption device can improve the cloud data encryption efficiency and the data security.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.

FIG. 1 is an overall schematic diagram of a data protection method according to an embodiment of the present application;

FIG. 2 is a diagram illustrating a data classification according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a data protection method according to an embodiment of the present application;

fig. 4 is a detailed flowchart of step S302 in fig. 3;

fig. 5 is a detailed flowchart of step S303 in fig. 3;

FIG. 6 is a schematic diagram of a selective data protection process according to an embodiment of the present application;

FIG. 7 is a flowchart illustrating another data protection method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a data individual at a corresponding location in a data set according to an embodiment of the present application;

FIG. 9 is a schematic flow chart of a successive classification stage according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a data protection device according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a data protection device according to an embodiment of the present application;

fig. 12 is a schematic hardware structure of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the embodiment of the application, the configuration architecture of public cloud and private cloud is adopted to store the enterprise data, assuming that the internal storage space of the enterprise is limited and all the data cannot be stored.

In order to protect the security of cloud data, the embodiment of the application stores high-value and high-confidential data in private cloud, and stores low-value or public data in public cloud. Namely, the data are roughly classified into two types, namely high risk and low risk, and the high risk data are configured in the private cloud, so that the private cloud has higher security; and the low-risk data is configured in public cloud to provide the purposes of inquiring and analyzing the data. In order to prevent personal privacy leakage caused by data analysis by an unauthorized user by using a data exploration technology and maintain the usability of the data to a certain extent, the scheme combines anonymization processing and partial encryption to protect public cloud data.

Referring to fig. 1, fig. 1 is an overall schematic diagram of a data protection method according to an embodiment of the present application;

as shown in fig. 1, the data protection is divided into two stages, namely a first stage and a second stage, wherein the first stage is a data classification stage, and the two-dimensional selective data protection stage, in the data classification stage, the data is roughly divided into two types of high risk and low risk according to the risk value of the data, and different protection measures are given according to the risk degree; in the selective data protection stage, the important attributes of the data to be stored in the public cloud are respectively processed by symmetric encryption and anonymization. The protected data are respectively stored in the environment of the mixed cloud according to the risk of the data, wherein the high-risk data are stored in the private cloud, and the low-risk data are stored in the public cloud. The high-risk data in the private cloud is only accessed by internal users, while the data in the public cloud has different access modes according to the authority of the users, the unauthorized users can only access the data after being disclosed or anonymized, and the encrypted content only provides the access of authorized users.

Data protection techniques (e.g., encryption, access control, etc.) are often used to protect important data, and if all data is encrypted in a non-differentiated manner, additional system resource consumption is easily caused by a large amount of operations required in the encryption and decryption processes. Therefore, if the data grade can be distinguished in advance, proper management and control can be respectively given according to the confidentiality degree of the data, and the security of the data can be further improved.

Referring to fig. 2 again, fig. 2 is a schematic diagram of data classification according to an embodiment of the application;

As shown in FIG. 2, the database is pre-processed, and in the pre-processing step, each data attribute value in the database is assigned a secret value, and the secret value is used for combining the attribute weight and the threshold value in the risk classification step to perform data risk classification, so as to obtain high risk and low risk data respectively.

Referring to fig. 3 again, fig. 3 is a flow chart of a data protection method according to an embodiment of the application;

As shown in fig. 3, the data protection method includes:

step S301: acquiring data to be classified;

Specifically, the data to be classified is a database or a database, for example: enterprise databases or enterprise databases.

Step S302: determining a risk value of each data column in the data to be classified;

Specifically, referring to fig. 4 again, fig. 4 is a detailed flowchart of step S302 in fig. 3;

As shown in fig. 4, this step S302: determining a risk value of each data column in the data to be classified, including:

step S3021: acquiring the attribute type of each data item variable in each data column;

Specifically, the attribute type of the data item variable comprises a category attribute and a numerical attribute.

Step S3022: determining the confidential value of each data item variable based on a preset value conversion function according to the attribute type;

specifically, the preset value transfer function is used for: determining a preset interval range; and determining the confidential value corresponding to each attribute type in the preset interval range.

For example: the preset value transfer function is used for giving a confidential value to each data item variable, for example: the preset value transfer function is represented by the following formula (1):

Vij=f _i(x_ij) formula (1)

Where x _ij is the variable of the item of the j-th column in the i-th attribute, f _i is the cost transfer function of the i-th attribute, and Vij is the secret value.

Specifically, for the category type attribute, the following table 1 shows:

Data numbering	Attribute { occupation }, of	Attribute secret value
			1	Agriculture, forestry, fishery and animal husbandry	60
2	Army and police	80
			3	Business management	85
4	Educational service	75

TABLE 1

It can be seen that table 1 contains a plurality of different occupations, and the data owner of the present application gives quantitative values between 1 and 100 as attribute confidential values according to the values of different occupations, respectively, and the larger the value, the higher the representative value.

It will be appreciated that the attribute is a field in a two-dimensional relational table of the database, for example: occupation, height, age, etc., each attribute has a plurality of attribute values, and the data string variable is the value belonging to a certain attribute in the sample record.

Specifically, for the numerical properties, the following table 2 shows:

Data numbering	Numerical attribute { age }	Category attribute { age }
			1	16	Teenagers
2	45	Zhuang Qing nationality
			3	67	Elderly people
4	27	Zhuang Qing nationality

TABLE 2

It can be seen that, in table 2, the numerical value attribute is the age, and the numerical value distribution is continuous, which is unfavorable for the calculation of the quantized numerical value, the numerical value is divided into 4 different categories, and the category attribute is given a quantized confidential value between 1 and 100 according to the value. For the numerical values which are not easy to be directly distinguished, the distance between the numerical values can be calculated by using a cluster analysis technology to display the similarity degree, and the numerical values are classified according to the similarity degree. In the embodiment of the application, the quantitative confidential value can be determined by means of logistic regression, decision trees, common linear regression, hierarchical analysis, cluster analysis, time series and the like.

In the embodiment of the application, the clustering analysis refers to classifying attribute values of numerical value types, such as samples aged 17-65 years into 4 class levels, and through selection of a central point and iteration, gathering values with relatively close numerical values together to form a hierarchy. For example: the teenagers in table 2 above select how many intervals are appropriate, and the range of intervals drawn by the center points of the four levels can cover the sample data to the maximum extent. It will be appreciated that the division of age should be more of our daily experience, and telephone numbers may be considered more reasonable from the point of view of area location, etc. If a particular value is encountered, such as the number of family members, the registration is divided, how we project the value to four levels, such as: miniature, small house, medium, large house. At this time, considering where the sample radix is mainly concentrated, it is divided into four layers, how to select the appropriate four points, and the sample data is covered as much as possible.

In an embodiment of the present application, the attribute type includes a numeric attribute and a category attribute, and the determining, according to the attribute type, a confidential value of each of the data item variables based on a preset value transfer function includes:

If the attribute type is a numerical attribute, converting the numerical attribute into a category attribute, and determining the confidential value of each data item variable based on a preset value conversion function;

if the attribute type is a category type attribute, determining the confidential value of each data item variable directly based on a preset value conversion function.

Specifically, different cost transfer functions are determined according to different attribute types, for example: for the category type attribute, the application adopts a grading classification method, which is to divide the attribute into a plurality of grades or categories, and define corresponding confidential values for each grade or category in advance; for the numerical type attribute, the attribute is converted into the category attribute, and then the confidential value is calculated by using a hierarchical classification method. The hierarchical classification method is used for classifying the data samples, and the confidential value corresponding to each classification is defined and determined by multiple decision sequences or empirical values.

Step S3023: and determining a risk value of each data column according to the confidential value of each data item variable.

Specifically, the determining the risk value of each data column according to the confidential value of each data item variable includes:

determining the security attribute corresponding to each data item variable;

and calculating a risk value corresponding to the data column of each data item variable based on the weight corresponding to each predetermined security attribute and combining the confidential value of each data item variable.

Specifically, the data to be classified includes a related data table, attribute weights are defined according to importance degrees among data attributes, and different weights are given to security attributes of the related data table, wherein the security attributes include: identification properties, quasi-identification properties, sensitive properties, and non-identification properties.

In the embodiment of the application, the identification attribute can directly identify the identity of the person, if the harm of leakage to the privacy of the person is the most serious, the importance is the highest, the weight value is the highest, the sensitive attribute is the sensitive attribute, the identification attribute is the standard, and the other attribute which does not contain the attribute is the non-identification attribute. Because the weights are not easily and directly assessed among the data attributes, the application calculates the weights by using a multi-attribute decision method. Since the highest scale is used in the multi-attribute decision-related tradeoff ordering method, the present application employs the tradeoff ordering method to calculate weights. Specifically, the trade-off ordering method includes: compromise (VIKOR).

The key of determining the weight in the compromise solution (VIKOR) is the attribute ordering and the number of attributes, if there are n attributes, the ordering is a ₁,A₂…A_n, the relative weight is w ₁,w₂…w_n in sequence, and 1>w ₁≥w₂…≥w_n >0 is satisfied, and the sum of all the weights is 1, where w _n represents the weight of the nth attribute. Assuming that n is the total number of attributes, the weight of the kth attribute may be determined according to the following equation (2):

From equation (2):

……

In the embodiment of the application, in order to accelerate the calculation of the weight values, the application establishes a sorting weight value table with different attribute numbers in advance, and directly obtains the weight values by using a table look-up mode. The risk classification in the present application is to generate a corresponding secret value V _ij for each data item of the data column attribute, multiply the secret value V _ij with the corresponding attribute weight W _k, and then add the secret value V _ij to calculate a risk value R _j of the data column, where the calculation mode is as shown in the following formula (3):

Wherein n is the number of attributes contained in each data. Thus, if there are m data rows, a risk value R ₁,R₂,...,R_m for the m data rows can be generated.

Step S303: according to the risk value, carrying out risk classification on the data to be classified, and determining a classification result;

specifically, referring to fig. 5 again, fig. 5 is a detailed flowchart of step S303 in fig. 3;

as shown in fig. 5, this step S303: according to the risk value, carrying out risk classification on the data to be classified, and determining a classification result, wherein the method comprises the following steps:

Step S3031: calculating a risk value of each data in the data to be classified;

Specifically, a risk value of each data, i.e., a risk value R ₁,R₂,...,R_m of m data rows, is calculated.

Step S3032: judging whether the risk value of a certain data is larger than a preset risk threshold value or not;

Specifically, after calculating the risk values of all the data columns, the data are classified into two major categories, namely high risk and low risk data, according to a preset risk threshold T _A.

In the embodiment of the present application, the risk threshold is related to the preference degree of the manager for risk bearing and experience judgment, for example: in a finance company, regarding the data record of customer information portrait, the risk value of the customer information record is higher than 120 (one value is assumed), the customer information belongs to confidential data, and template data is not suitable to be issued, and the risk threshold is set to 120.

If the risk value of the data row is higher than the risk threshold T _A, the data belongs to high risk data, and if the risk value of the data row is smaller than or equal to the risk threshold T _A, the data belongs to low risk data. For example, the data table includes n attributes a and m data columns t, weights of the attributes a ₁ to a _n are w ₁,w₂,…,w_n in order, the secret value of the data item variable of the attribute a ₁ in the data column t ₁ is V ₁₁ after being calculated by the carry-over value conversion function (4-1), the secret value of the data item of the attribute a ₂ is V ₂₁, the value of the data item variable of the attribute a _n is V _n1, and risk values of the calculated data column t ₁、t₂、t_m are as shown in the following formula (4) -formula (6):

r ₁＝w₁×V₁₁+w₂×V₂₁…w_n×V_n1 formula (4)

R ₂＝w₁×V₁₂+w₂×V₂₂…w_n×V_n2 formula (5)

R _m＝w₁×V_1m+w₂×V_2m…w_n×V_nm formula (6)

To sum up, a risk value of each data in the data to be classified is calculated to obtain a data risk table, as shown in the following table 3:

TABLE 3 Table 3

Step S3033: determining the classification result as high risk data;

if the risk value of a certain data is greater than or equal to a preset risk threshold value, determining that the classification result is high risk data;

step S3034: determining the classification result as low risk data;

if the risk value of a certain data is smaller than the preset risk threshold value, determining the classification result as low-risk data.

Step S304: selectively protecting the data to be classified according to the classification result;

specifically, if the classification result of a certain data is high risk data, storing the data through private cloud; if the classification result of a certain data is low risk data, the data is stored through public cloud.

In an embodiment of the present application, a data protection method is provided, where the data protection method includes: acquiring data to be classified; determining a risk value of each data column in the data to be classified; according to the risk value, carrying out risk classification on the data to be classified, and determining a classification result; and according to the classification result, selectively protecting the data to be classified. On one hand, the method and the device can determine the classification results of different data by performing risk classification on the data to be classified through the risk values and determining the classification results, and on the other hand, the method and the device can improve the data encryption efficiency and the information retrieval efficiency by performing selective data protection on the data to be classified through the classification results.

In order to protect the security of the data in public cloud and consider the availability of the data, the application also provides a method for optimizing the firefly swarm, improving the K-membrane in the initial value selection of the swarm, searching for proper data and the like to achieve the K anonymity goal, and utilizing the symmetric encryption technology to conduct selective data protection. The application encrypts the important attribute by adopting the symmetric encryption technology with the advantage of high encryption and decryption speed, so as to save the processing time of huge cloud data. In addition, the anonymization technology is utilized to reserve the generalized semantic information of the data, so that the user can conveniently inquire the content and analyze the data.

Specifically, referring to fig. 6, fig. 6 is a schematic flow chart of selective data protection according to an embodiment of the present application;

As shown in fig. 6, the selective data protection process includes:

Firstly, carrying out data attribute partitioning on public cloud data, and dividing the public cloud data into four attributes according to the attributes of the data, wherein the four attributes are respectively an identification attribute, a quasi-identification attribute, a sensitive attribute and a non-identification attribute.

In the identification attribute, considering that the identification attribute can directly identify the identity of the person, if the person is not treated, the person is easy to leak, but if the person is deleted, the usability of the data may be reduced, so the application adopts a recoding mode to protect, for example: recoding the name or identification number using, for example, a001, a 002.

In the aspect of quasi-identification of the attribute and the sensitive attribute, the symmetric encryption and anonymization technology is adopted to process the attribute content, the method is to copy the attribute content into two parts, one part is processed by using the anonymization technology, and the other part is processed by adopting the symmetric encryption. For example: the accuracy of information release is controlled by adopting different anonymization technologies, adjusting numerical generalization level height and the like, the anonymized content is provided for inquiry and data analysis, and the original value of the data is protected by using symmetric encryption and is only downloaded and decrypted by authorized users, so that privacy leakage caused by direct decryption in public cloud is avoided.

In terms of non-identifying attributes, they are not processed, thereby reducing data throughput.

Specifically, referring to fig. 7 again, fig. 7 is a flow chart of another data protection method according to an embodiment of the present application;

as shown in fig. 7, the data protection method includes:

step S701: partitioning all the data in the public cloud according to the data attribute of each data in the public cloud, and determining a region corresponding to each data attribute;

specifically, the data attributes comprise identification attributes, quasi-identification attributes, sensitive attributes and non-identification attributes;

step S702: and determining a protection mode of the area corresponding to each data attribute, and protecting all the data in the public cloud based on the protection mode of the area corresponding to each data attribute.

Specifically, the symmetric encryption mode and the anonymization mode are combined, and the method comprises the following steps:

Anonymizing the first dataset;

and carrying out symmetrical encryption processing on the second data set.

Specifically, the anonymizing the first dataset includes:

Wherein, the firefly swarm optimization algorithm comprises the following steps: initializing a group center evolution stage and a data classification stage;

Entering a circulation;

Updating the brightness of each firefly;

updating the radius;

and outputting a classification result of the successive classification stage.

Wherein, the data classification stage further comprises: the residual data processing stage specifically comprises the following steps:

Specifically, the firefly swarm optimization algorithm comprises the following steps:

(1) Initializing group center evolution stage

In the firefly swarm optimization algorithm, each piece of data in the data set has a respective agent firefly (hereinafter referred to as firefly), and the data structure of the firefly is shown in table 4 below:

TABLE 4 Table 4

1.1, Arranging fireflies

In firefly swarm optimization algorithms, firefly arrangements are not randomly spread in the solution space, but rather are based on the corresponding locations generated by the characteristics of the individual data in the dataset. In the embodiment of the application, all data in the dataset are converted into corresponding position (x, y) values according to attribute values by first adopting a Non-METRIC MDS function in the data analysis software PAST, and are added to the coordinate axis data of initialization Data in the firefly data structure in the application.

Wherein Non-METRICMDS is a function that automatically projects two-dimensional graphics distance points by inputting multi-dimensional attribute values, generally maintaining distance from multi-dimensional data. For example: the quasi-recognition attribute referred to in this application is a multi-field record, in contrast to multi-dimensional data. How to project onto the two-dimensional screen form an effective distance presentation can be implemented using Non-METRICMDS functions.

Referring to fig. 8, fig. 8 is a schematic diagram of a data unit at a corresponding position of a data set according to an embodiment of the present application;

As shown in FIG. 8, the right side is the result of multiple PAST analysis on the left side of the original data table T, wherein the 1 st, 3 rd, 4 th, 5 th, 6 th, 8 th and 9 th data are all "United-States" in the "nativecountry" field value, so that the positions of the 9 th and 1 st data after analysis are biased to the lower side, wherein the 9 th and 1 st data are in three field attributes, only the Age fields are separated from each other by 3, and the difference in the data attribute values of the 4 th and 6 th data in the vicinity is smaller than that of the other data in the vicinity, and therefore, the 9 th data are closest to the 1 st data in the data set T. After PAST analysis, each data in the data set T obtains a corresponding position, and the firefly arrangement action is completed according to the position.

1.2 Brightness update stage

After the data set is loaded, a brightness updating stage is firstly carried out, each firefly searches for other fireflies in the sensing radius at the stage, the other fireflies in the radius form a temporary cluster by taking the fireflies as the initial point of the cluster, the change degree of the information loss quantity in the radius of each firefly is calculated through the temporary cluster, and finally whether the k anonymity requirement is met or not is judged according to the number of members in the radius. The key of K anonymization is to blur sample data, realize each group of sample data sets, achieve that the record number is more than or equal to K, K is the number in the preset firefly radius, represent that the record distance of the K firefly codes is very close, and can be made into a group of data for similar anonymization.

Therefore, the degree of the number of members and the information loss in the radius is the influence basis of the brightness of fireflies in each cycle, so that the brightness formula of fireflies in each cycle in the study is shown in the following formula (7) and formula (8):

tau _i(t)＝(1-ρ)τ_i(t-1)+γJ_i (t) equation (7)

/>

Wherein, the meaning represented by each parameter is as follows:

τ _i (t): the fluorescence brightness of firefly i in the t-th cycle;

ρ: a luminance decay factor between [0,1];

(1- ρ): the brightness attenuation rate is mainly used for controlling the past experience proportion and helping fireflies not to change drastically in each cycle process;

gamma: a proportionality constant for controlling the empirical weight of the cyclic search solution;

gamma J _i (t): the adaptive value is an objective function of the position of the firefly i in the t-th cycle, and is mainly used for reflecting the similarity degree of the radius members when the firefly i is in the position of the t-th cycle;

gw _i: a group formed by members within the radius of firefly i;

d (gw _i): the degree of information loss of the group;

|gw _i |: number of members within firefly i radius;

|QI|: number of data set QI fields;

k: the number of names is suppressed;

The denominator |gw _i |×|qi| is the number within the firefly i radius and the number of QI fields, indicating the worst information loss amount result of the group, max (D (gw _i))＝|gw_i |×|qi|. When the attribute values of all fireflies j within the firefly i radius are more similar, then/> The smaller the result value. Conversely, if the attribute values of all fireflies j within the radius of firefly i are not similar, then/>The larger the result value is; k represents whether the number of members in the radius of the firefly i reaches k (k is k anonymous parameter k), and if the number of members does not reach k, the brightness is smaller; if the total number of members in the firefly i radius is greater than k, then/>Representing a similar requirement for k-1 strokes to reach k anonymity.

As shown in fig. 8, when k=3, ρ=0.6, γ=0.7, there are 4, 6, 9 fireflies within the sensing radius of the 1 st firefly at the t-th cycle, and the last cycle of the 1 st firefly has a firefly value τ _i (t-1) =0.5, therefore,

1.3 Movement phase

In the moving stage, firefly i detects that the other firefly fluorescence values in the radius are brighter than the firefly fluorescence value itself. If no other fireflies have fluorescence values brighter than firefly i in the radius of firefly i, the firefly i stays in place; if the firefly has a firefly fluorescence value within the radius of firefly i that is brighter than firefly i, the direction of the brightest firefly (called firefly j) is selected to be closer. If there are more than two fireflies brightest within the radius of firefly i at the same time, firefly i will move according to the firefly direction with the closest data-to-data distance in the following manner:

x _i(t+1)＝X_i(t)+sd_ij (t) formula (10)

/>

Wherein, the meaning represented by each parameter is as follows:

N _i (t): firefly with inner radius fluorescence value of firefly i larger than that of firefly i

D (i, j): the distance between two points of firefly i and firefly j

Firefly i senses radius distance at the t-th cycle

Τ _i (t): fluorescent value brightness of firefly i in the t th cycle

Τ _j (t): firefly j at the t-th cycle

X _i (t): position of firefly i at the ith cycle

D _ij (t): is a unit vector, controls the direction of flight,

S: step value.

For example: firefly i has fireflies j ₁ and j ₂ at the t-th cycle position gw _i (x, y) = (0.2, 0.5), and the fluorescence value in the radius of firefly i is larger than that of firefly i, and the positions are respectivelyFirefly j ₁ has a fluorescence value greater than firefly j ₂, so firefly i moves s distance toward firefly j ₁; and the position X _i(t+1)＝gw_i (0.2+0.86 s,0.5+ (-0.52 s)) after the movement of firefly i.

1.4 Zone radius update phase

After the firefly i passes through the moving stage, the area decision radius of the firefly i in the next cycle is updated according to the firefly quantity in the sensing radiusFirefly i region decision radius/>The size determines the height of D _i (t) by how much firefly is within the firefly i sensing radius at the t-th cycle, and D _i (t) is defined as the number of neighbors covered in the circle area. If D _i (t) of firefly i is higher, the region decision radius/>, of firefly i in the next cycleThe smaller the size; conversely, if D _i (t) is lower, the region decision radius/>, of firefly i in the next cycleThe larger D _i (t) is, the greater/>Remain unchanged. The update formula of the region decision radius and the calculation mode of the neighbor density formula are shown in the following formula (12) -formula (13):

Wherein, the meaning represented by each parameter is as follows:

Initial radius of induction

Ni (t): number of neighbors in circular area

Beta: constant, weight value representing neighbor density.

For example: initial induction radius of firefly i in the t-th cycleNi (t) =5, β=0.2, then

Specifically, the initialization group center evolution stage is described in combination with the above steps, which specifically includes the following steps:

Step 1: loading a data set and deploying agent firefly data, and setting relevant parameters such as k hidden names k to be executed, an initial radius r ₀, initial brightness t ₀, a brightness attenuation coefficient rho, a proportionality constant gamma, the number of times t of executing and the like;

Step 2: entering a circulation;

Step 3: updating the brightness of fireflies;

step 4: searching the fluorescence value of the neighbor in the radius, and selecting the advancing direction according to the brightness of the fluorescence value or the minimum information loss after the addition;

Step 5: moving toward the target firefly;

Step 6: updating the domain zone decision radius.

Step 7: returning to the step 2, and ending the firefly evolution process after executing the set t times of circulation;

Step 8: outputting the Data value of the last cycle of firefly (ID, data, the firefly value luminence, radius, and intra-radius member set of the last cycle of firefly), and bringing the evolution result into the Data classification stage.

(2) A data classification stage comprising:

2.1 stage of successive classification

After the data set passes through the evolution stage, the agent fireflies of the data individual record the last cycle result, including attribute data, coordinate position, last cycle fluorescence value, firefly member data in the last cycle radius, sensing radius and other information of the fireflies. In the data classification stage, the agent fireflies for all data are initially selected according to the fireflies with the highest brightness. In the successive classification process, fireflies selected as initial points of the classification clusters at the time are firstly searched for the quantity of fireflies which are not distributed to other clusters according to the sensing radius of fireflies at the initial points of the clusters, if the quantity of fireflies which are not distributed to the previous clusters in the radius of the fireflies meets k-1, the fireflies are used as the initial points of the clusters, and all neighbor fireflies in the radius are used as data of k-1 with minimum information loss quantity selected one by one to establish the k anonymous clusters; if the neighbor within the selected firefly radius does not meet k-1 data, skipping the firefly, and searching for the next brightest firefly to continue until all fireflies are searched.

In the successive classification process, the number of fireflies in the firefly radius selected as the starting point of the clustering is greater than or equal to k < -1 > times, k anonymous clusters are established preferentially, and when the number of fireflies in all the firefly radii does not meet k < -1 > times of data, the fireflies enter a residual data processing stage as residual fireflies to be processed in another classification stage.

Specifically, referring to fig. 9 again, fig. 9 is a schematic flow chart of a successive classification stage according to an embodiment of the present application;

After the data set passes evolution, each firefly carries its own original data, position, fluorescence value of last cycle, sensing radius and neighbor member set to go to successive data classification stage.

As shown in fig. 9, the successive classification stage includes:

Starting;

step S901: inputting an evolution result of the previous stage;

Specifically, agent firefly information (fluorescence value, sensing radius, neighbor within sensing radius) of evolution stage and anonymous number k of k anonymity are loaded.

Step S902: sequencing from high to low according to fluorescence brightness;

Specifically, the brightness of the fluorescent value of fireflies is ranked from high to low.

Step S903: whether there are materials that can be classified;

Specifically, starting from the firefly with the brightest fluorescence value, searching the firefly quantity which is not distributed to clusters in the radius range of the firefly;

step S904: taking the nearest K-1 data from the radius to establish a cluster;

Specifically, if the number of firefly neighbors which are not distributed to the previous cluster in the radius range of the firefly from the start point of the cluster does not meet k-1, jumping out the current firefly, selecting the next brightest firefly, and continuously searching the number of fireflies which are not distributed to the cluster in the radius range of the firefly; if the number of firefly neighbors which are not distributed to the previous cluster in the radius range of the initial firefly of the cluster meets k-1, taking the firefly as a cluster initial point, and adding neighbor members in the radius one by one in a mode of minimum information loss until the cluster meets k data.

Repeating the steps: that is, the next brightest firefly is selected and the search for the number of fireflies within the firefly radius that have not been assigned to the cluster is continued; if the number of firefly neighbors which are not distributed to the previous cluster in the radius range of the initial firefly of the cluster meets k-1, taking the firefly as a cluster initial point, and adding neighbor members in the radius one by one in a mode of minimum information loss until the cluster meets k data;

Until the number of fireflies not allocated to the clusters within any one firefly radius satisfies k-1, at this time, step S905 is entered;

Step S905: processing the residual value;

Specifically, combining the fireflies which are not distributed into a new data set, and carrying out residual data processing;

Specifically, entering a residual data processing stage, wherein the fireflies which are remained and not distributed in the successive classification stage are mainly divided into two types, one of which is that the fireflies have low fluorescence values and are not attracted by fireflies with other brighter fluorescence values; the other is data which is not allocated because fireflies are far from the starting point of the clusters in each successive cluster establishment process.

The remaining data processing stage mainly carries out data classification on the remaining two types of data again to establish a k anonymous cluster. Therefore, there are mainly two processing modes at this stage, namely, K-member (hereinafter referred to as KGSO-K) and K-member (hereinafter referred to as KGSO-RK) of the modified cluster start point selection mode. KGSO-K is to set up a K-means method proposed by Byun on the residual data to set up a kk anonymous cluster, and KGSO-RK mainly selects the highest brightness from the residual fireflies as a new cluster initial point in a cluster initial point picking mode, and selects the data with the smallest K-1 information loss quantity to set up a cluster, and once the cluster membership satisfies K data, the data with the brightest fluorescence value in the residual data is selected again as the new cluster initial point and continues to select data until all the residual data are allocated to a proper cluster.

Step S906: outputting a classification result;

And (5) ending.

The embodiment of the application provides two modes for processing residual data:

mode one: the original greedy k-member algorithm comprises the following steps:

Step 1, loading a data set and setting the number of names.

Step 2. Execute greedy k-member algorithm.

Step3, outputting a classification result.

Mode two: a modified k-means (REVISED GREEDY K-means) algorithm comprising the steps of:

Step 1. Loading a data set, firefly fluorescence values and setting the number of hidden names.

Step 2, sorting the brightness according to the firefly fluorescence value from high to low.

Step 3, selecting fireflies with highest fluorescence values as initial points of clustering.

Step 4, adding data one by one according to the mode that the information loss is minimum until k data are met.

Step 5, selecting the highest fluorescence value of unassigned fireflies as a new cluster center point.

Step 6. Repeat Step 4, step 5 until the remaining unallocated data does not satisfy k strokes.

Step 7, adding unassigned fireflies to the established clusters in a manner that the data is closest to the clusters in the remaining stages.

Step 8, outputting a classification result.

The operation mode of the classification stage is based on a greedy k-member and establishes k anonymous clusters, but the application still has a plurality of differences from a k-member classification method in the data classification process:

Firstly, when a cluster is established, taking fluorescent brightness as a cluster initial point;

When a new cluster is established, the k-member first selects the data farthest from the last cluster start point from the rest of unassigned data as the new cluster start point. In the research method, the second method proposed in the successive classification stage and the residual data processing stage is to select the object according to the firefly fluorescence value, and the higher the firefly brightness is, the higher the surrounding height is, so the method is very suitable for being used as the initial point of clustering.

Secondly, in the classification stage, taking neighbor fireflies within the radius range of the initial point of the cluster as objects to be added into the cluster members;

When k-members build a cluster, it is necessary to compute data-to-cluster distance results for the data of the entire dataset and pick the most recent data to join the cluster until the cluster satisfies k data. In the process of establishing clusters in the successive classification stage in the research data classification process, neighbor fireflies in the radius range of the start point of the clustered objects are taken as objects to be added into the clusters, and the objects are selected from the radius neighbors according to the distance from the data to the clusters until the clusters meet k data.

In the embodiment of the application, through a two-stage information classification and selective big data information security protection mechanism, on one hand, the information classification based on the risk value can effectively distinguish information types, is convenient for implementing information access hierarchical management and control, and improves the security of information resources, and on the other hand, by adopting a selective protection measure based on classification, the information retrieval efficiency can be effectively improved, the system resource loss (such as CPU, memory and the like) in the encryption and decryption processes of a large amount of data can be reduced, and the high efficiency and the availability of information resource utilization can be improved.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a data protection device according to an embodiment of the present application;

the data protection device is applied to a server, as shown in fig. 10, and the data protection device 100 includes:

a data acquisition unit 101 for acquiring data to be classified;

a risk value determining unit 102, configured to determine a risk value of each data column in the data to be classified;

A classification result unit 103, configured to perform risk classification on the data to be classified according to the risk value, and determine a classification result;

and the selective data protection unit 104 is configured to selectively protect the data to be classified according to the classification result.

It should be noted that, the device can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of executing the method. Technical details which are not described in detail in the device embodiments may be found in the methods provided by the embodiments of the present application.

In the embodiment of the application, the data to be classified are obtained; determining a risk value of each data column in the data to be classified; according to the risk value, carrying out risk classification on the data to be classified, and determining a classification result; and according to the classification result, selectively protecting the data to be classified. On one hand, the method and the device can determine the classification results of different data by performing risk classification on the data to be classified through the risk values and determining the classification results, and on the other hand, the method and the device can improve the data encryption efficiency and the information retrieval efficiency by performing selective data protection on the data to be classified through the classification results.

Referring to fig. 11, fig. 11 is a schematic structural diagram of another data protection device according to an embodiment of the present application;

Wherein, the data protection device is applied to a server, as shown in fig. 11, the data protection device 110 includes:

a data attribute partitioning unit 111, configured to partition all the data in the public cloud according to the data attribute of each data in the public cloud, and determine a region corresponding to each data attribute;

The selective protection unit 112 is configured to determine a protection mode of an area corresponding to each data attribute, and selectively protect all the data in the public cloud based on the protection mode of the area corresponding to each data attribute.

In the embodiment of the application, the data attributes comprise identification attributes, quasi-identification attributes, sensitive attributes and non-identification attributes;

the selective protection unit 112 is specifically configured to:

In the embodiment of the application, all the materials in the public cloud are partitioned according to the material attribute of each material in the public cloud, and the area corresponding to each material attribute is determined; and determining a protection mode of the area corresponding to each data attribute, and protecting all the data in the public cloud based on the protection mode of the area corresponding to each data attribute. By partitioning all the data in the public cloud, determining the area corresponding to each data attribute, and further determining the protection mode of the area corresponding to each data attribute, the cloud data encryption method and the cloud data encryption device can improve the cloud data encryption efficiency and the data security.

Referring to fig. 12 again, fig. 12 is a schematic hardware structure of a server according to an embodiment of the present application;

As shown in fig. 12, the server 120 includes: one or more processors 121, and a memory 122, one processor 121 being illustrated in fig. 12.

The processor 121 and the memory 122 may be connected by a bus or otherwise, which is illustrated in fig. 12 as a bus connection.

A processor 121 for acquiring data to be classified;

determining a risk value of each data column in the data to be classified;

According to the risk value, carrying out risk classification on the data to be classified, and determining a classification result;

And selectively protecting the data to be classified according to the classification result.

Processor 121, further configured to: partitioning all the data in the public cloud according to the data attribute of each data in the public cloud, and determining a region corresponding to each data attribute;

The memory 122 is a non-volatile computer readable storage medium, and may be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the data protection method in the embodiment of the present application. The processor 121 executes various functional applications of the controller and data processing, i.e., implements the data protection method of the above-described method embodiment, by running nonvolatile software programs, instructions, and modules stored in the memory 122.

Memory 122 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the controller, etc. In addition, memory 122 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 122 optionally includes memory remotely located relative to processor 121, which may be connected to the controller via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 122 that, when executed by the one or more processors 121, perform the data protection method of any of the method embodiments described above.

It should be noted that the above product may execute the method provided by the embodiment of the present application, and has the corresponding functional module and beneficial effects of executing the method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.

Embodiments of the present application provide a non-transitory computer readable storage medium storing computer executable instructions that are executed by one or more processors, such as the one processor 121 in fig. 12, to cause the one or more processors to perform the data protection method of any of the method embodiments described above.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, but may also be implemented by means of hardware. Those skilled in the art will appreciate that all or part of the processes implementing the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and where the program may include processes implementing the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (RandomAccessMemory, RAM), or the like.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the application, the steps may be implemented in any order, and there are many other variations of the different aspects of the application as described above, which are not provided in detail for the sake of brevity; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims

1. A method of data protection, the method comprising:

Determining a protection mode of a region corresponding to each data attribute, and protecting all data in the public cloud based on the protection mode of the region corresponding to each data attribute;

The data attributes comprise identification attributes, quasi-identification attributes, sensitive attributes and non-identification attributes;

The symmetric encryption mode and anonymization mode are combined, and the method comprises the following steps:

Anonymizing the first dataset;

performing symmetric encryption processing on the second data set;

the anonymizing the first dataset includes:

2. The method of claim 1, wherein determining the protection mode of the area corresponding to each data attribute further comprises:

3. The method of claim 1, wherein the firefly swarm optimization algorithm comprises: initializing a group center evolution stage;

Entering a circulation;

Updating the brightness of each firefly;

updating the radius;

4. The method of claim 3, wherein the firefly swarm optimization algorithm further comprises: a data classification stage;

and outputting a classification result of the successive classification stage.

5. The method of claim 4, wherein the data classification stage further comprises: the residual data processing stage specifically comprises the following steps:

6. A data protection device, the device comprising:

The selective protection unit is used for determining the protection mode of the area corresponding to each data attribute and selectively protecting all the data in the public cloud based on the protection mode of the area corresponding to each data attribute;

The selective protection unit is specifically configured to:

Anonymizing the first dataset;

performing symmetric encryption processing on the second data set;

the anonymizing the first dataset includes:

7. The apparatus of claim 6, wherein the device comprises a plurality of sensors,

The selective protection unit is further configured to:

8. A server, the server comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.