CN113468561A

CN113468561A - Data protection method and device and server

Info

Publication number: CN113468561A
Application number: CN202110680699.0A
Authority: CN
Inventors: 宋晓峰
Original assignee: Baowan Capital Management Co ltd
Current assignee: Baowan Capital Management Co ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-10-01
Anticipated expiration: 2041-06-18

Abstract

The embodiment of the application relates to a data protection method, which comprises the following steps: partitioning all the data in the public cloud according to the data attribute of each piece of data in the public cloud, and determining an area corresponding to each data attribute; and determining the protection mode of the area corresponding to each data attribute, and protecting all the data in the public cloud based on the protection mode of the area corresponding to each data attribute. By partitioning all data in the public cloud, determining the area corresponding to each data attribute and further determining the protection mode of the area corresponding to each data attribute, the method and the device can improve the efficiency of cloud data encryption and improve data security.

Description

Data protection method and device and server

Technical Field

The embodiment of the application relates to the technical field of data security, in particular to a data protection method, a data protection device and a server.

Background

With the increasing development of big data and cloud computing technology, data resources are endowed with extremely important economic attributes and strategic attributes, and become important resources which are related to the interests and safety of an organization or an individual. While the value of data resources is widely agreed, data resource security also faces various security threats and challenges. With the development of cloud computing in recent years, a large amount of encryption and decryption operations on mobile devices have been performed, and nowadays, the encryption and decryption operations are assisted by a cloud server, and a user side (an encryption or decryption person) can perform complex operations on the cloud server with strong computing power. When the user side obtains the calculation result returned by the server, the ciphertext can be synthesized or the plaintext can be decoded by combining the calculation result, but from the perspective of information safety, the encryption and decryption calculation in the mode has to ensure the integrity of the parameter transmission among the sender, the receiver and the cloud server, otherwise, the wrong ciphertext or plaintext is generated.

In order to protect the data security in the cloud environment, data encryption and access control are commonly used. The access control can restrict the user's access to the data content that meets its authority, and the data encryption can protect the confidential data content from being read by unauthorized persons. If only the access control technology is used to protect the security of the file data, there is still a possibility that the confidential data content is leaked due to the specific risk of the cloud environment, so that the security of the data content can be ensured by the data encryption technology under the condition that the data is leaked carelessly.

At present, the common data encryption method in the cloud environment is to encrypt data completely and store the data in a cloud data center, and execute decryption when required by a user. If a large amount of data encryption and decryption are performed repeatedly, system resources (such as CPU, memory, etc.) will be consumed seriously, resulting in heavy system load and reduced system performance; secondly, in the process of accessing the data by the user, the encryption and decryption may be repeatedly performed, so that the cloud provider or the malicious user has an opportunity to steal the data of the user; in addition, the encrypted data presents a stack of meaningless messy codes, so that the data is not easy to retrieve and query, and for a cloud provider, the encrypted data is not beneficial to subsequent business analysis requirements except that the resource consumption and the complexity of key management are greatly improved.

In the process of implementing the present application, the applicant finds that at least the following problems exist in the related art:

the efficiency of cloud data encryption is low, and the security is not enough.

Disclosure of Invention

An object of the embodiments of the present application is to provide a data protection method, an apparatus, and a server, so as to improve data security of a cloud.

In a first aspect, an embodiment of the present application provides a data protection method, where the method includes:

partitioning all the data in the public cloud according to the data attribute of each piece of data in the public cloud, and determining an area corresponding to each data attribute;

and determining the protection mode of the area corresponding to each data attribute, and protecting all the data in the public cloud based on the protection mode of the area corresponding to each data attribute.

In some embodiments, the profile attributes include an identification attribute, a quasi-identification attribute, a sensitive attribute, and a non-identification attribute;

the determining the protection mode of the area corresponding to each material attribute includes:

if the data attribute is the identification attribute, determining that the protection mode of the area corresponding to the identification attribute is a recoding mode;

if the data attribute is the quasi-identification attribute or the sensitive attribute, determining that the protection mode of the area corresponding to the quasi-identification attribute or the sensitive attribute is a symmetric encryption mode combined with an anonymization mode;

and if the data attribute is the non-identification attribute, not processing the data of the area corresponding to the non-identification attribute.

In some embodiments, the symmetric encryption scheme in combination with the anonymization scheme includes:

copying the data with the data attribute as the quasi-identification attribute or the sensitive attribute to generate a first data set and a second data set;

anonymizing the first data set;

and carrying out symmetric encryption processing on the second data set.

In some embodiments, said anonymizing the first set of material comprises:

and carrying out anonymization processing on the first data set based on a firefly swarm optimization algorithm.

In some embodiments, the firefly swarm optimization algorithm comprises: initializing a group center evolution stage;

wherein, the initializing group center evolution stage comprises:

loading a first data set, arranging fireflies corresponding to each data set, and setting fireflies parameters to be executed, wherein the fireflies parameters comprise an anonymous number, an initial radius, initial brightness, a brightness attenuation coefficient, a proportionality constant and cycle times;

entering into circulation;

updating the brightness of each firefly;

searching the fluorescence value of the neighbor within the initial radius of each firefly;

determining a target firefly corresponding to each firefly and moving towards the target firefly;

updating the radius;

returning to the circulation until the set circulation times are executed to end the firefly evolution process;

outputting a data value of the last cycle of the fireflies, wherein the data value comprises an identification and data of each fireflies, a firefly value of the last cycle, a radius and a member set in the radius;

and determining the last cycle data value of the firefly as an evolution result of an initial swarm center evolution stage.

In some embodiments, the firefly swarm optimization algorithm further comprises: a data classification stage;

wherein, the data classification stage comprises: the successive classification stage specifically includes:

acquiring an evolution result of the central evolution stage of the initialized group, and sequencing all fireflies from high to low according to fluorescence brightness;

starting from the firefly with the highest fluorescence brightness, searching the number of fireflies which are not distributed to a cluster in the radius range, and if the number of fireflies which are not distributed to the previous cluster in the radius of the fireflies meets k-1, taking the fireflies as the cluster starting point, and establishing the k anonymous cluster by selecting k-1 data with the minimum information loss amount one by one for all neighbor fireflies in the radius; if the neighbors in the radius of the selected firefly do not meet k-1 data, skipping over the firefly, and selecting the next firefly with the highest fluorescence brightness to continue searching until all the fireflies are searched;

and outputting the classification result of the successive classification stage.

In some embodiments, the material classification stage further comprises: the remaining data processing stage specifically includes:

combining the fireflies which are not allocated in the successive classification stage into a residual data set;

loading the residual data set, acquiring the fluorescence brightness of each firefly in the residual data set, and setting the anonymous number;

sequencing the fluorescence brightness of all fireflies which are not distributed in the successive classification stage from high to low;

selecting firefly with highest fluorescence brightness as a cluster initial point;

adding data one by one according to the mode of minimum information loss amount until the anonymous number is met;

circularly selecting the firefly with the highest fluorescence brightness in the rest of other fireflies as a cluster initial point until the anonymous number is not satisfied;

adding the unassigned fireflies into the established clusters according to the mode that the clusters of the rest data processing stage are closest to each other;

and outputting the classification result of the residual data processing stage.

In a second aspect, an embodiment of the present application provides a data protection apparatus, where the apparatus includes:

the data attribute partitioning unit is used for partitioning all data in the public cloud according to the data attribute of each data in the public cloud and determining an area corresponding to each data attribute;

and the selective protection unit is used for determining the protection mode of the area corresponding to each data attribute and selectively protecting all the data in the public cloud based on the protection mode of the area corresponding to each data attribute.

the selective protection unit is specifically configured to:

In a third aspect, an embodiment of the present application provides a server, where the server includes:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data protection method of the first aspect.

In a fourth aspect, a non-transitory computer-readable storage medium stores computer-executable instructions that, when executed by a server, cause the server to perform the data protection method of the first aspect.

The beneficial effects of the embodiment of the application are that: by providing a method of data protection, the method comprising: partitioning all the data in the public cloud according to the data attribute of each piece of data in the public cloud, and determining an area corresponding to each data attribute; and determining the protection mode of the area corresponding to each data attribute, and protecting all the data in the public cloud based on the protection mode of the area corresponding to each data attribute. By partitioning all data in the public cloud, determining the area corresponding to each data attribute and further determining the protection mode of the area corresponding to each data attribute, the method and the device can improve the efficiency of cloud data encryption and improve data security.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

Fig. 1 is an overall schematic diagram of a data protection method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a data classification provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a data protection method according to an embodiment of the present application;

FIG. 4 is a detailed flowchart of step S302 in FIG. 3;

FIG. 5 is a detailed flowchart of step S303 in FIG. 3;

FIG. 6 is a schematic flow chart illustrating selective data protection according to an embodiment of the present application;

fig. 7 is a schematic flowchart of another data protection method provided in an embodiment of the present application;

FIG. 8 is a schematic diagram of a data unit in a corresponding position of a data set according to an embodiment of the present application;

FIG. 9 is a flow chart of a successive classification stage provided by an embodiment of the present application;

fig. 10 is a schematic structural diagram of a data protection apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a data protection apparatus according to an embodiment of the present application;

fig. 12 is a schematic hardware structure diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the embodiment of the present application, it is assumed that the internal storage space of the enterprise is limited and cannot store all the data, so a configuration architecture combining a public cloud and a private cloud is adopted to store the data of the enterprise.

In order to protect the security of the cloud data, the embodiment of the application stores high-value and high-confidentiality data in the private cloud, and stores lower-value or public data in the public cloud. That is, the data is divided into high risk and low risk, the high risk data is allocated in the private cloud to make it have higher security; the low-risk data are configured in the public cloud and provide inquiry and data analysis purposes. In order to avoid personal privacy disclosure caused by data analysis by unauthorized users using a data exploration technology and maintain certain usability of data, the scheme combines a anonymization processing and partial encryption mode to protect public cloud data.

Referring to fig. 1, fig. 1 is a general schematic diagram of a data protection method according to an embodiment of the present application;

as shown in fig. 1, the present application divides data protection into two stages, which are a stage one and a stage two, respectively, wherein the stage one is a data classification stage, and the stage two-dimensional selective data protection stage, and in the data classification stage, the present solution classifies data into two types, i.e., high risk and low risk, according to the risk value of the data, and then gives different protection measures according to the risk degree; in the selective data protection stage, the important attributes of the data to be stored in the public cloud are respectively encrypted symmetrically and processed anonymously. The protected data are respectively stored in the environment of the mixed cloud according to the data risk, wherein the high-risk data are stored in the private cloud, and the low-risk data are stored in the public cloud. The high-risk data in the private cloud is only accessed by internal users, while the data in the public cloud has different access modes according to the user authority, unauthorized users can only access the data which is disclosed or processed anonymously, and the encrypted content only provides authorized users to access.

Data protection techniques (e.g., encryption, access control, etc.) are often used to protect important data, and if all data is encrypted to a certain degree, additional system resources are consumed due to the large number of operations required during the encryption/decryption process. Therefore, if the data grades can be distinguished in advance, the data can be controlled appropriately according to the data confidentiality degree, so that the security of the data can be further improved.

Referring to fig. 2 again, fig. 2 is a schematic diagram of a data classification according to an embodiment of the present application;

as shown in FIG. 2, the database is pre-processed, and in the pre-processing step, each attribute value of the data in the database is assigned a secret value, which is used in the risk classification step to combine the attribute weight and the threshold value to perform the risk classification of the data, so as to obtain the high-risk data and the low-risk data, respectively.

Referring to fig. 3, fig. 3 is a schematic flowchart of a data protection method according to an embodiment of the present application;

as shown in fig. 3, the data protection method includes:

step S301: acquiring data to be classified;

specifically, the data to be classified is a database or a database, for example: an enterprise database or an enterprise database.

Step S302: determining a risk value of each data row in the data to be classified;

specifically, referring to fig. 4 again, fig. 4 is a detailed flowchart of step S302 in fig. 3;

as shown in fig. 4, the step S302: determining a risk value for each profile in the data to be classified, comprising:

step S3021: acquiring the attribute type of each data item variable in each data row;

specifically, the attribute types of the material item variables include a category type attribute and a numerical type attribute.

Step S3022: determining the confidential value of each data item variable based on a preset value conversion function according to the attribute type;

specifically, the preset value conversion function is configured to: determining a preset interval range; and determining a secret value corresponding to each attribute type in the preset interval range.

For example: the predetermined cost function is used to assign a secret value to each item variable, such as: the preset value conversion function is the following formula (1):

Vij＝f_i(x_ij) Formula (1)

Wherein x is_ijA variable of item for jth data row in ith attribute, f_iVij is the secret value for the value transfer function of the ith attribute.

Specifically, for the type attribute, as shown in table 1 below:

data numbering	Attribute { occupation }	Attribute secret
			1	Agriculture, forestry, fishery and animal husbandry	60
2	Military police	80
			3	Business management	85
4	Educational services	75

TABLE 1

It can be seen that table 1 contains a plurality of different professions, and the data owner gives quantized values between 1 and 100 according to the values of different profession categories in advance, and the values are taken as the attribute secret values, and the larger the value is, the higher the value is.

It is understood that an attribute is a field in the two-dimensional relational table of the database, for example: occupation, height, age, etc., each attribute having a plurality of attribute values, and the data row variables are values in the sample record under a certain attribute.

Specifically, for the numerical attributes, as shown in table 2 below:

data numbering	Numerical attribute (age)	Category attribute (age)
			1	16	Teenagers
2	45	Young and strong year
			3	67	Old age
4	27	Young and strong year

TABLE 2

It can be seen that the numerical attribute of the age in table 2 is not favorable for the calculation of the quantized numerical value due to the continuous numerical distribution, and the numerical value is firstly divided into 4 different categories, and then the category attribute is respectively given to the quantized secret values between 1 and 100 according to the value. For the numerical values which are difficult to directly classify, the distance between the numerical values can be calculated by using a cluster analysis technology to display the similarity degree of the numerical values, and the numerical values are classified according to the similarity degree. In the embodiment of the present application, the quantitative secret value may be determined by a logistic regression, a decision tree, a general linear regression, a hierarchical analysis, a cluster analysis, a time series, and the like.

In the embodiment of the application, the cluster analysis refers to classifying attribute values of numerical types, such as samples aged 17-65 years, into 4 classes of levels, and clustering values closer to each other through selection of a central point and iteration to form a hierarchy. For example: the teenager in table 2 above selects a suitable number of intervals, and whether the range of the intervals drawn by the central points of the four levels can cover the sample data to the maximum extent. It can be understood that the division of the age is more the daily experience, and the division of the telephone number is more reasonable from the aspects of the area and the location. If a particular value, such as family membership, is encountered, and the registration is divided, we project the value to four levels, such as: miniature, small house type, medium house type, large house type. At this time, considering where the sample cardinality is mainly concentrated, the sample cardinality is divided into four layers, and how to select the appropriate four points to cover the sample data as much as possible.

In an embodiment of the present application, the determining the secret value of each variable of the material item based on a preset value transformation function according to the attribute type includes:

if the attribute type is a numerical attribute, converting the numerical attribute into a category attribute, and determining the secret value of each data item variable based on a preset value conversion function;

and if the attribute type is the category type attribute, directly determining the secret value of each data item variable based on a preset value conversion function.

Specifically, different value transformation functions are determined according to different attribute types, for example: for the class type attribute, the classification method is adopted, the attribute is divided into a plurality of classes or categories, and corresponding secret values are defined for each class or category in advance; for the numerical attribute, the attribute is converted into the category attribute, and then the secret value of the attribute is calculated by using a hierarchical classification method. The hierarchical classification method is used for classifying the data samples, and the secret value corresponding to each category is defined to be determined by multiple decision sorting or empirical values.

Step S3023: determining a risk value for each of the series of data items based on the confidential value of each of the variable data items.

Specifically, the determining the risk value of each profile according to the confidential value of each profile variable includes:

determining the security attribute corresponding to each data item variable;

and calculating the risk value corresponding to the information row of each information item variable based on the predetermined weight corresponding to each security attribute and by combining the confidential value of each information item variable.

Specifically, the data to be classified includes a relational data table, attribute weights are defined according to importance degrees among data attributes, and different weights are given to security attributes of the relational data table, respectively, where the security attributes include: an identifying attribute, a quasi-identifying attribute, a sensitive attribute, and a non-identifying attribute.

In the embodiment of the application, because the identification attribute can directly identify the identity of the individual, if the privacy of the individual is seriously damaged by leakage, the identification attribute has the highest importance, has the highest weight value, is the sensitive attribute, is the standard identification attribute again, and is the non-identification attribute which does not contain the sensitive attribute. Since it is not easy to directly evaluate the weight between data attributes, the present application utilizes a multi-attribute decision-making approach to calculate the weight. Since the highest proportion is used in the multi-attribute decision-related trade-off ranking method, the application adopts the trade-off ranking method to calculate the weight. Specifically, the trade-off ordering method includes: compromise solution (VIKOR).

The key to determine the weight in the compromise solution (VIKOR) is the attribute rank and the number of attributes, and if there are n attributes, the rank is A₁，A₂…A_nThen the relative weight is sequentially w₁，w₂…w_nAnd satisfy 1>w₁≥w₂…≥w_n>0, all weights add up to 1, where w_nRepresenting the weight of the nth attribute. Assuming n is the total number of attributes, the weight of the kth attribute can be determined according to the following equation (2):

as can be seen from equation (2):

……

in the embodiment of the application, in order to accelerate the weight value calculation, a sorting weight value table with different attribute numbers is established in advance, and the weight value is directly obtained by using a table look-up mode. The risk classification in this application is to generate a corresponding secret V for each entry of the attribute of the data row_ijAnd its corresponding attribute weight W_kAfter multiplication, the sum is added to calculate the risk value R of the data row_jThe calculation method is shown in the following formula (3):

wherein n is the number of attributes contained in each data. Therefore, if there are m data rows, the risk value R of m data rows can be generated₁，R₂，...，R_m。

Step S303: according to the risk value, carrying out risk classification on the data to be classified, and determining a classification result;

specifically, referring to fig. 5 again, fig. 5 is a detailed flowchart of step S303 in fig. 3;

as shown in fig. 5, the step S303: according to the risk value, carrying out risk classification on the data to be classified, and determining a classification result, wherein the classification result comprises the following steps:

step S3031: calculating a risk value of each data in the data to be classified;

specifically, calculate the risk value of each data, i.e. the risk value R of m data rows₁，R₂，...，R_m。

Step S3032: judging whether the risk value of certain data is greater than a preset risk threshold value or not;

specifically, after calculating the risk values of all the data rows, the risk values are calculated according to a preset risk threshold T_AThe data is divided into two categories, high-risk data and low-risk data.

In the embodiment of the present application, the risk threshold is related to the preference degree of risk tolerance of the administrator and the empirical judgment, for example: in the financial company, for the data record of the customer information portrait, the risk value of the customer information record is higher than 120 (a value is assumed), the customer information is regarded as more confidential data, and is not suitable for being issued as template data, and the risk threshold value is set to be 120.

If the risk value of the data row is higher than the risk threshold T_ABelongs to the high risk data if the risk value of the data row is less than or equal to the risk threshold T_AIt belongs to low risk data. For example, the data table includes n attributes A and m data rows t, the attribute A₁To A_nRespectively have weights of w in order₁，w₂，…，w_nData row t₁Middle attribute A₁After the data item variable is carried into the value conversion function (4-1) to calculate, the secret value is V₁₁Property A₂Secret value of material item is V₂₁Property A_nThe variable value of the data item is V_n1Calculate the data row t₁、t₂、t_mThe risk values of (a) are respectively as follows equation (4) to equation (6):

R₁＝w₁×V₁₁+w₂×V₂₁…w_n×V_n1formula (4)

R₂＝w₁×V₁₂+w₂×V₂₂…w_n×V_n2Formula (5)

R_m＝w₁×V_1m+w₂×V_2m…w_n×V_nmFormula (6)

In summary, the risk value of each data in the data to be classified is calculated to obtain a data risk table, which is shown in table 3 below:

TABLE 3

Step S3033: determining the classification result as high-risk data;

if the risk value of a certain data is greater than or equal to a preset risk threshold value, determining that the classification result is high-risk data;

step S3034: determining the classification result as low-risk data;

and if the risk value of a certain data is smaller than a preset risk threshold value, determining that the classification result is the low-risk data.

Step S304: selectively protecting the data to be classified according to the classification result;

specifically, if the classification result of a certain data is a high-risk data, the high-risk data is stored through a private cloud; and if the classification result of certain data is low-risk data, storing the data through the public cloud.

In an embodiment of the present application, there is provided a data protection method, including: acquiring data to be classified; determining a risk value of each data row in the data to be classified; according to the risk value, carrying out risk classification on the data to be classified, and determining a classification result; and carrying out selective data protection on the data to be classified according to the classification result. On the one hand, the data to be classified is subjected to risk classification through the risk value, the classification result is determined, the classification result of different data can be determined, on the other hand, selective data protection is carried out on the data to be classified through the classification result, and the data encryption efficiency and the information retrieval efficiency can be improved.

In order to protect the security of the data in the public cloud and give consideration to the usability of the data, the application also provides a firefly swarm optimization concept, improves the modes of k-member in cluster initial value selection, proper data search and the like to achieve the aim of k anonymity, and utilizes a symmetric encryption technology to selectively protect the data. This application encrypts important attribute through the symmetrical formula encryption technique that has the fast advantage of encryption and decryption to save the processing time of the huge data in high in the clouds. In addition, the anonymization technology is used for reserving the generalized semantic information of the data, so that the content query and the data analysis of a user can be facilitated.

Specifically, please refer to fig. 6, fig. 6 is a schematic flow chart illustrating a selective data protection according to an embodiment of the present application;

as shown in fig. 6, the process of selective data protection includes:

the method comprises the steps of firstly carrying out data attribute partitioning on public cloud data, and dividing the public cloud data into four attributes according to the attributes of the data, wherein the four attributes are respectively an identification attribute, a quasi-identification attribute, a sensitive attribute and a non-identification attribute.

In the identification attribute, considering that the identification attribute can directly identify the personal identity, the personal identity is easy to leak if not processed, but the usability of the data can be reduced if the identification attribute is deleted, so the application adopts a recoding mode for protection, for example: the name or identification number is re-encoded using, for example, a001, a 002.

In the aspect of quasi-identification of the attribute and the sensitive attribute, the attribute is processed by adopting a symmetric encryption technology and an anonymization technology at the same time, namely, the content of the attribute is copied into two parts, one part is processed by using the anonymization technology, and the other part is processed by adopting the symmetric encryption. For example: the accuracy of the released information is controlled by adopting different anonymization technologies, adjusting the numerical generalization hierarchy height and other methods, the anonymized content is used for inquiry and data analysis, the original value of the data is protected by using symmetric encryption and is only used for downloading and decrypting by an authorized user, and privacy leakage caused by direct decryption in public cloud is avoided.

In the aspect of non-identification attribute, it is not processed, thereby reducing the data processing amount.

Specifically, please refer to fig. 7 again, fig. 7 is a schematic flow chart of another data protection method according to an embodiment of the present disclosure;

as shown in fig. 7, the data protection method includes:

step S701: partitioning all the data in the public cloud according to the data attribute of each piece of data in the public cloud, and determining an area corresponding to each data attribute;

specifically, the data attribute comprises an identification attribute, a quasi-identification attribute, a sensitive attribute and a non-identification attribute;

step S702: and determining the protection mode of the area corresponding to each data attribute, and protecting all the data in the public cloud based on the protection mode of the area corresponding to each data attribute.

Specifically, the symmetric encryption method in combination with the anonymization method includes:

anonymizing the first data set;

and carrying out symmetric encryption processing on the second data set.

Specifically, the anonymizing the first data set includes:

Wherein, the firefly bee colony optimization algorithm comprises the following steps: initializing a group center evolution stage and a data classification stage;

wherein, the initializing group center evolution stage comprises:

entering into circulation;

updating the brightness of each firefly;

updating the radius;

Wherein, the data classification stage further comprises: the remaining data processing stage specifically includes:

and outputting the classification result of the residual data processing stage.

Specifically, the firefly swarm optimization algorithm comprises the following steps:

(1) initializing a cluster-centric evolution stage

In the firefly swarm optimization algorithm, each data in the data set has its own agent firefly (hereinafter, firefly), and the data structure of firefly is shown in table 4 below:

TABLE 4

1.1 arrangement of fireflies

In the firefly swarm optimization algorithm, the arrangement of fireflies is not randomly scattered in the solution space, but is generated according to the corresponding positions of the data individuals in the characteristics of the data set. In the embodiment of the present application, all Data in the Data set are converted into corresponding position (x, y) values according to the attribute values by using the Non-metric MDS function in the Data analysis software PAST, and added to the coordinate axis Data of the initialization Data in the Data structure of firefly in the present application.

Wherein Non-metricMDS is a function that automatically projects a two-dimensional graph distance point by inputting a multi-dimensional attribute value, substantially maintaining a distance to multi-dimensional data. For example: the quasi-discriminating attributes referred to in this application are multi-field records, which by contrast are multi-dimensional data. How to project the image to a two-bit screen to form an effective distance display can be realized by utilizing Non-metric CDS function.

Referring to fig. 8, fig. 8 is a schematic diagram of a data unit at a corresponding position of a data set according to an embodiment of the present application;

as shown in fig. 8, the right side is the result of the left original data table T after multiple PAST analysis, wherein the 1 st, 3 rd, 4 th, 5 th, 6 th, 8 th, and 9 th data are all "United-States" in the "native correlation" domain, and therefore the positions of the data are biased downward after the analysis, wherein the 9 th and 1 st data only have an Age field difference of 3 among the three field attributes, which is smaller than the attribute difference of the 4 th and 6 th data of the other same nearby data, and therefore, the 9 th data is the closest to the 1 st data in the data set T. After PAST analysis, each data in the data set T obtains a corresponding position, and the firefly layout action is completed according to the self position.

1.2, brightness update stage

After loading the data set, firstly entering a brightness updating stage, in which each firefly searches for other fireflies in the sensed radius of the firefly, and uses the firefly as a cluster initial point to form a temporary cluster with the other fireflies in the radius, and the change degree of the information loss amount in the radius of each firefly is calculated through the temporary cluster, and finally, whether the anonymity requirement of k is met or not is judged according to the number of members in the radius. The key point of K anonymization lies in that sample data is fuzzified, each group of sample data set is realized, the number of records is more than or equal to K, K is the number within the preset radius of the firefly, the distance of records representing K firefly codes is very close, and a group of data can be made to perform close anonymization.

Therefore, the number of members within the radius and the degree of information loss are the basis of the influence of the brightness of firefly in each cycle, and the brightness formula of firefly in each cycle in this study is shown in the following formula (7) and formula (8):

τ_i(t)＝(1-ρ)τ_i(t-1)+γJ_i(t) formula (7)

Wherein, the values of the parameters are as follows:

τ_i(t): the fluorescence intensity of firefly i at the t-th cycle;

ρ: a luminance decay factor between [0,1 ];

(1-. rho): the brightness attenuation rate is mainly used for controlling the past empirical proportion and helping fireflies not to generate violent change in each cycle process;

γ: the proportionality constant is used for controlling the empirical proportion of the cyclic search solution;

γJ_i(t): the adaptive value is an objective function of the position of the firefly i in the t-th cycle and is mainly used for reflecting the similar degree of the firefly i in the radius members when the firefly i is in the position of the t-th cycle;

gw_i: members within the radius of firefly i form a group;

D(gw_i): the degree of information loss for the group;

|gw_il: the number of members within the radius of the firefly i;

| QI |: the number of QI fields in the data set;

k: hiding the name number;

denominator | gw_iThe portion of | XlQI | is the number within the radius of firefly i and the number of QI fields, representing the worst information loss result of the group, namely Max (D (gw)_i))＝|gw_i| x | QI |. When the attribute values of all the fireflies j within the radius of the firefly i are more similar, the more similar the attribute values of all the fireflies j within the radius of the firefly i are

The smaller the result value. On the contrary, if all the fireflies within the radius of firefly iThe more dissimilar the attribute values of worm j are, the more dissimilar are

The larger the result value; k represents whether the number of members in the radius of the firefly i reaches k (k is k anonymous parameter k), and if not, the brightness is smaller; if the total number of members in the radius of the firefly i is more than k, then

Indicating that there are similar requirements for k-1 pens to achieve k-anonymity.

Referring to fig. 8, when k is 3, ρ is 0.6, and γ is 0.7, in the t-th cycle, there are 4 th, 6 th, and 9 th fireflies within the sensing radius of the 1 st firefly, and the last cycle firefly value τ of the 1 st firefly is obtained_i(t-1) ═ 0.5, therefore,

1.3, moving phase

In the moving stage, the firefly i can detect that the fluorescence value of other fireflies in the radius is brighter than the own fluorescence value. If the fluorescence value of no other fireflies within the radius of the firefly i is brighter than that of the firefly i, the firefly i stays in the original place and does not move; if the fluorescence value of the firefly within the radius of the firefly i is brighter than that of the firefly i, the direction of the brightest firefly (called firefly j) is selected to be close. If there are more than two glowworms within the radius of the glowworms i, the glowworms i will move in the direction of the glowworms closest to the data-to-data distance according to the following formula (9) -formula (11):

X_i(t+1)＝X_i(t)+sd_ij(t) formula (10)

Wherein, the values of the parameters are as follows:

N_i(t): firefly with fluorescence value in radius of firefly i larger than that of firefly i

d (i, j): distance between two points of firefly i and firefly j

Firefly i senses the radial distance at the t-th cycle

τ_i(t): fluorescence brightness of firefly i at the t-th cycle

τ_j(t): brightness of glowworm value at t cycle

X_i(t): the location of firefly i in the ith cycle

d_ij(t): is a unit vector, controls the direction of flight,

s: step value.

For example: firefly i at the t-th cycle position gw_i(x, y) ═ 0.2,0.5, and the fluorescence value in the radius of firefly i is greater than that of firefly i and firefly j₁And j₂At the respective positions of

Firefly j₁Fluorescence value greater than firefly j₂Fluorescence value, so that firefly i faces firefly j₁Moving by s distance; and the position X of the firefly i after movement_i(t+1)＝gw_i(0.2+0.86s,0.5+(-0.52s))。

1.4, area radius updating phase

After the firefly i passes through the moving stage, the area decision radius of the firefly i in the next circulation is updated according to the number of the fireflies in the sensing radius

Firefly i-region decision radius

The size of D is determined by how many fireflies within the radius of firefly i sense at the t-th cycle_i(t) height, D_i(t) is defined as the number of neighbors covered in the area of the circle. D of firefly i_iThe higher the (t) the radius of decision of the area of firefly i in the next cycle

The smaller the size; on the contrary, if D_iThe lower (t) the radius of area decision of firefly i in the next cycle

The larger, if D_i(t) is not changed, then

Remain unchanged. The update formula of the area decision radius and the neighbor density formula are calculated as shown in the following formula (12) -formula (13):

wherein, the values of the parameters are as follows:

initial radius of induction

Ni (t): number of neighbors within a circular area

Beta: constant, weight value representing neighbor density.

For example: initial radius of induction of firefly i in the t-th cycle

Ni (t) 5, β 0.2, then

Specifically, the initialization of the cluster center evolution stage is described in combination with the above steps, and specifically includes the following steps:

step 1: loading data set and deploying agent firefly data, setting k secret name number k to be executed and initial radius r₀Initial brightness t₀Setting related parameters such as a brightness attenuation coefficient rho, a proportionality constant gamma and the number t of cycles to be executed;

step 2: entering into circulation;

and step 3: updating the brightness of the firefly;

and 4, step 4: searching the fluorescence value of the neighbor in the radius, and selecting the advancing direction according to the minimum fluorescence value brightness or the minimum information loss after adding;

and 5: moving towards the target firefly;

step 6: and updating the domain area decision radius.

And 7: returning to the step 2, and ending the firefly evolution process until the set t cycles are executed;

and 8: outputting the Data value of the last cycle of the firefly (ID, Data of the firefly, the fluorescence value of the last cycle, radius and radius member set), and bringing the evolution result into the Data classification stage.

(2) The data classification stage includes:

2.1 successive Classification stage

After the data set passes through the evolution stage, the firefly agent of the data individual can record the last cycle result, including the attribute data, the coordinate position, the fluorescence value of the last cycle, the firefly member data in the last cycle radius, the sensing radius and other information of the firefly. In the data classification stage, the agent fireflies for all data are initially selected according to the firefly with the highest brightness. In the successive classification process, the fireflies selected as the initial point of the classification cluster are firstly searched for the number of the fireflies which are not distributed to other clusters aiming at the sensing radius of the fireflies at the initial point of the cluster, if the number of the fireflies which are not distributed to the previous cluster in the radius of the fireflies meets k-1, the fireflies are used as the initial point of the cluster, and all the neighboring fireflies in the radius are selected one by one to obtain k-1 data with the minimum information loss amount so as to establish the k anonymous cluster; if the neighbors in the radius of the selected firefly do not satisfy k-1 data, skipping over the firefly, finding the next brightest firefly and continuing to search until all the fireflies are searched.

In the successive classification process, k anonymous clusters are preferentially established when the quantity of the fireflies within the radius selected as the cluster starting point is larger than or equal to k-1, and when the quantity of the fireflies within all the fireflies radius does not meet k-1 data, the fireflies can be used as the rest fireflies to enter a rest data processing stage to perform another classification stage processing.

Specifically, please refer to fig. 9 again, fig. 9 is a schematic flow chart of a successive classification stage according to an embodiment of the present application;

after the data set evolves, each firefly respectively carries the original data, the position, the fluorescence value of the last cycle, the sensing radius and the neighbor member set of the firefly to a successive data classification stage.

As shown in fig. 9, the successive classification stage includes:

starting;

step S901: inputting the evolution result of the previous stage;

specifically, agent firefly information (fluorescence value, sensing radius and neighbors in the sensing radius) of an evolution stage and anonymous k of k anonymity are loaded.

Step S902: sequencing according to the fluorescent brightness from high to low;

specifically, the fluorescence value brightness of fireflies is ranked from high to low.

Step S903: whether there are any more materials that can be classified;

specifically, starting from the firefly with the brightest fluorescence value, searching the number of fireflies which are not allocated to clusters in the radius range of the firefly;

step S904: taking the nearest K-1 data from the radius to establish a cluster;

specifically, if the number of firefly neighbors which are not allocated to the former cluster in the radius range of the firefly at the starting point of the cluster does not meet k-1 strokes, skipping out the current firefly, selecting the next brightest firefly, and continuously searching the number of the fireflies which are not allocated to the cluster in the radius range of the firefly; if the number of firefly neighbors not allocated to the previous cluster in the cluster starting firefly radius range satisfies k-1 strokes, the firefly is used as a cluster initial point, and the firefly neighbors are added to the members of the neighbors in the radius stroke by stroke in a mode of minimum information loss until the cluster satisfies k strokes of data.

Repeating the steps: that is, the next brightest firefly is selected, and the number of fireflies not yet assigned to clusters within the radius range of the firefly is continuously searched; if the number of firefly neighbors which are not distributed to the previous cluster in the radius range of the cluster starting firefly meets k-1 strokes, the firefly is used as a cluster initial point, and the firefly is added to the neighbor members in the radius stroke by stroke in a mode of minimum information loss until the cluster meets k strokes of data;

until there is no firefly number not assigned to the cluster within a radius range of one firefly to satisfy k-1 pens, at this time, go to step S905;

step S905: processing a residual value;

specifically, combining the fireflies which are not allocated into a new data set, and performing residual data processing;

specifically, entering a residual data processing stage, wherein the residual unassigned fireflies in the successive classification stage are mainly classified into two categories, namely, the fireflies have low fluorescence values and are not attracted by other fireflies with brighter fluorescence values; the other is data that is not assigned during each successive cluster building process because the firefly is far from the start of the cluster.

And the rest data processing stage is mainly used for carrying out data classification on the rest two types of data again to establish a k anonymous cluster. Therefore, there are two main ways of handling at this stage, K-member (hereinafter referred to as KGSO-K) and K-member (hereinafter referred to as KGSO-RK) for modifying the cluster starting point selection. KGSO-K is to use the residual data as the K-member method proposed by Byun to establish the kk anonymous cluster, while KGSO-RK selects the highest brightness from the residual firefly as the new cluster initial point in the cluster initial point selection mode, and selects the K-1 data with the minimum information loss amount to establish the cluster, once the cluster member number satisfies K data, the data with the highest fluorescence value in the residual data is selected again as the new cluster initial point and the data selection is continued until all the residual data are allocated to the proper cluster.

Step S906: outputting a classification result;

and (6) ending.

The embodiment of the application provides two modes for processing the residual data:

the first method is as follows: the original greedy k-member algorithm comprises the following steps:

step 1, loading a data set and setting an anonymous number.

Step 2. execute greedy k-member algorithm.

And Step 3, outputting a classification result.

The second method comprises the following steps: the improved k-member (Revised greedy k-member) algorithm comprises the following steps:

step 1. load data set, firefly fluorescence value and set the secret number.

And Step 2, sorting the fluorescence values of the fireflies from high to low in brightness.

And Step 3, selecting the firefly with the highest fluorescence value as a cluster initial point.

And Step 4, adding data one by one in a mode of minimum information loss amount until k times of data are met.

Step 5, selecting the firefly with the highest fluorescence value as the new cluster center point.

And Step 6, repeating Step 4 and Step 5 until the residual unallocated data do not meet k pens.

And Step 7, adding the unassigned fireflies into the established clusters according to the closest cluster distance mode of the data to the clusters of the rest stage.

And Step 8, outputting a classification result.

The operation mode of the classification stage is to establish a k-anonymous cluster on the basis of greedy k-member, but the method of the classification stage is still different from the k-member classification method in the data classification process:

firstly, when a cluster is established, the fluorescence brightness is taken as a cluster initial point;

when a new cluster is established, the k-member will first select the data farthest from the previous cluster starting point from the remaining unallocated data as the new cluster starting point. In the research method, the second method proposed in the successive classification stage and the residual data processing stage is to select the fluorescence value of firefly as the object, and the higher the brightness of firefly, the higher the surrounding height conforms to the situation of k anonymity limit and the lowest information loss amount, so it is very suitable for being used as the initial point of the cluster.

Secondly, in the classification stage, neighbor fireflies in the radius range of the initial point of the cluster are taken as the member objects to be added into the cluster;

k-member when establishing a cluster, it is necessary to calculate the data-to-cluster distance result for the data of the entire data set, and pick out the closest data to add to the cluster until the cluster satisfies k data. In the successive classification stage in the research data classification process, in the process of establishing the cluster, neighbor fireflies in the radius range of the cluster starting point are used as members to be added into the cluster, and the neighbor fireflies are selected from the radius neighbors according to the distance from the data to the cluster until the cluster meets k times of data.

In the embodiment of the application, through a two-stage information classification and selective big data information security protection mechanism, on one hand, information classification based on a risk value can effectively distinguish information categories, which is convenient for implementing hierarchical management and control of information access and improves the security of information resources, and on the other hand, selective protection measures based on classification are adopted, which can effectively improve information retrieval efficiency, reduce system resource loss (such as CPU, memory and the like) in the encryption and decryption processes of a large amount of data, and improve the efficiency and availability of information resource utilization.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a data protection device according to an embodiment of the present application;

as shown in fig. 10, the data protection apparatus 100 includes:

a data obtaining unit 101, configured to obtain data to be classified;

a risk value determining unit 102, configured to determine a risk value of each document row in the data to be classified;

a classification result unit 103, configured to perform risk classification on the data to be classified according to the risk value, and determine a classification result;

a selective data protection unit 104, configured to perform selective data protection on the data to be classified according to the classification result.

It should be noted that the above-mentioned apparatus can execute the method provided by the embodiments of the present application, and has corresponding functional modules and beneficial effects for executing the method. For technical details which are not described in detail in the device embodiments, reference is made to the methods provided in the embodiments of the present application.

In the embodiment of the application, data to be classified is obtained; determining a risk value of each data row in the data to be classified; according to the risk value, carrying out risk classification on the data to be classified, and determining a classification result; and carrying out selective data protection on the data to be classified according to the classification result. On the one hand, the data to be classified is subjected to risk classification through the risk value, the classification result is determined, the classification result of different data can be determined, on the other hand, selective data protection is carried out on the data to be classified through the classification result, and the data encryption efficiency and the information retrieval efficiency can be improved.

Referring to fig. 11, fig. 11 is a schematic structural diagram of another data protection device according to an embodiment of the present application;

as shown in fig. 11, the data protection apparatus 110 includes:

the data attribute partitioning unit 111 is configured to partition all data in the public cloud according to a data attribute of each data in the public cloud, and determine an area corresponding to each data attribute;

the selective protection unit 112 is configured to determine a protection manner of an area corresponding to each data attribute, and selectively protect all data in the public cloud based on the protection manner of the area corresponding to each data attribute.

In the embodiment of the application, the data attribute comprises an identification attribute, a quasi-identification attribute, a sensitive attribute and a non-identification attribute;

the selective protection unit 112 is specifically configured to:

In the embodiment of the application, all the data in the public cloud are partitioned according to the data attribute of each data in the public cloud, and the area corresponding to each data attribute is determined; and determining the protection mode of the area corresponding to each data attribute, and protecting all the data in the public cloud based on the protection mode of the area corresponding to each data attribute. By partitioning all data in the public cloud, determining the area corresponding to each data attribute and further determining the protection mode of the area corresponding to each data attribute, the method and the device can improve the efficiency of cloud data encryption and improve data security.

Referring to fig. 12 again, fig. 12 is a schematic diagram of a hardware structure of a server according to an embodiment of the present disclosure;

as shown in fig. 12, the server 120 includes: one or more processors 121 and a memory 122, and one processor 121 is taken as an example in fig. 12.

The processor 121 and the memory 122 may be connected by a bus or other means, and fig. 12 illustrates the connection by a bus as an example.

A processor 121, configured to obtain data to be classified;

determining a risk value of each data row in the data to be classified;

according to the risk value, carrying out risk classification on the data to be classified, and determining a classification result;

and selectively protecting the data to be classified according to the classification result.

Processor 121, further configured to: partitioning all the data in the public cloud according to the data attribute of each piece of data in the public cloud, and determining an area corresponding to each data attribute;

The memory 122 is a non-volatile computer-readable storage medium, and can be used for storing non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the data protection method in the embodiment of the present application. The processor 121 executes various functional applications of the controller and data processing, i.e., implements the data protection method of the above-described method embodiment, by executing the nonvolatile software program, instructions, and modules stored in the memory 122.

The memory 122 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the controller, and the like. Further, the memory 122 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 122 may optionally include memory located remotely from the processor 121, which may be connected to the controller via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 122 and, when executed by the one or more processors 121, perform the data protection method of any of the method embodiments described above.

It should be noted that the product can execute the method provided by the embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

Embodiments of the present application provide a non-transitory computer-readable storage medium storing computer-executable instructions, which are executed by one or more processors, such as the processor 121 in fig. 12, to enable the one or more processors to perform the data protection method in any of the above method embodiments.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a general hardware platform, and may also be implemented by hardware. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; within the context of the present application, where technical features in the above embodiments or in different embodiments can also be combined, the steps can be implemented in any order and there are many other variations of the different aspects of the present application as described above, which are not provided in detail for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for data protection, the method comprising:

2. The method of claim 1, wherein the profile attributes include an identification attribute, a quasi-identification attribute, a sensitive attribute, and a non-identification attribute;

3. The method of claim 2, wherein the symmetric encryption scheme in combination with the anonymization scheme comprises:

anonymizing the first data set;

and carrying out symmetric encryption processing on the second data set.

4. The method of claim 3, wherein anonymizing the first set of material comprises:

5. The method of claim 4, wherein the firefly bee colony optimization algorithm comprises: initializing a group center evolution stage;

wherein, the initializing group center evolution stage comprises:

entering into circulation;

updating the brightness of each firefly;

updating the radius;

6. The method of claim 5, wherein the firefly bee colony optimization algorithm further comprises: a data classification stage;

7. The method of claim 6, wherein the material classification stage further comprises: the remaining data processing stage specifically includes:

and outputting the classification result of the residual data processing stage.

8. A data protection device, the device comprising:

9. The apparatus of claim 8, wherein the profile attributes include an identification attribute, a quasi-identification attribute, a sensitive attribute, and a non-identification attribute;

the selective protection unit is specifically configured to:

10. A server, characterized in that the server comprises:

at least one processor; and the number of the first and second groups,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.