Disclosure of Invention
1. Technical problem to be solved
Aiming at the problem of low electronic commerce data security in the prior art, the invention provides a processing method of electronic commerce data, which improves the data security through differential encryption, index establishment, access control and the like.
2. Technical proposal
The aim of the invention is achieved by the following technical scheme.
The embodiment of the specification provides a method for processing electronic commerce data, which comprises the following steps: acquiring electronic commerce data, extracting feature information by utilizing feature engineering, and constructing feature vectors; using a machine learning algorithm to train a prediction model based on the feature vector to predict first data to be encrypted; calculating semantic relativity among the predicted encrypted data as an encryption sequence; encrypting the first data by utilizing a differential encryption algorithm according to the characteristic information and the encryption sequence of the data to generate second data; establishing a mapping relation between first data and second data in a database, and constructing an index; an access control policy based on data attributes and user roles is employed.
Further, using a machine learning algorithm, training a predictive model based on the feature vector, the step of predicting the first data to be encrypted comprising: labeling the constructed feature vectors to generate a first training set and a verification set; detecting whether the sample quantity distribution of each category in the first training set is balanced or not; if the first training set has unbalanced category, the first training set is processed by utilizing an oversampling or undersampling technology, and a second training set is generated; utilizing the second training set and the verification set to train the gradient lifting decision tree model to obtain an encryption prediction model; and predicting the data needing to be encrypted in the electronic commerce data by using the encryption prediction model to obtain first data.
Further, if there is a class imbalance in the first training set, the step of processing the first training set using an over-sampling or under-sampling technique to generate a second training set includes: acquiring the total sample amount N of the first training set; dynamically setting a first threshold ratio P 1 and a second threshold ratio P 2,P1 of the sample number to be smaller than P 2 according to the total sample amount; calculating a first threshold N 1 according to the total sample amount N and the first threshold proportion P 1; calculating a second threshold N 2 according to the total sample N and the second threshold proportion P 2; judging whether the number N i of samples in each category is larger than a first threshold N 1 or smaller than a second threshold N 2; when the number of samples N i is larger than a first threshold N 1, undersampling the corresponding samples; when the number of samples N i is smaller than the second threshold N 2, performing oversampling processing on the corresponding samples; repeating threshold judgment and sampling processing on the training set subjected to undersampling processing or oversampling processing to obtain a second training set; dynamically setting a threshold ratio to be set by an adaptive algorithm based on a machine learning algorithm; the undersampling process is one of downsampling, random sampling or weighted sampling; the over-sampling process is one of up-sampling, synthesizing, or replicating the sample.
Further, training the gradient lifting decision tree model by using the second training set, and obtaining the encryption prediction model includes: according to the Bagging algorithm, a plurality of bootstrap samples are obtained from the second training set through sampling with substitution, and the number of the samples is A 1 to A 2 of the total sample amount; training a GBDT regression model for each bootstrap sample respectively, wherein the maximum iteration number of the model is B 1 to B 2; during GBDT regression model training, adding Laplace noise at leaf nodes, wherein the value of the Laplace noise is a dynamic preset privacy budget epsilon; evaluating the trained GBDT model using a validation set, removing the model for RMSE exceeding a threshold, the threshold being C 1 to C 2 of the RMSE values evaluated on the validation set; for the reserved GBDT model, the node number and the node depth of the model are reduced by using a knowledge distillation technology, and the target depth and the node number are D 1 to D 2 of the node number and the node depth of the model GBDT before knowledge distillation is not performed respectively; and (3) integrating a plurality of GBDT models through iteration, wherein the iteration times are E 1 to E 2, and the model is used as an encryption prediction model.
Further, according to the Bagging algorithm, the step of obtaining a plurality of bootstrap samples from the second training set by the subsampling includes: when the number of samples N i is smaller than or equal to a preset first threshold N 1, training a single GBDT model by using a second training set; when the number of samples N i is larger than the first threshold N 1 and smaller than a preset second threshold N 2, randomly replacing bootstrap samples with the number of samples extracted from the second training set as the second training set sample numbers A 1 to A 2, and training a plurality of GBDT models by using the sampled bootstrap samples; when the number of samples N i is greater than or equal to the second threshold N 2, bootstrap samples with the same number of samples as the second training set are randomly replaced from the second training set, and the plurality of GBDT models are trained by using the extracted bootstrap samples.
Further, the step of dynamically presetting the privacy budget epsilon comprises the following steps: setting the initialization privacy budget ε 0 to F 1 to F 2; circularly receiving an ith round of query requests Q i, and acquiring a query type x i and a data size y i corresponding to the ith round of query requests Q i; according to the query type x i and the data size y i of the ith round of query request Q i, the privacy budget epsilon qi,εqi consumed by the ith round of query Q i is calculated by the following formula:
εqi=k1*log(xi)*(yi α)
Wherein k 1 is a privacy consumption coefficient; alpha is a data scale adjustment coefficient;
Subtracting the accumulated consumed privacy budget from the initial preset privacy budget epsilon 0 to obtain a residual privacy budget epsilon t+1; judging whether the residual privacy budget epsilon t+1 is lower than a preset threshold epsilon min; when epsilon t+1 is smaller than or equal to epsilon min, calculating a preset sensitivity weight corresponding to the query type; calculating Laplacian noise parameters b i corresponding to the ith round of query requests Q i according to the sensitivity weight omega i and the ith round of query consumption epsilon qi; after adding the calculated noise b i to the i-th round of query request Q i, a query response is returned.
Further, the step of calculating the laplace noise parameter b i includes: the calculated privacy noise distribution center b i0,bi0 is calculated by the following formula:
bi0=ωi*εqi
Wherein ω i is the sensitivity weight of query type x i; epsilon qi is the privacy budget consumed by the query of query type x i;
The set adaptive variance function f (y i),f(yi) is calculated by the following formula:
f(yi)=log(k2*(yi+1))
wherein k 2 is a constant of the adaptive variance function;
Constructing a Laplace distribution or Gaussian distribution F i with b i0 as a center and a variance F (y i); the privacy noise value b i is obtained by sampling from the distribution F i using a distribution-based sampling algorithm.
Further, the step of differentially encrypting includes: classifying the electronic commerce data, and extracting key business data, authority control data and text data to be searched; carrying out layered authority control on key business data, wherein the transaction amount of payment information data is larger than a preset threshold value, by adopting a chained encryption mechanism based on homomorphic encryption; controlling access rights by adopting an attribute-based encryption technology for rights control data comprising preset user rights level and client type; the text data to be searched comprises preset customer feedback and order remark text, and the search is carried out in an encrypted domain by adopting an encryption technology supporting the search.
Further, the step of employing an access control policy based on the data attributes and the user roles includes: generating a random salt value for each first data to be indexed using the XTS mode of the AES-256 algorithm; deriving an encryption key and a decryption key of the index database using a HKDF algorithm based on the master key; encrypting the first data and the random salt value by using an encryption key to generate second data; creating an index table in an index database, the index table containing a data ID, an encrypted salt value, and second data; when the index is read, searching the matched encrypted salt value and the second data through the data ID, and decrypting by using a decryption key to obtain the first data; establishing an RBAC-based access control mechanism in an index database, wherein the access control mechanism comprises identity verification, authorization management and access audit logs; and carrying out security control on the index database by adopting authority classification and access audit.
Further, the step of encrypting the first data and the random salt value using the encryption key, the step of generating the second data includes: generating a random master key with lengths G 1 to G 2 by using TRNG; for the first data, generating a random salt value with the length of G 1 to G 2 bits by using a safe random number generation algorithm; according to the master key, deriving an encryption key with the length of G 3 to G 4 bits by using an HMAC-SHA256 algorithm, and generating a first data corresponding encryption key; performing AES-256 encryption on the first data by using an encryption key, and outputting encrypted first data S 1; AES-256 encryption is carried out on the random salt value of the first data, and an encrypted salt value S 2 is output; hashing S1 and S2 by using an SHA-256 algorithm to obtain hash values U 1 and U 2; splitting U 1 into K 1 fragments and splitting U 2 into K 2 fragments by using a Shamir secret sharing algorithm; and selecting M fragments at different positions from the fragments of U 1 and U 2 according to a preset rule by adopting a multi-path splicing mode to carry out exclusive-OR (exclusive-OR) and XOR (exclusive-OR) splicing, so as to generate second data.
3. Advantageous effects
Compared with the prior art, the invention has the advantages that:
(1) The data that needs to be encrypted is predicted by a machine learning model, which may include personal identification information of the customer, credit card data, and the like. The differentiated encryption algorithm is used, so that the privacy of data is effectively protected. The differential encryption is a method for considering the characteristics of data in the data encryption process, so that the same sensitive data can generate different results after encryption, and the security of the data is improved;
(2) An index database of the encrypted data is established, and the security of the index is ensured by adopting an advanced encryption standard (AES-256) and a key derivation algorithm (HKDF). AES-256 is a highly secure symmetric encryption algorithm that provides powerful data protection. The key derivation algorithm (HKDF) is used to generate more keys from the keys, ensuring the security and randomness of the keys. This means that even if index data are compromised, an attacker cannot easily decrypt them, thereby improving the security of the data;
(3) An access control strategy based on data attributes and user roles and dynamic privacy budget management are adopted, so that only authorized users can access data and privacy budgets can be dynamically managed; ensuring that data is only accessed by authorized users if needed, dynamic privacy budget management is an intelligent method for ensuring that the use of the data does not exceed predetermined privacy limits, meaning that even if the data has been accessed by many users, appropriate restrictions can be made according to privacy policies and regulations to protect the privacy of the users and improve the security of the data.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present specification, and it is possible for those of ordinary skill in the art to apply the present specification to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.
A flowchart is used in this specification to describe the operations performed by the system according to embodiments of the present specification. It should be appreciated that the preceding or following operations are not necessarily performed in order precisely. Rather, the steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.
The method and system provided in the embodiments of the present specification are described in detail below with reference to the accompanying drawings.
FIG. 1 is an exemplary flow chart of a method of processing electronic commerce data according to some embodiments of the present description; as shown in fig. 1, a method for processing electronic commerce data includes: s110, acquiring electronic commerce data, extracting feature information by utilizing feature engineering, and constructing feature vectors; feature engineering is used to extract key feature information from raw data, which may include user behavior frequency, purchase history, merchandise attributes, and the like. S120, using a machine learning algorithm, training a prediction model based on the feature vector, and predicting first data to be encrypted; the predictive model is trained using a machine learning algorithm using the feature vectors. The task of the predictive model is to predict which data needs to be encrypted based on the feature vector. This may be decided based on factors such as sensitivity, privacy, etc. The output of the model is a prediction indicating which data needs to be encrypted. S130, calculating semantic relativity among data as encryption sequence for the predicted encrypted data; the semantic relevance of the data can help the system determine which data should be put together for encryption to maintain the relevance of the data. S140, encrypting the first data by utilizing a differential encryption algorithm according to the characteristic information and the encryption sequence of the data to generate second data; the system determines the order and manner of encryption based on the output of the predictive model and the semantic relevance of the data. And encrypting the first data to be encrypted by using a differential encryption algorithm to generate second data. The differential encryption algorithm can select different encryption modes according to the characteristics and the sensitivity of the data. S150, establishing a mapping relation between the first data and the second data in a database, and constructing an index; the system creates a mapping relationship in the database, associating the first data with the generated second data. This index can be used to quickly retrieve and access both the original data and the encrypted data, ensuring the integrity and consistency of the data. S160 employs an access control policy based on data attributes and user roles, and employs an access control policy based on data attributes (e.g., sensitivity level) and user roles (e.g., user permissions) to ensure that only authorized users can access data. The access control policy may control access rights in a fine-grained manner according to characteristics of different data and roles of users.
Specifically, obtaining raw data from an electronic commerce platform, including transaction records, user information, order text, and the like; performing data cleaning, denoising and formatting processing on different data types; extracting important characteristic information such as user behavior, transaction amount, time stamp and the like by utilizing a characteristic engineering technology; and constructing a feature vector, converting the feature information into a numerical form, and training a machine learning model. Selecting an appropriate machine learning algorithm, such as a gradient boosting decision tree (Gradient Boosting Decision Trees); constructing a first training set and a verification set, marking training samples, and marking which data need to be encrypted; detecting whether the sample quantity distribution of each category in the first training set is balanced or not; if the category imbalance exists, the first training set is processed by adopting an oversampling (such as SMOTE) or undersampling technology, and a second training set is generated; and training the gradient lifting decision tree model by using the second training set and the verification set to obtain the encryption prediction model. Predicting data to be encrypted to obtain first data; calculating semantic relativity between data, and using Natural Language Processing (NLP) technology or text embedding method; and selecting a differential encryption algorithm according to the characteristic information and the encryption sequence of the data so as to ensure proper protection of different data types. For highly sensitive data such as transaction amount, hierarchical authority control is carried out by adopting homomorphic encryption algorithm; for authority control data (such as user authority level and client type), adopting an attribute-based encryption technology to realize fine access control; text data to be searched is searched within the encrypted domain using encryption techniques that support searching, such as searchable encryption (Searchable Encryption) or homomorphic search techniques. Establishing a mapping relation between first data and second data in a database, and constructing an index table; the index table contains a data ID, an encrypted salt value and second data; generating a random salt value for each first data to be indexed using the XTS mode of the AES-256 algorithm; deriving an encryption key and a decryption key of the index database using a HKDF algorithm based on the master key; the first data and the random salt value are encrypted using an encryption key to generate second data. Establishing a role-based access control mechanism in an index database, wherein the role-based access control mechanism comprises authentication, authorization management and access audit logs; controlling the access right of the data by adopting a strategy based on the data attribute and the user role; formulating authority grading and access audit strategies to realize fine-granularity authority control; each user or character is assigned a unique access key for decrypting the data.
Example 1: and collecting data such as e-commerce user registration information, browsing records, transaction records and the like, wherein the total number of the data is 1000 ten thousand. Feature vectors containing 10 features of age, occupation, consumption level and the like are constructed through statistical analysis and the like. Based on the feature vectors, GBDT models are used to predict that the financial transaction information needs to be encrypted. And setting a dynamic threshold ratio, and processing sample imbalance. And calculating commodity similarity among orders as encryption sequence for the predicted transaction information. Similar goods are sequentially encrypted. For the payment information with the order amount larger than 5000 yuan, a chained encryption method based on homomorphic encryption is adopted; attribute-based encryption is employed for the customer identity information. In the database, the order ID and encrypted order content in the Mapping trade order table. And establishing a role-based access control model. The operator can only query the encrypted order information.
Example 2: unstructured data such as customer feedback text, order notes, etc., generated by a user is collected. And constructing text feature vectors comprising word frequencies, parts of speech, keywords and the like. Based on the feature vector, GBDT model is adopted to predict that the text containing the user privacy information needs to be encrypted. For predictive text, a text semantic similarity matrix is calculated as the encryption order. For text data, a deterministic encryption algorithm supporting searching is employed. Keyword searches may be performed within the encryption domain. And establishing an index corresponding relation between the feedback text and the encrypted text. The operators and the manager have different document access rights. And (5) adopting an RBAC model to carry out identity authentication and authorization control.
In summary, the data collection of S110 provides an original data source for subsequent feature engineering and model training. And S120, model training, namely, predicting the original data by using the feature vector constructed in the S110, and outputting the data needing to be encrypted. Wherein, the dynamic threshold setting improves the accuracy of the model. S130, calculating the semantic association degree of the data, combining the model output of S120, providing an encryption sequence, and ordering and encrypting according to the data association degree, thereby improving the security. And S140, a differential encryption algorithm suitable for different data types is selected, and the pertinence and the security strength of encryption are further enhanced by matching with the encryption sequence of S130. S150 establishes a mapping relation between the original data and the encrypted data generated by S140, and supports index searching and access control requirements. S160 further builds an access control model based on the attribute and the role, and comprehensively guarantees the data access safety by combining the mapping index of S150. Through the organic connection and the cooperation of the technical scheme, the full-flow safety protection from data acquisition to access control is realized, so that the original electronic commerce data is efficiently and reliably protected, and the safety of the whole scheme is effectively improved.
FIG. 2 is an exemplary flow chart of predicting first data that needs to be encrypted, as shown in FIG. 2, according to some embodiments of the present disclosure, using a machine learning algorithm to train a prediction model based on feature vectors, the step of predicting the first data that needs to be encrypted comprising: s121, labeling the constructed feature vectors to generate a first training set and a verification set; and labeling the constructed feature vectors, wherein the labeling aims to divide the data into a training set and a verification set. The training set is used for training the model, and the verification set is used for evaluating and optimizing the performance of the model. S122, detecting whether the sample number distribution of each category in the first training set is balanced; class imbalance may affect the performance of the model and the accuracy of the predicted results. S123, if the first training set has class imbalance, processing the first training set by utilizing an oversampling or undersampling technology to generate a second training set; oversampling increases the number of minority class samples, while undersampling decreases the number of majority class samples to obtain balanced training data. S124, utilizing the second training set and the verification set training gradient to lift the decision tree model to obtain an encryption prediction model; a gradient boost decision tree (Gradient Boosting Decision Tree) model is trained using the balanced second training set and validation set. The gradient-lifting decision tree is a powerful machine learning algorithm that can be used for classification tasks and performs well in predicting the first data that needs to be encrypted. S125, predicting the data needing to be encrypted in the electronic commerce data by using the encryption prediction model to obtain first data. The e-commerce data is predicted using the trained encryption prediction model to determine which data needs to be encrypted.
Specifically, a method of predicting first data that needs to be encrypted using a machine learning algorithm. The method comprises the steps of feature vector labeling, sample balance detection, over-sampling/under-sampling processing, machine learning model training, data prediction and the like, and aims to provide an effective prediction basis for data encryption. Suitable features are selected from the e-commerce data, such as user behavior, transaction amounts, time stamps, etc. And labeling the feature vector, namely labeling the data needing to be encrypted as 1, and labeling the data not needing to be encrypted as 0. The number of samples with labels 1 and 0 is counted and checked for unbalance. If imbalance exists, an over-sampling or under-sampling technique is employed as needed to generate a second training set. The training gradient boost decision tree model or other suitable machine learning algorithm is trained using the second training set and the validation set. The model parameters are tuned to improve performance. And predicting the electronic commerce data by using the trained encryption prediction model to obtain first data. The prediction result can be used for the subsequent data encryption operation, so that only the data needing to be encrypted is ensured to be encrypted and protected.
In this embodiment, 100 tens of thousands of user registration information including the user's age, occupation, etc. is collected; constructing a user feature vector containing 10-dimensional features of age, occupation and the like; the feature vector is marked, and registration data containing personal privacy information such as real names, identity card numbers and the like is marked. Generating a user registration training set and a verification set; detecting sample distribution proportion of non-private data and private data in a training set, wherein class distribution is unbalanced due to fewer private data; oversampling is carried out on the privacy data category, and a new training set with balanced sample size is generated; using the new training set and the verification set, training GBDT the model to predict the privacy of the user registration data; predicting all registration data according to GBDT model, and outputting user privacy registration information to be encrypted; checking the prediction result to ensure that all real privacy data are covered; and obtaining the first data needing encryption protection, and finishing privacy prediction of the user registration data.
FIG. 3 is an exemplary flow chart for generating a second training set according to some embodiments of the present description, as shown in FIG. 3, if there is a class imbalance in the first training set, the first training set is processed using an over-sampling or under-sampling technique, the step of generating the second training set comprising: S123A obtains the total sample amount N of the first training set; S123B dynamically sets a first threshold ratio P1 and a second threshold ratio P2 of the number of samples according to an adaptive strategy of the machine learning algorithm. These ratios are used to determine whether to undersample or oversample the sample. S123C calculates a first threshold N 1 and a second threshold N 2 according to the dynamically set threshold proportion, wherein the thresholds are used for judging whether sampling processing is carried out or not; S123D judges whether the number of samples N i of each category is larger than a first threshold N 1 or smaller than a second threshold N 2; S123E, when the number of samples N i is larger than a first threshold N 1, undersampling the corresponding samples; when the number of samples N i is smaller than the second threshold N 2, performing oversampling processing on the corresponding samples; S123F, repeating threshold judgment and sampling processing on the training set subjected to undersampling processing or oversampling processing to obtain a second training set; the threshold determination and sampling process is repeated until the number of samples for all categories is between the first threshold N 1 and the second threshold N 2, thereby generating a balanced second training set. Dynamically setting a threshold ratio to be set by an adaptive algorithm based on a machine learning algorithm; and a self-adaptive strategy based on a machine learning algorithm is adopted, and the sampling threshold proportion is dynamically set so as to adapt to different data distribution conditions. The undersampling process is one of downsampling, random sampling or weighted sampling; the over-sampling process is one of up-sampling, synthesizing, or replicating the sample.
The method comprises the steps of dynamically setting a threshold ratio, and adopting undersampling and oversampling technologies to ensure that the class distribution of a training set is more balanced so as to improve the performance of a machine learning model, thereby enhancing the safety of electronic commerce data. The method comprises the steps of obtaining the total sample amount N of a first training set, and defining two threshold ratios according to the total sample amount N in a dynamic setting mode: the first threshold ratio P 1 and the second threshold ratio P 2.P1 are less than P 2. These ratios will be used to determine when to undersample and oversample. Calculating a first threshold N 1 based on the total sample N and the first threshold ratio P 1; based on the total number of samples N and the second threshold ratio P 2, a second threshold N 2 is calculated. For each class of sample number N i, a determination is made as to whether it is greater than a first threshold N 1 or less than a second threshold N 2. When the number of samples N i is greater than the first threshold N 1, an undersampling process is performed to reduce the number of samples. When the number of samples N i is smaller than the second threshold N 2, an oversampling process is performed to increase the number of samples. For the training set subjected to undersampling or oversampling, the above threshold judgment and sampling processing steps may be repeated as needed until the condition of balanced sample distribution is satisfied. The method further comprises the step of dynamically setting the threshold ratio, which may be set based on an adaptive algorithm of a machine learning algorithm. This ensures the rationality and adaptability of the threshold ratio. The undersampling process may employ one of downsampling, random sampling, or weighted sampling. The oversampling process may employ one of upsampling, synthesizing, or replicating samples.
Specifically, two threshold ratios are defined in a dynamically set manner: a first threshold ratio P 1 and a second threshold ratio P2.P 1 is smaller than P 2, in this embodiment, the recommended value ranges of P 1 and P 2 are initialized: empirically, P 1 was set to 0.05-0.2; p 2 is set to 0.2-0.5; collecting the total amount N of training data set samples; selecting a plurality of candidate sets, such as { (0.05,0.3), (0.1,0.4), (0.15, 0.5) }; for each candidate group, multiple rounds of experiments were performed: applying the set of P 1 and P 2 to sample processing and model training, recording the results; model evaluation metrics such as AUC, F1 score, etc. for different candidate sets are compared. The highest scoring group was chosen as the final value of P 1 and P 2; an error function can also be established, and the values of P 1 and P 2 are continuously optimized through a gradient descent method; the optimal values of P 1 and P 2 can be automatically found out by using the parameter adjustment algorithms such as Bayesian optimization, random search and the like; when a new data set is acquired, the above process is repeated to update P 1 and P 2, so as to realize dynamic optimization.
Specifically, threshold ratios P 1 and P 2 are initialized, wherein 0<P 1<P2 <1; calculating the total amount N of training set samples; calculating dynamic thresholds N 1 and N 2:N1=P1*N,=P2 x N according to the ratios P 1 and P 2, and the sample size N; traversing each class i of the training set, and counting the number N i of samples of the class; judging the size relation between N i and N 1 and N 2: if N i<=N1, the class i sample size is small, and oversampling is needed; if N i>N1 and N i<N2, the class i sample size is moderate, undersampling can be considered; if N i>=N2, the class i sample size is large, undersampling is required. And selecting different sampling strategies for different categories according to the judging result. Repeating the above process until the sample size distribution is balanced.
In this embodiment, a registered data set including characteristics of the age, occupation, and the like of the user is collected, and the total sample size is 50 ten thousand; counting the sample sizes under different categories, and finding that the sample size containing personal privacy data is only 5 ten thousand, and is unbalanced compared with other categories; setting a dynamic threshold ratio P 1=0.1,P2 =0.3, i.e. the threshold dynamically changes according to the total sample size; calculating that the sample size of the privacy data category is lower than a second threshold value N 2 =15ten thousand; oversampling the privacy data category, synthesizing a new sample by using an SMOTE algorithm, and increasing the privacy data sample to 15 ten thousand; judging whether the sample size is balanced again, wherein the privacy sample is still higher than a first threshold value N 1 =5ten thousand; therefore, undersampling needs to be continued, and samples of other types are reduced to 15 ten thousand by using a downsampling mode; repeatedly judging and sampling until the sample size of each category is close to 15 ten thousand, and generating a second training set with balanced sample sizes; and a Bayesian optimization algorithm is adopted to dynamically optimize the threshold proportion so as to adapt to more scenes.
FIG. 4 is an exemplary flowchart of obtaining an encryption prediction model according to some embodiments of the present disclosure, as shown in FIG. 4, training a gradient boost decision tree model using a second training set, the steps of obtaining the encryption prediction model including: S124A uses a Bagging algorithm to randomly draw a plurality of bootstrap samples from the second training set in a put-back way, and the number of the samples is dynamically set between A1 and A2; S124B, training an independent GBDT regression model aiming at each bootstrap sample, wherein the maximum iteration number of the model is dynamically set between B1 and B2; S124C, in the training process of GBDT regression model, in order to enhance privacy protection, adding Laplacian noise to the prediction result at the leaf node, wherein the intensity of the noise is adjusted according to the dynamically preset privacy budget epsilon; S124D evaluates each trained GBDT model using the validation set, screening out those models whose Root Mean Square Error (RMSE) over the validation set exceeds the thresholds C 1 to C 2 to ensure predictive performance of the models; S124E applies knowledge distillation techniques to the retained GBDT model to reduce the number of nodes and depth of the model to reduce the complexity of the model while maintaining its performance. The target depth and the node number are dynamically set between D 1 and D 2; S124F dynamically sets the number of iterations between E 1 to E 2 by iteratively integrating a plurality GBDT models to generate a final encrypted prediction model. Integrating multiple models helps to improve the robustness and performance of the models.
The sample sampling method based on the Bagging algorithm is used for acquiring a plurality of bootstrap samples and is used for training a gradient lifting decision tree (Gradient Boosting Decision Tree, GBDT) model. The method can adopt different strategies according to different sample numbers so as to ensure the diversity and performance of the model. The Bagging algorithm generates a plurality of bootstrap samples through substitution sampling so as to increase the diversity and robustness of the model, and different Bagging strategies are adopted according to the different sample numbers so as to adapt to the training requirements of different data scales.
When the number of samples N i is smaller than or equal to a preset first threshold N 1, a single GBDT model is trained by using the second training set, and when the number of samples is smaller, bagging sampling is not needed, and a single GBDT model is directly trained by using the second training set.
When the number of samples N i is greater than N 1 and less than N 2, bootstrapping samples with the number of samples a 1 to a 2 are randomly replaced from the second training set, and a plurality of GBDT models are trained by using the sampled bootstrapping samples: randomly retrieving a number of bootstrap samples from the second training set according to the preset ranges a 1 to a 2 when the number of samples is in the middle range, and then training a plurality of independent GBDT models using the retrieved samples;
When the number of samples N i is greater than or equal to N 2, bootstrapping samples with the same number of samples as the second training set are randomly replaced from the second training set, the plurality of GBDT models are trained by using the extracted bootstrapping samples, when the number of samples is greater, in order to maintain diversity, bootstrapping samples with the same number of samples as the second training set are randomly replaced from the second training set, and then the plurality of GBDT models are trained by using the bootstrapping samples.
The encryption prediction model is created through a Bagging algorithm, a gradient lifting decision tree (GBDT), differential privacy, knowledge distillation technology and the like to protect sensitive data. And carrying out data analysis and prediction by using GBDT model, and introducing differential privacy protection measures to ensure the privacy and safety of user data. An appropriate second training set is selected, which is used to train GBDT models. A Bagging algorithm is adopted to sample a plurality of bootstrap samples from the second training set in a put-back mode, and the number range of the samples is determined by A 1 to A 2. In the present application, a 1=(0.8~0.9)*N;A2 = (0.2 to 0.1) N.
A GBDT regression model is trained on each boottrap sample, and the maximum iteration frequency range of the model is determined by B 1 to B 2. During the training of GBDT regression model, laplacian noise is added at leaf nodes, and the value of the noise is determined by a dynamically preset privacy budget epsilon. In the application, the initial maximum iteration number B 1 of GBDT is set to be 10 times, each time a boottrap sample is trained, the error drop curve is recorded, when the error drop of continuous 5 times is smaller than the preset threshold tau, the current iteration number is recorded as B 2, and the GBDT maximum iteration number range of the boottrap sample is [ B 1,B2 ]. Repeating the above process for all bootstrap samples to obtain respective maximum iteration frequency ranges.
The trained GBDT models are evaluated using the validation set, and the model with RMSE exceeding a threshold value, the threshold range being determined by C 1 to C 2, is removed from the RMSE values evaluated on the validation set. In the present application, C 1 = (0.8-0.9) RMSE, setting C 1 to a smaller value will ensure that some poor performing models are removed; c 2 = (0.2-0.1) RMSE, setting C 2 to a slightly higher value will ensure that most better performing models are preserved.
For the reserved GBDT model, the knowledge distillation technology is adopted to reduce the node number and the node depth of the model, and the target depth and the node number are respectively determined by D 1 to D 2. Knowledge distillation aims at generating a smaller model and reserving the prediction capability of a large model, for GBDT models, the node number and the tree depth determine the complexity and the scale of the model, the node number of the original GBDT model is M, the tree depth is H, the target model is not too small, otherwise, too much information is lost, D 1=0.5M,D2 =0.8M is set, the target model tree depth is not too shallow, D 1=0.8H,D2 =H is set, namely the node number of the target model is 50% -80% of the original model, the tree depth is 80% -100% of the original model, the model scale can be effectively reduced by shrinking the node number and the tree depth within a certain range, and meanwhile, the main model expression capability is reserved.
And (3) by iteratively integrating a plurality of GBDT models, determining the iteration frequency range from E 1 to E 2, and finally obtaining the encryption prediction model. In the present application, the minimum iteration number E 1 is initialized to 10; when GBDT models are added in each iteration, testing the overall effect of the newly added models on a verification set; if the new model is added to cause that the prediction effect is not obviously improved (less than the preset lifting threshold alpha), the current iteration number is marked as E 2; the number of iterative integration times is in the range of E 1,E2.
By adding Laplacian noise in the model training process, differential privacy protection is realized, and the user sensitive data is ensured not to be leaked. The knowledge distillation technology is used, the scale of the model is reduced, the calculation and storage cost is reduced, and the prediction performance of the model is maintained. And the robustness of the model is improved by using a Bagging algorithm, the risk of overfitting is reduced, and the generalization capability of the model is improved.
Specifically, a Bagging algorithm is used to acquire a plurality of bootstrap samples from the second training set through substitution sampling so as to train a plurality of GBDT models, and therefore privacy protection and safety improvement of user data are achieved. The technical characteristics of differential privacy and model integration are combined, and the purpose is to ensure the privacy of data and the robustness of the model. The Bagging algorithm is an abbreviation of Bootstrap Aggregating, which generates subsets of training data by means of the replaced samples, and then uses these subsets to train multiple models. An appropriate second training set is selected, which is used to train GBDT models. And calculating the sample number N i in the second training set according to the actual data condition.
When the number of samples N i is less than or equal to a preset first threshold N 1, the Bagging algorithm is not used, and instead, a single GBDT model is trained by using a second training set. This is because the number of samples is small, and model integration is not required to improve training efficiency.
When the number of samples N i is greater than the first threshold N 1 and less than the preset second threshold N 2, the following steps are performed: bootstrap samples of the second training set sample sizes a 1 through a 2 are randomly and repeatedly extracted from the second training set. Multiple GBDT models were trained using the extracted boottrap samples. The purpose of this is to increase the diversity of the model and to increase the generalization ability and robustness of the model.
When the number of samples N i is greater than or equal to a preset second threshold N 2, the following steps are performed: bootstrap samples having the same number of samples as the second training set are randomly and repeatedly extracted from the second training set. Multiple GBDT models are trained by using the extracted bootstrap samples, similar to step three, in order to increase the diversity and robustness of the models. The resulting multiple GBDT models may be integrated by voting, averaging, or the like, to generate a more robust and accurate prediction result. The diversity of the data can be increased through the Bagging algorithm and random put-back sampling, so that the privacy of the user data is protected. A plurality of models are generated by using a Bagging algorithm, and the robustness and accuracy of the models can be improved by integrating the models.
In this embodiment, the second training set sample number range [ a 1,A2 ]: training the total number of samples n=10000; a 1=0.8*N=8000;A2 =0.2xn=2000; the second training set sample number range is set to 8000, 2000. GBDT maximum iteration number range [ B 1,B2]:B1 = 10 iterations; the error drop threshold τ is set to 0.005; when the iteration number is 15, the error drop is smaller than 0.005, and B 2 =15; the maximum iteration number range is [10, 15]; GBDT model screening RMSE threshold range [ C 1,C2 ]: model RMES is 0.1; c 1=0.8*0.1=0.08;C2 =0.2×0.1=0.02; the RMSE threshold range is [0.08,0.02]. Knowledge post-distillation model node number range [ D 1,D2 ]: the number of original model nodes m=500; the number range of the target model nodes is [0.5 x 500,0.8 x 500] = [250, 400]; iterative integration frequency range [ E 1,E2]:E1 =10; the boost threshold α is set to 0.01; when the iteration is performed 12 times, the lifting is smaller than 0.01, and then E 2 =12; the iteration number ranges from [10, 12].
The privacy budget is dynamically calculated according to the query type and the data size, and a differential privacy technology is adopted when a preset threshold value is reached, so that the privacy of user data is protected. The privacy budget ε 0, which ranges from F 1 to F 2, is initialized. This initial budget is used for subsequent privacy management. The round robin receives the ith round of query requests Q i, each including a query type x i and a data size y i. The privacy budget ε qi consumed by the query is calculated based on the query type x i and the data size y i of the ith round of query request Q i. The calculation formula is as follows: epsilon qi=k1*log(xi)*(yi α); in the application, referring to industry standard, the standard default value of the initial privacy budget epsilon 0 is set to be 1.0, and an adjustable budget adjustment coefficient k is set according to different application scenes, wherein the default value is 1.0; when the application scene has higher requirements on privacy protection, k can be set between 0.5 and 0.8; when the application scene has high requirements on functions and output quality, k can be set between 1.5 and 2.0. The initial privacy budget calculation formula is: epsilon 0 = k; thus, the reasonable range of values F 1 to F 2 for the initial privacy budget epsilon 0 is: [0.5,2.0]; multiple levels of budget ranges may also be set, for example: grade 1, [0.5,1.0]; level 2, [1.0,1.5]; grade 3, [1.5,2.0]; the value range of the privacy consumption coefficient k1 is [0.5,2], and the consumption rate is controlled. The value range of the data scale adjusting coefficient alpha is [0.8,1], which is slightly higher than the linear relation. The query type complexity x i may be classified into 1-10 levels according to query semantics, with higher levels indicating higher query complexity and sensitivity. The weight omega i can be hooked with the query level, ranging from 0.1,1, with higher levels being more weighted. Noise parameter b i is calculated according to the current residual budget and the weight, so that high-weight query has better privacy protection effect when the budget is low. The residual budget threshold epsilon min can be set to 10% -30% of the initial budget, triggering the protection mechanism to be enabled.
Specifically, the privacy noise distribution center b i0 for each query request, which reflects the privacy sensitivity of the corresponding query request, is calculated by calculating the sensitivity weights and the adaptive variance functions, and constructing the corresponding random noise distribution. b i0 is calculated as follows: b i0=ωi*εqi;ωi is the sensitivity weight of the query type x i, the preset sensitivity weight (x 1,x2,......,xm) corresponding to the query type (x 1,x2,......,xm) is calculated, and different sensitivity weights are given by weighing the importance of different query types. Epsilon qi is the privacy budget consumed by the query of query type x i, calculated in the previous steps. An adaptive variance function f (y i) is introduced, which is used to control the variance of the noise, thereby dynamically adjusting the noise level according to the data size. The adaptive variance function f (y i) is calculated as follows: f (y i)=log(k2*(yi +1)); in the application, the value range of the self-adaptive variance k 2 is [1,1.5], and the variance size is controlled. By assigning different sensitivity weights ω i to different types of queries, the privacy protection level can be personalized according to the importance and privacy sensitivity of the queries. This allows for a more intensive privacy protection of important data, while for less sensitive data the overhead of privacy protection can be reduced, improving the usability of the data. The random noise distribution is introduced, and the query result is randomly disturbed by calculating the privacy noise distribution center b i0, so that the information of the original data is confused, and the sensitive data is not easily restored by malicious users or attackers. This randomization technique improves the level of data privacy protection. The adaptive variance function f (y i) allows the variance of the noise to be dynamically adjusted according to the data size of the query. The variance is relatively large when large-scale data is processed, allowing more randomness, while the variance is smaller when small-scale data is processed to reduce excessive disturbance. Therefore, the privacy can be protected, the accuracy of the data can be kept as much as possible, and the usability of the data is improved. The value range of the parameter k 2 in the adaptive variance function is [1,1.5], which allows the noise level to be controlled according to specific requirements. A larger k 2 value will result in a smaller variance, providing more stringent privacy protection, while a smaller k 2 value will provide a larger variance, allowing more data leakage, but improving the usability of the data. This flexibility enables the method to be tuned in different application scenarios.
In this embodiment, the initial privacy budget ε 0 is set to 1.5, i.e., budget level 2 is selected; receiving a query request Q1, querying a user to purchase records, wherein the type belongs to level 4 sensitive query, and the data size y 1 = 5000; calculate the privacy budget consumption of Q 1 according to formula epsilon qi=k1*log(xi)*(yi α)=1*log4*(50000.9) =5.2; the sensitivity weight ω 1 =0.9 of Q 1; calculating a noise distribution center: b i0=ωi*εqi =0.9×5.2=4.68; setting an adaptive variance parameter k 2 =1.3, and calculating a square f (y i)=log(k2*(yi +1))=log (1.3×5000+1))=8.8 according to a data size y 1 =5000. Constructing a Laplace noise distribution with a center of b i0 and a variance of f (y i); the privacy noise b i is obtained from the noise distribution sample, added to the inquiry response and sent to the user.
FIG. 5 is an exemplary flow chart of differential encryption, shown in FIG. 5, according to some embodiments of the present disclosure, for separating data into critical business data, entitlement control data, and text data to be searched, and employing different encryption techniques and entitlement control policies to protect the privacy and security of the data, the steps of differential encryption comprising: the S141 initial step involves classifying the e-commerce data into three categories: key business data: including payment information data for which the transaction amount is greater than a preset threshold, etc. Rights control data: including preset user permission levels, client types, etc. Text data to be searched: including customer feedback and order remark text, etc. S142, for payment information data with transaction amount larger than a preset threshold value in the key service data, hierarchical authority control is carried out by adopting a chained encryption mechanism based on homomorphic encryption. Homomorphic encryption allows computation to be performed in the encrypted state, thereby enabling rights control without exposing plaintext data. The chained encryption allows a plurality of key levels, and only proper keys can decrypt data of corresponding layers, so that the security of the data is enhanced; the homomorphic encryption and chained encryption mechanisms are used for realizing multi-level authority control on key service data, and ensuring that only authorized users can access sensitive data. By classifying the data and applying different encryption strategies, the data is differentially protected according to the sensitivity and the purpose of the data, and the data security is improved. S143 applies attribute-based encryption technology to the rights control data, such as the user rights level and the client type, for controlling the access rights. The attribute encryption allows the data to be encrypted and decrypted according to the attributes of the user, so that only the user with specific attributes can access the corresponding data; attribute-based encryption techniques are used to control rights, ensuring that only users with the corresponding attributes are able to decrypt and access rights control data. S144, for text data to be searched, such as customer feedback and order remark text, an encryption technology supporting searching is adopted, so that searching in an encryption domain is allowed. The encryption technique supporting search allows search operations to be performed in an encrypted state without exposing plaintext data, protecting the privacy of the data.
Wherein, for the classification of the key business data, filtering and classification can be performed based on the attribute of the data. And identifying the data with the transaction amount larger than the preset threshold value as key business data. The data is classified according to transaction type (e.g., payment, refund, purchase, etc.). For records classified as critical business data, their categories are labeled for subsequent processing. The rights control data typically includes information about the user rights level, the client type, etc. Classifying according to authority levels of different users. Classification is based on the type of customer (e.g., general user, VIP user, administrator, etc.). The text data to be searched typically includes text information such as customer feedback, order notes, etc. These data may be classified according to the content and purpose of the data. Natural Language Processing (NLP) techniques, such as text classification algorithms, are used to identify keywords or topics in text data to determine their categories.
The critical business data is classified, typically according to transaction amount criteria. Then, for the payment information data with the transaction amount larger than the preset threshold value, a chained encryption mechanism based on homomorphic encryption is adopted. Allowing specific calculations to be made, such as aggregation or statistics of the payment amount, with the data remaining encrypted, without decrypting the data. This greatly improves the security of the payment data, preventing unauthorized access and leakage. The rights control data includes information such as user rights level and client type, which is used to control data access rights. This means that only users with corresponding properties or rights can decrypt and access the relevant data. The protection of sensitive rights information is enhanced. Only authorized users can decrypt and view the data, thereby ensuring confidentiality and integrity of the rights data and reducing the risk of misuse of the data. The text data to be searched may contain information such as customer feedback and order notes. Such data typically requires searching but maintains encryption to maintain privacy. While protecting the privacy of the data, necessary search operations such as keyword search or data retrieval may be performed. This increases the privacy and security of the text data while allowing limited analysis and querying of the data.
In the embodiment, a transaction description text is extracted from electronic commerce transaction data by using an NLP technology; identifying keywords such as 'purchase', 'payment' and the like by using a text classification algorithm, and marking the text record as key business data; and encrypting the payment data with transaction amount larger than 5000 yuan by using a homomorphic encryption algorithm Paillier, so as to realize statistic analysis under an encryption domain. Marking records of user role grades of 'manager' or 'operator' as authority control data from a user database; the entitlement control data is encrypted using a CP-ABE attribute-based encryption algorithm. The access control policy binds user department attributes. Extracting customer feedback content from the order comment text by using a regular expression, and encrypting the extracted feedback text by adopting an encryption algorithm SSE supporting keyword search; under the encryption domain, the feedback text is searched by using keywords such as 'good score', 'bad score', and the like.
FIG. 6 is an exemplary flow chart for setting access control policies according to some embodiments of the disclosure, as shown in FIG. 6, the steps of employing access control policies based on data attributes and user roles including: S143A generates random salt values using the XTS mode of the AES-256 algorithm for each first data that needs to be indexed. The random salt value increases the randomness of the data, improves the safety and prevents the rainbow table attack on the same data. The use of random salt values increases the security of the data and reduces the risk of attacks on the data. The AES-256XTS mode is adopted for encryption, so that the privacy of data is protected, and only authorized users can decrypt the data. S143B derives the encryption key and decryption key of the index database using a HKDF algorithm based on the master key. The HKDF algorithm ensures that the key derived from the master key is strength-appropriate and suitable for data encryption and decryption; and key derivation is performed by using HKDF algorithm, so that the generated key is proper in strength and not easy to crack. S143C, encrypting the first data and the random salt value by using an encryption key to generate second data; and encrypting the first data and the random salt value by using the generated encryption key to generate second data. The data encryption protects the privacy of the first data, and only the user with the decryption key can restore the original data; S143D creates an index table in the index database containing the following information: data ID: for uniquely identifying each data item. Encryption of salt: a random salt value associated with each data item. Second data: encrypted data. S143E, when the index is read, searching the matched encrypted salt value and the second data through the data ID, and decrypting by using the decryption key to obtain the first data. The decryption operation can only be carried out by an authorized user, so that the security of the data is ensured; S143F establishes a role-based access control mechanism in an index database, comprising the following steps: and (3) identity authentication: the user needs to provide valid authentication credentials, including a user name and password. And (3) authorization management: based on the role and attributes of the user, it is determined whether it has permission to access specific data. Accessing an audit log: all access to the database is recorded for auditing and monitoring. Rights grading: different authority levels are assigned to different user roles and attributes to refine data access control. Role-based access control allows different rights to be assigned to different users, refining data access control.
By adopting a strong encryption algorithm and a flexible access control mechanism, only authorized users can access sensitive data, and access of the data is recorded and monitored, so that the data security is improved. For each first data to be indexed, the present application uses the XTS mode of the AES-256 algorithm to generate a random salt value. This random salt value will be used with the first data for the subsequent encryption process. Using a master key-based HKDF (HMAC-based Key Derivation Function) algorithm, the present application derives the encryption and decryption keys of the index database. This ensures secure generation and management of keys. The first data and the random salt value are encrypted by adopting the generated encryption key, and the second data are generated. The second data will be stored in the index database. In the index database, the application creates an index table containing the data ID, the encrypted salt value, and the second data. This table will be used to retrieve and decrypt the first data. The present application establishes role-based access control (RBAC) mechanisms in the index database, including authentication, authorization management, and access audit logs. And (3) identity authentication: the user needs to provide valid credentials for authentication to access the data. This ensures that only authenticated users can access the system. And (3) authorization management: by user roles and attributes, the present application can control, in fine granularity, who can access which data. Only authorized users are able to decrypt and obtain the first data. Accessing an audit log: the application records all data access events, including user identity, time stamp and access operation. This helps to monitor the use of data and track potential security issues. The application adopts authority grading to ensure that data of different levels can only be accessed by authorized users. This means that high sensitivity data can only be accessed by high-level users, while low sensitivity data can be accessed by a wider population of users. Access auditing is an important component that allows the present application to periodically audit data access, identify potential risks, and take appropriate action to enhance data security. The strong encryption of data and the safe management of keys are ensured by adopting an XTS mode of an AES-256 algorithm and a HKDF algorithm. The RBAC-based access control mechanism provides flexible rights management, allowing fine-grained control of data based on user roles and attributes. Access to audit log records and the use of monitoring data facilitates timely discovery and management of potential security issues.
Example 3, XTS mode using AES-256 algorithm generates random salt values of length 256 bits; deriving a database index encryption key and a decryption key with 256bit length through HKDF-SHA256 based on the master key; performing AES-256 encryption on the client information data and the random salt value by using an encryption key, and outputting encrypted data; creating a data table in a MySQL database, comprising the fields: data ID, salt value, encrypted data. Inserting the encrypted data; when a user logs in, the identity is verified through a user name and a password, the system inquires the role of the user, and an access token is generated according to the role authority; when the user accesses the index data, the access token is presented, and the system verifies and records the access log.
Example 4, based on the master key, a database full-text index encryption key of 512 bits in length is generated by HKDF-SHA 256; performing AES-512 encryption on the commodity comment text by using the encryption key, and outputting an encrypted text; creating an index comprising the fields: document ID, encrypted text. Inserting an encrypted text; defining access control strategies, wherein different user roles have different data access rights; and when the user logs in the system, the two-factor identity authentication is performed. The user rights are verified and journaled when accessing the text index.
Wherein the application ensures the security of the first data and the salt value by generating a random master key and a random salt value using TRNG and performing key derivation and AES-256 encryption using HMAC-SHA256 algorithm. By combining the Shamir secret sharing algorithm and the multi-path splicing mode, the application generates the second data and provides powerful support for data security. A random master key of lengths G 1 to G 2 is generated by a True Random Number Generator (TRNG), ensuring a high degree of randomness and security of the master key. A random salt value of length G 1 to G 2 bits is generated using a secure random number generation algorithm for increasing the complexity of data encryption. And according to the generated random master key, deriving by utilizing an HMAC-SHA256 algorithm, generating an encryption key corresponding to the first data with the length of G 3 to G 4 bits, and ensuring the safe generation and management of the key. And performing AES-256 encryption on the first data by using the generated encryption key to obtain encrypted first data S 1. AES-256 encryption is performed on the random salt value of the first data to obtain an encrypted salt value S 2. And (3) hashing S 1 and S 2 by using an SHA-256 algorithm to obtain hash values U 1 and U 2 respectively so as to ensure data integrity and security. U 1 is split into K 1 fragments and U 2 is split into K 2 fragments by using a Shamir secret sharing algorithm so as to increase the dispersibility and security of data. And selecting M fragments at different positions from the fragments of U 1 and U 2 according to a preset rule by adopting a multi-path splicing mode to carry out exclusive-OR (exclusive-OR) and XOR (exclusive-OR) splicing, so as to generate second data. This ensures the complexity and security of the second data and is not susceptible to a single fragment. And generating a random master key by using TRNG, and ensuring the high randomness and the safety of the master key. Generating random salt values increases the complexity of data encryption. The HMAC-SHA256 algorithm is used for key derivation, so that the safe generation and management of the key are ensured. And encrypting the first data and the random salt value by adopting an AES-256 algorithm, so that the encryption safety of the data is ensured. The SHA-256 algorithm is used for hashing and slicing, so that data integrity and security are ensured. For splitting the hash value into a plurality of fragments, increasing the dispersibility and security of the data. And splicing the plurality of fragments through exclusive or (XOR) to generate second data, so that the complexity and the safety of the second data are ensured. In the application, the length of the master key G 1 to G 2 is 256-512 bits, so that the security strength is ensured; random salt lengths G 1 to G 2 are 128-256 bits, increasing entropy and unpredictability; the derived key lengths G 3 to G 4 are 256 bits and are matched with an AES-256 algorithm; SHA-256 output hash value length 256 bits; the number of fragments K 1、K2 of the Shamir algorithm is 5-8, so that a fault tolerance threshold is realized; the number M of the multi-path splicing selection fragments is 3-5, so that the difficulty in back-pushing is ensured; the XOR splicing positions are uniformly distributed according to the principle of interval; the second index data length is the sum of the lengths of the spliced segments; different security levels can be set, and the high-level parameters take larger values and the low-level parameters take smaller values.
In this embodiment, a random master key of 512 bits in length is generated, using hardware TRNG generation; generating a random salt value with 256-bit length, and generating by using a system RNG algorithm; deriving a 256-bit encryption key using HMAC-SHA 256; performing AES-256 encryption on the first data by using the encryption key, and outputting encrypted data S 1; performing AES-256 encryption on the random salt value by using different encryption keys, and outputting an encrypted salt value S 2; carrying out SHA-256 hash on S 1 and S 2 respectively to obtain 256-bit hash values U 1 and U 2; splitting U 1 into 5 fragments using a Shamir secret sharing algorithm; u 2 is split into 6 fragments; selecting 4 fragments with different positions from the fragments of U 1 and U 2 for exclusive OR splicing to generate second index data; the second index data length is 1024 bits of the sum of 4 fragments; finally, the first data ID, the random salt value and the second index data are stored in a database index table.