CN117235796A

CN117235796A - Electronic commerce data processing method

Info

Publication number: CN117235796A
Application number: CN202311261982.5A
Authority: CN
Inventors: 江海清; 刘祖豪; 张永杰
Original assignee: Qingdao Zhongqiyingcai Group Culture Media Co ltd
Current assignee: Ningyuan County Damai E-commerce Co.,Ltd.
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-12-15

Abstract

The application discloses a processing method of electronic commerce data, which relates to the technical field of data security and comprises the following steps: acquiring electronic commerce data, extracting feature information by utilizing feature engineering, and constructing feature vectors; using a machine learning algorithm to train a prediction model based on the feature vector to predict first data to be encrypted; calculating semantic relativity among the predicted encrypted data as an encryption sequence; encrypting the first data by utilizing a differential encryption algorithm according to the characteristic information and the encryption sequence of the data to generate second data; establishing a mapping relation between first data and second data in a database, and constructing an index; an access control policy based on data attributes and user roles is employed. Aiming at the problem of low safety of electronic commerce data in the prior art, the application improves the safety of the electronic commerce data through differential encryption, index establishment, access control and the like.

Description

Electronic commerce data processing method

Technical Field

The application relates to the technical field of data security, in particular to a processing method of electronic commerce data.

Background

With the vigorous development of electronic commerce, efficient utilization and security protection of massive electronic commerce data become an important subject. Electronic commerce data typically contains user information, transaction information, merchandise information, etc., while some sensitive data may be present.

The traditional encryption method generally adopts a 'one-cut' mode to encrypt all data, and does not distinguish the sensitivity degree of the data, so that the utilization efficiency of the data is reduced. With the advent of technologies such as differential privacy, protection of user privacy can be achieved by adding elaborate noise to the query results. But applying differential privacy directly to database queries can compromise the usability of the data by being too noisy.

In the related art, for example, in chinese patent document CN111310207B, an electronic commerce data processing method, apparatus, electronic commerce system and server are provided, by recording an order encryption component corresponding to an electronic commerce order, acquiring service coverage information of the electronic commerce order in a generating process, and establishing an order encryption node sequence corresponding to the service coverage information, and associating each order item information included in the electronic commerce order into the order encryption node sequence according to a preset association sequence, so that based on the order encryption component corresponding to the electronic commerce order, at least part of target order encryption nodes are determined in the order encryption node sequence, and data encryption is performed on corresponding order item information included in the electronic commerce order according to the target order encryption nodes; but this scheme exists at least: the preset order encryption component and encryption node are relied on, and the difference and dynamic adjustment of order data are not considered, so that the safety of electronic commerce data is reduced.

Disclosure of Invention

1. Technical problem to be solved

Aiming at the problem of low electronic commerce data security in the prior art, the invention provides a processing method of electronic commerce data, which improves the data security through differential encryption, index establishment, access control and the like.

2. Technical proposal

The aim of the invention is achieved by the following technical scheme.

The embodiment of the specification provides a method for processing electronic commerce data, which comprises the following steps: acquiring electronic commerce data, extracting feature information by utilizing feature engineering, and constructing feature vectors; using a machine learning algorithm to train a prediction model based on the feature vector to predict first data to be encrypted; calculating semantic relativity among the predicted encrypted data as an encryption sequence; encrypting the first data by utilizing a differential encryption algorithm according to the characteristic information and the encryption sequence of the data to generate second data; establishing a mapping relation between first data and second data in a database, and constructing an index; an access control policy based on data attributes and user roles is employed.

Further, using a machine learning algorithm, training a predictive model based on the feature vector, the step of predicting the first data to be encrypted comprising: labeling the constructed feature vectors to generate a first training set and a verification set; detecting whether the sample quantity distribution of each category in the first training set is balanced or not; if the first training set has unbalanced category, the first training set is processed by utilizing an oversampling or undersampling technology, and a second training set is generated; utilizing the second training set and the verification set to train the gradient lifting decision tree model to obtain an encryption prediction model; and predicting the data needing to be encrypted in the electronic commerce data by using the encryption prediction model to obtain first data.

Further, if there is a class imbalance in the first training set, the step of processing the first training set using an over-sampling or under-sampling technique to generate a second training set includes: acquiring the total sample amount N of the first training set; dynamically setting a first threshold ratio P of the number of samples according to the total number of samples ₁ And (d)Two threshold ratio P ₂ ，P ₁ Less than P ₂ The method comprises the steps of carrying out a first treatment on the surface of the According to the total sample N and the first threshold ratio P ₁ Calculate a first threshold N ₁ The method comprises the steps of carrying out a first treatment on the surface of the According to the total sample N and the second threshold ratio P ₂ Calculate a second threshold N ₂ The method comprises the steps of carrying out a first treatment on the surface of the Judging the number N of samples of each category _i Whether or not it is greater than a first threshold value N ₁ Or less than a second threshold N ₂ The method comprises the steps of carrying out a first treatment on the surface of the When the number of samples N _i Greater than a first threshold N ₁ When the sampling is carried out, undersampling treatment is carried out on the corresponding samples; when the number of samples N _i Less than a second threshold N ₂ When the sampling device is used, the corresponding sample is subjected to oversampling; repeating threshold judgment and sampling processing on the training set subjected to undersampling processing or oversampling processing to obtain a second training set; dynamically setting a threshold ratio to be set by an adaptive algorithm based on a machine learning algorithm; the undersampling process is one of downsampling, random sampling or weighted sampling; the over-sampling process is one of up-sampling, synthesizing, or replicating the sample.

Further, training the gradient lifting decision tree model by using the second training set, and obtaining the encryption prediction model includes: according to the Bagging algorithm, a plurality of bootstrap samples are obtained from the second training set through sampling with substitution, and the number of the samples is A of the total number of the samples ₁ To A ₂ The method comprises the steps of carrying out a first treatment on the surface of the Training a GBDT regression model for each bootstrap sample, wherein the maximum iteration number of the model is B ₁ To B ₂ A wheel; during GBDT regression model training, adding Laplace noise at leaf nodes, wherein the value of the Laplace noise is a dynamically preset privacy budget epsilon; evaluating the trained GBDT model using a validation set, removing the model with RMSE exceeding a threshold value, the threshold value being C of the RMSE values evaluated on the validation set ₁ To C ₂ The method comprises the steps of carrying out a first treatment on the surface of the For the reserved GBDT model, the knowledge distillation technology is used for reducing the node number and the node depth of the model, and the target depth and the node number are respectively D of the node number and the node depth of the GBDT model before knowledge distillation is not carried out ₁ To D ₂ The method comprises the steps of carrying out a first treatment on the surface of the By iteratively integrating a plurality of GBDT models, the iteration number is E ₁ To E to ₂ And secondly, the encryption prediction model is used.

Further, according to the Bagging algorithm, from the secondThe step of obtaining a plurality of bootstrap samples in the training set through the put-back sampling comprises the following steps: when the number of samples N _i Less than or equal to a preset first threshold N ₁ Training a single GBDT model by using the second training set; when the number of samples N _i Greater than a first threshold N ₁ And is smaller than a preset second threshold value N ₂ At the time, the number of the extracted samples from the second training set is randomly replaced by the sample size A of the second training set ₁ To A ₂ Training a plurality of GBDT models by using the extracted boottrap samples; when the number of samples N _i Greater than or equal to the second threshold N ₂ And randomly replacing bootstrap samples with the same sample number as the second training set from the second training set, and training a plurality of GBDT models by using the extracted bootstrap samples.

Further, the step of dynamically presetting the privacy budget epsilon comprises the following steps: setting an initialization privacy budget ε ₀ Is F ₁ To F ₂ The method comprises the steps of carrying out a first treatment on the surface of the Circularly receiving ith round of inquiry request Q _i Acquiring an ith round of query request Q _i Corresponding query type x _i And data size y _i The method comprises the steps of carrying out a first treatment on the surface of the According to the ith round of inquiry request Q _i Query type x _i And data size y _i Calculate the ith round of query Q _i Consumed privacy budget ε _qi ，ε _qi Calculated by the following formula:

ε _qi ＝k ₁ *log(x _i )*(y _i ^α )

wherein k is ₁ Is a privacy consumption coefficient; alpha is a data scale adjustment coefficient;

from an initially preset privacy budget ε ₀ Subtracting the accumulated consumed privacy budget to obtain residual privacy budget epsilon _t+1 The method comprises the steps of carrying out a first treatment on the surface of the Judging the residual privacy budget epsilon _t+1 Whether or not it is below a preset threshold epsilon _min The method comprises the steps of carrying out a first treatment on the surface of the When epsilon _t+1 Less than or equal to epsilon _min When the query type is determined, calculating a preset sensitivity weight corresponding to the query type; according to the sensitivity weight omega _i And ith round of query consumption ε _qi Calculate the ith round of query request Q _i Corresponding Laplace noise parameter b _i The method comprises the steps of carrying out a first treatment on the surface of the For the ith round of query request Q _i Adding inCalculated noise b _i And then, returning a query response.

Further, the Laplace noise parameter b is calculated _i The method comprises the following steps: calculating privacy noise distribution center b _i0 ，b _i0 Calculated by the following formula:

b _i0 ＝ω _i *ε _qi

wherein omega _i For query type x _i Sensitivity weights of (2); epsilon _qi For query type x _i Privacy budgets consumed by the querying of (a);

setting an adaptive variance function f (y _i )，f(y _i ) Calculated by the following formula:

f(y _i )＝log(k ₂ *(y _i +1))

wherein k is ₂ A constant that is an adaptive variance function;

build b _i0 Is centered, variance is f (y _i ) Laplacian or Gaussian distribution F _i The method comprises the steps of carrying out a first treatment on the surface of the From distribution F using a distribution-based sampling algorithm _i The privacy noise value b is obtained by sampling _i 。

Further, the step of differentially encrypting includes: classifying the electronic commerce data, and extracting key business data, authority control data and text data to be searched; carrying out layered authority control on key business data, wherein the transaction amount of payment information data is larger than a preset threshold value, by adopting a chained encryption mechanism based on homomorphic encryption; controlling access rights by adopting an attribute-based encryption technology for rights control data comprising preset user rights level and client type; the text data to be searched comprises preset customer feedback and order remark text, and the search is carried out in an encrypted domain by adopting an encryption technology supporting the search.

Further, the step of employing an access control policy based on the data attributes and the user roles includes: generating a random salt value for each first data to be indexed using the XTS mode of the AES-256 algorithm; deriving an encryption key and a decryption key of the index database by using an HKDF algorithm based on the master key; encrypting the first data and the random salt value by using an encryption key to generate second data; creating an index table in an index database, the index table containing a data ID, an encrypted salt value, and second data; when the index is read, searching the matched encrypted salt value and the second data through the data ID, and decrypting by using a decryption key to obtain the first data; establishing an RBAC-based access control mechanism in an index database, wherein the access control mechanism comprises identity verification, authorization management and access audit logs; and carrying out security control on the index database by adopting authority classification and access audit.

Further, the step of encrypting the first data and the random salt value using the encryption key, the step of generating the second data includes: generating a length G using TRNG ₁ To G ₂ Is a random master key of (a); for the first data, generating a length G by using a secure random number generation algorithm ₁ To G ₂ Random salt values of bits; deriving the length G from the master key using the HMAC-SHA256 algorithm ₃ To G ₄ Generating an encryption key corresponding to the first data; AES-256 encrypting the first data by using the encryption key, outputting the encrypted first data S ₁ The method comprises the steps of carrying out a first treatment on the surface of the AES-256 encryption is carried out on the random salt value of the first data, and the encrypted salt value S is output ₂ The method comprises the steps of carrying out a first treatment on the surface of the Hashing S1 and S2 respectively by SHA-256 algorithm to obtain hash value U ₁ And U ₂ The method comprises the steps of carrying out a first treatment on the surface of the U is determined by using a Shamir secret sharing algorithm ₁ Splitting into K ₁ Fragments, U ₂ Splitting into K ₂ Fragments; adopting a multi-path splicing mode, and according to a preset rule, obtaining a U-shaped signal from the U ₁ And U ₂ Selecting M fragments at different positions for exclusive-or (XOR) splicing to generate second data.

3. Advantageous effects

Compared with the prior art, the invention has the advantages that:

(1) The data that needs to be encrypted is predicted by a machine learning model, which may include personal identification information of the customer, credit card data, and the like. The differentiated encryption algorithm is used, so that the privacy of data is effectively protected. The differential encryption is a method for considering the characteristics of data in the data encryption process, so that the same sensitive data can generate different results after encryption, and the security of the data is improved;

(2) An index database of the encrypted data is established, and the security of the index is ensured by adopting an advanced encryption standard (AES-256) and a key derivation algorithm (HKDF). AES-256 is a highly secure symmetric encryption algorithm that provides powerful data protection. A key derivation algorithm (HKDF) is used to generate more keys from the keys, ensuring the security and randomness of the keys. This means that even if index data are compromised, an attacker cannot easily decrypt them, thereby improving the security of the data;

(3) An access control strategy based on data attributes and user roles and dynamic privacy budget management are adopted, so that only authorized users can access data and privacy budgets can be dynamically managed; ensuring that data is only accessed by authorized users if needed, dynamic privacy budget management is an intelligent method for ensuring that the use of the data does not exceed predetermined privacy limits, meaning that even if the data has been accessed by many users, appropriate restrictions can be made according to privacy policies and regulations to protect the privacy of the users and improve the security of the data.

Drawings

The present specification will be further described by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

FIG. 1 is an exemplary flow chart of a method of processing electronic commerce data according to some embodiments of the present description;

FIG. 2 is an exemplary flow chart of predicting first data that needs to be encrypted according to some embodiments of the present description;

FIG. 3 is an exemplary flow chart for generating a second training set according to some embodiments of the present description;

FIG. 4 is an exemplary flow diagram of obtaining an encryption predictive model, shown in accordance with some embodiments of the present disclosure;

FIG. 5 is an exemplary flow diagram for acquiring a dynamic privacy budget in accordance with some embodiments of the present description;

fig. 6 is an exemplary flow chart for setting access control policies according to some embodiments of the specification.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present specification, and it is possible for those of ordinary skill in the art to apply the present specification to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.

A flowchart is used in this specification to describe the operations performed by the system according to embodiments of the present specification. It should be appreciated that the preceding or following operations are not necessarily performed in order precisely. Rather, the steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.

The method and system provided in the embodiments of the present specification are described in detail below with reference to the accompanying drawings.

FIG. 1 is an exemplary flow chart of a method of processing electronic commerce data according to some embodiments of the present description; as shown in fig. 1, a method for processing electronic commerce data includes: s110, acquiring electronic commerce data, extracting feature information by utilizing feature engineering, and constructing feature vectors; feature engineering is used to extract key feature information from raw data, which may include user behavior frequency, purchase history, merchandise attributes, and the like. S120, using a machine learning algorithm, training a prediction model based on the feature vector, and predicting first data to be encrypted; the predictive model is trained using a machine learning algorithm using the feature vectors. The task of the predictive model is to predict which data needs to be encrypted based on the feature vector. This may be decided based on factors such as sensitivity, privacy, etc. The output of the model is a prediction indicating which data needs to be encrypted. S130, calculating semantic relativity among data as encryption sequence for the predicted encrypted data; the semantic relevance of the data can help the system determine which data should be put together for encryption to maintain the relevance of the data. S140, encrypting the first data by utilizing a differential encryption algorithm according to the characteristic information and the encryption sequence of the data to generate second data; the system determines the order and manner of encryption based on the output of the predictive model and the semantic relevance of the data. And encrypting the first data to be encrypted by using a differential encryption algorithm to generate second data. The differential encryption algorithm can select different encryption modes according to the characteristics and the sensitivity of the data. S150, establishing a mapping relation between the first data and the second data in a database, and constructing an index; the system creates a mapping relationship in the database, associating the first data with the generated second data. This index can be used to quickly retrieve and access both the original data and the encrypted data, ensuring the integrity and consistency of the data. S160 employs an access control policy based on data attributes and user roles, and employs an access control policy based on data attributes (e.g., sensitivity level) and user roles (e.g., user permissions) to ensure that only authorized users can access data. The access control policy may control access rights in a fine-grained manner according to characteristics of different data and roles of users.

Specifically, obtaining raw data from an electronic commerce platform, including transaction records, user information, order text, and the like; performing data cleaning, denoising and formatting processing on different data types; extracting important characteristic information such as user behavior, transaction amount, time stamp and the like by utilizing a characteristic engineering technology; and constructing a feature vector, converting the feature information into a numerical form, and training a machine learning model. Selecting an appropriate machine learning algorithm, such as a gradient boosting decision tree (Gradient Boosting Decision Trees); constructing a first training set and a verification set, marking training samples, and marking which data need to be encrypted; detecting whether the sample quantity distribution of each category in the first training set is balanced or not; if the category imbalance exists, the first training set is processed by adopting an oversampling (such as SMOTE) or undersampling technology, and a second training set is generated; and training the gradient lifting decision tree model by using the second training set and the verification set to obtain the encryption prediction model. Predicting data to be encrypted to obtain first data; calculating semantic relativity between data, and using Natural Language Processing (NLP) technology or text embedding method; and selecting a differential encryption algorithm according to the characteristic information and the encryption sequence of the data so as to ensure proper protection of different data types. For highly sensitive data such as transaction amount, hierarchical authority control is carried out by adopting homomorphic encryption algorithm; for authority control data (such as user authority level and client type), adopting an attribute-based encryption technology to realize fine access control; text data to be searched is searched within the encrypted domain using encryption techniques that support searching, such as searchable encryption (Searchable Encryption) or homomorphic search techniques. Establishing a mapping relation between first data and second data in a database, and constructing an index table; the index table contains a data ID, an encrypted salt value and second data; generating a random salt value for each first data to be indexed using the XTS mode of the AES-256 algorithm; deriving an encryption key and a decryption key of the index database by using an HKDF algorithm based on the master key; the first data and the random salt value are encrypted using an encryption key to generate second data. Establishing a role-based access control mechanism in an index database, wherein the role-based access control mechanism comprises authentication, authorization management and access audit logs; controlling the access right of the data by adopting a strategy based on the data attribute and the user role; formulating authority grading and access audit strategies to realize fine-granularity authority control; each user or character is assigned a unique access key for decrypting the data.

Example 1: and collecting data such as e-commerce user registration information, browsing records, transaction records and the like, wherein the total number of the data is 1000 ten thousand. Feature vectors containing 10 features of age, occupation, consumption level and the like are constructed through statistical analysis and the like. Based on the feature vector, the GBDT model is adopted to predict that the financial transaction information needs to be encrypted. And setting a dynamic threshold ratio, and processing sample imbalance. And calculating commodity similarity among orders as encryption sequence for the predicted transaction information. Similar goods are sequentially encrypted. For the payment information with the order amount larger than 5000 yuan, a chained encryption method based on homomorphic encryption is adopted; attribute-based encryption is employed for the customer identity information. In the database, the order ID and encrypted order content in the Mapping trade order table. And establishing a role-based access control model. The operator can only query the encrypted order information.

Example 2: unstructured data such as customer feedback text, order notes, etc., generated by a user is collected. And constructing text feature vectors comprising word frequencies, parts of speech, keywords and the like. Based on the feature vector, the GBDT model is adopted to predict that the text containing the user privacy information needs to be encrypted. For predictive text, a text semantic similarity matrix is calculated as the encryption order. For text data, a deterministic encryption algorithm supporting searching is employed. Keyword searches may be performed within the encryption domain. And establishing an index corresponding relation between the feedback text and the encrypted text. The operators and the manager have different document access rights. And (5) adopting an RBAC model to carry out identity authentication and authorization control.

In summary, the data collection of S110 provides an original data source for subsequent feature engineering and model training. And S120, model training, namely, predicting the original data by using the feature vector constructed in the S110, and outputting the data needing to be encrypted. Wherein, the dynamic threshold setting improves the accuracy of the model. S130, calculating the semantic association degree of the data, combining the model output of S120, providing an encryption sequence, and ordering and encrypting according to the data association degree, thereby improving the security. And S140, a differential encryption algorithm suitable for different data types is selected, and the pertinence and the security strength of encryption are further enhanced by matching with the encryption sequence of S130. S150 establishes a mapping relation between the original data and the encrypted data generated by S140, and supports index searching and access control requirements. S160 further builds an access control model based on the attribute and the role, and comprehensively guarantees the data access safety by combining the mapping index of S150. Through the organic connection and the cooperation of the technical scheme, the full-flow safety protection from data acquisition to access control is realized, so that the original electronic commerce data is efficiently and reliably protected, and the safety of the whole scheme is effectively improved.

FIG. 2 is an exemplary flow chart of predicting first data that needs to be encrypted, as shown in FIG. 2, according to some embodiments of the present disclosure, using a machine learning algorithm to train a prediction model based on feature vectors, the step of predicting the first data that needs to be encrypted comprising: s121, labeling the constructed feature vectors to generate a first training set and a verification set; and labeling the constructed feature vectors, wherein the labeling aims to divide the data into a training set and a verification set. The training set is used for training the model, and the verification set is used for evaluating and optimizing the performance of the model. S122, detecting whether the sample number distribution of each category in the first training set is balanced; class imbalance may affect the performance of the model and the accuracy of the predicted results. S123, if the first training set has class imbalance, processing the first training set by utilizing an oversampling or undersampling technology to generate a second training set; oversampling increases the number of minority class samples, while undersampling decreases the number of majority class samples to obtain balanced training data. S124, utilizing the second training set and the verification set training gradient to lift the decision tree model to obtain an encryption prediction model; a gradient boost decision tree (Gradient Boosting Decision Tree) model is trained using the balanced second training set and the validation set. The gradient-lifting decision tree is a powerful machine learning algorithm that can be used for classification tasks and performs well in predicting the first data that needs to be encrypted. S125, predicting the data needing to be encrypted in the electronic commerce data by using the encryption prediction model to obtain first data. The e-commerce data is predicted using the trained encryption prediction model to determine which data needs to be encrypted.

Specifically, a method of predicting first data that needs to be encrypted using a machine learning algorithm. The method comprises the steps of feature vector labeling, sample balance detection, over-sampling/under-sampling processing, machine learning model training, data prediction and the like, and aims to provide an effective prediction basis for data encryption. Suitable features are selected from the e-commerce data, such as user behavior, transaction amounts, time stamps, etc. And labeling the feature vector, namely labeling the data needing to be encrypted as 1, and labeling the data not needing to be encrypted as 0. The number of samples with labels 1 and 0 is counted and checked for unbalance. If imbalance exists, an over-sampling or under-sampling technique is employed as needed to generate a second training set. The training gradient boost decision tree model or other suitable machine learning algorithm is trained using the second training set and the validation set. The model parameters are tuned to improve performance. And predicting the electronic commerce data by using the trained encryption prediction model to obtain first data. The prediction result can be used for the subsequent data encryption operation, so that only the data needing to be encrypted is ensured to be encrypted and protected.

In this embodiment, 100 tens of thousands of user registration information including the user's age, occupation, etc. is collected; constructing a user feature vector containing 10-dimensional features of age, occupation and the like; the feature vector is marked, and registration data containing personal privacy information such as real names, identity card numbers and the like is marked. Generating a user registration training set and a verification set; detecting sample distribution proportion of non-private data and private data in a training set, wherein class distribution is unbalanced due to fewer private data; oversampling is carried out on the privacy data category, and a new training set with balanced sample size is generated; training the GBDT model to predict the privacy of the user registration data by using the new training set and the verification set; predicting all registration data according to the GBDT model, and outputting user privacy registration information to be encrypted; checking the prediction result to ensure that all real privacy data are covered; and obtaining the first data needing encryption protection, and finishing privacy prediction of the user registration data.

FIG. 3 is an exemplary flow chart for generating a second training set according to some embodiments of the present description, as shown in FIG. 3, if there is a class imbalance in the first training set, the first training set is processed using an over-sampling or under-sampling technique, the step of generating the second training set comprising: S123A obtains the total sample amount N of the first training set; S123B dynamically sets a first threshold ratio P1 and a second threshold ratio P2 of the number of samples according to an adaptive strategy of the machine learning algorithm. These ratios are used to determine whether to undersample or oversample the sample. S123C calculates a first threshold N according to the dynamically set threshold ratio ₁ And a second threshold value N ₂ These thresholds will be used to determine whether to sampleThe method comprises the steps of carrying out a first treatment on the surface of the S123D judging the number N of samples of each category _i Whether or not it is greater than a first threshold value N ₁ Or less than a second threshold N ₂ The method comprises the steps of carrying out a first treatment on the surface of the S123E when the number of samples N _i Greater than a first threshold N ₁ When the sampling is carried out, undersampling treatment is carried out on the corresponding samples; when the number of samples N _i Less than a second threshold N ₂ When the sampling device is used, the corresponding sample is subjected to oversampling; S123F, repeating threshold judgment and sampling processing on the training set subjected to undersampling processing or oversampling processing to obtain a second training set; the threshold judgment and sampling process are repeatedly performed until the number of samples of all categories is within a first threshold N ₁ And a second threshold value N ₂ And thereby generating a balanced second training set. Dynamically setting a threshold ratio to be set by an adaptive algorithm based on a machine learning algorithm; and a self-adaptive strategy based on a machine learning algorithm is adopted, and the sampling threshold proportion is dynamically set so as to adapt to different data distribution conditions. The undersampling process is one of downsampling, random sampling or weighted sampling; the over-sampling process is one of up-sampling, synthesizing, or replicating the sample.

The method comprises the steps of dynamically setting a threshold ratio, and adopting undersampling and oversampling technologies to ensure that the class distribution of a training set is more balanced so as to improve the performance of a machine learning model, thereby enhancing the safety of electronic commerce data. The method comprises the steps of obtaining the total sample amount N of a first training set, and defining two threshold ratios according to the total sample amount N in a dynamic setting mode: first threshold ratio P ₁ And a second threshold ratio P ₂ 。P ₁ Less than P ₂ . These ratios will be used to determine when to undersample and oversample. Based on the total number of samples N and the first threshold ratio P ₁ Calculate a first threshold N ₁ The method comprises the steps of carrying out a first treatment on the surface of the Based on the total sample size N and the second threshold ratio P ₂ Calculate a second threshold N ₂ . Number of samples N for each class _i Judging whether it is greater than the first threshold value N ₁ Or less than a second threshold N ₂ . When the number of samples N _i Greater than a first threshold N ₁ When the undersampling process is performed, the number of samples is reduced. When the number of samples N _i Less than a second threshold N ₂ When the over-sampling process is performed to increaseThe number of samples is added. For the training set subjected to undersampling or oversampling, the above threshold judgment and sampling processing steps may be repeated as needed until the condition of balanced sample distribution is satisfied. The method further comprises the step of dynamically setting the threshold ratio, which may be set based on an adaptive algorithm of a machine learning algorithm. This ensures the rationality and adaptability of the threshold ratio. The undersampling process may employ one of downsampling, random sampling, or weighted sampling. The oversampling process may employ one of upsampling, synthesizing, or replicating samples.

Specifically, two threshold ratios are defined in a dynamically set manner: first threshold ratio P ₁ And a second threshold ratio P2.P (P) ₁ Less than P ₂ In the present embodiment, P is initialized ₁ And P ₂ Is set for the recommended value range of (1): according to experience, P ₁ Setting 0.05-0.2; p (P) ₂ Setting 0.2-0.5; collecting the total amount N of training data set samples; selecting a plurality of candidate sets, such as { (0.05,0.3), (0.1,0.4), (0.15, 0.5) }; for each candidate group, multiple rounds of experiments were performed: p of the group ₁ And P ₂ The method is applied to sample processing and model training, and the result is recorded; model evaluation metrics such as AUC, F1 score, etc. for different candidate sets are compared. Selecting the group with the highest score as the final P ₁ And P ₂ Is a value of (2); an error function can also be established, and P is continuously optimized by a gradient descent method ₁ And P ₂ Is a value of (2); the P can also be automatically found out by utilizing Bayesian optimization, random search and other parameter adjustment algorithms ₁ And P ₂ Is the optimum value of (2); when acquiring the new data set, the process is repeated to update P ₁ And P ₂ Dynamic optimization is realized.

Specifically, the threshold ratio P is initialized ₁ And P ₂ Wherein 0 is<P ₁ <P ₂ <1, a step of; calculating the total amount N of training set samples; according to the proportion P ₁ And P ₂ And the sample size N, calculating a dynamic threshold N ₁ And N ₂ ：N ₁ ＝P ₁ *N，＝P ₂ * N; traversing each class i of the training set, and counting the number N of samples of the class _i The method comprises the steps of carrying out a first treatment on the surface of the Judging N _i And N ₁ And N ₂ Is a size relationship of (a): if N _i <＝N ₁ The class i sample size is small and over-sampling is required; if N _i >N ₁ And N is _i <N ₂ The sample size of the category i is moderate, and undersampling can be considered; if N _i >＝N ₂ The class i sample size is large and undersampling is required. And selecting different sampling strategies for different categories according to the judging result. Repeating the above process until the sample size distribution is balanced.

In this embodiment, a registered data set including characteristics of the age, occupation, and the like of the user is collected, and the total sample size is 50 ten thousand; counting the sample sizes under different categories, and finding that the sample size containing personal privacy data is only 5 ten thousand, and is unbalanced compared with other categories; setting a dynamic threshold ratio P ₁ ＝0.1，P ₂ =0.3, i.e. the threshold dynamically varies depending on the total sample size; calculating that the sample size of the privacy data class is lower than the second threshold N ₂ =15 tens of thousands; oversampling the privacy data category, synthesizing a new sample by using an SMOTE algorithm, and increasing the privacy data sample to 15 ten thousand; judging whether the sample size is balanced again, wherein the privacy sample is still higher than the first threshold N ₁ =5ten thousand; therefore, undersampling needs to be continued, and samples of other types are reduced to 15 ten thousand by using a downsampling mode; repeatedly judging and sampling until the sample size of each category is close to 15 ten thousand, and generating a second training set with balanced sample sizes; and a Bayesian optimization algorithm is adopted to dynamically optimize the threshold proportion so as to adapt to more scenes.

FIG. 4 is an exemplary flowchart of obtaining an encryption prediction model according to some embodiments of the present disclosure, as shown in FIG. 4, training a gradient boost decision tree model using a second training set, the steps of obtaining the encryption prediction model including: S124A uses a Bagging algorithm to randomly draw a plurality of bootstrap samples from the second training set in a put-back way, and the number of the samples is dynamically set between A1 and A2; S124B, training an independent GBDT regression model for each bootstrap sample, wherein the maximum iteration number of the model is dynamically set between B1 and B2; S124C in the training process of the GBDT regression model, in order to enhance privacy protection, laplacian noise is added to the prediction result at the leaf node, and the intensity of the noise is according to the motion The privacy budget epsilon of the state preset is adjusted; S124D filters out those Root Mean Square Errors (RMSE) on the validation set that exceed the threshold C by evaluating each trained GBDT model using the validation set ₁ To C ₂ To ensure predictive performance of the model; S124E applies knowledge distillation techniques to the retained GBDT model to reduce the number of nodes and depth of the model to reduce the complexity of the model while maintaining its performance. The target depth and the node number are D ₁ To D ₂ The dynamic setting is carried out between the two; S124F, integrating a plurality of GBDT models through iteration, wherein the iteration times are E ₁ To E to ₂ Dynamically set in between to generate a final encryption predictive model. Integrating multiple models helps to improve the robustness and performance of the models.

The sample sampling method based on the Bagging algorithm is used for acquiring a plurality of bootstrap samples and is used for training a gradient lifting decision tree (Gradient Boosting Decision Tree, GBDT) model. The method can adopt different strategies according to different sample numbers so as to ensure the diversity and performance of the model. The Bagging algorithm generates a plurality of bootstrap samples through substitution sampling so as to increase the diversity and robustness of the model, and different Bagging strategies are adopted according to the different sample numbers so as to adapt to the training requirements of different data scales.

When the number of samples N _i Less than or equal to a preset first threshold N ₁ When the number of samples is small, bagging sampling is not needed, and the second training set is directly used for training a single GBDT model.

Sample number N _i Greater than N ₁ And is less than N ₂ At the time, the number of the extracted samples from the second training set is randomly replaced with A ₁ To A ₂ Training multiple GBDT models using the extracted boottrap samples: when the number of samples is within the middle range, according to the preset range A ₁ To A ₂ Randomly retrieving a number of bootstrap samples from the second training set, and then training a plurality of independent GBDT models using the retrieved samples;

sample ofNumber N _i N is greater than or equal to ₂ And randomly substituting the boottrap samples with the same number of samples as the second training set from the second training set, training a plurality of GBDT models by using the sampled boottrap samples, and randomly substituting the boottrap samples with the same number of samples as the second training set from the second training set in order to maintain diversity when the number of samples is larger, and training a plurality of GBDT models by using the boottrap samples.

The encryption prediction model is created through a Bagging algorithm, a gradient lifting decision tree (GBDT), differential privacy, a knowledge distillation technology and the like to protect sensitive data. And the GBDT model is used for data analysis and prediction, and meanwhile differential privacy protection measures are introduced to ensure the privacy and safety of user data. An appropriate second training set is selected, which is used to train the GBDT model. Sampling a plurality of bootstrap samples from the second training set by adopting a Bagging algorithm in a put-back way, wherein the number of the samples is in a range from A ₁ To A ₂ And (5) determining. In the present application, A ₁ ＝(0.8～0.9)*N；A ₂ ＝(0.2～0.1)*N。

Respectively training a GBDT regression model for each bootstrap sample, wherein the maximum iteration frequency range of the model is represented by B ₁ To B ₂ And (5) determining. During the training process of the GBDT regression model, laplacian noise is added at leaf nodes, and the value of the noise is determined by a dynamically preset privacy budget epsilon. In the application, GBDT initial maximum iteration number B is set ₁ For 10 times, each training a bootstrap sample, recording an error drop curve, and recording the current iteration number as B when the error drop of continuous 5 times of iteration is smaller than a preset threshold tau ₂ Then the GBDT maximum iteration number of this bootstrap sample is in the range of [ B ] ₁ ，B ₂ ]. Repeating the above process for all bootstrap samples to obtain respective maximum iteration frequency ranges.

Evaluating the trained GBDT model using the validation set, removing the model for which the RMSE exceeds a threshold value, the threshold value ranging from C, based on the RMSE values evaluated on the validation set ₁ To C ₂ And (5) determining. In the present application, C ₁ = (0.8-0.9) ×rmse, set C ₁ A smaller value will ensure that some poorly performing models are removed; c (C) ₂ = (0.2-0.1) ×rmse, set C ₂ A slightly higher value will ensure that most better performing models are preserved.

For the reserved GBDT model, adopting a knowledge distillation technology to reduce the node number and the node depth of the model, wherein the target depth and the node number are respectively represented by D ₁ To D ₂ And (5) determining. Knowledge distillation aims at generating a smaller model and reserving the prediction capability of a large model, determining the complexity and scale of the model by node number and tree depth for the GBDT model, wherein the node number of the original GBDT model is M, the tree depth is H, the target model is not too small, otherwise, too much information is lost, and D is set ₁ ＝0.5M，D ₂ =0.8m, the target model tree depth should not be too shallow, set D ₁ ＝0.8H，D ₂ The number of nodes of the target model is 50% -80% of that of the original model, the tree depth is 80% -100% of that of the original model, and the model scale can be effectively reduced by shrinking the number of nodes and the tree depth within a certain range, and meanwhile, the main model expression capability is reserved.

By iteratively integrating a plurality of GBDT models, the iteration frequency range is defined by E ₁ To E to ₂ And determining to finally obtain the encryption prediction model. In the present application, the minimum number of iterations E is initialized ₁ 10 times; when GBDT models are added in each iteration, testing the overall effect of the newly added models on a verification set; if the new model results in no obvious improvement of the prediction effect (less than the preset improvement threshold alpha), the current iteration number is marked as E ₂ The method comprises the steps of carrying out a first treatment on the surface of the Then the iterative integration is performed a number of times in the range of [ E ] ₁ ，E ₂ ]。

By adding Laplacian noise in the model training process, differential privacy protection is realized, and the user sensitive data is ensured not to be leaked. The knowledge distillation technology is used, the scale of the model is reduced, the calculation and storage cost is reduced, and the prediction performance of the model is maintained. And the robustness of the model is improved by using a Bagging algorithm, the risk of overfitting is reduced, and the generalization capability of the model is improved.

Specifically, a Bagging algorithm is used to obtain a plurality of bootstras from the second training set through the put-back samplingAnd p samples are used for training a plurality of GBDT models, so that privacy protection and safety improvement of user data are realized. The technical characteristics of differential privacy and model integration are combined, and the purpose is to ensure the privacy of data and the robustness of the model. The Bagging algorithm is an abbreviation for Bootstrap Aggregating, which generates subsets of training data by the replaced samples, and then uses these subsets to train multiple models. An appropriate second training set is selected, which is used to train the GBDT model. According to the actual data condition, calculating the sample number N in the second training set _i 。

When the number of samples N _i Less than or equal to a preset first threshold N ₁ When, instead of using the Bagging algorithm, a single GBDT model is trained using a second training set. This is because the number of samples is small, and model integration is not required to improve training efficiency.

When the number of samples N _i Greater than a first threshold N ₁ And is smaller than a preset second threshold value N ₂ When the method is carried out, the following steps are carried out: randomly and repeatedly extracting samples from the second training set to obtain a second training set sample size A ₁ To A ₂ Boottrap samples of (c). Multiple GBDT models are trained using the extracted bootstrap samples. The purpose of this is to increase the diversity of the model and to increase the generalization ability and robustness of the model.

When the number of samples N _i Greater than or equal to a preset second threshold N ₂ When the method is carried out, the following steps are carried out: bootstrap samples having the same number of samples as the second training set are randomly and repeatedly extracted from the second training set. Multiple GBDT models are trained by using the extracted bootstrap samples, similar to step three, in order to increase the diversity and robustness of the models. The resulting multiple GBDT models can be integrated by voting or averaging to generate more robust and accurate predictions. The diversity of the data can be increased through the Bagging algorithm and random put-back sampling, so that the privacy of the user data is protected. A plurality of models are generated by using a Bagging algorithm, and the robustness and accuracy of the models can be improved by integrating the models.

In this embodiment, the second training setSample number range [ A ₁ ，A ₂ ]: training the total number of samples n=10000; a is that ₁ ＝0.8*N＝8000；A ₂ =0.2n=2000; the second training set sample size range is set to [8000, 2000 ]]. GBDT maximum iteration count range [ B ] ₁ ，B ₂ ]：B ₁ =10 iterations; the error drop threshold τ is set to 0.005; when the iteration number is 15, the error drop is less than 0.005, B ₂ =15; the maximum iteration number is in the range of [10, 15]The method comprises the steps of carrying out a first treatment on the surface of the GBDT model screening RMSE threshold range [ C ₁ ，C ₂ ]: model RMES was 0.1; c (C) ₁ ＝0.8*0.1＝0.08；C ₂ =0.2x0.1=0.02; the RMSE threshold range is [0.08,0.02 ]]. Knowledge distilled model node number range [ D ₁ ，D ₂ ]: the number of original model nodes m=500; the range of the number of the target model nodes is [0.5 x 500,0.8 x 500 ]]＝[250，400]The method comprises the steps of carrying out a first treatment on the surface of the Iterative integration frequency Range [ E ] ₁ ，E ₂ ]：E ₁ =10; the boost threshold α is set to 0.01; when the iteration is 12 times, the lifting is smaller than 0.01, E ₂ =12; the iteration number is in the range of [10, 12]。

The privacy budget is dynamically calculated according to the query type and the data size, and a differential privacy technology is adopted when a preset threshold value is reached, so that the privacy of user data is protected. Initializing privacy budget ε ₀ The budget is in the range of F ₁ To F ₂ In the present application. This initial budget is used for subsequent privacy management. Circularly receiving ith round of inquiry request Q _i Each query request includes a query type x _i And data size y _i . According to the ith round of inquiry request Q _i Query type x _i And data size y _i Calculating privacy budget ε for the query consumption _qi . The calculation formula is as follows: epsilon _qi ＝k ₁ *log(x _i )*(y _i ^α ) The method comprises the steps of carrying out a first treatment on the surface of the In the present application, the initial privacy budget ε is referenced to industry standards ₀ Setting the standard default value of (2) to be 1.0, and setting an adjustable budget adjustment coefficient k according to different application scenes, wherein the default value is 1.0; when the application scene has higher requirements on privacy protection, k can be set between 0.5 and 0.8; when the application scene is about to function and output qualityWhen higher, k can be set between 1.5 and 2.0. The initial privacy budget calculation formula is: epsilon ₀ =k; thus, the initial privacy budget ε ₀ Reasonable value range F of (2) ₁ To F ₂ The method comprises the following steps: [0.5,2.0]The method comprises the steps of carrying out a first treatment on the surface of the Multiple levels of budget ranges may also be set, for example: grade 1, [0.5,1.0]The method comprises the steps of carrying out a first treatment on the surface of the Grade 2, [1.0,1.5]The method comprises the steps of carrying out a first treatment on the surface of the Grade 3, [1.5,2.0]The method comprises the steps of carrying out a first treatment on the surface of the Privacy consumption coefficient k1 is in the value range of [0.5,2 ]]The consumption rate is controlled. The value range of the data scale adjustment coefficient alpha is [0.8,1 ]]Slightly higher than the linear relationship. Query type complexity x _i The query semantics can be classified into 1-10 levels, with higher levels representing higher query complexity and sensitivity. Weight omega _i Can be hooked with query level, and the range is 0.1,1]The higher the level, the greater the weight. Noise parameter b _i According to the current residual budget and weight calculation, the high-weight query has better privacy protection effect when the low budget is ensured. Residual budget threshold ε _min May be set to 10% -30% of the initial budget, triggering the protection mechanism to be enabled.

Specifically, the privacy noise distribution center b of each query request is calculated by calculating the sensitivity weight and the adaptive variance function, and constructing a corresponding random noise distribution _i0 It reflects the privacy sensitivity of the corresponding query request. b _i0 Is calculated as follows: b _i0 ＝ω _i *ε _qi ；ω _i For query type x _i Is calculated, the query type (x ₁ ,x ₂ ,......,x _m ) Corresponding preset sensitivity weights (x ₁ ,x ₂ ,......,x _m ) Different sensitivity weights are given by weighing the importance of different query types. Epsilon _qi For query type x _i The privacy budget consumed by the query of (c) is calculated in the previous step. Introducing an adaptive variance function f (y _i ) This function is used to control the variance of the noise, thereby dynamically adjusting the noise level according to the data size. Adaptive variance function f (y _i ) Is calculated as follows: f (y) _i )＝log(k ₂ *(y _i +1)); in the present application, the variance k is adapted ₂ The value range is [1,1.5 ] ]The variance size is controlled.By assigning different sensitivity weights ω to different types of queries _i The privacy protection level may be personalized according to the importance and privacy sensitivity of the query. This allows for a more intensive privacy protection of important data, while for less sensitive data the overhead of privacy protection can be reduced, improving the usability of the data. Introducing random noise distribution by calculating privacy noise distribution center b _i0 The query result is randomly disturbed, so that the information of the original data is confused, and the sensitive data is not easily restored by a malicious user or attacker. This randomization technique improves the level of data privacy protection. Adaptive variance function f (y _i ) Allowing the variance of the noise to be dynamically adjusted according to the data size of the query. The variance is relatively large when large-scale data is processed, allowing more randomness, while the variance is smaller when small-scale data is processed to reduce excessive disturbance. Therefore, the privacy can be protected, the accuracy of the data can be kept as much as possible, and the usability of the data is improved. Parameter k in an adaptive variance function ₂ The value range of (2) is [1,1.5 ]]Allowing the size of the noise to be controlled according to specific requirements. Greater k ₂ The value will result in a smaller variance, providing more stringent privacy protection, and a smaller k ₂ The value provides a greater variance, allowing more data leakage, but improving the usability of the data. This flexibility enables the method to be tuned in different application scenarios.

In this embodiment, the initial privacy budget ε ₀ Set to 1.5, i.e. select budget level 2; receiving a query request Q1, querying a user purchase record, wherein the type belongs to level 4 sensitive query, and the data size y ₁ =5000; according to the formula epsilon _qi ＝k ₁ *log(x _i )*(y _i ^α )＝1*log4*(5000 ^0.9 ) =5.2, calculate Q ₁ Privacy budget consumption of (a); q (Q) ₁ Sensitivity weight omega of (2) ₁ =0.9; calculating a noise distribution center: b _i0 ＝ω _i *ε _qi =0.9×5.2=4.68; setting an adaptive variance parameter k ₂ =1.3, according to data size y ₁ =5000, calculator f (y _i )＝log(k ₂ *(y _i +1))=log (1.3 x (5000+1))=8.8; build center b _i0 Variance is f (y _i ) Is a laplace noise distribution of (a); obtaining privacy noise b from the noise distribution samples _i And adding the query response to the query response and then sending the query response to the user.

FIG. 5 is an exemplary flow chart of differential encryption, shown in FIG. 5, according to some embodiments of the present disclosure, for separating data into critical business data, entitlement control data, and text data to be searched, and employing different encryption techniques and entitlement control policies to protect the privacy and security of the data, the steps of differential encryption comprising: the S141 initial step involves classifying the e-commerce data into three categories: key business data: including payment information data for which the transaction amount is greater than a preset threshold, etc. Rights control data: including preset user permission levels, client types, etc. Text data to be searched: including customer feedback and order remark text, etc. S142, for payment information data with transaction amount larger than a preset threshold value in the key service data, hierarchical authority control is carried out by adopting a chained encryption mechanism based on homomorphic encryption. Homomorphic encryption allows computation to be performed in the encrypted state, thereby enabling rights control without exposing plaintext data. The chained encryption allows a plurality of key levels, and only proper keys can decrypt data of corresponding layers, so that the security of the data is enhanced; the homomorphic encryption and chained encryption mechanisms are used for realizing multi-level authority control on key service data, and ensuring that only authorized users can access sensitive data. By classifying the data and applying different encryption strategies, the data is differentially protected according to the sensitivity and the purpose of the data, and the data security is improved. S143 applies attribute-based encryption technology to the rights control data, such as the user rights level and the client type, for controlling the access rights. The attribute encryption allows the data to be encrypted and decrypted according to the attributes of the user, so that only the user with specific attributes can access the corresponding data; attribute-based encryption techniques are used to control rights, ensuring that only users with the corresponding attributes are able to decrypt and access rights control data. S144, for text data to be searched, such as customer feedback and order remark text, an encryption technology supporting searching is adopted, so that searching in an encryption domain is allowed. The encryption technique supporting search allows search operations to be performed in an encrypted state without exposing plaintext data, protecting the privacy of the data.

Wherein, for the classification of the key business data, filtering and classification can be performed based on the attribute of the data. And identifying the data with the transaction amount larger than the preset threshold value as key business data. The data is classified according to transaction type (e.g., payment, refund, purchase, etc.). For records classified as critical business data, their categories are labeled for subsequent processing. The rights control data typically includes information about the user rights level, the client type, etc. Classifying according to authority levels of different users. Classification is based on the type of customer (e.g., general user, VIP user, administrator, etc.). The text data to be searched typically includes text information such as customer feedback, order notes, etc. These data may be classified according to the content and purpose of the data. Natural Language Processing (NLP) techniques, such as text classification algorithms, are used to identify keywords or topics in text data to determine their categories.

The critical business data is classified, typically according to transaction amount criteria. Then, for the payment information data with the transaction amount larger than the preset threshold value, a chained encryption mechanism based on homomorphic encryption is adopted. Allowing specific calculations to be made, such as aggregation or statistics of the payment amount, with the data remaining encrypted, without decrypting the data. This greatly improves the security of the payment data, preventing unauthorized access and leakage. The rights control data includes information such as user rights level and client type, which is used to control data access rights. This means that only users with corresponding properties or rights can decrypt and access the relevant data. The protection of sensitive rights information is enhanced. Only authorized users can decrypt and view the data, thereby ensuring confidentiality and integrity of the rights data and reducing the risk of misuse of the data. The text data to be searched may contain information such as customer feedback and order notes. Such data typically requires searching but maintains encryption to maintain privacy. While protecting the privacy of the data, necessary search operations such as keyword search or data retrieval may be performed. This increases the privacy and security of the text data while allowing limited analysis and querying of the data.

In the embodiment, a transaction description text is extracted from electronic commerce transaction data by using an NLP technology; identifying keywords such as 'purchase', 'payment' and the like by using a text classification algorithm, and marking the text record as key business data; and encrypting the payment data with transaction amount larger than 5000 yuan by using a homomorphic encryption algorithm Paillier, so as to realize statistic analysis under an encryption domain. Marking records of user role grades of 'manager' or 'operator' as authority control data from a user database; the entitlement control data is encrypted using a CP-ABE attribute-based encryption algorithm. The access control policy binds user department attributes. Extracting customer feedback content from the order comment text by using a regular expression, and encrypting the extracted feedback text by adopting an encryption algorithm SSE supporting keyword search; under the encryption domain, the feedback text is searched by using keywords such as 'good score', 'bad score', and the like.

FIG. 6 is an exemplary flow chart for setting access control policies according to some embodiments of the disclosure, as shown in FIG. 6, the steps of employing access control policies based on data attributes and user roles including: S143A generates random salt values using the XTS mode of the AES-256 algorithm for each first data that needs to be indexed. The random salt value increases the randomness of the data, improves the safety and prevents the rainbow table attack on the same data. The use of random salt values increases the security of the data and reduces the risk of attacks on the data. The AES-256XTS mode is adopted for encryption, so that the privacy of data is protected, and only authorized users can decrypt the data. S143B derives an encryption key and a decryption key of the index database using the HKDF algorithm based on the master key. The HKDF algorithm ensures that the key derived from the master key is properly strong and suitable for data encryption and decryption; and the HKDF algorithm is used for key derivation, so that the generated key is proper in strength and is not easy to crack. S143C, encrypting the first data and the random salt value by using an encryption key to generate second data; and encrypting the first data and the random salt value by using the generated encryption key to generate second data. The data encryption protects the privacy of the first data, and only the user with the decryption key can restore the original data; S143D creates an index table in the index database containing the following information: data ID: for uniquely identifying each data item. Encryption of salt: a random salt value associated with each data item. Second data: encrypted data. S143E, when the index is read, searching the matched encrypted salt value and the second data through the data ID, and decrypting by using the decryption key to obtain the first data. The decryption operation can only be carried out by an authorized user, so that the security of the data is ensured; S143F establishes a role-based access control mechanism in an index database, comprising the following steps: and (3) identity authentication: the user needs to provide valid authentication credentials, including a user name and password. And (3) authorization management: based on the role and attributes of the user, it is determined whether it has permission to access specific data. Accessing an audit log: all access to the database is recorded for auditing and monitoring. Rights grading: different authority levels are assigned to different user roles and attributes to refine data access control. Role-based access control allows different rights to be assigned to different users, refining data access control.

By adopting a strong encryption algorithm and a flexible access control mechanism, only authorized users can access sensitive data, and access of the data is recorded and monitored, so that the data security is improved. For each first data to be indexed, the present application uses the XTS mode of the AES-256 algorithm to generate a random salt value. This random salt value will be used with the first data for the subsequent encryption process. Using the master key based HKDF (HMAC-based Key Derivation Function) algorithm, the present application derives the encryption and decryption keys of the index database. This ensures secure generation and management of keys. The first data and the random salt value are encrypted by adopting the generated encryption key, and the second data are generated. The second data will be stored in the index database. In the index database, the application creates an index table containing the data ID, the encrypted salt value, and the second data. This table will be used to retrieve and decrypt the first data. The present application establishes role-based access control (RBAC) mechanisms in the index database, including authentication, authorization management, and access audit logs. And (3) identity authentication: the user needs to provide valid credentials for authentication to access the data. This ensures that only authenticated users can access the system. And (3) authorization management: by user roles and attributes, the present application can control, in fine granularity, who can access which data. Only authorized users are able to decrypt and obtain the first data. Accessing an audit log: the application records all data access events, including user identity, time stamp and access operation. This helps to monitor the use of data and track potential security issues. The application adopts authority grading to ensure that data of different levels can only be accessed by authorized users. This means that high sensitivity data can only be accessed by high-level users, while low sensitivity data can be accessed by a wider population of users. Access auditing is an important component that allows the present application to periodically audit data access, identify potential risks, and take appropriate action to enhance data security. The XTS mode of AES-256 algorithm and HKDF algorithm are adopted to ensure strong encryption of data and secure management of keys. The RBAC-based access control mechanism provides flexible rights management, allowing fine-grained control of data based on user roles and attributes. Access to audit log records and the use of monitoring data facilitates timely discovery and management of potential security issues.

Example 3, XTS mode using AES-256 algorithm generates random salt values of length 256 bits; deriving a database index encryption key and a decryption key with 256bit length through HKDF-SHA256 based on the master key; performing AES-256 encryption on the client information data and the random salt value by using an encryption key, and outputting encrypted data; creating a data table in a MySQL database, comprising the fields: data ID, salt value, encrypted data. Inserting the encrypted data; when a user logs in, the identity is verified through a user name and a password, the system inquires the role of the user, and an access token is generated according to the role authority; when the user accesses the index data, the access token is presented, and the system verifies and records the access log.

Example 4, a database full-text index encryption key of 512 bits in length is generated by HKDF-SHA256 based on the master key; performing AES-512 encryption on the commodity comment text by using the encryption key, and outputting an encrypted text; creating an index comprising the fields: document ID, encrypted text. Inserting an encrypted text; defining access control strategies, wherein different user roles have different data access rights; and when the user logs in the system, the two-factor identity authentication is performed. The user rights are verified and journaled when accessing the text index.

Wherein the application ensures the security of the first data and the salt value by generating a random master key and a random salt value using TRNG and performing key derivation and AES-256 encryption using HMAC-SHA256 algorithm. By combining the Shamir secret sharing algorithm and the multi-path splicing mode, the application generates the second data and provides powerful support for data security. Generating a length G by using a True Random Number Generator (TRNG) ₁ To G ₂ To ensure a high degree of randomness and security of the master key. Generating a length G by using a secure random number generation algorithm ₁ To G ₂ The random salt value of the bits is used to increase the complexity of data encryption. Deriving by HMAC-SHA256 algorithm according to the generated random master key to generate a length G ₃ To G ₄ And the encryption key corresponding to the first data of the bit ensures the safe generation and management of the key. Performing AES-256 encryption on the first data by using the generated encryption key to obtain encrypted first data S ₁ . AES-256 encryption is carried out on the random salt value of the first data to obtain an encrypted salt value S ₂ . For S ₁ And S is ₂ Hash by SHA-256 algorithm to obtain hash value U ₁ And U ₂ To ensure data integrity and security. U is determined by using a Shamir secret sharing algorithm ₁ Splitting into K ₁ Fragments, U ₂ Splitting into K ₂ To increase the dispersion and security of data. Adopting a multi-path splicing mode, and according to a preset rule, obtaining a U-shaped signal from the U ₁ And U ₂ Selecting M different fragments from the fragments of (C)And performing exclusive or (XOR) splicing on the fragments at the positions to generate second data. This ensures the complexity and security of the second data and is not susceptible to a single fragment. And generating a random master key by using TRNG, and ensuring the high randomness and the safety of the master key. Generating random salt values increases the complexity of data encryption. The HMAC-SHA256 algorithm is used for key derivation, so that the safe generation and management of the key are ensured. And encrypting the first data and the random salt value by adopting an AES-256 algorithm, so that the encryption safety of the data is ensured. The SHA-256 algorithm is used for hashing and slicing, so that data integrity and security are ensured. For splitting the hash value into a plurality of fragments, increasing the dispersibility and security of the data. And splicing the plurality of fragments through exclusive or (XOR) to generate second data, so that the complexity and the safety of the second data are ensured. In the present application, the master key length G ₁ To G ₂ 256-512bit, ensuring the safety strength; random salt length G ₁ To G ₂ For 128-256 bits, increasing entropy and unpredictability; derived key length G ₃ To G ₄ 256 bits, matching with AES-256 algorithm; SHA-256 output hash value length 256 bits; shamir algorithm fragment number K ₁ 、K ₂ 5-8 pieces, realizing fault tolerance threshold; the number M of the multi-path splicing selection fragments is 3-5, so that the difficulty in back-pushing is ensured; the XOR splicing positions are uniformly distributed according to the principle of interval; the second index data length is the sum of the lengths of the spliced segments; different security levels can be set, and the high-level parameters take larger values and the low-level parameters take smaller values.

In this embodiment, a random master key of 512 bits in length is generated, using hardware TRNG generation; generating a random salt value with 256-bit length, and generating by using a system RNG algorithm; deriving a 256-bit encryption key using HMAC-SHA 256; AES-256 encrypting the first data by using the encryption key, outputting the encrypted data S ₁ The method comprises the steps of carrying out a first treatment on the surface of the AES-256 encryption is carried out on random salt value by using different encryption keys, and an encrypted salt value S is output ₂ The method comprises the steps of carrying out a first treatment on the surface of the For S ₁ And S is ₂ SHA-256 hashes are performed respectively to obtain 256-bit hash value U ₁ And U ₂ The method comprises the steps of carrying out a first treatment on the surface of the U using Shamir secret sharing algorithm ₁ Splitting into 5 fragments; u (U) ₂ Split into 6 fragmentsThe method comprises the steps of carrying out a first treatment on the surface of the From U ₁ And U ₂ Selecting 4 fragments at different positions for exclusive-or splicing to generate second index data; the second index data length is 1024 bits of the sum of 4 fragments; finally, the first data ID, the random salt value and the second index data are stored in a database index table.

Claims

1. A method of processing electronic commerce data, comprising:

acquiring electronic commerce data, extracting feature information by utilizing feature engineering, and constructing feature vectors;

using a machine learning algorithm to train a prediction model based on the feature vector to predict first data to be encrypted;

calculating semantic relativity among the predicted encrypted data as an encryption sequence;

encrypting the first data by utilizing a differential encryption algorithm according to the characteristic information and the encryption sequence of the data to generate second data;

establishing a mapping relation between first data and second data in a database, and constructing an index;

an access control policy based on data attributes and user roles is employed.

2. The method for processing electronic commerce data according to claim 1, wherein:

using a machine learning algorithm, training a predictive model based on the feature vector, the step of predicting the first data to be encrypted comprising:

labeling the constructed feature vectors to generate a first training set and a verification set;

detecting whether the sample quantity distribution of each category in the first training set is balanced or not;

if the first training set has unbalanced category, the first training set is processed by utilizing an oversampling or undersampling technology, and a second training set is generated;

Utilizing the second training set and the verification set to train the gradient lifting decision tree model to obtain an encryption prediction model;

and predicting the data needing to be encrypted in the electronic commerce data by using the encryption prediction model to obtain first data.

3. The method for processing electronic commerce data according to claim 2, wherein:

if there is a class imbalance in the first training set, processing the first training set using an over-sampling or under-sampling technique, the step of generating a second training set comprising:

acquiring the total sample amount N of the first training set;

dynamically setting a first threshold ratio P of the number of samples according to the total number of samples ₁ And a second threshold ratio P ₂ ，P ₁ Less than P ₂ ；

According to the total sample N and the first threshold ratio P ₁ Calculate a first threshold N ₁ ；

According to the total sample N and the second threshold ratio P ₂ Calculate a second threshold N ₂ ；

Judging the number N of samples of each category _i Whether or not it is greater than a first threshold value N ₁ Or less than a second threshold N ₂ ；

When the number of samples N _i Greater than a first threshold N ₁ When the sampling is carried out, undersampling treatment is carried out on the corresponding samples;

when the number of samples N _i Less than a second threshold N ₂ When the sampling device is used, the corresponding sample is subjected to oversampling;

repeating threshold judgment and sampling processing on the training set subjected to undersampling processing or oversampling processing to obtain a second training set;

Dynamically setting a threshold ratio to be set by an adaptive algorithm based on a machine learning algorithm;

the undersampling process is one of downsampling, random sampling or weighted sampling;

the over-sampling process is one of up-sampling, synthesizing, or replicating the sample.

4. The method for processing electronic commerce data according to claim 3, wherein:

training the gradient lifting decision tree model by using the second training set, and obtaining the encryption prediction model comprises the following steps:

according to the Bagging algorithm, a plurality of bootstrap samples are obtained from the second training set through sampling with substitution, and the number of the samples is A of the total number of the samples ₁ To A ₂ ；

Training a GBDT regression model for each bootstrap sample, wherein the maximum iteration number of the model is B ₁ To B ₂ A wheel;

during GBDT regression model training, adding Laplace noise at leaf nodes, wherein the value of the Laplace noise is a dynamically preset privacy budget epsilon;

evaluating the trained GBDT model using a validation set, removing the model with RMSE exceeding a threshold value, the threshold value being C of the RMSE values evaluated on the validation set ₁ To C ₂ ；

For the reserved GBDT model, the knowledge distillation technology is used for reducing the node number and the node depth of the model, and the target depth and the node number are respectively D of the node number and the node depth of the GBDT model before knowledge distillation is not carried out ₁ To D ₂ ；

By iteratively integrating a plurality of GBDT models, the iteration number is E ₁ To E to ₂ And secondly, the encryption prediction model is used.

5. The method for processing electronic commerce data according to claim 4, wherein:

according to the Bagging algorithm, the step of obtaining a plurality of bootstrap samples from the second training set through the put-back sampling includes:

when the number of samples N _i Less than or equal to a preset first threshold N ₁ Training a single GBDT model by using the second training set;

when the number of samples N _i Greater than a first threshold N ₁ And is smaller than a preset second threshold value N ₂ At the time, the number of the extracted samples from the second training set is randomly replaced by the sample size A of the second training set ₁ To A ₂ Training a plurality of GBDT models by using the extracted boottrap samples;

when the number of samplesN _i Greater than or equal to the second threshold N ₂ And randomly replacing bootstrap samples with the same sample number as the second training set from the second training set, and training a plurality of GBDT models by using the extracted bootstrap samples.

6. The method for processing electronic commerce data according to claim 4, wherein:

the step of dynamically presetting the privacy budget epsilon comprises the following steps:

setting an initialization privacy budget ε ₀ Is F ₁ To F ₂ ；

Circularly receiving ith round of inquiry request Q _i Acquiring an ith round of query request Q _i Corresponding query type x _i And data size y _i ；

According to the ith round of inquiry request Q _i Query type x _i And data size y _i Calculate the ith round of query Q _i Consumed privacy budget ε _qi ，ε _qi Calculated by the following formula:

ε _qi ＝k ₁ *log(x _i )*(y _i ^α )

from an initially preset privacy budget ε ₀ Subtracting the accumulated consumed privacy budget to obtain residual privacy budget epsilon _t+1 ；

Judging the residual privacy budget epsilon _t+1 Whether or not it is below a preset threshold epsilon _min ；

When epsilon _t+1 Less than or equal to epsilon _min When the query type is determined, calculating a preset sensitivity weight corresponding to the query type;

according to the sensitivity weight omega _i And ith round of query consumption ε _qi Calculate the ith round of query request Q _i Corresponding Laplace noise parameter b _i ；

For the ith round of query request Q _i Adding the calculated noise b _i And then, returning a query response.

7. The method for processing electronic commerce data according to claim 6, wherein:

calculating Laplace noise parameter b _i The method comprises the following steps:

calculating privacy noise distribution center b _i0 ，b _i0 Calculated by the following formula:

b _i0 ＝ω _i *ε _qi

f(y _i )＝log(k ₂ *(y _i +1))

wherein k is ₂ A constant that is an adaptive variance function;

build b _i0 Is centered, variance is f (y _i ) Laplacian or Gaussian distribution F _i ；

From distribution F using a distribution-based sampling algorithm _i The privacy noise value b is obtained by sampling _i 。

8. The method for processing electronic commerce data according to claim 1, wherein:

the step of differential encryption includes:

classifying the electronic commerce data, and extracting key business data, authority control data and text data to be searched;

carrying out layered authority control on key business data, wherein the transaction amount of payment information data is larger than a preset threshold value, by adopting a chained encryption mechanism based on homomorphic encryption;

controlling access rights by adopting an attribute-based encryption technology for rights control data comprising preset user rights level and client type;

the text data to be searched comprises preset customer feedback and order remark text, and the search is carried out in an encrypted domain by adopting an encryption technology supporting the search.

9. The method for processing electronic commerce data according to claim 1, wherein:

the step of employing an access control policy based on data attributes and user roles includes:

Generating a random salt value for each first data to be indexed using the XTS mode of the AES-256 algorithm;

deriving an encryption key and a decryption key of the index database by using an HKDF algorithm based on the master key;

encrypting the first data and the random salt value by using an encryption key to generate second data;

creating an index table in an index database, the index table containing a data ID, an encrypted salt value, and second data;

when the index is read, searching the matched encrypted salt value and the second data through the data ID, and decrypting by using a decryption key to obtain the first data;

establishing an RBAC-based access control mechanism in an index database, wherein the access control mechanism comprises identity verification, authorization management and access audit logs;

and carrying out security control on the index database by adopting authority classification and access audit.

10. The method for processing electronic commerce data according to claim 9, wherein:

the step of encrypting the first data and the random salt value using an encryption key, the step of generating second data comprising:

generating a length G using TRNG ₁ To G ₂ Is a random master key of (a);

for the first data, generating a length G by using a secure random number generation algorithm ₁ To G ₂ Random salt values of bits;

Deriving the length G from the master key using the HMAC-SHA256 algorithm ₃ To G ₄ Generating an encryption key corresponding to the first data;

using encryption key pairsOne data is subjected to AES-256 encryption and the encrypted first data S is output ₁ ；

AES-256 encryption is carried out on the random salt value of the first data, and the encrypted salt value S is output ₂ ；

Hashing S1 and S2 respectively by SHA-256 algorithm to obtain hash value U ₁ And U ₂ ；

U is determined by using a Shamir secret sharing algorithm ₁ Splitting into K ₁ Fragments, U ₂ Splitting into K ₂ Fragments;

adopting a multi-path splicing mode, and according to a preset rule, obtaining a U-shaped signal from the U ₁ And U ₂ Selecting M fragments at different positions for exclusive-or (XOR) splicing to generate second data.