CN117290888A - Information desensitization method for big data, storage medium and server - Google Patents

Information desensitization method for big data, storage medium and server Download PDF

Info

Publication number
CN117290888A
CN117290888A CN202311574400.9A CN202311574400A CN117290888A CN 117290888 A CN117290888 A CN 117290888A CN 202311574400 A CN202311574400 A CN 202311574400A CN 117290888 A CN117290888 A CN 117290888A
Authority
CN
China
Prior art keywords
information
desensitization
sample
generator
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311574400.9A
Other languages
Chinese (zh)
Other versions
CN117290888B (en
Inventor
刘世闻
董爱平
戴晔
顾璇
严典范
李彩荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Fengyun Technology Service Co ltd
Original Assignee
Jiangsu Fengyun Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Fengyun Technology Service Co ltd filed Critical Jiangsu Fengyun Technology Service Co ltd
Priority to CN202311574400.9A priority Critical patent/CN117290888B/en
Publication of CN117290888A publication Critical patent/CN117290888A/en
Application granted granted Critical
Publication of CN117290888B publication Critical patent/CN117290888B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Bioethics (AREA)
  • Computer Security & Cryptography (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides an information desensitization method of big data, a storage medium and a server, wherein the method comprises the following steps: acquiring an original data set, wherein the original data set comprises a plurality of pieces of original data; extracting entity, relation and attribute for each piece of original data, and determining sensitive information (information exposing user privacy) in the entity and the attribute; performing primary desensitization treatment (comprising one or more of desensitization replacement, hiding and generalization) on the sensitive information, and generating condition information based on the entity, relationship and attribute after the primary desensitization treatment; and generating a generator in the countermeasure network by using the trained conditions, generating a synthetic sample based on the condition information and the random vector, and finally obtaining a desensitization data set consisting of a plurality of synthetic samples. The condition generation countermeasure network is used for generating a desensitization data set which can be used for post-processing (such as data analysis, mining and the like) based on the original data set, so that the privacy of a user is protected, and the post-processing requirement of the data is met.

Description

Information desensitization method for big data, storage medium and server
Technical Field
The present application relates to the field of big data technologies, and in particular, to an information desensitizing method, a storage medium, and a server for big data.
Background
The desensitization of sensitive information of big data is an important measure for protecting personal privacy and data security. During the course of large data processing, sensitive data including personal identity, financial information, medical records, etc. may be involved, and in order to prevent such data from being abused or compromised, desensitization measures may be taken to effectively reduce risk. Desensitization of sensitive information refers to processing sensitive data to eliminate or replace sensitive content therein so that the data no longer has the ability to directly identify individuals or sensitive information while maintaining availability. Common sensitive information desensitization methods include:
anonymization: by deleting or replacing personal identification information, such as name, identification card number, etc., the data cannot be directly associated with a particular individual.
Encryption: the encryption algorithm is used to encrypt the sensitive data, ensuring that only authorized users can decrypt and access the original data.
De-labeling: the personal identity information is separated from other attributes such that the identification information in the data cannot be directly associated with the personal identity.
Data perturbation: randomizing, perturbing or blurring the sensitive data to reduce the recognizability of the sensitive information in the data.
Data masking: generalized symbols or placeholders are used instead of sensitive data to hide the true sensitive information.
In order to keep certain availability of the desensitized data, so as to meet the requirements of data analysis, mining and the like, and meanwhile, personal privacy or sensitive information cannot be revealed, the requirements are difficult to meet in the conventional scheme. In addition, the desensitized data may be restored or informative in various ways to obtain sensitive information, such as restoration, associated information analysis, etc. There is a need for a more secure data desensitization scheme that protects user privacy.
Disclosure of Invention
The embodiment of the application aims to provide an information desensitization method, a storage medium and a server for big data, so that a desensitization data set which can be used for post-processing (such as data analysis and mining) is generated based on an original data set through a condition generation countermeasure network, the privacy of a user is protected, and the post-processing requirement of the data is met.
In order to achieve the above object, embodiments of the present application are realized by:
in a first aspect, an embodiment of the present application provides a method for desensitizing information of big data, including: acquiring an original data set, wherein the original data set comprises a plurality of pieces of original data; extracting an entity, a relation and an attribute according to each piece of original data, and determining sensitive information in the entity and the attribute, wherein the sensitive information represents information exposing privacy of a user; performing preliminary desensitization on the sensitive information, and generating condition information based on the entity, the relation and the attribute after the preliminary desensitization, wherein the preliminary desensitization comprises one or more of desensitization replacement, hiding and generalization; and generating a generator in the countermeasure network by using the trained conditions, generating a synthetic sample based on the condition information and the random vector, and finally obtaining a desensitization data set consisting of a plurality of synthetic samples.
With reference to the first aspect, in a first possible implementation manner of the first aspect, the building process of the condition generating countermeasure network is: acquiring a training set, wherein the training set comprises a plurality of pieces of training data; extracting entities, relations and attributes aiming at each piece of training data, and determining sensitive information in the entities and the attributes, wherein the sensitive information represents information exposing privacy of a user; carrying out consistent preliminary desensitization treatment on the training data and the sensitive information, taking the training data after the preliminary desensitization treatment as a real sample, and generating condition information based on the entity, the relation and the attribute after the preliminary desensitization treatment, wherein the preliminary desensitization comprises one or more of desensitization replacement, hiding and generalization; constructing a condition generation countermeasure network, generating a synthetic sample by using a generator of the condition generation countermeasure network based on condition information and a random vector, distinguishing a real sample from the synthetic sample by using a discriminator of the condition generation countermeasure network based on the condition information, and optimizing the generator and the discriminator based on the distinguishing result in a back propagation manner, so as to realize countermeasure training of the generator and the discriminator, and repeatedly iterating to finally obtain a trained condition generation countermeasure network.
With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the generator selects LSTM, and the arbiter selects transducer.
With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the generator loss function is defined as:
wherein,representing a random vector +.>Representing condition information->Representing generator loss function, +.>The representation generator is based on condition information->And random vector->Synthesized sample generated, ++>The representation discriminator gives a composite sampleAnd condition information->Probability of (i) that the synthesized sample is judged as a true sample,/-probability of (i) that the synthesized sample is judged as a true sample>Is a cross entropy loss function that is used to measure the gap between the model output and the true value.
With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the discriminator loss function is defined as:
wherein,representing the arbiter loss function, +.>Loss function indicating that the real sample is determined as a real sample,/->Loss function indicating that the generated sample is determined to be a true sample,/->Representing a real sample,/>Representing the discriminator gives the true sample +.>And condition information->I.e. probability of true sample +.>The probability of a true sample is determined.
With reference to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, the condition informationThe method meets the following conditions:
wherein,representing condition information->Representing the total amount of categories of entities in the condition information, +.>Indicate->Entities of individual categories;
wherein,representation entity->Frequency of occurrence of->Representation entity->The number of relationships that exist with other entities,representation entity->Attribute total amount of->Representation entity->The%>And attributes.
With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, parameters of the generator are updated in the following manner:
wherein,for the generator parameters to be optimized, +.>Learning rate corresponding to generator, +.>For synthesizing the conditions fulfilled in the sample, +.>Representing condition information->Loss function of the representation generator->About generator parameters->Is a gradient of (a).
With reference to the fifth possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, parameters of the arbiter are updated in the following manner:
wherein,for the arbiter parameters to be optimized, +.>For the learning rate corresponding to the discriminator, +.>Representing the loss function in the arbiter>About the discriminator parameters->Is used for the gradient of (a),
representing the loss function in the arbiter>About the discriminator parameters->Is used for the gradient of (a),
for synthesizing the conditions fulfilled in the sample, +.>Representing condition information.
In a second aspect, an embodiment of the present application provides a storage medium, where the storage medium includes a stored program, where the program when executed controls a device in which the storage medium is located to perform the method for information desensitization of big data according to any one of the first aspect or the possible implementation manners of the first aspect.
In a third aspect, embodiments of the present application provide a server, including a memory for storing information including program instructions, and a processor for controlling execution of the program instructions, which when loaded and executed by the processor, implement the steps of the method for information desensitization of big data according to the first aspect or any one of the possible implementations of the first aspect.
The beneficial effects are that:
1. according to the scheme, a solution idea of an countermeasure network is generated by adopting conditions, each piece of original data in an original data set is subjected to entity, relation and attribute extraction, and sensitive information in the entity and the attribute is determined; then carrying out preliminary desensitization treatment on the sensitive information, and generating condition information based on the entity, relationship and attribute after the preliminary desensitization treatment; and generating a generator in the countermeasure network by using the trained conditions, generating a synthetic sample based on the condition information and the random vector, and finally obtaining a desensitization data set consisting of a plurality of synthetic samples. When the desensitization data set for post-processing is generated, the privacy information of the user is subjected to preliminary desensitization (original data structure can be maintained), the condition information is generated by utilizing the extracted entity, relation and attribute, the generation of the synthesized sample is guided, the quality of the synthesized sample can be ensured, the synthesized sample is very close to the real sample in the original data set, and the reliability of post-processing (data analysis, data mining and the like) is ensured. By using the desensitization data set, leakage of sensitive information containing user privacy is effectively eliminated, and sample generation is performed after preliminary desensitization is performed, so that threat of association analysis and restoration technology to user privacy is avoided, and safety guarantee is provided for user data privacy.
2. When the condition generation countermeasure network is constructed, the LSTM is selected as the generator, the transducer is selected as the discriminator, so that the condition generation countermeasure network can be suitable for more complex text content generation tasks, corresponding loss functions are designed, the generator and the discriminator are updated in a back propagation mode, when parameters to be optimized of the generator and the discriminator are updated, the generation capacity in the generated samples (the generation capacity is gradually enhanced along with repeated iteration of training) is considered, when related parameters are updated, the conditions met by the generated samples are considered, optimization of the related parameters is guided, the lower the conditions met by the early generated samples are, the larger the update amplitude is (corresponding parameters are utilized)As a coefficient of learning rate), training efficiency can be effectively improved, and convergence of the model can be quickened.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a build condition generation countermeasure network.
Fig. 2 is a flowchart of a method for desensitizing information of big data according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
The data desensitization technique of the present scheme relies on a condition generation countermeasure network to generate synthetic samples based on condition information (derived from entity, relationship, attribute extraction on the original data) and random vectors to construct a desensitized dataset corresponding to the original dataset. Thus, the construction of a condition generating countermeasure network is first described herein.
Referring to fig. 1, fig. 1 is a flowchart of a construction condition generation countermeasure network. In this embodiment, the condition generation countermeasure network construction process includes: step S11, step S12, step S13, and step S14.
First, step S11 may be performed.
Step S11: and acquiring a training set, wherein the training set comprises a plurality of pieces of training data.
In this embodiment, a training set may be acquired, where the training set includes a plurality of pieces of training data. For example, government affair data, a piece of government affair data contains a processing procedure of a government affair task, for example, a government affair item described by text content.
After the training set is obtained, step S12 may be performed.
Step S12: and extracting the entity, the relation and the attribute aiming at each piece of training data, and determining sensitive information in the entity and the attribute, wherein the sensitive information represents information exposing the privacy of the user.
Entity, relationship and attribute extraction can be performed for each piece of training data. For example, the extracted entity may include a user's name, a government affair handled, a submitted document, etc., while the attribute, such as an identification card number, a mobile phone number, a bank card number, a category of the document (e.g., paper or electronic part), etc., and the relationship may characterize a relationship between different entities, such as a handles a government affair, B is a sponsor of a, C is a superior of B, D is an affiliated unit, affiliated department, etc.
After extracting the entity, the relationship and the attribute, the sensitive information in the entity and the attribute can be further determined. For example, name, identification card number, cell phone number, bank card number, address, job position, etc. can expose information of user privacy.
After the sensitive information is determined, step S13 may be further performed.
Step S13: and carrying out consistent preliminary desensitization treatment on the training data and the sensitive information, taking the training data after the preliminary desensitization treatment as a real sample, and generating condition information based on the entity, the relation and the attribute after the preliminary desensitization treatment, wherein the preliminary desensitization comprises one or more of desensitization replacement, concealment and generalization.
In this embodiment, consistent preliminary desensitization of the training data and the sensitive information may be performed, the preliminary desensitization including one or more of desensitization substitution, concealment, generalization. For example, names may be desensitized (e.g., with pseudonym substitution, and the substitution of different pseudonyms for different personal names can be maintained to ensure stability of data content), bank card numbers, identification card numbers, cell phone numbers, etc. may be suppressed or replaced (e.g., with successive portions of the value replaced with 0 orEtc., e.g., 8 bits in succession), desensitizes sensitive information (e.g., salary, consumption, time nodes, specific locations, etc.) of some dataclasses in a generalized manner.
Thus, the primary desensitization processing of the training data and the sensitive information (extracted entity, relation and attribute) can be realized.
The training data after the preliminary desensitization is taken as a real sample (so as to perform training), and the information of the entity (including the occurrence frequency), the relation (including the relation quantity between each entity and other entities), the attribute (including the content and quantity of the attribute) and the like after the preliminary desensitization can form the condition information.
Condition informationThe method meets the following conditions:
, (1)
wherein,representing condition information->Representing the total amount of categories of entities in the condition information, +.>Indicate->Entities of a category.
, (2)
Wherein,representation entity->Frequency of occurrence of->Representation entity->The number of relationships that exist with other entities,representation entity->Attribute total amount of->Representation entity->The%>And attributes. For finer application scenes, the method canCorresponding weights are given to each item in the condition information so as to adapt to the scene, and details are not repeated here.
After determining the real sample and the condition information corresponding to each piece of training data, step S14 may be performed.
Step S14: constructing a condition generation countermeasure network, generating a synthetic sample by using a generator of the condition generation countermeasure network based on condition information and a random vector, distinguishing a real sample from the synthetic sample by using a discriminator of the condition generation countermeasure network based on the condition information, and optimizing the generator and the discriminator based on the distinguishing result in a back propagation manner, so as to realize countermeasure training of the generator and the discriminator, and repeatedly iterating to finally obtain a trained condition generation countermeasure network.
In this embodiment, a condition generating countermeasure network may be constructed, the generator uses an LSTM (Long short-term memory) model, and the discriminator uses a transducer.
The generator loss function is defined as:, (3)
wherein,representing a random vector +.>Representing condition information->Representing generator loss function, +.>The representation generator is based on condition information->And random vector->Synthesized sample generated, ++>The representation discriminator gives a composite sampleAnd condition information->Probability of (i) that the synthesized sample is judged as a true sample,/-probability of (i) that the synthesized sample is judged as a true sample>Is a cross entropy loss function that is used to measure the gap between the model output and the true value.
The arbiter loss function is defined as:
, (4)
, (5)
, (6)
wherein,representing the arbiter loss function, +.>Loss function indicating that the real sample is determined as a real sample,/->Loss function indicating that the generated sample is determined to be a true sample,/->Representing a real sample, +_>Representing the discriminator gives the true sample +.>And condition information->I.e. probability of true sample +.>The probability of a true sample is determined.
Then, the parameters of the generator may be updated in the following manner:, (7)
wherein,for the generator parameters to be optimized, +.>Learning rate corresponding to generator, +.>For synthesizing the conditions fulfilled in the sample, +.>Representing condition information->Loss function of the representation generator->About generator parameters->Is a gradient of (a).
The parameters of the discriminator can be updated in the following ways:
, (8)
wherein,for the arbiter parameters to be optimized, +.>For the learning rate corresponding to the discriminator, +.>Representing the loss function in the arbiter>About the discriminator parameters->Is used for the gradient of (a),
representing the loss function in the arbiter>About the discriminator parameters->Gradient of->For the condition satisfied in the synthetic sample (i.e. in the synthetic sample, the condition information is met +.>Degree of (f)%>Representing condition information.
After the network structure of the condition generation countermeasure network is constructed, a synthetic sample can be generated by utilizing a generator of the condition generation countermeasure network based on the condition information and the random vector, a real sample and the synthetic sample are distinguished by utilizing a discriminator of the condition generation countermeasure network based on the condition information, and the generator and the discriminator are optimized based on the distinguishing result (namely, the probability that a certain sample belongs to the real sample) in a back propagation manner, so that countermeasure training of the generator and the discriminator is realized, iteration is repeated, and finally the trained condition generation countermeasure network is obtained.
It should be noted that, during specific training, periodic alternate training needs to be performed on the generator and the arbiter to improve the training effect, and the training strategy of the countermeasure network may be generated according to the existing conditions, which is not described herein.
After training the condition generation countermeasure network, a generator of the condition generation countermeasure network can be used as a core and is arranged in a server, and a synthetic sample is generated by using the generator based on the condition information and the random vector so as to obtain a desensitization data set, and the detailed description is referred to below.
Referring to fig. 2, fig. 2 is a flowchart of a method for desensitizing information of big data according to an embodiment of the present application. In the present embodiment, the information desensitization method of big data may include step S21, step S22, step S23, and step S24.
First, the server may run step S21.
Step S21: an original data set is obtained, wherein the original data set contains a plurality of pieces of original data.
In this embodiment, the server may acquire an original data set, where the original data set includes a plurality of pieces of original data.
After obtaining the original data set, the server may run step S22.
Step S22: and extracting the entity, the relation and the attribute according to each piece of original data, and determining sensitive information in the entity and the attribute, wherein the sensitive information represents information exposing the privacy of the user.
For each piece of raw data:
the server may perform entity, relationship, and attribute extraction on the original data, where the extraction may be implemented using BERT (Bidirectional Encoder Representations from Transformers), GPT (generating Pre-trained Transformer), seq2Seq, transformer, etc., which are not described herein, and further determines the sensitive information therein (mainly focused on the entity and the attribute) based on the extracted entity, relationship, and attribute, for example, based on a rule-based sensitive information identification scheme (i.e., setting rules to determine which entity and attribute belong to the sensitive information, such as name, id card number, mobile phone number, bank card number, address, etc.), or using a deep learning model to perform sensitive information identification classification on the extracted entity and attribute, to determine which belong to the sensitive information, which is not limited herein.
After determining the sensitive information, the server may run step S23.
Step S23: and carrying out preliminary desensitization treatment on the sensitive information, and generating condition information based on the entity, the relation and the attribute after the preliminary desensitization treatment, wherein the preliminary desensitization comprises one or more of desensitization replacement, concealment and generalization.
In this embodiment, the server may perform preliminary desensitization processing on the sensitive information, for example, may perform desensitization replacement on the name (for example, using pseudonym replacement, and may keep the replacement of different pseudonyms for different people to ensure stability of data content), may use hidden or alternative manners on the bank card number, the identification card number, the mobile phone number, etc. (for example, replacing a continuous part of the numerical values with 0 or 0Etc., e.g., 8 bits in succession), desensitizes sensitive information (e.g., salary, consumption, time nodes, specific locations, etc.) of some dataclasses in a generalized manner.
Then, based on the entity, relation and attribute after preliminary desensitization, condition information corresponding to the original data of the piece can be generated. Thus, the same processing can be performed on each piece of original data in the original data set, and corresponding condition information can be obtained.
After that (after obtaining the condition information of one piece of the original data, not all the condition information corresponding to the original data is obtained), the server may execute step S24.
Step S24: and generating a generator in the countermeasure network by using the trained conditions, generating a synthetic sample based on the condition information and the random vector, and finally obtaining a desensitization data set consisting of a plurality of synthetic samples.
In this embodiment, the server may generate a generator in the countermeasure network using the trained conditions, based on the condition informationAnd random vector->And generating synthetic samples, and finally obtaining a desensitization data set consisting of a plurality of synthetic samples (each synthetic sample corresponds to one piece of original data).
The present embodiment also provides a storage medium including a stored program, wherein the apparatus in which the storage medium is controlled to execute the information desensitization method of big data in the present embodiment when the program runs.
The embodiment of the application also provides a server, which comprises a memory and a processor, wherein the memory is used for storing information comprising program instructions, and the processor is used for controlling the execution of the program instructions, and the program instructions realize the steps of the information desensitizing method of big data in the embodiment when being loaded and executed by the processor.
In summary, the embodiments of the present application provide an information desensitizing method, a storage medium, and a server for big data: generating a solution idea of the countermeasure network by adopting conditions, extracting entities, relations and attributes from each piece of original data in the original data set, and determining sensitive information in the entities and the attributes; then carrying out preliminary desensitization treatment on the sensitive information, and generating condition information based on the entity, relationship and attribute after the preliminary desensitization treatment; and generating a generator in the countermeasure network by using the trained conditions, generating a synthetic sample based on the condition information and the random vector, and finally obtaining a desensitization data set consisting of a plurality of synthetic samples. When the desensitization data set for post-processing is generated, the privacy information of the user is subjected to preliminary desensitization (original data structure can be maintained), the condition information is generated by utilizing the extracted entity, relation and attribute, the generation of the synthesized sample is guided, the quality of the synthesized sample can be ensured, the synthesized sample is very close to the real sample in the original data set, and the reliability of post-processing (data analysis, data mining and the like) is ensured. By using the desensitization data set, leakage of sensitive information containing user privacy is effectively eliminated, and sample generation is performed after preliminary desensitization is performed, so that threat of association analysis and restoration technology to user privacy is avoided, and safety guarantee is provided for user data privacy.
When the condition generation countermeasure network is constructed, the LSTM is selected as the generator, the transducer is selected as the discriminator, so that the condition generation countermeasure network can be suitable for more complex text content generation tasks, corresponding loss functions are designed, the generator and the discriminator are updated in a back propagation mode, when parameters to be optimized of the generator and the discriminator are updated, the generation capacity in the generated samples (the generation capacity is gradually enhanced along with repeated iteration of training) is considered, when related parameters are updated, the conditions met by the generated samples are considered, optimization of the related parameters is guided, the lower the conditions met by the early generated samples are, the larger the update amplitude is (corresponding parameters are utilized)As a coefficient of learning rate), training efficiency can be effectively improved, and convergence of the model can be quickened.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method of desensitizing information of big data, comprising:
acquiring an original data set, wherein the original data set comprises a plurality of pieces of original data;
extracting an entity, a relation and an attribute according to each piece of original data, and determining sensitive information in the entity and the attribute, wherein the sensitive information represents information exposing privacy of a user;
performing preliminary desensitization on the sensitive information, and generating condition information based on the entity, the relation and the attribute after the preliminary desensitization, wherein the preliminary desensitization comprises one or more of desensitization replacement, hiding and generalization;
and generating a generator in the countermeasure network by using the trained conditions, generating a synthetic sample based on the condition information and the random vector, and finally obtaining a desensitization data set consisting of a plurality of synthetic samples.
2. The method for desensitizing information of big data according to claim 1, wherein the construction process of the condition generating countermeasure network is:
acquiring a training set, wherein the training set comprises a plurality of pieces of training data;
extracting entities, relations and attributes aiming at each piece of training data, and determining sensitive information in the entities and the attributes, wherein the sensitive information represents information exposing privacy of a user;
carrying out consistent preliminary desensitization treatment on the training data and the sensitive information, taking the training data after the preliminary desensitization treatment as a real sample, and generating condition information based on the entity, the relation and the attribute after the preliminary desensitization treatment, wherein the preliminary desensitization comprises one or more of desensitization replacement, hiding and generalization;
constructing a condition generation countermeasure network, generating a synthetic sample by using a generator of the condition generation countermeasure network based on condition information and a random vector, distinguishing a real sample from the synthetic sample by using a discriminator of the condition generation countermeasure network based on the condition information, and optimizing the generator and the discriminator based on the distinguishing result in a back propagation manner, so as to realize countermeasure training of the generator and the discriminator, and repeatedly iterating to finally obtain a trained condition generation countermeasure network.
3. The method of claim 2, wherein the generator is LSTM and the arbiter is transducer.
4. A method of desensitizing information for big data according to claim 3, wherein the generator loss function is defined as:
wherein,representing a random vector +.>Representing condition information->Representing generator loss function, +.>The representation generator is based on condition information->And random vector->Synthesized sample generated, ++>Representing the arbiter giving the composite sample +.>And condition information->Probability of (2)Probability of judging the synthesized sample as a true sample, +.>Is a cross entropy loss function that is used to measure the gap between the model output and the true value.
5. The method of information desensitization of big data according to claim 4, wherein the arbiter loss function is defined as:
wherein,representing the arbiter loss function, +.>Loss function indicating that the real sample is determined as a real sample,/->Loss function indicating that the generated sample is determined to be a true sample,/->Representing a real sample, +_>Representing the discriminator gives the true sample +.>And condition information->I.e. probability of true sample +.>The probability of a true sample is determined.
6. The method for information desensitization of big data according to claim 5, wherein the condition informationThe method meets the following conditions:
wherein,representing condition information->Representing the total amount of categories of entities in the condition information, +.>Indicate->Entities of individual categories;
wherein,representation entity->Frequency of occurrence of->Representation entity->Number of relations with other entities present, +.>Representation entity->Attribute total amount of->Representation entity->The%>And attributes.
7. The method of information desensitization of big data according to claim 6, wherein the parameters of the generator are updated by:
wherein,for the generator parameters to be optimized, +.>Learning rate corresponding to generator, +.>For synthesizing the conditions fulfilled in the sample, +.>Representing condition information->Loss function of the representation generator->About generator parameters->Is a gradient of (a).
8. The method for desensitizing information of big data according to claim 6, wherein parameters of the discriminator are updated by:
wherein,for the arbiter parameters to be optimized, +.>For the learning rate corresponding to the discriminator, +.>Representing the loss function in the arbiter>About the discriminator parameters->Is used for the gradient of (a),
representing the loss function in the arbiter>About the discriminator parameters->Gradient of->For synthesizing the conditions fulfilled in the sample, +.>Representing condition information.
9. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the method of information desensitization of big data according to any one of claims 1-8.
10. A server comprising a memory for storing information including program instructions and a processor for controlling execution of the program instructions, characterized by: the program instructions, when loaded and executed by a processor, implement the steps of the method for information desensitization of big data according to any of claims 1-8.
CN202311574400.9A 2023-11-23 2023-11-23 Information desensitization method for big data, storage medium and server Active CN117290888B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311574400.9A CN117290888B (en) 2023-11-23 2023-11-23 Information desensitization method for big data, storage medium and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311574400.9A CN117290888B (en) 2023-11-23 2023-11-23 Information desensitization method for big data, storage medium and server

Publications (2)

Publication Number Publication Date
CN117290888A true CN117290888A (en) 2023-12-26
CN117290888B CN117290888B (en) 2024-02-09

Family

ID=89248375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311574400.9A Active CN117290888B (en) 2023-11-23 2023-11-23 Information desensitization method for big data, storage medium and server

Country Status (1)

Country Link
CN (1) CN117290888B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117592114A (en) * 2024-01-19 2024-02-23 中国电子科技集团公司第三十研究所 Network parallel simulation oriented data desensitization method, system and readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN108805833A (en) * 2018-05-29 2018-11-13 西安理工大学 Miscellaneous minimizing technology of copybook binaryzation ambient noise of network is fought based on condition
CN111079174A (en) * 2019-11-21 2020-04-28 中国电力科学研究院有限公司 Power consumption data desensitization method and system based on anonymization and differential privacy technology
CN111563275A (en) * 2020-07-14 2020-08-21 中国人民解放军国防科技大学 Data desensitization method based on generation countermeasure network
CN114357519A (en) * 2022-01-07 2022-04-15 支付宝(杭州)信息技术有限公司 Data desensitization method and system
CN114513337A (en) * 2022-01-20 2022-05-17 电子科技大学 Privacy protection link prediction method and system based on mail data
CN115374899A (en) * 2021-05-19 2022-11-22 富泰华工业(深圳)有限公司 Optimization method for generation countermeasure network and electronic equipment
WO2023065632A1 (en) * 2021-10-21 2023-04-27 平安科技(深圳)有限公司 Data desensitization method, data desensitization apparatus, device, and storage medium
CN116910806A (en) * 2023-06-30 2023-10-20 紫光云技术有限公司 Log desensitization method and system based on deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN108805833A (en) * 2018-05-29 2018-11-13 西安理工大学 Miscellaneous minimizing technology of copybook binaryzation ambient noise of network is fought based on condition
CN111079174A (en) * 2019-11-21 2020-04-28 中国电力科学研究院有限公司 Power consumption data desensitization method and system based on anonymization and differential privacy technology
CN111563275A (en) * 2020-07-14 2020-08-21 中国人民解放军国防科技大学 Data desensitization method based on generation countermeasure network
CN115374899A (en) * 2021-05-19 2022-11-22 富泰华工业(深圳)有限公司 Optimization method for generation countermeasure network and electronic equipment
WO2023065632A1 (en) * 2021-10-21 2023-04-27 平安科技(深圳)有限公司 Data desensitization method, data desensitization apparatus, device, and storage medium
CN114357519A (en) * 2022-01-07 2022-04-15 支付宝(杭州)信息技术有限公司 Data desensitization method and system
CN114513337A (en) * 2022-01-20 2022-05-17 电子科技大学 Privacy protection link prediction method and system based on mail data
CN116910806A (en) * 2023-06-30 2023-10-20 紫光云技术有限公司 Log desensitization method and system based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117592114A (en) * 2024-01-19 2024-02-23 中国电子科技集团公司第三十研究所 Network parallel simulation oriented data desensitization method, system and readable storage medium
CN117592114B (en) * 2024-01-19 2024-04-19 中国电子科技集团公司第三十研究所 Network parallel simulation oriented data desensitization method, system and readable storage medium

Also Published As

Publication number Publication date
CN117290888B (en) 2024-02-09

Similar Documents

Publication Publication Date Title
KR102430649B1 (en) Computer-implemented system and method for automatically identifying attributes for anonymization
Trivedi et al. An efficient credit card fraud detection model based on machine learning methods
CN107547718B (en) Telecommunication fraud identification and defense system based on deep learning
CN117290888B (en) Information desensitization method for big data, storage medium and server
Zhao et al. Detecting telecommunication fraud by understanding the contents of a call
JP2022551832A (en) METHOD FOR TRAINING AND TESTING ADAPTION NETWORK CORRESPONDING TO OBFUSCATION NETWORK CAPABLE OF PROCESSING DATA TO BE CONCEALED FOR PRIVACY, AND TRAINING DEVICE AND TESTING DEVICE USING THE SAME}
Xiang et al. A word-embedding-based steganalysis method for linguistic steganography via synonym substitution
CN115238827B (en) Privacy-protecting sample detection system training method and device
CN112016850A (en) Service evaluation method and device
CN116975927A (en) LLM language user privacy information protection method based on natural language prompt
Zhou et al. Textobfuscator: Making pre-trained language model a privacy protector via obfuscating word representations
CN114912142A (en) Data desensitization method and device, electronic equipment and storage medium
CN113742763A (en) Confusion encryption method and system based on government affair sensitive data
CN117079658A (en) Speaker anonymization system and method based on differential privacy
Park et al. Detecting audio adversarial examples with logit noising
Chen et al. Fraud analysis and detection for real-time messaging communications on social networks
CN114781368A (en) Business requirement safety processing method and device
CN112597390A (en) Block chain big data processing method based on digital finance and big data server
Khoje Securing Data Platforms: Strategic Masking Techniques for Privacy and Security for B2B Enterprise Data
Peng et al. Differential attribute desensitization system for personal information protection
Zhang et al. A Differential Privacy Image Publishing Method Based on Wavelet Transform
Kumar et al. Brain storm optimization based association rule mining model for intelligent phishing URLs websites detection
Kabwe et al. Identity attributes metric modelling based on mathematical distance metrics models
US20230137497A1 (en) Pre-computation and memoization of simulations
Huang et al. FirewaLLM: A Portable Data Protection and Recovery Framework for LLM Services

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant