CN117892358A - Verification method and system for limited data desensitization method - Google Patents
Verification method and system for limited data desensitization method Download PDFInfo
- Publication number
- CN117892358A CN117892358A CN202410303595.1A CN202410303595A CN117892358A CN 117892358 A CN117892358 A CN 117892358A CN 202410303595 A CN202410303595 A CN 202410303595A CN 117892358 A CN117892358 A CN 117892358A
- Authority
- CN
- China
- Prior art keywords
- data
- limited
- desensitization
- original
- limited data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000586 desensitisation Methods 0.000 title claims abstract description 91
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000012795 verification Methods 0.000 title claims description 74
- 238000012549 training Methods 0.000 claims abstract description 46
- 230000035945 sensitivity Effects 0.000 claims abstract description 10
- 239000013598 vector Substances 0.000 claims description 30
- 238000012545 processing Methods 0.000 claims description 19
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000004140 cleaning Methods 0.000 claims description 7
- 238000000547 structure data Methods 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000010200 validation analysis Methods 0.000 claims 2
- 238000013473 artificial intelligence Methods 0.000 abstract description 6
- 238000005516 engineering process Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6254—Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Probability & Statistics with Applications (AREA)
- Quality & Reliability (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method and a system for verifying a desensitization method of limited data, which relate to the technical field of limited data and comprise the following steps: selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization treatment on the original data, and recording the desensitized data as first limited data; importing the first limited data into an artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after model training is finished; comparing the original data with the second limited data, and judging whether the second limited data is consistent with the original data; if the first limited data are consistent, judging that the desensitization of the first limited data fails; if the first limited data is inconsistent, verifying the second limited data based on the original data, and judging whether the first limited data is desensitized successfully. The method is used for ensuring that the mode or rule of the sensitive information is not revealed to unauthorized users after the limited data is desensitized and trained by the artificial intelligence model.
Description
Technical Field
The invention relates to the technical field of limited data, in particular to a verification method and a verification system for a limited data desensitization method.
Background
Restricted data refers to data that is restricted and protected in some way, including personal identification information, medical records, financial data, business secrets, etc., with the restrictions typically associated with the sensitivity, privacy, or security of the data. Restricted data is characterized by a range of access and use that needs to be restricted.
Because the artificial intelligent model training needs to refer to sensitive data like personal identity information and the like to perform operations such as data association of multi-source data, the artificial intelligent model is often trained by using limited data so as to improve the performance and accuracy of the model. And the limited data is closer to the actual application scene, so that the model can be helped to better understand and adapt to various semantic scenes, and reasonable reasoning conclusion can be made. Similar scene data can be better processed and more accurate predictions and decisions provided by using limited data as a model for sample training.
However, since the limited data contains sensitive information, such as personal identification or specific group, the limited data may face security risks such as hacking, data leakage, etc. during model training, data processing and data transmission by the artificial intelligence model, which may result in exposure of the limited data. It is often necessary to desensitize the restricted data, but even after desensitizing the restricted data, patterns or rules that may contain the restricted data after being used as training samples by the artificial intelligence model training may still be revealed to unauthorized users, and may also cause problems of misuse of the restricted data.
The above is a disadvantage of the prior art.
Disclosure of Invention
In order to solve the defects, the invention provides a limited data desensitization method and a limited data desensitization system, wherein the limited data is trained and used by an artificial intelligent model after desensitization processing, and the mode or rule of the limited data cannot be revealed to unauthorized users.
In a first aspect, the present invention provides a method of authentication of a restricted data desensitization method, the method comprising:
s1, selecting a corresponding desensitization method according to the sensitivity degree and service requirement of original data, carrying out desensitization treatment on the original data, and recording the desensitized data as first limited data;
s2, importing the first limited data into an artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after model training is finished;
s3, comparing the original data with the second limited data, and judging whether the second limited data is consistent with the original data in content or not;
if the first limited data are consistent, judging that the desensitization of the first limited data fails;
if not, executing the step S4;
and S4, verifying the second limited data based on the original data, and judging whether the first limited data is desensitized successfully.
Further, before executing step S2, data security verification needs to be performed on the first limited data, including:
collecting illegal words and constructing a security verification word stock;
based on the security verification word stock, keyword retrieval is carried out on the first limited data, and whether illegal words exist in the first limited data is judged;
if the first limited data has illegal words, judging that the desensitization of the first limited data fails;
if the first limited data does not have the violation word, step S2 is performed.
Further, in step S4, based on the original data, similarity verification is performed on the second limited data, including structured data similarity verification and unstructured data similarity verification.
Further, structured data similarity verification includes:
extracting structured data in original data, cleaning and preprocessing the structured data, and constructing a plurality of sets according to data attributes;
extracting structured data in the second limited data for cleaning and preprocessing, and constructing a plurality of sets according to data attributes;
the original data and the set with the same data attribute in the second limited data form a group of sets;
according to the Jaccard similarity coefficient formula:calculating Jaccard similarity coefficients of each group of sets;
judging whether the original data structured data set and the second limited data structured data set are associated in each data attribute according to Jaccard similarity coefficients of each group of sets;
when the Jaccard similarity coefficient of the set is larger than the preset Jaccard similarity coefficient, judging that the original data structured data set and the second limited data structured data set are associated under the data attribute;
wherein A represents a set of data attributes in the original data structure data, B represents a set of data attributes in the second restricted data structure data,represents the union of set A and set B, +.>Representing the intersection of set a with set B.
Further, if the Jaccard similarity coefficients of the set of all the data attributes are smaller than the preset Jaccard similarity coefficients, judging that the structured data in the original data and the structured data in the second limited data are not associated;
otherwise, judging that the structured data in the original data and the structured data in the second limited data are associated.
Further, unstructured data similarity verification includes:
extracting unstructured data in the original data and unstructured data in the second limited data, and cleaning and preprocessing;
performing word frequency statistics on unstructured data in the processed original data and unstructured data in the second limited data, counting the occurrence times of each word in the two groups of data, and respectively constructing an original data word frequency matrix and a second limited data word frequency matrix;
converting the original data word frequency matrix and the second limited data word frequency matrix into an original data vector and a second limited data vector respectively, wherein each dimension in the data vector represents the word frequency of a word;
calculation formula according to cosine similarityCalculating cosine similarity of the two data vectors;
if the cosine similarity is greater than the preset cosine similarity, judging that the unstructured data in the original data are associated with the unstructured data in the second limited data;
otherwise, judging that the unstructured data in the original data and the unstructured data in the second limited data are not associated;
wherein A represents an original data vector, B represents a second restricted data vector, A.B represents an inner product of the original data vector and the second restricted data vector, the A and B represent modulo lengths of the original data vector and the second restricted data vector, respectively.
Further, in step S4, when it is determined that there is an association between the structured data in the original data and the structured data in the second limited data and that there is an association between the unstructured data in the original data and the unstructured data in the second limited data, it is determined that the desensitization of the first limited data fails;
and otherwise, judging that the desensitization of the first limited data is successful.
In a second aspect, the present invention provides a restricted data desensitization method authentication system, the system comprising:
the desensitization processing module is used for selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization processing on the original data, and recording the desensitized data as first limited data;
the model training module is used for importing the first limited data obtained by the desensitization processing module into the artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after the model training is finished;
the first desensitization verification module is used for comparing the original data with the second limited data and judging whether the second limited data is consistent with the original data;
and the second desensitization verification module is used for verifying the second limited data based on the original data when the first desensitization verification module verifies that the second limited data is inconsistent with the original data, and judging whether the first limited data is successfully desensitized.
Further, the system further comprises: and the data security verification module is used for performing security verification on the first limited data obtained by the desensitization processing module and judging whether the first limited data has illegal words or not.
Further, the second desensitization verification module comprises a structured data similarity verification module and an unstructured data similarity verification module;
the structured data similarity verification module is used for verifying whether the structured data in the original data and the structured data in the second limited data are associated or not;
the unstructured data similarity verification module is used for verifying whether the unstructured data in the original data and the unstructured data in the second limited data are associated or not;
when the structured data similarity verification module verifies that the structured data in the original data and the structured data in the second limited data are associated, and the unstructured data similarity verification module verifies that the unstructured data in the original data and the unstructured data in the second limited data are associated, the second desensitization verification module judges that the first limited data fails to be desensitized;
otherwise, the second desensitization verification module determines that the first restricted data desensitization was successful.
From the above technical scheme, the invention has the following advantages:
the invention provides a method and a system for verifying a desensitization method of limited data, which are characterized in that after desensitization treatment is carried out on the limited data, the desensitized limited data is used as a sample to be imported into an artificial intelligent model for model training, content and data comparison verification is carried out according to the limited data and model output content, whether the output content of the verification model is associated with the limited data or not is judged, so that the problem that the limited data still contains a mode or rule of the limited data after the desensitization treatment is imported into the artificial intelligent model for training and is revealed to an unauthorized user is avoided, and the safety of the limited data and the personal privacy of the user are ensured.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method of one embodiment of the invention;
FIG. 2 is a schematic flow chart of one specific use of the method of one embodiment of the present invention;
fig. 3 is a schematic block diagram of a system of one embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the present invention provides a limited data desensitizing method, which includes:
and 110, selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization treatment on the original data, and recording the desensitized data as first limited data.
And selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization processing on the original data, and recording the desensitized data as first limited data.
After the desensitization process, data security verification is performed on the first limited data. Firstly, collecting illegal words, constructing a safety verification word bank according to the illegal words, searching keywords of the first limited data based on the safety verification word bank, and judging whether the illegal words exist in the first limited data. If the first limited data has illegal words, judging that the desensitization of the first limited data fails; if the first restricted data does not have an offending word, step 120 is performed.
And 120, importing the first limited data into the artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after the model training is finished.
The artificial intelligent model used for importing the first limited data into the artificial intelligent model as a training sample to perform model training of each stage can be a traditional machine learning model such as decision trees, random forests, support vector machines or logistic regression, a deep learning model such as a neural network, a convolutional neural network or a cyclic neural network, and a pre-training model such as BERT, roBERTa or XLnet.
BERT (Bidirectional Encoder Representations from Transformers) is a bi-directional, transducer-based pre-training model proposed by Google. By performing unsupervised target prediction training on large-scale text data, BERT can learn context-dependent representations of words and sentences, providing good feature expression for various downstream natural language processing tasks. RoBERTa is an improvement and extension to BERT, developed by Facebook AI, and pre-trains with larger scale data and longer training time, while using more training skills, further improving model performance. XLNet is a transducer-based pre-trained model developed by CMU in concert with Google Brain. Unlike traditional autoregressive pre-training models, XLNet employs an alignment language model (Permutation Language Model) goal, so that the model can take into account all possible contextual alignment in the prediction, thereby better capturing the relationships between words.
It should be noted that, the artificial intelligence model imported in the model training of the present invention by importing the first limited data as the training sample into the artificial intelligence model includes, but is not limited to, the artificial intelligence models listed in the above embodiments.
Step 130, comparing the original data with the second limited data, and judging whether the second limited data is consistent with the original data; if the first limited data are consistent, judging that the desensitization of the first limited data fails; if not, step 140 is performed.
In step 130, comparing all data of the original data with all data of the second limited data, and if the original data content is completely consistent with the second limited data content, determining that the desensitization of the first limited data fails; if the original data content is not completely consistent with the second limited data content, further verifying the second limited data.
And 140, verifying the second limited data based on the original data, and judging whether the first limited data is successfully desensitized.
And performing similarity verification on the second limited data based on the original data, wherein the similarity verification comprises structured data similarity verification and unstructured data similarity verification. Data may be divided into structured data and unstructured data based on its organization and characteristics, structured data being data organized in a predefined pattern and format, with explicit data types and relationships, typically stored in the form of tables, rows, or relational databases. Unstructured data refers to data that has no explicit structure or format, is highly diverse, complex, and irregular, and exists in the form of text, images, audio, video, and the like.
In this embodiment, the structured data similarity verification includes:
and respectively extracting structured data in the original data and structured data in the second limited data by adopting a regular expression matching method, a natural language processing technology or a database query language method, and performing arrangement and cleaning, including removing repeated values, processing missing values, processing abnormal values and the like.
And respectively constructing a plurality of sets according to the data attribute by the structured data in the processed original data and the structured data in the second limited data. And forming a group of sets with the same data attribute in the original data and the second limited data, and then according to a Jaccard similarity coefficient formula:jaccard similarity coefficients were calculated for each set of sets.
And judging whether the original data structured data set and the second limited data structured data set are associated in each data attribute according to the Jaccard similarity coefficient of each group of sets. When the Jaccard similarity coefficient of the set is larger than the preset Jaccard similarity coefficient, judging that the original data structure data set and the second limited data structure data set are associated under the data attribute.
In this embodiment, only when Jaccard similarity coefficients of the set group of all data attributes in the original data structured data and the second limited data structured data are smaller than preset Jaccard similarity coefficients, it is determined that no association exists between the structured data in the original data and the structured data in the second limited data; otherwise, judging that the structured data in the original data and the structured data in the second limited data are associated.
Wherein A represents an original data structureA set of data attributes in the data, B representing a data set of the data attributes in the second restricted data structure data,represents the union of set A and set B, +.>Representing the intersection of set a with set B.
It should be noted that, a person skilled in the art may set the value of the preset Jaccard similarity coefficient according to actual needs.
In this embodiment, unstructured data similarity verification includes:
the unstructured data in the original data and the unstructured data in the second limited data are respectively extracted by adopting the technologies of text classification, emotion analysis, keyword extraction and the like or using natural language processing technology, and are arranged and cleaned, including repeated value removal, missing value processing, abnormal value processing and the like. And carrying out word frequency statistics on unstructured data in the processed original data and unstructured data in the second limited data, counting the occurrence times of each word in the two groups of data, respectively constructing an original data word frequency matrix and a second limited data word frequency matrix, and respectively converting the original data word frequency matrix and the second limited data word frequency matrix into an original data vector and a second limited data vector, wherein each dimension in the data vector represents the word frequency of one word. Calculation formula according to cosine similarityCosine similarity of the two data vectors is calculated.
If the cosine similarity is greater than the preset cosine similarity, judging that the unstructured data in the original data are associated with the unstructured data in the second limited data; otherwise, judging that the unstructured data in the original data and the unstructured data in the second limited data are not associated.
Wherein A represents an original data vector, B represents a second restricted data vector, A.B represents an inner product of the original data vector and the second restricted data vector, the A and B represent modulo lengths of the original data vector and the second restricted data vector, respectively.
It should be noted that, a person skilled in the art may set the value of the preset cosine similarity according to actual needs.
In this embodiment, when it is determined that there is an association between structured data in the original data and structured data in the second limited data and an association between unstructured data in the original data and unstructured data in the second limited data, it is determined that the first limited data fails to be desensitized; and otherwise, judging that the desensitization of the first limited data is successful.
In order to facilitate understanding of the present invention, the limited data desensitization method verification method provided by the present invention is further described below with reference to the use control process of the limited data desensitization method verification method in the embodiment.
As shown in fig. 2, the limited data desensitizing method includes:
selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, performing desensitization treatment on the original data, recording the desensitized data as first limited data, performing data security verification on the first limited data, judging whether illegal words exist in the first limited data, and if yes, judging that the desensitization of the first limited data fails; if not, the first limited data is used as a training sample to be imported into the artificial intelligent model for model training, after model training is finished, the output content of the artificial intelligent model is recorded as second limited data, the original data is compared with the second limited data, whether the second limited data is consistent with the original data content or not is judged, and if yes, the first limited data is judged to be failed in desensitization; if not, extracting structured data in the original data and structured data in the second limited data, verifying the similarity of the structured data in the second limited data, judging whether the structured data in the original data and the structured data in the second limited data are associated, if so, judging that the desensitization of the first limited data fails; if not, extracting unstructured data in the original data and unstructured data in the second limited data, verifying the unstructured data similarity of the second limited data, judging whether the unstructured data in the original data and the unstructured data in the second limited data are associated, and if so, judging that the desensitization of the first limited data fails; if not, the first limited data is successfully desensitized.
As shown in fig. 3, as an embodiment of the present invention, the present invention further provides a limited data desensitizing method verification system, the system 200 includes:
the desensitization processing module 210 is configured to select a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, perform desensitization processing on the original data, and record the desensitized data as first limited data;
the model training module 220 is configured to import the first limited data obtained by the desensitization processing module as a training sample into the artificial intelligent model for model training, and record the output content of the artificial intelligent model as second limited data after the model training is completed;
the first desensitization verification module 230 is configured to compare the original data with the second limited data, and determine whether the second limited data is consistent with the original data;
and the second desensitization verification module 240 is configured to, when the first desensitization verification module verifies that the second limited data is inconsistent with the original data content, verify the second limited data based on the original data, and determine whether the first limited data is desensitized successfully.
As an embodiment of the present invention, the system 200 further comprises: the data security verification module 250 is configured to perform security verification on the first limited data obtained by the desensitization processing module, and determine whether the first limited data has a violation word.
As one embodiment of the present invention, the second desensitization verification module includes a structured data similarity verification module and an unstructured data similarity verification module;
the structured data similarity verification module is used for verifying whether the structured data in the original data and the structured data in the second limited data are associated or not;
the unstructured data similarity verification module is used for verifying whether the unstructured data in the original data and the unstructured data in the second limited data are associated or not;
when the structured data similarity verification module verifies that the structured data in the original data and the structured data in the second limited data are associated, and the unstructured data similarity verification module verifies that the unstructured data in the original data and the unstructured data in the second limited data are associated, the second desensitization verification module judges that the first limited data fails to be desensitized; otherwise, the second desensitization verification module determines that the first restricted data desensitization was successful.
The invention provides a method and a system for verifying a desensitization method of limited data, which are characterized in that after desensitization treatment is carried out on the limited data, the desensitized limited data is used as a sample to be imported into an artificial intelligent model for model training, content and data comparison verification is carried out according to the limited data and model output content, whether the output content of the verification model is associated with the limited data or not is judged, so that the problem that the limited data still contains a mode or rule of the limited data after the desensitization treatment is imported into the artificial intelligent model for training and is revealed to an unauthorized user is avoided, and the safety of the limited data and the personal privacy of the user are ensured.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method of validating a restricted data desensitization method, the method comprising:
s1, selecting a corresponding desensitization method according to the sensitivity degree and service requirement of original data, carrying out desensitization treatment on the original data, and recording the desensitized data as first limited data;
s2, importing the first limited data into an artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after model training is finished;
s3, comparing the original data with the second limited data, and judging whether the second limited data is consistent with the original data in content or not;
if the first limited data are consistent, judging that the desensitization of the first limited data fails;
if not, executing the step S4;
and S4, verifying the second limited data based on the original data, and judging whether the first limited data is desensitized successfully.
2. The limited data desensitization method according to claim 1, wherein prior to performing step S2, data security verification is required for the first limited data, comprising:
collecting illegal words and constructing a security verification word stock;
based on the security verification word stock, keyword retrieval is carried out on the first limited data, and whether illegal words exist in the first limited data is judged;
if the first limited data has illegal words, judging that the desensitization of the first limited data fails;
if the first limited data does not have the violation word, step S2 is performed.
3. The method of claim 1, wherein in step S4, the second limited data is subjected to similarity verification based on the original data, including structured data similarity verification and unstructured data similarity verification.
4. A method of validating a limited data desensitization method according to claim 3, wherein structured data similarity validation comprises:
extracting structured data in original data, cleaning and preprocessing the structured data, and constructing a plurality of sets according to data attributes;
extracting structured data in the second limited data for cleaning and preprocessing, and constructing a plurality of sets according to data attributes;
the original data and the set with the same data attribute in the second limited data form a group of sets;
according to the Jaccard similarity coefficient formula:calculating Jaccard similarity coefficients of each group of sets;
judging whether the original data structured data set and the second limited data structured data set are associated in each data attribute according to Jaccard similarity coefficients of each group of sets;
when the Jaccard similarity coefficient of the set is larger than the preset Jaccard similarity coefficient, judging that the original data structured data set and the second limited data structured data set are associated under the data attribute;
wherein A represents a set of data attributes in the original data structure data, B represents a set of data attributes in the second restricted data structure data,represents the union of set A and set B, +.>Representing the intersection of set a with set B.
5. The method for verifying a restricted data desensitization method according to claim 4, wherein if the Jaccard similarity coefficient of the set of all data attributes is smaller than a preset Jaccard similarity coefficient, it is determined that there is no association between the structured data in the original data and the structured data in the second restricted data;
otherwise, judging that the structured data in the original data and the structured data in the second limited data are associated.
6. A method of validating a limited data desensitization method according to claim 3, wherein unstructured data similarity validation comprises:
extracting unstructured data in the original data and unstructured data in the second limited data, and cleaning and preprocessing;
performing word frequency statistics on unstructured data in the processed original data and unstructured data in the second limited data, counting the occurrence times of each word in the two groups of data, and respectively constructing an original data word frequency matrix and a second limited data word frequency matrix;
converting the original data word frequency matrix and the second limited data word frequency matrix into an original data vector and a second limited data vector respectively, wherein each dimension in the data vector represents the word frequency of a word;
calculation formula according to cosine similarityCalculating cosine similarity of the two data vectors;
if the cosine similarity is greater than the preset cosine similarity, judging that the unstructured data in the original data are associated with the unstructured data in the second limited data;
otherwise, judging that the unstructured data in the original data and the unstructured data in the second limited data are not associated;
wherein A represents an original data vector, B represents a second restricted data vector, A.B represents an inner product of the original data vector and the second restricted data vector, the A and B represent modulo lengths of the original data vector and the second restricted data vector, respectively.
7. The limited data desensitization method according to claim 1, wherein in step S4, when it is determined that there is an association between structured data in the original data and structured data in the second limited data and an association between unstructured data in the original data and unstructured data in the second limited data, it is determined that the first limited data fails to be desensitized;
and otherwise, judging that the desensitization of the first limited data is successful.
8. A restricted data desensitization method authentication system, the system comprising:
the desensitization processing module is used for selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization processing on the original data, and recording the desensitized data as first limited data;
the model training module is used for importing the first limited data obtained by the desensitization processing module into the artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after the model training is finished;
the first desensitization verification module is used for comparing the original data with the second limited data and judging whether the second limited data is consistent with the original data;
and the second desensitization verification module is used for verifying the second limited data based on the original data when the first desensitization verification module verifies that the second limited data is inconsistent with the original data, and judging whether the first limited data is successfully desensitized.
9. The restricted data desensitization method authentication system according to claim 8, wherein said system further comprises: and the data security verification module is used for performing security verification on the first limited data obtained by the desensitization processing module and judging whether the first limited data has illegal words or not.
10. The limited data desensitization method authentication system according to claim 8, wherein the second desensitization authentication module comprises a structured data similarity authentication module and an unstructured data similarity authentication module;
the structured data similarity verification module is used for verifying whether the structured data in the original data and the structured data in the second limited data are associated or not;
the unstructured data similarity verification module is used for verifying whether the unstructured data in the original data and the unstructured data in the second limited data are associated or not;
when the structured data similarity verification module verifies that the structured data in the original data and the structured data in the second limited data are associated, and the unstructured data similarity verification module verifies that the unstructured data in the original data and the unstructured data in the second limited data are associated, the second desensitization verification module judges that the first limited data fails to be desensitized;
otherwise, the second desensitization verification module determines that the first restricted data desensitization was successful.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410303595.1A CN117892358B (en) | 2024-03-18 | 2024-03-18 | Verification method and system for limited data desensitization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410303595.1A CN117892358B (en) | 2024-03-18 | 2024-03-18 | Verification method and system for limited data desensitization method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117892358A true CN117892358A (en) | 2024-04-16 |
CN117892358B CN117892358B (en) | 2024-07-05 |
Family
ID=90647753
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410303595.1A Active CN117892358B (en) | 2024-03-18 | 2024-03-18 | Verification method and system for limited data desensitization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117892358B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111475855A (en) * | 2020-06-24 | 2020-07-31 | 支付宝(杭州)信息技术有限公司 | Data processing method and device for realizing privacy protection |
WO2021212968A1 (en) * | 2020-04-24 | 2021-10-28 | 华为技术有限公司 | Unstructured data processing method, apparatus, and device, and medium |
CN113989156A (en) * | 2021-11-01 | 2022-01-28 | 北京地平线信息技术有限公司 | Method, apparatus, medium, device, and program for reliability verification of desensitization method |
CN113988226A (en) * | 2021-12-29 | 2022-01-28 | 深圳红途科技有限公司 | Data desensitization validity verification method and device, computer equipment and storage medium |
CN116646046A (en) * | 2023-07-27 | 2023-08-25 | 中日友好医院(中日友好临床医学研究所) | Electronic medical record processing method and system based on Internet diagnosis and treatment |
CN117421773A (en) * | 2023-10-30 | 2024-01-19 | 阿维塔科技(重庆)有限公司 | Data desensitization processing method, device, equipment and storage medium |
CN117556455A (en) * | 2023-07-24 | 2024-02-13 | 上海凯馨信息科技有限公司 | Data desensitization security inspection method |
-
2024
- 2024-03-18 CN CN202410303595.1A patent/CN117892358B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021212968A1 (en) * | 2020-04-24 | 2021-10-28 | 华为技术有限公司 | Unstructured data processing method, apparatus, and device, and medium |
CN111475855A (en) * | 2020-06-24 | 2020-07-31 | 支付宝(杭州)信息技术有限公司 | Data processing method and device for realizing privacy protection |
CN113989156A (en) * | 2021-11-01 | 2022-01-28 | 北京地平线信息技术有限公司 | Method, apparatus, medium, device, and program for reliability verification of desensitization method |
CN113988226A (en) * | 2021-12-29 | 2022-01-28 | 深圳红途科技有限公司 | Data desensitization validity verification method and device, computer equipment and storage medium |
CN117556455A (en) * | 2023-07-24 | 2024-02-13 | 上海凯馨信息科技有限公司 | Data desensitization security inspection method |
CN116646046A (en) * | 2023-07-27 | 2023-08-25 | 中日友好医院(中日友好临床医学研究所) | Electronic medical record processing method and system based on Internet diagnosis and treatment |
CN117421773A (en) * | 2023-10-30 | 2024-01-19 | 阿维塔科技(重庆)有限公司 | Data desensitization processing method, device, equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
PING CHEN等: "Safety verification model of desensitization algorithm for civil aviation passenger data based on statistics", 《2023 INTERNATIONAL CONFERENCE ON CYBER-ENABLED DISTRIBUTED COMPUTING AND KNOWLEDGE DISCOVERY(CYBERC)》, 21 February 2024 (2024-02-21) * |
张煜;吕锡香;邹宇聪;李一戈;: "基于生成对抗网络的文本序列数据集脱敏", 网络与信息安全学报, no. 04, 15 August 2020 (2020-08-15) * |
Also Published As
Publication number | Publication date |
---|---|
CN117892358B (en) | 2024-07-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11574077B2 (en) | Systems and methods for removing identifiable information | |
US11127403B2 (en) | Machine learning-based automatic detection and removal of personally identifiable information | |
CN113051371B (en) | Chinese machine reading understanding method and device, electronic equipment and storage medium | |
Luo et al. | A CNN-based Approach to the Detection of SQL Injection Attacks | |
CN106991312B (en) | Internet anti-fraud authentication method based on voiceprint recognition | |
CN112580352B (en) | Keyword extraction method, device and equipment and computer storage medium | |
CN113593661A (en) | Clinical term standardization method, device, electronic equipment and storage medium | |
CN115314268B (en) | Malicious encryption traffic detection method and system based on traffic fingerprint and behavior | |
CN111460114A (en) | Retrieval method, device, equipment and computer readable storage medium | |
CN115687980A (en) | Desensitization classification method of data table, and classification model training method and device | |
CN117520503A (en) | Financial customer service dialogue generation method, device, equipment and medium based on LLM model | |
CN110990834B (en) | Static detection method, system and medium for android malicious software | |
CN117272142A (en) | Log abnormality detection method and system and electronic equipment | |
CN116401343A (en) | Data compliance analysis method | |
CN117271558A (en) | Language query model construction method, query language acquisition method and related devices | |
CN117708297A (en) | Query statement generation method and device, electronic equipment and storage medium | |
CN112966507A (en) | Method, device, equipment and storage medium for constructing recognition model and identifying attack | |
CN117892358B (en) | Verification method and system for limited data desensitization method | |
Bisogni et al. | Multibiometric score-level fusion through optimization and training | |
CN106446696A (en) | Information processing method and electronic device | |
CN112668284B (en) | Legal document segmentation method and system | |
CN114741088A (en) | App source code linking method based on user comments and developer intelligence | |
CN117077680A (en) | Question and answer intention recognition method and device | |
CN113259369A (en) | Data set authentication method and system based on machine learning member inference attack | |
CN117272123B (en) | Sensitive data processing method and device based on large model and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |