CN117892358A - Verification method and system for limited data desensitization method - Google Patents

Verification method and system for limited data desensitization method Download PDF

Info

Publication number
CN117892358A
CN117892358A CN202410303595.1A CN202410303595A CN117892358A CN 117892358 A CN117892358 A CN 117892358A CN 202410303595 A CN202410303595 A CN 202410303595A CN 117892358 A CN117892358 A CN 117892358A
Authority
CN
China
Prior art keywords
data
limited
desensitization
original
limited data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410303595.1A
Other languages
Chinese (zh)
Other versions
CN117892358B (en
Inventor
蔡卓人
蒋江涛
郭鹏
郭浩宇
邓小宁
金剑
马杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North Health Medical Big Data Technology Co ltd
Original Assignee
North Health Medical Big Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North Health Medical Big Data Technology Co ltd filed Critical North Health Medical Big Data Technology Co ltd
Priority to CN202410303595.1A priority Critical patent/CN117892358B/en
Publication of CN117892358A publication Critical patent/CN117892358A/en
Application granted granted Critical
Publication of CN117892358B publication Critical patent/CN117892358B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for verifying a desensitization method of limited data, which relate to the technical field of limited data and comprise the following steps: selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization treatment on the original data, and recording the desensitized data as first limited data; importing the first limited data into an artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after model training is finished; comparing the original data with the second limited data, and judging whether the second limited data is consistent with the original data; if the first limited data are consistent, judging that the desensitization of the first limited data fails; if the first limited data is inconsistent, verifying the second limited data based on the original data, and judging whether the first limited data is desensitized successfully. The method is used for ensuring that the mode or rule of the sensitive information is not revealed to unauthorized users after the limited data is desensitized and trained by the artificial intelligence model.

Description

Verification method and system for limited data desensitization method
Technical Field
The invention relates to the technical field of limited data, in particular to a verification method and a verification system for a limited data desensitization method.
Background
Restricted data refers to data that is restricted and protected in some way, including personal identification information, medical records, financial data, business secrets, etc., with the restrictions typically associated with the sensitivity, privacy, or security of the data. Restricted data is characterized by a range of access and use that needs to be restricted.
Because the artificial intelligent model training needs to refer to sensitive data like personal identity information and the like to perform operations such as data association of multi-source data, the artificial intelligent model is often trained by using limited data so as to improve the performance and accuracy of the model. And the limited data is closer to the actual application scene, so that the model can be helped to better understand and adapt to various semantic scenes, and reasonable reasoning conclusion can be made. Similar scene data can be better processed and more accurate predictions and decisions provided by using limited data as a model for sample training.
However, since the limited data contains sensitive information, such as personal identification or specific group, the limited data may face security risks such as hacking, data leakage, etc. during model training, data processing and data transmission by the artificial intelligence model, which may result in exposure of the limited data. It is often necessary to desensitize the restricted data, but even after desensitizing the restricted data, patterns or rules that may contain the restricted data after being used as training samples by the artificial intelligence model training may still be revealed to unauthorized users, and may also cause problems of misuse of the restricted data.
The above is a disadvantage of the prior art.
Disclosure of Invention
In order to solve the defects, the invention provides a limited data desensitization method and a limited data desensitization system, wherein the limited data is trained and used by an artificial intelligent model after desensitization processing, and the mode or rule of the limited data cannot be revealed to unauthorized users.
In a first aspect, the present invention provides a method of authentication of a restricted data desensitization method, the method comprising:
s1, selecting a corresponding desensitization method according to the sensitivity degree and service requirement of original data, carrying out desensitization treatment on the original data, and recording the desensitized data as first limited data;
s2, importing the first limited data into an artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after model training is finished;
s3, comparing the original data with the second limited data, and judging whether the second limited data is consistent with the original data in content or not;
if the first limited data are consistent, judging that the desensitization of the first limited data fails;
if not, executing the step S4;
and S4, verifying the second limited data based on the original data, and judging whether the first limited data is desensitized successfully.
Further, before executing step S2, data security verification needs to be performed on the first limited data, including:
collecting illegal words and constructing a security verification word stock;
based on the security verification word stock, keyword retrieval is carried out on the first limited data, and whether illegal words exist in the first limited data is judged;
if the first limited data has illegal words, judging that the desensitization of the first limited data fails;
if the first limited data does not have the violation word, step S2 is performed.
Further, in step S4, based on the original data, similarity verification is performed on the second limited data, including structured data similarity verification and unstructured data similarity verification.
Further, structured data similarity verification includes:
extracting structured data in original data, cleaning and preprocessing the structured data, and constructing a plurality of sets according to data attributes;
extracting structured data in the second limited data for cleaning and preprocessing, and constructing a plurality of sets according to data attributes;
the original data and the set with the same data attribute in the second limited data form a group of sets;
according to the Jaccard similarity coefficient formula:calculating Jaccard similarity coefficients of each group of sets;
judging whether the original data structured data set and the second limited data structured data set are associated in each data attribute according to Jaccard similarity coefficients of each group of sets;
when the Jaccard similarity coefficient of the set is larger than the preset Jaccard similarity coefficient, judging that the original data structured data set and the second limited data structured data set are associated under the data attribute;
wherein A represents a set of data attributes in the original data structure data, B represents a set of data attributes in the second restricted data structure data,represents the union of set A and set B, +.>Representing the intersection of set a with set B.
Further, if the Jaccard similarity coefficients of the set of all the data attributes are smaller than the preset Jaccard similarity coefficients, judging that the structured data in the original data and the structured data in the second limited data are not associated;
otherwise, judging that the structured data in the original data and the structured data in the second limited data are associated.
Further, unstructured data similarity verification includes:
extracting unstructured data in the original data and unstructured data in the second limited data, and cleaning and preprocessing;
performing word frequency statistics on unstructured data in the processed original data and unstructured data in the second limited data, counting the occurrence times of each word in the two groups of data, and respectively constructing an original data word frequency matrix and a second limited data word frequency matrix;
converting the original data word frequency matrix and the second limited data word frequency matrix into an original data vector and a second limited data vector respectively, wherein each dimension in the data vector represents the word frequency of a word;
calculation formula according to cosine similarityCalculating cosine similarity of the two data vectors;
if the cosine similarity is greater than the preset cosine similarity, judging that the unstructured data in the original data are associated with the unstructured data in the second limited data;
otherwise, judging that the unstructured data in the original data and the unstructured data in the second limited data are not associated;
wherein A represents an original data vector, B represents a second restricted data vector, A.B represents an inner product of the original data vector and the second restricted data vector, the A and B represent modulo lengths of the original data vector and the second restricted data vector, respectively.
Further, in step S4, when it is determined that there is an association between the structured data in the original data and the structured data in the second limited data and that there is an association between the unstructured data in the original data and the unstructured data in the second limited data, it is determined that the desensitization of the first limited data fails;
and otherwise, judging that the desensitization of the first limited data is successful.
In a second aspect, the present invention provides a restricted data desensitization method authentication system, the system comprising:
the desensitization processing module is used for selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization processing on the original data, and recording the desensitized data as first limited data;
the model training module is used for importing the first limited data obtained by the desensitization processing module into the artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after the model training is finished;
the first desensitization verification module is used for comparing the original data with the second limited data and judging whether the second limited data is consistent with the original data;
and the second desensitization verification module is used for verifying the second limited data based on the original data when the first desensitization verification module verifies that the second limited data is inconsistent with the original data, and judging whether the first limited data is successfully desensitized.
Further, the system further comprises: and the data security verification module is used for performing security verification on the first limited data obtained by the desensitization processing module and judging whether the first limited data has illegal words or not.
Further, the second desensitization verification module comprises a structured data similarity verification module and an unstructured data similarity verification module;
the structured data similarity verification module is used for verifying whether the structured data in the original data and the structured data in the second limited data are associated or not;
the unstructured data similarity verification module is used for verifying whether the unstructured data in the original data and the unstructured data in the second limited data are associated or not;
when the structured data similarity verification module verifies that the structured data in the original data and the structured data in the second limited data are associated, and the unstructured data similarity verification module verifies that the unstructured data in the original data and the unstructured data in the second limited data are associated, the second desensitization verification module judges that the first limited data fails to be desensitized;
otherwise, the second desensitization verification module determines that the first restricted data desensitization was successful.
From the above technical scheme, the invention has the following advantages:
the invention provides a method and a system for verifying a desensitization method of limited data, which are characterized in that after desensitization treatment is carried out on the limited data, the desensitized limited data is used as a sample to be imported into an artificial intelligent model for model training, content and data comparison verification is carried out according to the limited data and model output content, whether the output content of the verification model is associated with the limited data or not is judged, so that the problem that the limited data still contains a mode or rule of the limited data after the desensitization treatment is imported into the artificial intelligent model for training and is revealed to an unauthorized user is avoided, and the safety of the limited data and the personal privacy of the user are ensured.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method of one embodiment of the invention;
FIG. 2 is a schematic flow chart of one specific use of the method of one embodiment of the present invention;
fig. 3 is a schematic block diagram of a system of one embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the present invention provides a limited data desensitizing method, which includes:
and 110, selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization treatment on the original data, and recording the desensitized data as first limited data.
And selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization processing on the original data, and recording the desensitized data as first limited data.
After the desensitization process, data security verification is performed on the first limited data. Firstly, collecting illegal words, constructing a safety verification word bank according to the illegal words, searching keywords of the first limited data based on the safety verification word bank, and judging whether the illegal words exist in the first limited data. If the first limited data has illegal words, judging that the desensitization of the first limited data fails; if the first restricted data does not have an offending word, step 120 is performed.
And 120, importing the first limited data into the artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after the model training is finished.
The artificial intelligent model used for importing the first limited data into the artificial intelligent model as a training sample to perform model training of each stage can be a traditional machine learning model such as decision trees, random forests, support vector machines or logistic regression, a deep learning model such as a neural network, a convolutional neural network or a cyclic neural network, and a pre-training model such as BERT, roBERTa or XLnet.
BERT (Bidirectional Encoder Representations from Transformers) is a bi-directional, transducer-based pre-training model proposed by Google. By performing unsupervised target prediction training on large-scale text data, BERT can learn context-dependent representations of words and sentences, providing good feature expression for various downstream natural language processing tasks. RoBERTa is an improvement and extension to BERT, developed by Facebook AI, and pre-trains with larger scale data and longer training time, while using more training skills, further improving model performance. XLNet is a transducer-based pre-trained model developed by CMU in concert with Google Brain. Unlike traditional autoregressive pre-training models, XLNet employs an alignment language model (Permutation Language Model) goal, so that the model can take into account all possible contextual alignment in the prediction, thereby better capturing the relationships between words.
It should be noted that, the artificial intelligence model imported in the model training of the present invention by importing the first limited data as the training sample into the artificial intelligence model includes, but is not limited to, the artificial intelligence models listed in the above embodiments.
Step 130, comparing the original data with the second limited data, and judging whether the second limited data is consistent with the original data; if the first limited data are consistent, judging that the desensitization of the first limited data fails; if not, step 140 is performed.
In step 130, comparing all data of the original data with all data of the second limited data, and if the original data content is completely consistent with the second limited data content, determining that the desensitization of the first limited data fails; if the original data content is not completely consistent with the second limited data content, further verifying the second limited data.
And 140, verifying the second limited data based on the original data, and judging whether the first limited data is successfully desensitized.
And performing similarity verification on the second limited data based on the original data, wherein the similarity verification comprises structured data similarity verification and unstructured data similarity verification. Data may be divided into structured data and unstructured data based on its organization and characteristics, structured data being data organized in a predefined pattern and format, with explicit data types and relationships, typically stored in the form of tables, rows, or relational databases. Unstructured data refers to data that has no explicit structure or format, is highly diverse, complex, and irregular, and exists in the form of text, images, audio, video, and the like.
In this embodiment, the structured data similarity verification includes:
and respectively extracting structured data in the original data and structured data in the second limited data by adopting a regular expression matching method, a natural language processing technology or a database query language method, and performing arrangement and cleaning, including removing repeated values, processing missing values, processing abnormal values and the like.
And respectively constructing a plurality of sets according to the data attribute by the structured data in the processed original data and the structured data in the second limited data. And forming a group of sets with the same data attribute in the original data and the second limited data, and then according to a Jaccard similarity coefficient formula:jaccard similarity coefficients were calculated for each set of sets.
And judging whether the original data structured data set and the second limited data structured data set are associated in each data attribute according to the Jaccard similarity coefficient of each group of sets. When the Jaccard similarity coefficient of the set is larger than the preset Jaccard similarity coefficient, judging that the original data structure data set and the second limited data structure data set are associated under the data attribute.
In this embodiment, only when Jaccard similarity coefficients of the set group of all data attributes in the original data structured data and the second limited data structured data are smaller than preset Jaccard similarity coefficients, it is determined that no association exists between the structured data in the original data and the structured data in the second limited data; otherwise, judging that the structured data in the original data and the structured data in the second limited data are associated.
Wherein A represents an original data structureA set of data attributes in the data, B representing a data set of the data attributes in the second restricted data structure data,represents the union of set A and set B, +.>Representing the intersection of set a with set B.
It should be noted that, a person skilled in the art may set the value of the preset Jaccard similarity coefficient according to actual needs.
In this embodiment, unstructured data similarity verification includes:
the unstructured data in the original data and the unstructured data in the second limited data are respectively extracted by adopting the technologies of text classification, emotion analysis, keyword extraction and the like or using natural language processing technology, and are arranged and cleaned, including repeated value removal, missing value processing, abnormal value processing and the like. And carrying out word frequency statistics on unstructured data in the processed original data and unstructured data in the second limited data, counting the occurrence times of each word in the two groups of data, respectively constructing an original data word frequency matrix and a second limited data word frequency matrix, and respectively converting the original data word frequency matrix and the second limited data word frequency matrix into an original data vector and a second limited data vector, wherein each dimension in the data vector represents the word frequency of one word. Calculation formula according to cosine similarityCosine similarity of the two data vectors is calculated.
If the cosine similarity is greater than the preset cosine similarity, judging that the unstructured data in the original data are associated with the unstructured data in the second limited data; otherwise, judging that the unstructured data in the original data and the unstructured data in the second limited data are not associated.
Wherein A represents an original data vector, B represents a second restricted data vector, A.B represents an inner product of the original data vector and the second restricted data vector, the A and B represent modulo lengths of the original data vector and the second restricted data vector, respectively.
It should be noted that, a person skilled in the art may set the value of the preset cosine similarity according to actual needs.
In this embodiment, when it is determined that there is an association between structured data in the original data and structured data in the second limited data and an association between unstructured data in the original data and unstructured data in the second limited data, it is determined that the first limited data fails to be desensitized; and otherwise, judging that the desensitization of the first limited data is successful.
In order to facilitate understanding of the present invention, the limited data desensitization method verification method provided by the present invention is further described below with reference to the use control process of the limited data desensitization method verification method in the embodiment.
As shown in fig. 2, the limited data desensitizing method includes:
selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, performing desensitization treatment on the original data, recording the desensitized data as first limited data, performing data security verification on the first limited data, judging whether illegal words exist in the first limited data, and if yes, judging that the desensitization of the first limited data fails; if not, the first limited data is used as a training sample to be imported into the artificial intelligent model for model training, after model training is finished, the output content of the artificial intelligent model is recorded as second limited data, the original data is compared with the second limited data, whether the second limited data is consistent with the original data content or not is judged, and if yes, the first limited data is judged to be failed in desensitization; if not, extracting structured data in the original data and structured data in the second limited data, verifying the similarity of the structured data in the second limited data, judging whether the structured data in the original data and the structured data in the second limited data are associated, if so, judging that the desensitization of the first limited data fails; if not, extracting unstructured data in the original data and unstructured data in the second limited data, verifying the unstructured data similarity of the second limited data, judging whether the unstructured data in the original data and the unstructured data in the second limited data are associated, and if so, judging that the desensitization of the first limited data fails; if not, the first limited data is successfully desensitized.
As shown in fig. 3, as an embodiment of the present invention, the present invention further provides a limited data desensitizing method verification system, the system 200 includes:
the desensitization processing module 210 is configured to select a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, perform desensitization processing on the original data, and record the desensitized data as first limited data;
the model training module 220 is configured to import the first limited data obtained by the desensitization processing module as a training sample into the artificial intelligent model for model training, and record the output content of the artificial intelligent model as second limited data after the model training is completed;
the first desensitization verification module 230 is configured to compare the original data with the second limited data, and determine whether the second limited data is consistent with the original data;
and the second desensitization verification module 240 is configured to, when the first desensitization verification module verifies that the second limited data is inconsistent with the original data content, verify the second limited data based on the original data, and determine whether the first limited data is desensitized successfully.
As an embodiment of the present invention, the system 200 further comprises: the data security verification module 250 is configured to perform security verification on the first limited data obtained by the desensitization processing module, and determine whether the first limited data has a violation word.
As one embodiment of the present invention, the second desensitization verification module includes a structured data similarity verification module and an unstructured data similarity verification module;
the structured data similarity verification module is used for verifying whether the structured data in the original data and the structured data in the second limited data are associated or not;
the unstructured data similarity verification module is used for verifying whether the unstructured data in the original data and the unstructured data in the second limited data are associated or not;
when the structured data similarity verification module verifies that the structured data in the original data and the structured data in the second limited data are associated, and the unstructured data similarity verification module verifies that the unstructured data in the original data and the unstructured data in the second limited data are associated, the second desensitization verification module judges that the first limited data fails to be desensitized; otherwise, the second desensitization verification module determines that the first restricted data desensitization was successful.
The invention provides a method and a system for verifying a desensitization method of limited data, which are characterized in that after desensitization treatment is carried out on the limited data, the desensitized limited data is used as a sample to be imported into an artificial intelligent model for model training, content and data comparison verification is carried out according to the limited data and model output content, whether the output content of the verification model is associated with the limited data or not is judged, so that the problem that the limited data still contains a mode or rule of the limited data after the desensitization treatment is imported into the artificial intelligent model for training and is revealed to an unauthorized user is avoided, and the safety of the limited data and the personal privacy of the user are ensured.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of validating a restricted data desensitization method, the method comprising:
s1, selecting a corresponding desensitization method according to the sensitivity degree and service requirement of original data, carrying out desensitization treatment on the original data, and recording the desensitized data as first limited data;
s2, importing the first limited data into an artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after model training is finished;
s3, comparing the original data with the second limited data, and judging whether the second limited data is consistent with the original data in content or not;
if the first limited data are consistent, judging that the desensitization of the first limited data fails;
if not, executing the step S4;
and S4, verifying the second limited data based on the original data, and judging whether the first limited data is desensitized successfully.
2. The limited data desensitization method according to claim 1, wherein prior to performing step S2, data security verification is required for the first limited data, comprising:
collecting illegal words and constructing a security verification word stock;
based on the security verification word stock, keyword retrieval is carried out on the first limited data, and whether illegal words exist in the first limited data is judged;
if the first limited data has illegal words, judging that the desensitization of the first limited data fails;
if the first limited data does not have the violation word, step S2 is performed.
3. The method of claim 1, wherein in step S4, the second limited data is subjected to similarity verification based on the original data, including structured data similarity verification and unstructured data similarity verification.
4. A method of validating a limited data desensitization method according to claim 3, wherein structured data similarity validation comprises:
extracting structured data in original data, cleaning and preprocessing the structured data, and constructing a plurality of sets according to data attributes;
extracting structured data in the second limited data for cleaning and preprocessing, and constructing a plurality of sets according to data attributes;
the original data and the set with the same data attribute in the second limited data form a group of sets;
according to the Jaccard similarity coefficient formula:calculating Jaccard similarity coefficients of each group of sets;
judging whether the original data structured data set and the second limited data structured data set are associated in each data attribute according to Jaccard similarity coefficients of each group of sets;
when the Jaccard similarity coefficient of the set is larger than the preset Jaccard similarity coefficient, judging that the original data structured data set and the second limited data structured data set are associated under the data attribute;
wherein A represents a set of data attributes in the original data structure data, B represents a set of data attributes in the second restricted data structure data,represents the union of set A and set B, +.>Representing the intersection of set a with set B.
5. The method for verifying a restricted data desensitization method according to claim 4, wherein if the Jaccard similarity coefficient of the set of all data attributes is smaller than a preset Jaccard similarity coefficient, it is determined that there is no association between the structured data in the original data and the structured data in the second restricted data;
otherwise, judging that the structured data in the original data and the structured data in the second limited data are associated.
6. A method of validating a limited data desensitization method according to claim 3, wherein unstructured data similarity validation comprises:
extracting unstructured data in the original data and unstructured data in the second limited data, and cleaning and preprocessing;
performing word frequency statistics on unstructured data in the processed original data and unstructured data in the second limited data, counting the occurrence times of each word in the two groups of data, and respectively constructing an original data word frequency matrix and a second limited data word frequency matrix;
converting the original data word frequency matrix and the second limited data word frequency matrix into an original data vector and a second limited data vector respectively, wherein each dimension in the data vector represents the word frequency of a word;
calculation formula according to cosine similarityCalculating cosine similarity of the two data vectors;
if the cosine similarity is greater than the preset cosine similarity, judging that the unstructured data in the original data are associated with the unstructured data in the second limited data;
otherwise, judging that the unstructured data in the original data and the unstructured data in the second limited data are not associated;
wherein A represents an original data vector, B represents a second restricted data vector, A.B represents an inner product of the original data vector and the second restricted data vector, the A and B represent modulo lengths of the original data vector and the second restricted data vector, respectively.
7. The limited data desensitization method according to claim 1, wherein in step S4, when it is determined that there is an association between structured data in the original data and structured data in the second limited data and an association between unstructured data in the original data and unstructured data in the second limited data, it is determined that the first limited data fails to be desensitized;
and otherwise, judging that the desensitization of the first limited data is successful.
8. A restricted data desensitization method authentication system, the system comprising:
the desensitization processing module is used for selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization processing on the original data, and recording the desensitized data as first limited data;
the model training module is used for importing the first limited data obtained by the desensitization processing module into the artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after the model training is finished;
the first desensitization verification module is used for comparing the original data with the second limited data and judging whether the second limited data is consistent with the original data;
and the second desensitization verification module is used for verifying the second limited data based on the original data when the first desensitization verification module verifies that the second limited data is inconsistent with the original data, and judging whether the first limited data is successfully desensitized.
9. The restricted data desensitization method authentication system according to claim 8, wherein said system further comprises: and the data security verification module is used for performing security verification on the first limited data obtained by the desensitization processing module and judging whether the first limited data has illegal words or not.
10. The limited data desensitization method authentication system according to claim 8, wherein the second desensitization authentication module comprises a structured data similarity authentication module and an unstructured data similarity authentication module;
the structured data similarity verification module is used for verifying whether the structured data in the original data and the structured data in the second limited data are associated or not;
the unstructured data similarity verification module is used for verifying whether the unstructured data in the original data and the unstructured data in the second limited data are associated or not;
when the structured data similarity verification module verifies that the structured data in the original data and the structured data in the second limited data are associated, and the unstructured data similarity verification module verifies that the unstructured data in the original data and the unstructured data in the second limited data are associated, the second desensitization verification module judges that the first limited data fails to be desensitized;
otherwise, the second desensitization verification module determines that the first restricted data desensitization was successful.
CN202410303595.1A 2024-03-18 2024-03-18 Verification method and system for limited data desensitization method Active CN117892358B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410303595.1A CN117892358B (en) 2024-03-18 2024-03-18 Verification method and system for limited data desensitization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410303595.1A CN117892358B (en) 2024-03-18 2024-03-18 Verification method and system for limited data desensitization method

Publications (2)

Publication Number Publication Date
CN117892358A true CN117892358A (en) 2024-04-16
CN117892358B CN117892358B (en) 2024-07-05

Family

ID=90647753

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410303595.1A Active CN117892358B (en) 2024-03-18 2024-03-18 Verification method and system for limited data desensitization method

Country Status (1)

Country Link
CN (1) CN117892358B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475855A (en) * 2020-06-24 2020-07-31 支付宝(杭州)信息技术有限公司 Data processing method and device for realizing privacy protection
WO2021212968A1 (en) * 2020-04-24 2021-10-28 华为技术有限公司 Unstructured data processing method, apparatus, and device, and medium
CN113989156A (en) * 2021-11-01 2022-01-28 北京地平线信息技术有限公司 Method, apparatus, medium, device, and program for reliability verification of desensitization method
CN113988226A (en) * 2021-12-29 2022-01-28 深圳红途科技有限公司 Data desensitization validity verification method and device, computer equipment and storage medium
CN116646046A (en) * 2023-07-27 2023-08-25 中日友好医院(中日友好临床医学研究所) Electronic medical record processing method and system based on Internet diagnosis and treatment
CN117421773A (en) * 2023-10-30 2024-01-19 阿维塔科技(重庆)有限公司 Data desensitization processing method, device, equipment and storage medium
CN117556455A (en) * 2023-07-24 2024-02-13 上海凯馨信息科技有限公司 Data desensitization security inspection method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021212968A1 (en) * 2020-04-24 2021-10-28 华为技术有限公司 Unstructured data processing method, apparatus, and device, and medium
CN111475855A (en) * 2020-06-24 2020-07-31 支付宝(杭州)信息技术有限公司 Data processing method and device for realizing privacy protection
CN113989156A (en) * 2021-11-01 2022-01-28 北京地平线信息技术有限公司 Method, apparatus, medium, device, and program for reliability verification of desensitization method
CN113988226A (en) * 2021-12-29 2022-01-28 深圳红途科技有限公司 Data desensitization validity verification method and device, computer equipment and storage medium
CN117556455A (en) * 2023-07-24 2024-02-13 上海凯馨信息科技有限公司 Data desensitization security inspection method
CN116646046A (en) * 2023-07-27 2023-08-25 中日友好医院(中日友好临床医学研究所) Electronic medical record processing method and system based on Internet diagnosis and treatment
CN117421773A (en) * 2023-10-30 2024-01-19 阿维塔科技(重庆)有限公司 Data desensitization processing method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PING CHEN等: "Safety verification model of desensitization algorithm for civil aviation passenger data based on statistics", 《2023 INTERNATIONAL CONFERENCE ON CYBER-ENABLED DISTRIBUTED COMPUTING AND KNOWLEDGE DISCOVERY(CYBERC)》, 21 February 2024 (2024-02-21) *
张煜;吕锡香;邹宇聪;李一戈;: "基于生成对抗网络的文本序列数据集脱敏", 网络与信息安全学报, no. 04, 15 August 2020 (2020-08-15) *

Also Published As

Publication number Publication date
CN117892358B (en) 2024-07-05

Similar Documents

Publication Publication Date Title
US11574077B2 (en) Systems and methods for removing identifiable information
US11127403B2 (en) Machine learning-based automatic detection and removal of personally identifiable information
CN113051371B (en) Chinese machine reading understanding method and device, electronic equipment and storage medium
Luo et al. A CNN-based Approach to the Detection of SQL Injection Attacks
CN106991312B (en) Internet anti-fraud authentication method based on voiceprint recognition
CN112580352B (en) Keyword extraction method, device and equipment and computer storage medium
CN113593661A (en) Clinical term standardization method, device, electronic equipment and storage medium
CN115314268B (en) Malicious encryption traffic detection method and system based on traffic fingerprint and behavior
CN111460114A (en) Retrieval method, device, equipment and computer readable storage medium
CN115687980A (en) Desensitization classification method of data table, and classification model training method and device
CN117520503A (en) Financial customer service dialogue generation method, device, equipment and medium based on LLM model
CN110990834B (en) Static detection method, system and medium for android malicious software
CN117272142A (en) Log abnormality detection method and system and electronic equipment
CN116401343A (en) Data compliance analysis method
CN117271558A (en) Language query model construction method, query language acquisition method and related devices
CN117708297A (en) Query statement generation method and device, electronic equipment and storage medium
CN112966507A (en) Method, device, equipment and storage medium for constructing recognition model and identifying attack
CN117892358B (en) Verification method and system for limited data desensitization method
Bisogni et al. Multibiometric score-level fusion through optimization and training
CN106446696A (en) Information processing method and electronic device
CN112668284B (en) Legal document segmentation method and system
CN114741088A (en) App source code linking method based on user comments and developer intelligence
CN117077680A (en) Question and answer intention recognition method and device
CN113259369A (en) Data set authentication method and system based on machine learning member inference attack
CN117272123B (en) Sensitive data processing method and device based on large model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant