CN117892358A

CN117892358A - Verification method and system for limited data desensitization method

Info

Publication number: CN117892358A
Application number: CN202410303595.1A
Authority: CN
Inventors: 蔡卓人; 蒋江涛; 郭鹏; 郭浩宇; 邓小宁; 金剑; 马杰
Original assignee: North Health Medical Big Data Technology Co ltd
Current assignee: North Health Medical Big Data Technology Co ltd
Priority date: 2024-03-18
Filing date: 2024-03-18
Publication date: 2024-04-16
Anticipated expiration: 2044-03-18
Also published as: CN117892358B

Abstract

The invention provides a method and a system for verifying a desensitization method of limited data, which relate to the technical field of limited data and comprise the following steps: selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization treatment on the original data, and recording the desensitized data as first limited data; importing the first limited data into an artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after model training is finished; comparing the original data with the second limited data, and judging whether the second limited data is consistent with the original data; if the first limited data are consistent, judging that the desensitization of the first limited data fails; if the first limited data is inconsistent, verifying the second limited data based on the original data, and judging whether the first limited data is desensitized successfully. The method is used for ensuring that the mode or rule of the sensitive information is not revealed to unauthorized users after the limited data is desensitized and trained by the artificial intelligence model.

Description

Verification method and system for limited data desensitization method

Technical Field

The invention relates to the technical field of limited data, in particular to a verification method and a verification system for a limited data desensitization method.

Background

Restricted data refers to data that is restricted and protected in some way, including personal identification information, medical records, financial data, business secrets, etc., with the restrictions typically associated with the sensitivity, privacy, or security of the data. Restricted data is characterized by a range of access and use that needs to be restricted.

Because the artificial intelligent model training needs to refer to sensitive data like personal identity information and the like to perform operations such as data association of multi-source data, the artificial intelligent model is often trained by using limited data so as to improve the performance and accuracy of the model. And the limited data is closer to the actual application scene, so that the model can be helped to better understand and adapt to various semantic scenes, and reasonable reasoning conclusion can be made. Similar scene data can be better processed and more accurate predictions and decisions provided by using limited data as a model for sample training.

However, since the limited data contains sensitive information, such as personal identification or specific group, the limited data may face security risks such as hacking, data leakage, etc. during model training, data processing and data transmission by the artificial intelligence model, which may result in exposure of the limited data. It is often necessary to desensitize the restricted data, but even after desensitizing the restricted data, patterns or rules that may contain the restricted data after being used as training samples by the artificial intelligence model training may still be revealed to unauthorized users, and may also cause problems of misuse of the restricted data.

The above is a disadvantage of the prior art.

Disclosure of Invention

In order to solve the defects, the invention provides a limited data desensitization method and a limited data desensitization system, wherein the limited data is trained and used by an artificial intelligent model after desensitization processing, and the mode or rule of the limited data cannot be revealed to unauthorized users.

In a first aspect, the present invention provides a method of authentication of a restricted data desensitization method, the method comprising:

s1, selecting a corresponding desensitization method according to the sensitivity degree and service requirement of original data, carrying out desensitization treatment on the original data, and recording the desensitized data as first limited data;

s2, importing the first limited data into an artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after model training is finished;

s3, comparing the original data with the second limited data, and judging whether the second limited data is consistent with the original data in content or not;

if the first limited data are consistent, judging that the desensitization of the first limited data fails;

if not, executing the step S4;

and S4, verifying the second limited data based on the original data, and judging whether the first limited data is desensitized successfully.

Further, before executing step S2, data security verification needs to be performed on the first limited data, including:

collecting illegal words and constructing a security verification word stock;

based on the security verification word stock, keyword retrieval is carried out on the first limited data, and whether illegal words exist in the first limited data is judged;

if the first limited data has illegal words, judging that the desensitization of the first limited data fails;

if the first limited data does not have the violation word, step S2 is performed.

Further, in step S4, based on the original data, similarity verification is performed on the second limited data, including structured data similarity verification and unstructured data similarity verification.

Further, structured data similarity verification includes:

extracting structured data in original data, cleaning and preprocessing the structured data, and constructing a plurality of sets according to data attributes;

extracting structured data in the second limited data for cleaning and preprocessing, and constructing a plurality of sets according to data attributes;

the original data and the set with the same data attribute in the second limited data form a group of sets;

according to the Jaccard similarity coefficient formula:calculating Jaccard similarity coefficients of each group of sets;

judging whether the original data structured data set and the second limited data structured data set are associated in each data attribute according to Jaccard similarity coefficients of each group of sets;

when the Jaccard similarity coefficient of the set is larger than the preset Jaccard similarity coefficient, judging that the original data structured data set and the second limited data structured data set are associated under the data attribute;

wherein A represents a set of data attributes in the original data structure data, B represents a set of data attributes in the second restricted data structure data,represents the union of set A and set B, +.>Representing the intersection of set a with set B.

Further, if the Jaccard similarity coefficients of the set of all the data attributes are smaller than the preset Jaccard similarity coefficients, judging that the structured data in the original data and the structured data in the second limited data are not associated;

otherwise, judging that the structured data in the original data and the structured data in the second limited data are associated.

Further, unstructured data similarity verification includes:

extracting unstructured data in the original data and unstructured data in the second limited data, and cleaning and preprocessing;

performing word frequency statistics on unstructured data in the processed original data and unstructured data in the second limited data, counting the occurrence times of each word in the two groups of data, and respectively constructing an original data word frequency matrix and a second limited data word frequency matrix;

converting the original data word frequency matrix and the second limited data word frequency matrix into an original data vector and a second limited data vector respectively, wherein each dimension in the data vector represents the word frequency of a word;

calculation formula according to cosine similarityCalculating cosine similarity of the two data vectors;

if the cosine similarity is greater than the preset cosine similarity, judging that the unstructured data in the original data are associated with the unstructured data in the second limited data;

otherwise, judging that the unstructured data in the original data and the unstructured data in the second limited data are not associated;

wherein A represents an original data vector, B represents a second restricted data vector, A.B represents an inner product of the original data vector and the second restricted data vector, the A and B represent modulo lengths of the original data vector and the second restricted data vector, respectively.

Further, in step S4, when it is determined that there is an association between the structured data in the original data and the structured data in the second limited data and that there is an association between the unstructured data in the original data and the unstructured data in the second limited data, it is determined that the desensitization of the first limited data fails;

and otherwise, judging that the desensitization of the first limited data is successful.

In a second aspect, the present invention provides a restricted data desensitization method authentication system, the system comprising:

the desensitization processing module is used for selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization processing on the original data, and recording the desensitized data as first limited data;

the model training module is used for importing the first limited data obtained by the desensitization processing module into the artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after the model training is finished;

the first desensitization verification module is used for comparing the original data with the second limited data and judging whether the second limited data is consistent with the original data;

and the second desensitization verification module is used for verifying the second limited data based on the original data when the first desensitization verification module verifies that the second limited data is inconsistent with the original data, and judging whether the first limited data is successfully desensitized.

Further, the system further comprises: and the data security verification module is used for performing security verification on the first limited data obtained by the desensitization processing module and judging whether the first limited data has illegal words or not.

Further, the second desensitization verification module comprises a structured data similarity verification module and an unstructured data similarity verification module;

the structured data similarity verification module is used for verifying whether the structured data in the original data and the structured data in the second limited data are associated or not;

the unstructured data similarity verification module is used for verifying whether the unstructured data in the original data and the unstructured data in the second limited data are associated or not;

when the structured data similarity verification module verifies that the structured data in the original data and the structured data in the second limited data are associated, and the unstructured data similarity verification module verifies that the unstructured data in the original data and the unstructured data in the second limited data are associated, the second desensitization verification module judges that the first limited data fails to be desensitized;

otherwise, the second desensitization verification module determines that the first restricted data desensitization was successful.

From the above technical scheme, the invention has the following advantages:

the invention provides a method and a system for verifying a desensitization method of limited data, which are characterized in that after desensitization treatment is carried out on the limited data, the desensitized limited data is used as a sample to be imported into an artificial intelligent model for model training, content and data comparison verification is carried out according to the limited data and model output content, whether the output content of the verification model is associated with the limited data or not is judged, so that the problem that the limited data still contains a mode or rule of the limited data after the desensitization treatment is imported into the artificial intelligent model for training and is revealed to an unauthorized user is avoided, and the safety of the limited data and the personal privacy of the user are ensured.

Drawings

In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method of one embodiment of the invention;

FIG. 2 is a schematic flow chart of one specific use of the method of one embodiment of the present invention;

fig. 3 is a schematic block diagram of a system of one embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the present invention provides a limited data desensitizing method, which includes:

and 110, selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization treatment on the original data, and recording the desensitized data as first limited data.

And selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, carrying out desensitization processing on the original data, and recording the desensitized data as first limited data.

After the desensitization process, data security verification is performed on the first limited data. Firstly, collecting illegal words, constructing a safety verification word bank according to the illegal words, searching keywords of the first limited data based on the safety verification word bank, and judging whether the illegal words exist in the first limited data. If the first limited data has illegal words, judging that the desensitization of the first limited data fails; if the first restricted data does not have an offending word, step 120 is performed.

And 120, importing the first limited data into the artificial intelligent model as a training sample to perform model training, and recording the output content of the artificial intelligent model as second limited data after the model training is finished.

The artificial intelligent model used for importing the first limited data into the artificial intelligent model as a training sample to perform model training of each stage can be a traditional machine learning model such as decision trees, random forests, support vector machines or logistic regression, a deep learning model such as a neural network, a convolutional neural network or a cyclic neural network, and a pre-training model such as BERT, roBERTa or XLnet.

BERT (Bidirectional Encoder Representations from Transformers) is a bi-directional, transducer-based pre-training model proposed by Google. By performing unsupervised target prediction training on large-scale text data, BERT can learn context-dependent representations of words and sentences, providing good feature expression for various downstream natural language processing tasks. RoBERTa is an improvement and extension to BERT, developed by Facebook AI, and pre-trains with larger scale data and longer training time, while using more training skills, further improving model performance. XLNet is a transducer-based pre-trained model developed by CMU in concert with Google Brain. Unlike traditional autoregressive pre-training models, XLNet employs an alignment language model (Permutation Language Model) goal, so that the model can take into account all possible contextual alignment in the prediction, thereby better capturing the relationships between words.

It should be noted that, the artificial intelligence model imported in the model training of the present invention by importing the first limited data as the training sample into the artificial intelligence model includes, but is not limited to, the artificial intelligence models listed in the above embodiments.

Step 130, comparing the original data with the second limited data, and judging whether the second limited data is consistent with the original data; if the first limited data are consistent, judging that the desensitization of the first limited data fails; if not, step 140 is performed.

In step 130, comparing all data of the original data with all data of the second limited data, and if the original data content is completely consistent with the second limited data content, determining that the desensitization of the first limited data fails; if the original data content is not completely consistent with the second limited data content, further verifying the second limited data.

And 140, verifying the second limited data based on the original data, and judging whether the first limited data is successfully desensitized.

And performing similarity verification on the second limited data based on the original data, wherein the similarity verification comprises structured data similarity verification and unstructured data similarity verification. Data may be divided into structured data and unstructured data based on its organization and characteristics, structured data being data organized in a predefined pattern and format, with explicit data types and relationships, typically stored in the form of tables, rows, or relational databases. Unstructured data refers to data that has no explicit structure or format, is highly diverse, complex, and irregular, and exists in the form of text, images, audio, video, and the like.

In this embodiment, the structured data similarity verification includes:

and respectively extracting structured data in the original data and structured data in the second limited data by adopting a regular expression matching method, a natural language processing technology or a database query language method, and performing arrangement and cleaning, including removing repeated values, processing missing values, processing abnormal values and the like.

And respectively constructing a plurality of sets according to the data attribute by the structured data in the processed original data and the structured data in the second limited data. And forming a group of sets with the same data attribute in the original data and the second limited data, and then according to a Jaccard similarity coefficient formula:jaccard similarity coefficients were calculated for each set of sets.

And judging whether the original data structured data set and the second limited data structured data set are associated in each data attribute according to the Jaccard similarity coefficient of each group of sets. When the Jaccard similarity coefficient of the set is larger than the preset Jaccard similarity coefficient, judging that the original data structure data set and the second limited data structure data set are associated under the data attribute.

In this embodiment, only when Jaccard similarity coefficients of the set group of all data attributes in the original data structured data and the second limited data structured data are smaller than preset Jaccard similarity coefficients, it is determined that no association exists between the structured data in the original data and the structured data in the second limited data; otherwise, judging that the structured data in the original data and the structured data in the second limited data are associated.

Wherein A represents an original data structureA set of data attributes in the data, B representing a data set of the data attributes in the second restricted data structure data,represents the union of set A and set B, +.>Representing the intersection of set a with set B.

It should be noted that, a person skilled in the art may set the value of the preset Jaccard similarity coefficient according to actual needs.

In this embodiment, unstructured data similarity verification includes:

the unstructured data in the original data and the unstructured data in the second limited data are respectively extracted by adopting the technologies of text classification, emotion analysis, keyword extraction and the like or using natural language processing technology, and are arranged and cleaned, including repeated value removal, missing value processing, abnormal value processing and the like. And carrying out word frequency statistics on unstructured data in the processed original data and unstructured data in the second limited data, counting the occurrence times of each word in the two groups of data, respectively constructing an original data word frequency matrix and a second limited data word frequency matrix, and respectively converting the original data word frequency matrix and the second limited data word frequency matrix into an original data vector and a second limited data vector, wherein each dimension in the data vector represents the word frequency of one word. Calculation formula according to cosine similarityCosine similarity of the two data vectors is calculated.

If the cosine similarity is greater than the preset cosine similarity, judging that the unstructured data in the original data are associated with the unstructured data in the second limited data; otherwise, judging that the unstructured data in the original data and the unstructured data in the second limited data are not associated.

It should be noted that, a person skilled in the art may set the value of the preset cosine similarity according to actual needs.

In this embodiment, when it is determined that there is an association between structured data in the original data and structured data in the second limited data and an association between unstructured data in the original data and unstructured data in the second limited data, it is determined that the first limited data fails to be desensitized; and otherwise, judging that the desensitization of the first limited data is successful.

In order to facilitate understanding of the present invention, the limited data desensitization method verification method provided by the present invention is further described below with reference to the use control process of the limited data desensitization method verification method in the embodiment.

As shown in fig. 2, the limited data desensitizing method includes:

selecting a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, performing desensitization treatment on the original data, recording the desensitized data as first limited data, performing data security verification on the first limited data, judging whether illegal words exist in the first limited data, and if yes, judging that the desensitization of the first limited data fails; if not, the first limited data is used as a training sample to be imported into the artificial intelligent model for model training, after model training is finished, the output content of the artificial intelligent model is recorded as second limited data, the original data is compared with the second limited data, whether the second limited data is consistent with the original data content or not is judged, and if yes, the first limited data is judged to be failed in desensitization; if not, extracting structured data in the original data and structured data in the second limited data, verifying the similarity of the structured data in the second limited data, judging whether the structured data in the original data and the structured data in the second limited data are associated, if so, judging that the desensitization of the first limited data fails; if not, extracting unstructured data in the original data and unstructured data in the second limited data, verifying the unstructured data similarity of the second limited data, judging whether the unstructured data in the original data and the unstructured data in the second limited data are associated, and if so, judging that the desensitization of the first limited data fails; if not, the first limited data is successfully desensitized.

As shown in fig. 3, as an embodiment of the present invention, the present invention further provides a limited data desensitizing method verification system, the system 200 includes:

the desensitization processing module 210 is configured to select a corresponding desensitization method according to the sensitivity degree and the service requirement of the original data, perform desensitization processing on the original data, and record the desensitized data as first limited data;

the model training module 220 is configured to import the first limited data obtained by the desensitization processing module as a training sample into the artificial intelligent model for model training, and record the output content of the artificial intelligent model as second limited data after the model training is completed;

the first desensitization verification module 230 is configured to compare the original data with the second limited data, and determine whether the second limited data is consistent with the original data;

and the second desensitization verification module 240 is configured to, when the first desensitization verification module verifies that the second limited data is inconsistent with the original data content, verify the second limited data based on the original data, and determine whether the first limited data is desensitized successfully.

As an embodiment of the present invention, the system 200 further comprises: the data security verification module 250 is configured to perform security verification on the first limited data obtained by the desensitization processing module, and determine whether the first limited data has a violation word.

As one embodiment of the present invention, the second desensitization verification module includes a structured data similarity verification module and an unstructured data similarity verification module;

when the structured data similarity verification module verifies that the structured data in the original data and the structured data in the second limited data are associated, and the unstructured data similarity verification module verifies that the unstructured data in the original data and the unstructured data in the second limited data are associated, the second desensitization verification module judges that the first limited data fails to be desensitized; otherwise, the second desensitization verification module determines that the first restricted data desensitization was successful.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of validating a restricted data desensitization method, the method comprising:

if not, executing the step S4;

2. The limited data desensitization method according to claim 1, wherein prior to performing step S2, data security verification is required for the first limited data, comprising:

collecting illegal words and constructing a security verification word stock;

3. The method of claim 1, wherein in step S4, the second limited data is subjected to similarity verification based on the original data, including structured data similarity verification and unstructured data similarity verification.

4. A method of validating a limited data desensitization method according to claim 3, wherein structured data similarity validation comprises:

5. The method for verifying a restricted data desensitization method according to claim 4, wherein if the Jaccard similarity coefficient of the set of all data attributes is smaller than a preset Jaccard similarity coefficient, it is determined that there is no association between the structured data in the original data and the structured data in the second restricted data;

6. A method of validating a limited data desensitization method according to claim 3, wherein unstructured data similarity validation comprises:

7. The limited data desensitization method according to claim 1, wherein in step S4, when it is determined that there is an association between structured data in the original data and structured data in the second limited data and an association between unstructured data in the original data and unstructured data in the second limited data, it is determined that the first limited data fails to be desensitized;

8. A restricted data desensitization method authentication system, the system comprising:

9. The restricted data desensitization method authentication system according to claim 8, wherein said system further comprises: and the data security verification module is used for performing security verification on the first limited data obtained by the desensitization processing module and judging whether the first limited data has illegal words or not.

10. The limited data desensitization method authentication system according to claim 8, wherein the second desensitization authentication module comprises a structured data similarity authentication module and an unstructured data similarity authentication module;