CN116049877B

CN116049877B - Method, system, equipment and storage medium for identifying and desensitizing private data

Info

Publication number: CN116049877B
Application number: CN202211741149.6A
Authority: CN
Inventors: 陈耀远
Original assignee: China Asean Information Harbor Co ltd
Current assignee: China Asean Information Harbor Co ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2024-05-28
Anticipated expiration: 2042-12-30
Also published as: CN116049877A

Abstract

The invention discloses a method for identifying and desensitizing private data, which belongs to the technical field of data security and solves the technical problems of high complexity, high realization cost and large labor consumption in the prior art, and comprises the following steps: recognizing and extracting the privacy data based on the transfer learning technology; desensitizing the extracted privacy data according to a set desensitization strategy to obtain desensitized data; synchronizing desensitized data to a data out-of-domain gateway interface, setting a data fusing threshold, and normally outputting desensitized data conforming to the fusing threshold through the interface; and reporting the desensitized data which does not accord with the fusing threshold value to a management center and realizing a fusing mechanism. The invention also discloses a system, equipment and storage medium for identifying and desensitizing the private data. The invention can identify, monitor and process the sensitive data in various and changeable data transaction privacy data, reduce the risk of leakage of the sensitive data in links such as use, storage and the like, and reduce the security threat caused by the leakage of the sensitive data.

Description

Method, system, equipment and storage medium for identifying and desensitizing private data

Technical Field

The present invention relates to the field of data security technologies, and in particular, to a method, a system, a device, and a storage medium for identifying and desensitizing private data.

Background

With the wide spread and application of big data in recent years, the value of data resources is gradually valued and accepted, and the demand for data transaction is also increasing.

The data is taken as a novel production element, is a digital, networked and intelligent basis, is rapidly integrated into various links such as production, distribution, circulation, consumption, social service management and the like, deeply changes the production mode, the life style and the social management mode, and has increasingly outstanding effects on improving the production efficiency, the life quality and the innovative social management.

The data transaction market in China is rapidly rising, and timely discovery and management of private information become a problem to be solved currently. The data element circulation market is numerous, how to automatically find private data from massive transaction data is the first problem facing at present, and how to conduct platform information of data compliance desensitization examination from private data content is the second problem. The trust and compliance problems of data transaction in the economy of the data elements are solved, and the trust and compliance problems become one of the most core of a data circulation system.

At present, privacy information identification rules and desensitization of a data transaction platform are mainly collected manually, so that labor and time cost are wasted, and an automatic information discovery, extraction and desensitization method is urgent.

In recent years, the rapid development of artificial intelligence technology has not been advanced in the field of natural language processing, wherein text classification is used for text with different characteristics, and named entity recognition technology is mainly used for information extraction and text data structuring.

The named entity recognition method in the prior art is mainly an entity recognition method based on traditional machine learning and an entity recognition method based on deep learning. An entity identification method based on traditional machine learning, such as an invention patent application document CN111274804A, carries out model learning on marked data through statistics, sends the data to be predicted to model prediction, and calculates the entity with the maximum possibility by using a Viterbi algorithm. A named entity recognition method based on deep learning, such as the application document of patent CN111126068A, constructs a neural network model to learn semantic features and can learn more complicated semantics, but a large amount of labeling data is needed for learning, and the data labeling work is extremely time-consuming and labor-consuming.

Disclosure of Invention

The invention aims to solve the technical problems in the prior art, and aims to provide a method for identifying and desensitizing privacy data, which has the characteristics of high efficiency and real-time performance, and is used for identifying, monitoring and processing sensitive data in various and changeable data transaction privacy data, so that the risk of leakage of the sensitive data in links such as use, storage and the like is reduced, and the security threat caused by the leakage of the sensitive data is reduced.

It is a second object of the present invention to provide a system for private data identification and desensitization.

The third object of the present invention is to provide a computer device.

It is a fourth object of the present invention to provide a computer readable storage medium.

To achieve the above object, the present invention provides a method for identifying and desensitizing private data, comprising the steps of:

s1, identifying and extracting privacy data based on a transfer learning technology;

s2, desensitizing the extracted privacy data according to a set desensitizing strategy to obtain desensitized data;

S3, synchronizing the desensitized data to a data out-of-domain gateway interface, setting a data fusing threshold, and normally outputting the desensitized data conforming to the fusing threshold through the interface; and reporting the desensitized data which does not accord with the fusing threshold value to a management center and realizing a fusing mechanism.

As a further improvement, in step S1, multi-dimensional research and judgment is performed according to the key information extracted by the characteristic of the private data, so as to perform classification and named entity recognition operations.

Further, based on the feature representation and the transfer learning of the relation knowledge fusion, the knowledge is encoded in the form of features and is transmitted to the target field from the source field, so that the task effect of the target field is improved.

Further, knowledge migration between related fields, namely, assuming that the relationship between data in the source field and the target field is the same, adding data in different fields, and transmitting the extracted knowledge backwards to generate virtual data for the category which is invisible in the feature space; enhancing the generalization capability of the semantic mapping function by adding data;

The classification and information sampling of the data transaction are automatically completed in the whole process of transfer learning through feature representation and relationship knowledge fusion, and the classified data transaction sample privacy data can be gradually corrected and enriched.

Further, in step S2, the sensitive data identification rule is parsed, and the data desensitization is performed on the full sensitive data and returned, including the following steps:

S21, acquiring a desensitization data strategy for improving transfer learning, and then acquiring a data source type;

S22, analyzing a sensitive data identification and desensitization strategy, and binding a sample data source to identify and desensitize the sensitive data to obtain desensitized data;

s23, analyzing a preview instruction of the management center, and sending a desensitized data source to the management center;

S24, aiming at formatted data, carrying out asymmetric encryption on the whole field, burying a related decryption key in a new field of a data source, and applying a certificate issued by a user and the data source decryption key to a management center to obtain a whole decryption mode if the data source is required to be decrypted later;

s25, encrypting the field aiming at the non-formatted data;

and S26, establishing connection with a management center, and reporting the desensitized progress information of the file in real time.

Further, in step S3, the traffic of the outgoing domain is monitored by using the timing model and the firewall device; downloading the desensitized data source through the data downloading server information of the data downloading rule; the method comprises the following steps:

s31, acquiring a domain-out access rule of desensitization data of a management center, and synchronously outputting the desensitization data through an interface;

s32, establishing a related time sequence data model according to a data flow interface of the desensitization data in a period of time;

s33, acquiring an internal firewall mechanism of a management center, and setting a data fusing threshold for preventing malicious recognition;

And S34, the desensitization data is normally output through the interface according with the setting of the fusing threshold value, is not met, and is reported to the management center and the fusing mechanism is realized.

Further, the time sequence data model is:

Where v represents the text content in the website, and p (c ₁ |v) represents the probability of whether the text belongs to or not, and is obtained by the time sequence data and the firewall. g (c ₁, v) is used to determine whether the data out-of-domain gateway v belongs to the anomalous data class c ₁. If g (c ₁, v) is not less than 0, it indicates that v belongs to abnormal data c ₁ in a certain period of time.

To achieve the second object, the present invention provides a system for identifying and desensitizing private data, comprising:

The identification module is used for identifying and extracting the privacy data based on the transfer learning technology;

The desensitization module is used for carrying out desensitization processing on the extracted privacy data according to a set desensitization strategy to obtain desensitized data;

the output module synchronizes the desensitized data to the data out-of-domain gateway interface, sets a data fusing threshold value, and normally outputs the desensitized data conforming to the fusing threshold value through the interface; and reporting the desensitized data which does not accord with the fusing threshold value to a management center and realizing a fusing mechanism.

In order to achieve the third object, the present invention provides a computer device, including a memory and a processor, the memory storing a computer program, the processor implementing a method of privacy data identification and desensitization as described above when executing the computer program.

In order to achieve the above object, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of privacy data identification and desensitization as described above.

Advantageous effects

Compared with the prior art, the invention has the advantages that:

The invention uses a transfer learning mode to classify and extract privacy information, and automatically uses an asymmetric encryption mode to encrypt data and automatically and safely distribute the desensitized sensitive key information; the invention can automatically complete the classification and information sampling of the privacy data in the data transaction in the whole course, and can gradually correct and enrich the identification of the privacy data of the classified data transaction, thereby achieving continuous evolution and improvement.

Drawings

FIG. 1 is a schematic diagram of the present invention;

Fig. 2 is a flow chart of a business.

Detailed Description

The invention will be further described with reference to specific embodiments in the drawings.

Referring to fig. 1-2, a method for identifying and desensitizing private data includes the steps of:

S1, privacy data are identified and extracted based on a transfer learning technology. According to the key information extracted by the characteristic of the privacy data, the characteristic such as important data such as personal name, mobile phone number, identity card, longitude and latitude, and the related characteristic of personal data, the extracted key information is utilized to carry out multidimensional research and judgment so as to carry out classification and named entity recognition operation, such as classification and named entity recognition operation on texts, text pictures and formatted data.

1) Based on the transfer learning of feature representation and relation knowledge fusion, the knowledge is encoded in the form of features and is transmitted to the target field from the source field, so that the task effect of the target field is improved.

2) Knowledge migration between related fields, namely, supposing that the relationship between data in source fields and target fields is the same, adding data in different fields, and transmitting the extracted knowledge backwards to generate virtual data for categories which are invisible in a feature space; the generalization ability of the semantic mapping function is enhanced by adding data.

The improved transfer learning model is used for analyzing the sensitive data identification rule, scanning the data sources in batches, determining the type of the sensitive data file to identify the sensitive data of the data source content, and recording identification log information.

The input is the policy module parsing the sensitive data recognition policy result and is input to the sensitive data recognition algorithm engine.

1) And acquiring and analyzing a data acquisition rule configuration strategy of the management center, and then acquiring a data source.

2) And acquiring and analyzing a sensitive data identification strategy of the management center, and then carrying out sensitive data identification on the data source.

3) And generating a sensitive data identification log according to the identification result and uploading the sensitive data identification log to a management center.

And S2, performing desensitization treatment on the extracted privacy data according to a set desensitization strategy to obtain desensitized data. Analyzing the sensitive data identification rule, performing data desensitization on the full sensitive data and returning, wherein the method comprises the following steps of:

s25, encrypting the field aiming at the non-formatted data;

S3, synchronizing the desensitized data to a data out-of-domain gateway interface, setting a data fusing threshold, and normally outputting the desensitized data conforming to the fusing threshold through the interface; and reporting the desensitized data which does not accord with the fusing threshold value to a management center and realizing a fusing mechanism. Monitoring the traffic of the outgoing domain by using a time sequence model and a firewall device; downloading the desensitized data source through the data downloading server information of the data downloading rule; the method comprises the following steps:

The time sequence data model is as follows:

Several kinds of policy information are described below:

1) Device information, user information, and user rights authorization mechanisms.

2) And the sensitive data analysis model is used for classifying sensitive data and initializing data extraction.

3) And the analysis management center acquires remote data source strategy configuration information field sampling analysis, such as finding sensitive data, and issuing analysis.

5) And analyzing the sensitive data identification strategy, and analyzing the sensitive data identification strategy configuration information field of the analysis management center.

6) And resolving the sensitive data desensitization strategy, and resolving the sensitive data desensitization strategy configuration information field of the resolving management center.

7) And (3) generating a sensitive data report, which is responsible for analyzing sensitive data statistics in the gateway and generating the report.

The policy synchronization requires that the static desensitization gateway monitor the full synchronization of each policy information of the management center in real time.

In order to realize the display of the running state of the real-time gateway, the timely warning of faults and the review of the historical state, a time sequence characteristic model and a monitoring and warning system of a related firewall are constructed, unified analysis and warning of logs and safety events of the whole network safety equipment are realized, the discovery, detection and warning of advanced threats and unknown threats are realized, and a safety event report is provided. And detecting and blocking the abnormal network data packet information and alarming, thereby achieving the gateway device for safe distribution.

A system for private data identification and desensitization, comprising:

A computer device comprising a memory storing a computer program and a processor implementing a method of privacy data identification and desensitisation as described above when the processor executes the computer program.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method of privacy data identification and desensitization as described above.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and improvements can be made by those skilled in the art without departing from the structure of the present invention, and these do not affect the effect of the implementation of the present invention and the utility of the patent.

Claims

1. A method of private data identification and desensitization comprising the steps of:

s3, synchronizing the desensitized data to a data out-of-domain gateway interface, setting a data fusing threshold, and normally outputting the desensitized data conforming to the fusing threshold through the interface; the desensitized data which does not accord with the fusing threshold value is reported to the management center and a fusing mechanism is realized;

in step S1, multi-dimensional research and judgment are performed according to the key information extracted by the characteristic of the private data, so as to perform classification and named entity recognition operations;

in step S2, analyzing the sensitive data identification rule, performing data desensitization on the full sensitive data, and returning, including the following steps:

s25, encrypting the field aiming at the non-formatted data;

s26, establishing connection with a management center, and reporting the desensitized progress information of the file in real time;

in step S3, monitoring the traffic of the outgoing domain by using the timing model and the firewall device; downloading the desensitized data source through the data downloading server information of the data downloading rule; the method comprises the following steps:

s34, the desensitization data is normally output through an interface according with the setting of the fusing threshold value, is not met, and is reported to a management center and a fusing mechanism is realized;

The time sequence data model is as follows:

In the above Representing text content in a website,/>Representing the probability of whether the text belongs to, and obtaining the text by time sequence data and a firewall/>For determining data out-of-domain gateway/>Whether or not it belongs to the abnormal data class/>If/>Indicating/>, within a certain period of timeBelonging to abnormal data/>。

2. The method for identifying and desensitizing private data according to claim 1, wherein knowledge is encoded in a form of features based on feature representation and transfer learning of relational knowledge fusion, and is transferred from a source domain to a target domain, so that task effects of the target domain are improved.

3. A method of private data identification and desensitization according to claim 2, wherein knowledge migration between related domains, assuming that the relationship between data is the same in source domain and target domain, adding data in different domains, and the extracted knowledge is transmitted backwards to generate virtual data for classes not visible in feature space; enhancing the generalization capability of the semantic mapping function by adding data;

4. A system for private data identification and desensitization comprising:

The output module synchronizes the desensitized data to the data out-of-domain gateway interface, sets a data fusing threshold value, and normally outputs the desensitized data conforming to the fusing threshold value through the interface; the desensitized data which does not accord with the fusing threshold value is reported to the management center and a fusing mechanism is realized;

The output module monitors the flow of the out-of-domain by using the time sequence model and the firewall device; downloading the desensitized data source through the data downloading server information of the data downloading rule; the method comprises the following steps:

Acquiring a desensitization data out-of-domain access rule of a management center, and synchronously outputting the desensitization data through an interface;

establishing a related time sequence data model according to a data flow interface of desensitization data in a period of time;

Acquiring an internal firewall mechanism of a management center, and setting a data fusing threshold for preventing malicious recognition;

The desensitization data is normally output through the interface according with the setting of the fusing threshold value, is not met, and is reported to the management center to realize the fusing mechanism;

The time sequence data model is as follows:

5. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements a method of private data identification and desensitization according to any one of claims 1-3 when executing the computer program.

6. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of private data identification and desensitization according to any of claims 1 to 3.