CN112541193A

CN112541193A - Method and device for protecting private data

Info

Publication number: CN112541193A
Application number: CN202011432591.1A
Authority: CN
Inventors: 曹佳炯; 丁菁汀
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-23
Anticipated expiration: 2040-12-10
Also published as: CN112541193B

Abstract

The embodiment of the specification provides a method and a device for protecting private data. According to the method of the embodiment, firstly, historical data uploaded by a target user on a target platform is obtained; then, extracting identity data and identity correlation data from the historical data, wherein the identity data is the only data corresponding to the user identity, and the identity correlation data is the data capable of presuming the user identity; then, analyzing the extracted identity data and the identity relevance data, and determining the privacy disclosure risk level of the target user on the target platform according to the analysis result; and finally, performing data desensitization processing on the data uploaded by the target user on the target platform according to the privacy disclosure risk level.

Description

Method and device for protecting private data

Technical Field

One or more embodiments of the present specification relate to the field of network security technologies, and in particular, to a method and an apparatus for protecting private data.

Background

Internet technology is an important backbone of social development and progress. However, the internet technology brings high-quality services to people, and meanwhile, the problem of leakage of user privacy data becomes more serious, so that the use experience of users is greatly reduced, economic loss is brought to the users, and even the personal safety of the users is threatened. Therefore, in the internet technology, it is important to protect the private data of the user.

At present, a method for protecting privacy data generally adopts a strong privacy data protection method, namely, direct identity data of a user such as a certificate number is protected. However, in the internet application process, a large amount of indirect identity data related to the user, such as address information, can be involved, and some illegal persons can comprehensively analyze the direct identity data of the user through the indirect identity data, so that the existing privacy data protection method cannot reliably ensure the security of the privacy data of the user. For this reason, there is a need to provide a more reliable privacy data protection scheme.

Disclosure of Invention

One or more embodiments of the present specification describe a method and an apparatus for protecting private data, which can more reliably protect private data of a user.

According to a first aspect, there is provided a method of protecting private data, comprising:

acquiring historical data which is uploaded by a target user on a target platform;

extracting identity data and identity correlation data from the historical data, wherein the identity data is the data which only corresponds to the user identity, and the identity correlation data is the data which can infer the user identity;

analyzing the extracted identity data and the identity relevance data, and determining the privacy disclosure risk level of the target user on the target platform according to the analysis result;

and performing data desensitization processing on the data uploaded by the target user on the target platform according to the privacy disclosure risk level.

In one embodiment, wherein the identity data includes at least one of a face image and a license number, extracting the identity data from the historical data includes:

identifying whether a human face image exists in the historical data by using a human face detection model; if the historical data contains a face image, extracting the face image and taking the face image as identity data;

and/or, identifying whether the historical data has a certificate number by using an OCR model; and if the historical data has the card number, extracting the card number and taking the card number as the identity data.

In one embodiment, the identity association data comprises at least one of positioning data, address information, a communication number and landmark buildings;

extracting identity association data from the historical data, including:

identifying whether the historical data contains positioning data or not; if the historical data contains positioning data, extracting the positioning data and taking the positioning data as identity relevance data;

and/or identifying whether address information exists in the historical data by using an OCR (optical character recognition) model; if the historical data contains address information, extracting the address information and using the address information as identity relevance data;

and/or, identifying whether the historical data has the communication number by using an OCR model; if the historical data contains the communication number, extracting the communication number and using the communication number as identity correlation data;

and/or, using a landmark detection model to identify whether landmark buildings exist in the historical data; and if the landmark buildings exist in the historical data, extracting the landmark buildings and using the landmark buildings as identity relevance data.

In one embodiment, the determining the privacy leakage risk level of the target user at the target platform includes:

analyzing the extracted identity data, and determining a risk level corresponding to the identity data according to an analysis result;

analyzing the extracted identity relevance data, and determining a risk grade corresponding to the identity relevance data according to an analysis result;

and determining the privacy disclosure risk level of the target user on the target platform according to the risk level corresponding to the identity data and the risk level corresponding to the identity relevance data.

In one embodiment, the analyzing the extracted identity data and determining the risk level corresponding to the identity data according to the analysis result includes:

if the extracted identity data comprises the face images, clustering the extracted face images, and determining the probability of the face images belonging to the target user according to the clustering result; determining a risk level corresponding to the face image according to the probability;

if the extracted identity data comprises the certificate number, determining the risk level corresponding to the certificate number according to the type and/or the number of the extracted certificate number;

and determining a risk grade corresponding to the identity data according to the risk grade corresponding to the face image and the risk grade corresponding to the certificate number.

In one embodiment, the analyzing the extracted identity relevance data and determining a risk level corresponding to the identity relevance data according to an analysis result includes:

if the extracted identity relevance data comprises positioning data, address information and landmark buildings, calculating the daily movement path of the target user according to the extracted positioning data, address information and landmark buildings; determining the risk level corresponding to the address data according to the proportion of the same path in the movement paths every day;

if the extracted identity relevance data comprises the communication number, determining the risk level corresponding to the communication number according to whether the extracted communication number is complete;

and determining the risk grade corresponding to the identity relevance data according to the risk grade corresponding to the address data and the risk grade corresponding to the communication number.

In one embodiment, the identity data comprises a face image, and the identity association data comprises positioning data, address information, a communication number and a landmark building;

the analyzing the extracted identity data and the identity relevance data, and determining the privacy disclosure risk level of the target user on the target platform according to the analysis result, comprises:

clustering the face images in the extracted identity data, determining the face images belonging to the target user in the extracted face images according to a clustering result, and extracting the face features of the face images belonging to the target user to obtain face feature vectors;

extracting the characteristics of the positioning data, the address information, the communication number and the landmark building, and constructing an identity relevance data characteristic vector based on the extracted characteristics;

respectively carrying out spherical data augmentation on the face feature vector and the identity relevance data feature vector, and dividing the augmented data into a training set and a test set;

training a risk recognition double-layer perceptron by using the training set, and calculating a loss function value of the trained risk recognition double-layer perceptron by using the test set;

and determining the privacy disclosure risk level of the target user on the target platform according to the value of the loss function.

In an embodiment, the performing spherical data augmentation on the face feature vector and the identity relevance data feature vector, and dividing augmented data into a training set and a test set includes:

defining a first spherical radius for amplifying the face feature vector, randomly sampling the face feature vector within the first spherical radius to obtain an amplified face feature vector, selecting a part of data in the amplified face feature vector as a training set of the face feature vector, and using the other part of data in the amplified face feature vector as a test set of the face feature vector;

defining a second spherical radius for amplifying the identity relevance data feature vector, randomly sampling the identity relevance data feature vector in the second spherical radius to obtain the amplified identity relevance data feature vector, selecting a part of data in the amplified identity relevance data feature vector as a training set of the identity relevance data feature vector, and using another part of data in the amplified identity relevance data feature vector as a test set of the identity relevance data feature vector;

in one embodiment, wherein said training the risk recognition dual-tier perceptron using the training set and calculating the value of the loss function of the trained risk recognition dual-tier perceptron using the test set comprises:

using the identity relevance data feature vector amplified in the training set as the input of a risk recognition double-layer perceptron, using the human face feature vector amplified in the training set as the output of the risk recognition double-layer perceptron, and performing data fitting on the risk recognition double-layer perceptron for preset times to obtain the trained risk recognition double-layer perceptron;

and using the identity relevance data characteristic vector after the centralized test and the amplification as the input of a risk recognition double-layer perceptron, calculating the cosine similarity value of the trained risk recognition double-layer perceptron, and using the cosine similarity value as the value of a loss function.

In one embodiment, the performing data desensitization processing on the data uploaded by the target user on the target platform includes:

if the target user instruction is determined not to be subjected to privacy protection, desensitization processing is not performed on the data uploaded by the target user on the target platform;

if the level privacy protection in the target user instruction is determined, desensitizing the identity data uploaded by the target user on the target platform and the communication number in the identity relevance data, and not desensitizing other identity relevance data except the communication number in the identity relevance data;

and if the high-level privacy protection of the target user instruction is determined, desensitization processing is carried out on the identity data uploaded by the target user on the target platform and the communication number, the address information and the landmark building in the identity relevance data, and random processing is carried out on the positioning data in the identity relevance data.

According to a second aspect, there is provided an apparatus for protecting private data, comprising:

the acquisition unit is configured to acquire historical data which is uploaded by a target user on a target platform;

the extracting unit is configured to extract identity data and identity correlation data from the historical data, wherein the identity data is data which uniquely corresponds to the user identity, and the identity correlation data is data which can infer the user identity;

the analysis unit is configured to analyze the extracted identity data and the identity correlation data and determine the privacy disclosure risk level of the target user on the target platform according to an analysis result;

and the data desensitization processing unit is configured to perform data desensitization processing on the data uploaded by the target user on the target platform according to the privacy disclosure risk level.

In one embodiment, wherein the identity data includes at least one of a face image and a certificate number, the extraction unit is configured to:

the extraction unit is configured to:

In one embodiment, wherein the analysis unit is configured to:

if the extracted identity data comprises the certificate and the certificate number thereof, determining the risk level corresponding to the certificate and the certificate number thereof according to the type and/or the number of the extracted certificate and the certificate number thereof;

and determining the risk grade corresponding to the identity data according to the risk grade corresponding to the face image and the risk grade corresponding to the certificate and the certificate number thereof.

In one embodiment, the analysis unit is configured to:

the analysis unit is configured to:

performing spherical data augmentation on the face feature vector and the feature vector of the identity relevance data, and dividing the augmented data into a training set and a test set;

In one embodiment, wherein the analysis unit is configured to:

in one embodiment, wherein the analysis unit is configured to:

In one embodiment, wherein the data desensitization processing unit is configured to:

if the target user instruction is determined not to be subjected to privacy protection, performing no desensitization processing on the data uploaded by the target user on the target platform;

According to a third aspect, there is provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, being configured to perform the method of the first aspect.

According to the method and the device provided by the embodiment of the specification, after the privacy disclosure risk level of the target user on the target platform is obtained by extracting the identity data and the identity relevance data in the historical data uploaded by the target user on the target platform and analyzing the identity data and the identity relevance data, the data desensitization processing can be performed on the data uploaded by the target user on the target platform based on the privacy disclosure risk level of the target user, and therefore the privacy protection requirements of different users can be met. Because the privacy disclosure risk level is obtained on the basis of comprehensively analyzing the identity data and the identity relevance data of the target user, the data desensitization processing is carried out on the data uploaded by the target user on the target platform according to the privacy disclosure risk level, the privacy data of the target user can be protected in an all-around manner, and therefore an attacker can be prevented from carrying out reasoning attack on the identity data from the identity relevance data.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present specification, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow diagram of a method for protecting private data provided by one embodiment of the present description;

FIG. 2 is a flow diagram of a method for determining a privacy disclosure risk level provided by one embodiment of the present description;

FIG. 3 is a flow chart of a method for determining a risk level corresponding to identity data provided in one embodiment of the present description;

FIG. 4 is a flow diagram of a method for determining a risk level corresponding to identity relevance data provided in one embodiment of the present description;

FIG. 5 is a flow diagram of a method for determining a privacy disclosure risk level provided by one embodiment of the present description;

fig. 6 is a schematic diagram of a device for protecting private data according to another embodiment of the present disclosure.

Detailed Description

In recent years, with the development of the internet, it becomes easier to acquire data of users. For example, on various social platforms, images of faces, location data, nicknames, etc. published by users may be obtained. Although the identity data (name, identification card, etc.) of the user cannot be directly obtained from the data such as the face image and the positioning data, if a certain amount of such data is obtained, an attacker can estimate the identity data of the user at a high rate. For example, once an attacker has mastered some identity data — non-identity data (e.g., a person's identity card, the address of the person is known, and the identity card-address forms a data pair). And the identity data of the user can be reversely positioned and inquired by combining the non-identity data crawled from the Internet. For example, an attacker knows Zhang III resides in cell A, makes an employee at company B and frequently goes to restaurant C for consumption. Then, if an attacker finds that an anonymous account publishes similar track information on the network, the user can be presumed to be Zhang III, so that more privacy data of Zhang III, such as facial images and the like, can be acquired through the content published by the anonymous account, and serious privacy data leakage is caused.

In summary, the current privacy protection method usually only protects strong privacy data, for example, only protects a face area in a face recognition system. However, the protection of the weak privacy data (i.e. the identity correlation data described in this embodiment) is ignored, so that an attacker can deduce the strong privacy information of the user (i.e. the identity data described in this embodiment) through a series of identity correlation data, thereby causing more serious privacy data leakage (both the identity data and the identity correlation data are leaked). In order to solve the problem, the scheme provides a more reliable private data protection scheme.

The following describes implementations of the concepts of the embodiments of the present disclosure.

Fig. 1 is a flow diagram of a method for protecting private data, which may be performed by any device, apparatus, platform, cluster of devices having computing and processing capabilities, according to an embodiment. The method can be applied to various social platforms, such as various shopping platforms, instant messaging platforms, video platforms, short video platforms or broadcast social network platforms and the like. As shown in fig. 1, the method includes:

step 101, acquiring historical data uploaded by a target user on a target platform;

103, extracting identity data and identity correlation data from historical data, wherein the identity data is the only data corresponding to the user identity, and the identity correlation data is the data capable of presuming the user identity;

105, analyzing the extracted identity data and the identity relevance data, and determining the privacy disclosure risk level of the target user on the target platform according to the analysis result;

and step 107, performing data desensitization processing on the data uploaded by the target user on the target platform according to the privacy disclosure risk level.

According to the method provided by the embodiment of the specification, after the privacy disclosure risk level of the target user on the target platform is obtained by extracting the identity data and the identity relevance data in the historical data uploaded by the target user on the target platform and analyzing the identity data and the identity relevance data, the data desensitization processing can be performed on the data uploaded by the target user on the target platform based on the privacy disclosure risk level of the target user, and therefore the privacy protection requirements of different users can be met. Because the privacy disclosure risk level is obtained on the basis of comprehensively analyzing the identity data and the identity relevance data of the target user, the data desensitization processing is carried out on the data uploaded by the target user on the target platform according to the privacy disclosure risk level, the privacy data of the target user can be protected in an all-around manner, and therefore an attacker can be prevented from carrying out reasoning attack on the identity data from the identity relevance data.

The manner in which the various steps shown in fig. 1 are performed is described below.

First, in step 101, the target user is a user who uses a social platform and can perform privacy data protection by the method described in the embodiments of the present specification. The target platform is a platform, such as a microblog platform, a short video platform, and the like, on which the method of the embodiments of the present description can be implemented or applied.

In one embodiment, the step 101 of obtaining the history data that the target user uploaded on the target platform is performed after obtaining the authorization of the target user, that is, after obtaining the authorization of the target user, the step of obtaining the history data that the target user uploaded on the target platform is performed. If the authorization of the target user is not obtained, the subsequent steps of the embodiment of the specification are stopped, the historical data which is uploaded by the target user on the target platform is ensured to be obtained on the basis of the permission of the target user, and the phenomenon that the target user feels dislike due to the fact that the historical data of the target user is obtained and used privately is avoided. The historical data acquired in step 101 and uploaded by the target user on the target platform includes characters, graphic images, positioning, videos, and the like.

Secondly, the identity data described in step 103 is data uniquely corresponding to the user identity, that is, unique identification data that one user can distinguish from other users, such as identity card or driving license, or other documents, or voice print or fingerprint, etc. The identity relevance data in step 103 is data that can infer the identity of the user, for example, in an anonymous social platform, the identity of the user can be inferred approximately by directly obtaining the address (location or indoor image), unit (location or indoor image) and frequently consumed location (location) of the user through data shared by the user, and thus data that causes the leakage of user privacy data (associating the known information such as the identity of the user with the address).

In one embodiment, the identity data includes at least one of a face image and a license number. On the basis, the step 103 includes extracting a face image and/or a certificate number when extracting the identity data from the historical data.

Specifically, when extracting a face image, the following steps may be performed: identifying whether a human face image exists in the historical data by using a human face detection model; and if the historical data contains the face image, extracting the face image and taking the face image as the identity data. The face images extracted here are all face images included in all images in the history data, and these face images may be of the same user or of different users. For example, the face images extracted from the history data include eight a-user face images, two B-user face images, and one C-user face image.

Specifically, when extracting the certificate number, the certificate number may be: identifying whether the historical data has a certificate number by using an OCR model; and if the historical data has the card number, extracting the card number and taking the card number as the identity data. The certificate number can comprise an identity certificate number, a driving certificate number, a marriage certificate number and the like, and the certificate number thereof can comprise an identity certificate number, a driving certificate number, a marriage certificate number and the like. The extracted certificate numbers are all certificate numbers included in the historical data, and include numbers of different certificates of the same user and certificate numbers of the same type and different types of different users. For example, the license numbers extracted from the history data include the identification number and the driver license number of zhang, the marriage license number of lie four, the driver license number of wang five, and the like.

In one embodiment, the identity association data includes at least one of location data, address information, a communication number, and a landmark building. On this basis, step 103 includes extracting one or more of positioning data, address information, communication numbers and landmark buildings when extracting the identity association data from the historical data.

Specifically, when the positioning data is extracted, the following steps are performed: identifying whether the historical data has positioning data or not; and if the historical data contains the positioning data, extracting the positioning data and taking the positioning data as identity relevance data. For example, if the user includes the location of the city a X park in some dynamic uploaded by the microblog, then the city a X park is taken as one of the identity association data.

Specifically, when the address information is extracted: identifying whether address information exists in the historical data by using an OCR model; and if the address information exists in the historical data, extracting the address information and using the address information as identity relevance data. The address information extracted here includes text address information, address information in pictures, and the like, for example, the extracted address information is a delivery address stored on a shopping platform and a delivery address or a mailing address in an uploaded express bill number picture.

Specifically, when extracting the communication number, the following steps are performed: identifying whether the historical data has communication numbers or not by using an OCR model; and if the historical data contains the communication number, extracting the communication number and using the communication number as identity correlation data. The extracted communication number comprises a mobile phone number, an instant messaging application account number, a mailbox account number and the like.

Specifically, when landmark buildings are extracted: identifying whether landmark buildings exist in the historical data by using the landmark detection model; if the landmark buildings exist in the historical data, the landmark buildings are extracted and serve as identity relevance data. The landmark buildings extracted here include landmark buildings included in respective pictures, for example, a wide goose web included in one picture, an oriental pearl included in another picture, and the like.

Next, a specific implementation of step 105 is described. Step 105 is implemented in two ways, including but not limited to:

the first mode is to analyze the identity data and the identity correlation data respectively, analyze the risk levels corresponding to the identity data and the identity correlation data respectively, and finally comprehensively determine the privacy disclosure risk level of the target user on the target platform. Specifically, as shown in fig. 2, this method includes:

step 1051, analyzing the extracted identity data, and determining a risk level corresponding to the identity data according to the analysis result; step 1053, analyzing the extracted identity relevance data, and determining the risk level corresponding to the identity relevance data according to the analysis result; and 1055, determining the privacy disclosure risk level of the target user on the target platform according to the risk level corresponding to the identity data and the risk level corresponding to the identity relevance data.

In an embodiment, in combination with the content that the identity data includes a face image and a certificate number, as shown in fig. 3, when analyzing the extracted identity data and determining a risk level corresponding to the identity data according to an analysis result, step 1051 includes: step 10511, if the extracted identity data includes a face image, clustering the extracted face image, and determining the probability of the face image belonging to the target user according to the clustering result; and determining the corresponding risk level of the face image according to the probability. And 10513, if the extracted identity data comprises the certificate number, determining the risk level corresponding to the certificate number according to the type and/or the number of the extracted certificate number. And 10515, determining a risk level corresponding to the identity data according to the risk level corresponding to the face image and the risk level corresponding to the certificate number.

First, in step 10511, when clustering is performed on the extracted face images, kmans clustering may be performed after extracting features of the face images by using a face recognition model. Usually, a great part of facial images uploaded by a user on a social platform are facial images of the user, so that when the probability of the facial images belonging to a target user is determined according to a clustering result, the ratio of the facial images with the largest number to all the facial images is calculated. For example, if the extracted face images include eight face images of the a user, two face images of the B user, and one face image of the C user, it may be determined that the a user is the target user and the probability of the face image belonging to the target user is 8/11. Further, in order to ensure that different risk levels are classified, in the embodiments of the present specification, a corresponding risk level is predefined for different probabilities, for example, the probability of the face image with the largest number is more than 50%, and the risk level is + 2; the largest number of faces accounts for more than 30% but less than 50%, risk level + 1; the most numerous faces account for less than 30%, then the risk level + 0. On the basis, after the probability of the face image belonging to the target user is determined, the risk level corresponding to the face image can be determined according to the probability and the corresponding relation between the predefined probability and the risk level.

Secondly, if the extracted identity data includes a certificate number, and a relatively serious result will be caused if the certificate number is leaked, therefore, the embodiments of the present specification may define different risk levels for whether the extracted identity data includes the certificate number, and may also define different levels of risk levels for different types of certificate numbers, such as: the extracted identity data comprises a certificate number, and a risk level +2 is defined; the extracted identity data does not comprise a certificate number, and the risk level +0 is defined; the extracted identity data comprises an identity card number, and a risk level +2 is defined; the extracted identity data comprises a driving license number, a defined risk level +1 and the like. On the basis, if the extracted identity data includes the certificate number in step 10513, the risk level corresponding to the certificate number is determined according to the predefined correspondence between the certificate number and the risk level.

And finally, in the step 10515, the risk level corresponding to the face image and the risk level corresponding to the certificate number are superposed to determine the risk level corresponding to the identity data. For example, if it is determined in step 10511 that the risk level corresponding to the face image is +3, and it is determined in step 10513 that the risk level corresponding to the certificate number is +1, it may be determined that the risk level corresponding to the identity data is + 4.

In another embodiment, in combination with the above-mentioned identity association data including the positioning data, the address information, the communication number and the content of the landmark building, as shown in fig. 4, the step 1053, when analyzing the extracted identity association data and determining the risk level corresponding to the identity association data according to the analysis result, includes: step 10531, if the extracted identity relevance data comprises positioning data, address information and landmark buildings, calculating the daily movement path of the target user according to the extracted positioning data, address information and landmark buildings; and determining the risk level corresponding to the address data according to the proportion of the same path in the movement paths every day. Step 10533, if the extracted identity relevance data includes a communication number, determining a risk level corresponding to the communication number according to the type and/or number of the extracted communication number. And 10535, determining the risk level corresponding to the identity relevance data according to the risk level corresponding to the address data and the risk level corresponding to the communication number.

First, in step 10531, a user may perform analysis on a commonly used movement path according to the extracted positioning data, address information, and landmark building, where an embodiment of the present specification defines a movement path to be calculated for a path that passes through every day, and then a movement path on the same day may be calculated according to the positioning data, address information, and landmark building on the same day. If the same proportion of the motion paths exceeds a certain proportion in all the motion paths, the fact that the user frequently reciprocates the motion paths is proved, and information such as the family of the user can be deduced according to the information, so that different risk levels are defined for different proportions of the same motion path in the embodiment of the specification, namely the proportion of the motion path is more than 50%, and the risk level is + 2; the risk level +1 if the share of the same motion path is above 30% but less than 50%. On the basis, after the occupation ratio of the same motion path is determined, the risk level corresponding to the address data can be determined according to the corresponding relation between the predefined occupation ratio of the motion path and the risk level.

Secondly, if the extracted identity data includes a communication number, the leakage of the communication number will cause serious consequences, and therefore, the embodiment of the present specification may define different risk levels for whether the extracted identity data includes a complete communication number. Wherein, whether the communication number is complete is determined according to the characteristics of the communication number. For example, a mobile phone number typically includes 11 digits, and if a number extracted consists of 10 digits, it may be determined that it is not a complete communication number. For example: the extracted identity data comprises a complete communication number definition risk level + 2; the extracted identity data does not include the complete communication number definition risk level + 0. In addition, the embodiments of the present disclosure may further define different levels of risk levels for different types of communication numbers, for example: the extracted identity data comprises a complete mobile phone number, and a risk level +2 is defined; the extracted identity data comprises a complete instant messaging application number, a defined risk level +1 and the like. On this basis, step 10533 is implemented based on whether the extracted identity data includes the complete communication number and the predefined correspondence between the communication number and the risk level when determining the risk level corresponding to the communication number.

The second mode is that comprehensive analysis is carried out on the identity data and the identity correlation data to obtain the privacy disclosure risk level of the target user on the target platform, and the identity data is required to comprise a face image when the mode is realized. Specifically, as shown in fig. 5, this method includes:

and 1051', clustering the face images in the extracted identity data, determining the face images belonging to the target user in the extracted face images according to the clustering result, and extracting the face features of the face images belonging to the target user to obtain face feature vectors. And 1053', extracting the characteristics of the positioning data, the address information, the communication number and the landmark building, and constructing an identity relevance data characteristic vector based on the extracted characteristics. And 1055', respectively carrying out spherical data augmentation on the face characteristic vector and the identity relevance data characteristic vector, and dividing the augmented data into a training set and a test set. And 1057', training the risk recognition double-layer perceptron by using a training set, and calculating the value of the loss function of the trained risk recognition double-layer perceptron by using a test set. And 1059', determining the privacy disclosure risk level of the target user on the target platform according to the value of the loss function.

First, the method for clustering the face images in the extracted identity data in step 1051' is similar to the method for clustering the extracted face images in step 1051, and is not described herein again. When determining that the extracted face images belong to the face images of the target user according to the clustering result, the target user generally uploads more face images, so in the face images uploaded by the target user, the face image with the largest number is used as the face image of the target user in the embodiment of the present specification. Further, when extracting the face features of the face images belonging to the target user to obtain the face feature vector, the face features of each face image belonging to the target user may be extracted first, then the average value of the face features of all the face images belonging to the target user is obtained, the average value is used as the face feature vector, and the face feature vector is recorded as F.

Secondly, when the identity relevance data feature vector is extracted, in the step 1053', the locating data, the address information, the communication number and the feature vector of the landmark building can be respectively extracted and combined to obtain the feature vector of the identity relevance data. For character identity relevance data such as positioning data, address information or communication numbers, a word2vector model can be used for converting characters in the positioning data and the address information into vector characteristics f_text(ii) a For image identity relevance data such as landmark buildings and the like, a landmark classifier (or imagenet pre-training model) can be used for extracting corresponding features f_image. At this time, the identity relevance data feature vector F ═ F is obtained_text,f_image]。

Next, in step 1055', when the spherical data augmentation is performed on the face feature vector and the identity relevance data feature vector, and the augmented data is divided into a training set and a test set, the specific method is as follows: defining a first spherical radius of the augmented face feature vector, randomly sampling the face feature vector within the first spherical radius to obtain the augmented face feature vector, selecting a part of data in the augmented face feature vector as a training set of the face feature vector, and using the other part of data in the augmented face feature vector as a test set of the face feature vector; defining a second spherical radius of the augmented identity relevance data feature vector, randomly sampling the identity relevance data feature vector in the second spherical radius to obtain the augmented identity relevance data feature vector, selecting a part of data in the augmented identity relevance data feature vector as a training set of the identity relevance data feature vector, and using another part of data in the augmented identity relevance data feature vector as a test set of the identity relevance data feature vector.

Wherein, the first spherical radius and the second spherical radius are both a number close to 0. And after the augmented face feature vector is obtained, selecting data with a certain ratio as a training set of the face feature vector, and using the data as the training set of the face feature vector. If 50% of data is selected as a training set of the face feature vectors, and the other 50% of data is selected as a test set of the face feature vectors. The method for dividing the training set and the test set in the augmented identity relevance data feature vector is the same as the method for dividing the training set and the test set in the augmented identity relevance data feature vector. It should be noted that, when the training set and the test set are divided, the proportion of the selected data may be set as required, and in specific implementation, in order to ensure that the risk recognition dual-layer perceptron obtained by training is relatively accurate, the proportion of the data in the training set may be divided to be greater than that in the test set.

Next, when training the risk recognition dual-layer perceptron using the training set and calculating the value of the loss function of the trained risk recognition dual-layer perceptron using the test set, step 1057' may: using the identity relevance data characteristic vector amplified in the training set as the input of the risk recognition double-layer perceptron, using the human face characteristic vector amplified in the training set as the output of the risk recognition double-layer perceptron, and performing data fitting for preset times on the risk recognition double-layer perceptron to obtain the trained risk recognition double-layer perceptron; and using the identity relevance data characteristic vector after the test centralized amplification as the input of the risk recognition double-layer perceptron, calculating the cosine similarity value of the trained risk recognition double-layer perceptron, and using the cosine similarity value as the value of the loss function. The preset times can be set according to needs, and a balance number is selected on the basis of ensuring the accuracy of the calculated amount and the training result. Of course, the loss function may also select the euclidean distance as desired. The benefit of selecting cosine similarity as the loss function is that its value is normalized, so that the privacy disclosure risk level can be determined intuitively from the cosine similarity value.

Finally, in step 1059 ', when determining the privacy disclosure risk level of the target user on the target platform according to the value of the loss function, the higher the value (cosine similarity) of the loss function calculated in step 1057', the higher the identity data can be fitted by using the identity relevance data, so that the higher the risk of exposing privacy is. Further, the embodiments of the present disclosure may pre-define values of different loss functions corresponding to different risk levels, such as: cosine similarity > -0.5, risk level + 2; the cosine similarity > is 0.2, the cosine similarity is less than 0.5, and the risk level is +1, so that after the value of the cosine similarity is obtained, the privacy disclosure risk level of the target user on the target platform can be visually determined according to the corresponding relation between the value of the predefined loss function and the risk level.

In one embodiment, after determining the privacy disclosure risk level of the target user on the target platform, step 105 may further: and recommending a privacy protection scheme to the target user according to the privacy disclosure risk level. Specifically, it may be recommended to the target user not to perform privacy protection, medium-level privacy protection, or high-level privacy protection according to the privacy disclosure risk level. For example, if the privacy disclosure risk level is defined as from 0 to 8 with 9 levels, and 0 to 2 levels are low risk in the embodiments of the present specification, no privacy protection may be performed; the level 3-5 is medium risk, and medium level privacy protection can be performed; grades 6-8 are high risk and can carry out high-grade privacy protection. At this time, when the privacy leakage risk level of the target user is level 7, a high level of privacy protection may be recommended to the target user. The purpose of recommendation is to provide a user with a more appropriate privacy protection scheme, however, based on the principle of user selection freedom, the target user may not accept the recommended privacy protection scheme and may select other privacy protection schemes according to actual requirements.

In an embodiment, in step 107, when performing data desensitization processing on data uploaded by a target user on a target platform according to a privacy disclosure risk level, different processing modes exist for different privacy protection schemes of a target user instruction, specifically as follows:

and if the target user instruction does not carry out privacy protection, desensitization processing is not carried out on the data uploaded by the target user on the target platform, namely, desensitization processing is not carried out on the identity data and the identity relevance data uploaded by the target user on the target platform.

And if the target user command is subjected to the level privacy protection, desensitizing the identity data uploaded by the target user on the target platform and the communication number in the identity relevance data, and not desensitizing other identity relevance data except the communication number in the identity relevance data. For example, desensitization processing is performed on the mobile phone numbers in the identity data and the identity correlation data, and desensitization processing is not performed on the positioning data and the like. When desensitization processing is carried out on the communication numbers in the identity data and the identity relevance data, operations such as Gaussian blur or randomization and the like can be carried out.

And if the target user instructs high-level privacy protection, desensitizing the identity data uploaded by the target user on the target platform, the communication number, the address information and the landmark building in the identity associated data, and randomly processing the positioning data in the identity associated data. For example, in addition to some processing for medium-level privacy protection, the selection may further randomize the location data, as well as desensitize the address information and landmark building. When the positioning data is random, the positioning data can be randomly selected from a certain distance taking the positioning data as a center, such as randomly selecting the positioning data from a range of 1500 m.

In another embodiment, since the data uploaded by the target user on the target platform is updated continuously, to ensure that a more accurate privacy protection scheme can be recommended to the target user, the method of the embodiment of the present specification further includes: and periodically evaluating whether the privacy disclosure risk level of the target user on the target platform is improved, and if the privacy disclosure risk level of the target user on the target platform is improved, prompting the target user to carry out privacy protection scheme upgrading. For example, the privacy leakage risk level of the target user on the target platform is re-determined every month/year and the like in the manner described in steps 101 to 105. When the privacy disclosure risk level of the target user on the target platform is determined to be improved, if the privacy disclosure risk level of the target user is increased from 5 level to 6 level, the target user is prompted to carry out privacy protection scheme upgrading in time. And after the target user selects the upgrade privacy protection scheme, performing data desensitization processing on the identity data and the identity associated data based on the privacy protection scheme selected by the target user by adopting the method described in the step 107.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

According to an embodiment of another aspect, a protection apparatus for private data is provided. Fig. 6 shows a schematic block diagram of the protection means of private data according to an embodiment. It is to be appreciated that the apparatus can be implemented by any apparatus, device, platform, and cluster of devices having computing and processing capabilities. As shown in fig. 6, the apparatus 600 includes:

an obtaining unit 601 configured to obtain history data that a target user has uploaded on a target platform;

an extracting unit 603 configured to extract, from the history data, identity data and identity relevance data, where the identity data is data uniquely corresponding to the user identity, and the identity relevance data is data capable of inferring the user identity;

an analyzing unit 605 configured to analyze the extracted identity data and the identity relevance data, and determine a privacy disclosure risk level of the target user on the target platform according to an analysis result;

and the data desensitization processing unit 607 is configured to perform data desensitization processing on the data uploaded by the target platform by the target user according to the privacy disclosure risk level.

In one embodiment, wherein the identity data includes at least one of a face image and a license number;

the extraction unit 603 is configured to:

identifying whether a human face image exists in the historical data by using a human face detection model; if the historical data contains the face image, extracting the face image and taking the face image as identity data; and/or, identifying whether the historical data has a certificate number by using an OCR model; and if the historical data has the card number, extracting the card number and taking the card number as the identity data.

In one embodiment, the identity association data includes at least one of location data, address information, a communication number, and a landmark building;

the extraction unit 603 is configured to:

identifying whether the historical data has positioning data or not; if the historical data contains positioning data, extracting the positioning data and taking the positioning data as identity relevance data;

and/or, identifying whether address information exists in the historical data by using an OCR model; if the historical data contains address information, extracting the address information and using the address information as identity relevance data;

and/or, whether landmark buildings exist in the historical data is identified by using the landmark detection model; if the landmark buildings exist in the historical data, the landmark buildings are extracted and serve as identity relevance data.

In one embodiment, among others, the analysis unit 605 is configured to:

In one embodiment, where the identity data includes a face image and a certificate number, the analysis unit 605 is configured to:

if the extracted identity data comprises the card number, determining the risk level corresponding to the card number;

In one embodiment, the analysis unit 605 is configured to:

if the extracted identity relevance data comprises the communication numbers, determining the risk level corresponding to the communication numbers according to the number and/or the type of the extracted communication numbers;

In one embodiment, the identity data comprises a face image, and the identity relevance data comprises positioning data, address information, a communication number and a landmark building;

the analysis unit 605 is configured to:

clustering the face images in the extracted identity data, determining the face images belonging to the target user in the extracted face images according to the clustering result, and extracting the face features of the face images belonging to the target user to obtain face feature vectors;

respectively carrying out spherical data augmentation on the face characteristic vector and the identity relevance data characteristic vector, and dividing the augmented data into a training set and a test set;

training the risk recognition double-layer perceptron by using a training set, and calculating the value of a loss function of the trained risk recognition double-layer perceptron by using a test set;

In one embodiment, among others, the analysis unit 605 is configured to:

in one embodiment, among others, the analysis unit 605 is configured to:

In one embodiment, among others, the data desensitization processing unit 607 is configured to:

if the target user instruction is determined not to be subjected to privacy protection, carrying out no desensitization treatment on the data uploaded by the target user on the target platform;

and if the high-level privacy protection instructed by the target user is determined, desensitization processing is carried out on the identity data uploaded by the target user on the target platform, the communication numbers, the address information and the landmark buildings in the identity relevance data, and random processing is carried out on the positioning data in the identity relevance data.

In one embodiment, among others, the apparatus further comprises:

the evaluation unit is configured to periodically evaluate whether the privacy disclosure risk level of the target user on the target platform is improved;

and the prompting unit is configured to prompt the target user to carry out privacy protection scheme upgrading if the privacy disclosure risk level of the target user on the target platform is improved.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 1 to 5.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor implementing the method described in conjunction with fig. 1-5 when executing the executable code.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. The protection method of the private data comprises the following steps:

2. The method of claim 1, wherein the identity data comprises at least one of a face image and a license number;

extracting identity data from the historical data, including:

3. The method of claim 1, wherein the identity association data comprises at least one of location data, address information, a communication number, and a landmark building;

extracting identity association data from the historical data, including:

4. The method of claim 1, wherein determining the privacy leakage risk level of the target user at the target platform comprises:

5. The method of claim 4, wherein,

the analyzing the extracted identity data and determining the risk level corresponding to the identity data according to the analysis result comprises the following steps:

6. The method of claim 4, wherein,

the analyzing the extracted identity relevance data and determining the risk level corresponding to the identity relevance data according to the analysis result comprises the following steps:

if the extracted identity relevance data comprises the communication number, determining the risk level corresponding to the communication number according to the type and/or the number of the extracted communication number;

7. The method of claim 1, wherein the identity data comprises a facial image, the identity association data comprises positioning data, address information, a communication number, and a landmark building;

and training the risk recognition double-layer perceptron by using the training set, and calculating the value of the loss function of the trained risk recognition double-layer perceptron by using the test set.

8. The method of claim 7, wherein the performing spherical data augmentation on the face feature vector and the identity relevance data feature vector respectively, and dividing augmented data into a training set and a test set comprises:

defining and amplifying a second spherical radius of the identity relevance data feature vector, randomly sampling the identity relevance data feature vector in the second spherical radius to obtain the amplified identity relevance data feature vector, selecting a part of data in the amplified identity relevance data feature vector as a training set of the identity relevance data feature vector, and using another part of data in the amplified identity relevance data feature vector as a test set of the identity relevance data feature vector.

9. The method of claim 7, wherein training the risk-recognition dual-tier perceptron using the training set and calculating a value of a loss function of the trained risk-recognition dual-tier perceptron using the test set comprises:

10. The method of claim 1, wherein the performing data desensitization processing on the data uploaded by the target user on the target platform comprises:

11. Protection device of private data, including:

12. The apparatus of claim 11, wherein the identity data comprises at least one of a face image and a license number, the extraction unit configured to:

13. The apparatus of claim 11, wherein the identity association data comprises at least one of location data, address information, a communication number, and a landmark building;

the extraction unit is configured to:

14. The apparatus of claim 11, wherein the analysis unit is configured to:

15. The apparatus of claim 14, wherein,

the analysis unit is configured to:

16. The apparatus of claim 14, wherein,

the analysis unit is configured to:

17. The apparatus of claim 11, wherein the identity data comprises a facial image, the identity association data comprises positioning data, address information, a communication number, and a landmark building;

the analysis unit is configured to:

18. The apparatus of claim 17, wherein the analysis unit is configured to:

19. The apparatus of claim 17, wherein the analysis unit is configured to:

20. The apparatus of claim 11, wherein the data desensitization processing unit is configured to:

21. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-10.