CN114564740A - Big data anonymization processing method and big data processing equipment - Google Patents

Big data anonymization processing method and big data processing equipment Download PDF

Info

Publication number
CN114564740A
CN114564740A CN202210139315.9A CN202210139315A CN114564740A CN 114564740 A CN114564740 A CN 114564740A CN 202210139315 A CN202210139315 A CN 202210139315A CN 114564740 A CN114564740 A CN 114564740A
Authority
CN
China
Prior art keywords
data
privacy
user behavior
information
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210139315.9A
Other languages
Chinese (zh)
Inventor
陈笑男
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202210139315.9A priority Critical patent/CN114564740A/en
Publication of CN114564740A publication Critical patent/CN114564740A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Storage Device Security (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a big data anonymization processing method and big data processing equipment, wherein data attribute identifications of target type data with the same type of data description attributes in a plurality of user behavior data blocks of a user behavior data set are obtained from the user behavior data set with a first data security level, then data fragments corresponding to the data attribute identifications are respectively obtained from the user behavior data blocks to obtain a plurality of pieces of data information to be processed, finally, the importance information of corresponding privacy data is obtained by carrying out operations such as anonymization pre-analysis on a privacy tag sequence to carry out hierarchical processing on the corresponding privacy data, and further, the privacy data anonymization processing method with different levels is realized, so that the big data privacy protection aiming at the target data information is realized, and the requirements of different scenes can be met.

Description

Big data anonymization processing method and big data processing equipment
The application is a divisional application of an invention patent application with the application number of 202110175876.X, the application date of 2021, 02/06, and the name of the invention of a data processing method and a big data processing device for big data privacy protection.
Technical Field
The invention relates to the technical field of big data, in particular to a data processing method and big data processing equipment aiming at big data privacy protection.
Background
With the continuous development of computer science and information technology, big data gradually becomes a high-value resource which is developed and utilized by the topics of governments, enterprises, individuals and the like. With the development of big data technology, data mining, integration and transaction become more convenient. However, in the context of widespread use of large data, data privacy disclosure is an important issue that is of interest to various subjects. In the internet era, digitization has further increased the possibility of data privacy disclosure. Therefore, when the big data brings infinite value to the era, how to effectively avoid abnormal disclosure of privacy under each scene is an important technical problem which needs to be solved urgently in the industry at present.
Disclosure of Invention
Based on the defects of the existing design, the embodiment of the invention provides a data processing method aiming at big data privacy protection, which is applied to big data processing equipment and comprises the following steps:
acquiring data attribute identifications of target type data with the same type of data description attributes in a plurality of user behavior data blocks of a user behavior data set from the user behavior data set with a first data security level, wherein each user behavior data block comprises data content acquired by data acquisition aiming at least one user behavior;
respectively acquiring data segments corresponding to the data attribute identifications from the user behavior data blocks to obtain a plurality of pieces of to-be-processed data information;
and carrying out privacy data processing on the to-be-processed data information of each first data security level according to a preset privacy data processing rule to obtain target data information of a second data security level, wherein the second data security level is used for realizing big data privacy protection aiming at the target data information.
In the embodiment provided by the present invention, the obtaining, from the user behavior data set of the first data security level, data attribute identifiers of target type data having the same type of data description attribute in a plurality of user behavior data blocks of the user behavior data set includes:
vector representation is carried out on each data fragment of each user behavior data block in the user behavior data set to obtain a first data description matrix, data attribute identification is carried out on each first data description matrix, and data attribute identification of each data fragment in each user behavior data block in the user behavior data set is obtained;
and matching the data attribute identifications of the target type data in the user behavior data blocks from the data attribute identifications of the identified data fragments.
In the embodiment provided by the present invention, vector-representing each data fragment of each user behavior data block in the user behavior data set to obtain a first data description matrix, and performing data attribute identification on each first data description matrix to obtain a data attribute identifier of each data fragment of target type data in each user behavior data block of the user behavior data set, includes:
inputting the user behavior data set into a first privacy data recognition model obtained by pre-training, performing feature vector conversion on each user behavior data block in the user behavior data set by using a feature vector conversion layer of the first privacy data recognition model to obtain a first data description matrix, and performing data attribute recognition on each first data description matrix by using an attribute extraction layer of the first privacy data recognition model to obtain data attribute identifications of each data fragment of target type data in each user behavior data block of the user behavior data set;
the feature vector conversion layer is used for performing at least one of the following feature vector conversions: the feature vector conversion layer comprises a target attribute extraction layer, and the data granularity of data extraction performed by an attribute extraction kernel of the target attribute extraction layer is the data size corresponding to at least one minimum data block of the data storage mode of the user behavior data block.
In the embodiment provided by the present invention, the vector representation of each data fragment of each user behavior data block in the user behavior data set to obtain a first data description matrix, and the data attribute identification of each first data description matrix to obtain the data attribute identifier of each data fragment in each user behavior data block in the user behavior data set includes:
vector representation is carried out on each data segment of each user behavior data block in the user behavior data set by adopting a preset data conversion mode to obtain a first data description matrix; the preset data conversion mode at least comprises attribute mapping and content hashing, wherein the attribute mapping and the content hashing comprise the steps of mapping the data attributes of each data segment to vector representations in a preset vector corresponding table, carrying out content hashing operation on the data contents of each data segment, and then correspondingly storing the data contents of each data segment and the corresponding vector representations;
and inputting each first data description matrix into a second privacy data recognition model obtained by pre-training, and performing data attribute recognition on each first data description matrix by an attribute extraction layer of the second privacy data recognition model to obtain data attribute identification of each data segment of target type data in each user behavior data block of the user behavior data set.
In the embodiment provided by the present invention, the data attribute identifier includes: the method comprises the steps that a data type label of preset data information in target type data and a privacy type label indicating a privacy type corresponding to the target type data are included;
the obtaining of the data segments corresponding to the data attribute identifications from the user behavior data blocks respectively to obtain a plurality of pieces of to-be-processed data information includes:
for each user behavior data block, according to a data extraction range corresponding to the target type data when data acquisition is carried out on a data type tag and a privacy type tag in a data attribute identifier of the user behavior data block, acquiring a data segment of which the type tag is a preset type tag in the user behavior data block according to the data extraction range, and determining the acquired data segment as to-be-processed data information; or alternatively
For each user behavior data block, traversing a data segment of which the matching type label is a privacy type label in the user behavior data block according to the data type label of the target type data in the data attribute identification of the user behavior data block; and mapping the obtained data segments to target type tags from different privacy type tags by adopting a tag mapping mode, and determining data information corresponding to the data segments after tag mapping as to-be-processed data information.
In the embodiment provided by the invention, the processing of the private data of the to-be-processed data information with each first data security level to obtain the target data information with the second data security level comprises the following steps:
performing vector representation on each piece of to-be-processed data information to obtain a second data description matrix, performing data marking on corresponding data description of target type data in each second data description matrix to obtain a third data description matrix after data marking, and performing privacy data processing on each third data description matrix to obtain the target data information, wherein the privacy data processing comprises at least one of the following processing modes: differential privacy processing, privacy diversification processing and privacy anonymization processing; or
Inputting each piece of to-be-processed data information to a third privacy data recognition model obtained by pre-training, performing feature vector conversion on each piece of input to-be-processed data information by using a feature vector conversion layer of the third privacy data recognition model to obtain a second data description matrix, performing data marking on corresponding data description of target type data in each second data description matrix by using a matrix data marking layer of the third privacy data recognition model to obtain a data-marked third data description matrix, and performing privacy data processing on each third data description matrix by using a matrix data privacy processing layer of the third privacy data recognition model to obtain the target data information; wherein the feature vector conversion layer is configured to perform at least one of the following feature vector conversions: the feature vector transformation layer comprises a target attribute extraction layer, and the data granularity of data extraction performed by an attribute extraction kernel of the target attribute extraction layer is the data size corresponding to at least one minimum data block of the data storage mode of the user behavior data block.
In the embodiment provided by the present invention, the vector-representing each piece of to-be-processed data information to obtain the second data description matrix includes:
performing vector representation on each piece of data information to be processed by adopting a preset data conversion mode to obtain a second data description matrix; the preset data conversion mode at least comprises attribute mapping and content hash;
the data marking corresponding to the target type data in each second data description matrix to obtain a third data description matrix after data marking, and performing privacy data processing on each third data description matrix to obtain the target data information includes:
and inputting each second data description matrix to a fourth privacy data recognition model obtained by pre-training, performing data marking on corresponding data description of target type data in each second data description matrix by using a matrix data marking layer of the fourth privacy data recognition model to obtain a third data description matrix after data marking, and performing privacy data processing on each third data description matrix by using a matrix data privacy processing layer of the third privacy data recognition model to obtain the target data information.
In the embodiment provided by the present invention, the obtaining of the target data information by performing private data processing on each third data description matrix includes:
performing matrix fusion on the third data description matrix to obtain a fusion data matrix, and performing privacy data processing on the fusion data matrix through at least one data privacy processing unit to obtain the target data information; or
Mapping each data element in the third data description matrix to a specified data storage interval according to a preset mapping relation, and taking the specified data storage interval obtained after mapping as the target data information, wherein the data occupation space of the specified data storage interval is larger than that of the data information marked by each data; or
Extracting the position information of the data elements with the same type of data description attributes in the third data description matrix, and performing privacy data processing on the data elements corresponding to the position information according to the position information to obtain the target data information; or
Performing data security processing on the third data description matrixes respectively according to the data security policy corresponding to the first data security level, and performing differential privacy processing, privacy diversification processing or privacy anonymization processing on each third data description matrix after data security processing to obtain the target data information; or
And performing differential privacy processing, privacy diversification processing or privacy anonymization processing on the third data description matrix to obtain reference data information, and performing data security processing on the reference data information according to a data security policy corresponding to the first data security level to obtain the target data information.
In the embodiment provided by the present invention, the processing of the to-be-processed data information of each first data security level according to the preset privacy data processing rule to obtain the target data information of the second data security level includes:
according to the to-be-processed data information of each first data security level, acquiring a local privacy tag sequence and a global privacy tag sequence corresponding to each to-be-processed data information; the local privacy tag sequence may include local privacy tags respectively corresponding to data segments in each user data block in the to-be-processed data information, and one local privacy tag may correspond to data of one user data block;
performing anonymization pre-analysis on a local privacy tag sequence and a global privacy tag sequence in the privacy data information corresponding to the data information to be processed based on a sequence correlation coefficient between the local privacy tag sequence and the global privacy tag sequence corresponding to the data information to be processed to obtain an anonymization pre-analysis result;
determining a global privacy tag with abnormality in anonymization pre-analysis as a global privacy tag to be matched according to the anonymization pre-analysis result, and determining anonymization demand information matched with the global privacy tag to be matched according to an information correlation coefficient between data information corresponding to the global privacy tag without abnormality in the anonymization pre-analysis result and data information corresponding to the global privacy tag to be matched;
carrying out anonymization pre-analysis on the global privacy tag to be matched according to anonymization demand information matched with the global privacy tag to be matched to obtain an anonymization pre-analysis result;
according to the anonymization preanalysis result, an anonymization processing instruction corresponding to the privacy data processing rule is obtained, and anonymization processing is carried out on the privacy data information according to the anonymization processing instruction to obtain the target data information;
the acquiring of the local privacy tag sequence and the global privacy tag sequence in the privacy data information corresponding to the data information to be processed includes:
according to-be-processed data information of each first data security level, at least two local privacy tags and at least two global privacy tags in the privacy data information corresponding to the to-be-processed data information are obtained;
obtaining a local privacy tag correlation coefficient and a local privacy tag feature difference between the at least two local privacy tags, and obtaining a global privacy tag correlation coefficient and a global privacy tag feature difference between the at least two global privacy tags;
arranging the at least two local privacy tags according to the correlation coefficient of the local privacy tags and the characteristic difference of the local privacy tags to obtain a local privacy tag sequence in the privacy data information corresponding to the data information to be processed; a sequence of local privacy tags comprising at least one local privacy tag; arranging the at least two global privacy tags according to the global privacy tag correlation coefficient and the global privacy tag feature difference to obtain a global privacy tag sequence in the privacy data information corresponding to the data information to be processed; a sequence of global privacy tags includes at least one global privacy tag.
The invention also provides a big data processing device, which comprises a processor, a machine-readable storage medium and a machine-readable storage medium, wherein the machine-readable storage medium is connected with the processor and is used for storing programs, instructions or codes, and the processor is used for executing the programs, the instructions or the codes in the machine-readable storage medium so as to realize the data processing method aiming at big data privacy protection.
In summary, in the data processing method and the big data processing apparatus for big data privacy protection provided in the embodiments of the present invention, data attribute identifiers of target type data having the same type of data description attributes in a plurality of user behavior data blocks of a user behavior data set are obtained from the user behavior data set at a first data security level, then data segments corresponding to the data attribute identifiers are respectively obtained from the user behavior data blocks, so as to obtain a plurality of pieces of data information to be processed, and finally, the data information to be processed at each first data security level is subjected to private data processing according to a preset private data processing rule, so as to obtain the target data information at a second data security level, so that big data privacy protection for the target data information can be achieved. In addition, intelligent data processing tools such as a private data identification model and a data description matrix are introduced, so that the identification accuracy of private data can be improved, and the accuracy of big data privacy protection can be improved. Meanwhile, hierarchical processing of different privacy data is achieved through data analysis of different dimensions, privacy protection requirements of different occasions can be used, and user experience is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a schematic flowchart of a data processing method for big data privacy protection according to an embodiment of the present invention.
Fig. 2 is a flow chart illustrating the sub-steps of step S10 in fig. 1.
Fig. 3 is a schematic flow chart illustrating the sub-steps of step S30 in fig. 1.
Fig. 4 is a schematic diagram of a big data processing device according to an embodiment of the present invention.
Fig. 5 is a functional block diagram of the big data processing apparatus in fig. 4.
Detailed Description
Exemplary embodiments of the present invention will be described herein in detail. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent every implementation consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and every possible combination of one or more of the associated listed items.
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 is a schematic flowchart of a data processing method for big data privacy protection according to an embodiment of the present invention. In this embodiment, the method may be implemented by a big data processing device, and the big data processing device may be a computer, a server, a computer cluster, a server cluster, a cloud server, a cloud data platform, and other devices with big data processing and analyzing capabilities, but is not limited thereto. The data processing method comprises the following steps from step S10 to step S30, which are described in detail below.
Step S10, obtaining, from the user behavior data set of the first data security level, data attribute identifications of target type data with the same category of data description attribute in a plurality of user behavior data chunks of the user behavior data set. In this embodiment, each user behavior data block includes data content obtained by performing data acquisition on at least one user behavior.
Step S20, respectively obtaining data segments corresponding to the data attribute identifiers from the user behavior data blocks, and obtaining multiple pieces of to-be-processed data information.
Step S30, performing privacy data processing on the to-be-processed data information of each first data security level according to a preset privacy data processing rule to obtain target data information of a second data security level, where the second data security level is used to implement big data privacy protection for the target data information.
In detail, in this embodiment, the first data security level and the second data security level in the above steps may be a preset data level type that identifies privacy status information of the to-be-processed data information, and are mainly used for identifying data in different privacy statuses. For example, in this embodiment, the first data security level may represent original collected data information that has not passed through privacy processing or privacy protection, or data that has undergone a certain privacy processing but has not yet reached the privacy level that is required by the embodiment of the present invention, and the second data security level represents data information that has passed through privacy processing or privacy protection of the embodiment of the present invention. For example, in the embodiment of the present invention, the second data security level is higher than the first data security level, and the higher the privacy level is, the better the privacy protection is obtained by the privacy information in the surface data information. The user behavior data set of the first data security level may be composed of user behavior data obtained by large data acquisition of user behavior data generated by a user using a relevant application client by a large data processing device or other data acquisition devices connected with or communicating with the large data processing device.
The data attribute identifier may be identification information carried by different data segments and used for representing data related characteristics (such as a user account ID, a user gender, a user age, a user location, and the like). The data segment may be a data segment formed by data acquired by data acquisition of a user behavior at a time, or a data segment formed by data acquired by data acquisition of a user behavior in one data acquisition period, or a data segment formed by data division of user behavior data acquired in a certain time according to a set data segment division rule, and is not particularly limited herein. The data segments may have the same data size (e.g., have the same byte space) or may have different data sizes (e.g., have different byte spaces).
The private data processing according to the set private processing rule may be a data processing method of identifying the private data in each data segment and performing the identified private data according to a set private processing method (for example, differential privacy processing, privacy anonymization processing, big data desensitization processing, or the like).
The specific implementation method of the above related steps will be described in detail with reference to specific embodiments.
In the step S10, the data attribute identifications of the target type data with the same category of data description attribute in the user behavior data chunks of the user behavior data set are obtained from the user behavior data set of the first data security level, and an alternative specific implementation manner includes the following sub-steps S101 to S103, which are described in detail below.
And a substep S101, performing vector representation on each data segment of each user behavior data block in the user behavior data set to obtain a first data description matrix.
In detail, in this embodiment, each user behavior data block may be a storage space for storing data obtained by data acquisition of a user behavior, may also be a storage space for storing data obtained by data acquisition of a user behavior in a data acquisition period, or may also be a storage space for storing user behavior data acquired in a certain time, which is not specifically limited herein, and the user behavior data block may be understood as a data storage unit or a data storage section in this embodiment.
In addition, in this embodiment, the user behavior data set may be input to a first privacy data recognition model obtained by pre-training, and a feature vector conversion layer of the first privacy data recognition model performs feature vector conversion on each user behavior data block in the user behavior data set to obtain the first data description matrix. The feature vector conversion layer is used for realizing feature vector conversion of the user behavior data block by feature vector conversion methods such as feature representation mapping processing, attribute and content segmentation processing, attribute feature standardization processing and the like, the feature vector conversion layer comprises a target attribute extraction layer, and the data granularity of data extraction performed by an attribute extraction kernel of the target attribute extraction layer is the data size corresponding to at least one minimum data block of the data storage mode of the user behavior data block. The feature representation mapping process may be, for example, vector mapping the attributes of the data in the data segment according to a set mapping relationship to obtain a corresponding vector representation. The attribute and content division processing may be, for example, dividing the data content and the data attribute corresponding to each data segment, then respectively performing vector representation on the attribute and the data, and representing the data content of each data segment in a feature vector matrix manner to obtain the feature vector matrix. The attribute feature normalization processing may be, for example, encoding the data attribute of the data content corresponding to each data segment into a standard feature description interval according to a set uniform encoding rule or a standard encoding rule, and further obtaining the feature vector matrix.
And a substep S102, performing data attribute identification on each first data description matrix to obtain a data attribute identifier of each data segment in each user behavior data block in the user behavior data set.
In detail, in this embodiment, a data attribute of each first data description matrix may be identified by an attribute extraction layer of the first privacy data identification model, so as to obtain a data attribute identifier of each data segment of the target type data in each user behavior data block of the user behavior data set.
And a substep S103, matching the data attribute identifications of the target type data in the user behavior data blocks from the data attribute identifications of the identified data fragments.
In this way, by the method described above, data attribute identifications of target type data having the same type of data description attribute in a plurality of user behavior data blocks of the user behavior data set can be obtained. The object type data is object type data of interest, for example, data for which privacy processing is desired.
The data attribute identification of the target type data in the user behavior data block may include: data type labels of the feature points of the target type data in the user behavior data block and type labels of the target type data in the user behavior data block; or, data type labels of the start and end points of the object type data detection box, and the like. The data attribute identifier is not limited to specific one, and may be any data that can locate the target type in the user behavior data block.
In this embodiment, the target type data may be preset data that needs to be subjected to privacy protection, and the specific type is not limited, and may be, for example, a data type representing privacy information such as user identity, such as an account, a name, a gender, an age, and income. For example, the embodiment may also perform data preprocessing to convert the user behavior data block with the first data security level into common data capable of performing target identification, and then perform target identification; and target identification can be directly performed on the user behavior data block with the first data security level to obtain the data attribute identifier, and the specific implementation mode is not limited.
In addition, in the above step S101 and step S102, vector representation is performed on each data segment of each user behavior data block in the user behavior data set to obtain a first data description matrix, and data attribute identification is performed on each first data description matrix to obtain a data attribute identifier of each data segment in each user behavior data block in the user behavior data set, where another alternative implementation manner is as follows:
firstly, performing vector representation on each data segment of each user behavior data block in the user behavior data set by adopting a preset data conversion mode to obtain a first data description matrix;
and then, inputting each first data description matrix to a second privacy data recognition model obtained by pre-training, and performing data attribute recognition on each first data description matrix by an attribute extraction layer of the second privacy data recognition model to obtain data attribute identifications of each data segment of the target type data in each user behavior data block of the user behavior data set.
In this embodiment, the preset data conversion manner may include attribute mapping and content hashing. The attribute mapping and content hashing comprises the steps of mapping the data attributes of each data segment to vector representations in a preset vector corresponding table, and then performing content hashing operation on the data content of each data segment and then correspondingly storing the data content of each data segment and the corresponding vector representations.
Further, in this embodiment, the data attribute identifier may include: the target type data comprises a data type label of preset data information in the target type data and a privacy type label indicating a privacy type corresponding to the target type data. The preset data information in the target type data may be, for example, information having private or sensitive data, such as information related to account information, name, gender, age, income, and the like of the user.
Based on the above, in step S20, the data segments corresponding to the data attribute identifiers are respectively obtained from the user behavior data blocks to obtain a plurality of pieces of to-be-processed data information, and the specific implementation method may be any one of the following two implementation methods.
The first method comprises the following steps: and for each user behavior data block, acquiring a data segment of which the type tag is a preset type tag in the user behavior data block according to a data extraction range corresponding to the target type data when data acquisition is carried out on the data type tag and the privacy type tag in the data attribute identifier of the user behavior data block, and determining the acquired data segment as data information to be processed. For example, the data extraction range may be a storage interval for storing corresponding data, which is obtained by performing data matching in the user-type data block according to the data attribute representation and then querying.
And the second method comprises the following steps: for each user behavior data block, traversing a data segment of which the matching type label is a privacy type label in the user behavior data block according to the data type label of the target type data in the data attribute identification of the user behavior data block; and then, mapping the obtained data segments from different privacy type tags to a target type tag in a tag mapping mode, and determining the data information corresponding to the data segments after tag mapping as the data information to be processed. For example, in this embodiment, the privacy category labels include various labels belonging to privacy types, such as an account information label, a revenue information label, a geographic location information label, and the like, and then the labels may be mapped uniformly to a preset uniform type label, such as a primary privacy information label, a secondary privacy information label, a tertiary privacy information label, and the like. Different levels of privacy information tags may represent different privacy levels, with higher levels indicating a higher level of privacy protection required. Therefore, the corresponding privacy data can be subsequently subjected to privacy processing in a targeted manner according to the grade of the target type tag. For example, for a privacy tag of the highest level, corresponding privacy data may be directly deleted, and for a privacy tag of the next higher level, corresponding privacy data may be replaced with a set code, and the like, which is not limited specifically.
Further, in this embodiment, in step S30, the target data information of the second data security level is obtained by performing private data processing on the to-be-processed data information of each first data security level, which may be specifically implemented by any one of the following first scheme and the following second scheme, which are described below.
The first scheme is as follows:
vector representation is carried out on each piece of data information to be processed to obtain a second data description matrix, and data marking is carried out on corresponding data description of target type data in each second data description matrix to obtain a third data description matrix after data marking;
and carrying out privacy data processing on each third data description matrix to obtain the target data information, wherein the privacy data processing comprises any one or combination of a difference privacy processing, a privacy diversification processing and a privacy anonymization processing. In this embodiment, for example, privacy processing may be performed on the corresponding data description subjected to data marking, such as performing big data desensitization, privacy differentiation, and privacy data encryption on the data description of the marked portion, where for example, the big data desensitization may be to replace the data description of the marked portion with preset description information, so that on the premise of retaining corresponding data blocks, the whole collected big data may also be used for later analysis. In this way, through the data description after the data marking, the data content corresponding to the corresponding data description which needs to be subjected to privacy processing can be found, and then the corresponding data content is subjected to privacy data processing.
Scheme II:
inputting each piece of to-be-processed data information into a third privacy data recognition model obtained through pre-training, and performing feature vector conversion on each piece of input to-be-processed data information through a feature vector conversion layer of the third privacy data recognition model to obtain a second data description matrix;
performing data marking on corresponding data description of the target type data in each second data description matrix by using a matrix data marking layer of the third privacy data identification model to obtain a data-marked third data description matrix;
performing privacy data processing on each third data description matrix by a matrix data privacy processing layer of the third privacy data identification model to obtain the target data information; wherein the feature vector conversion layer is configured to perform at least one of the following feature vector conversions: the feature vector conversion layer comprises a target attribute extraction layer, and the data granularity of data extraction performed by an attribute extraction kernel of the target attribute extraction layer is the data size corresponding to at least one minimum data block of the data storage mode of the user behavior data block.
Further, in this implementation, the vector representation of each piece of to-be-processed data information is performed to obtain the second data description matrix, and one implementation manner may be: vector representation is carried out on each piece of data information to be processed by adopting a preset data conversion mode to obtain a second data description matrix; the preset data conversion mode at least comprises attribute mapping and content hashing.
Based on this, the data marking is performed on the corresponding data description of the target type data in each second data description matrix to obtain a third data description matrix after data marking, and the privacy data processing is performed on each third data description matrix to obtain the target data information, which may be implemented in a manner that:
and inputting each second data description matrix to a fourth privacy data recognition model obtained by pre-training, performing data marking on corresponding data description of target type data in each second data description matrix by using a matrix data marking layer of the fourth privacy data recognition model to obtain a third data description matrix after data marking, and performing privacy data processing on each third data description matrix by using a matrix data privacy processing layer of the third privacy data recognition model to obtain the target data information. In this way, the fourth privacy data identification model may be a deep learning model obtained by performing model training in advance using a data description matrix sample, and may be used to perform data tagging on each data description in the data description matrix, for example, tag private data in a corresponding data tagging manner, tag non-private data in other data tagging manners different from the private data, and thus subsequently perform privacy data processing on data descriptions of relevant private data in a third data description matrix including corresponding data tags in a targeted manner, so as to achieve the purpose of big data privacy protection of the embodiment of the present invention.
Based on the above, the target data information is obtained by performing private data processing on each third data description matrix, and a specific implementation manner may be any one of the manners described in (1) to (4) below.
(1) And performing matrix fusion on the third data description matrixes to obtain a fusion data matrix, and performing privacy data processing on the fusion data matrix through at least one data privacy processing unit to obtain the target data information. In this example, the data in each user behavior data block may correspond to obtain a third data description matrix, and for uniform processing of the aspect data, each third data description matrix may be obtained as a fusion data matrix in a matrix fusion manner, so as to facilitate subsequent uniform private data processing on the fusion data matrix directly, and it is not necessary to process multiple matrices separately.
(2) And mapping each data element in the third data description matrix to a specified data storage interval according to a preset mapping relation, and taking the specified data storage interval obtained after mapping as the target data information, wherein the data occupation space of the specified data storage interval is larger than that of the data information marked by each data. In this embodiment, for example, the designated data storage interval may include a private data storage interval used for storing private data and a non-private data storage interval used for storing non-private data, respectively, non-private data elements in the third data description matrix that are not marked by private data may be mapped to the non-private data storage interval, and private data elements marked by private data may be mapped to the private data storage interval, where data access permissions of the non-private data storage interval and the private data storage interval are different, for example, the data access permission of the non-private data storage interval is lower than that of the private data storage interval, so that processing of private data in the third data description matrix is achieved, and the purpose of protecting big data privacy is achieved.
(3) And extracting the position information of the data elements with the same type of data description attributes in the third data description matrix, and performing privacy data processing on the data elements corresponding to the position information according to the position information to obtain the target data information. In this embodiment, the corresponding data element position information may be located according to the data description attributes of the same respective categories (privacy categories), and then the corresponding data element may be subjected to privacy data processing with respect to the corresponding position information, for example, the data reading authority at the position information of the privacy data element is subjected to upgrade processing or privacy difference processing with respect to the data at the position information, and the like.
(4) And performing data security processing on the third data description matrixes respectively according to the data security policy corresponding to the first data security level, and performing differential privacy processing, privacy diversification processing or privacy anonymization processing on each third data description matrix after data security processing to obtain the target data information. In this way, data security processing (such as public key encryption based on big data, data security access control based on user attributes, and the like) may be performed on each third data description matrix according to the data security policy corresponding to the first data security level, and then privacy processing such as differential privacy processing, privacy diversification processing, or privacy anonymization processing may be performed on the relevant privacy data in the third data description matrix, so as to protect the privacy data, and then obtain target data information having the second data security level.
(5) And performing differential privacy processing, privacy diversification processing or privacy anonymization processing on the third data description matrix to obtain reference data information, and performing data security processing on the reference data information according to a data security policy corresponding to the first data security level to obtain the target data information. In this way, privacy processing such as differential privacy processing, privacy diversification processing, privacy anonymization processing, or the like may be performed on the related privacy data in the third data description matrix to protect the privacy data, and then data security processing (such as public key encryption based on big data, data security access control based on user attributes, and the like) may be performed on each third data description matrix according to the data security policy corresponding to the first data security level, so as to obtain target data information having the second data security level.
Further, in the embodiment of the present invention, the above-mentioned private data processing method may be a private data anonymization processing for specific data, and based on this, as shown in fig. 3, in the above-mentioned step S30, the to-be-processed data information of each first data security level is subjected to private data processing according to a preset private data processing rule to obtain target data information of a second data security level, and another alternative embodiment includes the following substeps S301-S305, which is described in detail below.
And a substep S301, obtaining a local privacy tag sequence and a global privacy tag sequence corresponding to each piece of to-be-processed data information according to the to-be-processed data information of each first data security level. In this embodiment, the local privacy tag sequence may include local privacy tags respectively corresponding to data segments in each user data block in the to-be-processed data information, and one local privacy tag may correspond to data of one user data block. The global privacy tag is used for representing the integral privacy identification of each piece of to-be-processed data information. For example, in this embodiment, each piece of to-be-processed data information may be input into a pre-trained privacy data tag model to perform privacy data identification, that is, a local privacy tag sequence and a global privacy tag sequence corresponding to each piece of to-be-processed data information may be output.
And a substep S302, based on a sequence correlation coefficient between a local privacy tag sequence and a global privacy tag sequence corresponding to the data information to be processed, performing anonymization pre-analysis on the local privacy tag sequence and the global privacy tag sequence in the privacy data information corresponding to the data information to be processed to obtain an anonymization pre-analysis result. For example, the anonymization pre-analysis may be that each local privacy tag in the local privacy tag sequence is respectively subjected to relevance matching with a corresponding global privacy tag, and the degree of matching with the global privacy tag is used as an anonymization pre-analysis result. For example, the global privacy tag represents a high-level privacy level, and the matching degree may include a high matching degree, a medium matching degree, and a low matching degree, where the privacy data corresponding to the high matching degree and the medium matching degree need to be subjected to corresponding anonymization processing subsequently.
And a substep S303, determining the global privacy label with abnormality in the anonymization pre-analysis as a global privacy label to be matched according to the anonymization pre-analysis result, and determining anonymization demand information matched with the global privacy label to be matched according to an information correlation coefficient between data information corresponding to the global privacy label without abnormality in the anonymization pre-analysis result and data information corresponding to the global privacy label to be matched. In this embodiment, the existence of the anomaly in the anonymization pre-analysis may refer to an abnormal analysis result of a certain global privacy tag in the anonymization pre-analysis, for example, because a privacy tag bit of the global privacy tag is missing, the global privacy tag cannot be pre-analyzed and matched with a corresponding local privacy tag in the pre-analysis process, so that the corresponding anonymization demand information may be determined according to an information correlation coefficient between data information corresponding to the global privacy tag without the anomaly and data information corresponding to the global privacy tag to be matched, for example, anonymization demand information corresponding to the global privacy tag without the anomaly with the highest correlation coefficient may be used as anonymization demand information corresponding to the global privacy tag to be matched. The anonymization demand information may correspond to an anomaly-free global privacy tag having a highest correlation coefficient.
And a substep S304, performing anonymization pre-analysis on the global privacy tag to be matched according to the anonymization demand information matched with the global privacy tag to be matched. The anonymization pre-analysis process here can refer to the manner of step S302, and is not described here in detail.
And a substep S305 of obtaining an anonymization processing instruction corresponding to the privacy data processing rule based on the anonymization pre-analysis results obtained in the substeps 302 and S304, and performing anonymization processing on the privacy data information based on the anonymization processing instruction to obtain the target data information. For example, the anonymization processing instruction may be an instruction indicating a specific anonymization processing method, such as K-anonymization, i-diversification, data desensitization, privacy differentiation, privacy deletion, privacy substitution, and the like, which is not limited in particular. For example, different anonymization processing indications represent the importance of privacy processing, and the different importance corresponds to different anonymization processing modes, for example, the anonymization processing indication at the highest level may directly delete corresponding privacy data, the anonymization processing indication at the second highest level may replace the privacy data with preset anonymization data, or perform privacy processing on the privacy data by using methods such as differential privacy and data desensitization. Therefore, the corresponding privacy data can be subjected to hierarchical processing by analyzing the privacy tag sequence to obtain the importance information of the corresponding privacy data, and then the privacy data anonymization processing methods of different levels are realized to meet the requirements of different scenes.
In the step (1), the obtaining of the local privacy tag sequence and the global privacy tag sequence in the privacy data information corresponding to the data information to be processed specifically includes:
according to-be-processed data information of each first data security level, at least two local privacy tags and at least two global privacy tags in the privacy data information corresponding to the to-be-processed data information are obtained;
obtaining a local privacy tag correlation coefficient and a local privacy tag characteristic difference between the at least two local privacy tags, and obtaining a global privacy tag correlation coefficient and a global privacy tag characteristic difference between the at least two global privacy tags;
arranging the at least two local privacy tags according to the correlation coefficient of the local privacy tags and the characteristic difference of the local privacy tags to obtain a local privacy tag sequence in the privacy data information corresponding to the data information to be processed; a sequence of local privacy tags comprising at least one local privacy tag; arranging the at least two global privacy tags according to the correlation coefficient of the global privacy tags and the feature difference of the global privacy tags to obtain a global privacy tag sequence in the privacy data information corresponding to the data information to be processed; a sequence of global privacy tags includes at least one global privacy tag.
In addition, for example, the performing anonymization pre-analysis on the local privacy tag sequence and the global privacy tag sequence in the private data information corresponding to the to-be-processed data information based on a sequence correlation coefficient between the local privacy tag sequence and the global privacy tag sequence in the private data information corresponding to the to-be-processed data information to obtain an anonymization pre-analysis result includes:
determining a global privacy tag sequence in the privacy data information corresponding to the data information to be processed as a global privacy tag sequence to be analyzed, and determining a local privacy tag sequence in the privacy data information corresponding to the data information to be processed as a local privacy tag sequence to be analyzed; the global privacy tag in the global privacy tag sequence to be analyzed is obtained from a privacy tag index table which is established in advance and aims at the privacy data information corresponding to the data information to be processed;
obtaining a local privacy tag in the privacy tag index table, and determining a sequence correlation coefficient between the global privacy tag sequence to be analyzed and the local privacy tag sequence to be analyzed according to a privacy tag correlation coefficient between the local privacy tag in the privacy tag index table and the local privacy tag in the local privacy tag sequence to be analyzed; and when the sequence correlation coefficient is not less than the correlation coefficient threshold value, performing anonymization pre-analysis on the global privacy tag sequence to be analyzed and the local privacy tag sequence to be analyzed to obtain an anonymization pre-analysis result. Therefore, when the sequence correlation coefficient is not smaller than the correlation coefficient threshold, it is indicated that the anonymization preanalysis abnormity does not occur in the corresponding global privacy label sequence, anonymization preanalysis can be performed, and then anonymization preanalysis processing is performed.
Fig. 4 is a schematic diagram of a big data processing apparatus 1 according to an embodiment of the present invention. In this embodiment, the big data processing device 1 is configured to implement the data processing method for big data privacy protection provided by the embodiment of the present invention. In this embodiment, the big data processing apparatus 1 may include a data processing device 10, a machine-readable storage medium 11, and a processor 12.
Alternatively, the machine-readable storage medium 11 may be accessed by the processor 12 through a bus interface. The machine-readable storage medium 11 may also be integrated into the processor 12, and may be, for example, a cache and/or general purpose registers.
The processor 12 is a control center of the large data processing apparatus 1, connects various parts of the entire large data processing apparatus 1 with various interfaces and lines, and performs various functions of the large data processing apparatus 1 and processes data by running or executing software programs and/or modules stored in the machine-readable storage medium 11 and calling data stored in the machine-readable storage medium 11, thereby performing overall control of the large data processing apparatus 1. Alternatively, processor 12 may include one or more processing cores. For example, the processor 12 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc. and a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor.
The processor 12 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application-Specific Integrated Circuit (ASIC), or the like. The machine-readable storage medium 11 may be, but is not limited to, a ROM or other type of static storage device that may store static information and instructions, a RAM or other type of dynamic storage device that may store information and instructions, and the like. The machine-readable storage medium 11 may be self-contained and coupled to the processor 12 via a communication bus. The machine-readable storage medium 11 may also be integrated with the processor. The machine-readable storage medium 11 is used for storing machine-executable instructions for executing the scheme of the application. The processor 12 is configured to execute machine-executable instructions stored in the machine-readable storage medium 11 to implement the methods provided by the present invention.
Fig. 5 is a schematic diagram of functional modules of the data processing apparatus 10. The data processing apparatus 10 includes a plurality of software functional modules, and machine executable programs or instructions corresponding to the software functional modules may be stored in the machine readable storage medium and executed by the processor 12, so as to implement the data processing method for big data privacy protection according to the present invention. In detail, the data processing apparatus 10 may include an attribute identification acquisition module 101, a data fragment acquisition module 102, and a privacy data processing module 103. The above modules will be described in detail below.
The attribute identifier obtaining module 101 is configured to obtain, from a user behavior data set of a first data security level, data attribute identifiers of target type data having data description attributes of the same category in a plurality of user behavior data blocks of the user behavior data set, where each user behavior data block includes data content obtained by performing data acquisition for at least one user behavior.
The data segment obtaining module 102 is configured to obtain data segments corresponding to the data attribute identifiers from the user behavior data blocks, respectively, to obtain a plurality of pieces of to-be-processed data information.
The privacy data processing module 103 is configured to perform privacy data processing on the to-be-processed data information of each first data security level according to a preset privacy data processing rule to obtain target data information of a second data security level, where the second data security level is used to implement big data privacy protection for the target data information.
In detail, the attribute identifier obtaining module 101 obtains, from a user behavior data set of a first data security level, data attribute identifiers of target type data having data description attributes of the same category in a plurality of user behavior data blocks of the user behavior data set, and a specific implementation manner includes:
vector representation is carried out on each data fragment of each user behavior data block in the user behavior data set to obtain a first data description matrix, data attribute identification is carried out on each first data description matrix, and data attribute identification of each data fragment in each user behavior data block in the user behavior data set is obtained;
and matching the data attribute identifications of the target type data in the user behavior data blocks from the data attribute identifications of the identified data fragments.
In detail, the attribute identifier obtaining module 101 performs vector representation on each data segment of each user behavior data block in the user behavior data set to obtain a first data description matrix, performs data attribute identification on each first data description matrix, and obtains the data attribute identifier of each data segment in each user behavior data block in the user behavior data set, where a specific implementation manner includes:
inputting the user behavior data set into a first privacy data recognition model obtained by pre-training, performing feature vector conversion on each user behavior data block in the user behavior data set by using a feature vector conversion layer of the first privacy data recognition model to obtain a first data description matrix, and performing data attribute recognition on each first data description matrix by using an attribute extraction layer of the first privacy data recognition model to obtain data attribute identifications of each data fragment of target type data in each user behavior data block of the user behavior data set;
the feature vector conversion layer is used for performing at least one of the following feature vector conversions: the feature vector conversion layer at least comprises a target attribute extraction layer, and the data granularity of data extraction performed by an attribute extraction kernel of the target attribute extraction layer is the data size corresponding to at least one minimum data block of the data storage mode of the user behavior data block.
In detail, the attribute identifier obtaining module 101 performs vector representation on each data segment of each user behavior data block in the user behavior data set to obtain a first data description matrix, performs data attribute identification on each first data description matrix to obtain a data attribute identifier of each data segment in each user behavior data block in the user behavior data set, and another specific implementation manner includes:
vector representation is carried out on each data segment of each user behavior data block in the user behavior data set by adopting a preset data conversion mode to obtain a first data description matrix; the preset data conversion mode at least comprises attribute mapping and content hashing, wherein the attribute mapping and the content hashing comprise the steps of mapping the data attributes of each data segment to vector representations in a preset vector corresponding table, carrying out content hashing operation on the data contents of each data segment, and then correspondingly storing the data contents of each data segment and the corresponding vector representations;
and inputting each first data description matrix to a second privacy data recognition model obtained by pre-training, and performing data attribute recognition on each first data description matrix by an attribute extraction layer of the second privacy data recognition model to obtain data attribute identifications of each data segment of target type data in each user behavior data block of the user behavior data set.
Further, the data attribute identification includes: the target type data comprises a data type label of preset data information and a privacy type label indicating a privacy type corresponding to the target type data. Based on this, the data segment obtaining module 102 obtains the data segments corresponding to the data attribute identifiers from the user behavior data blocks, respectively, to obtain a plurality of pieces of to-be-processed data information, which is a specific implementation manner:
for each user behavior data block, according to a data extraction range corresponding to the target type data when data acquisition is carried out on a data type tag and a privacy type tag in a data attribute identifier of the user behavior data block, acquiring a data segment of which the type tag is a preset type tag in the user behavior data block according to the data extraction range, and determining the acquired data segment as to-be-processed data information; or
For each user behavior data block, traversing a data segment of which the matching type label is a privacy type label in the user behavior data block according to the data type label of the target type data in the data attribute identification of the user behavior data block; and mapping the obtained data segments to target type tags from different privacy type tags by adopting a tag mapping mode, and determining data information corresponding to the data segments after tag mapping as to-be-processed data information.
In detail, the private data processing module 103 performs private data processing on to-be-processed data information of each first data security level to obtain target data information of a second data security level, and may be implemented in any one of the following manners:
vector representation is carried out on each piece of data information to be processed to obtain a second data description matrix, data marking is carried out on corresponding data description of target type data in each second data description matrix to obtain a third data description matrix after data marking, privacy data processing is carried out on each third data description matrix to obtain the target data information, and the privacy data processing comprises at least one of the following processing modes: differential privacy processing, privacy diversification processing and privacy anonymization processing; or alternatively
Inputting each piece of to-be-processed data information to a third privacy data recognition model obtained by pre-training, performing feature vector conversion on each piece of input to-be-processed data information by using a feature vector conversion layer of the third privacy data recognition model to obtain a second data description matrix, performing data marking on corresponding data description of target type data in each second data description matrix by using a matrix data marking layer of the third privacy data recognition model to obtain a data-marked third data description matrix, and performing privacy data processing on each third data description matrix by using a matrix data privacy processing layer of the third privacy data recognition model to obtain the target data information; the feature vector conversion layer realizes feature vector conversion through feature representation mapping processing, attribute and content segmentation processing or attribute feature standardization processing, and comprises a target attribute extraction layer, wherein the data granularity of data extraction performed by an attribute extraction kernel of the target attribute extraction layer is the data size corresponding to at least one minimum data block of the data storage mode of the user behavior data block.
Further, in another alternative implementation manner, the private data processing module 103 performs private data processing on the to-be-processed data information of each first data security level according to a preset private data processing rule to obtain target data information of a second data security level, and the specific implementation manner includes:
according to the to-be-processed data information of each first data security level, a local privacy tag sequence and a global privacy tag sequence corresponding to each to-be-processed data information are obtained; the local privacy tag sequence may include local privacy tags respectively corresponding to data segments in each user data block in the to-be-processed data information, and one local privacy tag may correspond to data of one user data block;
performing anonymization pre-analysis on a local privacy tag sequence and a global privacy tag sequence in the privacy data information corresponding to the data information to be processed based on a sequence correlation coefficient between the local privacy tag sequence and the global privacy tag sequence corresponding to the data information to be processed to obtain an anonymization pre-analysis result;
determining a global privacy tag with abnormality in anonymization pre-analysis as a global privacy tag to be matched according to the anonymization pre-analysis result, and determining anonymization demand information matched with the global privacy tag to be matched according to an information correlation coefficient between data information corresponding to the global privacy tag without abnormality in the anonymization pre-analysis result and data information corresponding to the global privacy tag to be matched;
carrying out anonymization preanalysis on the global privacy label to be matched according to the anonymization demand information matched with the global privacy label to be matched to obtain an anonymization preanalysis result;
and according to the anonymization pre-analysis result, obtaining an anonymization processing instruction corresponding to the privacy data processing rule, and according to the anonymization processing instruction, carrying out anonymization processing on the privacy data information to obtain the target data information.
The method for acquiring the local privacy tag sequence and the global privacy tag sequence in the privacy data information corresponding to the data information to be processed includes the following specific implementation modes:
according to-be-processed data information of each first data security level, at least two local privacy tags and at least two global privacy tags in the privacy data information corresponding to the to-be-processed data information are obtained;
obtaining a local privacy tag correlation coefficient and a local privacy tag feature difference between the at least two local privacy tags, and obtaining a global privacy tag correlation coefficient and a global privacy tag feature difference between the at least two global privacy tags;
arranging the at least two local privacy tags according to the correlation coefficient of the local privacy tags and the characteristic difference of the local privacy tags to obtain a local privacy tag sequence in the privacy data information corresponding to the data information to be processed; a sequence of local privacy tags comprising at least one local privacy tag; arranging the at least two global privacy tags according to the correlation coefficient of the global privacy tags and the feature difference of the global privacy tags to obtain a global privacy tag sequence in the privacy data information corresponding to the data information to be processed; a sequence of global privacy tags includes at least one global privacy tag.
In addition, it should be noted that the functional modules such as the attribute identifier obtaining module 101, the data fragment obtaining module 102, and the privacy data processing module 103 may be respectively configured to execute steps S10 to S30 shown in fig. 1, and further contents of the modules may refer to descriptions of corresponding steps, which are not described in detail herein.
In summary, in the data processing method and the big data processing apparatus for big data privacy protection provided in the embodiments of the present invention, data attribute identifiers of target type data having the same type of data description attributes in a plurality of user behavior data blocks of a user behavior data set are obtained from the user behavior data set at a first data security level, then data segments corresponding to the data attribute identifiers are respectively obtained from the user behavior data blocks, so as to obtain a plurality of pieces of data information to be processed, and finally, the data information to be processed at each first data security level is subjected to private data processing according to a preset private data processing rule, so as to obtain the target data information at a second data security level, so that big data privacy protection for the target data information can be achieved. In addition, intelligent data processing tools such as a private data identification model and a data description matrix are introduced, so that the identification accuracy of private data can be improved, and the accuracy of big data privacy protection can be improved. Meanwhile, hierarchical processing of different privacy data is achieved through data analysis of different dimensions, privacy protection requirements of different occasions can be used, and user experience is improved.
The embodiments described above are only a part of the embodiments of the present invention, and not all of them. The components of embodiments of the present invention generally described and illustrated in the figures can be arranged and designed in a wide variety of different configurations. Therefore, the detailed description of the embodiments of the present invention provided in the drawings is not intended to limit the scope of the present invention, but is merely representative of selected embodiments of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims. Moreover, various other embodiments based on the embodiments of the present invention, which can be obtained by those skilled in the art without inventive efforts, shall fall within the scope of the present invention.

Claims (7)

1. The big data anonymization processing method is applied to a big data processing device, and comprises the following steps:
according to the to-be-processed data information of each first data security level, acquiring a corresponding local privacy tag sequence and a corresponding global privacy tag sequence in each to-be-processed data information; the local privacy tag sequence may include local privacy tags respectively corresponding to data segments in each user data block in the to-be-processed data information, and one local privacy tag may correspond to data of one user data block;
performing anonymization pre-analysis on a local privacy tag sequence and a global privacy tag sequence in the privacy data information corresponding to the data information to be processed based on a sequence correlation coefficient between the local privacy tag sequence and the global privacy tag sequence corresponding to the data information to be processed to obtain an anonymization pre-analysis result;
determining a global privacy label with abnormity in the anonymization preanalysis as a global privacy label to be matched according to the anonymization preanalysis result, and determining anonymization demand information matched with the global privacy label to be matched according to an information correlation coefficient between data information corresponding to the global privacy label without abnormity in the anonymization preanalysis result and data information corresponding to the global privacy label to be matched;
carrying out anonymization pre-analysis on the global privacy tag to be matched according to anonymization demand information matched with the global privacy tag to be matched to obtain an anonymization pre-analysis result;
and according to the anonymization preanalysis result, obtaining an anonymization processing instruction corresponding to the privacy data processing rule, and according to the anonymization processing instruction, carrying out anonymization processing on the privacy data information to obtain target data information of a second data security level.
2. The method according to claim 1, wherein the method further comprises a step of obtaining the to-be-processed data information of the first data security level, specifically comprising:
vector representation is carried out on each data fragment of each user behavior data block in a user behavior data set with a first data security level to obtain a first data description matrix, data attribute identification is carried out on each first data description matrix, and data attribute identification of each data fragment in each user behavior data block in the user behavior data set is obtained;
matching data attribute identifications of target type data with the same type of data description attributes in a plurality of user behavior data blocks from the identified data attribute identifications of each data fragment;
and respectively acquiring data segments corresponding to the data attribute identifications from the user behavior data blocks to obtain a plurality of pieces of to-be-processed data information with first data security levels.
3. The method of claim 2, wherein the vector representation of each data segment of each user behavior data block in the user behavior data set of the first data security level to obtain a first data description matrix, and the data attribute identification of each first data description matrix to obtain the data attribute identifier of each data segment in each user behavior data block in the user behavior data set comprises:
inputting the user behavior data set into a first privacy data recognition model obtained by pre-training, performing feature vector conversion on each user behavior data block in the user behavior data set by using a feature vector conversion layer of the first privacy data recognition model to obtain a first data description matrix, and performing data attribute recognition on each first data description matrix by using an attribute extraction layer of the first privacy data recognition model to obtain data attribute identifications of each data fragment of target type data in each user behavior data block of the user behavior data set;
the feature vector conversion layer is used for performing at least one of the following feature vector conversions: the feature vector conversion layer at least comprises a target attribute extraction layer, and the data granularity of data extraction performed by an attribute extraction kernel of the target attribute extraction layer is the data size corresponding to at least one minimum data block of the data storage mode of the user behavior data block.
4. The method of claim 2, wherein the vector representation of each data segment of each user behavior data block in the user behavior data set of the first data security level to obtain a first data description matrix, and the data attribute identification of each first data description matrix to obtain the data attribute identifier of each data segment in each user behavior data block in the user behavior data set comprises:
vector representation is carried out on each data segment of each user behavior data block in the user behavior data set by adopting a preset data conversion mode to obtain a first data description matrix; the preset data conversion mode at least comprises attribute mapping and content hashing, wherein the attribute mapping and the content hashing comprise vector representation of mapping the data attribute of each data segment to a preset vector corresponding table and corresponding storage of the data content of each data segment after content hashing operation and the corresponding vector representation;
and inputting each first data description matrix into a second privacy data recognition model obtained by pre-training, and performing data attribute recognition on each first data description matrix by an attribute extraction layer of the second privacy data recognition model to obtain data attribute identification of each data segment of target type data in each user behavior data block of the user behavior data set.
5. The method of claim 2, wherein the data attribute identification comprises: the method comprises the steps that a data type label of preset data information in target type data and a privacy type label indicating a privacy type corresponding to the target type data are included;
the obtaining of the data segments corresponding to the data attribute identifications from the user behavior data blocks respectively to obtain a plurality of pieces of to-be-processed data information includes:
for each user behavior data block, according to a data extraction range corresponding to the target type data when data acquisition is carried out on a data type tag and a privacy type tag in a data attribute identifier of the user behavior data block, acquiring a data segment of which the type tag is a preset type tag in the user behavior data block according to the data extraction range, and determining the acquired data segment as to-be-processed data information; or
For each user behavior data block, traversing a data segment of which the matching type label is a privacy type label in the user behavior data block according to the data type label of the target type data in the data attribute identification of the user behavior data block; and mapping the obtained data segments to target type tags from different privacy type tags by adopting a tag mapping mode, and determining data information corresponding to the data segments after tag mapping as to-be-processed data information.
6. The method according to any one of claims 1 to 5, wherein the obtaining of the corresponding local privacy tag sequence and global privacy tag sequence in each piece of to-be-processed data information includes:
according to-be-processed data information of each first data security level, at least two local privacy tags and at least two global privacy tags in the privacy data information corresponding to the to-be-processed data information are obtained;
obtaining a local privacy tag correlation coefficient and a local privacy tag feature difference between the at least two local privacy tags, and obtaining a global privacy tag correlation coefficient and a global privacy tag feature difference between the at least two global privacy tags;
arranging the at least two local privacy tags according to the correlation coefficient of the local privacy tags and the characteristic difference of the local privacy tags to obtain a local privacy tag sequence in the privacy data information corresponding to the data information to be processed; a sequence of local privacy tags comprising at least one local privacy tag; arranging the at least two global privacy tags according to the correlation coefficient of the global privacy tags and the feature difference of the global privacy tags to obtain a global privacy tag sequence in the privacy data information corresponding to the data information to be processed; a sequence of global privacy tags includes at least one global privacy tag.
7. A big data processing device comprising a processor, a machine-readable storage medium coupled to the processor, the machine-readable storage medium storing a program, instructions, or code, the processor configured to execute the program, instructions, or code in the machine-readable storage medium to implement the method of any of claims 1-6.
CN202210139315.9A 2021-02-06 2021-02-06 Big data anonymization processing method and big data processing equipment Withdrawn CN114564740A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210139315.9A CN114564740A (en) 2021-02-06 2021-02-06 Big data anonymization processing method and big data processing equipment

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210139315.9A CN114564740A (en) 2021-02-06 2021-02-06 Big data anonymization processing method and big data processing equipment
CN202110175876.XA CN112818398B (en) 2021-02-06 2021-02-06 Data processing method and big data processing equipment for big data privacy protection

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202110175876.XA Division CN112818398B (en) 2021-02-06 2021-02-06 Data processing method and big data processing equipment for big data privacy protection

Publications (1)

Publication Number Publication Date
CN114564740A true CN114564740A (en) 2022-05-31

Family

ID=75864454

Family Applications (3)

Application Number Title Priority Date Filing Date
CN202110175876.XA Active CN112818398B (en) 2021-02-06 2021-02-06 Data processing method and big data processing equipment for big data privacy protection
CN202210139315.9A Withdrawn CN114564740A (en) 2021-02-06 2021-02-06 Big data anonymization processing method and big data processing equipment
CN202210139326.7A Withdrawn CN114564741A (en) 2021-02-06 2021-02-06 Big data privacy protection method based on anonymization analysis and big data processing equipment

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202110175876.XA Active CN112818398B (en) 2021-02-06 2021-02-06 Data processing method and big data processing equipment for big data privacy protection

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202210139326.7A Withdrawn CN114564741A (en) 2021-02-06 2021-02-06 Big data privacy protection method based on anonymization analysis and big data processing equipment

Country Status (1)

Country Link
CN (3) CN112818398B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116842562A (en) * 2023-06-30 2023-10-03 煋辰数梦(杭州)科技有限公司 Big data security platform based on privacy computing technology
CN117786739A (en) * 2023-12-19 2024-03-29 国网青海省电力公司信息通信公司 Data processing method, server and system

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113849133B (en) * 2021-09-29 2023-09-12 珠海格力电器股份有限公司 Processing method and device of privacy data, electronic equipment and storage medium
US11593521B1 (en) 2022-02-04 2023-02-28 Snowflake Inc. Tag-based application of masking policy
CN115456101B (en) * 2022-09-23 2023-09-12 上海豹云网络信息服务有限公司 Data security transmission method and system based on data center
CN116436704B (en) * 2023-06-13 2023-08-18 深存科技(无锡)有限公司 Data processing method and data processing equipment for user privacy data

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101175190B1 (en) * 2008-11-19 2012-08-20 한국전자통신연구원 Rotation based transformation method and apparatus for preserving data privacy
CN105046601A (en) * 2015-07-09 2015-11-11 传成文化传媒(上海)有限公司 User data processing method and system
CN106529329A (en) * 2016-10-11 2017-03-22 中国电子科技网络信息安全有限公司 Desensitization system and desensitization method used for big data

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116842562A (en) * 2023-06-30 2023-10-03 煋辰数梦(杭州)科技有限公司 Big data security platform based on privacy computing technology
CN116842562B (en) * 2023-06-30 2024-03-15 煋辰数梦(杭州)科技有限公司 Big data security platform based on privacy computing technology
CN117786739A (en) * 2023-12-19 2024-03-29 国网青海省电力公司信息通信公司 Data processing method, server and system

Also Published As

Publication number Publication date
CN112818398B (en) 2022-04-01
CN112818398A (en) 2021-05-18
CN114564741A (en) 2022-05-31

Similar Documents

Publication Publication Date Title
CN112818398B (en) Data processing method and big data processing equipment for big data privacy protection
JP2014029732A (en) Method for generating representation of image contents using image search and retrieval criteria
CN109933502B (en) Electronic device, user operation record processing method and storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN114550076A (en) Method, device and equipment for monitoring area abnormal behaviors and storage medium
CN115098679A (en) Method, device, equipment and medium for detecting abnormality of text classification labeling sample
CN113642025A (en) Interface data processing method, device, equipment and storage medium
CN114356712B (en) Data processing method, apparatus, device, readable storage medium, and program product
CN112732693B (en) Intelligent internet of things data acquisition method, device, equipment and storage medium
CN116662839A (en) Associated big data cluster analysis method and device based on multidimensional intelligent acquisition
CN111368128B (en) Target picture identification method, device and computer readable storage medium
CN112464180A (en) Page screenshot outgoing control method and system, electronic device and storage medium
CN116089541B (en) Abnormal identification method for massive real estate registration data
CN116738369A (en) Traffic data classification method, device, equipment and storage medium
CN111429110A (en) Store standardization auditing method, device, equipment and storage medium
CN114528908B (en) Network request data classification model training method, classification method and storage medium
CN116318860A (en) Intelligent control method for network security equipment
CN115600571A (en) Automatic information filling method, device, equipment and medium based on template matching
CN113868503A (en) Commodity picture compliance detection method, device, equipment and storage medium
CN114693955A (en) Method and device for comparing image similarity and electronic equipment
CN112597498A (en) Webshell detection method, system and device and readable storage medium
CN117112846B (en) Multi-information source license information management method, system and medium
CN116318985B (en) Computer network security early warning system and method based on big data
CN115048543B (en) Image similarity judgment method, image searching method and device
CN115695054B (en) WAF interception page identification method and device based on machine learning and related components

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20220531