CN115712703A - Decision analysis method and server applied to big data anonymous processing - Google Patents

Decision analysis method and server applied to big data anonymous processing Download PDF

Info

Publication number
CN115712703A
CN115712703A CN202211670623.0A CN202211670623A CN115712703A CN 115712703 A CN115712703 A CN 115712703A CN 202211670623 A CN202211670623 A CN 202211670623A CN 115712703 A CN115712703 A CN 115712703A
Authority
CN
China
Prior art keywords
text
word vector
vector
user data
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202211670623.0A
Other languages
Chinese (zh)
Inventor
潘航
陈心
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Suithu Internet Technology Co ltd
Original Assignee
Hefei Suithu Internet Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Suithu Internet Technology Co ltd filed Critical Hefei Suithu Internet Technology Co ltd
Priority to CN202211670623.0A priority Critical patent/CN115712703A/en
Publication of CN115712703A publication Critical patent/CN115712703A/en
Withdrawn legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a decision analysis method and a server applied to big data anonymization, which can utilize two different sensitive text extraction models to extract sensitive texts from U text word vector relationship networks of an initial user data description file to obtain a first user data description file and a second user data description file, utilize the first user data description file and the second user data description file to generate a user data description file to be anonymized, and can comprehensively consider the relation among an individual characteristic text set, a group characteristic text set and a fuzzy text set so as to ensure the complete inclusion of individual privacy data and the accurate positioning of part of fuzzy text data which can cause the indirect disclosure of the individual privacy data when generating the user data description file to be anonymized, and the generated user data description file to be anonymized can be used as a credible basis decision for data anonymization protection, thereby improving the pertinence and the attack resistance of the data anonymization protection.

Description

Decision analysis method and server applied to big data anonymous processing
Technical Field
The invention relates to the technical field of data processing, in particular to a decision analysis method and a server applied to big data anonymous processing.
Background
Data anonymization (data anonymization) is the process of protecting private or sensitive information by eliminating or encrypting identifiers that associate individuals with stored data. Data anonymization is also referred to as data desensitization, pseudonymization, de-identification, and the like. In other words, data desensitization refers primarily to the technical approach of handling sensitive data. At present, the security problem of personal privacy information of a user such as privacy disclosure is concerned, and data anonymity is one of important means for preventing privacy disclosure, and the facing technical barrier thereof cannot be ignored. The traditional data anonymization technology usually adopts a cutting processing mode, for example, only anonymizing the obvious individual privacy data, but the mode has weak attack resistance and is easy to cause indirect disclosure of the privacy data.
Disclosure of Invention
The invention provides a decision analysis method and a server applied to big data anonymous processing, and adopts the following technical scheme in order to achieve the technical purpose.
The first aspect is a decision analysis method applied to big data anonymization processing, which is applied to an anonymization decision analysis server, and the method comprises the following steps:
acquiring an initial user data description file, wherein the initial user data description file comprises a sensitive keyword set;
mining the initial user data description file by utilizing a feedforward neural network language model to obtain U text word vector relation networks, wherein the feedforward neural network language model comprises U word vector mining units, the raw material of each word vector mining unit is the output of the previous word vector mining unit, and U is an integer not less than 1;
processing the U text word vector relationship network by using a first sensitive text extraction model to obtain a first user data description file, wherein the first user data description file comprises a first individual feature text set, a first cluster feature text set and a fuzzy text set, the first individual feature text set corresponds to a text set corresponding to the sensitive keyword set, and the fuzzy text set is an associated text set of the first individual feature text set and the first cluster feature text set;
processing the U text word vector relationship network by using a second sensitive text extraction model to obtain a second user data description file, wherein the second user data description file comprises a second individual feature text set and a second group feature text set, and the second individual feature text set corresponds to a text set corresponding to the sensitive keyword set;
and generating a user data description file to be anonymous according to the first user data description file and the second user data description file, wherein the user data description file to be anonymous comprises the sensitive keyword set.
In some exemplary embodiments, the first sensitive text extraction model comprises a U-field category attention unit and a text vector sorting unit; the processing the U text word vector relationship network by using the first sensitive text extraction model to obtain a first user data description file comprises the following steps:
processing the U text word vector relationship networks by using the U field category attention units to generate U mixed entry vector sets;
processing V mixed entry vector sets in the U mixed entry vector sets by using the text vector sorting unit to obtain first entry vector distribution, wherein V is an integer smaller than U;
and generating the first user data description file according to the first entry vector distribution.
In some exemplary embodiments, each of the text word vector relationship networks corresponds to a text knowledge distribution chain; the processing the U text word vector relationship networks by using the U field category attention units to generate U mixed entry vector sets, including:
taking an nth text knowledge distribution chain and an nth-1 text knowledge distribution chain as raw materials of an nth field type attention unit, and outputting an nth mixed entry vector set by using the nth field type attention unit, wherein the nth text knowledge distribution chain corresponds to an nth text word vector relationship network, the nth-1 text knowledge distribution chain corresponds to an nth-1 text word vector relationship network, and n is an integer greater than 1 and less than U;
and taking the mth text knowledge distribution chain as a raw material of the mth field category attention unit, and outputting an mth mixed entry vector set by using the mth field category attention unit, wherein the mth text knowledge distribution chain corresponds to the mth text word vector relationship network, and m is 1 or U.
In some exemplary embodiments, the first sensitive text extraction model further includes a detection module, and the detection module is configured to record a detection number G of the field category attention unit on processing of a text word vector relationship network, where G is an integer not less than 1; the U text word vector relation networks correspond to U text knowledge distribution chains; the processing the U text word vector relationship networks by using the U field category attention unit to generate U mixed entry vector sets includes:
acquiring the processing detection number G recorded by the detection module;
when the processing detection number G is smaller than a first limit value, a first text knowledge distribution chain is used as a raw material of a first field type attention unit, and a first mixed entry vector set is output by using the first field type attention unit, wherein the first text knowledge distribution chain corresponds to a first text word vector relation network which is obtained according to a first word vector mining unit in the feedforward neural network language model;
taking a second text knowledge distribution chain and the first mixed entry vector set as raw materials of a second field type attention unit, and outputting a second mixed entry vector set by using the second field type attention unit, wherein the second text knowledge distribution chain corresponds to a second text word vector relationship network which is obtained by a second word vector mining unit in the feedforward neural network language model;
taking a third text knowledge distribution chain and the second mixed entry vector set as raw materials of a third field type attention unit, and outputting a third mixed entry vector set by using the third field type attention unit, wherein the third text knowledge distribution chain corresponds to a third text word vector relation network which is obtained by a third word vector mining unit in the feedforward neural network language model;
and taking a fourth text knowledge distribution chain as a raw material of a fourth field type attention unit, and outputting a fourth mixed entry vector set by using the fourth field type attention unit, wherein the fourth text knowledge distribution chain corresponds to a fourth text word vector relation network, and the fourth text word vector relation network is obtained according to a fourth word vector mining unit in the feedforward neural network language model.
In some exemplary embodiments, after the obtaining the processing detection number G recorded by the detection module, the method further includes:
when the processing detection number G is not less than a first limit value and less than a second limit value, taking the first text knowledge distribution chain and the third mixed entry vector set as raw materials of the first field type attention unit, and outputting a fifth mixed entry vector set by using the first field type attention unit;
taking the second text knowledge distribution chain and the fifth mixed entry vector set as raw materials of the second field type attention unit, and outputting a sixth mixed entry vector set by using the second field type attention unit;
and taking the third text knowledge distribution chain, the third mixed entry vector set and the sixth mixed entry vector set as raw materials of the third field type attention unit, and outputting a seventh mixed entry vector set by using the third field type attention unit.
In some exemplary embodiments, the processing, by the text vector sorting unit, the V mixed entry vector sets in the U mixed entry vector sets to obtain a first entry vector distribution includes: and when the processing detection number G is equal to the second limit value, processing the fourth mixed entry vector set and the seventh mixed entry vector set by using the text vector sorting unit to obtain first entry vector distribution.
In some exemplary embodiments, the processing, by the text vector sorting unit, the fourth mixed entry vector set and the seventh mixed entry vector set to obtain a first entry vector distribution includes:
semantic description extraction is carried out on the fourth mixed entry vector set to obtain first entry semantic description;
extracting text polarity from the semantic description of the first entry to obtain a first text polarity;
processing the semantic description of the first entry by using a nonlinear processing module to obtain a first semantic feature;
semantic description extraction is carried out on the seventh mixed entry vector set to obtain second entry semantic description;
extracting text polarity from the semantic description of the second entry to obtain a second text polarity;
processing the second entry semantic description by using a nonlinear processing module to obtain a second semantic feature;
and generating a first entry vector distribution according to the fourth mixed entry vector set, the first semantic feature, the first text polarity, the seventh mixed entry vector set, the second semantic feature and the second text polarity.
In some exemplary embodiments, the second sensitive text extraction model comprises U +1 moving average nodes; the U text word vector relation networks correspond to U text knowledge distribution chains; processing the U text word vector relationship network by using a second sensitive text extraction model to obtain a second user data description file, wherein the processing comprises the following steps:
taking a first text knowledge distribution chain as a raw material of a first moving average node, and outputting a first user information visual expression by using the first moving average node, wherein the first text knowledge distribution chain corresponds to a first text word vector relation network which is obtained according to a first word vector mining unit in the feedforward neural network language model;
taking a second text knowledge distribution chain and the first user information visual expression as raw materials of a second moving average node, and outputting the second user information visual expression by using the second moving average node, wherein the second text knowledge distribution chain corresponds to a second text word vector relation network, and the second text word vector relation network is obtained according to a second word vector mining unit in the feedforward neural network language model;
taking a third text knowledge distribution chain and the second user information visual expression as raw materials of a third moving average node, and outputting the third user information visual expression by using the third moving average node, wherein the third text knowledge distribution chain corresponds to a third text word vector relation network, and the third text word vector relation network is obtained according to a third word vector mining unit in the feedforward neural network language model;
taking a fourth text knowledge distribution chain and the third user information visual expression as raw materials of a fourth moving average node, and outputting the fourth user information visual expression by using the fourth moving average node, wherein the fourth text knowledge distribution chain corresponds to a fourth text word vector relation network which is obtained according to a fourth word vector mining unit in the feedforward neural network language model;
taking a fourth user information visual expression as a raw material of a fifth moving average node, and outputting the fifth user information visual expression by using the fifth moving average node;
and generating the second user data description file according to the fifth user information visual expression.
In some exemplary embodiments, the privacy risk index of the initial user data description profile is F, where F is an integer greater than 1, and the mining the initial user data description profile using a feed-forward neural network language model to obtain U text word vector relationship networks includes:
mining the initial user data description file through a fourth word vector mining unit to obtain a fourth text word vector relational network, wherein the privacy risk index of the fourth text word vector relational network is q4 x F;
processing the fourth text word vector relationship network through a third word vector mining unit to obtain a third text word vector relationship network, wherein the privacy risk index of the third text word vector relationship network is q3 x F;
processing the third text word vector relation network through a second word vector mining unit to obtain a second text word vector relation network, wherein the privacy risk index of the second text word vector relation network is q2 x F;
and processing the second text word vector relationship network through a first word vector mining unit to obtain a first text word vector relationship network, wherein the privacy risk index of the first text word vector relationship network is q1 x F.
In some exemplary embodiments, the generating a user data description profile to be anonymized according to the first user data description profile and the second user data description profile includes:
and obtaining the user data description file to be anonymous according to the deduplication processing result of the file statement of the first user data description file and the file statement of the second user data description file.
A second aspect is an anonymous decision analysis server comprising a memory and a processor; the memory and the processor are coupled; the memory for storing computer program code, the computer program code comprising computer instructions; wherein the computer instructions, when executed by the processor, cause the anonymous decision analysis server to perform the method of the first aspect.
A third aspect is a computer-readable storage medium having stored thereon a computer program which, when executed, performs the method of the first aspect.
According to one embodiment of the invention, an initial user data description archive is first obtained, wherein the initial user data description archive comprises a set of sensitive keywords; secondly, mining the initial user data description file by using a feedforward neural network language model to obtain U text word vector relation networks, wherein the feedforward neural network language model comprises U word vector mining units, and the raw material of each word vector mining unit is the output of the previous word vector mining unit; then, processing the U text word vector relationship network by using a first sensitive text extraction model to obtain a first user data description file, wherein the first user data description file comprises a first individual feature text set, a first cluster feature text set and a fuzzy text set, the first individual feature text set corresponds to a text set corresponding to the sensitive keyword set, and the fuzzy text set is an associated text set of the first individual feature text set and the first cluster feature text set; processing the U text word vector relationship network by using a second sensitive text extraction model to obtain a second user data description file, wherein the second user data description file comprises a second individual feature text set and a second group feature text set, and the second individual feature text set corresponds to a text set corresponding to the sensitive keyword set; and finally, generating a user data description file to be anonymous through the first user data description file and the second user data description file, wherein the user data description file to be anonymous comprises a sensitive keyword set. According to the embodiment of the invention, two different sensitive text extraction models are used for extracting the sensitive text from the U text word vector relationship networks of the initial user data description file to obtain the first user data description file and the second user data description file, the first user data description file and the second user data description file are used for generating the user data description file to be anonymized, the relation among the individual characteristic text set, the group characteristic text set and the fuzzy text set can be comprehensively considered, so that the complete inclusion of the individual privacy data and the accurate positioning of the part of the fuzzy text data which possibly cause the indirect disclosure of the individual privacy data can be ensured when the user data description file to be anonymized is generated, and the generated user data description file to be anonymized can be used as a credible decision basis for data anonymization protection, thereby improving the pertinence and the attack resistance of the data anonymity protection.
Drawings
Fig. 1 is a schematic flow chart of a decision analysis method applied to big data anonymization processing according to an embodiment of the present invention.
Detailed Description
In the following, the terms "first", "second" and "third", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," or "third," etc., may explicitly or implicitly include one or more of that feature.
Fig. 1 is a flow chart illustrating a decision analysis method applied to big data anonymization processing according to an embodiment of the present invention, where the decision analysis method applied to big data anonymization processing may be implemented by an anonymization decision analysis server, and the anonymization decision analysis server may include a memory and a processor; the memory and the processor are coupled; the memory for storing computer program code, the computer program code comprising computer instructions; wherein the computer instructions, when executed by the processor, cause the anonymous decision analysis server to perform the techniques described in steps 110-150.
Step 110, obtaining an initial user data description file.
In an embodiment of the invention, the initial user data description profile comprises a set of sensitive keywords.
Further, the obtained initial user data description profile may be a user data description profile received by using an existing network communication technology, or may be a user data description profile stored by the anonymous decision analysis server itself. For example, the initial user data description profile may be an e-commerce user data description profile, an office user data description profile, a medical user data description profile, and a gaming user data description profile. Taking the initial user data description file as an office user data description file as an example, the sensitive keyword set may be one or more of a project keyword, a partner keyword, a key keyword, an address keyword, or a price quotation keyword in the office user data description file. Taking the initial user data description file as the e-commerce user data description file as an example, the sensitive keyword set may be one or more of a personalized preference keyword, a contact address keyword or a browsing record keyword in the e-commerce user data description file. Taking the initial user data description file as the medical user data description file as an example, the sensitive keyword set may be one or more of medical record keywords or identity keywords in the medical user data description file.
And step 120, mining the initial user data description file by using a feedforward neural network language model to obtain U text word vector relation networks.
The feed-forward neural network Language Model (FFNNLanguage Model, FFNNLM) comprises U word vector mining units, raw materials of each word vector mining unit are output of the previous word vector mining unit, and U is an integer not less than 1. The word vector mining unit may be understood as a feature extraction layer for text feature extraction.
It can be understood that the feedforward neural network language model is that user data description files are arranged through a U text word vector relation network obtained by U times of word vector mining, the raw material of the first time of word vector mining is an initial user data description file, and the raw material of each subsequent time of word vector mining is the output of the last word vector mining. Further, the first place of the text word vector relational network queue is a high privacy risk index representation of the initial user data description profile, and the last place is a prediction of a low privacy risk index. In the embodiment of the invention, the grouped user data description files are understood as the file pyramid or the file queue, and privacy risk indexes of the user data description files at different queue positions are different. The initial user data profile has a high privacy risk index since no anonymous decision analysis is performed.
It can be understood that a group of initial user data description files are used as raw materials of the feedforward neural network language model, a U-layer word vector mining unit in the feedforward neural network language model is used for generating a U-group text word vector relation network, and privacy risk indexes of the U-group text word vector relation networks are different.
And step 130, processing the U text word vector relationship networks by using a first sensitive text extraction model to obtain a first user data description file.
The first user data description file comprises a first individual feature text set, a first group feature text set and a fuzzy text set, wherein the first individual feature text set corresponds to a text set corresponding to the sensitive keyword set, and the fuzzy text set is an associated text set of the first individual feature text set and the first group feature text set. If the anonymity execution strength of the first individual feature text set is the greatest, the anonymity execution strength of the first group feature text set is the least (anonymity processing may not be executed in some scenarios), and the fuzzy text set is centered (so that the fuzzy text set can be understood as a cross text set of the first individual feature text set and the first group feature text set, and the anonymity execution strength is located between the first individual feature text set and the first group feature text set).
It can be understood that the first sensitive text extraction model may include a description file multi-classification module, and the first user data description file is obtained through text word vector sorting and translation after the U text word vector relationship networks are processed by the description file multi-classification module. And obtaining a first user data description file through a first sensitive text extraction model.
And 140, processing the U text word vector relationship network by using a second sensitive text extraction model to obtain a second user data description file.
The second user data description file comprises a second individual characteristic text set and a second group characteristic text set, and the second individual characteristic text set corresponds to the text set corresponding to the sensitive keyword set.
It can be understood that U text word vector relationship nets are used as raw materials of the second sensitive text extraction model, the second sensitive text extraction model is used for outputting a second user data description file, the second sensitive text extraction model comprises U +1 word vector mining units, the raw material of the first word vector mining unit is a text word vector relationship net, the raw materials of the subsequent U-1 word vector mining units are an output of the previous layer and a text word vector relationship net, and the raw material of the last U +1 processing layer is an output of the U-th layer.
And 150, generating a user data description file to be anonymous through the first user data description file and the second user data description file.
The user data description file to be anonymous comprises a sensitive keyword set.
It can be understood that the first user data description file and the second user data description file are subjected to de-coincidence and are sorted to generate a user data description file to be anonymized, and the user data description file to be anonymized includes a sensitive keyword set to complete anonymization pre-processing of the user data description file (which can also be understood as anonymization decision analysis processing, that is, providing a user data description file to be anonymized which needs to be anonymized).
According to the embodiment of the invention, two different sensitive text extraction models are utilized to extract the sensitive texts of U text word vector relationship networks of an initial user data description file to obtain a first user data description file and a second user data description file, the first user data description file and the second user data description file are utilized to generate a user data description file to be anonymous, the relation among an individual characteristic text set, a group characteristic text set and a fuzzy text set can be comprehensively considered, so that the complete inclusion of individual privacy data and the accurate positioning of part of fuzzy text data which possibly cause the indirect leakage of the individual privacy data can be ensured when the user data description file to be anonymous is generated, and the generated user data description file to be anonymous can be used as a credible decision basis for data protection, so that the pertinence and the attack resistance of the data anonymity protection are improved.
Under some exemplary design considerations, the first sensitive text extraction model includes U field category attention units and a text vector arrangement unit. Wherein step 130 includes sub-steps 1301 through 1305.
Step 1301, processing the U text word vector relationship networks by using the U field category attention units to generate U mixed entry vector sets.
It can be understood that U text word vector relationship nets are used as raw materials of U field category attention units, and it is ensured that one text word vector relationship net is input into each field category attention unit, the text word vector relationship nets input into each field category attention unit are different, U mixed entry vector sets are output by using the U field category attention units, and the U mixed entry vector sets correspond to the U text word vector relationship nets.
The field type attention unit is used for identifying the entry type of the text word vector in the text word vector relation network, so that a mixed entry vector set with a plurality of entry types is obtained. The term category may exemplarily include a privacy category and a non-privacy category, or a privacy category, a non-privacy category, and a pending category, and the like, which are not limited herein.
And step 1303, processing the V mixed entry vector sets in the U mixed entry vector sets by using a text vector sorting unit to obtain first entry vector distribution.
Wherein V is an integer less than U.
It can be understood that V mixed entry vector sets are determined from the U mixed entry vector sets, and the V mixed entry vector sets are vector-spliced to obtain the first entry vector distribution, for example, the V mixed entry vector sets may be vector-combined to obtain the first entry vector distribution.
Step 1305, generating a first user data description archive by the first entry vector distribution.
It is understood that term vector distribution is a representation of the user data description profile, either as a term vector matrix or a term vector list.
According to the embodiment of the invention, the first sensitive text extraction model comprises U field type attention units to process the U text word vector relationship network, and vector splicing is carried out according to V mixed entry vector sets in the obtained U mixed entry vector sets to generate the first user data description file, so that the accuracy of sensitive text extraction of the first sensitive text extraction model and the integrity of sensitive text extraction are improved.
Under some exemplary design considerations, each text word vector relationship net corresponds to a text knowledge distribution chain. Based on this, sub-step 1301 includes sub-steps 3011 to 3013.
And 3011, taking the nth text knowledge distribution chain and the (n-1) th text knowledge distribution chain as raw materials of the nth field type attention unit, and outputting the nth mixed entry vector set by using the nth field type attention unit.
The nth text knowledge distribution chain corresponds to the nth text word vector relationship network, the (n-1) th text knowledge distribution chain corresponds to the (n-1) th text word vector relationship network, and n is an integer which is larger than 1 and smaller than U. The text knowledge distribution chain may be a matrix representation of a text word vector relational network, which may be a representation in the form of a knowledge graph or directed graph.
And 3013, taking the mth text knowledge distribution chain as a raw material of the mth field type attention unit, and outputting the mth mixed entry vector set by using the mth field type attention unit.
Wherein the mth text knowledge distribution chain corresponds to the mth text word vector relationship network, and m is 1 or U.
For example, the user data description profile may be represented in terms of a distribution of term vectors, such that each text word vector relationship network corresponds to a text knowledge distribution chain. Taking U as an example of 4, since n is greater than 1 and smaller than U, the value of n is 2 or3, and the value of m is 1 or4. When m is equal to 1, taking the first text knowledge distribution chain as a raw material of a first field category attention unit, and outputting a first mixed entry vector set by using the first field category attention unit; when n is equal to 2, taking the second text knowledge distribution chain and the first text knowledge distribution chain as raw materials of a second field type attention unit, and outputting a second mixed entry vector set by using the second field type attention unit; when n is equal to 3, taking the third text knowledge distribution chain and the second text knowledge distribution chain as raw materials of a third field type attention unit, and outputting a third mixed entry vector set by using the third field type attention unit; when m is equal to 4, taking a fourth text knowledge distribution chain as a raw material of a fourth field category attention unit, and outputting a fourth mixed entry vector set by using the fourth field category attention unit; based on the four text word vector relationship networks, four field type attention units are utilized to process the four text word vector relationship networks, and four mixed entry vector sets are generated.
According to the method and the device, one or two text knowledge distribution chains are processed through different field type attention units of the first sensitive text extraction model to obtain the mixed entry vector set, and the reliability and integrity of the sensitive text extraction of the first sensitive text extraction model are improved.
Under some exemplary design ideas, the first sensitive text extraction model further comprises a detection module, and the detection module is used for recording the processing detection number G of the field type attention unit on the text word vector relationship network, wherein G is an integer not less than 1. Therefore, the detection module can be understood as a frequency counting module, and the U text word vector relation networks correspond to the U text knowledge distribution chains. Further, sub-step 1301 includes sub-steps 13010 through 13015.
Step 13010, acquiring the processing detection number G recorded by the detection module.
It can be understood that the detection module is configured to record a processing detection number G of the field category attention unit on the text word vector relationship network, and process the detection number +1 each time the field category attention unit performs one processing on the text knowledge distribution chain. Before the first field type attention unit processes the text knowledge distribution chain, the processing detection number G of the detection module is 0, and after the first field type attention unit processes the text knowledge distribution chain, the processing detection number G of the detection module is 1; before the second field type attention unit processes the text knowledge distribution chain, the processing detection number G of the detection module is 1, and after the second field type attention unit processes the text knowledge distribution chain, the processing detection number G of the detection module is 2; before the third field type attention unit processes the text knowledge distribution chain, the processing detection number G of the detection module is 2, and after the third field type attention unit processes the text knowledge distribution chain, the processing detection number G of the detection module is 3; before the fourth field type attention unit processes the text knowledge distribution chain, the processing detection number G of the detection module is 3, and after the third field type attention unit processes the text knowledge distribution chain, the processing detection number G of the detection module is 4.
Step 13011, when the processing detection number G is less than the first limit value.
It can be understood that the first limit is 1, and when the processing detection number is less than 1, the process of processing the text knowledge distribution chain by the field category attention unit for the first time is started.
And 13012, taking the first text knowledge distribution chain as a raw material of the first field type attention unit, and outputting a first mixed entry vector set by using the first field type attention unit.
The first text knowledge distribution chain corresponds to a first text word vector relation network, and the first text word vector relation network is obtained according to a first word vector mining unit in the feedforward neural network language model.
And 13013, taking the second text knowledge distribution chain and the first mixed entry vector set as raw materials of a second field type attention unit, and outputting a second mixed entry vector set by using the second field type attention unit.
And the second text knowledge distribution chain corresponds to a second text word vector relation network, and the second text word vector relation network is obtained according to a second word vector mining unit in the feedforward neural network language model.
And 13014, taking the third text knowledge distribution chain and the second mixed entry vector set as raw materials of a third field type attention unit, and outputting the third mixed entry vector set by using the third field type attention unit.
And the third text knowledge distribution chain corresponds to a third text word vector relation network, and the third text word vector relation network is obtained according to a third word vector mining unit in the feedforward neural network language model.
And 13015, taking the fourth text knowledge distribution chain as a raw material of the fourth field type attention unit, and outputting a fourth mixed entry vector set by using the fourth field type attention unit.
And the fourth text knowledge distribution chain corresponds to a fourth text word vector relation network, and the fourth text word vector relation network is obtained according to a fourth word vector mining unit in the feedforward neural network language model.
For example, the first time field category attention unit processes the text knowledge distribution chain as follows: firstly, a first field category attention unit processes a first text knowledge distribution chain to obtain a first mixed entry vector set; secondly, the second field type attention unit processes the second text knowledge distribution chain and the first mixed entry vector set to obtain a second mixed entry vector set; then, the third field type attention unit processes the third text knowledge distribution chain and the second mixed entry vector set to obtain a third mixed entry vector set; and finally, the fourth field type attention unit processes the fourth text knowledge distribution chain to obtain a fourth mixed entry vector set.
In the embodiment of the invention, in the first processing process, a first mixed entry vector set output by a first field type attention unit is added into the raw material of a second field type attention unit, the first mixed entry vector set and a second text knowledge distribution chain are used for simultaneous processing to generate a second mixed entry vector set, and similarly, a second mixed entry vector set output by the second field type attention unit is added into the raw material of a third field type attention unit, and the second mixed entry vector set and the third text knowledge distribution chain are used for simultaneous processing to generate a third mixed entry vector set, so that support is provided for improving the reliability and integrity of sensitive text extraction of a first sensitive text extraction model.
Under some exemplary design considerations, sub-step 13010 is further followed by sub-steps 13021-13024.
Step 13021, when the processing detection number G is not less than the first limit value and less than the second limit value.
The second limit value is an iteration turn, and when the first limit value is 1 and the second limit value is 3, the process that the field type attention unit processes the text knowledge distribution chain for the second time is started; and when the first limit value is 1 and the second limit value is 4, after the second field category attention unit finishes processing the text knowledge distribution chain, starting a process of processing the text knowledge distribution chain by a third field category attention unit.
Step 13022, taking the first text knowledge distribution chain and the third mixed entry vector set as raw materials of the first field type attention unit, and outputting a fifth mixed entry vector set by using the first field type attention unit.
And 13023, taking the second text knowledge distribution chain and the fifth mixed entry vector set as raw materials of the second field type attention unit, and outputting a sixth mixed entry vector set by using the second field type attention unit.
Step 13024, taking the third text knowledge distribution chain, the third mixed entry vector set and the sixth mixed entry vector set as the raw materials of the third field type attention unit, and outputting a seventh mixed entry vector set by using the third field type attention unit.
The second-level field type attention unit processes the text knowledge distribution chain as follows: firstly, a first field category attention unit processes a third mixed entry vector set output by a third field category attention unit in the first time and a first text knowledge distribution chain to obtain a fifth mixed entry vector set; then, the second field type attention unit processes the second text knowledge distribution chain and the fifth mixed entry vector set to obtain a sixth mixed entry vector set; and finally, the third field type attention unit processes the third text knowledge distribution chain, the third mixed entry vector set and the sixth mixed entry vector set to obtain a seventh mixed entry vector set.
Further, the third-order field category attention unit processes the text knowledge distribution chain as follows: firstly, a first field type attention unit processes a seventh mixed entry vector set and a first text knowledge distribution chain output by a third field type attention unit in a second round to obtain an eighth mixed entry vector set; then, the second field type attention unit processes the eighth mixed entry vector set and the second text knowledge distribution chain to obtain a ninth mixed entry vector set; and finally, the third field type attention unit processes the third text knowledge distribution chain, the seventh mixed entry vector set and the ninth mixed entry vector set to obtain a tenth mixed entry vector set.
In the embodiment of the invention, in the second processing process, the output of the third field type attention unit in the last time is used as the raw material of the first field type attention unit and the output of the third field type attention unit in the second time, so that support is provided for improving the reliability and integrity of the sensitive text extraction of the first sensitive text extraction model.
In an alternative embodiment, sub-step 13010 is followed by sub-steps 13031 through 13032.
Step 13031, when the processing detection number G is equal to the second limit value.
Step 13032, the text vector sorting unit is used to process the fourth mixed entry vector set and the seventh mixed entry vector set, so as to obtain the first entry vector distribution.
It can be understood that, when the second limit is 3, the fourth mixed entry vector set output by the fourth field type attention unit in the first processing process and the seventh mixed entry vector set output by the third field type attention unit in the second processing process are subjected to vector set splicing to obtain the first entry vector distribution.
And when the second limit value is 4, carrying out vector set splicing on a fourth mixed entry vector set output by the fourth field type attention unit in the first processing process and a tenth mixed entry vector set output by the third field type attention unit in the third processing process to obtain the first entry vector distribution.
The embodiment of the invention provides an exemplary scheme for processing a user data description archive through a first sensitive text extraction model. Wherein, a knowledge link1, a knowledge link2, a knowledge link3 and a knowledge link4 are four text knowledge distribution chains corresponding to the four text word vector relationship nets generated in step 120, an attribution unit1, an attribution unit2, an attribution unit3 and an attribution unit4 are four field type attention units, G is a detection module, and a co-feature unit is a text vector arrangement unit. The first limit is 1 and the second limit is 4, i.e., when the process detection number G of the field type attention unit is 4, the iteration is terminated.
The first field category attention unit processing procedure comprises the following steps: firstly, a first text knowledge distribution chain knowledge link1 is used as a raw material of a first field type attention unit attribute unit1, and a first mixed entry vector set entry 1 is output by using the first field type attention unit attribute unit 1; then, taking a second text knowledge distribution chain knowledge link2 and a first mixed entry vector set entry vector1 as raw materials of a second field type attention unit attentry unit2, and outputting the second mixed entry vector set entry vector2 by using the second field type attention unit attentry unit 2; then, taking a third text knowledge distribution chain knowledge link3 and a second mixed entry vector set entry vector2 as raw materials of a third field type attention unit attentry unit3, and outputting the third mixed entry vector set entry vector3 by using the third field type attention unit attentry unit 3; and finally, taking a fourth text knowledge distribution chain knowledge link4 as a raw material of a fourth field category attention unit4, and outputting a fourth mixed entry vector set entry 4 by using the fourth field category attention unit 4. After the first field type attention unit processing is completed, the detection module G is 1. Since G is smaller than the second limit 4, second-level field type attention unit processing is required.
Further, the second-level field category attention unit processing procedure comprises: firstly, outputting a third mixed entry vector set entry vector3 and a first text knowledge distribution chain knowledgman link1 by a third field type attention unit att3 in the first time as raw materials of a first field type attention unit attentry unit1, and outputting a fifth mixed entry vector set entry vector5 by using the first field type attention unit attentry 1; then, taking a second text knowledge distribution chain knowledglink 2 and a fifth mixed term vector set entry vector5 as raw materials of a second field type attention unit2, and outputting a sixth mixed term vector set entry vector6 by using the second field type attention unit 2; then, the third text knowledge distribution chain knowledge link3, the third mixed entry vector set entry vector3 and the sixth mixed entry vector set entry vector6 are used as raw materials of a third field type attention unit attentry unit3, and a seventh mixed entry vector set entry vector7 is output by using the third field type attention unit attentry 3. After the second-level field type attention unit processing is completed, the detection module G is 2. Since G is smaller than the second limit value of 4, third-order field class attention unit processing is required.
More further, the third-order field-class attention unit processing procedure includes: firstly, outputting a seventh mixed entry vector set entry vector7 and a first text knowledge distribution chain knowledgman link1 by a third field type attention unit att 3 in a second round as raw materials of a first field type attention unit att 1, and outputting an eighth mixed entry vector set entry vector8 by the first field type attention unit att 1; then, taking a second text knowledge distribution chain knowledge link2 and an eighth mixed entry vector8 as raw materials of a second field type attention unit2, and outputting a ninth mixed entry vector9 by using the second field type attention unit 2; then, the third text knowledge distribution chain knowledge link3, the seventh mixed entry vector7 and the ninth mixed entry vector9 are used as raw materials of a third field type attention unit3, and a tenth mixed entry vector entry 10 is output by using the third field type attention unit 3. After the third-order field type attention unit processing is completed, the detection module G is 3. Since G is less than the second limit of 4, a fourth order field type attention unit process is required.
Further, the fourth order field category attention unit process includes: firstly, outputting a tenth mixed term vector set entry vector10 and a first text knowledge distribution chain knowledglink 1 by a third field type attention unit att unit3 in the third round as raw materials of a first field type attention unit att unit1, and outputting an eleventh mixed term vector set entry vector11 by using the first field type attention unit att unit 1; then, taking a second text knowledge distribution chain knowledge link2 and an eleventh mixed entry vector set entry vector11 as raw materials of a second field type attention unit attentry unit2, and outputting a twelfth mixed entry vector set entry vector12 by using the second field type attention unit attentry unit 2; then, the third text knowledge distribution chain knowledge link3, the tenth mixed entry vector set entry vector10 and the twelfth mixed entry vector set entry vector12 are used as raw materials of a third field type attention unit attribute 3, and a thirteenth mixed entry vector set entry vector13 is output by using the third field type attention unit attribute 3. After the fourth-order field type attention unit processing is completed, the detection module G is 4. Since G is equal to the second limit value of 4, the fifth order field type attention unit process is not required.
Outputting a fourth mixed entry vector set entry vector4 by a fourth field type attention unit4 in the first time and outputting a thirteenth mixed entry vector set entry vector13 by a third field type attention unit3 in the fourth time to a co-feature unit of a text vector arrangement unit to generate first entry vector distribution, and obtaining a first user data description file through the first entry vector distribution.
According to the embodiment of the invention, the vector integration is carried out by utilizing the mixed entry vector sets of different processing detection numbers G, so that support is provided for improving the reliability and integrity of the sensitive text extraction of the first sensitive text extraction model.
In an alternative embodiment, the sub-step 13032 comprises sub-steps 30321 through 30327.
30321, semantic description extraction is carried out on the fourth mixed entry vector set to obtain the first entry semantic description.
30322, text polarity extraction is carried out on the semantic description of the first entry to obtain a first text polarity.
30323, the first term semantic description is processed by the nonlinear processing module to obtain a first semantic feature.
30324, semantic description extraction is carried out on the seventh mixed entry vector set to obtain a second entry semantic description.
30325, text polarity extraction is carried out on the semantic description of the second entry to obtain a second text polarity.
30326, the non-linear processing module is used for processing the semantic description of the second entry to obtain a second semantic feature.
30327, generating a first entry vector distribution by the fourth mixed entry vector set, the first semantic feature, the first text polarity, the seventh mixed entry vector set, the second semantic feature and the second text polarity.
It can be understood that, the generating of the first entry vector distribution by the fourth mixed entry vector set, the first semantic feature, the first text polarity, the seventh mixed entry vector set, the second semantic feature, and the second text polarity specifically includes: generating a first semantic feature set through the fourth mixed entry vector set and the first semantic features; generating a second semantic feature set through the seventh mixed entry vector set and the second semantic features; and performing weighted calculation through the first semantic feature set, the first text polarity, the second semantic feature set and the second text polarity to obtain the first entry vector distribution. The term semantic description is used for representing deep meaning or derivative meaning of the term and can be used as a simulation basis for reverse privacy stealing deduction in an anonymous protection process. Further, the nonlinear processing module may be an activation function, and the obtained semantic features may be fused on the basis of text polarity (for example, polarity of positive text that may provide protection for user privacy, polarity of negative text that reveals user privacy to a certain extent, and the like), so as to obtain complete and rich entry vector distribution.
In the embodiment of the invention, in the process of vector arrangement, semantic description extraction is carried out on the vector to be arranged to obtain the text polarity and the semantic feature set, and then the text polarity and the semantic feature set are subjected to weighted calculation to obtain the first term vector distribution, so that support is provided for improving the reliability and integrity of sensitive text extraction of the first sensitive text extraction model.
In an alternative embodiment, the second sensitive text extraction model includes U +1 moving average nodes (convolution nodes); the U text word vector relationship networks correspond to the U text knowledge distribution chains. Based on this, step 140 includes sub-steps 1401 through 1406.
And 1401, taking the first text knowledge distribution chain as a raw material of a first moving average node, and outputting a first user information visual expression by using the first moving average node.
The first text knowledge distribution chain corresponds to a first text word vector relation network, and the first text word vector relation network is obtained according to a first word vector mining unit in the feedforward neural network language model. The user information visual representation may reflect textual detail information contained in the textual knowledge distribution chain and/or image information derived from the text in the form of a feature vector.
And 1402, taking the second text knowledge distribution chain and the first user information visual expression as raw materials of a second moving average node, and outputting the second user information visual expression by using the second moving average node.
And the second text knowledge distribution chain corresponds to a second text word vector relation network, and the second text word vector relation network is obtained according to a second word vector mining unit in the feedforward neural network language model.
And 1403, taking the third text knowledge distribution chain and the second user information visual expression as raw materials of a third moving average node, and outputting the third user information visual expression by using the third moving average node.
And the third text knowledge distribution chain corresponds to a third text word vector relation network, and the third text word vector relation network is obtained according to a third word vector mining unit in the feedforward neural network language model.
And 1404, taking the fourth text knowledge distribution chain and the third user information visual expression as raw materials of a fourth moving average node, and outputting the fourth user information visual expression by using the fourth moving average node.
And the fourth text knowledge distribution chain corresponds to a fourth text word vector relation network, and the fourth text word vector relation network is obtained according to a fourth word vector mining unit in the feedforward neural network language model.
And step 1405, taking the fourth user information visual expression as a raw material of a fifth moving average node, and outputting the fifth user information visual expression by using the fifth moving average node.
And 1406, generating a second user data description file through the fifth user information visual expression.
The embodiment of the invention shows an exemplary scheme for processing the user data description archive through the second sensitive text extraction model. Wherein, knowledge link1, knowledge link2, knowledge link3, and knowledge link4 are four text knowledge distribution chains corresponding to the four text word vector relation nets generated in step 120, node1, node2, node3, node4, and node5 are five moving average nodes in the second sensitive text extraction model, and Visual expression1, visual expression2, visual expression3, visual expression4, and Visual expression5 are five user information Visual expressions generated by the five moving average nodes in the second sensitive text extraction model.
Firstly, a first text knowledge distribution chain knowledge link1 is used as a raw material of a first moving average node1, and the first moving average node1 is used for outputting a first user information Visual expression1; secondly, a second text knowledge distribution chain knowledglink 2 and a first user information Visual expression node1 are used as raw materials of a second moving average node2, and the second moving average node2 is used for outputting a second user information Visual expression2; thirdly, a third text knowledge distribution chain knowledglink 3 and a second user information Visual expression2 are used as raw materials of a third moving average node3, and the third moving average node3 is used for outputting the third user information Visual expression3; then, a fourth text knowledge distribution chain knowledge link4 and a third user information Visual expression3 are used as raw materials of a fourth moving average node4, and the fourth user information Visual expression4 is output by using the fourth moving average node 4; then, the fourth user information Visual expression4 is used as a raw material of the fifth moving average node5, and the fifth user information Visual expression5 is output by using the fifth moving average node 5. Finally, a second user data description archive is generated by visually expressing Visual expression5 through fifth user information.
According to the embodiment of the invention, the text word vector relationship network is subjected to multi-round sliding average to generate the second user data description file, so that support is provided for improving the reliability and integrity of the sensitive text extraction of the first sensitive text extraction model.
In an alternative embodiment, the privacy risk index of the initial user data profile is F, where F is an integer greater than 1. Based on this, step 120 includes sub-steps 1201 to 1204.
Step 1201, mining the initial user data description archive through a fourth word vector mining unit to obtain a fourth text word vector relation network.
And the privacy risk index of the fourth text word vector relational network is q4 x F.
Step 1202, processing the fourth text word vector relationship network through a third word vector mining unit to obtain a third text word vector relationship network.
And the privacy risk index of the third text word vector relational network is q3 x F.
Step 1203, processing the third text word vector relationship network through a second word vector mining unit to obtain a second text word vector relationship network.
And the privacy risk index of the second text word vector relation network is q2 x F.
And 1204, processing the second text word vector relationship network through a first word vector mining unit to obtain a first text word vector relationship network.
And the privacy risk index of the first text word vector relational network is q1 x F.
Further, q1 to q4 correspond to different weight coefficients, respectively, and q4> q3> q2> q1. It can be appreciated that the privacy risk index decreases as the number of semantic description refinements increases.
An exemplary scheme for mining an initial user data description archive using a feed-forward neural network language model is provided in embodiments of the present invention. The method comprises the steps of obtaining a file, a feature extraction unit4, a feature extraction unit3, a feature extraction unit2 and a feature extraction unit1, wherein the file is an initial user data description file, the feature extraction unit1 is four word vector mining units, the file 4, the file 3, the file 2 and the file 1 are four text word vector relationship networks output by the four word vector mining units, and the knowledge link4, the knowledge link3, the knowledge link2 and the knowledge link1 are text knowledge distribution chains corresponding to the four text word vector relationship networks.
Firstly, an initial user data description archive file is used as a raw material of a fourth word vector mining unit feature extraction unit4, a fourth text word vector relationship network4 is output by using the fourth word vector mining unit feature extraction unit4, and the fourth text word vector relationship network4 corresponds to a fourth text knowledge distribution chain knowledge link4; then, a fourth text word vector relation network relationship 4 is used as a raw material of a third word vector mining unit feature extraction unit3, the third text word vector relation network relationship 3 is output by using the third word vector mining unit feature extraction unit3, and the third text word vector relation network relationship 3 corresponds to a third text knowledge distribution link3; then, a third text word vector relation network relationship network3 is used as a raw material of a second word vector mining unit feature extraction unit2, the second text word vector relation network relationship network2 is output by using the second word vector mining unit feature extraction unit2, and the second text word vector relation network relationship network2 corresponds to a second text knowledge distribution chain knowledglink 2; and finally, a second text word vector relationship network relationship 2 is used as a raw material of a first word vector mining unit feature extraction unit1, the first text word vector relationship network relationship 1 is output by using the first word vector mining unit feature extraction unit1, and the first text word vector relationship network relationship 1 corresponds to a first text knowledge distribution link1.
In the embodiment of the invention, the feedforward neural network language model is utilized to mine the initial user data description file, and the text word vector relation network is used as the raw material of the first sensitive text extraction model and the second sensitive text extraction model by using a plurality of text word vector relation networks, so that support is provided for improving the reliability and the integrity of the sensitive text extraction model.
In an alternative embodiment, step 150 includes sub-step 1501.
Step 1501, obtaining the user data description file to be anonymous according to the duplication removal processing result of the archive statement of the first user data description file and the archive statement of the second user data description file.
According to the embodiment of the invention, the file statement duplicate removal processing result of the first user data description file and the second user data description file is used as the file statement of the user data description file to be anonymous, and the first user data description file and the second user data description file are subjected to duplicate removal processing, so that the reliability and integrity of the sensitive text extraction model are improved.
Under other design considerations which can be implemented independently, the decision analysis method applied to the anonymization processing of the big data includes STEP11 to STEP51. It can be understood that STEPs 12 to 15 are processing procedures of a feedforward neural network language model, STEPs 21 to 35 are processing procedures of a first sensitive text extraction model, STEPs 41 to 46 are processing procedures of the first sensitive text extraction model, STEPs 21 to 35 and STEPs 41 to 46 do not have a sequential execution order, and may be a parallel execution mode, and STEP51 is executed after STEPs 21 to 35 and STEPs 41 to 46 are all executed.
STEP11, acquiring initial user data description profiles.
Wherein the initial user data description profiles files comprise a set of sensitive keywords. The privacy risk index of the initial user data profile is F.
STEP12, processing the initial user data description profiles according to a fourth word vector mining unit feature extraction unit4 in the feedforward neural network language model, and obtaining a fourth text word vector relation network relationship 4.
And the privacy risk index of the fourth text word vector relation network relationship network4 is q4 x F. The fourth text word vector relationship network relationship 4 corresponds to a fourth text knowledge distribution chain knowledge link4.
And STEP13, processing a fourth text word vector relation network relationship network4 according to a third word vector mining unit feature extraction unit3 in the feedforward neural network language model to obtain a third text word vector relation network relationship network3.
And the privacy risk index of the third text word vector relation network relationship network3 is q3 x F. The third text word vector relation network relationship network3 corresponds to the third text knowledge distribution chain knowledge link3.
And the STEP14 processes the third text word vector relation network relationship 3 according to a second word vector mining unit feature extraction unit2 in the feedforward neural network language model to obtain a second text word vector relation network relationship 2.
And the privacy risk index of the second text word vector relation network relationship network2 is q 2F. The second text word vector relationship network relationship 2 corresponds to the second text knowledge distribution chain knowledge link2.
And the STEP15 processes the second text word vector relation network relationship 2 according to a first word vector mining unit feature extraction unit1 in the feedforward neural network language model to obtain a first text word vector relation network relationship 1.
The privacy risk index of the first text word vector relation network relationship network1 is q1 × F. The first text word vector relationship network1 corresponds to a first text knowledge distribution chain knowledge link1.
STEP21, obtaining the processing detection number G of the field type attention unit recorded by the detection module to the text word vector relation network.
It can be understood that, when the processing detection number G recorded by the detection module is 0, the processing of the text word vector relationship network by the first field category attention unit is started.
STEP22, taking a first text knowledge distribution chain knowledglink 1 as a raw material of a first field type attention unit1, and outputting a first mixed entry vector set entry vector1 by using the first field type attention unit 1.
STEP23, taking a second text knowledge distribution chain knowledglink 2 and a first mixed entry vector set entry vector1 as raw materials of a second field type attention unit2, and outputting the second mixed entry vector set entry vector2 by using the second field type attention unit 2.
STEP24, taking a third text knowledge distribution chain knowledglink 3 and a second mixed term vector set entry vector2 as raw materials of a third field type attention unit3, and outputting the third mixed term vector set entry vector3 by using the third field type attention unit 3.
STEP25, taking a fourth text knowledge distribution chain knowledglink 4 as a raw material of a fourth field type attention unit4, and outputting a fourth mixed entry vector set entry 4 by using the fourth field type attention unit 4.
It is understood that the detection module G is 1 after the first field type attention unit processing is completed. Since G is smaller than the second limit 4, second-level field type attention unit processing is required.
STEP26, outputting a third mixed entry vector set entry vector3 and a first text knowledge distribution chain knowledging link1 as raw materials of a first field type attention unit attentry unit1 by using a third field type attention unit attentry unit3 in the first time, and outputting a fifth mixed entry vector set entry vector5 by using the first field type attention unit attentry unit 1.
STEP27, taking a second text knowledge distribution chain knowledglink 2 and a fifth mixed term vector set entry vector5 as raw materials of a second field type attention unit attribute unit2, and outputting a sixth mixed term vector set entry vector6 by using the second field type attention unit attribute unit 2.
STEP28, taking a third text knowledge distribution chain knowledglink 3, a third mixed term vector set entry vector3 and a sixth mixed term vector set entry vector6 as raw materials of a third field type attention unit3, and outputting a seventh mixed term vector set entry vector7 by using the third field type attention unit 3.
It can be understood that, after the second-level field type attention unit processing is completed, the detection module G is 2. Since G is smaller than the second limit value of 4, third-order field class attention unit processing is required.
STEP29, outputting a seventh mixed entry vector set entry vector7 and a first text knowledge distribution chain knowledgy link1 by a third field type attention unit attribute 3 in the second round as raw materials of a first field type attention unit attribute 1, and outputting an eighth mixed entry vector set entry vector8 by the first field type attention unit attribute 1.
STEP30, taking a second text knowledge distribution chain knowledglink 2 and an eighth mixed entry vector8 as raw materials of a second field type attention unit attentry unit2, and outputting a ninth mixed entry vector9 by using the second field type attention unit attentry unit 2.
STEP31, taking a third text knowledge distribution chain knowledglink 3, a seventh mixed entry vector7 and a ninth mixed entry vector9 as raw materials of a third field type attention unit attentry unit3, and outputting a tenth mixed entry vector entry 10 by using the third field type attention unit attentry unit 3.
It is understood that, after the third-level field type attention unit processing is completed, the detection module G is 3. Since G is less than the second limit of 4, a fourth order field type attention unit process is required.
STEP32, outputting a tenth mixed entry vector set entry vector10 and a first text knowledge distribution chain knowledging link1 by a third field type attention unit att 3 in the third round as raw materials of a first field type attention unit att 1, and outputting an eleventh mixed entry vector set entry vector11 by the first field type attention unit att 1.
STEP33, taking a second text knowledge distribution chain knowledglink 2 and an eleventh mixed term vector set entry vector11 as raw materials of a second field type attention unit2, and outputting a twelfth mixed term vector set entry vector12 by using the second field type attention unit 2.
STEP34, taking a third text knowledge distribution chain knowledglink 3, a tenth mixed term vector set entry vector10 and a twelfth mixed term vector set entry vector12 as raw materials of a third field type attention unit3, and outputting a thirteenth mixed term vector set entry vector13 by using the third field type attention unit 3.
It is understood that the detection module G is 4 after the fourth-order field-type attention unit processing is completed. Since G is equal to the second limit value of 4, the fifth order field type attention unit process is not required.
STEP35, outputting a fourth mixed entry vector set entry vector4 from a fourth field type attention unit entry 4 in the first time and outputting a thirteenth mixed entry vector set entry vector13 from a third field type attention unit entry 3 in the fourth time to a text vector sorting unit co-feature unit to generate a first entry vector distribution, and obtaining a first user data description file through the first entry vector distribution.
STEP41, taking a first text knowledge distribution chain knowledge link1 as a raw material of a first moving average node1, and outputting a first user information Visual expression1 by using the first moving average node 1.
STEP42, taking a second text knowledge distribution chain knowledge link2 and a first user information Visual expression node1 as raw materials of a second moving average node2, and outputting a second user information Visual expression2 by using the second moving average node 2.
And STEP43, taking a third text knowledge distribution chain knowledge link3 and a second user information Visual expression2 as raw materials of a third moving average node3, and outputting the third user information Visual expression3 by using the third moving average node 3.
And STEP44, taking a fourth text knowledge distribution chain knowledge link4 and a third user information Visual expression3 as raw materials of a fourth moving average node4, and outputting the fourth user information Visual expression4 by using the fourth moving average node 4.
STEP45, taking the fourth user information Visual expression4 as a raw material of a fifth moving average node5, and outputting the fifth user information Visual expression5 by using the fifth moving average node 5.
STEP46, generating a second user data description archive by visually expressing Visual expression5 through fifth user information.
And STEP51, obtaining the user data description file to be anonymous through the duplication removing processing result of the file statement of the first user data description file and the file statement of the second user data description file.
Under other independent design ideas, after the user data description archive to be anonymized is obtained, targeted data anonymization processing can be performed on the sensitive keyword set in the user data description archive to be anonymized. Exemplary means of data anonymity include, among others: masking, pseudonymization, generalization, mixed-row screening, scrambling and the like, and technical personnel in the field can flexibly select a data anonymity means to process an anonymity user data description archive according to actual conditions, so that privacy information corresponding to an individual feature text set can be accurately protected, the anonymity processing can be performed on a fuzzy text set to the maximum extent, and partial leakage of the individual feature text set caused by deduction of anti-anonymity stealing by a third party through the fuzzy text set is avoided.
In addition, under some design ideas which can be independently implemented, after targeted data anonymization processing of sensitive keyword sets in a user data description archive to be anonymized is completed, an anonymized user data description archive can be obtained, and then the anonymized user data description archive is published to realize data resource sharing under the premise of privacy protection. Based on this, the method may further include the following: carrying out data anonymization processing on the sensitive keyword set in the data description file of the user to be anonymized to obtain an anonymized data description file of the user; publishing the anonymous user data description file; analyzing data stealing intention according to the data access behavior of the anonymous user data description file to obtain data stealing intention information; and updating the access authority by using the data stealing intention information. By the design, after the anonymous user data description file is released, data anonymity protection can be continuously followed, for example, data stealing intention analysis of data access behaviors can be carried out, and access authority updating is carried out after the data stealing intention analysis is obtained so as to further improve the data security of the anonymous user data description file.
In addition, under some design ideas which can be independently implemented, data theft intention analysis is performed according to data access behaviors of the anonymous user data description archive, so as to obtain data theft intention information, which can include the following contents: acquiring a visual operation data set aiming at the data access behavior, wherein the visual operation data set comprises at least two pieces of visual operation data; obtaining a link index between each piece of visual operation data in the visual operation data set and the data access behavior; sorting the visual operation data according to the corresponding contact indexes of the visual operation data and the abnormal operation tendency prediction characteristics of the visual operation data to obtain corresponding visual operation data groups; generating a target risk intent conclusion set for the data access behavior based on the visualization operation data set, the target risk intent conclusion set comprising at least two target risk intent conclusions; determining data theft intent information based on the set of target risk intent conclusions.
The data stealing intention information can be determined according to a set number of target risk intention conclusions ranked at the top in the target risk intention conclusion group, for example, if the top three target risk intention conclusions ranked at the top point to the user portrait information stealing, the data stealing intention information can be determined as the portrait stealing intention. Therefore, comprehensive sequencing and analysis can be performed on the basis of different visual operation data, so that the reliability of data stealing intention information is improved.
In addition, under some design ideas that can be implemented independently, the sorting of the visual operation data according to the contact index corresponding to the visual operation data and the abnormal operation tendency prediction feature of the visual operation data to obtain a corresponding visual operation data set includes: dividing each piece of visual operation data according to the corresponding contact index of each piece of visual operation data and the abnormal operation tendency prediction characteristics of each piece of visual operation data to obtain at least two visual operation data queues; and sorting the visual operation data queues, and sorting the visual operation data in the visual operation data queues respectively to obtain the visual operation data group.
According to the embodiment of the invention, two different sensitive text extraction models are utilized to extract the sensitive texts of U text word vector relationship networks of an initial user data description file to obtain a first user data description file and a second user data description file, the first user data description file and the second user data description file are utilized to generate a user data description file to be anonymous, the relation among an individual characteristic text set, a group characteristic text set and a fuzzy text set can be comprehensively considered, so that the complete inclusion of individual privacy data and the accurate positioning of part of fuzzy text data which possibly cause the indirect leakage of the individual privacy data can be ensured when the user data description file to be anonymous is generated, and the generated user data description file to be anonymous can be used as a credible decision basis for anonymous protection of data, thereby improving the pertinence and the attack resistance of anonymous protection of the data.
The above description is only a specific embodiment of the present invention. Those skilled in the art can conceive of changes or substitutions based on the specific embodiments provided by the present invention, and all such changes or substitutions are intended to be included within the scope of the present invention.

Claims (10)

1. A decision analysis method applied to big data anonymization processing is characterized by being applied to an anonymization decision analysis server, and the method comprises the following steps:
acquiring an initial user data description file, wherein the initial user data description file comprises a sensitive keyword set;
mining the initial user data description file by utilizing a feedforward neural network language model to obtain U text word vector relation networks, wherein the feedforward neural network language model comprises U word vector mining units, the raw material of each word vector mining unit is the output of the previous word vector mining unit, and U is an integer not less than 1;
processing the U text word vector relationship network by using a first sensitive text extraction model to obtain a first user data description file, wherein the first user data description file comprises a first individual feature text set, a first cluster feature text set and a fuzzy text set, the first individual feature text set corresponds to a text set corresponding to the sensitive keyword set, and the fuzzy text set is an associated text set of the first individual feature text set and the first cluster feature text set;
processing the U text word vector relationship network by using a second sensitive text extraction model to obtain a second user data description file, wherein the second user data description file comprises a second individual feature text set and a second group feature text set, and the second individual feature text set corresponds to a text set corresponding to the sensitive keyword set;
and generating a user data description file to be anonymous according to the first user data description file and the second user data description file, wherein the user data description file to be anonymous comprises the sensitive keyword set.
2. The decision analysis method for anonymous processing of big data as set forth in claim 1, wherein the first sensitive text extraction model comprises U field type attention unit and text vector arrangement unit; the processing the U text word vector relationship network by using the first sensitive text extraction model to obtain a first user data description file comprises the following steps:
processing the U text word vector relationship networks by using the U field category attention units to generate U mixed entry vector sets;
processing V mixed entry vector sets in the U mixed entry vector sets by using the text vector sorting unit to obtain first entry vector distribution, wherein V is an integer smaller than U;
and generating the first user data description file according to the first entry vector distribution.
3. The decision analysis method applied to big data anonymization of claim 2, wherein each of the text word vector relationship nets corresponds to a text knowledge distribution chain; the processing the U text word vector relationship networks by using the U field category attention units to generate U mixed entry vector sets, including:
taking an nth text knowledge distribution chain and an (n-1) th text knowledge distribution chain as raw materials of an nth field category attention unit, and outputting an nth mixed entry vector set by using the nth field category attention unit, wherein the nth text knowledge distribution chain corresponds to an nth text word vector relationship network, the (n-1) th text knowledge distribution chain corresponds to an (n-1) th text word vector relationship network, and n is an integer greater than 1 and less than U;
and taking the mth text knowledge distribution chain as a raw material of the mth field category attention unit, and outputting an mth mixed entry vector set by using the mth field category attention unit, wherein the mth text knowledge distribution chain corresponds to the mth text word vector relationship network, and m is 1 or U.
4. The decision analysis method applied to big data anonymization processing according to claim 2, wherein the first sensitive text extraction model further includes a detection module, the detection module is configured to record a processing detection number G of the field category attention unit on a text word vector relationship network, where G is an integer not less than 1; the U text word vector relationship networks correspond to U text knowledge distribution chains; the processing the U text word vector relationship networks by using the U field category attention units to generate U mixed entry vector sets, including:
acquiring the processing detection number G recorded by the detection module;
when the processing detection number G is smaller than a first limit value, a first text knowledge distribution chain is used as a raw material of a first field type attention unit, and a first mixed entry vector set is output by using the first field type attention unit, wherein the first text knowledge distribution chain corresponds to a first text word vector relation network which is obtained according to a first word vector mining unit in the feedforward neural network language model;
taking a second text knowledge distribution chain and the first mixed entry vector set as raw materials of a second field type attention unit, and outputting a second mixed entry vector set by using the second field type attention unit, wherein the second text knowledge distribution chain corresponds to a second text word vector relationship network which is obtained by a second word vector mining unit in the feedforward neural network language model;
taking a third text knowledge distribution chain and the second mixed entry vector set as raw materials of a third field type attention unit, and outputting a third mixed entry vector set by using the third field type attention unit, wherein the third text knowledge distribution chain corresponds to a third text word vector relationship network, and the third text word vector relationship network is obtained according to a third word vector mining unit in the feedforward neural network language model;
and taking a fourth text knowledge distribution chain as a raw material of a fourth field type attention unit, and outputting a fourth mixed entry vector set by using the fourth field type attention unit, wherein the fourth text knowledge distribution chain corresponds to a fourth text word vector relation network, and the fourth text word vector relation network is obtained according to a fourth word vector mining unit in the feedforward neural network language model.
5. The decision analysis method applied to big data anonymization processing according to claim 4, wherein after obtaining the processing detection number G recorded by the detection module, the method further comprises:
when the processing detection number G is not less than a first limit value and less than a second limit value, taking the first text knowledge distribution chain and the third mixed entry vector set as raw materials of the first field type attention unit, and outputting a fifth mixed entry vector set by using the first field type attention unit;
taking the second text knowledge distribution chain and the fifth mixed entry vector set as raw materials of the second field type attention unit, and outputting a sixth mixed entry vector set by using the second field type attention unit;
and taking the third text knowledge distribution chain, the third mixed entry vector set and the sixth mixed entry vector set as raw materials of the third field type attention unit, and outputting a seventh mixed entry vector set by using the third field type attention unit.
6. The decision analysis method as claimed in claim 5, wherein the processing, by the text vector sorting unit, the V mixed entry vector sets in the U mixed entry vector sets to obtain a first entry vector distribution comprises: when the processing detection number G is equal to the second limit value, processing the fourth mixed entry vector set and the seventh mixed entry vector set by using the text vector sorting unit to obtain first entry vector distribution;
wherein the processing the fourth mixed entry vector set and the seventh mixed entry vector set by using the text vector sorting unit to obtain a first entry vector distribution includes: semantic description extraction is carried out on the fourth mixed entry vector set to obtain first entry semantic description; extracting text polarity from the semantic description of the first entry to obtain a first text polarity; processing the semantic description of the first entry by using a nonlinear processing module to obtain a first semantic feature; semantic description extraction is carried out on the seventh mixed entry vector set to obtain second entry semantic description; extracting text polarity from the semantic description of the second entry to obtain a second text polarity; processing the second entry semantic description by using a nonlinear processing module to obtain a second semantic feature; and generating a first entry vector distribution according to the fourth mixed entry vector set, the first semantic feature, the first text polarity, the seventh mixed entry vector set, the second semantic feature and the second text polarity.
7. The decision analysis method applied to big data anonymization processing according to claim 1, wherein the second sensitive text extraction model includes U +1 moving average nodes; the U text word vector relation networks correspond to U text knowledge distribution chains; processing the U text word vector relationship network by using a second sensitive text extraction model to obtain a second user data description file, wherein the processing comprises the following steps:
taking a first text knowledge distribution chain as a raw material of a first moving average node, and outputting a first user information visual expression by using the first moving average node, wherein the first text knowledge distribution chain corresponds to a first text word vector relation network which is obtained according to a first word vector mining unit in the feedforward neural network language model;
taking a second text knowledge distribution chain and the first user information visual expression as raw materials of a second moving average node, and outputting a second user information visual expression by using the second moving average node, wherein the second text knowledge distribution chain corresponds to a second text word vector relation network which is obtained according to a second word vector mining unit in the feedforward neural network language model;
taking a third text knowledge distribution chain and the second user information visual expression as raw materials of a third moving average node, and outputting a third user information visual expression by using the third moving average node, wherein the third text knowledge distribution chain corresponds to a third text word vector relation network which is obtained according to a third word vector mining unit in the feedforward neural network language model;
taking a fourth text knowledge distribution chain and the third user information visual expression as raw materials of a fourth moving average node, and outputting a fourth user information visual expression by using the fourth moving average node, wherein the fourth text knowledge distribution chain corresponds to a fourth text word vector relation network, and the fourth text word vector relation network is obtained according to a fourth word vector mining unit in the feedforward neural network language model;
taking a fourth user information visual expression as a raw material of a fifth moving average node, and outputting the fifth user information visual expression by using the fifth moving average node;
and generating the second user data description file according to the fifth user information visual expression.
8. The decision analysis method applied to big data anonymization processing according to claim 1, wherein the privacy risk index of the initial user data description file is F, where F is an integer greater than 1, and the mining of the initial user data description file by using the feedforward neural network language model to obtain U text word vector relationship networks includes:
mining the initial user data description file through a fourth word vector mining unit to obtain a fourth text word vector relational network, wherein the privacy risk index of the fourth text word vector relational network is q4 x F;
processing the fourth text word vector relationship network through a third word vector mining unit to obtain a third text word vector relationship network, wherein the privacy risk index of the third text word vector relationship network is q3 x F;
processing the third text word vector relationship network through a second word vector mining unit to obtain a second text word vector relationship network, wherein the privacy risk index of the second text word vector relationship network is q2 x F;
and processing the second text word vector relationship network through a first word vector mining unit to obtain a first text word vector relationship network, wherein the privacy risk index of the first text word vector relationship network is q1 x F.
9. The method as claimed in claim 1, wherein the step of generating the user data description file to be anonymized according to the first user data description file and the second user data description file comprises:
and obtaining the user data description file to be anonymous according to the deduplication processing result of the file statement of the first user data description file and the file statement of the second user data description file.
10. An anonymous decision analysis server, comprising: a memory and a processor; the memory and the processor are coupled; the memory for storing computer program code, the computer program code comprising computer instructions; wherein the computer instructions, when executed by the processor, cause the anonymous decision analysis server to perform the method of any of claims 1-9.
CN202211670623.0A 2022-12-26 2022-12-26 Decision analysis method and server applied to big data anonymous processing Withdrawn CN115712703A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211670623.0A CN115712703A (en) 2022-12-26 2022-12-26 Decision analysis method and server applied to big data anonymous processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211670623.0A CN115712703A (en) 2022-12-26 2022-12-26 Decision analysis method and server applied to big data anonymous processing

Publications (1)

Publication Number Publication Date
CN115712703A true CN115712703A (en) 2023-02-24

Family

ID=85236089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211670623.0A Withdrawn CN115712703A (en) 2022-12-26 2022-12-26 Decision analysis method and server applied to big data anonymous processing

Country Status (1)

Country Link
CN (1) CN115712703A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341660A (en) * 2023-05-30 2023-06-27 八爪鱼人工智能科技(常熟)有限公司 Information optimization method and server applied to artificial intelligence
CN116361858A (en) * 2023-04-10 2023-06-30 广西南宁玺北科技有限公司 User session resource data protection method and software product applying AI decision
CN117349879A (en) * 2023-09-11 2024-01-05 江苏汉康东优信息技术有限公司 Text data anonymization privacy protection method based on continuous word bag model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361858A (en) * 2023-04-10 2023-06-30 广西南宁玺北科技有限公司 User session resource data protection method and software product applying AI decision
CN116361858B (en) * 2023-04-10 2024-01-26 北京无限自在文化传媒股份有限公司 User session resource data protection method and software product applying AI decision
CN116341660A (en) * 2023-05-30 2023-06-27 八爪鱼人工智能科技(常熟)有限公司 Information optimization method and server applied to artificial intelligence
CN117349879A (en) * 2023-09-11 2024-01-05 江苏汉康东优信息技术有限公司 Text data anonymization privacy protection method based on continuous word bag model

Similar Documents

Publication Publication Date Title
CN115712703A (en) Decision analysis method and server applied to big data anonymous processing
CA3021168C (en) Anticipatory cyber defense
Ranade et al. Generating fake cyber threat intelligence using transformer-based models
US10162848B2 (en) Methods and apparatus for harmonization of data stored in multiple databases using concept-based analysis
Murtaza et al. Mining trends and patterns of software vulnerabilities
Macdonald et al. Identifying digital threats in a hacker web forum
Yeboah-Ofori et al. Cyber intelligence and OSINT: Developing mitigation techniques against cybercrime threats on social media
Ampel et al. Labeling hacker exploits for proactive cyber threat intelligence: a deep transfer learning approach
JP2023542632A (en) Protecting sensitive data in documents
Yang et al. Automated cyber threat intelligence reports classification for early warning of cyber attacks in next generation SOC
Alhajjar et al. Survival analysis for insider threat: Detecting insider threat incidents using survival analysis techniques
Remmide et al. Detection of phishing URLs using temporal convolutional network
Marin et al. Inductive and deductive reasoning to assist in cyber-attack prediction
US11783088B2 (en) Processing electronic documents
Mujtaba et al. Detection of suspicious terrorist emails using text classification: A review
Lasky et al. Machine Learning Based Approach to Recommend MITRE ATT&CK Framework for Software Requirements and Design Specifications
CN114398887A (en) Text classification method and device and electronic equipment
Lytvynov et al. Corporate networks protection against attacks using content-analysis of global information space
Ma et al. A Parse Tree-Based NoSQL Injection Attacks Detection Mechanism.
Senanayake et al. LYZGen: A mechanism to generate leads from Generation Y and Z by analysing web and social media data
Krüger An Approach to Profiler Detection of Cyber Attacks using Case-based Reasoning.
Ali et al. Unintended Memorization and Timing Attacks in Named Entity Recognition Models
Latypov et al. Multilevel model of computer attack based on attributive metagraphs
Soud et al. PrAIoritize: Learning to Prioritize Smart Contract Bugs and Vulnerabilities
Nakid Evaluation and detection of cybercriminal attack type using machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20230224

WW01 Invention patent application withdrawn after publication