CN106649262B - Method for protecting sensitive information of enterprise hardware facilities in social media - Google Patents
Method for protecting sensitive information of enterprise hardware facilities in social media Download PDFInfo
- Publication number
- CN106649262B CN106649262B CN201610971014.7A CN201610971014A CN106649262B CN 106649262 B CN106649262 B CN 106649262B CN 201610971014 A CN201610971014 A CN 201610971014A CN 106649262 B CN106649262 B CN 106649262B
- Authority
- CN
- China
- Prior art keywords
- hardware
- feature
- information
- model
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- Computing Systems (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of privacy protection, and particularly relates to a method for protecting sensitive information of enterprise hardware facilities in social media. Firstly, establishing a hardware infrastructure information base, and then determining a hardware model related to social media description information by constructing a hardware classification model and a hardware model matching algorithm; and finally, the obtained hardware model is used for purposefully shielding or replacing key words which possibly reveal sensitive information in the hardware description information. The invention can process the keywords differently according to different keyword sensitivity levels, and has strong expandability.
Description
Technical Field
The invention relates to a method for protecting sensitive information of enterprise hardware facilities in social media, and belongs to the technical field of privacy protection.
Background
Along with the emergence of traditional social media such as microblogs and network forums and emerging social media such as WeChat, Facebook and Twitter, people enter the social media era. The rapid rise of social media has accelerated the flow of information, making communication between people more and more convenient. But not to be neglected, the widespread use of social media also poses security risks, as social media users also pose threats, either intentionally or unintentionally, to confidential and sensitive information of a business or institution that, if taken, integrated and utilized unsuccessfully by a business or by some lawless entity, could lead to the disclosure of personal or institutional privacy [1 ]. A mobile device user can conveniently rely on location-based services to obtain his or her location and associated service information. Although the location-based service provides great convenience for the user, the location-based service requires the location information of the mobile user to be acquired first to provide the corresponding service to the user, and the location-based service system cannot guarantee that the server does not reveal or illegally use the location information of the user. Location-based services therefore present a significant challenge to the user's protection of location privacy [2 ]. In addition, with the rise of big data technology in recent years, more and more privacy protection technologies based on big data technology are provided, but currently, relevant research on big data security and privacy protection at home and abroad is not sufficient, and the problem of big data security and privacy protection can be better solved only by combining technical means with relevant policy and regulations [3 ].
With the widespread use of the internet, there is an increasing amount of research on privacy protection or commercial confidentiality protection at home and abroad. The main research directions of privacy protection include a general privacy protection technology, a privacy protection technology oriented to data mining, a data publishing principle based on privacy protection, a privacy protection algorithm and the like. General privacy protection techniques are directed to protecting data privacy at a lower application level, generally by introducing statistical and probabilistic models; the privacy protection technology facing data mining mainly solves the problem that how to protect privacy according to the characteristics of different data mining operations in high-level data application; the data publishing principle based on privacy protection aims to provide a privacy protection method which can be commonly used in various applications, and further the privacy protection algorithm designed on the basis has universality. As an emerging research hotspot, the privacy protection technology has very important value in both theoretical research and practical application aspects [4 ].
The traditional sensitive information protection method is mainly a filtering method based on keyword matching, but the method ignores the semantic environment of the context, has low accuracy, is difficult to resist manual interference, needs to maintain a large number of keyword dictionaries and has high labor cost. Emerging sensitive information protection methods comprise protection methods based on natural language processing and artificial intelligence, but the technologies are still in a research stage and cannot meet the requirement on filtering accuracy under the actual condition.
Disclosure of Invention
According to the invention, the protection of sensitive information is not researched from a macroscopic perspective, but a specific aspect of privacy or business confidentiality protection is selected, namely enterprise hardware information protection in social media is researched, and a corresponding information protection method is provided.
As described above, when a social media user posts a speech, privacy information may be leaked, and similarly, when an intra-enterprise person posts a speech on social media such as a microblog or a forum, sensitive information such as an intra-enterprise hardware model and configuration may also be leaked.
In order to solve the technical problem, the invention provides a new angle, namely, the information protection is carried out by combining the strategies of text classification and semantic replacement. The basic idea is to determine the hardware category and model described by the information publisher by classification, then to search all attribute information of the model hardware from the established hardware information base, and to shield or replace the key words in the hardware description information published by the publisher according to the key words in the attribute information. The main innovation points of the method are that a hardware information base is constructed, a hardware information classification model and a hardware model matching algorithm are designed, and a key sensitive word replacement method is provided;
the technical scheme of the invention is specifically introduced as follows.
The invention provides a method for protecting sensitive information of enterprise hardware facilities in social media, which comprises the following specific steps:
step one, constructing a model
(1) Construction of hardware information base
Acquiring hardware information, extracting a plurality of levels, attributes and attribute value information including hardware categories, manufacturers and models, organizing the information into an XML hierarchical structure, and constructing a hardware information base;
(2) chinese word segmentation for hardware description information in hardware information base
(3) Method for constructing hardware classification model and hardware model matching algorithm
After the hardware description information in the hardware information base is segmented, firstly extracting the characteristic information of a large class, then extracting the characteristic information of a manufacturer on the basis of the classification of the large class, and constructing a manufacturer classification model; finally, a hardware model matching algorithm is constructed through the major types and the category information of manufacturers, and the model of the hardware is determined;
(4) constructing keyword shielding replacement model
For each hardware major category, performing sensitivity level division on attribute keywords appearing in hardware description information, and constructing a keyword shielding replacement model by adopting different processing modes on the keywords with different sensitivity levels; wherein, the sensitivity levels are divided into 0, 1, 2, 3 and 4; the keywords with the sensitivity level of 0 are not processed, the keywords with the sensitivity level of 4 are directly shielded by asterisks, and the keywords with the sensitivity levels of 1, 2 and 3 are processed through a keyword semantic tree; the keyword semantic tree is constructed by keywords on different levels in a hardware information base according to an XML structural relationship; the keyword semantic tree has four layers, and the replacement strategy based on the keyword semantic tree is as follows:
for the key words with the sensitivity level of 1, adopting the father nodes thereof for replacement; for the keyword with the sensitivity level of 2, adopting a father node of the father node to replace; directly replacing the keywords with the sensitivity level of 3 by using the root node;
step two, detection protection
After word segmentation processing is carried out on input social media content, determining an attribution major class, an attribution manufacturer and an attribution type according to a hardware classification model and a hardware type matching algorithm in the step one; and after the model is determined, shielding a replacement model by using the keywords constructed in the step one, and executing corresponding actions, namely shielding, replacing and not processing, on the attribute keywords in the segmented social media content by using the corresponding sensitivity levels and processing modes.
In the invention, the hardware classification model classifies the hardware major classes and hardware manufacturers through a feature selection algorithm and a classification algorithm.
In the invention, when the hardware is classified, the feature selection algorithm adopts an improved information gain method; the specific calculation formula is as follows:
wherein t is the feature, c is the category, k is the number of categories, dis (t) is the distribution of the feature t among the categories, which is the ratio of the number of samples appearing in the feature t to the total number of all samples, and P (t) is the featureProbability of occurrence, P (c) represents probability of occurrence of the class, P (c, t) represents probability of co-occurrence of the feature and the class,indicating the probability that the feature does not occur,indicating the probability that the feature does not appear that the sample belongs to class c.
The classification algorithm adopts an improved KNN method, wherein the distance calculation formula is as follows:
where x represents unclassified samples and y represents classified samples, which are n-dimensional vectors, each dimension of the vector representing a feature value, IG' (t)i) Represents the ith feature tiX ═ x1,x2,…,xn),y=(y1,y2,…,yn) D (x, y) denotes the distance between x and y, xiyiRepresenting the ith characteristic value of the sample.
In the invention, when the hardware manufacturer is classified, the feature selection algorithm adopts a feature similarity method to select features; selecting features by using similarity among classes on features, and defining p classes among features tiThe similarity of the above is that the p classes are respectively c1,c2,…,cpDefining the p classes in the feature tiThe similarity of (3) is that any two classes are at tiThe average of the similarity sums above, i.e.:
if it is notThen the feature t is considerediToo much similarity between these p classes, discomfortCollaborate as the classified feature, otherwise, the classified feature can be used;
the classification algorithm adopts an improved KNN method, selects the reciprocal of the similarity as the weight of the features to participate in the calculation of the KNN algorithm, and adopts a specific KNN distance calculation formula as follows:
wherein, ciDenotes the ith class, p is the total number of classes, tiDenotes the ith feature, n is the total number of features, and x ═ x1,x2,…,xn),y=(y1,y2,…,yn) Respectively representing unclassified samples and classified samples having n feature values xiyi。
In the invention, a hardware model matching algorithm adopts a method based on a hardware model set, namely, hardware models with the same attribute value are put into one set, the attribute value of the hardware to be matched on some attributes is determined, so that the model set to which the hardware belongs is determined, and then the intersection of the sets is solved, so that the model to which the hardware belongs is obtained.
In the invention, leaf nodes at the bottommost layer of the keyword semantic tree are sub-feature words of the attribute key words at the innermost layer of the XML structure in the hardware information base, the attribute key words at the innermost layer of the XML structure in the hardware information base correspond to the penultimate layer of the semantic tree, the attribute key words at the second layer of the XML structure are at the penultimate layer of the semantic tree, the root node is at the fourth layer, and the root node is the name of the major class of hardware.
Compared with the prior art, the invention has substantive characteristics and remarkable progress:
(1) the method can be used for discovering sensitive content which possibly reveals enterprise hardware information and exists during social media content publishing, a fine-grained content control method is provided, compared with the existing method, the method has certain advancement in a coarse-grained mode that the whole content can be controlled, and essential requirements of social media content sharing are maintained as far as possible.
(2) A classification and matching method based on three levels of major categories, manufacturers and models is designed, so that the information such as words and attributes of the same category can be fully utilized, the recall rate of detection is improved, and sensitive leakage of hardware is avoided. Meanwhile, the search range is narrowed during matching, and matching is only carried out in the information base of the same manufacturer, so that the matching efficiency is improved.
(3) A new thought and implementation method is provided on a hardware information base structure, feature selection, classifier construction and protection method, an XML structural form is designed, an information gain calculation method is improved, a feature selection method based on manufacturer category feature similarity is designed, a keyword semantic tree is constructed, and a specific protection strategy is provided.
Drawings
FIG. 1 is a general flow diagram of the present invention.
FIG. 2 is a schematic diagram of a hardware vendor's classification flow.
Fig. 3 is a flow chart diagram of a hardware model matching method.
FIG. 4 is a flow chart diagram of a keyword mask replacement method.
Fig. 5 is a diagram of a hardware information base (XML structure).
FIG. 6 is a diagram showing the correspondence between keywords of each layer of the semantic tree and keywords of each layer of XML in the embodiment.
FIG. 7 is a final sample graph of the semantic tree established in the example.
Detailed Description
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.
The general process of the invention is shown in fig. 1, and specifically includes a left model building process and a right detection and protection process in fig. 1, wherein the processing results of the model building process in three links provide essential basic data for the detection and protection process.
The main work of the present invention includes:
(1) constructing a hardware information base;
(2) performing Chinese word segmentation on the hardware description information;
(3) constructing a hardware classification model and a hardware model matching algorithm;
(4) and constructing a keyword shielding replacement method.
The key technologies involved in the above process are explained in detail in turn below.
1. Construction of hardware information base
In the embodiment, a web crawler program is designed for a large computer network, and 36 pieces of hardware information of tens of thousands of models, including mobile phones, notebooks, switches, routers and the like, are automatically crawled. Organizing the hardware information into an XML file form, wherein each tag of the XML represents the attribute of the hardware, and the text description content corresponding to the tag represents the attribute value of the hardware. And constructing a tree hardware information base through the structural description capacity of the XML. The hardware information base forms a basic information source required by the subsequent processing flow. The constructed hardware information base (XML structure) is shown in fig. 5.
2. Chinese word segmentation for hardware information
Although all types of hardware information are obtained in the work of the step 1, the information cannot be directly used for computer processing, Chinese word segmentation is required to be carried out, auxiliary words are removed, keywords in the Chinese words are extracted, and then the extracted keywords are used for carrying out subsequent classification processing and other work. At present, common word segmentation methods can be used for the step, for example, a hierarchical hidden markov model-based Chinese lexical analysis system ICTCCLAS and the like developed by the research institute of computing technology of Chinese academy of sciences support a user dictionary and various coding formats, and the word segmentation accuracy is as high as 97.5%.
3. Method for constructing hardware classification model and hardware model matching algorithm
On the basis of word segmentation, the hardware model described by the hardware description information is determined by constructing a classification model and a hardware model matching algorithm. The hardware classification model comprises two sub-classification processes, namely classification of a hardware major class and classification of a hardware manufacturer, wherein the classification of the hardware manufacturer is performed on the basis of the classification of the hardware major class. Through the two steps, the category and the manufacturer of the hardware can be determined, and finally the model of the hardware can be determined through a hardware model matching method, and the basic ideas of the three processes are described below.
(1) Classification of hardware broad classes
The classification of the hardware major classes refers to a KNN classification method in text classification, firstly, the feature words which have large contribution to the classification are selected through feature selection, and then, the hardware is classified through a classification algorithm. The feature selection algorithm and the classification algorithm respectively use an information gain method and a KNN method for reference, but improve the characteristics of a hardware information base, and are beneficial to improving the classification accuracy.
The traditional information gain method only considers the influence of whether the characteristic words appear on the global information entropy, but does not consider the frequency problem of the appearance of the characteristic words in the classes and among the classes.
The improved information gain method has the following calculation formula:
where dis (t) represents the distribution of the feature t among the classes, which is the ratio of the number of samples in which the feature t appears to the total number of all samples. Therefore it selectsThe adjustment coefficient is based on two reasons, first,is a decreasing function of dis (t), i.e. when the distribution value of the characteristic t among the classes is very small,the size is larger, which just meets the requirement; secondly, selectConventional information can be balanced for adjusting coefficientsThe weight between the gain value IG (t) and the inter-class distribution value dis (t) of the characteristic t is set so that the calculation result does not depend on either one of the values too much.
Similarly, the invention improves the traditional KNN algorithm, and the improvement lies in that the influence of different characteristics on classification is considered, the information gain value selected by the characteristics is used as the weight of the KNN algorithm, the information gain value of one characteristic represents the influence of the characteristic on the information entropy, if the information gain value is larger, the influence of the characteristic on the classification result is larger, so the information gain value of the characteristic is directly used as the weight of the characteristic in the KNN algorithm, and the contribution degree of the characteristics with different information gain values on the classification can be reflected. The calculation formula of the distance in the improved KNN algorithm is given below.
Where x represents unclassified samples and y represents classified samples, which are n-dimensional vectors, each dimension in the vector representing a feature value. IG (t)i) Represents the ith feature tiThe information gain value of (1). x ═ x1,x2,…,xn),y=(y1,y2,…,yn)。
(2) Classification of hardware manufacturers
After the classification of the hardware major class, the classification of the hardware vendor is to determine the vendor of the hardware under that class. Likewise, feature selection and classification using a suitable classification algorithm are required in this step of classification.
The feature selection algorithm adopted by the invention is a calculation method based on feature similarity, namely, the feature similarity of each feature among different manufacturer categories is considered, if the feature similarity is larger than or equal to a certain threshold value, the feature is considered to be too similar among different manufacturers and not suitable for being used as a classified feature, and otherwise, the feature can be used as the classified feature. Likewise, the improved KNN classification algorithm continues to be used in this part of the classification, except that the weight of the features is changed to the logarithm of the reciprocal of the feature similarity, as described below.
In the hardware information base, each hardware feature may include a plurality of sub-features, and the feature value of the feature, such as "dimension", includes three dimension values, i.e., length, width, and height. Here, length, width, and height are three sub-features of the feature "dimension". Assuming a feature tiConsisting of n sub-features, i.e. ti=(ti1,ti2,…,tin). A certain sample is characterized by tiCharacteristic value ofAnother sample is at feature tiCharacteristic value ofThen defineAndthe similarity between them is:
i.e. the cosine of the angle between the vectors is used to define the similarity between two features. Since different features to be examined may contain different numbers of sub-features, i.e. different dimensions, this is done in order to ignore the dimensions of the vectors, emphasizing the examination of the similarity between two vectors from the angle between them, when two vectors, i.e. two features, are similar, the cosine value of the angle is larger, otherwise smaller.
After defining the similarity of a single feature, a calculation method of the similarity of two classes on a certain feature is given next. Since each class may contain multiple samples, two classes c are assumed1And c2The number of samples contained is m1And m2Then define these two classes at feature tiThe similarity of (c) is calculated as follows:
as can be seen from the above formula, the feature t is applied to two classesiThe definition of similarity is directly taking all sample pairs of two classes in the feature tiMean of upper similarity, which can be done by matching all sample pairs between two classes to the feature tiThe similarity of the above is taken into consideration.
Between two classes at feature tiBased on the similarity calculation, the feature t between p classes is defined belowiThe similarity of (c). Let p classes be c1,c2,…,cpDefining the p classes in the feature tiThe similarity of (3) is that any two classes are at tiThe average of the similarity sums above, i.e.:
if the p classes are in the feature tiIs greater than or equal to a certain threshold value delta, i.e.Then the feature t is considerediThe similarity between the p classes is too large to be used as the characteristic of classification, and the similarity can be used as the characteristic of classification on the contrary.
The classification at each step is still classified by using the modified KNN algorithm, except that the weight of the features is changed and is not an information gain value, but an inverse number of the similarity of the features. The reason why the inverse of the similarity of the features is selected as the weight of the features is based on the reason that the similarity of the features represents the similarity degree of the features between different classes, the features with higher similarity do not contribute much to the classification and should be given less weight, and the features with lower similarity do contribute more to the classification and should be given higher feature, so that it is reasonable that the invention selects the inverse of the similarity as the weight of the features to participate in the calculation of the KNN algorithm, and the following is a specific distance calculation formula of KNN:
the classification process of the hardware manufacturer is as follows, and fig. 2 shows a corresponding flowchart.
1) Selecting samples of different manufacturers in a certain class from a hardware information base;
2) calculating the feature similarity of the features among different manufacturers according to different features;
3) if the feature similarity of the feature is smaller than a certain threshold, taking the feature as a classification feature, otherwise, returning to 2), and selecting the next feature to continue to calculate the feature similarity;
4) and classifying by using the selected characteristics and an improved KNN algorithm to obtain the corresponding manufacturer category.
(3) Matching of hardware models
After the category of the hardware and the manufacturer under the category are determined, the invention determines the model of the hardware under the manufacturer by constructing a hardware model matching algorithm. The hardware model matching algorithm adopted by the invention is a method based on a hardware model set, namely, hardware models with the same attribute value are put into one set, when the model of certain hardware needs to be determined, only the attribute value of the hardware on certain attribute needs to be determined, so that the model set to which the hardware belongs can be determined, and then the model to which the hardware belongs can be obtained by solving the intersection of the sets. Compared with the method for successively comparing the hardware models, the hardware model matching method has great advantages in efficiency, and can greatly reduce the comparison times.
When the hardware model is matched, all products are not compared one by one, but a new algorithm is established to enable the comparison to have higher efficiency. Specifically, suppose that the category of products has n attributes (t)1,t2,…,tn) Each attribute tiAll comprise aiSub-features, i.e.The product produced by the manufacturer is classified into the attribute tiThe same products are grouped into a collection. And since a model of a product may be identical to other products in more than one attribute, the model of the product may appear in different sets, i.e., the sets may intersect each other.
If p attributes appear in the description information of the hardware, respectivelyPropertiesIs a characteristic value ofThe algorithm for hardware model matching is described as follows:
1) will attribute tiThe hardware models with the same attribute value are placed in the same set;
2) let i ═ 1, C ═ Ω, where Ω denotes the complete set;
5) If C contains only one element or i > p, proceed to 6), otherwise i ═ i +1, and return to 3);
6) and returning a set C, wherein the set C is a final hardware model comparison result.
Fig. 3 shows a specific flowchart of the hardware model matching method, and the main steps are described as follows.
1) Constructing a hardware model set with the same attribute value aiming at each attribute;
2) taking out a certain attribute, and investigating the attribute value of the hardware on the attribute to obtain a hardware model set corresponding to the attribute value;
3) taking intersection of the hardware model set and the obtained hardware model set, stopping if the intersection only contains one element or the attribute is completely taken, wherein the element in the intersection is the model of the hardware, and otherwise, returning to 2);
4. constructing keyword shielding replacement model
The method carries out shielding replacement on the keywords which possibly reveal the hardware sensitive information and appear in the hardware description information by designing a keyword shielding replacement model. The method comprises the steps of dividing different sensitivity levels according to different keywords and adopting different processing modes for the keywords with different sensitivity levels.
(1) Keyword sensitivity level classification
For each hardware major class, 5 sensitivity levels of all attribute value keywords are established in advance and are respectively represented by numbers 0, 1, 2, 3 and 4, and the sensitivity degrees of the keywords are sequentially increased, which is specifically shown in table 1.
TABLE 1 comparison table of sensitivity levels
Level of sensitivity | 0 | 1 | 2 | 3 | 4 |
Of significance | Is not sensitive | Is slightly sensitive | General sensitivity | Is relatively sensitive | Is very sensitive |
Treatment method | Do not process | Replacement of | Replacement of | Replacement of | Shielding |
And different processing modes are adopted for the keywords with different sensitivity levels. The keywords with the sensitivity level of 0 are not processed, the keywords with the sensitivity level of 4 are directly shielded by asterisks, and the keywords with the sensitivity levels of 1, 2 and 3 are processed in a semantic tree construction mode.
(2) Construction of keyword semantic trees
And replacing the keywords with sensitivity levels of 1, 2 and 3 by constructing a semantic tree. Leaf nodes in the semantic tree are keywords with the most specific semantics, the semantics are gradually fuzzy along with the rise of the node hierarchy, and root nodes are nodes with the most fuzzy semantics. For hardware description information, the semantic tree has 4 layers in total, and the replacement strategy based on the semantic tree is as follows:
for the key words with the sensitivity level of 1, adopting the father nodes thereof for replacement; for the keyword with the sensitivity level of 2, adopting a father node of the father node to replace; and directly replacing the keywords with the sensitivity level of 3 by using the root node.
The XML document of each model of hardware in the hardware information base is a hierarchical structure, and the attribute key words of the upper layer are more semantically fuzzy than the attribute key words of the lower layer, so the key word semantic tree can be established by using the XML document.
The method for establishing the semantic tree is that leaf nodes at the bottom layer are sub-feature words of the attribute keywords at the bottom layer. The penultimate layer of the semantic tree corresponds to the innermost attribute keywords of the XML structure in the hardware information base, which are semantically more fuzzy than the respective sub-feature words. The last-but-one layer of the semantic tree is the second-layer attribute key words of the XML structure, because the first layer of the XML document is the specific model of the hardware, which is very sensitive information, the last-but-one layer of the semantic tree does not correspond to the first layer of the XML document, but the name of the hardware class which is more semantically fuzzy than the last-but-one layer is taken as the key words of the layer, and because the last-but-one layer has been raised to the name of the hardware class, the layer is also the first layer of the whole semantic tree, namely the root node. Fig. 6 shows the corresponding relationship between each layer of keywords of the semantic tree and each layer of keywords of XML, and fig. 7 shows the final example of the established semantic tree, in which the "second layer attribute keywords" and the "third layer attribute keywords" in the example both refer to the second layer and third layer attribute keywords in the XML document.
Examples of the applications
Because the information content related to the IT hardware facilities of the enterprise is not much available on the social media of the internet, the collection is difficult. In the example verification, 5000 pieces of partial information of hardware description are extracted from a hardware information base, and the description information is arranged into text documents, wherein each piece of description information corresponds to one text document. The word-segmented keyword sample (with random deletion of some keywords) is consistent with the content obtained from the social media after processing, so the processed data can approximately simulate the hardware description information sample in the social media.
Optionally, 60 samples from each large class are used as training samples, the total number of the training samples is 2160, the remaining 40 samples of each class are used as samples to be classified for testing, and the total number of the testing samples is 1440, and the relation between classification performance and k value is shown in table 2.
TABLE 2 correct classification ratio and F for hardware classes at different k values1Mean value of
|
1 | 5 | 10 | 15 | 20 | 25 | 30 |
Correct classification ratio | 80.1% | 72.8% | 69.3% | 67.3% | 65.7% | 63.8% | 60% |
F1Mean value of | 0.805 | 0.734 | 0.706 | 0.689 | 0.676 | 0.663 | 0.639 |
In the classification of hardware manufacturers, the hardware manufacturers are classified by taking the hardware major category of 'mobile phones', and eight manufacturers of mobile phones are selected, namely samsung, apple, huashi, OPPO, vivo, charm, association and kupai. The proportion and F of correctly classified samples under different k values are tested1The average values, the verification results obtained are shown in table 3.
TABLE 3 proportion of correctly classified samples from the manufacturer under different k values and F1Mean value of
|
1 | 5 | 10 | 15 | 20 | 25 | 30 | 35 |
Correct classification ratio | 42.4% | 36.0% | 34.7% | 35.6% | 31.8% | 35.6% | 33.5% | 31.4% |
F1Mean value of | 0.422 | 0.350 | 0.339 | 0.328 | 0.295 | 0.319 | 0.299 | 0.281 |
And randomly selecting 200 texts under the mobile phone category, and correspondingly processing each sub-feature value according to the sensitivity level of the corresponding sub-feature word, wherein the final statistical data is shown in a table 4.
TABLE 4 partial keyword mask replacement performance data
Sub-feature word | Full network communication | Mobile 4G | Unicom 4G | Telecommunications 4G | Transverse direction |
Number of sub-feature words | 20 | 89 | 76 | 41 | 138 |
Number of correct processes | 20 | 89 | 76 | 41 | 138 |
Accuracy rate | 100% | 100% | 100% | 100% | 100% |
Reference to the literature
[1] Guo fine, social media use user information privacy and protection [ J ]. china information security, 2014, (7): 90-93.
[2] Weiqiong, ruthengshen, progress in the research of location privacy protection technology [ J ]. computer science, 2008, 35 (9): 21-25.
[3] Von deng, zhanmin, lihao, big data security and privacy protection [ J ] computer science, 2014, 37 (1): 246-258.
[4] Zhoushueg, Lifeng, Douyufei, Shouyouqu-Kuiqi a privacy-preserving research review for database-oriented applications [ J ]. computer science and newspaper, 2009, 32 (5): 847-861.
Claims (6)
1. A method for protecting sensitive information of enterprise hardware facilities in social media is characterized by comprising the following specific steps:
step one, constructing a model
(1) Construction of hardware information base
Acquiring hardware information, extracting a plurality of levels, attributes and attribute value information including hardware categories, manufacturers and models, organizing the information into an XML hierarchical structure, and constructing a hardware information base;
(2) performing Chinese word segmentation on hardware description information in a hardware information base;
(3) method for constructing hardware classification model and hardware model matching algorithm
After the hardware description information in the hardware information base is segmented, firstly extracting the characteristic information of a large class, then extracting the characteristic information of a manufacturer on the basis of the classification of the large class, and constructing a manufacturer classification model; finally, a hardware model matching algorithm is constructed through the major types and the category information of manufacturers, and the model of the hardware is determined;
(4) constructing keyword shielding replacement model
For each hardware major category, performing sensitivity level division on attribute keywords appearing in hardware description information, and constructing a keyword shielding replacement model by adopting different processing modes on the keywords with different sensitivity levels; wherein, the sensitivity levels are divided into 0, 1, 2, 3 and 4; the keywords with the sensitivity level of 0 are not processed, the keywords with the sensitivity level of 4 are directly shielded, and the keywords with the sensitivity levels of 1, 2 and 3 are processed through a keyword semantic tree; the keyword semantic tree is constructed by keywords on different levels in a hardware information base according to an XML structural relationship; the keyword semantic tree has four layers, and the replacement strategy based on the keyword semantic tree is as follows:
for the key words with the sensitivity level of 1, adopting the father nodes thereof for replacement; for the keyword with the sensitivity level of 2, adopting a father node of the father node to replace; directly replacing the keywords with the sensitivity level of 3 by using the root node; step two, detection protection
After word segmentation processing is carried out on input social media content, determining an attribution major class, an attribution manufacturer and an attribution type according to a hardware classification model and a hardware type matching algorithm in the step one; and after the model is determined, shielding a replacement model by using the keywords constructed in the step one, and executing corresponding actions, namely shielding, replacing and not processing, on the attribute keywords in the segmented social media content by using the corresponding sensitivity levels and processing modes.
2. The sensitive information protection method according to claim 1, wherein the hardware classification model classifies the hardware major and hardware manufacturers according to a feature selection algorithm and a classification algorithm.
3. The sensitive information protection method according to claim 2, wherein, when classifying the hardware major classes, the feature selection algorithm adopts an improved information gain method, and a specific calculation formula is as follows:
wherein t is a feature, c represents a category, k represents the number of categories, dis (t) represents the distribution of the feature t among the categories, which is the ratio of the number of samples in which the feature t appears to the total number of all samples, P (t) represents the probability of the feature appearing, P (c) represents the probability of the category appearing, P (c, t) represents the probability of the feature and the category appearing together,indicating the probability that the feature does not occur,representing the probability that the feature does not appear that the sample belongs to the class c;
the classification algorithm adopts an improved KNN method, wherein the distance calculation formula is as follows:
where x represents unclassified samples and y represents classified samples, which are n-dimensional vectors, each dimension of the vector representing a feature value, IG' (t)i) Represents the ith feature tiX ═ x1,x2,…,xn),y=(y1,y2,…,yn) D (x, y) denotes the distance between x and y, xi,yiRepresenting the ith characteristic value of the sample.
4. The sensitive information protection method according to claim 2, wherein when the classification of the hardware manufacturer is performed, the feature selection algorithm performs feature selection by using a feature similarity method; selecting features by using similarity among classes on features, and defining p classes among features tiThe similarity of the above is that the p classes are respectively c1,c2,…,cpDefining the p classes in the feature tiThe similarity of (3) is that any two classes are at tiThe average of the similarity sums above, i.e.:
if it is notDelta is threshold value, then the characteristic t is considerediThe similarity between the p classes is too large and is not suitable for being used as the characteristic of classification, otherwise, the similarity can be used as the characteristic of classification;
the classification algorithm adopts an improved KNN method, selects the reciprocal of the similarity as the weight of the features to participate in the calculation of the KNN algorithm, and adopts a specific KNN distance calculation formula as follows:
wherein, ciDenotes the ith class, p is the total number of classes, tiDenotes the ith feature, n is the total number of features, and x ═ x1,x2,…,xn),y=(y1,y2,…,yn) Respectively representing unclassified samples and classified samples having n feature values xiyi。
5. The sensitive information protection method according to claim 1, wherein the hardware model matching algorithm adopts a hardware model set method, that is, hardware models with the same attribute value are put into one set, the attribute value of the hardware to be matched on some attributes is determined, so as to determine the model set to which the hardware belongs, and then the intersection of the sets is obtained, so as to obtain the model to which the hardware belongs.
6. The sensitive information protection method according to claim 1, wherein leaf nodes at the lowest layer of the keyword semantic tree are sub-feature words of the innermost attribute keywords of the XML structure in the hardware information base, the penultimate layer of the semantic tree corresponds to the innermost attribute keywords of the XML structure in the hardware information base, the penultimate layer of the semantic tree is the second layer attribute keywords of the XML structure, the fourth layer is a root node, and the root node is the name of the hardware major class.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610971014.7A CN106649262B (en) | 2016-10-31 | 2016-10-31 | Method for protecting sensitive information of enterprise hardware facilities in social media |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610971014.7A CN106649262B (en) | 2016-10-31 | 2016-10-31 | Method for protecting sensitive information of enterprise hardware facilities in social media |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106649262A CN106649262A (en) | 2017-05-10 |
CN106649262B true CN106649262B (en) | 2020-07-07 |
Family
ID=58821041
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610971014.7A Active CN106649262B (en) | 2016-10-31 | 2016-10-31 | Method for protecting sensitive information of enterprise hardware facilities in social media |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649262B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108390865B (en) * | 2018-01-30 | 2021-03-02 | 南京航空航天大学 | Fine-grained access control method based on privacy drive |
CN111209735B (en) * | 2020-01-03 | 2023-06-02 | 广州杰赛科技股份有限公司 | Document sensitivity calculation method and device |
CN112100646A (en) * | 2020-04-09 | 2020-12-18 | 南京邮电大学 | Spatial data privacy protection matching method based on two-stage grid conversion |
CN112000867A (en) * | 2020-08-17 | 2020-11-27 | 桂林电子科技大学 | Text classification method based on social media platform |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101827102A (en) * | 2010-04-20 | 2010-09-08 | 中国人民解放军理工大学指挥自动化学院 | Data prevention method based on content filtering |
CN104239436A (en) * | 2014-08-27 | 2014-12-24 | 南京邮电大学 | Network hot event detection method based on text classification and clustering analysis |
CN104866465A (en) * | 2014-02-25 | 2015-08-26 | 腾讯科技(深圳)有限公司 | Sensitive text detection method and device |
US9245012B2 (en) * | 2008-03-28 | 2016-01-26 | International Business Machines Corporation | Information classification system, information processing apparatus, information classification method and program |
CN105426361A (en) * | 2015-12-02 | 2016-03-23 | 上海智臻智能网络科技股份有限公司 | Keyword extraction method and device |
CN105955978A (en) * | 2016-04-15 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for data leakage protection |
-
2016
- 2016-10-31 CN CN201610971014.7A patent/CN106649262B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9245012B2 (en) * | 2008-03-28 | 2016-01-26 | International Business Machines Corporation | Information classification system, information processing apparatus, information classification method and program |
CN101827102A (en) * | 2010-04-20 | 2010-09-08 | 中国人民解放军理工大学指挥自动化学院 | Data prevention method based on content filtering |
CN104866465A (en) * | 2014-02-25 | 2015-08-26 | 腾讯科技(深圳)有限公司 | Sensitive text detection method and device |
CN104239436A (en) * | 2014-08-27 | 2014-12-24 | 南京邮电大学 | Network hot event detection method based on text classification and clustering analysis |
CN105426361A (en) * | 2015-12-02 | 2016-03-23 | 上海智臻智能网络科技股份有限公司 | Keyword extraction method and device |
CN105955978A (en) * | 2016-04-15 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for data leakage protection |
Also Published As
Publication number | Publication date |
---|---|
CN106649262A (en) | 2017-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108681557B (en) | Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint | |
CN106649262B (en) | Method for protecting sensitive information of enterprise hardware facilities in social media | |
CN109376352B (en) | Patent text modeling method based on word2vec and semantic similarity | |
CN104778283B (en) | A kind of user's occupational classification method and system based on microblogging | |
CN110705247B (en) | Based on x2-C text similarity calculation method | |
Li et al. | An optimized approach for massive web page classification using entity similarity based on semantic network | |
CN103488707B (en) | A kind of method that candidate categories are searched for based on Greedy strategy and heuritic approach | |
CN114595689A (en) | Data processing method, data processing device, storage medium and computer equipment | |
CN114997288A (en) | Design resource association method | |
CN106502990A (en) | A kind of microblogging Attribute selection method and improvement TF IDF method for normalizing | |
Liu et al. | Identifying protein complexes based on node embeddings obtained from protein-protein interaction networks | |
Elfida et al. | Enhancing to method for extracting Social network by the relation existence | |
Kausar et al. | A detailed study on information retrieval using genetic algorithm | |
Zhang et al. | A hot spot clustering method based on improved kmeans algorithm | |
Zhang et al. | Improving entity linking in Chinese domain by sense embedding based on graph clustering | |
Yang et al. | Exploring word similarity to improve chinese personal name disambiguation | |
Sharma et al. | Analysis of clustering algorithms in biological networks | |
Pita et al. | Strategies for short text representation in the word vector space | |
Yang et al. | A hot topic detection approach on Chinese microblogging | |
Foncubierta-Rodríguez et al. | From visual words to a visual grammar: using language modelling for image classification | |
Chen et al. | A Malicious Web Page Detection Model based on SVM Algorithm: Research on the Enhancement of SVM Efficiency by Multiple Machine Learning Algorithms | |
Venkateswarlu et al. | Aquila optimized feedback artificial tree for detection of fake news and impact identification | |
Emami et al. | Web Person Name Disambiguation Using Social Links and Enriched Profile Information | |
Sanabila et al. | Automatic Wayang Ontology Construction using Relation Extraction from Free Text | |
Zafarani-Moattar et al. | A comprehensive study on Frequent Pattern Mining and Clustering categories for topic detection in Persian text stream |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |