CN106649262B - Method for protecting sensitive information of enterprise hardware facilities in social media - Google Patents

Method for protecting sensitive information of enterprise hardware facilities in social media Download PDF

Info

Publication number
CN106649262B
CN106649262B CN201610971014.7A CN201610971014A CN106649262B CN 106649262 B CN106649262 B CN 106649262B CN 201610971014 A CN201610971014 A CN 201610971014A CN 106649262 B CN106649262 B CN 106649262B
Authority
CN
China
Prior art keywords
hardware
feature
information
model
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610971014.7A
Other languages
Chinese (zh)
Other versions
CN106649262A (en
Inventor
曾剑平
崔战伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201610971014.7A priority Critical patent/CN106649262B/en
Publication of CN106649262A publication Critical patent/CN106649262A/en
Application granted granted Critical
Publication of CN106649262B publication Critical patent/CN106649262B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Computing Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of privacy protection, and particularly relates to a method for protecting sensitive information of enterprise hardware facilities in social media. Firstly, establishing a hardware infrastructure information base, and then determining a hardware model related to social media description information by constructing a hardware classification model and a hardware model matching algorithm; and finally, the obtained hardware model is used for purposefully shielding or replacing key words which possibly reveal sensitive information in the hardware description information. The invention can process the keywords differently according to different keyword sensitivity levels, and has strong expandability.

Description

Method for protecting sensitive information of enterprise hardware facilities in social media
Technical Field
The invention relates to a method for protecting sensitive information of enterprise hardware facilities in social media, and belongs to the technical field of privacy protection.
Background
Along with the emergence of traditional social media such as microblogs and network forums and emerging social media such as WeChat, Facebook and Twitter, people enter the social media era. The rapid rise of social media has accelerated the flow of information, making communication between people more and more convenient. But not to be neglected, the widespread use of social media also poses security risks, as social media users also pose threats, either intentionally or unintentionally, to confidential and sensitive information of a business or institution that, if taken, integrated and utilized unsuccessfully by a business or by some lawless entity, could lead to the disclosure of personal or institutional privacy [1 ]. A mobile device user can conveniently rely on location-based services to obtain his or her location and associated service information. Although the location-based service provides great convenience for the user, the location-based service requires the location information of the mobile user to be acquired first to provide the corresponding service to the user, and the location-based service system cannot guarantee that the server does not reveal or illegally use the location information of the user. Location-based services therefore present a significant challenge to the user's protection of location privacy [2 ]. In addition, with the rise of big data technology in recent years, more and more privacy protection technologies based on big data technology are provided, but currently, relevant research on big data security and privacy protection at home and abroad is not sufficient, and the problem of big data security and privacy protection can be better solved only by combining technical means with relevant policy and regulations [3 ].
With the widespread use of the internet, there is an increasing amount of research on privacy protection or commercial confidentiality protection at home and abroad. The main research directions of privacy protection include a general privacy protection technology, a privacy protection technology oriented to data mining, a data publishing principle based on privacy protection, a privacy protection algorithm and the like. General privacy protection techniques are directed to protecting data privacy at a lower application level, generally by introducing statistical and probabilistic models; the privacy protection technology facing data mining mainly solves the problem that how to protect privacy according to the characteristics of different data mining operations in high-level data application; the data publishing principle based on privacy protection aims to provide a privacy protection method which can be commonly used in various applications, and further the privacy protection algorithm designed on the basis has universality. As an emerging research hotspot, the privacy protection technology has very important value in both theoretical research and practical application aspects [4 ].
The traditional sensitive information protection method is mainly a filtering method based on keyword matching, but the method ignores the semantic environment of the context, has low accuracy, is difficult to resist manual interference, needs to maintain a large number of keyword dictionaries and has high labor cost. Emerging sensitive information protection methods comprise protection methods based on natural language processing and artificial intelligence, but the technologies are still in a research stage and cannot meet the requirement on filtering accuracy under the actual condition.
Disclosure of Invention
According to the invention, the protection of sensitive information is not researched from a macroscopic perspective, but a specific aspect of privacy or business confidentiality protection is selected, namely enterprise hardware information protection in social media is researched, and a corresponding information protection method is provided.
As described above, when a social media user posts a speech, privacy information may be leaked, and similarly, when an intra-enterprise person posts a speech on social media such as a microblog or a forum, sensitive information such as an intra-enterprise hardware model and configuration may also be leaked.
In order to solve the technical problem, the invention provides a new angle, namely, the information protection is carried out by combining the strategies of text classification and semantic replacement. The basic idea is to determine the hardware category and model described by the information publisher by classification, then to search all attribute information of the model hardware from the established hardware information base, and to shield or replace the key words in the hardware description information published by the publisher according to the key words in the attribute information. The main innovation points of the method are that a hardware information base is constructed, a hardware information classification model and a hardware model matching algorithm are designed, and a key sensitive word replacement method is provided;
the technical scheme of the invention is specifically introduced as follows.
The invention provides a method for protecting sensitive information of enterprise hardware facilities in social media, which comprises the following specific steps:
step one, constructing a model
(1) Construction of hardware information base
Acquiring hardware information, extracting a plurality of levels, attributes and attribute value information including hardware categories, manufacturers and models, organizing the information into an XML hierarchical structure, and constructing a hardware information base;
(2) chinese word segmentation for hardware description information in hardware information base
(3) Method for constructing hardware classification model and hardware model matching algorithm
After the hardware description information in the hardware information base is segmented, firstly extracting the characteristic information of a large class, then extracting the characteristic information of a manufacturer on the basis of the classification of the large class, and constructing a manufacturer classification model; finally, a hardware model matching algorithm is constructed through the major types and the category information of manufacturers, and the model of the hardware is determined;
(4) constructing keyword shielding replacement model
For each hardware major category, performing sensitivity level division on attribute keywords appearing in hardware description information, and constructing a keyword shielding replacement model by adopting different processing modes on the keywords with different sensitivity levels; wherein, the sensitivity levels are divided into 0, 1, 2, 3 and 4; the keywords with the sensitivity level of 0 are not processed, the keywords with the sensitivity level of 4 are directly shielded by asterisks, and the keywords with the sensitivity levels of 1, 2 and 3 are processed through a keyword semantic tree; the keyword semantic tree is constructed by keywords on different levels in a hardware information base according to an XML structural relationship; the keyword semantic tree has four layers, and the replacement strategy based on the keyword semantic tree is as follows:
for the key words with the sensitivity level of 1, adopting the father nodes thereof for replacement; for the keyword with the sensitivity level of 2, adopting a father node of the father node to replace; directly replacing the keywords with the sensitivity level of 3 by using the root node;
step two, detection protection
After word segmentation processing is carried out on input social media content, determining an attribution major class, an attribution manufacturer and an attribution type according to a hardware classification model and a hardware type matching algorithm in the step one; and after the model is determined, shielding a replacement model by using the keywords constructed in the step one, and executing corresponding actions, namely shielding, replacing and not processing, on the attribute keywords in the segmented social media content by using the corresponding sensitivity levels and processing modes.
In the invention, the hardware classification model classifies the hardware major classes and hardware manufacturers through a feature selection algorithm and a classification algorithm.
In the invention, when the hardware is classified, the feature selection algorithm adopts an improved information gain method; the specific calculation formula is as follows:
Figure BDA0001145461840000031
wherein t is the feature, c is the category, k is the number of categories, dis (t) is the distribution of the feature t among the categories, which is the ratio of the number of samples appearing in the feature t to the total number of all samples, and P (t) is the featureProbability of occurrence, P (c) represents probability of occurrence of the class, P (c, t) represents probability of co-occurrence of the feature and the class,
Figure BDA0001145461840000033
indicating the probability that the feature does not occur,
Figure BDA0001145461840000034
indicating the probability that the feature does not appear that the sample belongs to class c.
The classification algorithm adopts an improved KNN method, wherein the distance calculation formula is as follows:
Figure BDA0001145461840000032
where x represents unclassified samples and y represents classified samples, which are n-dimensional vectors, each dimension of the vector representing a feature value, IG' (t)i) Represents the ith feature tiX ═ x1,x2,…,xn),y=(y1,y2,…,yn) D (x, y) denotes the distance between x and y, xiyiRepresenting the ith characteristic value of the sample.
In the invention, when the hardware manufacturer is classified, the feature selection algorithm adopts a feature similarity method to select features; selecting features by using similarity among classes on features, and defining p classes among features tiThe similarity of the above is that the p classes are respectively c1,c2,…,cpDefining the p classes in the feature tiThe similarity of (3) is that any two classes are at tiThe average of the similarity sums above, i.e.:
Figure BDA0001145461840000041
if it is not
Figure BDA0001145461840000042
Then the feature t is considerediToo much similarity between these p classes, discomfortCollaborate as the classified feature, otherwise, the classified feature can be used;
the classification algorithm adopts an improved KNN method, selects the reciprocal of the similarity as the weight of the features to participate in the calculation of the KNN algorithm, and adopts a specific KNN distance calculation formula as follows:
Figure BDA0001145461840000043
wherein, ciDenotes the ith class, p is the total number of classes, tiDenotes the ith feature, n is the total number of features, and x ═ x1,x2,…,xn),y=(y1,y2,…,yn) Respectively representing unclassified samples and classified samples having n feature values xiyi
In the invention, a hardware model matching algorithm adopts a method based on a hardware model set, namely, hardware models with the same attribute value are put into one set, the attribute value of the hardware to be matched on some attributes is determined, so that the model set to which the hardware belongs is determined, and then the intersection of the sets is solved, so that the model to which the hardware belongs is obtained.
In the invention, leaf nodes at the bottommost layer of the keyword semantic tree are sub-feature words of the attribute key words at the innermost layer of the XML structure in the hardware information base, the attribute key words at the innermost layer of the XML structure in the hardware information base correspond to the penultimate layer of the semantic tree, the attribute key words at the second layer of the XML structure are at the penultimate layer of the semantic tree, the root node is at the fourth layer, and the root node is the name of the major class of hardware.
Compared with the prior art, the invention has substantive characteristics and remarkable progress:
(1) the method can be used for discovering sensitive content which possibly reveals enterprise hardware information and exists during social media content publishing, a fine-grained content control method is provided, compared with the existing method, the method has certain advancement in a coarse-grained mode that the whole content can be controlled, and essential requirements of social media content sharing are maintained as far as possible.
(2) A classification and matching method based on three levels of major categories, manufacturers and models is designed, so that the information such as words and attributes of the same category can be fully utilized, the recall rate of detection is improved, and sensitive leakage of hardware is avoided. Meanwhile, the search range is narrowed during matching, and matching is only carried out in the information base of the same manufacturer, so that the matching efficiency is improved.
(3) A new thought and implementation method is provided on a hardware information base structure, feature selection, classifier construction and protection method, an XML structural form is designed, an information gain calculation method is improved, a feature selection method based on manufacturer category feature similarity is designed, a keyword semantic tree is constructed, and a specific protection strategy is provided.
Drawings
FIG. 1 is a general flow diagram of the present invention.
FIG. 2 is a schematic diagram of a hardware vendor's classification flow.
Fig. 3 is a flow chart diagram of a hardware model matching method.
FIG. 4 is a flow chart diagram of a keyword mask replacement method.
Fig. 5 is a diagram of a hardware information base (XML structure).
FIG. 6 is a diagram showing the correspondence between keywords of each layer of the semantic tree and keywords of each layer of XML in the embodiment.
FIG. 7 is a final sample graph of the semantic tree established in the example.
Detailed Description
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.
The general process of the invention is shown in fig. 1, and specifically includes a left model building process and a right detection and protection process in fig. 1, wherein the processing results of the model building process in three links provide essential basic data for the detection and protection process.
The main work of the present invention includes:
(1) constructing a hardware information base;
(2) performing Chinese word segmentation on the hardware description information;
(3) constructing a hardware classification model and a hardware model matching algorithm;
(4) and constructing a keyword shielding replacement method.
The key technologies involved in the above process are explained in detail in turn below.
1. Construction of hardware information base
In the embodiment, a web crawler program is designed for a large computer network, and 36 pieces of hardware information of tens of thousands of models, including mobile phones, notebooks, switches, routers and the like, are automatically crawled. Organizing the hardware information into an XML file form, wherein each tag of the XML represents the attribute of the hardware, and the text description content corresponding to the tag represents the attribute value of the hardware. And constructing a tree hardware information base through the structural description capacity of the XML. The hardware information base forms a basic information source required by the subsequent processing flow. The constructed hardware information base (XML structure) is shown in fig. 5.
2. Chinese word segmentation for hardware information
Although all types of hardware information are obtained in the work of the step 1, the information cannot be directly used for computer processing, Chinese word segmentation is required to be carried out, auxiliary words are removed, keywords in the Chinese words are extracted, and then the extracted keywords are used for carrying out subsequent classification processing and other work. At present, common word segmentation methods can be used for the step, for example, a hierarchical hidden markov model-based Chinese lexical analysis system ICTCCLAS and the like developed by the research institute of computing technology of Chinese academy of sciences support a user dictionary and various coding formats, and the word segmentation accuracy is as high as 97.5%.
3. Method for constructing hardware classification model and hardware model matching algorithm
On the basis of word segmentation, the hardware model described by the hardware description information is determined by constructing a classification model and a hardware model matching algorithm. The hardware classification model comprises two sub-classification processes, namely classification of a hardware major class and classification of a hardware manufacturer, wherein the classification of the hardware manufacturer is performed on the basis of the classification of the hardware major class. Through the two steps, the category and the manufacturer of the hardware can be determined, and finally the model of the hardware can be determined through a hardware model matching method, and the basic ideas of the three processes are described below.
(1) Classification of hardware broad classes
The classification of the hardware major classes refers to a KNN classification method in text classification, firstly, the feature words which have large contribution to the classification are selected through feature selection, and then, the hardware is classified through a classification algorithm. The feature selection algorithm and the classification algorithm respectively use an information gain method and a KNN method for reference, but improve the characteristics of a hardware information base, and are beneficial to improving the classification accuracy.
The traditional information gain method only considers the influence of whether the characteristic words appear on the global information entropy, but does not consider the frequency problem of the appearance of the characteristic words in the classes and among the classes.
The improved information gain method has the following calculation formula:
Figure BDA0001145461840000071
where dis (t) represents the distribution of the feature t among the classes, which is the ratio of the number of samples in which the feature t appears to the total number of all samples. Therefore it selects
Figure BDA0001145461840000072
The adjustment coefficient is based on two reasons, first,
Figure BDA0001145461840000073
is a decreasing function of dis (t), i.e. when the distribution value of the characteristic t among the classes is very small,
Figure BDA0001145461840000074
the size is larger, which just meets the requirement; secondly, select
Figure BDA0001145461840000075
Conventional information can be balanced for adjusting coefficientsThe weight between the gain value IG (t) and the inter-class distribution value dis (t) of the characteristic t is set so that the calculation result does not depend on either one of the values too much.
Similarly, the invention improves the traditional KNN algorithm, and the improvement lies in that the influence of different characteristics on classification is considered, the information gain value selected by the characteristics is used as the weight of the KNN algorithm, the information gain value of one characteristic represents the influence of the characteristic on the information entropy, if the information gain value is larger, the influence of the characteristic on the classification result is larger, so the information gain value of the characteristic is directly used as the weight of the characteristic in the KNN algorithm, and the contribution degree of the characteristics with different information gain values on the classification can be reflected. The calculation formula of the distance in the improved KNN algorithm is given below.
Figure BDA0001145461840000076
Where x represents unclassified samples and y represents classified samples, which are n-dimensional vectors, each dimension in the vector representing a feature value. IG (t)i) Represents the ith feature tiThe information gain value of (1). x ═ x1,x2,…,xn),y=(y1,y2,…,yn)。
(2) Classification of hardware manufacturers
After the classification of the hardware major class, the classification of the hardware vendor is to determine the vendor of the hardware under that class. Likewise, feature selection and classification using a suitable classification algorithm are required in this step of classification.
The feature selection algorithm adopted by the invention is a calculation method based on feature similarity, namely, the feature similarity of each feature among different manufacturer categories is considered, if the feature similarity is larger than or equal to a certain threshold value, the feature is considered to be too similar among different manufacturers and not suitable for being used as a classified feature, and otherwise, the feature can be used as the classified feature. Likewise, the improved KNN classification algorithm continues to be used in this part of the classification, except that the weight of the features is changed to the logarithm of the reciprocal of the feature similarity, as described below.
In the hardware information base, each hardware feature may include a plurality of sub-features, and the feature value of the feature, such as "dimension", includes three dimension values, i.e., length, width, and height. Here, length, width, and height are three sub-features of the feature "dimension". Assuming a feature tiConsisting of n sub-features, i.e. ti=(ti1,ti2,…,tin). A certain sample is characterized by tiCharacteristic value of
Figure BDA0001145461840000081
Another sample is at feature tiCharacteristic value of
Figure BDA0001145461840000082
Then define
Figure BDA0001145461840000083
And
Figure BDA0001145461840000084
the similarity between them is:
Figure BDA0001145461840000085
i.e. the cosine of the angle between the vectors is used to define the similarity between two features. Since different features to be examined may contain different numbers of sub-features, i.e. different dimensions, this is done in order to ignore the dimensions of the vectors, emphasizing the examination of the similarity between two vectors from the angle between them, when two vectors, i.e. two features, are similar, the cosine value of the angle is larger, otherwise smaller.
After defining the similarity of a single feature, a calculation method of the similarity of two classes on a certain feature is given next. Since each class may contain multiple samples, two classes c are assumed1And c2The number of samples contained is m1And m2Then define these two classes at feature tiThe similarity of (c) is calculated as follows:
Figure BDA0001145461840000086
as can be seen from the above formula, the feature t is applied to two classesiThe definition of similarity is directly taking all sample pairs of two classes in the feature tiMean of upper similarity, which can be done by matching all sample pairs between two classes to the feature tiThe similarity of the above is taken into consideration.
Between two classes at feature tiBased on the similarity calculation, the feature t between p classes is defined belowiThe similarity of (c). Let p classes be c1,c2,…,cpDefining the p classes in the feature tiThe similarity of (3) is that any two classes are at tiThe average of the similarity sums above, i.e.:
Figure BDA0001145461840000087
if the p classes are in the feature tiIs greater than or equal to a certain threshold value delta, i.e.
Figure BDA0001145461840000088
Then the feature t is considerediThe similarity between the p classes is too large to be used as the characteristic of classification, and the similarity can be used as the characteristic of classification on the contrary.
The classification at each step is still classified by using the modified KNN algorithm, except that the weight of the features is changed and is not an information gain value, but an inverse number of the similarity of the features. The reason why the inverse of the similarity of the features is selected as the weight of the features is based on the reason that the similarity of the features represents the similarity degree of the features between different classes, the features with higher similarity do not contribute much to the classification and should be given less weight, and the features with lower similarity do contribute more to the classification and should be given higher feature, so that it is reasonable that the invention selects the inverse of the similarity as the weight of the features to participate in the calculation of the KNN algorithm, and the following is a specific distance calculation formula of KNN:
Figure BDA0001145461840000091
the classification process of the hardware manufacturer is as follows, and fig. 2 shows a corresponding flowchart.
1) Selecting samples of different manufacturers in a certain class from a hardware information base;
2) calculating the feature similarity of the features among different manufacturers according to different features;
3) if the feature similarity of the feature is smaller than a certain threshold, taking the feature as a classification feature, otherwise, returning to 2), and selecting the next feature to continue to calculate the feature similarity;
4) and classifying by using the selected characteristics and an improved KNN algorithm to obtain the corresponding manufacturer category.
(3) Matching of hardware models
After the category of the hardware and the manufacturer under the category are determined, the invention determines the model of the hardware under the manufacturer by constructing a hardware model matching algorithm. The hardware model matching algorithm adopted by the invention is a method based on a hardware model set, namely, hardware models with the same attribute value are put into one set, when the model of certain hardware needs to be determined, only the attribute value of the hardware on certain attribute needs to be determined, so that the model set to which the hardware belongs can be determined, and then the model to which the hardware belongs can be obtained by solving the intersection of the sets. Compared with the method for successively comparing the hardware models, the hardware model matching method has great advantages in efficiency, and can greatly reduce the comparison times.
When the hardware model is matched, all products are not compared one by one, but a new algorithm is established to enable the comparison to have higher efficiency. Specifically, suppose that the category of products has n attributes (t)1,t2,…,tn) Each attribute tiAll comprise aiSub-features, i.e.
Figure BDA0001145461840000092
The product produced by the manufacturer is classified into the attribute tiThe same products are grouped into a collection. And since a model of a product may be identical to other products in more than one attribute, the model of the product may appear in different sets, i.e., the sets may intersect each other.
If p attributes appear in the description information of the hardware, respectively
Figure BDA0001145461840000101
Properties
Figure BDA0001145461840000102
Is a characteristic value of
Figure BDA0001145461840000103
The algorithm for hardware model matching is described as follows:
1) will attribute tiThe hardware models with the same attribute value are placed in the same set;
2) let i ═ 1, C ═ Ω, where Ω denotes the complete set;
3) finding and attribution
Figure BDA0001145461840000104
Collections with the same attribute value
Figure BDA0001145461840000105
4)
Figure BDA0001145461840000106
5) If C contains only one element or i > p, proceed to 6), otherwise i ═ i +1, and return to 3);
6) and returning a set C, wherein the set C is a final hardware model comparison result.
Fig. 3 shows a specific flowchart of the hardware model matching method, and the main steps are described as follows.
1) Constructing a hardware model set with the same attribute value aiming at each attribute;
2) taking out a certain attribute, and investigating the attribute value of the hardware on the attribute to obtain a hardware model set corresponding to the attribute value;
3) taking intersection of the hardware model set and the obtained hardware model set, stopping if the intersection only contains one element or the attribute is completely taken, wherein the element in the intersection is the model of the hardware, and otherwise, returning to 2);
4. constructing keyword shielding replacement model
The method carries out shielding replacement on the keywords which possibly reveal the hardware sensitive information and appear in the hardware description information by designing a keyword shielding replacement model. The method comprises the steps of dividing different sensitivity levels according to different keywords and adopting different processing modes for the keywords with different sensitivity levels.
(1) Keyword sensitivity level classification
For each hardware major class, 5 sensitivity levels of all attribute value keywords are established in advance and are respectively represented by numbers 0, 1, 2, 3 and 4, and the sensitivity degrees of the keywords are sequentially increased, which is specifically shown in table 1.
TABLE 1 comparison table of sensitivity levels
Level of sensitivity 0 1 2 3 4
Of significance Is not sensitive Is slightly sensitive General sensitivity Is relatively sensitive Is very sensitive
Treatment method Do not process Replacement of Replacement of Replacement of Shielding
And different processing modes are adopted for the keywords with different sensitivity levels. The keywords with the sensitivity level of 0 are not processed, the keywords with the sensitivity level of 4 are directly shielded by asterisks, and the keywords with the sensitivity levels of 1, 2 and 3 are processed in a semantic tree construction mode.
(2) Construction of keyword semantic trees
And replacing the keywords with sensitivity levels of 1, 2 and 3 by constructing a semantic tree. Leaf nodes in the semantic tree are keywords with the most specific semantics, the semantics are gradually fuzzy along with the rise of the node hierarchy, and root nodes are nodes with the most fuzzy semantics. For hardware description information, the semantic tree has 4 layers in total, and the replacement strategy based on the semantic tree is as follows:
for the key words with the sensitivity level of 1, adopting the father nodes thereof for replacement; for the keyword with the sensitivity level of 2, adopting a father node of the father node to replace; and directly replacing the keywords with the sensitivity level of 3 by using the root node.
The XML document of each model of hardware in the hardware information base is a hierarchical structure, and the attribute key words of the upper layer are more semantically fuzzy than the attribute key words of the lower layer, so the key word semantic tree can be established by using the XML document.
The method for establishing the semantic tree is that leaf nodes at the bottom layer are sub-feature words of the attribute keywords at the bottom layer. The penultimate layer of the semantic tree corresponds to the innermost attribute keywords of the XML structure in the hardware information base, which are semantically more fuzzy than the respective sub-feature words. The last-but-one layer of the semantic tree is the second-layer attribute key words of the XML structure, because the first layer of the XML document is the specific model of the hardware, which is very sensitive information, the last-but-one layer of the semantic tree does not correspond to the first layer of the XML document, but the name of the hardware class which is more semantically fuzzy than the last-but-one layer is taken as the key words of the layer, and because the last-but-one layer has been raised to the name of the hardware class, the layer is also the first layer of the whole semantic tree, namely the root node. Fig. 6 shows the corresponding relationship between each layer of keywords of the semantic tree and each layer of keywords of XML, and fig. 7 shows the final example of the established semantic tree, in which the "second layer attribute keywords" and the "third layer attribute keywords" in the example both refer to the second layer and third layer attribute keywords in the XML document.
Examples of the applications
Because the information content related to the IT hardware facilities of the enterprise is not much available on the social media of the internet, the collection is difficult. In the example verification, 5000 pieces of partial information of hardware description are extracted from a hardware information base, and the description information is arranged into text documents, wherein each piece of description information corresponds to one text document. The word-segmented keyword sample (with random deletion of some keywords) is consistent with the content obtained from the social media after processing, so the processed data can approximately simulate the hardware description information sample in the social media.
Optionally, 60 samples from each large class are used as training samples, the total number of the training samples is 2160, the remaining 40 samples of each class are used as samples to be classified for testing, and the total number of the testing samples is 1440, and the relation between classification performance and k value is shown in table 2.
TABLE 2 correct classification ratio and F for hardware classes at different k values1Mean value of
Parameter k 1 5 10 15 20 25 30
Correct classification ratio 80.1% 72.8% 69.3% 67.3% 65.7% 63.8% 60%
F1Mean value of 0.805 0.734 0.706 0.689 0.676 0.663 0.639
In the classification of hardware manufacturers, the hardware manufacturers are classified by taking the hardware major category of 'mobile phones', and eight manufacturers of mobile phones are selected, namely samsung, apple, huashi, OPPO, vivo, charm, association and kupai. The proportion and F of correctly classified samples under different k values are tested1The average values, the verification results obtained are shown in table 3.
TABLE 3 proportion of correctly classified samples from the manufacturer under different k values and F1Mean value of
Parameter k 1 5 10 15 20 25 30 35
Correct classification ratio 42.4% 36.0% 34.7% 35.6% 31.8% 35.6% 33.5% 31.4%
F1Mean value of 0.422 0.350 0.339 0.328 0.295 0.319 0.299 0.281
And randomly selecting 200 texts under the mobile phone category, and correspondingly processing each sub-feature value according to the sensitivity level of the corresponding sub-feature word, wherein the final statistical data is shown in a table 4.
TABLE 4 partial keyword mask replacement performance data
Sub-feature word Full network communication Mobile 4G Unicom 4G Telecommunications 4G Transverse direction
Number of sub-feature words 20 89 76 41 138
Number of correct processes 20 89 76 41 138
Accuracy rate 100% 100% 100% 100% 100%
Reference to the literature
[1] Guo fine, social media use user information privacy and protection [ J ]. china information security, 2014, (7): 90-93.
[2] Weiqiong, ruthengshen, progress in the research of location privacy protection technology [ J ]. computer science, 2008, 35 (9): 21-25.
[3] Von deng, zhanmin, lihao, big data security and privacy protection [ J ] computer science, 2014, 37 (1): 246-258.
[4] Zhoushueg, Lifeng, Douyufei, Shouyouqu-Kuiqi a privacy-preserving research review for database-oriented applications [ J ]. computer science and newspaper, 2009, 32 (5): 847-861.

Claims (6)

1. A method for protecting sensitive information of enterprise hardware facilities in social media is characterized by comprising the following specific steps:
step one, constructing a model
(1) Construction of hardware information base
Acquiring hardware information, extracting a plurality of levels, attributes and attribute value information including hardware categories, manufacturers and models, organizing the information into an XML hierarchical structure, and constructing a hardware information base;
(2) performing Chinese word segmentation on hardware description information in a hardware information base;
(3) method for constructing hardware classification model and hardware model matching algorithm
After the hardware description information in the hardware information base is segmented, firstly extracting the characteristic information of a large class, then extracting the characteristic information of a manufacturer on the basis of the classification of the large class, and constructing a manufacturer classification model; finally, a hardware model matching algorithm is constructed through the major types and the category information of manufacturers, and the model of the hardware is determined;
(4) constructing keyword shielding replacement model
For each hardware major category, performing sensitivity level division on attribute keywords appearing in hardware description information, and constructing a keyword shielding replacement model by adopting different processing modes on the keywords with different sensitivity levels; wherein, the sensitivity levels are divided into 0, 1, 2, 3 and 4; the keywords with the sensitivity level of 0 are not processed, the keywords with the sensitivity level of 4 are directly shielded, and the keywords with the sensitivity levels of 1, 2 and 3 are processed through a keyword semantic tree; the keyword semantic tree is constructed by keywords on different levels in a hardware information base according to an XML structural relationship; the keyword semantic tree has four layers, and the replacement strategy based on the keyword semantic tree is as follows:
for the key words with the sensitivity level of 1, adopting the father nodes thereof for replacement; for the keyword with the sensitivity level of 2, adopting a father node of the father node to replace; directly replacing the keywords with the sensitivity level of 3 by using the root node; step two, detection protection
After word segmentation processing is carried out on input social media content, determining an attribution major class, an attribution manufacturer and an attribution type according to a hardware classification model and a hardware type matching algorithm in the step one; and after the model is determined, shielding a replacement model by using the keywords constructed in the step one, and executing corresponding actions, namely shielding, replacing and not processing, on the attribute keywords in the segmented social media content by using the corresponding sensitivity levels and processing modes.
2. The sensitive information protection method according to claim 1, wherein the hardware classification model classifies the hardware major and hardware manufacturers according to a feature selection algorithm and a classification algorithm.
3. The sensitive information protection method according to claim 2, wherein, when classifying the hardware major classes, the feature selection algorithm adopts an improved information gain method, and a specific calculation formula is as follows:
Figure FDA0002406468000000011
wherein t is a feature, c represents a category, k represents the number of categories, dis (t) represents the distribution of the feature t among the categories, which is the ratio of the number of samples in which the feature t appears to the total number of all samples, P (t) represents the probability of the feature appearing, P (c) represents the probability of the category appearing, P (c, t) represents the probability of the feature and the category appearing together,
Figure FDA0002406468000000021
indicating the probability that the feature does not occur,
Figure FDA0002406468000000022
representing the probability that the feature does not appear that the sample belongs to the class c;
the classification algorithm adopts an improved KNN method, wherein the distance calculation formula is as follows:
Figure FDA0002406468000000023
where x represents unclassified samples and y represents classified samples, which are n-dimensional vectors, each dimension of the vector representing a feature value, IG' (t)i) Represents the ith feature tiX ═ x1,x2,…,xn),y=(y1,y2,…,yn) D (x, y) denotes the distance between x and y, xi,yiRepresenting the ith characteristic value of the sample.
4. The sensitive information protection method according to claim 2, wherein when the classification of the hardware manufacturer is performed, the feature selection algorithm performs feature selection by using a feature similarity method; selecting features by using similarity among classes on features, and defining p classes among features tiThe similarity of the above is that the p classes are respectively c1,c2,…,cpDefining the p classes in the feature tiThe similarity of (3) is that any two classes are at tiThe average of the similarity sums above, i.e.:
Figure FDA0002406468000000024
if it is not
Figure FDA0002406468000000025
Delta is threshold value, then the characteristic t is considerediThe similarity between the p classes is too large and is not suitable for being used as the characteristic of classification, otherwise, the similarity can be used as the characteristic of classification;
the classification algorithm adopts an improved KNN method, selects the reciprocal of the similarity as the weight of the features to participate in the calculation of the KNN algorithm, and adopts a specific KNN distance calculation formula as follows:
Figure FDA0002406468000000026
wherein, ciDenotes the ith class, p is the total number of classes, tiDenotes the ith feature, n is the total number of features, and x ═ x1,x2,…,xn),y=(y1,y2,…,yn) Respectively representing unclassified samples and classified samples having n feature values xiyi
5. The sensitive information protection method according to claim 1, wherein the hardware model matching algorithm adopts a hardware model set method, that is, hardware models with the same attribute value are put into one set, the attribute value of the hardware to be matched on some attributes is determined, so as to determine the model set to which the hardware belongs, and then the intersection of the sets is obtained, so as to obtain the model to which the hardware belongs.
6. The sensitive information protection method according to claim 1, wherein leaf nodes at the lowest layer of the keyword semantic tree are sub-feature words of the innermost attribute keywords of the XML structure in the hardware information base, the penultimate layer of the semantic tree corresponds to the innermost attribute keywords of the XML structure in the hardware information base, the penultimate layer of the semantic tree is the second layer attribute keywords of the XML structure, the fourth layer is a root node, and the root node is the name of the hardware major class.
CN201610971014.7A 2016-10-31 2016-10-31 Method for protecting sensitive information of enterprise hardware facilities in social media Active CN106649262B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610971014.7A CN106649262B (en) 2016-10-31 2016-10-31 Method for protecting sensitive information of enterprise hardware facilities in social media

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610971014.7A CN106649262B (en) 2016-10-31 2016-10-31 Method for protecting sensitive information of enterprise hardware facilities in social media

Publications (2)

Publication Number Publication Date
CN106649262A CN106649262A (en) 2017-05-10
CN106649262B true CN106649262B (en) 2020-07-07

Family

ID=58821041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610971014.7A Active CN106649262B (en) 2016-10-31 2016-10-31 Method for protecting sensitive information of enterprise hardware facilities in social media

Country Status (1)

Country Link
CN (1) CN106649262B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108390865B (en) * 2018-01-30 2021-03-02 南京航空航天大学 Fine-grained access control method based on privacy drive
CN111209735B (en) * 2020-01-03 2023-06-02 广州杰赛科技股份有限公司 Document sensitivity calculation method and device
CN112100646A (en) * 2020-04-09 2020-12-18 南京邮电大学 Spatial data privacy protection matching method based on two-stage grid conversion
CN112000867A (en) * 2020-08-17 2020-11-27 桂林电子科技大学 Text classification method based on social media platform

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101827102A (en) * 2010-04-20 2010-09-08 中国人民解放军理工大学指挥自动化学院 Data prevention method based on content filtering
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
CN104866465A (en) * 2014-02-25 2015-08-26 腾讯科技(深圳)有限公司 Sensitive text detection method and device
US9245012B2 (en) * 2008-03-28 2016-01-26 International Business Machines Corporation Information classification system, information processing apparatus, information classification method and program
CN105426361A (en) * 2015-12-02 2016-03-23 上海智臻智能网络科技股份有限公司 Keyword extraction method and device
CN105955978A (en) * 2016-04-15 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for data leakage protection

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9245012B2 (en) * 2008-03-28 2016-01-26 International Business Machines Corporation Information classification system, information processing apparatus, information classification method and program
CN101827102A (en) * 2010-04-20 2010-09-08 中国人民解放军理工大学指挥自动化学院 Data prevention method based on content filtering
CN104866465A (en) * 2014-02-25 2015-08-26 腾讯科技(深圳)有限公司 Sensitive text detection method and device
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
CN105426361A (en) * 2015-12-02 2016-03-23 上海智臻智能网络科技股份有限公司 Keyword extraction method and device
CN105955978A (en) * 2016-04-15 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for data leakage protection

Also Published As

Publication number Publication date
CN106649262A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
CN106649262B (en) Method for protecting sensitive information of enterprise hardware facilities in social media
CN109376352B (en) Patent text modeling method based on word2vec and semantic similarity
CN104778283B (en) A kind of user's occupational classification method and system based on microblogging
CN110705247B (en) Based on x2-C text similarity calculation method
Li et al. An optimized approach for massive web page classification using entity similarity based on semantic network
CN103488707B (en) A kind of method that candidate categories are searched for based on Greedy strategy and heuritic approach
CN114595689A (en) Data processing method, data processing device, storage medium and computer equipment
CN114997288A (en) Design resource association method
CN106502990A (en) A kind of microblogging Attribute selection method and improvement TF IDF method for normalizing
Liu et al. Identifying protein complexes based on node embeddings obtained from protein-protein interaction networks
Elfida et al. Enhancing to method for extracting Social network by the relation existence
Kausar et al. A detailed study on information retrieval using genetic algorithm
Zhang et al. A hot spot clustering method based on improved kmeans algorithm
Zhang et al. Improving entity linking in Chinese domain by sense embedding based on graph clustering
Yang et al. Exploring word similarity to improve chinese personal name disambiguation
Sharma et al. Analysis of clustering algorithms in biological networks
Pita et al. Strategies for short text representation in the word vector space
Yang et al. A hot topic detection approach on Chinese microblogging
Foncubierta-Rodríguez et al. From visual words to a visual grammar: using language modelling for image classification
Chen et al. A Malicious Web Page Detection Model based on SVM Algorithm: Research on the Enhancement of SVM Efficiency by Multiple Machine Learning Algorithms
Venkateswarlu et al. Aquila optimized feedback artificial tree for detection of fake news and impact identification
Emami et al. Web Person Name Disambiguation Using Social Links and Enriched Profile Information
Sanabila et al. Automatic Wayang Ontology Construction using Relation Extraction from Free Text
Zafarani-Moattar et al. A comprehensive study on Frequent Pattern Mining and Clustering categories for topic detection in Persian text stream

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant