CN117009596A - Identification method and device for power grid sensitive data - Google Patents

Identification method and device for power grid sensitive data Download PDF

Info

Publication number
CN117009596A
CN117009596A CN202310778166.5A CN202310778166A CN117009596A CN 117009596 A CN117009596 A CN 117009596A CN 202310778166 A CN202310778166 A CN 202310778166A CN 117009596 A CN117009596 A CN 117009596A
Authority
CN
China
Prior art keywords
data
keywords
document data
preset
sensitive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310778166.5A
Other languages
Chinese (zh)
Inventor
那琼澜
苏丹
来骥
张实君
杨艺西
任建伟
马跃
邢宁哲
庞思睿
曽婧
李硕
徐相森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202310778166.5A priority Critical patent/CN117009596A/en
Publication of CN117009596A publication Critical patent/CN117009596A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and a device for identifying power grid sensitive data, which relate to the technical field of power grid safety reinforcement, improve the quality of the identified sensitive data, and avoid obtaining a large amount of redundant and low-value sensitive data, thereby providing a better solution for identifying the power grid sensitive data. The main technical scheme of the application is as follows: acquiring document data to be processed from a power grid; extracting keywords from the document data by analyzing the document data, wherein the keywords are used for representing associated sensitive information; processing the keywords and the sensitive information associated with the keywords by using a preset machine learning model, and judging the labels corresponding to the keywords; combining the document data, and processing a plurality of labels by using a preset label distribution learning model to obtain label probability distribution corresponding to the labels; and judging and determining target sensitive data from the document data according to the tag probability distribution.

Description

Identification method and device for power grid sensitive data
Technical Field
The application relates to the technical field of power grid safety reinforcement, in particular to a method and a device for identifying power grid sensitive data.
Background
The power grid data volume is large, the types and the characteristics are complex and various, and sensitive data identification is a precondition of data safety protection. With current sensitive data recognition technology, some sensitive words, such as customer name, identification number, contact phone, residence address, etc., can be recognized. However, in different document instances, the degrees of sensitivity that different sensitive words make in the document instance are different, for example, in one document instance, for example, in a specific scene, the name sensitivity is lower and the identification number sensitivity is higher.
Therefore, if only the data related to the sensitive words are collected as the sensitive data, it is difficult to distinguish the sensitivity of the content in different document examples, and a large amount of redundant data (for example, if the renaming rate of a name is very high, the meaning of the name as the sensitive data is not great) is obtained, so that it is difficult to really identify which is the sensitive data with high real sensitivity in a large amount of complex data in a power grid. How to identify truly valuable sensitive data from the power grid is thus a current problem that needs to be solved.
Disclosure of Invention
In view of the above, the present application provides a method and an apparatus for identifying sensitive data of a power grid, so as to identify sensitive data with higher sensitivity from a large amount of complex data information in the power grid by using tag distribution learning, thereby obtaining real and higher quality sensitive data, avoiding obtaining a large amount of redundant and low value sensitive data, and providing a better solution for identifying sensitive data of the power grid.
In order to achieve the above purpose, the present application mainly provides the following technical solutions:
the first aspect of the application provides a method for identifying power grid sensitive data, which comprises the following steps:
acquiring document data to be processed from a power grid;
extracting keywords from the document data by analyzing the document data, wherein the keywords are used for representing associated sensitive information;
processing the keywords and the sensitive information associated with the keywords by using a preset machine learning model, and judging the labels corresponding to the keywords;
combining the document data, and processing a plurality of labels by using a preset label distribution learning model to obtain label probability distribution corresponding to the labels;
and judging and determining target sensitive data from the document data according to the tag probability distribution. A second aspect of the present application provides an apparatus for identifying grid-sensitive data, the apparatus comprising:
the first acquisition unit is used for acquiring document data to be processed from a power grid;
the extraction unit is used for extracting keywords from the document data through analysis processing of the document data, wherein the keywords are used for representing associated sensitive information;
the first processing unit is used for processing the keywords and the sensitive information associated with the keywords by using a preset machine learning model and judging the labels corresponding to the keywords;
the second processing unit is used for processing a plurality of labels by combining the document data and utilizing a preset label distribution learning model to obtain label probability distribution corresponding to the labels;
and the determining unit is used for judging and determining the target sensitive data from the document data according to the tag probability distribution.
A third aspect of the present application provides a storage medium, where the storage medium includes a stored program, where the program, when executed, controls a device where the storage medium is located to perform a method for identifying grid-sensitive data as described above.
A fourth aspect of the application provides an electronic device comprising at least one processor, at least one memory connected to the processor, a bus;
the processor and the memory complete communication with each other through the bus;
the processor is used for calling the program instructions in the memory to execute the identification method of the power grid sensitive data.
By means of the technical scheme, the technical scheme provided by the application has at least the following advantages:
the application provides a method and a device for identifying sensitive data of a power grid, which are used for analyzing the document data to be processed in the power grid, extracting keywords from the document data, correspondingly associating some sensitive information with the keywords, judging labels based on the keywords and the associated sensitive information, namely the sensitive labels associated with the document data, processing the labels by using a preset label distribution learning model on the basis, and obtaining label probability distribution which is used for representing the importance degree of the labels in the document data (namely the influence on the sensitivity degree of the document data), thereby further judging and determining target sensitive data contained in the document data. Compared with the technical problem that the prior art only collects sensitive word associated sensitive information to cause the redundancy of the obtained sensitive data, the method and the device for identifying the sensitive data by using the label distribution learning are used for identifying the sensitive data with higher sensitivity from a large amount of complex data information in the power grid, so that the real sensitive data with higher quality are obtained, the obtaining of a large amount of redundant sensitive data with low value is avoided, and a better solution for identifying the sensitive data of the power grid is provided.
The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a flowchart of a method for identifying grid sensitive data according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a training preset tag distribution learning model, as exemplified by an embodiment of the present application;
FIG. 3 is a flowchart of another method for identifying grid sensitive data according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating a label distribution prediction flow according to an embodiment of the present application;
fig. 5 is a block diagram of an identification device for power grid sensitive data according to an embodiment of the present application;
fig. 6 is a block diagram of another device for identifying grid sensitive data according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the application to those skilled in the art.
The embodiment of the application provides a method for identifying power grid sensitive data, as shown in fig. 1, and the method comprises the following specific steps:
101. and acquiring the document data to be processed from the power grid.
102. And extracting keywords from the document data by analyzing the document data, wherein the keywords are used for representing associated sensitive information.
In embodiments of the present application, keywords may also be referred to as "sensitive words" that characterize the associated sensitive information, such as keywords "customer name", "identification number", "contact phone", "residence address", and so forth.
103. And processing the keywords and sensitive information associated with the keywords by using a preset machine learning model, and judging the labels corresponding to the keywords.
In the embodiment of the application, the corresponding label can be further distinguished based on the keywords and the associated sensitive information thereof, and the distinguishing method is not limited.
104. And processing the plurality of labels by using a preset label distribution learning model in combination with the document data to obtain label probability distribution corresponding to the labels.
Among them, a tag distribution learning algorithm (also called a tag distribution learning algorithm) (label distribution learning, LDL) is, in short, a data multi-tag identification method based on a tag distribution descriptive model.
For many problems in the real world, the importance of different tags tends to be different. For example, a piece of user electricity data is marked with a plurality of labels such as a user name, a user position, an electricity consumption amount, an electricity consumption time and the like, and the labels specifically describe the information to different degrees; for another example, in grid sensitive data, the information of complex data is often the result of a mixture of multiple basic information (such as time, place, business and user), each of which can correspondingly represent a label, and these labels often express different intensities in a specific data instance, thereby exhibiting complex information meaning. Similar examples are numerous, because once an instance is associated with multiple tags at the same time, the tags are typically not just as important for the instance, but rather more likely to be prime and secondary.
Therefore, the embodiment of the application constructs a preset label distribution learning model based on a label distribution learning algorithm, and is used for processing the labels in the document data so as to obtain label probability distribution corresponding to the labels.
One example application is illustrated: for instance x, a real numberEach possible label y is given indicating the extent to which y describes x. Without loss of generality, assume ∈ ->And further assume that the set of tags is a complete set, and that all tags in the ready-to-use set must be able to describe one example completely, so that in the setAll->The sum of (2) is 1, satisfying the above two conditions>Referred to as the degree of description of y versus x. For one example, the descriptors of all tags constitute a data structure resembling a probability distribution, and so are referred to as tag distribution, while the process of learning on a data set labeled with a tag distribution is referred to as tag distribution learning.
For example, as shown in FIG. 2, an embodiment of the present application further illustrates a schematic flow chart of training a preset tag distribution learning model.
105. And judging and determining the target sensitive data from the document data according to the tag probability distribution.
The tag probability distribution is equivalent to the description degree of the tag on the sample, namely, a real value between 0 and 1 is utilized, the tag learned by the tag distribution not only reflects whether the tag is related to the sample (namely, 0 is irrelevant, and not 0 is relevant), but also reflects the description degree of the tag on the sample.
For example, if we say that a tag is not related to an instance by a "0" and a "1" indicates that the tag distribution learns that the tag's the instance's descriptive degree by a specific continuous value, all tags are related and correct, and the entire tag distribution set completely describes an instance.
Therefore, in the embodiment of the application, the above-mentioned example is "document data", and the description degree between each label and "document data" is obtained by using the label probability distribution, namely the description degree is equal to the sensitivity degree, so that the label with higher sensitivity degree can be distinguished based on the higher probability distribution, and the target sensitive data with higher sensitivity degree can be distinguished and determined from the document data based on the sensitive information associated with the labels.
Compared with the technical problem that the prior art only collects sensitive word related sensitive information to cause the redundancy of the obtained sensitive data, the embodiment of the application is used for recognizing sensitive data with higher sensitivity from a large amount of complex data information in a power grid by using label distribution learning, so that the real and higher-quality sensitive data are obtained, and the large amount of redundant and low-value sensitive data are prevented from being obtained, so that a better solution for recognizing the sensitive data of the power grid is provided.
In order to make a more detailed description on the above embodiments, the embodiment of the present application further provides another method for identifying grid sensitive data, as shown in fig. 3, and the following specific steps are provided for this embodiment of the present application:
201. and acquiring the document data to be processed from the power grid.
The document data includes structured data and unstructured data. In short, structured data, such as row data, stored in a database, may be implemented as data logically represented by a two-dimensional table structure; the data that is inconvenient to represent with the two-dimensional logical table of the database is referred to as unstructured data, including office documents in all formats, texts, pictures, XML, HTML, various reports, image and audio/video information, and the like.
202a, for structured data in document data, determining keywords from the structured data using regular expressions.
Regular expressions, also known as regular expressions, are commonly used to retrieve and replace text that meets a certain pattern (rule). A regular expression is a logical formula for operating on a character string, namely, a "regular character string" is formed by a plurality of specific characters defined in advance and combinations of the specific characters, and the "regular character string" is used for expressing a filtering logic for the character string.
In the embodiment of the application, based on the characteristics of the structured data, the keyword segments in the document data can be filtered and screened based on the regular expression, so that the keywords are further determined based on the data information stored on the keyword segments.
202b, for unstructured data in the document data, matching is performed based on a preset keyword library to determine keywords, or keywords are extracted from the unstructured data by using a preset keyword extraction algorithm.
The embodiment of the application provides two parallel implementation methods: one implementation method is that matching is performed based on a preset keyword library to determine keywords; another approach is to extract keywords from unstructured data using a preset keyword extraction algorithm. For this alternative method, the examples of the present application exemplify two different schemes:
a scheme is as follows: extracting candidate words from the document data; and processing the candidate words by using a preset classification model to judge whether the candidate words are keywords or not.
The method is actually a supervised learning algorithm, the keyword extraction process is regarded as a two-class problem, candidate words are extracted first, then labels are defined for each candidate word, either keywords or keywords are not, and then a keyword extraction classifier is trained. When a new document comes, extracting all candidate words, then extracting a classifier by using the trained keywords, classifying each candidate word, and finally taking the candidate word with the label as the keyword.
Another scheme is as follows: firstly, extracting candidate words from document data; processing the candidate words by using a preset scoring model to obtain scoring values corresponding to each candidate word, wherein the scoring values are used for representing the probability of the candidate word as a keyword; and secondly, selecting a preset number of candidate words with the previous ranking from the candidate words as keywords according to the high-low scoring value.
The scheme is actually an unsupervised learning algorithm, candidate words are extracted first, then scoring is carried out on each candidate word, and then the candidate words with highest topK scores are output as key words. There are different algorithms, such as TF-IDF, textRank, etc., depending on the strategy of scoring.
203. And processing the keywords and sensitive information associated with the keywords by using a preset machine learning model, and judging the labels corresponding to the keywords.
After extracting the keywords and the associated sensitive information from the structured data and the unstructured data, performing type discrimination according to the keywords, discriminating labels according to the keywords by using a machine learning method, and for example, classifying by using a Support Vector Machine (SVM).
Tag classification is a multi-class classification problem, where the number of classes is given by k. When using a support vector machine to classify this problem, one type is assumed to be a positive sample and the other type is assumed to be a negative sample.
The one-to-many method takes class i samples as positive samples and all classes except the class as negative samples, and trains a vector machine between the two classes of samples, and the method constructs k classification support vector machines in total. When a certain vector is tested, the class corresponding to the vector machine which calculates the maximum value is taken as the class of the vector.
The one-to-one method is to select sample data in a class i and a class j from classification problems to train classification vector machines between the two classes, and the total number of the constructed vector machines is k (k-1)/2. Although the "one-to-one" classification method produces a number of classification vector machines that is (k-1)/2 times that of the "one-to-many" method, the training scale of the "one-to-one" method is much smaller than that of the "one-to-many" method. The vector is tested by scoring, and the category with the highest score is selected as the category of the test data after the calculation of k (k-1)/2 classifiers.
204. And processing the plurality of labels by using a preset label distribution learning model in combination with the document data to obtain label probability distribution corresponding to the labels.
205. And judging and determining the target sensitive data from the document data according to the tag probability distribution.
The embodiment of the application firstly provides an implementation method for constructing a preset label distribution learning model, which comprises the following steps: acquiring a sample data set, wherein the sample data set comprises a plurality of sample instances, and each sample instance corresponds to one label distribution; based on the label distribution as an initial, performing iterative training on sample data to realize label distribution learning so as to construct a preset label distribution learning model. The following is an exemplary explanation:
the label distribution generating process consists of two stages of learning on a sample marked by the label distribution and generating the label distribution.
Stage 1: learning on a sample marked with label distribution;
after the data multi-label identification is completed, further learning on the label distribution labeled sample is realized. In this step, for each picture in the set, there is a set of label distributions corresponding to it, where k is the number of iterations, so the goal of this step is to learn the parameter matrix Θ of the label distribution learning model k . If KL divergence is used to measure the distance between two distributions, the optimal parameter matrix should be:
according to the principle of maximum entropy, the model can be expressed as:
there are many optimization methods that can solve the problems, such as conjugate gradient drop, quasi-newton method, etc. As with BFGS-LLD, the quasi-Newton method BFGS is employed herein, and its optimization involves mainly the first derivative, namely:
after the mark distribution learning is finished, the parameter matrix Θ k Can be used to predict tag distribution. In order to maintain consistency between tagged and untagged data, a tag distribution, called pseudo tag distribution, needs to be estimated for each untagged data. Here we use a simple and efficient KNN algorithm, that is, for each unlabeled data its pseudo-age is the mean label distribution of its K nearest labeled neighbors. In addition, to find the most appropriate neighbors, we modify the index that measures the sample difference to the Euclidean distance of the sample featureKL divergence from the predicted tag distribution. The distances for the unlabeled data and the labeled data are:
wherein C is the balance factor hyper-parameter. There are two points to note, one is that for non-tag data, the tag distribution characterizes the uncertainty for the given tag, and two is because Θ k The label distribution predicted by it changes each time during each iteration, which means that the pseudo-label distribution of the unlabeled data may also change. These make SALDL more robust than other commonly used semi-supervised learning methods.
Stage 2: generating label distribution;
sensitive data is classified using tag distribution learning, and the input at the time of training is a tag distribution dataset in which a sample instance corresponds to a tag distribution. The components in each tag distribution are a real number between 0 and 1, indicating how well the corresponding tag describes a sample instance, and a sum of 1, meaning that one tag distribution describes a sample instance in its entirety. After training, model parameters are obtained, feature vectors of test samples are input during testing, and corresponding label distribution can be predicted through the obtained model.
First, k sample features are randomly selected in training set X as an initial mean vector { μ } 12 ,...,μ n Then calculate sample x in training set j The spatial distance from the initial mean vector divides the samples into clusters of mean vectors with the smallest distance. Thereafter, the mean vector of the cluster is updated. Thus, the iteration is continued until the mean vector of each cluster is kept unchanged, that is, the distance between clusters can be kept to be minimized, or the iteration number reaches a set maximum value. At this point the training of the model on the training set has been completed.
And then inputting sample characteristic data of the test set, calculating the distance between each sample and the mean vector of the training set characteristic data cluster, and finally obtaining a distance matrix T. The distance is calculated using the minkowski distance. In the algorithm, the new matrix T' is obtained by solving the corresponding reciprocal of each element of the obtained Minkowski distance matrix T, namely, the closer a sample is to a certain mean vector, the larger weight is obtained in the dimension.
And normalizing the elements in the matrix T' by a softmax function to ensure that the weight values are all constrained between 0 and 1, thereby obtaining a weight matrix W capable of converting the average value vector of the label distribution of the training set sample into the predicted label distribution of the test set sample. After obtaining the weight W of the label distribution of the test set, multiplying W by the average vector matrix U of the label distribution corresponding to the training set by p=wu, where u= [ μ ] 12 ,...,μ n ]. P is the desired predicted tag distribution. So far, our prediction process ends.
The whole process is represented by a flowchart, such as the label distribution prediction flowchart shown in fig. 4. The basic steps of a particular implementation can be summarized as follows.
Input: minkowski distance parameter P, maximum number of clustering iterations ite, cluster number of clusters k and training set S= { (x) 1 ,D 1 ),(x 2 ,D 2 ),...,(x n ,D n ) Test sample feature set
And (3) outputting: predictive tag distribution P for test set samples.
Step 1: obtaining k cluster-initiated mean vectors { mu } 12 ,...,μ n And randomly selecting k sample feature vectors from the sample feature set H of the training set, and then making the initial clustering clusters empty.
Step 2: the iteration begins, first, in order to determine sample x j Which cluster it belongs to, need to calculate x j And each mean value vector mu i Distance d of (2) ij
Step 3: sample to be sampledThe X is j And dividing the cluster into the corresponding cluster with the smallest distance value, and updating the cluster.
Step 4: and obtaining a new mean vector of the updated cluster after the cluster.
Step 5: judging whether the iteration process is finished, if all the mean value vectors are not changed, or if the iteration times reach the preset maximum iteration times, finishing the iteration, and performing the step 6. Otherwise, returning to the step 2, and continuing the iteration process.
Step 6: in the test stage, the test sample is sent into a model, the distance between the input test sample and each mean value vector is calculated, and the distances of all the mean value vectors of the test sample form a matrix T.
Step 7: calculating a weight matrix, changing the distance matrix into weight, and finally obtaining a weight matrix W through normalization processing of the weight.
Step 8: and outputting a predicted label distribution result to quickly obtain the predicted label distribution P of the test sample.
Therefore, in the embodiment of the application, the above-mentioned example is "document data", and the description degree between each label and "document data" is obtained by using the label probability distribution, namely the description degree is equal to the sensitivity degree, so that the label with higher sensitivity degree can be distinguished based on the higher probability distribution, and the target sensitive data with higher sensitivity degree can be distinguished and determined from the document data based on the sensitive information associated with the labels.
Further, as an implementation of the methods shown in fig. 1 and fig. 3, an embodiment of the present application provides an apparatus for identifying grid sensitive data. The embodiment of the device corresponds to the embodiment of the method, and for convenience of reading, details of the embodiment of the method are not repeated one by one, but it should be clear that the device in the embodiment can correspondingly realize all the details of the embodiment of the method. The device is applied to improving the quality of identification sensitive data, and particularly as shown in fig. 5, the device comprises:
a first obtaining unit 31, configured to obtain document data to be processed from a power grid;
an extracting unit 32, configured to extract keywords from the document data by performing parsing processing on the document data, where the keywords are used to characterize associated sensitive information;
a first processing unit 33, configured to process the keywords and the sensitive information associated with the keywords by using a preset machine learning model, and determine a label corresponding to each keyword;
a second processing unit 34, configured to process a plurality of the labels by using a preset label distribution learning model in combination with the document data, so as to obtain a label probability distribution corresponding to the labels;
a determining unit 35, configured to determine target sensitive data from the document data according to the tag probability distribution.
Further, as shown in fig. 6, the document data includes structured data and unstructured data, and the extracting unit 32 includes:
a first processing module 321, configured to determine, for the structured data in the document data, a keyword from the structured data using a regular expression;
a second processing module 322, configured to match the unstructured data in the document data based on a preset keyword library to determine keywords, or extract the keywords from the unstructured data by using a preset keyword extraction algorithm.
Further, as shown in fig. 6, the second processing module 322 is specifically further configured to: extracting candidate words from the document data; and processing the candidate words by using a preset two-classification model to judge whether the candidate words are keywords or not.
Further, as shown in fig. 6, the second processing module 322 is specifically further configured to: extracting candidate words from the document data; processing the candidate words by using a preset scoring model to obtain scoring values corresponding to each candidate word, wherein the scoring values are used for representing the probability of the candidate word as the keyword; and selecting a preset number of candidate words with the previous ranking from the candidate words as the keywords according to the high-low scoring value.
Further, as shown in fig. 6, before the document data is combined and the labels are processed by using a preset label distribution learning model to obtain label probability distributions corresponding to the labels, the apparatus further includes:
a second obtaining unit 36 for obtaining a sample data set, wherein the sample data set includes a plurality of sample instances, and each sample instance corresponds to a label distribution
And a training unit 37, configured to perform iterative training on the sample data based on the tag distribution as an initial, and implement tag distribution learning to construct a preset tag distribution learning model.
In summary, the embodiment of the application provides a method and a device for identifying sensitive data of a power grid, which are used for analyzing the document data to be processed in the power grid, extracting keywords from the document data, associating some sensitive information with the keywords, distinguishing labels based on the keywords and the associated sensitive information, namely the sensitive labels associated with the document data, and processing the labels by using a preset label distribution learning model on the basis to obtain label probability distribution, wherein the label probability distribution is used for representing the importance degree of the labels in the document data (namely the influence on the sensitivity degree of the document data), so as to further distinguish and determine target sensitive data contained in the document data. Compared with the technical problem that the prior art only collects sensitive word associated sensitive information to cause the redundancy of the obtained sensitive data, the embodiment of the application utilizes the label distribution learning to identify the sensitive data with higher sensitivity from a large amount of complex data information in the power grid, thereby obtaining the real and higher-quality sensitive data, avoiding obtaining a large amount of redundant and low-value sensitive data, and providing a better solution for identifying the sensitive data of the power grid.
The identification device for the grid sensitive data comprises a processor and a memory, wherein the first acquisition unit, the extraction unit, the first processing unit, the second processing unit, the determination unit and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can be provided with one or more than one, and the kernel parameters are adjusted to learn by using label distribution to identify sensitive data with higher sensitivity from a large amount of complex data information in the power grid, so that the real sensitive data with higher quality is obtained, the sensitive data with large redundancy and low value is avoided, and a better solution for identifying the sensitive data of the power grid is provided.
The embodiment of the application provides a storage medium, wherein a program is stored on the storage medium, and the program realizes the identification method of the power grid sensitive data when being executed by a processor.
The embodiment of the application provides a processor which is used for running a program, wherein the program runs to execute a control flow hijacking prevention method of firmware of intelligent terminal equipment of a power grid.
The application also provides a computer program product adapted to perform a program of steps of an identification method initialized with grid-sensitive data when executed on a data processing device.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, the device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims (10)

1. A method for identifying grid sensitive data, the method comprising:
acquiring document data to be processed from a power grid;
extracting keywords from the document data by analyzing the document data, wherein the keywords are used for representing associated sensitive information;
processing the keywords and the sensitive information associated with the keywords by using a preset machine learning model, and judging the labels corresponding to the keywords;
combining the document data, and processing a plurality of labels by using a preset label distribution learning model to obtain label probability distribution corresponding to the labels;
and judging and determining target sensitive data from the document data according to the tag probability distribution.
2. The method according to claim 1, wherein the document data includes structured data and unstructured data, and the extracting keywords from the document data by parsing the document data includes:
determining keywords from the structured data by using regular expressions for the structured data in the document data;
and matching the unstructured data in the document data based on a preset keyword library to determine keywords, or extracting the keywords from the unstructured data by using a preset keyword extraction algorithm.
3. The method of claim 2, wherein extracting the keywords from the unstructured data using a preset keyword extraction algorithm comprises:
extracting candidate words from the document data;
and processing the candidate words by using a preset two-classification model to judge whether the candidate words are keywords or not.
4. The method of claim 2, wherein extracting the keywords from the unstructured data using a preset keyword extraction algorithm comprises:
extracting candidate words from the document data;
processing the candidate words by using a preset scoring model to obtain scoring values corresponding to each candidate word, wherein the scoring values are used for representing the probability of the candidate word as the keyword;
and selecting a preset number of candidate words with the previous ranking from the candidate words as the keywords according to the high-low scoring value.
5. The method of claim 1, wherein before said combining said document data and processing a plurality of said tags using a preset tag distribution learning model to obtain a tag probability distribution corresponding to said tags, said method further comprises:
obtaining a sample data set, wherein the sample data set comprises a plurality of sample instances, and each sample instance corresponds to one label distribution;
and based on the label distribution as an initial, performing iterative training on the sample data to realize label distribution learning so as to construct a preset label distribution learning model.
6. An identification device for grid sensitive data, the device comprising:
the first acquisition unit is used for acquiring document data to be processed from a power grid;
the extraction unit is used for extracting keywords from the document data through analysis processing of the document data, wherein the keywords are used for representing associated sensitive information;
the first processing unit is used for processing the keywords and the sensitive information associated with the keywords by using a preset machine learning model and judging the labels corresponding to the keywords;
the second processing unit is used for processing a plurality of labels by combining the document data and utilizing a preset label distribution learning model to obtain label probability distribution corresponding to the labels;
and the determining unit is used for judging and determining the target sensitive data from the document data according to the tag probability distribution.
7. The apparatus according to claim 6, wherein the document data includes structured data and unstructured data therein, and the extracting unit includes:
the first processing module is used for determining keywords from the structured data by using a regular expression for the structured data in the document data;
and the second processing module is used for matching the unstructured data in the document data based on a preset keyword library to determine keywords or extracting the keywords from the unstructured data by using a preset keyword extraction algorithm.
8. The apparatus of claim 7, wherein the second processing module is further specifically configured to: extracting candidate words from the document data; and processing the candidate words by using a preset two-classification model to judge whether the candidate words are keywords or not.
9. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the method of identifying grid sensitive data according to any one of claims 1-5.
10. An electronic device comprising at least one processor, and at least one memory, bus, coupled to the processor;
the processor and the memory complete communication with each other through the bus;
the processor is configured to invoke program instructions in the memory to perform the method of identifying grid sensitive data as claimed in any of claims 1-5.
CN202310778166.5A 2023-06-28 2023-06-28 Identification method and device for power grid sensitive data Pending CN117009596A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310778166.5A CN117009596A (en) 2023-06-28 2023-06-28 Identification method and device for power grid sensitive data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310778166.5A CN117009596A (en) 2023-06-28 2023-06-28 Identification method and device for power grid sensitive data

Publications (1)

Publication Number Publication Date
CN117009596A true CN117009596A (en) 2023-11-07

Family

ID=88560980

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310778166.5A Pending CN117009596A (en) 2023-06-28 2023-06-28 Identification method and device for power grid sensitive data

Country Status (1)

Country Link
CN (1) CN117009596A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191275A (en) * 2019-11-28 2020-05-22 深圳云安宝科技有限公司 Sensitive data identification method, system and device
WO2020215571A1 (en) * 2019-04-25 2020-10-29 平安科技(深圳)有限公司 Sensitive data identification method and device, storage medium, and computer apparatus
US20210133279A1 (en) * 2019-11-04 2021-05-06 Adobe Inc. Utilizing a neural network to generate label distributions for text emphasis selection
CN113065330A (en) * 2021-03-22 2021-07-02 四川大学 Method for extracting sensitive information from unstructured data
CN113962302A (en) * 2021-10-20 2022-01-21 全球能源互联网研究院有限公司 Sensitive data intelligent identification method based on label distribution learning
US20220405274A1 (en) * 2021-06-17 2022-12-22 Huawei Technologies Co., Ltd. Method and system for detecting sensitive data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020215571A1 (en) * 2019-04-25 2020-10-29 平安科技(深圳)有限公司 Sensitive data identification method and device, storage medium, and computer apparatus
US20210133279A1 (en) * 2019-11-04 2021-05-06 Adobe Inc. Utilizing a neural network to generate label distributions for text emphasis selection
CN111191275A (en) * 2019-11-28 2020-05-22 深圳云安宝科技有限公司 Sensitive data identification method, system and device
CN113065330A (en) * 2021-03-22 2021-07-02 四川大学 Method for extracting sensitive information from unstructured data
US20220405274A1 (en) * 2021-06-17 2022-12-22 Huawei Technologies Co., Ltd. Method and system for detecting sensitive data
CN113962302A (en) * 2021-10-20 2022-01-21 全球能源互联网研究院有限公司 Sensitive data intelligent identification method based on label distribution learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
XIN GENG: "Label Distribution Learning", IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 28, no. 07, 23 March 2016 (2016-03-23) *
尚芳剑;来骥;李信;周巍: "基于卷积特征向量的电力物联数据加密研究", 信息技术, no. 04, 25 April 2021 (2021-04-25) *
王士元,彭刚: "语言、语音与技术", 31 August 2006, 上海教育出版社 *
龙宣东: "基于标记分布和代价敏感的多标记特征选择算法研究", 中国优秀硕士学位论文全文数据库信息科技辑, 15 April 2022 (2022-04-15) *

Similar Documents

Publication Publication Date Title
Al Qadi et al. Arabic text classification of news articles using classical supervised classifiers
US8788503B1 (en) Content identification
Hamreras et al. Content based image retrieval by ensembles of deep learning object classifiers
JP5214760B2 (en) Learning apparatus, method and program
JP6897749B2 (en) Learning methods, learning systems, and learning programs
CN113887580B (en) Contrast type open set image recognition method and device considering multi-granularity correlation
CN112767106B (en) Automatic auditing method, system, computer readable storage medium and auditing equipment
CN115048464A (en) User operation behavior data detection method and device and electronic equipment
CN116150698B (en) Automatic DRG grouping method and system based on semantic information fusion
KR20160149050A (en) Apparatus and method for selecting a pure play company by using text mining
Todorov et al. Mining concept similarities for heterogeneous ontologies
Jayady et al. Theme Identification using Machine Learning Techniques
Waqas et al. Robust bag classification approach for multi-instance learning via subspace fuzzy clustering
Hussain et al. Design and analysis of news category predictor
CN111027636A (en) Unsupervised feature selection method and system based on multi-label learning
CN114297393A (en) Software defect report classification method integrating multivariate text information and report intention
Tran et al. EXMOVES: mid-level features for efficient action recognition and video analysis
CN116451139B (en) Live broadcast data rapid analysis method based on artificial intelligence
Grimme et al. Lost in transformation: Rediscovering llm-generated campaigns in social media
CN115688101A (en) Deep learning-based file classification method and device
CN117009596A (en) Identification method and device for power grid sensitive data
US20170293863A1 (en) Data analysis system, and control method, program, and recording medium therefor
CN114254622A (en) Intention identification method and device
AbdulHafiz Novel Opinion mining System for Movie Reviews
CN116932487B (en) Quantized data analysis method and system based on data paragraph division

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination