CN117009596A

CN117009596A - Identification method and device for power grid sensitive data

Info

Publication number: CN117009596A
Application number: CN202310778166.5A
Authority: CN
Inventors: 那琼澜; 苏丹; 来骥; 张实君; 杨艺西; 任建伟; 马跃; 邢宁哲; 庞思睿; 曽婧; 李硕; 徐相森
Original assignee: State Grid Corp of China SGCC; Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2023-11-07

Abstract

The application discloses a method and a device for identifying power grid sensitive data, which relate to the technical field of power grid safety reinforcement, improve the quality of the identified sensitive data, and avoid obtaining a large amount of redundant and low-value sensitive data, thereby providing a better solution for identifying the power grid sensitive data. The main technical scheme of the application is as follows: acquiring document data to be processed from a power grid; extracting keywords from the document data by analyzing the document data, wherein the keywords are used for representing associated sensitive information; processing the keywords and the sensitive information associated with the keywords by using a preset machine learning model, and judging the labels corresponding to the keywords; combining the document data, and processing a plurality of labels by using a preset label distribution learning model to obtain label probability distribution corresponding to the labels; and judging and determining target sensitive data from the document data according to the tag probability distribution.

Description

Identification method and device for power grid sensitive data

Technical Field

The application relates to the technical field of power grid safety reinforcement, in particular to a method and a device for identifying power grid sensitive data.

Background

The power grid data volume is large, the types and the characteristics are complex and various, and sensitive data identification is a precondition of data safety protection. With current sensitive data recognition technology, some sensitive words, such as customer name, identification number, contact phone, residence address, etc., can be recognized. However, in different document instances, the degrees of sensitivity that different sensitive words make in the document instance are different, for example, in one document instance, for example, in a specific scene, the name sensitivity is lower and the identification number sensitivity is higher.

Therefore, if only the data related to the sensitive words are collected as the sensitive data, it is difficult to distinguish the sensitivity of the content in different document examples, and a large amount of redundant data (for example, if the renaming rate of a name is very high, the meaning of the name as the sensitive data is not great) is obtained, so that it is difficult to really identify which is the sensitive data with high real sensitivity in a large amount of complex data in a power grid. How to identify truly valuable sensitive data from the power grid is thus a current problem that needs to be solved.

Disclosure of Invention

In view of the above, the present application provides a method and an apparatus for identifying sensitive data of a power grid, so as to identify sensitive data with higher sensitivity from a large amount of complex data information in the power grid by using tag distribution learning, thereby obtaining real and higher quality sensitive data, avoiding obtaining a large amount of redundant and low value sensitive data, and providing a better solution for identifying sensitive data of the power grid.

In order to achieve the above purpose, the present application mainly provides the following technical solutions:

the first aspect of the application provides a method for identifying power grid sensitive data, which comprises the following steps:

acquiring document data to be processed from a power grid;

extracting keywords from the document data by analyzing the document data, wherein the keywords are used for representing associated sensitive information;

processing the keywords and the sensitive information associated with the keywords by using a preset machine learning model, and judging the labels corresponding to the keywords;

combining the document data, and processing a plurality of labels by using a preset label distribution learning model to obtain label probability distribution corresponding to the labels;

and judging and determining target sensitive data from the document data according to the tag probability distribution. A second aspect of the present application provides an apparatus for identifying grid-sensitive data, the apparatus comprising:

the first acquisition unit is used for acquiring document data to be processed from a power grid;

the extraction unit is used for extracting keywords from the document data through analysis processing of the document data, wherein the keywords are used for representing associated sensitive information;

the first processing unit is used for processing the keywords and the sensitive information associated with the keywords by using a preset machine learning model and judging the labels corresponding to the keywords;

the second processing unit is used for processing a plurality of labels by combining the document data and utilizing a preset label distribution learning model to obtain label probability distribution corresponding to the labels;

and the determining unit is used for judging and determining the target sensitive data from the document data according to the tag probability distribution.

A third aspect of the present application provides a storage medium, where the storage medium includes a stored program, where the program, when executed, controls a device where the storage medium is located to perform a method for identifying grid-sensitive data as described above.

A fourth aspect of the application provides an electronic device comprising at least one processor, at least one memory connected to the processor, a bus;

the processor and the memory complete communication with each other through the bus;

the processor is used for calling the program instructions in the memory to execute the identification method of the power grid sensitive data.

By means of the technical scheme, the technical scheme provided by the application has at least the following advantages:

the application provides a method and a device for identifying sensitive data of a power grid, which are used for analyzing the document data to be processed in the power grid, extracting keywords from the document data, correspondingly associating some sensitive information with the keywords, judging labels based on the keywords and the associated sensitive information, namely the sensitive labels associated with the document data, processing the labels by using a preset label distribution learning model on the basis, and obtaining label probability distribution which is used for representing the importance degree of the labels in the document data (namely the influence on the sensitivity degree of the document data), thereby further judging and determining target sensitive data contained in the document data. Compared with the technical problem that the prior art only collects sensitive word associated sensitive information to cause the redundancy of the obtained sensitive data, the method and the device for identifying the sensitive data by using the label distribution learning are used for identifying the sensitive data with higher sensitivity from a large amount of complex data information in the power grid, so that the real sensitive data with higher quality are obtained, the obtaining of a large amount of redundant sensitive data with low value is avoided, and a better solution for identifying the sensitive data of the power grid is provided.

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a flowchart of a method for identifying grid sensitive data according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a training preset tag distribution learning model, as exemplified by an embodiment of the present application;

FIG. 3 is a flowchart of another method for identifying grid sensitive data according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a label distribution prediction flow according to an embodiment of the present application;

fig. 5 is a block diagram of an identification device for power grid sensitive data according to an embodiment of the present application;

fig. 6 is a block diagram of another device for identifying grid sensitive data according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the application to those skilled in the art.

The embodiment of the application provides a method for identifying power grid sensitive data, as shown in fig. 1, and the method comprises the following specific steps:

101. and acquiring the document data to be processed from the power grid.

102. And extracting keywords from the document data by analyzing the document data, wherein the keywords are used for representing associated sensitive information.

In embodiments of the present application, keywords may also be referred to as "sensitive words" that characterize the associated sensitive information, such as keywords "customer name", "identification number", "contact phone", "residence address", and so forth.

103. And processing the keywords and sensitive information associated with the keywords by using a preset machine learning model, and judging the labels corresponding to the keywords.

In the embodiment of the application, the corresponding label can be further distinguished based on the keywords and the associated sensitive information thereof, and the distinguishing method is not limited.

104. And processing the plurality of labels by using a preset label distribution learning model in combination with the document data to obtain label probability distribution corresponding to the labels.

Among them, a tag distribution learning algorithm (also called a tag distribution learning algorithm) (label distribution learning, LDL) is, in short, a data multi-tag identification method based on a tag distribution descriptive model.

For many problems in the real world, the importance of different tags tends to be different. For example, a piece of user electricity data is marked with a plurality of labels such as a user name, a user position, an electricity consumption amount, an electricity consumption time and the like, and the labels specifically describe the information to different degrees; for another example, in grid sensitive data, the information of complex data is often the result of a mixture of multiple basic information (such as time, place, business and user), each of which can correspondingly represent a label, and these labels often express different intensities in a specific data instance, thereby exhibiting complex information meaning. Similar examples are numerous, because once an instance is associated with multiple tags at the same time, the tags are typically not just as important for the instance, but rather more likely to be prime and secondary.

Therefore, the embodiment of the application constructs a preset label distribution learning model based on a label distribution learning algorithm, and is used for processing the labels in the document data so as to obtain label probability distribution corresponding to the labels.

One example application is illustrated: for instance x, a real numberEach possible label y is given indicating the extent to which y describes x. Without loss of generality, assume ∈ ->And further assume that the set of tags is a complete set, and that all tags in the ready-to-use set must be able to describe one example completely, so that in the setAll->The sum of (2) is 1, satisfying the above two conditions>Referred to as the degree of description of y versus x. For one example, the descriptors of all tags constitute a data structure resembling a probability distribution, and so are referred to as tag distribution, while the process of learning on a data set labeled with a tag distribution is referred to as tag distribution learning.

For example, as shown in FIG. 2, an embodiment of the present application further illustrates a schematic flow chart of training a preset tag distribution learning model.

105. And judging and determining the target sensitive data from the document data according to the tag probability distribution.

The tag probability distribution is equivalent to the description degree of the tag on the sample, namely, a real value between 0 and 1 is utilized, the tag learned by the tag distribution not only reflects whether the tag is related to the sample (namely, 0 is irrelevant, and not 0 is relevant), but also reflects the description degree of the tag on the sample.

For example, if we say that a tag is not related to an instance by a "0" and a "1" indicates that the tag distribution learns that the tag's the instance's descriptive degree by a specific continuous value, all tags are related and correct, and the entire tag distribution set completely describes an instance.

Therefore, in the embodiment of the application, the above-mentioned example is "document data", and the description degree between each label and "document data" is obtained by using the label probability distribution, namely the description degree is equal to the sensitivity degree, so that the label with higher sensitivity degree can be distinguished based on the higher probability distribution, and the target sensitive data with higher sensitivity degree can be distinguished and determined from the document data based on the sensitive information associated with the labels.

Compared with the technical problem that the prior art only collects sensitive word related sensitive information to cause the redundancy of the obtained sensitive data, the embodiment of the application is used for recognizing sensitive data with higher sensitivity from a large amount of complex data information in a power grid by using label distribution learning, so that the real and higher-quality sensitive data are obtained, and the large amount of redundant and low-value sensitive data are prevented from being obtained, so that a better solution for recognizing the sensitive data of the power grid is provided.

In order to make a more detailed description on the above embodiments, the embodiment of the present application further provides another method for identifying grid sensitive data, as shown in fig. 3, and the following specific steps are provided for this embodiment of the present application:

201. and acquiring the document data to be processed from the power grid.

The document data includes structured data and unstructured data. In short, structured data, such as row data, stored in a database, may be implemented as data logically represented by a two-dimensional table structure; the data that is inconvenient to represent with the two-dimensional logical table of the database is referred to as unstructured data, including office documents in all formats, texts, pictures, XML, HTML, various reports, image and audio/video information, and the like.

202a, for structured data in document data, determining keywords from the structured data using regular expressions.

Regular expressions, also known as regular expressions, are commonly used to retrieve and replace text that meets a certain pattern (rule). A regular expression is a logical formula for operating on a character string, namely, a "regular character string" is formed by a plurality of specific characters defined in advance and combinations of the specific characters, and the "regular character string" is used for expressing a filtering logic for the character string.

In the embodiment of the application, based on the characteristics of the structured data, the keyword segments in the document data can be filtered and screened based on the regular expression, so that the keywords are further determined based on the data information stored on the keyword segments.

202b, for unstructured data in the document data, matching is performed based on a preset keyword library to determine keywords, or keywords are extracted from the unstructured data by using a preset keyword extraction algorithm.

The embodiment of the application provides two parallel implementation methods: one implementation method is that matching is performed based on a preset keyword library to determine keywords; another approach is to extract keywords from unstructured data using a preset keyword extraction algorithm. For this alternative method, the examples of the present application exemplify two different schemes:

a scheme is as follows: extracting candidate words from the document data; and processing the candidate words by using a preset classification model to judge whether the candidate words are keywords or not.

The method is actually a supervised learning algorithm, the keyword extraction process is regarded as a two-class problem, candidate words are extracted first, then labels are defined for each candidate word, either keywords or keywords are not, and then a keyword extraction classifier is trained. When a new document comes, extracting all candidate words, then extracting a classifier by using the trained keywords, classifying each candidate word, and finally taking the candidate word with the label as the keyword.

Another scheme is as follows: firstly, extracting candidate words from document data; processing the candidate words by using a preset scoring model to obtain scoring values corresponding to each candidate word, wherein the scoring values are used for representing the probability of the candidate word as a keyword; and secondly, selecting a preset number of candidate words with the previous ranking from the candidate words as keywords according to the high-low scoring value.

The scheme is actually an unsupervised learning algorithm, candidate words are extracted first, then scoring is carried out on each candidate word, and then the candidate words with highest topK scores are output as key words. There are different algorithms, such as TF-IDF, textRank, etc., depending on the strategy of scoring.

203. And processing the keywords and sensitive information associated with the keywords by using a preset machine learning model, and judging the labels corresponding to the keywords.

After extracting the keywords and the associated sensitive information from the structured data and the unstructured data, performing type discrimination according to the keywords, discriminating labels according to the keywords by using a machine learning method, and for example, classifying by using a Support Vector Machine (SVM).

Tag classification is a multi-class classification problem, where the number of classes is given by k. When using a support vector machine to classify this problem, one type is assumed to be a positive sample and the other type is assumed to be a negative sample.

The one-to-many method takes class i samples as positive samples and all classes except the class as negative samples, and trains a vector machine between the two classes of samples, and the method constructs k classification support vector machines in total. When a certain vector is tested, the class corresponding to the vector machine which calculates the maximum value is taken as the class of the vector.

The one-to-one method is to select sample data in a class i and a class j from classification problems to train classification vector machines between the two classes, and the total number of the constructed vector machines is k (k-1)/2. Although the "one-to-one" classification method produces a number of classification vector machines that is (k-1)/2 times that of the "one-to-many" method, the training scale of the "one-to-one" method is much smaller than that of the "one-to-many" method. The vector is tested by scoring, and the category with the highest score is selected as the category of the test data after the calculation of k (k-1)/2 classifiers.

204. And processing the plurality of labels by using a preset label distribution learning model in combination with the document data to obtain label probability distribution corresponding to the labels.

205. And judging and determining the target sensitive data from the document data according to the tag probability distribution.

The embodiment of the application firstly provides an implementation method for constructing a preset label distribution learning model, which comprises the following steps: acquiring a sample data set, wherein the sample data set comprises a plurality of sample instances, and each sample instance corresponds to one label distribution; based on the label distribution as an initial, performing iterative training on sample data to realize label distribution learning so as to construct a preset label distribution learning model. The following is an exemplary explanation:

the label distribution generating process consists of two stages of learning on a sample marked by the label distribution and generating the label distribution.

Stage 1: learning on a sample marked with label distribution;

after the data multi-label identification is completed, further learning on the label distribution labeled sample is realized. In this step, for each picture in the set, there is a set of label distributions corresponding to it, where k is the number of iterations, so the goal of this step is to learn the parameter matrix Θ of the label distribution learning model ^k . If KL divergence is used to measure the distance between two distributions, the optimal parameter matrix should be:

according to the principle of maximum entropy, the model can be expressed as:

there are many optimization methods that can solve the problems, such as conjugate gradient drop, quasi-newton method, etc. As with BFGS-LLD, the quasi-Newton method BFGS is employed herein, and its optimization involves mainly the first derivative, namely:

after the mark distribution learning is finished, the parameter matrix Θ ^k Can be used to predict tag distribution. In order to maintain consistency between tagged and untagged data, a tag distribution, called pseudo tag distribution, needs to be estimated for each untagged data. Here we use a simple and efficient KNN algorithm, that is, for each unlabeled data its pseudo-age is the mean label distribution of its K nearest labeled neighbors. In addition, to find the most appropriate neighbors, we modify the index that measures the sample difference to the Euclidean distance of the sample featureKL divergence from the predicted tag distribution. The distances for the unlabeled data and the labeled data are:

wherein C is the balance factor hyper-parameter. There are two points to note, one is that for non-tag data, the tag distribution characterizes the uncertainty for the given tag, and two is because Θ ^k The label distribution predicted by it changes each time during each iteration, which means that the pseudo-label distribution of the unlabeled data may also change. These make SALDL more robust than other commonly used semi-supervised learning methods.

Stage 2: generating label distribution;

sensitive data is classified using tag distribution learning, and the input at the time of training is a tag distribution dataset in which a sample instance corresponds to a tag distribution. The components in each tag distribution are a real number between 0 and 1, indicating how well the corresponding tag describes a sample instance, and a sum of 1, meaning that one tag distribution describes a sample instance in its entirety. After training, model parameters are obtained, feature vectors of test samples are input during testing, and corresponding label distribution can be predicted through the obtained model.

First, k sample features are randomly selected in training set X as an initial mean vector { μ } ₁ ,μ ₂ ,...,μ _n Then calculate sample x in training set _j The spatial distance from the initial mean vector divides the samples into clusters of mean vectors with the smallest distance. Thereafter, the mean vector of the cluster is updated. Thus, the iteration is continued until the mean vector of each cluster is kept unchanged, that is, the distance between clusters can be kept to be minimized, or the iteration number reaches a set maximum value. At this point the training of the model on the training set has been completed.

And then inputting sample characteristic data of the test set, calculating the distance between each sample and the mean vector of the training set characteristic data cluster, and finally obtaining a distance matrix T. The distance is calculated using the minkowski distance. In the algorithm, the new matrix T' is obtained by solving the corresponding reciprocal of each element of the obtained Minkowski distance matrix T, namely, the closer a sample is to a certain mean vector, the larger weight is obtained in the dimension.

And normalizing the elements in the matrix T' by a softmax function to ensure that the weight values are all constrained between 0 and 1, thereby obtaining a weight matrix W capable of converting the average value vector of the label distribution of the training set sample into the predicted label distribution of the test set sample. After obtaining the weight W of the label distribution of the test set, multiplying W by the average vector matrix U of the label distribution corresponding to the training set by p=wu, where u= [ μ ] ₁ ,μ ₂ ,...,μ _n ]. P is the desired predicted tag distribution. So far, our prediction process ends.

The whole process is represented by a flowchart, such as the label distribution prediction flowchart shown in fig. 4. The basic steps of a particular implementation can be summarized as follows.

Input: minkowski distance parameter P, maximum number of clustering iterations ite, cluster number of clusters k and training set S= { (x) ₁ ,D ₁ ),(x ₂ ,D ₂ ),...,(x _n ,D _n ) Test sample feature set

And (3) outputting: predictive tag distribution P for test set samples.

Step 1: obtaining k cluster-initiated mean vectors { mu } ₁ ,μ ₂ ,...,μ _n And randomly selecting k sample feature vectors from the sample feature set H of the training set, and then making the initial clustering clusters empty.

Step 2: the iteration begins, first, in order to determine sample x _j Which cluster it belongs to, need to calculate x _j And each mean value vector mu _i Distance d of (2) _ij 。

Step 3: sample to be sampledThe X is _j And dividing the cluster into the corresponding cluster with the smallest distance value, and updating the cluster.

Step 4: and obtaining a new mean vector of the updated cluster after the cluster.

Step 5: judging whether the iteration process is finished, if all the mean value vectors are not changed, or if the iteration times reach the preset maximum iteration times, finishing the iteration, and performing the step 6. Otherwise, returning to the step 2, and continuing the iteration process.

Step 6: in the test stage, the test sample is sent into a model, the distance between the input test sample and each mean value vector is calculated, and the distances of all the mean value vectors of the test sample form a matrix T.

Step 7: calculating a weight matrix, changing the distance matrix into weight, and finally obtaining a weight matrix W through normalization processing of the weight.

Step 8: and outputting a predicted label distribution result to quickly obtain the predicted label distribution P of the test sample.

Further, as an implementation of the methods shown in fig. 1 and fig. 3, an embodiment of the present application provides an apparatus for identifying grid sensitive data. The embodiment of the device corresponds to the embodiment of the method, and for convenience of reading, details of the embodiment of the method are not repeated one by one, but it should be clear that the device in the embodiment can correspondingly realize all the details of the embodiment of the method. The device is applied to improving the quality of identification sensitive data, and particularly as shown in fig. 5, the device comprises:

a first obtaining unit 31, configured to obtain document data to be processed from a power grid;

an extracting unit 32, configured to extract keywords from the document data by performing parsing processing on the document data, where the keywords are used to characterize associated sensitive information;

a first processing unit 33, configured to process the keywords and the sensitive information associated with the keywords by using a preset machine learning model, and determine a label corresponding to each keyword;

a second processing unit 34, configured to process a plurality of the labels by using a preset label distribution learning model in combination with the document data, so as to obtain a label probability distribution corresponding to the labels;

a determining unit 35, configured to determine target sensitive data from the document data according to the tag probability distribution.

Further, as shown in fig. 6, the document data includes structured data and unstructured data, and the extracting unit 32 includes:

a first processing module 321, configured to determine, for the structured data in the document data, a keyword from the structured data using a regular expression;

a second processing module 322, configured to match the unstructured data in the document data based on a preset keyword library to determine keywords, or extract the keywords from the unstructured data by using a preset keyword extraction algorithm.

Further, as shown in fig. 6, the second processing module 322 is specifically further configured to: extracting candidate words from the document data; and processing the candidate words by using a preset two-classification model to judge whether the candidate words are keywords or not.

Further, as shown in fig. 6, the second processing module 322 is specifically further configured to: extracting candidate words from the document data; processing the candidate words by using a preset scoring model to obtain scoring values corresponding to each candidate word, wherein the scoring values are used for representing the probability of the candidate word as the keyword; and selecting a preset number of candidate words with the previous ranking from the candidate words as the keywords according to the high-low scoring value.

Further, as shown in fig. 6, before the document data is combined and the labels are processed by using a preset label distribution learning model to obtain label probability distributions corresponding to the labels, the apparatus further includes:

a second obtaining unit 36 for obtaining a sample data set, wherein the sample data set includes a plurality of sample instances, and each sample instance corresponds to a label distribution

And a training unit 37, configured to perform iterative training on the sample data based on the tag distribution as an initial, and implement tag distribution learning to construct a preset tag distribution learning model.

In summary, the embodiment of the application provides a method and a device for identifying sensitive data of a power grid, which are used for analyzing the document data to be processed in the power grid, extracting keywords from the document data, associating some sensitive information with the keywords, distinguishing labels based on the keywords and the associated sensitive information, namely the sensitive labels associated with the document data, and processing the labels by using a preset label distribution learning model on the basis to obtain label probability distribution, wherein the label probability distribution is used for representing the importance degree of the labels in the document data (namely the influence on the sensitivity degree of the document data), so as to further distinguish and determine target sensitive data contained in the document data. Compared with the technical problem that the prior art only collects sensitive word associated sensitive information to cause the redundancy of the obtained sensitive data, the embodiment of the application utilizes the label distribution learning to identify the sensitive data with higher sensitivity from a large amount of complex data information in the power grid, thereby obtaining the real and higher-quality sensitive data, avoiding obtaining a large amount of redundant and low-value sensitive data, and providing a better solution for identifying the sensitive data of the power grid.

The identification device for the grid sensitive data comprises a processor and a memory, wherein the first acquisition unit, the extraction unit, the first processing unit, the second processing unit, the determination unit and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can be provided with one or more than one, and the kernel parameters are adjusted to learn by using label distribution to identify sensitive data with higher sensitivity from a large amount of complex data information in the power grid, so that the real sensitive data with higher quality is obtained, the sensitive data with large redundancy and low value is avoided, and a better solution for identifying the sensitive data of the power grid is provided.

The embodiment of the application provides a storage medium, wherein a program is stored on the storage medium, and the program realizes the identification method of the power grid sensitive data when being executed by a processor.

The embodiment of the application provides a processor which is used for running a program, wherein the program runs to execute a control flow hijacking prevention method of firmware of intelligent terminal equipment of a power grid.

The application also provides a computer program product adapted to perform a program of steps of an identification method initialized with grid-sensitive data when executed on a data processing device.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, the device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.

The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip. Memory is an example of a computer-readable medium.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A method for identifying grid sensitive data, the method comprising:

acquiring document data to be processed from a power grid;

and judging and determining target sensitive data from the document data according to the tag probability distribution.

2. The method according to claim 1, wherein the document data includes structured data and unstructured data, and the extracting keywords from the document data by parsing the document data includes:

determining keywords from the structured data by using regular expressions for the structured data in the document data;

and matching the unstructured data in the document data based on a preset keyword library to determine keywords, or extracting the keywords from the unstructured data by using a preset keyword extraction algorithm.

3. The method of claim 2, wherein extracting the keywords from the unstructured data using a preset keyword extraction algorithm comprises:

extracting candidate words from the document data;

and processing the candidate words by using a preset two-classification model to judge whether the candidate words are keywords or not.

4. The method of claim 2, wherein extracting the keywords from the unstructured data using a preset keyword extraction algorithm comprises:

extracting candidate words from the document data;

processing the candidate words by using a preset scoring model to obtain scoring values corresponding to each candidate word, wherein the scoring values are used for representing the probability of the candidate word as the keyword;

and selecting a preset number of candidate words with the previous ranking from the candidate words as the keywords according to the high-low scoring value.

5. The method of claim 1, wherein before said combining said document data and processing a plurality of said tags using a preset tag distribution learning model to obtain a tag probability distribution corresponding to said tags, said method further comprises:

obtaining a sample data set, wherein the sample data set comprises a plurality of sample instances, and each sample instance corresponds to one label distribution;

and based on the label distribution as an initial, performing iterative training on the sample data to realize label distribution learning so as to construct a preset label distribution learning model.

6. An identification device for grid sensitive data, the device comprising:

7. The apparatus according to claim 6, wherein the document data includes structured data and unstructured data therein, and the extracting unit includes:

the first processing module is used for determining keywords from the structured data by using a regular expression for the structured data in the document data;

and the second processing module is used for matching the unstructured data in the document data based on a preset keyword library to determine keywords or extracting the keywords from the unstructured data by using a preset keyword extraction algorithm.

8. The apparatus of claim 7, wherein the second processing module is further specifically configured to: extracting candidate words from the document data; and processing the candidate words by using a preset two-classification model to judge whether the candidate words are keywords or not.

9. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the method of identifying grid sensitive data according to any one of claims 1-5.

10. An electronic device comprising at least one processor, and at least one memory, bus, coupled to the processor;

the processor is configured to invoke program instructions in the memory to perform the method of identifying grid sensitive data as claimed in any of claims 1-5.