CN117807190A

CN117807190A - Intelligent identification method for sensitive data of energy big data

Info

Publication number: CN117807190A
Application number: CN202410217787.0A
Authority: CN
Inventors: 王世谦; 邵志鹏; 张小建; 贾一博; 高先周; 李为; 宋大为; 王圆圆; 费稼轩; 黄秀丽; 卜飞飞; 李秋燕; 华远鹏; 韩丁
Original assignee: State Grid Smart Grid Research Institute Co ltd; Qingdao Tatan Technology Service Co ltd; Economic and Technological Research Institute of State Grid Henan Electric Power Co Ltd
Current assignee: State Grid Smart Grid Research Institute Co ltd; Qingdao Tatan Technology Service Co ltd; Economic and Technological Research Institute of State Grid Henan Electric Power Co Ltd
Priority date: 2024-02-28
Filing date: 2024-02-28
Publication date: 2024-04-02

Abstract

The invention relates to the technical field of data processing, in particular to an intelligent identification method for sensitive data of big energy data, which comprises the following steps: collecting log data of the power industry; calculating TF-IDF values of the segmented words, further obtaining clusters of the segmented words, obtaining the maximum single-mode dimension of the segmented words, and further obtaining the log electric sensitivity index of the segmented words; acquiring a local subject word set of each word, and further calculating a main body constant weight of each word; acquiring a local electric sensitivity correction index of each word segmentation according to the constant weight of the theme, the sensitivity index of the log point and the distance between the theme words, and further acquiring the sensitivity modification weight of each sentence; and acquiring the electric power clause sensitivity index of each clause, further acquiring the electric power sensitivity characteristic vector of each clause, and identifying the sensitive data. The invention aims to solve the problem of inaccurate identification of sensitive data due to the specificity and complexity of a Chinese language structure.

Description

Intelligent identification method for sensitive data of energy big data

Technical Field

The application relates to the technical field of data processing, in particular to an intelligent identification method for energy big data sensitive data.

Background

The energy industry comprises various industries such as solar energy, wind energy, electric energy and the like, wherein the electric power industry is a basic industry related to national folk life, and a large amount of business data is generated along with the development of an information society towards an intelligent direction, and as the electric power data possibly contains sensitive information about individuals, families or organizations, the sensitive degree of the electric power data needs to be identified, so that the data cannot leak or misuse the sensitive information.

However, the convergence of mass data brings great value and simultaneously faces serious safety risks, and power grid staff is difficult to accurately and efficiently extract sensitive information from the mass data, so that sensitive information in electric power data needs to be identified through an intelligent method, a named entity identification (Named Entity Recognition, NER) technology can be used for identifying sensitive data in the electric power data, and entities with specific significance in texts can be identified, including key sensitive information such as names, place names and proper nouns, but the traditional named entity identification method is mostly multi-faceted to the identification of English texts, has poor identification effect on Chinese texts, and has the problem of inaccurate identification when the traditional named entity identification algorithm identifies sensitive data in large energy data due to the specificity and complexity of Chinese language structures, such as word ambiguity of Chinese.

Disclosure of Invention

In order to solve the technical problems, the invention provides an intelligent identification method for energy big data sensitive data, which aims to solve the existing problems.

The intelligent identification method for the energy big data sensitive data adopts the following technical scheme:

the embodiment of the invention provides an intelligent identification method for sensitive data of big energy data, which comprises the following steps:

collecting log data of the power industry, performing sentence segmentation, and extracting word vectors of each word segmentation in each sentence;

calculating TF-IDF values of the segmented words by using a TF-IDF algorithm, carrying out clustering division on the segmented words according to Euclidean distances between word vectors of the segmented words, and acquiring the maximum single-mode dimension of the segmented words according to element values in the word vectors of the segmented words; acquiring a log electric sensitivity index of each word according to the maximum single-mode dimension and the TF-IDF value of each word; acquiring a local subject word set of each word according to the sequence of the word in the sentence of each word in the log data of the power industry; obtaining the main body constant weight of each word according to the distance between the subject words in the local subject word set; obtaining a local electric sensitivity correction index of each word segmentation according to the constant weight of the subject, the sensitivity index of the log point and the distance between the subject words; acquiring the sensitivity modification weight of each clause according to the log electric sensitivity index and the local electric sensitivity modification index; acquiring the electric power clause sensitivity index of each clause according to the sensitivity modification weight, the log electric sensitivity index and the local electric sensitivity modification index; acquiring the electric power sensitivity characteristic vector of each word according to the log electric sensitivity index, the local electric sensitivity correction index and the electric power clause sensitivity index of each clause;

and identifying the sensitive data according to each word, the word vector of each word and the power sensitive characteristic vector.

Further, the clustering division is performed on each word according to the euclidean distance between word vectors of each word, and the maximum single-mode dimension of each word is obtained according to the element value in the word vector of each word, including:

taking word vectors corresponding to all segmented words in the log data of the power industry as input of a DBSCAN clustering algorithm, measuring the Euclidean distance between the word vectors corresponding to the segmented words, and outputting clustering clusters obtained after clustering of all segmented words;

for each word segmentation, each element in the word vector is used as a dimension, the dimensions are numbered according to the sequence of the dimensions in the word vector, the sum of squares of element values of each dimension in the word vector is used as a model of the word vector, the ratio of the squares of the element values of each dimension in the word vector to the model of the word vector is marked as a single-model ratio, and the number of the dimension with the largest single-model ratio in the word vector is marked as the largest single-model dimension.

Further, the obtaining the log electric sensitivity index of each word includes:

for each word, calculating the standard deviation of the maximum single-mode dimension of all the words in the cluster where the word is located, calculating the sum value of the standard deviation and a preset coordination coefficient, calculating the ratio of the number of all the words in the cluster where the word is located to the sum value, and taking the product of the ratio and the TF-IDF value of the word as the log sensitivity index of each word.

Further, the obtaining the local subject word set of each word segment includes:

sorting the clauses according to the sequence of the clauses in the log data of the power industry, and regarding a clause set formed by other A clauses nearest to each clause as a neighbor sentence set of each clause for the clause where each clause is located, wherein A is a preset clause number;

and taking the neighbor sentence set as input of an LDA (laser direct structuring) subject word extraction model, wherein the output of the LDA subject word extraction model is all subject words in the neighbor sentence set, and taking the set of all the subject words as a local subject word set of each word.

Further, the obtaining the main body constant weight of each word segment includes:

and calculating Google normalized distances among all the subject words in the local subject word set of each word, and taking the reciprocal of the sum value of all the normalized distances in the local subject word set as the subject constant weight of each word.

Further, the obtaining the local electric sensitivity correction index of each word includes:

for each word, calculating Euclidean distance between the word and the log electric sensitivity index of each subject word in the local subject word set, calculating Google normalized distance between the word and each subject word in the local subject word set, calculating the sum of the Euclidean distance and the Google normalized distance, calculating the sum of all the sum of the word, and taking the product of the sum and the constant weight of the subject as the local electric sensitivity correction index of each word.

Further, the obtaining the sensitivity modification weight of each clause according to the log electric sensitivity index and the local electric sensitivity modification index includes:

taking each clause in the log data of the power industry as the input of a dependency syntax tree, wherein the output of the dependency syntax tree is the word segmentation of each clause, the dependency relationship among the word segmentation and the word segmentation part of speech;

for each clause, taking natural constant as a base, taking log electric sensitivity index of the part of speech of the clause as a modifier as an index function calculation result, and calculating the product of the calculation result and the local electric sensitivity correction index of the word with the part of speech as the modifier in the clause, and taking the average value of the product of the words with the part of speech as the modifier in the clause as the sensitivity modification weight of each clause.

Further, the obtaining the power clause sensitivity index of each clause includes:

for word segmentation with part of speech as a main word in each clause, calculating the product of the log electric sensitivity index and the local electric sensitivity correction index of the word segmentation as a first product, and taking the product of the sum of all the first products and the sensitivity modification weight in each clause as the electric power clause sensitivity index of each clause.

Further, the obtaining the power sensitive feature vector of each word includes:

and sequentially arranging the log electric sensitivity index, the local electric sensitivity correction index and the electric sentence sensitivity index of the sentence in which each word is positioned to obtain the electric sensitivity feature vector of each word.

Further, the identifying the sensitive data by each word segment, the word vector of each word segment and the power sensitive feature vector includes:

taking each word, a word vector of each word and an electric power sensitive characteristic vector of each word as inputs of a BP neural network, wherein the outputs of the BP neural network are named entities to which each word belongs and the sensitivity of each word;

the sensitivity of all the segmented words is used as the input of a K-means clustering algorithm, and the output of the K-means clustering algorithm is three clustering clusters; calculating the average value of the sensitivity of all the word segmentation in each cluster, arranging the cluster according to the average value in a descending order, and sequentially marking the clusters corresponding to the arranged results as important data, general data and auxiliary data;

calculating the proportion of named entities corresponding to the words in each cluster, arranging the proportion of the named entities corresponding to the words in each cluster in a descending order to form a named entity proportion sequence, sequentially summing the elements in the named entity proportion sequence, and taking a set formed by the named entities corresponding to the elements in the named entity proportion sequence in the summation result as a representative entity set of the cluster when the summation is more than 70%;

and using a named entity recognition technology to recognize named entities in the log data, respectively matching the named entities with the representative entity sets of the important data, the general data and the auxiliary data cluster to obtain a matching result of the sensitivity degree of the log data, and taking the matching result as a recognition result of the sensitive data.

The invention has at least the following beneficial effects:

according to the invention, the log electric sensitivity index is constructed by analyzing the global characteristics of the segmented words in the log data of the power industry, and the global sensitivity degree of each segmented word is reflected; constructing a local electric sensitivity correction index based on the electric sensitivity index of the log and the characteristics of the word segmentation in the context, correcting the electric sensitivity index of the log under certain special conditions, and improving the accuracy of identifying the named entity and judging the sensitivity degree in the subsequent steps; the electric clause sensitivity index is built based on the local electric sensitivity correction index, the sensitivity degree of the whole clause under special conditions is reflected, and the recognition efficiency in the subsequent steps is improved; the method comprises the steps of constructing an electric power sensitive feature vector based on a log electric sensitivity index, a local electric sensitivity correction index and an electric power clause sensitivity index, using the electric power sensitive feature vector as input of a BP neural network, obtaining named entities and sensitivities of each segmented word, clustering and grading the sensitivities, realizing identification of sensitive data of large energy data, and solving the problem of inaccurate identification of the sensitive data due to the specificity and complexity of a Chinese language structure.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an intelligent identification method for big energy data sensitive data;

fig. 2 is a flow chart of power sensitive feature vector acquisition.

Detailed Description

In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following is a detailed description of specific implementation, structure, characteristics and effects of the intelligent identification method for the energy big data sensitive data according to the invention in combination with the accompanying drawings and the preferred embodiment. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The specific scheme of the intelligent identification method for the energy big data sensitive data provided by the invention is specifically described below with reference to the accompanying drawings.

The invention provides an intelligent identification method for large energy data, and in particular provides an intelligent identification method for large energy data, referring to fig. 1, comprising the following steps:

and S001, collecting log data of the power industry and preprocessing.

Since the processing manners of different log data in the power industry are the same, the present embodiment will be described by taking one log data processing as an example.

The method comprises the steps of collecting log data of an electric power system through the electric power information system, wherein the log data are Chinese text data, and recording data such as operation activities of the electric power system, running states and parameters of electric power equipment, safety data of the electric power system, load information of a power grid and the like, wherein part of the data are sensitive data, such as equipment identification information in the parameters of the electric power equipment, including information such as unique identifiers and serial numbers of various equipment, and the like, and the equipment is possibly attacked remotely after the information is leaked, so that the safety and stability of the electric power system are affected.

In order to facilitate the subsequent understanding of the context in the log data by the named entity recognition technique, it is necessary to segment, sentence and word the log data and convert the word into a word vector.

Specifically, the paragraph is marked as a separator, log data of the electric power system is divided into a plurality of paragraphs, the period is taken as a separator, the paragraphs are divided into a plurality of clauses, each clause is respectively subjected to word segmentation processing through a jieba tool library in the python programming language, then dead words such as 'in', and the like which do not substantially help in a named entity recognition task are removed from the results after the word segmentation processing, the obtained words are used as the input of the BERT model, are output as low-dimensional dense word embedded vectors with semantic feature representation, and are recorded as word vectors of each segmented word. The BERT model is a known technology, and the specific process is not described in detail in this embodiment.

Thus, log data of the power industry is obtained.

Step S002, analyzing the characteristics of the word in the log data of the power industry to construct a log electric sensitivity index, constructing a local electric sensitivity correction index based on the log electric sensitivity index and the context characteristics of the word, constructing a power clause sensitivity index based on the local electric sensitivity correction index and the dependency relationship of the word in the sentence, and constructing a power sensitivity feature vector based on the log electric sensitivity index, the local electric sensitivity correction index and the power clause sensitivity index.

In the log data of the power industry, as the used equipment and the like are special equipment in the field of the power industry, the descriptors used for the running states of the equipment are relatively close, meanwhile, the use frequency of proper nouns and special terminology of the industry is relatively high, the occurrence frequency of the types of the partitionings in the whole log is relatively high, meanwhile, the parameters, states and the like of the power equipment are reflected by the partitionings such as the descriptors and the proper nouns of the running states of the equipment, the sensitivity degree is relatively high, and the log electric sensitivity index is constructed according to the embodiment, the important and sensitivity degree characteristics of each partitioning in the global are reflected, and the construction process of the log electric sensitivity index is as follows:

and taking each word in the log data of the power industry as the input of a TF-IDF algorithm, wherein the output of the TF-IDF algorithm is the TF-IDF value of each word.

Further, word vectors corresponding to all segmented words in the log data of the electric industry are used as input of a DBSCAN clustering algorithm, the measurement distance is Euclidean distance between word vectors corresponding to all segmented words, each clustering cluster obtained after clustering of all segmented words is output, each element in the word vectors is used as one dimension, the dimensions are numbered according to the dimensions corresponding to each element in the word vectors, the sum of squares of element values of each dimension in the word vectors is used as a model of the word vectors, the ratio of the squares of element values of single dimension in the word vectors to the model of the word vectors is marked as a single-mode ratio, and the number of the dimension with the largest single-mode ratio in the word vectors is marked as the largest single-mode dimension.

Illustrating: if the word vector corresponding to the word "generator" is [0.5, -0.3,1.2 ]]The dimensions corresponding to 0.5, -0.3, and 1.2 are numbered 1, 2, and 3, respectively, and the modulus of the word vector isThe single mode ratio of each dimension is +.>、/>、/>The maximum single-mode dimension is 3, where the euclidean distance, TF-IDF algorithm, and DBSCAN clustering algorithm are known techniques, and detailed description of this embodiment is omitted.

Wherein,log electric sensitivity index representing the c-th word in the log data of the electric power industry,/I>Representing the number of words in a cluster where the c-th word in the log data of the electric power industry is located,/->Representing standard deviation of maximum single-mode dimension corresponding to all participles in cluster where the c-th participle in the log data of the power industry>Represents a coordination coefficient for avoiding incapacitation caused by 0 denominator, in this embodiment +.>The value of (1),>and the TF-IDF value corresponding to the c-th word in the log data of the power industry is represented.

The more the number of the word segments in the cluster where the c-th word segment is located, namelyThe larger the sample in the cluster is, the more samples in the cluster are, the higher the effectiveness of the word segmentation belonging to the cluster is when the word segmentation in the cluster is analyzed, and the smaller the standard deviation of the maximum single-mode dimension corresponding to all the word segments in the cluster where the c-th word segmentation is located is, namely->The smaller the difference of the largest single-mode dimension of each word in the cluster is, the larger the value of the word vector in a certain dimension is after the word is processed by the BERT model, the more important the word vector in the dimension is, if the difference of the largest single-mode dimension in the current cluster is smaller, the stronger the uniformity of the word in the cluster is, the higher the sensitivity degree of the word in the cluster in a certain aspect is, and meanwhile, the larger the TF-IDF value corresponding to the c-th word is, namely>The larger the word is, the higher the importance of the word in the log data is, namely the higher the sensitivity degree is, so the calculated log electric sensitivity index is larger.

The log electric sensitivity index of each word in the log data of the electric power industry obtained through the steps reflects the global importance and sensitivity degree of each word, and under special conditions, due to the word ambiguity phenomenon of Chinese, the overall importance and sensitivity degree of a sentence may be larger, for example, the description about the running state, the related parameters and the like of some important equipment is that when the equipment is important, the related description word about the equipment is also important, namely, the sensitivity degree is higher, and the sensitivity degree of the description word in the global feature may be relatively lower. Therefore, in this embodiment, local electric sensitivity correction indexes are constructed by analyzing the context characteristics of each word, and the local electric sensitivity correction indexes of the log are locally corrected, and the construction process of the local electric sensitivity correction indexes is as follows:

taking a clause set formed by the nearest A clauses as a neighbor sentence set of the c-th clause according to the sequence of the clauses in the log data of the electric power industry, wherein the value of A in the embodiment is 10, the neighbor sentence set is used as the input of an LDA (local description of the object) topic word extraction model, the input is outputted as all topic words in the neighbor sentence set, and the set of all topic words is used as a local topic word set of the c-th clause; the Google normalized distance between the subject words in the local subject word set of the c-th segmented word is calculated, wherein the calculation of the LDA subject word extraction model and the Google normalized distance is a known technique, and the detailed process is not repeated in this embodiment. The local electrical sensitivity correction index can be calculated as follows:

wherein the method comprises the steps ofTheme constant weight representing the c-th word segment,/->Representing the number of subject words in the local subject word set of the c-th word segment, ++>Google normalized distance between the d-th subject word and the e-th subject word in the local subject word set representing the c-th subject word, +.>Local electric sensitivity correction index representing the c-th word in the electric power industry log data, ++>Euclidean distance between log electrosensitivity indexes representing the c-th word and the f-th word in the local set of words,/o>And the Google normalized distance between the c-th segmentation word and the f-th topic word in the local topic word set is represented.

In the local subject word set of the c-th segmented word, the larger the Google normalized distance between every two subject words is, namelyThe bigger the semantic similarity between the subject terms is smaller, namely, in the A clauses which are closer, different subject terms reflect different subjects, so when the subject terms are corrected based on the subject terms, the smaller the weight of the subject terms is, the smaller the calculated constant weight of the subject terms is, and the smaller the Euclidean distance between the c-th subject term and the log electric sensitivity index of the rest subject terms is, namely->The smaller the difference of the sensitivity degree between the c-th word and the subject word is, the less the c-th word is required to be corrected, and the Go between the c-th word and the rest of the subject words isThe smaller the log normalized distance, i.e. +.>The smaller the word is, the larger the semantic similarity between the c-th word and the rest of the subject words is, namely the c-th word and the subject words reflect the same subject, the smaller the degree of correcting the c-th word is, and the smaller the calculated local electric sensitivity correction index is.

The local electric sensitivity correction index of each word obtained through the steps reflects the correction degree of the global sensitivity degree of each word, and because the sensitivity degree of each word in a certain sentence in the log data of the electric power industry is not high under certain special conditions, the whole sensitivity degree of the sentence is higher, for example, if ' some conventional maintenance works are finished ', a new risk assessment indicates that a certain potential safety hazard exists in the current transmission capacity of the electric power system ', the sensitivity degree of each word in the word segmentation result of the sentence is relatively lower, and the potential safety hazard exists in the electric power system is indicated by the sentence as a whole, and the potential hazard needs to be repaired in time so as to avoid further expansion. In order to improve the recognition efficiency and accuracy of sensitive data in the log data of the power industry in the subsequent steps, the embodiment constructs the sensitivity index of the electric clause based on the local electric sensitivity correction index, reflects the overall sensitivity degree of each word in the log data of the power industry, and comprises the following construction process of the sensitivity index of the electric clause:

the a-th clause in the log data of the electric power industry is used as the input of a dependency syntax tree, the output is the word segmentation of the a-th clause, the dependency relationship among the word segmentation, the word segmentation part of speech, the dependency relationship among the word segmentation comprises a main-predicate relationship, a dynamic guest relationship and the like, the word segmentation part of speech comprises nouns, verbs, adjectives and the like, wherein the dependency syntax tree is a known technology, and the specific process is not repeated in the embodiment. The power clause sensitivity index can be calculated as follows:

wherein the method comprises the steps ofSensitive modification weight for representing a clause a in the log data of the power industry>Representing the number of modifier words in the a-th clause in the log data of the power industry, wherein the modifier words comprise adjectives, ordinal words and the like,/->、/>Log electric sensitivity indexes and local electric sensitivity correction indexes of h modifier words in an a-th clause in log data of the power industry are respectively represented, and exp () represents an exponential function based on a natural constant;

a power clause sensitivity index of an a-th clause in the log data of the power industry is represented by +.>Representing the number of subject words in the a-th clause, i.e. the number of words whose dependency relationship is subject in the dependency syntax tree, +.>、/>And the log electric sensitivity index and the local electric sensitivity correction index of the i-th main word in the a-th clause in the log data of the power industry are respectively represented.

The more sensitive the modifier is in a clause, i.eThe higher the correction of the modifier is, the greater the degree of correction of the modifier is, i.e. +.>The larger the modification word is, the larger the sensitivity modification degree of the modification word to the main word is, namely the larger the sensitivity modification degree of the modification word to the whole clause is, in order to increase the sensitivity modification degree, the embodiment carries out exponential processing on the log electric sensitivity index, so the calculated sensitivity modification weight is the larger the sensitivity modification weight is, and the larger the sensitivity degree and the modification degree of the main word are, namely->、/>The greater the sensitivity of the clause, the greater the calculated power clause sensitivity index.

Through the log electric sensitivity index and the local electric sensitivity correction index of each word in the log data of the electric power industry and the electric power sentence sensitivity index of each sentence, the electric power sensitivity characteristic vector of each word can be constructed, so that the named entity and the sensitivity degree of each word can be conveniently and subsequently identified, and the electric power sensitivity characteristic vector of each word can be specifically expressed as follows:

wherein the method comprises the steps ofA power sensitive feature vector representing the b-th word in the a-th word in the power industry log data,、/>log electric sensitivity index and local electric sensitivity correction index respectively representing the b-th word in the a-th word in the log data of the electric power industry,/->And the power clause sensitivity index of the a clause in the power industry log data is represented. The power sensitive feature vector obtaining flow chart is shown in fig. 2.

Thus, the power sensitive characteristic vector of each word in the log data of the power industry is obtained.

Step S003, electric power sensitive characteristic vectors and the like of each word in the log data of the electric power industry are used as input of the BP neural network, named entities and sensitivity of each word are obtained, sensitivity is clustered, sensitivity degree of the named entities is obtained, and identification of sensitive data of large energy data is achieved.

Through the power sensitive feature vector of each word in the log data of the power industry obtained by the steps, the words, the word vector of each word and the power sensitive feature vector of each word can be used as the input of the BP neural network, adam is used as an optimization algorithm, root mean square error MSE is used as a loss function, the sensitivity of a named entity to which each word belongs and each word is output, for example, a current transformer is used as a named entity, the named entity is used as a device, when the named entity performs word segmentation, the named entity can be divided into two words of current and transformer, the named entity to which the two words belong is used as the device, the sensitivity of all the words is used as the input of the K-means clustering algorithm, the Euclidean distance of sensitivity among the words is used as a measurement distance, the value of the cluster number K is 3, and the cluster is output. The BP neural network and the K-means clustering algorithm are known techniques, and the specific process is not repeated in this embodiment.

Calculating the average value of the sensitivity of all the word segmentation in each cluster, arranging the cluster according to descending order, and respectively marking the cluster corresponding to the arranged result as important data, general data and auxiliary data; calculating the proportion of named entities corresponding to the words in each cluster, arranging the proportion of the named entities corresponding to the words in each cluster in a descending order to form a named entity proportion sequence, sequentially summing the elements in the named entity proportion sequence, and taking a set of the named entities corresponding to the elements in the named entity proportion sequence in the summation result as a representative entity set of the cluster when the summation is more than 70%.

For example, 5 named entities are arranged in one cluster, the rates after descending order and sorting are respectively 30% of equipment, 25% of state, 20% of working time, 13% of working efficiency and 12% of working content, and when the third element is solved, the summation result is more than 70%, and at the moment, the equipment, the state and the working time are the representative entity sets of the cluster.

Further, when the rest log text data of the power industry is processed, named entities in the log data are identified through a named entity identification technology, the named entities are respectively matched with representative entity sets of important data, general data and auxiliary data clusters, a matching result of the sensitivity degree of the log data is obtained, and the matching result is used as an identification result of the sensitive data, so that intelligent identification of the sensitive data is realized. The named entity recognition technology is a well-known technology and is not described herein in detail.

It should be noted that: the sequencing of the embodiments of the present invention described above is for illustration only, and do not represent the merits or merits of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment mainly describes differences from other embodiments.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; the technical solutions described in the foregoing embodiments are modified or some of the technical features are replaced equivalently, so that the essence of the corresponding technical solutions does not deviate from the scope of the technical solutions of the embodiments of the present application, and all the technical solutions are included in the protection scope of the present application.

Claims

1. The intelligent identification method for the energy big data sensitive data is characterized by comprising the following steps of:

2. The intelligent recognition method of the energy big data according to claim 1, wherein the clustering division is performed on each word according to the Euclidean distance between word vectors of each word, and the maximum single-mode dimension of each word is obtained according to the element value in the word vector of each word, comprising:

3. The intelligent identification method for the big data of the energy source according to claim 1, wherein the step of obtaining the log electric sensitivity index of each word comprises the following steps:

4. The intelligent recognition method of energy big data sensitive data according to claim 1, wherein the obtaining the local subject word set of each word comprises:

5. The intelligent recognition method of the big data of energy according to claim 1, wherein the obtaining the constant weight of the main body of each word comprises the following steps:

6. The intelligent recognition method of the energy big data sensitive data according to claim 1, wherein the obtaining the local electric sensitivity correction index of each word comprises the following steps:

7. The intelligent identification method of the energy big data according to claim 1, wherein the obtaining the sensitive modification weight of each clause according to the log electric sensitivity index and the local electric sensitivity modification index comprises the following steps:

8. The intelligent identification method for the big data of energy according to claim 7, wherein the step of obtaining the power clause sensitivity index of each clause comprises the following steps:

9. The intelligent recognition method of the energy big data sensitive data according to claim 1, wherein the step of obtaining the power sensitive feature vector of each word comprises the following steps:

10. The intelligent recognition method of the energy big data sensitive data according to claim 1, wherein the recognition of the sensitive data by each word, the word vector of each word and the power sensitive feature vector comprises the following steps: