CN117807190A - Intelligent identification method for sensitive data of energy big data - Google Patents

Intelligent identification method for sensitive data of energy big data Download PDF

Info

Publication number
CN117807190A
CN117807190A CN202410217787.0A CN202410217787A CN117807190A CN 117807190 A CN117807190 A CN 117807190A CN 202410217787 A CN202410217787 A CN 202410217787A CN 117807190 A CN117807190 A CN 117807190A
Authority
CN
China
Prior art keywords
word
sensitivity
clause
index
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410217787.0A
Other languages
Chinese (zh)
Inventor
王世谦
邵志鹏
张小建
贾一博
高先周
李为
宋大为
王圆圆
费稼轩
黄秀丽
卜飞飞
李秋燕
华远鹏
韩丁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Smart Grid Research Institute Co ltd
Qingdao Tatan Technology Service Co ltd
Economic and Technological Research Institute of State Grid Henan Electric Power Co Ltd
Original Assignee
State Grid Smart Grid Research Institute Co ltd
Qingdao Tatan Technology Service Co ltd
Economic and Technological Research Institute of State Grid Henan Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Smart Grid Research Institute Co ltd, Qingdao Tatan Technology Service Co ltd, Economic and Technological Research Institute of State Grid Henan Electric Power Co Ltd filed Critical State Grid Smart Grid Research Institute Co ltd
Priority to CN202410217787.0A priority Critical patent/CN117807190A/en
Publication of CN117807190A publication Critical patent/CN117807190A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of data processing, in particular to an intelligent identification method for sensitive data of big energy data, which comprises the following steps: collecting log data of the power industry; calculating TF-IDF values of the segmented words, further obtaining clusters of the segmented words, obtaining the maximum single-mode dimension of the segmented words, and further obtaining the log electric sensitivity index of the segmented words; acquiring a local subject word set of each word, and further calculating a main body constant weight of each word; acquiring a local electric sensitivity correction index of each word segmentation according to the constant weight of the theme, the sensitivity index of the log point and the distance between the theme words, and further acquiring the sensitivity modification weight of each sentence; and acquiring the electric power clause sensitivity index of each clause, further acquiring the electric power sensitivity characteristic vector of each clause, and identifying the sensitive data. The invention aims to solve the problem of inaccurate identification of sensitive data due to the specificity and complexity of a Chinese language structure.

Description

Intelligent identification method for sensitive data of energy big data
Technical Field
The application relates to the technical field of data processing, in particular to an intelligent identification method for energy big data sensitive data.
Background
The energy industry comprises various industries such as solar energy, wind energy, electric energy and the like, wherein the electric power industry is a basic industry related to national folk life, and a large amount of business data is generated along with the development of an information society towards an intelligent direction, and as the electric power data possibly contains sensitive information about individuals, families or organizations, the sensitive degree of the electric power data needs to be identified, so that the data cannot leak or misuse the sensitive information.
However, the convergence of mass data brings great value and simultaneously faces serious safety risks, and power grid staff is difficult to accurately and efficiently extract sensitive information from the mass data, so that sensitive information in electric power data needs to be identified through an intelligent method, a named entity identification (Named Entity Recognition, NER) technology can be used for identifying sensitive data in the electric power data, and entities with specific significance in texts can be identified, including key sensitive information such as names, place names and proper nouns, but the traditional named entity identification method is mostly multi-faceted to the identification of English texts, has poor identification effect on Chinese texts, and has the problem of inaccurate identification when the traditional named entity identification algorithm identifies sensitive data in large energy data due to the specificity and complexity of Chinese language structures, such as word ambiguity of Chinese.
Disclosure of Invention
In order to solve the technical problems, the invention provides an intelligent identification method for energy big data sensitive data, which aims to solve the existing problems.
The intelligent identification method for the energy big data sensitive data adopts the following technical scheme:
the embodiment of the invention provides an intelligent identification method for sensitive data of big energy data, which comprises the following steps:
collecting log data of the power industry, performing sentence segmentation, and extracting word vectors of each word segmentation in each sentence;
calculating TF-IDF values of the segmented words by using a TF-IDF algorithm, carrying out clustering division on the segmented words according to Euclidean distances between word vectors of the segmented words, and acquiring the maximum single-mode dimension of the segmented words according to element values in the word vectors of the segmented words; acquiring a log electric sensitivity index of each word according to the maximum single-mode dimension and the TF-IDF value of each word; acquiring a local subject word set of each word according to the sequence of the word in the sentence of each word in the log data of the power industry; obtaining the main body constant weight of each word according to the distance between the subject words in the local subject word set; obtaining a local electric sensitivity correction index of each word segmentation according to the constant weight of the subject, the sensitivity index of the log point and the distance between the subject words; acquiring the sensitivity modification weight of each clause according to the log electric sensitivity index and the local electric sensitivity modification index; acquiring the electric power clause sensitivity index of each clause according to the sensitivity modification weight, the log electric sensitivity index and the local electric sensitivity modification index; acquiring the electric power sensitivity characteristic vector of each word according to the log electric sensitivity index, the local electric sensitivity correction index and the electric power clause sensitivity index of each clause;
and identifying the sensitive data according to each word, the word vector of each word and the power sensitive characteristic vector.
Further, the clustering division is performed on each word according to the euclidean distance between word vectors of each word, and the maximum single-mode dimension of each word is obtained according to the element value in the word vector of each word, including:
taking word vectors corresponding to all segmented words in the log data of the power industry as input of a DBSCAN clustering algorithm, measuring the Euclidean distance between the word vectors corresponding to the segmented words, and outputting clustering clusters obtained after clustering of all segmented words;
for each word segmentation, each element in the word vector is used as a dimension, the dimensions are numbered according to the sequence of the dimensions in the word vector, the sum of squares of element values of each dimension in the word vector is used as a model of the word vector, the ratio of the squares of the element values of each dimension in the word vector to the model of the word vector is marked as a single-model ratio, and the number of the dimension with the largest single-model ratio in the word vector is marked as the largest single-model dimension.
Further, the obtaining the log electric sensitivity index of each word includes:
for each word, calculating the standard deviation of the maximum single-mode dimension of all the words in the cluster where the word is located, calculating the sum value of the standard deviation and a preset coordination coefficient, calculating the ratio of the number of all the words in the cluster where the word is located to the sum value, and taking the product of the ratio and the TF-IDF value of the word as the log sensitivity index of each word.
Further, the obtaining the local subject word set of each word segment includes:
sorting the clauses according to the sequence of the clauses in the log data of the power industry, and regarding a clause set formed by other A clauses nearest to each clause as a neighbor sentence set of each clause for the clause where each clause is located, wherein A is a preset clause number;
and taking the neighbor sentence set as input of an LDA (laser direct structuring) subject word extraction model, wherein the output of the LDA subject word extraction model is all subject words in the neighbor sentence set, and taking the set of all the subject words as a local subject word set of each word.
Further, the obtaining the main body constant weight of each word segment includes:
and calculating Google normalized distances among all the subject words in the local subject word set of each word, and taking the reciprocal of the sum value of all the normalized distances in the local subject word set as the subject constant weight of each word.
Further, the obtaining the local electric sensitivity correction index of each word includes:
for each word, calculating Euclidean distance between the word and the log electric sensitivity index of each subject word in the local subject word set, calculating Google normalized distance between the word and each subject word in the local subject word set, calculating the sum of the Euclidean distance and the Google normalized distance, calculating the sum of all the sum of the word, and taking the product of the sum and the constant weight of the subject as the local electric sensitivity correction index of each word.
Further, the obtaining the sensitivity modification weight of each clause according to the log electric sensitivity index and the local electric sensitivity modification index includes:
taking each clause in the log data of the power industry as the input of a dependency syntax tree, wherein the output of the dependency syntax tree is the word segmentation of each clause, the dependency relationship among the word segmentation and the word segmentation part of speech;
for each clause, taking natural constant as a base, taking log electric sensitivity index of the part of speech of the clause as a modifier as an index function calculation result, and calculating the product of the calculation result and the local electric sensitivity correction index of the word with the part of speech as the modifier in the clause, and taking the average value of the product of the words with the part of speech as the modifier in the clause as the sensitivity modification weight of each clause.
Further, the obtaining the power clause sensitivity index of each clause includes:
for word segmentation with part of speech as a main word in each clause, calculating the product of the log electric sensitivity index and the local electric sensitivity correction index of the word segmentation as a first product, and taking the product of the sum of all the first products and the sensitivity modification weight in each clause as the electric power clause sensitivity index of each clause.
Further, the obtaining the power sensitive feature vector of each word includes:
and sequentially arranging the log electric sensitivity index, the local electric sensitivity correction index and the electric sentence sensitivity index of the sentence in which each word is positioned to obtain the electric sensitivity feature vector of each word.
Further, the identifying the sensitive data by each word segment, the word vector of each word segment and the power sensitive feature vector includes:
taking each word, a word vector of each word and an electric power sensitive characteristic vector of each word as inputs of a BP neural network, wherein the outputs of the BP neural network are named entities to which each word belongs and the sensitivity of each word;
the sensitivity of all the segmented words is used as the input of a K-means clustering algorithm, and the output of the K-means clustering algorithm is three clustering clusters; calculating the average value of the sensitivity of all the word segmentation in each cluster, arranging the cluster according to the average value in a descending order, and sequentially marking the clusters corresponding to the arranged results as important data, general data and auxiliary data;
calculating the proportion of named entities corresponding to the words in each cluster, arranging the proportion of the named entities corresponding to the words in each cluster in a descending order to form a named entity proportion sequence, sequentially summing the elements in the named entity proportion sequence, and taking a set formed by the named entities corresponding to the elements in the named entity proportion sequence in the summation result as a representative entity set of the cluster when the summation is more than 70%;
and using a named entity recognition technology to recognize named entities in the log data, respectively matching the named entities with the representative entity sets of the important data, the general data and the auxiliary data cluster to obtain a matching result of the sensitivity degree of the log data, and taking the matching result as a recognition result of the sensitive data.
The invention has at least the following beneficial effects:
according to the invention, the log electric sensitivity index is constructed by analyzing the global characteristics of the segmented words in the log data of the power industry, and the global sensitivity degree of each segmented word is reflected; constructing a local electric sensitivity correction index based on the electric sensitivity index of the log and the characteristics of the word segmentation in the context, correcting the electric sensitivity index of the log under certain special conditions, and improving the accuracy of identifying the named entity and judging the sensitivity degree in the subsequent steps; the electric clause sensitivity index is built based on the local electric sensitivity correction index, the sensitivity degree of the whole clause under special conditions is reflected, and the recognition efficiency in the subsequent steps is improved; the method comprises the steps of constructing an electric power sensitive feature vector based on a log electric sensitivity index, a local electric sensitivity correction index and an electric power clause sensitivity index, using the electric power sensitive feature vector as input of a BP neural network, obtaining named entities and sensitivities of each segmented word, clustering and grading the sensitivities, realizing identification of sensitive data of large energy data, and solving the problem of inaccurate identification of the sensitive data due to the specificity and complexity of a Chinese language structure.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an intelligent identification method for big energy data sensitive data;
fig. 2 is a flow chart of power sensitive feature vector acquisition.
Detailed Description
In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following is a detailed description of specific implementation, structure, characteristics and effects of the intelligent identification method for the energy big data sensitive data according to the invention in combination with the accompanying drawings and the preferred embodiment. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The specific scheme of the intelligent identification method for the energy big data sensitive data provided by the invention is specifically described below with reference to the accompanying drawings.
The invention provides an intelligent identification method for large energy data, and in particular provides an intelligent identification method for large energy data, referring to fig. 1, comprising the following steps:
and S001, collecting log data of the power industry and preprocessing.
Since the processing manners of different log data in the power industry are the same, the present embodiment will be described by taking one log data processing as an example.
The method comprises the steps of collecting log data of an electric power system through the electric power information system, wherein the log data are Chinese text data, and recording data such as operation activities of the electric power system, running states and parameters of electric power equipment, safety data of the electric power system, load information of a power grid and the like, wherein part of the data are sensitive data, such as equipment identification information in the parameters of the electric power equipment, including information such as unique identifiers and serial numbers of various equipment, and the like, and the equipment is possibly attacked remotely after the information is leaked, so that the safety and stability of the electric power system are affected.
In order to facilitate the subsequent understanding of the context in the log data by the named entity recognition technique, it is necessary to segment, sentence and word the log data and convert the word into a word vector.
Specifically, the paragraph is marked as a separator, log data of the electric power system is divided into a plurality of paragraphs, the period is taken as a separator, the paragraphs are divided into a plurality of clauses, each clause is respectively subjected to word segmentation processing through a jieba tool library in the python programming language, then dead words such as 'in', and the like which do not substantially help in a named entity recognition task are removed from the results after the word segmentation processing, the obtained words are used as the input of the BERT model, are output as low-dimensional dense word embedded vectors with semantic feature representation, and are recorded as word vectors of each segmented word. The BERT model is a known technology, and the specific process is not described in detail in this embodiment.
Thus, log data of the power industry is obtained.
Step S002, analyzing the characteristics of the word in the log data of the power industry to construct a log electric sensitivity index, constructing a local electric sensitivity correction index based on the log electric sensitivity index and the context characteristics of the word, constructing a power clause sensitivity index based on the local electric sensitivity correction index and the dependency relationship of the word in the sentence, and constructing a power sensitivity feature vector based on the log electric sensitivity index, the local electric sensitivity correction index and the power clause sensitivity index.
In the log data of the power industry, as the used equipment and the like are special equipment in the field of the power industry, the descriptors used for the running states of the equipment are relatively close, meanwhile, the use frequency of proper nouns and special terminology of the industry is relatively high, the occurrence frequency of the types of the partitionings in the whole log is relatively high, meanwhile, the parameters, states and the like of the power equipment are reflected by the partitionings such as the descriptors and the proper nouns of the running states of the equipment, the sensitivity degree is relatively high, and the log electric sensitivity index is constructed according to the embodiment, the important and sensitivity degree characteristics of each partitioning in the global are reflected, and the construction process of the log electric sensitivity index is as follows:
and taking each word in the log data of the power industry as the input of a TF-IDF algorithm, wherein the output of the TF-IDF algorithm is the TF-IDF value of each word.
Further, word vectors corresponding to all segmented words in the log data of the electric industry are used as input of a DBSCAN clustering algorithm, the measurement distance is Euclidean distance between word vectors corresponding to all segmented words, each clustering cluster obtained after clustering of all segmented words is output, each element in the word vectors is used as one dimension, the dimensions are numbered according to the dimensions corresponding to each element in the word vectors, the sum of squares of element values of each dimension in the word vectors is used as a model of the word vectors, the ratio of the squares of element values of single dimension in the word vectors to the model of the word vectors is marked as a single-mode ratio, and the number of the dimension with the largest single-mode ratio in the word vectors is marked as the largest single-mode dimension.
Illustrating: if the word vector corresponding to the word "generator" is [0.5, -0.3,1.2 ]]The dimensions corresponding to 0.5, -0.3, and 1.2 are numbered 1, 2, and 3, respectively, and the modulus of the word vector isThe single mode ratio of each dimension is +.>、/>、/>The maximum single-mode dimension is 3, where the euclidean distance, TF-IDF algorithm, and DBSCAN clustering algorithm are known techniques, and detailed description of this embodiment is omitted.
Wherein,log electric sensitivity index representing the c-th word in the log data of the electric power industry,/I>Representing the number of words in a cluster where the c-th word in the log data of the electric power industry is located,/->Representing standard deviation of maximum single-mode dimension corresponding to all participles in cluster where the c-th participle in the log data of the power industry>Represents a coordination coefficient for avoiding incapacitation caused by 0 denominator, in this embodiment +.>The value of (1),>and the TF-IDF value corresponding to the c-th word in the log data of the power industry is represented.
The more the number of the word segments in the cluster where the c-th word segment is located, namelyThe larger the sample in the cluster is, the more samples in the cluster are, the higher the effectiveness of the word segmentation belonging to the cluster is when the word segmentation in the cluster is analyzed, and the smaller the standard deviation of the maximum single-mode dimension corresponding to all the word segments in the cluster where the c-th word segmentation is located is, namely->The smaller the difference of the largest single-mode dimension of each word in the cluster is, the larger the value of the word vector in a certain dimension is after the word is processed by the BERT model, the more important the word vector in the dimension is, if the difference of the largest single-mode dimension in the current cluster is smaller, the stronger the uniformity of the word in the cluster is, the higher the sensitivity degree of the word in the cluster in a certain aspect is, and meanwhile, the larger the TF-IDF value corresponding to the c-th word is, namely>The larger the word is, the higher the importance of the word in the log data is, namely the higher the sensitivity degree is, so the calculated log electric sensitivity index is larger.
The log electric sensitivity index of each word in the log data of the electric power industry obtained through the steps reflects the global importance and sensitivity degree of each word, and under special conditions, due to the word ambiguity phenomenon of Chinese, the overall importance and sensitivity degree of a sentence may be larger, for example, the description about the running state, the related parameters and the like of some important equipment is that when the equipment is important, the related description word about the equipment is also important, namely, the sensitivity degree is higher, and the sensitivity degree of the description word in the global feature may be relatively lower. Therefore, in this embodiment, local electric sensitivity correction indexes are constructed by analyzing the context characteristics of each word, and the local electric sensitivity correction indexes of the log are locally corrected, and the construction process of the local electric sensitivity correction indexes is as follows:
taking a clause set formed by the nearest A clauses as a neighbor sentence set of the c-th clause according to the sequence of the clauses in the log data of the electric power industry, wherein the value of A in the embodiment is 10, the neighbor sentence set is used as the input of an LDA (local description of the object) topic word extraction model, the input is outputted as all topic words in the neighbor sentence set, and the set of all topic words is used as a local topic word set of the c-th clause; the Google normalized distance between the subject words in the local subject word set of the c-th segmented word is calculated, wherein the calculation of the LDA subject word extraction model and the Google normalized distance is a known technique, and the detailed process is not repeated in this embodiment. The local electrical sensitivity correction index can be calculated as follows:
wherein the method comprises the steps ofTheme constant weight representing the c-th word segment,/->Representing the number of subject words in the local subject word set of the c-th word segment, ++>Google normalized distance between the d-th subject word and the e-th subject word in the local subject word set representing the c-th subject word, +.>Local electric sensitivity correction index representing the c-th word in the electric power industry log data, ++>Euclidean distance between log electrosensitivity indexes representing the c-th word and the f-th word in the local set of words,/o>And the Google normalized distance between the c-th segmentation word and the f-th topic word in the local topic word set is represented.
In the local subject word set of the c-th segmented word, the larger the Google normalized distance between every two subject words is, namelyThe bigger the semantic similarity between the subject terms is smaller, namely, in the A clauses which are closer, different subject terms reflect different subjects, so when the subject terms are corrected based on the subject terms, the smaller the weight of the subject terms is, the smaller the calculated constant weight of the subject terms is, and the smaller the Euclidean distance between the c-th subject term and the log electric sensitivity index of the rest subject terms is, namely->The smaller the difference of the sensitivity degree between the c-th word and the subject word is, the less the c-th word is required to be corrected, and the Go between the c-th word and the rest of the subject words isThe smaller the log normalized distance, i.e. +.>The smaller the word is, the larger the semantic similarity between the c-th word and the rest of the subject words is, namely the c-th word and the subject words reflect the same subject, the smaller the degree of correcting the c-th word is, and the smaller the calculated local electric sensitivity correction index is.
The local electric sensitivity correction index of each word obtained through the steps reflects the correction degree of the global sensitivity degree of each word, and because the sensitivity degree of each word in a certain sentence in the log data of the electric power industry is not high under certain special conditions, the whole sensitivity degree of the sentence is higher, for example, if ' some conventional maintenance works are finished ', a new risk assessment indicates that a certain potential safety hazard exists in the current transmission capacity of the electric power system ', the sensitivity degree of each word in the word segmentation result of the sentence is relatively lower, and the potential safety hazard exists in the electric power system is indicated by the sentence as a whole, and the potential hazard needs to be repaired in time so as to avoid further expansion. In order to improve the recognition efficiency and accuracy of sensitive data in the log data of the power industry in the subsequent steps, the embodiment constructs the sensitivity index of the electric clause based on the local electric sensitivity correction index, reflects the overall sensitivity degree of each word in the log data of the power industry, and comprises the following construction process of the sensitivity index of the electric clause:
the a-th clause in the log data of the electric power industry is used as the input of a dependency syntax tree, the output is the word segmentation of the a-th clause, the dependency relationship among the word segmentation, the word segmentation part of speech, the dependency relationship among the word segmentation comprises a main-predicate relationship, a dynamic guest relationship and the like, the word segmentation part of speech comprises nouns, verbs, adjectives and the like, wherein the dependency syntax tree is a known technology, and the specific process is not repeated in the embodiment. The power clause sensitivity index can be calculated as follows:
wherein the method comprises the steps ofSensitive modification weight for representing a clause a in the log data of the power industry>Representing the number of modifier words in the a-th clause in the log data of the power industry, wherein the modifier words comprise adjectives, ordinal words and the like,/->、/>Log electric sensitivity indexes and local electric sensitivity correction indexes of h modifier words in an a-th clause in log data of the power industry are respectively represented, and exp () represents an exponential function based on a natural constant;
a power clause sensitivity index of an a-th clause in the log data of the power industry is represented by +.>Representing the number of subject words in the a-th clause, i.e. the number of words whose dependency relationship is subject in the dependency syntax tree, +.>、/>And the log electric sensitivity index and the local electric sensitivity correction index of the i-th main word in the a-th clause in the log data of the power industry are respectively represented.
The more sensitive the modifier is in a clause, i.eThe higher the correction of the modifier is, the greater the degree of correction of the modifier is, i.e. +.>The larger the modification word is, the larger the sensitivity modification degree of the modification word to the main word is, namely the larger the sensitivity modification degree of the modification word to the whole clause is, in order to increase the sensitivity modification degree, the embodiment carries out exponential processing on the log electric sensitivity index, so the calculated sensitivity modification weight is the larger the sensitivity modification weight is, and the larger the sensitivity degree and the modification degree of the main word are, namely->、/>The greater the sensitivity of the clause, the greater the calculated power clause sensitivity index.
Through the log electric sensitivity index and the local electric sensitivity correction index of each word in the log data of the electric power industry and the electric power sentence sensitivity index of each sentence, the electric power sensitivity characteristic vector of each word can be constructed, so that the named entity and the sensitivity degree of each word can be conveniently and subsequently identified, and the electric power sensitivity characteristic vector of each word can be specifically expressed as follows:
wherein the method comprises the steps ofA power sensitive feature vector representing the b-th word in the a-th word in the power industry log data,、/>log electric sensitivity index and local electric sensitivity correction index respectively representing the b-th word in the a-th word in the log data of the electric power industry,/->And the power clause sensitivity index of the a clause in the power industry log data is represented. The power sensitive feature vector obtaining flow chart is shown in fig. 2.
Thus, the power sensitive characteristic vector of each word in the log data of the power industry is obtained.
Step S003, electric power sensitive characteristic vectors and the like of each word in the log data of the electric power industry are used as input of the BP neural network, named entities and sensitivity of each word are obtained, sensitivity is clustered, sensitivity degree of the named entities is obtained, and identification of sensitive data of large energy data is achieved.
Through the power sensitive feature vector of each word in the log data of the power industry obtained by the steps, the words, the word vector of each word and the power sensitive feature vector of each word can be used as the input of the BP neural network, adam is used as an optimization algorithm, root mean square error MSE is used as a loss function, the sensitivity of a named entity to which each word belongs and each word is output, for example, a current transformer is used as a named entity, the named entity is used as a device, when the named entity performs word segmentation, the named entity can be divided into two words of current and transformer, the named entity to which the two words belong is used as the device, the sensitivity of all the words is used as the input of the K-means clustering algorithm, the Euclidean distance of sensitivity among the words is used as a measurement distance, the value of the cluster number K is 3, and the cluster is output. The BP neural network and the K-means clustering algorithm are known techniques, and the specific process is not repeated in this embodiment.
Calculating the average value of the sensitivity of all the word segmentation in each cluster, arranging the cluster according to descending order, and respectively marking the cluster corresponding to the arranged result as important data, general data and auxiliary data; calculating the proportion of named entities corresponding to the words in each cluster, arranging the proportion of the named entities corresponding to the words in each cluster in a descending order to form a named entity proportion sequence, sequentially summing the elements in the named entity proportion sequence, and taking a set of the named entities corresponding to the elements in the named entity proportion sequence in the summation result as a representative entity set of the cluster when the summation is more than 70%.
For example, 5 named entities are arranged in one cluster, the rates after descending order and sorting are respectively 30% of equipment, 25% of state, 20% of working time, 13% of working efficiency and 12% of working content, and when the third element is solved, the summation result is more than 70%, and at the moment, the equipment, the state and the working time are the representative entity sets of the cluster.
Further, when the rest log text data of the power industry is processed, named entities in the log data are identified through a named entity identification technology, the named entities are respectively matched with representative entity sets of important data, general data and auxiliary data clusters, a matching result of the sensitivity degree of the log data is obtained, and the matching result is used as an identification result of the sensitive data, so that intelligent identification of the sensitive data is realized. The named entity recognition technology is a well-known technology and is not described herein in detail.
It should be noted that: the sequencing of the embodiments of the present invention described above is for illustration only, and do not represent the merits or merits of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment mainly describes differences from other embodiments.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; the technical solutions described in the foregoing embodiments are modified or some of the technical features are replaced equivalently, so that the essence of the corresponding technical solutions does not deviate from the scope of the technical solutions of the embodiments of the present application, and all the technical solutions are included in the protection scope of the present application.

Claims (10)

1. The intelligent identification method for the energy big data sensitive data is characterized by comprising the following steps of:
collecting log data of the power industry, performing sentence segmentation, and extracting word vectors of each word segmentation in each sentence;
calculating TF-IDF values of the segmented words by using a TF-IDF algorithm, carrying out clustering division on the segmented words according to Euclidean distances between word vectors of the segmented words, and acquiring the maximum single-mode dimension of the segmented words according to element values in the word vectors of the segmented words; acquiring a log electric sensitivity index of each word according to the maximum single-mode dimension and the TF-IDF value of each word; acquiring a local subject word set of each word according to the sequence of the word in the sentence of each word in the log data of the power industry; obtaining the main body constant weight of each word according to the distance between the subject words in the local subject word set; obtaining a local electric sensitivity correction index of each word segmentation according to the constant weight of the subject, the sensitivity index of the log point and the distance between the subject words; acquiring the sensitivity modification weight of each clause according to the log electric sensitivity index and the local electric sensitivity modification index; acquiring the electric power clause sensitivity index of each clause according to the sensitivity modification weight, the log electric sensitivity index and the local electric sensitivity modification index; acquiring the electric power sensitivity characteristic vector of each word according to the log electric sensitivity index, the local electric sensitivity correction index and the electric power clause sensitivity index of each clause;
and identifying the sensitive data according to each word, the word vector of each word and the power sensitive characteristic vector.
2. The intelligent recognition method of the energy big data according to claim 1, wherein the clustering division is performed on each word according to the Euclidean distance between word vectors of each word, and the maximum single-mode dimension of each word is obtained according to the element value in the word vector of each word, comprising:
taking word vectors corresponding to all segmented words in the log data of the power industry as input of a DBSCAN clustering algorithm, measuring the Euclidean distance between the word vectors corresponding to the segmented words, and outputting clustering clusters obtained after clustering of all segmented words;
for each word segmentation, each element in the word vector is used as a dimension, the dimensions are numbered according to the sequence of the dimensions in the word vector, the sum of squares of element values of each dimension in the word vector is used as a model of the word vector, the ratio of the squares of the element values of each dimension in the word vector to the model of the word vector is marked as a single-model ratio, and the number of the dimension with the largest single-model ratio in the word vector is marked as the largest single-model dimension.
3. The intelligent identification method for the big data of the energy source according to claim 1, wherein the step of obtaining the log electric sensitivity index of each word comprises the following steps:
for each word, calculating the standard deviation of the maximum single-mode dimension of all the words in the cluster where the word is located, calculating the sum value of the standard deviation and a preset coordination coefficient, calculating the ratio of the number of all the words in the cluster where the word is located to the sum value, and taking the product of the ratio and the TF-IDF value of the word as the log sensitivity index of each word.
4. The intelligent recognition method of energy big data sensitive data according to claim 1, wherein the obtaining the local subject word set of each word comprises:
sorting the clauses according to the sequence of the clauses in the log data of the power industry, and regarding a clause set formed by other A clauses nearest to each clause as a neighbor sentence set of each clause for the clause where each clause is located, wherein A is a preset clause number;
and taking the neighbor sentence set as input of an LDA (laser direct structuring) subject word extraction model, wherein the output of the LDA subject word extraction model is all subject words in the neighbor sentence set, and taking the set of all the subject words as a local subject word set of each word.
5. The intelligent recognition method of the big data of energy according to claim 1, wherein the obtaining the constant weight of the main body of each word comprises the following steps:
and calculating Google normalized distances among all the subject words in the local subject word set of each word, and taking the reciprocal of the sum value of all the normalized distances in the local subject word set as the subject constant weight of each word.
6. The intelligent recognition method of the energy big data sensitive data according to claim 1, wherein the obtaining the local electric sensitivity correction index of each word comprises the following steps:
for each word, calculating Euclidean distance between the word and the log electric sensitivity index of each subject word in the local subject word set, calculating Google normalized distance between the word and each subject word in the local subject word set, calculating the sum of the Euclidean distance and the Google normalized distance, calculating the sum of all the sum of the word, and taking the product of the sum and the constant weight of the subject as the local electric sensitivity correction index of each word.
7. The intelligent identification method of the energy big data according to claim 1, wherein the obtaining the sensitive modification weight of each clause according to the log electric sensitivity index and the local electric sensitivity modification index comprises the following steps:
taking each clause in the log data of the power industry as the input of a dependency syntax tree, wherein the output of the dependency syntax tree is the word segmentation of each clause, the dependency relationship among the word segmentation and the word segmentation part of speech;
for each clause, taking natural constant as a base, taking log electric sensitivity index of the part of speech of the clause as a modifier as an index function calculation result, and calculating the product of the calculation result and the local electric sensitivity correction index of the word with the part of speech as the modifier in the clause, and taking the average value of the product of the words with the part of speech as the modifier in the clause as the sensitivity modification weight of each clause.
8. The intelligent identification method for the big data of energy according to claim 7, wherein the step of obtaining the power clause sensitivity index of each clause comprises the following steps:
for word segmentation with part of speech as a main word in each clause, calculating the product of the log electric sensitivity index and the local electric sensitivity correction index of the word segmentation as a first product, and taking the product of the sum of all the first products and the sensitivity modification weight in each clause as the electric power clause sensitivity index of each clause.
9. The intelligent recognition method of the energy big data sensitive data according to claim 1, wherein the step of obtaining the power sensitive feature vector of each word comprises the following steps:
and sequentially arranging the log electric sensitivity index, the local electric sensitivity correction index and the electric sentence sensitivity index of the sentence in which each word is positioned to obtain the electric sensitivity feature vector of each word.
10. The intelligent recognition method of the energy big data sensitive data according to claim 1, wherein the recognition of the sensitive data by each word, the word vector of each word and the power sensitive feature vector comprises the following steps:
taking each word, a word vector of each word and an electric power sensitive characteristic vector of each word as inputs of a BP neural network, wherein the outputs of the BP neural network are named entities to which each word belongs and the sensitivity of each word;
the sensitivity of all the segmented words is used as the input of a K-means clustering algorithm, and the output of the K-means clustering algorithm is three clustering clusters; calculating the average value of the sensitivity of all the word segmentation in each cluster, arranging the cluster according to the average value in a descending order, and sequentially marking the clusters corresponding to the arranged results as important data, general data and auxiliary data;
calculating the proportion of named entities corresponding to the words in each cluster, arranging the proportion of the named entities corresponding to the words in each cluster in a descending order to form a named entity proportion sequence, sequentially summing the elements in the named entity proportion sequence, and taking a set formed by the named entities corresponding to the elements in the named entity proportion sequence in the summation result as a representative entity set of the cluster when the summation is more than 70%;
and using a named entity recognition technology to recognize named entities in the log data, respectively matching the named entities with the representative entity sets of the important data, the general data and the auxiliary data cluster to obtain a matching result of the sensitivity degree of the log data, and taking the matching result as a recognition result of the sensitive data.
CN202410217787.0A 2024-02-28 2024-02-28 Intelligent identification method for sensitive data of energy big data Pending CN117807190A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410217787.0A CN117807190A (en) 2024-02-28 2024-02-28 Intelligent identification method for sensitive data of energy big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410217787.0A CN117807190A (en) 2024-02-28 2024-02-28 Intelligent identification method for sensitive data of energy big data

Publications (1)

Publication Number Publication Date
CN117807190A true CN117807190A (en) 2024-04-02

Family

ID=90423609

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410217787.0A Pending CN117807190A (en) 2024-02-28 2024-02-28 Intelligent identification method for sensitive data of energy big data

Country Status (1)

Country Link
CN (1) CN117807190A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
US20200074312A1 (en) * 2018-08-15 2020-03-05 Royal Bank Of Canada System and method for call centre management
US20200301919A1 (en) * 2017-05-05 2020-09-24 Ping An Technology (Shenzhen) Co., Ltd. Method and system of mining information, electronic device and readable storable medium
CN112131475A (en) * 2020-09-25 2020-12-25 重庆邮电大学 Interpretable and interactive user portrait method and device
CN112749341A (en) * 2021-01-22 2021-05-04 南京莱斯网信技术研究院有限公司 Key public opinion recommendation method, readable storage medium and data processing device
CN113722758A (en) * 2021-08-31 2021-11-30 平安科技(深圳)有限公司 Log desensitization method and device, computer equipment and storage medium
CN113807073A (en) * 2020-06-16 2021-12-17 中国电信股份有限公司 Text content abnormity detection method, device and storage medium
CN115858785A (en) * 2022-12-06 2023-03-28 北京安信天行科技有限公司 Sensitive data identification method and system based on big data
CN117076888A (en) * 2023-08-29 2023-11-17 淮阴工学院 Preprocessing storage method for switch cabinet acquired data

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
US20200301919A1 (en) * 2017-05-05 2020-09-24 Ping An Technology (Shenzhen) Co., Ltd. Method and system of mining information, electronic device and readable storable medium
US20200074312A1 (en) * 2018-08-15 2020-03-05 Royal Bank Of Canada System and method for call centre management
CN113807073A (en) * 2020-06-16 2021-12-17 中国电信股份有限公司 Text content abnormity detection method, device and storage medium
CN112131475A (en) * 2020-09-25 2020-12-25 重庆邮电大学 Interpretable and interactive user portrait method and device
CN112749341A (en) * 2021-01-22 2021-05-04 南京莱斯网信技术研究院有限公司 Key public opinion recommendation method, readable storage medium and data processing device
CN113722758A (en) * 2021-08-31 2021-11-30 平安科技(深圳)有限公司 Log desensitization method and device, computer equipment and storage medium
CN115858785A (en) * 2022-12-06 2023-03-28 北京安信天行科技有限公司 Sensitive data identification method and system based on big data
CN117076888A (en) * 2023-08-29 2023-11-17 淮阴工学院 Preprocessing storage method for switch cabinet acquired data

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
公冶小燕;林培光;任威隆;张晨;张春云;: "基于改进的TF-IDF算法及共现词的主题词抽取算法", 南京大学学报(自然科学), no. 06, 30 November 2017 (2017-11-30) *
李应博;张斌;: "基于改进TFIDF算法的SQL注入攻击检测方法", 信息工程大学学报, no. 01, 15 February 2020 (2020-02-15) *
林学峰;夏元轶;郭金龙;于晓文;: "基于卷积神经网络的敏感文件检测方法", 计算机与现代化, no. 07, 15 July 2018 (2018-07-15) *
袁慧;: "电力行业网络敏感信息过滤的研究与实现", 电力信息化, no. 11, 15 November 2010 (2010-11-15) *

Similar Documents

Publication Publication Date Title
CN109033307B (en) CRP clustering-based word multi-prototype vector representation and word sense disambiguation method
CN110598005B (en) Public safety event-oriented multi-source heterogeneous data knowledge graph construction method
CN111737496A (en) Power equipment fault knowledge map construction method
CN111209401A (en) System and method for classifying and processing sentiment polarity of online public opinion text information
CN112732934B (en) Power grid equipment word segmentation dictionary and fault case library construction method
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN111259153A (en) Attribute-level emotion analysis method of complete attention mechanism
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN114491081A (en) Electric power data tracing method and system based on data blood relationship graph
CN113157918B (en) Commodity name short text classification method and system based on attention mechanism
CN112860898B (en) Short text box clustering method, system, equipment and storage medium
CN112967710B (en) Low-resource customer dialect point identification method
CN117390198A (en) Method, device, equipment and medium for constructing scientific and technological knowledge graph in electric power field
CN116186562B (en) Encoder-based long text matching method
CN117313849A (en) Knowledge graph construction method and device for energy industry based on multi-source heterogeneous data fusion technology
CN116050419B (en) Unsupervised identification method and system oriented to scientific literature knowledge entity
CN116186350B (en) Power transmission line engineering searching method and device based on knowledge graph and topic text
CN117807190A (en) Intelligent identification method for sensitive data of energy big data
CN115563968A (en) Water and electricity transportation and inspection knowledge natural language artificial intelligence system and method
CN106816871B (en) State similarity analysis method for power system
CN113326371B (en) Event extraction method integrating pre-training language model and anti-noise interference remote supervision information
CN112579783A (en) Short text clustering method based on Laplace map
Zhu Sentiment analysis of international and foreign Chinese-language texts with multilevel features
Zhang et al. Named Entity Recognition for Smart Grid Operation and Inspection Domain using Attention Mechanism
CN117235137B (en) Professional information query method and device based on vector database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination