CN116310407A

CN116310407A - Heterogeneous data semantic extraction method for power distribution and utilization multidimensional service

Info

Publication number: CN116310407A
Application number: CN202211095695.7A
Authority: CN
Inventors: 丁一; 张磐; 滕飞; 庞超; 霍现旭; 吴磊; 戚艳; 杨挺; 尚学军; 陈沛; 焦秋良; 孙峤
Original assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Priority date: 2022-09-08
Filing date: 2022-09-08
Publication date: 2023-06-23

Abstract

The invention relates to a heterogeneous data semantic extraction method for power distribution and utilization multi-dimensional service, which comprises the following steps: step 1, preprocessing power distribution and utilization multidimensional picture data; step 2, carrying out semantic tag extraction on the power distribution multi-dimensional picture data preprocessed in the step 1 by adopting a deep learning model and combining with a manual correction mode, and constructing an image semantic tag set; step 3, carrying out semantic extraction on the inspection text data based on the constructed image semantic tag set constructed in the step 2, and matching corresponding semantic tags for equipment and places in the inspection text; and 4, based on the extraction result of the text data keywords of the inspection in the step 3, establishing text and picture semantic sequences with the picture semantic tags in the step 2 by using an LCS algorithm, calculating the similarity between the sequences, and performing data matching check. The invention can improve the accuracy of data semantic extraction.

Description

Heterogeneous data semantic extraction method for power distribution and utilization multidimensional service

Technical Field

The invention belongs to the field of deep learning algorithm in power big data application, relates to a heterogeneous data semantic extraction method, and particularly relates to a heterogeneous data semantic extraction method for power distribution and utilization multi-dimensional service.

Background

Because of the rapid development speed of distribution network intellectualization in recent years, along with the promotion of paperless records, a large amount of electric heterogeneous data with poor organization are emerging. The image and text data are greatly increased, and the image and text data comprise text data such as patrol records and picture data such as running states and environment states of power equipment shot by the patrol robot. But such data is difficult for a computer to understand and use. Meanwhile, the distribution network multidimensional service field has wide data sources and various forms, and each source or form can be regarded as a structural form, such as pictures, numbers, texts and the like. The current semantic extraction is mostly focused on single-structure data processing such as text, however, a great deal of data such as images is emerging, so that the requirement for semantic extraction of image data is more urgent.

The semantic understanding of the data enables the intelligent agent to sense and understand the real data scene more deeply, and further can infer the sensed data information so as to better support the intelligent sensing industry application of the electric power system. The data semantic extraction aims at giving multi-source heterogeneous data, and the target semantic is extracted by using an automatic mode such as manual or machine learning, deep learning and the like. Manual extraction generally adopts a mode of expert or organization discussion, and more manpower resources are required. While automatic extraction relies on rapidly developing computer technology, many achievements have been made in recent years, and the development of early text semantic extraction is advanced to the extraction of data with structural diversity such as pictures, texts and the like.

The method effectively extracts accurate semantic information of mass growing measurement data, can better understand, search and manage the data, further realizes multi-level and multi-dimensional semantic understanding and association of visual elements, and lays an important foundation for supporting intelligent application services of content-oriented power industry.

However, the existing heterogeneous data semantic extraction method for the power distribution and utilization multidimensional service has the following defects: the conventional semantic extraction method is mostly oriented to the general field and is not applicable to the electric power vertical field. For example, the natural language processing technology based on LSTM mainly utilizes the context relevance of data to improve the semantic extraction accuracy, but the situations that a large amount of short messages exist in the power distribution business data, the context relevance is weak and the like are not considered, so that the existing semantic extraction method has poor practicability on the power distribution data.

No prior art patent document, which is the same as or similar to the present invention, was found after searching.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a heterogeneous data semantic extraction method for power distribution and utilization multi-dimensional service, which can solve the technical problem that information is difficult to extract due to various data structures among different services.

The invention solves the practical problems by adopting the following technical scheme:

a heterogeneous data semantic extraction method for power distribution and utilization multidimensional service comprises the following steps:

step 1, preprocessing power distribution and utilization multidimensional picture data;

step 2, carrying out semantic tag extraction on the power distribution multi-dimensional picture data preprocessed in the step 1 by adopting a deep learning model and combining with a manual correction mode, and constructing an image semantic tag set;

step 3, carrying out semantic extraction on the inspection text data based on the constructed image semantic tag set constructed in the step 2, and matching corresponding semantic tags for equipment and places in the inspection text;

and 4, based on the extraction result of the text data keywords of the inspection in the step 3, establishing text and picture semantic sequences with the picture semantic tags in the step 2 by using an LCS algorithm, calculating the similarity between the sequences, and performing data matching check.

Moreover, the specific steps of the step 1 include:

(1) Unifying the sizes of the patrol shooting pictures, scaling the original image into a specified image size, and unifying the sizes of the image size to 600 multiplied by 800;

(2) Cutting and filling, namely unifying the sizes of images of different distribution network service sources, cutting an original image if the original image is larger than a target image, and filling blank pixel points generated in the image stretching process by using black pixels if the original image is smaller than the target image;

(3) And adjusting the image proportion of the equipment, adjusting the length-width ratio of partial image files in the data set, taking the function value as 1, taking the central point as a reference, and storing the adjusted image as a new image.

Moreover, the specific steps of the step 2 include:

(1) Constructing a deep learning model adopting an Encoder-Decoder structure, firstly inputting the power-distribution multi-dimensional picture data preprocessed in the step 1, entering an Encoder part, extracting the characteristics x= (x) of n positions of a picture by using a feature map of a convolution layer through the spatial characteristics of CNN ₁ ,x ₂ ,…,x _n ) Wherein x is a D-dimensional vector;

(2) The context vector of the incoming Decoder CNN is z when the decoding is carried out in the t stage, namely the generation of the t feature semantics _t The hidden layer state of the previous stage of CNN is g _t-1 The method comprises the steps of carrying out a first treatment on the surface of the This context vector z _t Is x= (x) ₁ ,x ₂ ,...,x _n ) In particular z _t And x= (x) ₁ ,x _z ,...,x _n ) The relationship of (2) can be expressed as:

wherein alpha is _t,m The weight of the image features at the mth position is measured when the t feature semantics are generated; this weight is actually the previous hidden layer state g _t-1 And mth position image feature x _m Is a function of (2);

(3) Obtaining the characteristic z _t As the input of CNN, the model result y is output through hidden variable generation _t ；

(4) And (3) manually intervening and calibrating part of data to finally obtain an image semantic tag set.

Moreover, the specific steps of the step 3 include:

(1) The characteristic is represented as follows: taking word vector X of the inspection text mapping as input, using a hash coding mode to code the dimensions of the inspection text vector of the power distribution station to an h layer, and then sequentially obtaining a 128-dimensional low-dimensional vector Y through a three-layer network of DNN, wherein the calculation process is as follows:

wherein l represents a hidden layer node, W _i Is the first ₁ F is the activation function of the hidden layer and the output layer, and the patrol text semantic model uses tanh as the activation function of the hidden layer and the output layer:

(2) By using n+1 low-dimensional vectors with 128 dimensions, respectively representing Query and N docs, the semantic similarity between the Query and Doc can be represented by the cosine distance (cosine similarity) between the two semantic vectors

(3) Semantic similarity of Query and positive sample Doc can be converted into a posterior probability through a softmax function:

where γ is the smoothing factor of softmax, d+ is the positive sample under Query, D 'is any sample under Query, and D' is the entire sample space under Query. In the training phase, a log-loss function is used:

L＝-ln P(D ⁺ ∣Q) (6)

(4) The CNN-DSSM extracts the context information of the inspection record text under the sliding window through the convolution layer, and extracts the global context information through the pooling layer, so that semantic features are effectively reserved, and corresponding semantic tags are matched for equipment, places and the like in the inspection text.

Moreover, the specific steps of the step 4 include:

(1) Recursively solving a common subsequence of a picture semantic sequence and a text sequence: let the sequence x= (X) ₁ ,x ₂ ,...,x _n ) And y= (Y) ₁ ,y ₂ ,...,y _m ). The number of elements of X, Y is n and m respectively; x is X _i (x ₁ ,x ₂ ,...,x _n ) And Y _i (y ₁ ,y ₂ ,...,y _m ) Respectively, are subsets of sequences X and Y, where i.ltoreq.n, j.ltoreq.m. Based on the fact that the subset of the longest common subsequence between the two sequences is also its common subsequence property, the longest common subsequence LCS (X _n ,Y _m ) The specific calculation formula is as follows:

(2) Similarity calculation: if the obtained longest common subsequence length is k, the text 1 length is n ₁ Text 2 has length n ₂ The calculation formula of the similarity S is:

(3) And sorting labels according to the similarity: and (3) carrying out similarity calculation through a longest public subsequence algorithm, solving the longest public subsequence between the matched electrogram semantic sequence and the patrol text sequence, sequencing all label pairs according to the similarity S between the solved result measurement sequences, taking a group of labels with highest similarity as a matching result, and finishing checking among different semantic sequences.

The invention has the advantages and beneficial effects that:

1. the invention provides a heterogeneous data semantic extraction method for power distribution and utilization multi-dimensional service, which utilizes CNN and CNN-DSSM algorithms to extract semantic information of equipment fault maintenance pictures and inspection record texts respectively according to the characteristics of the power distribution and utilization multi-dimensional heterogeneous data, and adopts LCS to carry out semantic matching and other operations on image-text information. According to the technical scheme, useful information can be quickly and accurately obtained from unstructured data, information check is carried out between the structured data and unstructured data, and the data semantic extraction accuracy is improved.

2. The invention provides a data semantic extraction method for the specific field of power distribution and utilization multidimensional advanced measurement data, integrating data resources, applying data normalization, semantic extraction, similarity calculation and other construction methods. The present invention is mainly directed to multi-dimensional high-level measurement data of power distribution and utilization, for example: the invention uses intelligent learning algorithm to extract semanteme, including picture character recognition, text semanteme extraction, image-text information matching, etc., adopts CNN model, merges multi-layer perceptron architecture, further improves the depth of model, makes the model extract deeper information, realizes intelligent recognition and check of overhaul data semanteme in distribution and utilization multidimensional, and is beneficial to intelligent application of distribution and network big data.

Drawings

Fig. 1 is a schematic drawing of power inspection picture extraction according to the present invention;

FIG. 2 is a schematic diagram of a power station house inspection text information extraction model of the present invention;

fig. 3 is a schematic diagram of the graphic information matching check of the present invention.

Detailed Description

Embodiments of the invention are described in further detail below with reference to the attached drawing figures:

the specific steps of the step 1 comprise:

The working principle of the step 1 is as follows:

and aiming at distribution network operation and maintenance service data, the method comprehensively utilizes the Pandas, numpy and other tool packages in Python to unify the format and the organization storage form of the data, and comprises the preprocessing operations of fragmentation data removal, size adjustment and normalization of an original power equipment image and the like.

the specific steps of the step 2 include:

(2) Constructing a deep learning model adopting an Encoder-Decoder structure, as shown in FIG. 1, firstly inputting the power-on multi-dimensional picture data preprocessed in the step 1, entering an Encoder part, extracting the characteristics x= (x) of n positions of a picture by using a feature map of a convolution layer through the spatial characteristics of CNN ₁ ,x ₂ ,…,x _n ) Wherein x is a D-dimensional vector;

in this embodiment, the feature map is set to have a height and width of 14, the number of channels is 256, n=14×14=196, and d=256; meanwhile, in order to extract high-level semantics of the power equipment image, an Attention mechanism is added during decoding, and different weights are distributed to the extracted image features;

The working principle of the step 2 is as follows:

and performing semantic tag extraction on the semi-structured and unstructured picture data by adopting modes of deep learning, manual labeling and the like, and constructing a semantic tag set.

Because the equipment fault maintenance picture contains more visual information, such as the color and the texture of a shallow layer and semantic information of a higher layer, such as operators, power equipment and the like, relative to the inspection record text, the invention adopts a deep learning model to extract the semantic label of the power picture by combining manual correction.

Compared with the traditional method, the model greatly improves the accuracy of the generated semantic label, but due to less supervision data, the problems of semantic gravity center deviation, inexactness in description and the like still exist, and in order to solve the influence of the data quantity problem on the label accuracy, part of data is calibrated by adopting manual intervention, the accuracy of the semantic label is improved, and finally the image semantic label is obtained.

the specific steps of the step 3 include:

(1) The characteristic is represented as follows: taking word vector X mapped by the inspection text as input, wherein the dimension coding of the inspection text vector of the power distribution station is carried out to h layers in a hash coding mode, and then 128-dimensional low-dimensional vector Y is obtained through a three-layer network of DNN in sequence, wherein the specific process is shown in figure 2; the calculation process is as follows:

L＝-ln P(D ⁺ ∣Q) (6)

The working principle of the step 3 is as follows:

and extracting keywords from the inspection text data by deep learning. In the existing semantic extraction method, the CNN-DSSM is utilized for keyword extraction, so that the semantic degree of feature vector representation can be improved, the sparsity of data dimension can be reduced, and the problem of data noise can be effectively solved.

The DSSM core concept is that the query and the doc are mapped into a semantic space with a common dimension, and the hidden semantic model is obtained through training by maximizing cosine similarity between the query and doc semantic vectors, so that the purpose of searching is achieved.

The input of the initial DSSM model consists of two parts: (1) query, the Query portion; (2) doc, the document part, where the number of documents is dynamic, depends on the specific scenario and business. The MLP part of the model is divided into 5 layers, decreasing in order from bottom to top dimensions, respectively: 500k, 30k, 300, 128.

Step 4, based on the extraction result of the text data keyword in the inspection step 3, establishing text and picture semantic sequences with the picture semantic tags in the step 2 by using an LCS algorithm, calculating the similarity between the sequences, and performing data matching check;

as shown in fig. 3, the specific steps of the step 4 include:

(1) Recursively solving a common subsequence of a picture semantic sequence and a text sequence: let the sequence x= (X) ₁ ,x ₂ ,...,x _n ) And y= (Y) ₁ ,y ₂ ,...,y _m ). The number of elements of X, Y is n and m respectively; x is X _i (x ₁ ,x ₂ ,...,x _n ) And Y _i (y ₁ ,y ₂ ,...,y _m ) Respectively, are subsets of sequences X and Y, where i.ltoreq.n, j.ltoreq.m. Based on the subset of the longest common subsequence between the two sequences being also its common subsequence natureThe quality must be based on whether the corresponding element values are equal, recursively solving the longest common subsequence LCS (X _n ,Y _m ) The specific calculation formula is as follows:

(3) And sorting labels according to the similarity: and (3) carrying out similarity calculation through a longest public subsequence algorithm, rapidly solving the longest public subsequence between the matched electrogram semantic sequence and the inspection text sequence, sequencing all label pairs according to the similarity S between the solved result measurement sequences, taking a group of labels with the highest similarity as a matching result, and finishing the check between different semantic sequences.

The working principle of the step 4 is as follows:

and calculating the label similarity of the label set by adopting a semantic similarity method of the longest public subsequence, and selecting a text with the highest confidence level to carry out matching check on the semantics of the data of different modes.

The longest common subsequence refers to a sequence, given two sequences X and Y, of which all common subsequences are queried, the length of which can be quantized by applying the longest common subsequence to calculate similarity. The longest common subsequence solving method comprises an exhaustion method and a dynamic programming method, but when the data quantity reaches a certain quantity, the calculated quantity can be exponentially increased along with the increase of the element quantity in the sequence, and the dynamic programming method can avoid repeated calculation, so that the matching efficiency of LCS to the two subsequences is improved.

Step 5, experimental verification

In order to prove that the atlas constructed by the method has the advantages of rich mode, high accuracy, high retrieval efficiency and the like, the method provided by the invention selects two common indexes of AUC and average accuracy AP for method verification.

In order to prove that the matched electric heterogeneous data semantic extraction technology constructed by the method has the advantages of high accuracy, high retrieval efficiency and the like, the method provided by the invention selects two common indexes of AUC and average accuracy AP for method verification. The accuracy of the proposed method is proved to be high. The AP index corresponds to the area under the Precision-Recall curve, so that the integral performance of the algorithm can be better reflected, and the calculation formula is as follows:

where TP is the correctly aligned data pair and FP is the data pair that is semantically uncorrelated with the data tag.

Whereas the AUC method verifies the probability of aligning the correct data pairs for the link, provided that in the experiment, a total of f aligned links are performed, where f occurs ₁ The number of times the linked data pair can be selected as a result of the secondary experiment, where f is present ₂ The result of the experiment is that the node pair score is equal to or lower than that of one node at f ₁ AUC can be expressed as:

it should be emphasized that the embodiments described herein are illustrative rather than limiting, and that this invention encompasses other embodiments which may be made by those skilled in the art based on the teachings herein and which fall within the scope of this invention.

Claims

1. A heterogeneous data semantic extraction method for power distribution and utilization multidimensional service is characterized by comprising the following steps of: the method comprises the following steps:

2. The heterogeneous data semantic extraction method for the power distribution and utilization multi-dimensional service according to claim 1, wherein the heterogeneous data semantic extraction method is characterized by comprising the following steps: the specific steps of the step 1 comprise:

3. The heterogeneous data semantic extraction method for the power distribution and utilization multi-dimensional service according to claim 1, wherein the heterogeneous data semantic extraction method is characterized by comprising the following steps: the specific steps of the step 2 include:

(1) Constructing a deep learning model adopting an Encoder-Decoder structure, firstly inputting the power-on multi-dimensional picture data preprocessed in the step 1, entering an Encoder part, and passing through a CNN spaceFeature map of convolution layer is used to extract feature x= (x) of n positions of a picture ₁ ,x ₂ ,…,x _n ) Wherein x is a D-dimensional vector;

4. The heterogeneous data semantic extraction method for the power distribution and utilization multi-dimensional service according to claim 1, wherein the heterogeneous data semantic extraction method is characterized by comprising the following steps: the specific steps of the step 3 include:

wherein, gamma is a smoothing factor of softmax, D+ is a positive sample under Query, D 'is any sample under Query, and D' is the whole sample space under Query; in the training phase, a log-loss function is used:

L＝-lnP(D ⁺ ∣Q) (6)

5. The heterogeneous data semantic extraction method for the power distribution and utilization multi-dimensional service according to claim 1, wherein the heterogeneous data semantic extraction method is characterized by comprising the following steps: the specific steps of the step 4 include:

(1) Recursively solving picture semantic orderCommon subsequence of column and text sequences: let the sequence x= (X) ₁ ,x ₂ ,...,x _n ) And y= (Y) ₁ ,y ₂ ,...,y _m ) The method comprises the steps of carrying out a first treatment on the surface of the The number of elements of X, Y is n and m respectively; x is X _i (x ₁ ,x ₂ ,...,x _n ) And Y _i (y ₁ ,y ₂ ,...,y _m ) Respectively a subset of sequences X and Y, wherein i is less than or equal to n and j is less than or equal to m; based on the fact that the subset of the longest common subsequence between the two sequences is also its common subsequence property, the longest common subsequence LCS (X _n ,Y _m ) The specific calculation formula is as follows:

(2) And sorting labels according to the similarity: and (3) carrying out similarity calculation through a longest public subsequence algorithm, solving the longest public subsequence between the matched electrogram semantic sequence and the patrol text sequence, sequencing all label pairs according to the similarity S between the solved result measurement sequences, taking a group of labels with highest similarity as a matching result, and finishing checking among different semantic sequences.