CN116680423A - Management method, device, equipment and medium for multi-source heterogeneous data of power supply chain - Google Patents

Management method, device, equipment and medium for multi-source heterogeneous data of power supply chain Download PDF

Info

Publication number
CN116680423A
CN116680423A CN202310968192.4A CN202310968192A CN116680423A CN 116680423 A CN116680423 A CN 116680423A CN 202310968192 A CN202310968192 A CN 202310968192A CN 116680423 A CN116680423 A CN 116680423A
Authority
CN
China
Prior art keywords
data
source heterogeneous
heterogeneous data
supply chain
feature vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310968192.4A
Other languages
Chinese (zh)
Other versions
CN116680423B (en
Inventor
李海弘
刘勇
徐天天
陈甜妹
陈枫
张莹
岳衡
杨岸涛
沈琦
蔡亮
李伟键
李启雷
莫加杰
王刘俊
丁靖
马新强
杨建党
张可鑫
符艳青
杨新益
包江雪
马骏
俞晨玺
翁慧颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo Production Chain Digital Technology Co ltd
State Grid Zhejiang Zhedian Tendering Consulting Co ltd
Zhejiang University ZJU
State Grid Zhejiang Electric Power Co Ltd
Jiaxing Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Ningbo Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Materials Branch of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
Ningbo Production Chain Digital Technology Co ltd
State Grid Zhejiang Zhedian Tendering Consulting Co ltd
Zhejiang University ZJU
State Grid Zhejiang Electric Power Co Ltd
Jiaxing Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Ningbo Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Materials Branch of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo Production Chain Digital Technology Co ltd, State Grid Zhejiang Zhedian Tendering Consulting Co ltd, Zhejiang University ZJU, State Grid Zhejiang Electric Power Co Ltd, Jiaxing Power Supply Co of State Grid Zhejiang Electric Power Co Ltd, Ningbo Power Supply Co of State Grid Zhejiang Electric Power Co Ltd, Materials Branch of State Grid Zhejiang Electric Power Co Ltd filed Critical Ningbo Production Chain Digital Technology Co ltd
Priority to CN202310968192.4A priority Critical patent/CN116680423B/en
Publication of CN116680423A publication Critical patent/CN116680423A/en
Application granted granted Critical
Publication of CN116680423B publication Critical patent/CN116680423B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/45Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a management method, a device, equipment and a medium for multi-source heterogeneous data of an electric power supply chain, which are used for collecting multi-source heterogeneous data under the electric power supply chain and storing the multi-source heterogeneous data according to data types in a classified manner; carrying out vector conversion on data of different data types by adopting an encoder of a CLIP algorithm to obtain a multidimensional feature vector; calculating the similarity of the multidimensional feature vectors obtained by converting the structured data in the multi-source heterogeneous data, and merging fields of the multidimensional feature vectors with the similarity exceeding a preset threshold; clustering the multidimensional feature vectors obtained by converting unstructured data in the multi-source heterogeneous data, and storing the clustered multidimensional feature vectors in a classified manner. The multi-source heterogeneous data management of the power supply chain can be realized, and the operation efficiency of the supply chain is optimized.

Description

Management method, device, equipment and medium for multi-source heterogeneous data of power supply chain
Technical Field
The invention relates to the technical field of information data fusion, in particular to a method, a device, equipment and a medium for managing multi-source heterogeneous data of a power supply chain.
Background
In the digital transformation era, data is taken as an important production element of platform operation, and various application fields of the data are explored by establishing various business modes based on data circulation and application. In the supply chain, through continuous amplification of data assets, cross-department, cross-level, cross-domain and cross-enterprise end-to-end business collaboration and operation modes, innovation and transformation of core business modes such as plan management, bid-bidding purchasing, contract performance, quality supervision, supplier management and the like are realized, business operation and resource allocation efficiency are optimized, and a data multiplication effect is exerted. Meanwhile, through strengthening data modeling and application, the value of the big data asset is deeply mined, and the digitization of management activities such as state tracking, process control and dynamic optimization is realized.
However, when the traditional big data technology is used for acquiring the data of the supply chain, the problems of accuracy and lack of semantic understanding exist, the data processing type is prior, and the multi-source heterogeneous data of the power supply chain cannot be dealt with.
Disclosure of Invention
In order to overcome the defects, the invention provides a method, a device, equipment and a medium for managing multi-source heterogeneous data of a power supply chain, which can realize multi-source heterogeneous data management of the power supply chain and optimize the operation efficiency of the power supply chain.
The embodiment of the invention provides a management method of multi-source heterogeneous data of a power supply chain, which comprises the following steps:
collecting multi-source heterogeneous data under a power grid supply chain, and storing the multi-source heterogeneous data in a classified manner according to data types;
carrying out vector conversion on data of different data types by adopting an encoder of a CLIP algorithm to obtain a multidimensional feature vector;
calculating the similarity of the multidimensional feature vectors obtained by converting the structured data in the multi-source heterogeneous data, and merging fields of the multidimensional feature vectors with the similarity exceeding a preset threshold;
clustering the multidimensional feature vectors obtained by converting unstructured data in the multi-source heterogeneous data, and storing the clustered multidimensional feature vectors in a classified manner.
Preferably, the collecting multi-source heterogeneous data under the power grid supply chain specifically includes:
based on a big data clustering technology, multi-source heterogeneous data under a power grid supply chain are collected through a Logstar data collection engine;
the multi-source heterogeneous data includes structured data, semi-structured data, and unstructured data.
As a preferred solution, the storing the multi-source heterogeneous data according to data type classification specifically includes:
storing unstructured data in the multi-source heterogeneous data into a MinIO object storage server, storing structured data in the multi-source heterogeneous data into a postgrel relational database, and storing semi-structured data in the multi-source heterogeneous data into a Mongo DB distributed database.
Preferably, after storing the multi-source heterogeneous data according to the data type classification, the method further comprises:
extracting picture files and video files of unstructured data in the multi-source heterogeneous data through an MD5 information abstract algorithm, and matching and de-duplicating the picture files and the video files with files existing in a stored database;
and supplementing the missing information of the user or the product for the picture file and the video file of the structured data in the multi-source heterogeneous data through the corresponding relation among the user identification card number, the tax payer identification number and the unique coding information of the product.
As a preferred solution, the encoder using the CLIP algorithm performs vector conversion on data of different data types to obtain a multidimensional feature vector, which specifically includes:
encoding the text in the document file in the unstructured data by adopting a CLIP text encoder, and converting the text into a multidimensional feature vector with meaning text semantic features;
encoding the pictures in the picture files in the unstructured data by adopting a CLIP picture encoder, and converting the pictures into multidimensional feature vectors with meaning picture semantic features;
adopting a CLIP video encoder to encode videos in video files in the unstructured data, and converting the videos into multidimensional feature vectors with meaning video semantic features;
and encoding the fields of the structured data by adopting a CLIP text encoder to obtain corresponding multidimensional feature vectors.
Preferably, the similarity of the multidimensional feature vectors is
wherein ,representing vectorsAVector of ANDBCosine similarity between the IAThe I is a vectorAA kind of electronic deviceL2Norm, ||BThe I is a vectorBA kind of electronic deviceL2Norms.
As a preferred solution, the clustering the multidimensional feature vectors obtained by converting unstructured data in the multi-source heterogeneous data, and storing the clustered multidimensional feature vectors in a classification manner specifically includes:
based on the preset clustering quantity and the initial value of a clustering center, performing unsupervised learning cluster analysis on elements of the multi-dimensional feature vector obtained by converting the unstructured data by adopting a K-means algorithm, and dividing the multi-dimensional feature vector of the unstructured data according to different cluster boundaries determined by the analysis;
the multidimensional feature vectors are stored in a classified manner according to different kinds of the division.
The embodiment of the invention also provides a device for managing multi-source heterogeneous data of the power supply chain, which comprises:
the data acquisition module is used for acquiring multi-source heterogeneous data under a power grid supply chain and storing the multi-source heterogeneous data according to data types in a classified mode;
the vector conversion module is used for carrying out vector conversion on the data with different data types by adopting an encoder of the CLIP algorithm to obtain a multidimensional feature vector;
the merging module is used for calculating the similarity of the multidimensional feature vectors obtained by converting the structured data in the multi-source heterogeneous data and merging fields of the multidimensional feature vectors with the similarity exceeding a preset threshold value;
and the clustering module is used for clustering the multidimensional feature vectors obtained by converting unstructured data in the multi-source heterogeneous data, and storing the clustered multidimensional feature vectors in a classified manner.
Preferably, the process of collecting multi-source heterogeneous data under the power grid supply chain by the data acquisition module specifically includes:
based on a big data clustering technology, multi-source heterogeneous data under a power grid supply chain are collected through a Logstar data collection engine;
the multi-source heterogeneous data includes structured data, semi-structured data, and unstructured data.
Preferably, the process of classifying and storing the multi-source heterogeneous data according to the data type by the data acquisition module specifically includes:
storing unstructured data in the multi-source heterogeneous data into a MinIO object storage server, storing structured data in the multi-source heterogeneous data into a postgrel relational database, and storing semi-structured data in the multi-source heterogeneous data into a Mongo DB distributed database.
Preferably, the apparatus further comprises a preprocessing module for:
after the multi-source heterogeneous data are stored in a classified mode according to data types, extracting picture files and video files of unstructured data in the multi-source heterogeneous data through an MD5 information abstract algorithm, and matching and de-duplicating the picture files and the video files with files existing in a stored database;
and supplementing the missing information of the user or the product for the picture file and the video file of the structured data in the multi-source heterogeneous data through the corresponding relation among the user identification card number, the tax payer identification number and the unique coding information of the product.
Preferably, the vector conversion module is specifically configured to:
encoding the text in the document file in the unstructured data by adopting a CLIP text encoder, and converting the text into a multidimensional feature vector with meaning text semantic features;
encoding the pictures in the picture files in the unstructured data by adopting a CLIP picture encoder, and converting the pictures into multidimensional feature vectors with meaning picture semantic features;
adopting a CLIP video encoder to encode videos in video files in the unstructured data, and converting the videos into multidimensional feature vectors with meaning video semantic features;
and encoding the fields of the structured data by adopting a CLIP text encoder to obtain corresponding multidimensional feature vectors.
As a preferable scheme, the similarity of the multidimensional feature vectors is that
wherein ,representing vectorsAVector of ANDBCosine similarity between the IAThe I is a vectorAA kind of electronic deviceL2Norm, ||BThe I is a vectorBA kind of electronic deviceL2Norms.
Preferably, the clustering module is specifically configured to:
based on the preset clustering quantity and the initial value of a clustering center, performing unsupervised learning cluster analysis on elements of the multi-dimensional feature vector obtained by converting the unstructured data by adopting a K-means algorithm, and dividing the multi-dimensional feature vector of the unstructured data according to different cluster boundaries determined by the analysis;
the multidimensional feature vectors are stored in a classified manner according to different kinds of the division.
The embodiment of the invention also provides a terminal device, which comprises a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, wherein the processor implements the method for managing multi-source heterogeneous data of the power supply chain according to any one of the above embodiments when executing the computer program.
The embodiment of the invention also provides a computer readable storage medium, which comprises a stored computer program, wherein when the computer program runs, equipment where the computer readable storage medium is located is controlled to execute the method for managing the multi-source heterogeneous data of the power supply chain according to any one of the above embodiments.
The method, the device, the equipment and the medium for managing the multi-source heterogeneous data of the power supply chain collect the multi-source heterogeneous data under the power grid supply chain and store the multi-source heterogeneous data according to the data type in a classified manner; carrying out vector conversion on data of different data types by adopting an encoder of a CLIP algorithm to obtain a multidimensional feature vector; calculating the similarity of the multidimensional feature vectors obtained by converting the structured data in the multi-source heterogeneous data, and merging fields of the multidimensional feature vectors with the similarity exceeding a preset threshold; clustering the multidimensional feature vectors obtained by converting unstructured data in the multi-source heterogeneous data, and storing the clustered multidimensional feature vectors in a classified manner. The multi-source heterogeneous data management of the power supply chain can be realized, and the operation efficiency of the supply chain is optimized.
Drawings
Fig. 1 is a flow chart of a method for managing multi-source heterogeneous data of a power supply chain according to an embodiment of the invention;
FIG. 2 is a flow chart of a method for managing multi-source heterogeneous data of a power supply chain according to another embodiment of the present invention;
FIG. 3 is a schematic flow chart of text-to-picture matching provided by an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a management device for multi-source heterogeneous data of a power supply chain according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flow chart of a method for managing multi-source heterogeneous data of a power supply chain according to an embodiment of the present invention is provided, and the method includes steps S1 to S4:
s1, acquiring multi-source heterogeneous data under a power grid supply chain, and storing the multi-source heterogeneous data according to data types in a classified manner;
s2, carrying out vector conversion on data of different data types by adopting an encoder of a CLIP algorithm to obtain a multidimensional feature vector;
s3, calculating the similarity of the multidimensional feature vectors obtained by converting the structured data in the multi-source heterogeneous data, and merging fields of the multidimensional feature vectors with the similarity exceeding a preset threshold;
and S4, clustering the multidimensional feature vectors obtained by converting unstructured data in the multi-source heterogeneous data, and storing the clustered multidimensional feature vectors in a classified manner.
When the embodiment is implemented, multi-source heterogeneous data under a power grid supply chain is collected, wherein the multi-source heterogeneous data relate to energy saving/intelligent equipment information, supplier, client and cooperation unit information in the field of power systems. And classifying and storing the acquired data according to the data type.
Carrying out vector conversion on data of different data types by adopting an encoder of a CLIP algorithm to obtain a multidimensional feature vector; the CLIP (Contrastive Language-Image Pretraining, contrast language-image pre-training) algorithm comprises a plurality of encoders, and the data of different data types are subjected to vector conversion through different encoders, so that the efficient integration and adaptation of the multi-source heterogeneous data can be realized, the value density and quality of the data are improved, and the intellectualization and automation of the fusion and adaptation of the multi-source heterogeneous data are improved.
For the structured data, calculating the vector similarity of the structured data, and merging the data with high vector similarity;
for unstructured data, the multidimensional feature vectors are clustered, so that the multidimensional feature vectors are divided in a feature space and stored according to different feature classifications, and high-quality retrieval is facilitated.
According to the embodiment, the collected multi-source heterogeneous data are respectively vectorized according to the data types and converted into multi-dimensional feature vectors, the vector similarity of the structured data is calculated, and the data with high vector similarity are combined; for unstructured data, the multidimensional feature vectors are clustered so as to be divided in a feature space, and the unstructured data are stored according to different feature classifications, so that high-quality retrieval is facilitated. By introducing various artificial intelligence algorithm technologies, the value density and quality of the multi-source heterogeneous data can be effectively improved, and the intellectualization and automation of the proper matching of the multi-source heterogeneous data are increased, so that a supply chain owner can be helped to better know the running condition and potential risk of the supply chain, a more accurate and effective supply chain strategy is formulated, and the efficiency and cost effectiveness of the supply chain are improved.
In still another embodiment of the present invention, the collecting multi-source heterogeneous data under the power grid supply chain in step S1 specifically includes:
based on a big data clustering technology, multi-source heterogeneous data under a power grid supply chain are collected through a Logstar data collection engine;
the multi-source heterogeneous data includes structured data, semi-structured data, and unstructured data.
In the implementation of this embodiment, referring to fig. 2, a flow chart of a method for managing multi-source heterogeneous data of a power supply chain according to another embodiment of the present invention is shown, and when multi-source heterogeneous data is collected, based on a big data clustering technology and log-mesh data collection, a data source is multi-source heterogeneous data under a power grid supply chain, and relates to energy-saving power equipment, intelligent power equipment, suppliers, customers and cooperation units in the field of power systems.
The big data clustering technology is a distributed computing cluster formed by a plurality of computers and is specially used for storing and processing mass data. Typically comprising a Master Node and several Data nodes, the Master Node is responsible for managing and monitoring all nodes and tasks in the cluster, while the Data nodes are used for the actual Data processing and storage tasks. By distributing the calculation tasks to a plurality of nodes for parallel execution, the large data cluster can greatly reduce the time and cost of data processing and improve the processing efficiency and accuracy of the data.
The logstack data collection engine is capable of dynamically collecting, converting, and transmitting data from a plurality of data sources, and is not affected by format or complexity.
The multi-source heterogeneous data includes structured data, semi-structured data, and unstructured data. Structured data refers to data organized according to a certain format and rule, semi-structured data is a data type between structured data and unstructured data, has a characteristic of partial structuring, but does not completely conform to the format and rule of conventional structured data, and unstructured data is a data type without an explicit structure or a simple interpretation, such as picture data, audio data, video data, text data, and the like.
In still another embodiment of the present invention, the step S1 stores the multi-source heterogeneous data according to data type classification, and specifically includes:
storing unstructured data in the multi-source heterogeneous data into a MinIO object storage server, storing structured data in the multi-source heterogeneous data into a postgrel relational database, and storing semi-structured data in the multi-source heterogeneous data into a Mongo DB distributed database.
In the implementation of the present embodiment, referring to fig. 2, the collected data is classified and stored according to data types.
Data such as for documents, pictures, video, text, etc.; unstructured data such as video and pictures are stored in a MinIO object storage server, and the MinIO object storage server is a high-performance distributed object storage system and supports single-machine and cluster deployment.
The semi-structured data is stored into a Mongo DB distributed database, which is a database of distributed file storage.
Structured data is stored in a postgrel relational database, which is a mature object-relational database.
It should be noted that, the classification storage manner provided in this embodiment is not the only manner of classification storage, and other file storage or object storage manners may be adopted in other embodiments.
In yet another embodiment provided by the present invention, after the step S1, the method further includes:
extracting picture files and video files of unstructured data in the multi-source heterogeneous data through an MD5 information abstract algorithm, and matching and de-duplicating the picture files and the video files with files existing in a stored database;
and supplementing the missing information of the user or the product for the picture file and the video file of the structured data in the multi-source heterogeneous data through the corresponding relation among the user identification card number, the tax payer identification number and the unique coding information of the product.
When the embodiment is implemented, after data of different types is stored, data aggregation and data cleaning are carried out on each data category, and operations such as data deduplication, missing data completion and the like are carried out, so that multi-mode feature extraction preparation is completed.
Specifically, the files such as pictures and videos in the multi-source heterogeneous data are extracted through an MD5 information summary algorithm, and matching and duplicate removal are performed on the files existing in a database.
And for the structured data, the missing information of the user or the product is complemented through the corresponding relation among the information such as the user identification card number, the tax payer identification number, the unique product code and the like.
In yet another embodiment of the present invention, the step S2 specifically includes:
encoding the text in the document file in the unstructured data by adopting a CLIP text encoder, and converting the text into a multidimensional feature vector with meaning text semantic features;
encoding the pictures in the picture files in the unstructured data by adopting a CLIP picture encoder, and converting the pictures into multidimensional feature vectors with meaning picture semantic features;
adopting a CLIP video encoder to encode videos in video files in the unstructured data, and converting the videos into multidimensional feature vectors with meaning video semantic features;
and encoding the fields of the structured data by adopting a CLIP text encoder to obtain corresponding multidimensional feature vectors.
In the implementation of this embodiment of the present market, referring to fig. 2, the CLIP algorithm includes a plurality of encoders, a text encoder CLIP encoder, a picture encoder CLIP Feature Extractor, and a video encoder CLIP2TV, and when performing multidimensional feature vector conversion, the method specifically includes:
the text encoder CLIP encoder is used as a deep learning model, the text encoder CLIP encoder encodes the text by using a transducer architecture, the text description is converted into high-dimensional vector representation, and texts with similar semantics are also similar in a vector space; converting the data of the document type into a multidimensional feature vector containing text semantic features by using a CLIP encoder, and processing the overlong text in a cut-off mode;
the picture encoder CLIP Feature Extractor is used as an image feature extractor based on the CLIP price, converts the input picture into a high-dimensional feature vector through CLIP Feature Extractor, wherein the feature vector is rich in semantic information of the picture, and converts unstructured data of the picture class into a multi-dimensional feature vector containing semantic features of the picture by using CLIP Feature Extractor;
the Video Encoder CLIP2TV is a Video text retrieval framework based on a transform architecture, wherein the Video encoding is a Video Encoder based on a transform, which can encode Video into feature vectors rich in Video semantic features, so that the type of the Video can be judged according to the feature vectors. Encoding unstructured data of the video type into multidimensional feature vectors containing video semantic features by using a CLIP2 TV;
and performing vector conversion on the fields of the structured data by using a text encoder CLIP encoder to obtain corresponding multidimensional feature vectors.
Referring to fig. 3, a schematic flow chart of text-picture matching provided by an embodiment of the present invention is shown, and for text data of a certain power transformer, a text encoder is used to perform vector conversion to obtain a corresponding text vector T 1 ~T N Vector conversion is carried out by adopting a picture encoder to obtain a corresponding picture vector I 1 ~I N After the corresponding vector conversion is completed, the corresponding vector conversion is matched with one vector.
When a text is input for searching, the text encoder is used for encoding and matching with the encoding of the picture, and the picture with the highest feature similarity is returned.
By introducing new artificial intelligence technologies such as text image matching, text video retrieval and the like, the semantics and the context information of the text content can be better understood, the data quality is improved, and the efficiency of text picture and text video retrieval is greatly improved; by using computer vision techniques, the platform can directly process image and video data, enabling content-based retrieval.
In yet another embodiment of the present invention, the similarity of the multidimensional feature vectors calculated in the step S3 is
wherein ,representing vectorsAVector of ANDBCosine similarity between the IAThe I is a vectorAA kind of electronic deviceL2Norm, ||BThe I is a vectorBA kind of electronic deviceL2Norms.
In the implementation of the embodiment, after the structured data is vectorized, the similarity degree between each vector is used for calculating the similarity degree to judge whether the semantics of the structured data are similar, a corresponding threshold value is set, when the cosine similarity of the two vectors exceeds the threshold value, the two vectors are judged to be highly similar and are subjected to field merging, namely one field in the similar fields is selected to replace other near-meaning fields, so that the data redundancy is reduced, and the data quality is improved.
The cosine similarity is a vector similarity measurement method and is used for measuring the similarity degree between vectors, the value range is between-1 and 1, the closer the value is to 1, the more similar the vector is, the closer the value is to-1, the more dissimilar the vector is, and the value is 0, and the vectors are orthogonal.
(Vector)AAndBcosine similarity betweenI.e. vectorASum vectorBIs divided by the vectorASum vectorBL2 norm product of, ||AThe I is a vectorAA kind of electronic deviceL2Norm, ||BThe I is a vectorBA kind of electronic deviceL2Norms.
Vector ofL2The norm is the sum of squares of all the dimension values in the vector, for the vectorAWhich is provided withL2The norm is in particularN is a vectorADimension number of>Is vector quantityAIs a dimension value of (a).
And based on analysis processing of the similarity, automatic and intelligent data processing and analysis are realized, the efficiency of data acquisition and fusion is improved, and the workload of manual data classification is reduced.
In yet another embodiment of the present invention, the step S4 specifically includes:
based on the preset clustering quantity and the initial value of a clustering center, performing unsupervised learning cluster analysis on elements of the multi-dimensional feature vector obtained by converting the unstructured data by adopting a K-means algorithm, and dividing the multi-dimensional feature vector of the unstructured data according to different cluster boundaries determined by the analysis;
the multidimensional feature vectors are stored in a classified manner according to different kinds of the division.
When the embodiment is implemented, after unstructured data is vectorized, based on the preset clustering quantity and the initial value of a clustering center, a K-Means algorithm is adopted to cluster the multidimensional feature vectors, so that the positions of the multidimensional feature vectors are divided in a feature vector space and stored according to the division of the multidimensional feature vectors, and high-quality retrieval is carried out subsequently.
The K-means algorithm is a common unsupervised learning algorithm, and the main task of the algorithm is to divide a data set into K clusters, wherein K is used as a super parameter and needs to be set for a plurality of times according to the actual situation of the data, and the aim is to minimize the sum of the distances between data points and the centers of the clusters to which the data points belong, namely to minimize the intra-cluster difference. And gradually optimizing the clustering effect by iteratively updating the clustering center position. When the cluster center is not changed any more, the algorithm converges and a final cluster result is obtained.
The K-means algorithm comprises the steps of setting the number K of clusters to be formed and setting K initialized cluster center points;
for each data point, calculating the distance between the data point and each cluster center point, and distributing each data point to the cluster which belongs to the cluster center closest to the data point;
calculating a new cluster center according to all data points in each cluster;
the above two steps are repeated until the cluster center is no longer changed or a predetermined number of iterations is reached.
The distance between the data point and the cluster center point in this embodiment uses Manhattan distance
wherein ,representing vectorsXSum vectorYCorresponding element +.> and />The manhattan distance is the sum of the absolute values of the differences of all the different dimension values of the two vectors.
And carrying out cluster analysis on the feature vectors by using a K-Means algorithm, dividing the heterogeneous data in a feature space by adopting unsupervised learning through clustering, determining the boundary of the heterogeneous data, dividing the multidimensional feature vectors of the unstructured data according to different cluster boundaries determined by analysis, and determining the labels of different clusters.
And the divided data are classified and stored in a database, so that the subsequent quick retrieval is convenient.
The data can be subjected to fine processing and effective fusion through the K-Means algorithm, and the accuracy and the credibility of data analysis are improved, so that the management and treatment capability of a user of a supply chain on the supply chain can be effectively improved.
The embodiment of the invention also provides a device for managing multi-source heterogeneous data of a power supply chain, referring to fig. 4, which is a schematic structural diagram of the device for managing multi-source heterogeneous data of a power supply chain, provided by the embodiment of the invention, wherein the device comprises:
the data acquisition module is used for acquiring multi-source heterogeneous data under a power grid supply chain and storing the multi-source heterogeneous data according to data types in a classified mode;
the vector conversion module is used for carrying out vector conversion on the data with different data types by adopting an encoder of the CLIP algorithm to obtain a multidimensional feature vector;
the merging module is used for calculating the similarity of the multidimensional feature vectors obtained by converting the structured data in the multi-source heterogeneous data and merging fields of the multidimensional feature vectors with the similarity exceeding a preset threshold value;
and the clustering module is used for clustering the multidimensional feature vectors obtained by converting unstructured data in the multi-source heterogeneous data, and storing the clustered multidimensional feature vectors in a classified manner.
It should be noted that, the management device for multi-source heterogeneous data of a power supply chain provided in the embodiment of the present invention can execute the management method for multi-source heterogeneous data of a power supply chain described in any embodiment of the foregoing embodiment, and specific functions of the management device for multi-source heterogeneous data of a power supply chain are not described herein.
Referring to fig. 5, a schematic structural diagram of a terminal device according to an embodiment of the present invention is provided. The terminal device of this embodiment includes: a processor, a memory, and a computer program stored in the memory and executable on the processor, such as a power supply chain multi-source heterogeneous data management program. The steps in the above embodiments of the method for managing multi-source heterogeneous data of each power supply chain are implemented when the processor executes the computer program, for example, steps S1 to S4 shown in fig. 1. Or the processor, when executing the computer program, performs the functions of the modules in the above apparatus embodiments.
The computer program may be divided into one or more modules/units, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more modules/units may be a series of computer program instruction segments capable of performing the functions describing the execution of the computer program in the terminal device. For example, the computer program may be divided into modules, and specific functions of each module are not described herein.
The terminal equipment can be computing equipment such as a desktop computer, a notebook computer, a palm computer, a cloud server and the like. The terminal device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the schematic diagram is merely an example of a terminal device and does not constitute a limitation of the terminal device, and may include more or less components than illustrated, or may combine certain components, or different components, e.g., the terminal device may further include an input-output device, a network access device, a bus, etc.
The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center of the terminal device, and which connects various parts of the entire terminal device using various interfaces and lines.
The memory may be used to store the computer program and/or module, and the processor may implement various functions of the terminal device by running or executing the computer program and/or module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
Wherein the terminal device integrated modules/units may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as stand alone products. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code, which may be in the form of code, object code, executable files, or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.
While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims (10)

1. A method for managing multi-source heterogeneous data of a power supply chain, the method comprising:
collecting multi-source heterogeneous data under a power grid supply chain, and storing the multi-source heterogeneous data in a classified manner according to data types;
carrying out vector conversion on data of different data types by adopting an encoder of a CLIP algorithm to obtain a multidimensional feature vector;
calculating the similarity of the multidimensional feature vectors obtained by converting the structured data in the multi-source heterogeneous data, and merging fields of the multidimensional feature vectors with the similarity exceeding a preset threshold;
clustering the multidimensional feature vectors obtained by converting unstructured data in the multi-source heterogeneous data, and storing the clustered multidimensional feature vectors in a classified manner.
2. The method for managing multi-source heterogeneous data of an electric power supply chain according to claim 1, wherein the collecting multi-source heterogeneous data under an electric power grid supply chain specifically comprises:
based on a big data clustering technology, multi-source heterogeneous data under a power grid supply chain are collected through a Logstar data collection engine;
the multi-source heterogeneous data includes structured data, semi-structured data, and unstructured data.
3. The method for managing multi-source heterogeneous data of a power supply chain according to claim 1, wherein the storing the multi-source heterogeneous data according to data types includes:
storing unstructured data in the multi-source heterogeneous data into a MinIO object storage server, storing structured data in the multi-source heterogeneous data into a postgrel relational database, and storing semi-structured data in the multi-source heterogeneous data into a Mongo DB distributed database.
4. The method of managing multi-source heterogeneous data of a power supply chain according to claim 1, wherein after storing the multi-source heterogeneous data classified by data type, the method further comprises:
extracting picture files and video files of unstructured data in the multi-source heterogeneous data through an MD5 information abstract algorithm, and matching and de-duplicating the picture files and the video files with files existing in a stored database;
and supplementing the missing information of the user or the product for the picture file and the video file of the structured data in the multi-source heterogeneous data through the corresponding relation among the user identification card number, the tax payer identification number and the unique coding information of the product.
5. The method for managing multi-source heterogeneous data of a power supply chain according to claim 1, wherein the encoder using the CLIP algorithm performs vector conversion on data of different data types to obtain a multi-dimensional feature vector, and the method specifically comprises:
encoding the text in the document file in the unstructured data by adopting a CLIP text encoder, and converting the text into a multidimensional feature vector with meaning text semantic features;
encoding the pictures in the picture files in the unstructured data by adopting a CLIP picture encoder, and converting the pictures into multidimensional feature vectors with meaning picture semantic features;
adopting a CLIP video encoder to encode videos in video files in the unstructured data, and converting the videos into multidimensional feature vectors with meaning video semantic features;
and encoding the fields of the structured data by adopting a CLIP text encoder to obtain corresponding multidimensional feature vectors.
6. The method for managing multi-source heterogeneous data of a power supply chain according to claim 1, wherein the similarity of the multi-dimensional feature vectors is
wherein ,representing vectorsAVector of ANDBCosine similarity between the IAThe I is a vectorAA kind of electronic deviceL2Norm, ||BThe I is a vectorBA kind of electronic deviceL2Norms.
7. The method for managing multi-source heterogeneous data of a power supply chain according to claim 1, wherein the clustering of the multi-dimensional feature vectors obtained by converting unstructured data in the multi-source heterogeneous data includes:
based on the preset clustering quantity and the initial value of a clustering center, performing unsupervised learning cluster analysis on elements of the multi-dimensional feature vector obtained by converting the unstructured data by adopting a K-means algorithm, and dividing the multi-dimensional feature vector of the unstructured data according to different cluster boundaries determined by the analysis;
the multidimensional feature vectors are stored in a classified manner according to different kinds of the division.
8. A device for managing multi-source heterogeneous data of an electric power supply chain, the device comprising:
the data acquisition module is used for acquiring multi-source heterogeneous data under a power grid supply chain and storing the multi-source heterogeneous data according to data types in a classified mode;
the vector conversion module is used for carrying out vector conversion on the data with different data types by adopting an encoder of the CLIP algorithm to obtain a multidimensional feature vector;
the merging module is used for calculating the similarity of the multidimensional feature vectors obtained by converting the structured data in the multi-source heterogeneous data and merging fields of the multidimensional feature vectors with the similarity exceeding a preset threshold value;
and the clustering module is used for clustering the multidimensional feature vectors obtained by converting unstructured data in the multi-source heterogeneous data, and storing the clustered multidimensional feature vectors in a classified manner.
9. A terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the method of managing multi-source heterogeneous data of a power supply chain according to any one of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored computer program, wherein the computer program when run controls a device in which the computer readable storage medium is located to perform the method for managing multi-source heterogeneous data of an electric power supply chain according to any one of claims 1 to 7.
CN202310968192.4A 2023-08-03 2023-08-03 Management method, device, equipment and medium for multi-source heterogeneous data of power supply chain Active CN116680423B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310968192.4A CN116680423B (en) 2023-08-03 2023-08-03 Management method, device, equipment and medium for multi-source heterogeneous data of power supply chain

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310968192.4A CN116680423B (en) 2023-08-03 2023-08-03 Management method, device, equipment and medium for multi-source heterogeneous data of power supply chain

Publications (2)

Publication Number Publication Date
CN116680423A true CN116680423A (en) 2023-09-01
CN116680423B CN116680423B (en) 2023-10-20

Family

ID=87782266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310968192.4A Active CN116680423B (en) 2023-08-03 2023-08-03 Management method, device, equipment and medium for multi-source heterogeneous data of power supply chain

Country Status (1)

Country Link
CN (1) CN116680423B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190171660A1 (en) * 2017-06-22 2019-06-06 NewVoiceMedia Ltd. System and method for text categorization and sentiment analysis
CN113256980A (en) * 2021-05-28 2021-08-13 佳都科技集团股份有限公司 Road network state determination method, device, equipment and storage medium
CN113986873A (en) * 2021-09-26 2022-01-28 夏文祥 Massive Internet of things data modeling processing, storing and sharing method
CN114003731A (en) * 2021-10-29 2022-02-01 国网河北省电力有限公司电力科学研究院 Heterogeneous data processing method, device, server and storage medium
CN114153839A (en) * 2021-10-29 2022-03-08 杭州未名信科科技有限公司 Integration method, device, equipment and storage medium of multi-source heterogeneous data
CN114379608A (en) * 2021-12-13 2022-04-22 中铁南方投资集团有限公司 Multi-source heterogeneous data integration processing method for urban rail transit engineering
CN114708976A (en) * 2022-03-11 2022-07-05 中国医学科学院阜外医院深圳医院(深圳市孙逸仙心血管医院) Method, device, equipment and storage medium for assisting diagnosis technology
CN115309734A (en) * 2022-08-26 2022-11-08 广东电网有限责任公司东莞供电局 Multi-source heterogeneous data processing method for transformer substation
CN115544206A (en) * 2022-08-31 2022-12-30 许闻 ERP data management method and system combining artificial intelligence and big data
CN115731425A (en) * 2022-12-05 2023-03-03 广州欢聚时代信息科技有限公司 Commodity classification method, commodity classification device, commodity classification equipment and commodity classification medium
CN116070143A (en) * 2022-12-09 2023-05-05 广西电网有限责任公司 Power distribution network multi-source heterogeneous data fusion method and system based on artificial intelligence
CN116089898A (en) * 2022-11-29 2023-05-09 贵州电网有限责任公司 Multi-source heterogeneous data acquisition fusion method and system for distributed power distribution network

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190171660A1 (en) * 2017-06-22 2019-06-06 NewVoiceMedia Ltd. System and method for text categorization and sentiment analysis
CN113256980A (en) * 2021-05-28 2021-08-13 佳都科技集团股份有限公司 Road network state determination method, device, equipment and storage medium
CN113986873A (en) * 2021-09-26 2022-01-28 夏文祥 Massive Internet of things data modeling processing, storing and sharing method
CN114003731A (en) * 2021-10-29 2022-02-01 国网河北省电力有限公司电力科学研究院 Heterogeneous data processing method, device, server and storage medium
CN114153839A (en) * 2021-10-29 2022-03-08 杭州未名信科科技有限公司 Integration method, device, equipment and storage medium of multi-source heterogeneous data
CN114379608A (en) * 2021-12-13 2022-04-22 中铁南方投资集团有限公司 Multi-source heterogeneous data integration processing method for urban rail transit engineering
CN114708976A (en) * 2022-03-11 2022-07-05 中国医学科学院阜外医院深圳医院(深圳市孙逸仙心血管医院) Method, device, equipment and storage medium for assisting diagnosis technology
CN115309734A (en) * 2022-08-26 2022-11-08 广东电网有限责任公司东莞供电局 Multi-source heterogeneous data processing method for transformer substation
CN115544206A (en) * 2022-08-31 2022-12-30 许闻 ERP data management method and system combining artificial intelligence and big data
CN116089898A (en) * 2022-11-29 2023-05-09 贵州电网有限责任公司 Multi-source heterogeneous data acquisition fusion method and system for distributed power distribution network
CN115731425A (en) * 2022-12-05 2023-03-03 广州欢聚时代信息科技有限公司 Commodity classification method, commodity classification device, commodity classification equipment and commodity classification medium
CN116070143A (en) * 2022-12-09 2023-05-05 广西电网有限责任公司 Power distribution network multi-source heterogeneous data fusion method and system based on artificial intelligence

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JING FAN;SHAOWEN GAO;YONG LIU;XINQIANG MA;JIANDANG YANG;CHANGJIE FAN: "Semisupervised Game Player Categorization From Very Big Behavior Log Data", IEEE, pages 3419 - 3430 *
罗恩韬;王国军;李超良;: "大数据环境中多维数据去重的聚类算法研究", 小型微型计算机系统, no. 03, pages 438 - 442 *

Also Published As

Publication number Publication date
CN116680423B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
Chen et al. Parallel spectral clustering in distributed systems
CN109684352B (en) Data analysis system, data analysis method, storage medium, and electronic device
CN109165202A (en) A kind of preprocess method of multi-source heterogeneous big data
CN111627552B (en) Medical streaming data blood-edge relationship analysis and storage method and device
KR102219955B1 (en) Behavior-based platform system using the bigdata
CN104239553A (en) Entity recognition method based on Map-Reduce framework
CN107341210B (en) C-DBSCAN-K clustering algorithm under Hadoop platform
EP3872703B1 (en) Method and device for classifying face image, electronic device and storage medium
CN104317899A (en) Big-data analyzing and processing system and access method
CN112765150A (en) Big data heterogeneous fusion extraction method and device
Raju et al. Content-based image retrieval using local texture features in distributed environment
Duan et al. Distributed in-memory vocabulary tree for real-time retrieval of big data images
Eken et al. DoCA: a content-based automatic classification system over digital documents
CN116680423B (en) Management method, device, equipment and medium for multi-source heterogeneous data of power supply chain
CN110209895B (en) Vector retrieval method, device and equipment
Chen et al. Video vehicle detection and recognition based on mapreduce and convolutional neural network
CN103761290A (en) Data management method and system based on content aware
Hou et al. Distributed image retrieval base on LSH indexing on spark
CN111090743B (en) Thesis recommendation method and device based on word embedding and multi-value form concept analysis
CN114218250A (en) Data blood margin display method, system, device and storage medium
Jia et al. An improved FP-growth algorithm based on SOM partition
Ortiz-Ballona et al. A vertical fragmentation method for multimedia databases considering content-based queries
Liu et al. Creating descriptive visual words for tag ranking of compressed social image
Sarr et al. Data stream summary in big data context: challenges and opportunities
CN116226686B (en) Table similarity analysis method, apparatus, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant