CN114003596A - Multi-source heterogeneous data processing system and method based on industrial system - Google Patents

Multi-source heterogeneous data processing system and method based on industrial system Download PDF

Info

Publication number
CN114003596A
CN114003596A CN202111355901.9A CN202111355901A CN114003596A CN 114003596 A CN114003596 A CN 114003596A CN 202111355901 A CN202111355901 A CN 202111355901A CN 114003596 A CN114003596 A CN 114003596A
Authority
CN
China
Prior art keywords
data
vulnerability
useful
distribution metric
industrial system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111355901.9A
Other languages
Chinese (zh)
Other versions
CN114003596B (en
Inventor
许丰娟
李俊
郝志强
高建磊
李耀兵
江浩
巩天宇
赵千
李赟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Industrial Control Systems Cyber Emergency Response Team
Original Assignee
China Industrial Control Systems Cyber Emergency Response Team
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Industrial Control Systems Cyber Emergency Response Team filed Critical China Industrial Control Systems Cyber Emergency Response Team
Priority to CN202111355901.9A priority Critical patent/CN114003596B/en
Publication of CN114003596A publication Critical patent/CN114003596A/en
Application granted granted Critical
Publication of CN114003596B publication Critical patent/CN114003596B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1433Vulnerability analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Computer And Data Communications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

According to the multisource heterogeneous data processing system and method based on the industrial system, the edge computing module is adopted to complete a part of computing tasks (data cleaning, screening, encryption processing and the like are carried out on preprocessed data), the computing pressure of the cloud data center can be effectively relieved, heterogeneous data are coded in a multi-path parallel mode to form a unified identifier, subsequent computing is facilitated, the processing speed can be increased, the edge computing module is adopted to carry out data screening based on unified coding, the data storage expense of the cloud data center can be greatly saved, meanwhile, the data screening of the edge computing module is also an efficient data cleaning mode, the computing burden of the cloud data center can be reduced, in addition, leak data are detected in real time, and the timeliness requirement of abnormal alarm can be met by directly uploading the data to the cloud data center.

Description

Multi-source heterogeneous data processing system and method based on industrial system
Technical Field
The invention relates to the technical field of industrial data processing, in particular to a multisource heterogeneous data processing system and method based on an industrial system.
Background
The rapidity, timeliness and professional requirements of enterprises on industrial data acquisition are increasingly enhanced. The traditional industrial informatization is that data acquisition is carried out on site, data transmission is mainly carried out in a local area network, the trend that industrial data gradually migrate to public clouds is great at present, the high-speed transmission of the clouds on the data is challenging, and the traditional wireless data acquisition technology is difficult to be superior to the industrial scene data acquisition with high precision and low time delay, so that the real-time monitoring requirement of a highly-automatic production process cannot be met.
With the continuous development of industrial automation and internet application, especially the development and application of 5G technology, industrial internet becomes a necessary development trend of modern industry, the quantity of data generated in industrial fields is greatly increased, and industrial data is necessarily increased by geometric multiples. Industrial data is the basis for the development of industrial internet, which is a soul for industrial internet applications and controls. However, the large amount of industrial data entails difficulties in analysis and application, especially in situations where current data processing devices are very lagged. Meanwhile, in order to ensure normal and stable operation of the industrial system, historical data which needs to be recorded is more diversified, if the data are directly stored or sent to a data center from the network edge for processing, a large amount of storage space is wasted, and query, transmission and calling of the data become very troublesome, so that a certain means is urgently needed to be adopted for screening and compressing the data, so as to solve the problems in the prior art.
Disclosure of Invention
The invention aims to provide a multisource heterogeneous data processing system and method based on an industrial system, which can greatly shorten the waiting time, improve the processing efficiency and the analysis efficiency of data and further solve the problems of data real-time performance and reliability caused by a large number of heterogeneous devices and networks on the site of an industrial internet.
In order to achieve the purpose, the invention provides the following scheme:
an industrial system based multi-source heterogeneous data processing system comprising:
the multi-channel data acquisition terminal is used for acquiring data of each device in the industrial system; an apparatus in an industrial system comprising: industrial host equipment, production control equipment, network equipment, safety equipment, office equipment and industrial auxiliary equipment;
the acquisition preprocessing terminal is connected with the multi-path data acquisition terminal and is used for preprocessing the acquired data of each device in the industrial system; the pretreatment comprises the following steps: coding processing, classification processing and vulnerability data detection;
the edge calculation module is connected with the acquisition preprocessing terminal and is used for carrying out data cleaning, screening and encryption processing on the preprocessed data;
and the cloud data center is respectively connected with the acquisition preprocessing terminal and the edge computing module and is used for storing the preprocessed data and the data subjected to data cleaning, screening and encryption processing.
Preferably, the acquisition preprocessing terminal includes:
the encoding unit is connected with the multi-path data acquisition end and is used for encoding the acquired data of each device in the industrial system to obtain encoded data;
the classification unit is connected with the coding unit and is used for classifying the coded data to obtain classified data; the classification data includes: control data, network data, platform data, log data, traffic data, asset data, tool data, production data, or vulnerability data;
the cache unit comprises a plurality of buffer areas, is respectively connected with the classification unit and the edge calculation module, and is used for caching the classification data, transmitting the cached classification data to the edge calculation module when any one of the buffer areas is full, and simultaneously clearing the cached data in the full buffer area;
and the vulnerability detection unit is connected with the classification unit and the cloud data center and is used for detecting whether vulnerability data exist in the classification data, encrypting the existing vulnerability data and uploading the encrypted vulnerability data to the cloud data center when the vulnerability data exist, and simultaneously generating an alarm signal.
Preferably, the method further comprises the following steps:
the alarm module is connected with the vulnerability detection unit and used for receiving the alarm signal and then sending an alarm; the mode of receiving the alarm signal is a short message, an email or an alarm mode.
Preferably, the plurality of buffers includes: a production data cache region, a control data cache region, a log data cache region, a network data cache region, a traffic data cache region, an asset data cache region, a tool data cache region, a platform data cache region, and a vulnerability data cache region.
Preferably, the edge calculation module includes:
the data cleaning unit is connected with the acquisition preprocessing terminal and is used for cleaning the preprocessed data;
the data supplementing unit is connected with the data cleaning unit and used for supplementing the cleaned data by adopting an interpolation method to obtain supplemented data; the interpolation method comprises the following steps: random interpolation and linear interpolation;
the data screening unit is connected with the data cleaning unit and used for screening the supplementary data by adopting a distribution measurement-based downsampling method to obtain useful data;
and the encryption unit is connected with the data screening unit and is used for encrypting the useful data.
Preferably, the data screening unit includes:
the data distance determining subunit is connected with the data supplementing unit and is used for measuring the distance between any two data in the supplementing data by adopting the Euclidean distance;
the distribution metric determining subunit is connected with the data distance determining subunit and used for determining the distribution metric of each data according to the distance based on the neighborhood of each data in the supplementary data; the neighborhood is a hyper-sphere formed by taking any data point in the supplementary data as a center and taking a preset value as a radius;
the data sorting subunit is connected with the distribution metric determining subunit and is used for sorting the data in the supplementary data in a descending order based on the distribution metric to obtain sorted data;
the first judgment subunit is connected with the data sorting subunit and is used for judging whether the distribution metric of each data in the arrangement data is greater than a preset threshold value or not to obtain a first judgment result;
the first useful data determining subunit is connected with the judging subunit and is used for reserving the data corresponding to the distribution metric and judging the data as useful data when the first judging result is that the distribution metric is greater than the preset threshold;
the second judging subunit is connected with the judging subunit and used for judging whether the data corresponding to the distribution metric is in the neighborhood of the existing useful data or not when the first judging result is that the distribution metric is smaller than or equal to the preset threshold value, so as to obtain a second judging result;
a second useful data determining subunit, connected to the second judging subunit, and configured to determine, when the second judgment result indicates that the data corresponding to the distribution metric is not in a neighborhood of existing useful data, that the data corresponding to the distribution metric is useful data;
and the redundant data determining subunit is connected with the second judging subunit and used for determining that the data corresponding to the distribution metric is useful data when the second judging result is that the data corresponding to the distribution metric is in the neighborhood of the available data.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
according to the multisource heterogeneous data processing system based on the industrial system, the edge computing module is adopted to complete a part of computing tasks, so that the computing pressure of the cloud data center can be effectively relieved, heterogeneous data are coded in a multipath parallel mode to form a unified identifier, subsequent computing is facilitated, the processing speed can be increased, the edge computing module is adopted to screen data based on the unified code, the data storage expense of the cloud data center can be greatly saved, meanwhile, the data screening of the edge computing module is also an efficient data cleaning mode, the computing burden of the cloud data center can be reduced, in addition, leak data are detected in real time, and the timeliness requirement of abnormal alarm can be met by directly uploading the data to the cloud data center.
Corresponding to the multi-source heterogeneous data processing system based on the industrial system, the invention also provides a multi-source heterogeneous data processing method based on the industrial system, and the method comprises the following steps:
collecting data of each device in an industrial system; an apparatus in an industrial system comprising: industrial host equipment, production control equipment, network equipment, safety equipment, office equipment and industrial auxiliary equipment;
preprocessing acquired data of each device in the industrial system; the pretreatment comprises the following steps: coding processing, classification processing and vulnerability data detection;
carrying out data cleaning, screening and encryption processing on the preprocessed data;
and storing the preprocessed data and the data subjected to data cleaning, screening and encryption.
Preferably, the preprocessing the acquired data of each device in the industrial system specifically includes:
encoding the acquired data of each device in the industrial system to obtain encoded data;
classifying the coded data to obtain classified data; the classification data includes: control data, network data, platform data, log data, traffic data, asset data, tool data, production data, or vulnerability data;
caching the classified data, transmitting the cached classified data to the edge computing module when the cache is full, and simultaneously clearing the cached data in the full cache region;
and detecting whether vulnerability data exists in the classified data, encrypting the existing vulnerability data and uploading the encrypted vulnerability data to the cloud data center when the vulnerability data exists, and generating an alarm signal at the same time.
Preferably, the data cleaning, screening and encrypting the preprocessed data specifically includes:
carrying out data cleaning on the preprocessed data;
supplementing the cleaned data by adopting an interpolation method to obtain supplemented data;
screening the supplementary data by adopting a distribution measurement-based downsampling method to obtain useful data;
and encrypting the useful data.
Preferably, the screening of the supplementary data by using a downsampling method based on distribution metric to obtain useful data specifically includes:
measuring the distance between any two data in the supplementary data by adopting a Euclidean distance;
determining distribution measurement of each data according to the distance based on the neighborhood of each data in the supplementary data; the neighborhood is a hyper-sphere formed by taking any data point in the supplementary data as a center and taking a preset value as a radius;
sorting the data in the supplementary data in a descending order based on the distribution measurement to obtain sorted data;
judging whether the distribution metric of each data in the arrangement data is larger than a preset threshold value or not to obtain a first judgment result;
when the first judgment result is that the distribution metric is larger than the preset threshold, retaining data corresponding to the distribution metric and judging the data to be useful data;
when the first judgment result is that the distribution metric is less than or equal to the preset threshold, judging whether the data corresponding to the distribution metric is in the neighborhood of the existing useful data or not to obtain a second judgment result;
when the second judgment result is that the data corresponding to the distribution metric is not in the neighborhood of the existing useful data, determining that the data corresponding to the distribution metric is useful data;
and when the second judgment result is that the data corresponding to the distribution metric is in the neighborhood of the existing useful data, determining that the data corresponding to the distribution metric is the useful data.
The technical effect achieved by the multisource heterogeneous data processing method based on the industrial system provided by the invention is the same as that achieved by the multisource heterogeneous data processing system based on the industrial system provided by the invention, so that the detailed description is omitted.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic diagram of an industrial system based multi-source heterogeneous data processing system according to the present invention;
fig. 2 is a flowchart of a multi-source heterogeneous data processing method based on an industrial system according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a multisource heterogeneous data processing system and method based on an industrial system, which can greatly shorten the waiting time, improve the processing efficiency and the analysis efficiency of data and further solve the problems of data real-time performance and reliability caused by a large number of heterogeneous devices and networks on the site of an industrial internet.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the multi-source heterogeneous data processing system based on an industrial system provided by the present invention includes: the system comprises a multi-path data acquisition terminal, an acquisition preprocessing terminal, an edge computing module and a cloud data center.
The multi-path data acquisition end is used for acquiring data of each device in the industrial system. The multi-channel data acquisition end comprises a plurality of data acquisition devices, and the data acquisition devices acquire data of various different types on an industrial field. Industrial field device objects include industrial host devices, production control devices, network devices, security devices, office devices, industrial auxiliary devices, and the like.
The acquisition preprocessing terminal is connected with the multi-path data acquisition terminal and is used for preprocessing acquired data of each device in the industrial system. The pretreatment comprises the following steps: coding processing, classification processing and vulnerability data detection. The data after being coded are classified and cached by the collection preprocessing terminal, and after the vulnerability data cache region is full, vulnerability data are directly encrypted and uploaded to the cloud data center.
The edge calculation module is connected with the acquisition preprocessing terminal and is used for carrying out data cleaning, screening and encryption processing on the preprocessed data. The edge calculation module has certain calculation capacity, and performs data cleaning processing on uniformly coded data, and the specific content is as follows:
the edge calculation module performs data cleaning on the uniformly coded data and supplements missing values by adopting a difference method, and the specific method comprises the following steps: random interpolation, newton interpolation. And directly deleting abnormal data beyond the value range.
The edge computing module computes useful data and redundant data by adopting a distribution measurement-based downsampling method for the cleaned data, stores the useful data into a data cache region of the edge computing module, encrypts the data after the cache region is full, and uploads the encrypted data to a cloud data center.
The edge calculation module and the acquisition preprocessing terminal are encrypted before uploading data so as to ensure the safety of the data.
And the cloud data center is respectively connected with the acquisition preprocessing terminal and the edge computing module and is used for storing the preprocessed data and the data subjected to data cleaning, screening and encryption processing so as to facilitate subsequent analysis and decision.
As another embodiment of the present invention, the acquisition preprocessing terminal adopted by the present invention may be configured to include: the device comprises an encoding unit, a classification unit, a cache unit comprising a plurality of buffers and a vulnerability detection unit.
The coding unit is connected with the multi-path data acquisition end and used for coding the acquired data of each device in the industrial system to obtain coded data.
The classification unit is connected with the coding unit and is used for classifying the coded data to obtain classified data. The classification data includes: control data, network data, platform data, log data, traffic data, asset data, tool data, production data, or vulnerability data. The vulnerability data refers to data which has security threat to the industrial system or causes abnormal operation of the industrial system.
The cache unit comprising a plurality of buffer areas is respectively connected with the classification unit and the edge calculation module, and is used for caching the classification data, transmitting the cached classification data to the edge calculation module when any buffer area is full, and simultaneously clearing the cached data in the full buffer area. Wherein the plurality of buffers include: a production data cache region, a control data cache region, a log data cache region, a network data cache region, a traffic data cache region, an asset data cache region, a tool data cache region, a platform data cache region, and a vulnerability data cache region.
The vulnerability detection unit is connected with the classification unit and the cloud data center, and is used for detecting whether vulnerability data exists in the classification data, encrypting the existing vulnerability data and uploading the encrypted vulnerability data to the cloud data center when the vulnerability data exists, and meanwhile generating an alarm signal.
As another embodiment of the present invention, the multi-source heterogeneous data processing system based on the industrial system provided above of the present invention may further include: and an alarm module.
The alarm module is connected with the vulnerability detection unit and used for receiving the alarm signal and then sending out an alarm. The mode of receiving the alarm signal is a short message, an email or an alarm mode.
As another embodiment of the present invention, the edge calculation module adopted in the foregoing may include: the device comprises a data cleaning unit, a data supplementing unit, a data screening unit and an encryption unit.
The data cleaning unit is connected with the acquisition preprocessing terminal and is used for cleaning the preprocessed data.
The data supplementing unit is connected with the data cleaning unit and is used for supplementing the cleaned data by adopting an interpolation method to obtain supplemented data. The interpolation method comprises the following steps: random interpolation and linear interpolation.
The data screening unit is connected with the data cleaning unit and is used for screening the supplementary data by adopting a distribution measurement-based downsampling method to obtain useful data.
The encryption unit is connected with the data screening unit and is used for encrypting the useful data.
Further, the data filtering unit includes: the device comprises a data distance determining subunit, a distribution metric determining subunit, a data sorting subunit, a first judging subunit, a first useful data determining subunit, a second judging subunit, a second useful data determining subunit and a redundant data determining subunit.
The data distance determining subunit is connected with the data supplementing unit and is used for supplementing the distance between any two data in the data by adopting Euclidean distance measurement.
The distribution metric determining subunit is connected with the data distance determining subunit, and the distribution metric determining subunit is used for determining the distribution metric of each data according to the distance based on the neighborhood of each data in the supplementary data. The neighborhood is a hyper-sphere formed by taking any data point in the supplementary data as a center and taking a preset value as a radius.
The data sorting subunit is connected with the distribution metric determining subunit, and the data sorting subunit is used for sorting the data in the supplementary data in a descending order based on the distribution metric to obtain the sorted data.
The first judging subunit is connected with the data sorting subunit, and is used for judging whether the distribution metric of each data in the arranged data is greater than a preset threshold value or not to obtain a first judging result.
And the first useful data determining subunit is connected with the judging subunit, and is used for reserving the data corresponding to the distribution metric and judging the data as useful data when the first judgment result is that the distribution metric is greater than a preset threshold value.
And the second judging subunit is connected with the judging subunit, and is used for judging whether the data corresponding to the distribution metric is in the neighborhood of the existing useful data or not when the first judging result is that the distribution metric is less than or equal to the preset threshold value, so as to obtain a second judging result.
And the second useful data determining subunit is connected with the second judging subunit, and the second useful data determining subunit is used for determining that the data corresponding to the distribution metric is useful data when the second judgment result is that the data corresponding to the distribution metric is not in the neighborhood of the existing useful data.
And the redundant data determining subunit is connected with the second judging subunit, and the redundant data determining subunit is used for determining that the data corresponding to the distribution metric is the useful data when the second judging result is that the data corresponding to the distribution metric is in the neighborhood of the existing useful data.
Corresponding to the multi-source heterogeneous data processing system based on the industrial system, the invention also provides a multi-source heterogeneous data processing method based on the industrial system, as shown in fig. 2, the method comprises the following steps:
step 100: data is collected for each device in the industrial system. An apparatus in an industrial system comprising: industrial host equipment, production control equipment, network equipment, security equipment, office equipment and industrial auxiliary equipment.
Step 101: and preprocessing the acquired data of each device in the industrial system. The pretreatment comprises the following steps: coding processing, classification processing and vulnerability data detection. The implementation process of the step can be as follows:
step 1011: and encoding the acquired data of each device in the industrial system to obtain encoded data.
Step 1012: and classifying the coded data to obtain classified data. The classification data includes: control data, network data, platform data, log data, traffic data, asset data, tool data, production data, or vulnerability data.
Step 1013: and caching the classified data, transmitting the cached classified data to the edge computing module when the cache is full, and clearing the cached data in the full cache region.
Step 1014: and detecting whether vulnerability data exists in the classified data, encrypting the existing vulnerability data and uploading the encrypted vulnerability data to a cloud data center when the vulnerability data exists, and generating an alarm signal at the same time.
Step 102: and carrying out data cleaning, screening and encryption processing on the preprocessed data. The implementation process of the step can comprise the following steps:
step 1021: and performing data cleaning on the preprocessed data.
Step 1022: and supplementing the cleaned data by adopting an interpolation method to obtain supplemented data.
Step 1023: and screening the supplementary data by adopting a downsampling method based on distribution measurement to obtain useful data.
Step 1024: useful data is encrypted.
Step 103: and storing the preprocessed data and the data subjected to data cleaning, screening and encryption.
As another embodiment, the implementation process of the step 1023 may be:
and supplementing the distance between any two data in the data by adopting Euclidean distance measurement.
A distribution metric for each data is determined from the distance based on a neighborhood of each data in the supplemental data. The neighborhood is a hyper-sphere formed by taking any data point in the supplementary data as a center and taking a preset value as a radius.
And sequencing the data in the supplementary data in a descending manner based on the distribution measurement to obtain the sequence data.
And judging whether the distribution metric of each data in the arrangement data is greater than a preset threshold value or not to obtain a first judgment result.
And when the first judgment result is that the distribution metric is larger than the preset threshold, retaining the data corresponding to the distribution metric and judging the data to be useful data.
And when the first judgment result is that the distribution metric is less than or equal to the preset threshold, judging whether the data corresponding to the distribution metric is in the neighborhood of the existing useful data or not, and obtaining a second judgment result.
And when the second judgment result is that the data corresponding to the distribution metric is not in the neighborhood of the existing useful data, determining that the data corresponding to the distribution metric is the useful data.
And when the second judgment result is that the data corresponding to the distribution metric is in the neighborhood of the existing useful data, determining the data corresponding to the distribution metric as the useful data.
The following provides a specific embodiment, which is used to explain the specific implementation process of the multi-source heterogeneous data processing system and method based on the industrial system, and in the practical application process, the implementation process is not limited to the algorithm adopted in the following embodiments.
Step 1: the multi-path data acquisition area comprises a plurality of data acquisition devices, the data acquisition devices acquire data of various industrial field devices, the device objects comprise industrial host equipment, production control equipment, network equipment, safety equipment, office equipment, industrial auxiliary equipment and the like, and the data acquisition devices transmit the data to the acquisition preprocessing terminal.
Step 2: the acquisition preprocessing terminal uniformly encodes and classifies and caches the obtained data
Step 2.1: the acquisition preprocessing terminal collects data sent by the data acquisition equipment and uniformly encodes the obtained data.
Step 2.2: the collection preprocessing terminal carries out preliminary classification and caching on the coded data, the data are divided into control data, network data, platform data, log data, flow data, asset data, tool data, production data and vulnerability data, the control data, the network data, the platform data, the log data, the flow data, the asset data, the tool data, the production data and the vulnerability data are stored in cache regions corresponding to the collection preprocessing terminal respectively, and the cache regions comprise a control data cache region, a network data cache region, a platform data cache region, a log data cache region, a flow data cache region, an asset data cache region, a tool data cache region, a production data cache region and a vulnerability data cache region. And after any cache region is full, sending the data of the cache region to an edge calculation module, emptying the data of the cache region after the data is successfully sent, and waiting for new data to be stored.
Step 2.3: if the vulnerability data is detected, the vulnerability data is encrypted and then directly uploaded to a cloud data center, and abnormal information is sent to an alarm module in the modes of short messages, mails, alarms and the like.
And step 3: and the edge calculation module is used for cleaning and screening data.
Step 3.1: after the edge calculation module receives data sent by the cache region of the acquisition preprocessing terminal, the data is firstly cleaned through the data cleaning module, and missing values are supplemented by adopting various interpolation methods, wherein the methods comprise a random interpolation method and a linear interpolation method. The random interpolation method is to select the historical data of the buffer area to carry out random sampling to replace the missing data.
The linear interpolation formula is as follows:
Figure BDA0003357063850000111
wherein (x)0,y0),(x1,y1) For known historical data, (x)2,y2) For data with missing values, y2Is a missing value.
Step 3.2: the edge calculation module screens out useful data and redundant data by adopting a distribution measurement-based downsampling method for the cleaned data, and the specific method is as follows:
the Euclidean distance is used for measuring the distance d (x) between any two datai,xj):
Figure BDA0003357063850000121
Wherein x isi,xjFor any two pieces of data, n is the data dimension, xikIs the k-th number of the ith piece of data.
Is defined by the sample point xiCentered on a hypersphere with epsilon as radius as a sample point xiE neighborhood of (c). With Nε(xi) Number of sample points representing the intersection of all data with the neighborhood, Nε(xi) Larger means xiThe greater the number of nearby data distributions. Setting an adjustable radius epsilon and a threshold q, and calculating N corresponding to each dataε
The distribution metric of each data is defined by the epsilon neighborhood of each data:
Figure BDA0003357063850000122
where ρ (x) is the distribution metric of data x and n is the number of data points in the ε neighborhood.
And calculating a distribution metric of all the data, wherein the distribution metric represents the distribution information of the data to a certain extent, and the larger the distribution metric value is, the more redundant data near the data is represented.
The data screening unit in the edge calculation module arranges all the data according to the sequence of rho (x) values from large to small, and screens the data one by one from the data point with the maximum rho (x): if N is presentε(xi) If greater than the threshold q, the number is considered to beThe site reserves for useful data. If N is presentε(xi) And if the data is not in the epsilon neighborhood of the existing useful data point, the point is taken as useful data, and if the data is in the hypersphere (namely the epsilon neighborhood of the existing useful data point), the point is considered as redundant data.
And after traversing all the data, the data screening module divides all the data into useful data and redundant data.
And 4, step 4: and storing useful data into a cache region of the edge computing module, encrypting the data after the cache region is full, and uploading the data to a cloud data center.
Based on the above description, the technical solution provided by the present invention now has the following advantages over the prior art:
1. the invention aims at industrial field multisource heterogeneous data, and utilizes an acquisition preprocessing terminal to uniformly encode, classify and store the data in a multipath parallel mode. The data are uniformly coded and then divided into log data, flow data, asset data, tool data, production data and vulnerability data, which are respectively stored in corresponding cache regions, and the data are uniformly coded and preliminarily classified, so that the data management is facilitated and the subsequent useful data screening is facilitated.
2. The method is oriented to the industrial field multi-source heterogeneous data, and classification and screening of the industrial field multi-source heterogeneous data are achieved. And the computing pressure of the cloud data center is effectively relieved by completing a part of computing tasks through the edge computing module. The edge computing module screens data based on unified coding, performs data compression by eliminating redundant data, and utilizes distribution measurement information to reduce data volume to a certain extent when screening data, and meanwhile, the information of original data is kept to a greater extent, useful data are effectively extracted, and data storage expenditure of a cloud data center is greatly saved. Meanwhile, for the cloud data center, data screening of the edge computing module is also an efficient data cleaning mode, and the edge computing module bears a part of computing tasks, so that the computing burden of the cloud data center is reduced. The vulnerability data is directly uploaded to the alarm module, and the timeliness requirement of abnormal alarm is met.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A multi-source heterogeneous data processing system based on an industrial system, comprising:
the multi-channel data acquisition terminal is used for acquiring data of each device in the industrial system; an apparatus in an industrial system comprising: industrial host equipment, production control equipment, network equipment, safety equipment, office equipment and industrial auxiliary equipment;
the acquisition preprocessing terminal is connected with the multi-path data acquisition terminal and is used for preprocessing the acquired data of each device in the industrial system; the pretreatment comprises the following steps: coding processing, classification processing and vulnerability data detection;
the edge calculation module is connected with the acquisition preprocessing terminal and is used for carrying out data cleaning, screening and encryption processing on the preprocessed data;
and the cloud data center is respectively connected with the acquisition preprocessing terminal and the edge computing module and is used for storing the preprocessed data and the data subjected to data cleaning, screening and encryption processing.
2. The industrial system-based multi-source heterogeneous data processing system according to claim 1, wherein the acquisition preprocessing terminal comprises:
the encoding unit is connected with the multi-path data acquisition end and is used for encoding the acquired data of each device in the industrial system to obtain encoded data;
the classification unit is connected with the coding unit and is used for classifying the coded data to obtain classified data; the classification data includes: control data, network data, platform data, log data, traffic data, asset data, tool data, production data, or vulnerability data;
the cache unit comprises a plurality of buffer areas, is respectively connected with the classification unit and the edge calculation module, and is used for caching the classification data, transmitting the cached classification data to the edge calculation module when any one of the buffer areas is full, and simultaneously clearing the cached data in the full buffer area;
and the vulnerability detection unit is connected with the classification unit and the cloud data center and is used for detecting whether vulnerability data exist in the classification data, encrypting the existing vulnerability data and uploading the encrypted vulnerability data to the cloud data center when the vulnerability data exist, and simultaneously generating an alarm signal.
3. The industrial system based multi-source heterogeneous data processing system of claim 2, further comprising:
the alarm module is connected with the vulnerability detection unit and used for receiving the alarm signal and then sending an alarm; the mode of receiving the alarm signal is a short message, an email or an alarm mode.
4. The industrial system-based multi-source heterogeneous data processing system of claim 2, wherein the plurality of buffers comprises: a production data cache region, a control data cache region, a log data cache region, a network data cache region, a traffic data cache region, an asset data cache region, a tool data cache region, a platform data cache region, and a vulnerability data cache region.
5. The industrial system-based multi-source heterogeneous data processing system of claim 1, wherein the edge calculation module comprises:
the data cleaning unit is connected with the acquisition preprocessing terminal and is used for cleaning the preprocessed data;
the data supplementing unit is connected with the data cleaning unit and used for supplementing the cleaned data by adopting an interpolation method to obtain supplemented data; the interpolation method comprises the following steps: random interpolation and linear interpolation;
the data screening unit is connected with the data cleaning unit and used for screening the supplementary data by adopting a distribution measurement-based downsampling method to obtain useful data;
and the encryption unit is connected with the data screening unit and is used for encrypting the useful data.
6. The industrial system-based multi-source heterogeneous data processing system of claim 5, wherein the data screening unit comprises:
the data distance determining subunit is connected with the data supplementing unit and is used for measuring the distance between any two data in the supplementing data by adopting the Euclidean distance;
the distribution metric determining subunit is connected with the data distance determining subunit and used for determining the distribution metric of each data according to the distance based on the neighborhood of each data in the supplementary data; the neighborhood is a hyper-sphere formed by taking any data point in the supplementary data as a center and taking a preset value as a radius;
the data sorting subunit is connected with the distribution metric determining subunit and is used for sorting the data in the supplementary data in a descending order based on the distribution metric to obtain sorted data;
the first judgment subunit is connected with the data sorting subunit and is used for judging whether the distribution metric of each data in the arrangement data is greater than a preset threshold value or not to obtain a first judgment result;
the first useful data determining subunit is connected with the judging subunit and is used for reserving the data corresponding to the distribution metric and judging the data as useful data when the first judging result is that the distribution metric is greater than the preset threshold;
the second judging subunit is connected with the judging subunit and used for judging whether the data corresponding to the distribution metric is in the neighborhood of the existing useful data or not when the first judging result is that the distribution metric is smaller than or equal to the preset threshold value, so as to obtain a second judging result;
a second useful data determining subunit, connected to the second judging subunit, and configured to determine, when the second judgment result indicates that the data corresponding to the distribution metric is not in a neighborhood of existing useful data, that the data corresponding to the distribution metric is useful data;
and the redundant data determining subunit is connected with the second judging subunit and used for determining that the data corresponding to the distribution metric is useful data when the second judging result is that the data corresponding to the distribution metric is in the neighborhood of the available data.
7. A multi-source heterogeneous data processing method based on an industrial system is characterized by comprising the following steps:
collecting data of each device in an industrial system; an apparatus in an industrial system comprising: industrial host equipment, production control equipment, network equipment, safety equipment, office equipment and industrial auxiliary equipment;
preprocessing acquired data of each device in the industrial system; the pretreatment comprises the following steps: coding processing, classification processing and vulnerability data detection;
carrying out data cleaning, screening and encryption processing on the preprocessed data;
and storing the preprocessed data and the data subjected to data cleaning, screening and encryption.
8. The multi-source heterogeneous data processing method based on the industrial system according to claim 7, wherein the preprocessing of the collected data of each device in the industrial system specifically includes:
encoding the acquired data of each device in the industrial system to obtain encoded data;
classifying the coded data to obtain classified data; the classification data includes: control data, network data, platform data, log data, traffic data, asset data, tool data, production data, or vulnerability data;
caching the classified data, transmitting the cached classified data to the edge computing module when the cache is full, and simultaneously clearing the cached data in the full cache region;
and detecting whether vulnerability data exists in the classified data, encrypting the existing vulnerability data and uploading the encrypted vulnerability data to the cloud data center when the vulnerability data exists, and generating an alarm signal at the same time.
9. The multi-source heterogeneous data processing method based on the industrial system according to claim 7, wherein the data cleaning, screening and encrypting the preprocessed data specifically comprises:
carrying out data cleaning on the preprocessed data;
supplementing the cleaned data by adopting an interpolation method to obtain supplemented data;
screening the supplementary data by adopting a distribution measurement-based downsampling method to obtain useful data;
and encrypting the useful data.
10. The multi-source heterogeneous data processing method based on the industrial system according to claim 9, wherein the filtering of the supplementary data by using a downsampling method based on distribution metrics to obtain useful data specifically comprises:
measuring the distance between any two data in the supplementary data by adopting a Euclidean distance;
determining distribution measurement of each data according to the distance based on the neighborhood of each data in the supplementary data; the neighborhood is a hyper-sphere formed by taking any data point in the supplementary data as a center and taking a preset value as a radius;
sorting the data in the supplementary data in a descending order based on the distribution measurement to obtain sorted data;
judging whether the distribution metric of each data in the arrangement data is larger than a preset threshold value or not to obtain a first judgment result;
when the first judgment result is that the distribution metric is larger than the preset threshold, retaining data corresponding to the distribution metric and judging the data to be useful data;
when the first judgment result is that the distribution metric is less than or equal to the preset threshold, judging whether the data corresponding to the distribution metric is in the neighborhood of the existing useful data or not to obtain a second judgment result;
when the second judgment result is that the data corresponding to the distribution metric is not in the neighborhood of the existing useful data, determining that the data corresponding to the distribution metric is useful data;
and when the second judgment result is that the data corresponding to the distribution metric is in the neighborhood of the existing useful data, determining that the data corresponding to the distribution metric is the useful data.
CN202111355901.9A 2021-11-16 2021-11-16 Multi-source heterogeneous data processing system and method based on industrial system Active CN114003596B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111355901.9A CN114003596B (en) 2021-11-16 2021-11-16 Multi-source heterogeneous data processing system and method based on industrial system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111355901.9A CN114003596B (en) 2021-11-16 2021-11-16 Multi-source heterogeneous data processing system and method based on industrial system

Publications (2)

Publication Number Publication Date
CN114003596A true CN114003596A (en) 2022-02-01
CN114003596B CN114003596B (en) 2022-07-12

Family

ID=79929181

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111355901.9A Active CN114003596B (en) 2021-11-16 2021-11-16 Multi-source heterogeneous data processing system and method based on industrial system

Country Status (1)

Country Link
CN (1) CN114003596B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115202867A (en) * 2022-06-09 2022-10-18 南京慧安炬创信息科技有限公司 Data processing method and system based on edge calculation
CN115754416A (en) * 2022-11-16 2023-03-07 国能大渡河瀑布沟发电有限公司 Edge calculation-based partial discharge analysis system and method for hydraulic generator

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108956111A (en) * 2018-06-11 2018-12-07 北京天泽智云科技有限公司 A kind of the abnormal state detection method and detection system of mechanical part
CN109635958A (en) * 2018-12-12 2019-04-16 成都航天科工大数据研究院有限公司 A kind of predictive industrial equipment maintaining method and maintenance system based on edge calculations
CN110336703A (en) * 2019-07-12 2019-10-15 河海大学常州校区 Industrial big data based on edge calculations monitors system
CN110912749A (en) * 2019-11-29 2020-03-24 北京工业大学 Method for predicting DNS data
CN111556032A (en) * 2020-04-14 2020-08-18 江苏天人工业互联网研究院有限公司 Industrial big data processing system based on artificial intelligence algorithm
CN111679288A (en) * 2020-06-19 2020-09-18 中国林业科学研究院资源信息研究所 Method for measuring spatial distribution of point cloud data
CN112130999A (en) * 2020-09-23 2020-12-25 南方电网科学研究院有限责任公司 Electric power heterogeneous data processing method based on edge calculation
CN113157994A (en) * 2021-03-02 2021-07-23 昆山九华电子设备厂 Multi-source heterogeneous platform data processing method
WO2021204487A1 (en) * 2020-04-06 2021-10-14 Asml Netherlands B.V. Method of determining a sampling scheme, associated apparatus and computer program
CN113613287A (en) * 2021-06-21 2021-11-05 工业云制造(四川)创新中心有限公司 Automatic data acquisition system based on edge calculation

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108956111A (en) * 2018-06-11 2018-12-07 北京天泽智云科技有限公司 A kind of the abnormal state detection method and detection system of mechanical part
CN109635958A (en) * 2018-12-12 2019-04-16 成都航天科工大数据研究院有限公司 A kind of predictive industrial equipment maintaining method and maintenance system based on edge calculations
CN110336703A (en) * 2019-07-12 2019-10-15 河海大学常州校区 Industrial big data based on edge calculations monitors system
CN110912749A (en) * 2019-11-29 2020-03-24 北京工业大学 Method for predicting DNS data
WO2021204487A1 (en) * 2020-04-06 2021-10-14 Asml Netherlands B.V. Method of determining a sampling scheme, associated apparatus and computer program
CN111556032A (en) * 2020-04-14 2020-08-18 江苏天人工业互联网研究院有限公司 Industrial big data processing system based on artificial intelligence algorithm
CN111679288A (en) * 2020-06-19 2020-09-18 中国林业科学研究院资源信息研究所 Method for measuring spatial distribution of point cloud data
CN112130999A (en) * 2020-09-23 2020-12-25 南方电网科学研究院有限责任公司 Electric power heterogeneous data processing method based on edge calculation
CN113157994A (en) * 2021-03-02 2021-07-23 昆山九华电子设备厂 Multi-source heterogeneous platform data processing method
CN113613287A (en) * 2021-06-21 2021-11-05 工业云制造(四川)创新中心有限公司 Automatic data acquisition system based on edge calculation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LESTAT.Z.: ""数据分布度的度量Measures of Spread"", 《HTTPS://BLOG.CSDN.NET/YOLOHOHOHOHO/ARTICLE/DETAILS/99686997》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115202867A (en) * 2022-06-09 2022-10-18 南京慧安炬创信息科技有限公司 Data processing method and system based on edge calculation
CN115754416A (en) * 2022-11-16 2023-03-07 国能大渡河瀑布沟发电有限公司 Edge calculation-based partial discharge analysis system and method for hydraulic generator
CN115754416B (en) * 2022-11-16 2023-06-27 国能大渡河瀑布沟发电有限公司 Partial discharge analysis system and method for hydro-generator based on edge calculation

Also Published As

Publication number Publication date
CN114003596B (en) 2022-07-12

Similar Documents

Publication Publication Date Title
CN114003596B (en) Multi-source heterogeneous data processing system and method based on industrial system
CN115606162A (en) Abnormal flow detection method and system, and computer storage medium
CN110895526A (en) Method for correcting data abnormity in atmosphere monitoring system
CN109743356B (en) Industrial internet data acquisition method and device, readable storage medium and terminal
CN107786388B (en) Anomaly detection system based on large-scale network flow data
CN110460591B (en) CDN flow abnormity detection device and method based on improved hierarchical time memory network
CN110932899B (en) Intelligent fault compression research method and system applying AI
CN106682225B (en) A kind of big data collects storage method and system
CN105376110A (en) Network data packet analysis method and system in big data stream technology
CN110995153A (en) Abnormal data detection method and device for photovoltaic power station and electronic equipment
CN114143036A (en) Alarm method, device, equipment and computer storage medium
CN113591674A (en) Real-time video stream-oriented edge environment behavior recognition system
CN113687610B (en) Method for protecting terminal information of GAN-CNN power monitoring system
CN115578666A (en) Key frame filtering system combining traffic abnormal events and static events
CN117892713A (en) Method, device, electronic equipment and storage medium for determining report difference data
CN113128626A (en) Multimedia stream fine classification method based on one-dimensional convolutional neural network model
CN111092861A (en) Communication network safety prediction system
CN110888850A (en) Data quality detection method based on power Internet of things platform
CN111586052B (en) Multi-level-based crowd sourcing contract abnormal transaction identification method and identification system
CN112948639B (en) Unified storage management method and system for data of highway middling station
CN111199777B (en) Biological big data-oriented streaming and mutation real-time mining system and method
CN111612087B (en) Method for generating image feature dictionary of EMUs TEDS system
CN112907111A (en) Intelligent monitoring data acquisition and analysis method based on Internet of things technology
CN114004989A (en) Power safety early warning data clustering processing method based on improved K-means algorithm
CN111614786A (en) System and method for processing data at high speed by remote server based on block chain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant