CN115293662A - Parallel and distributed ocean observation data fusion intelligent calculation method and system - Google Patents

Parallel and distributed ocean observation data fusion intelligent calculation method and system Download PDF

Info

Publication number
CN115293662A
CN115293662A CN202211230749.6A CN202211230749A CN115293662A CN 115293662 A CN115293662 A CN 115293662A CN 202211230749 A CN202211230749 A CN 202211230749A CN 115293662 A CN115293662 A CN 115293662A
Authority
CN
China
Prior art keywords
data
distributed
observation data
model
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211230749.6A
Other languages
Chinese (zh)
Inventor
张兆虔
李响
赵志刚
王春晓
耿丽婷
郭莹
吴晓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Shandong Computer Science Center National Super Computing Center in Jinan
Priority to CN202211230749.6A priority Critical patent/CN115293662A/en
Publication of CN115293662A publication Critical patent/CN115293662A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method and a system for intelligently calculating ocean observation data integrating parallel and distributed, which relate to the field of intelligent calculation of ocean observation time sequence data streams, and are used for acquiring ocean observation data streams of each channel in real time and storing the ocean observation data streams to a distributed cluster; carrying out disorder, duplication removal and deletion preprocessing on the data stream; on the basis of the preprocessed marine observation data stream, performing multi-channel online learning model training by adopting a super-computation MPI (multi-path multi-input multi-output) parallel training model to obtain a latest marine observation data intelligent computation model of each channel; based on a Flink distributed stream processing system, selecting a latest ocean observation data intelligent calculation model corresponding to each channel for ocean observation data continuously flowing into each channel, and carrying out real-time reasoning and prediction; the method is suitable for the application scene of multi-channel and multi-task, effectively supports the on-line learning and reasoning task of the streaming data and the management of high-flux sensor data, and realizes the rapid iterative upgrade of a multi-channel calculation model of the data and the real-time reasoning of the data.

Description

Parallel and distributed ocean observation data fusion intelligent calculation method and system
Technical Field
The invention belongs to the field of ocean observation time sequence data flow intelligent calculation, and particularly relates to a parallel and distributed ocean observation data intelligent calculation method and system.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
With the rapid development of marine observation networks, a large number of buoy sensors generate hydroecological and physical environment observation time sequence data in a flow mode; the calculation and analysis of ocean observation big data need to achieve the automatic, concurrent, efficient and accurate target; therefore, developing a computing system and method for supporting real-time automated processing and analysis of observed data streams is a key core technology that needs to be overcome. However, the characteristics of the high-throughput, multi-channel marine observation data stream itself pose the following challenges for building a computing system:
challenge 1: challenges are presented by multiple channels, high throughput, mass data streams.
Data streams generated by the multi-class marine observation sensors are transmitted through respective channels, and how to ensure ordered caching and classified storage of mass data and further realize high throughput and high reliability is the primary challenge facing the system; for marine observation data, in order to reduce the overall processing time delay of the system, high throughput data is required to be met firstly, the data is continuously generated in a streaming form, a batch processing method is commonly used at present, small batch processing is performed on the continuously generated data according to a certain time interval or a fixed window, namely, a certain time delay problem is caused by accumulating once for calculation, the overall processing time is correspondingly increased, the requirement of real-time is not met, the inherent defect of batch processing is overcome, and the real-time property is a precondition for marine observation data processing.
In addition, for streaming data generated by the ocean sensor in real time, the repeated transmission and processing of the data no longer meet the specific scene requirements, how to ensure that each data is processed only once and the timeliness of the data processing can be considered is also a challenge; the high reliability problem includes that abnormal situations such as data disorder and missing occur when data streams are processed, and is specifically described as follows: the flow processing is generated from sensor data, transmitted and then transmitted to the processing unit, and a certain process and time are needed in the middle, and ideally, the data flowing to the processing unit arrives according to the time sequence of the data generation, as shown in fig. 1 (a), but the generation of disorder caused by network, distributed storage and other reasons cannot be eliminated; the disorder means that the sequence of data received by a processor is not strictly arranged according to the sequence of Event Time (Event Time) occurring in the data, ideally, the data arrives and is processed in one sequence, but in actual situations, the problem of disorder and late arrival of the data can be caused due to the influence of various factors mentioned above, once the disorder occurs, if the operation of a Window (Window) is determined only according to the Event Time, a system cannot determine whether the data is in place completely, but cannot wait indefinitely, and for Time sequence data, the problem of disorder during data processing needs to be solved firstly to ensure the accuracy of reasoning results.
Challenge 2: the challenges presented by data "concept drift" (caused by the evolution of data flow over time).
According to the traditional static model based on batch data training, the model can be out of date quickly, the problem of dynamic concept drift cannot be solved, and accurate reasoning cannot be carried out; the existing classical machine learning models are basically obtained by training batch data based on a supervised or unsupervised learning method, namely, the models usually work on a static/batch processing data set, and under the condition, the models have a good effect on reasoning and prediction of batch historical data; however, when the models constructed based on batch learning are applied to stream data, the model reasoning effect is reduced along with the time lapse and the occurrence of concept drift; in a high-flux ocean stream data scene, a system for supporting a model to perform online learning on stream data is needed, and the model is ensured to be continuously adapted to a changing data stream so as to solve the problem of concept drift.
Challenge 3: challenges are presented by a large number of models for concurrent training and concurrent data stream reasoning.
Aiming at model training under the scene: different models need to be trained for each observation channel data, the requirements on resource allocation and memory are higher, strong computing power is needed to meet real-time parallel computation of mass data, and the method is more suitable for achieving the training requirements of the mass models in a mode of adopting an hyper-computing (HPC) MPI parallel training model, while large data computing frames such as Spark and Flink are not inherently provided for deep learning model training, and a computing mechanism of the method is only suitable for training a single large model in a data parallel mode and is not suitable for concurrent training of a large number of models at the same time.
Reasoning for data flow under the scene: data is acquired and inferred in a stream form, if a calculation mode of model training is continued (a multi-channel stream data is inferred in a super-calculation parallel operation mode), when a single-node abnormal condition occurs (such as single-node physical fault or data stream transmission abnormity), the deadplate characteristic of super-calculation MPI parallel calculation operation does not support switching of other calculation resources, operation process cannot be continued, and the inference prediction of the whole ocean data is influenced; therefore, the method still needs to deal with abnormal conditions possibly caused by various reasons while ensuring high concurrency of multi-channel reasoning of the system, and improves the reliability and stability of the system, which is also a challenge; on the premise of ensuring high reliability, data streams need high-speed processing, quick response and real-time prediction, and although the timeliness of data processing is greatly improved by a current mature Spark processing framework compared with a traditional MapReduce distributed processing framework, for our application scene, that is, each piece of data needs to be processed and fed back in time, the design concept of Spark micro-batch processing cannot achieve real pure real-time online reasoning.
Based on the analysis, the intelligent calculation method for marine observation data needs to be researched urgently.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method and a system for intelligently calculating fused parallel and distributed marine observation data, a streaming big data frame and a deep Learning frame are bound and fused, meanwhile, a supercomputing platform carries out deep Learning model training conforming to an Online Learning paradigm, and Online Learning and reasoning of the marine observation data are realized by utilizing the advantages of data partitioning, a window selection mechanism, a distributed architecture, iterative computation characteristics, high-performance calculation and the like, so that the requirements of Online Learning and real-time Online reasoning of multi-channel data are met, the problems of early-stage data loss and disorder, high throughput, high reliability, low delay, concept drift and the like of mass data streams are solved, real-time and intelligent processing of the high-flux data streams by the system is further realized, the method and the system are suitable for multi-channel and multi-task application scenes, and are convenient to manage.
To achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
the invention provides a method for intelligently calculating ocean observation data by fusing parallel and distributed modes;
the intelligent calculation method for the merged parallel and distributed ocean observation data comprises the following steps:
acquiring ocean observation data flow of each channel in real time, and storing the ocean observation data flow into a distributed cluster Kafka;
carrying out disorder, duplication removal and deletion preprocessing on the stored data stream;
performing multi-channel online learning model training by adopting a mode of performing supercomputing MPI parallel training model based on the preprocessed marine observation data stream to obtain the latest marine observation data intelligent calculation model of each channel;
based on a Flink distributed stream processing system, the latest ocean observation data intelligent calculation model corresponding to each channel is selected for ocean observation data which continuously flow into each channel, and real-time reasoning and prediction are carried out.
Further, the distributed cluster Kafka stores the data stream of each channel into a respective Topic (Topic), divides each Topic into a plurality of partitions (Partition) arranged in order, and distributes each Topic to a plurality of intra-cluster servers.
Further, in the distributed cluster Kafka, when data with the same primary key value is submitted, the server in the cluster can only persist one piece of ocean observation data with the same primary key value, and it is ensured that the data is processed only once in the cluster processing process.
Furthermore, an evolution convolution neural network architecture is adopted to realize an online learning method, so that the model is adaptive to the evolution of the dynamic evolution data stream, the structure of the model is dynamically adjusted from shallow to deep along with the inflow of data, the substructure of the model is dynamically re-weighted from the data stream in a sequence or online learning mode, and a neural network with a deep structure and a complex nonlinear function is learned in an online environment.
Furthermore, the supercomputing MPI parallel training model creates respective data connection for each theme of the distributed cluster Kafka, provides training data for the online learning model of each channel through the data connection, submits the training data to the supercomputing platform through MPI operation, and one process of one computing node of the supercomputing platform is responsible for the online learning model training of one channel.
And further, performing version control on the model of each channel, transmitting the trained model into a model version library in real time, and adjusting the storage granularity of the model according to the specific situation to obtain the corresponding storage step length.
Furthermore, the Flink distributed stream processing system is a distributed cluster which is built, a distributed architecture is used, when a certain node or certain nodes are down, the inference program does not need to be restarted to acquire stream data from the beginning for inference, and the stream data is rolled back to the nearest state for inference prediction, so that the overall inference program and the feedback of the prediction result are not influenced.
The invention provides a parallel and distributed ocean observation data fused intelligent computing system.
The intelligent computing system for fusing parallel and distributed marine observation data comprises a distributed storage module, a data preprocessing module, an online learning module and an online reasoning module;
a distributed storage module configured to: acquiring ocean observation data flow of each channel in real time, and storing the ocean observation data flow into a distributed cluster Kafka;
a data pre-processing module configured to: carrying out disorder, duplicate removal and deletion pretreatment on the stored data stream;
an online learning module configured to: performing multi-channel online learning model training by adopting a mode of performing supercomputing MPI parallel training model based on the preprocessed marine observation data stream to obtain the latest marine observation data intelligent calculation model of each channel;
an online reasoning module configured to: based on a Flink distributed stream processing system, the latest ocean observation data intelligent calculation model corresponding to each channel is selected for the ocean observation data continuously flowing into each channel, and real-time reasoning and prediction are carried out.
A third aspect of the present invention provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the steps in the method for intelligent computation of fused parallel and distributed marine observation data according to the first aspect of the present invention.
A fourth aspect of the present invention provides an electronic device, including a memory, a processor, and a program stored in the memory and executable on the processor, where the processor executes the program to implement the steps in the method for intelligently computing the merged parallel and distributed marine observation data according to the first aspect of the present invention.
The above one or more technical solutions have the following beneficial effects:
the invention effectively solves the problems of transmission and storage of high-flux data streams generated by multiple sensors in multiple classes, and ensures ordered caching and classified storage of mass data so as to solve the problems of high throughput and high reliability in the data transmission and storage stage; each data in the data stream is only processed once, the real-time performance of the whole data processing is considered, and abnormal conditions such as data disorder and missing in the data stream processing process are solved.
On the system level, the distributed message queue is deployed, so that high reliability is guaranteed, and ocean observation data streams can be continuously and accurately transmitted to corresponding computing nodes in the production and deployment process; the method is suitable for a multichannel and multitask application scene, under the scene, real-time parallel computation of mass data is realized by combining powerful computing resources of a supercomputer, and distributed training of a multichannel deep Learning model of an Online Learning paradigm is met by MPI (multi-path Learning) operation of an HPC (high performance computing); the distributed observation data stream processing and online reasoning are realized by constructing a Flink distributed cluster, so that the high fault tolerance of the system can be ensured, the online learning and reasoning task of the stream data and the management of high-flux sensor data are effectively supported, and the rapid acquisition of the data, the rapid iterative upgrade of a multi-channel calculation model and the real-time reasoning of the data are realized.
Compared with other algorithms, the method has the advantages that the calculation speed is higher, the space utilization rate is higher, the processing process online learning device does not need to write data into a magnetic disk, the structure and the complexity of the model are dynamically adjusted from shallow to deep along with the continuous input of the evolution data stream, the model can keep stable performance, the problem of over-fitting or under-fitting is avoided, and the accuracy of the model can be well guaranteed while the real-time performance is guaranteed.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 (a) is a diagram illustrating an example of sequential processing of an ideal case data stream according to an embodiment of the present invention;
FIG. 1 (b) is a diagram illustrating an example of an out-of-order handling mechanism for actual scene data streams according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a filling effect of missing values of time series data according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a single-channel online learning mechanism according to an embodiment of the present invention;
FIG. 4 is a diagram of an architecture of an online training and online reasoning system according to an embodiment of the present invention;
fig. 5 is a diagram of an internal operation structure of a yann-based Flink system according to an embodiment of the present invention;
FIG. 6 is a flowchart of a method provided by an embodiment of the invention;
fig. 7 is a system structure diagram provided in the embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention; unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention; as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Noun explanation
Online Learning: the model training method can rapidly adjust and iteratively update a model in real time according to the change of a data stream, so that the model can reflect the online change in real time, the prediction accuracy is improved, and the reasoning instantaneity is ensured;
kafka: a message queue based on a publish/subscribe pattern;
MPI: a message passing interface supporting high performance computing;
topic: a subject;
partition: partitioning;
a Broker: a server within a cluster;
replica: copying;
exact Once: is processed only once;
SeqNumber: a queue number;
and (4) Flink: a distributed processing engine for batch/stream data;
watermark: a water level line, a delay trigger mechanism which can meet specific requirements;
and (3) Scala: a programming language in multi-paradigm, similar to the Java programming language;
redis: a key-value storage database, one of non-relational databases;
set: a data structure;
bloom Filter: a spatially efficient random data structure can quickly determine whether an element belongs to the set.
Example one
The embodiment discloses an intelligent calculation method for fusing parallel and distributed ocean observation data;
as shown in fig. 6, the method for intelligently calculating the merged parallel and distributed marine observation data includes:
step S1: acquiring ocean observation data flow of each channel in real time, and storing the ocean observation data flow into a distributed cluster Kafka;
for data storage caching, data streams generated by various observation sensors are transmitted based on respective channels, and in order to ensure ordered caching and classified storage of high-throughput data, the data streams of each channel are consumed to respective topics Topic by adopting a distributed cluster Kafka; in order to solve the problem of excessive theme Topic data caused by high-throughput data, each theme Topic is divided into a plurality of Partition partitions, each Partition is an ordered queue, so that huge theme topics are distributed on a server Broker in a plurality of clusters, the whole cluster can adapt to data of any size, high expansibility can be realized, and the use cost is reduced.
In one-time transmission of streaming data, if abnormal conditions such as downtime of a server Broker in a certain cluster occur, data loss occurs, and therefore, in order to improve fault tolerance of the system, a plurality of duplicate replicas are set for each Topic, even if the server Broker in a certain cluster fails, the cluster can still consume corresponding data through the duplicate replicas, and high reliability of the distributed cluster is guaranteed.
To ensure the exact Once when the sensor data is consumed in production, the specific process is as follows:
firstly, an idempotent parameter in data needs to be set to True, wherein idempotent refers to that only one piece of data is processed no matter how much repeated data exists, a primary key value PID is allocated to the data when the idempotent data is initialized, a queue serial number SeqNumber is attached to the data sent to the same Partition, a Broker end of a server in a cluster caches the PID, the Partition and the SeqNumber, and the Broker end of the server in the cluster caches a certain piece of data of a certain queue of a specific certain Partition.
Step S2: carrying out disorder, duplicate removal and deletion pretreatment on the stored data stream;
and carrying out disorder and duplicate removal processing on the data based on a window selection mechanism and a delay trigger mechanism.
In the delay trigger mechanism described in this embodiment, by taking a Watermark mechanism as a reference, a specific implementation method is to set a delay Time of the delay trigger mechanism to be t, check that the largest EventTime in the ocean observation data that has arrived is denoted as maxEventTime (where EventTime refers to the Time for generating data by the ocean observation sensor) every Time a data processing program verifies that all data of which EventTime is smaller than maxEventTime-t have arrived, and trigger the window to execute if the cutoff Time of the window is equal to maxEventTime-t.
Setting a corresponding Watermark according to a specific scenario and a requirement, taking fig. 1 (b) as an example, setting a delay time t of a delayed trigger mechanism to be 2s, so that a Watermark corresponding to an event with a timestamp of 10 00.
For the data repetition problem, in the data acquisition caching and data transmission process, the data repetition problem may be caused by factors such as a network, and therefore, in the embodiment, the timestamp of each piece of data is used as a key, for massive data, because the memory load is limited, it is obviously impossible to implement a deduplication operation by using a set data structure similar to scala or a set of redis, a Bloom Filter is used to implement deduplication, the Bloom Filter can be used to search whether an element is in a certain set, the query time and the space efficiency are much better than those of other algorithms, and the real-time deduplication of the marine observation data stream in the scene is met.
For the problem of data loss, common methods include filling and deleting, abnormal data can be deleted for a scene with low stability requirement, but for a scene with high stability requirement, abnormal value filling is required.
In order to avoid information loss, the present embodiment adopts a method of filling missing data, and a common filling method is mainly based on a statistical method, including near filling, feature value filling, and linear filling. The system fills up missing values in a linear interpolation mode based on machine learning, the specific effect is shown in fig. 2, the method is a priori method, missing data and adjacent points are required to meet a certain linear fitting relation, and the missing value problem of ocean observation time sequence data can be effectively solved.
And step S3: performing multi-channel online learning model training by adopting a mode of performing supercomputing MPI parallel training model based on the preprocessed marine observation data stream to obtain the latest marine observation data intelligent calculation model of each channel;
the data flow may evolve over time, this behavior being referred to as conceptual drift; according to the traditional static model based on batch data training, the model can be out of date quickly, the problem of dynamic concept drift cannot be solved, and accurate prediction cannot be carried out.
Taking the one-way model of the application scenario as an example, as shown in fig. 3, first, whether to initialize the model is selected according to a specific service scenario and requirements, and then the model is updated incrementally by continuously training a continuous data stream, and the unknown data is reasoned in real time by using the current latest model. The decision boundary of the online learning algorithm can be adjusted in time according to the received feedback, so that the basic reality of prediction is enhanced, and the opportunity of making more accurate decisions in subsequent rounds is improved. Compared with the traditional batch learning mode, the input sample of online learning also participates in model training, the model is updated in an iterative mode, the model can change along with the change of the environment, the continuously changing environment variable is dynamically adapted, the online training can follow the curve in the error curved surface in each epoch, the online training can stably use higher learning rate, accordingly, the online training can achieve convergence faster than batch training through less iterative convergence of training data, the online learning aims to pay attention to learning of a learner in each step and updating of the model to guarantee the optimal predicted value of future data, and the model is applied to real-time sensor data flow without manual adjustment of model hyper-parameters.
(1) Online learning algorithm with adaptive expansion of presentation capacity
The method adopts an evolutionary convolutional neural network architecture to realize an online learning method, so that a model is adaptive to evolution of a dynamic evolution data stream, the structure of the model is dynamically adjusted from shallow to deep along with the inflow of data, and the substructure of the model is dynamically re-weighted from the data stream in a sequential or online learning mode.
Aiming at ocean observation data flow, time series characteristic learning is carried out on observation data by adopting one-dimensional convolution, so that the learning and modeling of an ocean data sequence are realized, and the method specifically comprises the following steps:
each model is provided with n convolution blocks, the maximum representation capability of the model is represented, and the traditional CNN prediction depends on the last layer h n The final prediction of the scheme is based on the prediction of each layer h 1 ,h 2 ,…,h n The weighted combination of } results as follows:
Figure DEST_PATH_IMAGE001
wherein, P i Refers to the prediction, h, derived in the CNN block in the i-th block n Refers to a flat characterized graph learned in the convolution module,
Figure 876019DEST_PATH_IMAGE002
are shallow feed forward network parameters assigned to each CNN block.
To realize dynamic adjustment of each layer, an attention network is added to learn attention weights of CNN blocks so as to determine the relationship between each layer block, and further realize the re-weighting operation:
Figure DEST_PATH_IMAGE003
wherein, attribute (, is a shallow network for calculating the weight of the CNN block, which is intended to measure the relationship between each substructure, and the normalization operation is performed again once for each iteration.
The prediction of the final model is obtained by fusing a plurality of block weighted predictions:
Figure 606208DEST_PATH_IMAGE004
for marine observation data streams, the loss function of the model at time step t is:
Figure DEST_PATH_IMAGE005
in time step t, the parameters of the model ith block are updated as:
Figure 261312DEST_PATH_IMAGE006
further, the final predicted gradient with respect to each block parameter is calculated according to the above equation.
The function of the model is represented as:
Figure DEST_PATH_IMAGE007
wherein the content of the first and second substances,
Figure 718838DEST_PATH_IMAGE008
represents the input sequence of samples of the model,
Figure DEST_PATH_IMAGE009
representing a predicted sequence derived by a model for an input sequence
Figure 349189DEST_PATH_IMAGE010
The online deep learning model predicts in turn, in each turn, a sample is input into an algorithm to obtain the prediction of the next stage, and after the prediction is finished, a target interval is revealed
Figure DEST_PATH_IMAGE011
And calculating the instantaneous loss:
Figure 233969DEST_PATH_IMAGE012
the degree of prediction deviation is reflected, when each round is finished, the algorithm carries out self-adjustment according to the instantaneous loss of the next round without manual intervention to adjust parameters, the efficiency is improved, and meanwhile the accuracy can be guaranteed powerfully.
(2) Online learning model training method adopting super-computation parallel computation support
For model training, as shown in fig. 4, in the online learning stage, for multi-channel data, it is assumed that the data distribution and evolution mechanisms of each observation channel are very different, and therefore, a model training task is set for the observation sensor data of each channel, and online learning of multiple models is required. It should be noted that, in the online learning process of the observed data stream, the online learner does not need to write data into the disk, and the specific process is as follows:
firstly, consuming corresponding data streams from corresponding topics in a distributed message queue, wherein each Topic stores various types of data subjected to cleaning processing;
then, a connection is established through a connection operator, and in the process, important connection parameters are required to be configured, wherein the important connection parameters comprise a brooker id serving as a leader, a kafka listening port value is set to be 9092, a Zookeeper address used for storing metadata of the brooker, a hostName or ip address of the Zookeeper, a client connection port 2181 and an available path/path.
Finally, after the connection is established, each channel corresponds to a respective online learning model, each model can learn from the absence, or a base model can be established through part of initial data, the model can feed back errors through back propagation according to the loss value obtained on each data layer by layer, and the corresponding parameters and the weight of each neuron are automatically adjusted.
It should be noted that the data stream consumed by Kafka cluster may be set autonomously, depending on specific business requirements, by dividing a small amount of windowed data as training data, or by using each data as training data for one training.
In order to train different models for each channel data, the requirements on resource allocation and memory are higher, and strong computing power is needed to meet real-time parallel computation of mass data, so that the requirement of multi-channel online learning is realized by adopting a mode of a supercomputing MPI parallel training model, namely model online learning tasks of all channels are submitted to a supercomputer platform through MPI operation.
After submission, the operation is distributed to massive computing resources, and each node comprises a plurality of computing processes, and one process of one computing node can be responsible for an online learning task of data of one channel sensor; of course, a node is responsible for data of only one channel to optimize performance; for the observation sensor data of one channel, the newly collected sensor data from Kafka is continuously input into a corresponding online learning program to train and update models, the models are stored in a model version library regularly, and the step aims to provide preconditions for rollback operation when the online models have problems; the initial model of a channel starts with a shallow network to ensure that its initial model converges quickly. If the initial model is too complex, convergence is slow, training time is increased, the advantages of online learning are lost, and high real-time performance cannot be guaranteed. As the data is continuously consumed, the corresponding model is iteratively updated. As more internet of things sensor data is obtained, the representation capability and the complexity of the model also increase.
(3) Model library construction and version maintenance
And performing model version control on each channel data, transmitting the trained model into a model version library in real time, and adjusting the corresponding storage step length according to the storage granularity of the model.
The method is explained in a sliding window training mode, after each training is finished and the model is updated, the model which is continuously updated iteratively is named by using the timestamp naming model of the latest data in the current training window and is stored in a model version library in a stream mode, and the model version library selects memory-based Redis to improve the writing and reading speed of the model due to the fact that the requirement of real-time reasoning is met. The model is destroyed after being consumed, but the iteration version of each time needs to be stored for subsequent calling or online service, and further, as continuous data flow needs to continuously update the model, as the iteration times of the multi-channel model are continuously increased, the throughput of the model and the storage capacity of the system are also a great challenge, a distributed file system (HDFS) is adopted to back up the model, so that the distributed storage of the model after real-time training is realized, and even if a certain storage node is down or other uncontrollable faults occur, the high fault tolerance and high reliability of the model can still be ensured.
And step S4: based on a Flink distributed stream processing system, the latest ocean observation data intelligent calculation model corresponding to each channel is selected for ocean observation data which continuously flow into each channel, and real-time reasoning and prediction are carried out.
For single-node reasoning, if an abnormal condition occurs and then the reasoning fails, the overall reasoning prediction condition is influenced; compared with the inference process realized by the supercomputing, the high concurrency and high availability of the Flink distributed computation engine can better ensure the high reliability of the inference process, so that the problem of single-point failure possibly caused by the supercomputing single-node inference can be well solved by establishing the Flink distributed cluster for online inference.
The distributed cluster is built, the distributed architecture is used, when abnormal conditions such as downtime occur to one or more nodes, the inference program does not need to be restarted to acquire the streaming data from the beginning for inference, the data can be rolled back to the nearest state for inference prediction, and the feedback of the overall inference program and the prediction result is not influenced.
Because the inference stage occupies far less resources than the training stage, the online inference part adopts the above method to make inference prediction on the data of each channel, which is a better scheme, firstly, build a Flink Cluster based on Yarn, wherein the Flink adopts a Per-Job-Cluster mode in the Yarn running mode, and the specific content of Flink-based Job execution is as follows:
and after each Job corresponds to the corresponding cluster, the Job respectively applies for resources to the Yarn according to the self condition after submitting the stream data reasoning Job until the Job execution is finished, and whether one Job fails or not does not influence the normal submission and operation of the next Job.
In addition, each job independently shares the Dispatcher and the Resource Manager, and can receive Resource application according to the requirement; the operation requirement is a large-scale long-time operation mode, so that the scheme can effectively meet the operation and application requirements of the embodiment; most importantly, each task is independent and independent, does not influence each other, is suitable for multi-channel and multi-task application scenes, is easy to manage, and the created clusters disappear after the tasks are executed.
The operator classifies and processes the model obtained from the model version base according to the timestamp, and then filters the latest model. The application program is submitted to the Flink cluster in the form of a job, a job manager in the Flink cluster firstly receives an inference calculation program to be executed, and the inference calculation program concretely comprises the following steps: job Graph, logical Dataflow Graph, and JAR packages of all classes, libraries, and other resources after packing. The Job manager will convert the Job Graph into a physical level data flow Graph, also called Execution Graph, which contains all tasks that can be executed concurrently. The job Manager will request the Resource Manager for the resources necessary to execute the Task, i.e. slots on the Task Manager. Once it has acquired enough resources, the execution graph is distributed to the Task Manager that actually runs them. During operation, the job manager may take charge of all operations that require central coordination, such as coordination of checkpoints.
Specifically, the Flink cluster is deployed on the Yann, based on the Yann scheduling system, the fault transfer of each role can be automatically processed, and the Yann resource can be used as required, so that the resource utilization rate of the cluster is improved. The specific submission flow of the inference job is shown in fig. 5:
after the Flink reasoning task is submitted, the Client uploads a Jar packet and configuration of the Flink to the HDFS, then submits the reasoning task to a resource manager of Yarn, the resource manager allocates resources of a container and informs a corresponding node manager to start an application manager, the application manager loads the Jar packet of the Flink after being started and configures and constructs an environment, then starts an operation manager, then the application manager applies for the resource to start the task manager from the resource manager, after the resource manager allocates the resources of the container, the application manager informs the node manager of the node where the resources are located to start the task manager, the node manager loads the Jar packet of the Flink and configures and constructs the environment and starts the task manager, after the task manager is started, the task manager sends a heartbeat packet to the operation manager and waits for the operation manager to allocate the reasoning task to the operation manager, and the specific implementation process of the reasoning program on the Flink is achieved.
In conclusion, in the method of the embodiment, distributed storage is adopted, and higher fault tolerance and high reliability are ensured based on a distributed message queue system; data preprocessing solves the problems of data disorder, data loss and the like possibly occurring in data streams; on-line learning, high-performance calculation based on super calculation realizes real-time update and iteration of a multi-channel model, and the problem of concept drift is effectively solved by combining the provided algorithm; and (3) online reasoning, namely, a Flink distributed stream processing system is built, so that the data can be ensured to realize quick response, quick reasoning and high reliability.
Example two
The embodiment discloses an intelligent computing system for fusing parallel and distributed ocean observation data;
as shown in fig. 7, the intelligent computing system for merging parallel and distributed marine observation data includes a distributed storage module, a data preprocessing module, an online learning module, and an online reasoning module;
a distributed storage module configured to: acquiring ocean observation data flow of each channel in real time, and storing the ocean observation data flow into a distributed cluster Kafka;
a data pre-processing module configured to: carrying out disorder, duplicate removal and deletion pretreatment on the stored data stream;
an online learning module configured to: performing multi-channel online learning model training by adopting a mode of performing supercomputing MPI parallel training model based on the preprocessed marine observation data stream to obtain the latest marine observation data intelligent calculation model of each channel;
an online reasoning module configured to: based on a Flink distributed stream processing system, the latest ocean observation data intelligent calculation model corresponding to each channel is selected for ocean observation data which continuously flow into each channel, and real-time reasoning and prediction are carried out.
EXAMPLE III
An object of the present embodiment is to provide a computer-readable storage medium.
A computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps in the intelligent calculation method for fused parallel and distributed marine observation data according to embodiment 1 of the present disclosure.
Example four
An object of the present embodiment is to provide an electronic device.
The electronic device comprises a memory, a processor and a program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the steps of the intelligent calculation method for the fused parallel and distributed ocean observation data according to the embodiment 1 of the disclosure.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The method for intelligently calculating the ocean observation data integrating the parallel and the distributed modes is characterized by comprising the following steps of:
acquiring ocean observation data flow of each channel in real time, and storing the ocean observation data flow into a distributed cluster Kafka;
carrying out disorder, duplication removal and deletion preprocessing on the stored data stream;
performing multi-channel online learning model training by adopting a mode of performing supercomputing MPI parallel training model based on the preprocessed marine observation data stream to obtain the latest marine observation data intelligent calculation model of each channel;
based on a Flink distributed stream processing system, the latest ocean observation data intelligent calculation model corresponding to each channel is selected for ocean observation data which continuously flow into each channel, and real-time reasoning and prediction are carried out.
2. The intelligent calculation method for merged parallel and distributed marine observation data according to claim 1, wherein the distributed cluster Kafka stores the data stream of each channel into a respective topic, each topic is divided into a plurality of partitions arranged in order, and each topic is distributed on a plurality of servers in the cluster.
3. The intelligent calculation method for merged parallel and distributed marine observation data as claimed in claim 2, wherein in the distributed cluster Kafka, when data with the same primary key value is submitted, the server in the cluster can only persist one piece of marine observation data with the same primary key value, and ensure that the data is processed only once in the cluster processing process.
4. The intelligent calculation method for merged parallel and distributed marine observation data as claimed in claim 1, wherein an evolutionary convolutional neural network architecture is adopted to realize an online learning method, so that the model adaptively evolves for a dynamically evolving data stream, and simultaneously the structure of the model is dynamically adjusted from shallow to deep along with the inflow of data, and the sub-structure of the model is dynamically re-weighted from the data stream in a sequential or online learning manner, so as to learn a neural network with a deep structure and a complex nonlinear function in an online environment.
5. The method for intelligent computation of fused parallel and distributed marine observation data according to claim 1, wherein the supercomputing MPI parallel training model creates a respective data connection for each topic of the distributed cluster Kafka, provides training data for the online learning model of each channel through the data connection, and submits the training data to the supercomputer platform through MPI jobs, and a process of a computation node of the supercomputer platform is responsible for the online learning model training of one channel.
6. The intelligent calculation method for merged parallel and distributed marine observation data according to claim 1, further comprising performing version control on the model of each channel, transmitting the trained model to a model version library in real time, and adjusting the storage granularity of the model according to specific conditions to correspond to the storage step length.
7. The intelligent calculation method for merged parallel and distributed marine observation data according to claim 1, wherein the Flink distributed stream processing system is used for building a distributed cluster, and when one or more nodes are down, the inference program does not need to be restarted to acquire the stream data from the beginning for inference, but the system rolls back to the nearest state for inference prediction, and the feedback of the overall inference program and the prediction result is not influenced.
8. The ocean observation data intelligent computing system integrating the parallel and the distributed modes is characterized by comprising a distributed storage module, a data preprocessing module, an online learning module and an online reasoning module;
a distributed storage module configured to: acquiring ocean observation data flow of each channel in real time, and storing the ocean observation data flow into a distributed cluster Kafka;
a data pre-processing module configured to: carrying out disorder, duplication removal and deletion preprocessing on the stored data stream;
an online learning module configured to: performing multi-channel online learning model training based on the preprocessed marine observation data stream by adopting a mode of an ultra-computation MPI (multi-point information instrumentation) parallel training model to obtain a latest marine observation data intelligent calculation model of each channel;
an online reasoning module configured to: based on a Flink distributed stream processing system, the latest ocean observation data intelligent calculation model corresponding to each channel is selected for ocean observation data which continuously flow into each channel, and real-time reasoning and prediction are carried out.
9. Computer readable storage medium, on which a program is stored, which program, when being executed by a processor, carries out the steps of the method for intelligent computation of fused parallel and distributed marine observation data according to any one of claims 1 to 7.
10. Electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for intelligent computation of fused parallel and distributed marine observation data according to any one of claims 1 to 7 when executing the program.
CN202211230749.6A 2022-10-10 2022-10-10 Parallel and distributed ocean observation data fusion intelligent calculation method and system Pending CN115293662A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211230749.6A CN115293662A (en) 2022-10-10 2022-10-10 Parallel and distributed ocean observation data fusion intelligent calculation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211230749.6A CN115293662A (en) 2022-10-10 2022-10-10 Parallel and distributed ocean observation data fusion intelligent calculation method and system

Publications (1)

Publication Number Publication Date
CN115293662A true CN115293662A (en) 2022-11-04

Family

ID=83819346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211230749.6A Pending CN115293662A (en) 2022-10-10 2022-10-10 Parallel and distributed ocean observation data fusion intelligent calculation method and system

Country Status (1)

Country Link
CN (1) CN115293662A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115879569A (en) * 2023-03-08 2023-03-31 齐鲁工业大学(山东省科学院) IoT observation data online learning method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113783931A (en) * 2021-08-02 2021-12-10 中企云链(北京)金融信息服务有限公司 Internet of things data aggregation and analysis method
CN114385601A (en) * 2022-03-24 2022-04-22 山东省计算中心(国家超级计算济南中心) Cloud-edge collaborative high-throughput ocean data intelligent processing method and system based on super computation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113783931A (en) * 2021-08-02 2021-12-10 中企云链(北京)金融信息服务有限公司 Internet of things data aggregation and analysis method
CN114385601A (en) * 2022-03-24 2022-04-22 山东省计算中心(国家超级计算济南中心) Cloud-edge collaborative high-throughput ocean data intelligent processing method and system based on super computation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
毛亚青等: "基于Flink的海量医学图像检索系统设计与实现", 《计算机测量与控制》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115879569A (en) * 2023-03-08 2023-03-31 齐鲁工业大学(山东省科学院) IoT observation data online learning method and system

Similar Documents

Publication Publication Date Title
US11120368B2 (en) Scalable and efficient distributed auto-tuning of machine learning and deep learning models
Ghobaei-Arani et al. A cost-efficient IoT service placement approach using whale optimization algorithm in fog computing environment
Peng et al. Dl2: A deep learning-driven scheduler for deep learning clusters
Shi et al. Communication-efficient distributed deep learning with merged gradient sparsification on GPUs
Hu et al. Flutter: Scheduling tasks closer to data across geo-distributed datacenters
US10847038B2 (en) Aircraft communications system with an operational digital twin
Mostafa et al. Fog resource selection using historical executions
Arkian et al. Model-based stream processing auto-scaling in geo-distributed environments
Kamburugamuve et al. Survey of distributed stream processing
CN107807983A (en) A kind of parallel processing framework and design method for supporting extensive Dynamic Graph data query
Qin et al. Enactment of adaptation in data stream processing with latency implications—a systematic literature review
Pal et al. Big data real time ingestion and machine learning
CN115293662A (en) Parallel and distributed ocean observation data fusion intelligent calculation method and system
CN113157459A (en) Load information processing method and system based on cloud service
Tuli et al. AI augmented Edge and Fog computing: Trends and challenges
Zayid et al. Predicting the performance measures of a message-passing multiprocessor architecture using artificial neural networks
Tuli et al. DRAGON: Decentralized fault tolerance in edge federations
CN113240100B (en) Parallel computing method and system based on discrete Hopfield neural network
Ali et al. Smlt: A serverless framework for scalable and adaptive machine learning design and training
Xu et al. Model-based reinforcement learning for elastic stream processing in edge computing
Cui et al. Scalable deep learning on distributed GPUs with a GPU-specialized parameter server
CN116996941A (en) Calculation force unloading method, device and system based on cooperation of cloud edge ends of distribution network
Anwar et al. Recommender system for optimal distributed deep learning in cloud datacenters
US20220308777A1 (en) System and method for model orchestration
Read et al. Deep Reinforcement Learning (DRL)-based Methods for Serverless Stream Processing Engines: A Vision, Architectural Elements, and Future Directions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20221104