CN115293662A

CN115293662A - Parallel and distributed ocean observation data fusion intelligent calculation method and system

Info

Publication number: CN115293662A
Application number: CN202211230749.6A
Authority: CN
Inventors: 张兆虔; 李响; 赵志刚; 王春晓; 耿丽婷; 郭莹; 吴晓明
Original assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2022-10-10
Filing date: 2022-10-10
Publication date: 2022-11-04

Abstract

The invention provides a method and a system for intelligently calculating ocean observation data integrating parallel and distributed, which relate to the field of intelligent calculation of ocean observation time sequence data streams, and are used for acquiring ocean observation data streams of each channel in real time and storing the ocean observation data streams to a distributed cluster; carrying out disorder, duplication removal and deletion preprocessing on the data stream; on the basis of the preprocessed marine observation data stream, performing multi-channel online learning model training by adopting a super-computation MPI (multi-path multi-input multi-output) parallel training model to obtain a latest marine observation data intelligent computation model of each channel; based on a Flink distributed stream processing system, selecting a latest ocean observation data intelligent calculation model corresponding to each channel for ocean observation data continuously flowing into each channel, and carrying out real-time reasoning and prediction; the method is suitable for the application scene of multi-channel and multi-task, effectively supports the on-line learning and reasoning task of the streaming data and the management of high-flux sensor data, and realizes the rapid iterative upgrade of a multi-channel calculation model of the data and the real-time reasoning of the data.

Description

Parallel and distributed ocean observation data fusion intelligent calculation method and system

Technical Field

The invention belongs to the field of ocean observation time sequence data flow intelligent calculation, and particularly relates to a parallel and distributed ocean observation data intelligent calculation method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the rapid development of marine observation networks, a large number of buoy sensors generate hydroecological and physical environment observation time sequence data in a flow mode; the calculation and analysis of ocean observation big data need to achieve the automatic, concurrent, efficient and accurate target; therefore, developing a computing system and method for supporting real-time automated processing and analysis of observed data streams is a key core technology that needs to be overcome. However, the characteristics of the high-throughput, multi-channel marine observation data stream itself pose the following challenges for building a computing system:

challenge 1: challenges are presented by multiple channels, high throughput, mass data streams.

Data streams generated by the multi-class marine observation sensors are transmitted through respective channels, and how to ensure ordered caching and classified storage of mass data and further realize high throughput and high reliability is the primary challenge facing the system; for marine observation data, in order to reduce the overall processing time delay of the system, high throughput data is required to be met firstly, the data is continuously generated in a streaming form, a batch processing method is commonly used at present, small batch processing is performed on the continuously generated data according to a certain time interval or a fixed window, namely, a certain time delay problem is caused by accumulating once for calculation, the overall processing time is correspondingly increased, the requirement of real-time is not met, the inherent defect of batch processing is overcome, and the real-time property is a precondition for marine observation data processing.

In addition, for streaming data generated by the ocean sensor in real time, the repeated transmission and processing of the data no longer meet the specific scene requirements, how to ensure that each data is processed only once and the timeliness of the data processing can be considered is also a challenge; the high reliability problem includes that abnormal situations such as data disorder and missing occur when data streams are processed, and is specifically described as follows: the flow processing is generated from sensor data, transmitted and then transmitted to the processing unit, and a certain process and time are needed in the middle, and ideally, the data flowing to the processing unit arrives according to the time sequence of the data generation, as shown in fig. 1 (a), but the generation of disorder caused by network, distributed storage and other reasons cannot be eliminated; the disorder means that the sequence of data received by a processor is not strictly arranged according to the sequence of Event Time (Event Time) occurring in the data, ideally, the data arrives and is processed in one sequence, but in actual situations, the problem of disorder and late arrival of the data can be caused due to the influence of various factors mentioned above, once the disorder occurs, if the operation of a Window (Window) is determined only according to the Event Time, a system cannot determine whether the data is in place completely, but cannot wait indefinitely, and for Time sequence data, the problem of disorder during data processing needs to be solved firstly to ensure the accuracy of reasoning results.

Challenge 2: the challenges presented by data "concept drift" (caused by the evolution of data flow over time).

According to the traditional static model based on batch data training, the model can be out of date quickly, the problem of dynamic concept drift cannot be solved, and accurate reasoning cannot be carried out; the existing classical machine learning models are basically obtained by training batch data based on a supervised or unsupervised learning method, namely, the models usually work on a static/batch processing data set, and under the condition, the models have a good effect on reasoning and prediction of batch historical data; however, when the models constructed based on batch learning are applied to stream data, the model reasoning effect is reduced along with the time lapse and the occurrence of concept drift; in a high-flux ocean stream data scene, a system for supporting a model to perform online learning on stream data is needed, and the model is ensured to be continuously adapted to a changing data stream so as to solve the problem of concept drift.

Challenge 3: challenges are presented by a large number of models for concurrent training and concurrent data stream reasoning.

Aiming at model training under the scene: different models need to be trained for each observation channel data, the requirements on resource allocation and memory are higher, strong computing power is needed to meet real-time parallel computation of mass data, and the method is more suitable for achieving the training requirements of the mass models in a mode of adopting an hyper-computing (HPC) MPI parallel training model, while large data computing frames such as Spark and Flink are not inherently provided for deep learning model training, and a computing mechanism of the method is only suitable for training a single large model in a data parallel mode and is not suitable for concurrent training of a large number of models at the same time.

Reasoning for data flow under the scene: data is acquired and inferred in a stream form, if a calculation mode of model training is continued (a multi-channel stream data is inferred in a super-calculation parallel operation mode), when a single-node abnormal condition occurs (such as single-node physical fault or data stream transmission abnormity), the deadplate characteristic of super-calculation MPI parallel calculation operation does not support switching of other calculation resources, operation process cannot be continued, and the inference prediction of the whole ocean data is influenced; therefore, the method still needs to deal with abnormal conditions possibly caused by various reasons while ensuring high concurrency of multi-channel reasoning of the system, and improves the reliability and stability of the system, which is also a challenge; on the premise of ensuring high reliability, data streams need high-speed processing, quick response and real-time prediction, and although the timeliness of data processing is greatly improved by a current mature Spark processing framework compared with a traditional MapReduce distributed processing framework, for our application scene, that is, each piece of data needs to be processed and fed back in time, the design concept of Spark micro-batch processing cannot achieve real pure real-time online reasoning.

Based on the analysis, the intelligent calculation method for marine observation data needs to be researched urgently.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method and a system for intelligently calculating fused parallel and distributed marine observation data, a streaming big data frame and a deep Learning frame are bound and fused, meanwhile, a supercomputing platform carries out deep Learning model training conforming to an Online Learning paradigm, and Online Learning and reasoning of the marine observation data are realized by utilizing the advantages of data partitioning, a window selection mechanism, a distributed architecture, iterative computation characteristics, high-performance calculation and the like, so that the requirements of Online Learning and real-time Online reasoning of multi-channel data are met, the problems of early-stage data loss and disorder, high throughput, high reliability, low delay, concept drift and the like of mass data streams are solved, real-time and intelligent processing of the high-flux data streams by the system is further realized, the method and the system are suitable for multi-channel and multi-task application scenes, and are convenient to manage.

To achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

the invention provides a method for intelligently calculating ocean observation data by fusing parallel and distributed modes;

the intelligent calculation method for the merged parallel and distributed ocean observation data comprises the following steps:

acquiring ocean observation data flow of each channel in real time, and storing the ocean observation data flow into a distributed cluster Kafka;

carrying out disorder, duplication removal and deletion preprocessing on the stored data stream;

performing multi-channel online learning model training by adopting a mode of performing supercomputing MPI parallel training model based on the preprocessed marine observation data stream to obtain the latest marine observation data intelligent calculation model of each channel;

based on a Flink distributed stream processing system, the latest ocean observation data intelligent calculation model corresponding to each channel is selected for ocean observation data which continuously flow into each channel, and real-time reasoning and prediction are carried out.

Further, the distributed cluster Kafka stores the data stream of each channel into a respective Topic (Topic), divides each Topic into a plurality of partitions (Partition) arranged in order, and distributes each Topic to a plurality of intra-cluster servers.

Further, in the distributed cluster Kafka, when data with the same primary key value is submitted, the server in the cluster can only persist one piece of ocean observation data with the same primary key value, and it is ensured that the data is processed only once in the cluster processing process.

Furthermore, an evolution convolution neural network architecture is adopted to realize an online learning method, so that the model is adaptive to the evolution of the dynamic evolution data stream, the structure of the model is dynamically adjusted from shallow to deep along with the inflow of data, the substructure of the model is dynamically re-weighted from the data stream in a sequence or online learning mode, and a neural network with a deep structure and a complex nonlinear function is learned in an online environment.

Furthermore, the supercomputing MPI parallel training model creates respective data connection for each theme of the distributed cluster Kafka, provides training data for the online learning model of each channel through the data connection, submits the training data to the supercomputing platform through MPI operation, and one process of one computing node of the supercomputing platform is responsible for the online learning model training of one channel.

And further, performing version control on the model of each channel, transmitting the trained model into a model version library in real time, and adjusting the storage granularity of the model according to the specific situation to obtain the corresponding storage step length.

Furthermore, the Flink distributed stream processing system is a distributed cluster which is built, a distributed architecture is used, when a certain node or certain nodes are down, the inference program does not need to be restarted to acquire stream data from the beginning for inference, and the stream data is rolled back to the nearest state for inference prediction, so that the overall inference program and the feedback of the prediction result are not influenced.

The invention provides a parallel and distributed ocean observation data fused intelligent computing system.

The intelligent computing system for fusing parallel and distributed marine observation data comprises a distributed storage module, a data preprocessing module, an online learning module and an online reasoning module;

a distributed storage module configured to: acquiring ocean observation data flow of each channel in real time, and storing the ocean observation data flow into a distributed cluster Kafka;

a data pre-processing module configured to: carrying out disorder, duplicate removal and deletion pretreatment on the stored data stream;

an online learning module configured to: performing multi-channel online learning model training by adopting a mode of performing supercomputing MPI parallel training model based on the preprocessed marine observation data stream to obtain the latest marine observation data intelligent calculation model of each channel;

an online reasoning module configured to: based on a Flink distributed stream processing system, the latest ocean observation data intelligent calculation model corresponding to each channel is selected for the ocean observation data continuously flowing into each channel, and real-time reasoning and prediction are carried out.

A third aspect of the present invention provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the steps in the method for intelligent computation of fused parallel and distributed marine observation data according to the first aspect of the present invention.

A fourth aspect of the present invention provides an electronic device, including a memory, a processor, and a program stored in the memory and executable on the processor, where the processor executes the program to implement the steps in the method for intelligently computing the merged parallel and distributed marine observation data according to the first aspect of the present invention.

The above one or more technical solutions have the following beneficial effects:

the invention effectively solves the problems of transmission and storage of high-flux data streams generated by multiple sensors in multiple classes, and ensures ordered caching and classified storage of mass data so as to solve the problems of high throughput and high reliability in the data transmission and storage stage; each data in the data stream is only processed once, the real-time performance of the whole data processing is considered, and abnormal conditions such as data disorder and missing in the data stream processing process are solved.

On the system level, the distributed message queue is deployed, so that high reliability is guaranteed, and ocean observation data streams can be continuously and accurately transmitted to corresponding computing nodes in the production and deployment process; the method is suitable for a multichannel and multitask application scene, under the scene, real-time parallel computation of mass data is realized by combining powerful computing resources of a supercomputer, and distributed training of a multichannel deep Learning model of an Online Learning paradigm is met by MPI (multi-path Learning) operation of an HPC (high performance computing); the distributed observation data stream processing and online reasoning are realized by constructing a Flink distributed cluster, so that the high fault tolerance of the system can be ensured, the online learning and reasoning task of the stream data and the management of high-flux sensor data are effectively supported, and the rapid acquisition of the data, the rapid iterative upgrade of a multi-channel calculation model and the real-time reasoning of the data are realized.

Compared with other algorithms, the method has the advantages that the calculation speed is higher, the space utilization rate is higher, the processing process online learning device does not need to write data into a magnetic disk, the structure and the complexity of the model are dynamically adjusted from shallow to deep along with the continuous input of the evolution data stream, the model can keep stable performance, the problem of over-fitting or under-fitting is avoided, and the accuracy of the model can be well guaranteed while the real-time performance is guaranteed.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 (a) is a diagram illustrating an example of sequential processing of an ideal case data stream according to an embodiment of the present invention;

FIG. 1 (b) is a diagram illustrating an example of an out-of-order handling mechanism for actual scene data streams according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a filling effect of missing values of time series data according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a single-channel online learning mechanism according to an embodiment of the present invention;

FIG. 4 is a diagram of an architecture of an online training and online reasoning system according to an embodiment of the present invention;

fig. 5 is a diagram of an internal operation structure of a yann-based Flink system according to an embodiment of the present invention;

FIG. 6 is a flowchart of a method provided by an embodiment of the invention;

fig. 7 is a system structure diagram provided in the embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention; unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention; as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Noun explanation

Online Learning: the model training method can rapidly adjust and iteratively update a model in real time according to the change of a data stream, so that the model can reflect the online change in real time, the prediction accuracy is improved, and the reasoning instantaneity is ensured;

kafka: a message queue based on a publish/subscribe pattern;

MPI: a message passing interface supporting high performance computing;

topic: a subject;

partition: partitioning;

a Broker: a server within a cluster;

replica: copying;

exact Once: is processed only once;

SeqNumber: a queue number;

and (4) Flink: a distributed processing engine for batch/stream data;

watermark: a water level line, a delay trigger mechanism which can meet specific requirements;

and (3) Scala: a programming language in multi-paradigm, similar to the Java programming language;

redis: a key-value storage database, one of non-relational databases;

set: a data structure;

bloom Filter: a spatially efficient random data structure can quickly determine whether an element belongs to the set.

Example one

The embodiment discloses an intelligent calculation method for fusing parallel and distributed ocean observation data;

as shown in fig. 6, the method for intelligently calculating the merged parallel and distributed marine observation data includes:

step S1: acquiring ocean observation data flow of each channel in real time, and storing the ocean observation data flow into a distributed cluster Kafka;

for data storage caching, data streams generated by various observation sensors are transmitted based on respective channels, and in order to ensure ordered caching and classified storage of high-throughput data, the data streams of each channel are consumed to respective topics Topic by adopting a distributed cluster Kafka; in order to solve the problem of excessive theme Topic data caused by high-throughput data, each theme Topic is divided into a plurality of Partition partitions, each Partition is an ordered queue, so that huge theme topics are distributed on a server Broker in a plurality of clusters, the whole cluster can adapt to data of any size, high expansibility can be realized, and the use cost is reduced.

In one-time transmission of streaming data, if abnormal conditions such as downtime of a server Broker in a certain cluster occur, data loss occurs, and therefore, in order to improve fault tolerance of the system, a plurality of duplicate replicas are set for each Topic, even if the server Broker in a certain cluster fails, the cluster can still consume corresponding data through the duplicate replicas, and high reliability of the distributed cluster is guaranteed.

To ensure the exact Once when the sensor data is consumed in production, the specific process is as follows:

firstly, an idempotent parameter in data needs to be set to True, wherein idempotent refers to that only one piece of data is processed no matter how much repeated data exists, a primary key value PID is allocated to the data when the idempotent data is initialized, a queue serial number SeqNumber is attached to the data sent to the same Partition, a Broker end of a server in a cluster caches the PID, the Partition and the SeqNumber, and the Broker end of the server in the cluster caches a certain piece of data of a certain queue of a specific certain Partition.

Step S2: carrying out disorder, duplicate removal and deletion pretreatment on the stored data stream;

and carrying out disorder and duplicate removal processing on the data based on a window selection mechanism and a delay trigger mechanism.

In the delay trigger mechanism described in this embodiment, by taking a Watermark mechanism as a reference, a specific implementation method is to set a delay Time of the delay trigger mechanism to be t, check that the largest EventTime in the ocean observation data that has arrived is denoted as maxEventTime (where EventTime refers to the Time for generating data by the ocean observation sensor) every Time a data processing program verifies that all data of which EventTime is smaller than maxEventTime-t have arrived, and trigger the window to execute if the cutoff Time of the window is equal to maxEventTime-t.

Setting a corresponding Watermark according to a specific scenario and a requirement, taking fig. 1 (b) as an example, setting a delay time t of a delayed trigger mechanism to be 2s, so that a Watermark corresponding to an event with a timestamp of 10 00.

For the data repetition problem, in the data acquisition caching and data transmission process, the data repetition problem may be caused by factors such as a network, and therefore, in the embodiment, the timestamp of each piece of data is used as a key, for massive data, because the memory load is limited, it is obviously impossible to implement a deduplication operation by using a set data structure similar to scala or a set of redis, a Bloom Filter is used to implement deduplication, the Bloom Filter can be used to search whether an element is in a certain set, the query time and the space efficiency are much better than those of other algorithms, and the real-time deduplication of the marine observation data stream in the scene is met.

For the problem of data loss, common methods include filling and deleting, abnormal data can be deleted for a scene with low stability requirement, but for a scene with high stability requirement, abnormal value filling is required.

In order to avoid information loss, the present embodiment adopts a method of filling missing data, and a common filling method is mainly based on a statistical method, including near filling, feature value filling, and linear filling. The system fills up missing values in a linear interpolation mode based on machine learning, the specific effect is shown in fig. 2, the method is a priori method, missing data and adjacent points are required to meet a certain linear fitting relation, and the missing value problem of ocean observation time sequence data can be effectively solved.

And step S3: performing multi-channel online learning model training by adopting a mode of performing supercomputing MPI parallel training model based on the preprocessed marine observation data stream to obtain the latest marine observation data intelligent calculation model of each channel;

the data flow may evolve over time, this behavior being referred to as conceptual drift; according to the traditional static model based on batch data training, the model can be out of date quickly, the problem of dynamic concept drift cannot be solved, and accurate prediction cannot be carried out.

Taking the one-way model of the application scenario as an example, as shown in fig. 3, first, whether to initialize the model is selected according to a specific service scenario and requirements, and then the model is updated incrementally by continuously training a continuous data stream, and the unknown data is reasoned in real time by using the current latest model. The decision boundary of the online learning algorithm can be adjusted in time according to the received feedback, so that the basic reality of prediction is enhanced, and the opportunity of making more accurate decisions in subsequent rounds is improved. Compared with the traditional batch learning mode, the input sample of online learning also participates in model training, the model is updated in an iterative mode, the model can change along with the change of the environment, the continuously changing environment variable is dynamically adapted, the online training can follow the curve in the error curved surface in each epoch, the online training can stably use higher learning rate, accordingly, the online training can achieve convergence faster than batch training through less iterative convergence of training data, the online learning aims to pay attention to learning of a learner in each step and updating of the model to guarantee the optimal predicted value of future data, and the model is applied to real-time sensor data flow without manual adjustment of model hyper-parameters.

(1) Online learning algorithm with adaptive expansion of presentation capacity

The method adopts an evolutionary convolutional neural network architecture to realize an online learning method, so that a model is adaptive to evolution of a dynamic evolution data stream, the structure of the model is dynamically adjusted from shallow to deep along with the inflow of data, and the substructure of the model is dynamically re-weighted from the data stream in a sequential or online learning mode.

Aiming at ocean observation data flow, time series characteristic learning is carried out on observation data by adopting one-dimensional convolution, so that the learning and modeling of an ocean data sequence are realized, and the method specifically comprises the following steps:

each model is provided with n convolution blocks, the maximum representation capability of the model is represented, and the traditional CNN prediction depends on the last layer h _n The final prediction of the scheme is based on the prediction of each layer h ₁ ,h ₂ ,…,h _n The weighted combination of } results as follows:

wherein, P _i Refers to the prediction, h, derived in the CNN block in the i-th block _n Refers to a flat characterized graph learned in the convolution module,

are shallow feed forward network parameters assigned to each CNN block.

To realize dynamic adjustment of each layer, an attention network is added to learn attention weights of CNN blocks so as to determine the relationship between each layer block, and further realize the re-weighting operation:

wherein, attribute (, is a shallow network for calculating the weight of the CNN block, which is intended to measure the relationship between each substructure, and the normalization operation is performed again once for each iteration.

The prediction of the final model is obtained by fusing a plurality of block weighted predictions:

for marine observation data streams, the loss function of the model at time step t is:

in time step t, the parameters of the model ith block are updated as:

further, the final predicted gradient with respect to each block parameter is calculated according to the above equation.

The function of the model is represented as:

wherein the content of the first and second substances,

represents the input sequence of samples of the model,

representing a predicted sequence derived by a model for an input sequence

。

The online deep learning model predicts in turn, in each turn, a sample is input into an algorithm to obtain the prediction of the next stage, and after the prediction is finished, a target interval is revealed

And calculating the instantaneous loss:

the degree of prediction deviation is reflected, when each round is finished, the algorithm carries out self-adjustment according to the instantaneous loss of the next round without manual intervention to adjust parameters, the efficiency is improved, and meanwhile the accuracy can be guaranteed powerfully.

(2) Online learning model training method adopting super-computation parallel computation support

For model training, as shown in fig. 4, in the online learning stage, for multi-channel data, it is assumed that the data distribution and evolution mechanisms of each observation channel are very different, and therefore, a model training task is set for the observation sensor data of each channel, and online learning of multiple models is required. It should be noted that, in the online learning process of the observed data stream, the online learner does not need to write data into the disk, and the specific process is as follows:

firstly, consuming corresponding data streams from corresponding topics in a distributed message queue, wherein each Topic stores various types of data subjected to cleaning processing;

then, a connection is established through a connection operator, and in the process, important connection parameters are required to be configured, wherein the important connection parameters comprise a brooker id serving as a leader, a kafka listening port value is set to be 9092, a Zookeeper address used for storing metadata of the brooker, a hostName or ip address of the Zookeeper, a client connection port 2181 and an available path/path.

Finally, after the connection is established, each channel corresponds to a respective online learning model, each model can learn from the absence, or a base model can be established through part of initial data, the model can feed back errors through back propagation according to the loss value obtained on each data layer by layer, and the corresponding parameters and the weight of each neuron are automatically adjusted.

It should be noted that the data stream consumed by Kafka cluster may be set autonomously, depending on specific business requirements, by dividing a small amount of windowed data as training data, or by using each data as training data for one training.

In order to train different models for each channel data, the requirements on resource allocation and memory are higher, and strong computing power is needed to meet real-time parallel computation of mass data, so that the requirement of multi-channel online learning is realized by adopting a mode of a supercomputing MPI parallel training model, namely model online learning tasks of all channels are submitted to a supercomputer platform through MPI operation.

After submission, the operation is distributed to massive computing resources, and each node comprises a plurality of computing processes, and one process of one computing node can be responsible for an online learning task of data of one channel sensor; of course, a node is responsible for data of only one channel to optimize performance; for the observation sensor data of one channel, the newly collected sensor data from Kafka is continuously input into a corresponding online learning program to train and update models, the models are stored in a model version library regularly, and the step aims to provide preconditions for rollback operation when the online models have problems; the initial model of a channel starts with a shallow network to ensure that its initial model converges quickly. If the initial model is too complex, convergence is slow, training time is increased, the advantages of online learning are lost, and high real-time performance cannot be guaranteed. As the data is continuously consumed, the corresponding model is iteratively updated. As more internet of things sensor data is obtained, the representation capability and the complexity of the model also increase.

(3) Model library construction and version maintenance

And performing model version control on each channel data, transmitting the trained model into a model version library in real time, and adjusting the corresponding storage step length according to the storage granularity of the model.

The method is explained in a sliding window training mode, after each training is finished and the model is updated, the model which is continuously updated iteratively is named by using the timestamp naming model of the latest data in the current training window and is stored in a model version library in a stream mode, and the model version library selects memory-based Redis to improve the writing and reading speed of the model due to the fact that the requirement of real-time reasoning is met. The model is destroyed after being consumed, but the iteration version of each time needs to be stored for subsequent calling or online service, and further, as continuous data flow needs to continuously update the model, as the iteration times of the multi-channel model are continuously increased, the throughput of the model and the storage capacity of the system are also a great challenge, a distributed file system (HDFS) is adopted to back up the model, so that the distributed storage of the model after real-time training is realized, and even if a certain storage node is down or other uncontrollable faults occur, the high fault tolerance and high reliability of the model can still be ensured.

And step S4: based on a Flink distributed stream processing system, the latest ocean observation data intelligent calculation model corresponding to each channel is selected for ocean observation data which continuously flow into each channel, and real-time reasoning and prediction are carried out.

For single-node reasoning, if an abnormal condition occurs and then the reasoning fails, the overall reasoning prediction condition is influenced; compared with the inference process realized by the supercomputing, the high concurrency and high availability of the Flink distributed computation engine can better ensure the high reliability of the inference process, so that the problem of single-point failure possibly caused by the supercomputing single-node inference can be well solved by establishing the Flink distributed cluster for online inference.

The distributed cluster is built, the distributed architecture is used, when abnormal conditions such as downtime occur to one or more nodes, the inference program does not need to be restarted to acquire the streaming data from the beginning for inference, the data can be rolled back to the nearest state for inference prediction, and the feedback of the overall inference program and the prediction result is not influenced.

Because the inference stage occupies far less resources than the training stage, the online inference part adopts the above method to make inference prediction on the data of each channel, which is a better scheme, firstly, build a Flink Cluster based on Yarn, wherein the Flink adopts a Per-Job-Cluster mode in the Yarn running mode, and the specific content of Flink-based Job execution is as follows:

and after each Job corresponds to the corresponding cluster, the Job respectively applies for resources to the Yarn according to the self condition after submitting the stream data reasoning Job until the Job execution is finished, and whether one Job fails or not does not influence the normal submission and operation of the next Job.

In addition, each job independently shares the Dispatcher and the Resource Manager, and can receive Resource application according to the requirement; the operation requirement is a large-scale long-time operation mode, so that the scheme can effectively meet the operation and application requirements of the embodiment; most importantly, each task is independent and independent, does not influence each other, is suitable for multi-channel and multi-task application scenes, is easy to manage, and the created clusters disappear after the tasks are executed.

The operator classifies and processes the model obtained from the model version base according to the timestamp, and then filters the latest model. The application program is submitted to the Flink cluster in the form of a job, a job manager in the Flink cluster firstly receives an inference calculation program to be executed, and the inference calculation program concretely comprises the following steps: job Graph, logical Dataflow Graph, and JAR packages of all classes, libraries, and other resources after packing. The Job manager will convert the Job Graph into a physical level data flow Graph, also called Execution Graph, which contains all tasks that can be executed concurrently. The job Manager will request the Resource Manager for the resources necessary to execute the Task, i.e. slots on the Task Manager. Once it has acquired enough resources, the execution graph is distributed to the Task Manager that actually runs them. During operation, the job manager may take charge of all operations that require central coordination, such as coordination of checkpoints.

Specifically, the Flink cluster is deployed on the Yann, based on the Yann scheduling system, the fault transfer of each role can be automatically processed, and the Yann resource can be used as required, so that the resource utilization rate of the cluster is improved. The specific submission flow of the inference job is shown in fig. 5:

after the Flink reasoning task is submitted, the Client uploads a Jar packet and configuration of the Flink to the HDFS, then submits the reasoning task to a resource manager of Yarn, the resource manager allocates resources of a container and informs a corresponding node manager to start an application manager, the application manager loads the Jar packet of the Flink after being started and configures and constructs an environment, then starts an operation manager, then the application manager applies for the resource to start the task manager from the resource manager, after the resource manager allocates the resources of the container, the application manager informs the node manager of the node where the resources are located to start the task manager, the node manager loads the Jar packet of the Flink and configures and constructs the environment and starts the task manager, after the task manager is started, the task manager sends a heartbeat packet to the operation manager and waits for the operation manager to allocate the reasoning task to the operation manager, and the specific implementation process of the reasoning program on the Flink is achieved.

In conclusion, in the method of the embodiment, distributed storage is adopted, and higher fault tolerance and high reliability are ensured based on a distributed message queue system; data preprocessing solves the problems of data disorder, data loss and the like possibly occurring in data streams; on-line learning, high-performance calculation based on super calculation realizes real-time update and iteration of a multi-channel model, and the problem of concept drift is effectively solved by combining the provided algorithm; and (3) online reasoning, namely, a Flink distributed stream processing system is built, so that the data can be ensured to realize quick response, quick reasoning and high reliability.

Example two

The embodiment discloses an intelligent computing system for fusing parallel and distributed ocean observation data;

as shown in fig. 7, the intelligent computing system for merging parallel and distributed marine observation data includes a distributed storage module, a data preprocessing module, an online learning module, and an online reasoning module;

an online reasoning module configured to: based on a Flink distributed stream processing system, the latest ocean observation data intelligent calculation model corresponding to each channel is selected for ocean observation data which continuously flow into each channel, and real-time reasoning and prediction are carried out.

EXAMPLE III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps in the intelligent calculation method for fused parallel and distributed marine observation data according to embodiment 1 of the present disclosure.

Example four

An object of the present embodiment is to provide an electronic device.

The electronic device comprises a memory, a processor and a program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the steps of the intelligent calculation method for the fused parallel and distributed ocean observation data according to the embodiment 1 of the disclosure.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The method for intelligently calculating the ocean observation data integrating the parallel and the distributed modes is characterized by comprising the following steps of:

2. The intelligent calculation method for merged parallel and distributed marine observation data according to claim 1, wherein the distributed cluster Kafka stores the data stream of each channel into a respective topic, each topic is divided into a plurality of partitions arranged in order, and each topic is distributed on a plurality of servers in the cluster.

3. The intelligent calculation method for merged parallel and distributed marine observation data as claimed in claim 2, wherein in the distributed cluster Kafka, when data with the same primary key value is submitted, the server in the cluster can only persist one piece of marine observation data with the same primary key value, and ensure that the data is processed only once in the cluster processing process.

4. The intelligent calculation method for merged parallel and distributed marine observation data as claimed in claim 1, wherein an evolutionary convolutional neural network architecture is adopted to realize an online learning method, so that the model adaptively evolves for a dynamically evolving data stream, and simultaneously the structure of the model is dynamically adjusted from shallow to deep along with the inflow of data, and the sub-structure of the model is dynamically re-weighted from the data stream in a sequential or online learning manner, so as to learn a neural network with a deep structure and a complex nonlinear function in an online environment.

5. The method for intelligent computation of fused parallel and distributed marine observation data according to claim 1, wherein the supercomputing MPI parallel training model creates a respective data connection for each topic of the distributed cluster Kafka, provides training data for the online learning model of each channel through the data connection, and submits the training data to the supercomputer platform through MPI jobs, and a process of a computation node of the supercomputer platform is responsible for the online learning model training of one channel.

6. The intelligent calculation method for merged parallel and distributed marine observation data according to claim 1, further comprising performing version control on the model of each channel, transmitting the trained model to a model version library in real time, and adjusting the storage granularity of the model according to specific conditions to correspond to the storage step length.

7. The intelligent calculation method for merged parallel and distributed marine observation data according to claim 1, wherein the Flink distributed stream processing system is used for building a distributed cluster, and when one or more nodes are down, the inference program does not need to be restarted to acquire the stream data from the beginning for inference, but the system rolls back to the nearest state for inference prediction, and the feedback of the overall inference program and the prediction result is not influenced.

8. The ocean observation data intelligent computing system integrating the parallel and the distributed modes is characterized by comprising a distributed storage module, a data preprocessing module, an online learning module and an online reasoning module;

a data pre-processing module configured to: carrying out disorder, duplication removal and deletion preprocessing on the stored data stream;

an online learning module configured to: performing multi-channel online learning model training based on the preprocessed marine observation data stream by adopting a mode of an ultra-computation MPI (multi-point information instrumentation) parallel training model to obtain a latest marine observation data intelligent calculation model of each channel;

9. Computer readable storage medium, on which a program is stored, which program, when being executed by a processor, carries out the steps of the method for intelligent computation of fused parallel and distributed marine observation data according to any one of claims 1 to 7.

10. Electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for intelligent computation of fused parallel and distributed marine observation data according to any one of claims 1 to 7 when executing the program.