CN117354330A

CN117354330A - Improved edge computing IoT big data analysis architecture

Info

Publication number: CN117354330A
Application number: CN202311335469.6A
Authority: CN
Inventors: 杨创; 魏贵义
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2023-10-13
Filing date: 2023-10-13
Publication date: 2024-01-05

Abstract

The patent relates to an improved internet of things (IoT) big data analysis system, comprising an internet of things edge layer and a cloud processing layer, aiming at solving the challenge of processing and managing large-scale IoT sensor data. The edge layer is responsible for collecting and primarily processing data from various sensors and devices, and then transmitting the data to the cloud processing layer for distributed storage and preprocessing. Preprocessing includes normalization, filtering, queuing, and data aggregation to improve data quality and provide a better data basis for subsequent processing and model training. The Map-Only algorithm and the MapReduce parallel processing mechanism are adopted, so that the data processing speed and efficiency are improved. In addition, training and reasoning of the machine learning model is performed using an optimized BP neural network algorithm. The method provides strong support for the application of the Internet of things, can be used for real-time decision making, prediction and resource optimization, provides an efficient solution for processing and analyzing the data of the Internet of things, and has wide application potential.

Description

Improved edge computing IoT big data analysis architecture

Technical Field

The invention relates to the field of big data analysis of the Internet of things, in particular to an improved edge computing (IoT) big data analysis architecture.

Background

With the popularity of the internet of things (IoT), the large-scale data generated by various sensor devices presents an explosive growth, and thus data processing and management faces unprecedented challenges. The rise of edge computation provides new possibilities to address these challenges, pushing computation and data processing towards the network edge to reduce latency and improve scalability. In addition, the application of techniques such as machine learning and federal learning has made data analysis more intelligent, but has also presented new challenges. Traditional data processing frameworks and methods have been inefficient and, therefore, new research is needed to optimize data intake, processing, and storage to achieve more intelligent, efficient, and scalable IoT applications.

Disclosure of Invention

The present invention aims to propose an improved edge computing IoT big data analysis architecture, aiming at coping with IoT generated big data challenges with edge computing and machine learning technologies. The proposed framework and algorithm aim to solve the problems of data ingestion, processing and storage to achieve more efficient big data analysis. This will provide new methods and tools for IoT and edge computing fields, drive the development of this field, provide more opportunities for various application fields, and help solve the problem of large-scale data processing. The final goal is to introduce intelligence into all objects in the physical world, promote the fusion of the animal networking application and the machine learning technology, and provide support for future intellectualization.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

on one hand, the embodiment of the invention comprises an Internet of things edge data processing method, which performs preliminary processing and integration on the original data from the sensor and provides a basis for subsequent big data analysis and management. Data is collected from various internet of things devices and sensors, then preliminary processing is carried out on the data, then the data is transmitted to nearby edge devices or servers, and caching and processing are carried out by the edge devices.

On the other hand, the embodiment of the invention further comprises a cloud processing layer responsible for data loading and parallel processing. Firstly, after the cloud processing layer receives the cache data of the edge, distributed storage is needed. Distributed storage typically employs multiple storage nodes, which may be located in different physical locations. This architecture ensures redundant backup of data to improve reliability and fault tolerance.

Further, the cloud processing layer needs to perform preliminary processing on the original data, and the pretreatment mechanisms such as normalization, filtering and queuing are used for preparing the data for effective processing and training, so that the quality and accuracy of the data are improved, and the data are suitable for subsequent processing, analysis and modeling.

Preferably, the normalization operation is used to eliminate deviations in the data, ensure consistency of the data, and improve accuracy of data processing.

Preferably, filtering is used to speed up the actual processing speed. Selectively retaining high quality information while filtering out bad or noisy data, thereby improving data quality

In order to accelerate the data processing speed and efficiently utilize big data, an M/M/1 queuing model is optimized, and a mixed M/M/1 queuing model is adopted for queuing processing.

Preferably, a message queue is used to speed up data processing. The message queue runs in a specific operating mode, the message M is acquired at time t and then forwarded to the generated component, controlled by H (a specific handler). This enhances efficient use of big data, ensuring that messages are processed and delivered in a predetermined manner when needed.

Further, data aggregation of the preprocessed data, integration of data from multiple IoT sensor sources into a central location, provides accurate packet data for further analysis.

Further, the data divided into blocks is parallelized with data loading and mapping using a parallel algorithm. And the data blocks are loaded on a plurality of nodes at the same time, so that the overall calculation efficiency is improved. The large data set is divided into fixed-size small blocks, which are processed in parallel by each node, by optimizing the block size of the processing unit to keep the number of parallel channels balanced.

In another aspect, machine learning algorithm parallel processing of a BP neural network is provided, including training and reasoning of a BP neural network machine learning model. The BP network can learn and store a large number of mappings of input and expected results, and automatically fine-tune network weights and threshold limit values through error approximation and error back propagation. The BP model construction has parallelism, the optimized BP model is trained and verified in parallel by utilizing a plurality of processing nodes, and score results generated by each processing node are combined by using integrated learning so as to improve the results.

The beneficial effects of the invention are as follows: the edge computing IoT big data analysis architecture provided by the invention effectively solves the large-scale heterogeneous data processing and storage challenges generated by IoT applications, and provides an efficient data processing solution. Second, by optimizing data loading, cluster resource management, and machine learning applications, the performance and speed of data processing is improved, enabling IoT applications to obtain valuable information more quickly. In addition, distributed storage and parallel processing are adopted, so that the efficiency and usability of data processing are improved. The invention has wide application range, can be suitable for various IoT application fields, and provides better data analysis and modeling tools for business and decision support. Therefore, the invention is expected to generate important practical benefits in the fields of edge computing and IoT big data processing.

Drawings

FIG. 1 is a system model of data analysis of the Internet of things;

FIG. 2 is a schematic diagram of a modified edge computing IoT big data architecture data processing flow;

FIG. 3 is a schematic diagram of data processing and transmission for an M/M/1 queue;

FIG. 4 is a schematic diagram of parallel processing by MapReduce operations;

FIG. 5 is a block diagram of a hybrid BP neural model.

Detailed Description

The invention will now be described in further detail with reference to the drawings and examples.

Referring to fig. 1, the invention provides a system model for analyzing big data of internet of things data, and constructs a system model consisting of an internet of things edge layer and a cloud processing layer. The first internet of things edge layer includes various IoT sensors and embedded devices that cover a number of areas including environmental monitoring, security monitoring, facility monitoring, traffic monitoring, power monitoring, and transportation monitoring, among others, and are integrated with edge servers.

The second cloud processing layer is a core component of the big data analysis framework and plays a crucial role. Its main responsibilities include receiving, processing, and storing data from the internet of things edge layer. The layer is composed of a cloud server and various processing units, and various data processing tasks are executed after various complex data generated by the edge server of the edge equipment of the Internet of things are collected so as to calculate valuable information about the data. At the same time, it is desirable to ensure redundant backup and high availability of data to ensure that the data is available for access when needed. The cloud processing layer enables the system to efficiently process large and heterogeneous data, providing support for various IoT applications.

In the implementation, the edge computing IoT big data processing architecture provides the whole process of big data processing in the internet of things edge layer and the cloud computing environment, and aims to efficiently manage and analyze large-scale internet of things data. Referring to fig. 2, the big data processing includes the steps of:

(S1) the edge devices and servers receive a large amount of data generated by various internet of things sensors and embedded devices.

(S2) the edge data is properly collected through an edge cache in the cloud processing layer, and then distributed storage is performed by the cloud.

And S3, optimizing loading of big data by using a Map-Only algorithm, and reducing communication overhead and improving data processing speed and efficiency by parallelizing data loading and mapping.

(S4) carrying out pretreatment such as normalization, filtering and queuing, data aggregation and the like on the collected data, and preparing the data for effective treatment and training.

And (S5) parallel processing under the edge environment is realized through an optimized MapReduce mechanism. The MapReduce programming paradigm has better scalability, flexibility, cost-effectiveness, rapidity, simplicity, and resilience.

(S6) training and reasoning the machine model by using an optimized Back Propagation (BP) neural network algorithm. The model can learn the mapping relation between input data and output data by continuously adjusting the weights and the threshold values of the neural network to minimize the loss function of the model.

In the execution step S1, a large amount of data generated by the sensor and the embedded device is received. In an edge environment, various sensors, such as: environmental, security, facility, traffic, power and transportation observation sensors are deployed on different locations and devices. These sensors are responsible for monitoring and collecting various data such as temperature, humidity, location, events, etc., and depending on their design and use, a large amount of data is constantly being generated. The generated data is received, buffered, and processed by nearby edge devices or servers.

In step S2, the edge cache performs preliminary processing, aggregation and temporary storage on the data. This process helps reduce redundancy and bandwidth consumption of data transmission to the cloud. Data is grouped and compressed by type, time stamp, etc. characteristics so that it can be processed more efficiently.

And after the data is initially processed in the edge cache, uploading the data to cloud storage. This process is periodic or based on trigger conditions to ensure consistency and timeliness of the data. At the cloud, the data is stored in a distributed storage system. This system is responsible for persistent storage, backup, management and scalability of data. The data is stored in a partitioned, duplicate policy to ensure the reliability and availability of the data.

In step S3, map-Only algorithm is introduced to parallelize data loading and mapping. Data loading depends on the type of processing available inside the parallel and distributed platforms, and data must be loaded to the parallel processing platform before processing. The Sqoop utility needs to be integrated with the Map job of MapReduce paradigm, and then Map-Only algorithm is introduced to adjust the segmentation size and replication factor of the traditional method for parallel data ingestion. First, a new directory is created under the root directory. Next, a file is verified and added to the directory. Then, command change copying is performed. Similarly, replication of all files is changed in the directory. To this end, specific and generic parameters are used to control the operation of the Sqoop tool.

This process configures the generic Hadoop command line parameters via the Sqoop tool, then selects the source tables to import from the relational database management system (RDBMS), and specifies the storage format of the data. Next, the particular column subset to import is selected using the-columns parameter, as needed, while the SQL WHERE clause is used to filter the data to import. Finally, incremental import is realized by utilizing the-increment parameter of the Sqoop, and only new records or updated records in the RDBMS source table are imported, so that the data in the Hadoop is ensured to keep the latest state. This process makes the import of data from the RDBMS to the Hadoop Distributed File System (HDFS) highly configurable and flexible.

In step S4, data is preprocessed. The method comprises the steps of normalizing, filtering, queuing and data aggregation, so that the quality of the data is improved, the processing speed is accelerated, and a better data base is provided for subsequent processing and model training.

Data normalization scales the value of the data to a range of 0 to 1 using a min-max normalization method. Data of different scales, ranges or units are converted into uniform standard scales, and the values of the data points are mapped to a range of 0 to 1 so that the minimum value becomes 0, the maximum value becomes 1, and other values are located between the two. The dimensional difference of the data is eliminated, and different characteristics or variables are ensured to have similar dimensions, so that the data analysis and modeling are easier to perform.

The core idea of filtering with optimized Kalman Filtering (KF) is to combine the previous state estimation with the new observation data to obtain a more accurate estimation of the system state. The starting point of the algorithm is the initialization, which includes defining the dynamics of the system (transfer model T), the observation mode (observation model O), and the estimation of the system uncertainty (noise covariance CN and observation covariance CO). It then completes the state estimation by a series of steps: firstly, obtaining initial data; the previous state estimate is then searched, and new observations are then acquired. Next, the transition model and previous state estimates are used to predict the current system state and estimate the state uncertainty. The new observations are then combined with the predicted state, and the state estimate is updated by calculating the kalman gain. Finally, the process is repeated, including the prediction and observation steps, to gradually refine the state estimate. Once all time steps have been processed, the filter process ends, providing a series of estimates of the state of the system that take into account the noise of the observed data and the dynamics of the system, which can provide accurate state estimates in various applications.

And a mixed M/M/1 queuing model is adopted for queuing, so that the data processing speed is increased, and big data is efficiently utilized. The model performs various operations when it receives the data segmentation D at time t. At this point, the system is considered to be in a steady state. FIG. 3 illustrates data processing and transmission for an M/M/1 queue. Two divisions, e.g. S ₁ And S is ₂ Is balanced at steady state, then S ₂ ,...,S _k-1 ,S _k ,S _k+1 And an arrival rate Λ, the service rate μ can be calculated as:

ΛS ₀ ＝μS ₁ ，ΛS ₁ ＝μS ₂ ，ΛS _k ＝μS _k+1

thus, it is possible to obtain: s is S _n ＝(Λ/μ) ^k S _k ―1＝(Λ/μ) ^k S ₀ ，

When the probability is equal to 1, it is:S ₀ (1+(Λ/μ) ¹ +(Λ/μ) ² +...)＝1

summing the series can result in: s=1- Λ/μ

When S=1-S ₀ When s=1 (1- Λ/, mu)

Because of S _n ＝SR ^k (1―S)，

The average value is calculated as follows:

mq=m-Tasks are known,

thus, the waiting time can be obtained as

Finally, the average number of tasks in the system becomes

Finally, the preferred technique for data aggregation is to divide the vast data set into smaller blocks, reduce the complexity of each block, make it easier to manage and process, and group similar data in blocks, which are then processed simultaneously on different processing units. Inside each data block, various aggregation operations may be performed, such as summing, averaging, counting, etc. These operations serve to generate summary data, reduce the size of the data set, and provide a higher level of data summarization.

In step S5, parallel processing under the edge environment is realized through an optimized MapReduce mechanism. Big data is very huge and needs to be divided into blocks or segments for distributed storage and parallel processing. Therefore, the MapReduce paradigm is preferred.

MapReduce is a programming model and processing technique for processing and generating large-scale data sets, and parallel processing includes mapping (Map) and reduction (Reduce), and the flow is shown in fig. 4. First, a large-scale data set is divided into small data blocks, which can be processed on different nodes of a parallel computing cluster, with each data block containing a plurality of records or data items. Then, in the mapping phase, each data block is passed to a set of mapping tasks. The goal of the mapping task is to convert each record or data item in an input data block into a set of key-value pairs, where the keys are used to identify certain attributes of the data and the values contain the actual data. The mapping tasks are performed in parallel, each task processing its assigned data block independently. Next, the MapReduce framework sorts and groups the key-value pairs output by the key-pair map. This process groups key-value pairs with the same key together and assigns them to different reduction tasks. And partitioning the keys to ensure load balancing of the reduction tasks. A reduction is then performed, each reduction task taking a set of key-value pairs with the same key, and performing a user-defined reduction operation. These operations typically include aggregating values, calculating statistics, or performing other data processing operations. The reduction tasks are also performed in parallel, each task independently processing its assigned data set. Each reduction task generates a partial result that is ultimately combined into a complete result set that contains the final processing results for the input data set.

And 6, training and reasoning a machine model by using an optimized Back Propagation (BP) neural network algorithm. BP neural network is a neural network model for machine learning, used for training and reasoning. And (3) the parallelism of BP model construction, and a plurality of processing nodes are utilized to train and verify the optimized BP model in parallel. The score results produced by each processing node are combined using ensemble learning to improve the results. The BP model consists of an input layer, a hidden layer and an output layer, as shown in fig. 5.

The preferred variant of the BP network adopts an additional momentum technology, and a gradient descent algorithm is adopted to construct a parallel model by introducing a momentum coefficient mu. The coefficients as a function of weight are shown below:

where Δω (n+1) and Δω (n) represent weights after the n+1th and n-th iterations. The value of μmust be between 0 and 1, gx/gw representing the negative value of the gradient. And a variable learning rate method is adopted, and self-adaptive adjustment is carried out according to the error change. The adjustment of the learning rate in the adaptive adjustment can be calculated as follows:

wherein the increment factor m++ is greater than 1 and the decrement factor m-is between 0 and 1. Here, X (n+1) and X (n) represent the sum of squares of total errors after the (n+1) th and n-th iterations, respectively. Finally, Δ represents the learning rate. The learning direction management concept of BP is that the weight and the threshold value of the network are adjusted to be consistent with the direction of the gradient.

According to formula S _i+1 ＝S _i ―Λ _i g _i . Si represents a matrix of existing weight thresholds g _i Representing the gradient of the current operation, delta _i Representing the learning rate. Assume that there is one having an input node y _j Hidden layer node x _j And output layer node z _i And we can then get the three-layer BP model:

wherein the method comprises the steps ofThe calculated output of the output node is:

the error of the available output nodes is:

final output bits:

the invention provides a comprehensive data processing and analyzing system of the Internet of things, which integrates an Internet of things edge layer and a cloud processing layer, and efficiently manages and analyzes large-scale and diversified data of the Internet of things through the steps of data acquisition, preprocessing, distributed processing, machine learning model training reasoning and the like. The method provides key data support for various IoT applications, facilitates real-time decision making, prediction and resource utilization optimization, improves data processing efficiency and quality of the internet of things system, facilitates development of the internet of things technology, and provides a solid foundation for innovation in the fields of intelligent cities, intelligent transportation, environmental protection and the like.

The preferred embodiments disclosed above are merely to help illustrate the present invention, and it is obvious to those skilled in the art that the scope of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

Claims

1. A system model for big data analysis of internet of things data, the system model comprising: and two layers consisting of an Internet of things edge layer and a cloud processing layer.

The internet of things edge layer comprises various IoT sensors and embedded devices, and is used in various fields such as environment monitoring, security monitoring, facility monitoring, traffic monitoring, power monitoring and transportation monitoring. The internet of things edge layer is integrated with the edge server and is used for receiving, processing and buffering data from the sensor and the embedded device.

The cloud processing layer comprises a cloud server and various processing units, and is used for receiving, processing and storing data from the edge layer of the Internet of things, and comprises the following data processing flows:

the optimization of data loading is realized through a Map-Only algorithm, communication overhead is reduced through parallelization of data loading and mapping, and data processing speed and efficiency are improved.

The data is pre-processed, e.g., normalized, filtered, queued, and aggregated, to prepare the data for further processing and training.

The parallel processing under the edge environment is realized by adopting a MapReduce mechanism, and better scalability, flexibility, cost effectiveness, rapidness, simplicity and elasticity are provided.

Machine model training and reasoning is performed using an optimized Back Propagation (BP) neural network algorithm to learn the mapping from input data to output data.

2. The system model of claim 1, wherein the sensors of the internet of things edge layer include environmental observation sensors, security monitoring sensors, facility monitoring sensors, traffic monitoring sensors, power monitoring sensors, and transportation observation sensors.

3. The system model of claim 1, wherein the data loading process of the cloud processing layer includes mapping job integration with a MapReduce paradigm using an Sqoop tool to achieve parallel data ingestion by adjusting partition size and replication factors.

4. The system model of claim 1, wherein the data preprocessing process of the cloud processing layer includes filtering using kalman filtering, queuing using a hybrid M/1 queuing model, and data aggregation using a divide-by-conquer approach.

5. The system model of claim 1, wherein the machine model training and reasoning process of the cloud processing layer includes model training using a BP neural network algorithm, with weights and thresholds of the neural network being continuously adjusted to minimize a loss function of the model, enabling the model to learn a mapping relationship between input data and output data.

6. The system model of claim 1, for processing large-scale and heterogeneous internet of things data, supporting various IoT applications, providing capabilities for real-time decision making, prediction and resource optimization, improving data processing efficiency and quality of the internet of things system.

7. The system model according to claim 1, which is suitable for innovation in the fields of intelligent cities, intelligent transportation, environmental protection and the like, and provides a solid foundation for development of the fields.