CN117992549A

CN117992549A - Heterogeneous big data storage and management system based on multisource

Info

Publication number: CN117992549A
Application number: CN202410326538.5A
Authority: CN
Inventors: 丁凡; 周萌; 吕李志; 周文博
Original assignee: Lanzhou University of Technology
Current assignee: Lanzhou University of Technology
Priority date: 2024-03-21
Filing date: 2024-03-21
Publication date: 2024-05-07

Abstract

The invention discloses a heterogeneous big data storage and management system based on multiple sources. In the invention, the data compression module is arranged in the data management module, and the square difference between the estimated value and the actual value can be minimized through calculating the system parameters, so as to obtain the error of the compressed data. The application of the algorithms not only can improve the efficiency of data compression, but also can bring more convenience and economy to data transmission and storage, wherein cold data is data which has low user access frequency and is not influenced by overload, and hot data is opposite to the cold data. Meanwhile, the cold and hot data partitioning module can reduce the proportion of cold data in the memory as much as possible and even reduce backup by classifying cold and hot data of two types, so as to ensure that a user can achieve load balance on an access request of the cold and hot data, thereby improving the overall data storage management efficiency, saving the data storage space, being more convenient for technicians to study, fully playing technical advantages and improving the utilization rate of data information.

Description

Heterogeneous big data storage and management system based on multisource

Technical Field

The invention belongs to the technical field of big data storage and management, and particularly relates to a multi-source heterogeneous big data storage and management system.

Background

Multisource heterogeneous big data refers to large-scale data from a number of different sources and data sets of different structures. These data may be stored in different formats, in different databases or file systems, and may contain structured data (e.g., table data in relational databases), semi-structured data (e.g., XML, JSON, etc.), and unstructured data (e.g., text, images, audio, etc.). Characteristics of multi-source heterogeneous big data include multi-source: the data sources are wide, and may come from different business systems, sensors, social media, log files, etc. The data may have different formats, structures, and semantics. Isomerism: data has different structures and forms, and may be stored in a variety of storage systems such as relational databases, noSQL databases, distributed file systems, and the like. The processing and analysis of multi-source heterogeneous big data is of great significance to enterprises and organizations. By integrating and analyzing data from different sources, more comprehensive, accurate and deep insights can be obtained, helping to make better business decisions, improve products and services, find market trends, optimize operation, etc. Meanwhile, multi-source heterogeneous big data also presents challenges including data integration, data cleaning, data security and privacy, and the like, and needs to be handled by using appropriate technologies and methods.

However, the common data storage and management system has the problems of large scale, multiple sources, different structures, high redundancy and the like of data sources in the use process, so that the problems of difficult data reading, writing, storage and management of the storage and management system can be caused.

Disclosure of Invention

The invention aims at: in order to solve the above-mentioned problems, a multi-source heterogeneous big data storage and management system is provided.

The technical scheme adopted by the invention is as follows: the utility model provides a heterogeneous big data storage and management system based on multisource, includes power supply module, data acquisition layer module, data transmission module, data preprocessing module, data storage module, data management module, application service layer module, cold and hot data partition module, data index optimization module, data compression module and unusual storage detection module, the output of power supply module is connected with the input of data acquisition layer module, the output of data acquisition layer module is connected with the input of data transmission module, the output of data transmission module is connected with the input of data preprocessing module, the output of data preprocessing module is connected with the input of data storage module, the output of data storage module is connected with the input of data management module, the output of data management module is connected with the input of application service layer module.

In a preferred embodiment, the data management module is internally provided with a cold and hot data partitioning module, a data index optimizing module, a data compression module and an abnormal storage detection module, and the integral output ends of the cold and hot data partitioning module, the data index optimizing module, the data compression module and the abnormal storage detection module are connected with the input end of the data management module.

In a preferred embodiment, the data acquisition layer module designs three interfaces according to different data structures, classifies the three types of data during data acquisition, and is convenient for storing the structured data and part of the semi-structured data into a database and storing the part of the semi-structured data and the unstructured data locally.

In a preferred embodiment, when the data transmission module is in the receive mode, a single receive and a continuous receive mode are provided using LoRaMode of LSD4RF-2F917N 10. In the single receiving mode, the demodulator firstly analyzes the scale of the data to be received according to the dividing result of the machine vision on the data boundary, and when the target signal is not recognized in the unit period of the rated time window, the demodulator is restored to the standby mode. If the target signal is identified, the execution parameter information is written into PayloadCrcE in the mode shown by the application service layer module, so that the communication data is received. In the continuous reception mode, the demodulator, after determining the size of the data, does not perform target signal identification with reference to the period of the time window, but performs signal detection and data reception of the full period in real time by the frequency difference. In this way, the smoothness of communication is ensured.

In a preferred embodiment, the data preprocessing module uniformly stores all data in a cold data area, sets a global ID for each piece of data, wherein the ID is a unique identifier of the data, the first three bytes respectively represent a hot index, hot data and cold data, and the rest bytes are identifier fields of specific data under the corresponding partition;

The data storage module extracts the fields to be searched to form a single piece of index data, the designed main key is put in, the original data is written into the HBase according to the main key, and the search data is written into the search according to the main key. After the data storage is completed, the original data needs to be subjected to abnormal storage detection.

In a preferred embodiment, the cold and hot data partitioning module selects a batch of cold data and hot data as a data set according to the access times of the data in the database, and performs model construction and training based on a logistic regression algorithm, so that the model can feed back whether the data belongs to the cold data or the hot data according to the read-in data characteristics, and after the cold and hot classification of the data is completed, the cold and hot data partitioning module needs to classify the hot data at Rowkey, modify the first three bytes in the designed Rowkey according to the classification result, and then write the cold and hot data regions of different nodes respectively. And then, writing data into a plurality of Region partitions simultaneously in the data writing process, so that the data writing efficiency is improved.

In a preferred embodiment, the data index optimization module calculates the remaining space within 85% of the disk space in design optimization, then sets the size of each fragment, and divides the remaining space of the disk by the size of each fragment to obtain the setting of the index fragment number in the cluster system, so that the index fragment number is reasonably distributed, the dynamic setting of the index fragment number is realized, and the creation of an index library and the efficient writing of data are completed.

In a preferred embodiment, the flow of the data compression module is as follows:

(1) Setting a sliding window as M bytes and a buffer area as N bytes;

(2) Receiving new data;

(3) Calculating the difference values s1, s2, s3, s4, s5 and s6 of the data of the time T and the front time and the back time;

(4) Calculating a residual error average value of an actual value and a theoretical value, and setting a data fluctuation range;

(5) If the difference values s1-s6 are beyond the fluctuation range, marking the difference values as abnormal data to be directly stored, otherwise, adding the abnormal data into a sliding window;

(6) Updating and recording the fluctuation range of the data;

(7) If the fluctuation range of the data exceeds the set threshold value, the step (11) is carried out, and otherwise, the process is continued;

(8) Compression parameters;

(9) If the compressed parameters are not converged, the step (2) is carried out, and otherwise, the process is continued;

(10) If the buffer area still has new data, the step (2) is carried out, otherwise, the process is continued;

(11) Compressing the data.

In a preferred embodiment, the anomaly storage detection module adopts storage anomaly detection based on a least squares support vector machine algorithm, and the detection step of the anomaly storage detection module includes:

(1) And (3) data processing: the original data is subjected to mean value zero treatment and normalization treatment (training samples can be formed by adding random white noise into the measured data)

(2) Reconstructing a phase space: converting the one-dimensional time sequence into a two-dimensional matrix form to acquire more comprehensive data information and a potential association mode between the data;

(3) Model construction: based on a least square support vector machine algorithm, a time sequence data prediction model is established, and a prediction error is controlled to be minimum;

(4) Model prediction: further predicting the obtained measurement data to obtain a predicted value;

(5) Obtaining an error value: comparing the predicted value with the actual value to obtain an error value;

(6) Calculating regression function values: calculating a linear regression function of the time sequence data;

(7) Data comparison: comparing the linear regression function value of the time sequence data with a threshold value, and if the linear regression function value is larger than the threshold value, carrying out early warning on the occurrence of abnormality and ending the detection; and otherwise, repeating the steps.

In a preferred embodiment, the application service layer module provides the collected resources and functions for users in a service mode, so that the data analysis requirements of traffic prediction, road network planning and traffic management are met, scientific references can be brought to urban traffic management and control in China, and the downloading requirements of urban road traffic statistics data can be met. The cloud service platform can realize the data and service interaction operation between the systems by means of the resource, data and document exchange frame system of the urban road traffic flow.

In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:

In the invention, the data compression module is arranged in the data management module, and the square difference between the estimated value and the actual value can be minimized through calculating the system parameters, so as to obtain the error of the compressed data. The application of the algorithms not only can improve the efficiency of data compression, but also can bring more convenience and economy to data transmission and storage, wherein cold data is data which has low user access frequency and is not influenced by overload, and hot data is opposite to the cold data. Meanwhile, the cold and hot data partitioning module can reduce the duty ratio of cold data in the memory and even backup as much as possible through cold and hot classification of two types of data, so that the access request of a user to the cold and hot data is ensured to achieve load balance, the overall data storage management efficiency is improved, the data storage space is saved, the overall storage management difficulty is reduced, meanwhile, the research of technicians is facilitated, the technical advantages are brought into full play, the utilization rate of data information is improved, the efficient storage and low-delay retrieval of the data are met, and the data can be applied to more fields.

Drawings

FIG. 1 is an overall system block diagram of the present invention;

FIG. 2 is a block diagram of a data management module system in accordance with the present invention;

FIG. 3 is a block diagram of the overall system flow in the present invention.

The marks in the figure: the system comprises a 1-power supply module, a 2-data acquisition layer module, a 3-data transmission module, a 4-data preprocessing module, a 5-data storage module, a 6-data management module, a 7-application service layer module, an 8-cold and hot data partitioning module, a 9-data index optimization module, a 10-data compression module and an 11-abnormal storage detection module.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

With reference to figures 1-2 of the drawings,

The utility model provides a heterogeneous big data storage and management system based on multisource, including power supply module 1, data acquisition layer module 2, data transmission module 3, data preprocessing module 4, data storage module 5, data management module 6, application service layer module 7, cold and hot data subregion module 8, data index optimization module 9, data compression module 10 and unusual storage detection module 11, the output of power supply module 1 is connected with the input of data acquisition layer module 2, the output of data acquisition layer module 2 is connected with the input of data transmission module 3, the output of data transmission module 3 is connected with the input of data preprocessing module 4, the output of data preprocessing module 4 is connected with the input of data storage module 5, the output of data storage module 5 is connected with the input of data management module 6, the output of data management module 6 is connected with the input of application service layer module 7.

The cold and hot data partitioning module 8, the data index optimizing module 9, the data compression module 10 and the abnormal storage detection module 11 are arranged in the data management module 6, and the integral output ends of the cold and hot data partitioning module 8, the data index optimizing module 9, the data compression module 10 and the abnormal storage detection module 11 are connected with the input end of the data management module 6.

The data acquisition layer module 2 designs three interfaces according to different data structure structured data, semi-structured data and unstructured data, classifies the three types of data during data acquisition, and is convenient for storing the structured data and part of the semi-structured data such as CSV file data into a database and storing part of the semi-structured data such as PDF file data and unstructured data locally.

When the data transmission module 3 is in the reception mode, a single reception mode and a continuous reception mode are set by LoRaMode of the LSD4RF-2F917N 10. In the single receiving mode, the demodulator firstly analyzes the scale of the data to be received according to the dividing result of the machine vision on the data boundary, and when the target signal is not recognized in the unit period of the rated time window, the demodulator is restored to the standby mode. If the target signal is identified, execution parameter information is written into PayloadCrcE in the manner shown by the application service layer module 7, so as to realize the reception of communication data. In the continuous reception mode, the demodulator, after determining the size of the data, does not perform target signal identification with reference to the period of the time window, but performs signal detection and data reception of the full period in real time by the frequency difference. In this way, the smoothness of communication is ensured.

The data preprocessing module 4 uniformly stores all data in a cold data area, sets a global ID for each piece of data, wherein the ID is the unique identifier of the data, the first three bytes respectively represent hot index, hot data and cold data, and the rest bytes are identifier fields of specific data under the corresponding partition;

The data storage module 5 extracts the fields to be searched to form a single piece of index data, and puts the designed main key into the data processing layer to perform cold and hot partition on the conventional data in the third stage of the data processing layer, at this time, the main key of the HBase is designed, the original data is written into the HBase according to the main key, and the search data is written into the search according to the main key. After the data storage is completed, the original data needs to be subjected to abnormal storage detection.

The cold and hot data partitioning module 8 firstly selects a batch of cold data and hot data as a data set according to the access times of the data in the database, and builds and trains a model based on a logistic regression algorithm, so that the model can feedback and output whether the data belongs to the cold data or the hot data according to the data characteristics read in and input, after the cold and hot classification of the data is completed, the cold and hot data partitioning module 8 once realizes the cold and hot classification of the data, needs to classify the hot data at Rowkey according to the classification result, the hot data is 1, the cold data is 0, the first three bytes in the designed Rowkey are modified to respectively represent the hot index, the hot data and the cold data, and then the cold and hot data regions of different nodes are respectively written. And then, writing data into a plurality of Region partitions simultaneously in the data writing process, so that the data writing efficiency is improved.

The data index optimization module 9 calculates the residual space within 85% of the disk space in design optimization, then sets the size of each fragment, and divides the disk residual space by the size of each fragment to obtain the setting of the index fragment number in the cluster system, so that the index fragment number is reasonably distributed, the dynamic setting of the index fragment number is realized, and the creation of an index library and the efficient writing of data are completed.

The flow of the data compression module 10 is:

(1) Setting a sliding window as M bytes and a buffer area as N bytes;

(2) Receiving new data;

(6) Updating and recording the fluctuation range of the data;

(8) Compression parameters;

(11) Compressing the data.

The abnormality storage detection module 11 adopts storage abnormality detection based on a least square support vector machine algorithm, and the detection steps of the abnormality storage detection module 11 include:

(7) Data comparison: comparing the linear regression function value of the time sequence data with a threshold value, and if the linear regression function value is larger than the threshold value, carrying out early warning on the occurrence of abnormality and ending the detection; and otherwise, repeating the steps from the step data storage module 5 to the step application service layer module 7.

The application service layer module 7 provides the collected resources and functions for users in a service mode, so that the data analysis requirements of traffic prediction, road network planning and traffic management are met, scientific references can be brought to urban traffic management and control in China, and the downloading requirements of urban road traffic statistics data can be met. The cloud service platform can realize the data and service interaction operation between the systems by means of the resource, data and document exchange frame system of the urban road traffic flow.

In the invention, the data compression module 10 is arranged in the data management module 6, and the square difference between the estimated value and the actual value can be minimized by calculating the system parameters, so as to obtain the error of the compressed data. The application of the algorithms not only can improve the efficiency of data compression, but also can bring more convenience and economy to data transmission and storage, wherein cold data is data which has low user access frequency and is not influenced by overload, and hot data is opposite to the cold data. Meanwhile, the cold and hot data partitioning module 8 can reduce the duty ratio of cold data in the memory and even reduce backup as far as possible through cold and hot classification of two types of data, so that the access request of a user to the cold and hot data is ensured to achieve load balance, the overall data storage management efficiency is improved, the data storage space is saved, the overall storage management difficulty is reduced, meanwhile, the research of technicians is facilitated, the technical advantages are brought into full play, the utilization rate of data information is improved, the efficient storage and low-delay retrieval of the data are met, and the data can be applied to more fields.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The utility model provides a big data storage and management system based on multisource heterogeneous, includes power supply module (1), data acquisition layer module (2), data transmission module (3), data preprocessing module (4), data storage module (5), data management module (6), application service layer module (7), cold and hot data partitioning module (8), data index optimization module (9), data compression module (10) and unusual storage detection module (11), its characterized in that: the output end of the power supply module (1) is connected with the input end of the data acquisition layer module (2), the output end of the data acquisition layer module (2) is connected with the input end of the data transmission module (3), the output end of the data transmission module (3) is connected with the input end of the data preprocessing module (4), the output end of the data preprocessing module (4) is connected with the input end of the data storage module (5), the output end of the data storage module (5) is connected with the input end of the data management module (6), and the output end of the data management module (6) is connected with the input end of the application service layer module (7).

2. A multi-source heterogeneous big data storage and management system according to claim 1, wherein: the cold and hot data partitioning module (8), the data index optimizing module (9), the data compression module (10) and the abnormal storage detecting module (11) are arranged in the data management module (6), and the whole output ends of the cold and hot data partitioning module (8), the data index optimizing module (9), the data compression module (10) and the abnormal storage detecting module (11) are connected with the input end of the data management module (6).

3. A multi-source heterogeneous big data storage and management system according to claim 1, wherein: the data acquisition layer module (2) designs three interfaces according to different data structures (structured data, semi-structured data and unstructured data), classifies the three types of data during data acquisition, and is convenient for locally storing the structured data, part of the semi-structured data and the unstructured data.

4. A multi-source heterogeneous big data storage and management system according to claim 1, wherein: when the data transmission module (3) is in a receiving mode, a single receiving mode and a continuous receiving mode are set by LoRaMode of the LSD4RF-2F917N 10; in a single receiving mode, the demodulator firstly analyzes the scale of the data to be received according to the dividing result of the machine vision on the data boundary, and when the target signal is not recognized in the unit period of the rated time window, the demodulator is restored to the standby mode; if the target signal is identified, writing execution parameter information into PayloadCrcE in the mode shown by the application service layer module (7) to realize the receiving of communication data; in the continuous receiving mode, after determining the scale of the data, the demodulator does not recognize the target signal by taking the period of the time window as a reference, but detects the signal and receives the data in the whole period in real time through the frequency difference; in this way, the smoothness of communication is ensured.

5. A multi-source heterogeneous big data storage and management system according to claim 1, wherein: the data preprocessing module (4) uniformly stores all data in a cold data area, sets a global ID for each piece of data, wherein the ID is a unique identifier of the data, the first three bytes respectively represent hot index, hot data and cold data, and the rest bytes are identifier fields of specific data under the corresponding partition;

The data storage module (5) extracts the fields to be searched to form a single index data, the designed main key is put in, the conventional data is subjected to cold and hot partition in the third stage of the data processing layer, the main key of the HBase is designed, the original data is written into the HBase according to the main key, and the search data is written into the search according to the main key; after the data storage is completed, the original data needs to be subjected to abnormal storage detection.

6. A multi-source heterogeneous big data storage and management system according to claim 1, wherein: the cold and hot data partitioning module (8) firstly selects a batch of cold data and hot data as a data set according to the access times of the data in the database, and builds and trains a model based on a logistic regression algorithm, so that the model can feed back (output) whether the data belongs to the cold data or the hot data according to the data characteristics read in (input), and the cold and hot data partitioning module (8) realizes the cold and hot classification of the data once after the cold and hot classification of the data is finished, modifies the first three bytes in the designed Rowkey according to the classification result after the hot data is classified at Rowkey, and then writes the first three bytes into cold and hot data areas of different nodes respectively; and then, writing data into a plurality of Region partitions simultaneously in the data writing process, so that the data writing efficiency is improved.

7. A multi-source heterogeneous big data storage and management system according to claim 1, wherein: the data index optimization module (9) calculates the residual space within 85% of the disk space in design optimization, then sets the size of each fragment, and divides the disk residual space by the size of each fragment to obtain the setting of the index fragment number in the cluster system, so that the index fragment number is reasonably distributed, the dynamic setting of the index fragment number is realized, and the creation of an index library and the efficient writing of data are completed.

8. A multi-source heterogeneous big data storage and management system according to claim 1, wherein: the flow of the data compression module (10) is as follows:

(1) Setting a sliding window as M bytes and a buffer area as N bytes;

(2) Receiving new data;

(6) Updating and recording the fluctuation range of the data;

(8) Compression parameters;

(11) Compressing the data.

9. A multi-source heterogeneous big data storage and management system according to claim 1, wherein: the abnormal storage detection module (11) adopts storage abnormal monitoring based on a least square support vector machine algorithm, and the detection step of the abnormal storage detection module (11) comprises the following steps:

(1) And (3) data processing: carrying out mean value zero processing and normalization processing on the original data, wherein a training sample can be formed by adding random white noise into the measurement data;

(7) Data comparison: comparing the linear regression function value of the time sequence data with a threshold value, and if the linear regression function value is larger than the threshold value, carrying out early warning on the occurrence of abnormality and ending the detection; and (3) otherwise, repeating the steps (5) to (7).

10. A multi-source heterogeneous big data storage and management system according to claim 1, wherein: the application service layer module (7) provides the collected resources and functions for users in a service mode, so that the data analysis requirements of traffic prediction, road network planning and traffic management are met, scientific references can be brought to urban traffic management and control in China, and the downloading requirements of urban road traffic statistics can be met; the cloud service platform can realize the data and service interaction operation between the systems by means of the resource, data and document exchange frame system of the urban road traffic flow.