CN112084190A

CN112084190A - Big data based acquired data real-time storage and management system and method

Info

Publication number: CN112084190A
Application number: CN202010900066.1A
Authority: CN
Inventors: 程德心; 周风明; 郝江波; 周昭晖
Original assignee: Wuhan Kotei Informatics Co Ltd
Current assignee: Wuhan Kotei Informatics Co Ltd
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2020-12-15

Abstract

The embodiment of the invention provides a big data-based collected data real-time storage and management system and method, wherein the collected data uploaded in real time is stored and sorted through each module of a big data frame, so that the data in different formats are stored in a classified manner, and mass collected data are stored in real time and automatically classified, so that the storage efficiency is improved by uploading and storing in real time; the required data can be called conveniently during research, and labor cost and hardware cost rate are reduced.

Description

Big data based acquired data real-time storage and management system and method

Technical Field

The embodiment of the invention relates to the technical field of big data processing, in particular to a big data-based collected data real-time storage and management system and method.

Background

An automatic driving automobile is an intelligent automobile which realizes unmanned driving through a computer system. The automatic driving automobile depends on the cooperation of artificial intelligence, visual calculation, radar, monitoring device and global positioning system, so that the computer can operate the motor vehicle automatically and safely without any active operation of human.

The collected data is one of important data sources for training automatic driving, and is more indispensable data of a high-precision map, road analysis and driving behavior decision analysis, and how to uniformly manage and efficiently store the data is an important research topic in the field of automatic driving.

The complex road condition and the vehicle driving condition in China are fed back to research, mass data are inevitably needed, a storage, management and reading process is inevitably needed from a sensor (such as a camera, a laser radar, a millimeter wave radar and the like) to usable data of researchers, a data hard disk or a small-sized database is adopted for storage after traditional data are collected, manual maintenance or classification and collection are needed for collection and updating every time, the manual maintenance cost is greatly increased along with the increase of collection equipment, and the mass data storage and management are not facilitated. With the improvement of the automatic driving level, the collected data can be also increased by orders of magnitude, and the traditional method for storing the data needs to invest higher labor cost and needs a large amount of hardware equipment for support.

Disclosure of Invention

The embodiment of the invention provides a big data-based collected data real-time storage and management system and method, which are used for solving the problem that the data storage needs to be supported by a large amount of hardware equipment besides higher labor cost because the collected data is also subjected to order of magnitude increase in the prior art.

In a first aspect, an embodiment of the present invention provides a big data-based collected data real-time storage and management system, including a distributed system architecture Hadoop, where the distributed system architecture Hadoop includes a distributed column storage database HBase, a distributed file system HDFS, and a programming model MapReduce;

the distributed column storage database HBase is used for receiving the collected data uploaded by the data interface, randomly accessing the collected data in real time, and writing a big data file with the size exceeding a set threshold value in the collected data into the distributed file system HDFS;

the distributed file system HDFS is used for file management, file storage and file acquisition;

the programming model MapReduce is used for carrying out screening operation and processing on collected data so as to classify and store data of different formats, establish SQL query, and import the result of Reduce into a distributed column storage database HBase after the result is summarized.

Further, the distributed column storage database HBase is further configured to:

and taking the position of the big data file in the HDFS as an index, and replacing the content of the big data file in the HBase in the distributed column storage database with the index.

The system further comprises a large-scale data analysis platform Pig, wherein the large-scale data analysis platform Pig is used for reading a Hadoop configuration file of the distributed system architecture to obtain a machine where a Namenode and a JobTracker process are located, converting a series of MapReduce operations according to the size of a data set to operate in the Hadoop of the distributed system architecture, and loading, merging, filtering, sorting, grouping and associating big data in the Hadoop of the distributed system architecture based on a PigLatin language of a data stream-oriented SQL-like type and using function functions for the data set.

The system further comprises a data warehouse tool Hive, wherein the data warehouse tool Hive is used for establishing a complete Structured Query Language (SQL) query, converting SQL statements into a MapReduce task for processing, and finally summarizing Reduce results.

Further, the HDFS comprises a NameNode node and a plurality of DataNode nodes; the NameNode node is used for storing and managing file metadata, and the DataNode node is used for storing an HDFS file in a data form, responding to a read-write request of an HDFS client, periodically reporting heartbeat information to the NameNode node, periodically reporting data block information to the NameNode node, and periodically reporting caching data block information to the NameNode node.

In a second aspect, an embodiment of the present invention provides a method for storing and managing collected data in real time based on big data, including:

receiving collected data uploaded by a data interface based on a distributed column storage database HBase, randomly accessing the collected data in real time, and writing a big data file with the size exceeding a set threshold in the collected data into the HDFS;

the HDFS based distributed file system is used for file management, file storage and file acquisition;

the method is used for carrying out screening operation and processing on collected data based on a programming model MapReduce so as to classify and store data in different formats, establish SQL query, and import the result into a distributed column storage database HBase after the Reduce result is summarized.

Further, after writing the big data file with the size exceeding the set threshold in the collected data into the distributed file system HDFS, the method further includes:

Further, still include:

reading a Hadoop configuration file of the distributed system architecture to obtain a machine where a Namenode and a JobTracker process are located, converting the Hadoop configuration file into a series of MapReduce jobs according to the size of a data set, operating the MapReduce jobs in the Hadoop of the distributed system architecture, and loading, merging, filtering, sorting, grouping and associating big data in the Hadoop of the distributed system architecture based on a PigLatin language of SQL-like oriented data streams and using function functions for the data set.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method for storing and managing collected data based on big data in real time according to the embodiment of the second aspect of the present invention when executing the program.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the big-data-based collected data real-time storage and management method according to an embodiment of the second aspect of the present invention.

According to the collected data real-time storage and management system and method based on big data, the collected data uploaded in real time are stored and sorted through each module of the big data frame, so that data in different formats are stored in a classified mode, mass collected data are stored in real time and automatically classified, the storage efficiency is improved by uploading and storing in real time, compared with the mode that manual hard disk data are collected, analyzed, classified and stored again, the efficiency is improved, the industrial cost is reduced, the follow-up data utilization is facilitated, the management personnel are convenient to maintain, the large number of hard disks are avoided to store, and the data loss condition similar to the data hard disks is not prone to occur; the required data can be called conveniently during research, and labor cost and hardware cost rate are reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic structural diagram of a device for real-time storage and management of collected data based on big data according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for real-time storage and management of collected data based on big data according to an embodiment of the present invention;

fig. 3 is a schematic physical structure diagram according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the embodiment of the present application, the term "and/or" is only one kind of association relationship describing an associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.

The terms "first" and "second" in the embodiments of the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, the terms "comprise" and "have", as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a system, product or apparatus that comprises a list of elements or components is not limited to only those elements or components but may alternatively include other elements or components not expressly listed or inherent to such product or apparatus. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Traditional data are collected and then stored in a data hard disk or a small database, manual maintenance or classification and collection are needed for collection and updating every time, the manual maintenance cost is greatly improved along with the increase of collection equipment, and the storage and management of mass data are not facilitated. With the improvement of the automatic driving level, the collected data can be also increased by orders of magnitude, and the traditional method for storing the data needs to invest higher labor cost and needs a large amount of hardware equipment for support.

Therefore, the embodiment of the invention provides a system and a method for storing and managing collected data in real time based on big data, wherein the collected data uploaded in real time is stored and sorted through each module of a big data frame, so that data in different formats are stored in a classified mode, mass collected data are stored in real time and automatically classified, the storage efficiency is improved by uploading and storing in real time, and compared with the mode that the efficiency is improved by summarizing, analyzing, classifying and storing the data again through an artificial hard disk, the industrial cost is reduced. The following description and description will proceed with reference being made to various embodiments.

Fig. 1 is a diagram of a big data-based collected data real-time storage and management system according to an embodiment of the present invention, which includes a distributed system architecture Hadoop, where the distributed system architecture Hadoop includes a distributed column storage database HBase, a distributed file system (HDFS), and a programming model MapReduce;

Fig. 2 is a flow chart of a method for storing and managing collected data based on big data in real time according to an embodiment of the present invention, and the following describes a principle of a system for storing and managing collected data based on big data in real time according to an embodiment of the present invention with reference to fig. 1 and fig. 2.

Hadoop implements a distributed file system (HadoopDistributedFileSystems), HDFS for short. HDFS is characterized by high fault tolerance and is designed for deployment on inexpensive (low-cost) hardware; and it provides high throughput (highthroughput) access to data of applications, suitable for applications with very large data sets (lagedataset). HDFS relaxes the requirements of (relax) POSIX and can access data in a streaming file system. The most core design of the Hadoop framework is as follows: HDFS and programming model MapReduce. HDFS provides storage for massive data, while MapReduce provides computation for massive data.

Hadoop is made up of many elements. The bottommost part is Hadoop Distributed FileSystem (HDFS), which stores files on all storage nodes in the Hadoop cluster. The upper layer of the HDFS is a MapReduce engine, which consists of JobTrackers and TaskTrackers.

To external clients, HDFS behaves like a traditional hierarchical file system. Files may be created, deleted, moved, or renamed, among other things.

Files stored in the HDFS are divided into blocks, and then the blocks are copied to a plurality of computers (datanodes). This is in contrast to conventional RAID architectures. The size of the blocks (version 1.x defaults to 64MB, version 2.x defaults to 128MB) and the number of blocks copied are determined by the client when creating the file.

MapReduce is a programming model that handles large sets of semi-structured data. A programming model is a way to handle and structure a particular problem. In this embodiment, MapReduce mainly performs parallel operation of large-scale data, performs format analysis on the collected data, such as csv format of a table type, mpeg or flv format of a video type, png or jpg format of a picture type, and then performs data classification.

The HBase is constructed on the HDFS and used for randomly accessing mass data in real time.

On the basis of the above embodiment, the distributed column storage database HBase is further configured to:

On the basis of the above embodiments, the system further comprises a large-scale data analysis platform Pig, wherein the large-scale data analysis platform Pig is used for reading a Hadoop configuration file of the distributed system architecture to obtain a machine where a Namenode and a JobTracker process are located, converting a series of MapReduce operations according to the size of a data set and operating the MapReduce operations in the Hadoop of the distributed system architecture, and loading, merging, filtering, sorting, grouping, associating and using function functions for the data set on the basis of a PigLatin language of SQL-like oriented data streams.

The Pig is a large-scale data analysis platform based on Hadoop, and provides SQL-LIKE language called PigLatin, and a compiler of the SQL-LIKE language can convert SQL-LIKE data analysis requests into a series of optimized MapReduce operations. The Pig provides a simple operation and programming interface for the parallel computation of complex mass data.

On the basis of the above embodiments, the system further comprises a data warehouse tool Hive, wherein the data warehouse tool Hive is used for establishing a complete Structured Query Language (SQL) query, converting an SQL statement into a MapReduce task for processing, and finally, summarizing Reduce results.

Querying the big data set by SQL analysis tool Hive: the Hive operation depends on Hadoop, a directory is created for Hive on an HDFS, data of Hive is stored in the HDFS, a data model in Hive exists in a form of a Table (Table), the Hive query operation process strictly conforms to a Hadoop MapReduce job execution model, Hive converts a user's HiveQL statement into a MapReduce job through an interpreter and submits the MapReduce job to a Hadoop cluster, and the Hadoop monitors the job execution process and then returns a job execution result to the user.

On the basis of the above embodiments, the HDFS includes a NameNode node and a plurality of DataNode nodes; the NameNode node is used for storing and managing file metadata, and the DataNode node is used for storing an HDFS file in a data form, responding to a read-write request of an HDFS client, periodically reporting heartbeat information to the NameNode node, periodically reporting data block information to the NameNode node, and periodically reporting caching data block information to the NameNode node.

HDFS supports a traditional legacy file organization structure. A user or a program may create directories and store files in many directories. The namespace hierarchy of a file system is similar to other file systems. Files may be created, moved from one directory to another, or renamed.

The method comprises the steps of carrying out management, storage and acquisition operations on files through a highly fault-tolerant distributed file system of an HDFS (Hadoop distributed file system), adjusting the throughput of the system according to the magnitude of collected data, storing and managing metadata in the system by using NameNode (NN) nodes, storing data in the files by using DataNode (DN) nodes, and carrying out timing communication between the NameNode nodes and the DataNode nodes through a heartbeat mechanism.

The NameNode node has the main functions: receiving read-write service of a client;

the NameNode node stores metadata (metadata) information, which mainly comprises the following steps:

(1) file owner and permissions

(2) Which blocks a file contains

(3) Each block is stored on which DataNode node (reported when being started by the DataNode node)

Each block is stored in which DataNode node, the information is not stored in a NameNode node disk, when the HDFS system is started, the DataNode node reports the information to the NameNode node, the NameNode node stores the information in a memory, and the NameNode node reports the information again at intervals.

The metadata information of the NameNode node can be loaded to the memory after being started

(1) Storing metadata into a file with a disk file name fsimage

(2) The location information of the block is not saved in the fsimage file

(3) The edit file records the operation log of the metadata (delete file, upload file operation will be recorded in the edit file, not immediately modify the fsimage file, but merge the edit file and the fsimage at intervals, delete or add the corresponding data).

When the DN thread is started, block position information is reported to the NameNode, the NameNode is kept in contact with the NameNode by sending heartbeat to the NameNode (once in 3 seconds), if the NameNode does not receive the DN heartbeat in 10 minutes, the node is considered to be lost, and the block on the node is copied to other DataNode nodes, so that the minimum number of copies (the default number is 3) is ensured.

The metadata information is always in two parts, one part is stored in a disk and the other part is stored in a memory. But the location information of the block is only stored in the memory, i.e. the data is not available after shutdown, but the information is reported again after restart.

HDFS is designed to reliably store a large number of files between a large number of machines in a cluster, and it stores files in the form of a sequence of blocks. The blocks in the file are all the same size except the last block. Blocks belonging to a file are copied for fault tolerance. The block size and the number of copies are configured in units of files, and the application may modify the copy factor at or after file creation. Files in HDFS are written once and there is only one write operation at any time.

The name node is responsible for handling all block copy related decisions. It periodically accepts heartbeat and block reports for data nodes in the cluster. The arrival of a heartbeat indicates that the data node is normal. A block report includes a list of all blocks on the data node.

On the basis of the foregoing embodiments, after writing a big data file, of which the size exceeds a set threshold, in the collected data into the distributed file system HDFS, the method further includes:

On the basis of the above embodiments, the method further includes:

Based on the same concept, an embodiment of the present invention further provides an entity structure schematic diagram, as shown in fig. 3, the server may include: a processor (processor)301, a communication Interface (communication Interface)302, a memory (memory)303 and a communication bus 304, wherein the processor 301, the communication Interface 302 and the memory 303 complete communication with each other through the communication bus 304. The processor 301 may call logic instructions in the memory 303 to perform the steps of the big data based collected data real-time storage and management method according to the embodiments described above. Examples include:

In addition, the logic instructions in the memory 301 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Based on the same conception, the embodiment of the present invention further provides a non-transitory computer-readable storage medium, where a computer program is stored, where the computer program includes at least one code, and the at least one code is executable by the main control device to control the main control device to implement the steps of the big data-based collected data real-time storage and management method according to the embodiments. Examples include:

Based on the same technical concept, the embodiment of the present application further provides a computer program, which is used to implement the above method embodiment when the computer program is executed by the main control device.

The program may be stored in whole or in part on a storage medium packaged with the processor, or in part or in whole on a memory not packaged with the processor.

Based on the same technical concept, the embodiment of the present application further provides a processor, and the processor is configured to implement the above method embodiment. The processor may be a chip.

In summary, according to the system and the method for real-time storage and management of collected data based on big data provided by the embodiments of the present invention, the collected data uploaded in real time is stored and sorted by each module of the big data frame, so that data of different formats are stored in a classified manner, and mass collected data are stored in real time and automatically classified, so that the storage efficiency is improved by real-time uploading and storing, compared with the case of using an artificial hard disk to collect, analyze, classify and store data again, the efficiency is improved, the industrial cost is reduced, the subsequent data utilization is facilitated, the maintenance by a manager is facilitated, the use of a large number of hard disks is avoided, and the data loss similar to the data hard disks is not easy to occur; the required data can be called conveniently during research, and labor cost and hardware cost rate are reduced.

The embodiments of the present invention can be arbitrarily combined to achieve different technical effects.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the procedures or functions described in accordance with the present application are generated, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk SQLidStateDisk), among others.

One of ordinary skill in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the above method embodiments. And the aforementioned storage medium includes: various media capable of storing program codes, such as ROM or RAM, magnetic or optical disks, etc.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. The big data-based acquired data real-time storage and management system is characterized by comprising a distributed system architecture Hadoop, wherein the distributed system architecture Hadoop comprises a distributed column storage database HBase, a distributed file system HDFS and a programming model MapReduce;

the programming model MapReduce is used for carrying out screening operation and processing on the acquired data so as to classify and store data of different formats, establish SQL query, and import the result of Reduce into a distributed column storage database HBase after the result is summarized; wherein the collected data is data collected by a sensor carried by the autonomous vehicle.

2. The big-data based collected data real-time storage and management system according to claim 1, wherein the distributed column store database HBase is further configured to:

3. The big-data-based collected data real-time storage and management system according to claim 1, further comprising a large-scale data analysis platform Pig, wherein the large-scale data analysis platform Pig is used for reading a Hadoop configuration file of a distributed system architecture to obtain a machine where a Naneonide and a JobTracker process are located, converting the obtained Naneonide and JobTracker process into a series of MapReduce jobs according to the size of a data set and operating the MapReduce jobs on the Hadoop of the distributed system architecture, and loading, merging, filtering, sorting, grouping, associating and using function functions on the data set based on a Pig Latin language of a data stream-oriented SQL-like language.

4. The big-data-based collected data real-time storage and management system according to claim 1, further comprising a data warehouse tool Hive, wherein the data warehouse tool Hive is used for establishing a complete Structured Query Language (SQL) query, converting SQL statements into MapReduce tasks for processing, and finally performing Reduce result summarization.

5. The big data based acquired data real-time storage and management system according to claim 1, wherein the distributed file system HDFS comprises a NameNode node and a plurality of DataNode nodes; the NameNode node is used for storing and managing file metadata, and the DataNode node is used for storing an HDFS file in a data form, responding to a read-write request of an HDFS client, periodically reporting heartbeat information to the NameNode node, periodically reporting data block information to the NameNode node, and periodically reporting caching data block information to the NameNode node.

6. A big data-based collected data real-time storage and management method is characterized by comprising the following steps:

7. The big data-based collected data real-time storage and management method according to claim 6, wherein after writing the big data file with a size exceeding a set threshold in the collected data into the distributed file system HDFS, the method further comprises:

8. The big data based collected data real-time storage and management method according to claim 6, further comprising:

reading a Hadoop configuration file of the distributed system architecture to obtain a machine where a Namenode and a JobTracker process are located, converting the Hadoop configuration file into a series of MapReduce operations according to the size of a data set, operating the MapReduce operations in the Hadoop of the distributed system architecture, and loading, merging, filtering, sequencing, grouping and associating big data in the Hadoop of the distributed system architecture based on a Pig language similar to SQL oriented to data flow and using function functions for the data set.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the big data based collected data real-time storage and management method according to any one of claims 6 to 8 when executing the program.

10. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the big data based acquisition data real-time storage and management method according to any one of claims 6 to 8.