CN110888861A

CN110888861A - Novel big data storage method

Info

Publication number: CN110888861A
Application number: CN201911103257.9A
Authority: CN
Inventors: 冯报安; 杨晶生
Original assignee: Shanghai Microphone Culture Media Co Ltd
Current assignee: Shanghai Microphone Culture Media Co Ltd
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2020-03-17

Abstract

The invention belongs to the technical field of big data storage, and particularly relates to a novel big data storage method, which comprises the following steps: selecting an excellent database tool; writing excellent program codes; carrying out partition operation on mass data; establishing a wide index and a cache mechanism; enlarging the virtual memory and processing in batches; using a data warehouse and a multidimensional database store; and (4) carrying out data mining and mass data association storage by using the sampling data. The data type is divided into three types, namely hot data, warm data and cold data, classified storage is carried out on the three different types of data, different technologies and solutions are applied to different use scenes, the service performance is improved, the calculation time and the hardware cost are reduced, different types of databases are applied to different data, then the classification processing can be carried out on the complex data, and the data storage and analysis processing are facilitated.

Description

Novel big data storage method

Technical Field

The invention relates to the technical field of big data storage, in particular to a novel big data storage method.

Background

With the rapid development of applications such as mobile internet, internet of things and the like, the global data volume has increased explosively. The rapid increase in data volume is predictive of the large data age that has now been entered. The network operator has huge users and has the capability of controlling the terminal and the user internet access channel, so that the network operator has a good data base in the aspect of user behavior analysis, deeply analyzes the flow behavior characteristics and rules of the users, finds the potential consumption requirements of the users, and is an effective means for improving the value and the operation level. However, not only is the data size larger and larger, but the large data types and processing real-time requirements greatly increase the complexity of large data processing.

The traditional data analysis and processing method is only used for one type of data and is single, and big data has the characteristics of huge data quantity, complex structure, numerous types and the like, so that a new challenge is provided for the storage, processing and analysis of the big data.

To this end, we propose a new big data storage method to solve the above problems.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a novel big data storage method.

In order to achieve the purpose, the invention adopts the following technical scheme:

a novel big data storage method comprises the following steps:

s1, selecting an excellent database tool;

s2, writing excellent program codes;

s3, partitioning the mass data;

s4, establishing a wide index and a cache mechanism;

s5, enlarging the virtual memory, and performing batch processing;

s6, storing a usage data warehouse and a multidimensional database;

and S7, mining the data by using the sampling data, and storing the mass data in an associated manner.

In the above novel big data storage method, the types of the database in step S1 mainly include: relational databases, columnar databases, key-value databases, graph databases, and distributed document storage databases.

In the above novel big data storage method, the partitioning operation in step S3 includes three areas, where the three areas are a hot spot area, a warm spot area, and a cold spot area, respectively, and the hot spot area corresponds to the latest data of the hottest spot and needs to provide random reading and writing with low delay for on-line services; the temperature point region corresponds to a relative hot point but is only used for batch calculation and does not need to access and read random data in real time; the cold spot region corresponds to historical data for very low frequency access.

In the above novel big data storage method, for the hot spot data that needs to be randomly read and written, Hbase and SSD hard disks are used to provide an average 20ms random read performance, for the large-scale analysis calculated hot spot data, Spark, partial and hybrid hard disk storage is used, and for the low frequency access data cold spot data, Hive, HDFS is used to store all data.

In the above novel big data storage method, the data warehouse in step S6 obtains raw data from multiple information sources, stores the raw data in the internal database of the data warehouse after sorting and processing, provides a unified, coordinated and integrated information environment for users of the data warehouse through a data warehouse access tool, and supports an enterprise global decision process and deep comprehensive analysis of enterprise operation and management.

In the above novel big data storage method, the real-time data acquisition tool is Kafka, the offline data acquisition tool is ETL, and the internet data acquisition tool is Crawler.

In the novel big data storage method, the methods applied in data sampling are roughly simple random sampling, hierarchical sampling, pond sampling, random undersampling and oversampling.

Compared with the prior art, the novel big data storage method has the advantages that:

1. the data type is divided into three types, namely hot data, warm data and cold data, the data of the three different types are classified and stored, different technologies and solutions are applied according to different use scenes, the service performance is improved, and the calculation time and the hardware cost are reduced.

2. According to the invention, when different data are targeted, different types of databases are applied, and then the complex data can be classified, so that the situation that the traditional data analysis only targets a single type is avoided, and different data calculation sampling methods can be applied to analyze and process different types of data, thereby being beneficial to the storage and analysis and processing of the data.

Drawings

FIG. 1 is a method step diagram of a novel big data storage method proposed by the present invention;

fig. 2 is a data type structure diagram of a novel big data storage method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

Referring to fig. 1-2, a novel big data storage method includes the following steps:

s1, selecting an excellent database tool;

s2, writing excellent program codes;

s3, partitioning the mass data;

s4, establishing a wide index and a cache mechanism;

s5, enlarging the virtual memory, and performing batch processing;

s6, storing a usage data warehouse and a multidimensional database;

The types of the database in step S1 mainly include: the system comprises a relational database, a column database, a key value database, a graphic database and a distributed document storage database, wherein the relational data are used by people all the time, such as Oracle, MySQL, SQL Server and Postgress, the data stored in the relational database must meet certain requirements, and a certain data model, such as a main key, an external key and data redundancy, is met by using the general data; columnar databases are generally applied to correspond to a large amount of string data, examples being HBase, casandra, Sybase IQ, HP Vertica, EMC greenplus, etc. The columnar database is generated by data analysis of a data warehouse oriented to a big data environment from the beginning, and is mainly suitable for batch data processing and instant query; key Value database, i.e. Key-Value storage, which is a mode of NoSQL storage, the data of the Key Value database is organized, indexed and stored according to the form of Key Value pairs, KV storage is very suitable for the business data which does not relate to excessive data relation business relations, and simultaneously, the times of reading and writing a disk can be effectively reduced, and the Key Value database has better reading and writing performance than SQL database storage; the graphic database is not dedicated to storing graphic images, but because it maintains the relationship between its data in a graph-like structure, Neo4j, Sones is a typical representative thereof; distributed document storage databases are flexible in application, document storage supports access to structured data, and unlike relational models, document storage has no mandatory framework.

The partitioning operation in step S3 includes three areas, where the three areas are a hot spot area, a warm spot area, and a cold spot area, and the hot spot area corresponds to the latest data of the hottest point and needs to provide low-delay random reading and writing for the on-line service; the temperature point region corresponds to the relative hot point but is only used for batch calculation and does not need to access and read random data in real time; the cold spot area corresponds to historical data of very low-frequency access, specifically, for hot spot area data needing random reading and writing, Hbase and SSD hard disks are used for providing random reading performance of 20ms on average, for a large-scale analysis and calculation temperature spot area, Spark, partial and mixed hard disk storage is used, for the cold spot area of the low-frequency access data, Hive and HDFS are used for storing all data, large data framework platforms built by different large data frames and technologies are used, different technologies and solutions are applied according to different use scenes, service performance is improved, and calculation time and hardware cost are reduced.

The data warehouse in step S6 obtains raw data from multiple information sources, stores the raw data in the internal database of the data warehouse after sorting and processing, provides a unified, coordinated and integrated information environment for users of the data warehouse through a data warehouse access tool, and supports an enterprise global decision process and deep comprehensive analysis of enterprise management and management.

The Kafka is selected as a data real-time acquisition tool, the ETL is selected as a data off-line acquisition tool, and the Crawler is selected as a data internet acquisition tool, so that the use requirements during different data acquisition are met.

More specifically, the method applied during data sampling is roughly simple random sampling, hierarchical sampling, pond sampling, random undersampling and oversampling, and can be used for solving different data analysis and calculation problems.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A novel big data storage method is characterized by comprising the following steps:

s1, selecting an excellent database tool;

s2, writing excellent program codes;

s3, partitioning the mass data;

s4, establishing a wide index and a cache mechanism;

s5, enlarging the virtual memory, and performing batch processing;

s6, storing a usage data warehouse and a multidimensional database;

2. The method according to claim 1, wherein the types of the database in step S1 mainly include: relational databases, columnar databases, key-value databases, graph databases, and distributed document storage databases.

3. The novel big data storage method according to claim 1, wherein the partition operation in step S3 includes three areas, and the three areas are a hot spot area, a warm spot area and a cold spot area, respectively, where the hot spot area corresponds to the latest data at the hottest point and needs to provide low-latency random reading and writing for on-line services; the temperature point region corresponds to a relative hot point but is only used for batch calculation and does not need to access and read random data in real time; the cold spot region corresponds to historical data for very low frequency access.

4. The novel big data storage method according to claim 3, wherein for the hot spot data needing random reading and writing, Hbase and SSD hard disks are used to provide random reading performance of 20ms on average, for the large-scale analysis and calculation temperature spot areas, Spark, partial and hybrid hard disk storage is used, and for the low-frequency access data cold spot areas, Hive and HDFS are used to store all data.

5. The novel big data storage method according to claim 1, wherein the data warehouse in step S6 obtains raw data from multiple information sources, stores the raw data in an internal database of the data warehouse after being sorted and processed, provides a unified, coordinated and integrated information environment for users of the data warehouse through a data warehouse access tool, and supports an enterprise global decision process and deep comprehensive analysis of enterprise management and management.

6. The novel big data storage method according to claim 1, wherein Kafka is selected as the real-time data collection tool, ETL is selected as the off-line data collection tool, and Crawler is selected as the internet data collection tool.

7. The novel big data storage method according to claim 1, wherein the method applied in data sampling is substantially simple random sampling, hierarchical sampling, pond sampling, random undersampling and oversampling.