CN110888861A - Novel big data storage method - Google Patents

Novel big data storage method Download PDF

Info

Publication number
CN110888861A
CN110888861A CN201911103257.9A CN201911103257A CN110888861A CN 110888861 A CN110888861 A CN 110888861A CN 201911103257 A CN201911103257 A CN 201911103257A CN 110888861 A CN110888861 A CN 110888861A
Authority
CN
China
Prior art keywords
data
storage method
novel big
data storage
sampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911103257.9A
Other languages
Chinese (zh)
Inventor
冯报安
杨晶生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Microphone Culture Media Co Ltd
Original Assignee
Shanghai Microphone Culture Media Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Microphone Culture Media Co Ltd filed Critical Shanghai Microphone Culture Media Co Ltd
Priority to CN201911103257.9A priority Critical patent/CN110888861A/en
Publication of CN110888861A publication Critical patent/CN110888861A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Abstract

The invention belongs to the technical field of big data storage, and particularly relates to a novel big data storage method, which comprises the following steps: selecting an excellent database tool; writing excellent program codes; carrying out partition operation on mass data; establishing a wide index and a cache mechanism; enlarging the virtual memory and processing in batches; using a data warehouse and a multidimensional database store; and (4) carrying out data mining and mass data association storage by using the sampling data. The data type is divided into three types, namely hot data, warm data and cold data, classified storage is carried out on the three different types of data, different technologies and solutions are applied to different use scenes, the service performance is improved, the calculation time and the hardware cost are reduced, different types of databases are applied to different data, then the classification processing can be carried out on the complex data, and the data storage and analysis processing are facilitated.

Description

Novel big data storage method
Technical Field
The invention relates to the technical field of big data storage, in particular to a novel big data storage method.
Background
With the rapid development of applications such as mobile internet, internet of things and the like, the global data volume has increased explosively. The rapid increase in data volume is predictive of the large data age that has now been entered. The network operator has huge users and has the capability of controlling the terminal and the user internet access channel, so that the network operator has a good data base in the aspect of user behavior analysis, deeply analyzes the flow behavior characteristics and rules of the users, finds the potential consumption requirements of the users, and is an effective means for improving the value and the operation level. However, not only is the data size larger and larger, but the large data types and processing real-time requirements greatly increase the complexity of large data processing.
The traditional data analysis and processing method is only used for one type of data and is single, and big data has the characteristics of huge data quantity, complex structure, numerous types and the like, so that a new challenge is provided for the storage, processing and analysis of the big data.
To this end, we propose a new big data storage method to solve the above problems.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides a novel big data storage method.
In order to achieve the purpose, the invention adopts the following technical scheme:
a novel big data storage method comprises the following steps:
s1, selecting an excellent database tool;
s2, writing excellent program codes;
s3, partitioning the mass data;
s4, establishing a wide index and a cache mechanism;
s5, enlarging the virtual memory, and performing batch processing;
s6, storing a usage data warehouse and a multidimensional database;
and S7, mining the data by using the sampling data, and storing the mass data in an associated manner.
In the above novel big data storage method, the types of the database in step S1 mainly include: relational databases, columnar databases, key-value databases, graph databases, and distributed document storage databases.
In the above novel big data storage method, the partitioning operation in step S3 includes three areas, where the three areas are a hot spot area, a warm spot area, and a cold spot area, respectively, and the hot spot area corresponds to the latest data of the hottest spot and needs to provide random reading and writing with low delay for on-line services; the temperature point region corresponds to a relative hot point but is only used for batch calculation and does not need to access and read random data in real time; the cold spot region corresponds to historical data for very low frequency access.
In the above novel big data storage method, for the hot spot data that needs to be randomly read and written, Hbase and SSD hard disks are used to provide an average 20ms random read performance, for the large-scale analysis calculated hot spot data, Spark, partial and hybrid hard disk storage is used, and for the low frequency access data cold spot data, Hive, HDFS is used to store all data.
In the above novel big data storage method, the data warehouse in step S6 obtains raw data from multiple information sources, stores the raw data in the internal database of the data warehouse after sorting and processing, provides a unified, coordinated and integrated information environment for users of the data warehouse through a data warehouse access tool, and supports an enterprise global decision process and deep comprehensive analysis of enterprise operation and management.
In the above novel big data storage method, the real-time data acquisition tool is Kafka, the offline data acquisition tool is ETL, and the internet data acquisition tool is Crawler.
In the novel big data storage method, the methods applied in data sampling are roughly simple random sampling, hierarchical sampling, pond sampling, random undersampling and oversampling.
Compared with the prior art, the novel big data storage method has the advantages that:
1. the data type is divided into three types, namely hot data, warm data and cold data, the data of the three different types are classified and stored, different technologies and solutions are applied according to different use scenes, the service performance is improved, and the calculation time and the hardware cost are reduced.
2. According to the invention, when different data are targeted, different types of databases are applied, and then the complex data can be classified, so that the situation that the traditional data analysis only targets a single type is avoided, and different data calculation sampling methods can be applied to analyze and process different types of data, thereby being beneficial to the storage and analysis and processing of the data.
Drawings
FIG. 1 is a method step diagram of a novel big data storage method proposed by the present invention;
fig. 2 is a data type structure diagram of a novel big data storage method according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
Referring to fig. 1-2, a novel big data storage method includes the following steps:
s1, selecting an excellent database tool;
s2, writing excellent program codes;
s3, partitioning the mass data;
s4, establishing a wide index and a cache mechanism;
s5, enlarging the virtual memory, and performing batch processing;
s6, storing a usage data warehouse and a multidimensional database;
and S7, mining the data by using the sampling data, and storing the mass data in an associated manner.
The types of the database in step S1 mainly include: the system comprises a relational database, a column database, a key value database, a graphic database and a distributed document storage database, wherein the relational data are used by people all the time, such as Oracle, MySQL, SQL Server and Postgress, the data stored in the relational database must meet certain requirements, and a certain data model, such as a main key, an external key and data redundancy, is met by using the general data; columnar databases are generally applied to correspond to a large amount of string data, examples being HBase, casandra, Sybase IQ, HP Vertica, EMC greenplus, etc. The columnar database is generated by data analysis of a data warehouse oriented to a big data environment from the beginning, and is mainly suitable for batch data processing and instant query; key Value database, i.e. Key-Value storage, which is a mode of NoSQL storage, the data of the Key Value database is organized, indexed and stored according to the form of Key Value pairs, KV storage is very suitable for the business data which does not relate to excessive data relation business relations, and simultaneously, the times of reading and writing a disk can be effectively reduced, and the Key Value database has better reading and writing performance than SQL database storage; the graphic database is not dedicated to storing graphic images, but because it maintains the relationship between its data in a graph-like structure, Neo4j, Sones is a typical representative thereof; distributed document storage databases are flexible in application, document storage supports access to structured data, and unlike relational models, document storage has no mandatory framework.
The partitioning operation in step S3 includes three areas, where the three areas are a hot spot area, a warm spot area, and a cold spot area, and the hot spot area corresponds to the latest data of the hottest point and needs to provide low-delay random reading and writing for the on-line service; the temperature point region corresponds to the relative hot point but is only used for batch calculation and does not need to access and read random data in real time; the cold spot area corresponds to historical data of very low-frequency access, specifically, for hot spot area data needing random reading and writing, Hbase and SSD hard disks are used for providing random reading performance of 20ms on average, for a large-scale analysis and calculation temperature spot area, Spark, partial and mixed hard disk storage is used, for the cold spot area of the low-frequency access data, Hive and HDFS are used for storing all data, large data framework platforms built by different large data frames and technologies are used, different technologies and solutions are applied according to different use scenes, service performance is improved, and calculation time and hardware cost are reduced.
The data warehouse in step S6 obtains raw data from multiple information sources, stores the raw data in the internal database of the data warehouse after sorting and processing, provides a unified, coordinated and integrated information environment for users of the data warehouse through a data warehouse access tool, and supports an enterprise global decision process and deep comprehensive analysis of enterprise management and management.
The Kafka is selected as a data real-time acquisition tool, the ETL is selected as a data off-line acquisition tool, and the Crawler is selected as a data internet acquisition tool, so that the use requirements during different data acquisition are met.
More specifically, the method applied during data sampling is roughly simple random sampling, hierarchical sampling, pond sampling, random undersampling and oversampling, and can be used for solving different data analysis and calculation problems.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (7)

1. A novel big data storage method is characterized by comprising the following steps:
s1, selecting an excellent database tool;
s2, writing excellent program codes;
s3, partitioning the mass data;
s4, establishing a wide index and a cache mechanism;
s5, enlarging the virtual memory, and performing batch processing;
s6, storing a usage data warehouse and a multidimensional database;
and S7, mining the data by using the sampling data, and storing the mass data in an associated manner.
2. The method according to claim 1, wherein the types of the database in step S1 mainly include: relational databases, columnar databases, key-value databases, graph databases, and distributed document storage databases.
3. The novel big data storage method according to claim 1, wherein the partition operation in step S3 includes three areas, and the three areas are a hot spot area, a warm spot area and a cold spot area, respectively, where the hot spot area corresponds to the latest data at the hottest point and needs to provide low-latency random reading and writing for on-line services; the temperature point region corresponds to a relative hot point but is only used for batch calculation and does not need to access and read random data in real time; the cold spot region corresponds to historical data for very low frequency access.
4. The novel big data storage method according to claim 3, wherein for the hot spot data needing random reading and writing, Hbase and SSD hard disks are used to provide random reading performance of 20ms on average, for the large-scale analysis and calculation temperature spot areas, Spark, partial and hybrid hard disk storage is used, and for the low-frequency access data cold spot areas, Hive and HDFS are used to store all data.
5. The novel big data storage method according to claim 1, wherein the data warehouse in step S6 obtains raw data from multiple information sources, stores the raw data in an internal database of the data warehouse after being sorted and processed, provides a unified, coordinated and integrated information environment for users of the data warehouse through a data warehouse access tool, and supports an enterprise global decision process and deep comprehensive analysis of enterprise management and management.
6. The novel big data storage method according to claim 1, wherein Kafka is selected as the real-time data collection tool, ETL is selected as the off-line data collection tool, and Crawler is selected as the internet data collection tool.
7. The novel big data storage method according to claim 1, wherein the method applied in data sampling is substantially simple random sampling, hierarchical sampling, pond sampling, random undersampling and oversampling.
CN201911103257.9A 2019-11-12 2019-11-12 Novel big data storage method Pending CN110888861A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911103257.9A CN110888861A (en) 2019-11-12 2019-11-12 Novel big data storage method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911103257.9A CN110888861A (en) 2019-11-12 2019-11-12 Novel big data storage method

Publications (1)

Publication Number Publication Date
CN110888861A true CN110888861A (en) 2020-03-17

Family

ID=69747284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911103257.9A Pending CN110888861A (en) 2019-11-12 2019-11-12 Novel big data storage method

Country Status (1)

Country Link
CN (1) CN110888861A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111930848A (en) * 2020-09-17 2020-11-13 阿里云计算有限公司 Data partition storage method, device and system
CN113254442A (en) * 2021-05-21 2021-08-13 首约科技(北京)有限公司 Warehouse and table dividing method for trip industry
CN113704346A (en) * 2020-05-20 2021-11-26 杭州海康威视数字技术股份有限公司 Hbase table cold and hot data conversion method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150074037A1 (en) * 2013-09-12 2015-03-12 Sap Ag In Memory Database Warehouse
CN106682067A (en) * 2016-11-08 2017-05-17 浙江邦盛科技有限公司 Machine learning anti-fraud monitoring system based on transaction data
CN110083649A (en) * 2019-04-24 2019-08-02 北京电子工程总体研究所 It is a kind of based on cold, warm, dsc data data management system and data managing method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150074037A1 (en) * 2013-09-12 2015-03-12 Sap Ag In Memory Database Warehouse
CN106682067A (en) * 2016-11-08 2017-05-17 浙江邦盛科技有限公司 Machine learning anti-fraud monitoring system based on transaction data
CN110083649A (en) * 2019-04-24 2019-08-02 北京电子工程总体研究所 It is a kind of based on cold, warm, dsc data data management system and data managing method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704346A (en) * 2020-05-20 2021-11-26 杭州海康威视数字技术股份有限公司 Hbase table cold and hot data conversion method and device and electronic equipment
CN111930848A (en) * 2020-09-17 2020-11-13 阿里云计算有限公司 Data partition storage method, device and system
WO2022057739A1 (en) * 2020-09-17 2022-03-24 阿里云计算有限公司 Partition-based data storage method, apparatus, and system
CN113254442A (en) * 2021-05-21 2021-08-13 首约科技(北京)有限公司 Warehouse and table dividing method for trip industry

Similar Documents

Publication Publication Date Title
CN102521405B (en) Massive structured data storage and query methods and systems supporting high-speed loading
CN102521406B (en) Distributed query method and system for complex task of querying massive structured data
CN109669934B (en) Data warehouse system suitable for electric power customer service and construction method thereof
JP6639420B2 (en) Method for flash-optimized data layout, apparatus for flash-optimized storage, and computer program
CN110888861A (en) Novel big data storage method
Goil et al. A parallel scalable infrastructure for OLAP and data mining
EP2263180B1 (en) Indexing large-scale gps tracks
CN104361113B (en) A kind of OLAP query optimization method under internal memory flash memory mixing memory module
CN107291806A (en) A kind of Data View copy alternative manner in Web visible environments
CN103810219B (en) Line storage database-based data processing method and device
CN109783441A (en) Mass data inquiry method based on Bloom Filter
CN104408163A (en) Data hierarchical storage method and device
CN103744913A (en) Database retrieval method based on search engine technology
Liu et al. Smartcube: An adaptive data management architecture for the real-time visualization of spatiotemporal datasets
CN113032427B (en) Vectorization query processing method for CPU and GPU platform
Ilić et al. A comparative analysis of smart metering data aggregation performance
Hu et al. Efficient provenance management via clustering and hybrid storage in big data environments
Gedik et al. Motion adaptive indexing for moving continual queries over moving objects
CN110990340B (en) Big data multi-level storage architecture
Gaurav et al. An outline on big data and big data analytics
CN108319604B (en) Optimization method for association of large and small tables in hive
CN110825744B (en) Cluster environment-based air quality monitoring big data partition storage method
CN112463904B (en) Mixed analysis method of distributed space vector data and single-point space data
Li et al. SP-phoenix: a massive spatial point data management system based on phoenix
Feng et al. Indexing techniques of distributed ordered tables: A survey and analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination