CN110888861A - Novel big data storage method - Google Patents
Novel big data storage method Download PDFInfo
- Publication number
- CN110888861A CN110888861A CN201911103257.9A CN201911103257A CN110888861A CN 110888861 A CN110888861 A CN 110888861A CN 201911103257 A CN201911103257 A CN 201911103257A CN 110888861 A CN110888861 A CN 110888861A
- Authority
- CN
- China
- Prior art keywords
- data
- storage method
- novel big
- data storage
- sampling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000013500 data storage Methods 0.000 title claims abstract description 26
- 238000005070 sampling Methods 0.000 claims abstract description 17
- 238000004364 calculation method Methods 0.000 claims abstract description 10
- 238000005192 partition Methods 0.000 claims abstract 2
- 238000004458 analytical method Methods 0.000 claims description 9
- 238000007726 management method Methods 0.000 claims description 5
- 238000000638 solvent extraction Methods 0.000 claims description 5
- 238000005065 mining Methods 0.000 claims description 3
- 238000013480 data collection Methods 0.000 claims 3
- 238000007405 data analysis Methods 0.000 abstract description 5
- 238000005516 engineering process Methods 0.000 abstract description 4
- 238000007418 data mining Methods 0.000 abstract 1
- 230000006399 behavior Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 241001080526 Vertica Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of big data storage, and particularly relates to a novel big data storage method, which comprises the following steps: selecting an excellent database tool; writing excellent program codes; carrying out partition operation on mass data; establishing a wide index and a cache mechanism; enlarging the virtual memory and processing in batches; using a data warehouse and a multidimensional database store; and (4) carrying out data mining and mass data association storage by using the sampling data. The data type is divided into three types, namely hot data, warm data and cold data, classified storage is carried out on the three different types of data, different technologies and solutions are applied to different use scenes, the service performance is improved, the calculation time and the hardware cost are reduced, different types of databases are applied to different data, then the classification processing can be carried out on the complex data, and the data storage and analysis processing are facilitated.
Description
Technical Field
The invention relates to the technical field of big data storage, in particular to a novel big data storage method.
Background
With the rapid development of applications such as mobile internet, internet of things and the like, the global data volume has increased explosively. The rapid increase in data volume is predictive of the large data age that has now been entered. The network operator has huge users and has the capability of controlling the terminal and the user internet access channel, so that the network operator has a good data base in the aspect of user behavior analysis, deeply analyzes the flow behavior characteristics and rules of the users, finds the potential consumption requirements of the users, and is an effective means for improving the value and the operation level. However, not only is the data size larger and larger, but the large data types and processing real-time requirements greatly increase the complexity of large data processing.
The traditional data analysis and processing method is only used for one type of data and is single, and big data has the characteristics of huge data quantity, complex structure, numerous types and the like, so that a new challenge is provided for the storage, processing and analysis of the big data.
To this end, we propose a new big data storage method to solve the above problems.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides a novel big data storage method.
In order to achieve the purpose, the invention adopts the following technical scheme:
a novel big data storage method comprises the following steps:
s1, selecting an excellent database tool;
s2, writing excellent program codes;
s3, partitioning the mass data;
s4, establishing a wide index and a cache mechanism;
s5, enlarging the virtual memory, and performing batch processing;
s6, storing a usage data warehouse and a multidimensional database;
and S7, mining the data by using the sampling data, and storing the mass data in an associated manner.
In the above novel big data storage method, the types of the database in step S1 mainly include: relational databases, columnar databases, key-value databases, graph databases, and distributed document storage databases.
In the above novel big data storage method, the partitioning operation in step S3 includes three areas, where the three areas are a hot spot area, a warm spot area, and a cold spot area, respectively, and the hot spot area corresponds to the latest data of the hottest spot and needs to provide random reading and writing with low delay for on-line services; the temperature point region corresponds to a relative hot point but is only used for batch calculation and does not need to access and read random data in real time; the cold spot region corresponds to historical data for very low frequency access.
In the above novel big data storage method, for the hot spot data that needs to be randomly read and written, Hbase and SSD hard disks are used to provide an average 20ms random read performance, for the large-scale analysis calculated hot spot data, Spark, partial and hybrid hard disk storage is used, and for the low frequency access data cold spot data, Hive, HDFS is used to store all data.
In the above novel big data storage method, the data warehouse in step S6 obtains raw data from multiple information sources, stores the raw data in the internal database of the data warehouse after sorting and processing, provides a unified, coordinated and integrated information environment for users of the data warehouse through a data warehouse access tool, and supports an enterprise global decision process and deep comprehensive analysis of enterprise operation and management.
In the above novel big data storage method, the real-time data acquisition tool is Kafka, the offline data acquisition tool is ETL, and the internet data acquisition tool is Crawler.
In the novel big data storage method, the methods applied in data sampling are roughly simple random sampling, hierarchical sampling, pond sampling, random undersampling and oversampling.
Compared with the prior art, the novel big data storage method has the advantages that:
1. the data type is divided into three types, namely hot data, warm data and cold data, the data of the three different types are classified and stored, different technologies and solutions are applied according to different use scenes, the service performance is improved, and the calculation time and the hardware cost are reduced.
2. According to the invention, when different data are targeted, different types of databases are applied, and then the complex data can be classified, so that the situation that the traditional data analysis only targets a single type is avoided, and different data calculation sampling methods can be applied to analyze and process different types of data, thereby being beneficial to the storage and analysis and processing of the data.
Drawings
FIG. 1 is a method step diagram of a novel big data storage method proposed by the present invention;
fig. 2 is a data type structure diagram of a novel big data storage method according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
Referring to fig. 1-2, a novel big data storage method includes the following steps:
s1, selecting an excellent database tool;
s2, writing excellent program codes;
s3, partitioning the mass data;
s4, establishing a wide index and a cache mechanism;
s5, enlarging the virtual memory, and performing batch processing;
s6, storing a usage data warehouse and a multidimensional database;
and S7, mining the data by using the sampling data, and storing the mass data in an associated manner.
The types of the database in step S1 mainly include: the system comprises a relational database, a column database, a key value database, a graphic database and a distributed document storage database, wherein the relational data are used by people all the time, such as Oracle, MySQL, SQL Server and Postgress, the data stored in the relational database must meet certain requirements, and a certain data model, such as a main key, an external key and data redundancy, is met by using the general data; columnar databases are generally applied to correspond to a large amount of string data, examples being HBase, casandra, Sybase IQ, HP Vertica, EMC greenplus, etc. The columnar database is generated by data analysis of a data warehouse oriented to a big data environment from the beginning, and is mainly suitable for batch data processing and instant query; key Value database, i.e. Key-Value storage, which is a mode of NoSQL storage, the data of the Key Value database is organized, indexed and stored according to the form of Key Value pairs, KV storage is very suitable for the business data which does not relate to excessive data relation business relations, and simultaneously, the times of reading and writing a disk can be effectively reduced, and the Key Value database has better reading and writing performance than SQL database storage; the graphic database is not dedicated to storing graphic images, but because it maintains the relationship between its data in a graph-like structure, Neo4j, Sones is a typical representative thereof; distributed document storage databases are flexible in application, document storage supports access to structured data, and unlike relational models, document storage has no mandatory framework.
The partitioning operation in step S3 includes three areas, where the three areas are a hot spot area, a warm spot area, and a cold spot area, and the hot spot area corresponds to the latest data of the hottest point and needs to provide low-delay random reading and writing for the on-line service; the temperature point region corresponds to the relative hot point but is only used for batch calculation and does not need to access and read random data in real time; the cold spot area corresponds to historical data of very low-frequency access, specifically, for hot spot area data needing random reading and writing, Hbase and SSD hard disks are used for providing random reading performance of 20ms on average, for a large-scale analysis and calculation temperature spot area, Spark, partial and mixed hard disk storage is used, for the cold spot area of the low-frequency access data, Hive and HDFS are used for storing all data, large data framework platforms built by different large data frames and technologies are used, different technologies and solutions are applied according to different use scenes, service performance is improved, and calculation time and hardware cost are reduced.
The data warehouse in step S6 obtains raw data from multiple information sources, stores the raw data in the internal database of the data warehouse after sorting and processing, provides a unified, coordinated and integrated information environment for users of the data warehouse through a data warehouse access tool, and supports an enterprise global decision process and deep comprehensive analysis of enterprise management and management.
The Kafka is selected as a data real-time acquisition tool, the ETL is selected as a data off-line acquisition tool, and the Crawler is selected as a data internet acquisition tool, so that the use requirements during different data acquisition are met.
More specifically, the method applied during data sampling is roughly simple random sampling, hierarchical sampling, pond sampling, random undersampling and oversampling, and can be used for solving different data analysis and calculation problems.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.
Claims (7)
1. A novel big data storage method is characterized by comprising the following steps:
s1, selecting an excellent database tool;
s2, writing excellent program codes;
s3, partitioning the mass data;
s4, establishing a wide index and a cache mechanism;
s5, enlarging the virtual memory, and performing batch processing;
s6, storing a usage data warehouse and a multidimensional database;
and S7, mining the data by using the sampling data, and storing the mass data in an associated manner.
2. The method according to claim 1, wherein the types of the database in step S1 mainly include: relational databases, columnar databases, key-value databases, graph databases, and distributed document storage databases.
3. The novel big data storage method according to claim 1, wherein the partition operation in step S3 includes three areas, and the three areas are a hot spot area, a warm spot area and a cold spot area, respectively, where the hot spot area corresponds to the latest data at the hottest point and needs to provide low-latency random reading and writing for on-line services; the temperature point region corresponds to a relative hot point but is only used for batch calculation and does not need to access and read random data in real time; the cold spot region corresponds to historical data for very low frequency access.
4. The novel big data storage method according to claim 3, wherein for the hot spot data needing random reading and writing, Hbase and SSD hard disks are used to provide random reading performance of 20ms on average, for the large-scale analysis and calculation temperature spot areas, Spark, partial and hybrid hard disk storage is used, and for the low-frequency access data cold spot areas, Hive and HDFS are used to store all data.
5. The novel big data storage method according to claim 1, wherein the data warehouse in step S6 obtains raw data from multiple information sources, stores the raw data in an internal database of the data warehouse after being sorted and processed, provides a unified, coordinated and integrated information environment for users of the data warehouse through a data warehouse access tool, and supports an enterprise global decision process and deep comprehensive analysis of enterprise management and management.
6. The novel big data storage method according to claim 1, wherein Kafka is selected as the real-time data collection tool, ETL is selected as the off-line data collection tool, and Crawler is selected as the internet data collection tool.
7. The novel big data storage method according to claim 1, wherein the method applied in data sampling is substantially simple random sampling, hierarchical sampling, pond sampling, random undersampling and oversampling.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911103257.9A CN110888861A (en) | 2019-11-12 | 2019-11-12 | Novel big data storage method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911103257.9A CN110888861A (en) | 2019-11-12 | 2019-11-12 | Novel big data storage method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110888861A true CN110888861A (en) | 2020-03-17 |
Family
ID=69747284
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911103257.9A Pending CN110888861A (en) | 2019-11-12 | 2019-11-12 | Novel big data storage method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110888861A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111930848A (en) * | 2020-09-17 | 2020-11-13 | 阿里云计算有限公司 | Data partition storage method, device and system |
CN113254442A (en) * | 2021-05-21 | 2021-08-13 | 首约科技(北京)有限公司 | Warehouse and table dividing method for trip industry |
CN113704346A (en) * | 2020-05-20 | 2021-11-26 | 杭州海康威视数字技术股份有限公司 | Hbase table cold and hot data conversion method and device and electronic equipment |
CN113722280A (en) * | 2021-08-16 | 2021-11-30 | 盛隆电气集团有限公司 | Storage analysis method for massive power network big data |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150074037A1 (en) * | 2013-09-12 | 2015-03-12 | Sap Ag | In Memory Database Warehouse |
CN106682067A (en) * | 2016-11-08 | 2017-05-17 | 浙江邦盛科技有限公司 | Machine learning anti-fraud monitoring system based on transaction data |
CN110083649A (en) * | 2019-04-24 | 2019-08-02 | 北京电子工程总体研究所 | It is a kind of based on cold, warm, dsc data data management system and data managing method |
-
2019
- 2019-11-12 CN CN201911103257.9A patent/CN110888861A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150074037A1 (en) * | 2013-09-12 | 2015-03-12 | Sap Ag | In Memory Database Warehouse |
CN106682067A (en) * | 2016-11-08 | 2017-05-17 | 浙江邦盛科技有限公司 | Machine learning anti-fraud monitoring system based on transaction data |
CN110083649A (en) * | 2019-04-24 | 2019-08-02 | 北京电子工程总体研究所 | It is a kind of based on cold, warm, dsc data data management system and data managing method |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113704346A (en) * | 2020-05-20 | 2021-11-26 | 杭州海康威视数字技术股份有限公司 | Hbase table cold and hot data conversion method and device and electronic equipment |
CN113704346B (en) * | 2020-05-20 | 2024-06-04 | 杭州海康威视数字技术股份有限公司 | Hbase table cold-hot data conversion method and device and electronic equipment |
CN111930848A (en) * | 2020-09-17 | 2020-11-13 | 阿里云计算有限公司 | Data partition storage method, device and system |
WO2022057739A1 (en) * | 2020-09-17 | 2022-03-24 | 阿里云计算有限公司 | Partition-based data storage method, apparatus, and system |
CN113254442A (en) * | 2021-05-21 | 2021-08-13 | 首约科技(北京)有限公司 | Warehouse and table dividing method for trip industry |
CN113722280A (en) * | 2021-08-16 | 2021-11-30 | 盛隆电气集团有限公司 | Storage analysis method for massive power network big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110888861A (en) | Novel big data storage method | |
CN102521405B (en) | Massive structured data storage and query methods and systems supporting high-speed loading | |
CN102521406B (en) | Distributed query method and system for complex task of querying massive structured data | |
CN109669934B (en) | Data warehouse system suitable for electric power customer service and construction method thereof | |
Goil et al. | A parallel scalable infrastructure for OLAP and data mining | |
EP2263180B1 (en) | Indexing large-scale gps tracks | |
CN104361113B (en) | A kind of OLAP query optimization method under internal memory flash memory mixing memory module | |
Ma et al. | KSQ: Top-k similarity query on uncertain trajectories | |
CN107291806A (en) | A kind of Data View copy alternative manner in Web visible environments | |
CN103810219B (en) | Line storage database-based data processing method and device | |
CN109783441A (en) | Mass data inquiry method based on Bloom Filter | |
CN104408163A (en) | Data hierarchical storage method and device | |
CN103744913A (en) | Database retrieval method based on search engine technology | |
CN113032427B (en) | Vectorization query processing method for CPU and GPU platform | |
Ilić et al. | A comparative analysis of smart metering data aggregation performance | |
Gedik et al. | Motion adaptive indexing for moving continual queries over moving objects | |
Gaurav et al. | An outline on big data and big data analytics | |
CN110990340B (en) | Big data multi-level storage architecture | |
CN108319604B (en) | Optimization method for association of large and small tables in hive | |
Colmenares et al. | A single-node datastore for high-velocity multidimensional sensor data | |
CN110825744B (en) | Cluster environment-based air quality monitoring big data partition storage method | |
Tao et al. | Range aggregation with set selection | |
Li et al. | SP-phoenix: a massive spatial point data management system based on phoenix | |
CN112463904B (en) | Mixed analysis method of distributed space vector data and single-point space data | |
Feng et al. | Indexing techniques of distributed ordered tables: A survey and analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200317 |