CN112835895A

CN112835895A - Data storage system and storage method thereof

Info

Publication number: CN112835895A
Application number: CN202110110508.7A
Authority: CN
Inventors: 王媛; 仇国祥; 何成; 熊腾辉
Original assignee: CENTURY DRAGON INFORMATION NETWORK CO LTD
Current assignee: Tianyi Digital Life Technology Co Ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-05-25

Abstract

The application discloses a data storage system and a storage method thereof, wherein the storage system comprises: the data acquisition unit, the data processing unit, the data storage unit, the data application unit and the display unit; the data acquisition unit is used for acquiring subsystem data from the service subsystem; the data processing unit is used for preprocessing the subsystem data to obtain preprocessed data and sending the preprocessed data to the data storage unit through a preset interface; a Kylin cluster is built on the data storage unit, and a storage database in the data storage unit comprises MySQL; the data storage unit is used for constructing a multi-dimensional table of the preprocessed data through the Kylin cluster and storing the preprocessed data and result data calculated through the Kylin cluster pre-query to MySQL; the data application unit is used for constructing an application program according to the multidimensional table and the result data stored in the MySQL; and the display unit is used for displaying the application program, the preprocessed data, the multi-dimensional table and the result data.

Description

Data storage system and storage method thereof

Technical Field

The present application relates to the field of databases, and in particular, to a data storage system and a storage method thereof.

Background

Data is an important product of the development of modern internet. With the business development of enterprises, a large amount of data is dispersed in each sub-business system, the amount of data is continuously increased, and the value of hidden data is increased. How to store large-scale data is a focus of attention of researchers.

At present, a relatively universal large-scale data storage system is an HBase + SparkSQL-based system, a Hadoop ecosystem is arranged at the bottom layer of the system, and a SparkSQL large-scale data calculation engine is used on the HBase, so that the Hive query performance is compatible and optimized. In the system, spark SQL reads data into a memory, and the query and calculation performance is improved through the memory. If the data can be basically read into the memory, the memory cache of Spark can make Spark sql have good performance. However, for data of ultra-large scale, the performance of Spark is greatly reduced when the disk is frequently read and written.

Disclosure of Invention

The application provides a data storage system and a storage method thereof, which solve the technical problem that the performance of the conventional data storage system is greatly reduced when a spare disk reads and writes ultra-large-scale data frequently.

In view of the above, a first aspect of the present application provides a data storage system, including: the data acquisition unit, the data processing unit, the data storage unit, the data application unit and the display unit;

the data acquisition unit is used for acquiring subsystem data from the service subsystem;

the data processing unit is used for preprocessing the subsystem data to obtain preprocessed data and sending the preprocessed data to the data storage unit through a preset interface;

a Kylin cluster is built on the data storage unit, and a storage database in the data storage unit comprises MySQL;

the data storage unit is used for constructing a multi-dimensional table of the preprocessed data through the Kylin cluster and storing the preprocessed data and result data calculated through the Kylin cluster pre-query to MySQL;

the data application unit is used for constructing an application program according to the multidimensional table and the result data stored in the MySQL;

and the display unit is used for displaying the application program, the preprocessing data, the multi-dimensional table and the result data.

Optionally, the subsystem data comprises: business data, service logs, buried point data and external crawler data.

Optionally, the data processing unit includes:

the preprocessing subunit is used for converting, cleaning and summarizing the subsystem data to obtain the preprocessed data;

and the sending subunit is used for sending the preprocessed data to the storage calculating unit through a preset interface.

Optionally, the preprocessing unit is specifically configured to convert, clean, and summarize the subsystem data in real time to obtain the preprocessed data.

Optionally, the preprocessing unit acquires the subsystem data through an ETL tool.

Optionally, the Kylin cluster is further configured to, in response to a query request input by a user, obtain query data corresponding to the query request from the MySQL.

Optionally, the presentation unit is further configured to present the query data in a form of a report.

Optionally, the storage database in the data storage unit further includes: a Spark database;

the data storage unit is used for calculating the preprocessing data in an off-line manner through the Spark database and storing the calculated preprocessing data to the MySQL.

A second aspect of the present application provides a data storage method applied to the data storage system of the first aspect, including:

the data acquisition unit acquires subsystem data from the business subsystem;

the data processing unit preprocesses the subsystem data to obtain preprocessed data, and sends the preprocessed data to the data storage unit through a preset interface;

the data storage unit constructs a multi-dimensional table of the preprocessed data through the Kylin cluster, and stores the preprocessed data and result data calculated through the Kylin cluster pre-query to MySQL;

the data application unit constructs an application program according to the multidimensional table and the result data stored in the MySQL;

and the display unit displays the application program, the preprocessing data, the multi-dimensional table and the result data.

From the above technical method, the present application has the following advantages:

the data storage system in this application includes: the data acquisition unit, the data processing unit, the data storage unit, the data application unit and the display unit; the data acquisition unit is used for acquiring subsystem data from the service subsystem; the data processing unit is used for preprocessing the subsystem data to obtain preprocessed data and sending the preprocessed data to the data storage unit through a preset interface; a Kylin cluster is built on the data storage unit, and a storage database in the data storage unit comprises MySQL; the data storage unit is used for constructing a multi-dimensional table of the preprocessed data through the Kylin cluster and storing the preprocessed data and result data calculated through the Kylin cluster pre-query to MySQL; the data application unit is used for constructing an application program according to the multidimensional table and the result data stored in the MySQL; the display unit is used for displaying the application program, the preprocessing data, the multi-dimensional table and the result data. The technical problem that the performance of the existing data storage system can be greatly reduced when the spare disk is frequently read and written for the ultra-large-scale data is solved.

Drawings

In order to more clearly illustrate the technical method in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive labor.

FIG. 1 is a schematic structural diagram of an embodiment of a data storage system according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an embodiment of a data storage system in an embodiment of the present application;

fig. 3 is a schematic flowchart of an embodiment of a data storage method in an embodiment of the present application.

Detailed Description

The embodiment of the application provides a data storage system and a storage method thereof, and solves the technical problem that the performance of the existing data storage system is greatly reduced when a spare disk reads and writes ultra-large-scale data frequently.

In order to make the method of the present application better understood, the technical method in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For easy understanding, please refer to fig. 1, in which fig. 1 is a schematic structural diagram of an embodiment of a data storage system according to an embodiment of the present application.

A data storage system in this embodiment includes: the system comprises a data acquisition unit 101, a data processing unit 102, a data storage unit 103, a data application unit 104 and a display unit 105; the data acquisition unit 101 is used for acquiring subsystem data from the service subsystem; the data processing unit 102 is configured to preprocess the subsystem data to obtain preprocessed data, and send the preprocessed data to the data storage unit 103 through a preset interface; a Kylin cluster is built on the data storage unit 103, and a storage database in the data storage unit 103 comprises MySQL; the data storage unit 103 is used for constructing a multidimensional table of the preprocessed data through the Kylin cluster, and storing the preprocessed data and result data calculated through the Kylin cluster pre-query to MySQL; the data application unit 104 is used for constructing an application program according to the multidimensional table and the result data stored in the MySQL; and the display unit 105 is used for displaying the application program, the preprocessed data, the multi-dimensional table and the result data.

In the data storage system in this embodiment, the data acquisition unit 101 acquires subsystem data from the business subsystem; then, the data processing unit 102 preprocesses the subsystem data to obtain preprocessed data, and sends the preprocessed data to the data storage unit 103 through a preset interface; a Kylin cluster is built on the data storage unit 103, and a storage database in the data storage unit 103 comprises MySQL; constructing a multidimensional table of preprocessed data through a Kylin cluster through a data storage unit 103, and storing the preprocessed data and result data calculated through a Kylin cluster pre-query to MySQL; constructing an application program according to the multidimensional table and result data stored in MySQL by the data application unit 104; and finally, the display unit 105 displays the application program, the preprocessed data, the multidimensional table and the result data, so that the technical problem that the performance of the conventional data storage system is greatly reduced when the spare disk reads and writes ultra-large-scale data is solved.

The foregoing is a first embodiment of a data storage system provided in the embodiments of the present application, and the following is a second embodiment of a data storage system provided in the embodiments of the present application.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an embodiment of a data storage system according to the present application.

Further, the subsystem data includes: business data, service logs, buried point data and external crawler data.

Further, the data processing unit 102 includes:

the preprocessing subunit is used for converting, cleaning and summarizing the subsystem data to obtain preprocessed data;

and the sending subunit is used for sending the preprocessed data to the storage calculation unit through a preset interface.

Further, the preprocessing unit is specifically configured to convert, clean, and summarize the subsystem data in real time to obtain preprocessed data.

Further, the preprocessing unit acquires subsystem data through the ETL tool.

And further, the Kylin cluster is used for responding to the query request input by the user and acquiring the query data corresponding to the query request from MySQL.

Further, the presentation unit 105 is further configured to present the query data in a form of a report.

Further, the storage database in the data storage unit 103 further includes: a Spark database;

and the data storage unit 103 is used for calculating the preprocessing data offline through the Spark database and storing the calculated preprocessing data into MySQL.

For ease of understanding, the data storage system of the present embodiment is described in detail below with reference to a specific embodiment:

as shown in fig. 2, the data storage system in this embodiment constructs a complete enterprise-level user data storage, interactive query analysis, and visual report display system based on a Hadoop ecosystem and an Apache Kylin large-scale data query engine, which are specifically as follows:

1) data acquisition layer (i.e., data acquisition unit 101): the system is mainly responsible for gathering subsystem data (namely service data, service logs, buried point data, external crawler data and the like) of each subsystem.

2) Data processing layer (i.e., data processing unit 102): periodically and periodically extracting, converting, cleaning and summarizing the subsystem data through an ETL tool in batch or in real time through kafka to obtain preprocessed data, and storing the preprocessed data in an enterprise-level central information database.

3) Storage computation layer (i.e., data storage unit 103): based on a Hadoop + HDFS + HBase + Hive + Yann ecosystem, a big data storage and calculation cluster is constructed, data storage capacity and calculation task execution efficiency are improved through the cluster, a Kylin cluster environment (Kylin often only needs to deploy a small number of nodes) is constructed on the basis, report data, query data and preprocessing data for display are stored in MySQL through Spark, Hive or MapReduce offline calculation, meanwhile cube preprocessing is constructed by combining the Kylin cluster, a multidimensional table is constructed according to a fact table, Kylin is not a substitute of the love and Spark SQL but is a query accelerator of the love and Spark SQL, and the method is suitable for tasks of determining data query and aggregation analysis targets and improving response speed.

4) Data application layer (i.e., data application unit 104): the method mainly comprises the steps of constructing Web application (for example, by Spring boot and the like) according to report data stored in MySQL or a dimension table constructed by Kylin cube, and obtaining the report data by JDBC drive and standard SQL language and by JDBCtemplate or ORM framework query.

5) Presentation and application layers (i.e., presentation unit 105): the presentation layer displays the conclusion of the query report, the statistical analysis, the multidimensional online analysis and the data mining in front of a user through a foreground analysis tool, and can draw rich visual charts through visual libraries such as ECharts, Amchart, D3 and the like or directly display the query result through a table; and a fused visual display panel can be manufactured by directly using BI tools such as Tableau, PowerBI and the like and through JDBC standard drive connection, so that a user or an enterprise management decision maker can obtain a more rational and more intuitive data analysis result.

In this embodiment, under the condition that a user data analysis target and a query condition are determined, a multidimensional table is established for an original data table in Hive by means of capability of building cube pre-calculation provided by a Kylin cluster, so that the online data scale can be greatly reduced, then, query is performed through ODBC, JDBC or restful api by using standard SQL, and a sub-second-level result response can be realized on a TB-level super-large-scale data set, so that query with a low delay data aggregation condition is realized.

The second embodiment of the data storage system provided in the embodiments of the present application is as follows.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating an embodiment of a data storage method according to an embodiment of the present application.

A data storage method in this embodiment includes:

step 301, the data acquisition unit acquires subsystem data from the service subsystem.

Step 302, the data processing unit preprocesses the subsystem data to obtain preprocessed data, and sends the preprocessed data to the data storage unit through a preset interface.

Step 303, the data storage unit constructs a multidimensional table of the preprocessed data through the Kylin cluster, and stores the preprocessed data and the result data calculated through the Kylin cluster pre-query to MySQL.

And step 304, the data application unit constructs an application program according to the multidimensional table and the result data stored in the MySQL.

And 305, displaying the application program, the preprocessed data, the multi-dimensional table and the result data by the display unit.

In the data storage method in the embodiment, subsystem data is collected from a business subsystem through a data collection unit; then, preprocessing the subsystem data through a data processing unit to obtain preprocessed data, and sending the preprocessed data to a data storage unit through a preset interface; a Kylin cluster is built on the data storage unit, and a storage database in the data storage unit comprises MySQL; constructing a multidimensional table of preprocessed data through a Kylin cluster through a data storage unit, and storing the preprocessed data and result data calculated through the Kylin cluster pre-query to MySQL; constructing an application program according to the multidimensional table and result data stored in MySQL by a data application unit; and finally, the application program, the preprocessed data, the multi-dimensional table and the result data are displayed through the display unit, so that the technical problem that the performance of the conventional data storage system is greatly reduced when the spare disk reads and writes ultra-large-scale data is solved.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed system, commodity loading server and method may be implemented in other ways. For example, the above-described embodiments of the merchandise loading server are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when the actual implementation is performed, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be an indirect coupling or communication connection through some interfaces, commodity loading server or unit, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A data storage system, comprising: the data acquisition unit, the data processing unit, the data storage unit, the data application unit and the display unit;

the display unit is used for displaying the application program, the preprocessing data, the multi-dimensional table and the result data.

2. The data storage system of claim 1, wherein the subsystem data comprises: business data, service logs, buried point data and external crawler data.

3. The data storage system of claim 1, wherein the data processing unit comprises:

4. The data storage system of claim 3, wherein the preprocessing subunit is configured to convert, clean, and summarize the subsystem data in real time to obtain the preprocessed data.

5. The data storage system of claim 3, wherein the preprocessing unit obtains the subsystem data via an ETL tool.

6. The data storage system of claim 1, wherein the Kylin cluster is further configured to, in response to a query request input by a user, obtain query data corresponding to the query request from the MySQL.

7. The data storage system of claim 6, wherein the presentation unit is further configured to present the query data in a form of a report.

8. The data storage system of claim 1, wherein the storage database in the data storage unit further comprises: a Spark database;

9. A data storage method applied to the data storage system according to any one of claims 1 to 8, comprising:

the data acquisition unit acquires subsystem data from the business subsystem;