CN115309724A

CN115309724A - Data warehouse design and data analysis acceleration system for industrial analysis

Info

Publication number: CN115309724A
Application number: CN202211037559.2A
Authority: CN
Inventors: 王永顺; 吴楠; 齐海茂; 赵飞飞; 郭向国; 张克猛
Original assignee: Ningbo Digital Technology Co ltd
Current assignee: Ningbo Digital Technology Co ltd
Priority date: 2022-08-26
Filing date: 2022-08-26
Publication date: 2022-11-08

Abstract

The invention discloses a data warehouse design and data analysis acceleration system for industrial analysis, and relates to the technical field of industrial analysis. The system comprises a data warehouse design module and a data analysis acceleration module: the data warehouse design module is used for storing the data source and operating according to the requirement; the data analysis acceleration module is used for constructing and analyzing stored data, and the data warehouse design module comprises an ODS layer, a DWD layer, a DWM layer and an ADS layer; the ODS layer is used for storing data extracted from various data sources; the DWD layer is used for data cleaning. The method plays a positive role in industrial research, analysis, modeling and industrial information system construction on a macroscopic level; on a microscopic level, the foundation stone is used for comprehensively evaluating enterprises in the industry, and can better help governments to know the development conditions of the enterprises; the method also has good acceleration effect on the analysis of local enterprise operation conditions, industrial area distribution, industrial cluster aggregation distribution and the like of governments in various regions.

Description

Data warehouse design and data analysis acceleration system for industrial analysis

Technical Field

The invention belongs to the technical field of industrial analysis, and particularly relates to a data warehouse design and data analysis acceleration system for industrial analysis.

Background

The data is the core of all platforms, the core of industrial intelligence, and the fuel for data mining and data analysis;

with the deepening of industrial analysis research, the related data types are more and more, the multi-source heterogeneous phenomenon is obvious, and at present, the unified processing standard is not available for the results of a plurality of data collected by enterprises, so that the collected data content is disordered, the quality is uneven, and the industrial analysis modeling is severely restricted;

therefore, a theme-oriented, integrated, time-varying, nonvolatile data set is needed, cleaner, higher quality data will be obtained, a foundation is laid for further data activities, a standardized data asset management method, flow and strategy are established, and data use efficiency will be effectively improved. The present system is intended to solve these problems.

Disclosure of Invention

The present invention is directed to a data warehouse design and data analysis acceleration system for industrial analysis, so as to solve the technical problems in the background art.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention relates to a data warehouse design and data analysis acceleration system for industrial analysis, which comprises a data warehouse design module and a data analysis acceleration module, wherein the data warehouse design module comprises:

the data warehouse design module is used for storing the data source and operating according to the requirement;

the data analysis acceleration module is used for constructing and analyzing the stored data.

Further, the data warehouse design module comprises an ODS layer, a DWD layer, a DWM layer and an ADS layer;

the ODS layer is used for storing data extracted from various data sources;

the DWD layer is used for data cleaning;

the DWM layer is used for summarizing, aggregating and index calculating data and storing a processed structure;

the ADS layer is used for storing data for directly supporting application development.

Further, the data analysis accelerating module comprises a data cleaning and pre-polymerization analysis module, a data metadata management module and a data quality monitoring module:

the data cleaning and pre-polymerization analysis module is used for cleaning the structured information and polymerizing the existing data according to the dimension;

the data metadata management module is used for creating, storing, integrating and controlling metadata;

the data quality monitoring module is used for detecting the quality of data to judge the accuracy of information.

Further, the data cleaning and pre-polymerization analysis module comprises an ODS layer, a DWD layer and a DWM layer;

and the created ODS layer extracts content and completes content through designation, and simultaneously summarizes the existing structured and unstructured data into a structured data table to be stored in the big data platform and stores metadata information.

Further, the step of creating the DWD layer is as follows:

s1: after data is acquired, the target data warehouse is accessed after various conversion operations are required;

s2: and carrying out dimension modeling on the existing data according to facts and dimensions.

Further, the data conversion step in the S1 step is as follows:

uniformly processing collected unstructured data such as texts, xml, pictures and the like into structured data;

and data cleaning based on the structured information.

Further, the data cleansing includes:

A. the naming specifications of the data fields are unified;

the storage formats of different data types are unified;

C. the data statistics units are unified;

D. the data abnormal value processing amount is too large or the date is illegal;

E. the null data description is uniform.

Further, the creation DWM layer performs aggregation and wide table making according to the dimension on the existing data.

Further, the metadata is collected by:

the method includes the steps that original enterprise data are collected, namely structured data are stored with original metadata information;

the collection is information obtained from the webpage, and the metadata information can be supplemented according to the webpage content;

the collection extracts structured data from unstructured data, which we have created metadata information.

Further, the detection indexes of the data quality are as follows:

a. newly adding a data volume curve graph for each fact table of an enterprise every day;

b. according to the provincial summary data volume, each fact table of the enterprise gives an alarm when the data volume is zero;

c. and monitoring key fields of each fact table, performing null value rate detection, uniqueness detection, field repetition rate detection and field format error rate detection on daily updated data, and drawing the result into a chart to be displayed on a webpage.

The invention has the following beneficial effects:

the method plays a positive role in industrial research, analysis, modeling and industrial information system construction on a macroscopic level; on a microscopic level, the foundation stone is used for comprehensively evaluating enterprises in the industry, and can better help governments to know the development conditions of the enterprises; the method also has good acceleration effect on the analysis of local enterprise operation conditions, industrial area distribution, industrial cluster aggregation distribution and the like of governments in various regions.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic structural view of the present invention;

FIG. 2 is a flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-2, the present invention is a data warehouse design and data analysis acceleration system for industrial analysis.

1. And designing an industry analysis data warehouse.

1. Establishment of industry analytics data warehouse necessity

1.1 the hierarchical design of the data mart can avoid the problem of model redevelopment caused by the change of a data source;

1.2 isolating original data, namely data abnormity or data sensitivity, and decoupling real data from statistical data;

1.3 the complex problem is simplified, each layer only processes simple tasks, and the problem is convenient to locate.

2. Industry analytics data warehouse design

The data warehouse is divided into four layers, namely an ODS layer, a DWD layer, a DWM layer and an ADS layer, and reference can be made to FIG. 1.

The ODS layer is used for storing data extracted from various data sources, wherein the main data sources comprise company CRM system data, purchased third-party data sources, excel/CS format data tables, enterprise business data and the like, and the data are stored in the ODS layer after ETL.

The DWD layer mainly plays a role in data cleaning and cleans required data according to business requirements. For example, the main concern of the industrial and commercial enterprise data in the industrial analysis is that the data can be extracted to the DWD layer through the rule (the enterprise beginning with the unified social credit code of 91 is the industrial and commercial enterprise), and meanwhile, the table is built according to the type of the data.

The DWM layer is used for summarizing, aggregating, calculating indexes and the like on data, and storing processed results in the layer. The method is mainly divided into two parts: one part is DWS (high granularity summary data); one part is DWB (Low granularity summary data)

The ADS layer is used for storing data for directly supporting application development. Such as supporting enterprise queries, the data is stored in an ElasticSearch database. When the visualization system display is supported, data are stored in MySQL and the like.

2. Implementation process

2.1 data cleaning and Pre-polymerization analysis

2.1.1 creating ODS layer

Designing and establishing a data extraction process, collecting a plurality of related dimensions of an enterprise from a plurality of heterogeneous data sources, wherein a source system is very complex and lacks corresponding documents, firstly, what we need to do is to specify extraction content and complement description documents, and then, existing structured and unstructured data are summarized into a structured data table which is stored in a big data platform and metadata information is stored;

the collected enterprise content information is shown in Table 1

2.1.2 creating DWD layers

2.1.2.1. After data is acquired, various conversion operations are required, and only the data which is in compliance can enter the target data warehouse.

The operations that need to be done at this step are as follows:

(1) processing collected unstructured data such as texts, xml and pictures into structured data;

(2) data cleansing based on structured information, the cleansing comprising:

A. the naming specifications of the data fields are unified;

unifying different data type storage formats;

C. the data statistics units are uniform;

E. the null data description is uniform.

2.1.2.2. And carrying out dimension modeling on the existing data according to facts and dimensions. Facts represent a measure of traffic data, while dimensions are angles from which data is observed. Facts are numeric types that can be aggregated and computed, and dimensions are a set of hierarchical relationships or descriptors that define facts. There are three main roles for dimensional modeling:

the method is easy to understand, and the dimension model is easy to understand and more intuitive. In the dimension model, information is grouped according to service types or dimensions, so that the readability of the information is improved, and the interpretation of data meaning is facilitated. The simplified model also allows the system to access the database in a more efficient manner;

high performance, the model is more prone to non-normalization because it optimizes the performance of the query. When a relation model is introduced, the essence of normalization is to reduce data redundancy so as to optimize the performance of transaction processing or data updating;

extensible, the dimensional model is extensible. Since the dimension model allows data redundancy, when fields are added to a dimension table or fact table, it does not have as great an effect as the relational model, with the result that it is easier to accommodate unpredictably added data.

The determined dimension table and fact table are listed in Table 2

2.1.3. Creating a DWM layer

Performing aggregation (time granularity and area granularity) and wide table making on the existing data according to the dimensionality;

the dimension is aggregated to effectively accelerate the subsequent calculation speed, the past experience of the user selects the month as the granularity for aggregation in the time dimension, the region dimension selects the ground-level market granularity for aggregation, and the industry dimension selects the gate class as the statistical granularity. In the aspect of statistical value selection, the occurrence times of the unmeasurable facts are counted, and the occurrence times and the occurrence amount of the unmeasurable facts are counted;

2.2. data metadata management

Metadata (metadata) is information about the organization of data, data fields and their relationships, and simply, metadata is data that is used to describe data.

Metadata management is a whole set of processes of creation, storage, integration and control of metadata, and can help development and business personnel to quickly know the upstream and downstream relation of the data and the meaning of the data; the method can accurately position the data to be searched, reduce the time cost of data research and improve the working efficiency.

Our metadata comes primarily from three aspects:

the method includes the steps that original enterprise data are collected, namely structured data are stored, and original metadata information is stored;

After the metadata information is acquired, a document is reserved firstly, and then the convenient query and maintenance of the metadata management platform are recorded

2.3 data quality monitoring

The data quality is directly related to the accuracy of the information, and the evaluation and analysis effect and confidence of the enterprise can be indirectly influenced. Therefore, data quality monitoring is very critical and necessary.

The main indexes are divided into the following three parts:

a. daily newly-added data quantity curve chart of each fact table of enterprise

b. According to province summary data volume, each fact table of the enterprise alarms when the data volume is zero

c. And monitoring key fields of each fact table, carrying out null value rate detection, uniqueness detection, field repetition rate detection and field format error rate detection on daily updated data, and drawing results into a chart to be displayed on a webpage.

3. The achieved effect

1. A data warehouse architecture design scheme oriented to an industrial analysis informatization system is provided.

2. And an aggregation scheme for data cleaning, screening and governing of enterprise data is provided, so that the follow-up analysis on the current situation of the enterprise can be accelerated better.

3. The metadata and the data quality of the enterprise are comprehensively monitored, and the correctness and the availability of the data can be ensured.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A data warehouse design and data analysis acceleration system for industrial analysis, characterized by: the system comprises a data warehouse design module and a data analysis acceleration module:

the data warehouse design module is used for storing a data source and operating according to the requirement;

2. The data warehouse design and data analysis acceleration system for industrial analysis of claim 1, wherein the data warehouse design module comprises an ODS layer, a DWD layer, a DWM layer, and an ADS layer;

the ODS layer is used for storing data extracted from various data sources;

the DWD layer is used for cleaning data;

3. The data warehouse design and data analysis acceleration system for industry analysis of claim 1, wherein the data analysis acceleration module comprises a data washing and pre-polymerization analysis module, a data metadata management module and a data quality monitoring module:

4. The data warehouse design and data analysis acceleration system for industry analysis of claim 3, wherein the data cleansing and pre-polymerization analysis module comprises a create ODS layer, a create DWD layer, and a create DWM layer;

and the created ODS layer extracts content and completes the content by appointing, and simultaneously summarizes the existing structured and unstructured data into a structured data table to be stored in the big data platform and stores metadata information.

5. The data warehouse design and data analysis acceleration system for industry analysis of claim 4, wherein the step of creating DWD layer is as follows:

6. The system of claim 5, wherein the data conversion step in the step S1 is as follows:

and performing data cleaning on the basis of the structured information.

7. The data warehouse design and data analysis acceleration system for industry analysis of claim 6, wherein the data cleansing comprises:

A. the naming specifications of the data fields are unified;

unifying different data type storage formats;

C. the data statistics units are uniform;

E. the null data description is uniform.

8. The data warehouse design and data analysis acceleration system for industry analysis as claimed in claim 4, characterized in that, the creation DWM layer is made by aggregating and wide table making to existing data according to dimension.

9. The data warehouse design and data analysis acceleration system for industry analysis of claim 3, wherein the metadata is collected by:

10. The data warehouse design and data analysis acceleration system for industry analysis of claim 3, wherein the detection index of data quality is as follows: