CN115309724A - Data warehouse design and data analysis acceleration system for industrial analysis - Google Patents

Data warehouse design and data analysis acceleration system for industrial analysis Download PDF

Info

Publication number
CN115309724A
CN115309724A CN202211037559.2A CN202211037559A CN115309724A CN 115309724 A CN115309724 A CN 115309724A CN 202211037559 A CN202211037559 A CN 202211037559A CN 115309724 A CN115309724 A CN 115309724A
Authority
CN
China
Prior art keywords
data
analysis
layer
module
warehouse design
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202211037559.2A
Other languages
Chinese (zh)
Inventor
王永顺
吴楠
齐海茂
赵飞飞
郭向国
张克猛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo Digital Technology Co ltd
Original Assignee
Ningbo Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo Digital Technology Co ltd filed Critical Ningbo Digital Technology Co ltd
Priority to CN202211037559.2A priority Critical patent/CN115309724A/en
Publication of CN115309724A publication Critical patent/CN115309724A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a data warehouse design and data analysis acceleration system for industrial analysis, and relates to the technical field of industrial analysis. The system comprises a data warehouse design module and a data analysis acceleration module: the data warehouse design module is used for storing the data source and operating according to the requirement; the data analysis acceleration module is used for constructing and analyzing stored data, and the data warehouse design module comprises an ODS layer, a DWD layer, a DWM layer and an ADS layer; the ODS layer is used for storing data extracted from various data sources; the DWD layer is used for data cleaning. The method plays a positive role in industrial research, analysis, modeling and industrial information system construction on a macroscopic level; on a microscopic level, the foundation stone is used for comprehensively evaluating enterprises in the industry, and can better help governments to know the development conditions of the enterprises; the method also has good acceleration effect on the analysis of local enterprise operation conditions, industrial area distribution, industrial cluster aggregation distribution and the like of governments in various regions.

Description

Data warehouse design and data analysis acceleration system for industrial analysis
Technical Field
The invention belongs to the technical field of industrial analysis, and particularly relates to a data warehouse design and data analysis acceleration system for industrial analysis.
Background
The data is the core of all platforms, the core of industrial intelligence, and the fuel for data mining and data analysis;
with the deepening of industrial analysis research, the related data types are more and more, the multi-source heterogeneous phenomenon is obvious, and at present, the unified processing standard is not available for the results of a plurality of data collected by enterprises, so that the collected data content is disordered, the quality is uneven, and the industrial analysis modeling is severely restricted;
therefore, a theme-oriented, integrated, time-varying, nonvolatile data set is needed, cleaner, higher quality data will be obtained, a foundation is laid for further data activities, a standardized data asset management method, flow and strategy are established, and data use efficiency will be effectively improved. The present system is intended to solve these problems.
Disclosure of Invention
The present invention is directed to a data warehouse design and data analysis acceleration system for industrial analysis, so as to solve the technical problems in the background art.
In order to solve the technical problems, the invention is realized by the following technical scheme:
the invention relates to a data warehouse design and data analysis acceleration system for industrial analysis, which comprises a data warehouse design module and a data analysis acceleration module, wherein the data warehouse design module comprises:
the data warehouse design module is used for storing the data source and operating according to the requirement;
the data analysis acceleration module is used for constructing and analyzing the stored data.
Further, the data warehouse design module comprises an ODS layer, a DWD layer, a DWM layer and an ADS layer;
the ODS layer is used for storing data extracted from various data sources;
the DWD layer is used for data cleaning;
the DWM layer is used for summarizing, aggregating and index calculating data and storing a processed structure;
the ADS layer is used for storing data for directly supporting application development.
Further, the data analysis accelerating module comprises a data cleaning and pre-polymerization analysis module, a data metadata management module and a data quality monitoring module:
the data cleaning and pre-polymerization analysis module is used for cleaning the structured information and polymerizing the existing data according to the dimension;
the data metadata management module is used for creating, storing, integrating and controlling metadata;
the data quality monitoring module is used for detecting the quality of data to judge the accuracy of information.
Further, the data cleaning and pre-polymerization analysis module comprises an ODS layer, a DWD layer and a DWM layer;
and the created ODS layer extracts content and completes content through designation, and simultaneously summarizes the existing structured and unstructured data into a structured data table to be stored in the big data platform and stores metadata information.
Further, the step of creating the DWD layer is as follows:
s1: after data is acquired, the target data warehouse is accessed after various conversion operations are required;
s2: and carrying out dimension modeling on the existing data according to facts and dimensions.
Further, the data conversion step in the S1 step is as follows:
uniformly processing collected unstructured data such as texts, xml, pictures and the like into structured data;
and data cleaning based on the structured information.
Further, the data cleansing includes:
A. the naming specifications of the data fields are unified;
the storage formats of different data types are unified;
C. the data statistics units are unified;
D. the data abnormal value processing amount is too large or the date is illegal;
E. the null data description is uniform.
Further, the creation DWM layer performs aggregation and wide table making according to the dimension on the existing data.
Further, the metadata is collected by:
the method includes the steps that original enterprise data are collected, namely structured data are stored with original metadata information;
the collection is information obtained from the webpage, and the metadata information can be supplemented according to the webpage content;
the collection extracts structured data from unstructured data, which we have created metadata information.
Further, the detection indexes of the data quality are as follows:
a. newly adding a data volume curve graph for each fact table of an enterprise every day;
b. according to the provincial summary data volume, each fact table of the enterprise gives an alarm when the data volume is zero;
c. and monitoring key fields of each fact table, performing null value rate detection, uniqueness detection, field repetition rate detection and field format error rate detection on daily updated data, and drawing the result into a chart to be displayed on a webpage.
The invention has the following beneficial effects:
the method plays a positive role in industrial research, analysis, modeling and industrial information system construction on a macroscopic level; on a microscopic level, the foundation stone is used for comprehensively evaluating enterprises in the industry, and can better help governments to know the development conditions of the enterprises; the method also has good acceleration effect on the analysis of local enterprise operation conditions, industrial area distribution, industrial cluster aggregation distribution and the like of governments in various regions.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic structural view of the present invention;
FIG. 2 is a flow chart of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-2, the present invention is a data warehouse design and data analysis acceleration system for industrial analysis.
1. And designing an industry analysis data warehouse.
1. Establishment of industry analytics data warehouse necessity
1.1 the hierarchical design of the data mart can avoid the problem of model redevelopment caused by the change of a data source;
1.2 isolating original data, namely data abnormity or data sensitivity, and decoupling real data from statistical data;
1.3 the complex problem is simplified, each layer only processes simple tasks, and the problem is convenient to locate.
2. Industry analytics data warehouse design
The data warehouse is divided into four layers, namely an ODS layer, a DWD layer, a DWM layer and an ADS layer, and reference can be made to FIG. 1.
The ODS layer is used for storing data extracted from various data sources, wherein the main data sources comprise company CRM system data, purchased third-party data sources, excel/CS format data tables, enterprise business data and the like, and the data are stored in the ODS layer after ETL.
The DWD layer mainly plays a role in data cleaning and cleans required data according to business requirements. For example, the main concern of the industrial and commercial enterprise data in the industrial analysis is that the data can be extracted to the DWD layer through the rule (the enterprise beginning with the unified social credit code of 91 is the industrial and commercial enterprise), and meanwhile, the table is built according to the type of the data.
The DWM layer is used for summarizing, aggregating, calculating indexes and the like on data, and storing processed results in the layer. The method is mainly divided into two parts: one part is DWS (high granularity summary data); one part is DWB (Low granularity summary data)
The ADS layer is used for storing data for directly supporting application development. Such as supporting enterprise queries, the data is stored in an ElasticSearch database. When the visualization system display is supported, data are stored in MySQL and the like.
2. Implementation process
2.1 data cleaning and Pre-polymerization analysis
2.1.1 creating ODS layer
Designing and establishing a data extraction process, collecting a plurality of related dimensions of an enterprise from a plurality of heterogeneous data sources, wherein a source system is very complex and lacks corresponding documents, firstly, what we need to do is to specify extraction content and complement description documents, and then, existing structured and unstructured data are summarized into a structured data table which is stored in a big data platform and metadata information is stored;
the collected enterprise content information is shown in Table 1
Figure BDA0003817994660000061
Figure BDA0003817994660000071
2.1.2 creating DWD layers
2.1.2.1. After data is acquired, various conversion operations are required, and only the data which is in compliance can enter the target data warehouse.
The operations that need to be done at this step are as follows:
(1) processing collected unstructured data such as texts, xml and pictures into structured data;
(2) data cleansing based on structured information, the cleansing comprising:
A. the naming specifications of the data fields are unified;
unifying different data type storage formats;
C. the data statistics units are uniform;
D. the data abnormal value processing amount is too large or the date is illegal;
E. the null data description is uniform.
2.1.2.2. And carrying out dimension modeling on the existing data according to facts and dimensions. Facts represent a measure of traffic data, while dimensions are angles from which data is observed. Facts are numeric types that can be aggregated and computed, and dimensions are a set of hierarchical relationships or descriptors that define facts. There are three main roles for dimensional modeling:
the method is easy to understand, and the dimension model is easy to understand and more intuitive. In the dimension model, information is grouped according to service types or dimensions, so that the readability of the information is improved, and the interpretation of data meaning is facilitated. The simplified model also allows the system to access the database in a more efficient manner;
high performance, the model is more prone to non-normalization because it optimizes the performance of the query. When a relation model is introduced, the essence of normalization is to reduce data redundancy so as to optimize the performance of transaction processing or data updating;
extensible, the dimensional model is extensible. Since the dimension model allows data redundancy, when fields are added to a dimension table or fact table, it does not have as great an effect as the relational model, with the result that it is easier to accommodate unpredictably added data.
The determined dimension table and fact table are listed in Table 2
Figure BDA0003817994660000081
Figure BDA0003817994660000091
2.1.3. Creating a DWM layer
Performing aggregation (time granularity and area granularity) and wide table making on the existing data according to the dimensionality;
the dimension is aggregated to effectively accelerate the subsequent calculation speed, the past experience of the user selects the month as the granularity for aggregation in the time dimension, the region dimension selects the ground-level market granularity for aggregation, and the industry dimension selects the gate class as the statistical granularity. In the aspect of statistical value selection, the occurrence times of the unmeasurable facts are counted, and the occurrence times and the occurrence amount of the unmeasurable facts are counted;
2.2. data metadata management
Metadata (metadata) is information about the organization of data, data fields and their relationships, and simply, metadata is data that is used to describe data.
Metadata management is a whole set of processes of creation, storage, integration and control of metadata, and can help development and business personnel to quickly know the upstream and downstream relation of the data and the meaning of the data; the method can accurately position the data to be searched, reduce the time cost of data research and improve the working efficiency.
Our metadata comes primarily from three aspects:
the method includes the steps that original enterprise data are collected, namely structured data are stored, and original metadata information is stored;
the collection is information obtained from the webpage, and the metadata information can be supplemented according to the webpage content;
the collection extracts structured data from unstructured data, which we have created metadata information.
After the metadata information is acquired, a document is reserved firstly, and then the convenient query and maintenance of the metadata management platform are recorded
2.3 data quality monitoring
The data quality is directly related to the accuracy of the information, and the evaluation and analysis effect and confidence of the enterprise can be indirectly influenced. Therefore, data quality monitoring is very critical and necessary.
The main indexes are divided into the following three parts:
a. daily newly-added data quantity curve chart of each fact table of enterprise
b. According to province summary data volume, each fact table of the enterprise alarms when the data volume is zero
c. And monitoring key fields of each fact table, carrying out null value rate detection, uniqueness detection, field repetition rate detection and field format error rate detection on daily updated data, and drawing results into a chart to be displayed on a webpage.
3. The achieved effect
1. A data warehouse architecture design scheme oriented to an industrial analysis informatization system is provided.
2. And an aggregation scheme for data cleaning, screening and governing of enterprise data is provided, so that the follow-up analysis on the current situation of the enterprise can be accelerated better.
3. The metadata and the data quality of the enterprise are comprehensively monitored, and the correctness and the availability of the data can be ensured.
In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (10)

1. A data warehouse design and data analysis acceleration system for industrial analysis, characterized by: the system comprises a data warehouse design module and a data analysis acceleration module:
the data warehouse design module is used for storing a data source and operating according to the requirement;
the data analysis acceleration module is used for constructing and analyzing the stored data.
2. The data warehouse design and data analysis acceleration system for industrial analysis of claim 1, wherein the data warehouse design module comprises an ODS layer, a DWD layer, a DWM layer, and an ADS layer;
the ODS layer is used for storing data extracted from various data sources;
the DWD layer is used for cleaning data;
the DWM layer is used for summarizing, aggregating and index calculating data and storing a processed structure;
the ADS layer is used for storing data for directly supporting application development.
3. The data warehouse design and data analysis acceleration system for industry analysis of claim 1, wherein the data analysis acceleration module comprises a data washing and pre-polymerization analysis module, a data metadata management module and a data quality monitoring module:
the data cleaning and pre-polymerization analysis module is used for cleaning the structured information and polymerizing the existing data according to the dimension;
the data metadata management module is used for creating, storing, integrating and controlling metadata;
the data quality monitoring module is used for detecting the quality of data to judge the accuracy of information.
4. The data warehouse design and data analysis acceleration system for industry analysis of claim 3, wherein the data cleansing and pre-polymerization analysis module comprises a create ODS layer, a create DWD layer, and a create DWM layer;
and the created ODS layer extracts content and completes the content by appointing, and simultaneously summarizes the existing structured and unstructured data into a structured data table to be stored in the big data platform and stores metadata information.
5. The data warehouse design and data analysis acceleration system for industry analysis of claim 4, wherein the step of creating DWD layer is as follows:
s1: after data is acquired, the target data warehouse is accessed after various conversion operations are required;
s2: and carrying out dimension modeling on the existing data according to facts and dimensions.
6. The system of claim 5, wherein the data conversion step in the step S1 is as follows:
uniformly processing collected unstructured data such as texts, xml, pictures and the like into structured data;
and performing data cleaning on the basis of the structured information.
7. The data warehouse design and data analysis acceleration system for industry analysis of claim 6, wherein the data cleansing comprises:
A. the naming specifications of the data fields are unified;
unifying different data type storage formats;
C. the data statistics units are uniform;
D. the data abnormal value processing amount is too large or the date is illegal;
E. the null data description is uniform.
8. The data warehouse design and data analysis acceleration system for industry analysis as claimed in claim 4, characterized in that, the creation DWM layer is made by aggregating and wide table making to existing data according to dimension.
9. The data warehouse design and data analysis acceleration system for industry analysis of claim 3, wherein the metadata is collected by:
the method includes the steps that original enterprise data are collected, namely structured data are stored with original metadata information;
the collection is information obtained from the webpage, and the metadata information can be supplemented according to the webpage content;
the collection extracts structured data from unstructured data, which we have created metadata information.
10. The data warehouse design and data analysis acceleration system for industry analysis of claim 3, wherein the detection index of data quality is as follows:
a. newly adding a data volume curve graph for each fact table of an enterprise every day;
b. according to the provincial summary data volume, each fact table of the enterprise gives an alarm when the data volume is zero;
c. and monitoring key fields of each fact table, carrying out null value rate detection, uniqueness detection, field repetition rate detection and field format error rate detection on daily updated data, and drawing results into a chart to be displayed on a webpage.
CN202211037559.2A 2022-08-26 2022-08-26 Data warehouse design and data analysis acceleration system for industrial analysis Withdrawn CN115309724A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211037559.2A CN115309724A (en) 2022-08-26 2022-08-26 Data warehouse design and data analysis acceleration system for industrial analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211037559.2A CN115309724A (en) 2022-08-26 2022-08-26 Data warehouse design and data analysis acceleration system for industrial analysis

Publications (1)

Publication Number Publication Date
CN115309724A true CN115309724A (en) 2022-11-08

Family

ID=83864090

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211037559.2A Withdrawn CN115309724A (en) 2022-08-26 2022-08-26 Data warehouse design and data analysis acceleration system for industrial analysis

Country Status (1)

Country Link
CN (1) CN115309724A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117785983A (en) * 2024-02-20 2024-03-29 四川大学华西医院 Target object evaluation method, system, electronic device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117785983A (en) * 2024-02-20 2024-03-29 四川大学华西医院 Target object evaluation method, system, electronic device and storage medium

Similar Documents

Publication Publication Date Title
US9218396B2 (en) Insight determination and explanation in multi-dimensional data sets
CN109471846A (en) User behavior auditing system and method on a kind of cloud based on cloud log analysis
CN111324602A (en) Method for realizing financial big data oriented analysis visualization
CN103605651A (en) Data processing showing method based on on-line analytical processing (OLAP) multi-dimensional analysis
CN110851667A (en) Integrated analysis method and tool for multi-source large data
CN109754177A (en) Pollution sources portrait label system, the building method of pollution sources portrait and its application
CN112148718A (en) Big data support management system for city-level data middling station
CN109977125A (en) A kind of big data safety analysis plateform system based on network security
CN112926852A (en) Atmospheric ecological environment analysis method based on data fusion
CN115309724A (en) Data warehouse design and data analysis acceleration system for industrial analysis
CN113590607A (en) Electric power marketing report realization method and system based on report factor
US7899776B2 (en) Explaining changes in measures thru data mining
Lecue et al. Explaining and predicting abnormal expenses at large scale using knowledge graph based reasoning
Dong et al. Scene-based big data quality management framework
CN111127186A (en) Application method of customer credit rating evaluation system based on big data technology
CN112784129A (en) Pump station equipment operation and maintenance data supervision platform
CN115293682A (en) Abnormal logistics order monitoring method and related device
CN116701525A (en) Early warning method and system based on real-time data analysis and electronic equipment
CN109242301A (en) A kind of soil performance interactive mode real-time analysis method based on big data framework
CN116975043B (en) Data real-time transmission construction method based on stream frame
CN117573687B (en) Service form write-back/reading method and system based on ClickHouse database
Gallo et al. Data warehouse design and management: theory and practice
CN116451056B (en) Terminal feature insight method, device and equipment
Fu et al. Management of Power Marketing Audit Work Based on Tobit Model and Big Data Technology
Belabbess et al. Combining machine learning and semantics for anomaly detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20221108