CN112527886A - Data warehouse system based on urban brain - Google Patents

Data warehouse system based on urban brain Download PDF

Info

Publication number
CN112527886A
CN112527886A CN202110173925.6A CN202110173925A CN112527886A CN 112527886 A CN112527886 A CN 112527886A CN 202110173925 A CN202110173925 A CN 202110173925A CN 112527886 A CN112527886 A CN 112527886A
Authority
CN
China
Prior art keywords
data
layer
data warehouse
fact
granularity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110173925.6A
Other languages
Chinese (zh)
Inventor
梁鹏飞
李晓东
崔师龙
王崟乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongguancun Smart City Co Ltd
Original Assignee
Zhongguancun Smart City Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongguancun Smart City Co Ltd filed Critical Zhongguancun Smart City Co Ltd
Priority to CN202110173925.6A priority Critical patent/CN112527886A/en
Publication of CN112527886A publication Critical patent/CN112527886A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Abstract

The embodiment of the invention provides a data warehouse system based on a city brain, which comprises: the system comprises a distributed file system based on Hadoop, data ETL, a five-layer data warehouse, online analysis processing, a distributed computing engine based on Hadoop and metadata; the distributed file system and the calculation engine based on Hadoop are adopted to construct a distributed data warehouse system, and the multivariate heterogeneous data is stored, processed and analyzed in a unified manner; carry out reasonable layering to data warehouse, promote the rate of reuse of data, guarantee basically that the data on this layer in the several storehouses relies on the data acquisition of last layer, avoid the appearance of new demand at every turn to bring duplicative work load.

Description

Data warehouse system based on urban brain
Technical Field
The invention relates to the technical field of data warehouses, in particular to a data warehouse construction method and system based on a city brain.
Background
The urban brain is based on data resources generated by cities, utilizes new-generation information technologies such as artificial intelligence, big data, block chains, 5G, Internet of things and the like, constructs an artificial intelligence center for urban traffic management, public safety, emergency management, urban sanitation, tourism, environmental protection, urban fine management and the like, promotes construction and gets through various urban digital management platforms, utilizes real-time full-amount urban data, corrects operation short boards in time, optimizes urban public resources, and realizes urban management modes, service modes and high-quality breakthrough of digital industry development.
The development of data warehouse technology in China is fast, and various industries such as telecommunication, banking, finance, insurance, manufacturing, retail and the like establish own data warehouses, wherein the most representative data warehouses belong to an operation analysis system constructed by telecommunication operators. Of course, the data warehouse is at great risk, primarily because the data warehouse is analysis-oriented, so holding business demands is a prerequisite for successful real-time data warehouse projects. If the business requirements cannot be met, the technical realization of the data warehouse is perfect and has no meaning; second, it is technically necessary to ensure that data can be efficiently and accurately obtained from the business system, effectively manage the huge data in the data warehouse, and provide flexible and effective access to data for personnel at different levels of the enterprise. In addition, the data warehouse involves a plurality of departments and a plurality of systems, how to effectively obtain the support of high-level leaders, coordinate resources of all parties and effectively manage projects is also the key influencing the success or failure of the data warehouse projects. At present, no uniform specification exists for the bins, and each company selects a proper layering and modeling mode according to own requirements.
With the explosion of internet data, people are gradually aware of the importance of data, scientific data processing and business intelligent data analysis are more and more widely applied to the data analysis requirement of building a city brain integrated with a city, and the traditional database cannot meet the requirement of storing and processing big data. At present, the construction of a data warehouse aiming at a Hadoop ecosystem does not have a clear specification, so a data warehouse construction method and a data warehouse construction system for a city brain are explored.
Disclosure of Invention
Therefore, in order to meet business requirements, the embodiment of the invention provides a data warehouse system of a city brain, and the method is applied to a city brain scene, improves the training efficiency of an AI model by introducing a quantum computing technology and utilizing the speed advantage of quantum computing, and meets the increasing requirements of city operation data and new business scenes. The specific technical scheme is as follows:
to achieve the above object, an embodiment of the present invention provides a data warehouse system based on a city brain, including: the system comprises a distributed file system based on Hadoop, data ETL, a five-layer data warehouse, online analysis processing, a distributed computing engine based on Hadoop and metadata; the distributed file system is used for storing the data source in a file form; the five-layer data warehouse is used for counting and storing the data source; the online analysis processing is used for responding to most analysis requirements in a preset time period; the distributed computing engine in Hadoop is used for computing the data source passing through the ETL of the data; the metadata is data describing data for identifying a resource; evaluating the resources; tracking changes of the resource in the using process; the method realizes simple and efficient management of a large amount of networked data; the information resources are effectively discovered, searched and integrally organized, and the used resources are effectively managed.
Further, the five-layer data warehouse comprises: the system comprises an original data layer, a detail granularity fact layer, a data service public granularity layer, a data subject accumulation layer and a data application layer; wherein the content of the first and second substances,
the original data layer is used for acquiring original data from a data source and storing the original data;
the detail granularity fact layer is used for constructing a detail layer fact table with the finest granularity based on each specific business process characteristic by taking a business process as a modeling drive;
the data service common granularity layer is used for taking an analyzed subject object as a modeling drive, constructing a summary index fact table of common granularity based on the application of an upper layer and the index requirements of products, and physically modeling by a wide-tabulation means; constructing a statistical index with standard naming and consistent caliber, providing a public index for an upper layer, and establishing an aggregate broad table and a detailed fact table;
the data subject accumulation layer is used for summarizing an index fact table every day and carrying out wide-tabulation processing;
the data application layer is used for storing personalized statistical index data of the data products.
Further, the raw data includes: the system comprises a geographic information system, a government affair system, log data and structured data of the equipment of the Internet of things.
Furthermore, the statistical index data is obtained by the data topic accumulation layer and the data service common granularity layer, and when some complex statistical indexes cannot be obtained by the data topic accumulation layer and the data service common granularity layer, the statistical index data is obtained by the original data layer.
Further, the five-layer data warehouse adopts an open-source Hive hierarchical data warehouse.
Further, the detailed granularity fact layer further comprises a relational model modeling layer and a dimension model modeling layer, wherein the relational model modeling layer is used for building a relational model of the database, and the dimension model modeling layer is used for building a dimension model of the data; when the relational model of the database is designed and constructed, the specification requirement of a three-normal form is adopted; when the dimension model is constructed, a fact table is used as a center to organize a table data table.
Further, the data table comprises a dimension table and a fact table; the dimension table is used for storing description information of the fact; the fact table comprises a transactional fact table, a periodic snapshot fact table and an accumulative snapshot fact table.
Further, the dimension model adopts a star model; the dimension model modeling layer comprises:
the service selection module is used for selecting a service line related to a specific service in a service process system;
the statement data granularity module is used for indicating the level of the refinement degree and the comprehensive degree of the stored data in the data of the data warehouse;
the dimension determining module is used for describing business facts;
and the fact confirmation module is used for confirming the metric value in the service.
Further, the data application layer also comprises tagging the data and classifying the data by using a spark machine learning algorithm.
Further, the Hadoop-based distributed storage framework comprises Kafka storage media used for corresponding to different themes, and messages are processed between different themes through spark timing.
The embodiment of the invention provides a data warehouse system based on a city brain, which comprises: the system comprises a distributed file system based on Hadoop, data ETL, a five-layer data warehouse, online analysis processing, a distributed computing engine based on Hadoop and metadata; the distributed file system is used for storing the data source in a file form; the five-layer data warehouse is used for counting and storing the data source; the online analysis processing is used for responding to most analysis requirements in a preset time period; the distributed computing engine in Hadoop is used for computing the data source passing through the ETL of the data; the metadata is data describing data for identifying a resource; evaluating the resources; tracking changes of the resource in the using process; the method realizes simple and efficient management of a large amount of networked data; the information resources are effectively discovered, searched and integrally organized, and the used resources are effectively managed. The distributed file system and the calculation engine based on Hadoop are adopted to construct a distributed data warehouse system, and the multivariate heterogeneous data is stored, processed and analyzed in a unified manner; carry out reasonable layering to data warehouse, promote the rate of reuse of data, guarantee basically that the data on this layer in the several storehouses relies on the data acquisition of last layer, avoid the appearance of new demand at every turn to bring duplicative work load.
Furthermore, in a real-time system, Kafka is used as a storage medium of the message, the Kafka corresponds to different topics, and the message processing is performed between different topics through spark timing, so that compared with the traditional multi-bin MR calculation engine, frequent file reading and writing io is reduced, and the calculation efficiency is greatly improved.
Drawings
Fig. 1 is a data warehouse system based on a city brain according to embodiment 1 of the present invention;
FIG. 2 is a diagram of a relationship structure between fact tables and dimension tables of partial business DWD layers related to the Internet of things in a data warehouse of a city brain;
fig. 3 is a schematic structural diagram of a real-time alarm system warehouse system of a data warehouse system based on a city brain according to an embodiment of the present invention.
Detailed Description
In order to clearly and thoroughly show the technical solution of the present invention, the following description is made with reference to the accompanying drawings, but the scope of the present invention is not limited thereto.
Referring to fig. 1, a data warehouse system based on a city brain according to embodiment 1 of the present invention includes:
the system comprises a distributed storage frame based on Hadoop, a distributed file system, data ETL, a five-layer data warehouse, online analysis and processing, a distributed computing engine based on Hadoop and metadata; the distributed storage framework is used for storing massive external data sources; the distributed file system is used for storing the data source in a file form; the five-layer data warehouse is used for counting and storing the data source; the online analysis processing is used for responding to most analysis requirements in a preset time period; the distributed computing engine in Hadoop is used for computing the data source passing through the ETL of the data; the metadata is data describing data for identifying a resource; evaluating the resources; tracking changes of the resource in the using process; the method realizes simple and efficient management of a large amount of networked data; the information resources are effectively discovered, searched and integrally organized, and the used resources are effectively managed.
The Hadoop is a distributed system infrastructure developed by the Apache Foundation. A user can develop a distributed program without knowing the distributed underlying details. The power of the cluster is fully utilized to carry out high-speed operation and storage. Hadoop implements a Distributed File System (Hadoop Distributed File System), where one component is the HDFS. HDFS is characterized by high fault tolerance and is designed for deployment on inexpensive (low-cost) hardware; and it provides high throughput (high throughput) to access data of applications, suitable for applications with very large data sets. HDFS relaxes the requirements of (relax) POSIX and can access (streaming access) data in a file system in the form of streams. The most core design of the Hadoop framework is as follows: HDFS and MapReduce. HDFS provides storage for massive data, while MapReduce provides computation for massive data.
The data warehouse carries out real-time log processing analysis on the data source by adopting a Flume + Logstash + Kafka + Spark Streaming framework, referring to the figure, data of behavior data reaches a log server through Nginx, equipment app reaches a service server through Nginx, the server stores the data in a log file form, and the data is stored in hdfs through a subscription-release mode of Kafka.
The aforementioned Nginx (engine x) is a high performance HTTP and reverse proxy web server, and also provides IMAP/POP3/SMTP services.
Metadata is data of data, which is defined as: in a program, data is not an object to be processed, but data for changing the behavior of the program by changing its value. It functions to control program behavior in an interpreted manner during runtime.
In the embodiment of the invention, the system also comprises management layers such as distributed coordination, monitoring, timing scheduling, metadata management, authority management, quality management and the like.
The data warehouse system is a theme-oriented, integrated, relatively stable data set reflecting historical changes for supporting administrative decisions. Data warehouse architectures typically contain four levels: data source, data storage and management, data service, data application. A data source: the data source of the data warehouse comprises external data, the existing business system, document data and the like; data integration: the data extraction, cleaning, conversion and loading tasks are completed, and the data in the data source is loaded into the data warehouse in a fixed period by using an ETL (Extract-Transform-Load) tool. Data storage and management: the hierarchy mainly relates to the storage and management of data, including data warehouse, data mart, data warehouse detection, operation and maintenance tools, metadata management and the like. Data service: the data service is provided for the front end and the application, the data can be directly obtained from a data warehouse for the front end application to use, and the data service responsible for the front end application can also be provided for the front end application through an OnLine Analytical Processing (OLAP) server. Data application: the hierarchy is directly oriented to users and comprises a data query tool, a free report tool, a data analysis tool, a data mining tool and various application systems.
In order to improve the analysis efficiency and the reusability of table data in the bins, the bins need to be divided hierarchically, so that the final statistical requirement depends on the intermediate analysis result as much as possible, and the statistics is not performed from the original table for many times. Firstly, the layering of a plurality of bins is divided into five layers, namely a data introduction layer, namely an original data layer ODS, a fine granularity fact layer DWD, a data service common granularity layer DWS, a data subject accumulation layer (period summary) DWT and a data application layer ADS.
The original data layer ODS is used for collecting original data from a data source and storing the original data; the original data comprises log data of alarm devices such as GIS, government affairs and IOT and structured data stored in RDBMS. The partial data has two functions: firstly, a copy of source data is reserved in the HDFS system for storing records. And secondly, processing the subsequent ETL data based on the layer, and importing the data after cleaning the ETL data into the DWD layer.
The detail granularity fact layer DWD layer is used for constructing a detail layer fact table with the finest granularity by taking a business process as a modeling drive and based on the characteristics of each specific business process; the layer is a detail data layer, the business process is used as a modeling drive, and a detail layer fact table with the finest granularity is constructed based on the characteristics of each specific business process. The data use characteristics of enterprises can be combined, and certain important dimension attribute fields of the detailed fact table are subjected to proper redundancy, namely, broad tabulation processing.
The data service common granularity layer DWS layer: the method is used for constructing a summary index fact table of public granularity by taking an analyzed subject object as a modeling drive based on the upper application and the index requirements of products, and physically modeling by a wide-tabulation means; and constructing statistical indexes with standard naming and consistent calibers, providing public indexes for the upper layer, and establishing an aggregated broad table and a detailed fact table.
The data topic accumulation layer DWT layer: the system is used for summarizing an index fact table every day and performing broad tabulation treatment; the summary index fact table of the DWS layer is processed by the wide tabulation, only the summary granularity is not passed, the DWS layer is mostly used for summarizing the statistical result every day, and the DWT layer is mostly used for summarizing the accumulative result for 30 days.
The data application layer ADS layer: the system is used for storing the personalized statistical index data of the data product. Storing individualized statistical index data of data products, generally, carrying out statistics by a DWT layer or a DWS layer to obtain, and when some complex statistical indexes can not be obtained by DWT and DWS layer statistics, obtaining through a table in the DWD layer is needed.
In the embodiment of the present invention, the data flow of the data warehouse after receiving the external data source is as follows: firstly, a data source of an ODS layer is a part of a service library, HDFS is imported through Sqoop, log data are collected through flash, peak clipping is carried out through Kafka, and then the log data are landed in an HDFS system through Kafka and loaded to the ODS layer in hive. In addition, the data of the ODS layer is the most primitive data and is not processed.
Data ETL: when the data of the ODS layer loads and washes the DWD layer, the dirty data which is not in accordance with the requirement in the external data of the collection or the interface is processed, including but not limited to the following:
a) data with deviation of format content;
b) cleaning logic error data;
c) cleaning a missing field;
d) data that does not meet business requirements;
e) contradictory data;
f) data desensitization.
In an optional implementation manner of the embodiment of the present invention, the detailed granularity fact layer further includes a relationship model modeling layer and a dimension model modeling layer, the relationship model modeling layer is configured to construct a relationship model of a database, and the dimension model modeling layer is configured to construct a dimension model of data.
Relational modeling: relational databases are designed to comply with the requirements of the three-paradigm specification in order to reduce data redundancy. The association between the tables through the main foreign key does not reduce redundant fields and increase the flexibility between the tables, but an efficiency problem is caused, and frequent Join operation among a plurality of tables is needed for inquiring data, so that the inquiry efficiency is reduced.
Dimension modeling: different from paradigm modeling, dimension modeling is mainly applied to an OLAP system, usually a fact table is taken as a center to organize the table, the method is mainly oriented to business of a city brain, and the characteristic is that data redundancy possibly exists, so that the data acquisition efficiency is improved. In consideration of the problem of large data volume in a large data environment, a star constellation model is adopted.
In the DWD layer, data tables are divided into dimension tables and fact tables, where dimension tables are generally descriptive information for facts. Each dimension table corresponds to an object or concept in the real world, such as date, region, device type, etc., and is characterized by: the dimension table has wide range, less rows relative to the fact table, relatively fixed content and mostly coding table.
And each row of data in the fact table represents a business event. "facts" indicate measurable values of traffic events. The rows of each fact table include: metric values having an additive numerical type, and foreign keys associated with the dimension tables, typically two or more foreign keys, between which a many-to-many relationship between dimension tables is represented. It is characterized in that:
fact tables are very large, and the content is relatively narrow: the number of columns is small and changes often, with many additions per day.
The fact table is divided into: transactional fact tables, periodic snapshot fact tables, cumulative snapshot fact tables.
Transactional fact table: the data is taken as a line of data in the fact table in units of each transaction or event, such as an alarm prompt of the device, an alarm record, and the like. Once the transaction is committed and the fact table data is inserted, the data cannot be changed, and the updating mode is incremental updating.
Fig. 2 is a structural diagram showing a relationship between fact tables and dimension tables among partial service DWD layer tables related to the internet of things in a data warehouse of a city brain, wherein the fact tables include an alarm information table, a well lid displacement sensor, a fire platform sensor acquisition table and a toxic and harmful sensor acquisition table. The dimension tables corresponding to the fact tables include: list of bearers, list of data types, list of enterprises, list of components, list of device types.
Periodic snapshot fact table: the periodic snapshot fact table does not retain all data, but only data at fixed time intervals, such as daily or weekly population flows, or monthly alarm times.
Cumulative snapshot fact table: the cumulative snapshot fact table is used to track changes in the business facts. For example, a data warehouse may need to accumulate or store a case to track the progress of the case from alarm, to recording, processing, solving, etc. at various stages of time point data. The records of fact tables are also constantly updated as this business process progresses.
The most important for the modeling of the bins is the construction of the DWD layer: the DWD layer needs to build a dimension model, a star model is generally adopted, and the presentation state is generally a constellation model due to a plurality of fact tables.
In an optional implementation manner of the embodiment of the present invention, the dimensional model modeling layer includes:
the service selection module is used for selecting a service line related to a specific service in a service process system;
the statement data granularity module is used for indicating the level of the refinement degree and the comprehensive degree of the stored data in the data of the data warehouse;
the dimension determining module is used for describing business facts;
and the fact confirmation module is used for confirming the metric value in the service.
And at the DWD layer, a detail layer fact table with the finest granularity is constructed by taking the business as a model building driver and based on a specific business process. The fact table can be processed into a wide table.
The above procedure for modeling the dimensions of bins, followed by DWS, DWT, and ADS, has no relationship to modeling.
DWS layer: from the aspect of dimensionality, the current-day behaviors of all subject objects are counted, a theme broad table serving a DWT layer and some service detail data are served, and special requirements are met.
DWS layer: and constructing a full-scale wide table of the subject object based on the upper application and the index requirements of the product by taking the analyzed subject object as a modeling drive.
And then, executing the tasks of the previous day at 1 point in the morning by using the task scheduling system, and sending mails to developers by using the system when the scheduling tasks are abnormal, so that abnormal processing can be timely carried out.
After the data analysis is completed, a visualization tool can be displayed in real time, and OLAP multidimensional analysis related components based on HIVE include Kylin, Druid, Presto, Elasticissearch and the like, so that the decision and display system can read results in real time, and analysis of related machine learning algorithms in SparkMlib is supported.
In addition, secondary development can be carried out based on an OLAP tool, a client interface for providing services is constructed, the client interface is connected to Kylin through a rest API, and the Kylin carries out multi-dimensional cube calculation on a fact table of a hive star model in advance, and stores the result in HBase for efficient calling.
Meanwhile, data of an application layer in the HIVE can be labeled, and classification is performed by using a spark machine learning algorithm.
And constructing a real-time warehouse counting system in the third diagram based on the same data warehouse modeling mode.
Compared with databases ODS, DWD, DWS, DWT and ADS in Hive in an offline number bin, Kafka is used as a storage medium of messages in a real-time system, the Kafka corresponds to different topics respectively, and the messages are processed between the different topics through spark timing.
The embodiment of the invention provides a data warehouse system based on a city brain, which comprises: the system comprises a distributed storage frame based on Hadoop, a distributed file system, data ETL, a five-layer data warehouse, online analysis and processing, a distributed computing engine based on Hadoop and metadata; the distributed storage framework is used for storing massive external data sources; the distributed file system is used for storing the data source in a file form; the five-layer data warehouse is used for counting and storing the data source; the online analysis processing is used for responding to most analysis requirements in a preset time period; the distributed computing engine in Hadoop is used for computing the data source passing through the ETL of the data; the metadata is data describing data for identifying a resource; evaluating the resources; tracking changes of the resource in the using process; the method realizes simple and efficient management of a large amount of networked data; the information resources are effectively discovered, searched and integrally organized, and the used resources are effectively managed. The distributed data warehouse system is constructed by adopting a distributed storage frame and a calculation engine based on Hadoop, and multi-element heterogeneous data is stored, uniformly processed and analyzed; carry out reasonable layering to data warehouse, promote the rate of reuse of data, guarantee basically that the data on this layer in the several storehouses relies on the data acquisition of last layer, avoid the appearance of new demand at every turn to bring duplicative work load.
Furthermore, in a real-time system, Kafka is used as a storage medium of the message, the Kafka corresponds to different topics, and the message processing is performed between different topics through spark timing, so that compared with the traditional multi-bin MR calculation engine, frequent file reading and writing io is reduced, and the calculation efficiency is greatly improved.
Examples are: an off-line and real-time alarm system warehouse system for the urban brain is constructed.
An off-line system:
1. an acquisition system: and collecting data of alarm, networking alarm, telephone alarm and the like of different devices distributed in a city, and storing the data on the HDFS system.
2. Loading data: the data on the HDFS system is imported into a raw data layer of a data warehouse HIVE, and a piece of raw data is kept and can be used for a subsequent ETL process.
3. Data cleaning: and according to the design of metadata in the system, cleaning the data in the original data layer according to a cleaning rule and loading the cleaned data to the data detail layer.
4. Common granularity summary layer: according to the business system requirements and metadata, different dimensionality analyses are carried out through ETL tools such as a button, a HIVE sql and the like, and a day-based summary table is constructed. The data source of the layer is a data detail layer.
5. And a periodic summary layer: and (4) carrying out summary statistics based on weeks, months, years or the last N days by using an ETL tool according to business requirements, wherein the source table is a public granularity summary layer and a data detail layer.
6. A data application layer: and (3) selecting a periodic summary layer from a table needing to be visualized, counting HIVE sql and preferably selecting a source, and when the periodic summary layer cannot obtain a desired field, sequentially considering a common granularity summary layer, a data detail layer and the like in the same way.
7. And scheduling the tasks, wherein the tasks are set to be executed in the morning of each day.
8. Visual display: the Superset or Kylin is connected to hive to perform visual display of business reports and multidimensional analysis of data tables, and meanwhile, the method can be used for constructing an ES search library to facilitate search of various alarm events.
9. And (3) performing prediction model training of a machine learning algorithm based on data in the data warehouse, and better serving an alarm system of a city brain.
A real-time system:
referring to fig. 3, 1, data of an acquisition system accessed by a real-time system is stored in an ODS layer in topic specified by Kafka.
2. In the ETL process of the data, the spark streaming accesses Kafka data of the ODS layer for processing, the cleaning rule and some code tables can be obtained from the ES, and the processed information is imported into the topic of the DW layer for storage.
3. The DW layer 2 can be divided into a DWD layer, a DWs layer, a DWT layer and other topics according to service requirements, and statistical aggregation of data is performed by SparkStreaming.
4. By connecting the Kafka with the Druid to display data in real time or importing the real-time data into Redis for visual display of the Webserver, when a certain index is abnormal, the alarm information can be acquired at the highest speed.
Therefore, when the alarm collector in a certain section receives abnormal signals, such as alarm signals of fire, flood and the like, the information can be accurately acquired by the real-time platform of the urban brain, and an alarm is immediately given out for effective processing.
Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims (10)

1. A city brain-based data warehouse system, comprising: the system comprises a distributed file system based on Hadoop, data ETL, a five-layer data warehouse, online analysis processing, a distributed computing engine based on Hadoop and metadata; the distributed file system is used for storing the data source in a file form; the five-layer data warehouse is used for counting and storing the data source; the online analysis processing is used for responding to most analysis requirements in a preset time period; the distributed computing engine in Hadoop is used for computing the data source passing through the ETL of the data; the metadata is data describing data for identifying a resource; evaluating the resources; tracking changes of the resource in the using process; the method realizes simple and efficient management of a large amount of networked data; the information resources are effectively discovered, searched and integrally organized, and the used resources are effectively managed.
2. The city brain-based data warehouse system of claim 1, wherein the five-tier data warehouse comprises: the system comprises an original data layer, a detail granularity fact layer, a data service public granularity layer, a data subject accumulation layer and a data application layer; wherein the content of the first and second substances,
the original data layer is used for acquiring original data from a data source and storing the original data;
the detail granularity fact layer is used for constructing a detail layer fact table with the finest granularity based on each specific business process characteristic by taking a business process as a modeling drive;
the data service common granularity layer is used for taking an analyzed subject object as a modeling drive, constructing a summary index fact table of common granularity based on the application of an upper layer and the index requirements of products, and physically modeling by a wide-tabulation means; constructing a statistical index with standard naming and consistent caliber, providing a public index for an upper layer, and establishing an aggregate broad table and a detailed fact table;
the data subject accumulation layer is used for summarizing an index fact table every day and carrying out wide-tabulation processing;
the data application layer is used for storing personalized statistical index data of the data products.
3. The city brain-based data warehouse system of claim 2, wherein the raw data comprises: the system comprises a geographic information system, a government affair system, log data and structured data of the equipment of the Internet of things.
4. The city brain-based data warehouse system of claim 3, wherein the statistical indicator data is obtained from the data topic accumulation layer and the data service common granularity layer, and is obtained from the original data layer when some statistical indicators are not available from the data topic accumulation layer and the data service common granularity layer.
5. The city brain-based data warehouse system of claim 1, wherein the five-tiered data warehouse employs an open-source Hive tiered data warehouse.
6. The city brain-based data warehouse system of claim 2, wherein the fine-grained fact layers further comprise a relational model modeling layer for building relational models of databases and a dimensional model modeling layer for building dimensional models of data; when the relational model of the database is designed and constructed, the specification requirement of a three-normal form is adopted; when the dimension model is constructed, a fact table is used as a center to organize a table data table.
7. The city brain-based data warehouse system of claim 6, wherein the data tables comprise dimension tables and fact tables; the dimension table is used for storing description information of the fact; the fact table comprises a transactional fact table, a periodic snapshot fact table and an accumulative snapshot fact table.
8. The city brain-based data warehouse system of claim 6 or 7, wherein the dimensional model employs a star model; the dimension model modeling layer comprises:
the service selection module is used for selecting a service line related to a specific service in a service process system;
the statement data granularity module is used for indicating the level of the refinement degree and the comprehensive degree of the stored data in the data of the data warehouse;
the dimension determining module is used for describing business facts;
and the fact confirmation module is used for confirming the metric value in the service.
9. The city brain-based data warehouse system of claim 2, wherein the data application layer further comprises tagging data for classification using spark's machine learning algorithm.
10. The city brain-based data warehouse system of claim 1, wherein the Hadoop-based distributed file system comprises Kafka storage media for corresponding to impassable topics, and wherein messages between different topics are processed by spark streaming.
CN202110173925.6A 2021-02-09 2021-02-09 Data warehouse system based on urban brain Pending CN112527886A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110173925.6A CN112527886A (en) 2021-02-09 2021-02-09 Data warehouse system based on urban brain

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110173925.6A CN112527886A (en) 2021-02-09 2021-02-09 Data warehouse system based on urban brain

Publications (1)

Publication Number Publication Date
CN112527886A true CN112527886A (en) 2021-03-19

Family

ID=74975712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110173925.6A Pending CN112527886A (en) 2021-02-09 2021-02-09 Data warehouse system based on urban brain

Country Status (1)

Country Link
CN (1) CN112527886A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468246A (en) * 2021-07-20 2021-10-01 上海齐屹信息科技有限公司 Intelligent data counting and subscribing system and method based on OLTP
CN113792041A (en) * 2021-08-04 2021-12-14 河南大学 Hive and Spark-based remote sensing data service offline batch processing system and method
CN115374329A (en) * 2022-10-25 2022-11-22 杭州比智科技有限公司 Method and system for managing enterprise business metadata and technical metadata
CN117609289A (en) * 2024-01-22 2024-02-27 山东浪潮数据库技术有限公司 Energy data processing system based on wide table

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120016835A1 (en) * 2010-07-15 2012-01-19 Amarjit Singh Universal database - cDB
CN103678665A (en) * 2013-12-24 2014-03-26 焦点科技股份有限公司 Heterogeneous large data integration method and system based on data warehouses
CN104899199A (en) * 2014-03-04 2015-09-09 阿里巴巴集团控股有限公司 Data processing method and system for data warehouse
CN105843880A (en) * 2016-03-21 2016-08-10 中国矿业大学 Coal mine multi-dimensional data warehousing system based on multiple data marts

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120016835A1 (en) * 2010-07-15 2012-01-19 Amarjit Singh Universal database - cDB
CN103678665A (en) * 2013-12-24 2014-03-26 焦点科技股份有限公司 Heterogeneous large data integration method and system based on data warehouses
CN104899199A (en) * 2014-03-04 2015-09-09 阿里巴巴集团控股有限公司 Data processing method and system for data warehouse
CN105843880A (en) * 2016-03-21 2016-08-10 中国矿业大学 Coal mine multi-dimensional data warehousing system based on multiple data marts

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GOAL: ""数据仓库概念总结"", 《HTTP://WWW.GAOHONGWEI.CN/636/》 *
码农家园: ""数据仓库(二) 数仓理论(重点核心)"", 《HTTPS://WWW.CODENONG.COM/CS106931344/》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468246A (en) * 2021-07-20 2021-10-01 上海齐屹信息科技有限公司 Intelligent data counting and subscribing system and method based on OLTP
CN113468246B (en) * 2021-07-20 2023-06-23 上海齐屹信息科技有限公司 Intelligent data statistics and subscription system and method based on OLTP
CN113792041A (en) * 2021-08-04 2021-12-14 河南大学 Hive and Spark-based remote sensing data service offline batch processing system and method
CN113792041B (en) * 2021-08-04 2024-04-09 河南大学 Remote sensing data service offline batch processing system and method based on Hive and Spark
CN115374329A (en) * 2022-10-25 2022-11-22 杭州比智科技有限公司 Method and system for managing enterprise business metadata and technical metadata
CN115374329B (en) * 2022-10-25 2023-03-17 杭州比智科技有限公司 Method and system for managing enterprise business metadata and technical metadata
CN117609289A (en) * 2024-01-22 2024-02-27 山东浪潮数据库技术有限公司 Energy data processing system based on wide table

Similar Documents

Publication Publication Date Title
CN108628929B (en) Method and apparatus for intelligent archiving and analysis
CN112527886A (en) Data warehouse system based on urban brain
CN110347719B (en) Enterprise foreign trade risk early warning method and system based on big data
CN111475509A (en) Big data-based user portrait and multidimensional analysis system
US8126750B2 (en) Consolidating data source queries for multidimensional scorecards
CN111324602A (en) Method for realizing financial big data oriented analysis visualization
CN105139281A (en) Method and system for processing big data of electric power marketing
CN111160867A (en) Large-scale regional parking lot big data analysis system
CN112181960A (en) Intelligent operation and maintenance framework system based on AIOps
CN116842055A (en) System and method for integrated processing of internet of things data batch flow
CN112817958A (en) Electric power planning data acquisition method and device and intelligent terminal
CN116361487A (en) Multi-source heterogeneous policy knowledge graph construction and storage method and system
CN115309749A (en) Big data experiment system for scientific and technological service
Hu E-commerce big data computing platform system based on distributed computing logistics information
CN114090529A (en) Log management method, device, system and storage medium
Herodotou et al. Big maritime data management
Toivonen Big data quality challenges in the context of business analytics
CN116911671A (en) Data asset operation efficiency evaluation method and system
CN116149947A (en) Quality evaluation method and device for data model, electronic equipment and storage medium
CN115936296A (en) Production and manufacturing data cockpit system of discrete manufacturing enterprise based on industrial internet big data lake
US20140067874A1 (en) Performing predictive analysis
CN117131059A (en) Report data processing method, device, equipment and storage medium
CN114691762A (en) Intelligent construction method for enterprise data
CN113111244A (en) Multisource heterogeneous big data fusion system based on traditional Chinese medicine knowledge large-scale popularization
CN111260452B (en) Method and system for constructing tax big data model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210319

RJ01 Rejection of invention patent application after publication