CN113407649A

CN113407649A - Data warehouse modeling method and device, electronic equipment and storage medium

Info

Publication number: CN113407649A
Application number: CN202110734409.6A
Authority: CN
Inventors: 李翛然; 孙丛丛
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-09-17

Abstract

The disclosure provides a data warehouse modeling method, a data warehouse modeling device, electronic equipment and a storage medium, and relates to the field of artificial intelligence such as big data processing and distributed storage, wherein the method comprises the following steps: determining a vertical service to be modeled; aiming at the vertical services, a data access layer for acquiring basic data from each service end, a basic data layer for cleaning the basic data to obtain cleaned data, a middle data layer for counting the cleaned data according to different dimensions to obtain a statistical result, a data subject layer for constructing a subject according to one or all of the cleaned data and the statistical result, and a data application layer for generating a data index according to one or any combination of the cleaned data, the statistical result and the subject are constructed; and forming a data warehouse by utilizing the constructed data access layer, the basic data layer, the middle data layer, the data subject layer and the data application layer. By applying the scheme disclosed by the invention, the implementation cost can be reduced, the processing efficiency can be improved and the like.

Description

Data warehouse modeling method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a data warehouse modeling method and apparatus, an electronic device, and a storage medium in the fields of big data processing and distributed storage.

Background

A Data Warehouse (Data route) is a technology that is developed according to development requirements of information technology and Decision Support Systems (DSS) and guides business decisions by managing historical basic Data.

The data warehouse may have different implementation methods, such as a commonly used english man (Inmon) method, which is a top-down method in terms of flow, that is, a top-down method is used to effectively organize data and serve various vertical services, but such a method has problems of high development difficulty, long development period, and the like.

Disclosure of Invention

The disclosure provides a data warehouse modeling method, a data warehouse modeling device, an electronic device and a storage medium.

A data warehouse modeling method, comprising:

determining a vertical service to be modeled;

aiming at the vertical services, a data access layer for acquiring basic data from each service end, a basic data layer for cleaning the basic data to obtain cleaned data, a middle data layer for counting the cleaned data according to different dimensions to obtain a statistical result, a data subject layer for constructing a subject according to the cleaned data and one or all of the statistical result, and a data application layer for generating a data index according to one or any combination of the cleaned data, the statistical result and the subject are constructed;

and forming the data warehouse by utilizing the data access layer, the basic data layer, the middle data layer, the data subject layer and the data application layer.

A data warehouse modeling apparatus, comprising: the device comprises a first processing module and a second processing module;

the first processing module is used for determining vertical services to be modeled;

the second processing module is configured to construct, for the vertical service, a data warehouse including the following layers: the system comprises a data access layer, a basic data layer, a middle data layer, a data subject layer and a data application layer;

the data access layer is used for acquiring basic data from each service end; the basic data layer is used for cleaning the basic data to obtain cleaned data; the middle data layer is used for counting the cleaned data according to different dimensions to obtain a statistical result; the data theme layer is used for constructing a theme according to one or all of the cleaned data and the statistical result; and the data application layer is used for generating data indexes according to one or any combination of the cleaned data, the statistical result and the theme.

An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.

A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as described above.

A computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

One embodiment in the above disclosure has the following advantages or benefits: the data warehouse formed by a plurality of layers such as a data access layer, a basic data layer, a middle data layer, a data subject layer and a data application layer can be constructed, the data can be organized and stored orderly, and the realization is simple, so that the development difficulty is reduced, the development period is shortened, the realization cost is reduced, the processing efficiency is improved, and the like.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of an embodiment of a data warehouse modeling method according to the present disclosure;

FIG. 2 is a schematic diagram of a hierarchical structure of a data warehouse according to the present disclosure;

FIG. 3 is a schematic diagram of a technical architecture of a data warehouse according to the present disclosure;

FIG. 4 is a schematic diagram of a component structure of an embodiment 400 of a data warehouse modeling apparatus according to the present disclosure;

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

With the development of the mobile internet era, the internet plus era comes, so that more vertical services are given new opportunities and development in the mobile internet era, more and more application scenes are generated, and the characteristics of specialization and industrialization of data are obvious while the data accumulation is increased day by day. In the process of business development, more and more attention is paid to information brought by data mining and data analysis and the information is used as an important reference in the process of establishing an industry development direction, and a data warehouse is used as the basis of data analysis and data mining, so that the accuracy, the real-time performance and the like of the information are guaranteed.

Generally speaking, a data warehouse mostly adopts the idea of layered construction, but according to the difference of industry background, data format, data source, index requirement and the like, the layered design of the data warehouse needs to be reasonably and flexibly performed.

Accordingly, the present disclosure proposes a data warehouse that is a data collection that is data-integrated, subject-oriented, relatively stable, and capable of reflecting historical changes.

1) Data integration: the original data often comes from a multi-source heterogeneous scene, and certain processing, such as cleaning, needs to be performed on the data so as to keep the data consistent, complete, effective and accurate.

2) Subject-oriented: data in the data warehouse can be aggregated according to different dimensions and different granularities, and is organized according to a certain theme, wherein the theme is an abstract concept and is a key aspect concerned when a user uses the data warehouse to make a decision, and one theme is usually related to a plurality of operation type information systems.

3) And (3) relatively stabilizing: data of the data warehouse is mainly used for decision analysis, and once certain data enters the data warehouse, the data is usually reserved for a long time, and the modification and deletion operations are fewer.

4) Reflecting historical changes: the data in the data warehouse typically contains historical information, such as information that records the business from a certain past time point (e.g., the time point when the data warehouse is started) to the current stages, and by this information, quantitative analysis and prediction can be made on the development history and future trends of the business.

Accordingly, fig. 1 is a flowchart of an embodiment of a data warehouse modeling method according to the present disclosure. As shown in fig. 1, the following detailed implementation is included.

In step 101, a vertical type of business to be modeled is determined.

In step 102, for the vertical services, a data access layer for acquiring basic data from each service end, a basic data layer for cleaning the basic data to obtain cleaned data, a middle data layer for performing statistics on the cleaned data according to different dimensions to obtain statistical results, a data subject layer for constructing subjects according to one or all of the cleaned data and the statistical results, and a data application layer for generating data indexes according to one or any combination of the cleaned data, the statistical results, and the subjects are constructed.

In step 103, a data warehouse is formed by using the data access layer, the basic data layer, the intermediate data layer, the data subject layer and the data application layer.

In the scheme of the embodiment of the method, a data warehouse consisting of a plurality of layers, such as a data access layer, a basic data layer, an intermediate data layer, a data subject layer, a data application layer and the like, can be constructed, and can organize and store data in order, and the implementation is simple, so that the development difficulty is reduced, the development period is shortened, the implementation cost is reduced, the processing efficiency is improved, and the like.

In an embodiment of the present disclosure, the data warehouse may further include: and the data dimension layer is used for providing required dimension data for other layers in the data warehouse, namely, the data warehouse can be formed by the data dimension layer, the data access layer, the basic data layer, the middle data layer, the data subject layer and the data application layer. Accordingly, fig. 2 is a schematic diagram of a hierarchical structure of a data warehouse according to the present disclosure.

The various layers in the data warehouse shown in FIG. 2 are further described below.

1) Data dimension layer

The method can be used for storing the dimension data based on an Entity-Relationship (ER) model, the dimension data can be divided into high-radix dimension data and low-radix dimension data, different dimension data can correspond to different or same updating maintenance periods, and data updating is carried out according to the updating maintenance periods.

The dimension data refers to basic data, for example, in the case of the car drop service, the dimension data may include a city code dimension table, a user characteristic dimension table, a car model basic information table, a mobile terminal type configuration table, a dealer information table, and the like, and specifically includes what contents may be determined according to actual needs.

The data dimension layer can provide required dimension data for other layers in the data warehouse, namely, the other layers can obtain the dimension data from the data dimension layer when the dimension data is required to be used.

2) Data access layer

Basic data (namely original data) can be obtained from each service terminal based on an ER model, and a daily full-volume updating mode or an incremental updating mode can be adopted for different data.

Taking the car vertical service as an example, the acquired basic data may include: user behavior log data, clue data, automobile vertical information distribution data, dotting data, marking data, external data and the like, wherein specific contents can be determined according to actual needs.

3) Base data layer

The basic data layer can also be called a data detail layer, and can be used for cleaning basic data acquired by the data access layer based on the ER model, for example, Extract-Transform-Load (ETL) processing is performed to obtain cleaned data, so that complex logic and model algorithms are not involved, and therefore, the cleaned usable clean data can be quickly and efficiently acquired.

Taking the car plumbing service as an example, the post-cleaning data may include: user login data, thread distribution data, thread state updating data, user new-added data, dealer feedback data and the like, wherein the specific contents can be determined according to actual needs.

4) Intermediate data layer

The data obtained from the basic data layer can be processed based on the broad tabulation model and the snowflake model, for example, model algorithms such as a wind control model and a cleaning rule can be introduced, data characteristics are extracted, data aggregation is performed according to different dimensions, the cleaned data is counted according to different dimensions, and a statistical result is obtained.

Taking the car plumbing service as an example, the statistical result may include: user login statistics, user liveness, thread distribution statistics, thread bargaining statistics, region dispersion, user viscosity level, city level and the like, and specifically comprises which contents can be determined according to actual needs.

5) Data topic layer

And (3) constructing a theme according to one or all of the cleaned data and the statistical result, for example, constructing a related theme portrait based on a star model and a wide-tabulation model and a business theme, wherein the related theme portrait is mostly expressed as a large wide table and can be updated every day.

Taking the car vertical business as an example, the constructed theme may include: the user image, the vehicle image, the thread theme, the dealer image, etc. may be specifically included according to actual needs.

6) Data application layer

The data index can be generated according to one or any combination of the cleaned data, the statistical result and the theme. For example, according to a star model and a wide representation model, data such as cleaned data, statistical results, topics and the like can be logically summarized and processed according to specific service requirements, so that a data analysis report facing a service end is generated.

By adopting the data warehouse disclosed by the invention, time is changed by space, a multi-level data model is constructed for users to use, the data processing and accessing efficiency and the like can be improved, the complex problem can be simplified, the change of business can be conveniently processed, the data structure is clear, the data blood margin tracking can be conveniently carried out, the abnormity of the original data is shielded, the repeated development is reduced and the like.

In addition, the technical architecture of the data warehouse supporting the batch integration is described in the following aspects.

1) Data acquisition

When the basic data are acquired, different types of data can be acquired according to the acquisition modes corresponding to the different types of data respectively.

The core data of the data warehouse mainly comes from two types, one type is service interaction data, such as data related to orders, clues, materials and the like, and is usually stored in a relational database (mysql) and/or a non-relational database (mongo), the type of data can be collected by a data transmission tool (Sqoop) component, the other type is data generated in the interaction process of a user and a client product, and comprises user behaviors, timestamps and the like, the type of data is stored in a file mode, and can be collected by a mass log collection aggregation and transmission system (flash) component. By the acquisition mode, the required data can be conveniently and efficiently acquired.

2) Data storage

For a data warehouse, the data may be stored in one or all of the following ways: the method comprises the steps of storing offline data by adopting a Distributed File System (HDFS) and storing real-time data by adopting a message queue System (Kafka).

For offline data, the acquired data can be directly stored in the HDFS, and data related to each layer in the data warehouse can also be stored in the HDFS. For real-time data, Kafka can be used for storage, so that real-time acquisition and instant consumption of the data are facilitated, and the data can be stored in the HDFS after being processed. HDFS is a highly fault tolerant system and provides high throughput data access, etc., and is well suited for application on large-scale data sets.

3) Data computation

For a data warehouse, one or any combination of the following may be performed: adopting a resource manager (Yarn) to manage and schedule resources; a task scheduler (Azkaban) is adopted for task scheduling management; and performing data calculation and the like by adopting a mode of matching MapReduce with a calculation engine (Spark).

In the data warehouse, a series of processing such as cleaning and statistics needs to be performed on the accessed data. According to the difference of the priority, the data scale, the timeliness and the like of the data processing task, Yarn can be adopted for resource management and scheduling, Azkaban is adopted for task scheduling and management, and MapReduce and Spark are adopted for data calculation.

Yarn is a Hadoop resource manager, is a universal resource management system, and can provide uniform resource management and scheduling for related applications. Azkaban is a batch workflow task scheduler that runs a set of jobs and processes, etc. in a particular order within a workflow. In addition, MapReduce has the characteristics of high stability, flexible resource utilization, slow task output and the like, is suitable for data processing tasks with large data scale, complex data processing logic and low timeliness, and Spark has the characteristics of large resource consumption, high operation speed and the like, and is suitable for data processing tasks with small data scale, simple data processing logic and high timeliness.

4) Management and/or querying of data

According to the scheme, management and/or query of the data in the data warehouse in a predetermined mode are supported.

For example, the data may be managed based on the data storage manner of the HDFS, and an external table may be established using a data warehouse tool (Hive) or a distributed database (Doris). In addition, Hive and Doris both adopt a relational database management system₍MySQL) protocol, a user may use Structured Query Language (SQL) to perform data Query, statistics, analysis, and the like, except that the Hive bottom layer needs to perform calculation based on MapReduce or Spark, which may support large Query scale with minute-level response time, whereas Doris is suitable for performing small-scale data Query with second-level response time. In addition, a search engine (elastic search) may be used to perform a query of data and the like. The specific method can be determined according to actual needs.

5) Data application

Based on the data warehouse, data indexes can be output externally by providing tools such as an Application Programming Interface (API), a Doris client and the like to the outside, and the data warehouse can assist in data analysis, data mining and other work.

Based on the above description, fig. 3 is a schematic diagram of a technical architecture of a data warehouse according to the present disclosure, and please refer to the related description.

In a word, by adopting the scheme disclosed by the invention, the data can be organized and stored orderly, the technical architecture of big data can be combined, the data can be utilized efficiently and in high quality, the method has the characteristics of rapid construction, rapid expansion, rapid application and the like, and the output efficiency of the data is ensured.

It is noted that while for simplicity of explanation, the foregoing method embodiments are described as a series of acts, those skilled in the art will appreciate that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required for the disclosure.

The above is a description of embodiments of the method, and the embodiments of the apparatus are further described below.

Fig. 4 is a schematic diagram illustrating a structure of a data warehouse modeling apparatus 400 according to an embodiment of the present disclosure. As shown in fig. 4, includes: a first processing module 401 and a second processing module 402.

The first processing module 401 is configured to determine a vertical service to be modeled.

A second processing module 402, configured to construct, for the vertical service, a data warehouse including the following layers: the system comprises a data access layer, a basic data layer, a middle data layer, a data subject layer and a data application layer; the data access layer is used for acquiring basic data from each service end; the basic data layer is used for cleaning basic data to obtain cleaned data; the middle data layer is used for counting the cleaned data according to different dimensions to obtain a statistical result; the data theme layer is used for constructing a theme according to one or all of the cleaned data and the statistical result; and the data application layer is used for generating data indexes according to one or any combination of the cleaned data, the statistical result and the theme.

In an embodiment of the present disclosure, the data warehouse may further include: and the data dimension layer is used for providing required dimension data for other layers in the data warehouse.

In one embodiment of the present disclosure, ETL processing may be performed on the base data to obtain cleaned data.

In addition, the data warehouse disclosed by the disclosure can also support the technical architecture of stream batch integration.

In an embodiment of the present disclosure, when acquiring the basic data, different types of data may be acquired according to the acquisition manners corresponding to the different types of data, that is, different types of data may be acquired according to the respective corresponding acquisition manners.

In an embodiment of the disclosure, the second processing module 402 may further perform data storage for the data warehouse in one or all of the following manners: and (3) storing offline data by adopting an HDFS (Hadoop distributed file system), and storing real-time data by adopting Kafka.

In an embodiment of the disclosure, the second processing module 402 may further perform, for the data warehouse, one or any combination of the following: resource management scheduling is carried out by adopting Yarn; carrying out task scheduling management by adopting Azkaban; and calculating data by adopting a mode of matching MapReduce with Spark.

In an embodiment of the present disclosure, the second processing module 402 may further manage and/or query the data in the data warehouse, that is, may support managing and/or querying the data in the data warehouse in a predetermined manner.

In addition, based on the data warehouse, various tools can be provided to the outside, the output of data indexes is carried out to the outside, and the data warehouse can assist in data analysis, data mining and other work.

For a specific work flow of the apparatus embodiment shown in fig. 4, reference is made to the related description in the foregoing method embodiment, and details are not repeated.

In a word, by adopting the scheme of the embodiment of the device disclosed by the invention, the data can be organized and stored orderly, the technical architecture of big data can be combined, the data can be utilized efficiently and in high quality, the device has the characteristics of rapid construction, rapid expansion, rapid application and the like, and the output efficiency of the data is ensured.

The scheme disclosed by the disclosure can be applied to the field of artificial intelligence, in particular to the fields of big data processing, distributed storage and the like.

Artificial intelligence is a subject for studying a computer to simulate some thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning and the like) of a human, and has a hardware technology and a software technology, the artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge graph technology and the like.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 501 performs the various methods and processes described above, such as the methods described in this disclosure. For example, in some embodiments, the methods described in this disclosure may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When loaded into RAM 503 and executed by computing unit 501, may perform one or more steps of the methods described in the present disclosure. Alternatively, in other embodiments, the computing unit 501 may be configured by any other suitable means (e.g., by means of firmware) to perform the methods described by the present disclosure.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS). The server may also be a server of a distributed system, or a server incorporating a blockchain. Cloud computing refers to accessing an elastically extensible shared physical or virtual resource pool through a network, resources can include servers, operating systems, networks, software, applications, storage devices and the like, a technical system for deploying and managing the resources in a self-service mode as required can be achieved, and efficient and powerful data processing capacity can be provided for technical applications and model training of artificial intelligence, block chains and the like through a cloud computing technology.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A data warehouse modeling method, comprising:

determining a vertical service to be modeled;

2. The method of claim 1, further comprising:

constructing data dimension layers for providing required dimension data for other layers in the data warehouse; and composing the data warehouse by using the data dimension layer, the data access layer, the basic data layer, the middle data layer, the data subject layer and the data application layer.

3. The method of claim 1, wherein,

the cleaning of the basic data comprises: and carrying out extraction conversion and loading ETL processing on the basic data.

4. The method according to any one of claims 1 to 3,

the basic data includes: and respectively acquiring different types of data according to the respective corresponding acquisition modes.

5. The method of any of claims 1-3, further comprising:

and aiming at the data warehouse, storing the data by adopting one or all of the following modes: and storing offline data by adopting a distributed file system (HDFS), and storing real-time data by adopting a message queue system (Kafka).

6. The method of any of claims 1-3, further comprising:

for the data warehouse, performing one or any combination of the following:

adopting a resource manager Yarn to manage and schedule resources;

task scheduling management is carried out by adopting a task scheduler Azkaban;

and calculating data by adopting a mode of matching MapReduce for mapping simplification and a calculation engine Spark.

7. The method of any of claims 1-3, further comprising: managing and/or querying data in the data warehouse.

8. A data warehouse modeling apparatus, comprising: the device comprises a first processing module and a second processing module;

9. The apparatus of claim 8, wherein,

the data warehouse also comprises: a data dimension layer;

the data dimension layer is used for providing required dimension data for other layers in the data warehouse.

10. The apparatus of claim 8, wherein,

11. The apparatus of any one of claims 8 to 10,

12. The apparatus of any one of claims 8 to 10,

the second processing module is further configured to, for the data warehouse, store data in one or all of the following manners: and storing offline data by adopting a distributed file system (HDFS), and storing real-time data by adopting a message queue system (Kafka).

13. The apparatus of any one of claims 8 to 10,

the second processing module is further configured to, for the data warehouse, perform one or any combination of the following: adopting a resource manager Yarn to manage and schedule resources; task scheduling management is carried out by adopting a task scheduler Azkaban; and calculating data by adopting a mode of matching MapReduce for mapping simplification and a calculation engine Spark.

14. The apparatus of any one of claims 8 to 10,

the second processing module is further configured to manage and/or query data in the data warehouse.

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.