CN115617790A

CN115617790A - Data warehouse creation method, electronic device and storage medium

Info

Publication number: CN115617790A
Application number: CN202110806403.5A
Authority: CN
Inventors: 王隆生; 王宇
Original assignee: Shenzhen Fugui Precision Industrial Co Ltd
Current assignee: Shenzhen Fulian Fugui Precision Industry Co Ltd
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2023-01-17
Also published as: TW202305613A; TWI785724B

Abstract

The application provides a data warehouse creating method, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a data source; analyzing the data source to generate a data overview; extracting topics of the data source based on the data overview to obtain a plurality of topic tables; establishing a data mapping relation for the theme table according to the data overview; and creating a data warehouse based on the theme table and the data mapping relation. By the method and the device, the efficiency of data warehouse creation can be improved.

Description

Data warehouse creation method, electronic device and storage medium

Technical Field

The present application relates to the field of data analysis, and in particular, to a data warehouse creation method, an electronic device, and a storage medium.

Background

The creation of data warehouses is an important foundation for the digital transformation of large enterprises at present. Without a data warehouse, data needs to be extracted directly from the business database for analysis. The business database is mainly used for business operation service, and although the business database can be used for analysis, the data structure is complex, the data is disordered and difficult to process, the query history is lacked, and the progress is slow when large-scale query is carried out.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data warehouse creation method, an electronic device, and a storage medium, which can improve the efficiency of data warehouse creation.

The application provides a data warehouse creation method, which comprises the following steps: acquiring a data source; analyzing the data source to generate a data overview; extracting topics of the data source based on the data overview to obtain a plurality of topic tables; establishing a data mapping relation for the theme table according to the data overview; and creating a data warehouse based on the theme table and the data mapping relation.

In one possible implementation, the analyzing the data source and generating the data overview include: converting the data source into a data table; acquiring a first field with a numerical value in the data table and calculating the numerical value distribution of the first field; acquiring a second field without a numerical value in the data table and calculating a pattern character string of the second field; acquiring data distribution of the data table on a time axis; and taking the numerical value distribution, the pattern character string and the data distribution as the data overview.

In one possible implementation manner, the calculating the pattern string of the second field includes: reading characters of the second field; if the character is any one of special characters, numbers, letters or texts, adding the character into a result set; and obtaining the pattern character string according to the result set.

In one possible implementation, the extracting topics of the data source based on the data overview, and obtaining a plurality of topic tables includes: inputting fields in the data sheet and fields in the data overview into a pre-trained topic model; outputting the theme of the data table through the theme model; and generating the theme table according to the data table and the theme.

In a possible implementation manner, the establishing a data mapping relationship for the topic table according to the data overview includes: extracting pattern character strings corresponding to each theme table; selecting any one theme table from the plurality of theme tables as a first theme table, and selecting unselected theme tables as a second theme table; calculating the similarity between the pattern character string corresponding to the first theme table and the pattern character string corresponding to each second theme table; and establishing a data mapping relation between the second theme table corresponding to the maximum similarity and the first theme table.

In a possible implementation manner, the establishing a data mapping relationship for the topic table according to the data overview includes: selecting one theme table from the plurality of theme tables as a target theme table, and selecting the unselected theme table as a non-target theme table; extracting a first data distribution of the target subject table on a target time scale and a second data distribution of each non-target subject table on the target time scale; calculating the similarity of the first data distribution and the second data distribution; and establishing a data mapping relation for the theme table according to the similarity.

In one possible implementation manner, the calculating the similarity between the first data distribution and the second data distribution includes: calculating a pearson correlation coefficient between the first data distribution and the second data distribution; and taking the Pearson correlation coefficient as the similarity.

In one possible implementation manner, the calculating the similarity between the first data distribution and the second data distribution includes: inputting the first data distribution and the second data distribution into a dynamic programming solution model which is trained in advance; outputting, by the dynamic programming solution algorithm model, a similarity distance of the first data distribution and the second data distribution; and calculating the similarity according to the similar distance.

The present application further provides an electronic device, which includes a processor and a memory, wherein the processor is configured to implement the data warehouse creating method when executing the computer program stored in the memory.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data warehouse creation method.

According to the data warehouse establishing method and the related equipment, the theme table and the data mapping relation are established, and the data warehouse is established according to the theme table and the data mapping relation, so that disordered data becomes regular, product consultation and data analysis are faster and more efficient, and problems can be intensively analyzed.

Drawings

Fig. 1 is a schematic structural diagram of an electronic device of a data warehouse creation method according to an embodiment of the present application.

Fig. 2 is a flowchart of a data warehouse creation method provided in an embodiment of the present application.

Fig. 3 is an exemplary data distribution diagram provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in detail below with reference to the accompanying drawings and specific embodiments.

Referring to fig. 1, fig. 1 is a schematic view of an electronic device according to an embodiment of the present application. Referring to fig. 1, the electronic device 1 includes, but is not limited to, a memory 11 and at least one processor 12, which may be connected via a bus or directly.

The electronic device 1 may be a computer, a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), or other devices with application programs installed therein. It will be understood by those skilled in the art that the schematic diagram 1 is only an example of the electronic device 1, and does not constitute a limitation to the electronic device 1, and may include more or less components than those shown, or combine some components, or different components, for example, the electronic device 1 may further include an input-output device, a network access device, a bus, etc.

Fig. 2 is a flow chart of a preferred embodiment of the data warehouse creation method of the present application. The data warehouse creation method is applied to the electronic device 1. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs. In this embodiment, the data warehouse creating method includes:

and S21, acquiring a data source.

Data warehouse construction is an important foundation for digital transformation of large enterprises at present. When no data warehouse exists, data needs to be directly taken from the business database for analysis. The business database is mainly used for business operation service, and although the business database can be used for analysis, the data structure is complex, the data is disordered and difficult to understand, and large-scale query is very slow. When a large-scale data warehouse is built, the data overview, the data theme model building and the data mapping building are mainly divided into three steps, wherein the data overview step is a basic step for building the data warehouse, and a data source of business data needs to be obtained for data overview.

In one embodiment of the present application, the data source is obtained by extracting, interactively converting and loading service data in a service system. The service data includes a service database, a file, and an API interface. The number of the data sources can be set according to the actual requirements of users, for example, data in 3-5 periods of the service system are acquired.

In an embodiment of the present application, an automatic scheduling time for acquiring the data source may be set, and the data source acquisition task is executed according to the automatic scheduling time, so that the business data may be automatically integrated into the data warehouse.

And S22, analyzing the data source to generate a data overview.

In one embodiment of the present application, the data overview serves as a basic link for subsequent analysis, and mainly calculates three contents: the value distribution of the fields, the pattern of the field values, and the data distribution.

In an embodiment of the application, the analyzing the data source and generating the data overview includes:

(1) And converting the data source into a data table. And during specific implementation, the service database, the file and the API interface are converted into a relational data form, wherein the relational data form comprises a data table. The business database, the file and the API interface can be converted into data forms of rows and columns through a relational model, and the data forms of the rows and columns are determined as the data table. For example, a production work order table is generated according to production data, a material work order table is generated according to material data, a maintenance data table is generated according to maintenance data, and a work order type table is generated according to the production work order table, the material work order table and the maintenance work order table.

(2) And acquiring a first field with a numerical value in the data table and calculating the numerical value distribution of the first field. In specific implementation, the data table is traversed, and a first field with a numerical value in the data table is extracted. For example, the work order type table has 200 production work order fields, 24 material work order fields, and 40 maintenance work order fields. And determining the numerical value distribution of the data table according to the first field. By analyzing the numerical distribution, a data modeler knows the numerical distribution, and has basic knowledge on the overall data.

(3) And acquiring a second field without numerical values in the data table and calculating a pattern character string of the second field. There are tables where fields do not have a numerical distribution, and typically such numbers are primary keys, or a class of fields similar to primary keys, that are critical in determining the relationship between data tables. And in specific implementation, traversing the data table, and extracting a second field without a numerical value in the data table. And reading the character of the second field, if the character is any one of special characters, numbers, letters or texts, adding the character into a result set, and obtaining the pattern character string according to the result set. Illustratively, reading the character of the second field, judging whether the character is a special character, and adding the character into a result set when the character is the special character. When the character is not a special character, it is judged whether the character is a numeral. When the character is a number, the character is added to a result set. When the character is not a number, it is determined whether the character is a letter. When the character is a letter, the character is added to a result set. And when the character is not a letter, judging whether the character is Chinese or not. When the character is Chinese, the character is added to a result set. And obtaining the pattern character string according to the result set.

(3) And acquiring the data distribution of the data table on a time axis. In specific implementation, the data of the aggregation table is deaggregated according to the hour granularity and the day granularity, and two-dimensional data with the time as an x axis and the data amount as a y axis are obtained. Such as the data profile shown in fig. 3.

(4) And taking the numerical value distribution, the pattern character string and the data distribution as the data overview. In specific implementation, because data needs to be identified and calculated in a fine granularity (field and character level), the spark calculation framework is mainly adopted for data overview analysis.

By generating the data overview, a data base is provided for the establishment of a subsequent data warehouse.

And S23, extracting the topics of the data source based on the data overview to obtain a plurality of topic tables.

The core of the establishment of the data warehouse is to perform data modeling on all data in the system. The main task of data modeling is to establish a data topic model, namely, the whole data content is logically divided according to the business significance of the data. Such division requires rearrangement and merger of the data sets according to the business meaning of various data, and if all data sets are manually identified one by one, a lot of time and labor are required.

In an embodiment of the application, the extracting topics of the data source based on the data overview, and obtaining a plurality of topic tables includes: inputting fields in the data sheet and fields in the data overview into a pre-trained topic model; outputting the theme of the data table through the theme model; and generating the theme table according to the data table and the theme.

In specific implementation, training data with a preset first proportion is obtained, and the training data is input into a pre-established neural network framework for training to obtain the topic model, for example, an underlying Dirichlet Allocation (LDA) topic model. Acquiring test data with a preset second proportion; and testing the theme model by using the test data and acquiring a test passing rate. And when the test passing rate is greater than a preset passing rate threshold value, taking the topic model as the pre-trained topic model. And inputting the fields in the data sheet and the fields in the data overview into a pre-trained topic model, wherein the topic model can automatically analyze the fields in the data sheet and the fields in the data overview, and count words in the fields, and according to the counted information, the topics contained in the current fields and the proportion of each topic are judged. Finally, each table will have a percentage of some number of topics, e.g., topic table EQP _ PROD _ NUM _ REL (95% device, 5% event), where EQP _ PROD _ NUM _ REL is the name of the topic table. Topic models in the fields of machine learning and natural language processing are statistical models used to find abstract topics in a series of documents. Intuitively, if a segment describes a central idea, then some specific words appear more frequently. For example, where a description is of a production, the words "work order" and "production count" will occur more frequently. If a description is in the process of maintenance, the words "abnormal" and "maintenance accessory" will appear more frequently.

And S24, establishing a data mapping relation for the theme table according to the data overview.

After the completion of the subject table is built for all data, it is then necessary to make explicit the relationships between tables.

In an embodiment of the application, the establishing a data mapping relationship for the topic table according to the data overview includes: extracting pattern character strings corresponding to each theme table; selecting any one theme table from the plurality of theme tables as a first theme table, and using unselected theme tables as a second subject table; calculating the similarity of the pattern character strings corresponding to the first theme table and the pattern character strings corresponding to each second theme table; and establishing a data mapping relation between the second theme table corresponding to the maximum similarity and the first theme table. In a specific implementation, a first pearson correlation coefficient between the pattern character string corresponding to the first topic table and the pattern character string corresponding to each of the second topic tables may be calculated, and the first pearson correlation coefficient may be used as the similarity.

In an embodiment of the present application, when the maximum similarity is smaller than a preset threshold, it indicates that the relation cannot be directly found through the pattern character string of the field, and the relation may be found by comparing data distributions. The establishing of the data mapping relationship for the topic table according to the data overview comprises: selecting one theme table from the plurality of theme tables as a target theme table, and selecting the unselected theme table as a non-target theme table; extracting a first data distribution of the target subject table on a target time scale and a second data distribution of each non-target subject table on the target time scale; calculating the similarity of the first data distribution and the second data distribution; and establishing a data mapping relation for the theme table according to the similarity. In a specific implementation, the second pearson correlation coefficient may be used as the similarity by calculating the second pearson correlation coefficient between the first data distribution and the second data distribution. The similarity = 1/(1 + the similarity distance) may also be determined by inputting the first data distribution and the second data distribution to a Dynamic Time Warping (DTW) model trained in advance, outputting a similarity distance of the first data distribution and the second data distribution by the DTW model, and calculating the similarity according to the similarity distance. And establishing a data mapping relation between the theme table corresponding to the second data distribution with the maximum similarity and the theme table corresponding to the first data distribution.

In an embodiment of the present application, if the field name is directly used for lookup, it is mainly to lookup whether the fields of the comparison table and the table are consistent. The field comparison method is oriented to all fields, but since a common field name such as an ID appears in almost every table, the identification of such a relationship is not accurate. At this time, the pattern character string and the data distribution are used for searching, and a relatively accurate data mapping relation can be obtained.

And S25, creating a data warehouse based on the theme table and the data mapping relation.

Step S23 establishes a plurality of topic tables, and step S24 establishes a data mapping relationship between the plurality of topic tables. A data warehouse is then required to be established through the plurality of topic tables and the data mapping relationships.

In an embodiment of the application, the creating a data warehouse based on the topic table and the data mapping relationship includes:

(1) And classifying the theme tables with the same theme according to the theme of the theme table. For example, the device themes include: theme table EQP _ CLASS (device 100%), theme table eqipment (device 100%), theme table EQP _ SCD (device 85%, event 15%), theme table EQP _ PROD _ NUM _ REL (device 95%, material 5%).

(2) And connecting the theme tables with the mapping relation according to the data mapping relation. For example, the topic table EQP _ CLASS — topic table EQUIPMENT.

By establishing the mapping relation between the theme table and the data and establishing the data warehouse according to the mapping relation between the theme table and the data, disordered data becomes regular, product consultation and data analysis are faster and more efficient, and problems can be analyzed in a centralized way.

Referring to fig. 1, in the present embodiment, the memory 11 may be an internal memory of the electronic device 1, that is, a memory built in the electronic device 1. In other embodiments, the memory 11 may also be an external memory of the electronic device 1, that is, a memory externally connected to the electronic device 1.

In some embodiments, the memory 11 is used for storing program codes and various data, and realizes high-speed and automatic access to programs or data during the operation of the electronic device 1.

The memory 11 may include random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

In one embodiment, the Processor 12 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any other conventional processor or the like.

The program code and various data in the memory 11 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the processes in the methods of the embodiments, such as the data warehouse creation method, may also be implemented by a computer program that can be stored in a computer-readable storage medium and instructs related hardware to implement the steps of the embodiments of the methods when executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, a recording medium, a usb-disk, a removable hard disk, a magnetic diskette, an optical disk, a computer Memory, a Read-Only Memory (ROM), etc.

It is understood that the above described module division is a logical function division, and there may be other division ways in actual implementation. In addition, functional modules in the embodiments of the present application may be integrated into the same processing unit, or each module may exist alone physically, or two or more modules are integrated into the same unit. The integrated module can be realized in a hardware form, and can also be realized in a form of hardware and a software functional module.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present application and not for limiting, and although the present application is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made to the technical solutions of the present application without departing from the spirit and scope of the technical solutions of the present application.

Claims

1. A data warehouse creation method, characterized in that the data warehouse creation method comprises:

acquiring a data source;

analyzing the data source to generate a data overview;

extracting topics of the data source based on the data overview to obtain a plurality of topic tables;

establishing a data mapping relation for the theme table according to the data overview;

and creating a data warehouse based on the theme table and the data mapping relation.

2. The data warehouse creation method of claim 1, wherein the analyzing the data source and generating a data overview comprises:

converting the data source into a data table;

acquiring a first field with a numerical value in the data table and calculating the numerical value distribution of the first field;

acquiring a second field without a numerical value in the data table and calculating a pattern character string of the second field;

acquiring data distribution of the data table on a time axis;

and taking the numerical value distribution, the pattern character string and the data distribution as the data overview.

3. The data warehouse creation method of claim 2, wherein the computing a pattern string for the second field comprises:

reading characters of the second field;

if the character is any one of a special character, a number, a letter or a Chinese text, adding the character into a result set;

and obtaining the pattern character string according to the result set.

4. The data warehouse creation method of claim 1, wherein the extracting topics of the data sources based on the data overview, resulting in a plurality of topic tables, comprises:

inputting fields in the data sheet and fields in the data overview into a pre-trained topic model;

outputting the theme of the data table through the theme model;

and generating the theme table according to the data table and the theme.

5. The data warehouse creation method of claim 4, wherein the establishing a data mapping relationship for the topic table from the data overview comprises:

extracting a pattern character string corresponding to each theme table;

selecting any one theme table from the plurality of theme tables as a first theme table, and using unselected theme tables as a second subject table;

calculating the similarity of the pattern character strings corresponding to the first theme table and the pattern character strings corresponding to each second theme table;

and establishing a data mapping relation between the second theme table corresponding to the maximum similarity and the first theme table.

6. The data warehouse creation method of claim 4, wherein the establishing a data mapping relationship for the topic table in accordance with the data overview comprises:

selecting one theme table from the plurality of theme tables as a target theme table, and selecting the unselected theme table as a non-target theme table;

extracting a first data distribution of the target subject table on a target time scale and a second data distribution of each non-target subject table on the target time scale;

calculating the similarity of the first data distribution and the second data distribution;

and establishing a data mapping relation for the theme table according to the similarity.

7. The data warehouse creation method of claim 5 or 6, wherein the calculating the similarity of the first data distribution and the second data distribution comprises:

calculating a pearson correlation coefficient between the first data distribution and the second data distribution;

and taking the Pearson correlation coefficient as the similarity.

8. The data warehouse creation method of claim 5 or 6, wherein the calculating the similarity of the first data distribution and the second data distribution comprises:

inputting the first data distribution and the second data distribution into a dynamic programming solution model trained in advance;

outputting, by the dynamic programming solution algorithm model, a similarity distance of the first data distribution and the second data distribution;

and calculating the similarity according to the similar distance.

9. An electronic device, characterized in that the electronic device comprises a processor and a memory, the processor being configured to execute a computer program stored in the memory to implement the data warehouse creation method according to any of claims 1 to 8.

10. A computer-readable storage medium storing at least one instruction which, when executed by a processor, implements a data warehouse creation method as claimed in any one of claims 1 to 8.