CN105824914A

CN105824914A - Configuration-based snowflake model information extraction method

Info

Publication number: CN105824914A
Application number: CN201610148250.9A
Authority: CN
Inventors: 张引; 魏宝刚; 庄越挺; 钱宏泽
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2016-03-15
Filing date: 2016-03-15
Publication date: 2016-08-03

Abstract

The invention discloses a configuration-based snowflake model information extraction method. The configuration-based snowflake model information extraction method comprises the following steps: firstly, analyzing a data model in a database according to a definition of a snowflake model; carrying out design conversion on a model which does not meet conditions; secondly, analyzing characteristics of different data sources, and summarizing common abstract features and representing in a configuration file manner; realizing a corresponding configuration file parser and a subsequent universal processing procedure; thirdly, selecting tables and fields, which need to be processed, from certain specific databases, and compiling configuration files meeting specifications; finally, extracting content needed by a user by the corresponding configuration file parser according to the configuration files, and processing the content. By adopting the method, only the corresponding files of tasks, which accord with the snowflake model and have the same subsequent processing process, need to be recompiled, that is to say a previous universal processing procedure is multiplexed to another task, so that time and labor are saved.

Description

A kind of snowflake model information extraction method based on configuration

Technical field

The present invention relates to data model process field, be specifically related to a kind of snowflake model information extraction method based on configuration.

Background technology

Along with the development of science and technology, data produce increasingly faster, and the means seeking the process of a kind of efficient data become the task of top priority.So-called efficiently, i.e. without processing routine is revised on a large scale, can multiplexing process other there is analog information process several data bases of task.

The model of frequently-used data tissue, mainly has two kinds, Star Model and snowflake model at present.Star Model is a kind of non-normalized structure, and each dimension of cube is directly connected with true table, there is not gradual change dimension, so data exist certain redundancy, but search efficiency is higher comparatively speaking.Snowflake schema is the further stratification to star schema dimension table, reduces data redundancy by decomposing multi-dimensional table as far as possible.The dimension table of snowflake schema is based on paradigm theory, it it is the boundary's a kind of design pattern between third normal form and star schema, it is typically part data tissue and uses the norm structure of third normal form, the fact that part data tissue uses star schema table and dimension table structure.The advantage of snowflake schema is to decrease memory space to a certain extent, and normalized structure is easier to update and safeguard.Same snowflake schema there is also many shortcomings.Snowflake schema is more complicated, and user is not easy to understand, browsing content relative difficulty, extra connection will make query performance decline.

Although field concrete after connecting the various dimensions table decomposition of true table is different, but structure is the most similar, therefore can characterize different multi-dimensional table, multiplexing process program by definition configuration file mode.

Summary of the invention

It is an object of the invention to by the feature of different aforementioned sources under generalizing abstraction snowflake model, be used for distinguishing in the way of configuration file, it is achieved the multiplexing of existing processing routine, improve data-handling efficiency.

The technical scheme that the purpose of the present invention realizes comprises the following steps:

A kind of snowflake model information extraction method based on configuration, for particular model data base, realizes the multiplexing of processing routine by amendment configuration file mode, specifically includes following steps:

1) judge whether the data model of several pending data bases meets snowflake model definition respectively, if not snowflake model, be then converted into snowflake model；If snowflake model, then carry out the 2nd) step；

2) the various dimensions table after decomposing for snowflake model, is the relation of one-to-many, and the different usual field number of data source and the difference of field name, and structure is the most identical, and key-like formula is connected in addition.Analyze the overall feature of several pending data bases, summarize several its common abstract characteristics, and be defined in configuration file；

3) for configuration file, it is achieved resolver reads its content, as the operational factor of program pass data base.It is also desirable to realize the general purpose processor of reprocessing after database information is extracted；And this general purpose processor is specifically different because of mission requirements.

4) custom profile multiplexing procedure: the same treatment task of disparate databases, it is only necessary to observation analysis data characteristics, and by its characterizing definition in configuration file, the multiplexing of program can be realized, improve information processing efficiency.Specifically, when for certain certain database information handling task pending, first the data characteristics of this data base is analyzed, and in step 2) on the profile-base that obtains, part or all in the most defined abstract characteristics is placed in configuration file as effective abstract characteristics, and without revising its program code, realize the multiplexing of program efficiently.

Described snowflake model is defined as: has one or more dimension table to be not directly connected on true table, but is connected to the model on true table by primary dimension table.

Described step 2) particularly as follows: for different snowflake model data bases, its primary dimension table, major key, breakdown, filtered fields are configured hereof with feature mode；Define special symbol (depending on the visual practical situation of this symbol) respectively as the initial row of each category feature and end line, be the value of each category feature between initial row and end line.

Described step 3) in ergodic data storehouse the process of information as follows: each record in traversal primary dimension table, for any record therein, the value obtaining its non-filtered field processes, according to its major key after a record is disposed, travel through all breakdowns, to any table therein, obtain the record that beyond it, key-like formula associates with primary dimension table, and the value of filtered fields is processed again；The program wherein processed information determines according to actual task demand.

The present invention compared with prior art has the advantages that

1. only need to revise configuration file for different task, it is not necessary to program source code is modified and compiles, time-consuming；

2. configuration file is readable strong, and layman's left-hand seat is easy；

3., by modification and perfection configuration file resolver, follow-up function can be further expanded.

Accompanying drawing explanation

Fig. 1 is the Star Model of " medicine treatment disease " in embodiment；

Fig. 2 is the snowflake model of " medicine treatment disease " in embodiment；

Fig. 3 is that embodiment Chinese medicine processes custom profile；

Fig. 4 is Command Line Parsing device handling process in embodiment；

Fig. 5 is sick process custom profile in embodiment.

Detailed description of the invention

Below in conjunction with instantiation and accompanying drawing, the present invention is described in further detail, wherein initial data is carried out simplification process, but does not affect method explanation.Technical characteristic described in the present invention, in the case of not colliding with each other, all can be mutually combined.

The snowflake model information extraction method based on configuration of the present invention, core concept, for particular model data base, realizes the multiplexing of processing routine by amendment configuration file mode.Specifically include following steps:

(1) judge whether the data model of data base meets snowflake model definition, if not snowflake model, then need to be converted into snowflake model.So-called snowflake model, has been characterized in that one or more dimension table is not directly connected on true table, but when being connected on true table by primary dimension table, its diagram links together just as multiple snowflakes, therefore claims snowflake model.Wherein true table preserves detailed value or the fact of metric, but do not comprise descriptive information, as medicine is cured the disease relation Chinese medicine and sick major key, the description of each of which is then by dimension table record, and dimension table can be further broken into primary dimension table and the split table associated by external key.The design of snowflake model and conversion, by the dimension table being connected with true table is decomposed into primary dimension table and breakdown realization, be associated with external key between the two.

(2) analyze the feature of disparate databases, its common trait is carried out abstract, and design configurations file, above-mentioned common trait is defined in wherein.Different snowflake model data bases, its difference predominantly primary dimension table, major key, breakdown, needing to configure hereof it with feature mode, follow-up data processes and is also not directed to each field in table in addition, therefore also can using filtered fields as feature configuration hereof.Finally, configuration file comprises primary dimension table, major key, breakdown, filtered fields four Partial Feature.Computer reading process for convenience, definition " # feature name " as initial row of each category feature, definition " " as end line of each category feature, between be the value of each category feature.The resolver realized needs to read content therein, and stores in program internal memory variable.As follows for the process of information traversal in data base: each record in traversal primary dimension table, for any record therein, the value obtaining its non-filtered field processes, according to its major key after a record is disposed, travel through all breakdowns, to any table therein, obtain the record that beyond it, key-like formula associates with primary dimension table, and the value of filtered fields is processed again.The general purpose processor wherein processed information is different because of real needs.

(3) by defining configuration file with upper type, after realizing Command Line Parsing device and general purpose processor, identical information for disparate databases processes task, on the profile-base defined, have only to observation analysis data characteristics, analyze the feature needed for this data base, and by its characterizing definition in configuration file.If defined common trait need not in database in certain step (2), then can be it is set to " empty ".Thus in the case of need not revise program code, the multiplexing of program can be realized, improve information processing efficiency.

Embodiment

A kind of based on configuration snowflake model information extraction method in the present embodiment comprises the steps:

(1) being illustrated in figure 1 " medicine treatment disease " data of Star Model, if such as there is multiple same name such as " Herba Ephedrae " but the different medicine of separate sources effect statement, then " Herba Ephedrae " this name can store repeatedly, causes data redundancy.Multi-dimensional data table is carried out decomposed and reconstituted, forms the snowflake model shown in Fig. 2.Visible effect and source field split from original table, are associated with the numbered external key of medicine.Owing to effect of same medicine separate sources describes difference, so medicine table and effect table are the relations of one-to-many.

(2) sick correlation table and the data mode of medicine correlation table are observed, though both concrete field number and field names, structure is the most similar, therefore the attribute of definition configuration file, including primary dimension table, base table, relation table and ignore field.In view of configuration file needs to be readable by a computer, so needing that form is had regulation, using " #+ attribute-name " as the starting point of a property value in example, using " " as the end point of a property value, between content be property value set, as legal medicine processes configuration file as shown in Figure 3.The handling process of corresponding Command Line Parsing device as shown in Figure 4, reads corresponding parametric values to in dependent variable.

(3) carrying out the process of common segment, if this routine sequence general procedure is the program of generative semantics net, this part can change according to specific tasks.

(4) configuration file as shown in Figure 3 can complete the process task of medicine related data sources, if desired processes the data source that disease is relevant, and when ignoring title, then configures file as shown in Figure 5 and can realize the multiplexing of program.

Claims

1. a snowflake model information extraction method based on configuration, it is characterised in that for particular data model data base, realizes the multiplexing of processing routine by amendment configuration file mode, specifically includes following steps:

2) analyze the overall feature of several pending data bases, summarize several its common abstract characteristics, and be defined in configuration file；

3) for configuration file, it is achieved resolver reads its content and stores in program internal memory variable, information in ergodic data storehouse；Simultaneously, it is achieved the general purpose processor of reprocessing after database information is extracted；

4) for the information handling task of pending certain database, analyze the data characteristics of this data base, and in step 2) on the profile-base that obtains, part or all in the most defined abstract characteristics is placed in configuration file as effective abstract characteristics, it is achieved the multiplexing of program.

2. as claimed in claim 1 based on the snowflake model information extraction method configured, it is characterised in that described snowflake model is defined as: have one or more dimension table to be not directly connected on true table, but be connected to the model on true table by primary dimension table.

3. as claimed in claim 1 based on the snowflake model information extraction method configured, it is characterized in that described step 2) particularly as follows: for different snowflake model data bases, its primary dimension table, major key, breakdown, filtered fields are configured hereof with feature mode；Definition special symbol is as the initial row of each category feature and end line respectively, is the value of each category feature between initial row and end line.

4. as claimed in claim 1 based on the snowflake model information extraction method configured, it is characterized in that described step 3) in ergodic data storehouse the process of information as follows: each record in traversal primary dimension table, for any record therein, the value obtaining its non-filtered field processes, according to its major key after a record is disposed, travel through all breakdowns, to any table therein, obtain the record that beyond it, key-like formula associates with primary dimension table, and the value of filtered fields is processed again；The program wherein processed information determines according to actual task demand.