CN105824914A - Configuration-based snowflake model information extraction method - Google Patents

Configuration-based snowflake model information extraction method Download PDF

Info

Publication number
CN105824914A
CN105824914A CN201610148250.9A CN201610148250A CN105824914A CN 105824914 A CN105824914 A CN 105824914A CN 201610148250 A CN201610148250 A CN 201610148250A CN 105824914 A CN105824914 A CN 105824914A
Authority
CN
China
Prior art keywords
model
snowflake
snowflake model
data
configuration file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610148250.9A
Other languages
Chinese (zh)
Inventor
张引
魏宝刚
庄越挺
钱宏泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201610148250.9A priority Critical patent/CN105824914A/en
Publication of CN105824914A publication Critical patent/CN105824914A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a configuration-based snowflake model information extraction method. The configuration-based snowflake model information extraction method comprises the following steps: firstly, analyzing a data model in a database according to a definition of a snowflake model; carrying out design conversion on a model which does not meet conditions; secondly, analyzing characteristics of different data sources, and summarizing common abstract features and representing in a configuration file manner; realizing a corresponding configuration file parser and a subsequent universal processing procedure; thirdly, selecting tables and fields, which need to be processed, from certain specific databases, and compiling configuration files meeting specifications; finally, extracting content needed by a user by the corresponding configuration file parser according to the configuration files, and processing the content. By adopting the method, only the corresponding files of tasks, which accord with the snowflake model and have the same subsequent processing process, need to be recompiled, that is to say a previous universal processing procedure is multiplexed to another task, so that time and labor are saved.

Description

A kind of snowflake model information extraction method based on configuration
Technical field
The present invention relates to data model process field, be specifically related to a kind of snowflake model information extraction method based on configuration.
Background technology
Along with the development of science and technology, data produce increasingly faster, and the means seeking the process of a kind of efficient data become the task of top priority.So-called efficiently, i.e. without processing routine is revised on a large scale, can multiplexing process other there is analog information process several data bases of task.
The model of frequently-used data tissue, mainly has two kinds, Star Model and snowflake model at present.Star Model is a kind of non-normalized structure, and each dimension of cube is directly connected with true table, there is not gradual change dimension, so data exist certain redundancy, but search efficiency is higher comparatively speaking.Snowflake schema is the further stratification to star schema dimension table, reduces data redundancy by decomposing multi-dimensional table as far as possible.The dimension table of snowflake schema is based on paradigm theory, it it is the boundary's a kind of design pattern between third normal form and star schema, it is typically part data tissue and uses the norm structure of third normal form, the fact that part data tissue uses star schema table and dimension table structure.The advantage of snowflake schema is to decrease memory space to a certain extent, and normalized structure is easier to update and safeguard.Same snowflake schema there is also many shortcomings.Snowflake schema is more complicated, and user is not easy to understand, browsing content relative difficulty, extra connection will make query performance decline.
Although field concrete after connecting the various dimensions table decomposition of true table is different, but structure is the most similar, therefore can characterize different multi-dimensional table, multiplexing process program by definition configuration file mode.
Summary of the invention
It is an object of the invention to by the feature of different aforementioned sources under generalizing abstraction snowflake model, be used for distinguishing in the way of configuration file, it is achieved the multiplexing of existing processing routine, improve data-handling efficiency.
The technical scheme that the purpose of the present invention realizes comprises the following steps:
A kind of snowflake model information extraction method based on configuration, for particular model data base, realizes the multiplexing of processing routine by amendment configuration file mode, specifically includes following steps:
1) judge whether the data model of several pending data bases meets snowflake model definition respectively, if not snowflake model, be then converted into snowflake model;If snowflake model, then carry out the 2nd) step;
2) the various dimensions table after decomposing for snowflake model, is the relation of one-to-many, and the different usual field number of data source and the difference of field name, and structure is the most identical, and key-like formula is connected in addition.Analyze the overall feature of several pending data bases, summarize several its common abstract characteristics, and be defined in configuration file;
3) for configuration file, it is achieved resolver reads its content, as the operational factor of program pass data base.It is also desirable to realize the general purpose processor of reprocessing after database information is extracted;And this general purpose processor is specifically different because of mission requirements.
4) custom profile multiplexing procedure: the same treatment task of disparate databases, it is only necessary to observation analysis data characteristics, and by its characterizing definition in configuration file, the multiplexing of program can be realized, improve information processing efficiency.Specifically, when for certain certain database information handling task pending, first the data characteristics of this data base is analyzed, and in step 2) on the profile-base that obtains, part or all in the most defined abstract characteristics is placed in configuration file as effective abstract characteristics, and without revising its program code, realize the multiplexing of program efficiently.
Described snowflake model is defined as: has one or more dimension table to be not directly connected on true table, but is connected to the model on true table by primary dimension table.
Described step 2) particularly as follows: for different snowflake model data bases, its primary dimension table, major key, breakdown, filtered fields are configured hereof with feature mode;Define special symbol (depending on the visual practical situation of this symbol) respectively as the initial row of each category feature and end line, be the value of each category feature between initial row and end line.
Described step 3) in ergodic data storehouse the process of information as follows: each record in traversal primary dimension table, for any record therein, the value obtaining its non-filtered field processes, according to its major key after a record is disposed, travel through all breakdowns, to any table therein, obtain the record that beyond it, key-like formula associates with primary dimension table, and the value of filtered fields is processed again;The program wherein processed information determines according to actual task demand.
The present invention compared with prior art has the advantages that
1. only need to revise configuration file for different task, it is not necessary to program source code is modified and compiles, time-consuming;
2. configuration file is readable strong, and layman's left-hand seat is easy;
3., by modification and perfection configuration file resolver, follow-up function can be further expanded.
Accompanying drawing explanation
Fig. 1 is the Star Model of " medicine treatment disease " in embodiment;
Fig. 2 is the snowflake model of " medicine treatment disease " in embodiment;
Fig. 3 is that embodiment Chinese medicine processes custom profile;
Fig. 4 is Command Line Parsing device handling process in embodiment;
Fig. 5 is sick process custom profile in embodiment.
Detailed description of the invention
Below in conjunction with instantiation and accompanying drawing, the present invention is described in further detail, wherein initial data is carried out simplification process, but does not affect method explanation.Technical characteristic described in the present invention, in the case of not colliding with each other, all can be mutually combined.
The snowflake model information extraction method based on configuration of the present invention, core concept, for particular model data base, realizes the multiplexing of processing routine by amendment configuration file mode.Specifically include following steps:
(1) judge whether the data model of data base meets snowflake model definition, if not snowflake model, then need to be converted into snowflake model.So-called snowflake model, has been characterized in that one or more dimension table is not directly connected on true table, but when being connected on true table by primary dimension table, its diagram links together just as multiple snowflakes, therefore claims snowflake model.Wherein true table preserves detailed value or the fact of metric, but do not comprise descriptive information, as medicine is cured the disease relation Chinese medicine and sick major key, the description of each of which is then by dimension table record, and dimension table can be further broken into primary dimension table and the split table associated by external key.The design of snowflake model and conversion, by the dimension table being connected with true table is decomposed into primary dimension table and breakdown realization, be associated with external key between the two.
(2) analyze the feature of disparate databases, its common trait is carried out abstract, and design configurations file, above-mentioned common trait is defined in wherein.Different snowflake model data bases, its difference predominantly primary dimension table, major key, breakdown, needing to configure hereof it with feature mode, follow-up data processes and is also not directed to each field in table in addition, therefore also can using filtered fields as feature configuration hereof.Finally, configuration file comprises primary dimension table, major key, breakdown, filtered fields four Partial Feature.Computer reading process for convenience, definition " # feature name " as initial row of each category feature, definition " " as end line of each category feature, between be the value of each category feature.The resolver realized needs to read content therein, and stores in program internal memory variable.As follows for the process of information traversal in data base: each record in traversal primary dimension table, for any record therein, the value obtaining its non-filtered field processes, according to its major key after a record is disposed, travel through all breakdowns, to any table therein, obtain the record that beyond it, key-like formula associates with primary dimension table, and the value of filtered fields is processed again.The general purpose processor wherein processed information is different because of real needs.
(3) by defining configuration file with upper type, after realizing Command Line Parsing device and general purpose processor, identical information for disparate databases processes task, on the profile-base defined, have only to observation analysis data characteristics, analyze the feature needed for this data base, and by its characterizing definition in configuration file.If defined common trait need not in database in certain step (2), then can be it is set to " empty ".Thus in the case of need not revise program code, the multiplexing of program can be realized, improve information processing efficiency.
Embodiment
A kind of based on configuration snowflake model information extraction method in the present embodiment comprises the steps:
(1) being illustrated in figure 1 " medicine treatment disease " data of Star Model, if such as there is multiple same name such as " Herba Ephedrae " but the different medicine of separate sources effect statement, then " Herba Ephedrae " this name can store repeatedly, causes data redundancy.Multi-dimensional data table is carried out decomposed and reconstituted, forms the snowflake model shown in Fig. 2.Visible effect and source field split from original table, are associated with the numbered external key of medicine.Owing to effect of same medicine separate sources describes difference, so medicine table and effect table are the relations of one-to-many.
(2) sick correlation table and the data mode of medicine correlation table are observed, though both concrete field number and field names, structure is the most similar, therefore the attribute of definition configuration file, including primary dimension table, base table, relation table and ignore field.In view of configuration file needs to be readable by a computer, so needing that form is had regulation, using " #+ attribute-name " as the starting point of a property value in example, using " " as the end point of a property value, between content be property value set, as legal medicine processes configuration file as shown in Figure 3.The handling process of corresponding Command Line Parsing device as shown in Figure 4, reads corresponding parametric values to in dependent variable.
(3) carrying out the process of common segment, if this routine sequence general procedure is the program of generative semantics net, this part can change according to specific tasks.
(4) configuration file as shown in Figure 3 can complete the process task of medicine related data sources, if desired processes the data source that disease is relevant, and when ignoring title, then configures file as shown in Figure 5 and can realize the multiplexing of program.

Claims (4)

1. a snowflake model information extraction method based on configuration, it is characterised in that for particular data model data base, realizes the multiplexing of processing routine by amendment configuration file mode, specifically includes following steps:
1) judge whether the data model of several pending data bases meets snowflake model definition respectively, if not snowflake model, be then converted into snowflake model;If snowflake model, then carry out the 2nd) step;
2) analyze the overall feature of several pending data bases, summarize several its common abstract characteristics, and be defined in configuration file;
3) for configuration file, it is achieved resolver reads its content and stores in program internal memory variable, information in ergodic data storehouse;Simultaneously, it is achieved the general purpose processor of reprocessing after database information is extracted;
4) for the information handling task of pending certain database, analyze the data characteristics of this data base, and in step 2) on the profile-base that obtains, part or all in the most defined abstract characteristics is placed in configuration file as effective abstract characteristics, it is achieved the multiplexing of program.
2. as claimed in claim 1 based on the snowflake model information extraction method configured, it is characterised in that described snowflake model is defined as: have one or more dimension table to be not directly connected on true table, but be connected to the model on true table by primary dimension table.
3. as claimed in claim 1 based on the snowflake model information extraction method configured, it is characterized in that described step 2) particularly as follows: for different snowflake model data bases, its primary dimension table, major key, breakdown, filtered fields are configured hereof with feature mode;Definition special symbol is as the initial row of each category feature and end line respectively, is the value of each category feature between initial row and end line.
4. as claimed in claim 1 based on the snowflake model information extraction method configured, it is characterized in that described step 3) in ergodic data storehouse the process of information as follows: each record in traversal primary dimension table, for any record therein, the value obtaining its non-filtered field processes, according to its major key after a record is disposed, travel through all breakdowns, to any table therein, obtain the record that beyond it, key-like formula associates with primary dimension table, and the value of filtered fields is processed again;The program wherein processed information determines according to actual task demand.
CN201610148250.9A 2016-03-15 2016-03-15 Configuration-based snowflake model information extraction method Pending CN105824914A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610148250.9A CN105824914A (en) 2016-03-15 2016-03-15 Configuration-based snowflake model information extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610148250.9A CN105824914A (en) 2016-03-15 2016-03-15 Configuration-based snowflake model information extraction method

Publications (1)

Publication Number Publication Date
CN105824914A true CN105824914A (en) 2016-08-03

Family

ID=56987823

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610148250.9A Pending CN105824914A (en) 2016-03-15 2016-03-15 Configuration-based snowflake model information extraction method

Country Status (1)

Country Link
CN (1) CN105824914A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111427938A (en) * 2020-03-18 2020-07-17 中国建设银行股份有限公司 Data unloading method and device
CN113010611A (en) * 2019-12-19 2021-06-22 北京阿博茨科技有限公司 Method and system for automatically generating relations between relational database tables
CN110019357B (en) * 2017-09-29 2021-06-29 北京国双科技有限公司 Database query script generation method and device
CN114706575A (en) * 2022-06-07 2022-07-05 杭州比智科技有限公司 Method and system for migrating and multiplexing data model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833708A (en) * 2010-05-07 2010-09-15 山东中创软件工程股份有限公司 Method and device for generating early warning information
CN105335412A (en) * 2014-07-31 2016-02-17 阿里巴巴集团控股有限公司 Method and device for data conversion and data migration
CN103092866B (en) * 2011-11-03 2016-08-31 金蝶软件(中国)有限公司 Data monitoring method and supervising device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833708A (en) * 2010-05-07 2010-09-15 山东中创软件工程股份有限公司 Method and device for generating early warning information
CN103092866B (en) * 2011-11-03 2016-08-31 金蝶软件(中国)有限公司 Data monitoring method and supervising device
CN105335412A (en) * 2014-07-31 2016-02-17 阿里巴巴集团控股有限公司 Method and device for data conversion and data migration

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
汤晓炜: ""电信经营分析中的指标上传系统研究与实现"", 《中国优秀硕士学位论文全文数据库》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019357B (en) * 2017-09-29 2021-06-29 北京国双科技有限公司 Database query script generation method and device
CN113010611A (en) * 2019-12-19 2021-06-22 北京阿博茨科技有限公司 Method and system for automatically generating relations between relational database tables
CN111427938A (en) * 2020-03-18 2020-07-17 中国建设银行股份有限公司 Data unloading method and device
CN111427938B (en) * 2020-03-18 2023-08-29 中国建设银行股份有限公司 Data transfer method and device
CN114706575A (en) * 2022-06-07 2022-07-05 杭州比智科技有限公司 Method and system for migrating and multiplexing data model

Similar Documents

Publication Publication Date Title
US10860632B2 (en) Information query method and device
CN105868204B (en) A kind of method and device for converting Oracle scripting language SQL
JP6165864B2 (en) Working with distributed databases with external tables
CN107622103B (en) Managing data queries
EP2608074A2 (en) Systems and methods for merging source records in accordance with survivorship rules
Chung et al. JackHare: a framework for SQL to NoSQL translation using MapReduce
EP3671526B1 (en) Dependency graph based natural language processing
CN105824914A (en) Configuration-based snowflake model information extraction method
CN103425780A (en) Data inquiry method and data inquiry device
CN1592908B (en) Database system having heterogeneous object types
Jiang et al. Mapping-driven XML transformation
JP5927886B2 (en) Query system and computer program
JP2008171181A (en) Structured data search apparatus
CN107491476A (en) A kind of data model translation and query analysis method suitable for a variety of big data management systems
JP2006053724A (en) Xml data management method
CN113297251A (en) Multi-source data retrieval method, device, equipment and storage medium
US20140067853A1 (en) Data search method, information system, and recording medium storing data search program
WO2013111287A1 (en) Sparql query optimization method
CN107818181A (en) Indexing means and its system based on Plcient interactive mode engines
AU2003222783A1 (en) Method and apparatus for querying relational databases
Budikova et al. Query language for complex similarity queries
US20080126317A1 (en) Method and system for converting source data files into database query language
Elamparithi et al. A Review on Database Migration Strategies, Techniques and Tools
CN110825792A (en) High-concurrency distributed data retrieval method based on golang middleware coroutine mode
KR20130057715A (en) Method for providing deep domain knowledge based on massive science information and apparatus thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160803

RJ01 Rejection of invention patent application after publication