CN109710602A - Data model detection method and device - Google Patents

Data model detection method and device Download PDF

Info

Publication number
CN109710602A
CN109710602A CN201811599084.XA CN201811599084A CN109710602A CN 109710602 A CN109710602 A CN 109710602A CN 201811599084 A CN201811599084 A CN 201811599084A CN 109710602 A CN109710602 A CN 109710602A
Authority
CN
China
Prior art keywords
data
column
metadatabase
unstructured
structural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201811599084.XA
Other languages
Chinese (zh)
Inventor
于阔
郭庆
宋怀明
谢莹莹
蒋丹东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Dawning International Information Industry Co Ltd
Dawning Information Industry Co Ltd
Original Assignee
Zhongke Dawning International Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Dawning International Information Industry Co Ltd filed Critical Zhongke Dawning International Information Industry Co Ltd
Priority to CN201811599084.XA priority Critical patent/CN109710602A/en
Publication of CN109710602A publication Critical patent/CN109710602A/en
Withdrawn legal-status Critical Current

Links

Abstract

The present invention provides a kind of data model detection method and device.The described method includes: extracting the pattern information carried in data;Column filtering is carried out to the data extracted, filters out the unrelated column of association;Numerical value polymerization is carried out to each column;Each column data and column field existing in metadatabase are compared, the highest column of matching degree in metadatabase are found, if matching degree is more than given threshold, regards as overlapping fields, metadata column is defined as existing column;Metadatabase is updated, addition is newly added.The present invention can be improved the success rate that data model detects automatically, reduce manual intervention.

Description

Data model detection method and device
Technical field
The present invention relates to big data technical field more particularly to a kind of data model detection methods and device.
Background technique
The data source of many storage complex datas, such as from social data library, electric quotient data library, arrive human gene data Library is all based on the data set of complicated multidimensional, big data quantity storage.It is handled for this data set, it is huge there are one Big challenge is how to find implicit data structure, data correlation relation from the data of magnanimity and finally extract intentionally The data of justice.Usually, analyst can help to extract the significant data in part by various analysis tools.But it relies on Existing analysis tool is modeled and is shown to the data in complex data source, it is necessary to lasting carry out human-computer interaction.User needs It is very familiar with the characteristic of complex data collection, it is necessary to explicitly instructed to computer to allow computer that corresponding algorithm is called to complete Modeling.This human-computer interaction needs repeat repeatedly in many cases,.When user processing data be with trillion timing, Such data processing method is extremely complex and cumbersome.Based on this need, data automatic modeling has been had already appeared on the market The design of method and visual means.
However existing data automatic modeling only with investigate source of new data data, although its probe result can be applied effectively On this data mapping, but for the correlation inquiry of multi-data source, there is also sizable obstacles, need cumbersome artificial dry The pre- alignment to be associated domain, the work of this part neither generates new value, and is very easy to error.For Reusability Data, under efficiency is very low.
With the development of data service, in the same service application scene, general existing structure data also have non-structural Change data, and is frequently necessary to association process.Wherein, structural data refers to row data, is stored in lane database, can use two Dimension table structure carrys out the data of logical expression, and is difficult to the data showed with database two dimension logical table and is known as unstructured number According to unstructured data includes office documents, text, picture, XML, HTML, all kinds of reports, image and the sound of all formats Frequently/video information etc..
In the Data processing of the prior art, relevant database can be directlyed adopt for structural data and stored, and The inquiry, filtering or calculating to structural data are realized in relevant database;For unstructured data, using big data frame The technologies such as structure Hadoop, MapReduce, Spark carry out batch processing, including inquiry, the filtering to unstructured data Or it calculates.Be to correlation inquiry between structural data and unstructured data in the prior art it is possible, this is related to non- The preparatory modeling of structural data, is associated inquiry later.However existing unstructured automaticdata modeling technique, exist Sizable limitation, the model identified cannot be exactly matched with structural data model, cause failure.Its basic reason exists In the model of different data sources is defined by different people, and there may be a variety of naming methods for data identical for essence, even Use different language.The domain for being such as called " Identity " in a certain data source may be called " Social in another data source ID ", but its essence is identical information, can be used as the key of correlation inquiry, but need user to be apparent from could just should indeed With.
Summary of the invention
Data model detection method and device provided by the invention, can be improved the success rate that data model detects automatically, Reduce manual intervention.
In a first aspect, the present invention provides a kind of data model detection method, comprising:
Extract the pattern information carried in data;
Column filtering is carried out to the data extracted, filters out the unrelated column of association;
Numerical value polymerization is carried out to each column;
Each column data and column field existing in metadatabase are compared, find in metadatabase matching degree most High column regard as overlapping fields, metadata column is defined as existing column if matching degree is more than given threshold;
Metadatabase is updated, addition is newly added.
Optionally, in the update metadatabase, after addition is newly added, the method also includes:
Correlation inquiry request is obtained, and correlation inquiry request is decomposed into multiple subqueries and is requested;
When including the inquiry request to unstructured data component in the request of the multiple subquery, the non-knot is called The corresponding analysis mode of structure data package parses the unstructured data component, obtains the data of mode;
The data for having mode and structural data are associated inquiry, obtain the result set of the correlation inquiry.
Optionally, described that the data for having mode and structural data are associated inquiry, it obtains the association and looks into The result set of inquiry includes:
Each column of the data for having mode are polymerize to obtain codomain set, numerical value is arranged can jump not as condition of contact It crosses;For each column query metadata library, if the existing column data of codomain sets match, new model column quote existing column, otherwise arrange Source database is added as new column in the normalized processing of title.
Optionally, the correlation inquiry includes: to be attached operation, to structure to structural data and unstructured data Change the cascade of data SQL query and non-structural data enquiry processing and to structured data query result and unstructured number According to progress joint operation.
Second aspect, the present invention provide a kind of data model detection device, comprising:
Extracting unit, for extracting the pattern information carried in data;
Filter element filters out the unrelated column of association for carrying out column filtering to the data extracted;
Polymerized unit, for carrying out numerical value polymerization to each column;
Comparison unit finds metadata for comparing each column data and column field existing in metadatabase The highest column of matching degree in library regard as overlapping fields if matching degree is more than given threshold, and metadata column is defined as existing There are column;
Updating unit, for updating metadatabase, addition is newly added.
Optionally, described device further include:
Acquiring unit after addition is newly added, obtains correlation inquiry and asks for updating metadatabase in the updating unit It asks, and correlation inquiry request is decomposed into multiple subqueries and is requested;
Call unit, for when including inquiry request to unstructured data component in the request of the multiple subquery When, the corresponding analysis mode of the unstructured data component is called, the unstructured data component is parsed, is obtained There are the data of mode;
Query unit obtains the association for the data for having mode and structural data to be associated inquiry The result set of inquiry.
Optionally, the query unit obtains codomain set for polymerizeing each column of the data for having mode, number Value is arranged can skip not as condition of contact;For each column query metadata library, if the existing column data of codomain sets match, new mould Type column quote existing column, and otherwise source database is added as new column in the normalized processing of column name.
Optionally, the correlation inquiry includes: to be attached operation, to structure to structural data and unstructured data Change the cascade of data SQL query and non-structural data enquiry processing and to structured data query result and unstructured number According to progress joint operation.
Data model detection method and device provided in an embodiment of the present invention, pass through the independent parsing to unstructured data Processing, obtains the data of mode, wherein not needing artificial intervention, can call the data solution being arranged when data definition automatically Analysis mode parses unstructured data, and the accurate correlation for realizing unstructured data and structural data is inquired, energy The matching degree for enough improving the data model identification and video memory model of non-structural data source, increases substantially data model and detects automatically Success rate, reduce manual intervention;By introducing metadatabase, the persistence to data model probe result is played, it can Substantially reduce the expense of repetition inquiry.
Detailed description of the invention
Fig. 1 is the flow chart of data model detection method provided in an embodiment of the present invention;
Fig. 2 is the structural schematic diagram of one embodiment of the invention data model detection device;
Fig. 3 is the structural schematic diagram of another embodiment of the present invention data model detection device.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only It is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
The embodiment of the present invention provides a kind of data model detection method, as shown in Figure 1, which comprises
S11, the pattern information carried in data is extracted.
S12, column filtering is carried out to the data extracted, filters out the unrelated column of association.
Wherein, including floating-point values arrange.
S13, numerical value polymerization is carried out to each column.
Wherein, a kind of way of the numerical value polymerization is sequence and deduplication.
S14, column field existing in each column data and metadatabase is compared, finds and matches journey in metadatabase Highest column are spent, if matching degree is more than given threshold, regard as overlapping fields, metadata column is defined as existing column.
S15, metadatabase is updated, addition is newly added.
The data pattern being related in the embodiment of the present invention is that mode is expressly recited to one kind of data, and database stores number According to mode could construct complicated data structure to establish the inner link between data and multiple just because of there is data pattern Miscellaneous relationship, to constitute the global structure mode of data.Data pattern is to carry out " type " to data based on selected data model Aspect is portrayed, and corresponding " example " is then to the description in terms of data " value ".Prior data model could discuss phase according to it Data pattern is answered, there is data pattern, corresponding example can be obtained according to the mode.Usual data have specific field, class Type is exactly data pattern, referred to as structural data, is otherwise non-mode, is unstructured data, such as log, blog, webpage, And the data and media data such as picture, video, audio of NoSQL (database for referring to non-relational) database purchase File etc..
The correlation inquiry being related in the present embodiment is not only that the join of two bivariate tables in relevant database (connects Connect) operation, and referring to has join operation, union (joint) operation, grade between structuring and unstructured two kinds of data Connection operation etc., structuring and the unstructured data object as equality will be fused to unification to the operation of two kinds of data objects Operation in.
The execution of correlation inquiry proposed in this paper, main including but not limited to following three: structural data with it is non-structural Change data and carries out join operation;The cascade of structural data SQL query and non-structural data enquiry processing;Structural data is looked into It askes result and unstructured data carries out Union operation.
In order to which those skilled in the art can more be apparent from correlation inquiry involved in the present invention, now provide with The scene of lower correlation inquiry, but a kind of scene of following correlation inquiry is likely not limited in specific implementation procedure, to this It is not specifically limited in embodiment.
Correlation inquiry scene one:
Structural data and unstructured data carry out join operation, and such application is divided into two kinds of situations again:
The first situation: structural data and unstructured data belong to an object, possess potential identical pass Join field, and is one-to-one.
For the correlation inquiry scene, it is illustrated by taking the application scenarios of hospital's case history as an example below:
The case history of each patient has structural data, comprising: age, gender, medical record number, medical history, the last time see a doctor Date and illness description etc.;Also there is unstructured data, comprising: analysis report, CT film, electrocardiogram and waveform diagram etc..It is right The analysis of patient history data can tracking treatment to patient, illness analysis and pathogenic factor analysis etc. can have very great help, It is analyzed for example, fluctuating biggish patient to electrocardiogram, it is necessary to search for electrocardiogram and fluctuate biggish patient's feature, then first It first needs to provide a unstructured data analysis tool and analyzes the big electrocardiogram of fluctuation, it is corresponding then to export such picture The other information of patient analyzes such patient according to the information of output.This process is exactly to pass through unstructured data (electrocardiogram) goes the process of index structure data (patient information).
Heterogeneous data source in use above scene, including structural data and unstructured data, there is corresponding relationship, non- It can store " patient ID " field such as patient in structural data path, structural data also has storage one to be called The associated column of " medical record number ".Associated column is exactly all existing column of two components (structuring and unstructured), and two parts data are answered It is associated by this associated column, " patient ID " value is exactly in unstructured data and structural data in the scene " medical record number " potential identical associated column.Due to the difference of name, Data Analyst would have to remember that this two column is phase With.And mode of the invention can help unstructured data more accurately to model, it is identical to be used with structured data source Standard establishes metadata, to simplify query process.
Unstructured data automated intelligent modeling method provided in an embodiment of the present invention can be used for structural data and non- Structural data is associated inquiry, and detailed process includes:
1, correlation inquiry request is obtained, and correlation inquiry request is decomposed into multiple subqueries and is requested.
2, it when including the inquiry request to unstructured data component in the request of the multiple subquery, calls described non- The corresponding analysis mode of structural data component parses the unstructured data component, obtains the data of mode.
3, the data for having mode and structural data are associated inquiry, obtain the result of the correlation inquiry Collection.
Specifically, each column polymerization of the data for having mode is obtained into codomain set, numerical value is arranged not as condition of contact It can skip;For each column query metadata library, if the existing column data of codomain sets match, new model column quote existing column, no Then source database is added as new column in the normalized processing of column name.
In the present embodiment, there are the data of mode just to refer to the number for parsing unstructured data component by defined mode According to the data have defined mode.
The data for having mode and structural data are carried out in storage platform identical with the structural data Correlation inquiry obtains the result set of the correlation inquiry.
The present invention is handled by the independent parsing to unstructured data, the data of mode is obtained, wherein being not required to very important person For intervention, the data analysis mode being arranged when data definition can be called to parse unstructured data automatically, realized The accurate correlation of unstructured data and structural data is inquired.The present invention can be improved the data model of non-structural data source The matching degree of identification and video memory model increases substantially the success rate that data model is detected automatically, reduces manual intervention;By drawing Enter metadatabase, play the persistence to data model probe result, the expense of repetition inquiry can be substantially reduced.
The embodiment of the present invention also provides a kind of data model detection device, as shown in Fig. 2, described device includes:
Extracting unit 11, for extracting the pattern information carried in data;
Filter element 12 filters out the unrelated column of association for carrying out column filtering to the data extracted;
Polymerized unit 13, for carrying out numerical value polymerization to each column;
Comparison unit 14 finds first number for comparing each column data and column field existing in metadatabase Overlapping fields is regarded as, metadata column is defined as if matching degree is more than given threshold according to the highest column of matching degree in library Existing column;
Updating unit 15, for updating metadatabase, addition is newly added.
Optionally, as shown in figure 3, described device further include:
Acquiring unit 16 after addition is newly added, obtains association and looks into for updating metadatabase in the updating unit 15 Request is ask, and correlation inquiry request is decomposed into multiple subqueries and is requested;
Call unit 17, for when including inquiry request to unstructured data component in the request of the multiple subquery When, the corresponding analysis mode of the unstructured data component is called, the unstructured data component is parsed, is obtained There are the data of mode;
Query unit 18 obtains the pass for the data for having mode and structural data to be associated inquiry The result set that joint investigation is ask.
Optionally, the query unit 18, for each column of the data for having mode to be polymerize to obtain codomain set, Numerical value is arranged can skip not as condition of contact;For each column query metadata library, if the existing column data of codomain sets match, newly Model column quote existing column, and otherwise source database is added as new column in the normalized processing of column name.
Optionally, the correlation inquiry includes: to be attached operation, to structure to structural data and unstructured data Change the cascade of data SQL query and non-structural data enquiry processing and to structured data query result and unstructured number According to progress joint operation.
Data model detection device provided in an embodiment of the present invention is handled by the independent parsing to unstructured data, The data of mode are obtained, wherein not needing artificial intervention, the data parsing side being arranged when data definition can be called automatically Formula parses unstructured data, and the accurate correlation for realizing unstructured data and structural data is inquired, Neng Gouti The matching degree of the data model identification and video memory model of high non-structural data source, increase substantially that data model detects automatically at Power reduces manual intervention;By introducing metadatabase, the persistence to data model probe result is played, it can be significantly Reduce the expense for repeating inquiry.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above method embodiment, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims (8)

1. a kind of data model detection method characterized by comprising
Extract the pattern information carried in data;
Column filtering is carried out to the data extracted, filters out the unrelated column of association;
Numerical value polymerization is carried out to each column;
Each column data and column field existing in metadatabase are compared, it is highest to find matching degree in metadatabase Column regard as overlapping fields, metadata column is defined as existing column if matching degree is more than given threshold;
Metadatabase is updated, addition is newly added.
2. the method according to claim 1, wherein being added after newly adding, institute in the update metadatabase State method further include:
Correlation inquiry request is obtained, and correlation inquiry request is decomposed into multiple subqueries and is requested;
When including the inquiry request to unstructured data component in the request of the multiple subquery, call described unstructured The corresponding analysis mode of data package parses the unstructured data component, obtains the data of mode;
The data for having mode and structural data are associated inquiry, obtain the result set of the correlation inquiry.
3. according to the method described in claim 2, it is characterized in that, it is described by the data for having mode and structural data into Row correlation inquiry, the result set for obtaining the correlation inquiry include:
Each column of the data for having mode are polymerize to obtain codomain set, numerical value is arranged can skip not as condition of contact;For Each column query metadata library, if the existing column data of codomain sets match, new model column quote existing column, and otherwise column name passes through Source database is added as new column in standardization.
4. according to the method in claim 2 or 3, which is characterized in that the correlation inquiry include: to structural data with it is non- Structural data is attached operation, the cascade handled to structural data SQL query and non-structural data enquiry and right Structured data query result and unstructured data carry out joint operation.
5. a kind of data model detection device characterized by comprising
Extracting unit, for extracting the pattern information carried in data;
Filter element filters out the unrelated column of association for carrying out column filtering to the data extracted;
Polymerized unit, for carrying out numerical value polymerization to each column;
Comparison unit is found in metadatabase for comparing each column data and column field existing in metadatabase The highest column of matching degree regard as overlapping fields, metadata column is defined as existing if matching degree is more than given threshold Column;
Updating unit, for updating metadatabase, addition is newly added.
6. device according to claim 5, which is characterized in that described device further include:
Acquiring unit, for updating metadatabase in the updating unit, after addition is newly added, acquisition correlation inquiry request, And correlation inquiry request is decomposed into multiple subqueries and is requested;
Call unit, for adjusting when including the inquiry request to unstructured data component in the request of the multiple subquery With the corresponding analysis mode of the unstructured data component, the unstructured data component is parsed, mould has been obtained The data of formula;
Query unit obtains the correlation inquiry for the data for having mode and structural data to be associated inquiry Result set.
7. device according to claim 6, which is characterized in that the query unit, for by the data for having mode Each column polymerize to obtain codomain set, numerical value is arranged can skip not as condition of contact;For each column query metadata library, if value The existing column data of domain sets match, then new model column quote existing column, and otherwise the normalized processing of column name is added as new column Source database.
8. device according to claim 6 or 7, which is characterized in that the correlation inquiry include: to structural data with it is non- Structural data is attached operation, the cascade handled to structural data SQL query and non-structural data enquiry and right Structured data query result and unstructured data carry out joint operation.
CN201811599084.XA 2018-12-26 2018-12-26 Data model detection method and device Withdrawn CN109710602A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811599084.XA CN109710602A (en) 2018-12-26 2018-12-26 Data model detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811599084.XA CN109710602A (en) 2018-12-26 2018-12-26 Data model detection method and device

Publications (1)

Publication Number Publication Date
CN109710602A true CN109710602A (en) 2019-05-03

Family

ID=66258305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811599084.XA Withdrawn CN109710602A (en) 2018-12-26 2018-12-26 Data model detection method and device

Country Status (1)

Country Link
CN (1) CN109710602A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103425780A (en) * 2013-08-19 2013-12-04 曙光信息产业股份有限公司 Data inquiry method and data inquiry device
CN104142980A (en) * 2014-07-15 2014-11-12 中电科华云信息技术有限公司 Big data-based metadata model management system and method
US20160019299A1 (en) * 2014-07-17 2016-01-21 International Business Machines Corporation Deep semantic search of electronic medical records
CN108595614A (en) * 2018-04-20 2018-09-28 成都智信电子技术有限公司 Tables of data mapping method applied to HIS systems
CN109063178A (en) * 2018-08-22 2018-12-21 四川新网银行股份有限公司 A kind of method and device of the self-service analytical statement extended automatically

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103425780A (en) * 2013-08-19 2013-12-04 曙光信息产业股份有限公司 Data inquiry method and data inquiry device
CN104142980A (en) * 2014-07-15 2014-11-12 中电科华云信息技术有限公司 Big data-based metadata model management system and method
US20160019299A1 (en) * 2014-07-17 2016-01-21 International Business Machines Corporation Deep semantic search of electronic medical records
CN108595614A (en) * 2018-04-20 2018-09-28 成都智信电子技术有限公司 Tables of data mapping method applied to HIS systems
CN109063178A (en) * 2018-08-22 2018-12-21 四川新网银行股份有限公司 A kind of method and device of the self-service analytical statement extended automatically

Similar Documents

Publication Publication Date Title
Liu et al. MMKG: multi-modal knowledge graphs
US11003636B2 (en) Generating and reusing transformations for evolving schema mapping
CN111414393B (en) Semantic similar case retrieval method and equipment based on medical knowledge graph
US10713240B2 (en) Systems and methods for rapid data analysis
CN110908997B (en) Data blood relationship construction method and device, server and readable storage medium
US8943059B2 (en) Systems and methods for merging source records in accordance with survivorship rules
US8645332B1 (en) Systems and methods for capturing data refinement actions based on visualized search of information
US20030084025A1 (en) Method of cardinality estimation using statistical soft constraints
CN108352196A (en) There is no hospital's matching in the health care data library for going mark of apparent standard identifier
US20080256026A1 (en) Method For Optimizing And Executing A Query Using Ontological Metadata
CN104899295B (en) A kind of heterogeneous data source data relation analysis method
US20120215766A1 (en) Searching and Displaying Data Objects Residing in Data Management Systems
Lbath et al. Schema inference for property graphs
US20180032603A1 (en) Extracting graph topology from distributed databases
US9053207B2 (en) Adaptive query expression builder for an on-demand data service
He et al. Stylus: a strongly-typed store for serving massive RDF data
Barioni et al. Seamlessly integrating similarity queries in SQL
Shoaib et al. Semantic web based integrated agriculture information framework
CN109710602A (en) Data model detection method and device
CN110147396B (en) Mapping relation generation method and device
US9747359B2 (en) Using a database to translate a natural key to a surrogate key
CN115359865A (en) Case data pushing method and device, computer equipment and storage medium
US11537568B1 (en) Efficient data processing for schema changes
CN112527796A (en) Data table processing method and device and computer readable storage medium
Unbehauen et al. SPARQL Update queries over R2RML mapped data sources

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20190503

WW01 Invention patent application withdrawn after publication