CN109710602A - Data model detection method and device - Google Patents
Data model detection method and device Download PDFInfo
- Publication number
- CN109710602A CN109710602A CN201811599084.XA CN201811599084A CN109710602A CN 109710602 A CN109710602 A CN 109710602A CN 201811599084 A CN201811599084 A CN 201811599084A CN 109710602 A CN109710602 A CN 109710602A
- Authority
- CN
- China
- Prior art keywords
- data
- column
- metadatabase
- unstructured
- structural
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of data model detection method and device.The described method includes: extracting the pattern information carried in data;Column filtering is carried out to the data extracted, filters out the unrelated column of association;Numerical value polymerization is carried out to each column;Each column data and column field existing in metadatabase are compared, the highest column of matching degree in metadatabase are found, if matching degree is more than given threshold, regards as overlapping fields, metadata column is defined as existing column;Metadatabase is updated, addition is newly added.The present invention can be improved the success rate that data model detects automatically, reduce manual intervention.
Description
Technical field
The present invention relates to big data technical field more particularly to a kind of data model detection methods and device.
Background technique
The data source of many storage complex datas, such as from social data library, electric quotient data library, arrive human gene data
Library is all based on the data set of complicated multidimensional, big data quantity storage.It is handled for this data set, it is huge there are one
Big challenge is how to find implicit data structure, data correlation relation from the data of magnanimity and finally extract intentionally
The data of justice.Usually, analyst can help to extract the significant data in part by various analysis tools.But it relies on
Existing analysis tool is modeled and is shown to the data in complex data source, it is necessary to lasting carry out human-computer interaction.User needs
It is very familiar with the characteristic of complex data collection, it is necessary to explicitly instructed to computer to allow computer that corresponding algorithm is called to complete
Modeling.This human-computer interaction needs repeat repeatedly in many cases,.When user processing data be with trillion timing,
Such data processing method is extremely complex and cumbersome.Based on this need, data automatic modeling has been had already appeared on the market
The design of method and visual means.
However existing data automatic modeling only with investigate source of new data data, although its probe result can be applied effectively
On this data mapping, but for the correlation inquiry of multi-data source, there is also sizable obstacles, need cumbersome artificial dry
The pre- alignment to be associated domain, the work of this part neither generates new value, and is very easy to error.For Reusability
Data, under efficiency is very low.
With the development of data service, in the same service application scene, general existing structure data also have non-structural
Change data, and is frequently necessary to association process.Wherein, structural data refers to row data, is stored in lane database, can use two
Dimension table structure carrys out the data of logical expression, and is difficult to the data showed with database two dimension logical table and is known as unstructured number
According to unstructured data includes office documents, text, picture, XML, HTML, all kinds of reports, image and the sound of all formats
Frequently/video information etc..
In the Data processing of the prior art, relevant database can be directlyed adopt for structural data and stored, and
The inquiry, filtering or calculating to structural data are realized in relevant database;For unstructured data, using big data frame
The technologies such as structure Hadoop, MapReduce, Spark carry out batch processing, including inquiry, the filtering to unstructured data
Or it calculates.Be to correlation inquiry between structural data and unstructured data in the prior art it is possible, this is related to non-
The preparatory modeling of structural data, is associated inquiry later.However existing unstructured automaticdata modeling technique, exist
Sizable limitation, the model identified cannot be exactly matched with structural data model, cause failure.Its basic reason exists
In the model of different data sources is defined by different people, and there may be a variety of naming methods for data identical for essence, even
Use different language.The domain for being such as called " Identity " in a certain data source may be called " Social in another data source
ID ", but its essence is identical information, can be used as the key of correlation inquiry, but need user to be apparent from could just should indeed
With.
Summary of the invention
Data model detection method and device provided by the invention, can be improved the success rate that data model detects automatically,
Reduce manual intervention.
In a first aspect, the present invention provides a kind of data model detection method, comprising:
Extract the pattern information carried in data;
Column filtering is carried out to the data extracted, filters out the unrelated column of association;
Numerical value polymerization is carried out to each column;
Each column data and column field existing in metadatabase are compared, find in metadatabase matching degree most
High column regard as overlapping fields, metadata column is defined as existing column if matching degree is more than given threshold;
Metadatabase is updated, addition is newly added.
Optionally, in the update metadatabase, after addition is newly added, the method also includes:
Correlation inquiry request is obtained, and correlation inquiry request is decomposed into multiple subqueries and is requested;
When including the inquiry request to unstructured data component in the request of the multiple subquery, the non-knot is called
The corresponding analysis mode of structure data package parses the unstructured data component, obtains the data of mode;
The data for having mode and structural data are associated inquiry, obtain the result set of the correlation inquiry.
Optionally, described that the data for having mode and structural data are associated inquiry, it obtains the association and looks into
The result set of inquiry includes:
Each column of the data for having mode are polymerize to obtain codomain set, numerical value is arranged can jump not as condition of contact
It crosses;For each column query metadata library, if the existing column data of codomain sets match, new model column quote existing column, otherwise arrange
Source database is added as new column in the normalized processing of title.
Optionally, the correlation inquiry includes: to be attached operation, to structure to structural data and unstructured data
Change the cascade of data SQL query and non-structural data enquiry processing and to structured data query result and unstructured number
According to progress joint operation.
Second aspect, the present invention provide a kind of data model detection device, comprising:
Extracting unit, for extracting the pattern information carried in data;
Filter element filters out the unrelated column of association for carrying out column filtering to the data extracted;
Polymerized unit, for carrying out numerical value polymerization to each column;
Comparison unit finds metadata for comparing each column data and column field existing in metadatabase
The highest column of matching degree in library regard as overlapping fields if matching degree is more than given threshold, and metadata column is defined as existing
There are column;
Updating unit, for updating metadatabase, addition is newly added.
Optionally, described device further include:
Acquiring unit after addition is newly added, obtains correlation inquiry and asks for updating metadatabase in the updating unit
It asks, and correlation inquiry request is decomposed into multiple subqueries and is requested;
Call unit, for when including inquiry request to unstructured data component in the request of the multiple subquery
When, the corresponding analysis mode of the unstructured data component is called, the unstructured data component is parsed, is obtained
There are the data of mode;
Query unit obtains the association for the data for having mode and structural data to be associated inquiry
The result set of inquiry.
Optionally, the query unit obtains codomain set for polymerizeing each column of the data for having mode, number
Value is arranged can skip not as condition of contact;For each column query metadata library, if the existing column data of codomain sets match, new mould
Type column quote existing column, and otherwise source database is added as new column in the normalized processing of column name.
Optionally, the correlation inquiry includes: to be attached operation, to structure to structural data and unstructured data
Change the cascade of data SQL query and non-structural data enquiry processing and to structured data query result and unstructured number
According to progress joint operation.
Data model detection method and device provided in an embodiment of the present invention, pass through the independent parsing to unstructured data
Processing, obtains the data of mode, wherein not needing artificial intervention, can call the data solution being arranged when data definition automatically
Analysis mode parses unstructured data, and the accurate correlation for realizing unstructured data and structural data is inquired, energy
The matching degree for enough improving the data model identification and video memory model of non-structural data source, increases substantially data model and detects automatically
Success rate, reduce manual intervention;By introducing metadatabase, the persistence to data model probe result is played, it can
Substantially reduce the expense of repetition inquiry.
Detailed description of the invention
Fig. 1 is the flow chart of data model detection method provided in an embodiment of the present invention;
Fig. 2 is the structural schematic diagram of one embodiment of the invention data model detection device;
Fig. 3 is the structural schematic diagram of another embodiment of the present invention data model detection device.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
It is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill
Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
The embodiment of the present invention provides a kind of data model detection method, as shown in Figure 1, which comprises
S11, the pattern information carried in data is extracted.
S12, column filtering is carried out to the data extracted, filters out the unrelated column of association.
Wherein, including floating-point values arrange.
S13, numerical value polymerization is carried out to each column.
Wherein, a kind of way of the numerical value polymerization is sequence and deduplication.
S14, column field existing in each column data and metadatabase is compared, finds and matches journey in metadatabase
Highest column are spent, if matching degree is more than given threshold, regard as overlapping fields, metadata column is defined as existing column.
S15, metadatabase is updated, addition is newly added.
The data pattern being related in the embodiment of the present invention is that mode is expressly recited to one kind of data, and database stores number
According to mode could construct complicated data structure to establish the inner link between data and multiple just because of there is data pattern
Miscellaneous relationship, to constitute the global structure mode of data.Data pattern is to carry out " type " to data based on selected data model
Aspect is portrayed, and corresponding " example " is then to the description in terms of data " value ".Prior data model could discuss phase according to it
Data pattern is answered, there is data pattern, corresponding example can be obtained according to the mode.Usual data have specific field, class
Type is exactly data pattern, referred to as structural data, is otherwise non-mode, is unstructured data, such as log, blog, webpage,
And the data and media data such as picture, video, audio of NoSQL (database for referring to non-relational) database purchase
File etc..
The correlation inquiry being related in the present embodiment is not only that the join of two bivariate tables in relevant database (connects
Connect) operation, and referring to has join operation, union (joint) operation, grade between structuring and unstructured two kinds of data
Connection operation etc., structuring and the unstructured data object as equality will be fused to unification to the operation of two kinds of data objects
Operation in.
The execution of correlation inquiry proposed in this paper, main including but not limited to following three: structural data with it is non-structural
Change data and carries out join operation;The cascade of structural data SQL query and non-structural data enquiry processing;Structural data is looked into
It askes result and unstructured data carries out Union operation.
In order to which those skilled in the art can more be apparent from correlation inquiry involved in the present invention, now provide with
The scene of lower correlation inquiry, but a kind of scene of following correlation inquiry is likely not limited in specific implementation procedure, to this
It is not specifically limited in embodiment.
Correlation inquiry scene one:
Structural data and unstructured data carry out join operation, and such application is divided into two kinds of situations again:
The first situation: structural data and unstructured data belong to an object, possess potential identical pass
Join field, and is one-to-one.
For the correlation inquiry scene, it is illustrated by taking the application scenarios of hospital's case history as an example below:
The case history of each patient has structural data, comprising: age, gender, medical record number, medical history, the last time see a doctor
Date and illness description etc.;Also there is unstructured data, comprising: analysis report, CT film, electrocardiogram and waveform diagram etc..It is right
The analysis of patient history data can tracking treatment to patient, illness analysis and pathogenic factor analysis etc. can have very great help,
It is analyzed for example, fluctuating biggish patient to electrocardiogram, it is necessary to search for electrocardiogram and fluctuate biggish patient's feature, then first
It first needs to provide a unstructured data analysis tool and analyzes the big electrocardiogram of fluctuation, it is corresponding then to export such picture
The other information of patient analyzes such patient according to the information of output.This process is exactly to pass through unstructured data
(electrocardiogram) goes the process of index structure data (patient information).
Heterogeneous data source in use above scene, including structural data and unstructured data, there is corresponding relationship, non-
It can store " patient ID " field such as patient in structural data path, structural data also has storage one to be called
The associated column of " medical record number ".Associated column is exactly all existing column of two components (structuring and unstructured), and two parts data are answered
It is associated by this associated column, " patient ID " value is exactly in unstructured data and structural data in the scene
" medical record number " potential identical associated column.Due to the difference of name, Data Analyst would have to remember that this two column is phase
With.And mode of the invention can help unstructured data more accurately to model, it is identical to be used with structured data source
Standard establishes metadata, to simplify query process.
Unstructured data automated intelligent modeling method provided in an embodiment of the present invention can be used for structural data and non-
Structural data is associated inquiry, and detailed process includes:
1, correlation inquiry request is obtained, and correlation inquiry request is decomposed into multiple subqueries and is requested.
2, it when including the inquiry request to unstructured data component in the request of the multiple subquery, calls described non-
The corresponding analysis mode of structural data component parses the unstructured data component, obtains the data of mode.
3, the data for having mode and structural data are associated inquiry, obtain the result of the correlation inquiry
Collection.
Specifically, each column polymerization of the data for having mode is obtained into codomain set, numerical value is arranged not as condition of contact
It can skip;For each column query metadata library, if the existing column data of codomain sets match, new model column quote existing column, no
Then source database is added as new column in the normalized processing of column name.
In the present embodiment, there are the data of mode just to refer to the number for parsing unstructured data component by defined mode
According to the data have defined mode.
The data for having mode and structural data are carried out in storage platform identical with the structural data
Correlation inquiry obtains the result set of the correlation inquiry.
The present invention is handled by the independent parsing to unstructured data, the data of mode is obtained, wherein being not required to very important person
For intervention, the data analysis mode being arranged when data definition can be called to parse unstructured data automatically, realized
The accurate correlation of unstructured data and structural data is inquired.The present invention can be improved the data model of non-structural data source
The matching degree of identification and video memory model increases substantially the success rate that data model is detected automatically, reduces manual intervention;By drawing
Enter metadatabase, play the persistence to data model probe result, the expense of repetition inquiry can be substantially reduced.
The embodiment of the present invention also provides a kind of data model detection device, as shown in Fig. 2, described device includes:
Extracting unit 11, for extracting the pattern information carried in data;
Filter element 12 filters out the unrelated column of association for carrying out column filtering to the data extracted;
Polymerized unit 13, for carrying out numerical value polymerization to each column;
Comparison unit 14 finds first number for comparing each column data and column field existing in metadatabase
Overlapping fields is regarded as, metadata column is defined as if matching degree is more than given threshold according to the highest column of matching degree in library
Existing column;
Updating unit 15, for updating metadatabase, addition is newly added.
Optionally, as shown in figure 3, described device further include:
Acquiring unit 16 after addition is newly added, obtains association and looks into for updating metadatabase in the updating unit 15
Request is ask, and correlation inquiry request is decomposed into multiple subqueries and is requested;
Call unit 17, for when including inquiry request to unstructured data component in the request of the multiple subquery
When, the corresponding analysis mode of the unstructured data component is called, the unstructured data component is parsed, is obtained
There are the data of mode;
Query unit 18 obtains the pass for the data for having mode and structural data to be associated inquiry
The result set that joint investigation is ask.
Optionally, the query unit 18, for each column of the data for having mode to be polymerize to obtain codomain set,
Numerical value is arranged can skip not as condition of contact;For each column query metadata library, if the existing column data of codomain sets match, newly
Model column quote existing column, and otherwise source database is added as new column in the normalized processing of column name.
Optionally, the correlation inquiry includes: to be attached operation, to structure to structural data and unstructured data
Change the cascade of data SQL query and non-structural data enquiry processing and to structured data query result and unstructured number
According to progress joint operation.
Data model detection device provided in an embodiment of the present invention is handled by the independent parsing to unstructured data,
The data of mode are obtained, wherein not needing artificial intervention, the data parsing side being arranged when data definition can be called automatically
Formula parses unstructured data, and the accurate correlation for realizing unstructured data and structural data is inquired, Neng Gouti
The matching degree of the data model identification and video memory model of high non-structural data source, increase substantially that data model detects automatically at
Power reduces manual intervention;By introducing metadatabase, the persistence to data model probe result is played, it can be significantly
Reduce the expense for repeating inquiry.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above method embodiment, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium
In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic
Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access
Memory, RAM) etc..
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers
It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.
Claims (8)
1. a kind of data model detection method characterized by comprising
Extract the pattern information carried in data;
Column filtering is carried out to the data extracted, filters out the unrelated column of association;
Numerical value polymerization is carried out to each column;
Each column data and column field existing in metadatabase are compared, it is highest to find matching degree in metadatabase
Column regard as overlapping fields, metadata column is defined as existing column if matching degree is more than given threshold;
Metadatabase is updated, addition is newly added.
2. the method according to claim 1, wherein being added after newly adding, institute in the update metadatabase
State method further include:
Correlation inquiry request is obtained, and correlation inquiry request is decomposed into multiple subqueries and is requested;
When including the inquiry request to unstructured data component in the request of the multiple subquery, call described unstructured
The corresponding analysis mode of data package parses the unstructured data component, obtains the data of mode;
The data for having mode and structural data are associated inquiry, obtain the result set of the correlation inquiry.
3. according to the method described in claim 2, it is characterized in that, it is described by the data for having mode and structural data into
Row correlation inquiry, the result set for obtaining the correlation inquiry include:
Each column of the data for having mode are polymerize to obtain codomain set, numerical value is arranged can skip not as condition of contact;For
Each column query metadata library, if the existing column data of codomain sets match, new model column quote existing column, and otherwise column name passes through
Source database is added as new column in standardization.
4. according to the method in claim 2 or 3, which is characterized in that the correlation inquiry include: to structural data with it is non-
Structural data is attached operation, the cascade handled to structural data SQL query and non-structural data enquiry and right
Structured data query result and unstructured data carry out joint operation.
5. a kind of data model detection device characterized by comprising
Extracting unit, for extracting the pattern information carried in data;
Filter element filters out the unrelated column of association for carrying out column filtering to the data extracted;
Polymerized unit, for carrying out numerical value polymerization to each column;
Comparison unit is found in metadatabase for comparing each column data and column field existing in metadatabase
The highest column of matching degree regard as overlapping fields, metadata column is defined as existing if matching degree is more than given threshold
Column;
Updating unit, for updating metadatabase, addition is newly added.
6. device according to claim 5, which is characterized in that described device further include:
Acquiring unit, for updating metadatabase in the updating unit, after addition is newly added, acquisition correlation inquiry request,
And correlation inquiry request is decomposed into multiple subqueries and is requested;
Call unit, for adjusting when including the inquiry request to unstructured data component in the request of the multiple subquery
With the corresponding analysis mode of the unstructured data component, the unstructured data component is parsed, mould has been obtained
The data of formula;
Query unit obtains the correlation inquiry for the data for having mode and structural data to be associated inquiry
Result set.
7. device according to claim 6, which is characterized in that the query unit, for by the data for having mode
Each column polymerize to obtain codomain set, numerical value is arranged can skip not as condition of contact;For each column query metadata library, if value
The existing column data of domain sets match, then new model column quote existing column, and otherwise the normalized processing of column name is added as new column
Source database.
8. device according to claim 6 or 7, which is characterized in that the correlation inquiry include: to structural data with it is non-
Structural data is attached operation, the cascade handled to structural data SQL query and non-structural data enquiry and right
Structured data query result and unstructured data carry out joint operation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811599084.XA CN109710602A (en) | 2018-12-26 | 2018-12-26 | Data model detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811599084.XA CN109710602A (en) | 2018-12-26 | 2018-12-26 | Data model detection method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109710602A true CN109710602A (en) | 2019-05-03 |
Family
ID=66258305
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811599084.XA Withdrawn CN109710602A (en) | 2018-12-26 | 2018-12-26 | Data model detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109710602A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103425780A (en) * | 2013-08-19 | 2013-12-04 | 曙光信息产业股份有限公司 | Data inquiry method and data inquiry device |
CN104142980A (en) * | 2014-07-15 | 2014-11-12 | 中电科华云信息技术有限公司 | Big data-based metadata model management system and method |
US20160019299A1 (en) * | 2014-07-17 | 2016-01-21 | International Business Machines Corporation | Deep semantic search of electronic medical records |
CN108595614A (en) * | 2018-04-20 | 2018-09-28 | 成都智信电子技术有限公司 | Tables of data mapping method applied to HIS systems |
CN109063178A (en) * | 2018-08-22 | 2018-12-21 | 四川新网银行股份有限公司 | A kind of method and device of the self-service analytical statement extended automatically |
-
2018
- 2018-12-26 CN CN201811599084.XA patent/CN109710602A/en not_active Withdrawn
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103425780A (en) * | 2013-08-19 | 2013-12-04 | 曙光信息产业股份有限公司 | Data inquiry method and data inquiry device |
CN104142980A (en) * | 2014-07-15 | 2014-11-12 | 中电科华云信息技术有限公司 | Big data-based metadata model management system and method |
US20160019299A1 (en) * | 2014-07-17 | 2016-01-21 | International Business Machines Corporation | Deep semantic search of electronic medical records |
CN108595614A (en) * | 2018-04-20 | 2018-09-28 | 成都智信电子技术有限公司 | Tables of data mapping method applied to HIS systems |
CN109063178A (en) * | 2018-08-22 | 2018-12-21 | 四川新网银行股份有限公司 | A kind of method and device of the self-service analytical statement extended automatically |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240248895A1 (en) | Systems and methods for rapid data analysis | |
CN111414393B (en) | Semantic similar case retrieval method and equipment based on medical knowledge graph | |
CN110908997B (en) | Data blood relationship construction method and device, server and readable storage medium | |
US20200110731A1 (en) | Generating and reusing transformations for evolving schema mapping | |
US8943059B2 (en) | Systems and methods for merging source records in accordance with survivorship rules | |
US8645332B1 (en) | Systems and methods for capturing data refinement actions based on visualized search of information | |
US8380750B2 (en) | Searching and displaying data objects residing in data management systems | |
US20030084025A1 (en) | Method of cardinality estimation using statistical soft constraints | |
CN108352196A (en) | There is no hospital's matching in the health care data library for going mark of apparent standard identifier | |
US20080256026A1 (en) | Method For Optimizing And Executing A Query Using Ontological Metadata | |
Lbath et al. | Schema inference for property graphs | |
GB2599287A (en) | Automatic validation and enrichment of semantic relations between medical entities for drug discovery | |
US9053207B2 (en) | Adaptive query expression builder for an on-demand data service | |
He et al. | Stylus: a strongly-typed store for serving massive RDF data | |
CN117076742A (en) | Data blood edge tracking method and device and electronic equipment | |
CN115359865A (en) | Case data pushing method and device, computer equipment and storage medium | |
Shoaib et al. | Semantic web based integrated agriculture information framework | |
Barioni et al. | Seamlessly integrating similarity queries in SQL | |
CN112527796A (en) | Data table processing method and device and computer readable storage medium | |
CN109710602A (en) | Data model detection method and device | |
CN110147396B (en) | Mapping relation generation method and device | |
CN113672457B (en) | Method and device for identifying abnormal operation in database | |
US9747359B2 (en) | Using a database to translate a natural key to a surrogate key | |
Ravenschlag et al. | Effective queries for mega-analysis in cognitive neuroscience. | |
US11537568B1 (en) | Efficient data processing for schema changes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190503 |
|
WW01 | Invention patent application withdrawn after publication |