CN109766436A

CN109766436A - A kind of matched method and apparatus of data element of the field and knowledge base of tables of data

Info

Publication number: CN109766436A
Application number: CN201811472910.4A
Authority: CN
Inventors: 张毅然
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2018-12-04
Filing date: 2018-12-04
Publication date: 2019-05-17

Abstract

The invention discloses the matched method and apparatus of the data element of a kind of field of tables of data and knowledge base, wherein the described method includes: carrying out word segmentation processing to the field in tables of data, constructs the feature vector of field；The feature vector library that knowledge base is searched according to the feature vector of the field carries out similarity mode with the feature vector of data element in described eigenvector library and determines matched data element when fitting through.The embodiment of the present invention can be applied to preprocessing process when tables of data access, improve governance efficiency and accuracy rate.

Description

A kind of matched method and apparatus of data element of the field and knowledge base of tables of data

Technical field

The present invention relates to database field, the matched method of data element of the field and knowledge base of espespecially a kind of tables of data and Device.

Background technique

Can be to the name of the data of industry using data element, type, value is standardized and is classified, and data element itself is also Data, during data are administered, if be able to achieve standardization and administer the efficiency and quality for directly determining that data are administered.It is right Tables of data in various sources, it is understood that there may be business meaning is consistent, but the situation that its field information is inconsistent.When taking an original When the tables of data of beginning, how precisely rapidly found from existing knowledge base with data element corresponding to the data sheet field, For realizing that fast and efficiently standardization is very crucial a part for administering.

Currently, becoming in the data integration of data element in various industries in the case where Data element standard is gradually established Increasingly important can be used for normative database, the data item in tables of data.At present the standard most cases of data element be with What document form occurred, the raising with current industry operation system to data dependence relation increasingly, the quality of data is related to industry Whether business system can normally run.The tables of data of data source is verified according to data element, can ensure the quality of data.Mark It is the premise verified that the data item of tables of data, which establishes mapping, in quasi- data element and data source.In the process for carrying out data processing In, the tables of data in all kinds of sources is had, each tables of data has different fields, needs to know based on the field in tables of data Not corresponding data element, the mode generallyd use at present manually is identified, although accuracy rate is relatively high, efficiency is but It is extremely inefficient.When data volume is smaller, still by the way of artificial, if data volume is big, artificial mode will become not Reality.

Summary of the invention

In order to solve the above-mentioned technical problems, the present invention provides a kind of fields of tables of data to match with the data element of knowledge base Method and apparatus, to improve the matched efficiency of data element of the field and knowledge base of tables of data.

In order to reach the object of the invention, the present invention provides the data element of a kind of field of tables of data and knowledge base is matched Method, comprising:

Word segmentation processing is carried out to the field in tables of data, constructs the feature vector of field；

The feature vector library that knowledge base is searched according to the feature vector of the field, with data element in described eigenvector library Feature vector carry out similarity mode determine matched data element when fitting through.

Optionally, the method also includes:

The information for obtaining data element in standard scale, according to the corresponding feature vector of the acquisition of information of the data element, by institute State the feature vector library of the feature vector deposit knowledge base of data element.

Optionally, the field in tables of data carries out word segmentation processing, constructs the feature vector of field, comprising:

Obtain the field in tables of data；

The field is segmented, term vector is generated；

The feature vector of each word is generated according to the term vector；

The feature vector of each word is synthesized, the feature vector of the field is generated.

Optionally, it is described according to the feature vector of the field search knowledge base feature vector library, with the feature to The feature vector for measuring data element in library carries out similarity mode, comprising:

Cosine is successively carried out according to the feature vector of each data element in the feature vector of the field and feature vector library Similarity calculation；

When the similarity score being calculated is greater than preset threshold, determination is fitted through.

Optionally, it is described according to the feature vector of the field search knowledge base feature vector library, with the feature to The feature vector for measuring data element in library carries out similarity mode, when fitting through, after determining matched data element, and the side Method further include:

Data item comprising matched data element is sorted from large to small according to similarity score, selects similarity score most Field in big data item and the tables of data is carried out to mark.

The present invention also provides the matched devices of the data element of a kind of field of tables of data and knowledge base, comprising:

Field processing module constructs the feature vector of field for carrying out word segmentation processing to the field in tables of data；

Matching module, for searching the feature vector library of knowledge base according to the feature vector of the field, with the feature The feature vector of data element carries out similarity mode and determines matched data element when fitting through in vector library.

Optionally, described device further include:

Feature vector library generation module, for obtaining the information of data element in standard scale, according to the information of the data element Corresponding feature vector is obtained, by the feature vector library of the feature vector deposit knowledge base of the data element.

Optionally, the field processing module, is used for:

Obtain the field in tables of data；

The field is segmented, term vector is generated；

The feature vector of each word is generated according to the term vector；

Optionally, the matching module, is used for:

Optionally, the matching module, is also used to:

The embodiment of the present invention includes: word segmentation processing is carried out to the field in tables of data, constructs the feature vector of field；According to The feature vector of the field searches the feature vector library of knowledge base, with the feature vector of data element in described eigenvector library into Row similarity mode determines matched data element when fitting through.When the embodiment of the present invention can be applied to tables of data access Preprocessing process, improve governance efficiency and accuracy rate.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that understand through the implementation of the invention.The objectives and other advantages of the invention can be by specification, right Specifically noted structure is achieved and obtained in claim and attached drawing.

Detailed description of the invention

Attached drawing is used to provide to further understand technical solution of the present invention, and constitutes part of specification, with this The embodiment of application technical solution for explaining the present invention together, does not constitute the limitation to technical solution of the present invention.

Fig. 1 is the flow chart of the matched method of data element of the field and knowledge base of the tables of data of the embodiment of the present invention；

Fig. 2 is the flow chart of the step 101 of the embodiment of the present invention；

Fig. 3 is the schematic diagram of the feature vector of the building field of the embodiment of the present invention；

Fig. 4 is the flow chart for establishing feature vector library of the embodiment of the present invention；

Fig. 5 is the flow chart of the matched method of data element of the field and knowledge base of the tables of data of application example of the present invention；

Fig. 6 is the schematic diagram of the matched device of data element of the field and knowledge base of the tables of data of the embodiment of the present invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention Embodiment be described in detail.It should be noted that in the absence of conflict, in the embodiment and embodiment in the application Feature can mutual any combination.

Step shown in the flowchart of the accompanying drawings can be in a computer system such as a set of computer executable instructions It executes.Also, although logical order is shown in flow charts, and it in some cases, can be to be different from herein suitable Sequence executes shown or described step.

The embodiment of the present invention can be identified in knowledge base based on the accumulation of existing knowledge base based on the field in tables of data Data element.

As shown in Figure 1, the field of the tables of data of the embodiment of the present invention and the matched method of the data element of knowledge base, comprising:

Step 101, word segmentation processing is carried out to the field in tables of data, constructs the feature vector of field.

Tables of data herein refers to the tables of data that needs are standardized.

As shown in Fig. 2, step 101 may include:

Step 201, the field in tables of data is obtained.

Step 202, the field is segmented, generates term vector.

Wherein, each word m ∈ [1, M] generates term vector, constructs dictionary.M is the classification number of word.

Step 203, the feature vector of each word is generated according to the term vector.

For each word in field, feature vector is obtainedWherein, L is of word in field Number.

Step 204, the feature vector of each word is synthesized, generates the feature vector of the field.

For each field, feature vector V={ v is obtained₁v₂,...,v_M}.It is shown in Figure 3.

In one embodiment, the method also includes: feature vector library is established, as shown in figure 4, including the following steps:

Step 301, the information of data element in standard scale is obtained.

Step 302, according to the corresponding feature vector of acquisition of information of the data element.

Wherein, the generating mode of the feature vector of data element is referred to generate the mode of the feature vector of field.

Step 303, by the feature vector library of the feature vector deposit knowledge base of the data element.

Step 102, the feature vector library that knowledge base is searched according to the feature vector of the field, with described eigenvector library The feature vector of middle data element carries out similarity mode and determines matched data element when fitting through.

In this step, successively according to the feature vector of each data element in the feature vector of the field and feature vector library Carry out cosine similarity calculating；When the similarity score being calculated is greater than preset threshold, determination is fitted through.

Wherein, similarity score score can use following formula:

Wherein, V={ v₁v₂,...,v_MBe field feature vector,For data element feature to Amount.

In the embodiment of the present invention, data element is split in standardisation process, is clustered, the data sheet field of data source It can be achieved to compare the data element of field and knowledge base using cosine similarity, and then identify accurate data element, phase Than in traditional way, more efficiently, intelligence.

In one embodiment, after step 102, may also include that

Wherein, data item includes data element, can also include determiner, for example, data item are as follows: sender _ name, In, name is data element, and sender is determiner.

When selecting the field in the maximum data item of similarity score and the tables of data to carry out to mark, phase can choose Recommended like the maximum one or more data item of degree score value, in the maximum one or more data item of similarity score again The field in suitable data item and the tables of data is selected to carry out to mark.

It can be known in continuous data management task by the field of tables of data in data source through the embodiment of the present invention Normal data member in other knowledge base is realized quick to mark in standardized data improvement.The embodiment of the present invention can be applied to Preprocessing process when tables of data accesses improves governance efficiency and accuracy rate.

It is illustrated below with an application example.

As shown in figure 5, including the following steps:

Step 401, the field of a tables of data is obtained.

Step 402, the feature vector of field is generated.

Wherein, the feature vector for generating field can refer to the description of Fig. 2.

Step 403, feature vector is obtained from feature vector library；

Step 404, judge whether it is matched complete, if not provided, execute step 405, if matching finish, execute step 408。

Wherein, after successively being matched feature vector all in feature vector library with the feature vector of field, then Think matched complete.

Step 405, the similarity of two feature vectors is calculated.

Step 406, judge whether similarity is greater than preset threshold, if so, executing step 407, executed if not, returning Step 403；

Step 407, the corresponding data item of the data element is recorded, returns to step 403；

Step 408, matching result is exported, wherein if it is small to obtain all similarity scores according to similarity calculation In being equal to preset threshold, then matching result is that it fails to match, which is classified as not match classification.If according to similarity meter Calculation, which obtains similarity score, to be existed greater than preset threshold, then matching result is successful match, which is classified as matching classification, And export the maximum one or more data item of similarity score.

As shown in fig. 6, the embodiment of the present invention also provides the matched dress of data element of the field and knowledge base of a kind of tables of data It sets, comprising:

Field processing module 51 constructs the feature vector of field for carrying out word segmentation processing to the field in tables of data；

Matching module 52, for searching the feature vector library of knowledge base according to the feature vector of the field, with the spy The feature vector of data element carries out similarity mode and determines matched data element when fitting through in sign vector library.

In one embodiment, described device further include:

In one embodiment, the field processing module 51, is used for:

Obtain the field in tables of data；

The field is segmented, term vector is generated；

The feature vector of each word is generated according to the term vector；

In one embodiment, the matching module 52, is used for:

In one embodiment, the matching module 52, is also used to:

The embodiment of the present invention can be applied to preprocessing process when tables of data access, improve governance efficiency and accuracy rate.

The embodiment of the present invention also proposes the matched equipment of data element of the field and knowledge base of a kind of tables of data, including storage Device, processor and storage on a memory and the computer program that can run on a processor, the processor execution journey The matched method of the data element of field and knowledge base that above-mentioned tables of data is realized when sequence.

The embodiment of the present invention also proposes a kind of computer readable storage medium, is stored with computer executable instructions, described The matched method of the data element of field and knowledge base that above-mentioned tables of data is realized when computer executable instructions are executed by processor.

It will appreciated by the skilled person that whole or certain steps, system, dress in method disclosed hereinabove Functional module/unit in setting may be implemented as software, firmware, hardware and its combination appropriate.In hardware embodiment, Division between the functional module/unit referred in the above description not necessarily corresponds to the division of physical assemblies；For example, one Physical assemblies can have multiple functions or a function or step and can be executed by several physical assemblies cooperations.Certain groups Part or all components may be implemented as by processor, such as the software that digital signal processor or microprocessor execute, or by It is embodied as hardware, or is implemented as integrated circuit, such as specific integrated circuit.Such software can be distributed in computer-readable On medium, computer-readable medium may include computer storage medium (or non-transitory medium) and communication media (or temporarily Property medium).As known to a person of ordinary skill in the art, term computer storage medium is included in for storing information (such as Computer readable instructions, data structure, program module or other data) any method or technique in the volatibility implemented and non- Volatibility, removable and nonremovable medium.Computer storage medium include but is not limited to RAM, ROM, EEPROM, flash memory or its His memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storages, magnetic holder, tape, disk storage or other Magnetic memory apparatus or any other medium that can be used for storing desired information and can be accessed by a computer.This Outside, known to a person of ordinary skill in the art to be, communication media generally comprises computer readable instructions, data structure, program mould Other data in the modulated data signal of block or such as carrier wave or other transmission mechanisms etc, and may include any information Delivery media.

Claims

1. a kind of field of tables of data and the matched method of the data element of knowledge base, comprising:

The feature vector library that knowledge base is searched according to the feature vector of the field, the spy with data element in described eigenvector library Sign vector carries out similarity mode and determines matched data element when fitting through.

2. the method according to claim 1, wherein the method also includes:

The information for obtaining data element in standard scale, according to the corresponding feature vector of the acquisition of information of the data element, by the number According to the feature vector library of the feature vector deposit knowledge base of member.

3. the method according to claim 1, wherein the field in tables of data carries out word segmentation processing, structure Build the feature vector of field, comprising:

Obtain the field in tables of data；

The field is segmented, term vector is generated；

The feature vector of each word is generated according to the term vector；

4. the method according to claim 1, wherein described search knowledge base according to the feature vector of the field Feature vector library, in described eigenvector library data element feature vector carry out similarity mode, comprising:

It is similar that the feature vector of each data element in feature vector library cosine is successively carried out according to the feature vector of the field Degree calculates；

5. according to the method described in claim 4, it is characterized in that, described search knowledge base according to the feature vector of the field Feature vector library, in described eigenvector library data element feature vector carry out similarity mode, when fitting through, really After fixed matched data element, the method also includes:

Data item comprising matched data element is sorted from large to small according to similarity score, selects similarity score maximum Field in data item and the tables of data is carried out to mark.

6. a kind of field of tables of data and the matched device of the data element of knowledge base characterized by comprising

Matching module, for searching the feature vector library of knowledge base according to the feature vector of the field, with described eigenvector The feature vector of data element carries out similarity mode and determines matched data element when fitting through in library.

7. device according to claim 6, which is characterized in that described device further include:

Feature vector library generation module, for obtaining the information of data element in standard scale, according to the acquisition of information of the data element Corresponding feature vector, by the feature vector library of the feature vector deposit knowledge base of the data element.

8. device according to claim 6, which is characterized in that the field processing module is used for:

Obtain the field in tables of data；

The field is segmented, term vector is generated；

The feature vector of each word is generated according to the term vector；

9. device according to claim 6, which is characterized in that the matching module is used for:

10. device according to claim 9, which is characterized in that the matching module is also used to: