CN113836897A

CN113836897A - Method for aligning multi-source heterogeneous data dictionary

Info

Publication number: CN113836897A
Application number: CN202111108385.XA
Authority: CN
Inventors: 贾少敏; 余增文; 张东
Original assignee: Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Institute of Computer Technology and Applications
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2021-12-24

Abstract

The invention relates to a method for aligning a multi-source heterogeneous data dictionary, and belongs to the field of big data. Selecting a source database, selecting a source table, and selecting a source field as a standard; selecting a target database, selecting a target table, and selecting a target field to be aligned; selecting a data value in a source table; the data values in the target table are directly selected or intelligently screened out through algorithms, including but not limited to: cosine similarity matching, editing distance matching, longitude and latitude distance matching, classification code matching and time and date matching; if the data value in the target table needs to be expanded into the source table, the expansion is started, and the data value is expanded into the source table; and checking the matching result after matching is successful. The method is simple to operate, the matching result is clear at a glance, and the specific data values are displayed as uniform data values after the data dictionaries are aligned.

Description

Method for aligning multi-source heterogeneous data dictionary

Technical Field

The invention belongs to the field of big data, and particularly relates to a method for aligning a multi-source heterogeneous data dictionary, in particular to a method for matching data in a target table with standard field values by directly selecting or intelligently matching algorithms by utilizing the standard field values of the table in a source database and the table field values in target data.

Background

With the development of big data, data aggregation has become an indispensable link in a big data platform, data of data sources in different places are merged together in the data aggregation process, and data standards of the data sources in different places may appear, and the data standards mark dictionaries for certain data values, so that data values with the same data meaning are generated according to the data standards in different places, but the corresponding data values are different.

For example: value representing the sex of a person: some places represent males with 0, females with 1, while some places represent males with M, females with F, while some places represent males females with similar values. Different data values can omit unmatched data values when data are collected, and unreadable codes are displayed for people when data are displayed.

In many works, dictionary values may be defined and unified by themselves, and converted into standard values by changing field values according to standards of themselves. This easily leads to: data changes and pollute the data source.

If a field value can be mapped to a standard value without changing the field value in the source data. The present invention is a technique realized based on such a demand.

Disclosure of Invention

Technical problem to be solved

The technical problem to be solved by the invention is how to provide a method for aligning a multi-source heterogeneous data dictionary, so as to solve the problem that the existing dictionary value processing method causes data change and pollutes a data source.

(II) technical scheme

In order to solve the technical problem, the invention provides a method for aligning a multi-source heterogeneous data dictionary, which comprises the following steps:

s1, selecting a source database, selecting a source table, and selecting a source field as a standard;

s2, selecting a target database, selecting a target table, and selecting a target field to be aligned;

s3, selecting the data value of the source field in the source table;

s4, directly selecting the data value of the target field in the target table or screening the data value of the target field in the target table through an algorithm;

s5, if the data value of the target field in the target table needs to be expanded into the source table, the expansion is started, and the data value of the target field in the target table is expanded into the source table;

and S6, checking the matching result after matching is successful.

Further, the step S1 specifically includes: selecting a source database as a dictionary table, selecting a table in the database, and selecting a specific field in the table;

further, the step S2 specifically includes: selecting a target database to be aligned, selecting a table in the database, and selecting a specific field in the table.

Further, the step S3 specifically includes: a certain data value of the source field in the source table is selected in turn.

Further, the step S4 specifically includes: directly selecting the data value of the target field in the target table, or setting a threshold value when the data value cannot be directly selected, then calculating the similarity between the data value of the source field and the data value of the target field through several algorithms, and screening out the data value of the target field in the target table which meets the threshold value, wherein the algorithms include but are not limited to cosine similarity matching, edit distance matching, longitude and latitude distance matching, classification code matching and time and date matching.

Further, the cosine similarity matching algorithm specifically includes the following steps:

firstly, performing word segmentation on a source field and a target field;

secondly, listing all words appearing in the source field and the target field;

thirdly, counting the word frequency appearing in the source field and the target field for all the words in the second step;

fourthly, writing out word frequency vectors of the source field and the target field according to the word frequency;

fifthly, calculating cosine values of included angles of the two word frequency vectors to measure similarity between the two word frequency vectors;

and sixthly, judging whether the cosine value is larger than a threshold value, if so, considering that the value of the source field is similar to the value of the target field, and screening the value of the target field.

Further, if the source field and the target field are coordinate values, longitude and latitude distance matching is adopted, longitude difference absolute values of the source field and the target field and latitude difference absolute values of the source field and the target field are respectively calculated, and the coordinate values of the target field, which are different from the coordinate values of the source field within a set error range, are screened out.

Further, the classification code matching algorithm comprises: and searching the shorter character string in the character strings of the source field and the target field in the longer character string after word segmentation, and dividing the length of the longer character string by the number of the characters which are matched equally to obtain a percentage, wherein the larger the percentage is, the higher the matching degree is.

Further, if the source field and the target field are time values, time and date matching is adopted, the time values of the source field and the target field are converted into time stamps, the absolute difference value of the two time stamps is calculated, and if the absolute difference value is smaller than a threshold value, the value of the target field is screened out.

Further, the step S5 specifically includes: and (3) removing the same data value from the field value in the target table by starting an expansion mode, and expanding the field value in the source table.

(III) advantageous effects

The invention provides a method for aligning a multi-source heterogeneous data dictionary. The operation is simple, the matching result is clear at a glance, and the specific data values are displayed as uniform data values after the data dictionaries are aligned.

Drawings

FIG. 1 is a flow chart of a multi-source heterogeneous data dictionary alignment method of the present invention.

Detailed Description

In order to make the objects, contents and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.

The invention discloses a method for aligning a multi-source heterogeneous data dictionary, which comprises the steps of (1) selecting a source database, selecting a source table, selecting a source field (2) serving as a standard to select a target database, selecting a target table, selecting a target field (3) to be aligned, selecting a data value (4) in the source table to directly select a data value in the target table or intelligently screening the data value in the target table through an algorithm, wherein the algorithm comprises but is not limited to: 1) matching cosine similarity; 2) matching the editing distance; 3) matching longitude and latitude distances; 4) matching classified codes; 5) and matching time and date. (5) And if the data value in the target table needs to be expanded into the source table, the expansion is started, and the data value is expanded into the source table (6) and the matching result is checked after the matching is successful.

The purpose of the invention is: a method for unifying the mapping values of data values in different data sources by directly matching the data values or intelligently matching the data values through an algorithm and expanding standard data is provided.

In order to achieve the above object, the present invention provides a method for aligning a multi-source heterogeneous data dictionary, which comprises:

s1, selecting a source database, selecting a source table and selecting a source field as a standard.

S2, selecting a target database, selecting a target table and selecting a target field needing to be aligned.

And S3, selecting the data value in the source table.

S4, directly selecting the data values in the target table or screening the data values in the target table through an algorithm, wherein the algorithm includes but is not limited to: 1) matching cosine similarity; 2) matching the editing distance; 3) matching longitude and latitude distances; 4) matching classified codes; 5) and matching time and date.

S5, if it needs to expand the data value in the target table to the source table, starting expansion and expanding the data value to the source table.

And S6, checking the matching result after matching is successful.

In the process of dictionary alignment, a source database table serving as a dictionary table needs to be selected, then field values in the source database are matched with field values in a target database, and a dictionary pool can be expanded if the field values in the source database are not matched with the field values in the target database. The operation is simple, the matching result is clear at a glance, and the specific data values are displayed as uniform data values after the data dictionaries are aligned.

FIG. 1 is a flow chart of a method for multi-source heterogeneous data dictionary alignment according to the present invention. As shown in fig. 1, the method includes:

(1) selecting a source database, selecting a source table, and selecting a source field as a standard.

In specific implementation, a source database as a dictionary table is selected, a table in the database is selected, and a specific field in the table is selected.

(2) Selecting a target database, selecting a target table, and selecting a target field needing to be aligned.

In specific implementation, a target database to be aligned is selected, a table in the database is selected, and a specific field in the table is selected.

(3) The data value of the source field in the source table is selected.

In specific implementation, a certain data value of the source field in the source table is selected in sequence.

(4) And directly selecting the data value of the target field in the target table or intelligently screening the data value of the target field in the target table through an algorithm.

In specific implementation, the data value of the target field in the target table is directly selected, or when the data value cannot be directly selected, a threshold is set, then the similarity between the data value of the source field and the data value of the target field is calculated through several algorithms, and the data value of the target field in the target table meeting the threshold is intelligently screened out, wherein the algorithms include but are not limited to:

1) cosine similarity matching:

the algorithm measures the similarity between two vectors by measuring their cosine values of their included angle, which can be found by using the euclidean dot product formula:

a·b＝||a||b||coosθ

given two attribute vectors, A and B, the remaining chord similarity θ is given by the dot product and the vector length, as follows:

a herein_i，B_iRepresenting the components of vectors a and B, respectively.

The similarity given ranges from-1 to 1: a 1 means that the two vectors point in exactly the opposite direction, a 1 means that their points are exactly the same, a 0 usually means that they are independent, and a value between them means an intermediate similarity or dissimilarity.

Assume the value of source field a: "this leather boot has a large number. That number is appropriate ", the value of the destination field B: "the leather boot is not small in number, and is more suitable". We calculate this similarity by this algorithm:

firstly, performing word segmentation on a source field and a target field.

A: this is only/leather boot/number/big. That/number/appropriate.

B: this is only/leather boot/number/not/small, that/better/appropriate.

And secondly, listing all words appearing in the source field and the target field.

This is the leather boot, the number, is bigger. That is, proper, not, small, even more

And thirdly, counting the word frequency appearing in the source field and the target field for all the words in the second step.

This is only: 1, leather boots: 1, number: 2, enlarging: 1. that is: 1, suitably: 1, not: 0, small: 0, further: 0

And fourthly, writing out the word frequency vectors of the source field and the target field according to the word frequency.

A＝(1，1，2，1，1，1，0，0，0)

B＝(1，1，1，0，1，1，1，1，1)

applying the above formula:

The cosine of the angle in the calculation is 0.81, and if we set the threshold to 0.8, then the value of the source field a and the value of the target field B are considered similar, and the value of the target field B is selected.

2) Matching the editing distance:

the Levenshtein Distance (also called Edit Distance) refers to the minimum number of editing operations required to change from one string to another string, and if the Distance is larger, the strings are different.

Similarity is 1-distance/Math.Max (str1.length, str2.length)

If the value of the source field a is defined: "word" is defined as str1, the value of the target field B: "wind" is defined as str2.

Step 1: a length of str1 or str2 of 0 returns the length of another string.

Length 4, str2 length 4, the maximum length is: 4.

step 2: the matrix d of (n +1) (m +1) is initialized and the values of the first row and column are incremented from 0. Two strings (nm level) are scanned, if: str1[ i ] ═ str2[ j ], which was recorded as 0 using temp. Otherwise temp is noted as 1. Then, the matrix d [ i, j ] is assigned to the minimum value of d [ i-1, j ] +1, d [ i, j-1] +1, d [ i-1, j-1] + temp.

The initialization is as follows: a matrix of d (4+1) (4+1) ═ d (5) (5), as in the table:

	0	w	r	l	d
						0	0	1	2	3	4
w	1
						i	2
n	3
						d	4

then two characters start to be scanned, from which step it follows:

and 3, after scanning, returning the last value d [ n ] [ m ] of the matrix, namely the distance between the last value d [ n ] [ m ] and the matrix.

The last value of the matrix d 4 2.

Calculating a similarity formula: 1-their distance/maximum of two string lengths.

Applying the above formula: the similarity is 1-2/4-0.5.

If we set the threshold value to 0.5, then the values of the source field A and the target field B are considered similar, and the value of the target field B is selected.

3) Matching the longitude and latitude distances:

and respectively calculating the longitude difference absolute value between A and B and the latitude difference absolute value between A and B according to the coordinate values of A and B, and extracting the longitude and latitude which are within the set error range from the coordinate value of the source field.

If the field value we calculate is a place name, the value can be matched by the calculation method.

If the source field value A has a value: the value of the target field B is Beijing Western station.

The coordinate value obtained by the Baidu map developer through the open API is as follows: distanceA ═ 116.37790192578123,39.86384240940923, and the coordinate values of B are: distanceB ═ 116.32056702587889, 39.894931215390564.

Then the absolute difference in longitude and latitude is calculated separately:

|116.37790192578123-116.32056702587889|＝0.05733489990234

|39.86384240940923-39.894931215390564|＝0.031088805981334

if the threshold is set to: longitude does not exceed 0.06 and latitude does not exceed 0.05, the value of field B is screened out.

4) And (3) classified code matching:

and searching the shorter character string in the character strings of the source field and the target field in the longer character string after word segmentation, and dividing the length of the longer character string by the number of the characters which are matched equally to obtain a percentage, wherein the larger the percentage is, the higher the matching degree is.

If the value of the source field A is: "word", the value of the target field B is: "world".

In the first step, the values of A and B are split respectively

The word of A is split into: w, o, r, d, length 4.

The word of B is split into: w, o, r, l, d, length 5.

Secondly, each word in A is searched in B, and the word frequency can be obtained:

w: 1, o: 1, r: 1, d: 1, the sum of word frequencies is 4.

Third, the sum of the word frequencies is divided by the longest length of the two values, 5:

similarity 4 ÷ 5 ×% 100 ═ 0.8.

If we set the threshold to 0.8, then the value of the target field B is filtered out.

5) Time-date matching

The closer the two dates are, the higher the matching degree of the two dates is: the matching degree is calculated by calculating the absolute difference of the two dates.

Similarity | date (A) -date (B) ceiling

If the value of the source field A is: 14 minutes and 20 seconds at 14 o 'clock at 18 o' clock at 9 p 2021, the value of the object field B is: 9/18/14/15/10 s in 2021.

These two field values are first converted to timestamps:

A＝1631945660000,B＝1631945710000

similarity | a-B |1631945660000 | -, 1631945710000| -50,000 |

If the threshold is set to 10000 and the similarity difference exceeds this threshold, object B does not fit the screening.

(5) And if the field value of the target field in the target table needs to be expanded into the source table, starting expansion, and expanding the field value of the target field into the source table.

In specific implementation, when the field values in the target table are considered to be more than those in the standard data table and are the desired field values, the field values in the target table can be expanded into the source table after the same data values are removed by starting the expansion, so that the standard data in the standard data table can be maximized.

(6) And checking the matching result after matching is successful.

In specific implementation, after the matching is confirmed to be correct, the matching is successful.

Of course, the present invention may have other embodiments, and those skilled in the art can make various changes and modifications according to the present invention without departing from the spirit and the essence of the present invention, for example, by matching similar numbers, matching data in different databases, not limited to dictionary values, and applying the successfully matched values as similar values to each scene.

Example 1:

a method of multi-source heterogeneous data dictionary alignment, comprising:

(1) different data sources may be selected.

(2) The data in the target table is not modified.

(3) The standard data values can be expanded in a continuous expansion mode.

(4) And is not limited to matching dictionary values.

(5) The screening may be aided by different algorithms.

Further, different data sources can be selected, so that a user can flexibly select and widely expand data dictionaries or data values.

Further, in the case of not modifying the data in the target table, the data in the target table can be interpreted as finally uniform data or data values of similarity without polluting the data in the target table.

Further, in the expanding the standard data value by the expanding method, the standard data value may be accumulated continuously, so as to facilitate the standard and generalized matching application.

Further, the method is not limited to matching dictionary values, and may be applied to other places, for example, data in different tables needs to be associated and matched, or the method may be adopted to match data in two tables and match similar data.

Further, the intelligent screening may be performed by different algorithms, including but not limited to:

matching cosine similarity;

matching the editing distance;

matching longitude and latitude distances;

matching classified codes;

and matching time and date.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A method for aligning a multi-source heterogeneous data dictionary is characterized by comprising the following steps:

s3, selecting the data value of the source field in the source table;

and S6, checking the matching result after matching is successful.

2. The method for dictionary alignment of multi-source heterogeneous data according to claim 1, wherein the step S1 specifically includes: selecting a source database as a dictionary table and selecting a table in the database, selecting a specific field in the table.

3. The method for dictionary alignment of multi-source heterogeneous data according to claim 1, wherein the step S2 specifically includes: selecting a target database to be aligned, selecting a table in the database, and selecting a specific field in the table.

4. The method for dictionary alignment of multi-source heterogeneous data according to any one of claims 1 to 3, wherein the step S3 specifically includes: a certain data value of the source field in the source table is selected in turn.

5. The method for dictionary alignment of multi-source heterogeneous data according to claim 4, wherein the step S4 specifically includes: directly selecting the data value of the target field in the target table, or setting a threshold value when the data value cannot be directly selected, then calculating the similarity between the data value of the source field and the data value of the target field through several algorithms, and screening out the data value of the target field in the target table which meets the threshold value, wherein the algorithms include but are not limited to cosine similarity matching, edit distance matching, longitude and latitude distance matching, classification code matching and time and date matching.

6. The method for dictionary alignment of multi-source heterogeneous data according to claim 5, wherein the cosine similarity matching algorithm specifically comprises the steps of:

firstly, performing word segmentation on a source field and a target field;

secondly, listing all words appearing in the source field and the target field;

7. The method of claim 5, wherein if the source field and the target field are coordinate values, longitude and latitude distance matching is used to calculate absolute values of longitude difference between the source field and the target field, latitude difference between the source field and the target field is calculated, and coordinate values of the target field which are different from the coordinate values of the source field within a set error range are selected.

8. The method of multi-source heterogeneous data dictionary alignment of claim 5, wherein the categorical code matching algorithm comprises: and searching the shorter character string in the character strings of the source field and the target field in the longer character string after word segmentation, and dividing the length of the longer character string by the number of the characters which are matched equally to obtain a percentage, wherein the larger the percentage is, the higher the matching degree is.

9. The method of multi-source heterogeneous data dictionary alignment of claim 5, wherein if the source field and the target field are time values, time-date matching is used to convert the source field and target field time values to timestamps, the absolute difference between the two timestamps is calculated, and if the absolute difference is less than a threshold, the value of the target field is filtered out.

10. The method for dictionary alignment of multi-source heterogeneous data according to claim 5, wherein the step S5 specifically includes: and (3) removing the same data value from the field value in the target table by starting an expansion mode, and expanding the field value in the source table.