CN113836897A - Method for aligning multi-source heterogeneous data dictionary - Google Patents

Method for aligning multi-source heterogeneous data dictionary Download PDF

Info

Publication number
CN113836897A
CN113836897A CN202111108385.XA CN202111108385A CN113836897A CN 113836897 A CN113836897 A CN 113836897A CN 202111108385 A CN202111108385 A CN 202111108385A CN 113836897 A CN113836897 A CN 113836897A
Authority
CN
China
Prior art keywords
field
source
target
selecting
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111108385.XA
Other languages
Chinese (zh)
Inventor
贾少敏
余增文
张东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Computer Technology and Applications
Original Assignee
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Computer Technology and Applications filed Critical Beijing Institute of Computer Technology and Applications
Priority to CN202111108385.XA priority Critical patent/CN113836897A/en
Publication of CN113836897A publication Critical patent/CN113836897A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for aligning a multi-source heterogeneous data dictionary, and belongs to the field of big data. Selecting a source database, selecting a source table, and selecting a source field as a standard; selecting a target database, selecting a target table, and selecting a target field to be aligned; selecting a data value in a source table; the data values in the target table are directly selected or intelligently screened out through algorithms, including but not limited to: cosine similarity matching, editing distance matching, longitude and latitude distance matching, classification code matching and time and date matching; if the data value in the target table needs to be expanded into the source table, the expansion is started, and the data value is expanded into the source table; and checking the matching result after matching is successful. The method is simple to operate, the matching result is clear at a glance, and the specific data values are displayed as uniform data values after the data dictionaries are aligned.

Description

Method for aligning multi-source heterogeneous data dictionary
Technical Field
The invention belongs to the field of big data, and particularly relates to a method for aligning a multi-source heterogeneous data dictionary, in particular to a method for matching data in a target table with standard field values by directly selecting or intelligently matching algorithms by utilizing the standard field values of the table in a source database and the table field values in target data.
Background
With the development of big data, data aggregation has become an indispensable link in a big data platform, data of data sources in different places are merged together in the data aggregation process, and data standards of the data sources in different places may appear, and the data standards mark dictionaries for certain data values, so that data values with the same data meaning are generated according to the data standards in different places, but the corresponding data values are different.
For example: value representing the sex of a person: some places represent males with 0, females with 1, while some places represent males with M, females with F, while some places represent males females with similar values. Different data values can omit unmatched data values when data are collected, and unreadable codes are displayed for people when data are displayed.
In many works, dictionary values may be defined and unified by themselves, and converted into standard values by changing field values according to standards of themselves. This easily leads to: data changes and pollute the data source.
If a field value can be mapped to a standard value without changing the field value in the source data. The present invention is a technique realized based on such a demand.
Disclosure of Invention
Technical problem to be solved
The technical problem to be solved by the invention is how to provide a method for aligning a multi-source heterogeneous data dictionary, so as to solve the problem that the existing dictionary value processing method causes data change and pollutes a data source.
(II) technical scheme
In order to solve the technical problem, the invention provides a method for aligning a multi-source heterogeneous data dictionary, which comprises the following steps:
s1, selecting a source database, selecting a source table, and selecting a source field as a standard;
s2, selecting a target database, selecting a target table, and selecting a target field to be aligned;
s3, selecting the data value of the source field in the source table;
s4, directly selecting the data value of the target field in the target table or screening the data value of the target field in the target table through an algorithm;
s5, if the data value of the target field in the target table needs to be expanded into the source table, the expansion is started, and the data value of the target field in the target table is expanded into the source table;
and S6, checking the matching result after matching is successful.
Further, the step S1 specifically includes: selecting a source database as a dictionary table, selecting a table in the database, and selecting a specific field in the table;
further, the step S2 specifically includes: selecting a target database to be aligned, selecting a table in the database, and selecting a specific field in the table.
Further, the step S3 specifically includes: a certain data value of the source field in the source table is selected in turn.
Further, the step S4 specifically includes: directly selecting the data value of the target field in the target table, or setting a threshold value when the data value cannot be directly selected, then calculating the similarity between the data value of the source field and the data value of the target field through several algorithms, and screening out the data value of the target field in the target table which meets the threshold value, wherein the algorithms include but are not limited to cosine similarity matching, edit distance matching, longitude and latitude distance matching, classification code matching and time and date matching.
Further, the cosine similarity matching algorithm specifically includes the following steps:
firstly, performing word segmentation on a source field and a target field;
secondly, listing all words appearing in the source field and the target field;
thirdly, counting the word frequency appearing in the source field and the target field for all the words in the second step;
fourthly, writing out word frequency vectors of the source field and the target field according to the word frequency;
fifthly, calculating cosine values of included angles of the two word frequency vectors to measure similarity between the two word frequency vectors;
and sixthly, judging whether the cosine value is larger than a threshold value, if so, considering that the value of the source field is similar to the value of the target field, and screening the value of the target field.
Further, if the source field and the target field are coordinate values, longitude and latitude distance matching is adopted, longitude difference absolute values of the source field and the target field and latitude difference absolute values of the source field and the target field are respectively calculated, and the coordinate values of the target field, which are different from the coordinate values of the source field within a set error range, are screened out.
Further, the classification code matching algorithm comprises: and searching the shorter character string in the character strings of the source field and the target field in the longer character string after word segmentation, and dividing the length of the longer character string by the number of the characters which are matched equally to obtain a percentage, wherein the larger the percentage is, the higher the matching degree is.
Further, if the source field and the target field are time values, time and date matching is adopted, the time values of the source field and the target field are converted into time stamps, the absolute difference value of the two time stamps is calculated, and if the absolute difference value is smaller than a threshold value, the value of the target field is screened out.
Further, the step S5 specifically includes: and (3) removing the same data value from the field value in the target table by starting an expansion mode, and expanding the field value in the source table.
(III) advantageous effects
The invention provides a method for aligning a multi-source heterogeneous data dictionary. The operation is simple, the matching result is clear at a glance, and the specific data values are displayed as uniform data values after the data dictionaries are aligned.
Drawings
FIG. 1 is a flow chart of a multi-source heterogeneous data dictionary alignment method of the present invention.
Detailed Description
In order to make the objects, contents and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
The invention discloses a method for aligning a multi-source heterogeneous data dictionary, which comprises the steps of (1) selecting a source database, selecting a source table, selecting a source field (2) serving as a standard to select a target database, selecting a target table, selecting a target field (3) to be aligned, selecting a data value (4) in the source table to directly select a data value in the target table or intelligently screening the data value in the target table through an algorithm, wherein the algorithm comprises but is not limited to: 1) matching cosine similarity; 2) matching the editing distance; 3) matching longitude and latitude distances; 4) matching classified codes; 5) and matching time and date. (5) And if the data value in the target table needs to be expanded into the source table, the expansion is started, and the data value is expanded into the source table (6) and the matching result is checked after the matching is successful.
The purpose of the invention is: a method for unifying the mapping values of data values in different data sources by directly matching the data values or intelligently matching the data values through an algorithm and expanding standard data is provided.
In order to achieve the above object, the present invention provides a method for aligning a multi-source heterogeneous data dictionary, which comprises:
s1, selecting a source database, selecting a source table and selecting a source field as a standard.
S2, selecting a target database, selecting a target table and selecting a target field needing to be aligned.
And S3, selecting the data value in the source table.
S4, directly selecting the data values in the target table or screening the data values in the target table through an algorithm, wherein the algorithm includes but is not limited to: 1) matching cosine similarity; 2) matching the editing distance; 3) matching longitude and latitude distances; 4) matching classified codes; 5) and matching time and date.
S5, if it needs to expand the data value in the target table to the source table, starting expansion and expanding the data value to the source table.
And S6, checking the matching result after matching is successful.
In the process of dictionary alignment, a source database table serving as a dictionary table needs to be selected, then field values in the source database are matched with field values in a target database, and a dictionary pool can be expanded if the field values in the source database are not matched with the field values in the target database. The operation is simple, the matching result is clear at a glance, and the specific data values are displayed as uniform data values after the data dictionaries are aligned.
FIG. 1 is a flow chart of a method for multi-source heterogeneous data dictionary alignment according to the present invention. As shown in fig. 1, the method includes:
(1) selecting a source database, selecting a source table, and selecting a source field as a standard.
In specific implementation, a source database as a dictionary table is selected, a table in the database is selected, and a specific field in the table is selected.
(2) Selecting a target database, selecting a target table, and selecting a target field needing to be aligned.
In specific implementation, a target database to be aligned is selected, a table in the database is selected, and a specific field in the table is selected.
(3) The data value of the source field in the source table is selected.
In specific implementation, a certain data value of the source field in the source table is selected in sequence.
(4) And directly selecting the data value of the target field in the target table or intelligently screening the data value of the target field in the target table through an algorithm.
In specific implementation, the data value of the target field in the target table is directly selected, or when the data value cannot be directly selected, a threshold is set, then the similarity between the data value of the source field and the data value of the target field is calculated through several algorithms, and the data value of the target field in the target table meeting the threshold is intelligently screened out, wherein the algorithms include but are not limited to:
1) cosine similarity matching:
the algorithm measures the similarity between two vectors by measuring their cosine values of their included angle, which can be found by using the euclidean dot product formula:
a·b=||a||b||coosθ
given two attribute vectors, A and B, the remaining chord similarity θ is given by the dot product and the vector length, as follows:
Figure BDA0003273336780000051
a hereini,BiRepresenting the components of vectors a and B, respectively.
The similarity given ranges from-1 to 1: a 1 means that the two vectors point in exactly the opposite direction, a 1 means that their points are exactly the same, a 0 usually means that they are independent, and a value between them means an intermediate similarity or dissimilarity.
Assume the value of source field a: "this leather boot has a large number. That number is appropriate ", the value of the destination field B: "the leather boot is not small in number, and is more suitable". We calculate this similarity by this algorithm:
firstly, performing word segmentation on a source field and a target field.
A: this is only/leather boot/number/big. That/number/appropriate.
B: this is only/leather boot/number/not/small, that/better/appropriate.
And secondly, listing all words appearing in the source field and the target field.
This is the leather boot, the number, is bigger. That is, proper, not, small, even more
And thirdly, counting the word frequency appearing in the source field and the target field for all the words in the second step.
This is only: 1, leather boots: 1, number: 2, enlarging: 1. that is: 1, suitably: 1, not: 0, small: 0, further: 0
And fourthly, writing out the word frequency vectors of the source field and the target field according to the word frequency.
A=(1,1,2,1,1,1,0,0,0)
B=(1,1,1,0,1,1,1,1,1)
Fifthly, calculating cosine values of included angles of the two word frequency vectors to measure similarity between the two word frequency vectors;
applying the above formula:
Figure BDA0003273336780000061
and sixthly, judging whether the cosine value is larger than a threshold value, if so, considering that the value of the source field is similar to the value of the target field, and screening the value of the target field.
The cosine of the angle in the calculation is 0.81, and if we set the threshold to 0.8, then the value of the source field a and the value of the target field B are considered similar, and the value of the target field B is selected.
2) Matching the editing distance:
the Levenshtein Distance (also called Edit Distance) refers to the minimum number of editing operations required to change from one string to another string, and if the Distance is larger, the strings are different.
Similarity is 1-distance/Math.Max (str1.length, str2.length)
If the value of the source field a is defined: "word" is defined as str1, the value of the target field B: "wind" is defined as str2.
Step 1: a length of str1 or str2 of 0 returns the length of another string.
Length 4, str2 length 4, the maximum length is: 4.
step 2: the matrix d of (n +1) (m +1) is initialized and the values of the first row and column are incremented from 0. Two strings (nm level) are scanned, if: str1[ i ] ═ str2[ j ], which was recorded as 0 using temp. Otherwise temp is noted as 1. Then, the matrix d [ i, j ] is assigned to the minimum value of d [ i-1, j ] +1, d [ i, j-1] +1, d [ i-1, j-1] + temp.
The initialization is as follows: a matrix of d (4+1) (4+1) ═ d (5) (5), as in the table:
0 w r l d
0 0 1 2 3 4
w 1
i 2
n 3
d 4
then two characters start to be scanned, from which step it follows:
Figure BDA0003273336780000071
Figure BDA0003273336780000081
and 3, after scanning, returning the last value d [ n ] [ m ] of the matrix, namely the distance between the last value d [ n ] [ m ] and the matrix.
The last value of the matrix d 4 2.
Calculating a similarity formula: 1-their distance/maximum of two string lengths.
Applying the above formula: the similarity is 1-2/4-0.5.
If we set the threshold value to 0.5, then the values of the source field A and the target field B are considered similar, and the value of the target field B is selected.
3) Matching the longitude and latitude distances:
and respectively calculating the longitude difference absolute value between A and B and the latitude difference absolute value between A and B according to the coordinate values of A and B, and extracting the longitude and latitude which are within the set error range from the coordinate value of the source field.
If the field value we calculate is a place name, the value can be matched by the calculation method.
If the source field value A has a value: the value of the target field B is Beijing Western station.
The coordinate value obtained by the Baidu map developer through the open API is as follows: distanceA ═ 116.37790192578123,39.86384240940923, and the coordinate values of B are: distanceB ═ 116.32056702587889, 39.894931215390564.
Then the absolute difference in longitude and latitude is calculated separately:
|116.37790192578123-116.32056702587889|=0.05733489990234
|39.86384240940923-39.894931215390564|=0.031088805981334
if the threshold is set to: longitude does not exceed 0.06 and latitude does not exceed 0.05, the value of field B is screened out.
4) And (3) classified code matching:
and searching the shorter character string in the character strings of the source field and the target field in the longer character string after word segmentation, and dividing the length of the longer character string by the number of the characters which are matched equally to obtain a percentage, wherein the larger the percentage is, the higher the matching degree is.
If the value of the source field A is: "word", the value of the target field B is: "world".
In the first step, the values of A and B are split respectively
The word of A is split into: w, o, r, d, length 4.
The word of B is split into: w, o, r, l, d, length 5.
Secondly, each word in A is searched in B, and the word frequency can be obtained:
w: 1, o: 1, r: 1, d: 1, the sum of word frequencies is 4.
Third, the sum of the word frequencies is divided by the longest length of the two values, 5:
similarity 4 ÷ 5 ×% 100 ═ 0.8.
If we set the threshold to 0.8, then the value of the target field B is filtered out.
5) Time-date matching
The closer the two dates are, the higher the matching degree of the two dates is: the matching degree is calculated by calculating the absolute difference of the two dates.
Similarity | date (A) -date (B) ceiling
If the value of the source field A is: 14 minutes and 20 seconds at 14 o 'clock at 18 o' clock at 9 p 2021, the value of the object field B is: 9/18/14/15/10 s in 2021.
These two field values are first converted to timestamps:
A=1631945660000,B=1631945710000
similarity | a-B |1631945660000 | -, 1631945710000| -50,000 |
If the threshold is set to 10000 and the similarity difference exceeds this threshold, object B does not fit the screening.
(5) And if the field value of the target field in the target table needs to be expanded into the source table, starting expansion, and expanding the field value of the target field into the source table.
In specific implementation, when the field values in the target table are considered to be more than those in the standard data table and are the desired field values, the field values in the target table can be expanded into the source table after the same data values are removed by starting the expansion, so that the standard data in the standard data table can be maximized.
(6) And checking the matching result after matching is successful.
In specific implementation, after the matching is confirmed to be correct, the matching is successful.
Of course, the present invention may have other embodiments, and those skilled in the art can make various changes and modifications according to the present invention without departing from the spirit and the essence of the present invention, for example, by matching similar numbers, matching data in different databases, not limited to dictionary values, and applying the successfully matched values as similar values to each scene.
Example 1:
a method of multi-source heterogeneous data dictionary alignment, comprising:
(1) different data sources may be selected.
(2) The data in the target table is not modified.
(3) The standard data values can be expanded in a continuous expansion mode.
(4) And is not limited to matching dictionary values.
(5) The screening may be aided by different algorithms.
Further, different data sources can be selected, so that a user can flexibly select and widely expand data dictionaries or data values.
Further, in the case of not modifying the data in the target table, the data in the target table can be interpreted as finally uniform data or data values of similarity without polluting the data in the target table.
Further, in the expanding the standard data value by the expanding method, the standard data value may be accumulated continuously, so as to facilitate the standard and generalized matching application.
Further, the method is not limited to matching dictionary values, and may be applied to other places, for example, data in different tables needs to be associated and matched, or the method may be adopted to match data in two tables and match similar data.
Further, the intelligent screening may be performed by different algorithms, including but not limited to:
matching cosine similarity;
matching the editing distance;
matching longitude and latitude distances;
matching classified codes;
and matching time and date.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method for aligning a multi-source heterogeneous data dictionary is characterized by comprising the following steps:
s1, selecting a source database, selecting a source table, and selecting a source field as a standard;
s2, selecting a target database, selecting a target table, and selecting a target field to be aligned;
s3, selecting the data value of the source field in the source table;
s4, directly selecting the data value of the target field in the target table or screening the data value of the target field in the target table through an algorithm;
s5, if the data value of the target field in the target table needs to be expanded into the source table, the expansion is started, and the data value of the target field in the target table is expanded into the source table;
and S6, checking the matching result after matching is successful.
2. The method for dictionary alignment of multi-source heterogeneous data according to claim 1, wherein the step S1 specifically includes: selecting a source database as a dictionary table and selecting a table in the database, selecting a specific field in the table.
3. The method for dictionary alignment of multi-source heterogeneous data according to claim 1, wherein the step S2 specifically includes: selecting a target database to be aligned, selecting a table in the database, and selecting a specific field in the table.
4. The method for dictionary alignment of multi-source heterogeneous data according to any one of claims 1 to 3, wherein the step S3 specifically includes: a certain data value of the source field in the source table is selected in turn.
5. The method for dictionary alignment of multi-source heterogeneous data according to claim 4, wherein the step S4 specifically includes: directly selecting the data value of the target field in the target table, or setting a threshold value when the data value cannot be directly selected, then calculating the similarity between the data value of the source field and the data value of the target field through several algorithms, and screening out the data value of the target field in the target table which meets the threshold value, wherein the algorithms include but are not limited to cosine similarity matching, edit distance matching, longitude and latitude distance matching, classification code matching and time and date matching.
6. The method for dictionary alignment of multi-source heterogeneous data according to claim 5, wherein the cosine similarity matching algorithm specifically comprises the steps of:
firstly, performing word segmentation on a source field and a target field;
secondly, listing all words appearing in the source field and the target field;
thirdly, counting the word frequency appearing in the source field and the target field for all the words in the second step;
fourthly, writing out word frequency vectors of the source field and the target field according to the word frequency;
fifthly, calculating cosine values of included angles of the two word frequency vectors to measure similarity between the two word frequency vectors;
and sixthly, judging whether the cosine value is larger than a threshold value, if so, considering that the value of the source field is similar to the value of the target field, and screening the value of the target field.
7. The method of claim 5, wherein if the source field and the target field are coordinate values, longitude and latitude distance matching is used to calculate absolute values of longitude difference between the source field and the target field, latitude difference between the source field and the target field is calculated, and coordinate values of the target field which are different from the coordinate values of the source field within a set error range are selected.
8. The method of multi-source heterogeneous data dictionary alignment of claim 5, wherein the categorical code matching algorithm comprises: and searching the shorter character string in the character strings of the source field and the target field in the longer character string after word segmentation, and dividing the length of the longer character string by the number of the characters which are matched equally to obtain a percentage, wherein the larger the percentage is, the higher the matching degree is.
9. The method of multi-source heterogeneous data dictionary alignment of claim 5, wherein if the source field and the target field are time values, time-date matching is used to convert the source field and target field time values to timestamps, the absolute difference between the two timestamps is calculated, and if the absolute difference is less than a threshold, the value of the target field is filtered out.
10. The method for dictionary alignment of multi-source heterogeneous data according to claim 5, wherein the step S5 specifically includes: and (3) removing the same data value from the field value in the target table by starting an expansion mode, and expanding the field value in the source table.
CN202111108385.XA 2021-09-22 2021-09-22 Method for aligning multi-source heterogeneous data dictionary Pending CN113836897A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111108385.XA CN113836897A (en) 2021-09-22 2021-09-22 Method for aligning multi-source heterogeneous data dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111108385.XA CN113836897A (en) 2021-09-22 2021-09-22 Method for aligning multi-source heterogeneous data dictionary

Publications (1)

Publication Number Publication Date
CN113836897A true CN113836897A (en) 2021-12-24

Family

ID=78960317

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111108385.XA Pending CN113836897A (en) 2021-09-22 2021-09-22 Method for aligning multi-source heterogeneous data dictionary

Country Status (1)

Country Link
CN (1) CN113836897A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910496A (en) * 2023-09-14 2023-10-20 深圳市智慧城市科技发展集团有限公司 Configuration method and device of data quality monitoring rule and readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6389429B1 (en) * 1999-07-30 2002-05-14 Aprimo, Inc. System and method for generating a target database from one or more source databases
EP2610763A1 (en) * 2011-11-30 2013-07-03 Tata Consultancy Services Limited Automated framework for post archival data comparison
CN108920601A (en) * 2018-06-27 2018-11-30 中国联合网络通信集团有限公司 A kind of data matching method and device
CN109783490A (en) * 2018-12-25 2019-05-21 杭州数梦工场科技有限公司 Data fusion method, device, computer equipment and storage medium
CN110717080A (en) * 2019-10-08 2020-01-21 深圳市新系区块链技术有限公司 Data processing method, system and related equipment
CN112307097A (en) * 2020-10-13 2021-02-02 武汉中科通达高新技术股份有限公司 Data asset management method and device
CN113220782A (en) * 2021-04-30 2021-08-06 土巴兔集团股份有限公司 Method, device, equipment and medium for generating multivariate test data source

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6389429B1 (en) * 1999-07-30 2002-05-14 Aprimo, Inc. System and method for generating a target database from one or more source databases
EP2610763A1 (en) * 2011-11-30 2013-07-03 Tata Consultancy Services Limited Automated framework for post archival data comparison
CN108920601A (en) * 2018-06-27 2018-11-30 中国联合网络通信集团有限公司 A kind of data matching method and device
CN109783490A (en) * 2018-12-25 2019-05-21 杭州数梦工场科技有限公司 Data fusion method, device, computer equipment and storage medium
CN110717080A (en) * 2019-10-08 2020-01-21 深圳市新系区块链技术有限公司 Data processing method, system and related equipment
CN112307097A (en) * 2020-10-13 2021-02-02 武汉中科通达高新技术股份有限公司 Data asset management method and device
CN113220782A (en) * 2021-04-30 2021-08-06 土巴兔集团股份有限公司 Method, device, equipment and medium for generating multivariate test data source

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
作者:韩红旗: "《语义指纹著者姓名消歧理论及应用》", 北京:科学技术文献出版社, pages: 114 - 116 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910496A (en) * 2023-09-14 2023-10-20 深圳市智慧城市科技发展集团有限公司 Configuration method and device of data quality monitoring rule and readable storage medium
CN116910496B (en) * 2023-09-14 2024-01-23 深圳市智慧城市科技发展集团有限公司 Configuration method and device of data quality monitoring rule and readable storage medium

Similar Documents

Publication Publication Date Title
JP6629942B2 (en) Hierarchical automatic document classification and metadata identification using machine learning and fuzzy matching
CN101978348B (en) Manage the archives about approximate string matching
JP3067980B2 (en) String matching method and apparatus
Moreau et al. Robust similarity measures for named entities matching
Zagoris et al. A document image retrieval system
CN103473327A (en) Image retrieval method and image retrieval system
US20220171753A1 (en) Matching Non-exact Addresses
CN110162681B (en) Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium
CN110941720B (en) Knowledge base-based specific personnel information error correction method
CN107526721B (en) Ambiguity elimination method and device for comment vocabularies of e-commerce products
CN105580003A (en) Data sanitization and normalization and geocoding methods
CN102971729A (en) Ascribing actionable attributes to data that describes a personal identity
CN111428511B (en) Event detection method and device
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN109522417A (en) A kind of trading company's abstracting method of company name
CN112633001A (en) Text named entity recognition method and device, electronic equipment and storage medium
CN113420546A (en) Text error correction method and device, electronic equipment and readable storage medium
Dehghan et al. Unconstrained Farsi handwritten word recognition using fuzzy vector quantization and hidden Markov models
CN113868351A (en) Address clustering method and device, electronic equipment and storage medium
CN113836897A (en) Method for aligning multi-source heterogeneous data dictionary
CN113792188B (en) Directory data comparison method
CN116843155A (en) SAAS-based person post bidirectional matching method and system
CN111581496A (en) Industry data analysis method and data analysis platform based on search engine keyword data
Carbonnel et al. Lexical post-processing optimization for handwritten word recognition
CN115831117A (en) Entity identification method, entity identification device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination