WO2019093675A1 - Dispositif de fusion de données et procédé d'analyse de données volumineuses - Google Patents

Dispositif de fusion de données et procédé d'analyse de données volumineuses Download PDF

Info

Publication number
WO2019093675A1
WO2019093675A1 PCT/KR2018/012358 KR2018012358W WO2019093675A1 WO 2019093675 A1 WO2019093675 A1 WO 2019093675A1 KR 2018012358 W KR2018012358 W KR 2018012358W WO 2019093675 A1 WO2019093675 A1 WO 2019093675A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
data set
similarity
row
ranking
Prior art date
Application number
PCT/KR2018/012358
Other languages
English (en)
Korean (ko)
Inventor
최용준
장명수
공성원
Original Assignee
(주) 위세아이텍
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by (주) 위세아이텍 filed Critical (주) 위세아이텍
Publication of WO2019093675A1 publication Critical patent/WO2019093675A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2219Large Object storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to a data merging apparatus and method for big data analysis.
  • the merge of data is impossible if the viewpoint and purpose of internal / external data generation are different, or the data is semantically different from the same data or expression due to the lack of data standards. In this case, it is necessary to input excessive human resources to match the data, and it is impossible to normalize the data.
  • the data matching algorithm which is one of the data preprocessing processes, is the fuzzy data matching algorithm.
  • This algorithm is an algorithm that matches the data using the calculated values based on the edit distance (Levenshtein Distance).
  • edit distance Longshtein Distance
  • the present invention has been made to solve the above problems occurring in the prior art, and it is an object of the present invention to provide a data merge device for merging a plurality of data by performing a similarity calculation by selecting a plurality of column items from a plurality of data sets, And a method is provided.
  • the purpose is to provide.
  • the present invention has been made to solve the above-mentioned problems of the prior art, and it is an object of the present invention to provide a method and apparatus for performing a first matching similarity calculation and a second matching similarity calculation, And more particularly, to a data merging apparatus and method for merging a plurality of data more accurately.
  • a data merge apparatus for analyzing big data, comprising: a data base including a plurality of data sets; a first data set for performing data matching among a plurality of data sets included in the data base; Determining a data set and a second data set, selecting a plurality of column items of the column items of the first data set and the second data set to calculate a degree of similarity between the selected column items, And a data merge unit for merging the first data set and the second data set based on the matching result of the data row.
  • the data matching unit derives a first similarity rank data set by performing a first matching similarity calculation based on the first column of the first data set and the second column of the second data set, 1 similarity rank order data set to derive a second similarity rank order data set by performing a second matching similarity degree calculation based on the first similarity degree ranking data set and comparing the first similarity degree ranking data set and the second similarity degree ranking data set, The data row of the first data set and the data row of the second data set may be matched.
  • the first similarity degree ranking data set and the second degree of similarity ranking data set calculated through the data matching unit may include ranking information of each data row of the second data set.
  • the data matching unit is configured to combine the first row data of the first data set and the second row data of the second data set located in the first rank of the first similarity rank order data set, Ranking in the similarity ranking data set and ranking in the first similarity ranking data set of the row data of the second data set located in the first ranking of the second similarity ranking data set are both 1,
  • the row data of the second data set located at the first rank of the first similarity rank order data set matches the row data of the second data set located at the first rank of the second similarity rank order data set, It is possible to determine that the row data of the second data set located in the first data set matches the first row data.
  • the data matching unit is configured to combine the first row data of the first data set and the second row data of the second data set located in the first rank of the first similarity rank order data set,
  • the ranking in the similarity ranking data set and the row data of the second data set in which the ranking in the first similarity ranking data set of the row data of the second data set located at the first rank of the second similarity ranking data set is all 1
  • it is possible to determine that the row data of the second data set matching the first row data can not be matched.
  • the data matching unit is configured to combine the first row data of the first data set and the first row data of the second data set, Ranking of the second similarity ranking data set of the first row data of the second data set is 1 and the ranking of the second similarity ranking data set of the second row data of the second data set is 1 , It can be determined that the first row data of the second data set is matched with the first row data.
  • a data merging apparatus for analyzing big data includes a profiling unit for analyzing data types and character string patterns of a plurality of column items of the first data set and the second data set through data analysis, And a similar item analyzer for analyzing similar items between the first data set and the second data set based on the analysis result of the profiling unit.
  • the data merge unit may exclude column items that are duplicated in the merging of the plurality of column items of the first data set and the second data set, based on the analysis result of the similar- And perform the merging of the first data set and the second data set.
  • a data merge method for analyzing big data comprising: determining a first data set and a second data set to perform data matching among a plurality of data sets included in a database; Deriving a first similarity rank order data set by performing a first matching similarity degree computation based on a first column of the first data set and the first column of the second data set, performing a second matching similarity calculation based on the first similarity degree rank data set Deriving a second similarity ranking data set, comparing the first and second similarity ranking data sets, and matching the data rows of a second data set matching each data row of the first data set And merging the first data set and the second data set based on the result of the step and the matching.
  • a plurality of data items can be merged by performing a similarity calculation by selecting a plurality of column items from a plurality of data sets.
  • the first matching similarity calculation result is compared with the second matching similarity calculation to perform final data matching, A plurality of data can be merged more accurately.
  • FIG. 1 is a block diagram schematically illustrating a configuration of a data merge apparatus according to an embodiment of the present invention.
  • FIGS. 2A to 2D are schematic diagrams for explaining a first matching similarity calculation according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram for explaining a second matching similarity calculation according to an embodiment of the present invention.
  • FIG. 4 is a diagram illustrating exemplary first matching similarity calculation and second similarity calculation according to an embodiment of the present invention.
  • FIGS. 5A and 5B are schematic diagrams for illustrating a method of comparing a first similarity degree ranking data set and a second similarity ranking ranking data set according to an exemplary embodiment of the present invention to match data rows of a first data set and a second data set.
  • 6A and 6B are schematic diagrams for explaining an embodiment of data analysis and similar item analysis according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram for explaining an embodiment of data merging for big data analysis according to an embodiment of the present invention.
  • FIG. 8 is a flowchart illustrating a data merging method according to an embodiment of the present invention.
  • FIG. 1 is a block diagram schematically illustrating the configuration of a data merge apparatus according to an embodiment of the present invention.
  • FIGS. 2A to 2D are schematic diagrams for explaining a first matching similarity calculation according to an embodiment of the present invention
  • FIG. 4 is a diagram illustrating a first matching similarity calculation result and a second matching similarity calculation result according to an embodiment of the present invention
  • FIG. 4 illustrates a result of a second matching similarity calculation according to an embodiment of the present invention
  • FIGS. 5A and 5B are schematic diagrams for explaining matching of data rows of a first data set and a second data set by comparing a first similarity degree ranking data set and a second similarity degree ranking data set according to an embodiment of the present invention .
  • the data merge apparatus 100 includes a data base 110, a data matching unit 120, a profiling unit 130, a similar item analysis unit 140, and a data merge unit 150 .
  • the data merge apparatus 100 may receive data necessary for data merge from an external server, but the present invention is not limited thereto.
  • the data merge apparatus 100 determines a first data set and a second data set among a plurality of data sets, selects a plurality of column items from the column items of the first data set and the second data set, .
  • the data merge apparatus 100 may match the data rows of the first data set and the second data set based on the similarity.
  • the data merge device 100 may provide the user with a matching determination whether or not the user is unacceptable based on whether or not to match the plurality of data, and may merge the plurality of data based on the matching determination information.
  • the data merge apparatus 100 may merge the first data set and the second data set based on the matching result of the data row.
  • the data merge apparatus 100 may merge data by matching a first data set and a second data set using a data matching algorithm based on artificial intelligence in a plurality of data sets.
  • the data merge apparatus 100 may be configured such that the first data and the second data of the second data set are overlapped and matched to the first data, which is a problem that arises in the result of data matching performed through the calculation of similarity between existing single column items
  • the first row data of the first data set and the first row data of the second data set may be matched by performing data matching through a similarity calculation between a plurality of column items.
  • the data merge device 100 analyzes and matches the data similarity of a plurality of data sets, it is possible to merge the data without intervention of the data engineer specialist and the task person in charge.
  • the data merge apparatus 100 can improve the accuracy of data matching by learning and analyzing the amount of data that increases under a big data environment and the data characteristics of a specific business domain.
  • the data merge apparatus 100 can perform more accurate data matching and merging by selecting matching candidates directly to the user of a data item that is determined to be impossible to match, and correcting the result.
  • the database 110 may include a plurality of data sets used for merging big data.
  • the database 110 may include unstructured data.
  • the data matching unit 120 may determine a first data set 10 and a second data set 20 to perform data matching among a plurality of data sets included in the data base 110.
  • the first data set 10 and the second data set 20 are data that includes a plurality of column items of similarity degree information set in advance based on the similarity calculation result of the similarity analysis unit 140 Can be set.
  • the first data set 10 and the second data set 20 may be a data set determined by the user's selection.
  • the data matching unit 120 may determine the first data set 10 and the second data set 20 that include items required to be merged among a plurality of data sets to perform the similarity calculation.
  • the data matching unit 120 may select a plurality of column items among the column items of the first data set and the second data set to calculate the similarity between the selected column items.
  • the data matching unit 120 may perform an operation using an artificial intelligence based algorithm when performing a similarity operation.
  • the data matching algorithm of the data matching unit 120 may perform similarity calculation using a fuzzy data matching algorithm, but the present invention is not limited thereto.
  • the Fuzzy Data Matching algorithm is an algorithm that performs matching between data using the calculated values based on the edit distance (Levenshtein Distance).
  • reference numeral (a) in FIGS. 2A and 2B is a first data set 10 and reference numeral (b) in FIGS. 2A and 2B is a second data set (10).
  • the first data set may include four column entries 11-14.
  • the second data set may include four column entries 21 to 24.
  • the column items included in the first data set and the second data set can be divided into representative keys.
  • the representative key of the first column item 11 of the first data set may be a 'parking lot name'
  • the first column item 11 may be a set of columns included in the representative key 'parking lot name' .
  • the representative key of the first column item 11 included in the first data set 10 is the 'parking lot name' and the representative key of the second column item 12 of the first data set 10 is' Operation start time ', the representative key of the third column item 13 of the first data set 10 is the' parking lot operation end time ', and the representative of the fourth column item 14 of the first data set 10
  • the key can be 'parking fee (10 minutes, unit level)'.
  • the representative key of the first column item 21 of the second data set 20 is the parking lot and the representative key of the second column item 22 of the second data set 20 is the parking lot operation start time ,
  • the representative key of the third column item 23 of the second data set 20 is the parking lot operation end time and the representative key of the fourth column item 24 of the second data set 20 is the address of the ' Lt; / RTI >
  • the data matching unit 120 may include a first column of the first data set 10 of the plurality of column items of the first data set 10 and the second data set 20, The degree of similarity between the selected column items can be calculated by selecting the third column items 11 to 13 and selecting the first to third column items 21 to 23 of the second data set 20. At this time, the number of the plurality of column items of the first data set 10 and the second data set 20 must match. That is, the data matching unit 120 can determine the number of column items to be matched with the data matching. As another example, the user may select a column item for which to perform data matching.
  • the data matching unit 120 may select a representative column item among a plurality of selected column items. 2C, the data matching unit 120 selects the first column item 11 of the first data set 10 as a representative column item and the second column item 11 of the second data set 20 as a representative column item.
  • the column item 21 can be selected as the representative column item to perform the similarity calculation.
  • the representative column entry may be selected by the user.
  • the data matching unit 120 may perform data matching using a fuzzy algorithm when performing a similarity operation by selecting a representative column item among a plurality of selected column items, but the present invention is not limited thereto.
  • the data matching unit 120 performs a first matching similarity calculation based on the first columns 11 and 21 of the first data set 10 and the second data set 20 to generate a first similarity rank order data set 30 Can be derived.
  • the first column of the first data set and the second data set may be the parking lot name 11 and the parking lot 21.
  • the first column is not limited to the parking lot 11 and the parking lot name 21 but may be provided in the data matching unit 120 or at the user's choice based on the first column 11 and 21
  • the first degree of similarity degree calculation may be performed to derive the first degree of similarity ranking data set 30.
  • the data matching unit 120 may perform a similarity calculation between each row data of the first data set 10 and all row data of the corresponding column items of the second data set 20.
  • the first matching similarity computation is performed on the first column data 11 of the first data set 10 and the first column data 21 of the second data set 20 May be an operation for comparing the degree of similarity. Illustratively, referring to FIG.
  • the first column item 11 of the first data set 10 is a representative key column name, and the first column item 11 of the first data set 10
  • the first column item 21 of the first data item 'trainee' and the second item data set 20 is a representative key parking item
  • the first column item 21 of the second data set 20 is' , 'Seoul Station 2 Parking Lot', 'Seoul Station 3 Parking Lot', 'Cheonggyecheon Doosan We've the Zenith', 'First Bank', 'Training Center Park', 'New Shinhae Station (Link to Line 9)', Car park, and 'Golden Tower'. That is, the data matching unit 120 compares the respective row data of the first column item 11 of the first data set 10 with the first column item 21 of the second data set 20 .
  • the first order of likeness ranking data set 30 is included in each row data included in the first column item 11 of the first data set 10 and in the first column item 21 of the second data set 20
  • the data set may be sorted in the order of higher degree of similarity compared with the respective row data.
  • the first degree of similarity ranking data set 30 according to one embodiment of the present application can be described with reference to FIG. 2D.
  • Column A of FIG. 2D is each row data included in the first column item 11 of the first data set 10.
  • the first degree of similarity Rank 1 of the first order similarities Rank 1 to A of the first data set 10 and the first order similarity Rank 2 of the first column items 11 of the second data set 20 ) Is the data included in the first column item 21 of the second data set 20 arranged in descending order of the value of the result of the similarity calculation between the first column 21 and the second column 21.
  • the data matching unit 120 sets the first column item of the second data set 20 in the second data set 20 to the first data item 20 in the second data set 20, 21) in the order of the column indexes of the first similarity rank order data set.
  • the data matching unit 120 may store the data similarity calculation result of the 'training source' of the first column item 11 of the first data set 10 into the first column item 21 of the second data set 20 ), (0, 0) (0, 1) of the row data included in the training data ('Training Park', 90 and 5)
  • a ranking data set can be derived.
  • the data matching unit 120 may derive the data of the first matching degree ranking data set 30 including the similarity calculation result value and the column index of the second data set 20 to the first similarity ranking data set. That is, the result of similarity between the 'training center' and the 'training center park' included in the first column item 11 is 90, and the column index in the second data set 20 of 'training center park' have.
  • the data matching unit 120 may list the operation results in the order of the column indexes arranged in the second data set 20.
  • the data matching unit 120 may derive the second similarity rank order data set 40 by performing a second matching similarity calculation based on the first similarity rank order data set 30.
  • the first similarity ranking data set 30 and the second similarity ranking data set 40 may include ranking information of each data row of the second data set 20.
  • the first similarity ranking data set 30 and the second similarity ranking data set 40 may include respective data rows of the second data set 20 in a ranked order of similarity values.
  • the data matching unit 120 compares a plurality of column items of the first data set 10 and the second data set 20 based on the first matching degree of similarity ranking data set 30 to obtain a second similarity ranking data set .
  • the data matching unit 120 derives the first similarity rank order data set 30 by performing a first matching similarity degree calculation based on the first columns 11 and 21, The second matching similarity calculation may be performed for each row data included in the plurality of remaining column items selected from among the column items of the first data set 10 and the second data set 20 according to the rank.
  • the data matching unit 120 sequentially arranges the remaining column data of each row data of the second data set 20 included in the first similarity rank order data set 30 in order of similarity of the first similarity rank order data set 30 (Row data) of the first data set 10 and the remaining column data (each row data) of the first data set 10 to calculate a second matching similarity degree.
  • FIG. 3 (a) may be a process of deriving a second similarity ranking data set 40.
  • the data matching unit 120 performs a similarity calculation between the column items based on the first columns 11 and 21 as shown in (1) of FIG. 3 (a) in the second matching similarity degree calculation,
  • the similarity degree calculation between the column items is performed on the basis of the second columns 12 and 22 as shown in (2) of FIG. 3 (a) 23), the similarity calculation between the column items can be performed.
  • the data matching unit 120 calculates the first matching degree of the second data set 20 based on the first row data 'training source' of the first column item 11 of the first data set 10, And the result of the similarity calculation may be derived as a first similarity ranking data set 30 as shown by (1) in FIG. 3 (a).
  • the data matching unit 120 includes a second column item 12 and a third column item 13 located in the same row as the first row data 'training source' of the first column item 11 of the first data set 10, And a second column item (12) located on the same row as the first data to third data 'training center park' to 'Seoul station 2 parking lot' of the first column item (21) of the second data set (20) Item 13 can be performed.
  • the second matching similarity computation includes a computation of similarity with all the row data of the second data set 20 for each row data of the first data set 10.
  • the data matching unit 120 performs a second matching similarity calculation to determine whether or not the 'training person' of the column item of the 'parking lot name' of the first data set 10 and the 'parking lot' of the second data set 20 (1), which is the row data of the column item, and the data included in the column item of the 'parking lot operation start time' of the row data of the 'training source' of the first data set 10 (2) of the data included in the 'parking lot operation start time' which is the row data of the 'training center park' of the first data set 10 and the data of the 'training center' of the first data set 10, (3) calculating the similarity between the data included in the column item of the 'parking lot operation end time' and the data included in the 'parking lot operation end time' which is the row data of the 'training center park' of the second data set 20 To determine a second order of similarity ranking and derive a second order of similarity ranking data set 40 .
  • the data matching unit 120 may derive the data of the first column item 11 and the calculation results of the respective row data of the first similarity rank order data set 30 at the time of deriving the second similarity rank order data set 40 . 3 (b), the data matching unit 120 determines that the column index in the second data set 20 of the training center park is 5, and the column index in the first data set 20 of the training center
  • the matching similarity degree calculation result value is 90, and the second matching similarity degree result value (?, And the average value of the similarity degree result) can be analyzed as 96.67.
  • the second matching similarity value calculated by considering the first to third column items is increased compared to the first matching similarity value calculated by considering only the first column item, More accurate data matching can be performed through comparison between column items.
  • the data matching unit 120 may match the data rows of the first data set and the second data set based on the similarity.
  • FIG. 4 illustrates a result of performing a first matching similarity calculation based on the first column 11 of the first data set 10 and the second column of the second data set 20 And may be a diagram showing the result of the data with the highest degree of similarity.
  • the similarity degree of the first degree of similarity calculation result between the 'training center' included in the first data set 10 and the 'training center park' included in the second data set 20 can be calculated to be 90%.
  • the 'Seoul station parking lot' stores the data of the second column 21 of the second data set 20
  • the matching results are different from each other, and the similarity value can also be equal to 93%. That is, a plurality of data may be matched to the 'Seoul Station 2 parking lot' and the 'Seoul Station 3 parking lot' in the first column 11 of the first data set 10.
  • the 4B may show the result of performing the second matching similarity calculation by adding the second column item. Based on the results of performing based on the first column 11 and the second column 21 of the first data set 10 and the second data set 20 at the first matching degree of similarity, , The similarity calculation can be performed.
  • the data matching unit 120 may perform a similarity calculation based on a plurality of columns to derive a result of a similarity calculation more accurately than a result of a single column.
  • the first matching similarity calculation and the second matching similarity calculation are performed based on the first column and the second column.
  • a plurality of (for example, three) columns are selected, Value can be calculated.
  • the above-described columns may be selected by the data matching unit 120, but the present invention is not limited thereto. It is possible to determine the column items of the first data set 10 and the column items of the second data set 20, Can be performed.
  • a second matching similarity calculation is performed based on the first similarity calculation result after the first matching similarity calculation is performed, and the first matching similarity calculation result And the second matching similarity calculation to perform final data matching.
  • the data matching unit 120 compares the first matching similarity degree calculation result and the second matching similarity degree calculation result to perform final data matching so that data matching is performed between the first data set 10 and the second data set 20 The matching accuracy is improved.
  • the data matching unit 120 compares the first similarity rank order data set 30 and the second similarity order ranking data set 40 and outputs the first data set 10 and the second data set 20 can be matched.
  • the data matching unit 120 may compare N and N 'in step S510. If the comparison result of N and N 'coincides with each other, the data matching unit 120 may compare a and a' in step S520. The data matching unit 120 determines that the row data of the second data set 20 located at the first rank of the first similarity rank order data set 30 is equal to the row data of the second data set 20 when N and N ' It can be determined that the first row data of the first data set 10 is matched with the first row data of the first data set 10.
  • the data matching unit 120 may associate with the first row data (e.g., A1) of the first data set 10, (N) in the second similarity ranking data set 40 of the row data of the second data set 20 located in the ranking and the second data set (N) located in the first ranking of the second similarity ranking data set 40 20 are all 1 and the rank N 'of the second data set 20 located in the first rank of the first similarity rank order data set 30 is 1
  • the row data a and the row data a 'of the second data set 20 located at the first rank of the second similarity rank order data set 40 coincide with each other, It can be determined that the row data of the second data set 30 located in the first rank matches with the first row data of the first data set 10.
  • the data matching unit 120 determines that 'N' and 'N' match, and 'a' and 'a' can do. That is, in the second similarity rank order data set 40 of the 'Trainee Park' having the rank 1 in the first similarity rank order data set 30 in association with the 'trainee', the ranking is 1 and the second similarity rank order data set 40, the ranking in the first similarity rank order data set 30 of the 'Trainee Park' having rank 1 is 1, so it can be determined that N and N 'match.
  • the data matching unit 120 stores the data of the first data set 10 including the training data and the data of the first data set 10 including the training data and the second data including the training data The row data of the set 20) is matched.
  • the data matching unit 120 may compare a and a' in step S520. If it is determined in step S520 that a and a 'are not coincident with the comparison results a and a, the row data of the second data set 20 matched with the first row data can be determined as unmatchable.
  • the second similarity rank data of the row data of the second data set 20 located in the first rank of the first similarity rank order data set 30 are all 1
  • row data of the second data set 20 matched with the first row data can be determined as unmatchable.
  • the data matching unit 120 performs matching when N and N 'are equal to 1, and a and a' having N and N ' It can be judged to be impossible.
  • the data matching unit 120 compares the data items included in the second data set 20 in association with the'Opicia 'so that the first similarity degree rank data set 30 ('Optia 1', 89, 13) and (Opicia 1 ', 13, 89, 94.5) and (' Opicia 2 ', 14, 89, and 94), and the second similarity ranking data set 40 ), And the like.
  • the data matching unit 120 determines that the data having the similarity calculation result of 1 in the first similarity rank order data set 30 is'Opithia 1 'and'Opecia 2' (that is, The data having the first degree of similarity calculation result in the second degree of similarity ranking data set 40 can be judged to be 'Opisia 1' and 'Opisia 2' (that is, And the row data having the second highest order of similarity rankings are a plurality of 'Occia 1' and 'Occia 2').
  • RANK 1 and RANK 2 may be the number of data, not the order of similarity.
  • the data matching unit 120 Since the data matching unit 120 has a plurality of data that can be matched with 'Occia', the data matching unit 120 provides the determination result to the user to obtain a more accurate matching result, and the first row data of the first data set 10 The row data of the second data set can be matched. That is, when there are a plurality of row data of the second data set 20 in which the first similarity rank order and the second similarity rank order are both first in relation to the specific row data of the first data set 10, The row data of the second data set 20 matching the specific row data of the first data set 10 can not be determined.
  • the first row data of the first data set 10 when the first row data of the first data set 10 can not be matched with the row data of the second data set to be matched, the first data set 10 is corrected or selected by the user, The row data of the second data set to be matched can be determined.
  • the data matching unit 120 may compare N and N' in step S530.
  • the data matching unit 120 matches the data having the rank 1 of the second similarity rank order data set 40 with the first row data of the first data set 10 when the N and N 'comparison result N is greater than or equal to N' .
  • the first row data of the second data set 20 and the first similarity ranking data set of the second row data (30) is all 1
  • the second similarity ranking data set (40) of the first row data of the second data set (20) is 1
  • the second row data of the second data set It can be determined that the first row data of the second data set 20 is matched with the first row data if the ranking of the second similarity ranking data set 40 of the second data set 20 is not 1.
  • the data matching unit 120 may calculate and compare the rank information N. N 'of each row data a, a'.
  • the data matching unit 120 may compare N and N 'rankings to determine that data is matched when N and N' do not match.
  • the first row data of the first data set can be determined as " Seoul station parking lot ".
  • the data matching unit 120 compares the calculation result of the first similarity rank order data set 30 of the 'Seoul Station Parking Lot' ('Seoul Station 2 parking lot', 93,1) and ('Seoul Station 3 parking lot', 93,2) (3 parking lots in Seoul Station, 2, 93, and 97.67) and (2 parking lots in Seoul Station 2, 1, 93, and 97.67 in the second similarity degree ranking data set 40) 84.33) or the like to derive the second similarity ranking data set 40.
  • the data matching unit 120 may determine a as a parking lot at the Seoul Station 2, determine a 'as a parking lot at the Seoul Station 3, determine N as 2, and N' as 1.
  • the data matching unit 120 determines that N and N 'are in disagreement and N is positioned higher than N', and data located at 1 in the second similarity rank order data set 40 ) Matches the first row data of the first data set (Seoul station parking lot).
  • the data matching unit 120 may calculate and compare the rank information N. N 'of each row data a, a'.
  • the data matching unit 120 associates the first row data of the second data set 20 and the first row data of the second data set 20 in association with the first row data of the first data set 10
  • the order of the similarity ranking data set 30 is all 1, but the order of the first row data and the second row data in the second similarity order data is different, the order of the rows of the second data set 20 having a higher similarity order It can be determined that the data is matched with the first row data of the first data set 10.
  • 6A and 6B are schematic diagrams for explaining an embodiment of data analysis and similar item analysis according to an embodiment of the present invention.
  • the profiling unit 130 may analyze data types and character string patterns of a plurality of column items of the first data set 10 and the second data set 20 through data analysis.
  • the profiling unit 130 may analyze the data of a plurality of data sets included in the database 110 in a data type and a character string pattern for each column item.
  • the profiling unit 130 may perform an analysis for shaping the unstructured data included in the database 110.
  • the similar item analysis unit 140 may analyze similar items between the first data set 10 and the second data set 20 based on the analysis result of the profiling unit 130.
  • the similar item analysis unit 140 may include a representative key of a plurality of column items 11 to 14 included in the first data set 10 and a plurality of column items included in the second data set 20 21 to 24) can be analyzed to perform similarity analysis.
  • the similarity analysis result of the representative key of the 'parking lot operation start time' included in the first data set 10 and the representative key of the 'parking lot operation start time' included in the second data set 20 is calculated to be 99% .
  • the column corresponding to the 'parking lot operation start time' is included in the first data set 10 and is also included in the second data set 20.
  • the similar item analysis unit 140 may calculate 80% of the similarity between the 'parking lot name' of the first data set 10 and the 'parking lot' of the second data set 20.
  • the similar item analyzing unit 140 uses a similar character string as a result of the character pattern analysis of the 'parking lot name' and the 'parking lot', but the same column item is included in the first data set 10 and the second data set 20 It can be judged that it is not.
  • the data matching unit 120 and the similar item analysis unit 140 may perform similarity analysis based on the Levenshtein distance).
  • FIG. 7 is a schematic diagram for explaining an embodiment of data merging for big data analysis according to an embodiment of the present invention.
  • the data merge unit 150 may merge the first data set 10 and the second data set 20 based on the matching result of the data row.
  • the data merging unit 150 may extract the first data set and the second data set from the first data set and the second data set based on the analysis result of the similar item analyzing unit 140, The combination of the set 10 and the second set of data 20 can be performed.
  • the data merge unit 150 may merge the column items using the matching row data of the first data set 10 and the second data set 20 according to the matching result of the data row.
  • the data merge unit 150 may perform the merging of the first data set 10 and the second data set 20 based on the column items required for the big data analysis .
  • the data merge unit 150 receives the first data set 10 from the column of the parking lot name 11, the parking lot operation start time 12, the parking lot operation end time 13, and the parking fee 14
  • the second data set 20 may perform a merge by selecting a column item of the parking lot 21, the parking lot operation start time 22, the parking lot operation end time 23, and the address 24 .
  • the data merge unit 150 may merge the column items based on the column items selected by the user's decision, but the present invention is not limited thereto.
  • the data merge unit 150 may calculate the parking lot operation start times 12 and 22 overlapping the first data set 10 and the second data set 20 based on the analysis result of the similarity analysis unit 140, It is possible to perform the data merge by selecting one of the two column items of the operation end time 13 and 23.
  • the data merge unit 150 receives data from the parking lot 11, the parking lot operation start time 12, the parking lot operation end time 13, the parking fee (14) However, it is possible to perform a data merge by selecting a plurality of column items among a plurality of column items included in the first data set 10 and the second data set 20.
  • FIG. 8 is a flowchart illustrating a data merging method according to an embodiment of the present invention.
  • the data merge method shown in FIG. 8 is performed by the data merge apparatus 100 described with reference to FIG. 1 through FIG. Therefore, even if omitted below, the description of the data merge apparatus 100 through FIGS. 1 through 7 can also be applied to FIG.
  • the data merge apparatus 100 may determine a first data set 10 and a second data set 20 to perform data matching among a plurality of data sets.
  • step S802 the data merge apparatus 100 performs a first matching similarity calculation based on the first column of the first data set 10 and the second data set 20 to generate a first similarity ranking data set 30, Can be derived.
  • step S803 the data merge apparatus 100 may derive a second similarity ranking data set 40 by performing a second matching similarity calculation based on the first similarity ranking data set 30.
  • step S804 the data merge apparatus 100 compares the first similarity rank order data set 30 and the second similarity rank order data set 40, and compares the second data that is matched with each data row of the first data set 10 The data row of the set 20 can be matched.
  • step S805 the data merge apparatus 100 may merge the first data set 10 and the second data set 20 based on the matching result.

Abstract

Selon l'invention, un dispositif de fusion de données pour une analyse de données volumineuses peut comprendre : une base de données qui comprend une pluralité d'ensembles de données; une unité d'appariement de données qui détermine un premier ensemble de données et un second ensemble de données pour effectuer un appariement de données parmi la pluralité d'ensembles de données inclus dans la base de données, puis sélectionne une pluralité d'éléments de colonne parmi les éléments de colonne du premier ensemble de données et du second ensemble de données afin de calculer une similarité entre les éléments de colonne sélectionnés et d'apparier les rangées de données du premier ensemble de données avec le second ensemble de données d'après la similarité; et une unité de fusion de données qui fusionne le premier ensemble de données et le second ensemble de données d'après le résultat de correspondance des rangées de données.
PCT/KR2018/012358 2017-11-10 2018-10-18 Dispositif de fusion de données et procédé d'analyse de données volumineuses WO2019093675A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2017-0149691 2017-11-10
KR1020170149691A KR102033151B1 (ko) 2017-11-10 2017-11-10 빅데이터 분석을 위한 데이터 병합 장치 및 방법

Publications (1)

Publication Number Publication Date
WO2019093675A1 true WO2019093675A1 (fr) 2019-05-16

Family

ID=66438462

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2018/012358 WO2019093675A1 (fr) 2017-11-10 2018-10-18 Dispositif de fusion de données et procédé d'analyse de données volumineuses

Country Status (2)

Country Link
KR (1) KR102033151B1 (fr)
WO (1) WO2019093675A1 (fr)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102345410B1 (ko) 2019-11-19 2021-12-30 주식회사 피씨엔 빅데이터 지능형 수집 방법 및 장치
KR102378737B1 (ko) * 2020-01-16 2022-03-28 (주) 에스아이 오토 금융 서비스 자동차 견적제공 시스템
KR102541934B1 (ko) 2020-11-20 2023-06-12 주식회사 피씨엔 빅데이터 증강분석 프로파일링 시스템
KR102640444B1 (ko) 2021-10-15 2024-02-27 주식회사 피씨엔 빅데이터 신뢰성과 활용성 극대화를 위한 빅데이터 증강분석 프로파일링 방법 및 장치
KR102483584B1 (ko) * 2021-12-03 2023-01-02 한국과학기술정보연구원 표준 항목명을 이용한 데이터셋 관리 방법, 그리고 이를 구현하기 위한 장치
KR102513676B1 (ko) * 2022-05-30 2023-03-24 한국과학기술정보연구원 데이터 분석 시스템 및 그 방법

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006099236A (ja) * 2004-09-28 2006-04-13 Toshiba Corp 分類支援装置、分類支援方法及び分類支援プログラム
JP4997856B2 (ja) * 2006-07-19 2012-08-08 富士通株式会社 データベース分析プログラム、データベース分析装置、データベース分析方法
KR20150077669A (ko) * 2013-12-30 2015-07-08 충남대학교산학협력단 맵리듀스 방식을 이용한 데이터 분석 방법 및 시스템
JP5785617B2 (ja) * 2010-09-14 2015-09-30 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation データ・セットを取り扱うための方法及び構成、データ処理プログラム及びコンピュータ・プログラム製品
KR20160099540A (ko) * 2013-12-20 2016-08-22 페이스북, 인크. 다양한 소셜 네트워킹 시스템에 의해 관리되는 사용자 프로필 정보의 조합

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8122045B2 (en) * 2007-02-27 2012-02-21 International Business Machines Corporation Method for mapping a data source to a data target

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006099236A (ja) * 2004-09-28 2006-04-13 Toshiba Corp 分類支援装置、分類支援方法及び分類支援プログラム
JP4997856B2 (ja) * 2006-07-19 2012-08-08 富士通株式会社 データベース分析プログラム、データベース分析装置、データベース分析方法
JP5785617B2 (ja) * 2010-09-14 2015-09-30 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation データ・セットを取り扱うための方法及び構成、データ処理プログラム及びコンピュータ・プログラム製品
KR20160099540A (ko) * 2013-12-20 2016-08-22 페이스북, 인크. 다양한 소셜 네트워킹 시스템에 의해 관리되는 사용자 프로필 정보의 조합
KR20150077669A (ko) * 2013-12-30 2015-07-08 충남대학교산학협력단 맵리듀스 방식을 이용한 데이터 분석 방법 및 시스템

Also Published As

Publication number Publication date
KR102033151B1 (ko) 2019-10-16
KR20190053616A (ko) 2019-05-20

Similar Documents

Publication Publication Date Title
WO2019093675A1 (fr) Dispositif de fusion de données et procédé d'analyse de données volumineuses
CN109697162B (zh) 一种基于开源代码库的软件缺陷自动检测方法
WO2019103183A1 (fr) Dispositif d'évaluation d'entreprise se basant sur des critères esg et son procédé de fonctionnement
WO2021194056A1 (fr) Procédé de formation d'un réseau d'apprentissage profond basé sur une intelligence artificielle et dispositif d'apprentissage l'utilisant
WO2020111314A1 (fr) Appareil et procédé d'interrogation-réponse basés sur un graphe conceptuel
WO2011162446A1 (fr) Module et procédé permettant de décider une entité nommée d'un terme à l'aide d'un dictionnaire d'entités nommées combiné avec un schéma d'ontologie et une règle d'exploration
WO2020040537A1 (fr) Système de recherche d'informations de statut de dispositions de construction selon un système de classification de dispositions de construction, et procédé associé
WO2021049706A1 (fr) Système et procédé de réponse aux questions d'ensemble
WO2018088664A1 (fr) Dispositif de détection automatique d'erreur de corpus d'étiquetage morphosyntaxique au moyen d'ensembles approximatifs, et procédé associé
WO2018212396A1 (fr) Procédé, dispositif et programme informatique pour analyser des données
WO2021251558A1 (fr) Appareil, système et procédé de classification de données pour une recherche d'essai clinique
WO2020138607A1 (fr) Procédé et dispositif pour fournir une question et une réponse à l'aide d'un agent conversationnel
WO2017131325A1 (fr) Système et procédé de vérification et de correction de base de connaissances
WO2022265480A1 (fr) Procédé et dispositif d'analyse d'interactions entre des médicaments
WO2012144683A1 (fr) Procédé et dispositif pour évaluer un stade prometteur à l'aide d'un cycle de vie de technologie prometteuse
WO2022080583A1 (fr) Système de prédiction de données de bloc de bitcoin basé sur un apprentissage profond prenant en compte des caractéristiques de distribution en série chronologique
WO2012046904A1 (fr) Procédé et dispositif pour fournir des informations de recherche à partir de ressources multiples
WO2015126058A1 (fr) Procédé de prévision du pronostic d'un cancer
WO2012144685A1 (fr) Procédé et dispositif de visualisation du développement de technologie
CN109240903A (zh) 一种自动评估的方法和装置
WO2022114447A1 (fr) Procédé de fourniture de données d'essai clinique similaires et serveur l'exécutant
WO2019107840A1 (fr) Dispositif et procédé permettant de détecter une déclaration de sinistre frauduleuse sur la base de l'intelligence artificielle
WO2013172500A1 (fr) Appareil et procédé de détermination de la ressemblance entre des phrases basées sur l'identification de paraphrases
WO2022186539A1 (fr) Procédé et appareil d'identification de célébrité basés sur la classification d'image
WO2012144684A1 (fr) Procédé et dispositif de prédiction de vitesse de développement d'une technologie

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18875074

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18875074

Country of ref document: EP

Kind code of ref document: A1