WO2021047373A1 - 基于大数据的列数据处理方法、设备及介质 - Google Patents

基于大数据的列数据处理方法、设备及介质 Download PDF

Info

Publication number
WO2021047373A1
WO2021047373A1 PCT/CN2020/110364 CN2020110364W WO2021047373A1 WO 2021047373 A1 WO2021047373 A1 WO 2021047373A1 CN 2020110364 W CN2020110364 W CN 2020110364W WO 2021047373 A1 WO2021047373 A1 WO 2021047373A1
Authority
WO
WIPO (PCT)
Prior art keywords
column
data
column data
similarity
remarks
Prior art date
Application number
PCT/CN2020/110364
Other languages
English (en)
French (fr)
Inventor
李光跃
Original Assignee
星环信息科技(上海)股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 星环信息科技(上海)股份有限公司 filed Critical 星环信息科技(上海)股份有限公司
Publication of WO2021047373A1 publication Critical patent/WO2021047373A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Definitions

  • the present disclosure relates to data processing technology, for example, to a column data processing method, device, and medium based on big data.
  • Calculating the similarity and degree of similarity between the data can help the staff to find the subject data similar to the analyzed data.
  • unsupervised learning is used to cluster data, and the similarity between data is calculated through features such as data overlap, overlap of different or unique values, pattern matching, and name matching.
  • the unsupervised learning method is used to cluster the data, which results in a large amount of calculation and the accuracy of the calculated similarity result is not high.
  • the present disclosure provides a column data processing method, device and medium based on big data, which can obtain the similarity of column data pairs with higher accuracy and reduce the amount of calculation.
  • a column data processing method based on big data the method includes:
  • the similarity of each column data pair is determined.
  • a computer device including a processor and a memory, where the memory is used to store instructions, and when the instructions are executed, the processor is caused to perform the following operations:
  • the similarity of each column data pair is determined.
  • a computer-readable storage medium is also provided, the storage medium is used to store instructions, and the instructions are used to execute:
  • the similarity of each column data pair is determined.
  • FIG. 1 is a flowchart of a method for processing column data based on big data in the first embodiment of the present invention
  • Embodiment 1 of the present invention is a schematic diagram of an application scenario in Embodiment 1 of the present invention.
  • FIG. 3 is a flowchart of a method for calculating column name similarity in the second embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a column data processing device based on big data in the fourth embodiment of the present invention.
  • Fig. 8 is a schematic structural diagram of a computer device in the fifth embodiment of the present invention.
  • Some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowchart describes multiple operations (or steps) as sequential processing, many of the operations can be implemented in parallel, concurrently, or simultaneously. In addition, the order of multiple operations can be rearranged. The processing may be terminated when its operation is completed, but may also have additional steps not included in the drawings. The processing may correspond to methods, functions, procedures, subroutines, subroutines, and so on.
  • column data refers to data stored in the database in a column storage manner, where the amount of data included in each column is not fixed.
  • data attribute of column data is the meta information of column data, and the column type of column data is included in the meta information.
  • initial column data set refers to the clustering of column data according to the data attributes of the column data in the column data set, and the numerical initial column data set, the character initial column data set and the time initial column can be obtained. Data collection.
  • unsupervised clustering cluster used in this article refers to the classification result of column data obtained by unsupervised clustering processing on the initial column data set.
  • Similarity used in this article refers to the degree of similarity between two columns of data, that is, the more similar the two columns of data, the greater the similarity; correspondingly, "column name similarity” refers to the degree of similarity between the two columns of data.
  • column names refers to the degree of similarity between the two columns of data.
  • Column remark similarity refers to the degree of similarity of column remarks between two column data. The column remarks are used to understand the attributes of the column data. A column of data may or may not have remarks.
  • column data pair used in this article can be composed of any two column data, and the column data pair in this article can also be composed of any two unsupervised clusters; correspondingly, the "first column name” is the first Column data or the name of the first unsupervised cluster cluster; “second column name” is the name of the second column data or second unsupervised cluster cluster; “remarks in the first column” is the first column data or the name of the second unsupervised cluster A remark of an unsupervised cluster; the "second column of remarks” is the remarks of the second column of data or the second unsupervised cluster.
  • unsupervised learning is used to cluster column data, and the similarity between column data is calculated by features such as column data overlap, overlap of different or unique values, pattern matching, and name matching.
  • this method can calculate the similarity between column data, it uses unsupervised learning to cluster the column data, which results in a large amount of calculation and the accuracy of the calculated similarity results is not high.
  • the calculation amount is large and the accuracy of the calculated similarity results is not high.
  • the embodiment of the present invention adopts a method to calculate the similarity of the column data, which can reduce the similarity of the column data. The amount of calculation and can improve the accuracy of calculating column similarity.
  • the embodiment of the present invention obtains the column data set to be processed, and classifies the obtained column data set according to the column data attributes in the column data set to obtain at least two initial column data sets; unsupervised the initial column data set Clustering processing to obtain at least two unsupervised clusters; generate multiple column data pairs according to the at least two unsupervised clusters, and determine the similarity of the column names between the two column data in the multiple column data pairs, and Column remark similarity; according to column name similarity and column remark similarity, determine the similarity of matching with multiple column data pairs.
  • the amount of calculation can be greatly reduced; at the same time, by calculating the similarity of column names and column remarks of column data pairs
  • the degree to determine the similarity of column data can improve the accuracy of calculating the similarity of column data.
  • Figure 1 is a flowchart of a method for processing column data based on big data in the first embodiment of the present invention. This embodiment can be applied to the case of processing a large amount of column data in an enterprise.
  • Column data processing device which can be implemented by software and/or hardware, and integrated in the device that executes the method.
  • the device that executes the method may be a computer, a tablet, and/or a mobile phone. And other smart terminals. Referring to Figure 1, the method includes the following steps.
  • Step 110 Obtain a column data set to be processed, and classify the column data according to the data attributes of the column data in the column data set to obtain at least two initial column data sets.
  • the data when storing data in the database, the data may be stored in rows or columns. Storing data in rows without indexing uses a large number of input/output interfaces, and it takes a lot of time and resources to build indexes and materialized views. At the same time, in the face of query requirements, the database must be greatly expanded to meet performance requirements; data is stored in columns Since the selection rules in the query are defined by columns, the entire database is automatically indexed. Storing data in columns can aggregate and store the data of each field, which can greatly reduce the amount of data read when only a few fields are needed for the query.
  • the column data involved in the embodiment of the present invention is processed in units of columns, and each column data can contain one or more data.
  • the reading can be greatly reduced.
  • the amount of data is more convenient for subsequent data processing operations.
  • the column data processing method involved in the embodiment of the present invention can also calculate the similarity of the row data. In order to facilitate the description of the embodiment of the present invention, only the column data is introduced as an example in the embodiment of the present invention.
  • the column data to be processed is stored in a columnar storage database, and all column data stored in the columnar storage database is called a column data set.
  • the column data may be classified according to the data attributes of the column data in the column data set to obtain at least two initial column data sets.
  • classify the column data according to the data attributes of the column data in the column data set including: obtaining metadata of the column data in the column data set, the metadata includes the column type of the column data; according to the column type of the column data , To classify the column data.
  • the column type can be at least one of a character type, a numeric type, and a time type.
  • the metadata of the column data may also include the column name, column remarks, or statistical information of the column.
  • the statistical information of the column data can be the shortest length, the longest length, the average length and/or the length of the data with the highest frequency; if the column type of the column data is Numerical type, the statistical information of the column data can be the maximum value, minimum value and/or average value of the column data.
  • At least two initial column data sets can be obtained.
  • a numeric initial column data set, a character initial column data set, and a time initial column data set can be obtained.
  • an initial column data set consistent with the data type of the column can also be obtained.
  • Step 120 Perform unsupervised classification processing on each of the at least two initial column data sets to obtain at least two unsupervised clusters, wherein the at least two unsupervised clusters and the at least two initial column data Set one to one correspondence.
  • unsupervised clustering is performed on the initial column data set obtained in step 110 to obtain at least two unsupervised clusters corresponding to each initial column data set.
  • the initial column data set is a numerical initial column data set
  • at least two unsupervised clusters corresponding to the numerical initial column data set can be obtained. The following will introduce how to perform unsupervised clustering on the initial column data set.
  • the initial column data set is a numerical initial column data set
  • the statistical indicators of the column data can be calculated, which can include the maximum value a 1 , the minimum value a 2 and the average value a 3 , then the value of a column of numerical data
  • the column characteristics are expressed as [a 1 , a 2 , a 3 ].
  • the initial column data set is a character-type initial column data set
  • the minimum length of the string in each column of character data is b 1
  • the longest length of the string b 2 is used as statistical information.
  • the average length b 3 and the most frequent string length b 4 are used as statistical information.
  • the column characteristics of a column of character data can be expressed as [b 1 , b 2 , b 3 , b 4 ].
  • the column characteristic matrix Assuming that there are m columns of character data, and calculating its characteristic index, the column characteristic matrix can be obtained:
  • Step 130 Generate a plurality of column data pairs according to the at least two unsupervised clusters, and determine the column name similarity and the column remark similarity between the two column data in each column data pair.
  • At least two unsupervised clusters can be obtained, and each of the at least two unsupervised clusters can be divided into unsupervised clusters.
  • the column data is combined in pairs to obtain multiple column data pairs.
  • the column data pair 12 composed of column data 1 and column data 2 and the column data pair 21 composed of column data 2 and column data 1 are the same column data pair. After generating the column data pairs, determine the column name similarity and column remark similarity of each column data pair.
  • step 110 and step 120 are not performed, column data pairs are directly generated for the obtained column data set. If there are 100,000 column data in the column data set, nearly 5 billion column data pairs will be generated. That is to say, it is necessary to calculate the similarity of column names and column remarks for 5000000000 column data pairs to obtain all similar data. Assuming that the above 100,000 column data is processed in steps 110 and 120, 400 is obtained. Unsupervised clusters, assuming that the column data is uniform, each cluster has 250 column data, these 400 unsupervised clusters can be generated Pairs of column data. Therefore, the solution of the embodiment of the present invention can greatly reduce the amount of calculation, and the more uniform the data in the cluster, the more obvious the effect of reducing the amount of calculation.
  • Step 140 Determine the similarity of each column data pair according to the similarity of the column name and the similarity of the column remarks.
  • the column name similarity and the column remark similarity of the column data pair can be obtained through step 130, and the column name similarity of the column data is recorded as s col , and the column remark similarity of the column data is recorded as s com ,
  • the similarity S of the column data pair can be calculated by the following formula.
  • the technical solution of this embodiment obtains the column data set to be processed, and classifies the obtained column data set according to the column data attributes in the column data set to obtain at least two initial column data sets; for the initial column data set Perform unsupervised clustering processing to obtain at least two unsupervised clusters; generate multiple column data pairs according to at least two unsupervised clusters, and determine the column name between the two column data in each column data pair Similarity and column remark similarity; according to column name similarity and column remark similarity, determine the similarity of matching with each column data pair, which can obtain the similarity result of the column data pair with higher accuracy and reduce the amount of calculation .
  • Figure 2 illustrates a system to which an embodiment of the present invention can be applied.
  • the data catalog system sends the metadata of column data to the column similarity back-end service, where any change in the metadata of column data will cause Recalculate the similarity of the columns related to the changed column data.
  • the column similarity back-end service After the column similarity back-end service receives the change in the metadata of the column data sent by the data catalog system, the column similarity back-end service will write the column data metadata into the similarity back-end database, that is, after passing the similarity
  • the end database can query the metadata of column data.
  • the column similarity back-end service can also send the column data similarity calculation task to the task scheduling service.
  • the task scheduling service After the task scheduling service receives the request, it will calculate the similarity of the column data through the distributed computing engine; the distributed computing engine The column data similarity calculation task is refined; the first stage of the first task is to classify the column data, the initial column data set, to obtain the initial column data set, and then perform unsupervised clustering processing on the initial column data set to obtain unsupervised Cluster clusters, and finally generate multiple column data pairs based on at least two unsupervised clusters obtained in the previous step; the second stage in the first task performs column data on the first column of the multiple column data pairs generated Preprocessing of names and column remarks to obtain the column name similarity and column remark similarity of the first column data pair, and obtain the column similarity of the first column data pair according to the column name similarity and column remark similarity of the first column data pair Finally, the similarity of the first column of data pairs is saved in the similarity back-end database; at the same time, the second task in the distributed computing engine-the nth task can generate the second column of data pair-the
  • n in the nth task is not fixed, and is related to the logarithm of the generated column data pair. For example, if the logarithm of the generated column data pair is 100, then the nth task is the hundredth task.
  • the column similarity back-end service can query the meta-information of the column data and the similarity of the column data in the similarity back-end database in real time, and can also query the task status of the task scheduling service.
  • the distributed computing engine can obtain the similarity of each column data pair at the same time through different tasks, and save the similarity of the column data pair in the similarity back-end database through the column similarity back-end service
  • the similarity of column data pairs can be queried in real time, which can not only obtain the similarity results of column data pairs with higher accuracy, but also reduce the amount of calculation and calculation time.
  • Fig. 3 is a flowchart of calculating the similarity of column names between two column data in a column data pair involved in the second embodiment of the present invention.
  • This embodiment describes the above embodiment and determines two of each column data pair.
  • the similarity of the column names between the two column data can include: obtaining the first column name and the second column name corresponding to the two column data in the column data pair; calculating the column name between the first column name and the second column nameedit Distance; respectively obtain the noun vector table in the first column corresponding to the first column name and the noun vector table in the second column corresponding to the second column name, and calculate the difference between the noun vector table in the first column and the noun vector table in the second column Column name semantic distance; according to column name edit distance and column name semantic distance, determine the column name similarity between the two column data in the column data pair.
  • the method includes:
  • Step 310 Obtain the first column name and the second column name corresponding to the two column data in the column data pair.
  • each column data pair contains two column data, which are respectively recorded as the first column data and the second column data.
  • the column name of the first column data is recorded as the first column name.
  • the column name of the second column of data is recorded as the second column name.
  • column name of column data needs to be standardized, that is, the column name needs to be expanded into independent words.
  • the above-mentioned column data name needs to be expanded into [my,first,name]. Since numbers have no effect on the similarity of column names, when numbers appear in the column names, the numbers are ignored, that is, the numbers in the column data names are deleted.
  • the word vector model to convert the English words in the column data names into word vectors.
  • A [a 1, a 2 , ..., a n]
  • B [b 1 , b 2, ..., b m ], wherein a 1, a 2, ..., a n are the words first column of table a in the name of each word, b 1, b 2, ...
  • n the number of words contained in the word list of the first column name
  • n is the number of words contained in the word list of the first column name
  • m is the number of words contained in the word list of the second column name
  • n′ is the number of words contained in the noun vector table of the first column
  • the number of word vectors, m′ is the number of word vectors contained in the second column of the noun vector table
  • n′ ⁇ n, m′ ⁇ m, V a1 ,V a2 ,...,V an′ are the first column names respectively a word with a word vector table a 1, a 2, ..., a n corresponding word vectors, V b1, V b, ... , V bm ' respectively of the second column B word names and the word in the vector table
  • Step 320 Calculate the column name edit distance between the first column name and the second column name.
  • Editing distance refers to the minimum number of editing operations required to convert one character to another between two characters. Allowable editing operations include replacing one character with another, inserting a character, and deleting a character. Therefore, the size of the edit distance is not within a range. In order to standardize the edit distance within the range of [0,1], redefine the edit distance, the formula is as follows:
  • s(x) is the sigmoid function, used to normalize the edit distance to the range of [0,1]
  • x is the edit distance
  • L max represents the maximum number of characters in the two strings
  • d represents the original of the two strings
  • the edit distance is the minimum number of editing operations
  • h' is the custom edit distance, that is, the column name edit distance involved in this embodiment, where the larger the h', the more similar the two character strings.
  • f(a i ,B) max(g(a i ,b 1 ),g(a i ,b 2 ),...,g(a i ,b m ))
  • s 2 represents the edit distance between the first column name and the second column name
  • f(a i ,B) represents the edit of the i-th word a i in the first column name and all words in the second column name
  • the maximum distance, g(a i ,b j ) represents the edit distance between the i-th word in the first column name and the j-th word in the second column name
  • L A is the length of the first column name.
  • g(a i ,b j ) max(h′(a i ,b j ),h′(a i1 ,b j ),h′(a i2 ,b j ),...,h′(a ik ,b j ))
  • a ik represents the k-th synonym of the i-th word a i in the first column name
  • h′(a ik ,b j ) represents the edit distance between a ik and b j.
  • a i does not appear in the word vector (that is, it may not be a word or spelling error), then the synonymous word is not considered.
  • Step 330 Calculate the semantic distance of the column names between the noun vector table in the first column and the noun vector table in the second column.
  • the semantic distance of the column names between the noun vector table in the first column and the noun vector table in the second column can be calculated according to the following formula.
  • s 1 is the semantic distance of the column names between the noun vector table in the first column and the noun vector table in the second column
  • L Av and L Bv are the word vectors of the noun vector table A in the first column and the noun vector table B in the second column
  • V A is the mean value of the word vector table of name A in the first column
  • V B is the mean value of the word vector table of name B in the first column.
  • Step 340 Determine the similarity of the column names between the two column data in the column data pair according to the column name edit distance and the column name semantic distance.
  • step 320 obtains the column name edit distance s 2 between the first column name and the second column name
  • step 330 obtains the column name semantic distance s 1 between the first column name and the second column name
  • the ranges of column name edit distance s 2 and column name semantic distance s 1 are both [0,1].
  • this embodiment adopts a piecewise function to calculate the column name similarity, and the column name similarity s col can be calculated according to the following formula.
  • the first column name and the second column name corresponding to the two column data in the column data pair are obtained, and the obtained column names are processed to obtain the column noun vector, according to
  • the column noun vector calculates the column name edit distance and the column name semantic distance between the first column name and the second column name, and sets different weights for the column name edit distance and column name semantic distance under different conditions through a piecewise function, to get Column name similarity with higher accuracy rate.
  • the similarity of the column remarks between the two columns of data can include: obtaining the first column of remarks and the second column of remarks corresponding to the two column data in the column data pair; calculating the column between the first column of remarks and the second column of remarks Remark edit distance; after word segmentation is performed on the first column of remarks and the second column of remarks, the first column of remarks word vector table corresponding to the first column of remarks and the second column of remarks word vector table corresponding to the second column of remarks are obtained; Calculate the column remark semantic distance between the first column remark word vector table and the second column remark word vector table; determine the column remark similarity between the two column data in the column data pair according to the column remark edit distance and the column remark semantic distance , Referring to Figure 4, the method includes:
  • Step 410 Obtain the first column remarks and the second column remarks corresponding to the two column data in the column data pair.
  • the column remarks can reflect the main content of the column data corresponding to the column remarks, and the user needs to define the column remarks. Therefore, there may be a phenomenon that a column of data does not have a column remark.
  • the remarks corresponding to the first column of data in the column data pair are recorded as the first column remarks, and the column remarks corresponding to the second column of data are recorded as the second column remarks.
  • the column remarks similarity between the two column data in the column data pair is determined to be 0, that is, if it is determined that the first column of data does not contain column remarks, The second column of data does not contain column remarks or the first column of data and the second column of data do not contain column remarks, you can directly determine that the similarity of the column remarks between the two column data in the column data pair is 0.
  • the word vector data set corresponding to each word in the column remarks can be obtained by querying AI Lib and using the Directional Skip-Gram (DSG) algorithm to train the public word vector data set.
  • DSG Directional Skip-Gram
  • Step 420 Calculate the column remark edit distance between the first column of remarks and the second column of remarks.
  • the calculation method of the column name edit distance between the first column of remarks and the second column of remarks is the same as the calculation method of the column name edit distance between the first column name and the second column name, which will not be omitted here in this embodiment.
  • the edit distance of the column remarks between the first column of remarks and the second column of remarks can be obtained as s 4 .
  • Step 430 Calculate the column remark semantic distance between the first column of remark word vector table and the second column of remark word vector table.
  • the semantic distance of column remarks between the remark word vector table in the first column and the remark word vector table in the second column can be calculated by the following formula.
  • s 3 is the semantic distance of column remarks
  • L Cv and L Dv are the number of word vectors in the first column of remarks word vector table and the second column of remarks word vector table
  • V C is the number of remarks in the first column of word vector table Mean value
  • V D is the mean value of the remark word vector table in the second column.
  • Step 440 Determine the column remark similarity between the two column data in the column data pair according to the column remark edit distance and the column remark semantic distance.
  • step 420 obtains the column remark edit distance s 4 between the first column of remarks and the second column of remarks
  • step 430 obtains the column remarks semantic distance s 3 between the first column of remarks and the second column of remarks
  • the range of the column note edit distance s 4 and the column note semantic distance s 3 are both [0,1].
  • this embodiment adopts a piecewise function to calculate the column remark similarity, and the column remark similarity s com can be calculated according to the following formula.
  • the first column remarks and the second column remarks corresponding to the two column data in the column data pair are obtained, and the obtained column remarks are processed to obtain the column remark word vector, Calculate the column note edit distance and column note semantic distance between the first column note and the second column note according to the column note word vector, and set different weights for the column note edit distance and column note semantic distance under different conditions through the piecewise function , The similarity of column remarks with higher accuracy is obtained.
  • Figure 5 is a flowchart of calculating column edit distance, where column edit distance includes column name edit distance and column remark edit distance.
  • Process the column names and column remarks to obtain the word vectors corresponding to the column names and column remarks respectively, compare the lengths of the two column data tables, and mark the longer column data table as the A table and the shorter column data table as the B table , Where, A table corresponds to the first column of data table, B table corresponds to the second column of data table; judge whether there is a word vector in the i-th word in A table, if it exists, look for k synonymous words of the i-th word , That is, A i [a i ,a 1 ,..., a k ], calculate the edit distance between each word in A i and each word in table B, and select the largest edit distance to get the column edit distance.
  • the column edit distance is obtained by determining the maximum edit distance, which improves the accuracy of solving the edit distance.
  • FIG. 6 is a flowchart of calculating column similarity in an embodiment of the present invention.
  • the column names of the two column data contained in the column data pair are processed to obtain column noun vectors, and the column names are calculated respectively according to the column noun vectors.
  • the semantic distance and edit distance of the column name are calculated, and the column name similarity is calculated; for the calculation of the column remark similarity, it is judged whether the two column data in the column data pair contains column remarks.
  • the two The similarity of column remarks between two column data is 0; if both column data contains column remarks, the column remarks are processed to obtain the column remark word vector, and the semantic distance and edit distance of the column remarks are calculated according to the column remark word vector , And calculate the similarity of column remarks; determine the similarity of column names by column name similarity and column remarks similarity.
  • FIG. 7 is a schematic structural diagram of a column data processing device based on big data provided in the fourth embodiment of the present invention.
  • the device can be implemented by software and/or hardware, and can execute the big data-based processing device described in any embodiment of the present disclosure.
  • the device includes: a column data set acquisition module 710, an unsupervised clustering processing module 720, a column data pair generation module 730, and a column data pair similarity determination module 740.
  • Column data collection obtaining module 710 configured to obtain the column data collection to be processed, and classify the column data according to the data attributes of the column data in the column data collection to obtain at least two initial column data collections;
  • Unsupervised clustering processing module 720 configured to perform unsupervised clustering processing on each of the at least two initial column data sets to obtain at least two unsupervised clusters, wherein the at least two unsupervised clusters The cluster has a one-to-one correspondence with at least two initial column data sets;
  • Column data pair generation module 730 set to generate multiple column data pairs corresponding to at least two unsupervised clusters, and determine the similarity of column names and column remarks between the two column data in each column data pair Similarity
  • Column data pair similarity determination module 740 set to determine the similarity of each column data pair according to the column name similarity and the column remark similarity.
  • At least two initial column data sets are obtained through the column data set acquisition module, and at least two unsupervised clusters are obtained by unsupervised clustering processing on the initial column data set through the unsupervised clustering module Cluster; column data pairs are generated by the column data pair generation module, and the column name similarity and column remark similarity of each column data pair are calculated; the column data pair similarity determination module is used to obtain the similarity of the column data pairs.
  • This embodiment can obtain a similarity result of a column data pair with a higher accuracy rate and can reduce the amount of calculation.
  • the column data set obtaining module 710 may further include: a column data meta-information obtaining unit configured to obtain metadata of the column data in the column data set, and the meta-information includes the column data.
  • the column type of the data according to the column type of the column data, the column data is classified.
  • the column type includes at least one of the following: character type, numeric type, and time type.
  • the column data pair generation module 730 is further configured to combine the column data in each of the at least two unsupervised clusters in pairs to obtain multiple column data pairs.
  • the column data pair generation module 730 further includes a column name similarity calculation unit, wherein the column name similarity calculation unit is configured to: obtain the first column name and the first column name corresponding to the two column data in the column data pair. Second column name; calculate the edit distance of the column name between the first column name and the second column name; calculate the semantic distance of the column name between the first column noun vector table and the second column noun vector table; edit the distance and column name according to the column name Semantic distance determines the similarity of column names between two column data in a column data pair.
  • the column name similarity calculation unit is configured to: obtain the first column name and the first column name corresponding to the two column data in the column data pair. Second column name; calculate the edit distance of the column name between the first column name and the second column name; calculate the semantic distance of the column name between the first column noun vector table and the second column noun vector table; edit the distance and column name according to the column name Semantic distance determines the similarity of column names between two column data in a column data pair
  • the column data pair generation module 730 further includes a column remark similarity calculation unit, where the column remark similarity calculation unit is set to: obtain the first column remarks and the first column corresponding to the two column data in the column data pair. Two-column remarks; calculate the column remark edit distance between the first column remarks and the second column remarks; calculate the column remarks semantic distance between the first column remarks word vector table and the second column remarks word vector table; according to the column remarks edit distance and Column remarks semantic distance determines the similarity of column remarks between two column data in a column data pair.
  • the column remark similarity calculation unit is set to: obtain the first column remarks and the first column corresponding to the two column data in the column data pair. Two-column remarks; calculate the column remark edit distance between the first column remarks and the second column remarks; calculate the column remarks semantic distance between the first column remarks word vector table and the second column remarks word vector table; according to the column remarks edit distance and Column remarks semantic distance determines the similarity of column remarks between two column data in a column data pair.
  • the device for processing column data based on big data in this embodiment further includes a column remark judging module configured to determine two of the column data pairs when it is determined that the first column of remarks or the second column of remarks are empty The similarity of column remarks between column data is 0.
  • the device for processing column data based on big data provided by the embodiment of the present invention can execute the method for processing column data based on big data provided by any embodiment of the present disclosure, and has corresponding functional modules and beneficial effects for the execution method.
  • FIG. 8 is a schematic structural diagram of a computer device provided by Embodiment 5 of the present invention.
  • the device includes a processor 80, a memory 81, an input device 82, and an output device 83; the number of processors 80 may be one Or more, one processor 80 is taken as an example in FIG. 8; the processor 80, the memory 81, the input device 82, and the output device 83 may be connected by a bus or other methods. In FIG. 8, the connection by a bus is taken as an example.
  • the memory 81 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the column data processing method based on big data in the embodiment of the present invention (for example, based on big data).
  • the processor 80 executes various functional applications and data processing of the computer device by running the software programs, instructions, and modules stored in the memory 81, that is, realizes the above-mentioned column data processing method based on big data.
  • the memory 81 may include a program storage area and a data storage area.
  • the program storage area may store an application program required by the operating system and at least one function; the data storage area may store data created according to the use of the terminal.
  • the memory 81 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the memory 81 may include a memory remotely provided with respect to the processor 80, and these remote memories may be connected to a computer device through a network. Examples of the aforementioned networks include the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device 82 can be used to receive input digital or character information, and to generate key signal input related to user settings and function control of the computer equipment.
  • the output device 83 may include a display device such as a display screen.
  • the sixth embodiment of the present invention also provides a storage medium containing computer-executable instructions.
  • the computer-executable instructions are executed by a computer processor, they are used to execute a method for processing column data based on big data.
  • the method includes: obtaining The column data set to be processed, and the column data is classified according to the data attributes of the column data in the column data set to obtain at least two initial column data sets;
  • An embodiment of the present invention provides a storage medium containing computer-executable instructions.
  • the computer-executable instructions are not limited to the method operations described above, and can also perform column data processing based on big data provided by any embodiment of the present disclosure. Related operations in the method.
  • the present disclosure can be implemented by software and necessary general-purpose hardware, and can also be implemented by hardware.
  • the present disclosure can be embodied in the form of a software product.
  • the computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), and Random Access Memory (Random Access Memory). , RAM), flash memory (FLASH), hard disk or optical disk, etc., including at least one instruction to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in multiple embodiments of the present disclosure.
  • the multiple units and modules included are only divided according to functional logic, but are not limited to the above division, as long as the corresponding function can be realized; in addition, The names of multiple functional units are only used to facilitate distinguishing from each other, and are not used to limit the protection scope of the present disclosure.

Abstract

一种基于大数据的列数据处理方法、设备及介质。其中,基于大数据的列数据处理方法包括:获取待处理的列数据集合,并根据列数据集合中列数据的数据属性对列数据进行分类处理,得到至少两个初始列数据集合(110);对至少两个初始列数据集合中的每一个进行无监督聚类处理,得到至少两个无监督聚类簇,其中,所述至少两个无监督聚类簇与至少两个初始列数据集合一一对应(120);生成与至少两个无监督聚类簇分别对应的多个列数据对,并确定每个列数据对中的两个列数据间的列名相似度以及列备注相似度(130);根据列名相似度以及列备注相似度,确定每个列数据对的相似度(140)。

Description

基于大数据的列数据处理方法、设备及介质
本申请要求在2019年09月11日提交中国专利局、申请号为201910860409.3的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开涉及数据处理技术,例如涉及一种基于大数据的列数据处理方法、设备及介质。
背景技术
随着大数据时代的来临,企业中往往会涉及到大量的数据,需要工作人员对大量的数据进行维护,并确定每个数据的含义,以及数据之间的关联关系,从而使得数据能够更好地辅助业务分析。
通过计算数据之间的相似性以及相似程度可以很好地帮助工作人员去寻找到与分析的数据相近的主题数据。相关技术中,采用无监督学习的方式对数据进行聚类,并通过数据重叠度,不同值或唯一值重叠度、模式匹配和名称匹配等特征来计算数据之间的相似度。
相关技术中的方法虽然可以计算出数据之间的相似度,但是由于采用了无监督学习的方式对数据进行聚类,导致计算量大且计算得到的相似度结果准确度不高。
发明内容
本公开提供一种基于大数据的列数据处理方法、设备及介质,可以得到准确率较高的列数据对的相似度并且可以减少计算量。
提供了一种基于大数据的列数据处理方法,该方法包括:
获取待处理的列数据集合,并根据所述列数据集合中列数据的数据属性对所述列数据进行分类处理,得到至少两个初始列数据集合;
对所述至少两个初始列数据集合中的每一个进行无监督聚类处理,得到至少两个无监督聚类簇,其中,所述至少两个无监督聚类簇与所述至少两个初始列数据集合一一对应;
根据至少两个无监督聚类簇生成多个列数据对,并确定每个列数据对中的两个列数据间的列名相似度以及列备注相似度;
根据所述列名相似度以及所述列备注相似度,确定每个列数据对的相似度。
还提供了一种计算机设备,包括处理器和存储器,所述存储器用于存储指令,当所述指令执行时使得所述处理器执行以下操作:
获取待处理的列数据集合,并根据所述列数据集合中列数据的数据属性对所述列数据进行分类处理,得到至少两个初始列数据集合;
对所述至少两个初始列数据集合中的每一个进行无监督聚类处理,得到至少两个无监督聚类簇,其中,所述至少两个无监督聚类簇与所述至少两个初始列数据集合一一对应;
根据至少两个无监督聚类簇生成多个列数据对,并确定每个列数据对中的两个列数据间的列名相似度以及列备注相似度;
根据所述列名相似度以及所述列备注相似度,确定每个列数据对的相似度。
还提供了一种计算机可读存储介质,存储介质用于存储指令,指令用于执行:
获取待处理的列数据集合,并根据所述列数据集合中列数据的数据属性对所述列数据进行分类处理,得到至少两个初始列数据集合;
对所述至少两个初始列数据集合中的每一个进行无监督聚类处理,得到至少两个无监督聚类簇,其中,所述至少两个无监督聚类簇与所述至少两个初始列数据集合一一对应;
根据至少两个无监督聚类簇生成多个列数据对,并确定每个列数据对中的两个列数据间的列名相似度以及列备注相似度;
根据所述列名相似度以及所述列备注相似度,确定每个列数据对的相似度。
附图说明
图1是本发明实施例一中的一种基于大数据的列数据处理方法的流程图;
图2是本发明实施例一中的一种应用场景的示意图;
图3是本发明实施例二中的一种列名相似度的计算方法的流程图;
图4是本发明实施例三中的一种列备注相似度的计算方法的流程图;
图5是本发明实施例三中的计算编辑距离的流程图;
图6是本发明实施例三中的计算列相似度的流程图;
图7是本发明实施例四中的一种基于大数据的列数据处理装置的结构示意图;
图8是本发明实施例五中的一种计算机设备的结构示意图。
具体实施方式
下面结合附图和实施例对本发明实施例进行说明。附图中仅示出了与本发明实施例相关的部分而非全部结构。
一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将多项操作(或步骤)描述成顺序的处理,但是其中的许多操作可以被并行地、并发地或者同时实施。此外,多项操作的顺序可以被重新安排。当其操作完成时所述处理可以被终止,但是还可以具有未包括在附图中的附加步骤。所述处理可以对应于方法、函数、规程、子例程、子程序等等。
本文使用的术语“列数据”是按列存储的方式存储在数据库中的数据,其中,每一列包括的数据量是不固定的。
本文使用的术语“列数据的数据属性”是列数据的元信息,元信息中包括列数据的列类型。
本文使用的术语“初始列数据集合”是指根据列数据集合中列数据的数据属性对列数据进行聚类处理,可以得到数值型初始列数据集合、字符型初始列数据集合以及时间型初始列数据集合。
本文使用的术语“无监督聚类簇”是指根据对初始列数据集合进行无监督聚类处理,而得到的列数据的分类结果。
本文使用的术语“相似度”是指两个列数据之间的相似程度,即两个列数据越类似,其相似度越大;相应的,“列名相似度”是指两个列数据之间列名的相似程度;“列备注相似度”是指两个列数据之间列备注的相似程度,其中,列备注是为了便于了解列数据的属性,人为加上的对列数据的备注,一列数据可能有列备注,也可能没有列备注。
本文使用的术语“列数据对”可以由任意两个列数据组成,本文中的列数据对也可以由任意两个无监督聚类簇组成;相应的,“第一列名”即为第一列数据或者第一无监督聚类簇的名字;“第二列名”即为第二列数据或者第二无监督聚类簇的名字;“第一列备注”即为第一列数据或者第一无监督聚类簇的备注;“第二列备注”即为第二列数据或者第二无监督聚类簇的备注。
为了便于理解,对本发明实施例进行简述。
通常,采用无监督学习的方式对列数据进行聚类,并通过列数据重叠度,不同值或唯一值重叠度、模式匹配、名称匹配等特征来计算列数据之间的相似度。该方法虽然可以计算出列数据之间的相似度,但是由于采用了无监督学习的方式对列数据进行聚类,导致计算量大且计算得到的相似度结果准确度不高。针对采用了无监督学习的方式对列数据进行聚类,导致计算量大且计算得到的相似度结果准确度不高的问题,本发明实施例采用一种方法计算列数据的相似度,能够减少计算量并且能够提高计算列相似度的准确率。
本发明实施例通过获取待处理的列数据集合,根据列数据集合中的列数据属性对获取的列数据集合进行分类处理,得到了至少两个初始列数据集合;对初始列数据集合进行无监督聚类处理,得到至少两个无监督聚类簇;根据至少两个无监督聚类簇生成多个列数据对,并确定多个列数据对中的两个列数据间的列名相似度以及列备注相似度;根据列名相似度以及列备注相似度,确定与多个列数据对匹配的相似度。通过将大量的列数据进行分类处理后再对初始列数据集合进行无监督聚类,并生成列数据对可以大量的减少计算量;同时,通过计算列数据对的列名相似度和列备注相似度来确定列数据的相似度,可以提高计算列数据的相似度的准确性。
实施例一
图1是本发明实施例一中的一种基于大数据的列数据处理方法的流程图,本实施例可适用于对企业中大量的列数据进行处理的情况,该方法可以由基于大数据的列数据处理装置来执行,该装置可以通过软件和/或硬件的方式实现,并集成在执行本方法的设备中,在本实施例中执行本方法的设备可以是计算机、平板电脑和/或手机等智能终端。参考图1,该方法包括如下步骤。
步骤110、获取待处理的列数据集合,并根据列数据集合中列数据的数据属性对列数据进行分类处理,得到至少两个初始列数据集合。
在一实施例中,数据库中存储数据时可以对数据进行按行存储也可以进行按列存储。按行存储数据没有索引的查询使用大量输入/输出接口,并且建立索引和物化视图需要花费大量时间和资源,同时,面对查询的需求,数据库必须被大量膨胀才能满足性能需求;按列存储数据由于查询中的选择规则是通过列来定义的,因此整个数据库是自动索引化的。按列存储数据可以将每个字段的数据聚集存储,在查询只需要少数几个字段的时候,能大大减少读取的数据量。
在一实施例中,本发明实施例所涉及到的列数据是以列为单位对数据进行 处理,每一个列数据中可以包含一个或多个数据,通过对列数据进行处理能大大减少读取的数据量,也更加方便进行后续的数据处理操作。相应的,本发明实施例中涉及到的列数据处理方法也可以计算行数据的相似度,为了便于对本发明实施例的叙述在本发明实施例中仅以列数据为例进行介绍。
在一实施例中,待处理的列数据存储在列式存储数据库中,存储在列式存储数据库中的所有列数据被称为列数据集合。在一实施例中,可以根据列数据集合中列数据的数据属性对列数据进行分类处理,得到至少两个初始列数据集合。
可选的,根据列数据集合中列数据的数据属性对列数据进行分类处理,包括:获取列数据集合中列数据的元信息,元信息中包括列数据的列类型;根据列数据的列类型,对列数据进行分类处理。其中,列类型可以为字符型、数值型以及时间型中的至少一项。示例性的,列数据的元信息还可以包括列名、列备注或者该列的统计信息。其中,若列数据的列类型为字符型,则该列数据的统计信息可以为列数据的最短长度、最长长度、平均长度和/或频数最大的数据的长度;若列数据的列类型为数值型,则该列数据的统计信息可以为列数据的极大值、极小值和/或平均值。
在一实施例中,根据列数据集合中列数据的数据属性对列数据进行分类处理,可以得到至少两个初始列数据集合。示例性的,根据列数据集合中列数据的数据属性对列数据进行分类处理,可以得到数值型初始列数据集合、字符型初始列数据集合以及时间型初始列数据集合。在一实施例中,若列数据集合中还包括其他类型的列数据,相应的,也可以得到与该列数据类型一致的初始列数据集合。
步骤120、对至少两个初始列数据集合中的每一个进行无监督分类处理,得到至少两个无监督聚类簇,其中,所述至少两个无监督聚类簇与至少两个初始列数据集合一一对应。
在一实施例中,对步骤110中得到的初始列数据集合进行无监督聚类,得到与每个初始列数据集合相对应的至少两个无监督聚类簇。示例性的,若初始列数据集合为数值型初始列数据集合,则通过无监督聚类,可以得到与数值型初始列数据集合对应的至少两个无监督聚类簇。以下将介绍如何对初始列数据集合进行无监督聚类。
其中,若初始列数据集合为数值型初始列数据集合,可计算列数据的统计指标,可以包括极大值a 1、极小值a 2和平均值a 3,则可以把一列数值型数据的列 特性表示为[a 1,a 2,a 3]。
假设有n列数值型数据,计算其统计信息,可得到其列特性矩阵:
Figure PCTCN2020110364-appb-000001
将N作为聚类(ISODATA)算法的输入,对数值型初始列数据集合进行聚类,从而可以实现对n列数据进行更细致的分类,可以得到至少两个无监督聚类簇。
若初始列数据集合为字符型初始列数据集合,由于字符型数据没有数值型数据直观的统计信息,故将每列字符数据中字符串最短长度b 1,字符串最长长度b 2,字符串平均长度b 3,频数最大的字符串长度b 4作为统计信息,可以把一列字符型数据的列特性表示为[b 1,b 2,b 3,b 4]。
假设有m列字符型数据,计算其特性指标,可得到其列特性矩阵:
Figure PCTCN2020110364-appb-000002
将M作为ISODATA算法的输入,对字符型初始列数据集合进行聚类,从而对m列数据进行更细致的分类,可以得到至少两个无监督聚类簇。
由于时间型列数据的数据量总体相对较少,不需要进行聚类;而其他类型的数据没有统一的结构,不便于找到其列特性,故本发明实施例中不对其他类型的数据进行聚类。
步骤130、根据至少两个无监督聚类簇生成多个列数据对,并确定每个列数据对中的两个列数据间的列名相似度以及列备注相似度。
在一实施例中,通过对初始列数据集合进行无监督处理后,可以得到至少 两个无监督聚类簇,可以通过将至少两个无监督聚类簇中的每一个无监督聚类簇中的列数据进行两两组合,从而得到多个列数据对。在一实施例中,列数据1和列数据2组成的列数据对12与列数据2和列数据1组成的列数据对21为同一个列数据对。生成列数据对后,分别确定每个列数据对的列名相似度和列备注相似度。
在一实施例中,若不进行步骤110和步骤120,直接对获取的列数据集合生成列数据对,若列数据集合中有100000个列数据,则会生成将近50亿个列数据对,也就是说需要对5000000000个列数据对进行列名和列备注的相似度的计算,才能得到全部的相似数据,而假设将上述的10万个列数据,经过步骤110和步骤120的处理后,得到400个无监督聚类簇,假设列数据均匀情况下,每个簇250条列数据,这400个无监督聚类簇可以生成
Figure PCTCN2020110364-appb-000003
个列数据对。因此,本发明实施例的方案,可以极大地降低计算量,并且聚类簇中的数据越均匀,降低计算量的效果越明显。
步骤140、根据列名相似度以及列备注相似度,确定每个列数据对的相似度。
在一实施例中,通过步骤130可以得到列数据对的列名相似度和列备注相似度,并将列数据的列名相似度记为s col,列数据的列备注相似度记为s com,则列数据对的相似度S,可以通过如下公式计算得到。
Figure PCTCN2020110364-appb-000004
本实施例的技术方案,通过获取待处理的列数据集合,根据列数据集合中的列数据属性对获取的列数据集合进行分类处理,得到了至少两个初始列数据集合;对初始列数据集合进行无监督聚类处理,得到至少两个无监督聚类簇;根据至少两个无监督聚类簇生成多个列数据对,并确定每个列数据对中的两个 列数据间的列名相似度以及列备注相似度;根据列名相似度以及列备注相似度,确定与每个列数据对匹配的相似度,可以得到准确率较高的列数据对的相似度结果并且可以减少计算量。
应用场景
图2列举了一种可以应用本发明实施例的系统,在一实施例中,数据目录系统发送列数据的元信息到列相似度后端服务,其中,列数据的元信息的任何变化都会引起与变化的列数据相关列的相似度的重新计算。列相似度后端服务接收到数据目录系统发送的列数据的元信息的变化后,列相似度后端服务会将列数据的元信息写入到相似度后端数据库中,即通过相似度后端数据库可以查询列数据的元信息。同时,列相似度后端服务也可以将列数据相似度计算任务发送至任务调度服务,任务调度服务接收到请求后,会通过分布式计算引擎进行列数据的相似度计算;分布式计算引擎对列数据相似度计算任务进行细化;第一任务中的第一阶段对列数据即初始列数据集进行分类处理得到初始列数据集合,然后对初始列数据集合进行无监督聚类处理得到无监督聚类簇,最后根据上一步得到的至少两个无监督聚类簇生成多个列数据对;第一任务中的第二阶段对生成的多个列数据对中的第一列数据对进行列名和列备注的预处理,得到第一列数据对的列名相似度和列备注相似度,并根据第一列数据对的列名相似度和列备注相似度得到第一列数据对的列相似度,最后将第一列数据对的相似度保存至相似度后端数据库中;同时,分布式计算引擎中的第二任务-第n任务可以对生成的第二列数据对-第n列数据对的列相似度进行计算,并保存至相似度后端数据库中。第n任务中的n的数值并不固定,其与生成的列数据对的对数相关,例如,生成的列数据对的对数为100,那么第n任务即为第一百任务。本系统中,列相似度后端服务可以实时的查询相似度后端数据库中的列数据的元信息和列数据的相似度,也可以查询任务调度服务的任务状态。
在本应用场景中,分布式计算引擎通过不同的任务可以同时得到每个列数据对的相似度,并将列数据对的相似度保存在相似度后端数据库中,通过列相似度后端服务可以实时的查询列数据对的相似度,不但可以得到准确率较高的列数据对的相似度结果并且可以减少计算量和计算时间。
实施例二
图3为本发明实施例二涉及到的计算列数据对中的两个列数据间的列名相似度的流程图,本实施例对上述实施例进行说明,确定每个列数据对中的两个列数据间的列名相似度可以包括:获取与列数据对中的两个列数据对应的第一列名以及第二列名;计算第一列名与第二列名间的列名编辑距离;分别获取与 第一列名对应的第一列名词向量表,以及与第二列名对应的第二列名词向量表,并计算第一列名词向量表与第二列名词向量表间的列名语义距离;根据列名编辑距离以及列名语义距离,确定列数据对中的两个列数据间的列名相似度。参考图3,该方法包括:
步骤310、获取与列数据对中的两个列数据对应的第一列名以及第二列名。
在一实施例中,每一个列数据对中都包含两个列数据,分别记为第一列数据和第二列数据,相应的,第一列数据的列名被记为第一列名,第二列数据的列名被记为第二列名。本实施例中涉及到的第一列数据、第二列数据、第一列名以及第二列名等词,均是为了便于对本发明实施例的描述而用到的,并不是对本发明实施例的限制。
可选的,列数据的列名的命名方式有两种,一种是驼峰式命名,如myFirstName;另一种是下划线命名,如my_first_name。本实施例中需要对列数据的列名进行标准化操作,即将列名展开成独立的单词,如上述列数据名需要展开为[my,first,name]。由于数字对列名相似度没有影响,当列名中出现数字时,对数字进行忽略处理,即将列数据名中的数字删除。
可选的,将列数据的列名进行标准化操作后,利用词向量模型将列数据名中的英文单词转化为词向量,示例性的,可以通过facebook利用fasttext算法训练公开的英文词向量模型得到列数据名中的英文单词的词向量。假设第一列名和第二列名分别为A和B,预处理之后得到两个列名的单词表分别为A=[a 1,a 2,...,a n]和B=[b 1,b 2,...,b m],其中a 1,a 2,...,a n分别为第一列名A的单词表中的每个单词,b 1,b 2,...,b m分别为第二列名B的单词表中的每个单词,其中,n为第一列名的单词表中包含的单词数量,m为第二列名的单词表中包含的单词数量。由于列名分解后的单词可能存在缩写单词、错拼单词等情况,因此一些单词不一定存在对应的词向量,本实施例中若查询不到一些单词的向量,则忽略该单词。因此,第一列名和第二列名的单词表转化为词向量表A v=[V a1,V a2,...,V an′]和B v=[V b1,V b,...,V bm′],其中,n为第一列名的单词表中包含的单词数量,m为第二列名的单词表中包含的单词数量,n′为第一列名词 向量表中包含的词向量数量,m′为第二列名词向量表中包含的词向量数量,且n′≤n,m′≤m,V a1,V a2,...,V an′分别为第一列名A词向量表中与单词a 1,a 2,...,a n对应的词向量,V b1,V b,...,V bm′分别为第二列名B词向量表中与单词b 1,b 2,...,b m对应的词向量。
步骤320、计算第一列名与第二列名间的列名编辑距离。
编辑距离是指两个字符之间,由一个字符转换成为另一个字符所需要的最少编辑操作次数,允许的编辑操作包括一个字符替换为另一个字符,插入一个字符,删除一个字符。因此编辑距离的大小并不是在一个范围内,为了将编辑距离标准化在[0,1]范围内,对编辑距离重新定义,公式如下:
Figure PCTCN2020110364-appb-000005
h′=(s(x)((L max-d)/L max×6)-0.5)×2
其中,s(x)为sigmoid函数,用于将编辑距离标准化到[0,1]范围内,x为编辑距离;L max表示两个字符串的最大字符数;d表示两个字符串的原编辑距离,即最少编辑操作次数;h′为自定义编辑距离,即本实施例中涉及到的列名编辑距离,其中,h′越大,表示两个字符串越相似。
由于列数据名的单词大小可能不一致,为了保证列数据对AB与列数据BA的编辑距离一致,故令单词列表较长的列为第一列,第一列名与第二列名间的列名编辑距离的计算公式如下:
Figure PCTCN2020110364-appb-000006
f(a i,B)=max(g(a i,b 1),g(a i,b 2),...,g(a i,b m))
其中,s 2表示第一列名与第二列名间的列名编辑距离,f(a i,B)表示第一列名中第i个单词a i与第二列名中所有单词的编辑距离的最大值,g(a i,b j)表示第一列名中第i个单词与第二列名中第j个单词的编辑距离,L A为第一列名的长度。
若a i为单词,且在词向量中出现,考虑到单词存在近义词的情况,利用词向 量集得到a i的近义词,利用近义词优化编辑距离,从而增大编辑距离的可靠性,故g(a i,b j)的计算公式定义如下:
g(a i,b j)=max(h′(a i,b j),h′(a i1,b j),h′(a i2,b j),...,h′(a ik,b j))
其中a ik表示第一列名中第i个单词a i的第k个近义词,h′(a ik,b j)表示a ik和b j的编辑距离。
若a i不在词向量中出现(即可能不是单词或拼写错误),则不考虑近义词情况。
由上可知,在计算a i和b j的编辑距离时,需要考虑a i的近义词。若a i存在词向量,利用已有词向量集可以得到a i的空间距离最近的k个单词,即[a i1,a i2,...,a ik];若a i不存在词向量,此时a i无近义词,k=0,故g(a i,b j)=h′(a i,b j)。
步骤330、计算第一列名词向量表与第二列名词向量表间的列名语义距离。
在一实施例中,通过步骤310中得到第一列名与第二列名的词向量表分别为A v=[V a1,V a2,...,V an′]和B v=[V b1,V b,...,V bm′],其中n′≤n,m′≤m,V a1,V a2,...,V an′分别为第一列名A词向量表中与单词a 1,a 2,...,a n对应的词向量,V b1,V b,...,V bm′分别为第二列名B词向量表中与单词b 1,b 2,...,b m对应的词向量。则可以根据以下公式计算第一列名词向量表与第二列名词向量表间的列名语义距离。
Figure PCTCN2020110364-appb-000007
Figure PCTCN2020110364-appb-000008
s 1=V A·V B/(‖V A‖×‖V B‖)
其中,s 1为第一列名词向量表与第二列名词向量表间的列名语义距离,L Av和L Bv分别为第一列名词向量表A和第二列名词向量表B的词向量个数,V A为第一列名A词向量表的均值,V B为第一列名B词向量表的均值。
步骤340、根据列名编辑距离以及列名语义距离,确定列数据对中的两个列数据间的列名相似度。
在一实施例中,由步骤320得到第一列名与第二列名间的列名编辑距离s 2,由步骤330得到第一列名与第二列名间的列名语义距离s 1,列名编辑距离s 2和列名语义距离s 1的范围均为[0,1]。列名相似度计算中,由于列名单词很大可能不存在词向量,故列名相似度中,编辑距离的权重相对较大。而单纯的线性权重关系无法得到一个准确的列名相似度,故本实施例采用分段函数的形式,计算列名相似度,可以根据如下公式计算列名相似度s col
Figure PCTCN2020110364-appb-000009
本实施例在列名相似度计算过程中,获取与列数据对中的两个列数据对应的第一列名以及第二列名,并对获取的列名进行处理,得到列名词向量,根据列名词向量计算第一列名与第二列名间的列名编辑距离和列名语义距离,并通过分段函数将不同条件下的列名编辑距离和列名语义距离设置不同的权重,得到了准确率更高的列名相似度。
实施例三
图4是本发明实施例三涉及到的计算列数据对中的两个列数据间的列备注相似度的流程图,本实施例对上述任意实施例进行说明,将确定每个列数据对中的两个列数据间的列备注相似度可以包括:获取与列数据对中的两个列数据对应的第一列备注以及第二列备注;计算第一列备注与第二列备注间的列备注 编辑距离;对第一列备注以及第二列备注进行分词处理后,得到与第一列备注对应的第一列备注词向量表以及与第二列备注对应的第二列备注词向量表;计算第一列备注词向量表与第二列备注词向量表间的列备注语义距离;根据列备注编辑距离以及列备注语义距离,确定列数据对中的两个列数据间的列备注相似度,参考图4,该方法包括:
步骤410、获取与列数据对中的两个列数据对应的第一列备注以及第二列备注。
在一实施例中,列备注可以反映与列备注对应的列数据的主要内容,需要用户对列备注进行定义,因此,有可能存在一列数据不存在列备注的现象。将列数据对中与第一列数据对应的备注记为第一列备注,与第二列数据对应的列备注记为第二列备注。
可选的,在确定第一列备注或者第二列备注为空时,确定列数据对中的两个列数据间的列备注相似度为0,即若确定第一列数据不包含列备注、第二列数据不包含列备注或者第一列数据和第二列数据都不包含列备注,可以直接确定列数据对中两个列数据间的列备注相似度为0。
可选的,获取到列数据对中的两个列数据对应的第一列备注以及第二列备注后,可以对第一列备注和第二列备注进行分词处理,其中,分词处理是将连续的文本按照一定的规则重新组合成词序列的过程。示例性的,可以将第一列备注和第二列备注经中的停用词、标点符号、英文字母以及数字删除,并通过分词工具对第一列备注和第二列备注进行分词处理,得到第一列备注表C=[c 1,c 2,...,c n]和第二列备注表D=[d 1,d 2,...,d m],其中,n为第一列备注表中包含的单词数量,m为第二列备注表中包含的单词数量。利用词向量模型将英文单词转化为词向量,本实施例中可以通过查询AI Lib利用Directional Skip-Gram(DSG)算法训练公开的词向量数据集而得到列备注中每个单词对应的词向量,当列备注中的单词查询不到词向量时,对其进行忽略处理,得到第一列备注和第二列备注的词向量表分别为C v=[V c1,V c2,...,V cn′]和d v=[V d1,V d2,...,V dm′],其中n为第一列备注表中包含的单词数量,m为第二列备注表中包含的单词数量,n′为第一列备注词向量表中包含的词向量数量,m′为第二列备注词向量表中包含的词向量数量,且n′≤n,m′≤m,V c1,V c2,...,V cn′分别为与第一列备注表中的单词c 1,c 2,...,c n对应的词向量,V d1,V d2,...,V dm′分别为与第二列备注表中的单词d 1,d 2,...,d m对应的词向量。
步骤420、计算第一列备注与第二列备注间的列备注编辑距离。
在一实施例中,第一列备注与第二列备注间的列备注编辑距离与第一列名与第二列名间的列名编辑距离的计算方法一致,本实施例中在此不再对其进行 阐述,通过步骤320中涉及到的方法,可以得到第一列备注与第二列备注间的列备注编辑距离为s 4
步骤430、计算第一列备注词向量表与第二列备注词向量表间的列备注语义距离。
在一实施例中,由步骤410得到第一列备注和第二列备注的词向量表分别为C v=[V c1,V c2,...,V cn′]和d v=[V d1,V d2,...,V dm′],其中n为第一列备注表中包含的单词数量,m为第二列备注表中包含的单词数量,n′为第一列备注词向量表中包含的词向量数量,m′为第二列备注词向量表中包含的词向量数量,且n′≤n,m′≤m,V c1,V c2,...,V cn′分别为与第一列备注表中的单词c 1,c 2,...,c n对应的词向量,V d1,V d2,...,V dm′分别为与第二列备注表中的单词d 1,d 2,...,d m对应的词向量。
第一列备注词向量表与第二列备注词向量表间的列备注语义距离可以通过如下公式计算得到。
Figure PCTCN2020110364-appb-000010
Figure PCTCN2020110364-appb-000011
s 3=V C·V D/(‖V C‖×‖V D‖)
其中,s 3为列备注的语义距离,L Cv和L Dv分别为第一列备注词向量表和第二列备注词向量表的词向量个数,V C为第一列备注词向量表的均值,V D为第二列备注词向量表的均值。
步骤440、根据列备注编辑距离以及列备注语义距离,确定列数据对中的两个列数据间的列备注相似度。
在一实施例中,由步骤420得到第一列备注与第二列备注间的列备注编辑距离s 4,由步骤430得到第一列备注与第二列备注间的列备注语义距离s 3,列备注编辑距离s 4和列备注语义距离s 3的范围均为[0,1]。列备注相似度计算中,由于列备注为文本信息,故列备注的语义距离的重要性大于列备注的编辑距离。而 单纯的线性权重关系无法得到一个准确的列备注相似度,故本实施例采用分段函数的形式,计算列备注相似度,可以根据如下公式计算列备注相似度s com
Figure PCTCN2020110364-appb-000012
本实施例在列备注相似度计算过程中,获取与列数据对中的两个列数据对应的第一列备注以及第二列备注,并对获取的列备注进行处理,得到列备注词向量,根据列备注词向量计算第一列备注与第二列备注间的列备注编辑距离和列备注语义距离,并通过分段函数将不同条件下的列备注编辑距离和列备注语义距离设置不同的权重,得到了准确率更高的列备注相似度。
图5是计算列编辑距离的流程图,其中列编辑距离包括列名编辑距离和列备注编辑距离。对列名和列备注进行处理得到与列名和列备注分别对应的词向量,比较两个列数据表的长度,将较长的列数据表记为A表,较短的列数据表记为B表,其中,A表与第一列数据表对应,B表与第二列数据表对应;判断A表中的第i个单词是否存在词向量,若存在,则寻找k个第i个单词的近义词,即A i=[a i,a 1,...,a k],分别计算A i中每个单词与B表中每个单词的编辑距离,并选出最大的编辑距离,得到列编辑距离。
在上述例子中,通过确定最大编辑距离而得到列编辑距离,提高了求解编辑距离的准确度。
示例性的,图6是本发明实施例中计算列相似度的流程图,对列数据对中包含的两个列数据的列名进行处理,得到列名词向量,根据列名词向量分别计算列名的语义距离和编辑距离,并计算得到列名相似度;针对列备注相似度的计算,判断列数据对中的两个列数据是否包含列备注,若其中一个列数据不包含列备注,则两个列数据间的列备注相似度为0;若两个列数据都包含列备注,则对列备注进行处理,得到列备注词向量,并根据列备注词向量计算列备注的语义距离和编辑距离,并计算得到列备注相似度;通过列名相似度和列备注相似度确定列相似度。
在上述例子中,通过对列名和列备注进行处理,得到列名和列备注的词向量,并根据列名和列备注的词向量确定了列名和列备注的相似度,最终得到了准确率较高的列相似度。
实施例四
图7是本发明实施例四提供的一种基于大数据的列数据处理装置的结构示意图,该装置可以由软件和/或硬件的方式实现,并且可以执行本公开任意实施例所述的基于大数据的列数据处理方法,参考图7,该装置包括:列数据集合获取模块710、无监督聚类处理模块720、列数据对生成模块730及列数据对相似度确定模块740。
列数据集合获取模块710:设置为获取待处理的列数据集合,并根据列数据集合中列数据的数据属性对列数据进行分类处理,得到至少两个初始列数据集合;
无监督聚类处理模块720:设置为对至少两个初始列数据集合中的每一个进行无监督聚类处理,得到至少两个无监督聚类簇,其中,所述至少两个无监督聚类簇与至少两个初始列数据集合一一对应;
列数据对生成模块730:设置为生成与至少两个无监督聚类簇分别对应的多个列数据对,并确定每个列数据对中的两个列数据间的列名相似度以及列备注相似度;
列数据对相似度确定模块740:设置为根据列名相似度以及列备注相似度,确定每个列数据对的相似度。
本实施例的技术方案,通过列数据集合获取模块得到了至少两个初始列数据集合,并通过无监督聚类模块对初始列数据集合进行无监督聚类处理得到了至少两个无监督聚类簇;通过列数据对生成模块生成列数据对,计算得到每个列数据对的列名相似度和列备注相似度;通过列数据对相似度确定模块得到列数据对的相似度。本实施例可以得到准确率较高的列数据对的相似度结果并且可以减少计算量。
可选的,本实施例在上述方案的基础上,列数据集合获取模块710还可以包括:列数据的元信息获取单元,设置为获取列数据集合中列数据的元信息,元信息中包括列数据的列类型;根据列数据的列类型,对列数据进行分类处理。
可选的,列类型包括下述至少一项:字符型、数值型、时间型。
可选的,列数据对生成模块730还设置为将至少两个无监督聚类簇中的每 一个无监督聚类簇中的列数据进行两两组合,得到多个列数据对。
可选的,列数据对生成模块730还包括列名相似度计算单元,其中,列名相似度计算单元是设置为:获取与列数据对中的两个列数据对应的第一列名以及第二列名;计算第一列名与第二列名间的列名编辑距离;计算第一列名词向量表与第二列名词向量表间的列名语义距离;根据列名编辑距离以及列名语义距离,确定列数据对中的两个列数据间的列名相似度。
可选的,列数据对生成模块730还包括列备注相似度计算单元,其中,列备注相似度计算单元是设置为:获取与列数据对中的两个列数据对应的第一列备注以及第二列备注;计算第一列备注与第二列备注间的列备注编辑距离;计算第一列备注词向量表与第二列备注词向量表间的列备注语义距离;根据列备注编辑距离以及列备注语义距离,确定列数据对中的两个列数据间的列备注相似度。
可选的,本实施例所述的基于大数据的列数据处理装置还包括列备注判断模块,设置为在确定第一列备注或者第二列备注为空时,确定列数据对中的两个列数据间的列备注相似度为0。
本发明实施例所提供的基于大数据的列数据处理装置可执行本公开任意实施例所提供的基于大数据的列数据处理方法,具备执行方法相应的功能模块和有益效果。
实施例五
图8为本发明实施例五提供的一种计算机设备的结构示意图,如图8所示,该设备包括处理器80、存储器81、输入装置82和输出装置83;处理器80的数量可以是一个或多个,图8中以一个处理器80为例;处理器80、存储器81、输入装置82和输出装置83可以通过总线或其他方式连接,图8中以通过总线连接为例。
存储器81作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本发明实施例中的基于大数据的列数据处理方法对应的程序指令/模块(例如,基于大数据的列数据处理装置中的列数据集合获取模块710、无监督聚类处理模块720、列数据对生成模块730及列数据对相似度确定模块740)。处理器80通过运行存储在存储器81中的软件程序、指令以及模块,从而执行计算机设备的多种功能应用以及数据处理,即实现上述的基于大数据的列数据处理方法。
存储器81可包括存储程序区和存储数据区,其中,存储程序区可存储操作 系统、至少一个功能所需的应用程序;存储数据区可存储根据终端的使用所创建的数据等。此外,存储器81可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器81可包括相对于处理器80远程设置的存储器,这些远程存储器可以通过网络连接至计算机设备。上述网络的实例包括互联网、企业内部网、局域网、移动通信网及其组合。
输入装置82可用于接收输入的数字或字符信息,以及产生与计算机设备的用户设置以及功能控制有关的键信号输入。输出装置83可包括显示屏等显示设备。
实施例六
本发明实施例六还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行一种基于大数据的列数据处理方法,该方法包括:获取待处理的列数据集合,并根据列数据集合中列数据的数据属性对列数据进行分类处理,得到至少两个初始列数据集合;
对至少两个初始列数据集合中的每一个进行无监督聚类处理,得到至少两个无监督聚类簇,其中,所述至少两个无监督聚类簇与至少两个初始列数据集合一一对应;
根据至少两个无监督聚类簇生成多个列数据对,并确定每个列数据对中的两个列数据间的列名相似度以及列备注相似度;
根据列名相似度以及列备注相似度,确定每个列数据对的相似度。
本发明实施例所提供的一种包含计算机可执行指令的存储介质,其计算机可执行指令不限于如上所述的方法操作,还可以执行本公开任意实施例所提供的基于大数据的列数据处理方法中的相关操作。
通过以上关于实施方式的描述,本公开可借助软件及必需的通用硬件来实现,也可以通过硬件实现。本公开可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括至少一个指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开多个实施例所述的方法。
上述基于大数据的列数据处理装置的实施例中,所包括的多个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应 的功能即可;另外,多个功能单元的名称也只是为了便于相互区分,并不用于限制本公开的保护范围。

Claims (15)

  1. 一种基于大数据的列数据处理方法,包括:
    获取待处理的列数据集合,并根据所述列数据集合中列数据的数据属性对所述列数据进行分类处理,得到至少两个初始列数据集合;
    对所述至少两个初始列数据集合中的每一个进行无监督聚类处理,得到至少两个无监督聚类簇,其中,所述至少两个无监督聚类簇与所述至少两个初始列数据集合一一对应;
    根据所述至少两个无监督聚类簇生成多个列数据对,并确定每个列数据对中的两个列数据间的列名相似度以及列备注相似度;
    根据所述列名相似度以及所述列备注相似度,确定每个列数据对的相似度。
  2. 根据权利要求1所述的方法,其中,所述根据所述列数据集合中列数据的数据属性对所述列数据进行分类处理,包括:
    获取所述列数据集合中所述列数据的元信息,所述元信息中包括所述列数据的列类型;
    根据所述列数据的列类型,对所述列数据进行分类处理。
  3. 根据权利要求2所述的方法,其中,所述列类型包括下述至少一项:字符型、数值型、时间型。
  4. 根据权利要求1所述的方法,其中,生成与所述至少两个无监督聚类簇对应的多个列数据对,包括:
    将所述至少两个无监督聚类簇中的每一个无监督聚类簇中的列数据进行两两组合,得到所述多个列数据对。
  5. 根据权利要求1所述的方法,其中,所述确定每个列数据对中的两个列数据间的列名相似度,包括:
    获取与每个列数据对中的两个列数据对应的第一列名以及第二列名;
    计算所述第一列名与所述第二列名间的列名编辑距离;
    计算第一列名词向量表与第二列名词向量表间的列名语义距离;
    根据所述列名编辑距离以及所述列名语义距离,确定每个列数据对中的两个列数据间的列名相似度。
  6. 根据权利要求1所述的方法,其中,所述确定每个列数据对中的两个列数据间的列备注相似度,包括:
    获取与每个列数据对中的两个列数据对应的第一列备注以及第二列备注;
    计算所述第一列备注与所述第二列备注间的列备注编辑距离;
    计算第一列备注词向量表与第二列备注词向量表间的列备注语义距离;
    根据所述列备注编辑距离以及所述列备注语义距离,确定每个列数据对中的两个列数据间的列备注相似度。
  7. 根据权利要求6所述的方法,其中,在所述获取与每个列数据对中的两个列数据对应的第一列备注以及第二列备注之后,还包括:
    在确定所述第一列备注或者所述第二列备注为空的情况下,确定每个列数据对中的两个列数据间的列备注相似度为0。
  8. 一种计算机设备,包括处理器和存储器,所述存储器用于存储指令,所述指令执行时使得所述处理器执行以下操作:
    获取待处理的列数据集合,并根据所述列数据集合中列数据的数据属性对所述列数据进行分类处理,得到至少两个初始列数据集合;
    对所述至少两个初始列数据集合中的每一个进行无监督聚类处理,得到至少两个无监督聚类簇,其中,所述至少两个无监督聚类簇与所述至少两个初始列数据集合一一对应;
    根据所述至少两个无监督聚类簇生成多个列数据对,并确定每个列数据对中的两个列数据间的列名相似度以及列备注相似度;
    根据所述列名相似度以及所述列备注相似度,确定每个列数据对的相似度。
  9. 根据权利要求8所述的计算机设备,其中,所述处理器是设置为通过以下方式对所述列数据进行分类处理:
    获取所述列数据集合中所述列数据的元信息,所述元信息中包括所述列数据的列类型;
    根据所述列数据的列类型,对所述列数据进行分类处理。
  10. 根据权利要求9所述的计算机设备,其中,所述列类型包括下述至少一项:字符型、数值型、时间型。
  11. 根据权利要求8所述的计算机设备,其中,所述处理器是设置为通过以下方式生成与所述至少两个无监督聚类簇对应的多个列数据对:
    将所述至少两个无监督聚类簇中的每一个无监督聚类簇中的列数据进行两两组合,得到所述多个列数据对。
  12. 根据权利要求8所述的计算机设备,其中,所述处理器是设置为通过以下方式确定每个列数据对中的两个列数据间的列名相似度:
    获取与每个列数据对中的两个列数据对应的第一列名以及第二列名;
    计算所述第一列名与所述第二列名间的列名编辑距离;
    计算第一列名词向量表与第二列名词向量表间的列名语义距离;
    根据所述列名编辑距离以及所述列名语义距离,确定每个列数据对中的两个列数据间的列名相似度。
  13. 根据权利要求8所述的计算机设备,其中,所述处理器是设置为通过以下方式确定每个列数据对中的两个列数据间的列备注相似度:
    获取与每个列数据对中的两个列数据对应的第一列备注以及第二列备注;
    计算所述第一列备注与所述第二列备注间的列备注编辑距离;
    计算第一列备注词向量表与第二列备注词向量表间的列备注语义距离;
    根据所述列备注编辑距离以及所述列备注语义距离,确定每个列数据对中的两个列数据间的列备注相似度。
  14. 根据权利要求13所述的计算机设备,其中,所述处理器在确定所述第一列备注或者所述第二列备注为空的情况下,确定每个列数据对中的两个列数据间的列备注相似度为0。
  15. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1-7中任一项所述的基于大数据的列数据处理方法。
PCT/CN2020/110364 2019-09-11 2020-08-21 基于大数据的列数据处理方法、设备及介质 WO2021047373A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910860409.3A CN110569289B (zh) 2019-09-11 2019-09-11 基于大数据的列数据处理方法、设备及介质
CN201910860409.3 2019-09-11

Publications (1)

Publication Number Publication Date
WO2021047373A1 true WO2021047373A1 (zh) 2021-03-18

Family

ID=68779365

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/110364 WO2021047373A1 (zh) 2019-09-11 2020-08-21 基于大数据的列数据处理方法、设备及介质

Country Status (2)

Country Link
CN (1) CN110569289B (zh)
WO (1) WO2021047373A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569289B (zh) * 2019-09-11 2020-06-02 星环信息科技(上海)有限公司 基于大数据的列数据处理方法、设备及介质
CN113127573A (zh) * 2019-12-31 2021-07-16 奇安信科技集团股份有限公司 相关数据的确定方法、装置、计算机设备和存储介质
US11455321B2 (en) 2020-03-19 2022-09-27 International Business Machines Corporation Deep data classification using governance and machine learning
US20210374525A1 (en) * 2020-05-28 2021-12-02 International Business Machines Corporation Method and system for processing data records
WO2021246958A1 (en) * 2020-06-02 2021-12-09 Ecommerce Enablers Pte. Ltd. System and method for combining related data from separate databases using identifier field pairs

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100192024A1 (en) * 2009-01-23 2010-07-29 Intelliscience Corporation Methods and systems for detection of anomalies in digital data streams
CN103810388A (zh) * 2014-02-19 2014-05-21 福建工程学院 基于面向映射的分块技术的大规模本体映射方法
CN105893349A (zh) * 2016-03-31 2016-08-24 新浪网技术(中国)有限公司 类目标签匹配映射方法及装置
CN107273430A (zh) * 2017-05-16 2017-10-20 北京奇虎科技有限公司 一种数据存储方法和装置
CN108614884A (zh) * 2018-05-03 2018-10-02 桂林电子科技大学 一种基于卷积神经网络的服装图像检索方法
CN110569289A (zh) * 2019-09-11 2019-12-13 星环信息科技(上海)有限公司 基于大数据的列数据处理方法、设备及介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9679370B2 (en) * 2012-12-06 2017-06-13 Nec Corporation Image processing device and image processing method
CN103744955B (zh) * 2014-01-04 2017-04-05 北京理工大学 一种基于本体匹配的语义查询方法
CN104063801B (zh) * 2014-06-23 2016-05-25 有米科技股份有限公司 一种基于聚类的移动广告推荐方法
CN104182517B (zh) * 2014-08-22 2017-10-27 北京羽乐创新科技有限公司 数据处理的方法及装置
CN105049341A (zh) * 2015-09-10 2015-11-11 陈包容 给新增即时通讯号码自动添加备注信息的方法及装置
CN107704474B (zh) * 2016-08-08 2020-08-25 华为技术有限公司 属性对齐方法和装置
CN108090068B (zh) * 2016-11-21 2021-05-25 医渡云(北京)技术有限公司 医院数据库中的表的分类方法及装置
CN106777218B (zh) * 2016-12-26 2020-04-28 中央军委装备发展部第六十三研究所 一种基于属性相似度的本体匹配方法
CN110019169B (zh) * 2017-12-29 2021-04-13 中国移动通信集团陕西有限公司 一种数据处理的方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100192024A1 (en) * 2009-01-23 2010-07-29 Intelliscience Corporation Methods and systems for detection of anomalies in digital data streams
CN103810388A (zh) * 2014-02-19 2014-05-21 福建工程学院 基于面向映射的分块技术的大规模本体映射方法
CN105893349A (zh) * 2016-03-31 2016-08-24 新浪网技术(中国)有限公司 类目标签匹配映射方法及装置
CN107273430A (zh) * 2017-05-16 2017-10-20 北京奇虎科技有限公司 一种数据存储方法和装置
CN108614884A (zh) * 2018-05-03 2018-10-02 桂林电子科技大学 一种基于卷积神经网络的服装图像检索方法
CN110569289A (zh) * 2019-09-11 2019-12-13 星环信息科技(上海)有限公司 基于大数据的列数据处理方法、设备及介质

Also Published As

Publication number Publication date
CN110569289A (zh) 2019-12-13
CN110569289B (zh) 2020-06-02

Similar Documents

Publication Publication Date Title
WO2021047373A1 (zh) 基于大数据的列数据处理方法、设备及介质
WO2021017721A1 (zh) 智能问答方法、装置、介质及电子设备
WO2022126971A1 (zh) 基于密度的文本聚类方法、装置、设备及存储介质
US11620306B2 (en) Low-latency predictive database analysis
Meng et al. Semi-supervised heterogeneous fusion for multimedia data co-clustering
US11941034B2 (en) Conversational database analysis
WO2021164231A1 (zh) 公文摘要提取方法、装置、设备及计算机可读存储介质
CN107329987A (zh) 一种基于mongo数据库的搜索系统
CN111552788B (zh) 基于实体属性关系的数据库检索方法、系统与设备
US11487943B2 (en) Automatic synonyms using word embedding and word similarity models
US20220300543A1 (en) Method of retrieving query, electronic device and medium
Li et al. Bounded approximate query processing
WO2023160137A1 (zh) 图数据存储方法、系统及计算机设备
Ma et al. Typifier: Inferring the type semantics of structured data
US11809468B2 (en) Phrase indexing
WO2017107130A1 (zh) 数据查询方法和数据库系统
CN110874366A (zh) 数据处理、查询方法和装置
WO2022257455A1 (zh) 一种相似文本的确定方法、装置、终端设备及存储介质
Lu et al. A novel approach towards large scale cross-media retrieval
WO2017019883A1 (en) Locality-sensitive hashing for algebraic expressions
Kong et al. A Multi-source Heterogeneous Data Storage and Retrieval System for Intelligent Manufacturing
WO2022116324A1 (zh) 搜索模型训练方法、装置、终端设备及存储介质
CN115408491B (zh) 一种历史数据的文本检索方法及系统
CN116127086B (zh) 基于科技文献资源的地理科学数据需求分析方法及装置
US11734318B1 (en) Superindexing systems and methods

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20862359

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20862359

Country of ref document: EP

Kind code of ref document: A1