WO2015183306A1 - Traitement d'une table de colonnes - Google Patents
Traitement d'une table de colonnes Download PDFInfo
- Publication number
- WO2015183306A1 WO2015183306A1 PCT/US2014/040234 US2014040234W WO2015183306A1 WO 2015183306 A1 WO2015183306 A1 WO 2015183306A1 US 2014040234 W US2014040234 W US 2014040234W WO 2015183306 A1 WO2015183306 A1 WO 2015183306A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- column
- columns
- similarity measure
- predefined threshold
- larger
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
Definitions
- FIG. 5 depicts a flow diagram of a method for processing a table, according to another example of the present disclosure.
- the tables in databases often contain very large numbers of columns.
- the data contained in the tables Prior to performing data analytics applications on the tables, the data contained in the tables are often manually pre-processed. That is, an analyst often spends a great deal of time preparing the data for the analytics applications by manually determining which of the columns are informative and which of the columns are non-informative. Typically, only a small subset of the columns is informative. By way of example, analysts often spend upwards of 80% of their time preparing the data for the analysis. As such, the manual pre-processing of the data contained tables is typically time-consuming and inefficient.
- columns that may contain relatively low entropies may automatically be identified and removed from a table, which may be a structured data table.
- a determination may be made as to which of the columns contain information that is relatively similar to other columns and those columns may also be removed from the table.
- the similarity determination among the columns may be made through implementation of a two-step procedure, in which the computational complexity of the first step is relatively lower than the computational complexity of the second step. That is, the first step may be simpler and faster to implement than the second step, but may provide a relatively less accurate determination of similarity among the columns.
- the similarity measures disclosed herein may be effective for different column types (e.g., categorical and numerical) because the similarity measures disclosed herein uses mutual information and entropies and thus does not assume a specific dependence, e.g., a linear mapping between two columns.
- data may be generalized to any data that contains elements (equivalent to rows) and fields (equivalent to columns) and may be transformed to have the form of the table 112.
- data received in a format other than in a table format, such as an XML file may be transformed into the form of the table 112.
- the table processing apparatus 120 may be, for instance, a volatile or non-volatile memory, such as dynamic random access memory (DRAM), electrically erasable programmable read-only memory (EEPROM), magnetoresistive random access memory (MRAM), memristor, flash memory, floppy disk, a compact disc read only memory (CD-ROM), a digital video disc read only memory (DVD-ROM), or other optical or magnetic media, and the like, on which software may be stored.
- the modules 210-224 may be software modules, e.g., sets of machine readable instructions, stored in the table processing apparatus 120.
- the interface 204 may include hardware and/or software to enable the processor 202 to communicate with the database 110.
- the interface 204 may enable a wired or wireless connection to the database 110 (user input/output devices) over a network, such as an intranet, a local area network, the Internet, etc.
- the computing device 200 may be directly connected to the database 1 10, e.g., the computing device 200 may be a server that is to control access to the database 110.
- the interface may include a network interface card and/or may also include hardware and/or software to enable the processor 202 to communicate with various input and/or output devices, such as a keyboard, a mouse, a display, another computing device, etc., through which a user may input instructions into the computing device 200.
- a determination may be made as to whether a second similarity measure between the first column and the second column is larger than a second predefined threshold, in which the second similarity measure has a greater computational complexity as compared with the first similarity measure. That is, computation of the second similarity measure may require a greater amount of time and/or computational resources as compared with computation of the first similarity measure.
- the second similarity measure determining module 218 may determine a second similarity measure between the first column and the second column and may determine whether the second similarity measure is above a second predefined threshold. The second similarity measure and the second predefined threshold are discussed in greater detail below with respect to the method 400 below.
- one of the first column and the second column may be removed from the table 1 12. More particularly, for instance, the column removing module 224 may remove one of the first column and the second column from the table 1 12.
- the column removing module 224 may remove one of the first column and the second column from the table 1 12.
- a determination may be made that one of the first column and the second column may include non-informative, e.g., duplicative, information and thus, one of the first column and the second column may be removed to reduce the number of columns that may be analyzed during a data analytics operation.
- entropies of each of the columns may be accessed.
- the entropies of the columns for instance, low entropies, may denote that the columns are empty, almost empty, constant, almost constant, etc.
- high entropies may denote that columns are full, almost full, have different values along the columns, etc.
- the entropy of a column may denote the amount of change between values contained in the rows of the columns.
- the entropy accessing module 214 may access the entropies by determining the entropies of the columns based upon the computed histograms of the columns.
- the entropy accessing module 214 may determine the entropies of the columns through computation of the following equation:
- the columns identified as having entropies that fall below the predetermined threshold may be removed from the table 112. That is, for instance, the column removing module 224 may remove the columns that have relatively constant data across their respective rows for containing non-informative information.
- Equation (2) for a given pair of columns A and B, their histograms are marked by P A , P B , their expectations are marked by ⁇ ⁇ , ⁇ ⁇ ., and "7" represents a transpose function. According to an example, the histograms for the columns are sorted. In addition, or alternatively, the shorter histogram is padded with zeroes such that the lengths of the two histograms are identical.
- a determination may be made as to whether the HS of column A and column B is larger than a first predefined threshold.
- the first similarity measure determining module 216 may compare the HS of column A and column B with the first predefined threshold to determine whether the HS exceeds the first predefined threshold.
- the first predefined threshold may be set to meet any suitable objective. Thus, for instance, the first predefined threshold may be set to be a higher value when it is desired to have a higher number of columns removed from the table 1 12.
Abstract
La présente invention concerne, selon un exemple, un procédé de traitement d'une table comportant une pluralité de colonnes, dans lequel il est possible de déterminer si une première mesure de similarité entre une première colonne de la pluralité de colonnes et une deuxième colonne de la pluralité de colonnes est supérieure à un premier seuil prédéfini. En réponse à une détermination que la première mesure de similarité est supérieure au premier seuil prédéfini, il est possible de déterminer si une deuxième mesure de similarité entre la première colonne et la deuxième colonne est supérieure à un deuxième seuil prédéfini. De plus, la première colonne ou la deuxième colonne peut être retirée de la table si la deuxième mesure de similarité est supérieure au deuxième seuil prédéfini.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2014/040234 WO2015183306A1 (fr) | 2014-05-30 | 2014-05-30 | Traitement d'une table de colonnes |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2014/040234 WO2015183306A1 (fr) | 2014-05-30 | 2014-05-30 | Traitement d'une table de colonnes |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015183306A1 true WO2015183306A1 (fr) | 2015-12-03 |
Family
ID=54699458
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2014/040234 WO2015183306A1 (fr) | 2014-05-30 | 2014-05-30 | Traitement d'une table de colonnes |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2015183306A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10353927B2 (en) | 2014-07-10 | 2019-07-16 | Entit Software Llc | Categorizing columns in a data table |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050240615A1 (en) * | 2004-04-22 | 2005-10-27 | International Business Machines Corporation | Techniques for identifying mergeable data |
US20050289191A1 (en) * | 2004-06-29 | 2005-12-29 | International Business Machines Corporation | Method, system, program for determining frequency of updating database histograms |
US20070214168A1 (en) * | 2003-11-24 | 2007-09-13 | Computer Associates Think, Inc. | Method and System for Removing Rows from Directory Tables |
US20120173226A1 (en) * | 2010-12-30 | 2012-07-05 | International Business Machines Corporation | Table merging with row data reduction |
US20130041910A1 (en) * | 2006-02-17 | 2013-02-14 | Jonathan T. Betz | Attribute Entropy as a Signal in Object Normalization |
-
2014
- 2014-05-30 WO PCT/US2014/040234 patent/WO2015183306A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070214168A1 (en) * | 2003-11-24 | 2007-09-13 | Computer Associates Think, Inc. | Method and System for Removing Rows from Directory Tables |
US20050240615A1 (en) * | 2004-04-22 | 2005-10-27 | International Business Machines Corporation | Techniques for identifying mergeable data |
US20050289191A1 (en) * | 2004-06-29 | 2005-12-29 | International Business Machines Corporation | Method, system, program for determining frequency of updating database histograms |
US20130041910A1 (en) * | 2006-02-17 | 2013-02-14 | Jonathan T. Betz | Attribute Entropy as a Signal in Object Normalization |
US20120173226A1 (en) * | 2010-12-30 | 2012-07-05 | International Business Machines Corporation | Table merging with row data reduction |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10353927B2 (en) | 2014-07-10 | 2019-07-16 | Entit Software Llc | Categorizing columns in a data table |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI718643B (zh) | 異常群體識別方法及裝置 | |
AU2017202873B2 (en) | Efficient query processing using histograms in a columnar database | |
US11734233B2 (en) | Method for classifying an unmanaged dataset | |
US10747762B2 (en) | Automatic generation of sub-queries | |
US10929384B2 (en) | Systems and methods for distributed data validation | |
CN105956628B (zh) | 数据分类方法和用于数据分类的装置 | |
US8560506B2 (en) | Automatic selection of blocking column for de-duplication | |
US9600559B2 (en) | Data processing for database aggregation operation | |
CN110019785B (zh) | 一种文本分类方法及装置 | |
EP3308295B1 (fr) | Infrastructure de conservation des données | |
US20220229854A1 (en) | Constructing ground truth when classifying data | |
CN111611228B (zh) | 一种基于分布式数据库的负载均衡调整方法及装置 | |
Staudte | The shapes of things to come: Probability density quantiles | |
WO2016086973A1 (fr) | Génération de requêtes de recherche non structurées à partir d'un ensemble de termes de données structurées | |
CN106874286B (zh) | 一种筛选用户特征的方法及装置 | |
WO2015183306A1 (fr) | Traitement d'une table de colonnes | |
US11709798B2 (en) | Hash suppression | |
US10599625B2 (en) | Managing storage of data | |
CN110019771B (zh) | 文本处理的方法及装置 | |
US20160110396A1 (en) | Data processing apparatuses, methods, and non-transitory tangible machine-readable medium thereof | |
CN111160449A (zh) | 多维大数据特征属性的处理方法、装置、终端及存储介质 | |
CN117151756A (zh) | 用户标签信息的确定方法、装置、设备、介质和程序产品 | |
CN116882396A (zh) | 功能点分析方法、装置、计算机设备、存储介质和产品 | |
Tatsuma et al. | SimRank similarity preserving projection for shape‐based 3D model auto‐annotation | |
CN111177132A (zh) | 关系型数据的标签清洗方法、装置、设备及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14892891 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14892891 Country of ref document: EP Kind code of ref document: A1 |