WO2015183306A1 - Traitement d'une table de colonnes - Google Patents

Traitement d'une table de colonnes Download PDF

Info

Publication number
WO2015183306A1
WO2015183306A1 PCT/US2014/040234 US2014040234W WO2015183306A1 WO 2015183306 A1 WO2015183306 A1 WO 2015183306A1 US 2014040234 W US2014040234 W US 2014040234W WO 2015183306 A1 WO2015183306 A1 WO 2015183306A1
Authority
WO
WIPO (PCT)
Prior art keywords
column
columns
similarity measure
predefined threshold
larger
Prior art date
Application number
PCT/US2014/040234
Other languages
English (en)
Inventor
Hadas Kogan
Inbal Tadeski
Original Assignee
Hewlett-Packard Development Company, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L.P. filed Critical Hewlett-Packard Development Company, L.P.
Priority to PCT/US2014/040234 priority Critical patent/WO2015183306A1/fr
Publication of WO2015183306A1 publication Critical patent/WO2015183306A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Definitions

  • FIG. 5 depicts a flow diagram of a method for processing a table, according to another example of the present disclosure.
  • the tables in databases often contain very large numbers of columns.
  • the data contained in the tables Prior to performing data analytics applications on the tables, the data contained in the tables are often manually pre-processed. That is, an analyst often spends a great deal of time preparing the data for the analytics applications by manually determining which of the columns are informative and which of the columns are non-informative. Typically, only a small subset of the columns is informative. By way of example, analysts often spend upwards of 80% of their time preparing the data for the analysis. As such, the manual pre-processing of the data contained tables is typically time-consuming and inefficient.
  • columns that may contain relatively low entropies may automatically be identified and removed from a table, which may be a structured data table.
  • a determination may be made as to which of the columns contain information that is relatively similar to other columns and those columns may also be removed from the table.
  • the similarity determination among the columns may be made through implementation of a two-step procedure, in which the computational complexity of the first step is relatively lower than the computational complexity of the second step. That is, the first step may be simpler and faster to implement than the second step, but may provide a relatively less accurate determination of similarity among the columns.
  • the similarity measures disclosed herein may be effective for different column types (e.g., categorical and numerical) because the similarity measures disclosed herein uses mutual information and entropies and thus does not assume a specific dependence, e.g., a linear mapping between two columns.
  • data may be generalized to any data that contains elements (equivalent to rows) and fields (equivalent to columns) and may be transformed to have the form of the table 112.
  • data received in a format other than in a table format, such as an XML file may be transformed into the form of the table 112.
  • the table processing apparatus 120 may be, for instance, a volatile or non-volatile memory, such as dynamic random access memory (DRAM), electrically erasable programmable read-only memory (EEPROM), magnetoresistive random access memory (MRAM), memristor, flash memory, floppy disk, a compact disc read only memory (CD-ROM), a digital video disc read only memory (DVD-ROM), or other optical or magnetic media, and the like, on which software may be stored.
  • the modules 210-224 may be software modules, e.g., sets of machine readable instructions, stored in the table processing apparatus 120.
  • the interface 204 may include hardware and/or software to enable the processor 202 to communicate with the database 110.
  • the interface 204 may enable a wired or wireless connection to the database 110 (user input/output devices) over a network, such as an intranet, a local area network, the Internet, etc.
  • the computing device 200 may be directly connected to the database 1 10, e.g., the computing device 200 may be a server that is to control access to the database 110.
  • the interface may include a network interface card and/or may also include hardware and/or software to enable the processor 202 to communicate with various input and/or output devices, such as a keyboard, a mouse, a display, another computing device, etc., through which a user may input instructions into the computing device 200.
  • a determination may be made as to whether a second similarity measure between the first column and the second column is larger than a second predefined threshold, in which the second similarity measure has a greater computational complexity as compared with the first similarity measure. That is, computation of the second similarity measure may require a greater amount of time and/or computational resources as compared with computation of the first similarity measure.
  • the second similarity measure determining module 218 may determine a second similarity measure between the first column and the second column and may determine whether the second similarity measure is above a second predefined threshold. The second similarity measure and the second predefined threshold are discussed in greater detail below with respect to the method 400 below.
  • one of the first column and the second column may be removed from the table 1 12. More particularly, for instance, the column removing module 224 may remove one of the first column and the second column from the table 1 12.
  • the column removing module 224 may remove one of the first column and the second column from the table 1 12.
  • a determination may be made that one of the first column and the second column may include non-informative, e.g., duplicative, information and thus, one of the first column and the second column may be removed to reduce the number of columns that may be analyzed during a data analytics operation.
  • entropies of each of the columns may be accessed.
  • the entropies of the columns for instance, low entropies, may denote that the columns are empty, almost empty, constant, almost constant, etc.
  • high entropies may denote that columns are full, almost full, have different values along the columns, etc.
  • the entropy of a column may denote the amount of change between values contained in the rows of the columns.
  • the entropy accessing module 214 may access the entropies by determining the entropies of the columns based upon the computed histograms of the columns.
  • the entropy accessing module 214 may determine the entropies of the columns through computation of the following equation:
  • the columns identified as having entropies that fall below the predetermined threshold may be removed from the table 112. That is, for instance, the column removing module 224 may remove the columns that have relatively constant data across their respective rows for containing non-informative information.
  • Equation (2) for a given pair of columns A and B, their histograms are marked by P A , P B , their expectations are marked by ⁇ ⁇ , ⁇ ⁇ ., and "7" represents a transpose function. According to an example, the histograms for the columns are sorted. In addition, or alternatively, the shorter histogram is padded with zeroes such that the lengths of the two histograms are identical.
  • a determination may be made as to whether the HS of column A and column B is larger than a first predefined threshold.
  • the first similarity measure determining module 216 may compare the HS of column A and column B with the first predefined threshold to determine whether the HS exceeds the first predefined threshold.
  • the first predefined threshold may be set to meet any suitable objective. Thus, for instance, the first predefined threshold may be set to be a higher value when it is desired to have a higher number of columns removed from the table 1 12.

Abstract

La présente invention concerne, selon un exemple, un procédé de traitement d'une table comportant une pluralité de colonnes, dans lequel il est possible de déterminer si une première mesure de similarité entre une première colonne de la pluralité de colonnes et une deuxième colonne de la pluralité de colonnes est supérieure à un premier seuil prédéfini. En réponse à une détermination que la première mesure de similarité est supérieure au premier seuil prédéfini, il est possible de déterminer si une deuxième mesure de similarité entre la première colonne et la deuxième colonne est supérieure à un deuxième seuil prédéfini. De plus, la première colonne ou la deuxième colonne peut être retirée de la table si la deuxième mesure de similarité est supérieure au deuxième seuil prédéfini.
PCT/US2014/040234 2014-05-30 2014-05-30 Traitement d'une table de colonnes WO2015183306A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2014/040234 WO2015183306A1 (fr) 2014-05-30 2014-05-30 Traitement d'une table de colonnes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2014/040234 WO2015183306A1 (fr) 2014-05-30 2014-05-30 Traitement d'une table de colonnes

Publications (1)

Publication Number Publication Date
WO2015183306A1 true WO2015183306A1 (fr) 2015-12-03

Family

ID=54699458

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2014/040234 WO2015183306A1 (fr) 2014-05-30 2014-05-30 Traitement d'une table de colonnes

Country Status (1)

Country Link
WO (1) WO2015183306A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10353927B2 (en) 2014-07-10 2019-07-16 Entit Software Llc Categorizing columns in a data table

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050240615A1 (en) * 2004-04-22 2005-10-27 International Business Machines Corporation Techniques for identifying mergeable data
US20050289191A1 (en) * 2004-06-29 2005-12-29 International Business Machines Corporation Method, system, program for determining frequency of updating database histograms
US20070214168A1 (en) * 2003-11-24 2007-09-13 Computer Associates Think, Inc. Method and System for Removing Rows from Directory Tables
US20120173226A1 (en) * 2010-12-30 2012-07-05 International Business Machines Corporation Table merging with row data reduction
US20130041910A1 (en) * 2006-02-17 2013-02-14 Jonathan T. Betz Attribute Entropy as a Signal in Object Normalization

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070214168A1 (en) * 2003-11-24 2007-09-13 Computer Associates Think, Inc. Method and System for Removing Rows from Directory Tables
US20050240615A1 (en) * 2004-04-22 2005-10-27 International Business Machines Corporation Techniques for identifying mergeable data
US20050289191A1 (en) * 2004-06-29 2005-12-29 International Business Machines Corporation Method, system, program for determining frequency of updating database histograms
US20130041910A1 (en) * 2006-02-17 2013-02-14 Jonathan T. Betz Attribute Entropy as a Signal in Object Normalization
US20120173226A1 (en) * 2010-12-30 2012-07-05 International Business Machines Corporation Table merging with row data reduction

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10353927B2 (en) 2014-07-10 2019-07-16 Entit Software Llc Categorizing columns in a data table

Similar Documents

Publication Publication Date Title
TWI718643B (zh) 異常群體識別方法及裝置
AU2017202873B2 (en) Efficient query processing using histograms in a columnar database
US11734233B2 (en) Method for classifying an unmanaged dataset
US10747762B2 (en) Automatic generation of sub-queries
US10929384B2 (en) Systems and methods for distributed data validation
CN105956628B (zh) 数据分类方法和用于数据分类的装置
US8560506B2 (en) Automatic selection of blocking column for de-duplication
US9600559B2 (en) Data processing for database aggregation operation
CN110019785B (zh) 一种文本分类方法及装置
EP3308295B1 (fr) Infrastructure de conservation des données
US20220229854A1 (en) Constructing ground truth when classifying data
CN111611228B (zh) 一种基于分布式数据库的负载均衡调整方法及装置
Staudte The shapes of things to come: Probability density quantiles
WO2016086973A1 (fr) Génération de requêtes de recherche non structurées à partir d'un ensemble de termes de données structurées
CN106874286B (zh) 一种筛选用户特征的方法及装置
WO2015183306A1 (fr) Traitement d'une table de colonnes
US11709798B2 (en) Hash suppression
US10599625B2 (en) Managing storage of data
CN110019771B (zh) 文本处理的方法及装置
US20160110396A1 (en) Data processing apparatuses, methods, and non-transitory tangible machine-readable medium thereof
CN111160449A (zh) 多维大数据特征属性的处理方法、装置、终端及存储介质
CN117151756A (zh) 用户标签信息的确定方法、装置、设备、介质和程序产品
CN116882396A (zh) 功能点分析方法、装置、计算机设备、存储介质和产品
Tatsuma et al. SimRank similarity preserving projection for shape‐based 3D model auto‐annotation
CN111177132A (zh) 关系型数据的标签清洗方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14892891

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14892891

Country of ref document: EP

Kind code of ref document: A1