EP3933699A1 - A computer-implemented method and apparatus for automatically annotating columns of a table with semantic types - Google Patents

A computer-implemented method and apparatus for automatically annotating columns of a table with semantic types Download PDF

Info

Publication number
EP3933699A1
EP3933699A1 EP20183087.4A EP20183087A EP3933699A1 EP 3933699 A1 EP3933699 A1 EP 3933699A1 EP 20183087 A EP20183087 A EP 20183087A EP 3933699 A1 EP3933699 A1 EP 3933699A1
Authority
EP
European Patent Office
Prior art keywords
semantic
cell
neural network
row
annotations
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP20183087.4A
Other languages
German (de)
French (fr)
Inventor
Rakebul Hasan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens AG
Original Assignee
Siemens AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens AG filed Critical Siemens AG
Priority to EP20183087.4A priority Critical patent/EP3933699A1/en
Priority to US17/350,330 priority patent/US11977833B2/en
Publication of EP3933699A1 publication Critical patent/EP3933699A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation

Definitions

  • the invention relates to a computer-implemented method for generating automatically annotations for tabular cell data of a table consisting of columns and rows, in particular to perform context-aware semantic annotation for tabular data using self-attention.
  • Tables comprise collections of related data entities organized in rows.
  • a table T comprises several columns C and rows R.
  • a table consists of a plurality of cell data organized in cells and rows of the table T.
  • a detection of a semantic type of data columns in relational tables is necessary for various data preparation and information retrieval tasks such as schema matching, data discovery, semantic search or data cleaning. Further, recognizing the semantic types of table data is requested to aggregate information from multiple different tabular data sources or tables.
  • mapping of tabular data of tables T can be categorized into two main categories, i.e. ontology alignment approaches and machine learning approaches.
  • An ontology alignment approach does align a table schema to a target ontology.
  • Machine learning approaches are provided to predict column annotations.
  • Ontology alignment approaches may first automatically generate a putative ontology from the table schema. This generation process can be based on manually curated rules. A putative ontology is then augmented and refined based on the instance data in the respective table. For example, entropybased rules can be used in order to discard or keep classes and properties from the initial putative ontology. A further step in the ontology alignment approach is to find alignments between the putative ontology and the target ontology. For the ontology alignment approach, it is possible to use syntactic, semantic or structural similarity metrics to find similarity between the elements of two ontologies. A final step in the ontology alignment approach is to generate declarative mapping definitions from the estimated alignments in a conventional standard format such as R2RML. These declarative mappings can be used both to transform and materialize the tabular source data to a knowledge graph or to query the tabular source data using a target ontology.
  • the invention provides according to the first aspect a computer-implemented method for generating automatically annotations for tabular cell data of a table having columns and rows, wherein the method comprises the steps of: supplying raw cell data of cells of a row of the table as input to an embedding layer of a semantic type annotation neural network which transforms the received raw cell data of the cells of the supplied row into cell embedding vectors, processing the cell embedding vectors generated by the embedding layer by a self-attention layer of the semantic type annotation neural network to calculate attentions among the cells of the respective row of said table encoding a context within said row output as cell context vectors and processing the cell context vectors generated by the self-attention layer by a classification layer of the semantic type annotation neural network to predict semantic column type annotations and/or to predict relations between semantic column type Agilen, wherein the method comprises the steps of: supplying raw cell data of cells of a row of the table as input to an embedding layer of a semantic type annotation neural network which transforms the received raw cell data of the cells of the
  • the computer-implemented method according to the present invention has the advantage that it does take into consideration the context of a certain entity occurrence in the table.
  • a further advantage of the computer-implemented method according to the first aspect of the present invention is that in contrast to existing machine learning approaches which rely on hand-picked features the computer-implemented method according to the present invention does not require any manual feature engineering.
  • a self-attention model is used to encode context of each cell in a row of the table and then a classifier is used to predict semantic type annotations for each column of the table. Consequently, the computer-implemented method according to the present invention is relatively simple to engineer and can be adapted to a wide range of use cases since it does not use any hand-crafted or hand-picked features.
  • a bidirectional recurrent neural network is trained as an encoder of an autoencoder on cell embeddings provided by a byte-pair encoding model and is used as an encoder of the embedding layer of the semantic type annotation neural network.
  • the generated annotations of the tabular cell data of the table are supplied to an ETL process used to generate a knowledge graph instance stored in a memory.
  • the classification layer calculates column type vectors comprising for the cell data of each cell of the respective supplied row predicted semantic column type probabilities.
  • a mean pooling of the column type vectors of all rows of the table is performed to predict a semantic column type for each column of said table.
  • the self-attention layer of the semantic type annotation neural network comprises a stack of transformers to calculate attentions among the cells of the respective row of said table.
  • the semantic type annotation neural network is trained in a supervised learning process using labeled rows as samples.
  • the invention provides according to a further aspect an annotation software tool comprising the features of claim 8.
  • the invention provides according to the second aspect an annotation software tool adapted to perform the computer-implemented method according to the first aspect of the present invention to generate automatically annotations for tabular cell data of a table received from a data source.
  • the invention further provides according to a further aspect an apparatus comprising the features of claim 9.
  • the invention provides according to the third aspect an apparatus used for automatic generation of annotations processed for providing a knowledge graph instance of a knowledge graph stored in a knowledge base, said apparatus comprising a semantic type annotation neural network having an embedding layer adapted to transform the received raw cell data of cells of a supplied row of a table into cell embedding vectors, a self-attention layer adapted to calculate attentions among the cells of the respective row of said table encoding a context within said row output as cell context vectors and having a classification layer adapted to process the cell context vectors received from the self-attention layer to predict semantic column type annotations and/or to predict relations between semantic column type annotations for the columns of said table.
  • a semantic type annotation neural network having an embedding layer adapted to transform the received raw cell data of cells of a supplied row of a table into cell embedding vectors, a self-attention layer adapted to calculate attentions among the cells of the respective row of said table encoding a context within said row output as cell context vectors and having a classification layer
  • the bidirectional recurrent neural network trained as an encoder of an autoencoder on cell embeddings provided by a byte-pair encoding model is implemented as an encoder of the embedding layer of the semantic type annotation neural network of said apparatus.
  • the generated annotations of the tabular cell data of the table are supplied to an ETL process used to generate a knowledge graph instance of the knowledge base.
  • the classification layer of the semantic type annotation neural network is adapted to calculate column type vectors comprising for the cell data of each cell of the respective supplied row predicted semantic column type probabilities.
  • a mean pooling of column type vectors of all rows of the table is performed to predict the semantic type annotation of each column of said table.
  • the self-attention layer of the semantic type annotation neural network comprises a stack of transformers adapted to calculate attentions among the cells of the respective row of said table.
  • the semantic type annotation neural network of said apparatus is trained in a supervised learning process using labeled rows as samples.
  • the computer-implemented method according to the first aspect of the present invention can comprise several main steps.
  • the computer-implemented method comprises three main steps S1, S2, S3.
  • the computer-implemented method illustrated in Fig. 1 is provided for generating automatically annotations for tabular cell data of a table T having columns C and rows R.
  • raw cell data of cells C of a row R within the table T is supplied as input to an embedding layer EL of a semantic type annotation neural network STANN which transforms the received raw cell data of the cells C of the supplied row R into cell embedding vectors e.
  • the cell embedding vectors e generated by the embedding layer EL are processed by a self-attention layer SAL of the semantic type annotation neural network STANN to calculate attentions among the cells C of the respective row R of said table T encoding a context within said row R output as cell context vectors.
  • the cell context vectors generated by the self-attention layer SAL are processed by a classification layer CL of the semantic type annotation neural network STANN to predict semantic column type annotations and/or to predict relations between semantic column type annotations for the columns C of the respective table T.
  • Fig. 2 shows the overall architecture of the semantic type annotation neural network STANN as used by the computer-implemented method illustrated in Fig. 1 .
  • the used semantic type annotation neural network STANN comprises an embedding layer EL, a self-attention layer SAL and a classification layer CL.
  • the embedding layer EL is adapted to transform the received raw cell data of cells C of a supplied row R of a table T into cell embedding vectors e.
  • the received input data comprises a row R having cell data of three different cells corresponding to columns in the table T including a first cell data "Munich", a second cell data "Germany” and a third cell data "1.46 million”.
  • a task of the computer-implemented method according to the present invention resides in providing automatically corresponding semantic column type annotations for the different columns of the table T. In the very simple example of Fig.
  • these semantic column type annotations comprise the semantic type "City" for the first column, the semantic type "Country” for the second column and the identification that no semantic type could be found or generated for the cell data in the third column.
  • each cell or cell data for a given row R is first considered as an atomic unit. Then, self-attention is used to learn the context among the cells in the row R.
  • the embedding layer EL of the semantic type annotation neural network STANN transforms each cell text of a cell C into a corresponding embedding vector e using a cell encoder. This cell encoder can be trained separately using an autoencoder structure as illustrated in Fig. 3 .
  • the semantic type annotation neural network STANN as shown in Fig. 2 further comprises a self-attention layer SAL adapted to calculate attentions among the cells C of the respective row R of the table T encoding the context within said row R output as cell context vectors.
  • the self-attention layer SAL is adapted to learn the context for semantic types for each cell C.
  • the used self-attention mechanism SAM can be based on transformers as described by Vaswani, Ashish, et al. "Attention is all you need" Advances in neural information processing systems, 2017 , to encode context in a sentence by computing attention among the tokens in the respective sentence.
  • the self-attention layer SAL uses this approach to encode context in a row R by computing attention among the cells C in the respective row R.
  • Transformers can also encode the order of the sequence of the tokens using a positional encoding method. In case that the order of the cells is not relevant, the transformers can be modified by removing the positional encoding.
  • the semantic type annotation neural network STANN as illustrated in Fig. 2 further comprises a classification layer CL.
  • the classification layer CL is adapted to process the cell context vectors received from the self-attention layer SAL to predict semantic column type annotations and/or to predict relations between semantic column type annotations for the columns C of the respective table T.
  • the classification layer CL is adapted to learn the semantic type for each cell C.
  • a linear classifier can be used for this classification task.
  • the task performed by the classification layer CL is similar to the task in a named entity recognition, NER, process where it is necessary to predict a type for each token in a sentence.
  • NER named entity recognition
  • the final classification in a NER model is done by a conditional random field model.
  • the conditional random field model conditions on the order of the sequence of tokens.
  • the order of the cells C in a row R of a table T in most cases has not to be considered and hence a linear layer can be used for the final classification task performed by the classification layer CL.
  • all rows R of the table T can be passed through the semantic type annotation neural network STANN as illustrated in Fig. 2 . Further, it is possible to compute and calculate a mean probability of all the semantic type prediction probabilities.
  • the semantic type annotation neural network STANN as depicted in Fig. 2 predicts only the semantic types that correspond to the different columns C of the table T.
  • the same model with slight modification might be used.
  • the embedding layer EL of the modified semantic type annotation neural network STANN used to predict also the relations between the different columns C of the table T can also include the cell position embeddings of the two cells C which are the subject and object of the relation in question.
  • the classification is also slightly modified and performs a relation classification for the whole row R with two cells C of the row R highlighted as the subject and object of the respective relation.
  • a bidirectional recurrent neural network RNN can be trained as the encoder ENC of an autoencoder AE on cell embeddings provided by a byte-pair encoding model BPE.
  • the trained bidirectional recurrent neural network RNN can then be used as an encoder within the embedding layer EL of the semantic type annotation neural network STANN as shown in Fig. 2 .
  • the decoder DEC of the autoencoder AE learns to decode the lower dimensional representation ldr to the original embedding representation as close as possible, providing a reconstructed cell text embedding ct-e' feedback by a loss function LF.
  • the encoder ENC and decoder DEC of the autoencoder AE as illustrated in Fig. 3 , it is possible to use bidirectional recurrent neural networks RNNs.
  • the apparatus and method according to the present invention does reuse the learned encoder ENC of the autoencoder AE as shown in Fig. 3 in the embedding layer EL of the semantic type annotation neural network STANN as shown in Fig. 2 .
  • the generated annotations of the tabular cell data of the table T can be supplied to an ETL process used to generate a knowledge graph instance stored in a memory.
  • the classification layer CL calculates column type vectors y comprising for the cell data of each cell C of the respective supplied row R predicted semantic column type probabilities.
  • a mean pooling of the column type vectors y of all rows R of the table T can be performed to predict a semantic column type for each column C of the respective table T.
  • the self-attention layer SAL of the semantic type annotation neural network STANN comprises a stack of transformers to calculate the attentions among the cells C of the respective row R of the table T.
  • the semantic type annotation neural network STANN can be trained in a supervised learning process using labeled rows R as samples.
  • the computer-implemented method as shown in the flowchart of Fig. 1 can be used to a software tool used to generate automatically annotations for tabular cell data of a table T received from any kind of data source.
  • the computer-implemented method according to the present invention does not rely on ontology alignment approaches. Accordingly, the computer-implemented method according to the present invention is automatic and data-driven. Further, it does not require manually engineering features for different use cases. The computer-implemented method and apparatus according to the present invention can learn the context for each semantic type from the raw data unlike to conventional approaches.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

A computer-implemented method for generating automatically annotations for tabular cell data of a table having column and rows, wherein the method comprises the steps of:Supplying raw cell data of cells of a row of the table as input to an embedding layer of a semantic type annotation neural network which transforms the received raw cell data of the cells of the supplied row into cell embedding vectors;processing the cell embedding vectors generated by the embedding layer by a self-attention layer of the semantic type annotation neural network to calculate attentions among the cells of the respective row of said table encoding a context within said row output as cell context vectors; andprocessing the cell context vectors generated by the self-attention layer by a classification layer of the semantic type annotation neural network to predict semantic column type annotations and/or to predict relations between semantic column type annotations for the columns of said table.

Description

  • The invention relates to a computer-implemented method for generating automatically annotations for tabular cell data of a table consisting of columns and rows, in particular to perform context-aware semantic annotation for tabular data using self-attention.
  • Tables comprise collections of related data entities organized in rows. A table T comprises several columns C and rows R. A table consists of a plurality of cell data organized in cells and rows of the table T. A detection of a semantic type of data columns in relational tables is necessary for various data preparation and information retrieval tasks such as schema matching, data discovery, semantic search or data cleaning. Further, recognizing the semantic types of table data is requested to aggregate information from multiple different tabular data sources or tables.
  • The mapping of tabular data of tables T can be categorized into two main categories, i.e. ontology alignment approaches and machine learning approaches. An ontology alignment approach does align a table schema to a target ontology. Machine learning approaches are provided to predict column annotations.
  • Ontology alignment approaches may first automatically generate a putative ontology from the table schema. This generation process can be based on manually curated rules. A putative ontology is then augmented and refined based on the instance data in the respective table. For example, entropybased rules can be used in order to discard or keep classes and properties from the initial putative ontology. A further step in the ontology alignment approach is to find alignments between the putative ontology and the target ontology. For the ontology alignment approach, it is possible to use syntactic, semantic or structural similarity metrics to find similarity between the elements of two ontologies. A final step in the ontology alignment approach is to generate declarative mapping definitions from the estimated alignments in a conventional standard format such as R2RML. These declarative mappings can be used both to transform and materialize the tabular source data to a knowledge graph or to query the tabular source data using a target ontology.
  • Among the conventional machine learning-based approaches to predict column annotations, it is possible to treat the semantic type detection problem as a multi-class classification problem using a feedforward deep neural network. For instance, Hulsebos, Madelon, et al. "Sherlock: A deep learning approach to semantic data type detection" Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, is an approach where only the values of a single column for a semantic type are considered. Accordingly, this conventional approach does ignore the context of all the other columns in the table. Zhang, Dan, et al. "Sato: Contextual Semantic Type Detection in Tables" arXiv preprint 1911.06311, 2019, describes an approach where both single column features and global table context features are considered. However, the extraction of these features does require case by case analysis and manual feature engineering.
  • Accordingly, it is an object of the present invention to provide a method and apparatus for generating automatically annotations for tabular cell data of a where the features can be learned from the raw data and therefore no manual feature engineering is required.
  • This object is achieved by a computer-implemented method according to the first aspect of the present invention.
  • The invention provides according to the first aspect a computer-implemented method for generating automatically annotations for tabular cell data of a table having columns and rows,
    wherein the method comprises the steps of:
    supplying raw cell data of cells of a row of the table as input to an embedding layer of a semantic type annotation neural network which transforms the received raw cell data of the cells of the supplied row into cell embedding vectors, processing the cell embedding vectors generated by the embedding layer by a self-attention layer of the semantic type annotation neural network to calculate attentions among the cells of the respective row of said table encoding a context within said row output as cell context vectors and processing the cell context vectors generated by the self-attention layer by a classification layer of the semantic type annotation neural network to predict semantic column type annotations and/or to predict relations between semantic column type annotations for the columns of said table.
  • The computer-implemented method according to the present invention has the advantage that it does take into consideration the context of a certain entity occurrence in the table.
  • A further advantage of the computer-implemented method according to the first aspect of the present invention is that in contrast to existing machine learning approaches which rely on hand-picked features the computer-implemented method according to the present invention does not require any manual feature engineering.
  • With the computer-implemented method according to the present invention, a self-attention model is used to encode context of each cell in a row of the table and then a classifier is used to predict semantic type annotations for each column of the table. Consequently, the computer-implemented method according to the present invention is relatively simple to engineer and can be adapted to a wide range of use cases since it does not use any hand-crafted or hand-picked features.
  • In a possible embodiment of the computer-implemented method according to the first aspect of the present invention, a bidirectional recurrent neural network is trained as an encoder of an autoencoder on cell embeddings provided by a byte-pair encoding model and is used as an encoder of the embedding layer of the semantic type annotation neural network.
  • In a further possible embodiment of the computer-implemented method according to the first aspect of the present invention, the generated annotations of the tabular cell data of the table are supplied to an ETL process used to generate a knowledge graph instance stored in a memory.
  • In a still further possible embodiment of the computer-implemented method according to the first aspect of the present invention, the classification layer calculates column type vectors comprising for the cell data of each cell of the respective supplied row predicted semantic column type probabilities.
  • In a further possible embodiment of the computer-implemented method according to the first aspect of the present invention, a mean pooling of the column type vectors of all rows of the table is performed to predict a semantic column type for each column of said table.
  • In a still further possible embodiment of the computer-implemented method according to the first aspect of the present invention, the self-attention layer of the semantic type annotation neural network comprises a stack of transformers to calculate attentions among the cells of the respective row of said table.
  • In a still further possible embodiment of the computer-implemented method according to the first aspect of the present invention, the semantic type annotation neural network is trained in a supervised learning process using labeled rows as samples.
  • The invention provides according to a further aspect an annotation software tool comprising the features of claim 8.
  • The invention provides according to the second aspect an annotation software tool adapted to perform the computer-implemented method according to the first aspect of the present invention to generate automatically annotations for tabular cell data of a table received from a data source.
  • The invention further provides according to a further aspect an apparatus comprising the features of claim 9.
  • The invention provides according to the third aspect an apparatus used for automatic generation of annotations processed for providing a knowledge graph instance of a knowledge graph stored in a knowledge base,
    said apparatus comprising
    a semantic type annotation neural network having an embedding layer adapted to transform the received raw cell data of cells of a supplied row of a table into cell embedding vectors,
    a self-attention layer adapted to calculate attentions among the cells of the respective row of said table encoding a context within said row output as cell context vectors and having
    a classification layer adapted to process the cell context vectors received from the self-attention layer to predict semantic column type annotations and/or to predict relations between semantic column type annotations for the columns of said table.
  • In a still further possible embodiment of the apparatus according to the third aspect of the present invention, the bidirectional recurrent neural network trained as an encoder of an autoencoder on cell embeddings provided by a byte-pair encoding model is implemented as an encoder of the embedding layer of the semantic type annotation neural network of said apparatus.
  • In a possible embodiment of the apparatus according to the third aspect of the present invention, the generated annotations of the tabular cell data of the table are supplied to an ETL process used to generate a knowledge graph instance of the knowledge base.
  • In a further possible embodiment of the apparatus according to the third aspect of the present invention, the classification layer of the semantic type annotation neural network is adapted to calculate column type vectors comprising for the cell data of each cell of the respective supplied row predicted semantic column type probabilities.
  • In a further possible embodiment of the apparatus according to the third aspect of the present invention, a mean pooling of column type vectors of all rows of the table is performed to predict the semantic type annotation of each column of said table.
  • In a still further possible embodiment of the apparatus according to the third aspect of the present invention, the self-attention layer of the semantic type annotation neural network comprises a stack of transformers adapted to calculate attentions among the cells of the respective row of said table.
  • In a further possible embodiment of the apparatus according to the third aspect of the present invention, the semantic type annotation neural network of said apparatus is trained in a supervised learning process using labeled rows as samples.
  • In the following, possible embodiments of the different aspects of the present invention are described in more detail with reference to the enclosed figures.
  • Fig. 1
    shows a flowchart of a possible exemplary embodiment of a computer-implemented method according to the first aspect of the present invention;
    Fig. 2
    shows a diagram for illustrating a possible exemplary embodiment of the apparatus according to a further aspect of the present invention;
    Fig. 3
    shows a diagram for illustrating a training process used for training an encoder of an embedding layer of the semantic type annotation neural network used by the method and apparatus according to the present invention.
  • As can be seen from the flowchart illustrated in Fig. 1, the computer-implemented method according to the first aspect of the present invention can comprise several main steps. In the illustrated exemplary embodiment, the computer-implemented method comprises three main steps S1, S2, S3. The computer-implemented method illustrated in Fig. 1 is provided for generating automatically annotations for tabular cell data of a table T having columns C and rows R.
  • In a first step S1 of the computer-implemented method, raw cell data of cells C of a row R within the table T is supplied as input to an embedding layer EL of a semantic type annotation neural network STANN which transforms the received raw cell data of the cells C of the supplied row R into cell embedding vectors e.
  • In a further step S2, the cell embedding vectors e generated by the embedding layer EL are processed by a self-attention layer SAL of the semantic type annotation neural network STANN to calculate attentions among the cells C of the respective row R of said table T encoding a context within said row R output as cell context vectors.
  • In a further step S3, the cell context vectors generated by the self-attention layer SAL are processed by a classification layer CL of the semantic type annotation neural network STANN to predict semantic column type annotations and/or to predict relations between semantic column type annotations for the columns C of the respective table T.
  • Fig. 2 shows the overall architecture of the semantic type annotation neural network STANN as used by the computer-implemented method illustrated in Fig. 1. As can be seen from the diagram of Fig. 2, the used semantic type annotation neural network STANN comprises an embedding layer EL, a self-attention layer SAL and a classification layer CL.
  • The embedding layer EL is adapted to transform the received raw cell data of cells C of a supplied row R of a table T into cell embedding vectors e. In the illustrated simple example of Fig. 2, the received input data comprises a row R having cell data of three different cells corresponding to columns in the table T including a first cell data "Munich", a second cell data "Germany" and a third cell data "1.46 million". A task of the computer-implemented method according to the present invention resides in providing automatically corresponding semantic column type annotations for the different columns of the table T. In the very simple example of Fig. 2, these semantic column type annotations comprise the semantic type "City" for the first column, the semantic type "Country" for the second column and the identification that no semantic type could be found or generated for the cell data in the third column. With the computer-implemented method according to the first aspect of the present invention, each cell or cell data for a given row R is first considered as an atomic unit. Then, self-attention is used to learn the context among the cells in the row R. The embedding layer EL of the semantic type annotation neural network STANN transforms each cell text of a cell C into a corresponding embedding vector e using a cell encoder. This cell encoder can be trained separately using an autoencoder structure as illustrated in Fig. 3.
  • The semantic type annotation neural network STANN as shown in Fig. 2 further comprises a self-attention layer SAL adapted to calculate attentions among the cells C of the respective row R of the table T encoding the context within said row R output as cell context vectors. The self-attention layer SAL is adapted to learn the context for semantic types for each cell C. The used self-attention mechanism SAM can be based on transformers as described by Vaswani, Ashish, et al. "Attention is all you need" Advances in neural information processing systems, 2017, to encode context in a sentence by computing attention among the tokens in the respective sentence. The self-attention layer SAL uses this approach to encode context in a row R by computing attention among the cells C in the respective row R. Transformers can also encode the order of the sequence of the tokens using a positional encoding method. In case that the order of the cells is not relevant, the transformers can be modified by removing the positional encoding.
  • The semantic type annotation neural network STANN as illustrated in Fig. 2 further comprises a classification layer CL. The classification layer CL is adapted to process the cell context vectors received from the self-attention layer SAL to predict semantic column type annotations and/or to predict relations between semantic column type annotations for the columns C of the respective table T.
  • The classification layer CL is adapted to learn the semantic type for each cell C. In a possible embodiment, a linear classifier can be used for this classification task. The task performed by the classification layer CL is similar to the task in a named entity recognition, NER, process where it is necessary to predict a type for each token in a sentence. Typically, the final classification in a NER model is done by a conditional random field model. The conditional random field model conditions on the order of the sequence of tokens. Further, with the computer-implemented method according to the present invention, the order of the cells C in a row R of a table T in most cases has not to be considered and hence a linear layer can be used for the final classification task performed by the classification layer CL.
  • In a possible embodiment, to compute the column semantic types of the whole table T, all rows R of the table T can be passed through the semantic type annotation neural network STANN as illustrated in Fig. 2. Further, it is possible to compute and calculate a mean probability of all the semantic type prediction probabilities.
  • The semantic type annotation neural network STANN as depicted in Fig. 2 predicts only the semantic types that correspond to the different columns C of the table T. To predict further also the relations between the columns C, the same model with slight modification might be used. In that case, the embedding layer EL of the modified semantic type annotation neural network STANN used to predict also the relations between the different columns C of the table T can also include the cell position embeddings of the two cells C which are the subject and object of the relation in question. In this case, the classification is also slightly modified and performs a relation classification for the whole row R with two cells C of the row R highlighted as the subject and object of the respective relation.
  • As also illustrated in Fig. 3, a bidirectional recurrent neural network RNN can be trained as the encoder ENC of an autoencoder AE on cell embeddings provided by a byte-pair encoding model BPE. The trained bidirectional recurrent neural network RNN can then be used as an encoder within the embedding layer EL of the semantic type annotation neural network STANN as shown in Fig. 2. With the training, it is possible to embed a cell text ct in each training sample to provide cell text embedding cte using a pre-trained byte-pair encoding model as also described by Heinzerling, Benjamin, and Michael Strube "Bpemb: Tokenization-free pre-trained subword embeddings in 275 languages" arXiv preprint 1710.02187, 2017. The byte-pair encoding model BPE can segment words into subwords or tokens which effectively handles out-of-vocabulary words. In the next step, the encoder encodes the input embedding ct-e into a lower dimensional representation ldr. Then, the decoder DEC of the autoencoder AE learns to decode the lower dimensional representation ldr to the original embedding representation as close as possible, providing a reconstructed cell text embedding ct-e' feedback by a loss function LF. For the encoder ENC and decoder DEC of the autoencoder AE as illustrated in Fig. 3, it is possible to use bidirectional recurrent neural networks RNNs. The apparatus and method according to the present invention does reuse the learned encoder ENC of the autoencoder AE as shown in Fig. 3 in the embedding layer EL of the semantic type annotation neural network STANN as shown in Fig. 2.
  • In a possible embodiment, the generated annotations of the tabular cell data of the table T can be supplied to an ETL process used to generate a knowledge graph instance stored in a memory.
  • As also illustrated in Fig. 2, the classification layer CL calculates column type vectors y comprising for the cell data of each cell C of the respective supplied row R predicted semantic column type probabilities. A mean pooling of the column type vectors y of all rows R of the table T can be performed to predict a semantic column type for each column C of the respective table T.
  • In a preferred embodiment, the self-attention layer SAL of the semantic type annotation neural network STANN comprises a stack of transformers to calculate the attentions among the cells C of the respective row R of the table T. The semantic type annotation neural network STANN can be trained in a supervised learning process using labeled rows R as samples. The computer-implemented method as shown in the flowchart of Fig. 1 can be used to a software tool used to generate automatically annotations for tabular cell data of a table T received from any kind of data source.
  • In contrast to conventional approaches, the computer-implemented method according to the present invention does not rely on ontology alignment approaches. Accordingly, the computer-implemented method according to the present invention is automatic and data-driven. Further, it does not require manually engineering features for different use cases. The computer-implemented method and apparatus according to the present invention can learn the context for each semantic type from the raw data unlike to conventional approaches.

Claims (15)

  1. A computer-implemented method for generating automatically annotations for tabular cell data of a table, T, having columns, C, and rows, R,
    wherein the method comprises the steps (S) of:
    Supplying (S1) raw cell data of cells, C, of a row, R, of the table, T, as input to an embedding layer (EL) of a semantic type annotation neural network (STANN) which transforms the received raw cell data of the cells, C, of the supplied row, R, into cell embedding vectors, e;
    processing (S2) the cell embedding vectors, e, generated by the embedding layer (EL) by a self-attention layer (SAL) of the semantic type annotation neural network (STANN) to calculate attentions among the cells, C, of the respective row, R, of said table, T, encoding a context within said row, R, output as cell context vectors; and
    processing (S3) the cell context vectors generated by the self-attention layer (SAL) by a classification layer (CL) of the semantic type annotation neural network (STANN) to predict semantic column type annotations and/or to predict relations between semantic column type annotations for the columns, C, of said table, T.
  2. The computer-implemented method according to claim 1 wherein a bidirectional recurrent neural network, RNN, trained as an encoder of an autoencoder on cell embeddings provided by a byte-pair encoding model, BPE, is used as an encoder of the embedding layer (EL) of the semantic type annotation neural network (STANN).
  3. The computer-implemented method according to claim 1 or 2 wherein the generated annotations and the tabular cell data of the table, T, are supplied to an ETL process used to generate a knowledge graph instance stored in a memory.
  4. The computer-implemented method according to any of the preceding claims 1 to 3 wherein the classification layer (CL) calculates column type vectors, y, comprising for the cell data of each cell, C, of the respective supplied row, R, predicted semantic column type probabilities.
  5. The computer-implemented method according to claim 4 wherein a mean pooling of the column type vectors, y, of all rows, R, of the table, T, is performed to predict a semantic column type for each column, C, of said table, T.
  6. The computer-implemented method according to any of the preceding claims 1 to 5 wherein the self-attention layer (SAL) of the semantic type annotation neural network (STANN) comprises a stack of transformers to calculate attentions among the cells, C, of the respective row, R, of said table, T.
  7. The computer-implemented method according to any of the preceding claims 1 to 6 wherein the semantic type annotation neural network (STANN) is trained in a supervised learning process using labeled rows, R, as samples.
  8. An annotation software tool adapted to perform the computer-implemented method according to any of the preceding claims 1 to 7 to generate automatically annotations for tabular cell data of a table, T, received from a data source.
  9. An apparatus used for automatic generation of annotations processed for providing a knowledge graph instance of a knowledge graph stored in a knowledge base,
    said apparatus comprising
    a semantic type annotation neural network (STANN) having an embedding layer (EL) adapted to transform the received raw cell data of cells, C, of a supplied row, R, of a table, T, into cell embedding vectors, e,
    a self-attention layer (SAL) adapted to calculate attentions among the cells, C, of the respective row, R, of said table, T, encoding a context within said row, R, output as cell context vectors and having
    a classification layer (CL) adapted to process the cell context vectors received from the self-attention layer (SAL) to predict semantic column type annotations and/or to predict relations between semantic column type annotations for the columns, C, of said table, T.
  10. The apparatus according to claim 9 wherein a bidirectional recurrent neural network, RNN, trained as an encoder of an autoencoder on cell embeddings provided by a byte-pair encoding model, BPE, is implemented as an encoder of the embedding layer (EL) of the semantic type annotation neural network (STANN) of said apparatus.
  11. The apparatus according to claim 9 or 10 wherein the generated annotations and the tabular cell data of the table, T, are supplied to an ETL process used to generate a knowledge graph instance of the knowledge base.
  12. The apparatus according to any of the preceding claims 9 to 11 wherein the classification layer (CL) of the semantic type annotation neural network (STANN) is adapted to calculate column type vectors, y, comprising for the cell data of each cell, C, of the respective supplied row, R, predicted semantic column type probabilities.
  13. The apparatus according to any of the preceding claims 9 to 12 wherein a mean pooling of column type vectors, y, of all rows, R, of the table, T, is performed to predict the semantic type annotation of each column, C, of said table, T.
  14. The apparatus according to any of the preceding claims 9 to 13 wherein the self-attention layer (SAL) of the semantic type annotation neural network (STANN) comprises a stack of transformers adapted to calculate attentions among the cells, C, of the respective row, R, of said table, T.
  15. The apparatus according to any of the preceding claims 9 to 14 wherein the semantic type annotation neural network (STANN) of said apparatus is trained in a supervised learning process using labeled rows, R, as samples.
EP20183087.4A 2020-06-30 2020-06-30 A computer-implemented method and apparatus for automatically annotating columns of a table with semantic types Withdrawn EP3933699A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20183087.4A EP3933699A1 (en) 2020-06-30 2020-06-30 A computer-implemented method and apparatus for automatically annotating columns of a table with semantic types
US17/350,330 US11977833B2 (en) 2020-06-30 2021-06-17 Computer-implemented method and apparatus for automatically annotating columns of a table with semantic types

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP20183087.4A EP3933699A1 (en) 2020-06-30 2020-06-30 A computer-implemented method and apparatus for automatically annotating columns of a table with semantic types

Publications (1)

Publication Number Publication Date
EP3933699A1 true EP3933699A1 (en) 2022-01-05

Family

ID=71409142

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20183087.4A Withdrawn EP3933699A1 (en) 2020-06-30 2020-06-30 A computer-implemented method and apparatus for automatically annotating columns of a table with semantic types

Country Status (2)

Country Link
US (1) US11977833B2 (en)
EP (1) EP3933699A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11928107B2 (en) * 2020-05-22 2024-03-12 International Business Machines Corporation Similarity-based value-to-column classification
US11782928B2 (en) * 2020-06-30 2023-10-10 Microsoft Technology Licensing, Llc Computerized information extraction from tables
US20230161774A1 (en) * 2021-11-24 2023-05-25 International Business Machines Corporation Semantic annotation for tabular data
CN115062118B (en) * 2022-07-26 2023-01-31 神州医疗科技股份有限公司 Dual-channel information extraction method and device, electronic equipment and medium
CN117033469B (en) * 2023-10-07 2024-01-16 之江实验室 Database retrieval method, device and equipment based on table semantic annotation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200076841A1 (en) * 2018-09-05 2020-03-05 Oracle International Corporation Context-aware feature embedding and anomaly detection of sequential log data using deep recurrent neural networks
US20200175015A1 (en) * 2018-11-29 2020-06-04 Koninklijke Philips N.V. Crf-based span prediction for fine machine learning comprehension

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11238084B1 (en) * 2016-12-30 2022-02-01 Wells Fargo Bank, N.A. Semantic translation of data sets
US10380259B2 (en) * 2017-05-22 2019-08-13 International Business Machines Corporation Deep embedding for natural language content based on semantic dependencies
CN110019471B (en) * 2017-12-15 2024-03-08 微软技术许可有限责任公司 Generating text from structured data
US11544463B2 (en) * 2019-05-09 2023-01-03 Intel Corporation Time asynchronous spoken intent detection
JP7276498B2 (en) * 2019-11-21 2023-05-18 日本電信電話株式会社 Information processing device, information processing method and program
US11610061B2 (en) * 2019-12-02 2023-03-21 Asapp, Inc. Modifying text according to a specified attribute
US11361028B2 (en) * 2020-06-09 2022-06-14 Microsoft Technology Licensing, Llc Generating a graph data structure that identifies relationships among topics expressed in web documents

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200076841A1 (en) * 2018-09-05 2020-03-05 Oracle International Corporation Context-aware feature embedding and anomaly detection of sequential log data using deep recurrent neural networks
US20200175015A1 (en) * 2018-11-29 2020-06-04 Koninklijke Philips N.V. Crf-based span prediction for fine machine learning comprehension

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HEINZERLING, BENJAMINMICHAEL STRUBE: "Bpemb: Tokenization-free pre-trained subword embeddings in 275 languages", ARXIV PREPRINT 1710.02187, 2017
HULSEBOS, MADELON ET AL.: "Sherlock: A deep learning approach to semantic data type detection", PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2019
VASWANI, ASHISH ET AL.: "Attention is all you need", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 2017
ZHANG, DAN ET AL.: "Sato: Contextual Semantic Type Detection in Tables", ARXIV PREPRINT 1911.06311, 2019

Also Published As

Publication number Publication date
US11977833B2 (en) 2024-05-07
US20210406452A1 (en) 2021-12-30

Similar Documents

Publication Publication Date Title
EP3933699A1 (en) A computer-implemented method and apparatus for automatically annotating columns of a table with semantic types
Jia et al. A review of hybrid and ensemble in deep learning for natural language processing
Washio et al. Neural latent relational analysis to capture lexical semantic relations in a vector space
Li et al. Simultaneous learning of pivots and representations for cross-domain sentiment classification
Kula et al. Implementation of the BERT-derived architectures to tackle disinformation challenges
Teng et al. Two local models for neural constituent parsing
Jettakul et al. Relation extraction between bacteria and biotopes from biomedical texts with attention mechanisms and domain-specific contextual representations
Kim et al. Construction of machine-labeled data for improving named entity recognition by transfer learning
Stoeckel et al. Voting for POS tagging of Latin texts: Using the flair of FLAIR to better ensemble classifiers by example of Latin
Masumura et al. Multi-task and multi-lingual joint learning of neural lexical utterance classification based on partially-shared modeling
Satar et al. Semantic role aware correlation transformer for text to video retrieval
Jiang et al. A hierarchical model with recurrent convolutional neural networks for sequential sentence classification
Reis et al. Transformers aftermath: Current research and rising trends
Wan et al. A Survey of Deep Active Learning for Foundation Models
Tkachenko et al. Neural Morphological Tagging for Estonian.
Qin et al. Advancing Chinese event detection via revisiting character information
Zhai et al. MLNet: a multi-level multimodal named entity recognition architecture
Peng et al. Text Steganalysis Based on Hierarchical Supervised Learning and Dual Attention Mechanism
Wen et al. Few-shot named entity recognition with joint token and sentence awareness
Soam et al. Sentiment Analysis Using Deep Learning: A Comparative Study
Mady et al. Enhancing performance of biomedical named entity recognition
Jin et al. Discriminative, generative and self-supervised approaches for target-agnostic learning
Bari et al. Novel metaknowledge-based processing technique for multimediata big data clustering challenges
Ait Benali et al. Arabic named entity recognition in social media based on BiLSTM-CRF using an attention mechanism
Masui et al. Recurrent visual relationship recognition with triplet unit for diversity

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

B565 Issuance of search results under rule 164(2) epc

Effective date: 20210118

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20220706