EP2601594A1 - Procédé et dispositif de traitement automatique de données en un format de cellule - Google Patents

Procédé et dispositif de traitement automatique de données en un format de cellule

Info

Publication number
EP2601594A1
EP2601594A1 EP11749377.5A EP11749377A EP2601594A1 EP 2601594 A1 EP2601594 A1 EP 2601594A1 EP 11749377 A EP11749377 A EP 11749377A EP 2601594 A1 EP2601594 A1 EP 2601594A1
Authority
EP
European Patent Office
Prior art keywords
data
cell
cells
similarity
automatically
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP11749377.5A
Other languages
German (de)
English (en)
Inventor
Martin RÜGAMER
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SOLYP Informatik GmbH
Original Assignee
SOLYP Informatik GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SOLYP Informatik GmbH filed Critical SOLYP Informatik GmbH
Priority to EP11749377.5A priority Critical patent/EP2601594A1/fr
Publication of EP2601594A1 publication Critical patent/EP2601594A1/fr
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets

Definitions

  • the invention relates to a method for automatic
  • data is in a cell format, e.g. is known from spreadsheets. Typically, this allows data from one category (e.g., in vertically arranged cells) to be linked to data from other categories (e.g., in horizontally arranged cells).
  • categories e.g., in vertically arranged cells
  • cells and data cells are used synonymously here.
  • Data in cell format is used again and again as import / export format for programs.
  • the arrangement of the data in cell format has established itself as an interface between programs.
  • data in particular 'soft data
  • cell format in which a) a start cell is selected as the first data cell for a data rectangle,
  • the similarity threshold determines whether the data rectangle is expanded in the horizontal and / or vertical direction.
  • steps b) and c) are carried out up to a termination criterion.
  • a label is a string to understand, which can be considered as a label for a number of cells.
  • the use of the labeling information is for the subsequent further processing of the pure number information
  • Formula property of the respective data cells respectively defined protection of the data cell, the respective height of the data cell, the respective width of the data cell, absolute relation between data cells, relative relation between
  • Data cell is determined. In this way, a meaningful evaluation of similarity can be made.
  • the criteria can be applied in particular in combination.
  • caption data for data cells in the vicinity of the data rectangle are automatically detected. This allows an improved allocation of the data.
  • Similarity Analysis automatically generates a file that has data cells to which certain attributes can be attributed based on the similarity analysis. Also, it is advantageous if the calculation of the measure and the adaptation of the size of the data rectangle in a
  • Spreadsheet programs are integrated. This makes it possible to analyze soft data in a spreadsheet program.
  • Spreadsheet programs are 'widely used and offer data in cell formats, so that an advantageous use of the method is possible here.
  • a determined data rectangle is automatically integrated into a database, which is in particular linked to an input template.
  • an input template e.g. understood an input mask.
  • Structure of a first data cell and a second data cell, in particular adjacent data cells is automatically compared and, if necessary, a measure of the difference is determined. This automatically determines the similarity of data cells.
  • the method can be used in conjunction with a
  • Data rectangles to be integrated into a spreadsheet program e.g. determine which areas in a data sheet are similar to each other so that they
  • the task is also performed by a system for automatic
  • Spreadsheet program has an integrated system according to claim 14.
  • FIG. Figure 2 is an illustration of a uniform XML envelope
  • FIG. 3 is a schematic representation of the data exchange
  • Fig. 10 is a screen shot of an Exce 1 file as
  • Fig. 5 is a detail of the table of Fig. 4;
  • Fig. 6 is a tabular representation of the calculation of
  • FIG. 7 is a tabular representation of the calculation of the similarities between further data lines
  • Fig. 8-10 is an illustration of the characterization of adjacent ones
  • Fig. 11 is a flow chart of the basic algorithm
  • Fig. 12-13 an example of the determination of orders of magnitude
  • Fig. 14 shows an example of the detection of stripe patterns
  • Fig. 15 is an example of the capturing of labels
  • Fig. 16 is a flowchart similar to that for detection
  • Figs. 17-18 show an example of a similarity ssucne
  • Fig. 22 is a view of a questionnaire.
  • soft data e.g., data without a hard
  • Soft data is business information that can not be expressed by measures.
  • SAP BW upstream systems
  • a questionnaire is a structured template into which data that is not specially adapted to this template can be imported from a data source.
  • the algorithm described here analyzes the information in the data source. et al Similarities, to determine. This calculated information is then imported into the template, with the template only general
  • Presets that allow mapping of the parsed data from the data source can e.g. the metadata ⁇ table name, foreign keys, column names, etc.) of a relational database linked to the template.
  • the template does not have extensive presets that allow the mapping; the "intelligence" for the assignment of the data is in the procedure, not in the mapping
  • One embodiment of the overall method is divided into three phases, with the most important second phase in turn passing through three stages.
  • Fig. 1 is a flowchart shown, in which these phases are shown.
  • phase of the syntactic unification (FIG. 1, steps 1.1 to 1.5) is already known in principle.
  • the phase of the automatic analysis (FIG. 1, steps 2.1 to 2.3) relates to the automatic processing of the data in the
  • a data source is selected on a client (eg a browser) (FIG. 1: step 1.1, FIG. 3: step 1), which can be clicked or dragged onto a server (Fig. 1: step 1.2; Fig. 3: step 2) is transmitted. This is also called "binary upload”.
  • File formats converted into XML data with which then further processing of the data is possible.
  • Possible file formats may e.g. of word processing programs such as e.g. Word or OpenOffice, or presentation programs, such as PowerPoint are generated.
  • PDF formats and HTML documents can serve as a starting point for the conversion.
  • uniform XML format then contains a representation of the cell format and possibly also the connections between the
  • Data cells e.g., formulas
  • PowerPoint files can be stored in .ppt
  • FIGS. 2 and 21 An example of how an XML download (see FIG. 3, step) may look like is shown in FIGS. 2 and 21. 2 shows a visualization of the XML grammar.
  • the automatic analysis of the data advantageously takes place on the client (i.e., the browser) side. to the. one to relieve the expensive, central processing power of the server and to scale arbitrarily.
  • regions ie data cells
  • an .xlsx file or its representation in xml identify features that have specific structural ⁇ eg, rectangular range of numbers in a table) or content (eg, "EBIT” as a measure and "2010" as the current year) characteristics. These areas are hereafter referred to as
  • content-related feature is to be understood as meaning that there are identifiers (eg a header) in the data source that categorize certain data (eg, in the adjacent data cells), so the content that follows is not content in the sense of, but in the assignment of data cells to a
  • this area is automatically assigned to a part of a questionnaire by deducing the form of the information (e.g., first column and column headings) on the subject sizing (e.g., different measures in several years).
  • the form of the information e.g., first column and column headings
  • the subject sizing e.g., different measures in several years.
  • the questionnaire corresponds to a database table
  • the technical dimensioning corresponds to the.
  • the assignment is a search for the primary key in the metadata repository of the database.
  • Programs that deal with cell formats are provided. Starting from a first data cell, these may be e.g. characterized by the following criteria:
  • Another criterion is the structure of a formula in one of the data cells. Even if the numbers in formulas of neighboring cells are different, the syntactic structure (decomposition into terms) of a formula (e.g., a sum, an exponential expression, etc.) can provide information about the similarity of the cells to be compared.
  • the syntactic structure allows the analysis of the formula without numbers and / or
  • Another criterion may be the reference of the data cells in a formula.
  • an absolute reference or a relative reference can be evaluated.
  • semantics of a formula can be used as a criterion by e.g. It is automatically recognized that two types of mean value calculation are contained in two data cells whose syntax is different but the target of the calculation is similar.
  • Magnitude is. In principle, it is possible to use some or all of these criteria for characterization.
  • FIG. 5 shows by way of example a section of FIG. 4
  • the number "89.3" is intended to serve as the first data cell from which a similarity to neighboring data cells is automatically determined, since the technical evaluation of the "similarity" of two cells is of particular importance for the automatic method.
  • the similarity between two data cells is calculated by comparing the respective criteria.
  • each criterion is a percentage single" formed similarity ". Then, to increase the fault tolerance of
  • the worst value is deleted and the remaining values are added with a (learned) weighting.
  • Fig. 6 is in the form of a table, the calculation of
  • the similarity also involves the order of the numbers, e.g. To identify outliers.
  • the orders of magnitude become over a logarithmic measure
  • an outlier can be determined.
  • the criterion of the order of magnitude has been defined as outliers, since in this. Criterion was the least match. The removal of this result gives the best overall value, which, incidentally, can be understood as a definition of the outlier.
  • the overall similarity (last line in Fig. 6) is then calculated from the matches (considering the weights), where in the divisor is the sum of the relevant weights (i.e., without outliers).
  • Comparison with dimension values is categorized. In the example of FIG. 7 this circumstance is not taken into consideration.
  • Criteria a compliance of 0% was determined.
  • the criterion with the highest weighting here the "order of magnitude” is considered to be an outlier, ie the divisor is 1-0.15 when calculating the overall score. For example, if a formatted year number is
  • the tolerance threshold is the limit at which the percentage similarity value is interpreted as a yes / no decision "similar".
  • both vertically see FIG. 8
  • horizontally see FIG. 9
  • the immediate and then further neighbors are also characterized and compared with the output characterization. This comparison leads to the positive result "similar” when only a few aspects (up to a weighted average) are different.
  • Treating empty cells is an important point.
  • one data cell is “similar” to an adjacent empty cell, so the spread of the data area does not stop at empty data cells, which of course must prevent completely unfilled areas, and in particular the
  • Similarity of adjacent data cells is determined. If the similarity reaches a certain threshold, the data rectangle is extended horizontally.
  • the similarity in the vertical direction is determined.
  • a certain threshold_2 the data rectangle is extended by a vertical neighbor data cell and the method is included with the calculation in
  • threshold_2 it is checked whether a horizontal extension was made in the step before. If so, then with the repeated determination of similarity in horizontal
  • KPI i.e., Key Performance
  • the percentage growth value for the following year can be listed: percentage and absolute values, possibly highlighted by different layouts, alternate with each other
  • the data rectangle includes 9x4 data cells.
  • the surrounding ones are 0
  • Caption above includes 9x1 data cells, the caption below also contains 9x1 data cells.
  • 1x4 cells are arranged left and right.
  • the first strategy is described in FIG. 16 in the form of a flowchart. It should be noted that this embodiment of the method basically independent as well as in combination with the
  • Method for detecting a data rectangle (e.g., Fig. 11) is usable.
  • search for keywords can also be used in connection with the search for the data rectangle.
  • a feature vector receives one for each subject dimension ("time”, “market participant”, etc.)
  • each cell value is searched in the space of all previously known dimension values and, in the case of a found value, this criterion is additionally included in the similarity analysis.
  • this criterion is additionally included in the similarity analysis.
  • the database is searched for whether this data has already occurred once.
  • Dimensional attributes can be assigned to these foursides, such as in Fig. 18 for the values BMW and VW; both are market participants ⁇ MT). From this a similarity can be calculated again, here 100%.
  • multidimensional data model ⁇ Fig. 1, step 2, 3 which may also be referred to as a data cube.
  • a data cube can be thought of as a multidimensional matrix, with the columns and rows being the dimensions
  • the dimension combination "KPI x year” is used for the central data area of the "Financial Objectives in the
  • Data cells B6 to J9 contain market shares.
  • a file is automatically generated whose data cells can be assigned certain attributes.
  • the embodiment according to Fig. 1 can e.g. be coupled with a learning system so that certain relationships between the data cells and the structure of a spreadsheet are stored.
  • FIG. 22 shows a view of a questionnaire into which the data from FIG. 20 has been read.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un système de traitement automatique de données, en particulier de données temporaires, en format de cellule, procédé selon lequel a) une cellule de départ est choisie comme première cellule de données pour un rectangle de données, b) une valeur pour une similitude entre la première cellule de données et au moins une cellule seconde de données, en particulier au voisinage de la première cellule de données, est générée ensuite de manière automatique, c) en fonction d'au moins une valeur-seuil prédéterminée pour la similitude, on décide si le rectangle de données est élargi en direction horizontale et/ou verticale.
EP11749377.5A 2010-08-06 2011-08-04 Procédé et dispositif de traitement automatique de données en un format de cellule Ceased EP2601594A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP11749377.5A EP2601594A1 (fr) 2010-08-06 2011-08-04 Procédé et dispositif de traitement automatique de données en un format de cellule

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
EP10172235 2010-08-06
EP11749377.5A EP2601594A1 (fr) 2010-08-06 2011-08-04 Procédé et dispositif de traitement automatique de données en un format de cellule
PCT/EP2011/063489 WO2012017056A1 (fr) 2010-08-06 2011-08-04 Procédé et dispositif de traitement automatique de données en un format de cellule

Publications (1)

Publication Number Publication Date
EP2601594A1 true EP2601594A1 (fr) 2013-06-12

Family

ID=44532823

Family Applications (1)

Application Number Title Priority Date Filing Date
EP11749377.5A Ceased EP2601594A1 (fr) 2010-08-06 2011-08-04 Procédé et dispositif de traitement automatique de données en un format de cellule

Country Status (2)

Country Link
EP (1) EP2601594A1 (fr)
WO (1) WO2012017056A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020013956A1 (fr) * 2018-07-13 2020-01-16 Microsoft Technology Licensing, Llc Systèmes, procédés et supports lisibles par ordinateur pour une identification de table améliorée à l'aide d'un réseau neuronal

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659527B (zh) 2018-06-29 2023-03-28 微软技术许可有限责任公司 电子表单中的表格检测
CN109829144B (zh) * 2018-12-28 2023-06-06 陈德芹 一种在线表格跨表引用方法及装置
KR20210057306A (ko) * 2019-11-12 2021-05-21 주식회사 모카앤제이에스 블록 에디터 기반 문서 편집 서비스 제공 방법, 그를 수행하기 위한 서버 및 컴퓨터 프로그램

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8205149B2 (en) * 2001-01-05 2012-06-19 Microsoft Corporation Enhanced find and replace for electronic documents
US20060167911A1 (en) * 2005-01-24 2006-07-27 Stephane Le Cam Automatic data pattern recognition and extraction
US7779000B2 (en) * 2005-08-29 2010-08-17 Microsoft Corporation Associating conditions to summary table data
US8856649B2 (en) * 2009-06-08 2014-10-07 Business Objects Software Limited Aggregation level and measure based hinting and selection of cells in a data display

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
None *
See also references of WO2012017056A1 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020013956A1 (fr) * 2018-07-13 2020-01-16 Microsoft Technology Licensing, Llc Systèmes, procédés et supports lisibles par ordinateur pour une identification de table améliorée à l'aide d'un réseau neuronal
US12039257B2 (en) 2018-07-13 2024-07-16 Microsoft Technology Licensing, Llc Systems, methods, and computer-readable media for improved table identification using a neural network

Also Published As

Publication number Publication date
WO2012017056A1 (fr) 2012-02-09

Similar Documents

Publication Publication Date Title
DE202011110895U1 (de) Echtzeitsynchronisierte Bearbeitung von Dokumenten durch mehrere Benutzer für das Bloggen
DE112013000987T5 (de) Erzeugen von Visualisierungen einer Anzeigegruppe von Tags, die Inhaltsinstanzen in Suchkriterien erfüllenden Objekten darstellen
DE102013206281A1 (de) Optimieren von zerstreuten schemalosen Daten in relationalen Speichern
EP2439691A1 (fr) Dispositif et procédé d'établissement mécanique d'un schéma de processus
DE60310881T2 (de) Methode und Benutzerschnittstelle für das Bilden einer Darstellung von Daten mit Meta-morphing
DE112018002626T5 (de) Verfahren und Systeme zur optimierten visuellen Zusammenfassung von Sequenzen mit zeitbezogenen Ereignisdaten
WO2012017056A1 (fr) Procédé et dispositif de traitement automatique de données en un format de cellule
DE19849855C1 (de) Verfahren zur automatischen Generierung einer textlichen Äußerung aus einer Bedeutungsrepräsentation durch ein Computersystem
AT522281B1 (de) Verfahren zur Charakterisierung des Betriebszustands eines Computersystems
DE3689502T2 (de) System und Verfahren zur Programmstrukturierung durch Datentabellenübersetzung.
DE112012004300T5 (de) Verfahren, Programm und System zum Erstellen eines Arbeitsablaufs von einer Arbeitsspezifikation
DE102012025349A1 (de) Bestimmung eines Ähnlichkeitsmaßes und Verarbeitung von Dokumenten
WO2003054727A1 (fr) Systeme de categorisation pour des objets de donnees et procede pour verifier la coherence d'affectations d'objets de donnees a des categories
DE10325843B4 (de) Verfahren, Drucksystem, Computer und Computerprogramm zum Verwalten von Resourcen zur Verwendung in einem resourcenbasierten Dokumentendatenstrom
WO2009012802A1 (fr) Système et procédé de gestion de la génération et de la répartition de publications
EP0978052A1 (fr) Selection assistee par ordinateur de donnees d'entrainement pour reseau neuronal
EP2149844B1 (fr) Procédé et produit de programme informatique d'introduction automatique de données d'un système de base de données dans une structure de données
DE102009037848A1 (de) Verfahren zum rechnergestützten Verarbeiten von digitalen semantisch annotierten Informationen
DE102009016588A1 (de) Verfahren zur Ermittlung von Textinformationen
EP1629401A2 (fr) Procede, dispositif et programme informatique comportant des elements de code de programme et un produit de code de programme destines a l'analyse de donnees utiles structurees selon une structure de base de donnees
EP1324236A1 (fr) Détermination d'une fonction caractéristique d'une matrice avec enrichissement et compression
EP2518644A1 (fr) Procédé de commande de la traduction de règles prédéterminées et/ou de données devant être enregistrées d'un flux de données
EP4102378A1 (fr) Méthode de réorganisation et/ou de transformation des données
DE10109876B4 (de) Verfahren und Einrichtung zum Datenmanagement
DE202023106456U1 (de) Ein System zur Vorbereitung einer These für ein angewandtes Forschungsprojekt

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20130306

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAX Request for extension of the european patent (deleted)
17Q First examination report despatched

Effective date: 20160914

REG Reference to a national code

Ref country code: DE

Ref legal event code: R003

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20190912