EP2601594A1 - Procédé et dispositif de traitement automatique de données en un format de cellule - Google Patents
Procédé et dispositif de traitement automatique de données en un format de celluleInfo
- Publication number
- EP2601594A1 EP2601594A1 EP11749377.5A EP11749377A EP2601594A1 EP 2601594 A1 EP2601594 A1 EP 2601594A1 EP 11749377 A EP11749377 A EP 11749377A EP 2601594 A1 EP2601594 A1 EP 2601594A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- data
- cell
- cells
- similarity
- automatically
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
- G06F40/18—Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
Definitions
- the invention relates to a method for automatic
- data is in a cell format, e.g. is known from spreadsheets. Typically, this allows data from one category (e.g., in vertically arranged cells) to be linked to data from other categories (e.g., in horizontally arranged cells).
- categories e.g., in vertically arranged cells
- cells and data cells are used synonymously here.
- Data in cell format is used again and again as import / export format for programs.
- the arrangement of the data in cell format has established itself as an interface between programs.
- data in particular 'soft data
- cell format in which a) a start cell is selected as the first data cell for a data rectangle,
- the similarity threshold determines whether the data rectangle is expanded in the horizontal and / or vertical direction.
- steps b) and c) are carried out up to a termination criterion.
- a label is a string to understand, which can be considered as a label for a number of cells.
- the use of the labeling information is for the subsequent further processing of the pure number information
- Formula property of the respective data cells respectively defined protection of the data cell, the respective height of the data cell, the respective width of the data cell, absolute relation between data cells, relative relation between
- Data cell is determined. In this way, a meaningful evaluation of similarity can be made.
- the criteria can be applied in particular in combination.
- caption data for data cells in the vicinity of the data rectangle are automatically detected. This allows an improved allocation of the data.
- Similarity Analysis automatically generates a file that has data cells to which certain attributes can be attributed based on the similarity analysis. Also, it is advantageous if the calculation of the measure and the adaptation of the size of the data rectangle in a
- Spreadsheet programs are integrated. This makes it possible to analyze soft data in a spreadsheet program.
- Spreadsheet programs are 'widely used and offer data in cell formats, so that an advantageous use of the method is possible here.
- a determined data rectangle is automatically integrated into a database, which is in particular linked to an input template.
- an input template e.g. understood an input mask.
- Structure of a first data cell and a second data cell, in particular adjacent data cells is automatically compared and, if necessary, a measure of the difference is determined. This automatically determines the similarity of data cells.
- the method can be used in conjunction with a
- Data rectangles to be integrated into a spreadsheet program e.g. determine which areas in a data sheet are similar to each other so that they
- the task is also performed by a system for automatic
- Spreadsheet program has an integrated system according to claim 14.
- FIG. Figure 2 is an illustration of a uniform XML envelope
- FIG. 3 is a schematic representation of the data exchange
- Fig. 10 is a screen shot of an Exce 1 file as
- Fig. 5 is a detail of the table of Fig. 4;
- Fig. 6 is a tabular representation of the calculation of
- FIG. 7 is a tabular representation of the calculation of the similarities between further data lines
- Fig. 8-10 is an illustration of the characterization of adjacent ones
- Fig. 11 is a flow chart of the basic algorithm
- Fig. 12-13 an example of the determination of orders of magnitude
- Fig. 14 shows an example of the detection of stripe patterns
- Fig. 15 is an example of the capturing of labels
- Fig. 16 is a flowchart similar to that for detection
- Figs. 17-18 show an example of a similarity ssucne
- Fig. 22 is a view of a questionnaire.
- soft data e.g., data without a hard
- Soft data is business information that can not be expressed by measures.
- SAP BW upstream systems
- a questionnaire is a structured template into which data that is not specially adapted to this template can be imported from a data source.
- the algorithm described here analyzes the information in the data source. et al Similarities, to determine. This calculated information is then imported into the template, with the template only general
- Presets that allow mapping of the parsed data from the data source can e.g. the metadata ⁇ table name, foreign keys, column names, etc.) of a relational database linked to the template.
- the template does not have extensive presets that allow the mapping; the "intelligence" for the assignment of the data is in the procedure, not in the mapping
- One embodiment of the overall method is divided into three phases, with the most important second phase in turn passing through three stages.
- Fig. 1 is a flowchart shown, in which these phases are shown.
- phase of the syntactic unification (FIG. 1, steps 1.1 to 1.5) is already known in principle.
- the phase of the automatic analysis (FIG. 1, steps 2.1 to 2.3) relates to the automatic processing of the data in the
- a data source is selected on a client (eg a browser) (FIG. 1: step 1.1, FIG. 3: step 1), which can be clicked or dragged onto a server (Fig. 1: step 1.2; Fig. 3: step 2) is transmitted. This is also called "binary upload”.
- File formats converted into XML data with which then further processing of the data is possible.
- Possible file formats may e.g. of word processing programs such as e.g. Word or OpenOffice, or presentation programs, such as PowerPoint are generated.
- PDF formats and HTML documents can serve as a starting point for the conversion.
- uniform XML format then contains a representation of the cell format and possibly also the connections between the
- Data cells e.g., formulas
- PowerPoint files can be stored in .ppt
- FIGS. 2 and 21 An example of how an XML download (see FIG. 3, step) may look like is shown in FIGS. 2 and 21. 2 shows a visualization of the XML grammar.
- the automatic analysis of the data advantageously takes place on the client (i.e., the browser) side. to the. one to relieve the expensive, central processing power of the server and to scale arbitrarily.
- regions ie data cells
- an .xlsx file or its representation in xml identify features that have specific structural ⁇ eg, rectangular range of numbers in a table) or content (eg, "EBIT” as a measure and "2010" as the current year) characteristics. These areas are hereafter referred to as
- content-related feature is to be understood as meaning that there are identifiers (eg a header) in the data source that categorize certain data (eg, in the adjacent data cells), so the content that follows is not content in the sense of, but in the assignment of data cells to a
- this area is automatically assigned to a part of a questionnaire by deducing the form of the information (e.g., first column and column headings) on the subject sizing (e.g., different measures in several years).
- the form of the information e.g., first column and column headings
- the subject sizing e.g., different measures in several years.
- the questionnaire corresponds to a database table
- the technical dimensioning corresponds to the.
- the assignment is a search for the primary key in the metadata repository of the database.
- Programs that deal with cell formats are provided. Starting from a first data cell, these may be e.g. characterized by the following criteria:
- Another criterion is the structure of a formula in one of the data cells. Even if the numbers in formulas of neighboring cells are different, the syntactic structure (decomposition into terms) of a formula (e.g., a sum, an exponential expression, etc.) can provide information about the similarity of the cells to be compared.
- the syntactic structure allows the analysis of the formula without numbers and / or
- Another criterion may be the reference of the data cells in a formula.
- an absolute reference or a relative reference can be evaluated.
- semantics of a formula can be used as a criterion by e.g. It is automatically recognized that two types of mean value calculation are contained in two data cells whose syntax is different but the target of the calculation is similar.
- Magnitude is. In principle, it is possible to use some or all of these criteria for characterization.
- FIG. 5 shows by way of example a section of FIG. 4
- the number "89.3" is intended to serve as the first data cell from which a similarity to neighboring data cells is automatically determined, since the technical evaluation of the "similarity" of two cells is of particular importance for the automatic method.
- the similarity between two data cells is calculated by comparing the respective criteria.
- each criterion is a percentage single" formed similarity ". Then, to increase the fault tolerance of
- the worst value is deleted and the remaining values are added with a (learned) weighting.
- Fig. 6 is in the form of a table, the calculation of
- the similarity also involves the order of the numbers, e.g. To identify outliers.
- the orders of magnitude become over a logarithmic measure
- an outlier can be determined.
- the criterion of the order of magnitude has been defined as outliers, since in this. Criterion was the least match. The removal of this result gives the best overall value, which, incidentally, can be understood as a definition of the outlier.
- the overall similarity (last line in Fig. 6) is then calculated from the matches (considering the weights), where in the divisor is the sum of the relevant weights (i.e., without outliers).
- Comparison with dimension values is categorized. In the example of FIG. 7 this circumstance is not taken into consideration.
- Criteria a compliance of 0% was determined.
- the criterion with the highest weighting here the "order of magnitude” is considered to be an outlier, ie the divisor is 1-0.15 when calculating the overall score. For example, if a formatted year number is
- the tolerance threshold is the limit at which the percentage similarity value is interpreted as a yes / no decision "similar".
- both vertically see FIG. 8
- horizontally see FIG. 9
- the immediate and then further neighbors are also characterized and compared with the output characterization. This comparison leads to the positive result "similar” when only a few aspects (up to a weighted average) are different.
- Treating empty cells is an important point.
- one data cell is “similar” to an adjacent empty cell, so the spread of the data area does not stop at empty data cells, which of course must prevent completely unfilled areas, and in particular the
- Similarity of adjacent data cells is determined. If the similarity reaches a certain threshold, the data rectangle is extended horizontally.
- the similarity in the vertical direction is determined.
- a certain threshold_2 the data rectangle is extended by a vertical neighbor data cell and the method is included with the calculation in
- threshold_2 it is checked whether a horizontal extension was made in the step before. If so, then with the repeated determination of similarity in horizontal
- KPI i.e., Key Performance
- the percentage growth value for the following year can be listed: percentage and absolute values, possibly highlighted by different layouts, alternate with each other
- the data rectangle includes 9x4 data cells.
- the surrounding ones are 0
- Caption above includes 9x1 data cells, the caption below also contains 9x1 data cells.
- 1x4 cells are arranged left and right.
- the first strategy is described in FIG. 16 in the form of a flowchart. It should be noted that this embodiment of the method basically independent as well as in combination with the
- Method for detecting a data rectangle (e.g., Fig. 11) is usable.
- search for keywords can also be used in connection with the search for the data rectangle.
- a feature vector receives one for each subject dimension ("time”, “market participant”, etc.)
- each cell value is searched in the space of all previously known dimension values and, in the case of a found value, this criterion is additionally included in the similarity analysis.
- this criterion is additionally included in the similarity analysis.
- the database is searched for whether this data has already occurred once.
- Dimensional attributes can be assigned to these foursides, such as in Fig. 18 for the values BMW and VW; both are market participants ⁇ MT). From this a similarity can be calculated again, here 100%.
- multidimensional data model ⁇ Fig. 1, step 2, 3 which may also be referred to as a data cube.
- a data cube can be thought of as a multidimensional matrix, with the columns and rows being the dimensions
- the dimension combination "KPI x year” is used for the central data area of the "Financial Objectives in the
- Data cells B6 to J9 contain market shares.
- a file is automatically generated whose data cells can be assigned certain attributes.
- the embodiment according to Fig. 1 can e.g. be coupled with a learning system so that certain relationships between the data cells and the structure of a spreadsheet are stored.
- FIG. 22 shows a view of a questionnaire into which the data from FIG. 20 has been read.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP11749377.5A EP2601594A1 (fr) | 2010-08-06 | 2011-08-04 | Procédé et dispositif de traitement automatique de données en un format de cellule |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP10172235 | 2010-08-06 | ||
EP11749377.5A EP2601594A1 (fr) | 2010-08-06 | 2011-08-04 | Procédé et dispositif de traitement automatique de données en un format de cellule |
PCT/EP2011/063489 WO2012017056A1 (fr) | 2010-08-06 | 2011-08-04 | Procédé et dispositif de traitement automatique de données en un format de cellule |
Publications (1)
Publication Number | Publication Date |
---|---|
EP2601594A1 true EP2601594A1 (fr) | 2013-06-12 |
Family
ID=44532823
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP11749377.5A Ceased EP2601594A1 (fr) | 2010-08-06 | 2011-08-04 | Procédé et dispositif de traitement automatique de données en un format de cellule |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP2601594A1 (fr) |
WO (1) | WO2012017056A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020013956A1 (fr) * | 2018-07-13 | 2020-01-16 | Microsoft Technology Licensing, Llc | Systèmes, procédés et supports lisibles par ordinateur pour une identification de table améliorée à l'aide d'un réseau neuronal |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110659527B (zh) | 2018-06-29 | 2023-03-28 | 微软技术许可有限责任公司 | 电子表单中的表格检测 |
CN109829144B (zh) * | 2018-12-28 | 2023-06-06 | 陈德芹 | 一种在线表格跨表引用方法及装置 |
KR20210057306A (ko) * | 2019-11-12 | 2021-05-21 | 주식회사 모카앤제이에스 | 블록 에디터 기반 문서 편집 서비스 제공 방법, 그를 수행하기 위한 서버 및 컴퓨터 프로그램 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8205149B2 (en) * | 2001-01-05 | 2012-06-19 | Microsoft Corporation | Enhanced find and replace for electronic documents |
US20060167911A1 (en) * | 2005-01-24 | 2006-07-27 | Stephane Le Cam | Automatic data pattern recognition and extraction |
US7779000B2 (en) * | 2005-08-29 | 2010-08-17 | Microsoft Corporation | Associating conditions to summary table data |
US8856649B2 (en) * | 2009-06-08 | 2014-10-07 | Business Objects Software Limited | Aggregation level and measure based hinting and selection of cells in a data display |
-
2011
- 2011-08-04 WO PCT/EP2011/063489 patent/WO2012017056A1/fr active Application Filing
- 2011-08-04 EP EP11749377.5A patent/EP2601594A1/fr not_active Ceased
Non-Patent Citations (2)
Title |
---|
None * |
See also references of WO2012017056A1 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020013956A1 (fr) * | 2018-07-13 | 2020-01-16 | Microsoft Technology Licensing, Llc | Systèmes, procédés et supports lisibles par ordinateur pour une identification de table améliorée à l'aide d'un réseau neuronal |
US12039257B2 (en) | 2018-07-13 | 2024-07-16 | Microsoft Technology Licensing, Llc | Systems, methods, and computer-readable media for improved table identification using a neural network |
Also Published As
Publication number | Publication date |
---|---|
WO2012017056A1 (fr) | 2012-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE202011110895U1 (de) | Echtzeitsynchronisierte Bearbeitung von Dokumenten durch mehrere Benutzer für das Bloggen | |
DE112013000987T5 (de) | Erzeugen von Visualisierungen einer Anzeigegruppe von Tags, die Inhaltsinstanzen in Suchkriterien erfüllenden Objekten darstellen | |
DE102013206281A1 (de) | Optimieren von zerstreuten schemalosen Daten in relationalen Speichern | |
EP2439691A1 (fr) | Dispositif et procédé d'établissement mécanique d'un schéma de processus | |
DE60310881T2 (de) | Methode und Benutzerschnittstelle für das Bilden einer Darstellung von Daten mit Meta-morphing | |
DE112018002626T5 (de) | Verfahren und Systeme zur optimierten visuellen Zusammenfassung von Sequenzen mit zeitbezogenen Ereignisdaten | |
WO2012017056A1 (fr) | Procédé et dispositif de traitement automatique de données en un format de cellule | |
DE19849855C1 (de) | Verfahren zur automatischen Generierung einer textlichen Äußerung aus einer Bedeutungsrepräsentation durch ein Computersystem | |
AT522281B1 (de) | Verfahren zur Charakterisierung des Betriebszustands eines Computersystems | |
DE3689502T2 (de) | System und Verfahren zur Programmstrukturierung durch Datentabellenübersetzung. | |
DE112012004300T5 (de) | Verfahren, Programm und System zum Erstellen eines Arbeitsablaufs von einer Arbeitsspezifikation | |
DE102012025349A1 (de) | Bestimmung eines Ähnlichkeitsmaßes und Verarbeitung von Dokumenten | |
WO2003054727A1 (fr) | Systeme de categorisation pour des objets de donnees et procede pour verifier la coherence d'affectations d'objets de donnees a des categories | |
DE10325843B4 (de) | Verfahren, Drucksystem, Computer und Computerprogramm zum Verwalten von Resourcen zur Verwendung in einem resourcenbasierten Dokumentendatenstrom | |
WO2009012802A1 (fr) | Système et procédé de gestion de la génération et de la répartition de publications | |
EP0978052A1 (fr) | Selection assistee par ordinateur de donnees d'entrainement pour reseau neuronal | |
EP2149844B1 (fr) | Procédé et produit de programme informatique d'introduction automatique de données d'un système de base de données dans une structure de données | |
DE102009037848A1 (de) | Verfahren zum rechnergestützten Verarbeiten von digitalen semantisch annotierten Informationen | |
DE102009016588A1 (de) | Verfahren zur Ermittlung von Textinformationen | |
EP1629401A2 (fr) | Procede, dispositif et programme informatique comportant des elements de code de programme et un produit de code de programme destines a l'analyse de donnees utiles structurees selon une structure de base de donnees | |
EP1324236A1 (fr) | Détermination d'une fonction caractéristique d'une matrice avec enrichissement et compression | |
EP2518644A1 (fr) | Procédé de commande de la traduction de règles prédéterminées et/ou de données devant être enregistrées d'un flux de données | |
EP4102378A1 (fr) | Méthode de réorganisation et/ou de transformation des données | |
DE10109876B4 (de) | Verfahren und Einrichtung zum Datenmanagement | |
DE202023106456U1 (de) | Ein System zur Vorbereitung einer These für ein angewandtes Forschungsprojekt |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20130306 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAX | Request for extension of the european patent (deleted) | ||
17Q | First examination report despatched |
Effective date: 20160914 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R003 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED |
|
18R | Application refused |
Effective date: 20190912 |