FR2944895A1

FR2944895A1 - METHOD FOR THE HEURISTIC ANALYSIS OF A FILE AND COMPUTER PROGRAM PRODUCT FOR IMPLEMENTING SUCH A METHOD

Info

Publication number: FR2944895A1
Application number: FR0902028A
Authority: FR
Inventors: Franck Jeannin
Original assignee: ORMETIS
Current assignee: ORMETIS
Priority date: 2009-04-27
Filing date: 2009-04-27
Publication date: 2010-10-29
Also published as: WO2010125256A1

Abstract

Procédé d'analyse heuristique d'un fichier représentatif d'un tableau de données afin de reconstituer ledit tableau, le fichier étant structuré en lignes de caractères comportant des caractères séparateurs de colonnes, caractérisé en ce qu'il comporte la recherche d'un caractère dont le nombre d'occurrences est le même pour toutes les lignes ; ledit caractère étant choisi en tant que séparateur de colonnes, de manière à déterminer une structure de tableau dépendante du format. Produit programme d'ordinateur comprenant des instructions de code de programme pour la mise en oeuvre d'un tel procédé.Method for heuristically analyzing a file representative of a data table in order to reconstitute said table, the file being structured in lines of characters comprising column separator characters, characterized in that it comprises the search for a character whose number of occurrences is the same for all lines; said character being chosen as a column separator, so as to determine a format dependent array structure. A computer program product comprising program code instructions for carrying out such a method.

Description

PROCEDE D'ANALYSE HEURISTIQUE D'UN FICHIER ET PRODUIT PROGRAMME D'ORDINATEUR POUR LA MISE EN OEUVRE D'UN TEL PROCEDE L'invention porte sur un procédé d'analyse heuristique d'un fichier représentatif d'un tableau de données, afin de reconstituer ledit tableau, ainsi que sur un produit programme d'ordinateur comprenant des instructions de code de programme pour la mise en oeuvre d'un tel procédé. L'invention s'applique en particulier à l'interprétation des fichiers de type CSV ( comma separated values , c'est à dire valeurs séparées par virgule ). Le format CSV est un format informatique ouvert permettant de représenter des données sous la forme d'un tableau. Il s'agit d'un format texte, structuré en lignes de caractères séparées par un retour de chariot (caractère CRLF ). Chaque ligne est subdivisée en un même nombre de colonnes par un caractère prédéfini, dit séparateur de colonnes . On obtient ainsi une représentation unidimensionnelle d'un tableau bidimensionnel. Le séparateur de colonnes peut être traité comme un caractère ordinaire à condition d'apparaître à l'intérieur d'un groupe de caractères enfermés entre des doubles guillemets. The invention relates to a heuristic analysis method of a file representative of a data table, in order to provide a method for heuristically analyzing a file representative of a table of data, in order to reconstructing said array, as well as a computer program product comprising program code instructions for carrying out such a method. The invention applies in particular to the interpretation of CSV type files (comma separated values, ie values separated by comma). The CSV format is an open computer format for representing data in the form of a table. It is a text format, structured in lines of characters separated by a carriage return (CRLF character). Each line is subdivided into the same number of columns by a predefined character, called column separator. This gives a one-dimensional representation of a two-dimensional array. The column separator can be treated as an ordinary character as long as it appears inside a group of characters enclosed in double quotation marks.

Ce format est très utilisé, en raison de sa simplicité. Cependant, son utilisation est rendue difficile par !e fait qu'il n'est qu'imparfaitement normalisé. En effet, contrairement au séparateur de lignes (caractère CRLF ), le séparateur de colonnes n'est pas standardisé. Il s'agit parfois de la virgule û d'où le nom du format û mais ce choix n'est pas très heureux, car la virgule est également utilisée comme séparateur décimal (ou de milliers, dans les pays anglo-saxons), dans les adresses, etc. Par conséquent d'autres caractères (point-virgule, tabulation, tiret, barre oblique...) sont couramment utilisés en tant que séparateurs de colonnes. Il s'ensuit que l'utilisateur souhaitant importer un fichier de type CSV dans un tableur doit choisir manuellement le caractère de séparation des colonnes. Cette tâche est fastidieuse, mais son automatisation n'a pas été possible à ce jour. This format is very used, because of its simplicity. However, its use is made difficult by the fact that it is only imperfectly standardized. Unlike the line separator (CRLF character), the column separator is not standardized. Sometimes this is the comma - hence the name of the format - but this choice is not very happy, because the comma is also used as decimal separator (or thousands, in the Anglo-Saxon countries), in addresses, etc. Therefore other characters (semicolon, tab, dash, slash ...) are commonly used as column separators. It follows that the user wishing to import a CSV type file into a spreadsheet must manually choose the separator character of the columns. This task is tedious, but its automation has not been possible to date.

L'invention vise à résoudre ce problème. Dans un mode de réalisation préféré, l'invention permet également d'identifier, de manière automatique, un format des données du fichier. Par format on entend ici en particulier un format régional . En effet certaines données telles que les dates ou les nombres s'écrivent d'une manière différente dans différents pays : ainsi, dans de nombreux pays la virgule est le séparateur décimal et le point le séparateur optionnel des milliers, tandis que dans les pays anglo-saxons la réciproque est vraie ; de même, les dates s'écrivent sous la forme jour/mois/année en Europe (Royaume-Uni inclus) et mois/jour/année aux Etats-Unis, et ainsi de suite. L'identification automatique du format régional du fichier ù supposé unique ù permet l'interprétation des données contenues dans les groupements signifiants. Une idée à la base de ce mode de réalisation préféré de l'invention est que les problèmes consistant à identifier la structure du tableau et son format régional ne doivent pas être résolus séparément, mais de manière synergique, ce qui permet de trouver une solution heuristique particulièrement performante. Il est important de souligner le caractère heuristique du procédé de l'invention : il n'est pas possible de garantir que le tableau sera reconstitué de manière exacte, mais en fait cela se produit dans l'immense majorité des cas, sauf lorsqu'on utilise des fichiers trop petits ou pathologiques . En cas d'échec, bien entendu, il reste possible d'avoir recours à une identification manuelle des caractères séparateurs et/ou du format régional. Un objet de l'invention est donc un procédé d'analyse heuristique d'un fichier représentatif d'un tableau de données afin de reconstituer ledit tableau, le fichier étant structuré en lignes de caractères comportant des caractères séparateurs de colonnes, caractérisé en ce qu'il comporte la recherche d'un caractère, à l'exception de caractères éventuels encadrés par des doubles guillemets, dont le nombre d'occurrences est le même pour toutes les lignes ; ledit caractère étant choisi en tant que séparateur de colonnes de manière à déterminer une structure de tableau. Selon un mode de réalisation préféré de l'invention, le procédé peut comporter : a. l'identification, à l'intérieur de chaque ligne, de groupements de caractères considérés comme signifiants, au moins certains des caractères n'appartenant pas à de tels groupements étant considérés comme des séparateurs de colonnes potentiels ; cette identification étant effectuée en parallèle pour plusieurs formats de données prédéfinis ; b. pour chaque format de données, la recherche d'un caractère, parmi lesdits séparateurs de colonnes potentiels, dont le nombre d'occurrences est le même pour toutes les lignes ; ledit caractère étant choisi en tant que séparateur de colonnes effectif, de manière à déterminer une structure de tableau dépendante du format ; et c. l'application d'un critère de sélection pour choisir un format de données et le séparateur de colonnes correspondant comme étant les plus vraisemblables. Selon des variantes particulières du procédé de l'invention : - L'étape c. peut comporter l'élimination du ou des formats prédéfinis ne permettant pas d'identifier un séparateur de colonnes effectif. - L'étape a. peut comporter l'utilisation d'une pluralité de stratégies d'identification de groupements de caractères considérés comme signifiants pour chaque format de données prédéfini. - L'étape c. peut comporter la sélection du format ou du couple format de données / stratégie d'identification de groupements de caractères conduisant à l'identification du plus grand nombre de caractères considérés comme signifiants. - Les étapes a. et b. peuvent être mises en oeuvre conjointement, et comporter: al. l'identification, à l'intérieur de chaque ligne, desdits groupements de caractères considérés comme signifiants ; 4 a2. le comptage des occurrences des séparateurs de colonnes potentiels dans ladite ligne ; et a3. sauf pour la première ligne analysée, la suppression des séparateurs de colonnes potentiels dont le nombre 5 d'occurrences est différent de celui de la ligne précédente. - Le procédé peut comporter également l'identification de types prédéfinis desdits groupements de caractères considérés comme signifiants. - Le procédé peut comporter également l'identification de la 10 première ligne du tableau comme étant une ligne d'en-têtes si toutes les autres lignes du tableau présentent une structure commune, la structure de ladite première ligne étant différente. - Le procédé peut comporter également l'identification de la première colonne du tableau comme étant une colonne d'en-têtes si toutes 15 les autres colonnes du tableau présentent une structure commune, la structure de ladite première colonne étant différente. Un autre objet de l'invention est un produit programme d'ordinateur comprenant des instructions de code de programme pour la mise en oeuvre d'un tel procédé. 20 D'autres caractéristiques, détails et avantages de l'invention ressortiront à la lecture de la description faite en référence aux figures 1, 2, 3 et 4 annexées, qui illustrent un exemple de mise en oeuvre du procédé de l'invention. L'invention sera décrite en référence à des cas particuliers, 25 dans lesquels le fichier à analyser est de type CSV et peut contenir du texte, des nombres éventuellement décimaux et des dates. On considère trois formats de données, ou paramètres, possibles, correspondant à des régions géographiques distinctes : un format français , dans lequel la virgule sert comme 30 séparateur décimal, le point comme séparateur optionnel des milliers et les dates s'écrivent de la manière suivante : jour (entier compris entre 1 et 31) / mois (entier compris entre 1 et 12) / année (entier à quatre chiffres, ne commençant pas par zéro) ; - un format américain , dans lequel le point sert comme séparateur décimal, la virgule comme séparateur optionnel des milliers et les dates s'écrivent de la manière suivante : mois (entier compris entre 1 et 12) / jour (entier compris entre 1 et 31) / année (entier à quatre chiffres, ne commençant pas par zéro) ; et - un format britannique dans lequel le point sert comme séparateur décimal, la virgule comme séparateur optionnel des milliers et les dates s'écrivent de la manière suivante : jour (entier compris entre 1 et 31) / mois (entier compris entre 1 et 12) / année (entier à quatre chiffres, ne commençant pas par zéro). II est clair qu'il s'agit là d'une simplification : en réalité, les dates peuvent s'écrire de plusieurs manières différentes, par exemple avec le nom du mois écrit en lettres, en entier ou sous une forme raccourcie. Par ailleurs, d'autres types de données peuvent être utilisés, tels que des sommes d'argent (nombre entier ou décimal suivi ou précédé par le nom ou le symbole de la devise), des adresses, des adresses de courriel, des numéros de téléphone, des pourcentages, etc. The invention aims to solve this problem. In a preferred embodiment, the invention also makes it possible to automatically identify a file data format. By format we mean here in particular a regional format. Indeed some data such as dates or numbers are written differently in different countries: thus, in many countries the comma is the decimal separator and the point the optional separator thousands, while in Anglo countries -should the reciprocal be true; similarly, dates are written as day / month / year in Europe (United Kingdom included) and month / day / year in the United States, and so on. The automatic identification of the regional format of the so-called unique file allows the interpretation of the data contained in the signifying groups. An idea underlying this preferred embodiment of the invention is that the problems of identifying the structure of the table and its regional format should not be solved separately, but in a synergistic way, which allows to find a heuristic solution particularly powerful. It is important to underline the heuristic nature of the process of the invention: it is not possible to guarantee that the table will be reconstituted exactly, but in fact it happens in the vast majority of cases, except when uses files that are too small or pathological. In the event of failure, of course, it remains possible to use manual identification of the separator characters and / or the regional format. An object of the invention is therefore a heuristic analysis method of a file representative of a data table in order to reconstitute said table, the file being structured in lines of characters comprising column separator characters, characterized in that it includes the search for a character, with the exception of possible characters enclosed by double quotation marks, whose number of occurrences is the same for all lines; said character being chosen as a column separator so as to determine an array structure. According to a preferred embodiment of the invention, the method may comprise: a. identifying, within each line, groups of characters considered significant, at least some of the characters not belonging to such groups being considered as potential column separators; this identification being performed in parallel for several predefined data formats; b. for each data format, searching for a character, among said potential column separators, whose number of occurrences is the same for all the lines; said character being chosen as an effective column separator, so as to determine a format dependent array structure; and c. the application of a selection criterion to choose a data format and the corresponding column separator as the most likely. According to particular variants of the process of the invention: step c. may include the elimination of the predefined format or formats that do not identify an effective column separator. - Step a. may include the use of a plurality of character group identification strategies considered significant for each predefined data format. - Step c. may include the selection of the format or the data format / identification strategy combination of character groups leading to the identification of the largest number of characters considered significant. - The steps a. and B. may be implemented jointly, and include: identifying, within each line, said groupings of characters considered significant; 4 a2. counting occurrences of potential column separators in said line; and a3. except for the first line analyzed, the deletion of potential column separators whose number of occurrences is different from that of the previous line. - The method may also include the identification of predefined types of said character groups considered significant. The method may also include identifying the first row of the array as a header line if all other rows of the array have a common structure, the structure of said first row being different. The method may also include identifying the first column of the array as a column of headers if all the other columns of the array have a common structure, the structure of said first column being different. Another object of the invention is a computer program product comprising program code instructions for carrying out such a method. Other characteristics, details and advantages of the invention will emerge on reading the description made with reference to FIGS. 1, 2, 3 and 4, which illustrate an example of implementation of the method of the invention. The invention will be described with reference to particular cases in which the file to be analyzed is of CSV type and may contain text, possibly decimal numbers and dates. We consider three possible data formats, or parameters, corresponding to distinct geographical regions: a French format, in which the comma serves as a decimal separator, the point as an optional separator of thousands, and the dates are written in the following manner : day (an integer between 1 and 31) / month (an integer between 1 and 12) / year (a four-digit integer, not beginning with zero); - an American format, in which the point serves as a decimal separator, the comma as an optional separator for thousands, and the dates are written in the following way: month (integer between 1 and 12) / day (integer between 1 and 31) ) / year (four-digit integer, not starting with zero); and - a British format in which the point is used as a decimal separator, the comma as an optional separator for thousands and the dates are written in the following way: day (integer between 1 and 31) / month (integer between 1 and 12) ) / year (four-digit integer, not starting with zero). It is clear that this is a simplification: in reality, the dates can be written in several different ways, for example with the name of the month written in letters, in whole or in a shortened form. In addition, other types of data may be used, such as sums of money (integer or decimal followed or preceded by the name or symbol of the currency), addresses, e-mail addresses, phone, percentages, etc.

Le jeu de caractères utilisé par le fichier est supposé connu. Il existe d'ailleurs de nombreuses techniques permettant de le déterminer de manière automatique. On fait par ailleurs les hypothèses suivantes, qui ne sont pas très limitatives, car on peut supposer qu'elles sont satisfaites par tout fichier 25 CSV raisonnable : - le caractère utilisé pour séparer les colonnes existe (il y a donc au moins deux colonnes ù cette hypothèse sera cependant relâchée par la suite) et il est unique sur toute la longueur du fichier ; - le double guillemet " est un caractère spécial et ne peut 30 pas être un séparateur de colonnes. Tous les caractères compris entre deux doubles guillemets sont traités littéralement et ne peuvent pas être à cheval entre deux colonnes, même si la chaîne de caractères contient le caractère de séparation de colonnes ; - Le nombre de colonnes est constant pour chaque ligne du fichier ; - Les paramètres de format ( paramètres régionaux ) ne peuvent pas varier au sein d'un même fichier. Par ailleurs, certains formats de nombres et de dates ne sont valides que pour certains paramètres régionaux ; - La première ligne peut être de nature différente des lignes lo suivantes si elle correspond aux en-têtes de colonnes ; - De même, la première colonne peut être de nature différente des autres colonnes si elle correspond aux en-têtes de lignes. Le procédé selon l'invention vise à identifier le caractère séparateur et, par conséquent, à déterminer le nombre de colonnes. 15 Optionnellement, un tel procédé peut permettre de déterminer également : - le format géographique des données (limité dans les exemples aux trois possibilités précitées : français, américain, anglais) ; - si la première ligne et/ou la première colonne 20 représentent des en-têtes ; - le type (nombre, date, texte) et, le cas échéant, la valeur de chaque donnée. Comme mentionné plus haut, le mode préféré de réalisation de l'invention se base sur la détermination simultanée du caractère 25 séparateur de colonnes (et donc du nombre de colonnes) et du format des données û format qui n'a pas nécessairement une signification géographique comme dans les exemples qui suivent. Cette idée peut être exposée en s'aidant de quelques exemples très simples. On peut commencer par considérer le fichier suivant, 30 constitué de deux lignes : 1,234,567 1,234 Prises isolément, ces deux lignes peuvent chacune être interprétées de deux façons : virgule comme séparateur de milliers ou virgule comme séparateur de colonnes. Il est clair que la virgule ne peut pas être un séparateur décimal, car dans ce cas la première ligne aurait un format s irrégulier. Par contre, une seule interprétation permet de maintenir le nombre de colonnes constant : virgule comme séparateur de milliers, une seule colonne. A l'inverse, dans le fichier suivant : 10 1,234,567 1,23,456 La virgule n'est pas un séparateur de milliers valide sur la deuxième ligne, ce qui suggère de l'interpréter plutôt comme un séparateur de colonnes (y compris, rétroactivement, pour la première ligne, là où les 15 deux options étaient possibles). Si l'on s'en tient uniquement à la règle du nombre de colonnes constant, les caractères point , virgule et barre oblique sont tous les trois des séparateurs de colonnes potentiels dans le fichier suivant : 20 1.25,13/10/2008,9.99 10.5,5/5/2008,200.5 La première ligne du fichier peut donc être décomposée de la manière suivante : 1 ù 25,13/10/2008,9 ù 99 (Si le point est le 25 séparateur de colonnes); ou 1.25 ù 13/10/2008 ù 9.99 (Si la virgule est le séparateur de colonnes) ; ou 1.25,13 ù 10 ù 2008,9.99 (Si la barre oblique est le séparateur de colonnes). 30 On voit donc que le simple comptage des instances des caractères dans les différentes lignes du fichier est parfois insuffisant pour identifier de manière fiable le séparateur de colonnes. Par ailleurs, cette approche ne fournit aucune indication quant au format des données. C'est à ce point qu'intervient la notion de groupement de caractères signifiant, ou considéré comme tel. Un groupement signifiant est tout groupement de caractères qui est susceptible, en raison de sa forme, de représenter autre chose que du texte : un nombre, une date, un numéro de téléphone, etc. Un groupement de caractères n'est pas signifiant en soi, mais seulement dans le contexte d'un format de données. The character set used by the file is assumed to be known. There are also many techniques to determine it automatically. We also make the following assumptions, which are not very limiting, because we can assume that they are satisfied by any reasonable CSV file: - the character used to separate the columns exists (there are therefore at least two columns where this hypothesis will however be released later) and it is unique over the entire length of the file; The double quotation mark is a special character and can not be a column separator.All characters between two double quotation marks are treated literally and can not be straddling between two columns, even if the character string contains the double quotation mark. Column separation character - The number of columns is constant for each line of the file - The format parameters (regional settings) can not vary within the same file Some formats of numbers and dates are only valid for certain locales - The first line may be different from the following lo lines if it matches the column headers - Similarly, the first column may be different in nature from the other columns if it matches The method according to the invention aims at identifying the separator character and, consequently, at determining the number Optionally, such a method can also determine: - the geographical format of the data (limited in the examples to the three aforementioned possibilities: French, American, English); if the first row and / or the first column represent headers; - the type (number, date, text) and, if applicable, the value of each data. As mentioned above, the preferred embodiment of the invention is based on the simultaneous determination of the column separator character (and thus the number of columns) and the format of the data - a format that does not necessarily have a geographic meaning. as in the examples that follow. This idea can be exposed with the help of some very simple examples. We can begin by considering the following file, consisting of two lines: 1,234,567 1,234 taken in isolation, these two lines can each be interpreted in two ways: comma as separator of thousands or comma as separator of columns. It is clear that the comma can not be a decimal separator, because in this case the first line would have an irregular format. On the other hand, a single interpretation makes it possible to maintain the number of columns constant: comma as separator of thousands, a single column. Conversely, in the following file: 10 1,234,567 1,23,456 The comma is not a valid thousands separator on the second line, which suggests to interpret it as a column separator (including, retroactively, for the first line, where both options were possible). If you stick to the constant number of columns rule only, the period, decimal, and slash characters are all potential column separators in the following file: 20 1.25,13 / 10 / 2008.9.99 10.5,5 / 5 / 2008,200.5 The first line of the file can therefore be decomposed as follows: 1 ù 25,13 / 10 / 2008,9 ù 99 (If the point is the column separator); or 1.25 ù 13/10/2008 ù 9.99 (If the decimal point is the column separator); or 1.25,13 ù 10 ù 2008,9.99 (If the slash is the column separator). Thus, it can be seen that simply counting the instances of the characters in the different lines of the file is sometimes insufficient to reliably identify the column separator. In addition, this approach provides no indication of the format of the data. It is at this point that the notion of a grouping of characters signifying, or considered as such, comes into play. A signifying group is any grouping of characters that is likely, because of its form, to represent something other than text: a number, a date, a telephone number, and so on. A grouping of characters is not significant in itself, but only in the context of a data format.

Par exemple, 1.25 est un groupement qui peut être considéré comme signifiant en format anglais ou américain : il représente un nombre décimal. En revanche, il n'est pas signifiant (il ne représente que du texte) en format français. De même, 13/10/2008 est signifiant (une date) en format 15 français ou anglais , mais pas en format américain . Le procédé selon le mode préféré de réalisation de l'invention comporte l'identification, à l'intérieur de chaque ligne, de groupements de caractères potentiellement signifiants, et cela pour chacun des formats prédéfinis. 20 Ainsi, en format français, on pourra découper la première ligne du fichier de la manière suivante : 1 (nombre) ù .25,13/10/2008,9. (texte) ù 99 (nombre) ; ou encore : 1 (nombre) ù . (texte) ù 25,13 (nombre) ù / 25 (texte) ù 10 (nombre) ù / (texte) ù 2008 (nombre) ù , (texte) ù 9 (nombre) ù . (texte) ù 99 (nombre) ; ou encore : 1.25, (texte) ù 13/10/2008 (date) ù ,9.99 (texte), et ainsi de suite. On voit que plusieurs découpages sont possibles pour 30 chaque format régional : il est donc nécessaire de définir une ou plusieurs stratégies de découpage. Par exemple on peut envisager une première stratégie, dite gloutonne , qui tend à privilégier les grands nombres et les dates par rapport aux petits nombres, et une deuxième stratégie dite frugale qui tend à privilégier les petits nombres. Avantageusement, ces stratégies seront utilisées en parallèle. Une autre idée à la base dudit mode de réalisation préféré de l'invention est que les caractères faisant partie des groupements identifiés comme signifiants ne peuvent pas être des séparateurs de colonnes (les groupements de caractères encadrés par des doubles guillemets sont traités comme des groupements signifiants). C'est donc seulement parmi les autres caractères, qui ne font partie d'aucun groupement signifiant, que l'on cherchera celui qui présente un même nombre d'occurrences dans toutes les lignes, et qui sera donc identifié comme étant le séparateur de colonnes effectif. Si le nombre de lignes du fichier est suffisant, le plus souvent un seul caractère satisfera à cette condition, et cela pour un seul format. On aura donc identifié en même temps la structure du tableau représenté par le fichier et le format des données dudit tableau. Le principe de fonctionnement de l'invention sera maintenant illustré à l'aide d'un exemple plus complet. Dans cet exemple, on cherche à analyser le fichier suivant, constitué de quatre lignes : Nom,Age,Date de naissance Pierre,38,10/5/1970 Paul,20,25/8/1988 Jacques,70,7/7/1938 L'interprétation de ce fichier est évidente pour un être humain, 25 mais pas pour un ordinateur. Comme dans l'exemple précédent, on considère trois formats régionaux (français - FR, anglais - UK et américain - US), avec les deux stratégies gloutonne et frugale , et des données de type texte, date ou nombre. 30 On remarquera que la première ligne constitue un en-tête, et est donc différente des autres lignes. La méthode de l'invention permet de gérer cette difficulté supplémentaire. 2944895 io La figure 1 illustre l'analyse de la première ligne du fichier. En principe, cette ligne est découpée en groupements signifiants de six manières différentes (deux stratégies pour trois formats régionaux). Concrètement, elle ne contient que du texte, et donc aucun groupement signifiant ne peut être 5 identifié, et cela quel que soit le format considéré. Tous les caractères (N, o, m, , , A, g, e, D, a, t, , d, n, i, s, c) sont donc considérés comme des séparateurs de colonnes potentiels, et leurs occurrences sont comptées. Pour chaque couple format régional / stratégie, un compteur indique le nombre de caractères potentiellement signifiants identifiés depuis le 10 début du fichier. Ces compteurs sont initialisés à zéro, et leurs valeurs restent nulles après analyse de la première ligne (qui ne contient que du texte, et donc aucun caractère signifiant). L'analyse de la deuxième ligne (figure 2) s'avère moins banale. 15 En format US, stratégie gloutonne , l'algorithme identifie les groupements signifiants suivants : 38 (nombre) et 10/5/1970 (date), soit 11 caractères ; le compteur prend donc la valeur 11. Les séparateurs de colonnes potentiels sont donc : P , i , e , r et , . Or, P et r peuvent être écartés tout de suite, car ces caractères 20 étaient absent dans la première ligne : il ne peut donc pas s'agir de séparateurs de colonne effectifs. Le caractère e doit également être écarté, car il apparaît quatre fois dans la première ligne, mais seulement deux fois dans la deuxième. Seules les caractères , et i , dont le nombre d'occurrences est le même dans la première et la deuxième ligne sont donc retenus comme des séparateurs de colonnes potentiels. En format US, stratégie frugale , les groupements signifiants sont : 38 , 10 , 5 et 1970 , soit 9 caractères. Les séparateurs de colonnes potentiels sont donc : P , i , e , r , , et / . Comme dans le cas précédent, seuls les caractères , et i présentent un même nombre d'occurrences dans la première et dans la deuxième ligne, et sont donc retenus comme séparateurs de colonnes potentiels. 2944895 Il En format UK, les résultats sont les mêmes que pour le format US. II n'est donc pas nécessaire de décrire l'analyse en détail. En format FR, stratégie gloutonne , les groupements signifiants sont 38,10 (nombre, avec virgule comme séparateur décimal), 5 5 , et 1970 , soit 10 caractères. Les séparateurs de colonnes potentiels sont donc : P , i , e , r , , et / . Mais cette fois, seule une des deux virgules est considérée comme un séparateur de colonnes potentiel : en effet, la virgule qui apparaît dans 38,10 appartient à un groupement signifiant, et ne peut donc pas servir pour séparer deux colonnes. lo Dans ces conditions, le seul candidat à la fonction de séparateur de colonnes est le caractère i , qui apparaît une fois dans chaque ligne en dehors de tout groupement signifiant. La stratégie frugale appliquée au format FR donne les mêmes résultats. 15 L'analyse de la troisième ligne est illustrée sur la figure 3. En format US, stratégie gloutonne , l'algorithme identifie les groupements signifiants suivants : 20 25 , 8 et 1988 , car 25/8/1988 n'est pas une date valable dans ce format. Ces groupements comprennent 9 caractères au total, ce qui amène la valeur du compteur à 20. 20 Il n'est pas nécessaire de compter les occurrences de tous les autres caractères de la ligne, mais on peut s'intéresser uniquement à , et i , dont le nombre d'occurrences était le même dans la première et la deuxième ligne. Or, le caractère i n'apparaît pas dans la troisième ligne du fichier, ce qui indique qu'il ne peut pas s'agir d'un séparateur de colonnes. En revanche, 25 le caractère , apparaît bien deux fois, comme dans la première et la deuxième ligne : ce caractère retient donc son statut de séparateur de colonnes potentiel. L'analyse en stratégie frugale / format US ou UK donne les mêmes résultats. 30 En format UK, stratégie gloutonne , en revanche, le découpage donne les groupes signifiants suivants : 20 (nombre) et 25/8/1988 (date), soit 11 caractères, ce qui amène la valeur du compteur à 22. Comme dans les autres cas, le caractère , reste le seul séparateur de colonnes potentiel. En format FR on rencontre une difficulté : la virgule avait été éliminée lors de l'étape précédente, et seul le caractère i était encore considéré comme un séparateur de colonne potentiel. Mais la troisième colonne ne contient pas de i : ce caractère doit donc être écarté à son tour. Il s'ensuit que l'adoption du format FR ne permet pas d'identifier un séparateur de colonnes. Cela ne signifie pas nécessairement que le fichier n'utilise pas le format français : après tout, un fichier peut bien ne comporter qu'une seule colonne. Cependant, cette hypothèse est considérée comme peu probable, par conséquent le format FR est écarté à titre provisoire. L'analyse de la quatrième et dernière ligne est illustrée sur la figure 4. Cette analyse est simplifiée par rapport aux précédentes : premièrement, car seulement quatre découpages doivent être considérés (formats US et UK, stratégies frugale et gloutonne ) ; deuxièmement, seul le nombre d'occurrences du caractère , doit être déterminé. Dans les quatre cas, on trouve deux occurrences du caractère , en dehors de tout groupement signifiant. Le séparateur de colonnes a ainsi été identifié de manière non-ambiguë comme étant le caractère , , mais deux formats régionaux restent envisageables : US et UK. Pour choisir entre ces deux formats, on s'intéresse aux valeurs enregistrées dans les compteurs. On trouve : - Format US, stratégie gloutonne : 30 caractères signifiants For example, 1.25 is a grouping that can be considered as meaning in English or American format: it represents a decimal number. On the other hand, it is not significant (it only represents text) in French format. Similarly, 13/10/2008 is meaning (a date) in French or English format, but not in American format. The method according to the preferred embodiment of the invention comprises identifying, within each line, groups of potentially significant characters, and this for each of the predefined formats. Thus, in French format, the first line of the file can be broken down as follows: 1 (number) ù .25,13 / 10 / 2008,9. (text) 99 (number); or again: 1 (number) ù. (text) ù 25,13 (number) ù / 25 (text) ù 10 (number) ù / (text) ù 2008 (number) ù, (text) ù 9 (number) ù. (text) 99 (number); or again: 1.25, (text) ù 13/10/2008 (date) ù, 9.99 (text), and so on. It can be seen that several divisions are possible for each regional format: it is therefore necessary to define one or more cutting strategies. For example we can consider a first strategy, called gluttonous, which tends to favor large numbers and dates compared to small numbers, and a second so-called frugal strategy that tends to favor small numbers. Advantageously, these strategies will be used in parallel. Another idea underlying said preferred embodiment of the invention is that the characters forming part of the groups identified as signifiers can not be column separators (the groupings of characters enclosed in double quotation marks are treated as signifying groups ). It is therefore only among the other characters, which are not part of any signifying group, that we will look for the one that has the same number of occurrences in all the lines, and which will therefore be identified as the column separator. effective. If the number of lines of the file is sufficient, most often only one character will satisfy this condition, and that for a single format. We will have identified at the same time the structure of the table represented by the file and the data format of the table. The operating principle of the invention will now be illustrated with the aid of a more complete example. In this example, we try to analyze the following file, consisting of four lines: Name, Age, Date of birth Peter, 38.10 / 5/1970 Paul, 20,25 / 8/1988 Jacques, 70,7 / 7 / 1938 The interpretation of this file is obvious to a human being, but not to a computer. As in the previous example, we consider three regional formats (French - FR, English - UK and US - US), with both gluttonous and frugal strategies, and text type data, date or number. Note that the first line is a header, and is therefore different from the other lines. The method of the invention makes it possible to manage this additional difficulty. Figure 1 illustrates the analysis of the first line of the file. In principle, this line is divided into meaningful groupings in six different ways (two strategies for three regional formats). Concretely, it contains only text, and therefore no meaningful grouping can be identified, and this whatever the format considered. All characters (N, o, m,,, A, g, e, D, a, t,, d, n, i, s, c) are therefore considered as potential column separators, and their occurrences are counted . For each regional format / strategy pair, a counter indicates the number of potentially significant characters identified from the beginning of the file. These counters are initialized to zero, and their values remain null after analysis of the first line (which contains only text, and therefore no significant character). The analysis of the second line (Figure 2) is less commonplace. In US format, gluttonous strategy, the algorithm identifies the following signifying groups: 38 (number) and 10/5/1970 (date), that is 11 characters; the counter therefore takes the value 11. The potential column separators are therefore: P, i, e, r and,. However, P and r can be discarded immediately, because these characters 20 were absent in the first line: it can not therefore be effective column separators. The character e must also be discarded because it appears four times in the first line, but only twice in the second line. Only the characters, and i, whose number of occurrences is the same in the first and second rows are therefore retained as potential column separators. In US format, frugal strategy, the signifying groups are: 38, 10, 5 and 1970, that is 9 characters. The potential column separators are therefore: P, i, e, r,, and /. As in the previous case, only the characters, and i have the same number of occurrences in the first and in the second row, and are therefore retained as potential column separators. 2944895 Il In UK format, the results are the same as for the US format. It is therefore not necessary to describe the analysis in detail. In FR format, gluttonous strategy, the signifying groupings are 38.10 (number, with decimal point as a separator), 5 5, and 1970, that is 10 characters. The potential column separators are therefore: P, i, e, r,, and /. But this time, only one of the two commas is considered as a potential column separator: indeed, the comma that appears in 38,10 belongs to a signifying group, and can not be used to separate two columns. In these conditions, the only candidate for the column separator function is the character i, which appears once in each line outside any signifying group. The frugal strategy applied to the FR format gives the same results. The analysis of the third line is illustrated in FIG. 3. In US format, gluttonous strategy, the algorithm identifies the following signifying groups: 25, 8 and 1988, since 25/8/1988 is not a date valid in this format. These groupings comprise 9 characters in total, which brings the value of the counter to 20. 20 It is not necessary to count the occurrences of all the other characters of the line, but one can be interested only in, and i, whose number of occurrences was the same in the first and the second line. However, the i character does not appear in the third line of the file, which indicates that it can not be a column separator. On the other hand, the character, appears twice, as in the first and the second line: this character thus retains its potential column separator status. The frugal strategy analysis / US or UK format gives the same results. In UK format, gluttonous strategy, on the other hand, the division gives the following significant groups: 20 (number) and 25/8/1988 (date), that is 11 characters, which brings the value of the counter to 22. As in the In other cases, the character remains the only potential column separator. In FR format there is a difficulty: the comma was eliminated in the previous step, and only the character i was still considered a potential column separator. But the third column contains no i: this character must be discarded in turn. It follows that the adoption of the FR format does not identify a column separator. This does not necessarily mean that the file does not use the French format: after all, a file may have only one column. However, this assumption is considered unlikely, therefore the FR format is provisionally discarded. The analysis of the fourth and last line is illustrated in Figure 4. This analysis is simplified compared to the previous ones: firstly, because only four cuts have to be considered (US and UK formats, frugal and greedy strategies); secondly, only the number of occurrences of the character, must be determined. In all four cases, we find two occurrences of the character, outside any meaningful grouping. The column separator has thus been unambiguously identified as the character, but two regional formats remain possible: US and UK. To choose between these two formats, we are interested in the values recorded in the counters. We find: - US format, gluttonous strategy: 30 significant characters

signifiants - Format US, stratégie frugale : 26 caractères - Format UK, stratégie gloutonne : 32 caractères 30 signifiants et - Format UK, stratégie frugale : 26 caractères signifiants. Signifiers - US format, frugal strategy: 26 characters - UK format, gluttonous strategy: 32 characters 30 signifiers and - UK format, frugal strategy: 26 significant characters.

On choisit de préférence le couple format/stratégie qui donne le nombre plus élevé de caractères signifiants, soit le format UK couplé à la stratégie gloutonne . En conclusion, on a déterminé que toutes les lignes à 5 l'exception de la première (qui est constituée uniquement par du texte) présentent la structure suivante : texte ù nombre ù date en format anglais avec , comme séparateur de colonnes. La première ligne est considérée comme un en-tête en raison 10 de sa structure différente des autres lignes. Comme les colonnes ont toutes des structures différentes (la première est constituée de données de type texte ; la deuxième de données de type nombre ; et la troisième de données de type date) il n'y a pas de raison de considérer que la première colonne constitue un en-tête.One chooses preferably the format / strategy pair which gives the higher number of significant characters, that is to say the UK format coupled with the gluttonous strategy. In conclusion, it has been determined that all the lines except the first one (which consists only of text) have the following structure: text - number in date in English format, as column separator. The first line is considered a header because of its different structure from the other lines. Since the columns all have different structures (the first is text data, the second is number data, and the third is date data) there is no reason to consider that the first column constitutes a header.

15 Ainsi, le procédé de l'invention a permis de déterminer la structure du tableau en identifiant le séparateur de colonnes, le type de chaque champ de données, le format régional utilisé et le fait que la première ligne constitue un en-tête. Cela a été fait de manière complètement automatique, sans besoin d'aucune intervention de la part de l'utilisateur.Thus, the method of the invention made it possible to determine the structure of the table by identifying the column separator, the type of each data field, the regional format used and the fact that the first line constitutes a header. This was done completely automatically, without the need for any intervention on the part of the user.

20 Si aucun couple format/stratégie n'avait permis d'identifier un caractère séparateur, cela aurait amené à la conclusion que le fichier était constitué d'une seule colonne. L'identification du format régional et du type des données n'aurait pas été possible. En variante, il est possible de mettre en oeuvre une version 25 simplifiée du procédé, dans laquelle seulement certains des caractères non signifiants sont considérés comme des séparateurs de colonnes potentiels. En effet, il est peu probable qu'un caractère alphanumérique (A ù Z, a ù z, 0 ù 9) soit choisi comme séparateur de colonnes : on peut donc se limiter à compter les occurrences des autres caractères non signifiants (signes de 30 ponctuation, caractères spéciaux, etc.). Selon un mode de réalisation encore plus simplifié de l'invention, on peut se limiter à comparer ligne par ligne le nombre d'occurrences de tous les caractères, à l'exception de ceux encadrés par des doubles guillemets, sans identification préalable de groupements signifiants. Cette variante est beaucoup plus simple à mettre en oeuvre mais, comme montré par l'exemple ci-dessus, elle ne permet pas toujours de déterminer de manière univoque un séparateur de colonnes. Il peut donc être avantageux de procéder en deux phases : lors d'une première phase, on met en oeuvre uniquement le procédé simplifié ; si à la fin de cette phase une ambiguïté sur le caractère séparateur subsiste, on met en oeuvre le procédé plus complet, tel que décrit ci-dessus. lo 15 20 25 If no format / strategy pair had identified a separator character, this would have led to the conclusion that the file consisted of a single column. Identification of the regional format and type of data would not have been possible. Alternatively, it is possible to implement a simplified version of the method, in which only some of the non-significant characters are considered potential column separators. Indeed, it is unlikely that an alphanumeric character (A ù Z, a ù z, 0 ù 9) will be chosen as a column separator: we can therefore limit ourselves to counting the occurrences of other non-significant characters (signs of 30 punctuation, special characters, etc.). According to an even more simplified embodiment of the invention, it is sufficient to compare line by line the number of occurrences of all the characters, with the exception of those framed by double quotation marks, without prior identification of signifying groups. . This variant is much simpler to implement but, as shown by the example above, it does not always allow to uniquely determine a column separator. It may therefore be advantageous to proceed in two phases: in a first phase, only the simplified method is implemented; if at the end of this phase an ambiguity on the separator character remains, the more complete method, as described above, is implemented. lo 15 20 25

Claims

REVENDICATIONS1. A computer heuristic analysis method of a computer file representative of a data table in order to reconstitute said table, the file being structured in lines of characters comprising column separator characters, characterized in that it comprises the use said computer to search for a character, except for any characters enclosed in double quotation marks, whose number of occurrences is the same for all the lines; said character being chosen as a column separator so as to determine an array structure.

2. Method according to claim 1, comprising the following steps: a. identifying, within each line, groups of characters considered significant, at least some of the characters not belonging to such groups being considered as potential column separators; this identification being performed in parallel for several predefined data formats; b. for each data format, searching for a character, among said potential column separators, whose number of occurrences is the same for all the lines; said character being chosen as an effective column separator, so as to determine a format dependent array structure; and c. the application of a selection criterion to choose a data format and the corresponding column separator as the most likely.

3. The method of claim 2, wherein step c. 30 comprises the elimination of the predefined format or formats that do not make it possible to identify an effective column separator. 0 5 10 20 25

4. Method according to one of claims 2 or 3, wherein step a. comprises using a plurality of character group identification strategies considered significant for each predefined data format.

The method of claim 4, wherein step c. includes the selection of the format or the data format / character group identification strategy pair leading to the identification of the largest number of characters considered significant.

6. Method according to one of claims 2 to 5, wherein the steps a. and B. are implemented jointly, and include: identifying, within each line, said groupings of characters considered significant; a2. counting occurrences of potential column separators in said line; and a3. except for the first line analyzed, the deletion of potential column separators whose number of occurrences is different from that of the previous line.

7. Method according to one of claims 2 to 6, also comprising the identification of predefined types of said groups of characters considered significant.

The method of claim 7 including identifying the first row of the array as a header line if all other rows in the array have a common structure, the structure of said first row being different. 30

9. Method according to one of claims 7 or 8, comprising the identification of the first column of the table as a column of headers if all the other columns of the table have a common structure, the structure of said first column being different.

A computer program product comprising program code instructions for carrying out a method according to one of the preceding claims.