FR2911201A1

FR2911201A1 - Written text editing method for correcting spelling error, involves calculating difference between apparition frequency of one n-gram in text and in language using n-gram by n-gram technique

Info

Publication number: FR2911201A1
Application number: FR0752564A
Authority: FR
Inventors: Jerome Berger
Original assignee: Sagem Communications SAS
Current assignee: Sagemcom Broadband SAS
Priority date: 2007-01-08
Filing date: 2007-01-08
Publication date: 2008-07-11

Abstract

The method involves establishing text and language distributions of frequencies to which N-grams in a text and a language are appeared, respectively, where n-grams can be groups of characters. The distributions are compared between them. A language (18), whose language distribution has a large similarity with the text distribution, is determined as a language of a written text (15). The text is processed according to the language determination. The difference between the apparition frequency of one n-gram in the text and in the language is calculated using an n-gram by n-gram technique.

Description

Procédé d'édition d'un texte exprimé dans une langueMethod for editing a text expressed in one language

DOMAINE TECHNIQUE DE L'INVENTION La présente invention a pour objet un procédé d'édition d'un document écrit : un texte. Un but essentiel du procédé selon l'invention est de reconnaître, de façon automatique et systématique la langue de rédaction du texte, même si ce texte comporte des erreurs liées, par exemple, à des fautes de frappes ou des mauvaises reconnaissances de caractères d'un texte manuscrit. TECHNICAL FIELD OF THE INVENTION The subject of the present invention is a method of editing a written document: a text. An essential aim of the method according to the invention is to recognize, automatically and systematically the language of the text, even if this text includes errors related, for example, to typing errors or misrecognitions of characters. a handwritten text.

Le domaine de l'invention est, d'une façon générale, celui du traitement de texte. Dans ce domaine, on connaît notamment la correction orthographique et la traduction qui font appel à des dictionnaires. A titre de traitement, on connaît aussi la reconnaissance de caractères manuscrits ou la reconnaissance vocale qui nécessitent l'utilisation d'une base de données relatives aux lettres de différents alphabets. Pour qu'un logiciel de traitement de texte et/ou de reconnaissance de caractères manuscrits ou vocale soit efficace, il faut qu'il comporte un module de dictionnaire qui corrige les erreurs selon une langue de référence. Un des moyens de l'invention est de déterminer automatiquement, par une machine, le dictionnaire qui correspond à un texte donné. ARRIERE-PLAN TECHNOLOGIQUE DE L'INVENTION Une première méthode de recherche de dictionnaire comporte la vérification de ce que tous les mots du texte sont présents dans un même dictionnaire. Cette méthode, outre qu'elle impose une grande capacité mémoire (autant de fois qu'il y a de langues à tester), ne résout pas le problème des fautes d'orthographe. Pour chercher parmi tous les dictionnaires possibles celui qui convient le mieux, dans l'état de la technique, il est aussi connu d'utiliser une approche sémantique. Avec une telle approche, on compte dans le texte des fréquences d'apparition de mots. On compare ces fréquences d'apparition de mots avec des fréquences d'apparition de ces mêmes mots dans un ensemble de textes de la langue. Si la comparaison est favorable, on décide de considérer que le texte est exprimé dans cette langue. Une contrainte existe également pour cette deuxième méthode : il est nécessaire de disposer d'un dictionnaire conséquent, d'environ 40 000 mots, pour chaque langue. De plus, en termes de calculs, dans les environnements embarqués, cette méthode est trop lourde. Une autre méthode, décrite dans le document US/1006 002 5988, consiste à utiliser des fréquences d'apparition de groupes ou séquences de n lettres appelés n-grammes, n étant un nombre entier positif non nul. Ce procédé permet d'utiliser un nombre beaucoup moins important de données. Par exemple, pour un alphabet de 26 lettres, on aura : - 263 = 17 576 combinaisons pour des trigrammes ; - 262 = 676 combinaisons pour des bigrammes. The field of the invention is, in general, that of the word processor. In this field, orthographic correction and translation using dictionaries are particularly known. As a treatment, it is also known to recognize handwritten characters or voice recognition that require the use of a database relating to the letters of different alphabets. For word processing software and / or character or voice recognition to be effective, it must include a dictionary module that corrects errors in a reference language. One of the means of the invention is to automatically determine, by a machine, the dictionary that corresponds to a given text. BACKGROUND ART OF THE INVENTION A first dictionary search method includes verifying that all the words of the text are present in the same dictionary. This method, besides imposing a large memory capacity (as many times as there are languages to test), does not solve the problem of misspellings. To search among all the possible dictionaries, which is the most appropriate, in the state of the art, it is also known to use a semantic approach. With such an approach, one counts in the text frequencies of appearance of words. These frequencies of appearance of words are compared with frequencies of appearance of these same words in a set of texts of the language. If the comparison is favorable, it is decided to consider that the text is expressed in that language. A constraint also exists for this second method: it is necessary to have a consequent dictionary, about 40 000 words, for each language. Moreover, in terms of calculations, in embedded environments, this method is too heavy. Another method, disclosed in US / 1006,002,588, is to use occurrence frequencies of groups or sequences of n letters called n-grams, where n is a non-zero positive integer. This method makes it possible to use a much smaller number of data. For example, for an alphabet of 26 letters, we will have: - 263 = 17,576 combinations for trigrams; - 262 = 676 combinations for bigrams.

Selon ce document, on obtient un profil statistique représentant la probabilité pour chaque n-gramme d'appartenir à une langue donnée. Ce profil statistique est ensuite traité en multipliant chacune de ses probabilités entre elles, de façon à calculer une probabilité globale. On calcule la probabilité globale pour plusieurs langues candidates. On choisit ensuite la probabilité globale la plus élevée pour déterminer la langue candidate qui sera considérée comme étant celle du texte. Avec cet état de la technique, un problème demeure. Si le texte étudié comporte un n-gramme n'existant pas dans une langue candidate, par exemple "zz" dans le cas d'un bigramme, on a pour ce n-gramme une probabilité nulle d'appartenance à ladite langue. De ce fait, la probabilité globale est nulle également car elle est le résultat de la multiplication des probabilités propres à chaque n-gramme entre elles. Un tel n-gramme erroné, pouvant être provoqué notamment par une faute de frappe ou par une mauvaise reconnaissance de caractères, empêche la reconnaissance d'une langue. De plus, tous les n-grammes possibles ne sont pas forcément présents dans le texte étudié et/ou certains n-grammes y sont plusieurs fois. Pour ceux qui y sont plusieurs fois, leur présence est prise en compte plusieurs fois. Cette dernière méthode présente un manque de souplesse vis-à-vis des erreurs dues notamment à des mauvaises reconnaissances de caractères ou à des fautes de frappe. Les fréquences d'apparition de n-grammes sont déjà utilisées dans le domaine du traitement du langage, en particulier pour la reconnaissance vocale. Une telle méthode de reconnaissance vocale consiste à analyser des fréquences d'apparition de n-grammes pour choisir un mot parmi d'autres. According to this document, we obtain a statistical profile representing the probability for each n-gram to belong to a given language. This statistical profile is then processed by multiplying each of its probabilities with each other, so as to calculate an overall probability. The overall probability for several candidate languages is calculated. The highest overall probability is then chosen to determine the candidate language that will be considered as the language of the text. With this state of the art, a problem remains. If the text studied includes a n-gram that does not exist in a candidate language, for example "zz" in the case of a bigram, there is for this n-gram a zero probability of belonging to said language. As a result, the overall probability is zero because it is the result of the multiplication of probabilities specific to each n-gram between them. Such an erroneous n-gram, which can be caused in particular by a typo or a bad character recognition, prevents the recognition of a language. Moreover, not all possible n-grams are necessarily present in the studied text and / or some n-grams are there several times. For those who are there several times, their presence is taken into account several times. The latter method has a lack of flexibility vis-à-vis the errors due in particular to misidentifications of characters or typing errors. The frequencies of appearance of n-grams are already used in the field of language processing, in particular for speech recognition. One such method of voice recognition is to analyze frequencies of appearance of n-grams to choose one word among others.

Cette méthode de reconnaissance vocale ne sert pas pour reconnaître la langue utilisée. En particulier, un procédé de reconnaissance vocale nécessite a priori la connaissance de la langue dans laquelle le locuteur s'exprime. Il ne peut être question de la rechercher. This voice recognition method is not used to recognize the language used. In particular, a speech recognition method requires a priori knowledge of the language in which the speaker speaks. There can be no question of looking for it.

DESCRIPTION GENERALE DE L'INVENTION L'objet de l'invention propose une solution au problème qui vient d'être exposé. Dans l'invention, on propose de minimiser la charge globale de calcul et d'assouplir l'analyse d'un texte écrit vis-à-vis des erreurs dues notamment à des mauvaises reconnaissance de caractères et/ou à des fautes de frappe. A cet effet, dans l'invention, on propose de reconnaître la langue d'écriture d'un texte écrit en établissant des statistiques relatives à une distribution de fréquences d'apparition de groupes de lettres, ou n-grammes, dans ce texte, n étant un nombre entier positif non nul, et en la comparant avec des statistiques de référence relatives à des distribution de fréquences d'apparition des n-grammes dans plusieurs langues pour déterminer celle qui est la plus proche. Plus précisément, la comparaison de la distribution du texte avec les distributions respectives des langues prédéfinies est réalisée en calculant, n- gramme par n-gramme, un écart entre une probabilité de présence du n-gramme dans la langue et une statistique de présence du même n-gramme dans le texte. Dans un exemple, la somme de toutes les valeurs absolues des écarts séparant les n-grammes sert pour définir une éventuelle similarité. D'autres méthodes mathématiques existent pour quantifier ces écarts, à savoir la somme des carrés desdits écarts, ou le maximum des valeurs absolues desdits écarts ou autres. Ainsi, une faute de frappe et/ou une mauvaise reconnaissance de caractère manuscrit, engendrant au moins un n-gramme erroné, sont négligeables au regard de la quantité de n-grammes correctement distribués. Par ailleurs, ce procédé représente une charge de travail moins importante que dans l'état de la technique. L'invention concerne donc essentiellement un procédé d'édition d'un texte exprimé dans une langue, comportant des étapes dans lesquelles : - on reçoit, dans un appareil, une suite d'informations binaires correspondant à des caractères d'un alphabet ; - on édite, sur un périphérique de l'appareil, un texte correspondant aux caractères reçus ; - on fragmente le texte en une pluralité de groupes de n caractères appelés n-grammes, n étant un nombre entier positif non nul ; - on établit une distribution texte de fréquences auxquelles apparaissent les n-grammes dans le texte ; - on établit une distribution langue de fréquences auxquelles apparaissent les n-grammes dans une langue ; - on réitère cette dernière opération pour un certain nombre de langues connues, chaque langue connue étant caractérisée par une distribution de fréquences d'apparition de n-grammes ; - on compare lesdites distributions texte et langues entre elles ; - on détermine comme langue du texte, la langue dont la distribution langue présente une plus grande similarité avec la distribution texte et on traite le texte en fonction de cette détermination ; caractérisé en ce qu'il comporte l'étapes suivante : - on calcule, n-gramme par n-gramme, l'écart entre la fréquence d'apparition d'un n-gramme dans le texte et dans la langue. Outre les caractéristiques principales qui viennent d'être mentionnées dans le paragraphe précédent, le procédé selon l'invention peut présenter une ou plusieurs caractéristiques complémentaires parmi les suivantes: on élargit les combinaisons possibles réalisées avec l'ensemble des lettres de l'alphabet de la langue avec un ou deux caractères d'espace placés avant ou après un (n-1)gramme ou avant et après un (n-2)gramme ; -les n-grammes sont des groupes de deux caractères appelés bigrammes ; -les n-grammes sont des groupes de trois caractères appelés trigrammes ; -la comparaison des distributions comporte une étape d'addition, pour tous les n-grammes, de toutes les valeurs absolues des écarts les séparant ; -la comparaison des distributions comporte une étape d'addition, pour tous les n-grammes, de tous les carrés des écarts les séparant ; - la comparaison des distributions comporte une étape de détermination du maximum de toutes les valeurs absolues des écarts les séparant ; - on ne retient les écarts que s'ils souscrivent à un certain critère ; - le critère auquel doivent souscrire les écarts est de présenter une valeur inférieure à un seuil ; - le critère auquel doivent souscrire les écarts est de présenter une valeur minimale ; - le traitement réalisé par le procédé est une correction d'orthographe ; - le traitement réalisé par le procédé est une traduction automatique du texte reçu, dans une langue paramétrée par l'appareil. L'invention et ses différentes applications seront mieux comprises à la lecture de la description qui suit et à l'examen des figures qui l'accompagnent. BREVE DESCRIPTION DES FIGURES Celles-ci ne sont présentées qu'à titre indicatif et nullement limitatif de l'invention. Les figures montrent : - à la figure 1, un exemple de mise en oeuvre du procédé selon l'invention dans un terminal ; - à la figure 2, un exemple de comparaison d'une distribution de bigrammes d'un texte avec des distributions de n-grammes de trois langues distinctes. GENERAL DESCRIPTION OF THE INVENTION The object of the invention proposes a solution to the problem just described. In the invention, it is proposed to minimize the overall computation load and to make the analysis of a written text more flexible with regard to errors due in particular to bad character recognition and / or typing errors. For this purpose, in the invention, it is proposed to recognize the writing language of a written text by establishing statistics relating to a frequency distribution of occurrence of groups of letters, or n-grams, in this text, n being a nonzero positive integer, and comparing it with reference statistics relating to n-gram frequency distributions in several languages to determine the nearest one. More precisely, the comparison of the distribution of the text with the respective distributions of the predefined languages is performed by calculating, n-gram per n-gram, a difference between a probability of presence of the n-gram in the language and a statistics of presence of the same n-gram in the text. In one example, the sum of all the absolute values of the differences separating the n-grams is used to define a possible similarity. Other mathematical methods exist to quantify these differences, namely the sum of the squares of said deviations, or the maximum of the absolute values of said deviations or others. Thus, a typo and / or a poor handwritten character recognition, generating at least one erroneous n-gram, are negligible compared to the quantity of n-grams correctly distributed. Moreover, this process represents a smaller workload than in the state of the art. The invention therefore essentially relates to a method of editing a text expressed in a language, comprising steps in which: - one receives, in an apparatus, a sequence of binary information corresponding to characters of an alphabet; - On a device of the device, a text corresponding to the received characters is edited; the text is fragmented into a plurality of groups of n characters called n-grams, n being a non-zero positive integer; - a text distribution of frequencies is established in which the n-grams appear in the text; a frequency-language distribution is established in which the n-grams appear in one language; this last operation is repeated for a certain number of known languages, each known language being characterized by a frequency distribution of appearance of n-grams; comparing said text and language distributions with each other; - the language whose language distribution has a greater similarity to the text distribution is determined as the language of the text and the text is processed according to this determination; characterized in that it comprises the following steps: - one calculates, n-gram per n-gram, the difference between the frequency of appearance of an n-gram in the text and in the language. In addition to the main features which have just been mentioned in the preceding paragraph, the method according to the invention may have one or more additional characteristics among the following: the possible combinations made with the set of letters of the alphabet of the language with one or two space characters placed before or after one (n-1) gram or before and after one (n-2) gram; n-grams are groups of two characters called bigrams; n-grams are groups of three characters called trigrams; the comparison of the distributions comprises a step of adding, for all the n-grams, all the absolute values of the differences separating them; the comparison of the distributions comprises a step of addition, for all the n-grams, of all the squares of the gaps separating them; the comparison of the distributions comprises a step of determining the maximum of all the absolute values of the differences separating them; - differences are only retained if they subscribe to a certain criterion; - the criterion to which the differences must subscribe is to present a value below a threshold; - the criterion to be used for the differences is to present a minimum value; the treatment carried out by the method is a spelling correction; the processing performed by the method is an automatic translation of the received text into a language set by the device. The invention and its various applications will be better understood by reading the following description and examining the figures that accompany it. BRIEF DESCRIPTION OF THE FIGURES These are presented only as an indication and in no way limitative of the invention. The figures show: in FIG. 1, an exemplary implementation of the method according to the invention in a terminal; - In Figure 2, an example of a comparison of a distribution of bigrams of a text with distributions of n-grams of three distinct languages.

DESCRIPTION DETAILLEE DES FORMES DE REALISATION PREFEREES DE L'INVENTION Les différents éléments apparaissant sur plusieurs figures auront gardé, sauf précision contraire, la même référence. La figure 1 montre un exemple de mise en oeuvre du procédé selon l'invention dans un terminal. Dans un exemple préféré, le terminal est une imprimante multi-fonctions 1. En variante, l'invention peut être mise en oeuvre dans un téléphone mobile. L'invention est particulièrement utile si elle est appliquée à un appareil mobile. La première utilisation envisagée est celle des imprimantes multi-fonctions 1 dans le cadre d'une numérisation avec reconnaissance de caractères. Une telle numérisation est réalisée lorsque l'imprimante multi-fonctions 1 possède un lecteur optique pour produire une image numérique d'un texte posé sur une glace de lecture, ou défilant devant une barre de lecture, de ce lecteur optique. L'image numérique est ensuite convertie en texte par un logiciel de reconnaissance de caractères. Il peut convenir de rechercher automatiquement la langue du texte. L'imprimante multi-fonctions 1 comporte, reliés par un bus 2 d'adresses, de données et de commandes : un microprocesseur 3, un écran 4, un clavier 5, une mémoire de programmes 6, une mémoire de données 7 et une interface 8. La mémoire de programmes 6 comprend, à cet effet, éventuellement un programme général 9, éventuellement un programme 10 de numérisation, éventuellement un programme 11 de reconnaissance de caractères, selon l'invention, un programme 12 de reconnaissance de langue, éventuellement un programme de traduction 13 et éventuellement un programme de correction orthographique 14. Dans un exemple, le programme 9 est le Selon l'invention, un texte manuscrit 20 est reçu par l'imprimante multi-fonctions 1, par exemple sous la forme d'un document papier comportant des caractères d'un alphabet. Dans un exemple, l'alphabet utilisé est l'alphabet latin qui contient 26 lettres. Le texte manuscrit 20 est numérisé par le programme 10 de numérisation de l'imprimante multi-fonctions 1 sous forme d'une image numérique. Les caractères numérisés sont ensuite reconnus par le programme 11 de reconnaissance de caractères et convertis pour obtenir le texte initial 15. En variante, le texte initial 15 est reçu par une interface 8 reliée au bus 2, voire par un lecteur de type lecteur de carte à mémoire, également relié au bus 2 et recevant une mémoire amovible chargée avec le texte à éditer. Il existe de nombreux types de carte amovible à mémoire, en général associée chacune avec un protocole d'échange propriétaire. Par exemple, la mémoire amovible peut être un support à puce comportant un mode d'emploi de l'imprimante multi-fonctions 1. Quand le texte initial 15 est reçu par l'imprimante multi-fonctions 1, il peut être reçu sous la forme d'un document papier comportant des caractères manuscrits d'un alphabet. La langue d'écriture du texte 15 doit être reconnue par le programme 12 de reconnaissance de langue, au moyen système d'exploitation de l'imprimante multi-fonctions 1. La mémoire de données 7 comprend : un texte initial 15, des informations 16 sur la langue du texte initial, un dictionnaire 17, des informations sur une langue 18 de l'utilisateur et un texte final 19. Les informations 16 sont les informations recherchées. du procédé selon l'invention. Dans tous les cas, l'appareil 1 reçoit une suite d'informations binaires correspondant à des caractères d'un alphabet. Une fois les caractères manuscrits et la langue d'écriture reconnus, voire également avant reconnaissance de la langue, le texte reçu 15 est édité sur un périphérique de l'appareil, voire imprimé sur papier. En variante, l'édition peut être vocale. Dans ce cas, le périphérique est un haut-parleur. Toutefois, au sens de l'invention, il est possible de ne pas éditer le texte reçu tel quel ou de ne l'éditer qu'une fois traité (expurgé les fautes d'orthographe par exemple). Au sens littéral, dans l'invention, on considère en outre que l'édition est réalisée du seul fait que le texte est au moins mémorisé dans la mémoire 7, qu'il soit visualisé, imprimé, ou diffusé ultérieurement, ou immédiatement ou non. Le texte édité de l'invention correspondant aux caractères reçus est ainsi soit le texte initial 15, soit le texte traité 19. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS OF THE INVENTION The various elements appearing in several figures will have kept, unless otherwise specified, the same reference. FIG. 1 shows an exemplary implementation of the method according to the invention in a terminal. In a preferred example, the terminal is a multi-function printer 1. Alternatively, the invention can be implemented in a mobile phone. The invention is particularly useful if it is applied to a mobile device. The first use envisaged is that of multi-function printers 1 as part of a scanning with character recognition. Such digitization is performed when the multi-function printer 1 has an optical reader to produce a digital image of a text placed on a reading glass, or scrolling in front of a reading bar, of this optical reader. The digital image is then converted to text by character recognition software. It may be convenient to search automatically for the language of the text. The multi-function printer 1 comprises, connected by a bus 2 of addresses, data and commands: a microprocessor 3, a screen 4, a keyboard 5, a program memory 6, a data memory 7 and an interface 8. The program memory 6 includes, for this purpose, possibly a general program 9, possibly a scanning program 10, possibly a character recognition program 11, according to the invention, a language recognition program 12, possibly a translation program 13 and possibly a spell check program 14. In one example, the program 9 is according to the invention, a handwritten text 20 is received by the multi-function printer 1, for example in the form of a paper document with characters of an alphabet. In one example, the alphabet used is the Latin alphabet that contains 26 letters. The handwritten text 20 is digitized by the scanning program 10 of the multi-function printer 1 as a digital image. The digitized characters are then recognized by the character recognition program 11 and converted to obtain the initial text 15. In a variant, the initial text 15 is received by an interface 8 connected to the bus 2, or even by a reader of the card reader type. memory, also connected to the bus 2 and receiving a removable memory loaded with the text to edit. There are many types of removable memory cards, usually associated each with a proprietary exchange protocol. For example, the removable memory may be a chip carrier having a user's manual of the multi-function printer 1. When the initial text is received by the multi-function printer 1, it can be received in the form a paper document with handwritten characters of an alphabet. The language for writing the text 15 must be recognized by the language recognition program 12, by means of the operating system of the multi-function printer 1. The data memory 7 comprises: an initial text 15, information 16 on the language of the initial text, a dictionary 17, information on a language 18 of the user and a final text 19. The information 16 is the information sought. of the process according to the invention. In all cases, the apparatus 1 receives a sequence of binary information corresponding to characters of an alphabet. Once the handwritten characters and the writing language recognized, or even before recognition of the language, the received text 15 is edited on a device of the device, or even printed on paper. Alternatively, the edition can be voice. In this case, the device is a speaker. However, in the sense of the invention, it is possible not to edit the received text as is or to edit it once treated (redacted spelling mistakes for example). In the literal sense, in the invention, it is furthermore considered that the editing is carried out simply because the text is at least stored in the memory 7, whether it is viewed, printed or broadcast later, or immediately or not. . The edited text of the invention corresponding to the characters received is thus either the initial text 15 or the treated text 19.

Le programme 12 de reconnaissance de la langue d'écriture du texte 15 comporte les étapes dans lesquelles : - on fragmente le texte initial 15 en une pluralité de groupes de n caractères appelés n-grammes. Dans un exemple préféré, on prend n = 2, i.e. on utilise des bigrammes. En variante, on prend n = 3, i.e. on utilise des trigrammes. Les bigrammes offrent l'avantage d'augmenter la vitesse de traitement ; - on établit une distribution texte de fréquences auxquelles apparaissent les bigrammes dans le texte ; - on établit une distribution langue de fréquences auxquelles 25 apparaissent les bigrammes dans une langue a ; - on réitère cette dernière opération pour un certain nombre de langues connues R et y, chaque langue connue R et y étant caractérisée par une distribution de fréquences d'apparition de bigrammes ; - on compare lesdites distributions texte et langues entre elles en 30 calculant n-gramme par n-gramme, bigramme par bigramme dans l'exemple, l'écart entre la fréquence d'apparition d'un n-gramme dans le texte et dans la langue a, R et y et en additionnant toutes les valeurs absolues desdits écarts. D'autres méthodes mathématiques existent pour quantifier ces écarts, à savoir la somme des carrés desdits écarts ou le maximum des valeurs 35 absolues desdits écarts. - on détermine comme langue du texte, la langue dont la distribution langue présente une plus grande similarité avec la distribution texte. Cette similarité est déterminée en ne retenant que les écarts qui souscrivent à un certain critère : présenter une valeur minimale. En variante, le critère auquel doivent souscrire les écarts peut être de présenter une valeur inférieure à un seuil Le texte initial 15 est ensuite traité en fonction de cette détermination. Dans un exemple, il est traduit par le programme de traduction 13, au moyen d'un dictionnaire approprié 17 afin d'obtenir un texte final 19 écrit dans la langue 18 choisie par un utilisateur de l'imprimante multi-fonctions 1 ; ledit texte final (s'affichant sur l'écran 4). De nombreux autres traitements de documents écrits peuvent bénéficier d'une identification de la langue d'écriture. Par exemple : - le document peut être classé dans la mémoire de données 7 de 15 façon différente en fonction de sa langue ; - un programme de correction orthographique 14 peut sélectionner un dictionnaire adapté 17 plutôt qu'utiliser un dictionnaire général contenant toutes les langues possibles. Ils devraient donc gagner en vitesse, en consommation de mémoire et en qualité de résultat. 20 La figure 2 montre un exemple de comparaison d'une distribution de n-grammes d'un texte avec des distributions de n-grammes de trois langues distinctes a, 13 et y. Dans cet exemple de réalisation du procédé selon l'invention, on considère que le texte reçu est arbitrairement composé dans une langue 25 comprenant uniquement des lettres a, b c et d. L'alphabet de la langue est toutefois élargi d'un caractère spécial _ correspondant à l'espace présent au début et ou à la fin d'un mot. Le procédé de l'invention propose de fragmenter le texte en une pluralité de groupes de n caractères appelés n-grammes. 30 Avec des bigrammes, les lettres présentes dans la langue offrent un ensemble de 24 combinaisons de bigrammes possibles. Mais dans un texte, toutes les combinaisons ne sont pas forcément présentes. Ici par exemple, les bigrammes "aa", "a"' et "cc" ne sont pas écrits dans le texte. On établit alors une distribution texte de fréquences auxquelles apparaissent les 35 bigrammes dans le texte. Par distribution texte, on entend la statistique de présence des bigrammes considérés dans le texte à étudier. Tout simplement, on compte le nombre d'apparitions de chacun d'eux dans le texte. On divise ce nombre par le nombre total de n-grammes, de bigrammes, du texte. The program 12 for recognizing the writing language of the text 15 comprises the steps in which: the original text is fragmented into a plurality of groups of n characters called n-grams. In a preferred example, n = 2, i.e. bigrams are used. Alternatively, n = 3, i.e. trigrams are used. Bigrams have the advantage of increasing the speed of treatment; - we establish a text distribution of frequencies to which the bigrams appear in the text; a language frequency distribution is established in which the bigrams appear in a language a; the latter operation is repeated for a certain number of known languages R and y, each known language R and y being characterized by a distribution of frequencies of appearance of bigrams; comparing said text and language distributions with each other by calculating n-gram per n-gram, bigram by bigram in the example, the difference between the frequency of appearance of an n-gram in the text and in the language a, R and y and summing all the absolute values of said deviations. Other mathematical methods exist to quantify these deviations, namely the sum of the squares of said deviations or the maximum of the absolute values of said deviations. - language is defined as the language whose language distribution has a greater similarity to the text distribution. This similarity is determined by retaining only those deviations that subscribe to a certain criterion: to present a minimum value. As a variant, the criterion to which the deviations must subscribe may be to present a value below a threshold. The initial text 15 is then processed according to this determination. In one example, it is translated by the translation program 13, by means of an appropriate dictionary 17 to obtain a final text 19 written in the language 18 chosen by a user of the multi-function printer 1; the final text (displayed on screen 4). Many other treatments of written documents can benefit from identification of the writing language. For example: the document can be classified in the data memory 7 differently according to its language; - A spellcheck program 14 can select a suitable dictionary 17 rather than use a general dictionary containing all possible languages. They should therefore gain speed, memory consumption and quality of result. Figure 2 shows an example of comparing a distribution of n-grams of a text with n-grams distributions of three distinct languages a, 13 and y. In this embodiment of the method according to the invention, it is considered that the received text is arbitrarily composed in a language comprising only letters a, b c and d. The alphabet of the language is, however, expanded by a special character _ corresponding to the space present at the beginning and / or end of a word. The method of the invention proposes to fragment the text into a plurality of groups of n characters called n-grams. 30 With bigrams, the letters present in the language offer a set of 24 possible bigrama combinations. But in a text, not all combinations are necessarily present. Here for example, the bigram "aa", "a" and "cc" are not written in the text. A text distribution of frequencies is then established in which the 35 bigrams appear in the text. By text distribution is meant the presence statistics of the bigrams considered in the text to be studied. Quite simply, we count the number of appearances of each of them in the text. This number is divided by the total number of n-grams, bigrams, and text.

On a établi préalablement, sur un ensemble de textes connus dans une langue a, voire sur un dictionnaire de cette langue a, une distribution langue de fréquences auxquelles apparaissent les bigrammes dans cette langue a. On réitère cette opération pour un certain nombre de langues connues R et y. Chaque langue connue R et y est caractérisée par une distribution de fréquences d'apparition de bigrammes. Ces distributions langues représentent la probabilité de présence des n-grammes, des bigrammes, dans la langue. Alors que la distribution texte est calculée par l'appareil 20, les distributions langues sont mémorisées à l'avance dans la mémoire 7. We have established beforehand, on a set of texts known in a language, or even on a dictionary of this language a, a language distribution of frequencies to which the bigrams appear in this language a. This operation is repeated for a certain number of known languages R and y. Each known language R and y is characterized by a distribution of occurrence frequencies of bigrams. These language distributions represent the probability of the presence of n-grams, bigrams, in the language. While the text distribution is calculated by the apparatus 20, the language distributions are stored in advance in the memory 7.

Pour simplifier l'explication, on a représenté graphiquement ces distributions texte et langues a, R et y. Sur cette représentation on a affiché les écarts calculés entre l'apparition d'un bigramme dans la langue et dans le texte. Ces écarts sont cotés en positifs ou en négatifs. On compare ensuite lesdites distributions texte et langues a, R et y entre elles. Dans cet exemple de réalisation du procédé selon l'invention, la comparaison est réalisée en calculant la somme de toutes les valeurs absolues des écarts les séparant. Avec les valeurs affichées, le résultat de cette somme est 243 pour la distribution langue a, 451 pour la distribution langue R et 763 pour la distribution langue y. La langue a présentant le moins grand écart est donc choisie. D'autres méthodes mathématiques existent pour quantifier ces écarts, à savoir la somme du carré desdits écarts. Dans ce cas, les résultats sont respectivement 4080 pour la distribution langue a, 11033 pour la distribution langue R et 29345 pour la distribution langue y. To simplify the explanation, these text and language distributions a, R and y have been represented graphically. This representation shows the calculated differences between the appearance of a bigram in the language and in the text. These differences are rated as positive or negative. The text and language distributions a, R and y are then compared with each other. In this embodiment of the method according to the invention, the comparison is made by calculating the sum of all the absolute values of the differences separating them. With the values displayed, the result of this sum is 243 for the language a distribution, 451 for the language distribution R and 763 for the language distribution y. The language with the smallest gap is therefore chosen. Other mathematical methods exist to quantify these deviations, namely the sum of the squares of said deviations. In this case, the results are respectively 4080 for the language distribution a, 11033 for the language distribution R and 29345 for the language distribution y.

Une autre méthode est celle du maximum des valeurs absolues desdits écarts. Les résultats sont alors 27 pour la distribution langue a, 37 pour la distribution langue 13 et 73 pour la distribution langue y. Chacune des trois méthodes aboutit ici à la même conclusion : c'est la distribution langue a qui présente l'écart 243, 4080 ou 27 le plus faible par rapport à la distribution texte. L'écart entre la distribution langue a et la distribution texte, pour chaque bigramme, est symbolisé par la zone hachurée sur la figure 2, La langue a présente une similarité maximale avec le texte ; elle est donc considérée comme étant celle du texte. Another method is that of the maximum of the absolute values of said deviations. The results are then 27 for language distribution a, 37 for language distribution 13 and 73 for language distribution y. Each of the three methods leads here to the same conclusion: it is the language distribution with which the difference 243, 4080 or 27 is the smallest compared to the text distribution. The difference between the language a distribution and the text distribution, for each bigram, is symbolized by the hatched area in Figure 2, The language has a maximum similarity with the text; it is therefore considered to be that of the text.

Il convient de mettre en évidence les particularités de la méthode liée au calcul des écarts. Par exemple, des fautes de frappe ou des mauvaises reconnaissances de caractères manuscrits pouvant avoir engendré ici le bigramme erroné "ac". En effet, ce dernier n'existe pas dans la langue a. D'une part l'écart de valeur 20 déterminé pour le bigramme "ac", entre la distribution texte et la distribution langue a est négligeable dans l'opération d'addition décrite précédemment, au regard des vingt-trois autres bigrammes correctement distribués dans le texte. D'autre part, il peut être envisageable de ne pas prendre en compte les n-grammes qui apparaissent dans le texte alors qu'ils n'apparaîtraient pas dans la langue. Dans ce dernier cas, dans la comparaison, on ne retient les écarts que s'ils souscrivent à un critère. Le critère est ici celui de la présence du n-gramme dans la langue. Un autre critère est celui de négliger les écarts inférieurs à un seuil. Par exemple, si on néglige les écarts inférieurs à 10, la première méthode donne un résultat de 195 (au lieu de 243) pour la langue a, alors que le résultat est presque inchangé pour les langues 13 et y. L'utilisation d'un critère permet ainsi d'accélérer le calcul et d'être plus sélectif. Une fois la langue du texte ainsi déterminée, le texte peut être traité. Dans ce dernier cas, il peut soit être édité tel quel, soit être corrigé, soit encore être traduit. Par exemple, si un texte est reçu dans une langue a et si un utilisateur de l'appareil, ici du terminal mobile, a choisi une langue 13 en zone 18 dans la mémoire de données 7, on peut prévoir de traduire le texte de la langue a vers la langue R. The peculiarities of the method related to the calculation of deviations should be highlighted. For example, typos or misrecognitions of handwritten characters that may have generated here the wrong bigram "ac". Indeed, the latter does not exist in the language a. On the one hand, the difference in value determined for the bigram "ac" between the text distribution and the language distribution a is negligible in the addition operation described above, with regard to the twenty-three other bigrams correctly distributed in the text. On the other hand, it may be possible not to take into account the n-grams that appear in the text when they would not appear in the language. In the latter case, in the comparison, the differences are retained only if they subscribe to a criterion. The criterion here is that of the presence of the n-gram in the language. Another criterion is to neglect deviations below a threshold. For example, if we neglect the differences less than 10, the first method gives a result of 195 (instead of 243) for the language a, while the result is almost unchanged for languages 13 and y. The use of a criterion thus makes it possible to speed up the calculation and to be more selective. Once the language of the text thus determined, the text can be processed. In the latter case, it can either be edited as is, be corrected, or still be translated. For example, if a text is received in a language a and if a user of the device, here of the mobile terminal, has chosen a language 13 in zone 18 in the data memory 7, it is possible to translate the text of the language a to language R.

Claims

1 - Editing method (12) of a text (15) expressed in a language, comprising steps in which: - an apparatus (1) receives a sequence of binary information corresponding to characters of an alphabet; - On a device (4) of the device, a text corresponding to the received characters is edited; the text is fragmented into a plurality of groups of n characters called n-grams, n being a non-zero positive integer; - a text distribution of frequencies is established in which the n-grams appear in the text; a language frequency distribution is established in which the n-grams appear in a language (a; R; Y); this last operation is repeated for a certain number of known languages, each known language being characterized by a frequency distribution of appearance of n-grams; comparing said text and language distributions with each other; the language whose language distribution has a greater similarity with the text distribution is determined as the language of the text and the text is processed according to this determination, characterized in that it comprises the following steps: -gram per n-gram, the difference between the frequency of appearance of an n-gram in the text and in the language.

2 - Process according to claim 1, characterized in that widens the possible combinations made with all the letters of the alphabet of the language with one or two characters of space L) placed before or after a n-lgramme or before and after an n-2gram.

3 - Process according to at least one of claims 1 to 2, characterized in that the n-grams are groups of two characters called bigrams.

4 - Process according to at least one of claims 1 to 2, characterized in that the n-grams are groups of three characters called trigrams. 35

5 - Process according to at least one of claims 1 to 4, characterized in that the comparison of the distributions comprises a step of adding, for all the n-grams, all the absolute values of the differences separating them.

6 - Process according to one of claims 1 to 4, characterized in that the comparison of the distributions comprises a step of adding, for all n-grams, all squares deviations separating them.

7 - Method according to one of claims 1 to 4, characterized in that the comparison of the distributions comprises a step of determining the maximum of all the absolute values of the differences separating them.

8 - Method according to one of claims 1 to 7, characterized in that one holds the differences only if they subscribe to a certain criterion.

9 - Process according to one of claims 1 to 8, characterized in that the criterion to which subscribe deviations is to present a value less than a threshold. 9 - Process according to at least one of claims 1 to 8, characterized in that the criterion to which subscribe deviations is to present a minimum value.

10 - Process according to at least one of claims 1 to 9, characterized in that the processing carried out by the method is a spelling correction (14).

11 - Process according to at least one of claims 1 to 9, characterized in that the processing carried out by the method is an automatic translation (13) of the received text, in a language (18) set by the device.