WO2014189400A1

WO2014189400A1 - A method for diacritisation of texts written in latin- or cyrillic-derived alphabets

Info

Publication number: WO2014189400A1
Application number: PCT/RS2013/000010
Authority: WO
Inventors: Milan SEČUJSKI; Stevan OSTROGONAC; Darko Pekar; Dragan KNEŽEVIĆ; Branislav POPOVIĆ; Milana BOJANIĆ
Original assignee: Axon Doo
Priority date: 2013-05-22
Filing date: 2013-05-22
Publication date: 2014-11-27

Abstract

The presented invention is related to the method for the recovery of diacritical marks in texts written in any of the languages using Latin- or Cyrillic-derived alphabets with diacritical marks. The embodiment of the invention presented in this document uses multiple information sources (topical information and information on semantic proximity) in the task of diacritisation, which is recognised and treated as a classification task. The invention relies on classification based on topical information provided by text categorisation, the information on semantic proximity of particular words in the text, as well as morphological information. At word level the classification task is limited to the calculation of the semantic score of each particular word interpretation. The actual recovery of the diacritical marks is carried out only at the sentence level (or possibly some higher level such as paragraph or the entire text), with the assumption that users, when adapting a text to a non-diacritised setting, consistently use one of the existing conventions, rather than switching between different conventions.

Description

A METHOD FOR DIACRITISATION OF TEXTS WRITTEN IN

LATIN- OR CYRILLIC-DERIVED ALPHABETS

Technical Field

The invention belongs to the field of natural language processing (NLP). It can be applied in a variety of systems related to language technologies, including but not limited to natural language understanding, machine translation, question answering, information retrieval, information extraction, as well as text-to-speech synthesis.

Background Art

Diacritical marks (diacritics) are symbols added to particular letters of an alphabet to indicate different value and/or pronunciation than the one that the letters are otherwise given. In the orthography of some languages a letter modified by a diacritic is treated as a new, distinct letter, while in others it is treated as a letter-diacritic combination, sometimes with a different phonological value, sometimes with a sole function to distinguish between homonyms pronounced in the same way. Diacritical marks mostly appear above letters, although other positions can be found in some Latin- or Cyrillic-derived alphabets, such as below the letter, within the letter or between two consecutive letters. The use of diacritical marks in traditional writing is obligatory in most cases, while in case documents are written using computers or other similar devices, the situation is less clear. Namely, the computer technology of today was developed mostly in the English-speaking countries, where an alphabet without diacritics is used, and for that reason, keyboard layouts and data formats were initially developed with a bias favouring English. A number of extensions of the basic ASCII code chart were proposed in order to accommodate letters with diacritical marks common to particular Latin-derived alphabets. Today, Unicode and the ISO/IEC 10646 Universal Character Set (UCS) have a much wider array of characters, including both Latin and Cyrillic ones, and they replace ASCII and similar standards rapidly in many environments. Letters with diacritics can be composed in most of the existing keyboard layouts, and the existing standards such as Unicode assigns unique code to every known character. Although considerable effort has evidently been made to enable computer users to use alphabets of their own languages with equal ease, in practice this often results in user dissatisfaction due to problems related to the conversion of documents between various software applications or versions thereof. A particularly severe example of this problem occurs in e-mail or SMS correspondence, where e.g. the use of (some) characters with diacritics does not always produce the desired result on the side of the recipient. Furthermore, sometimes even the pricing policy of the provider of mobile telephone services varies depending on the code page used for composing messages. For all these reasons, computer users and particularly mobile phone users have begun using their own alternative alphabets, established by de facto standard, with particular diacritised letters replaced by their non-diacritised versions or sequences thereof (the second option most notably used in cases where the pronunciation differs greatly between the diacritised and the non-diacritised version and thus simply omitting the diacritical mark would constitute an error). A growing number of users in this category, particularly young ones, has given rise to speculations that, in the future, diacritical marks in a number of languages may be made obsolete.

Such an introduction of alternative diacritic-less alphabets leads to a considerable amount of ambiguity, since words which differ only in the pattern of diacritical marks can appear identical in a setting without these marks. To an educated native speaker the recovery of diacritical marks is most often an easy task, with rare exceptions. However, from the point of view of natural language processing, the task is not trivial, and there is clearly a need for automatic diacritisation of text as a necessary step in the recovery of both the meaning of text and its pronunciation. There is no standard scientific framework established for resolving this particular problem, although solutions related to particular languages have been proposed in the past, both rule-based and based on machine learning techniques.

Some of them, which are protected, are mentioned below.

The patent application EP1471440 A2, filed March 23^rd 2004, entitled "System and method for word analysis" relates to a computer implemented method and system for morphological analysis of the input text using a rule engine, which includes various transitions based on lexicon, orthography rule module and morpheme combination module.

The patent application EP1402480 Al, filed July 4^th 2001, entitled "Category based, extensible and interactive system for document retrieval" relates to a system designed to search for documents and analyze them in order to determine their word-pair patterns after receiving a search query from a requestor based on linguistic and mathematical approaches for automatic text categorization.

The patent application EP2447854 Al, filed January 14^th 201 1, entitled "Method and system of automatic diacritisation of Arabic" relates to a method and system for diacritizing a text. The method includes determination of the string of diacritized characters in the Arabic language given the string of non-diacritized ones.

The patent EP0138079 Bl, filed September 17^th 1984, entitled "Character recognition apparatus and method for recognizing characters associated with diacritical marks" relates to a method and apparatus for the recognition of characters or symbols which have diacritical marks. Disclosure of the Invention

The presented invention is related to the method for the recovery of diacritical marks in texts written in any of the languages using Latin- or Cyrillic-derived alphabets with diacritical marks. The input of the method, is a diacritic-less text in a particular language using a Latin- or Cyrillic-derived alphabet. The output of the method is the diacritised version of the input text.

The embodiment of the invention presented in this document uses multiple information sources in the task of diacritisation, which is recognised and treated as a classification task. Owing to the aggregation of the available information, the invention, which can be better appreciated with reference to the following figures and specifications, restores diacritical marks on the input text with great accuracy. In doing so, the invention uses topical information provided by text categorisation, the information on semantic proximity of particular words in the text, as well as morphological information.

The invention will be presented for the case of the Serbian language and its Latin script as a typical example of this problem, although a generalisation to any of the languages using Latin- or Cyrillic-derived alphabets, as well as any modification or variation that such a generalisation would include, should be readily apparent to any person skilled in the art.

Namely, the Latin script of Serbian recognises five letters with diacritical marks: C, C, S, Z and D, as well as the digraph DZ. To adapt a text to a non-diacritised setting two conventions are predominantly used, as shown in Table 1.

Some replacements can lead to ambiguity, which is nevertheless easily resolved by an educated human reader. If the diacritic-less input text the word casa appears, its corresponding output words are identified as casa, casa, and casa by a combination of lexical look-up (relying on a morphological dictionary of Serbian) and morphological analysis. The classification task amounts to the selection of the most likely option based on an aggregation of multiple information sources (topical information and semantic proximity information). Within the invention, the final decision as to the most likely option is postponed and carried out only at the sentence level (or possibly some higher level such as paragraph or the entire text), while at word level the decision is limited to soft classification, i.e. calculation of the semantic score of each particular option, based on topical information as well as information related to the semantic proximity (two information sources) between (1) each option corresponding to the target word, and (2) its neighbouring word(s) in the sentence. In reaching the final decision at the sentence (or higher) level, the invention operates under the assumption that, when adapting a text to a diacritic-less setting, the users consistently use any of the existing conventions. Exceptions to this rule exist, but they are primarily related to the cases where the use of the adopted convention would lead to an ambiguity that even an educated reader would consider to be confusing.

Brief Description of the Drawings

Figure 1-describes the basic structure of the invention, based on successive application of soft classification between various interpretations of the ambiguous non-diacritised input word at word level, as well as hard classification at sentence level (or possibly some higher level such as paragraph or the entire text).

Figure 2-represents the module for training (including the construction of separate soft classifiers for sufficiently frequent words in the non-diacritised versions of a large text corpus, as well as the construction of a language model required for the disambiguation process), which is the preparatory phase for the process of diacritisation.

Best Mode for Carrying Out of the Invention

The invention can be understood as the method carrying out the classification task amounting to the selection of the most likely one of all possible strings of words corresponding to a string of words as presented in a non-diacritised setting in the input text. The words possibly corresponding to one non-diacritised word in the input text can be diacritised or non- diacritised themselves, i.e. in some cases the input and the output word can be identical.

Figure 1 describes the basic structure of the method which is the subject of the invention. The text is firstly searched for diacritisation marks to establish whether the diacritisation is required or not 100. If a diacritical mark is found, it is to be assumed that the text is fully diacritised, otherwise the diacritisation is required. In the latter case, the text is firstly categorised as to its topic and/or functional style 101, and for each word in the text it is established whether there exists a soft classifier 102 related to that particular word and text category, the term 'soft classifier' denoting a structure containing: • the list of possible interpretations of the ambiguous non-diacritised input word,

• for each one of the possible interpretations:

o information related to the morphology (part-of-speech labels and values of particular morphological categories where applicable);

o a decision tree based structure 103 used to calculate the semantic score of the particular interpretation in a given context.

If the soft classifier 102 for the target word does not exist, the word is accepted as is (in its non-diacritised form), however, it still can have multiple possible interpretations since it can belong to different parts of speech (adverb, adjective, noun, pronoun, preposition...) or have different values of relevant morphological categories (gender, number, case,...). In other words, it can belong to different morphological classes, the term 'morphological class' denoting the combination of part-of-speech category and the values of relevant morphological categories. The method 104 for establishing possible interpretations (related to different morphological classes) can use either one or a combination of both of the following approaches: lexical lookup based on a morphological dictionary 105 and morphological analysis 106. If the soft classifier 102 for the target word exists, it is used to provide the list of possible interpretations of the ambiguous non-diacritised input word, as well as the semantic score and morphological information related to each interpretation, which have been established during the construction of the classifier during the training phase. Upon arriving at the end of the sentence, for each word a list of its possible interpretations has thus been determined, the differences between them being related to morphological information, diacritical patterns or both. Next, a process of disambiguation is carried out at sentence level, where a hard decision related to each ambiguous word is reached. This process will hereafter be referred to as decoding 107, and is based on the language model 108 built during the training phase.

Figure 2 represents the module for training the method which is the subject of the invention. This module uses a large textual corpus 201 which is fully diacritised and categorised regarding topical information and/or functional style (newspaper style, scientific style...). This module also uses the non-diacritised versions 202 of the textual corpus 201 (the fact that there can be more than one corpus 202 is the consequence of the fact that there can be more than one convention for the adaptation of text to diacritic-less setting). The module for training firstly establishes a list 203 of words in non-diacritised corpora 202 that appear more often than a predefined threshold, and are such that they could have been produced by adaptation of some other words to a non-diacritised setting. For each of the words from this list a soft classifier 102 is built 204, using multiple information sources (topical information and information on semantic proximity) based on the corpora 201 and 202. The soft classifier 102 stores the list of possible interpretations of the ambiguous non-diacritised input word, morphological classes (i.e. information related to the morphology) of each of them, as well as decision tree based structures 103 intended for the calculation of the corresponding category- dependent semantic score of each of the possible interpretations in a given context. To determine the semantic score of each particular (possibly diacritised) interpretation of an ambiguous non-diacritised word which appears in the input text, the soft classifier 102 uses the following information as the set of features of the decision tree based structure 103 related to that interpretation

• topical information, provided by text categorisation, which establishes the topic of the text, its category and/or its functional style, having in mind that texts belonging to various topics, categories or functional styles can have significantly different statistical properties;

• information on semantic proximity (semantical relatedness), calculated between the target word and its neighbour(s) using any of the standard techniques, including latent semantic analysis (LSA);

The output of each tree based structure 103 is the semantic score s of the corresponding interpretation. In this way, during the exploitation phase, each of the interpretations will be assigned a semantic 'likelihood' in a given context, with a definite decision as to the correct one being postponed to the decoding process 107. The semantic score s of a particular interpretation will be used as one of the parameters that affect the values of the search metrics in the decoding process 107.

Besides the construction of soft classifiers, the training process includes the construction 205 of a language model 108, which will be used in the decoding 106. This model consists of the estimates of the probabilities of unigrams and «-grams of higher orders of morphological classes. The language model 108 is built using the textual corpus 201 and the method 104 for establishing possible interpretations related to different morphological classes of the words in the corpus 201. The estimated probabilities are smoothed using any of the standard techniques such as Good-Turing estimation.

The decoding process 107 is based on the language model 108 in that it consists of a search through a lattice of possible word interpretations, disambiguating between them using the probability estimates contained in the language model 108 and producing the output text with recovered diacritical marks as the result. The search can be organised along the lines of the Viterbi algorithm or some other appropriate technique, with search metrics based on the values of the following parameters:

• Estimated probabilities of a particular morphological class after some other morphological class or a sequence thereof (e.g. probability that a noun [feminine, nominative, singular] will appear after an adjective [feminine, nominative, singular]);

• Estimated probabilities of particular words knowing their morphological class (words that appear more frequently within a morphological class will have a higher probability than words that appear less frequently); Semantic score s, as provided by soft classifiers 102, for particular words and particular text category and/or functional style as established by 101;

Occurrences of switches between different conventions for adapting a text to a diacritic-less setting. As the system operates under the assumption that, when adapting a text to a diacritic-less setting, users consistently use any of the existing conventions, with exceptions primarily related to the cases where the consistent use of the adopted convention would lead to critical ambiguity, the switches between different conventions should be considered as low probability events. Consequently, each transition between the nodes of the lattice should also be attributed a score denoting the existence (or absence) of a switch between the used conventions, i.e. the fact that the convention used for the adaptation of the current word to a diacritic-less setting was different from the one that was used on the last occasion when some of the previous words in the text was thus adapted. The optimal pair of values s_cs and s_cs, used to indicate a presence or absence of a switch, can be set heuristically or estimated using a segment of the textual corpora withheld from training, Namely, soft classifiers 102 can be built using only a portion of the available textual corpus 201 and the corresponding portions of the corpora 202, with varying values of s_cs and s_cs, and the accuracy of diacritisation using particular values of s_cs and s_cs can be measured on the remaining portions of these corpora. The values of s_cs and s_cs which yield the highest accuracy can be adopted as actual values that will be used within the method, which is the subject of the invention.

Industrial Applicability

The for automatic diacritisation of text can be applied within a range of systems related to language technologies, which can be divided into two principal groups:

• Systems where text diacritisation is aimed at the recovery of the meaning of words.

These include, but are not limited to, natural language understanding, machine translation, question answering, information retrieval and information extraction;

• Systems where text diacritisation is aimed at the recovery of the pronunciation of words, which principally refers to text-to-speech synthesis, but also to automatic speech recognition to a lesser degree. Namely, in order to build language models for automatic speech recognition it is necessary to possess a large text corpus which has to be fully diacritised. Although most of the diacritisation is mandatory in traditional writing, segments of the corpus related to less formal settings such as personal correspondence or various forms of Internet posting can be lacking diacritical marks, which could affect the accuracy of acoustic models used in recognition if not treated properly.

Claims

1. The method for diacritisation of texts written in Latin- or Cyrillic-derived alphabets which is given in a non-diacritised setting, establishing the list of possible interpretations of each word in an input text, performing the recovery of diacritical marks and producing an output sentence with a correct pattern of diacritical marks, characterized by:

soft classification step (102) at word level using a set of features of the decision tree based structure (103) as input, aimed at calculating the semantic score of each particular interpretation as well as information related to the morphological class of each particular interpretation; and

hard classification step (107) at sentence level or a higher level.

2. A method as claimed in claim 1, characterized in that interpretation of each word stands for the version of the word in a fully diacritised setting with established part of speech and values of particular morphological categories.

3. A method as claimed in claim 1, characterized in that soft classification (102) assigns to each possible interpretation of word from the list of the ambiguous non-diacritised input words, information related to the morphological class and decision tree based structure (103) used to calculate the semantic score of the particular interpretation in a given context.

4. A method as claimed in claim 1, characterized in that a set of features of the decision tree based structure includes topical information and information on semantic proximity.

5. A method as claimed in claim 4, characterized in that topical information establishes the topic of the text, its category and/or its functional style.

6. A method as claimed in claim 5, characterized in that the topic of the text may denote sports, art, or science, category may denote a newspaper article, excerpt from a book or sample of personal correspondence and functional style may denote newspaper, scientific, publicistic or poetic style.

7. A method as claimed in claim 4, characterized in that information on semantic proximity is calculated between each interpretation corresponding to the target word and its neighbouring word(s) in the sentence.

8. A method as claimed in claim 1, characterized in that information related to the morphological class combine the part-of-speech category which may denote adverb, adjective, conjuction, interestion, pronoun, preposition, noun, or verb; and the values of relevant morphological categories which may denote gender, number, case, tense or mood.

9. A method as claimed in claim 1, characterized in that performing hard classification (107) at higher level means carrying out of the process of disambiguation at the paragraph level or the entire text level.