WO2014189400A1 - A method for diacritisation of texts written in latin- or cyrillic-derived alphabets - Google Patents

A method for diacritisation of texts written in latin- or cyrillic-derived alphabets Download PDF

Info

Publication number
WO2014189400A1
WO2014189400A1 PCT/RS2013/000010 RS2013000010W WO2014189400A1 WO 2014189400 A1 WO2014189400 A1 WO 2014189400A1 RS 2013000010 W RS2013000010 W RS 2013000010W WO 2014189400 A1 WO2014189400 A1 WO 2014189400A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
word
information
diacritised
morphological
Prior art date
Application number
PCT/RS2013/000010
Other languages
French (fr)
Inventor
Milan SEČUJSKI
Stevan OSTROGONAC
Darko Pekar
Dragan KNEŽEVIĆ
Branislav POPOVIĆ
Milana BOJANIĆ
Original Assignee
Axon Doo
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Axon Doo filed Critical Axon Doo
Priority to PCT/RS2013/000010 priority Critical patent/WO2014189400A1/en
Publication of WO2014189400A1 publication Critical patent/WO2014189400A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/274Converting codes to words; Guess-ahead of partial word inputs

Definitions

  • the invention belongs to the field of natural language processing (NLP). It can be applied in a variety of systems related to language technologies, including but not limited to natural language understanding, machine translation, question answering, information retrieval, information extraction, as well as text-to-speech synthesis.
  • NLP natural language processing
  • Diacritical marks are symbols added to particular letters of an alphabet to indicate different value and/or pronunciation than the one that the letters are otherwise given.
  • a letter modified by a diacritic is treated as a new, distinct letter, while in others it is treated as a letter-diacritic combination, sometimes with a different phonological value, sometimes with a sole function to distinguish between homonyms pronounced in the same way.
  • Diacritical marks mostly appear above letters, although other positions can be found in some Latin- or Cyrillic-derived alphabets, such as below the letter, within the letter or between two consecutive letters.
  • the presented invention is related to the method for the recovery of diacritical marks in texts written in any of the languages using Latin- or Cyrillic-derived alphabets with diacritical marks.
  • the input of the method is a diacritic-less text in a particular language using a Latin- or Cyrillic-derived alphabet.
  • the output of the method is the diacritised version of the input text.
  • the embodiment of the invention presented in this document uses multiple information sources in the task of diacritisation, which is recognised and treated as a classification task. Owing to the aggregation of the available information, the invention, which can be better appreciated with reference to the following figures and specifications, restores diacritical marks on the input text with great accuracy. In doing so, the invention uses topical information provided by text categorisation, the information on semantic proximity of particular words in the text, as well as morphological information.
  • the classification task amounts to the selection of the most likely option based on an aggregation of multiple information sources (topical information and semantic proximity information).
  • the final decision as to the most likely option is postponed and carried out only at the sentence level (or possibly some higher level such as paragraph or the entire text), while at word level the decision is limited to soft classification, i.e.
  • the invention operates under the assumption that, when adapting a text to a diacritic-less setting, the users consistently use any of the existing conventions. Exceptions to this rule exist, but they are primarily related to the cases where the use of the adopted convention would lead to an ambiguity that even an educated reader would consider to be confusing.
  • Figure 1- describes the basic structure of the invention, based on successive application of soft classification between various interpretations of the ambiguous non-diacritised input word at word level, as well as hard classification at sentence level (or possibly some higher level such as paragraph or the entire text).
  • Figure 2- represents the module for training (including the construction of separate soft classifiers for sufficiently frequent words in the non-diacritised versions of a large text corpus, as well as the construction of a language model required for the disambiguation process), which is the preparatory phase for the process of diacritisation.
  • the invention can be understood as the method carrying out the classification task amounting to the selection of the most likely one of all possible strings of words corresponding to a string of words as presented in a non-diacritised setting in the input text.
  • the words possibly corresponding to one non-diacritised word in the input text can be diacritised or non- diacritised themselves, i.e. in some cases the input and the output word can be identical.
  • Figure 1 describes the basic structure of the method which is the subject of the invention.
  • the text is firstly searched for diacritisation marks to establish whether the diacritisation is required or not 100. If a diacritical mark is found, it is to be assumed that the text is fully diacritised, otherwise the diacritisation is required. In the latter case, the text is firstly categorised as to its topic and/or functional style 101, and for each word in the text it is established whether there exists a soft classifier 102 related to that particular word and text category, the term 'soft classifier' denoting a structure containing: • the list of possible interpretations of the ambiguous non-diacritised input word,
  • the soft classifier 102 for the target word does not exist, the word is accepted as is (in its non-diacritised form), however, it still can have multiple possible interpretations since it can belong to different parts of speech (adverb, adjective, noun, pronoun, preposition%) or have different values of relevant morphological categories (gender, number, case,). In other words, it can belong to different morphological classes, the term 'morphological class' denoting the combination of part-of-speech category and the values of relevant morphological categories.
  • the method 104 for establishing possible interpretations can use either one or a combination of both of the following approaches: lexical lookup based on a morphological dictionary 105 and morphological analysis 106.
  • the soft classifier 102 for the target word exists, it is used to provide the list of possible interpretations of the ambiguous non-diacritised input word, as well as the semantic score and morphological information related to each interpretation, which have been established during the construction of the classifier during the training phase.
  • a list of its possible interpretations has thus been determined, the differences between them being related to morphological information, diacritical patterns or both.
  • a process of disambiguation is carried out at sentence level, where a hard decision related to each ambiguous word is reached. This process will hereafter be referred to as decoding 107, and is based on the language model 108 built during the training phase.
  • Figure 2 represents the module for training the method which is the subject of the invention.
  • This module uses a large textual corpus 201 which is fully diacritised and categorised regarding topical information and/or functional style (newspaper style, scientific style).
  • This module also uses the non-diacritised versions 202 of the textual corpus 201 (the fact that there can be more than one corpus 202 is the consequence of the fact that there can be more than one convention for the adaptation of text to diacritic-less setting).
  • the module for training firstly establishes a list 203 of words in non-diacritised corpora 202 that appear more often than a predefined threshold, and are such that they could have been produced by adaptation of some other words to a non-diacritised setting.
  • a soft classifier 102 For each of the words from this list a soft classifier 102 is built 204, using multiple information sources (topical information and information on semantic proximity) based on the corpora 201 and 202.
  • the soft classifier 102 stores the list of possible interpretations of the ambiguous non-diacritised input word, morphological classes (i.e. information related to the morphology) of each of them, as well as decision tree based structures 103 intended for the calculation of the corresponding category- dependent semantic score of each of the possible interpretations in a given context.
  • the soft classifier 102 uses the following information as the set of features of the decision tree based structure 103 related to that interpretation
  • topical information provided by text categorisation, which establishes the topic of the text, its category and/or its functional style, having in mind that texts belonging to various topics, categories or functional styles can have significantly different statistical properties;
  • LSA latent semantic analysis
  • each tree based structure 103 The output of each tree based structure 103 is the semantic score s of the corresponding interpretation.
  • each of the interpretations will be assigned a semantic 'likelihood' in a given context, with a definite decision as to the correct one being postponed to the decoding process 107.
  • the semantic score s of a particular interpretation will be used as one of the parameters that affect the values of the search metrics in the decoding process 107.
  • the training process includes the construction 205 of a language model 108, which will be used in the decoding 106.
  • This model consists of the estimates of the probabilities of unigrams and «-grams of higher orders of morphological classes.
  • the language model 108 is built using the textual corpus 201 and the method 104 for establishing possible interpretations related to different morphological classes of the words in the corpus 201.
  • the estimated probabilities are smoothed using any of the standard techniques such as Good-Turing estimation.
  • the decoding process 107 is based on the language model 108 in that it consists of a search through a lattice of possible word interpretations, disambiguating between them using the probability estimates contained in the language model 108 and producing the output text with recovered diacritical marks as the result.
  • the search can be organised along the lines of the Viterbi algorithm or some other appropriate technique, with search metrics based on the values of the following parameters:
  • Semantic score s as provided by soft classifiers 102, for particular words and particular text category and/or functional style as established by 101;
  • the optimal pair of values s cs and s cs used to indicate a presence or absence of a switch, can be set heuristically or estimated using a segment of the textual corpora withheld from training, Namely, soft classifiers 102 can be built using only a portion of the available textual corpus 201 and the corresponding portions of the corpora 202, with varying values of s cs and s cs , and the accuracy of diacritisation using particular values of s cs and s cs can be measured on the remaining portions of these corpora. The values of s cs and s cs which yield the highest accuracy can be adopted as actual values that will be used within the method, which is the subject of the invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The presented invention is related to the method for the recovery of diacritical marks in texts written in any of the languages using Latin- or Cyrillic-derived alphabets with diacritical marks. The embodiment of the invention presented in this document uses multiple information sources (topical information and information on semantic proximity) in the task of diacritisation, which is recognised and treated as a classification task. The invention relies on classification based on topical information provided by text categorisation, the information on semantic proximity of particular words in the text, as well as morphological information. At word level the classification task is limited to the calculation of the semantic score of each particular word interpretation. The actual recovery of the diacritical marks is carried out only at the sentence level (or possibly some higher level such as paragraph or the entire text), with the assumption that users, when adapting a text to a non-diacritised setting, consistently use one of the existing conventions, rather than switching between different conventions.

Description

A METHOD FOR DIACRITISATION OF TEXTS WRITTEN IN
LATIN- OR CYRILLIC-DERIVED ALPHABETS
Technical Field
The invention belongs to the field of natural language processing (NLP). It can be applied in a variety of systems related to language technologies, including but not limited to natural language understanding, machine translation, question answering, information retrieval, information extraction, as well as text-to-speech synthesis.
Background Art
Diacritical marks (diacritics) are symbols added to particular letters of an alphabet to indicate different value and/or pronunciation than the one that the letters are otherwise given. In the orthography of some languages a letter modified by a diacritic is treated as a new, distinct letter, while in others it is treated as a letter-diacritic combination, sometimes with a different phonological value, sometimes with a sole function to distinguish between homonyms pronounced in the same way. Diacritical marks mostly appear above letters, although other positions can be found in some Latin- or Cyrillic-derived alphabets, such as below the letter, within the letter or between two consecutive letters. The use of diacritical marks in traditional writing is obligatory in most cases, while in case documents are written using computers or other similar devices, the situation is less clear. Namely, the computer technology of today was developed mostly in the English-speaking countries, where an alphabet without diacritics is used, and for that reason, keyboard layouts and data formats were initially developed with a bias favouring English. A number of extensions of the basic ASCII code chart were proposed in order to accommodate letters with diacritical marks common to particular Latin-derived alphabets. Today, Unicode and the ISO/IEC 10646 Universal Character Set (UCS) have a much wider array of characters, including both Latin and Cyrillic ones, and they replace ASCII and similar standards rapidly in many environments. Letters with diacritics can be composed in most of the existing keyboard layouts, and the existing standards such as Unicode assigns unique code to every known character. Although considerable effort has evidently been made to enable computer users to use alphabets of their own languages with equal ease, in practice this often results in user dissatisfaction due to problems related to the conversion of documents between various software applications or versions thereof. A particularly severe example of this problem occurs in e-mail or SMS correspondence, where e.g. the use of (some) characters with diacritics does not always produce the desired result on the side of the recipient. Furthermore, sometimes even the pricing policy of the provider of mobile telephone services varies depending on the code page used for composing messages. For all these reasons, computer users and particularly mobile phone users have begun using their own alternative alphabets, established by de facto standard, with particular diacritised letters replaced by their non-diacritised versions or sequences thereof (the second option most notably used in cases where the pronunciation differs greatly between the diacritised and the non-diacritised version and thus simply omitting the diacritical mark would constitute an error). A growing number of users in this category, particularly young ones, has given rise to speculations that, in the future, diacritical marks in a number of languages may be made obsolete.
Such an introduction of alternative diacritic-less alphabets leads to a considerable amount of ambiguity, since words which differ only in the pattern of diacritical marks can appear identical in a setting without these marks. To an educated native speaker the recovery of diacritical marks is most often an easy task, with rare exceptions. However, from the point of view of natural language processing, the task is not trivial, and there is clearly a need for automatic diacritisation of text as a necessary step in the recovery of both the meaning of text and its pronunciation. There is no standard scientific framework established for resolving this particular problem, although solutions related to particular languages have been proposed in the past, both rule-based and based on machine learning techniques.
Some of them, which are protected, are mentioned below.
The patent application EP1471440 A2, filed March 23rd 2004, entitled "System and method for word analysis" relates to a computer implemented method and system for morphological analysis of the input text using a rule engine, which includes various transitions based on lexicon, orthography rule module and morpheme combination module.
The patent application EP1402480 Al, filed July 4th 2001, entitled "Category based, extensible and interactive system for document retrieval" relates to a system designed to search for documents and analyze them in order to determine their word-pair patterns after receiving a search query from a requestor based on linguistic and mathematical approaches for automatic text categorization.
The patent application EP2447854 Al, filed January 14th 201 1, entitled "Method and system of automatic diacritisation of Arabic" relates to a method and system for diacritizing a text. The method includes determination of the string of diacritized characters in the Arabic language given the string of non-diacritized ones.
The patent EP0138079 Bl, filed September 17th 1984, entitled "Character recognition apparatus and method for recognizing characters associated with diacritical marks" relates to a method and apparatus for the recognition of characters or symbols which have diacritical marks. Disclosure of the Invention
The presented invention is related to the method for the recovery of diacritical marks in texts written in any of the languages using Latin- or Cyrillic-derived alphabets with diacritical marks. The input of the method, is a diacritic-less text in a particular language using a Latin- or Cyrillic-derived alphabet. The output of the method is the diacritised version of the input text.
The embodiment of the invention presented in this document uses multiple information sources in the task of diacritisation, which is recognised and treated as a classification task. Owing to the aggregation of the available information, the invention, which can be better appreciated with reference to the following figures and specifications, restores diacritical marks on the input text with great accuracy. In doing so, the invention uses topical information provided by text categorisation, the information on semantic proximity of particular words in the text, as well as morphological information.
The invention will be presented for the case of the Serbian language and its Latin script as a typical example of this problem, although a generalisation to any of the languages using Latin- or Cyrillic-derived alphabets, as well as any modification or variation that such a generalisation would include, should be readily apparent to any person skilled in the art.
Namely, the Latin script of Serbian recognises five letters with diacritical marks: C, C, S, Z and D, as well as the digraph DZ. To adapt a text to a non-diacritised setting two conventions are predominantly used, as shown in Table 1.
Figure imgf000004_0001
Some replacements can lead to ambiguity, which is nevertheless easily resolved by an educated human reader. If the diacritic-less input text the word casa appears, its corresponding output words are identified as casa, casa, and casa by a combination of lexical look-up (relying on a morphological dictionary of Serbian) and morphological analysis. The classification task amounts to the selection of the most likely option based on an aggregation of multiple information sources (topical information and semantic proximity information). Within the invention, the final decision as to the most likely option is postponed and carried out only at the sentence level (or possibly some higher level such as paragraph or the entire text), while at word level the decision is limited to soft classification, i.e. calculation of the semantic score of each particular option, based on topical information as well as information related to the semantic proximity (two information sources) between (1) each option corresponding to the target word, and (2) its neighbouring word(s) in the sentence. In reaching the final decision at the sentence (or higher) level, the invention operates under the assumption that, when adapting a text to a diacritic-less setting, the users consistently use any of the existing conventions. Exceptions to this rule exist, but they are primarily related to the cases where the use of the adopted convention would lead to an ambiguity that even an educated reader would consider to be confusing.
Brief Description of the Drawings
Figure 1-describes the basic structure of the invention, based on successive application of soft classification between various interpretations of the ambiguous non-diacritised input word at word level, as well as hard classification at sentence level (or possibly some higher level such as paragraph or the entire text).
Figure 2-represents the module for training (including the construction of separate soft classifiers for sufficiently frequent words in the non-diacritised versions of a large text corpus, as well as the construction of a language model required for the disambiguation process), which is the preparatory phase for the process of diacritisation.
Best Mode for Carrying Out of the Invention
The invention can be understood as the method carrying out the classification task amounting to the selection of the most likely one of all possible strings of words corresponding to a string of words as presented in a non-diacritised setting in the input text. The words possibly corresponding to one non-diacritised word in the input text can be diacritised or non- diacritised themselves, i.e. in some cases the input and the output word can be identical.
Figure 1 describes the basic structure of the method which is the subject of the invention. The text is firstly searched for diacritisation marks to establish whether the diacritisation is required or not 100. If a diacritical mark is found, it is to be assumed that the text is fully diacritised, otherwise the diacritisation is required. In the latter case, the text is firstly categorised as to its topic and/or functional style 101, and for each word in the text it is established whether there exists a soft classifier 102 related to that particular word and text category, the term 'soft classifier' denoting a structure containing: • the list of possible interpretations of the ambiguous non-diacritised input word,
• for each one of the possible interpretations:
o information related to the morphology (part-of-speech labels and values of particular morphological categories where applicable);
o a decision tree based structure 103 used to calculate the semantic score of the particular interpretation in a given context.
If the soft classifier 102 for the target word does not exist, the word is accepted as is (in its non-diacritised form), however, it still can have multiple possible interpretations since it can belong to different parts of speech (adverb, adjective, noun, pronoun, preposition...) or have different values of relevant morphological categories (gender, number, case,...). In other words, it can belong to different morphological classes, the term 'morphological class' denoting the combination of part-of-speech category and the values of relevant morphological categories. The method 104 for establishing possible interpretations (related to different morphological classes) can use either one or a combination of both of the following approaches: lexical lookup based on a morphological dictionary 105 and morphological analysis 106. If the soft classifier 102 for the target word exists, it is used to provide the list of possible interpretations of the ambiguous non-diacritised input word, as well as the semantic score and morphological information related to each interpretation, which have been established during the construction of the classifier during the training phase. Upon arriving at the end of the sentence, for each word a list of its possible interpretations has thus been determined, the differences between them being related to morphological information, diacritical patterns or both. Next, a process of disambiguation is carried out at sentence level, where a hard decision related to each ambiguous word is reached. This process will hereafter be referred to as decoding 107, and is based on the language model 108 built during the training phase.
Figure 2 represents the module for training the method which is the subject of the invention. This module uses a large textual corpus 201 which is fully diacritised and categorised regarding topical information and/or functional style (newspaper style, scientific style...). This module also uses the non-diacritised versions 202 of the textual corpus 201 (the fact that there can be more than one corpus 202 is the consequence of the fact that there can be more than one convention for the adaptation of text to diacritic-less setting). The module for training firstly establishes a list 203 of words in non-diacritised corpora 202 that appear more often than a predefined threshold, and are such that they could have been produced by adaptation of some other words to a non-diacritised setting. For each of the words from this list a soft classifier 102 is built 204, using multiple information sources (topical information and information on semantic proximity) based on the corpora 201 and 202. The soft classifier 102 stores the list of possible interpretations of the ambiguous non-diacritised input word, morphological classes (i.e. information related to the morphology) of each of them, as well as decision tree based structures 103 intended for the calculation of the corresponding category- dependent semantic score of each of the possible interpretations in a given context. To determine the semantic score of each particular (possibly diacritised) interpretation of an ambiguous non-diacritised word which appears in the input text, the soft classifier 102 uses the following information as the set of features of the decision tree based structure 103 related to that interpretation
• topical information, provided by text categorisation, which establishes the topic of the text, its category and/or its functional style, having in mind that texts belonging to various topics, categories or functional styles can have significantly different statistical properties;
• information on semantic proximity (semantical relatedness), calculated between the target word and its neighbour(s) using any of the standard techniques, including latent semantic analysis (LSA);
The output of each tree based structure 103 is the semantic score s of the corresponding interpretation. In this way, during the exploitation phase, each of the interpretations will be assigned a semantic 'likelihood' in a given context, with a definite decision as to the correct one being postponed to the decoding process 107. The semantic score s of a particular interpretation will be used as one of the parameters that affect the values of the search metrics in the decoding process 107.
Besides the construction of soft classifiers, the training process includes the construction 205 of a language model 108, which will be used in the decoding 106. This model consists of the estimates of the probabilities of unigrams and «-grams of higher orders of morphological classes. The language model 108 is built using the textual corpus 201 and the method 104 for establishing possible interpretations related to different morphological classes of the words in the corpus 201. The estimated probabilities are smoothed using any of the standard techniques such as Good-Turing estimation.
The decoding process 107 is based on the language model 108 in that it consists of a search through a lattice of possible word interpretations, disambiguating between them using the probability estimates contained in the language model 108 and producing the output text with recovered diacritical marks as the result. The search can be organised along the lines of the Viterbi algorithm or some other appropriate technique, with search metrics based on the values of the following parameters:
• Estimated probabilities of a particular morphological class after some other morphological class or a sequence thereof (e.g. probability that a noun [feminine, nominative, singular] will appear after an adjective [feminine, nominative, singular]);
• Estimated probabilities of particular words knowing their morphological class (words that appear more frequently within a morphological class will have a higher probability than words that appear less frequently); Semantic score s, as provided by soft classifiers 102, for particular words and particular text category and/or functional style as established by 101;
Occurrences of switches between different conventions for adapting a text to a diacritic-less setting. As the system operates under the assumption that, when adapting a text to a diacritic-less setting, users consistently use any of the existing conventions, with exceptions primarily related to the cases where the consistent use of the adopted convention would lead to critical ambiguity, the switches between different conventions should be considered as low probability events. Consequently, each transition between the nodes of the lattice should also be attributed a score denoting the existence (or absence) of a switch between the used conventions, i.e. the fact that the convention used for the adaptation of the current word to a diacritic-less setting was different from the one that was used on the last occasion when some of the previous words in the text was thus adapted. The optimal pair of values scs and scs, used to indicate a presence or absence of a switch, can be set heuristically or estimated using a segment of the textual corpora withheld from training, Namely, soft classifiers 102 can be built using only a portion of the available textual corpus 201 and the corresponding portions of the corpora 202, with varying values of scs and scs, and the accuracy of diacritisation using particular values of scs and scs can be measured on the remaining portions of these corpora. The values of scs and scs which yield the highest accuracy can be adopted as actual values that will be used within the method, which is the subject of the invention.
Industrial Applicability
The for automatic diacritisation of text can be applied within a range of systems related to language technologies, which can be divided into two principal groups:
• Systems where text diacritisation is aimed at the recovery of the meaning of words.
These include, but are not limited to, natural language understanding, machine translation, question answering, information retrieval and information extraction;
• Systems where text diacritisation is aimed at the recovery of the pronunciation of words, which principally refers to text-to-speech synthesis, but also to automatic speech recognition to a lesser degree. Namely, in order to build language models for automatic speech recognition it is necessary to possess a large text corpus which has to be fully diacritised. Although most of the diacritisation is mandatory in traditional writing, segments of the corpus related to less formal settings such as personal correspondence or various forms of Internet posting can be lacking diacritical marks, which could affect the accuracy of acoustic models used in recognition if not treated properly.

Claims

Claims
1. The method for diacritisation of texts written in Latin- or Cyrillic-derived alphabets which is given in a non-diacritised setting, establishing the list of possible interpretations of each word in an input text, performing the recovery of diacritical marks and producing an output sentence with a correct pattern of diacritical marks, characterized by:
soft classification step (102) at word level using a set of features of the decision tree based structure (103) as input, aimed at calculating the semantic score of each particular interpretation as well as information related to the morphological class of each particular interpretation; and
hard classification step (107) at sentence level or a higher level.
2. A method as claimed in claim 1, characterized in that interpretation of each word stands for the version of the word in a fully diacritised setting with established part of speech and values of particular morphological categories.
3. A method as claimed in claim 1, characterized in that soft classification (102) assigns to each possible interpretation of word from the list of the ambiguous non-diacritised input words, information related to the morphological class and decision tree based structure (103) used to calculate the semantic score of the particular interpretation in a given context.
4. A method as claimed in claim 1, characterized in that a set of features of the decision tree based structure includes topical information and information on semantic proximity.
5. A method as claimed in claim 4, characterized in that topical information establishes the topic of the text, its category and/or its functional style.
6. A method as claimed in claim 5, characterized in that the topic of the text may denote sports, art, or science, category may denote a newspaper article, excerpt from a book or sample of personal correspondence and functional style may denote newspaper, scientific, publicistic or poetic style.
7. A method as claimed in claim 4, characterized in that information on semantic proximity is calculated between each interpretation corresponding to the target word and its neighbouring word(s) in the sentence.
8. A method as claimed in claim 1, characterized in that information related to the morphological class combine the part-of-speech category which may denote adverb, adjective, conjuction, interestion, pronoun, preposition, noun, or verb; and the values of relevant morphological categories which may denote gender, number, case, tense or mood.
9. A method as claimed in claim 1, characterized in that performing hard classification (107) at higher level means carrying out of the process of disambiguation at the paragraph level or the entire text level.
PCT/RS2013/000010 2013-05-22 2013-05-22 A method for diacritisation of texts written in latin- or cyrillic-derived alphabets WO2014189400A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/RS2013/000010 WO2014189400A1 (en) 2013-05-22 2013-05-22 A method for diacritisation of texts written in latin- or cyrillic-derived alphabets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RS2013/000010 WO2014189400A1 (en) 2013-05-22 2013-05-22 A method for diacritisation of texts written in latin- or cyrillic-derived alphabets

Publications (1)

Publication Number Publication Date
WO2014189400A1 true WO2014189400A1 (en) 2014-11-27

Family

ID=48747699

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RS2013/000010 WO2014189400A1 (en) 2013-05-22 2013-05-22 A method for diacritisation of texts written in latin- or cyrillic-derived alphabets

Country Status (1)

Country Link
WO (1) WO2014189400A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045586A (en) * 2015-12-02 2017-08-15 松下知识产权经营株式会社 Control method and control device
CN108304373A (en) * 2017-10-13 2018-07-20 腾讯科技(深圳)有限公司 Construction method, device, storage medium and the electronic device of semantic dictionary
CN108717406A (en) * 2018-05-10 2018-10-30 平安科技(深圳)有限公司 Text mood analysis method, device and storage medium
CN110275938A (en) * 2019-05-29 2019-09-24 广州伟宏智能科技有限公司 Knowledge extraction method and system based on non-structured document
US20220188515A1 (en) * 2019-03-27 2022-06-16 Qatar Foundation For Education, Science And Community Development Method and system for diacritizing arabic text

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0138079B1 (en) 1983-09-29 1991-08-07 International Business Machines Corporation Character recognition apparatus and method for recognising characters associated with diacritical marks
EP1402480A2 (en) 2001-06-30 2004-03-31 Siemens Aktiengesellschaft Means for connecting a front panel of a parallelepipedal built-in unit to the built-in housing of the latter
EP1471440A2 (en) 2003-03-31 2004-10-27 Microsoft Corporation System and method for word analysis
EP2447854A1 (en) 2010-10-27 2012-05-02 King Abdulaziz City for Science and Technology Method and system of automatic diacritization of Arabic

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0138079B1 (en) 1983-09-29 1991-08-07 International Business Machines Corporation Character recognition apparatus and method for recognising characters associated with diacritical marks
EP1402480A2 (en) 2001-06-30 2004-03-31 Siemens Aktiengesellschaft Means for connecting a front panel of a parallelepipedal built-in unit to the built-in housing of the latter
EP1471440A2 (en) 2003-03-31 2004-10-27 Microsoft Corporation System and method for word analysis
EP2447854A1 (en) 2010-10-27 2012-05-02 King Abdulaziz City for Science and Technology Method and system of automatic diacritization of Arabic

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DAVID YAROWSKY: "Decision lists for lexical ambiguity resolution", PROCEEDINGS OF THE 32ND ANNUAL MEETING ON ASSOCIATION FOR COMPUTATIONAL LINGUISTICS -, 1 January 1994 (1994-01-01), Morristown, NJ, USA, pages 88 - 95, XP055077663, DOI: 10.3115/981732.981745 *
DIMITRA VERGYRI AND KATRIN KIRCHHOFF: "Automatic diacritization of Arabic for acoustic modeling in speech recognition", INTERNET CITATION, 28 August 2004 (2004-08-28), pages 1 - 8, XP002624241, Retrieved from the Internet <URL:http://acl.ldc.upenn.edu/W/W04/W04-1612.pdf> [retrieved on 20110222] *
MOHSEN A.A. RASHWAN ET AL: "A hybrid system for automatic Arabic diacritization", PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON ARABIC LANGUAGE RESOURCES AND TOOLS, 22 April 2009 (2009-04-22), Cairo, Egypt, pages 54 - 60, XP055077894, ISBN: 978-2-95-174085-3 *
TUAN ANH LUU ET AL: "A Pointwise Approach for Vietnamese Diacritics Restoration", ASIAN LANGUAGE PROCESSING (IALP), 2012 INTERNATIONAL CONFERENCE ON, IEEE, 13 November 2012 (2012-11-13), pages 189 - 192, XP032339752, ISBN: 978-1-4673-6113-2, DOI: 10.1109/IALP.2012.18 *
ZITOUNI I ET AL: "Arabic diacritic restoration approach based on maximum entropy models", COMPUTER SPEECH AND LANGUAGE, ELSEVIER, LONDON, GB, vol. 23, no. 3, 1 July 2009 (2009-07-01), pages 257 - 276, XP026010724, ISSN: 0885-2308, [retrieved on 20080617], DOI: 10.1016/J.CSL.2008.06.001 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045586A (en) * 2015-12-02 2017-08-15 松下知识产权经营株式会社 Control method and control device
CN108304373A (en) * 2017-10-13 2018-07-20 腾讯科技(深圳)有限公司 Construction method, device, storage medium and the electronic device of semantic dictionary
CN108304373B (en) * 2017-10-13 2021-07-09 腾讯科技(深圳)有限公司 Semantic dictionary construction method and device, storage medium and electronic device
CN108717406A (en) * 2018-05-10 2018-10-30 平安科技(深圳)有限公司 Text mood analysis method, device and storage medium
CN108717406B (en) * 2018-05-10 2021-08-24 平安科技(深圳)有限公司 Text emotion analysis method and device and storage medium
US20220188515A1 (en) * 2019-03-27 2022-06-16 Qatar Foundation For Education, Science And Community Development Method and system for diacritizing arabic text
CN110275938A (en) * 2019-05-29 2019-09-24 广州伟宏智能科技有限公司 Knowledge extraction method and system based on non-structured document

Similar Documents

Publication Publication Date Title
CN101133411B (en) Fault-tolerant romanized input method for non-roman characters
Laboreiro et al. Tokenizing micro-blogging messages using a text classification approach
US8660834B2 (en) User input classification
CN101002198B (en) Systems and methods for spell correction of non-roman characters and words
Gupta et al. A survey of common stemming techniques and existing stemmers for indian languages
US8301435B2 (en) Removing ambiguity when analyzing a sentence with a word having multiple meanings
WO2005064490A1 (en) System for recognising and classifying named entities
CN106096664A (en) A kind of sentiment analysis method based on social network data
Lee et al. Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of Korean
Bar-Haim et al. Part-of-speech tagging of Modern Hebrew text
WO2014189400A1 (en) A method for diacritisation of texts written in latin- or cyrillic-derived alphabets
Gambäck et al. Methods for Amharic part-of-speech tagging
Luu et al. A pointwise approach for Vietnamese diacritics restoration
Tufiş et al. DIAC+: A professional diacritics recovering system
CN107229611B (en) Word alignment-based historical book classical word segmentation method
Alsayadi et al. Integrating semantic features for enhancing arabic named entity recognition
Onyenwe et al. Toward an effective igbo part-of-speech tagger
Makazhanov et al. On certain aspects of kazakh part-of-speech tagging
Muhamad et al. Proposal: A hybrid dictionary modelling approach for malay tweet normalization
WO2008131509A1 (en) Systems and methods for improving translation systems
Tantug A probabilistic mobile text entry system for agglutinative languages
Kranig Evaluation of language identification methods
CN113158693A (en) Uygur language keyword generation method and device based on Chinese keywords, electronic equipment and storage medium
Yıldırım et al. An unsupervised text normalization architecture for turkish language
Manohar et al. Spellchecker for Malayalam using finite state transition models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13734524

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13734524

Country of ref document: EP

Kind code of ref document: A1