WO2009006911A1

WO2009006911A1 - System and method for large-scale arabic lexical semantic analysis

Info

Publication number: WO2009006911A1
Application number: PCT/EG2007/000022
Authority: WO
Inventors: Mohsen Abdel-Razik Ali Rashwan; Mohamed Attia Mohamed El-Araby Ahmed
Original assignee: The Engineering Company For The Development Of Computer Systems. (Rdi)
Priority date: 2007-07-12
Filing date: 2007-07-12
Publication date: 2009-01-15

Abstract

System and method for extracting word senses sequences (01 ) corresponding to sequences of input Arabic words (11 ). These senses belong to a global compact basis set of predefined semantic fields. The system can also produce the semantic relations (02) between the words of two input Arabic sequences of words (11, 12). It relies on a lexical semantic relational database that relates the lexical Arabic compounds to semantic fields both in the forward and backward directions. To achieve high coverage of the highly derivative and inflective Arabic language, this database is not a vocabulary but a morpheme based one. Semantic fields are associated with lexical compounds with the aid of a large-scale morphological analyzer. Semantic relations between words are determined by first reducing the words to lexical compounds, mapping lexical compounds to semantic fields, and relating the latter to each other. This approach reduces complexity considerably.

Description

A new compact approach to large-scale Arabic lexical semantic analysis

Technical field:

This invention is used for finding the disambiguated word senses sequence corresponding to input Arabic text, the most likely semantic relations sequence between two input Arabic texts, and it also can find the most likely words sequence related to input Arabic text with a given input semantic relations sequence.

Background art:

The following is the list of the most relevant Tools and Arabic Language Resources to our work:

1. Arabic Word Net (AWN) by the Linguistic Data Consortium (LDC) of the University of Pennsylvania (www.LDC.UNPENN.edu) is the most relevant to our work. The 1^st release of this vocabulary-based AWN appeared in Mar. 2007 and has been subject to many critics esp. as per its poor coverage of the Arabic language, and the shortage & sometimes the peculiarities of the semantic relations. It should be noted that this AWN has been bootstrapped from the LDC 's English Word Net.

2. For the sake of machine translation where Arabic is involved as either a source or target language, the following companies have developed semantic knowledge bases and analysis mechanisms to embedded in their MT systems.: i. SAKHR^® which is a Kuwaiti-Egyptian company (www.Sakhr.com). ii. Cimos^® based in France: www.Cimos.com. iii. ScanSoft^® (now acquired by Nuance); all are EU companies (www .nuance .com^") .

3. Arabic Thesaurus by Coltec^® (www.CoiTec.net); an Egyptian company. • Innovation points in our work:

1- It is morpheme-based not vocabulary-based like most of relevant rivals which much widens the coverage over the highly derivative and inflective Arabic language than any vocabulary -based techniques.

2- The lexical-to-semantic mapping is made to a compact basis set of (hundreds of) semantic fields. This implementation of the Semantic Fields theory is done for the first time in Arabic linguistics, at least in the published literature.

3- Based on the previous two points, the lexical semantic text factorization produced by our system can enhance the performance of Arabic text mining applications by reducing the dimensionality and concentrating the correlations among the factorized text entities subject to the statistical methods typically deployed in text mining.

While text mining systems can perform pattern discovery for many tasks by directly processing surface (raw) text, performance gets much better and better as the mining is done on deeper and deeper linguistic analysis of this text given the same mining algorithms, training corpora size, and computational power.

The mathematical explanation behind that is as we get deeper in linguistic analysis (resolving more and more complex relations) the raw text is factorized into more fundamental, and typically less numerous atomic entities are dealt with. This in turn reveals higher (statistical) correlations and reduces the dimensionality of the problem which sharpens the effectiveness of the mining process.

Disclosure of invention:

The proposed system shown in figure 1 can be operated in the following three modes:

1. Semantic Factorization mode; Oi(Ii) is the disambiguated word senses sequence corresponding to the i/p text Ii after the scenario: Au Mu A₁, A3, inverse relational Database (WDB). AJ. A₁. M₁, A*.

2. Semantic Similarity mode: Ch(Ii, I₂) is the most likely semantic relations sequence between the i/p text Ii and i/p text I₂ after the scenario:^, Mu A₁, A3, inverse RDB, Ad, A₁, M₁. Ae, A7, semantic relations RDB, and AR.

3. Semantic Provocation mode: O₃(Ii, R) is the most likely words sequence related to the i/p text Ii with the i/p semantic relations sequence R after the scenario: Au Mu A₁, A₁. inverse RDB. A±. A₁, M₁, AΛ, A?, R from AQ . semantic relations RDB, Am, Au, forward RDB. A₁₁.

In addition to the hardware shown in figures 0_a and O_b, the system is mainly composed from the following modules:

I- Arabic Morphological Analyzer & PoS tagger modules;

• Arabic Morphological Model:

Due to the highly derivative and inflective nature of the Arabic language, it is much more comprehensive, effective, and economic to deal with its compact set of basic building entities; i.e. morphemes, than its unmanageably huge generable vocabulary. Following that morpheme-based approach, the canonical lexical structure of any Arabic word w according to ArabMorpho^® has been formulated as a quadruple; w→ q = (t : p,r,f,s)

¹ For full details of this model, see; Attia, M., 2000, A Large-Scale Computational Processor of The Arabic Morphology, and Applications, MSc. thesis, Dept. of Computer Engineering, Faculty of Engineering, Cairo University. http://www.RDI-eg.com/RDI/Technologies/paper.htm where t is Type Code (with possible types are Regular Derivative, Irregular Derivative, Fixed, Arabized), p is Prefix Code, r is Root Code, / is Pattern Code, and s is Suffix Code.

These kinds of morphemes in the Arabic lexicon of ArabMorpho are clearly classified in figure 2.

With a dynamic coverage ratio exceeding 99.8% - without counting the Arabic transliterated foreign words - the knowledge base (i.e. the lexicon) of ArabMorpho^® based on this model are composed from only about 7,700 morphemes with the fully agent-oriented linguistic description of each. The sizes of each kind of morphemes in figure 2 are as follows:

1- P: About 260 Arabic prefixes.

2- Rd'. About 4,600 Arabic derivative roots.

3- Frd- About 1,000 Arabic regular derivative patterns.

4- Fid'. About 300 Arabic irregularly derived words.

5- R/. About 250 Roots of Arabic fixed words.

6- Ff. About 300 Arabic fixed words.

7- R_a- About 240 Roots of Arabized words.

8- F_a: About 290 Arabized words.

9- S: About 550 Arabic suffixes.

Figure 3 shows this model in application on few sample Arabic words.

• Arabic PoS tagging model:²

• Defining And Positioning Arabic PoS Tagging

Part-Of-Speech (PoS) tagging is a fundamental linguistic analysis process where PoS tags that convey the basic context-free syntactic features of input surface text words are extracted.

Among several linguistic processing tasks for whom PoS tagging may be quite useful, including the problems tackled here, PoS tags are the most essential input features for all kinds of natural language computational syntax parsers which are in turn one step higher in the ladder towards language understanding and machine translation as well.

² For full details of this model, see: Attia, M., Rashwan, M., 2004, A Large-Scale Arabic POS Tagger Based on a Compact Arabic POS Tags Set, and Application on the Statistical Inference of Syntactic Diacritics of Arabic Text Words, Proceedings of the Arabic Language Technologies and Resources Int'l Conference; NEMLAR, Cairo 2004 http://www.RDI-eg.com/RDI/Technologies/paper.htm. Based on that definition, the position of PoS tagging is obviously a middle sub layer between the two fundamental lexical and syntactic ones on the NLP ladder as shown in figure 4.

• Compact Arabic PoS Tags Set

Composing an Arabic PoS tags set necessitates scanning the lexico-syntactic features of each possible word of the Arabic vocabulary which is apparently infeasible. Instead, thanks for the morpheme-based approach, the features of each morpheme in the relatively compact ArabMorpho^® knowledge base have been scanned, then digested through several iterations of decimation into a non redundant compact Arabic PoS tags set.

During that scanning process the following criteria has been adhered to:

1- All the existing lexico-syntactic features must be named and registered, which aims to the completeness of the resulting PoS tags set.

2- All the named and registered features must be atomic, which aims to compactness and avoids redundancy in the resulting tags set. This in turn is vital for the effectiveness of the based upon PoS tagging process - which is essentially an abstraction process - and all higher processing layers as well.

3- AU the named and registered features can be ensured upon the PoS labeling of the morphemes in our Arabic lexical knowledge base. (More on this point in the next section below)

The table in figure 5 shows our Arabic PoS tags set along with the meaning of each tag verbalized in both English and Arabic. Moreover, the 62 tags in the set are functionally categorized in order to maximize clarity.

While some tags in that table may have corresponding ones in other languages; e.g. English, others do not have such counterparts and are specific to the Arabic language.

• Arabic PoS Labeling

Having the Arabic PoS tags set been designed, labeling the morphemes of the lexical knowledge base comes as the next job which is a straightforward one given that the following three main points are carefully considered:

1- For morphologically analyzed words; the/part of the quadruples gives the Arabic PoS tagging of stems, while the p and s parts give the Arabic PoS tagging of affixes. Hence, the root morphemes of all kinds which do not participate to tagging are not Arabic PoS labeled.

2- Due to the atomicity of the tags in the Arabic PoS tags (see the previous section) and in same time the compound nature of Arabic morphemes in general, PoS labels of Arabic morphemes are vectors not simple scalars.

3- Only ensured Arabic PoS tags are considered in the Arabic PoS labeling of morphemes, i.e. When an Arabic PoS tag is a possible - or even a highly probable - but not an ensured feature of a given morpheme, it is not included in its Arabic PoS label vector.

The table in figure 6 shows a few morpheme labeling examples (ArabMorpho^®) in order to concretely illustrate the process.

• Arabic PoS Tagging

The Arabic PoS tagging process is implemented in the following steps:

1- The Arabic strings sequence to be PoS tagged are morphologically analyzed and combinatorially disambiguated. These results in a disambiguated quadruples sequence where each string is substituted by either one quadruple or a mark of Transliterated string.

2- For the prefix, pattern, and suffix morphemes of each quadruple in the sequence, the Arabic PoS labels; APoSφ) APoS(t: f) APoS(s) are retrieved from the Arabic lexicon of ArabMorpho^® .

3- The Arabic PoS tags vector of each word in the sequence is then composed using the formula:

APoS(w)=Concat(APoS(p), APoS(Uj), APoS(s))

where the Concat function simply concatenates the PoS sub vectors of the constituting morphemes after eliminating any mutual redundancy among their tags.

The resulting Arabic PoS tags vectors by RDI's ArabTagger⁰ of sample words in a real-life phrase are shown in figure 7.

II- Word sense statistical disambiguation module:

The famous Yarowsky algorithm (http://www.vinartus.net/spa/03c-v7.pdf) for semi- supervised statistical learning is used to realize the Word Sense Disambiguation (WSD). Considering the high cost and vulnerability to human errors of the manual semantic annotation of large-enough text corpora (sizes within millions of words are required), the main virtue of such an algorithm becomes clear that it is able to learn from both labeled (manually annotated) and unlabeled data.

As it is not one of this patent claims, it should be noted that the implementation of WSD may be altered if more effective methods emerge in the field. III- Semantic Fields to Lexical Compounds mapping (forward Lexical Semantics DB) module;

Forward lexical semantics DB is the direct transformation of our primary Arabic lexical semantic source into a relational database where the primary key is an ID of semantic field which provokes all the terms grouped under that field. More than 36000 core terms originally covered in our primary source under about 1800 semantic fields.

• The DB building criteria:

• Originality of the source Arabic lexical-semantic knowledge base.

The published literature has been surveyed for sound semantic knowledge bases crafted originally for the Arabic language by many credible Arabic linguistics experts. Matching against the above mentioned criteria, the ones we use are based on the theory of semantic fields has hence been found to be our best fit and hence been elected to be our primary - not necessarily the sole - sources of raw Arabic lexical semantics.

• Widest coverage of possible lexical compounds, and semantic relations.

In order to avoid (or at least; minimize) a high runtime retrieval miss ratio of input words vs. the (inevitably limited) terms covered in that source of raw Arabic lexical semantics, two flexible concepts of lexical compounds (instead of full form words) and morphological & PoS tags constraining have been established and deployed while coding our DB to tame the highly inflective and derivative nature of Arabic.

• Simplicity and compactness of the resulting Arabic lexical semantic DB.

• Minimum implementation and updating cost.

Relying on well ordered sources of original Arabic lexical semantics, establishing the concepts of lexical compounds and constraining, and the two-level semantic mapping, all enabled our team to produce vl.O of this DB in a less than of 80 man-months only.

• The process of building the Arabic Morpho-Lexical Semantics Database:

Each term — which may span over one or more words - is morphologically analyzed and PoS tagged considering that:

• Any semantically neutral morpheme (that does not contribute in attributing its lexical compounds to its semantic field) is marked by -1000 as don't care ones. Ex: The definitive article (Jl) in the word (_jiyjl) has no semantic effect; i.e. the two different derivatives

— CK IJILAJ l) are semantically equivalent. Hence -1000 is put in the prefix field as a don 't care value.

• The morpheme, whose explicit emergence is necessary for its lexical compound in order to belong to a certain semantic field, is marked by its positive morphological code.

Ex: The definitive article (Jl) in the word (Λl) is semantically determinant, hence its specific positive morphological code (9) is put in the prefix field. Similarly the null suffix is semantically determinant, hence its positive morphological code (0) is put in the suffix field.

• If the PoS tag of a morpheme - not the specific morpheme itself - is necessary for its lexical compound in order to belong to a certain semantic field. It is marked (by its negative PoS tag code).

Ex: The suffix (L) whose PoS sub vector contains a PoS tag (Femin) is semantically determinant in the word (IcL,). However, other compound suffixes like (... ciiL - Ujj_) which are common with the suffix (L) in the PoS tag (Femin) may be used with the same stem to produce words like (... ejUL, - lμs.L>) keeping them semantically mapped to the same semantic field. In such a case, we put (-48) which is the negative code of the PoS tag (Femin) in the suffix field.

The table in figure 8 shows a sample fragment of the forward Arabic lexical semantics Relational Database (RDB).

IV- Lexical Compounds to Semantic Fields mapping (inverse Lexical Semantics DB) module:

• The forward lexical semantics RDB is inverted so that its primary key becomes a morphologically and PoS-tags constrained lexical compound which provokes the ID's of the possible semantic fields (i.e. word senses) that it may belong to.

• To maximize the coverage of the generable Arabic vocabulary, a special set of lexical compounds whose morphemes are all don 't care ones except for the roots (of the 1^st words), are added in the inverse RDB. Each one of these ground lexical compounds provokes the union of semantic fields tied to all the derivatives of its root. The lexical compounds comprising the primary keys in the inverse RDB are arranged in ascending order with priorities starting respectively at each quadruple by the root, the form, the prefix, and then the suffix.

The table in figure 9 shows a sample fragment of the inverse Arabic lexical semantics RDB.

V- RPB inverter module:

This module is used to automatically invert the Semantic Fields to Lexical Compounds (forward RDB) into Lexical Compounds to Semantic Fields (inverse RDB). This inversion is implemented using the standard SQL (Structured Query Language).

VI- DB of the semantic relations among the set of the semantic fields:

This is the database that carries the major semantic relation (antonymy, hyponymy, ....) among the semantic fields vs. each other in the form of a matrix per each relation.

Given the forward/backward mapping between the lexical compounds and the semantic fields, as well as the mapping among the semantic fields vs. each other (via the semantic relations DB), we have an indirect method to make the semantic mapping among the Arabic lexical compounds (hence, full form words) vs. each other by the following sequence:

• The i/p Arabic two words texts are morphologically analyzed and PoS tagged in order to produce the Arabic lexical compounds sequence for each of them.

• Each Arabic lexical compound is then mapped to its all possible word senses using the inverse lexical semantics DB.

• These all possible semantics are then disambiguated using the word sense statistical disambiguation module to find the most probable semantic for each one from the all candidate semantics.

• Using the semantic relations RDB, we can find the relations between any two given semantics. And finally, the semantic relations between words are indirectly inferred.

- The concept of mapping from the lexical compounds to the semantic fields and the building of the relations among the semantic fields, instead of building the relations directly among final form words is a clever idea to reduce the order of complexity (as per storage, human labor, and computational cost) since:

1. The order of complexity of building the relations directly among the lexical compounds is 0(V²) where V is the vocabulary size. In the highly derivative and inflective language; hundreds of millions is not an exaggeration!. 2. The order of complexity of our indirect semantic mapping approach becomes 0(S² + S-V^{^}); where S is the number of the semantic fields (hundreds), V^{^} is the number of core lexical compounds (tens of thousands).

3. Since both S & F" << V, 0(S² + S-V) is much more computationally tractable than O (V²).

Brief description of the drawings;

Figure 0_a: shows the outer layout of the device. It consists of the following parts:

Display (Bi) module: To display the i/p data from the user and the o/p of the system.

Keyboard (B₂) module: To get the i/p data from the user.

Mode of Operation Selection Button (B₃): To let the user be able to select one of the three modes of operation of the system (Semantic Factorization mode, Semantic Similarity mode, Semantic Provocation mode).

Figure O_b: shows the internal layout of the device. It consists of the following parts:

Display Controller (Ui) module: To adapt the data format and the data voltage level in order to be displayed on the display of the device.

Field Programmable Gate Array (FPGA) (U₂) module: To perform the morphological analysis and the PoS tagging processes. The FPGA can provide the fast operation that is needed by these two processes.

Random Access Memory (RAM) (U₃) module: To store the temporarily results . Read Only Memory (ROM) (U₄) module: To store the system permanent settings.

Micro Controller (Us) module: To control all system operations and to apply the system that is shown in figure 1.

Figure 1: represents the Architecture of our Arabic lexical semantic analysis system, which is composed of the following modules:

Arabic Morphological Analyzer & PoS tagger (Mi) module: Analyzes input Arabic words to its canonical Arabic morphological model (prefix, form, root, and suffix) and produce their PoS tags vectors.

Word sense statistical disambiguation (M₂) module: Disambiguates the input senses to find the most probable sense among them.

Arabic forward lexical semantics RDB; Di module: RDB that contains the ID of the Semantic Field as the primary key which provokes the Arabic Lexical Compounds belonging to it. Arabic inverse lexical semantics RDB; D₂ module: RDB that contains the Lexical Compound as the primary key which provokes all the its word senses; i.e. its possible semantic fields triggered by that compound.

RDB inverter (M3) module: This module is used to automatically produce D2 by the SQL inversion of Dl.

Semantic relations RDB among the set of the semantic fields; D3 module: This RDB codes the major semantic relations (antonymy, hyponymy, ....) among the semantic fields vs. each other in the form of a matrix per each relation. figure 2: represents the 9 types of morphemes in our Arabic lexicon of ArabMorpho⁰ '. figure 3: represents the ArabMorphcP 's canonical lexical structure of sample Arabic words. figure 4: represents the position of PoS tagging on the standard NLP ladder. figure 5: represents our Arabic PoS tags set. figure 6: represents a sample Arabic PoS labels of sample Arabic morphemes from RDI 'sArabMorpho^® lexicon. figure 7: represents the resulting Arabic PoS tagging of the words of a sample phrase using ArabTagger⁰. figure 8: represents a sample fragment of our forward Arabic lexical semantics RDB. figure 9: represents a sample fragment of our inverse Arabic lexical semantics RDB.

Claims

1. Relying on a large-scale Arabic morphological analyzer & PoS tagger, along with the concept of morphologically & PoS tags constrained lexical compounds, the presented system is a morpheme-based not a vocabulary-based one (like its other relevant rivals) which maximizes the coverage over the highly derivative and inflective Arabic language.

2. Based on the semantic fields theory of Semantics are described using a compact basis set of semantic fields (word senses). The lexical semantics knowledge base of the system are coded in the form of RDB where lexical compounds are mapped to semantic fields and vise versa. This basis set of semantic fields along with the RDB 's are originally crafted for the Arabic language and not borrowed nor bootstrapped from other languages.

3. Based on the previous two claims, our system produces what may be called Arabic lexical semantic text factorization which can enhance the performance of Arabic text mining applications by reducing the dimensionality and concentrating the correlation between the entities subject to the statistical methods typically deployed for data mining.

The mathematical explanation behind that is as we get deeper in linguistic analysis (resolving more and more complex relations) the raw text is factorized into more fundamental, and typically less numerous atomic entities are dealt with (here, morphemes, PoS tags, and semantic fields). This in turn reveals higher (statistical) correlations and reduces the dimensionality of the problem which sharpens the effectiveness of the mining process.

4. Instead of the infeasible full semantic mapping directly among the whole Arabic vocabulary with a complexity of 0(V²); V is the huge vocabulary size of Arabic, our system does it using a two-level mapping that may be coded as w_t-<→LC_m<→SF_u<→SF_v^→LC_n<→W_j. Arabic words W₁ are analyzed (using the Arabic morphological analyzer and PoS tagger) into lexical compounds LC_1n, which are in turn mapped to semantic fields SF_n (using the inverse lexical semantics RDB). As the semantic fields are semantically related using a matrix per each relation, the other half of the mapping process may go in the inverse direction so that Arabic words are in totality semantically mapped indirectly.

The order of complexity of our indirect semantic mapping approach becomes 0(S² + S-V); where S is the number of the semantic fields (hundreds), F is the number of core lexical compounds (tens of thousands). Since both S & F « V, 0(S² + S-V) is much more computationally tractable than 0(V²).