EP1652107A1

EP1652107A1 - Method and system for categorizing arabic text

Info

Publication number: EP1652107A1
Application number: EP04740317A
Authority: EP
Inventors: Hisham El-Shishiny
Original assignee: Compagnie IBM France SAS; International Business Machines Corp
Current assignee: Compagnie IBM France SAS; International Business Machines Corp
Priority date: 2003-07-23
Filing date: 2004-05-13
Publication date: 2006-05-03
Also published as: IL173306A0; WO2005015434A1

Abstract

The present invention is directed to a system, method and computer program for categorizing Arabic documents based on the text content. More particularly, the invention is a frequency based method using a learning approach that exploits, Arabic lexical look-up, Arabic morphological analysis, and a number of interconnected Arabic linguistic filters, to categorize Arabic texts. The present Arabic text categorization method comprises two phases namely: the learning phase, and the automatic categorization phase. During the learning phase, lemma forms (called stems) of specific noun types are extracted from manually categorized Arabic texts and then filtered, using Arabic morphological analysis. Based on these lemma forms and on the normalized frequency of these lemma forms for each predefined category, it is possible to automatically assign new Arabic texts to predefined categories during the automatic text categorization phase. As a result, categorization of Arabic texts is more precise and less sensitive to noise than prior art solutions. The present invention relates to a method for automatically assigning Arabic texts to predefined categories supporting information retrieval. For example, the method can be used to filter out Arabic documents that are unlikely to contain extractable data and can be used to route Arabic texts to processing mechanisms that are category specific.

Description

Method and System for Categorizing Arabic Text Technical field of the invention

The present invention relates to text categorization, machine learning and knowledge management, and more particularly to a method and system for categorizing Arabic text.

Background art

Technical Field

The automated assignment of natural language texts to predefined categories based on their content is known as "text categorization". "Text categorization" attempts to reproduce human categorization judgments and is getting an increasing importance in information retrieval systems. Text categorization is an essential part of any document processing system.

A category is a class and a class is a specifically defined division in a system of classification. The purpose of text categorization is to assign subject categories to documents, based on their contents, to support information retrieval. There is an increasing use of categorization for data extraction, as it can be used to filter out documents that are unlikely to contain extractable data. Categorization can also be used to route texts to category specific processing mechanisms.

Initial Problem The present invention relates to the initial problem of how reproducing human categorization judgments and automatically assigning predefined categories to Arabic free text documents, based on their content. These Arabic free text documents may be news, stories, electronic mail, technical abstracts ...etc.

Text categorization is a challenging problem for machine learning. Feature sets are huge, on the order of tens of thousands of features when words are used and even more if phrases are allowed. Natural language features have a number of properties such as ambiguity and skewed distributions, that make automatic categorization a difficult problem.

Prior Art

Prior art systems mainly address the problem of constructing text categorization systems by two approaches.

• The first approach is based on knowledge engineering and uses methods similar to those used in Expert Systems for classification. Categorization rules are constructed based on relationships between some text features and the output categories. • The second approach is to use existing bodies of manually categorized texts in constructing categorizers by inductive learning. A training set of texts is used to extract some characteristic features of each category. Then, based on these features new texts are categorized. A variety of learning approaches have been adopted by researchers, including Bayesian classification, decision trees and factor analysis. Bayesian classification uses Bayes' rule to estimate the category assignment probabilities, then assigns to a text those categories with high probabilities. Decision trees are used to recursively subdivide the training examples into subsets, based on an information gain metric.

Another approach is to compute the correlation coefficients between texts based on their word coordinates and then to perform a factor analysis. In this manner, the initially high dimension (the number of words) may be reduced.

Another approach is the k-nearest neighbor approach where the new text is assigned to the category of the majority of its nearest neighbors. Given an arbitrary input text, the system ranks its nearest neighbors among training texts, and uses the categories of the k top-ranking neighbors to predict the categories of the input text. The similarity score of each neighbor text is used as the weight of its categories, and the sum of category weights over the k nearest neighbors are used for category ranking.

Residual Problem The current text categorization approaches, previously described, are mainly constructed and tested on English, German and French test corpora (a corpora is a collection of writings used for linguistic analysis), assuming that the solution will work as well with other languages. Arabic Language has special characteristics and features (that do not necessarily exist in other languages) which when taken into consideration, give a more efficient categorization for Arabic texts.

The previously described prior methods are not satisfactory, if used for categorizing Arabic texts, because they do not take properly into consideration the morphological features and the special characteristics of Arabic language. This deficiency makes these methods less precise and more sensitive to noise when used to categorize Arabic texts.

Objects of the invention

It is an object of the present invention to provide an improved system and method for categorizing Arabic documents based on the text content.

It is another object to it to provide a system and method to use special Arabic morphological features and characteristics of Arabic texts to make the resulting categorization more precise and less sensitive to noise.

Summary of the invention

The present invention is directed to a system, method and computer program for categorizing Arabic documents based on the text content. More particularly, the invention is a frequency based method using a learning approach that exploits :

• Arabic lexical look-up,

• Arabic morphological analysis, and

• a number of interconnected Arabic linguistic filters, to categorize Arabic texts.

The present Arabic text categorization method comprises two phases namely :

• the learning phase, and • the automatic categorization phase.

During the learning phase, lemma forms (called stems) of specific noun types are extracted from manually categorized Arabic texts and then filtered, using Arabic morphological analysis. Based on these lemma forms and on the normalized frequency of these lemma forms for each predefined category, it is possible to automatically assign new Arabic texts to predefined categories during the automatic text categorization phase. As a result, categorization of Arabic texts is more precise and less sensitive to noise than prior art solutions.

The present invention relates to a method for automatically assigning Arabic texts to predefined categories supporting information retrieval. For example, the method can be used to filter out Arabic documents that are unlikely to contain extractable data and can be used to route Arabic texts to processing mechanisms that are category specific.

Claim 1 recites a method, for use in a computer system, categorizing Arabic texts based on their content, comprising the step of :

• learning Arabic morphological features from selected Arabic texts belonging to predefined categories, said step comprising, for each selected Arabic text belonging to a predefined category, the further steps of : • extracting stems of words corresponding to specific Arabic word types; • computing a normalized frequency for each extracted stem; • listing extracted stems having a normalized frequency above a given threshold; • comparing said list of stems with lists of stems associated with other predefined categories; • removing from said list, stems appearing in at least one list so that a unique list of keywords and their normalized frequencies is associated with each predefined category, stems in each unique list being defined as keywords for the corresponding predefined category. Further embodiments of the invention are provided in the appended dependent claims.

The foregoing, together with other objects, features, and advantages of this invention can be better appreciated with reference to the following specification, claims and drawings.

Brief description of the drawings

The novel and inventive features believed characteristics of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objects and advantages thereof, will best be understood by reference to the following detailed description of an illustrative detailed embodiment when read in conjunction with the accompanying drawings, wherein :

• Figure 1 shows the main components of the system for categorizing Arabic texts according to the present invention.

• Figure 2 is a view of the learning sub-system comprised in the system for categorizing Arabic texts according to the present invention.

• Figure 3 is a view of the categorization sub-system comprised in the system for categorizing Arabic texts according to the present invention.

• Figure 4 is a flow chart showing the different steps of the learning phase of the system for categorizing Arabic texts according to the present invention.

• Figure 5 is a flow chart showing the different steps of the categorization phase of the system for categorizing Arabic texts according to the present invention.

Preferred embodiment of the invention

Main Components of the Arabic Text Categorization System Figure 1 shows the different components comprised in the system 100 for categorizing Arabic texts according to the present invention. The system 100 includes :

• a Learning Sub-system 101 used to learn the Arabic morphological features of selected Arabic texts with predefined categories 105, and

• a Categorization Sub-system 102 used to assign an Arabic text 103 to a category 104, based on the results obtained by the learning sub-system 101.

Learning Sub-system

Figure 2 shows the Learning Sub-system 200 comprised in the system for categorizing Arabic texts according to the present invention.

The Sub-system 200 includes a Learning Engine 201 for learning the morphological features of manually categorized Arabic texts 206 by applying a first 202 and a second linguistic filter 205 on them.

• The first linguistic filter 202 : • identifies words belonging to specific Arabic word types in the Arabic manually categorized texts, using an Arabic morphology analyzer 203, and • extracts their lemma forms, called 'stems', using an Arabic lexical database 204 where 'stems' of Arabic words are stored.

• The second linguistic filter 205 filters the list of 'stems' extracted by the first linguistic filter 202 to remove the 'stop' like words, in data retrieval terminology, that are commonly used in any category.

The learning engine 201 uses the output information from the first linguistic filter 202 and the second linguistic filter 205, to produce lists of extracted keywords (keywords are significant or descriptive words) and their normalized frequencies 207 for the Arabic manually categorized texts 206.

Categorization Sub-system

Figure 3 shows the Categorization Sub-system comprised in the system for categorizing Arabic texts according to the present invention. The Sub-system 300 includes a categorization engine 301 for assigning new Arabic texts 306 to predefined categories 308. The categorization engine 301 applies a first 302 and a second linguistic filter 305 :

• The first linguistic filter 302 applied on the new Arabic text, extracts the lemma forms called 'stems' of all words belonging to specific Arabic word types, using an Arabic morphology analyzer 303, which makes use of an Arabic lexical database 304 where 'stems' of Arabic words are stored.

• The second linguistic filter 305 removes the 'stop' like words, in information retrieval terminology, commonly used in any category, from the list of extracted 'stems' of each new Arabic text.

The categorization engine 301 uses the output information from the first linguistic filter 302 and the second linguistic filter 305 together with the lists of keywords and their normalized frequencies for the predefined categories 307, to assign the new Arabic texts 306 to their categories 308.

Description of the Arabic Text Categorization Method

The invented Arabic text categorization method comprises two phases namely

• the learning phase, and

• the automatic categorization phase.

Description of the learning phase Figure 4 is a flow chart showing the different steps of the learning phase according to the present invention.

During the learning phase, large Arabic text files (few thousand words as representative as possible for the texts that one expects to encounter in the future) 400, pertaining to specific predefined Arabic text categories, are collected by human transcribers. For each Arabic text file, pertaining to a predefined category, the following steps are performed:

• Step a 401. Applying a first linguistic filter that, based on Arabic morphological analysis and Arabic lexical look-up, obtains the lemma forms of the text words, called 'stems', which allow the linking of multiple inflected variants of the same Arabic word. According to the current invention, only the 'stems' of the following Arabic word types (not usually existing in other languages) need to be extracted : • native nouns (not derived from verbs); • the following selected Arabic noun types derived from verbs : • active participles, that are used in Arabic to describe the person/thing carrying out an action, • passive nouns, that are used in Arabic to describe the person/thing which is the object of an action, • nouns of instrument, which describe the implement used to perform an action, • nouns of place which describe the place where an action happens, • verbal nouns, which generally relate to actions without time and is used in sentences where English would use the infinitive. Many verbal nouns have also acquired specific meanings in general circulation.

Nouns derived from verbs are constructions that follow predictable patterns that can be identified by Arabic morphological analysis. Researches related to this invention revealed that Arabic text can be best characterized, for the purpose of text categorization, by the above mentioned Arabic word types.

• Step b 402. • Computing the normalized frequency of each extracted stem in step a 401 above, and • Collecting all stems having a normalized frequency (defined as the number of times a stem appears in the text divided by the total number of words in the text) above a given cut-off (threshold) level.

• Step c 403. When 'vowel marks' are not used on top of and underneath the Arabic text characters, which is usually the case in the majority of Arabic texts, there would be a multitude of possible vowel combinations for the same set of characters which constitute the word. All of these combinations would be correct in the sense that they would be valid forms, but not all of them would be correct in the syntax in which the word is used. Due to this ambiguity, some 'stop' like stems (in data retrieval terminology), commonly used in any text category, may exist in the list of stems produced in step b 402 and should be filtered out. Therefore, a second linguistic filter is applied on the collected stems in step b 402 above, to remove 'stop' like stems (in data retrieval terminology), by comparison with a given list of words comprising the stems of the 'stop' like words.

• Step d 404. • Comparing the final lists of identified stems for all categories, • Removing stems that appear in more than one category, and thereby • Obtaining a unique list of stems (considered as 'keywords') and their normalized frequencies (considered as weights) 407, said list being specific and characteristic of each predefined category. If removing identical stems, gives one or more category lists of stems having a number of stems below a given threshold, then the Arabic texts 400, from which those lists were extracted, are augmented 406 with additional words (few hundred words) from texts related to those predefined categories.

Then steps a 401 , b 402, c 403 and d 404 are repeated.

Description of the automatic categorization phase

Figure 5 is a flow chart showing the different steps of the automatic categorization phase according to the present invention.

During the automatic categorization phase, where predefined categories are automatically assigned to new Arabic texts, the following steps are performed for each new (uncategorized) Arabic text 500:

• Step a 501. Applying a first linguistic filter using Arabic morphological analysis and Arabic lexical lookup, on the words of the new Arabic text, to extract a list for the stems of all the following word types: • native nouns (not derived from verbs) • the following selected Arabic noun types derived from verbs : • active participles, that are used in Arabic to describe the person/thing carrying out an action, • passive nouns, that are used in Arabic to describe the person/thing which is the object of an action, • nouns of instrument, which describe the implement used to perform an action, • nouns of place which describe the place where an action happens, • verbal nouns, which generally relate to actions without time and is used in sentences where English would use the infinitive. Many verbal nouns have also acquired specific meanings in general circulation.

Step b 502. Due to a certain ambiguity in Arabic texts when 'vowel marks' are not used (as explained in step c 403 above), applying a second linguistic filter for removing the 'stop' like stems (in data retrieval terminology) from the list of stems extracted from the new Arabic text in step a 501 above, by comparison with a given list of words comprising the stems of the 'stop' like words.

• Step c 503. • Comparing the list of stems of the new Arabic text with the 'Keywords' of each predefined category, obtained in the learning phase, and • Computing for each predefined category the 'SCORE' corresponding to the SUM (for all keywords pertaining to a predefined category) of the number of matches of a keyword in the list of stems of the new document multiplied by the keyword normalized frequency (normalized frequencies are computed in step b 402 of the learning phase Figure 4).

• Step d 507. • Assigning the new document to the text category giving the highest 'SCORE' for the new Arabic text, and • Assigning Arabic texts with 'SCORE' below a given threshold 504 to a 'non categorized' category 505 and documents having identical 'SCORE' (or very close scores) for more than one category 506 to more than one category 508. While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood that various changes in form and detail may be made therein without departing from the spirit, and scope of the invention.

Claims

ClaimsWhat we claim is:

1. A method, for use in a computer system, categorizing Arabic texts based on their content, said method comprising the preliminary step of :

• learning Arabic morphological features from selected Arabic texts belonging to predefined categories, said step comprising, for each selected Arabic text belonging to a predefined category, the further steps of :

• extracting stems of words corresponding to specific Arabic word types; • computing a normalized frequency for each extracted stem; • listing extracted stems having a normalized frequency above a given threshold; • comparing said list of stems with lists of stems associated with other predefined categories; • removing from said list, stems appearing in at least one list so that a unique list of keywords and their normalized frequencies is associated with each predefined category, stems in each unique list being defined as keywords for the corresponding predefined category.

2. The Method according to the preceding claim wherein said specific Arabic word types are part of the following Arabic word types :

• native nouns;

• nouns types derived from verbs comprising : • active participles used in Arabic to describe a person/thing carrying out an action; • passive nouns used in Arabic to describe the person/thing which is the object of an action; • nouns of instrument describing the implement used to perform an action; • nouns of place describing the place where an action happens; • verbal nouns generally related to actions without time and used in sentences where English would use the infinitive.

3. The Method according to any one of the preceding claims wherein the step of extracting stems of words corresponding to specific Arabic word types, comprises the further steps of :

• identifying words belonging to specific Arabic word types by means of an Arabic morphological analyzer;

• extracting stems of identified words by means of an Arabic lexical database comprising stems of Arabic words.

4. The Method according to any one of the preceding claims wherein the step of listing extracted stems having a normalized frequency above a given threshold, comprises the further step of:

• removing from said list of stems having a normalized frequency above a given threshold, 'stop' like stems by comparison with a given list comprising stems of 'stop' like words.

5. The Method according to any one of the preceding claims wherein a normalized frequency is defined as the number of times a stem appears in the text divided by the total number of words in the text.

6. The method according to any one of the preceding claims wherein the step of removing from said list, stems appearing in at least one another list, comprises the further step of :

if the number of stems in said list is below a predefined number :

• augmenting the selected Arabic text with additional words from one or a plurality of Arabic texts belonging to the predefined category.

7. The Method according to any one of the preceding claims comprising the further step of : • assigning a new Arabic text to a predefined category using the lists of keywords and the corresponding normalized frequencies associated with the predefined categories, said step comprising for this new Arabic text, the further steps of :

• extracting stems of words corresponding to said specific Arabic word types; • assigning the new Arabic document to the category giving the highest match between the extracted stems and the keywords weighted with the normalized frequency associated with each keyword.

8. The Method according to the preceding claim wherein the step of assigning the new Arabic document to the category giving the highest match between extracted stems and keywords, comprises the further steps of :

• comparing the extracted stems with the keywords of each predefined category; • computing, for each predefined category, the result of the sum, for all keywords belonging to said predefined category, of the number of matches of each keyword in the list of extracted stems multiplied by the keyword normalized frequency; • assigning the new Arabic document to the category giving the highest result.

9. The Method according to any one of claims 7 to 8 wherein the step of extracting from the new Arabic text, stems of words corresponding to specific Arabic word types, comprises the further steps of :

10. The Method according to any one of claims 7 to 9 wherein the step of extracting stems of words corresponding to specific Arabic word types, comprises the further step of: • removing from extracted stems, 'stop' like stems by comparison with a given list comprising stems of 'stop' like words.

11. A computer system comprising means adapted for carrying out the steps of the method according to any one of the preceding claims.

12. A computer program comprising instructions for carrying out the method according to any one of claims 1 to 10 when said computer program is executed on the system according to the preceding claim.