EP1652107A1 - Method and system for categorizing arabic text - Google Patents

Method and system for categorizing arabic text

Info

Publication number
EP1652107A1
EP1652107A1 EP04740317A EP04740317A EP1652107A1 EP 1652107 A1 EP1652107 A1 EP 1652107A1 EP 04740317 A EP04740317 A EP 04740317A EP 04740317 A EP04740317 A EP 04740317A EP 1652107 A1 EP1652107 A1 EP 1652107A1
Authority
EP
European Patent Office
Prior art keywords
arabic
stems
words
text
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP04740317A
Other languages
German (de)
French (fr)
Inventor
Hisham El-Shishiny
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Compagnie IBM France SAS
International Business Machines Corp
Original Assignee
Compagnie IBM France SAS
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Compagnie IBM France SAS, International Business Machines Corp filed Critical Compagnie IBM France SAS
Priority to EP04740317A priority Critical patent/EP1652107A1/en
Publication of EP1652107A1 publication Critical patent/EP1652107A1/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present invention relates to text categorization, machine learning and knowledge management, and more particularly to a method and system for categorizing Arabic text.
  • Text categorization attempts to reproduce human categorization judgments and is getting an increasing importance in information retrieval systems. Text categorization is an essential part of any document processing system.
  • a category is a class and a class is a specifically defined division in a system of classification.
  • the purpose of text categorization is to assign subject categories to documents, based on their contents, to support information retrieval.
  • categorization for data extraction, as it can be used to filter out documents that are unlikely to contain extractable data.
  • Categorization can also be used to route texts to category specific processing mechanisms.
  • the present invention relates to the initial problem of how reproducing human categorization judgments and automatically assigning predefined categories to Arabic free text documents, based on their content.
  • Arabic free text documents may be news, stories, electronic mail, technical abstracts ...etc.
  • Text categorization is a challenging problem for machine learning. Feature sets are huge, on the order of tens of thousands of features when words are used and even more if phrases are allowed. Natural language features have a number of properties such as ambiguity and skewed distributions, that make automatic categorization a difficult problem.
  • Prior art systems mainly address the problem of constructing text categorization systems by two approaches.
  • the first approach is based on knowledge engineering and uses methods similar to those used in Expert Systems for classification. Categorization rules are constructed based on relationships between some text features and the output categories.
  • the second approach is to use existing bodies of manually categorized texts in constructing categorizers by inductive learning. A training set of texts is used to extract some characteristic features of each category. Then, based on these features new texts are categorized.
  • a variety of learning approaches have been adopted by researchers, including Bayesian classification, decision trees and factor analysis. Bayesian classification uses Bayes' rule to estimate the category assignment probabilities, then assigns to a text those categories with high probabilities. Decision trees are used to recursively subdivide the training examples into subsets, based on an information gain metric.
  • Another approach is to compute the correlation coefficients between texts based on their word coordinates and then to perform a factor analysis. In this manner, the initially high dimension (the number of words) may be reduced.
  • Another approach is the k-nearest neighbor approach where the new text is assigned to the category of the majority of its nearest neighbors. Given an arbitrary input text, the system ranks its nearest neighbors among training texts, and uses the categories of the k top-ranking neighbors to predict the categories of the input text. The similarity score of each neighbor text is used as the weight of its categories, and the sum of category weights over the k nearest neighbors are used for category ranking.
  • the present invention is directed to a system, method and computer program for categorizing Arabic documents based on the text content. More particularly, the invention is a frequency based method using a learning approach that exploits :
  • the present Arabic text categorization method comprises two phases namely :
  • lemma forms (called stems) of specific noun types are extracted from manually categorized Arabic texts and then filtered, using Arabic morphological analysis. Based on these lemma forms and on the normalized frequency of these lemma forms for each predefined category, it is possible to automatically assign new Arabic texts to predefined categories during the automatic text categorization phase. As a result, categorization of Arabic texts is more precise and less sensitive to noise than prior art solutions.
  • the present invention relates to a method for automatically assigning Arabic texts to predefined categories supporting information retrieval.
  • the method can be used to filter out Arabic documents that are unlikely to contain extractable data and can be used to route Arabic texts to processing mechanisms that are category specific.
  • Claim 1 recites a method, for use in a computer system, categorizing Arabic texts based on their content, comprising the step of :
  • Figure 1 shows the main components of the system for categorizing Arabic texts according to the present invention.
  • Figure 2 is a view of the learning sub-system comprised in the system for categorizing Arabic texts according to the present invention.
  • Figure 3 is a view of the categorization sub-system comprised in the system for categorizing Arabic texts according to the present invention.
  • Figure 4 is a flow chart showing the different steps of the learning phase of the system for categorizing Arabic texts according to the present invention.
  • Figure 5 is a flow chart showing the different steps of the categorization phase of the system for categorizing Arabic texts according to the present invention.
  • Main Components of the Arabic Text Categorization System Figure 1 shows the different components comprised in the system 100 for categorizing Arabic texts according to the present invention.
  • the system 100 includes :
  • a Categorization Sub-system 102 used to assign an Arabic text 103 to a category 104, based on the results obtained by the learning sub-system 101.
  • Figure 2 shows the Learning Sub-system 200 comprised in the system for categorizing Arabic texts according to the present invention.
  • the Sub-system 200 includes a Learning Engine 201 for learning the morphological features of manually categorized Arabic texts 206 by applying a first 202 and a second linguistic filter 205 on them.
  • the first linguistic filter 202 • identifies words belonging to specific Arabic word types in the Arabic manually categorized texts, using an Arabic morphology analyzer 203, and • extracts their lemma forms, called 'stems', using an Arabic lexical database 204 where 'stems' of Arabic words are stored.
  • the second linguistic filter 205 filters the list of 'stems' extracted by the first linguistic filter 202 to remove the 'stop' like words, in data retrieval terminology, that are commonly used in any category.
  • the learning engine 201 uses the output information from the first linguistic filter 202 and the second linguistic filter 205, to produce lists of extracted keywords (keywords are significant or descriptive words) and their normalized frequencies 207 for the Arabic manually categorized texts 206.
  • FIG. 3 shows the Categorization Sub-system comprised in the system for categorizing Arabic texts according to the present invention.
  • the Sub-system 300 includes a categorization engine 301 for assigning new Arabic texts 306 to predefined categories 308.
  • the categorization engine 301 applies a first 302 and a second linguistic filter 305 :
  • the first linguistic filter 302 applied on the new Arabic text extracts the lemma forms called 'stems' of all words belonging to specific Arabic word types, using an Arabic morphology analyzer 303, which makes use of an Arabic lexical database 304 where 'stems' of Arabic words are stored.
  • the second linguistic filter 305 removes the 'stop' like words, in information retrieval terminology, commonly used in any category, from the list of extracted 'stems' of each new Arabic text.
  • the categorization engine 301 uses the output information from the first linguistic filter 302 and the second linguistic filter 305 together with the lists of keywords and their normalized frequencies for the predefined categories 307, to assign the new Arabic texts 306 to their categories 308.
  • the invented Arabic text categorization method comprises two phases namely
  • large Arabic text files (few thousand words as representative as possible for the texts that one expects to encounter in the future) 400, pertaining to specific predefined Arabic text categories, are collected by human transcribers. For each Arabic text file, pertaining to a predefined category, the following steps are performed:
  • Step a 401 Applying a first linguistic filter that, based on Arabic morphological analysis and Arabic lexical look-up, obtains the lemma forms of the text words, called 'stems', which allow the linking of multiple inflected variants of the same Arabic word.
  • Step b 402. Computing the normalized frequency of each extracted stem in step a 401 above, and • Collecting all stems having a normalized frequency (defined as the number of times a stem appears in the text divided by the total number of words in the text) above a given cut-off (threshold) level.
  • Step c 403. When 'vowel marks' are not used on top of and underneath the Arabic text characters, which is usually the case in the majority of Arabic texts, there would be a multitude of possible vowel combinations for the same set of characters which constitute the word. All of these combinations would be correct in the sense that they would be valid forms, but not all of them would be correct in the syntax in which the word is used. Due to this ambiguity, some 'stop' like stems (in data retrieval terminology), commonly used in any text category, may exist in the list of stems produced in step b 402 and should be filtered out. Therefore, a second linguistic filter is applied on the collected stems in step b 402 above, to remove 'stop' like stems (in data retrieval terminology), by comparison with a given list of words comprising the stems of the 'stop' like words.
  • a second linguistic filter is applied on the collected stems in step b 402 above, to remove 'stop' like stems (in data retrieval terminology), by comparison with a given list
  • Step d 404 • Comparing the final lists of identified stems for all categories, • Removing stems that appear in more than one category, and thereby • Obtaining a unique list of stems (considered as 'keywords') and their normalized frequencies (considered as weights) 407, said list being specific and characteristic of each predefined category. If removing identical stems, gives one or more category lists of stems having a number of stems below a given threshold, then the Arabic texts 400, from which those lists were extracted, are augmented 406 with additional words (few hundred words) from texts related to those predefined categories.
  • steps a 401 , b 402, c 403 and d 404 are repeated.
  • Figure 5 is a flow chart showing the different steps of the automatic categorization phase according to the present invention.
  • Step b 502. Due to a certain ambiguity in Arabic texts when 'vowel marks' are not used (as explained in step c 403 above), applying a second linguistic filter for removing the 'stop' like stems (in data retrieval terminology) from the list of stems extracted from the new Arabic text in step a 501 above, by comparison with a given list of words comprising the stems of the 'stop' like words.
  • Step c 503. Comparing the list of stems of the new Arabic text with the 'Keywords' of each predefined category, obtained in the learning phase, and • Computing for each predefined category the 'SCORE' corresponding to the SUM (for all keywords pertaining to a predefined category) of the number of matches of a keyword in the list of stems of the new document multiplied by the keyword normalized frequency (normalized frequencies are computed in step b 402 of the learning phase Figure 4).
  • Step d 507. Assigning the new document to the text category giving the highest 'SCORE' for the new Arabic text, and • Assigning Arabic texts with 'SCORE' below a given threshold 504 to a 'non categorized' category 505 and documents having identical 'SCORE' (or very close scores) for more than one category 506 to more than one category 508. While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood that various changes in form and detail may be made therein without departing from the spirit, and scope of the invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention is directed to a system, method and computer program for categorizing Arabic documents based on the text content. More particularly, the invention is a frequency based method using a learning approach that exploits, Arabic lexical look-up, Arabic morphological analysis, and a number of interconnected Arabic linguistic filters, to categorize Arabic texts. The present Arabic text categorization method comprises two phases namely: the learning phase, and the automatic categorization phase. During the learning phase, lemma forms (called stems) of specific noun types are extracted from manually categorized Arabic texts and then filtered, using Arabic morphological analysis. Based on these lemma forms and on the normalized frequency of these lemma forms for each predefined category, it is possible to automatically assign new Arabic texts to predefined categories during the automatic text categorization phase. As a result, categorization of Arabic texts is more precise and less sensitive to noise than prior art solutions. The present invention relates to a method for automatically assigning Arabic texts to predefined categories supporting information retrieval. For example, the method can be used to filter out Arabic documents that are unlikely to contain extractable data and can be used to route Arabic texts to processing mechanisms that are category specific.

Description

Method and System for Categorizing Arabic Text Technical field of the invention
The present invention relates to text categorization, machine learning and knowledge management, and more particularly to a method and system for categorizing Arabic text.
Background art
Technical Field
The automated assignment of natural language texts to predefined categories based on their content is known as "text categorization". "Text categorization" attempts to reproduce human categorization judgments and is getting an increasing importance in information retrieval systems. Text categorization is an essential part of any document processing system.
A category is a class and a class is a specifically defined division in a system of classification. The purpose of text categorization is to assign subject categories to documents, based on their contents, to support information retrieval. There is an increasing use of categorization for data extraction, as it can be used to filter out documents that are unlikely to contain extractable data. Categorization can also be used to route texts to category specific processing mechanisms.
Initial Problem The present invention relates to the initial problem of how reproducing human categorization judgments and automatically assigning predefined categories to Arabic free text documents, based on their content. These Arabic free text documents may be news, stories, electronic mail, technical abstracts ...etc.
Text categorization is a challenging problem for machine learning. Feature sets are huge, on the order of tens of thousands of features when words are used and even more if phrases are allowed. Natural language features have a number of properties such as ambiguity and skewed distributions, that make automatic categorization a difficult problem.
Prior Art
Prior art systems mainly address the problem of constructing text categorization systems by two approaches.
• The first approach is based on knowledge engineering and uses methods similar to those used in Expert Systems for classification. Categorization rules are constructed based on relationships between some text features and the output categories. • The second approach is to use existing bodies of manually categorized texts in constructing categorizers by inductive learning. A training set of texts is used to extract some characteristic features of each category. Then, based on these features new texts are categorized. A variety of learning approaches have been adopted by researchers, including Bayesian classification, decision trees and factor analysis. Bayesian classification uses Bayes' rule to estimate the category assignment probabilities, then assigns to a text those categories with high probabilities. Decision trees are used to recursively subdivide the training examples into subsets, based on an information gain metric.
Another approach is to compute the correlation coefficients between texts based on their word coordinates and then to perform a factor analysis. In this manner, the initially high dimension (the number of words) may be reduced.
Another approach is the k-nearest neighbor approach where the new text is assigned to the category of the majority of its nearest neighbors. Given an arbitrary input text, the system ranks its nearest neighbors among training texts, and uses the categories of the k top-ranking neighbors to predict the categories of the input text. The similarity score of each neighbor text is used as the weight of its categories, and the sum of category weights over the k nearest neighbors are used for category ranking.
Residual Problem The current text categorization approaches, previously described, are mainly constructed and tested on English, German and French test corpora (a corpora is a collection of writings used for linguistic analysis), assuming that the solution will work as well with other languages. Arabic Language has special characteristics and features (that do not necessarily exist in other languages) which when taken into consideration, give a more efficient categorization for Arabic texts.
The previously described prior methods are not satisfactory, if used for categorizing Arabic texts, because they do not take properly into consideration the morphological features and the special characteristics of Arabic language. This deficiency makes these methods less precise and more sensitive to noise when used to categorize Arabic texts.
Objects of the invention
It is an object of the present invention to provide an improved system and method for categorizing Arabic documents based on the text content.
It is another object to it to provide a system and method to use special Arabic morphological features and characteristics of Arabic texts to make the resulting categorization more precise and less sensitive to noise.
Summary of the invention
The present invention is directed to a system, method and computer program for categorizing Arabic documents based on the text content. More particularly, the invention is a frequency based method using a learning approach that exploits :
• Arabic lexical look-up,
• Arabic morphological analysis, and
• a number of interconnected Arabic linguistic filters, to categorize Arabic texts.
The present Arabic text categorization method comprises two phases namely :
• the learning phase, and • the automatic categorization phase.
During the learning phase, lemma forms (called stems) of specific noun types are extracted from manually categorized Arabic texts and then filtered, using Arabic morphological analysis. Based on these lemma forms and on the normalized frequency of these lemma forms for each predefined category, it is possible to automatically assign new Arabic texts to predefined categories during the automatic text categorization phase. As a result, categorization of Arabic texts is more precise and less sensitive to noise than prior art solutions.
The present invention relates to a method for automatically assigning Arabic texts to predefined categories supporting information retrieval. For example, the method can be used to filter out Arabic documents that are unlikely to contain extractable data and can be used to route Arabic texts to processing mechanisms that are category specific.
Claim 1 recites a method, for use in a computer system, categorizing Arabic texts based on their content, comprising the step of :
• learning Arabic morphological features from selected Arabic texts belonging to predefined categories, said step comprising, for each selected Arabic text belonging to a predefined category, the further steps of : • extracting stems of words corresponding to specific Arabic word types; • computing a normalized frequency for each extracted stem; • listing extracted stems having a normalized frequency above a given threshold; • comparing said list of stems with lists of stems associated with other predefined categories; • removing from said list, stems appearing in at least one list so that a unique list of keywords and their normalized frequencies is associated with each predefined category, stems in each unique list being defined as keywords for the corresponding predefined category. Further embodiments of the invention are provided in the appended dependent claims.
The foregoing, together with other objects, features, and advantages of this invention can be better appreciated with reference to the following specification, claims and drawings.
Brief description of the drawings
The novel and inventive features believed characteristics of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objects and advantages thereof, will best be understood by reference to the following detailed description of an illustrative detailed embodiment when read in conjunction with the accompanying drawings, wherein :
• Figure 1 shows the main components of the system for categorizing Arabic texts according to the present invention.
• Figure 2 is a view of the learning sub-system comprised in the system for categorizing Arabic texts according to the present invention.
• Figure 3 is a view of the categorization sub-system comprised in the system for categorizing Arabic texts according to the present invention.
• Figure 4 is a flow chart showing the different steps of the learning phase of the system for categorizing Arabic texts according to the present invention.
• Figure 5 is a flow chart showing the different steps of the categorization phase of the system for categorizing Arabic texts according to the present invention.
Preferred embodiment of the invention
Main Components of the Arabic Text Categorization System Figure 1 shows the different components comprised in the system 100 for categorizing Arabic texts according to the present invention. The system 100 includes :
• a Learning Sub-system 101 used to learn the Arabic morphological features of selected Arabic texts with predefined categories 105, and
• a Categorization Sub-system 102 used to assign an Arabic text 103 to a category 104, based on the results obtained by the learning sub-system 101.
Learning Sub-system
Figure 2 shows the Learning Sub-system 200 comprised in the system for categorizing Arabic texts according to the present invention.
The Sub-system 200 includes a Learning Engine 201 for learning the morphological features of manually categorized Arabic texts 206 by applying a first 202 and a second linguistic filter 205 on them.
• The first linguistic filter 202 : • identifies words belonging to specific Arabic word types in the Arabic manually categorized texts, using an Arabic morphology analyzer 203, and • extracts their lemma forms, called 'stems', using an Arabic lexical database 204 where 'stems' of Arabic words are stored.
• The second linguistic filter 205 filters the list of 'stems' extracted by the first linguistic filter 202 to remove the 'stop' like words, in data retrieval terminology, that are commonly used in any category.
The learning engine 201 uses the output information from the first linguistic filter 202 and the second linguistic filter 205, to produce lists of extracted keywords (keywords are significant or descriptive words) and their normalized frequencies 207 for the Arabic manually categorized texts 206.
Categorization Sub-system
Figure 3 shows the Categorization Sub-system comprised in the system for categorizing Arabic texts according to the present invention. The Sub-system 300 includes a categorization engine 301 for assigning new Arabic texts 306 to predefined categories 308. The categorization engine 301 applies a first 302 and a second linguistic filter 305 :
• The first linguistic filter 302 applied on the new Arabic text, extracts the lemma forms called 'stems' of all words belonging to specific Arabic word types, using an Arabic morphology analyzer 303, which makes use of an Arabic lexical database 304 where 'stems' of Arabic words are stored.
• The second linguistic filter 305 removes the 'stop' like words, in information retrieval terminology, commonly used in any category, from the list of extracted 'stems' of each new Arabic text.
The categorization engine 301 uses the output information from the first linguistic filter 302 and the second linguistic filter 305 together with the lists of keywords and their normalized frequencies for the predefined categories 307, to assign the new Arabic texts 306 to their categories 308.
Description of the Arabic Text Categorization Method
The invented Arabic text categorization method comprises two phases namely
• the learning phase, and
• the automatic categorization phase.
Description of the learning phase Figure 4 is a flow chart showing the different steps of the learning phase according to the present invention.
During the learning phase, large Arabic text files (few thousand words as representative as possible for the texts that one expects to encounter in the future) 400, pertaining to specific predefined Arabic text categories, are collected by human transcribers. For each Arabic text file, pertaining to a predefined category, the following steps are performed:
• Step a 401. Applying a first linguistic filter that, based on Arabic morphological analysis and Arabic lexical look-up, obtains the lemma forms of the text words, called 'stems', which allow the linking of multiple inflected variants of the same Arabic word. According to the current invention, only the 'stems' of the following Arabic word types (not usually existing in other languages) need to be extracted : • native nouns (not derived from verbs); • the following selected Arabic noun types derived from verbs : • active participles, that are used in Arabic to describe the person/thing carrying out an action, • passive nouns, that are used in Arabic to describe the person/thing which is the object of an action, • nouns of instrument, which describe the implement used to perform an action, • nouns of place which describe the place where an action happens, • verbal nouns, which generally relate to actions without time and is used in sentences where English would use the infinitive. Many verbal nouns have also acquired specific meanings in general circulation.
Nouns derived from verbs are constructions that follow predictable patterns that can be identified by Arabic morphological analysis. Researches related to this invention revealed that Arabic text can be best characterized, for the purpose of text categorization, by the above mentioned Arabic word types.
• Step b 402. • Computing the normalized frequency of each extracted stem in step a 401 above, and • Collecting all stems having a normalized frequency (defined as the number of times a stem appears in the text divided by the total number of words in the text) above a given cut-off (threshold) level.
• Step c 403. When 'vowel marks' are not used on top of and underneath the Arabic text characters, which is usually the case in the majority of Arabic texts, there would be a multitude of possible vowel combinations for the same set of characters which constitute the word. All of these combinations would be correct in the sense that they would be valid forms, but not all of them would be correct in the syntax in which the word is used. Due to this ambiguity, some 'stop' like stems (in data retrieval terminology), commonly used in any text category, may exist in the list of stems produced in step b 402 and should be filtered out. Therefore, a second linguistic filter is applied on the collected stems in step b 402 above, to remove 'stop' like stems (in data retrieval terminology), by comparison with a given list of words comprising the stems of the 'stop' like words.
• Step d 404. • Comparing the final lists of identified stems for all categories, • Removing stems that appear in more than one category, and thereby • Obtaining a unique list of stems (considered as 'keywords') and their normalized frequencies (considered as weights) 407, said list being specific and characteristic of each predefined category. If removing identical stems, gives one or more category lists of stems having a number of stems below a given threshold, then the Arabic texts 400, from which those lists were extracted, are augmented 406 with additional words (few hundred words) from texts related to those predefined categories.
Then steps a 401 , b 402, c 403 and d 404 are repeated.
Description of the automatic categorization phase
Figure 5 is a flow chart showing the different steps of the automatic categorization phase according to the present invention.
During the automatic categorization phase, where predefined categories are automatically assigned to new Arabic texts, the following steps are performed for each new (uncategorized) Arabic text 500:
• Step a 501. Applying a first linguistic filter using Arabic morphological analysis and Arabic lexical lookup, on the words of the new Arabic text, to extract a list for the stems of all the following word types: • native nouns (not derived from verbs) • the following selected Arabic noun types derived from verbs : • active participles, that are used in Arabic to describe the person/thing carrying out an action, • passive nouns, that are used in Arabic to describe the person/thing which is the object of an action, • nouns of instrument, which describe the implement used to perform an action, • nouns of place which describe the place where an action happens, • verbal nouns, which generally relate to actions without time and is used in sentences where English would use the infinitive. Many verbal nouns have also acquired specific meanings in general circulation.
Step b 502. Due to a certain ambiguity in Arabic texts when 'vowel marks' are not used (as explained in step c 403 above), applying a second linguistic filter for removing the 'stop' like stems (in data retrieval terminology) from the list of stems extracted from the new Arabic text in step a 501 above, by comparison with a given list of words comprising the stems of the 'stop' like words.
• Step c 503. • Comparing the list of stems of the new Arabic text with the 'Keywords' of each predefined category, obtained in the learning phase, and • Computing for each predefined category the 'SCORE' corresponding to the SUM (for all keywords pertaining to a predefined category) of the number of matches of a keyword in the list of stems of the new document multiplied by the keyword normalized frequency (normalized frequencies are computed in step b 402 of the learning phase Figure 4).
• Step d 507. • Assigning the new document to the text category giving the highest 'SCORE' for the new Arabic text, and • Assigning Arabic texts with 'SCORE' below a given threshold 504 to a 'non categorized' category 505 and documents having identical 'SCORE' (or very close scores) for more than one category 506 to more than one category 508. While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood that various changes in form and detail may be made therein without departing from the spirit, and scope of the invention.

Claims

ClaimsWhat we claim is:
1. A method, for use in a computer system, categorizing Arabic texts based on their content, said method comprising the preliminary step of :
• learning Arabic morphological features from selected Arabic texts belonging to predefined categories, said step comprising, for each selected Arabic text belonging to a predefined category, the further steps of :
• extracting stems of words corresponding to specific Arabic word types; • computing a normalized frequency for each extracted stem; • listing extracted stems having a normalized frequency above a given threshold; • comparing said list of stems with lists of stems associated with other predefined categories; • removing from said list, stems appearing in at least one list so that a unique list of keywords and their normalized frequencies is associated with each predefined category, stems in each unique list being defined as keywords for the corresponding predefined category.
2. The Method according to the preceding claim wherein said specific Arabic word types are part of the following Arabic word types :
• native nouns;
• nouns types derived from verbs comprising : • active participles used in Arabic to describe a person/thing carrying out an action; • passive nouns used in Arabic to describe the person/thing which is the object of an action; • nouns of instrument describing the implement used to perform an action; • nouns of place describing the place where an action happens; • verbal nouns generally related to actions without time and used in sentences where English would use the infinitive.
3. The Method according to any one of the preceding claims wherein the step of extracting stems of words corresponding to specific Arabic word types, comprises the further steps of :
• identifying words belonging to specific Arabic word types by means of an Arabic morphological analyzer;
• extracting stems of identified words by means of an Arabic lexical database comprising stems of Arabic words.
4. The Method according to any one of the preceding claims wherein the step of listing extracted stems having a normalized frequency above a given threshold, comprises the further step of:
• removing from said list of stems having a normalized frequency above a given threshold, 'stop' like stems by comparison with a given list comprising stems of 'stop' like words.
5. The Method according to any one of the preceding claims wherein a normalized frequency is defined as the number of times a stem appears in the text divided by the total number of words in the text.
6. The method according to any one of the preceding claims wherein the step of removing from said list, stems appearing in at least one another list, comprises the further step of :
if the number of stems in said list is below a predefined number :
• augmenting the selected Arabic text with additional words from one or a plurality of Arabic texts belonging to the predefined category.
7. The Method according to any one of the preceding claims comprising the further step of : • assigning a new Arabic text to a predefined category using the lists of keywords and the corresponding normalized frequencies associated with the predefined categories, said step comprising for this new Arabic text, the further steps of :
• extracting stems of words corresponding to said specific Arabic word types; • assigning the new Arabic document to the category giving the highest match between the extracted stems and the keywords weighted with the normalized frequency associated with each keyword.
8. The Method according to the preceding claim wherein the step of assigning the new Arabic document to the category giving the highest match between extracted stems and keywords, comprises the further steps of :
• comparing the extracted stems with the keywords of each predefined category; • computing, for each predefined category, the result of the sum, for all keywords belonging to said predefined category, of the number of matches of each keyword in the list of extracted stems multiplied by the keyword normalized frequency; • assigning the new Arabic document to the category giving the highest result.
9. The Method according to any one of claims 7 to 8 wherein the step of extracting from the new Arabic text, stems of words corresponding to specific Arabic word types, comprises the further steps of :
• identifying words belonging to specific Arabic word types by means of an Arabic morphological analyzer;
• extracting stems of identified words by means of an Arabic lexical database comprising stems of Arabic words.
10. The Method according to any one of claims 7 to 9 wherein the step of extracting stems of words corresponding to specific Arabic word types, comprises the further step of: • removing from extracted stems, 'stop' like stems by comparison with a given list comprising stems of 'stop' like words.
11. A computer system comprising means adapted for carrying out the steps of the method according to any one of the preceding claims.
12. A computer program comprising instructions for carrying out the method according to any one of claims 1 to 10 when said computer program is executed on the system according to the preceding claim.
EP04740317A 2003-07-23 2004-05-13 Method and system for categorizing arabic text Withdrawn EP1652107A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP04740317A EP1652107A1 (en) 2003-07-23 2004-05-13 Method and system for categorizing arabic text

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
EP03368072 2003-07-23
PCT/EP2004/006906 WO2005015434A1 (en) 2003-07-23 2004-05-13 Method and system for categorizing arabic text
EP04740317A EP1652107A1 (en) 2003-07-23 2004-05-13 Method and system for categorizing arabic text

Publications (1)

Publication Number Publication Date
EP1652107A1 true EP1652107A1 (en) 2006-05-03

Family

ID=34130383

Family Applications (1)

Application Number Title Priority Date Filing Date
EP04740317A Withdrawn EP1652107A1 (en) 2003-07-23 2004-05-13 Method and system for categorizing arabic text

Country Status (3)

Country Link
EP (1) EP1652107A1 (en)
IL (1) IL173306A0 (en)
WO (1) WO2005015434A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009006911A1 (en) * 2007-07-12 2009-01-15 The Engineering Company For The Development Of Computer Systems. (Rdi) System and method for large-scale arabic lexical semantic analysis
CN102169495B (en) * 2011-04-11 2014-04-02 趣拿开曼群岛有限公司 Industry dictionary generating method and device
CN110717040A (en) * 2019-09-18 2020-01-21 平安科技(深圳)有限公司 Dictionary expansion method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2005015434A1 *

Also Published As

Publication number Publication date
IL173306A0 (en) 2006-06-11
WO2005015434A1 (en) 2005-02-17

Similar Documents

Publication Publication Date Title
Alajmi et al. Toward an ARABIC stop-words list generation
Duwairi Machine learning for Arabic text categorization
US7296009B1 (en) Search system
JP3950535B2 (en) Data processing method and apparatus
Song et al. A comparative study on text representation schemes in text categorization
CN107992633A (en) Electronic document automatic classification method and system based on keyword feature
WO2005020091A1 (en) System and method for processing text utilizing a suite of disambiguation techniques
US20050102251A1 (en) Method of document searching
NZ524988A (en) A document categorisation system
WO2002021324A1 (en) Method and apparatus for summarizing multiple documents using a subsumption model
Alami et al. Arabic text summarization based on graph theory
Rathod Extractive text summarization of Marathi news articles
US20020059219A1 (en) System and methods for web resource discovery
Gopan et al. Comparative study on different approaches in keyword extraction
JP3847273B2 (en) Word classification device, word classification method, and word classification program
EP0822503A1 (en) Document retrieval system
KR100435442B1 (en) Method And System For Summarizing Document
Basili et al. NLP-driven IR: Evaluating performances over a text classification task
CN112711666A (en) Futures label extraction method and device
CN108475265B (en) Method and device for acquiring unknown words
EP1652107A1 (en) Method and system for categorizing arabic text
Sahmoudi et al. Towards a linguistic patterns for arabic keyphrases extraction
Syafiq et al. A concise review of named entity recognition system: Methods and features
Basili et al. A robust model for intelligent text classification
Tran et al. Learning based approaches for vietnamese question classification using keywords extraction from the web

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20060223

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PL PT RO SE SI SK TR

DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20061012