EP1203309A1 - System and method for detecting text similarity over short passages - Google Patents
System and method for detecting text similarity over short passagesInfo
- Publication number
- EP1203309A1 EP1203309A1 EP00951059A EP00951059A EP1203309A1 EP 1203309 A1 EP1203309 A1 EP 1203309A1 EP 00951059 A EP00951059 A EP 00951059A EP 00951059 A EP00951059 A EP 00951059A EP 1203309 A1 EP1203309 A1 EP 1203309A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- primitive
- features
- common
- normalizing
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 239000002131 composite material Substances 0.000 claims abstract description 43
- 238000010801 machine learning Methods 0.000 claims abstract description 12
- 238000011524 similarity measure Methods 0.000 claims abstract description 7
- 238000012545 processing Methods 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 9
- 238000001514 detection method Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims 3
- 238000010606 normalization Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 7
- 230000009471 action Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 238000013473 artificial intelligence Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 230000004075 alteration Effects 0.000 description 2
- 230000006698 induction Effects 0.000 description 2
- 241000408659 Darpa Species 0.000 description 1
- 102100022493 Mucin-6 Human genes 0.000 description 1
- 108010008692 Mucin-6 Proteins 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
Definitions
- the present invention relates generally to natural language processing and more particularly relates to a system and method for determining the similarity of text in short passages.
- a further problem with known techniques for detecting similarity is that the conventional notions of similarity which are applicable to large text samples, such as documents and large text segments, do not provide sufficient measures of similarity for measuring similarity in small text segments.
- Standard notions of similarity generally involve the creation of a vector or profile of characteristics of a text fragment and determine a conceptual distance between vectors on the basis of frequencies.
- Features typically include stemmed words, although multi-word units and collocations also have been used.
- Typological characteristics, such as thesaural features have also been used to calculate features. The difference between vectors for one text unit (usually a query) and another text unit (usually a document) then determines closeness or similarity of the text units.
- the text units are represented as vectors of sparse n-grams of word occurrences and learning is applied over those vectors. Though effective in the context of large document comparisons, a more fine-grained distinction for similarity measures is required to properly characterize the similarity of two small text segments.
- a method for determining similarity in short text segments in accordance with the present invention includes the steps of determining common primitive features in the text segments, determining common composite features in the text segments and then calculating a similarity measure based upon the primitive and composite features.
- the primitive features can be selected from the group including common single words, common noun phrases, synonyms, common semantic classes of verbs, and common proper nouns.
- the composite features which represent relationships between and among the primitive features, can be selected from the group including primitive feature order restrictions, primitive feature distance restrictions, and primitive type restrictions.
- the step of determining common primitive features can include the further steps of identifying common primitive features, assigning a value to the primitive features, and normalizing the feature values. Normalizing the values can include normalizing for text segment length and normalizing for the frequency of primitive feature occurrence. Similarly, determining composite features generally includes identifying the composite features, assigning a value to the composite features, and normalizing the feature values. Again, normalization of the feature values can include normalizing for text segment length and normalizing for the frequency of feature occurrence.
- Figure 1 is a flow chart illustrating an overview of a present method for comparing small text segments
- Figure 2 is a flow chart illustrating the step of defining similarity for small text segments in accordance with the present methods
- Figure 3 is a flow chart illustrating the process of computing primitive features for use in detecting similarity in small text segments
- Figure 4 is a flow chart illustrating the process of calculating composite features for use in detecting similarity of small text segments in accordance with the present methods
- Figure 5 is a block diagram of a software system topology for determining similarity in small text segments in accordance with the present methods
- Figure 6 is an illustration of exemplary short text segments
- Figure 7 is a diagram illustrating a composite feature match between two of the short text segments provided in Figure 6 using a "same order" rule
- Figure 8 is a diagram illustrating a composite feature match between two of the short text segments provided in Figure 6 using a "within distance" rule.
- Figure 9 is a diagram illustrating a composite feature match between two of the short text segments provided in Figure 6 using a "primitive type" rule.
- FIG. 1 is a flow chart illustrating an overview of the process used in the present invention for detecting similarity in small text segments.
- a problem in the prior art is that the definition of similarity commonly used for large text segments, such as documents, is not sufficiently refined to provide an adequate measure of similarity when comparing small text segments.
- small text segments refer to sentences, phrases and short paragraphs.
- step 100 a definition of similarity for small text segments is provided. From this definition, the method proceeds to identify primitive features of the small text segments and determine feature values for the primitive features (step 105).
- Primitive features are those which generally compare simple parts of speech and text, such as single words, word categories, or phrases such as noun phrases, synonyms, verb class and proper nouns.
- the process can identify composite features of the short-text segments and determine composite feature values (step 110).
- Composite features are those which compare relationships among two or more primitive features. Once primitive features and composite features have been identified and given an appropriate value, a machine learning algorithm is applied to classify small text segments as similar or not similar (step 115).
- Figure 2 is a flow chart which illustrates the process of establishing an appropriate definition of similarity for small text segments.
- two text units can be considered as similar if they share the same focus on a common concept, actor, object or action.
- the common actor or object definition must perform or be subjected to the same action or be the subject of the same description. This is exemplified in the flow chart of Figure 2, where two small text segments are selected from a body of text and are analyzed. If the two text segments relate to a common concept (step 205), then further analysis is performed to see if the common concept relates to the same action (step 210) or relates to the same description (step 215).
- Similar tests are performed to determine if the two text segments relate to a common actor (step 220) or to a common object (step 225). If there is no common concept, actor or object, the text segments are considered not similar (step 235). Similarly, for those text segments which do refer or relate to a common concept, actor or object, those segments will still be found not similar unless they also relate to a common action or involve the same description. Thus, for short text segments to be similar, they must contain a common concept, actor, or object which is also the subject of a common action or description.
- the comparisons in steps 205, 220 and 225 can be the basis for primitive features 240. Those relationships between primitive features which are identified in steps 210, 215 can be referred to as composite features 245.
- Figure 2 is illustrated as a sequential process, it represents a decision tree involved in a definition of similarity of two short text segments as applied in the present invention which can also be performed in a largely parallel manner. For example, decisions 205, 220 and 225 can be performed concurrently as can decisions 210 and 215. Using this definition of similarity for small text segments, a feature- based process can be employed which compares primitive and composite features of short text segments to determine if the definition is satisfied for two or more given input text segments.
- Figure 3 is a flow chart which illustrates a method for extracting and scaling primitive features in accordance with the present invention.
- the text segments are compared for a level of commonality, including determining whether there is a common single word (step 305), a common noun phrase (step 310), whether two words in the phrases are synonyms (step 315), whether the phrases include verbs having a common semantic class (step 320), and whether a common proper noun can be found in the two phrases (step 325). If none of these conditions are satisfied for the applied small text segments, there is no primitive feature common to these two text segments (step 327). When a primitive feature has been identified, e.g., one of the conditions in steps 305 through 325 is satisfied, a feature value is assigned to that primitive feature.
- the values which are assigned to the features are determined by a machine learning algorithm, such as RIPPER, which is trained using a suitable training corpus.
- RIPPER is a widely -used and effective rule induction system which is available from AT&T Laboratories and is described by Cohen in "Learning Trees and Rules with Set- Valued Features, Proceedings of the Fourteenth National Conference on Artificial Intelligence, American Association on Artificial Intelligence, 1996, which is incorporated by reference. It has been found that a sub- set of a corpus of 264 paragraphs which have been manually tagged by human readers as similar or not similar can be used to establish a feature rule set for RIPPER which is then suitable for assigning values to the features identified in the text segments.
- the particular training corpus and learned rule set will generally vary depending on the desired application.
- the values assigned will vary based on properties of the machine learning algorithm and training corpus.
- these values can be normalized based on text length (step 335) and/or noted frequency of occurrence (step 340). Though normalization is optional, it is a desirable step to provide uniform and accurate results across varying types of text and length of text segments.
- Primitive features provide a baseline indication of similarity.
- relationships among primitive features referred to as composite features, can also be evaluated. Referring to Figure 4, a method of evaluating composite features is illustrated.
- Composite features are those features which identify relationships among primitive feature pairs.
- composite features are defined by placing different forms of restrictions on participating primitive feature pairs.
- the primitive features identified in each of the small text segments are applied to a test layer 400 where various feature relationships are evaluated.
- the relationships illustrated in test layer 400 are exemplary in nature and are not intended to illustrate an exhaustive list of possible relationships.
- an large number of relationships between and among primitive features can be used to establish composite features.
- one type of feature relationship for composite features can be that the primitives occur in the same order in each of the text samples (step 405). This is illustrated by example in Figure 7.
- Figure 6 provides three short text segments to be compared.
- Figure 7 illustrates a match according to the "same order" composite feature rule.
- primitive features are identified by shading and the relationships which form the composite features are illustrated by connecting lines.
- the primitive features ⁇ two, contact ⁇ appear in the same order in text segments Figure 6 (a) and 6 (b) from Figure 6.
- Another possible relationship is that two pairs of primitive elements are required to occur within a certain distance in both text segments.
- the maximum distance between the primitive elements which would satisfy the relationship can be a variable or a predetermined constant (step 410).
- n is set to a value less than three.
- the primitive features ⁇ contact, lost ⁇ do not appear in the same order, they occur within n words of each other (n ⁇ 3 in this case).
- Yet another exemplary relationship can be that the two text segments include the same primitive feature types.
- one primitive feature can be restricted to a simplex noun phrase while the other to a verb.
- two noun phrases one from each text unit, must match according to the rule for matching simplex noun phrases and two verbs must match according to the applied rules of verb primitives (e.g., sharing the same semantic class).
- This is illustrated in Figure 9 where the primitive feature "An OH-58 helicopter" is deemed a simplex noun phrase match with "the helicopter” and both phrases include a common verb, "lost".
- feature values are assigned to those composite features identified (step 420).
- the feature values are assigned by a machine learning algorithm, such as RIPPER, which has been trained on a suitable training corpus.
- the feature values assigned to the composite feature can be normalized for text length and relative occurrence of the primitive feature or composite feature (steps 425, 430, respectively).
- a machine learning algorithm is applied to determine a similarity value between the text segments (step 435).
- the machine learning algorithm can perform a rule-based analysis to determine similarity. Alternatively, a simpler algorithm can be used to determine similarity by comparing the total feature value of the text segments being compared to a predetermined threshold value.
- FIG. 5 is a block diagram of an exemplary software system for conducting the method described in connection with Figures 1-4.
- the system is generally implemented in software for a general purpose computer, such as a personal computer or work station.
- the system includes a main processing section 500.
- One or more interface modules 510 are included for receiving text input for the text segments to be compared and for providing the text segments to the main processing section 500.
- the text input can be provided by a number of sources, including but not limited to, computer readable memory, hard disks, optical disks, network databases, on-line sources, manual keyed input and the like. Based on the desired text source and input mechanism, one skilled in the art can provide appropriate text input interface module 510 hardware and software.
- the main processing section 500 is also operatively coupled to a training corpus 515, which is generally stored in computer readable storage media.
- the main processing section 500 is generally programmed in a structured manner which calls various subprograms, library routines, and the like to perform the various functions described in accordance with Figures 1-4.
- the main processing section 500 can invoke the various subroutines sequentially (serial) or in a parallel, or batched, processing mode.
- the received text is generally passed to a preprocessing routine 520.
- the preprocessing routine cleans up the received text, such as by removing control characters from the text.
- the preprocessing routine also performs part-of- speech (POS) tagging, using known techniques, such as are available in the ALEMBIC tool set, described by Aberdeen et al. in "MITRE: Description of the Alembic System as used for MUC-6," Proceedings of the Sixth Message
- ALEMBIC provides a set of data and language processing tools which identify the various parts of speech present in the small text segments.
- a noun phrase comparison subroutine 525 such as Linklt
- Linklt can be employed to determine whether a common noun phrase is present in the applied text segments and for identifying simplex noun phrases and matching those that share the same noun head.
- the Linklt tool is described by N. Wacholder in "Simplex NPs Clustered by Head: A Method for Identifying Significant Topics in a Document", Proceedings of the Workshop on the Computational Treatment of Nominals, October 1998, which is hereby incorporated by reference in its entirety.
- the noun comparison algorithm can also be used to match those nouns identified using the ALEMBIC toolset using various predetermined matching criteria. Variations on proper noun matching can include restricting the proper noun type to a person, place or organization. Such subcategories can also be extracted using ALEMBIC's named entity finder.
- a word co-occurrence detection sub-routine 540 can be called by the main program 500. Variations of the word co-occurrence operation can restrict matching to cases where the parts of speech of the words also match, or relax the comparison to cases where only the word stems of the two words are identical.
- a synonym detection algorithm 530 can be called by the main processing routine 500.
- a lexical database such as WordNet®, as described by G. Miller in "WordNet, An On-Line Lexical Database," International Journal of Lexicography, Vol. 3, No.
- WordNet provides sense information and places words in sets of synonyms (synsets). Words that appear in the same synset are generally considered matches. Variations on this feature can be used to restrict the words being compared to a specific part-of- speech class.
- a verb classifier and comparator algorithm 535 can be operatively coupled to the main processing section 500 and called by the main program.
- Semantic classes for verbs have been found to be useful for determining document types and text similarity. This is discussed, for example, in "The Role of Verbs in Document Analysis” by J. Klavans et al., Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics, 1998, which is hereby incorporated by reference in its entirety.
- those verbs which are found to have a common semantic class e.g., communication, motion, agreement, argument, etc., those verbs are considered to match.
- the program operating in main processing section 500 can also provide algorithms to normalize feature values for text lengths and relative occurrence of the primitive.
- each feature value can be normalized by the size of the textual segments in the pair. For example, for a pair of textual segments A and B, the feature values assigned are divided by a normalization value, N:
- N ⁇ Length ⁇ A) x Length ⁇ B) (1)
- Normalization of feature values can also be based on the relative frequency of occurrence of each primitive feature. Such normalization is motivated by the general observation that infrequently matching primitive elements are likely to have a higher impact on similarity than primitives which match more frequently. Such normalization is similar to the document frequency component of the commonly employed TF*IDF calculation.
- each primitive feature is associated with a value which is equal to the number of textual units in which the primitive appeared in the corpus. For a primitive element which compares single words, this is the number of text segments which contain that word in the corpus; for a noun phrase, this is the number of textual units that contain noun phrases that share the same head; and similarly for other primitive types. We multiply each feature's value by:
- the program in main processing section 500 generally employs a machine learning algorithm 545 to determine whether the text units match overall.
- a suitable machine learning algorithm is RIPPER, as disclosed by Cohen in "Learning Trees and Rules with Set- Valued Features, Proceedings of the Fourteenth National Conference on Artificial Intelligence, American Association on Artificial Intelligence, 1996, which is incorporated by reference.
- RIPPER is a widely-used and effective rule induction system. This RIPPER algorithm is trained over a corpus of manually marked pairs of text units continued in the training corpus 515.
- a suitable corpus was constructed using a subset of the Topic Detection and Tracking (TDT) corpus developed by NIST and DARPA.
- the TDT corpus in a collection of over 16,000 news articles from Reuters and CNN where many of the articles have been manually grouped into 25 categories each of which correspond to a single event.
- the selected corpus was formed using the Reuters' articles in five of the twenty five categories from randomly selected days.
- the resulting training corpus 515 contained 30 related articles.
- the 30 articles provided 264 paragraphs which were selected as the small text segments and resulted in 10,345 comparisons between segments.
- a machine learning algorithm can add the total value of composite features found in the text segments and compare this value against a similarity threshold.
- feature values can be predetermined based on human experience through the use of a look-up table.
- all features can be given a binary value and the similarity comparison can be determined based on a simple accumulated count of detected primary and composite features.
- the present methods while evaluated on a corpus of English language documents, are not language specific and are generally applicable to any language. Of course, the individual subroutines may require some alteration to accommodate the varied constructions found in different languages.
- the methods for determining similarity in small text segments described herein form an important component in larger systems, such as document archiving systems and multi-document summarization systems.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A system and method are provided for determining similarity in short text segments. The method provides a definition of similarity which is appropriate for the small text setting (100). Small text segments are compared to determine if there exist common primitive features, such as words, noun phrases, synonyms, verbs with a common semantic class, proper nouns and the like (105). From the primitive features identified, the small text segments are evaluated to determine whether composite features are present (110). Composite features are defined as predetermined relationships between primitive features. The common primitive features and composite features are applied as inputs to an appropriate machine learning algorithm which is trained to ascertain a similarity measure based on the primitive and composite features common to the text segments (115).
Description
SYSTEM AND METHOD FOR DETECTING TEXT SIMILARITY OVER SHORT PASSAGES
FIELD OF THE INVENTION The present invention relates generally to natural language processing and more particularly relates to a system and method for determining the similarity of text in short passages.
BACKGROUND OF THE INVENTION
With the growing volume of textual information, such as newspaper articles, magazines, Internet articles, and the like, there is a growing need to automatically cluster and/or classify such documents and determine whether groups of documents express similarities or not. For the most part, research in this area has focused on detecting similarity between documents and large segments of text or between a short query phrase and one or more documents.
While effective techniques have been developed for document clustering and classification which depend on inter-document similarity measures, these techniques generally rely only on shared words, or occasionally on collocation of words. Such techniques are applicable when large units of text, such as full documents, are compared. In this case, there is generally sufficient overlap to detect similarity in the documents and/or document segments. However, when the units of text are small, for example a paragraph or abstract, such simple surface matching of words and phrases is far more prone to error. In the case of small text units, the sample size is reduced and the number of potential matches is reduced accordingly. Thus, there remains a need for improved techniques for detecting similarities between small text units.
A further problem with known techniques for detecting similarity is that the conventional notions of similarity which are applicable to large text samples, such as documents and large text segments, do not provide sufficient measures of similarity for measuring similarity in small text segments. Standard notions of similarity generally involve the creation of a vector or profile of characteristics of a text
fragment and determine a conceptual distance between vectors on the basis of frequencies. Features typically include stemmed words, although multi-word units and collocations also have been used. Typological characteristics, such as thesaural features, have also been used to calculate features. The difference between vectors for one text unit (usually a query) and another text unit (usually a document) then determines closeness or similarity of the text units.
In some cases, the text units are represented as vectors of sparse n-grams of word occurrences and learning is applied over those vectors. Though effective in the context of large document comparisons, a more fine-grained distinction for similarity measures is required to properly characterize the similarity of two small text segments.
SUMMARY OF THE INVENTION It is an object of the present invention to provide systems and methods for detecting similarity between two or more small text segments. A method for determining similarity in short text segments in accordance with the present invention includes the steps of determining common primitive features in the text segments, determining common composite features in the text segments and then calculating a similarity measure based upon the primitive and composite features. The primitive features can be selected from the group including common single words, common noun phrases, synonyms, common semantic classes of verbs, and common proper nouns. The composite features, which represent relationships between and among the primitive features, can be selected from the group including primitive feature order restrictions, primitive feature distance restrictions, and primitive type restrictions. Preferably, the step of determining common primitive features can include the further steps of identifying common primitive features, assigning a value to the primitive features, and normalizing the feature values. Normalizing the values can include normalizing for text segment length and normalizing for the frequency of primitive feature occurrence. Similarly, determining composite features generally includes identifying the composite features, assigning a value to the composite
features, and normalizing the feature values. Again, normalization of the feature values can include normalizing for text segment length and normalizing for the frequency of feature occurrence.
BRIEF DESCRIPTION OF THE DRAWING Further objects, features and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying figures showing illustrative embodiments of the invention, in which
Figure 1 is a flow chart illustrating an overview of a present method for comparing small text segments; Figure 2 is a flow chart illustrating the step of defining similarity for small text segments in accordance with the present methods;
Figure 3 is a flow chart illustrating the process of computing primitive features for use in detecting similarity in small text segments;
Figure 4 is a flow chart illustrating the process of calculating composite features for use in detecting similarity of small text segments in accordance with the present methods;
Figure 5 is a block diagram of a software system topology for determining similarity in small text segments in accordance with the present methods; Figure 6 is an illustration of exemplary short text segments; Figure 7 is a diagram illustrating a composite feature match between two of the short text segments provided in Figure 6 using a "same order" rule;
Figure 8 is a diagram illustrating a composite feature match between two of the short text segments provided in Figure 6 using a "within distance" rule; and
Figure 9 is a diagram illustrating a composite feature match between two of the short text segments provided in Figure 6 using a "primitive type" rule.
Throughout the figures, the same reference numerals and characters, unless otherwise stated, are used to denote like features, elements, components or portions of the illustrated embodiments. Moreover, while the subject invention will now be described in detail with reference to the figures, it is done so in connection with the illustrative embodiments. It is intended that changes and modifications can be made
to the described embodiments without departing from the true scope and spirit of the subject invention as defined by the appended claims.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS Figure 1 is a flow chart illustrating an overview of the process used in the present invention for detecting similarity in small text segments. As previously noted, a problem in the prior art is that the definition of similarity commonly used for large text segments, such as documents, is not sufficiently refined to provide an adequate measure of similarity when comparing small text segments. Generally, small text segments refer to sentences, phrases and short paragraphs. Referring to Figure 1 , in step 100 a definition of similarity for small text segments is provided. From this definition, the method proceeds to identify primitive features of the small text segments and determine feature values for the primitive features (step 105). Primitive features are those which generally compare simple parts of speech and text, such as single words, word categories, or phrases such as noun phrases, synonyms, verb class and proper nouns. In addition to primitive features, the process can identify composite features of the short-text segments and determine composite feature values (step 110). Composite features are those which compare relationships among two or more primitive features. Once primitive features and composite features have been identified and given an appropriate value, a machine learning algorithm is applied to classify small text segments as similar or not similar (step 115).
Figure 2 is a flow chart which illustrates the process of establishing an appropriate definition of similarity for small text segments. In general, two text units can be considered as similar if they share the same focus on a common concept, actor, object or action. In addition, the common actor or object definition must perform or be subjected to the same action or be the subject of the same description. This is exemplified in the flow chart of Figure 2, where two small text segments are selected from a body of text and are analyzed. If the two text segments relate to a common concept (step 205), then further analysis is performed to see if the common concept relates to the same action (step 210) or relates to the same description (step 215).
Similar tests are performed to determine if the two text segments relate to a common actor (step 220) or to a common object (step 225). If there is no common concept, actor or object, the text segments are considered not similar (step 235). Similarly, for those text segments which do refer or relate to a common concept, actor or object, those segments will still be found not similar unless they also relate to a common action or involve the same description. Thus, for short text segments to be similar, they must contain a common concept, actor, or object which is also the subject of a common action or description. The comparisons in steps 205, 220 and 225 can be the basis for primitive features 240. Those relationships between primitive features which are identified in steps 210, 215 can be referred to as composite features 245.
While Figure 2 is illustrated as a sequential process, it represents a decision tree involved in a definition of similarity of two short text segments as applied in the present invention which can also be performed in a largely parallel manner. For example, decisions 205, 220 and 225 can be performed concurrently as can decisions 210 and 215. Using this definition of similarity for small text segments, a feature- based process can be employed which compares primitive and composite features of short text segments to determine if the definition is satisfied for two or more given input text segments.
Figure 3 is a flow chart which illustrates a method for extracting and scaling primitive features in accordance with the present invention. The text segments are compared for a level of commonality, including determining whether there is a common single word (step 305), a common noun phrase (step 310), whether two words in the phrases are synonyms (step 315), whether the phrases include verbs having a common semantic class (step 320), and whether a common proper noun can be found in the two phrases (step 325). If none of these conditions are satisfied for the applied small text segments, there is no primitive feature common to these two text segments (step 327). When a primitive feature has been identified, e.g., one of the conditions in steps 305 through 325 is satisfied, a feature value is assigned to that primitive feature. Preferably, the values which are assigned to the features are determined by a machine learning algorithm, such as RIPPER, which is trained using a suitable training corpus. RIPPER is a widely -used and effective rule induction
system which is available from AT&T Laboratories and is described by Cohen in "Learning Trees and Rules with Set- Valued Features, Proceedings of the Fourteenth National Conference on Artificial Intelligence, American Association on Artificial Intelligence, 1996, which is incorporated by reference. It has been found that a sub- set of a corpus of 264 paragraphs which have been manually tagged by human readers as similar or not similar can be used to establish a feature rule set for RIPPER which is then suitable for assigning values to the features identified in the text segments. The particular training corpus and learned rule set will generally vary depending on the desired application. The values assigned will vary based on properties of the machine learning algorithm and training corpus. After feature values are assigned in step 330. these values can be normalized based on text length (step 335) and/or noted frequency of occurrence (step 340). Though normalization is optional, it is a desirable step to provide uniform and accurate results across varying types of text and length of text segments. Primitive features provide a baseline indication of similarity. To further refine the notion of similarity in small text segments, relationships among primitive features, referred to as composite features, can also be evaluated. Referring to Figure 4, a method of evaluating composite features is illustrated. Composite features are those features which identify relationships among primitive feature pairs. Generally, composite features are defined by placing different forms of restrictions on participating primitive feature pairs. Referring to Figure 4, the primitive features identified in each of the small text segments are applied to a test layer 400 where various feature relationships are evaluated. The relationships illustrated in test layer 400 are exemplary in nature and are not intended to illustrate an exhaustive list of possible relationships. It will be appreciated that an large number of relationships between and among primitive features can be used to establish composite features. For example, one type of feature relationship for composite features can be that the primitives occur in the same order in each of the text samples (step 405). This is illustrated by example in Figure 7. Figure 6 provides three short text segments to be compared. Figure 7 illustrates a match according to the "same order" composite feature rule. In Figures 7-9, primitive features are identified by shading and the
relationships which form the composite features are illustrated by connecting lines. In the case illustrated in Figure 7 the primitive features {two, contact} appear in the same order in text segments Figure 6 (a) and 6 (b) from Figure 6.
Another possible relationship is that two pairs of primitive elements are required to occur within a certain distance in both text segments. The maximum distance between the primitive elements which would satisfy the relationship can be a variable or a predetermined constant (step 410). Referring to Figure 8, an example of a positive match for the "within distance" composite feature rule is provided, given that the distance, n, is set to a value less than three. In Figure 8, although the primitive features {contact, lost} do not appear in the same order, they occur within n words of each other (n<3 in this case).
Yet another exemplary relationship can be that the two text segments include the same primitive feature types. For example, one primitive feature can be restricted to a simplex noun phrase while the other to a verb. In such a case, two noun phrases, one from each text unit, must match according to the rule for matching simplex noun phrases and two verbs must match according to the applied rules of verb primitives (e.g., sharing the same semantic class). This is illustrated in Figure 9 where the primitive feature "An OH-58 helicopter" is deemed a simplex noun phrase match with "the helicopter" and both phrases include a common verb, "lost". By matching primitive feature types, a simple grammatical relationship is determined in the text segments. Returning to Figure 4. for each condition that is satisfied in test layer 400, feature values are assigned to those composite features identified (step 420). The feature values are assigned by a machine learning algorithm, such as RIPPER, which has been trained on a suitable training corpus. As in the case of primitive features, optionally, the feature values assigned to the composite feature can be normalized for text length and relative occurrence of the primitive feature or composite feature (steps 425, 430, respectively). Once both primitive features and composite features of the small text segments have been identified, a machine learning algorithm is applied to determine a similarity value between the text segments (step 435). The machine learning algorithm can perform a rule-based analysis to determine similarity. Alternatively, a simpler algorithm can be
used to determine similarity by comparing the total feature value of the text segments being compared to a predetermined threshold value.
Figure 5 is a block diagram of an exemplary software system for conducting the method described in connection with Figures 1-4. The system is generally implemented in software for a general purpose computer, such as a personal computer or work station. The system includes a main processing section 500. One or more interface modules 510 are included for receiving text input for the text segments to be compared and for providing the text segments to the main processing section 500. The text input can be provided by a number of sources, including but not limited to, computer readable memory, hard disks, optical disks, network databases, on-line sources, manual keyed input and the like. Based on the desired text source and input mechanism, one skilled in the art can provide appropriate text input interface module 510 hardware and software.
The main processing section 500 is also operatively coupled to a training corpus 515, which is generally stored in computer readable storage media. The main processing section 500 is generally programmed in a structured manner which calls various subprograms, library routines, and the like to perform the various functions described in accordance with Figures 1-4. The main processing section 500 can invoke the various subroutines sequentially (serial) or in a parallel, or batched, processing mode. The received text is generally passed to a preprocessing routine 520. The preprocessing routine cleans up the received text, such as by removing control characters from the text. The preprocessing routine also performs part-of- speech (POS) tagging, using known techniques, such as are available in the ALEMBIC tool set, described by Aberdeen et al. in "MITRE: Description of the Alembic System as used for MUC-6," Proceedings of the Sixth Message
Understanding Conference, 1995, which is hereby incorporated by reference. ALEMBIC provides a set of data and language processing tools which identify the various parts of speech present in the small text segments.
Following text preprocessing, control is returned to the main processing section 500 which then preferably invokes a noun phrase comparison subroutine 525, such as Linklt, to perform noun phrase comparison of step 310. Linklt can be
employed to determine whether a common noun phrase is present in the applied text segments and for identifying simplex noun phrases and matching those that share the same noun head. The Linklt tool is described by N. Wacholder in "Simplex NPs Clustered by Head: A Method for Identifying Significant Topics in a Document", Proceedings of the Workshop on the Computational Treatment of Nominals, October 1998, which is hereby incorporated by reference in its entirety.
To determine if two segments include common proper nouns as required in step 325, the noun comparison algorithm can also be used to match those nouns identified using the ALEMBIC toolset using various predetermined matching criteria. Variations on proper noun matching can include restricting the proper noun type to a person, place or organization. Such subcategories can also be extracted using ALEMBIC's named entity finder.
Following noun phrase identification and matching, other routines for detecting primitive features can be employed. For example, to perform step 305 and determine whether common single word primitive features exist between two text segments, a word co-occurrence detection sub-routine 540 can be called by the main program 500. Variations of the word co-occurrence operation can restrict matching to cases where the parts of speech of the words also match, or relax the comparison to cases where only the word stems of the two words are identical. Similarly, to determine if two text segments include words which are synonyms, a synonym detection algorithm 530 can be called by the main processing routine 500. In this regard, a lexical database such as WordNet®, as described by G. Miller in "WordNet, An On-Line Lexical Database," International Journal of Lexicography, Vol. 3, No. 4 (1990), can be employed. WordNet provides sense information and places words in sets of synonyms (synsets). Words that appear in the same synset are generally considered matches. Variations on this feature can be used to restrict the words being compared to a specific part-of- speech class.
To determine if two verbs present in the short text segments are of the same semantic class as set forth in step 320, a verb classifier and comparator algorithm 535 can be operatively coupled to the main processing section 500 and called by the main program. Semantic classes for verbs have been found to be useful for determining
document types and text similarity. This is discussed, for example, in "The Role of Verbs in Document Analysis" by J. Klavans et al., Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics, 1998, which is hereby incorporated by reference in its entirety. For those verbs which are found to have a common semantic class, e.g., communication, motion, agreement, argument, etc., those verbs are considered to match.
The program operating in main processing section 500 can also provide algorithms to normalize feature values for text lengths and relative occurrence of the primitive. To normalize feature values for text length, as set forth in step 335, each feature value can be normalized by the size of the textual segments in the pair. For example, for a pair of textual segments A and B, the feature values assigned are divided by a normalization value, N:
N = ^Length{A) x Length{B) (1)
This operation removes any potential bias in favor of longer text segments. It is noted that the units involved in the lengths of A and the lengths of B are generally measured by a word count.
Normalization of feature values can also be based on the relative frequency of occurrence of each primitive feature. Such normalization is motivated by the general observation that infrequently matching primitive elements are likely to have a higher impact on similarity than primitives which match more frequently. Such normalization is similar to the document frequency component of the commonly employed TF*IDF calculation. In this case, each primitive feature is associated with a value which is equal to the number of textual units in which the primitive appeared in the corpus. For a primitive element which compares single words, this is the number of text segments which contain that word in the corpus; for a noun phrase, this is the number of textual units that contain noun phrases that share the same head; and similarly for other primitive types. We multiply each feature's value by:
Log(^) (2)
where T is a number of textual segments and N is the number of textual segments containing the primitive. It is noted that since normalization for text length and frequency of occurrence are both optional operations, when these two normalization techniques are selectively applied, there are up to four variations of normalizations for each primitive feature. Of course, other normalization techniques may be added to, or substituted for, the two methods discussed herein.
The program in main processing section 500 generally employs a machine learning algorithm 545 to determine whether the text units match overall. A suitable machine learning algorithm is RIPPER, as disclosed by Cohen in "Learning Trees and Rules with Set- Valued Features, Proceedings of the Fourteenth National Conference on Artificial Intelligence, American Association on Artificial Intelligence, 1996, which is incorporated by reference. RIPPER is a widely-used and effective rule induction system. This RIPPER algorithm is trained over a corpus of manually marked pairs of text units continued in the training corpus 515. A suitable corpus was constructed using a subset of the Topic Detection and Tracking (TDT) corpus developed by NIST and DARPA. The TDT corpus in a collection of over 16,000 news articles from Reuters and CNN where many of the articles have been manually grouped into 25 categories each of which correspond to a single event. The selected corpus was formed using the Reuters' articles in five of the twenty five categories from randomly selected days. The resulting training corpus 515 contained 30 related articles. The 30 articles provided 264 paragraphs which were selected as the small text segments and resulted in 10,345 comparisons between segments.
Although use of a machine learning algorithm is preferred, other algorithms can also be used. For example, an algorithm can add the total value of composite features found in the text segments and compare this value against a similarity threshold. Similarly, although it is preferred to determine feature values based on the use of a machine learning algorithm, feature values can be predetermined based on human experience through the use of a look-up table. Alternatively, all features can be given a binary value and the similarity comparison can be determined based on a simple accumulated count of detected primary and composite features.
The present methods, while evaluated on a corpus of English language documents, are not language specific and are generally applicable to any language. Of course, the individual subroutines may require some alteration to accommodate the varied constructions found in different languages. The methods for determining similarity in small text segments described herein form an important component in larger systems, such as document archiving systems and multi-document summarization systems.
Although the present invention has been described in connection with specific exemplary embodiments, it should be understood that various changes, substitutions and alterations can be made to the disclosed embodiments without departing from the spirit and scope of the invention as set forth in the appended claims.
Claims
1. A method for determining similarity in short text segments comprising: determining common primitive features in the text segments; determining common composite features in the text segments; and calculating a similarity measure based upon said primitive and composite features.
2. The method for determining similarity as defined by claim 1, wherein said primitive features are selected from the group including common single word, common noun phrase, synonyms, common semantic class of verbs, and common proper nouns.
3. The method for determining similarity as defined by claim 1 , wherein said composite features are selected from the group including primitive feature order restrictions, primitive distance restrictions, and primitive type restrictions.
4. The method for determining similarity as defined by claim 1 , wherein said step of determining common primitive features includes: identifying common primitive features; assigning a value to said primitive features; and normalizing said value.
5. The method for determining similarity as defined by claim 4, wherein said step of normalizing includes at least one of normalizing for text segment length and normalizing for frequency of primitive occurrence.
6. The method for determining similarity as defined by claim 1 , wherein said step of determining common composite features includes: identifying common primitive features; assigning a value to said primitive features; and normalizing said value.
7. The method for determining similarity as defined by claim 6, wherein said step of normalizing includes at least one of normalizing for text segment length and normalizing for frequency of primitive occurrence.
8. A system for determining similarity in short text segments comprising: an interface circuit for receiving text segments for comparison; a main processing section, the main processing section being operatively couple to the interface circuit and operating under the control of a computer program, the program performing operations to determine common primitive features in the text segments, determine common composite features in the text segments; calculate a similarity measure based upon said primitive and composite features, and provide an output indicative of the similarity measure.
9. The system for determining similarity as defined by claim 8, wherein said primitive features are selected from the group including common single word, common noun phrase, synonyms, common semantic class of verbs, and common proper nouns.
10. The system for determining similarity as defined by claim 8, wherein said composite features are selected from the group including primitive feature order restrictions, primitive distance restrictions, and primitive type restrictions.
11. The system for determining similarity as defined by claim 8, wherein the processing operation of determining common primitive features includes: identifying common primitive features; assigning a value to said primitive features; and normalizing said value.
12. The system for determining similarity as defined by claim 1 1, wherein the processing operation of normalizing includes at least one of normalizing for text segment length and normalizing for frequency of primitive occurrence.
13. The system for determining similarity as defined by claim 8, wherein said processing operation for determining common composite features includes: identifying common primitive features; assigning a value to said primitive features; and normalizing said value.
14. The system for determining similarity as defined by claim 13, wherein said processing operation for normalizing includes at least one of normalizing for text segment length and normalizing for frequency of primitive occurrence.
15. The system for determining similarity as defined by claim 8, wherein the computer program includes a noun phrase identification subroutine, a synonym detection subroutine, a verb classifier subroutine and a word co-occurrence subroutine.
16. The system for determining similarity as defined by claim 8, further comprising a computer readable training corpus, and wherein the computer program includes a machine learning algorithm operatively coupled to the training corpus for learning and applying a rule set for determining similarity in small text segments.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13993099P | 1999-06-18 | 1999-06-18 | |
US139930P | 1999-06-18 | ||
PCT/US2000/040238 WO2000079426A1 (en) | 1999-06-18 | 2000-06-19 | System and method for detecting text similarity over short passages |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1203309A1 true EP1203309A1 (en) | 2002-05-08 |
EP1203309A4 EP1203309A4 (en) | 2006-06-21 |
Family
ID=22488940
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP00951059A Withdrawn EP1203309A4 (en) | 1999-06-18 | 2000-06-19 | System and method for detecting text similarity over short passages |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP1203309A4 (en) |
WO (1) | WO2000079426A1 (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7711547B2 (en) * | 2001-03-16 | 2010-05-04 | Meaningful Machines, L.L.C. | Word association method and apparatus |
WO2009005492A1 (en) * | 2007-06-29 | 2009-01-08 | United States Postal Service | Systems and methods for validating an address |
US7769778B2 (en) | 2007-06-29 | 2010-08-03 | United States Postal Service | Systems and methods for validating an address |
US10255609B2 (en) | 2008-02-21 | 2019-04-09 | Micronotes, Inc. | Interactive marketing system |
CN102279843A (en) * | 2010-06-13 | 2011-12-14 | 北京四维图新科技股份有限公司 | Method and device for processing phrase data |
CN103176962B (en) * | 2013-03-08 | 2015-11-04 | 深圳先进技术研究院 | The statistical method of text similarity and system |
US10628749B2 (en) | 2015-11-17 | 2020-04-21 | International Business Machines Corporation | Automatically assessing question answering system performance across possible confidence values |
US10282678B2 (en) | 2015-11-18 | 2019-05-07 | International Business Machines Corporation | Automated similarity comparison of model answers versus question answering system output |
CN106649222B (en) * | 2016-12-13 | 2019-07-16 | 浙江网新恒天软件有限公司 | Based on semantic analysis repetition detection method approximate with the text of multiple Simhash |
US10657525B2 (en) | 2017-06-27 | 2020-05-19 | Kasisto, Inc. | Method and apparatus for determining expense category distance between transactions via transaction signatures |
CN107562824B (en) * | 2017-08-21 | 2020-10-27 | 昆明理工大学 | Text similarity detection method |
CN108846117A (en) * | 2018-06-26 | 2018-11-20 | 北京金堤科技有限公司 | The duplicate removal screening technique and device of business news flash |
US11151325B2 (en) * | 2019-03-22 | 2021-10-19 | Servicenow, Inc. | Determining semantic similarity of texts based on sub-sections thereof |
CN111581947A (en) * | 2020-04-29 | 2020-08-25 | 华南理工大学 | Similar text calibration method |
KR102572106B1 (en) * | 2023-05-15 | 2023-08-29 | (주) 애드캐리 | Document automatic conversion system used in marketing methods |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5835905A (en) * | 1997-04-09 | 1998-11-10 | Xerox Corporation | System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5278980A (en) * | 1991-08-16 | 1994-01-11 | Xerox Corporation | Iterative technique for phrase query formation and an information retrieval system employing same |
US5619709A (en) * | 1993-09-20 | 1997-04-08 | Hnc, Inc. | System and method of context vector generation and retrieval |
US5675819A (en) * | 1994-06-16 | 1997-10-07 | Xerox Corporation | Document information retrieval using global word co-occurrence patterns |
US5893095A (en) * | 1996-03-29 | 1999-04-06 | Virage, Inc. | Similarity engine for content-based retrieval of images |
JP3598742B2 (en) * | 1996-11-25 | 2004-12-08 | 富士ゼロックス株式会社 | Document search device and document search method |
-
2000
- 2000-06-19 WO PCT/US2000/040238 patent/WO2000079426A1/en not_active Application Discontinuation
- 2000-06-19 EP EP00951059A patent/EP1203309A4/en not_active Withdrawn
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5835905A (en) * | 1997-04-09 | 1998-11-10 | Xerox Corporation | System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents |
Non-Patent Citations (5)
Title |
---|
CALLAN J P ED - CROFT W B ET AL ASSOCIATION FOR COMPUTING MACHINERY: "PASSAGE-LEVEL EVIDENCE IN DOCUMENT RETRIEVAL" SIGIR '94. DUBLIN, JULY 3 - 6, 1994, PROCEEDINGS OF THE ANNUAL INTERNATIONAL ACM-SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, BERLIN, SPRINGER, DE, vol. CONF. 17, 3 July 1994 (1994-07-03), pages 302-310, XP000475328 * |
LEWIS D D ET AL ASSOCIATION FOR COMPUTING MACHINERY: "A SEQUENTIAL ALGORITHM FOR TRAINING TEXT CLASSIFIERS" SIGIR '94. DUBLIN, JULY 3 - 6, 1994, PROCEEDINGS OF THE ANNUAL INTERNATIONAL ACM-SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, BERLIN, SPRINGER, DE, vol. CONF. 17, 3 July 1994 (1994-07-03), pages 3-12, XP000475309 * |
SALTON G ET AL: "AUTOMATIC TEXT DECOMPOSITION USING TEXT SEGMENTS AND TEXT THEMES" HYPERTEXT '96. 7TH. ACM CONFERENCE ON HYPERTEXT. WASHINGTON, MAR. 16 - 20, 1996, ACM CONFERENCE ON HYPERTEXT, NEW YORK, ACM, US, vol. CONF. 7, 16 March 1996 (1996-03-16), pages 53-65, XP000724322 ISBN: 0-89791-778-2 * |
See also references of WO0079426A1 * |
VOORHEES E M: "USING WORDNET TO DISAMBIGUATE WORD SENSES FOR TEXT RETRIEVAL" SIGIR FORUM, ACM, NEW YORK, NY, US, vol. SPEC. ISSUE, 27 June 1993 (1993-06-27), pages 171-180, XP000562407 ISSN: 0163-5840 * |
Also Published As
Publication number | Publication date |
---|---|
WO2000079426A1 (en) | 2000-12-28 |
EP1203309A4 (en) | 2006-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Munot et al. | Comparative study of text summarization methods | |
Zhang et al. | Keyword extraction using support vector machine | |
Al-Hashemi | Text Summarization Extraction System (TSES) Using Extracted Keywords. | |
EP2354967A1 (en) | Semantic textual analysis | |
EP1429258A1 (en) | DATA PROCESSING METHOD&comma; DATA PROCESSING SYSTEM&comma; AND PROGRAM | |
US20060224379A1 (en) | Method of finding answers to questions | |
US20070129934A1 (en) | Method and system of language detection | |
US20070016863A1 (en) | Method and apparatus for extracting and structuring domain terms | |
Salvetti et al. | Automatic opinion polarity classification of movie reviews | |
Muresan et al. | Combining linguistic and machine learning techniques for email summarization | |
EP1203309A1 (en) | System and method for detecting text similarity over short passages | |
KR20210119041A (en) | Device and Method for Cluster-based duplicate document removal | |
JP2011118689A (en) | Retrieval method and system | |
Hussein | Arabic document similarity analysis using n-grams and singular value decomposition | |
Shrestha | Corpus-based methods for short text similarity | |
Dai et al. | A new statistical formula for Chinese text segmentation incorporating contextual information | |
Takale et al. | Measuring semantic similarity between words using web documents | |
Liddy et al. | TIPSTER Panel-DR-LINK's Linguistic-Conceptual Approach to Document Detection. | |
Och Dag et al. | Evaluating automated support for requirements similarity analysis in market-driven development | |
Selvaretnam et al. | A linguistically driven framework for query expansion via grammatical constituent highlighting and role-based concept weighting | |
Agichtein et al. | Predicting accuracy of extracting information from unstructured text collections | |
Alias et al. | A Malay text corpus analysis for sentence compression using pattern-growth method | |
El-Shayeb et al. | Comparative analysis of different text segmentation algorithms on Arabic news stories | |
Sánchez et al. | Discovering non-taxonomic relations from the Web | |
CN113590738A (en) | Method for detecting network sensitive information based on content and emotion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20011217 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 20060523 |
|
17Q | First examination report despatched |
Effective date: 20070126 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20070606 |