WO2000033215A1 - Term-length term-frequency method for measuring document similarity and classifying text - Google Patents

Term-length term-frequency method for measuring document similarity and classifying text Download PDF

Info

Publication number
WO2000033215A1
WO2000033215A1 PCT/US1999/025686 US9925686W WO0033215A1 WO 2000033215 A1 WO2000033215 A1 WO 2000033215A1 US 9925686 W US9925686 W US 9925686W WO 0033215 A1 WO0033215 A1 WO 0033215A1
Authority
WO
WIPO (PCT)
Prior art keywords
term
document
terms
moduli
establish
Prior art date
Application number
PCT/US1999/025686
Other languages
French (fr)
Inventor
Mark Kantrowitz
Original Assignee
Justsystem Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Justsystem Corporation filed Critical Justsystem Corporation
Priority to AU19073/00A priority Critical patent/AU1907300A/en
Publication of WO2000033215A1 publication Critical patent/WO2000033215A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • Apparatus for Classifying Text focuses on using a document readability metric to distinguish texts in computer manuals from text written by foreigners and native English speakers. It multiplies the quantity log (N/L) / [log(N) - 1] , where N is the number of words in the document and L is the number of different words by the correlation coefficient between the word length and the logarithm scaled rank order of word frequency. The latter is evaluated for the particular document, not all documents, and yields a measure of the degree to which polysyllabic words are used by the document . N/L is the average term frequency for the document .
  • the computer method does not require any information outside the document being scored and is easy to implement . It is so simple that it would not be expected to work well, but in fact outperforms some existing methods.
  • a document summarizer based on this method is easy to implement and use and requires less memory than other methods .
  • the present invention is also scalable because it does not rely on information outside the document itself and so does not consume more resources as the number of documents increases. Avoiding the need to update this information makes the present invention more scalable than state-of- the-art information retrieval algorthims, making it also highly suitable for distributed information retrieval applications.
  • the present invention is directed to information retrieval, text filtering, text summarization, text classification, keyword extraction and related tasks, but not text compression.
  • the method of extracting characterizing terms from a document comprises a) extracting terms from the document, b) counting the number of occurrences of each term extracted from the document to establish a frequency value for each term, c) counting the characters or strokes in each term to establish a character count for each term, d) multiplying the frequency value for each term or monotonic function of the frequency value by the character count or for each term or a monotonic function of each count to establish a modulus for each term and e) sorting the terms according to their moduli whereby the terms with the greatest moduli may be accepted as characteristic of the document ' s content .
  • non-Roman alphabets such as used with Asian languages, the complexity of a word is reflected more by the number of strokes in the word than the number of characters .
  • Computer text file it is meant an ASCII or other standard text file comprised of bytes representative of alphanumeric characters grouped together to represent words.
  • text stream it is meant a series of bytes representative of alphanumeric characters being grouped together to represent words being transferred serially or in parallel into the computer from a file, keyboard, modem or some other device .
  • the computer implemented method disclosed herein multiplies the frequency of terms (words or phrases) appearing in a text file or stream (hereafter "document") by the length of each term to establish a modulus (score or weight) by which each term can be ranked. Terms with the highest modulus are the most descriptive of the content or information found in the document.
  • Variations on this method include multiplying term frequency by the logarithm of term length, multiplying term frequency by the square root of term length, using a stop-list of articles, prepositions and other common terms to eliminate such terms from the calculation and stemming or truncating words to standard word form. Additionally, the resulting term weights can be normalized using a fixed table of term constants . These constants would be based on a large training corpus of text documents, but this corpus would not necessarily be the same as the collection of documents being indexed. The idea is not to reintroduce term frequency inverse document frequency (TFIDF) into the formula, but to normalize the term length form frequency (TLTF) values by the typical values for the term so that TLTF then highlights departures from the norm.
  • TFIDF term frequency inverse document frequency
  • the possible constants include normalizing term frequencies by the overall frequency of occurrence of the term in the language and normalizing term frequencies by precomputed term rarity values (i.e., multiplying by inverted document frequencies for a reference corpus) .
  • Both term length (TL) and term frequency (TF) are available from the document itself without requiring any external resources.
  • TL term length
  • TF term frequency
  • the method disclosed herein has application for term weighting and is applicable to information retrieval applications, such as document retrieval; cross-language information retrieval; keyword extraction; document routing; classification; categorization; clustering; document filtering; query expansion; chapter, paragraph and sentence segmentation; spelling correction (i.e., ranking candidate corrections) ; term, query and document similarity metrics; and text summarization.
  • information retrieval applications such as document retrieval; cross-language information retrieval; keyword extraction; document routing; classification; categorization; clustering; document filtering; query expansion; chapter, paragraph and sentence segmentation; spelling correction (i.e., ranking candidate corrections) ; term, query and document similarity metrics; and text summarization.
  • the first step 10 is to capture the document content line by line.
  • Perl has a standard function for inputting a line of text from a file. The line is stored in a scalar variable.
  • the next step 20 is to isolate each word in the line.
  • Perl has a function for splitting a line of text into an array of words. The array of words is then processed to build a word frequency hash using each new word as a key at 30.
  • a hash is like an array in that it is a collection of scalar values with individual elements indexed by keys associated with each value.
  • each word is individually compared to the keys already in the frequency hash and if not present, added to the hash as a key and if present, the value (number) associated with the key is incremented.
  • a test is made at 40 to determine if another line of text is available in the document. Steps 10, 20 and 30 are repeated until all lines in the entire document are processed.
  • the "&add_keyword” subroutine adds or increments the count of words to the frequency hash "%keyfreq".
  • the processing step 30 can be made more sophisticated by ignoring certain extremely common words
  • stop words such as articles and prepositions that have little value in characterizing the content of the document. As each word in the array is encountered, it is compared for membership in a stop word array and if present, it is skipped. Alternatively, the stop words can be eliminated at the end of the process rather than at this time.
  • the processing step 30 can be made even more sophisticated by stemming words to the same form. Hence, the plural form of words can be changed to singular and the past tenses and past participles of verb forms can be truncated to the present tense. Where the method of this invention is simply to identify the most characteristic terms in a document, the normalizing step 40 need not be implemented.
  • normalization is desirable.
  • the simplest normalization technique is to divide all frequency values in the word frequency hash by a constant indicative of the length of the document.
  • a more complicated normalization process would be to use a table of normalization constants which hold a normalization constant for each word. The normalization process would then divide the frequency value associated with each word key with a constant that either increases or decreases the value according to a preconceived understanding of the word for characterizing the text. It would not be unlikely that both of these normalization techniques be used.
  • Yet another normalization method comprises subtracting the average frequency for the term from the term's frequency and dividing the result by the standard deviation for the frequencies.
  • the average frequency for the term and the standard deviation is computed using a collection of documents. Since the average and standard deviation do not change much when documents are added to the collection, we can consider them to be fixed.
  • This normalization method scales the moduli into a standard unit's domain, identifying the number of standard deviations a way the term's frequency is from the typical frequency for the term. This makes the scaled values comparable between documents and terms.
  • the next step 60 is to form a keyword array by extracting the keys from the word frequency hash.
  • the keyword array is then sorted by the product of the frequency value for that word multiplied by the length of the word.
  • ⁇ result sort by_freq (keys (%keyfreq) ); ⁇ sub by_freq ⁇
  • the words at the front of the keyword array are then displayed as the words that best characterize the document. For example, the entire document might be displayed highlighting the five words at the front of the keyword array .
  • the keyword array might be converted to keyword value hash. The value for each keyword may simply be one or it might be the product frequency times word length.
  • the keyword product hashes for two documents can then be combined by dot product multiplication followed by summing the product components to get a factor indicative of the similarity of the content of the two documents .

Abstract

A computer implemented method of extracting characterizing terms from a document comprising the steps of extracting terms from the document, counting the number of occurrences of each term extracted from the document to establish a frequency value for each term, counting the characters or strokes in each term to establish a character count for each term, multiplying the frequency value for each term or a monotonic function thereof by the character count for each term for each term to establish a modulus for each term and sorting the terms according to their moduli whereby the terms with the greatest moduli may be accepted as keyword's characteristic of the document's content.

Description

TERM-LENGTH TERM-FREQUENCY METHOD FOR MEASURING DOCUMENT SIMILARITY AND CLASSIFYING TEXT
BACKGROUND OF THE INVENTION Information retrieval, text classification, information filtering, text summarization, text keyword extraction and highlighting, document routing and related processes are increasingly needed with the advent of the World Wide Web. Most such methods involve forming vectors of term scores or weights for a document or term, where a term may be a word character, n-gram or phrase. The task is to then find the pairs of vectors that are closest using some definition of closeness, such as the least-squares method or the cosine distance (dot product) method.
The most common methods of computing scores for terms involve statistical information about the document itself, as well as information about the universe of documents that include the document. For example, term frequency multiplied by inverse document frequency (TFIDF) computes the frequency of each term in the document or query and multiplies it by the reciprocal of the frequency of the term across documents (a measure of the rarity of the term) . Unfortunately, one must maintain statistics about the set of documents and maintain substantial information external to the document itself. Updating this information is an important problem for such methods.
U.S. Patent No. 4,843,389 entitled "Text Compression and Expansion Method and Apparatus" uses the product of the word length in characters by the word's frequency of occurrence for text compression but not information retrieval. Also, this patent involved the frequency of occurrence of the word in the language in which it is used (i.e., "general usage over a sample of texts from the user's environment"), not the frequency of occurrence of the term in the document being compressed. U.S. Patent No. 5,182,708 entitled "Method and
Apparatus for Classifying Text" focuses on using a document readability metric to distinguish texts in computer manuals from text written by foreigners and native English speakers. It multiplies the quantity log (N/L) / [log(N) - 1] , where N is the number of words in the document and L is the number of different words by the correlation coefficient between the word length and the logarithm scaled rank order of word frequency. The latter is evaluated for the particular document, not all documents, and yields a measure of the degree to which polysyllabic words are used by the document . N/L is the average term frequency for the document .
U.S. Patent No. 5,293,552 entitled "Method for Storing Bibliometric Information on Items From a Finite Source of Text, and in Particular Document Postings for Use in a Full-Text Document Retrieval System" makes use of a postulated rank-occurrence frequency relation. It was found that the resulting computed frequencies are too high for high frequency terms and too low for low frequency terms and that taking the square root of the estimated occurrence frequencies yields better results than using the raw occurrence frequencies themselves . The patent concerns the use of this estimation technique to reduce the size of the indexes in the information retrieval algorithm.
The computer method, according to the present invention, does not require any information outside the document being scored and is easy to implement . It is so simple that it would not be expected to work well, but in fact outperforms some existing methods. A document summarizer based on this method is easy to implement and use and requires less memory than other methods . The present invention is also scalable because it does not rely on information outside the document itself and so does not consume more resources as the number of documents increases. Avoiding the need to update this information makes the present invention more scalable than state-of- the-art information retrieval algorthims, making it also highly suitable for distributed information retrieval applications.
The present invention is directed to information retrieval, text filtering, text summarization, text classification, keyword extraction and related tasks, but not text compression.
SUMMARY OF THE INVENTION Briefly, according to this invention, there is provided a method for identifying the most descriptive words in a computer text file or stream. The method of extracting characterizing terms from a document comprises a) extracting terms from the document, b) counting the number of occurrences of each term extracted from the document to establish a frequency value for each term, c) counting the characters or strokes in each term to establish a character count for each term, d) multiplying the frequency value for each term or monotonic function of the frequency value by the character count or for each term or a monotonic function of each count to establish a modulus for each term and e) sorting the terms according to their moduli whereby the terms with the greatest moduli may be accepted as characteristic of the document ' s content . In non-Roman alphabets such as used with Asian languages, the complexity of a word is reflected more by the number of strokes in the word than the number of characters .
By computer text file, it is meant an ASCII or other standard text file comprised of bytes representative of alphanumeric characters grouped together to represent words. By text stream, it is meant a series of bytes representative of alphanumeric characters being grouped together to represent words being transferred serially or in parallel into the computer from a file, keyboard, modem or some other device . The computer implemented method disclosed herein multiplies the frequency of terms (words or phrases) appearing in a text file or stream (hereafter "document") by the length of each term to establish a modulus (score or weight) by which each term can be ranked. Terms with the highest modulus are the most descriptive of the content or information found in the document. Variations on this method include multiplying term frequency by the logarithm of term length, multiplying term frequency by the square root of term length, using a stop-list of articles, prepositions and other common terms to eliminate such terms from the calculation and stemming or truncating words to standard word form. Additionally, the resulting term weights can be normalized using a fixed table of term constants . These constants would be based on a large training corpus of text documents, but this corpus would not necessarily be the same as the collection of documents being indexed. The idea is not to reintroduce term frequency inverse document frequency (TFIDF) into the formula, but to normalize the term length form frequency (TLTF) values by the typical values for the term so that TLTF then highlights departures from the norm. The possible constants include normalizing term frequencies by the overall frequency of occurrence of the term in the language and normalizing term frequencies by precomputed term rarity values (i.e., multiplying by inverted document frequencies for a reference corpus) . Both term length (TL) and term frequency (TF) are available from the document itself without requiring any external resources. For extracting significant keywords, one presents the n terms (for some number n) with the highest moduli (scores) . For summarizing documents, one may present the sentences with the highest total scores. For document similarity, one uses the scores as the elements of the term vector.
The method disclosed herein has application for term weighting and is applicable to information retrieval applications, such as document retrieval; cross-language information retrieval; keyword extraction; document routing; classification; categorization; clustering; document filtering; query expansion; chapter, paragraph and sentence segmentation; spelling correction (i.e., ranking candidate corrections) ; term, query and document similarity metrics; and text summarization. BRIEF DESCRIPTION OF THE DRAWING
Further features and other objects and advantages will become clear from the following detailed description made with reference to the drawing which is a flow diagram for explaining a computer program for implementing the invention.
DESCRIPTION OF THE PREFERRED EMBODIMENTS This is a computer program implemented method. It has been implemented in the Perl language which is particularly useful for processing documents. The Perl language is described in Learning Perl by Randal L. Schwartz and Tom Christiansen (O'Reilly & Associates, Inc. 1997) . The main Perl program instructions comprise: &load_file("file") ; &gather_freq () ; &process_freq () ; foreach $key (©result) print " $key" ;
As those skilled in the art know, most programs can be written in any one of a number of languages and using different approaches can cause the computer to follow the same basic steps. The drawing schematically illustrates the main portion of the program.
Referring to the drawing, the first step 10 is to capture the document content line by line. Perl has a standard function for inputting a line of text from a file. The line is stored in a scalar variable. The next step 20 is to isolate each word in the line. Perl has a function for splitting a line of text into an array of words. The array of words is then processed to build a word frequency hash using each new word as a key at 30. A hash is like an array in that it is a collection of scalar values with individual elements indexed by keys associated with each value. Hence, at 30, each word is individually compared to the keys already in the frequency hash and if not present, added to the hash as a key and if present, the value (number) associated with the key is incremented. After each line is processed at 30, a test is made at 40 to determine if another line of text is available in the document. Steps 10, 20 and 30 are repeated until all lines in the entire document are processed.
In the Perl embodiment, steps 10, 20 and 30 are implemented by the &load_file and &gather_freq subroutines. Portions of the &gather_freq are set forth here, sub gather_freq { foreach $line (©file) { if ($line =~ Λ\</) { } else {
©words = split (/\s+/, $line) ; foreach $word (©words) {
&add_keyword ( " $word" ) ;
} } } } The "&add_keyword" subroutine adds or increments the count of words to the frequency hash "%keyfreq".
The processing step 30 can be made more sophisticated by ignoring certain extremely common words
("stop words") , such as articles and prepositions that have little value in characterizing the content of the document. As each word in the array is encountered, it is compared for membership in a stop word array and if present, it is skipped. Alternatively, the stop words can be eliminated at the end of the process rather than at this time. The processing step 30 can be made even more sophisticated by stemming words to the same form. Hence, the plural form of words can be changed to singular and the past tenses and past participles of verb forms can be truncated to the present tense. Where the method of this invention is simply to identify the most characteristic terms in a document, the normalizing step 40 need not be implemented. However, if the word characterizing vector created by the method is to be compared with the word characterizing vectors of other documents, normalization is desirable. The simplest normalization technique is to divide all frequency values in the word frequency hash by a constant indicative of the length of the document. A more complicated normalization process would be to use a table of normalization constants which hold a normalization constant for each word. The normalization process would then divide the frequency value associated with each word key with a constant that either increases or decreases the value according to a preconceived understanding of the word for characterizing the text. It would not be unlikely that both of these normalization techniques be used. Yet another normalization method comprises subtracting the average frequency for the term from the term's frequency and dividing the result by the standard deviation for the frequencies. The average frequency for the term and the standard deviation is computed using a collection of documents. Since the average and standard deviation do not change much when documents are added to the collection, we can consider them to be fixed. This normalization method scales the moduli into a standard unit's domain, identifying the number of standard deviations a way the term's frequency is from the typical frequency for the term. This makes the scaled values comparable between documents and terms.
The next step 60 is to form a keyword array by extracting the keys from the word frequency hash. The keyword array is then sorted by the product of the frequency value for that word multiplied by the length of the word.
sub process_freq {
©result = sort by_freq (keys (%keyfreq) ); } sub by_freq {
$x = $keyfreq {"$a"} * length ("$a"); $y = $keyfreq {"$b"} * length ("$b"); if ($x < $y) { i; } elesif ($x == $y) {
0; }elseif ($x > $y) {
-1
} }
At 70, the words at the front of the keyword array (those with the greatest product) are then displayed as the words that best characterize the document. For example, the entire document might be displayed highlighting the five words at the front of the keyword array . In an extension of this method, the keyword array might be converted to keyword value hash. The value for each keyword may simply be one or it might be the product frequency times word length. The keyword product hashes for two documents can then be combined by dot product multiplication followed by summing the product components to get a factor indicative of the similarity of the content of the two documents .
Having thus described my invention with the detail and particularity required by the Patent Laws, what is desired protected by Letters Patent is set forth in the following claims.

Claims

I CLAIM :
1. A computer implemented method of extracting characterizing terms from a document comprising the steps Of: a) extracting terms from the document; b) counting the number of occurrences of each term extracted from the document to establish a frequency value for each term; c) counting the characters or strokes in each term to establish a character count for each term; d) multiplying the frequency value for each term or a monotonic function thereof by the character count for each term or a monotonic function thereof to establish a modulus for each term; and e) sorting the terms according to their moduli whereby the terms with the greatest moduli may be accepted as keyword's characteristic of the document's content.
2. The method according to claim 1, wherein a step is provided for discarding stop words from the extracted terms .
3. The method according to claim 1, wherein a step is provided for the term counts to be normalized.
4. The method according to claim 3, wherein a step is provided for the term count to be normalized by a constant indicative of the length of the document .
5. The method according to claim 3 or 4 , wherein a step is provided for the term count to be normalized by dividing the term count for individual terms by constants for terms that are preconceived to increase or decrease the frequency value according to the import of the term for characterizing the content of the document.
6. The method according to claim 3, wherein a step is provided for the term count to be normalized by subtracting a term specific constant and dividing by another term specific constant.
7. The method according to claim 1, wherein a step is provided for stemming terms extracted from the document to a standard word form.
8. The method according to claim 1 further comprising the steps of counting the keywords in each sentence of the document and displaying the sentence with the most keywords therein.
9. The method according to claim 1 further comprising the steps of generating a score for each sentence in the document based upon the moduli of the words in the sentence and displaying the sentences with the highest scores as a summary of the document .
10. The method according to claim 9, wherein the scores are normalized by the sentence length.
11. The method according to claim 9, wherein the scores are computed by adding the top n moduli for each sentence .
12. The method according to claim 1 further comprising the steps of displaying the entire document with the keywords highlighted.
13. A computer implemented method of comparing the content of two documents comprising extracting characterizing terms from each document by the steps of: a) extracting terms from the document; b) counting the number of occurrences of each term extracted from the document to establish a frequency value for each term; c) counting the characters in each term to establish a character count for each term; d) multiplying the frequency value for each term or a monotonic function thereof by the character count for each term or a monotonic function thereof to establish a modulus for each term; e) sorting the terms according to their moduli whereby the terms with the greatest moduli may be accepted as characteristic of each document's content; f) forming a vector for each document based on a number of terms having the greatest moduli; and g) generating a factor indicative of the similarity of the two documents.
14. The method according to claim 13, wherein the step for generating a factor indicative of the similarity of the two documents includes forming the dot product of the two vectors .
15. A computer implemented method of document retrieval based on a query including query terms comprising extracting characterizing terms from each document selected by the query comprising the steps of : a) extracting terms from the document; b) counting the number of occurrences of each term extracted from the document to establish a frequency value for each term; c) counting the characters in each term to establish a character count for each term; d) multiplying the frequency value for each term or a monotonic function thereof by the character count for each term or a monotonic function thereof to establish a modulus for each term; e) sorting the terms according to their moduli whereby the terms with the greatest moduli may be accepted as characteristic of each document's content; f) forming a vector for each document based on a number of terms having the greatest moduli; and g) generating a factor indicative of the similarity of the document to the query terms.
16. The method according to claim 5, wherein the step for generating a factor indicative of the similarity of the document to the query terms includes forming the dot product of the document vector and a vector of query terms .
17. The method according to claim 1 comprising the further steps of calling a spell checking program which provides a candidate correction list of terms for each misspelled term encountered and sorting the candidate correction list in ascending order according to modulus.
18. The method according to claim 13, wherein terms in the candidate correction list that do not have a modulus are sorted in ascending order by term length.
PCT/US1999/025686 1998-11-30 1999-11-01 Term-length term-frequency method for measuring document similarity and classifying text WO2000033215A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU19073/00A AU1907300A (en) 1998-11-30 1999-11-01 Term-length term-frequency method for measuring document similarity and classifying text

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US20156998A 1998-11-30 1998-11-30
US09/201,569 1998-11-30

Publications (1)

Publication Number Publication Date
WO2000033215A1 true WO2000033215A1 (en) 2000-06-08

Family

ID=22746357

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/025686 WO2000033215A1 (en) 1998-11-30 1999-11-01 Term-length term-frequency method for measuring document similarity and classifying text

Country Status (2)

Country Link
AU (1) AU1907300A (en)
WO (1) WO2000033215A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1168202A2 (en) * 2000-06-28 2002-01-02 Matsushita Electric Industrial Co., Ltd. Apparatus for retrieving similar documents and apparatus for extracting relevant keywords
WO2002048905A1 (en) * 2000-12-15 2002-06-20 80-20 Software Pty. Limited Method of document searching
EP1462950A1 (en) * 2003-03-27 2004-09-29 Sony International (Europe) GmbH Method of analysis of a text corpus
WO2007057809A2 (en) * 2005-11-15 2007-05-24 Koninklijke Philips Electronics N.V. Method of obtaining a representation of a text
US7321880B2 (en) 2003-07-02 2008-01-22 International Business Machines Corporation Web services access to classification engines
EP1921860A2 (en) * 2006-11-10 2008-05-14 Fujitsu Limited Information retrieval apparatus and information retrieval method
US7412453B2 (en) 2002-12-30 2008-08-12 International Business Machines Corporation Document analysis and retrieval
US8244767B2 (en) 2009-10-09 2012-08-14 Stratify, Inc. Composite locality sensitive hash based processing of documents
CN103218435A (en) * 2013-04-15 2013-07-24 上海嘉之道企业管理咨询有限公司 Method and system for clustering Chinese text data
WO2015108723A1 (en) * 2014-01-20 2015-07-23 Array Technology, LLC Document grouping system
US9355171B2 (en) 2009-10-09 2016-05-31 Hewlett Packard Enterprise Development Lp Clustering of near-duplicate documents
CN114492446A (en) * 2022-02-16 2022-05-13 平安科技(深圳)有限公司 Legal document processing method and device, electronic equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956010B (en) * 2016-04-20 2019-03-26 浙江大学 Distributed information retrieval set option method based on distributed characterization and partial ordering

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5544049A (en) * 1992-09-29 1996-08-06 Xerox Corporation Method for performing a search of a plurality of documents for similarity to a plurality of query words
US5642502A (en) * 1994-12-06 1997-06-24 University Of Central Florida Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text
US5748953A (en) * 1989-06-14 1998-05-05 Hitachi, Ltd. Document search method wherein stored documents and search queries comprise segmented text data of spaced, nonconsecutive text elements and words segmented by predetermined symbols

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5748953A (en) * 1989-06-14 1998-05-05 Hitachi, Ltd. Document search method wherein stored documents and search queries comprise segmented text data of spaced, nonconsecutive text elements and words segmented by predetermined symbols
US5544049A (en) * 1992-09-29 1996-08-06 Xerox Corporation Method for performing a search of a plurality of documents for similarity to a plurality of query words
US5642502A (en) * 1994-12-06 1997-06-24 University Of Central Florida Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1168202A3 (en) * 2000-06-28 2004-01-14 Matsushita Electric Industrial Co., Ltd. Apparatus for retrieving similar documents and apparatus for extracting relevant keywords
EP1168202A2 (en) * 2000-06-28 2002-01-02 Matsushita Electric Industrial Co., Ltd. Apparatus for retrieving similar documents and apparatus for extracting relevant keywords
WO2002048905A1 (en) * 2000-12-15 2002-06-20 80-20 Software Pty. Limited Method of document searching
US7412453B2 (en) 2002-12-30 2008-08-12 International Business Machines Corporation Document analysis and retrieval
US8015206B2 (en) 2002-12-30 2011-09-06 International Business Machines Corporation Document analysis and retrieval
US8015171B2 (en) 2002-12-30 2011-09-06 International Business Machines Corporation Document analysis and retrieval
EP1462950A1 (en) * 2003-03-27 2004-09-29 Sony International (Europe) GmbH Method of analysis of a text corpus
US7321880B2 (en) 2003-07-02 2008-01-22 International Business Machines Corporation Web services access to classification engines
WO2007057809A3 (en) * 2005-11-15 2007-08-02 Koninkl Philips Electronics Nv Method of obtaining a representation of a text
WO2007057809A2 (en) * 2005-11-15 2007-05-24 Koninklijke Philips Electronics N.V. Method of obtaining a representation of a text
EP1921860A2 (en) * 2006-11-10 2008-05-14 Fujitsu Limited Information retrieval apparatus and information retrieval method
EP1921860A3 (en) * 2006-11-10 2012-02-15 Fujitsu Limited Information retrieval apparatus and information retrieval method
EP2802143A1 (en) * 2006-11-10 2014-11-12 Fujitsu Limited Information retrieval apparatus and information retrieval method
US8244767B2 (en) 2009-10-09 2012-08-14 Stratify, Inc. Composite locality sensitive hash based processing of documents
US9355171B2 (en) 2009-10-09 2016-05-31 Hewlett Packard Enterprise Development Lp Clustering of near-duplicate documents
CN103218435A (en) * 2013-04-15 2013-07-24 上海嘉之道企业管理咨询有限公司 Method and system for clustering Chinese text data
WO2015108723A1 (en) * 2014-01-20 2015-07-23 Array Technology, LLC Document grouping system
US9298983B2 (en) 2014-01-20 2016-03-29 Array Technology, LLC System and method for document grouping and user interface
CN114492446A (en) * 2022-02-16 2022-05-13 平安科技(深圳)有限公司 Legal document processing method and device, electronic equipment and storage medium
CN114492446B (en) * 2022-02-16 2023-06-16 平安科技(深圳)有限公司 Legal document processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
AU1907300A (en) 2000-06-19

Similar Documents

Publication Publication Date Title
US5991714A (en) Method of identifying data type and locating in a file
Damashek Gauging similarity with n-grams: Language-independent categorization of text
Ahonen et al. Applying data mining techniques for descriptive phrase extraction in digital document collections
Ahmed et al. Language identification from text using n-gram based cumulative frequency addition
JP2978044B2 (en) Document classification device
US8046370B2 (en) Retrieval of structured documents
US8005665B2 (en) Method and apparatus for generating a language independent document abstract
EP0750266B1 (en) Document classification unit and document retrieval unit
Kruengkrai et al. Generic text summarization using local and global properties of sentences
JPH03172966A (en) Similar document retrieving device
WO2000033215A1 (en) Term-length term-frequency method for measuring document similarity and classifying text
CN108363694B (en) Keyword extraction method and device
US20020059346A1 (en) Sort system for text retrieval
Irfan et al. Implementation of Fuzzy C-Means algorithm and TF-IDF on English journal summary
Bohne et al. Efficient keyword extraction for meaningful document perception
JP3198932B2 (en) Document search device
Raskutti et al. Second order features for maximising text classification performance
Hajeer et al. A new stemming algorithm for efficient information retrieval systems and web search engines
Gowtham et al. An approach for document pre-processing and K means algorithm implementation
Zhou et al. Chinese documents classification based on N-grams
Lv et al. Research of english text classification methods based on semantic meaning
CN113609247A (en) Big data text duplicate removal technology based on improved Simhash algorithm
Ilham et al. Implementation of clustering and similarity analysis for detecting content similarity in student final projects
Carmel et al. Morphological disambiguation for Hebrew search systems
Gandotra et al. Functional words removal techniques: A review

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ CZ DE DE DK DK DM EE EE ES FI FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase