WO2019136457A1 - Method for automated categorization of keyword data - Google Patents

Method for automated categorization of keyword data Download PDF

Info

Publication number
WO2019136457A1
WO2019136457A1 PCT/US2019/012730 US2019012730W WO2019136457A1 WO 2019136457 A1 WO2019136457 A1 WO 2019136457A1 US 2019012730 W US2019012730 W US 2019012730W WO 2019136457 A1 WO2019136457 A1 WO 2019136457A1
Authority
WO
WIPO (PCT)
Prior art keywords
text string
uniform resource
category
text
resource locators
Prior art date
Application number
PCT/US2019/012730
Other languages
French (fr)
Other versions
WO2019136457A4 (en
Inventor
Stephen Scarr
Original Assignee
Stephen Scarr
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US15/865,110 priority Critical patent/US20190121914A1/en
Priority to US15/865,110 priority
Application filed by Stephen Scarr filed Critical Stephen Scarr
Publication of WO2019136457A1 publication Critical patent/WO2019136457A1/en
Publication of WO2019136457A4 publication Critical patent/WO2019136457A4/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

A method for categorizing text strings assigns text strings to topical categories (100). A search engine (120) retrieves and ranks (130) a list of Uniform Resource Locators (URLs) for each test string. The most highly-ranked URLs for a set of text strings form a whitelist (170) of pre-approved text strings that are assumed to correlate closely with category meaning. Incorrectly categorized text strings are identified by scoring a list of URLs retrieved by a search engine for each text string, comparing each score to the whitelist position of the text string (220), flagging text strings with scores that deviate from whitelist position by at least a threshold amount (230), and reassigning flagged text strings to categories with the most similar sets of retrieved URLs (260).

Description

Description

METHOD FOR AUTOMATED CATEGORIZATION OF KEYWORD DATA

Technical Field

The invention relates to algorithmic classification of text strings into topical hierarchies.

Background Art

In the field of automated algorithmic classification of text strings into topical hierarchies or ontologies there is a need to quickly identify incorrect categorizations and to provide a path for improvement. Being able to rapidly analyze and improve a large dataset of classified text with limited manual intervention allows for quick release of updated datasets and can identify and correct errors before they manifest in applications that may rely them.

Collaborative filtering solutions require observation of usage patterns over a period of time. What is needed is a method by which large datasets may be quickly tested and text strings that are poorly classified are quickly identified and correctly categorized.

Disclosure of the Invention

A method for categorizing text strings employs humans to create topical categories and vocabulary rules. Text strings are input into computer system memory and assigned to topical categories according to the vocabulary rules. One or more search engines are then used to retrieve and rank a list of ETniform Resource Locators (ETRLs) for each training set test string. The most highly-ranked URLs for a set of text strings form a whitelist of pre-approved text strings that are assumed to correlate closely with category meaning.

Each text string is processed with one or more internet search engines to retrieve a ranked set of uniform resource locators. A numerical score is assigned to each ranked uniform resource locator for each text string in each category. A whitelist of uniform resource locators is created for each category with the uniform resource locators ranked by numerical score. One or more text strings may be processed by an internet search engine to retrieve an audit set of ranked uniform resource locators. The positional rank of each uniform resource locator in the audit set may be compared to the positional rank of the same uniform resource locator in the whitelist for the category to which the audited text string is assigned.

Incorrectly categorized text strings may be identified by scoring a list of URLs retrieved by a search engine for each text string, comparing each score to the whitelist position of the text string, flagging text strings with scores that deviate from whitelist position by at least a threshold amount, and reassigning flagged text strings to categories with the most similar sets of retrieved URLs.

A new, unknown text string may be efficiently and accurately categorized by using one or more horizontal search engines to generate a list of returned URLs for the text string. A score is assigned to each URL, and the text string is assigned to the category with the most similar URL whitelist.

Brief Description of the Drawings

FIG. 1 is a processing flow diagram showing a method for creating whitelists of closely related URLs.

FIG. 2 is a processing flow diagram showing a method for auditing a text string to confirm accuracy of category assignment for text string.

FIG. 3 is a processing flow diagram showing a method for creating whitelists of URLs that may lack significant overlap with audited URLs.

FIG. 4 is a processing flow diagram showing a method for auditing a URL that lacks significant overlap with whitelisted URLs and accurately assigning the audited URL to at least one category.

Modes for Carrying Out the Invention

Accurate categorization of keywords by meaning can effect substantial improvements in the usefulness of search engine results. Although automated systems may excel at making rapid and accurate comparisons between keywords and other text strings, such systems have had at best limited success at recognizing the meanings attributed to text strings by human beings. Human evaluation of large volumes of text strings is a slow and expensive process.

The combined speed and accuracy of text string categorization may be substantially improved by a method that employs humans to create categories of meaning and audit a training set of text strings assigned to those categories. One or more search engines such as Google, Bing and/or other search engines known in the art are then used to retrieve and rank a list of Uniform Resource Locators (URLs) for each training set test string. The most highly-ranked URLs for a set of text strings form a whitelist of pre- approved text strings that are assumed to correlate closely with category meaning.

Incorrectly categorized text strings may be identified by scoring a list of URLs retrieved by a search engine for each text string, comparing each score to the whitelist position of the text string, flagging text strings with scores that deviate from whitelist position by at least a threshold amount, and reassigning flagged text strings to categories with the most similar sets of retrieved URLs.

A new, unknown text string may be efficiently and accurately categorized by using one or more horizontal search engines to generate a list of returned URLs for the text string. A score is assigned to each URL, and the text string is assigned to the category with the most similar URL whitelist.

A preferred embodiment of the invention utilizes over 450,000 hierarchical categories that together encompass the entire commercial and social internet. A dataset of text strings is processed by an automated system using human-created vocabulary rules to assign each text string to one category, creating a training set of text strings for each category. The vocabulary rules may include positive and negative filter words that allow or prevent assignment of a text string to a category. In the preferred embodiment each training set comprises the lesser of the top 25% of text strings or the top 500 text strings assigned to the category, ranked in descending order by volume.

Each text string training set is processed by one or more horizontal search engines to create a whitelist. FIG. l is a processing flow diagram showing a method for creating a whitelist. A category is selected 100 for processing and the highest volume text strings from the training set assigned to the category are identified 110 Each identified text string from the training set is processed by one or more horizontal search engines and a predetermined number of URLs are retrieved from the provided result sets 120 In this preferred embodiment, a maximum of 100 URLs are retrieved for each text string. In other embodiments more or fewer URLs may be retrieved. Each retrieved URL is recorded along with its positional rank, starting at 1, in the search engine results set.

In an alternate embodiment URLs that are deemed“noisy” (common across all text strings and all categories) may be excluded. Wikipedia, eBay, Bing, Ask.com, Google, Yahoo, and Amazon are examples of sites with URLs associated with so many categories that they are too noisy to be useful.

In this preferred embodiment each retrieved URL is given a position score 130 between 1 and 0, determined by the formula 1 - ((P - 1) / T) where P is the position rank and T is the total number of URLs retrieved. If a set of 100 URLs is retrieved, the first position URL is assigned a score of 1.00, the second 0.99, the third 0.98, etc. The lOOth receives a score of 0.01. If a retrieved set of URLs only comprises 10, the first position a score of 1.00, the second 0.90, the third 0.80. Scored URLs are stored in a database 140.

Stored URLs are aggregated and ranked 150 by cumulative position scores. A whitelist for the category is created 160 from the 50 highest scoring URLs, ranked by score, descending. A new category is selected 165 and the process is repeated until a whitelist is created for every category 170.

When a whitelist has been created for every category the entire dataset or any portion of the dataset of text strings may be audited for correct categorization. FIG. 2 shows a method for auditing text string categorization.

A text string from the dataset is selected 200 for auditing. The text string is processed by one or more horizontal search engines and URLs are retrieved 210. If the text string has been assigned a category 215, the retrieved text string URLs are scored 220 against the whitelist URLs for the category according to the similarity in position the search result URLs have to the position of corresponding URLs in the whitelist. A text string URL in the same rank position as the corresponding whitelist URL is given 100% of the whitelist URL’s score.

If a text string URL is ranked higher than a corresponding whitelist URL, the text string score the score is decreased by a compounded 5% for each unit of difference in rank position. For example, a text string’s URL www.testl.com is in position 5. The corresponding whitelist URL, www.testl.com, is in rank position 8 with a score of 4.00. The text string’s URL score becomes 3.43, or ((4*.95)*.95)*.95, or 4*(0.95L3).

If a text string URL is ranked lower than a corresponding whitelist URL, the text string score is decreased by a compounded 25%. For example, a text string’s URL www.test2.com is in position 10. The corresponding whitelist URL, www.test2.com, is in rank position 2 with a score of 20.00. The text string’s URL score becomes 2.00, or 20*(0.75L8).

Each text string URL is compared to its corresponding whitelist URL and the text string scores are adjusted as described above. Text strings with cumulative URL scores below a chosen threshold of similarity to the corresponding whitelist scores are flagged 230 as incorrectly classified. Any suitable threshold of similarity may be selected. In this preferred embodiment a minimum threshold value of 0.5% of the“perfect” URL whitelist score is selected.

A text string with an adjusted cumulative score at or exceeding the selected threshold value remains assigned 240 to the same category. The returned URLs for a text string with an adjusted cumulative score below the selected threshold value are compared 250 to whitelist URLs for other categories using the same auditing process until the category having the most similar whitelist URLs is identified and the text string is reassigned 260 to a new category. Scores of multiple potential new categories may be compared against each other to break ties in instances where more than one category’s whitelist shows a high degree of overlap with text string URLs.

Once whitelists have been created and audited for every category, unknown and unclassified text strings may be processed and assigned to categories. Referring again to FIG. 2, a new text string is selected 200 for auditing. The text string is processed by one or more horizontal search engines and URLs are retrieved 210 If the text string has not been assigned a category 215 the returned URLs for the text string are compared 250 to whitelist URLs for each category using the same auditing process until the category having the most similar whitelist URLs is identified and the text string is assigned 260 to a category.

For example, in the category “bow tie” the dataset would include http://www.bowtieclub.com/ and http://www.bowties.com/. A search for“bow tie pasta” may return URLs such as www. cooks. com/rec/search/0, 1 -0,bow_tie_pasta,FF.html and allrecipes.com/recipe/bowtie-pasta/. The URLs in the search results will have a high match rate to the URLs for the category“pasta” but a low match rate to the URLs and the list of URLs for the category“bow tie”, causing the text string to be placed in the“pasta” category. Processing of a text string selected for auditing may result in retrieval of a set of URLs which, when compared with stored URL whitelists, have in common a percentage of URLs (in any position or score) less than a chosen threshold of significance. Any suitable threshold of significance may be selected. In an alternate embodiment, a minimum value of 10% of URLs in common between a category URL whitelist and a set of text string URLs is the chosen threshold of significance.

In an alternate method shown in FIG. 3, a text string from the dataset is selected 300 for auditing. The text string is processed by one or more horizontal search engines and URLs are retrieved 305. URLs retrieved for the audited text string are compared to whitelist URLs 310. If URLs retrieved for the audited text string do not significantly overlap whitelist URLs, processing branches 350 to a categorization process shown in FIG. 4 400.

If URLs retrieved for the audited text string do significantly overlap whitelist URLs, processing continues 315 to a test for category assignment 315. If the audited text string has not been assigned a category, the returned text string URLs are compared to whitelist URLs for hierarchical categories 335 until significant overlaps are found 340.

If the text string has been assigned a category 315, the retrieved text string URLs are scored 320 against the whitelist URLs for the category according to the similarity in position the search result URLs have to the position of corresponding URLs in the whitelist. A text string URL in the same rank position as the corresponding whitelist URL is given 100% of the whitelist URL’s score.

If a text string URL is ranked higher than a corresponding whitelist URL, the text string score the score is decreased by a compounded 5% for each unit of difference in rank position. For example, a text string’s URL www.testl.com is in position 5. The corresponding whitelist URL, www.testl.com, is in rank position 8 with a score of 4.00. The text string’s URL score becomes 3.43, or ((4*.95)*.95)*.95, or 4*(0.95L3). In alternate embodiments the score may be decreased by amounts greater or less than 5% according to a programmer’s judgment.

If a text string URL is ranked lower than a corresponding whitelist URL, the text string score is in this embodiment decreased by a compounded 25%. For example, a text string’s URL www.test2.com is in position 10. The corresponding whitelist URL, www.test2.com, is in rank position 2 with a score of 20.00. The text string’s URL score becomes 2.00, or 20*(0.75L8). In alternate embodiments the score may be decreased by amounts greater or less than 25% according to a programmer’s judgment.

Each text string URL is compared to its corresponding whitelist URL and the text string scores are adjusted as described above. Text strings with cumulative URL scores below a chosen threshold of similarity to the corresponding whitelist scores are flagged 325 as incorrectly classified. Any suitable threshold of similarity may be selected. In this embodiment a minimum threshold value of 0.5% of the“perfect” URL whitelist score is selected.

A text string with an adjusted cumulative score at or exceeding the selected threshold value remains assigned 330 to the same category. The returned URLs for a text string with an adjusted cumulative score below the selected threshold value are compared 335 to whitelist URLs for other categories using the same auditing process until the category having the most overlap with whitelist URLs is identified 340 and the text string is reassigned 345 to a new category. Scores of multiple potential new categories may be compared against each other to break ties in instances where more than one category’s whitelist shows a high degree of overlap with text string URLs.

As described above, when overlap between URLs for an audited text string and whitelist URLs falls below a selected threshold of significance, processing branches 350 to a categorization process shown in FIG. 4 400 A web crawler 405 as is known in the art is directed to each URL returned for the audited text string. For each URL the web crawler extracts hypertext markup language (HTML) content from the web page addressed by the URL and separates content from HTML tags 410

The content retrieved from the web pages is passed into a content classification system 415 that uses a natural language processing engine to parse the content, tag parts of speech and chunk the content into variable length n-grams of word tokens. These chunks are then compared by the content classification system to sets of human-created vocabulary rules curated within human-created hierarchical categories containing text strings audited and selected directly by humans as examples that best embody the concept of each category. Potential topic classifications for each URL are returned 420 for storage and evaluation.

The set of n-grams associated with each URL may be assigned to no category or to many categories, with a typical range being 5 to 25 categories. Each category exists in a hierarchical tree data structure having only one parent, or root, node that may branch to many child nodes that may in turn branch to descendant nodes, ending with leaf nodes that have no further branches. Many categories may share the same ancestor. Each node represents a tier numbered according to the count of nodes back to the root node. For example, the category for the leaf-node concept of “Bow Tie Pasta” may have the following hierarchical path: Food & Drink: :Food::Grains, Noodles, & Rice: Noodles: :Noodles [No Brand Specified] ::Bow Tie Pasta. This path contains 6 tiers, where“Food & Drink” is the Tier 1,“Food” is the Tier 2, etc.

All categories returned for all n-grams associated with all URLs are stored and counted 425 Every appearance of the same category in the same tier level increments the category count for a text string by 1. Every appearance of the leaf-node category increments the category count by 1. For example, a text string might be assigned to three categories within the following branches:

Food & Drink: :Food::Grains, Noodles, & Rice: Noodles: Noodles [No Brand Specified]: Low Tie Pasta

Food & Drink: :Food: :Produce: Vegetables: :Tomatoes

Food & Drink: :Food: :Grains, Noodles, & Rice: Noodles: :Dell'Alpe Pasta

Tier 1 of“Food & Drink” would be counted three times; the Tier 2 of“Food” also three times; the Tier 3 of“Grains, Noodles, and Rice” would be counted two times; the tier 4 of “Noodles” two times; and all other categories counted once.

Each tier of the hierarchy is given a weight value, based on a Fibonacci-like sequence:

Tier 1 = 1

Tier 2 = 3

Tier 3 = 5

Tier 4 = 8

Tier 5 = 13

Final Node = 21 Any tiers between 5 and the final node are not scored. These weights are then multiplied 430 by the category counts to create a confidence score for every unique category in all URLs retrieved for a text string.

Classification scores are summed 435 down the hierarchical path until a maximum potential confidence score is reached. Confidence scores between paths are compared. The category 440 with the path that creates this maximum confidence score is the category to which the audited text string should be assigned 445 If two or more categories have the same score, the audited text string should be assigned to the highest scoring common parent of the tied categories 450

The entire method may be repeated periodically or as needed to accommodate additions, deletions, or modifications in categories, changes in text string meanings, changes in search engine algorithms, and other changes in the content and function of the internet.

The methods described above may be implemented on an electronic data processing device or system including but not limited to a general-purpose computer or a computer network as known in the art. Client computer and server computers provide processing, storage, and input/output devices executing application programs. A computer can be linked through communications networks to other computing devices. A communications network can be part of a remote access network, the Internet, a local area or wide area networks. The disclosed techniques are readily employed in hyperlinked databases, such as a corporate Intranet, a designated portion of the Internet such as a wiki or a particular high-level domain, or in a set of hyperlinked documents.

Each computer contains a system bus comprising a set of lines used for data transfer among the components of a computer or processing system, connecting a processor, disk storage, memory, input/output ports, network ports, and other system elements. An Input/Output (EO) device interface connects various input and output devices such a keyboard, mouse, monitor, printer, and speakers to the computer. A network interface connects the computer to various other devices attached to a network. Random access memory provides volatile storage for computer software instructions and data used to implement the embodiments described above. Disk storage, solid state storage or other high-speed read/write memory devices provide non-volatile storage for computer software instructions and data used to implement the embodiments described above. A central processor unit attached to the system bus executes of computer instructions. Processor routines and data may be read from and written to computer

readable mediums such as DVD-ROM's, CD-ROM's, diskettes, tapes, and hard

drives that provide at least portions of the software instructions for the system. Computer programs can be installed by any suitable software installation procedure, as is well known in the art. Alternatively, at least a portion of the software instructions may also

be downloaded over a cable, communication and/or wireless connection.

The principles, embodiments, and modes of operation of the present invention have been set forth in the foregoing specification. The embodiments disclosed herein should be interpreted as illustrating the present invention and not as restricting it. The foregoing disclosure is not intended to limit the range of equivalent structure available to a person of ordinary skill in the art in any way, but rather to expand the range of equivalent structures in ways not previously contemplated. Numerous variations and changes can be made to the foregoing illustrative embodiments without departing from the scope and spirit of the present invention.

Claims

Claims I Claim:
1. A computer-implemented method for categorizing text strings, comprising the steps of:
creating topical categories;
creating vocabulary rules;
inputting text strings to system memory;
assigning the text strings to the topical categories with the vocabulary rules; processing each text string assigned to each category with at least one internet search engine to retrieve a set of ranked uniform resource locators;
assigning a numerical score to each ranked uniform resource locator retrieved for each text string for each category; and
creating a whitelist of uniform resource locators retrieved for each text string for at least a first category, the uniform resource locators ranked by the numerical scores of the text strings assigned to the first category.
2. A computer-implemented method for categorizing text strings as claimed in Claim 1, comprising the additional step of repeating the step of creating a whitelist of uniform resource locators for each additional category until a whitelist is created for every category.
3. A computer-implemented method for categorizing text strings as claimed in Claim 2, comprising the additional step of auditing at least one text string by processing the text string with at least one internet search engine to retrieve an audit set of ranked uniform resource locators.
4. A computer-implemented method for categorizing text strings as claimed in Claim 2, comprising the additional step of auditing each text string by processing each text string with at least one internet search engine to retrieve an audit set of ranked uniform resource locators.
5. A computer-implemented method for categorizing text strings as claimed in Claim 4, comprising the additional step of comparing the positional rank of each uniform resource locator in the audit set to the positional rank of the same uniform resource locator in the whitelist for the category to which the audited text string is assigned.
6. A computer-implemented method for categorizing text strings, comprising the steps of:
creating topical categories;
creating vocabulary rules;
inputting text strings to system memory;
assigning the text strings to the topical categories with the vocabulary rules; processing each text string assigned to each category with at least one internet search engine to retrieve a set of ranked uniform resource locators;
assigning a numerical score to each ranked uniform resource locator retrieved for each text string for each category;
creating a whitelist of uniform resource locators retrieved for each text string for a first category, the uniform resource locators ranked by the numerical scores of the text strings assigned to the first category;
repeating the step of creating a whitelist of uniform resource locators for a each additional category until a whitelist is created for every category; and
auditing at least a selected text string by processing the text string with at least one internet search engine to retrieve text string uniform resource locators and comparing whitelist uniform resource locators to the text string uniform resource locators.
7. A computer-implemented method for categorizing text strings as claimed in Claim 6, further comprising the step of directing a web crawler to the audited text string uniform resource locators, the web crawler extracting hypertext markup language content from the web pages addressed by the uniform resource locators.
8. A computer-implemented method for categorizing text strings as claimed in Claim 7, further comprising the steps of parsing the hypertext markup language content with a natural language processing engine and chunking the hypertext markup language content into variable length n-grams of word tokens.
9. A computer-implemented method for categorizing text strings as claimed in Claim 8, further comprising the steps of returning and storing all categories for all n-grams associated with the uniform resource locator, comparing the n-grams of word tokens to sets of human-created vocabulary rules to generate at least one confidence score, and assigning the audited text string to the category with the highest confidence score.
10. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method including the operations of:
inputting creating human-created topical categories and vocabulary rules to memory;
inputting text strings to memory;
processing the text strings with the vocabulary rules to assign the text strings to at least a first topical category;
processing each text string assigned to each category with at least one internet search engine to retrieve a set of ranked uniform resource locators;
calculating a numerical score to each ranked uniform resource locator retrieved for each text string for each category; and
storing in memory a whitelist of uniform resource locators retrieved for each text string for at least a first category, the uniform resource locators ranked by the numerical scores of the text strings assigned to the first category.
11. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 10 wherein the operation of storing in memory a whitelist of uniform resource locators retrieved for each text string is repeated for every topical category.
12. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 10 further comprising the operation of retrieving an audit set of ranked uniform resource locators with at least one internet search engine.
13. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 12, further comprising the operation of comparing the positional rank of each uniform resource locator in the audit set to the positional rank of the same uniform resource locator in the whitelist for the categories to which each audited text string is assigned.
14. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 10, further comprising the operations of auditing at least a selected text string by processing the text string with at least one internet search engine to retrieve text string uniform resource locators and comparing whitelist uniform resource locators to the text string uniform resource locators.
15. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 14, further comprising the operation of directing a web crawler to the audited text string uniform resource locators, the web crawler extracting hypertext markup language content from the web pages addressed by the uniform resource locators.
16. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 15, further comprising the operations of parsing the hypertext markup language content with a natural language processing engine and chunking the hypertext markup language content into variable length n-grams of word tokens.
17. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 16, further comprising the operations of returning and storing all categories for all n-grams associated with the uniform resource locator, comparing the n- grams of word tokens to sets of human-created vocabulary rules to generate at least one confidence score, and assigning the audited text string to the category with the highest confidence score.
PCT/US2019/012730 2014-10-03 2019-01-08 Method for automated categorization of keyword data WO2019136457A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US15/865,110 US20190121914A1 (en) 2014-10-03 2018-01-08 Method for Automated Categorization of Keyword Data
US15/865,110 2018-01-08

Publications (2)

Publication Number Publication Date
WO2019136457A1 true WO2019136457A1 (en) 2019-07-11
WO2019136457A4 WO2019136457A4 (en) 2019-08-15

Family

ID=66175475

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/012730 WO2019136457A1 (en) 2014-10-03 2019-01-08 Method for automated categorization of keyword data

Country Status (1)

Country Link
WO (1) WO2019136457A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030225763A1 (en) * 2002-04-15 2003-12-04 Microsoft Corporation Self-improving system and method for classifying pages on the world wide web
US8078625B1 (en) * 2006-09-11 2011-12-13 Aol Inc. URL-based content categorization
US20140258261A1 (en) * 2013-03-11 2014-09-11 Xerox Corporation Language-oriented focused crawling using transliteration based meta-features

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030225763A1 (en) * 2002-04-15 2003-12-04 Microsoft Corporation Self-improving system and method for classifying pages on the world wide web
US8078625B1 (en) * 2006-09-11 2011-12-13 Aol Inc. URL-based content categorization
US20140258261A1 (en) * 2013-03-11 2014-09-11 Xerox Corporation Language-oriented focused crawling using transliteration based meta-features

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ARDO A ET AL: "AUTOMATIC CLASSIFICATION APPLIED TO FULL-TEXT INTERNET DOCUMENTS IN A ROBOT-GENERATED SUBJECT INDEX", ONLINE INFORMATION. INTERNATIONAL ONLINE INFORMATION MEETINGPROCEEDINGS, XX, XX, 7 December 1999 (1999-12-07), pages 239 - 246, XP001062342 *
None

Also Published As

Publication number Publication date
WO2019136457A4 (en) 2019-08-15

Similar Documents

Publication Publication Date Title
Speer et al. ConceptNet 5: A large semantic network for relational knowledge
Gupta et al. Survey on social tagging techniques
Yi et al. Linking folksonomy to Library of Congress subject headings: an exploratory study
CN101111837B (en) Search processing with automatic categorization of queries
US8024327B2 (en) System and method for measuring the quality of document sets
US8082246B2 (en) System and method for ranking search results using click distance
Heymann et al. Social tag prediction
US20070005566A1 (en) Knowledge Correlation Search Engine
Ramage et al. Clustering the tagged web
US20070061297A1 (en) Ranking blog documents
Hulpus et al. Unsupervised graph-based topic labelling using dbpedia
EP1934823B1 (en) Click distance determination
JP2008538016A (en) Knowledge discovery technology by constructing knowledge correlation using concepts or items
Widdows Semantic vector products: Some initial investigations
US20090106221A1 (en) Ranking and Providing Search Results Based In Part On A Number Of Click-Through Features
Pant et al. Link contexts in classifier-guided topical crawlers
US20020049704A1 (en) Method and system for dynamic data-mining and on-line communication of customized information
US20110137919A1 (en) Apparatus and method for knowledge graph stabilization
US7933914B2 (en) Automatic task creation and execution using browser helper objects
Hotho et al. Information retrieval in folksonomies: Search and ranking
JP4637181B2 (en) Displaying search results based on document structure
Chen et al. Web mining: Machine learning for web applications
JP5620913B2 (en) Document length as a static relevance feature for ranking search results
GB2397147A (en) Organising, linking and summarising documents using weighted keywords
US7565350B2 (en) Identifying a web page as belonging to a blog

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19717634

Country of ref document: EP

Kind code of ref document: A1