WO2019136457A4 - Method for automated categorization of keyword data - Google Patents
Method for automated categorization of keyword data Download PDFInfo
- Publication number
- WO2019136457A4 WO2019136457A4 PCT/US2019/012730 US2019012730W WO2019136457A4 WO 2019136457 A4 WO2019136457 A4 WO 2019136457A4 US 2019012730 W US2019012730 W US 2019012730W WO 2019136457 A4 WO2019136457 A4 WO 2019136457A4
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- uniform resource
- text string
- category
- text
- resource locators
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
Abstract
A method for categorizing text strings assigns text strings to topical categories (100). A search engine (120) retrieves and ranks (130) a list of Uniform Resource Locators (URLs) for each test string. The most highly-ranked URLs for a set of text strings form a whitelist (170) of pre-approved text strings that are assumed to correlate closely with category meaning. Incorrectly categorized text strings are identified by scoring a list of URLs retrieved by a search engine for each text string, comparing each score to the whitelist position of the text string (220), flagging text strings with scores that deviate from whitelist position by at least a threshold amount (230), and reassigning flagged text strings to categories with the most similar sets of retrieved URLs (260).
Claims
1. A computer-implemented method for categorizing text strings, comprising the steps of:
creating topical categories;
creating vocabulary rules;
inputting text strings to system memory;
processing a dataset of text strings with an automated system, the automated system using human-created vocabulary rules to assign each text string to a topical category;
processing each text string assigned to each category with at least one internet search engine to retrieve a set of ranked uniform resource locators;
assigning a numerical score to each ranked uniform resource locator retrieved for each text string for each category; and
creating a whitelist of uniform resource locators retrieved for each text string for at least a first category, the uniform resource locators ranked by the numerical scores of the text strings assigned to the first category.
2. A computer-implemented method for categorizing text strings as claimed in Claim 1, comprising the additional step of repeating the step of creating a whitelist of uniform resource locators for each additional category until a whitelist is created for every category.
3. A computer-implemented method for categorizing text strings as claimed in Claim 2, comprising the additional step of auditing at least one text string by processing the text string with at least one internet search engine to retrieve an audit set of ranked uniform resource locators.
4. A computer-implemented method for categorizing text strings as claimed in Claim 2, comprising the additional step of auditing each text string by processing each text
16
string with at least one internet search engine to retrieve an audit set of ranked uniform resource locators.
5. A computer-implemented method for categorizing text strings as claimed in Claim 4, comprising the additional step of comparing the positional rank of each uniform resource locator in the audit set to the positional rank of the same uniform resource locator in the whitelist for the category to which the audited text string is assigned.
6. A computer-implemented method for categorizing text strings, comprising the steps of:
creating topical categories;
creating vocabulary rules;
inputting text strings to system memory;
processing a dataset of text strings with an automated system, the automated system using human-created vocabulary rules to assign each text string to a topical category;
processing each text string assigned to each category with at least one internet search engine to retrieve a set of ranked uniform resource locators;
assigning a numerical score to each ranked uniform resource locator retrieved for each text string for each category;
creating a whitelist of uniform resource locators retrieved for each text string for a first category, the uniform resource locators ranked by the numerical scores of the text strings assigned to the first category;
repeating the step of creating a whitelist of uniform resource locators for a each additional category until a whitelist is created for every category; and
auditing at least a selected text string by processing the text string with at least one horizontal internet search engine to identify at least one text string not assigned to a topical category;
retrieving text string uniform resource locators for unassigned text strings and comparing whitelist uniform resource locators to the text string uniform resource locators.
7. A computer-implemented method for categorizing text strings as claimed in Claim 6, further comprising the step of directing a web crawler to the audited text string
17
uniform resource locators, the web crawler extracting hypertext markup language content from the web pages addressed by the uniform resource locators.
8. A computer-implemented method for categorizing text strings as claimed in Claim 7, further comprising the steps of parsing the hypertext markup language content with a natural language processing engine and chunking the hypertext markup language content into variable length n-grams of word tokens.
9. A computer-implemented method for categorizing text strings as claimed in Claim 8, further comprising the steps of returning and storing all categories for all n-grams associated with the uniform resource locator, comparing the n-grams of word tokens to sets of human-created vocabulary rules to generate at least one confidence score, and assigning the audited text string to the category with the highest confidence score.
10. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method including the operations of:
inputting creating human-created topical categories and vocabulary rules to memory;
inputting text strings to memory;
processing the text strings with the vocabulary rules to assign the text strings to at least a first topical category;
processing each text string assigned to each category with at least one internet search engine to retrieve a set of ranked uniform resource locators;
calculating a numerical score to each ranked uniform resource locator retrieved for each text string for each category; and
storing in memory a whitelist of uniform resource locators retrieved for each text string for at least a first category, the uniform resource locators ranked by the numerical scores of the text strings assigned to the first category.
11. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as
18
claimed in Claim 10 wherein the operation of storing in memory a whitelist of uniform resource locators retrieved for each text string is repeated for every topical category.
12. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 10 further comprising the operation of retrieving an audit set of ranked uniform resource locators with at least one internet search engine.
13. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 12, further comprising the operation of comparing the positional rank of each uniform resource locator in the audit set to the positional rank of the same uniform resource locator in the whitelist for the categories to which each audited text string is assigned.
14. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 10, further comprising the operations of auditing at least a selected text string by processing the text string with at least one internet search engine to retrieve text string uniform resource locators and comparing whitelist uniform resource locators to the text string uniform resource locators.
15. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 14, further comprising the operation of directing a web crawler to the audited text string uniform resource locators, the web crawler extracting hypertext markup language content from the web pages addressed by the uniform resource locators.
16. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 15, further comprising the operations of parsing the hypertext markup language content with a natural language processing engine and chunking the hypertext markup language content into variable length n-grams of word tokens.
19
17. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 16, further comprising the operations of returning and storing all categories for all n-grams associated with the uniform resource locator, comparing the n- grams of word tokens to sets of human-created vocabulary rules to generate at least one confidence score, and assigning the audited text string to the category with the highest confidence score.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/865,110 US10977332B2 (en) | 2014-10-03 | 2018-01-08 | Method for automated categorization of keyword data |
US15/865,110 | 2018-01-08 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2019136457A1 WO2019136457A1 (en) | 2019-07-11 |
WO2019136457A4 true WO2019136457A4 (en) | 2019-08-15 |
Family
ID=66175475
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2019/012730 WO2019136457A1 (en) | 2018-01-08 | 2019-01-08 | Method for automated categorization of keyword data |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2019136457A1 (en) |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030225763A1 (en) * | 2002-04-15 | 2003-12-04 | Microsoft Corporation | Self-improving system and method for classifying pages on the world wide web |
US8078625B1 (en) * | 2006-09-11 | 2011-12-13 | Aol Inc. | URL-based content categorization |
US9189557B2 (en) * | 2013-03-11 | 2015-11-17 | Xerox Corporation | Language-oriented focused crawling using transliteration based meta-features |
-
2019
- 2019-01-08 WO PCT/US2019/012730 patent/WO2019136457A1/en active Application Filing
Also Published As
Publication number | Publication date |
---|---|
WO2019136457A1 (en) | 2019-07-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11030199B2 (en) | Systems and methods for contextual retrieval and contextual display of records | |
US11106664B2 (en) | Systems and methods for generating a contextually and conversationally correct response to a query | |
US8386240B2 (en) | Domain dictionary creation by detection of new topic words using divergence value comparison | |
US8073877B2 (en) | Scalable semi-structured named entity detection | |
US8204874B2 (en) | Abbreviation handling in web search | |
US9069857B2 (en) | Per-document index for semantic searching | |
US8515731B1 (en) | Synonym verification | |
US20130177893A1 (en) | Method and Apparatus for Responding to an Inquiry | |
US11086866B2 (en) | Method and system for rewriting a query | |
KR20190057282A (en) | Scenario Passage classifier, scenario classifier, and computer program for it | |
CN1629837A (en) | Method and apparatus for processing, browsing and classified searching of electronic document and system thereof | |
CN109471889B (en) | Report accelerating method, system, computer equipment and storage medium | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
US10977332B2 (en) | Method for automated categorization of keyword data | |
Shekhar et al. | Linguistic structural framework for encoding transliteration variants for word origin detection using bilingual lexicon | |
WO2019136457A4 (en) | Method for automated categorization of keyword data | |
Ganjisaffar et al. | qspell: Spelling correction of web search queries using ranking models and iterative correction | |
Amalia et al. | The usage evaluation of official computer terms in bahasa indonesia in indonesian government official websites | |
German et al. | Information extraction method from a resume (CV) | |
US9898540B1 (en) | Method for automated categorization of keyword data | |
Khaitan et al. | Data-driven compound splitting method for English compounds in domain names | |
Pérez-Granados et al. | Sentiment analysis in Colombian online newspaper comments | |
Gyorodi et al. | Full-text search engine using mySQL | |
Wenzlitschke et al. | Using BERT to retrieve relevant and argumentative sentence pairs. | |
Malumba et al. | AfriWeb: a web search engine for a marginalized language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19717634 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19717634 Country of ref document: EP Kind code of ref document: A1 |