WO2019136457A4 - Method for automated categorization of keyword data - Google Patents

Method for automated categorization of keyword data Download PDF

Info

Publication number
WO2019136457A4
WO2019136457A4 PCT/US2019/012730 US2019012730W WO2019136457A4 WO 2019136457 A4 WO2019136457 A4 WO 2019136457A4 US 2019012730 W US2019012730 W US 2019012730W WO 2019136457 A4 WO2019136457 A4 WO 2019136457A4
Authority
WO
WIPO (PCT)
Prior art keywords
uniform resource
text string
category
text
resource locators
Prior art date
Application number
PCT/US2019/012730
Other languages
French (fr)
Other versions
WO2019136457A1 (en
Inventor
Stephen Scarr
Original Assignee
Stephen Scarr
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/865,110 external-priority patent/US10977332B2/en
Application filed by Stephen Scarr filed Critical Stephen Scarr
Publication of WO2019136457A1 publication Critical patent/WO2019136457A1/en
Publication of WO2019136457A4 publication Critical patent/WO2019136457A4/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

A method for categorizing text strings assigns text strings to topical categories (100). A search engine (120) retrieves and ranks (130) a list of Uniform Resource Locators (URLs) for each test string. The most highly-ranked URLs for a set of text strings form a whitelist (170) of pre-approved text strings that are assumed to correlate closely with category meaning. Incorrectly categorized text strings are identified by scoring a list of URLs retrieved by a search engine for each text string, comparing each score to the whitelist position of the text string (220), flagging text strings with scores that deviate from whitelist position by at least a threshold amount (230), and reassigning flagged text strings to categories with the most similar sets of retrieved URLs (260).

Claims

15 AMENDED CLAIMS received by the International Bureau on 22 July 2019 (22.07.2019) I Claim:
1. A computer-implemented method for categorizing text strings, comprising the steps of:
creating topical categories;
creating vocabulary rules;
inputting text strings to system memory;
processing a dataset of text strings with an automated system, the automated system using human-created vocabulary rules to assign each text string to a topical category;
processing each text string assigned to each category with at least one internet search engine to retrieve a set of ranked uniform resource locators;
assigning a numerical score to each ranked uniform resource locator retrieved for each text string for each category; and
creating a whitelist of uniform resource locators retrieved for each text string for at least a first category, the uniform resource locators ranked by the numerical scores of the text strings assigned to the first category.
2. A computer-implemented method for categorizing text strings as claimed in Claim 1, comprising the additional step of repeating the step of creating a whitelist of uniform resource locators for each additional category until a whitelist is created for every category.
3. A computer-implemented method for categorizing text strings as claimed in Claim 2, comprising the additional step of auditing at least one text string by processing the text string with at least one internet search engine to retrieve an audit set of ranked uniform resource locators.
4. A computer-implemented method for categorizing text strings as claimed in Claim 2, comprising the additional step of auditing each text string by processing each text 16
string with at least one internet search engine to retrieve an audit set of ranked uniform resource locators.
5. A computer-implemented method for categorizing text strings as claimed in Claim 4, comprising the additional step of comparing the positional rank of each uniform resource locator in the audit set to the positional rank of the same uniform resource locator in the whitelist for the category to which the audited text string is assigned.
6. A computer-implemented method for categorizing text strings, comprising the steps of:
creating topical categories;
creating vocabulary rules;
inputting text strings to system memory;
processing a dataset of text strings with an automated system, the automated system using human-created vocabulary rules to assign each text string to a topical category;
processing each text string assigned to each category with at least one internet search engine to retrieve a set of ranked uniform resource locators;
assigning a numerical score to each ranked uniform resource locator retrieved for each text string for each category;
creating a whitelist of uniform resource locators retrieved for each text string for a first category, the uniform resource locators ranked by the numerical scores of the text strings assigned to the first category;
repeating the step of creating a whitelist of uniform resource locators for a each additional category until a whitelist is created for every category; and
auditing at least a selected text string by processing the text string with at least one horizontal internet search engine to identify at least one text string not assigned to a topical category;
retrieving text string uniform resource locators for unassigned text strings and comparing whitelist uniform resource locators to the text string uniform resource locators.
7. A computer-implemented method for categorizing text strings as claimed in Claim 6, further comprising the step of directing a web crawler to the audited text string 17
uniform resource locators, the web crawler extracting hypertext markup language content from the web pages addressed by the uniform resource locators.
8. A computer-implemented method for categorizing text strings as claimed in Claim 7, further comprising the steps of parsing the hypertext markup language content with a natural language processing engine and chunking the hypertext markup language content into variable length n-grams of word tokens.
9. A computer-implemented method for categorizing text strings as claimed in Claim 8, further comprising the steps of returning and storing all categories for all n-grams associated with the uniform resource locator, comparing the n-grams of word tokens to sets of human-created vocabulary rules to generate at least one confidence score, and assigning the audited text string to the category with the highest confidence score.
10. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method including the operations of:
inputting creating human-created topical categories and vocabulary rules to memory;
inputting text strings to memory;
processing the text strings with the vocabulary rules to assign the text strings to at least a first topical category;
processing each text string assigned to each category with at least one internet search engine to retrieve a set of ranked uniform resource locators;
calculating a numerical score to each ranked uniform resource locator retrieved for each text string for each category; and
storing in memory a whitelist of uniform resource locators retrieved for each text string for at least a first category, the uniform resource locators ranked by the numerical scores of the text strings assigned to the first category.
11. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as 18
claimed in Claim 10 wherein the operation of storing in memory a whitelist of uniform resource locators retrieved for each text string is repeated for every topical category.
12. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 10 further comprising the operation of retrieving an audit set of ranked uniform resource locators with at least one internet search engine.
13. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 12, further comprising the operation of comparing the positional rank of each uniform resource locator in the audit set to the positional rank of the same uniform resource locator in the whitelist for the categories to which each audited text string is assigned.
14. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 10, further comprising the operations of auditing at least a selected text string by processing the text string with at least one internet search engine to retrieve text string uniform resource locators and comparing whitelist uniform resource locators to the text string uniform resource locators.
15. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 14, further comprising the operation of directing a web crawler to the audited text string uniform resource locators, the web crawler extracting hypertext markup language content from the web pages addressed by the uniform resource locators.
16. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 15, further comprising the operations of parsing the hypertext markup language content with a natural language processing engine and chunking the hypertext markup language content into variable length n-grams of word tokens. 19
17. A non-volatile storage medium storing instructions readable and executable by an electronic data processing device to perform a text string categorization method as claimed in Claim 16, further comprising the operations of returning and storing all categories for all n-grams associated with the uniform resource locator, comparing the n- grams of word tokens to sets of human-created vocabulary rules to generate at least one confidence score, and assigning the audited text string to the category with the highest confidence score.
PCT/US2019/012730 2018-01-08 2019-01-08 Method for automated categorization of keyword data WO2019136457A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/865,110 US10977332B2 (en) 2014-10-03 2018-01-08 Method for automated categorization of keyword data
US15/865,110 2018-01-08

Publications (2)

Publication Number Publication Date
WO2019136457A1 WO2019136457A1 (en) 2019-07-11
WO2019136457A4 true WO2019136457A4 (en) 2019-08-15

Family

ID=66175475

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/012730 WO2019136457A1 (en) 2018-01-08 2019-01-08 Method for automated categorization of keyword data

Country Status (1)

Country Link
WO (1) WO2019136457A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030225763A1 (en) * 2002-04-15 2003-12-04 Microsoft Corporation Self-improving system and method for classifying pages on the world wide web
US8078625B1 (en) * 2006-09-11 2011-12-13 Aol Inc. URL-based content categorization
US9189557B2 (en) * 2013-03-11 2015-11-17 Xerox Corporation Language-oriented focused crawling using transliteration based meta-features

Also Published As

Publication number Publication date
WO2019136457A1 (en) 2019-07-11

Similar Documents

Publication Publication Date Title
US11030199B2 (en) Systems and methods for contextual retrieval and contextual display of records
US11106664B2 (en) Systems and methods for generating a contextually and conversationally correct response to a query
US8386240B2 (en) Domain dictionary creation by detection of new topic words using divergence value comparison
US8073877B2 (en) Scalable semi-structured named entity detection
US8204874B2 (en) Abbreviation handling in web search
US9069857B2 (en) Per-document index for semantic searching
US8515731B1 (en) Synonym verification
US20130177893A1 (en) Method and Apparatus for Responding to an Inquiry
US11086866B2 (en) Method and system for rewriting a query
KR20190057282A (en) Scenario Passage classifier, scenario classifier, and computer program for it
CN1629837A (en) Method and apparatus for processing, browsing and classified searching of electronic document and system thereof
CN109471889B (en) Report accelerating method, system, computer equipment and storage medium
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
US10977332B2 (en) Method for automated categorization of keyword data
Shekhar et al. Linguistic structural framework for encoding transliteration variants for word origin detection using bilingual lexicon
WO2019136457A4 (en) Method for automated categorization of keyword data
Ganjisaffar et al. qspell: Spelling correction of web search queries using ranking models and iterative correction
Amalia et al. The usage evaluation of official computer terms in bahasa indonesia in indonesian government official websites
German et al. Information extraction method from a resume (CV)
US9898540B1 (en) Method for automated categorization of keyword data
Khaitan et al. Data-driven compound splitting method for English compounds in domain names
Pérez-Granados et al. Sentiment analysis in Colombian online newspaper comments
Gyorodi et al. Full-text search engine using mySQL
Wenzlitschke et al. Using BERT to retrieve relevant and argumentative sentence pairs.
Malumba et al. AfriWeb: a web search engine for a marginalized language

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19717634

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19717634

Country of ref document: EP

Kind code of ref document: A1