WO2012061462A1 - Systems and methods regarding keyword extraction - Google Patents
Systems and methods regarding keyword extraction Download PDFInfo
- Publication number
- WO2012061462A1 WO2012061462A1 PCT/US2011/058899 US2011058899W WO2012061462A1 WO 2012061462 A1 WO2012061462 A1 WO 2012061462A1 US 2011058899 W US2011058899 W US 2011058899W WO 2012061462 A1 WO2012061462 A1 WO 2012061462A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- candidate
- computer system
- pool
- keywords
- candidate pool
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/49—Data-driven translation using very large corpora, e.g. the web
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2452—Query translation
- G06F16/24522—Translation of natural language queries to structured queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/48—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Definitions
- Keyword extraction typically serves as the core component of contextual advertising systems, where advertisements that match webpage content are chosen based on keywords automatically selected from the page text.
- keywords In order to display ads relevant to the webpage, and thus potentially more interesting to the user, numerous features present in the text need to be assessed to make a decision as to which keywords accurately reflect the content of the page.
- a keyword extraction system takes a page url as input and returns 10 keyword phrases ranked by the system as top keyword candidates.
- the system first processes webpage text and uses its structure to extract phrases which serve as a keyword candidate pool.
- Each phrase can then be described by a set of features such as its frequency on the webpage, location in the text, capitalization and its linguistic structure (for example, whether it constitutes a noun phrase).
- the system learns how these features contribute to the decision whether a candidate phrase is likely to be a "good" keyword. Once it has been trained in this manner, the system can be used to identify keywords on previously unseen webpages (i.e., that were not in the training set).
- An exemplary system embodiment improves this approach by using natural langauge processing techniques in order to achieve improved performance.
- One or more exemplary embodiments employ a novel keyword candidate extraction method that is sensitive to phrase structure, and may include additional linguistic features that lead to better machine learning results.
- One exemplary aspect comprises a computer system comprising: (a) a
- the tf-idf weight (term frequency-inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
- preprocessing unit that extracts text from a webpage to produce at least a first set of candidate keywords, applies language processing to produce at least a second set of candidate keywords, and combines said first and second sets of candidate keywords into a first candidate pool; (b) a candidate extraction unit that receives data from said
- preprocessing unit describing at least said first candidate pool and produces a second candidate pool
- a feature extraction unit that receives data describing at least said second candidate pool and analyzes said second candidate pool for general features and linguistic features
- a classification unit that receives said data describing at least said second candidate pool and related data from said feature extraction unit, and determines a likelihood of each candidate in said second candidate pool being a primary or secondary keyword.
- At least part of said language processing is performed by a tokenizer and a parser; (2) at least part of said language processing is performed by a tokenizer, a parser, a part of speech tagger, and a named entity tagger; (3) at least part of said language processing is performed by a tokenizer; (4) at least part of said language processing is performed by a parser; (5) at least part of said language processing is performed by a part of speech tagger; (6) at least part of said language processing is performed by a named entity tagger; (7) said first set of candidate keywords comprises metadata text; (8) said second candidate pool comprises noun phrases and noun sequences; (9) said second candidate pool comprises noun phrases, noun sequences, and n-grams; (10) said general features comprise one or more of frequency, position in the document, and capitalization; (11) said linguistic features relate to one or more of part of speech, phrase structure, and named entity information; (12) said general features comprise frequency features, and said frequency features, and said frequency features, and said frequency features, and said
- Another aspect comprises A method comprising steps implemented by a computer processing system, said steps comprising: (a) extracting text from a webpage to produce at least a first set of candidate keywords, applying language processing to produce at least a second set of candidate keywords, and combining said first and second sets of candidate keywords into a first candidate pool; (b) receiving data describing at least said first candidate pool and producing a second candidate pool; (c) receiving data describing at least said second candidate pool and analyzing said second candidate pool for general features and linguistic features; and (d) receiving said data describing at least said second candidate pool and related data from said feature extraction unit, and determining a likelihood of each candidate in said second candidate pool being a primary or secondary keyword.
- Another aspect comprises a tangible computer readable medium storing software operable to perform steps comprising: (a) extracting text from a webpage to produce at least a first set of candidate keywords, applying language processing to produce at least a second set of candidate keywords, and combining said first and second sets of candidate keywords into a first candidate pool; (b) receiving data describing at least said first candidate pool and producing a second candidate pool; (c) receiving data describing at least said second candidate pool and analyzing said second candidate pool for general features and linguistic features; and (d) receiving said data describing at least said second candidate pool and related data from said feature extraction unit, and determining a likelihood of each candidate in said second candidate pool being a primary or secondary keyword.
- FIG. 1 depicts an overview of processing of an exemplary embodiment.
- FIG. 2 depicts a computer system over which an exemplary embodiment may be implemented.
- FIG. 1. Each component is described in further detail in the remaining sections of this description.
- plain text of the page may be extracted from the HTML format.
- this text may be processed further to obtain information about its structure that can be useful to the keyword extraction system.
- the preprocessing unit of the system preferably performs extraction as well as tagging and formatting webpage text, to provide suitable input for the stages of candidate phrase selection and feature extraction that follow.
- content text may be first extracted from the webpage using BoilerPipe (see, e.g., [9]), which removes boilerplate content and preserves only the main text body of the page. Aside from the body text, header information such as title, meta-description, and meta-keywords may be extracted and combined with
- BoilerPipe output to form plain text input for further processing.
- the page text may then be tokenized and the tokenizer output passed to a part-of- speech tagger (see, e.g., [18]) and a parser (see, e.g., [13]). Since there is a tendency for keywords to constitute noun phrases, parser output may be used to find noun phrases in the text.
- the use of a parser rather than a chunker may be motivated by the desire to obtain finer-grained information on hierarchical phrase structure, as opposed to basic noun phrase chunks, in order to improve keyword candidate extraction.
- NE Named Entities
- Two different NE systems see, e.g., [18], [4]) preferably are used in order to provide coverage of a larger set of entity types.
- Candidate extraction may be used to select phrases that are potential keywords and can be used as input for the classifier which estimates the likelihood that a given phrase is a keyword.
- better accuracy of candidate extraction helps to filter word combinations that are not likely keywords and thus reduces the amount of negative training samples, thereby improving the ratio of positive to negative training data (the keyword extraction task has an imbalance between positive and negative samples, with very few positive label data).
- a keyword extraction method performs as follows. First, a base candidate set is formed by recursively extracting all noun phrases from parsed text. Then all candidate subsequences (extracted left to right) that consist of nouns only are added to the candidate set (for example, if best Nixon camera accessories is the candidate, Nixon camera accessories, camera accessories and accessories would be added to the candidate set). Finally, the candidate set is augmented with all unigrams, bigrams, and trigrams extracted from the candidate phrases.
- the candidate set may also be filtered against a stoplist of most frequent English words. Unigrams or bigrams containing a stopword preferably are removed from the candidate set. However, longer phrases containing a word from the stoplist in the middle of the phrase may be retained.
- an exemplary embodiment employs a classifier that uses the input (features of the candidate phrase) to estimate the probability that the phrase is a keyword, and assigns an output label
- the classifier function that maps the feature input to a keyword label may be obtained using supervised machine learning. That is, the mapping may be learned by the classifier system based on a dataset where "correct" output labels have been provided by human annotators.
- a maximum entropy (ME) model may be used (this is sometimes called a the logistic regression model; for an introduction, see [11]).
- An ME model derives constraints from the training data and assumes a distribution of maximum entropy in cases not covered by the training set.
- the ME classifier input consists of vectors of values for each keyword candidate, which are used by the model to learn the weights associated with each feature. Given new input data, the trained classifer can then compute the probability that a phrase is a keyword given the input values for that candidate phrase.
- f is a joint- feature (a function of the input vector and the label) and a is a weight assigned to that feature.
- CG Natural Language Toolkit
- rbf kernel support- vector machines
- CG refers to the Conjugate Gradient method, a standard iterative method to solve sparse linear equation systems that is provided as one of the training methods in the classifier library.
- CG requires the scipy package (http://www.scipy.org/) to be installed with Python and NLTK. candidates with the highest probabilities in a given webpage.
- a set of feature values may be computed for each keyword candidate and used as classifier input.
- the choice of features plays an important role in classifier performance.
- the features may be divided into two types: (a) general, non-linguistic features, and (b) linguistic features.
- General features may be similar to the features employed by the system described in [17] and include information such as frequency, position in the document, and capitalization. Linguistic features make use of part of speech, phrase structure, and named entity information. The two types of features are described in more detail below.
- HasCap 1 if at least binary YES one word in
- IsNoun 1 if all words binary YES(but in keyword defined candidate are differently nouns, 0 with a otherwise. distinction between proper an generic nouns) hasNoun 1 if at least binary YES one word in
- HasNE_oak 1 if keyword binary NO
- Frequency features provide information similar to TFxIDF . Frequency features
- TFxIDF refers to term frequency-inverse document frequency and is a standard score used in information retrieval to evaluate the relative importance of a term. It is based on frequency of the term in a given may include relative term frequency within the document, log of term frequency, as well as DF (frequency in document collection) and log DF values. DF values may be approximated using frequencies from Google Ngram corpus. Preferably only unigram and bigram frequency information are used to calculate DF. For candidate phrases longer than 2 words, the average of DFs for all bigrams in the phrase may be used as the DF value. Averages may be used in order to obtain a similar range of values for phrases of different length. Also, DF values computed for the entire blog collection may be used, instead of the frequencies from the Google Ngram corpus.
- Capitalized words include proper names or words marked as important terms in a given document. Exemplary capitalization features are: whether all words in keyword candidate are capitalized, and whether at least one word in a candidate phrase is capitalized.
- wikipedia traffic statistics may be used to reflect the popularity of keyword candidates as frequent search/query items. This set of features may include: whether the candidate phrase is a wikipedia title (including redirects), and the traffic figure for the candidate phrase (0 if the candidate is not a wikipedia title). Traffic statistics may be based on hourly wikipedia logs aggregated over a certain period (e.g., a 20 day period in June 2010).
- the candidate is a Noun Phrase or contains a Noun Phrase.
- the candidate phrase contains at least one noun, and whether the candidate phrase consists of nouns only.
- a keyword candidate is a Named Entity, whether it contains a Named Entity and the Named Entity tag assigned to the candidate ("O" if the candidate phrase is not an NE).
- Pointwise mutual information reflects whether a phrase is likely to be a collocation.
- a PMI score of a candidate phrase may be calculated as follows:
- PMI may be set to the average of PMI scores for all bigrams in the phrase.
- the training data may comprise, say, 500 web pages (selected randomly from a blog page corpus; see [3]). Annotators may be presented with plain-text extracted from the blog page and instructed to select keywords that best express the content of the page. Meta information from the header preferably is not included in the annotated text.
- Keywords there is no limit on the number of keywords that may be chosen for a single page. Additional pages may also be annotated and set aside as a test set not used for training.
- the keywords preferably are selected by two annotators.
- the inter- annotator agreement on this task is might not be high (for example, in one implementation, the kappa score 4 of annotators was 0.49.
- Low kappa scores may be due to the following: First, annotators may tag similar phrases that are only partial matches. Second, when a maximum number of keywords that can be selected is not specified, one annotator may choose to select a greater number of keywords than another for a given text.
- GS Golden Standard
- annotators may be instructed to also select whether the keyword is a "primary keyword” or a "secondary keyword.”
- Primary keywords may be defined as keywords
- Cohen's kappa coefficient is a statistical measure commonly employed to measure agreement between
- Kappa is calculated as where P(A ) is the observed l - P(E)
- kappa scores between each annotator and the standard were 0.75 for annotator 1 and and 0.74 for annotator 2.
- Detailed agreeement statistics for primary and secondary keywords are shown in Table 2 below.
- an exemplary embodiment uses noun phrases as a base candidate set, but augments the candidate pool with noun sequences and unigrams, bigrams, and trigrams extracted from the noun phrases.
- One prior art method of obtaining all possible candidate phrases from a text is to include all n-grams up to length n (typically 3-5) in the candidate set.
- n typically 3-5
- a serious disadvantage of this n-gram method is that it introduces substantial noise, in the form of word sequences that are not meaningful phrases and/or are not likely to be potential keywords. The n-gram method thus suffers from low precision.
- An alternative prior art method is to use language structure cues to extract candidates. Since keywords tend to be noun phrases, all noun phrases from the text can be used to form the candidate pool. However, this method has a markedly lower recall than the n-gram extraction method, which means that many potential keywords are not included in the candidate set.
- the n-gram approach has a recall above 80%, but it also has the lowest precision of the three methods (i.e., the candidate set includes a substantial amount of noise). Extracting noun phrases as candidates has the advantage of increasing precision, but this method has a very low recall (only 26%), so there is a high chance of missing potential keywords.
- an exemplary embodiment of the inventive method results in an improvement in recall compared to extracting noun phrases.
- the recall of this approach is comparable to the n-gram method, but the precision is higher. Evaluation results of how the different methods combine with classifier performance are described below.
- the results achieved by the inventive system were compared to a baseline, based on [17].
- the candidate extraction method is the n-gram method, and features consist of general non-linguistic features (plus a simple set of NP/Noun features). How system
- Top- 10 score (like R-Precision but with a cut-off at top- 10 results, i.e. all ⁇ > 10 are set to 10).
- the top- 10 measure was used for evaluation since it provides an estimate of how the classifier performs as an extraction system when the candidates with top- 10 scores are selected as the keyword output.
- System performance was tested on a held-out test set of 100 webpages which were never used in classifier training (see Table 4) and
- Table 4 Top- 10 score results for the held-out set.
- Table 5 Top-10 score results for cross-validation tests.
- keyword extraction preferably comprises: (a) preprocessing, which includes text extraction from the webpage as well as linguistic processing such as part of speech tagging and parsing; (b) extraction of keyword candidate phrases; and (c) candidate classification using supervised machine learning.
- the inventive systems and methods may achieve improved performance due to use of linguistic information, both at the candidate selection and at the feature extraction stage.
- An exemplary embodiment comprises candidate selection that uses hierarchical phrase structure, resulting in a less noisy candidate pool.
- Features that may be used for classification also include linguistic features such as part of speech and named entity information, resulting in improved classifier performance.
- Embodiments comprise computer components and computer-implemented steps that will be apparent to those skilled in the art. For example, calculations and communications can be performed electronically, and results can be displayed using a graphical user interface.
- Computers 100 communicate via network 110 with a server 130.
- a plurality of sources of data 120-121 also communicate via network 110 with a server 130, processor 150, and/or other components operable to calculate and/or transmit information.
- Server(s) 130 may be coupled to one or more storage devices 140, one or more processors 150, and software 160.
- Server 130 may facilitate communication of data from a storage device 140 to and from processor(s) 150, and communications to computers 100.
- Processor 150 may optionally include or communicate with local or networked storage (not shown) which may be used to store temporary or other information.
- Software 160 can be installed locally at a computer 100, processor 150 and/or centrally supported for facilitating calculations and applications.
- processing and decision-making may be performed by functionally equivalent circuits such as a digital signal processor circuit or an application specific integrated circuit.
- Boilerplate detection using shallow text features WSDM '10: Proceedings of the third ACM international conference on Web search and data mining, pages 441—450, New York, NY, USA, 2010. ACM. [10] Matsuo, Y. and Ishizuka, M. Keyword Extraction from a Document using Word Co-occurrence Statistical Information. Transactions of the Japanese Society for Artificial Intelligence, 17:217-223, 2002.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Library & Information Science (AREA)
- Probability & Statistics with Applications (AREA)
- Multimedia (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR1020137011659A KR101672579B1 (ko) | 2010-11-05 | 2011-11-02 | 키워드 추출에 관한 시스템 및 방법 |
| EP11838723.2A EP2635965A4 (en) | 2010-11-05 | 2011-11-02 | KEYWORK EXTRACTION SYSTEMS AND METHODS |
| JP2013537776A JP5990178B2 (ja) | 2010-11-05 | 2011-11-02 | キーワード抽出に関するシステム及び方法 |
| CN2011800531753A CN103201718A (zh) | 2010-11-05 | 2011-11-02 | 关于关键词提取的系统和方法 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US41039210P | 2010-11-05 | 2010-11-05 | |
| US61/410,392 | 2010-11-05 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2012061462A1 true WO2012061462A1 (en) | 2012-05-10 |
Family
ID=46020615
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2011/058899 Ceased WO2012061462A1 (en) | 2010-11-05 | 2011-11-02 | Systems and methods regarding keyword extraction |
Country Status (6)
| Country | Link |
|---|---|
| US (1) | US8874568B2 (enExample) |
| EP (1) | EP2635965A4 (enExample) |
| JP (1) | JP5990178B2 (enExample) |
| KR (1) | KR101672579B1 (enExample) |
| CN (1) | CN103201718A (enExample) |
| WO (1) | WO2012061462A1 (enExample) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2014203264A1 (en) * | 2013-06-21 | 2014-12-24 | Hewlett-Packard Development Company, L.P. | Topic based classification of documents |
| US20240419890A1 (en) * | 2023-06-19 | 2024-12-19 | International Business Machines Corporation | Signature discourse transformation |
Families Citing this family (72)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20120076414A1 (en) * | 2010-09-27 | 2012-03-29 | Microsoft Corporation | External Image Based Summarization Techniques |
| US8732014B2 (en) * | 2010-12-20 | 2014-05-20 | Yahoo! Inc. | Automatic classification of display ads using ad images and landing pages |
| US9558267B2 (en) * | 2011-02-11 | 2017-01-31 | International Business Machines Corporation | Real-time data mining |
| US8898163B2 (en) | 2011-02-11 | 2014-11-25 | International Business Machines Corporation | Real-time information mining |
| US8983826B2 (en) * | 2011-06-30 | 2015-03-17 | Palo Alto Research Center Incorporated | Method and system for extracting shadow entities from emails |
| CN103198057B (zh) * | 2012-01-05 | 2017-11-07 | 深圳市世纪光速信息技术有限公司 | 一种自动给文档添加标签的方法和装置 |
| US9613125B2 (en) * | 2012-10-11 | 2017-04-04 | Nuance Communications, Inc. | Data store organizing data using semantic classification |
| US9292797B2 (en) | 2012-12-14 | 2016-03-22 | International Business Machines Corporation | Semi-supervised data integration model for named entity classification |
| CN103473317A (zh) * | 2013-09-12 | 2013-12-25 | 百度在线网络技术(北京)有限公司 | 提取关键词的方法和设备 |
| EP3063669A4 (en) * | 2013-10-31 | 2017-04-26 | Hewlett-Packard Enterprise Development LP | Classifying document using patterns |
| CN104679768B (zh) * | 2013-11-29 | 2019-08-09 | 百度在线网络技术(北京)有限公司 | 从文档中提取关键词的方法和设备 |
| US9384287B2 (en) | 2014-01-15 | 2016-07-05 | Sap Portals Isreal Ltd. | Methods, apparatus, systems and computer readable media for use in keyword extraction |
| US8924338B1 (en) | 2014-06-11 | 2014-12-30 | Fmr Llc | Automated predictive tag management system |
| KR101624909B1 (ko) * | 2014-12-10 | 2016-05-27 | 주식회사 와이즈넛 | 정규화된 키워드 가중치에 기반한 연관 키워드 추출 방법 |
| JP6074820B2 (ja) * | 2015-01-23 | 2017-02-08 | 国立研究開発法人情報通信研究機構 | アノテーション補助装置及びそのためのコンピュータプログラム |
| US10140314B2 (en) * | 2015-08-21 | 2018-11-27 | Adobe Systems Incorporated | Previews for contextual searches |
| US10169374B2 (en) * | 2015-08-21 | 2019-01-01 | Adobe Systems Incorporated | Image searches using image frame context |
| KR101708444B1 (ko) * | 2015-11-16 | 2017-02-22 | 주식회사 위버플 | 키워드 및 자산 가격 관련성 평가 방법 및 그 장치 |
| CN105653701B (zh) | 2015-12-31 | 2019-01-15 | 百度在线网络技术(北京)有限公司 | 模型生成方法及装置、词语赋权方法及装置 |
| US10558785B2 (en) | 2016-01-27 | 2020-02-11 | International Business Machines Corporation | Variable list based caching of patient information for evaluation of patient rules |
| US10528702B2 (en) | 2016-02-02 | 2020-01-07 | International Business Machines Corporation | Multi-modal communication with patients based on historical analysis |
| US10565309B2 (en) * | 2016-02-17 | 2020-02-18 | International Business Machines Corporation | Interpreting the meaning of clinical values in electronic medical records |
| US11037658B2 (en) | 2016-02-17 | 2021-06-15 | International Business Machines Corporation | Clinical condition based cohort identification and evaluation |
| US10937526B2 (en) | 2016-02-17 | 2021-03-02 | International Business Machines Corporation | Cognitive evaluation of assessment questions and answers to determine patient characteristics |
| US10685089B2 (en) | 2016-02-17 | 2020-06-16 | International Business Machines Corporation | Modifying patient communications based on simulation of vendor communications |
| US10282356B2 (en) | 2016-03-07 | 2019-05-07 | International Business Machines Corporation | Evaluating quality of annotation |
| CN107203542A (zh) * | 2016-03-17 | 2017-09-26 | 阿里巴巴集团控股有限公司 | 词组提取方法及装置 |
| US10311388B2 (en) | 2016-03-22 | 2019-06-04 | International Business Machines Corporation | Optimization of patient care team based on correlation of patient characteristics and care provider characteristics |
| US10923231B2 (en) | 2016-03-23 | 2021-02-16 | International Business Machines Corporation | Dynamic selection and sequencing of healthcare assessments for patients |
| CN105912524B (zh) * | 2016-04-09 | 2019-08-20 | 北京交通大学 | 基于低秩矩阵分解的文章话题关键词提取方法和装置 |
| RU2619193C1 (ru) | 2016-06-17 | 2017-05-12 | Общество с ограниченной ответственностью "Аби ИнфоПоиск" | Многоэтапное распознавание именованных сущностей в текстах на естественном языке на основе морфологических и семантических признаков |
| US10318562B2 (en) * | 2016-07-27 | 2019-06-11 | Google Llc | Triggering application information |
| KR101931859B1 (ko) * | 2016-09-29 | 2018-12-21 | (주)시지온 | 전자문서의 대표 단어 선정 방법, 전자 문서 제공 방법, 및 이를 수행하는 컴퓨팅 시스템 |
| CN108073568B (zh) * | 2016-11-10 | 2020-09-11 | 腾讯科技(深圳)有限公司 | 关键词提取方法和装置 |
| US9965460B1 (en) * | 2016-12-29 | 2018-05-08 | Konica Minolta Laboratory U.S.A., Inc. | Keyword extraction for relationship maps |
| CN107248927B (zh) * | 2017-05-02 | 2020-06-09 | 华为技术有限公司 | 故障定位模型的生成方法、故障定位方法和装置 |
| CN107704503A (zh) * | 2017-08-29 | 2018-02-16 | 平安科技(深圳)有限公司 | 用户关键词提取装置、方法及计算机可读存储介质 |
| US10417268B2 (en) * | 2017-09-22 | 2019-09-17 | Druva Technologies Pte. Ltd. | Keyphrase extraction system and method |
| CN112037774B (zh) * | 2017-10-24 | 2024-04-26 | 北京嘀嘀无限科技发展有限公司 | 用于关键短语识别的系统和方法 |
| US11216452B2 (en) * | 2017-11-01 | 2022-01-04 | Sap Se | Systems and methods for disparate data source aggregation, self-adjusting data model and API |
| KR102019194B1 (ko) | 2017-11-22 | 2019-09-06 | 주식회사 와이즈넛 | 문서 내 핵심 키워드 추출 시스템 및 방법 |
| JP7239991B2 (ja) * | 2018-01-05 | 2023-03-15 | 国立大学法人九州工業大学 | ラベル付与装置、ラベル付与方法、及びプログラム |
| US20190272071A1 (en) * | 2018-03-02 | 2019-09-05 | International Business Machines Corporation | Automatic generation of a hierarchically layered collaboratively edited document view |
| US10831803B2 (en) * | 2018-07-26 | 2020-11-10 | Beijing Jingdong Shangke Information Technology Co., Ltd. | System and method for true product word recognition |
| US11404058B2 (en) | 2018-10-31 | 2022-08-02 | Walmart Apollo, Llc | System and method for handling multi-turn conversations and context management for voice enabled ecommerce transactions |
| US11238850B2 (en) | 2018-10-31 | 2022-02-01 | Walmart Apollo, Llc | Systems and methods for e-commerce API orchestration using natural language interfaces |
| US11195524B2 (en) * | 2018-10-31 | 2021-12-07 | Walmart Apollo, Llc | System and method for contextual search query revision |
| US11183176B2 (en) | 2018-10-31 | 2021-11-23 | Walmart Apollo, Llc | Systems and methods for server-less voice applications |
| CN109977397B (zh) * | 2019-02-18 | 2022-11-15 | 广州市诚毅科技软件开发有限公司 | 基于词性组合的新闻热点提取方法、系统及存储介质 |
| US12118314B2 (en) * | 2019-05-31 | 2024-10-15 | Nec Corporation | Parameter learning apparatus, parameter learning method, and computer readable recording medium |
| US11874882B2 (en) * | 2019-07-02 | 2024-01-16 | Microsoft Technology Licensing, Llc | Extracting key phrase candidates from documents and producing topical authority ranking |
| US11250214B2 (en) | 2019-07-02 | 2022-02-15 | Microsoft Technology Licensing, Llc | Keyphrase extraction beyond language modeling |
| CN110362827B (zh) * | 2019-07-11 | 2024-05-14 | 腾讯科技(深圳)有限公司 | 一种关键词提取方法、装置及存储介质 |
| CN110377725B (zh) * | 2019-07-12 | 2021-09-24 | 深圳新度博望科技有限公司 | 数据生成方法、装置、计算机设备及存储介质 |
| CN110516237B (zh) * | 2019-08-15 | 2022-12-09 | 重庆长安汽车股份有限公司 | 短文本短语提取方法、系统及存储介质 |
| CN110781662B (zh) * | 2019-10-21 | 2022-02-01 | 腾讯科技(深圳)有限公司 | 一种逐点互信息的确定方法和相关设备 |
| CN113703588B (zh) * | 2020-05-20 | 2024-10-29 | 北京搜狗科技发展有限公司 | 一种输入方法、装置和用于输入的装置 |
| US10878174B1 (en) * | 2020-06-24 | 2020-12-29 | Starmind Ag | Advanced text tagging using key phrase extraction and key phrase generation |
| CN114490956A (zh) * | 2020-10-26 | 2022-05-13 | 北京金山数字娱乐科技有限公司 | 一种关键词提取方法及装置 |
| CN112347778B (zh) * | 2020-11-06 | 2023-06-20 | 平安科技(深圳)有限公司 | 关键词抽取方法、装置、终端设备及存储介质 |
| KR102639979B1 (ko) * | 2020-12-08 | 2024-02-22 | 주식회사 카카오엔터프라이즈 | 주요 키워드 추출 장치, 그것의 제어 방법 및 주요 키워드 추출 프로그램 |
| CN115700499A (zh) * | 2021-07-23 | 2023-02-07 | 北京橙心无限科技发展有限公司 | 信息处理方法、装置、电子设备和可读存储介质 |
| US11379763B1 (en) | 2021-08-10 | 2022-07-05 | Starmind Ag | Ontology-based technology platform for mapping and filtering skills, job titles, and expertise topics |
| KR102334236B1 (ko) | 2021-08-31 | 2021-12-02 | (주)네오플로우 | 음성 변환 Text Data에서 의미있는 키워드 추출 방법과 활용 |
| KR102334255B1 (ko) | 2021-08-31 | 2021-12-02 | (주)네오플로우 | AI 기반 음성서비스의 Text Data 수집 플랫폼 구축 및 통합관리방법 |
| CN114416919A (zh) * | 2021-12-21 | 2022-04-29 | 火星语盟(深圳)科技有限公司 | 关键词提取方法及系统 |
| CN114398968B (zh) * | 2022-01-06 | 2022-09-20 | 北京博瑞彤芸科技股份有限公司 | 基于文件相似度对同类获客文件进行标注的方法和装置 |
| CN115204146B (zh) * | 2022-07-28 | 2023-06-27 | 平安科技(深圳)有限公司 | 关键词抽取方法、装置、计算机设备及存储介质 |
| US20240386062A1 (en) * | 2023-05-16 | 2024-11-21 | Sap Se | Label Extraction and Recommendation Based on Data Asset Metadata |
| US20250190701A1 (en) * | 2023-12-06 | 2025-06-12 | Jpmorgan Chase Bank, N.A. | Method and system for extracting key phrases |
| US20250231952A1 (en) * | 2024-01-12 | 2025-07-17 | Y.E. Hub Armenia LLC | System and method for ranking search engine results |
| CN119558314B (zh) * | 2025-01-27 | 2025-04-11 | 成都理工大学 | 基于神经网络的多语言会计术语自动识别方法 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20060287988A1 (en) * | 2005-06-20 | 2006-12-21 | Efficient Frontier | Keyword charaterization and application |
| US20070198506A1 (en) * | 2006-01-18 | 2007-08-23 | Ilial, Inc. | System and method for context-based knowledge search, tagging, collaboration, management, and advertisement |
| US20090254512A1 (en) * | 2008-04-03 | 2009-10-08 | Yahoo! Inc. | Ad matching by augmenting a search query with knowledge obtained through search engine results |
| US20100185689A1 (en) * | 2009-01-20 | 2010-07-22 | Microsoft Corporation | Enhancing Keyword Advertising Using Wikipedia Semantics |
Family Cites Families (22)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH0765018A (ja) * | 1993-08-31 | 1995-03-10 | Matsushita Electric Ind Co Ltd | キーワード自動抽出装置 |
| US6167368A (en) * | 1998-08-14 | 2000-12-26 | The Trustees Of Columbia University In The City Of New York | Method and system for indentifying significant topics of a document |
| US7925610B2 (en) * | 1999-09-22 | 2011-04-12 | Google Inc. | Determining a meaning of a knowledge item using document-based information |
| US7526425B2 (en) * | 2001-08-14 | 2009-04-28 | Evri Inc. | Method and system for extending keyword searching to syntactically and semantically annotated data |
| JP2004139553A (ja) * | 2002-08-19 | 2004-05-13 | Matsushita Electric Ind Co Ltd | 文書検索システムおよび質問応答システム |
| US7139752B2 (en) * | 2003-05-30 | 2006-11-21 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations |
| US7555705B2 (en) * | 2003-09-10 | 2009-06-30 | Microsoft Corporation | Annotation management in a pen-based computing system |
| US7428529B2 (en) * | 2004-04-15 | 2008-09-23 | Microsoft Corporation | Term suggestion for multi-sense query |
| JP2006146705A (ja) * | 2004-11-22 | 2006-06-08 | Mitsubishi Electric Corp | 構造化文書曖昧照合装置及びそのプログラム |
| US8135728B2 (en) * | 2005-03-24 | 2012-03-13 | Microsoft Corporation | Web document keyword and phrase extraction |
| JP4236057B2 (ja) * | 2006-03-24 | 2009-03-11 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 新たな複合語を抽出するシステム |
| US8341112B2 (en) * | 2006-05-19 | 2012-12-25 | Microsoft Corporation | Annotation by search |
| US8001105B2 (en) * | 2006-06-09 | 2011-08-16 | Ebay Inc. | System and method for keyword extraction and contextual advertisement generation |
| JP2008065417A (ja) * | 2006-09-05 | 2008-03-21 | Hottolink Inc | 連想語群検索装置、システム及びコンテンツマッチ型広告システム |
| JP3983265B1 (ja) * | 2006-09-27 | 2007-09-26 | 沖電気工業株式会社 | 辞書作成支援システム、方法及びプログラム |
| US20080098300A1 (en) * | 2006-10-24 | 2008-04-24 | Brilliant Shopper, Inc. | Method and system for extracting information from web pages |
| JP5193669B2 (ja) * | 2008-05-08 | 2013-05-08 | 株式会社野村総合研究所 | 検索システム |
| US8386519B2 (en) * | 2008-12-30 | 2013-02-26 | Expanse Networks, Inc. | Pangenetic web item recommendation system |
| JP5143057B2 (ja) * | 2009-03-02 | 2013-02-13 | 日本電信電話株式会社 | 重要キーワード抽出装置及び方法及びプログラム |
| US20100281025A1 (en) * | 2009-05-04 | 2010-11-04 | Motorola, Inc. | Method and system for recommendation of content items |
| BR112012006743A2 (pt) * | 2009-09-26 | 2019-09-24 | Ogilvy Hamish | método para indexar uma pluralidade de documentos, sistema para indexar uma pluralidade de documentos, método para analisar uma porção de texto e recuperar documentos relevantes para a porção de texto, método para refinar os resultados de uma busca, sistema para refinar os resultados de uma busca, sistema para analisar uma porção de texto de entrada e recuperar documentos relevantes para a porção de texto, e mídia legível em computador |
| US8463786B2 (en) * | 2010-06-10 | 2013-06-11 | Microsoft Corporation | Extracting topically related keywords from related documents |
-
2011
- 2011-11-02 US US13/287,294 patent/US8874568B2/en active Active
- 2011-11-02 JP JP2013537776A patent/JP5990178B2/ja active Active
- 2011-11-02 WO PCT/US2011/058899 patent/WO2012061462A1/en not_active Ceased
- 2011-11-02 KR KR1020137011659A patent/KR101672579B1/ko active Active
- 2011-11-02 CN CN2011800531753A patent/CN103201718A/zh active Pending
- 2011-11-02 EP EP11838723.2A patent/EP2635965A4/en not_active Withdrawn
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20060287988A1 (en) * | 2005-06-20 | 2006-12-21 | Efficient Frontier | Keyword charaterization and application |
| US20070198506A1 (en) * | 2006-01-18 | 2007-08-23 | Ilial, Inc. | System and method for context-based knowledge search, tagging, collaboration, management, and advertisement |
| US20090254512A1 (en) * | 2008-04-03 | 2009-10-08 | Yahoo! Inc. | Ad matching by augmenting a search query with knowledge obtained through search engine results |
| US20100185689A1 (en) * | 2009-01-20 | 2010-07-22 | Microsoft Corporation | Enhancing Keyword Advertising Using Wikipedia Semantics |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP2635965A4 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2014203264A1 (en) * | 2013-06-21 | 2014-12-24 | Hewlett-Packard Development Company, L.P. | Topic based classification of documents |
| US20240419890A1 (en) * | 2023-06-19 | 2024-12-19 | International Business Machines Corporation | Signature discourse transformation |
| US12327078B2 (en) * | 2023-06-19 | 2025-06-10 | International Business Machines Corporation | Signature discourse transformation |
Also Published As
| Publication number | Publication date |
|---|---|
| KR20130142124A (ko) | 2013-12-27 |
| EP2635965A4 (en) | 2016-08-10 |
| US20120117092A1 (en) | 2012-05-10 |
| KR101672579B1 (ko) | 2016-11-03 |
| JP5990178B2 (ja) | 2016-09-07 |
| CN103201718A (zh) | 2013-07-10 |
| EP2635965A1 (en) | 2013-09-11 |
| JP2013544397A (ja) | 2013-12-12 |
| US8874568B2 (en) | 2014-10-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US8874568B2 (en) | Systems and methods regarding keyword extraction | |
| Firoozeh et al. | Keyword extraction: Issues and methods | |
| US8819047B2 (en) | Fact verification engine | |
| US8819001B1 (en) | Systems, methods, and user interface for discovering and presenting important contents in a document | |
| US8892422B1 (en) | Phrase identification in a sequence of words | |
| Bansal et al. | Hybrid attribute based sentiment classification of online reviews for consumer intelligence | |
| US20170277668A1 (en) | Automatic document summarization using search engine intelligence | |
| CN104281702B (zh) | 基于电力关键词分词的数据检索方法及装置 | |
| US20090070322A1 (en) | Browsing knowledge on the basis of semantic relations | |
| WO2019217096A1 (en) | System and method for automatically responding to user requests | |
| WO2006108069A2 (en) | Searching through content which is accessible through web-based forms | |
| Boston et al. | Wikimantic: Toward effective disambiguation and expansion of queries | |
| Das et al. | Temporal analysis of sentiment events–a visual realization and tracking | |
| Bendersky et al. | Joint annotation of search queries | |
| Fauzi et al. | Image understanding and the web: a state-of-the-art review | |
| Lin et al. | Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information | |
| Noah et al. | Evaluation of lexical-based approaches to the semantic similarity of Malay sentences | |
| Riaz | Concept search in Urdu | |
| Alashri et al. | Lexi-augmenter: Lexicon-based model for tweets sentiment analysis | |
| Klang et al. | Linking, searching, and visualizing entities in wikipedia | |
| Zhang et al. | A semantics-based method for clustering of Chinese web search results | |
| Bhaskar et al. | Tweet Contextualization (Answering Tweet Question)-the Role of Multi-document Summarization. | |
| Ghorai | An Information Retrieval System for FIRE 2016 Microblog Track. | |
| Ermakova et al. | IRIT at INEX: question answering task | |
| Kanhabua | Time-aware approaches to information retrieval |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 11838723 Country of ref document: EP Kind code of ref document: A1 |
|
| REEP | Request for entry into the european phase |
Ref document number: 2011838723 Country of ref document: EP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2011838723 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2013537776 Country of ref document: JP Kind code of ref document: A |
|
| ENP | Entry into the national phase |
Ref document number: 20137011659 Country of ref document: KR Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |