WO2016093532A1 - Procédé d'extraction de mot-clé associé basé sur un poids de mot-clé normalisé - Google Patents

Procédé d'extraction de mot-clé associé basé sur un poids de mot-clé normalisé Download PDF

Info

Publication number
WO2016093532A1
WO2016093532A1 PCT/KR2015/012949 KR2015012949W WO2016093532A1 WO 2016093532 A1 WO2016093532 A1 WO 2016093532A1 KR 2015012949 W KR2015012949 W KR 2015012949W WO 2016093532 A1 WO2016093532 A1 WO 2016093532A1
Authority
WO
WIPO (PCT)
Prior art keywords
keyword
weight
document
keywords
weights
Prior art date
Application number
PCT/KR2015/012949
Other languages
English (en)
Korean (ko)
Inventor
한규열
안영민
Original Assignee
주식회사 와이즈넛
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 와이즈넛 filed Critical 주식회사 와이즈넛
Publication of WO2016093532A1 publication Critical patent/WO2016093532A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • G06F16/1794Details of file format conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Definitions

  • the present invention relates to search techniques, and in particular, to a method for extracting related keywords used in search techniques.
  • Information is extracted from document databases residing in the memory of a network or computer. And documents are woven by concept. More specifically, a document is made up of a set of conceptual words. Usually, a plurality of conceptual words consists of a plurality of words. However, people's memory does not have mechanical perfection, and even if they remember things such as the meaning, background, and feeling of an approximate concept, they often do not remember the exact word (s).
  • each conceptual word can be assumed to be related to the concept. If so, the keywords included in the document are pre-configured, and related keywords can be provided for the keywords used in the search for convenience of the user.
  • association analysis technique is applied to the keywords appearing in the document, the relevant keywords for each keyword can be obtained.
  • the frequency of simultaneous occurrence in the document is used to measure the reliability of the associated keyword.
  • the keyword with the highest frequency in the document is analyzed to be related to all keywords that appear at the same time. Problem occurs.
  • the method of determining the importance of keywords by extracting keywords from existing documents is based on the frequency of appearance. Since keyword weight by TF-IDF is for the entire document set, the keyword importance in a single document is not reflected. In particular, there is a difficulty that weights cannot exist for keywords that did not exist in the training set document in which TF-IDF was calculated.
  • the inventors of the present invention have completed the present invention after a long research effort to solve this problem.
  • An object of the present invention is to propose a method related to an algorithm that conveniently informs the association (association) between keywords appearing in a document.
  • the present invention is to provide a new methodology for analyzing the relationship between the keywords that can effectively replace the problem of the associated keyword analysis by the simple frequency described in the background art.
  • the related keyword extraction method based on the normalized keyword weight of the present invention
  • the first weight is the formula of Figure 4 (W is the weight of the keyword in a specific document, TL is the length of the keyword, TF is The frequency of the keyword, ISF, can be calculated using the frequency of the sentence in which the keyword appears.
  • the normalization weight (NTFISF (W i )) of the step (b) is the formula (W i is i) of FIG.
  • the weight of the keyword, t may be calculated using a document belonging to a document set.
  • the degree of association (RW) of the keyword pair in the step (c) is a second weight equation (W) of FIG. i may be calculated using a weight of an i th keyword, W j is a weight of a j th keyword, and t is a document belonging to a document set.
  • the ranking of the step (d) is the weight of the document frequency in which each keyword of the second weight and the keyword pair appeared at the same time Can be summed.
  • the step (b) includes the top N (N is an integer greater than 1) among the normalized keyword weights as important keywords.
  • the method may further include selecting.
  • FIG. 1 is a diagram conceptually showing an example of a computer system configuration of the present invention.
  • FIG. 2 is a diagram schematically showing an overall process according to a preferred embodiment of the present invention.
  • FIG. 3 is a flowchart illustrating hierarchically describing step S200 of FIG. 2.
  • FIG. 5 is a formula for normalizing the keyword weight of FIG. 4 to the sum of weights of all documents in the embodiment of the present invention.
  • 6 is a formula for calculating the degree of association of two keywords that appear simultaneously in the embodiment of the present invention.
  • the method of the present invention is executed by information retrieval and data mining software, which implementation is based on a computer system.
  • 1 conceptually illustrates the system configuration of the present invention.
  • the document database 20 a plurality of documents constitute a document set and are stored.
  • the document database 20 may be a database inside a company or organization.
  • the document database 20 may be a database of a web site that can be accessed by a plurality of user terminals through a network.
  • the server 30 may manage the document database 20 in particular through devices / programs for information retrieval, data mining software, and other DB operations.
  • Computer device 10 may access the document database 20 for information retrieval. Normally, the device 10 searches using the keyword 7. Such a search may be searching for knowledge in a search site such as Google, searching for a product in an online shopping site, or searching for desired information in any other web site. If the retrieval system can extract certain keywords 9 having associations RW with keywords 7 by sophisticated algorithms, the utility of the document database 20 is greatly increased.
  • the present specification discloses a specific and novel method for extracting a keyword having a high relevance (RW) to the keyword (7).
  • FIG. 2 schematically illustrates the entire process of a method for a computer device 10 to extract an association keyword from a document set in a document database 20 in accordance with one preferred embodiment of the present invention.
  • Keyword candidates are generated from a document set (S100). Sentence analysis is performed to generate keyword candidates, and sentence analysis may be performed by a commonly used method of morpheme analysis.
  • a unit including zero or more prefixes and a unit of zero or more suffixes in front of a noun can be used as a keyword candidate, as shown in Equation 1 below.
  • * means that the part of speech may be repeated zero or more
  • + means that the part of speech may be repeated one or more times.
  • the keyword weight is calculated for the generated keyword and normalized to the sum of the weights of all documents (S200).
  • the keyword weight at this stage will be referred to as ⁇ first weight>.
  • the first weight may be obtained using a Term Frequency-Inverted Sentence Frequency (TF-ISF) technique.
  • the step S200 may include calculating a keyword weight by the TF-ISF technique as shown in FIG. 3 (S210) and normalizing the entire document set and selecting an important keyword (S220).
  • Keyword weight by TF-ISF is the same as the equation of FIG. W is the weight of a keyword in a particular document.
  • TL indicates the length of the keyword. For example, when “Korea” and “soccer” are combined to form a compound noun, such as “Korean football,” the length of the keyword will be reflected in the keyword's length. Can be.
  • TF refers to the frequency of the keyword
  • ISF refers to the frequency of the sentence in which the keyword appears.
  • the weight of the word (first weight) in the specific document obtained by FIG. 4 is normalized by the sum of weights in all documents by the formula of FIG.
  • W i is a weight of the i-th keyword
  • t is a document belonging to a document set.
  • the top N of the keyword weights normalized by the equation of FIG. 5 may be selected as important keywords.
  • Each keyword can be paired with two keywords, while each is related.
  • the weight (second weight) of the related keyword for each keyword can be calculated by calculating the degree of association of the keyword pairs appearing at the same time as the sum of the weights of the documents simultaneously appearing for each keyword (S300).
  • the degree of association (RW) of two keywords that appear simultaneously may be calculated by the equation of FIG.
  • W i denotes a weight of an i-th keyword
  • W j denotes a weight of a j-th keyword
  • t denotes a document belonging to a document set. That is, when the i-th keyword is called "AAA” and the j-th keyword is called "BBB", the association degree between the keyword "AAA” and "BBB” is determined by the normalized weight of the document. Can be calculated as a sum.
  • W ti and W tj both mean that they belong to document t.
  • the ranking of the related keyword is determined (S400).
  • the ranking may be determined in the order of high relevance among the plurality of keyword pairs using the relevance of the second weighted keyword pair. 6 may be calculated as a weighted sum of the weights of the associated keywords of the two keywords and the document frequency (DF).
  • DF (W i , W j ) of FIG. 7 is the number of documents in which "AAA" and "BBB" appear simultaneously in the previous example, for example, the higher the number, the higher the weight. This means that the sum of the normalized weights of the two keywords is high, and the number of keyword pairs that appear in several documents at the same time is used as the related keyword.
  • the related keyword ranking indicates which keyword is highly related by sorting the values of the weights of FIG. 7 in descending order.
  • the related keyword extraction method of the present invention described above can effectively reflect the importance of a keyword in each document.
  • highly relevant keyword pairs can be extracted more accurately than the conventional method.
  • related keywords can be more accurately recommended to users who want to search for information by entering keywords. If an information provider operates a shopping-related database, more effective product recommendations can be made.
  • the algorithm of the present invention described above may be used to extract key words from a document set, rank search results in a search engine, or calculate a degree of similarity between documents.
  • the method of extracting a related keyword may be implemented in the form of program instructions that can be executed by various computer means and recorded in a computer readable medium.
  • the computer readable medium may include program instructions, data files, data structures, etc. alone or in combination.
  • the program instructions recorded on the media may be those specially designed and constructed for the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts.
  • Examples of computer readable media include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CD-ROMs, DVDs, magnetic-optical media such as floppy disks, and ROM, RAM, Hardware devices specifically configured to store and execute program instructions, such as flash memory, may be included.
  • Examples of program instructions include machine code, such as produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter.
  • Hardware devices may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé pour extraire des mots-clés associés dans un ensemble de documents dans une base de données de documents. Le procédé de la présente invention consiste : a) à générer, par un dispositif informatique, un mot-clé candidat à partir de l'ensemble de documents; b) à fournir un premier poids par rapport au mot-clé généré et normaliser le poids de mot-clé à la somme des poids dans tous les documents; c) à calculer le degré d'association d'une paire de mots-clés qui apparaissent simultanément comme la somme des poids dans des documents qui apparaissent simultanément par rapport à chaque mot-clé, pour calculer ainsi, par mot-clé, le poids (second poids) de mots-clés associés; et d) à déterminer le classement de la paire de mots-clés à l'aide du degré d'association de la paire de mots-clés à laquelle le second poids est fourni.
PCT/KR2015/012949 2014-12-10 2015-12-01 Procédé d'extraction de mot-clé associé basé sur un poids de mot-clé normalisé WO2016093532A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020140177226A KR101624909B1 (ko) 2014-12-10 2014-12-10 정규화된 키워드 가중치에 기반한 연관 키워드 추출 방법
KR10-2014-0177226 2014-12-10

Publications (1)

Publication Number Publication Date
WO2016093532A1 true WO2016093532A1 (fr) 2016-06-16

Family

ID=56106180

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2015/012949 WO2016093532A1 (fr) 2014-12-10 2015-12-01 Procédé d'extraction de mot-clé associé basé sur un poids de mot-clé normalisé

Country Status (2)

Country Link
KR (1) KR101624909B1 (fr)
WO (1) WO2016093532A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766307A (zh) * 2016-08-18 2018-03-06 阿里巴巴集团控股有限公司 一种表单元素联动的方法和设备
CN109101574A (zh) * 2018-07-18 2018-12-28 北京明朝万达科技股份有限公司 一种数据防泄漏系统的任务审批方法和系统
CN111782986A (zh) * 2019-05-17 2020-10-16 北京京东尚科信息技术有限公司 一种监控基于短链接进行访问的方法和装置

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019022262A1 (fr) * 2017-07-24 2019-01-31 주식회사 마이셀럽스 Système de recherche et de guidage de préférence pour chaque zone d'intérêt
KR102019194B1 (ko) * 2017-11-22 2019-09-06 주식회사 와이즈넛 문서 내 핵심 키워드 추출 시스템 및 방법
KR102128659B1 (ko) * 2018-10-16 2020-06-30 주식회사 포스코아이씨티 키워드 추출 및 요약문 생성 시스템 및 방법
KR102376489B1 (ko) * 2019-11-22 2022-03-18 주식회사 와이즈넛 단어 랭킹 기반의 텍스트 문서 군집 및 주제 생성 장치 및 그 방법
KR102675245B1 (ko) 2021-03-22 2024-06-17 삼육대학교산학협력단 응집도 점수를 기반으로 한 소셜 빅데이터의 효율적인 키워드 추출방법
KR102528401B1 (ko) 2021-06-07 2023-05-03 삼육대학교산학협력단 대화형 형태소 분석을 제공하기 위한 시스템

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090017269A (ko) * 2007-08-14 2009-02-18 엔에이치엔(주) 단어 연관도를 기반으로 카테고리 연관도를 측정하는 방법및 상기 방법을 수행하는 시스템
WO2010120101A2 (fr) * 2009-04-13 2010-10-21 (주)미디어레 Procede de recommandation de mots-cles mettant en oeuvre un modele spatial de vecteurs inverse et appareil correspondant
US20110264699A1 (en) * 2008-12-30 2011-10-27 Telecom Italia S.P.A. Method and system for content classification
KR20130036863A (ko) * 2011-10-05 2013-04-15 (주)워드워즈 의미적 자질을 이용한 문서 분류 시스템 및 그 방법
KR20130142124A (ko) * 2010-11-05 2013-12-27 라쿠텐 인코포레이티드 키워드 추출에 관한 시스템 및 방법

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090017269A (ko) * 2007-08-14 2009-02-18 엔에이치엔(주) 단어 연관도를 기반으로 카테고리 연관도를 측정하는 방법및 상기 방법을 수행하는 시스템
US20110264699A1 (en) * 2008-12-30 2011-10-27 Telecom Italia S.P.A. Method and system for content classification
WO2010120101A2 (fr) * 2009-04-13 2010-10-21 (주)미디어레 Procede de recommandation de mots-cles mettant en oeuvre un modele spatial de vecteurs inverse et appareil correspondant
KR20130142124A (ko) * 2010-11-05 2013-12-27 라쿠텐 인코포레이티드 키워드 추출에 관한 시스템 및 방법
KR20130036863A (ko) * 2011-10-05 2013-04-15 (주)워드워즈 의미적 자질을 이용한 문서 분류 시스템 및 그 방법

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766307A (zh) * 2016-08-18 2018-03-06 阿里巴巴集团控股有限公司 一种表单元素联动的方法和设备
CN109101574A (zh) * 2018-07-18 2018-12-28 北京明朝万达科技股份有限公司 一种数据防泄漏系统的任务审批方法和系统
CN109101574B (zh) * 2018-07-18 2020-09-25 北京明朝万达科技股份有限公司 一种数据防泄漏系统的任务审批方法和系统
CN111782986A (zh) * 2019-05-17 2020-10-16 北京京东尚科信息技术有限公司 一种监控基于短链接进行访问的方法和装置

Also Published As

Publication number Publication date
KR101624909B1 (ko) 2016-05-27

Similar Documents

Publication Publication Date Title
WO2016093532A1 (fr) Procédé d'extraction de mot-clé associé basé sur un poids de mot-clé normalisé
CN105488196B (zh) 一种基于互联语料的热门话题自动挖掘系统
Bhatia et al. Automatic labelling of topics with neural embeddings
Goga et al. Exploiting innocuous activity for correlating users across sites
US9298825B2 (en) Tagging entities with descriptive phrases
US7519588B2 (en) Keyword characterization and application
CN107180093B (zh) 信息搜索方法及装置和时效性查询词识别方法及装置
CN103631929B (zh) 一种用于搜索的智能提示的方法、模块和系统
US8949227B2 (en) System and method for matching entities and synonym group organizer used therein
US10152478B2 (en) Apparatus, system and method for string disambiguation and entity ranking
CN110532480B (zh) 一种用于人读威胁情报推荐的知识图谱构建方法及威胁情报推荐方法
Hu et al. Improving wikipedia-based place name disambiguation in short texts using structured data from dbpedia
Dalton et al. A neighborhood relevance model for entity linking
Katragadda et al. Framework for real-time event detection using multiple social media sources
KR102091633B1 (ko) 연관법령 제공 방법
Han et al. Linking fine-grained locations in user comments
KR20160066216A (ko) 사용자 검색어 연관 이슈패턴 검출 방법, 이를 수행하는 이슈패턴 검출 서버 및 이를 저장하는 기록매체
US9465875B2 (en) Searching based on an identifier of a searcher
Zhang et al. Recommendation over a heterogeneous social network
Mahdabi et al. Patent query formulation by synthesizing multiple sources of relevance evidence
Rasheed et al. Query expansion in information retrieval for Urdu language
Gao et al. Entity linking to one thousand knowledge bases
CN110362813A (zh) 基于bm25的搜索相关性度量方法、存储介质、设备及系统
Laclavík et al. Search query categorization at scale
Panchenko et al. Large-scale parallel matching of social network profiles

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15868420

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15868420

Country of ref document: EP

Kind code of ref document: A1