WO2021027085A1 - Method and device for automatically extracting text keyword, and storage medium - Google Patents
Method and device for automatically extracting text keyword, and storage medium Download PDFInfo
- Publication number
- WO2021027085A1 WO2021027085A1 PCT/CN2019/115115 CN2019115115W WO2021027085A1 WO 2021027085 A1 WO2021027085 A1 WO 2021027085A1 CN 2019115115 W CN2019115115 W CN 2019115115W WO 2021027085 A1 WO2021027085 A1 WO 2021027085A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- keywords
- word
- ary
- words
- candidate
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
Definitions
- the present invention relates to the field of Internet technology, in particular to a method, equipment and storage medium for automatically extracting text keywords.
- Automatic keyword extraction is to automatically extract thematic or important words or phrases from the text. It is the basic and necessary work of many text mining tasks such as text retrieval and text summarization.
- Document keywords represent the subject matter and key content of the document, and are the smallest unit of document content understanding.
- the statistical method is to use the statistical information of the words in the document to extract the keywords of the document. This method is relatively simple, does not require training data, and generally does not require an external knowledge base, so the extraction speed is fast, and in scenarios that require real-time calculations Often used in.
- the first step of extracting keywords in Chinese natural language is to segment the text, build a vocabulary, and then extract keywords from the vocabulary.
- This method results in keywords that can only be words in the vocabulary. Since the word segmentation granularity of general word segmentation tools is relatively fine (the noise caused by such segmentation is relatively small and easy to filter), but word segmentation often brings semantic fragmentation, such as "China Internet Conference” will be divided into “China” , "Internet” and “Conference”, if "China Internet Conference” is the key word, such words not in the vocabulary will be discarded and will not be extracted as keywords. If the word segmentation tool has coarse segmentation granularity (such as ternary or above), it will bring more noise and be difficult to filter, resulting in the extraction of many noisy keywords.
- the purpose of the present invention is to provide a method, equipment and storage medium for automatically extracting text keywords.
- an embodiment of the present invention provides a method for automatically extracting text keywords.
- the method includes:
- n is a positive integer greater than 1.
- the method further includes:
- n-ary candidate keyword set From the n-ary candidate keyword set, remove the n-ary candidate keywords included in the n+1-ary result keywords to obtain an n-ary result keyword set.
- the method further includes:
- the optimized keywords are matched with the qualifier table, and if they match, the optimized keywords are replaced with keywords in the n+1 meta result keyword set.
- the step of "obtaining an n-ary candidate keyword set" includes:
- the text is segmented into an n-ary set, the noise in the set is filtered first, and then keywords in the set are extracted to obtain an n-ary candidate keyword set.
- the steps of filtering noise include:
- the words in the binary set include pre-words and post-words.
- the minimum word frequency of the pre-words and the following words in the univariate set is x, and words in the binary set with a word frequency less than 2x/3 are filtered;
- the step of "obtaining an n-ary candidate keyword set" includes:
- n-1 yuan candidate keywords that contain the same n-2 yuan word and the n-2 yuan word is at a different position in the keyword to obtain the n-1 yuan result key Word set, where n is a positive integer greater than 2.
- noise filtering is performed to obtain an n+1-ary result keyword set.
- the step of performing noise filtering after merging keywords in the n-ary candidate keyword set includes:
- an embodiment of the present invention provides an electronic device including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor executes the program When realizing the steps in the method for automatically extracting text keywords.
- an embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method for automatically extracting text keywords are realized .
- the technical solution of the present invention merges the extracted keywords after segmentation, so that the semantics of the split keywords are complemented, and avoids the situation of incomplete semantics caused by too thin word segmentation. .
- FIG. 1 is a schematic flowchart of a method for automatically extracting text keywords according to an embodiment of the present invention.
- Fig. 2 is a schematic flowchart of a method for automatically extracting text keywords in a specific embodiment of the present invention.
- the keywords of the present invention can be words, which are the smallest language units that can be used, such as flowers, birds, people, young, language, etc.
- the keywords can also be phrases, which consist of two or two
- a unary word refers to a word
- a binary word is a phrase that combines two words, that is, two words. Metawords include two unary words, and so on.
- the method for automatically extracting text keywords of the present invention includes:
- n is a positive integer greater than 1.
- n-ary candidate keyword set there are many ways to obtain the n-ary candidate keyword set, which will be described in detail later. After obtaining the set of n-gram candidate keywords, compare the keywords in the set one by one, and find two that contain the same n-1 metawords and the n-1 metawords have different positions in the keywords Keywords are merged into n+1 meta result keywords to obtain an n+1 meta result keyword set.
- the obtained binary candidate keyword set is: ⁇ core interest, smart phone, conflict of interest, semiconductor field, Android operation, operating system, Operating ecology, attracting talents, lack of talents ⁇ , since both "core interests” and “conflicts of interest” include the unary word "interests", and the positions of "interests" in these two keywords are different, you can put these in order
- the two words are merged, and the merged ternary result keyword is "core conflict of interest", and so on, the ternary result keyword set is: ⁇ core conflict of interest, Android operating system, Android operating ecology, insufficient talent attraction ⁇ .
- the technical scheme of the present invention merges the extracted keywords after subdividing, so that the semantics of the split keywords are complemented, and the situation of semantic incompleteness caused by too thin word segmentation is avoided.
- noise filtering is first performed to obtain the n+1-ary result keyword set.
- Noise filtering refers to removing some keywords that do not meet the grammatical regulations or do not meet the requirements.
- the step of merging keywords in the n-ary candidate keyword set and performing noise filtering includes:
- word segmentation tools such as jieba, hanlp, stanfordNLP, and thulac can be used to segment the text of the keywords to be extracted into a unigram set (also called a unigram set), and then noise filtering is performed on the words in the unigram set Specifically, it can filter the part of speech, word frequency and word length of the words in the set.
- Part-of-speech filtering can filter out adjectives, adverbs, and prepositions, and only retain nouns and verbs.
- Word frequency filtering refers to filtering out words that appear in the text with a frequency greater than the maximum word frequency or less than the minimum word frequency.
- Word length filtering refers to filtering out words that appear in the text whose length is greater than the maximum length or less than the minimum length.
- Word frequency and word length filtering are based on the empirical data collected in tens of thousands or even tens of millions of samples to derive the maximum word frequency, minimum word frequency, maximum length, minimum length, etc., and then filter.
- TF-IDF uses the global statistics IDF (inverse text frequency) of words in the corpus and the TF (term frequency) of words in the current document to calculate the weights of words, and the words with the highest weights are used as keywords.
- TF (term frequency) is the number of occurrences of the specified word in the text.
- IDF inversed document frequency is the ratio of the total number of documents in the corpus to the number of documents containing the specified words and then taking the logarithm.
- TF-IDF is the product of TF and IDF.
- TF-IDF Calculate the TF-IDF of each word in the document as the weight of the word to filter keywords.
- keywords There are at least two ways to extract keywords using TF-IDF: Method one, absolute value, all words in the set whose weight exceeds a certain fixed value are extracted as keywords. Method two, relative value, the top words in the weight ranking in the set are extracted as keywords.
- the unary candidate keyword set After extracting the keywords in the unary set, the unary candidate keyword set is obtained, the highest word frequency max_count of the keywords in this set is found, and the words with the combined word frequency of the n-ary candidate keywords less than max_count/4 are filtered.
- the filtering conditions can be tightened appropriately. For example, the words whose word frequency is less than max_count/3 after the merged candidate keywords of n yuan can be filtered, and so on.
- the method further includes: removing the n-ary candidate keywords included in the n+1-ary result keywords from the n-ary candidate keyword set to obtain an n-ary result keyword set.
- the set of binary candidate keywords are: ⁇ core interests, smart phones, conflicts of interest, semiconductor field, Android operation, operating system, operating ecology, attracting talents, insufficient talents ⁇ , remove the ternary result
- the binary candidate keywords contained in the keywords for example, "core conflict of interest” includes “core interest” and “conflict of interest”, so remove the "core interest” and "conflict of interest” from the set of binary candidate keywords.
- the binary result keyword set is ⁇ smartphone, semiconductor field ⁇ .
- the method further includes:
- the optimized keywords are matched with the qualifier table, and if they match, the optimized keywords are replaced with keywords in the n+1 meta result keyword set.
- the qualifier table can be customized according to actual needs, such as proper nouns of the input method, full names and abbreviations of various companies, etc.
- the maximum word length can be obtained through experience.
- step of "obtaining an n-ary candidate keyword set" includes:
- the text is segmented into an n-ary set, the noise in the set is filtered first, and then keywords in the set are extracted to obtain an n-ary candidate keyword set.
- This step is similar to the step of obtaining a set of unary candidate keywords. The difference is that the noise filtering method is different, and as the number of yuan increases, the noise will increase and the filtering method will be more complicated.
- the steps of filtering noise include:
- max_count/5 the highest word frequency max_count in the one-element candidate keyword set; filter the words whose word frequency is less than or equal to max_count/5 in the two-element set.
- the max_count/5 here can be adjusted, and it can also be max_count/6 or max_count/4.
- the words in the binary set include two unary words, the preceding word and the succeeding word (or the preceding unary word and the succeeding unary word).
- the non-compliant words in the binary set should be filtered.
- the non-compliant words can be words with obvious grammatical errors (such as suffix words), words with prepositions (such as "tomorrow"), or Unit words (such as "80 yuan”), etc.
- suffix words words with prepositions
- unit words such as "80 yuan”
- n-ary candidate keyword set can also be obtained by merging n-1 yuan candidate keywords:
- n-1 yuan candidate keywords that contain the same n-2 yuan word and the n-2 yuan word is at a different position in the keyword to obtain the n-1 yuan result key Word set, where n is a positive integer greater than 2.
- the unary candidate keywords contained in the binary candidate keywords are removed to obtain the unary result keyword set.
- the results of extracting the text keywords are: a set of unary result keywords, a combination of binary result keywords, and a set of ternary result keywords.
- the length can be restricted.
- the keywords are optimized, and the optimization method refers to the preceding text.
- the present invention also provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the method for automatically extracting text keywords as described above is realized when the processor executes the program Steps in.
- the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method for automatically extracting text keywords are realized.
Abstract
Description
Claims (10)
- 一种文本关键词自动提取的方法,其特征在于,所述方法包括:A method for automatically extracting text keywords, characterized in that the method includes:获取n元候选关键词集合;Obtain an n-ary candidate keyword set;将n元候选关键词集合中的包含有相同n-1元词且所述n-1元词在所述关键词的位置不同的两个关键词进行合并,得到n+1元结果关键词集合,其中n为大于1的正整数。Combine two keywords in the n-gram candidate keyword set that contain the same n-1 yuan word and the n-1 yuan word is at a different position in the keyword to obtain an n+1 yuan result keyword set , Where n is a positive integer greater than 1.
- 如权利要求1所述文本关键词自动提取的方法,其特征在于,所述方法还包括:The method for automatically extracting text keywords according to claim 1, wherein the method further comprises:从所述n元候选关键词集合中,移除n+1元结果关键词包含的n元候选关键词,得到n元结果关键词集合。From the n-ary candidate keyword set, remove the n-ary candidate keywords included in the n+1-ary result keywords to obtain an n-ary result keyword set.
- 如权利要求1所述文本关键词自动提取的方法,其特征在于,所述方法还包括:The method for automatically extracting text keywords according to claim 1, wherein the method further comprises:当所述n+1元结果关键词集合中的关键词的字长大于最大字长时,去掉所述关键词的第一个或者最后一个一元词,得到优化关键词;When the word length of the keyword in the n+1 meta result keyword set is greater than the maximum word length, remove the first or last unary word of the keyword to obtain an optimized keyword;将所述优化关键词与限定词表进行匹配,若匹配,将所述优化关键词替换所述n+1元结果关键词集合中的关键词。The optimized keywords are matched with the qualifier table, and if they match, the optimized keywords are replaced with keywords in the n+1 meta result keyword set.
- 根据权利要求1所述的文本关键词自动提取的方法,其特征在于,所述“获取n元候选关键词集合”的步骤包括:The method for automatically extracting text keywords according to claim 1, wherein the step of "obtaining an n-ary candidate keyword set" comprises:将所述文本分词成n元集合,先过滤所述集合中的噪音,再提取所述集合中的关键词,得到n元候选关键词集合。The text is segmented into an n-ary set, the noise in the set is filtered first, and then keywords in the set are extracted to obtain an n-ary candidate keyword set.
- 根据权利要求4所述的文本关键词自动提取的方法,其特征在于,在所述n为2,即将所述文本分词成二元集合后,过滤噪音的步骤包括:The method for automatically extracting text keywords according to claim 4, characterized in that, after the n is 2, that is, after the text is segmented into a binary set, the step of filtering noise comprises:在将所述文本分词成一元集合后,先过滤所述集合中的噪音,再提取所述集合中的关键词,获取所述关键词中的最高词频max_count;After segmenting the text into a one-element set, filter the noise in the set first, and then extract the keywords in the set to obtain the highest word frequency max_count in the keywords;过滤二元集合中词频小于或等于max_count/5的词;Filter the words whose word frequency is less than or equal to max_count/5 in the binary set;过滤二元集合中词频小于2的词;Filter the words whose word frequency is less than 2 in the binary set;二元集合中的词包括前词和后词,所述前词和后词在一元集合中的最小词频为x,过滤二元集合中词频小于2x/3的词;The words in the binary set include pre-words and post-words. The minimum word frequency of the pre-words and the following words in the univariate set is x, and words in the binary set with a word frequency less than 2x/3 are filtered;过滤二元集合中不符合规定的词。Filter the non-compliant words in the binary set.
- 根据权利要求1所述的文本关键词自动提取的方法,其特征在于,所述“获取n元候选关键词集合”的步骤包括:The method for automatically extracting text keywords according to claim 1, wherein the step of "obtaining an n-ary candidate keyword set" comprises:获取n-1元候选关键词集合;Obtain a set of n-1 yuan candidate keywords;将n-1元候选关键词集合中的包含有相同n-2元词且所述n-2元词在所述关键词的位置不同的两个关键词进行合并,得到n-1元结果关键词集合,其中n为大于2的正整数。Combine two keywords in the set of n-1 yuan candidate keywords that contain the same n-2 yuan word and the n-2 yuan word is at a different position in the keyword to obtain the n-1 yuan result key Word set, where n is a positive integer greater than 2.
- 根据权利要求1所述的文本关键词自动提取的方法,其特征在于:The method for automatically extracting text keywords according to claim 1, wherein:在将n元候选关键词集合中的关键词合并后,进行噪音过滤得到n+1元结果关键词集合。After merging the keywords in the n-gram candidate keyword set, noise filtering is performed to obtain the n+1-gram result keyword set.
- 根据权利要求7所述的文本关键词自动提取的方法,其特征在于,合并所述n元候选关键词集合中的关键词后进行噪音过滤的步骤包括:8. The method for automatically extracting text keywords according to claim 7, wherein the step of merging keywords in the n-ary candidate keyword set and then performing noise filtering comprises:在将所述文本分词成一元集合后,先过滤所述集合中的噪音,再提取所述集合中的关键词,获取所述关键词中的最高词频max_count;After segmenting the text into a one-element set, filter the noise in the set first, and then extract the keywords in the set to obtain the highest word frequency max_count in the keywords;过滤n元候选关键词合并后的词频少于max_count/4的词;Filter words whose word frequency is less than max_count/4 after the merged n-ary candidate keywords;过滤n元候选关键词合并后的词频少于2的词。Filter words whose word frequency is less than 2 after the combination of n-ary candidate keywords.
- 一种电子设备,包括存储器和处理器,所述存储器存储有可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1-8任意一项所述文本关键词自动提取的方法中的步骤。An electronic device, comprising a memory and a processor, the memory storing a computer program that can run on the processor, wherein the processor implements any one of claims 1-8 when the program is executed The steps in the method for automatically extracting text keywords.
- 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-8任意一项所述文本关键词自动提取的方法中的步骤。A computer-readable storage medium with a computer program stored thereon, wherein the computer program implements the steps in the method for automatically extracting text keywords according to any one of claims 1-8 when the computer program is executed by a processor.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910754155.7A CN110532551A (en) | 2019-08-15 | 2019-08-15 | Method, equipment and the storage medium that text key word automatically extracts |
CN201910754155.7 | 2019-08-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021027085A1 true WO2021027085A1 (en) | 2021-02-18 |
Family
ID=68663358
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/115115 WO2021027085A1 (en) | 2019-08-15 | 2019-11-01 | Method and device for automatically extracting text keyword, and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110532551A (en) |
WO (1) | WO2021027085A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111488727B (en) * | 2020-03-24 | 2023-09-19 | 南阳柯丽尔科技有限公司 | Word file parsing method, word file parsing apparatus, and computer-readable storage medium |
CN116978384B (en) * | 2023-09-25 | 2024-01-02 | 成都市青羊大数据有限责任公司 | Public security integrated big data management system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004006124A2 (en) * | 2002-07-03 | 2004-01-15 | Word Data Corp. | Text-representation, text-matching and text-classification code, system and method |
CN101196904A (en) * | 2007-11-09 | 2008-06-11 | 清华大学 | News keyword abstraction method based on word frequency and multi-component grammar |
CN105956158A (en) * | 2016-05-17 | 2016-09-21 | 清华大学 | Automatic extraction method of network neologism on the basis of mass microblog texts and use information |
CN106557459A (en) * | 2015-09-24 | 2017-04-05 | 北京神州泰岳软件股份有限公司 | A kind of method and apparatus that neologisms are extracted from work order |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1280757C (en) * | 2000-11-17 | 2006-10-18 | 意蓝科技股份有限公司 | Method for automatically-searching key word from file and its system |
CN102375863A (en) * | 2010-08-27 | 2012-03-14 | 北京四维图新科技股份有限公司 | Method and device for keyword extraction in geographic information field |
CN102411563B (en) * | 2010-09-26 | 2015-06-17 | 阿里巴巴集团控股有限公司 | Method, device and system for identifying target words |
CN103678318B (en) * | 2012-08-31 | 2016-12-21 | 富士通株式会社 | Multi-word unit extraction method and equipment and artificial neural network training method and equipment |
CN103092979B (en) * | 2013-01-31 | 2016-01-27 | 中国科学院对地观测与数字地球科学中心 | The disposal route of remotely-sensed data retrieval natural language |
CN104216875B (en) * | 2014-09-26 | 2017-05-03 | 中国科学院自动化研究所 | Automatic microblog text abstracting method based on unsupervised key bigram extraction |
CN105426539B (en) * | 2015-12-23 | 2018-12-18 | 成都云数未来信息科学有限公司 | A kind of lucene Chinese word cutting method based on dictionary |
CN107665191B (en) * | 2017-10-19 | 2020-08-04 | 中国人民解放军陆军工程大学 | Private protocol message format inference method based on extended prefix tree |
-
2019
- 2019-08-15 CN CN201910754155.7A patent/CN110532551A/en active Pending
- 2019-11-01 WO PCT/CN2019/115115 patent/WO2021027085A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004006124A2 (en) * | 2002-07-03 | 2004-01-15 | Word Data Corp. | Text-representation, text-matching and text-classification code, system and method |
CN101196904A (en) * | 2007-11-09 | 2008-06-11 | 清华大学 | News keyword abstraction method based on word frequency and multi-component grammar |
CN106557459A (en) * | 2015-09-24 | 2017-04-05 | 北京神州泰岳软件股份有限公司 | A kind of method and apparatus that neologisms are extracted from work order |
CN105956158A (en) * | 2016-05-17 | 2016-09-21 | 清华大学 | Automatic extraction method of network neologism on the basis of mass microblog texts and use information |
Also Published As
Publication number | Publication date |
---|---|
CN110532551A (en) | 2019-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109241538B (en) | Chinese entity relation extraction method based on dependency of keywords and verbs | |
CN109101479B (en) | Clustering method and device for Chinese sentences | |
US10339453B2 (en) | Automatically generating test/training questions and answers through pattern based analysis and natural language processing techniques on the given corpus for quick domain adaptation | |
WO2018205389A1 (en) | Voice recognition method and system, electronic apparatus and medium | |
Shoukry et al. | Preprocessing Egyptian dialect tweets for sentiment mining | |
WO2018157789A1 (en) | Speech recognition method, computer, storage medium, and electronic apparatus | |
CN102693279B (en) | Method, device and system for fast calculating comment similarity | |
CN107943786B (en) | Chinese named entity recognition method and system | |
Tabassum et al. | A survey on text pre-processing & feature extraction techniques in natural language processing | |
WO2017198031A1 (en) | Semantic parsing method and apparatus | |
US20200073890A1 (en) | Intelligent search platforms | |
JP2011118689A (en) | Retrieval method and system | |
WO2021027085A1 (en) | Method and device for automatically extracting text keyword, and storage medium | |
WO2014114175A1 (en) | Method and apparatus for providing search engine tags | |
JP5718405B2 (en) | Utterance selection apparatus, method and program, dialogue apparatus and method | |
US8806455B1 (en) | Systems and methods for text nuclearization | |
CN110889292B (en) | Text data viewpoint abstract generating method and system based on sentence meaning structure model | |
Sheeba et al. | Improved keyword and keyphrase extraction from meeting transcripts | |
Malandrakis et al. | Sail: Sentiment analysis using semantic similarity and contrast features | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
Kim et al. | Compact lexicon selection with spectral methods | |
Shrawankar et al. | Construction of news headline from detailed news article | |
Malandrakis et al. | Affective language model adaptation via corpus selection | |
Sahmoudi et al. | Towards a linguistic patterns for arabic keyphrases extraction | |
KR20190140668A (en) | The korean morpheme analyzer using user defined morpheme and the method of the same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19941090 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19941090 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19941090 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 210922) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19941090 Country of ref document: EP Kind code of ref document: A1 |