WO2021027085A1 - Method and device for automatically extracting text keyword, and storage medium - Google Patents

Method and device for automatically extracting text keyword, and storage medium Download PDF

Info

Publication number
WO2021027085A1
WO2021027085A1 PCT/CN2019/115115 CN2019115115W WO2021027085A1 WO 2021027085 A1 WO2021027085 A1 WO 2021027085A1 CN 2019115115 W CN2019115115 W CN 2019115115W WO 2021027085 A1 WO2021027085 A1 WO 2021027085A1
Authority
WO
WIPO (PCT)
Prior art keywords
keywords
word
ary
words
candidate
Prior art date
Application number
PCT/CN2019/115115
Other languages
French (fr)
Chinese (zh)
Inventor
龚朝辉
陶予祺
童刚
Original Assignee
苏州朗动网络科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州朗动网络科技有限公司 filed Critical 苏州朗动网络科技有限公司
Publication of WO2021027085A1 publication Critical patent/WO2021027085A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Definitions

  • the present invention relates to the field of Internet technology, in particular to a method, equipment and storage medium for automatically extracting text keywords.
  • Automatic keyword extraction is to automatically extract thematic or important words or phrases from the text. It is the basic and necessary work of many text mining tasks such as text retrieval and text summarization.
  • Document keywords represent the subject matter and key content of the document, and are the smallest unit of document content understanding.
  • the statistical method is to use the statistical information of the words in the document to extract the keywords of the document. This method is relatively simple, does not require training data, and generally does not require an external knowledge base, so the extraction speed is fast, and in scenarios that require real-time calculations Often used in.
  • the first step of extracting keywords in Chinese natural language is to segment the text, build a vocabulary, and then extract keywords from the vocabulary.
  • This method results in keywords that can only be words in the vocabulary. Since the word segmentation granularity of general word segmentation tools is relatively fine (the noise caused by such segmentation is relatively small and easy to filter), but word segmentation often brings semantic fragmentation, such as "China Internet Conference” will be divided into “China” , "Internet” and “Conference”, if "China Internet Conference” is the key word, such words not in the vocabulary will be discarded and will not be extracted as keywords. If the word segmentation tool has coarse segmentation granularity (such as ternary or above), it will bring more noise and be difficult to filter, resulting in the extraction of many noisy keywords.
  • the purpose of the present invention is to provide a method, equipment and storage medium for automatically extracting text keywords.
  • an embodiment of the present invention provides a method for automatically extracting text keywords.
  • the method includes:
  • n is a positive integer greater than 1.
  • the method further includes:
  • n-ary candidate keyword set From the n-ary candidate keyword set, remove the n-ary candidate keywords included in the n+1-ary result keywords to obtain an n-ary result keyword set.
  • the method further includes:
  • the optimized keywords are matched with the qualifier table, and if they match, the optimized keywords are replaced with keywords in the n+1 meta result keyword set.
  • the step of "obtaining an n-ary candidate keyword set" includes:
  • the text is segmented into an n-ary set, the noise in the set is filtered first, and then keywords in the set are extracted to obtain an n-ary candidate keyword set.
  • the steps of filtering noise include:
  • the words in the binary set include pre-words and post-words.
  • the minimum word frequency of the pre-words and the following words in the univariate set is x, and words in the binary set with a word frequency less than 2x/3 are filtered;
  • the step of "obtaining an n-ary candidate keyword set" includes:
  • n-1 yuan candidate keywords that contain the same n-2 yuan word and the n-2 yuan word is at a different position in the keyword to obtain the n-1 yuan result key Word set, where n is a positive integer greater than 2.
  • noise filtering is performed to obtain an n+1-ary result keyword set.
  • the step of performing noise filtering after merging keywords in the n-ary candidate keyword set includes:
  • an embodiment of the present invention provides an electronic device including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor executes the program When realizing the steps in the method for automatically extracting text keywords.
  • an embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method for automatically extracting text keywords are realized .
  • the technical solution of the present invention merges the extracted keywords after segmentation, so that the semantics of the split keywords are complemented, and avoids the situation of incomplete semantics caused by too thin word segmentation. .
  • FIG. 1 is a schematic flowchart of a method for automatically extracting text keywords according to an embodiment of the present invention.
  • Fig. 2 is a schematic flowchart of a method for automatically extracting text keywords in a specific embodiment of the present invention.
  • the keywords of the present invention can be words, which are the smallest language units that can be used, such as flowers, birds, people, young, language, etc.
  • the keywords can also be phrases, which consist of two or two
  • a unary word refers to a word
  • a binary word is a phrase that combines two words, that is, two words. Metawords include two unary words, and so on.
  • the method for automatically extracting text keywords of the present invention includes:
  • n is a positive integer greater than 1.
  • n-ary candidate keyword set there are many ways to obtain the n-ary candidate keyword set, which will be described in detail later. After obtaining the set of n-gram candidate keywords, compare the keywords in the set one by one, and find two that contain the same n-1 metawords and the n-1 metawords have different positions in the keywords Keywords are merged into n+1 meta result keywords to obtain an n+1 meta result keyword set.
  • the obtained binary candidate keyword set is: ⁇ core interest, smart phone, conflict of interest, semiconductor field, Android operation, operating system, Operating ecology, attracting talents, lack of talents ⁇ , since both "core interests” and “conflicts of interest” include the unary word "interests", and the positions of "interests" in these two keywords are different, you can put these in order
  • the two words are merged, and the merged ternary result keyword is "core conflict of interest", and so on, the ternary result keyword set is: ⁇ core conflict of interest, Android operating system, Android operating ecology, insufficient talent attraction ⁇ .
  • the technical scheme of the present invention merges the extracted keywords after subdividing, so that the semantics of the split keywords are complemented, and the situation of semantic incompleteness caused by too thin word segmentation is avoided.
  • noise filtering is first performed to obtain the n+1-ary result keyword set.
  • Noise filtering refers to removing some keywords that do not meet the grammatical regulations or do not meet the requirements.
  • the step of merging keywords in the n-ary candidate keyword set and performing noise filtering includes:
  • word segmentation tools such as jieba, hanlp, stanfordNLP, and thulac can be used to segment the text of the keywords to be extracted into a unigram set (also called a unigram set), and then noise filtering is performed on the words in the unigram set Specifically, it can filter the part of speech, word frequency and word length of the words in the set.
  • Part-of-speech filtering can filter out adjectives, adverbs, and prepositions, and only retain nouns and verbs.
  • Word frequency filtering refers to filtering out words that appear in the text with a frequency greater than the maximum word frequency or less than the minimum word frequency.
  • Word length filtering refers to filtering out words that appear in the text whose length is greater than the maximum length or less than the minimum length.
  • Word frequency and word length filtering are based on the empirical data collected in tens of thousands or even tens of millions of samples to derive the maximum word frequency, minimum word frequency, maximum length, minimum length, etc., and then filter.
  • TF-IDF uses the global statistics IDF (inverse text frequency) of words in the corpus and the TF (term frequency) of words in the current document to calculate the weights of words, and the words with the highest weights are used as keywords.
  • TF (term frequency) is the number of occurrences of the specified word in the text.
  • IDF inversed document frequency is the ratio of the total number of documents in the corpus to the number of documents containing the specified words and then taking the logarithm.
  • TF-IDF is the product of TF and IDF.
  • TF-IDF Calculate the TF-IDF of each word in the document as the weight of the word to filter keywords.
  • keywords There are at least two ways to extract keywords using TF-IDF: Method one, absolute value, all words in the set whose weight exceeds a certain fixed value are extracted as keywords. Method two, relative value, the top words in the weight ranking in the set are extracted as keywords.
  • the unary candidate keyword set After extracting the keywords in the unary set, the unary candidate keyword set is obtained, the highest word frequency max_count of the keywords in this set is found, and the words with the combined word frequency of the n-ary candidate keywords less than max_count/4 are filtered.
  • the filtering conditions can be tightened appropriately. For example, the words whose word frequency is less than max_count/3 after the merged candidate keywords of n yuan can be filtered, and so on.
  • the method further includes: removing the n-ary candidate keywords included in the n+1-ary result keywords from the n-ary candidate keyword set to obtain an n-ary result keyword set.
  • the set of binary candidate keywords are: ⁇ core interests, smart phones, conflicts of interest, semiconductor field, Android operation, operating system, operating ecology, attracting talents, insufficient talents ⁇ , remove the ternary result
  • the binary candidate keywords contained in the keywords for example, "core conflict of interest” includes “core interest” and “conflict of interest”, so remove the "core interest” and "conflict of interest” from the set of binary candidate keywords.
  • the binary result keyword set is ⁇ smartphone, semiconductor field ⁇ .
  • the method further includes:
  • the optimized keywords are matched with the qualifier table, and if they match, the optimized keywords are replaced with keywords in the n+1 meta result keyword set.
  • the qualifier table can be customized according to actual needs, such as proper nouns of the input method, full names and abbreviations of various companies, etc.
  • the maximum word length can be obtained through experience.
  • step of "obtaining an n-ary candidate keyword set" includes:
  • the text is segmented into an n-ary set, the noise in the set is filtered first, and then keywords in the set are extracted to obtain an n-ary candidate keyword set.
  • This step is similar to the step of obtaining a set of unary candidate keywords. The difference is that the noise filtering method is different, and as the number of yuan increases, the noise will increase and the filtering method will be more complicated.
  • the steps of filtering noise include:
  • max_count/5 the highest word frequency max_count in the one-element candidate keyword set; filter the words whose word frequency is less than or equal to max_count/5 in the two-element set.
  • the max_count/5 here can be adjusted, and it can also be max_count/6 or max_count/4.
  • the words in the binary set include two unary words, the preceding word and the succeeding word (or the preceding unary word and the succeeding unary word).
  • the non-compliant words in the binary set should be filtered.
  • the non-compliant words can be words with obvious grammatical errors (such as suffix words), words with prepositions (such as "tomorrow"), or Unit words (such as "80 yuan”), etc.
  • suffix words words with prepositions
  • unit words such as "80 yuan”
  • n-ary candidate keyword set can also be obtained by merging n-1 yuan candidate keywords:
  • n-1 yuan candidate keywords that contain the same n-2 yuan word and the n-2 yuan word is at a different position in the keyword to obtain the n-1 yuan result key Word set, where n is a positive integer greater than 2.
  • the unary candidate keywords contained in the binary candidate keywords are removed to obtain the unary result keyword set.
  • the results of extracting the text keywords are: a set of unary result keywords, a combination of binary result keywords, and a set of ternary result keywords.
  • the length can be restricted.
  • the keywords are optimized, and the optimization method refers to the preceding text.
  • the present invention also provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the method for automatically extracting text keywords as described above is realized when the processor executes the program Steps in.
  • the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method for automatically extracting text keywords are realized.

Abstract

A method and device for automatically extracting a text keyword, and a storage medium. The method comprises: acquiring an n-element candidate keyword set; and combining two keywords in the n-element candidate keyword set, which keywords include the same (n - 1)-element word, the positions of the (n - 1)-element word in the keywords being different, so as to obtain an (n + 1)-element result keyword set, wherein n is a positive integer greater than 1. By means of the method, the semantics of divided keywords is completed by combining keywords extracted after subdivision, thereby avoiding the phenomenon of semantic incompleteness caused by the super subdivision of divided words.

Description

文本关键词自动提取的方法、设备和存储介质Method, equipment and storage medium for automatically extracting text keywords 技术领域Technical field
本发明涉及互联网技术领域,特别是涉及一种文本关键词自动提取的方法、设备和存储介质。The present invention relates to the field of Internet technology, in particular to a method, equipment and storage medium for automatically extracting text keywords.
背景技术Background technique
自动关键词抽取是从文本中自动抽取主题性或重要性的词或短语,是文本检索、文本摘要等许多文本挖掘任务的基础性和必要性的工作。文档关键词表征了文档主题性和关键性的内容,是文档内容理解的最小单位。Automatic keyword extraction is to automatically extract thematic or important words or phrases from the text. It is the basic and necessary work of many text mining tasks such as text retrieval and text summarization. Document keywords represent the subject matter and key content of the document, and are the smallest unit of document content understanding.
自动关键词抽取的方法有很多,主流的自动关键词抽取方法,比如基于语言分析的方法、统计法、机器学习法等有10余种。统计法是利用文档中词语的统计信息抽取文档的关键词,这种方法相对来说较简单,不需要训练数据,一般也不需要外部知识库,所以抽取的速度快,在需要实时计算的场景中经常使用。There are many methods of automatic keyword extraction. There are more than 10 mainstream automatic keyword extraction methods, such as methods based on language analysis, statistical methods, and machine learning methods. The statistical method is to use the statistical information of the words in the document to extract the keywords of the document. This method is relatively simple, does not require training data, and generally does not require an external knowledge base, so the extraction speed is fast, and in scenarios that require real-time calculations Often used in.
中文自然语言抽取关键词的第一步就是对文本进行分词,构建词表,然后从词表中提取关键词。这种方法导致关键词只能是词表中的词语。由于一般分词工具的分词粒度比较细(这样分带来的噪音相对较小,且容易过滤),但是分词常常会带来语义被割裂的情况,如“中国互联网大会”会被分割成“中国”、“互联网”和“大会”,如果“中国互联网大会”才是关键词,这种不在词表中的词便被丢弃,不会作为关键词提取出来。而如果分词工具的分词粒度粗(比如三元或以上),带来的噪音会比较多,且难过滤,导致提取很多噪音关键词。The first step of extracting keywords in Chinese natural language is to segment the text, build a vocabulary, and then extract keywords from the vocabulary. This method results in keywords that can only be words in the vocabulary. Since the word segmentation granularity of general word segmentation tools is relatively fine (the noise caused by such segmentation is relatively small and easy to filter), but word segmentation often brings semantic fragmentation, such as "China Internet Conference" will be divided into "China" , "Internet" and "Conference", if "China Internet Conference" is the key word, such words not in the vocabulary will be discarded and will not be extracted as keywords. If the word segmentation tool has coarse segmentation granularity (such as ternary or above), it will bring more noise and be difficult to filter, resulting in the extraction of many noisy keywords.
发明内容Summary of the invention
本发明的目的在于提供一种文本关键词自动提取的方法、设备和存储介质。The purpose of the present invention is to provide a method, equipment and storage medium for automatically extracting text keywords.
为实现上述发明目的之一,本发明一实施方式提供一种文本关键词自动提取的方法,所述方法包括:In order to achieve one of the above-mentioned objects of the invention, an embodiment of the present invention provides a method for automatically extracting text keywords. The method includes:
获取n元候选关键词集合;Obtain an n-ary candidate keyword set;
将n元候选关键词集合中的包含有相同n-1元词且所述n-1元词在所述关键词的位置不同的两个关键词进行合并,得到n+1元结果关键词集合,其中n为大于1的正整数。Combine two keywords in the n-gram candidate keyword set that contain the same n-1 yuan word and the n-1 yuan word is at a different position in the keyword to obtain an n+1 yuan result keyword set , Where n is a positive integer greater than 1.
作为本发明一实施方式的进一步改进,所述方法还包括:As a further improvement of an embodiment of the present invention, the method further includes:
从所述n元候选关键词集合中,移除n+1元结果关键词包含的n元候选关键词,得到n元结果关键词集合。From the n-ary candidate keyword set, remove the n-ary candidate keywords included in the n+1-ary result keywords to obtain an n-ary result keyword set.
作为本发明一实施方式的进一步改进,所述方法还包括:As a further improvement of an embodiment of the present invention, the method further includes:
当所述n+1元结果关键词集合中的关键词的字长大于最大字长时,去掉所述关键词的第一个或者最后一个一元词,得到优化关键词;When the word length of the keyword in the n+1 meta result keyword set is greater than the maximum word length, remove the first or last unary word of the keyword to obtain an optimized keyword;
将所述优化关键词与限定词表进行匹配,若匹配,将所述优化关键词替换所述n+1元结果关键词集合中的关键词。The optimized keywords are matched with the qualifier table, and if they match, the optimized keywords are replaced with keywords in the n+1 meta result keyword set.
作为本发明一实施方式的进一步改进,所述“获取n元候选关键词集合”的步骤包括:As a further improvement of an embodiment of the present invention, the step of "obtaining an n-ary candidate keyword set" includes:
将所述文本分词成n元集合,先过滤所述集合中的噪音,再提取所述集合中的关键词,得到n元候选关键词集合。The text is segmented into an n-ary set, the noise in the set is filtered first, and then keywords in the set are extracted to obtain an n-ary candidate keyword set.
作为本发明一实施方式的进一步改进,在所述n为2,即将所述文本分词成二元集合后,过滤噪音的步骤包括:As a further improvement of an embodiment of the present invention, after n is 2, that is, after the text is segmented into a binary set, the steps of filtering noise include:
在将所述文本分词成一元集合后,先过滤所述集合中的噪音,再提取所述集合中的关键词,获取所述关键词中的最高词频max_count;After segmenting the text into a one-element set, filter the noise in the set first, and then extract the keywords in the set to obtain the highest word frequency max_count in the keywords;
过滤二元集合中词频小于或等于max_count/5的词;Filter the words whose word frequency is less than or equal to max_count/5 in the binary set;
过滤二元集合中词频小于2的词;Filter the words whose word frequency is less than 2 in the binary set;
二元集合中的词包括前词和后词,所述前词和后词在一元集合中的最小词频为x,过滤二元集合中词频小于2x/3的词;The words in the binary set include pre-words and post-words. The minimum word frequency of the pre-words and the following words in the univariate set is x, and words in the binary set with a word frequency less than 2x/3 are filtered;
过滤二元集合中不符合规定的词。Filter the non-compliant words in the binary set.
作为本发明一实施方式的进一步改进,所述“获取n元候选关键词集合”的步骤包括:As a further improvement of an embodiment of the present invention, the step of "obtaining an n-ary candidate keyword set" includes:
获取n-1元候选关键词集合;Obtain a set of n-1 yuan candidate keywords;
将n-1元候选关键词集合中的包含有相同n-2元词且所述n-2元词在所述关键词的位置不同的两个关键词进行合并,得到n-1元结果关键词集合,其中n为大于2的正整数。Combine two keywords in the set of n-1 yuan candidate keywords that contain the same n-2 yuan word and the n-2 yuan word is at a different position in the keyword to obtain the n-1 yuan result key Word set, where n is a positive integer greater than 2.
作为本发明一实施方式的进一步改进,在将n元候选关键词集合中的关键词合并后,进行噪音过滤得到n+1元结果关键词集合。As a further improvement of an embodiment of the present invention, after the keywords in the n-ary candidate keyword set are merged, noise filtering is performed to obtain an n+1-ary result keyword set.
作为本发明一实施方式的进一步改进,合并所述n元候选关键词集合中的关键词后进行噪音过滤的步骤包括:As a further improvement of an embodiment of the present invention, the step of performing noise filtering after merging keywords in the n-ary candidate keyword set includes:
在将所述文本分词成一元集合后,先过滤所述集合中的噪音,再提取所述集合中的关键词,获取所述关键词中的最高词频max_count;After segmenting the text into a one-element set, filter the noise in the set first, and then extract the keywords in the set to obtain the highest word frequency max_count in the keywords;
过滤n元候选关键词合并后的词频少于max_count/4的词;Filter words whose word frequency is less than max_count/4 after the merged n-ary candidate keywords;
过滤n元候选关键词合并后的词频少于2的词。Filter words whose word frequency is less than 2 after the combination of n-ary candidate keywords.
为实现上述发明目的之一,本发明一实施方式提供一种电子设备,包括存储器和处理器,所述存储器存储有可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现上述文本 关键词自动提取的方法中的步骤。In order to achieve one of the above-mentioned objects of the invention, an embodiment of the present invention provides an electronic device including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor executes the program When realizing the steps in the method for automatically extracting text keywords.
为实现上述发明目的之一,本发明一实施方式提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述文本关键词自动提取的方法中的步骤。In order to achieve one of the above-mentioned objects of the invention, an embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method for automatically extracting text keywords are realized .
与现有技术相比,本发明的技术方案通过对细分后提取的关键词进行合并,使得被分裂的关键词的语义得到补全,避免了因为分词太细带来的语义不完整的情况。Compared with the prior art, the technical solution of the present invention merges the extracted keywords after segmentation, so that the semantics of the split keywords are complemented, and avoids the situation of incomplete semantics caused by too thin word segmentation. .
附图说明Description of the drawings
图1是本发明一实施方式的文本关键词自动提取的方法的流程示意图。FIG. 1 is a schematic flowchart of a method for automatically extracting text keywords according to an embodiment of the present invention.
图2是本发明一具体实施方式的文本关键词自动提取的方法的流程示意图。Fig. 2 is a schematic flowchart of a method for automatically extracting text keywords in a specific embodiment of the present invention.
具体实施方式detailed description
以下将结合附图所示的具体实施方式对本发明进行详细描述。但这些实施方式并不限制本发明,本领域的普通技术人员根据这些实施方式所做出的结构、方法、或功能上的变换均包含在本发明的保护范围内。The present invention will be described in detail below in conjunction with the specific embodiments shown in the drawings. However, these embodiments do not limit the present invention, and the structural, method, or functional changes made by those skilled in the art based on these embodiments are all included in the protection scope of the present invention.
需要说明的是,本发明的关键词可以是单词,单词是最小能够运用的语言单位,比如花、鸟、人、年轻、语言等,关键词也可以是词组,词组是由两个或两个以上的词结合而成的语法单位,比如优化方案、知识产权、中国互联网大会、中华全国专利代理人协会等。因此,本发明中的n元词(n为正整数),是指包括有n个单词的词或词组,比如一元词是指单词,二元词是两个单词组合在一起的词组,即二元词包括两个一元词,以此类推。It should be noted that the keywords of the present invention can be words, which are the smallest language units that can be used, such as flowers, birds, people, young, language, etc. The keywords can also be phrases, which consist of two or two The grammatical unit formed by the combination of the above words, such as optimization plan, intellectual property, China Internet Conference, China National Patent Attorney Association, etc. Therefore, the n-gram (n is a positive integer) in the present invention refers to a word or phrase that includes n words. For example, a unary word refers to a word, and a binary word is a phrase that combines two words, that is, two words. Metawords include two unary words, and so on.
如图1所示,本发明的文本关键词自动提取的方法包括:As shown in Figure 1, the method for automatically extracting text keywords of the present invention includes:
获取n元候选关键词集合;Obtain an n-ary candidate keyword set;
将n元候选关键词集合中的包含有相同n-1元词且所述n-1元词在所述关键词的位置不同的两个关键词进行合并,得到n+1元结果关键词集合,其中n为大于1的正整数。Combine two keywords in the n-gram candidate keyword set that contain the same n-1 yuan word and the n-1 yuan word is at a different position in the keyword to obtain an n+1 yuan result keyword set , Where n is a positive integer greater than 1.
在本发明中,n元候选关键词集合的获取方式有多种,后续会具体介绍。在得到n元候选关键词集合后,两两逐一比对此集合中的关键词,找到包含有相同n-1元词且所述n-1元词在所述关键词的位置不同的两个关键词,合并成n+1元结果关键词,得到n+1元结果关键词集合。In the present invention, there are many ways to obtain the n-ary candidate keyword set, which will be described in detail later. After obtaining the set of n-gram candidate keywords, compare the keywords in the set one by one, and find two that contain the same n-1 metawords and the n-1 metawords have different positions in the keywords Keywords are merged into n+1 meta result keywords to obtain an n+1 meta result keyword set.
下面以具体的例子对所述方法进行解释说明,以n等于2为例,获取到的二元候选关键词集合为:{核心利益,智能手机,利益冲突,半导体领域,安卓操作,操作系统,操作生态,吸引人才,人才不足},由于“核心利益”和“利益冲突”都包括有一元词“利益”,且“利益”在这两个关键词中的位置不同,因此可以按照顺序将这两个词进行合并,合并后的三元结果关键词为“核心利益冲突”,以此类推,得到三元结果关键词集合为:{核心利益冲突,安卓操作系统,安 卓操作生态,吸引人才不足}。The method is explained below with a specific example. Taking n equal to 2 as an example, the obtained binary candidate keyword set is: {core interest, smart phone, conflict of interest, semiconductor field, Android operation, operating system, Operating ecology, attracting talents, lack of talents}, since both "core interests" and "conflicts of interest" include the unary word "interests", and the positions of "interests" in these two keywords are different, you can put these in order The two words are merged, and the merged ternary result keyword is "core conflict of interest", and so on, the ternary result keyword set is: {core conflict of interest, Android operating system, Android operating ecology, insufficient talent attraction }.
本发明的技术方案通过对细分后提取的关键词进行合并,使得被分裂的关键词的语义得到补全,避免了因为分词太细带来的语义不完整的情况。The technical scheme of the present invention merges the extracted keywords after subdividing, so that the semantics of the split keywords are complemented, and the situation of semantic incompleteness caused by too thin word segmentation is avoided.
进一步的,在将n元候选关键词集合中的关键词合并后,先进行噪音过滤,得到n+1元结果关键词集合。Further, after the keywords in the n-ary candidate keyword set are merged, noise filtering is first performed to obtain the n+1-ary result keyword set.
噪音过滤是指去掉一些不符合语法规定、或者不符合要求的关键词。Noise filtering refers to removing some keywords that do not meet the grammatical regulations or do not meet the requirements.
在一个优选实施方式中,合并所述n元候选关键词集合中的关键词后进行噪音过滤的步骤包括:In a preferred embodiment, the step of merging keywords in the n-ary candidate keyword set and performing noise filtering includes:
在将待抽取关键词的文本分词成一元集合后,先过滤所述集合中的噪音,再提取所述集合中的关键词,获取所述关键词中的最高词频max_count;After segmenting the text of the keywords to be extracted into a one-element set, filter the noise in the set first, and then extract the keywords in the set to obtain the highest word frequency max_count in the keywords;
过滤n元候选关键词合并后的词频少于max_count/4的词;Filter words whose word frequency is less than max_count/4 after the merged n-ary candidate keywords;
过滤n元候选关键词合并后的词频少于2的词。Filter words whose word frequency is less than 2 after the combination of n-ary candidate keywords.
在此实施方式中,可以使用jieba、hanlp、stanfordNLP和thulac等分词工具,将待抽取关键词的文本分词成一元集合(也可以称为unigram集合),然后对此一元集合中的词进行噪音过滤,具体可以是对集合中的词的词性、词频和词长进行过滤。词性过滤可以过滤掉形容词、副词和介词等,只保留名词和动词。词频过滤是指过滤掉在文本中出现的频率大于最大词频或者小于最小词频的词。词长过滤是指过滤掉在文本中出现的长度大于最大长度或者小于最小长度的词。词频和词长过滤,是以收集到的几万甚至几千万样本中的经验数据来推导出最大词频、最小词频、最大长度、最小长度等,然后进行过滤的。In this embodiment, word segmentation tools such as jieba, hanlp, stanfordNLP, and thulac can be used to segment the text of the keywords to be extracted into a unigram set (also called a unigram set), and then noise filtering is performed on the words in the unigram set Specifically, it can filter the part of speech, word frequency and word length of the words in the set. Part-of-speech filtering can filter out adjectives, adverbs, and prepositions, and only retain nouns and verbs. Word frequency filtering refers to filtering out words that appear in the text with a frequency greater than the maximum word frequency or less than the minimum word frequency. Word length filtering refers to filtering out words that appear in the text whose length is greater than the maximum length or less than the minimum length. Word frequency and word length filtering are based on the empirical data collected in tens of thousands or even tens of millions of samples to derive the maximum word frequency, minimum word frequency, maximum length, minimum length, etc., and then filter.
过滤掉一元集合中的噪音后,使用关键词抽取算法,比如TF-IDF算法和TextRank算法等,提取所述一元集合中的关键词。本发明优选使用TF-IDF算法提取关键词。TF-IDF利用了词语在语料库的全局统计量IDF(逆文本频率)和词语在当前文档中的TF(词频)计算词语的权重,将权重靠前的词语作为关键词。TF(term frequency),是指定词语在文本中的出现词数。IDF(inversed document frequency),是语料库中文档总数与包含指定词语的文档数的比值再取对数。TF-IDF是TF与IDF的乘积。计算出文档中每个词语的TF-IDF作为词语的权重进行关键词的筛选。利用TF-IDF提取关键字可以有至少以下两种方式:方式一、绝对取值,所有权重超过某个固定值的集合中的词都被抽取为关键词。方式二、相对取值,集合中的的权重排名前数名的词被抽取为关键词。After filtering the noise in the unary set, keyword extraction algorithms, such as TF-IDF algorithm and TextRank algorithm, are used to extract the keywords in the unary set. The present invention preferably uses the TF-IDF algorithm to extract keywords. TF-IDF uses the global statistics IDF (inverse text frequency) of words in the corpus and the TF (term frequency) of words in the current document to calculate the weights of words, and the words with the highest weights are used as keywords. TF (term frequency) is the number of occurrences of the specified word in the text. IDF (inversed document frequency) is the ratio of the total number of documents in the corpus to the number of documents containing the specified words and then taking the logarithm. TF-IDF is the product of TF and IDF. Calculate the TF-IDF of each word in the document as the weight of the word to filter keywords. There are at least two ways to extract keywords using TF-IDF: Method one, absolute value, all words in the set whose weight exceeds a certain fixed value are extracted as keywords. Method two, relative value, the top words in the weight ranking in the set are extracted as keywords.
抽取一元集合中的关键词后,得到一元候选关键词集合,找到此集合中的关键词的最高词频max_count,过滤n元候选关键词合并后的词频少于max_count/4的词。需要说明的是,如果想 要多一些n+1元的关键词,可以适当放宽过滤的条件,比如可以过滤n元候选关键词合并后的词频少于max_count/5的词,如果想要少一些n+1元的关键词,可以适当收紧过滤的条件,比如可以过滤n元候选关键词合并后的词频少于max_count/3的词,依次类推。另外,还要过滤一些偶然组合在一起、只出现一次的词,即过滤n元候选关键词合并后的词频少于2的词。After extracting the keywords in the unary set, the unary candidate keyword set is obtained, the highest word frequency max_count of the keywords in this set is found, and the words with the combined word frequency of the n-ary candidate keywords less than max_count/4 are filtered. It should be noted that if you want more keywords of n+1 yuan, you can relax the filtering conditions appropriately. For example, you can filter the words whose word frequency is less than max_count/5 after the merged candidate keywords of n yuan, if you want less For the keywords of n+1 yuan, the filtering conditions can be tightened appropriately. For example, the words whose word frequency is less than max_count/3 after the merged candidate keywords of n yuan can be filtered, and so on. In addition, it is necessary to filter some words that are accidentally grouped together and appear only once, that is, to filter words whose word frequency is less than 2 after the combination of n-ary candidate keywords.
经过噪音过滤,得到噪音少且准确率比较高的n+1元结果关键词集合。After noise filtering, an n+1 result keyword set with less noise and higher accuracy is obtained.
进一步的,所述方法还包括:从所述n元候选关键词集合中,移除n+1元结果关键词中包含的n元候选关键词,得到n元结果关键词集合。Further, the method further includes: removing the n-ary candidate keywords included in the n+1-ary result keywords from the n-ary candidate keyword set to obtain an n-ary result keyword set.
还是以上面的例子来说明,二元候选关键词集合为:{核心利益,智能手机,利益冲突,半导体领域,安卓操作,操作系统,操作生态,吸引人才,人才不足},移除三元结果关键词中包含的二元候选关键词,例如“核心利益冲突”包括“核心利益”和“利益冲突”,因此移除二元候选关键词集合中的“核心利益”和“利益冲突”,以此类推,得到二元结果关键词集合为{智能手机,半导体领域}。Taking the above example to illustrate, the set of binary candidate keywords are: {core interests, smart phones, conflicts of interest, semiconductor field, Android operation, operating system, operating ecology, attracting talents, insufficient talents}, remove the ternary result The binary candidate keywords contained in the keywords, for example, "core conflict of interest" includes "core interest" and "conflict of interest", so remove the "core interest" and "conflict of interest" from the set of binary candidate keywords. By analogy, the binary result keyword set is {smartphone, semiconductor field}.
关键词合并之后,随着长度的增长,错误率也会提高,在一个优选实施方式中,所述方法还包括:After the keywords are merged, as the length increases, the error rate also increases. In a preferred embodiment, the method further includes:
当所述n+1元结果关键词集合中的关键词的字长大于最大字长时,去掉所述关键词的第一个或者最后一个一元词,得到优化关键词;When the word length of the keyword in the n+1 meta result keyword set is greater than the maximum word length, remove the first or last unary word of the keyword to obtain an optimized keyword;
将所述优化关键词与限定词表进行匹配,若匹配,将所述优化关键词替换所述n+1元结果关键词集合中的关键词。The optimized keywords are matched with the qualifier table, and if they match, the optimized keywords are replaced with keywords in the n+1 meta result keyword set.
所述限定词表可以根据实际需求自定义,比如可以是输入法的专有名词和各个公司的全称和简称等。所述最大字长可以通过经验获得。The qualifier table can be customized according to actual needs, such as proper nouns of the input method, full names and abbreviations of various companies, etc. The maximum word length can be obtained through experience.
进一步的,所述“获取n元候选关键词集合”的步骤包括:Further, the step of "obtaining an n-ary candidate keyword set" includes:
将所述文本分词成n元集合,先过滤所述集合中的噪音,再提取所述集合中的关键词,得到n元候选关键词集合。此步骤和得到一元候选关键词集合的步骤类似,不同的是噪音的过滤方式不一样,而且随着元数的增加,噪音会越多,过滤的方式会越复杂。The text is segmented into an n-ary set, the noise in the set is filtered first, and then keywords in the set are extracted to obtain an n-ary candidate keyword set. This step is similar to the step of obtaining a set of unary candidate keywords. The difference is that the noise filtering method is different, and as the number of yuan increases, the noise will increase and the filtering method will be more complicated.
以n等于2为例,将所述文本分词成二元集合后,过滤噪音的步骤包括:Taking n equal to 2 as an example, after the text is segmented into a binary set, the steps of filtering noise include:
获取一元候选关键词集合中的最高词频max_count;过滤二元集合中词频小于或等于max_count/5的词。此处的max_count/5是可以调节的,也可以是max_count/6或者max_count/4。Get the highest word frequency max_count in the one-element candidate keyword set; filter the words whose word frequency is less than or equal to max_count/5 in the two-element set. The max_count/5 here can be adjusted, and it can also be max_count/6 or max_count/4.
过滤一些偶然组合在一起、只出现一次的词,即过滤二元集合中词频少于2的词。Filter some words that are accidentally grouped together and appear only once, that is, to filter words with a word frequency of less than 2 in the binary set.
另外,二元集合中的词包括两个一元词,前词和后词(或者说前一元词和后一元词),所述前词和后词在一元候选关键词集合中的最小词频为x(比如前词的词频为3,后词的词频为4, x=3),过滤二元集合中词频小于2x/3的词。In addition, the words in the binary set include two unary words, the preceding word and the succeeding word (or the preceding unary word and the succeeding unary word). The minimum word frequency of the preceding and succeeding words in the unary candidate keyword set is x (For example, the word frequency of the former word is 3, the word frequency of the latter word is 4, x=3), and words with a word frequency of less than 2x/3 in the binary set are filtered.
同时还要过滤二元集合中不符合规定的词,不符合规定的词可以是语法明显错误的词(比如后缀词前置)、包括有介词的词(比如“在明天”)、或者包括有单位的词(比如“80元”)等。对比一元集合的过滤步骤,二元集合的过滤步骤更加复杂。At the same time, the non-compliant words in the binary set should be filtered. The non-compliant words can be words with obvious grammatical errors (such as suffix words), words with prepositions (such as "tomorrow"), or Unit words (such as "80 yuan"), etc. Compared with the filtering steps of a univariate set, the filtering steps of a binary set are more complicated.
需要说明的是,在n为大于2的正整数时,获取n元候选关键词集合也可以是通过合并n-1元候选关键词而得到:It should be noted that when n is a positive integer greater than 2, obtaining an n-ary candidate keyword set can also be obtained by merging n-1 yuan candidate keywords:
获取n-1元候选关键词集合;Obtain a set of n-1 yuan candidate keywords;
将n-1元候选关键词集合中的包含有相同n-2元词且所述n-2元词在所述关键词的位置不同的两个关键词进行合并,得到n-1元结果关键词集合,其中n为大于2的正整数。Combine two keywords in the set of n-1 yuan candidate keywords that contain the same n-2 yuan word and the n-2 yuan word is at a different position in the keyword to obtain the n-1 yuan result key Word set, where n is a positive integer greater than 2.
此实施方式的原理前面已经介绍过了,这里就不赘述。The principle of this embodiment has been introduced before, and will not be repeated here.
下面通过具体的实施例进一步解释,使用上述的文本关键词自动提取的方法提取文本关键词的过程:The following further explains through specific embodiments, the process of extracting text keywords using the above-mentioned method for automatically extracting text keywords:
如图2所示,使用jiaba分词工具将待提取关键词的文本(后续成为所述文本)分别分词成一元集合和二元集合,过滤一元集合的噪音,使用TF-IDF抽取一元集合的关键词,得到一元候选关键词集合,获取所述集合中的最高词频max_count。通过max_count和前述的方式,过滤二元集合中的噪音,使用TF-IDF抽取二元集合的关键词,得到二元候选关键词集合。As shown in Figure 2, use the jiaba word segmentation tool to segment the text to be extracted (which will become the text later) into a one-element set and a two-element set, filter the noise of the one-element set, and use TF-IDF to extract the keywords of the one-element set , Obtain a set of unary candidate keywords, and obtain the highest word frequency max_count in the set. Through max_count and the aforementioned method, the noise in the binary set is filtered, and the keywords of the binary set are extracted using TF-IDF to obtain the binary candidate keyword set.
在一元候选关键词集合中,去除二元候选关键词包含的一元候选关键词,得到一元结果关键词集合。In the unary candidate keyword set, the unary candidate keywords contained in the binary candidate keywords are removed to obtain the unary result keyword set.
将二元候选关键词集合中的关键词进行合并,并过滤噪音(参考前文),得到三元结果关键词集合,同时在二元候选关键词集合中,去除三元结果关键词包含的二元候选关键词,得到二元结果关键词集合。Combine the keywords in the binary candidate keyword set and filter the noise (refer to the previous article) to obtain the ternary result keyword set. At the same time, remove the binary result keywords in the binary candidate keyword set. Candidate keywords, get the binary result keyword set.
提取所述文本关键词的结果为:一元结果关键词集合、二元结果关键词结合和三元结果关键词集合。The results of extracting the text keywords are: a set of unary result keywords, a combination of binary result keywords, and a set of ternary result keywords.
当然,对于最后的结果,可以进行长度限制,当结果关键词集合中的关键词的字长大于最大字长时,对关键词进行优化,优化的方式参考前文。Of course, for the final result, the length can be restricted. When the word length of the keywords in the result keyword set is greater than the maximum word length, the keywords are optimized, and the optimization method refers to the preceding text.
本发明还提供一种电子设备,包括存储器和处理器,所述存储器存储有可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现上述文本关键词自动提取的方法中的步骤。The present invention also provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the method for automatically extracting text keywords as described above is realized when the processor executes the program Steps in.
本发明还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,实现上述文本关键词自动提取的方法中的步骤。The present invention also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method for automatically extracting text keywords are realized.
应当理解,虽然本说明书按照实施方式加以描述,但并非每个实施方式仅包含一个独立的技 术方案,说明书的这种叙述方式仅仅是为清楚起见,本领域技术人员应当将说明书作为一个整体,各实施方式中的技术方案也可以经适当组合,形成本领域技术人员可以理解的其他实施方式。It should be understood that although this specification is described in accordance with the implementation manners, not each implementation manner only includes an independent technical solution. This narration in the specification is only for clarity, and those skilled in the art should regard the specification as a whole. The technical solutions in the embodiments can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.
上文所列出的一系列的详细说明仅仅是针对本发明的可行性实施方式的具体说明,它们并非用以限制本发明的保护范围,凡未脱离本发明技艺精神所作的等效实施方式或变更均应包含在本发明的保护范围之内。The series of detailed descriptions listed above are only specific descriptions of feasible implementations of the present invention. They are not intended to limit the scope of protection of the present invention. Any equivalent implementations or implementations made without departing from the technical spirit of the present invention All changes shall be included in the protection scope of the present invention.

Claims (10)

  1. 一种文本关键词自动提取的方法,其特征在于,所述方法包括:A method for automatically extracting text keywords, characterized in that the method includes:
    获取n元候选关键词集合;Obtain an n-ary candidate keyword set;
    将n元候选关键词集合中的包含有相同n-1元词且所述n-1元词在所述关键词的位置不同的两个关键词进行合并,得到n+1元结果关键词集合,其中n为大于1的正整数。Combine two keywords in the n-gram candidate keyword set that contain the same n-1 yuan word and the n-1 yuan word is at a different position in the keyword to obtain an n+1 yuan result keyword set , Where n is a positive integer greater than 1.
  2. 如权利要求1所述文本关键词自动提取的方法,其特征在于,所述方法还包括:The method for automatically extracting text keywords according to claim 1, wherein the method further comprises:
    从所述n元候选关键词集合中,移除n+1元结果关键词包含的n元候选关键词,得到n元结果关键词集合。From the n-ary candidate keyword set, remove the n-ary candidate keywords included in the n+1-ary result keywords to obtain an n-ary result keyword set.
  3. 如权利要求1所述文本关键词自动提取的方法,其特征在于,所述方法还包括:The method for automatically extracting text keywords according to claim 1, wherein the method further comprises:
    当所述n+1元结果关键词集合中的关键词的字长大于最大字长时,去掉所述关键词的第一个或者最后一个一元词,得到优化关键词;When the word length of the keyword in the n+1 meta result keyword set is greater than the maximum word length, remove the first or last unary word of the keyword to obtain an optimized keyword;
    将所述优化关键词与限定词表进行匹配,若匹配,将所述优化关键词替换所述n+1元结果关键词集合中的关键词。The optimized keywords are matched with the qualifier table, and if they match, the optimized keywords are replaced with keywords in the n+1 meta result keyword set.
  4. 根据权利要求1所述的文本关键词自动提取的方法,其特征在于,所述“获取n元候选关键词集合”的步骤包括:The method for automatically extracting text keywords according to claim 1, wherein the step of "obtaining an n-ary candidate keyword set" comprises:
    将所述文本分词成n元集合,先过滤所述集合中的噪音,再提取所述集合中的关键词,得到n元候选关键词集合。The text is segmented into an n-ary set, the noise in the set is filtered first, and then keywords in the set are extracted to obtain an n-ary candidate keyword set.
  5. 根据权利要求4所述的文本关键词自动提取的方法,其特征在于,在所述n为2,即将所述文本分词成二元集合后,过滤噪音的步骤包括:The method for automatically extracting text keywords according to claim 4, characterized in that, after the n is 2, that is, after the text is segmented into a binary set, the step of filtering noise comprises:
    在将所述文本分词成一元集合后,先过滤所述集合中的噪音,再提取所述集合中的关键词,获取所述关键词中的最高词频max_count;After segmenting the text into a one-element set, filter the noise in the set first, and then extract the keywords in the set to obtain the highest word frequency max_count in the keywords;
    过滤二元集合中词频小于或等于max_count/5的词;Filter the words whose word frequency is less than or equal to max_count/5 in the binary set;
    过滤二元集合中词频小于2的词;Filter the words whose word frequency is less than 2 in the binary set;
    二元集合中的词包括前词和后词,所述前词和后词在一元集合中的最小词频为x,过滤二元集合中词频小于2x/3的词;The words in the binary set include pre-words and post-words. The minimum word frequency of the pre-words and the following words in the univariate set is x, and words in the binary set with a word frequency less than 2x/3 are filtered;
    过滤二元集合中不符合规定的词。Filter the non-compliant words in the binary set.
  6. 根据权利要求1所述的文本关键词自动提取的方法,其特征在于,所述“获取n元候选关键词集合”的步骤包括:The method for automatically extracting text keywords according to claim 1, wherein the step of "obtaining an n-ary candidate keyword set" comprises:
    获取n-1元候选关键词集合;Obtain a set of n-1 yuan candidate keywords;
    将n-1元候选关键词集合中的包含有相同n-2元词且所述n-2元词在所述关键词的位置不同的两个关键词进行合并,得到n-1元结果关键词集合,其中n为大于2的正整数。Combine two keywords in the set of n-1 yuan candidate keywords that contain the same n-2 yuan word and the n-2 yuan word is at a different position in the keyword to obtain the n-1 yuan result key Word set, where n is a positive integer greater than 2.
  7. 根据权利要求1所述的文本关键词自动提取的方法,其特征在于:The method for automatically extracting text keywords according to claim 1, wherein:
    在将n元候选关键词集合中的关键词合并后,进行噪音过滤得到n+1元结果关键词集合。After merging the keywords in the n-gram candidate keyword set, noise filtering is performed to obtain the n+1-gram result keyword set.
  8. 根据权利要求7所述的文本关键词自动提取的方法,其特征在于,合并所述n元候选关键词集合中的关键词后进行噪音过滤的步骤包括:8. The method for automatically extracting text keywords according to claim 7, wherein the step of merging keywords in the n-ary candidate keyword set and then performing noise filtering comprises:
    在将所述文本分词成一元集合后,先过滤所述集合中的噪音,再提取所述集合中的关键词,获取所述关键词中的最高词频max_count;After segmenting the text into a one-element set, filter the noise in the set first, and then extract the keywords in the set to obtain the highest word frequency max_count in the keywords;
    过滤n元候选关键词合并后的词频少于max_count/4的词;Filter words whose word frequency is less than max_count/4 after the merged n-ary candidate keywords;
    过滤n元候选关键词合并后的词频少于2的词。Filter words whose word frequency is less than 2 after the combination of n-ary candidate keywords.
  9. 一种电子设备,包括存储器和处理器,所述存储器存储有可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1-8任意一项所述文本关键词自动提取的方法中的步骤。An electronic device, comprising a memory and a processor, the memory storing a computer program that can run on the processor, wherein the processor implements any one of claims 1-8 when the program is executed The steps in the method for automatically extracting text keywords.
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-8任意一项所述文本关键词自动提取的方法中的步骤。A computer-readable storage medium with a computer program stored thereon, wherein the computer program implements the steps in the method for automatically extracting text keywords according to any one of claims 1-8 when the computer program is executed by a processor.
PCT/CN2019/115115 2019-08-15 2019-11-01 Method and device for automatically extracting text keyword, and storage medium WO2021027085A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910754155.7A CN110532551A (en) 2019-08-15 2019-08-15 Method, equipment and the storage medium that text key word automatically extracts
CN201910754155.7 2019-08-15

Publications (1)

Publication Number Publication Date
WO2021027085A1 true WO2021027085A1 (en) 2021-02-18

Family

ID=68663358

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/115115 WO2021027085A1 (en) 2019-08-15 2019-11-01 Method and device for automatically extracting text keyword, and storage medium

Country Status (2)

Country Link
CN (1) CN110532551A (en)
WO (1) WO2021027085A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488727B (en) * 2020-03-24 2023-09-19 南阳柯丽尔科技有限公司 Word file parsing method, word file parsing apparatus, and computer-readable storage medium
CN116978384B (en) * 2023-09-25 2024-01-02 成都市青羊大数据有限责任公司 Public security integrated big data management system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004006124A2 (en) * 2002-07-03 2004-01-15 Word Data Corp. Text-representation, text-matching and text-classification code, system and method
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN105956158A (en) * 2016-05-17 2016-09-21 清华大学 Automatic extraction method of network neologism on the basis of mass microblog texts and use information
CN106557459A (en) * 2015-09-24 2017-04-05 北京神州泰岳软件股份有限公司 A kind of method and apparatus that neologisms are extracted from work order

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1280757C (en) * 2000-11-17 2006-10-18 意蓝科技股份有限公司 Method for automatically-searching key word from file and its system
CN102375863A (en) * 2010-08-27 2012-03-14 北京四维图新科技股份有限公司 Method and device for keyword extraction in geographic information field
CN102411563B (en) * 2010-09-26 2015-06-17 阿里巴巴集团控股有限公司 Method, device and system for identifying target words
CN103678318B (en) * 2012-08-31 2016-12-21 富士通株式会社 Multi-word unit extraction method and equipment and artificial neural network training method and equipment
CN103092979B (en) * 2013-01-31 2016-01-27 中国科学院对地观测与数字地球科学中心 The disposal route of remotely-sensed data retrieval natural language
CN104216875B (en) * 2014-09-26 2017-05-03 中国科学院自动化研究所 Automatic microblog text abstracting method based on unsupervised key bigram extraction
CN105426539B (en) * 2015-12-23 2018-12-18 成都云数未来信息科学有限公司 A kind of lucene Chinese word cutting method based on dictionary
CN107665191B (en) * 2017-10-19 2020-08-04 中国人民解放军陆军工程大学 Private protocol message format inference method based on extended prefix tree

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004006124A2 (en) * 2002-07-03 2004-01-15 Word Data Corp. Text-representation, text-matching and text-classification code, system and method
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN106557459A (en) * 2015-09-24 2017-04-05 北京神州泰岳软件股份有限公司 A kind of method and apparatus that neologisms are extracted from work order
CN105956158A (en) * 2016-05-17 2016-09-21 清华大学 Automatic extraction method of network neologism on the basis of mass microblog texts and use information

Also Published As

Publication number Publication date
CN110532551A (en) 2019-12-03

Similar Documents

Publication Publication Date Title
CN109241538B (en) Chinese entity relation extraction method based on dependency of keywords and verbs
CN109101479B (en) Clustering method and device for Chinese sentences
US10339453B2 (en) Automatically generating test/training questions and answers through pattern based analysis and natural language processing techniques on the given corpus for quick domain adaptation
WO2018205389A1 (en) Voice recognition method and system, electronic apparatus and medium
Shoukry et al. Preprocessing Egyptian dialect tweets for sentiment mining
WO2018157789A1 (en) Speech recognition method, computer, storage medium, and electronic apparatus
CN102693279B (en) Method, device and system for fast calculating comment similarity
CN107943786B (en) Chinese named entity recognition method and system
Tabassum et al. A survey on text pre-processing & feature extraction techniques in natural language processing
WO2017198031A1 (en) Semantic parsing method and apparatus
US20200073890A1 (en) Intelligent search platforms
JP2011118689A (en) Retrieval method and system
WO2021027085A1 (en) Method and device for automatically extracting text keyword, and storage medium
WO2014114175A1 (en) Method and apparatus for providing search engine tags
JP5718405B2 (en) Utterance selection apparatus, method and program, dialogue apparatus and method
US8806455B1 (en) Systems and methods for text nuclearization
CN110889292B (en) Text data viewpoint abstract generating method and system based on sentence meaning structure model
Sheeba et al. Improved keyword and keyphrase extraction from meeting transcripts
Malandrakis et al. Sail: Sentiment analysis using semantic similarity and contrast features
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
Kim et al. Compact lexicon selection with spectral methods
Shrawankar et al. Construction of news headline from detailed news article
Malandrakis et al. Affective language model adaptation via corpus selection
Sahmoudi et al. Towards a linguistic patterns for arabic keyphrases extraction
KR20190140668A (en) The korean morpheme analyzer using user defined morpheme and the method of the same

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19941090

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19941090

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19941090

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 210922)

122 Ep: pct application non-entry in european phase

Ref document number: 19941090

Country of ref document: EP

Kind code of ref document: A1