WO2021027085A1

WO2021027085A1 - Method and device for automatically extracting text keyword, and storage medium

Info

Publication number: WO2021027085A1
Application number: PCT/CN2019/115115
Authority: WO
Inventors: 龚朝辉; 陶予祺; 童刚
Original assignee: 苏州朗动网络科技有限公司
Priority date: 2019-08-15
Filing date: 2019-11-01
Publication date: 2021-02-18
Also published as: CN110532551A

Abstract

A method and device for automatically extracting a text keyword, and a storage medium. The method comprises: acquiring an n-element candidate keyword set; and combining two keywords in the n-element candidate keyword set, which keywords include the same (n - 1)-element word, the positions of the (n - 1)-element word in the keywords being different, so as to obtain an (n + 1)-element result keyword set, wherein n is a positive integer greater than 1. By means of the method, the semantics of divided keywords is completed by combining keywords extracted after subdivision, thereby avoiding the phenomenon of semantic incompleteness caused by the super subdivision of divided words.

Description

Method, equipment and storage medium for automatically extracting text keywords

Technical field

The present invention relates to the field of Internet technology, in particular to a method, equipment and storage medium for automatically extracting text keywords.

Background technique

Automatic keyword extraction is to automatically extract thematic or important words or phrases from the text. It is the basic and necessary work of many text mining tasks such as text retrieval and text summarization. Document keywords represent the subject matter and key content of the document, and are the smallest unit of document content understanding.

There are many methods of automatic keyword extraction. There are more than 10 mainstream automatic keyword extraction methods, such as methods based on language analysis, statistical methods, and machine learning methods. The statistical method is to use the statistical information of the words in the document to extract the keywords of the document. This method is relatively simple, does not require training data, and generally does not require an external knowledge base, so the extraction speed is fast, and in scenarios that require real-time calculations Often used in.

The first step of extracting keywords in Chinese natural language is to segment the text, build a vocabulary, and then extract keywords from the vocabulary. This method results in keywords that can only be words in the vocabulary. Since the word segmentation granularity of general word segmentation tools is relatively fine (the noise caused by such segmentation is relatively small and easy to filter), but word segmentation often brings semantic fragmentation, such as "China Internet Conference" will be divided into "China" , "Internet" and "Conference", if "China Internet Conference" is the key word, such words not in the vocabulary will be discarded and will not be extracted as keywords. If the word segmentation tool has coarse segmentation granularity (such as ternary or above), it will bring more noise and be difficult to filter, resulting in the extraction of many noisy keywords.

Summary of the invention

The purpose of the present invention is to provide a method, equipment and storage medium for automatically extracting text keywords.

In order to achieve one of the above-mentioned objects of the invention, an embodiment of the present invention provides a method for automatically extracting text keywords. The method includes:

Obtain an n-ary candidate keyword set;

Combine two keywords in the n-gram candidate keyword set that contain the same n-1 yuan word and the n-1 yuan word is at a different position in the keyword to obtain an n+1 yuan result keyword set , Where n is a positive integer greater than 1.

As a further improvement of an embodiment of the present invention, the method further includes:

From the n-ary candidate keyword set, remove the n-ary candidate keywords included in the n+1-ary result keywords to obtain an n-ary result keyword set.

When the word length of the keyword in the n+1 meta result keyword set is greater than the maximum word length, remove the first or last unary word of the keyword to obtain an optimized keyword;

The optimized keywords are matched with the qualifier table, and if they match, the optimized keywords are replaced with keywords in the n+1 meta result keyword set.

As a further improvement of an embodiment of the present invention, the step of "obtaining an n-ary candidate keyword set" includes:

The text is segmented into an n-ary set, the noise in the set is filtered first, and then keywords in the set are extracted to obtain an n-ary candidate keyword set.

As a further improvement of an embodiment of the present invention, after n is 2, that is, after the text is segmented into a binary set, the steps of filtering noise include:

After segmenting the text into a one-element set, filter the noise in the set first, and then extract the keywords in the set to obtain the highest word frequency max_count in the keywords;

Filter the words whose word frequency is less than or equal to max_count/5 in the binary set;

Filter the words whose word frequency is less than 2 in the binary set;

The words in the binary set include pre-words and post-words. The minimum word frequency of the pre-words and the following words in the univariate set is x, and words in the binary set with a word frequency less than 2x/3 are filtered;

Filter the non-compliant words in the binary set.

Obtain a set of n-1 yuan candidate keywords;

Combine two keywords in the set of n-1 yuan candidate keywords that contain the same n-2 yuan word and the n-2 yuan word is at a different position in the keyword to obtain the n-1 yuan result key Word set, where n is a positive integer greater than 2.

As a further improvement of an embodiment of the present invention, after the keywords in the n-ary candidate keyword set are merged, noise filtering is performed to obtain an n+1-ary result keyword set.

As a further improvement of an embodiment of the present invention, the step of performing noise filtering after merging keywords in the n-ary candidate keyword set includes:

Filter words whose word frequency is less than max_count/4 after the merged n-ary candidate keywords;

Filter words whose word frequency is less than 2 after the combination of n-ary candidate keywords.

In order to achieve one of the above-mentioned objects of the invention, an embodiment of the present invention provides an electronic device including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor executes the program When realizing the steps in the method for automatically extracting text keywords.

In order to achieve one of the above-mentioned objects of the invention, an embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method for automatically extracting text keywords are realized .

Compared with the prior art, the technical solution of the present invention merges the extracted keywords after segmentation, so that the semantics of the split keywords are complemented, and avoids the situation of incomplete semantics caused by too thin word segmentation. .

Description of the drawings

FIG. 1 is a schematic flowchart of a method for automatically extracting text keywords according to an embodiment of the present invention.

Fig. 2 is a schematic flowchart of a method for automatically extracting text keywords in a specific embodiment of the present invention.

detailed description

The present invention will be described in detail below in conjunction with the specific embodiments shown in the drawings. However, these embodiments do not limit the present invention, and the structural, method, or functional changes made by those skilled in the art based on these embodiments are all included in the protection scope of the present invention.

It should be noted that the keywords of the present invention can be words, which are the smallest language units that can be used, such as flowers, birds, people, young, language, etc. The keywords can also be phrases, which consist of two or two The grammatical unit formed by the combination of the above words, such as optimization plan, intellectual property, China Internet Conference, China National Patent Attorney Association, etc. Therefore, the n-gram (n is a positive integer) in the present invention refers to a word or phrase that includes n words. For example, a unary word refers to a word, and a binary word is a phrase that combines two words, that is, two words. Metawords include two unary words, and so on.

As shown in Figure 1, the method for automatically extracting text keywords of the present invention includes:

Obtain an n-ary candidate keyword set;

In the present invention, there are many ways to obtain the n-ary candidate keyword set, which will be described in detail later. After obtaining the set of n-gram candidate keywords, compare the keywords in the set one by one, and find two that contain the same n-1 metawords and the n-1 metawords have different positions in the keywords Keywords are merged into n+1 meta result keywords to obtain an n+1 meta result keyword set.

The method is explained below with a specific example. Taking n equal to 2 as an example, the obtained binary candidate keyword set is: {core interest, smart phone, conflict of interest, semiconductor field, Android operation, operating system, Operating ecology, attracting talents, lack of talents}, since both "core interests" and "conflicts of interest" include the unary word "interests", and the positions of "interests" in these two keywords are different, you can put these in order The two words are merged, and the merged ternary result keyword is "core conflict of interest", and so on, the ternary result keyword set is: {core conflict of interest, Android operating system, Android operating ecology, insufficient talent attraction }.

The technical scheme of the present invention merges the extracted keywords after subdividing, so that the semantics of the split keywords are complemented, and the situation of semantic incompleteness caused by too thin word segmentation is avoided.

Further, after the keywords in the n-ary candidate keyword set are merged, noise filtering is first performed to obtain the n+1-ary result keyword set.

Noise filtering refers to removing some keywords that do not meet the grammatical regulations or do not meet the requirements.

In a preferred embodiment, the step of merging keywords in the n-ary candidate keyword set and performing noise filtering includes:

After segmenting the text of the keywords to be extracted into a one-element set, filter the noise in the set first, and then extract the keywords in the set to obtain the highest word frequency max_count in the keywords;

In this embodiment, word segmentation tools such as jieba, hanlp, stanfordNLP, and thulac can be used to segment the text of the keywords to be extracted into a unigram set (also called a unigram set), and then noise filtering is performed on the words in the unigram set Specifically, it can filter the part of speech, word frequency and word length of the words in the set. Part-of-speech filtering can filter out adjectives, adverbs, and prepositions, and only retain nouns and verbs. Word frequency filtering refers to filtering out words that appear in the text with a frequency greater than the maximum word frequency or less than the minimum word frequency. Word length filtering refers to filtering out words that appear in the text whose length is greater than the maximum length or less than the minimum length. Word frequency and word length filtering are based on the empirical data collected in tens of thousands or even tens of millions of samples to derive the maximum word frequency, minimum word frequency, maximum length, minimum length, etc., and then filter.

After filtering the noise in the unary set, keyword extraction algorithms, such as TF-IDF algorithm and TextRank algorithm, are used to extract the keywords in the unary set. The present invention preferably uses the TF-IDF algorithm to extract keywords. TF-IDF uses the global statistics IDF (inverse text frequency) of words in the corpus and the TF (term frequency) of words in the current document to calculate the weights of words, and the words with the highest weights are used as keywords. TF (term frequency) is the number of occurrences of the specified word in the text. IDF (inversed document frequency) is the ratio of the total number of documents in the corpus to the number of documents containing the specified words and then taking the logarithm. TF-IDF is the product of TF and IDF. Calculate the TF-IDF of each word in the document as the weight of the word to filter keywords. There are at least two ways to extract keywords using TF-IDF: Method one, absolute value, all words in the set whose weight exceeds a certain fixed value are extracted as keywords. Method two, relative value, the top words in the weight ranking in the set are extracted as keywords.

After extracting the keywords in the unary set, the unary candidate keyword set is obtained, the highest word frequency max_count of the keywords in this set is found, and the words with the combined word frequency of the n-ary candidate keywords less than max_count/4 are filtered. It should be noted that if you want more keywords of n+1 yuan, you can relax the filtering conditions appropriately. For example, you can filter the words whose word frequency is less than max_count/5 after the merged candidate keywords of n yuan, if you want less For the keywords of n+1 yuan, the filtering conditions can be tightened appropriately. For example, the words whose word frequency is less than max_count/3 after the merged candidate keywords of n yuan can be filtered, and so on. In addition, it is necessary to filter some words that are accidentally grouped together and appear only once, that is, to filter words whose word frequency is less than 2 after the combination of n-ary candidate keywords.

After noise filtering, an n+1 result keyword set with less noise and higher accuracy is obtained.

Further, the method further includes: removing the n-ary candidate keywords included in the n+1-ary result keywords from the n-ary candidate keyword set to obtain an n-ary result keyword set.

Taking the above example to illustrate, the set of binary candidate keywords are: {core interests, smart phones, conflicts of interest, semiconductor field, Android operation, operating system, operating ecology, attracting talents, insufficient talents}, remove the ternary result The binary candidate keywords contained in the keywords, for example, "core conflict of interest" includes "core interest" and "conflict of interest", so remove the "core interest" and "conflict of interest" from the set of binary candidate keywords. By analogy, the binary result keyword set is {smartphone, semiconductor field}.

After the keywords are merged, as the length increases, the error rate also increases. In a preferred embodiment, the method further includes:

The qualifier table can be customized according to actual needs, such as proper nouns of the input method, full names and abbreviations of various companies, etc. The maximum word length can be obtained through experience.

Further, the step of "obtaining an n-ary candidate keyword set" includes:

The text is segmented into an n-ary set, the noise in the set is filtered first, and then keywords in the set are extracted to obtain an n-ary candidate keyword set. This step is similar to the step of obtaining a set of unary candidate keywords. The difference is that the noise filtering method is different, and as the number of yuan increases, the noise will increase and the filtering method will be more complicated.

Taking n equal to 2 as an example, after the text is segmented into a binary set, the steps of filtering noise include:

Get the highest word frequency max_count in the one-element candidate keyword set; filter the words whose word frequency is less than or equal to max_count/5 in the two-element set. The max_count/5 here can be adjusted, and it can also be max_count/6 or max_count/4.

Filter some words that are accidentally grouped together and appear only once, that is, to filter words with a word frequency of less than 2 in the binary set.

In addition, the words in the binary set include two unary words, the preceding word and the succeeding word (or the preceding unary word and the succeeding unary word). The minimum word frequency of the preceding and succeeding words in the unary candidate keyword set is x (For example, the word frequency of the former word is 3, the word frequency of the latter word is 4, x=3), and words with a word frequency of less than 2x/3 in the binary set are filtered.

At the same time, the non-compliant words in the binary set should be filtered. The non-compliant words can be words with obvious grammatical errors (such as suffix words), words with prepositions (such as "tomorrow"), or Unit words (such as "80 yuan"), etc. Compared with the filtering steps of a univariate set, the filtering steps of a binary set are more complicated.

It should be noted that when n is a positive integer greater than 2, obtaining an n-ary candidate keyword set can also be obtained by merging n-1 yuan candidate keywords:

Obtain a set of n-1 yuan candidate keywords;

The principle of this embodiment has been introduced before, and will not be repeated here.

The following further explains through specific embodiments, the process of extracting text keywords using the above-mentioned method for automatically extracting text keywords:

As shown in Figure 2, use the jiaba word segmentation tool to segment the text to be extracted (which will become the text later) into a one-element set and a two-element set, filter the noise of the one-element set, and use TF-IDF to extract the keywords of the one-element set , Obtain a set of unary candidate keywords, and obtain the highest word frequency max_count in the set. Through max_count and the aforementioned method, the noise in the binary set is filtered, and the keywords of the binary set are extracted using TF-IDF to obtain the binary candidate keyword set.

In the unary candidate keyword set, the unary candidate keywords contained in the binary candidate keywords are removed to obtain the unary result keyword set.

Combine the keywords in the binary candidate keyword set and filter the noise (refer to the previous article) to obtain the ternary result keyword set. At the same time, remove the binary result keywords in the binary candidate keyword set. Candidate keywords, get the binary result keyword set.

The results of extracting the text keywords are: a set of unary result keywords, a combination of binary result keywords, and a set of ternary result keywords.

Of course, for the final result, the length can be restricted. When the word length of the keywords in the result keyword set is greater than the maximum word length, the keywords are optimized, and the optimization method refers to the preceding text.

The present invention also provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the method for automatically extracting text keywords as described above is realized when the processor executes the program Steps in.

The present invention also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method for automatically extracting text keywords are realized.

It should be understood that although this specification is described in accordance with the implementation manners, not each implementation manner only includes an independent technical solution. This narration in the specification is only for clarity, and those skilled in the art should regard the specification as a whole. The technical solutions in the embodiments can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.

The series of detailed descriptions listed above are only specific descriptions of feasible implementations of the present invention. They are not intended to limit the scope of protection of the present invention. Any equivalent implementations or implementations made without departing from the technical spirit of the present invention All changes shall be included in the protection scope of the present invention.

Claims

A method for automatically extracting text keywords, characterized in that the method includes:

Obtain an n-ary candidate keyword set;

Combine two keywords in the n-gram candidate keyword set that contain the same n-1 yuan word and the n-1 yuan word is at a different position in the keyword to obtain an n+1 yuan result keyword set , Where n is a positive integer greater than 1.
The method for automatically extracting text keywords according to claim 1, wherein the method further comprises:

From the n-ary candidate keyword set, remove the n-ary candidate keywords included in the n+1-ary result keywords to obtain an n-ary result keyword set.
The method for automatically extracting text keywords according to claim 1, wherein the method further comprises:

When the word length of the keyword in the n+1 meta result keyword set is greater than the maximum word length, remove the first or last unary word of the keyword to obtain an optimized keyword;

The optimized keywords are matched with the qualifier table, and if they match, the optimized keywords are replaced with keywords in the n+1 meta result keyword set.
The method for automatically extracting text keywords according to claim 1, wherein the step of "obtaining an n-ary candidate keyword set" comprises:

The text is segmented into an n-ary set, the noise in the set is filtered first, and then keywords in the set are extracted to obtain an n-ary candidate keyword set.
The method for automatically extracting text keywords according to claim 4, characterized in that, after the n is 2, that is, after the text is segmented into a binary set, the step of filtering noise comprises:

After segmenting the text into a one-element set, filter the noise in the set first, and then extract the keywords in the set to obtain the highest word frequency max_count in the keywords;

Filter the words whose word frequency is less than or equal to max_count/5 in the binary set;

Filter the words whose word frequency is less than 2 in the binary set;

The words in the binary set include pre-words and post-words. The minimum word frequency of the pre-words and the following words in the univariate set is x, and words in the binary set with a word frequency less than 2x/3 are filtered;

Filter the non-compliant words in the binary set.
The method for automatically extracting text keywords according to claim 1, wherein the step of "obtaining an n-ary candidate keyword set" comprises:

Obtain a set of n-1 yuan candidate keywords;

Combine two keywords in the set of n-1 yuan candidate keywords that contain the same n-2 yuan word and the n-2 yuan word is at a different position in the keyword to obtain the n-1 yuan result key Word set, where n is a positive integer greater than 2.
The method for automatically extracting text keywords according to claim 1, wherein:

After merging the keywords in the n-gram candidate keyword set, noise filtering is performed to obtain the n+1-gram result keyword set.
8. The method for automatically extracting text keywords according to claim 7, wherein the step of merging keywords in the n-ary candidate keyword set and then performing noise filtering comprises:

After segmenting the text into a one-element set, filter the noise in the set first, and then extract the keywords in the set to obtain the highest word frequency max_count in the keywords;

Filter words whose word frequency is less than max_count/4 after the merged n-ary candidate keywords;

Filter words whose word frequency is less than 2 after the combination of n-ary candidate keywords.
An electronic device, comprising a memory and a processor, the memory storing a computer program that can run on the processor, wherein the processor implements any one of claims 1-8 when the program is executed The steps in the method for automatically extracting text keywords.
A computer-readable storage medium with a computer program stored thereon, wherein the computer program implements the steps in the method for automatically extracting text keywords according to any one of claims 1-8 when the computer program is executed by a processor.