CN110532551A

CN110532551A - Method, equipment and the storage medium that text key word automatically extracts

Info

Publication number: CN110532551A
Application number: CN201910754155.7A
Authority: CN
Inventors: 龚朝辉; 陶予祺; 童刚
Original assignee: Suzhou Long Mobile Network Technology Co Ltd
Current assignee: Suzhou Long Mobile Network Technology Co Ltd
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2019-12-03
Also published as: WO2021027085A1

Abstract

Present invention discloses a kind of method that text key word automatically extracts, equipment and storage mediums, which comprises obtains n member candidate key set of words；By in n member candidate key set of words include identical n-1 member word and the n-1 member word two keywords different in the position of the keyword merge, obtain n+1 member result keyword set, wherein n is the positive integer greater than 1.Compared with prior art, technical solution of the present invention is by merging the keyword extracted after subdivision, so that the semanteme for the keyword being split off obtains completion, avoids because of the semantic incomplete situation of the too thin bring of participle.

Description

Method, equipment and the storage medium that text key word automatically extracts

Technical field

The present invention relates to Internet technical fields, more particularly to a kind of method that text key word automatically extracts, equipment And storage medium.

Background technique

Automatic keyword abstraction is to extract thematic or importance word or phrase automatically from text, be text retrieval, The work of the basic and necessity of many text mining tasks such as text snippet.Document keyword characterize document subject matter and Critical content is the minimum unit that document content understands.

The method of automatic keyword abstraction has very much, the automatic keyword abstraction method of mainstream, for example is based on language analysis Method, statistic law, machine learning method etc. have more than 10.Statistic law is the statistical information abstracting document using word in document Keyword, this method is comparatively relatively simple, does not need training data, does not also need external knowledge library generally, so extracting Speed it is fast, be commonly used in the scene for needing to calculate in real time.

The first step of Chinese natural language extracting keywords is exactly to segment to text, vocabulary is constructed, then from vocabulary Middle extraction keyword.This method causes keyword that can only be the word in vocabulary.Due to generally segmenting the participle granularity of tool Thinner (divide bring noise relatively small in this way, and be easy filtering), but participle can usually bring the semantic feelings isolated Condition, such as " China Internet conference " can be divided into " China ", " internet " and " conference ", if " China Internet conference " is It is keyword, this word not in vocabulary is just dropped, and will not be come out as keyword extraction.And if point of participle tool Word coarse size (such as ternary or more), bring noise can be relatively more, and sad filter, cause to extract many noise keywords.

Summary of the invention

The purpose of the present invention is to provide a kind of method that text key word automatically extracts, equipment and storage mediums.

One of for achieving the above object, an embodiment of the present invention provides a kind of side that text key word automatically extracts Method, which comprises

Obtain n member candidate key set of words；

It will in n member candidate key set of words include identical n-1 member word and the n-1 member word in the position of the keyword It sets two different keywords to merge, obtains n+1 member result keyword set, wherein n is the positive integer greater than 1.

As the further improvement of an embodiment of the present invention, the method also includes:

From the n member candidate key set of words, the n member candidate keywords that n+1 member result keyword includes are removed, are obtained N member result keyword set.

When the word length of the keyword in the n+1 member result keyword set is greater than maximum word length, remove the key First or the last one unitary word of word obtain optimization keyword；

The optimization keyword is matched with vocabulary is limited, if matching, the optimization keyword is replaced into the n+1 Keyword in first result keyword set.

As the further improvement of an embodiment of the present invention, the step of " obtain n member candidate key set of words ", is wrapped It includes:

The text is segmented into n member set, first filters the noise in the set, then extract the key in the set Word obtains n member candidate key set of words.

As the further improvement of an embodiment of the present invention, it is 2 in the n, i.e., segments the text at two metasets After conjunction, the step of filtering noise, includes:

After segmenting the text at unitary set, the noise in the set is first filtered, then extract in the set Keyword, obtain the highest word frequency max_count in the keyword；

Word frequency is less than or equal to the word of max_count/5 in filtered binary set；

Word of the word frequency less than 2 in filtered binary set；

Word in binary set includes preceding word and rear word, and the minimum word frequency of the preceding word and rear word in unitary set is x, Word frequency is less than the word of 2x/3 in filtered binary set；

Word against regulation in filtered binary set.

Obtain n-1 member candidate key set of words；

It will in n-1 member candidate key set of words include identical n-2 member word and the n-2 member word in the keyword Two different keywords of position merge, and obtain n-1 member result keyword set, and wherein n is the positive integer greater than 2.

As the further improvement of an embodiment of the present invention, merge by the keyword in n member candidate key set of words Afterwards, it carries out noise filtering and obtains n+1 member result keyword set.

As the further improvement of an embodiment of the present invention, merge the keyword in the n member candidate key set of words Laggard Row noise filter the step of include:

Filter the word that the word frequency after n member candidate keywords merge is less than max_count/4；

Filter the word that the word frequency after n member candidate keywords merge is less than 2.

One of for achieving the above object, an embodiment of the present invention provides a kind of electronic equipment, including memory and Processor, the memory are stored with the computer program that can be run on the processor, and the processor executes the journey The step in method that above-mentioned text key word automatically extracts is realized when sequence.

One of for achieving the above object, an embodiment of the present invention provides a kind of computer readable storage medium, On be stored with computer program, the computer program realizes the side that above-mentioned text key word automatically extracts when being executed by processor Step in method.

Compared with prior art, technical solution of the present invention is by merging the keyword extracted after subdivision, so that The semanteme for the keyword being split off obtains completion, avoids because of the semantic incomplete situation of the too thin bring of participle.

Detailed description of the invention

Fig. 1 is the flow diagram for the method that the text key word of an embodiment of the present invention automatically extracts.

Fig. 2 is the flow diagram for the method that the text key word of the embodiment of the invention automatically extracts.

Specific embodiment

Below with reference to specific embodiment shown in the drawings, the present invention will be described in detail.But these embodiments are simultaneously The present invention is not limited, structure that those skilled in the art are made according to these embodiments, method or functionally Transformation is included within the scope of protection of the present invention.

It should be noted that keyword of the invention can be word, word is the minimum linguistic unit that can be used, than Such as flower, bird, people, youth, language, keyword are also possible to phrase, and phrase is combined by two or more words Syntactical unit, such as prioritization scheme, intellectual property, China Internet conference, national institute of patent agents, China etc..Cause This, the n-gram word (n is positive integer) in the present invention, refer to include n word word or phrase, for example unitary word refers to word, Binary word is the phrase of two combinations of words together, i.e., binary word includes two unitary words, and so on.

As shown in Figure 1, the method that text key word of the invention automatically extracts includes:

Obtain n member candidate key set of words；

In the present invention, there are many acquisition modes of n member candidate key set of words, rear extended meeting is specifically introduced.Obtaining n member After candidate key set of words, the keyword in this set is compared one by one two-by-two, finding includes identical n-1 member word and the n-1 First word two keywords different in the position of the keyword, are merged into n+1 member result keyword, obtain n+1 member result pass Keyword set.

The method is explained with specific example below, by taking n is equal to 2 as an example, the binary got is candidate Keyword set are as follows: { key benefits, smart phone, the conflict of interest, semiconductor field, Android operation, operating system, operation life State attracts talent, manpower shortage }, since " key benefits " and " conflict of interest " all include unitary word " interests ", and " interests " Position in the two keywords is different, therefore can in sequence merge the two words, the ternary knot after merging Fruit keyword is " key benefits conflict ", and so on, obtain ternary result keyword set are as follows: { key benefits conflict, Android Operating system, Android operation ecology, attract talent deficiency }.

Technical solution of the present invention is by merging the keyword extracted after subdivision, so that the keyword being split off Semanteme obtains completion, avoids because of the semantic incomplete situation of the too thin bring of participle.

Further, after merging the keyword in n member candidate key set of words, advanced Row noise filtering obtains n+1 First result keyword set.

Noise filtering, which refers to remove, some does not meet grammatical norm or undesirable keyword.

In a preferred embodiment, merge the laggard Row noise mistake of keyword in the n member candidate key set of words The step of filter includes:

After segmenting the text of keyword to be extracted at unitary set, the noise in the set is first filtered, then extract Keyword in the set obtains the highest word frequency max_count in the keyword；

In this embodiment, the participles tool such as jieba, hanlp, stanfordNLP and thulac can be used, it will be to The text participle of extracting keywords is at unitary set (being referred to as unigram set), then to the word in this unitary set Noise filtering is carried out, specifically can be and part of speech, word frequency and the word length of the word in set are filtered.Part of speech filtering can filter Fall adjective, adverbial word and preposition etc., only retains noun and verb.Word frequency filtering refers to that the frequency for filtering out and occurring in the text is big In maximum word frequency or less than the word of minimum word frequency.The long filtering of word refers to that the length for filtering out and occurring in the text is greater than and most greatly enhances Degree or less than minimum length word.The long filtering of word frequency and word is with the experience in tens of thousands of or even several ten million samples that are collected into Data derive maximum word frequency, minimum word frequency, maximum length, minimum length etc., are then filtered.

After filtering out the noise in unitary set, keyword abstraction algorithm, such as TF-IDF algorithm and TextRank are used Algorithm etc. extracts the keyword in the unitary set.Present invention preferably uses TF-IDF algorithms to extract keyword.TF-IDF benefit Global statistics IDF (inverse text frequency) and TF (word frequency) of the word in current document with word in corpus calculate word The weight of language, using the forward word of weight as keyword.TF (term frequency) is specified word going out in the text Existing word number.IDF (inversed document frequency) is total number of documents and the text comprising specifying word in corpus The ratio of gear number takes logarithm again.TF-IDF is the product of TF and IDF.The TF-IDF of each word in document is calculated as word Weight carry out keyword screening.Extracted in the way of keyword can have at least following two by TF-IDF: in the way of one, absolutely Value, all weights are more than that the word in the set of some fixed value is all extracted as keyword.Mode two, opposite value, set In weight ranking before several word be extracted as keyword.

After extracting the keyword in unitary set, unitary candidate key set of words is obtained, finds the keyword in this set Highest word frequency max_count, filtering n member candidate keywords merge after word frequency be less than max_count/4 word.It needs to illustrate Be, if it is desired to the keyword of more n+1 members can suitably relax the condition of filtering, for example can filter that n member is candidate to close Word frequency after keyword merging is less than the word of max_count/5, if it is desired to which the keyword of few some n+1 members can be tightened suitably The condition of filtering, for example the word that the word frequency after n member candidate keywords merge is less than max_count/3 can be filtered, and so on. In addition, also to filter it is some accidentally combine, primary word only occur, i.e., filtering n member candidate keywords merge after word Frequency is less than 2 word.

By noise filtering, the n+1 member result keyword set that noise is few and accuracy rate is relatively high is obtained.

Further, the method also includes: from the n member candidate key set of words, remove n+1 member result keyword In include n member candidate keywords, obtain n member result keyword set.

Or illustrated with above example, binary candidate key set of words are as follows: { key benefits, smart phone, interests punching Prominent, semiconductor field, Android operation, operating system, operation is ecological, attracts talent, manpower shortage }, it is crucial to remove ternary result The binary candidate keywords for including in word, such as " key benefits conflict " include " key benefits " and " conflict of interest ", therefore are moved Except " key benefits " and " conflict of interest " in binary candidate key set of words, and so on, obtain binary outcome keyword set It is combined into { smart phone, semiconductor field }.

After keyword merges, with the growth of length, error rate can also be improved, in a preferred embodiment, institute State method further include:

The restriction vocabulary can be customized according to actual needs, for example can be the proper noun and each public affairs of input method Full name and abbreviation of department etc..The maximum word length can be gained through experience.

Further, the step of described " obtain n member candidate key set of words " includes:

The text is segmented into n member set, first filters the noise in the set, then extract the key in the set Word obtains n member candidate key set of words.This step is similar with the step of obtaining unitary candidate key set of words, the difference is that making an uproar The filter type of sound is different, and with the increase of first number, noise can be more, and the mode of filtering can be more complicated.

By taking n is equal to 2 as an example, after the text is segmented into binary set, the step of filtering noise, includes:

Obtain the highest word frequency max_count in unitary candidate key set of words；In filtered binary set word frequency be less than or Word equal to max_count/5.Max_count/5 herein be it is adjustable, be also possible to max_count/6 or max_ count/4。

Filter it is some accidentally combine, primary word only occur, i.e., in filtered binary set word frequency be less than 2 word.

In addition, the word in binary set includes two unitary words, preceding word and rear word (preceding unitary word and rear unitary in other words Word), the minimum word frequency of the preceding word and rear word in unitary candidate key set of words is that (word frequency of word is 3 to x before such as, rear word Word frequency be 4, x=3), word frequency is less than the word of 2x/3 in filtered binary set.

Word against regulation in filtered binary set is also wanted simultaneously, word against regulation can be grammer apparent error Word (for example suffix word is preposition), include the word (such as " in tomorrow ") of preposition or include unit word (such as " 80 Member ") etc..The filtration step of unitary set is compared, the filtration step of binary set is more complicated.

It should be noted that obtaining n member candidate key set of words when n is the positive integer greater than 2 and being also possible to pass through conjunction And n-1 member candidate keywords and obtain:

Obtain n-1 member candidate key set of words；

The principle of this embodiment is already explained above, does not just repeat here.

It is explained further below by specific embodiment, is extracted using the method that above-mentioned text key word automatically extracts The process of text key word:

As shown in Fig. 2, segmenting text (subsequent the to become the text) difference of tool by keyword to be extracted using jiaba Unitary set and binary set are segmented into, the noise of unitary set is filtered, the keyword of unitary set is extracted using TF-IDF, is obtained To unitary candidate key set of words, the highest word frequency max_count in the set is obtained.By max_count and above-mentioned Mode, the noise in filtered binary set are extracted the keyword of binary set using TF-IDF, obtain binary candidate key word set It closes.

In unitary candidate key set of words, the unitary candidate keywords that removal binary candidate keywords include obtain one First result keyword set.

Keyword in binary candidate key set of words is merged, and filtering noise (with reference to above), obtains ternary Result keyword set, while in binary candidate key set of words, the binary candidate that removal ternary result keyword includes is closed Keyword obtains binary outcome keyword set.

Extract the result of the text key word are as follows: unitary result keyword set, binary outcome keyword combination and three First result keyword set.

Certainly, for last as a result, length limitation can be carried out, when the word length of the keyword in result keyword set When greater than maximum word length, keyword is optimized, the mode of optimization is with reference to above.

The present invention also provides a kind of electronic equipment, including memory and processor, the memory is stored with can be described The computer program run on processor, the processor realize what above-mentioned text key word automatically extracted when executing described program Step in method.

The present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, the computer journey When sequence is executed by processor, the step in method that above-mentioned text key word automatically extracts is realized.

It should be appreciated that although this specification is described in terms of embodiments, but not each embodiment only includes one A independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should will say As a whole, the technical solution in each embodiment may also be suitably combined to form those skilled in the art can for bright book With the other embodiments of understanding.

The series of detailed descriptions listed above only for feasible embodiment of the invention specifically Protection scope bright, that they are not intended to limit the invention, it is all without departing from equivalent implementations made by technical spirit of the present invention Or change should all be included in the protection scope of the present invention.

Claims

1. a kind of method that text key word automatically extracts, which is characterized in that the described method includes:

Obtain n member candidate key set of words；

By in n member candidate key set of words include identical n-1 member word and the n-1 member word the position of the keyword not Two same keywords merge, and obtain n+1 member result keyword set, and wherein n is the positive integer greater than 1.

2. the method that text key word as described in claim 1 automatically extracts, which is characterized in that the method also includes:

From the n member candidate key set of words, the n member candidate keywords that n+1 member result keyword includes are removed, obtain n member Result keyword set.

3. the method that text key word as described in claim 1 automatically extracts, which is characterized in that the method also includes:

When the word length of the keyword in the n+1 member result keyword set is greater than maximum word length, remove the keyword First or the last one unitary word obtain optimization keyword；

The optimization keyword is matched with vocabulary is limited, if matching, the optimization keyword is replaced into the n+1 member knot Keyword in fruit keyword set.

4. the method that text key word according to claim 1 automatically extracts, which is characterized in that described " it is candidate to obtain n member The step of keyword set " includes:

The text is segmented into n member set, first filters the noise in the set, then extract the keyword in the set, Obtain n member candidate key set of words.

5. the method that text key word according to claim 4 automatically extracts, which is characterized in that the n be 2, i.e., will After text participle is at binary set, the step of filtering noise, includes:

After segmenting the text at unitary set, the noise in the set is first filtered, then extract the pass in the set Keyword obtains the highest word frequency max_count in the keyword；

Word of the word frequency less than 2 in filtered binary set；

Word in binary set includes preceding word and rear word, and the minimum word frequency of the preceding word and rear word in unitary set is x, filtering Word frequency is less than the word of 2x/3 in binary set；

Word against regulation in filtered binary set.

6. the method that text key word according to claim 1 automatically extracts, which is characterized in that described " it is candidate to obtain n member The step of keyword set " includes:

Obtain n-1 member candidate key set of words；

It will in n-1 member candidate key set of words include identical n-2 member word and the n-2 member word in the position of the keyword Two different keywords merge, and obtain n-1 member result keyword set, and wherein n is the positive integer greater than 2.

7. the method that text key word according to claim 1 automatically extracts, it is characterised in that:

After merging the keyword in n member candidate key set of words, carries out noise filtering and obtain n+1 member result keyword collection It closes.

8. the method that text key word according to claim 7 automatically extracts, which is characterized in that it is candidate to merge the n member The step of keyword in keyword set laggard Row noise filtering includes:

9. a kind of electronic equipment, including memory and processor, the memory is stored with and can run on the processor Computer program, which is characterized in that the processor realizes text described in claim 1-8 any one when executing described program The step in method that keyword automatically extracts.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step in method that text key word described in realization claim 1-8 any one automatically extracts when being executed by processor.