CN106557508A

CN106557508A - A kind of text key word extracting method and device

Info

Publication number: CN106557508A
Application number: CN201510629350.9A
Authority: CN
Inventors: 李国洋; 王庆磊; 梁德兴
Original assignee: Beijing Shenzhou Taiyue Software Co Ltd
Current assignee: Beijing Shenzhou Taiyue Software Co Ltd
Priority date: 2015-09-28
Filing date: 2015-09-28
Publication date: 2017-04-05

Abstract

The invention discloses a kind of text key word extracting method and device.Methods described includes：Participle is carried out to the sentence of pending text, the initial weight of each word after participle is set；According to each word after participle whether in keywords database, the initial weight is changed, the final weight after the change of each word after participle is obtained；The key word of the word as pending text of predetermined quantity is extracted according to the final weight of the word from the starting of weight highest word.Technical scheme is based in a large amount of short essays, with the presence of same subject short essay and short essay key word word frequency number more than 1 it is assumed that being arranged by the weight twice of the structure of keywords database, the word segmentation processing of pending text, word and the technological means such as descending sort based on term weighing reach the purpose that improves identification key word accuracy rate.

Description

A kind of text key word extracting method and device

Technical field

The present invention relates to Chinese natural language processing technology field, more particularly to a kind of text key word extraction Method and apparatus.

Background technology

Chinese text keyword extraction is the core word extractive technique carried out for article central idea, as The basic technology of text-processing, text key word are extracted and have evolved to very ripe stage, different subjects Thinking not of the same race is developed, including TF-IDF (term frequency-inverse document Frequency, positive word frequency-inverse word frequency) algorithm, TextRank (text sequence) algorithm, disaggregated model knowledge Not Deng main way, also non-master stream mode part of speech is inferred, with the part of speech compound mode that key word often occurs To infer whether a word is key word.

Wherein, the cardinal principle of TF-IDF algorithms extraction key word is：Using the positive word frequency of statistics and inverse word frequency Calculation extract key word, positive word frequency refers to the number of times that word occurs in article, and positive word frequency is bigger It is probably more key word, fraction is higher；Inverse word frequency refers to the number of times that word occurs in all articles, Inverse word frequency is bigger to be represented all articles and was all likely to occur, and is that the probability of key word is less, so inverse word The bigger fraction of frequency is lower；The key word frequency of occurrence of i.e. positive word frequency is directly proportional to fraction, and the pass of inverse word frequency Keyword frequency of occurrence is inversely proportional to fraction.TF-IDF extracts the advantage of key word to be that formula is simple, calculates speed Degree is fast, is more adapted to list main body article；Have the disadvantage the article to complex situations such as such as multi-threaded, short essays Extraction effect is not ideal enough.

TextRank algorithm is developed from PageRank algorithms, and PageRank algorithms are a pages Proposed algorithm, and TextRank algorithm is then a word proposed algorithm.TextRank algorithm is extracted crucial The realization approach of word is：One word is voted to adjacent word, if adjacent word throws 5 tickets to word A, Correspondingly, word A throws 5 tickets to the word, and finally, in article, the poll of the high word of word frequency can be higher, Word near high word frequency, poll also can be high.TextRank algorithm is compared with TF-IDF algorithm effects and connects Closely, pluses and minuses of the TextRank algorithm in keyword extraction are essentially identical with TF-IDF algorithms.

Disaggregated model extraction algorithm is to divide with aforesaid TF-IDF algorithms and TextRank algorithm difference It recognizing key word, is not to be based entirely on word frequency statisticses that class model extraction algorithm is based on disaggregated model, Thus identify that key word be very different with aforementioned TF-IDF algorithms and TextRank algorithm.Point Class model extraction algorithm is general to be used as training data by the artificial key word that extracts, and sets some features, The such as feature such as the position of key word, part of speech, word frequency, the beginning of usual article occur key word, noun The probability for becoming key word can be larger.Sorting algorithm has many kinds, and the extraction to key word also has very Big impact, disaggregated model extraction algorithm advantage are can to extract verb, adjective as key word, special Levy selection flexibly, different features can be set according to different textual forms, have the disadvantage to need artificial preparation Training data, cost of labor are higher, and as disaggregated model calculating process is unknowable, development process goes out Existing problem is difficult to solve.

At present, in terms of Chinese text keyword extraction, it is substantially all using big length list theme article as Experimental data, this big length article seem very complicated, but as which belongs to single theme, using TF-IDF This simple algorithm, it becomes possible to extract the key word of its central theme, and for chat record, although it is Short essay, but often the chat record of a period of time belongs to same subject, or several themes, it is only necessary to will Multiple short essay merging treatments, extract key word, still can be good at extracting the key word of central theme.

But when single short essay is run into, when describing several actions or thing, entirely extract the work of key word What is just become is very thorny, therefore needs a kind of effective method for extracting short essay text key word badly.

The content of the invention

In view of the above problems, the present invention propose one kind overcome the problems referred to above or at least in part solve on A kind of text key word extracting method and device of problem are stated, the technical scheme is that what is be achieved in that：

On the one hand, the present invention proposes a kind of text key word extracting method, and methods described includes：

Participle is carried out to the sentence of pending text, the initial weight of each word after participle is set；

According to each word after participle whether in the keywords database, the initial weight is changed, is obtained Final weight after the change of each word to after participle；

From the starting of weight highest word, the word for extracting predetermined quantity according to the final weight of the word is made For the key word of the pending text.

Preferably, build the keywords database in the following way in advance：

Key word is extracted to each history text in history text data, and calculates commenting for each key word Point；

Scoring is selected more than scoring threshold value and more than the key word for recommending frequency threshold value, the history text is built This keywords database；

The part of speech of each key word in the keywords database is recognized, the adjective in the keywords database is removed And modal particle, and term weighing is set for each key word according to part of speech；And according to pending text Corresponding business word is added or screened to application in the keywords database, and is the business word Term weighing is set.

Preferably, the sentence to pending text carries out participle, arranges each word after participle Initial weight, including：

After calculating participle, the scoring of each word, every according to the score calculation of each word after participle The initial weight of individual word.

It is further preferred that whether described each word according to after participle is in the advance keywords database for building In, the initial weight is changed, the final weight after the change of each word after participle is obtained, including：

Each word after participle is judged whether in the keywords database, if a certain word after participle exists In the keywords database, then the initial weight for changing the word is its initial weight and the word in the pass Term weighing sum in keyword storehouse, and the initial weight after change is set to the final weight of the word； If a certain word after participle does not change the initial weight of the word in the keywords database, or not and Initial weight is set to into the final weight of the word.

Preferably, it is described to extract predetermined number according to the final weight of the word from the starting of weight highest word Key word of the word of amount as the pending text, including：

The key word number of the pending text is determined according to the word quantity after the pending text participle Amount, when the word quantity is less than predetermined quantity, each word after participle is waited to locate as described The key word of reason text；

When the word quantity is more than predetermined quantity, to the word quantity, the business with positive integer K rounds, The key word quantity of the numerical value that obtains as the pending text will be rounded；And according to the final weight Each word after descending arrangement participle, extracts the word of the key word quantity from the starting of weight highest word Key word of the language as the pending text.

It is further preferred that when the word quantity is more than predetermined quantity, methods described also includes：

The ratio of noun in the key word of the pending text is determined according to text categories, according to noun Ratio extracts the word of the key word quantity as the pending text from the starting of weight highest word Key word.

On the other hand, the invention allows for a kind of text key word extraction element, described device includes：

Word segmentation processing unit, carries out participle for the sentence to pending text, arranges each after participle The initial weight of word；

Final weight computing unit, for according to each word after participle whether in the advance key for building In dictionary, the initial weight is changed, the final weight after the change of each word after participle is obtained；

Keyword extracting unit, for from the starting of weight highest word, according to the final weight of the word Extract the key word of the word as the pending text of predetermined quantity.

Preferably, described device also includes keywords database construction unit；

The keywords database construction unit, closes for extracting to each history text in history text data Keyword, and calculate the scoring of each key word；Select scoring more than scoring threshold value and be more than to recommend number of times threshold The key word of value, builds the keywords database of the history text；In recognizing the keywords database, each is crucial The part of speech of word, removes the adjective and modal particle in the keywords database, and crucial for each according to part of speech Word arranges term weighing；And phase is added according to the application of pending text in the keywords database The business word answered, and term weighing is set for the business word.

Preferably, the word segmentation processing unit, the scoring specifically for calculating each word after participle, root According to each word after participle score calculation described in each word initial weight；

The final weight computing unit, specifically for judging each word after participle whether in the pass In keyword storehouse, if a certain word after participle is in the keywords database, the initial power of the word is changed Weight is its initial weight and term weighing sum of the word in the keywords database, and by after change Initial weight is set to the final weight of the word；If a certain word after participle is not in the keywords database In, then the initial weight of the word is not changed, and initial weight is set to the final weight of the word.

Preferably, the keyword extracting unit, after according to the pending text participle Word quantity determines the key word quantity of the pending text, when the word quantity is less than predetermined quantity When, using each word after participle as the pending text key word；When the word quantity During more than predetermined quantity, to the word quantity, the business with positive integer K rounds, and will round the numerical value for obtaining As the key word quantity of the pending text；And arrange after participle according to the final weight descending Each word, extracts the word of the key word quantity as described pending from the starting of weight highest word The key word of text；

When the word quantity is more than predetermined quantity, the keyword extracting unit is additionally operable to according to text Classification determines the ratio of noun in the key word of the pending text, according to the ratio of noun from weight most The key word of the word as the pending text of the key word quantity is extracted in high word starting.

The beneficial effect of the embodiment of the present invention is：The present invention has same subject based in a large amount of short essays Short essay exist and short essay key word word frequency number more than 1 it is assumed that building with regard to history text in advance Keywords database, and respective weights are set for each key word in keywords database according to part of speech；Treat in extraction When processing the key word of text, participle is carried out to the sentence of pending text, each word after participle is obtained Initial weight；Whether the initial weight of word is changed in keywords database according to word, each word is obtained Final weight after language change；By extracting true according to the final weight of word from the starting of weight highest word The key word of the pending text of fixed number amount.Compared to prior art, invention significantly increases crucial The accuracy rate that word is extracted.

Description of the drawings

Fig. 1 is the flow chart of text key word extracting method provided in an embodiment of the present invention；

Fig. 2 is text key word extraction element structural representation provided in an embodiment of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to this Bright embodiment is described in further detail.

First, make a brief description to being related to technical term in specific embodiment：

Word frequency：The number of times that one word occurs in an article is referred to as word frequency.

Part of speech：Basic concept in semantic analysis, general word can be divided into noun, verb, adjective, Various parts of speech such as adverbial word.

Weight：The general different attention degrees for being shown by setting weight in the search to different terms, more Crucial term weighing is higher, otherwise more unessential term weighing is lower, is set to zero, directly ignores.

Participle：Natural language processing sentence needs for sentence to split into word and is processed, and sentence splits into The process of word is exactly participle.

The present invention global design thought be：Each word for short essay word frequency information is few, in most short essays Only occur once, and rely on the key word that word frequency information is extracted to be difficult to obtain preferable effect, and classification The situation of the wretched insufficiency of model extraction key word, present invention assumes that in a large amount of short essays, having same subject Short essay exist, and assume short essay key word word frequency number be more than 1.Based on aforementioned it is assumed that for big Amount history text data, extract a collection of key word and set up keywords database, according to keywords database and by right Pending text carries out the technological means such as word segmentation processing, weight calculation and descending sort, intercepts pending The key word of text.

Embodiment one：

The present embodiment is built in keywords database, and the keywords database before text key word is extracted in advance Each key word arranges respective weights.

Specifically, build the keywords database in the following way in advance：

Select scoring more than scoring threshold value and more than the key word for recommending frequency threshold value, build history text Scoring threshold value, for example with 0～100 point of scoring rule, is set to 80 points, will be pushed away by keywords database Recommend frequency threshold value and be set to 10 times；

The part of speech of each key word in identification keywords database, removes the adjective and/or the tone in keywords database Word, and term weighing is set for each key word according to part of speech；And according to the application neck of pending text Corresponding business word is added or screens in domain in keywords database, and arranges term weighing for business word.

In actual applications, the present embodiment adopts TF-IDF and/or TextRank algorithm to each history text When originally carrying out keyword extraction, algorithm can score to each word, it is assumed that algorithm enters to each word The scoring of row 0～100, after above-mentioned scoring process is carried out to all of history text, the present embodiment screening Go out word of the fraction more than 80, word of the number of times more than 10 times will be recommended to list in keywords database.Then Using FNLP instruments, the part of speech of each key word in keywords database is identified, arrange corresponding according to part of speech Weight, the weight of noun is set to into 5 for example, the weight of verb is set to 4, the weight of other parts of speech It is set to 3.Thus, the present embodiment completes the structure of keywords database, but meet above-mentioned condition Some conventional adjective, modal particles are also possible that in key word, need to reject from keywords database.

Additionally, can also set a batch traffic noun for no application, this kind of word is generally non- It is often important, its weight can be set to 6.

After keywords database has been built, you can carry out keyword extraction to pending text.Such as Fig. 1 institutes Show, Fig. 1 is the flow chart of text key word extracting method provided in an embodiment of the present invention, the side in Fig. 1 Method includes：

S100, carries out participle to the sentence of pending text, arranges the initial power of each word after participle Weight.

This step can carry out word segmentation processing to the sentence of pending text using prior art, for example with IKanalyzer participles instrument carries out participle to sentence.

Wherein, participle is carried out to the sentence of pending text, the initial power of each word after participle is set Include again：

The scoring of each word after participle is calculated using TF-IDF algorithms and/or TextRank algorithm, will be divided The initial weight of score calculation each word of each word after word.For example, when algorithm is to each word When carrying out 0～100 scoring, can just word scoring conversion between 0～5.

S200, whether according to each word after participle in the advance keywords database for building, change is initial Weight, obtains the final weight after the change of each word after participle.

Wherein, according to each word after participle whether in keywords database, initial weight is changed, is obtained Final weight after the change of each word after participle includes：

Each word after participle is judged whether in keywords database, if a certain word after participle is in key In dictionary, then the initial weight for changing the word is its initial weight and word of the word in keywords database Language weight sum, and the initial weight after change is set to the final weight of the word；If after participle A certain word does not then change the initial weight of the word not in the keywords database, and by initial weight It is set to the final weight of the word.

S300, makees from the starting of weight highest word according to the word that the final weight of word extracts predetermined quantity For the key word of pending text.

Wherein, make from the starting of weight highest word according to the word that the final weight of word extracts predetermined quantity Key word for pending text includes：

The key word quantity of pending text is determined according to the word quantity after pending text participle, works as word Language quantity be less than predetermined quantity when, using each word after participle as pending text key word；

When word quantity is more than predetermined quantity, to word quantity, the business with positive integer K rounds, and will round Key word quantity of the numerical value for obtaining as pending text；And according to final weight descending arrangement point Each word after word, from weight highest word starting extract key word quantity word as described treating from The key word of reason text.

Assume that acquiescence at least extracts 4 key words, when the word lazy weight that pending text participle is obtained When 4, each word that participle is obtained is as the key word of pending text；When pending text When the word quantity that participle is obtained is more than 4, then the key word quantity of pending text can be according to following Formula is calculated：

Pending text key word quantity=[word quantity/K]

Wherein, for word number is obtained after pending text participle, K is the positive integer of setting to word quantity, [] is accorded with for rounding operation.

In actual applications, the present embodiment also passes through the ratio of noun in the key word for set pending text, To improve ability to express of the key word to text centric theme.

Specifically, in step S300, when the word quantity is more than predetermined quantity, in Fig. 1 Method also includes：

The ratio of noun in the key word of pending text is determined according to text categories, according to the ratio of noun The key word of the word as pending text of key word quantity is extracted from the starting of weight highest word.

Present invention is particularly suitable for the keyword extraction to short essay text.

Embodiments of the invention based in a large amount of short essays, with the presence of the pass of the short essay and short essay of same subject Supposed premise of the keyword word frequency number more than 1, by the keywords database for building history text in advance, and according to Part of speech is that each key word in keywords database arranges respective weights.In the key word for extracting pending text When, carry out word segmentation processing first to the sentence of pending text, obtain the initial of each word after participle Weight；Then according to whether the initial weight of the word after processing participle of classifying in keywords database obtains every The final weight of individual word；Process finally by sequence, obtain the key word of pending text.

Embodiment two：

The present embodiment extracts the key word of a short essay text using the technical scheme provided in embodiment one, and Result will be extracted to be contrasted with prior art, so that technical scheme is more intuitively presented Superiority.

The present embodiment gives a short essay text, and the key of the short essay text is obtained beforehand through artificial means Word is used as the standard of comparison.

The method provided during the embodiment of the present invention one is respectively adopted and TF-IDF key words of the prior art The keyword extracting method of extracting method, TextRank keyword extracting methods and disaggregated model is to above-mentioned short Text carries out keyword extraction, and the key word that various methods are extracted is tested and assessed, evaluating result As shown in table 1：

Table 1

Keyword extracting method	Accuracy rate
		TF-IDF	50.1%
TF-IDF	49.8%
		Disaggregated model	48.0%
Technical scheme	64.0%

The key word extracted by technical scheme be can be seen that by the data in table 1 Accuracy rate is far above TF-IDF keyword extracting methods of the prior art, TextRank keyword extractions The keyword extracting method of method and disaggregated model, although 64% accuracy rate is not unusual height, but phase Accuracy rate than key word 50% or so is extracted in prior art, technical scheme is by short essay text The accuracy rate of this keyword extraction improves a lot, in actual project is applied the invention to, will More significant effect can be obtained.

Embodiment three：

Based on above-mentioned text key word extracting method identical technology design, the embodiment of the present invention three also carries A kind of text key word extraction element is supplied.

Fig. 2 is the text key word extraction element structural representation that the embodiment of the present invention three is provided, in Fig. 2 Device include：

Word segmentation processing unit 21, carries out participle for the sentence to pending text, arranges every after participle The initial weight of individual word.

Wherein, word segmentation processing unit 21, specifically for using TF-IDF and/or TextRank algorithm meter The scoring of each word after point counting word, each word according to the score calculation of each word after participle Initial weight.

Final weight computing unit 22, for according to each word after participle whether in the advance pass for building In keyword storehouse, the initial weight is changed, the final weight after the change of each word after participle is obtained.

Wherein, final weight computing unit 22, specifically for judging each word after participle whether in institute State in keywords database, if a certain word after participle is in the keywords database, change the first of the word Beginning weight is its initial weight and term weighing sum of the word in the keywords database, and will be changed Initial weight afterwards is set to the final weight of the word；If a certain word after participle is not in the key The initial weight of the word in dictionary, is not then changed, and initial weight is set to the final power of the word Weight.

Keyword extracting unit 23, for from the starting of weight highest word, according to the final power of the word Bring up again the key word for the word of predetermined quantity being taken as pending text.

Wherein, keyword extracting unit 23, specifically for according to the word after the pending text participle Quantity determines the key word quantity of the pending text, when the word quantity is less than predetermined quantity, Using each word after participle as the pending text key word；

When the word quantity is more than predetermined quantity, to the word quantity, the business with positive integer K rounds, The key word quantity of the numerical value that obtains as the pending text will be rounded；And according to final weight descending Each word after arrangement participle, extracts the word conduct of the key word quantity from weight highest word The key word of the pending text；

When the word quantity is more than predetermined quantity, the keyword extracting unit is additionally operable to according to text This classification determines the ratio of noun in the key word of the pending text, according to the ratio of noun from weight The key word of the word as the pending text of the key word quantity is extracted in the starting of highest word.

Wherein, the device in Fig. 2 also include keywords database construction unit, the keywords database construction unit, For to each history text in history text data, being carried using TF-IDF and/or TextRank algorithm Key word is taken, and calculates the scoring of each key word；Select to score more than scoring threshold value and secondary more than recommending The key word of number threshold value, builds the keywords database of the history text；Recognize each in the keywords database The part of speech of key word, removes the adjective and/or modal particle in keywords database, and closes for each according to part of speech Keyword arranges term weighing；And in keywords database add or sieve according to the application of pending text Corresponding business word is selected, and term weighing is set for business word.

In sum, the invention discloses a kind of text key word extracting method and device, the present invention is based on In a large amount of short essays, with the presence of the vacation that the key word word frequency number of the short essay and short essay of same subject is more than 1 If building the keywords database with regard to history text in advance, and it is each pass in keywords database according to part of speech Keyword arranges respective weights；When the key word of pending text is extracted, the sentence of pending text is entered Row participle, obtains the initial weight of each word after participle；Whether changed in keywords database according to word The initial weight of word, obtains the final weight after each word change；By from weight highest word Beginning extracts the key word of the pending text of quantification according to the final weight of word.Compared to existing skill Art, invention significantly increases the accuracy rate of keyword extraction.

Presently preferred embodiments of the present invention is the foregoing is only, the protection model of the present invention is not intended to limit Enclose.All any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., Comprising within the scope of the present invention.

Claims

1. a kind of text key word extracting method, it is characterised in that methods described includes：

According to each word after participle whether in the advance keywords database for building, the initial power is changed Weight, obtains the final weight after the change of each word after participle；

2. text key word extracting method according to claim 1, it is characterised in that using as follows Mode builds the keywords database in advance：

3. text key word extracting method according to claim 1, it is characterised in that described to treat The sentence for processing text carries out participle, arranges the initial weight of each word after participle, including：

4. text key word extracting method according to claim 3, it is characterised in that the basis Whether each word after participle in the advance keywords database for building changes the initial weight, obtains Final weight after the change of each word after participle, including：

5. text key word extracting method according to claim 1, it is characterised in that described from power The word starting of weight highest, the word for extracting predetermined quantity according to the final weight of the word are treated as described The key word of text is processed, including：

6. text key word extracting method according to claim 5, it is characterised in that when institute's predicate When language quantity is more than predetermined quantity, methods described also includes：

7. a kind of text key word extraction element, it is characterised in that described device includes：

Keyword extracting unit, for from weight highest word, the final weight begun according to the word Extract the key word of the word as the pending text of predetermined quantity.

8. text key word extraction element according to claim 7, it is characterised in that described device Also include keywords database construction unit；

The keywords database construction unit, closes for extracting to each history text in history text data Keyword, and calculate the scoring of each key word；Select scoring more than scoring threshold value and be more than to recommend number of times threshold The key word of value, builds the keywords database of the history text；In recognizing the keywords database, each is crucial The part of speech of word, removes the adjective and modal particle in the keywords database, and crucial for each according to part of speech Word arranges term weighing；And added in the keywords database according to the application of pending text or Corresponding business word is screened, and term weighing is set for the business word.

9. text key word extraction element according to claim 7, it is characterised in that

The word segmentation processing unit, the scoring specifically for calculating each word after participle, after participle Each word score calculation described in each word initial weight；

10. text key word extraction element according to claim 7, it is characterised in that the pass Keyword extraction unit, specifically for treating according to the word quantity determination after the pending text participle The key word quantity of text is processed, when the word quantity is less than predetermined quantity, by each after participle Key word of the word all as the pending text；

When the word quantity is more than predetermined quantity, to the word quantity, the business with positive integer K rounds, The key word quantity of the numerical value that obtains as the pending text will be rounded；And according to the final weight Each word after descending arrangement participle, extracts the word of the key word quantity from weight highest word As the key word of the pending text；