CN108763213A

CN108763213A - Theme feature text key word extracting method

Info

Publication number: CN108763213A
Application number: CN201810516408.2A
Authority: CN
Inventors: 彭易锦; 代翔; 黄细凤; 王侃; 杨拓
Original assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Current assignee: CETC 10 Research Institute; Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date: 2018-05-25
Filing date: 2018-05-25
Publication date: 2018-11-06

Abstract

The invention discloses a kind of theme feature text key word extracting method, can obtain extracting result better than the text key word of tradition TF-IDF methods using the present invention.The technical scheme is that：Training stage segments training text, remove stop words, the pretreatments such as part of speech filtering, count the inverse document frequency of word, while it using the theme probability matrix of topic model methodology acquistion to word and being normalized, the theme distribution entropy of word is calculated according to word theme probability matrix, the global weights of word are calculated in conjunction with inverse document frequency and theme distribution entropy, global weight computing result is output to test phase, after being pre-processed to test text, the normalization word frequency of word in statistical test text, the global weight computing result that word frequency is obtained with the training stage will be normalized to be combined, it calculates the comprehensive score of word and is ranked up, automatic keyword extraction result of several words of highest scoring as current test text in being sorted using score.

Description

Theme feature text key word extracting method

Technical field

The invention belongs to natural language processing technique fields, and in particular to a kind of text of the theme distribution feature based on word This keyword extracting method.

Background technology

Keyword extraction is the key that the technologies such as information retrieval, text classification cluster and automatic abstract generation, is quick Obtain the important means of document subject matter.It is traditionally one group of word that can summarize document subject matter content or short by key definition Language.Keyword characterizes the thematic and critical content of document, is the minimum unit for expressing a text core content.It closes Keyword has very important application in many fields, and the classification such as the autoabstract of document, Web page information extraction, document is poly- Class, search engine etc..However, text in most instances does not directly give keyword, it is therefore desirable to design key Word extracting method.The purpose of keyword extraction is the Feature Words for proposing to reflect its main contents and meaning from text, allusion quotation The text key word extracting method of type is that the power of each Feature Words is calculated according to certain rule after extracting the Feature Words of text Weight is determined to the keyword of reflection text subject content according to the weight of Feature Words.Since the Internet resources moment is not all Disconnected update, Chinese text show explosive growth, and the method that keyword extraction is carried out using manual type is time-consuming longer, and has There is certain subjectivity, therefore, it is necessary to study the method that can extract keyword from document automatically.Keyword extraction is also referred to as closed Keyword extract or keyword mark, be from text with some maximally related words of the thought expressed by the text or Phrase extraction Process out, automatic keyword extraction are oneself for identifying or marking the representative word or phrase in document with this effect Dynamicization technology.The automatic keyword extraction of text is all a critical issue and the research of natural language processing field all the time Hot spot.With being currently continuously increased to text data application demand, the automatic keyword extracting method proposed in recent years has very A variety of, some methods achieve preferable effect in the keyword extraction of specific area, however independently of language and field General automatic keyword extracting method also needs to further study.Currently, some keyword extraction systems are to be based on single method It realizes, some are the synthesis of a variety of methods, by used core methed, can be summarized as following several typically most represent The method of property：

1) method based on thesaurus.Method based on thesaurus is that thesaurus is established in specific area, by this The factors such as the long word frequency of vocabulary bluebeard compound calculate the weight of word.Such methods are limited by background dictionary, are resulted in keyword and are carried It takes not comprehensive enough.

2) method based on the meaning of a word.Method based on the meaning of a word carries out meaning of a word mark using rule base or synonymicon to word Then note carries out it ambiguity row's discrimination, the weight of word is calculated by arranging discrimination result.What such methods were directly established by user The influence of rule base performance causes extraction efficiency relatively low additionally due to needing to carry out the work that word sense disambiguation and synonym identify.

3) Statistics-Based Method.Statistics-Based Method is to use most commonly used method at present.Utilize word in document The keyword of the statistical information abstracting document of language, by calculating certain features of word, such as TF, DF, TF-IDF, comentropy, in conjunction with Its position feature, such as title, section head, weight is distributed for word, and keyword is extracted according to weight size order.This method is opposite For it is fairly simple, do not need training data and external knowledge library generally, simple statistical rules, such as part of speech mistake can be utilized Filter, word frequency etc. are screened, and candidate key set of words is obtained, and are assessed candidate keywords using certain statistic, are realized Keyword extraction.Statistics-Based Method is disadvantageous in that computationally intensive；Extraction result would make sense incomplete character String, causes accuracy rate not high；Low-frequency word cannot be extracted；Need a large amount of urtext.

4) method based on topic model, topic model are the probabilistic language models for simulating mankind's writing, and a document is It is mixed by multiple themes, and each theme is the probability distribution on vocabulary.The theme feature of word is more apparent in document, It is stronger that it represents the ability of a certain theme.The topic weights that word is calculated using topic model method obtain word-theme Then matrix selects keyword of several the highest words of weight as the theme under each theme.

5) method based on complex network.Method based on complex network is a kind of unsupervised approaches, first with Feature Words For node, the structure of linguistic network figure is carried out using the relationship between Feature Words as side, then analyzes constructed linguistic network Figure, has therefrom found the word or phrase of central role, these words or phrase are the keyword of document.It is by the word in document It is configured to a network by given rule, by verifying the Small-world Characters of network, extracting has network average path length As keyword, this method fails to explain the relationship between keyword and average path length increment the node of dramatic impact, And network connectivty is often difficult to ensure, it is computationally intensive.

6) method based on neural network explores the work that all conducts a research on the basis of term vector indicates mostly at present, And based on two it is assumed that one is that word is unfolded around keyword in document, keyword embodies the central idea of article, two It is that most of word and keyword are being semantically similar in article.The research of keyword extracting method based on neural network In the starting stage.Although keyword extraction techniques obtain significant progress in recent years, extraction result also obtains far away at present Satisfactory effect.

In the above-mentioned methods, Statistics-Based Method be study earliest, most widely used a kind of method, be primarily upon Statistical property has the characteristics that model generalization ability is strong, is easily achieved, and the independence with language and field, wherein most allusion quotation Type method is term frequency-inverse document frequency approach (TermFrequency-Inverse Document Frequency, TF-IDF). TF-IDF can assess significance level of the word to a document, wherein TF is known as word frequency, is described in document for calculating the word The ability of appearance；IDF is known as inverse document frequency, the ability for distinguishing document for calculating the word.The guiding theory of TF-IDF is established On such basic assumption:Occurs the occurrence number in another is with class text of word many times in a text Can be very much, vice versa.TF-IDF methods are established on such hypothesis：The word repeatedly occurred in a document, Also can repeatedly occur in other similar documents, vice versa.It is also contemplated that word distinguishes different classes of ability, a word The text frequency that language occurs is smaller, and it is bigger that it distinguishes different classes of ability.When calculating the weight of word with TF-IDF algorithms, The frequency that one word occurs in a document is high and the frequency of occurrences is low in other documents, indicates the word for indicating this text The separating capacity of shelves is strong, and weighted value is bigger.The advantages of TF-IDF algorithms is simple and quick, and results contrast meets practical feelings Condition.But traditional TF-IDF methods weigh merely the importance of a word with word frequency, merely with words-frequency feature, to word in text Different classes of or theme distribution character, which lacks, in shelves set considers, and can not embody the effect of the features such as part of speech, leads to some The IDF values that the low-frequency word of text can not be represented are very high, it is opposite some can represent very well text high frequency words IDF values but very It is low.Substantially IDF is a kind of weighting for attempting to inhibit noise, and merely thinks that the small word of text frequency is more important, The big word of text frequency is more useless, it is clear that this is not right-on.IDF can not effectively reflect the important of word The distribution situation of degree and Feature Words makes it that can not complete the function to weighed value adjusting well, so the precision of TF-IDF methods It is not very high.

Invention content

The present invention is lacked for traditional TF-IDF methods merely with words-frequency feature, to the classification or theme distribution characteristic of word The shortcoming of weary consideration provides a kind of extraction efficiency height, and accuracy rate is higher, can make full use of the theme distribution characteristic of word The extracting method of keyword.

In order to achieve the above object, a kind of theme feature text key word extracting method proposed by the present invention, feature exist In including the following steps：Using text as the carrier of information, text key word extraction is divided by training according to theme distribution feature Stage and test phase are learnt with the training text preprocessing module of training stage, inverse document frequency computing module, topic model Test text preprocessing module, local weight computing module, the comprehensive score of module, global weight computing module and test phase It calculates and sorting module forms text key word extraction algorithm model；Wherein, training of the training text preprocessing module to input Text data carries out Chinese word segmentation, removes stop words and part of speech filtration treatment successively, then by pretreated training text data It is input to topic model study module and inverse document frequency computing module；Topic model study module is directed to pretreated training Text data is obtained using the theme distribution feature of the automatic study word in the unsupervised ground of topic model method by learning training It can reflect word-theme matrix of word probability density characteristics in different themes；Inverse document frequency computing module utilizes pre- place Training text data after reason count the number of documents and word that each word is contained in training text for each word Inverse document frequency, the ratio between training text total quantity and number of documents comprising the word is calculated, with pair of ratio Number is used as inverse document frequency；Global weight computing module is obtained according to inverse document frequency result of calculation and topic model study module Word-theme matrix, according to word theme probability matrix calculate word theme distribution entropy and each word theme distribution The inverse inverse document frequency corresponding with word of theme distribution entropy is multiplied, obtains the global weight computing knot of each word by entropy Fruit, the comprehensive score that global weight computing result is sent into test phase calculates and sorting module；Comprehensive score calculates and sequence The local weight that the global weights and local weight computing module that module is obtained according to the training stage obtain, each word is corresponded to Global weights be multiplied with local weight, calculate the comprehensive score of each word and be ranked up, score is most in being sorted with score Keyword extraction result of several the high words as current test text.Wherein, test text preprocessing module is to input Test text carry out Chinese word segmentation successively, remove stop words and part of speech filtration treatment, by pretreated test text input office Portion's weight computing module；Local weight computing module is according to test text pre-processed results, each word in statistical test text Normalization word frequency, will normalization word frequency as the local weight result of calculation of word.

The present invention has following remarkable advantage compared with the prior art：

It improves work efficiency.The present invention learns the complete of word automatically using text as information carrier, using training text data Office's weights, learning outcome are used for the automatic keyword extraction of test text, reduce selection keyword by hand it is artificial it is subjective because Element influences, while reducing the workload of manpower, improves work efficiency.Wherein, with Text Pretreatment module, inverse document frequency The training part of computing module, topic model study module and global weight computing module composition text key word extraction model, Realize that simply, operation is quick.

It is high to extract accuracy rate.During automatically extracting text key word, on the basis of traditional TF-IDF methods Comentropy is introduced, the normalization theme distribution entropy of vocabulary is calculated, and be combined with traditional IDF method, realizes compared with Gao Zhun The keyword extraction of true rate overcomes traditional TF-IDF methods and lacks lacking for consideration to the classification or theme distribution feature of vocabulary Point and deficiency.Meanwhile during Text Pretreatment, Chinese word segmentation is carried out successively to the training text data of input, goes to stop Word and part of speech filtration treatment avoid the useless vocabulary of high-frequency appearance using the part of speech feature filtering useless vocabulary of vocabulary To the adverse effect of keyword extraction result.

With preferable scalability.Training stage of the invention when calculating the global weights of vocabulary, utilizes topic model The theme distribution feature of the automatic learning Vocabulary in the unsupervised ground of method, does not need the support of additional labeled data, has preferable Scalability, the expanded application for keyword extraction model provide possibility.

The present invention for text key word the processing method and flow for providing complete set, in experimental data The more traditional TF-IDF methods of automatic keyword extraction effect have larger promotion, have preferable practicability and stronger engineering Practical value.

Description of the drawings

For a clearer understanding of the present invention, now by specific implementation mode through the invention, referring concurrently to attached drawing, to retouch The present invention is stated, wherein：

Fig. 1 is the flow chart of present subject matter feature text key word extracting method.

Fig. 2 is the flow chart of training text preprocessing module in Fig. 1.

Fig. 3 is the flow chart of global weight computing module in Fig. 1.

The invention will be further described below in conjunction with the accompanying drawings.

Specific implementation mode

Refering to fig. 1.According to the present invention, a kind of theme feature text key word extracting method, it is characterised in that including as follows Step：Using text as the carrier of information, text key word extraction is divided by training stage and test according to theme distribution feature Stage, with the training text preprocessing module of training stage, inverse document frequency computing module, topic model study module, the overall situation Weight computing module and the test text preprocessing module of test phase, local weight computing module, comprehensive score are calculated and are arranged Sequence module forms text key word extraction algorithm model；Wherein, training text data of the training text preprocessing module to input Chinese word segmentation is carried out successively, removes stop words and part of speech filtration treatment, and pretreated training text data are then input to master Inscribe model learning module and inverse document frequency computing module；Topic model study module is directed to pretreated training text number According to using the theme distribution feature of the automatic study word in the unsupervised ground of topic model method, obtaining to reflect by learning training Word-theme matrix of word probability density characteristics in different themes；Inverse document frequency computing module utilizes pretreated Training text data count the number of documents that each word is contained in training text for each word, calculate training text Ratio between total quantity and number of documents comprising the word, using the logarithm of ratio as inverse document frequency；Global weights Word-theme matrix that computing module is obtained according to inverse document frequency result of calculation and topic model study module calculates each The inverse inverse document frequency corresponding with word of theme distribution entropy is multiplied, obtains each word by the theme distribution entropy of word Global weight computing is as a result, the comprehensive score that global weight computing result is sent into test phase calculates and sorting module；It is comprehensive The local weight that the global weights and local weight computing module that score calculates and sorting module is obtained according to the training stage obtain, The corresponding global weights of each word are multiplied with local weight, the comprehensive score of each word is calculated and is ranked up, with Divide keyword extraction result of several words of highest scoring in sequence as current test text.

Inverse document frequency computing module and topic model study module form one and are weighed with global in Text Pretreatment module Parallel computing module, calculates separately the Document distribution feature of vocabulary and the theme distribution feature of word between value computing module.

Global weight computing module is when calculating the global weights of vocabulary, in the base that traditional inverse document frequency IDF is calculated Introduce comentropy on plinth, after the theme probability distribution of normalization vocabulary, calculate the theme distribution entropy of each vocabulary, and with tradition Inverse document frequency IDF values be combined, obtain the global weights of vocabulary.

Test phase contains test text preprocessing module, local weight computing module, and comprehensive score calculates and sequence Module, wherein test text preprocessing module carries out Chinese word segmentation to the test text of input, removes stop words and part of speech mistake successively Filter is handled, and pretreated test text is inputted local weight computing module；Local weight computing module is according to test text Pre-processed results, the normalization word frequency of each word in statistical test text, will normalization word frequency as the local weight of word Local weight result of calculation is sent into comprehensive score calculating and sorting module by result of calculation；Comprehensive score calculates and sorting module The local weight that the global weights and test phase obtained according to the training stage obtain, by the corresponding global weights of each word and Local weight is multiplied, and calculates the comprehensive score of each word and is ranked up, several words of highest scoring in being sorted with score Keyword extraction result of the language as current test text.

Refering to Fig. 2.Training text preprocessing module is first with Chinese words segmentation to several training texts of input Word segmentation processing is carried out, the participle knot that all word lists and each vocabulary in training text correspond to part of speech markup information is obtained Fruit.Training text preprocessing module carries out word segmentation processing using Open-Source Tools packet FudanNLP, then, according to deactivated vocabulary to text This vocabulary carries out stop words and handles, go stop words handle in, by contain M vocabulary word lists and deactivated vocabulary into Row compares, if corresponding vocabulary can be found by deactivating in vocabulary, just deletes the vocabulary, then according to filtering part of speech table to text word It converges and carries out part of speech filtration treatment, word lists are compared with filtering part of speech table in part of speech filtration treatment, delete word lists In corresponding part of speech all vocabulary, be denoted as word={ word pretreated word lists are obtained₁…word_j…word_M, Word lists are exported to inverse document frequency computing module and topic model study module again, wherein j is j-th in M vocabulary Vocabulary serial number.

Inverse document frequency computing module is according to word lists word={ word₁…word_j…word_M, statistics is occurred Different vocabulary, formed and include the dictionary w=(w of N number of vocabulary₁…w_i…w_N), a vocabulary w in dictionary is then taken successively_i, And it includes vocabulary w to count in the training text of input_iDocument number df_i, inverse document frequency idf_iIt is that document is total in training text Number Num_docWith include vocabulary w_iDocument number df_iBetween ratio logarithm, repeat this process, until having counted institute in dictionary There is the inverse document frequency of vocabulary, forms inverse document frequency matrix idf=(idf₁…idf_i…idf_N), wherein w_iIt indicates in dictionary I-th of vocabulary i=1,2 ... N are vocabulary number；Total number of documents Num in training text_docWith include vocabulary w_iDocument number df_iBetween ratio logarithm

The word lists word=that topic model study module obtains after being pre-processed according to training text preprocessing module {word₁…word_j…word_M, using common topic model learning method, such as LDA topic models (Latent Dirichlet Allocation, LDA) learning training is carried out, the probability matrix P between word and theme is obtained,

Probability matrix P is N row, a K rows, and size is the matrix that N × K is, columns N indicates the number of vocabulary in dictionary, line number K Indicate the theme number being manually set.Probability matrix P reflects probability density characteristics of the word in different themes, wherein p₁、p₂、p_NIt is the theme probability distribution vector of different terms in probability matrix P.

LDA is a kind of document subject matter generation model, also referred to as three layers of bayesian probability model and non-supervisory engineering Habit technology contains word, theme and document three-decker, and LDA, which can be used for identifying in extensive document sets or corpus, to be hidden Subject information, the method that it uses bag of words, each document be considered as a word frequency vector by this method, to by text Information converts the digital information for ease of modeling.

Refering to Fig. 3.The word that global weight computing module is obtained according to topic model study module-theme matrix, from upper A line is taken in predicate language-theme matrix, is denoted as p_i=(p_i1…p_ij…p_iK), p_iIndicate that i-th of vocabulary is general under different themes Rate is distributed, p_ijIndicate probability value of i-th of the vocabulary under j-th of theme；By p_iIt is normalized to obtain normalization theme Probability distribution vectorWherein,Calculation formula be

Global weight computing module utilizes normalization theme probability distribution vectorCalculate comentropy Wherein, comentropy is the desired value for weighing the appearance of a stochastic variable, and the comentropy of a variable is bigger, then it occurs Various situations it is also more, that is, the content for including is more, comentropy ent_iIt is bigger indicate the word under different themes Distribution it is more uniform, i.e., the word does not have apparent theme tendentiousness, which is that the possibility of keyword is smaller；Conversely, Comentropy is smaller to indicate that the theme tendentiousness of the word is stronger, which is that the possibility of keyword is bigger.

Global weight computing module utilizes vocabulary w_iComentropy ent_iThe inverse document obtained with inverse document frequency computing module Frequency idf_i, according to global weight computing formulaCalculate vocabulary w_iGlobal weights, repeat above procedure, directly Global weight computing to all vocabulary in dictionary is completed, and global weight computing result is obtained, wherein g is global weights mark Symbol.

Test text preprocessing module executes and training text preprocessing module phase for each test text of input As operate, including segment, go stop words, part of speech filtering etc. processing procedure, export pretreated word lists word= {word₁…word_j…word_T, wherein j=1,2 ... T indicate j-th of vocabulary, include T vocabulary altogether in word lists.

Local weight computing module is according to the word lists word={ word of output₁…word_j…word_T, it counts successively Dictionary w={ w₁…w_i…w_NIn each word w_iOccurrence number tf in word lists word_i, by occurrence number tf_iNormalizing Local weight is used as after changeI.e.I=1,2, L, N, T indicate that vocabulary total number in word lists word, l are parts Weights identify.

Comprehensive score calculates and sorting module obtains local weight result of calculationAfterwards, the overall situation that the combined training stage obtains WeightsResult of calculation, calculate dictionary in i-th of word comprehensive score score_i, the score of all words is calculated successively, Obtain word score matrix score=(score₁…,score_i…score_N), wherein the comprehensive score of i-th of wordI=1,2 ... N.Comprehensive score calculate and sorting module by the comprehensive score of all words according to from greatly to Small sequence is arranged, and take out the higher Q word of comprehensive score as the keyword extraction of current test text as a result, The Q word extracted can reflect the main contents and meaning of text, and the keyword extraction and output for completing test text carry Take result, wherein Q is manually set as needed.

It is the description to the present invention and its embodiment provided to the engineers and technicians in familiar field of the present invention above, These descriptions should be considered to be illustrative and not restrictive.Engineers and technicians can be accordingly in invention claims Thought is done specific operation and is implemented, without prejudice to the spirit and scope of the invention as defined in the appended claims, can be right It makes a variety of changes in the form and details.Above-mentioned these are regarded as coverage of the invention.

Claims

1. a kind of theme feature text key word extracting method, it is characterised in that include the following steps：Using text as information Text key word extraction is divided into training stage and test phase, with the training of training stage by carrier according to theme distribution feature Text Pretreatment module, inverse document frequency computing module, topic model study module, global weight computing module and test phase Test text preprocessing module, local weight computing module, comprehensive score calculate and sorting module composition text key word carry Take algorithm model；Wherein, training text preprocessing module carries out Chinese word segmentation to the training text data of input, goes to deactivate successively Then pretreated training text data are input to topic model study module and inverse document frequency by word and part of speech filtration treatment Rate computing module；Topic model study module is directed to pretreated training text data, unsupervised using topic model method Ground learns the theme distribution feature of word automatically, obtains to reflect that word probability distribution in different themes is special by learning training Word-theme matrix of property；Inverse document frequency computing module utilizes pretreated training text data, for each word, Contain the number of documents of each word and the inverse document frequency of word in statistics training text, calculate training text total quantity with Including the ratio between the number of documents of the word, using the logarithm of ratio as inverse document frequency；Global weight computing module According to word-theme matrix that inverse document frequency result of calculation and topic model study module obtain, according to word theme probability Matrix calculates the theme distribution entropy of the theme distribution entropy and each word of word, and the inverse of theme distribution entropy is corresponding with word Inverse document frequency is multiplied, and obtains the global weight computing of each word as a result, global weight computing result is sent into test phase Comprehensive score calculate and sorting module；The global weights drawn game that comprehensive score calculates and sorting module is obtained according to the training stage The corresponding global weights of each word are multiplied with local weight, calculate each by the local weight that portion's weight computing module obtains The comprehensive score of word is simultaneously ranked up, several words of highest scoring are as the pass of current test text in being sorted using score Keyword extracts result.

2. theme feature text key word extracting method as described in claim 1, it is characterised in that：Test phase contains survey Text Pretreatment module, local weight computing module are tried, comprehensive score calculates and sorting module, wherein test text pre-processes Module carries out Chinese word segmentation to the test text of input, removes stop words and part of speech filtration treatment successively, by pretreated test Text input local weight computing module；Local weight computing module is according to test text pre-processed results, statistical test text In each word normalization word frequency, will normalization word frequency as the local weight result of calculation of word.

3. theme feature text key word extracting method as described in claim 1, it is characterised in that：Training text pre-processes mould Block carries out word segmentation processing first with Chinese words segmentation to several training texts of input, obtains all words in training text Remittance list and each vocabulary correspond to the word segmentation result of part of speech markup information.

4. theme feature text key word extracting method as described in claim 1, it is characterised in that：Training text pre-processes mould Word segmentation processing is carried out using Open-Source Tools packet FudanNLP, then, text vocabulary is carried out at stop words according to vocabulary is deactivated Reason the remittance word lists for containing M vocabulary is compared with deactivated vocabulary, in going stop words to handle if stop words Corresponding vocabulary can be found in table, just deletes the vocabulary, then text vocabulary is carried out at part of speech filtering according to filtering part of speech table It manages, is compared word lists with filtering part of speech table in part of speech filtration treatment, corresponding part of speech is all in deletion word lists Vocabulary is denoted as word={ word pretreated word lists are obtained₁…word_j…word_M, then word lists are exported To inverse document frequency computing module and topic model study module, wherein j is the serial number of j-th of vocabulary in M vocabulary.

5. theme feature text key word extracting method as described in claim 1, it is characterised in that：Inverse document frequency calculates mould Root tuber is according to word lists word={ word₁…word_j…word_M, the different vocabulary occurred are counted, it includes N number of word to be formed Dictionary w=(the w of remittance₁…w_i…w_N), a vocabulary w in dictionary is then taken successively_i, and count and wrapped in the training text of input W containing vocabulary_iDocument number df_i, inverse document frequency idf_iIt is total number of documents Num in training text_docWith include vocabulary w_iText Shelves number df_iBetween ratio logarithm, repeat this process, the inverse document frequency until having counted all vocabulary in dictionary, shape At inverse document frequency matrix idf=(idf₁…idf_i…idf_N), wherein w_iIndicate i-th of vocabulary in dictionary, i=1,2 ... N For vocabulary number；Total number of documents Num in training text_docWith include vocabulary w_iDocument number df_iBetween ratio logarithm

6. theme feature text key word extracting method as claimed in claim 5, it is characterised in that：Topic model study module Word lists word={ the word obtained after being pre-processed according to training text preprocessing module₁…word_j…word_M, using master The three layers of bayesian probability model LDA inscribed in model learning method carry out learning training, obtain the probability between word and theme Matrix P,

Probability matrix P is N row, a K rows, and size is the matrix that N × K is, columns N indicates the number of vocabulary in dictionary, row Number K indicates that the theme number being manually set, probability matrix P reflect probability density characteristics of the word in different themes, In, p₁、p₂、…p_NIt is the theme probability distribution vector of different terms in probability matrix P.

7. theme feature text key word extracting method as claimed in claim 6, it is characterised in that：Global weight computing module The word obtained according to topic model study module-theme matrix takes a line from the word-theme matrix, is denoted as p_i= (p_i1…p_ij…p_iK), by p_iIt is normalized to obtain normalization theme probability distribution vector Wherein,Calculation formula beWherein, p_iIndicate probability distribution of i-th of the vocabulary under different themes, p_ij Indicate probability value of i-th of the vocabulary under j-th of theme.

8. theme feature text key word extracting method as claimed in claim 6, it is characterised in that：Global weight computing module Utilize vocabulary w_iComentropy ent_iThe inverse document frequency idf obtained with inverse document frequency computing module_i, according to global weights meter Calculate formulaCalculate vocabulary w_iGlobal weights, repeat above procedure, until the overall situation of all vocabulary in dictionary Weight computing is completed, and obtains global weight computing result, wherein g is global weights identifier.

9. theme feature text key word extracting method as claimed in claim 5, it is characterised in that：Local weight computing module According to the word lists word={ word of output₁…word_j…word_T, dictionary w={ w are counted successively₁…w_i…w_NIn it is each Word w_iOccurrence number tf in word lists word_i, by occurrence number tf_iLocal weight is used as after normalizationI.e.I=1,2, L, N, T indicate that vocabulary total number in word lists word, l are local weight marks.

10. theme feature text key word extracting method as claimed in claim 9, it is characterised in that：Comprehensive score calculate and Sorting module obtains local weight result of calculationAfterwards, the global weights that the combined training stage obtainsResult of calculation, calculate The comprehensive score score of i-th of word in dictionary_i, the score of all words is calculated successively, obtains word score matrix score =(score₁…,score_i…score_N), wherein the comprehensive score of i-th of wordI=1,2 ... N.