CN108763213A - Theme feature text key word extracting method - Google Patents

Theme feature text key word extracting method Download PDF

Info

Publication number
CN108763213A
CN108763213A CN201810516408.2A CN201810516408A CN108763213A CN 108763213 A CN108763213 A CN 108763213A CN 201810516408 A CN201810516408 A CN 201810516408A CN 108763213 A CN108763213 A CN 108763213A
Authority
CN
China
Prior art keywords
word
text
vocabulary
theme
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810516408.2A
Other languages
Chinese (zh)
Inventor
彭易锦
代翔
黄细凤
王侃
杨拓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 10 Research Institute
Southwest Electronic Technology Institute No 10 Institute of Cetc
Original Assignee
Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Electronic Technology Institute No 10 Institute of Cetc filed Critical Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority to CN201810516408.2A priority Critical patent/CN108763213A/en
Publication of CN108763213A publication Critical patent/CN108763213A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of theme feature text key word extracting method, can obtain extracting result better than the text key word of tradition TF-IDF methods using the present invention.The technical scheme is that:Training stage segments training text, remove stop words, the pretreatments such as part of speech filtering, count the inverse document frequency of word, while it using the theme probability matrix of topic model methodology acquistion to word and being normalized, the theme distribution entropy of word is calculated according to word theme probability matrix, the global weights of word are calculated in conjunction with inverse document frequency and theme distribution entropy, global weight computing result is output to test phase, after being pre-processed to test text, the normalization word frequency of word in statistical test text, the global weight computing result that word frequency is obtained with the training stage will be normalized to be combined, it calculates the comprehensive score of word and is ranked up, automatic keyword extraction result of several words of highest scoring as current test text in being sorted using score.

Description

Theme feature text key word extracting method
Technical field
The invention belongs to natural language processing technique fields, and in particular to a kind of text of the theme distribution feature based on word This keyword extracting method.
Background technology
Keyword extraction is the key that the technologies such as information retrieval, text classification cluster and automatic abstract generation, is quick Obtain the important means of document subject matter.It is traditionally one group of word that can summarize document subject matter content or short by key definition Language.Keyword characterizes the thematic and critical content of document, is the minimum unit for expressing a text core content.It closes Keyword has very important application in many fields, and the classification such as the autoabstract of document, Web page information extraction, document is poly- Class, search engine etc..However, text in most instances does not directly give keyword, it is therefore desirable to design key Word extracting method.The purpose of keyword extraction is the Feature Words for proposing to reflect its main contents and meaning from text, allusion quotation The text key word extracting method of type is that the power of each Feature Words is calculated according to certain rule after extracting the Feature Words of text Weight is determined to the keyword of reflection text subject content according to the weight of Feature Words.Since the Internet resources moment is not all Disconnected update, Chinese text show explosive growth, and the method that keyword extraction is carried out using manual type is time-consuming longer, and has There is certain subjectivity, therefore, it is necessary to study the method that can extract keyword from document automatically.Keyword extraction is also referred to as closed Keyword extract or keyword mark, be from text with some maximally related words of the thought expressed by the text or Phrase extraction Process out, automatic keyword extraction are oneself for identifying or marking the representative word or phrase in document with this effect Dynamicization technology.The automatic keyword extraction of text is all a critical issue and the research of natural language processing field all the time Hot spot.With being currently continuously increased to text data application demand, the automatic keyword extracting method proposed in recent years has very A variety of, some methods achieve preferable effect in the keyword extraction of specific area, however independently of language and field General automatic keyword extracting method also needs to further study.Currently, some keyword extraction systems are to be based on single method It realizes, some are the synthesis of a variety of methods, by used core methed, can be summarized as following several typically most represent The method of property:
1) method based on thesaurus.Method based on thesaurus is that thesaurus is established in specific area, by this The factors such as the long word frequency of vocabulary bluebeard compound calculate the weight of word.Such methods are limited by background dictionary, are resulted in keyword and are carried It takes not comprehensive enough.
2) method based on the meaning of a word.Method based on the meaning of a word carries out meaning of a word mark using rule base or synonymicon to word Then note carries out it ambiguity row's discrimination, the weight of word is calculated by arranging discrimination result.What such methods were directly established by user The influence of rule base performance causes extraction efficiency relatively low additionally due to needing to carry out the work that word sense disambiguation and synonym identify.
3) Statistics-Based Method.Statistics-Based Method is to use most commonly used method at present.Utilize word in document The keyword of the statistical information abstracting document of language, by calculating certain features of word, such as TF, DF, TF-IDF, comentropy, in conjunction with Its position feature, such as title, section head, weight is distributed for word, and keyword is extracted according to weight size order.This method is opposite For it is fairly simple, do not need training data and external knowledge library generally, simple statistical rules, such as part of speech mistake can be utilized Filter, word frequency etc. are screened, and candidate key set of words is obtained, and are assessed candidate keywords using certain statistic, are realized Keyword extraction.Statistics-Based Method is disadvantageous in that computationally intensive;Extraction result would make sense incomplete character String, causes accuracy rate not high;Low-frequency word cannot be extracted;Need a large amount of urtext.
4) method based on topic model, topic model are the probabilistic language models for simulating mankind's writing, and a document is It is mixed by multiple themes, and each theme is the probability distribution on vocabulary.The theme feature of word is more apparent in document, It is stronger that it represents the ability of a certain theme.The topic weights that word is calculated using topic model method obtain word-theme Then matrix selects keyword of several the highest words of weight as the theme under each theme.
5) method based on complex network.Method based on complex network is a kind of unsupervised approaches, first with Feature Words For node, the structure of linguistic network figure is carried out using the relationship between Feature Words as side, then analyzes constructed linguistic network Figure, has therefrom found the word or phrase of central role, these words or phrase are the keyword of document.It is by the word in document It is configured to a network by given rule, by verifying the Small-world Characters of network, extracting has network average path length As keyword, this method fails to explain the relationship between keyword and average path length increment the node of dramatic impact, And network connectivty is often difficult to ensure, it is computationally intensive.
6) method based on neural network explores the work that all conducts a research on the basis of term vector indicates mostly at present, And based on two it is assumed that one is that word is unfolded around keyword in document, keyword embodies the central idea of article, two It is that most of word and keyword are being semantically similar in article.The research of keyword extracting method based on neural network In the starting stage.Although keyword extraction techniques obtain significant progress in recent years, extraction result also obtains far away at present Satisfactory effect.
In the above-mentioned methods, Statistics-Based Method be study earliest, most widely used a kind of method, be primarily upon Statistical property has the characteristics that model generalization ability is strong, is easily achieved, and the independence with language and field, wherein most allusion quotation Type method is term frequency-inverse document frequency approach (TermFrequency-Inverse Document Frequency, TF-IDF). TF-IDF can assess significance level of the word to a document, wherein TF is known as word frequency, is described in document for calculating the word The ability of appearance;IDF is known as inverse document frequency, the ability for distinguishing document for calculating the word.The guiding theory of TF-IDF is established On such basic assumption:Occurs the occurrence number in another is with class text of word many times in a text Can be very much, vice versa.TF-IDF methods are established on such hypothesis:The word repeatedly occurred in a document, Also can repeatedly occur in other similar documents, vice versa.It is also contemplated that word distinguishes different classes of ability, a word The text frequency that language occurs is smaller, and it is bigger that it distinguishes different classes of ability.When calculating the weight of word with TF-IDF algorithms, The frequency that one word occurs in a document is high and the frequency of occurrences is low in other documents, indicates the word for indicating this text The separating capacity of shelves is strong, and weighted value is bigger.The advantages of TF-IDF algorithms is simple and quick, and results contrast meets practical feelings Condition.But traditional TF-IDF methods weigh merely the importance of a word with word frequency, merely with words-frequency feature, to word in text Different classes of or theme distribution character, which lacks, in shelves set considers, and can not embody the effect of the features such as part of speech, leads to some The IDF values that the low-frequency word of text can not be represented are very high, it is opposite some can represent very well text high frequency words IDF values but very It is low.Substantially IDF is a kind of weighting for attempting to inhibit noise, and merely thinks that the small word of text frequency is more important, The big word of text frequency is more useless, it is clear that this is not right-on.IDF can not effectively reflect the important of word The distribution situation of degree and Feature Words makes it that can not complete the function to weighed value adjusting well, so the precision of TF-IDF methods It is not very high.
Invention content
The present invention is lacked for traditional TF-IDF methods merely with words-frequency feature, to the classification or theme distribution characteristic of word The shortcoming of weary consideration provides a kind of extraction efficiency height, and accuracy rate is higher, can make full use of the theme distribution characteristic of word The extracting method of keyword.
In order to achieve the above object, a kind of theme feature text key word extracting method proposed by the present invention, feature exist In including the following steps:Using text as the carrier of information, text key word extraction is divided by training according to theme distribution feature Stage and test phase are learnt with the training text preprocessing module of training stage, inverse document frequency computing module, topic model Test text preprocessing module, local weight computing module, the comprehensive score of module, global weight computing module and test phase It calculates and sorting module forms text key word extraction algorithm model;Wherein, training of the training text preprocessing module to input Text data carries out Chinese word segmentation, removes stop words and part of speech filtration treatment successively, then by pretreated training text data It is input to topic model study module and inverse document frequency computing module;Topic model study module is directed to pretreated training Text data is obtained using the theme distribution feature of the automatic study word in the unsupervised ground of topic model method by learning training It can reflect word-theme matrix of word probability density characteristics in different themes;Inverse document frequency computing module utilizes pre- place Training text data after reason count the number of documents and word that each word is contained in training text for each word Inverse document frequency, the ratio between training text total quantity and number of documents comprising the word is calculated, with pair of ratio Number is used as inverse document frequency;Global weight computing module is obtained according to inverse document frequency result of calculation and topic model study module Word-theme matrix, according to word theme probability matrix calculate word theme distribution entropy and each word theme distribution The inverse inverse document frequency corresponding with word of theme distribution entropy is multiplied, obtains the global weight computing knot of each word by entropy Fruit, the comprehensive score that global weight computing result is sent into test phase calculates and sorting module;Comprehensive score calculates and sequence The local weight that the global weights and local weight computing module that module is obtained according to the training stage obtain, each word is corresponded to Global weights be multiplied with local weight, calculate the comprehensive score of each word and be ranked up, score is most in being sorted with score Keyword extraction result of several the high words as current test text.Wherein, test text preprocessing module is to input Test text carry out Chinese word segmentation successively, remove stop words and part of speech filtration treatment, by pretreated test text input office Portion's weight computing module;Local weight computing module is according to test text pre-processed results, each word in statistical test text Normalization word frequency, will normalization word frequency as the local weight result of calculation of word.
The present invention has following remarkable advantage compared with the prior art:
It improves work efficiency.The present invention learns the complete of word automatically using text as information carrier, using training text data Office's weights, learning outcome are used for the automatic keyword extraction of test text, reduce selection keyword by hand it is artificial it is subjective because Element influences, while reducing the workload of manpower, improves work efficiency.Wherein, with Text Pretreatment module, inverse document frequency The training part of computing module, topic model study module and global weight computing module composition text key word extraction model, Realize that simply, operation is quick.
It is high to extract accuracy rate.During automatically extracting text key word, on the basis of traditional TF-IDF methods Comentropy is introduced, the normalization theme distribution entropy of vocabulary is calculated, and be combined with traditional IDF method, realizes compared with Gao Zhun The keyword extraction of true rate overcomes traditional TF-IDF methods and lacks lacking for consideration to the classification or theme distribution feature of vocabulary Point and deficiency.Meanwhile during Text Pretreatment, Chinese word segmentation is carried out successively to the training text data of input, goes to stop Word and part of speech filtration treatment avoid the useless vocabulary of high-frequency appearance using the part of speech feature filtering useless vocabulary of vocabulary To the adverse effect of keyword extraction result.
With preferable scalability.Training stage of the invention when calculating the global weights of vocabulary, utilizes topic model The theme distribution feature of the automatic learning Vocabulary in the unsupervised ground of method, does not need the support of additional labeled data, has preferable Scalability, the expanded application for keyword extraction model provide possibility.
The present invention for text key word the processing method and flow for providing complete set, in experimental data The more traditional TF-IDF methods of automatic keyword extraction effect have larger promotion, have preferable practicability and stronger engineering Practical value.
Description of the drawings
For a clearer understanding of the present invention, now by specific implementation mode through the invention, referring concurrently to attached drawing, to retouch The present invention is stated, wherein:
Fig. 1 is the flow chart of present subject matter feature text key word extracting method.
Fig. 2 is the flow chart of training text preprocessing module in Fig. 1.
Fig. 3 is the flow chart of global weight computing module in Fig. 1.
The invention will be further described below in conjunction with the accompanying drawings.
Specific implementation mode
Refering to fig. 1.According to the present invention, a kind of theme feature text key word extracting method, it is characterised in that including as follows Step:Using text as the carrier of information, text key word extraction is divided by training stage and test according to theme distribution feature Stage, with the training text preprocessing module of training stage, inverse document frequency computing module, topic model study module, the overall situation Weight computing module and the test text preprocessing module of test phase, local weight computing module, comprehensive score are calculated and are arranged Sequence module forms text key word extraction algorithm model;Wherein, training text data of the training text preprocessing module to input Chinese word segmentation is carried out successively, removes stop words and part of speech filtration treatment, and pretreated training text data are then input to master Inscribe model learning module and inverse document frequency computing module;Topic model study module is directed to pretreated training text number According to using the theme distribution feature of the automatic study word in the unsupervised ground of topic model method, obtaining to reflect by learning training Word-theme matrix of word probability density characteristics in different themes;Inverse document frequency computing module utilizes pretreated Training text data count the number of documents that each word is contained in training text for each word, calculate training text Ratio between total quantity and number of documents comprising the word, using the logarithm of ratio as inverse document frequency;Global weights Word-theme matrix that computing module is obtained according to inverse document frequency result of calculation and topic model study module calculates each The inverse inverse document frequency corresponding with word of theme distribution entropy is multiplied, obtains each word by the theme distribution entropy of word Global weight computing is as a result, the comprehensive score that global weight computing result is sent into test phase calculates and sorting module;It is comprehensive The local weight that the global weights and local weight computing module that score calculates and sorting module is obtained according to the training stage obtain, The corresponding global weights of each word are multiplied with local weight, the comprehensive score of each word is calculated and is ranked up, with Divide keyword extraction result of several words of highest scoring in sequence as current test text.
Inverse document frequency computing module and topic model study module form one and are weighed with global in Text Pretreatment module Parallel computing module, calculates separately the Document distribution feature of vocabulary and the theme distribution feature of word between value computing module.
Global weight computing module is when calculating the global weights of vocabulary, in the base that traditional inverse document frequency IDF is calculated Introduce comentropy on plinth, after the theme probability distribution of normalization vocabulary, calculate the theme distribution entropy of each vocabulary, and with tradition Inverse document frequency IDF values be combined, obtain the global weights of vocabulary.
Test phase contains test text preprocessing module, local weight computing module, and comprehensive score calculates and sequence Module, wherein test text preprocessing module carries out Chinese word segmentation to the test text of input, removes stop words and part of speech mistake successively Filter is handled, and pretreated test text is inputted local weight computing module;Local weight computing module is according to test text Pre-processed results, the normalization word frequency of each word in statistical test text, will normalization word frequency as the local weight of word Local weight result of calculation is sent into comprehensive score calculating and sorting module by result of calculation;Comprehensive score calculates and sorting module The local weight that the global weights and test phase obtained according to the training stage obtain, by the corresponding global weights of each word and Local weight is multiplied, and calculates the comprehensive score of each word and is ranked up, several words of highest scoring in being sorted with score Keyword extraction result of the language as current test text.
Refering to Fig. 2.Training text preprocessing module is first with Chinese words segmentation to several training texts of input Word segmentation processing is carried out, the participle knot that all word lists and each vocabulary in training text correspond to part of speech markup information is obtained Fruit.Training text preprocessing module carries out word segmentation processing using Open-Source Tools packet FudanNLP, then, according to deactivated vocabulary to text This vocabulary carries out stop words and handles, go stop words handle in, by contain M vocabulary word lists and deactivated vocabulary into Row compares, if corresponding vocabulary can be found by deactivating in vocabulary, just deletes the vocabulary, then according to filtering part of speech table to text word It converges and carries out part of speech filtration treatment, word lists are compared with filtering part of speech table in part of speech filtration treatment, delete word lists In corresponding part of speech all vocabulary, be denoted as word={ word pretreated word lists are obtained1…wordj…wordM, Word lists are exported to inverse document frequency computing module and topic model study module again, wherein j is j-th in M vocabulary Vocabulary serial number.
Inverse document frequency computing module is according to word lists word={ word1…wordj…wordM, statistics is occurred Different vocabulary, formed and include the dictionary w=(w of N number of vocabulary1…wi…wN), a vocabulary w in dictionary is then taken successivelyi, And it includes vocabulary w to count in the training text of inputiDocument number dfi, inverse document frequency idfiIt is that document is total in training text Number NumdocWith include vocabulary wiDocument number dfiBetween ratio logarithm, repeat this process, until having counted institute in dictionary There is the inverse document frequency of vocabulary, forms inverse document frequency matrix idf=(idf1…idfi…idfN), wherein wiIt indicates in dictionary I-th of vocabulary i=1,2 ... N are vocabulary number;Total number of documents Num in training textdocWith include vocabulary wiDocument number dfiBetween ratio logarithm
The word lists word=that topic model study module obtains after being pre-processed according to training text preprocessing module {word1…wordj…wordM, using common topic model learning method, such as LDA topic models (Latent Dirichlet Allocation, LDA) learning training is carried out, the probability matrix P between word and theme is obtained,
Probability matrix P is N row, a K rows, and size is the matrix that N × K is, columns N indicates the number of vocabulary in dictionary, line number K Indicate the theme number being manually set.Probability matrix P reflects probability density characteristics of the word in different themes, wherein p1、p2、pNIt is the theme probability distribution vector of different terms in probability matrix P.
LDA is a kind of document subject matter generation model, also referred to as three layers of bayesian probability model and non-supervisory engineering Habit technology contains word, theme and document three-decker, and LDA, which can be used for identifying in extensive document sets or corpus, to be hidden Subject information, the method that it uses bag of words, each document be considered as a word frequency vector by this method, to by text Information converts the digital information for ease of modeling.
Refering to Fig. 3.The word that global weight computing module is obtained according to topic model study module-theme matrix, from upper A line is taken in predicate language-theme matrix, is denoted as pi=(pi1…pij…piK), piIndicate that i-th of vocabulary is general under different themes Rate is distributed, pijIndicate probability value of i-th of the vocabulary under j-th of theme;By piIt is normalized to obtain normalization theme Probability distribution vectorWherein,Calculation formula be
Global weight computing module utilizes normalization theme probability distribution vectorCalculate comentropy Wherein, comentropy is the desired value for weighing the appearance of a stochastic variable, and the comentropy of a variable is bigger, then it occurs Various situations it is also more, that is, the content for including is more, comentropy entiIt is bigger indicate the word under different themes Distribution it is more uniform, i.e., the word does not have apparent theme tendentiousness, which is that the possibility of keyword is smaller;Conversely, Comentropy is smaller to indicate that the theme tendentiousness of the word is stronger, which is that the possibility of keyword is bigger.
Global weight computing module utilizes vocabulary wiComentropy entiThe inverse document obtained with inverse document frequency computing module Frequency idfi, according to global weight computing formulaCalculate vocabulary wiGlobal weights, repeat above procedure, directly Global weight computing to all vocabulary in dictionary is completed, and global weight computing result is obtained, wherein g is global weights mark Symbol.
Test text preprocessing module executes and training text preprocessing module phase for each test text of input As operate, including segment, go stop words, part of speech filtering etc. processing procedure, export pretreated word lists word= {word1…wordj…wordT, wherein j=1,2 ... T indicate j-th of vocabulary, include T vocabulary altogether in word lists.
Local weight computing module is according to the word lists word={ word of output1…wordj…wordT, it counts successively Dictionary w={ w1…wi…wNIn each word wiOccurrence number tf in word lists wordi, by occurrence number tfiNormalizing Local weight is used as after changeI.e.I=1,2, L, N, T indicate that vocabulary total number in word lists word, l are parts Weights identify.
Comprehensive score calculates and sorting module obtains local weight result of calculationAfterwards, the overall situation that the combined training stage obtains WeightsResult of calculation, calculate dictionary in i-th of word comprehensive score scorei, the score of all words is calculated successively, Obtain word score matrix score=(score1…,scorei…scoreN), wherein the comprehensive score of i-th of wordI=1,2 ... N.Comprehensive score calculate and sorting module by the comprehensive score of all words according to from greatly to Small sequence is arranged, and take out the higher Q word of comprehensive score as the keyword extraction of current test text as a result, The Q word extracted can reflect the main contents and meaning of text, and the keyword extraction and output for completing test text carry Take result, wherein Q is manually set as needed.
It is the description to the present invention and its embodiment provided to the engineers and technicians in familiar field of the present invention above, These descriptions should be considered to be illustrative and not restrictive.Engineers and technicians can be accordingly in invention claims Thought is done specific operation and is implemented, without prejudice to the spirit and scope of the invention as defined in the appended claims, can be right It makes a variety of changes in the form and details.Above-mentioned these are regarded as coverage of the invention.

Claims (10)

1. a kind of theme feature text key word extracting method, it is characterised in that include the following steps:Using text as information Text key word extraction is divided into training stage and test phase, with the training of training stage by carrier according to theme distribution feature Text Pretreatment module, inverse document frequency computing module, topic model study module, global weight computing module and test phase Test text preprocessing module, local weight computing module, comprehensive score calculate and sorting module composition text key word carry Take algorithm model;Wherein, training text preprocessing module carries out Chinese word segmentation to the training text data of input, goes to deactivate successively Then pretreated training text data are input to topic model study module and inverse document frequency by word and part of speech filtration treatment Rate computing module;Topic model study module is directed to pretreated training text data, unsupervised using topic model method Ground learns the theme distribution feature of word automatically, obtains to reflect that word probability distribution in different themes is special by learning training Word-theme matrix of property;Inverse document frequency computing module utilizes pretreated training text data, for each word, Contain the number of documents of each word and the inverse document frequency of word in statistics training text, calculate training text total quantity with Including the ratio between the number of documents of the word, using the logarithm of ratio as inverse document frequency;Global weight computing module According to word-theme matrix that inverse document frequency result of calculation and topic model study module obtain, according to word theme probability Matrix calculates the theme distribution entropy of the theme distribution entropy and each word of word, and the inverse of theme distribution entropy is corresponding with word Inverse document frequency is multiplied, and obtains the global weight computing of each word as a result, global weight computing result is sent into test phase Comprehensive score calculate and sorting module;The global weights drawn game that comprehensive score calculates and sorting module is obtained according to the training stage The corresponding global weights of each word are multiplied with local weight, calculate each by the local weight that portion's weight computing module obtains The comprehensive score of word is simultaneously ranked up, several words of highest scoring are as the pass of current test text in being sorted using score Keyword extracts result.
2. theme feature text key word extracting method as described in claim 1, it is characterised in that:Test phase contains survey Text Pretreatment module, local weight computing module are tried, comprehensive score calculates and sorting module, wherein test text pre-processes Module carries out Chinese word segmentation to the test text of input, removes stop words and part of speech filtration treatment successively, by pretreated test Text input local weight computing module;Local weight computing module is according to test text pre-processed results, statistical test text In each word normalization word frequency, will normalization word frequency as the local weight result of calculation of word.
3. theme feature text key word extracting method as described in claim 1, it is characterised in that:Training text pre-processes mould Block carries out word segmentation processing first with Chinese words segmentation to several training texts of input, obtains all words in training text Remittance list and each vocabulary correspond to the word segmentation result of part of speech markup information.
4. theme feature text key word extracting method as described in claim 1, it is characterised in that:Training text pre-processes mould Word segmentation processing is carried out using Open-Source Tools packet FudanNLP, then, text vocabulary is carried out at stop words according to vocabulary is deactivated Reason the remittance word lists for containing M vocabulary is compared with deactivated vocabulary, in going stop words to handle if stop words Corresponding vocabulary can be found in table, just deletes the vocabulary, then text vocabulary is carried out at part of speech filtering according to filtering part of speech table It manages, is compared word lists with filtering part of speech table in part of speech filtration treatment, corresponding part of speech is all in deletion word lists Vocabulary is denoted as word={ word pretreated word lists are obtained1…wordj…wordM, then word lists are exported To inverse document frequency computing module and topic model study module, wherein j is the serial number of j-th of vocabulary in M vocabulary.
5. theme feature text key word extracting method as described in claim 1, it is characterised in that:Inverse document frequency calculates mould Root tuber is according to word lists word={ word1…wordj…wordM, the different vocabulary occurred are counted, it includes N number of word to be formed Dictionary w=(the w of remittance1…wi…wN), a vocabulary w in dictionary is then taken successivelyi, and count and wrapped in the training text of input W containing vocabularyiDocument number dfi, inverse document frequency idfiIt is total number of documents Num in training textdocWith include vocabulary wiText Shelves number dfiBetween ratio logarithm, repeat this process, the inverse document frequency until having counted all vocabulary in dictionary, shape At inverse document frequency matrix idf=(idf1…idfi…idfN), wherein wiIndicate i-th of vocabulary in dictionary, i=1,2 ... N For vocabulary number;Total number of documents Num in training textdocWith include vocabulary wiDocument number dfiBetween ratio logarithm
6. theme feature text key word extracting method as claimed in claim 5, it is characterised in that:Topic model study module Word lists word={ the word obtained after being pre-processed according to training text preprocessing module1…wordj…wordM, using master The three layers of bayesian probability model LDA inscribed in model learning method carry out learning training, obtain the probability between word and theme Matrix P,
Probability matrix P is N row, a K rows, and size is the matrix that N × K is, columns N indicates the number of vocabulary in dictionary, row Number K indicates that the theme number being manually set, probability matrix P reflect probability density characteristics of the word in different themes, In, p1、p2、…pNIt is the theme probability distribution vector of different terms in probability matrix P.
7. theme feature text key word extracting method as claimed in claim 6, it is characterised in that:Global weight computing module The word obtained according to topic model study module-theme matrix takes a line from the word-theme matrix, is denoted as pi= (pi1…pij…piK), by piIt is normalized to obtain normalization theme probability distribution vector Wherein,Calculation formula beWherein, piIndicate probability distribution of i-th of the vocabulary under different themes, pij Indicate probability value of i-th of the vocabulary under j-th of theme.
8. theme feature text key word extracting method as claimed in claim 6, it is characterised in that:Global weight computing module Utilize vocabulary wiComentropy entiThe inverse document frequency idf obtained with inverse document frequency computing modulei, according to global weights meter Calculate formulaCalculate vocabulary wiGlobal weights, repeat above procedure, until the overall situation of all vocabulary in dictionary Weight computing is completed, and obtains global weight computing result, wherein g is global weights identifier.
9. theme feature text key word extracting method as claimed in claim 5, it is characterised in that:Local weight computing module According to the word lists word={ word of output1…wordj…wordT, dictionary w={ w are counted successively1…wi…wNIn it is each Word wiOccurrence number tf in word lists wordi, by occurrence number tfiLocal weight is used as after normalizationI.e.I=1,2, L, N, T indicate that vocabulary total number in word lists word, l are local weight marks.
10. theme feature text key word extracting method as claimed in claim 9, it is characterised in that:Comprehensive score calculate and Sorting module obtains local weight result of calculationAfterwards, the global weights that the combined training stage obtainsResult of calculation, calculate The comprehensive score score of i-th of word in dictionaryi, the score of all words is calculated successively, obtains word score matrix score =(score1…,scorei…scoreN), wherein the comprehensive score of i-th of wordI=1,2 ... N.
CN201810516408.2A 2018-05-25 2018-05-25 Theme feature text key word extracting method Pending CN108763213A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810516408.2A CN108763213A (en) 2018-05-25 2018-05-25 Theme feature text key word extracting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810516408.2A CN108763213A (en) 2018-05-25 2018-05-25 Theme feature text key word extracting method

Publications (1)

Publication Number Publication Date
CN108763213A true CN108763213A (en) 2018-11-06

Family

ID=64006351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810516408.2A Pending CN108763213A (en) 2018-05-25 2018-05-25 Theme feature text key word extracting method

Country Status (1)

Country Link
CN (1) CN108763213A (en)

Cited By (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766544A (en) * 2018-12-24 2019-05-17 中国科学院合肥物质科学研究院 Document keyword abstraction method and device based on LDA and term vector
CN109783798A (en) * 2018-12-12 2019-05-21 平安科技(深圳)有限公司 Method, apparatus, terminal and the storage medium of text information addition picture
CN109859291A (en) * 2019-02-21 2019-06-07 北京一品智尚信息科技有限公司 Intelligent LOGO design method, system and storage medium
CN109977399A (en) * 2019-03-05 2019-07-05 国网青海省电力公司 A kind of data analysing method and device based on NLP technology
CN110059311A (en) * 2019-03-27 2019-07-26 银江股份有限公司 A kind of keyword extracting method and system towards judicial style data
CN110134767A (en) * 2019-05-10 2019-08-16 云知声(上海)智能科技有限公司 A kind of screening technique of vocabulary
CN110287481A (en) * 2019-05-29 2019-09-27 西南电子技术研究所(中国电子科技集团公司第十研究所) Name entity corpus labeling training system
CN110298033A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling trains extracting tool
CN110377734A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of file classification method based on support vector machines
CN110413863A (en) * 2019-08-01 2019-11-05 信雅达系统工程股份有限公司 A kind of public sentiment news duplicate removal and method for pushing based on deep learning
CN110413997A (en) * 2019-07-16 2019-11-05 深圳供电局有限公司 For the new word discovery method and its system of power industry, readable storage medium storing program for executing
CN110580279A (en) * 2019-08-19 2019-12-17 湖南正宇软件技术开发有限公司 Information classification method, system, equipment and storage medium
CN110705285A (en) * 2019-09-20 2020-01-17 北京市计算中心 Government affair text subject word bank construction method, device, server and readable storage medium
CN110852085A (en) * 2019-08-19 2020-02-28 湖南正宇软件技术开发有限公司 Hotspot topic mining method and system
CN111160025A (en) * 2019-12-12 2020-05-15 日照睿安信息科技有限公司 Method for actively discovering case keywords based on public security text
WO2020155769A1 (en) * 2019-01-30 2020-08-06 平安科技(深圳)有限公司 Method and device for establishing keyword generation model
CN111625578A (en) * 2020-05-26 2020-09-04 辽宁大学 Feature extraction method suitable for time sequence data in cultural science and technology fusion field
CN111680131A (en) * 2020-06-22 2020-09-18 平安银行股份有限公司 Document clustering method and system based on semantics and computer equipment
CN111753048A (en) * 2020-05-21 2020-10-09 高新兴科技集团股份有限公司 Document retrieval method, device, equipment and storage medium
CN111950261A (en) * 2020-10-16 2020-11-17 腾讯科技(深圳)有限公司 Method, device and computer readable storage medium for extracting text keywords
CN111949767A (en) * 2020-08-20 2020-11-17 深圳市卡牛科技有限公司 Method, device, equipment and storage medium for searching text keywords
CN112100317A (en) * 2020-09-24 2020-12-18 南京邮电大学 Feature keyword extraction method based on theme semantic perception
CN112100492A (en) * 2020-09-11 2020-12-18 河北冀联人力资源服务集团有限公司 Batch delivery method and system for resumes of different versions
CN112199926A (en) * 2020-10-16 2021-01-08 中国地质大学(武汉) Geological report text visualization method based on text mining and natural language processing
CN112464635A (en) * 2020-07-27 2021-03-09 上海汇招信息技术有限公司 Method and system for automatically scoring bid document
CN112507060A (en) * 2020-12-14 2021-03-16 福建正孚软件有限公司 Domain corpus construction method and system
CN112530591A (en) * 2020-12-10 2021-03-19 厦门越人健康技术研发有限公司 Method for generating auscultation test vocabulary and storage equipment
CN112541105A (en) * 2019-09-20 2021-03-23 福建师范大学地理研究所 Keyword generation method, public opinion monitoring method, device, equipment and medium
CN112686026A (en) * 2021-03-17 2021-04-20 平安科技(深圳)有限公司 Keyword extraction method, device, equipment and medium based on information entropy
CN112765979A (en) * 2021-01-15 2021-05-07 西华大学 System and method for extracting thesis keywords
CN112836507A (en) * 2021-01-13 2021-05-25 哈尔滨工程大学 Method for extracting domain text theme
CN112989824A (en) * 2021-05-12 2021-06-18 武汉卓尔数字传媒科技有限公司 Information pushing method and device, electronic equipment and storage medium
CN113095073A (en) * 2021-03-12 2021-07-09 深圳索信达数据技术有限公司 Corpus tag generation method and device, computer equipment and storage medium
WO2021139466A1 (en) * 2020-01-06 2021-07-15 北京大米科技有限公司 Topic word determination method for text, device, storage medium, and terminal
CN113191147A (en) * 2021-05-27 2021-07-30 中国人民解放军军事科学院评估论证研究中心 Unsupervised automatic term extraction method, apparatus, device and medium
CN113312463A (en) * 2021-05-26 2021-08-27 中国平安人寿保险股份有限公司 Intelligent evaluation method and device for voice question answering, computer equipment and storage medium
CN113408286A (en) * 2021-05-28 2021-09-17 浙江工业大学 Chinese entity identification method and system for mechanical and chemical engineering field
CN113761130A (en) * 2021-08-31 2021-12-07 珠海读书郎软件科技有限公司 System and method for assisting composition writing
CN113808742A (en) * 2021-08-10 2021-12-17 三峡大学 LSTM (localized surface technology) attention mechanism disease prediction method based on text feature dimension reduction
CN114065759A (en) * 2021-11-19 2022-02-18 深圳视界信息技术有限公司 Model failure detection method and device, electronic equipment and medium
CN114357996A (en) * 2021-12-06 2022-04-15 北京网宿科技有限公司 Time sequence text feature extraction method and device, electronic equipment and storage medium
CN114491034A (en) * 2022-01-24 2022-05-13 聚好看科技股份有限公司 Text classification method and intelligent device
CN115983251A (en) * 2023-02-16 2023-04-18 江苏联著实业股份有限公司 Text topic extraction system and method based on sentence analysis
CN115994527A (en) * 2023-03-23 2023-04-21 广东聚智诚科技有限公司 Machine learning-based PPT automatic generation system
CN116842945A (en) * 2023-07-07 2023-10-03 中国标准化研究院 Digital library data mining method
CN114491034B (en) * 2022-01-24 2024-05-28 聚好看科技股份有限公司 Text classification method and intelligent device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system
US20160314191A1 (en) * 2015-04-24 2016-10-27 Linkedin Corporation Topic extraction using clause segmentation and high-frequency words
CN106294863A (en) * 2016-08-23 2017-01-04 电子科技大学 A kind of abstract method for mass text fast understanding

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160314191A1 (en) * 2015-04-24 2016-10-27 Linkedin Corporation Topic extraction using clause segmentation and high-frequency words
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system
CN106294863A (en) * 2016-08-23 2017-01-04 电子科技大学 A kind of abstract method for mass text fast understanding

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘啸剑: "基于主题模型的关键词抽取算法研究", 《全国优秀硕士学位论文全文数据库》 *
钱爱兵: "基于改进TF-IDF的中文网页关键词抽取——以新闻网页为例", 《情报理论与实践》 *

Cited By (61)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783798A (en) * 2018-12-12 2019-05-21 平安科技(深圳)有限公司 Method, apparatus, terminal and the storage medium of text information addition picture
CN109766544A (en) * 2018-12-24 2019-05-17 中国科学院合肥物质科学研究院 Document keyword abstraction method and device based on LDA and term vector
CN109766544B (en) * 2018-12-24 2022-09-30 中国科学院合肥物质科学研究院 Document keyword extraction method and device based on LDA and word vector
WO2020155769A1 (en) * 2019-01-30 2020-08-06 平安科技(深圳)有限公司 Method and device for establishing keyword generation model
CN109859291A (en) * 2019-02-21 2019-06-07 北京一品智尚信息科技有限公司 Intelligent LOGO design method, system and storage medium
CN109977399A (en) * 2019-03-05 2019-07-05 国网青海省电力公司 A kind of data analysing method and device based on NLP technology
CN110059311A (en) * 2019-03-27 2019-07-26 银江股份有限公司 A kind of keyword extracting method and system towards judicial style data
CN110134767A (en) * 2019-05-10 2019-08-16 云知声(上海)智能科技有限公司 A kind of screening technique of vocabulary
CN110134767B (en) * 2019-05-10 2021-07-23 云知声(上海)智能科技有限公司 Screening method of vocabulary
CN110298033B (en) * 2019-05-29 2022-07-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling training extraction system
CN110287481B (en) * 2019-05-29 2022-06-14 西南电子技术研究所(中国电子科技集团公司第十研究所) Named entity corpus labeling training system
CN110298033A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling trains extracting tool
CN110287481A (en) * 2019-05-29 2019-09-27 西南电子技术研究所(中国电子科技集团公司第十研究所) Name entity corpus labeling training system
CN110377734A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of file classification method based on support vector machines
CN110413997A (en) * 2019-07-16 2019-11-05 深圳供电局有限公司 For the new word discovery method and its system of power industry, readable storage medium storing program for executing
CN110413997B (en) * 2019-07-16 2023-04-07 深圳供电局有限公司 New word discovery method, system and readable storage medium for power industry
CN110413863A (en) * 2019-08-01 2019-11-05 信雅达系统工程股份有限公司 A kind of public sentiment news duplicate removal and method for pushing based on deep learning
CN110580279A (en) * 2019-08-19 2019-12-17 湖南正宇软件技术开发有限公司 Information classification method, system, equipment and storage medium
CN110852085A (en) * 2019-08-19 2020-02-28 湖南正宇软件技术开发有限公司 Hotspot topic mining method and system
CN112541105A (en) * 2019-09-20 2021-03-23 福建师范大学地理研究所 Keyword generation method, public opinion monitoring method, device, equipment and medium
CN110705285B (en) * 2019-09-20 2022-11-22 北京市计算中心有限公司 Government affair text subject word library construction method, device, server and readable storage medium
CN110705285A (en) * 2019-09-20 2020-01-17 北京市计算中心 Government affair text subject word bank construction method, device, server and readable storage medium
CN111160025A (en) * 2019-12-12 2020-05-15 日照睿安信息科技有限公司 Method for actively discovering case keywords based on public security text
WO2021139466A1 (en) * 2020-01-06 2021-07-15 北京大米科技有限公司 Topic word determination method for text, device, storage medium, and terminal
CN111753048A (en) * 2020-05-21 2020-10-09 高新兴科技集团股份有限公司 Document retrieval method, device, equipment and storage medium
CN111625578A (en) * 2020-05-26 2020-09-04 辽宁大学 Feature extraction method suitable for time sequence data in cultural science and technology fusion field
CN111625578B (en) * 2020-05-26 2023-12-08 辽宁大学 Feature extraction method suitable for time series data in cultural science and technology fusion field
CN111680131A (en) * 2020-06-22 2020-09-18 平安银行股份有限公司 Document clustering method and system based on semantics and computer equipment
CN111680131B (en) * 2020-06-22 2022-08-12 平安银行股份有限公司 Document clustering method and system based on semantics and computer equipment
CN112464635A (en) * 2020-07-27 2021-03-09 上海汇招信息技术有限公司 Method and system for automatically scoring bid document
CN111949767A (en) * 2020-08-20 2020-11-17 深圳市卡牛科技有限公司 Method, device, equipment and storage medium for searching text keywords
CN112100492A (en) * 2020-09-11 2020-12-18 河北冀联人力资源服务集团有限公司 Batch delivery method and system for resumes of different versions
CN112100317B (en) * 2020-09-24 2022-10-14 南京邮电大学 Feature keyword extraction method based on theme semantic perception
CN112100317A (en) * 2020-09-24 2020-12-18 南京邮电大学 Feature keyword extraction method based on theme semantic perception
CN111950261A (en) * 2020-10-16 2020-11-17 腾讯科技(深圳)有限公司 Method, device and computer readable storage medium for extracting text keywords
CN112199926A (en) * 2020-10-16 2021-01-08 中国地质大学(武汉) Geological report text visualization method based on text mining and natural language processing
CN112199926B (en) * 2020-10-16 2024-05-10 中国地质大学(武汉) Geological report text visualization method based on text mining and natural language processing
CN112530591A (en) * 2020-12-10 2021-03-19 厦门越人健康技术研发有限公司 Method for generating auscultation test vocabulary and storage equipment
CN112507060A (en) * 2020-12-14 2021-03-16 福建正孚软件有限公司 Domain corpus construction method and system
CN112836507A (en) * 2021-01-13 2021-05-25 哈尔滨工程大学 Method for extracting domain text theme
CN112836507B (en) * 2021-01-13 2022-12-09 哈尔滨工程大学 Method for extracting domain text theme
CN112765979B (en) * 2021-01-15 2023-05-09 西华大学 Paper keyword extraction system and method thereof
CN112765979A (en) * 2021-01-15 2021-05-07 西华大学 System and method for extracting thesis keywords
CN113095073A (en) * 2021-03-12 2021-07-09 深圳索信达数据技术有限公司 Corpus tag generation method and device, computer equipment and storage medium
CN112686026B (en) * 2021-03-17 2021-06-18 平安科技(深圳)有限公司 Keyword extraction method, device, equipment and medium based on information entropy
CN112686026A (en) * 2021-03-17 2021-04-20 平安科技(深圳)有限公司 Keyword extraction method, device, equipment and medium based on information entropy
CN112989824A (en) * 2021-05-12 2021-06-18 武汉卓尔数字传媒科技有限公司 Information pushing method and device, electronic equipment and storage medium
CN113312463B (en) * 2021-05-26 2023-07-18 中国平安人寿保险股份有限公司 Intelligent evaluation method and device for voice questions and answers, computer equipment and storage medium
CN113312463A (en) * 2021-05-26 2021-08-27 中国平安人寿保险股份有限公司 Intelligent evaluation method and device for voice question answering, computer equipment and storage medium
CN113191147A (en) * 2021-05-27 2021-07-30 中国人民解放军军事科学院评估论证研究中心 Unsupervised automatic term extraction method, apparatus, device and medium
CN113408286A (en) * 2021-05-28 2021-09-17 浙江工业大学 Chinese entity identification method and system for mechanical and chemical engineering field
CN113808742A (en) * 2021-08-10 2021-12-17 三峡大学 LSTM (localized surface technology) attention mechanism disease prediction method based on text feature dimension reduction
CN113761130A (en) * 2021-08-31 2021-12-07 珠海读书郎软件科技有限公司 System and method for assisting composition writing
CN114065759A (en) * 2021-11-19 2022-02-18 深圳视界信息技术有限公司 Model failure detection method and device, electronic equipment and medium
CN114065759B (en) * 2021-11-19 2023-10-13 深圳数阔信息技术有限公司 Model failure detection method and device, electronic equipment and medium
CN114357996A (en) * 2021-12-06 2022-04-15 北京网宿科技有限公司 Time sequence text feature extraction method and device, electronic equipment and storage medium
CN114491034A (en) * 2022-01-24 2022-05-13 聚好看科技股份有限公司 Text classification method and intelligent device
CN114491034B (en) * 2022-01-24 2024-05-28 聚好看科技股份有限公司 Text classification method and intelligent device
CN115983251A (en) * 2023-02-16 2023-04-18 江苏联著实业股份有限公司 Text topic extraction system and method based on sentence analysis
CN115994527A (en) * 2023-03-23 2023-04-21 广东聚智诚科技有限公司 Machine learning-based PPT automatic generation system
CN116842945A (en) * 2023-07-07 2023-10-03 中国标准化研究院 Digital library data mining method

Similar Documents

Publication Publication Date Title
CN108763213A (en) Theme feature text key word extracting method
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
CN106997382B (en) Innovative creative tag automatic labeling method and system based on big data
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
Yu et al. Hierarchical topic modeling of Twitter data for online analytical processing
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN107193801A (en) A kind of short text characteristic optimization and sentiment analysis method based on depth belief network
CN108595425A (en) Based on theme and semantic dialogue language material keyword abstraction method
Wang et al. Ptr: Phrase-based topical ranking for automatic keyphrase extraction in scientific publications
CN105786991A (en) Chinese emotion new word recognition method and system in combination with user emotion expression ways
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
CN110134925A (en) A kind of Chinese patent text similarity calculating method
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
CN106599054A (en) Method and system for title classification and push
CN110222172B (en) Multi-source network public opinion theme mining method based on improved hierarchical clustering
CN108170666A (en) A kind of improved method based on TF-IDF keyword extractions
CN109408802A (en) A kind of method, system and storage medium promoting sentence vector semanteme
CN110175221A (en) Utilize the refuse messages recognition methods of term vector combination machine learning
CN116501875B (en) Document processing method and system based on natural language and knowledge graph
CN114997288A (en) Design resource association method
CN116362243A (en) Text key phrase extraction method, storage medium and device integrating incidence relation among sentences
Silvia et al. Summarizing text for indonesian language by using latent dirichlet allocation and genetic algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181106