CN108763213A

CN108763213A - Theme feature text key word extracting method

Info

Publication number: CN108763213A
Application number: CN201810516408.2A
Authority: CN
Inventors: 彭易锦; 代翔; 黄细凤; 王侃; 杨拓
Original assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Current assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date: 2018-05-25
Filing date: 2018-05-25
Publication date: 2018-11-06

Abstract

The invention discloses a kind of theme feature text key word extracting method, can obtain extracting result better than the text key word of tradition TF-IDF methods using the present invention.The technical scheme is that：Training stage segments training text, remove stop words, the pretreatments such as part of speech filtering, count the inverse document frequency of word, while it using the theme probability matrix of topic model methodology acquistion to word and being normalized, the theme distribution entropy of word is calculated according to word theme probability matrix, the global weights of word are calculated in conjunction with inverse document frequency and theme distribution entropy, global weight computing result is output to test phase, after being pre-processed to test text, the normalization word frequency of word in statistical test text, the global weight computing result that word frequency is obtained with the training stage will be normalized to be combined, it calculates the comprehensive score of word and is ranked up, automatic keyword extraction result of several words of highest scoring as current test text in being sorted using score.

Description

Method for extracting topic characteristic text key words

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a text keyword extraction method based on topic distribution characteristics of words.

Background

Keyword extraction is the key of technologies such as information retrieval, text classification clustering and automatic abstract generation, and is an important means for rapidly acquiring document topics. Keywords are conventionally defined as a set of words or phrases that can summarize the subject matter of a document. The key words represent the topical and critical contents of the document and are the minimum units for expressing the core contents of a text. Keywords are very important in many fields, such as automatic summarization of documents, extraction of web page information, classification and clustering of documents, search engines, and the like. However, in most cases, the text does not directly give the keyword, and therefore, a keyword extraction method needs to be designed. The purpose of keyword extraction is to extract characteristic words capable of reflecting main contents and meanings of the characteristic words from a text, and a typical text keyword extraction method is to extract the characteristic words of the text, calculate the weight of each characteristic word according to a certain rule, and determine keywords capable of reflecting the subject contents of the text according to the weights of the characteristic words. Because internet resources are constantly updated, the Chinese text is explosively increased, the time consumption of a method for extracting keywords in a manual mode is long, and the method has certain subjectivity, so that a method for automatically extracting keywords from a document needs to be researched. Keyword extraction, also known as keyword extraction or keyword labeling, is a process of extracting from a text some words or phrases that are most relevant to the idea expressed by the text, and automatic keyword extraction is an automated technique of identifying or labeling representative words or phrases in a document that have such an effect. Automatic keyword extraction of texts has been a key problem and research hotspot in the field of natural language processing. With the increasing demand of the current text data application, there are many automatic keyword extraction methods proposed in recent years, and some methods have better effect in keyword extraction in specific fields, however, general automatic keyword extraction methods independent of language and field need further research. At present, some keyword extraction systems are realized based on a single method, and some keyword extraction systems are a synthesis of multiple methods, and according to the adopted core method, the keyword extraction systems can be summarized into the following typical most representative methods:

1) a method based on topic word lists. The method based on the topic word list is to establish the topic word list in a specific field and calculate the weight of a word by combining the word list with factors such as word length, word frequency and the like. Such methods are limited by the background lexicon, resulting in less than comprehensive keyword extraction.

2) Word sense based methods. The method based on word sense adopts a rule base or a synonym dictionary to label the word sense, then carries out polysemy disambiguation on, and calculates the weight of the word through the disambiguation result. This kind of method is directly influenced by the performance of the rule base established by the user, and in addition, the extraction efficiency is low due to the work of word sense disambiguation and synonym identification.

3) A statistical-based approach. Statistical-based methods are currently the most widely used methods. Extracting keywords of the document by utilizing statistical information of words in the document, distributing weights for the words by calculating certain characteristics of the words, such as TF, DF, TF-IDF and information entropy, and combining position characteristics of the words, such as titles, paragraph heads and the like, and extracting the keywords according to the weight sequence. The method is relatively simple, generally does not need training data and an external knowledge base, can screen by using simple statistical rules such as part of speech filtering, word frequency and the like to obtain a candidate keyword set, and evaluates the candidate keywords by using certain statistic to realize keyword extraction. The disadvantage of the statistical-based method is the large amount of calculation; the extraction result has meaning incomplete character strings, which causes low accuracy; low frequency words cannot be extracted; a large amount of original text is required.

4) The method based on the topic model is characterized in that the topic model is a probability language model simulating human writing, a document is formed by mixing a plurality of topics, and each topic is a probability distribution on a vocabulary. The more pronounced the subject matter characteristics of words in a document, the greater its ability to represent a certain subject. Calculating the theme weight of the words by using a theme model method to obtain a word-theme matrix, and then selecting a plurality of words with the highest weight under each theme as the keywords of the theme.

5) A complex network based approach. The method based on the complex network is an unsupervised method, firstly, the characteristic words are used as nodes, the relation between the characteristic words is used as an edge to construct a language network graph, then the constructed language network graph is analyzed, words or phrases playing a central role are searched, and the words or phrases are keywords of the document. The method constructs words in a document into a network according to a given rule, extracts nodes which have severe influence on the average path length of the network as keywords by verifying the characteristics of the small world of the network, cannot explain the relation between the keywords and the variation of the average path length, is difficult to ensure the connectivity of the network, and has large calculation amount.

6) Most of the current exploration of the neural network-based method is to develop research work on the basis of word vector representation, and the neural network-based method is based on two assumptions, namely that words in a document are developed around keywords, the keywords embody the central idea of an article, and most words in the article are semantically similar to the keywords. The research of the keyword extraction method based on the neural network is in a starting stage. Although the keyword extraction technology has been developed in recent years, the extraction result has not been satisfactory yet.

Among the above methods, the statistical-based method is the one that is most studied at the earliest and most widely used, mainly focuses on statistical characteristics, has the characteristics of strong model generalization capability and easy implementation, and has independence of language and field, wherein the most typical method is the term Frequency-Inverse Document Frequency method (TF-IDF). TF-IDF can evaluate the importance degree of a word to a document, wherein TF is called word frequency and is used for calculating the capacity of the word describing the document content; the IDF is called the inverse document frequency and is used to calculate the ability of the word to distinguish between documents. The guiding idea of TF-IDF is based on the basic assumption that words that occur many times in one text will also occur many times in another similar text and vice versa. The TF-IDF method is based on the assumption that: words that appear multiple times in one document also appear multiple times in other documents of the same type, and vice versa. In addition, the ability of the words to distinguish different categories is considered, and the smaller the text frequency of a word appears, the greater the ability of the word to distinguish different categories is. When the TF-IDF algorithm is used for calculating the weights of the words, the frequency of the words appearing in one document is high, and the frequency of the words appearing in other documents is low, so that the distinguishing capability of the words on the documents is high, and the weight value of the words is larger. The TF-IDF algorithm has the advantages of simplicity and quickness, and the result is relatively consistent with the actual situation. However, the traditional TF-IDF method measures the importance of a word by simply using the word frequency, only uses the word frequency characteristics, lacks consideration for the distribution characteristics of different categories or topics of the word in the document set, and cannot reflect the functions of the characteristics of the word nature and the like, so that some IDF values of low-frequency words which cannot represent the text are very high, and conversely some IDF values of high-frequency words which can well represent the text are very low. IDF is essentially a weighting that tries to suppress noise, and simply considers that words with a small text frequency are more important and words with a large text frequency are less useful, which is obviously not entirely correct. The IDF cannot effectively reflect the importance of words and the distribution of feature words, so that the function of adjusting the weight cannot be well completed, and the accuracy of the TF-IDF method is not very high.

Disclosure of Invention

Aiming at the defects that the traditional TF-IDF method only utilizes word frequency characteristics and lacks consideration on the category or topic distribution characteristics of words, the invention provides the keyword extraction method which has high extraction efficiency and high accuracy and can fully utilize the topic distribution characteristics of the words.

In order to achieve the above object, the present invention provides a method for extracting keywords from a topic feature text, which is characterized by comprising the following steps: text is taken as a carrier of information, text keywords are extracted and divided into a training stage and a testing stage according to theme distribution characteristics, and a text keyword extraction algorithm model is composed of a training text preprocessing module, an inverse document frequency calculating module, a theme model learning module, a global weight calculating module, a testing text preprocessing module, a local weight calculating module and a comprehensive score calculating and sorting module in the training stage; the training text preprocessing module sequentially performs Chinese word segmentation, word removal and part-of-speech filtering on input training text data, and then inputs the preprocessed training text data to the topic model learning module and the inverse document frequency calculating module; the topic model learning module automatically learns the topic distribution characteristics of the words in an unsupervised manner by using a topic model method aiming at the preprocessed training text data, and obtains a word-topic matrix capable of reflecting the probability distribution characteristics of the words on different topics through learning and training; the inverse document frequency calculation module utilizes the preprocessed training text data, and for each word, the document number of each word and the inverse document frequency of each word are counted in the training text, the ratio of the total number of the training texts to the document number of the words is calculated, and the logarithm of the ratio is used as the inverse document frequency; the global weight calculation module calculates the topic distribution entropy of the words and the topic distribution entropy of each word according to the inverse document frequency calculation result and the word-topic matrix obtained by the topic model learning module, multiplies the inverse of the topic distribution entropy by the inverse document frequency corresponding to the word to obtain the global weight calculation result of each word, and sends the global weight calculation result to the comprehensive score calculation and sorting module in the test stage; and the comprehensive score calculating and sorting module multiplies the global weight and the local weight corresponding to each word according to the global weight obtained in the training stage and the local weight obtained in the local weight calculating module, calculates the comprehensive score of each word and sorts the words, and takes a plurality of words with the highest score in the score sorting as the keyword extraction result of the current test text. The test text preprocessing module carries out Chinese word segmentation, word deactivation and part-of-speech filtering on an input test text in sequence, and inputs the preprocessed test text into the local weight calculation module; and the local weight calculation module counts the normalized word frequency of each word in the test text according to the test text preprocessing result, and takes the normalized word frequency as the local weight calculation result of the word.

Compared with the prior art, the invention has the following remarkable advantages:

the working efficiency is improved. The invention takes the text as an information carrier, utilizes the training text data to automatically learn the global weight of the words, and the learning result is used for automatic keyword extraction of the test text, thereby reducing the influence of artificial subjective factors of manually selecting the keywords, reducing the workload of manpower and improving the working efficiency. The text keyword extraction model training part is composed of a text preprocessing module, an inverse document frequency calculation module, a theme model learning module and a global weight calculation module, and is simple to implement and quick to operate.

The extraction accuracy is high. In the process of automatically extracting the text keywords, the information entropy is introduced on the basis of the traditional TF-IDF method, the normalized topic distribution entropy of the words is calculated, and the method is combined with the traditional IDF method, so that the keyword extraction with higher accuracy is realized, and the defects that the traditional TF-IDF method does not consider the categories or the topic distribution characteristics of the words are overcome. Meanwhile, in the text preprocessing process, Chinese word segmentation, word stop removal and part-of-speech filtering processing are sequentially carried out on input training text data, and useless words are filtered by using part-of-speech characteristics of the words, so that adverse effects of the useless words which appear at high frequency on a keyword extraction result are avoided.

Has better expandability. In the training stage, when the global weight of the vocabulary is calculated, the theme distribution characteristics of the vocabulary are automatically learned without supervision by using a theme model method, and the method does not need the support of additional labeled data, has better expandability and provides possibility for the extended application of a keyword extraction model.

The invention provides a set of complete processing method and flow for extracting the text keywords, and the automatic keyword extraction effect on experimental data is greatly improved compared with the traditional TF-IDF method, so that the method has better practicability and stronger engineering practical value.

Drawings

For a more complete understanding of the present invention, reference will now be made to the detailed description of the invention, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of the method for extracting keywords from text featuring the subject matter of the present invention.

FIG. 2 is a flow diagram of the training text pre-processing module of FIG. 1.

FIG. 3 is a flow diagram of the global weight calculation module of FIG. 1.

The invention will be further explained with reference to the drawings.

Detailed Description

See fig. 1. According to the invention, the method for extracting the key words of the topic feature text is characterized by comprising the following steps of: text is taken as a carrier of information, text keywords are extracted and divided into a training stage and a testing stage according to theme distribution characteristics, and a text keyword extraction algorithm model is composed of a training text preprocessing module, an inverse document frequency calculating module, a theme model learning module, a global weight calculating module, a testing text preprocessing module, a local weight calculating module and a comprehensive score calculating and sorting module in the training stage; the training text preprocessing module sequentially performs Chinese word segmentation, word removal and part-of-speech filtering on input training text data, and then inputs the preprocessed training text data to the topic model learning module and the inverse document frequency calculating module; the topic model learning module automatically learns the topic distribution characteristics of the words in an unsupervised manner by using a topic model method aiming at the preprocessed training text data, and obtains a word-topic matrix capable of reflecting the probability distribution characteristics of the words on different topics through learning and training; the inverse document frequency calculation module utilizes the preprocessed training text data, counts the number of documents containing each word in the training text aiming at each word, calculates the ratio of the total number of the training texts to the number of the documents containing the word, and takes the logarithm of the ratio as the inverse document frequency; the global weight calculation module calculates the topic distribution entropy of each term according to the inverse document frequency calculation result and the term-topic matrix obtained by the topic model learning module, multiplies the inverse document frequency corresponding to the term by the inverse of the topic distribution entropy to obtain the global weight calculation result of each term, and sends the global weight calculation result to the comprehensive score calculation and sorting module in the test stage; and the comprehensive score calculating and sorting module multiplies the global weight and the local weight corresponding to each word according to the global weight obtained in the training stage and the local weight obtained in the local weight calculating module, calculates the comprehensive score of each word and sorts the words, and takes a plurality of words with the highest score in the score sorting as the keyword extraction result of the current test text.

The inverse document frequency calculation module and the topic model learning module form a parallel calculation module between the text preprocessing module and the global weight calculation module, and respectively calculate the document distribution characteristics of words and the topic distribution characteristics of words.

When the global weight of the vocabulary is calculated, the global weight calculation module introduces an information entropy on the basis of the traditional inverse document frequency IDF calculation, calculates the theme distribution entropy of each vocabulary after normalizing the theme probability distribution of the vocabulary, and combines the theme distribution entropy with the traditional inverse document frequency IDF value to obtain the global weight of the vocabulary.

The test stage comprises a test text preprocessing module, a local weight calculation module and a comprehensive score calculation and sequencing module, wherein the test text preprocessing module sequentially carries out Chinese word segmentation, word deactivation and part-of-speech filtering on an input test text, and inputs the preprocessed test text into the local weight calculation module; the local weight calculation module counts the normalized word frequency of each word in the test text according to the test text preprocessing result, takes the normalized word frequency as the local weight calculation result of the word, and sends the local weight calculation result to the comprehensive score calculation and ordering module; and the comprehensive score calculating and sorting module multiplies the global weight and the local weight corresponding to each word according to the global weight obtained in the training stage and the local weight obtained in the testing stage, calculates the comprehensive score of each word and sorts the words, and takes a plurality of words with the highest score in the score sorting as the keyword extraction result of the current test text.

See fig. 2. The training text preprocessing module firstly carries out word segmentation processing on a plurality of input training texts by utilizing a Chinese word segmentation technology to obtain word segmentation results of all vocabulary lists and part-of-speech tagging information corresponding to each vocabulary in the training texts. The training text preprocessing module adopts an open source tool kit FudanNLP to perform word segmentation processing, then, stop word processing is performed on text vocabularies according to a stop word list, a vocabulary list containing M vocabularies is compared with the stop word list in the stop word processing, if corresponding vocabularies can be found in the stop word list, the vocabularies are deleted, then, part-of-speech filtering processing is performed on the text vocabularies according to a filtering part-of-speech list, the vocabulary list is compared with the filtering part-of-speech list in the part-of-speech filtering processing, all vocabularies of corresponding parts-of-speech in the vocabulary list are deleted, and the obtained preprocessed vocabulary list is marked as word { word ═ word₁…word_j…word_MAnd outputting the vocabulary list to an inverse document frequency calculation module and a topic model learning module, wherein j is the sequence number of the jth vocabulary in M vocabularies.

The inverse document frequency calculation module calculates word from the vocabulary list word ═ word₁…word_j…word_MAnd counting all the appeared different vocabularies to form a dictionary w ═ containing N vocabularies (w ═₁…w_i…w_N) Then one word w in the dictionary is taken in turn_iAnd counting the vocabulary w contained in the input training text_iNumber of documents df_iInverse document frequency idf_iIs the total number Num of documents in the training text_docAnd includes the word w_iNumber of documents df_iThe process is repeated until the inverse document frequency of all the words in the dictionary is counted, and an inverse document frequency matrix idf is formed (idf)₁…idf_i…idf_N) Wherein w is_iThe ith word i in the dictionary is 1, and 2 … N is the number of words; total number Num of documents in training text_docAnd includes the word w_iNumber of documents df_iLogarithm of the ratio therebetween

The topic model learning module preprocesses according to the training text preprocessing module to obtain a word list word ═ word { (word)₁…word_j…word_MA common theme model learning method, such as an LDA (latent Dirichlet Allocation, LDA), is adopted for learning and training to obtain a probability matrix P between words and themes,

the probability matrix P is a matrix with N columns and K rows and the size of N multiplied by K, wherein the column number N represents the number of words in a dictionary, and the row number K represents the number of topics set artificially. The probability matrix P reflects the probability distribution characteristics of words on different subjects, wherein P₁、p₂、p_NIs the topic probability distribution vector of different words in the probability matrix P.

LDA is a document theme generation model, also called a three-layer Bayes probability model and unsupervised machine learning technology, comprising three-layer structure of words, themes and documents, LDA can be used to identify latent theme information in large-scale document sets or corpora, it adopts word bag method, this method regards each document as a word frequency vector, thus converts text information into digital information easy to model.

See fig. 3. The global weight calculation module takes a row from the word-theme matrix according to the word-theme matrix obtained by the theme model learning module and records the row as p_i＝(p_i1…p_ij…p_iK)，p_iRepresenting the probability distribution, p, of the ith vocabulary under different topics_ijRepresenting the probability value of the ith vocabulary under the jth subject; p is to be_iCarrying out normalization processing to obtain a normalized subject probability distribution vectorWherein,is calculated by the formula

Global weight calculation module utilizes normalized topic probability distribution vectorsComputing information entropyThe information entropy is used for measuring an expected value of a random variable, and the larger the information entropy of a variable is, the more various situations occur, that is, the more contents are contained, and the information entropy ent_iThe larger the word, the more uniform the distribution of the word under different topics, that is, the word has no obvious topic tendency, and the less the possibility that the word is a keyword; conversely, a smaller information entropy indicates a stronger topical tendency of the word, and a greater likelihood that the word is a keyword.

Global weight calculation module utilizes vocabulary w_iInformation entropy of (ent)_iAnd the inverse document frequency idf obtained by the inverse document frequency calculation module_iCalculating a formula according to the global weightCalculate the vocabulary w_iRepeating the above processes until the global weight calculation of all the vocabularies in the dictionary is completed, and obtaining a global weight calculation result, wherein g is a global weight identifier.

The test text preprocessing module performs similar operations to the training text preprocessing module for each input test text, including word segmentation, stop word removal, and,Processing procedures such as part of speech filtering, and outputting a preprocessed vocabulary list word ═ word { (word)₁…word_j…word_TJ-1, and 2 … T represents the jth vocabulary, and the vocabulary list contains T vocabularies in total.

The local weight calculation module calculates the word list word as { word ═ according to the output₁…word_j…word_TGet statistics of dictionary w ═ w in turn₁…w_i…w_NEvery word w in_iNumber of occurrences tf in vocabulary list word_iWill occur a number of times tf_iNormalized to be local weightNamely, it isAnd i is 1,2, L, N, T represents the word summary number in the word list word, and L is a local weight identifier.

The comprehensive score calculation and sorting module obtains a local weight calculation resultThen, the global weight value obtained by combining the training stageCalculating the comprehensive score of the ith word in the dictionary_iSequentially calculating the scores of all the words to obtain a word score matrix score (score)₁…,score_i…score_N) Wherein, the composite score of the ith wordi is 1,2 … N. The comprehensive score calculating and sorting module arranges the comprehensive scores of all the words according to the sequence from big to small, and takes out the Q words with higher comprehensive scores as the extraction result of the keywords of the current test text, the extracted Q words can reflect the main content and meaning of the text, and the test text is finishedThe extraction result is output, wherein Q is set manually as required.

The foregoing is a description of the invention and embodiments thereof provided to persons skilled in the art of the invention and is to be considered as illustrative and not restrictive. An engineer may specifically operate according to the idea of the claims and may make various changes in form and detail without departing from the spirit and scope of the invention defined by the appended claims. All of which are considered to be within the scope of the present invention.

Claims

1. A method for extracting keywords of a topic feature text is characterized by comprising the following steps: text is taken as a carrier of information, text keywords are extracted and divided into a training stage and a testing stage according to theme distribution characteristics, and a text keyword extraction algorithm model is composed of a training text preprocessing module, an inverse document frequency calculating module, a theme model learning module, a global weight calculating module, a testing text preprocessing module, a local weight calculating module and a comprehensive score calculating and sorting module in the training stage; the training text preprocessing module sequentially performs Chinese word segmentation, word removal and part-of-speech filtering on input training text data, and then inputs the preprocessed training text data to the topic model learning module and the inverse document frequency calculating module; the topic model learning module automatically learns the topic distribution characteristics of the words in an unsupervised manner by using a topic model method aiming at the preprocessed training text data, and obtains a word-topic matrix capable of reflecting the probability distribution characteristics of the words on different topics through learning and training; the inverse document frequency calculation module utilizes the preprocessed training text data, and for each word, the document number of each word and the inverse document frequency of each word are counted in the training text, the ratio of the total number of the training texts to the document number of the words is calculated, and the logarithm of the ratio is used as the inverse document frequency; the global weight calculation module calculates the topic distribution entropy of the words and the topic distribution entropy of each word according to the inverse document frequency calculation result and the word-topic matrix obtained by the topic model learning module, multiplies the inverse of the topic distribution entropy by the inverse document frequency corresponding to the word to obtain the global weight calculation result of each word, and sends the global weight calculation result to the comprehensive score calculation and sorting module in the test stage; and the comprehensive score calculating and sorting module multiplies the global weight and the local weight corresponding to each word according to the global weight obtained in the training stage and the local weight obtained in the local weight calculating module, calculates the comprehensive score of each word and sorts the words, and takes a plurality of words with the highest score in the score sorting as the keyword extraction result of the current test text.

2. The method of claim 1, wherein the method comprises: the test stage comprises a test text preprocessing module, a local weight calculation module and a comprehensive score calculation and sequencing module, wherein the test text preprocessing module sequentially carries out Chinese word segmentation, word deactivation and part-of-speech filtering on an input test text, and inputs the preprocessed test text into the local weight calculation module; and the local weight calculation module counts the normalized word frequency of each word in the test text according to the test text preprocessing result, and takes the normalized word frequency as the local weight calculation result of the word.

3. The method of claim 1, wherein the method comprises: the training text preprocessing module firstly carries out word segmentation processing on a plurality of input training texts by utilizing a Chinese word segmentation technology to obtain word segmentation results of all vocabulary lists and part-of-speech tagging information corresponding to each vocabulary in the training texts.

4. The method of claim 1, wherein the method comprises: the training text preprocessing module adopts an open source tool kit FudanNLP to perform word segmentation processing, then, the text vocabulary is subjected to word deactivation processing according to a deactivation vocabulary table, in the word deactivation processing, a vocabulary list containing M vocabularies is compared with the deactivation vocabulary table, if corresponding vocabularies can be found in the deactivation vocabulary table, the vocabularies are deleted, then, the text vocabulary is subjected to part-of-speech filtering processing according to a filtering part-of-speech table, in the part-of-speech filtering processing, the vocabulary list is compared with the filtering part-of-speech table, all vocabularies of corresponding parts-of-speech in the vocabulary list are deleted, and the obtained preprocessed vocabulary list is marked as word { word ═ word₁…word_j…word_MAnd outputting the vocabulary list to an inverse document frequency calculation module and a topic model learning module, wherein j is the sequence number of the jth vocabulary in the M vocabularies.

5. The method of claim 1, wherein the method comprises: the inverse document frequency calculation module calculates word from the vocabulary list word ═ word₁…word_j…word_MAnd counting all the appeared different vocabularies to form a dictionary w ═ containing N vocabularies (w ═₁…w_i…w_N) Then one word w in the dictionary is taken in turn_iAnd counting the vocabulary w contained in the input training text_iNumber of documents df_iInverse document frequency idf_iIs the total number Num of documents in the training text_docAnd includes the word w_iNumber of documents df_iThe process is repeated until the inverse document frequency of all the words in the dictionary is counted, and an inverse document frequency matrix idf is formed (idf)₁…idf_i…idf_N) Wherein w is_iThe number of the ith vocabulary in the dictionary is shown, i is 1, and 2 … N is the number of the vocabularies; total number Num of documents in training text_docAnd includes the word w_iNumber of documents df_iLogarithm of the ratio therebetween

6. The method of claim 5, wherein the method comprises: the topic model learning module preprocesses according to the training text preprocessing module to obtain a word list word ═ word { (word)₁…word_j…word_MA three-layer Bayes probability model LDA in the theme model learning method is adopted for learning and training to obtain a probability matrix P between words and themes,

the probability matrix P is a matrix with N columns and K rows and the size of N multiplied by K, the column number N represents the number of words in a dictionary, the row number K represents the number of artificially set topics, the probability matrix P reflects the probability distribution characteristics of the words on different topics, wherein P₁、p₂、…p_NIs the topic probability distribution vector of different words in the probability matrix P.

7. The method of claim 6, wherein the method comprises: the global weight calculation module takes a row from the word-theme matrix according to the word-theme matrix obtained by the theme model learning module and records the row as p_i＝(p_i1…p_ij…p_iK) A 1 is to p_iCarrying out normalization processing to obtain a normalized subject probability distribution vectorWherein,is calculated by the formulaWherein p is_iRepresenting the probability distribution, p, of the ith vocabulary under different topics_ijRepresenting the probability value of the ith vocabulary under the jth subject.

8. The method of claim 6, wherein the method comprises: global weight calculation module utilizes vocabulary w_iInformation entropy of (ent)_iAnd the inverse document frequency idf obtained by the inverse document frequency calculation module_iCalculating a formula according to the global weightCalculate the vocabulary w_iRepeating the above processes until the global weight calculation of all the vocabularies in the dictionary is completed, and obtaining a global weight calculation result, wherein g is a global weight identifier.

9. The method of claim 5, wherein the method comprises: the local weight calculation module calculates the word list word as { word ═ according to the output₁…word_j…word_TGet statistics of dictionary w ═ w in turn₁…w_i…w_NEvery word w in_iNumber of occurrences tf in vocabulary list word_iWill occur a number of times tf_iNormalized to be local weightNamely, it isAnd i is 1,2, L, N, T represents the word summary number in the word list word, and L is a local weight identifier.

10. The method of extracting topic feature text keywords as claimed in claim 9 wherein: the comprehensive score calculation and sorting module obtains a local weight calculation resultThen, the global weight value obtained by combining the training stageCalculating the comprehensive score of the ith word in the dictionary_iSequentially calculating the scores of all the words to obtain a word score matrix score (score)₁…,score_i…score_N) Wherein, the composite score of the ith wordi＝1,2…N。