CN108763213A - Theme feature text key word extracting method - Google Patents

Theme feature text key word extracting method Download PDF

Info

Publication number
CN108763213A
CN108763213A CN201810516408.2A CN201810516408A CN108763213A CN 108763213 A CN108763213 A CN 108763213A CN 201810516408 A CN201810516408 A CN 201810516408A CN 108763213 A CN108763213 A CN 108763213A
Authority
CN
China
Prior art keywords
word
text
module
vocabulary
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810516408.2A
Other languages
Chinese (zh)
Inventor
彭易锦
代翔
黄细凤
王侃
杨拓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Electronic Technology Institute No 10 Institute of Cetc
Original Assignee
Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Electronic Technology Institute No 10 Institute of Cetc filed Critical Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority to CN201810516408.2A priority Critical patent/CN108763213A/en
Publication of CN108763213A publication Critical patent/CN108763213A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of theme feature text key word extracting method, can obtain extracting result better than the text key word of tradition TF-IDF methods using the present invention.The technical scheme is that:Training stage segments training text, remove stop words, the pretreatments such as part of speech filtering, count the inverse document frequency of word, while it using the theme probability matrix of topic model methodology acquistion to word and being normalized, the theme distribution entropy of word is calculated according to word theme probability matrix, the global weights of word are calculated in conjunction with inverse document frequency and theme distribution entropy, global weight computing result is output to test phase, after being pre-processed to test text, the normalization word frequency of word in statistical test text, the global weight computing result that word frequency is obtained with the training stage will be normalized to be combined, it calculates the comprehensive score of word and is ranked up, automatic keyword extraction result of several words of highest scoring as current test text in being sorted using score.

Description

Method for extracting topic characteristic text key words
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a text keyword extraction method based on topic distribution characteristics of words.
Background
Keyword extraction is the key of technologies such as information retrieval, text classification clustering and automatic abstract generation, and is an important means for rapidly acquiring document topics. Keywords are conventionally defined as a set of words or phrases that can summarize the subject matter of a document. The key words represent the topical and critical contents of the document and are the minimum units for expressing the core contents of a text. Keywords are very important in many fields, such as automatic summarization of documents, extraction of web page information, classification and clustering of documents, search engines, and the like. However, in most cases, the text does not directly give the keyword, and therefore, a keyword extraction method needs to be designed. The purpose of keyword extraction is to extract characteristic words capable of reflecting main contents and meanings of the characteristic words from a text, and a typical text keyword extraction method is to extract the characteristic words of the text, calculate the weight of each characteristic word according to a certain rule, and determine keywords capable of reflecting the subject contents of the text according to the weights of the characteristic words. Because internet resources are constantly updated, the Chinese text is explosively increased, the time consumption of a method for extracting keywords in a manual mode is long, and the method has certain subjectivity, so that a method for automatically extracting keywords from a document needs to be researched. Keyword extraction, also known as keyword extraction or keyword labeling, is a process of extracting from a text some words or phrases that are most relevant to the idea expressed by the text, and automatic keyword extraction is an automated technique of identifying or labeling representative words or phrases in a document that have such an effect. Automatic keyword extraction of texts has been a key problem and research hotspot in the field of natural language processing. With the increasing demand of the current text data application, there are many automatic keyword extraction methods proposed in recent years, and some methods have better effect in keyword extraction in specific fields, however, general automatic keyword extraction methods independent of language and field need further research. At present, some keyword extraction systems are realized based on a single method, and some keyword extraction systems are a synthesis of multiple methods, and according to the adopted core method, the keyword extraction systems can be summarized into the following typical most representative methods:
1) a method based on topic word lists. The method based on the topic word list is to establish the topic word list in a specific field and calculate the weight of a word by combining the word list with factors such as word length, word frequency and the like. Such methods are limited by the background lexicon, resulting in less than comprehensive keyword extraction.
2) Word sense based methods. The method based on word sense adopts a rule base or a synonym dictionary to label the word sense, then carries out polysemy disambiguation on, and calculates the weight of the word through the disambiguation result. This kind of method is directly influenced by the performance of the rule base established by the user, and in addition, the extraction efficiency is low due to the work of word sense disambiguation and synonym identification.
3) A statistical-based approach. Statistical-based methods are currently the most widely used methods. Extracting keywords of the document by utilizing statistical information of words in the document, distributing weights for the words by calculating certain characteristics of the words, such as TF, DF, TF-IDF and information entropy, and combining position characteristics of the words, such as titles, paragraph heads and the like, and extracting the keywords according to the weight sequence. The method is relatively simple, generally does not need training data and an external knowledge base, can screen by using simple statistical rules such as part of speech filtering, word frequency and the like to obtain a candidate keyword set, and evaluates the candidate keywords by using certain statistic to realize keyword extraction. The disadvantage of the statistical-based method is the large amount of calculation; the extraction result has meaning incomplete character strings, which causes low accuracy; low frequency words cannot be extracted; a large amount of original text is required.
4) The method based on the topic model is characterized in that the topic model is a probability language model simulating human writing, a document is formed by mixing a plurality of topics, and each topic is a probability distribution on a vocabulary. The more pronounced the subject matter characteristics of words in a document, the greater its ability to represent a certain subject. Calculating the theme weight of the words by using a theme model method to obtain a word-theme matrix, and then selecting a plurality of words with the highest weight under each theme as the keywords of the theme.
5) A complex network based approach. The method based on the complex network is an unsupervised method, firstly, the characteristic words are used as nodes, the relation between the characteristic words is used as an edge to construct a language network graph, then the constructed language network graph is analyzed, words or phrases playing a central role are searched, and the words or phrases are keywords of the document. The method constructs words in a document into a network according to a given rule, extracts nodes which have severe influence on the average path length of the network as keywords by verifying the characteristics of the small world of the network, cannot explain the relation between the keywords and the variation of the average path length, is difficult to ensure the connectivity of the network, and has large calculation amount.
6) Most of the current exploration of the neural network-based method is to develop research work on the basis of word vector representation, and the neural network-based method is based on two assumptions, namely that words in a document are developed around keywords, the keywords embody the central idea of an article, and most words in the article are semantically similar to the keywords. The research of the keyword extraction method based on the neural network is in a starting stage. Although the keyword extraction technology has been developed in recent years, the extraction result has not been satisfactory yet.
Among the above methods, the statistical-based method is the one that is most studied at the earliest and most widely used, mainly focuses on statistical characteristics, has the characteristics of strong model generalization capability and easy implementation, and has independence of language and field, wherein the most typical method is the term Frequency-Inverse Document Frequency method (TF-IDF). TF-IDF can evaluate the importance degree of a word to a document, wherein TF is called word frequency and is used for calculating the capacity of the word describing the document content; the IDF is called the inverse document frequency and is used to calculate the ability of the word to distinguish between documents. The guiding idea of TF-IDF is based on the basic assumption that words that occur many times in one text will also occur many times in another similar text and vice versa. The TF-IDF method is based on the assumption that: words that appear multiple times in one document also appear multiple times in other documents of the same type, and vice versa. In addition, the ability of the words to distinguish different categories is considered, and the smaller the text frequency of a word appears, the greater the ability of the word to distinguish different categories is. When the TF-IDF algorithm is used for calculating the weights of the words, the frequency of the words appearing in one document is high, and the frequency of the words appearing in other documents is low, so that the distinguishing capability of the words on the documents is high, and the weight value of the words is larger. The TF-IDF algorithm has the advantages of simplicity and quickness, and the result is relatively consistent with the actual situation. However, the traditional TF-IDF method measures the importance of a word by simply using the word frequency, only uses the word frequency characteristics, lacks consideration for the distribution characteristics of different categories or topics of the word in the document set, and cannot reflect the functions of the characteristics of the word nature and the like, so that some IDF values of low-frequency words which cannot represent the text are very high, and conversely some IDF values of high-frequency words which can well represent the text are very low. IDF is essentially a weighting that tries to suppress noise, and simply considers that words with a small text frequency are more important and words with a large text frequency are less useful, which is obviously not entirely correct. The IDF cannot effectively reflect the importance of words and the distribution of feature words, so that the function of adjusting the weight cannot be well completed, and the accuracy of the TF-IDF method is not very high.
Disclosure of Invention
Aiming at the defects that the traditional TF-IDF method only utilizes word frequency characteristics and lacks consideration on the category or topic distribution characteristics of words, the invention provides the keyword extraction method which has high extraction efficiency and high accuracy and can fully utilize the topic distribution characteristics of the words.
In order to achieve the above object, the present invention provides a method for extracting keywords from a topic feature text, which is characterized by comprising the following steps: text is taken as a carrier of information, text keywords are extracted and divided into a training stage and a testing stage according to theme distribution characteristics, and a text keyword extraction algorithm model is composed of a training text preprocessing module, an inverse document frequency calculating module, a theme model learning module, a global weight calculating module, a testing text preprocessing module, a local weight calculating module and a comprehensive score calculating and sorting module in the training stage; the training text preprocessing module sequentially performs Chinese word segmentation, word removal and part-of-speech filtering on input training text data, and then inputs the preprocessed training text data to the topic model learning module and the inverse document frequency calculating module; the topic model learning module automatically learns the topic distribution characteristics of the words in an unsupervised manner by using a topic model method aiming at the preprocessed training text data, and obtains a word-topic matrix capable of reflecting the probability distribution characteristics of the words on different topics through learning and training; the inverse document frequency calculation module utilizes the preprocessed training text data, and for each word, the document number of each word and the inverse document frequency of each word are counted in the training text, the ratio of the total number of the training texts to the document number of the words is calculated, and the logarithm of the ratio is used as the inverse document frequency; the global weight calculation module calculates the topic distribution entropy of the words and the topic distribution entropy of each word according to the inverse document frequency calculation result and the word-topic matrix obtained by the topic model learning module, multiplies the inverse of the topic distribution entropy by the inverse document frequency corresponding to the word to obtain the global weight calculation result of each word, and sends the global weight calculation result to the comprehensive score calculation and sorting module in the test stage; and the comprehensive score calculating and sorting module multiplies the global weight and the local weight corresponding to each word according to the global weight obtained in the training stage and the local weight obtained in the local weight calculating module, calculates the comprehensive score of each word and sorts the words, and takes a plurality of words with the highest score in the score sorting as the keyword extraction result of the current test text. The test text preprocessing module carries out Chinese word segmentation, word deactivation and part-of-speech filtering on an input test text in sequence, and inputs the preprocessed test text into the local weight calculation module; and the local weight calculation module counts the normalized word frequency of each word in the test text according to the test text preprocessing result, and takes the normalized word frequency as the local weight calculation result of the word.
Compared with the prior art, the invention has the following remarkable advantages:
the working efficiency is improved. The invention takes the text as an information carrier, utilizes the training text data to automatically learn the global weight of the words, and the learning result is used for automatic keyword extraction of the test text, thereby reducing the influence of artificial subjective factors of manually selecting the keywords, reducing the workload of manpower and improving the working efficiency. The text keyword extraction model training part is composed of a text preprocessing module, an inverse document frequency calculation module, a theme model learning module and a global weight calculation module, and is simple to implement and quick to operate.
The extraction accuracy is high. In the process of automatically extracting the text keywords, the information entropy is introduced on the basis of the traditional TF-IDF method, the normalized topic distribution entropy of the words is calculated, and the method is combined with the traditional IDF method, so that the keyword extraction with higher accuracy is realized, and the defects that the traditional TF-IDF method does not consider the categories or the topic distribution characteristics of the words are overcome. Meanwhile, in the text preprocessing process, Chinese word segmentation, word stop removal and part-of-speech filtering processing are sequentially carried out on input training text data, and useless words are filtered by using part-of-speech characteristics of the words, so that adverse effects of the useless words which appear at high frequency on a keyword extraction result are avoided.
Has better expandability. In the training stage, when the global weight of the vocabulary is calculated, the theme distribution characteristics of the vocabulary are automatically learned without supervision by using a theme model method, and the method does not need the support of additional labeled data, has better expandability and provides possibility for the extended application of a keyword extraction model.
The invention provides a set of complete processing method and flow for extracting the text keywords, and the automatic keyword extraction effect on experimental data is greatly improved compared with the traditional TF-IDF method, so that the method has better practicability and stronger engineering practical value.
Drawings
For a more complete understanding of the present invention, reference will now be made to the detailed description of the invention, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of the method for extracting keywords from text featuring the subject matter of the present invention.
FIG. 2 is a flow diagram of the training text pre-processing module of FIG. 1.
FIG. 3 is a flow diagram of the global weight calculation module of FIG. 1.
The invention will be further explained with reference to the drawings.
Detailed Description
See fig. 1. According to the invention, the method for extracting the key words of the topic feature text is characterized by comprising the following steps of: text is taken as a carrier of information, text keywords are extracted and divided into a training stage and a testing stage according to theme distribution characteristics, and a text keyword extraction algorithm model is composed of a training text preprocessing module, an inverse document frequency calculating module, a theme model learning module, a global weight calculating module, a testing text preprocessing module, a local weight calculating module and a comprehensive score calculating and sorting module in the training stage; the training text preprocessing module sequentially performs Chinese word segmentation, word removal and part-of-speech filtering on input training text data, and then inputs the preprocessed training text data to the topic model learning module and the inverse document frequency calculating module; the topic model learning module automatically learns the topic distribution characteristics of the words in an unsupervised manner by using a topic model method aiming at the preprocessed training text data, and obtains a word-topic matrix capable of reflecting the probability distribution characteristics of the words on different topics through learning and training; the inverse document frequency calculation module utilizes the preprocessed training text data, counts the number of documents containing each word in the training text aiming at each word, calculates the ratio of the total number of the training texts to the number of the documents containing the word, and takes the logarithm of the ratio as the inverse document frequency; the global weight calculation module calculates the topic distribution entropy of each term according to the inverse document frequency calculation result and the term-topic matrix obtained by the topic model learning module, multiplies the inverse document frequency corresponding to the term by the inverse of the topic distribution entropy to obtain the global weight calculation result of each term, and sends the global weight calculation result to the comprehensive score calculation and sorting module in the test stage; and the comprehensive score calculating and sorting module multiplies the global weight and the local weight corresponding to each word according to the global weight obtained in the training stage and the local weight obtained in the local weight calculating module, calculates the comprehensive score of each word and sorts the words, and takes a plurality of words with the highest score in the score sorting as the keyword extraction result of the current test text.
The inverse document frequency calculation module and the topic model learning module form a parallel calculation module between the text preprocessing module and the global weight calculation module, and respectively calculate the document distribution characteristics of words and the topic distribution characteristics of words.
When the global weight of the vocabulary is calculated, the global weight calculation module introduces an information entropy on the basis of the traditional inverse document frequency IDF calculation, calculates the theme distribution entropy of each vocabulary after normalizing the theme probability distribution of the vocabulary, and combines the theme distribution entropy with the traditional inverse document frequency IDF value to obtain the global weight of the vocabulary.
The test stage comprises a test text preprocessing module, a local weight calculation module and a comprehensive score calculation and sequencing module, wherein the test text preprocessing module sequentially carries out Chinese word segmentation, word deactivation and part-of-speech filtering on an input test text, and inputs the preprocessed test text into the local weight calculation module; the local weight calculation module counts the normalized word frequency of each word in the test text according to the test text preprocessing result, takes the normalized word frequency as the local weight calculation result of the word, and sends the local weight calculation result to the comprehensive score calculation and ordering module; and the comprehensive score calculating and sorting module multiplies the global weight and the local weight corresponding to each word according to the global weight obtained in the training stage and the local weight obtained in the testing stage, calculates the comprehensive score of each word and sorts the words, and takes a plurality of words with the highest score in the score sorting as the keyword extraction result of the current test text.
See fig. 2. The training text preprocessing module firstly carries out word segmentation processing on a plurality of input training texts by utilizing a Chinese word segmentation technology to obtain word segmentation results of all vocabulary lists and part-of-speech tagging information corresponding to each vocabulary in the training texts. The training text preprocessing module adopts an open source tool kit FudanNLP to perform word segmentation processing, then, stop word processing is performed on text vocabularies according to a stop word list, a vocabulary list containing M vocabularies is compared with the stop word list in the stop word processing, if corresponding vocabularies can be found in the stop word list, the vocabularies are deleted, then, part-of-speech filtering processing is performed on the text vocabularies according to a filtering part-of-speech list, the vocabulary list is compared with the filtering part-of-speech list in the part-of-speech filtering processing, all vocabularies of corresponding parts-of-speech in the vocabulary list are deleted, and the obtained preprocessed vocabulary list is marked as word { word ═ word1…wordj…wordMAnd outputting the vocabulary list to an inverse document frequency calculation module and a topic model learning module, wherein j is the sequence number of the jth vocabulary in M vocabularies.
The inverse document frequency calculation module calculates word from the vocabulary list word ═ word1…wordj…wordMAnd counting all the appeared different vocabularies to form a dictionary w ═ containing N vocabularies (w ═1…wi…wN) Then one word w in the dictionary is taken in turniAnd counting the vocabulary w contained in the input training textiNumber of documents dfiInverse document frequency idfiIs the total number Num of documents in the training textdocAnd includes the word wiNumber of documents dfiThe process is repeated until the inverse document frequency of all the words in the dictionary is counted, and an inverse document frequency matrix idf is formed (idf)1…idfi…idfN) Wherein w isiThe ith word i in the dictionary is 1, and 2 … N is the number of words; total number Num of documents in training textdocAnd includes the word wiNumber of documents dfiLogarithm of the ratio therebetween
The topic model learning module preprocesses according to the training text preprocessing module to obtain a word list word ═ word { (word)1…wordj…wordMA common theme model learning method, such as an LDA (latent Dirichlet Allocation, LDA), is adopted for learning and training to obtain a probability matrix P between words and themes,
the probability matrix P is a matrix with N columns and K rows and the size of N multiplied by K, wherein the column number N represents the number of words in a dictionary, and the row number K represents the number of topics set artificially. The probability matrix P reflects the probability distribution characteristics of words on different subjects, wherein P1、p2、pNIs the topic probability distribution vector of different words in the probability matrix P.
LDA is a document theme generation model, also called a three-layer Bayes probability model and unsupervised machine learning technology, comprising three-layer structure of words, themes and documents, LDA can be used to identify latent theme information in large-scale document sets or corpora, it adopts word bag method, this method regards each document as a word frequency vector, thus converts text information into digital information easy to model.
See fig. 3. The global weight calculation module takes a row from the word-theme matrix according to the word-theme matrix obtained by the theme model learning module and records the row as pi=(pi1…pij…piK),piRepresenting the probability distribution, p, of the ith vocabulary under different topicsijRepresenting the probability value of the ith vocabulary under the jth subject; p is to beiCarrying out normalization processing to obtain a normalized subject probability distribution vectorWherein,is calculated by the formula
Global weight calculation module utilizes normalized topic probability distribution vectorsComputing information entropyThe information entropy is used for measuring an expected value of a random variable, and the larger the information entropy of a variable is, the more various situations occur, that is, the more contents are contained, and the information entropy entiThe larger the word, the more uniform the distribution of the word under different topics, that is, the word has no obvious topic tendency, and the less the possibility that the word is a keyword; conversely, a smaller information entropy indicates a stronger topical tendency of the word, and a greater likelihood that the word is a keyword.
Global weight calculation module utilizes vocabulary wiInformation entropy of (ent)iAnd the inverse document frequency idf obtained by the inverse document frequency calculation moduleiCalculating a formula according to the global weightCalculate the vocabulary wiRepeating the above processes until the global weight calculation of all the vocabularies in the dictionary is completed, and obtaining a global weight calculation result, wherein g is a global weight identifier.
The test text preprocessing module performs similar operations to the training text preprocessing module for each input test text, including word segmentation, stop word removal, and,Processing procedures such as part of speech filtering, and outputting a preprocessed vocabulary list word ═ word { (word)1…wordj…wordTJ-1, and 2 … T represents the jth vocabulary, and the vocabulary list contains T vocabularies in total.
The local weight calculation module calculates the word list word as { word ═ according to the output1…wordj…wordTGet statistics of dictionary w ═ w in turn1…wi…wNEvery word w iniNumber of occurrences tf in vocabulary list wordiWill occur a number of times tfiNormalized to be local weightNamely, it isAnd i is 1,2, L, N, T represents the word summary number in the word list word, and L is a local weight identifier.
The comprehensive score calculation and sorting module obtains a local weight calculation resultThen, the global weight value obtained by combining the training stageCalculating the comprehensive score of the ith word in the dictionaryiSequentially calculating the scores of all the words to obtain a word score matrix score (score)1…,scorei…scoreN) Wherein, the composite score of the ith wordi is 1,2 … N. The comprehensive score calculating and sorting module arranges the comprehensive scores of all the words according to the sequence from big to small, and takes out the Q words with higher comprehensive scores as the extraction result of the keywords of the current test text, the extracted Q words can reflect the main content and meaning of the text, and the test text is finishedThe extraction result is output, wherein Q is set manually as required.
The foregoing is a description of the invention and embodiments thereof provided to persons skilled in the art of the invention and is to be considered as illustrative and not restrictive. An engineer may specifically operate according to the idea of the claims and may make various changes in form and detail without departing from the spirit and scope of the invention defined by the appended claims. All of which are considered to be within the scope of the present invention.

Claims (10)

1. A method for extracting keywords of a topic feature text is characterized by comprising the following steps: text is taken as a carrier of information, text keywords are extracted and divided into a training stage and a testing stage according to theme distribution characteristics, and a text keyword extraction algorithm model is composed of a training text preprocessing module, an inverse document frequency calculating module, a theme model learning module, a global weight calculating module, a testing text preprocessing module, a local weight calculating module and a comprehensive score calculating and sorting module in the training stage; the training text preprocessing module sequentially performs Chinese word segmentation, word removal and part-of-speech filtering on input training text data, and then inputs the preprocessed training text data to the topic model learning module and the inverse document frequency calculating module; the topic model learning module automatically learns the topic distribution characteristics of the words in an unsupervised manner by using a topic model method aiming at the preprocessed training text data, and obtains a word-topic matrix capable of reflecting the probability distribution characteristics of the words on different topics through learning and training; the inverse document frequency calculation module utilizes the preprocessed training text data, and for each word, the document number of each word and the inverse document frequency of each word are counted in the training text, the ratio of the total number of the training texts to the document number of the words is calculated, and the logarithm of the ratio is used as the inverse document frequency; the global weight calculation module calculates the topic distribution entropy of the words and the topic distribution entropy of each word according to the inverse document frequency calculation result and the word-topic matrix obtained by the topic model learning module, multiplies the inverse of the topic distribution entropy by the inverse document frequency corresponding to the word to obtain the global weight calculation result of each word, and sends the global weight calculation result to the comprehensive score calculation and sorting module in the test stage; and the comprehensive score calculating and sorting module multiplies the global weight and the local weight corresponding to each word according to the global weight obtained in the training stage and the local weight obtained in the local weight calculating module, calculates the comprehensive score of each word and sorts the words, and takes a plurality of words with the highest score in the score sorting as the keyword extraction result of the current test text.
2. The method of claim 1, wherein the method comprises: the test stage comprises a test text preprocessing module, a local weight calculation module and a comprehensive score calculation and sequencing module, wherein the test text preprocessing module sequentially carries out Chinese word segmentation, word deactivation and part-of-speech filtering on an input test text, and inputs the preprocessed test text into the local weight calculation module; and the local weight calculation module counts the normalized word frequency of each word in the test text according to the test text preprocessing result, and takes the normalized word frequency as the local weight calculation result of the word.
3. The method of claim 1, wherein the method comprises: the training text preprocessing module firstly carries out word segmentation processing on a plurality of input training texts by utilizing a Chinese word segmentation technology to obtain word segmentation results of all vocabulary lists and part-of-speech tagging information corresponding to each vocabulary in the training texts.
4. The method of claim 1, wherein the method comprises: the training text preprocessing module adopts an open source tool kit FudanNLP to perform word segmentation processing, then, the text vocabulary is subjected to word deactivation processing according to a deactivation vocabulary table, in the word deactivation processing, a vocabulary list containing M vocabularies is compared with the deactivation vocabulary table, if corresponding vocabularies can be found in the deactivation vocabulary table, the vocabularies are deleted, then, the text vocabulary is subjected to part-of-speech filtering processing according to a filtering part-of-speech table, in the part-of-speech filtering processing, the vocabulary list is compared with the filtering part-of-speech table, all vocabularies of corresponding parts-of-speech in the vocabulary list are deleted, and the obtained preprocessed vocabulary list is marked as word { word ═ word1…wordj…wordMAnd outputting the vocabulary list to an inverse document frequency calculation module and a topic model learning module, wherein j is the sequence number of the jth vocabulary in the M vocabularies.
5. The method of claim 1, wherein the method comprises: the inverse document frequency calculation module calculates word from the vocabulary list word ═ word1…wordj…wordMAnd counting all the appeared different vocabularies to form a dictionary w ═ containing N vocabularies (w ═1…wi…wN) Then one word w in the dictionary is taken in turniAnd counting the vocabulary w contained in the input training textiNumber of documents dfiInverse document frequency idfiIs the total number Num of documents in the training textdocAnd includes the word wiNumber of documents dfiThe process is repeated until the inverse document frequency of all the words in the dictionary is counted, and an inverse document frequency matrix idf is formed (idf)1…idfi…idfN) Wherein w isiThe number of the ith vocabulary in the dictionary is shown, i is 1, and 2 … N is the number of the vocabularies; total number Num of documents in training textdocAnd includes the word wiNumber of documents dfiLogarithm of the ratio therebetween
6. The method of claim 5, wherein the method comprises: the topic model learning module preprocesses according to the training text preprocessing module to obtain a word list word ═ word { (word)1…wordj…wordMA three-layer Bayes probability model LDA in the theme model learning method is adopted for learning and training to obtain a probability matrix P between words and themes,
the probability matrix P is a matrix with N columns and K rows and the size of N multiplied by K, the column number N represents the number of words in a dictionary, the row number K represents the number of artificially set topics, the probability matrix P reflects the probability distribution characteristics of the words on different topics, wherein P1、p2、…pNIs the topic probability distribution vector of different words in the probability matrix P.
7. The method of claim 6, wherein the method comprises: the global weight calculation module takes a row from the word-theme matrix according to the word-theme matrix obtained by the theme model learning module and records the row as pi=(pi1…pij…piK) A 1 is to piCarrying out normalization processing to obtain a normalized subject probability distribution vectorWherein,is calculated by the formulaWherein p isiRepresenting the probability distribution, p, of the ith vocabulary under different topicsijRepresenting the probability value of the ith vocabulary under the jth subject.
8. The method of claim 6, wherein the method comprises: global weight calculation module utilizes vocabulary wiInformation entropy of (ent)iAnd the inverse document frequency idf obtained by the inverse document frequency calculation moduleiCalculating a formula according to the global weightCalculate the vocabulary wiRepeating the above processes until the global weight calculation of all the vocabularies in the dictionary is completed, and obtaining a global weight calculation result, wherein g is a global weight identifier.
9. The method of claim 5, wherein the method comprises: the local weight calculation module calculates the word list word as { word ═ according to the output1…wordj…wordTGet statistics of dictionary w ═ w in turn1…wi…wNEvery word w iniNumber of occurrences tf in vocabulary list wordiWill occur a number of times tfiNormalized to be local weightNamely, it isAnd i is 1,2, L, N, T represents the word summary number in the word list word, and L is a local weight identifier.
10. The method of extracting topic feature text keywords as claimed in claim 9 wherein: the comprehensive score calculation and sorting module obtains a local weight calculation resultThen, the global weight value obtained by combining the training stageCalculating the comprehensive score of the ith word in the dictionaryiSequentially calculating the scores of all the words to obtain a word score matrix score (score)1…,scorei…scoreN) Wherein, the composite score of the ith wordi=1,2…N。
CN201810516408.2A 2018-05-25 2018-05-25 Theme feature text key word extracting method Pending CN108763213A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810516408.2A CN108763213A (en) 2018-05-25 2018-05-25 Theme feature text key word extracting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810516408.2A CN108763213A (en) 2018-05-25 2018-05-25 Theme feature text key word extracting method

Publications (1)

Publication Number Publication Date
CN108763213A true CN108763213A (en) 2018-11-06

Family

ID=64006351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810516408.2A Pending CN108763213A (en) 2018-05-25 2018-05-25 Theme feature text key word extracting method

Country Status (1)

Country Link
CN (1) CN108763213A (en)

Cited By (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766544A (en) * 2018-12-24 2019-05-17 中国科学院合肥物质科学研究院 Document keyword abstraction method and device based on LDA and term vector
CN109783798A (en) * 2018-12-12 2019-05-21 平安科技(深圳)有限公司 Method, apparatus, terminal and the storage medium of text information addition picture
CN109859291A (en) * 2019-02-21 2019-06-07 北京一品智尚信息科技有限公司 Intelligent LOGO design method, system and storage medium
CN109977399A (en) * 2019-03-05 2019-07-05 国网青海省电力公司 A kind of data analysing method and device based on NLP technology
CN110059311A (en) * 2019-03-27 2019-07-26 银江股份有限公司 A kind of keyword extracting method and system towards judicial style data
CN110134767A (en) * 2019-05-10 2019-08-16 云知声(上海)智能科技有限公司 A kind of screening technique of vocabulary
CN110287481A (en) * 2019-05-29 2019-09-27 西南电子技术研究所(中国电子科技集团公司第十研究所) Name entity corpus labeling training system
CN110298033A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling trains extracting tool
CN110377734A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of file classification method based on support vector machines
CN110413997A (en) * 2019-07-16 2019-11-05 深圳供电局有限公司 New word discovery method, system and readable storage medium for power industry
CN110413863A (en) * 2019-08-01 2019-11-05 信雅达系统工程股份有限公司 A kind of public sentiment news duplicate removal and method for pushing based on deep learning
CN110580279A (en) * 2019-08-19 2019-12-17 湖南正宇软件技术开发有限公司 Information classification method, system, equipment and storage medium
CN110705285A (en) * 2019-09-20 2020-01-17 北京市计算中心 Government affair text subject word bank construction method, device, server and readable storage medium
CN110852085A (en) * 2019-08-19 2020-02-28 湖南正宇软件技术开发有限公司 Hotspot topic mining method and system
CN111160025A (en) * 2019-12-12 2020-05-15 日照睿安信息科技有限公司 Method for actively discovering case keywords based on public security text
WO2020155769A1 (en) * 2019-01-30 2020-08-06 平安科技(深圳)有限公司 Method and device for establishing keyword generation model
CN111625578A (en) * 2020-05-26 2020-09-04 辽宁大学 Feature extraction method suitable for time sequence data in cultural science and technology fusion field
CN111680131A (en) * 2020-06-22 2020-09-18 平安银行股份有限公司 Document clustering method and system based on semantics and computer equipment
CN111753048A (en) * 2020-05-21 2020-10-09 高新兴科技集团股份有限公司 Document retrieval method, device, equipment and storage medium
CN111950261A (en) * 2020-10-16 2020-11-17 腾讯科技(深圳)有限公司 Method, device and computer readable storage medium for extracting text keywords
CN111949767A (en) * 2020-08-20 2020-11-17 深圳市卡牛科技有限公司 Method, device, equipment and storage medium for searching text keywords
CN112100317A (en) * 2020-09-24 2020-12-18 南京邮电大学 Feature keyword extraction method based on theme semantic perception
CN112100492A (en) * 2020-09-11 2020-12-18 河北冀联人力资源服务集团有限公司 Batch delivery method and system for resumes of different versions
CN112199926A (en) * 2020-10-16 2021-01-08 中国地质大学(武汉) Geological report text visualization method based on text mining and natural language processing
CN112464635A (en) * 2020-07-27 2021-03-09 上海汇招信息技术有限公司 Method and system for automatically scoring bid document
CN112507060A (en) * 2020-12-14 2021-03-16 福建正孚软件有限公司 Domain corpus construction method and system
CN112530591A (en) * 2020-12-10 2021-03-19 厦门越人健康技术研发有限公司 Method for generating auscultation test vocabulary and storage equipment
CN112541105A (en) * 2019-09-20 2021-03-23 福建师范大学地理研究所 Keyword generation method, public opinion monitoring method, device, equipment and medium
CN112686026A (en) * 2021-03-17 2021-04-20 平安科技(深圳)有限公司 Keyword extraction method, device, equipment and medium based on information entropy
CN112765979A (en) * 2021-01-15 2021-05-07 西华大学 System and method for extracting thesis keywords
CN112836507A (en) * 2021-01-13 2021-05-25 哈尔滨工程大学 Method for extracting domain text theme
CN112989824A (en) * 2021-05-12 2021-06-18 武汉卓尔数字传媒科技有限公司 Information pushing method and device, electronic equipment and storage medium
CN113095073A (en) * 2021-03-12 2021-07-09 深圳索信达数据技术有限公司 Corpus tag generation method and device, computer equipment and storage medium
WO2021139466A1 (en) * 2020-01-06 2021-07-15 北京大米科技有限公司 Topic word determination method for text, device, storage medium, and terminal
CN113191147A (en) * 2021-05-27 2021-07-30 中国人民解放军军事科学院评估论证研究中心 Unsupervised automatic term extraction method, apparatus, device and medium
CN113312463A (en) * 2021-05-26 2021-08-27 中国平安人寿保险股份有限公司 Intelligent evaluation method and device for voice question answering, computer equipment and storage medium
CN113408286A (en) * 2021-05-28 2021-09-17 浙江工业大学 Chinese entity identification method and system for mechanical and chemical engineering field
CN113761130A (en) * 2021-08-31 2021-12-07 珠海读书郎软件科技有限公司 System and method for assisting composition writing
CN113808742A (en) * 2021-08-10 2021-12-17 三峡大学 LSTM (localized surface technology) attention mechanism disease prediction method based on text feature dimension reduction
CN114065759A (en) * 2021-11-19 2022-02-18 深圳视界信息技术有限公司 Model failure detection method and device, electronic equipment and medium
CN114357996A (en) * 2021-12-06 2022-04-15 北京网宿科技有限公司 Time sequence text feature extraction method and device, electronic equipment and storage medium
CN114491034A (en) * 2022-01-24 2022-05-13 聚好看科技股份有限公司 Text classification method and intelligent device
CN114580557A (en) * 2022-03-10 2022-06-03 北京中知智慧科技有限公司 Document similarity determination method and device based on semantic analysis
CN115983251A (en) * 2023-02-16 2023-04-18 江苏联著实业股份有限公司 Text topic extraction system and method based on sentence analysis
CN115994527A (en) * 2023-03-23 2023-04-21 广东聚智诚科技有限公司 Machine learning-based PPT automatic generation system
CN116074036A (en) * 2022-11-21 2023-05-05 兴业银行股份有限公司 Attack behavior detection and identification method and system based on log features of security equipment
CN116842945A (en) * 2023-07-07 2023-10-03 中国标准化研究院 Digital library data mining method
CN117993392A (en) * 2024-03-05 2024-05-07 北京引智科技有限公司 Comprehensive information analysis method and system based on keyword extraction

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system
US20160314191A1 (en) * 2015-04-24 2016-10-27 Linkedin Corporation Topic extraction using clause segmentation and high-frequency words
CN106294863A (en) * 2016-08-23 2017-01-04 电子科技大学 A kind of abstract method for mass text fast understanding

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160314191A1 (en) * 2015-04-24 2016-10-27 Linkedin Corporation Topic extraction using clause segmentation and high-frequency words
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system
CN106294863A (en) * 2016-08-23 2017-01-04 电子科技大学 A kind of abstract method for mass text fast understanding

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘啸剑: "基于主题模型的关键词抽取算法研究", 《全国优秀硕士学位论文全文数据库》 *
钱爱兵: "基于改进TF-IDF的中文网页关键词抽取——以新闻网页为例", 《情报理论与实践》 *

Cited By (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783798A (en) * 2018-12-12 2019-05-21 平安科技(深圳)有限公司 Method, apparatus, terminal and the storage medium of text information addition picture
CN109766544B (en) * 2018-12-24 2022-09-30 中国科学院合肥物质科学研究院 Document keyword extraction method and device based on LDA and word vector
CN109766544A (en) * 2018-12-24 2019-05-17 中国科学院合肥物质科学研究院 Document keyword abstraction method and device based on LDA and term vector
WO2020155769A1 (en) * 2019-01-30 2020-08-06 平安科技(深圳)有限公司 Method and device for establishing keyword generation model
CN109859291A (en) * 2019-02-21 2019-06-07 北京一品智尚信息科技有限公司 Intelligent LOGO design method, system and storage medium
CN109977399A (en) * 2019-03-05 2019-07-05 国网青海省电力公司 A kind of data analysing method and device based on NLP technology
CN110059311A (en) * 2019-03-27 2019-07-26 银江股份有限公司 A kind of keyword extracting method and system towards judicial style data
CN110134767B (en) * 2019-05-10 2021-07-23 云知声(上海)智能科技有限公司 Screening method of vocabulary
CN110134767A (en) * 2019-05-10 2019-08-16 云知声(上海)智能科技有限公司 A kind of screening technique of vocabulary
CN110287481B (en) * 2019-05-29 2022-06-14 西南电子技术研究所(中国电子科技集团公司第十研究所) Named entity corpus labeling training system
CN110298033A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling trains extracting tool
CN110298033B (en) * 2019-05-29 2022-07-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling training extraction system
CN110287481A (en) * 2019-05-29 2019-09-27 西南电子技术研究所(中国电子科技集团公司第十研究所) Name entity corpus labeling training system
CN110377734A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of file classification method based on support vector machines
CN110413997A (en) * 2019-07-16 2019-11-05 深圳供电局有限公司 New word discovery method, system and readable storage medium for power industry
CN110413997B (en) * 2019-07-16 2023-04-07 深圳供电局有限公司 New word discovery method, system and readable storage medium for power industry
CN110413863A (en) * 2019-08-01 2019-11-05 信雅达系统工程股份有限公司 A kind of public sentiment news duplicate removal and method for pushing based on deep learning
CN110580279A (en) * 2019-08-19 2019-12-17 湖南正宇软件技术开发有限公司 Information classification method, system, equipment and storage medium
CN110852085A (en) * 2019-08-19 2020-02-28 湖南正宇软件技术开发有限公司 Hotspot topic mining method and system
CN110705285A (en) * 2019-09-20 2020-01-17 北京市计算中心 Government affair text subject word bank construction method, device, server and readable storage medium
CN110705285B (en) * 2019-09-20 2022-11-22 北京市计算中心有限公司 Government affair text subject word library construction method, device, server and readable storage medium
CN112541105A (en) * 2019-09-20 2021-03-23 福建师范大学地理研究所 Keyword generation method, public opinion monitoring method, device, equipment and medium
CN111160025A (en) * 2019-12-12 2020-05-15 日照睿安信息科技有限公司 Method for actively discovering case keywords based on public security text
WO2021139466A1 (en) * 2020-01-06 2021-07-15 北京大米科技有限公司 Topic word determination method for text, device, storage medium, and terminal
CN111753048A (en) * 2020-05-21 2020-10-09 高新兴科技集团股份有限公司 Document retrieval method, device, equipment and storage medium
CN111625578B (en) * 2020-05-26 2023-12-08 辽宁大学 Feature extraction method suitable for time series data in cultural science and technology fusion field
CN111625578A (en) * 2020-05-26 2020-09-04 辽宁大学 Feature extraction method suitable for time sequence data in cultural science and technology fusion field
CN111680131B (en) * 2020-06-22 2022-08-12 平安银行股份有限公司 Document clustering method and system based on semantics and computer equipment
CN111680131A (en) * 2020-06-22 2020-09-18 平安银行股份有限公司 Document clustering method and system based on semantics and computer equipment
CN112464635A (en) * 2020-07-27 2021-03-09 上海汇招信息技术有限公司 Method and system for automatically scoring bid document
CN111949767A (en) * 2020-08-20 2020-11-17 深圳市卡牛科技有限公司 Method, device, equipment and storage medium for searching text keywords
CN112100492A (en) * 2020-09-11 2020-12-18 河北冀联人力资源服务集团有限公司 Batch delivery method and system for resumes of different versions
CN112100317A (en) * 2020-09-24 2020-12-18 南京邮电大学 Feature keyword extraction method based on theme semantic perception
CN112100317B (en) * 2020-09-24 2022-10-14 南京邮电大学 Feature keyword extraction method based on theme semantic perception
CN111950261A (en) * 2020-10-16 2020-11-17 腾讯科技(深圳)有限公司 Method, device and computer readable storage medium for extracting text keywords
CN112199926B (en) * 2020-10-16 2024-05-10 中国地质大学(武汉) Geological report text visualization method based on text mining and natural language processing
CN112199926A (en) * 2020-10-16 2021-01-08 中国地质大学(武汉) Geological report text visualization method based on text mining and natural language processing
CN112530591A (en) * 2020-12-10 2021-03-19 厦门越人健康技术研发有限公司 Method for generating auscultation test vocabulary and storage equipment
CN112507060A (en) * 2020-12-14 2021-03-16 福建正孚软件有限公司 Domain corpus construction method and system
CN112836507B (en) * 2021-01-13 2022-12-09 哈尔滨工程大学 Method for extracting domain text theme
CN112836507A (en) * 2021-01-13 2021-05-25 哈尔滨工程大学 Method for extracting domain text theme
CN112765979B (en) * 2021-01-15 2023-05-09 西华大学 Paper keyword extraction system and method thereof
CN112765979A (en) * 2021-01-15 2021-05-07 西华大学 System and method for extracting thesis keywords
CN113095073A (en) * 2021-03-12 2021-07-09 深圳索信达数据技术有限公司 Corpus tag generation method and device, computer equipment and storage medium
CN112686026A (en) * 2021-03-17 2021-04-20 平安科技(深圳)有限公司 Keyword extraction method, device, equipment and medium based on information entropy
CN112686026B (en) * 2021-03-17 2021-06-18 平安科技(深圳)有限公司 Keyword extraction method, device, equipment and medium based on information entropy
CN112989824A (en) * 2021-05-12 2021-06-18 武汉卓尔数字传媒科技有限公司 Information pushing method and device, electronic equipment and storage medium
CN113312463A (en) * 2021-05-26 2021-08-27 中国平安人寿保险股份有限公司 Intelligent evaluation method and device for voice question answering, computer equipment and storage medium
CN113312463B (en) * 2021-05-26 2023-07-18 中国平安人寿保险股份有限公司 Intelligent evaluation method and device for voice questions and answers, computer equipment and storage medium
CN113191147A (en) * 2021-05-27 2021-07-30 中国人民解放军军事科学院评估论证研究中心 Unsupervised automatic term extraction method, apparatus, device and medium
CN113408286A (en) * 2021-05-28 2021-09-17 浙江工业大学 Chinese entity identification method and system for mechanical and chemical engineering field
CN113808742A (en) * 2021-08-10 2021-12-17 三峡大学 LSTM (localized surface technology) attention mechanism disease prediction method based on text feature dimension reduction
CN113808742B (en) * 2021-08-10 2024-07-02 三峡大学 LSTM attention mechanism disease prediction method based on text feature dimension reduction
CN113761130A (en) * 2021-08-31 2021-12-07 珠海读书郎软件科技有限公司 System and method for assisting composition writing
CN114065759A (en) * 2021-11-19 2022-02-18 深圳视界信息技术有限公司 Model failure detection method and device, electronic equipment and medium
CN114065759B (en) * 2021-11-19 2023-10-13 深圳数阔信息技术有限公司 Model failure detection method and device, electronic equipment and medium
CN114357996A (en) * 2021-12-06 2022-04-15 北京网宿科技有限公司 Time sequence text feature extraction method and device, electronic equipment and storage medium
CN114491034B (en) * 2022-01-24 2024-05-28 聚好看科技股份有限公司 Text classification method and intelligent device
CN114491034A (en) * 2022-01-24 2022-05-13 聚好看科技股份有限公司 Text classification method and intelligent device
CN114580557A (en) * 2022-03-10 2022-06-03 北京中知智慧科技有限公司 Document similarity determination method and device based on semantic analysis
CN116074036A (en) * 2022-11-21 2023-05-05 兴业银行股份有限公司 Attack behavior detection and identification method and system based on log features of security equipment
CN115983251A (en) * 2023-02-16 2023-04-18 江苏联著实业股份有限公司 Text topic extraction system and method based on sentence analysis
CN115994527A (en) * 2023-03-23 2023-04-21 广东聚智诚科技有限公司 Machine learning-based PPT automatic generation system
CN116842945A (en) * 2023-07-07 2023-10-03 中国标准化研究院 Digital library data mining method
CN117993392A (en) * 2024-03-05 2024-05-07 北京引智科技有限公司 Comprehensive information analysis method and system based on keyword extraction

Similar Documents

Publication Publication Date Title
CN108763213A (en) Theme feature text key word extracting method
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN110298033B (en) Keyword corpus labeling training extraction system
CN109101477B (en) Enterprise field classification and enterprise keyword screening method
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN103838833B (en) Text retrieval system based on correlation word semantic analysis
CN109670014B (en) Paper author name disambiguation method based on rule matching and machine learning
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
CN108804595B (en) Short text representation method based on word2vec
CN109885675B (en) Text subtopic discovery method based on improved LDA
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN111368088A (en) Text emotion classification method based on deep learning
CN110705247A (en) Based on x2-C text similarity calculation method
CN111309916A (en) Abstract extraction method and device, storage medium and electronic device
CN110728144A (en) Extraction type document automatic summarization method based on context semantic perception
Celikyilmaz et al. A graph-based semi-supervised learning for question-answering
Shetty et al. Auto text summarization with categorization and sentiment analysis
CN110728135A (en) Text theme indexing method and device, electronic equipment and computer storage medium
Yan et al. Sentiment Analysis of Short Texts Based on Parallel DenseNet.
CN108804422B (en) Scientific and technological paper text modeling method
CN114298020B (en) Keyword vectorization method based on topic semantic information and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181106