CN108052593B - Topic keyword extraction method based on topic word vector and network structure - Google Patents

Topic keyword extraction method based on topic word vector and network structure Download PDF

Info

Publication number
CN108052593B
CN108052593B CN201711315360.0A CN201711315360A CN108052593B CN 108052593 B CN108052593 B CN 108052593B CN 201711315360 A CN201711315360 A CN 201711315360A CN 108052593 B CN108052593 B CN 108052593B
Authority
CN
China
Prior art keywords
topic
word
keyword
words
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711315360.0A
Other languages
Chinese (zh)
Other versions
CN108052593A (en
Inventor
胡晓慧
李超
曾庆田
戴明弟
赵中英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Science and Technology
Original Assignee
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Science and Technology filed Critical Shandong University of Science and Technology
Priority to CN201711315360.0A priority Critical patent/CN108052593B/en
Publication of CN108052593A publication Critical patent/CN108052593A/en
Application granted granted Critical
Publication of CN108052593B publication Critical patent/CN108052593B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a topic keyword extraction method based on a topic word vector and a network structure, and particularly relates to the technical field of extracting keywords from texts. The topic keyword extraction method based on the topic word vector and the network structure carries out topic clustering on text corpora based on an LDA topic model, and obtains 100 keywords with the topic relevance of top100 in each topic; using word2vec to represent each word in the text corpus as a word vector, obtaining semantic similarity between every two words through calculation, respectively calculating a word with semantic similarity top5 with each keyword in the keywords, wherein the keywords and the words with semantic similarity top5 form a new keyword set; a keyword network is constructed and the words of each set top20 are obtained as keywords for the topic. The method can extract the keywords with higher word frequency in the document, and can effectively find the keywords with lower word frequency but strong relation with the theme.

Description

Topic keyword extraction method based on topic word vector and network structure
Technical Field
The invention relates to the technical field of extracting keywords from texts, in particular to a topic keyword extraction method based on topic word vectors and network structures.
Background
With the wide application of the expression learning technology in the field of natural language processing, word2vec is applied to vector expression of words, semantic and grammar rules of the words can be well described and obtained, and meanwhile, a theme model can well explain the theme aggregation condition of a document level. Therefore, research on word vector representations fusing topic models and topic keywords is becoming more and more widespread.
LDA topic model: among the various topic models proposed, LDA is a generative model that can generalize the distribution of topics. LDA is a three-level hierarchical bayesian model in which each item of a collection is modeled as a finite mixture over a set of potential topics, and conversely, each topic is modeled as an infinite mixture of a set of potential topic probabilities. In the context of text modeling, topic probabilities provide a display representation of a document. The modeling process of LDA can be described as finding a corresponding topic mix for each resource (i.e., P (z | d)), each topic being described by another probability distribution (i.e., P (t | z)). This can be formally expressed as:
Figure GDA0002536312330000011
wherein, P (t)i| d) is the probability on the i-th item of a given document d, ziIs a potential topic. P (t)i|zjJ) is t in topic jiThe probability of (c). P (z)jJ | d) is the probability of the document being on topic j. The number of Z's of potential topics must be defined in advance. LDA uses Dirichlet prior distributions and determined topic numbers to estimate topic word distributions P (t | z) and document topic distributions P (z | d) from unlabeled corpora.
LDA is a topic model with a wide range of applications, and most other topic models are extended based on LDA. However, the keywords extracted by the LDA are generally too wide in the whole view, and the theme of the article cannot be reflected well, so that the method provided by the invention is innovative.
word embedding: word embedding encodes each word as a continuous vector (word vector) based on syntactic and semantic information so that the distance of similar words on their word vectors is similar. After a language model is counted and established from a natural text and word vectors are obtained, the language model can be used as input of a neural network to perform syntactic analysis, emotion analysis and the like, and can also be used as auxiliary features to expand the existing model. But only the word vectors are unable to identify the topic that the text expects, which must be combined with the topic model.
The existing unsupervised keyword extraction technology mainly comprises the schemes of TF-IDF, Topic model, TextRank and the like. The technical defects are mainly reflected in the following aspects:
TF-IDF is a common weighting technology used for information retrieval and data mining, is a measure for the importance of search keywords, and can also obtain a good effect when applied to extraction of text keywords. But the TF-IDF is based on the cross entropy of word frequency and keyword probability distribution, namely, the TF-IDF does not consider the sequence of the occurrence of words and does not consider the relation between each word in the text and the context.
The widely used Topic model such as LDA can better dig out topics from documents, but the extracted keywords are too wide, and many words which have higher word frequency but are irrelevant to the topics can not better reflect the topics, so that the keywords are not suitable.
The TextRank algorithm is a graph-based sorting algorithm for texts, the texts are divided into sentences, a graph model is established by utilizing the context co-occurrence relation of words in the texts, and keywords are extracted according to the PageRank value in the graph model. The algorithm can simply and effectively extract the key words of a single document on the basis of considering the co-occurrence relation of word frequency and words, but can not identify and cluster the topics of multiple documents, so that the key words of the documents under a specific topic can not be extracted.
Disclosure of Invention
The invention aims to overcome the defects, and provides a keyword extraction method which combines a topic model LDA and Word embedding, extracts keywords of the same topic text by using similarity network propagation, can extract the keywords with higher Word frequency in a document, and can effectively find the keywords with lower Word frequency and strong topic relation.
The invention specifically adopts the following technical scheme:
as shown in fig. 1, a method for extracting topic keywords based on topic word vectors and network structures specifically includes:
performing word segmentation on an original text corpus;
performing topic clustering on the text corpus based on an LDA topic model, and obtaining a keyword set Keywordset with top100 degree of relevance to the topic in each topic1={k1,...,k100};
Using word2vec to represent each word in the text corpus as a word vector, and obtaining semantic similarity between every two words by calculating cosine values between the word vectors;
respectively calculating keyword set and keyword set1Is semantically similar to the word of top5, keyword set keyword1The words in (1) and the words with semantic similarity top5 jointly form a new keyword set2
Set keyword set2Each keyword in the keyword network is a node, semantic similarity between words is the weight of an edge, a keyword network is constructed, and a keyword set Keywordset is obtained according to the PageRank value of each node2The word of the middle top20 is used as the key word of the subject to form the final key word setfinal
Preferably, the word segmentation is to divide the acquired original text into word sequences for subsequent topic clustering and keyword extraction, and the result of word segmentation is used as the input of word2vec to remove special symbols; when the user inputs the data as LDA, the null words, the place names which can not be used as the key words of the theme and the repeated prepositions which are irrelevant to the theme are removed.
Preferably, topic clustering is performed on the text corpus based on the LDA topic model, and perplexity is used in language modeling to measure how good the modeling effect is, that is, lower perplexity represents better generalization performance, and the perplexity calculation formula is as follows:
Figure GDA0002536312330000031
wherein, P (w)i|tj) Is the word wiAt topic tjDistribution of (A) P (t)i| d) is the topic tjThe distribution over the document d, N is the total number of words in the corpus that are not repeated, K is the number of topics, i 1.
Preferably, in the word vector generation process, a process of obtaining a word vector representation model of each word with a word segmentation result of a mixed text of a title and a content as an input.
Preferably, in the keyword network construction process, the construction steps specifically include:
s1: calculating words with initial keyword semantic similarity top5 obtained in the step of clustering with the same theme by using cosine relationship between word vectors, removing duplication and combining with keyword set Keywordset1Form a new keyword set2
S2: calculating keyword set under each topic2The similarity between every two words in the Chinese language is used as the weight between two points;
s3: setting a threshold value, and filtering edges with similarity lower than the threshold value;
s4: constructing a keyword network of each topic;
s5: extracting the topic key words: after the keyword network is constructed, 20 nodes of top from high to low of PageRank values in each topic network are calculated, and corresponding words are used as keyword sets of the topicsfinal
The invention has the following beneficial effects:
firstly, clustering text corpora based on an LDA topic model; secondly, representing each word in the text expectation as a word vector by using word2 vec; then, the words of the similarity top5 of each keyword in the document of the subject are obtained, and a new keyword set is formed together. Finally, constructing a keyword network by taking the keywords as nodes and the similarity between the words as the weight of the edges, and obtaining the core nodes of the network as the keywords of the theme;
the method combines a topic model LDA and Word embedding, extracts keywords of the same topic text by using network propagation of similarity, can extract the keywords with higher Word frequency in a document, and can effectively find the keywords with lower Word frequency but strong topic relation;
the method carries out secondary discovery on the keywords according to the word vector relation on the basis of considering the word frequency, and brings words with low word frequency but similar semantics into the alternative set of the keywords, so that the selection range of the keywords can be reasonably expanded, and the finally obtained keywords under the same theme are more closely related semantically;
the method introduces word vectors and carries out network construction based on the distance between the word vectors, so that keywords with similar word senses under the same theme can be found out more accurately, and a more accurate result is obtained.
Drawings
FIG. 1 is a flow chart of a topic keyword extraction method based on topic word vectors and network structures;
FIG. 2 is a graph of confusion;
FIG. 3 is a keyword profile for a teaching-type notification;
FIG. 4 is a keyword profile for a rating type notification;
FIG. 5 is a keyword profile for a library-type notification.
Detailed Description
The following description of the embodiments of the present invention will be made with reference to the accompanying drawings:
as shown in fig. 1, a method for extracting topic keywords based on topic word vectors and network structures specifically includes:
performing word segmentation on an original text corpus;
performing topic clustering on the text corpus based on an LDA topic model, and obtaining a keyword set Keywordset with top100 degree of relevance to the topic in each topic1={k1,...,k100};
Using word2vec to represent each word in the text corpus as a word vector, and obtaining semantic similarity between every two words by calculating cosine values between the word vectors;
respectively calculating keyword set and keyword set1Each of (1) toThe semantic similarity of each keyword to the word of top5, keyword set KeywordSet1The words in (1) and the words with semantic similarity top5 jointly form a new keyword set2
Set keyword set2Each keyword in the keyword network is a node, semantic similarity between words is the weight of an edge, a keyword network is constructed, and a keyword set Keywordset is obtained according to the PageRank value of each node2The word of the middle top20 is used as the key word of the subject to form the final key word setfinal
Word segmentation, namely segmenting the acquired original text into word sequences for subsequent topic clustering and keyword extraction, and removing special symbols when the word segmentation result is used as the input of word2 vec; when the user inputs the data as LDA, the null words, the place names which can not be used as the key words of the theme and a large number of repeated prepositions irrelevant to the theme are removed.
As shown in fig. 2, the text corpus is subject clustered based on the LDA subject model, and the property is used in language modeling to measure how good the modeling effect is, that is, a lower property represents better generalization performance, and the property calculation formula is as follows:
Figure GDA0002536312330000051
wherein, P (w)i|tj) Is the word wiAt topic tjDistribution of (A) P (t)j| d) is the topic tjThe distribution over the document d, N is the total number of words in the corpus that are not repeated, K is the number of topics, i 1. The number of topics is changed, and the optimal number of subjects is obtained by calculating the perplexity of the data set under different number of subjects.
Selecting the magnitude at the curve inflection point enables the perplexity value of the dataset to be small and the number of topics not to be excessive. Then, the topic distribution of each document and the word distribution under each topic are obtained, and 100 words with LDA value ranking top under each topic are selected as an initial keyword set.
In the word vector generation process, a process of obtaining a word vector representation model of each word with the word segmentation result of the mixed text of the title and the content as an input. In the scheme, a CBOW model is selected, the window size is set to be 5 to predict the probability of the current pivot word, and a negative sampling algorithm is selected to distinguish a target word and extract noise distribution through logistic regression. Table 1(word2vec model training parameter settings) gives a description and default values for key parameters in the training process.
TABLE 1
Figure GDA0002536312330000052
Finally, high-dimensional vector representation of all words in the text can be obtained, and similarity relation between all words, namely semantic distance, can be obtained by utilizing the word vector model.
In the keyword network construction process, the construction steps specifically comprise:
s1: calculating words with initial keyword semantic similarity top5 obtained in the step of clustering with the same theme by using cosine relationship between word vectors, removing duplication and combining with keyword set Keywordset1Form a new keyword set2
S2: calculating keyword set Kevwordset under each topic2The similarity between every two words in the Chinese language is used as the weight between two points;
s3: setting a threshold value, and filtering edges with similarity lower than the threshold value; the different results for different values of threshold selection are shown in table 2:
TABLE 2
similarity Topic similarity
0.05 0.41
0.1 0.44
0.15 0.48
0.2 0.49
0.25 0.52
0.3 0.59
0.35 0.55
0.4 0.57
0.45 0.56
0.5 0.52
0.55 0.50
It can be seen from the table that the degree of aggregation between key words in the same topic is higher when the threshold value is selected to be 0.3.
S4: constructing a keyword network of each topic;
s5: extracting the topic key words: after the keyword network is constructed, 20 nodes of top to bottom of the PageRank value in each topic network are calculated and pairedThe corresponding word is used as the key word of the theme to form a new key word set keywordfinal
As shown in fig. 3-5, in the scheme of the invention, through an experimental mode, 9802 pieces of news announced in a college from 2002 to 2017 are crawled, after word segmentation processing, the topic keywords are extracted through the steps of topic mining, word vector calculation, keyword network construction and the like, and the result is compared with the keywords obtained by the traditional topic model LDA.
Wherein, the words with dark colors represent words that can better reflect the subject, and the lighter the colors represent the lower the degree of correlation between the words and the subject. Larger words indicate a higher ranking under the method. It can be seen that the method of the present invention can better extract the key words which can represent the subject under the condition of integrating word frequency and semantics.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims (4)

1. A topic keyword extraction method based on topic word vectors and network structures is characterized by specifically comprising the following steps:
performing word segmentation on an original text corpus;
performing topic clustering on the text corpus based on an LDA topic model, and obtaining a keyword set Keywordset with top100 degree of relevance to the topic in each topic1={k1,...,k100};
Using word2vec to represent each word in the text corpus as a word vector, and obtaining semantic similarity between every two words by calculating cosine values between the word vectors;
respectively calculating keyword set and keyword set1Is semantically similar to the word of top5, keyword set keyword1The words in (1) and the words with semantic similarity top5 jointly form a new keyword set2
Set keyword set2Each keyword in the keyword network is a node, semantic similarity between words is the weight of an edge, a keyword network is constructed, and a keyword set Keywordset is obtained according to the PageRank value of each node2The word of the middle top20 is used as the key word of the subject to form the final key word setfinal
In the keyword network construction process, the construction steps specifically include:
s1: calculating words with initial keyword semantic similarity top5 obtained in the step of clustering with the same theme by using cosine relationship between word vectors, removing duplication and combining with keyword set Keywordset1Form a new keyword set2
S2: calculating keyword set under each topic2The similarity between every two words in the Chinese language is used as the weight between two points;
s3: setting a threshold value, and filtering edges with similarity lower than the threshold value;
s4: constructing a keyword network of each topic;
s5: extracting the topic key words: after the keyword network is constructed, 20 nodes of top from high to low of PageRank values in each topic network are calculated, and corresponding words are used as keyword sets of the topicsfinal
2. The method as claimed in claim 1, wherein the word segmentation is to segment the obtained original text into word sequences for subsequent topic clustering and keyword extraction, and the word segmentation result is used as the input of word2vec to remove special symbols; when the user inputs the data as LDA, the null words, the place names which can not be used as the key words of the theme and the repeated prepositions which are irrelevant to the theme are removed.
3. The method as claimed in claim 1, wherein the topic keyword extraction method based on topic word vector and network structure is characterized in that topic clustering is performed on text corpus based on LDA topic model, and in language modeling, use property to measure how good the modeling effect is, i.e. lower property represents better generalization performance, and the property calculation formula is as follows:
Figure FDA0002536312320000021
wherein, P (w)i|tj) Is the word wiAt topic tjDistribution of (A) P (t)j| d) is the topic tjThe distribution over the document d, N is the total number of words in the corpus that are not repeated, K is the topic number, i 1.
4. The method of claim 1, wherein a process of obtaining a word vector representation model of each word using a segmentation result of a mixed text of a title and contents as an input in the word vector generation process is performed.
CN201711315360.0A 2017-12-12 2017-12-12 Topic keyword extraction method based on topic word vector and network structure Active CN108052593B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711315360.0A CN108052593B (en) 2017-12-12 2017-12-12 Topic keyword extraction method based on topic word vector and network structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711315360.0A CN108052593B (en) 2017-12-12 2017-12-12 Topic keyword extraction method based on topic word vector and network structure

Publications (2)

Publication Number Publication Date
CN108052593A CN108052593A (en) 2018-05-18
CN108052593B true CN108052593B (en) 2020-09-22

Family

ID=62124320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711315360.0A Active CN108052593B (en) 2017-12-12 2017-12-12 Topic keyword extraction method based on topic word vector and network structure

Country Status (1)

Country Link
CN (1) CN108052593B (en)

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829822B (en) * 2018-06-12 2023-10-27 腾讯科技(深圳)有限公司 Media content recommendation method and device, storage medium and electronic device
CN108920454A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of theme phrase extraction method
CN108984519B (en) * 2018-06-14 2022-07-05 华东理工大学 Dual-mode-based automatic event corpus construction method and device and storage medium
CN110020034B (en) * 2018-06-29 2023-12-08 程宇镳 Information quotation analysis method and system
CN109086355B (en) * 2018-07-18 2022-05-17 北京航天云路有限公司 Hot-spot association relation analysis method and system based on news subject term
CN109376352B (en) * 2018-08-28 2022-11-29 中山大学 Patent text modeling method based on word2vec and semantic similarity
CN109522928A (en) * 2018-10-15 2019-03-26 北京邮电大学 Theme sentiment analysis method, apparatus, electronic equipment and the storage medium of text
CN109284366A (en) * 2018-10-17 2019-01-29 徐佳慧 A kind of construction method and device of the homogenous network towards investment and financing mechanism
CN109492157B (en) * 2018-10-24 2021-08-31 华侨大学 News recommendation method and theme characterization method based on RNN and attention mechanism
CN109636645A (en) * 2018-12-13 2019-04-16 平安医疗健康管理股份有限公司 Medical insurance monitoring and managing method, unit and computer readable storage medium
CN109710759B (en) * 2018-12-17 2021-06-08 北京百度网讯科技有限公司 Text segmentation method and device, computer equipment and readable storage medium
CN109885831B (en) * 2019-01-30 2023-06-02 广州杰赛科技股份有限公司 Keyword extraction method, device, equipment and computer readable storage medium
CN110442855B (en) * 2019-04-10 2023-11-07 北京捷通华声科技股份有限公司 Voice analysis method and system
CN110046228B (en) * 2019-04-18 2021-06-11 合肥工业大学 Short text topic identification method and system
CN110209941B (en) * 2019-06-03 2021-01-15 北京卡路里信息技术有限公司 Method for maintaining push content pool, push method, device, medium and server
CN110222347B (en) * 2019-06-20 2020-06-23 首都师范大学 Composition separation detection method
CN110287321A (en) * 2019-06-26 2019-09-27 南京邮电大学 A kind of electric power file classification method based on improvement feature selecting
CN110472005B (en) * 2019-06-27 2023-09-15 中山大学 Unsupervised keyword extraction method
CN110427492B (en) * 2019-07-10 2023-08-15 创新先进技术有限公司 Keyword library generation method and device and electronic equipment
CN110717329B (en) * 2019-09-10 2023-06-16 上海开域信息科技有限公司 Method for performing approximate search based on word vector to rapidly extract advertisement text theme
CN110807326B (en) * 2019-10-24 2023-04-28 江汉大学 Short text keyword extraction method combining GPU-DMM and text features
CN111026866B (en) * 2019-10-24 2020-10-23 北京中科闻歌科技股份有限公司 Domain-oriented text information extraction clustering method, device and storage medium
CN110851602A (en) * 2019-11-13 2020-02-28 精硕科技(北京)股份有限公司 Method and device for topic clustering
CN110851570B (en) * 2019-11-14 2023-04-18 中山大学 Unsupervised keyword extraction method based on Embedding technology
CN110991175B (en) * 2019-12-10 2024-04-09 爱驰汽车有限公司 Method, system, equipment and storage medium for generating text in multi-mode
CN111078838B (en) * 2019-12-13 2023-08-18 北京小米智能科技有限公司 Keyword extraction method, keyword extraction device and electronic equipment
CN111079422B (en) * 2019-12-13 2023-07-14 北京小米移动软件有限公司 Keyword extraction method, keyword extraction device and storage medium
CN113139379B (en) * 2020-01-20 2023-12-22 中国电信股份有限公司 Information identification method and system
CN111401040B (en) * 2020-03-17 2021-06-18 上海爱数信息技术股份有限公司 Keyword extraction method suitable for word text
CN111428489B (en) * 2020-03-19 2023-08-29 北京百度网讯科技有限公司 Comment generation method and device, electronic equipment and storage medium
CN111950264B (en) * 2020-08-05 2024-04-26 广东工业大学 Text data enhancement method and knowledge element extraction method
CN112100317B (en) * 2020-09-24 2022-10-14 南京邮电大学 Feature keyword extraction method based on theme semantic perception
CN112270185A (en) * 2020-10-29 2021-01-26 山西大学 Text representation method based on topic model
CN112508376A (en) * 2020-11-30 2021-03-16 中国科学院深圳先进技术研究院 Index system construction method
CN113011133A (en) * 2021-02-23 2021-06-22 吉林大学珠海学院 Single cell correlation technique data analysis method based on natural language processing
CN113051917B (en) * 2021-04-23 2022-11-18 东南大学 Document implicit time inference method based on time window text similarity
CN113407679B (en) * 2021-06-30 2023-10-03 竹间智能科技(上海)有限公司 Text topic mining method and device, electronic equipment and storage medium
CN113378512B (en) * 2021-07-05 2023-05-26 中国科学技术信息研究所 Automatic indexing-based stepless dynamic evolution subject cloud image generation method
CN113486176B (en) * 2021-07-08 2022-11-04 桂林电子科技大学 News classification method based on secondary feature amplification
CN113505581A (en) * 2021-07-27 2021-10-15 北京工商大学 Education big data text analysis method based on APSO-LSTM network
CN113591476A (en) * 2021-08-10 2021-11-02 闪捷信息科技有限公司 Data label recommendation method based on machine learning
CN113673223A (en) * 2021-08-25 2021-11-19 北京智通云联科技有限公司 Keyword extraction method and system based on semantic similarity
CN115168600B (en) * 2022-06-23 2023-07-11 广州大学 Value chain knowledge discovery method under personalized customization
CN116431814B (en) * 2023-06-06 2023-09-05 北京中关村科金技术有限公司 Information extraction method, information extraction device, electronic equipment and readable storage medium
CN116975246B (en) * 2023-08-03 2024-04-26 深圳市博锐高科科技有限公司 Data acquisition method, device, chip and terminal

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778161A (en) * 2015-04-30 2015-07-15 车智互联(北京)科技有限公司 Keyword extracting method based on Word2Vec and Query log
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8245135B2 (en) * 2009-09-08 2012-08-14 International Business Machines Corporation Producing a visual summarization of text documents

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778161A (en) * 2015-04-30 2015-07-15 车智互联(北京)科技有限公司 Keyword extracting method based on Word2Vec and Query log
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Research on Keyword extraction based on Word2Vec weighted TextRank;Yujun Wen 等;《2016 2nd IEEE International Conference on Computer and Communications》;20161017 *
融合主题词嵌入和网络结构分析的主题关键词提取方法;曾庆田 等;《数据分析与知识发现》;20190725(第7期);52-60 *
领域关键词抽取:结合LDA与Word2Vec;韦强申;《中国优秀硕士学位论文全文数据库 信息科技辑》;20161215(第12期);I138-406 *

Also Published As

Publication number Publication date
CN108052593A (en) 2018-05-18

Similar Documents

Publication Publication Date Title
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
CN109858028B (en) Short text similarity calculation method based on probability model
CN106156204B (en) Text label extraction method and device
CN106372061B (en) Short text similarity calculation method based on semantics
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
Wen et al. Research on keyword extraction based on word2vec weighted textrank
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
CN109376352B (en) Patent text modeling method based on word2vec and semantic similarity
CN109885675B (en) Text subtopic discovery method based on improved LDA
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN108052630B (en) Method for extracting expansion words based on Chinese education videos
WO2013049529A1 (en) Method and apparatus for unsupervised learning of multi-resolution user profile from text analysis
CN111382276A (en) Event development venation map generation method
Tiwari et al. Ensemble approach for twitter sentiment analysis
CN116501875B (en) Document processing method and system based on natural language and knowledge graph
CN112836029A (en) Graph-based document retrieval method, system and related components thereof
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
CN116362243A (en) Text key phrase extraction method, storage medium and device integrating incidence relation among sentences
CN110569351A (en) Network media news classification method based on restrictive user preference
Villegas et al. Vector-based word representations for sentiment analysis: a comparative study
Khan et al. Efficient feature selection and domain relevance term weighting method for document classification
CN114298020A (en) Keyword vectorization method based on subject semantic information and application thereof
Gong A personalized recommendation method for short drama videos based on external index features
Tohalino et al. Using virtual edges to extract keywords from texts modeled as complex networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant