CN111222333A - Keyword extraction method based on fusion of network high-order structure and topic model - Google Patents

Keyword extraction method based on fusion of network high-order structure and topic model Download PDF

Info

Publication number
CN111222333A
CN111222333A CN202010321185.1A CN202010321185A CN111222333A CN 111222333 A CN111222333 A CN 111222333A CN 202010321185 A CN202010321185 A CN 202010321185A CN 111222333 A CN111222333 A CN 111222333A
Authority
CN
China
Prior art keywords
word
network
order structure
topic
weighted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010321185.1A
Other languages
Chinese (zh)
Inventor
朱婷婷
杨瀚
温序铭
王炜
谢超平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Sobey Digital Technology Co Ltd
Original Assignee
Chengdu Sobey Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Sobey Digital Technology Co Ltd filed Critical Chengdu Sobey Digital Technology Co Ltd
Priority to CN202010321185.1A priority Critical patent/CN111222333A/en
Publication of CN111222333A publication Critical patent/CN111222333A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a keyword extraction method based on the fusion of a network high-order structure and a topic model, which comprises the following steps: the method comprises the following steps: news textDWord segmentation; step two: stopping words from the word segmentation result to generate a word sequence; step three: word co-occurrence network based on word sequence constructionG(ii) a Step four: word-pair co-occurrence network based on network high-order structureGThe continuous edges of the adjacent rows are weighted to obtain a weighted adjacent matrixM(ii) a Step five: computing word co-occurrence networkGThe topic expression ability of the word in (1) under the target text; step six: based on the weighted adjacency matrix obtained in step fourMAnd step five, the topic expression ability is obtained, and the word co-occurrence network is calculatedGThe final importance score of the word in (1) and selecting the word before the word is selected from large to small according to the final importance scorekThe word being a news textDThe keyword(s). According to the keyword extraction method, on one hand, the calculation complexity is low; on the other hand, the topic of the word is fused, and the accuracy of extracting the keywords of the news text is improved.

Description

Keyword extraction method based on fusion of network high-order structure and topic model
Technical Field
The invention belongs to the field of automatic extraction of news keywords, and particularly relates to a keyword extraction method based on fusion of a network high-order structure and a topic model, which is suitable for an unsupervised automatic extraction scene of news text keywords.
Background
The development of network technology and the rise of converged media have led to a dramatic increase in the amount of news information. A great deal of news data is generated on each big news platform (such as today's headlines) every day, and how to make audience groups quickly acquire information from news documents with comprehensive information and a great amount of information faces a great challenge.
As two basic tasks of natural language processing, a text classification technology and a keyword extraction technology can obtain key information related to the content of a news document, so that audiences can quickly know the content of the news document. The classification technology is to classify the news text content in a hierarchical manner to obtain the category to which the news text belongs, and the classification system is well defined in advance and is a closed set. However, the category of news is a rather exemplary concept, and only the audience group can generally know that the news belongs to the category, such as sports, politics, economy, and the like. In contrast, keyword extraction techniques can obtain important words that are more relevant to the content or topic of a news document, and the information covered by the words is more specific. For example, both news items may belong to sports on a category hierarchy, but the keyword extraction results are basketball and figure skating, respectively. The more specific general information can help the audience to perform more effective information filtering and is more beneficial to intelligent distribution of news data (in a recommendation scene).
The keyword extraction algorithm is mainly divided into two major categories, namely supervised and unsupervised. Since supervised methods usually require a lot of manual annotation data, which is costly, the present invention cuts through mainly from an unsupervised point of view. The unsupervised keyword extraction method can be regarded as a sorting method, and essentially calculates the importance of words, and the series of methods can be divided into two types in general: statistical-based models and network-based models. The most representative statistical model is TF-IDF, which calculates the importance of a word mainly using the frequency of occurrence of the word in the target document (TF) and the inverse of the frequency of occurrence of the word in all documents (i.e., corpus). TF-IDF is based on simple statistics only when measuring word importance, and does not consider the semantics or themes of words. On the other hand, in the network-based ranking method, if the TextRank model is used, a network is constructed from a target document, terms appearing in the target document are used as nodes of the network, the co-occurrence relation among the terms in the target document is used as a connecting edge of the network, and then the importance of the nodes in the network (namely the importance of the terms) is calculated by using a random walk method. Such models still do not take into account the topic of the word. Aiming at the problem, a paper "EntrophyRank: Key phrase extraction algorithm based on subject entropy" (Yi hong, Chen Yan, Li Ping. Chinese information bulletin, 2019, 33(11): 107 + 114.) uses the information entropy to calculate the subject expression capability of the words in the specific document and modify a random walk model, thereby improving the extraction effect to a certain extent. However, although TextRank and the subsequent modified versions such as entopyrank do not need to use a large amount of corpora, the algorithm itself needs to iterate to converge to obtain the importance scores of the nodes, which is more expensive to calculate compared to the pure statistical models such as TF-IDF. Aiming at the problem of computing cost, a paper "Single Document keyword extraction Via Quantifying Higher-order Structural Features of Word Co-occurrence Graph" redefines the importance index of a Word from the high-order structure (subgraph) of the network: KSMT and KSMQ, the sub-graph of the network is regarded as a semantic component by the model, and the more semantic components the word participates in, the more the importance of the word to the article can be reflected. Although the improvement in computational efficiency is obtained, the KSMT and the KSMQ are essentially statistical models and do not actually consider the topic of the word.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the problems that the application and application effect of the model are limited to a certain extent due to high calculation cost or no consideration of semantics or topics, the keyword extraction method based on the fusion of the network high-order structure and the topic model is provided.
The technical scheme adopted by the invention is as follows: a keyword extraction method based on the fusion of a network high-order structure and a topic model comprises the following steps:
the method comprises the following steps: news textDWord segmentation;
step two: stopping words from the word segmentation result to generate a word sequence;
step three: word co-occurrence network based on word sequence constructionG
Step four: word-pair co-occurrence network based on network high-order structureGThe continuous edges of the adjacent rows are weighted to obtain a weighted adjacent matrixM(ii) a The fourth step comprises the following substeps:
step 401: selecting a network high-order structure form as M4 or M13 in the three-node subgraph; wherein M4 indicates that three word pairs consisting of three nodes co-occur once; m13 indicates that one of the three word pairs formed by the three nodes never co-occurs;
step 402: word-pair co-occurrence network based on network high-order structure M4 or M13GThe continuous edges of the adjacent rows are weighted to obtain a weighted adjacent matrixM: for wordsn i Andn j weight of its connected edgesweight ij Is a wordn i Andn j the number of M4 or M13 which are co-occurring, thereby obtaining a weighted adjacency matrix based on the network high-order structure M4 or M13MThe weighted adjacency matrixMElement (1) ofM ij =weight ij
Step five: computing word co-occurrence networkGThe topic expression ability of the word in (1) under the target text;
step six: based on the weighted adjacency matrix obtained in step fourMAnd step five, the topic expression ability is obtained, and the word co-occurrence network is calculatedGThe final importance score of the word in (1) and selecting the word before the word is selected from large to small according to the final importance scorekThe word being a news textDThe keyword(s).
Further, the method of the step one is as follows: using Jieba word segmentation tool for given news textDPerforming word segmentation, wherein the word segmentation mode of the Jieba word segmentation toolAnd selecting an accurate mode.
Further, a custom thesaurus in a news scenario may be added for word segmentation.
Further, the method of the second step is as follows: removing stop words in the word segmentation result by using the stop word list so as to generate a word sequence; the stop word list is constructed according to news scenes.
Further, the third step comprises the following substeps:
step 301: setting window sizewindowStep lengthstrideAnd a threshold valueα
Step 302: according to window sizewindowAnd step sizestridePerforming sliding traversal on the word sequence, and counting the word pairs appearing in the same windowe ij And the number of windows in which the word pair co-occursc ij
Step 303: deleting co-occurring window numbersc ij Less than thresholdαWord paire ij Obtaining a set of word pairsE={(e ij )|c ij αAnd from the set of word pairsETo obtain a set of wordsN={n i };
Step 304: set wordsNChinese wordn i As nodes and appear in the word pair setEAdding connecting edges between word pairs in the Chinese character to construct a word co-occurrence networkG
Further, step five includes the following substeps:
step 501: obtaining word co-occurrence network by using previously learned topic modelGEach word inn i Subject distribution of
Figure 64362DEST_PATH_IMAGE001
And news textDSubject distribution of
Figure 587748DEST_PATH_IMAGE002
(ii) a Wherein the content of the first and second substances,Kis a subject number;
step 502: to pairEach wordn i Calculate it in the news textDThe following topic distribution is calculated as follows:
Figure DEST_PATH_IMAGE003
whereinfFor the softmax function:
Figure 640761DEST_PATH_IMAGE004
step 503: calculating topic expression capability of words under target text
Figure DEST_PATH_IMAGE005
The calculation formula is as follows:
Figure 882386DEST_PATH_IMAGE006
wherein the content of the first and second substances,hin order to be a function of the entropy of the information,
Figure DEST_PATH_IMAGE007
further, in step six, the calculation formula of the importance score is:
Figure 219827DEST_PATH_IMAGE008
wherein the content of the first and second substances,Score(n i ) Presentation word co-occurrence networkGEach word inn i A final importance score of;M ij representing the weighted adjacency matrix obtained in step fourMThe elements of (1);
Figure 863298DEST_PATH_IMAGE005
presentation word co-occurrence networkGThe words in (1) express the ability of the subject under the target text.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
according to the keyword extraction method, on one hand, the calculation complexity is low; on the other hand, the topic of the word is fused, and the accuracy of extracting the keywords of the news text is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart of a keyword extraction method based on the fusion of a network high-order structure and a topic model.
Fig. 2 is a schematic diagram of a high-order structure of all networks in three node subgraphs.
Fig. 3 is a high-level structure diagram of the M4 and M13 networks according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating obtaining a weighted adjacency matrix based on the network high-order structure M4 according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The features and properties of the present invention are described in further detail below with reference to examples.
As shown in fig. 1, the keyword extraction method based on the fusion of the network high-order structure and the topic model provided in this embodiment includes the following steps:
the method comprises the following steps: news textDAnd (5) word segmentation.
In this step one, a given news text may be segmented using a Jieba segmentation toolDAnd performing word segmentation, wherein the word segmentation mode of the Jieba word segmentation tool selects an accurate mode. In addition, in order to enable the word segmentation result to be more accurate, a custom word bank in a news scene can be added for word segmentation.
Step two: and stopping words according to the word segmentation result to generate a word sequence.
In the second step, stop words in the word segmentation result can be removed by using the stop word list so as to generate a word sequence
Figure DEST_PATH_IMAGE009
(ii) a Because no specific stop word list can be suitable for a news scene at present, the stop word list can be constructed according to the news scene in order to enable the stop word result to be more accurate.
Step three: word co-occurrence network based on word sequence constructionG
In the third step, traversing the word sequence according to the set window size, counting the times of the word pairs in the word sequence appearing in the window, and filtering the word pairs with lower frequency; and then all the words in the remaining word pairs are used as network nodes, and connecting edges are added to construct a co-occurrence network.
Specifically, step three includes the following substeps:
step 301: setting window sizewindowStep lengthstride(sliding distance) and threshold valueα
Step 302: according to window sizewindowAnd step sizestridePerforming sliding traversal on the word sequence, and counting the word pairs appearing in the same windowe ij (i.e. w) i Andw j i=1,2,…,n,j=1,2,…,n) And the number of windows in which the word pair co-occursc ij
Step 303: deleting co-occurring window numbersc ij Less than thresholdαWord paire ij Obtaining a set of word pairsE={(e ij )|c ij αAnd from the set of word pairsETo obtain a set of wordsN={n i };
Step 304: set wordsNChinese wordn i As nodes and appear in the word pair setEAdding connecting edges between word pairs in the Chinese character to construct a word co-occurrence networkG
Step four: word-pair co-occurrence network based on network high-order structureGThe continuous edges of the adjacent rows are weighted to obtain a weighted adjacent matrixM
The fourth step comprises the following substeps:
step 401: selecting a network high-order structure form as M4 or M13 in the three-node subgraph;
the paper "high-order organization of complex networks" (Austin R. Benson, David F. Gleich and Jure Leskovec, Science, 08 Jul 2016, Vol 353, Issue 6295, pp. 163-166) shows all Higher order structures on three nodes, from M1 to M13, as shown in FIG. 2. Considering constructed word co-occurrence networkGIt is an undirected graph, i.e. the connecting edges between word nodes are not directional (equivalent to the connecting edges between word nodes are bidirectional), so in this application scenario, the high-order structures between three nodes are only M4 and M13, as shown in fig. 3 (where bidirectional illustration is equivalent to undirected illustration). In the application scenario of the patent, M4 indicates that three word pairs formed by three nodes all co-occur; m13 indicates that one of the three word pairs formed by the three nodes never co-occurs.
Step 402: word-pair co-occurrence network based on network high-order structure M4 or M13GThe edges of (a) are weighted.
In particular, for wordsn i Andn j weight of its connected edgesweight ij Is a wordn i Andn j the number of M4 or M13 which are co-occurring, thereby obtaining a weighted adjacency matrix based on the network high-order structure M4 or M13MThe weighted adjacency matrixMElement (1) ofM ij = weight ij . FIG. 4 is a diagram of obtaining a weighted adjacency matrix based on a network high-order structure M4MThe same applies to the network high-order structure M13.
Step five: computing word co-occurrence networkGThe words in (1) express the ability of the subject under the target text.
In the fifth step, the weighted words are co-occurred in the networkGThe topic distribution of the words in the target text is calculated by utilizing the topic distribution of the news text and the topic distribution of the words, and the information entropy of the topic distribution is further calculated to serve as the topic expression capacity of the words in the target text.
Specifically, step five includes the following substeps;
step 501: obtaining word co-occurrence network by using previously learned topic modelGEach word inn i Subject distribution of
Figure 260781DEST_PATH_IMAGE001
And news textDSubject distribution of
Figure 989703DEST_PATH_IMAGE002
(ii) a Wherein the content of the first and second substances,Kis a subject number;
the topic model learned in advance is the prior art, for example, the topic model may be trained by using lda topic model in the genesis module of python and news corpus, and the word co-occurrence network may be obtained by using the topic modelGEach word inn i Subject distribution of
Figure 68517DEST_PATH_IMAGE001
And news textDSubject matter of
Figure 628811DEST_PATH_IMAGE002
. Need to make sure thatNote that for default words (i.e., words not covered by the trained topic model), we set their topic distribution to a uniform distribution.
Step 502: for each wordn i Calculate it in the news textDThe following topic distribution is calculated as follows:
Figure 400458DEST_PATH_IMAGE003
whereinfFor the softmax function:
Figure 914878DEST_PATH_IMAGE004
step 503: calculating topic expression capability of words under target text
Figure 797384DEST_PATH_IMAGE005
The calculation formula is as follows:
Figure 149868DEST_PATH_IMAGE006
wherein the content of the first and second substances,hin order to be a function of the entropy of the information,
Figure 154733DEST_PATH_IMAGE007
step six: based on the weighted adjacency matrix obtained in step fourMAnd step five, the topic expression ability is obtained, and the word co-occurrence network is calculatedGThe final importance score of the word in (1) and selecting the word before the word is selected from large to small according to the final importance scorekThe word being a news textDThe keyword(s).
Wherein the calculation formula of the importance score is as follows:
Figure 592667DEST_PATH_IMAGE008
wherein the content of the first and second substances,Score(n i ) Presentation word co-occurrence networkGEach word inn i A final importance score of;M ij representing the weighted adjacency matrix obtained in step fourMThe elements of (1);
Figure 13284DEST_PATH_IMAGE005
presentation word co-occurrence networkMThe words in (1) express the ability of the subject under the target text. It should be noted that calculating the importance scoreScore(n i ) Other functional forms may also be selected.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (7)

1. A keyword extraction method based on the fusion of a network high-order structure and a topic model is characterized by comprising the following steps:
the method comprises the following steps: news textDWord segmentation;
step two: stopping words from the word segmentation result to generate a word sequence;
step three: word co-occurrence network based on word sequence constructionG
Step four: word-pair co-occurrence network based on network high-order structureGThe continuous edges of the adjacent rows are weighted to obtain a weighted adjacent matrixM(ii) a The fourth step comprises the following substeps:
step 401: selecting a network high-order structure form as M4 or M13 in the three-node subgraph; wherein M4 indicates that three word pairs consisting of three nodes co-occur once; m13 indicates that one of the three word pairs formed by the three nodes never co-occurs;
step 402: word-pair co-occurrence network based on network high-order structure M4 or M13GThe continuous edges of the adjacent rows are weighted to obtain a weighted adjacent matrixM: for wordsn i Andn j weight of its connected edgesweight ij Is a wordn i Andn j the number of M4 or M13 which co-occurThis obtains a weighted adjacency matrix based on the network high-order structure M4 or M13MThe weighted adjacency matrixMElement (1) ofM ij =weight ij
Step five: computing word co-occurrence networkGThe topic expression ability of the word in (1) under the target text;
step six: based on the weighted adjacency matrix obtained in step fourMAnd step five, the topic expression ability is obtained, and the word co-occurrence network is calculatedGThe final importance score of the word in (1) and selecting the word before the word is selected from large to small according to the final importance scorekThe word being a news textDThe keyword(s).
2. The keyword extraction method based on the fusion of the network high-order structure and the topic model as claimed in claim 1, wherein the method of the first step is: using Jieba word segmentation tool for given news textDAnd performing word segmentation, wherein the word segmentation mode of the Jieba word segmentation tool selects an accurate mode.
3. The method for extracting keywords based on the fusion of network high-order structure and topic model according to claim 1 or 2, characterized in that a custom thesaurus in a news scene can be added for word segmentation.
4. The keyword extraction method based on the fusion of the network high-order structure and the topic model according to claim 1, wherein the method of the second step is: removing stop words in the word segmentation result by using the stop word list so as to generate a word sequence; the stop word list is constructed according to news scenes.
5. The keyword extraction method based on the fusion of the network high-order structure and the topic model as claimed in claim 1, wherein the step three comprises the following substeps:
step 301: setting window sizewindowStep lengthstrideAnd a threshold valueα
Step 302: according to window sizewindowAnd step sizestridePerforming sliding traversal on the word sequence, and counting the word pairs appearing in the same windowe ij And the number of windows in which the word pair co-occursc ij
Step 303: deleting co-occurring window numbersc ij Less than thresholdαWord paire ij Obtaining a set of word pairsE={(e ij )|c ij αAnd from the set of word pairsETo obtain a set of wordsN={n i };
Step 304: set wordsNChinese wordn i As nodes and appear in the word pair setEAdding connecting edges between word pairs in the Chinese character to construct a word co-occurrence networkG
6. The keyword extraction method based on the fusion of the network high-order structure and the topic model as claimed in claim 1, wherein the step five comprises the following substeps:
step 501: obtaining word co-occurrence network by using previously learned topic modelGEach word inn i Subject distribution of
Figure 49703DEST_PATH_IMAGE001
And news textDSubject distribution of
Figure 507229DEST_PATH_IMAGE002
(ii) a Wherein the content of the first and second substances,Kis a subject number;
step 502: for each wordn i Calculate it in the news textDThe following topic distribution is calculated as follows:
Figure 321601DEST_PATH_IMAGE003
whereinfFor the softmax function:
Figure 144064DEST_PATH_IMAGE004
step 503: calculating topic expression capability of words under target text
Figure 34266DEST_PATH_IMAGE005
The calculation formula is as follows:
Figure 233166DEST_PATH_IMAGE006
wherein the content of the first and second substances,hin order to be a function of the entropy of the information,
Figure 902045DEST_PATH_IMAGE007
7. the method for extracting keywords based on the fusion of a network high-order structure and a topic model according to claim 1, wherein in the sixth step, the calculation formula of the importance score is as follows:
Figure 895409DEST_PATH_IMAGE008
wherein the content of the first and second substances,Score(n i ) Presentation word co-occurrence networkGEach word inn i A final importance score of;M ij representing the weighted adjacency matrix obtained in step fourMThe elements of (1);
Figure 649738DEST_PATH_IMAGE005
presentation word co-occurrence networkGThe words in (1) express the ability of the subject under the target text.
CN202010321185.1A 2020-04-22 2020-04-22 Keyword extraction method based on fusion of network high-order structure and topic model Pending CN111222333A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010321185.1A CN111222333A (en) 2020-04-22 2020-04-22 Keyword extraction method based on fusion of network high-order structure and topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010321185.1A CN111222333A (en) 2020-04-22 2020-04-22 Keyword extraction method based on fusion of network high-order structure and topic model

Publications (1)

Publication Number Publication Date
CN111222333A true CN111222333A (en) 2020-06-02

Family

ID=70828552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010321185.1A Pending CN111222333A (en) 2020-04-22 2020-04-22 Keyword extraction method based on fusion of network high-order structure and topic model

Country Status (1)

Country Link
CN (1) CN111222333A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
CN109726402A (en) * 2019-01-11 2019-05-07 中国电子科技集团公司第七研究所 A kind of document subject matter word extraction method
KR20190104656A (en) * 2018-03-02 2019-09-11 최성우 Method and apparatus for extracting title on text
CN110263343A (en) * 2019-06-24 2019-09-20 北京理工大学 The keyword abstraction method and system of phrase-based vector
CN110362678A (en) * 2019-06-04 2019-10-22 哈尔滨工业大学(威海) A kind of method and apparatus automatically extracting Chinese text keyword

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
KR20190104656A (en) * 2018-03-02 2019-09-11 최성우 Method and apparatus for extracting title on text
CN109726402A (en) * 2019-01-11 2019-05-07 中国电子科技集团公司第七研究所 A kind of document subject matter word extraction method
CN110362678A (en) * 2019-06-04 2019-10-22 哈尔滨工业大学(威海) A kind of method and apparatus automatically extracting Chinese text keyword
CN110263343A (en) * 2019-06-24 2019-09-20 北京理工大学 The keyword abstraction method and system of phrase-based vector

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
MONALI BORDOLOI 等: "Keyword extraction from micro-blogs using collective weight", 《SOCIAL NETWORK ANALYSIS AND MINING》 *
QIAN CHEN 等: "Chinese Keyword Extraction Using Semantically Weighted Network", 《2014 SIXTH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS》 *
YAN CHEN 等: "Single document keyword extraction via quantifying higher-order structural features of word co-occurrence graph", 《COMPUTER SPEECH&LANGUAGE》 *
尹红: "EntropyRank:基于主题熵的关键短语提取算法", 《中文信息学报》 *
常鹏: "基于词共现的文本主题挖掘模型和算法研究", 《中国博士学位论文全文数据库》 *
闫光辉 等: "融合高阶信息的社交网络重要节点识别算法", 《通信学报》 *

Similar Documents

Publication Publication Date Title
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN108197111B (en) Text automatic summarization method based on fusion semantic clustering
Li et al. Data sets: Word embeddings learned from tweets and general data
CN108763213A (en) Theme feature text key word extracting method
CN109885686A (en) A kind of multilingual file classification method merging subject information and BiLSTM-CNN
RU2618374C1 (en) Identifying collocations in the texts in natural language
CN103207860A (en) Method and device for extracting entity relationships of public sentiment events
WO2024036840A1 (en) Open-domain dialogue reply method and system based on topic enhancement
CN107315734A (en) A kind of method and system for becoming pronouns, general term for nouns, numerals and measure words standardization based on time window and semanteme
CN102779119B (en) A kind of method of extracting keywords and device
CN109446423A (en) A kind of Judgment by emotion system and method for news and text
Ma et al. A time-series based aggregation scheme for topic detection in Weibo short texts
CN111026866B (en) Domain-oriented text information extraction clustering method, device and storage medium
Rathod Extractive text summarization of Marathi news articles
Tomar et al. Probabilistic latent semantic analysis for unsupervised word sense disambiguation
Ao et al. News keywords extraction algorithm based on TextRank and classified TF-IDF
CN116362243A (en) Text key phrase extraction method, storage medium and device integrating incidence relation among sentences
CN114265936A (en) Method for realizing text mining of science and technology project
CN109871429B (en) Short text retrieval method integrating Wikipedia classification and explicit semantic features
Bahloul et al. ArA* summarizer: An Arabic text summarization system based on subtopic segmentation and using an A* algorithm for reduction
CN108427769B (en) Character interest tag extraction method based on social network
CN115617981A (en) Information level abstract extraction method for short text of social network
Tang et al. Text semantic understanding based on knowledge enhancement and multi-granular feature extraction
CN111222333A (en) Keyword extraction method based on fusion of network high-order structure and topic model
Zhou et al. Satirical news detection with semantic feature extraction and game-theoretic rough sets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200602