CN111222333A

CN111222333A - Keyword extraction method based on fusion of network high-order structure and topic model

Info

Publication number: CN111222333A
Application number: CN202010321185.1A
Authority: CN
Inventors: 朱婷婷; 杨瀚; 温序铭; 王炜; 谢超平
Original assignee: Chengdu Sobey Digital Technology Co Ltd
Current assignee: Chengdu Sobey Digital Technology Co Ltd
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2020-06-02

Abstract

The invention discloses a keyword extraction method based on the fusion of a network high-order structure and a topic model, which comprises the following steps: the method comprises the following steps: news textDWord segmentation; step two: stopping words from the word segmentation result to generate a word sequence; step three: word co-occurrence network based on word sequence constructionG(ii) a Step four: word-pair co-occurrence network based on network high-order structureGThe continuous edges of the adjacent rows are weighted to obtain a weighted adjacent matrixM(ii) a Step five: computing word co-occurrence networkGThe topic expression ability of the word in (1) under the target text; step six: based on the weighted adjacency matrix obtained in step fourMAnd step five, the topic expression ability is obtained, and the word co-occurrence network is calculatedGThe final importance score of the word in (1) and selecting the word before the word is selected from large to small according to the final importance scorekThe word being a news textDThe keyword(s). According to the keyword extraction method, on one hand, the calculation complexity is low; on the other hand, the topic of the word is fused, and the accuracy of extracting the keywords of the news text is improved.

Description

Keyword extraction method based on fusion of network high-order structure and topic model

Technical Field

The invention belongs to the field of automatic extraction of news keywords, and particularly relates to a keyword extraction method based on fusion of a network high-order structure and a topic model, which is suitable for an unsupervised automatic extraction scene of news text keywords.

Background

The development of network technology and the rise of converged media have led to a dramatic increase in the amount of news information. A great deal of news data is generated on each big news platform (such as today's headlines) every day, and how to make audience groups quickly acquire information from news documents with comprehensive information and a great amount of information faces a great challenge.

As two basic tasks of natural language processing, a text classification technology and a keyword extraction technology can obtain key information related to the content of a news document, so that audiences can quickly know the content of the news document. The classification technology is to classify the news text content in a hierarchical manner to obtain the category to which the news text belongs, and the classification system is well defined in advance and is a closed set. However, the category of news is a rather exemplary concept, and only the audience group can generally know that the news belongs to the category, such as sports, politics, economy, and the like. In contrast, keyword extraction techniques can obtain important words that are more relevant to the content or topic of a news document, and the information covered by the words is more specific. For example, both news items may belong to sports on a category hierarchy, but the keyword extraction results are basketball and figure skating, respectively. The more specific general information can help the audience to perform more effective information filtering and is more beneficial to intelligent distribution of news data (in a recommendation scene).

The keyword extraction algorithm is mainly divided into two major categories, namely supervised and unsupervised. Since supervised methods usually require a lot of manual annotation data, which is costly, the present invention cuts through mainly from an unsupervised point of view. The unsupervised keyword extraction method can be regarded as a sorting method, and essentially calculates the importance of words, and the series of methods can be divided into two types in general: statistical-based models and network-based models. The most representative statistical model is TF-IDF, which calculates the importance of a word mainly using the frequency of occurrence of the word in the target document (TF) and the inverse of the frequency of occurrence of the word in all documents (i.e., corpus). TF-IDF is based on simple statistics only when measuring word importance, and does not consider the semantics or themes of words. On the other hand, in the network-based ranking method, if the TextRank model is used, a network is constructed from a target document, terms appearing in the target document are used as nodes of the network, the co-occurrence relation among the terms in the target document is used as a connecting edge of the network, and then the importance of the nodes in the network (namely the importance of the terms) is calculated by using a random walk method. Such models still do not take into account the topic of the word. Aiming at the problem, a paper "EntrophyRank: Key phrase extraction algorithm based on subject entropy" (Yi hong, Chen Yan, Li Ping. Chinese information bulletin, 2019, 33(11): 107 + 114.) uses the information entropy to calculate the subject expression capability of the words in the specific document and modify a random walk model, thereby improving the extraction effect to a certain extent. However, although TextRank and the subsequent modified versions such as entopyrank do not need to use a large amount of corpora, the algorithm itself needs to iterate to converge to obtain the importance scores of the nodes, which is more expensive to calculate compared to the pure statistical models such as TF-IDF. Aiming at the problem of computing cost, a paper "Single Document keyword extraction Via Quantifying Higher-order Structural Features of Word Co-occurrence Graph" redefines the importance index of a Word from the high-order structure (subgraph) of the network: KSMT and KSMQ, the sub-graph of the network is regarded as a semantic component by the model, and the more semantic components the word participates in, the more the importance of the word to the article can be reflected. Although the improvement in computational efficiency is obtained, the KSMT and the KSMQ are essentially statistical models and do not actually consider the topic of the word.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the problems that the application and application effect of the model are limited to a certain extent due to high calculation cost or no consideration of semantics or topics, the keyword extraction method based on the fusion of the network high-order structure and the topic model is provided.

The technical scheme adopted by the invention is as follows: a keyword extraction method based on the fusion of a network high-order structure and a topic model comprises the following steps:

the method comprises the following steps: news textDWord segmentation;

step two: stopping words from the word segmentation result to generate a word sequence;

step three: word co-occurrence network based on word sequence constructionG；

Step four: word-pair co-occurrence network based on network high-order structureGThe continuous edges of the adjacent rows are weighted to obtain a weighted adjacent matrixM(ii) a The fourth step comprises the following substeps:

step 401: selecting a network high-order structure form as M4 or M13 in the three-node subgraph; wherein M4 indicates that three word pairs consisting of three nodes co-occur once; m13 indicates that one of the three word pairs formed by the three nodes never co-occurs;

step 402: word-pair co-occurrence network based on network high-order structure M4 or M13GThe continuous edges of the adjacent rows are weighted to obtain a weighted adjacent matrixM: for wordsn _iAndn _jweight of its connected edgesweight _ijIs a wordn _iAndn _jthe number of M4 or M13 which are co-occurring, thereby obtaining a weighted adjacency matrix based on the network high-order structure M4 or M13MThe weighted adjacency matrixMElement (1) ofM _ij =weight _ij；

Step five: computing word co-occurrence networkGThe topic expression ability of the word in (1) under the target text;

step six: based on the weighted adjacency matrix obtained in step fourMAnd step five, the topic expression ability is obtained, and the word co-occurrence network is calculatedGThe final importance score of the word in (1) and selecting the word before the word is selected from large to small according to the final importance scorekThe word being a news textDThe keyword(s).

Further, the method of the step one is as follows: using Jieba word segmentation tool for given news textDPerforming word segmentation, wherein the word segmentation mode of the Jieba word segmentation toolAnd selecting an accurate mode.

Further, a custom thesaurus in a news scenario may be added for word segmentation.

Further, the method of the second step is as follows: removing stop words in the word segmentation result by using the stop word list so as to generate a word sequence; the stop word list is constructed according to news scenes.

Further, the third step comprises the following substeps:

step 301: setting window sizewindowStep lengthstrideAnd a threshold valueα；

Step 302: according to window sizewindowAnd step sizestridePerforming sliding traversal on the word sequence, and counting the word pairs appearing in the same windowe _ijAnd the number of windows in which the word pair co-occursc _ij；

Step 303: deleting co-occurring window numbersc _ijLess than thresholdαWord paire _ijObtaining a set of word pairsE={(e _ij)|c _ij≥αAnd from the set of word pairsETo obtain a set of wordsN={n _i}；

Step 304: set wordsNChinese wordn _iAs nodes and appear in the word pair setEAdding connecting edges between word pairs in the Chinese character to construct a word co-occurrence networkG。

Further, step five includes the following substeps:

step 501: obtaining word co-occurrence network by using previously learned topic modelGEach word inn _iSubject distribution of

And news textDSubject distribution of

(ii) a Wherein the content of the first and second substances,Kis a subject number;

step 502: to pairEach wordn _iCalculate it in the news textDThe following topic distribution is calculated as follows:

whereinfFor the softmax function:

step 503: calculating topic expression capability of words under target text

The calculation formula is as follows:

wherein the content of the first and second substances,hin order to be a function of the entropy of the information,

。

further, in step six, the calculation formula of the importance score is:

wherein the content of the first and second substances,Score(n _i) Presentation word co-occurrence networkGEach word inn _iA final importance score of;M _ijrepresenting the weighted adjacency matrix obtained in step fourMThe elements of (1);

presentation word co-occurrence networkGThe words in (1) express the ability of the subject under the target text.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

according to the keyword extraction method, on one hand, the calculation complexity is low; on the other hand, the topic of the word is fused, and the accuracy of extracting the keywords of the news text is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a flow chart of a keyword extraction method based on the fusion of a network high-order structure and a topic model.

Fig. 2 is a schematic diagram of a high-order structure of all networks in three node subgraphs.

Fig. 3 is a high-level structure diagram of the M4 and M13 networks according to an embodiment of the present invention.

FIG. 4 is a diagram illustrating obtaining a weighted adjacency matrix based on the network high-order structure M4 according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The features and properties of the present invention are described in further detail below with reference to examples.

As shown in fig. 1, the keyword extraction method based on the fusion of the network high-order structure and the topic model provided in this embodiment includes the following steps:

the method comprises the following steps: news textDAnd (5) word segmentation.

In this step one, a given news text may be segmented using a Jieba segmentation toolDAnd performing word segmentation, wherein the word segmentation mode of the Jieba word segmentation tool selects an accurate mode. In addition, in order to enable the word segmentation result to be more accurate, a custom word bank in a news scene can be added for word segmentation.

Step two: and stopping words according to the word segmentation result to generate a word sequence.

In the second step, stop words in the word segmentation result can be removed by using the stop word list so as to generate a word sequence

(ii) a Because no specific stop word list can be suitable for a news scene at present, the stop word list can be constructed according to the news scene in order to enable the stop word result to be more accurate.

Step three: word co-occurrence network based on word sequence constructionG。

In the third step, traversing the word sequence according to the set window size, counting the times of the word pairs in the word sequence appearing in the window, and filtering the word pairs with lower frequency; and then all the words in the remaining word pairs are used as network nodes, and connecting edges are added to construct a co-occurrence network.

Specifically, step three includes the following substeps:

step 301: setting window sizewindowStep lengthstride(sliding distance) and threshold valueα；

Step 302: according to window sizewindowAnd step sizestridePerforming sliding traversal on the word sequence, and counting the word pairs appearing in the same windowe _ij(i.e. w)_iAndw _j，i=1,2,…,n，j=1,2,…,n) And the number of windows in which the word pair co-occursc _ij；

Step four: word-pair co-occurrence network based on network high-order structureGThe continuous edges of the adjacent rows are weighted to obtain a weighted adjacent matrixM。

The fourth step comprises the following substeps:

step 401: selecting a network high-order structure form as M4 or M13 in the three-node subgraph;

the paper "high-order organization of complex networks" (Austin R. Benson, David F. Gleich and Jure Leskovec, Science, 08 Jul 2016, Vol 353, Issue 6295, pp. 163-166) shows all Higher order structures on three nodes, from M1 to M13, as shown in FIG. 2. Considering constructed word co-occurrence networkGIt is an undirected graph, i.e. the connecting edges between word nodes are not directional (equivalent to the connecting edges between word nodes are bidirectional), so in this application scenario, the high-order structures between three nodes are only M4 and M13, as shown in fig. 3 (where bidirectional illustration is equivalent to undirected illustration). In the application scenario of the patent, M4 indicates that three word pairs formed by three nodes all co-occur; m13 indicates that one of the three word pairs formed by the three nodes never co-occurs.

Step 402: word-pair co-occurrence network based on network high-order structure M4 or M13GThe edges of (a) are weighted.

In particular, for wordsn _iAndn _jweight of its connected edgesweight _ijIs a wordn _iAndn _jthe number of M4 or M13 which are co-occurring, thereby obtaining a weighted adjacency matrix based on the network high-order structure M4 or M13MThe weighted adjacency matrixMElement (1) ofM _ij = weight _ij. FIG. 4 is a diagram of obtaining a weighted adjacency matrix based on a network high-order structure M4MThe same applies to the network high-order structure M13.

Step five: computing word co-occurrence networkGThe words in (1) express the ability of the subject under the target text.

In the fifth step, the weighted words are co-occurred in the networkGThe topic distribution of the words in the target text is calculated by utilizing the topic distribution of the news text and the topic distribution of the words, and the information entropy of the topic distribution is further calculated to serve as the topic expression capacity of the words in the target text.

Specifically, step five includes the following substeps;

And news textDSubject distribution of

the topic model learned in advance is the prior art, for example, the topic model may be trained by using lda topic model in the genesis module of python and news corpus, and the word co-occurrence network may be obtained by using the topic modelGEach word inn _iSubject distribution of

And news textDSubject matter of

. Need to make sure thatNote that for default words (i.e., words not covered by the trained topic model), we set their topic distribution to a uniform distribution.

Step 502: for each wordn _iCalculate it in the news textDThe following topic distribution is calculated as follows:

whereinfFor the softmax function:

step 503: calculating topic expression capability of words under target text

The calculation formula is as follows:

。

Wherein the calculation formula of the importance score is as follows:

presentation word co-occurrence networkMThe words in (1) express the ability of the subject under the target text. It should be noted that calculating the importance scoreScore(n _i) Other functional forms may also be selected.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A keyword extraction method based on the fusion of a network high-order structure and a topic model is characterized by comprising the following steps:

the method comprises the following steps: news textDWord segmentation;

step three: word co-occurrence network based on word sequence constructionG；

step 402: word-pair co-occurrence network based on network high-order structure M4 or M13GThe continuous edges of the adjacent rows are weighted to obtain a weighted adjacent matrixM: for wordsn _iAndn _jweight of its connected edgesweight _ijIs a wordn _iAndn _jthe number of M4 or M13 which co-occurThis obtains a weighted adjacency matrix based on the network high-order structure M4 or M13MThe weighted adjacency matrixMElement (1) ofM _ij =weight _ij；

2. The keyword extraction method based on the fusion of the network high-order structure and the topic model as claimed in claim 1, wherein the method of the first step is: using Jieba word segmentation tool for given news textDAnd performing word segmentation, wherein the word segmentation mode of the Jieba word segmentation tool selects an accurate mode.

3. The method for extracting keywords based on the fusion of network high-order structure and topic model according to claim 1 or 2, characterized in that a custom thesaurus in a news scene can be added for word segmentation.

4. The keyword extraction method based on the fusion of the network high-order structure and the topic model according to claim 1, wherein the method of the second step is: removing stop words in the word segmentation result by using the stop word list so as to generate a word sequence; the stop word list is constructed according to news scenes.

5. The keyword extraction method based on the fusion of the network high-order structure and the topic model as claimed in claim 1, wherein the step three comprises the following substeps:

step 301: setting window sizewindowStep lengthstrideAnd a threshold valueα；

6. The keyword extraction method based on the fusion of the network high-order structure and the topic model as claimed in claim 1, wherein the step five comprises the following substeps:

And news textDSubject distribution of

whereinfFor the softmax function:

step 503: calculating topic expression capability of words under target text

The calculation formula is as follows:

。

7. the method for extracting keywords based on the fusion of a network high-order structure and a topic model according to claim 1, wherein in the sixth step, the calculation formula of the importance score is as follows: