CN108132927B - Keyword extraction method for combining graph structure and node association - Google Patents

Keyword extraction method for combining graph structure and node association Download PDF

Info

Publication number
CN108132927B
CN108132927B CN201711285868.0A CN201711285868A CN108132927B CN 108132927 B CN108132927 B CN 108132927B CN 201711285868 A CN201711285868 A CN 201711285868A CN 108132927 B CN108132927 B CN 108132927B
Authority
CN
China
Prior art keywords
nodes
association
node
text
item set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201711285868.0A
Other languages
Chinese (zh)
Other versions
CN108132927A (en
Inventor
马慧芳
王双
李苗
赵卫中
李宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest Normal University
Original Assignee
Northwest Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest Normal University filed Critical Northwest Normal University
Priority to CN201711285868.0A priority Critical patent/CN108132927B/en
Publication of CN108132927A publication Critical patent/CN108132927A/en
Application granted granted Critical
Publication of CN108132927B publication Critical patent/CN108132927B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a keyword extraction method for combining graph structures and node association. The method comprises the following steps: 1) mining a frequent closed term set of a document and generating a strong association rule set; 2) taking a rule head and a rule body which are not repeated in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist among the nodes, and taking the association strength of the association rules as edge weight to construct an association graph of the document; 3) randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sorting the results in a descending order; 4) clustering the first nodes, and taking the class center point of each class as a key extraction result of the document. The method can find the keywords of the long text without an external corpus, can avoid the influence of the text length of the field corpus, can check the importance of one word in a larger range, can improve the extraction precision of the field keyword set, and can be suitable for extracting the field keyword set of a large-scale field corpus.

Description

Keyword extraction method for combining graph structure and node association
Technical Field
The invention relates to a novel keyword extraction method, belongs to the field of word information processing, and particularly relates to a keyword extraction method for fusing graph structure and node association.
Background
With the popularization of network technology, webpage news and various electronic documents quickly infiltrate into the lives of people, and how users acquire valuable information from massive documents, a text keyword extraction technology is very important. The keyword extraction technology is widely applied to the aspects of webpage retrieval, text classification, topic word bank construction and the like, and the keyword extraction technology is helpful for a user to quickly understand the core content of a document. In most text mining tasks, keyword extraction is performed by sequencing the text content according to the relevance degree of terms, so that a keyword extraction algorithm for a plurality of single documents is generated accordingly.
The long text keyword extraction technique depends on the context in which the terms are located, and the keyword selection can be performed by adopting supervised learning or unsupervised learning. Two ways exist for extracting the key words by using a supervised learning algorithm, namely a binary classification method and feature selection. The binary classification method is to determine whether the candidate word is a keyword, and a classical classification method is often used. The feature selection method focuses on the internal statistical features of terms in the text and the external statistical features in the corpus. The technology for extracting the keywords by the unsupervised learning algorithm has three directions, namely a keyword extraction method based on potential semantic analysis, a keyword ordering method based on a graph and a keyword clustering method based on a theme. The latent semantic analysis excessively depends on the corpus to perform learning training, and under the actual condition, the corpus often cannot be obtained in time, so that a chance is provided for the development of a keyword extraction technology based on a graph, which can be separated from the corpus. The basic source of the graph-based keyword ranking algorithm is the PageRank algorithm. GraphSum considers that the relevance relation between the nodes has strong and weak scores, and the corresponding PageRank score is reduced when voting is carried out between the nodes with negative relevance relation, so that the node score is more reasonable. The Attribute indicates that not only the structure of the graph but also external information such as node attributes need to be considered when calculating the importance of the nodes, and systematically proposes 13 attributes of the nodes to measure the similarity of the nodes, so that the node ranking has higher reliability.
Disclosure of Invention
The invention provides a keyword Extraction Method (GSNA) for fusing Graph Structure and node Association, which comprehensively considers the Graph Structure attribute of Nodes, semantic information and the Association characteristics among the Nodes and can find keywords of long texts under the condition of departing from an external corpus.
In order to achieve the above object, the method for extracting keywords associated with a fusion graph structure and nodes according to the present invention comprises the following steps:
1) acquiring a text and performing text preprocessing;
2) mining a frequent closed item set of a document and generating a strong association rule set;
3) taking a rule head and a rule body which are not repeated in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist among the nodes, and taking the association strength of the association rules as edge weight to construct an association graph of the document;
4) randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sorting the results in a descending order;
5) clustering the first nodes, and taking the class center point of each class as a key extraction result of the document. The method for preprocessing the text in the step 1) comprises the following steps:
a. acquiring a text, wherein the text consists of a plurality of sentences;
b. removing all directories, titles, figures and tables in the text and only leaving text information;
c. segmenting words of the text, and segmenting words of the English text based on simple spaces; for Chinese text, performing word segmentation by using a common word segmentation algorithm;
d. filtering stop words of the text;
e. duplicate words in each sentence are removed.
The method for generating the strong association rule set in 2) comprises the following steps:
a. establishing a transaction set and an item set; the transaction is a single sentence in the text T; the transaction set is the whole preprocessed text T; any subset of all words contained by the set of terms for the entire text T is denoted as I ═ w1,w2,...,wm},wiThe ith word in the item set;
b. solving a closed item set; the closed item set is item set Xf and only if
Figure GDA0003427662760000031
Wherein
Figure GDA0003427662760000032
Is a closure operation;
c. solving the length of the item set, wherein the number of items in one item set is called the length of the item set, and the item set with the length of k is called a k-item set;
d. the support degree of the item set is obtained, and for the association rule A → B,
Figure GDA0003427662760000033
the support degree sup (A → B) of A → B is the ratio of the number of the affairs containing AbuB in T to the total number of the affairs in T;
e. calculating the correlation strength; for the association rule a → B,
Figure GDA0003427662760000034
the strength of the association of A → B is:
Figure GDA0003427662760000035
since lift (a → B) ═ lift (B → a), the lift metric satisfies symmetry. From the analysis of probability distribution, when lift (a → B) ═ 1, a and B obey independent probability distribution, which indicates that there is no correlation between a and B; when lift (A → B) >1, the posterior probability of B after A is introduced is higher than the prior probability of B, which indicates that a positive association relationship exists between A and B; when lift (A → B) <1, the posterior probability of B after A is introduced is lower than the prior probability of B, which indicates that a negative correlation exists between A and B;
f. solving a strong association rule; for the association rule a → B,
Figure GDA0003427662760000036
if meet sup (A → B) ≥ minsup and lift (A → B) ≤ max-liftOr lift (A → B) ≥ min+liftA → B is a strong association rule; wherein minsup is the minimum support threshold value of the association rule, max-liftIs the maximum negative correlation threshold, min+liftIs a minimum positive correlation threshold;
g. solving a closed frequent item set, wherein the item set X is the frequently closed item set if and only if (X) is not less than min and S (X) is not less than X; the invention adopts the CLOSE algorithm to dig the frequent closed item set so as to generate the strong association rule set;
the CLOSE algorithm is realized by the following steps:
step (1) preprocessing a text, then, changing the text into a 1-item set after the text is deduplicated, scanning a transaction set to calculate closure and support degree of the 1-item set, recording a generated closure and support degree set as FCC1, and then, removing the 1-item set with the support degree smaller than min and the same closure from FCC1 to obtain FC1(FC1 is a closure and support degree set obtained by a 1-item set through a CLOSE algorithm);
step (2) in the process of generating an i-item set (the initial value of i is 1), combining two different item sets in the generators of FC (i-1) in pairs as generators of FCCi, calculating the closure and the support degree of the i-item set by scanning a transaction set to obtain FCCi, and removing the item set with the support degree smaller than min and the item set with the repeated closure to obtain FCi;
step (3) adding i, repeating step (2) until the generator of FCCi is empty, and ending the CLOSE algorithm;
step (4) finally, all the closures of the FCs 1 to FCi are taken out to obtain all closed frequent item sets FC of the transaction set;
h. generating a strong association rule; first, the frequent closure set for transaction set T is L ═ L1,L2,...,LkIn which L isiA frequent closure set that is a set of i-items; secondly, extracting closures with different lengths in the L and the support degrees thereof to form a frequent closed item set FC; then, FC gets the frequent item set with different length and its support set Fre ═ f1,f2,...,f3In which fiI-frequent item set and its support set; finally, according to fi(2. ltoreq. i.ltoreq.k) generates a set of association rules AR of different forms, which are composed of a rule head of an m-item set (1. ltoreq. m.ltoreq.i-1) and a rule body of the (i-m) -item set, and calculates lift (m-itemset → (i-m) -itemset) according to Fre. The extraction satisfies that lift is less than or equal to max-liftOr lift is more than or equal to min+liftThe rule of (2) is used as a strong association rule.
The method for generating the association map in 3) comprises the following steps: taking out all rule heads and rule bodies in the strong association rule set, putting the rule heads and the rule bodies in the set, taking non-repeated item sets in the set as nodes of the association graph, taking the strong association existing between the points as edges of the association graph, and taking lift values as weights of the edges.
The GSNA algorithm in 4) comprises the following steps:
a. quantifying graph structure attributes of the nodes; in order to quantify the graph structure attribute of the node, six indexes are selected; wherein, atr1, atr2 and atr3 respectively represent the number of nodes with distances of 1, 2 and 3 from one node, atr4 is the ratio of atr1 to the average value of the degrees of the neighbor nodes, atr5 is the ratio of art2 to art1, and atr6 is the ratio of art3 to art 2; assuming the existence of node N in the dependency graphiThen N isiFormalized as Ni=(atr1,atr2,atr3,atr4,atr5,atr6);
b. Solving the semantic distance between two nodes; suppose the word set of the preprocessed document is I ═ w1,w2,...,wn}, node NiThe co-occurrence probability distribution with I is P ═ P1,p2,...,pn},NjThe co-occurrence probability distribution with I is Q ═ Q1,q2,...,qnIs then NiAnd NjHas a semantic distance of
Figure GDA0003427662760000051
c. Calculating the similarity between nodes combining the graph structure attribute and the semantic information; integrating graph structure attributes and semantic information for nodes, node NiAnd NjThe similarity measure of (c) is shown in formula (3); when N is presentiAnd NjThe greater the difference of the graph structure attribute or semantic distribution, the greater the similarity sijThe lower, conversely, the lower NiAnd NjThe closer the structural attributes or semantic distributions of (S) are, the more closely sijThe higher the correspondence;
Figure GDA0003427662760000061
e. normalizing each node at offSimilarity in the joint graph; assuming that the correlation diagram contains N nodes in total, the nodes with strong correlation have similarity, and the nodes without strong correlation have similarity of 0, a similarity matrix S of the correlation diagram can be formedN×N. To more formally measure the similarity degree of a node with all nodes, the similarity degree calculation of each node in the association graph is normalized as follows:
Figure GDA0003427662760000062
f. modifying the PageRank formula;
rirepresenting a node NiDegree of similarity to the average of all nodes in the graph, r ═ r1,r2,...,rN) Namely, the fixed probability distribution of all nodes in the association graph related to the graph structure and semantic attributes; using riModifying the PageRank formula to ensure that the nodes do not obey equal probability jump any more, but the nodes with higher similarity degree with the association graph are selected with higher probability, wherein the formula is as follows:
Figure GDA0003427662760000063
since the correlation diagram is an undirected graph, C (N) in the formula (5)k) Representing a node NkE represents NkThe number of neighbors of (2);
g. the PR voting score of positive associations is strengthened and the PR voting score of negative associations is reduced. In the association graph, there are edges between two nodes and only if there is a strong association rule between them. To enhance the PR voting score for positive associations and lower the PR voting score for negative associations, equation (5) is improved using lift values as follows:
Figure GDA0003427662760000071
h. randomly walking on the association diagram according to a formula (6), and iteratively calculating the PR value of each node until the formula (7) is met, so that the node score reaches a convergence state, wherein beta is a random walk termination threshold value;
Figure GDA0003427662760000072
and 5) clustering the nodes, wherein the steps are as follows: carrying out K-Mediod clustering on the sequenced nodes; firstly, selecting the first P nodes after descending sorting; selecting K from P nodes as P class centers with K not more than P; then calculating the semantic distance between the remaining nodes and the center of each class, and distributing the semantic distance to the class with the closest distance; finally, after all the nodes are distributed, searching the class center of each class again, and distributing the nodes of the non-class center again; and continuously repeating the processes of class center adjustment and node distribution until the difference value of the two adjacent node distribution results reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text. The basic technical flow is shown in fig. 1.
The invention discloses a keyword extraction method for combining graph structure and node association, which has the beneficial effects that: the invention uses the CLOSE algorithm to mine the strong association rule for constructing the association graph, the strong association rule obtained by taking the frequent closed item set as the reference has small scale, high speed and no information loss, thereby avoiding the problem of excessively complex graph structure caused by extracting redundant frequent item sets and improving the overall execution efficiency of the algorithm; the similarity of the graph structure attribute of the node in the association graph and the semantic information calculation node in the association graph is comprehensively considered, so that node jumping is not subject to equal probability distribution any more, and the possibility of jumping among similar nodes is increased; the invention provides that when the random walk is carried out on the association diagram, the voting among the nodes has positive and negative association scores, the positive association voting scores among the nodes are enhanced, the negative association voting scores among the nodes are reduced, and the node scores are more reasonable; the method can find the keywords of the long text under the condition of being separated from an external corpus; the graph structure attribute of the node, the correlation characteristics between the semantic information and the node are comprehensively considered, and the low execution efficiency of the GSNA caused by constructing a complex correlation graph is avoided; mining a frequent closed item set of a document and generating a strong association rule set; secondly, taking the nonrepetitive rule head and the rule body in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist, and taking the association strength of the association rules as edge weight to construct an association diagram of the document; then, randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sequencing the results in a descending order; and finally, clustering the previous nodes in order to avoid semantic inclusion relation among the extracted keywords, and taking the class center point of each class as a key extraction result of the document.
Drawings
FIG. 1 is a basic technical flow diagram of the present invention;
FIG. 2 is a diagram illustrating the steps of generating a closed frequent itemset according to the present invention;
FIG. 3 is a diagram of strong association rule generation of the present invention;
FIG. 4 is a correlation diagram of a document according to the present invention.
Detailed Description
Example 1
As shown in fig. 1-4, the invention provides a keyword extraction method for combining graph structures and node associations, which comprises the following steps:
1) acquiring a text and performing text preprocessing;
a. acquiring a text, selecting 'species origin' of Darwin as the text, wherein the text consists of a plurality of sentences;
b. removing all directories, titles, figures and tables in the text and only leaving text information;
c. segmenting words of the text, and segmenting words of the English text based on simple spaces; for Chinese text, performing word segmentation by using a common word segmentation algorithm; english text can be used as a space as a natural delimiter, Chinese is only a character, a sentence and a paragraph can be simply delimited by an obvious delimiter, a unique word does not have a delimiter in a form, and proper word segmentation software can be selected to segment Chinese; because the selected text is the Chinese text, the Chinese text is segmented by using the ending segmentation; for example, a section of "provenance of species":
the role of natural selection lies entirely in the preservation and accumulation of the various variants which, for each organism, are favourable both under organic and inorganic conditions during their lifetime. The end result of this is that the relationship of various organisms to their external conditions is increasingly improving. Such improvements entail progressive development of the regime of most organisms worldwide. "
After word segmentation will become:
the role of natural selection lies entirely in the preservation and accumulation of the various variants which, for each organism, are favourable both under organic and inorganic conditions during their lifetime. The end result of this is that the relationship of various organisms to their external conditions is increasingly improving. Such improvements entail progressive development of the regime of most organisms worldwide. "
d. Filtering stop words of the text; stop words have no practical meaning, and for English text, 'the', 'is', 'at', 'which', 'on' etc. can be used as stop words, and for Chinese text, 'a', 'here', 'where', 'I', 'though', etc. can be used as stop words. After the text is filtered for stop words, a set of sentence sets St ═ St { St } separated by stop marks (","1,st2,...,stnThe deactivation vocabulary can be defined by itself, which is not limited by the present invention. The invention uses a hundred degree stop word list and adds some stop words defined by texts. For the segment given in step (2.2), the filter by stop word becomes:
"natural selection action is fully preserved and accumulated organic and inorganic conditions of various variant organisms in life period are favorable
The final result is that the biological external condition relation is improved day by day
Improvement incurs gradual progress in the world's biosystems ";
e. removing repeated words in each sentence, wherein after the filtering step of the stop words, the repeated words become:
"natural selection action is fully preserved and accumulated in the life period of various variant organisms under the organic and inorganic conditions
The final result is that the biological external condition relation is improved day by day
Improvement incurs gradual progress in the world's biosystems ";
2) mining a frequent closed item set of a document and generating a strong association rule set;
a. establishing a transaction set and an item set; the transaction is a single sentence in the text T; the transaction set is the whole preprocessed text T; any subset of all words contained by the set of terms for the entire text T is denoted as I ═ w1,w2,...,wmWi is the ith word in the item set; for the book of "provenance of species", the transaction set is all the texts preprocessed in the book, the transaction is one sentence of the preprocessed texts, and the set of any words in the texts preprocessed by the term set is denoted as I ═ w1,w2,...,wm},wiThe ith word in the item set;
b. solving a closed item set; the closed term set is term set X if and only if S (X) f (g (X)) fg (X) X, where S fg is a closure operation; assume that there are five transactions in the transaction set, with transaction IDs tr1, tr2, tr3, tr4, tr5, and item sets { a, c, d }, { b, c, e }, { a, b, c, e }, { b, e }, and { a, b, c, e }, respectively. One set of closure terms is S ({ b, e }) ═ f (g ({ b, e }))) f ({2,3,4,5}) } b, e }.
c. The length of the item set is calculated, the item set { a, c, d } is called a 3-item set, and the item set { a, b, c, e } is called a 4-item set;
d. and (3) calculating the support degree of the item set, wherein for the item set X, the support degree sup (X) is the ratio of the number of the affairs g (X) containing X to the number of the whole affairs, namely | g (X) |/| T |. For the association rule a → B,
Figure GDA0003427662760000101
Figure GDA0003427662760000102
the support sup (A → B) of A → B is the ratio of the number of transactions containing AbuB in T to the total number of transactions in T. According to the example in b, e are in the item sets tr2, tr3, tr4, tr5, so the item set { b, e } has a support of sup { b, e } ═ 4/5;
e. calculating the correlation strength; for the association rule a → B,
Figure GDA0003427662760000103
the strength of the association of A → B is:
Figure GDA0003427662760000111
since lift (a → B) ═ lift (B → a), the lift metric satisfies symmetry. From the analysis of probability distribution, when lift (a → B) ═ 1, a and B obey independent probability distribution, which indicates that there is no correlation between a and B; when lift (A → B) >1, the posterior probability of B after A is introduced is higher than the prior probability of B, which indicates that a positive association relationship exists between A and B; when lift (A → B) <1, the posterior probability of B after A is introduced is lower than the prior probability of B, which indicates that a negative correlation exists between A and B;
f. solving a strong association rule; for the association rule a → B,
Figure GDA0003427662760000112
if meet sup (A → B) ≥ minsup and lift (A → B) ≤ max-liftOr lift (A → B) ≥ min+liftA → B is a strong association rule; wherein minsup is the minimum support threshold value of the association rule, max-liftIs the maximum negative correlation threshold, min+liftIs a minimum positive correlation threshold;
g. solving a closed frequent item set, wherein the item set X is the frequently closed item set if and only if (X) is not less than min and S (X) is not less than X; minsup is the minimum support of X in the transaction set.
The conventional association rule mining algorithm Apriori constructs an association rule by mining frequent item sets with different lengths, and when the business scale is large, because the business set is continuously scanned in the process of generating the association rule, a large-scale redundant frequent item set is easily obtained, so that the efficiency is low, and the analysis process becomes complicated. In order to quickly extract effective strong association rules from a document transaction set, the invention applies a CLOSE algorithm to mine a frequent closed item set so as to generate a strong association rule set, the algorithm uses closure operation and combines a breadth-first strategy to search an item set space, thereby realizing the quick pruning of a non-frequent item set and a non-closed item set in the search space.
Assuming minsup-2/5, the example in step (4) 1 first generates a 1-item set as FCC1 according to the ClOSE algorithm, scans the transaction set to calculate the closure and support degree of the 1-item set to obtain FCC1, and then changes FCC1// FFC1 to FCC1 to remove the item set with support degree less than 2/5 and the 1-item set with the same closure to obtain FC 1; in generating the 2-item set, only a different set of items in FC1 are combined pairwise to yield { a, b }, { a, c }, { b, c }, since
Figure GDA0003427662760000121
Therefore, after the { a, c } is pruned, the { a, b } and the { b, c } are reserved as generators of FCC2, the closure and the support degree of the 2-item set are calculated by scanning the transaction set to obtain FCC2, and then the item set with the support degree smaller than 2/5 and the item set of the repeated closure are removed to obtain FC 2; in the generation of FCC3 by FC2, the 3-item set generated by generator of FC2 is only { a, b, c }, but
Figure GDA0003427662760000122
Figure GDA0003427662760000123
So the post-pruned 3-item set for { a, b, c } is null, by which time the CLOSE algorithm ends. Finally, all closing of FC1 and FC2 are taken out, so that all closed frequent item sets FC of the transaction set are obtained. FIG. 2 is a diagram illustrating the closed frequent item set generation process of the transaction set in detail;
h. generating a strong association rule; first, the frequent closure set for transaction set T is L ═ L1,L2,...,LkIn which L isiA frequent closure set that is a set of i-items; secondly, extracting closures with different lengths in the L and the support degrees thereof to form a frequent closed item set FC; then, FC gets the frequent item set with different length and its support set Fre ═ f1,f2,...,f3In which fiIs a collection of i-frequent items anda set of support degrees; finally, according to fiFor 2 ≦ i ≦ k, a different form of association rule set AR is generated, which is composed of a rule head with an m-term set of 1 ≦ m ≦ i-1 and a rule body of the (i-m) -term set, and calculates lift (m-itemset → (i-m) -itemset) according to Fre; the extraction satisfies that lift is less than or equal to max-liftOr lift is more than or equal to min+liftThe rule of (2) is used as a strong association rule. According to the example in step (4) 1, assuming that minsup-2/5, the 3-item set { b, c, e } may have a rule body of the rule head 1-item set of the 2-item set and a rule body of the rule head 2-item set of the 1-item set, the lift metric satisfies symmetry, which can also be seen from the figure; the 2-item set { a, c } has only the rule head and the rule body of the 1-item set, and lift (a → c) ═ lift (c → a), and the frequently closed item set generation process is shown in fig. 3;
3) taking a rule head and a rule body which are not repeated in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist among the nodes, and taking the association strength of the association rules as edge weight to construct an association graph of the document; taking out all rule heads and rule bodies in the strong association rule set, putting the rule heads and the rule bodies in the set, taking non-repeated item sets in the set as nodes of the association graph, taking the strong association existing between the points as edges of the association graph, and taking lift values as weights of the edges. According to the example in 1 in step (4), a preprocessed document transaction set sample is assumed to be min equal to 0.6, min+lift=1.2,max-lift0.95, obtaining the frequently closed item set of the transaction set as FC { { a, c }, { b, e }, { c }, { b, c, e } } according to the CLOSE algorithm, and deriving a strong association rule set
Figure GDA0003427662760000131
Extracting non-repeated rule heads and rule bodies in the AR as association graph nodes V { { b, c }, { b, e }, { c, e }, { a }, { b }, { c }, { e } } and constructing a directionless weighted association graph shown in fig. 4 by using the association strength of the strong association rule in the AR as edge weights;
4) randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sorting the results in a descending order;
text repetition by different nodes in association graphsThe importance is different, the invention considers the existing method for measuring the node importance in the graph, and provides a GSNA algorithm comprehensively considering the similarity of the nodes and the correlation strength between the nodes, and the algorithm is essentially an improvement on the PageRank algorithm. The PageRank algorithm considers that an edge between the nodes A and B can be regarded as voting of the node A to the node B, the higher the sum of the edge weights of the nodes is, the higher the importance of the node is, and the PageRank simulates an internet owner, can randomly walk between the points on the graph and can jump. After a limited number of excursions, each point reaches a steady state and the expected value of each point tends to a constant value. And generating node sequencing once after each walk iteration, wherein the sequencing value is the PageRank score. Suppose the graph contains N nodes in total, NiFor the i-th node in the dependency graph, e denotes NiNumber of edge entries of (2), PR (N)k) Represents NkPageRank score of (C) (N)k) Represents NkD is used to measure the attenuation coefficient of a node migrating to another node, the PageRank score of Ni is defined as follows:
Figure GDA0003427662760000132
wherein, the jump between nodes is equal probability event, namely 1/N, and theoretically, the possibility of jump between similar nodes is higher. In addition, there is a phenomenon that nodes have not only a positive association relationship but also a negative association relationship with each other. The positive incidence relation is that one node appears together with another node, and the negative incidence relation is that the probability of another node appearing is reduced after one node appears. For negative associations between nodes, the voting score of PR should be decreased, while for positive associations, the voting score of PR needs to be increased.
The GSNA algorithm improves the PageRank from two aspects of the similarity and the relevance of the graph structure among the nodes. When measuring the similarity of the nodes, the GSNA considers the structural attribute and the implicit semantic information of the nodes in the association diagram; when measuring the relevance of the nodes, positive relevance transmission needs to be enhanced, negative relevance transmission needs to be reduced, and PR scores are reasonably scaled by GSNA, so that the node ranking is more accurate.
For the GSNA algorithm, the steps are as follows:
and (1) quantizing the graph structure attribute of the node. Six indexes are selected in order to quantify the graph structure attribute of the node. Wherein, atr1, atr2 and atr3 respectively represent the number of nodes with distances of 1, 2 and 3 from one node, and visually represent the association capability of one node with other nodes, atr4 is the ratio of the atr1 to the average value of the degrees of the neighbor nodes, which represents the similar matching degree of one node with the surrounding neighbor nodes, atr5 is the ratio of art2 to art1, and atr6 is the ratio of art3 to art2, which quantitatively represents the propagation capability of one node in the association diagram. Assuming the existence of node N in the dependency graphiThen N isiMay be formalized as Ni ═ by (atr1, atr2, atr3, atr4, atr5, atr 6).
Step (2) solving the semantic distance between two nodes
Suppose the word set of the preprocessed document is I ═ w1,w2,...,wn}, node NiThe co-occurrence probability distribution with I is P ═ P1,p2,...,pnP is NiThe number of sentences which appear simultaneously with the word in the I is larger than the total number of sentences of the text, and then the I is traversed to obtain P. N is a radical ofjThe co-occurrence probability distribution with I is Q ═ Q1,q2,...,qnIs then NiAnd NjIs shown in equation (3). The larger the semantic distance between two nodes, the JSD (N)i,Nj) The larger.
Figure GDA0003427662760000151
For the association graph of the document produced in fig. 4, the word set is I ═ { a, b, c, d, e }, the probability distribution of co-occurrence of the nodes { b, e } and I is P ═ 2/5,4/5,3/5,0,4/5}, and the probability distribution of co-occurrence of the nodes { a } and I is P ═ 3/5,2/5,3/5,1/5,2/5 }.
And (3) combining the similarity between the graph structure attribute and the nodes of the semantic information. Integrating graph structure attributes and semantic information for nodes, node NiAnd NjPhase ofThe similarity measure is shown in equation (4).
Ni=(atr1,atr2,atr3,atr4,atr5,atr6),
Figure GDA0003427662760000152
The larger the node similarity in conjunction with graph structure attributes, the higher the JSD (N)i,Nj) The larger the semantic distance between two nodes, when N isiAnd NjThe greater the difference of the graph structure attribute or semantic distribution, the greater the similarity sijThe lower, conversely, the lower NiAnd NjThe closer the structural attributes or semantic distributions of (S) are, the more closely sijThe higher the correspondence.
Figure GDA0003427662760000153
And (4) normalizing the similarity of each node in the association graph. Assuming that the correlation diagram contains N nodes in total, the nodes with strong correlation have similarity, and the nodes without strong correlation have similarity of 0, a similarity matrix S of the correlation diagram can be formedN×N. To more formally measure the similarity degree of a node with all nodes, the similarity degree calculation of each node in the association graph is normalized as follows:
Figure GDA0003427662760000161
and (5) correcting the PageRank formula.
riRepresenting a node NiDegree of similarity to the average of all nodes in the graph, r ═ r1,r2,...,rN) I.e. a fixed probability distribution that all nodes in the dependency graph are related to graph structure and semantic attributes. Using riModifying the PageRank formula to ensure that the nodes do not obey equal probability jump any more, but the nodes with higher similarity degree with the association graph are selected with higher probability, wherein the formula is as follows:
Figure GDA0003427662760000162
since the correlation diagram is an undirected graph, C (N) in the formula (6)k) Representing a node NkE represents NkThe number of neighbors.
And (6) enhancing the PR voting score of the positive association and reducing the PR voting score of the negative association. In the association graph, there are edges between two nodes and only if there is a strong association rule between them. To enhance the PR voting score for positive associations and lower the PR voting score for negative associations, equation (6) is improved using lift values as follows:
Figure GDA0003427662760000163
and (7) carrying out random walk on the association diagram according to a formula (6), and iteratively calculating the PR value of each node until a formula (8) is met, so that the node score reaches a convergence state, wherein beta is a random walk termination threshold value.
Figure GDA0003427662760000164
5) Clustering the first nodes, and taking the class center point of each class as a key extraction result of the document;
the nodes of the association graph are formed according to the frequent closed item set taking subsets with different lengths, the subsets of the k-item set contain 2k-1 non-empty subsets in total, and the 1-item set is necessarily the subset of the (k-1) -item set, and the phenomenon that the (k-1) -item set has semantic inclusion relation with the 1-item set can be considered. In order to prevent semantic inclusion relation between important nodes obtained after the nodes are sorted in a descending order, the invention carries out K-Mediod clustering on the sorted nodes.
Because the condition that non-key terms are clustered independently exists, the first P nodes after descending sorting are selected firstly; selecting K from P nodes as P class centers with K not more than P; then calculating the semantic distance between the remaining nodes and the center of each class, and distributing the semantic distance to the class with the closest distance; and finally, searching the class center of each class again after all the nodes are distributed, and distributing the nodes of the non-class center again. And continuously repeating the processes of class center adjustment and node distribution until the difference value of the two adjacent node distribution results reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text. The algorithm is concretely realized by the following steps:
selecting the first P nodes after descending sorting;
selecting K as P class centers with K being not more than K from P nodes;
calculating the semantic distance between the residual nodes and the center of each class, and distributing the semantic distance to the class with the closest distance;
step 4, calculating the minimum value of the semantic distance sum of all other points for other nodes except the class center node in each class, and taking the minimum value point as a new class center node;
and (5) continuously performing the step (4) until the difference value of the node distribution results of the two adjacent times reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text.
The invention marks the extracted keyword sequence as NretTagging the vocabulary sequence as NrelThe evaluation indexes involved include MAP (mean Average precision), Recall, and Fβ. Since the keywords expected to be retrieved by the invention are high in accuracy and ranked as far as possible, the keyword F is ranked as early as possibleβWith slight modification, MAP was introduced into FβAnd (4) calculating. The formula is defined as follows:
Figure GDA0003427662760000181
Figure GDA0003427662760000182
Figure GDA0003427662760000183
wherein, function index (t)i,Nrel) Returning the term tiIn the sequence NrelThe index position of (a); g (t)i,Nrel) To indicate a function, if the term tiAppearing in the original vocabulary sequence NrelIf yes, returning to 1, otherwise, returning to 0; compared with the recall rate, the method is more focused on improving the accuracy rate, so FβThe value of beta in the index is 0.5.
Design of experiments
In order to verify the feasibility of the algorithm, the experimental design comprises four links. Firstly selecting experimental data and proper evaluation indexes, secondly searching input parameter settings corresponding to an optimal association diagram model, then discussing the extraction number of keywords under the condition of the optimal association diagram, and finally selecting four keyword extraction algorithms to compare with the algorithm of the invention, thereby further explaining the superiority of the algorithm of the invention.
1. Influence of input parameters
Because the input parameters of the model are more, different parameter settings can bring different degrees of influence on the algorithm result. For example, if the minor setting in the CLOSE algorithm is too high, a large number of meaningful words are discarded, and if the minor setting is too low, a large number of redundant words are included, and min+liftAnd max-liftThe change of the relation coefficient d directly influences the form structure of the association graph, and meanwhile, the adjustment of the damping coefficient d of the GSNA algorithm during random walk also influences the keyword extraction effect. The invention sets the variation range of min to be 0.5 percent and 10 percent];min+liftIncrease at a rate of step size 0.1, base 0.1; max-liftThe initial value is 5, and the speed is increased by 5 steps; d is taken according to experience, and the variation range is set to be [0.15,0.85 ]]. Suppose the ordered keyword sequence is NvIn order to avoid the extraction effect analysis complexity caused by excessive parameter variation, the extraction quantity of the keywords is not considered temporarily, and when the model parameters are adjusted, the keyword sequence N obtained each time is extractedvFront | N ofrelI terms as the extracted keyword sequence NretAnd used to calculate the MAP.
2 extraction of word number selection
The keyword extraction focuses on selecting the words most relevant to the text, so that the quantity and the scale of the selected keywords are important. The invention introduces an extraction parameter r ═ Nret/NvAccording to the parameter found in 1, then Nret=r×NvSequence N representing the sequence from the rankvThe scale of the key words is extracted. Based on the resulting parameters in 1, the optimal r is calculated.
3. Algorithm comparison and analysis
In order to verify the superiority and comparability of the algorithm, TextRank, GraphSum, a keyword Extraction Method (Key Words Extraction Method by node Association, NA) considering the Graph Structure and not considering the Graph Structure, and a keyword Extraction Method (Key Words Extraction Method by Graph Structure, GS) considering the node Association and not considering the Graph Structure are compared with the keyword results obtained by the algorithm.
The four comparison methods are selected based on the following considerations that (1) TextRank is a traditional algorithm for automatically generating keywords, and does not consider the graph structure and the node association information; (2) comparing an NA algorithm only considering graph structure information and a GS algorithm only considering node relevance information with the GSNA algorithm simultaneously considering graph structure and node relevance, and displaying the importance of graph structure and node relevance to keyword extraction; (3) GSNA is improved on the GraphSum method, GraphSum is most similar to the method of the invention, and the GraphSum is used for generating text summaries, so as to obtain text keywords, the sentence covering step of the original GraphSum algorithm is removed during the experiment.

Claims (3)

1. A keyword extraction method for combining graph structure and node association is characterized in that:
1) acquiring a text and performing text preprocessing; the text preprocessing method comprises the following steps:
a. acquiring a text, wherein the text consists of a plurality of sentences;
b. the text only leaves text information except all directories, titles, figures and tables in the text;
c. segmenting words of the text, and segmenting words of the English text based on simple spaces; for Chinese text, performing word segmentation by using a common word segmentation algorithm;
d. filtering stop words of the text;
e. removing repeated words in each sentence;
2) mining a frequent closed item set of a document and generating a strong association rule set; the method for generating the strong association rule set comprises the following steps:
a. establishing a transaction set and an item set; the transaction is a single sentence in the text T; the transaction set is the whole preprocessed text T; any subset of all words contained by the set of terms for the entire text T is denoted as I ═ w1,w2,...,wm},wiThe ith word in the item set;
b. solving a closed item set; the closed item set is item set Xf and only if
Figure FDA0003427662750000011
Wherein
Figure FDA0003427662750000012
Is a closure operation;
c. solving the length of the item set, wherein the number of items in one item set is called the length of the item set, and the item set with the length of k is called a k-item set;
d. the support degree of the item set is obtained, and for the association rule A → B,
Figure FDA0003427662750000013
the support degree sup (A → B) of A → B is the ratio of the number of the affairs containing AbuB in T to the total number of the affairs in T;
e. calculating the correlation strength; for the association rule a → B,
Figure FDA0003427662750000014
the strength of the association of A → B is:
Figure FDA0003427662750000015
f. solving a strong association rule; for the association rule a → B,
Figure FDA0003427662750000016
if meet sup (A → B) ≥ minsup and lift (A → B) ≤ max-liftOr lift (A → B) ≥ min+liftA → B is a strong association rule; wherein minsup is the minimum support threshold value of the association rule, max-liftIs the maximum negative correlation threshold, min+liftIs a minimum positive correlation threshold;
g. solving a closed frequent item set, wherein the item set X is the frequently closed item set if and only if (X) is not less than min and S (X) is not less than X;
h. generating a strong association rule; first, the frequent closure set for transaction set T is L ═ L1,L2,...,LkIn which L isiA frequent closure set that is a set of i-items; secondly, extracting closures with different lengths in the L and the support degrees thereof to form a frequent closed item set FC; then, FC gets the frequent item set with different length and its support set Fre ═ f1,f2,...,f3In which fiI-frequent item set and its support set; finally, according to fiFor 2 ≦ i ≦ k, a different form of association rule set AR is generated, which is composed of a rule head with an m-term set of 1 ≦ m ≦ i-1 and a rule body of the (i-m) -term set, and calculates lift (m-itemset → (i-m) -itemset) according to Fre; the extraction satisfies that lift is less than or equal to max-liftOr lift is more than or equal to min+liftThe rule of (2) is used as a strong association rule;
3) taking a rule head and a rule body which are not repeated in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist among the nodes, and taking the association strength of the association rules as edge weight to construct an association graph of the document;
4) randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sorting the results in a descending order;
5) clustering the first nodes, and taking the class center point of each class as a key extraction result of the document.
2. The method for extracting keywords associated with nodes in a fused graph structure according to claim 1, wherein the method comprises the following steps: the GSNA algorithm in 4) comprises the following steps:
a. quantifying graph structure attributes of the nodes; in order to quantify the graph structure attribute of the node, six indexes are selected; wherein, atr1, atr2 and atr3 respectively represent the number of nodes with distances of 1, 2 and 3 from one node, atr4 is the ratio of atr1 to the average value of the degrees of the neighbor nodes, atr5 is the ratio of art2 to art1, and atr6 is the ratio of art3 to art 2; assuming the existence of node N in the dependency graphiThen N isiFormalized as Ni=(atr1,atr2,atr3,atr4,atr5,atr6);
b. Solving the semantic distance between two nodes; suppose the word set of the preprocessed document is I ═ w1,w2,...,wn}, node NiThe co-occurrence probability distribution with I is P ═ P1,p2,...,pn},NjThe co-occurrence probability distribution with I is Q ═ Q1,q2,...,qnIs then NiAnd NjHas a semantic distance of
Figure FDA0003427662750000031
c. Calculating the similarity between nodes combining the graph structure attribute and the semantic information; integrating graph structure attributes and semantic information for nodes, node NiAnd NjThe similarity measure of (c) is shown in formula (3); when N is presentiAnd NjThe greater the difference of the graph structure attribute or semantic distribution, the greater the similarity sijThe lower, conversely, the lower NiAnd NjThe closer the structural attributes or semantic distributions of (S) are, the more closely sijThe higher the correspondence;
Figure FDA0003427662750000032
e. normalizing the similarity of each node in the association graph;assuming that the correlation diagram contains N nodes in total, the nodes with strong correlation have similarity, and the nodes without strong correlation have similarity of 0, a similarity matrix S of the correlation diagram can be formedN×N(ii) a To more formally measure the similarity degree of a node with all nodes, the similarity degree calculation of each node in the association graph is normalized as follows:
Figure FDA0003427662750000033
f. modifying the PageRank formula;
rirepresenting a node NiDegree of similarity to the average of all nodes in the graph, r ═ r1,r2,...,rN) Namely, the fixed probability distribution of all nodes in the association graph related to the graph structure and semantic attributes; using riModifying the PageRank formula to ensure that the nodes do not obey equal probability jump any more, but the nodes with higher similarity degree with the association graph are selected with higher probability, wherein the formula is as follows:
Figure FDA0003427662750000041
since the correlation diagram is an undirected graph, C (N) in the formula (5)k) Representing a node NkE represents NkThe number of neighbors of (2);
g. the PR voting score of the positive correlation is strengthened, and the PR voting score of the negative correlation is reduced; in the association graph, two nodes have edges between them and only if there is a strong association rule between them; to enhance the PR voting score for positive associations and lower the PR voting score for negative associations, equation (5) is improved using lift values as follows:
Figure FDA0003427662750000042
h. randomly walking on the association diagram according to a formula (6), and iteratively calculating the PR value of each node until the formula (7) is met, so that the node score reaches a convergence state, wherein beta is a random walk termination threshold value;
Figure FDA0003427662750000043
3. the method for extracting keywords associated with nodes in a fused graph structure according to claim 1, wherein the method comprises the following steps: and 5) clustering the nodes, wherein the steps are as follows: carrying out K-Mediod clustering on the sequenced nodes; firstly, selecting the first P nodes after descending sorting; selecting K from P nodes as P class centers with K not more than P; then calculating the semantic distance between the remaining nodes and the center of each class, and distributing the semantic distance to the class with the closest distance; finally, after all the nodes are distributed, searching the class center of each class again, and distributing the nodes of the non-class center again; and continuously repeating the processes of class center adjustment and node distribution until the difference value of the two adjacent node distribution results reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text.
CN201711285868.0A 2017-12-07 2017-12-07 Keyword extraction method for combining graph structure and node association Expired - Fee Related CN108132927B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711285868.0A CN108132927B (en) 2017-12-07 2017-12-07 Keyword extraction method for combining graph structure and node association

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711285868.0A CN108132927B (en) 2017-12-07 2017-12-07 Keyword extraction method for combining graph structure and node association

Publications (2)

Publication Number Publication Date
CN108132927A CN108132927A (en) 2018-06-08
CN108132927B true CN108132927B (en) 2022-02-11

Family

ID=62389227

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711285868.0A Expired - Fee Related CN108132927B (en) 2017-12-07 2017-12-07 Keyword extraction method for combining graph structure and node association

Country Status (1)

Country Link
CN (1) CN108132927B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920660B (en) * 2018-07-04 2020-11-20 中国银行股份有限公司 Keyword weight obtaining method and device, electronic equipment and readable storage medium
CN109614466A (en) * 2018-11-15 2019-04-12 阿里巴巴集团控股有限公司 The keyword rule generating method and its device excavated based on Frequent Set
CN109726402B (en) * 2019-01-11 2022-12-23 中国电子科技集团公司第七研究所 Automatic extraction method for document subject term
CN109783628B (en) * 2019-01-16 2022-06-21 福州大学 Method for searching KSAARM by combining time window and association rule mining
CN111062356B (en) * 2019-12-26 2024-03-26 沈阳理工大学 Method for automatically identifying abnormal human body actions from monitoring video
CN111310436B (en) * 2020-02-11 2022-02-15 腾讯科技(深圳)有限公司 Text processing method and device based on artificial intelligence and electronic equipment
CN111461109B (en) * 2020-02-27 2023-09-15 浙江工业大学 Method for identifying documents based on environment multi-class word stock
CN111401928B (en) * 2020-04-01 2022-04-12 支付宝(杭州)信息技术有限公司 Method and device for determining semantic similarity of text based on graph data
CN112016836B (en) * 2020-08-31 2023-11-03 中国银联股份有限公司 Method and device for determining similarity between objects
CN112561081B (en) * 2020-12-18 2022-05-03 北京百度网讯科技有限公司 Conversion method and device of deep learning model, electronic equipment and storage medium
CN113032551B (en) * 2021-05-24 2021-09-10 北京泽桥传媒科技股份有限公司 Delivery progress calculation method and system based on combination of big data and article title
CN116433837B (en) * 2023-03-31 2023-09-12 浙江大学 Three-dimensional CAD model difference analysis method based on key point matching

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101488861A (en) * 2008-12-19 2009-07-22 中山大学 Keyword extracting method for network unknown application
US7577652B1 (en) * 2008-08-20 2009-08-18 Yahoo! Inc. Measuring topical coherence of keyword sets
CN102043851A (en) * 2010-12-22 2011-05-04 四川大学 Multiple-document automatic abstracting method based on frequent itemset
CN104933032A (en) * 2015-06-29 2015-09-23 电子科技大学 Method for extracting keywords of blog based on complex network
CN105975488A (en) * 2016-04-25 2016-09-28 哈尔滨工程大学 Method for querying keyword based on topic cluster unit in relational database
CN106202042A (en) * 2016-07-06 2016-12-07 中央民族大学 A kind of keyword abstraction method based on figure

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609528B (en) * 2012-02-14 2014-06-18 云南大学 Frequent mode association sorting method based on probabilistic graphical model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7577652B1 (en) * 2008-08-20 2009-08-18 Yahoo! Inc. Measuring topical coherence of keyword sets
CN101488861A (en) * 2008-12-19 2009-07-22 中山大学 Keyword extracting method for network unknown application
CN102043851A (en) * 2010-12-22 2011-05-04 四川大学 Multiple-document automatic abstracting method based on frequent itemset
CN104933032A (en) * 2015-06-29 2015-09-23 电子科技大学 Method for extracting keywords of blog based on complex network
CN105975488A (en) * 2016-04-25 2016-09-28 哈尔滨工程大学 Method for querying keyword based on topic cluster unit in relational database
CN106202042A (en) * 2016-07-06 2016-12-07 中央民族大学 A kind of keyword abstraction method based on figure

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于PageRank的新闻关键词提取算法;顾亦然 等;《电子科技大学学报》;20170930;第46卷(第5期);第778-779页 *
基于频繁模式挖掘的中文关键词提取算法;崔诚煜 等;《太赫兹科学与电子信息学报》;20150430;第13卷(第2期);第280-282页 *

Also Published As

Publication number Publication date
CN108132927A (en) 2018-06-08

Similar Documents

Publication Publication Date Title
CN108132927B (en) Keyword extraction method for combining graph structure and node association
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN109408743B (en) Text link embedding method
CN112507699B (en) Remote supervision relation extraction method based on graph convolution network
CN110543564B (en) Domain label acquisition method based on topic model
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
KR20020049164A (en) The System and Method for Auto - Document - classification by Learning Category using Genetic algorithm and Term cluster
CN112989802A (en) Barrage keyword extraction method, device, equipment and medium
WO2013049529A1 (en) Method and apparatus for unsupervised learning of multi-resolution user profile from text analysis
CN112836029A (en) Graph-based document retrieval method, system and related components thereof
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN111274494B (en) Composite label recommendation method combining deep learning and collaborative filtering technology
CN110851570A (en) Unsupervised keyword extraction method based on Embedding technology
CN116756303A (en) Automatic generation method and system for multi-topic text abstract
CN114611491A (en) Intelligent government affair public opinion analysis research method based on text mining technology
Ghalehtaki et al. A combinational method of fuzzy, particle swarm optimization and cellular learning automata for text summarization
CN115329085A (en) Social robot classification method and system
CN115905487A (en) Document question and answer method, system, electronic equipment and storage medium
Girsang et al. Extractive text summarization for indonesian news article using ant system algorithm
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN111563361B (en) Text label extraction method and device and storage medium
CN117057346A (en) Domain keyword extraction method based on weighted textRank and K-means
CN115544999A (en) Domain-oriented parallel large-scale text duplicate checking method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220211

CF01 Termination of patent right due to non-payment of annual fee