CN114065758B - Document keyword extraction method based on hypergraph random walk - Google Patents

Document keyword extraction method based on hypergraph random walk Download PDF

Info

Publication number
CN114065758B
CN114065758B CN202111388299.9A CN202111388299A CN114065758B CN 114065758 B CN114065758 B CN 114065758B CN 202111388299 A CN202111388299 A CN 202111388299A CN 114065758 B CN114065758 B CN 114065758B
Authority
CN
China
Prior art keywords
node
hypergraph
text data
word
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111388299.9A
Other languages
Chinese (zh)
Other versions
CN114065758A (en
Inventor
张建章
陈思思
詹秀秀
刘闯
张子柯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Normal University
Original Assignee
Hangzhou Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Normal University filed Critical Hangzhou Normal University
Priority to CN202111388299.9A priority Critical patent/CN114065758B/en
Publication of CN114065758A publication Critical patent/CN114065758A/en
Application granted granted Critical
Publication of CN114065758B publication Critical patent/CN114065758B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a document keyword extraction method based on hypergraph random walk. The method comprises the steps of cleaning text information to be processed, word segmentation, part-of-speech tagging and stop word removal, and obtaining a candidate keyword set through three methods (named entity identification, noun block extraction and longest sequence matching). And then carrying out structural modeling on the document by using a topological structure of the hypergraph, weighting the nodes and the hyperedges according to the TF-IDF value and the position information value, calculating importance scores of each node by random walk overlapping, sequencing candidate keywords, and selecting keywords with the front importance as output. The method can acquire more complete semantic relation among words, better utilize full-text information, improve the accuracy of keyword acquisition, and be applicable to various application scenes.

Description

Document keyword extraction method based on hypergraph random walk
Technical Field
The invention belongs to the technical field of information, in particular to the field of natural language processing information extraction, and relates to a document keyword extraction method based on hypergraph random walk.
Background
With the advent of informatization and digitalization, text data layering is endless, text mining becomes an important research field with great attention, and how to efficiently know key contents of document information from massive text data is particularly important. The automatic keyword extraction aims to automatically select important topic words or phrases from the documents, and the research results can be widely applied to downstream application fields such as document retrieval, text abstracts, recommendation systems and the like.
The keyword extraction algorithm can be divided into two major categories, namely a supervised algorithm and an unsupervised algorithm according to whether corpus is required to be marked. The supervised algorithm focuses on redefinition of tasks and design of features, keyword extraction tasks are converted into two classification or sequence labeling problems, various text features are designed, and finally a model is trained by using the supervised algorithms such as naive Bayes, a support vector machine, a conditional random field or a deep neural network. The unsupervised algorithm usually adopts various scoring indexes to sort candidate keywords, and then selects the most ranked ones as keywords, and the network graph-based method is the most important method in the unsupervised keyword extraction algorithm. The method does not need to carry out complex natural language pretreatment on the text, generally only needs word segmentation and part-of-speech tagging, has low dependency on a language analysis tool, and has good cross-domain and cross-language applicability.
Based on a complex network design unsupervised keyword extraction algorithm, related indexes such as context semantic information, semantic relation design phrase similarity measurement and the like in massive texts can be fully utilized, a complex network is utilized for modeling the document, semantic association and corpus statistics features are fused to construct an edge weight calculation method, and a final keyword ranking result is optimized.
The key word extraction based on the common graph model is not supervised, firstly, a word network graph of the document is constructed, then, the network graph analysis is carried out on the document by utilizing a network centrality analysis method of system science, the key idea of the method is that the importance of the node in the document is represented by the significance of the node, the importance of the node is determined on the basis of not damaging the integrity of the network, and the words or phrases with important roles found on the graph are the key words of the document. The method quantitatively analyzes the topological property of the network structure from the angles of the local attribute and the global attribute of the network. In the language network diagram, the preprocessed words are taken as nodes, the relation between the words is taken as edges, and the weights between the edges are generally expressed by semantic relativity between the words. When obtaining keywords by using the language network graph, the importance of each node needs to be evaluated, then the nodes are ordered according to the importance, and the words represented by the first N nodes with the highest importance are selected as keywords.
The current keyword extraction method based on word graph creates a graph model which is a common graph. The composition method only considers binary relations among words, but based on a node-edge construction method of a sliding window, ignores complete relations among words in sentences, ignores topic semantic relations carried by a document, and cannot be well integrated into a subject relation of a context.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides the document keyword extraction method based on the hypergraph random walk, which improves the keyword extraction accuracy, and digs out keywords which can correctly reflect the document theme, meanwhile, avoids large-scale manual corpus labeling work, reduces the information extraction cost, improves the information extraction efficiency, and realizes the efficient and accurate extraction of the document keywords.
The method of the invention is as follows:
(1) Text data collection and data preprocessing: and collecting text data on each large website, and then cleaning, word segmentation, part-of-speech tagging and removal of stop words on the collected text data. The method specifically comprises the following steps:
(1-1) text data acquisition, namely acquiring a document to be analyzed: legally acquiring text data of a plurality of sources through a data mining technology under the condition that the format and the theme of the acquired document are not limited;
(1-2) text data cleansing: noise removing is carried out on the input data, and then the formats of the text data are unified and stored; firstly, redundant blank spaces and special symbols in text data are removed, and then all numbers are replaced by < digit >; unifying English letters into upper-case letters or lower-case letters, and recoding text data into UTF-8; finally converting the text data into a unified format, and carrying out formatting and saving;
(1-3) word segmentation operation: recombining continuous word sequences in the text data into word sequences according to specifications, and segmenting the input text data by using Stanford CoreNLP tool kits supporting multiple languages;
(1-4) part-of-speech tagging: determining the grammar category of each word in a given sentence according to the meaning and the context content, determining the part of speech and marking; part-of-speech tagging is carried out on Chinese and English texts by jieba and Stanford CoreNLP respectively, a tag system compatible with ictclas is adopted for Chinese part-of-speech tagging, and a Penn Treebank part-of-speech tag system is adopted for English part-of-speech tagging;
(1-5) removing stop words: and deleting the stop words in the word segmentation result according to the stop word list.
(2) Extracting candidate keywords to obtain a candidate keyword set: and respectively adopting named entity recognition, noun phrase block extraction and longest sequence matching to the preprocessed text to obtain three sets, and then taking the union of the three sets as a candidate keyword set. The method specifically comprises the following steps:
(2-1) named entity recognition: extracting all entities in the text data, and identifying named entities in the text data to obtain a named entity set;
(2-2) noun phrase block extraction: firstly, dividing a text data into a plurality of small text data segments to obtain block sets, and then carrying out block analysis on each block set to obtain noun phrases and obtain phrase block sets;
(2-3) longest sequence match: and selecting words by using the N-gram sliding window, selecting words from the content in the text data according to the sliding window with the size of N to form a text fragment sequence gram with the length of N, and filtering the extracted text fragment sequence gram to form a keyword set.
(3) And constructing weighted hypergraph nodes, constructing weighted hypergraph edges, and calculating importance scores of candidate keywords by using random walk. The method specifically comprises the following steps:
(3-1) constructing weighted hypergraph nodes: selecting words with parts of speech as nouns, adjectives and verbs from the preprocessed text data as hypergraph nodes, wherein each word is only used as one hypergraph node, and if the parts of speech of repeated words are different, retaining the words with high parts of speech occurrence frequency;
The document set is denoted as d= { D 1,d2,…,dN }, where D n represents the nth input document in the document set, n=1, 2, …, N is the number of documents in the document set D, D n={vn,1,vn,2,…,vn,M},vn,m is the mth word in D n, m=1, 2, …, M is the number of words in the word bag D n;
d n ' is a word bag corresponding to d n and composed of different words v n,k, d n′={vn,1,vn,2,…,vn,K},vn,k is the kth node in d n ', k=1, 2, …, K and K are the number of nodes in the word bag d n ';
Calculation of TF-IDF value for node v n,k in d n And location information score/>Obtaining the initial weight of the node
(3-2) Constructing weighted hypergraph edges: calculating the transition probability of each node in the vocabulary hypergraph, and the transition probability between nodes v i and v j W i,j is the sum of all superlimits commonly affiliated to the node v i and the node v j minus 1, and W i is the sum of all superlimits commonly affiliated to all nodes which can be reached by the node v i minus 1; the transition probability T i,j is the over-edge weight;
(3-3) nodes and hyperedges form a node set of the vocabulary hypergraph HG n(Vn,En),Vn representing all the different words in the vocabulary hypergraph HG n, V n={vn,1,vn,2,…,vn,K},En representing a hyperedge set in the vocabulary hypergraph HG n, E n={en,1,en,2,…,en,L},en,l being the first hyperedge in E n, l=1, 2, …, L, L being the number of hyperedges in E n; if element H n(vn,k,en,l in v n,k∈en,l,HGn) =1, otherwise H n(vn,k,en,l) =0; the importance score S (v i) of the node v i is calculated using the random walk.
(4) Ranking the keywords in the candidate keyword set: summing and averaging importance scores of all nodes forming the candidate keywords to obtain importance scores of the candidate keywords; and sorting from high to low according to the scores, and selecting R candidate keywords before sorting as keywords finally predicted by the document d n.
In order to solve the defect of modeling a document by using a common graph in the keyword extraction process, the method models the document by using a hypergraph, fully considers the semantic correlation between words and sentences, and maps the topological relation on the common graph to the hypergraph. The beneficial technical effects of the invention include: the text information to be processed is obtained; and after the text information to be processed is subjected to preprocessing of cleaning word segmentation part-of-speech tagging, a candidate keyword set is obtained through three methods (named entity recognition, noun block extraction and longest sequence matching). And then carrying out structural modeling on the document by using a topological structure of the hypergraph, weighting the nodes and the hyperedges according to the TF-IDF value and the position information value, calculating a score for each node by random walk iteration, calculating a weight by weighting and summing each candidate keyword by the score of the node, sequencing the candidate keywords, and selecting the first N keywords with the highest importance scores as a final result. Therefore, the accuracy of keyword acquisition can be improved, more complete semantic relation among words can be acquired, full-text information can be better utilized, the extracted keywords of the document information to be processed can be suitable for various application scenes, and the use experience of users is improved.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings and a detailed description of keyword extraction for a single document.
As shown in fig. 1, a document keyword extraction method based on hypergraph random walk specifically comprises the following steps:
(1) Text data collection and data preprocessing: and collecting text data on each large website, and then cleaning, word segmentation, part-of-speech tagging and removal of stop words on the collected text data. The method specifically comprises the following steps:
(1-1) text data acquisition, namely acquiring a document to be analyzed: text data of multiple sources is legally obtained through a data mining technology without limiting the format and the theme of the obtained document, and the text data comprise texts such as full-text scientific publications, paper abstracts, news documents, microblogs, twitter and the like. Academic literature website: china knowledge network, superstar discovery, OALib, baseSearch, highWirePress, etc.; journal: bioMed Central life science network journal, and the like; academic conference website: ACL conferences, EMNLP conferences, etc.; news: may be a chinese news network, etc., and is not limited to these sources.
(1-2) Text data cleansing: noise removal is carried out on the input data, and then the formats of the text data are unified and stored. Firstly, redundant blank spaces and special symbols in text data, such as "@"; all numbers are then replaced with < digit >; unifying English letters into upper-case letters or lower-case letters; recoding the text data to UTF-8 (8 bits, universal CHARACTER SET/Unicode Transformation Format); and finally converting the text data into a unified format, and carrying out formatting and saving.
(1-3) Word segmentation operation: the continuous word sequences in the text data are recombined into word sequences according to the specification, and the input text data is segmented by using Stanford CoreNLP (https:// stanfordnlp. Gitsub. Io/CoreNLP /) tool packages supporting multiple languages.
Word segmentation means include, but are not limited to: in the first word segmentation mode, word segmentation is performed on sentence content based on a dictionary, so that a word set is obtained. The principle is that character strings in a document are matched with entries in a dictionary one by one, and if a certain character string is found in the dictionary, the matching is successful, and the character strings are segmented. For example: a forward maximum matching method, a reverse maximum matching method and a least segmentation method based on dictionary. And in the second word segmentation mode, the sentence is segmented by a word segmentation method based on a probability statistical model. The principle is to decide whether a character string constitutes a word or not according to the statistical frequency of occurrence of the character string in the corpus. The words are combinations of words, the more times adjacent words occur simultaneously, the more likely a word is to be formed, based on statistical methods such as: word frequency reverse file frequency word separator, hidden Markov model method, conditional random field method, etc. And in a third word segmentation mode, word segmentation is carried out on sentence contents by using a word segmentation method based on grammar and rules. The principle is that the syntactic and semantic analysis is carried out while the word is segmented, and the syntactic information and the semantic information are utilized to carry out part-of-speech tagging so as to solve the word segmentation ambiguity phenomenon.
In a specific implementation process, the sentence content obtained from the text corpus is assumed to be "natural language processing is honored as a bright bead on an artificial intelligent imperial crown", namely, example sentences. And then word segmentation operation is carried out on sentence content, and the obtained word set can be expressed as "[ ' natural ', ' language ', ' processing ', ' being honored ', ' artificial intelligence ', ' crown ', ' upper ', ' open bead ', '.
(1-4) Part-of-speech tagging: the grammatical category of each word is determined and tagged in a given sentence based on meaning and context. The Chinese and English texts are respectively marked with parts of speech by jieba and Stanford CoreNLP, the Chinese part of speech mark adopts a tag system compatible with ictclas, and the English part of speech mark adopts a Penn Treebank part of speech tag system.
Part-of-speech notations include, but are not limited to: in the first part-of-speech tagging mode, a part-of-speech tagging method based on rules is used for tagging the word set to obtain the part-of-speech set. The method is a part-of-speech tagging method which is provided earlier, part-of-speech disambiguation rules are built according to the combination of the word collocation relation and the context, and the part-of-speech tagging rules are usually built manually in the initial stage. In the second part-of-speech tagging mode, a part-of-speech tagging method based on a statistical model is used for tagging the word set. The method regards part-of-speech tagging as a sequence tagging problem, and the sequence of words with respective tags is given, so that the most probable part-of-speech of the next word can be determined. A third part-of-speech tagging method is used for tagging the word set with a part-of-speech tagging method based on a combination of a statistical method and a rule method; specific examples are: the filtering of statistical annotation results uses a rule method to disambiguate only those annotation results that are considered suspicious, rather than both statistical and rule methods for all cases.
In a specific implementation process, the word segmentation result is assumed to be "[ (natural ', (language ', (processing ') ', (praised as ', (artificial intelligence ', (crown ', (upper ',) and (open bead ')) ]), and then the word segmentation result is subjected to a part-of-speech labeling operation, so that the obtained result can be represented as [ (NN ', (NN ') ', (NN ', (SB ', (VV ', (NN ',) and (NN ', (LC ',) DEG ', (NN ') ' ], where NN represents noun (noun); SB represents a pronoun (in short bei-const) given; VV represents verb (verb); LC represents an azimuth term; DEG represents the genus "lattice".
(1-5) Removing stop words: according to the stop word list, the stop words in the word segmentation result are deleted, the process removes words which do not contain the effective information of the document, and the scale of candidate keywords is reduced. The Chinese and English keyword tables respectively use Chinese common stop word tables (https:// gitsub.com/goto 456/stopwords) and NLTK English stop word tables (https:// raw.gitubusercontent.com/nltk/nltk _data/gh-pages/packages/corpora/stop words.
(2) Extracting candidate keywords to obtain a candidate keyword set: and respectively adopting named entity recognition, noun phrase block extraction and longest sequence matching to the preprocessed text to obtain three sets, and then taking the union of the three sets as a candidate keyword set. The method specifically comprises the following steps:
(2-1) named entity recognition:
all entities in the text data are extracted, including named entities (mainly referring to characters, organizations, places), common entities (such as animals and plants, books, cultural conventions, materials, etc.), and abstract concepts (concepts of no physical form resulting from human abstract thinking). And identifying the named entities in the named entity to obtain a named entity set.
The named entity recognition mode mainly comprises the following steps: in the first recognition mode, named entity recognition based on rules is used to obtain a named entity set. The method relies on a manual rule system, combines a named entity library, performs weight assistance on each rule, and then compares the coincidence condition of the entity and the rule to perform type judgment. Rules typically rely on specific language fields and text styles and it is difficult to cover all language scenes. In the second recognition mode, named entity recognition based on statistics is similar to word segmentation, and currently popular methods are as follows: hidden markov models, conditional random field models, etc. The method is based on manually labeled corpus, and aims to solve the problem of labeling a sequence by using a named entity recognition task. And a third recognition mode adopts a mixing method. Natural language processing is not all a random process, and statistical-based methods alone make state search space very large, and filtering pruning processing must be performed in advance by means of rule knowledge. In the method, named entity recognition adopts a mixed method of named entity recognition based on rules and named entity recognition based on statistics.
In a specific implementation process, a named entity recognition operation is performed on the example sentence, and the obtained result can be expressed as [ 'natural language', 'artificial intelligence', 'imperial crown', 'bright bead' ]. After extraction, a candidate keyword set CANDIDATE _a is obtained.
(2-2) Noun phrase block extraction: firstly, a piece of text data is divided into a plurality of pieces of small pieces of text data to obtain chunk sets, and then each chunk set is subjected to chunk analysis (Chunking) to obtain Noun Phrases (NP) to obtain phrase block sets. Note that text segmentation is different from word segmentation, which is the segmentation of a piece of text into words, and text segmentation is the segmentation of a large piece of text into multiple small pieces of text.
In a specific implementation process, noun blocks of example sentences are extracted first, and the obtained candidate keyword set can be expressed as [ 'natural language processing', 'artificial intelligent crown', 'bright bead' ]. And filtering the extracted noun word blocks to obtain a candidate keyword set CANDIDATE _B.
(2-3) Longest sequence match: the N-gram sliding window is used for word extraction, and the method is an algorithm based on a statistical language model. The method comprises the steps of selecting words from contents in text data according to a sliding window with the size of N, forming a text fragment sequence gram with the length of N, and filtering the extracted text fragment sequence gram to form a keyword set.
In a specific implementation process, supposing that the sentence content obtained from the text corpus is "pearl on crown", performing sliding window word extraction on the text, supposing that N is 3, the result of obtaining the candidate keyword set may be expressed as [ ' king ', a crown, an upper crown, ' upper bright ', ' bright ', ' open beads ', ' open beads ', '. And then filtering the array set according to the part of speech to obtain a candidate keyword set CANDIDATE _C.
Combining and deduplicating the generated CANDIDATE _ A, candidate _ B, candidate _C to obtain a candidate keyword set.
(3) Constructing weighted hypergraph nodes, constructing weighted hypergraph edges, and calculating importance scores of candidate keywords by using random walk; the method specifically comprises the following steps:
(3-1) constructing weighted hypergraph nodes: in the preprocessed text data, words with parts of speech as nouns, adjectives and verbs are selected as hypergraph nodes, each word is only used as one hypergraph node, and if the parts of speech of repeated words are different, words with high parts of speech occurrence frequency are reserved.
The document set is denoted d= { D 1,d2,…,dN }, where D n represents the nth input document in the document set, n=1, 2, …, N is the number of documents in the document set D, D n={vn,1,vn,2,…,vn,M},vn,m is the mth word in D n, m=1, 2, …, M is the number of words in the bag of words D n.
D n ' is a word bag corresponding to d n and composed of different words v n,k, d n′={vn,1,vn,2,…,vn,K},vn,k is the kth node in d n ', k=1, 2, …, K is the number of nodes in the word bag d n '.
Calculating TF-IDF value and position information score for each node v n,k in d n' to obtain initial weight of the node:
The TF-IDF value is used to evaluate the importance of a word to one of the documents in a document set or corpus. The importance of a node increases proportionally with the number of times it appears in a file, but inversely with the frequency with which it appears in the corpus.
In a specific implementation, the Term Frequency (TF) is the number of occurrences of a term divided by the total number of terms in the document. Reverse document frequency (inverse document frequency, IDF) is a measure of the general importance of a word. The IDF of a particular word may be calculated by dividing the total number of documents by the number of documents containing the word, and taking the resulting quotient as a base 10 logarithm. In this way, the initial TF-IDF value of each node on the hypergraph is calculatedWherein, the word frequency/>, of v n,k Represents the number of times v n,k appears in d n; reverse file frequency/> Representing the number of documents in D containing the word v n,k. And then carrying out normalization processing to obtain a final TF-IDF value of each node: /(I)
And calculating the position information score of each node, wherein the initial position score of the node is inversely proportional to the position of the word represented by the node in the text and directly proportional to the word frequency.
Initial location information score for node v n,k The sum of the reciprocal of the absolute position sequence number of each occurrence of the word corresponding to the node in d n is subjected to normalization processing to obtain the final position information score/>, of each node
Initial weight of node v n,k Alpha and beta are weighting coefficients.
(3-2) Constructing weighted hypergraph edges: edges are created for the lexical hypergraph and weighted. Creating an edge for the hypergraph according to whether words corresponding to the two nodes are in the same sentence or not; if node v i and node v j are in the same sentence, then both nodes are scored to the same superside e. One sentence is an overtone, the overtone contains a plurality of words after the sentence is preprocessed, and different overtones possibly contain the same nodes, so that the overtones can intersect and even overlap, and a complete sentence semantic relation topological structure can be obtained.
The transition probability of each node in the vocabulary hypergraph is calculated, and the probability of the node v i reaching the node v j is determined according to the number of the superedges of the common membership of the node v i and the node v j.
Probability of transition between node v i and node v j W i,j is the sum of all superedges that node v i and node v j commonly belong to minus 1, and W i is the sum of all superedges that node v i can reach commonly belong to minus 1. The transition probability T i,j is the over-edge weight.
(3-3) Nodes and hyperedges form a node set of the vocabulary hypergraph HG n(Vn,En),Vn representing the node set of all the different words in the vocabulary hypergraph HG n, V n={vn,1,vn,2,…,vn,K},En representing the hyperedge set in the vocabulary hypergraph HG n, E n={en,1,en,2,…,en,L},en,l being the first hyperedge in E n, l=1, 2, …, L being the number of hyperedges in E n. If element H n(vn,k,en,l) in v n,k∈en,l,HGn =1, otherwise H n(vn,k,en,l) =0.
Calculating importance scores of candidate keywords by using random walk, wherein the random walk process is as follows:
selecting a starting node v i, and randomly selecting a specific superside e i containing v i; in e i, selecting a node with the highest probability of the node v i reaching other nodes as a transfer node v i+1;
then iteratively calculating the importance scores of the nodes v i As an initial weight of the node v i, λ is a damping factor, a (v i) is a set of all other nodes belonging to the same superside as the node v i, Σt j is a transition probability sum of the node v j. And (5) stopping iteration when the error of the two iterations is smaller than the set threshold value, and obtaining the final importance score of the node v i.
(4) Ranking the keywords in the candidate keyword set: and adding and averaging importance scores of all nodes forming the candidate keywords to obtain the importance scores of the candidate keywords. And sorting from high to low according to the scores, and selecting R candidate keywords before sorting as keywords finally predicted by the document d n.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (6)

1. A document keyword extraction method based on hypergraph random walk is characterized by comprising the following steps:
(1) Text data collection and data preprocessing: collecting text data on each large website, and then cleaning, word segmentation, part-of-speech tagging and removal of stop words on the collected text data; the method specifically comprises the following steps:
(1-1) text data acquisition, namely acquiring a document to be analyzed: legally acquiring text data of a plurality of sources through a data mining technology under the condition that the format and the theme of the acquired document are not limited;
(1-2) text data cleansing: noise removing is carried out on the input data, and then the formats of the text data are unified and stored; firstly, redundant blank spaces and special symbols in text data are removed, and then all numbers are replaced by < digit >; unifying English letters into upper-case letters or lower-case letters, and recoding text data into UTF-8; finally converting the text data into a unified format, and carrying out formatting and saving;
(1-3) word segmentation operation: recombining continuous word sequences in the text data into word sequences according to specifications, and segmenting the input text data by using Stanford CoreNLP tool kits supporting multiple languages;
(1-4) part-of-speech tagging: determining the grammar category of each word in a given sentence according to the meaning and the context content, determining the part of speech and marking; part-of-speech tagging is carried out on Chinese and English texts by jieba and Stanford CoreNLP respectively, a tag system compatible with ictclas is adopted for Chinese part-of-speech tagging, and a Penn Treebank part-of-speech tag system is adopted for English part-of-speech tagging;
(1-5) removing stop words: deleting the stop words in the word segmentation result according to the stop word list;
(2) Extracting candidate keywords to obtain a candidate keyword set: respectively adopting named entity recognition, noun phrase block extraction and longest sequence matching to the preprocessed text to obtain three sets, and then taking the union of the three sets as a candidate keyword set; the method specifically comprises the following steps:
(2-1) named entity recognition: extracting all entities in the text data, and identifying named entities in the text data to obtain a named entity set;
(2-2) noun phrase block extraction: firstly, dividing a text data into a plurality of small text data segments to obtain block sets, and then carrying out block analysis on each block set to obtain noun phrases and obtain phrase block sets;
(2-3) longest sequence match: selecting words by using an N-gram sliding window, selecting words from the content in the text data according to the sliding window with the size of N to form a text fragment sequence gram with the length of N, and filtering the extracted text fragment sequence gram to form a keyword set;
(3) Constructing weighted hypergraph nodes, constructing weighted hypergraph edges, and calculating importance scores of candidate keywords by using random walk; the method specifically comprises the following steps:
(3-1) constructing weighted hypergraph nodes: selecting words with parts of speech as nouns, adjectives and verbs from the preprocessed text data as hypergraph nodes, wherein each word is only used as one hypergraph node, and if the parts of speech of repeated words are different, retaining the words with high parts of speech occurrence frequency;
The document set is denoted as d= { D 1,d2,…,dN }, where D n represents the nth input document in the document set, n=1, 2, …, N is the number of documents in the document set D, D n={vn,1,vn,2,…,vn,M},vn,m is the mth word in D n, m=1, 2, …, M is the number of words in the word bag D n;
d n ' is a word bag corresponding to d n and composed of different words v n,k, d n′={vn,1,vn,2,…,vn,K},vn,k is the kth node in d n ', k=1, 2, …, K and K are the number of nodes in the word bag d n ';
Calculation of TF-IDF value for node v n,k in d n And location information score/>Obtain the initial weight/>, of the node
(3-2) Constructing weighted hypergraph edges: calculating the transition probability of each node in the vocabulary hypergraph, and the transition probability between nodes v i and v j W i,j is the sum of all superlimits commonly affiliated to the node v i and the node v j minus 1, and W i is the sum of all superlimits commonly affiliated to all nodes which can be reached by the node v i minus 1; the transition probability T i,j is the over-edge weight;
(3-3) nodes and hyperedges form a node set of the vocabulary hypergraph HG n(Vn,En),Vn representing all the different words in the vocabulary hypergraph HG n, V n={vn,1,vn,2,…,vn,K},En representing a hyperedge set in the vocabulary hypergraph HG n, E n={en,1,en,2,…,en,L},en,l being the first hyperedge in E n, l=1, 2, …, L, L being the number of hyperedges in E n; if element H n(vn,k,en,l in v n,k∈en,l,HGn) =1, otherwise H n(vn,k,en,l) =0; calculating an importance score S (v i) for the node v i using the random walk;
(4) Ranking the keywords in the candidate keyword set: summing and averaging importance scores of all nodes forming the candidate keywords to obtain importance scores of the candidate keywords; and sorting from high to low according to the scores, and selecting R candidate keywords before sorting as keywords finally predicted by the document d n.
2. The document keyword extraction method based on the hypergraph random walk as claimed in claim 1, wherein: the word segmentation operation mode comprises the following steps: the method comprises the steps of using a dictionary-based word segmentation method to segment sentence contents, using a probability statistical model-based word segmentation method to segment sentences, and using a grammar and rule-based word segmentation method to segment sentence contents.
3. The document keyword extraction method based on the hypergraph random walk as claimed in claim 1, wherein: the part-of-speech tagging method comprises the following steps: part of speech tagging methods based on rules, part of speech tagging methods based on statistical models, part of speech tagging methods based on a combination of statistical methods and rule methods.
4. The document keyword extraction method based on the hypergraph random walk as claimed in claim 1, wherein: the named entity identification adopts a mixed method of named entity identification based on rules and statistics.
5. The document keyword extraction method based on hypergraph random walk as claimed in claim 1, wherein the TF-IDF value calculation method is: first, the initial TF-IDF value of node v n,k is calculatedWherein, the word frequency/>, of v n,k Represents the number of times v n,k appears in d n; reverse file frequency/> Representing the number of documents in D that contain word v n,k; and then carrying out normalization processing to obtain a final TF-IDF value of each node:
initial location information score for node v n,k And (3) normalizing the sum of the reciprocals of the absolute position serial numbers of the words corresponding to the nodes in d n to obtain the final position information score of each node: /(I)
Initial weight of node v n,k Alpha and beta are weighting coefficients.
6. The document keyword extraction method based on the hypergraph random walk as claimed in claim 1, wherein: (3-3) calculating the importance score S (v i) of the node v i using the random walk as follows:
Selecting a starting node v i, and randomly selecting a specific superside e i containing v i; in e i, selecting a node with the highest probability of the node v i reaching other nodes as a transfer node v i+1; then iteratively calculating the importance scores of the nodes v i As the initial weight of the node v i, λ is the damping factor, a (v i) is the set of all other nodes belonging to the same superside as the node v i, Σt j is the transition probability sum of the node v j; and (5) stopping iteration when the error of the two iterations is smaller than the set threshold value, and obtaining the final importance score of the node v i.
CN202111388299.9A 2021-11-22 2021-11-22 Document keyword extraction method based on hypergraph random walk Active CN114065758B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111388299.9A CN114065758B (en) 2021-11-22 2021-11-22 Document keyword extraction method based on hypergraph random walk

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111388299.9A CN114065758B (en) 2021-11-22 2021-11-22 Document keyword extraction method based on hypergraph random walk

Publications (2)

Publication Number Publication Date
CN114065758A CN114065758A (en) 2022-02-18
CN114065758B true CN114065758B (en) 2024-04-19

Family

ID=80279203

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111388299.9A Active CN114065758B (en) 2021-11-22 2021-11-22 Document keyword extraction method based on hypergraph random walk

Country Status (1)

Country Link
CN (1) CN114065758B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115221871B (en) * 2022-06-24 2024-02-20 毕开龙 Multi-feature fusion English scientific literature keyword extraction method
CN114912449B (en) * 2022-07-18 2022-09-30 山东大学 Technical feature keyword extraction method and system based on code description text
CN116737745B (en) * 2023-08-16 2023-10-31 杭州州力数据科技有限公司 Method and device for updating entity vector representation in supply chain network diagram
CN116775849B (en) * 2023-08-23 2023-10-24 成都运荔枝科技有限公司 On-line problem processing system and method
CN117112916A (en) * 2023-10-25 2023-11-24 蓝色火焰科技成都有限公司 Financial information query method, device and storage medium based on Internet of vehicles
CN117251685B (en) * 2023-11-20 2024-01-26 中电科大数据研究院有限公司 Knowledge graph-based standardized government affair data construction method and device
CN117332761B (en) * 2023-11-30 2024-02-09 北京一标数字科技有限公司 PDF document intelligent identification marking system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7454430B1 (en) * 2004-06-18 2008-11-18 Glenbrook Networks System and method for facts extraction and domain knowledge repository creation from unstructured and semi-structured documents
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN107145519A (en) * 2017-04-10 2017-09-08 浙江大学 A kind of image retrieval and mask method based on hypergraph
CN111680509A (en) * 2020-06-10 2020-09-18 四川九洲电器集团有限责任公司 Method and device for automatically extracting text keywords based on co-occurrence language network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015016784A1 (en) * 2013-08-01 2015-02-05 National University Of Singapore A method and apparatus for tracking microblog messages for relevancy to an entity identifiable by an associated text and an image
US20170139899A1 (en) * 2015-11-18 2017-05-18 Le Holdings (Beijing) Co., Ltd. Keyword extraction method and electronic device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7454430B1 (en) * 2004-06-18 2008-11-18 Glenbrook Networks System and method for facts extraction and domain knowledge repository creation from unstructured and semi-structured documents
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN107145519A (en) * 2017-04-10 2017-09-08 浙江大学 A kind of image retrieval and mask method based on hypergraph
CN111680509A (en) * 2020-06-10 2020-09-18 四川九洲电器集团有限责任公司 Method and device for automatically extracting text keywords based on co-occurrence language network

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
S. Anjali ; Nair M. Meera ; .A Graph based Approach for Keyword Extraction from Documents.《2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP)》.2019,全文. *
基于加权超图随机游走的文献关键词提取算法;马慧芳;刘芳;夏琴;郝占军;;电子学报;20180615(第06期);全文 *
基于超图的多文档新闻关键词抽取;范泽泉;赖华;;计算机与数字工程;20171220(第12期);全文 *
基于超图的文本摘要与关键词协同抽取研究;莫鹏;胡珀;黄湘冀;何婷婷;;中文信息学报;20151115(第06期);全文 *
基于超图的汉越新闻关键词抽取研究;范泽泉;《CNKI》;20170301;全文 *

Also Published As

Publication number Publication date
CN114065758A (en) 2022-02-18

Similar Documents

Publication Publication Date Title
CN114065758B (en) Document keyword extraction method based on hypergraph random walk
CN108763333B (en) Social media-based event map construction method
Sharma et al. Self-supervised contextual keyword and keyphrase retrieval with self-labelling
CN102253930B (en) A kind of method of text translation and device
Suleiman et al. The use of hidden Markov model in natural ARABIC language processing: a survey
Suleiman et al. Comparative study of word embeddings models and their usage in Arabic language applications
Zu et al. Resume information extraction with a novel text block segmentation algorithm
CN113221559B (en) Method and system for extracting Chinese key phrase in scientific and technological innovation field by utilizing semantic features
Ye et al. Unknown Chinese word extraction based on variety of overlapping strings
CN102214189A (en) Data mining-based word usage knowledge acquisition system and method
Ashna et al. Lexicon based sentiment analysis system for malayalam language
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
Jia et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth
Akther et al. Compilation, analysis and application of a comprehensive Bangla Corpus KUMono
Tonkin Searching the long tail: Hidden structure in social tagging
CN113963748A (en) Protein knowledge map vectorization method
Ahmad et al. Machine and deep learning methods with manual and automatic labelling for news classification in bangla language
Tahrat et al. Text2geo: from textual data to geospatial information
Varghese et al. Lexical and semantic analysis of sacred texts using machine learning and natural language processing
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
Gupta et al. Keyword extraction: a review
Anley et al. Opinion Mining of Tourists' Sentiments: Towards a Comprehensive Service Improvement of Tourism Industry
Abdolahi et al. A new method for sentence vector normalization using word2vec
NASSR et al. Generate a list of stop words in Moroccan dialect from social network data using word embedding
Mazitov et al. Named entity recognition in Russian using Multi-Task LSTM-CRF

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant