CN117057346A - Domain keyword extraction method based on weighted textRank and K-means - Google Patents

Domain keyword extraction method based on weighted textRank and K-means Download PDF

Info

Publication number
CN117057346A
CN117057346A CN202310693345.9A CN202310693345A CN117057346A CN 117057346 A CN117057346 A CN 117057346A CN 202310693345 A CN202310693345 A CN 202310693345A CN 117057346 A CN117057346 A CN 117057346A
Authority
CN
China
Prior art keywords
word
words
keywords
textrank
weighted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310693345.9A
Other languages
Chinese (zh)
Inventor
王艺霏
汪永伟
张玉臣
周胜男
胡浩
王沁武
刘鹏程
王梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University of PLA Strategic Support Force
Original Assignee
Information Engineering University of PLA Strategic Support Force
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University of PLA Strategic Support Force filed Critical Information Engineering University of PLA Strategic Support Force
Priority to CN202310693345.9A priority Critical patent/CN117057346A/en
Publication of CN117057346A publication Critical patent/CN117057346A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a field keyword extraction method based on weighted textRank and K-means, which comprises the following steps: acquiring data from a network related platform, and establishing a military external dictionary; preprocessing the acquired data, including Chinese word segmentation, part-of-speech tagging and deactivation word removal; giving weight to the words according to the word positions and the word parts; obtaining n groups of results through grid search, and taking the weight parameter and algorithm parameter corresponding to the optimal value as final parameters; forming a score of the final keyword with the calculated parameters; sorting the words according to the calculated scores, and selecting the first n words as keywords; k-means clustering is carried out on the basis of extracting keywords, and domain keywords are reserved. The invention utilizes the word position characteristics and the part-of-speech characteristics on the basis of the original TextRank formula, greatly enhances the word semantic information and is beneficial to improving the accuracy of keyword extraction.

Description

Domain keyword extraction method based on weighted textRank and K-means
Technical Field
The invention relates to the technical field of field keyword extraction, in particular to a field keyword extraction method based on weighted TextRank and K-means.
Background
Keyword extraction is the retrieval of keywords or key phrases from a text document. These keywords are selected from phrases of the text document and characterize the subject matter of the document. The method of automatically extracting keywords from a document is a heuristic method of selecting the most common and important words or phrases from a text document, which is an important field in natural language processing and artificial intelligence.
The keyword extraction technology can save time to a great extent, and keywords can determine the topic of the text to a certain extent, so that the main content abstract of the article or the document is provided for the user. In the presence of huge amounts of information, people need to spend a great deal of time and effort to select and screen, and keywords of the text often can show the gist of the text, so that readers can quickly know the meaning of the text. The keyword extraction algorithm may also automatically construct books, publications, or indexes. In addition, keyword extraction may be used as a support for machine learning, where keyword extraction algorithms find the most relevant words describing text, and may be used to visualize or automatically classify text.
Along with the continuous deep research, keyword extraction technology is mature, but due to the characteristics of strong specialization, lack of corpus and the like in the military field, the research on the military text keyword extraction technology is not deep enough, so that the existing text information cannot be fully mined and deeply utilized. Existing keyword extraction techniques can be generally classified into two categories, supervised and unsupervised. The supervised mode, although having higher extraction accuracy, requires a large amount of annotation data to be learned, and the labor cost is relatively high. Compared with a supervised keyword extraction mode, the non-supervision mode does not need to manually make and maintain word lists, and the data requirement is much less. Classical algorithms in unsupervised methods include TF-IDF, LDA, textRank. The TF-IDF is used for realizing keyword extraction by counting word frequencies of words in the predicted text and other texts, but simply judging the importance of the words by the word frequencies, and ignoring other important factors such as semantic information of the words, word positions and the like. The LDA model is relatively complex in combination with text subject distinguishing keywords. Under certain conditions, the boundary between topics is fuzzy, and the same subject word has smaller association with the word, so that the keyword extraction effect is affected. TextRank is one of the algorithms widely used at present, and based on a word graph method, the importance of words in a text is judged through the relation among certain algorithm description words, and the TextRank is used for extracting keywords, phrases and abstracts.
In order to improve the performance of the TextRank algorithm and the keyword extraction effect, the existing improvement method mainly adopts three ideas: first, a joint extraction model is established in combination with other keyword extraction algorithms. And secondly, the semantic relation influence degree is enhanced by combining the semantic relation influence degree with a word vector pre-training tool. Thirdly, influencing factors such as word frequency, part of speech, word position and the like are increased, and a multi-feature extraction model is built.
The traditional keyword extraction model is single, few word features are considered, and importance of features such as word parts of speech and word positions is ignored. The influence of the segmentation on the keyword extraction result of the TextRank algorithm is not considered in the existing improved keyword extraction module, and the keyword cannot be extracted due to the fact that the segmentation error is ignored. Secondly, some improved methods do not conduct targeted screening of word features according to the characteristics of target texts, resulting in redundancy of word features. The TextRank model has more parameters, and some parameters in some improved models are only assigned by experience or obtained by repeated experiments in a large number, so that the procedure is complicated. In addition, the extracted keywords in the prior art still have common words with higher word frequency, and have insufficient functions for reflecting the central thought of the document and extracting the characteristics of the document.
Disclosure of Invention
In order to enhance the accurate utilization of the word characteristics by the model and improve the accuracy of keyword extraction, the invention provides a field keyword extraction method based on weighted TextRank and K-means on the basis of the TextRank model. TextRank is improved from four aspects: an external dictionary is introduced, military corpus information is increased, and the influence of word segmentation on an algorithm is reduced; comprehensively considering structural features of military texts, weighting word positions and parts of speech, and optimizing a TextRank transition probability matrix calculation process through weighting calculation; finally, realizing automatic parameter setting by combining grid searching; and after extracting the keywords, adding a word clustering step, screening and distinguishing common words from field words, and improving the keyword extraction quality.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a field keyword extraction method based on weighted textRank and K-means comprises the following steps:
step 1: acquiring data from a network related platform, and establishing a military external dictionary;
step 2: preprocessing the acquired data, including Chinese word segmentation, part-of-speech tagging and deactivation word removal;
step 3: giving weight to the words according to the word positions and the word parts;
step 4: obtaining n groups of results through grid search, and taking the weight parameter and algorithm parameter corresponding to the optimal value as final parameters;
step 5: forming a score of the final keyword by the parameters calculated in the step 4;
step 6: sorting the words according to the scores calculated in the step 5, and selecting the first n words as keywords;
step 7: k-means clustering is carried out on the basis of extracting keywords, and domain keywords are reserved.
Further, the part-of-speech tagging includes:
the Chinese words after word segmentation are respectively marked as nouns, verbs and adjectives.
Further, in the step 3, weights are given to the words according to the word positions and the parts of speech according to the following formula:
wherein P is ij Is a candidate keyword node v i Weights of (2); α, β, γ are the ratio of the three weights, α+β+γ=1; w (W) (col )(v ij ) Representing candidate keyword nodes v i Passed to candidate keyword junction v j Is a co-occurrence influence of (1); w (W) (pos )(v ij ) Representing a part-of-speech weight value; in the method, in the process of the invention,
wherein W is (part )(v ij ) Representing candidate keyword nodes v i Passed to candidate keyword junction v j K is a parameter greater than 0.5 and less than 1.
Further, the important locations include a title, a head section, and a tail section.
Further, for nouns, verbs, adjectives, the part-of-speech weight values are 0.65, 0.2, 0.15, respectively.
Further, in the step 5, the score of the final keyword is obtained according to the following formula:
wherein IWS (v) i ) When the iteration reaches the upper limit of the iteration number or tends to be stable, the iteration operation is finished to generate a comprehensive score of any node in all nodes; d is a damping coefficient; in (v) i ) Is the pointing word t i Node set, out (v) i ) Is the word t i A set of pointed nodes.
Further, the step 7 includes:
training by using a Word2vec tool to obtain a character vector set corresponding to the extracted keywords, obtaining Word vectors by the weighted average of the character vectors, and clustering on the basis of the Word vectors;
randomly selecting one of the sample points as a clustering center, traversing the distances from all other sample points to the center point, calculating the sum of the distances, adopting a random seed thought, taking the data point corresponding to the maximum weight value as the seed point in a distance weight mode, and dividing the sample points into two types of general words and domain words based on continuous iteration of the clustering center after two initialized clustering centers are obtained.
Compared with the prior art, the invention has the beneficial effects that:
keyword extraction is the basic technology of natural language processing today, but traditional keyword extraction technology focuses more on improving and updating with fusion with other TF-IDF, word2vec and other models, and less attention is paid to a method for introducing information features of fusion words in keyword extraction. The invention is based on a classical keyword extraction model TextRank, firstly, a word stock in the military field is introduced as an external dictionary through a third party tool such as jieba and the like, so that a word segmentation result has field adaptability. And then, weighting calculation is carried out on the word positions and the part-of-speech features of the words, so as to obtain the word score of multi-feature fusion. And then, optimizing model parameters through grid search, so that the keyword extraction accuracy is effectively improved.
The invention utilizes the word position characteristics and the part-of-speech characteristics on the basis of the original TextRank formula, greatly enhances the word semantic information and is beneficial to improving the accuracy of keyword extraction. The technical scheme is provided for solving the problem that the keyword extraction model has little attention to the word characteristics or introduces the word characteristic redundancy. In addition, the invention focuses on the keyword quality level, and a clustering algorithm is added after the keyword extraction step, so that the quality and the field characteristics of the keywords extracted by the model can be further improved.
Drawings
FIG. 1 is a flowchart of a method for extracting domain keywords based on weighted textRank and K-means according to an embodiment of the present invention.
Detailed Description
For ease of understanding, some of the terms appearing in the detailed description of the invention are explained below:
1. keyword extraction technology: keyword extraction techniques refer to the automatic extraction of words or phrases from text that reflect the subject matter of the text. The keyword extraction technology plays a positive role in text mining, text clustering, text classification, personalized recommendation and the like.
2. Extracting domain keywords: and analyzing the text by means of keyword extraction, clustering and the like to obtain related keywords or phrases in the field.
Textrank model: the TextRank algorithm is a graph-based ranking algorithm for text. The basic idea is derived from the PageRank algorithm of Google, and by dividing a text into a plurality of constituent units (words and sentences) and establishing a graph model, the important components in the text are ordered by utilizing a voting mechanism, and keyword extraction and abstract can be realized by utilizing the information of a single document. Unlike LDA, HMM and other models, textRank does not require prior learning and training of multiple documents, and is widely used because of its simplicity and effectiveness.
4. Unsupervised learning: the training samples according to the unknown class (not labeled) solve various problems in pattern recognition, known as unsupervised learning. The common non-supervision learning algorithm mainly comprises a principal component analysis method PCA and the like, an equidistant mapping method, a local linear embedding method, a Laplacian characteristic mapping method, a black plug local linear embedding method, a local cut space arrangement method and the like.
5. Keyword: keywords in the present invention refer specifically to words in text that can highly summarize text semantic features.
6. Clustering algorithm: the clustering algorithm is different from the classification algorithm, and is an unsupervised algorithm for unlabeled samples. The data are grouped according to the characteristics among the data, the similarity of sample points in the group is larger through iteration, the similarity among the groups is smaller, and the clustering effect is better. Clustering algorithms currently widely studied are hierarchical-based clustering, density-based clustering, prototype-based clustering, and partition-based clustering.
K-means algorithm: the K-means algorithm was first proposed by Mac in 1967, and is a very classical algorithm, and still one of the "ten-big algorithms" that is very popular at present. The algorithm is widely used by people due to simple principle and high efficiency, but clustering results are also easily affected by other factors, such as determination of K value, outlier points and selection of initial clustering centers, and different points are selected as the initial clustering centers at will to cause different clustering results and possibly even fall into local optimum.
The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:
the invention provides a field keyword extraction method based on weighted textRank and K-means, which comprises the steps of text preprocessing, weight parameter setting, word position calculation by grid search, part-of-speech weight calculation and the like, and the specific flow is shown in figure 1.
The method specifically comprises the following steps:
step 1, acquiring data from a network related platform, and establishing a military external dictionary;
step 2, preprocessing the acquired data, namely Chinese word segmentation, part-of-speech tagging and stop removal words;
step 3, weight is given to important words according to word positions and word parts;
step 4, obtaining n groups of results through grid search, and taking the weight parameter and the algorithm parameter corresponding to the optimal value as final parameters;
step 5, forming a score of the final keyword by the parameters calculated in the step 4;
and 6, sorting the words according to the scores calculated in the step 5, and selecting the first n words as keywords.
And 7, carrying out K-means clustering on the basis of extracting the keywords, and reserving the field keywords.
1 text preprocessing
The obtained data needs to be preprocessed before extracting the keywords, including Chinese word segmentation, part-of-speech tagging, stop word removal and the like. The extraction result of the TextRank keywords is greatly influenced by word segmentation and text cleaning, and is seriously dependent on the word segmentation result. The content of military texts mostly relates to specific professional fields, a large number of field terms are applied, and the occurrence frequency of a plurality of professional vocabularies is high. Longer nouns, such as unit names, weaponry, etc., often appear in text, and if a general dictionary is used directly for word segmentation, a new word may be split into multiple single words, which affects subsequent keyword extraction. Based on the jieba word segmentation, a military field word stock is introduced as an external dictionary, so that the word segmentation result has field adaptability. The dictionary in the external field is built by self, and the 7492 entries of the 2011 edition ' Chinese people's liberation army ' are collected, the 2627 entries of military weapon equipment, the entries of military terms hundred-degree encyclopedia 3936 entries, and the sum of 14055 words. And the keywords extracted by the algorithm are added to the custom dictionary synchronously, so that the keyword extraction precision is further improved. And performing Chinese word segmentation processing on the acquired data, and marking parts of speech of each word in the custom dictionary as nouns, verbs and adjectives respectively.
After word segmentation, a large number of nonsensical stop words still exist in the text data, and the stop words need to be removed from the data set, so that effective keywords are extracted. Stop words refer to a class of words that occur in text at high frequencies and do not contribute to text semantic information, such as "the" "but" such word of mood or conjunctions, numbers, symbols, etc. The method for removing the stop words is to establish a stop word lexicon, traverse the keyword candidate word set and compare with the stop word lexicon, and delete the keyword candidate word set from the candidate word set if the comparison is successful.
2-word graph construction
The classical TextRank model takes candidate keywords as nodes, takes co-occurrence relations among words as edges to form a word graph, calculates the score of each node by using a formula, sorts the scores, and takes the first n words as keywords. The process of constructing the word graph is as follows:
cutting the whole sentence of the current text to obtain a set S= [ S ] of sentences 1 ,S 2 ,S 3 ,…,S n ];
For each sentence S i E, preprocessing the S to obtain a candidate keyword set T= [ T ] 1 ,t 2 ,t 3 ,…,t n ]The method comprises the steps of carrying out a first treatment on the surface of the Constructing a word graph G= (V, E), wherein V is a candidate keyword node set, and V= [ V ] 1 ,v 2 ,v 3 ,…v n ]E is the edge set between candidate keywords, if the word t is in the window i And t j Co-occurrence, word node v i And v j With edges in between. Such as word t i At t j Before co-occurrence, then edge is generated<v i ,v j >For example word t i At t j Then co-occurrence is performed to generate edge<v j ,v i >. For any one node, there are:
In(v i )={v j |<v j ,v i >∈E} (1)
Out(v i )={v i |<v j ,v i >∈E} (2)
In(v i ) Is the pointing word t i Node set, out (v) i ) Is the word t i A set of pointed nodes.
The iteration model of TextRank supports weighted operation, and by adding weights in the word graph, the strength of the connection between word nodes is distinguished. Let w ij Representing node v i Pointing to node v j The node v i The calculation formula of the score is as follows:
and iteratively propagating the weights of all nodes by using the formula until convergence. WS (v) i ) Representing node v i D is the damping coefficient, and the value ranges from 0 to 1, typically 0.85. When using the TextRank algorithm to calculate the scores for points in the graph, the points in the graph are recursively calculated from any initial value until convergence is achieved when the error rate at any point in the graph is less than a given limit, which typically takes 0.0001.
3 formula improvement
The classical TextRank algorithm defaults to set the initial value of each word node to 1/N (N is the node number), scores each node by using a voting mechanism, evenly distributes scores to adjacent nodes according to the number of connecting edges, constructs a transition probability matrix, and iteratively calculates updated node scores. In the process, the influence of each node is the same, and the final score is obtained only through the iteration of the co-occurrence times. However, the classical TextRank algorithm ignores differences in importance of the word itself, which should be given a higher score for co-occurrence words than for general words. The factors that determine the importance of words in military text are mainly two, namely the location and the part of speech of the words. Thus, the two attributes of the word are weighted herein to distinguish the impact of different nodes. On the basis, the word node score is obtained through an improved weighted calculation formula.
Compared with other texts, the text in the field has a certain rigor and authority due to the vivid characteristics of military profession, and the language style is usually accurate and concise. Military texts have a fixed structure, and usually texts in positions such as article titles, first sections, tail ends and the like contain more generalized information, so that the method is very important for keyword extraction. By combining the characteristics, different weight information is added to words in different positions in the text, so that the influence of words in important positions is improved, and the keyword extraction strength of the important positions can be improved. Based on the above principle, the candidate keywords in the title, the first segment and the last segment of the document are given weight values. Word t i The location weights in the text set are as follows:
wherein W is (part )(v ij ) Representing node v i To node v j K is a parameter greater than 0.5 and less than 1. The important positions (key positions), namely a title, a first segment and a last segment, are given different weight values by judging whether the position where the word first appears is at the key position or not. If the word "cruise ship" appears in the heading of the article, that word is the important location word,position weight W (part )(v ij ) The value of k is derived from a grid search experiment.
Part-of-speech features are representations of linguistic knowledge. The part of speech of the key words in the Chinese text is concentrated on nouns, verbs, adjectives and other entity words, and the proportion of the part of speech in the key words in the text set reaches 90%, so that the part of speech is introduced as one of important features of key word extraction. As one implementation, the part-of-speech weights of nouns, verbs and adjectives are respectively set to be 0.65, 0.2 and 0.15 to obtain a part-of-speech weight value W (pos )(v ij ) Representing node v i To node v j Part of speech influence of (a) is provided.
By W (col) (v ij ) Representing co-occurrence probabilities in the original form, i.e. node v i To node v j Is a co-occurrence influence of (a) and (b).
The formula is as follows:
according to the above, any node v i The weight calculation formula of (2) is as follows:
P ij =α*W (col) (v ij )+β*W (part) (v ij )+γ*W (pos) (v ij ) (6)
p in the formula ij Is node v i α, β, γ are ratios of three types of weights, α+β+γ=1, and a weight transfer matrix is constructed according to formula (6):
wherein the j-th column element in M represents the j-th candidate keyword node v j Weight assignment when the influence of (a) is transferred to other words.
The improved formula combined with formula (3) is:
wherein IWS (v) i ) When iteration reaches the upper limit of iteration times or tends to be stable, the iteration operation is finished to generate the comprehensive score of any node in all nodes, the comprehensive score of each word is the node influence score of the word in the keyword word graph, the scores of all word nodes are ordered in descending order, and the first n word nodes are selected as keyword extraction results.
4 parameter tuning
In the basic algorithm, there are 5 parameters that affect the accuracy of keyword extraction, namely damping coefficient, sliding window, iteration number, iteration threshold and number of extracted keywords. The damping coefficient is a number smaller than 1, so that the algorithm can be converged after multiple iterations, and the iteration times and the iteration threshold can not be set. However, to shorten the run time, algorithm iterations may be stopped when the desired result is obtained by setting the number of iterations and the iteration threshold. The sliding window is a constraint condition of word co-occurrence, and the co-occurrence probability is calculated between words only when the words co-occur within the set window. After the formula improvement increases the weight, the weight coefficients alpha, beta, gamma and the word position weight k are parameters which affect the final experimental result. The setting of parameters directly affects the accuracy of keyword extraction results, and the repeated experiment workload of manually adjusting the parameters is large and the accuracy cannot be ensured. In machine learning, the super-parameter optimization aims at searching super-parameters which enable the machine learning algorithm to perform optimally on the verification data set, and algorithms such as grid search, random search and Bayesian optimization are generally adopted. Random searching requires that the sample point set be large enough, and the Bayesian optimization algorithm is easy to fall into local optimal values. Grid searching is the most widely used hyper-parametric search algorithm. The algorithm is an exhaustive search parameter optimization algorithm, essentially, parameters are spatially divided into a plurality of grids, a model to be trained is optimized by traversing parameter combinations at all grid intersections, and the algorithm is simple, efficient and universal, so that the grid search algorithm is utilized for automatic parameter adjustment.
Grid searching requires a list of hyper-parameters that are set to n sets of hyper-parameters in a Cartesian product. The method involves the algorithm parameters to be 9, respectively listing the parameter value ranges, extracting keywords by using all parameter combination operation algorithms to obtain evaluation indexes, and screening the parameter combination corresponding to the optimal indexes to obtain the optimal parameters.
5 field word clustering
The invention uses Word2vec tool training to obtain character vector set, and Word vector is obtained by the character vector weighted average, and clustering is carried out on the basis of the Word vector. The K-means clustering algorithm is a simple iterative clustering algorithm, and grouping is based on 'compactness' or 'similarity', and the more similar the objects in the group are, the larger the inter-group gap is, the better the inter-group gap is. The distance measures include Euclidean distance, manhattan distance and Chebyshev distance, and a clustering algorithm generally uses Euclidean distance as a similarity measure and error square sum as an objective function for measuring clustering quality, and data points are divided into k clusters according to distance from a clustering center by minimizing the objective function. The K-Means algorithm is simple, fast and suitable for conventional datasets, but the K values are difficult to determine, and the selection of the location of K initialized centroids has a great influence on the final clustering result and the running time, so that the selection of the appropriate K centroids is required. If only a completely random choice, it may result in slow algorithm convergence. The most essential difference between the K-means++ algorithm and the K-Means algorithm is the initialization process at the center of K clusters.
To avoid the above problems, arthur et al propose a K-means++ algorithm that optimizes the method of randomly initializing the centroid for K-Means. The basic principle of the K-means++ algorithm in the process of initializing the cluster centers is to make the mutual distance between the initial cluster centers as far as possible, and the initialization process is as follows:
first, a sample point is randomly selected in the dataset as the first initialized cluster center, and the remaining cluster centers are similarly selected. Next, the distance between each sample point in the sample and the cluster center that has been initialized is calculated and the shortest distance among them is selected, denoted as di. A new data point is selected as a new cluster center, and the selection principle is that: the probability of a point with a larger distance being selected as the cluster center is larger. Repeating the above process until K cluster centers are determined for K initialized cluster centers, and calculating a final cluster center by using a K-Means algorithm.
This idea is employed herein to distinguish between generic and domain words. Specifically, randomly selecting one of the sample points as a clustering center, traversing the distances from all other sample points to the center point, and calculating the sum of the distances. In order to avoid noise, the element with the largest distance value is not directly selected as another sample point, a random seed thought is adopted, and the data point corresponding to the maximum weight value is used as a seed point in a distance weight mode. After two initialized cluster centers are obtained, sample points are divided into two types based on continuous iteration of the cluster centers. The specific algorithm flow is shown in table 1.
K-means++ clustering algorithm flow for specifying cluster number in table 1
In summary, the key improvement points of the invention are as follows:
1. keyword extraction techniques to add custom dictionaries. The invention introduces military field word stock as an external dictionary based on the jieba word segmentation, so that the word segmentation result has field adaptability. The 2011 edition of ' Chinese people's liberation army ' 7492 entries, military weapon equipment entries 2627 entries and military terms hundred-degree encyclopedia 3936 entries are collected, and 14055 words are totally collected. And the keywords extracted by the algorithm are added to the custom dictionary synchronously, so that the keyword extraction precision is further improved.
2. Keyword extraction technology integrating word positions and part-of-speech features. The invention takes part-of-speech and word position weights as basic elements of multi-feature fusion, gives higher weights to words in a document title, a first section and a last section on the basis of original text data, and sets part-of-speech weights as 0.65, 0.2 and 0.15 for noun, verb and adjective importance coefficients respectively. By adding abundant word feature weight representation in the original TextRank model, the utilization of word feature information by the model can be enhanced, and the keyword extraction accuracy can be effectively improved.
3. Keyword extraction techniques in conjunction with grid searching. According to the invention, the process of parameter selection and experiment of the TextRank model is automated, and the grid search algorithm is utilized for exhaustive search, so that parameter optimization is completed. And 5 parameters affecting the keyword extraction accuracy are optimized, namely damping coefficient, sliding window, iteration number, iteration threshold and extracted keyword number. The repeated experimental process can be subtracted from the automation of the process, so that the efficiency is improved.
4. And adding a keyword extraction technology of a K-Maens clustering algorithm. The invention carries out secondary screening on the extracted keywords, and distinguishes the domain words from the common words by utilizing the word vector characteristics, so that the extracted keywords have more domain characteristics, the keyword extraction task function is fully exerted, and the downstream task can be better served.
The foregoing is merely illustrative of the preferred embodiments of this invention, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this invention, and it is intended to cover such modifications and changes as fall within the true scope of the invention.

Claims (7)

1. The field keyword extraction method based on weighted textRank and K-means is characterized by comprising the following steps of:
step 1: acquiring data from a network related platform, and establishing a military external dictionary;
step 2: preprocessing the acquired data, including Chinese word segmentation, part-of-speech tagging and deactivation word removal;
step 3: giving weight to the words according to the word positions and the word parts;
step 4: obtaining n groups of results through grid search, and taking the weight parameter and algorithm parameter corresponding to the optimal value as final parameters;
step 5: forming a score of the final keyword by the parameters calculated in the step 4;
step 6: sorting the words according to the scores calculated in the step 5, and selecting the first n words as keywords;
step 7: k-means clustering is carried out on the basis of extracting keywords, and domain keywords are reserved.
2. The method for extracting domain keywords based on weighted TextRank and K-means according to claim 1, wherein the part-of-speech tagging comprises:
the Chinese words after word segmentation are respectively marked as nouns, verbs and adjectives.
3. The method for extracting domain keywords based on weighted TextRank and K-means according to claim 1, wherein in the step 3, weights are given to the words according to word positions and parts of speech according to the following formula:
P ij =α*W (col) (v ij )+β*W (part) (v ij )+γ*W (pos) (v ij )
wherein P is ij Is a candidate keyword node v i Weights of (2); alpha, beta and gamma are the ratio of three types of weights, alpha+beta+gamma 1; w (W) (col) (v ij ) Representing candidate keyword nodes v i Passed to candidate keyword junction v j Is a co-occurrence influence of (1); w (W) (pos) (v ij ) Representing a part-of-speech weight value; in the method, in the process of the invention,
wherein W is (part) (v ij ) Representing candidate keyword nodes v i Passed to candidate keyword junction v j K is a parameter greater than 0.5 and less than 1.
4. The method of claim 3, wherein the important locations include a title, a first segment, and a last segment.
5. A method of domain keyword extraction based on weighted TextRank and K-means according to claim 3, wherein the part-of-speech weight values are 0.65, 0.2, 0.15 for nouns, verbs, adjectives, respectively.
6. The method for extracting domain keywords based on weighted TextRank and K-means according to claim 3, wherein in the step 5, the score of the final keywords is obtained according to the following formula:
wherein IWS (v) i ) When the iteration reaches the upper limit of the iteration number or tends to be stable, the iteration operation is finished to generate a comprehensive score of any node in all nodes; d is a damping coefficient; in (v) i ) Is the pointing word t i Node set, out (v) i ) Is the word t i A set of pointed nodes.
7. The method for extracting domain keywords based on weighted TextRank and K-means according to claim 1, wherein the step 7 comprises:
training by using a Word2vec tool to obtain a character vector set corresponding to the extracted keywords, obtaining Word vectors by the weighted average of the character vectors, and clustering on the basis of the Word vectors;
randomly selecting one of the sample points as a clustering center, traversing the distances from all other sample points to the center point, calculating the sum of the distances, adopting a random seed thought, taking the data point corresponding to the maximum weight value as the seed point in a distance weight mode, and dividing the sample points into two types of general words and domain words based on continuous iteration of the clustering center after two initialized clustering centers are obtained.
CN202310693345.9A 2023-06-12 2023-06-12 Domain keyword extraction method based on weighted textRank and K-means Pending CN117057346A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310693345.9A CN117057346A (en) 2023-06-12 2023-06-12 Domain keyword extraction method based on weighted textRank and K-means

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310693345.9A CN117057346A (en) 2023-06-12 2023-06-12 Domain keyword extraction method based on weighted textRank and K-means

Publications (1)

Publication Number Publication Date
CN117057346A true CN117057346A (en) 2023-11-14

Family

ID=88663413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310693345.9A Pending CN117057346A (en) 2023-06-12 2023-06-12 Domain keyword extraction method based on weighted textRank and K-means

Country Status (1)

Country Link
CN (1) CN117057346A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117275655A (en) * 2023-11-15 2023-12-22 中国人民解放军总医院第六医学中心 Medical records statistics and arrangement method and system based on artificial intelligence

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117275655A (en) * 2023-11-15 2023-12-22 中国人民解放军总医院第六医学中心 Medical records statistics and arrangement method and system based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN106997382B (en) Innovative creative tag automatic labeling method and system based on big data
CN108132927B (en) Keyword extraction method for combining graph structure and node association
CN109858028B (en) Short text similarity calculation method based on probability model
CN108197117B (en) Chinese text keyword extraction method based on document theme structure and semantics
CN111143549A (en) Method for public sentiment emotion evolution based on theme
CN110543564B (en) Domain label acquisition method based on topic model
CN107895000B (en) Cross-domain semantic information retrieval method based on convolutional neural network
CN109522547B (en) Chinese synonym iteration extraction method based on pattern learning
CN111832289A (en) Service discovery method based on clustering and Gaussian LDA
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
CN112949713B (en) Text emotion classification method based on complex network integrated learning
CN112989802A (en) Barrage keyword extraction method, device, equipment and medium
CN112860898B (en) Short text box clustering method, system, equipment and storage medium
CN116501875B (en) Document processing method and system based on natural language and knowledge graph
CN110705247A (en) Based on x2-C text similarity calculation method
CN113988053A (en) Hot word extraction method and device
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN114611491A (en) Intelligent government affair public opinion analysis research method based on text mining technology
CN116756303A (en) Automatic generation method and system for multi-topic text abstract
CN117057346A (en) Domain keyword extraction method based on weighted textRank and K-means
Khalid et al. Topic detection from conversational dialogue corpus with parallel dirichlet allocation model and elbow method
CN111259156A (en) Hot spot clustering method facing time sequence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination