CN116414971A

CN116414971A - Keyword weight calculation method and keyword extraction method for multi-feature fusion

Info

Publication number: CN116414971A
Application number: CN202310185632.9A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Mabo Nanjing Intelligent Technology Co ltd
Current assignee: Mabo Nanjing Intelligent Technology Co ltd
Priority date: 2023-03-01
Filing date: 2023-03-01
Publication date: 2023-07-11

Abstract

The invention discloses a keyword weight calculation method of multi-feature fusion, which comprises the steps of firstly collecting and analyzing a document set, and marking each article in the document set by word segmentation to form alternative phrases; then, any alternative word is used as a quasi-selected word to obtain a normalized word frequency value, a word head position value, a word tail position value, a word average span value, a word head and tail span value, a word length value, a part-of-speech value, a TFIDF value and an average information entropy as multidimensional characteristics; and finally, calculating a fusion weight calculation formula of the quasi-selected word. The invention also discloses a keyword extraction method of the multi-feature fusion, which is carried out based on a textword algorithm, wherein the weights of edges in the algorithm are obtained according to a fusion weight calculation formula, so that keywords are extracted. The method has the remarkable effects that the word average span capable of better expressing the keyword distribution and the word head-tail span are adopted; the mode of fusion weight is solved by using Lasso regression, so that unimportant features are deleted, important features are reserved, and the accuracy of keyword extraction can be improved.

Description

Keyword weight calculation method and keyword extraction method for multi-feature fusion

Technical Field

The invention relates to a keyword extraction technology in a document, in particular to a TextRank automatic keyword extraction method integrating various word importance characteristics.

Background

With the rapid development of Internet big data, unstructured document data resources become huge, users are surrounded by a large amount of irrelevant information, and the accurate keyword extraction technology can effectively classify the document data, so that convenience is provided for users to accurately search and inquire.

Keyword extraction refers to extracting important words or phrases from text as abstracts or points of the text. This technique is commonly used in text summarization, document classification, information retrieval, and other applications. Traditional keyword extraction methods can be divided into TFIDF based on word frequency statistics, but only depend on word frequency information, other important information of words is absent, and the effect is not ideal; the topic model based on LDA often needs to be trained in advance, and the topic distribution of a training document set is greatly dependent; the TextRank algorithm based on graph nodes ignores the importance of the words and has poor effect, so that the Lasso-TextRank keyword extraction method with multi-feature fusion is provided for improving the current situation.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method for extracting document keywords from a large number of documents, wherein the fusion method comprises the steps of constructing multidimensional features of the keywords, and then solving fusion weight expressions of the multidimensional features through Lasso regression. And then, solving the fusion weight of the word through a fusion weight expression to improve the weight of the TextRank initial vocabulary node, and finally, obtaining the keyword of the document through iterative updating. The method of the scheme fully considers the importance degree of the words on the single-document and multi-document layers, and can effectively avoid missed extraction and wrong extraction of the keywords.

In order to solve the calculation problem of the fusion weight, the main technical scheme adopted is as follows:

the keyword weight calculation method for multi-feature fusion is characterized by comprising the following steps of:

step 1.1, collecting and selecting n documents as an analysis document set;

step 1.2, performing word segmentation marking on each article in the analysis document set, and arranging all candidate words of the same article in a front-back sequence to form a candidate phrase;

taking any alternative word as a quasi-optional word, and acquiring the multidimensional feature of each quasi-optional word;

the multi-dimensional feature comprises a normalized word frequency value w _TF1 Head position value w _FP Word tail position value w _PL Word averageSpan value w _MTS End-of-word span value w _PFL Word length value w _TL Part of speech value w _PSO TFIDF value w _TFIDF Average information entropy w _IH ；

Step 1.3, calculating fusion weight y of the pseudo words based on the following formula (1);

alpha is normalized word frequency value w of the quasi-selected word _TF1 Weight coefficient of (2);

beta is the initial position value w of the word to be selected _FP Weight coefficient of (2);

gamma is the word tail position value w of the pseudo word _PL Weight coefficient of (2);

delta being a pseudonym word average span value w _MTS Weight coefficient of (2);

epsilon is the word head-tail span value w of the quasi-selected word _PFL Weight coefficient of (2);

e is word length value w of the quasi-selected word _TL Weight coefficient of (2);

θ is the part of speech value w of the word to be selected _PSO Weight coefficient of (2);

TFIDF value w for the pseudo-word _TFIDF Weight coefficient of (2);

mu is the average information entropy w of the pseudo-optional word _IH Weight coefficient of (2);

w ₀ is a constant term.

In order to obtain the keywords of the document, the following main technical scheme is adopted:

the keyword extraction method for multi-feature fusion comprises the following steps:

step 2.1, obtaining a calculation formula of the fusion weight according to the method;

step 2.2, dividing the content of the document to be judged into complete sentences according to the sequence, then carrying out word segmentation and part-of-speech tagging on each sentence, filtering out stop words, and reserving words with specified part-of-speech as candidate keywords;

step 2.3, constructing a candidate keyword graph g= (V, E), wherein V is a node set of words, and is composed of the candidate keywords obtained in step 2.1; constructing an edge E between any two word nodes by adopting a co-occurrence relation;

edges exist between two word nodes, and only when corresponding words coexist in a window with the length of K, wherein K represents the size of the window, namely K words at most coexist;

updating the ith word node V according to the following formula (11) _i Importance weight WS (V) _i )：

d is an adjustment coefficient;

In(V _i ) To point to word node V _i Other node sets;

y _j for the j-th word node V _j Pointing to word node V _i Firstly, acquiring word node V _j The multi-dimensional characteristics of the three-dimensional model are calculated according to a formula (1);

Out(V _j ) For slave word node V _j A set of other nodes indicated;

y _jk for word node V _j Fusion weights pointing to the kth pointed node;

WS(V _j ) For the j-th word node V _j Importance weights of (2);

iteratively propagating the weight of each node according to the formula (11) until convergence;

and 2.4, carrying out reverse order sequencing on importance weights of all word nodes, and sequentially extracting selected keywords from the candidate keywords.

Detailed Description

The invention is further illustrated by the following examples.

Example 1:

a keyword weight calculation method for multi-feature fusion is carried out according to the following steps:

step 1.1, collecting and selecting n documents as an analysis document set;

the method is necessary to clean the documents in advance, including directly filtering and removing abnormal documents such as data text messy code documents, abnormal documents and the like existing in the documents;

the multi-dimensional feature comprises a normalized word frequency value w _TF1 Head position value w _FP Word tail position value w _PL Word average span value w _MTS End-of-word span value w _PFL Word length value w _TL Part of speech value w _PSO TFIDF value w _TFIDF Average information entropy w _IH ；

The multi-dimensional characteristics are calculated as follows:

(1) based on the same document, the word average span value wMTS of the pseudo word is calculated according to the following formula (3):

S _g (w) is the distance of the g-th front-rear span of the pseudo sentence in the corresponding document;

h is the number of spans which can be calculated in the corresponding document of the quasi-optional word, and the value of h is equal to the number of times that the quasi-optional word appears in the corresponding document minus one.

The g-th span means how many other alternatives are separated between two adjacent occurrences of the same term. Keywords are often mentioned in documents, so the span length of the keywords is generally smaller, and non-keywords are larger. If a word to be selected only appears once, the word average span value of the word to be selected defaults to the maximum span value of other alternative words in the corresponding document.

(2) Based on the same document, pressCalculating the end-of-word span value w of the pseudo word according to the following formula (4) _PFL ：

LP1 (w) is the total number of other preceding alternatives when the term of choice appears last time in the corresponding document;

FP (w) is a prefix position value of the pseudo word;

SumPard (d) is the total number of occurrences of all alternatives in the corresponding document.

(3) A method for calculating the weight of a multi-feature fusion keyword according to claim 1, wherein: based on the same document, calculating the normalized word frequency value w of the quasi word according to the following formula (5) _TF1 ：

Wherein:

f (w) is the number of times the pseudo-word appears in the corresponding document;

min (f (d)) is the minimum number of occurrences of the candidate word in the corresponding document;

max (f (d)) is the maximum number of occurrences of the candidate word in the corresponding document;

(4) based on the same document, the head position value wFP of the pseudo word is calculated according to the following formula (6):

wherein:

PF (w) is the number of preceding alternatives when the candidate appears for the first time in the corresponding document;

(5) based on the same document, calculating the end-of-word position value wPL of the pseudo word according to the following formula (7):

wherein:

LP (w) is the number of alternatives following the last occurrence of the term in the corresponding document;

(6) based on the same document, the word length value wTL of the pseudo word is calculated according to the following formula (8):

wherein:

l (w) is the length of the word to be selected;

max (L (d)) is the maximum value of the word length of the candidate words in the corresponding document;

min (L (d)) is the minimum value of the word length of the candidate words in the corresponding document;

(7) the probability of the part of speech of the pseudo-selected word as the key word is recorded as the part of speech value w of the pseudo-selected word _PSO The method comprises the steps of carrying out a first treatment on the surface of the The part of speech of the word to be selected comprises a plurality of types such as noun, adjective, verb, preposition, adverb, auxiliary word and the like, the probability of the noun, adjective, verb and the like serving as keywords is higher, the probability of the preposition, adverb and auxiliary word is lower, and the probability of the part of speech of the word to be selected serving as keywords is defined after artificial statistics according to experience and long-term statistics;

(8) calculating the TFIDF value w of the pseudo word according to the following formula (9) _TFIDF ：

Wherein:

f (w) is the number of times that the pseudonym appears in the corresponding same document;

∑ _k f (w) is the total number of occurrences of all alternatives in the same document;

n is the number of documents in the analysis document set;

|{j：w∈n _j the } | is the number of files in the analysis file set, which contains the quasi-selected words;

(9) according toThe average information entropy w is calculated by the following formula (10) _IH ：

Wherein:

f _wd the frequency of the quasi-optional word in the corresponding same document is used;

f _w the frequency of the quasi-word in all the documents in the analysis document set is determined;

n is the number of documents within the analysis document set.

delta is word average span value w of the pseudo word _MTS Weight coefficient of (2);

TFIDF value w for the pseudo-word _TFIDF Weight coefficient of (2);

w ₀ is a constant term.

In the step 1.3, the weight coefficients alpha, beta, gamma, delta, epsilon, theta of each item,

μ,w ₀ can be manually specified or calculated according to the following formula (2):

n is the number of documents in the analysis document set;

y _i fusion weights for artificially labeled pseudonyms;

fusing weights for the estimated keywords;

lambda is the adjustment coefficient.

Example 2:

step 2.1, obtaining a calculation formula of the fusion weight according to the method of the embodiment 1;

d is an adjustment coefficient;

In(V _i ) To point to word node V _i Other node sets;

y _j for the j-th word node V _j Pointing to word node V _i Is a fusion weight of (2);

Out(V _j ) For slave word node V _j A set of other nodes indicated;

y _jk for word node V _j Fusion weights pointing to the kth pointed node;

WS(V _j ) For the j-th word node V _j Importance weights of (2);

Jth word node V _j Pointing to word node V _i Is a fusion weight of V _j Calculated according to the formula (1), firstly, the word node V is obtained _j Multi-dimensional features of w _TF1 、w _FP 、w _PL 、w _MTS 、w _PFL 、w _TL 、w _PSO 、w _TFIDF 、w _IH The method comprises the steps of carrying out a first treatment on the surface of the Then calculate y according to the formula (1) _j 。

The beneficial effects are that: by adopting the method, the comprehensive evaluation of the characteristics of the keyword distribution word such as average span, head-to-tail span and the like can be better expressed; and the mode of solving the fusion weight by using Lasso regression is adopted to select the self-carried features, so that the unimportant features are deleted, the important features are reserved, and the accuracy of extracting the keywords can be improved.

Example 3:

the network collects 500 Chinese academic documents for keyword extraction, 400 of the 500 Chinese academic documents are used for training parameters to be estimated in a fusion weight formula (1) according to the method of the embodiment 1, and the rest 100 are used for testing the effect of extracting keywords by the algorithm of the text according to the method of the embodiment 2. The test comparison effect of the algorithm herein with the TextRank algorithm and TFIDF-TextRank algorithm is recorded in table 1, with window length k=5.

TABLE 1 test comparison of the algorithms herein with TextRank and TFIDF-TextRank

Algorithm model	Accuracy (%)	Recall (%)
			TextRank	0.49	0.38
TFIDF-TextRank	0.55	0.49
			Example 2 method	0.61	0.57

The accuracy is defined as: accuracy = number of extracted correct related words/number of co-extracted keywords;

recall is defined as: recall = number of extracted correct related words/number of actual keywords.

Finally, it should be noted that the above description is only a preferred embodiment of the present invention, and that many similar changes can be made by those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A keyword weight calculation method for multi-feature fusion is characterized by comprising the following steps:

step 1.1, collecting documents and selecting n documents as an analysis document set;

wherein:

TFIDF value w for the pseudo-word _TFIDF Weight coefficient of (2);

w ₀ is a constant term.

2. The keyword weight calculation method of multi-feature fusion according to claim 1, wherein the method comprises the following steps: in the step 1.3, each weight coefficient is calculated according to the following formula (2):

n is the number of documents in the analysis document set;

y _i fusion weights for artificially labeled pseudonyms;

fusing weights for the estimated keywords;

lambda is the adjustment coefficient.

3. The keyword weight calculation method of multi-feature fusion according to claim 1, wherein the method comprises the following steps: based on the same document, the word average span value w of the pseudo word is calculated according to the following formula (3) _MTS ：

Wherein: s is S _g (w) is the distance of the g-th front-rear span of the pseudo sentence in the corresponding document;

h is the number of spans which can be calculated in the corresponding document of the quasi-optional word, and the value of h is equal to the number of times of occurrence of the quasi-optional word in the corresponding document minus one.

4. The keyword weight calculation method of multi-feature fusion according to claim 1, wherein the method comprises the following steps: based on the same document, calculating the end-of-word span value w of the pseudo word according to the following formula (4) _PFL ：

Wherein:

FP (w) is a prefix position value of the pseudo word;

5. The keyword weight calculation method of multi-feature fusion according to claim 1, wherein the method comprises the following steps: based on the same document, calculating the normalized word frequency value w of the quasi word according to the following formula (5) _TF1 ：

Wherein:

based on the same document, the head position value wFP of the pseudo word is calculated according to the following formula (6):

wherein:

based on the same document, the end-of-word position value wPL of the pseudo word is calculated according to the following formula (7):

wherein:

based on the same document, calculating the word length value w of the quasi-selected word according to the following formula (8) _TL ：

Wherein:

l (w) is the length of the word to be selected;

the probability of the part of speech of the pseudo-word is taken as the key word is marked as the part of speech value w of the pseudo-word _PSO 。

6. The keyword weight calculation method of multi-feature fusion according to claim 1, wherein the method comprises the following steps: calculating the TFIDF value w of the pseudo word according to the following formula (9) _TFIDF ：

Wherein:

n is the number of documents in the analysis document set;

|{j:w∈n _j the } | is the number of files in the analysis file set, which contains the quasi-selected words;

calculating the average information entropy w according to the following formula (10) _IH ：

Wherein:

n is the number of documents within the analysis document set.

7. A keyword extraction method for multi-feature fusion is characterized by comprising the following steps:

step 2.1, obtaining a calculation formula of the fusion weight according to the method of any one of claims 1-6;

d is an adjustment coefficient;

In(V _i ) To point to word node V _i Other node sets;

Out(V _j ) For slave word node V _j A set of other nodes indicated;

y _jk for word node V _j Fusion weights pointing to the kth pointed node;

WS(V _j ) For the j-th word node V _j Importance weights of (2);