CN107102985A

CN107102985A - Multi-threaded keyword extraction techniques in improved document

Info

Publication number: CN107102985A
Application number: CN201710268836.3A
Authority: CN
Inventors: 金平艳
Original assignee: Sichuan Yonglian Information Technology Co Ltd
Current assignee: Sichuan Yonglian Information Technology Co Ltd
Priority date: 2017-04-23
Filing date: 2017-04-23
Publication date: 2017-08-29

Abstract

Multi-threaded keyword extraction techniques in improved document, according to Chinese word segmentation preprocessing process, degree of correlation function is constructed with the difference of two co-occurrence degree functions, it is then converted to a multi-threaded network model, construct object functionConjunction is extracted, using pitching functionConjunction is dissolved into multi-threaded network modelIn, obtain new model figure, then before extractingPosition vocabulary is text key word.The degree of accuracy of the present invention is high, with more preferable application value, can accurately calculate different vocabulary to the contribution degree of text thought, consider that multi-threaded property has distinguished different characteristic again, when extracting keyword for the first time, file characteristics are extracted with accurate algorithm, more preferable place mat is provided for subsequent document keyword extraction, is also that follow-up text similarity and text cluster provide good theoretical foundation.

Description

Multi-threaded keyword extraction techniques in improved document

Technical field

The present invention relates to Semantic Web technology field, and in particular to multi-threaded keyword extraction skill in improved document Art.

Background technology

Keyword is the summary of article theme, is often occurred in the form of word or phrase, is to express text subject meaning most Subsection, can make reader understand the approximate contents of article in a short time, so as to save the time of reader.So document is crucial Word, which can help user rapidly to find user from substantial amounts of collection of document, to be needed or relative document.But except science Paper is comprising outside keyword, and substantial amounts of document does not have numerous webpages on keyword, internet especially mentioned above.Face The text data of magnanimity, manual extracting keywords are wasted time and energy, and subjectivity is strong, extract improper application that can also be to next step Cause negative influence.Traditional keyword abstraction algorithm generally lacks the consideration to file structure feature, cause structural information this The missing of one key character, have impact on the accuracy of keyword extraction to a certain extent, can not particularly extract real anti- Reflect the vocabulary of content of text.The existing keyword abstraction algorithm based on complex network or graph model build text complex network or It is simple using morphology as network node during graph model, although this algorithm can keep the structure of text to greatest extent Information, but be due to not carry out semantic tagger, cause the keyword extracted not having interpretation semantically, it is possible to meeting Produce ambiguity.Therefore, in order to improve the present situation of text retrieval, people actively study the various of artificial intelligence and natural language processing Technology, many scholars propose the method for automatically extracting keyword using machine intelligence.As can be seen here, keyword Automatic is text Originally the basis automatically processed and core technology, are the key technologies for the efficiency and degree of accuracy for solving information retrieval, keyword is table Text subject is stated, in order to meet the demand, the present invention provides keyword extraction techniques multi-threaded in a kind of improved document.

The content of the invention

For some non-high frequencies are found out from multi-threaded document and theme is contributed big word as keyword, realize from Dynamic to extract theme word problem and the not high deficiency of conventional keyword extracting method precision in document, the invention provides one Plant keyword extraction techniques multi-threaded in improved document.

In order to solve the above problems, the present invention is achieved by the following technical solutions：

Step 1：Word segmentation processing is carried out to text using Chinese words segmentation；

Step 2：Text vocabulary is carried out according to deactivation table to go stop words to handle, word finder w is obtained；

Step 3：Construct degree of correlation function RE (w_i, w_j) processing of sequence from big to small is carried out to above-mentioned word finder w, n before taking Word constitutes a multi-threaded network model M；

Step 4：Construct object functionDetermine the conjunction LINK (C) between different themes；

Step 5：Construction fork functionConjunction is effectively incorporated in multi-threaded network model, illustraton of model is designated as M '.

Present invention has the advantages that：

1st, the method is higher than the degree of accuracy for the text key word set that traditional word frequency-anti-document frequency method is obtained.

2nd, in phrase semantic relationship map to subject network illustraton of model, multi-threaded property had both been considered, theme has been distinguished again Between different characteristic, the text key word of extraction more meets empirical value；

3rd, it is that follow-up text similarity and text cluster technology provide good theoretical foundation.

4th, this algorithm has bigger value.

5th, the method has been precisely calculated contribution degree of the different vocabulary to text thought in feature vocabulary.

6th, the method obtains more accurate file characteristics with accurate algorithm, is follow-up when extracting keyword for the first time Document keyword extraction provides more preferable place mat.

Brief description of the drawings

The structure flow chart of multi-threaded keyword extraction techniques in the improved documents of Fig. 1

Fig. 2 n-grams segmentation methods are illustrated

Fig. 3 Chinese text preprocessing process flow charts

Fig. 4 n words constitute a multi-threaded network model figure M

The multi-threaded network model figure M ' of Fig. 5

Embodiment

In order to solve to find out some non-high frequencies from multi-threaded document and contribute big word as keyword, reality to theme The problem of now automatically extracting in document theme word problem and not high conventional keyword extracting method precision, with reference to Fig. 1-figure 5 couples of present invention are described in detail, and its specific implementation step is as follows：

Step 1：Word segmentation processing is carried out to text using Chinese words segmentation, its specific participle technique process is as follows：

Step 1.1：According to《Dictionary for word segmentation》The word for treating to be matched in participle sentence with dictionary is found, the Chinese character for treating participle The complete scanning of string one time, carries out lookup matching in the dictionary of system, and the word that run into has in dictionary is just identified；If word Relevant matches are not present in allusion quotation, individual character are just simply partitioned into as word；Until Chinese character string is sky.

Step 1.2：It according to probability statistics, will treat that participle sentence is split as network structure, produce the n sentence that may be combined Every sequential node of this structure, is defined as SM by minor structure successively₁M₂M₃M₄M₅E, its structure chart is as shown in Figure 2.

Step 1.3：Based on method of information theory, certain weights are assigned to above-mentioned network structure each edge, its specific calculating Process is as follows：

According to《Dictionary for word segmentation》The dictionary word matched and the single word not matched, the i-th paths are comprising the number of word n_i.That is the number collection of n paths word is combined into (n₁, n₂..., n_n)。

Obtain min ()=min (n₁, n₂..., n_n)

In above-mentioned remaining (n-m) path left, the weight size of every adjacent path is solved.

In statistics corpus, the information content X (C of each word are calculated_i), then the adjacent word of solution path co-occurrence information amount X (C_i, C_i+1).Existing following formula：

X(C_i)=| x (C_i)₁-x(C_i)₂|

Above formula x (C_i)₁For word C in text corpus_iInformation content, x (C_i)₂For C containing word_iText message amount.

x(C_i)₁=-p (C_i)₁lnp(C_i)₁

Above formula p (C_i)₁For C_iProbability in text corpus, n is C containing word_iText corpus number.

x(C_i)₂=-p (C_i)₂lnp(C_i)₂

Above formula p (C_i)₂For C containing word_iTextual data probable value, N is statistics corpus Chinese version sum.

Similarly X (C_i, C_i+1)=| x (C_i, C_i+1)₁-x(C_i, C_i+1)₂|

x(C_i, C_i+1)₁For the word (C in text corpus_i, C_i+1) co-occurrence information amount, x (C_i, C_i+1)₂For adjacent word (C_i, C_i+1) co-occurrence text message amount.

Similarly x (C_i, C_i+1)₁=-p (C_i, C_i+1)₁lnp(C_i, C_i+1)₁

Above formula p (C_i, C_i+1)₁For the word (C in text corpus_i, C_i+1) co-occurrence probabilities, m be in text library word (C_i, C_i+1) co-occurrence amount of text.

x(C_i,C_i+1)₂=-P (C_i,C_i+i)₂lnp(C_i,C_i+i)₂

p(C_i, C_i+1)₂For adjacent word (C in text library_i, C_i+1) co-occurrence textual data probability.

The weights that every adjacent path can to sum up be obtained are

w(C_i, C_i+1)=X (C_i)+X(C_i+1)-2X(C_i, C_i+1)

Step 1.4：A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, it was specifically calculated Journey is as follows：

There are n paths, it is different per paths length, it is assumed that path length collection is combined into (L₁, L₂..., L_n)。

Assuming that by taking the minimum number of word in path to operate, eliminating m paths, m ＜ n.It is left (n-m) path, If its path length collection is combined into

It is per paths weight then:

Above formulaRespectively the 1,2nd arrivesThe weighted value on path side, can be calculated one by one according to step 1.4,For S in remaining (n-m) path_j The length of paths.

One paths of maximum weight:

Step 2：Text vocabulary is carried out according to deactivation table to go stop words to handle, word finder w is obtained, it is specifically described such as Under：

Stop words refers to that the frequency of occurrences is high in the text, but for word of the Text Flag without too big effect.Go to stop The process of word is exactly to be compared characteristic item with the word in deactivation vocabulary, by this feature entry deletion if matching.

Comprehensive participle and deletion stop words technology, Chinese text preprocessing process flow chart such as Fig. 3.

Step 3：Construct degree of correlation function RE (w_i, w_j) processing of sequence from big to small is carried out to above-mentioned word finder w, n before taking Word constitutes a multi-threaded network model M, and its specific calculating process is as follows：

Degree of correlation function RE (w_i, w_j):

Above formula ε is correction factor, d (w_i, w_j) it is vocabulary (w_i, w_j) between difference.

d(w_i, w_j)=| R₁(w_i, w_j)-R₂(w_i, w_j)|

Above formula R₁(w_i, w_j)、R₂(w_i, w_j) all it is relevance degree, g (w between vocabulary_i/w_j) it is w_iRelative to w_j′Co-occurrence degree, g (w_j/w_i) it is w_jRelative to C_i′Co-occurrence degree, n (w_i, w_j) it is two vocabulary (w_i, w_j) number of times that occurs in a word, n (w_i) be Vocabulary w_iThe number of times occurred in a document, n (w_j) it is vocabulary w_jThe number of times occurred in a document.

The n keywords as text before extracting, i.e., according to RE (w_i, w_j) value extracts preceding n keyword from big to small.

Step 4：Construct object functionThe conjunction LINK (C) between different themes is determined, its specific calculating process is such as Under：

Object function

Above formula ρ is correction factor, T_jBe the theme factor of influence.

Above formula j is j-th theme, and theme number is g, and h is the theme the number of middle vocabulary, and it is a variable, and theme is not Together, h value is just different,It is N for j-th of main in the title of the key words vocabulary number,Occur for conjunction C in theme j Number of times,For the similarity of vocabulary in conjunction C and theme, this can be calculated by conventional method, α, β RespectivelyInfluence coefficient, general β ＞ α, and alpha+beta=1, α, β can go out most preferably by experiment test Value, above formulaBe the theme Z_jTo the influence degree of document.

According toValue, from greatly to m conjunction LINK (C) of small selection.

Step 5：Construction fork functionConjunction is effectively incorporated in multi-threaded network model, illustraton of model is designated as M ', Its calculating process is as follows：

Pitch function：

Above formula G (C_i′/w_j′) it is C_i′Relative to w_j′Co-occurrence degree, G (w_j′/C_i′) it is w_j′Relative to C_i′Co-occurrence degree, above formula M_fFor the common father node density of two vocabulary Ontological concepts, S_fFor the common father node depth of two vocabulary Ontological concepts, n_fFor adopted original The maximum node density value in tree in network structure where correspondence father node, d_fFor correspondence father in adopted former network structure The degree of the tree in tree where node

Similarly

Above formula n (C_i′, w_j′) it is conjunction C_i′With vocabulary w in word finder_j′The number of times occurred in a word, N (w_j′) be Vocabulary w in word finder_j′The number of times occurred in a document, N (C_i′) it is conjunction C_i′The number of times occurred in a document, here N (C_i′) ≠N(w_j′)、n(C_i′, w_j′)=n (w_j′, C_i′)。

According to fork functionValue take n-1 vocabulary pair from big to small, produce n keyword in document.

Multi-threaded keyword extraction techniques in improved document, its false code calculating process is as follows：

Input：One document

Output：Extract the kernel keyword in document.

Claims

1. multi-threaded keyword extraction techniques in improved document, the present invention relates to Semantic Web technology field, and in particular to Multi-threaded keyword extraction techniques in improved document, it is characterized in that, comprise the following steps：

Step 1.1：According to《Dictionary for word segmentation》The word for treating to be matched in participle sentence with dictionary is found, the Chinese character string for treating participle is complete Whole scanning one time, carries out lookup matching in the dictionary of system, and the word that run into has in dictionary is just identified；If in dictionary In the absence of relevant matches, individual character is just simply partitioned into as word；Until Chinese character string is sky

Step 1.2：According to probability statistics, it will treat that participle sentence is split as network structure, produceThe individual sentence that may be combined Structure, is successively defined as every sequential node of this structure, its structure chart is as shown in Figure 2

Step 1.3：Based on method of information theory, certain weights, its specific calculating process are assigned to above-mentioned network structure each edge It is as follows：

According to《Dictionary for word segmentation》The dictionary word matched and the single word not matched, thePaths are comprising the number of word, i.e.,The number collection of paths word is combined into

It is above-mentioned leave it is remainingIn path, the weight size of every adjacent path is solved

In statistics corpus, the information content of each word is calculated, then the adjacent word of solution path co-occurrence information amount, existing following formula：

Above formulaFor word in text corpusInformation content,For containing wordText message amount

Above formulaForProbability in text corpus,For containing wordText corpus number

Above formulaFor containing wordTextual data probable value,For statistics corpus Chinese version sum

Similarly

For the word in text corpusCo-occurrence information amount,For adjacent wordThe text message amount of co-occurrence

Similarly

Above formulaFor the word in text corpusCo-occurrence probabilities,For the word in text libraryThe amount of text of co-occurrence

For adjacent word in text libraryThe textual data probability of co-occurrence

The weights that every adjacent path can to sum up be obtained are

Step 1.4：A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, its specific calculating process is such as Under：

HavePaths, it is different per paths length, it is assumed that path length collection is combined into

Assuming that by taking the minimum number of word in path to operate, eliminatingPaths,, that is, it is leftRoad Footpath, if its path length collection is combined into

It is per paths weight then:

Above formulaRespectively the 1,2nd arrivesRoad The weighted value on footpath side, can be calculated one by one according to step 1.4,It is remainingIn path The length of paths

One paths of maximum weight:

Step 2：Text vocabulary is carried out according to deactivation table to go stop words to handle, word finder is obtained, it is described in detail below：

Stop words refers to that the frequency of occurrences is high in the text, but for word of the Text Flag without too big effect, removes stop words Process be exactly to be compared with disabling the word in vocabulary characteristic item, by the spy if matching

Levy entry deletion

Comprehensive participle and deletion stop words technology, Chinese text preprocessing process flow chart such as Fig. 3

Step 3：Construct degree of correlation functionTo above-mentioned word finderThe processing of sequence from big to small is carried out, before taking Individual word constitutes a multi-threaded network model；

Step 4：Construct object functionDetermine the conjunction between different themes；

Step 5：Construction fork functionConjunction is effectively incorporated in multi-threaded network model, illustraton of model is designated as, it is counted Calculation process is as follows：

Pitch function：

Above formulaForRelative toCo-occurrence degree,ForRelative toCo-occurrence degree, Above formulaFor the common father node density of two vocabulary Ontological concepts,For the common father node depth of two vocabulary Ontological concepts,To correspond to the maximum node density value in the tree where father node in adopted former network structure,It is netted for adopted original The degree of the tree in tree in structure where correspondence father node

Similarly

Above formulaFor conjunctionWith vocabulary in word finderThe number of times occurred in a word, For vocabulary in word finderThe number of times occurred in a document,For conjunctionThe number of times occurred in a document, this In、

According to fork functionValue take from big to smallIndividual vocabulary pair, is produced in documentIndividual keyword.

2. multi-threaded keyword extraction techniques in the improved document according to claim 1, it is characterized in that, above institute The specific calculating process stated in step 3 is as follows：

Step 3：Construct degree of correlation functionTo above-mentioned word finderThe processing of sequence from big to small is carried out, before taking Individual word constitutes a multi-threaded network model, its specific calculating process is as follows：

Degree of correlation function:

Above formulaFor correction factor,For vocabularyBetween difference

Above formula、All it is relevance degree between vocabulary,ForRelative to Co-occurrence degree,ForRelative toCo-occurrence degree,For two vocabularyIn short The number of times of middle appearance,For vocabularyThe number of times occurred in a document,For vocabularyOccur in a document Number of times

,

Before extractionPosition is used as the keyword of text, i.e. basisBefore value is extracted from big to smallIndividual keyword.

3. multi-threaded keyword extraction techniques in the improved document according to claim 1, it is characterized in that, above institute The specific calculating process stated in step 4 is as follows：

Step 4：Construct object functionDetermine the conjunction between different themes, its specific calculating process is as follows：

Object function：

Above formulaFor correction factor,Be the theme factor of influence

Above formulaForIndividual theme, theme number isIt is individual,Be the theme the number of middle vocabulary, and it is a variable, and theme is not Together,Value it is just different,ForIndividual main in the title of the key words vocabulary number is,For conjunctionIn theme The number of times of middle appearance,For conjunctionWith the similarity of vocabulary in theme, this can be calculated by conventional method Go out,、Respectively、Influence coefficient, typically, and,、Optimum value, above formula can be gone out by experiment testIt is the themeTo the influence degree of document

According toValue, from greatly to small selectionIndividual conjunction。