CN103336806B

CN103336806B - A kind of key word sort method that the inherent of spacing and external pattern entropy difference occur based on word

Info

Publication number: CN103336806B
Application number: CN201310253678.6A
Authority: CN
Inventors: 杨震; 司书勇; 雷建军; 范科峰; 赖英旭
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2013-06-24
Filing date: 2013-06-24
Publication date: 2016-08-10
Anticipated expiration: 2033-06-24
Also published as: CN103336806A

Abstract

The present invention proposes a kind of based on being there is the method that the comentropy difference of the inherent of spacing and external pattern carries out key word sequence by word, belongs to Text extraction field.This method thinks that the appearance of key word is affected by two patterns: (1) inherent pattern, the statistical property of the key word position being described in a topic；(2) external pattern, describes the statistical attribute that in text, topic bunch occurs.On real text, experimental result finds, a word occurs that the interior external schema of spacing and external pattern information entropy difference are the biggest, then he is that the probability of key word is the biggest.

Description

A kind of key word sequence that the inherent of spacing and external pattern entropy difference occur based on word Method

Technical field

The present invention relates to keyword extraction and the sort method of a kind of novel text, belong to Text extraction field.

Background technology

Along with the deep development of the Internet, the quantity of information on network is increasing, and the means simultaneously obtaining information are the most more come The most convenient.But meanwhile Internet user suffers from a difficult problem for information explosion.In order to solve such difficult problem, it would be desirable to energy Enough quickly from magnanimity information, find interested part.This just requires that we can extract key from text message Word.

Traditional method is thought, if a word is identified as key word, then he certainly exists significant statistical nature.H. P. Luhn proposes original keyword extracting method.In the method for Luhn, after rejecting common word and rare word, close Keyword is ranked up by word frequency.Since then, method based on word frequency and improvement thereof are by extensive discussions.But, the method for word frequency But cannot by the most dramatically different close for word frequency word separately.M. Ortuno, J. P. Herrera and P. Carpena carries Go out to utilize the method for word spatial characteristics to detect key word.But the method being equally based on distribution cannot be by spatial distribution phase But the near word that importance is the most dramatically different separates.

The present invention proposes a kind of based on being occurred that by word the comentropy difference of the inherent of spacing and external pattern carries out key word The method of sequence.This method thinks that the appearance of key word is affected by two patterns: (1) inherent pattern, is described in words The statistical property of the key word position in topic；(2) external pattern, describes the statistical attribute that in text, topic bunch occurs.True literary composition In basis, experimental result finds, a word occurs that the interior external schema of spacing and external pattern information entropy difference are the biggest, then he is crucial The probability of word is the biggest.

Summary of the invention

Step (1) obtains text

Obtaining text, text is made up of the sentence of some numbers.

Step (2) Text Pretreatment

Step (2.1) removes all of punctuation mark, and all of letter is converted to small letter.

Step (2.2), for English text, carries out participle based on simple space.English different morphologies are as different It is two different words that word, such as " organ " are treated as with " organs ".

Step (2.3), for Chinese text, uses conventional participle software to carry out participle.

Step (3) word occurs that the inherent of spacing finds with external external schema

There is position in step (3.1) mark word.Assuming that text size is N, a word A occurs i.e. m word the most altogether Frequently, its position is represented by t₁, t₂, t₃... t_m, represent the word A t at text respectively₁, t₂, t₃... t_mPosition occurs.

Step (3.2) calculates word and location gap occurs.Word A occurs that spacing can be expressed as d_i=t_i+1-t_i.μ is expressed as d_i's Meansigma methods, i.e. average headway.

Step (3.3) divides word and the inherent of spacing and external pattern occurs.If d_i≤ μ, then d_iIt is classified as inherent mould Formula.In other words, certain for given word occurs, if it and word spacing d that position occurs next time_iLess than or equal to flat All spacing μ, then just d_iIt is classified as inherent pattern.Similarly, if d_i＞ μ, then d_iIt is classified as external pattern.

Step (3.4) calculates word and the inherent of spacing and external pattern entropy occurs.d^A ={d_i |d_i≤ μ } represent all d_i The set of≤μ.So one word occurs that the entropy of the inherent pattern of spacing is defined as:

H (d^{A}) = - \underset{d &Element; d^{A}}{Σ} P_{d} \log_{2} P_{d} - - - (1)

Here d is the spacing of word, d belong to 1,2,3 ... N}, and P_dRepresent is at d^AIt is general that middle d occurs Rate.At d^AThe word number that middle d occurs is n_d, d^AMiddle data amount check is S_A, P_d= n_d/S_A。

d^B= {d_i |d_i> μ represent all d_i> set of μ.So one word occurs that the entropy of the external pattern of spacing is fixed Justice is:

H (d^{B}) = - \underset{d &Element; d^{B}}{Σ} P_{d} \log_{2} P_{d} - - - (2)

Here d is also spacing, d belong to 1,2,3 ... N}, and P_dRepresent is at d^BIt is general that middle d occurs Rate.At d^BThe word number that middle d occurs is n_d, d^BMiddle data amount check is S_B, P_d= n_d/S_B。

Step (3.5) calculates word and occurs that the inherent of spacing and external pattern entropy are poor

ED^q(d)=(H(d^A))^q-(H(d^B))^q(3)

Wherein, q ∈ N⁺.Q takes positive integer, can be appreciated that can to obtain the effect when q is 2 best by experiment below.

The difference normalization of step (3.6) entropy calculates.Normalized entropy difference ED_norIt is defined as follows:

{ED}_{nor}^{q} (d) = \frac{{ED}^{q} (d)}{| {ED}_{geo}^{q} (d) |} - - - (4)

Wherein,

{ED}_{geo}^{q} (d) = {(- \underset{d \leq N / m}{Σ} \frac{{p (1 - p)}^{d - 1}}{p^{A}} \log \frac{{p (1 - p)}^{d - 1}}{p^{A}})}^{q} - {(- \underset{d &GreaterEqual; N / m}{Σ} \frac{p {(1 - p)}^{d - 1}}{p^{B}} \log \frac{p {(1 - p)}^{d - 1}}{p^{B}})}^{q} - - - (5)

Wherein

p^{A} = \underset{d \leq N / m}{Σ} p {(1 - p)}^{d - 1}, p^{B} = \underset{d > N / m}{Σ} p {(1 - p)}^{d - 1} .

D is that spacing takes positive integer.What N/m represented is The expectation of spacing.What p=m/N represented is word probability in the text, and m is the word frequency of corresponding words, and it is vocabulary sum that N represents.p(1- p)^d-1Be equivalent to d heavily Bernoulli trials.

Normalized purpose is to compare in order to the word of different p can be put under same standard, prevents due to p Difference cause entropy difference to have significantly difference (i.e. in order to eliminate the impact on experiment effect of the p factor).

Formula meaningWord occurs being smaller than the probability sum of average headway in the text.p^B Similar.Represent is under conditions of word is smaller than average headway, and the conditional probability that spacing is d occurs in word.

Vocabulary is ranked up by step (4) according to entropy difference relative size

Accompanying drawing explanation

Fig. 1: in text, word occurs that the internal schema of location gap divides schematic diagram with external schema.

The schematic diagram of Fig. 2: various boundary.A) boundary condition C_-1.Assume that adding word position-1 and N occurs.B) border Condition C₀.Assume that adding word position 0 and N+1 occurs.C) boundary condition C_c.Assume that joining end to end of text, each grid represent one The position of individual word.

Fig. 3:At boundary condition C_-1,C₀And C_cUnder the conditions of work as q=1,2 ... top-n accuracy rate when 5.

Fig. 4: key word detectsAt q=1,2 ... 5 and boundary condition C_-1,C₀And C_cUnder Average Accuracy (AP)。

Detailed description of the invention

Step (1) obtains text

Obtaining text, text is made up of the sentence of some numbers.

Testing material collection is Charles Darwin " The Origin of Species ", uses W.S. Dallas to carry The key word annex of confession is as evaluation and test foundation.

Step (2) Text Pretreatment

Step (2.1) removes all of punctuation mark, and all of letter is converted to small letter；Catalogue in literary composition, vocabulary, And index all removes from text.

Step (2.2), for English text, carries out participle based on simple space.First remove stop words, English different words Shape is as different words, and such as " organ " treats as with " organs " is two different words.Count word frequency m of each word, And the most total word quantity N.Calculate the Probability p=m/N of the appearance of each word.

Step (2.3), for Chinese text, uses conventional participle software to carry out participle.Use general segmentation methods to Chinese Text carries out participle.Count word frequency m of each word, and the most total word quantity N.Calculate the appearance of each word Probability p=m/N.

There is position in step (3.1) mark word.

Assuming that text size is N, a word A occurs m time the most altogether, and its position is represented by t₁, t₂, t₃... t_m, represent the word A t at text respectively₁, t₂, t₃... t_mThere is (as shown in Figure 1) in position.

Step (3.2) calculates word and location gap occurs

Word A occurs that spacing can be expressed as d_i=t_i+1-t_i.Assume that the location tables that in text, word occurs for m time is shown as: t₁, t₂, t₃... t_m.Word occurs in the alternate position spike on adjacent position can be write as such d_i=t_i+1-t_i, word spacing collection is combined into d₁, d₂,……d_m-1.Compare three kinds of different boundary condition C_-1,C₀And C_c(as shown in Figure 2).A) boundary condition C_-1.Assume to add There is position-1 and N in word.Namely assuming to occur in that uncorrelated word on-1 and N, word frequency is not added up in this appearance of twice.B) Boundary condition C₀.Assume that adding word position 0 and N+1 occurs.Namely assuming to occur in that uncorrelated word on-1 and N, this is twice Appearance do not add up word frequency.C) boundary condition C_c.Assume that joining end to end of text, each grid represent the position of a word, i.e. Assume the word " indirectly " (as shown in Figure 2) of first word of beginning and ending.For C_-1Boundary condition, distance set is modified to d₀ ^-1, d₁,……d_m-1,d_m ^-1,For C₀Boundary condition, distance set is modified to d₀ ⁰,d₁,…… d_m-1,d_m ⁰, whereinFor C_cBoundary condition, distance set is modified to d₁,d₂,……d_m-1,d_m ^c AndWherein d₁,……d_m-1, meaning is as represented spacing, t above_mIt is still that the position that word occurs for the m time Putting, N represents text size.

Step (3.3) divides word and the inherent of spacing and external pattern occurs

Location gap set according to each word above, calculates the average value mu of the spacing of word, with this meansigma methods conduct The foundation of external schema in dividing.Such as d={1,2,1,2,3,4,5,50,2,1,3,1,2,3,2,3,100,2,1,3,1,4, The meansigma methods of 3,2,1,2,1,1,1} is μ=7.1379.If d_i≤ μ is so d_iIt is classified as internal schema, d_i> μ is so d_iReturn For external schema.Such as 1 < μ, then 1 is classified as internal schema, and 50 and 100 are more than μ, and they are classified as external schema (such as Fig. 1 institute Show).Thus the spacing of word is divided into the set of inside and outside two patterns.The set of internal schema is designated as d^A, the set of external schema It is designated as d^B。

D in the example of top^A={1,2,1,2,3,4,5,2,1,3,1,2,3,2,3,2,1,3,1,4,3,2,1,2,1,1, 1}, d^B={50,100}。

Step (3.4) calculates word and the inherent of spacing and external pattern entropy occurs

The set d of inherent pattern^A ={d_i |d_i≤ μ } represent all d_iThe set of≤μ.There is spacing in so one word The entropy of inherent pattern be defined as:

H (d^{A}) = - \underset{d &Element; d^{A}}{Σ} P_{d} \log_{2} P_{d} - - - (6)

Here d is also spacing, and that it represents is d^AIn an element, and P_dRepresent is at d^AMiddle d occurs Probability.At d^AThe word number that middle d occurs is n_d, d^AMiddle data amount check is S_A, P_d=n_d/S_A。

The entropy of internal schema is calculated according to formula (6).Such as above example P₁=10/27, P₂=8/27, P₃=6/27, P₄=2/ 27, P₅=1/27.Substitute into formula (6) and obtain H (d^A)=1.98。

The set d of external schema^B= {d_i |d_i> μ represent all d_i> set of μ.So one word occurs outside spacing Entropy in pattern is defined as:

H (d^{B}) = - \underset{d &Element; d^{B}}{Σ} P_{d} \log_{2} P_{d} - - - (7)

Here d is also spacing, and that it represents is d^BIn an element, and P_dRepresent is at d^BIt is general that middle d occurs Rate.At d^BThe word number that middle d occurs is n_d, d^BMiddle data amount check is S_B, P_d=n_d/S_B。

For above example P₅₀=1/2, P₁₀₀=1/2.Substitute into formula (7) and obtain H (d^B)=1。

According to formula (7) even if going out the entropy of external schema.

ED^q(d)=(H(d^A))^q-(H(d^B))^q(8)

Wherein, q ∈ N⁺.Such as q=1,2 ..., 5.The example such as step (3.3) is given, as q=2, ED^q(d)= (1.98)²-(1)²=3.9204。

Step (3.6) calculates entropy difference normalization

Normalized entropy difference ED_norIt is defined as follows:

{ED}_{nor}^{q} (d) = \frac{{ED}^{q} (d)}{| {ED}_{geo}^{q} (d) |} - - - (9)

Wherein,

{ED}_{geo}^{q} (d) = {(- \underset{d \leq N / m}{Σ} \frac{{p (1 - p)}^{d - 1}}{p^{A}} \log \frac{{p (1 - p)}^{d - 1}}{p^{A}})}^{q} - {(- \underset{d &GreaterEqual; N / m}{Σ} \frac{p {(1 - p)}^{d - 1}}{p^{B}} \log \frac{p {(1 - p)}^{d - 1}}{p^{B}})}^{q} - - - (10)

Wherein

p^{A} = \underset{d \leq N / m}{Σ} p {(1 - p)}^{d - 1}, p^{B} = \underset{d > N / m}{Σ} p {(1 - p)}^{d - 1} .

The example such as step (3.3) is given, it is assumed that text size N=1000, when symbol occurrence number m=29, So average word spacing is μ=N/m=34.5, frequency of occurrences p=m/N=0.029 of symbol.

p^{A} = Σ_{d = 1}^{34} 0.029 * {(1 - 0.029)}^{d - 1} = 0.6323,

p^B=1-p^A=0.3677。

{ED}_{geo}^{q} (d) = {(- Σ_{d = 1}^{34} \frac{0.029 {(1 - 0.029)}^{d - 1}}{p^{A}} \log \frac{0.029 {(1 - 0.029)}^{d - 1}}{p^{A}})}^{q}

- {(- Σ_{d = 35}^{1000} \frac{0.029 {(1 - 0.029)}^{d - 1}}{p^{B}} \log \frac{0.029 {(1 - 0.029)}^{d - 1}}{p^{B}})}^{q}

= {(5.0288)}^{q} - {(6.4575)}^{q}

Typically,

{ED}_{geo}^{2} (d) = - 16.4105 .

So

{ED}_{nor}^{2} (d) = \frac{3.9204}{| - 16.4105 |} = 0.2389 .

Vocabulary is ranked up by step (4) according to entropy difference

It is according to the word divided in step (2), poor according to the entropy that the formula (6) of top calculates each word successively to (10), After calculating completes, all words are ranked up according to entropy difference is descending.Fig. 3 gives key word Testing index? Boundary condition C_-1,C₀And C_cUnder the conditions of work as q=1,2 ... top-n accuracy rate when 5.Assume that an algorithm is by marking to article In word sequence, wherein before in n result key word correct numerical statement be shown as key (n), then the accurate calibration of algorithm top-n Justice is p (n)=key (n)/n.Average Accuracy (AP) is defined asWherein p (n) is top-n Accuracy rate, if the word of ranking n-th is key word r (n)=1, if not key word, r (n)=0.L is the number of all words, R It it is the number of key word.Fig. 4 gives key word Testing indexAt q=1,2 ... 5 and boundary condition C_-1,C₀And C_cUnder Average Accuracy (AP).In terms of result, when q is 2, the performance of algorithm is more stable than during other values and performance is more excellent.

Claims

1. the key word sort method that the inherent of spacing and external pattern entropy difference occur based on word, it is characterised in that step is such as Under:

Step (1) obtains text

Obtaining text, text is made up of the sentence of some numbers；

Step (2) Text Pretreatment

Step (2.1) removes all of punctuation mark, and all of letter is converted to small letter；Catalogue in literary composition, vocabulary, and Index all removes from text；

Step (2.2), for English text, carries out participle based on simple space；First removing stop words, English different morphologies are worked as Become different words；Count word frequency m of each word, and the most total word quantity N；Calculate appearance general of each word Rate p=m/N；

Step (2.3), for Chinese text, uses conventional participle software to carry out participle；Use general segmentation methods to Chinese text Carry out participle；Count word frequency m of each word, and the most total word quantity N；Calculate the probability of the appearance of each word P=m/N；

There is position in step (3.1) mark word；

Assuming that text size is N, the word quantity that i.e. full text in step (2) is total, a word A occurs m time the most altogether, i.e. walks Suddenly the word frequency in (2), its positional representation occurred is t₁, t₂, t₃... ..t_m, represent the word A t at text respectively₁, t₂, t₃... ..t_mPosition occurs；

Step (3.2) calculates word and location gap occurs

In text, the location tables of m the appearance of word A is shown as: t₁, t₂, t₃... ..t_m；Wherein d₁,......d_m-1Represent spacing, t_m It is still that the position that word occurs for the m time；Word occurs in alternate position spike d on adjacent position_i=t_i+1-t_i, word spacing collection is combined into d₁,d₂,......d_m-1；For C_-1Boundary condition, it is assumed that text border is in-1 and N the two position, then distance set correction ForFor C₀Boundary condition, it is assumed that text border is in 0 and N+1 the two Position, then text distance set is modified to d₀ ⁰,d₁,......d_m-1,d_m ⁰, whereinFor C_c Boundary condition, it is assumed that joining end to end of text, distance set is modified to d₁,d₂,......d_m-1,d_m ^c,It is ring-type to be that text is linked to be State under, the last distance occurred and occur for the first time of word；And

Step (3.3) divides word and the inherent of spacing and external pattern occurs

Location gap set according to each word above, calculates the average value mu of the spacing of word, by this meansigma methods as division The foundation of interior external schema；If d_i≤ μ is so d_iIt is classified as internal schema, d_i> μ is so d_iIt is classified as external schema；Foundation according to this, this Sample is just divided into the spacing of word the set of inside and outside two patterns；The set of internal schema is designated as d^A, the set of external schema is designated as d^B；

Step (3.4) calculates word and the entropy of the inherent of spacing and external pattern occurs

The set d of inherent pattern^A={ d_i|d_i≤ μ } represent all d_iThe set of≤μ；There is the inherent mould of spacing in so one word The entropy of formula is defined as:

H (d^{A}) = - \underset{d &Element; d^{A}}{Σ} P_{d} \log_{2} P_{d} - - - (6)

Here d is also spacing, d belong to 1,2,3 ... N}, and P_dRepresent is at d^AThe probability that middle d occurs；At d^A The word number that middle d occurs is n_d, d^AMiddle data amount check is S_A, P_d=n_d/S_A；

The entropy of internal schema is calculated according to formula (6)；

The set d of external schema^B={ d_i|d_i> μ represent all d_i> set of μ；There is the external pattern of spacing in so one word Entropy is defined as:

H (d^{B}) = - \underset{d &Element; d^{B}}{Σ} P_{d} \log_{2} P_{d} - - - (7)

Here d is also spacing, d belong to 1,2,3 ... N}, and P_dRepresent is at d^BThe probability that middle d occurs；At d^B The word number that middle d occurs is n_d, d^BMiddle data amount check is S_B, P_d=n_d/S_B；

According to formula (7) even if going out the entropy of external schema；

ED²(d)=(H (d^A))²-(H(d^B))² (8)

Step (3.6) calculates entropy difference normalization

Normalized entropy difference ED_norIt is defined as follows:

{ED}_{n o r}^{q} (d) = \frac{{ED}^{q} (d)}{| {ED}_{g e o}^{q} (d) |} - - - (9)

Wherein,

{ED}_{g e o}^{q} (d) = {(- \underset{d \leq N / m}{Σ} \frac{p {(1 - p)}^{d - 1}}{p^{A}} l o g \frac{p {(1 - p)}^{d - 1}}{p^{A}})}^{q} - {(- \underset{d > N / m}{Σ} \frac{p {(1 - p)}^{d - 1}}{p^{B}} l o g \frac{p {(1 - p)}^{d - 1}}{p^{B}})}^{q} - - - (10)

WhereinIn formula (10), q=2, d are word spacing, represent d^AOr Person d^BIn an element；What N/m represented is the expectation of spacing, the most above average headway value μ；What p=m/N represented is Word probability in the text, m is the word frequency of corresponding words, and it is the most total word quantity that N represents；p(1-p)^d-1Be equivalent to d weight uncle exert Profit test；

Vocabulary is ranked up by step (4) according to entropy difference

According to the word divided in step (2), poor according to the entropy that the formula (6) of top calculates each word successively to (10), calculate After completing, all words are ranked up according to entropy difference is descending.