CN103336806B - A kind of key word sort method that the inherent of spacing and external pattern entropy difference occur based on word - Google Patents
A kind of key word sort method that the inherent of spacing and external pattern entropy difference occur based on word Download PDFInfo
- Publication number
- CN103336806B CN103336806B CN201310253678.6A CN201310253678A CN103336806B CN 103336806 B CN103336806 B CN 103336806B CN 201310253678 A CN201310253678 A CN 201310253678A CN 103336806 B CN103336806 B CN 103336806B
- Authority
- CN
- China
- Prior art keywords
- word
- text
- spacing
- occurs
- entropy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The present invention proposes a kind of based on being there is the method that the comentropy difference of the inherent of spacing and external pattern carries out key word sequence by word, belongs to Text extraction field.This method thinks that the appearance of key word is affected by two patterns: (1) inherent pattern, the statistical property of the key word position being described in a topic;(2) external pattern, describes the statistical attribute that in text, topic bunch occurs.On real text, experimental result finds, a word occurs that the interior external schema of spacing and external pattern information entropy difference are the biggest, then he is that the probability of key word is the biggest.
Description
Technical field
The present invention relates to keyword extraction and the sort method of a kind of novel text, belong to Text extraction field.
Background technology
Along with the deep development of the Internet, the quantity of information on network is increasing, and the means simultaneously obtaining information are the most more come
The most convenient.But meanwhile Internet user suffers from a difficult problem for information explosion.In order to solve such difficult problem, it would be desirable to energy
Enough quickly from magnanimity information, find interested part.This just requires that we can extract key from text message
Word.
Traditional method is thought, if a word is identified as key word, then he certainly exists significant statistical nature.H.
P. Luhn proposes original keyword extracting method.In the method for Luhn, after rejecting common word and rare word, close
Keyword is ranked up by word frequency.Since then, method based on word frequency and improvement thereof are by extensive discussions.But, the method for word frequency
But cannot by the most dramatically different close for word frequency word separately.M. Ortuno, J. P. Herrera and P. Carpena carries
Go out to utilize the method for word spatial characteristics to detect key word.But the method being equally based on distribution cannot be by spatial distribution phase
But the near word that importance is the most dramatically different separates.
The present invention proposes a kind of based on being occurred that by word the comentropy difference of the inherent of spacing and external pattern carries out key word
The method of sequence.This method thinks that the appearance of key word is affected by two patterns: (1) inherent pattern, is described in words
The statistical property of the key word position in topic;(2) external pattern, describes the statistical attribute that in text, topic bunch occurs.True literary composition
In basis, experimental result finds, a word occurs that the interior external schema of spacing and external pattern information entropy difference are the biggest, then he is crucial
The probability of word is the biggest.
Summary of the invention
Step (1) obtains text
Obtaining text, text is made up of the sentence of some numbers.
Step (2) Text Pretreatment
Step (2.1) removes all of punctuation mark, and all of letter is converted to small letter.
Step (2.2), for English text, carries out participle based on simple space.English different morphologies are as different
It is two different words that word, such as " organ " are treated as with " organs ".
Step (2.3), for Chinese text, uses conventional participle software to carry out participle.
Step (3) word occurs that the inherent of spacing finds with external external schema
There is position in step (3.1) mark word.Assuming that text size is N, a word A occurs i.e. m word the most altogether
Frequently, its position is represented by t1, t2, t3... tm, represent the word A t at text respectively1, t2, t3... tmPosition occurs.
Step (3.2) calculates word and location gap occurs.Word A occurs that spacing can be expressed as di=ti+1-ti.μ is expressed as di's
Meansigma methods, i.e. average headway.
Step (3.3) divides word and the inherent of spacing and external pattern occurs.If di≤ μ, then diIt is classified as inherent mould
Formula.In other words, certain for given word occurs, if it and word spacing d that position occurs next timeiLess than or equal to flat
All spacing μ, then just diIt is classified as inherent pattern.Similarly, if di> μ, then diIt is classified as external pattern.
Step (3.4) calculates word and the inherent of spacing and external pattern entropy occurs.dA ={di |di≤ μ } represent all di
The set of≤μ.So one word occurs that the entropy of the inherent pattern of spacing is defined as:
Here d is the spacing of word, d belong to 1,2,3 ... N}, and PdRepresent is at dAIt is general that middle d occurs
Rate.At dAThe word number that middle d occurs is nd, dAMiddle data amount check is SA, Pd = nd /SA。
dB= {di |di> μ represent all di> set of μ.So one word occurs that the entropy of the external pattern of spacing is fixed
Justice is:
Here d is also spacing, d belong to 1,2,3 ... N}, and PdRepresent is at dBIt is general that middle d occurs
Rate.At dBThe word number that middle d occurs is nd, dBMiddle data amount check is SB, Pd = nd /SB。
Step (3.5) calculates word and occurs that the inherent of spacing and external pattern entropy are poor
EDq(d)=(H(dA))q-(H(dB))q(3)
Wherein, q ∈ N+.Q takes positive integer, can be appreciated that can to obtain the effect when q is 2 best by experiment below.
The difference normalization of step (3.6) entropy calculates.Normalized entropy difference EDnorIt is defined as follows:
Wherein,
Wherein D is that spacing takes positive integer.What N/m represented is
The expectation of spacing.What p=m/N represented is word probability in the text, and m is the word frequency of corresponding words, and it is vocabulary sum that N represents.p(1-
p)d-1Be equivalent to d heavily Bernoulli trials.
Normalized purpose is to compare in order to the word of different p can be put under same standard, prevents due to p
Difference cause entropy difference to have significantly difference (i.e. in order to eliminate the impact on experiment effect of the p factor).
Formula meaningWord occurs being smaller than the probability sum of average headway in the text.pB
Similar.Represent is under conditions of word is smaller than average headway, and the conditional probability that spacing is d occurs in word.
Vocabulary is ranked up by step (4) according to entropy difference relative size
Accompanying drawing explanation
Fig. 1: in text, word occurs that the internal schema of location gap divides schematic diagram with external schema.
The schematic diagram of Fig. 2: various boundary.A) boundary condition C-1.Assume that adding word position-1 and N occurs.B) border
Condition C0.Assume that adding word position 0 and N+1 occurs.C) boundary condition Cc.Assume that joining end to end of text, each grid represent one
The position of individual word.
Fig. 3:At boundary condition C-1,C0And CcUnder the conditions of work as q=1,2 ... top-n accuracy rate when 5.
Fig. 4: key word detectsAt q=1,2 ... 5 and boundary condition C-1,C0And CcUnder Average Accuracy
(AP)。
Detailed description of the invention
Step (1) obtains text
Obtaining text, text is made up of the sentence of some numbers.
Testing material collection is Charles Darwin " The Origin of Species ", uses W.S. Dallas to carry
The key word annex of confession is as evaluation and test foundation.
Step (2) Text Pretreatment
Step (2.1) removes all of punctuation mark, and all of letter is converted to small letter;Catalogue in literary composition, vocabulary,
And index all removes from text.
Step (2.2), for English text, carries out participle based on simple space.First remove stop words, English different words
Shape is as different words, and such as " organ " treats as with " organs " is two different words.Count word frequency m of each word,
And the most total word quantity N.Calculate the Probability p=m/N of the appearance of each word.
Step (2.3), for Chinese text, uses conventional participle software to carry out participle.Use general segmentation methods to Chinese
Text carries out participle.Count word frequency m of each word, and the most total word quantity N.Calculate the appearance of each word
Probability p=m/N.
Step (3) word occurs that the inherent of spacing finds with external external schema
There is position in step (3.1) mark word.
Assuming that text size is N, a word A occurs m time the most altogether, and its position is represented by t1, t2, t3...
tm, represent the word A t at text respectively1, t2, t3... tmThere is (as shown in Figure 1) in position.
Step (3.2) calculates word and location gap occurs
Word A occurs that spacing can be expressed as di=ti+1-ti.Assume that the location tables that in text, word occurs for m time is shown as: t1, t2,
t3... tm.Word occurs in the alternate position spike on adjacent position can be write as such di=ti+1-ti, word spacing collection is combined into d1,
d2,……dm-1.Compare three kinds of different boundary condition C-1,C0And Cc(as shown in Figure 2).A) boundary condition C-1.Assume to add
There is position-1 and N in word.Namely assuming to occur in that uncorrelated word on-1 and N, word frequency is not added up in this appearance of twice.B)
Boundary condition C0.Assume that adding word position 0 and N+1 occurs.Namely assuming to occur in that uncorrelated word on-1 and N, this is twice
Appearance do not add up word frequency.C) boundary condition Cc.Assume that joining end to end of text, each grid represent the position of a word, i.e.
Assume the word " indirectly " (as shown in Figure 2) of first word of beginning and ending.For C-1Boundary condition, distance set is modified to d0 -1,
d1,……dm-1,dm -1,For C0Boundary condition, distance set is modified to d0 0,d1,……
dm-1,dm 0, whereinFor Cc Boundary condition, distance set is modified to d1,d2,……dm-1,dm c
AndWherein d1,……dm-1, meaning is as represented spacing, t abovemIt is still that the position that word occurs for the m time
Putting, N represents text size.
Step (3.3) divides word and the inherent of spacing and external pattern occurs
Location gap set according to each word above, calculates the average value mu of the spacing of word, with this meansigma methods conduct
The foundation of external schema in dividing.Such as d={1,2,1,2,3,4,5,50,2,1,3,1,2,3,2,3,100,2,1,3,1,4,
The meansigma methods of 3,2,1,2,1,1,1} is μ=7.1379.If di≤ μ is so di It is classified as internal schema, di > μ is so di Return
For external schema.Such as 1 < μ, then 1 is classified as internal schema, and 50 and 100 are more than μ, and they are classified as external schema (such as Fig. 1 institute
Show).Thus the spacing of word is divided into the set of inside and outside two patterns.The set of internal schema is designated as dA, the set of external schema
It is designated as dB。
D in the example of topA={1,2,1,2,3,4,5,2,1,3,1,2,3,2,3,2,1,3,1,4,3,2,1,2,1,1,
1}, dB ={50,100}。
Step (3.4) calculates word and the inherent of spacing and external pattern entropy occurs
The set d of inherent patternA ={di |di≤ μ } represent all diThe set of≤μ.There is spacing in so one word
The entropy of inherent pattern be defined as:
Here d is also spacing, and that it represents is dA In an element, and PdRepresent is at dAMiddle d occurs
Probability.At dAThe word number that middle d occurs is nd, dAMiddle data amount check is SA, Pd=nd/SA。
The entropy of internal schema is calculated according to formula (6).Such as above example P1=10/27, P2=8/27, P3=6/27, P4=2/
27, P5=1/27.Substitute into formula (6) and obtain H (dA)=1.98。
The set d of external schemaB= {di |di> μ represent all di> set of μ.So one word occurs outside spacing
Entropy in pattern is defined as:
Here d is also spacing, and that it represents is dBIn an element, and PdRepresent is at dBIt is general that middle d occurs
Rate.At dBThe word number that middle d occurs is nd, dBMiddle data amount check is SB, Pd=nd/SB。
For above example P50=1/2, P100=1/2.Substitute into formula (7) and obtain H (dB)=1。
According to formula (7) even if going out the entropy of external schema.
Step (3.5) calculates word and occurs that the inherent of spacing and external pattern entropy are poor
EDq(d)=(H(dA))q-(H(dB))q(8)
Wherein, q ∈ N+.Such as q=1,2 ..., 5.The example such as step (3.3) is given, as q=2, EDq(d)=
(1.98)2-(1)2=3.9204。
Step (3.6) calculates entropy difference normalization
Normalized entropy difference EDnorIt is defined as follows:
Wherein,
Wherein
The example such as step (3.3) is given, it is assumed that text size N=1000, when symbol occurrence number m=29,
So average word spacing is μ=N/m=34.5, frequency of occurrences p=m/N=0.029 of symbol.
Typically, So
Vocabulary is ranked up by step (4) according to entropy difference
It is according to the word divided in step (2), poor according to the entropy that the formula (6) of top calculates each word successively to (10),
After calculating completes, all words are ranked up according to entropy difference is descending.Fig. 3 gives key word Testing index?
Boundary condition C-1,C0And CcUnder the conditions of work as q=1,2 ... top-n accuracy rate when 5.Assume that an algorithm is by marking to article
In word sequence, wherein before in n result key word correct numerical statement be shown as key (n), then the accurate calibration of algorithm top-n
Justice is p (n)=key (n)/n.Average Accuracy (AP) is defined asWherein p (n) is top-n
Accuracy rate, if the word of ranking n-th is key word r (n)=1, if not key word, r (n)=0.L is the number of all words, R
It it is the number of key word.Fig. 4 gives key word Testing indexAt q=1,2 ... 5 and boundary condition C-1,C0And CcUnder
Average Accuracy (AP).In terms of result, when q is 2, the performance of algorithm is more stable than during other values and performance is more excellent.
Claims (1)
1. the key word sort method that the inherent of spacing and external pattern entropy difference occur based on word, it is characterised in that step is such as
Under:
Step (1) obtains text
Obtaining text, text is made up of the sentence of some numbers;
Step (2) Text Pretreatment
Step (2.1) removes all of punctuation mark, and all of letter is converted to small letter;Catalogue in literary composition, vocabulary, and
Index all removes from text;
Step (2.2), for English text, carries out participle based on simple space;First removing stop words, English different morphologies are worked as
Become different words;Count word frequency m of each word, and the most total word quantity N;Calculate appearance general of each word
Rate p=m/N;
Step (2.3), for Chinese text, uses conventional participle software to carry out participle;Use general segmentation methods to Chinese text
Carry out participle;Count word frequency m of each word, and the most total word quantity N;Calculate the probability of the appearance of each word
P=m/N;
Step (3) word occurs that the inherent of spacing finds with external external schema
There is position in step (3.1) mark word;
Assuming that text size is N, the word quantity that i.e. full text in step (2) is total, a word A occurs m time the most altogether, i.e. walks
Suddenly the word frequency in (2), its positional representation occurred is t1, t2, t3... ..tm, represent the word A t at text respectively1, t2,
t3... ..tmPosition occurs;
Step (3.2) calculates word and location gap occurs
In text, the location tables of m the appearance of word A is shown as: t1, t2, t3... ..tm;Wherein d1,......dm-1Represent spacing, tm
It is still that the position that word occurs for the m time;Word occurs in alternate position spike d on adjacent positioni=ti+1-ti, word spacing collection is combined into
d1,d2,......dm-1;For C-1Boundary condition, it is assumed that text border is in-1 and N the two position, then distance set correction
ForFor C0Boundary condition, it is assumed that text border is in 0 and N+1 the two
Position, then text distance set is modified to d0 0,d1,......dm-1,dm 0, whereinFor Cc
Boundary condition, it is assumed that joining end to end of text, distance set is modified to d1,d2,......dm-1,dm c,It is ring-type to be that text is linked to be
State under, the last distance occurred and occur for the first time of word;And
Step (3.3) divides word and the inherent of spacing and external pattern occurs
Location gap set according to each word above, calculates the average value mu of the spacing of word, by this meansigma methods as division
The foundation of interior external schema;If di≤ μ is so diIt is classified as internal schema, di> μ is so diIt is classified as external schema;Foundation according to this, this
Sample is just divided into the spacing of word the set of inside and outside two patterns;The set of internal schema is designated as dA, the set of external schema is designated as dB;
Step (3.4) calculates word and the entropy of the inherent of spacing and external pattern occurs
The set d of inherent patternA={ di|di≤ μ } represent all diThe set of≤μ;There is the inherent mould of spacing in so one word
The entropy of formula is defined as:
Here d is also spacing, d belong to 1,2,3 ... N}, and PdRepresent is at dAThe probability that middle d occurs;At dA
The word number that middle d occurs is nd, dAMiddle data amount check is SA, Pd=nd/SA;
The entropy of internal schema is calculated according to formula (6);
The set d of external schemaB={ di|di> μ represent all di> set of μ;There is the external pattern of spacing in so one word
Entropy is defined as:
Here d is also spacing, d belong to 1,2,3 ... N}, and PdRepresent is at dBThe probability that middle d occurs;At dB
The word number that middle d occurs is nd, dBMiddle data amount check is SB, Pd=nd/SB;
According to formula (7) even if going out the entropy of external schema;
Step (3.5) calculates word and occurs that the inherent of spacing and external pattern entropy are poor
ED2(d)=(H (dA))2-(H(dB))2 (8)
Step (3.6) calculates entropy difference normalization
Normalized entropy difference EDnorIt is defined as follows:
Wherein,
WhereinIn formula (10), q=2, d are word spacing, represent dAOr
Person dBIn an element;What N/m represented is the expectation of spacing, the most above average headway value μ;What p=m/N represented is
Word probability in the text, m is the word frequency of corresponding words, and it is the most total word quantity that N represents;p(1-p)d-1Be equivalent to d weight uncle exert
Profit test;
Vocabulary is ranked up by step (4) according to entropy difference
According to the word divided in step (2), poor according to the entropy that the formula (6) of top calculates each word successively to (10), calculate
After completing, all words are ranked up according to entropy difference is descending.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310253678.6A CN103336806B (en) | 2013-06-24 | 2013-06-24 | A kind of key word sort method that the inherent of spacing and external pattern entropy difference occur based on word |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310253678.6A CN103336806B (en) | 2013-06-24 | 2013-06-24 | A kind of key word sort method that the inherent of spacing and external pattern entropy difference occur based on word |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103336806A CN103336806A (en) | 2013-10-02 |
CN103336806B true CN103336806B (en) | 2016-08-10 |
Family
ID=49244971
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310253678.6A Active CN103336806B (en) | 2013-06-24 | 2013-06-24 | A kind of key word sort method that the inherent of spacing and external pattern entropy difference occur based on word |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103336806B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103744900A (en) * | 2013-12-26 | 2014-04-23 | 合一网络技术(北京)有限公司 | Visual discrimination difficulty combined text string weight calculation method and device |
CN109033166B (en) * | 2018-06-20 | 2022-01-07 | 国家计算机网络与信息安全管理中心 | Character attribute extraction training data set construction method |
CN110348497B (en) * | 2019-06-28 | 2021-09-10 | 西安理工大学 | Text representation method constructed based on WT-GloVe word vector |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101963972A (en) * | 2010-07-01 | 2011-02-02 | 深港产学研基地产业发展中心 | Method and system for extracting emotional keywords |
CN102253996A (en) * | 2011-07-08 | 2011-11-23 | 北京航空航天大学 | Multi-visual angle stagewise image clustering method |
CN102662936A (en) * | 2012-04-09 | 2012-09-12 | 复旦大学 | Chinese-English unknown words translating method blending Web excavation, multi-feature and supervised learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102456058B (en) * | 2010-11-02 | 2014-03-19 | 阿里巴巴集团控股有限公司 | Method and device for providing category information |
-
2013
- 2013-06-24 CN CN201310253678.6A patent/CN103336806B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101963972A (en) * | 2010-07-01 | 2011-02-02 | 深港产学研基地产业发展中心 | Method and system for extracting emotional keywords |
CN102253996A (en) * | 2011-07-08 | 2011-11-23 | 北京航空航天大学 | Multi-visual angle stagewise image clustering method |
CN102662936A (en) * | 2012-04-09 | 2012-09-12 | 复旦大学 | Chinese-English unknown words translating method blending Web excavation, multi-feature and supervised learning |
Also Published As
Publication number | Publication date |
---|---|
CN103336806A (en) | 2013-10-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103617157A (en) | Text similarity calculation method based on semantics | |
CN103729474B (en) | Method and system for recognizing forum user vest account | |
CN106708966A (en) | Similarity calculation-based junk comment detection method | |
CN107122352A (en) | A kind of method of the extracting keywords based on K MEANS, WORD2VEC | |
CN104778256B (en) | A kind of the quick of field question answering system consulting can increment clustering method | |
CN105653518A (en) | Specific group discovery and expansion method based on microblog data | |
CN108376133A (en) | The short text sensibility classification method expanded based on emotion word | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN104199840B (en) | Intelligent place name identification technology based on statistical model | |
CN107562831A (en) | A kind of accurate lookup method based on full-text search | |
CN102411563A (en) | Method, device and system for identifying target words | |
CN102298632B (en) | Character string similarity computing method and device and material classification method and device | |
CN106407484A (en) | Video tag extraction method based on semantic association of barrages | |
CN104298665A (en) | Identification method and device of evaluation objects of Chinese texts | |
CN106021329A (en) | A user similarity-based sparse data collaborative filtering recommendation method | |
CN104484380A (en) | Personalized search method and personalized search device | |
CN103955703A (en) | Medical image disease classification method based on naive Bayes | |
WO2017075912A1 (en) | News events extracting method and system | |
CN100543735C (en) | File similarity measure method based on file structure | |
CN103336806B (en) | A kind of key word sort method that the inherent of spacing and external pattern entropy difference occur based on word | |
CN105787662A (en) | Mobile application software performance prediction method based on attributes | |
CN107526792A (en) | A kind of Chinese question sentence keyword rapid extracting method | |
CN103886077A (en) | Short text clustering method and system | |
CN109214445A (en) | A kind of multi-tag classification method based on artificial intelligence | |
CN103116573A (en) | Field dictionary automatic extension method based on vocabulary annotation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |