CN102629266A

CN102629266A - Diagram text structure representation model based on harmonic progression

Info

Publication number: CN102629266A
Application number: CN2012100594049A
Authority: CN
Inventors: 陈雪; 吴超
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2012-03-08
Filing date: 2012-03-08
Publication date: 2012-08-08

Abstract

The invention discloses a diagram text structure representation model based on harmonic progression. The method comprises the following steps: (1), an individual text in field of collected works is opened; (2), the content of the text is rearranged according to the importance degree from high to low; (3), participle is performed on the text and punctuation marks are retained; (4), keywords and occurrence number of the keywords are counted; (5), the keywords of diagram are taken as pitch points, and the keywords with the occurrence numbers which are not zero are connected; and (6), a harmonic progression method is used for counting weights of the keywords and the keyword pairs. Due to the adoption of the method, the deficiency of the text structure information is avoided, and the counting to weights of the keywords and the keyword pairs can be performed at the aim of the structure information of an individual text; and the method is simple and convenient to operate, the effect is excellent, and the functions of TFIDF (Term Frequency Inverted Document Frequency) can be achieved.

Description

A kind of text graph structure representation model based on harmonic progression

Technical field

The present invention relates to a kind of representation model of text, specifically relate to adopt graph structure that text is represented, use harmonic progression to keyword and keyword to carrying out the model of weight calculation, be a kind of text graph structure representation model based on harmonic progression.

Background technology

The mankind are good at handling non-structured text, express custom because non-structured text meets human language, the more important thing is that the mankind have very strong logical reasoning ability.Machine then is good at the Processing Structure text, for example schemes and shows.During man-machine interaction, must human understandable non-structured text be converted into machine understandable structured text, this just needs the text representation model.

Using the widest text representation model at present is vector space model.Vector space model is shown as a weight vector with text table, and each in the vector is formed by lexical item, and the weight of each lexical item is confirmed by the TFIDF method.Wherein the TFIDF method is calculated the significance level of a lexical item for single piece of text in the collected works with lexical item weight formula.The lexical item weight of TFIDF method is exactly the product of word frequency TF (Term Frequency) and contrary document frequency IDF (Inverse Document Frequency).The concrete formula of TFIDF is following:

Figure 2012100594049100002DEST_PATH_IMAGE001

Wherein, TF _iBe the word frequency of lexical item i, i.e. the number of times that in text, occurs of lexical item i; IDF _iBe the contrary document frequency of lexical item i, it is by log (N/n _i) calculate; N is the text sum of text set; n _iFor comprising the textual data of lexical item i in the text set.

But when being to use vector space model to combine the TFIDF method that text is represented, not enough below existing:

(1) vector space model is regarded text the set of lexical item as, regards the relation between lexical item and the lexical item independently as, has so just lost a large amount of text structure information.

(2) the TFIDF method is not considered the influence of their present position factors to their weights, and is considered occurrence number or co-occurrence number of times separately when calculating the word frequency of lexical item, is not sufficient to express its actual weight.

(3) the TFIDF method is when calculating the contrary document frequency of lexical item, need be based on the text set in field, and can't be directed against single piece text.

Summary of the invention

The objective of the invention is to deficiency to vector space model and TFIDF method; A kind of text graph structure representation model based on harmonic progression is provided; This model can be avoided the disappearance of text structure information, and can calculate keyword and the right weight of keyword to the structural information of single piece of text simultaneously.

In order to reach above-mentioned purpose, design of the present invention is following: adopt the graph structure model that single piece of text is represented, avoid the disappearance of text structure information, and can calculate keyword and the right weight of keyword to the structural information of single piece of text simultaneously; Described graph structure model is: use graph structure to the keyword of text and between relation organize, carry out the calculating of weight again through the harmonic progression method.

According to above-mentioned invention thought, the present invention adopts following technical proposals:

A kind of text graph structure representation model based on harmonic progression is characterized in that its concrete steps are following:

(1) opens single piece of text in the collected works of field;

(2) content of text is arranged according to importance degree is descending again;

(3) text is carried out participle and keeps punctuation mark;

(4) statistics keyword and the right occurrence number of keyword;

(5) with the keyword being the node of figure, is not that 0 keyword is to connecting with the co-occurrence number of times;

(6) use the harmonic progression method that keyword and the right weight of keyword are calculated.

Described harmonic progression method is designated as HP, and its keyword and keyword are following to the weight calculation formula:

Wherein, N is keyword and the right occurrence number of keyword;

is Euler's constant,

Figure 2012100594049100002DEST_PATH_IMAGE005

.

A kind of text graph structure representation model based on harmonic progression of the present invention compared with prior art; Have following outstanding feature and advantage: do not having the field text set; Can't confirm under the situation of the separating capacity of keyword in text set; Can be through single piece of text of scanning, with the occurrence number of keyword and the weight that keyword is confirmed in the position occurs; Though only use occurrence number that weight is estimated, simple and easy to operate, and also effective; Because the logarithm in the harmonic progression method is the extendible order of magnitude, therefore can have the function of TFIDF concurrently, and also easier.

Description of drawings

Fig. 1 is the process flow diagram of a kind of text graph structure representation model based on harmonic progression of the present invention.

Embodiment

Below in conjunction with accompanying drawing embodiments of the invention are further described.

Embodiment one: referring to Fig. 1, this is characterized in that based on the text graph structure representation model of harmonic progression: adopt the graph structure model that single piece of text is represented, wherein use the harmonic progression method that keyword and the right weight of keyword are calculated;

Described graph structure model is exactly that keyword root with text establishes a connection to the cooccurrence relation in same sentence according to keyword;

Described harmonic progression method, its keyword and keyword are following to the weight calculation formula:

; N is keyword and the right occurrence number of keyword;

is Euler's constant,

.

Embodiment two: this is based on the text graph structure representation model of harmonic progression, carries out the expression of text from the 70 pieces of papers in 2011 to 2012 of TKDE.As shown in Figure 1, a kind of text graph structure representation model of present embodiment based on harmonic progression, its step is following:

S1. open single piece of text in the collected works of field, for example, open 2011 24 volume the 1st interim monographs;

S2. content of text is arranged according to importance degree is descending again, for example, arranged in proper order again according to title, summary, foreword and summary;

S3. text is carried out participle and keeps punctuation mark, for example, keep fullstop.

S4. add up the right occurrence number of keyword and keyword, be designated as n.

S5. being the node of figure with the keyword, is not that 0 keyword is to connecting with the co-occurrence number of times.

S6. use the harmonic progression method that keyword and the right weight of keyword are calculated; Harmonic progression method formula is designated as HP, and its keyword and keyword are following to the weight calculation formula:

Wherein, N is keyword and the right occurrence number of keyword;

is Euler's constant,

.

Claims

1. the text graph structure representation model based on harmonic progression is characterized in that: adopt the graph structure model that single piece of text is represented, wherein use the harmonic progression method that keyword and the right weight of keyword are calculated; Described graph structure model is exactly that keyword root with text establishes a connection to the cooccurrence relation in same sentence according to keyword; Its concrete steps are following:

Open single piece of text in the collected works of field;

Content of text is arranged according to importance degree is descending again;

Text is carried out participle and keeps punctuation mark;

Statistics keyword and the right occurrence number of keyword;

With the keyword is the node of figure, is not that 0 keyword is to connecting with the co-occurrence number of times;

Use the harmonic progression method that keyword and the right weight of keyword are calculated.

2. by the described text graph structure representation model of claim 1, it is characterized in that based on harmonic progression: the harmonic progression method in the said step (6), its keyword and keyword are following to the weight calculation formula:

Wherein, N is keyword and the right occurrence number of keyword;

is Euler's constant,

Figure 2012100594049100001DEST_PATH_IMAGE006

.