CN102629266A - Diagram text structure representation model based on harmonic progression - Google Patents

Diagram text structure representation model based on harmonic progression Download PDF

Info

Publication number
CN102629266A
CN102629266A CN2012100594049A CN201210059404A CN102629266A CN 102629266 A CN102629266 A CN 102629266A CN 2012100594049 A CN2012100594049 A CN 2012100594049A CN 201210059404 A CN201210059404 A CN 201210059404A CN 102629266 A CN102629266 A CN 102629266A
Authority
CN
China
Prior art keywords
keyword
text
harmonic progression
keywords
graph structure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012100594049A
Other languages
Chinese (zh)
Inventor
陈雪
吴超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN2012100594049A priority Critical patent/CN102629266A/en
Publication of CN102629266A publication Critical patent/CN102629266A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a diagram text structure representation model based on harmonic progression. The method comprises the following steps: (1), an individual text in field of collected works is opened; (2), the content of the text is rearranged according to the importance degree from high to low; (3), participle is performed on the text and punctuation marks are retained; (4), keywords and occurrence number of the keywords are counted; (5), the keywords of diagram are taken as pitch points, and the keywords with the occurrence numbers which are not zero are connected; and (6), a harmonic progression method is used for counting weights of the keywords and the keyword pairs. Due to the adoption of the method, the deficiency of the text structure information is avoided, and the counting to weights of the keywords and the keyword pairs can be performed at the aim of the structure information of an individual text; and the method is simple and convenient to operate, the effect is excellent, and the functions of TFIDF (Term Frequency Inverted Document Frequency) can be achieved.

Description

A kind of text graph structure representation model based on harmonic progression
Technical field
The present invention relates to a kind of representation model of text, specifically relate to adopt graph structure that text is represented, use harmonic progression to keyword and keyword to carrying out the model of weight calculation, be a kind of text graph structure representation model based on harmonic progression.
Background technology
The mankind are good at handling non-structured text, express custom because non-structured text meets human language, the more important thing is that the mankind have very strong logical reasoning ability.Machine then is good at the Processing Structure text, for example schemes and shows.During man-machine interaction, must human understandable non-structured text be converted into machine understandable structured text, this just needs the text representation model.
Using the widest text representation model at present is vector space model.Vector space model is shown as a weight vector with text table, and each in the vector is formed by lexical item, and the weight of each lexical item is confirmed by the TFIDF method.Wherein the TFIDF method is calculated the significance level of a lexical item for single piece of text in the collected works with lexical item weight formula.The lexical item weight of TFIDF method is exactly the product of word frequency TF (Term Frequency) and contrary document frequency IDF (Inverse Document Frequency).The concrete formula of TFIDF is following:
Figure 2012100594049100002DEST_PATH_IMAGE001
Wherein, TF iBe the word frequency of lexical item i, i.e. the number of times that in text, occurs of lexical item i; IDF iBe the contrary document frequency of lexical item i, it is by log (N/n i) calculate; N is the text sum of text set; n iFor comprising the textual data of lexical item i in the text set.
But when being to use vector space model to combine the TFIDF method that text is represented, not enough below existing:
(1) vector space model is regarded text the set of lexical item as, regards the relation between lexical item and the lexical item independently as, has so just lost a large amount of text structure information.
(2) the TFIDF method is not considered the influence of their present position factors to their weights, and is considered occurrence number or co-occurrence number of times separately when calculating the word frequency of lexical item, is not sufficient to express its actual weight.
(3) the TFIDF method is when calculating the contrary document frequency of lexical item, need be based on the text set in field, and can't be directed against single piece text.
Summary of the invention
The objective of the invention is to deficiency to vector space model and TFIDF method; A kind of text graph structure representation model based on harmonic progression is provided; This model can be avoided the disappearance of text structure information, and can calculate keyword and the right weight of keyword to the structural information of single piece of text simultaneously.
In order to reach above-mentioned purpose, design of the present invention is following: adopt the graph structure model that single piece of text is represented, avoid the disappearance of text structure information, and can calculate keyword and the right weight of keyword to the structural information of single piece of text simultaneously; Described graph structure model is: use graph structure to the keyword of text and between relation organize, carry out the calculating of weight again through the harmonic progression method.
According to above-mentioned invention thought, the present invention adopts following technical proposals:
A kind of text graph structure representation model based on harmonic progression is characterized in that its concrete steps are following:
(1) opens single piece of text in the collected works of field;
(2) content of text is arranged according to importance degree is descending again;
(3) text is carried out participle and keeps punctuation mark;
(4) statistics keyword and the right occurrence number of keyword;
(5) with the keyword being the node of figure, is not that 0 keyword is to connecting with the co-occurrence number of times;
(6) use the harmonic progression method that keyword and the right weight of keyword are calculated.
Described harmonic progression method is designated as HP, and its keyword and keyword are following to the weight calculation formula:
Wherein, N is keyword and the right occurrence number of keyword;
Figure 683998DEST_PATH_IMAGE004
is Euler's constant,
Figure 2012100594049100002DEST_PATH_IMAGE005
.
A kind of text graph structure representation model based on harmonic progression of the present invention compared with prior art; Have following outstanding feature and advantage: do not having the field text set; Can't confirm under the situation of the separating capacity of keyword in text set; Can be through single piece of text of scanning, with the occurrence number of keyword and the weight that keyword is confirmed in the position occurs; Though only use occurrence number that weight is estimated, simple and easy to operate, and also effective; Because the logarithm in the harmonic progression method is the extendible order of magnitude, therefore can have the function of TFIDF concurrently, and also easier.
Description of drawings
Fig. 1 is the process flow diagram of a kind of text graph structure representation model based on harmonic progression of the present invention.
Embodiment
Below in conjunction with accompanying drawing embodiments of the invention are further described.
Embodiment one: referring to Fig. 1, this is characterized in that based on the text graph structure representation model of harmonic progression: adopt the graph structure model that single piece of text is represented, wherein use the harmonic progression method that keyword and the right weight of keyword are calculated;
Described graph structure model is exactly that keyword root with text establishes a connection to the cooccurrence relation in same sentence according to keyword;
Described harmonic progression method, its keyword and keyword are following to the weight calculation formula:
Figure 444143DEST_PATH_IMAGE003
; N is keyword and the right occurrence number of keyword;
Figure 711176DEST_PATH_IMAGE004
is Euler's constant,
Figure 910077DEST_PATH_IMAGE005
.
Embodiment two: this is based on the text graph structure representation model of harmonic progression, carries out the expression of text from the 70 pieces of papers in 2011 to 2012 of TKDE.As shown in Figure 1, a kind of text graph structure representation model of present embodiment based on harmonic progression, its step is following:
S1. open single piece of text in the collected works of field, for example, open 2011 24 volume the 1st interim monographs;
S2. content of text is arranged according to importance degree is descending again, for example, arranged in proper order again according to title, summary, foreword and summary;
S3. text is carried out participle and keeps punctuation mark, for example, keep fullstop.
S4. add up the right occurrence number of keyword and keyword, be designated as n.
S5. being the node of figure with the keyword, is not that 0 keyword is to connecting with the co-occurrence number of times.
S6. use the harmonic progression method that keyword and the right weight of keyword are calculated; Harmonic progression method formula is designated as HP, and its keyword and keyword are following to the weight calculation formula:
Figure 765906DEST_PATH_IMAGE003
Wherein, N is keyword and the right occurrence number of keyword;
Figure 759270DEST_PATH_IMAGE004
is Euler's constant,
Figure 513599DEST_PATH_IMAGE005
.

Claims (2)

1. the text graph structure representation model based on harmonic progression is characterized in that: adopt the graph structure model that single piece of text is represented, wherein use the harmonic progression method that keyword and the right weight of keyword are calculated; Described graph structure model is exactly that keyword root with text establishes a connection to the cooccurrence relation in same sentence according to keyword; Its concrete steps are following:
Open single piece of text in the collected works of field;
Content of text is arranged according to importance degree is descending again;
Text is carried out participle and keeps punctuation mark;
Statistics keyword and the right occurrence number of keyword;
With the keyword is the node of figure, is not that 0 keyword is to connecting with the co-occurrence number of times;
Use the harmonic progression method that keyword and the right weight of keyword are calculated.
2. by the described text graph structure representation model of claim 1, it is characterized in that based on harmonic progression: the harmonic progression method in the said step (6), its keyword and keyword are following to the weight calculation formula:
Wherein, N is keyword and the right occurrence number of keyword;
Figure 2012100594049100001DEST_PATH_IMAGE004
is Euler's constant,
Figure 2012100594049100001DEST_PATH_IMAGE006
.
CN2012100594049A 2012-03-08 2012-03-08 Diagram text structure representation model based on harmonic progression Pending CN102629266A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100594049A CN102629266A (en) 2012-03-08 2012-03-08 Diagram text structure representation model based on harmonic progression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100594049A CN102629266A (en) 2012-03-08 2012-03-08 Diagram text structure representation model based on harmonic progression

Publications (1)

Publication Number Publication Date
CN102629266A true CN102629266A (en) 2012-08-08

Family

ID=46587526

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100594049A Pending CN102629266A (en) 2012-03-08 2012-03-08 Diagram text structure representation model based on harmonic progression

Country Status (1)

Country Link
CN (1) CN102629266A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744835A (en) * 2014-01-02 2014-04-23 上海大学 Text keyword extracting method based on subject model
CN109766408A (en) * 2018-12-04 2019-05-17 上海大学 The text key word weighing computation method of comprehensive word positional factor and word frequency factor
CN114328900A (en) * 2022-03-14 2022-04-12 深圳格隆汇信息科技有限公司 Information abstract extraction method based on key words

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020111941A1 (en) * 2000-12-19 2002-08-15 Xerox Corporation Apparatus and method for information retrieval
CN101067808A (en) * 2007-05-24 2007-11-07 上海大学 Text key word extracting method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020111941A1 (en) * 2000-12-19 2002-08-15 Xerox Corporation Apparatus and method for information retrieval
CN101067808A (en) * 2007-05-24 2007-11-07 上海大学 Text key word extracting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘巧凤: "基于图结构的中文文本聚类方法研究", 《万方硕士学位论文》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744835A (en) * 2014-01-02 2014-04-23 上海大学 Text keyword extracting method based on subject model
CN103744835B (en) * 2014-01-02 2016-12-07 上海大学 A kind of text key word extracting method based on topic model
CN109766408A (en) * 2018-12-04 2019-05-17 上海大学 The text key word weighing computation method of comprehensive word positional factor and word frequency factor
CN114328900A (en) * 2022-03-14 2022-04-12 深圳格隆汇信息科技有限公司 Information abstract extraction method based on key words

Similar Documents

Publication Publication Date Title
CN101067808B (en) Text key word extracting method
Rychlý A Lexicographer-Friendly Association Score.
CN103207905B (en) A kind of method of calculating text similarity of based target text
CN108536677A (en) A kind of patent text similarity calculating method
CN104102681B (en) Microblog key event acquiring method and device
CN102693279B (en) Method, device and system for fast calculating comment similarity
CN103514213B (en) Term extraction method and device
CN104199846B (en) Comment key phrases clustering method based on wikipedia
CN106372122B (en) A kind of Document Classification Method and system based on Wiki semantic matches
CN109471933A (en) A kind of generation method of text snippet, storage medium and server
CN109376352A (en) A kind of patent text modeling method based on word2vec and semantic similarity
CN109726298A (en) Knowledge mapping construction method, system, terminal and medium suitable for scientific and technical literature
CN102411564A (en) Electronic homework copying detection method
CN103123624A (en) Method of confirming head word, device of confirming head word, searching method and device
Rendl et al. Constraint models for the container pre-marshaling problem
CN105095430A (en) Method and device for setting up word network and extracting keywords
CN103530316A (en) Science subject extraction method based on multi-view learning
CN102629266A (en) Diagram text structure representation model based on harmonic progression
CN104951430A (en) Product feature tag extraction method and device
Pande et al. Application of natural language processing tools in stemming
CN104572736A (en) Keyword extraction method and device based on social networking services
CN102779119A (en) Method and device for extracting keywords
CN103164394A (en) Text similarity calculation method based on universal gravitation
CN102591976A (en) Text characteristic extracting method and document copy detection system based on sentence level
Gupta et al. Improving unsupervised stemming by using partial lemmatization coupled with data-based heuristics for Hindi

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120808