CN102629266A - Diagram text structure representation model based on harmonic progression - Google Patents
Diagram text structure representation model based on harmonic progression Download PDFInfo
- Publication number
- CN102629266A CN102629266A CN2012100594049A CN201210059404A CN102629266A CN 102629266 A CN102629266 A CN 102629266A CN 2012100594049 A CN2012100594049 A CN 2012100594049A CN 201210059404 A CN201210059404 A CN 201210059404A CN 102629266 A CN102629266 A CN 102629266A
- Authority
- CN
- China
- Prior art keywords
- keyword
- text
- harmonic progression
- keywords
- graph structure
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention discloses a diagram text structure representation model based on harmonic progression. The method comprises the following steps: (1), an individual text in field of collected works is opened; (2), the content of the text is rearranged according to the importance degree from high to low; (3), participle is performed on the text and punctuation marks are retained; (4), keywords and occurrence number of the keywords are counted; (5), the keywords of diagram are taken as pitch points, and the keywords with the occurrence numbers which are not zero are connected; and (6), a harmonic progression method is used for counting weights of the keywords and the keyword pairs. Due to the adoption of the method, the deficiency of the text structure information is avoided, and the counting to weights of the keywords and the keyword pairs can be performed at the aim of the structure information of an individual text; and the method is simple and convenient to operate, the effect is excellent, and the functions of TFIDF (Term Frequency Inverted Document Frequency) can be achieved.
Description
Technical field
The present invention relates to a kind of representation model of text, specifically relate to adopt graph structure that text is represented, use harmonic progression to keyword and keyword to carrying out the model of weight calculation, be a kind of text graph structure representation model based on harmonic progression.
Background technology
The mankind are good at handling non-structured text, express custom because non-structured text meets human language, the more important thing is that the mankind have very strong logical reasoning ability.Machine then is good at the Processing Structure text, for example schemes and shows.During man-machine interaction, must human understandable non-structured text be converted into machine understandable structured text, this just needs the text representation model.
Using the widest text representation model at present is vector space model.Vector space model is shown as a weight vector with text table, and each in the vector is formed by lexical item, and the weight of each lexical item is confirmed by the TFIDF method.Wherein the TFIDF method is calculated the significance level of a lexical item for single piece of text in the collected works with lexical item weight formula.The lexical item weight of TFIDF method is exactly the product of word frequency TF (Term Frequency) and contrary document frequency IDF (Inverse Document Frequency).The concrete formula of TFIDF is following:
Wherein, TF
iBe the word frequency of lexical item i, i.e. the number of times that in text, occurs of lexical item i; IDF
iBe the contrary document frequency of lexical item i, it is by log (N/n
i) calculate; N is the text sum of text set; n
iFor comprising the textual data of lexical item i in the text set.
But when being to use vector space model to combine the TFIDF method that text is represented, not enough below existing:
(1) vector space model is regarded text the set of lexical item as, regards the relation between lexical item and the lexical item independently as, has so just lost a large amount of text structure information.
(2) the TFIDF method is not considered the influence of their present position factors to their weights, and is considered occurrence number or co-occurrence number of times separately when calculating the word frequency of lexical item, is not sufficient to express its actual weight.
(3) the TFIDF method is when calculating the contrary document frequency of lexical item, need be based on the text set in field, and can't be directed against single piece text.
Summary of the invention
The objective of the invention is to deficiency to vector space model and TFIDF method; A kind of text graph structure representation model based on harmonic progression is provided; This model can be avoided the disappearance of text structure information, and can calculate keyword and the right weight of keyword to the structural information of single piece of text simultaneously.
In order to reach above-mentioned purpose, design of the present invention is following: adopt the graph structure model that single piece of text is represented, avoid the disappearance of text structure information, and can calculate keyword and the right weight of keyword to the structural information of single piece of text simultaneously; Described graph structure model is: use graph structure to the keyword of text and between relation organize, carry out the calculating of weight again through the harmonic progression method.
According to above-mentioned invention thought, the present invention adopts following technical proposals:
A kind of text graph structure representation model based on harmonic progression is characterized in that its concrete steps are following:
(1) opens single piece of text in the collected works of field;
(2) content of text is arranged according to importance degree is descending again;
(3) text is carried out participle and keeps punctuation mark;
(4) statistics keyword and the right occurrence number of keyword;
(5) with the keyword being the node of figure, is not that 0 keyword is to connecting with the co-occurrence number of times;
(6) use the harmonic progression method that keyword and the right weight of keyword are calculated.
Described harmonic progression method is designated as HP, and its keyword and keyword are following to the weight calculation formula:
A kind of text graph structure representation model based on harmonic progression of the present invention compared with prior art; Have following outstanding feature and advantage: do not having the field text set; Can't confirm under the situation of the separating capacity of keyword in text set; Can be through single piece of text of scanning, with the occurrence number of keyword and the weight that keyword is confirmed in the position occurs; Though only use occurrence number that weight is estimated, simple and easy to operate, and also effective; Because the logarithm in the harmonic progression method is the extendible order of magnitude, therefore can have the function of TFIDF concurrently, and also easier.
Description of drawings
Fig. 1 is the process flow diagram of a kind of text graph structure representation model based on harmonic progression of the present invention.
Embodiment
Below in conjunction with accompanying drawing embodiments of the invention are further described.
Embodiment one: referring to Fig. 1, this is characterized in that based on the text graph structure representation model of harmonic progression: adopt the graph structure model that single piece of text is represented, wherein use the harmonic progression method that keyword and the right weight of keyword are calculated;
Described graph structure model is exactly that keyword root with text establishes a connection to the cooccurrence relation in same sentence according to keyword;
Described harmonic progression method, its keyword and keyword are following to the weight calculation formula:
Embodiment two: this is based on the text graph structure representation model of harmonic progression, carries out the expression of text from the 70 pieces of papers in 2011 to 2012 of TKDE.As shown in Figure 1, a kind of text graph structure representation model of present embodiment based on harmonic progression, its step is following:
S1. open single piece of text in the collected works of field, for example, open 2011 24 volume the 1st interim monographs;
S2. content of text is arranged according to importance degree is descending again, for example, arranged in proper order again according to title, summary, foreword and summary;
S3. text is carried out participle and keeps punctuation mark, for example, keep fullstop.
S4. add up the right occurrence number of keyword and keyword, be designated as n.
S5. being the node of figure with the keyword, is not that 0 keyword is to connecting with the co-occurrence number of times.
S6. use the harmonic progression method that keyword and the right weight of keyword are calculated; Harmonic progression method formula is designated as HP, and its keyword and keyword are following to the weight calculation formula:
Claims (2)
1. the text graph structure representation model based on harmonic progression is characterized in that: adopt the graph structure model that single piece of text is represented, wherein use the harmonic progression method that keyword and the right weight of keyword are calculated; Described graph structure model is exactly that keyword root with text establishes a connection to the cooccurrence relation in same sentence according to keyword; Its concrete steps are following:
Open single piece of text in the collected works of field;
Content of text is arranged according to importance degree is descending again;
Text is carried out participle and keeps punctuation mark;
Statistics keyword and the right occurrence number of keyword;
With the keyword is the node of figure, is not that 0 keyword is to connecting with the co-occurrence number of times;
Use the harmonic progression method that keyword and the right weight of keyword are calculated.
2. by the described text graph structure representation model of claim 1, it is characterized in that based on harmonic progression: the harmonic progression method in the said step (6), its keyword and keyword are following to the weight calculation formula:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100594049A CN102629266A (en) | 2012-03-08 | 2012-03-08 | Diagram text structure representation model based on harmonic progression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100594049A CN102629266A (en) | 2012-03-08 | 2012-03-08 | Diagram text structure representation model based on harmonic progression |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102629266A true CN102629266A (en) | 2012-08-08 |
Family
ID=46587526
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012100594049A Pending CN102629266A (en) | 2012-03-08 | 2012-03-08 | Diagram text structure representation model based on harmonic progression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102629266A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103744835A (en) * | 2014-01-02 | 2014-04-23 | 上海大学 | Text keyword extracting method based on subject model |
CN109766408A (en) * | 2018-12-04 | 2019-05-17 | 上海大学 | The text key word weighing computation method of comprehensive word positional factor and word frequency factor |
CN114328900A (en) * | 2022-03-14 | 2022-04-12 | 深圳格隆汇信息科技有限公司 | Information abstract extraction method based on key words |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020111941A1 (en) * | 2000-12-19 | 2002-08-15 | Xerox Corporation | Apparatus and method for information retrieval |
CN101067808A (en) * | 2007-05-24 | 2007-11-07 | 上海大学 | Text key word extracting method |
-
2012
- 2012-03-08 CN CN2012100594049A patent/CN102629266A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020111941A1 (en) * | 2000-12-19 | 2002-08-15 | Xerox Corporation | Apparatus and method for information retrieval |
CN101067808A (en) * | 2007-05-24 | 2007-11-07 | 上海大学 | Text key word extracting method |
Non-Patent Citations (1)
Title |
---|
刘巧凤: "基于图结构的中文文本聚类方法研究", 《万方硕士学位论文》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103744835A (en) * | 2014-01-02 | 2014-04-23 | 上海大学 | Text keyword extracting method based on subject model |
CN103744835B (en) * | 2014-01-02 | 2016-12-07 | 上海大学 | A kind of text key word extracting method based on topic model |
CN109766408A (en) * | 2018-12-04 | 2019-05-17 | 上海大学 | The text key word weighing computation method of comprehensive word positional factor and word frequency factor |
CN114328900A (en) * | 2022-03-14 | 2022-04-12 | 深圳格隆汇信息科技有限公司 | Information abstract extraction method based on key words |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101067808B (en) | Text key word extracting method | |
Rychlý | A Lexicographer-Friendly Association Score. | |
CN103207905B (en) | A kind of method of calculating text similarity of based target text | |
CN108536677A (en) | A kind of patent text similarity calculating method | |
CN104102681B (en) | Microblog key event acquiring method and device | |
CN102693279B (en) | Method, device and system for fast calculating comment similarity | |
CN103514213B (en) | Term extraction method and device | |
CN104199846B (en) | Comment key phrases clustering method based on wikipedia | |
CN106372122B (en) | A kind of Document Classification Method and system based on Wiki semantic matches | |
CN109471933A (en) | A kind of generation method of text snippet, storage medium and server | |
CN109376352A (en) | A kind of patent text modeling method based on word2vec and semantic similarity | |
CN109726298A (en) | Knowledge mapping construction method, system, terminal and medium suitable for scientific and technical literature | |
CN102411564A (en) | Electronic homework copying detection method | |
CN103123624A (en) | Method of confirming head word, device of confirming head word, searching method and device | |
Rendl et al. | Constraint models for the container pre-marshaling problem | |
CN105095430A (en) | Method and device for setting up word network and extracting keywords | |
CN103530316A (en) | Science subject extraction method based on multi-view learning | |
CN102629266A (en) | Diagram text structure representation model based on harmonic progression | |
CN104951430A (en) | Product feature tag extraction method and device | |
Pande et al. | Application of natural language processing tools in stemming | |
CN104572736A (en) | Keyword extraction method and device based on social networking services | |
CN102779119A (en) | Method and device for extracting keywords | |
CN103164394A (en) | Text similarity calculation method based on universal gravitation | |
CN102591976A (en) | Text characteristic extracting method and document copy detection system based on sentence level | |
Gupta et al. | Improving unsupervised stemming by using partial lemmatization coupled with data-based heuristics for Hindi |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20120808 |