CN101067807A - Text semantic visable representation and obtaining method - Google Patents
Text semantic visable representation and obtaining method Download PDFInfo
- Publication number
- CN101067807A CN101067807A CN 200710041147 CN200710041147A CN101067807A CN 101067807 A CN101067807 A CN 101067807A CN 200710041147 CN200710041147 CN 200710041147 CN 200710041147 A CN200710041147 A CN 200710041147A CN 101067807 A CN101067807 A CN 101067807A
- Authority
- CN
- China
- Prior art keywords
- text
- keyword
- semantic
- subject
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This invention relates to a visual expression and obtaining method for text semantics, which divides a text semantics into three hierarchies: a low hierarchy composed of sets of scattered key words, a text topic made up of paragraphs as a middle hierarchy of a text and a high hierarchy of writings composed of mutual-linked topics, a semantic matrix of a text topic is generated by picking up key words and matrix operation based on the key word weight then the text chapter semantics is formed by linking topics of texts, which can utilize a depending relationship of contexts in text data to pick up semantics to increase the accuracy of semantics of complicated data objects and analyzes it into multilayer description with different grain crunodes, scores relationship of topics among crunodes by the structure of context among modeling crunodes and uses weight to measure the related degree among keywords, keywords to the topic of the text and topics.
Description
Technical field:
The present invention relates to the expression and obtaining method that a kind of computing machine generates text semantic automatically, more particularly, relate to a kind of visable representation and acquisition methods of the text semantic based on semantic matrix.
Background technology:
Along with information and development of internet technology, we can be at an easy rate obtain enormous amount and relate to the information resources of every field by the electronics and the network media in today, and promptly so-called information explosion (Information Explosion) problem is to tissue and index information resource and information retrieval technique have produced urgent demand quickly and effectively.Information destructuring, information category variation, document content covering scope are extensive etc., and factor has proposed great challenge to information organization and retrieval.For example, Web has become field most important information and knowledge bases such as scientific research, education and study; But the exponential growth rate of Web information has also been brought huge difficulty for the user effectively utilizes simultaneously.The digital library that extensively obtains building in recent years is the important magnanimity information source of another one.Digital library is the digitalization resource storehouse of preserving a large amount of structured messages, the generation person of these digital resources may be traditional library, museum, archives, university, government department, professional association or individual, and its target is to allow the somebody of institute can visit human all knowledge at any time and any place with the digital device of any connection internet.Calculate with 300 pages, every page 1500 character of a book, the text message of 1,000,000 volumes word books totally 9006, add relevant metadata description, the total data volume of XML document surpasses IT, also contain in the digital library simultaneously be used in a large number to impart knowledge to students, multimedia resources such as the video of scientific research and amusement and audio frequency.By software and services facilities such as search engine, browsers, the user can visit the information and the resource of Web or digital library, but the user often needs be more meticulous, more meet the knowledge of demand rather than information in heaps, for example customer requirements can obtain to express the information (for example the presentation file of the webpage of textual form and e-book, image and text coexistence form, look the multimedia document of audio form etc.) of the different medium forms of same theme simultaneously.Therefore for to satisfy user's variation, personalization, information and knowledge services demand, must have and to carry out effective extraction of semantics and relevant analyzing and processing function to these semi-structured information or data with access system based on the information service system (for example Network Educational Resources management system) of Web and the Content Management of digital library with multiple medium form.
The present invention relates to the text data resource is carried out extraction of semantics, can be meant hypertext, Web webpage, digital book, educational resource etc., these data objects itself are made up of non-structured character or data stream, but data object also has inner structure simultaneously.Existing method exists the following shortcoming or deficiency to this class classification of Data:
(1) utilize the method for pure statistics in the extraction of semantics process, and the semantic information of utilizing is few more.Semantic information all has important meaning for the accuracy and the user's request of searching system;
(2) the important hypothesis of statistical method is: all data all are the entities with same structure, are independent between the data and (Independent and identically distributed) that distribute together.Yet many real data collection itself have complex inner structure.For example we can carry out the theme extraction and the classification of hypertext with traditional text mining method, are about to each document and are described with key word or term vector, on this basis each webpage are independently classified.This statistical method has been ignored the inner structure of document fully.Usually, each document inside also has structures such as paragraph.Therefore, in the process that this semi-structured data resource is handled, we can not ignore the association that concerns between the data.
For solving above-mentioned two problems, we need new model and method utilizes the inner structure of text to come double structural relation data to carry out effective extraction of semantics and analysis.The present invention promptly provides a kind of like this text semantic expression and obtaining method, its core is the context dependency of coming the modeling text semantic from the inner structure of text, structure text semantic representation model and in addition visual on based on the basis of the inference rule (Fuzzy Cognitive figure) of matrix operation.
Summary of the invention:
The objective of the invention is to the problem at the prior art existence, a kind of visable representation and acquisition methods of text semantic is provided, this method can utilize the inner structure between the text data to carry out more effectively extraction of semantics.This method can directly be applied in different semi-structured data resources.Text data of the present invention is meant hypertext, Web webpage, digital book, educational resource etc., and these data objects itself are made up of non-structured character or data stream, but data object has complex inner structure simultaneously.
For achieving the above object, design of the present invention is: the inside multilayer semantic structure that comes these semi-structured data of modeling with semantic matrix and graph visualization thereof.Semantic matrix of the present invention and graph visualization thereof can be used for the semi-structured data object that modeling has complicated immanent structure, thereby can portray context theme correlationship between the inner node of data object effectively.
According to above-mentioned inventive concept, the present invention adopts following technical proposals:
A kind of visable representation of text semantic and acquisition methods is characterized in that text semantic is divided into three levels: the text subject that the text low layer semanteme that the set of discrete keyword constitutes, text fragment constitute is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text; By extracting keyword and generating the semantic matrix of text subject based on the matrix operation of keyword weight, the link by text subject forms the text discourse semantics again; The concrete operations step is as follows:
(1) text semantic is divided into three levels: the text low layer semanteme that the set of discrete keyword constitutes,
The text subject that text fragment constitutes is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text, wherein go out keyword with the TF-IDF formulas Extraction, again the text of downloading is divided into some paragraphs according to its inner structure, (as natural paragraph), the title of representing a text fragment with the label of an XML, a paragraph is represented a text subject, with all paragraphs in one piece of text of an XML file storage, an XML file is represented a text chapter;
(2) calculate the state value of the keyword in the text fragment and the weight between the keyword: in each text fragment, add up the frequency that keyword occurs, calculate the state value of keyword again, and the weight between the keyword;
(3) with the state value of keyword, the weight between the keyword with based on the inference rule of matrix multiple computing, calculate the weight of each keyword to text subject, entitlement in the normalization text subject focuses between the interval [0,1], generates the semantic matrix of text subject;
(4) form theme node title in the text subject with having the statement of word number in the title of a text fragment or the text fragment with the ratio maximum of keyword number;
(5) find out keyword common between the text subject, according to the keyword of common appearance to the weight of text subject separately, calculate the weight between the text chapter Chinese version theme, and link text themes as the text chapter, and with digraph visualText chapter: the entitlement in the normalization text chapter focuses on interval [0,1] between, by the semantic text discourse semantics with its generation of text subject being expressed as the digraph that node and directed edge constitute, realize the graph visualization of text discourse semantics;
(6) in the visualized graphs of text discourse semantics each text subject of secateurs exclusive keyword.The present invention has following outstanding substantive distinguishing features and remarkable advantage compared with prior art:
(1) the present invention can effectively utilize various context dependency (comprising context between the same granularity structural context of data object inside and granularity) and carries out more effective extraction of semantics, thereby can effectively improve the accuracy rate of the extraction of semantics of complex data object.
(2) method provided by the invention is decomposed into the multiple layer description with different grain size node by the inner structure according to data object with it, portrays text subject correlationship between node by the context mechanism between the modeling node.
(3) method provided by the invention, with weight measure between the keyword, keyword is to the degree of correlation between the text subject and between the text subject.
(4) in the method for the present invention, the used inference rule of generative semantics matrix is exactly the inference rule of Fuzzy Cognitive figure (FuzzyCognitive Maps).
The present invention can be by different level, the simple and direct semanteme that obtains efficiently and represent text, is convenient to computing machine and grasps and understand processing.
Description of drawings:
Fig. 1 is the semantic matrix and the graph visualization thereof that comprise the text subject of 4 keywords.
Fig. 2 is the semantic matrix and the visualized graphs thereof of the text subject of text fragment " based on the text representation of Fuzzy Cognitive figure " formation.
Fig. 3 is the visualized graphs that text fragment " Fuzzy Cognitive figure " constitutes.
Fig. 4 is the visualized graphs that text fragment " the automatic structure of Fuzzy Cognitive figure " constitutes.
Fig. 5 is the visualized graphs of the text chapter of three text subjects generations of link.
Fig. 6 is the visualized graphs of the text chapter behind the secateurs.
When Fig. 7 was the weight of calculating between the text subject, the counter-rotating keyword was to the direction of the weight of text subject.
Embodiment:
Details are as follows in conjunction with the accompanying drawings for a preferred embodiment of the present invention:
If any four keyword C
1, C
2, C
3, C
4, the semantic matrix of their text subject, and the visualized graphs of text subject is as shown in Figure 1.
The visable representation of text semantic and the concrete steps of acquisition methods and as follows:
(1) one piece of text semantic is divided into three levels: the text subject that the text low layer semanteme that the set of discrete keyword constitutes, text fragment constitute is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text;
(2) the state value V of calculating keyword
CiAnd the weight w between the keyword
Ij
(3) the state value V of usefulness keyword
CiAnd the weight w between the keyword
Ij,, obtain the weight V of each keyword to the theme node through the reasoning of a matrix operation
Cj, and, generate the semantic matrix E of a text subject with all values normalization;
(4) form theme node title in the text subject with having the statement of word number in the title of text fragment or the text fragment with the ratio maximum of keyword number;
W here
IjRepresent the weight between i keyword Ci and j the keyword Cj, use
Calculate,
Total m sentence of this paragraph here, in keyword Ci and k sentence of Cj co-occurrence, b then
k=1, otherwise b
k=0;
The state value of i keyword Ci in text used here
Calculate x
iRepresent i keyword
Frequency in the text appearance; Keyword C here
iTo the weight of theme node by rational formula
Obtain through a reasoning and calculation, f () expression here is to the normalized function (adopting the method for normalizing of asking arithmetic sum here) of all keyword weights, and this paragraph comprises N keyword, V
CiRepresent i keyword C
iState value in text, w
IjRepresent i keyword C
iWith j keyword C
jBetween weight, all keywords constitute the j row among the semantic matrix E of text subject to the weight of theme node;
For example, one piece of text has three text fragments, title is respectively: based on text representation (FCM-based document representation), Fuzzy Cognitive figure (the Fuzzy Cognitive Maps of Fuzzy Cognitive figure, FCM) and the automatic structure of Fuzzy Cognitive figure (FCM ' s automatic construction), the semantic matrix of their text subject and graph visualization figure thereof be respectively as Fig. 2, Fig. 3 and shown in Figure 4.
In Fig. 2, text fragment comprises keyword C1 (Fuzzy Cognitive figure), C2 (semanteme), C3 (expression), C4 (reasoning), C5 (cause and effect), C6 (keyword), C7 (template), C18 (text), and theme node C
0 0(based on the text representation of Fuzzy Cognitive figure), its graph visualization are shown in Fig. 3 (a), and the semantic matrix of text subject is shown in Fig. 3 (b);
In Fig. 3, text fragment comprises keyword C4 (reasoning), C5 (cause and effect), C15 (relation), C17 (notion), C33 (figure), and theme node C
1 0(Fuzzy Cognitive figure);
In Fig. 4, text fragment comprises keyword C1 (Fuzzy Cognitive figure), C4 (reasoning), C5 (cause and effect), C6 (keyword), C7 (template), C16 (making up automatically), and theme node C
8 0(the automatic structure of Fuzzy Cognitive figure);
(5) find out keyword C common between the text fragment that will link
k, according to the weight w of common keyword
KiAnd w
Kj, obtain w through counter-rotating
Jk, calculate the weight T between the theme node again
Ji, when Fig. 7 was the weight of calculating between the text subject, common keyword was to the Umklapp process of the direction of the weight of theme node, and Fig. 5 is the graph visualization of the text discourse semantics behind three text subject semantemes of link;
Theme node C here
j 0And C
i 0Between the weight formula
Calculate, between the theme node N is arranged
1Individual common keyword, β
kRepresent k keyword C
kThe counter-rotating coefficient, span [0,1] also can be obtained V by Bayesian formula
CkExpression keyword C
kState value, w
JkKeyword C after the expression counter-rotating
kTo theme node C
j 0Weight, w
KiExpression keyword C
kTo theme node C
i 0Weight;
(6) each text subject semanteme of secateurs exclusive keyword, Fig. 6 be to three text subject semantemes exclusive keyword carry out the visualized graphs of the text discourse semantics behind the secateurs.
Claims (2)
1. the visable representation of a text semantic and acquisition methods is characterized in that text semantic is divided into three levels: the text subject that the text low layer semanteme that the set of discrete keyword constitutes, text fragment constitute is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text; By extracting keyword and generating the semantic matrix of text subject based on the matrix operation of keyword weight, the link by text subject forms the text discourse semantics again.
2. the visable representation of text semantic according to claim 1 and acquisition methods is characterized in that operation steps is as follows:
(1) text semantic is divided into three levels: the text subject that the text low layer semanteme that the set of discrete keyword constitutes, text fragment constitute is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text;
(2) calculate the state value of the keyword in the text fragment and the weight between the keyword;
(3) with the state value of the weight between the keyword, keyword with based on the inference rule of matrix multiple computing, calculate the weight of each keyword, generate the semantic matrix of text subject text subject;
(4) form theme node title in the text subject with having the statement of word number in the title of text fragment or the text fragment with the ratio maximum of keyword number;
(5) find out keyword common in the text subject that will link, to the weight of text subject separately, calculate the weight between each text subject according to the keyword of common appearance, linking each text subject is the text chapter, and with digraph visualText chapter;
(6) in each text subject of secateurs exclusive keyword.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200710041147 CN101067807A (en) | 2007-05-24 | 2007-05-24 | Text semantic visable representation and obtaining method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200710041147 CN101067807A (en) | 2007-05-24 | 2007-05-24 | Text semantic visable representation and obtaining method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101067807A true CN101067807A (en) | 2007-11-07 |
Family
ID=38880368
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 200710041147 Pending CN101067807A (en) | 2007-05-24 | 2007-05-24 | Text semantic visable representation and obtaining method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101067807A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101887416A (en) * | 2010-06-29 | 2010-11-17 | 魔极科技(北京)有限公司 | Method and system for converting characters into graphs |
CN104462552A (en) * | 2014-12-25 | 2015-03-25 | 北京奇虎科技有限公司 | Question and answer page core word extracting method and device |
CN104516904A (en) * | 2013-09-29 | 2015-04-15 | 北大方正集团有限公司 | Key knowledge point recommendation method and system |
CN105917326A (en) * | 2013-09-10 | 2016-08-31 | 微软技术许可有限责任公司 | Creating inforgraphics from text data in electronic documents |
CN107423344A (en) * | 2017-05-16 | 2017-12-01 | 北京邮电大学 | A kind of method for visualizing and device of power transmission and transformation equipment state data |
CN108090199A (en) * | 2017-12-22 | 2018-05-29 | 浙江大学 | A kind of Semantic features extraction and method for visualizing of large size image set |
CN108415900A (en) * | 2018-02-05 | 2018-08-17 | 中国科学院信息工程研究所 | A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure |
CN108572631A (en) * | 2018-03-08 | 2018-09-25 | 华南理工大学 | A kind of intelligence control system and method based on two type Fuzzy Cognitive Maps |
CN109992657A (en) * | 2019-04-03 | 2019-07-09 | 浙江大学 | A kind of interactive problem generation method based on reinforcing Dynamic Inference |
CN111462741A (en) * | 2020-03-02 | 2020-07-28 | 北京声智科技有限公司 | Voice data processing method, device and storage medium |
CN111680516A (en) * | 2020-06-04 | 2020-09-18 | 宁波浙大联科科技有限公司 | PDM system product design requirement information semantic analysis and extraction method and system |
CN112989802A (en) * | 2021-01-28 | 2021-06-18 | 北京信息科技大学 | Barrage keyword extraction method, device, equipment and medium |
CN113297254A (en) * | 2021-06-21 | 2021-08-24 | 中国农业银行股份有限公司 | Conceptualization query method and device |
-
2007
- 2007-05-24 CN CN 200710041147 patent/CN101067807A/en active Pending
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101887416A (en) * | 2010-06-29 | 2010-11-17 | 魔极科技(北京)有限公司 | Method and system for converting characters into graphs |
CN105917326A (en) * | 2013-09-10 | 2016-08-31 | 微软技术许可有限责任公司 | Creating inforgraphics from text data in electronic documents |
CN104516904A (en) * | 2013-09-29 | 2015-04-15 | 北大方正集团有限公司 | Key knowledge point recommendation method and system |
CN104516904B (en) * | 2013-09-29 | 2018-04-03 | 北大方正集团有限公司 | A kind of Key Points recommend method and its system |
CN104462552B (en) * | 2014-12-25 | 2018-07-17 | 北京奇虎科技有限公司 | Question and answer page core word extracting method and device |
CN104462552A (en) * | 2014-12-25 | 2015-03-25 | 北京奇虎科技有限公司 | Question and answer page core word extracting method and device |
CN107423344B (en) * | 2017-05-16 | 2020-03-13 | 北京邮电大学 | Visualization method and device for state data of power transmission and transformation equipment |
CN107423344A (en) * | 2017-05-16 | 2017-12-01 | 北京邮电大学 | A kind of method for visualizing and device of power transmission and transformation equipment state data |
CN108090199A (en) * | 2017-12-22 | 2018-05-29 | 浙江大学 | A kind of Semantic features extraction and method for visualizing of large size image set |
CN108090199B (en) * | 2017-12-22 | 2020-02-21 | 浙江大学 | Semantic information extraction and visualization method for large-scale image set |
CN108415900A (en) * | 2018-02-05 | 2018-08-17 | 中国科学院信息工程研究所 | A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure |
CN108572631A (en) * | 2018-03-08 | 2018-09-25 | 华南理工大学 | A kind of intelligence control system and method based on two type Fuzzy Cognitive Maps |
CN109992657A (en) * | 2019-04-03 | 2019-07-09 | 浙江大学 | A kind of interactive problem generation method based on reinforcing Dynamic Inference |
CN111462741A (en) * | 2020-03-02 | 2020-07-28 | 北京声智科技有限公司 | Voice data processing method, device and storage medium |
CN111462741B (en) * | 2020-03-02 | 2024-02-02 | 北京声智科技有限公司 | Voice data processing method, device and storage medium |
CN111680516A (en) * | 2020-06-04 | 2020-09-18 | 宁波浙大联科科技有限公司 | PDM system product design requirement information semantic analysis and extraction method and system |
CN112989802A (en) * | 2021-01-28 | 2021-06-18 | 北京信息科技大学 | Barrage keyword extraction method, device, equipment and medium |
CN112989802B (en) * | 2021-01-28 | 2023-06-20 | 北京信息科技大学 | Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium |
CN113297254A (en) * | 2021-06-21 | 2021-08-24 | 中国农业银行股份有限公司 | Conceptualization query method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101067807A (en) | Text semantic visable representation and obtaining method | |
CN110941692B (en) | Internet political outturn news event extraction method | |
Jaschke et al. | Trias--An algorithm for mining iceberg tri-lattices | |
CN103473263B (en) | News event development process-oriented visual display method | |
Grobelnik et al. | Automated knowledge discovery in advanced knowledge management | |
Shiri | Linked data meets big data: A knowledge organization systems perspective | |
Iefremova et al. | Biographical articles in scientific literature: analysis of articles indexed in Web of Science | |
Nguyen et al. | Digital library research (1990-2010): A knowledge map of core topics and subtopics | |
Lioma et al. | A syntactically-based query reformulation technique for information retrieval | |
CN1766871A (en) | The processing method of the semi-structured data extraction of semantics of based on the context | |
Osipov et al. | Technologies for semantic analysis of scientific publications | |
Luong et al. | Ontology learning using word net lexical expansion and text mining | |
de Silva | SAFS3 algorithm: Frequency statistic and semantic similarity based semantic classification use case | |
Owoeye et al. | Classification of radical web text using a composite-based method | |
Li | Research on an enhanced web information processing technology based on ais text mining | |
Gregory et al. | Visual analysis of weblog content | |
Zinger et al. | Extracting an ontology of portrayable objects from WordNet | |
Geetha et al. | Effectual extraction of Data Relations from unstructured data | |
Cormode et al. | Scienceography: the study of how science is written | |
Dli et al. | Multimodel method of rubriсating the unstructured electronic text documents | |
Leskinen et al. | Biographical and Prosopographical Analyses of Finnish Academic People 1640–1899 Based on Linked Open Data | |
Kawtrakul et al. | A framework of NLP based information tracking and related knowledge organizing with topic maps | |
Luo et al. | Multimedia news exploration and retrieval by integrating keywords, relations and visual features | |
Neri et al. | Stalker, a multilingual text mining search engine for open source intelligence | |
Ling | An anthological review of research utilizing MontyLingua, a python-based end-to-end text processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Open date: 20071107 |