CN101067807A - Text semantic visable representation and obtaining method - Google Patents

Text semantic visable representation and obtaining method Download PDF

Info

Publication number
CN101067807A
CN101067807A CN 200710041147 CN200710041147A CN101067807A CN 101067807 A CN101067807 A CN 101067807A CN 200710041147 CN200710041147 CN 200710041147 CN 200710041147 A CN200710041147 A CN 200710041147A CN 101067807 A CN101067807 A CN 101067807A
Authority
CN
China
Prior art keywords
text
keyword
semantic
subject
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200710041147
Other languages
Chinese (zh)
Inventor
骆祥峰
方宁
徐炜民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai University
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN 200710041147 priority Critical patent/CN101067807A/en
Publication of CN101067807A publication Critical patent/CN101067807A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This invention relates to a visual expression and obtaining method for text semantics, which divides a text semantics into three hierarchies: a low hierarchy composed of sets of scattered key words, a text topic made up of paragraphs as a middle hierarchy of a text and a high hierarchy of writings composed of mutual-linked topics, a semantic matrix of a text topic is generated by picking up key words and matrix operation based on the key word weight then the text chapter semantics is formed by linking topics of texts, which can utilize a depending relationship of contexts in text data to pick up semantics to increase the accuracy of semantics of complicated data objects and analyzes it into multilayer description with different grain crunodes, scores relationship of topics among crunodes by the structure of context among modeling crunodes and uses weight to measure the related degree among keywords, keywords to the topic of the text and topics.

Description

The visable representation of text semantic and acquisition methods
Technical field:
The present invention relates to the expression and obtaining method that a kind of computing machine generates text semantic automatically, more particularly, relate to a kind of visable representation and acquisition methods of the text semantic based on semantic matrix.
Background technology:
Along with information and development of internet technology, we can be at an easy rate obtain enormous amount and relate to the information resources of every field by the electronics and the network media in today, and promptly so-called information explosion (Information Explosion) problem is to tissue and index information resource and information retrieval technique have produced urgent demand quickly and effectively.Information destructuring, information category variation, document content covering scope are extensive etc., and factor has proposed great challenge to information organization and retrieval.For example, Web has become field most important information and knowledge bases such as scientific research, education and study; But the exponential growth rate of Web information has also been brought huge difficulty for the user effectively utilizes simultaneously.The digital library that extensively obtains building in recent years is the important magnanimity information source of another one.Digital library is the digitalization resource storehouse of preserving a large amount of structured messages, the generation person of these digital resources may be traditional library, museum, archives, university, government department, professional association or individual, and its target is to allow the somebody of institute can visit human all knowledge at any time and any place with the digital device of any connection internet.Calculate with 300 pages, every page 1500 character of a book, the text message of 1,000,000 volumes word books totally 9006, add relevant metadata description, the total data volume of XML document surpasses IT, also contain in the digital library simultaneously be used in a large number to impart knowledge to students, multimedia resources such as the video of scientific research and amusement and audio frequency.By software and services facilities such as search engine, browsers, the user can visit the information and the resource of Web or digital library, but the user often needs be more meticulous, more meet the knowledge of demand rather than information in heaps, for example customer requirements can obtain to express the information (for example the presentation file of the webpage of textual form and e-book, image and text coexistence form, look the multimedia document of audio form etc.) of the different medium forms of same theme simultaneously.Therefore for to satisfy user's variation, personalization, information and knowledge services demand, must have and to carry out effective extraction of semantics and relevant analyzing and processing function to these semi-structured information or data with access system based on the information service system (for example Network Educational Resources management system) of Web and the Content Management of digital library with multiple medium form.
The present invention relates to the text data resource is carried out extraction of semantics, can be meant hypertext, Web webpage, digital book, educational resource etc., these data objects itself are made up of non-structured character or data stream, but data object also has inner structure simultaneously.Existing method exists the following shortcoming or deficiency to this class classification of Data:
(1) utilize the method for pure statistics in the extraction of semantics process, and the semantic information of utilizing is few more.Semantic information all has important meaning for the accuracy and the user's request of searching system;
(2) the important hypothesis of statistical method is: all data all are the entities with same structure, are independent between the data and (Independent and identically distributed) that distribute together.Yet many real data collection itself have complex inner structure.For example we can carry out the theme extraction and the classification of hypertext with traditional text mining method, are about to each document and are described with key word or term vector, on this basis each webpage are independently classified.This statistical method has been ignored the inner structure of document fully.Usually, each document inside also has structures such as paragraph.Therefore, in the process that this semi-structured data resource is handled, we can not ignore the association that concerns between the data.
For solving above-mentioned two problems, we need new model and method utilizes the inner structure of text to come double structural relation data to carry out effective extraction of semantics and analysis.The present invention promptly provides a kind of like this text semantic expression and obtaining method, its core is the context dependency of coming the modeling text semantic from the inner structure of text, structure text semantic representation model and in addition visual on based on the basis of the inference rule (Fuzzy Cognitive figure) of matrix operation.
Summary of the invention:
The objective of the invention is to the problem at the prior art existence, a kind of visable representation and acquisition methods of text semantic is provided, this method can utilize the inner structure between the text data to carry out more effectively extraction of semantics.This method can directly be applied in different semi-structured data resources.Text data of the present invention is meant hypertext, Web webpage, digital book, educational resource etc., and these data objects itself are made up of non-structured character or data stream, but data object has complex inner structure simultaneously.
For achieving the above object, design of the present invention is: the inside multilayer semantic structure that comes these semi-structured data of modeling with semantic matrix and graph visualization thereof.Semantic matrix of the present invention and graph visualization thereof can be used for the semi-structured data object that modeling has complicated immanent structure, thereby can portray context theme correlationship between the inner node of data object effectively.
According to above-mentioned inventive concept, the present invention adopts following technical proposals:
A kind of visable representation of text semantic and acquisition methods is characterized in that text semantic is divided into three levels: the text subject that the text low layer semanteme that the set of discrete keyword constitutes, text fragment constitute is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text; By extracting keyword and generating the semantic matrix of text subject based on the matrix operation of keyword weight, the link by text subject forms the text discourse semantics again; The concrete operations step is as follows:
(1) text semantic is divided into three levels: the text low layer semanteme that the set of discrete keyword constitutes,
The text subject that text fragment constitutes is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text, wherein go out keyword with the TF-IDF formulas Extraction, again the text of downloading is divided into some paragraphs according to its inner structure, (as natural paragraph), the title of representing a text fragment with the label of an XML, a paragraph is represented a text subject, with all paragraphs in one piece of text of an XML file storage, an XML file is represented a text chapter;
(2) calculate the state value of the keyword in the text fragment and the weight between the keyword: in each text fragment, add up the frequency that keyword occurs, calculate the state value of keyword again, and the weight between the keyword;
(3) with the state value of keyword, the weight between the keyword with based on the inference rule of matrix multiple computing, calculate the weight of each keyword to text subject, entitlement in the normalization text subject focuses between the interval [0,1], generates the semantic matrix of text subject;
(4) form theme node title in the text subject with having the statement of word number in the title of a text fragment or the text fragment with the ratio maximum of keyword number;
(5) find out keyword common between the text subject, according to the keyword of common appearance to the weight of text subject separately, calculate the weight between the text chapter Chinese version theme, and link text themes as the text chapter, and with digraph visualText chapter: the entitlement in the normalization text chapter focuses on interval [0,1] between, by the semantic text discourse semantics with its generation of text subject being expressed as the digraph that node and directed edge constitute, realize the graph visualization of text discourse semantics;
(6) in the visualized graphs of text discourse semantics each text subject of secateurs exclusive keyword.The present invention has following outstanding substantive distinguishing features and remarkable advantage compared with prior art:
(1) the present invention can effectively utilize various context dependency (comprising context between the same granularity structural context of data object inside and granularity) and carries out more effective extraction of semantics, thereby can effectively improve the accuracy rate of the extraction of semantics of complex data object.
(2) method provided by the invention is decomposed into the multiple layer description with different grain size node by the inner structure according to data object with it, portrays text subject correlationship between node by the context mechanism between the modeling node.
(3) method provided by the invention, with weight measure between the keyword, keyword is to the degree of correlation between the text subject and between the text subject.
(4) in the method for the present invention, the used inference rule of generative semantics matrix is exactly the inference rule of Fuzzy Cognitive figure (FuzzyCognitive Maps).
The present invention can be by different level, the simple and direct semanteme that obtains efficiently and represent text, is convenient to computing machine and grasps and understand processing.
Description of drawings:
Fig. 1 is the semantic matrix and the graph visualization thereof that comprise the text subject of 4 keywords.
Fig. 2 is the semantic matrix and the visualized graphs thereof of the text subject of text fragment " based on the text representation of Fuzzy Cognitive figure " formation.
Fig. 3 is the visualized graphs that text fragment " Fuzzy Cognitive figure " constitutes.
Fig. 4 is the visualized graphs that text fragment " the automatic structure of Fuzzy Cognitive figure " constitutes.
Fig. 5 is the visualized graphs of the text chapter of three text subjects generations of link.
Fig. 6 is the visualized graphs of the text chapter behind the secateurs.
When Fig. 7 was the weight of calculating between the text subject, the counter-rotating keyword was to the direction of the weight of text subject.
Embodiment:
Details are as follows in conjunction with the accompanying drawings for a preferred embodiment of the present invention:
If any four keyword C 1, C 2, C 3, C 4, the semantic matrix of their text subject, and the visualized graphs of text subject is as shown in Figure 1.
The visable representation of text semantic and the concrete steps of acquisition methods and as follows:
(1) one piece of text semantic is divided into three levels: the text subject that the text low layer semanteme that the set of discrete keyword constitutes, text fragment constitute is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text;
(2) the state value V of calculating keyword CiAnd the weight w between the keyword Ij
(3) the state value V of usefulness keyword CiAnd the weight w between the keyword Ij,, obtain the weight V of each keyword to the theme node through the reasoning of a matrix operation Cj, and, generate the semantic matrix E of a text subject with all values normalization;
(4) form theme node title in the text subject with having the statement of word number in the title of text fragment or the text fragment with the ratio maximum of keyword number;
W here IjRepresent the weight between i keyword Ci and j the keyword Cj, use w ij = Σ k = 1 m b k / m Calculate,
Total m sentence of this paragraph here, in keyword Ci and k sentence of Cj co-occurrence, b then k=1, otherwise b k=0;
The state value of i keyword Ci in text used here V C i = tanh ( x i ) Calculate x iRepresent i keyword
Frequency in the text appearance; Keyword C here iTo the weight of theme node by rational formula
V C j ( t + 1 ) = f ( Σ i = 1 i ≠ j N V C i ( t ) w ij )
Obtain through a reasoning and calculation, f () expression here is to the normalized function (adopting the method for normalizing of asking arithmetic sum here) of all keyword weights, and this paragraph comprises N keyword, V CiRepresent i keyword C iState value in text, w IjRepresent i keyword C iWith j keyword C jBetween weight, all keywords constitute the j row among the semantic matrix E of text subject to the weight of theme node;
For example, one piece of text has three text fragments, title is respectively: based on text representation (FCM-based document representation), Fuzzy Cognitive figure (the Fuzzy Cognitive Maps of Fuzzy Cognitive figure, FCM) and the automatic structure of Fuzzy Cognitive figure (FCM ' s automatic construction), the semantic matrix of their text subject and graph visualization figure thereof be respectively as Fig. 2, Fig. 3 and shown in Figure 4.
In Fig. 2, text fragment comprises keyword C1 (Fuzzy Cognitive figure), C2 (semanteme), C3 (expression), C4 (reasoning), C5 (cause and effect), C6 (keyword), C7 (template), C18 (text), and theme node C 0 0(based on the text representation of Fuzzy Cognitive figure), its graph visualization are shown in Fig. 3 (a), and the semantic matrix of text subject is shown in Fig. 3 (b);
In Fig. 3, text fragment comprises keyword C4 (reasoning), C5 (cause and effect), C15 (relation), C17 (notion), C33 (figure), and theme node C 1 0(Fuzzy Cognitive figure);
In Fig. 4, text fragment comprises keyword C1 (Fuzzy Cognitive figure), C4 (reasoning), C5 (cause and effect), C6 (keyword), C7 (template), C16 (making up automatically), and theme node C 8 0(the automatic structure of Fuzzy Cognitive figure);
(5) find out keyword C common between the text fragment that will link k, according to the weight w of common keyword KiAnd w Kj, obtain w through counter-rotating Jk, calculate the weight T between the theme node again Ji, when Fig. 7 was the weight of calculating between the text subject, common keyword was to the Umklapp process of the direction of the weight of theme node, and Fig. 5 is the graph visualization of the text discourse semantics behind three text subject semantemes of link;
Theme node C here j 0And C i 0Between the weight formula
T ji = tanh ( 2 * Σ k = 1 N 1 β k V C k w jk w ki )
Calculate, between the theme node N is arranged 1Individual common keyword, β kRepresent k keyword C kThe counter-rotating coefficient, span [0,1] also can be obtained V by Bayesian formula CkExpression keyword C kState value, w JkKeyword C after the expression counter-rotating kTo theme node C j 0Weight, w KiExpression keyword C kTo theme node C i 0Weight;
(6) each text subject semanteme of secateurs exclusive keyword, Fig. 6 be to three text subject semantemes exclusive keyword carry out the visualized graphs of the text discourse semantics behind the secateurs.

Claims (2)

1. the visable representation of a text semantic and acquisition methods is characterized in that text semantic is divided into three levels: the text subject that the text low layer semanteme that the set of discrete keyword constitutes, text fragment constitute is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text; By extracting keyword and generating the semantic matrix of text subject based on the matrix operation of keyword weight, the link by text subject forms the text discourse semantics again.
2. the visable representation of text semantic according to claim 1 and acquisition methods is characterized in that operation steps is as follows:
(1) text semantic is divided into three levels: the text subject that the text low layer semanteme that the set of discrete keyword constitutes, text fragment constitute is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text;
(2) calculate the state value of the keyword in the text fragment and the weight between the keyword;
(3) with the state value of the weight between the keyword, keyword with based on the inference rule of matrix multiple computing, calculate the weight of each keyword, generate the semantic matrix of text subject text subject;
(4) form theme node title in the text subject with having the statement of word number in the title of text fragment or the text fragment with the ratio maximum of keyword number;
(5) find out keyword common in the text subject that will link, to the weight of text subject separately, calculate the weight between each text subject according to the keyword of common appearance, linking each text subject is the text chapter, and with digraph visualText chapter;
(6) in each text subject of secateurs exclusive keyword.
CN 200710041147 2007-05-24 2007-05-24 Text semantic visable representation and obtaining method Pending CN101067807A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200710041147 CN101067807A (en) 2007-05-24 2007-05-24 Text semantic visable representation and obtaining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200710041147 CN101067807A (en) 2007-05-24 2007-05-24 Text semantic visable representation and obtaining method

Publications (1)

Publication Number Publication Date
CN101067807A true CN101067807A (en) 2007-11-07

Family

ID=38880368

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200710041147 Pending CN101067807A (en) 2007-05-24 2007-05-24 Text semantic visable representation and obtaining method

Country Status (1)

Country Link
CN (1) CN101067807A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887416A (en) * 2010-06-29 2010-11-17 魔极科技(北京)有限公司 Method and system for converting characters into graphs
CN104462552A (en) * 2014-12-25 2015-03-25 北京奇虎科技有限公司 Question and answer page core word extracting method and device
CN104516904A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Key knowledge point recommendation method and system
CN105917326A (en) * 2013-09-10 2016-08-31 微软技术许可有限责任公司 Creating inforgraphics from text data in electronic documents
CN107423344A (en) * 2017-05-16 2017-12-01 北京邮电大学 A kind of method for visualizing and device of power transmission and transformation equipment state data
CN108090199A (en) * 2017-12-22 2018-05-29 浙江大学 A kind of Semantic features extraction and method for visualizing of large size image set
CN108415900A (en) * 2018-02-05 2018-08-17 中国科学院信息工程研究所 A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure
CN108572631A (en) * 2018-03-08 2018-09-25 华南理工大学 A kind of intelligence control system and method based on two type Fuzzy Cognitive Maps
CN109992657A (en) * 2019-04-03 2019-07-09 浙江大学 A kind of interactive problem generation method based on reinforcing Dynamic Inference
CN111462741A (en) * 2020-03-02 2020-07-28 北京声智科技有限公司 Voice data processing method, device and storage medium
CN111680516A (en) * 2020-06-04 2020-09-18 宁波浙大联科科技有限公司 PDM system product design requirement information semantic analysis and extraction method and system
CN112989802A (en) * 2021-01-28 2021-06-18 北京信息科技大学 Barrage keyword extraction method, device, equipment and medium
CN113297254A (en) * 2021-06-21 2021-08-24 中国农业银行股份有限公司 Conceptualization query method and device

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887416A (en) * 2010-06-29 2010-11-17 魔极科技(北京)有限公司 Method and system for converting characters into graphs
CN105917326A (en) * 2013-09-10 2016-08-31 微软技术许可有限责任公司 Creating inforgraphics from text data in electronic documents
CN104516904A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Key knowledge point recommendation method and system
CN104516904B (en) * 2013-09-29 2018-04-03 北大方正集团有限公司 A kind of Key Points recommend method and its system
CN104462552B (en) * 2014-12-25 2018-07-17 北京奇虎科技有限公司 Question and answer page core word extracting method and device
CN104462552A (en) * 2014-12-25 2015-03-25 北京奇虎科技有限公司 Question and answer page core word extracting method and device
CN107423344B (en) * 2017-05-16 2020-03-13 北京邮电大学 Visualization method and device for state data of power transmission and transformation equipment
CN107423344A (en) * 2017-05-16 2017-12-01 北京邮电大学 A kind of method for visualizing and device of power transmission and transformation equipment state data
CN108090199A (en) * 2017-12-22 2018-05-29 浙江大学 A kind of Semantic features extraction and method for visualizing of large size image set
CN108090199B (en) * 2017-12-22 2020-02-21 浙江大学 Semantic information extraction and visualization method for large-scale image set
CN108415900A (en) * 2018-02-05 2018-08-17 中国科学院信息工程研究所 A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure
CN108572631A (en) * 2018-03-08 2018-09-25 华南理工大学 A kind of intelligence control system and method based on two type Fuzzy Cognitive Maps
CN109992657A (en) * 2019-04-03 2019-07-09 浙江大学 A kind of interactive problem generation method based on reinforcing Dynamic Inference
CN111462741A (en) * 2020-03-02 2020-07-28 北京声智科技有限公司 Voice data processing method, device and storage medium
CN111462741B (en) * 2020-03-02 2024-02-02 北京声智科技有限公司 Voice data processing method, device and storage medium
CN111680516A (en) * 2020-06-04 2020-09-18 宁波浙大联科科技有限公司 PDM system product design requirement information semantic analysis and extraction method and system
CN112989802A (en) * 2021-01-28 2021-06-18 北京信息科技大学 Barrage keyword extraction method, device, equipment and medium
CN112989802B (en) * 2021-01-28 2023-06-20 北京信息科技大学 Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN113297254A (en) * 2021-06-21 2021-08-24 中国农业银行股份有限公司 Conceptualization query method and device

Similar Documents

Publication Publication Date Title
CN101067807A (en) Text semantic visable representation and obtaining method
CN110941692B (en) Internet political outturn news event extraction method
Jaschke et al. Trias--An algorithm for mining iceberg tri-lattices
CN103473263B (en) News event development process-oriented visual display method
Grobelnik et al. Automated knowledge discovery in advanced knowledge management
Shiri Linked data meets big data: A knowledge organization systems perspective
Iefremova et al. Biographical articles in scientific literature: analysis of articles indexed in Web of Science
Nguyen et al. Digital library research (1990-2010): A knowledge map of core topics and subtopics
Lioma et al. A syntactically-based query reformulation technique for information retrieval
CN1766871A (en) The processing method of the semi-structured data extraction of semantics of based on the context
Osipov et al. Technologies for semantic analysis of scientific publications
Luong et al. Ontology learning using word net lexical expansion and text mining
de Silva SAFS3 algorithm: Frequency statistic and semantic similarity based semantic classification use case
Owoeye et al. Classification of radical web text using a composite-based method
Li Research on an enhanced web information processing technology based on ais text mining
Gregory et al. Visual analysis of weblog content
Zinger et al. Extracting an ontology of portrayable objects from WordNet
Geetha et al. Effectual extraction of Data Relations from unstructured data
Cormode et al. Scienceography: the study of how science is written
Dli et al. Multimodel method of rubriсating the unstructured electronic text documents
Leskinen et al. Biographical and Prosopographical Analyses of Finnish Academic People 1640–1899 Based on Linked Open Data
Kawtrakul et al. A framework of NLP based information tracking and related knowledge organizing with topic maps
Luo et al. Multimedia news exploration and retrieval by integrating keywords, relations and visual features
Neri et al. Stalker, a multilingual text mining search engine for open source intelligence
Ling An anthological review of research utilizing MontyLingua, a python-based end-to-end text processor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20071107