CN101067807A

CN101067807A - Text semantic visable representation and obtaining method

Info

Publication number: CN101067807A
Application number: CN 200710041147
Authority: CN
Inventors: 骆祥峰; 方宁; 徐炜民
Original assignee: University of Shanghai for Science and Technology
Current assignee: Shanghai University; University of Shanghai for Science and Technology
Priority date: 2007-05-24
Filing date: 2007-05-24
Publication date: 2007-11-07

Abstract

This invention relates to a visual expression and obtaining method for text semantics, which divides a text semantics into three hierarchies: a low hierarchy composed of sets of scattered key words, a text topic made up of paragraphs as a middle hierarchy of a text and a high hierarchy of writings composed of mutual-linked topics, a semantic matrix of a text topic is generated by picking up key words and matrix operation based on the key word weight then the text chapter semantics is formed by linking topics of texts, which can utilize a depending relationship of contexts in text data to pick up semantics to increase the accuracy of semantics of complicated data objects and analyzes it into multilayer description with different grain crunodes, scores relationship of topics among crunodes by the structure of context among modeling crunodes and uses weight to measure the related degree among keywords, keywords to the topic of the text and topics.

Description

The visable representation of text semantic and acquisition methods

Technical field:

The present invention relates to the expression and obtaining method that a kind of computing machine generates text semantic automatically, more particularly, relate to a kind of visable representation and acquisition methods of the text semantic based on semantic matrix.

Background technology:

Along with information and development of internet technology, we can be at an easy rate obtain enormous amount and relate to the information resources of every field by the electronics and the network media in today, and promptly so-called information explosion (Information Explosion) problem is to tissue and index information resource and information retrieval technique have produced urgent demand quickly and effectively.Information destructuring, information category variation, document content covering scope are extensive etc., and factor has proposed great challenge to information organization and retrieval.For example, Web has become field most important information and knowledge bases such as scientific research, education and study; But the exponential growth rate of Web information has also been brought huge difficulty for the user effectively utilizes simultaneously.The digital library that extensively obtains building in recent years is the important magnanimity information source of another one.Digital library is the digitalization resource storehouse of preserving a large amount of structured messages, the generation person of these digital resources may be traditional library, museum, archives, university, government department, professional association or individual, and its target is to allow the somebody of institute can visit human all knowledge at any time and any place with the digital device of any connection internet.Calculate with 300 pages, every page 1500 character of a book, the text message of 1,000,000 volumes word books totally 9006, add relevant metadata description, the total data volume of XML document surpasses IT, also contain in the digital library simultaneously be used in a large number to impart knowledge to students, multimedia resources such as the video of scientific research and amusement and audio frequency.By software and services facilities such as search engine, browsers, the user can visit the information and the resource of Web or digital library, but the user often needs be more meticulous, more meet the knowledge of demand rather than information in heaps, for example customer requirements can obtain to express the information (for example the presentation file of the webpage of textual form and e-book, image and text coexistence form, look the multimedia document of audio form etc.) of the different medium forms of same theme simultaneously.Therefore for to satisfy user's variation, personalization, information and knowledge services demand, must have and to carry out effective extraction of semantics and relevant analyzing and processing function to these semi-structured information or data with access system based on the information service system (for example Network Educational Resources management system) of Web and the Content Management of digital library with multiple medium form.

The present invention relates to the text data resource is carried out extraction of semantics, can be meant hypertext, Web webpage, digital book, educational resource etc., these data objects itself are made up of non-structured character or data stream, but data object also has inner structure simultaneously.Existing method exists the following shortcoming or deficiency to this class classification of Data:

(1) utilize the method for pure statistics in the extraction of semantics process, and the semantic information of utilizing is few more.Semantic information all has important meaning for the accuracy and the user's request of searching system;

(2) the important hypothesis of statistical method is: all data all are the entities with same structure, are independent between the data and (Independent and identically distributed) that distribute together.Yet many real data collection itself have complex inner structure.For example we can carry out the theme extraction and the classification of hypertext with traditional text mining method, are about to each document and are described with key word or term vector, on this basis each webpage are independently classified.This statistical method has been ignored the inner structure of document fully.Usually, each document inside also has structures such as paragraph.Therefore, in the process that this semi-structured data resource is handled, we can not ignore the association that concerns between the data.

For solving above-mentioned two problems, we need new model and method utilizes the inner structure of text to come double structural relation data to carry out effective extraction of semantics and analysis.The present invention promptly provides a kind of like this text semantic expression and obtaining method, its core is the context dependency of coming the modeling text semantic from the inner structure of text, structure text semantic representation model and in addition visual on based on the basis of the inference rule (Fuzzy Cognitive figure) of matrix operation.

Summary of the invention:

The objective of the invention is to the problem at the prior art existence, a kind of visable representation and acquisition methods of text semantic is provided, this method can utilize the inner structure between the text data to carry out more effectively extraction of semantics.This method can directly be applied in different semi-structured data resources.Text data of the present invention is meant hypertext, Web webpage, digital book, educational resource etc., and these data objects itself are made up of non-structured character or data stream, but data object has complex inner structure simultaneously.

For achieving the above object, design of the present invention is: the inside multilayer semantic structure that comes these semi-structured data of modeling with semantic matrix and graph visualization thereof.Semantic matrix of the present invention and graph visualization thereof can be used for the semi-structured data object that modeling has complicated immanent structure, thereby can portray context theme correlationship between the inner node of data object effectively.

According to above-mentioned inventive concept, the present invention adopts following technical proposals:

A kind of visable representation of text semantic and acquisition methods is characterized in that text semantic is divided into three levels: the text subject that the text low layer semanteme that the set of discrete keyword constitutes, text fragment constitute is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text; By extracting keyword and generating the semantic matrix of text subject based on the matrix operation of keyword weight, the link by text subject forms the text discourse semantics again; The concrete operations step is as follows:

(1) text semantic is divided into three levels: the text low layer semanteme that the set of discrete keyword constitutes,

The text subject that text fragment constitutes is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text, wherein go out keyword with the TF-IDF formulas Extraction, again the text of downloading is divided into some paragraphs according to its inner structure, (as natural paragraph), the title of representing a text fragment with the label of an XML, a paragraph is represented a text subject, with all paragraphs in one piece of text of an XML file storage, an XML file is represented a text chapter;

(2) calculate the state value of the keyword in the text fragment and the weight between the keyword: in each text fragment, add up the frequency that keyword occurs, calculate the state value of keyword again, and the weight between the keyword;

(3) with the state value of keyword, the weight between the keyword with based on the inference rule of matrix multiple computing, calculate the weight of each keyword to text subject, entitlement in the normalization text subject focuses between the interval [0,1], generates the semantic matrix of text subject;

(4) form theme node title in the text subject with having the statement of word number in the title of a text fragment or the text fragment with the ratio maximum of keyword number;

(5) find out keyword common between the text subject, according to the keyword of common appearance to the weight of text subject separately, calculate the weight between the text chapter Chinese version theme, and link text themes as the text chapter, and with digraph visualText chapter: the entitlement in the normalization text chapter focuses on interval [0,1] between, by the semantic text discourse semantics with its generation of text subject being expressed as the digraph that node and directed edge constitute, realize the graph visualization of text discourse semantics;

(6) in the visualized graphs of text discourse semantics each text subject of secateurs exclusive keyword.The present invention has following outstanding substantive distinguishing features and remarkable advantage compared with prior art:

(1) the present invention can effectively utilize various context dependency (comprising context between the same granularity structural context of data object inside and granularity) and carries out more effective extraction of semantics, thereby can effectively improve the accuracy rate of the extraction of semantics of complex data object.

(2) method provided by the invention is decomposed into the multiple layer description with different grain size node by the inner structure according to data object with it, portrays text subject correlationship between node by the context mechanism between the modeling node.

(3) method provided by the invention, with weight measure between the keyword, keyword is to the degree of correlation between the text subject and between the text subject.

(4) in the method for the present invention, the used inference rule of generative semantics matrix is exactly the inference rule of Fuzzy Cognitive figure (FuzzyCognitive Maps).

The present invention can be by different level, the simple and direct semanteme that obtains efficiently and represent text, is convenient to computing machine and grasps and understand processing.

Description of drawings:

Fig. 1 is the semantic matrix and the graph visualization thereof that comprise the text subject of 4 keywords.

Fig. 2 is the semantic matrix and the visualized graphs thereof of the text subject of text fragment " based on the text representation of Fuzzy Cognitive figure " formation.

Fig. 3 is the visualized graphs that text fragment " Fuzzy Cognitive figure " constitutes.

Fig. 4 is the visualized graphs that text fragment " the automatic structure of Fuzzy Cognitive figure " constitutes.

Fig. 5 is the visualized graphs of the text chapter of three text subjects generations of link.

Fig. 6 is the visualized graphs of the text chapter behind the secateurs.

When Fig. 7 was the weight of calculating between the text subject, the counter-rotating keyword was to the direction of the weight of text subject.

Embodiment:

Details are as follows in conjunction with the accompanying drawings for a preferred embodiment of the present invention:

If any four keyword C ₁, C ₂, C ₃, C ₄, the semantic matrix of their text subject, and the visualized graphs of text subject is as shown in Figure 1.

The visable representation of text semantic and the concrete steps of acquisition methods and as follows:

(1) one piece of text semantic is divided into three levels: the text subject that the text low layer semanteme that the set of discrete keyword constitutes, text fragment constitute is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text;

(2) the state value V of calculating keyword _CiAnd the weight w between the keyword _Ij

(3) the state value V of usefulness keyword _CiAnd the weight w between the keyword _Ij,, obtain the weight V of each keyword to the theme node through the reasoning of a matrix operation _Cj, and, generate the semantic matrix E of a text subject with all values normalization;

(4) form theme node title in the text subject with having the statement of word number in the title of text fragment or the text fragment with the ratio maximum of keyword number;

W here _IjRepresent the weight between i keyword Ci and j the keyword Cj, use

w_{ij} = Σ_{k = 1}^{m} b_{k} / m

Calculate,

Total m sentence of this paragraph here, in keyword Ci and k sentence of Cj co-occurrence, b then _k=1, otherwise b _k=0;

The state value of i keyword Ci in text used here

V_{C_{i}} = \tanh (x_{i})

Calculate x _iRepresent i keyword

Frequency in the text appearance; Keyword C here _iTo the weight of theme node by rational formula

V_{C_{j}} (t + 1) = f (Σ_{\begin{matrix} i = 1 \\ i &NotEqual; j \end{matrix}}^{N} V_{C_{i}} (t) w_{ij})

Obtain through a reasoning and calculation, f () expression here is to the normalized function (adopting the method for normalizing of asking arithmetic sum here) of all keyword weights, and this paragraph comprises N keyword, V _CiRepresent i keyword C _iState value in text, w _IjRepresent i keyword C _iWith j keyword C _jBetween weight, all keywords constitute the j row among the semantic matrix E of text subject to the weight of theme node;

For example, one piece of text has three text fragments, title is respectively: based on text representation (FCM-based document representation), Fuzzy Cognitive figure (the Fuzzy Cognitive Maps of Fuzzy Cognitive figure, FCM) and the automatic structure of Fuzzy Cognitive figure (FCM ' s automatic construction), the semantic matrix of their text subject and graph visualization figure thereof be respectively as Fig. 2, Fig. 3 and shown in Figure 4.

In Fig. 2, text fragment comprises keyword C1 (Fuzzy Cognitive figure), C2 (semanteme), C3 (expression), C4 (reasoning), C5 (cause and effect), C6 (keyword), C7 (template), C18 (text), and theme node C ₀ ⁰(based on the text representation of Fuzzy Cognitive figure), its graph visualization are shown in Fig. 3 (a), and the semantic matrix of text subject is shown in Fig. 3 (b);

In Fig. 3, text fragment comprises keyword C4 (reasoning), C5 (cause and effect), C15 (relation), C17 (notion), C33 (figure), and theme node C ₁ ⁰(Fuzzy Cognitive figure);

In Fig. 4, text fragment comprises keyword C1 (Fuzzy Cognitive figure), C4 (reasoning), C5 (cause and effect), C6 (keyword), C7 (template), C16 (making up automatically), and theme node C ₈ ⁰(the automatic structure of Fuzzy Cognitive figure);

(5) find out keyword C common between the text fragment that will link _k, according to the weight w of common keyword _KiAnd w _Kj, obtain w through counter-rotating _Jk, calculate the weight T between the theme node again _Ji, when Fig. 7 was the weight of calculating between the text subject, common keyword was to the Umklapp process of the direction of the weight of theme node, and Fig. 5 is the graph visualization of the text discourse semantics behind three text subject semantemes of link;

Theme node C here _j ⁰And C _i ⁰Between the weight formula

T_{ji} = \tanh (2 * Σ_{k = 1}^{N_{1}} β_{k} V_{C_{k}} w_{jk} w_{ki})

Calculate, between the theme node N is arranged ₁Individual common keyword, β _kRepresent k keyword C _kThe counter-rotating coefficient, span [0,1] also can be obtained V by Bayesian formula _CkExpression keyword C _kState value, w _JkKeyword C after the expression counter-rotating _kTo theme node C _j ⁰Weight, w _KiExpression keyword C _kTo theme node C _i ⁰Weight;

(6) each text subject semanteme of secateurs exclusive keyword, Fig. 6 be to three text subject semantemes exclusive keyword carry out the visualized graphs of the text discourse semantics behind the secateurs.

Claims

1. the visable representation of a text semantic and acquisition methods is characterized in that text semantic is divided into three levels: the text subject that the text low layer semanteme that the set of discrete keyword constitutes, text fragment constitute is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text; By extracting keyword and generating the semantic matrix of text subject based on the matrix operation of keyword weight, the link by text subject forms the text discourse semantics again.

2. the visable representation of text semantic according to claim 1 and acquisition methods is characterized in that operation steps is as follows:

(1) text semantic is divided into three levels: the text subject that the text low layer semanteme that the set of discrete keyword constitutes, text fragment constitute is semantic and constitute the high-level semantic of text chapter by interlinking between the text subject as the middle level of text;

(2) calculate the state value of the keyword in the text fragment and the weight between the keyword;

(3) with the state value of the weight between the keyword, keyword with based on the inference rule of matrix multiple computing, calculate the weight of each keyword, generate the semantic matrix of text subject text subject;

(5) find out keyword common in the text subject that will link, to the weight of text subject separately, calculate the weight between each text subject according to the keyword of common appearance, linking each text subject is the text chapter, and with digraph visualText chapter;

(6) in each text subject of secateurs exclusive keyword.