CN101308498A

CN101308498A - Text collection visualized system

Info

Publication number: CN101308498A
Application number: CNA2008100401459A
Authority: CN
Inventors: 马颖华; 苏贵洋; 李建华; 冯薇; 李文婷
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2008-07-03
Filing date: 2008-07-03
Publication date: 2008-11-19
Anticipated expiration: 2028-07-03
Also published as: CN100595762C

Abstract

A text set visualization system in the computer application technical field comprises a text acquisition module, a Chinese word segmentation module, a term weight calculation module, an XML file organization module and a visual graphical interface module; firstly, the text acquisition module, the Chinese word segmentation module, a characteristic word weight calculation module and the XML file organization module constitute a local database; and then through the interface interaction between the visual graphical interface module and the local database, results are graphically displayed to the search keywords of the user. The presentation of the displayed results applies the visual mode of the correlation degree between the file and a plurality of keywords provided by the invention, and interactive operation of dragging the keywords on a graphical interface is provided to the user to expand the definition of semantic relationships among keywords to achieve better effects.

Description

Text collection visualized system

Technical field

The present invention relates to the system in a kind of Computer Applied Technology field, specifically, what relate to is a kind of text collection visualized system.

Background technology

Along with the web technology extensively and profoundly, people widen day by day for the channel that obtains of information, as long as people use search engine search, the Search Results that same keyword caused may all can be the data of magnanimity.How from these results, to extract the information that can reflect user's needs, become a proposition that more and more is much accounted of, and the visual of text is exactly one of them good solution.

Visual the presenting of information can be accelerated people to information processing speed.The human eye per second can be handled and surpass 5 megabits quantity of information, and the brain per second can only be understood the quantity of information of about 500 bits.In the common keyword search process, people often still need just can judge whether it is needed document by reading literal after the content of roughly having understood article.Even if only see the title of article, decide and whether remove to read this piece article, the quantity of information of checking to article content can be bigger like this, and sometimes can omit of great value text.But even so, also flooded by a large amount of articles through regular meeting.Text collection visualized technology is utilized the big characteristics of visual processes information, with the distribution of avatars information, judges whether approaching own demand of the text by vision, can handle a large amount of result for retrieval faster.

Text collection visualized purpose is the content of coming structured videotex with two-dimensional/three-dimensional image, to assist each alanysis, retrieval or text mining work.Therefore general elder generation sets up mathematical model for text, shows this model with two-dimensional/three-dimensional image then, to reach intuitively, to represent visually the purpose of text characteristics.Set up model and can help to understand text structure.

Article, paragraph are flat flow structures, be unsuitable for to analyze and handle, if do not set up suitable structure and model, and can only understanding word by word and sentence by sentence realizes analyzing to text by people.The method of this artificial inspection efficient under the situation of handling and analyze a large amount of text datas is extremely low; on the other hand; present general cyber stalker such as google, yahoo etc.; only judge that all the keyword that whether comprises input in the article returns Search Results; text set is not further classified or other processing; Shi Changhui returns some different web pages that contain the same text content, has strengthened the workload that the user checks text.

Find through literature search prior art, have in the Chinese patent much is about " visualization system ", such as 200510086559.1 (" remote visualization systems of computing grid "), (03121859.8 " modular assistant visualizing system ") or the like, though it is visual that these patented technologies can realize, but not that mode with figure shows, and can't be applied in text visualization technique search engine aspect.

Summary of the invention

The present invention is directed to above shortcomings in the prior art, a kind of text collection visualized system is provided, come keyword and text are measured by setting up the text data model, according to structure (being the text constituent) classifying text content, and with the patterned way display result, when making the user face the data of magnanimity, not only there is way to filter out the interested part of oneself possibility, more will being presented in face of the user with patterned mode image, and can reflect intuitively that each several part is searched for the tightness degree of purpose with oneself in the Search Results, award the direct guidance quality information of user.

The present invention is achieved by the following technical solutions:

The present invention includes: text collection module, Chinese word segmentation module, term weighing computing module, XML file organization module, visualized graphs interface module, wherein:

The text collection module is collected web page text on the internet, imports the text that collects into the Chinese word segmentation module as the raw data source;

The Chinese word segmentation module is carried out word segmentation processing to the content of text that obtains in the text collection module, and obtaining with the speech is the language material of unit, and the statistics word frequency, is kept at local text, reads word frequency information calculations weight for follow-up term weighing computing module;

The term weighing computing module is used for the result behind the participle is carried out the calculating that feature extraction is a feature speech weight, and together with the characteristic of correspondence speech, and the title of place text is given XML file organization module with result of calculation;

XML file organization module is responsible for data that the term weighing computing module is imported into and is become the XML file to be retained in local computer with the data structure organization of setting, and provides the result who reads after the text data structure processing for the visualized graphs interface module;

It is basic corpus that the visualized graphs interface module is retained in local result data with above-mentioned XML file organization module, by with the user obtain user command alternately, and demonstrate the result.

Described text collection module, comprise: download submodule and sub module stored, wherein: download at first root network address of submodule from setting, according to the web page interlinkage that provides on the root network address, the webpage source file of setting the number of plies grasps, reject non-body matters such as html mark and scripted code simultaneously, obtain initial content of text, this module invokes sub module stored then, text is kept under the local directory of setting, before each text is preserved, at first judges under same catalogue, whether to exist and the identical text in text source (is foundation with URL), if exist, then the text do not preserved; After collecting work finishes, call described Chinese word segmentation module the initial content of text of preserving is above carried out participle work.

Described Chinese word segmentation module is carried out participle to a large amount of Chinese texts that collect, and removes stop-word, makes it to become the independent feature speech, and obtains the frequency of the appearance of each word in every piece of article, gives the term weighing computing module; Wherein, stop-word refer to as " ", " with ", " about " wait some prepositions and tone auxiliary word or words very commonly used.Because these words do not have concrete connotation, and the frequency that occurs in the whole text in text has influence on the content analysis to text, so not within the limit of consideration of weight calculation usually than higher.

Described term weighing computing module, comprise file reading submodule and weight calculation submodule, wherein: the file reading submodule is read participle and word frequency information thereof from the file that Chinese word-dividing mode is preserved, by the weight that draws the feature speech in the weight calculation submodule, and call XML file organization module, feature speech in each piece document and weight data set thereof are made into tree structure, save as the XML feature database.

Described XML file organization module, defined the data organization form of text header, feature speech and weight thereof, and provide and write the XML file function and read the XML file function, confession language weight computation module will be exported the result and be organized into XML file and visualized graphs interface module and read data acquisition the XML file from this locality respectively.

Described visualized graphs interface module, comprise controlling sub, graphical interfaces submodule, wherein: controlling sub is accepted the user instruction operation, keyword that obtains importing and operating parameter, retrieval comprises the document and the corresponding weights of keyword in the XML library that preserve this locality, by graphical interfaces submodule display result; The graphical interfaces submodule is accepted the information parameter of user's input from controlling sub, obtain the data relevant at the XML library with input parameter, data-switching is become graph-based and be presented on the panel, this submodule allows user to carry out the keyword drag operation and adjusts display result simultaneously.

Described controlling sub, its operating parameter that can be provided with comprises: fuzzy/accurate coupling, displaying ratio and check result for retrieval with textual form.

Described fuzzy matching, as long as be meant that comprising a certain keyword just can count result for retrieval with the document, promptly " or " relation.

Described accurate coupling is meant that result for retrieval must comprise the keyword of all key entries, promptly " with " relation.

Present in the exploration of technology in image, the developer more and more wishes and can constantly make result for retrieval meet demands of individuals more in interaction and the feedback by individual participation.The visual basic goal of result for retrieval just is to design an interface so that the user can browse and operate result for retrieval.

In the present invention, better obtain Search Results by the participation that adds access customer.The interdynamic factor that adds is included in the various operating parameters in the described controlling sub, it is the operating parameter that the controlling sub of above-mentioned visualized graphs interface module is accepted, and in the result of graphical interfaces submodule showed, the present invention had realized representing with patterned way the method for document and a plurality of keyword correlation degrees.On the other hand, the user can be in display result as required, selects the text of more emphasizing certain or certain several key words contents.As fashionable,, can select on the position text center of gravity of more being partial to " visual " this keyword if the user wishes to seek when containing the more some more document of visualization technique content in search " text "-" visual " this keyword sets; If when checking, can select on the position focus point of more being partial to " text " about the visual visualization technique article of text.The user also can see that some article highlights the text visualization technique from the situation that whole text set distributes, and some just mentions visual this branch of text when speaking of visualization technique.

In addition, by the semantic relation of towing keyword with the adjustment keyword, as import keyword sets " static link library "-" dynamic link library "-" difference ", be to wish that " static link library " and " dynamic link library " can alternately occur, so their relation will express tight.These two keywords are dragged closely, just can obtain the result for retrieval of these two keyword combinations.

Compared with prior art, the present invention has following beneficial effect:

1, content is identical, but the text from different web pages is identical because of its data structure (weight of keyword in the text of place), the position that shows on graphical interfaces can be overlapping, make the user can skip the identical text of these contents, pay close attention to the text of other different content more, prevent repeat reading.

2, the present invention is distributed in text on the graphical interfaces by its data structure, and this classifying text makes the user to select view as text as required in similar text collection or foreign peoples's text collection.

3, the interface interaction factor of the present invention's adding has been expanded the freedom of search customization, the function of search of not only the same providing " or ", " with " with common search engine, can also be by the towing keyword, allow the user adjust keyword in the distance on the visualization interface with the semantic relation of simulation between the keyword, dynamically adjust the result.

Description of drawings

Fig. 1 is a system architecture diagram of the present invention;

Fig. 2 is a text collection module status process flow diagram.

Embodiment

Below in conjunction with accompanying drawing embodiments of the invention are elaborated: present embodiment is being to implement under the prerequisite with the technical solution of the present invention, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.

As shown in Figure 1, present embodiment comprises: text collection module, Chinese word segmentation module, term weighing computing module, XML file organization module, visualized graphs interface module, wherein:

The text collection module is collected network text on the internet, imports the abundant text that collects into the Chinese word segmentation module;

The Chinese word segmentation module is carried out participle to a large amount of Chinese texts that collect, and removes stop-word, makes it to become independently keyword, and obtains the frequency of the appearance of each word in every piece of article, gives weight computation module;

The term weighing computing module is used for the result behind the participle is carried out the calculating that feature extraction is a feature speech weight, gives XML file organization module with result of calculation together with characteristic of correspondence speech and document information;

XML file organization module is responsible for that the data set that the term weighing computing module imports into is made into the XML file and is retained in this locality, and provides the result who reads after the text data structure processing for the visualized graphs interface module;

In local data base, set up initial corpus by the text collection module earlier.

As shown in Figure 2, described text collection module comprises and downloads submodule and sub module stored, downloads submodule and is begun by the root network address of setting earlier, constantly find link information on the webpage and test layer by layer by regular expression, the webpage source file of setting level grasps.By html mark and scripted code in the regular expression rejecting webpage source file, obtain in Chinese text character string and the insertion tabulation again.Detect the content of text that inserts by sub module stored subsequently and whether repeat, if do not repeat then to save as local text.

Described Chinese word segmentation module is carried out participle to a large amount of Chinese texts that collect, and remove stop-word, make it to become the independent feature speech, and obtain the frequency of the appearance of each word in every piece of article, give the term weighing computing module, wherein, stop-word refers to preposition, tone auxiliary word or words very commonly used.

The technology of Chinese word segmentation is slightly different with English participle, and English text is the speech string on small size character set, and Chinese is the word string on large character set.Simultaneously, the sentence of Chinese does not have tangible dividing mark between speech and speech, but continuous word string.Therefore,, must allow computing machine understand the implication of each speech in the sentence,, just can not cause the mistake in the understanding correct the cutting out of speech if want the participle that carries out of computer intelligence.

At present, the Chinese word segmentation algorithm model can be divided three classes substantially: be respectively mechanical Chinese word segmentation method (claiming dictionary formula syncopation again), and semantic morphology and the artificial intelligence method (claim again to understand and divide morphology) of dividing.Mechanical Chinese word segmentation refers to mate in dictionary; Semantic analysis introduced in the semantic morphology that divides, and the language message of natural language self is more handled; Artificial intelligence is that information is carried out intelligent a kind of pattern of handling, mainly contains two kinds of processing modes: a kind of psychologic symbol processing method, function of simulation human brain of being based on.As expert system promptly is the function of wishing the simulation human brain, the constructive inference network, and through symbol transition, thus can the processing of making an explanation property.A kind ofly be based on physiological analogy method.The operating mechanism that neural network is intended to simulate the nervous system mechanism of human brain realizes certain function.

The segmenting method of Cai Yonging is the mechanical Chinese word segmentation method in the present invention, the technology of using is to carry out word segmentation processing by the FreeICTCLAS Words partition system that the Chinese Academy of Sciences researches and develops, and calculate the word frequency that isolated feature speech occurs in text, then the feature speech in every piece of text and word frequency thereof the form with vector is kept in the text document.In addition, this module also be responsible for to be removed stop-word, as " ", " with ", " about " wait some prepositions and tone auxiliary word or words very commonly used.

Described weight computation module reads the feature speech of Chinese word segmentation module output and the text of word frequency, uses the TFIDF algorithm that often uses when information extraction and text feature.The TFIDF algorithm is that present information is obtained the comparatively ripe method in field, also is that information is obtained a kind of weighting algorithm that extracts frequent use in the research with text feature.For any given word w, the prime formula of the calculating of its TFIDF value is expressed as follows:

Weight(w)＝TF(w)×IDF(w)＝f(w)×log(N/n)

Wherein, the number of times that f (w) occurs in certain piece of article (being commonly referred to as the prospect language material) for this word, N is the text sum of the whole language materials of experiment (being commonly referred to as the background language material), and the number of times that n occurs in background language material text for this speech.In general, the occurrence number of word w is calculated according to article, and promptly w is designated as 1 time no matter how many times occurred as long as occur in certain piece of article.What TF was in fact given is the frequency of occurrences of word in the prospect language material, is equivalent to the absolute contribution of this word for the prospect language material.In general, the TF value is high more, and this speech is just representative more for the prospect language material.And IDF equals the ratio of the textual data n that background language material textual data N and this speech occur in the background language material, description be the size of this word usable range.When certain speech often occurs (n is very big), think this speech is interior everyday words, does not all have, so its IDF value becomes very little because n is very big just for what representativeness on a large scale for any field in the background language material.By multiplying each other of TF and IDF, can make the word of those authentic representative prospect language material characteristics obtain higher weights, and those generic word commonly used are suppressed.

In the use of reality, the TFIDF algorithm according to the difference of environment for use, also has many kinds to change, to adapt to different needs on the basis of prime formula.In the present invention, adopt a kind of distortion, promptly come result of calculation in conjunction with the normalization weighting formula that weighting algorithm draws to above-mentioned prime formula.Formula is as follows,

W (t, d) = \frac{tf (t, d) \times \log (N / n_{i} + 0.01)}{\sqrt{Σ_{t &Element; d} {[tf (t, d) \times \log (N / n_{i} + 0.01)]}^{2}}}

This formula has increased normalized factor at prime formula, and wherein, t represents certain keyword, and d represents certain document, and (t d) just represents the frequency that keyword t occurs to tf in document d.The purpose of this formula is to reduce the inhibiting effect of indivedual high frequency vocabulary to other feature speech, and each component is carried out standardization.

The content that obtains in above-mentioned feature speech weight computation module (weighted value of feature speech in document name, the document and feature speech correspondence) is imported into and is organized into the XML file layout with certain data structure in the XML text molded tissue block.Because the visualized graphs interface module can receive the input data from two approach, the one, directly to submit in real time behind the data source collection analysis and handle by internal memory, another kind is that reading of data is handled from be kept at local XML data library.In order to guarantee the fetch interface unanimity of visualized graphs interface module, to unifying in organizational form through content after the weights computing and the content of from the XML file, reading.Therefore, the output form that reads the XML data promptly writes the input form of XML file.

Described visualized graphs interface module, comprise controlling sub, graphical interfaces submodule, wherein: controlling sub is accepted the user instruction operation, keyword that obtains importing and operating parameter, retrieval comprises the document and the corresponding weights of keyword in the XML library that preserve this locality, by graphical interfaces submodule display result.In the result of graphical interfaces submodule showed, the present invention had realized representing with patterned way the method for document and a plurality of keyword correlation degrees.

Implementation to the graph-based of document and a plurality of keyword correlation degrees in the graphical interfaces submodule is described below:

1. the coordinate system that as a result of shows with the polar coordinate system in the unit circle, the keyword of key entry is distributed in the unit circle edge;

2. each keyword corresponding each piece in unit circle (weights have passed through normalization) document of comprising it has a weight point, and the radian coordinate of this point is identical with the radian coordinate of keyword, and the radius coordinate figure is the weighted value of this keyword in this piece document.The weight point of the keyword set of input in same piece of writing document connects into the polygon about this " text-choose keyword ";

3. calculate polygonal center of gravity, and be presented on the panel of graphical interfaces submodule.This center of gravity and the relative position of each keyword can illustrate the relevant tightness degree of the document and each keyword;

4. allow the user to change each keyword and distribute, thereby adjust this search, so as to the semantic relation between the explanation keyword for the tendentiousness between each keyword along circumference towing keyword.

By the way, when several target keyword composition weights in several pieces of documents were identical, promptly text was about the structural similarity of this theme, and the polygon center of gravity of these texts of representative that show in graphical interfaces so can be assembled or overlap.After one piece that checked wherein, the user can be to negate this bunch text from the angle decision of cluster, or checks this bunch text set in detail.The user can be in display result as required, select the text of more emphasizing certain or certain several key words contents, and provide and to expand the semantic relation that defines between the keyword in the interactive operation of towing keyword on the graphical interfaces to the user, obtain better effect.

Claims

1, a kind of text collection visualized system is characterized in that, comprising: text collection module, Chinese word segmentation module, term weighing computing module, XML file organization module, visualized graphs interface module, wherein:

2, text collection visualized system according to claim 1, it is characterized in that, described text collection module, comprise: download submodule and sub module stored, wherein: download at first root network address of submodule from setting, according to the web page interlinkage that provides on the root network address, the webpage source file of setting the number of plies grasps, reject non-body matters such as html mark and scripted code simultaneously, obtain initial content of text, this module invokes sub module stored then, text is kept under the local directory of setting, before each text is preserved, at first judges under same catalogue, whether to exist and the identical text in text source, if exist, then the text do not preserved; After collecting work finishes, call described Chinese word segmentation module the initial content of text of preserving is above carried out participle work.

3, text collection visualized system according to claim 1, it is characterized in that, described Chinese word segmentation module is carried out participle to a large amount of Chinese texts that collect, and remove stop-word, make it to become the independent feature speech, and obtain the frequency of the appearance of each word in every piece of article, give the term weighing computing module, wherein, stop-word refers to preposition, tone auxiliary word or words very commonly used.

4, text collection visualized system according to claim 1, it is characterized in that, described term weighing computing module, comprise file reading submodule and weight calculation submodule, wherein: the file reading submodule is read participle and word frequency information thereof from the file that Chinese word-dividing mode is preserved, by the weight that draws the feature speech in the weight calculation submodule, and call XML file organization module, feature speech in each piece document and weight data set thereof are made into tree structure, save as the XML feature database.

5, text collection visualized system according to claim 1, it is characterized in that, described XML file organization module definition the data organization form of text header, feature speech and weight thereof, and provide and write the XML file function and read the XML file function, confession language weight computation module will be exported the result and be organized into XML file and visualized graphs interface module and read data acquisition the XML file from this locality respectively.

6, text collection visualized system according to claim 1, it is characterized in that, described visualized graphs interface module, comprise controlling sub, graphical interfaces submodule, wherein: controlling sub is accepted the user instruction operation, keyword that obtains importing and operating parameter, retrieval comprises the document and the corresponding weights of keyword in the XML library that preserve this locality, by graphical interfaces submodule display result.

7, text collection visualized system according to claim 6, it is characterized in that, described controlling sub, its operating parameter that can be provided with comprises: fuzzy/as accurately to mate, displaying ratio and check result for retrieval with textual form, wherein: described fuzzy matching, as long as be meant that comprising a certain keyword just counts result for retrieval with the document, promptly " or " relation; Described accurate coupling is meant that result for retrieval must comprise the keyword of all key entries, promptly " with " relation.