CN103049454A

CN103049454A - Chinese and English search result visualization system based on multi-label classification

Info

Publication number: CN103049454A
Application number: CN2011103126629A
Authority: CN
Inventors: 卫志华; 苗夺谦
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2011-10-16
Filing date: 2011-10-16
Publication date: 2013-04-17
Anticipated expiration: 2031-10-16
Also published as: CN103049454B

Abstract

The invention relates to a Chinese and English search result visualization system based on multi-label classification. The Chinese and English search result visualization system comprises a display module, a search module, a classification module and a visualization module, wherein the display module is used for displaying a user interface and search results; the search module is used for calling a search engine API to perform searching and to obtain the search results according to inquire statements of users, and respectively integrating the Chinese and English search results; the classification module is used for performing Chinese and English multi-label classification on the results obtained through the search module and performing integration on the classification results; and the visualization module is used for achieving Web user interface design on the integrated classification results and outputting through the display module. Compared with the prior art, the Chinese and English search result visualization system can perform effective multi-label classification and integration on the search results by means of granular computing theory in a multi-label classification method based on the Bayesian theory, can display the search results by category according to user requirements by designing the visual system through the method, simultaneously cannot lose the search results, improves user browse efficiency and improves user browse experience.

Description

A kind of Chinese and English Search Results visualization system based on many labelings

Technical field

The present invention relates to areas of information technology, especially relate to a kind of Chinese and English Search Results visualization system based on many labelings.

Background technology

At present, online electronic document rapid growth has a large amount of documents to upload on the net every day.Search engine as a kind of important method of obtaining network knowledge, has obtained using more and more widely.Yet search engine often returns a large amount of Search Results, and this usually is submerged in the ocean of information the user.The search engine of current main-stream returns the Search Results according to the user key words ordering.In order to find interested information, the user needs one by one navigate search results.

For above problem, some begin to explore more advanced information retrieval method.As a rule, dual mode is arranged: a kind of information retrieval method that is based on semanteme, namely make every effort to adopt semantic analysis technology to understand document and user's query statement; Another kind is based on the method for machine learning, namely use from the historical data learning to model the document the Search Results is classified or cluster.The present invention pays close attention to the problem of improving the information retrieval result based on the method for machine learning.

The visual finger of Webpage searching result is according to the content of Search Results, with Search Results with a kind of more clear, process that mode more orderliness shows the user.Its purpose is to improve search efficiency, improves user's viewing experience.For this task, the technology based on text cluster is adopted in present most research work, is about to visualization tasks and regards a non-supervisory classification problem as.According to the method system of pattern classification, we at first extract feature and represent text from text, then text is assigned to the highest class of its similarity bunch in.Search engine based on clustering technique has Vivisimo and Groker.In this method, the title of class bunch is provided according to Feature Words automatically by system usually.Yet the class cluster name of this automatic acquisition claims often to be difficult to express the main contents of class bunch.This just makes the user be difficult to locate according to the given class cluster name of system the position of own interested information, and the effect of this visualization process is just not obvious.

Different from corresponding class label of object in the traditional pattern classification task, in many labelings, an object may be associated with a plurality of labels, may be relevant with economy such as one piece of document, simultaneously also may be relevant with computing machine, so the document is relevant with two classifications of computing machine with economy.Many labelings originate from the demand of text categorization task, wherein every piece of document is associated with a tag set in the training set, the task of classification is exactly the model of training document and known label set Relations Among, and is tag set of document output of every piece of label the unknown according to this model.

Summary of the invention

Purpose of the present invention is exactly the Chinese and English Search Results visualization system that a kind of Chinese and English search result information sorting technique based on many labels is provided and uses this information classification method for the defective that overcomes above-mentioned prior art existence, use for reference grain and calculate thought, can be according to user's demand category display of search results, improve user's browse efficiency, improve user's viewing experience.

Purpose of the present invention can be achieved through the following technical solutions:

A kind of Chinese and English Search Results visualization system based on many labelings, this system comprises:

Display module is used for showing user interface and Search Results; Search module is used for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates respectively Chinese and English Search Results; Sort module, the result who is used for search module is obtained carries out Chinese and English many labelings, and classification results is integrated; Visualization model is used for the classification results after integrating is realized the Web User Interface design, and exports by display module.

Described sort module comprises: sorter, and the result who is used for search module is obtained carries out Chinese and English many labelings, and carries out classification results and integrate; The classification corpus, this classification corpus is uneven corpus, comprises many labels corpus of several classifications, is used for training classifier.

Described classification corpus comprises Chinese classification corpus and English classification corpus.

Described sort module adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method specifically may further comprise the steps:

1) makes up Chinese and English classification corpus;

2) sorter carries out off-line learning by the classification corpus;

3) sorter is classified to Chinese and English Search Results respectively, carries out simultaneously on-line study;

4) classification results is integrated.

Described step 2) specifically may further comprise the steps:

A) training text in the traversal classification corpus;

B) training text is carried out pre-service;

C) scan training text, record the word frequency information of each Feature Words, add among the HashMap;

D) be saved in the file according to the conditional probability of each Feature Words of word frequency statistics information calculations among the HashMap, and with acquired results.

Described step 3) specifically may further comprise the steps:

A) from the training process spanned file, read in Feature Words and statistical information thereof, and add among the HashMap;

B) unknown text is carried out pre-service, the generating feature set of words;

C) travel through all Feature Words, and in the HashMap that step generates in a), search each Feature Words to the conditional probability of each classification;

D) according to the conditional probability of each Feature Words for each classification, calculate this unknown text for the joint probability of all categories;

E) according to all joint probabilities that obtains, calculate probability threshold value;

F) distribute all joint probabilities for this unknown text and be not less than the class label of probability threshold value and output label;

G) in HashMap, revise conditional probability in the classification that Feature Words in this unknown text provides corresponding to classification results;

H) assorting process finishes.

Described probability threshold value P _ThresBe unknown text d _iArithmetical mean for the posterior probability of all known class:

P_{thres} = \frac{1}{n} Σ_{j = 1}^{n} P (C_{j} | d_{i})

P (C _j| d _i) be unknown text d _iBelong to certain classification C _jProbability, n is the classification number, if P (C _j| d _i) 〉=P _Thres, d _iGive classification C _jLabel, d _iNumber of labels n _dSatisfy 1≤n _d≤ n.

Compared with prior art, the present invention uses for reference grain calculating and considers carefully, by adopting the many labelings method based on bayesian theory, can effectively classify and integrate Search Results, by adopting the method design visualization system, can improve user's browse efficiency according to user's demand category display of search results, improve user's viewing experience.

Description of drawings

Fig. 1 is structural representation of the present invention;

Fig. 2 is the synoptic diagram of sorting algorithm of the present invention;

Fig. 3 is the process flow diagram based on Chinese and English many labelings method of bayesian theory that sort module of the present invention adopts;

Fig. 4 is the process flow diagram based on off-line learning in Chinese and English many labelings method of bayesian theory;

Fig. 5 is the process flow diagram based on classification and on-line study in Chinese and English many labelings method of bayesian theory.

Embodiment

The present invention is described in detail below in conjunction with the drawings and specific embodiments.

Embodiment

As shown in Figure 1, a kind of Chinese and English Search Results visualization system based on many labelings comprises display module 1, search module 2, sort module 3 and visualization model 4.Wherein display module 1 is used for showing user interface and Search Results; Search module 2 is used for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates respectively Chinese and English Search Results; The result that sort module 3 is used for search module is obtained carries out Chinese and English many labelings, and classification results is integrated; Visualization model 4 is used for the classification results after integrating is realized the Web User Interface design, and passes through display module 1 output.

At first at display module 1 user's inputted search query statement, search module 2 calling search engine are searched for, and obtain Search Results, integrate respectively Chinese and English Search Results, then the result that search module 2 obtained by above-mentioned information classification method of sort module 3 carries out Chinese and English many labelings, and classification results is integrated.Last visualization model 4 is used for the classification results after integrating is carried out the Web User Interface design, and export to the user by display module 1, can adopt Struts 2 as the framework of MVC view, selection of container the combination of ApacheGeronimo 2.x+Jetty 6, this has guaranteed when satisfying user demand, has reduced the spending aspect software when disposing.In the technology of webpage front end employing AJAX, realized dynamically updating the Search Results under the user selection classification.The algorithm of whole system as shown in Figure 2.

Sort module of the present invention adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method may further comprise the steps: 1) make up the classification corpus; 2) sorter carries out off-line learning by the classification corpus; 3) sorter is classified to Chinese and English Search Results respectively, carries out simultaneously on-line study; 4) classification results is integrated, as shown in Figure 3.

For realizing that the method need to make up Chinese and English many labelings corpus, set up first Chinese news corpus storehouse, and utilize existing English Reuters corpus, be used for training classifier.Adopt the news corpus storehouse to be the basis herein, artificial screening also marks wherein part newsletter archive, has made up many labels of multiclass corpus of 9 classifications.This corpus comprises economy, military affairs, physical culture, amusement, science and technology, society, commercial affairs, education, 9 classifications of travelling totally 5084 pieces of texts, this corpus is uneven corpus, amount of text in each classification is distributed and is obtained in the situation that is quantity of information of all categories in considering real life.

The method that sorter adopts off-line learning and on-line study to combine is trained, the corpus of a news category of model carries out off-line learning, namely train many labelings of Bayes multiclass device, then, when in the actual motion of system, having new Search Results (text) to arrive, constantly revise and improve before learning model.

The off-line learning performing step may further comprise the steps as shown in Figure 4: the A) training text in the traversal classification corpus; B) training text is carried out pre-service; C) scan training text, record the word frequency information of each Feature Words, add among the HashMap; D) be saved in the file according to the conditional probability of each Feature Words of word frequency statistics information calculations among the HashMap, and with acquired results.

Wherein, HashMap is used for depositing Feature Words and the statistical information thereof of training text, and that uses HashMap can finish the inquiry of Feature Words or the retouching operation of certain Feature Words statistical information in the regular hour complexity.

The on-line study of sorter and classification are carried out simultaneously, and concrete steps comprise as shown in Figure 5: a) read in Feature Words and statistical information thereof from the training process spanned file, and add among the HashMap; B) unknown text is carried out pre-service, the generating feature set of words; C) travel through all Feature Words, and in the HashMap that step generates in a), search each Feature Words to the conditional probability of each classification; D) according to the conditional probability of each Feature Words for each classification, calculate this unknown text for the joint probability of all categories; E) according to all joint probabilities that obtains, calculate probability threshold value; F) distribute all joint probabilities for this unknown text and be not less than the class label of probability threshold value and output label; G) in HashMap, revise conditional probability in the classification that Feature Words in this unknown text provides corresponding to classification results; H) assorting process finishes.

Wherein, probability threshold value P _ThresBe unknown text d _iArithmetical mean for the posterior probability of all known class:

P_{thres} = \frac{1}{n} Σ_{j = 1}^{n} P (C_{j} | d_{i})

Claims

1. Chinese and English Search Results visualization system based on many labelings is characterized in that this system comprises:

Display module is used for showing user interface and Search Results;

Search module is used for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates respectively Chinese and English Search Results;

Sort module, the result who is used for search module is obtained carries out Chinese and English many labelings, and classification results is integrated;

Visualization model is used for the classification results after integrating is realized the Web User Interface design, and exports by display module.

2. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 1 is characterized in that described sort module comprises:

Sorter, the result who is used for search module is obtained carries out Chinese and English many labelings, and carries out classification results and integrate;

The classification corpus, this classification corpus is uneven corpus, comprises many labels corpus of several classifications, is used for training classifier.

3. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 2 is characterized in that, described classification corpus comprises Chinese classification corpus and English classification corpus.

4. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 1, it is characterized in that, described sort module adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method specifically may further comprise the steps:

1) makes up Chinese and English classification corpus;

2) sorter carries out off-line learning by the classification corpus;

4) classification results is integrated.

5. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 4 is characterized in that described step 2) specifically may further comprise the steps:

A) training text in the traversal classification corpus;

B) training text is carried out pre-service;

6. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 4 is characterized in that described step 3) specifically may further comprise the steps:

H) assorting process finishes.

7. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 6 is characterized in that described probability threshold value P _ThresBe unknown text d _iArithmetical mean for the posterior probability of all known class:

P_{thres} = \frac{1}{n} Σ_{j = 1}^{n} P (C_{j} | d_{i})