CN103049454B

CN103049454B - A kind of Chinese and English Search Results visualization system based on many labelings

Info

Publication number: CN103049454B
Application number: CN201110312662.9A
Authority: CN
Inventors: 卫志华; 苗夺谦
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2011-10-16
Filing date: 2011-10-16
Publication date: 2016-04-20
Anticipated expiration: 2031-10-16
Also published as: CN103049454A

Abstract

The present invention relates to a kind of Chinese and English Search Results visualization system based on many labelings, this system comprises: display module, for showing user interface and Search Results; Search module, for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates Chinese and English Search Results respectively; Sort module, the result for obtaining search module carries out Chinese and English many labelings, and integrates classification results; Visualization model, for realizing Web User Interface design to the classification results after integration, and is exported by display module.Compared with prior art, the present invention uses for reference Granule Computing and considers carefully, by adopting the many labelings method based on bayesian theory, effective many labelings and integration can be carried out to Search Results, by adopting the method design visualization system, can, according to the demand category display of search results of user, accomplish not lose Search Results simultaneously as far as possible, improve user's browse efficiency, improve user's viewing experience.

Description

A kind of Chinese and English Search Results visualization system based on many labelings

Technical field

The present invention relates to areas of information technology, especially relate to a kind of Chinese and English Search Results visualization system based on many labelings.

Background technology

At present, online electronic document rapidly increases, and has every day a large amount of documents to upload on the net.Search engine, as a kind of important method obtaining network knowledge, obtains and applies more and more widely.But search engine often returns a large amount of Search Results, this makes user usually be submerged in the ocean of information.The search engine of current main-stream returns the Search Results according to user key words sequence.In order to find interested information, user needs navigate search results one by one.

For above problem, some start to explore more advanced information retrieval method.As a rule, there are two kinds of modes: a kind of is information retrieval method based on semanteme, namely make every effort to adopt semantic analysis technology to understand the query statement of document and user; Another kind is the method based on machine learning, namely use from historical data learning to model the document Search Results is classified or cluster.The present invention pays close attention to the problem improving information retrieval result based on the method for machine learning.

The visual finger of Webpage searching result according to the content of Search Results, by Search Results with a kind of more clear, process that mode that is more orderliness shows user.Its object is to improve search efficiency, improve user's viewing experience.For this task, current most research work adopts the technology based on text cluster, regards a non-supervisory classification problem as by visualization tasks.According to the method system of pattern classification, first we extract feature to represent text from text, is then assigned to by text in the class the highest with its similarity bunch.Search engine based on clustering technique has Vivisimo and Groker.In this approach, the title of class bunch is provided according to Feature Words automatically by system usually.But the class cluster name of this automatic acquisition claims often to be difficult to express the main contents of class bunch.This just makes user be difficult to locate according to the class cluster name that system is given the position of oneself interested information, and the effect of this visualization process is just not obvious.

Different from a corresponding class label of object in traditional pattern classification task, in many labelings, an object may be associated with multiple label, such as one section of document may be relevant to economy, simultaneously also may be relevant to computing machine, therefore the document is relevant with computing machine two classifications to economy.Many labelings originate from the demand of text categorization task, wherein in training set, every section of document is associated with a tag set, the task of classification is exactly the model of relation between Training document and known label set, and exports a tag set according to the document that this model is every section of label the unknown.

Summary of the invention

Object of the present invention be exactly in order to overcome above-mentioned prior art exist defect and a kind of Chinese and English search result information sorting technique based on many labels is provided and applies the Chinese and English Search Results visualization system of this information classification approach, use for reference Granule Computing thought, can according to the demand category display of search results of user, improve user's browse efficiency, improve user's viewing experience.

Object of the present invention can be achieved through the following technical solutions:

Based on a Chinese and English Search Results visualization system for many labelings, this system comprises:

Display module, for showing user interface and Search Results; Search module, for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates Chinese and English Search Results respectively; Sort module, the result for obtaining search module carries out Chinese and English many labelings, and integrates classification results; Visualization model, for realizing Web User Interface design to the classification results after integration, and is exported by display module.

Described sort module comprises: sorter, and the result for obtaining search module carries out Chinese and English many labelings, and carries out classification results integration; Classification corpus, this classification corpus is uneven corpus, comprises many labels corpus of several classifications, for training classifier.

Described classification corpus comprises Chinese classification corpus and English classification corpus.

Described sort module adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method specifically comprises the following steps:

1) Chinese and English classification corpus is built;

2) sorter carries out off-line learning by classification corpus;

3) sorter is classified to Chinese and English Search Results respectively, carries out on-line study simultaneously;

4) classification results is integrated.

Described step 2) specifically comprise the following steps:

A) training text in traversal classification corpus;

B) pre-service is carried out to training text;

C) scan training text, record the word frequency information of each Feature Words, add in HashMap;

D) calculate the conditional probability of each Feature Words according to word frequency statistics information in HashMap, and acquired results is saved in file.

Described step 3) specifically comprise the following steps:

A) from training process spanned file, read in Feature Words and statistical information, and add in HashMap;

B) pre-service is carried out to unknown text, generating feature set of words;

C) travel through all Feature Words, and search the conditional probability of each Feature Words to each classification in step a) the middle HashMap generated;

D) according to the conditional probability of each Feature Words for each classification, the joint probability of this unknown text for all categories is calculated;

E) according to all joint probabilities obtained, probability threshold value is calculated;

F) for this unknown text distributes the class label that all joint probabilities are not less than probability threshold value, and output label;

G) in HashMap, revise Feature Words in this unknown text correspond to conditional probability in the classification that classification results provides;

H) assorting process terminates.

Described probability threshold value P _thresfor unknown text d _iarithmetical mean for the posterior probability of all known class:

P_{thres} = \frac{1}{n} Σ_{j = 1}^{n} P (C_{j} | d_{i})

P (C _j| d _i) be unknown text d _ibelong to certain classification C _jprobability, n is classification number, if P (C _j| d _i)>=P _thres, d _igive classification C _jlabel, d _inumber of labels n _dmeet 1≤n _d≤ n.

Compared with prior art, the present invention uses for reference Granule Computing and considers carefully, by adopting the many labelings method based on bayesian theory, can effectively classify to Search Results and integrate, by adopting the method design visualization system, according to the demand category display of search results of user, user's browse efficiency can be improved, improves user's viewing experience.

Accompanying drawing explanation

Fig. 1 is structural representation of the present invention;

Fig. 2 is the schematic diagram of sorting algorithm of the present invention;

The process flow diagram of the Chinese and English many labelings method based on bayesian theory that Fig. 3 adopts for sort module of the present invention;

Fig. 4 is the process flow diagram of off-line learning in the Chinese and English many labelings method based on bayesian theory;

Fig. 5 is the process flow diagram of classification and on-line study in the Chinese and English many labelings method based on bayesian theory.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.

Embodiment

As shown in Figure 1, a kind of Chinese and English Search Results visualization system based on many labelings, comprises display module 1, search module 2, sort module 3 and visualization model 4.Wherein display module 1 is for showing user interface and Search Results; Search module 2 for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates Chinese and English Search Results respectively; Sort module 3 carries out Chinese and English many labelings for the result obtained search module, and integrates classification results; Visualization model 4 for realizing Web User Interface design to the classification results after integration, and is exported by display module 1.

First at display module 1 user inputted search query statement, search module 2 calling search engine is searched for, and obtain Search Results, integrate Chinese and English Search Results respectively, then sort module 3 carries out Chinese and English many labelings by above-mentioned information classification approach to the result that search module 2 obtains, and integrates classification results.Last visualization model 4 is for carrying out Web User Interface design to the classification results after integration, and export to user by display module 1, Struts2 can be adopted as the framework of MVC view, the selection of container combination of ApacheGeronimo2.x+Jetty6, which ensure that while meeting user demand, spending when decreasing deployment in software.Adopt the technology of AJAX in webpage front end, achieve dynamically updating the Search Results under user's selection sort.The algorithm of whole system as shown in Figure 2.

Sort module of the present invention adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method comprises the following steps: 1) build classification corpus; 2) sorter carries out off-line learning by classification corpus; 3) sorter is classified to Chinese and English Search Results respectively, carries out on-line study simultaneously; 4) classification results is integrated, as shown in Figure 3.

Needing to build Chinese and English many labelings corpus for realizing the method, first setting up Chinese news corpus storehouse, and utilizing existing English Reuters corpus, for training classifier.Adopt based on news corpus storehouse, artificial screening also marks wherein part newsletter archive, constructs multiclass many labels corpus of 9 classifications herein.This corpus comprises economy, military, physical culture, amusement, science and technology, society, commercial affairs, education, 9 classifications of travelling totally 5084 sections of texts, this corpus is uneven corpus, and the amount of text distribution in each classification obtains when considering quantity of information of all categories in real life.

The method that sorter adopts off-line learning and on-line study to combine is trained, first the corpus setting up a news category carries out off-line learning, namely Bayes multiclass many labelings device is trained, then, when having new Search Results (text) to arrive in the actual motion of system, constantly revise and learning model before improving.

Off-line learning performing step as shown in Figure 4, comprises the following steps: A) travel through the training text of classifying in corpus; B) pre-service is carried out to training text; C) scan training text, record the word frequency information of each Feature Words, add in HashMap; D) calculate the conditional probability of each Feature Words according to word frequency statistics information in HashMap, and acquired results is saved in file.

Wherein, HashMap is for depositing Feature Words and the statistical information thereof of training text, and use HashMap can complete the inquiry of Feature Words or the retouching operation of certain Feature Words statistical information in regular hour complexity.

On-line study and the classification of sorter are carried out simultaneously, and concrete steps as shown in Figure 5, comprising: a) from training process spanned file, read in Feature Words and statistical information thereof, and add in HashMap; B) pre-service is carried out to unknown text, generating feature set of words; C) travel through all Feature Words, and search the conditional probability of each Feature Words to each classification in step a) the middle HashMap generated; D) according to the conditional probability of each Feature Words for each classification, the joint probability of this unknown text for all categories is calculated; E) according to all joint probabilities obtained, probability threshold value is calculated; F) for this unknown text distributes the class label that all joint probabilities are not less than probability threshold value, and output label; G) in HashMap, revise Feature Words in this unknown text correspond to conditional probability in the classification that classification results provides; H) assorting process terminates.

Wherein, probability threshold value P _thresfor unknown text d _iarithmetical mean for the posterior probability of all known class:

P_{thres} = \frac{1}{n} Σ_{j = 1}^{n} P (C_{j} | d_{i})

Claims

1., based on a Chinese and English Search Results visualization system for many labelings, it is characterized in that, this system comprises:

Display module, for showing user interface and Search Results;

Search module, for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates Chinese and English Search Results respectively;

Sort module, the result for obtaining search module carries out Chinese and English many labelings, and integrates classification results;

Visualization model, for realizing Web User Interface design to the classification results after integration, and exported by display module, described visualization model adopts Struts2 as the framework of MVC view, the combination of selection of container ApacheGeronimo2.x+Jetty6;

1) build Chinese and English classification corpus, described Chinese and English classification corpus are multiclass many labels corpus;

2) sorter carries out off-line learning by classification corpus;

4) classification results is integrated.

2. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 1, it is characterized in that, described sort module comprises:

Sorter, the result for obtaining search module carries out Chinese and English many labelings, and carries out classification results integration;

Classification corpus, this classification corpus is uneven corpus, comprises many labels corpus of several classifications, for training classifier.

3. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 2, is characterized in that, described classification corpus comprises Chinese classification corpus and English classification corpus.

4. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 1, is characterized in that, described step 3) specifically comprise the following steps:

B) pre-service is carried out to unknown text, generating feature set of words;

H) assorting process terminates.

5. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 4, is characterized in that, described probability threshold value P _thresfor unknown text d _iarithmetical mean for the posterior probability of all known class:

P_{t h r e s} = \frac{1}{n} Σ_{j = 1}^{n} P (C_{j} | d_{i})