CN103049454A - Chinese and English search result visualization system based on multi-label classification - Google Patents

Chinese and English search result visualization system based on multi-label classification Download PDF

Info

Publication number
CN103049454A
CN103049454A CN2011103126629A CN201110312662A CN103049454A CN 103049454 A CN103049454 A CN 103049454A CN 2011103126629 A CN2011103126629 A CN 2011103126629A CN 201110312662 A CN201110312662 A CN 201110312662A CN 103049454 A CN103049454 A CN 103049454A
Authority
CN
China
Prior art keywords
classification
chinese
english
search
search results
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011103126629A
Other languages
Chinese (zh)
Other versions
CN103049454B (en
Inventor
卫志华
苗夺谦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201110312662.9A priority Critical patent/CN103049454B/en
Publication of CN103049454A publication Critical patent/CN103049454A/en
Application granted granted Critical
Publication of CN103049454B publication Critical patent/CN103049454B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a Chinese and English search result visualization system based on multi-label classification. The Chinese and English search result visualization system comprises a display module, a search module, a classification module and a visualization module, wherein the display module is used for displaying a user interface and search results; the search module is used for calling a search engine API to perform searching and to obtain the search results according to inquire statements of users, and respectively integrating the Chinese and English search results; the classification module is used for performing Chinese and English multi-label classification on the results obtained through the search module and performing integration on the classification results; and the visualization module is used for achieving Web user interface design on the integrated classification results and outputting through the display module. Compared with the prior art, the Chinese and English search result visualization system can perform effective multi-label classification and integration on the search results by means of granular computing theory in a multi-label classification method based on the Bayesian theory, can display the search results by category according to user requirements by designing the visual system through the method, simultaneously cannot lose the search results, improves user browse efficiency and improves user browse experience.

Description

A kind of Chinese and English Search Results visualization system based on many labelings
Technical field
The present invention relates to areas of information technology, especially relate to a kind of Chinese and English Search Results visualization system based on many labelings.
Background technology
At present, online electronic document rapid growth has a large amount of documents to upload on the net every day.Search engine as a kind of important method of obtaining network knowledge, has obtained using more and more widely.Yet search engine often returns a large amount of Search Results, and this usually is submerged in the ocean of information the user.The search engine of current main-stream returns the Search Results according to the user key words ordering.In order to find interested information, the user needs one by one navigate search results.
For above problem, some begin to explore more advanced information retrieval method.As a rule, dual mode is arranged: a kind of information retrieval method that is based on semanteme, namely make every effort to adopt semantic analysis technology to understand document and user's query statement; Another kind is based on the method for machine learning, namely use from the historical data learning to model the document the Search Results is classified or cluster.The present invention pays close attention to the problem of improving the information retrieval result based on the method for machine learning.
The visual finger of Webpage searching result is according to the content of Search Results, with Search Results with a kind of more clear, process that mode more orderliness shows the user.Its purpose is to improve search efficiency, improves user's viewing experience.For this task, the technology based on text cluster is adopted in present most research work, is about to visualization tasks and regards a non-supervisory classification problem as.According to the method system of pattern classification, we at first extract feature and represent text from text, then text is assigned to the highest class of its similarity bunch in.Search engine based on clustering technique has Vivisimo and Groker.In this method, the title of class bunch is provided according to Feature Words automatically by system usually.Yet the class cluster name of this automatic acquisition claims often to be difficult to express the main contents of class bunch.This just makes the user be difficult to locate according to the given class cluster name of system the position of own interested information, and the effect of this visualization process is just not obvious.
Different from corresponding class label of object in the traditional pattern classification task, in many labelings, an object may be associated with a plurality of labels, may be relevant with economy such as one piece of document, simultaneously also may be relevant with computing machine, so the document is relevant with two classifications of computing machine with economy.Many labelings originate from the demand of text categorization task, wherein every piece of document is associated with a tag set in the training set, the task of classification is exactly the model of training document and known label set Relations Among, and is tag set of document output of every piece of label the unknown according to this model.
Summary of the invention
Purpose of the present invention is exactly the Chinese and English Search Results visualization system that a kind of Chinese and English search result information sorting technique based on many labels is provided and uses this information classification method for the defective that overcomes above-mentioned prior art existence, use for reference grain and calculate thought, can be according to user's demand category display of search results, improve user's browse efficiency, improve user's viewing experience.
Purpose of the present invention can be achieved through the following technical solutions:
A kind of Chinese and English Search Results visualization system based on many labelings, this system comprises:
Display module is used for showing user interface and Search Results; Search module is used for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates respectively Chinese and English Search Results; Sort module, the result who is used for search module is obtained carries out Chinese and English many labelings, and classification results is integrated; Visualization model is used for the classification results after integrating is realized the Web User Interface design, and exports by display module.
Described sort module comprises: sorter, and the result who is used for search module is obtained carries out Chinese and English many labelings, and carries out classification results and integrate; The classification corpus, this classification corpus is uneven corpus, comprises many labels corpus of several classifications, is used for training classifier.
Described classification corpus comprises Chinese classification corpus and English classification corpus.
Described sort module adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method specifically may further comprise the steps:
1) makes up Chinese and English classification corpus;
2) sorter carries out off-line learning by the classification corpus;
3) sorter is classified to Chinese and English Search Results respectively, carries out simultaneously on-line study;
4) classification results is integrated.
Described step 2) specifically may further comprise the steps:
A) training text in the traversal classification corpus;
B) training text is carried out pre-service;
C) scan training text, record the word frequency information of each Feature Words, add among the HashMap;
D) be saved in the file according to the conditional probability of each Feature Words of word frequency statistics information calculations among the HashMap, and with acquired results.
Described step 3) specifically may further comprise the steps:
A) from the training process spanned file, read in Feature Words and statistical information thereof, and add among the HashMap;
B) unknown text is carried out pre-service, the generating feature set of words;
C) travel through all Feature Words, and in the HashMap that step generates in a), search each Feature Words to the conditional probability of each classification;
D) according to the conditional probability of each Feature Words for each classification, calculate this unknown text for the joint probability of all categories;
E) according to all joint probabilities that obtains, calculate probability threshold value;
F) distribute all joint probabilities for this unknown text and be not less than the class label of probability threshold value and output label;
G) in HashMap, revise conditional probability in the classification that Feature Words in this unknown text provides corresponding to classification results;
H) assorting process finishes.
Described probability threshold value P ThresBe unknown text d iArithmetical mean for the posterior probability of all known class:
P thres = 1 n Σ j = 1 n P ( C j | d i )
P (C j| d i) be unknown text d iBelong to certain classification C jProbability, n is the classification number, if P (C j| d i) 〉=P Thres, d iGive classification C jLabel, d iNumber of labels n dSatisfy 1≤n d≤ n.
Compared with prior art, the present invention uses for reference grain calculating and considers carefully, by adopting the many labelings method based on bayesian theory, can effectively classify and integrate Search Results, by adopting the method design visualization system, can improve user's browse efficiency according to user's demand category display of search results, improve user's viewing experience.
Description of drawings
Fig. 1 is structural representation of the present invention;
Fig. 2 is the synoptic diagram of sorting algorithm of the present invention;
Fig. 3 is the process flow diagram based on Chinese and English many labelings method of bayesian theory that sort module of the present invention adopts;
Fig. 4 is the process flow diagram based on off-line learning in Chinese and English many labelings method of bayesian theory;
Fig. 5 is the process flow diagram based on classification and on-line study in Chinese and English many labelings method of bayesian theory.
Embodiment
The present invention is described in detail below in conjunction with the drawings and specific embodiments.
Embodiment
As shown in Figure 1, a kind of Chinese and English Search Results visualization system based on many labelings comprises display module 1, search module 2, sort module 3 and visualization model 4.Wherein display module 1 is used for showing user interface and Search Results; Search module 2 is used for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates respectively Chinese and English Search Results; The result that sort module 3 is used for search module is obtained carries out Chinese and English many labelings, and classification results is integrated; Visualization model 4 is used for the classification results after integrating is realized the Web User Interface design, and passes through display module 1 output.
At first at display module 1 user's inputted search query statement, search module 2 calling search engine are searched for, and obtain Search Results, integrate respectively Chinese and English Search Results, then the result that search module 2 obtained by above-mentioned information classification method of sort module 3 carries out Chinese and English many labelings, and classification results is integrated.Last visualization model 4 is used for the classification results after integrating is carried out the Web User Interface design, and export to the user by display module 1, can adopt Struts 2 as the framework of MVC view, selection of container the combination of ApacheGeronimo 2.x+Jetty 6, this has guaranteed when satisfying user demand, has reduced the spending aspect software when disposing.In the technology of webpage front end employing AJAX, realized dynamically updating the Search Results under the user selection classification.The algorithm of whole system as shown in Figure 2.
Sort module of the present invention adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method may further comprise the steps: 1) make up the classification corpus; 2) sorter carries out off-line learning by the classification corpus; 3) sorter is classified to Chinese and English Search Results respectively, carries out simultaneously on-line study; 4) classification results is integrated, as shown in Figure 3.
For realizing that the method need to make up Chinese and English many labelings corpus, set up first Chinese news corpus storehouse, and utilize existing English Reuters corpus, be used for training classifier.Adopt the news corpus storehouse to be the basis herein, artificial screening also marks wherein part newsletter archive, has made up many labels of multiclass corpus of 9 classifications.This corpus comprises economy, military affairs, physical culture, amusement, science and technology, society, commercial affairs, education, 9 classifications of travelling totally 5084 pieces of texts, this corpus is uneven corpus, amount of text in each classification is distributed and is obtained in the situation that is quantity of information of all categories in considering real life.
The method that sorter adopts off-line learning and on-line study to combine is trained, the corpus of a news category of model carries out off-line learning, namely train many labelings of Bayes multiclass device, then, when in the actual motion of system, having new Search Results (text) to arrive, constantly revise and improve before learning model.
The off-line learning performing step may further comprise the steps as shown in Figure 4: the A) training text in the traversal classification corpus; B) training text is carried out pre-service; C) scan training text, record the word frequency information of each Feature Words, add among the HashMap; D) be saved in the file according to the conditional probability of each Feature Words of word frequency statistics information calculations among the HashMap, and with acquired results.
Wherein, HashMap is used for depositing Feature Words and the statistical information thereof of training text, and that uses HashMap can finish the inquiry of Feature Words or the retouching operation of certain Feature Words statistical information in the regular hour complexity.
The on-line study of sorter and classification are carried out simultaneously, and concrete steps comprise as shown in Figure 5: a) read in Feature Words and statistical information thereof from the training process spanned file, and add among the HashMap; B) unknown text is carried out pre-service, the generating feature set of words; C) travel through all Feature Words, and in the HashMap that step generates in a), search each Feature Words to the conditional probability of each classification; D) according to the conditional probability of each Feature Words for each classification, calculate this unknown text for the joint probability of all categories; E) according to all joint probabilities that obtains, calculate probability threshold value; F) distribute all joint probabilities for this unknown text and be not less than the class label of probability threshold value and output label; G) in HashMap, revise conditional probability in the classification that Feature Words in this unknown text provides corresponding to classification results; H) assorting process finishes.
Wherein, probability threshold value P ThresBe unknown text d iArithmetical mean for the posterior probability of all known class:
P thres = 1 n Σ j = 1 n P ( C j | d i )
P (C j| d i) be unknown text d iBelong to certain classification C jProbability, n is the classification number, if P (C j| d i) 〉=P Thres, d iGive classification C jLabel, d iNumber of labels n dSatisfy 1≤n d≤ n.

Claims (7)

1. Chinese and English Search Results visualization system based on many labelings is characterized in that this system comprises:
Display module is used for showing user interface and Search Results;
Search module is used for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates respectively Chinese and English Search Results;
Sort module, the result who is used for search module is obtained carries out Chinese and English many labelings, and classification results is integrated;
Visualization model is used for the classification results after integrating is realized the Web User Interface design, and exports by display module.
2. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 1 is characterized in that described sort module comprises:
Sorter, the result who is used for search module is obtained carries out Chinese and English many labelings, and carries out classification results and integrate;
The classification corpus, this classification corpus is uneven corpus, comprises many labels corpus of several classifications, is used for training classifier.
3. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 2 is characterized in that, described classification corpus comprises Chinese classification corpus and English classification corpus.
4. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 1, it is characterized in that, described sort module adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method specifically may further comprise the steps:
1) makes up Chinese and English classification corpus;
2) sorter carries out off-line learning by the classification corpus;
3) sorter is classified to Chinese and English Search Results respectively, carries out simultaneously on-line study;
4) classification results is integrated.
5. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 4 is characterized in that described step 2) specifically may further comprise the steps:
A) training text in the traversal classification corpus;
B) training text is carried out pre-service;
C) scan training text, record the word frequency information of each Feature Words, add among the HashMap;
D) be saved in the file according to the conditional probability of each Feature Words of word frequency statistics information calculations among the HashMap, and with acquired results.
6. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 4 is characterized in that described step 3) specifically may further comprise the steps:
A) from the training process spanned file, read in Feature Words and statistical information thereof, and add among the HashMap;
B) unknown text is carried out pre-service, the generating feature set of words;
C) travel through all Feature Words, and in the HashMap that step generates in a), search each Feature Words to the conditional probability of each classification;
D) according to the conditional probability of each Feature Words for each classification, calculate this unknown text for the joint probability of all categories;
E) according to all joint probabilities that obtains, calculate probability threshold value;
F) distribute all joint probabilities for this unknown text and be not less than the class label of probability threshold value and output label;
G) in HashMap, revise conditional probability in the classification that Feature Words in this unknown text provides corresponding to classification results;
H) assorting process finishes.
7. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 6 is characterized in that described probability threshold value P ThresBe unknown text d iArithmetical mean for the posterior probability of all known class:
P thres = 1 n Σ j = 1 n P ( C j | d i )
P (C j| d i) be unknown text d iBelong to certain classification C jProbability, n is the classification number, if P (C j| d i) 〉=P Thres, d iGive classification C jLabel, d iNumber of labels n dSatisfy 1≤n d≤ n.
CN201110312662.9A 2011-10-16 2011-10-16 A kind of Chinese and English Search Results visualization system based on many labelings Expired - Fee Related CN103049454B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110312662.9A CN103049454B (en) 2011-10-16 2011-10-16 A kind of Chinese and English Search Results visualization system based on many labelings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110312662.9A CN103049454B (en) 2011-10-16 2011-10-16 A kind of Chinese and English Search Results visualization system based on many labelings

Publications (2)

Publication Number Publication Date
CN103049454A true CN103049454A (en) 2013-04-17
CN103049454B CN103049454B (en) 2016-04-20

Family

ID=48062097

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110312662.9A Expired - Fee Related CN103049454B (en) 2011-10-16 2011-10-16 A kind of Chinese and English Search Results visualization system based on many labelings

Country Status (1)

Country Link
CN (1) CN103049454B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108287911A (en) * 2018-02-01 2018-07-17 浙江大学 A kind of Relation extraction method based on about fasciculation remote supervisory
CN108287848A (en) * 2017-01-10 2018-07-17 中国移动通信集团贵州有限公司 Method and system for semanteme parsing
CN109388479A (en) * 2018-11-01 2019-02-26 郑州云海信息技术有限公司 The output method and device of deep learning data in mxnet system
CN110633365A (en) * 2019-07-25 2019-12-31 北京国信利斯特科技有限公司 Word vector-based hierarchical multi-label text classification method and system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106611015B (en) * 2015-10-27 2020-08-28 北京百度网讯科技有限公司 Label processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079048A (en) * 2006-05-24 2007-11-28 上海万纬信息技术有限公司 Internet information search engine and method based on software robot exclusion standard
CN101763424A (en) * 2009-12-14 2010-06-30 刘二中 Method for determining characteristic words and searching according to file content
CN101784022A (en) * 2009-01-16 2010-07-21 北京炎黄新星网络科技有限公司 Method and system for filtering and classifying short messages
CN101903878A (en) * 2007-10-11 2010-12-01 谷歌公司 Methods and systems for classifying search results to determine page elements
CN101908071A (en) * 2010-08-10 2010-12-08 厦门市美亚柏科信息股份有限公司 Method and device thereof for improving search efficiency of search engine
CN101963966A (en) * 2009-07-24 2011-02-02 李占胜 Method for sorting search results by adding labels into search results

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079048A (en) * 2006-05-24 2007-11-28 上海万纬信息技术有限公司 Internet information search engine and method based on software robot exclusion standard
CN101903878A (en) * 2007-10-11 2010-12-01 谷歌公司 Methods and systems for classifying search results to determine page elements
CN101784022A (en) * 2009-01-16 2010-07-21 北京炎黄新星网络科技有限公司 Method and system for filtering and classifying short messages
CN101963966A (en) * 2009-07-24 2011-02-02 李占胜 Method for sorting search results by adding labels into search results
CN101763424A (en) * 2009-12-14 2010-06-30 刘二中 Method for determining characteristic words and searching according to file content
CN101908071A (en) * 2010-08-10 2010-12-08 厦门市美亚柏科信息股份有限公司 Method and device thereof for improving search efficiency of search engine

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵毅: "《基于贝叶斯算法的垃圾邮件过滤系统的研究与开发》", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108287848A (en) * 2017-01-10 2018-07-17 中国移动通信集团贵州有限公司 Method and system for semanteme parsing
CN108287911A (en) * 2018-02-01 2018-07-17 浙江大学 A kind of Relation extraction method based on about fasciculation remote supervisory
CN108287911B (en) * 2018-02-01 2020-04-24 浙江大学 Relation extraction method based on constrained remote supervision
CN109388479A (en) * 2018-11-01 2019-02-26 郑州云海信息技术有限公司 The output method and device of deep learning data in mxnet system
CN110633365A (en) * 2019-07-25 2019-12-31 北京国信利斯特科技有限公司 Word vector-based hierarchical multi-label text classification method and system

Also Published As

Publication number Publication date
CN103049454B (en) 2016-04-20

Similar Documents

Publication Publication Date Title
US11580104B2 (en) Method, apparatus, device, and storage medium for intention recommendation
CN111428053B (en) Construction method of tax field-oriented knowledge graph
Weismayer et al. Identifying emerging research fields: a longitudinal latent semantic keyword analysis
CN101794311B (en) Fuzzy data mining based automatic classification method of Chinese web pages
Do et al. Multiview deep learning for predicting twitter users' location
US20090307213A1 (en) Suffix Tree Similarity Measure for Document Clustering
CN106447066A (en) Big data feature extraction method and device
CN102495892A (en) Webpage information extraction method
CN110059181A (en) Short text stamp methods, system, device towards extensive classification system
CN105183833A (en) User model based microblogging text recommendation method and recommendation apparatus thereof
CN104965905A (en) Web page classifying method and apparatus
CN104484431A (en) Multi-source individualized news webpage recommending method based on field body
CN105740468A (en) Individuation recommendation method and system combined with content publisher information
CN103049454B (en) A kind of Chinese and English Search Results visualization system based on many labelings
CN102428467A (en) Similarity-Based Feature Set Supplementation For Classification
CN103886020A (en) Quick search method of real estate information
CN103778206A (en) Method for providing network service resources
CN111754208A (en) Automatic screening method for recruitment resumes
CN108876643A (en) It is a kind of social activity plan exhibition network on acquire(Pin)Multimodal presentation method
Aung et al. Random forest classifier for multi-category classification of web pages
CN103761246A (en) Link network based user domain identifying method and device
Alsammak et al. An enhanced performance of K-nearest neighbor (K-NN) classifier to meet new big data necessities
Anastasopoulos et al. Computational text analysis for public management research: An annotated application to county budgets
Wu et al. An unsupervised framework for extracting multilane roads from OpenStreetMap
CN113239179A (en) Scientific research technology interest field recognition model training method, scientific and technological resource query method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160420

Termination date: 20171016