CN103049454A - Chinese and English search result visualization system based on multi-label classification - Google Patents
Chinese and English search result visualization system based on multi-label classification Download PDFInfo
- Publication number
- CN103049454A CN103049454A CN2011103126629A CN201110312662A CN103049454A CN 103049454 A CN103049454 A CN 103049454A CN 2011103126629 A CN2011103126629 A CN 2011103126629A CN 201110312662 A CN201110312662 A CN 201110312662A CN 103049454 A CN103049454 A CN 103049454A
- Authority
- CN
- China
- Prior art keywords
- classification
- chinese
- english
- search
- search results
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a Chinese and English search result visualization system based on multi-label classification. The Chinese and English search result visualization system comprises a display module, a search module, a classification module and a visualization module, wherein the display module is used for displaying a user interface and search results; the search module is used for calling a search engine API to perform searching and to obtain the search results according to inquire statements of users, and respectively integrating the Chinese and English search results; the classification module is used for performing Chinese and English multi-label classification on the results obtained through the search module and performing integration on the classification results; and the visualization module is used for achieving Web user interface design on the integrated classification results and outputting through the display module. Compared with the prior art, the Chinese and English search result visualization system can perform effective multi-label classification and integration on the search results by means of granular computing theory in a multi-label classification method based on the Bayesian theory, can display the search results by category according to user requirements by designing the visual system through the method, simultaneously cannot lose the search results, improves user browse efficiency and improves user browse experience.
Description
Technical field
The present invention relates to areas of information technology, especially relate to a kind of Chinese and English Search Results visualization system based on many labelings.
Background technology
At present, online electronic document rapid growth has a large amount of documents to upload on the net every day.Search engine as a kind of important method of obtaining network knowledge, has obtained using more and more widely.Yet search engine often returns a large amount of Search Results, and this usually is submerged in the ocean of information the user.The search engine of current main-stream returns the Search Results according to the user key words ordering.In order to find interested information, the user needs one by one navigate search results.
For above problem, some begin to explore more advanced information retrieval method.As a rule, dual mode is arranged: a kind of information retrieval method that is based on semanteme, namely make every effort to adopt semantic analysis technology to understand document and user's query statement; Another kind is based on the method for machine learning, namely use from the historical data learning to model the document the Search Results is classified or cluster.The present invention pays close attention to the problem of improving the information retrieval result based on the method for machine learning.
The visual finger of Webpage searching result is according to the content of Search Results, with Search Results with a kind of more clear, process that mode more orderliness shows the user.Its purpose is to improve search efficiency, improves user's viewing experience.For this task, the technology based on text cluster is adopted in present most research work, is about to visualization tasks and regards a non-supervisory classification problem as.According to the method system of pattern classification, we at first extract feature and represent text from text, then text is assigned to the highest class of its similarity bunch in.Search engine based on clustering technique has Vivisimo and Groker.In this method, the title of class bunch is provided according to Feature Words automatically by system usually.Yet the class cluster name of this automatic acquisition claims often to be difficult to express the main contents of class bunch.This just makes the user be difficult to locate according to the given class cluster name of system the position of own interested information, and the effect of this visualization process is just not obvious.
Different from corresponding class label of object in the traditional pattern classification task, in many labelings, an object may be associated with a plurality of labels, may be relevant with economy such as one piece of document, simultaneously also may be relevant with computing machine, so the document is relevant with two classifications of computing machine with economy.Many labelings originate from the demand of text categorization task, wherein every piece of document is associated with a tag set in the training set, the task of classification is exactly the model of training document and known label set Relations Among, and is tag set of document output of every piece of label the unknown according to this model.
Summary of the invention
Purpose of the present invention is exactly the Chinese and English Search Results visualization system that a kind of Chinese and English search result information sorting technique based on many labels is provided and uses this information classification method for the defective that overcomes above-mentioned prior art existence, use for reference grain and calculate thought, can be according to user's demand category display of search results, improve user's browse efficiency, improve user's viewing experience.
Purpose of the present invention can be achieved through the following technical solutions:
A kind of Chinese and English Search Results visualization system based on many labelings, this system comprises:
Display module is used for showing user interface and Search Results; Search module is used for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates respectively Chinese and English Search Results; Sort module, the result who is used for search module is obtained carries out Chinese and English many labelings, and classification results is integrated; Visualization model is used for the classification results after integrating is realized the Web User Interface design, and exports by display module.
Described sort module comprises: sorter, and the result who is used for search module is obtained carries out Chinese and English many labelings, and carries out classification results and integrate; The classification corpus, this classification corpus is uneven corpus, comprises many labels corpus of several classifications, is used for training classifier.
Described classification corpus comprises Chinese classification corpus and English classification corpus.
Described sort module adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method specifically may further comprise the steps:
1) makes up Chinese and English classification corpus;
2) sorter carries out off-line learning by the classification corpus;
3) sorter is classified to Chinese and English Search Results respectively, carries out simultaneously on-line study;
4) classification results is integrated.
Described step 2) specifically may further comprise the steps:
A) training text in the traversal classification corpus;
B) training text is carried out pre-service;
C) scan training text, record the word frequency information of each Feature Words, add among the HashMap;
D) be saved in the file according to the conditional probability of each Feature Words of word frequency statistics information calculations among the HashMap, and with acquired results.
Described step 3) specifically may further comprise the steps:
A) from the training process spanned file, read in Feature Words and statistical information thereof, and add among the HashMap;
B) unknown text is carried out pre-service, the generating feature set of words;
C) travel through all Feature Words, and in the HashMap that step generates in a), search each Feature Words to the conditional probability of each classification;
D) according to the conditional probability of each Feature Words for each classification, calculate this unknown text for the joint probability of all categories;
E) according to all joint probabilities that obtains, calculate probability threshold value;
F) distribute all joint probabilities for this unknown text and be not less than the class label of probability threshold value and output label;
G) in HashMap, revise conditional probability in the classification that Feature Words in this unknown text provides corresponding to classification results;
H) assorting process finishes.
Described probability threshold value P
ThresBe unknown text d
iArithmetical mean for the posterior probability of all known class:
P (C
j| d
i) be unknown text d
iBelong to certain classification C
jProbability, n is the classification number, if P (C
j| d
i) 〉=P
Thres, d
iGive classification C
jLabel, d
iNumber of labels n
dSatisfy 1≤n
d≤ n.
Compared with prior art, the present invention uses for reference grain calculating and considers carefully, by adopting the many labelings method based on bayesian theory, can effectively classify and integrate Search Results, by adopting the method design visualization system, can improve user's browse efficiency according to user's demand category display of search results, improve user's viewing experience.
Description of drawings
Fig. 1 is structural representation of the present invention;
Fig. 2 is the synoptic diagram of sorting algorithm of the present invention;
Fig. 3 is the process flow diagram based on Chinese and English many labelings method of bayesian theory that sort module of the present invention adopts;
Fig. 4 is the process flow diagram based on off-line learning in Chinese and English many labelings method of bayesian theory;
Fig. 5 is the process flow diagram based on classification and on-line study in Chinese and English many labelings method of bayesian theory.
Embodiment
The present invention is described in detail below in conjunction with the drawings and specific embodiments.
Embodiment
As shown in Figure 1, a kind of Chinese and English Search Results visualization system based on many labelings comprises display module 1, search module 2, sort module 3 and visualization model 4.Wherein display module 1 is used for showing user interface and Search Results; Search module 2 is used for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates respectively Chinese and English Search Results; The result that sort module 3 is used for search module is obtained carries out Chinese and English many labelings, and classification results is integrated; Visualization model 4 is used for the classification results after integrating is realized the Web User Interface design, and passes through display module 1 output.
At first at display module 1 user's inputted search query statement, search module 2 calling search engine are searched for, and obtain Search Results, integrate respectively Chinese and English Search Results, then the result that search module 2 obtained by above-mentioned information classification method of sort module 3 carries out Chinese and English many labelings, and classification results is integrated.Last visualization model 4 is used for the classification results after integrating is carried out the Web User Interface design, and export to the user by display module 1, can adopt Struts 2 as the framework of MVC view, selection of container the combination of ApacheGeronimo 2.x+Jetty 6, this has guaranteed when satisfying user demand, has reduced the spending aspect software when disposing.In the technology of webpage front end employing AJAX, realized dynamically updating the Search Results under the user selection classification.The algorithm of whole system as shown in Figure 2.
Sort module of the present invention adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method may further comprise the steps: 1) make up the classification corpus; 2) sorter carries out off-line learning by the classification corpus; 3) sorter is classified to Chinese and English Search Results respectively, carries out simultaneously on-line study; 4) classification results is integrated, as shown in Figure 3.
For realizing that the method need to make up Chinese and English many labelings corpus, set up first Chinese news corpus storehouse, and utilize existing English Reuters corpus, be used for training classifier.Adopt the news corpus storehouse to be the basis herein, artificial screening also marks wherein part newsletter archive, has made up many labels of multiclass corpus of 9 classifications.This corpus comprises economy, military affairs, physical culture, amusement, science and technology, society, commercial affairs, education, 9 classifications of travelling totally 5084 pieces of texts, this corpus is uneven corpus, amount of text in each classification is distributed and is obtained in the situation that is quantity of information of all categories in considering real life.
The method that sorter adopts off-line learning and on-line study to combine is trained, the corpus of a news category of model carries out off-line learning, namely train many labelings of Bayes multiclass device, then, when in the actual motion of system, having new Search Results (text) to arrive, constantly revise and improve before learning model.
The off-line learning performing step may further comprise the steps as shown in Figure 4: the A) training text in the traversal classification corpus; B) training text is carried out pre-service; C) scan training text, record the word frequency information of each Feature Words, add among the HashMap; D) be saved in the file according to the conditional probability of each Feature Words of word frequency statistics information calculations among the HashMap, and with acquired results.
Wherein, HashMap is used for depositing Feature Words and the statistical information thereof of training text, and that uses HashMap can finish the inquiry of Feature Words or the retouching operation of certain Feature Words statistical information in the regular hour complexity.
The on-line study of sorter and classification are carried out simultaneously, and concrete steps comprise as shown in Figure 5: a) read in Feature Words and statistical information thereof from the training process spanned file, and add among the HashMap; B) unknown text is carried out pre-service, the generating feature set of words; C) travel through all Feature Words, and in the HashMap that step generates in a), search each Feature Words to the conditional probability of each classification; D) according to the conditional probability of each Feature Words for each classification, calculate this unknown text for the joint probability of all categories; E) according to all joint probabilities that obtains, calculate probability threshold value; F) distribute all joint probabilities for this unknown text and be not less than the class label of probability threshold value and output label; G) in HashMap, revise conditional probability in the classification that Feature Words in this unknown text provides corresponding to classification results; H) assorting process finishes.
Wherein, probability threshold value P
ThresBe unknown text d
iArithmetical mean for the posterior probability of all known class:
P (C
j| d
i) be unknown text d
iBelong to certain classification C
jProbability, n is the classification number, if P (C
j| d
i) 〉=P
Thres, d
iGive classification C
jLabel, d
iNumber of labels n
dSatisfy 1≤n
d≤ n.
Claims (7)
1. Chinese and English Search Results visualization system based on many labelings is characterized in that this system comprises:
Display module is used for showing user interface and Search Results;
Search module is used for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates respectively Chinese and English Search Results;
Sort module, the result who is used for search module is obtained carries out Chinese and English many labelings, and classification results is integrated;
Visualization model is used for the classification results after integrating is realized the Web User Interface design, and exports by display module.
2. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 1 is characterized in that described sort module comprises:
Sorter, the result who is used for search module is obtained carries out Chinese and English many labelings, and carries out classification results and integrate;
The classification corpus, this classification corpus is uneven corpus, comprises many labels corpus of several classifications, is used for training classifier.
3. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 2 is characterized in that, described classification corpus comprises Chinese classification corpus and English classification corpus.
4. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 1, it is characterized in that, described sort module adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method specifically may further comprise the steps:
1) makes up Chinese and English classification corpus;
2) sorter carries out off-line learning by the classification corpus;
3) sorter is classified to Chinese and English Search Results respectively, carries out simultaneously on-line study;
4) classification results is integrated.
5. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 4 is characterized in that described step 2) specifically may further comprise the steps:
A) training text in the traversal classification corpus;
B) training text is carried out pre-service;
C) scan training text, record the word frequency information of each Feature Words, add among the HashMap;
D) be saved in the file according to the conditional probability of each Feature Words of word frequency statistics information calculations among the HashMap, and with acquired results.
6. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 4 is characterized in that described step 3) specifically may further comprise the steps:
A) from the training process spanned file, read in Feature Words and statistical information thereof, and add among the HashMap;
B) unknown text is carried out pre-service, the generating feature set of words;
C) travel through all Feature Words, and in the HashMap that step generates in a), search each Feature Words to the conditional probability of each classification;
D) according to the conditional probability of each Feature Words for each classification, calculate this unknown text for the joint probability of all categories;
E) according to all joint probabilities that obtains, calculate probability threshold value;
F) distribute all joint probabilities for this unknown text and be not less than the class label of probability threshold value and output label;
G) in HashMap, revise conditional probability in the classification that Feature Words in this unknown text provides corresponding to classification results;
H) assorting process finishes.
7. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 6 is characterized in that described probability threshold value P
ThresBe unknown text d
iArithmetical mean for the posterior probability of all known class:
P (C
j| d
i) be unknown text d
iBelong to certain classification C
jProbability, n is the classification number, if P (C
j| d
i) 〉=P
Thres, d
iGive classification C
jLabel, d
iNumber of labels n
dSatisfy 1≤n
d≤ n.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110312662.9A CN103049454B (en) | 2011-10-16 | 2011-10-16 | A kind of Chinese and English Search Results visualization system based on many labelings |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110312662.9A CN103049454B (en) | 2011-10-16 | 2011-10-16 | A kind of Chinese and English Search Results visualization system based on many labelings |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103049454A true CN103049454A (en) | 2013-04-17 |
CN103049454B CN103049454B (en) | 2016-04-20 |
Family
ID=48062097
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110312662.9A Expired - Fee Related CN103049454B (en) | 2011-10-16 | 2011-10-16 | A kind of Chinese and English Search Results visualization system based on many labelings |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103049454B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108287911A (en) * | 2018-02-01 | 2018-07-17 | 浙江大学 | A kind of Relation extraction method based on about fasciculation remote supervisory |
CN108287848A (en) * | 2017-01-10 | 2018-07-17 | 中国移动通信集团贵州有限公司 | Method and system for semanteme parsing |
CN109388479A (en) * | 2018-11-01 | 2019-02-26 | 郑州云海信息技术有限公司 | The output method and device of deep learning data in mxnet system |
CN110633365A (en) * | 2019-07-25 | 2019-12-31 | 北京国信利斯特科技有限公司 | Word vector-based hierarchical multi-label text classification method and system |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106611015B (en) * | 2015-10-27 | 2020-08-28 | 北京百度网讯科技有限公司 | Label processing method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079048A (en) * | 2006-05-24 | 2007-11-28 | 上海万纬信息技术有限公司 | Internet information search engine and method based on software robot exclusion standard |
CN101763424A (en) * | 2009-12-14 | 2010-06-30 | 刘二中 | Method for determining characteristic words and searching according to file content |
CN101784022A (en) * | 2009-01-16 | 2010-07-21 | 北京炎黄新星网络科技有限公司 | Method and system for filtering and classifying short messages |
CN101903878A (en) * | 2007-10-11 | 2010-12-01 | 谷歌公司 | Methods and systems for classifying search results to determine page elements |
CN101908071A (en) * | 2010-08-10 | 2010-12-08 | 厦门市美亚柏科信息股份有限公司 | Method and device thereof for improving search efficiency of search engine |
CN101963966A (en) * | 2009-07-24 | 2011-02-02 | 李占胜 | Method for sorting search results by adding labels into search results |
-
2011
- 2011-10-16 CN CN201110312662.9A patent/CN103049454B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079048A (en) * | 2006-05-24 | 2007-11-28 | 上海万纬信息技术有限公司 | Internet information search engine and method based on software robot exclusion standard |
CN101903878A (en) * | 2007-10-11 | 2010-12-01 | 谷歌公司 | Methods and systems for classifying search results to determine page elements |
CN101784022A (en) * | 2009-01-16 | 2010-07-21 | 北京炎黄新星网络科技有限公司 | Method and system for filtering and classifying short messages |
CN101963966A (en) * | 2009-07-24 | 2011-02-02 | 李占胜 | Method for sorting search results by adding labels into search results |
CN101763424A (en) * | 2009-12-14 | 2010-06-30 | 刘二中 | Method for determining characteristic words and searching according to file content |
CN101908071A (en) * | 2010-08-10 | 2010-12-08 | 厦门市美亚柏科信息股份有限公司 | Method and device thereof for improving search efficiency of search engine |
Non-Patent Citations (1)
Title |
---|
赵毅: "《基于贝叶斯算法的垃圾邮件过滤系统的研究与开发》", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108287848A (en) * | 2017-01-10 | 2018-07-17 | 中国移动通信集团贵州有限公司 | Method and system for semanteme parsing |
CN108287911A (en) * | 2018-02-01 | 2018-07-17 | 浙江大学 | A kind of Relation extraction method based on about fasciculation remote supervisory |
CN108287911B (en) * | 2018-02-01 | 2020-04-24 | 浙江大学 | Relation extraction method based on constrained remote supervision |
CN109388479A (en) * | 2018-11-01 | 2019-02-26 | 郑州云海信息技术有限公司 | The output method and device of deep learning data in mxnet system |
CN110633365A (en) * | 2019-07-25 | 2019-12-31 | 北京国信利斯特科技有限公司 | Word vector-based hierarchical multi-label text classification method and system |
Also Published As
Publication number | Publication date |
---|---|
CN103049454B (en) | 2016-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11580104B2 (en) | Method, apparatus, device, and storage medium for intention recommendation | |
CN111428053B (en) | Construction method of tax field-oriented knowledge graph | |
Weismayer et al. | Identifying emerging research fields: a longitudinal latent semantic keyword analysis | |
CN101794311B (en) | Fuzzy data mining based automatic classification method of Chinese web pages | |
Do et al. | Multiview deep learning for predicting twitter users' location | |
US20090307213A1 (en) | Suffix Tree Similarity Measure for Document Clustering | |
CN106447066A (en) | Big data feature extraction method and device | |
CN102495892A (en) | Webpage information extraction method | |
CN110059181A (en) | Short text stamp methods, system, device towards extensive classification system | |
CN105183833A (en) | User model based microblogging text recommendation method and recommendation apparatus thereof | |
CN104965905A (en) | Web page classifying method and apparatus | |
CN104484431A (en) | Multi-source individualized news webpage recommending method based on field body | |
CN105740468A (en) | Individuation recommendation method and system combined with content publisher information | |
CN103049454B (en) | A kind of Chinese and English Search Results visualization system based on many labelings | |
CN102428467A (en) | Similarity-Based Feature Set Supplementation For Classification | |
CN103886020A (en) | Quick search method of real estate information | |
CN103778206A (en) | Method for providing network service resources | |
CN111754208A (en) | Automatic screening method for recruitment resumes | |
CN108876643A (en) | It is a kind of social activity plan exhibition network on acquire(Pin)Multimodal presentation method | |
Aung et al. | Random forest classifier for multi-category classification of web pages | |
CN103761246A (en) | Link network based user domain identifying method and device | |
Alsammak et al. | An enhanced performance of K-nearest neighbor (K-NN) classifier to meet new big data necessities | |
Anastasopoulos et al. | Computational text analysis for public management research: An annotated application to county budgets | |
Wu et al. | An unsupervised framework for extracting multilane roads from OpenStreetMap | |
CN113239179A (en) | Scientific research technology interest field recognition model training method, scientific and technological resource query method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20160420 Termination date: 20171016 |