CN103049454B - A kind of Chinese and English Search Results visualization system based on many labelings - Google Patents

A kind of Chinese and English Search Results visualization system based on many labelings Download PDF

Info

Publication number
CN103049454B
CN103049454B CN201110312662.9A CN201110312662A CN103049454B CN 103049454 B CN103049454 B CN 103049454B CN 201110312662 A CN201110312662 A CN 201110312662A CN 103049454 B CN103049454 B CN 103049454B
Authority
CN
China
Prior art keywords
classification
chinese
english
search results
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110312662.9A
Other languages
Chinese (zh)
Other versions
CN103049454A (en
Inventor
卫志华
苗夺谦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201110312662.9A priority Critical patent/CN103049454B/en
Publication of CN103049454A publication Critical patent/CN103049454A/en
Application granted granted Critical
Publication of CN103049454B publication Critical patent/CN103049454B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention relates to a kind of Chinese and English Search Results visualization system based on many labelings, this system comprises: display module, for showing user interface and Search Results; Search module, for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates Chinese and English Search Results respectively; Sort module, the result for obtaining search module carries out Chinese and English many labelings, and integrates classification results; Visualization model, for realizing Web User Interface design to the classification results after integration, and is exported by display module.Compared with prior art, the present invention uses for reference Granule Computing and considers carefully, by adopting the many labelings method based on bayesian theory, effective many labelings and integration can be carried out to Search Results, by adopting the method design visualization system, can, according to the demand category display of search results of user, accomplish not lose Search Results simultaneously as far as possible, improve user's browse efficiency, improve user's viewing experience.

Description

A kind of Chinese and English Search Results visualization system based on many labelings
Technical field
The present invention relates to areas of information technology, especially relate to a kind of Chinese and English Search Results visualization system based on many labelings.
Background technology
At present, online electronic document rapidly increases, and has every day a large amount of documents to upload on the net.Search engine, as a kind of important method obtaining network knowledge, obtains and applies more and more widely.But search engine often returns a large amount of Search Results, this makes user usually be submerged in the ocean of information.The search engine of current main-stream returns the Search Results according to user key words sequence.In order to find interested information, user needs navigate search results one by one.
For above problem, some start to explore more advanced information retrieval method.As a rule, there are two kinds of modes: a kind of is information retrieval method based on semanteme, namely make every effort to adopt semantic analysis technology to understand the query statement of document and user; Another kind is the method based on machine learning, namely use from historical data learning to model the document Search Results is classified or cluster.The present invention pays close attention to the problem improving information retrieval result based on the method for machine learning.
The visual finger of Webpage searching result according to the content of Search Results, by Search Results with a kind of more clear, process that mode that is more orderliness shows user.Its object is to improve search efficiency, improve user's viewing experience.For this task, current most research work adopts the technology based on text cluster, regards a non-supervisory classification problem as by visualization tasks.According to the method system of pattern classification, first we extract feature to represent text from text, is then assigned to by text in the class the highest with its similarity bunch.Search engine based on clustering technique has Vivisimo and Groker.In this approach, the title of class bunch is provided according to Feature Words automatically by system usually.But the class cluster name of this automatic acquisition claims often to be difficult to express the main contents of class bunch.This just makes user be difficult to locate according to the class cluster name that system is given the position of oneself interested information, and the effect of this visualization process is just not obvious.
Different from a corresponding class label of object in traditional pattern classification task, in many labelings, an object may be associated with multiple label, such as one section of document may be relevant to economy, simultaneously also may be relevant to computing machine, therefore the document is relevant with computing machine two classifications to economy.Many labelings originate from the demand of text categorization task, wherein in training set, every section of document is associated with a tag set, the task of classification is exactly the model of relation between Training document and known label set, and exports a tag set according to the document that this model is every section of label the unknown.
Summary of the invention
Object of the present invention be exactly in order to overcome above-mentioned prior art exist defect and a kind of Chinese and English search result information sorting technique based on many labels is provided and applies the Chinese and English Search Results visualization system of this information classification approach, use for reference Granule Computing thought, can according to the demand category display of search results of user, improve user's browse efficiency, improve user's viewing experience.
Object of the present invention can be achieved through the following technical solutions:
Based on a Chinese and English Search Results visualization system for many labelings, this system comprises:
Display module, for showing user interface and Search Results; Search module, for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates Chinese and English Search Results respectively; Sort module, the result for obtaining search module carries out Chinese and English many labelings, and integrates classification results; Visualization model, for realizing Web User Interface design to the classification results after integration, and is exported by display module.
Described sort module comprises: sorter, and the result for obtaining search module carries out Chinese and English many labelings, and carries out classification results integration; Classification corpus, this classification corpus is uneven corpus, comprises many labels corpus of several classifications, for training classifier.
Described classification corpus comprises Chinese classification corpus and English classification corpus.
Described sort module adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method specifically comprises the following steps:
1) Chinese and English classification corpus is built;
2) sorter carries out off-line learning by classification corpus;
3) sorter is classified to Chinese and English Search Results respectively, carries out on-line study simultaneously;
4) classification results is integrated.
Described step 2) specifically comprise the following steps:
A) training text in traversal classification corpus;
B) pre-service is carried out to training text;
C) scan training text, record the word frequency information of each Feature Words, add in HashMap;
D) calculate the conditional probability of each Feature Words according to word frequency statistics information in HashMap, and acquired results is saved in file.
Described step 3) specifically comprise the following steps:
A) from training process spanned file, read in Feature Words and statistical information, and add in HashMap;
B) pre-service is carried out to unknown text, generating feature set of words;
C) travel through all Feature Words, and search the conditional probability of each Feature Words to each classification in step a) the middle HashMap generated;
D) according to the conditional probability of each Feature Words for each classification, the joint probability of this unknown text for all categories is calculated;
E) according to all joint probabilities obtained, probability threshold value is calculated;
F) for this unknown text distributes the class label that all joint probabilities are not less than probability threshold value, and output label;
G) in HashMap, revise Feature Words in this unknown text correspond to conditional probability in the classification that classification results provides;
H) assorting process terminates.
Described probability threshold value P thresfor unknown text d iarithmetical mean for the posterior probability of all known class:
P thres = 1 n Σ j = 1 n P ( C j | d i )
P (C j| d i) be unknown text d ibelong to certain classification C jprobability, n is classification number, if P (C j| d i)>=P thres, d igive classification C jlabel, d inumber of labels n dmeet 1≤n d≤ n.
Compared with prior art, the present invention uses for reference Granule Computing and considers carefully, by adopting the many labelings method based on bayesian theory, can effectively classify to Search Results and integrate, by adopting the method design visualization system, according to the demand category display of search results of user, user's browse efficiency can be improved, improves user's viewing experience.
Accompanying drawing explanation
Fig. 1 is structural representation of the present invention;
Fig. 2 is the schematic diagram of sorting algorithm of the present invention;
The process flow diagram of the Chinese and English many labelings method based on bayesian theory that Fig. 3 adopts for sort module of the present invention;
Fig. 4 is the process flow diagram of off-line learning in the Chinese and English many labelings method based on bayesian theory;
Fig. 5 is the process flow diagram of classification and on-line study in the Chinese and English many labelings method based on bayesian theory.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.
Embodiment
As shown in Figure 1, a kind of Chinese and English Search Results visualization system based on many labelings, comprises display module 1, search module 2, sort module 3 and visualization model 4.Wherein display module 1 is for showing user interface and Search Results; Search module 2 for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates Chinese and English Search Results respectively; Sort module 3 carries out Chinese and English many labelings for the result obtained search module, and integrates classification results; Visualization model 4 for realizing Web User Interface design to the classification results after integration, and is exported by display module 1.
First at display module 1 user inputted search query statement, search module 2 calling search engine is searched for, and obtain Search Results, integrate Chinese and English Search Results respectively, then sort module 3 carries out Chinese and English many labelings by above-mentioned information classification approach to the result that search module 2 obtains, and integrates classification results.Last visualization model 4 is for carrying out Web User Interface design to the classification results after integration, and export to user by display module 1, Struts2 can be adopted as the framework of MVC view, the selection of container combination of ApacheGeronimo2.x+Jetty6, which ensure that while meeting user demand, spending when decreasing deployment in software.Adopt the technology of AJAX in webpage front end, achieve dynamically updating the Search Results under user's selection sort.The algorithm of whole system as shown in Figure 2.
Sort module of the present invention adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method comprises the following steps: 1) build classification corpus; 2) sorter carries out off-line learning by classification corpus; 3) sorter is classified to Chinese and English Search Results respectively, carries out on-line study simultaneously; 4) classification results is integrated, as shown in Figure 3.
Needing to build Chinese and English many labelings corpus for realizing the method, first setting up Chinese news corpus storehouse, and utilizing existing English Reuters corpus, for training classifier.Adopt based on news corpus storehouse, artificial screening also marks wherein part newsletter archive, constructs multiclass many labels corpus of 9 classifications herein.This corpus comprises economy, military, physical culture, amusement, science and technology, society, commercial affairs, education, 9 classifications of travelling totally 5084 sections of texts, this corpus is uneven corpus, and the amount of text distribution in each classification obtains when considering quantity of information of all categories in real life.
The method that sorter adopts off-line learning and on-line study to combine is trained, first the corpus setting up a news category carries out off-line learning, namely Bayes multiclass many labelings device is trained, then, when having new Search Results (text) to arrive in the actual motion of system, constantly revise and learning model before improving.
Off-line learning performing step as shown in Figure 4, comprises the following steps: A) travel through the training text of classifying in corpus; B) pre-service is carried out to training text; C) scan training text, record the word frequency information of each Feature Words, add in HashMap; D) calculate the conditional probability of each Feature Words according to word frequency statistics information in HashMap, and acquired results is saved in file.
Wherein, HashMap is for depositing Feature Words and the statistical information thereof of training text, and use HashMap can complete the inquiry of Feature Words or the retouching operation of certain Feature Words statistical information in regular hour complexity.
On-line study and the classification of sorter are carried out simultaneously, and concrete steps as shown in Figure 5, comprising: a) from training process spanned file, read in Feature Words and statistical information thereof, and add in HashMap; B) pre-service is carried out to unknown text, generating feature set of words; C) travel through all Feature Words, and search the conditional probability of each Feature Words to each classification in step a) the middle HashMap generated; D) according to the conditional probability of each Feature Words for each classification, the joint probability of this unknown text for all categories is calculated; E) according to all joint probabilities obtained, probability threshold value is calculated; F) for this unknown text distributes the class label that all joint probabilities are not less than probability threshold value, and output label; G) in HashMap, revise Feature Words in this unknown text correspond to conditional probability in the classification that classification results provides; H) assorting process terminates.
Wherein, probability threshold value P thresfor unknown text d iarithmetical mean for the posterior probability of all known class:
P thres = 1 n Σ j = 1 n P ( C j | d i )
P (C j| d i) be unknown text d ibelong to certain classification C jprobability, n is classification number, if P (C j| d i)>=P thres, d igive classification C jlabel, d inumber of labels n dmeet 1≤n d≤ n.

Claims (5)

1., based on a Chinese and English Search Results visualization system for many labelings, it is characterized in that, this system comprises:
Display module, for showing user interface and Search Results;
Search module, for searching for according to user's query statement calling search engine API, and obtains Search Results, integrates Chinese and English Search Results respectively;
Sort module, the result for obtaining search module carries out Chinese and English many labelings, and integrates classification results;
Visualization model, for realizing Web User Interface design to the classification results after integration, and exported by display module, described visualization model adopts Struts2 as the framework of MVC view, the combination of selection of container ApacheGeronimo2.x+Jetty6;
Described sort module adopts classifies based on Chinese and English many labelings method of bayesian theory, and the method specifically comprises the following steps:
1) build Chinese and English classification corpus, described Chinese and English classification corpus are multiclass many labels corpus;
2) sorter carries out off-line learning by classification corpus;
3) sorter is classified to Chinese and English Search Results respectively, carries out on-line study simultaneously;
4) classification results is integrated.
2. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 1, it is characterized in that, described sort module comprises:
Sorter, the result for obtaining search module carries out Chinese and English many labelings, and carries out classification results integration;
Classification corpus, this classification corpus is uneven corpus, comprises many labels corpus of several classifications, for training classifier.
3. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 2, is characterized in that, described classification corpus comprises Chinese classification corpus and English classification corpus.
4. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 1, is characterized in that, described step 3) specifically comprise the following steps:
A) from training process spanned file, read in Feature Words and statistical information, and add in HashMap;
B) pre-service is carried out to unknown text, generating feature set of words;
C) travel through all Feature Words, and search the conditional probability of each Feature Words to each classification in step a) the middle HashMap generated;
D) according to the conditional probability of each Feature Words for each classification, the joint probability of this unknown text for all categories is calculated;
E) according to all joint probabilities obtained, probability threshold value is calculated;
F) for this unknown text distributes the class label that all joint probabilities are not less than probability threshold value, and output label;
G) in HashMap, revise Feature Words in this unknown text correspond to conditional probability in the classification that classification results provides;
H) assorting process terminates.
5. a kind of Chinese and English Search Results visualization system based on many labelings according to claim 4, is characterized in that, described probability threshold value P thresfor unknown text d iarithmetical mean for the posterior probability of all known class:
P t h r e s = 1 n Σ j = 1 n P ( C j | d i )
P (C j| d i) be unknown text d ibelong to certain classification C jprobability, n is classification number, if P (C j| d i)>=P thres, d igive classification C jlabel, d inumber of labels n dmeet 1≤n d≤ n.
CN201110312662.9A 2011-10-16 2011-10-16 A kind of Chinese and English Search Results visualization system based on many labelings Expired - Fee Related CN103049454B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110312662.9A CN103049454B (en) 2011-10-16 2011-10-16 A kind of Chinese and English Search Results visualization system based on many labelings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110312662.9A CN103049454B (en) 2011-10-16 2011-10-16 A kind of Chinese and English Search Results visualization system based on many labelings

Publications (2)

Publication Number Publication Date
CN103049454A CN103049454A (en) 2013-04-17
CN103049454B true CN103049454B (en) 2016-04-20

Family

ID=48062097

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110312662.9A Expired - Fee Related CN103049454B (en) 2011-10-16 2011-10-16 A kind of Chinese and English Search Results visualization system based on many labelings

Country Status (1)

Country Link
CN (1) CN103049454B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106611015A (en) * 2015-10-27 2017-05-03 北京百度网讯科技有限公司 Tag processing method and apparatus

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108287848B (en) * 2017-01-10 2020-09-04 中国移动通信集团贵州有限公司 Method and system for semantic parsing
CN108287911B (en) * 2018-02-01 2020-04-24 浙江大学 Relation extraction method based on constrained remote supervision
CN109388479A (en) * 2018-11-01 2019-02-26 郑州云海信息技术有限公司 The output method and device of deep learning data in mxnet system
CN110633365A (en) * 2019-07-25 2019-12-31 北京国信利斯特科技有限公司 Word vector-based hierarchical multi-label text classification method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079048A (en) * 2006-05-24 2007-11-28 上海万纬信息技术有限公司 Internet information search engine and method based on software robot exclusion standard
CN101763424A (en) * 2009-12-14 2010-06-30 刘二中 Method for determining characteristic words and searching according to file content
CN101903878A (en) * 2007-10-11 2010-12-01 谷歌公司 Methods and systems for classifying search results to determine page elements
CN101963966A (en) * 2009-07-24 2011-02-02 李占胜 Method for sorting search results by adding labels into search results

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101784022A (en) * 2009-01-16 2010-07-21 北京炎黄新星网络科技有限公司 Method and system for filtering and classifying short messages
CN101908071B (en) * 2010-08-10 2012-09-05 厦门市美亚柏科信息股份有限公司 Method and device thereof for improving search efficiency of search engine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079048A (en) * 2006-05-24 2007-11-28 上海万纬信息技术有限公司 Internet information search engine and method based on software robot exclusion standard
CN101903878A (en) * 2007-10-11 2010-12-01 谷歌公司 Methods and systems for classifying search results to determine page elements
CN101963966A (en) * 2009-07-24 2011-02-02 李占胜 Method for sorting search results by adding labels into search results
CN101763424A (en) * 2009-12-14 2010-06-30 刘二中 Method for determining characteristic words and searching according to file content

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《基于贝叶斯算法的垃圾邮件过滤系统的研究与开发》;赵毅;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20101115(第11期);第I139-69页 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106611015A (en) * 2015-10-27 2017-05-03 北京百度网讯科技有限公司 Tag processing method and apparatus

Also Published As

Publication number Publication date
CN103049454A (en) 2013-04-17

Similar Documents

Publication Publication Date Title
Weismayer et al. Identifying emerging research fields: a longitudinal latent semantic keyword analysis
US10565233B2 (en) Suffix tree similarity measure for document clustering
CN102508859B (en) Advertisement classification method and device based on webpage characteristic
CN107368614A (en) Image search method and device based on deep learning
CN107169001A (en) A kind of textual classification model optimization method based on mass-rent feedback and Active Learning
CN101794311A (en) Fuzzy data mining based automatic classification method of Chinese web pages
CN110059181A (en) Short text stamp methods, system, device towards extensive classification system
CN102194013A (en) Domain-knowledge-based short text classification method and text classification system
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
CN103123633A (en) Generation method of evaluation parameters and information searching method based on evaluation parameters
CN106874410A (en) Chinese microblogging text mood sorting technique and its system based on convolutional neural networks
CN103049454B (en) A kind of Chinese and English Search Results visualization system based on many labelings
CN104484431A (en) Multi-source individualized news webpage recommending method based on field body
WO2013049529A1 (en) Method and apparatus for unsupervised learning of multi-resolution user profile from text analysis
CN103778206A (en) Method for providing network service resources
CN112199508A (en) Parameter adaptive agricultural knowledge graph recommendation method based on remote supervision
MidhunChakkaravarthy Evolutionary and incremental text document classifier using deep learning
CN111754208A (en) Automatic screening method for recruitment resumes
CN110990670B (en) Growth incentive book recommendation method and recommendation system
CN115329085A (en) Social robot classification method and system
Aung et al. Random forest classifier for multi-category classification of web pages
CN116882414B (en) Automatic comment generation method and related device based on large-scale language model
CN113239179B (en) Scientific research technology interest field recognition model training method, scientific and technological resource query method and device
CN115510269A (en) Video recommendation method, device, equipment and storage medium
Evangeline et al. Text categorization techniques: A survey

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160420

Termination date: 20171016