CN102831192A - News searching device and method based on topics - Google Patents

News searching device and method based on topics Download PDF

Info

Publication number
CN102831192A
CN102831192A CN2012102747655A CN201210274765A CN102831192A CN 102831192 A CN102831192 A CN 102831192A CN 2012102747655 A CN2012102747655 A CN 2012102747655A CN 201210274765 A CN201210274765 A CN 201210274765A CN 102831192 A CN102831192 A CN 102831192A
Authority
CN
China
Prior art keywords
topic
web page
news
news web
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012102747655A
Other languages
Chinese (zh)
Inventor
李德聪
方庆安
杨青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PEOPLE SEARCH NETWORK AG
Original Assignee
PEOPLE SEARCH NETWORK AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PEOPLE SEARCH NETWORK AG filed Critical PEOPLE SEARCH NETWORK AG
Priority to CN2012102747655A priority Critical patent/CN102831192A/en
Publication of CN102831192A publication Critical patent/CN102831192A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a news searching device and method based on topics. The device comprises a collection analysis module, a cluster processing module, an index establishing module, an inquiry ordering module and a result output module, wherein the collection analysis module is used for collecting news webpage and extracting the corresponding features; the cluster processing module is used for clustering news webpage and generating topics and feature vectors; the index establishing module is used for establishing indexes for topics and news webpage; the inquiry ordering module is used for calculating ordering scores of topics and news webpage based on the user inquiry; and the result output module is used for ordering and outputting the searching result. By the device and method, the relevancy between the topics and the user inquiry can be checked radically from a deep level when news are searched by users, and the searched topics and regular news webpage are combined for display.

Description

News retrieval device and method based on topic
Technical field
The present invention relates to text cluster, the information retrieval technique of internet information process field, relate in particular to a kind of news retrieval device and method based on topic.
Background technology
Along with being the fast development of the infotech of representative with the internet, the quantity of information is increasing, velocity of propagation is more and more faster, and coverage is more and more wider.Under this overall situation, Internet news also shows the characteristics of diversification day by day.For same topic, exist many pieces to take different angles, originate from different medium, hold the news of different viewpoints probably.The user is when retrieval news; If can be unit with the topic; Show the news relevant, data etc. with topic; Be that the result for retrieval of unit is compared with single piece of news then, can help this public opinion situation relevant of the open-and-shut understanding of user, obtain better user experience with its inquiry with traditional simple displaying.
At present, the website of news retrieval is provided, is mainly the news vertical search channel of each official website of professional news media, portal website, search engine etc.The news retrieval of these websites leeway that also is improved.The news retrieval of these websites mainly depends on sets up index to single piece of news, behind the user input query, investigates the degree of correlation of user inquiring and single piece of news.Some website can only show with single piece of news to be the result for retrieval of unit merely; The result for retrieval of some website just simply lumps together the newsgroup that repeats; Though some website can rely on certain technology; Confirm to belong to each piece news of same topic; But it is when processes user queries; What take is the degree of correlation of investigating earlier user inquiring and single piece of news, the mode of showing the news combination relevant again with each single piece of news, not have fundamentally with profound level on the degree of correlation of investigation user inquiring and each topic.
Summary of the invention
In view of this; Fundamental purpose of the present invention is to provide a kind of news retrieval device and method based on topic; When user search news, fundamentally with profound level on the degree of correlation of examination each topic and user inquiring, and combine displaying to topic that retrieves and conventional news web page.
For achieving the above object, technical scheme of the present invention is achieved in that
A kind of news retrieval device based on topic comprises that mainly collection analysis module, clustering processing module, index set up module, inquiry order module and output module as a result; Wherein:
The collection analysis module is used to gather news web page, and extracts corresponding characteristic;
The clustering processing module is used for the news web page cluster, produces topic and proper vector thereof;
Module set up in index, is used for index set up in topic and news web page;
The inquiry order module, be used for to the user inquiry, calculate each topic and news web page ranking score;
Output module as a result is used for result for retrieval is sorted and exports.
A kind of news retrieval method based on topic mainly comprises:
A, collection news web page are analyzed news web page, extract the step of characteristic;
B, to the news web page cluster, produce the step of topic and proper vector thereof;
C, topic and news web page set up the step of index;
D, to user's inquiry, calculate the step of each topic and news web page ranking score; And
E, the step that result for retrieval is sorted and exports.
Wherein, the said collection news web page of steps A also extracts characteristic to news web page, comprising:
A1, employing web crawlers are gathered news web page;
A2, news web page is carried out the processing of participle, part-of-speech tagging, proper name identification, the structural attitude vector, said proper vector is a unit with speech or phrase token and weight thereof.
Step B is said to news web page cluster, generation topic and proper vector thereof, comprising:
The characteristic of said generation is to the news web page cluster; Each cluster result is as a topic; Each cluster result has one to be the center vector of element with token and relevant information thereof, and this vector is as the proper vector of topic, and writes down the news web page ID that this topic comprises.
Step C is said to set up index to topic and news web page, mainly comprises:
C1, index built in topic, set up inverted list; For each topic, the token in the proper vector that produces with step B is as index entry; For each token, arrange and store all topic ID, weight and other information of this token in each topic that comprises this token in the chain;
C2, news web page is built index; To each news web page, utilize the proper vector that produces in the steps A to set up index; Said index entry is the token in the proper vector.
The said inquiry to the user of step D, the process of calculating each topic and news web page ranking score comprise:
Behind D1, the user input query, inquiry is carried out processing such as participle, word segmentation result tax power, produced a query vector, unit is token.
D2, the calculating degree of correlation; For topic and conventional news web page, respectively through topic index and news web page index, calculate the cosine similarity of query feature vector and topic or news web page proper vector, draw the degree of correlation of inquiry and topic or news web page;
D3, comprehensively other factors are calculated the ranking score of topics or news web page, guarantee the comparability of the ranking score of topic and news web page in the computation process.
To the ordering of the said result for retrieval of step e,, both can mix ordering and also can sort separately because the ranking score of topic and news web page has comparability; When showing result for retrieval, this topic imports new page through link, comprises all interior news of this topic and other information of topic.
News retrieval device and method based on topic provided by the present invention has the following advantages:
The present invention as searching object, sets up index to topic with topic, fundamentally with profound level on the degree of correlation of examination topic and user inquiring.The ranking score of topic and the ranking score of conventional news web page have comparability, are convenient to mix ordering and show.
Description of drawings
Fig. 1 is the news retrieval method overview flow chart based on topic of the present invention;
Fig. 2 is for gathering the process flow diagram of news web page among the present invention to web crawlers;
Fig. 3 is for setting up the process flow diagram of index among the present invention to topic and news web page;
Fig. 4 is the inquiry that the present invention is directed to the user, calculates each topic and news web page ranking score process flow diagram;
Fig. 5 is the news retrieval apparatus structure synoptic diagram based on topic of the present invention.
Embodiment
Below in conjunction with accompanying drawing and embodiments of the invention device and method of the present invention is done further detailed explanation.
Fig. 1 is the news retrieval method overview flow chart based on topic of the present invention; This method is carried out work according to the news retrieval device based on topic shown in Figure 5, and this device mainly comprises: collection analysis module, clustering processing module, index set up module, inquire about order module and output module as a result; Wherein:
The collection analysis module is used to gather news web page, and extracts corresponding characteristic.
The clustering processing module is used for the news web page cluster, produces topic and proper vector thereof.
Module set up in index, is used for index set up in topic and news web page.
The inquiry order module, be used for to the user inquiry, calculate each topic and news web page ranking score;
Output module as a result is used for result for retrieval is sorted and exports.
As shown in Figure 1, described news retrieval method based on topic mainly comprises the steps:
Step S1, collection news web page are analyzed news web page, to extract the step of characteristic;
Step S2, to the news web page cluster, produce the step of topic and proper vector thereof; Specifically comprise:
Utilize the characteristic that produces among the step S1 to the news web page cluster, consider the continuous characteristics that produce of Internet news, clustering algorithm adopts online hierarchical clustering, and every wheel of this algorithm removed the topic that does not have renewal for a long time earlier; Then the newly-increased news of epicycle is done non-set of weights center (UPGMC, Unweighted Pair-Group Method using Centroids) cluster, produce a collection of new topic; Merge (if eligible) to new topic and existing topic again; At last all existing topic is carried out the UPGMC cluster again one time.Similarity in the cluster process is calculated and is adopted the cosine similarity.Each cluster result is as a topic, and each topic has one to be the center vector of element with token and relevant information thereof, and this vector is as the proper vector of topic.And write down the news web page ID that this topic comprises.
Step S3, topic and news web page set up the step of index;
Step S4, to user's inquiry, calculate each topic and news web page ranking score; And
Step S5, to result's step of sorting and exporting of retrieval.
Here, because the ranking score of topic and news web page has comparability, also can sort separately so both can mix ordering.When showing result for retrieval, topic can import new page through link, comprises the out of Memory (report trend map, comment, picture, video etc.) of all interior news of this topic and topic.
Fig. 2 is for web crawlers being gathered among the present invention the process flow diagram of news web page, and is as shown in Figure 2, and it is following that said collection news web page also extracts the step of characteristic to news web page:
Step S11, employing web crawlers are gathered news web page;
Step S12, news web page carried out processing such as participle, part-of-speech tagging, removal stop words, proper name identification, synonym merger, the structural attitude vector, proper vector is a unit with speech or phrase (token) and weight thereof.
Fig. 3 is for topic and news web page being set up among the present invention the process flow diagram of index, and is as shown in Figure 3, said index set up in topic and news web page, mainly comprises following steps:
Step S31, index built in topic, set up inverted list.For each topic, with the token in the proper vector that produces among the step S2 as index entry.For each token, each element of arranging in the chain is stored each topic ID, normalization weight and other information of this token in each topic that comprises this token.
Step S32, news web page is set up index.To each news web page, utilize the proper vector that produces among the step S1 to set up index.Similar with the index of topic, index entry is similarly the token in the proper vector, arranges each news web page ID, the information such as normalization weight of this token in each news web page that storage in the chain comprises this token.
Fig. 4 is the inquiry that the present invention is directed to the user, calculates each topic and news web page ranking score process flow diagram, and is as shown in Figure 4, to user's inquiry, calculates each topic and news web page ranking score, comprises following steps:
Behind step S41, the user input query, inquiry is carried out processing such as participle, word segmentation result tax power, produced a query vector, unit is similarly token.
The degree of correlation of step S42, calculating topic or news web page and inquiry.For topic and conventional news web page,, calculate the cosine similarity of query feature vector and topic or news web page proper vector respectively through topic index and news web page index.Formula is following:
Figure 2012102747655100002DEST_PATH_IMAGE001
Wherein: Q represents query feature vector, and P represents the proper vector of news web page or topic, because the mould of Q is for not influence of final ordering, P is normalization when building index, so the mould that does not have two vectors in the formula is as denominator.
When calculating the cosine similarity, the concrete employing token-at-a-time pattern that realizes, promptly carry out as follows:
1) gets the pairing chain of arranging of a token in the inquiry
2) arrange the weight of each element in the chain (refer to this token under topic or news documents) to this and the weight of this token (in inquiry) is done product, this product accumulation in the mark result of corresponding topic or news documents.
3) if also have token in the inquiry, return 1) continue to carry out.Otherwise final accumulation results is the degree of correlation of this inquiry and each topic or news web page.
Step S43, based on the degree of correlation, comprehensively other factors are calculated the ranking score of topics or news web page, guarantee the comparability of the ranking score of topic and news web page in the computation process.Whether these factors comprise: be topic (topic obtains certain award); The quality of news web page or topic itself, ageing etc.
The above is merely preferred embodiment of the present invention, is not to be used to limit protection scope of the present invention.

Claims (8)

1. the news retrieval device based on topic is characterized in that, comprises that mainly collection analysis module, clustering processing module, index set up module, inquiry order module and output module as a result; Wherein:
The collection analysis module is used to gather news web page, and extracts corresponding characteristic;
The clustering processing module is used for the news web page cluster, produces topic and proper vector thereof;
Module set up in index, is used for index set up in topic and news web page;
The inquiry order module, be used for to the user inquiry, calculate each topic and news web page ranking score;
Output module as a result is used for result for retrieval is sorted and exports.
2. the news retrieval method based on topic is characterized in that, mainly comprises:
A, collection news web page are analyzed news web page, extract the step of characteristic;
B, to the news web page cluster, produce the step of topic and proper vector thereof;
C, topic and news web page set up the step of index;
D, to user's inquiry, calculate the step of each topic and news web page ranking score; And
E, the step that result for retrieval is sorted and exports.
3. the news retrieval method based on topic according to claim 2 is characterized in that, the said collection news web page of steps A also extracts characteristic to news web page, comprising:
A1, employing web crawlers are gathered news web page;
A2, news web page is carried out the processing of participle, part-of-speech tagging, proper name identification, the structural attitude vector, said proper vector is a unit with speech or phrase token and weight thereof.
4. the news retrieval method based on topic according to claim 2 is characterized in that, step B is said to news web page cluster, generation topic and proper vector thereof, comprising:
The characteristic of said generation is to the news web page cluster; Each cluster result is as a topic; Each cluster result has one to be the center vector of element with token and relevant information thereof, and this vector is as the proper vector of topic, and writes down the news web page ID that this topic comprises.
5. the news retrieval method based on topic according to claim 2 is characterized in that, step C is said to set up index to topic and news web page, mainly comprises:
C1, index built in topic, set up inverted list; For each topic, the token in the proper vector that produces with step B is as index entry; For each token, arrange and store all topic ID, weight and other information of this token in each topic that comprises this token in the chain;
C2, news web page is built index; To each news web page, utilize the proper vector that produces in the steps A to set up index; Said index entry is the token in the proper vector.
6. the news retrieval method based on topic according to claim 2 is characterized in that, the said inquiry to the user of step D, the process of calculating each topic and news web page ranking score comprise:
Behind D1, the user input query, inquiry is carried out processing such as participle, word segmentation result tax power, produced a query vector, unit is token.
7.D2, calculate the degree of correlation; For topic and conventional news web page, respectively through topic index and news web page index, calculate the cosine similarity of query feature vector and topic or news web page proper vector, draw the degree of correlation of inquiry and topic or news web page;
D3, comprehensively other factors are calculated the ranking score of topics or news web page, guarantee the comparability of the ranking score of topic and news web page in the computation process.
8. the news retrieval method based on topic according to claim 2 is characterized in that, to the ordering of the said result for retrieval of step e, because the ranking score of topic and news web page has comparability, both can mix ordering and also can sort separately; When showing result for retrieval, this topic imports new page through link, comprises all interior news of this topic and other information of topic.
CN2012102747655A 2012-08-03 2012-08-03 News searching device and method based on topics Pending CN102831192A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012102747655A CN102831192A (en) 2012-08-03 2012-08-03 News searching device and method based on topics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012102747655A CN102831192A (en) 2012-08-03 2012-08-03 News searching device and method based on topics

Publications (1)

Publication Number Publication Date
CN102831192A true CN102831192A (en) 2012-12-19

Family

ID=47334329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012102747655A Pending CN102831192A (en) 2012-08-03 2012-08-03 News searching device and method based on topics

Country Status (1)

Country Link
CN (1) CN102831192A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077190A (en) * 2012-12-20 2013-05-01 人民搜索网络股份公司 Hot event ranking method based on order learning technology
CN106021351A (en) * 2016-05-10 2016-10-12 深圳职业技术学院 An aggregation extraction method and device for news events
CN106874292A (en) * 2015-12-11 2017-06-20 北京国双科技有限公司 Topic processing method and processing device
CN109902230A (en) * 2019-02-13 2019-06-18 北京航空航天大学 A kind of processing method and processing device of news data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070088720A1 (en) * 2005-10-17 2007-04-19 Siemens Aktiengesellschaft Method for detecting discrepancies between a user's perception of web sites and an author's intention of these web sites
CN102236710A (en) * 2011-06-30 2011-11-09 百度在线网络技术(北京)有限公司 Method and equipment for displaying news information in query result
CN102495872A (en) * 2011-11-30 2012-06-13 中国科学技术大学 Method and device for conducting personalized news recommendation to mobile device users

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070088720A1 (en) * 2005-10-17 2007-04-19 Siemens Aktiengesellschaft Method for detecting discrepancies between a user's perception of web sites and an author's intention of these web sites
CN102236710A (en) * 2011-06-30 2011-11-09 百度在线网络技术(北京)有限公司 Method and equipment for displaying news information in query result
CN102495872A (en) * 2011-11-30 2012-06-13 中国科学技术大学 Method and device for conducting personalized news recommendation to mobile device users

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077190A (en) * 2012-12-20 2013-05-01 人民搜索网络股份公司 Hot event ranking method based on order learning technology
CN106874292A (en) * 2015-12-11 2017-06-20 北京国双科技有限公司 Topic processing method and processing device
CN106874292B (en) * 2015-12-11 2020-05-05 北京国双科技有限公司 Topic processing method and device
CN106021351A (en) * 2016-05-10 2016-10-12 深圳职业技术学院 An aggregation extraction method and device for news events
CN106021351B (en) * 2016-05-10 2019-04-12 深圳职业技术学院 For the polymerization extracting method and device of media event
CN109902230A (en) * 2019-02-13 2019-06-18 北京航空航天大学 A kind of processing method and processing device of news data

Similar Documents

Publication Publication Date Title
CN107577759B (en) Automatic recommendation method for user comments
JP6416150B2 (en) Search method, search system, and computer program
Mitra Exploring session context using distributed representations of queries and reformulations
CN103744981B (en) System for automatic classification analysis for website based on website content
CN103678576B (en) The text retrieval system analyzed based on dynamic semantics
US8312022B2 (en) Search engine optimization
CN108280114B (en) Deep learning-based user literature reading interest analysis method
CN106339502A (en) Modeling recommendation method based on user behavior data fragmentation cluster
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN105095187A (en) Search intention identification method and device
CN102509233A (en) User online action information-based recommendation method
CN105404699A (en) Method, device and server for searching articles of finance and economics
US8712999B2 (en) Systems and methods for online search recirculation and query categorization
CN106204156A (en) A kind of advertisement placement method for network forum and device
CN101206674A (en) Enhancement type related search system and method using commercial articles as medium
CN103365924A (en) Method, device and terminal for searching information
CN103365839A (en) Recommendation search method and device for search engines
CN102033880A (en) Marking method and device based on structured data acquisition
CN106663100B (en) Multi-domain query completion
CN102270331A (en) Network shopping navigating method based on visual search
CN102750390A (en) Automatic news webpage element extracting method
CN111324801B (en) Hot event discovery method in judicial field based on hot words
CN103455487A (en) Extracting method and device for search term
CN103294820B (en) WEB page classifying method and system based on semantic extension
TWI571756B (en) Methods and systems for analyzing reading log and documents corresponding thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20121219