CN102890715A - Device and method for automatically organizing specific domain information - Google Patents

Device and method for automatically organizing specific domain information Download PDF

Info

Publication number
CN102890715A
CN102890715A CN2012103575482A CN201210357548A CN102890715A CN 102890715 A CN102890715 A CN 102890715A CN 2012103575482 A CN2012103575482 A CN 2012103575482A CN 201210357548 A CN201210357548 A CN 201210357548A CN 102890715 A CN102890715 A CN 102890715A
Authority
CN
China
Prior art keywords
news
specific area
information
module
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012103575482A
Other languages
Chinese (zh)
Inventor
李德聪
杨青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PEOPLE SEARCH NETWORK AG
Original Assignee
PEOPLE SEARCH NETWORK AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PEOPLE SEARCH NETWORK AG filed Critical PEOPLE SEARCH NETWORK AG
Priority to CN2012103575482A priority Critical patent/CN102890715A/en
Publication of CN102890715A publication Critical patent/CN102890715A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a device for automatically organizing specific domain information and a method for automatically organizing the specific domain information. The device mainly comprises a news collecting module for collecting network news, a news screening module for screening the news within a specific domain from the collected news, a news topics detecting module for detecting the topics of the news within the specific domain, a background information caching module for caching the news within the specific domain organized according to the topics to prepare a front-end module for accessing at any times, a specific domain information collecting module for collecting the information within the specific domain from set websites, an indexing module for building the index for the news and the information within the specific domain, and a searching module for processing the query input by a user, querying the index, and tidying querying results. After the device is adopted, the classifying, the clustering and the searching of a machine to information can be realized, the information within one specific domain can be automatically screened under the environment of internetwork mass information, and the effective originating and searching functions can be realized.

Description

A kind of devices and methods therefor of specific area information automation tissue
Technical field
The present invention relates to machine learning and information retrieval technique, relate in particular to a kind of devices and methods therefor of specific area information automation tissue.
Background technology
High speed development along with the internet, the more and more abundanter and day by day diversification of the network information, but, also mean if the user thinks to obtain a certain class customizing messages comprehensively and systematically simultaneously, to have to spend more time and efforts and from the information ocean, screen, and organize voluntarily and combing.
For this reason, the trial of this respect has been done by some internet information providers, and for example: each large portal website provides the news of sub-channel; Provide special report etc. for a certain major event, but these products depend on artificial screening and editor to a great extent, represent the form also often news of wall scroll, picture etc., form is also comparatively single.
In recent years, machine learning techniques (comprising the technology such as classification, cluster) and information retrieval technique are rapidly developed, while is the information of a certain specific area of automatically screening along with computing power improves constantly, and the possibility that provides technology to realize with search function effectively is provided.
Summary of the invention
In view of this, fundamental purpose of the present invention is to provide a kind of devices and methods therefor of specific area information automation tissue, to realize that machine is to classification, cluster and the retrieval of information, make it in the environment of internet mass information, Automatic sieve is selected the information of a certain specific area, and realizes effectively tissue and search function.
For achieving the above object, technical scheme of the present invention is achieved in that
A kind of device of specific area information automation tissue, this device mainly comprise news collection module, news screening module, news topic detection module, background information cache module, specific area information acquisition module, index module and retrieval module; Wherein:
The news collection module is used for collection network news;
News is screened module, filters out the news of specific area from the news that gathers;
The news topic detection module carries out topic detection to the news of specific area;
Background information cache module, buffer memory are pressed the news of the specific area of topic tissue, access at any time in order to front-end module;
The specific area information acquisition module gathers the information of this specific area from the website of setting;
Index module is set up index to the information of news and specific area
Retrieval module is processed and search index the inquiry of user's input, and the arrangement result for retrieval.
Wherein: described device further comprises front-end module, is used for the request of showing direct user oriented information and receiving the user.
Described specific area comprises that the user wishes to remove the message area collected by the internet.
The information that described specific area information acquisition module gathers is specially: the information of the defective food that gathers from the website of setting.
A kind of method of specific area information automation tissue mainly comprises the steps:
The step of A, information acquisition is from the structured message of network collection news and specific website;
The step of B, information sifting to the news automatic screening that gathers, draws the news of specific area;
The step of C, topic detection is carried out cluster to the news of specific area, is organized into topic and shows;
D, set up the step of index, the news of specific area and the structured message of specific website are set up index, for retrieval.
Wherein: steps A mainly comprises:
Collection network news, namely the Adoption Network reptile gathers the news of all kinds of news websites, and is translated into structured message; And
Gather the structured message of specific website, namely gather the information of specific area from specific website, also be translated into structured message.
The described news automatic screening to gathering of step B, mainly adopt the in advance Naive Bayes Classifier of specialized training, with the feature of the title that extracts webpage, text, url, and in conjunction with dependency rule, judge whether the news that newly collects belongs to the news of specific area class.
Step C mainly comprises:
C1, removal do not have vicissitudinous topic for a long time;
C2, each the bar news that enters in this cycle is extracted feature, and be configured for describing the proper vector based on vector space model of this news;
C3, a collection of proper vector of described generation is carried out hierarchical clustering, clustering algorithm adopts non-set of weights center UPGMC algorithm, each set in the cluster result namely bunch is all had a center vector, and calculate the cosine similarity;
C4, to described each bunch, find out the topic with the similarity maximum of this bunch; If this similarity is greater than reservation threshold, should bunch merges in this topic, and revise its center vector and update time; Otherwise, should bunch be considered as a New Topics, its birth time and update time are the current time in system;
C5, all topics are carried out the UPGMC hierarchical clustering again one time, all of cluster result bunch be this cycle finish after this whole topics.
Step D comprises: the news of described specific area and the information of described specific area class are set up index.
The devices and methods therefor of specific area information automation tissue provided by the present invention has the following advantages:
In the environment of internet mass information, realize that automatically screening goes out the information of a certain specific area, and to these information realizations by the topic tissue with represent, the search function of traditional text information and special construction information also is provided simultaneously.Make the user save the trouble of screening, combing, searching specific area information.
Description of drawings
Fig. 1 is the structural representation of device of the food security category information robotization tissue of the embodiment of the invention;
Fig. 2 is the overall procedure synoptic diagram of specific area information automation method for organizing of the present invention;
Fig. 3 is information acquisition process flow diagram of the present invention;
Fig. 4 is news topic overhaul flow chart of the present invention;
Fig. 5 is the index process flow diagram of setting up of the present invention.
Embodiment
Below in conjunction with accompanying drawing and embodiments of the invention devices and methods therefor of the present invention is described in further detail.
The present invention is for the consideration of the commercial factors such as demand of actual product, only with the example that is treated to of food security category information, the formation of device of specific area information automation tissue and the specific implementation process of method thereof are described, because devices and methods therefor of the present invention does not also rely on choosing of field, so can also process equally the information of other field.Therefore, adopt this devices and methods therefor, change slightly different demands and can realize the robotization of other category informations is processed, as the users such as the technical development category information of computer hardware, computer software testing technology type information being wished go by the internet information in the field of collecting.
Fig. 1 is the structural representation of device of the food security category information robotization tissue of the embodiment of the invention, and as shown in Figure 1, this device mainly comprises:
News collection module: be responsible for collection network news.
Specific area information acquisition module: the information that gathers this specific area from the website of setting.As, can be the defective food information acquisition module that gathers defective food information from setting the website here.
News screening module: the news that from the news that gathers, filters out specific area.As, can be food security class news here.
The news topic detection module: the news to specific area is carried out topic detection.As, food security class news is carried out topic detection.
The background information cache module: buffer memory is pressed the news of the specific area of topic tissue, accesses at any time in order to front-end module.As, buffer memory is pressed the food security news of topic tissue.
Index module: the information to news and specific area is set up index.As, news and defective food information are set up index.
Retrieval module: be responsible for the inquiry of user's input is processed and search index, and the arrangement result for retrieval.
Front-end module: mainly show direct user oriented information and receive user's request.
Fig. 2 is the overall procedure synoptic diagram of specific area information automation method for organizing of the present invention, and this flow process can periodically be carried out, and each performance period mainly comprises the steps (here only take the food security category information as example):
Step S1: the step of information acquisition, from the structured message of network collection news and specific website.
Here, the step of described information acquisition as shown in Figure 3, specifically comprises:
Step S11: collection network news, namely the Adoption Network reptile gathers the news of all kinds of news websites, and is converted into structured message in order to further process.Described structured message refers to comprise the standardize information of the items of information such as title, text, author, source, time.
Step S12: gather the structured message of specific website, namely gather the defective food information of publicity from specific government website.This category information is structurized mostly, provides with the form of form.Convert it into equally self-defining structured message after the collection.
Step S2: the step of information sifting, the news automatic screening to gathering draws the news in food security field.
Here, mainly adopt the in advance Naive Bayes Classifier of specialized training, the features such as the title of extraction webpage, text, url, and in conjunction with some rules, judge whether the news that newly collects belongs to food security class news, if, then judge specifically belong to which subclass, and classify.
Step S3: the step of topic detection, food security field news is carried out cluster, be organized into topic and show.
Here, need to process the food security class news that filters out, need periodically to carry out following substep, as shown in Figure 4:
Step S31: removal does not have vicissitudinous topic for a long time.So both effectively reduced the data volume of follow-up clustering processing, the interference of having avoided again out-of-date topic may cause cluster.
Step S32: each the bar news that enters in this cycle is extracted feature.At first to title and the text of news carry out participle, part-of-speech tagging, remove stop words, the step such as proper name identification, synonym merger, the result who processes is take word or phrase as unit, be referred to as token, to each token, calculate its TF.IWF score value as basic weight, and in conjunction with information such as its in the text position, part of speech, proper name types, determine the weight that it is final.Again token and score value thereof are configured to one based on the proper vector of vector space model, in order to describe this news.
Step S33: a collection of proper vector that generates among the step S32 is carried out hierarchical clustering, clustering algorithm adopts non-set of weights center (Unweighted Pair-Group Method using Centroids, UPGMC) algorithm, in this algorithm, each set in the cluster result (be called bunch) has a center vector.The computing method of similarity are for adopting the cosine similarity of two bunches center vector.
Step S34: to each bunch that produces among the step S33, find out the topic with the similarity maximum of this bunch.The computing method of similarity are cosine similarity still.If this similarity greater than reservation threshold, just merges to this bunch in this topic, and revise its center vector and update time.Otherwise this bunch is regarded as a New Topics, and its birth time and update time all are the current time in system.
Step S35: all topics are carried out the UPGMC hierarchical clustering again one time, all of cluster result bunch be this cycle finish after this whole topics.This process adopts the similarity of cosine similarity compute cluster equally.Merge generation if certain topic is several topics, also be the current time in system update time of this topic.
Step S4: set up the step of index, the structured message of food security field news and specific website is set up index, for retrieval.
Here, food security field news and defective food information are set up the process of index, as shown in Figure 5, mainly comprise:
Step S41: food security news is set up index, and index field comprises title, text etc., and supports result for retrieval by the ordering of the factors such as time.
Step S42: defective food information is set up index.At first change into the data layout same with news to be retrieved, in order to adopt same set of searching system with the news index.Index field comprises food name, trade (brand) name, classification etc., and supports sub-category screening.
Need to prove that owing to the net environment new data constantly produces, thereby the execution of above step all is periodic.
Above devices and methods therefor shows through demonstration and through practice, can effectively solve the information that automatically screening goes out the food security field, and to these information realizations by the topic tissue with represent, the search function of food security field news information and defective food information also is provided simultaneously.
For the demand of actual product and the consideration of some non-technical reason; what the present invention mainly specifically processed is the food security category information; but because method of the present invention and do not rely on choosing of a certain specific area; so for the information of other specific areas (as; we have also realized the similar products for product quality defect exposure realm information); as long as adopt similar screening, organize displaying, text and structured message to unify search method by topic, all should be regarded as within protection scope of the present invention.
The above is preferred embodiment of the present invention only, is not for limiting protection scope of the present invention.

Claims (9)

1. the device of a specific area information automation tissue, it is characterized in that this device mainly comprises news collection module, news screening module, news topic detection module, background information cache module, specific area information acquisition module, index module and retrieval module; Wherein:
The news collection module is used for collection network news;
News is screened module, filters out the news of specific area from the news that gathers;
The news topic detection module carries out topic detection to the news of specific area;
Background information cache module, buffer memory are pressed the news of the specific area of topic tissue, access at any time in order to front-end module;
The specific area information acquisition module gathers the information of this specific area from the website of setting;
Index module is set up index to the information of news and specific area; And
Retrieval module is processed and search index the inquiry of user's input, and the arrangement result for retrieval.
2. the device of specific area information automation tissue according to claim 1 is characterized in that, described device further comprises front-end module, is used for the request of showing direct user oriented information and receiving the user.
3. the device of specific area information automation tissue according to claim 1 is characterized in that, described specific area comprises that the user wishes to remove the message area collected by the internet.
4. the device of specific area information automation tissue according to claim 1 is characterized in that, the information that described specific area information acquisition module gathers is specially: the information of the defective food that gathers from the website of setting.
5. the method for a specific area information automation tissue is characterized in that, mainly comprises the steps:
The step of A, information acquisition is from the structured message of network collection news and specific website;
The step of B, information sifting to the news automatic screening that gathers, draws the news of specific area;
The step of C, topic detection is carried out cluster to the news of specific area, is organized into topic and shows;
D, set up the step of index, the news of specific area and the structured message of specific website are set up index, for retrieval.
6. the method for described specific area information automation tissue according to claim 5 is characterized in that steps A mainly comprises:
Collection network news, namely the Adoption Network reptile gathers the news of all kinds of news websites, and is translated into structured message; And
Gather the structured message of specific website, namely gather the information of specific area from specific website, also be translated into structured message.
7. the method for specific area information automation tissue according to claim 5, it is characterized in that, the described news automatic screening to gathering of step B, mainly adopt the in advance Naive Bayes Classifier of specialized training, feature with the title that extracts webpage, text, url, and in conjunction with dependency rule, judge whether the news that newly collects belongs to the news of specific area class.
8. the method for specific area information automation tissue according to claim 5 is characterized in that step C mainly comprises:
C1, removal do not have vicissitudinous topic for a long time;
C2, each the bar news that enters in this cycle is extracted feature, and be configured for describing the proper vector based on vector space model of this news;
C3, a collection of proper vector of described generation is carried out hierarchical clustering, clustering algorithm adopts non-set of weights center UPGMC algorithm, each set in the cluster result namely bunch is all had a center vector, and calculate the cosine similarity;
C4, to described each bunch, find out the topic with the similarity maximum of this bunch; If this similarity is greater than reservation threshold, should bunch merges in this topic, and revise its center vector and update time; Otherwise, should bunch be considered as a New Topics, its birth time and update time are the current time in system;
C5, all topics are carried out the UPGMC hierarchical clustering again one time, all of cluster result bunch be this cycle finish after this whole topics.
9. the method for specific area information automation tissue according to claim 1 is characterized in that step D comprises: the news of described specific area and the information of described specific area class are set up index.
CN2012103575482A 2012-09-24 2012-09-24 Device and method for automatically organizing specific domain information Pending CN102890715A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012103575482A CN102890715A (en) 2012-09-24 2012-09-24 Device and method for automatically organizing specific domain information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012103575482A CN102890715A (en) 2012-09-24 2012-09-24 Device and method for automatically organizing specific domain information

Publications (1)

Publication Number Publication Date
CN102890715A true CN102890715A (en) 2013-01-23

Family

ID=47534217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012103575482A Pending CN102890715A (en) 2012-09-24 2012-09-24 Device and method for automatically organizing specific domain information

Country Status (1)

Country Link
CN (1) CN102890715A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095398A (en) * 2016-05-10 2016-11-09 深圳前海信息技术有限公司 Big data mining application process based on DSL and device
CN110633406A (en) * 2018-06-06 2019-12-31 北京百度网讯科技有限公司 Event topic generation method and device, storage medium and terminal equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158963A (en) * 2007-10-31 2008-04-09 中兴通讯股份有限公司 Information acquisition processing and retrieval system
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN102495872A (en) * 2011-11-30 2012-06-13 中国科学技术大学 Method and device for conducting personalized news recommendation to mobile device users
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158963A (en) * 2007-10-31 2008-04-09 中兴通讯股份有限公司 Information acquisition processing and retrieval system
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN102495872A (en) * 2011-11-30 2012-06-13 中国科学技术大学 Method and device for conducting personalized news recommendation to mobile device users
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程葳等: "面向互联网新闻的在线话题检测算法", 《计算机工程》, vol. 35, no. 18, 30 September 2009 (2009-09-30), pages 28 - 30 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095398A (en) * 2016-05-10 2016-11-09 深圳前海信息技术有限公司 Big data mining application process based on DSL and device
CN106095398B (en) * 2016-05-10 2019-07-02 深圳前海信息技术有限公司 Big data development and application method and device based on DSL
CN110633406A (en) * 2018-06-06 2019-12-31 北京百度网讯科技有限公司 Event topic generation method and device, storage medium and terminal equipment

Similar Documents

Publication Publication Date Title
CN102831199B (en) Method and device for establishing interest model
CN103177090B (en) A kind of topic detection method and device based on big data
CN111708740A (en) Mass search query log calculation analysis system based on cloud platform
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
CN103365924A (en) Method, device and terminal for searching information
CN101788988B (en) Information extraction method
CN110705288A (en) Big data-based public opinion analysis system
CN101751458A (en) Network public sentiment monitoring system and method
CN102831220A (en) Subject-oriented customized news information extraction system
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN102567494B (en) Website classification method and device
CN104978332B (en) User-generated content label data generation method, device and correlation technique and device
US10467255B2 (en) Methods and systems for analyzing reading logs and documents thereof
CN111538931A (en) Big data-based public opinion monitoring method and device, computer equipment and medium
Nikhil et al. A survey on text mining and sentiment analysis for unstructured web data
CN103942268A (en) Method and device for combining search and application and application interface
Li Research on technology, algorithm and application of web mining
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN103823847A (en) Keyword extension method and device
KR100557874B1 (en) Method of scientific information analysis and media that can record computer program thereof
CN111859108A (en) Public opinion system search word recommendation system
CN102890715A (en) Device and method for automatically organizing specific domain information
CN107622125B (en) Information crawling method and device and electronic equipment
CN109948015B (en) Meta search list result extraction method and system
CN109033133A (en) Event detection and tracking based on Feature item weighting growth trend

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130123

WD01 Invention patent application deemed withdrawn after publication