CN101206653A - System and method for automatically collecting network information - Google Patents

System and method for automatically collecting network information Download PDF

Info

Publication number
CN101206653A
CN101206653A CNA2006101707848A CN200610170784A CN101206653A CN 101206653 A CN101206653 A CN 101206653A CN A2006101707848 A CNA2006101707848 A CN A2006101707848A CN 200610170784 A CN200610170784 A CN 200610170784A CN 101206653 A CN101206653 A CN 101206653A
Authority
CN
China
Prior art keywords
search
network information
webpage
module
link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2006101707848A
Other languages
Chinese (zh)
Inventor
邱全成
叶建华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inventec Corp
Original Assignee
Inventec Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inventec Corp filed Critical Inventec Corp
Priority to CNA2006101707848A priority Critical patent/CN101206653A/en
Publication of CN101206653A publication Critical patent/CN101206653A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a network information auto-collecting system and a method thereof. The method comprises the following steps that: web sites are searched and are classified into categories corresponding to each interest and hobby; the content of archives stored by a user or the content of a linked web site is analyzed to generate key words corresponding to the interests and hobbies of the user; after the category of key words is judged, links related to the key words are searched by using searching web sites which are included in the category corresponding to the key words; the linked web sites which are searched out are downloaded.

Description

The system and the method thereof of automatic collection network information
Technical field
The present invention relates to a kind of system and method thereof of automatic collection network information, be meant a kind of user's of automatic analysis gained interest and hobby especially, and with the System and method for of the collection network information analyzed.
Background technology
Rise along with network, there is more and more data to be published on the network in the mode of webpage, yet, even there are so much data to be distributed on the network, other user is not having under the situation of network address, and other user is the same can't to obtain their needed data, therefore, having produced only needs the crucial words of input just can hunt out the search website of the network address of related web page, and so the user just can see through the website, download webpage of searching gained and obtain the data that need.
Data were all by user's manual collection in the past, though the user manually can collect the data that suits the requirements, but also because be manual collection, therefore the data volume of collecting is just fewer, and the user needs the special extra many time of use to collect.In order to reduce the time that the user collects data, begin to have the program of collecting data to be used, after the program of collection data is normally imported crucial words by the user, crucial words with user's input sends request to specific search website, make that searching website search goes out link relevant with crucial words, the program of collecting data is after obtaining link, and the webpage of meeting download link correspondence is to finish the collection of data.
To sum up analyze, owing to use the program of collecting data to need the user to import crucial words voluntarily, therefore the user when uncollected data are collected at present still in hope, must set new crucial words speech, do not want to take the trouble slightly, in addition, use specific search website to search owing to collect the program of data at present, therefore the data of collecting can have correlativity in various degree along with the quality of searching the website, under the big situation of the data volume of collecting, the user can produce puzzlement for incoherent data.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of system and method thereof of automatic collection network information, produce the crucial words of corresponding user's interest and hobby by the content of the corresponding webpage of the interior perhaps link of analyzing the archives that the user stores, and linking of being correlated with of the relevant search website search of use and user's interest and hobby and crucial words, so just can collect interest and the high data of the hobby degree of association with the user, use to solve and use the program of collecting data to collect the problem of data at present.
For reaching above-mentioned purpose, the present invention can reach by System and method for two aspects, and system provided by the present invention includes: storage module, sort module, analysis module, search module, download module.
The disclosed method of the present invention includes the following step: store at least one search website; The classification, search website is to each interest and hobby corresponding class; Analyze at least one data that the user stores with the interest that produces corresponding user and at least one crucial words of hobby; Judge crucial words corresponding class; In the search website that crucial words corresponding class is comprised, search at least one link relevant with crucial words; The webpage of download link correspondence.
About detailed features of the present invention and real the work, now conjunction with figs. is described in detail as follows in embodiment, its content is enough to make any people who is familiar with correlation technique to understand technology contents of the present invention and implements according to this, and according to the disclosed content of this instructions and graphic, any people who is familiar with correlation technique can understand purpose and the advantage that the present invention is correlated with easily.
Description of drawings
Fig. 1 is the system architecture diagram of the automatic collection network information of the present invention;
Fig. 2 is the method flow diagram of the automatic collection network information of the present invention.
Wherein, Reference numeral is:
100 electronic installations
110 storage modules
120 sort modules
130 analysis modules
140 search module
150 download modules
190 detection modules
Step 210 stores and the classification, search website
Step 220 produces the crucial words of corresponding user's interest and hobby
Step 230 sort key words
Whether step 240 searches
Step 250 is from the search website search peer link that classification comprised of crucial words correspondence
The webpage of step 260 download link correspondence
Embodiment
Following elder generation illustrates System Operation of the present invention with the system architecture diagram of the automatic collection network information of Fig. 1 the present invention.As shown in Figure 1, system of the present invention contains sort module 120, analysis module 130, search module 140, download module 150.Wherein storage module 110 is responsible for storing at least one and is searched the website; Sort module 120 is responsible for being stored in search websites collection in the storage module to various interest and like in the pairing classification; Analyze this user's interest and hobby at least one data that analysis module 130 is responsible for being stored by the user, to produce corresponding at least one crucial words, wherein the data of user's storage comprise archives or link, and judge the interest and the corresponding classification of hobby of the crucial words correspondence that it produces; Search module 140 is responsible for searching in the search website that crucial words corresponding class is comprised with the interest of using this and is liked relevant at least one link of corresponding crucial words; Download module 150 is responsible for downloading the corresponding webpage of this link.
Then explain orally operation system of the present invention and method, and please refer to the method flow diagram of the automatic collection network information of Fig. 2 the present invention with an embodiment.
The present invention is before collecting data, must classify according to various interest and hobby to searching the website earlier, wherein, the search website that be classified can be to be stored in advance in the storage module 110 of the present invention, also can be imported voluntarily and be deposited in the storage module 110 by the user.
Search the website and the second search website if store first in the storage module 110, sort module then of the present invention 120 can be searched website and second with first and be searched extremely various interest and liking in the pairing classification (step 210) of websites collection, the method of classification for example searches the website with specific several words tests first and the second search result that the website was searched classifies, but the method for classification, search provided by the invention website is not as limit.In the present embodiment, sort module 120 is searched websites collection to programming taxonomy with first, be the recreation classification and search websites collection with second, wherein, above-mentioned programming taxonomy and recreation classification are the different difference that interest and hobby produced classification, because interest and the classification that can tell of hobby are quite a lot of, and to have which kind of classification to be classified out be not emphasis of the present invention, so do not add description.
Use after the present invention collects data the user, analysis module 130 of the present invention can be analyzed the data that the user has stored, with the interest that draws corresponding user and the crucial words (step 220) of hobby, wherein the data that stored of user comprise the archives of particular category, or the stored link of particular category, analysis module 130 can read the content of archives or the content that links in the pairing webpage is analyzed, the method of analyzing is for example used existing article sorter etc., just can obtain at least one crucial words after analyzing again, but the present invention is not exceeded to use the article sorter to analyze.
Then analysis module 130 can further be set up the crucial words of analysis gained and the corresponding relation between each interests and hobbies, just judge crucial words corresponding class (step 230), for example analysis module 130 is analyzed the walkthrough shelves of user's storage or is all corresponded to each recreation by the link that the major part in " my favorite " catalogue of user stores the website is discussed, so analysis module 130 can draw crucial words and its corresponding class is " recreation ".Because it is to draw by analyzing in the relevant article of user's interest and hobby that analysis module 130 is analyzed the crucial words of gained, therefore crucial words just can have coincideing of certain degree with user's interest and hobby, that is to say that crucial words corresponding class is user's interest and hobby corresponding class.
In the step of analysis and sort key words (step 220, step 230), if user's interest and hobby are extensively, the crucial words that then can analyze gained can be dispersed in each different classification.If the user only added linking in " my favorite " catalogue of just beginning recently to be interested in for one or two, then analysis module 130 also can be very high because of the different degree of the content of minority and other content, and then produce corresponding crucial words.
Analyze and sort key words (step 220, step 230) afterwards, search module 140 of the present invention just can use crucial words to search (step 250) from searching in the website with the classification " recreation " of crucial words corresponding second, after search with crucial words the second search website, can produce link relevant with crucial words, download module 150 of the present invention just can be downloaded the content (step 260) that link corresponding webpage relevant with crucial words, reaches the purpose of the data of automatic collection user interest and hobby.
When searching, can take a large amount of hardware resources or network bandwidth for fear of meeting of the present invention with data download, therefore the present invention more includes detection module 190, be responsible for detecting the state that electronic installation 100 of the present invention is arranged of carrying out, when electronic installation is in specific state, just activation search module 140 is searched link (step 240) relevant with crucial words, when wherein specific state does not for example have data to be transfused to or the utilization rate of processor when being lower than a particular value, but the condition of the activation that the present invention carried is not exceeded with above-mentioned two states.
When data are transfused to, the expression user is operating to carry out electronic installation 100 of the present invention, therefore normally use electronic installation 100 in order not influence the user, therefore detection module 190 will suspend the execution of search module 140, when not having data to be transfused to, the expression user does not temporarily use electronic installation 100, and therefore running of the present invention can't influence the user; And detection module 190 activation search module under the lower situation of processor utilization rate also is based on identical reason.
In addition, repeat to be collected for fear of identical data, therefore download module 190 is before the content of downloading webpage, the update time of the webpage that comparison earlier is downloaded and the time of last time downloading, when be later than the time that last time was downloaded the update time of the webpage that is downloaded, the expression webpage was done renewal, need to download again, if the update time of the webpage that is downloaded, the expression webpage was downloaded, did not need repeated downloads during early than time of last time being downloaded.
Moreover, the method of automatic collection network information of the present invention, can be implemented in the combination of hardware, software or hardware and software, also can in computer system, realize or intersperse among the dispersing mode of the computer system of several interconnected and realize with different assemblies with centralized system.
Though the present invention discloses as above with aforementioned preferred embodiment; right its is not in order to limit the present invention; under the situation that does not deviate from spirit of the present invention and essence thereof; those of ordinary skill in the art work as can make various corresponding changes and distortion according to the present invention, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.

Claims (10)

1. the method for an automatic collection network information is applied to it is characterized in that on the electronic installation that this method comprises the following step:
Store at least one search website;
Classify this search website to a classification of each interest and hobby correspondence;
Analyze at least one data that a user stores to produce to the interest that should the user and at least one crucial words of hobby;
Judge corresponding this classification of this key words;
In this search website that this classification comprises, search at least one link relevant with this key words; And
Download the corresponding webpage of this link.
2. the method for automatic collection network information according to claim 1 is characterized in that, this step of searching this link also comprises judgement when not having data to be transfused to, and searches the step of this link.
3. the method for automatic collection network information according to claim 1 is characterized in that, this step of searching this link also comprises judgement when a processor utilization rate of this electronic installation is lower than a particular value, searches the step of this link.
4. the method for automatic collection network information according to claim 1 is characterized in that, this step of downloading this webpage also comprises when being later than the time of last time downloading this webpage update time of judging this webpage, downloads this webpage.
5. the system of an automatic collection network information is applied to it is characterized in that on the electronic installation that this system comprises:
One storage module is in order to store at least one search website;
One sort module will be in order to will search websites collection to a classification of each interest and hobby correspondence;
One analysis module in order to analyzing at least one data that stored by a user the interest that should the user and at least one crucial words of hobby, and is judged this classification of this key words correspondence;
One search module is in order to search at least one link relevant with this key words in this search website that this classification comprised certainly; And
One downloads module, in order to download the corresponding webpage of this link.
6. the system of automatic collection network information according to claim 5 is characterized in that, these data comprise at least one archives.
7. the system of automatic collection network information according to claim 5 is characterized in that, these data comprise the corresponding webpage of at least one link.
8. the system of automatic collection network information according to claim 5 is characterized in that, this system also comprises a detection module, in order to detect data when being transfused to, this search module of activation.
9. the system of automatic collection network information according to claim 5 is characterized in that, when this detection module also is lower than a particular value in order to the utilization rate at a processor that detects this electronic installation, and this search module of activation.
10. the system of automatic collection network information according to claim 5 is characterized in that, when this download module also is later than the time of last time downloading this webpage in order to the update time of judging this webpage, downloads this webpage.
CNA2006101707848A 2006-12-22 2006-12-22 System and method for automatically collecting network information Pending CN101206653A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2006101707848A CN101206653A (en) 2006-12-22 2006-12-22 System and method for automatically collecting network information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2006101707848A CN101206653A (en) 2006-12-22 2006-12-22 System and method for automatically collecting network information

Publications (1)

Publication Number Publication Date
CN101206653A true CN101206653A (en) 2008-06-25

Family

ID=39566861

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2006101707848A Pending CN101206653A (en) 2006-12-22 2006-12-22 System and method for automatically collecting network information

Country Status (1)

Country Link
CN (1) CN101206653A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073678A (en) * 2010-12-03 2011-05-25 厦门市美亚柏科信息股份有限公司 System and method for analyzing information of websites
CN103577478A (en) * 2012-08-06 2014-02-12 腾讯科技(深圳)有限公司 Web page pushing method and system
WO2015149533A1 (en) * 2014-03-31 2015-10-08 北京奇虎科技有限公司 Method and device for word segmentation processing on basis of webpage content classification

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073678A (en) * 2010-12-03 2011-05-25 厦门市美亚柏科信息股份有限公司 System and method for analyzing information of websites
CN102073678B (en) * 2010-12-03 2013-02-27 厦门市美亚柏科信息股份有限公司 System and method for analyzing information of websites
CN103577478A (en) * 2012-08-06 2014-02-12 腾讯科技(深圳)有限公司 Web page pushing method and system
CN103577478B (en) * 2012-08-06 2015-07-29 腾讯科技(深圳)有限公司 Web page push method and system
WO2015149533A1 (en) * 2014-03-31 2015-10-08 北京奇虎科技有限公司 Method and device for word segmentation processing on basis of webpage content classification

Similar Documents

Publication Publication Date Title
Szomszor et al. Semantic modelling of user interests based on cross-folksonomy analysis
US7827191B2 (en) Discovering web-based multimedia using search toolbar data
Drost et al. Thwarting the nigritude ultramarine: Learning to identify link spam
CN103544188B (en) The user preference method for pushing of mobile Internet content and device
CN101853300B (en) Method and system for identifying and evaluating video downloading service website
US20090204617A1 (en) Content acquisition system and method of implementation
US20040260695A1 (en) Systems and methods to tune a general-purpose search engine for a search entry point
US20020138525A1 (en) Computer method and apparatus for determining content types of web pages
CN103823883A (en) Analysis method and system for website user access path
US9053186B2 (en) Method and apparatus for detecting and explaining bursty stream events in targeted groups
Achsan et al. A fast distributed focused-web crawling
CN103455758A (en) Method and device for identifying malicious website
US20100161599A1 (en) Computer Method and Apparatus of Information Management and Navigation
CN107766234A (en) A kind of assessment method, the apparatus and system of the webpage health degree based on mobile device
US20100031178A1 (en) Computer system, information collection support device, and method for supporting information collection
CN103902579A (en) Method and device for acquiring information
CN102902790A (en) Web page classification system and method
CN101206653A (en) System and method for automatically collecting network information
CN102902794A (en) Web page classification system and method
CN107704494B (en) User information collection method and system based on application software
KR100557874B1 (en) Method of scientific information analysis and media that can record computer program thereof
Jayanetti et al. Robots still outnumber humans in web archives, but less than before
CN105245394A (en) Method and equipment for analyzing network access log based on layered approach
CN102129441B (en) Web page information identifying and processing method and device
KR20200119534A (en) Ontology-based multilingual url filtering apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20080625