CN101206653A - System and method for automatically collecting network information - Google Patents
System and method for automatically collecting network information Download PDFInfo
- Publication number
- CN101206653A CN101206653A CNA2006101707848A CN200610170784A CN101206653A CN 101206653 A CN101206653 A CN 101206653A CN A2006101707848 A CNA2006101707848 A CN A2006101707848A CN 200610170784 A CN200610170784 A CN 200610170784A CN 101206653 A CN101206653 A CN 101206653A
- Authority
- CN
- China
- Prior art keywords
- search
- network information
- webpage
- module
- link
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a network information auto-collecting system and a method thereof. The method comprises the following steps that: web sites are searched and are classified into categories corresponding to each interest and hobby; the content of archives stored by a user or the content of a linked web site is analyzed to generate key words corresponding to the interests and hobbies of the user; after the category of key words is judged, links related to the key words are searched by using searching web sites which are included in the category corresponding to the key words; the linked web sites which are searched out are downloaded.
Description
Technical field
The present invention relates to a kind of system and method thereof of automatic collection network information, be meant a kind of user's of automatic analysis gained interest and hobby especially, and with the System and method for of the collection network information analyzed.
Background technology
Rise along with network, there is more and more data to be published on the network in the mode of webpage, yet, even there are so much data to be distributed on the network, other user is not having under the situation of network address, and other user is the same can't to obtain their needed data, therefore, having produced only needs the crucial words of input just can hunt out the search website of the network address of related web page, and so the user just can see through the website, download webpage of searching gained and obtain the data that need.
Data were all by user's manual collection in the past, though the user manually can collect the data that suits the requirements, but also because be manual collection, therefore the data volume of collecting is just fewer, and the user needs the special extra many time of use to collect.In order to reduce the time that the user collects data, begin to have the program of collecting data to be used, after the program of collection data is normally imported crucial words by the user, crucial words with user's input sends request to specific search website, make that searching website search goes out link relevant with crucial words, the program of collecting data is after obtaining link, and the webpage of meeting download link correspondence is to finish the collection of data.
To sum up analyze, owing to use the program of collecting data to need the user to import crucial words voluntarily, therefore the user when uncollected data are collected at present still in hope, must set new crucial words speech, do not want to take the trouble slightly, in addition, use specific search website to search owing to collect the program of data at present, therefore the data of collecting can have correlativity in various degree along with the quality of searching the website, under the big situation of the data volume of collecting, the user can produce puzzlement for incoherent data.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of system and method thereof of automatic collection network information, produce the crucial words of corresponding user's interest and hobby by the content of the corresponding webpage of the interior perhaps link of analyzing the archives that the user stores, and linking of being correlated with of the relevant search website search of use and user's interest and hobby and crucial words, so just can collect interest and the high data of the hobby degree of association with the user, use to solve and use the program of collecting data to collect the problem of data at present.
For reaching above-mentioned purpose, the present invention can reach by System and method for two aspects, and system provided by the present invention includes: storage module, sort module, analysis module, search module, download module.
The disclosed method of the present invention includes the following step: store at least one search website; The classification, search website is to each interest and hobby corresponding class; Analyze at least one data that the user stores with the interest that produces corresponding user and at least one crucial words of hobby; Judge crucial words corresponding class; In the search website that crucial words corresponding class is comprised, search at least one link relevant with crucial words; The webpage of download link correspondence.
About detailed features of the present invention and real the work, now conjunction with figs. is described in detail as follows in embodiment, its content is enough to make any people who is familiar with correlation technique to understand technology contents of the present invention and implements according to this, and according to the disclosed content of this instructions and graphic, any people who is familiar with correlation technique can understand purpose and the advantage that the present invention is correlated with easily.
Description of drawings
Fig. 1 is the system architecture diagram of the automatic collection network information of the present invention;
Fig. 2 is the method flow diagram of the automatic collection network information of the present invention.
Wherein, Reference numeral is:
100 electronic installations
110 storage modules
120 sort modules
130 analysis modules
140 search module
150 download modules
190 detection modules
Whether step 240 searches
The webpage of step 260 download link correspondence
Embodiment
Following elder generation illustrates System Operation of the present invention with the system architecture diagram of the automatic collection network information of Fig. 1 the present invention.As shown in Figure 1, system of the present invention contains sort module 120, analysis module 130, search module 140, download module 150.Wherein storage module 110 is responsible for storing at least one and is searched the website; Sort module 120 is responsible for being stored in search websites collection in the storage module to various interest and like in the pairing classification; Analyze this user's interest and hobby at least one data that analysis module 130 is responsible for being stored by the user, to produce corresponding at least one crucial words, wherein the data of user's storage comprise archives or link, and judge the interest and the corresponding classification of hobby of the crucial words correspondence that it produces; Search module 140 is responsible for searching in the search website that crucial words corresponding class is comprised with the interest of using this and is liked relevant at least one link of corresponding crucial words; Download module 150 is responsible for downloading the corresponding webpage of this link.
Then explain orally operation system of the present invention and method, and please refer to the method flow diagram of the automatic collection network information of Fig. 2 the present invention with an embodiment.
The present invention is before collecting data, must classify according to various interest and hobby to searching the website earlier, wherein, the search website that be classified can be to be stored in advance in the storage module 110 of the present invention, also can be imported voluntarily and be deposited in the storage module 110 by the user.
Search the website and the second search website if store first in the storage module 110, sort module then of the present invention 120 can be searched website and second with first and be searched extremely various interest and liking in the pairing classification (step 210) of websites collection, the method of classification for example searches the website with specific several words tests first and the second search result that the website was searched classifies, but the method for classification, search provided by the invention website is not as limit.In the present embodiment, sort module 120 is searched websites collection to programming taxonomy with first, be the recreation classification and search websites collection with second, wherein, above-mentioned programming taxonomy and recreation classification are the different difference that interest and hobby produced classification, because interest and the classification that can tell of hobby are quite a lot of, and to have which kind of classification to be classified out be not emphasis of the present invention, so do not add description.
Use after the present invention collects data the user, analysis module 130 of the present invention can be analyzed the data that the user has stored, with the interest that draws corresponding user and the crucial words (step 220) of hobby, wherein the data that stored of user comprise the archives of particular category, or the stored link of particular category, analysis module 130 can read the content of archives or the content that links in the pairing webpage is analyzed, the method of analyzing is for example used existing article sorter etc., just can obtain at least one crucial words after analyzing again, but the present invention is not exceeded to use the article sorter to analyze.
Then analysis module 130 can further be set up the crucial words of analysis gained and the corresponding relation between each interests and hobbies, just judge crucial words corresponding class (step 230), for example analysis module 130 is analyzed the walkthrough shelves of user's storage or is all corresponded to each recreation by the link that the major part in " my favorite " catalogue of user stores the website is discussed, so analysis module 130 can draw crucial words and its corresponding class is " recreation ".Because it is to draw by analyzing in the relevant article of user's interest and hobby that analysis module 130 is analyzed the crucial words of gained, therefore crucial words just can have coincideing of certain degree with user's interest and hobby, that is to say that crucial words corresponding class is user's interest and hobby corresponding class.
In the step of analysis and sort key words (step 220, step 230), if user's interest and hobby are extensively, the crucial words that then can analyze gained can be dispersed in each different classification.If the user only added linking in " my favorite " catalogue of just beginning recently to be interested in for one or two, then analysis module 130 also can be very high because of the different degree of the content of minority and other content, and then produce corresponding crucial words.
Analyze and sort key words (step 220, step 230) afterwards, search module 140 of the present invention just can use crucial words to search (step 250) from searching in the website with the classification " recreation " of crucial words corresponding second, after search with crucial words the second search website, can produce link relevant with crucial words, download module 150 of the present invention just can be downloaded the content (step 260) that link corresponding webpage relevant with crucial words, reaches the purpose of the data of automatic collection user interest and hobby.
When searching, can take a large amount of hardware resources or network bandwidth for fear of meeting of the present invention with data download, therefore the present invention more includes detection module 190, be responsible for detecting the state that electronic installation 100 of the present invention is arranged of carrying out, when electronic installation is in specific state, just activation search module 140 is searched link (step 240) relevant with crucial words, when wherein specific state does not for example have data to be transfused to or the utilization rate of processor when being lower than a particular value, but the condition of the activation that the present invention carried is not exceeded with above-mentioned two states.
When data are transfused to, the expression user is operating to carry out electronic installation 100 of the present invention, therefore normally use electronic installation 100 in order not influence the user, therefore detection module 190 will suspend the execution of search module 140, when not having data to be transfused to, the expression user does not temporarily use electronic installation 100, and therefore running of the present invention can't influence the user; And detection module 190 activation search module under the lower situation of processor utilization rate also is based on identical reason.
In addition, repeat to be collected for fear of identical data, therefore download module 190 is before the content of downloading webpage, the update time of the webpage that comparison earlier is downloaded and the time of last time downloading, when be later than the time that last time was downloaded the update time of the webpage that is downloaded, the expression webpage was done renewal, need to download again, if the update time of the webpage that is downloaded, the expression webpage was downloaded, did not need repeated downloads during early than time of last time being downloaded.
Moreover, the method of automatic collection network information of the present invention, can be implemented in the combination of hardware, software or hardware and software, also can in computer system, realize or intersperse among the dispersing mode of the computer system of several interconnected and realize with different assemblies with centralized system.
Though the present invention discloses as above with aforementioned preferred embodiment; right its is not in order to limit the present invention; under the situation that does not deviate from spirit of the present invention and essence thereof; those of ordinary skill in the art work as can make various corresponding changes and distortion according to the present invention, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.
Claims (10)
1. the method for an automatic collection network information is applied to it is characterized in that on the electronic installation that this method comprises the following step:
Store at least one search website;
Classify this search website to a classification of each interest and hobby correspondence;
Analyze at least one data that a user stores to produce to the interest that should the user and at least one crucial words of hobby;
Judge corresponding this classification of this key words;
In this search website that this classification comprises, search at least one link relevant with this key words; And
Download the corresponding webpage of this link.
2. the method for automatic collection network information according to claim 1 is characterized in that, this step of searching this link also comprises judgement when not having data to be transfused to, and searches the step of this link.
3. the method for automatic collection network information according to claim 1 is characterized in that, this step of searching this link also comprises judgement when a processor utilization rate of this electronic installation is lower than a particular value, searches the step of this link.
4. the method for automatic collection network information according to claim 1 is characterized in that, this step of downloading this webpage also comprises when being later than the time of last time downloading this webpage update time of judging this webpage, downloads this webpage.
5. the system of an automatic collection network information is applied to it is characterized in that on the electronic installation that this system comprises:
One storage module is in order to store at least one search website;
One sort module will be in order to will search websites collection to a classification of each interest and hobby correspondence;
One analysis module in order to analyzing at least one data that stored by a user the interest that should the user and at least one crucial words of hobby, and is judged this classification of this key words correspondence;
One search module is in order to search at least one link relevant with this key words in this search website that this classification comprised certainly; And
One downloads module, in order to download the corresponding webpage of this link.
6. the system of automatic collection network information according to claim 5 is characterized in that, these data comprise at least one archives.
7. the system of automatic collection network information according to claim 5 is characterized in that, these data comprise the corresponding webpage of at least one link.
8. the system of automatic collection network information according to claim 5 is characterized in that, this system also comprises a detection module, in order to detect data when being transfused to, this search module of activation.
9. the system of automatic collection network information according to claim 5 is characterized in that, when this detection module also is lower than a particular value in order to the utilization rate at a processor that detects this electronic installation, and this search module of activation.
10. the system of automatic collection network information according to claim 5 is characterized in that, when this download module also is later than the time of last time downloading this webpage in order to the update time of judging this webpage, downloads this webpage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2006101707848A CN101206653A (en) | 2006-12-22 | 2006-12-22 | System and method for automatically collecting network information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2006101707848A CN101206653A (en) | 2006-12-22 | 2006-12-22 | System and method for automatically collecting network information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101206653A true CN101206653A (en) | 2008-06-25 |
Family
ID=39566861
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2006101707848A Pending CN101206653A (en) | 2006-12-22 | 2006-12-22 | System and method for automatically collecting network information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101206653A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073678A (en) * | 2010-12-03 | 2011-05-25 | 厦门市美亚柏科信息股份有限公司 | System and method for analyzing information of websites |
CN103577478A (en) * | 2012-08-06 | 2014-02-12 | 腾讯科技(深圳)有限公司 | Web page pushing method and system |
WO2015149533A1 (en) * | 2014-03-31 | 2015-10-08 | 北京奇虎科技有限公司 | Method and device for word segmentation processing on basis of webpage content classification |
-
2006
- 2006-12-22 CN CNA2006101707848A patent/CN101206653A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073678A (en) * | 2010-12-03 | 2011-05-25 | 厦门市美亚柏科信息股份有限公司 | System and method for analyzing information of websites |
CN102073678B (en) * | 2010-12-03 | 2013-02-27 | 厦门市美亚柏科信息股份有限公司 | System and method for analyzing information of websites |
CN103577478A (en) * | 2012-08-06 | 2014-02-12 | 腾讯科技(深圳)有限公司 | Web page pushing method and system |
CN103577478B (en) * | 2012-08-06 | 2015-07-29 | 腾讯科技(深圳)有限公司 | Web page push method and system |
WO2015149533A1 (en) * | 2014-03-31 | 2015-10-08 | 北京奇虎科技有限公司 | Method and device for word segmentation processing on basis of webpage content classification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Szomszor et al. | Semantic modelling of user interests based on cross-folksonomy analysis | |
US7827191B2 (en) | Discovering web-based multimedia using search toolbar data | |
Drost et al. | Thwarting the nigritude ultramarine: Learning to identify link spam | |
CN103544188B (en) | The user preference method for pushing of mobile Internet content and device | |
CN101853300B (en) | Method and system for identifying and evaluating video downloading service website | |
US20090204617A1 (en) | Content acquisition system and method of implementation | |
US20040260695A1 (en) | Systems and methods to tune a general-purpose search engine for a search entry point | |
US20020138525A1 (en) | Computer method and apparatus for determining content types of web pages | |
CN103823883A (en) | Analysis method and system for website user access path | |
US9053186B2 (en) | Method and apparatus for detecting and explaining bursty stream events in targeted groups | |
Achsan et al. | A fast distributed focused-web crawling | |
CN103455758A (en) | Method and device for identifying malicious website | |
US20100161599A1 (en) | Computer Method and Apparatus of Information Management and Navigation | |
CN107766234A (en) | A kind of assessment method, the apparatus and system of the webpage health degree based on mobile device | |
US20100031178A1 (en) | Computer system, information collection support device, and method for supporting information collection | |
CN103902579A (en) | Method and device for acquiring information | |
CN102902790A (en) | Web page classification system and method | |
CN101206653A (en) | System and method for automatically collecting network information | |
CN102902794A (en) | Web page classification system and method | |
CN107704494B (en) | User information collection method and system based on application software | |
KR100557874B1 (en) | Method of scientific information analysis and media that can record computer program thereof | |
Jayanetti et al. | Robots still outnumber humans in web archives, but less than before | |
CN105245394A (en) | Method and equipment for analyzing network access log based on layered approach | |
CN102129441B (en) | Web page information identifying and processing method and device | |
KR20200119534A (en) | Ontology-based multilingual url filtering apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20080625 |