CN101676907A - Method and system of directionally acquiring Internet resources - Google Patents

Method and system of directionally acquiring Internet resources Download PDF

Info

Publication number
CN101676907A
CN101676907A CN200810222306A CN200810222306A CN101676907A CN 101676907 A CN101676907 A CN 101676907A CN 200810222306 A CN200810222306 A CN 200810222306A CN 200810222306 A CN200810222306 A CN 200810222306A CN 101676907 A CN101676907 A CN 101676907A
Authority
CN
China
Prior art keywords
information
webpage
extracting
resource
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200810222306A
Other languages
Chinese (zh)
Inventor
刘锦山
崔凤雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Leisu Technology Co Ltd
Original Assignee
Beijing Leisu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Leisu Technology Co Ltd filed Critical Beijing Leisu Technology Co Ltd
Priority to CN200810222306A priority Critical patent/CN101676907A/en
Publication of CN101676907A publication Critical patent/CN101676907A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a method of directionally acquiring Internet resources, comprising the following steps: determining the scope of the websites from which webpages are to be captured, the information of the resources to be acquired and the types of the resources; acquiring effective webpages which correspond to the types of the resources on each website through human-computer interaction according to the types of the resources; generating the configuration information on the information of the resources to be acquired according to the uniform resource locators of the websites and the effective webpages, the webpage structures and the information of the resources to be acquired; capturing text information which is matched with the configuration information on the information of the resources to be acquired on the websites, and saving the text information; indexing deeply the captured text information through human-computer interaction; and creating indexes for the text information which is indexed deeply so as to facilitate the text information indexing by the user. The system of directionally acquiring Internet resources comprises an Internet resource directionally acquiringunit and a text information deeply indexing unit. Through directionally acquiring the Internet resources with the searching engine, the problems on the generation of a large amount of junk information, the replication of resources, the disorganization of resources and the invalidation of the webpage snapshot, which can be caused by the commonly-used method of acquiring the Internet resources withthe searching engine, are solved.

Description

Directed acquisition methods of a kind of Internet resources and system
Technical field
The present invention relates to the internet search engine field, be specifically related to directed acquisition methods of a kind of Internet resources and system.
Background technology
Search engine is with the information on the certain strategy collection internet, information is being organized and handled the back provides the network information service for the user computer system.Its main effect is to help the user to obtain the high quality information that can meet consumers' demand that is present in the internet information environment fast, efficiently.
At present, universal search engine comprises information search, finish message and user inquiring three parts.Information search partly is responsible for grasping on the internet information, and the information that grasps is kept in the data server, and finish message is responsible for the index of reference device information that grasps is put in order, and then uses requestor to inquire about for the user; User inquiring partly is responsible for the user search interface is provided.
At present, the main limitation of information search part has following several aspect in the search engine technique:
1) for the Grasp Modes of a kind of formula that extends endlessly of obtaining employing of Internet resources, for example be if the employing web crawlers carries out the process of webpage extracting, web crawlers is from the webpage of some appointments, resolve the hyperlink that these webpage the insides comprise, download these hyperlink webpage pointed then, constantly go on, webpages all on the internet all can be downloaded in theory.But owing to do not determine before grasping effectively to grasp website, and website to be crawled is not carried out orientation analysis and then realize extracting particular webpage.Therefore be extracting a kind of roaming type, the non-directional formula, thereby the content that grabs has been full of a large amount of junk information and garbage, has greatly increased follow-up handling cost and user's use cost;
2) resource that grabs is not carried out the editor of the degree of depth, and then caused a large amount of repetitions of resource;
3) resource that grabs is not done the index of the degree of depth, do not provide the knowledge point such as subject, theme, author, unit, summary of every data, thereby do not have perfect knowledge hierarchy to support as the management of resource, resource organizations's system is at random not to have the art of composition, and tap/dip deep into utilizes difficulty very big.Such as, because when search engine businessman partly carries out the webpage extracting in information search, the different classes of webpage that is grasped is concentrated together, do not carry out cluster according to industry or subject, theme, thereby when the user utilizes correlation word to search for, for follow-up resource consolidation still be the utilization of user's degree of depth all be very big problem.
4) snapshots of web pages lost efficacy.The snapshots of web pages of existing extracting technology do not adopt holographic mode that content, format, the color information of webpage are accomplished that all localization files, and caused the consequence of snapshots of web pages uncomplete content, inefficacy.
Summary of the invention
The purpose of this invention is to provide directed acquisition methods of a kind of Internet resource and system, solve the problem of search engine method commonly used is caused in the prior art a large amount of junk information, resource repetition, resource inorganization and snapshots of web pages inefficacy.
For achieving the above object, the present invention adopts following technical scheme:
The directed acquisition methods of a kind of Internet resources, this method may further comprise the steps:
Determine to grasp website scope, the resource information that will obtain and affiliated resource class in advance;
According to described resource class, on each extracting website, obtain and the corresponding effective webpage of described resource class by man-machine interaction;
URL, the structure of web page of effective webpage of foundation described extracting website and link thereof and the resource information that will obtain, the configuration information of the resource information that generation will be obtained;
On the extracting website, grasp the information and the preservation that are complementary with described configuration information;
By man-machine interaction the information that grasps is carried out depth indexing, its arrangement is unified format, and its classification is made adjustment, deletion and the irrelevant junk information of the resource information that will obtain;
Information behind the depth indexing is set up index to be used for user search.
Wherein, information behind the depth indexing is being set up after index uses for user search, also comprise step: the pairing webpage of the information behind the depth indexing is carried out ecosystem file, when described information can't be opened when being used to retrieve, the webpage that calls the ecosystem file corresponding with it used for the user.
Wherein, in the extracting process, also comprise the webpage position that extracting information correspondence finished last time in record, finished the webpage position extracting of extracting information correspondence when grasp next time since last time again.
Wherein, in the extracting process, also comprise the step that information that will grasp and the information that has grasped compare,, then will not grasp this information if identical.
Wherein, to grasp the information that is complementary with configuration information on the website be to have removed the plain text content of source code, advertising message grasping, and comprises title, author, unit, keyword, summary, text, URL, extracting time, the classification of article.
The present invention also provides a kind of Internet resources orientation to obtain system, and this system comprises:
The initial information acquiring unit is used for determining in advance to grasp website scope, the resource information that will obtain and affiliated resource class;
Effectively the webpage acquiring unit according to described resource class, obtains and the corresponding effective webpage of described resource class on each extracting website by man-machine interaction;
The configuration information generation unit, URL, the structure of web page of effective webpage of described website of foundation and link thereof and the resource information that will obtain, the configuration information of the resource information that generation will be obtained;
Directed acquiring unit is used for grasping the information that is complementary with described configuration information and preserve on the extracting website;
The depth indexing unit carries out depth indexing by man-machine interaction to the information that grasps, and its arrangement is unified format, and its classification is made adjustment, deletion and the irrelevant junk information of the resource information that will obtain;
Retrieval unit is used for that the information behind the depth indexing is set up index and uses for user search.
Wherein, this system also comprises ecosystem file unit, is used for that the pairing webpage of the information behind the depth indexing is carried out ecosystem and files, and when described information can't be opened when being used to retrieve, the webpage that calls the ecosystem file corresponding with it used for the user.
Wherein, this system also comprises the download location record cell, is used in directed acquiring unit extracting process, and the webpage position of the information correspondence of extracting finished last time in record, provides starting point for grasping next time.
Wherein, this system also comprises comparing unit, and in directed acquiring unit extracting process, the information that will grasp compares with the information that has grasped, if identical, then will not grasp this information.
Utilize directed acquisition methods of Internet resources of the present invention and system, have the following advantages:
1) will download with the particular webpage that configuration information is complementary, and will not download, reduce a large amount of junk information and garbage that search engine technique commonly used caused effectively for ineligible webpage;
2) in downloading process by the record download location and want the comparison of download message, the resource repetition that the search engine technique of avoiding occurring using always causes;
3) organize problem by accessed resource being carried out depth indexing, make to get access to resources and knowledge, easier realization Clustering Retrieval;
When 4) preserving webpage is to file in the mode of holography, has realized localized permanently filing.
Description of drawings
Fig. 1 is the directed acquisition methods process flow diagram of Internet resources of the present invention;
Fig. 2 obtains system architecture diagram for Internet resources orientation of the present invention;
Fig. 3,4 is the directed information synoptic diagram that obtains among the embodiment;
Fig. 5,6 is the webpage synoptic diagram that ecosystem is filed among the embodiment.
Embodiment
Directed acquisition methods of the Internet resources that the present invention proposes and system are described as follows in conjunction with the accompanying drawings and embodiments.
Embodiment
Be illustrated in figure 1 as the directed acquisition methods process flow diagram of Internet resources of the present invention, the method comprising the steps of:
S101, determine required essential information, these essential informations comprise the website scope that grasps, the resource information that will obtain and affiliated resource class, general retrieval all is based on website commonly used and comes download message as grasping the website, the resource information of obtaining is meant the determined type of retrieval, as obtaining the shuttlecock category information of sport category, the classification under it is physical culture;
S102, according to resource class, obtain and the corresponding effective webpage of resource class on each extracting website by man-machine interaction, here said effective webpage is meant with the resource class degree of association that will obtain to indicate to be exactly the webpage of this resource class more greatly or directly, this step operation need realize by man-machine interaction, for example can login Sohu or other website, the people is for opening the webpage with resource class physical culture corresponding " physical culture " hurdle, and with this webpage as effective webpage, or by browse out of Memory will with the more closely-related webpages of physical culture also as effective webpage;
S103, utilize orientation analysis to generate the configuration information of the resource information that will obtain, determined effective webpage in previous step, need analyze one of them representative effective webpage in this step, determine effective webpage scope from form and content, the resource of each website is all organized according to a fixed structure on the internet, this structure shows as the uniform resource position mark URL address on the one hand, it all is structurized showing as the inner element of each webpage on the one hand, show as the content characteristic of each webpage on the one hand, by analyzing, extract the uniform resource position mark URL of the webpage (effectively webpage) of the affiliated class of resource information that grasps website in the scope of website, the characteristics of structure of web page and the resource information that will obtain, generation is with respect to the specific configuration information of class webpage under this site resource information, this configuration information record the URL information of the webpage that comprises the resource information class that will obtain, structure of web page information and content characteristic (promptly need to comprise resource information, as comprise shuttlecock), this just means and has write down the particular location of such webpage in whole website;
S104, result-configuration information according to orientation analysis comes orientation to obtain resource, be specially grasping the website in the scope of website, URL according to the webpage of class under the resource information of website in the configuration information (effectively webpage), structure of web page, utilize the position of the affiliated class webpage of resource information on the website, method location of mating, thereby determined to grasp the scope of webpage, then according to obtaining resource information, the source code that will comprise the removal that comprises above-mentioned resource information under the webpage that such webpage links, the text message of junk information such as advertising message grasps, this step can be removed the invalid information in the webpage, because webpage has structure, effective information and invalid information---be in the diverse location of a webpage such as advertisement and source code, when doing orientation analysis with the location records of effective information in configuration information, just only obtain effective information when obtaining, the position of invalid information is not recorded in the configuration information, just do not obtain invalid information when obtaining, simultaneously preserve the URL that also has this webpage that gets off with effective information, the title of webpage, the subject classification, information such as directed acquisition time, the general text message that grasps comprises the title of article, the author, unit, keyword, summary, text, URL, the extracting time, classification etc.Wherein, title, text, URL, extracting time, classification are that every piece of article must be caught; If author, unit, keyword, abstract fields original text have then just catch, original text is not then grabbed, and this step has reduced a large amount of junk information and the garbage that search engine technique commonly used caused effectively;
Step s105, the content of text of above-mentioned extracting is carried out degree of depth editor's index of man-machine interaction, URL by obtaining every piece of article in the content of text, title, keyword, summary, author, unit, in full, the knowledge point of subject classification, fill up the content that some do not have above-mentioned fields such as " author, unit, keywords " according to the content of article, with the finish message that grasps is unified format, sets up with convenient later index; Can also make adjustment to its classification according to the content of article, determined classification in not belonging to step s101 is adjusted other classification with it; In addition, also further remove some and irrelevant junk information and the rubbish record of the resource information that will obtain, the message structure that grasps is realized simplifying of further optimization and information;
Step s106 uses for user search setting up index through the information behind the depth indexing;
Step s107 carries out ecosystem to the pairing webpage of information behind the step s105 depth indexing and files, and ecosystem is filed and is meant that institute's web pages downloaded preserves automatically in the mode that is similar to " photograph " " genuineness ".It is holographic the file that ecosystem is filed, and has kept full contents such as comprising text, format, picture, pertinent literature, website mark, address on the page, and history file and webpage URL set up corresponding relation during file, and the filename of each webpage is corresponding with its URL.
Through above-mentioned steps, the user passes through search index when retrieval, classification according to the information that will retrieve is retrieved in the database of cluster storage plain text content, related by URL with the online webpage foundation of crawled website, call the ecosystem history file corresponding if online webpage is not opened and read related content with it.
In the present embodiment, when downloading, at first, the webpage position of the corresponding webpage of extracting information finished last time in record, and download the webpage position since last time when grasp next time again.Secondly, by with grasp the webpage that gets off and carry out the comparison of form and key content, finds that webpage just the same or that similarity is very high just will not grasp, can avoid repeating extracting like this, can avoid the resource repetition again.
More than in the implementation of each step, as long as step s107 is not limited to after step s106 at step s105.
Provide the process that a concrete example illustrates above-mentioned steps s101~s107 below, be retrieved as example with the orientation of national philosophy and the social sciences planning office website (http://www.npopss-cn.gov.cn/).
Among the step s101, the resource information that obtain is a philosophy, and the classification under it is a philosophy, and one of them website that grasp is http://www.npopss-cn.gov.cn/;
Among the step s102, be linked to the philosophy part http://www.npopss-cn.gov.cn/chgxj/zx/zx.html that grasps in the achievement selected introductions column by man-machine interaction, under this webpage, choose representational effective webpage of philosophy classification correspondence, through selecting and comparing, the webpage of selecting wherein is: (http://www.npopss-cn.gov.cn/chgxj/zx/zxw31_20080523.htm), the title of webpage are metaphysics and boundary research;
Among the step s103, by to title being the analysis of the webpage (http://www.npopss-cn.gov.cn/chgxj/zx/zxw31_20080523.htm) of " metaphysics and boundary research ", (the URL preceding structure of this class webpage is identical to have determined the URL characteristics of the webpage that will obtain of its representative, all be http://www.npopss-cn.gov.cn/chgxj/zx......), the structure of web page characteristics (all have title, subtitle, summary, text, and the webpage source code structure of position correspondence is all identical), information feature (has " philosophy " this speech in the content.Notice that this condition of information feature can be provided with, and also can not be provided with), generate corresponding configuration file.So just determined the particular location of each webpage in the website;
Among the step s104,, qualified webpage is all grasped according to the configuration information that generates among the step s103.It all is the content of text of having removed junk information such as source code, advertising message that each webpage grasps the content of getting off, and comprises title, author, unit, keyword, summary, text, URL, extracting time, classification of article etc.Wherein, title, text, URL, extracting time, classification are that every piece of article must be caught; If author, unit, keyword, abstract fields original text have then just catch, original text is not then grabbed, title as shown in Figure 3 is the article content of " metaphysics and boundary research ", just grasp less than author, unit, keyword field, to find, the content that each webpage is caught all is a plain text, the unwanted information (general designation junk information) such as source code, BANNER, color, footer and advertisement that do not have this webpage to have in the past;
Through above-mentioned steps, can obtain a qualified web pages:
The Contemporary Significance research of " Das Kapital " three big manuscript conceptions of history
http://www.npopss-cn.gov.cn/chgxj/zx/zxw32_20080523.htm
" metaphysics and boundary research "
http://www.npopss-cn.gov.cn/chgxj/zx/zxw31_20080523.htm
" research of west post-modernist philosophy of history "
http://www.npopss-cn.gov.cn/chgxj/zx/zxw30_20080523.htm
......
And ineligible webpage such as following this piece of writing, does not have " philosophy " this speech in the webpage, does not just grasp:
Marxian environmental ethics thought and Contemporary Value research thereof
http://www.npopss-cn.gov.cn/chgxj/zx/zxw29_20080523.htm。
Also note the position of the webpage " the Contemporary Significance researchs of " Das Kapital " three big manuscript conceptions of history " that obtains at last when grasping simultaneously, avoid repeating next time to obtain.
Among the step s105, the content of text that s104 is got access to carries out degree of depth editor's index of man-machine interaction, for example, for " metaphysics and boundary research " this text, because author, unit, these field original texts of keyword do not have, so when obtaining, just do not get access to, on this step of s105, need fill out author (Lu Jierong; Kingdom's richness; Liu Hongjiu; Ma Zhiguo), unit (Liaoning University), keyword (metaphysics; Boundary), according to circumstances adjust classification, to classify and be adjusted into " ontology " from " philosophy ", can also further remove junk information and rubbish record, such as do not need determining " research of west post-modernist philosophy of history " this webpage by judging, just can delete it, as shown in Figure 4 through the text behind the depth indexing;
After " research of west post-modernist philosophy of history " this webpage deletes, just remaining following webpage:
The Contemporary Significance research of " Das Kapital " three big manuscript conceptions of history
http://www.npopss-cn.gov.cn/chgxj/zx/zxw32_20080523.htm
" metaphysics and boundary research "
http://www.npopss-cn.gov.cn/chgxj/zx/zxw31_20080523.htm
......
Among the step s106, the web page contents (being content of text) through S105 step depth indexing is carried out index set up index file for reader's retrieval;
Among the step s107, file to carrying out ecosystem through the webpage behind the S105 step depth indexing, file and only leave the row webpage, and spam page that do not grab or deletion (for example: " research of west post-modernist philosophy of history " this webpage) just do not file:
The Contemporary Significance research of " Das Kapital " three big manuscript conceptions of history
http://www.npopss-cn.gov.cn/chgxj/zx/zxw32_20080523.htm
" metaphysics and boundary research "
http://www.npopss-cn.gov.cn/chgxj/zx/zxw31_20080523.htm
......
Webpage behind the file is (example has only been got the top of webpage) as shown in Figure 5 and Figure 6, and the filename of each webpage is corresponding with its URL.
Be illustrated in figure 2 as in the present embodiment Internet resources orientation and obtain system architecture diagram, this system comprises:
The initial information acquiring unit is used for determining in advance to grasp website scope, the resource information that will obtain and affiliated resource class;
Effectively the webpage acquiring unit according to described resource class, obtains and the corresponding effective webpage of described resource class on each extracting website by man-machine interaction;
The configuration information generation unit, URL, the structure of web page of effective webpage of described website of foundation and link thereof and the resource information that will obtain, the configuration information of the resource information that generation will be obtained;
Directed acquiring unit is used for grasping the information that is complementary with described configuration information and preserve on the extracting website;
The depth indexing unit carries out depth indexing by man-machine interaction to the information that grasps, and its arrangement is unified format, and its classification is made adjustment, deletion and the irrelevant junk information of the resource information that will obtain;
Retrieval unit is used for that the information behind the depth indexing is set up index and uses for user search.
This system also comprises ecosystem file unit, is used for that the pairing webpage of the information behind the depth indexing is carried out ecosystem and files, and when described information can't be opened when being used to retrieve, the webpage that calls the ecosystem file corresponding with it used for the user.
This system also comprises the download location record cell, is used in directed acquiring unit extracting process, and the webpage position of the information correspondence of extracting finished last time in record, provides starting point for grasping next time.
By above narration as can be known, present embodiment is not to be linked to all webpages that grasp on the website, but a webpage that is complementary with configuration information obtains, and is that selectivity is obtained.Such as certain physical culture website a lot of columns are arranged, the format of each column webpage may have nothing in common with each other, and the theme of each webpage is also not necessarily identical, when hope is grasped the webpage about the vollyball aspect wherein get off, just need analyze before grasping so by the configuration information generation unit, what characteristic URL, structure of web page, the content topic of the webpage of vollyball content aspect have, these common characteristic are extracted just formed configuration information, and directed acquiring unit just mates and can obtain the webpage of needs get off according to this configuration information.
Set up a cover by the present invention and be fit to the subject knowledge organization system of Internet resources management and the descriptor knowledge organization system that a cover is fit to the Internet resources management, related to each subject, industry-by-industry and every field.Via directed acquiring unit obtain URL that the webpage effective information that gets off obtained this webpage correspondence, title, keyword, summary, author, unit, in full, knowledge point such as subject classification, directed acquisition time, utilization utilizes interactive means can carry out further depth indexing for above-mentioned knowledge point, particularly further adjust, make it to improve more correct for the subject classification of the webpage that influences subject cluster, theme cluster and industry cluster.Handle by final man-machine interaction, obtain the webpage that gets off formed comprise URL, title, keyword, subject classification, author, unit, summary, in full, the structurized index data of knowledge point such as directed acquisition time, and then offer the reader by searching system and utilize.Thereby make resource not only can realize the term retrieval, and can realize subject, industry, theme Clustering Retrieval, realize that easily the degree of depth of resource is integrated, excavate and utilization.
Above embodiment only is used to illustrate the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; under the situation that does not break away from the spirit and scope of the present invention; can also make various variations and modification; therefore all technical schemes that are equal to also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims (9)

1, the directed acquisition methods of a kind of Internet resources is characterized in that this method may further comprise the steps:
Determine to grasp website scope, the resource information that will obtain and affiliated resource class in advance;
According to described resource class, on each extracting website, obtain and the corresponding effective webpage of described resource class by man-machine interaction;
URL, the structure of web page of effective webpage of foundation described extracting website and link thereof and the resource information that will obtain, the configuration information of the resource information that generation will be obtained;
On the extracting website, grasp the information and the preservation that are complementary with described configuration information;
By man-machine interaction the information that grasps is carried out depth indexing, its arrangement is unified format, and its classification is made adjustment, deletion and the irrelevant junk information of the resource information that will obtain;
Information behind the depth indexing is set up index to be used for user search.
2, the directed acquisition methods of Internet resources as claimed in claim 1, it is characterized in that, information behind the depth indexing is being set up after index uses for user search, also comprise step: the pairing webpage of the information behind the depth indexing is carried out ecosystem file, when described information can't be opened when being used to retrieve, the webpage that calls the ecosystem file corresponding with it used for the user.
3, the directed acquisition methods of Internet resources as claimed in claim 1, it is characterized in that, in the extracting process, also comprise the webpage position that extracting information correspondence finished last time in record, finished the webpage position extracting of extracting information correspondence when grasp next time again since last time.
4, the directed acquisition methods of Internet resources as claimed in claim 1 is characterized in that, in the extracting process, also comprises the step that information that will grasp and the information that has grasped compare, if identical, then will not grasp this information.
5, the directed acquisition methods of Internet resources as claimed in claim 1, it is characterized in that, to grasp the information that is complementary with configuration information on the website be to have removed the plain text content of source code, advertising message grasping, and comprises title, author, unit, keyword, summary, text, URL, extracting time, the classification of article.
6, a kind of Internet resources orientation is obtained system, it is characterized in that, this system comprises:
The initial information acquiring unit is used for determining in advance to grasp website scope, the resource information that will obtain and affiliated resource class;
Effectively the webpage acquiring unit according to described resource class, obtains and the corresponding effective webpage of described resource class on each extracting website by man-machine interaction;
The configuration information generation unit, URL, the structure of web page of effective webpage of described website of foundation and link thereof and the resource information that will obtain, the configuration information of the resource information that generation will be obtained;
Directed acquiring unit is used for grasping the information that is complementary with described configuration information and preserve on the extracting website;
The depth indexing unit carries out depth indexing by man-machine interaction to the information that grasps, and its arrangement is unified format, and its classification is made adjustment, deletion and the irrelevant junk information of the resource information that will obtain;
Retrieval unit is used for that the information behind the depth indexing is set up index and uses for user search.
7, Internet resources orientation as claimed in claim 6 is obtained system, it is characterized in that, this system also comprises ecosystem file unit, being used for that the pairing webpage of the information behind the depth indexing is carried out ecosystem files, when described information can't be opened when being used to retrieve, the webpage that calls the ecosystem file corresponding with it used for the user.
8, Internet resources orientation as claimed in claim 6 is obtained system, it is characterized in that this system also comprises the download location record cell, be used in directed acquiring unit extracting process, the webpage position of the information correspondence of extracting finished last time in record, provides starting point for grasping next time.
9, Internet resources orientation as claimed in claim 6 is obtained system, it is characterized in that, this system also comprises comparing unit, be used in directed acquiring unit extracting process, the information that will grasp compares with the information that has grasped, if identical, then will not grasp this information.
CN200810222306A 2008-09-16 2008-09-16 Method and system of directionally acquiring Internet resources Pending CN101676907A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810222306A CN101676907A (en) 2008-09-16 2008-09-16 Method and system of directionally acquiring Internet resources

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810222306A CN101676907A (en) 2008-09-16 2008-09-16 Method and system of directionally acquiring Internet resources

Publications (1)

Publication Number Publication Date
CN101676907A true CN101676907A (en) 2010-03-24

Family

ID=42029476

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810222306A Pending CN101676907A (en) 2008-09-16 2008-09-16 Method and system of directionally acquiring Internet resources

Country Status (1)

Country Link
CN (1) CN101676907A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750299A (en) * 2011-11-30 2012-10-24 新奥特(北京)视频技术有限公司 Method for converging information on internet
CN103064892A (en) * 2012-12-13 2013-04-24 北京海量融通软件技术有限公司 Network post indexing system and method
CN103136360A (en) * 2013-03-07 2013-06-05 北京宽连十方数字技术有限公司 Internet behavior markup engine and behavior markup method corresponding to same
CN103455492A (en) * 2012-05-29 2013-12-18 腾讯科技(深圳)有限公司 Method and device for searching web pages
CN104125268A (en) * 2014-06-26 2014-10-29 小米科技有限责任公司 File downloading method and device, routing device and terminal device
CN104333462A (en) * 2014-10-27 2015-02-04 深圳市云猫信息技术有限公司 Method, system and mobile terminal for configuring optical network unit
CN104506493A (en) * 2014-12-04 2015-04-08 武汉市烽视威科技有限公司 HLS content source returning and caching realization method
CN104516956A (en) * 2014-12-16 2015-04-15 中国科学院声学研究所 Incremental crawling method for website information
CN104899281A (en) * 2015-06-01 2015-09-09 百度在线网络技术(北京)有限公司 Academic article processing method and search processing method and apparatus for academic articles
CN105095402A (en) * 2015-07-08 2015-11-25 广西天海信息科技有限公司 Method for searching WeChat material
CN105631007A (en) * 2015-12-29 2016-06-01 云南电网有限责任公司电力科学研究院 Industry technical information collecting method and system
CN106446068A (en) * 2016-09-06 2017-02-22 北京邮电大学 Directory database generation and query methods and apparatuses
CN106991117A (en) * 2013-11-08 2017-07-28 北京奇虎科技有限公司 Snap processing method, snapshot display method, server, browser and system
CN107704515A (en) * 2017-09-01 2018-02-16 安徽简道科技有限公司 Data grab method based on internet data grasping system
CN107885820A (en) * 2017-11-07 2018-04-06 北京小度互娱科技有限公司 Breadth traversal orientation grasping means based on crawler system
CN108270812A (en) * 2016-12-30 2018-07-10 深圳市青果乐园网络科技有限公司 For obtaining method and system of the article publication with situation of sharing
CN108415959A (en) * 2018-02-06 2018-08-17 北京捷通华声科技股份有限公司 A kind of file classification method and device
CN108446076A (en) * 2018-01-30 2018-08-24 上海天旦网络科技发展有限公司 Index creation method and system based on web feed data
CN109902220A (en) * 2019-02-27 2019-06-18 腾讯科技(深圳)有限公司 Webpage information acquisition methods, device and computer readable storage medium

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750299A (en) * 2011-11-30 2012-10-24 新奥特(北京)视频技术有限公司 Method for converging information on internet
CN102750299B (en) * 2011-11-30 2018-03-16 新奥特(北京)视频技术有限公司 A kind of method of network information convergence
CN103455492B (en) * 2012-05-29 2018-10-30 腾讯科技(深圳)有限公司 A kind of method and apparatus of search and webpage
CN103455492A (en) * 2012-05-29 2013-12-18 腾讯科技(深圳)有限公司 Method and device for searching web pages
CN103064892B (en) * 2012-12-13 2016-11-16 北京海量融通软件技术有限公司 A kind of network patch literary composition indexing system and indexing method
CN103064892A (en) * 2012-12-13 2013-04-24 北京海量融通软件技术有限公司 Network post indexing system and method
CN103136360A (en) * 2013-03-07 2013-06-05 北京宽连十方数字技术有限公司 Internet behavior markup engine and behavior markup method corresponding to same
CN103136360B (en) * 2013-03-07 2016-09-07 北京宽连十方数字技术有限公司 A kind of internet behavior markup engine and to should the behavior mask method of engine
CN106991117B (en) * 2013-11-08 2020-08-14 北京奇虎科技有限公司 Snapshot processing method, snapshot display method, server, browser and system
CN106991117A (en) * 2013-11-08 2017-07-28 北京奇虎科技有限公司 Snap processing method, snapshot display method, server, browser and system
CN104125268A (en) * 2014-06-26 2014-10-29 小米科技有限责任公司 File downloading method and device, routing device and terminal device
CN104125268B (en) * 2014-06-26 2018-05-08 小米科技有限责任公司 Document down loading method, device, routing device and terminal device
CN104333462A (en) * 2014-10-27 2015-02-04 深圳市云猫信息技术有限公司 Method, system and mobile terminal for configuring optical network unit
CN104506493A (en) * 2014-12-04 2015-04-08 武汉市烽视威科技有限公司 HLS content source returning and caching realization method
CN104506493B (en) * 2014-12-04 2018-02-27 武汉市烽视威科技有限公司 A kind of method for realizing HLS contents Hui Yuan and caching
CN104516956A (en) * 2014-12-16 2015-04-15 中国科学院声学研究所 Incremental crawling method for website information
CN104516956B (en) * 2014-12-16 2017-12-01 中国科学院声学研究所 A kind of site information increment crawling method
CN104899281B (en) * 2015-06-01 2018-07-27 百度在线网络技术(北京)有限公司 The search processing method and device of academic article processing method and academic article
CN104899281A (en) * 2015-06-01 2015-09-09 百度在线网络技术(北京)有限公司 Academic article processing method and search processing method and apparatus for academic articles
CN105095402A (en) * 2015-07-08 2015-11-25 广西天海信息科技有限公司 Method for searching WeChat material
CN105631007A (en) * 2015-12-29 2016-06-01 云南电网有限责任公司电力科学研究院 Industry technical information collecting method and system
CN106446068B (en) * 2016-09-06 2020-02-07 北京邮电大学 Directory database generation and query method and device
CN106446068A (en) * 2016-09-06 2017-02-22 北京邮电大学 Directory database generation and query methods and apparatuses
CN108270812A (en) * 2016-12-30 2018-07-10 深圳市青果乐园网络科技有限公司 For obtaining method and system of the article publication with situation of sharing
CN108270812B (en) * 2016-12-30 2021-03-23 深圳市青果乐园网络科技有限公司 Method and system for acquiring article publishing and sharing conditions
CN107704515A (en) * 2017-09-01 2018-02-16 安徽简道科技有限公司 Data grab method based on internet data grasping system
CN107885820A (en) * 2017-11-07 2018-04-06 北京小度互娱科技有限公司 Breadth traversal orientation grasping means based on crawler system
CN108446076A (en) * 2018-01-30 2018-08-24 上海天旦网络科技发展有限公司 Index creation method and system based on web feed data
CN108415959A (en) * 2018-02-06 2018-08-17 北京捷通华声科技股份有限公司 A kind of file classification method and device
CN109902220A (en) * 2019-02-27 2019-06-18 腾讯科技(深圳)有限公司 Webpage information acquisition methods, device and computer readable storage medium
CN109902220B (en) * 2019-02-27 2023-11-24 腾讯科技(深圳)有限公司 Webpage information acquisition method, device and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN101676907A (en) Method and system of directionally acquiring Internet resources
US9659278B2 (en) Methods, systems, and computer program products for displaying tag words for selection by users engaged in social tagging of content
CN102930059B (en) Method for designing focused crawler
US8533199B2 (en) Intelligent bookmarks and information management system based on the same
US7953752B2 (en) Methods for merging text snippets for context classification
CN101604324B (en) Method and system for searching video service websites based on meta search
US8122069B2 (en) Methods for pairing text snippets to file activity
Xie et al. Efficient browsing of web search results on mobile devices based on block importance model
CN102682082B (en) Network Flash searching system and network Flash searching method based on content structure characteristics
JP2009500719A (en) Query search by image (query-by-imagesearch) and search system
US6694302B2 (en) System, method and article of manufacture for personal catalog and knowledge management
US20190235721A1 (en) Flexible content organization and retrieval
JP2017535860A (en) Method and apparatus for providing multimedia content
CN103020322A (en) Query method
CN102622402B (en) Server, method and system for providing information search service by using sheaf of pages
JP2009026249A (en) Browsing-history-editing terminal, program, and its method
Davison et al. Finding Relevant Website Queries.
CN106326236A (en) Webpage content identification method and system
CN101639840A (en) Method and device for identifying semantic structure of network information
WO2015074455A1 (en) Method and apparatus for computing url pattern of associated webpage
US20080208831A1 (en) Controlling search indexing
CN104881453A (en) Method and device for indentifying type of webpage
CN102819594B (en) A kind of method and apparatus of organization website information
Kaur et al. Research on the application of web mining technique based on XML for unstructured web data using LINQ
CN104063453A (en) Method for extracting key words of marketing based on URL (uniform resource locator) analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20100324