CN101676907A

CN101676907A - Method and system of directionally acquiring Internet resources

Info

Publication number: CN101676907A
Application number: CN200810222306A
Authority: CN
Inventors: 刘锦山; 崔凤雷
Original assignee: Beijing Leisu Technology Co Ltd
Current assignee: Beijing Leisu Technology Co Ltd
Priority date: 2008-09-16
Filing date: 2008-09-16
Publication date: 2010-03-24

Abstract

The invention relates to a method of directionally acquiring Internet resources, comprising the following steps: determining the scope of the websites from which webpages are to be captured, the information of the resources to be acquired and the types of the resources; acquiring effective webpages which correspond to the types of the resources on each website through human-computer interaction according to the types of the resources; generating the configuration information on the information of the resources to be acquired according to the uniform resource locators of the websites and the effective webpages, the webpage structures and the information of the resources to be acquired; capturing text information which is matched with the configuration information on the information of the resources to be acquired on the websites, and saving the text information; indexing deeply the captured text information through human-computer interaction; and creating indexes for the text information which is indexed deeply so as to facilitate the text information indexing by the user. The system of directionally acquiring Internet resources comprises an Internet resource directionally acquiringunit and a text information deeply indexing unit. Through directionally acquiring the Internet resources with the searching engine, the problems on the generation of a large amount of junk information, the replication of resources, the disorganization of resources and the invalidation of the webpage snapshot, which can be caused by the commonly-used method of acquiring the Internet resources withthe searching engine, are solved.

Description

Directed acquisition methods of a kind of Internet resources and system

Technical field

The present invention relates to the internet search engine field, be specifically related to directed acquisition methods of a kind of Internet resources and system.

Background technology

Search engine is with the information on the certain strategy collection internet, information is being organized and handled the back provides the network information service for the user computer system.Its main effect is to help the user to obtain the high quality information that can meet consumers' demand that is present in the internet information environment fast, efficiently.

At present, universal search engine comprises information search, finish message and user inquiring three parts.Information search partly is responsible for grasping on the internet information, and the information that grasps is kept in the data server, and finish message is responsible for the index of reference device information that grasps is put in order, and then uses requestor to inquire about for the user; User inquiring partly is responsible for the user search interface is provided.

At present, the main limitation of information search part has following several aspect in the search engine technique:

1) for the Grasp Modes of a kind of formula that extends endlessly of obtaining employing of Internet resources, for example be if the employing web crawlers carries out the process of webpage extracting, web crawlers is from the webpage of some appointments, resolve the hyperlink that these webpage the insides comprise, download these hyperlink webpage pointed then, constantly go on, webpages all on the internet all can be downloaded in theory.But owing to do not determine before grasping effectively to grasp website, and website to be crawled is not carried out orientation analysis and then realize extracting particular webpage.Therefore be extracting a kind of roaming type, the non-directional formula, thereby the content that grabs has been full of a large amount of junk information and garbage, has greatly increased follow-up handling cost and user's use cost;

2) resource that grabs is not carried out the editor of the degree of depth, and then caused a large amount of repetitions of resource;

3) resource that grabs is not done the index of the degree of depth, do not provide the knowledge point such as subject, theme, author, unit, summary of every data, thereby do not have perfect knowledge hierarchy to support as the management of resource, resource organizations's system is at random not to have the art of composition, and tap/dip deep into utilizes difficulty very big.Such as, because when search engine businessman partly carries out the webpage extracting in information search, the different classes of webpage that is grasped is concentrated together, do not carry out cluster according to industry or subject, theme, thereby when the user utilizes correlation word to search for, for follow-up resource consolidation still be the utilization of user's degree of depth all be very big problem.

4) snapshots of web pages lost efficacy.The snapshots of web pages of existing extracting technology do not adopt holographic mode that content, format, the color information of webpage are accomplished that all localization files, and caused the consequence of snapshots of web pages uncomplete content, inefficacy.

Summary of the invention

The purpose of this invention is to provide directed acquisition methods of a kind of Internet resource and system, solve the problem of search engine method commonly used is caused in the prior art a large amount of junk information, resource repetition, resource inorganization and snapshots of web pages inefficacy.

For achieving the above object, the present invention adopts following technical scheme:

The directed acquisition methods of a kind of Internet resources, this method may further comprise the steps:

Determine to grasp website scope, the resource information that will obtain and affiliated resource class in advance;

According to described resource class, on each extracting website, obtain and the corresponding effective webpage of described resource class by man-machine interaction;

URL, the structure of web page of effective webpage of foundation described extracting website and link thereof and the resource information that will obtain, the configuration information of the resource information that generation will be obtained;

On the extracting website, grasp the information and the preservation that are complementary with described configuration information;

By man-machine interaction the information that grasps is carried out depth indexing, its arrangement is unified format, and its classification is made adjustment, deletion and the irrelevant junk information of the resource information that will obtain;

Information behind the depth indexing is set up index to be used for user search.

Wherein, information behind the depth indexing is being set up after index uses for user search, also comprise step: the pairing webpage of the information behind the depth indexing is carried out ecosystem file, when described information can't be opened when being used to retrieve, the webpage that calls the ecosystem file corresponding with it used for the user.

Wherein, in the extracting process, also comprise the webpage position that extracting information correspondence finished last time in record, finished the webpage position extracting of extracting information correspondence when grasp next time since last time again.

Wherein, in the extracting process, also comprise the step that information that will grasp and the information that has grasped compare,, then will not grasp this information if identical.

Wherein, to grasp the information that is complementary with configuration information on the website be to have removed the plain text content of source code, advertising message grasping, and comprises title, author, unit, keyword, summary, text, URL, extracting time, the classification of article.

The present invention also provides a kind of Internet resources orientation to obtain system, and this system comprises:

The initial information acquiring unit is used for determining in advance to grasp website scope, the resource information that will obtain and affiliated resource class;

Effectively the webpage acquiring unit according to described resource class, obtains and the corresponding effective webpage of described resource class on each extracting website by man-machine interaction;

The configuration information generation unit, URL, the structure of web page of effective webpage of described website of foundation and link thereof and the resource information that will obtain, the configuration information of the resource information that generation will be obtained;

Directed acquiring unit is used for grasping the information that is complementary with described configuration information and preserve on the extracting website;

The depth indexing unit carries out depth indexing by man-machine interaction to the information that grasps, and its arrangement is unified format, and its classification is made adjustment, deletion and the irrelevant junk information of the resource information that will obtain;

Retrieval unit is used for that the information behind the depth indexing is set up index and uses for user search.

Wherein, this system also comprises ecosystem file unit, is used for that the pairing webpage of the information behind the depth indexing is carried out ecosystem and files, and when described information can't be opened when being used to retrieve, the webpage that calls the ecosystem file corresponding with it used for the user.

Wherein, this system also comprises the download location record cell, is used in directed acquiring unit extracting process, and the webpage position of the information correspondence of extracting finished last time in record, provides starting point for grasping next time.

Wherein, this system also comprises comparing unit, and in directed acquiring unit extracting process, the information that will grasp compares with the information that has grasped, if identical, then will not grasp this information.

Utilize directed acquisition methods of Internet resources of the present invention and system, have the following advantages:

1) will download with the particular webpage that configuration information is complementary, and will not download, reduce a large amount of junk information and garbage that search engine technique commonly used caused effectively for ineligible webpage;

2) in downloading process by the record download location and want the comparison of download message, the resource repetition that the search engine technique of avoiding occurring using always causes;

3) organize problem by accessed resource being carried out depth indexing, make to get access to resources and knowledge, easier realization Clustering Retrieval;

When 4) preserving webpage is to file in the mode of holography, has realized localized permanently filing.

Description of drawings

Fig. 1 is the directed acquisition methods process flow diagram of Internet resources of the present invention;

Fig. 2 obtains system architecture diagram for Internet resources orientation of the present invention;

Fig. 3,4 is the directed information synoptic diagram that obtains among the embodiment;

Fig. 5,6 is the webpage synoptic diagram that ecosystem is filed among the embodiment.

Embodiment

Directed acquisition methods of the Internet resources that the present invention proposes and system are described as follows in conjunction with the accompanying drawings and embodiments.

Embodiment

Be illustrated in figure 1 as the directed acquisition methods process flow diagram of Internet resources of the present invention, the method comprising the steps of:

S101, determine required essential information, these essential informations comprise the website scope that grasps, the resource information that will obtain and affiliated resource class, general retrieval all is based on website commonly used and comes download message as grasping the website, the resource information of obtaining is meant the determined type of retrieval, as obtaining the shuttlecock category information of sport category, the classification under it is physical culture;

S102, according to resource class, obtain and the corresponding effective webpage of resource class on each extracting website by man-machine interaction, here said effective webpage is meant with the resource class degree of association that will obtain to indicate to be exactly the webpage of this resource class more greatly or directly, this step operation need realize by man-machine interaction, for example can login Sohu or other website, the people is for opening the webpage with resource class physical culture corresponding " physical culture " hurdle, and with this webpage as effective webpage, or by browse out of Memory will with the more closely-related webpages of physical culture also as effective webpage;

S103, utilize orientation analysis to generate the configuration information of the resource information that will obtain, determined effective webpage in previous step, need analyze one of them representative effective webpage in this step, determine effective webpage scope from form and content, the resource of each website is all organized according to a fixed structure on the internet, this structure shows as the uniform resource position mark URL address on the one hand, it all is structurized showing as the inner element of each webpage on the one hand, show as the content characteristic of each webpage on the one hand, by analyzing, extract the uniform resource position mark URL of the webpage (effectively webpage) of the affiliated class of resource information that grasps website in the scope of website, the characteristics of structure of web page and the resource information that will obtain, generation is with respect to the specific configuration information of class webpage under this site resource information, this configuration information record the URL information of the webpage that comprises the resource information class that will obtain, structure of web page information and content characteristic (promptly need to comprise resource information, as comprise shuttlecock), this just means and has write down the particular location of such webpage in whole website;

S104, result-configuration information according to orientation analysis comes orientation to obtain resource, be specially grasping the website in the scope of website, URL according to the webpage of class under the resource information of website in the configuration information (effectively webpage), structure of web page, utilize the position of the affiliated class webpage of resource information on the website, method location of mating, thereby determined to grasp the scope of webpage, then according to obtaining resource information, the source code that will comprise the removal that comprises above-mentioned resource information under the webpage that such webpage links, the text message of junk information such as advertising message grasps, this step can be removed the invalid information in the webpage, because webpage has structure, effective information and invalid information---be in the diverse location of a webpage such as advertisement and source code, when doing orientation analysis with the location records of effective information in configuration information, just only obtain effective information when obtaining, the position of invalid information is not recorded in the configuration information, just do not obtain invalid information when obtaining, simultaneously preserve the URL that also has this webpage that gets off with effective information, the title of webpage, the subject classification, information such as directed acquisition time, the general text message that grasps comprises the title of article, the author, unit, keyword, summary, text, URL, the extracting time, classification etc.Wherein, title, text, URL, extracting time, classification are that every piece of article must be caught; If author, unit, keyword, abstract fields original text have then just catch, original text is not then grabbed, and this step has reduced a large amount of junk information and the garbage that search engine technique commonly used caused effectively;

Step s105, the content of text of above-mentioned extracting is carried out degree of depth editor's index of man-machine interaction, URL by obtaining every piece of article in the content of text, title, keyword, summary, author, unit, in full, the knowledge point of subject classification, fill up the content that some do not have above-mentioned fields such as " author, unit, keywords " according to the content of article, with the finish message that grasps is unified format, sets up with convenient later index; Can also make adjustment to its classification according to the content of article, determined classification in not belonging to step s101 is adjusted other classification with it; In addition, also further remove some and irrelevant junk information and the rubbish record of the resource information that will obtain, the message structure that grasps is realized simplifying of further optimization and information;

Step s106 uses for user search setting up index through the information behind the depth indexing;

Step s107 carries out ecosystem to the pairing webpage of information behind the step s105 depth indexing and files, and ecosystem is filed and is meant that institute's web pages downloaded preserves automatically in the mode that is similar to " photograph " " genuineness ".It is holographic the file that ecosystem is filed, and has kept full contents such as comprising text, format, picture, pertinent literature, website mark, address on the page, and history file and webpage URL set up corresponding relation during file, and the filename of each webpage is corresponding with its URL.

Through above-mentioned steps, the user passes through search index when retrieval, classification according to the information that will retrieve is retrieved in the database of cluster storage plain text content, related by URL with the online webpage foundation of crawled website, call the ecosystem history file corresponding if online webpage is not opened and read related content with it.

In the present embodiment, when downloading, at first, the webpage position of the corresponding webpage of extracting information finished last time in record, and download the webpage position since last time when grasp next time again.Secondly, by with grasp the webpage that gets off and carry out the comparison of form and key content, finds that webpage just the same or that similarity is very high just will not grasp, can avoid repeating extracting like this, can avoid the resource repetition again.

More than in the implementation of each step, as long as step s107 is not limited to after step s106 at step s105.

Provide the process that a concrete example illustrates above-mentioned steps s101～s107 below, be retrieved as example with the orientation of national philosophy and the social sciences planning office website (http://www.npopss-cn.gov.cn/).

Among the step s101, the resource information that obtain is a philosophy, and the classification under it is a philosophy, and one of them website that grasp is http://www.npopss-cn.gov.cn/;

Among the step s102, be linked to the philosophy part http://www.npopss-cn.gov.cn/chgxj/zx/zx.html that grasps in the achievement selected introductions column by man-machine interaction, under this webpage, choose representational effective webpage of philosophy classification correspondence, through selecting and comparing, the webpage of selecting wherein is: (http://www.npopss-cn.gov.cn/chgxj/zx/zxw31_20080523.htm), the title of webpage are metaphysics and boundary research;

Among the step s103, by to title being the analysis of the webpage (http://www.npopss-cn.gov.cn/chgxj/zx/zxw31_20080523.htm) of " metaphysics and boundary research ", (the URL preceding structure of this class webpage is identical to have determined the URL characteristics of the webpage that will obtain of its representative, all be http://www.npopss-cn.gov.cn/chgxj/zx......), the structure of web page characteristics (all have title, subtitle, summary, text, and the webpage source code structure of position correspondence is all identical), information feature (has " philosophy " this speech in the content.Notice that this condition of information feature can be provided with, and also can not be provided with), generate corresponding configuration file.So just determined the particular location of each webpage in the website;

Among the step s104,, qualified webpage is all grasped according to the configuration information that generates among the step s103.It all is the content of text of having removed junk information such as source code, advertising message that each webpage grasps the content of getting off, and comprises title, author, unit, keyword, summary, text, URL, extracting time, classification of article etc.Wherein, title, text, URL, extracting time, classification are that every piece of article must be caught; If author, unit, keyword, abstract fields original text have then just catch, original text is not then grabbed, title as shown in Figure 3 is the article content of " metaphysics and boundary research ", just grasp less than author, unit, keyword field, to find, the content that each webpage is caught all is a plain text, the unwanted information (general designation junk information) such as source code, BANNER, color, footer and advertisement that do not have this webpage to have in the past;

Through above-mentioned steps, can obtain a qualified web pages:

The Contemporary Significance research of " Das Kapital " three big manuscript conceptions of history

http://www.npopss-cn.gov.cn/chgxj/zx/zxw32_20080523.htm

" metaphysics and boundary research "

http://www.npopss-cn.gov.cn/chgxj/zx/zxw31_20080523.htm

" research of west post-modernist philosophy of history "

http://www.npopss-cn.gov.cn/chgxj/zx/zxw30_20080523.htm

......

And ineligible webpage such as following this piece of writing, does not have " philosophy " this speech in the webpage, does not just grasp:

Marxian environmental ethics thought and Contemporary Value research thereof

http://www.npopss-cn.gov.cn/chgxj/zx/zxw29_20080523.htm。

Also note the position of the webpage " the Contemporary Significance researchs of " Das Kapital " three big manuscript conceptions of history " that obtains at last when grasping simultaneously, avoid repeating next time to obtain.

Among the step s105, the content of text that s104 is got access to carries out degree of depth editor's index of man-machine interaction, for example, for " metaphysics and boundary research " this text, because author, unit, these field original texts of keyword do not have, so when obtaining, just do not get access to, on this step of s105, need fill out author (Lu Jierong; Kingdom's richness; Liu Hongjiu; Ma Zhiguo), unit (Liaoning University), keyword (metaphysics; Boundary), according to circumstances adjust classification, to classify and be adjusted into " ontology " from " philosophy ", can also further remove junk information and rubbish record, such as do not need determining " research of west post-modernist philosophy of history " this webpage by judging, just can delete it, as shown in Figure 4 through the text behind the depth indexing;

After " research of west post-modernist philosophy of history " this webpage deletes, just remaining following webpage:

http://www.npopss-cn.gov.cn/chgxj/zx/zxw32_20080523.htm

" metaphysics and boundary research "

http://www.npopss-cn.gov.cn/chgxj/zx/zxw31_20080523.htm

......

Among the step s106, the web page contents (being content of text) through S105 step depth indexing is carried out index set up index file for reader's retrieval;

Among the step s107, file to carrying out ecosystem through the webpage behind the S105 step depth indexing, file and only leave the row webpage, and spam page that do not grab or deletion (for example: " research of west post-modernist philosophy of history " this webpage) just do not file:

http://www.npopss-cn.gov.cn/chgxj/zx/zxw32_20080523.htm

" metaphysics and boundary research "

http://www.npopss-cn.gov.cn/chgxj/zx/zxw31_20080523.htm

......

Webpage behind the file is (example has only been got the top of webpage) as shown in Figure 5 and Figure 6, and the filename of each webpage is corresponding with its URL.

Be illustrated in figure 2 as in the present embodiment Internet resources orientation and obtain system architecture diagram, this system comprises:

This system also comprises ecosystem file unit, is used for that the pairing webpage of the information behind the depth indexing is carried out ecosystem and files, and when described information can't be opened when being used to retrieve, the webpage that calls the ecosystem file corresponding with it used for the user.

This system also comprises the download location record cell, is used in directed acquiring unit extracting process, and the webpage position of the information correspondence of extracting finished last time in record, provides starting point for grasping next time.

By above narration as can be known, present embodiment is not to be linked to all webpages that grasp on the website, but a webpage that is complementary with configuration information obtains, and is that selectivity is obtained.Such as certain physical culture website a lot of columns are arranged, the format of each column webpage may have nothing in common with each other, and the theme of each webpage is also not necessarily identical, when hope is grasped the webpage about the vollyball aspect wherein get off, just need analyze before grasping so by the configuration information generation unit, what characteristic URL, structure of web page, the content topic of the webpage of vollyball content aspect have, these common characteristic are extracted just formed configuration information, and directed acquiring unit just mates and can obtain the webpage of needs get off according to this configuration information.

Set up a cover by the present invention and be fit to the subject knowledge organization system of Internet resources management and the descriptor knowledge organization system that a cover is fit to the Internet resources management, related to each subject, industry-by-industry and every field.Via directed acquiring unit obtain URL that the webpage effective information that gets off obtained this webpage correspondence, title, keyword, summary, author, unit, in full, knowledge point such as subject classification, directed acquisition time, utilization utilizes interactive means can carry out further depth indexing for above-mentioned knowledge point, particularly further adjust, make it to improve more correct for the subject classification of the webpage that influences subject cluster, theme cluster and industry cluster.Handle by final man-machine interaction, obtain the webpage that gets off formed comprise URL, title, keyword, subject classification, author, unit, summary, in full, the structurized index data of knowledge point such as directed acquisition time, and then offer the reader by searching system and utilize.Thereby make resource not only can realize the term retrieval, and can realize subject, industry, theme Clustering Retrieval, realize that easily the degree of depth of resource is integrated, excavate and utilization.

Above embodiment only is used to illustrate the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; under the situation that does not break away from the spirit and scope of the present invention; can also make various variations and modification; therefore all technical schemes that are equal to also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims

1, the directed acquisition methods of a kind of Internet resources is characterized in that this method may further comprise the steps:

2, the directed acquisition methods of Internet resources as claimed in claim 1, it is characterized in that, information behind the depth indexing is being set up after index uses for user search, also comprise step: the pairing webpage of the information behind the depth indexing is carried out ecosystem file, when described information can't be opened when being used to retrieve, the webpage that calls the ecosystem file corresponding with it used for the user.

3, the directed acquisition methods of Internet resources as claimed in claim 1, it is characterized in that, in the extracting process, also comprise the webpage position that extracting information correspondence finished last time in record, finished the webpage position extracting of extracting information correspondence when grasp next time again since last time.

4, the directed acquisition methods of Internet resources as claimed in claim 1 is characterized in that, in the extracting process, also comprises the step that information that will grasp and the information that has grasped compare, if identical, then will not grasp this information.

5, the directed acquisition methods of Internet resources as claimed in claim 1, it is characterized in that, to grasp the information that is complementary with configuration information on the website be to have removed the plain text content of source code, advertising message grasping, and comprises title, author, unit, keyword, summary, text, URL, extracting time, the classification of article.

6, a kind of Internet resources orientation is obtained system, it is characterized in that, this system comprises:

7, Internet resources orientation as claimed in claim 6 is obtained system, it is characterized in that, this system also comprises ecosystem file unit, being used for that the pairing webpage of the information behind the depth indexing is carried out ecosystem files, when described information can't be opened when being used to retrieve, the webpage that calls the ecosystem file corresponding with it used for the user.

8, Internet resources orientation as claimed in claim 6 is obtained system, it is characterized in that this system also comprises the download location record cell, be used in directed acquiring unit extracting process, the webpage position of the information correspondence of extracting finished last time in record, provides starting point for grasping next time.

9, Internet resources orientation as claimed in claim 6 is obtained system, it is characterized in that, this system also comprises comparing unit, be used in directed acquiring unit extracting process, the information that will grasp compares with the information that has grasped, if identical, then will not grasp this information.