WO2010041517A1 - 情報収集装置、検索エンジン、情報収集方法およびプログラム - Google Patents
情報収集装置、検索エンジン、情報収集方法およびプログラム Download PDFInfo
- Publication number
- WO2010041517A1 WO2010041517A1 PCT/JP2009/064362 JP2009064362W WO2010041517A1 WO 2010041517 A1 WO2010041517 A1 WO 2010041517A1 JP 2009064362 W JP2009064362 W JP 2009064362W WO 2010041517 A1 WO2010041517 A1 WO 2010041517A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- link destination
- collection
- score
- information
- link
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Definitions
- the present invention relates to an information collection technique, and more specifically, an information collection apparatus, an information collection method and a program for efficiently collecting information from information resources on a network, and information collected from information as search targets Related to search engines.
- web crawlers periodically collect information on the Internet and enterprise networks, such as the Internet, and follow web links to collect web pages.
- a software component that is configured to Normally, the web crawler holds URL information of information resources as a collection base point and URL patterns that limit the range of URLs to be collected as configuration information.
- the administrator sets the URL pattern as a collection rule in consideration of the configuration of the target web site, and explicitly specifies the URL permitted or prohibited as the collection target.
- Non-patent document 1 Non-patent document 2
- the web crawler collects the web page by following the link according to the collection rule set by the administrator while determining whether the link destination URL included in the acquired web page is permitted.
- Web crawlers periodically circulate and update the database and index.
- the conventional web crawler uses the URL information of the information resource that is the base point of the collection and the URL pattern that limits the range of the URL to be collected to limit the collection target range. Can do.
- a technique for limiting the range of information resources on the network a technique based on the number of links or hops along the link path is also known.
- Patent Document 1 uses link URL information to link a URL of each page as a reference for the purpose of efficiently and accurately performing rating and filtering of a target page.
- the hyperlink information consisting of certain link path information is stored in the DB section, the link path information stored in the DB section is searched from the target page by the path search section, and the target page is stored in the database in the page score calculation section.
- the web crawler circulates information resources, collects web pages according to the scope of collection specified by the collection rules, and updates the database and index for use by end users. The state is maintained. However, if a link or transfer to a web page that is not explicitly specified by the collection rules occurs, the administrator will recognize the occurrence of the link or transfer and then, for example, as shown in FIG. A collection rule for manually collecting the destination page has to be set manually, increasing the burden on the administrator for maintaining the collection rule.
- FIG. 13 there is a web page including a frame that directly outputs a web page on another server on the web site.
- the administrator in order to set the information resource on this separate server as the collection target, the administrator must acquire the URL of each frame in order to set the collection rule.
- the address bar on the browser only contains the URL of the frame set, but to set additional collection rules, you must browse the source of the web page or perform a communication analysis. It had to be an administrator's trouble.
- Patent Document 1 As a method for defining the range of information resources on the network, there is a technique disclosed in Patent Document 1 described above, but this is performed by recording links between all pages and reaching the target page.
- the target of filtering is determined by using the number of pages to be processed or the number of links.
- reachable pages are determined based on only the number of links or the number of hops, and the domain structure such as an intra-organization network cannot be considered. The structure must be maintained, a lot of resources are required, and a method for defining the target range of information resources is not sufficient from the viewpoint of processing efficiency.
- a web crawler that can expand the collection range to a flexible and appropriate range, and can cope with changes that can cause changes in information resources that should be included in the collection target, such as site configuration changes. Development of was desired.
- the present invention has been made in view of the above problems, and without complicating the setting of collection rules by an administrator, while reducing the reduction in collection efficiency of an explicitly specified collection range, Information collection device, information collection method and program that can flexibly expand the collection range to an appropriate range and can respond to environmental changes such as site configuration changes that greatly change the relationship between information resources, It is another object of the present invention to provide a search engine that searches information resources from which information is collected.
- a link destination address included in data acquired from an information resource via a network is extracted, and a set of addresses that are eligible for collection is described for each extracted link destination address.
- a score is calculated against a given collection rule. The score reflects the distance between the link destination information resource indicated by the link destination address and the set. Then, according to the score calculated for the link destination information resource, it is determined whether the link destination information resource is included in the collection target.
- addresses that deviate from the provisions of the collection rule that describes a set of addresses that are eligible for collection that are explicitly set by the administrator can be handled with a score that reflects the distance from that set.
- the information resource can be included in the collection target, and the collection range can be expanded to an appropriate range according to the relationship between the information resources, thereby realizing efficient information collection.
- it can be configured to collect pages that are highly relevant to explicitly specified sites without being recognized by the administrator, so it is possible to efficiently manage the collection target without increasing the collection rules to be set. This makes it easy for the administrator to set and manage the collection rules.
- a difference is determined according to the degree of matching between the address expression included in the collection rule and the link destination address, based on the score calculated for the link source information resource.
- a score can be calculated.
- the difference between the score of the link source and the link destination can be determined according to the degree of matching between the address expression included in the collection rule and the link destination address, it corresponds to the site characteristics of the link destination reflected in the address Scoring is possible, and the collection range can be expanded to better match the manager's intention.
- an expiration date can be set for the calculated score, and the maximum effective score can be adopted when the score has already been calculated for the link destination information resource.
- the shortest distance along the effective link path between the link destination information resource and the set can be reflected in the score. That is, even if the previous effective shortest route is cut off due to deletion of information resources etc., it can be determined whether to calculate an appropriate score along the next effective route and include it in the collection target, Also, if all the routes from the above set are cut off, the expiration date can be excluded from the collection target, thus responding to changes in the relationship between information resources over time. Is possible.
- the link destination information resource when the score calculated for the link destination information resource or its expiration date is outside the range to be included in the collection target or collection target candidate, the link destination information resource is extracted from the collection target or collection target candidate. You can exclude and free up resources.
- the information resource that has become weakly related is automatically determined from the score or its expiration date, and the resources allocated for the collection are released. Therefore, it is possible to suitably prevent a delay in collecting information with respect to other necessary information resources.
- the difference includes the degree of matching between the domain name included in the link destination address and the domain name included in the element of the eligible address set, and the path part included in the link destination address and the element of the eligible address set.
- the degree of coincidence with the path portion, the number of links from the link source information resource, and whether or not the link destination address is on the intra-organization network or can be determined according to at least one of them.
- a difference in score with respect to a link destination information resource on the same server as a server of a set of eligible addresses explicitly set by an administrator or a link destination information resource on a server in a neighboring domain Decrease the score, increase the difference of the score for the link destination information resource on the server outside the network in the organization, the difference of the score according to the matching degree of the path part of the address, the number of links from the link source information resource
- the collection range can be expanded to reflect the user's intention more according to the link destination site characteristics expressed in the address itself and the link source information resource characteristics.
- the increase / decrease amount it is possible to flexibly manage the expansion range of collection.
- the present invention when it is determined that a link destination information resource not included in the set described in the collection rule is included in the collection target, at least a part of the domain name and path included in the link destination address of the link destination information resource Can be retained as candidates for additional collection rules.
- additional collection rules for highly relevant sites that were not initially recognized by the administrator are retained as candidates, so that the administrator can easily recognize the site and collect it. It is possible to easily change the rule settings.
- the search request from the client is ranked using the score calculated for the information resource included in the query set by the search request, and the search result is responded.
- the search result is responded.
- the distance from a set of addresses qualified as collection targets explicitly set by an administrator can be reflected in the rank of the search result.
- the distance of the link destination information resource from the set shall be the sum of the link length corresponding to the degree of matching between the representation of the address included in the collection rule and the extracted link destination address along each link. Can do.
- the collection rules may include a permitted address expression that specifies a qualified address or a prohibited address expression that specifies an ineligible address.
- FIG. 1 is a schematic diagram of a search system including a search server according to a first embodiment of the present invention.
- a search server that collects web pages from information resources on a network and indexes them for search, while responding to a search request from a client computer (hereinafter referred to as a client) 18. 20 will be described as an example.
- FIG. 1 shows a schematic diagram of a search system 10 including a search server 20 according to the first embodiment of the present invention.
- a search system 10 shown in FIG. 1 includes a search server 20 connected to an intra-organization network 12.
- the intra-organization network 12 is configured as, for example, a local area network (LAN) based on TCP / IP and Ethernet (registered trademark), a VPN (Virtual Private Network), a WAN (Wide Area Network) using a dedicated line, and the like.
- LAN local area network
- IP and Ethernet registered trademark
- VPN Virtual Private Network
- WAN Wide Area Network
- the search server 20 collects web pages from information resources on the network while following a link based on a designated web page according to a given collection rule.
- the collected web pages are parsed and indexed to satisfy a search request from a client, a search index is created, and a storage unit 22 (hereinafter referred to as a search index storage unit).
- the search server 20 stores a circulation destination table in which addresses of information resources that are candidates for collection are registered in the storage unit 24 (hereinafter referred to as a circulation destination table storage unit), and is found according to the collection rule. Register new candidates for collection.
- This circulation destination table functions as a waiting queue for collection.
- This address can be a URI (Uniform Resource Locator) indicating an information resource on the network, more specifically, a URL (Uniform Resource Locator) or a URN (Uniform Resource Name).
- URI Uniform Resource Locator
- URL Uniform Resource Locator
- URN Uniform Resource Name
- HTML HyperText Markup Language
- data collected from information resources data in a format in which the data may include hyperlinks pointing to other data, for example, XML documents described in XML (eXtensible Markup Language) and XML (XML Linking Language), hypertext Can include documents, spreadsheets, presentations, mail documents, etc. that contain links.
- the data collected from the information resource may be a multimedia file such as an image, audio, or video.
- the search server 20 is generally configured as a general-purpose computer device such as a personal computer, a workstation, a midrange, or a mainframe. More specifically, the search server 20 is connected via a central processing unit (CPU) such as a single-core processor or a multi-core processor, a cache memory, a RAM, a network interface card (NIC), and a storage interface. Storage device.
- CPU central processing unit
- the NIC connects the search server 20 at the physical layer level and the link layer level to the intra-organization network 12 using an appropriate communication protocol such as TCP / IP.
- the storage device provides a storage area for storing various data required by the search server 20.
- the search server 20 is controlled by an operating system (hereinafter referred to as OS) such as WINDOWS (registered trademark) 200X, UNIX (registered trademark), LINUX (registered trademark), and z / OS (registered trademark).
- OS operating system
- DBMS database management system
- DB2 registered trademark
- Oracle registered trademark
- Microsoft SQL Server registered trademark
- a circulation destination table storage unit 24 and a search index storage unit 22 are realized as a database.
- the patrol destination table and the search index are stored on the database in a computer accessible format.
- the web server 16 is composed of Apache HTTP Server, Microsoft (registered trademark) Internet Information Services, and the like, and provides information resources that are potential collection targets. Each of the web servers 16 is given a unique domain name whose parent domain is the domain of the intra-organization network. The web server 16 corresponds to the information resource specified by the path portion of the URL or the query character string. Responding to data acquisition requests.
- the Internet 14 also has servers (not shown) having the same configuration as the web server 16, and information resources on these servers are also potential collection targets.
- the web server 16 can also be configured as a general-purpose computer device similar to the search server 20.
- the search server 20 implements server programs such as CGI (Common Gateway Interface), SSI (Server Side Include), servlets, and web applications, and processes search requests from the client 18 using the HTTP protocol. And return search results.
- the client 18 can be configured as a general-purpose computer device that implements a web browser, a plug-in, or the like, or a mobile terminal device such as a PDA or a mobile phone, and issues a search request to the search server 20 to acquire a search result is doing.
- FIG. 2 shows functional blocks implemented on the search server 20 according to the first embodiment of the present invention.
- Each functional unit included in the search server 20 reads a program from a computer-readable recording medium, develops the program on a memory, and controls the operation of each hardware resource by executing the program. Is realized.
- the intra-organization network 12 and the Internet 14, which are components outside the search server 20, are surrounded by broken lines.
- the search server 20 includes a crawler unit 30 that collects web pages from information resources on the network.
- the crawler unit 30 sequentially reads URLs registered in the circulation destination table stored in the circulation destination table storage unit 24, accesses the information resource indicated by the collection target URL, and acquires the web page.
- the crawler unit 30 appropriately identifies hyperlinks from the acquired web page, determines information resources that should be candidates for collection according to preset collection rules and evaluation methods, and registers them in the circulation destination table. .
- FIG. 3 shows a data structure of the collection rule setting data 100 held by the search server 20 according to the first embodiment of the present invention.
- the collection rule setting data 100 held as setting information of the crawler unit 30 is a rule item in which a base URL list 100a in which URLs serving as base points are registered and a collection rule that describes a set of URLs that are eligible for collection are registered. List 100b.
- the URL registered as the base point is first registered in the circulation destination table of the circulation destination table storage unit 24.
- Each item of the collection rule describes a set of URLs that are eligible for collection, including a permitted address pattern (allow) that is explicitly permitted as a collection target, or a prohibited address pattern that is not explicitly permitted (forbid) ) Can be included.
- a permitted address pattern in each item for example, a prefix of an address described by an HTTP or HTTPS scheme, a domain or an IP address, a character string expressing a wild card, a range specification, or a regular expression may be adopted. It is possible, but not particularly limited.
- a set of URLs that are eligible for collection is uniquely defined by a plurality of items, and whether permission or prohibition is explicitly specified for any URL or not specified
- rules can be appropriately set for the setting order of items, a specific level of detail of an address pattern, and the like.
- the collection rule setting data 100 can include designation of extensions to be collected and extensions to be excluded.
- the crawler unit 30 includes a page processing unit 32 and a link destination evaluation unit 34 as submodules.
- the page processing unit 32 acquires a web page from the information resource to be collected, performs HTML syntax analysis on the page, identifies the hyperlink embedded in the page, extracts the URL of the link destination, It is passed to the link destination evaluation unit 34.
- the acquired web page is stored in the page storage unit 26 for indexing.
- the page processing unit 32 functions as an extraction unit of this embodiment.
- the link destination evaluation unit 34 checks each extracted link destination URL with the information resource indicated by the link destination URL according to a predetermined evaluation method while checking each item of the collection rule defined by the collection rule setting data 100. On the other hand, a score (the details of the score evaluation method will be described later) is calculated.
- the link destination evaluation unit 34 functions as a calculation unit of the present embodiment.
- the collection rule setting data 100 further includes a threshold 100c for inclusion in the collection target and a threshold 100d for inclusion in the collection target candidate for the calculated score.
- the link destination evaluation unit 34 illustrated in FIG. 2 compares the calculated score with the threshold 100d for each of the extracted link destination URLs, and whether or not to include the information resource indicated by the link destination URL as a collection target candidate. Determine whether. Then, the expiration date of the score is determined for the candidate information resource to be collected, and the URL, the score, and the expiration date are registered in the circulation destination table.
- the expiration date set in the score can be preferably a date and time provided with a predetermined margin starting from the next collection scheduled date and time of the web page of the link source.
- FIG. 4A shows a data structure of the circulation destination table 110 held by the search server 20 according to the first embodiment of the present invention.
- the traveling destination table 110 shown in FIG. 4A includes a field 110a in which URLs of information resources that are candidates for collection are input, a field 110b in which the calculated score is input, and an expiration date of the score. And an input field 110c.
- the circulation destination table 110 may be sorted in the score field 110b so that information resources having a larger score are preferentially circulated.
- the threshold value 100c shown in FIG. 3 is compared when the crawler unit 30 functioning as a determination unit in the present embodiment reads a record registered in the circulation destination table 110. Whether or not the crawler unit 30 sequentially reads the records in the circulation destination table 110 illustrated in FIG. 4A and refers to the score and the expiration date to acquire the web page with the information resource indicated by the URL as a collection target. Judging.
- the acquired web page is stored in the page storage unit 26.
- the score is also stored in the page storage unit 26 in association with the score in order to use the score in the search. .
- FIG. 5 schematically shows a score evaluation method for linked information resources.
- FIG. 5 shows a plurality of web pages (hereinafter simply referred to as pages) A to J as information resources indicated by the URL.
- Each page A to J includes a permitted set area defined by the permitted address pattern in the collection rule setting data 100, a set area excluded from the permitted set by the prohibited address pattern, and a prohibited address pattern. It exists on either the specified prohibited set area or the non-specified area.
- Each page A to J is linked to another page by a hyperlink indicated by a solid line, and the crawler unit 30 traces the hyperlink from the page A indicated by the base URL, and sequentially scores each linked page. calculate.
- pages A to D which are elements of the permitted set, are assigned the maximum score indicated by “100”
- pages E, D included in the set defined by the prohibited address pattern F is assigned a minimum score indicated by “0”.
- a set obtained by excluding a set specified by the prohibited address pattern from the permitted set constitutes a set of URLs that are eligible for collection specified by the rule item list 100b.
- the non-standard page I is linked from both the page D, which is an element of the permission set, and the non-standard page G. In this case, it is calculated when the page D is directly linked. Since the score “75” is higher than the score “50” when passing through the page G, priority is given.
- the score is subtracted by a predetermined subtraction amount for each link that passes from the page included in the collection of URLs eligible for collection to the evaluation target page.
- the predetermined subtraction amount is associated with the link length
- the value reflects the distance defined as the sum of the link lengths of the links that pass from the set of eligible URLs to the evaluation target page.
- the link length that is, the subtraction amount is a fixed value. However, as will be described later, it can also be a value according to the characteristics of the link destination site.
- the permitted address pattern and the prohibited address pattern are described in the collection rule, and the maximum score is assigned to the permitted set, and the minimum score is assigned to the prohibited set.
- the correspondence between the collection rule and the score is not particularly limited, and a method of directly specifying the score in the address pattern of the collection rule can be adopted.
- FIG. 6 schematically shows a method for updating the score calculated for the linked information resource.
- FIG. 6 shows that the page D has been deleted due to a change in the site configuration or the like after the crawler unit 30 makes a round of the pages A to J shown in FIG. 5 and starts the next collection process. The case is illustrated.
- the link from page D to page I is also broken.
- the score of page I is recalculated when page G is collected, and the expiration date is also updated. Become. Therefore, the score of the page I is updated with the score “50” calculated by the route via the page G, and is preferably set with the next update scheduled date and time of the page G as a starting point.
- each link is collected to become a valid link route to the linked page. It becomes possible to reflect the shortest distance along the score. That is, it can be said that it is possible to cope with a change in the link structure between pages over time.
- the cyclic collection is performed again, it is configured to recalculate the score that reflects the shortest distance along the link path that is currently valid, but in other embodiments, the score and A plurality of sets of expiration dates can be held, and when one expiration date expires, a larger and more effective score can be adopted.
- the search server 20 further includes a parser unit 40, an indexer unit 50, and a search engine unit 60.
- the parser unit 40 reads the web pages collected in the page storage unit 26 by the crawler unit 30, performs tag removal processing, etc., and performs character string analysis processing such as morphological analysis, and analyzes it with the calculated score
- the result is passed to the indexer unit 50.
- the indexer unit 50 creates a search index by indexing using the passed analysis result, and stores it in the search index storage unit 22.
- FIG. 4B shows the data structure of the search index 120 held by the search server 20 according to the first embodiment of the present invention.
- the search index 120 shown in FIG. 4B includes a field 120a in which the URL of the information resource to be searched is input, a field 120b in which the index information created by the indexing is input, and the calculated score.
- the field 120c is input.
- the search index that is actually used for the search process is preferably a data structure in which the above score is added as attached information to an inverted index including information indicating the appearance position of each word in the web page. Built as.
- the search engine unit 60 processes a search request from a client with reference to a search index including the score as attached information.
- the information resources included in the search results returned to the client are ranked so as to decrease in rank as they move away from the explicitly specified set of eligible URLs using the score.
- the search server 20 shown in FIG. 2 functions as a crawler that collects information from information resources on the web server 16 and the Internet 14 through the cooperation of hardware and software, and an indexer that indexes the collected information. And a function as a search engine that returns a search result in response to a search request from a client.
- the crawler function can be configured separately from other functions, and is not particularly limited.
- FIG. 7 shows a flowchart of the collection process executed by the crawler unit 30 according to the first embodiment of the present invention.
- the process shown in FIG. 7 starts from step S100 in response to, for example, an external command from an administrator or the like, or in response to the arrival of a time defined by a preset schedule or a preset interval. Is done.
- a schedule method for cyclic collection is not particularly limited.
- the update information of the page is reflected, the scheduled collection date and time is set according to the set collection frequency range, and continuous collection is performed. it can.
- the URL included in the collection specified in the collection rule setting data 100 is preferentially more frequently and lower than the URL set with an intermediate score. It can also be configured to collect at a frequency.
- step S101 the crawler unit 30 sequentially acquires records from the circulation destination table 110, and obtains URLs that are candidates for collection, a calculated score, and an expiration date set in the score.
- step S102 the crawler unit 30 compares the threshold (collection) 100c for inclusion in the collection target with the obtained score, and determines whether or not the information resource indicated by the URL of the acquired record is to be collected. To do. If it is determined in step S102 that the data is to be collected (YES), the process proceeds to step S103.
- step S103 the current time is compared with the expiration date set in the score to determine whether the score of the information resource is still valid and is still valid as a collection target.
- step S104 the crawler unit 30 calls the page processing unit 32 and passes the processing using the obtained URL and score as arguments.
- the crawler unit 30 determines whether or not there is an unprocessed record in step S105.
- step S105 If it is determined in step S105 that there are still unprocessed records (YES), the process loops to step S101 and is repeated until all the records have been processed. On the other hand, if it is determined in step S105 that there are no more unprocessed records (NO), the process proceeds to step S106, and the collection process is terminated.
- step S102 determines whether the score is less than the threshold (collection) 100c for inclusion in the collection target and not to be collected (S102: NO), and in step S103, the current time is outside the expiration date. If it is determined that the score is invalid (S103: NO), the process proceeds to step S107.
- step S107 the page deletion process of the information resource corresponding to the record is performed. In this page deletion process, when the web page has been collected in the past, the crawler unit 30 deletes the page data from the page storage unit 26 or sets it as a non-indexing target. Preferably, a record whose score has expired is deleted from the circulation destination table 110.
- URLs that are candidates for collection having a threshold (storage) of 100d or more are registered in the record, and when the record is read, the threshold (collection) of 100c or more is collected. It is configured to determine what is to be collected. However, in other embodiments, only URLs to be collected with a threshold (collection) 100c or higher are registered in the circulation destination table 110, and when a record is read, only the expiration date is confirmed. It can also be configured.
- FIG. 8 shows a flowchart of page processing executed by the page processing unit 32 according to the first embodiment of the present invention.
- the process shown in FIG. 8 is called from the crawler unit 30 in step S104 of the collection process shown in FIG. 7, and is started from step S200.
- the page processing unit 32 issues an acquisition request to the passed URL and acquires a web page from the information resource.
- the page processing unit 32 identifies the hyperlink embedded in the web page by HTML syntax analysis, and extracts the link destination URL.
- step S203 the page processing unit 32 determines whether there is an unprocessed link. If it is determined in step S203 that there is an unprocessed link (YES), the process proceeds to step S204.
- step S204 the page processing unit 32 calls the link destination evaluation unit 34, and passes the processing using the score of the web page of the link source as an argument. When the process is returned from the link destination evaluation unit 34, the process loops to step S203, and the process is repeated for all the extracted hyperlinks.
- step S205 the page processing is terminated, and The processing is returned to the collection processing shown in FIG.
- FIG. 9 shows a flowchart of the link destination evaluation process executed by the link destination evaluation unit 34 according to the first embodiment of the present invention.
- the process shown in FIG. 9 is called from the page processing unit 32 in step S204 of the page process shown in FIG. 8, and is started from step S300.
- step S ⁇ b> 301 the link destination evaluation unit 34 collates each item in the collection rule list explicitly specified in the collection rule setting data 100 with the passed URL.
- step S302 the link destination evaluation unit 34 determines whether there is an item that matches the URL in the collection rule list.
- step S307 the link destination evaluation unit 34 determines whether the URL matches the permitted address pattern and is explicitly permitted to be collected. If it is determined in step S307 that the link destination evaluation unit 34 has been explicitly permitted (YES), the link destination evaluation unit 34 proceeds to step S308, and determines the maximum value for the information resource indicated by the link destination URL. The score “100” is assigned and the process proceeds to step S310. On the other hand, if it is determined in step S307 that it has been explicitly prohibited (NO), the process proceeds to step S309, and the score “ “0” is assigned, and the process proceeds to step S310.
- step S302 determines whether there is no item matching the collection rule (NO)
- step S303 the link destination evaluation unit 34 subtracts a predetermined subtraction amount with reference to the score assigned to the web page, and calculates a score for the information source indicated by the link destination URL to be evaluated.
- step S304 the link destination evaluation unit 34 compares the calculated score with a threshold value (storage) 100d for inclusion in the collection target candidate. If it is determined in step S304 that the calculated score is greater than or equal to the threshold (storage) 100d (YES), the process proceeds to step S305.
- step S305 the link destination evaluation unit 34 refers to the circulation destination table 110 and tries to acquire a score corresponding to the link destination URL and its expiration date, and the calculated score is a valid score that can exist. It is determined whether or not the value is greater than or equal to the stored value. If it is determined in step S305 that the calculated score is equal to or greater than the stored value of an effective score that may exist (YES), the process proceeds to step S310.
- step S304 when it is determined in step S304 that the calculated score is less than the threshold (storage) 100d (S304: NO), and in step S305, the calculated score is less than the effective stored value of the score. If it is determined (NO), the process proceeds to step S306, where the link destination evaluation unit 34 discards the score calculated in this flow, terminates the link destination evaluation process, and Returning to the page processing shown in FIG.
- step S310 an expiration date is set, the score calculated in step S308, step S309, or step S303 is determined, a record corresponding to the circulation destination table 110 is added or updated as appropriate, and the process proceeds to step S306.
- the link destination evaluation process is terminated, and the page process shown in FIG.
- an address that is eligible as a collection target specified by the collection rule even for a URL that is not included in the collection rule explicitly set by the administrator by the processing of the first embodiment for example.
- Corresponding web pages can be included in the collection target, corresponding to the score reflecting the distance from the set of, and the collection range is expanded to an appropriate range according to the relevance by the link between web pages Is possible.
- the range can be managed and controlled, and the setting and management work of the collection rule by the administrator becomes easy.
- the URL when a URL that is a candidate for collection outside the rules of collection rules is found, the URL includes at least part of the domain name and path included in the URL.
- the address pattern can be held as a candidate for an additional item of the collection rule so that the administrator can present it when the collection rule setting data 100 is manually changed. For example, if “http://www.docs.example.com/form/required.html” is found as a candidate for collection, “http://www.docs.example.com/form/*” “Http://www.docs.example.com/*” can be retained to suggest as an additional candidate for the allowed address pattern.
- the permitted address pattern held as an additional candidate can be displayed later on a management graphical user interface or the like, and can be proposed at the time of manual setting by the administrator.
- FIG. 10 schematically shows a score evaluation method for linked information resources according to the second embodiment.
- a plurality of pages A to M are shown as information resources indicated by the URL.
- Each of pages A to M includes a permission set area, an excluded set area, a prohibited set area, a URL set area on the same server as that of the URL included in the permission set, and a permission set.
- pages A to D which are elements of the permitted set, are assigned the maximum score indicated by “100”, while the set defined by the prohibited address pattern
- the pages E and M included in are assigned a minimum score indicated by “0”.
- a set obtained by excluding a set specified by the prohibited address pattern from a set specified by the permitted address pattern is a set of URLs that are eligible for collection explicitly specified by the collection rule. It is composed.
- an intermediate value obtained by subtracting a predetermined subtraction amount from the score of the page of the permission set is calculated.
- a plurality of areas corresponding to site characteristics exist outside the area defined as a set of URLs that are eligible for collection.
- a page pages F and G in the example of FIG. 10
- pages belonging to this category are information resources on the same server as that of the URL of the permission set explicitly targeted for collection, for example, when evaluating the score with this page as the link destination, subtraction
- the amount can be configured to be reduced (eg, to a subtraction amount “10”).
- the page on the same server may include a permitted address pattern that matches not only the server domain name but also a part of the path.
- the subtraction amount may be reduced (for example, to the subtraction amount “5” when matching up to the first layer). it can.
- Pages belonging to this category are information resources on servers in the proximity domain of the URL server included in the permission set.
- a subtraction amount for example, It can be configured to be reduced to a subtraction amount “15”.
- the method of determining the proximity domain is not particularly limited, it is possible to determine the degree of matching with all the parent domains, it is also possible to determine the degree of matching of the parent domain excluding the top domain, The subtraction amount may be changed according to the degree of matching of the parent domain part.
- pages J and K in the example of FIG. 10 there is a page (pages J and K in the example of FIG. 10) on an external server in which the parent domain of the server that hosts the information resource does not match the parent domain of the search server 20. . Since pages belonging to this classification are information resources on a server outside the organization network 12 to which the search server 20 belongs, for example, when evaluating a score with this page as a link destination, a subtraction amount (for example, It can be configured to increase to a subtraction amount of “30”.
- the subtraction amount can be set to a default value (for example, the subtraction amount “25”). Furthermore, although not shown in FIG. 10 in the second embodiment, the subtraction amount can also be changed according to the number of links included in the link source web page.
- the processing flow of the collection processing, page processing, and link destination evaluation processing according to the second embodiment can be substantially the processing flow shown in FIGS. 7 to 9 as in the first embodiment.
- the second embodiment is different from the first embodiment in that the score calculation process shown in FIG. 11 is called in step S303 shown in FIG.
- FIG. 11 is a flowchart of the score calculation process executed by the link destination evaluation unit 34 according to the second embodiment of the present invention.
- the process shown in FIG. 11 is called by the process of step S303 shown in FIG. 9, and starts from step S400.
- step S401 the link destination evaluation unit 34 sets the subtraction amount to a default value (for example, “25”).
- step S402 the link destination evaluation unit 34 collates the domain name of the server included in the link destination URL with the permitted address pattern included in the collection rule list.
- step S403 the domain name of the matching server is collected. Determine if it exists in the rule list.
- step S404 the link destination evaluation unit 34 assumes that the server included in the link destination URL is the same as the permitted server, and reduces the subtraction amount (for example, to 40%).
- step S405 the link destination evaluation unit 34 checks the permitted address pattern that matches the path portion of the link destination URL, and in step S406, first determines whether or not it matches the first layer of the path portion. .
- step S406 If it is determined in step S406 that the first layer is the same (YES), the process proceeds to step S407, and the link destination evaluation unit 34 further reduces the subtraction amount (for example, 50%), and step S405 is performed.
- the loop is again looped, the comparison for the next hierarchy is advanced, and the loop is performed while they match (during step S406: YES).
- step S406 if it is determined in step S406 that the layers do not match (NO), the process proceeds to step S410.
- step S403 if it is determined in step S403 that the matching server domain name does not exist in the collection rule list (NO), the process proceeds to step S408.
- step S408 the link destination evaluation unit 34 determines whether or not the server included in the link destination URL is in the proximity domain of the server in the collection rule list.
- step S408 If it is determined in step S408 that it exists in the proximity domain (YES), the link destination evaluation unit 34 reduces the subtraction amount (for example, to 60%) in step S409, and proceeds to step S410. On the other hand, if it is determined in step S408 that it is not within the proximity domain (NO), the process proceeds directly to step S410.
- step S410 the link destination evaluation unit 34 further compares the parent domain name of the server included in the link destination URL with the parent domain assigned to the search server 20, and in step S411, the server included in the link destination URL is the organization. It is determined whether the server is outside the internal network 12. If it is determined in step S411 that the server is an external server (YES), the process proceeds to step S412 and the link destination evaluation unit 34 increases the subtraction amount (for example, by 20%), and the process proceeds to step S413. On the other hand, if it is determined in step S411 that the server belongs to the intra-organization network 12 (NO), the process directly proceeds to step S413.
- step S413 the subtraction amount is continuously increased according to the number of links L included in the link source web page (for example, increased by the number of links L). This is because the score of a link destination from a web page such as a bookmark or a link collection is evaluated low so that it is not easily included in the collection target.
- step S414 the final score is calculated by subtracting the amount of subtraction obtained by the processing in steps S400 to S413 from the score of the web page of the link source, and in FIG. Return to the link destination evaluation process shown in.
- the score is obtained for each link that passes from a page included in a set of URLs eligible for collection to a page to be evaluated.
- Subtraction amount corresponding to is subtracted to obtain.
- the link length that is, the subtraction amount can be changed according to the characteristics of the link destination site expressed in the URL character string, and the setting amount for increasing or decreasing the subtraction amount is adjusted. By doing so, it is possible to expand the collection range to better match the intention of the administrator.
- the collection rules can be collected without complicating the setting of the collection rules by the administrator and while reducing the decrease in the collection efficiency of the explicitly specified collection range.
- Information collection device, information collection method and program capable of flexibly extending the scope to an appropriate range and responding to environmental changes such as site configuration changes that greatly change the relationship between information resources,
- the above-described functions of the present invention include an object-oriented programming language such as C ++, Java (registered trademark), Java (registered trademark) Beans, Java (registered trademark) Applet, Java (registered trademark) Script, Perl, and Ruby, and a database such as SQL. It can be realized by a device executable program described in a language or the like, and can be stored in a device-readable recording medium and distributed or transmitted and distributed.
- SYMBOLS 10 ... Search system, 12 ... Network in organization, 14 ... Internet, 16 ... Web server, 18 ... Client, 20 ... Search server, 22 ... Search index storage part, 24 ... Travel destination table storage part, 26 ... Page storage part , 30 ... crawler part, 32 ... page processing part, 34 ... link destination evaluation part, 40 ... parser part, 50 ... indexer part, 60 ... search engine part, 100 ... collection rule setting data, 110 ... patrol destination table, 120 ... Search index
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
図1は、本発明の第1の実施形態による検索サーバ20を含んで構成される検索システム10の概略図を示す。図1に示す検索システム10は、組織内ネットワーク12に接続される検索サーバ20を含んで構成される。組織内ネットワーク12は、例えば、TCP/IPおよびイーサネット(登録商標)によるローカル・エリア・ネットワーク(LAN)、VPN(Virtual Private Network)や専用線を使用するWAN(Wide Area Network)などとして構成され、例えばインターネット14やウェブ・サーバ16a,bに接続されている。
上述までの第1の実施形態では、スコアを算出する際の減算量を一定としていた。以下、リンク先URLのサイト特性などに応じて減算量を変化させ、より柔軟に収集範囲を拡張する第2の実施形態について説明する。なお、第2の実施形態の検索システム10および検索サーバ20は、第1の実施形態と大部分において同一の構成を有しているため、相違する部分を中心に説明する。
Claims (16)
- ネットワーク上の情報資源から情報を収集する情報収集装置であって、
前記ネットワークを介して情報資源からデータを取得して、該データが含むリンク先アドレスを抽出する抽出部と、
前記リンク先アドレス毎に、収集対象として適格なアドレスの集合を記述する収集規則と照合して、該リンク先アドレスが指すリンク先情報資源の前記集合からの距離を反映するスコアを算出する算出部と、
前記リンク先情報資源に対して算出された前記スコアに従って該リンク先情報資源を収集対象に含めるかを判定する判定部と
を含む、情報収集装置。 - 前記算出部は、リンク元情報資源に対し算出された前記スコアを基準として、前記収集規則が含むアドレスの表現と前記リンク先アドレスとの適合の程度に応じた差分を決定し、前記リンク先情報資源に対する前記スコアを算出する、請求項1に記載の情報収集装置。
- 前記算出部は、前記スコアに対し有効期限を設定し、前記リンク先情報資源に対しスコアが既に算出されている場合には、最大の有効なスコアを採用する、請求項2に記載の情報収集装置。
- 前記判定部は、前記リンク先情報資源に対して算出された前記スコアまたはその有効期限が収集対象または収集対象候補に含める範囲外となった場合に、前記収集対象または前記収集対象候補から該リンク先情報資源を除外し、リソースを開放する、請求項3に記載の情報収集装置。
- 前記算出部は、前記リンク先アドレスが含むドメイン名と前記適格なアドレスの集合の要素が含むドメイン名との一致の程度、前記リンク先アドレスが含むパス部分と前記適格なアドレスの集合の要素が含むパス部分との一致の程度、リンク元情報資源からのリンク数および前記リンク先アドレスが組織内ネットワーク上のものか否かに応じて、またはこれらの少なくとも1つに応じて、前記差分を決定する、請求項4に記載の情報収集装置。
- 前記判定部は、前記収集規則に記述される前記集合に包含されないリンク先情報資源を収集対象に含めるよう判定した場合に、該リンク先情報資源のリンク先アドレスが含むドメイン名およびパスの少なくとも一部分を含むアドレスの表現を、追加の収集規則の候補として保持する、請求項5に記載の情報収集装置。
- 請求項1に記載の情報収集装置によってネットワーク上の情報資源から収集されたデータを索引付けた検索インデックスを参照する検索エンジンであって、
クライアントからの検索要求に対し、該検索要求による照会集合に含まれる情報資源に対し算出された前記スコアを用いてランク付けして、検索結果を応答する検索処理部を含む、検索エンジン。 - ネットワーク上の情報資源から情報を収集する方法であって、コンピュータが、
前記ネットワークを介して情報資源からデータを取得するステップと、
前記データが含むリンク先アドレスを抽出するステップと、
前記リンク先アドレス毎に、収集対象として適格なアドレスの集合を記述する収集規則と照合して、該リンク先アドレスが指すリンク先情報資源の前記集合からの距離を反映するスコアを算出するステップと、
前記リンク先情報資源に対して算出された前記スコアに従って該リンク先情報資源を収集対象に含めるかを判定するステップと
を実行する、情報収集方法。 - 前記算出するステップは、前記収集規則が含むアドレスの表現と前記リンク先アドレスとの適合の程度に応じた差分を決定するサブステップと、リンク元情報資源に対し算出された前記スコアを基準として、前記差分により前記リンク先情報資源に対する前記スコアを算出するサブステップとを含む、請求項8に記載の情報収集方法。
- コンピュータが、算出された前記スコアに対し有効期限を設定するステップをさらに実行し、前記算出するステップでは、リンク先情報資源に対しスコアが既に算出されている場合には、最高の有効なスコアを採用する、請求項9に記載の情報収集方法。
- コンピュータが、前記リンク先情報資源に対して算出された前記スコアまたはその有効期限が収集対象または収集対象候補に含める範囲外となった場合に、前記収集対象または前記収集対象候補から該リンク先情報資源を除外し、リソースを開放するステップをさらに実行する、請求項10に記載の情報収集方法。
- コンピュータを、ネットワーク上の情報資源から情報を収集する情報収集装置として機能させるためのコンピュータ実行可能なプログラムであって、前記プログラムは、前記情報収集装置を、
前記ネットワークを介して情報資源からデータを取得して、該データが含むリンク先アドレスを抽出する抽出部、
前記リンク先アドレス毎に、収集対象として適格なアドレスの集合を記述する収集規則と照合して、該リンク先アドレスが指すリンク先情報資源の前記集合からの距離を反映するスコアを算出する算出部、
前記リンク先情報資源に対して算出された前記スコアに従って該リンク先情報資源を収集対象に含めるかを判定する判定部
として機能させる、コンピュータ実行可能なプログラム。 - 前記算出部は、リンク元情報資源に対し算出された前記スコアを基準として、前記収集規則が含むアドレスの表現と前記リンク先アドレスとの適合の程度に応じた差分を決定し、前記リンク先情報資源に対する前記スコアを算出する、請求項12に記載のプログラム。
- 前記算出部は、前記スコアに対し有効期限を設定し、前記リンク先情報資源に対しスコアが既に算出されている場合には、最大の有効なスコアを採用する、請求項13に記載のプログラム。
- 前記判定部は、前記リンク先情報資源に対して算出された前記スコアまたはその有効期限が収集対象または収集対象候補に含める範囲外となった場合に、前記収集対象または前記収集対象候補から該リンク先情報資源を除外し、リソースを開放する、請求項14に記載のプログラム。
- ネットワーク上の情報資源から情報を収集する情報収集装置であって、
前記ネットワークを介して情報資源からデータを取得して、該データが含むリンク先アドレスを抽出する抽出部と、
前記リンク先アドレス毎に、収集対象として適格なアドレスの集合を記述する収集規則と照合して、該リンク先アドレスが指すリンク先情報資源の前記集合からの距離を反映するスコアを算出する算出部と、
前記リンク先情報資源に対して算出された前記スコアに従って該リンク先情報資源を収集対象に含めるかを判定する判定部と
を含み、
前記算出部は、リンク元情報資源に対し算出された前記スコアを基準として、前記収集規則が含むアドレスの表現と前記リンク先アドレスとの適合の程度に応じた差分を決定し、前記リンク先情報資源に対する前記スコアを算出し、
前記算出部は、前記スコアに対し有効期限を設定し、前記リンク先情報資源に対しスコアが既に算出されている場合には、最大の有効なスコアを採用し、
前記判定部は、前記リンク先情報資源に対して算出された前記スコアが収集対象または収集対象候補に含める範囲外となった場合に、前記収集対象または前記収集対象候補から該リンク先情報資源を除外し、リソースを開放し、
前記算出部は、前記リンク先アドレスが含むドメイン名と前記適格なアドレスの集合の要素が含むドメイン名との一致の程度、前記リンク先アドレスが含むパス部分と前記適格なアドレスの集合の要素が含むパス部分との一致の程度、リンク元情報資源からのリンク数および前記リンク先アドレスが組織内ネットワーク上のものか否かに応じて、またはこれらの少なくとも1つに応じて、前記差分を決定し、
前記判定部は、前記収集規則に記述される前記集合に包含されないリンク先情報資源を収集対象に含めるよう判定した場合に、該リンク先情報資源のリンク先アドレスが含むドメイン名およびパスの少なくとも一部分を含むアドレスの表現を、追加の収集規則の候補として保持する、情報収集装置。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2010532857A JP5325229B2 (ja) | 2008-10-08 | 2009-08-14 | 情報収集装置、検索エンジン、情報収集方法およびプログラム |
US13/003,875 US8676782B2 (en) | 2008-10-08 | 2009-08-14 | Information collection apparatus, search engine, information collection method, and program |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008-261848 | 2008-10-08 | ||
JP2008261848 | 2008-10-08 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010041517A1 true WO2010041517A1 (ja) | 2010-04-15 |
Family
ID=42100473
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/064362 WO2010041517A1 (ja) | 2008-10-08 | 2009-08-14 | 情報収集装置、検索エンジン、情報収集方法およびプログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US8676782B2 (ja) |
JP (1) | JP5325229B2 (ja) |
WO (1) | WO2010041517A1 (ja) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014528136A (ja) * | 2011-12-13 | 2014-10-23 | 北大方正集▲団▼有限公司Peking University Founder Group Co., Ltd | ネットデータの採集方法及びシステム |
JP2017173910A (ja) * | 2016-03-18 | 2017-09-28 | Jcc株式会社 | 検索サーバー、検索システム、検索情報配信システム、検索プログラム、検索情報配信プログラム |
JP2019020958A (ja) * | 2017-07-14 | 2019-02-07 | 株式会社日立製作所 | 情報収集支援装置および情報収集支援方法 |
JP2020140722A (ja) * | 2020-04-20 | 2020-09-03 | ヤフー株式会社 | コンテンツ収集装置、コンテンツ収集方法およびコンテンツ収集プログラム |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130110815A1 (en) * | 2011-10-28 | 2013-05-02 | Microsoft Corporation | Generating and presenting deep links |
US20130215101A1 (en) * | 2012-02-21 | 2013-08-22 | Motorola Solutions, Inc. | Anamorphic display |
US8832088B1 (en) | 2012-07-30 | 2014-09-09 | Google Inc. | Freshness-based ranking |
US20160247193A1 (en) * | 2015-02-19 | 2016-08-25 | Troy Group, Inc. | System and method of dynamically targeting information to product users |
RU2660593C2 (ru) * | 2016-04-07 | 2018-07-06 | Общество С Ограниченной Ответственностью "Яндекс" | Способ и сервер определения исходной ссылки на исходный объект |
KR101873147B1 (ko) * | 2017-02-01 | 2018-07-23 | 주식회사 크레도웨이 | 온라인 네트워크를 이용한 보험 청구의 시뮬레이션 방법 및 시스템 |
WO2019108740A1 (en) * | 2017-12-01 | 2019-06-06 | The Regents Of The University Of Colorado, A Body Corporate | Systems and methods for crawling web pages and parsing relevant information stored in web pages |
KR102398521B1 (ko) * | 2020-01-10 | 2022-05-16 | 김광년 | 대출서비스 중개시스템 및 방법 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH09218876A (ja) * | 1996-02-08 | 1997-08-19 | Nec Corp | ノード・リンク探索装置 |
JP2001084258A (ja) * | 1999-09-13 | 2001-03-30 | Oki Electric Ind Co Ltd | 対訳情報収集装置 |
JP2002259407A (ja) * | 2000-12-27 | 2002-09-13 | Fujitsu Ltd | 特定用途向けの文書収集装置、その方法及びコンピュータに実行させるためのプログラム |
JP2004199365A (ja) * | 2002-12-18 | 2004-07-15 | Canon Inc | 文書収集装置 |
JP2005301759A (ja) * | 2004-04-13 | 2005-10-27 | Vodafone Kk | 検索装置 |
JP2006235729A (ja) * | 2005-02-22 | 2006-09-07 | Mitsubishi Electric Corp | 選択的Web情報収集装置 |
JP2007149057A (ja) * | 2005-04-29 | 2007-06-14 | Palo Alto Research Center Inc | 個人化検索のためのシステム及び方法 |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6539430B1 (en) * | 1997-03-25 | 2003-03-25 | Symantec Corporation | System and method for filtering data received by a computer system |
US6411952B1 (en) * | 1998-06-24 | 2002-06-25 | Compaq Information Technologies Group, Lp | Method for learning character patterns to interactively control the scope of a web crawler |
US6418433B1 (en) * | 1999-01-28 | 2002-07-09 | International Business Machines Corporation | System and method for focussed web crawling |
US6778986B1 (en) * | 2000-07-31 | 2004-08-17 | Eliyon Technologies Corporation | Computer method and apparatus for determining site type of a web site |
US7203673B2 (en) * | 2000-12-27 | 2007-04-10 | Fujitsu Limited | Document collection apparatus and method for specific use, and storage medium storing program used to direct computer to collect documents |
JP4021681B2 (ja) | 2002-02-22 | 2007-12-12 | 日本電信電話株式会社 | ページレイティング/フィルタリング方法および装置とページレイティング/フィルタリングプログラムおよび該プログラムを記録したコンピュータ読取り可能な記録媒体 |
JP4093012B2 (ja) * | 2002-10-17 | 2008-05-28 | 日本電気株式会社 | ハイパーテキスト検査装置および方法並びにプログラム |
US8145710B2 (en) * | 2003-06-18 | 2012-03-27 | Symantec Corporation | System and method for filtering spam messages utilizing URL filtering module |
US20050015626A1 (en) * | 2003-07-15 | 2005-01-20 | Chasin C. Scott | System and method for identifying and filtering junk e-mail messages or spam based on URL content |
US7552109B2 (en) * | 2003-10-15 | 2009-06-23 | International Business Machines Corporation | System, method, and service for collaborative focused crawling of documents on a network |
US20080256065A1 (en) * | 2005-10-14 | 2008-10-16 | Jonathan Baxter | Information Extraction System |
US8943035B2 (en) * | 2005-11-14 | 2015-01-27 | Patrick J. Ferrel | Distributing web applications across a pre-existing web |
US7672943B2 (en) * | 2006-10-26 | 2010-03-02 | Microsoft Corporation | Calculating a downloading priority for the uniform resource locator in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score for targeted web crawling |
US7941740B2 (en) * | 2007-07-10 | 2011-05-10 | Yahoo! Inc. | Automatically fetching web content with user assistance |
US8965865B2 (en) * | 2008-02-15 | 2015-02-24 | The University Of Utah Research Foundation | Method and system for adaptive discovery of content on a network |
US8136029B2 (en) * | 2008-07-25 | 2012-03-13 | Hewlett-Packard Development Company, L.P. | Method and system for characterising a web site by sampling |
-
2009
- 2009-08-14 WO PCT/JP2009/064362 patent/WO2010041517A1/ja active Application Filing
- 2009-08-14 US US13/003,875 patent/US8676782B2/en active Active
- 2009-08-14 JP JP2010532857A patent/JP5325229B2/ja active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH09218876A (ja) * | 1996-02-08 | 1997-08-19 | Nec Corp | ノード・リンク探索装置 |
JP2001084258A (ja) * | 1999-09-13 | 2001-03-30 | Oki Electric Ind Co Ltd | 対訳情報収集装置 |
JP2002259407A (ja) * | 2000-12-27 | 2002-09-13 | Fujitsu Ltd | 特定用途向けの文書収集装置、その方法及びコンピュータに実行させるためのプログラム |
JP2004199365A (ja) * | 2002-12-18 | 2004-07-15 | Canon Inc | 文書収集装置 |
JP2005301759A (ja) * | 2004-04-13 | 2005-10-27 | Vodafone Kk | 検索装置 |
JP2006235729A (ja) * | 2005-02-22 | 2006-09-07 | Mitsubishi Electric Corp | 選択的Web情報収集装置 |
JP2007149057A (ja) * | 2005-04-29 | 2007-06-14 | Palo Alto Research Center Inc | 個人化検索のためのシステム及び方法 |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014528136A (ja) * | 2011-12-13 | 2014-10-23 | 北大方正集▲団▼有限公司Peking University Founder Group Co., Ltd | ネットデータの採集方法及びシステム |
JP2017173910A (ja) * | 2016-03-18 | 2017-09-28 | Jcc株式会社 | 検索サーバー、検索システム、検索情報配信システム、検索プログラム、検索情報配信プログラム |
JP2019020958A (ja) * | 2017-07-14 | 2019-02-07 | 株式会社日立製作所 | 情報収集支援装置および情報収集支援方法 |
JP2020140722A (ja) * | 2020-04-20 | 2020-09-03 | ヤフー株式会社 | コンテンツ収集装置、コンテンツ収集方法およびコンテンツ収集プログラム |
JP6991265B2 (ja) | 2020-04-20 | 2022-01-12 | ヤフー株式会社 | コンテンツ収集装置、コンテンツ収集方法およびコンテンツ収集プログラム |
Also Published As
Publication number | Publication date |
---|---|
US8676782B2 (en) | 2014-03-18 |
JPWO2010041517A1 (ja) | 2012-03-08 |
JP5325229B2 (ja) | 2013-10-23 |
US20110119263A1 (en) | 2011-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5325229B2 (ja) | 情報収集装置、検索エンジン、情報収集方法およびプログラム | |
US10210256B2 (en) | Anchor tag indexing in a web crawler system | |
US6718365B1 (en) | Method, system, and program for ordering search results using an importance weighting | |
US8458163B2 (en) | System and method for enabling website owner to manage crawl rate in a website indexing system | |
US8515954B2 (en) | Displaying autocompletion of partial search query with predicted search results | |
US8271546B2 (en) | Method and system for URL autocompletion using ranked results | |
EP2352103B1 (en) | Information processing apparatus, document retrieval system, document retrieval method, and program | |
JP5106045B2 (ja) | 検索エンジン連携ファイル共有システム | |
US20090049171A1 (en) | System and computer-readable medium for controlling access in a distributed data processing system | |
BRPI0113882B1 (pt) | método para buscar e analisar o conteúdo de tráfego em pontos de acesso em redes de dados | |
JP2000357176A (ja) | コンテンツ索引付け検索システム及び検索結果提供方法 | |
US8156227B2 (en) | System and method for managing multiple domain names for a website in a website indexing system | |
Dixit et al. | A novel approach to priority based focused crawler | |
CN101211340A (zh) | 基于客户端/服务端结构的动态网络爬行器 | |
WO2020024903A1 (zh) | 用于搜索区块链数据的方法、设备及计算机可读存储介质 | |
US10860697B2 (en) | Private content in search engine results | |
JP4606548B2 (ja) | 検索システムのメンテナンス方法及び検索システム | |
Leng et al. | PyBot: an algorithm for web crawling | |
KR100756421B1 (ko) | 해외 과학기술 전자원문 수집/색인/추출 시스템 및 그 방법 | |
JP3586272B2 (ja) | サーチエンジン、検索システム、および記憶媒体 | |
Sun et al. | Botseer: An automated information system for analyzing web robots | |
US20110208717A1 (en) | Chaffing search engines to obscure user activity and interests | |
JPH11184862A (ja) | グループ適応型情報検索装置 | |
JP2003186901A (ja) | Webサイト検索方法とシステム、並びに、この方法の実行プログラムとこの方法の実行プログラムを記録した記録媒体 | |
JP2004185303A (ja) | Wwwサイト履歴検索装置及び方法並びにプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09819053 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13003875 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010532857 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09819053 Country of ref document: EP Kind code of ref document: A1 |