WO2002044946A1 - Moteur de recherche - Google Patents

Moteur de recherche

Info

Publication number
WO2002044946A1
WO2002044946A1 PCT/JP2000/008430 JP0008430W WO0244946A1 WO 2002044946 A1 WO2002044946 A1 WO 2002044946A1 JP 0008430 W JP0008430 W JP 0008430W WO 0244946 A1 WO0244946 A1 WO 0244946A1
Authority
WO
WIPO (PCT)
Prior art keywords
page
index page
database
update date
index
Prior art date
Application number
PCT/JP2000/008430
Other languages
English (en)
Japanese (ja)
Inventor
Motoharu Mizutani
Original Assignee
Kabushiki Kaisha Toshiba
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kabushiki Kaisha Toshiba filed Critical Kabushiki Kaisha Toshiba
Priority to KR10-2004-7019523A priority Critical patent/KR20050004274A/ko
Priority to JP2002508887A priority patent/JP3586272B2/ja
Priority to PCT/JP2000/008430 priority patent/WO2002044946A1/fr
Priority to KR10-2002-7006827A priority patent/KR100496384B1/ko
Publication of WO2002044946A1 publication Critical patent/WO2002044946A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to a search engine for searching data distributed on a network, a search system, a database creation method in the search system, and a storage medium.
  • the robot searches the network for text written in HTML (Hyer Text Markup Language) on the network, and searches for the link destination described in the text.
  • HTML Layer Text Markup Language
  • the above databases may be decentralized due to their large volume. However, this is simply a division for large amounts, and is not divided in any way.
  • search for the word that you would like to include in the text you want to find.
  • a mirror site can be set up to decentralize access to popular sites and reduce traffic.
  • the I_Server http://www.pointcast.com/products/iserver, html
  • PCN Point Cast Network
  • the robot traverses the S domain or URL, and extracts the URL by patrol.
  • the search keyword is extracted from the issued file, the update date is obtained at the same time. Then, it determines the newness of the file according to the obtained update date, and prioritizes the display of search results.
  • index page In the case of index pages that are configured by frame tags, the index page is updated even if the linked page in each frame is updated. Unless the index page is updated, there is a problem that the update date will remain old and the search results will not match the content. In addition, in the case of a system that excludes pages that are updated infrequently from the search target, there is a problem that pages corresponding to frames are treated at a special disadvantage.
  • the purpose of this invention is to update the update date of a huge amount of search target data scattered on the database network, and to update the update date of the linked page.
  • the search engine, search system, and data in the search system allow you to obtain accurate update frequency information by changing to the latest update date.
  • the purpose is to provide a database creation method and a storage medium.
  • Another purpose of this invention is to provide database-based indexing.
  • the keyword of the linked page can be added to the keyword of the linked page, and it can be added to a search engine, search system, or search system.
  • the purpose is to provide a database creation method and a storage medium.
  • the search engine of the present invention is an index page of information on the network, at least a URL (Uniform Resource Locator) or a domain, and a date of renewal. And a database that stores index pages including keywords and keywords, and traverses the database based on a specified domain or URL, and updates the index page and the index page. And a traveling robot that obtains the update date of the page on the linked website and uses the latest update date as the update date of the index page.
  • URL Uniform Resource Locator
  • the search engine of the present invention is an index page on a network, and at least URL (Uniform).
  • Resource Locator or a database that stores an index page that includes a domain and a keyword, and traverses the database based on the specified domain or URL. And a cyclic robot that acquires a keyword of the page to be linked from the index page, and adds the acquired keyword of the page to the keyword of the index page.
  • the search system of the present invention is an index page of information on a network, and at least a URL (Uniform).
  • URL Uniform
  • Resource Locator or a database that stores an index page containing the domain, date of update, and a keyword, and traverses the database based on the specified domain or URL. Gets the update date and the update date of the page on the linked website from this index page, and sets the most recent update as the update date of the index page. It is composed of a bot and a search engine for searching the database based on a specified keyword.
  • the search system of the present invention is an index page on a network, and at least a URL (Uniform).
  • Resource Locator or a database storing an index page including a domain and a keyword; and traversing the database based on a specified domain or URL, and the index page described above;
  • a cyclic robot that obtains a keyword of a page to be linked from an index page, adds the keyword of the obtained page to the keyword of the index page, and a specified keyword.
  • a search engine for searching the database based on the search engine.
  • an index page of information on a network at least a URL (Uniform Resource Locator) or a domain, an update date keyword.
  • URL Uniform Resource Locator
  • a search system that has a database that stores index pages that contain In the database creation method, the specified domain or
  • the database traverses the database based on the URL, and obtains an update date of the index page and an update date of a page on a website linked from the index page, and obtains the obtained update date.
  • the feature is that the new update date is set as the update date of the index page.
  • an index page of information on a network including at least a URL or a domain, an update date and a keyword.
  • the database travels through the database based on a specified domain or URL,
  • the key of the page to be linked is obtained from the index page and the index page, and the key word of the obtained page is added to the key word of the index page.
  • a URL Uniform Resource Locator
  • a storage medium having a database storing index pages, and having a program for causing a computer to create a database in a search system for performing a database search in response to a search request.
  • an index page of information on a network includes at least a URL (Uniform Resource Locator) or a domain, an update date, and a keyword.
  • a storage medium that has a database that stores index pages and that has a program for causing a computer to create a database in a search system that performs a database search in response to a search request.
  • the gram is stored.
  • database patrol is performed for the same domain as the index page.
  • the index page and the link destination page are composed of a frame tag, and the latest update date of the page in the frame is set as the update date of the index page. Is done.
  • the updated date of the index page acquired by the traveling robot is compared with the updated date of the linked page, and the updated date of the linked page is newer. Replaces the update date of the index page with the update date of the linked page.
  • the keyword extracted from the link destination page is added to the index page keyword extracted by the traveling robot.
  • the above invention relates to a machine-readable medium storing a program for causing a computer to execute a corresponding procedure or means. Holds true.
  • the index page is mainly updated without updating the frame-compatible pages. Because the linked pages are updated on a frame-by-frame basis, they are treated as if they were updated very infrequently. According to the present invention, even with a frame-compatible search service, a search function similar to a non-frame-compatible page can be obtained.
  • the larger the database capacity the more pages can be searched, so that the amount of information increases.
  • the hit rate also increases.
  • the number of registrations is increased indefinitely, the number of search pages for one keypad will also increase, so that those who search can also obtain necessary information from among them. It becomes more difficult to extract.
  • search information can be collected in an index page, a brief description of a drawing that enables efficient search can be made.
  • FIG. 1 shows the configuration of a search engine according to an embodiment of the present invention. It is a figure showing an example.
  • Figure 2 is a diagram showing the structure of the index page. .
  • FIG. 3 is a flowchart showing the operation of the embodiment of the present invention.
  • Figure 4 is a flowchart showing the operations of the patrol robot, web server, and user.
  • FIG. 5 is a diagram showing an example of a screen for inputting a domain or URL to be registered.
  • FIG. 6 is a diagram showing an example of a registered URL screen.
  • FIG. 7 is a diagram illustrating a screen example when a keyword is input.
  • FIG. 8 is a diagram illustrating a screen example of a search result obtained by a search engine.
  • a page shall mean a piece of noise / text.
  • one page has a unique URL.
  • URL (UniformResocLeccutor) is a notification necessary for accessing page data.
  • URL includes protocol, domain name, port number, and path name information.
  • Mouth pots include Hyper Text Markup Language (HTML) and Standard Generalized Markup Language. Heino, like age (SGML). Reading documents written in text and collecting the documents on the network while mechanically extracting the links written there. However, it is realized by software. Layers with spiders instead of robots are sometimes called wanderers.
  • HTML Hyper Text Markup Language
  • SGML Standard Generalized Markup Language
  • the basic operation of the robot is as follows.
  • Step 1 Register the specified home page in the visiting list.
  • Step 2 The robot acquires a page according to visiting 1 ist.
  • Step 3 Analyze the acquired page and extract URL. ⁇
  • Step 4 Add the extracted URL to the visiting list (however, do not duplicate the URL).
  • the acquisition frequency of the page may be determined according to the frequency of updating the page.
  • a page is treated as an example of data distributed in a network.
  • FIG. 1 shows a configuration diagram of an entire search system including a search engine of the present invention.
  • the network 1 is connected to web servers 9 and 11, a user PC 13, a search server 19 and a search engine 21.
  • Search engine 2 1 is composed of a traveling robot 3, a database 5, and an engine 17.
  • the traveling robot 3 accesses the registered domain and URL, obtains the update date, and extracts the keyword. Also, access the linked page, get the update date, and extract keywords. Register the acquired update date and extracted key words in the database 5.
  • the database stores the index page power and the visiting list.
  • the index page includes a URL, a keypad, and attribute information
  • the attribute information includes an update date.
  • Engine 17 searches database 5 based on the specified keyword.
  • the search server 19 is, for example, a search server 19 typified by, for example, Informationek.
  • step S1 of FIG. 3 the user registers a domain or URL. That is, on the screen of the user PC 13, for example, a domain or URL input screen (a registration screen of the service chain) as shown in FIG. 5 is displayed. The user enters a search domain or URL, and selects the registration button 15. As a result, as shown in FIG. 4, the traveling robot 3 registers the domain or URL input by the user in the visiting list in the database 5.
  • step S3 of FIG. 3 the index page Is accessed. That is, as shown in FIG. 4, the traveling robot 3 sends the registered domain or URL to the web server 11, and the web server performs the indexing based on the received domain or URL. Access the page and send it to the patrol bot.
  • the traveling robot 3 obtains the update date A of the index page transmitted from the web server 11. Next, in step S7 of FIG. 3, keywords registered in the index page are extracted.
  • step S9 in FIG. 3 the link destination is accessed. That is, as shown in FIG. 4, the traveling robot 3 transmits a link destination address included in the index page to the web server 9 (11).
  • the web server 9 (11) accesses the link destination page on the web server 9 (11) based on the link destination address, and transmits the page to the traveling robot 3.
  • step S11 of FIG. 3 the update date B is obtained. That is, as shown in FIG. 4, the traveling robot 3 obtains the update date B of the link destination page, and further extracts a keyword.
  • step S13 of FIG. 3 the update dates A and B are compared, and in step S15, the update date is updated. That is, as shown in Fig.
  • step S21 it is determined whether or not the patrol has been completed. If the tour has not been completed, the process returns to step S9, and steps S9 to S21 are repeated.
  • step S21 if it is determined in step S21 that the tour has been completed, the tour robot 3 registers the obtained update date and keyword in the database 5 in step S23.
  • FIG. 6 is a diagram illustrating an example in which the traveling robot 3 uses the latest update date of the page in the frame as the update date of the index page. That is, it is assumed that the user has registered the .domain, com / index, and html powers using the registration screen of the domain or URL shown in FIG. It is also assumed that the current index page update date is March 14, 2000. It has a link destination page of title and html with an update date of February 14, 2000, and an update date of August 1, 2000. It shall consist of a link destination page of menu, html, and a link destination page of welcom. html with an update date of August 8, 2000. The patrol robot 3 obtains the update dates of these linked pages, compares the update dates, and indexes the latest update date, August 8, 2000, into an index page. Set as the update date of the page.
  • the search is provided, for example, on a page of the search server 19 (for example, a homepage provided by a refresh eye, an Infoseek, or the like). For example, as shown in FIG. Keypad from a keyword input screen for searching.
  • a search button 17 is selected after inputting a keyword, a keyword search is performed by the engine 17 shown in FIG. 1, and a search result as shown in FIG. 8 is displayed, for example.
  • the present invention is applicable to a search system on a network using a robot.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

On compare la mise à jour d'une page d'index acquise à l'aide d'un robot à circuits (3) à celle d'un site Web relié. Si celle du site Web est plus récente, on remplace celle de la page d'index par celle du site Web, puis on ajoute un mot de code extrait du site Web aux mots de code de la page d'index extraits par le robot à circuits.
PCT/JP2000/008430 2000-11-29 2000-11-29 Moteur de recherche WO2002044946A1 (fr)

Priority Applications (4)

Application Number Priority Date Filing Date Title
KR10-2004-7019523A KR20050004274A (ko) 2000-11-29 2000-11-29 검색엔진과, 검색시스템, 검색시스템에서의 데이터베이스작성방법 및, 기억매체
JP2002508887A JP3586272B2 (ja) 2000-11-29 2000-11-29 サーチエンジン、検索システム、および記憶媒体
PCT/JP2000/008430 WO2002044946A1 (fr) 2000-11-29 2000-11-29 Moteur de recherche
KR10-2002-7006827A KR100496384B1 (ko) 2000-11-29 2000-11-29 검색엔진과, 검색시스템, 검색시스템에서의 데이터베이스 작성방법 및, 기억매체

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2000/008430 WO2002044946A1 (fr) 2000-11-29 2000-11-29 Moteur de recherche

Publications (1)

Publication Number Publication Date
WO2002044946A1 true WO2002044946A1 (fr) 2002-06-06

Family

ID=11736729

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2000/008430 WO2002044946A1 (fr) 2000-11-29 2000-11-29 Moteur de recherche

Country Status (3)

Country Link
JP (1) JP3586272B2 (fr)
KR (2) KR20050004274A (fr)
WO (1) WO2002044946A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007157132A (ja) * 2005-12-08 2007-06-21 Internatl Business Mach Corp <Ibm> 文書ベースの情報およびユニフォーム・リソース・ロケータ(url)の管理方法およびプログラム
JP2008293384A (ja) * 2007-05-25 2008-12-04 Fuji Xerox Co Ltd 情報処理装置及び制御プログラム
JP2008299788A (ja) * 2007-06-04 2008-12-11 Fujitsu Ltd ウェブサーバ装置、ウェブサーバプログラムおよびウェブサーバ装置の管理方法
JP2011223283A (ja) * 2010-04-09 2011-11-04 Funai Electric Co Ltd テレビジョン装置
JP2020197876A (ja) * 2019-05-31 2020-12-10 Gmo Tech株式会社 情報処理システム、プログラム、及び、情報処理方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH03174653A (ja) * 1989-12-01 1991-07-29 Matsushita Electric Ind Co Ltd キーワード管理方法およびその装置
JPH117449A (ja) * 1997-06-16 1999-01-12 Hitachi Ltd ハイパーテキスト情報収集方法
JPH11212852A (ja) * 1998-01-28 1999-08-06 Nec Software Chubu Ltd Tcp/ip通信ホームページ読出方法及びその装置並びに情報記録媒体
JPH11296463A (ja) * 1998-04-10 1999-10-29 Nec Software Ltd フレームを使用しているホームページのマーキング・再表示方式
JPH11296428A (ja) * 1998-04-14 1999-10-29 Nec Home Electron Ltd ホームページの更新チェック方法および装置並びに更新チェックのための制御プログラムを格納した読み出し可能な記録媒体

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH03174653A (ja) * 1989-12-01 1991-07-29 Matsushita Electric Ind Co Ltd キーワード管理方法およびその装置
JPH117449A (ja) * 1997-06-16 1999-01-12 Hitachi Ltd ハイパーテキスト情報収集方法
JPH11212852A (ja) * 1998-01-28 1999-08-06 Nec Software Chubu Ltd Tcp/ip通信ホームページ読出方法及びその装置並びに情報記録媒体
JPH11296463A (ja) * 1998-04-10 1999-10-29 Nec Software Ltd フレームを使用しているホームページのマーキング・再表示方式
JPH11296428A (ja) * 1998-04-14 1999-10-29 Nec Home Electron Ltd ホームページの更新チェック方法および装置並びに更新チェックのための制御プログラムを格納した読み出し可能な記録媒体

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Internet hiza kurige", KABUSHIKI KAISHA ASCII, vol. 20, no. 11, 1 November 1996 (1996-11-01), JAPAN, pages 400 - 403, XP002937860 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007157132A (ja) * 2005-12-08 2007-06-21 Internatl Business Mach Corp <Ibm> 文書ベースの情報およびユニフォーム・リソース・ロケータ(url)の管理方法およびプログラム
JP2008293384A (ja) * 2007-05-25 2008-12-04 Fuji Xerox Co Ltd 情報処理装置及び制御プログラム
JP2008299788A (ja) * 2007-06-04 2008-12-11 Fujitsu Ltd ウェブサーバ装置、ウェブサーバプログラムおよびウェブサーバ装置の管理方法
JP2011223283A (ja) * 2010-04-09 2011-11-04 Funai Electric Co Ltd テレビジョン装置
JP2020197876A (ja) * 2019-05-31 2020-12-10 Gmo Tech株式会社 情報処理システム、プログラム、及び、情報処理方法

Also Published As

Publication number Publication date
JP3586272B2 (ja) 2004-11-10
KR100496384B1 (ko) 2005-06-21
JPWO2002044946A1 (ja) 2004-04-02
KR20020070293A (ko) 2002-09-05
KR20050004274A (ko) 2005-01-12

Similar Documents

Publication Publication Date Title
US9305100B2 (en) Object oriented data and metadata based search
US7979427B2 (en) Method and system for updating a search engine
US7552109B2 (en) System, method, and service for collaborative focused crawling of documents on a network
US6321228B1 (en) Internet search system for retrieving selected results from a previous search
JP5474038B2 (ja) モバイルサイトマップ
US7383299B1 (en) System and method for providing service for searching web site addresses
US7539669B2 (en) Methods and systems for providing guided navigation
US20070271255A1 (en) Reverse search-engine
JP2016181306A (ja) 索引キーを使用して検索を絞込むシステムおよび方法
Dixit et al. A novel approach to priority based focused crawler
CN107291940A (zh) 页面内容管理方法、装置及相关服务器
JP4769822B2 (ja) ページグループを用いた情報検索サービス提供サーバー、方法及びシステム
Berger et al. Mapping the Blogosphere--Towards a universal and scalable Blog-Crawler
US20120317091A1 (en) System and method for users to get newly updates
JP2004206492A (ja) ドキュメント表示方法およびそれを用いたリンク先選択機能付ゲートウェイ装置
WO2002044946A1 (fr) Moteur de recherche
JP2005056371A (ja) Web検索情報の管理方法、管理システム、コンピュータソフトウェアプログラム
JP2004206629A (ja) 異種データソース統合検索サーバシステム
KR20000017909A (ko) 인터넷상에서의 정보검색장치 및 이를 이용한 정보검색방법
JP3632354B2 (ja) 情報検索装置
US10061859B2 (en) Computer implemented systems and methods for dynamic and heuristically-generated search returns of particular relevance
Saranya et al. A Study on Competent Crawling Algorithm (CCA) for Web Search to Enhance Efficiency of Information Retrieval
Aliyu et al. Google query optimization tool
JP5559725B2 (ja) 複数の情報ブロックに区分されたウェブページを用いた情報検索サービス提供方法
JP5525424B2 (ja) 文書検索装置、文書検索方法及び文書検索プログラム

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref country code: JP

Ref document number: 2002 508887

Kind code of ref document: A

Format of ref document f/p: F

WWE Wipo information: entry into national phase

Ref document number: 1020027006827

Country of ref document: KR

AK Designated states

Kind code of ref document: A1

Designated state(s): CN JP KR SG US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

WWP Wipo information: published in national office

Ref document number: 1020027006827

Country of ref document: KR

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
WWG Wipo information: grant in national office

Ref document number: 1020027006827

Country of ref document: KR