WO2004088542A1 - A method of managing registered web sites in search engine and a system thereof - Google Patents

A method of managing registered web sites in search engine and a system thereof Download PDF

Info

Publication number
WO2004088542A1
WO2004088542A1 PCT/KR2004/000665 KR2004000665W WO2004088542A1 WO 2004088542 A1 WO2004088542 A1 WO 2004088542A1 KR 2004000665 W KR2004000665 W KR 2004000665W WO 2004088542 A1 WO2004088542 A1 WO 2004088542A1
Authority
WO
WIPO (PCT)
Prior art keywords
web site
adult
site
predetermined
web
Prior art date
Application number
PCT/KR2004/000665
Other languages
French (fr)
Inventor
Hyun Jung Lee
Original Assignee
Nhn Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nhn Corporation filed Critical Nhn Corporation
Publication of WO2004088542A1 publication Critical patent/WO2004088542A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to a search engine for providing information about web sites on the Internet, and more particularly to a method for managing web sites registered in a search engine, wherein information about the web sites registered in the search engine is analyzed to prevent the provision of search results different from essential contents contained in the web sites.
  • Lycos http://www.lycos.com
  • Yahoo http://www.yahoo.com
  • Lycos generally includes a database for classifying, storing and managing web site information based on a predetermined rule
  • a search robot embodied as software, for constantly traveling over the web and automatically collecting new web site information
  • search engine software for storing the collected data in a database and allowing a user of the search engine to search for desired information in the database.
  • Fig. la is a block diagram showing an entire system for providing the search engine service.
  • a user connects to a search engine server 150 over the Internet via a user terminal 110. If the user enters search terms, a search engine server 150 queries search engine software 140 about web site information corresponding to the entered search terms, and the search engine software 140 searches a database 130 to notify the user of retrieved web site information.
  • a search robot 120 is an entity embodied as software for constantly traveling over the web and automatically collecting new web site information from a web server 160, as described above. The search robot 120 searches for HTML (Hypertext Markup Language) documents on a network and parses links described in the HTML documents and then collects data from a number of web sites existing on the network.
  • HTML Hypertext Markup Language
  • the data collected by the search robot 120 is databased.
  • the term "databased" refers to a series of processes of performing morphological analysis of information located on a web site and producing a corresponding index table and storing it in the database 130.
  • the database 130 is provided to store all web site information collected by the search robot 120.
  • the search engine software 140 functions to show search results to users. This software searches a large number of pages stored in the database 130 and lists search results by relevance to the search term.
  • the conventional search engine as described above registers information about a web site in a search engine and provides the information to users in the following ways.
  • Information of a web site is collected using the search robot as described above, and the web site information is registered in the search engine after being reviewed by expert surfers.
  • a category corresponding to the subject of a web site to be registered is selected from a directory of categories classified by subject, and it is requested that the web site be registered in the selected category, and then the web site is registered in the search engine after being reviewed by expert surfers.
  • Web sites registered in the search engine in the above method are provided to a user who is looking for desired information after they are searched for in various ways, such as integrated web search and directory search, based on search terms entered by the user.
  • the integrated web search is also called "word-based search", in which Universal Resource Locators (URLs) of all web sites are stored in a database and desired information is searched for based on a specific keyword entered by the user.
  • the directory search is also called "subject-based search", in which web sites are organized into subject-based categories and if a user links to a desired category, the user can view detailed items thereof. In this manner, the subject-based search allows the user to continue to link to the detailed items and retrieve desired information.
  • Fig. lb is an example screenshot of the directory search method. As shown in this figure, directory search results with search terms "world cup" are three categories
  • a typical search engine based on the integrated web search method is Lycos (http://lycos.cs.cmu.edu) developed by Michael L. Mauldin at Carnegie-Mellon University, and a typical search engine based on the directory search method is Yahoo (http://www.yahoo.com).
  • Lycos http://lycos.cs.cmu.edu
  • Yahoo http://www.yahoo.com
  • Many current search engines provide hybrid search services based on a combination of the different search methods described above.
  • the conventional method for registering web sites in the search engine and searching for the registered web sites has the following problems.
  • “deceptive sites”) that contain contents of no use to the users and insert the popular keywords in their web pages in various ways. For example, if a user enters a popular keyword "Pikachu" to search for information about the Pikachu, information of all registered web sites that contain the word "Pikachu” in their web pages is provided to the user.
  • the web sites provided to the user may include web sites that contain adult or sexual contents and insert the word "Pikachu” in some places in their web pages in various ways (with ill intention in most cases).
  • This popular keyword insertion causes a wide age range of users to be exposed to the information of the web sites that contain adult or sexual contents.
  • (2) Content contained in a web site at the time when it is registered in the search engine may be different from content contained therein after the registration.
  • altered sites may cause, at any time, unexpected damage to a number of users who perform the directory search.
  • Fig. lc is a diagram illustrating an example web site, the content of which is altered after its registration. This figure shows a search results screen for a specific search term.
  • a web site providing information about arcade machines is listed on the search results screen.
  • adult content is displayed on the screen, instead of the information about the arcade machines.
  • the altered site causes unexpected damage to users who are searching for information on arcade machines.
  • the conventional method for overcoming the problems described above requires complaint reports by users or requires specialists such as expert surfers to constantly monitor the registered web sites.
  • the conventional method obviously cannot be an ultimate solution to the problems. If an algorithm automatically executed on the Internet to solve the problems can be provided, it will be a useful means to solve the problems all at once.
  • the present invention has been made in view of the above problems, and it is an object of the present invention to provide a method for managing web sites registered in a search engine, in which an algorithm is used to automatically detect deceptive sites or altered sites, thereby allowing users of the search engine to correctly search for their desired information. It is another object of the present invention to provide a method for managing web sites registered in a search engine, in which deceptive sites or altered sites are automatically detected, and punitive measures are automatically imposed on operators of the detected deceptive sites or altered sites, thereby reinforcing self-purification of the web sites registered in the search engine.
  • a method for managing a web site registered in a search engine comprising the steps of: classifying web site information of the web site by predetermined fields and recording the classified web site information in a database; maintaining predetermined adult keywords in an adult keyword database; reading a source file constituting a web page of the web site; analyzing the read source file; and determining, based on a predetermined basis, whether or not the web site is an adult site.
  • a method for managing a web site registered in a search engine comprising the steps of: classifying web site information of the web site by predetermined fields and recording the classified web site information in a database; maintaining predetermined adult keywords in an adult keyword database; maintaining predetermined popular keywords in a popular keyword database; reading a source file constituting a web page of the web site; analyzing the read source file; determining, based on a first predetermined basis, whether or not the web site is an adult site; determining, based on a second predetermined basis, whether or not the web site is a deceptive site, if the web site is determined to be an adult site; and performing a control operation to perform predetermined processing on the web site if the web site is determined to be a deceptive site.
  • deceptive site used in the present specification refers to a web site that inserts specific keywords in a source file of its web page in various ways and contains contents entirely different from those to be searched for based on the specific keywords.
  • the deceptive site is an adult site that inserts specific popular keywords in its web page and may be provided as a search result, irrespective of its essential content.
  • altered site refers to a web site that has a different subject after the registration thereof in a search engine from a subject when it was initially registered in the search engine.
  • the altered site is a web site that has been initially registered in the search engine by requesting registration as a general site, but has changed its content to adult content after the registration.
  • adult site refers to a web site that contains content very harmful to young boys and girls under 19. Research results show that most deceptive sites are highly likely to be adult sites and most altered sites are highly likely to be adult sites.
  • popular keywords refers to search words that appear very frequently, among search words entered by Internet users.
  • the popular keywords may continually vary depending on the Internet users' tendency and social situations of the time.
  • the popular keywords may include harmful keywords containing socially harmful content, and some examples thereof are "suicide”, "reject”,
  • adult keywords refers to search terms used to search for various adult contents contained in adult sites.
  • the adult keywords mostly rank high in the popular keywords provided by search engine providers, which indicates that the adult keywords have some connection with the popular keywords.
  • Fig. la is a block diagram showing the configuration of a conventional system for providing web site search engine services
  • Fig. lb is an example screenshot of a directory search method that is one of the web site search methods provided by search engines;
  • Fig. lc is a diagram showing an example web site that is altered after the registration
  • Fig. 2 is a block diagram showing the configuration of a system for managing web sites registered in a search engine according to a preferred embodiment of the present invention
  • Fig. 3a is a flow chart showing a method for detecting adult sites in order to manage web sites registered in a search engine according to an embodiment of the present invention
  • Fig. 3b is a flow chart showing a method for detecting deceptive or altered sites in order to manage web sites registered in a search engine according to an embodiment of the present invention
  • Fig. 4a is a flow chart showing a method for selecting adult keywords in the method for managing web sites registered in the search engine according to an embodiment of the present invention
  • Fig. 4b is a flow chart showing a method for selecting popular keywords in the method for managing web sites registered in the search engine according to an embodiment of the present invention
  • Figs. 5a to 5e are diagrams illustrating search result information types of adult sites that the search robot has obtained by searching the web in the method for managing the web sites registered in the search engine according to the preferred embodiment of the present invention
  • Fig. 6 is a flow chart showing a method for taking a predetermined punitive measure against a registrant of a web site determined to be a deceptive or altered site in the method for managing the web sites registered in the search engine according to the preferred embodiment of the present invention.
  • Fig. 7 is a block diagram showing the internal configuration of a general computer system that can be used in managing the web sites registered in the search engine according to the present invention.
  • Fig. 2 is a block diagram showing the configuration of a system for managing web sites registered in a search engine according to an embodiment of the present invention.
  • the system according to the embodiment of the present invention includes an interface module 201, a web site registration module 202, a web site management module 203, a web site information database 204, a web site analysis module 205, a keyword management module 206 which includes a search word analyzer 208 and an adult keyword extractor 209, a search robot 207, a popular keyword database 210, and an adult keyword database 211.
  • the system for managing web sites registered in the search engine may include a mail server 212 or a SMS server 213 for sending a predetermined message to a registrant of a registered web site.
  • the mail server 212 and the SMS server 213 may be provided in a system for providing search engine services or may be located in a system operated by a third party.
  • the interface module 201, other various modules, and the mail server 212 or the SMS server 213 are illustrated in Fig. 2 as separate entities, this illustration has been made only for easier explanation, and they may be the same entity.
  • the elements shown in Fig. 2 may also be physically located at the same place, or alternatively they may be physically located apart from each other according to another embodiment of the present invention.
  • the interface module 201 functions to support data transmission between the search engine registration management system and a computer terminal provided to a registrant who desires to register a predetermined web site in the search engine, and also functions to interface between physical transmission equipment.
  • the web site registration module 202 functions to receive a request to register the predetermined web site from the registrant, and also to collect and classify information/data about the web site contained in the web site registration request.
  • the web site registration module 202 may further include a billing module (not shown) for charging predetermined fees for the web site registration.
  • the billing module may operate to charge different fees for a web site desired to be registered, depending on the type of the web site (i.e., depending on whether it is a general site containing general content or an adult site containing adult content).
  • the web site management module 203 is a module for overall registration management of web sites according to the present invention. Based on information of the web sites collected by the search robot 207, the web site management module 203 determines whether the web sites are in operation in conformity with a standard based on which their registration has been permitted. If it is determined that the web site is in inappropriate operation (i.e., it is a deceptive site) or altered sites, the web site management module 203 automatically takes a predetermined measure against a registrant of the web site.
  • the web site management module 203 can interwork with the mail server 212 or the SMS server 213 to send an email to the registrant of the deceptive site or to send an SMS message to a mobile terminal of the registrant, thereby giving warning against the registrant for the inappropriate operation of the deceptive site or altered sites.
  • the web site information database 204 functions to classify and record information of the registered web sites.
  • Various web site information such as a URL(Universal Resource Locator) of a web site, web site category information which indicate if the registered websites are general web sites or adult web sites, keywords, registrant information (registrant's name, address, email address, mobile terminal number, etc.), directory information, and the like of the web sites, may be classified by the information fields and stored in the web site information database 204.
  • the web site category information of a web site indicates whether the web site is registered as a general web site or an adult web site.
  • Information of a web site stored in the web site information database 204 may be modified by a registrant of the web site and by a system manager.
  • the web site information database 204 may automatically update information of the web site stored therein, based on analysis results (for example, based on a new keyword corresponding to a URL of the web site) of data collected by the search robot 207 even though a registrant of the web site does not directly modify the stored information of the web site.
  • the web site analysis module 205 functions to analyze information of web sites collected by the search robot 207.
  • the type of data collected by the search robot 207 and a method for analyzing the collected data will be described below in detail with reference to Fig. 3.
  • the keyword management module 206 functions to manage keywords that may be a basis for determining whether or not a specific web site is a deceptive or altered site. In other words, the keyword management module 206 selects and manages adult and popular keywords that may be a basis for the deceptive site determination, and selects and manages adult keywords that may be a basis for the altered site determination.
  • the keyword management module 206 of the system for managing the registered web sites according to the embodiment of the present invention includes a keyword analyzer 208 for selecting popular words and an adult keyword extractor 209 for selecting adult keywords.
  • the keyword analyzer 208 records search words that a plurality of users have entered to the search engine, and records the number of times the entered search words appear.
  • the keyword analyzer 208 can select popular keywords from the entered search words by sorting them in order of the number of appearance times at intervals of a predetermined period.
  • the adult keyword extractor 209 can select one or more adult sites, analyze source files of web pages of the selected adult sites, and extract character strings that appear frequently on the web pages to select the extracted character strings as adult keywords.
  • the selected popular keywords are stored and managed in a popular keyword database 210, whereas the selected adult keywords are stored and managed in an adult keyword database 211. These databases 210 and 21 1 may operate to be automatically updated when a new popular or adult keyword is detected.
  • Fig. 3a is a flow chart showing a method for managing web sites registered in a search engine according to a preferred embodiment of the present invention.
  • a method for managing web sites registered in a search engine according to a preferred embodiment of the present invention will now be described in detail with reference to Fig. 3a in conjunction with Figs. 5a to 5e.
  • the web site registration management method is performed in the following manner, as shown in Fig. 3a.
  • a registrant who desires to register a predetermined web site in the search engine, makes a request to register the web site with information of the web site (305).
  • the information of the web site is classified by information fields (registrant's name, address, email address, mobile phone number, etc.) and recorded in a web site information database (310), and the web site is registered in the search engine (315).
  • This registration step 315 may be performed in several ways. For example, in one way, a web site is registered in the search engine upon request of a manager of the web site as described above.
  • a web site is registered in the search engine based on information of the web site obtained by the search robot that randomly travels over the web.
  • the registrant i.e., the manager
  • the web site can request that the web site be registered in a category closest to a subject (for example, "Pikachu” and "patent attorney exam") thereof decided by the registrant.
  • the requested web site can be registered in the search engine if it is determined that the requested web site satisfies predetermined requirements (for example, quality of the web site or noncommercial site requirements in case no registration fee is paid).
  • the method for managing web sites registered in the search engine according to the present invention will be described, limited to the case where the web site is registered in the search engine upon request of the registrant of the web site.
  • the method and system for managing web sites registered in the search engine according to the present invention can also be applied to other various ways in which the web site is registered in the search engine.
  • a specific adult keyword is stored and managed in the adult keyword database (320).
  • the search engine operates to analyze a source file contained in a web page of the web site.
  • the search engine may operate to crawl a web site and generate search result data thereof divided into elements #TITLE, #BODY or #ANCHOR.
  • the search engine reads a source file constituting a web page (325), analyzes the read source file (330), and determines, based on a predetermined basis, whether the web site is an adult site (335).
  • An example of the search result data will be described below in detail with reference to Figs. 5a to 5e.
  • Fig. 4a shows a method for selecting the adult keywords.
  • the method for selecting the adult keywords as a basis for determining whether a web site is an adult site includes the following steps. First, one or more adult sites are selected (405). The adult site can be selected directly by a manager of the web site registration management system according to the present invention. Alternatively, one or more of the web sites registered as adult sites can be automatically selected by searching a web site category information field in the database of the system according to the present invention. Character strings contained in a web page of the selected adult site are extracted (410), and respective frequencies of the extracted character strings are recorded (415). To record the frequencies, the extracted character strings can be recorded in the form of a table and the value of a frequency field of the table can be increased by one each time a corresponding character string is extracted.
  • the recorded character strings, as detected by the analysis, are sorted in order of the frequencies at intervals of a predetermined period (daily, weekly or monthly) (420).
  • High-ranked character strings are extracted from the recorded character strings, and the extracted character strings are selected as adult keywords, which are then stored in the adult keyword database (425).
  • all of the detected character strings can be selected as adult keywords without the sorting.
  • non-adult character strings may be selected as adult keywords, but it can prevent an increase in the system load required to select the adult keywords, which is caused by the sorting.
  • step 335 it is determined, based on a predetermined basis, whether the web site is an adult site.
  • a method for determining whether a web site is an adult site, according to a preferred embodiment of the present invention, will now be described with reference to Figs. 5a to 5e.
  • Figs. 5a to 5e are diagrams illustrating search result information types of adult sites that the search robot has obtained by searching the web in the method for managing the web sites registered in the search engine according to the preferred embodiment of the present invention.
  • Specific types of web sites which contain a number of adult keywords as shown in Figs. 5a to 5e, may be determined to be adult sites.
  • a web document corresponding to the search result information type ⁇ Type 1> of adult sites includes a number of URLs and a predetermined number of adult keywords.
  • ⁇ Type 1> since morphological analysis of the search result can be performed using an indexer used in the search engine, it is generally easy to determine the number of keywords, which match adult keywords, among the character strings included in a web page of the web document.
  • the search result information type ⁇ Type 1> it can be determined whether the web site is an adult site, by determining whether the ratio of the length of character strings, which match adult keywords, to the length of the total character strings (i.e., the length of character strings matching adult keywords / the length of the total character strings included in the web page x 100) is more than a predetermined value.
  • the ratio of the length of character strings, which match adult keywords, to the length of the total character strings i.e., the length of character strings matching adult keywords / the length of the total character strings included in the web page x 100
  • it can be determined whether the web site is an adult site For example, if the length of character strings in a web page, which match adult keywords, is 200 bytes or more, it can be determined whether a web site including the web page is an adult site.
  • a web document corresponding to the search result information type ⁇ Type 2> of adult sites includes a number of adult keywords with no space therebetween.
  • ⁇ Type 2> since an indexer used in the search engine cannot be used, it can be determined whether a web site corresponding to the web document is an adult site, based on the number of character strings appearing in a web page of the web document, which match adult keywords (i.e., based on the number of adult keywords appearing in the web page).
  • a web document corresponding to the search result information type ⁇ Type 3> of adult sites includes popular keywords in the title of the web document and a number of adult keywords in the body thereof.
  • ⁇ Type 3> it can be determined whether the web site is an adult site, by using an indexer according to the above method in Fig. 5 a or measuring the length of character strings that match adult keywords and determining whether the measured length is more than a predetermined value.
  • a web document corresponding to the search result information type ⁇ Type 4> of adult sites does not include popular and adult keywords in a title and a body of the web document, but includes a number of adult keywords in anchor text of the web document.
  • Adult sites of ⁇ Type 4> use the fact that the search robot searches web sites and returns search results thereof based on titles and bodies thereof.
  • the adult site determination can be performed based on the analysis of the anchor text of the web site. In other words, it can be determined whether the web site is an adult site, by analyzing character strings included in the anchor text and determining whether the number of character strings, which match adult keywords, among the analyzed character strings is more than a predetermined number.
  • a web document corresponding to the search result information type ⁇ Type 5> of adult sites does not include adult keywords but includes a large number of the same popular words in the title of the web document.
  • a web site of the type ⁇ Type 5> uses no adult keyword in an initial screen for entering the web site. Many recent adult sites belong to this type.
  • the initial screen contains only a window for entering a social security number, which is entirely unrelated to adult keywords, under the pretext of adult verification.
  • the adult site determination is difficult with only the extraction of keywords from the body of the web page. Thus, it may be most effective to determine the type ⁇ Type 5> to be an adult site type.
  • the search robot crawls a web site, and records search result data of a web page of the web site after dividing the data into elements #TITLE, #BODY or #ANCHOR, and then determines, based on a predetermined basis, whether the web site is an adult site.
  • the ratio of character strings, which match adult keywords, to the total character strings included in the web page of the web site can be calculated, and the web site can be determined to be an adult site if the calculated ratio is more than a predetermined ratio.
  • the adult keyword ratio can be calculated using an indexer.
  • the ratio of the length of adult keywords to the length of total character strings of the web document is calculated, and it is determined whether the web site is an adult site, based on whether the calculated ratio is more than a predetermined reference value.
  • the adult site determination can be made based not only on the ratio but also on the number of character strings included in the web page, which match adult keywords, as described above.
  • a web site including the web page is determined to be an adult site.
  • the second adult site determination can be useful particularly when the indexer cannot be used (i.e., in the case such as ⁇ Type 2>).
  • step 335 If it is determined, at step 335, that the web site is an adult site, the procedure moves to step 350 of Fig. 3b.
  • step 350 it is determined whether the web site determined to be an adult site is a deceptive site (branch ⁇ ) or an altered site (branch (2)), according to the web site management method according to the present invention.
  • Fig. 4b shows a method for selecting the popular keywords.
  • the method for selecting the popular keywords as a basis for determining whether a web site is a deceptive site includes the following steps. First, search words are entered by a plurality of search engine users (455). Respective frequencies of the entered search words are recorded (460). To record the frequencies, the entered search words can be recorded in the form of a table and the value of a frequency field of the table can be increased by one each time a corresponding search word is entered. The entered search words, as obtained by the analysis, are sorted in order of the frequencies at intervals of a predetermined period (daily, weekly or monthly) (465).
  • High-ranked search words are extracted from the sorted words, and the extracted search words are selected as popular keywords (470), which are then stored in the popular keyword database (475). It is preferable that search words constantly popular over the medium and long term, rather than search words regarding social issues in the short term, be selected as popular keywords. This is because search words (for example, "Starcraft” and "Zolaman”), which have generally gained constant popularity, are highly likely to be included as popular keywords in character strings of web page source files of deceptive sites, in terms of characteristics of the present invention.
  • the predetermined basis for determining whether a web site is a deceptive site may be whether or not predetermined popular keywords are contained in a source file that constitutes a web page included in the web site (355).
  • Examples of the deceptive site include an adult site of ⁇ Type 3> shown in Fig. 5c or an adult site of ⁇ Type 5> shown in Fig. 5e. If it is determined, at step 355, that a source file of the web site determined to be an adult site contains popular keywords, the web site may be determined to be a deceptive site (360). If the web site is determined to be a deceptive site, the procedure moves to step 605 of Fig. 6 via step 380, so as to take a predetermined measure against the web site.
  • registration information of the web site determined to be an adult site is searched for to determine whether the web site is an altered site (365).
  • the reason for searching for the registration information is that the altered site is a web site that was registered as a general site and then altered to an adult site after the registration, as described above.
  • Web site category information of the web site stored in the web site information database is searched to determine whether the web site has been registered as an adult site (370). If the web site has not been registered as an adult site, the web site can be determined to be an altered site (375). If the web site is determined to be a deceptive or altered site at step 360 or 375, the procedure moves to step 605 of Fig. 6 via step 380, so as to take a predetermined measure against the web site.
  • a web site management module searches a web site information database to obtain information of a registrant of the web site (610), and the web site management module receives the registrant information (620 and 650).
  • the web site management module extracts contact information of the registrant, such as an email address or a mobile terminal number thereof, from the received registrant information (630), and controls a mail server or an SMS server to transmit a predetermined message to a location corresponding to the contact information (640).
  • contact information of the registrant such as an email address or a mobile terminal number thereof
  • the web site management module extracts information of other registered web sites of the registrant from the registrant information (660), and then performs a control operation to automatically analyze the other web sites registered under the same registrant name (670). This is because the other web sites registered under the same registrant name are highly likely to be deceptive or altered sites operated based on the same or similar method. In this embodiment, if, based on the analysis of the other registered web sites, it is determined that they are deceptive or altered sites, step 610 of Fig. 6 may be repeated.
  • the system for managing the registered web sites may operate to automatically send an email, an SMS message or the like to a registrant of the web site to point out problems of the web site and then request that the registrant of the web site correct the problems within a grace period.
  • the system may be set to automatically perform the analysis and determination processes after the grace period. If the problems of the web site have not been corrected even after the grace period, a punitive measure, such as cancellation of the registration of the web site, may be taken against the registrant thereof.
  • a punitive measure such as a complicated registration procedure may be imposed on the registrant of the web site when the registrant requests registration of another web site at a later time.
  • Embodiments of the present invention further relate to computer readable media that include program instructions for performing various computer-implemented operations.
  • the media may also include, alone or in combination with the program instructions, data files, data structures, tables, and the like.
  • the media and program instructions may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts.
  • Examples of computer-readable media include magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM disks; magneto-optical media such as floptical disks; and hardware devices that are specially configured to store and perform program instructions, such as readonly memory devices (ROM) and random access memory (RAM).
  • the media may also be a transmission medium such as optical or metallic lines, wave guides, etc. including a carrier wave transmitting signals specifying the program instructions, data structures, etc.
  • Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.
  • Fig. 7 is a block diagram showing the internal configuration of a general computer system that can be used in managing web pages registered in the search engine according to the present invention.
  • the computer system includes any number of processors 740 (also referred to as central processing units, or CPUs) that are coupled to storage devices including primary storage 760 (typically a random access memory, or "RAM"), primary storage 770 (typically a read only memory, or "ROM").
  • primary storage 760 acts to transfer data and instructions uni-directionally to the CPU and primary storage 760 is used typically to transfer data and instructions in a bi-directional manner. Both of these primary storage devices may include any suitable type of the computer-readable media described above.
  • a mass storage device 710 is also coupled bi-directionally to CPU 740 and provides additional data storage capacity and may include any of the computer-readable media described above.
  • processor 710 may be used to store programs, data and the like and is typically a secondary storage medium such as a hard disk that is slower than primary storage.
  • a specific mass storage device such as a CD-ROM 720 may also pass data uni-directionally to the CPU.
  • Processor 740 is also coupled to an interface 730 that includes one or more input/output devices such as such as video monitors, track balls, mice, keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, or other well-known input devices such as, of course, other computers.
  • processor 740 optionally may be coupled to a computer or telecommunications network using a network connection as shown generally at 750 With such a network connection, it is contemplated that the
  • CPU might receive information from the network, or might output information to the network in the course of performing the above-described method steps.
  • the above- described devices and materials will be familiar to those of skill in the computer hardware and software arts.
  • the hardware elements described above may be configured (usually temporarily) to act as one or more software modules for performing the operations of this invention.
  • a method for managing web sites registered in a search engine in which an algorithm is used to automatically detect deceptive sites or altered sites and automatically take punitive measures such as warning against the detected sites, thereby saving a large amount of human resources that may otherwise have been wasted to detect the deceptive sites.

Abstract

Abstract Disclosed is a method and system for managing web sites registered in a search engine that provides information about web sites on the Internet, wherein information about the web sites registered in the search engine is analyzed to prevent the provision of search results different from essential contents contained in the web sites. In the method, information of the registered web site is received and recorded in a database after being classified by predetermined fields. Predetermined adult keywords are maintained in an adult keyword database. A source file constituting a web page of the registered web site is read and analyzed. It is determined based on a predetermined basis whether or not the registered web site is an adult site.

Description

A METHOD OF MANAGING REGISTERED WEB SITES IN SEARCH ENGINE AND A SYSTEM THEREOF
Technical Field
The present invention relates to a search engine for providing information about web sites on the Internet, and more particularly to a method for managing web sites registered in a search engine, wherein information about the web sites registered in the search engine is analyzed to prevent the provision of search results different from essential contents contained in the web sites.
Background Art A conventional search engine, such as Altavista (http://www.altavista.com),
Lycos (http://www.lycos.com) or Yahoo (http://www.yahoo.com), generally includes a database for classifying, storing and managing web site information based on a predetermined rule, a search robot, embodied as software, for constantly traveling over the web and automatically collecting new web site information, and search engine software for storing the collected data in a database and allowing a user of the search engine to search for desired information in the database.
Fig. la is a block diagram showing an entire system for providing the search engine service. As shown in Fig. la, a user connects to a search engine server 150 over the Internet via a user terminal 110. If the user enters search terms, a search engine server 150 queries search engine software 140 about web site information corresponding to the entered search terms, and the search engine software 140 searches a database 130 to notify the user of retrieved web site information. A search robot 120 is an entity embodied as software for constantly traveling over the web and automatically collecting new web site information from a web server 160, as described above. The search robot 120 searches for HTML (Hypertext Markup Language) documents on a network and parses links described in the HTML documents and then collects data from a number of web sites existing on the network. The data collected by the search robot 120 is databased. The term "databased" refers to a series of processes of performing morphological analysis of information located on a web site and producing a corresponding index table and storing it in the database 130. The database 130 is provided to store all web site information collected by the search robot 120. The search engine software 140 functions to show search results to users. This software searches a large number of pages stored in the database 130 and lists search results by relevance to the search term. The conventional search engine as described above registers information about a web site in a search engine and provides the information to users in the following ways.
(1) Information of a web site is collected using the search robot as described above, and the web site information is registered in the search engine after being reviewed by expert surfers. (2) A category corresponding to the subject of a web site to be registered is selected from a directory of categories classified by subject, and it is requested that the web site be registered in the selected category, and then the web site is registered in the search engine after being reviewed by expert surfers. Some search engines provide a fee-based directory registration service to reduce the time required to register a web site in their directory with a registration fee.
Web sites registered in the search engine in the above method are provided to a user who is looking for desired information after they are searched for in various ways, such as integrated web search and directory search, based on search terms entered by the user. The integrated web search is also called "word-based search", in which Universal Resource Locators (URLs) of all web sites are stored in a database and desired information is searched for based on a specific keyword entered by the user. The directory search is also called "subject-based search", in which web sites are organized into subject-based categories and if a user links to a desired category, the user can view detailed items thereof. In this manner, the subject-based search allows the user to continue to link to the detailed items and retrieve desired information. For example, if a user desires to find Korean team match scores in the 2002 Korea-Japan World Cup, the user can search for them via categories such as Sports -» Ball Sports -» Soccer -» FIFA World Cup -» 2002 Korea- Japan World Cup — » Korean team match scores. Fig. lb is an example screenshot of the directory search method. As shown in this figure, directory search results with search terms "world cup" are three categories
"World Cup", "2002 FIFA Korea- Japan World Cup" and "History of the World Cup", and the user can search for desired information by moving to one of the three categories in which the desired information is most likely to be placed. A typical search engine based on the integrated web search method is Lycos (http://lycos.cs.cmu.edu) developed by Michael L. Mauldin at Carnegie-Mellon University, and a typical search engine based on the directory search method is Yahoo (http://www.yahoo.com). Many current search engines provide hybrid search services based on a combination of the different search methods described above.
The conventional method for registering web sites in the search engine and searching for the registered web sites has the following problems.
(1) As the number of Internet users has rapidly increased, the number of users who desire to search for specific information has rapidly increased and the number of types of information for which they desire to search has increased. As the number of such users and the types of such information has increased, some search terms appear very frequently, which will also be referred to as "popular keywords". This causes a problem in that users, who desire to search for information based on the popular keywords, may receive information of web sites (hereinafter also referred to as
"deceptive sites") that contain contents of no use to the users and insert the popular keywords in their web pages in various ways. For example, if a user enters a popular keyword "Pikachu" to search for information about the Pikachu, information of all registered web sites that contain the word "Pikachu" in their web pages is provided to the user. The web sites provided to the user may include web sites that contain adult or sexual contents and insert the word "Pikachu" in some places in their web pages in various ways (with ill intention in most cases). This popular keyword insertion causes a wide age range of users to be exposed to the information of the web sites that contain adult or sexual contents. (2) Content contained in a web site at the time when it is registered in the search engine may be different from content contained therein after the registration. For example, let us assume that a web site having a domain name "http://www.worldcup.com" is registered in a directory at a sub-category thereof that is specified as Sports - Ball Sports -» Soccer → FIFA World Cup. After the registration, the content of the web site specified by the domain name may be altered from World Cup-related content to adult-related content by the change of the owner of the domain name or the like. Such web sites (hereinafter referred to as "altered sites") may cause, at any time, unexpected damage to a number of users who perform the directory search.
(3) Most search engine providers charge different registration fees for adult web sites containing keywords relating to adult content and general web sites containing common keywords. This is because the search engine operators bear a heavier burden of managing the registration of the adult web sites since the adult web sites are likely to violate the positive law, compared to the general web sites. An abuser may register a web site in the search engine with general content and keywords, and change the content of the web site to adult content to provide adult services after the registration. This web site may be considered an altered site as described above. It is very difficult to detect the altered site without complaint reports by users of the search engine or without manual searches by expert surfers. Fig. lc is a diagram illustrating an example web site, the content of which is altered after its registration. This figure shows a search results screen for a specific search term. A web site providing information about arcade machines is listed on the search results screen. However, if a user clicks on the information of the web site and moves to the web site, adult content is displayed on the screen, instead of the information about the arcade machines. In this manner, the altered site causes unexpected damage to users who are searching for information on arcade machines. The conventional method for overcoming the problems described above requires complaint reports by users or requires specialists such as expert surfers to constantly monitor the registered web sites. However, the conventional method obviously cannot be an ultimate solution to the problems. If an algorithm automatically executed on the Internet to solve the problems can be provided, it will be a useful means to solve the problems all at once.
Disclosure of the Invention
Therefore, the present invention has been made in view of the above problems, and it is an object of the present invention to provide a method for managing web sites registered in a search engine, in which an algorithm is used to automatically detect deceptive sites or altered sites, thereby allowing users of the search engine to correctly search for their desired information. It is another object of the present invention to provide a method for managing web sites registered in a search engine, in which deceptive sites or altered sites are automatically detected, and punitive measures are automatically imposed on operators of the detected deceptive sites or altered sites, thereby reinforcing self-purification of the web sites registered in the search engine.
It is yet another object of the present invention to provide a method for managing web sites registered in a search engine, in which an algorithm is used to automatically detect deceptive sites or altered sites and automatically take punitive measures such as warning against the detected sites, thereby saving a large amount of human resources that may otherwise have been wasted to detect the deceptive sites.
According to a preferred embodiment of the present invention to provide a method for managing a web site registered in a search engine, comprising the steps of: classifying web site information of the web site by predetermined fields and recording the classified web site information in a database; maintaining predetermined adult keywords in an adult keyword database; reading a source file constituting a web page of the web site; analyzing the read source file; and determining, based on a predetermined basis, whether or not the web site is an adult site.
According to a preferred embodiment of the present invention to provide A method for managing a web site registered in a search engine, comprising the steps of: classifying web site information of the web site by predetermined fields and recording the classified web site information in a database; maintaining predetermined adult keywords in an adult keyword database; maintaining predetermined popular keywords in a popular keyword database; reading a source file constituting a web page of the web site; analyzing the read source file; determining, based on a first predetermined basis, whether or not the web site is an adult site; determining, based on a second predetermined basis, whether or not the web site is a deceptive site, if the web site is determined to be an adult site; and performing a control operation to perform predetermined processing on the web site if the web site is determined to be a deceptive site. The term "deceptive site" used in the present specification refers to a web site that inserts specific keywords in a source file of its web page in various ways and contains contents entirely different from those to be searched for based on the specific keywords. For example, the deceptive site is an adult site that inserts specific popular keywords in its web page and may be provided as a search result, irrespective of its essential content. The term "altered site" refers to a web site that has a different subject after the registration thereof in a search engine from a subject when it was initially registered in the search engine. For example, the altered site is a web site that has been initially registered in the search engine by requesting registration as a general site, but has changed its content to adult content after the registration.
The term "adult site" refers to a web site that contains content very harmful to young boys and girls under 19. Research results show that most deceptive sites are highly likely to be adult sites and most altered sites are highly likely to be adult sites.
The term "popular keywords" refers to search words that appear very frequently, among search words entered by Internet users. The popular keywords may continually vary depending on the Internet users' tendency and social situations of the time. The popular keywords may include harmful keywords containing socially harmful content, and some examples thereof are "suicide", "reject",
"gambling" and " conspiracy". The term "adult keywords" refers to search terms used to search for various adult contents contained in adult sites. The adult keywords mostly rank high in the popular keywords provided by search engine providers, which indicates that the adult keywords have some connection with the popular keywords.
Brief Description of the Drawings
The above and other objects, features and other advantages of the present invention will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
Fig. la is a block diagram showing the configuration of a conventional system for providing web site search engine services;
Fig. lb is an example screenshot of a directory search method that is one of the web site search methods provided by search engines;
Fig. lc is a diagram showing an example web site that is altered after the registration; Fig. 2 is a block diagram showing the configuration of a system for managing web sites registered in a search engine according to a preferred embodiment of the present invention;
Fig. 3a is a flow chart showing a method for detecting adult sites in order to manage web sites registered in a search engine according to an embodiment of the present invention; Fig. 3b is a flow chart showing a method for detecting deceptive or altered sites in order to manage web sites registered in a search engine according to an embodiment of the present invention;
Fig. 4a is a flow chart showing a method for selecting adult keywords in the method for managing web sites registered in the search engine according to an embodiment of the present invention;
Fig. 4b is a flow chart showing a method for selecting popular keywords in the method for managing web sites registered in the search engine according to an embodiment of the present invention;
Figs. 5a to 5e are diagrams illustrating search result information types of adult sites that the search robot has obtained by searching the web in the method for managing the web sites registered in the search engine according to the preferred embodiment of the present invention;
Fig. 6 is a flow chart showing a method for taking a predetermined punitive measure against a registrant of a web site determined to be a deceptive or altered site in the method for managing the web sites registered in the search engine according to the preferred embodiment of the present invention; and
Fig. 7 is a block diagram showing the internal configuration of a general computer system that can be used in managing the web sites registered in the search engine according to the present invention.
Best Mode for Carrying Out the Invention
A method for managing web sites registered in a search engine according to preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
Fig. 2 is a block diagram showing the configuration of a system for managing web sites registered in a search engine according to an embodiment of the present invention. As shown in Fig. 2, the system according to the embodiment of the present invention includes an interface module 201, a web site registration module 202, a web site management module 203, a web site information database 204, a web site analysis module 205, a keyword management module 206 which includes a search word analyzer 208 and an adult keyword extractor 209, a search robot 207, a popular keyword database 210, and an adult keyword database 211. According to the embodiment of the present invention, the system for managing web sites registered in the search engine may include a mail server 212 or a SMS server 213 for sending a predetermined message to a registrant of a registered web site. The mail server 212 and the SMS server 213 may be provided in a system for providing search engine services or may be located in a system operated by a third party. Though the interface module 201, other various modules, and the mail server 212 or the SMS server 213 are illustrated in Fig. 2 as separate entities, this illustration has been made only for easier explanation, and they may be the same entity. The elements shown in Fig. 2 may also be physically located at the same place, or alternatively they may be physically located apart from each other according to another embodiment of the present invention.
First, the interface module 201 functions to support data transmission between the search engine registration management system and a computer terminal provided to a registrant who desires to register a predetermined web site in the search engine, and also functions to interface between physical transmission equipment. The web site registration module 202 functions to receive a request to register the predetermined web site from the registrant, and also to collect and classify information/data about the web site contained in the web site registration request. The web site registration module 202 may further include a billing module (not shown) for charging predetermined fees for the web site registration. The billing module may operate to charge different fees for a web site desired to be registered, depending on the type of the web site (i.e., depending on whether it is a general site containing general content or an adult site containing adult content).
The web site management module 203 is a module for overall registration management of web sites according to the present invention. Based on information of the web sites collected by the search robot 207, the web site management module 203 determines whether the web sites are in operation in conformity with a standard based on which their registration has been permitted. If it is determined that the web site is in inappropriate operation (i.e., it is a deceptive site) or altered sites, the web site management module 203 automatically takes a predetermined measure against a registrant of the web site. The web site management module 203 can interwork with the mail server 212 or the SMS server 213 to send an email to the registrant of the deceptive site or to send an SMS message to a mobile terminal of the registrant, thereby giving warning against the registrant for the inappropriate operation of the deceptive site or altered sites.
The web site information database 204 functions to classify and record information of the registered web sites. Various web site information, such as a URL(Universal Resource Locator) of a web site, web site category information which indicate if the registered websites are general web sites or adult web sites, keywords, registrant information (registrant's name, address, email address, mobile terminal number, etc.), directory information, and the like of the web sites, may be classified by the information fields and stored in the web site information database 204. The web site category information of a web site indicates whether the web site is registered as a general web site or an adult web site. Information of a web site stored in the web site information database 204 may be modified by a registrant of the web site and by a system manager. When content of a web site is changed, the web site information database 204 may automatically update information of the web site stored therein, based on analysis results (for example, based on a new keyword corresponding to a URL of the web site) of data collected by the search robot 207 even though a registrant of the web site does not directly modify the stored information of the web site.
The web site analysis module 205 functions to analyze information of web sites collected by the search robot 207. The type of data collected by the search robot 207 and a method for analyzing the collected data will be described below in detail with reference to Fig. 3.
The keyword management module 206 functions to manage keywords that may be a basis for determining whether or not a specific web site is a deceptive or altered site. In other words, the keyword management module 206 selects and manages adult and popular keywords that may be a basis for the deceptive site determination, and selects and manages adult keywords that may be a basis for the altered site determination. The keyword management module 206 of the system for managing the registered web sites according to the embodiment of the present invention includes a keyword analyzer 208 for selecting popular words and an adult keyword extractor 209 for selecting adult keywords. The keyword analyzer 208 records search words that a plurality of users have entered to the search engine, and records the number of times the entered search words appear. The keyword analyzer 208 can select popular keywords from the entered search words by sorting them in order of the number of appearance times at intervals of a predetermined period. The adult keyword extractor 209 can select one or more adult sites, analyze source files of web pages of the selected adult sites, and extract character strings that appear frequently on the web pages to select the extracted character strings as adult keywords. The selected popular keywords are stored and managed in a popular keyword database 210, whereas the selected adult keywords are stored and managed in an adult keyword database 211. These databases 210 and 21 1 may operate to be automatically updated when a new popular or adult keyword is detected. The above elements of the system for managing web sites registered in the search engine according to the embodiment of the present invention are divided simply according to their functions for easier explanation, and the functional division of the elements has nothing to do with actual physical locations thereof. It is obvious to those skilled in the art that the above modules may be embodied not only as hardware but also as software using a specific code.
Fig. 3a is a flow chart showing a method for managing web sites registered in a search engine according to a preferred embodiment of the present invention. A method for managing web sites registered in a search engine according to a preferred embodiment of the present invention will now be described in detail with reference to Fig. 3a in conjunction with Figs. 5a to 5e.
The web site registration management method according to the preferred embodiment of the present invention is performed in the following manner, as shown in Fig. 3a. A registrant, who desires to register a predetermined web site in the search engine, makes a request to register the web site with information of the web site (305). The information of the web site is classified by information fields (registrant's name, address, email address, mobile phone number, etc.) and recorded in a web site information database (310), and the web site is registered in the search engine (315). This registration step 315 may be performed in several ways. For example, in one way, a web site is registered in the search engine upon request of a manager of the web site as described above. In another way, a web site is registered in the search engine based on information of the web site obtained by the search robot that randomly travels over the web. In the former case, the registrant (i.e., the manager) of the web site can request that the web site be registered in a category closest to a subject (for example, "Pikachu" and "patent attorney exam") thereof decided by the registrant. After being reviewed by expert surfers, the requested web site can be registered in the search engine if it is determined that the requested web site satisfies predetermined requirements (for example, quality of the web site or noncommercial site requirements in case no registration fee is paid). The method for managing web sties registered in the search engine according to the present invention will be described, limited to the case where the web site is registered in the search engine upon request of the registrant of the web site. However, the method and system for managing web sties registered in the search engine according to the present invention can also be applied to other various ways in which the web site is registered in the search engine.
Next, a specific adult keyword is stored and managed in the adult keyword database (320).
According to the embodiment of the present invention, the search engine operates to analyze a source file contained in a web page of the web site. According to a preferred embodiment of the present invention, the search engine may operate to crawl a web site and generate search result data thereof divided into elements #TITLE, #BODY or #ANCHOR. In other words, the search engine reads a source file constituting a web page (325), analyzes the read source file (330), and determines, based on a predetermined basis, whether the web site is an adult site (335). An example of the search result data will be described below in detail with reference to Figs. 5a to 5e.
Fig. 4a shows a method for selecting the adult keywords.
As shown in Fig. 4a, the method for selecting the adult keywords as a basis for determining whether a web site is an adult site includes the following steps. First, one or more adult sites are selected (405). The adult site can be selected directly by a manager of the web site registration management system according to the present invention. Alternatively, one or more of the web sites registered as adult sites can be automatically selected by searching a web site category information field in the database of the system according to the present invention. Character strings contained in a web page of the selected adult site are extracted (410), and respective frequencies of the extracted character strings are recorded (415). To record the frequencies, the extracted character strings can be recorded in the form of a table and the value of a frequency field of the table can be increased by one each time a corresponding character string is extracted. The recorded character strings, as detected by the analysis, are sorted in order of the frequencies at intervals of a predetermined period (daily, weekly or monthly) (420). High-ranked character strings are extracted from the recorded character strings, and the extracted character strings are selected as adult keywords, which are then stored in the adult keyword database (425). According to another embodiment of the present invention, all of the detected character strings can be selected as adult keywords without the sorting. In this case, non-adult character strings may be selected as adult keywords, but it can prevent an increase in the system load required to select the adult keywords, which is caused by the sorting.
At step 335, it is determined, based on a predetermined basis, whether the web site is an adult site. A method for determining whether a web site is an adult site, according to a preferred embodiment of the present invention, will now be described with reference to Figs. 5a to 5e.
Figs. 5a to 5e are diagrams illustrating search result information types of adult sites that the search robot has obtained by searching the web in the method for managing the web sites registered in the search engine according to the preferred embodiment of the present invention. Specific types of web sites, which contain a number of adult keywords as shown in Figs. 5a to 5e, may be determined to be adult sites.
As shown in Fig. 5a, a web document corresponding to the search result information type <Type 1> of adult sites includes a number of URLs and a predetermined number of adult keywords. In the case of <Type 1>, since morphological analysis of the search result can be performed using an indexer used in the search engine, it is generally easy to determine the number of keywords, which match adult keywords, among the character strings included in a web page of the web document. Accordingly, in the search result information type <Type 1>, it can be determined whether the web site is an adult site, by determining whether the ratio of the length of character strings, which match adult keywords, to the length of the total character strings (i.e., the length of character strings matching adult keywords / the length of the total character strings included in the web page x 100) is more than a predetermined value. Alternatively, based on only the length of the character strings, which match adult keywords, among the total character strings, it can be determined whether the web site is an adult site. For example, if the length of character strings in a web page, which match adult keywords, is 200 bytes or more, it can be determined whether a web site including the web page is an adult site.
As shown in Fig. 5b, a web document corresponding to the search result information type <Type 2> of adult sites includes a number of adult keywords with no space therebetween. In the case of <Type 2>, since an indexer used in the search engine cannot be used, it can be determined whether a web site corresponding to the web document is an adult site, based on the number of character strings appearing in a web page of the web document, which match adult keywords (i.e., based on the number of adult keywords appearing in the web page).
As shown in Fig. 5c, a web document corresponding to the search result information type <Type 3> of adult sites includes popular keywords in the title of the web document and a number of adult keywords in the body thereof. In the case of
<Type 3>, it can be determined whether the web site is an adult site, by using an indexer according to the above method in Fig. 5 a or measuring the length of character strings that match adult keywords and determining whether the measured length is more than a predetermined value. As shown in Fig. 5d, a web document corresponding to the search result information type <Type 4> of adult sites does not include popular and adult keywords in a title and a body of the web document, but includes a number of adult keywords in anchor text of the web document. Adult sites of <Type 4> use the fact that the search robot searches web sites and returns search results thereof based on titles and bodies thereof. In the case of the search result information type <Type 4>, the adult site determination can be performed based on the analysis of the anchor text of the web site. In other words, it can be determined whether the web site is an adult site, by analyzing character strings included in the anchor text and determining whether the number of character strings, which match adult keywords, among the analyzed character strings is more than a predetermined number.
As shown in Fig. 5e, a web document corresponding to the search result information type <Type 5> of adult sites does not include adult keywords but includes a large number of the same popular words in the title of the web document. A web site of the type <Type 5> uses no adult keyword in an initial screen for entering the web site. Many recent adult sites belong to this type. In the case of <Type 5>, the initial screen contains only a window for entering a social security number, which is entirely unrelated to adult keywords, under the pretext of adult verification. In this case, the adult site determination is difficult with only the extraction of keywords from the body of the web page. Thus, it may be most effective to determine the type <Type 5> to be an adult site type.
In a first adult site determination method according to the preferred embodiment of the present invention, the search robot crawls a web site, and records search result data of a web page of the web site after dividing the data into elements #TITLE, #BODY or #ANCHOR, and then determines, based on a predetermined basis, whether the web site is an adult site. Preferably, the ratio of character strings, which match adult keywords, to the total character strings included in the web page of the web site can be calculated, and the web site can be determined to be an adult site if the calculated ratio is more than a predetermined ratio. In the case of <Type 1>, the adult keyword ratio can be calculated using an indexer. According to an embodiment of the present invention, in the case of <Type 1>, the ratio of the length of adult keywords to the length of total character strings of the web document is calculated, and it is determined whether the web site is an adult site, based on whether the calculated ratio is more than a predetermined reference value. In this case, the adult site determination can be made based not only on the ratio but also on the number of character strings included in the web page, which match adult keywords, as described above.
In a second adult site determination method according to the preferred embodiment of the present invention, if the number of character strings included in a web page, which match adult keywords, is more than a predetermined number, a web site including the web page is determined to be an adult site. The second adult site determination can be useful particularly when the indexer cannot be used (i.e., in the case such as <Type 2>).
If it is determined, at step 335, that the web site is an adult site, the procedure moves to step 350 of Fig. 3b. At step 350, it is determined whether the web site determined to be an adult site is a deceptive site (branch φ) or an altered site (branch (2)), according to the web site management method according to the present invention.
At branch φ, to determine whether the web site is a deceptive site, it is determined whether a web page included in the web site contains predetermined popular keywords (355).
Fig. 4b shows a method for selecting the popular keywords. As shown in Fig. 4b, the method for selecting the popular keywords as a basis for determining whether a web site is a deceptive site includes the following steps. First, search words are entered by a plurality of search engine users (455). Respective frequencies of the entered search words are recorded (460). To record the frequencies, the entered search words can be recorded in the form of a table and the value of a frequency field of the table can be increased by one each time a corresponding search word is entered. The entered search words, as obtained by the analysis, are sorted in order of the frequencies at intervals of a predetermined period (daily, weekly or monthly) (465). High-ranked search words are extracted from the sorted words, and the extracted search words are selected as popular keywords (470), which are then stored in the popular keyword database (475). It is preferable that search words constantly popular over the medium and long term, rather than search words regarding social issues in the short term, be selected as popular keywords. This is because search words (for example, "Starcraft" and "Zolaman"), which have generally gained constant popularity, are highly likely to be included as popular keywords in character strings of web page source files of deceptive sites, in terms of characteristics of the present invention.
The predetermined basis for determining whether a web site is a deceptive site may be whether or not predetermined popular keywords are contained in a source file that constitutes a web page included in the web site (355). Examples of the deceptive site include an adult site of <Type 3> shown in Fig. 5c or an adult site of <Type 5> shown in Fig. 5e. If it is determined, at step 355, that a source file of the web site determined to be an adult site contains popular keywords, the web site may be determined to be a deceptive site (360). If the web site is determined to be a deceptive site, the procedure moves to step 605 of Fig. 6 via step 380, so as to take a predetermined measure against the web site.
At branch φ, registration information of the web site determined to be an adult site is searched for to determine whether the web site is an altered site (365). The reason for searching for the registration information is that the altered site is a web site that was registered as a general site and then altered to an adult site after the registration, as described above.
Web site category information of the web site stored in the web site information database is searched to determine whether the web site has been registered as an adult site (370). If the web site has not been registered as an adult site, the web site can be determined to be an altered site (375). If the web site is determined to be a deceptive or altered site at step 360 or 375, the procedure moves to step 605 of Fig. 6 via step 380, so as to take a predetermined measure against the web site.
With reference to Fig. 6, a description will now be given of how a punitive measure is automatically taken against a web site when it is determined to be a deceptive or altered site at step 360 or 375 of Fig. 3b. If the web site is determined to be a deceptive or altered site, a web site management module searches a web site information database to obtain information of a registrant of the web site (610), and the web site management module receives the registrant information (620 and 650). According to an embodiment of the present invention, the web site management module extracts contact information of the registrant, such as an email address or a mobile terminal number thereof, from the received registrant information (630), and controls a mail server or an SMS server to transmit a predetermined message to a location corresponding to the contact information (640).
According to another embodiment of the present invention, the web site management module extracts information of other registered web sites of the registrant from the registrant information (660), and then performs a control operation to automatically analyze the other web sites registered under the same registrant name (670). This is because the other web sites registered under the same registrant name are highly likely to be deceptive or altered sites operated based on the same or similar method. In this embodiment, if, based on the analysis of the other registered web sites, it is determined that they are deceptive or altered sites, step 610 of Fig. 6 may be repeated.
According to a preferred embodiment of the present invention, if a web site is determined to be a deceptive site based on the analysis and determination methods, the system for managing the registered web sites may operate to automatically send an email, an SMS message or the like to a registrant of the web site to point out problems of the web site and then request that the registrant of the web site correct the problems within a grace period. In addition, the system may be set to automatically perform the analysis and determination processes after the grace period. If the problems of the web site have not been corrected even after the grace period, a punitive measure, such as cancellation of the registration of the web site, may be taken against the registrant thereof. According to another embodiment of the present invention, a punitive measure such as a complicated registration procedure may be imposed on the registrant of the web site when the registrant requests registration of another web site at a later time.
Embodiments of the present invention further relate to computer readable media that include program instructions for performing various computer-implemented operations. The media may also include, alone or in combination with the program instructions, data files, data structures, tables, and the like. The media and program instructions may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM disks; magneto-optical media such as floptical disks; and hardware devices that are specially configured to store and perform program instructions, such as readonly memory devices (ROM) and random access memory (RAM). The media may also be a transmission medium such as optical or metallic lines, wave guides, etc. including a carrier wave transmitting signals specifying the program instructions, data structures, etc. Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.
Fig. 7 is a block diagram showing the internal configuration of a general computer system that can be used in managing web pages registered in the search engine according to the present invention.
The computer system includes any number of processors 740 (also referred to as central processing units, or CPUs) that are coupled to storage devices including primary storage 760 (typically a random access memory, or "RAM"), primary storage 770 (typically a read only memory, or "ROM"). As is well known in the art, primary storage 760 acts to transfer data and instructions uni-directionally to the CPU and primary storage 760 is used typically to transfer data and instructions in a bi-directional manner. Both of these primary storage devices may include any suitable type of the computer-readable media described above. A mass storage device 710 is also coupled bi-directionally to CPU 740 and provides additional data storage capacity and may include any of the computer-readable media described above. The mass storage device
710 may be used to store programs, data and the like and is typically a secondary storage medium such as a hard disk that is slower than primary storage. A specific mass storage device such as a CD-ROM 720 may also pass data uni-directionally to the CPU. Processor 740 is also coupled to an interface 730 that includes one or more input/output devices such as such as video monitors, track balls, mice, keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, or other well-known input devices such as, of course, other computers. Finally, processor 740 optionally may be coupled to a computer or telecommunications network using a network connection as shown generally at 750 With such a network connection, it is contemplated that the
CPU might receive information from the network, or might output information to the network in the course of performing the above-described method steps. The above- described devices and materials will be familiar to those of skill in the computer hardware and software arts. The hardware elements described above may be configured (usually temporarily) to act as one or more software modules for performing the operations of this invention. Industrial Applicability
According to a method for managing web sites registered in a search engine, in which an algorithm is used to automatically detect deceptive sites or altered sites, thereby allowing users of the search engine to correctly search for their desired information.
According to a method for managing web sites registered in a search engine, in which deceptive sites or altered sites are automatically detected, and punitive measures are automatically imposed on operators of the detected deceptive sites or altered sites, thereby reinforcing self-purification of the web sites registered in the search engine.
According to a method for managing web sites registered in a search engine, in which an algorithm is used to automatically detect deceptive sites or altered sites and automatically take punitive measures such as warning against the detected sites, thereby saving a large amount of human resources that may otherwise have been wasted to detect the deceptive sites.
The foregoing descriptions of specific embodiments of the present invention have been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed, and obviously many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims appended hereto and their equivalents.

Claims

Claims:
1. A method for managing a web site registered in a search engine, comprising the steps of: classifying web site information of the web site by predetermined fields and recording the classified web site information in a database; maintaining predetermined adult keywords in an adult keyword database; reading a source file constituting a web page of the web site; analyzing the read source file; and determining, based on a predetermined basis, whether or not the web site is an adult site.
2. The method according to claim 1, wherein the predetermined basis is whether or not the sum of lengths of character strings included in the source file, said character strings matching the adult keywords, is more than a predetermined number of bytes.
3. The method according to claim 1, wherein the predetermined basis is whether or not the number of character strings included in the source file, said character strings matching the adult keywords, is more than a predetermined number.
4. The method according to claim 1, wherein the predetermined basis is whether or not the ratio of the sum of lengths of character strings included in the source file, said character strings matching the adult keywords, to the length of total character strings included in the source file is more than a predetermined value.
5. The method according to claim 1, wherein the predetermined basis is whether or not the number of URLs included in the source file is more than a first number and the number of character strings included in the source file, said character strings matching the adult keywords, is more than a second number.
6. The method according to claim 1, wherein the predetermined basis is whether or not the length of character strings included in the source file, said character strings having no space therebetween, is more than a predetermined number of bytes and the number of character strings included in the source file, said character strings matching the adult keywords, is more than a predetermined number.
7. The method according to claim 1, further comprising the step of maintaining predetermined popular keywords in a popular keyword database, wherein the predetermined basis is whether or not the number of character strings included in the source file, said character strings matching the popular keywords, is more than a predetermined number.
8. The method according to claim 1, wherein said step of analyzing the read source file comprises the step of extracting anchor text from the source file, and wherein the predetermined basis is whether or not the number of character strings included in the anchor text, said character strings matching the adult keywords, is more than a predetermined number.
9. The method according to claim 1, further comprising the step of maintaining predetermined popular keywords in a popular keyword database, wherein the predetermined basis is whether or not character strings included in the source file, said character strings matching the popular keywords, repeatedly appears more than a predetermined number of times.
10. The method according to claim 1, wherein said step of analyzing the read source file comprises the step of extracting a title tag from the source file, and wherein the predetermined basis is whether or not the sum of lengths of character strings included in the title tag of the source file, said character strings matching the adult keywords, is more than a predetermined number of bytes.
11. A method for managing a web site registered in a search engine, comprising the steps of: classifying web site information of the web site by predetermined fields and recording the classified web site information in a database; maintaining predetermined adult keywords in an adult keyword database; maintaining predetermined popular keywords in a popular keyword database; reading a source file constituting a web page of the web site; analyzing the read source file; determining, based on a first predetermined basis, whether or not the web site is an adult site; determining, based on a second predetermined basis, whether or not the web site is a deceptive site, if the web site is determined to be an adult site; and performing a control operation to perform predetermined processing on the web site if the web site is determined to be a deceptive site.
12. The method according to claim 11, wherein the second predetermined basis is whether or not the number of character strings included in the source file, said character strings matching the popular keywords, is more than a predetermined number.
13. The method according to claim 11, wherein the second predetermined basis is whether or not character strings included in the source file, said character strings matching the popular keywords, repeatedly appears more than a predetermined number of times.
14. A method for managing a web site registered in a search engine, comprising the steps of: classifying web site information of the web site by predetermined fields and recording the classified web site information in a database; maintaining predetermined adult keywords in an adult keyword database; reading a source file constituting a web page of the web site; analyzing the read source file; determining, based on a first predetermined basis, whether or not the web site is an adult site; determining, based on a second predetermined basis, whether or not the web site is an altered site, if the web site is determined to be an adult site; and performing a control operation to perform predetermined processing on the web site if the web site is determined to be an altered site.
15. The method according to claim 14, wherein said database includes a web site category information field, and wherein said step of determining whether or not the web site is an altered site comprises the steps of: obtaining web site category information of the web site by searching the web site category information field of said database; determining whether the web site has been registered as a general web site by analyzing the obtained web site category information; and determining the web site to be an altered site if it is determined that the web site has been registered as a general web site.
16. The method according to any one of claims 1, 11, and 14, wherein said step of maintaining the predetermined adult keywords in the adult keyword database comprises the steps of: receiving web site information of at least one web site, determined by a manager to be an adult site, from said database; extracting character strings included in a web page of said at least one web site; recording respective frequencies of the extracted character strings; sorting the extracted character strings in order of the frequencies; extracting a predetermined number of character strings as adult keywords from the sorted character strings; and storing the extracted adult keywords in the adult keyword database.
17. The method according to any one of claims 7, 9, and 11, wherein said step of maintaining the predetermined popular keywords in the popular keyword database comprises the steps of: receiving search words entered by a plurality of users; recording respective frequencies of the entered search words; sorting the entered search words in order of the frequencies; extracting a predetermined number of search words as popular keywords from the sorted search words; and storing the extracted popular keywords in the popular keyword database.
18. The method according to claim 11 or 14, wherein said database includes a web site registrant field, and wherein said step of performing the control operation to perform the predetermined processing comprises the steps of: obtaining registrant information of a registrant of the web site by searching a web site registrant field of said database; extracting contact information of the registrant from the web site registrant information; and transmitting a message to a location corresponding to the extracted contact information.
19. The method according to claim 18, wherein the contact information is an email address or a mobile terminal number of the registrant of the web site, and wherein said step of transmitting the message comprises the step of controlling an email server to send an email to the email address or the step of controlling an SMS server to send an SMS message to the mobile terminal number.
20. The method according to claim 11 or 14, wherein said database includes a web site registrant field, and wherein said step of performing the control operation to perform the predetermined processing includes the steps of: obtaining registrant information of a registrant of the web site by searching the web site registrant field of the database; extracting a URL of a different web site registered by the registrant from the registrant information of the web site; and reading a source file constituting a web page of the web site; analyzing the read source file; determining, based on a first predetermined basis, whether or not the web site is an adult site; and determining, based on a second predetermined basis, whether or not the web site is an altered or deceptive site if the web site is determined to be an adult site.
21. A computer-readable recording medium in which a program for performing the method defined in any one of claims 1 to 20 is recorded.
22. A system for managing a web site registered in a search engine, the system comprising: an interface module for performing data communication with at least one terminal; a web site registration module for receiving a web site registration request including web site information of a predetermined web site from said at least one terminal and classifying the web site information by predetermined fields; a database for classifying and storing a predetermined keyword and the web site information; a web site analysis module for extracting a source file constituting a web page of the web site, and analyzing the extracted source file; and a web site management module for determining, based on a predetermined basis, whether or not the web site is a deceptive or altered site.
PCT/KR2004/000665 2003-04-04 2004-03-25 A method of managing registered web sites in search engine and a system thereof WO2004088542A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2003-0021483 2003-04-04
KR1020030021483A KR100610775B1 (en) 2003-04-04 2003-04-04 A method of managing registered web sites in search engine and a system thereof

Publications (1)

Publication Number Publication Date
WO2004088542A1 true WO2004088542A1 (en) 2004-10-14

Family

ID=33128960

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2004/000665 WO2004088542A1 (en) 2003-04-04 2004-03-25 A method of managing registered web sites in search engine and a system thereof

Country Status (2)

Country Link
KR (1) KR100610775B1 (en)
WO (1) WO2004088542A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100670789B1 (en) * 2004-12-03 2007-01-17 한국전자통신연구원 Method for multi-level text filtering for blocking harmful web-sites
KR100823388B1 (en) * 2006-08-11 2008-04-17 주식회사 케익소프트 Method for providing web accessibility service and system thereof
KR101140263B1 (en) * 2010-07-07 2012-06-13 엔에이치엔(주) Method, system and computer readable recording medium for refining web based documents using text pattern extraction
KR102269954B1 (en) * 2019-02-28 2021-06-25 안상필 System for collecting status of web site

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR19990025595A (en) * 1997-09-13 1999-04-06 전주범 Preventing adult site open on televisions with Internet capabilities
KR20010025209A (en) * 2000-10-20 2001-04-06 고진선 Business method for providing harmful information intercept service using network and computer readable medium having stored thereon computer executable instruction for performing the method
KR20010105960A (en) * 2000-05-19 2001-11-29 이동진 Interception system of noxious- information in internet
KR20020081774A (en) * 2001-04-19 2002-10-30 주식회사 플랜티넷 Apparatus and method for uholesome site database saving

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR19990025595A (en) * 1997-09-13 1999-04-06 전주범 Preventing adult site open on televisions with Internet capabilities
KR20010105960A (en) * 2000-05-19 2001-11-29 이동진 Interception system of noxious- information in internet
KR20010025209A (en) * 2000-10-20 2001-04-06 고진선 Business method for providing harmful information intercept service using network and computer readable medium having stored thereon computer executable instruction for performing the method
KR20020081774A (en) * 2001-04-19 2002-10-30 주식회사 플랜티넷 Apparatus and method for uholesome site database saving

Also Published As

Publication number Publication date
KR20060038486A (en) 2006-05-04
KR100610775B1 (en) 2006-08-09

Similar Documents

Publication Publication Date Title
US8326818B2 (en) Method of managing websites registered in search engine and a system thereof
US8117208B2 (en) System for entity search and a method for entity scoring in a linked document database
US8972371B2 (en) Search engine and indexing technique
US5895470A (en) System for categorizing documents in a linked collection of documents
US5835905A (en) System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents
EP0886822B1 (en) System and method for locating resources on a network using resource evaluations derived from electronic messages
KR100462292B1 (en) A method for providing search results list based on importance information and a system thereof
US7421416B2 (en) Method of managing web sites registered in search engine and a system thereof
US20120179667A1 (en) Searching through content which is accessible through web-based forms
US20030208482A1 (en) Systems and methods of retrieving relevant information
Jepsen et al. Characteristics of scientific Web publications: Preliminary data gathering and analysis
KR100557874B1 (en) Method of scientific information analysis and media that can record computer program thereof
WO2004088542A1 (en) A method of managing registered web sites in search engine and a system thereof
KR20040098889A (en) A method of providing website searching service and a system thereof
KR100931772B1 (en) A method of providing website searching service and a system thereof
JP2003173351A (en) Method, device, program and storage medium for analysis, collection and retrieval of information
Clarke Search engines for the World Wide Web: an evaluation of recent developments
KR100458458B1 (en) A method of managing web sites registered in search engine and a system thereof
KR101048590B1 (en) A method of managing web sites registered in search engine and a system thereof
KR100994326B1 (en) A method for providing search results list based on importance information and a system thereof
KR20040086733A (en) A method of managing registered web sites in search engine and a system thereof
Broder et al. Information Retrieval on the Web.
KR100931775B1 (en) A method of providing website searching service and a system thereof
JP2006508466A (en) Method for registering website information in search engine and website search service method using the same
KR20040103763A (en) A method of managing web sites registered in search engine

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase