KR20040086732A

KR20040086732A - A method of managing web sites registered in search engine and a system thereof

Info

Publication number: KR20040086732A
Application number: KR1020040003113A
Authority: KR
Inventors: 나선균; 이현정; 최기정
Original assignee: 엔에이치엔(주)
Priority date: 2004-01-15
Filing date: 2004-01-15
Publication date: 2004-10-12

Abstract

PURPOSE: A method and a system for managing registered web sites on a search engine are provided to make a search engine user correctly search desired information by automatically detecting a malicious site through an algorithm. CONSTITUTION: An interface module(201) interfaces data/physical transfer equipments between a computer terminal of a registrant registering the web site to the search engine and a registration management system of the search engine. A web site registration module(202) receives a registration request for the web site from the registrant, and collects/classifies the information data included in the registration request. A web site management module(203) judges that the web site is operated in a state fit to a selected standard based on the web site information collected by a search robot(207), and automatically performs management for the registrant in case of the malicious web site. A web site information database(204) classifies/stores the information for the registered web sites. A web site analysis module(205) analyzes the web site information collected by the search robot.

Description

METHOD OF MANAGING WEB SITES REGISTERED IN SEARCH ENGINE AND A SYSTEM THEREOF}

본 발명은 인터넷 상에서 소정의 웹사이트에 대한 정보를 제공하는 검색 엔진에 관한 것이다. 더욱 상세하게는, 검색 엔진에 등록되어 있는 소정의 웹사이트에 대한 정보를 분석하여, 웹사이트에 포함되어 있는 실제 콘텐츠와 상이한 검색 결과가 제공되지 않도록 등록된 웹사이트를 관리하기 위한 방법에 관한 것이다.The present invention relates to a search engine that provides information about a given website on the Internet. More specifically, the present invention relates to a method for managing a registered website so that information about a predetermined website registered in a search engine is analyzed and a search result different from actual content included in the website is not provided. .

알타비스타(http://www.altavista.com), 라이코스(http://www.lycos.com), 야후(http://www.yahoo.com) 등과 같은 통상의 검색 엔진은 통상 웹사이트 정보를 소정의 기준에 따라 분류 및 저장하고 관리하기 위한 데이터베이스, 웹 상을 지속적으로 순회하면서 새로운 웹사이트 정보를 기계적으로 수집하기 위한 소프트웨어로 구현되는 검색 로봇(robot), 수집된 데이터를 데이터베이스화하여 상기 검색 엔진을 이용하는 이용자로 하여금 검색할 수 있도록 하는 검색 엔진 소프트웨어로 구성된다.Conventional search engines such as AltaVista (http://www.altavista.com), Lycos (http://www.lycos.com), Yahoo (http://www.yahoo.com), and the like, typically have website information. A database for classifying, storing, and managing the data according to predetermined criteria, a search robot implemented as software for mechanically collecting new website information while continuously circulating on the web, and storing the collected data as a database. It consists of search engine software that allows a user using a search engine to search.

상슬힌 검색 엔진 서비스를 제공하기 위한 전체 시스템의 블록도가 도 1a에 도시되어 있다. 도 1a을 참조하면, 사용자는 사용자 단말기(110)를 통하여 인터넷을 통해 검색 엔진 서버(150)로 접속한다. 사용자가 소정의 검색어를 입력하면, 검색 엔진 서버(150)는 검색 엔진 소프트웨어(140)로 해당 검색어에 대한 웹사이트 정보를 쿼리(query)하고, 검색 엔진 소프트웨어(140)는 해당 데이터베이스(130)를 검색하여 소정의 웹사이트 정보를 알려 준다. 검색 로봇(120)은 상기 설명한 바와 같이, 웹 상을 지속적으로 순회하면서 웹 서버(160)로부터 새로운 웹사이트 정보를 기계적으로 수집하기 위한 소프트웨어로 구현되는 엔티티(entity)이다. 검색 로봇(120)은 네트워크 상에서 HTML(HyperText Markup Language)로 기술된 문장을 탐색하고, 기재되어 있는 링크처를 파싱(parsing)하여 네트워크 상에 존재하는 다수의 웹사이트로부터 데이터를 수집한다. 이와 같이 검색 로봇(120)에 의해 수집된 데이터는 데이터베이스화되는데, 여기에서 데이터베이스화라 함은 웹사이트에 위치하는 소정의 정보에 대해 형태소 분석(morphological analysis)을 수행하고, 인덱스 테이블을 작성하여 데이터베이스(130)에 저장하는 일련의 수순을 의미한다. 데이터베이스(130)는 검색 로봇(120)에 의해 수집된 모든 웹사이트 정보를 저장하기 위한 것이다. 검색 엔진 소프트웨어(140)는 검색 결과를 사용자에게 보여주는 기능을 한다. 이 소프트웨어는 데이터베이스(130)에 저장된 수많은 페이지를 검색하여 검색 결과물을 검색 용어와 일치되는 정확도의 순서로 나열하게 된다. 위와 같은 종래의 검색 엔진은 다음과 같은 방법으로 웹사이트에 대한 정보를 검색 엔진에 등록하고, 상기 정보를 사용자에게 제공한다.A block diagram of the entire system for providing an untapped search engine service is shown in FIG. 1A. Referring to FIG. 1A, a user accesses a search engine server 150 through the Internet through a user terminal 110. When a user enters a predetermined search term, the search engine server 150 queries the search engine software 140 for website information about the search term, and the search engine software 140 queries the corresponding database 130. Search and inform the website information. As described above, the search robot 120 is an entity implemented in software for mechanically collecting new website information from the web server 160 while continuously circulating on the web. The search robot 120 retrieves the text described in HyperText Markup Language (HTML) on the network, parses the described link and collects data from a number of websites present on the network. As such, the data collected by the search robot 120 is databased. Here, databaseization is performed by performing morphological analysis on predetermined information located on a website and creating an index table. 130 means a sequence of steps to store. The database 130 is for storing all website information collected by the search robot 120. Search engine software 140 functions to present search results to the user. The software searches numerous pages stored in the database 130 and lists the search results in order of accuracy matching the search terms. The conventional search engine as described above registers information about a website with a search engine in the following manner and provides the information to a user.

(1) 상술한 바와 같이 검색 로봇을 이용하여 소정의 정보를 수집하고, 수집된 정보를 전문 서퍼(surfer)의 검수를 거쳐 상기 웹사이트를 검색 엔진에 등록한다.(1) As described above, predetermined information is collected using a search robot, and the collected information is registered by a professional surfer to register the website with a search engine.

(2) 등록하고자 하는 웹사이트의 주제에 따라 분류된 디렉토리를 선택하고, 상기 선택된 디렉토리에 대해 상기 웹사이트 등록 신청을 하며, 전문 서퍼의 검수를 거쳐 검색 엔진에 등록된다. 검색 엔진에 따라 이러한 디렉토리 등록의 경우에는 소정의 등록료를 받고 등록에 소요되는 시간을 줄여주는 서비스를 제공하기도 한다.(2) Select a directory classified according to the subject of the website to be registered, apply for the website registration with respect to the selected directory, and register the search engine after inspection by a professional surfer. Depending on the search engine, such a directory registration may provide a service that reduces a registration time and receives a predetermined registration fee.

상기의 방법 등으로 검색 엔진에 등록된 웹사이트는 소정의 정보를 검색하고자 하는 사용자의 검색어 입력에 따라, 통합 웹 검색 또는 디렉토리 검색 등의 다양한 검색 방식에 따라 검색되어 사용자에게 제공된다. 상기 통합 웹 검색은 다른 용어로 "단어별 검색"이라고도 하는데, 이 검색 방법은 데이타베이스에 모든 웹사이트들의 범용 리소스 로케이터(URL)를 저장하고 특정 키워드(Keyword)를 입력함으로써 원하는 정보를 찾는 방법을 말한다. 또한, 상기 디렉토리 검색은 다른 용어로 "주제별 검색"이라고도 하는데, 이 검색 방법은 각 분야별로 분류가 되어 있고, 원하는 분야를 링크하게 되면 그 분야의 세부 항목을 상세히 볼 수 있도록 하는 검색 방법이다. 이와 같이 사용자가 계속 링크해서 세부 항목을 검색하는 경우 사용자가 원하는 정보를 찾을 수 있게 된다. 예를 들어, 2002년 한국 월드컵에서 한국팀의 경기 스코어를 찾고자 하는 경우, 스포츠 -> 구기종목 -> 축구 -> 월드컵 ->한국/일본2002년 월드컵 -> 한국팀 경기 스코어 등의 방법으로 검색을 할 수 있게 된다. 도 1b에는 이러한 디렉토리 검색 방식의 일예가 출력된 화면이 도시되어 있다. "월드컵"에 대해 검색된 디렉토리로 "월드컵", "2002 FIFA 월드컵 한국 일본", "월드컵의 역사" 등의 디렉토리가 출력되고, 사용자는 상기 디렉토리 중 자신이 검색하고자 하는 정보가 위치할 가능성이 높은 디렉토리로 이동함으로써 정보 검색을 수행할 수 있다. 상술한 웹 통합 검색 방법을 사용하는 대표적인 검색 엔진으로는 카네기멜론 대학의 마이크 L. 몰딘이 개발한 라이코스(http://lycos.cs.cmu.edu) 검색 엔진을 들 수 있고, 상기 디렉토리 검색 방법을 사용하는 대표적인 검색 엔진으로는 야후(http://www.yahoo.com)를 들 수 있다. 현재 다수의 검색 엔진들은 상술한 검색 방법을 함께 서비스 하는 하이브리드(hybrid) 방식의 서비스를 제공하고 있다.The website registered in the search engine by the above method is searched and provided to the user according to various search methods such as integrated web search or directory search according to a user's search word input to search for predetermined information. The integrated web search is also called "search by word" in other terms. This search method stores a general purpose resource locator (URL) of all websites in a database and enters a specific keyword to find the desired information. Say. In addition, the directory search is also referred to as "search by topic" in another term, and this search method is classified by each field, and when a link is made to a desired field, a detailed search method of the field can be viewed. In this way, if the user continues to link and search for detailed items, the user can find the desired information. For example, if you want to find the scores of Korean teams in the 2002 Korea World Cup, you can search for Sports-> Ball-> Soccer-> World Cup-> Korea / Japan 2002 World Cup-> Korean Team score. You can do it. 1B is a screen showing an example of such a directory search method. As a directory searched for "World Cup", directories such as "World Cup", "2002 FIFA World Cup Korea Japan", "History of World Cup", etc. are output, and the user is most likely to locate the information he / she wants to search. Information retrieval can be performed by moving to. Representative search engines using the web integrated search method described above include a Lycos (http://lycos.cs.cmu.edu) search engine developed by Mike L. Moldin of Carnegie Mellon University. A typical search engine that uses is Yahoo (http://www.yahoo.com). Currently, many search engines provide a hybrid service that services the above-described search method together.

상술한 종래 기술에 의한 검색 엔진에의 웹사이트 등록 방법과 등록된 웹사이트의 검색 방법에서는 다음과 같은 문제점이 있다.The above-described conventional method of registering a website with a search engine and a method of searching a registered website has the following problems.

인터넷 사용자가 급증하면서 소정의 정보를 검색하고자 하는 사용자의 수가 늘어나고, 그들이 검색하고자 하는 정보의 종류가 많아 지게 된다. 이러한 사용자 및 검색하고자 하는 정보의 종류의 증가에 따라, 출현 빈도가 높은 검색어가 생기게 되고 이러한 출현 빈도가 높은 검색어(이하, "인기 키워드"라고 함)를 자신의 웹사이트에 다양한 방법으로 삽입함으로써 상기 검색어를 이용하여 검색을 하고자 하는 사용자에게 전혀 무용한 콘텐츠가 포함된 웹사이트(이하, "기만 사이트"라고 함)에 대한 정보를 제공하게 되는 문제점이 있다. 예를 들어, 인기 키워드 중의하나인 "피카츄"에 대한 정보를 검색하고자 하는 사용자가 "피카츄"를 검색어로 입력한 경우, 웹사이트에 "피카츄"가 포함되어 있는 모든 등록된 웹사이트 정보가 상기 사용자에게 제공되는데, 웹사이트들 중에서 콘텐츠는 성인물에 관한 것이면서, 웹사이트 중간 중간에 다양한(대부분의 경우 악의적인) 방법으로 "피카츄" 텍스트를 삽입한 웹사이트가 존재할 수 있고, 이러한 성인물을 콘텐츠로 하고 있는 웹사이트의 정보가 상술한 인기 키워드의 삽입으로 인해 다양한 연령대의 사용자에게 노출될 위험이 있다는 문제점이 있다.As the number of Internet users increase, the number of users who want to search for predetermined information increases, and the types of information that they want to search for become many. As the number of users and types of information to be searched for increases, a search term with high frequency is generated, and the search word with high frequency (hereinafter referred to as "popular keyword") is inserted into a website in various ways. There is a problem in that information about a website (hereinafter, referred to as a "deception site") containing content which is completely useless is provided to a user who wants to search by using a search word. For example, if a user who wants to search for information on one of the popular keywords "Pikachu" enters "Pikachu" as a search term, all registered website information including "Pikachu" on the website is displayed. Among the websites, the content is about adult content, and there may be websites that insert "Pikachu" text in a variety of (mostly malicious) ways in the middle of the website. There is a problem that the information of the website is exposed to users of various ages due to the insertion of the above-mentioned popular keywords.

상술한 문제점들을 해결하기 위한 해결책으로서 사용자들의 고발 신고 또는 전문 서퍼 등의 전문 인력을 통한 등록 웹사이트의 지속적인 모니터링이 필요하지만, 이러한 종래 기술에 따른 해결 방법은 상술한 문제점들에 대한 궁극적인 해결책이 될 수 없음은 자명하고, 이러한 문제점을 인터넷 상에서 소정의 알고리즘을 통하여 자동적으로 수행될 수 있는 방법이 도출될 수 있다면 상술한 문제점을 일거에 해결할 수 있는 유용한 수단이 될 수 있을 것이다.As a solution for solving the above-mentioned problems, it is necessary to continuously report the complaints of users or to monitor the registration website through a professional person such as a professional surfer. However, the solution according to the related art is the ultimate solution to the above-mentioned problems. It is obvious that this can not be done, and if a method capable of automatically performing such a problem through a predetermined algorithm on the Internet can be derived, it can be a useful means to solve the above-mentioned problems at once.

본 발명에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법은 상술한 종래 기술의 문제점을 해결하기 위한 것으로서, 상술한 기만 사이트를 소정의 알고리즘을 이용하여 자동적으로 검출해 낼 수 있도록 함으로써 검색 엔진 사용자로 하여금 자신이 검색하고자 하는 정보를 정확히 검색할 수 있는 검색 엔진을 제공하는 것을 그 목적으로 한다.The method for managing the registration of a website in the search engine according to the present invention is to solve the above-described problems of the prior art, and the search engine can be automatically detected by using a predetermined algorithm for the deception site described above. It is an object of the present invention to provide a search engine that allows a user to search exactly the information he wants to search.

또한, 본 발명에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법은 상기 기만 사이트를 자동적으로 검출하고, 검출된 기만 사이트 운영자에 대한 제재 조치를 가하도록 함으로써 검색 엔진에 등록되는 웹사이트 자체의 자정이 강화될 수 있도록 하는 것을 그 목적으로 한다.In addition, the method for managing the registration of the website in the search engine according to the present invention is to automatically detect the deceptive site, and to apply sanctions to the detected deception site operator of the website itself registered with the search engine. Its purpose is to enable midnight to be strengthened.

또한, 본 발명에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법은 상기 기만 사이트의 검출 및 검출된 상기 사이트들에 대한 경고 등의 제재 조치를 소정의 알고리즘에 의해 자동적으로 수행되도록 함으로써, 상술한 기만 사이트 검출을 위해 소요될 수 있는 다수의 인력 자원을 절약할 수 있도록 하는 것을 그 목적으로 한다.In addition, the method for managing the registration of the website in the search engine according to the present invention by allowing the sanctions, such as the detection of the deception site and the warning of the detected sites to be automatically performed by a predetermined algorithm, The goal is to save a large number of human resources that can be spent on site detection.

도 1a는 웹사이트 검색 엔진 서비스를 제공하기 위한 종래의 시스템을 도시하는 구성 블록도이다.1A is a block diagram illustrating a conventional system for providing a website search engine service.

도 1b는 웹사이트 검색 엔진 서비스 방식 중 디렉토리 검색 방식의 일예를 도시하는 도면이다.1B is a diagram illustrating an example of a directory search method among a website search engine service method.

도 2는 본 발명의 바람직한 일실시예에 따른 검색 엔진에서 등록된 웹사이트를 관리하기 위한 시스템을 도시하는 구성 블록도이다.2 is a block diagram illustrating a system for managing a registered website in a search engine according to an exemplary embodiment of the present invention.

도3은 본 발명의 일실시예에 따른 검색 엔진에서 등록된 웹사이트를 관리하기 위한 방법을 도시하는 흐름도이다.3 is a flowchart illustrating a method for managing a registered website in a search engine according to an embodiment of the present invention.

도 4a 내지 도 4k는 본 발명의 바람직한 일실시예에 따른 검색 엔진에서 등록된 웹사이트를 관리하기 위한 방법에 있어서, 검색 로봇이 웹사이트를 순회하여 독출한 얻은 기만 사이트의 정보 유형들을 도시하는 도면이다.4A-4K illustrate information types of deceptive sites obtained by a search robot traversing the website in a method for managing a registered website in a search engine according to a preferred embodiment of the present invention. to be.

도 5은 본 발명의 바람직한 일실시예에 따른 검색 엔진에서 등록된 웹사이트를 관리하기 위한 방법에 있어서, 기만 사이트로 판별된 웹사이트의 등록자에게 소정의 제재 조치를 가하는 방법을 도시하는 흐름도이다.FIG. 5 is a flowchart illustrating a method of applying a predetermined sanction to a registrant of a website determined as a deceptive site in a method for managing a registered website in a search engine according to an exemplary embodiment of the present invention.

도 6은 본 발명에 따른 검색 엔진에서 웹사이트의 등록을 관리하는데 이용될 수 있는 범용 컴퓨터 시스템의 내부 블록도이다.6 is an internal block diagram of a general purpose computer system that can be used to manage registration of a website in a search engine in accordance with the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

201 : 인터페이스 모듈 202 : 웹사이트 등록 모듈201: interface module 202: website registration module

203 : 웹사이트 관리 모듈 204 : 웹사이트 정보 DB203: Website Management Module 204: Website Information DB

205 : 웹사이트 분석 모듈 207 : 검색 로봇205: Website Analysis Module 207: Search Robot

208 : 메일 서버 209 : SMS 서버208: mail server 209: SMS server

본 발명의 바람직한 일실시예에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법은 상기 등록된 웹사이트에 대한 정보를 수신하고, 상기 웹사이트 정보를 데이터베이스 수단에 소정의 필드 별로 분류하여 기록하는 단계; 검색 로봇을 제어하여 상기 등록된 웹사이트의 웹페이지를 구성하는 소스 파일을 독출하는 단계; 상기 독출된 소스 파일을 분석하는 단계; 소정의 기준에 따라 상기 웹사이트가 기만 사이트인지 여부를 판단하는 단계; 상기 웹사이트가 기만 사이트인 것으로 판단되는 경우, 상기 등록된 웹사이트에 대하여 소정의 처리를 수행하도록 제어하는 단계를 포함하는 것을 특징으로 하고, 바람직하게는 상기 소스 파일은 하이퍼텍스트 마크업 언어(HTML) 문서일 수 있다.A method for managing registration of a website in a search engine according to an exemplary embodiment of the present invention includes receiving information about the registered website and classifying and recording the website information into a database by predetermined fields. step; Controlling a search robot to read a source file constituting a web page of the registered website; Analyzing the read source file; Determining whether the website is a deceptive site according to a predetermined criterion; And if it is determined that the website is a deceptive site, controlling to perform a predetermined process on the registered website. Preferably, the source file is a hypertext markup language (HTML). ) May be a document.

또한, 본 발명의 바람직한 일실시예에 따른 검색 엔진에서 등록된 웹사이트를 관리하기 위한 시스템은 하나 이상의 단말기와 데이터 통신을 수행하기 위한 인터페이스 모듈; 상기 하나 이상의 단말기로부터 소정의 웹사이트의 정보를 포함하는 웹사이트 등록 신청을 수신하고, 상기 웹사이트 정보를 소정의 필드 별로 분류하기 위한 웹사이트 등록 모듈; 상기 웹사이트 정보 및 상기 웹사이트에 대응하는 소정의 키워드를 분류하여 저장하기 위한 데이터베이스 수단; 상기 웹사이트의 웹페이지를 구성하는 소스 파일을 추출하고, 추출된 소스 파일을 분석하기 위한 웹사이트 분석 모듈; 및 소정의 기준에 따라 상기 웹사이트가 기만 사이트인지 여부를 판단하기 위한 웹사이트 관리 모듈을 포함하는 것을 특징으로 한다.In addition, a system for managing a website registered in a search engine according to an embodiment of the present invention includes an interface module for performing data communication with one or more terminals; A website registration module for receiving a website registration request including information of a predetermined website from the at least one terminal, and classifying the website information by predetermined fields; Database means for classifying and storing the website information and a predetermined keyword corresponding to the website; A website analysis module for extracting a source file constituting a web page of the website and analyzing the extracted source file; And a website management module for determining whether the website is a deceptive site according to a predetermined criterion.

상술한 바와 같이, 본 명세서에서 사용되는 기만 사이트라 함은 웹페이지의 소스 파일 등에 다양한 방법으로 소정의 키워드 등을 삽입하여 검색어를 통해 검색되는 내용과 실제 웹사이트에 포함된 콘텐츠가 완전히 상이한 웹사이트를 의미한다. 본 발명의 일실시예에 의하면 상기 웹페이지의 소스 파일 등에 삽입되는 소정의 키워드는 인기 키워드일 수 있다.As described above, the deception site used herein refers to a website that is completely different from the content searched through a search word by inserting a predetermined keyword or the like in various ways in a source file of a webpage. Means. According to an embodiment of the present invention, a predetermined keyword inserted into a source file of the webpage may be a popular keyword.

또한, 본 명세서에서 사용되는 인기 키워드라 함은 인터넷 사용자가 입력하는 검색어 중 그 출현 빈도가 매우 높은 검색어를 의미하는데, 이러한 인기 키워드는 그 당시의 사회 상황과 인터넷 사용자들의 취향에 따라 지속적으로 변화할 수 있다. 이러한 인기 키워드에는 사회적으로 유해한 내용을 내포하고 있는 일종의 유해 키워드가 포함될 수 있는데, 이러한 유해 키워드의 예로는 "자살", "왕따", "도박", "범죄 모의" 등을 들 수 있다.In addition, the popular keyword used in the present specification means a search word input by an internet user with a very high frequency of occurrence, and this popular keyword may change continuously according to the social situation at the time and the taste of the Internet user. Can be. Such popular keywords may include a kind of harmful keywords containing socially harmful contents. Examples of such harmful keywords include "suicide", "bullying", "gambling", "crime simulation", and the like.

이하 첨부된 도면을 참조하여 본 발명의 바람직한 일실시예에 따른 검색 엔진에서 웹사이트의 등록을 관리하는 방법에 대하여 상술한다.Hereinafter, a method of managing registration of a website in a search engine according to an exemplary embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 2는 본 발명의 바람직한 일실시예에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 시스템을 도시한 구성 블록도이다. 도2를 참조하면, 본 발명의 바람직한 일실시예에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 시스템은 인터페이스 모듈(201), 웹사이트 등록 모듈(202), 웹사이트 관리 모듈(203), 웹사이트 정보 데이터베이스(204), 웹사이트 분석 모듈(205), 및 검색 로봇(207)으로 구성될 수 있다. 또한, 본 발명의 바람직한 일실시예에 의하면 검색 엔진에서 웹사이트의 등록을 관리하기 위한 시스템은 등록 웹사이트의 등록자에게 소정의 메시지를 발송하기 위한 메일 서버(208), 또는 SMS 서버(209)를 포함할 수 있다. 이러한 메일 서버(208)와 SMS 서버(209)는 검색 엔진 서비스 제공 시스템 내에 포함될 수도 있고, 제3자가 운영하는 시스템 내에 위치할 수도 있다. 또한, 도 2에는 인터페이스 모듈(201), 다종의 모듈들, 및 메일 서버(208) 또는 SMS 서버(209)가 별개의 엔티티인 것처럼 도시되어 있으나 이는 설명의 편의를 위한 것일 뿐, 동일한 엔티티일 수 있다. 또한, 상기 도 2에 도시된 구성 요소들은 물리적으로도 동일한 장소에 위치할 수도 있고, 다른 실시예에 따르면 물리적으로 이격되어 있을 수도 있다.2 is a block diagram illustrating a system for managing registration of a website in a search engine according to an exemplary embodiment of the present invention. 2, a system for managing registration of a website in a search engine according to an exemplary embodiment of the present invention includes an interface module 201, a website registration module 202, a website management module 203, Website information database 204, website analysis module 205, and search robot 207. In addition, according to a preferred embodiment of the present invention, a system for managing registration of a website in a search engine may include a mail server 208 or an SMS server 209 for sending a predetermined message to a registrant of a registered website. It may include. The mail server 208 and the SMS server 209 may be included in a search engine service providing system or may be located in a system operated by a third party. In addition, although the interface module 201, the various modules, and the mail server 208 or the SMS server 209 are shown as separate entities in FIG. 2, this is for convenience of description and may be the same entity. have. In addition, the components shown in FIG. 2 may be physically located in the same place, or according to another embodiment may be physically spaced apart.

먼저, 인터페이스 모듈(201)은 소정의 웹사이트를 검색 엔진에 등록하고자 하는 등록자 측에 구비된 컴퓨터 단말기와 검색 엔진의 등록 관리 시스템 사이의 데이터 전송 및 물리적 전송 장비 간의 인터페이스 역할을 담당하는 모듈이다.First, the interface module 201 is a module that serves as an interface between data transmission and physical transmission equipment between a computer terminal provided at a registrant who wants to register a predetermined website in a search engine and a registration management system of the search engine.

웹사이트 등록 모듈(202)은 상기 등록자로부터 소정의 웹사이트에 대한 등록 신청을 수신하고, 웹사이트 등록 신청에 포함된 웹사이트에 대한 정보 데이터를 수집 및 분류하는 기능을 담당한다. 이러한 웹사이트 등록 모듈(202)은 웹사이트 등록에 대한 소정의 과금을 수행하는 과금 모듈(도시되지 아니함)을 더 포함할 수 있고, 과금 모듈은 등록을 원하는 웹사이트의 종류(일반적인 내용을 담고 있는 일반 사이트 또는 성인 콘텐츠를 담고 있는 성인 사이트)에 따라 그 과금 내역을 달리 적용하도록 동작할 수 있다.The website registration module 202 is responsible for receiving a registration request for a predetermined website from the registrant, and collecting and classifying information data about the website included in the website registration request. The website registration module 202 may further include a charging module (not shown) for performing a predetermined charging for website registration, and the charging module may include a type of website (general contents) to be registered. General charging sites or adult sites containing adult content).

웹사이트 관리 모듈(203)은 본 발명에 따른 웹사이트의 등록 관리를 총괄하는 모듈로서, 검색 로봇(207)이 수집한 웹사이트에 대한 정보를 기초로 상기 웹사이트가 선정된 기준에 적합하게 운영되고 있는지를 판단하고, 비정상적으로 운영되는 웹사이트, 즉 기만 사이트인 것으로 판단되는 경우, 상기 등록자에 대해 소정의 조치를 자동적으로 취하도록 제어하는 기능을 담당한다. 또한, 웹사이트 관리 모듈(203)은 메일 서버(208)나 단문자 메시지 서비스(SMS) 서버(209)와 연동함으로써 상기 기만 사이트의 등록자에 대해 이메일을 발송하거나 상기 등록자의 이동통신단말기로 SMS를 전송함으로써 웹사이트의 부정 운영에 대한 경고를 할 수 있다.The website management module 203 is a module that manages the registration management of the website according to the present invention, and operates according to the criteria for selecting the website based on the information on the website collected by the search robot 207. If it is determined that the web site is abnormally operated, that is, the deception site, the registrant is responsible for controlling to take a predetermined action automatically. In addition, the website management module 203 works with the mail server 208 or the short message service (SMS) server 209 to send an email to the registrant of the deceptive site or to send an SMS to the registrant's mobile terminal. By sending it, you can warn about misuse of the website.

웹사이트 정보 데이터베이스(204)는 등록된 웹사이트에 대한 정보를 분류하여 기록하는 역할을 담당한다. 웹사이트 정보 데이터베이스(204)에는 웹사이트의 범용 리소스 로케이터(URL), 해당 웹사이트의 키워드, 해당 웹사이트의 등록자 정보(등록자 이름, 주소, 이메일 주소, 이동통신단말기 번호 등), 해당 웹사이트의 디렉토리 정보 등의 다양한 정보가 각 필드 별로 분류 및 저장되어 있을 수 있다.The website information database 204 is responsible for classifying and recording information about registered websites. The website information database 204 includes a universal resource locator (URL) of the website, keywords of the website, registrant information (registrant name, address, email address, mobile terminal number, etc.) of the website, Various information such as directory information may be classified and stored for each field.

본 발명에 따른 웹사이트 정보 데이터베이스(204)에 저장된 정보는 시스템 관리자 및 상기 웹사이트의 등록자에 의해서 수정될 수 있고, 웹사이트의 콘텐츠가바뀌는 경우 등록자가 직접 수정하지 아니하더라도 검색 로봇(207)이 수집한 데이터의 분석 결과(해당 웹사이트의 URL 에 대응하는 새로운 키워드 등) 등에 따라 자동적으로 해당 정보가 갱신되도록 동작할 수 있다.The information stored in the website information database 204 according to the present invention can be modified by the system administrator and the registrant of the website, and if the contents of the website are changed, the search robot 207 does not modify the registrant directly. The information may be automatically updated according to the analysis result of the collected data (new keyword corresponding to the URL of the website).

웹사이트 분석 모듈(205)은 검색 로봇(207)이 수집한 웹사이트의 정보를 분석하는 역할을 담당한다. 검색 로봇(207)이 수집하는 데이터의 종류와 분석 방법에 대해서는 아래 도 3에 대한 설명과 함께 상술한다.The website analysis module 205 is responsible for analyzing the information of the website collected by the search robot 207. The type and analysis method of data collected by the search robot 207 will be described with reference to FIG. 3 below.

상술한 본 발명의 일실시예에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 시스템을 구성하는 각 구성 요소들은 설명의 편의를 위하여 단순히 기능적으로 구분된 것일 뿐, 각 구성 요소의 실제 물리적 위치와는 무관하다. 또한, 상술한 모듈들은 하드웨어로 구현될 수도 있고, 특정 코드를 이용한 소프트웨어로서 구현될 수 있음은 당업자에게 자명하다.Each component constituting the system for managing the registration of the website in the search engine according to an embodiment of the present invention described above is merely functionally separated for convenience of description, and the actual physical location of each component Is irrelevant. In addition, it will be apparent to those skilled in the art that the above-described modules may be implemented in hardware and may be implemented as software using specific codes.

도3은 본 발명의 바람직한 일실시예에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법을 도시하는 흐름도이다. 이하 도 4a 내지 도 4k, 및 도 6을 참조하여 도 3에 도시된 본 발명의 바람직한 일실시예에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법에 대해 상술한다.3 is a flowchart illustrating a method for managing registration of a website in a search engine in accordance with one preferred embodiment of the present invention. Hereinafter, a method for managing registration of a website in a search engine according to an exemplary embodiment of the present invention shown in FIG. 3 will be described with reference to FIGS. 4A to 4K and FIG. 6.

도 3에 도시된 본 발명의 바람직한 일실시예에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법은 다음과 같이 수행된다. 소정의 웹사이트를 검색 엔진에 등록하고자 하는 등록자는 등록을 원하는 웹사이트에 대한 정보와 함께 웹사이트 등록 신청을 한다(단계 305). 상기 웹사이트에 대한 정보는 웹사이트 정보 데이터베이스에 정보 필드((등록자 이름, 주소, 이메일 주소, 이동통신단말기 번호등) 별로 분류되어 기록되고(단계 310), 상기 웹사이트는 검색 엔진에 등록된다(단계 315). 이러한 등록 단계(단계 315)는 몇 개의 루트를 통해 수행될 수 있는데, 먼저 상술한 것과 같이 웹사이트 관리자가 검색 엔진에 자신의 웹사이트의 등록을 의뢰함으로써 등록되는 경우가 있고, 검색 로봇이 웹 상을 무작위로 돌아다니면서 얻어온 웹사이트 정보를 이용하여 웹사이트가 검색 엔진에 등록되는 경우가 있을 수 있다. 전자의 경우에는 웹사이트 등록자 자신이 웹사이트의 주제(예를 들면, "피카츄", "변리사 시험" 등)를 정하여 상기 웹사이트의 주제와 가장 근접한 카테고리에 웹사이트의 등록을 신청하고, 신청된 웹사이트에 대해서는 전문 서퍼의 검수를 거쳐 소정의 조건(웹사이트의 완성도, 등록 비용을 지불하지 아니하는 경우에는 비상업적 사이트 요건 충족 여부 등)을 만족하는 것으로 판단되는 경우에 검색 엔진에 등록될 수 있다. 본 발명에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법에서는 검색 엔진에 등록되는 루트를 상기 등록자의 신청에 의한 경우로 한정하여 설명하고 있지만, 본 발명에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법 및 그 시스템은 검색 엔진에 웹사이트가 등록되는 다종 다양한 방법에 대해서도 동일하게 적용될 수 있다.A method for managing registration of a website in a search engine according to an exemplary embodiment of the present invention shown in FIG. 3 is performed as follows. The registrant who wants to register a predetermined website in the search engine makes a website registration request together with information about the website that he / she wants to register (step 305). The information about the website is classified and recorded by information fields (registrant name, address, email address, mobile terminal number, etc.) in the website information database (step 310), and the website is registered in a search engine ( Step 315) This registration step (step 315) can be performed via several routes, as described above, a website administrator may register by requesting a search engine to register his website, In some cases, the website registrant itself may register the website's subject matter (eg, "Pikachu") by using the website information obtained by the robot randomly traversing the web. Apply for registration of the website in the category that most closely matches the subject of the website, and apply to the website. In the case of a web surfer, it can be registered in a search engine when it is determined that the predetermined conditions (website completion, noncommercial site requirements are met if the registration fee is not paid, etc.) are satisfied. In the method for managing the registration of the website in the search engine according to the description of the route registered in the search engine limited to the case by the registrant's application, but in the search engine according to the present invention to manage the registration of the website The method and the system can be equally applied to various other methods of registering a website with a search engine.

웹사이트가 등록되면, 검색 엔진은 검색 로봇을 제어하여 등록된 웹사이트의 웹페이지를 구성하는 소스 파일을 독출하고, 독출된 소스 파일을 분석한다(단계 320).When the website is registered, the search engine controls the search robot to read the source file constituting the web page of the registered website and analyze the read source file (step 320).

본 발명의 일실시예에 따른 분석 방법은 하이퍼텍스트 마크업 언어(HyperText Markup Language; HTML) 문서를 분석하는 방법이다. 더욱 상세하게는 웹사이트의 HTML 문서 내의 태그를 분석함으로써, 출현 빈도가 높은 검색어, 즉 인기 키워드를 자신의 웹사이트를 구성하는 HTML 문서에 삽입한 웹사이트, 즉 기만 사이트인지 여부를 판단할 수 있게 된다. 당업자라면 주지하는 바와 같이, HTML 문서는 태그라는 일종의 명령어와 함께 구성되고, 웹사이트를 만드는 웹디자이너 등은 이러한 태그를 통하여 웹사이트를 구성하고, 자신의 웹사이트에서 제공하고자 하는 콘텐츠를 자신의 웹사이트에 포함시킨다.An analysis method according to an embodiment of the present invention is a method of analyzing a HyperText Markup Language (HTML) document. More specifically, by analyzing the tags in the HTML document of the website, it is possible to determine whether or not the frequently occurring search terms, that is, the website that inserted the popular keyword into the HTML document that constitutes the website, or deception site. do. As those skilled in the art will appreciate, an HTML document is composed of a kind of command called a tag, and a web designer, such as a website creator, constructs a website through such a tag, and uses his or her web to display content that the website wants to provide. Include it on the site.

도 4a 내지 도 4k는 도 3a의 단계 320에서 수행되는, 웹사이트의 하이퍼텍스트 마크업 언어 문서를 분석하여 HTML 문서에 포함된 태그에 기만 사이트로 판단할 수 있는 소지가 있는 문자열을 포함하는 다양한 실시예들을 도시하는 도면으로서, 더욱 상세하게는 웹사이트의 HTML 문서 태그를 분석하여 다양한 방식의 기만 사이트를 검출하는 다양한 유형을 도시한 도면이다. 이하, 도 4a 내지 도 4k를 참조하여 본 발명에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법에서 HTML 문서의 분석이 어떻게 수행되는지에 대해 상술한다.4A to 4K illustrate various implementations of analyzing a hypertext markup language document of a website, which is performed in operation 320 of FIG. 3A, by including a character string that may be determined as a deceptive site in a tag included in an HTML document. As examples showing, in more detail, various types of detection of deceptive sites in various manners by analyzing HTML document tags of a website. Hereinafter, with reference to Figs. 4A to 4K, how the analysis of the HTML document is performed in the method for managing the registration of the website in the search engine according to the present invention.

(1) 배경색과 동일한 색의 문자열을 사용한 기만 사이트(1) Deception site using a string of the same color as the background color

도 4a 는 웹사이트의 배경색과 동일한 색의 문자열을 태그로서 포함하고 있는 기만 사이트의 일예를 도시한 도면이다. 좌측의 도면은 사용자에게 디스플레이되는 웹사이트의 화면을 도시한 것이고, 우측의 도면은 좌측에 디스플레이되는 웹사이트의 HTML 소스 파일이다. 도 4a의 상단에 도시된 소스 파일을 보면, 백그라운드 컬러(bgcolor)로 #FFFFFF가 지정되어 있고 텍스트의 컬러도 #FFFFFF로 지정되어 있어서, 텍스트인 "스타크래프트"와 "졸라맨"이 웹사이트의 화면에는 보이지 않게 된다. 도 4a의 하단에 도시된 일예도 마찬가지로 백그라운드 컬러로 백색을 의미하는 #FFFFFF가 지정되어 있고 텍스트의 컬러도 백색(white)으로 지정되어 있어서, 텍스트인 "스타크래프트"와 "졸라맨"이 웹사이트의 화면에는 보이지 않게 된다. 도 4a 의 소스 파일에 나타난 <body> 태그는 당업자라면 주지하는 바와 같이 웹브라우저에 표시되는 배경 또는 텍스트의 여러 속성을 설정할 수 있도록 하는 것이다. 태그는 크게 시작 태그와 종료 태그로 구성되는 태그(도 4a에 도시된 <body></body> 또는 <font></font> 태그)와 종료 태그의 필요가 없는 단독 태그로 구분될 수 있고, 이러한 태그를 이용하여 웹사이트를 다양한 방식으로 구성할 수 있다. 상기와 같이 배경색과 문자열의 색이 동일하다면, 이러한 웹사이트는 소정의 인기 키워드에 영합하여 상기 인기 키워드와 무관한 콘텐츠를 포함하고 있음에도 불구하고, 검색 결과 화면에 디스플레이될 수 있다.4A is a diagram illustrating an example of a deceptive site that includes, as a tag, a string having the same color as the background color of the website. The figure on the left shows the screen of the website displayed to the user, and the figure on the right is the HTML source file of the website displayed on the left. In the source file shown at the top of FIG. 4A, #FFFFFF is designated as the background color (bgcolor) and the color of the text is also designated as #FFFFFF, so that the texts "Starcraft" and "Zolaman" are displayed on the website. It becomes invisible to. The example shown at the bottom of FIG. 4A is similarly designated as #FFFFFF, which means white as the background color, and the color of the text is also designated as white, so that the texts "StarCraft" and "Zolaman" are displayed on the website. It will not be visible on the screen. The <body> tag shown in the source file of FIG. 4A allows a user to set various attributes of a background or text displayed in a web browser, as will be appreciated by those skilled in the art. The tag can be divided into a tag consisting of a start tag and an end tag (a <body> </ body> or <font> </ font> tag shown in FIG. 4A) and a single tag without the need for an end tag. You can use these tags to organize your website in a variety of ways. If the background color and the color of the text string are the same as described above, the website may be displayed on the search result screen even though the website includes content unrelated to the popular keyword by matching a predetermined popular keyword.

(2) 리다이렉션(redirection) 페이지에 포함된 문자열을 이용한 기만 사이트(2) Deception sites using strings included in the redirection page

도 4b는 리다이렉션 페이지에 포함된 문자열을 이용하는 기만 사이트의 일예를 도시한 도면이다. 좌측의 도면은 사용자에게 디스플레이되는 웹사이트의 화면을 도시한 것이고, 우측의 도면은 좌측에 디스플레이되는 웹사이트의 HTML 소스 파일이다. 당업자라면 주지하는 바와 같이 리다이렉션은 접속된 웹사이트에서 새로운 웹사이트로의 이동을 설정하기 위한 것으로서, 도 4b의 우측에 도시된 소스 파일의 형태로 구현될 수 있다. 우측의 소스 파일 중 상단의 메타 태그를 이용한 일예를 보면, 메타 태그 중 http-equiv 속성을 이용하고 있다. 상기 메타 태그는 통상 정해진 시간(도 4b의 content 항목에서 지정되는 시간) 경과 후에 자동으로 다른 문서로 이동하도록 설정하기 위한 것으로서, 주로 홈페이지의 주소가 변경되는 경우 옛 주소로 접속한 사용자에게 주소 변경 안내를 보여 주고 소정의 시간 경과시 자동으로 새로운 주소로 이동할 수 있도록 하는데 사용된다. 또한, 도 4b 우측 중단 및 하단의 경우에도 각각 self.location 태그 및 location.replace 태그 등을 이용하여 현재의 웹페이지를 "http://www.naver.com"으로 리다이렉션하도록 동작한다.4B is a diagram illustrating an example of a deception site using a character string included in a redirect page. The figure on the left shows the screen of the website displayed to the user, and the figure on the right is the HTML source file of the website displayed on the left. As will be appreciated by those skilled in the art, the redirection is for establishing a movement from the connected website to the new website, and may be implemented in the form of a source file shown on the right side of FIG. 4B. In the example using the meta tag on the top of the source file on the right, the http-equiv attribute is used. The meta tag is generally set to automatically move to another document after a predetermined time (the time specified in the content item of FIG. 4B). When the address of the homepage is changed, the meta tag is notified to a user who has accessed the old address. This function is used to show a and to automatically move to a new address after a predetermined time. In addition, in the case of the right middle side and the bottom side of FIG. 4B, the current web page is redirected to “http://www.naver.com” by using the self.location tag and the location.replace tag, respectively.

도 4b에 도시된 리다이렉션 페이지를 이용한 기만 사이트의 일예에서, 우측 상단의 메타 태그의 경우 리다이렉션 명령 다음에, 중단 및 하단의 경우에는 </script> 태그 다음에 소정의 인기 키워드("스타크래프트", "졸라맨")를 삽입하고 있다.In the example of the deception site using the redirect page shown in FIG. 4B, the meta tag in the upper right is followed by the redirect command, and in the case of the interruption and the lower end, the </ script> tag is followed by a predetermined popular keyword (“Starcraft”, "Jola Man") is inserted.

이러한 리다이렉션 페이지의 경우에는 태그 자체가 다른 웹사이트의 이동을 명령하고 있는 것이므로, 태그 이후에 부가되는 텍스트는 아무런 역할을 하지 아니하는 대신, 검색 로봇의 검색 결과는 웹사이트 내에서 문자열 출현 빈도에 따라 정해지므로 상기 웹사이트의 주제를 원 주제와 다르게 판단할 수 있다. 따라서, 상기와 같이 리다이렉션 페이지에 문자열이 포함되어 있다면, 이러한 웹사이트는 인기 키워드에 영합하여 상기 인기 키워드와 무관한 콘텐츠를 포함하고 있음에도 불구하고, 검색 결과 화면에 디스플레이될 수 있다.In the case of such a redirect page, the tag itself is instructing the movement of another website. Therefore, the text added after the tag does not play any role. Instead, the search result of the search robot is based on the frequency of occurrence of the string in the website. As a result, the subject of the website may be judged differently from the original subject. Therefore, if the redirection page includes a string as described above, such a website may be displayed on the search result screen even though the website includes content unrelated to the popular keyword in response to the popular keyword.

(3) 타이틀 태그를 이용한 문자열을 이용한 기만 사이트(3) Deception site using character string using title tag

도 4c는 타이틀 태그에 포함되는 문자열을 이용하는 기만 사이트의 일예를 도시한 도면이다. 좌측의 도면은 사용자에게 디스플레이되는 웹사이트의 화면을 도시한 것이고, 우측의 도면은 좌측에 디스플레이되는 웹사이트의 HTML 소스 파일이다.당업자라면 주지하는 바와 같이 타이틀 태그는 웹브라우저의 상단에 웹사이트의 주제를 간략하게 디스플레이하기 위하여 사용하는 태그로서, 도 4c의 우측에 도시된 소스 파일의 형태로 구현될 수 있다. 우측의 소스 파일 중 상단의 타이틀 태그를 이용한 일예를 보면, 타이틀 태그 내에 "스타크래프트"와 "졸라맨" 등의 인기 키워드를 다수 포함하고 있고, 이 태그로 인해 좌측의 웹브라우저와 같이 출력된다. 또한, 도 4c의 하단 도면에 도시된 경우에는 복수 개의 타이틀 태그를 사용하고 있다. 하단 도면의 소스 파일에는 <title> 및 </title> 태그 사이에 "히딩크, "스타크래프트", "졸라맨" 등의 다수의 인기 키워드가 포함되어 있다.4C is a diagram illustrating an example of a deception site using a character string included in a title tag. The figure on the left shows the screen of the website displayed to the user, and the figure on the right is the HTML source file of the website displayed on the left. As the person skilled in the art will appreciate, the title tag is displayed on the top of the web browser. As a tag used to briefly display a subject, it may be implemented in the form of a source file shown on the right side of FIG. 4C. As an example of using the title tag on the upper side of the source file on the right side, the title tag includes a number of popular keywords such as "StarCraft" and "Zolaman", and the tag is output like the web browser on the left side. 4C, a plurality of title tags are used. The source file in the lower figure contains a number of popular keywords, such as "Hiddink," StarCraft "," Zolaman ", between the <title> and </ title> tags.

이러한 타이틀 태그의 경우에는 타이틀 태그 내에 아무리 많은 문자열을 포함하고 있더라도 웹브라우저에는 디스플레이되지 않는 반면, 검색 로봇의 검색 결과는 웹사이트 내에서 문자열 출현 빈도에 따라 정해지므로 상기 타이틀 태그에 포함된 문자열들로 말미암아 상기 웹사이트의 주제를 원 주제와 다르게 판단할 수 있다. 따라서, 상기와 같이 타이틀 태그에 포함된 문자열의 길이가 소정치 이상이거나 타이틀 태그의 개수가 복수 개라면, 이러한 웹사이트는 인기 키워드에 영합하여 상기 인기 키워드와 무관한 콘텐츠를 포함하고 있음에도 불구하고, 검색 결과 화면에 디스플레이될 수 있다.In the case of such a title tag, no matter how many strings are included in the title tag, it is not displayed in the web browser, whereas the search results of the search robot are determined by the frequency of occurrence of the strings in the website. Because of this, the subject of the website can be judged differently from the original subject. Therefore, if the length of the string included in the title tag is equal to or larger than a predetermined value or the number of the title tag is plural, even if such a website matches the popular keyword and contains content irrelevant to the popular keyword, The search result screen may be displayed.

(4) 메타 태그(meta tag)에 포함되는 문자열을 이용한 기만 사이트(4) Deception site using character string included in meta tag

도 4d는 메타 태그에 포함된 문자열을 이용하는 기만 사이트의 일예를 도시한 도면이다. 좌측의 도면은 사용자에게 디스플레이되는 웹사이트의 화면을 도시한 것이고, 우측의 도면은 좌측에 디스플레이되는 웹사이트의 HTML 소스 파일이다.4D is a diagram illustrating an example of a deception site using a string included in a meta tag. The figure on the left shows the screen of the website displayed to the user, and the figure on the right is the HTML source file of the website displayed on the left.

당업자라면 주지하는 바와 같이 메타 태그는 해당 HTML 문서의 작성자, 작성 날짜, 키워드 등 웹브라우저의 본문에 디스플레이되지 아니하는 HTML 문서에 대한 일반적인 정보를 표현하고자 할 경우 사용하는 것이다. 도 4d의 우측 소스 파일을 보면, 메타 태그 내에 문서 이름(name)은 description이고, 문서의 내용(content)은 "스타크래프트"와 "졸라맨" 등의 인기 키워드를 다수 포함하고 있다. 이 경우, 메타 태그에 포함된 상기 인기 키워드 등인 문자열은 웹브라우저에 표시되지 아니하는 반면, 검색 로봇의 검색 결과는 웹사이트 내에서 문자열 출현 빈도에 따라 정해지므로 상기 웹사이트의 주제를 원 주제와 다르게 판단할 수 있다. 따라서, 상기와 같이 메타 태그 내에 문자열을 포함하고 있고 상기 문자열의 길이가 소정치 이상이라면, 이러한 웹사이트는 인기 키워드에 영합하여 상기 인기 키워드와 무관한 콘텐츠를 포함하고 있음에도 불구하고, 검색 결과 화면에 디스플레이될 수 있다.As will be appreciated by those skilled in the art, the meta tag is used to express general information about an HTML document that is not displayed in the body of a web browser, such as the author, creation date, and keyword of the HTML document. Referring to the source file on the right side of FIG. 4D, the document name in the meta tag is description, and the content of the document includes many popular keywords such as "StarCraft" and "Zolaman". In this case, the string, such as the popular keyword included in the meta tag, is not displayed in the web browser, whereas the search result of the search robot is determined according to the frequency of occurrence of the string in the website. You can judge. Therefore, if the meta tag includes the character string as described above and the length of the character string is equal to or larger than a predetermined value, the website is displayed on the search result screen even though the website includes content that is not related to the popular keyword. Can be displayed.

(5) 프레임 태그에 위치한 문자열을 이용한 기만 사이트(5) Deception site using string located in frame tag

도 4e는 프레임 태그에 위치한 문자열을 이용하는 기만 사이트의 일예를 도시한 도면이다. 좌측의 도면은 사용자에게 디스플레이되는 웹사이트의 화면을 도시한 것이고, 우측의 도면은 좌측에 디스플레이되는 웹사이트의 HTML 소스 파일이다. 당업자라면 주지하는 바와 같이 프레임 태그는 웹브라우저에 표시되는 화면을 둘 또는 그 이상으로 분할하고자 하는 경우 사용하는 것이다. 도 4e의 우측 소스 파일을 보면, 프레임 태그 <FRAMESET ROWS=" ")는 화면을 가로로 분할하기 위한 태그이고 " " 내에는 화면을 나누는 할당 정보가 삽입된다. 이러한 프레임 태그의종료(</FRAMESET>) 태그 이후에 위치하는 문자열은 "스타크래프트"와 "졸라맨" 등의 인기 키워드를 다수 포함하고 있다. 이 경우, 프레임 태그의 종료 태그 이후에 위치하는 상기 인기 키워드 등인 문자열은 웹브라우저의 화면 분할과는 아무런 관계가 없는 반면, 검색 로봇의 검색 결과는 웹사이트 내에서 문자열 출현 빈도에 따라 정해지므로 상기 웹사이트의 주제를 원래의 주제와 다르게 판단할 수 있다. 따라서, 상기와 같이 프레임 태그에 문자열이 위치하고 상기 문자열의 길이가 소정치 이상이라면, 이러한 웹사이트는 인기 키워드에 영합하여 상기 인기 키워드와 무관한 콘텐츠를 포함하고 있음에도 불구하고, 검색 결과 화면에 디스플레이될 수 있다.4E is a diagram illustrating an example of a deception site using a string located in a frame tag. The figure on the left shows the screen of the website displayed to the user, and the figure on the right is the HTML source file of the website displayed on the left. As will be appreciated by those skilled in the art, a frame tag is used when a screen displayed on a web browser is to be divided into two or more. Referring to the right source file of FIG. 4E, the frame tag <FRAMESET ROWS = "") is a tag for dividing the screen horizontally, and the allocation information for dividing the screen is inserted in the "". The string placed after the end of the frame tag includes a number of popular keywords such as "Starcraft" and "Zolaman". In this case, the string that is the popular keyword or the like placed after the end tag of the frame tag has nothing to do with screen division of the web browser, whereas the search result of the search robot is determined according to the frequency of occurrence of the string in the website. You can judge the theme of your site differently from the original theme. Therefore, if the character string is placed in the frame tag as described above and the length of the character string is equal to or larger than a predetermined value, the website is displayed on the search result screen even though the website includes content irrelevant to the popular keyword. Can be.

(6) 폼( form) 태그에 포함된 문자열을 이용한 기만 사이트(6) Deceptive sites using strings contained in form tags

도 4f는 폼 태그에 포함된 문자열을 이용하는 기만 사이트의 일예를 도시한 도면이다. 좌측의 도면은 사용자에게 디스플레이되는 웹사이트의 화면을 도시한 것이고, 우측의 도면은 좌측에 디스플레이되는 웹사이트의 HTML 소스 파일이다. 당업자라면 주지하는 바와 같이 폼 태그는 웹브라우저에 출력되는 소정의 폼을 정의하기 위해 사용하는 것이다. 도 4f의 우측 소스 파일을 보면, 폼 태그는 "<form> <input type="버튼종류" value="보여지는 글자"></form>의 형식으로 구성될 수 있다. 따라서, 도 4f의 우측에 도시된 소스 파일의 경우는 버튼 종류가 "hidden"이므로 버튼에 어떠한 글자도 보여질 수 없도록 설정된 것이고, 보여지는 글자 즉, 실제로는 웹브라우저에 표시되지 아니하는 문자열에는 "스타크래프트"와 "졸라맨" 등의 인기 키워드가 다수 포함되어 있다. 이 경우, 폼 태그에 포함된 상기 인기 키워드 등인 문자열은 웹브라우저의 폼을 정의하는 것과는 아무런 관계가 없는 반면, 검색 로봇의 검색 결과는 웹사이트 내에서 문자열 출현 빈도에 따라 정해지므로 상기 웹사이트의 주제를 원 주제와 다르게 판단할 수 있다. 따라서, 상기와 같이 폼 태그에 포함된 문자열의 길이가 소정치 이상이라면, 이러한 웹사이트는 인기 키워드에 영합하여 상기 인기 키워드와 무관한 콘텐츠를 포함하고 있음에도 불구하고, 검색 결과 화면에 디스플레이될 수 있다.FIG. 4F illustrates an example of a deception site using a string included in a form tag. The figure on the left shows the screen of the website displayed to the user, and the figure on the right is the HTML source file of the website displayed on the left. As will be appreciated by those skilled in the art, form tags are used to define a given form that is output to a web browser. Referring to the source file on the right side of FIG. 4F, the form tag may be configured in the form of "<form> <input type =" button type "value =" displayed characters "> </ form>. In the case of the source file shown in the figure, the button type is "hidden" so that no characters are displayed on the button, and the displayed characters, that is, the strings that are not actually displayed in the web browser, are "StarCraft" and "Zolaman". In this case, the popular keyword or the like contained in the form tag has nothing to do with defining the form of the web browser, while the search results of the search robot are displayed within the website. The subject of the website can be judged differently from the original subject, since it is determined according to the frequency of appearance. Even though the site may be pandering to popular keywords, including independent content and the most popular keywords, the search results display screen.

(7) Div 태그에 포함된 문자열을 이용한 기만 사이트(7) Deception site using character string included in Div tag

도 4g는 div 태그에 포함된 문자열을 이용하는 기만 사이트의 일예를 도시한 도면이다. 좌측의 도면은 사용자에게 디스플레이되는 웹사이트의 화면을 도시한 것이고, 우측의 도면은 좌측에 디스플레이되는 웹사이트의 HTML 소스 파일이다. 당업자라면 주지하는 바와 같이 div 태그는 통상 ID와 클래스 속성을 사용하여 스타일 시트와 함께 사용하는 것이다. 도 4g의 우측 소스 파일을 보면, div 태그는 "<div style="display:none; …">으로 기재되어 있다. 이 경우 웹브라우저에 출력되는 문자열의 속성(style)이 디스플레이(display) 없음(none)이므로 이후에 등장하는 문자열은 웹브라우저에 표시되지 아니한다, 이 경우, div 태그에 포함된 상기 인기 키워드 등인 문자열은 웹브라우저의 화면 디스플레이와는 아무런 관계가 없는 반면, 검색 로봇의 검색 결과는 웹사이트 내에서 문자열 출현 빈도에 따라 정해지므로 상기 웹사이트의 주제를 원 주제와 다르게 판단할 수 있다. 따라서, 상기와 같이 div 태그에 포함된 문자열의 길이가 소정치 이상인 경우, 이러한 웹사이트는 인기 키워드에 영합하여 상기 인기 키워드와 무관한 콘텐츠를 포함하고 있음에도불구하고, 검색 결과 화면에 디스플레이될 수 있다.4G illustrates an example of a deception site using a string included in a div tag. The figure on the left shows the screen of the website displayed to the user, and the figure on the right is the HTML source file of the website displayed on the left. As those skilled in the art will appreciate, div tags are typically used with style sheets using IDs and class attributes. Looking at the right source file of Figure 4G, the div tag is "<div style =" display: none; … In this case, since the style of the string output to the web browser is none, the subsequent string is not displayed in the web browser. In this case, it is included in the div tag. While the string, which is the popular keyword, has nothing to do with the screen display of the web browser, the search result of the search robot is determined according to the frequency of occurrence of the string in the website, so that the subject of the website can be judged differently from the original subject. Therefore, when the length of the string included in the div tag is equal to or greater than a predetermined value, the website may be displayed on the search result screen even though the website includes content that is not related to the popular keyword in response to the popular keyword. Can be.

(8) a href 태그에 포함된 문자열을 이용한 기만 사이트(8) deceptive sites using strings contained in a href tags

도 4h는 a href 태그에 포함된 문자열을 이용하는 기만 사이트의 일예를 도시한 도면이다. 좌측의 도면은 사용자에게 디스플레이되는 웹사이트의 화면을 도시한 것이고, 우측의 도면은 좌측에 디스플레이되는 웹사이트의 HTML 소스 파일이다. 당업자라면 주지하는 바와 같이 a href 태그는 한 문서 내에서 또는 다른 문서나 타 웹사이트로 쉽게 이동하기 위해서 문서 내에 특정 글자나 이미지에 이동할 주소를 연결하기 위하여 사용하는 것이다. 도 4h의 우측 소스 파일을 보면, a href 태그는 "<a href="#이동할 위치 또는 주소"> 링크표시대상 </a>"의 형식으로 구성될 수 있다. 도 4h의 우측에 도시된 소스 파일을 참조하면, 이동할 위치 및 링크 표시 대상이 전혀 지정되어 있지 아니하므로, 이러한 a href 태그는 웹브라우저 상에서 표시는 물론 실행도 되지 아니하는 태그이다. 이러한 실행되지 아니하는 태그에 포함된 문자열에 "스타크래프트"와 "졸라맨" 등의 인기 키워드가 다수 포함되어 있다. 이 경우, a href 태그에 포함된 상기 인기 키워드 등인 문자열은 웹브라우저의 화면 출력 또는 링크 실행과는 아무런 관계가 없는 반면, 검색 로봇의 검색 결과는 웹사이트 내에서 문자열 출현 빈도에 따라 정해지므로 상기 웹사이트의 주제를 원 주제와 다르게 판단할 수 있다. 따라서, 상기와 같이 a href 태그에 포함된 문자열의 길이가 소정치 이상인 경우, 이러한 웹사이트는 인기 키워드에 영합하여 상기 인기 키워드와 무관한 콘텐츠를 포함하고 있음에도 불구하고, 검색 결과 화면에 디스플레이될 수 있는 위험이 있다.4H is a diagram illustrating an example of a deception site using a string included in an a href tag. The figure on the left shows the screen of the website displayed to the user, and the figure on the right is the HTML source file of the website displayed on the left. As will be appreciated by those skilled in the art, the a href tag is used to link an address to a specific text or image within a document for easy navigation to another document or to another website. Referring to the source file on the right side of FIG. 4H, the a href tag may be configured in a format of "<a href="#move location or address"> link display target </a>". Referring to the source file shown on the right side of FIG. 4H, since a moving position and a link display target are not specified at all, the a href tag is a tag that is not displayed or executed on a web browser. Many of the popular keywords such as "StarCraft" and "Zolaman" are included in the string included in the tag that is not executed. In this case, the string, such as the popular keyword included in the a href tag, has nothing to do with the screen output or link execution of the web browser, while the search results of the search robot are determined according to the frequency of occurrence of the string in the website. You can judge the theme of the site differently from the original theme. Therefore, when the length of the string included in the a href tag is greater than or equal to the predetermined value as described above, the website may be displayed on the search result screen even though the website includes content that is not related to the popular keyword. There is a danger.

(9) 링크 팜(link farm)을 이용한 기만 사이트(9) Deception site using link farm

도 4i는 a href 태그에 포함된 문자열을 이용하는 기만 사이트의 일예를 도시한 도면이다. 당업자라면 주지하는 바와 같이, 링크 팜은 자신의 웹사이트 내의 웹페이지들을 상호 링크하여 검색 로봇으로 하여금 계속하여 상기 웹사이트를 검색하도록 함으로써 궁극적으로 웹페이지의 순위를 높이기 위한 수단으로 사용되는 경우가 많다. 이러한 링크 팜은 상술한 a href 태그를 이용하여 구현될 수 있다.4I is a diagram illustrating an example of a deception site using a string included in an a href tag. As will be appreciated by those skilled in the art, a link farm is often used as a means to ultimately rank a web page by interlinking web pages within its website to allow a search robot to continue to search the website. Such a link farm may be implemented using the a href tag described above.

이러한 링크 팜을 이용하는 웹사이트는 직접적으로 기만 사이트로 판단하기는 곤란하지만, 링크 팜을 일정 개수 이상으로 과다하게 사용하는 경우라면 검색 엔진으로 하여금 웹페이지 내의 인기 키워드들을 지속적으로 검색하도록 하는 방법을 사용하는 경우에는 기만 사이트일 가능성이 높으므로 이를 검출하여야 할 필요가 있다.Websites using such link farms are difficult to determine directly as deceptive sites, but if the link farm is used excessively over a certain number, the search engine may use a method to continuously search popular keywords in the webpage. If you do this is likely to be a deception site, it needs to be detected.

(10) font 태그에 포함된 문자열을 이용한 기만 사이트(10) Deception site using the string contained in the font tag

도 4j는 font 태그에 포함된 문자열을 이용하는 기만 사이트의 일예를 도시한 도면이다. 좌측의 도면은 사용자에게 디스플레이되는 웹사이트의 화면을 도시한 것이고, 우측의 도면은 좌측에 디스플레이되는 웹사이트의 HTML 소스 파일이다.4J illustrates an example of a deception site using a character string included in a font tag. The figure on the left shows the screen of the website displayed to the user, and the figure on the right is the HTML source file of the website displayed on the left.

당업자라면 주지하는 바와 같이 font 태그는 웹브라우저에 표시되는 문자열의 크기 등을 지정하기 위해 사용되는 것으로서, 도 4j에 도시된 소스 파일에는 폰트 크기(font size)가 0으로 설정되어 있다. 따라서, 이러한 font 태그에 포함된 문자열은 웹브라우저 상에서 전혀 표시가 되지 아니한다. 이와 같이 폰트 크기가 0이여서 표시되지 아니하는 문자열에 "스타크래프트"와 "졸라맨" 등의 인기 키워드가 다수 포함되어 있는 경우, font 태그에 포함된 상기 인기 키워드 등인 문자열은 웹브라우저의 화면 출력과는 아무런 관계가 없는 반면, 검색 로봇의 검색 결과는 웹사이트 내에서 문자열 출현 빈도에 따라 정해지므로 상기 웹사이트의 주제를 원 주제와 다르게 판단할 수 있다. 따라서, 상기와 같이 font 태그에 포함된 문자열의 크기가 0인 경우, 이러한 웹사이트는 인기 키워드에 영합하여 상기 인기 키워드와 무관한 콘텐츠를 포함하고 있음에도 불구하고, 검색 결과 화면에 디스플레이될 수 있다.As will be appreciated by those skilled in the art, the font tag is used to designate the size of a string displayed on a web browser, etc. The font size is set to 0 in the source file shown in FIG. 4J. Therefore, the strings included in these font tags are not displayed at all in the web browser. As such, when a string having a font size of 0 does not display a large number of popular keywords such as "StarCraft" and "Zolaman", the string such as the popular keyword included in the font tag is different from the screen output of the web browser. On the other hand, since the search result of the search robot is determined according to the frequency of occurrence of the character string in the website, the subject of the website can be judged differently from the original subject. Therefore, when the size of the character string included in the font tag is 0 as described above, the web site may be displayed on the search result screen despite the fact that the web site includes content unrelated to the popular keyword.

(11) 이미지 태그에 포함된 문자열을 이용한 기만 사이트(11) deceptive sites using strings embedded in image tags

도 4k는 img 태그에 포함된 문자열을 이용하는 기만 사이트의 일예를 도시한 도면이다. 좌측의 도면은 사용자에게 디스플레이되는 웹사이트의 화면을 도시한 것이고, 우측의 도면은 좌측에 디스플레이되는 웹사이트의 HTML 소스 파일이다.4K is a diagram illustrating an example of a deception site using a string included in an img tag. The figure on the left shows the screen of the website displayed to the user, and the figure on the right is the HTML source file of the website displayed on the left.

당업자라면 주지하는 바와 같이 img 태그는 문서에 소정의 이미지를 삽입하기 위하여 사용되는 것으로서, 도 4k에 도시된 소스 파일에는 삽입하고자 하는 이미지 파일이 "a.gif"로 지정되어 있다. img 태그에서는 삽입될 이미지가 지정된 다음, 상기 이미지의 위치, 정렬 방식 등의 속성을 지정하게 되는데, 도 4k의 경우에는 이러한 속성을 문자열로서 지정하고 있다. 이 경우, 웹브라우저 상에 상기 이미지가 디스플레이되는 경우 문자열로 지정된 속성은 상기 이미지 디스플레이에 어떠한 영향도 주지 못한다. 이와 같이 이미지의 속성에 영향을 미치지 아니하는 문자열에 "스타크래프트"와 "졸라맨" 등의 인기 키워드가 다수 포함되어 있는 경우, img 태그에 포함된 상기 인기 키워드 등인 문자열은 웹브라우저의 화면 출력과는 아무런 관계가 없는 반면, 검색 로봇의 검색 결과는 웹사이트 내에서 문자열 출현 빈도에 따라 정해지므로 상기 웹사이트의 주제를 원 주제와 다르게 판단할 수 있다. 따라서, 상기와 같이 img 태그에 포함된 문자열의 길이가 소정치 이상인 경우, 이러한 웹사이트는 인기 키워드에 영합하여 상기 인기 키워드와 무관한 콘텐츠를 포함하고 있음에도 불구하고, 검색 결과 화면에 디스플레이될 수 있다.As will be appreciated by those skilled in the art, an img tag is used to insert a predetermined image into a document, and an image file to be inserted is designated as "a.gif" in the source file shown in FIG. 4K. In the img tag, an image to be inserted is designated, and then attributes such as the position and alignment method of the image are designated. In FIG. 4K, such an attribute is designated as a string. In this case, when the image is displayed on a web browser, an attribute designated as a string has no effect on the image display. When the strings that do not affect the attributes of the image include a number of popular keywords such as "Starcraft" and "Zolaman", the strings such as the popular keywords included in the img tag are different from the screen output of the web browser. On the other hand, since the search result of the search robot is determined according to the frequency of occurrence of the character string in the website, the subject of the website can be judged differently from the original subject. Therefore, when the length of the character string included in the img tag is greater than or equal to the predetermined value as described above, the website may be displayed on the search result screen even though the website includes content that is not related to the popular keyword. .

단계 320는 상술한 실시예와 같이 HTML 문서 내에 포함된 태그 등을 분석하여 태그 등에 기재된 문자열의 길이 등을 측정하는 단계이다. 이러한 측정 결과를 토대로 단계 325에서 측정 결과를 토대로 소정의 기준에 따라 상기 웹사이트가 기만 사이트인지 여부를 판단하게 된다.In operation 320, the tag and the like included in the HTML document are analyzed to measure the length of the character string described in the tag or the like as described above. Based on the measurement result, it is determined in step 325 whether the website is a deceptive site based on a predetermined criterion.

단계 325에서 분석된 웹사이트가 기만 사이트인지 여부를 판단하기 위한 소정의 기준의 예는 도 4a 내지 도 4k에서 상술한 바와 같다. 일예로 상기 소정의 기준은 상기 하이퍼텍스트 마크업 언어 문서 중에 상기 웹페이지의 배경색과 동일한 색의 문자열이 포함되어 있는지 여부, 또는 상기 소정의 기준은 상기 하이퍼텍스트 마크업 언어 문서 중의 리다이렉션(redirection) 태그에 문자열이 포함되어 있는지 여부일 수 있다.Examples of predetermined criteria for determining whether the website analyzed in step 325 is a deceptive site are as described above with reference to FIGS. 4A to 4K. For example, the predetermined criterion is whether the hypertext markup language document includes a character string having the same color as the background color of the web page, or the predetermined criterion is a redirection tag in the hypertext markup language document. It may be whether or not a string is included in the.

본 발명의 바람직한 일실시예에 따르면, 단계 325의 소정의 기준은 상술한 (1) 내지 (11)의 기만 사이트 유형에서 설명한 분석 내용을 하이브리드(hybrid) 방식으로 적용하여 소정의 기준을 넘는 경우 기만 사이트로 판단하는 방법을 이용할 수 있다. 일예로, 타이틀 태그 내에 포함된 타이틀 문자열의 개수가 2개 이상인 경우부터 하나당 10점을 가산하고 최대 70점 까지 가산할 수 있다고 설정한다. 또한 리다이렉션 페이지에 포함된 문자열이 있는 경우 개수에 상관없이 70점을 가산하고, 링크 팜의 경우 링크 50개당 4 점씩 최대 80점 까지 가산하는 것으로 설정할 수 있다. 또한 크기가 0인 문자열이 있는 경우 문자열의 길이 100 byte 당 5점씩 가산하여 최대 70점 까지 가산하는 것으로 설정할 수 있다. 이와 같이 웹페이지를 구성하는 소스 파일을 분석하고, 상술한 여러 가지 기준에 따른 각각의 포인트와 포인트 가중치를 고려하여 계산한 총합계가 100점이 넘는 경우 해당 웹사이트는 기만 사이트라고 판단할 수 있을 것이다. 이러한 종합적인 판단 방법을 이용하는 경우, 하나의 유형에 근거한 판단(예를 들어, 타이틀 태그에 포함된 문자열의 개수가 50개이므로 기만 사이트로 판단)의 경우 그 판단에 오류가 있을 가능성이 높으므로, 상기와 같이 종합적으로 판단하여 기만 사이트 여부를 판단하는 것이 바람직할 수 있다.According to a preferred embodiment of the present invention, the predetermined criterion of step 325 is a deception in the case of exceeding the predetermined criterion by applying the analysis content described in the deception site types of (1) to (11) described above in a hybrid manner. You can use the method to determine the site. For example, since the number of title strings included in the title tag is two or more, 10 points are added per one and up to 70 points can be added. In addition, if there are character strings included in the redirection page, 70 points may be added regardless of the number, and in the case of a link farm, 4 points per 50 links may be added up to 80 points. In addition, if there is a character string of size 0, it can be set to add up to 70 points by adding 5 points for every 100 bytes of string length. As described above, when the source file constituting the web page is analyzed and the total calculated based on the points and the point weights according to the various criteria is more than 100 points, the corresponding website may be determined to be a deceptive site. In the case of using such a comprehensive judgment method, a judgment based on one type (e.g., a deception site because the number of strings included in the title tag is 50) is likely to be in error. It may be desirable to determine whether or not a deceptive site by comprehensively determining as described above.

또한, 본 발명의 바람직한 일실시예에 따르면, 상기 소정의 기준은 검색 로봇에 의한 웹 검색 방식에 의해 등록된 웹사이트와 소정의 디렉토리 지정 방식에 의해 등록된 웹사이트의 경우 차등적으로 적용할 수 있는데, 일예로 전자의 방식으로 등록된 웹사이트의 경우 웹페이지의 소스 파일 분석 결과 상술한 11가지 유형 중 3개의 유형에 해당하는 경우 기만 사이트로 판단한다면, 후자의 방식으로 등록된 웹사이트의 경우에는 1개의 유형에 해당하는 경우라도 기만 사이트로 판단하는 등의 방법을 사용할 수 있다. 이는 후자의 등록 방식의 경우 대부분의 검색 엔진 운영업체가 소정의 등록 비용을 웹사이트의 등록자로부터 받기 때문에 무료로 등록되는 전자의 경우보다는 어느 정도 호의를 베풀어야 할 필요가 있기 때문이다.In addition, according to a preferred embodiment of the present invention, the predetermined criteria may be differentially applied in the case of a website registered by a web search method by a search robot and a website registered by a predetermined directory designation method. For example, in the case of the website registered by the former method, if the source file analysis of the web page is found to be three of the 11 types described above, and judged as a deception site, the website is registered in the latter manner. In the case of one type, a method such as determining a deception site can be used. This is because in the latter registration method, since most search engine operators receive a predetermined registration fee from the registrant of the website, it is necessary to do some favor than the former one which is registered for free.

단계 325에서 기만 사이트로 판단된 경우, 상기 데이터베이스 수단의 웹사이트 등록자 필드를 검색하여 상기 웹사이트의 등록자 정보를 획득한다(단계 330). 상기 웹사이트의 등록자 정보에서 상기 등록자의 연락 정보를 추출하고(단계 335), 추출된 연락 정보를 이용하여 상기 웹사이트의 등록자에 대해 이메일 발송 또는 단문자 메시지 전송 등의 소정의 경고 조치가 수행된다(단계 340). 이러한 경고 조치에 대해서는 후술하는 도 5에 대한 설명에서 상술한다.If it is determined in step 325 as a deceptive site, the website registrant field of the database means is searched to obtain registrant information of the website (step 330). The registrant's contact information is extracted from the registrant information of the website (step 335), and predetermined warning measures such as sending an e-mail or sending a short message are performed to the registrant of the website using the extracted contact information. (Step 340). Such warning measures will be described in detail with reference to FIG. 5 to be described later.

본 발명의 또 다른 일실시예에 의하면, 단계 320에서 분석되는 것은 태그에 기재된 이미지일 수 있다. 예를 들어, 이미지의 화소를 분석하여 화소를 구성하는 RGB 콤포넌트를 추출하고, 추출 결과 특정 컬러(황색 등)의 화소가 소정의 기준(예를 들어 총 화소수의 50% 이상)을 초과하는 경우에는 일응 해당 사이트가 음란물 콘텐츠를 게시하고 있는 사이트일 것으로 추측할 수 있고, 이를 토대로 기만 사이트인지 여부를 판단할 수 있을 것이다.According to another embodiment of the present invention, what is analyzed in step 320 may be an image described in a tag. For example, when the pixels of an image are analyzed to extract RGB components constituting the pixels, and as a result of the extraction, pixels of a specific color (yellow, etc.) exceed a predetermined criterion (for example, 50% or more of the total number of pixels). In the meantime, it can be assumed that the site is posting the pornographic contents, and based on this, it can be determined whether the site is deceptive.

도 5은 본 발명의 바람직한 일실시예에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법에 있어서, 기만 또는 변질 사이트로 판별된 웹사이트의 등록자에게 소정의 제재 조치를 가하는 방법을 도시하는 흐름도이다.FIG. 5 is a flowchart illustrating a method for applying a predetermined sanction to a registrant of a website determined to be a deceptive or altered site in a method for managing registration of a website in a search engine according to an exemplary embodiment of the present invention. to be.

도 5를 참조하면, 상술한 도 3의 단계 325에서 기만 사이트로 판단된 경우의 자동 제재 조치가 도시되어 있다. 기만 사이트로 판단된 경우, 웹사이트 관리 모듈은 웹사이트 정보 데이터베이스를 검색하여 해당 웹사이트의 등록자의 정보를 획득하고(단계 510), 웹사이트 관리 모듈은 상기 등록자의 정보를 수신한다(단계 520 및 550). 본 발명의 일실시예에 따르면, 웹사이트 관리 모듈은 상기 등록자 정보에서 등록자의 이메일 주소 또는 이동통신단말기 번호 등의 연락 정보를 추출하고(단계 530), 메일 서버 또는 SMS 서버를 제어하여 상기 연락 정보로 소정의 메시지를 전송하도록 동작한다(단계 540).Referring to FIG. 5, the automatic sanction measures when the deception site is determined in step 325 of FIG. 3 described above are illustrated. If determined to be a deceptive site, the website management module searches the website information database to obtain the registrant information of the website (step 510), and the website management module receives the registrant information (step 520 and 550). According to an embodiment of the present invention, the website management module extracts contact information such as an e-mail address or a mobile communication terminal number of a registrant from the registrant information (step 530), and controls the contact information by controlling a mail server or an SMS server. To transmit the predetermined message to the client (step 540).

본 발명의 또 다른 일실시예에 따르면, 웹사이트 관리 모듈은 상기 등록자 정보에서 등록자의 기타 등록 웹사이트 정보를 추출(단계 560)하고, 동일한 등록자 명의로 등록된 기타 웹사이트에 대한 분석(단계 570)을 자동적으로 수행하도록 제어한다. 동일 등록자 명의의 웹사이트라면 동일 또는 유사한 방법으로 기만 사이트를 운영할 가능성이 높기 때문이다. 본 실시예의 경우, 기타 웹사이트의 분석 결과가 기만 사이트로 판단된 경우에는 도 5의 단계 510이 반복될 수 있다.According to another embodiment of the present invention, the website management module extracts the registrant's other registered website information from the registrant information (step 560), and analyzes other websites registered under the same registrant name (step 570). ) To automatically execute. This is because a website in the same registrant's name is more likely to operate the deception site in the same or similar way. In the present embodiment, if it is determined that the analysis result of the other website is the deception site, step 510 of FIG. 5 may be repeated.

본 발명의 바람직한 일실시예에 의하면, 소정의 웹사이트가 상술한 분석 및 판단 방법에 의하여 기만 사이트로 판단된 경우, 상기 웹사이트의 등록자에게 자동적으로 이메일, 단문자 메시지 등을 발송하여 해당 웹사이트의 문제점을 지적하고 일정 유예 기간을 두어 시정을 요구하도록 동작할 수 있다. 또한, 일정 유예 기간 경과 후 자동적으로 상기 분석 및 판단 프로세스를 수행하도록 설정할 수 있고, 이 경우에도 시정이 안된 경우에는 등록 취소 등의 제재 조치를 취하는 것도 가능하다. 또한, 본 발명의 다른 일실시예에 의하면, 상기 웹사이트의 등록자에게는 추후 다른 웹페이지를 등록하고자 하는 경우 등록 절차를 까다롭게 하는 등의 제재를 가하는 것도 가능하다.According to a preferred embodiment of the present invention, if a predetermined website is determined to be a deceptive site by the above-described analysis and determination method, the website is automatically sent to the registrant of the website by sending an e-mail, a short text message, or the like. It can act to point out the problem and to allow for a period of grace. In addition, the analysis and determination process may be automatically performed after a certain grace period has elapsed, and in this case, if the correction is not corrected, sanctions such as cancellation of registration may be taken. In addition, according to another embodiment of the present invention, the registrant of the website may be subject to sanctions such as to complicate the registration process in the case of registering another web page in the future.

본 발명의 실시예들은 다양한 컴퓨터로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터 판독 가능 매체를 포함한다. 상기 컴퓨터 판독 가능매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 상기 매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Embodiments of the invention include a computer readable medium containing program instructions for performing various computer-implemented operations. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The medium or program instructions may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. The medium may be a transmission medium such as an optical or metal wire, a waveguide, or the like including a carrier wave for transmitting a signal specifying a program command, a data structure, or the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

도 6은 본 발명에 따른 검색 엔진에서 등록된 웹페이지를 관리하는데 이용될 수 있는 범용 컴퓨터 시스템의 내부 블록도이다.6 is an internal block diagram of a general purpose computer system that may be used to manage registered web pages in a search engine in accordance with the present invention.

컴퓨터 시스템은 램(RAM: Random Access Memory)(660)과 롬(ROM: Read Only Memory)(670)을 포함하는 주기억장치와 연결되는 하나 이상의 프로세서(640)를 포함한다. 프로세서(640)는 중앙처리장치(CPU)로 불리기도 한다. 당업계에서 널리 알려져 있는 바와 같이, 롬(670)은 데이터(data)와 명령(instruction)을 단방향성으로 CPU에 전달하는 역할을 하며, 램(660)은 통상적으로 데이터와 명령을 양방향성으로 전달하는 데 사용된다. 램(660) 및 롬(670)은 컴퓨터 판독 가능 매체의 어떠한 적절한 형태를 포함할 수 있다. 대용량 기억장치(Mass Storage)(610)는 양방향성으로 프로세서(640)와 연결되어 추가적인 데이터 저장 능력을 제공하며, 상기된 컴퓨터 판독 가능 기록 매체 중 어떠한 것일 수 있다. 대용량 기억장치(610)는 프로그램, 데이터 등을 저장하는데 사용되며, 통상적으로 주기억장치보다 속도가 느린 하드디스크와 같은 보조기억장치이다. CD 롬(620)과 같은 특정 대용량 기억장치가 사용될 수도 있다. 프로세서(640)는 비디오 모니터, 트랙볼, 마우스, 키보드, 마이크로폰, 터치스크린 형 디스플레이, 카드 판독기, 자기 또는 종이 테이프 판독기, 음성 또는 필기 인식기, 조이스틱, 또는 기타 공지된 컴퓨터 입출력장치와 같은 하나 이상의 입출력 인터페이스(630)와 연결된다. 마지막으로, 프로세서(640)는 네트워크 인터페이스(650)를 통하여 유선 또는 무선 통신 네트워크에 연결될 수 있다. 이러한 네트워크 연결을 통하여 상기된 방법의 절차를 수행할 수 있다. 상기된 장치 및 도구는 컴퓨터 하드웨어 및 소프트웨어 기술 분야의 당업자에게 잘 알려져 있다.The computer system includes one or more processors 640 connected to a main memory including random access memory (RAM) 660 and read only memory (ROM) 670. The processor 640 is also called a central processing unit (CPU). As is well known in the art, the ROM 670 serves to pass data and instructions to the CPU unidirectionally, and the RAM 660 typically transfers data and instructions bidirectionally. Used to. RAM 660 and ROM 670 may include any suitable form of computer readable media. Mass storage 610 is bi-directionally coupled to processor 640 to provide additional data storage capability and may be any of the computer readable recording media described above. The mass storage device 610 is used to store programs, data, and the like, and is a secondary memory device such as a hard disk which is generally slower than the main memory device. Certain mass storage devices such as CD ROM 620 may be used. The processor 640 may include one or more input / output interfaces, such as a video monitor, trackball, mouse, keyboard, microphone, touchscreen display, card reader, magnetic or paper tape reader, voice or handwriting reader, joystick, or other known computer input / output device. 630 is connected. Finally, the processor 640 may be connected to a wired or wireless communication network through the network interface 650. Through this network connection, the procedure of the method described above can be performed. The apparatus and tools described above are well known to those skilled in the computer hardware and software arts.

상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수도 있다.The hardware device described above may be configured to operate as one or more software modules to perform the operations of the present invention.

본 발명에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법에 의하면, 상술한 기만 사이트를 소정의 알고리즘을 이용하여 자동적으로 검출해 낼 수 있도록 함으로써 검색 엔진 사용자로 하여금 자신이 검색하고자 하는 정보를 정확히 검색할 수 있는 검색 엔진을 제공할 수 있다는 기술적 효과를 얻을 수 있다.According to the method for managing the registration of a website in the search engine according to the present invention, the search engine user can search for the information he wants to search by automatically detecting the above-mentioned deception site using a predetermined algorithm. The technical effect is to provide a search engine that can accurately search.

또한, 본 발명에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법에 의하면, 상기 기만 사이트를 자동적으로 검출하고, 검출된 기만 사이트 운영자에 대한 제재 조치를 가하도록 함으로써 검색 엔진에 등록되는 웹사이트 자체의 자정 노력을 강화할 수 있다는 기술적 효과를 얻을 수 있다.In addition, according to the method for managing registration of a website in the search engine according to the present invention, a website registered with the search engine by automatically detecting the deceptive site and applying sanctions to the detected deception site operator. The technical effect is that it can strengthen its own midnight effort.

또한, 본 발명에 따른 검색 엔진에서 웹사이트의 등록을 관리하기 위한 방법에 의하면, 상기 기만 사이트의 검출 및 검출된 상기 사이트들에 대한 경고 등의 제재 조치를 소정의 알고리즘에 의해 자동적으로 수행되도록 함으로써, 상술한 기만 사이트 검출을 위해 소요되던 다수의 인력 자원을 절약할 수 있다는 기술적 효과를 얻을 수 있다.In addition, according to the method for managing the registration of the website in the search engine according to the present invention, the sanctions, such as detection of the deceptive site and warning of the detected sites are automatically performed by a predetermined algorithm In addition, the technical effect of saving a large number of manpower resources required for the above-described deception site detection can be obtained.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 이는 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 상기 기재로부터 다양한 수정 및 변형이 가능하다는 점은 자명하다. 따라서, 본 발명 사상은 아래에 기재된 특허 청구 범위에 의해서만 파악되어야 하고, 이의 균등 또는 등가적 변형 모두는 본 발명 사상의 범주에 속한다고 할 것이다.As described above, although the present invention has been described by way of limited embodiments and drawings, the present invention is not limited to the above-described embodiments, which can be variously modified and modified by those skilled in the art to which the present invention pertains. It is obvious that modifications are possible. Accordingly, the spirit of the present invention should be understood only by the claims set forth below, and all equivalent or equivalent modifications thereof will belong to the scope of the present invention.

Claims

A recording medium having recorded thereon a program for executing a method for managing a website registered in a search engine, comprising:

Receiving information on the registered website, and classifying and recording the website information into a database by predetermined fields;

Reading a source file constituting a web page of the registered website;

Analyzing the read source file;

Determining whether the website is a deceptive site according to a predetermined criterion; And

If it is determined that the website is a deceptive site, controlling to perform a predetermined process on the registered website.

Including,

Determining whether or not the deception site,

Maintaining a predetermined weight according to the criteria;

Calculating a point for each of the criteria according to a method of calculating a point according to each of the criteria;

Calculating a median value for each of the criteria by multiplying the point calculated for each of the criteria and a weight according to the criterion corresponding to the point;

Summing an intermediate value for each of the calculated criteria; And

Determining whether the summed median value is greater than or equal to a predetermined value, and determining the website as a deceptive site if the sum is greater than or equal to the predetermined value;

A computer-readable recording medium having recorded thereon a program embodying a computer-implemented method of managing a registered website.