KR20080052097A

KR20080052097A - Harmful web site filtering method and apparatus using web structural information

Info

Publication number: KR20080052097A
Application number: KR1020060124148A
Authority: KR
Inventors: 이승민; 이호균; 남택용
Original assignee: 한국전자통신연구원
Priority date: 2006-12-07
Filing date: 2006-12-07
Publication date: 2008-06-11
Also published as: KR100848319B1

Abstract

A method and an apparatus for filtering a harmful site by using web structure information are provided to effectively and definitely filter harmful sites like lewd sites and to prevent children or juveniles from accessing the harmful sites. An apparatus for filtering a harmful site includes a web document receiver(720), a basic filter(730) and a web structure information analyzer(740). The web document receiver receives web documents. The basic filter filters text and image from the received web documents first and sends the filtered web documents to the web structure information analyzer. The web structure information analyzer includes a harmful URL database comparator(760) and a learning model comparator(750). The web structure information analyzer determines whether the web documents are harmful by using the harmful URL database comparator which performs a learning model comparison or a link URL comparison, and if the web documents are determined to be harmful, makes a web document transmitter(770) transmit the web documents to a display unit(780). The harmful URL database comparator determines whether the web documents are filtered or not by using the link URL of the web documents. The learning model comparator(750) includes a web structure information extractor(752), a file creator(754), a learning model creator(756) and a file comparator(758). The web structure information extractor extracts the web structure information of a file transmitted from the web document receiver or the basic filter(730), and the file creator creates structure information, which is extracted from the web structure information extractor, in one file while performing the extraction of a harmful or harmless site.

Description

Harmful web site filtering method and apparatus using web structural information}

도 1은 기존의 기술에 의해 유해한 사이트를 차단하는 방법을 나타내는 도면이다. 1 is a view showing a method for blocking a harmful site by the existing technology.

도 2는 기존의 기술에 더하여 본 발명에 의한 웹 구조정보를 이용한 유해 사이트 차단 방법을 나타내는 도면이다. 2 is a view showing a harmful site blocking method using web structure information according to the present invention in addition to the existing technology.

도 3은 인바운드(inbound) 패킷과 아웃바운드(outbound) 패킷에 대하여 기존 기술에 의한 필터링 모듈과 본 발명에 의한 웹 구조정보 기반 필터링 모듈이 상호 작용하는 방식을 나타내는 도면이다.FIG. 3 is a diagram illustrating a method in which a filtering module according to an existing technology and a web structure information based filtering module according to the present invention interact with each other for an inbound packet and an outbound packet.

도 4는 기존 기술의 URL 필터링과 본 발명에 따른 웹 구조정보에 기초한 필터링으로서 유해 URL 데이터베이스를 함께 사용하는 일실시예를 보여준다. Figure 4 shows an embodiment of using a harmful URL database as a filtering based on the URL structure of the existing technology and the web structure information according to the present invention.

도 5는 기존 기술의 URL 필터링과 본 발명에 따른 웹 구조정보에 기초한 필터링으로서 유해 URL 데이터베이스 참조 및 학습 모델에 의한 비교를 함께 사용하는 일실시예를 보여준다.FIG. 5 shows an embodiment of using a URL filtering of the existing technology and a comparison based on a harmful URL database reference and a learning model as filtering based on web structure information according to the present invention.

도 6은 본 발명에 따른 학습 모델 생성 과정과 생성된 학습 모델을 이용한 유해 사이트 판단 과정을 보여준다. 6 shows a harmful site determination process using a learning model generation process and the generated learning model according to the present invention.

도 7은 본 발명에 따른 웹 구조정보를 이용한 유해 사이트 차단 장치를 나타 낸다. 7 shows an apparatus for blocking harmful sites using web structure information according to the present invention.

전 세계를 연결하는 인터넷이라는 네트워크의 보급은 현대인의 생활에 급격한 변화를 가져왔다. 이제는 정보를 얻기 위해 노력하던 시대에서 더 나아가 인터넷과 같은 접근이 용이한 정보 데이터베이스에 있는 넘쳐나는 수많은 정보에서 유용한 정보를 어떻게 선별하느냐가 중요한 시대가 되었다. 인터넷을 통한 정보의 공유는 현대인의 기술 발전과 지식 공유에 놀라운 공헌을 한 것은 부인할 수 없는 사실이나, 그 용이한 접근 가능성과 정보의 빠른 확산은 많은 부작용을 수반하고 있는 것도 사실이다. The spread of the Internet, which connects the world, has brought a radical change in modern life. It is now important to go beyond the era of trying to get information, and how to select useful information from the vast amount of information in accessible databases such as the Internet. The sharing of information through the Internet is an undeniable fact that has made a remarkable contribution to the technological development and knowledge sharing of modern people, but its easy accessibility and rapid spread of information have many side effects.

인터넷을 통한 여러 가지 부작용으로는 크게 보안의 미비로 인한 개인 정보의 유출 및 누구든지 접근 가능하다는 점과 상업적으로 인터넷을 이용할 수 있는 장점이 서로 맞물려 음란물이 범람하고 있는 것을 들 수 있다. 실제로 인터넷을 통해 가장 상업적 성공을 거둔 비지니스는 음란물 비지니스라는 통계가 있을 정도로 인터넷을 통한 음란물 유포는 큰 사회적 문제가 되고 있다. The various side effects of the Internet include the leakage of personal information due to the lack of security, the accessibility of anyone, and the merging of pornography due to the advantages of using the Internet commercially. Indeed, the distribution of pornography via the Internet has become a big social problem, as there is a statistic that the most commercial success through the Internet is the pornography business.

어린 아이로부터 성인까지 개인용 컴퓨터를 어렵지 않게 사용할 수 있는 연령층은 이러한 유해 사이트에 무차별적으로 노출되어 있는데, 이러한 유해 사이트가 가정이나 직장 내의 컴퓨터 단말기 등에 접근하지 못하도록 많은 하드웨어 장치와 소프트웨어적 프로그램이 개발되어 왔다. 그러나, 유해 사이트 방지 기술이 발 전해가는 것과 더불어 유해 사이트를 유포하는 사업자들도 자신들의 경제적 이익을 위해 교묘하게 이러한 방지 기술을 무력화시키는 방법들을 개발하고 사용하여 왔다. Age groups who can easily use personal computers from young children to adults are indiscriminately exposed to these harmful sites. Many hardware devices and software programs have been developed to prevent these harmful sites from accessing computer terminals in the home or work. come. However, with the introduction of harmful site protection technologies, operators that distribute harmful sites have also developed and used methods to artfully disable these protection technologies for their economic benefit.

본 발명은 컴퓨터 등과 같은 네크워크가 가능한 단말기에서 인터넷과 같은 공중 네트워크 이용 시 접속되는 사이트가 음란 사이트와 같은 유해한 사이트 일 때 이를 차단하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for blocking a site connected when a public network such as the Internet is a harmful site such as an obscene site in a network capable terminal such as a computer.

종래에는 유해 사이트를 차단하기 위하여 네트워크 사용자로부터 나가는 아웃바운드(outbound) 패킷의 경우 사용자가 입력한 URL 정보를 이용하거나, 네트워크 사용자가 받아들이는 인바운드(inbound) 패킷의 경우 웹 문서 내용의 텍스트나 이미지 정보를 이용하여 유해성 여부를 판단하였다. Conventionally, in case of outbound packet outgoing from network user to block malicious site, URL information input by user is used or in case of inbound packet accepted by network user, text or image information of web document contents It was determined whether or not harmful by using.

좀더 구체적으로는, 네트워크 단말기 사용자가 어떠한 사이트에 접근하기 위해 특정 URL 정보를 입력할 때, URL 주소를 해석하여 유해 사이트와 관련성이 있는지 없는지를 판단하여 해당 사이트를 불러올지 말지를 결정하게 된다. 간단하게는 URL 정보가 유해한 사이트에 주로 등장하는 단어를 포함하는 경우, 예를 들어 sex, hardcore 등을 포함하고 있는 경우 이를 필터링하여 웹 사이트를 불러 오는 것을 봉쇄하게 된다. More specifically, when a network terminal user inputs specific URL information to access a certain site, the URL address is interpreted to determine whether or not the relevant site is related to the harmful site, thereby determining whether to load the corresponding site. In short, if the URL information contains words that appear mainly on harmful sites, for example, if it contains sex, hardcore, etc., it blocks the loading of the website by filtering them.

만일 이러한 초기 필터링이 실패하는 경우에는 호출된 웹 문서를 화면 상에 출력하기 전에 사전 검사를 수행한다. 즉, 웹 문서 내용의 텍스트나 이미지 정보를 미리 차단 소프트웨어에서 분석하여 웹 문서의 텍스트 내용에 유해한 요소를 안고 있는 표현이 있다거나, 혹은 이미지 정보, 즉, 화면에 표시되는 색 정보를 분석하 여 지나치게 선정적인 색 - 핑크, 빨강 - 등이 많은 부분을 차지하고 있다면 이를 유해한 사이트로 판단하여 차단하는 방법을 사용하였다. If this initial filtering fails, pre-checking is performed before printing the called web document on the screen. In other words, the text or image information of the web document content is analyzed by the blocking software in advance, and there is an expression containing harmful elements in the text content of the web document, or the image information, that is, the color information displayed on the screen, is excessively analyzed. If a lot of sensational colors-pink, red-etc. occupy a large portion of the site was determined to be a harmful site.

종래의 방식인, 아웃바운드 패킷의 경우 URL 정보를 이용한 차단 방식이나 인바운드 패킷의 경우 텍스트 및/또는 이미지 정보를 이용하여 유해 사이트를 차단하는 방식은 여러 가지 한계에 부딪히게 되었다. 즉, URL 정보를 이용한 차단 방식의 경우, 이미 구축된 URL 정보 데이터베이스에 포함되어 있지 않은 유해 사이트의 경우는 손쉽게 이러한 차단 방식을 피해갈 수 있고 또한 URL 주소를 자동으로 바꾸는 방식을 통해 차단 방식을 무력화시킬 수도 있다. In the conventional method, the blocking method using URL information in the case of an outbound packet or the blocking method using a text and / or image information in the case of an inbound packet face various limitations. In other words, in the case of the blocking method using URL information, harmful sites that are not included in the already constructed URL information database can easily avoid this blocking method and disable the blocking method by automatically changing the URL address. You can also

인바운드 웹 문서의 내용을 분석하는 방식의 경우 과차단이나 오차단의 가능성이 많은데, 좀더 구체적으로는, 최근에 이러한 인바운드 웹 문서 분석에 의한 차단을 회피하기 위해 텍스트를 이미지로 처리하거나, 이미지 처리를 어렵게 하는 다양한 방법을 사용하여 웹 문서를 구성하고 있는 것이 현실이다. In the case of analyzing the contents of the inbound web document, there is a possibility of overblocking or error. More specifically, in order to avoid blocking by the inbound web document analysis recently, text is processed as an image or image processing is performed. The reality is that web documents are organized using various methods that make it difficult.

이러한 기술적 과제를 해결하기 위하여 본 발명은, 웹 문서의 구조 정보로서 HTML 소스에 있는 링크(Link) URL, 팝업 창의 개수 및 속성, 타이틀 태그, 폰트, 프레임 정보 등을 활용하여, 단독으로 사용하거나 혹은 종래 기술과 함께 사용함으로써 위에서 언급한 종래 기술의 한계점을 보완하며 좀더 효과적이고 확률 높은 유해 사이트 차단 방법 및 장치를 제공하는 것을 목적으로 한다. In order to solve this technical problem, the present invention utilizes the link URL in the HTML source, the number and attributes of pop-up windows, title tags, fonts, frame information, etc. as the structure information of the web document, or is used alone. It is an object of the present invention to complement the above-mentioned limitations of the prior art and to provide a more effective and probable harmful site blocking method and apparatus.

좀더 구체적으로는 본 발명은 다수의 유해 사이트와 무해 사이트의 웹 구조정보로부터 학습 모델을 생성하고, 인바운드된 웹 문서에 대해 선택적으로 URL 필 터링, 텍스트 필터링 및/또는 이미지 필터링을 행하고, 또한 상기 선택적으로 필터링된 웹 문서를 상기 생성된 학습 모델과 비교하고, 그 비교 결과를 판단하여 차단 여부를 결정하는 단계를 포함하는 유해사이트 차단 방법을 제공한다. More specifically, the present invention generates a learning model from the web structure information of a plurality of harmful sites and harmless sites, selectively performs URL filtering, text filtering, and / or image filtering on inbound web documents, and optionally And comparing the filtered web document with the generated learning model, and determining the blocking result to determine whether to block the harmful web site.

본 발명은 상기한 기술적 과제를 달성하기 위하여, 다수의 유해 사이트와 무해 사이트의 웹 구조정보로부터 학습 모델을 생성하는 단계; 인바운드(inbound)된 웹 문서를 상기 생성된 학습 모델과 비교하는 단계; 및 상기 학습 모델과 비교한 결과에 따라 차단을 결정하는 단계;를 포함하는 것을 특징으로 하는 웹 구조정보를 이용한 유해사이트 차단 방법이 제공된다. The present invention comprises the steps of generating a learning model from the web structure information of a plurality of harmful sites and harmless sites to achieve the above technical problem; Comparing an inbound web document with the generated learning model; And determining blocking based on a result of comparison with the learning model. The harmful site blocking method using the web structure information is provided.

바람직하게는, 상기 웹 구조정보를 이용한 유해사이트 차단 방법에서 상기 학습 모델을 생성한 후, 인바운드(inbound)된 웹 문서에 대해 URL 필터링, 텍스트 필터링 및/또는 이미지 필터링을 수행하는 단계를 더 포함하는 것을 특징으로 한다. Preferably, the method further includes generating URL learning, text filtering, and / or image filtering on the inbound web document after generating the learning model in the harmful site blocking method using the web structure information. It is characterized by.

바람직하게는, 상기 웹 구조정보를 이용한 유해사이트 차단 방법에서 상기 학습 모델을 생성하는 단계는, 상기 다수의 유해 사이트와 무해 사이트의 웹 구조정보로부터 팝업 창의 개수, 바탕색, 프레임 갯수, 프레임 색, 이미지 갯수 및/또는 텍스트 이미지 갯수에 대한 값을 추출하는 단계; 상기 추출된 값을 하나의 파일로 만드는 단계; 및 상기 생성된 파일을 패턴분류기술을 사용하여 학습 모델을 생성하는 단계를 포함하는 것을 특징으로 한다. Preferably, the generating of the learning model in the harmful site blocking method using the web structure information comprises: number of pop-up windows, background color, frame number, frame color, image from the web structure information of the plurality of harmful sites and harmless sites. Extracting values for the number and / or number of text images; Making the extracted value into a file; And generating a learning model by using the pattern classification technique on the generated file.

더욱 바람직하게는, 상기 패턴분류기술은 SVM(Support Vector Machine), 클 러스터링(Clustering) 또는 SOM(Self Organizing Map) 등을 이용하는 것을 특징으로 한다. More preferably, the pattern classification technique is characterized by using a support vector machine (SVM), clustering (Slustering) or self organizing map (SOM).

또한 본 발명의 목적상 웹 사이트 접속 시 웹 구조정보를 이용한 유해사이트 차단 방법은 기지의 유해 웹 사이트 및 링크를 포함하는 유해 URL 데이터베이스를 구축하는 단계; 상기 필터링을 거친 웹 문서 내의 링크 URL과 상기 생성된 유해 URL 데이터베이스 내의 URL 데이터와 비교하는 단계; 상기 비교 결과에 따라 상기 웹 문서의 차단을 결정하는 단계;를 포함하는 것을 특징으로 한다. In addition, the harmful site blocking method using the web structure information when accessing the web site for the purposes of the present invention comprises the steps of constructing a harmful URL database including a known harmful website and links; Comparing the link URL in the filtered web document with URL data in the generated harmful URL database; Determining the blocking of the web document according to the comparison result.

바람직하게는 상기의 유해사이트 차단 방법은 상기 유해 URL 데이터베이스를 구축하는 단계 후, 인바운딩되는 웹 문서에 대해 URL 필터링, 텍스트 필터링, 이미지 필터링을 수행하는 단계;를 더 포함하는 것을 특징으로 한다. Preferably, the harmful site blocking method further includes the step of performing URL filtering, text filtering, and image filtering on the inbound web document after the building of the harmful URL database.

바람직하게는 상기 인바운딩되는 웹 문서에 대해 URL 필터링, 텍스트 필터링, 이미지 필터링을 거친 웹 문서를 상기 생성된 학습 모델과 비교하여 차단을 결정하는 단계를 더 포함하되, 상기 유해 URL 데이터베이스 구축 단계는 다수의 유해 사이트와 무해 사이트의 웹 구조정보로부터 학습 모델을 생성하는 것을 더 포함하는 것을 특징으로 한다. Preferably, the method may further include determining blocking by comparing the web document that has undergone URL filtering, text filtering, and image filtering with respect to the generated learning model for the inbound web document. Characterized in that it further comprises generating a learning model from the harmful web site and harmful structure of the web site information.

바람직하게는, 상기 학습 모델을 생성하는 것은, 상기 다수의 유해 사이트와 무해 사이트의 웹 구조정보로부터 팝업 창의 개수, 바탕색, 프레임 갯수, 프레임 색, 이미지 갯수 및/또는 텍스트 이미지 갯수에 대한 값을 추출하는 단계; 상기 추출된 값을 하나의 파일로 만드는 단계; 상기 생성된 파일을 패턴분류기술을 사용하여 학습 모델을 생성하는 단계를 포함하는 것을 특징으로 한다. Preferably, generating the learning model extracts values for the number of pop-up windows, the background color, the number of frames, the frame color, the number of images, and / or the number of text images from the web structure information of the plurality of harmful and harmless sites. Doing; Making the extracted value into a file; And generating a learning model by using the pattern classification technique on the generated file.

더욱 바람직하게는, 상기 패턴분류기술은 SVM(Support Vector Machine), 클러스터링(Clustering) 또는 SOM(Self Organizing Map) 등을 이용하는 것을 특징으로 한다. More preferably, the pattern classification technique is characterized by using a support vector machine (SVM), clustering (Slustering) or a self organizing map (SOM).

또한 본 발명의 목적상 웹 구조정보를 이용한 유해사이트 차단 장치는 외부로부터 전송되어온 웹 문서를 수신하는 웹 문서 수신부; 상기 외부로부터 전송되어온 웹 문서의 구조 정보를 해석하는 웹 구조 정보 해석부; 및 상기 웹 구조 정보 해석부를 거쳐 유해하지 않은 사이트로 판명된 웹 문서를 사용자가 볼 수 있는 모니터링 장치로 전송하는 웹 문서 전송부;를 포함하는 것을 특징으로 한다. In addition, the harmful site blocking device using the web structure information for the purposes of the present invention includes a web document receiving unit for receiving a web document transmitted from the outside; A web structure information analyzer for analyzing structure information of the web document transmitted from the outside; And a web document transmission unit for transmitting a web document that is found to be a harmful site through the web structure information analysis unit to a monitoring device that can be viewed by a user.

바람직하게는, 상기 수신된 웹 문서에 대해 URL 필터링, 텍스트 필터링, 이미지 필터링을 행하는 기본 필터링부;를 더 포함하는 것을 특징으로 한다. The apparatus may further include a basic filtering unit for performing URL filtering, text filtering, and image filtering on the received web document.

바람직하게는, 상기 웹 구조 정보 해석부는 학습 모델 비교부 및/또는 유해 URL 데이터베이스비교부를 포함하는 것을 특징으로 한다. Preferably, the web structure information analysis unit is characterized in that it comprises a learning model comparison unit and / or harmful URL database comparison unit.

더욱 바람직하게는, 상기 학습 모델 비교부는 웹 문서에서 팝업 창의 갯수, 바탕색, 프레임 갯수, 프레임 색, 이미지 갯수 및/또는 텍스트 이미지 갯수에 대한 값을 추출하는 웹 구조정보 추출부; 상기 추출된 웹 구조정보를 하나의 파일로 만드는 파일 생성부; 및 상기 생성된 파일을 패턴분류기술로 학습 모델을 생성하는 학습 모델 생성부; 및 상기 학습 모델과 상기 파일 생성부에서 생성된 파일을 비교하는 파일 비교부를 포함하는 것을 특징으로 한다. More preferably, the learning model comparison unit comprises: a web structure information extracting unit for extracting values for the number of pop-up windows, the background color, the number of frames, the frame color, the number of images, and / or the number of text images in the web document; A file generator which makes the extracted web structure information into one file; And a learning model generator for generating a learning model using the generated file through the pattern classification technique. And a file comparator for comparing the learning model with a file generated by the file generator.

이하, 첨부한 도면을 참조하면서 본 발명에 따른 웹 구조정보를 이용한 유해사이트 차단 방법 및 장치에 대해 좀더 자세히 기술하기로 한다. 본 발명을 설명함 에 있어서 관련된 공지기술 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략될 것이다. 그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 클라이언트나 운용자, 사용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. Hereinafter, a method and apparatus for blocking harmful sites using web structure information according to the present invention will be described in detail with reference to the accompanying drawings. In the following description of the present invention, when it is determined that detailed descriptions of related well-known technologies or configurations may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted. In addition, terms to be described below are terms defined in consideration of functions in the present invention, which may vary according to a client's or operator's intention or custom. Therefore, the definition should be made based on the contents throughout the specification.

도 1은 기존의 기술에 의해 유해한 사이트를 차단하는 방법을 나타내고 있다. 일단 특정 웹 사이트를 접속하게 되면 다양한 웹 문서(110)가 네트워크 단말기에 들어오게 된다. 1 shows a method for blocking a harmful site by the existing technology. Once a particular web site is accessed, various web documents 110 enter the network terminal.

기존의 방법은 이러한 웹 문서를 분석하여 적절한 필터링을 하게 된다. 이러한 필터링을 거친 웹 문서는 유해 웹 문서(130)와 무해 웹 문서(140)로 구분되게 된다. 필터링 블럭(120)은 다양한 필터링부를 갖게 되는데, 대표적으로 URL 필터링(122), 텍스트 필터링(124) 및/또는 이미지 필터링(126)을 포함하게 된다. Existing methods analyze these web documents and filter them appropriately. The filtered web document is classified into a harmful web document 130 and a harmless web document 140. The filtering block 120 has various filtering units, and typically includes URL filtering 122, text filtering 124, and / or image filtering 126.

URL 필터링부는 네트워크 단말기에서 웹 사이트 주소를 입력하는 경우 해당 웹 사이트 주소를 기존의 유해 사이트 URL을 저장한 데이터베이스의 URL과 비교하여 해당 웹 사이트 주소가 유해 사이트 URL 정보와 동일하거나 유사한 단어를 포함하는 경우 이를 필터링 하여 해당 웹 사이트를 접속하지 못하게 한다. 즉, 엄밀하게는 네트워크 단말기에 들어오는 웹 문서(110)에 대한 필터링을 수행하는 것이 아닌, 사용자가 입력한 URL 정보를 대상으로 필터링을 수행하게 되는 것이다. The URL filtering unit compares the website address with the URL of the database that stores the existing harmful site URL when the website address is input from the network terminal, and the website address contains the same or similar words as the harmful site URL information. Filter this to prevent access to the website. That is, strictly, the filtering is performed on the URL information input by the user, rather than the filtering on the web document 110 coming into the network terminal.

보통 특정 웹 사이트는 예를 들면 244.128.72.2와 같이 일련의 숫자로 구성되어 있으나 이러한 숫자의 조합은 일반인에게는 쉽게 다가오지 않고 기억하기도 용이하지 않다. 이에 좀더 일반 사용자들이 인식하기 편하고 기억하기도 편하며 웹 사이트 운용자들에게도 상표나 상호와 같은 기능을 수행할 수 있도록, 혹은 웹 사이트 운용자가 비지니스 운용자이면 자신의 널리 알려진 비지니스 명칭과 일치하도록 일반인들이 쉽게 인식할 수 있는 웹 사이트 주소를 사용하게 된다. 즉, www.carsale.co.kr, www.auction.com 등과 같은 웹 사이트 주소를 사용하여 해당 웹 사이트의 기능이나 컨텐트를 일반 사용자들이 예측 가능하도록 하는 것이 웹 사이트의 운용자 측면이나 웹을 통해 정보를 얻으려는 사용자 측면에서 유리할 것이다. 이러한 웹 사이트 주소는 위에서 언급된 숫자(어드레스)로 전환되고 그 숫자에 해당되는 주소의 특정 서버에서 웹 사이트 컨텐트를 가져오게 된다. Usually, a web site is composed of a series of numbers, for example 244.128.72.2, but the combination of these numbers is not easy for the general public and is not easy to remember. This makes it easier for general users to recognize and remember, and to allow website operators to perform functions such as trademarks and trade names, or if the website operator is a business operator, to make it easier for the public to match their well-known business name. You will use a website address that you can. In other words, using a web site address such as www.carsale.co.kr, www.auction.com, etc. to make the function or content of the web site predictable by the end users of the web site or through the web It would be advantageous from the user's point of view. These web site addresses are converted to the numbers (addresses) mentioned above and the web site content is fetched from a particular server at that address.

좀더 자세히 설명하면, URL(uniform resource locator)은 보통 http://www.msn.com/index.html 혹은 ftp://ftp.tb.ac.kr/pub/public.zip 과 같이 구성된다. 위의 경우 http와 ftp는 접속할 때 사용할 프로토콜을 의미하고, 뒷 부분인 www.msn.com 및 ftp.tb.ac.kr은 사용자가 검색하려는 자료가 위치한 서버의 주소와 디렉토리를 의미한다. 바로 뒤의 index.html 이나 pub/public.zip 은 파일이 위치한 경로가 된다. In more detail, a URL (uniform resource locator) usually consists of http://www.msn.com/index.html or ftp://ftp.tb.ac.kr/pub/public.zip. In the above case, http and ftp are the protocols to use when connecting, and later www.msn.com and ftp.tb.ac.kr refer to the address and directory of the server where the user is searching. Immediately following index.html or pub / public.zip will be the path where the file is located.

URL 필터링부는 이러한 경우 네트워크 단말기 사용자가 입력한 URL 정보가 음란 사이트를 대표하는 정보일 가능성이 많은 단어를 포함하는 경우, 기존에 저장된 데이터베이스(도시되지 않음)에 이러한 정보를 미리 저장하였다가 사용자 입력 URL 정보와 비교하여 차단 여부를 수행하게 된다. In this case, if the URL information input by the user of the network terminal includes a word that is likely to be information representing a sexually explicit site, the URL filtering unit previously stores such information in a previously stored database (not shown), and then enters the user input URL. Blocking is performed by comparing with information.

텍스트 필터링의 경우 URL 필터링과는 달리 URL 필터링에 의해 아웃바운 드(outbound) 패킷이 필터링되지 않은 경우 호출 대상인 웹 문서들이 URL 정보에 의거해 네트워크 단말기에 표시되게 되는데 이러한 인바운드(inbound) 패킷의 웹 문서 중 텍스트를 대상으로 필터링하게 된다. 즉, 텍스트 필터링은 웹 문서 내에 있는 텍스트를 분석하여 그러한 텍스트가 유해 사이트에 공통적으로 많이 나타나는 단어를 사용하는 경우 그 빈도 수, 단어의 선정성에 따라 텍스트 데이터베이스(도시되지 않음)의 단어들과 비교하여 차단을 결정하게 된다.In case of text filtering, unlike URL filtering, when outbound packets are not filtered by URL filtering, the web documents that are called are displayed on the network terminal based on URL information. The web documents of these inbound packets are displayed. It will filter out the text. In other words, text filtering analyzes text within a web document and compares it with words in a text database (not shown), depending on the frequency and selectivity of words when such texts commonly appear on harmful sites. The decision is made to block.

이미지 필터링도 상기 텍스트 필터링과 마찬가지로 인바운드(inbound) 패킷의 웹 문서 필터링을 담당한다. 이미지 필터링은 웹 문서 내의 이미지 정보(주로 색 정보)를 사용하여 음란 여부를 판단하여 필터링을 하게 된다. Image filtering, like text filtering, is also responsible for web document filtering of inbound packets. Image filtering uses image information (mainly color information) in a web document to determine whether it is pornographic or not.

그러나 도 1에 나타난 이러한 URL 필터링, 텍스트 필터링, 이미지 필터링 기법은 과차단과 오차단의 가능성이 많다. 뿐만 아니라, 최근에는 음란 사이트 운영자들이 기존의 이러한 방법을 회피하기 위해 자동 URL 주소 변경 방식을 사용하기도 하며, 텍스트를 이미지로 처리하여 텍스트 필터링을 회피하기도 하며 이미지 처리를 어렵게 하는 다양한 방법을 동원하여 웹 문서를 구성하기 때문에 기존의 방법으로 유해 사이트를 차단하는데 한계가 있다. 이 때문에 좀더 오차단이나 과차단 등의 문제가 적고 이러한 음란 사이트 등의 운영자들의 다양한 차단 회피 방법에도 불구하고 차단성을 높일 수 있는 새로운 방법이 필요하게 되었다. However, the URL filtering, text filtering, and image filtering techniques shown in FIG. 1 have many possibilities of overblocking and error blocking. In addition, in recent years, obscene site operators have used automatic URL addressing to circumvent these traditional methods, and use text to image to avoid text filtering and to make image processing difficult. Due to the structure of the document, there is a limit to blocking harmful sites by the conventional method. As a result, there are fewer problems such as error and overblocking, and a new method of increasing the blocking property is needed despite various blocking methods of operators such as obscene sites.

도 2는 본 발명에 의한 웹 구조정보를 이용한 유해 사이트 차단 방법을 나타내는 도면이다. 2 is a diagram illustrating a harmful site blocking method using web structure information according to the present invention.

도면에서 보는 바와 같이 도 1에서 제시된 기존 기술에 의한 필터링 블 럭(220)이 추가되었을 때 본 발명은 더욱 효과적인 필터링을 제공할 수 있을 것이다. 하지만, 필터링 블럭(220)이 반드시 필수적인 것은 아니며 본 발명에서 제시하는 구조정보에 의한 필터링 블럭(230) 단독으로 사용되는 경우에도 상당히 효과적인 유해 사이트 필터링 효과를 거둘 수 있다. As shown in the figure, the present invention may provide more effective filtering when the filtering block 220 according to the existing technique shown in FIG. 1 is added. However, the filtering block 220 is not necessarily essential, and even when used alone as the filtering block 230 based on the structural information proposed in the present invention, the filtering of the harmful site may be quite effective.

도 1에서와 마찬가지로 초기 아웃바운드 URL 정보에 대해 URL 필터링(222)을 수행하고, 인바운드되는 웹 문서(210)가 텍스트 필터링(224), 이미지 필터링(226) 등에 의해 1차적으로 필터링된다. 2차적으로 본 발명이 제안하는 웹 문서 구조정보를 이용하는, 즉, 인바운드되는 웹 문서와 학습 모델(232)과 비교하고 유해 URL 데이터베이스(236)를 참조하여 비교하는 웹 구조 기반 필터링(234)을 거쳐 웹 문서는 유해 사이트(240)와 무해 사이트(250)로 나누어지게 된다. As in FIG. 1, URL filtering 222 is performed on the initial outbound URL information, and the inbound web document 210 is primarily filtered by text filtering 224, image filtering 226, and the like. Secondly, through the web structure-based filtering 234 using the web document structure information proposed by the present invention, that is, comparing the inbound web document with the learning model 232 and referring to the harmful URL database 236. The web document is divided into a harmful site 240 and a harmless site 250.

물론 유해 사이트는 차단되어 컴퓨터의 모니터 등과 같은 사용자 표시 장치(도시되지 않음)에 표시되지 않을 것이나 무해 사이트(250)는 사용자 표시 장치에 표시되어 사용자가 원하는 작업을 수행할 수 있게 될 것이다. Of course, the harmful site will be blocked and will not be displayed on a user display device (not shown) such as a monitor of a computer, but the harmless site 250 will be displayed on the user display device so that the user can perform a desired task.

상기 웹 구조기반 필터링 방법은 아래에서 좀더 자세히 설명될 것이다. The web structure-based filtering method will be described in more detail below.

도 3은 인바운드(inbound) 패킷과 아웃바운드(outbound) 패킷에 대하여 기존 기술에 의한 필터링 모듈과 본 발명에 의한 웹 구조정보 기반 필터링 모듈이 상호 작용하는 구체적인 방식을 나타내는 도면이다.FIG. 3 is a diagram illustrating a concrete manner in which a filtering module according to the existing technology and a web structure information based filtering module according to the present invention interact with each other for inbound packets and outbound packets.

먼저 아웃바운드 경로(370)를 통해 초기에 URL 기반 필터링 모듈(330)을 이용하여 필터링을 하는 방법을 살펴보기로 한다. 상기 URL 기반 필터링 모듈은 도 1 및 도 2에서 제시된 기본 필터링 블럭(120, 220)에서 URL 필터링을 수행하는 모듈 이다. First, a method of initially filtering using the URL-based filtering module 330 through the outbound path 370 will be described. The URL-based filtering module is a module that performs URL filtering in the basic filtering blocks 120 and 220 shown in FIGS. 1 and 2.

일단 웹 애플리케이션(Web Application, 310)을 수행하는 네트워크 단말기 상에서 사용자는 자신이 원하는 웹 애플리케이션을 액세스하기 위해 액세스 정보를 입력하게 된다. 보통은 인터넷 익스플로러나 네트스케이프와 같은 인터넷 웹 브라우저를 통해 이러한 동작을 수행할 수 있을 것이다. Once on a network terminal running a web application 310, a user enters access information to access a web application of his or her desire. Usually you will be able to do this through an Internet web browser such as Internet Explorer or Netscape.

이러한 액세스 정보인 URL은 해당 웹 문서를 불러들이기 위해 나가는(outbound) 중에 추출(extraction)되고 이 정보는 URL 기반 필터링 모듈에 의해 이미 구축된 데이터베이스(도시되지 않음)를 참조하여 상기 입력된 URL과 비교하여 차단 여부를 결정하게 된다. 예를 들어 사용자가 www.sexygirl.com 과 같은 웹 사이트 접속을 시도한다면 상기 웹 사이트가 제공하는 웹 문서가 존재하는 서버에 접속하기 전 "sexygirl"이 기존의 데이터베이스의 여러 가지 유해 URL 정보와 비교되게 되고 "sexy"라는 단어가 데이터베이스에 존재하여 비교 결과가 나오게 되면 상기 서버로의 접속 자체를 블럭킹하게 된다. This access information, the URL, is extracted during outbound to retrieve the corresponding web document and the information is compared with the entered URL by referring to a database (not shown) already built by the URL based filtering module. To determine whether to block. For example, if a user attempts to access a website such as www.sexygirl.com, the "sexygirl" may be compared with various harmful URL information in the existing database before connecting to the server where the web document provided by the website exists. If the word "sexy" exists in the database and the comparison results, the connection to the server itself is blocked.

만일 이러한 방법에 의해 아웃바운드 패킷에서 의심할 만한 정보가 데이터베이스에 의해 필터링 되지 않는다면 2차적으로 정보를 요청한 서버로부터 전달되어온 웹 문서(인바운드, 380)에 대한 필터링을 행하게 된다. If the suspicious information in the outbound packet is not filtered by the database by this method, the filtering is performed on the web document (inbound, 380) that has been transmitted from the server requesting the information.

일단 앞에서 언급된 바와 같이 기존 기술에 의한 텍스트, 이미지 기반 필터링 모듈(350)에 의한 필터링을 수행하게 된다. As mentioned above, the filtering by the text and image-based filtering module 350 according to the existing technology is performed.

텍스트 필터링에서는 상기 서버로부터 전달되어온 웹 문서 내의 텍스트를 참조하게 되고, 마찬가지로 그러한 텍스트 내용 중 음란성과 연관된 여러 가지 단어 나 그 조합들이 발견되게 되면 블럭킹을 실행하게 된다. 유해 사이트 상에 표시되는 여러 가지 음란성과 관련되는, 공통적으로 사용되는 단어들이 있으므로 이러한 방식으로 웹 문서의 유해성을 판단할 수 있을 것이다. Text filtering refers to the text in the web document transmitted from the server. Similarly, if a word or combination thereof is found among the text contents, it is blocked. Because there are commonly used words that relate to the various indecent things that appear on harmful sites, you can determine the harmfulness of a web document in this way.

이미지 필터링도 상기 텍스트 필터링과 마찬가지로 프레임 등의 색 정보를 이용하는 등의 방법을 사용하여 유해 사이트 상에서 일반적이고 공통적으로 사용되는 색 정보 등이 얼마나 많이 표현되고 포함되어 있는지를 판별하여 사이트 블럭킹을 수행할 수 있을 것이다. Similar to the above text filtering, image filtering can perform site blocking by determining how much color information commonly used on a harmful site is represented and included using a method such as using color information such as a frame. There will be.

그러나 이러한 방법 만으로는 앞에서 언급된 바와 같이 차단의 효율성에 문제가 제기되었고, 이에 좀더 효과적이고 효율적인 방법으로 웹 구조정보 기반 필터링 모듈(360)을 사용하여 인바운드되는 패킷에 대한 필터링이 제시되었다. 웹 구조기반 필터링 모듈을 사용하는 필터링 방식은 아래에서 좀더 자세히 설명하기로 한다. However, this method alone raises the problem of blocking efficiency as mentioned above, and proposed filtering of inbound packets using the web structure based filtering module 360 in a more effective and efficient manner. The filtering method using the web structure-based filtering module will be described in more detail below.

예를 들어 도 4에서 나타난 바와 같이 사용자가 www.newxxx.com 이라는 유해 웹 사이트 접속을 위해 상기의 웹 사이트 URL(412)을 입력한다고 가정해보자. 상기 www.newxxx.com 은 유해 URL 데이터베이스(440)에 존재하지 않는 새로운 유해 사이트이고 이는 기존의 URL 필터링에서는 차단할 수 없게 된다. For example, suppose that a user enters the web site URL 412 to access a harmful web site called www.newxxx.com, as shown in FIG. The www.newxxx.com is a new harmful site that does not exist in the harmful URL database 440, which cannot be blocked by existing URL filtering.

그리하여 www.newxxx.com 에 해당되는 웹 문서들이 들어 오게(inbound)되는데, 이 때, 이 www.newxxx.com 웹 문서는 또 다른 관련 유해 사이트로 이동하기 용 이하도록 예를 들면 www.xxx.com이라는 링크 URL을 담고 있다. 이 www.xxx.com 이라는 URL은 유해 URL 데이터베이스에 존재하게 되고, 웹 구조기반 필터링(430)에서 웹 문서에 있는 이 유해 링크 URL을 색출하여 당해 www.newxxx.com 도 유해 사이트일 가능성이 높다고 판단하여 사용자 표시 장치(410)에 웹 문서를 보이지 않고 차단을 수행하게 된다. Thus, web documents corresponding to www.newxxx.com are inbound, for example, www.newxxx.com web documents can be used to navigate to another relevant harmful site, for example www.xxx.com Contains the link URL. This www.xxx.com URL exists in the harmful URL database, and the web structure-based filtering 430 detects this harmful link URL in the web document and determines that www.newxxx.com is also a harmful site. As a result, the web document is blocked from being displayed on the user display device 410.

이 방법은 웹 구조정보 내의 링크 URL 까지 전부 검색하여 유해 URL 데이터베이스와 다시 비교함으로써 차단율을 높일 수 있음을 보여준다. 그러나 실제 최근의 유해 사이트는 텍스트 기반의 필터링을 회피하기 위해 이미지로 처리된 텍스트를 보여준다. 즉, 사용자가 사용자 표시 장치로 보기에는 텍스트이지만 실제 웹 문서의 속성 상으로는 이미지인 경우가 많다. This method shows that the blocking rate can be increased by searching all the link URLs in the web structure information and comparing them with the harmful URL database again. In practice, however, malicious sites in recent years show text that has been imaged to avoid text-based filtering. In other words, the user is text in view of the user display device, but the image is often an image of the actual web document.

이를 위해 최근 TIE(Text Information Extraction) 기술 연구가 이루어지고 있지만, 아직 연구 수준이며 기술 수준과 정확도가 매우 낮기 때문에 상용화 단계는 아니다. 그리고 웹 문서 내에 유해 이미지가 많이 포함되어 있으나, 그 크기가 작기 때문에 이미지 분석 기술을 적용하더라도 필터링되지 않는다. 그리고, 사용자가 입력한 웹 URL 뿐만 아니라, 링크 URL 역시 유해 URL 데이터베이스에 존재하지 않는 경우에는 유해 사이트 차단에 한계를 드러낼 것이다. To this end, text information extraction (TIE) technology research has recently been conducted, but it is not yet commercialized because it is still research level and its technology level and accuracy are very low. In addition, although many harmful images are included in the web document, the size of the web document is small. However, even if the image analysis technique is applied, the image is not filtered. And, if not only the web URL input by the user, but also the link URL does not exist in the harmful URL database, it will reveal a limit to blocking harmful sites.

URL 필터링(520)과 웹 구조기반 필터링(530)은 유해 URL 데이터베이스(540) 를 참조하여 도 4에서 설명한 바와 동일한 방식으로 동작하게 된다. 이미지 필터링(560)과 텍스트 필터링(570)에 관하여도 앞에서 설명한 바와 같다. URL filtering 520 and web structure-based filtering 530 operate in the same manner as described with reference to FIG. 4 with reference to malicious URL database 540. Image filtering 560 and text filtering 570 are the same as described above.

단지 도 5에서는 인바운드되는 여러 가지 웹 문서의 구조정보들을 사용하여 이미 생성되어 있는 학습 모델(550)을 이용하여 현재 www.newxxx.com 웹 문서의 구조정보와 비교한 후 차단 여부를 결정하게 된다. 웹 구조기반 필터링과 더불어 학습 모델을 사용한 필터링을 부가함으로써 유해 사이트를 차단하는 효율과 효과적 측면을 더욱 강화하였다. 아래 도 6에서 학습 모델을 생성하여 유해 사이트를 차단하는 구체적인 일실시예를 상술하도록 한다. In FIG. 5, the structure of the inbound web document is used to compare the structure information of the current www.newxxx.com web document with the learning model 550 that is already generated to determine whether to block. In addition to web structure-based filtering, filtering using a learning model is added to further enhance the efficiency and effectiveness of blocking harmful sites. Hereinafter, a specific example of blocking a harmful site by generating a learning model in FIG. 6 will be described in detail.

도 6은 본 발명에 따른 학습 모델 생성 과정과 생성된 학습 모델을 이용한 유해 사이트 판단 과정을 보여준다.6 shows a harmful site determination process using a learning model generation process and the generated learning model according to the present invention.

일단 학습 모델 생성과정(600)을 살펴보도록 한다. 학습 모델은 미리 여러 가지 데이타 세트, 즉, 다수(여기서는 예시적으로 N개)의 유해 사이트(610)와 또 다른 다수(여기서는 예시적으로 M개) 무해 사이트를 학습 모델을 생성하기 위해 사이트별 HTML 분석(630)에 이용하게 된다. 분석 과정을 통해 유해 사이트에서 공통적으로 나타나는 특징을 추출하고, 무해 사이트에서 특징적으로 나타나는 특징들을 추출하여 특징 파일(예를 들면 feature.dat)을 생성(640)하게 된다. First, the learning model generation process 600 will be described. The training model is previously site-specific HTML to generate a training model for different data sets, namely a large number of (e.g., here N) harmful sites 610 and another large (e.g., here M) harmless sites. For analysis 630. The analysis process extracts features common to harmful sites and extracts features characteristic of harmless sites to generate feature files (eg, feature.dat) (640).

이 때 웹 구조 정보로서 팝업 창의 갯수, 바탕색, 프레임 갯수, 프레임 색, 이미지 갯수 및/또는 텍스트 이미지 갯수 등의 구조정보를 활용하게 된다. 이러한 추출된 값들로 생성된 상기 특징 파일은 각각 사이트에 대한 정보를 표시하게 되는데, 640에서 보는 바와 같이 상기 웹 구조 정보에 대한 내용을 일련의 숫자로 표시 하게 되고 결과적인 유해/무해 판단 값이 각 특징 파일의 각 구조정보 내용(642) 서두에 1/-1로 표시되어 있다. At this time, the structure information such as the number of pop-up windows, background color, frame number, frame color, image number and / or text image number is utilized as web structure information. The feature file generated by the extracted values respectively displays the information about the site. As shown in 640, the information on the web structure information is displayed as a series of numbers, and the resulting harmful / harmless judgment value is determined by each. Each structural information content 642 of the feature file is indicated by 1 / -1.

상기 생성된 특징 파일은 패턴분류기술에 의한 기계 학습(650)을 거쳐 학습 모델로 만들어지게 되는데, 도 6에서는 패턴분류기술로 SVM(Support Vector Machine)을 사용하고 있는데, 클러스터링(Clustering)과 SOM(Self Organizing Map) 등을 사용하지만, 이에 제한되는 것은 아니고 가능한 모든 패턴분류기술이 본 학습 모델 생성을 위해 사용될 수 있다. The generated feature file is made into a learning model through machine learning 650 by a pattern classification technique. In FIG. 6, SVM (Support Vector Machine) is used as a pattern classification technique, and clustering and SOM ( Self Organizing Map) and the like, but without being limited thereto, all possible pattern classification techniques may be used to generate the learning model.

패턴분류기술에 의해 예를 들어 model.dat와 같은 모델 파일이 생성(660)되게 되고 이로써 학습 모델을 통한 비교 기준 파일을 확보하게 된다. By the pattern classification technique, a model file such as model.dat is generated 660, thereby securing a comparison reference file through a learning model.

다음으로 이렇게 생성된 학습 모델을 통한 유해 사이트 판단 과정을 알아보면, 일단 유해한 사이트가 682와 같이 입력되었다고 가정하면, HTML 분석(684)을 통해 입력된 사이트에 대한 구조정보를 파악하게 되고, 파악된 구조정보를 통해 686과 같이 특징을 추출하게 된다. 여기서는 아직까지 유/무해 판단이 이루어지지 않았으므로 특징 파일 서두에 아무런 값(1이나 -1)이 할당되지 않았다. Next, when the harmful site determination process through the generated learning model is assumed, once the harmful site is inputted as 682, HTML analysis 684 identifies structural information about the inputted site. Features are extracted as shown in 686 through the structure information. In this case, no value (1 or -1) is assigned to the beginning of the feature file because no judgment has been made yet.

추출된 특징은 학습 모델을 생성하는 것과 마찬가지로 비교의 용이함을 위해 파일로 만들어질 것이다(도시되지 않음). 이렇게 추출된 특징(파일)과 학습 모델에 의한 모델 파일을 서로 비교하여 유무해를 판단(688)하게 되고, 차단 여부의 결과(690)를 만들어 내게 된다. 이렇게 다양한 웹 구조정보를 유해 사이트 차단에 사용함으로써 좀더 정밀하고 유효한 차단 효과를 기대할 수 있을 것이다. The extracted features will be filed (not shown) for ease of comparison as well as creating a learning model. The feature (file) extracted in this way and the model file by the learning model are compared with each other to determine whether there is a harm (688), and the result of blocking or not (690) is produced. By using various web structure information to block harmful sites, more accurate and effective blocking effect can be expected.

도 7은 본 발명에 따른 웹 구조정보를 이용한 유해 사이트 차단 장치(700)를 나타낸다. 7 shows a harmful site blocking apparatus 700 using web structure information according to the present invention.

일단 웹 문서(710)가 네트워크 단말기에 전송되게 되면 웹 문서 수신부(720)에서 수신하게 된다. Once the web document 710 is transmitted to the network terminal, it is received by the web document receiver 720.

웹 문서 수신부(720)는 수신된 웹 문서(710)들을 1차적으로 텍스트 필터링 및/또는 이미지 필터링을 행하는 기본 필터링부(720)에 보내게 되는데, 상기한 기본 필터링부는 선택적이다. 즉, 본 발명에서 제시하는 웹 구조 정보 해석부(740)와 함께 사용되어 좀더 효과적인 필터링 결과를 도출할 수 있겠지만 이러한 기본 필터링부가 반드시 필요한 부분은 아니다. 즉, 다른 실시예에서는 인바운드되는 웹 문서들이 기본 필터링부가 없을 시 바로 웹 구조 정보 해석부로 전달되게 된다. The web document receiving unit 720 sends the received web document 710 to the basic filtering unit 720 which primarily performs text filtering and / or image filtering. The basic filtering unit is optional. That is, although used together with the web structure information analysis unit 740 proposed in the present invention, it is possible to derive a more effective filtering result, but this basic filtering unit is not necessary. That is, in another embodiment, when the inbound web documents do not have the basic filtering unit, they are delivered to the web structure information analyzing unit.

웹 구조정보 해석부(740)에서 본 발명에 의한 학습 모델 비교나 링크 URL 비교를 행하는 유해 URL 데이터베이스 비교부(760)를 통해 유해 여부가 판단되고 유해하지 않은 사이트로 판명된 경우 웹 문서 전송부(770)를 통해 웹 문서를 사용자가 볼 수 있는 모니터링 장치와 같은 사용자 표시 장치(780)로 전송하게 된다. 사용자 표시 장치는 모든 종류의 디스플레이 장치(LCD, CRT, built-in laptop computer monitor, PDA 화면, 휴대폰 화면 등)를 포함할 것이다. When the web structure information analysis unit 740 determines whether the site is harmful or harmful through the harmful URL database comparison unit 760 that compares the learning model or the link URL according to the present invention, the web document transmission unit ( The 770 transmits the web document to the user display device 780 such as a monitoring device that can be viewed by the user. The user display device may include all kinds of display devices (LCD, CRT, built-in laptop computer monitor, PDA screen, mobile phone screen, etc.).

상기 웹 구조정보 해석부(740)는 크게 학습 모델 비교부(750)와 유해 URL 데이터베이스 비교부(760)로 나눌 수 있는데, 두가지 비교부가 동시에 있을 수도 있고 어느 한가지만 있을 수도 있다. The web structure information analyzing unit 740 may be broadly divided into a learning model comparing unit 750 and a harmful URL database comparing unit 760. There may be two comparison units at the same time or only one of them.

먼저 유해 URL 데이터베이스 비교부(760)는 앞에서 설명한 바와 같이 인바운드되는 웹 문서(710) 내의 링크 URL을 판단하여 차단 여부를 결정하게 된다. 물론 이러한 경우 앞에서 자세히 언급된 바와 같이 이미 구축되어 있는 유해 URL 데이터베이스(도시되지 않음)를 참조하게 된다. First, the harmful URL database comparator 760 determines the blocking URL by determining the link URL in the inbound web document 710 as described above. Of course, in this case, as mentioned in detail above, it will refer to an already established harmful URL database (not shown).

상기 학습 모델 비교부(750)는 웹 구조정보 추출부(752), 파일 생성부(754), 학습 모델 생성부(756) 및 파일 비교부(758)로 구성되어 있다. The learning model comparison unit 750 includes a web structure information extraction unit 752, a file generation unit 754, a learning model generation unit 756, and a file comparison unit 758.

웹 구조정보 추출부(752)는 웹 문서 수신부(720)(혹은 필요한 경우 기본 필터링부(730))에서 전송되어 온 파일의 웹 구조정보를 추출한다. The web structure information extractor 752 extracts the web structure information of the file transmitted from the web document receiver 720 (or the basic filtering unit 730, if necessary).

마찬가지로 파일 생성부(754)는 유무해 사이트 추출을 위한 동작 시 상기 웹 구조정보 추출부(752)에서 추출된 구조정보를 하나의 파일로 생성하게 되고, 상기 생성된 파일은 학습 모델과 비교하는 파일 비교부(758)에서 학습 모델과의 비교를 통해 차단 유무가 결정되게 된다. Similarly, the file generator 754 generates the structure information extracted by the web structure information extractor 752 as a single file during an operation for site extraction, and compares the generated file with a learning model. The comparison unit 758 determines whether or not blocking is performed through comparison with the learning model.

학습 모델 생성부(756)는 초기에 학습 모델을 생성하는 경우에 사용되는데, 상기 웹 구조정보 추출부(752)를 통해 특징을 추출하여 파일 생성부(754)에서 파일로 생성한 후 학습 모델 생성부에서 상기 생성된 파일을 패턴분류기술을 이용하여 모델 파일을 생성하게 된다. 이 모델 파일이 유무해 비교의 기준 파일이 된다. 물론 매번 입력된 웹 문서에 대해 파일 비교부(758)에서 비교 판단을 수행한 후 다시 추가된 웹 문서를 학습 모델 생성에 사용할 수 있음은 자명하다. The training model generator 756 is used to initially generate a training model. The training model generator 756 extracts a feature through the web structure information extractor 752, generates a file in the file generator 754, and then generates a training model. The model generates the model file using the pattern classification technique. The presence of this model file is the reference file for comparison. Of course, it is obvious that the web document added again after the comparison is performed by the file comparison unit 758 on the web document input each time can be used to generate the learning model.

이상과 같이 본 발명은 양호한 실시예에 근거하여 설명하였지만, 이러한 실시예는 이 발명을 제한하려는 것이 아니라 예시하려는 것이므로, 본 발명이 속하는 기술분야의 숙련자라면 이 발명의 기술사상을 벗어남이 없이 위 실시예에 대한 다양한 변화나 변경 또는 조절이 가능할 것이다. 그러므로, 이 발명의 보호 범위는 첨부된 청구범위에 의해서만 한정될 것이며, 변화예나 변경예 또는 조절예를 모두 포함하는 것으로 해석되어야 할 것이다. As described above, the present invention has been described based on the preferred embodiments, but these embodiments are intended to illustrate the present invention, not to limit the present invention, so that those skilled in the art to which the present invention pertains can practice the above without departing from the technical spirit of the present invention. Various changes, modifications or adjustments to the example will be possible. Therefore, the protection scope of this invention will be limited only by the appended claims, and should be construed as including all changes, modifications or adjustments.

상기한 바와 같이 본 발명은, 네트워크에 연결된 단말기에 적용되어 그러한 단말기에서 인터넷 웹 사이트를 접속할 때, 웹 사이트의 구조정보를 이용함으로써 음란성 유무를 정확하게 판단하여 유해 사이트를 차단할 수 있게 한다. 본 발명을 통해 원치 않은 음란한 사이트 등의 유해 사이트를 보다 정확하면서도 효과적으로 차단할 수 있고, 성인 뿐만 아니라 음란 사이트에 취약한 어린 아이 및 청소년들이 이러한 음란하고 유해한 사이트에 접근하는 것을 보다 용이하게 통제하는 것을 가능하게 할 수 있을 것이다. As described above, the present invention is applied to a terminal connected to a network, and when accessing an Internet web site from such a terminal, by using the structural information of the web site, it is possible to accurately determine the presence or absence of pornography to block harmful sites. Through the present invention, it is possible to more effectively and effectively block harmful sites such as unwanted sexually explicit sites, and it is possible to more easily control not only adults but also young children and adolescents who are vulnerable to sexually explicit sites. You will be able to.

Claims

As a method of blocking harmful sites using web structure information,

Generating a learning model from web structure information of a plurality of harmful sites and harmless sites;

Comparing an inbound web document with the generated learning model; And determining blocking based on a result of comparison with the learning model.

The method of claim 1, wherein after generating the learning model,

And performing text filtering and / or image filtering on the inbound (inbound) web document and URL filtering on the URL information input by the user.

The method of claim 1, wherein generating the learning model comprises:

Extracting values for the number of pop-up windows, the background color, the number of frames, the frame color, the number of images, and / or the number of text images from the web structure information of the plurality of harmful and harmless sites;

Making the extracted value into a file; And

The harmful site blocking method using the web structure information, characterized in that it comprises the step of generating a learning model using the pattern classification technology of the generated file.

The method of claim 3, wherein the pattern classification technique uses a support vector machine (SVM), clustering, or a self organizing map (SOM).

As a method of blocking harmful sites using web structure information when accessing a website,

Establishing a malicious URL database containing known harmful web sites and links;

Comparing the link URL in the inbound web document with the URL data in the generated malicious URL database; And

And determining the blocking of the web document according to the comparison result.

The method of claim 5, wherein after the malicious URL database is established,

And performing text filtering and image filtering on a web document which is URL-filtered and inbound with respect to URL information input by a user.

The method of claim 5,

After comparing the URL data in the malicious URL database, the method further comprises the step of determining blocking by comparing the inbound web document with an already generated learning model, wherein the malicious URL database building step is harmless to a plurality of harmful sites. The harmful site blocking method using the web structure information, characterized in that it further comprises generating a learning model from the web structure information of the site.

The method of claim 7, wherein generating the learning model,

Making the extracted value into a file; And

The method of claim 8, wherein the pattern classification technique uses a support vector machine (SVM), clustering, or a self organizing map (SOM).

As a harmful site blocking device using web structure information,

A web document receiving unit which receives a web document transmitted from the outside;

A web structure information analyzer for analyzing structure information of the web document transmitted from the outside; And

And a web document transmission unit for transmitting a web document, which is found to be a non-hazardous site through the web structure information analysis unit, to a monitoring device that can be viewed by a user.

The web site of claim 10, further comprising: a basic filtering unit configured to perform URL filtering on the URL information input by the user, text filtering, and image filtering on the received web document. Blocking device.

The method of claim 10,

And the web structure information analyzing unit includes a learning model comparing unit and / or a harmful URL database comparing unit.

The method of claim 12, wherein the learning model comparison unit

A web structure information extraction unit for extracting values for the number of pop-up windows, the background color, the number of frames, the number of frames, the number of images, and / or the number of text images in the web document;

A file generator which makes the extracted web structure information into one file;

A learning model generation unit which generates a learning model using the generated file as a pattern classification technique; And

The harmful site blocking device using the web structure information, characterized in that it comprises a file comparison unit for comparing the learning model and the file generated by the file generation unit.