KR20210097408A

KR20210097408A - Device updating harmful website information and method thereof

Info

Publication number: KR20210097408A
Application number: KR1020200011068A
Authority: KR
Inventors: 김종환; 이남경
Original assignee: 한국전자통신연구원
Priority date: 2020-01-30
Filing date: 2020-01-30
Publication date: 2021-08-09

Abstract

Disclosed are an apparatus for updating information on a harmful website and method according to the same, capable of automatically collecting an addresses of a harmful website. The apparatus for updating information on a harmful website includes: a crawling unit; a web content analyzing unit; and an update unit. The crawling unit searches for a website by a search engine, based on search data related to feature data according to the feature analysis of the harmful site previously stored in a harmful site database unit, and periodically and automatically acquires object information related to the content of the website together with the address of the website from the searched website. The web content analyzing unit determines whether calculated harmfulness exceeds a harmfulness threshold value by analyzing feature data extracted from the object information, determines that the website is a harmful site, when the calculated harmfulness exceeds the harmfulness threshold value, and generates harmful analysis information at least including the address of the relevant website and the feature data. The update unit automatically updates the harmful site database unit based on the harmful analysis information.

Description

Device and method for updating harmful website information {DEVICE UPDATING HARMFUL WEBSITE INFORMATION AND METHOD THEREOF}

본 개시는 유해 웹사이트 정보의 업데이트 장치 및 이에 의한 방법에 대한 것이며, 보다 구체적으로는 유해 웹사이트에 대해 사람의 관여 없이 자동으로 토픽, 키워드, 이미지 등을 분석하여 유해 사이트 여부를 판별하고, 유해 사이트의 판단과 차단을 수행하는 외부 서비스에 유용한 정보를 제공하도록 해당 유해 사이트 주소 등을 자동으로 수집하기 위한 유해 웹사이트 정보의 업데이트 장치 및 이에 의한 방법에 대한 것이다.The present disclosure relates to an apparatus and method for updating harmful website information, and more specifically, to automatically analyze topics, keywords, images, etc. on harmful websites without human intervention to determine whether harmful websites are harmful, A device for updating harmful website information and a method therefor for automatically collecting the address of the harmful site to provide useful information to an external service that determines and blocks the site.

최근 미디어와 정보 통신 기술의 발달로 사람이 처리할 수 없는 속도로 유해 미디어 정보가 생산되는 경향이 점점 심화되고 있다. 유해 미디어 정보의 홍수 속에서 정보 소비자나 판매자가 해당 정보 중에 유해 미디어를 발견하고, 차단하는 방법이 다양하게 제시되고 있다. 대표적으로 유해 사이트 주소를 사람이 탐색하여 방화벽이나 인터넷 백본 등에서 차단하는 방법들이 있다.Recently, with the development of media and information and communication technology, the tendency to produce harmful media information at a rate that cannot be processed by humans is increasing. In the flood of harmful media information, various methods for information consumers or sellers to discover and block harmful media in the information are being proposed. Typically, there is a method of blocking a malicious site address by a human being searched for by a firewall or Internet backbone.

기존에는 주로 사람이 해당 사이트를 방문하여 유해 사이트를 판단하고, 해당 사이트 주소를 수집하는 방법을 사용하였다. 이를 개선하기 위해, 프로그램을 통해 자동으로 유해 사이트 주소를 미리 선별하여 찾아내고, 자동으로 그 정보를 업데이트하는 것이 효율적이나, 이에 대한 방법론이 부재한 상황이다. In the past, a method was mainly used in which a person visited the site, judged a harmful site, and collected the site address. In order to improve this, it is efficient to automatically preselect and find harmful site addresses through a program and automatically update the information, but there is no method for this.

본 개시의 기술적 과제는 유해 웹사이트에 대해 사람의 관여 없이 자동으로 토픽, 키워드, 이미지 등을 분석하여 유해 사이트 여부를 판별하고, 유해 사이트의 판단과 차단을 수행하는 외부 서비스에 유용한 정보를 제공하도록 해당 유해 사이트 주소 등을 자동으로 수집하기 위한 유해 웹사이트 정보의 업데이트 장치 및 이에 의한 방법을 제공하는 것이다. The technical problem of the present disclosure is to provide useful information to an external service that automatically analyzes topics, keywords, images, etc. without human intervention to determine whether a harmful website is a harmful website, and determines and blocks harmful websites. It is to provide a device for updating harmful website information and a method therefor for automatically collecting the address of the harmful website.

본 개시의 다른 기술적 과제는, 유해한 웹 사이트가 주소 변경 또는 폐쇄되는 등의 상황이 변경되더라도, 주기적으로 유해 사이트 여부를 판별하여 유해 웹사이트의 정보를 최신으로 갱신하기 위한 유해 웹사이트 정보의 업데이트 장치 및 이에 의한 방법을 제공하는 것이다. Another technical problem of the present disclosure is a device for updating harmful website information to periodically determine whether a harmful website is a harmful website and update the information on the harmful website to the latest even if the situation changes, such as when the address of the harmful website is changed or closed and to provide a method thereby.

본 개시에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 개시가 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved in the present disclosure are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those of ordinary skill in the art to which the present disclosure belongs from the description below. will be able

본 개시의 일 양상에 따르면, 유해 웹사이트 정보의 업데이트 장치가 제공될 수 있다. 상기 유해 웹사이트 정보의 업데이트 장치는, 유해 사이트 데이터베이스부에 기 저장된 유해 사이트의 특징 분석에 따른 특징 데이터와 관련된 검색 데이터를 기반으로 검색 엔진에 의해 웹사이트를 검색하고, 상기 검색된 웹사이트로부터 웹사이트 주소와 함께, 상기 웹사이트의 콘텐츠와 관련된 객체 정보를 주기적으로 자동 획득하는 크롤링부와, 상기 객체 정보로부터 추출된 특징 데이터를 분석하여 산출된 유해도가 유해 임계치를 초과하는지 여부를 판단하여, 초과하면 상기 웹사이트를 유해 사이트로 판정함과 아울러서, 해당 웹사이트의 주소, 특징 데이터를 적어도 포함하는 유해 분석 정보를 생성하는 웹 콘텐츠 분석부, 및 상기 유해 분석 정보로 상기 유해 사이트 데이터베이스부를 자동 갱신하는 업데이트부를 포함한다.According to an aspect of the present disclosure, an apparatus for updating harmful website information may be provided. The device for updating harmful website information searches for a website by a search engine based on search data related to characteristic data according to characteristic analysis of a harmful site previously stored in the harmful website database unit, and searches the website from the searched website. A crawling unit that periodically and automatically acquires object information related to the content of the website together with the address, and analyzes the feature data extracted from the object information to determine whether or not the calculated degree of harm exceeds a harmful threshold, and exceeds a web content analysis unit that determines the website as a harmful site and generates harmful analysis information including at least the address and characteristic data of the website, and automatically updates the harmful site database unit with the harmful analysis information Includes update.

본 개시의 다른 실시예에 따르면, 상기 콘텐츠는 텍스트, 비디오 데이터 및 이미지 데이터 중 적어도 어느 하나로 구성되며, 상기 검색 데이터는 상기 기 저장된 유해 사이트의 특징 데이터를 구성하는 키워드, 이미지 및 비디오 중 적어도 어느 하나일 수 있다. According to another embodiment of the present disclosure, the content is composed of at least one of text, video data, and image data, and the search data is at least one of keywords, images, and videos constituting the previously stored characteristic data of harmful sites. can be

또한, 상기 특징 데이터는 상기 텍스트인 경우에 상기 키워드와 아울러서 상기 유해 사이트 데이터베이스부에 저장된 유해 단어를 참조하여 추출된 단어이며, 상기 비디오 데이터 또는 상기 이미지 데이터인 경우에 유해 형상으로 추정되는 특징 요소일 수 있다. In addition, in the case of the text, the characteristic data is a word extracted with reference to the harmful word stored in the harmful site database unit as well as the keyword, and in the case of the video data or the image data, a characteristic element estimated as a harmful shape can

아울러, 상기 웹 콘텐츠 분석부는 상기 텍스트인 경우에 상기 객체 정보로부터 추출된 특징 데이터를 구성하는 상기 키워드 및 상기 추출된 단어에 기반하여 word2vec, doc2vec 또는 LDA(Latent Dirichlet Allocation) 토픽 분석 알고리즘을 이용하여 상기 유해도를 산출할 수 있다.In addition, in the case of the text, the web content analysis unit uses a word2vec, doc2vec, or LDA (Latent Dirichlet Allocation) topic analysis algorithm based on the extracted word and the keyword constituting the feature data extracted from the object information. hazard can be calculated.

이에 더하여, 상기 웹 콘텐츠 분석부는 상기 비디오 데이터 및 상기 이미지 데이터인 경우에 상기 객체 정보로부터 추출된 상기 특징 데이터의 상기 특징 요소에 대한 머신 러닝 또는 딥 러닝을 이용하여 상기 유해도를 산출할 수 있다. In addition, in the case of the video data and the image data, the web content analysis unit may calculate the harmfulness by using machine learning or deep learning on the feature element of the feature data extracted from the object information.

본 개시의 또 다른 실시예에 따르면, 상기 크롤링부는 상기 객체 정보와 관련된 메타 데이터를 포함하는 객체 수집 정보를 획득하여 상기 웹사이트 별로 상기 객체 수집 정보를 분류하는 것을 더 포함할 수 있다. According to another embodiment of the present disclosure, the crawling unit may further include obtaining object collection information including metadata related to the object information and classifying the object collection information for each website.

또한, 상기 콘텐츠가 텍스트인 경우에, 상기 유해 분석 정보는 상기 객제 수집 정보, 토픽, 텍스트의 단어들 간의 컨텍스트(context) 결과값 중 적어도 어느 하나를 더 포함하며, 상기 콘텐츠가 비디오 데이터 또는 이미지 데이터인 경우에, 상기 유해 분석 정보는 상기 객체 수집 정보, 토픽, 상기 객체 정보를 구성하는 데이터에서 하나 이상을 더 포함할 수 있다. In addition, when the content is text, the harmful analysis information further includes at least one of the object collection information, the topic, and a context result value between words of text, and the content is video data or image data In the case of , the harmful analysis information may further include one or more of the object collection information, the topic, and data constituting the object information.

이에 더하여, 상기 특징 데이터와 관련되어 추출된 키워드, 단어 및 유해 형상으로 추정되는 특징 요소는 상기 웹 콘텐츠 분석부 및 상기 유해 사이트 데이터베이스부 중 적어도 어느 하나에서 관리될 수 있다.In addition, keywords, words, and feature elements estimated as harmful shapes extracted in relation to the feature data may be managed in at least one of the web content analysis unit and the harmful site database unit.

본 개시의 또 다른 실시예에 따르면, 상기 웹 콘텐츠 분석부는 유해성 검출 접속부를 통해 외부 모듈로 구성된 유해 미디어 외부 검출기와 연결되며, 상기 유해 미디어 외부 검출기가 상기 객체 정보로부터 추출된 특징 데이터를 분석하여 산출된 유해도가 유해 임계치를 초과하는지 여부를 판단하고, 판단 결과를 상기 웹 콘텐츠 분석부로 통지할 수 있다.According to another embodiment of the present disclosure, the web content analysis unit is connected to a harmful media external detector configured as an external module through a harmfulness detection connection unit, and the harmful media external detector analyzes and calculates characteristic data extracted from the object information It may be determined whether the determined harmfulness exceeds a harmful threshold, and a result of the determination may be notified to the web content analysis unit.

본 개시의 또 다른 실시예에 따르면, 상기 유해 사이트로 판정된 웹사이트의 주소 및 상기 검색 데이터를 포함하는 유해 분석 정보를, 온톨로지(ontology)를 이용하는 데이터로 표현하여 상기 유해 사이트 데이터베이스부에 저장시킬 수 있다. According to another embodiment of the present disclosure, harmful analysis information including the address of the website determined as the harmful site and the search data is expressed as data using an ontology and stored in the harmful site database unit. can

본 개시의 또 다른 실시예에 따르면, 상기 업데이트부는 상기 유해 사이트로 판정된 웹사이트의 상기 객체 정보 및 상기 유해 분석 정보가 상기 유해 사이트 데이터베이스부에 기 저장된 정보와 상이한 경우에 상기 유해 분석 정보로 갱신할 수 있다. According to another embodiment of the present disclosure, the update unit updates the harmful analysis information when the object information and the harmful analysis information of the website determined as the harmful site are different from information previously stored in the harmful site database unit. can do.

본 개시의 다른 양상에 따르면, 유해 웹사이트 정보의 업데이트 방법이 제공될 수 있다. 상기 유해 웹사이트 정보의 업데이트 방법은, 유해 사이트 데이터베이스부에 기 저장된 유해 사이트의 특징 분석에 따른 특징 데이터와 관련된 검색 데이터를 기반으로 검색 엔진에 의해 웹사이트를 검색하고, 상기 검색된 웹사이트로부터 웹사이트 주소와 함께, 상기 웹사이트의 콘텐츠와 관련된 객체 정보를 주기적으로 자동 획득하는 크롤링 단계와, 상기 객체 정보로부터 추출된 특징 데이터를 분석하여 산출된 유해도가 유해 임계치를 초과하는지 여부를 판단하여, 초과하면 상기 웹사이트를 유해 사이트로 판정함과 아울러서, 해당 웹사이트의 주소, 특징 데이터를 적어도 포함하는 유해 분석 정보를 생성하는 웹 콘텐츠 분석 단계, 및 상기 유해 분석 정보로 상기 유해 사이트 데이터베이스부를 자동 갱신하는 업데이트 단계를 포함한다. According to another aspect of the present disclosure, a method of updating harmful website information may be provided. The method of updating harmful website information includes searching a website by a search engine based on search data related to characteristic data according to characteristic analysis of a harmful site pre-stored in a harmful site database unit, and searching for a website from the searched website. A crawling step of periodically automatically acquiring object information related to the content of the website together with an address, and determining whether the degree of harm calculated by analyzing the feature data extracted from the object information exceeds a harmful threshold, exceeding a web content analysis step of determining the website as a harmful site and generating harmful analysis information including at least an address and characteristic data of the website, and automatically updating the harmful site database unit with the harmful analysis information Includes update steps.

본 개시에 대하여 위에서 간략하게 요약된 특징들은 후술하는 본 개시의 상세한 설명의 예시적인 양상일 뿐이며, 본 개시의 범위를 제한하는 것은 아니다.The features briefly summarized above with respect to the present disclosure are merely exemplary aspects of the detailed description of the present disclosure that follows, and do not limit the scope of the present disclosure.

본 개시에 따르면, 유해 사이트의 판단과 차단을 수행하는 외부 서비스에 유용한 정보를 제공하도록 사람의 관여없이 유해한 웹사이트의 주소를 자동으로 수집하여 업데이트 가능하고, 유해한 웹 사이트의 미디어인 객체 정보를 수집하는 유해 웹사이트 정보의 업데이트 장치 및 방법을 제공할 수 있다. According to the present disclosure, it is possible to automatically collect and update the address of a harmful website without human intervention so as to provide useful information to an external service that performs judgment and blocking of harmful websites, and collects object information that is the media of harmful websites A device and method for updating harmful website information may be provided.

본 개시에 따르면, 유해한 웹 사이트가 주소 변경 또는 폐쇄되는 등의 상황이 변경되더라도, 주기적으로 유해 사이트 여부를 판별하여 유해 웹사이트의 정보를 최신으로 갱신하는 유해 웹사이트 정보의 업데이트 장치 및 이에 의한 방법을 제공할 수 있다. According to the present disclosure, an apparatus for updating harmful website information and a method therefor that periodically determine whether a harmful website is a harmful website and update the information on the harmful website to the latest even if the situation changes, such as when the address of the harmful website is changed or closed can provide

본 개시에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 개시가 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.Effects obtainable in the present disclosure are not limited to the above-mentioned effects, and other effects not mentioned may be clearly understood by those of ordinary skill in the art to which the present disclosure belongs from the description below. will be.

도 1은 본 발명의 일 실시예에 따른 유해 웹사이트 정보의 업데이트 장치를 나타내는 블록도이다.
도 2는 크롤링부의 동작 흐름을 나타내는 순서도이다.
도 3은 웹 컨텐츠 분석부의 동작 흐름을 나타내는 순서도이다.
도 4는 업데이트부의 동작 흐름을 나타내는 순서도이다. 1 is a block diagram illustrating an apparatus for updating harmful website information according to an embodiment of the present invention.
2 is a flowchart illustrating an operation flow of a crawling unit.
3 is a flowchart illustrating an operation flow of a web content analysis unit.
4 is a flowchart illustrating an operation flow of an update unit.

이하에서는 첨부한 도면을 참고로 하여 본 개시의 실시 예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나, 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present disclosure pertains can easily implement them. However, the present disclosure may be implemented in several different forms and is not limited to the embodiments described herein.

본 개시의 실시 예를 설명함에 있어서 공지 구성 또는 기능에 대한 구체적인 설명이 본 개시의 요지를 흐릴 수 있다고 판단되는 경우에는 그에 대한 상세한 설명은 생략한다. 그리고, 도면에서 본 개시에 대한 설명과 관계없는 부분은 생략하였으며, 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.In describing an embodiment of the present disclosure, if it is determined that a detailed description of a well-known configuration or function may obscure the gist of the present disclosure, a detailed description thereof will be omitted. And, in the drawings, parts not related to the description of the present disclosure are omitted, and similar reference numerals are attached to similar parts.

본 개시에 있어서, 어떤 구성요소가 다른 구성요소와 "연결", "결합" 또는 "접속"되어 있다고 할 때, 이는 직접적인 연결 관계뿐만 아니라, 그 중간에 또 다른 구성요소가 존재하는 간접적인 연결관계도 포함할 수 있다. 또한 어떤 구성요소가 다른 구성요소를 "포함한다" 또는 "가진다"고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 배제하는 것이 아니라 또 다른 구성요소를 더 포함할 수 있는 것을 의미한다.In the present disclosure, when a component is "connected", "coupled" or "connected" with another component, it is not only a direct connection relationship, but also an indirect connection relationship in which another component exists in the middle. may also include. In addition, when a component is said to "include" or "have" another component, it means that another component may be further included without excluding other components unless otherwise stated. .

본 개시에 있어서, 제1, 제2 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용되며, 특별히 언급되지 않는 한 구성요소들간의 순서 또는 중요도 등을 한정하지 않는다. 따라서, 본 개시의 범위 내에서 일 실시 예에서의 제1 구성요소는 다른 실시 예에서 제2 구성요소라고 칭할 수도 있고, 마찬가지로 일 실시 예에서의 제2 구성요소를 다른 실시 예에서 제1 구성요소라고 칭할 수도 있다. In the present disclosure, terms such as first, second, etc. are used only for the purpose of distinguishing one component from another, and do not limit the order or importance between the components unless otherwise specified. Accordingly, within the scope of the present disclosure, a first component in one embodiment may be referred to as a second component in another embodiment, and similarly, a second component in one embodiment is referred to as a first component in another embodiment. can also be called

본 개시에 있어서, 서로 구별되는 구성요소들은 각각의 특징을 명확하게 설명하기 위함이며, 구성요소들이 반드시 분리되는 것을 의미하지는 않는다. 즉, 복수의 구성요소가 통합되어 하나의 하드웨어 또는 소프트웨어 단위로 이루어질 수도 있고, 하나의 구성요소가 분산되어 복수의 하드웨어 또는 소프트웨어 단위로 이루어질 수도 있다. 따라서, 별도로 언급하지 않더라도 이와 같이 통합된 또는 분산된 실시 예도 본 개시의 범위에 포함된다. In the present disclosure, components that are distinguished from each other are for clearly explaining each characteristic, and do not necessarily mean that the components are separated. That is, a plurality of components may be integrated to form one hardware or software unit, or one component may be distributed to form a plurality of hardware or software units. Accordingly, even if not specifically mentioned, such integrated or distributed embodiments are also included in the scope of the present disclosure.

본 개시에 있어서, 다양한 실시 예에서 설명하는 구성요소들이 반드시 필수적인 구성요소들은 의미하는 것은 아니며, 일부는 선택적인 구성요소일 수 있다. 따라서, 일 실시 예에서 설명하는 구성요소들의 부분집합으로 구성되는 실시 예도 본 개시의 범위에 포함된다. 또한, 다양한 실시 예에서 설명하는 구성요소들에 추가적으로 다른 구성요소를 포함하는 실시 예도 본 개시의 범위에 포함된다. In the present disclosure, components described in various embodiments do not necessarily mean essential components, and some may be optional components. Accordingly, an embodiment composed of a subset of components described in an embodiment is also included in the scope of the present disclosure. In addition, embodiments including other components in addition to components described in various embodiments are also included in the scope of the present disclosure.

이하, 첨부한 도면을 참조하여 본 개시의 실시 예들에 대해서 설명한다.Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings.

도 1을 참조하여, 본 발명의 일 실시예에 따른 유해 웹사이트 정보의 업데이트 장치에 대하여 설명하기로 한다. An apparatus for updating harmful website information according to an embodiment of the present invention will be described with reference to FIG. 1 .

도 1은 본 발명의 일 실시예에 따른 유해 웹사이트 정보의 업데이트 장치를 나타내는 블록도이다.1 is a block diagram illustrating an apparatus for updating harmful website information according to an embodiment of the present invention.

유해 웹사이트 정보의 업데이트 장치(100)는 유해한 웹사이트의 주소를 자동으로 수집하여 업데이트 가능하고, 유해한 웹 사이트의 미디어인 객체 정보를 수집한다. The harmful website information update device 100 can automatically collect and update addresses of harmful websites, and collects object information that is media of harmful websites.

구체적으로, 유해 웹사이트 정보의 업데이트 장치(100)는 크롤링부(102), 웹 콘텐츠 분석부(104), 유해성 검출 접속부(106), 유해 미디어 외부 검출기(202)업데이트부(108) 및 유해 사이트 데이터베이스부(112) 를 포함할 수 있다. 유해 미디어 외부 검출기(202) 및 유해 미디어 차단 외부 서비스 모듈(204)은 장치(100)과 별개의 부재로서 기능한다.Specifically, the apparatus 100 for updating harmful website information includes a crawling unit 102 , a web content analysis unit 104 , a harmfulness detection connection unit 106 , a harmful media external detector 202 , an update unit 108 and a harmful site It may include a database unit 112 . The harmful media external detector 202 and the harmful media blocking external service module 204 function as separate members from the device 100 .

크롤링부(102)는 유해 사이트 데이터베이스부(112)에 기 저장된 유해 사이트의 특징 분석에 따른 특징 데이터와 관련된 검색 데이터를 기반으로 검색 엔진에 의해 웹사이트를 검색하고, 상기 검색된 웹사이트로부터 웹사이트 주소와 함께, 상기 웹사이트의 콘텐츠와 관련된 객체 정보를 를 주기적으로 자동 획득한다. 또한, 크롤링부(102)는 객체 정보와 관련된 메타 데이터를 포함하는 객체 수집 정보를 획득하여 웹사이트 별로 객체 수집 정보를 분류할 수 있다. The crawling unit 102 searches for a website by a search engine based on search data related to characteristic data according to characteristic analysis of a harmful site pre-stored in the harmful site database unit 112, and provides a website address from the searched website. In addition, object information related to the content of the website is automatically acquired periodically. Also, the crawler 102 may obtain object collection information including metadata related to object information and classify the object collection information for each website.

웹사이트가 제공하는 페이지는 단순히 한 종류의 콘텐츠의 타입인 텍스트, 동영상 혹은 비디오 클립(clip)과 같은 비디오 데이터 또는 사진과 같은 이미지 데이터만을 제공할 수도 있으나, 통상적으로 다양한 콘텐츠의 타입의 조합으로 구성될 수 있다. 이에 따라, 콘텐츠는 텍스트, 비디오 데이터 및 이미지 데이터 중 적어도 어느 하나의 타입으로 구성될 수 있다.A page provided by a website may simply provide only one type of content, such as video data such as text, video or video clip, or image data such as a photo, but is usually composed of a combination of various types of content. can be Accordingly, the content may be composed of at least one type of text, video data, and image data.

사용자가 웹페이지를 통해 접근하는 콘텐츠가 비디오 데이터 또는 이미지 데이터라고 할지라도, 검색 데이터는 단어로 구성된 키워드일 수도 있거나, 이미지 또는 비디오 클립 등의 그 자체일 수도 있다. 이와는 달리, 접근하는 콘텐츠가 텍스트일지라도, 검색 데이터는 키워드 뿐만 아니라, 텍스트를 추론가능한 이미지 또는 비디오일 수도 있다. 이에 따라, 검색 데이터는 기 저장된 유해 사이트의 특징 데이터를 구성하는 키워드, 이미지 및 비디오 중 적어도 어느 하나일 수 있다. 크롤링부(102)에 수행되는 웹사이트의 검색은 검색 데이터를 기반으로 장치(100) 자체에 마련하거나 이미 상용화된 서비스되는 검색 엔진일 수 있다. Even if the content that the user accesses through the web page is video data or image data, the search data may be a keyword composed of words, or may be an image or video clip itself. Alternatively, even if the content to be accessed is text, the search data may be not only keywords, but also images or videos from which text can be inferred. Accordingly, the search data may be at least one of keywords, images, and videos constituting previously stored characteristic data of harmful sites. The website search performed by the crawling unit 102 may be a search engine that is provided in the device 100 itself based on the search data or is already commercially available.

객체 정보는 웹사이트에 게재된 텍스트, 비디오 데이터 및 이미지 데이터 중에 하나 이상으로부터 각 데이터의 적어도 일부를 추출하며, 이러한 추출은 공지된 다양한 기법으로 수행될 수 있다. 이 경우에, 객체 수집 정보의 메타 데이터는 콘텐츠의 제목, 작성자, 출처, 생성시기, 기타 디스크립션(description)을 포함할 수 있다. The object information extracts at least a portion of each data from one or more of text, video data, and image data posted on the website, and such extraction may be performed by various known techniques. In this case, the metadata of the object collection information may include the title, author, source, creation time, and other descriptions of the content.

웹 콘텐츠 분석부(104)는 객체 정보로부터 추출된 특징 데이터를 분석하여 산출된 유해도가 유해 임계치를 초과하는지 여부를 판단하여, 초과하면 상기 웹사이트를 유해 사이트로 판정함과 아울러서, 해당 웹사이트의 주소, 검색된 웹사이트의 객체 정보로부터 추출된 특징 데이터를 적어도 포함하는 유해 분석 정보를 생성한다본 실시예에서는 웹 콘텐츠 분석부(104)는 객체 정보로부터 특징 데이터를 추출하며, 유해도 판정 결과에 따라 유해 분석 정보를 생성하고, 웹 콘텐츠 분석부(104)와 연결된 유해 미디어 외부 검출기(202)가 웹사이트의 유해도를 판정하는 것을 예시한다. 구체적으로, 외부 모듈로 구성된 유해 미디어 외부 검출기(202)는 유해성 검출 접속부(106)를 통해 웹 콘텐츠 분석부(104)와 연결되며, 특징 데이터를 분석하여 산출된 유해도가 유해 임계치를 초과하는지 여부를 판단하여, 판단 결과를 웹 콘텐츠 분석부(104)로 통지한다. 본 실시예는 장치(100)와 외부 모듈을 통한 분산 처리에 의해 데이터 처리의 경감과 다양한 유해도 판정 로직의 수용을 위한 것이나, 다른 실시예에서는 특징 데이터의 추출, 유해도 판정 및 유해 분석 정보의 생성은 웹 콘텐츠 분석부(104)에서 전부 구현될 수 있다.The web content analysis unit 104 analyzes the feature data extracted from the object information to determine whether the calculated harmfulness exceeds a harmful threshold, and if it exceeds, determines the website as a harmful site, and the corresponding website and generates harmful analysis information including at least the characteristic data extracted from the object information of the searched website. In this embodiment, the web content analysis unit 104 extracts the characteristic data from the object information, The harmful analysis information is generated according to the example, and the harmful media external detector 202 connected to the web content analysis unit 104 determines the harmfulness of the website. Specifically, the harmful media external detector 202 composed of an external module is connected to the web content analysis unit 104 through the harmfulness detection connection unit 106, and whether the harmfulness calculated by analyzing the characteristic data exceeds the harmful threshold , and notifies the determination result to the web content analysis unit 104 . The present embodiment is for the reduction of data processing and acceptance of various harmfulness determination logic by distributed processing through the device 100 and external modules, but in another embodiment, extraction of feature data, determination of harmfulness, and analysis of harmfulness information The generation may be entirely implemented in the web content analysis unit 104 .

특징 데이터는 기존 뿐만 아니라, 검색된 웹사이트로부터 추출되는 객체 정보로부터 추출되는데 있어서 다음과 같이 구성될 수 있다. 특징 데이터는 콘텐츠가 텍스트인 경우에 검색 데이터인 키워드와 아울러서, 유해 사이트 데이터베이스부(112)에 저장된 유해 단어를 참조하여 추출된 단어일 수 있다. 콘텐츠가비디오 데이터 또는 이미지 데이터인 경우에, 특징 데이터는 유해 형상으로 추정되는 특징 요소일 수 있으며, 추정되는 특징 요소는 유해 사이트 데이터베이스부(112)에서 저장된 유해 이미지 및 유해 비디오 데이터로부터 추출되어 이미 저장된 특징 요소를 참조할 수 있다. 이는 객체 정보로부터 기 저장된 특징 요소와 동일한 특징 요소를 추출할 수도 있지만, 기 저장된 유해한 객체 정보와 신규로 추출된 객체 정보가 상이한 경우가 통상적이므로, 유해하다고 추정되는 특징 요소의 추출이 기 저장된 특징 요소와의 유클리드 거리 등과 같은 소정의 유사도 내에서 실행된다.The feature data may be configured as follows when extracted from object information extracted from the searched website as well as the existing one. When the content is text, the characteristic data may be words extracted by referring to harmful words stored in the harmful site database unit 112 together with keywords that are search data. When the content is video data or image data, the characteristic data may be a characteristic element estimated to be a harmful shape, and the estimated characteristic element is extracted from harmful image and harmful video data stored in the harmful site database unit 112 and is already stored. You can refer to the feature element. Although it is possible to extract the same characteristic element as the previously stored characteristic element from the object information, it is common that the previously stored harmful object information and the newly extracted object information are different. It is performed within a certain degree of similarity, such as the Euclidean distance from .

따라서, 특징 데이터는 유해 사이트 데이터베이스부(110)에 기 저장된 유해 사이트의 특징 데이터와 동일할 수도 있으며, 기존의 특징 데이터와 관련된 검색 데이터로 검색된 유해 사이트로부터 신규로 추출될 수도 있다. 기존 뿐만 아니라, 신규 추출된 특징 데이터는 기존의 특징 데이터로부터 추론되며, 상술한 바와 같이 웹 콘텐츠 분석부(104) 및 유해 사이트 데이터베이스부(110) 중 적어도 어느 하나에서 관리되며, 신규 특징 데이터는 이후에 웹사이트의 탐색에서 검색 데이터로 활용될 수 있다.유해도 판정 및 유해 분석 정보는 본 실시예로서 장치(100)를 이용한 유해 웹사이트 업데이트 방법에서 후술하기로 한다.Accordingly, the characteristic data may be the same as the characteristic data of the harmful site pre-stored in the harmful site database unit 110 , or may be newly extracted from the harmful site searched with search data related to the existing characteristic data. Not only the existing but also the newly extracted feature data is inferred from the existing feature data, and as described above, it is managed in at least one of the web content analysis unit 104 and the harmful site database unit 110, and the new feature data is then It can be used as search data in the search of a website. The harmfulness determination and harmfulness analysis information will be described later in the method of updating a harmful website using the device 100 as this embodiment.

업데이트부(108)는 웹 콘텐츠 분석부(104)에서 유해하다고 판정된 특징 데이터와 관련된 유해 사이트로서의 웹사이트의 유해 분석 정보로 유해 사이트 데이터베이스부(112)를 자동 갱신한다. 이러한 갱신은 적어도 유해 사이트로 판정된 웹사이트의 주소 및 유해 사이트로의 객체 정보로부터 추출된 특징 데이터를 포함하는 유해 분석 정보를, 지식 정보 형태로서의 온톨로지(ontology)를 이용하는 데이터로 표현하여 유해 사이트 데이터베이스부(112)에 저장시킬 수 있다. The update unit 108 automatically updates the harmful site database unit 112 with harmful analysis information of a website as a harmful site related to the characteristic data determined to be harmful by the web content analysis unit 104 . This update expresses harmful analysis information including at least the address of the website determined to be a harmful site and characteristic data extracted from object information to the harmful site as data using an ontology as a form of knowledge information to express the harmful site database. It can be stored in the unit 112 .

유해 사이트 데이터베이스부(112)는 유해하다고 판정된 웹사이트로부터 추출된 유해 분석 정보를 저장하여 외부의 요청이 있는 경우에 전달할 수 있으며, 또한, 사용자가 접근한 웹사이트의 차단 여부를 수행하는 외부 모듈로서의 유해 미디어 차단 외부 서비스 모듈(204)에 유해 분석 정보를 송신할 수도 있다. The harmful site database unit 112 stores harmful analysis information extracted from the website determined to be harmful, and transmits it when there is an external request, and an external module that performs whether or not to block the website accessed by the user It is also possible to transmit harmful analysis information to the harmful media blocking external service module 204 as a .

유해 분석 정보는 유해 사이트의 웹 주소, 이를 탐색하는데 사용한 검색 데이터 뿐만 아니라, 상술한 바와 같이, 특징 데이터를 추출하는데 참조할 수 있는 유해 단어, 유해 이미지 및 유해 비디오 데이터로부터 추출된 특징 요소 등을 포함할 수 있다. Harmful analysis information includes not only the web address of harmful sites and search data used to search for them, but also harmful words that can be referred to for extracting characteristic data, and characteristic elements extracted from harmful images and harmful video data, as described above. can do.

이하에서는, 도 1 내지 도 4를 참조하여 본 발명의 다른 실시예에 따른 유해 웹사이트 정보의 업데이트 방법에 대해 설명하기로 한다. Hereinafter, a method of updating harmful website information according to another embodiment of the present invention will be described with reference to FIGS. 1 to 4 .

도 2는 크롤링부의 동작 흐름을 나타내는 순서도이다. 도 3은 웹 컨텐츠 분석부의 동작 흐름을 나타내는 순서도이다. 도 4는 업데이트부의 동작 흐름을 나타내는 순서도이다. 2 is a flowchart illustrating an operation flow of a crawling unit. 3 is a flowchart illustrating an operation flow of a web content analysis unit. 4 is a flowchart illustrating an operation flow of an update unit.

먼저 도 2를 참조하면, 유해 웹사이트 정보의 업데이트 장치(100)의 자동 실행 주기가 도달하면(S205의 Y), 크롤링부(102)는 유해 사이트 데이터베이스부에 기 저장된 유해 사이트의 특징 분석에 따른 특징 데이터와 관련된 검색 데이터를 기반으로 검색 엔진에 의해 웹사이트를 검색한다(S210). First, referring to FIG. 2 , when the automatic execution cycle of the harmful website information update device 100 arrives (Y in S205), the crawling unit 102 performs a characteristic analysis of harmful sites pre-stored in the harmful site database unit. A website is searched by the search engine based on the search data related to the feature data (S210).

콘텐츠는 텍스트, 비디오 데이터 및 이미지 데이터 중 적어도 어느 하나의 타입으로 구성될 수 있다.The content may be composed of at least one type of text, video data, and image data.

사용자가 웹페이지를 통해 접근하는 콘텐츠가 비디오 데이터 또는 이미지 데이터라고 할지라도, 검색 데이터는 단어로 구성된 키워드일 수도 있거나, 이미지 또는 비디오 클립 등의 그 자체일 수도 있다. 이와는 달리, 접근하는 콘텐츠가 텍스트일지라도, 검색 데이터는 키워드 뿐만 아니라, 텍스트를 추론가능한 이미지 또는 비디오일 수도 있다. 즉, 검색 데이터는 기 저장된 유해 사이트의 특징 데이터를 구성하는 키워드, 이미지 및 비디오 중 적어도 어느 하나일 수 있다.Even if the content that the user accesses through the web page is video data or image data, the search data may be a keyword composed of words, or may be an image or video clip itself. Alternatively, even if the content to be accessed is text, the search data may be not only keywords, but also images or videos from which text can be inferred. That is, the search data may be at least one of keywords, images, and videos constituting previously stored characteristic data of harmful sites.

여기서, 기 저장된 유해 사이트의 특징 데이터는 웹 콘텐츠 분석부(104)의 특징 분석에 의해 이미 유해 사이트로 판정된 웹사이트의 객체 정보로부터 추출된 기존의 특징 데이터로서 웹 콘텐츠 분석부(104) 및 유해 사이트 데이터베이스부(110) 중 적어도 어느 하나에 의해 관리될 수 있다. 크롤링부(102)는 관리되는 특징 데이터를 접수하여 검색 데이터로 가공한다. 본 실시예는 유해 사이트에 포함된 기존의 특징 데이터를 검색 데이터로 활용하여 웹사이트를 자동 검색함으로써 유해성이 높다고 예상되는 웹사이트가 용이하게 발견될 수 있다.Here, the pre-stored characteristic data of the harmful site is the existing characteristic data extracted from the object information of the website that has already been determined to be a harmful site by the characteristic analysis of the web content analysis unit 104, and the web content analysis unit 104 and the harmful site It may be managed by at least one of the site database unit 110 . The crawling unit 102 receives the managed feature data and processes it into search data. In the present embodiment, a website predicted to be highly harmful can be easily found by automatically searching for a website by using existing characteristic data included in the harmful site as search data.

크롤링부(102)에 수행되는 웹사이트의 탐색은 검색 데이터를 기반으로 장치(100) 자체에 마련하거나 이미 상용화된 서비스되는 검색 엔진일 수 있다. The web site search performed by the crawling unit 102 may be a search engine provided in the device 100 itself based on the search data or a service that is already commercialized.

이에 따라, 유해 사이트 데이터베이스부(112)에 이미 저장된 특징 데이터와 관련된 유해 사이트 뿐만 아니라, 특징 데이터로부터 가공된 검색 데이터를 이용하여 기존의 유해 사이트의 변경된 주소와 신규로 개설된 잠정적인 유해 사이트까지도 검색할 수 있다. Accordingly, not only the harmful sites related to the characteristic data already stored in the harmful site database unit 112 but also the changed addresses of the existing harmful sites and the newly opened provisional harmful sites are searched using the search data processed from the characteristic data. can do.

다음으로, 크롤링부(102)는 검색된 웹사이트의 주소 및 콘텐츠 관련 객체 정보를 획득한다(S215). Next, the crawling unit 102 acquires the address of the searched website and content-related object information (S215).

객체 정보는 웹사이트에 게재된 텍스트, 비디오 데이터 및 이미지 데이터 중에 하나 이상으로부터 각 데이터의 적어도 일부를 추출하며, 이러한 추출은 공지된 다양한 기법으로 수행될 수 있다.The object information extracts at least a portion of each data from one or more of text, video data, and image data posted on the website, and such extraction may be performed by various known techniques.

이어서, 크롤링부(102)는 획득된 객체 정보로부터 객체 수집 정보를 획득한다(S220).Next, the crawler 102 acquires object collection information from the acquired object information (S220).

객체 수집 정보는 객체 정보와 관련된 메타 데이터를 포함하며, 메타 데이터는 콘텐츠의 제목, 작성자, 출처, 생성시기, 기타 디스크립션(description)를 포함할 수 있다. 객체 수집 정보는 웹사이트 별로 분류되어, 이후에 유해 분석 정보로 구성되어 업데이트시에 지식 정보화 형태로 저장, 관리된다. The object collection information includes metadata related to the object information, and the metadata may include the title, author, source, creation time, and other descriptions of the content. Object collection information is classified for each website, and then is composed of harmful analysis information and stored and managed in the form of knowledge information when updated.

다음으로 도 3을 참조하면, 객체 정보가 생성되면(S305의 Y), 웹 콘텐츠 분석부(104)는 검색된 웹사이트의 객체 정보를 분석하여, 텍스트, 이미지 데이터 및 비디오 데이터 중 어느 하나와 관련된 특징 데이터를 추출한다(S310). Next, referring to FIG. 3 , when object information is generated (Y in S305), the web content analysis unit 104 analyzes the object information of the searched website, and features related to any one of text, image data, and video data. Data is extracted (S310).

특징 데이터는 콘텐츠가 텍스트인 경우에 검색 데이터인 키워드와 아울러서, 유해 사이트 데이터베이스부(112)에 저장된 유해 단어를 참조하여 추출된 단어일 수 있다. 콘텐츠가 비디오 데이터 또는 이미지 데이터인 경우에 특징 데이터는 유해 형상으로 추정되는 특징 요소일 수 있으며, 추정되는 특징 요소는 유해 사이트 데이터베이스부(112)에서 저장된 유해 이미지 및 유해 비디오 데이터로부터 추출되어 이미 저장된 특징 요소를 참조할 수 있다. 비디오 데이터인 경우, 특징 데이터는 예를 들어, MPEG-7 표준을 기반으로 추출된 소정의 특징 요소일 수 있다. 이는 객체 정보로부터 기 저장된 특징 요소와 동일한 특징 요소를 추출할 수도 있지만, 기 저장된 유해한 객체 정보와 신규로 추출된 객체 정보가 상이한 경우가 통상적이므로, 유해하다고 추정되는 특징 요소의 추출이 기 저장된 특징 요소와의 유클리드 거리 등과 같은 소정의 유사도 내에서 실행된다.When the content is text, the characteristic data may be words extracted by referring to harmful words stored in the harmful site database unit 112 together with keywords that are search data. When the content is video data or image data, the feature data may be a feature element estimated as a harmful shape, and the estimated feature element is a feature extracted from harmful image and harmful video data stored in the harmful site database unit 112 and already stored. element can be referenced. In the case of video data, the feature data may be, for example, a predetermined feature element extracted based on the MPEG-7 standard. Although it is possible to extract the same characteristic element as the previously stored characteristic element from the object information, it is common that the previously stored harmful object information and the newly extracted object information are different. It is performed within a certain degree of similarity, such as the Euclidean distance from .

다음으로, 웹 콘텐츠 분석부(104)와 유해성 검출 접속부(106)를 통해 연결된 유해 미디어 외부 검출기(202)는 텍스트, 이미지 데이터 및 비디오 데이터와 관련된 특징 데이터들 중 적어도 어느 하나에 기반한 유해도가 유해 임계치를 초과하는지 여부를 판단한다(S315). Next, the harmful media external detector 202 connected through the web content analysis unit 104 and the harmfulness detection connection unit 106 determines that the harmfulness level based on at least one of characteristic data related to text, image data, and video data is harmful. It is determined whether the threshold value is exceeded (S315).

유해 미디어 외부 검출기(202)는 텍스트인 경우에, 특징 데이터를 구성하는 키워드 및 추출된 단어에 기반하여, word2vec, doc2vec 또는 LDA(Latent Dirichlet Allocation) 토픽 분석 알고리즘을 이용하여 유해도를 산출할 수 있다.In the case of text, the harmful media external detector 202 may calculate the degree of harmfulness using a word2vec, doc2vec, or LDA (Latent Dirichlet Allocation) topic analysis algorithm based on the extracted words and keywords constituting the feature data. .

성적 언어, 폭력적 상황 등 관련된 유해 단어, 문구, 문장 등의 유해어 사전, 단어들의 조합으로 추론되는 컨텍스트(context) 상관도에 기초하여, 유해 미디어 외부 검출기(202)는 검색된 웹사이트로부터 추출된 키워드, 단어 및 이들로 구성된 텍스트 전체의 감성 평가를 수행하여 해당 웹사이트의 유해도를 분석한다. 예를 들어, 유해도 분석은 전체 텍스트 단어들에서 특징 데이터의 단어들이 차지하는 비율을 산출하거나, 특징 데이터의 단어들이 높은 빈도로 출현하더라도 전체 컨텍스트 측면에서 성적이거나 폭력적인 토픽인지 여부를 분석한다. Based on a context correlation inferred from a combination of words and a harmful dictionary of harmful words, phrases, sentences, etc. related to sexual language and violent situations, the harmful media external detector 202 is a keyword extracted from the searched website. , analyzes the harmfulness of the website by performing emotional evaluation of words and the entire text composed of them. For example, the harmfulness analysis calculates the ratio of the words of the feature data in the whole text words, or analyzes whether the topic of the feature data is a sexual or violent topic in terms of the whole context even if the words of the feature data appear with high frequency.

유해 미디어 외부 검출기(202)는 비디오 데이터 및 이미지 데이터인 경우에 특징 요소에 대한 머신 러닝 또는 딥 러닝을 이용하여 유해도를 산출한다.In the case of video data and image data, the harmful media external detector 202 calculates the degree of harmfulness by using machine learning or deep learning on feature elements.

구체적으로, 유해 사이트 데이터베이스부(112)에 이미 저장된 유해 사이트의 비디오 및 이미지 데이터의 대량의 특징 데이터가 학습되어 신규로 입력된 특징 데이터에 대해서 유해한 이미지 또는 비디오인지 여부가 판별됨으로써, 유해도가 분석된다. Specifically, a large amount of feature data of video and image data of a harmful site already stored in the harmful site database unit 112 is learned, and it is determined whether the newly inputted characteristic data is a harmful image or video, so that the degree of harmfulness is analyzed do.

유해도 분석 결과, 유해도가 유해 임계치를 초과하면(S315의 Y), 유해 미디어 외부 검출기(202)는 웹 콘텐츠 분석부(104)에 특징 데이터의 유해도를 분석한 결과를 웹 콘텐츠 분석부(104)에 통지하고, 웹 콘텐츠 분석부(104)는 통지된 결과와 관련된 해당 웹사이트를 유해 사이트로 판정하며, 이와 관련된 유해 분석 정보를 생성한다(S320). As a result of the harmfulness analysis, if the harmfulness exceeds the harmfulness threshold (Y in S315), the harmful media external detector 202 transmits the result of analyzing the harmfulness of the characteristic data to the web content analysis unit 104 to the web content analysis unit ( 104), the web content analysis unit 104 determines that the website related to the notified result is a harmful site, and generates harmful analysis information related thereto (S320).

콘텐츠 타입이 텍스트인 경우에, 유해 분석 정보는 유해하다고 판정된 웹 사이트 주소, 검색된 웹사이트의 특징 데이터와 함께, 검색 데이터, 객제 수집 정보, 유해도 분석에 따라 산출된 텍스트의 토픽, 특징 데이터와 관련된 단어, 텍스트의 단어들 간의 컨텍스트 결과값 중 적어도 어느 하나를 더 포함할 수 있다. When the content type is text, the harmful analysis information includes the address of the website determined to be harmful, characteristic data of the searched website, search data, object collection information, topic of text calculated according to the harmfulness analysis, characteristic data, and At least one of a related word and a context result value between words in the text may be further included.

콘텐츠 타입이 비디오 데이터 또는 이미지 데이터인 경우에, 유해 분석 정보는 해당 웹 사이트 주소, 특징 데이터와 함께, 검색 데이터, 객체 수집 정보, 유해도 분석에서 산출된 비디오 데이터 및/또는 이미지 데이터의 토픽, 특징 데이터와 관련되며 유해 형상으로 추정되는 특징 요소, 객체 정보를 구성하는 데이터들의 적어도 일부에서 하나 이상을 더 포함할 수 있다.When the content type is video data or image data, the harmful analysis information includes the corresponding website address and characteristic data, as well as search data, object collection information, topic, characteristics of video data and/or image data calculated from harmful analysis. It may further include one or more from at least a portion of data constituting a characteristic element and object information related to data and estimated as a harmful shape.

특징 데이터와 관련되어 추출된 키워드, 단어 및 유해 형상으로 추정되는 특징 요소는 웹 콘텐츠 분석부(104) 및 유해 사이트 데이터베이스부(110) 중 적어도 어느 하나에서 관리된다. 관리되는 특징 데이터는 웹사이트의 주기적 검색시에 크롤링부(102)에 제공되어 검색 데이터로 활용된다. At least one of the web content analysis unit 104 and the harmful site database unit 110 manages keywords, words, and characteristic elements estimated as harmful shapes extracted in relation to the characteristic data. The managed feature data is provided to the crawling unit 102 to be used as search data during a periodic search of a website.

객체 정보를 구성하는 데이터들은 크롤링부(102)에 의해 웹사이트로부터 추출된 비디오 데이터 및 이미지 데이터 중 적어도 어느 하나일 수 있다. 객체 정보의 데이터는 유해 사이트 데이터베이스부(112)에 저장되어 학습됨으로써, 이후에 입력되는 웹사이트의 비디오 및 이미지 데이터의 분석시에 특징 데이터 및 유해도 판별에 활용될 수 있다.Data constituting the object information may be at least one of video data and image data extracted from the website by the crawling unit 102 . The data of object information is stored and learned in the harmful site database unit 112 , so that it can be used to determine characteristic data and harmfulness level when analyzing video and image data of a website that is input later.

유해 분석 정보가 유해 사이트의 주소와 특징 데이터에 국한되지 않고, 상술한 다른 데이터들 중 적어도 어느 하나를 더 포함함으로써, 신규로 입력되는 웹사이트의 객체 정보, 특징 데이터의 추출 및 유해도 분석시에 보다 정확한 결과를 도출하는데 활용될 수 있다. The harmful analysis information is not limited to the address and characteristic data of the harmful site, and by further including at least any one of the above-mentioned other data, it is possible to extract the object information and characteristic data of the newly input website and to analyze the harmfulness level. It can be used to derive more accurate results.

한편, 유해도 분석 결과, 유해도가 유해 임계치를 초과하지 않으면(S315의 N), 해당 웹사이트는 유해 사이트가 아닌 것으로 판정된다. On the other hand, if the harmfulness analysis result does not exceed the harmfulness threshold (N in S315), it is determined that the website is not a harmful site.

다음으로 도 4를 참조하면, 유해 분석 정보가 생성되면(S405의 Y), 업데이트부(108)는 유해 사이트로 판정된 객체 정보와 유해 분석 정보를 판독한다(S410). Next, referring to FIG. 4 , when harmful analysis information is generated (Y in S405), the update unit 108 reads object information determined to be a harmful site and harmful analysis information (S410).

이어서, 업데이트부(108)는 객체 정보와 유해 분석 정보를 온톨로지 형태로 변환한다(S415). Next, the updater 108 converts the object information and the harmful analysis information into an ontology form (S415).

웹을 통해 수집된 데이터를 용이하게 공유하며 연동하기 위해, 유해 사이트 데이터베이스부(112)에 저장된 데이터의 지식 정보 형태로 저장, 관리되는 경우에, 업데이트부(108)는 지식 정보 형태에 부합하도록, 관련 정보를 온톨로지 형태로 변환한다. 이에 의해, 유해 사이트 데이터베이스부(112)에 저장된 유해 사이트의 관련 정보가 표준화된 방법으로 용이하게 웹을 통해 외부 서비스를 제공할 수 있다. In order to easily share and link data collected through the web, when the data stored in the harmful site database unit 112 is stored and managed in the form of knowledge information, the update unit 108 is configured to match the form of knowledge information; Converts related information into ontology form. Accordingly, it is possible to easily provide an external service through the web in a standardized method for related information on harmful sites stored in the harmful site database unit 112 .

온톨로지에 대해 상술하면, 시맨틱 웹(semantic web) 기술을 이용하는 데이터일 수 있다. 시맨틱 웹은 현재의 인터넷과 같은 분산 환경에서 리소스(웹 문서, 각종 파일, 서비스 등)에 대한 정보와 자원사이의 관계-의미 정보(semanteme)를 기계, 즉 컴퓨터가 처리할 수 있는 온톨로지(ontology) 형태로 표현하고, 이를 자동화된 기계가 처리하도록 하는 프레임워크이자 기술이다.In detail about the ontology, it may be data using a semantic web technology. The Semantic Web is an ontology in which information about resources (web documents, various files, services, etc.) and relation-semantic information between resources in a distributed environment such as the current Internet can be processed by machines, that is, computers. It is a framework and technology that expresses it in a form and allows an automated machine to process it.

온톨로지란 사람들이 사물에 대해 생각하는 바를 추상화하고 공유한 모델로서, 정형화되고 개념의 유형이나 사용상의 제약 조건들이 명시적으로 정의된 기술을 말한다. 컴퓨터 과학 분야에서 온톨로지는 특정한 영역을 표현하는 데이터 모델로서 특정한 영역(domain)에 속하는 개념과 개념 사이의 관계를 기술하는 정형(formal) 어휘의 집합으로 정의된다. 특히, 온톨로지는 시맨틱 웹을 구현할 수 있는 도구로써, 지식 개념을 의미적으로 연결할 수 있는 도구로 사용되며, 컴퓨터에서 사람이 갖고 있는 사물에 대한 개념을 일종의 데이터베이스의 형태로 가공하여 처리할 수 있도록 해 준다.An ontology is a model that abstracts and shares what people think about things, and refers to a technology in which the types of concepts and constraints on usage are explicitly defined. In the field of computer science, an ontology is a data model that expresses a specific domain, and is defined as a set of formal vocabulary describing concepts belonging to a specific domain and relationships between concepts. In particular, Ontology is a tool that can implement the Semantic Web, and is used as a tool that can semantically connect knowledge concepts. give.

시맨틱 웹은 XML(Extensible Markup Language)에 기반한 시맨틱 마크업 언어로써 표현될 수 있다. 이러한 시맨틱 웹에서는 주체(subject), 술어(predicate), 객체(object)의 트리플(triple) 형태로 개념을 표현하며, 다시 각각의 주체, 술어, 객체는 XML의 URI(Uniform Resource Identifier)로 표현될 수 있다. 현재 시맨틱 웹 온톨로지를 기술하는 표준 언어로 W3C(World Wide Web Consortium)에서 제안한 RDF(Resource Description Framew), RDF-Schema, OWL(web ontology language), SW RL(Semantic Web Rule Language) 그리고 ISO에서 제안한 TopicMaps 등이 있다.The semantic web may be expressed as a semantic markup language based on XML (Extensible Markup Language). In this semantic web, concepts are expressed in triple form of subject, predicate, and object, and each subject, predicate, and object can be expressed as XML Uniform Resource Identifier can RDF (Resource Description Framew), RDF-Schema, OWL (web ontology language), SW RL (Semantic Web Rule Language) proposed by W3C (World Wide Web Consortium), and TopicMaps proposed by ISO as a standard language for describing current semantic web ontology. etc.

다음으로, 업데이트부(108)는 웹사이트의 객체 정보 및 유해 분석 정보가 유해 사이트 데이터베이스부(112)에 기 저장된 정보와 상이한 경우에(S420의 N), 검색 데이터로 이용된 키워드, 이미지 데이터 및/또는 비디오 데이터 및 유해 웹사이트의 주소와 함께, 유해 분석 정보의 상술한 데이터들 중 적어도 일부로 유해 사이트 데이터베이스부(112)를 업데이트한다(S425). Next, when the object information and harmful analysis information of the website are different from the information previously stored in the harmful site database unit 112 (N of S420), the update unit 108 may include keywords, image data, and /or the harmful site database unit 112 is updated with at least a part of the above-mentioned data of harmful analysis information, together with the video data and the address of the harmful website (S425).

이와는 달리, 업데이트부(108)는 웹사이트의 객체 정보 및 유해 분석 정보가 유해 사이트 데이터베이스부(112)에 기 저장된 정보와 동일한 경우에(S420의 Y), 불필요한 과정을 방지하기 위해, 기 저장된 정보가 유지된다.Contrary to this, when the object information and harmful analysis information of the website are the same as the information previously stored in the harmful site database unit 112 (Y in S420), the update unit 108 performs the pre-stored information in order to prevent unnecessary processes. is maintained

본 실시예에 따르면, 유해 사이트의 판단과 차단을 수행하는 외부 서비스, 예컨대 유해 미디어 차단 외부 서비스 모듈(204)에 유용한 정보를 제공하도록 사람의 관여없이 유해한 웹사이트의 주소를 자동으로 수집하여 업데이트 가능하고, 유해한 웹 사이트의 미디어인 객체 정보를 수집할 수 있다. According to this embodiment, it is possible to automatically collect and update the address of a harmful website without human intervention so as to provide useful information to an external service that determines and blocks harmful sites, for example, the harmful media blocking external service module 204 . and may collect object information, which is the media of harmful websites.

또한, 본 실시예에 따르면, 유해한 웹 사이트가 주소 변경 또는 폐쇄되는 등의 상황이 변경되더라도, 주기적으로 유해 사이트 여부를 판별하여 유해 웹사이트의 정보를 최신으로 갱신할 수 있다. In addition, according to the present embodiment, even if the address of the harmful website is changed or the situation is changed, such as being closed, it is possible to periodically determine whether the harmful website is a harmful website and update the information on the harmful website to the latest.

본 개시의 예시적인 방법들은 설명의 명확성을 위해서 동작의 시리즈로 표현되어 있지만, 이는 단계가 수행되는 순서를 제한하기 위한 것은 아니며, 필요한 경우에는 각각의 단계가 동시에 또는 상이한 순서로 수행될 수도 있다. 본 개시에 따른 방법을 구현하기 위해서, 예시하는 단계에 추가적으로 다른 단계를 포함하거나, 일부의 단계를 제외하고 나머지 단계를 포함하거나, 또는 일부의 단계를 제외하고 추가적인 다른 단계를 포함할 수도 있다.Example methods of the present disclosure are expressed as a series of operations for clarity of description, but this is not intended to limit the order in which the steps are performed, and if necessary, each step may be performed simultaneously or in a different order. In order to implement the method according to the present disclosure, other steps may be included in addition to the illustrated steps, steps may be excluded from some steps, and/or other steps may be included except for some steps.

본 개시의 다양한 실시 예는 모든 가능한 조합을 나열한 것이 아니고 본 개시의 대표적인 양상을 설명하기 위한 것이며, 다양한 실시 예에서 설명하는 사항들은 독립적으로 적용되거나 또는 둘 이상의 조합으로 적용될 수도 있다.Various embodiments of the present disclosure do not list all possible combinations, but are intended to describe representative aspects of the present disclosure, and the details described in various embodiments may be applied independently or in combination of two or more.

또한, 본 개시의 다양한 실시 예는 하드웨어, 펌웨어(firmware), 소프트웨어, 또는 그들의 결합 등에 의해 구현될 수 있다. 하드웨어에 의한 구현의 경우, 하나 또는 그 이상의 ASICs(Application Specific Integrated Circuits), DSPs(Digital Signal Processors), DSPDs(Digital Signal Processing Devices), PLDs(Programmable Logic Devices), FPGAs(Field Programmable Gate Arrays), 범용 프로세서(general processor), 컨트롤러, 마이크로 컨트롤러, 마이크로 프로세서 등에 의해 구현될 수 있다. In addition, various embodiments of the present disclosure may be implemented by hardware, firmware, software, or a combination thereof. For implementation by hardware, one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose It may be implemented by a processor (general processor), a controller, a microcontroller, a microprocessor, and the like.

본 개시의 범위는 다양한 실시 예의 방법에 따른 동작이 장치 또는 컴퓨터 상에서 실행되도록 하는 소프트웨어 또는 머신-실행가능한 명령들(예를 들어, 운영체제, 애플리케이션, 펌웨어(firmware), 프로그램 등), 및 이러한 소프트웨어 또는 명령 등이 저장되어 장치 또는 컴퓨터 상에서 실행 가능한 비-일시적 컴퓨터-판독가능 매체(non-transitory computer-readable medium)를 포함한다. The scope of the present disclosure includes software or machine-executable instructions (eg, operating system, application, firmware, program, etc.) that cause an operation according to the method of various embodiments to be executed on a device or computer, and such software or and non-transitory computer-readable media in which instructions and the like are stored and executed on a device or computer.

100: 유해 웹사이트 정보의 업데이트 장치
102: 크롤링부 104: 웹 콘텐츠 분석부
106: 유해성 검출 접속부 108: 업데이트부
110: 유해 사이트 데이터베이스부 100: Device for updating harmful website information
102: crawling unit 104: web content analysis unit
106: hazard detection connection 108: update unit
110: harmful site database unit

Claims

A website is searched by a search engine based on search data related to characteristic data according to characteristic analysis of a harmful site previously stored in the harmful site database unit, and the contents of the website and the website address are retrieved from the searched website. a crawling unit that periodically and automatically acquires related object information;
By analyzing the characteristic data extracted from the object information, it is determined whether the calculated degree of harm exceeds a harmful threshold, and if it exceeds, the website is determined as a harmful site, and the address and characteristic data of the corresponding website are stored at least a web content analysis unit for generating harmful analysis information including; and
and an update unit for automatically updating the harmful site database unit with the harmful analysis information.

The method of claim 1,
The content is composed of at least one of text, video data, and image data, and the search data is at least one of keywords, images, and videos constituting the pre-stored characteristic data of the harmful website.

3. The method of claim 2,
In the case of the text, the characteristic data is a word extracted by referring to the harmful word stored in the harmful site database unit as well as the keyword in the case of the text. Update device of site information.

4. The method of claim 3,
In the case of the text, the web content analysis unit uses a word2vec, doc2vec, or LDA (Latent Dirichlet Allocation) topic analysis algorithm based on the extracted word and the keyword constituting the feature data extracted from the object information to determine the degree of harm An update device of harmful website information that produces

4. The method of claim 3,
In the case of the video data and the image data, the web content analysis unit updates harmful website information for calculating the degree of harm by using machine learning or deep learning on the feature element of the feature data extracted from the object information Device.

The method of claim 1,
and wherein the crawling unit obtains object collection information including metadata related to the object information and classifies the object collection information for each website.

7. The method of claim 6,
When the content is text, the harmful analysis information further includes at least one of the object collection information, topic, and context result value between words of text, and when the content is video data or image data For example, the harmful analysis information may further include one or more of the object collection information, topics, and data constituting the object information.

4. The method of claim 3,
An apparatus for updating harmful website information in which keywords, words, and characteristic elements estimated as harmful shapes extracted in relation to the characteristic data are managed by at least one of the web content analysis unit and the harmful site database unit.

The method of claim 1,
The web content analysis unit is connected to a harmful media external detector configured as an external module through a harmfulness detection connection unit, and whether the harmful media external detector analyzes the characteristic data extracted from the object information and the degree of harmfulness calculated by the harmfulness threshold exceeds the harmfulness threshold A device for updating harmful website information for determining and notifying the determination result to the web content analysis unit.

The method of claim 1,
The update unit expresses harmful analysis information including the address of the website determined as the harmful site and the search data as data using an ontology and stores the harmful website information in the harmful site database unit. .

The method of claim 1,
The update unit updates the harmful website information to the harmful analysis information when the object information and the harmful analysis information of the website determined to be the harmful site are different from the information previously stored in the harmful site database unit.

A website is searched by a search engine based on search data related to characteristic data according to characteristic analysis of a harmful site previously stored in the harmful site database unit, and the contents of the website and the website address are retrieved from the searched website. a crawling step of periodically automatically acquiring related object information;
By analyzing the characteristic data extracted from the object information, it is determined whether the calculated degree of harm exceeds a harmful threshold, and if it exceeds, the website is determined as a harmful site, and the address and characteristic data of the corresponding website are stored at least Web content analysis step of generating harmful analysis information including; and
and an update step of automatically updating the harmful site database unit with the harmful analysis information.

13. The method of claim 12,
The content is composed of at least one of text, video data, and image data, and the search data is at least one of keywords, images, and videos constituting the previously stored characteristic data of the harmful site.

14. The method of claim 13,
In the case of the text, the characteristic data is a word extracted by referring to the harmful word stored in the harmful site database unit as well as the keyword in the case of the text. How to update site information.

15. The method of claim 14,
In the case of the text, the web content analysis step uses a word2vec, doc2vec, or LDA (Latent Dirichlet Allocation) topic analysis algorithm based on the extracted word and the keyword constituting the feature data extracted from the object information in the case of the text. Update of harmful website information that calculates a degree and calculates the degree of harm using machine learning or deep learning for the feature element of the feature data extracted from the object information in the case of the video data and the image data method.

13. The method of claim 12,
The crawling step further includes obtaining object collection information including metadata related to the object information and classifying the object collection information for each website.

17. The method of claim 16,
When the content type is text, the harmful analysis information further includes at least one of the object collection information, a topic, and a context result value between words of text, wherein the content type is video data or image data , the harmful analysis information further includes one or more of the object collection information, the topic, and data constituting the object information.

13. The method of claim 12,
In the web content analysis step, the harmful website information is updated by analyzing the characteristic data extracted from the object information by a harmful media external detector composed of an external module and determining whether the calculated degree of harm exceeds a harmful threshold. method.

13. The method of claim 12,
In the updating step, harmful website information including the address of the website determined as the harmful site and the search data is expressed as data using an ontology and stored in the harmful website database unit. method.

13. The method of claim 12,
In the updating step, when the object information and the harmful analysis information of the website determined as the harmful site are different from the information previously stored in the harmful site database unit, the harmful website information is updated with the harmful website information updating method.