KR20190107830A

KR20190107830A - Spam-tag based blog spam detection apparatus and method, storage media storing the same

Info

Publication number: KR20190107830A
Application number: KR1020180029039A
Authority: KR
Inventors: 김남규; 현윤진; 정해강; 부현경
Original assignee: 국민대학교산학협력단
Priority date: 2018-03-13
Filing date: 2018-03-13
Publication date: 2019-09-23
Also published as: KR102053636B1

Abstract

The present invention relates to an apparatus for detecting blog spam based on spam tag, and to a method thereof. The apparatus for detecting blog spam based on spam tag comprises: a spam tag identification unit identifying a spam tag among a plurality of tags included in at least one post based on a body of the at least one post posted on a blog; a blog reliability calculation unit calculating the reliability of the blog based on the entire posts posted on the blog and a spam post including the spam tag; and a blog spam detection unit detecting the spam of the blog based on a tag spam degree corresponding to the reliability and the spam tag ratio of the spam post. Therefore, the apparatus can detect the blog spam by identifying spam tags included in the posts.

Description

Spam tag-based blog spam detection device and method, and recording media {SPAM-TAG BASED BLOG SPAM DETECTION APPARATUS AND METHOD, STORAGE MEDIA STORING THE SAME}

본 발명은 스팸 태그 기반의 블로그 스팸 탐지 기술에 관한 것으로, 보다 상세하게는 포스트에 포함된 스팸 태그를 식별하여 블로그 스팸을 탐지할 수 있는 스팸 태그 기반의 블로그 스팸 탐지 장치 및 방법에 관한 것이다.The present invention relates to a spam tag-based blog spam detection technology, and more particularly, to a spam tag-based blog spam detection device and method that can detect blog spam by including a spam tag included in a post.

부합하지 않는 태그, 즉 글의 주제와 직접적인 연관이 없는 태그를 나타내는 것으로 정의할 수 있다.It can be defined as representing a tag that does not match, that is, a tag that is not directly related to the subject of the article.

한국등록특허 제10-0902475(2009.06.04)호는 스팸문서 판단 시스템 및 방법에 관한 것으로, 스팸문서 판단 결과를 지속적으로 모니터링 하여 복수개의 스팸탐지 기법의 정확율에 따라 각 스팸탐지 기법의 스팸지수를 다르게 부여할 수 있어 스팸문서 판단 시스템의 재현율과 함께 정확율도 증가시킬 수 있으며, 각 스팸탐지 기법별로 스팸지수가 부여되기 때문에 새로운 스팸탐지 기법이 등장하더라도 해당 스팸탐지 기법에 스팸지수만을 부여하여 시스템에 바로 적용할 수 있어 시스템의 확장 가능성을 높일 수 있는 효과도 있다.Korean Patent Registration No. 10-0902475 (2009.06.04) relates to a spam document determination system and method. The spam index of each spam detection technique is determined according to the accuracy rate of a plurality of spam detection techniques by continuously monitoring the spam document determination result. It can be assigned differently, and the accuracy rate can be increased along with the reproducibility of the spam document determination system.As a spam index is provided for each spam detection technique, even if a new spam detection technique emerges, only the spam index is assigned to the spam detection technique. It can also be applied immediately, increasing the scalability of the system.

한국등록특허 제10-0902475(2009.06.04)호Korea Patent Registration No. 10-0902475 (2009.06.04)

본 발명의 일 실시예는 포스트에 포함된 스팸 태그를 식별하여 블로그 스팸을 탐지할 수 있는 스팸 태그 기반의 블로그 스팸 탐지 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is to provide a spam tag-based blog spam detection device and method that can detect blog spam by identifying the spam tag included in the post.

본 발명의 일 실시예는 블로그에 게시된 전체 포스트 및 스팸 태그를 포함하는 스팸 포스트를 기초로 블로그의 신뢰도를 산출할 수 있는 스팸 태그 기반의 블로그 스팸 탐지 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is to provide a spam tag-based blog spam detection device and method that can calculate the reliability of the blog based on the spam post including the entire post and the spam tag posted on the blog.

본 발명의 일 실시예는 블로그 신뢰도 및 태그 스팸도 간의 가중합을 기초로 블로그의 스팸 여부를 탐지할 수 있는 스팸 태그 기반의 블로그 스팸 탐지 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is to provide a spam tag-based blog spam detection device and method that can detect whether a blog spam based on a weighted sum between blog reliability and tag spam.

실시예들 중에서, 스팸 태그 기반의 블로그 스팸 탐지 장치는 블로그(Blog)에 게시된 적어도 하나의 포스트(Post)의 본문을 기초로 상기 적어도 하나의 포스트에 포함된 복수의 태그들 중 스팸 태그를 식별하는 스팸 태그 식별부, 상기 블로그에 게시된 전체 포스트 및 상기 스팸 태그를 포함하는 스팸 포스트를 기초로 상기 블로그의 신뢰도를 산출하는 블로그 신뢰도 산출부 및 상기 신뢰도 및 상기 스팸 포스트의 스팸 태그 비율에 해당하는 태그 스팸도를 기초로 상기 블로그의 스팸 여부를 탐지하는 블로그 스팸 탐지부를 포함한다.Among the embodiments, the spam tag-based blog spam detection device identifies a spam tag among a plurality of tags included in the at least one post based on a body of at least one post posted to a blog. A spam tag identification unit, a blog reliability calculator that calculates the reliability of the blog based on the spam post including all the posts posted on the blog and the spam tag, and the reliability and the spam tag ratio of the spam posts It includes a blog spam detection for detecting whether the blog spam based on the tag spam degree.

상기 스팸 태그 식별부는 상기 적어도 하나의 포스트에 대한 텍스트 파싱(Text Parsing)을 통해 포스트 핵심용어를 추출하고 상기 포스트 핵심용어를 기초로 클러스터링(Clustering)을 수행하여 포스트 그룹을 생성하는 포스트 그룹 생성 모듈, 상기 포스트 그룹에 대한 그룹 핵심용어를 추출하고 상기 그룹 핵심용어 및 상기 포스트 그룹에 포함된 그룹태그 간의 비교분석을 수행하는 태그 분석 모듈 및 상기 비교분석 결과를 이용하여 스팸 태그를 식별하는 스팸 태그 식별 모듈을 포함할 수 있다.The spam tag identification unit extracts a post key term through text parsing of the at least one post, and generates a post group by performing clustering based on the post key term. A tag analysis module for extracting a group key term for the post group and performing a comparative analysis between the group key term and a group tag included in the post group, and a spam tag identification module for identifying a spam tag using the result of the comparison analysis It may include.

상기 태그 분석 모듈은 상기 비교분석을 통해 상기 그룹태그 별로 상기 포스트 그룹 내 출현빈도 수, 출현 포스트 수 및 출현 비율을 도출할 수 있다.The tag analysis module may derive the number of appearance frequencies, the number of appearance posts, and the appearance rate in the post group for each group tag through the comparative analysis.

상기 스팸 태그 식별 모듈은 상기 출현 비율이 특정 임계값 미만에 해당하는 상기 그룹태그를 스팸 태그로 식별할 수 있다.The spam tag identification module may identify the group tag having the appearance ratio below a specific threshold value as a spam tag.

상기 블로그 신뢰도 산출부는 상기 스팸 포스트의 수를 상기 전체 포스트의 수로 나눈 값을 상기 블로그의 신뢰도로써 산출할 수 있다.The blog reliability calculator may calculate a value obtained by dividing the number of spam posts by the total number of posts as the reliability of the blog.

상기 블로그 스팸 탐지부는 상기 신뢰도 및 상기 태그 스팸도 간의 가중합이 특정 임계값을 초과하는 경우 상기 블로그를 스팸으로 결정할 수 있다.The blog spam detector may determine the blog as spam when the weighted sum between the reliability and the tag spam degree exceeds a specific threshold.

실시예들 중에서, 스팸 태그 기반의 블로그 스팸 탐지 방법은 (a) 블로그(Blog)에 게시된 적어도 하나의 포스트(Post)의 본문을 기초로 상기 적어도 하나의 포스트에 포함된 복수의 태그들 중 스팸 태그를 식별하는 단계, (b) 상기 블로그에 게시된 전체 포스트 및 상기 스팸 태그를 포함하는 스팸 포스트를 기초로 상기 블로그의 신뢰도를 산출하는 단계 및 (c) 상기 신뢰도 및 상기 스팸 포스트의 스팸 태그 비율에 해당하는 태그 스팸도를 기초로 상기 블로그의 스팸 여부를 탐지하는 단계를 포함한다.Among the embodiments, the spam tag-based blog spam detection method may include (a) spam among a plurality of tags included in the at least one post based on a body of at least one post posted to a blog. Identifying a tag, (b) calculating the reliability of the blog based on a total post posted to the blog and a spam post including the spam tag, and (c) the reliability and spam tag rate of the spam post Detecting whether the blog is spam based on the degree of tag spam corresponding to the;

상기 (a) 단계는 (a1) 상기 적어도 하나의 포스트에 대한 텍스트 파싱(Text Parsing)을 통해 포스트 핵심용어를 추출하고 상기 포스트 핵심용어를 기초로 클러스터링(Clustering)을 수행하여 포스트 그룹을 생성하는 단계, (a2) 상기 포스트 그룹에 대한 그룹 핵심용어를 추출하고 상기 그룹 핵심용어 및 상기 포스트 그룹에 포함된 그룹태그 간의 비교분석을 수행하는 단계 및 (a3) 상기 비교분석 결과를 이용하여 스팸 태그를 식별하는 단계를 포함할 수 있다.Step (a) is a step of generating a post group by extracting a post key term through text parsing of the at least one post and clustering based on the post key term. (a2) extracting a group key term for the post group, performing a comparative analysis between the group key term and a group tag included in the post group, and (a3) identifying a spam tag using the result of the comparative analysis It may include the step.

상기 (a2) 단계는 상기 비교분석을 통해 상기 그룹태그 별로 상기 포스트 그룹 내 출현빈도 수, 출현 포스트 수 및 출현 비율을 도출하는 단계일 수 있다.The step (a2) may be a step of deriving the number of appearances, the number of appearance posts, and the appearance rate in the post group for each group tag through the comparative analysis.

상기 (a3) 단계는 상기 출현 비율이 특정 임계값 미만에 해당하는 상기 그룹태그를 스팸 태그로 식별하는 단계일 수 있다.The step (a3) may be a step of identifying the group tag whose spam rate falls below a specific threshold value as a spam tag.

상기 (b) 단계는 상기 스팸 포스트의 수를 상기 전체 포스트의 수로 나눈 값을 상기 블로그의 신뢰도로써 산출하는 단계일 수 있다.The step (b) may be a step of calculating a value obtained by dividing the number of spam posts by the total number of posts as the reliability of the blog.

상기 (c) 단계는 상기 신뢰도 및 상기 태그 스팸도 간의 가중합이 특정 임계값을 초과하는 경우 상기 블로그를 스팸으로 결정하는 단계일 수 있다.Step (c) may be a step of determining the blog as spam when the weighted sum between the reliability and the tag spam degree exceeds a specific threshold.

실시예들 중에서, 컴퓨터 수행 가능한 기록매체는 블로그(Blog)에 게시된 적어도 하나의 포스트(Post)의 본문을 기초로 상기 적어도 하나의 포스트에 포함된 복수의 태그들 중 스팸 태그를 식별하는 과정, 상기 블로그에 게시된 전체 포스트 및 상기 스팸 태그를 포함하는 스팸 포스트를 기초로 상기 블로그의 신뢰도를 산출하는 과정 및 상기 신뢰도 및 상기 스팸 포스트의 스팸 태그 비율에 해당하는 태그 스팸도를 기초로 상기 블로그의 스팸 여부를 탐지하는 과정을 포함한다.In one or more embodiments, a computer-executable recording medium may include identifying a spam tag among a plurality of tags included in the at least one post based on a body of at least one post posted to a blog. Calculating the reliability of the blog based on the total posts posted on the blog and the spam post including the spam tag, and the tag spam level corresponding to the reliability and the spam tag ratio of the spam post. Detecting whether or not spam.

개시된 기술은 다음의 효과를 가질 수 있다. 다만, 특정 실시예가 다음의 효과를 전부 포함하여야 한다거나 다음의 효과만을 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.The disclosed technique can have the following effects. However, since a specific embodiment does not mean to include all of the following effects or only the following effects, it should not be understood that the scope of the disclosed technology is limited by this.

본 발명의 일 실시예에 따른 스팸 태그 기반의 블로그 스팸 탐지 장치 및 방법은 블로그에 게시된 전체 포스트 및 스팸 태그를 포함하는 스팸 포스트를 기초로 블로그의 신뢰도를 산출할 수 있다. The spam tag-based blog spam detection apparatus and method according to an embodiment of the present invention may calculate the reliability of the blog based on the spam post including the entire post and the spam tag posted on the blog.

본 발명의 일 실시예에 따른 스팸 태그 기반의 블로그 스팸 탐지 장치 및 방법은 블로그 신뢰도 및 태그 스팸도 간의 가중합을 기초로 블로그의 스팸 여부를 탐지할 수 있다. The spam tag-based blog spam detection apparatus and method according to an embodiment of the present invention can detect whether a blog is spam based on the weighted sum of blog reliability and tag spam.

도 1은 본 발명의 일 실시예에 따른 스팸 태그 기반의 블로그 스팸 탐지 시스템을 설명하는 도면이다.
도 2는 도 1에 있는 블로그 스팸 탐지 장치를 설명하는 블록도이다.
도 3은 도 2에 있는 스팸 태그 식별부를 설명하는 블록도이다.
도 4는 도 1에 있는 블로그 스팸 탐지 장치에서 수행되는 블로그 스팸 탐지 과정을 설명하는 순서도이다.
도 5는 도 2에 있는 스팸 태그 식별부에서 수행되는 스팸 태그 식별 과정을 설명하는 순서도이다.
도 6은 도 2에 있는 스팸 태그 식별부에서 수행되는 스팸 태그를 식별하는 과정을 설명하는 예시도이다.1 is a diagram illustrating a spam tag-based blog spam detection system according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a blog spam detection device of FIG. 1.
3 is a block diagram illustrating a spam tag identification unit of FIG. 2.
4 is a flowchart illustrating a blog spam detection process performed by the blog spam detection device of FIG. 1.
FIG. 5 is a flowchart illustrating a spam tag identification process performed by the spam tag identification unit of FIG. 2.
6 is an exemplary view illustrating a process of identifying a spam tag performed by the spam tag identification unit of FIG. 2.

본 발명에 관한 설명은 구조적 내지 기능적 설명을 위한 실시예에 불과하므로, 본 발명의 권리범위는 본문에 설명된 실시예에 의하여 제한되는 것으로 해석되어서는 아니 된다. 즉, 실시예는 다양한 변경이 가능하고 여러 가지 형태를 가질 수 있으므로 본 발명의 권리범위는 기술적 사상을 실현할 수 있는 균등물들을 포함하는 것으로 이해되어야 한다. 또한, 본 발명에서 제시된 목적 또는 효과는 특정 실시예가 이를 전부 포함하여야 한다거나 그러한 효과만을 포함하여야 한다는 의미는 아니므로, 본 발명의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.Description of the present invention is only an embodiment for structural or functional description, the scope of the present invention should not be construed as limited by the embodiments described in the text. That is, since the embodiments may be variously modified and may have various forms, the scope of the present invention should be understood to include equivalents capable of realizing the technical idea. In addition, the objects or effects presented in the present invention does not mean that a specific embodiment should include all or only such effects, the scope of the present invention should not be understood as being limited thereby.

한편, 본 출원에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.On the other hand, the meaning of the terms described in the present application should be understood as follows.

"제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.Terms such as "first" and "second" are intended to distinguish one component from another component, and the scope of rights should not be limited by these terms. For example, the first component may be named a second component, and similarly, the second component may also be named a first component.

어떤 구성요소가 다른 구성요소에 "연결되어"있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결될 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어"있다고 언급된 때에는 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 한편, 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being "connected" to another component, it should be understood that there may be other components in between, although it may be directly connected to the other component. On the other hand, when a component is referred to as being "directly connected" to another component, it should be understood that there is no other component in between. On the other hand, other expressions describing the relationship between the components, such as "between" and "immediately between" or "neighboring to" and "directly neighboring to", should be interpreted as well.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함하다"또는 "가지다" 등의 용어는 실시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions should be understood to include plural expressions unless the context clearly indicates otherwise, and terms such as "comprise" or "have" refer to a feature, number, step, operation, component, part, or feature thereof. It is to be understood that the combination is intended to be present and does not exclude in advance the possibility of the presence or addition of one or more other features or numbers, steps, operations, components, parts or combinations thereof.

각 단계들에 있어 식별부호(예를 들어, a, b, c 등)는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.In each step, an identification code (e.g., a, b, c, etc.) is used for convenience of description, and the identification code does not describe the order of the steps, and each step clearly indicates a specific order in context. Unless stated otherwise, they may occur out of the order noted. That is, each step may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

본 발명은 컴퓨터가 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있고, 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can be embodied as computer readable code on a computer readable recording medium, and the computer readable recording medium includes all kinds of recording devices in which data can be read by a computer system. . Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

여기서 사용되는 모든 용어들은 다르게 정의되지 않는 한, 본 발명이 속하는 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한 이상적이거나 과도하게 형식적인 의미를 지니는 것으로 해석될 수 없다.All terms used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. Generally, the terms defined in the dictionary used are to be interpreted to coincide with the meanings in the context of the related art, and should not be interpreted as having ideal or excessively formal meanings unless clearly defined in the present application.

도 1은 본 발명의 일 실시예에 따른 스팸 태그 기반의 블로그 스팸 탐지 시스템을 설명하는 도면이다.1 is a diagram illustrating a spam tag-based blog spam detection system according to an embodiment of the present invention.

도 1을 참조하면, 스팸 태그 기반의 블로그 스팸 탐지 시스템(이하, 블로그 스팸 탐지 시스템이라 한다.)(100)은 사용자 단말(110), 블로그 스팸 탐지 장치(130) 및 데이터베이스(150)를 포함할 수 있다.Referring to FIG. 1, a spam tag-based blog spam detection system (hereinafter, referred to as a blog spam detection system) 100 may include a user terminal 110, a blog spam detection device 130, and a database 150. Can be.

사용자 단말(110)은 특정 블로그에 접근하여 해당 블로그의 내용을 확인할 수 있는 컴퓨팅 장치에 해당할 수 있고, 스마트폰, 노트북 또는 컴퓨터로 구현될 수 있으며, 반드시 이에 한정되지 않고, 태블릿 PC 등 다양한 디바이스로도 구현될 수 있다. 사용자 단말(110)은 블로그 스팸 탐지 장치(130)와 네트워크를 통해 연결될 수 있고, 복수의 사용자 단말(110)은 블로그 스팸 탐지 장치(130)와 동시에 연결될 수 있다. The user terminal 110 may correspond to a computing device capable of accessing a specific blog to check the contents of the blog, and may be implemented as a smartphone, a notebook, or a computer, and is not limited thereto, and various devices such as a tablet PC. It can also be implemented. The user terminal 110 may be connected to the blog spam detection device 130 through a network, and the plurality of user terminals 110 may be simultaneously connected to the blog spam detection device 130.

일 실시예에서, 사용자 단말(110)은 블로그 스팸 탐지 장치(130)에 특정 블로그에 대한 스팸 탐지를 요청할 수 있고, 블로그 스팸 탐지 장치(130)로부터 블로그 스팸 탐지 결과를 수신하여 확인할 수 있다. 다른 실시예에서, 사용자 단말(110)은 복수의 포스트들로 구성된 특정 블로그의 포스트 페이지에 접근하는 경우 자동으로 해당 포스트 페이지가 포함된 특정 블로그에 대한 스팸 탐지를 블로그 스팸 탐지 장치(130)에 요청할 수 있고, 블로그 스팸 탐지 장치(130)로부터 수신 받은 블로그 스팸 탐지 정보를 기초로 블로그 스팸 탐지 결과를 표시할 수 있다.In one embodiment, the user terminal 110 may request spam detection for a specific blog from the blog spam detection device 130, and receive and confirm a blog spam detection result from the blog spam detection device 130. In another embodiment, when the user terminal 110 accesses a post page of a specific blog composed of a plurality of posts, the user terminal 110 automatically requests the blog spam detection device 130 for spam detection for the specific blog including the post page. The blog spam detection result may be displayed based on the blog spam detection information received from the blog spam detection device 130.

블로그 스팸 탐지 장치(130)는 사용자 단말(110)로부터 수신한 블로그 스팸 탐지 요청에 따라 특정 블로그에 대한 스팸 여부를 탐지하여 탐지 결과를 사용자 단말(110)에 제공할 수 있는 컴퓨터 또는 프로그램에 해당하는 서버로 구현될 수 있다. 블로그 스팸 탐지 장치(130)는 사용자 단말(110)과 블루투스, WiFi 등을 통해 무선으로 연결될 수 있고, 네트워크를 통해 사용자 단말(110)과 데이터를 주고 받을 수 있다.The blog spam detection device 130 corresponds to a computer or a program capable of detecting spam for a specific blog according to a blog spam detection request received from the user terminal 110 and providing the detection result to the user terminal 110. Can be implemented as a server. The blog spam detection device 130 may be wirelessly connected to the user terminal 110 through Bluetooth, WiFi, or the like, and may exchange data with the user terminal 110 through a network.

블로그 스팸 탐지 장치(130)는 데이터베이스(150)를 포함하여 구현될 수 있고, 데이터베이스(150)와 독립적으로 구현될 수 있다. 데이터베이스(150)와 독립적으로 구현된 경우 블로그 스팸 탐지 장치(130)는 데이터베이스(150)와 유선 또는 무선으로 연결되어 데이터를 주고 받을 수 있다.Blog spam detection device 130 may be implemented including a database 150, it may be implemented independently of the database (150). When implemented independently of the database 150, the blog spam detection device 130 may be connected to the database 150 by wire or wirelessly to exchange data.

데이터베이스(150)는 블로그 스팸 탐지를 위해 필요한 다양한 정보들을 저장할 수 있는 저장장치이다. 데이터베이스(150)는 사용자 단말(110)로부터 수신한 스팸 탐지 요청과 관련된 블로그 정보를 저장할 수 있고, 블로그 스팸 탐지에 사용되는 블로그 내 포스트 정보 및 해당 포스트의 태그 정보를 저장할 수 있으며, 반드시 이에 한정되지 않고, 특정 블로그에 대한 스팸 탐지를 수행하는 과정에서 다양한 형태로 수집 또는 가공된 정보들을 저장할 수 있다. The database 150 is a storage device that can store various information necessary for blog spam detection. The database 150 may store blog information related to a spam detection request received from the user terminal 110, and may store post information in a blog used for blog spam detection and tag information of a corresponding post, but is not limited thereto. Instead, in the process of performing spam detection for a particular blog, information collected or processed in various forms may be stored.

데이터베이스(150)는 특정 범위에 속하는 정보들을 저장하는 적어도 하나의 독립된 서브-데이터베이스들로 구성될 수 있고, 적어도 하나의 독립된 서브-데이터베이스들이 하나로 통합된 통합 데이터베이스로 구성될 수 있다. 적어도 하나의 독립된 서브-데이터베이스들로 구성되는 경우에는 각각의 서브-데이터베이스들은 블루투스, WiFi 등을 통해 무선으로 연결될 수 있고, 네트워크를 통해 상호 간의 데이터를 주고 받을 수 있다. 데이터베이스(150)는 통합 데이터베이스로 구성되는 경우 각각의 서브-데이터베이스들을 하나로 통합하고 상호 간의 데이터 교환 및 제어 흐름을 관리하는 제어부를 포함할 수 있다.The database 150 may be composed of at least one independent sub-databases that store information belonging to a specific range, and may be configured as an integrated database in which at least one independent sub-databases are integrated into one. When composed of at least one independent sub-database, each sub-database may be wirelessly connected through Bluetooth, WiFi, and the like, and may exchange data with each other through a network. When the database 150 is configured as an integrated database, the database 150 may include a control unit for integrating respective sub-databases into one and managing data exchange and control flow between them.

도 2는 도 1에 있는 블로그 스팸 탐지 장치를 설명하는 블록도이다.FIG. 2 is a block diagram illustrating a blog spam detection device of FIG. 1.

도 2를 참조하면, 블로그 스팸 탐지 장치(130)는 스팸 태그 식별부(210), 블로그 신뢰도 산출부(230), 블로그 스팸 탐지부(250) 및 제어부(270)를 포함할 수 있다.2, the blog spam detection apparatus 130 may include a spam tag identification unit 210, a blog reliability calculation unit 230, a blog spam detection unit 250, and a controller 270.

스팸 태그 식별부(210)는 블로그에 게시된 적어도 하나의 포스트의 본문을 기초로 포스트에 포함된 복수의 태그들 중에서 스팸 태그를 식별할 수 있다. 스팸 태그 식별부(210)는 포스트 본문의 내용과 태그와의 관련성을 기초로 스팸 태그를 식별할 수 있고, 예를 들어, 태그가 포스트 본문에서 출현하는 빈도 수를 기초로 태그와의 관련성을 판단하여 스팸 태그를 식별할 수 있다. 스팸 태그 식별부(210)에 대해서는 도 3에서 보다 자세히 설명한다.The spam tag identification unit 210 may identify a spam tag among a plurality of tags included in the post based on the body of at least one post posted to the blog. The spam tag identification unit 210 may identify the spam tag based on the relevance of the content of the post body and the tag, and determine the relevance with the tag based on the frequency of the tag appearing in the post body. To identify spam tags. The spam tag identification unit 210 will be described in more detail with reference to FIG. 3.

블로그 신뢰도 산출부(230)는 블로그에 게시된 전체 포스트 및 스팸 태그를 포함하는 스팸 포스트를 기초로 블로그의 신뢰도를 산출할 수 있다. 여기에서, 블로그의 신뢰도는 블로그에 대해 신뢰할 수 있는 정도를 수치화한 것에 해당할 수 있고, 블로그에 게시된 포스트들이 해당 포스트 본문의 내용과 연관되지 않은 스팸 태그를 많이 포함하고 있을수록 해당 블로그의 신뢰도는 낮게 측정될 수 있다.The blog reliability calculator 230 may calculate the reliability of the blog based on a spam post including all posts and spam tags posted on the blog. Here, the trustworthiness of the blog may correspond to a quantification of the trustworthiness of the blog, and the more the posts posted to the blog include spam tags that are not related to the content of the post body, the more the trustworthiness of the blog Can be measured low.

일 실시예에서, 블로그 신뢰도 산출부(230)는 스팸 포스트의 수를 전체 포스트의 수로 나눈 값을 블로그의 신뢰도로써 산출할 수 있다. 블로그의 신뢰도는 해당 블로그의 스팸 태그 사용 지수에 해당할 수 있고, 스팸 태그 사용 지수는 해당 블로그에 게시된 전체 포스트 수에 대한 스팸 포스트 수의 비율에 해당할 수 있다. 보다 구체적으로, 스팸 태그 사용 지수는 해당 블로그에 게시된 포스트 중에서 스팸 태그를 포함하는 포스트의 수가 많을수록 높아질 수 있고, 스팸 태그 사용 지수가 높은 값을 가질수록 해당 블로그의 신뢰도는 낮아질 수 있다. 즉, 스팸 태그 사용 지수가 낮은 값에 해당할수록 블로그의 신뢰도는 높아질 수 있다.In one embodiment, the blog reliability calculator 230 may calculate a value obtained by dividing the number of spam posts by the total number of posts as the reliability of the blog. The reliability of a blog may correspond to the spam tag usage index of the blog, and the spam tag usage index may correspond to a ratio of the number of spam posts to the total number of posts posted to the blog. More specifically, the spam tag use index may be higher as the number of posts including spam tags among the posts posted on the blog increases, and the higher the spam tag use index, the lower the reliability of the blog. In other words, the lower the spam tag usage index, the higher the reliability of the blog.

블로그 스팸 탐지부(250)는 신뢰도 및 스팸 포스트의 스팸 태그 비율에 해당하는 태그 스팸도를 기초로 블로그의 스팸 여부를 탐지할 수 있다. 일 실시예에서, 태그 스팸도는 총 태그 수에 대한 스팸 태그 수의 비율에 해당할 수 있다. 다른 실시예에서, 태그 스팸도는 하나의 블로그에서 스팸 태그를 포함하는 모든 포스트들의 총 태그 수에 대한 총 스팸 태그의 수의 비율에 해당할 수 있다. 또 다른 실시예에서, 태그 스팸도는 스팸 태그를 포함하는 포스트들에 대해 포스트당 평균 태그 수에 대한 포스트당 평균 스팸 태그 수의 비율에 해당할 수 있다. The blog spam detector 250 may detect whether the blog is spam based on the degree of tag spam corresponding to the reliability and the spam tag ratio of the spam post. In one embodiment, the tag spam level may correspond to the ratio of the number of spam tags to the total number of tags. In another embodiment, the tag spam level may correspond to the ratio of the total number of spam tags to the total number of tags of all the posts containing spam tags in one blog. In another embodiment, the tag spam level may correspond to the ratio of the average spam tag number per post to the average tag number per post for the posts containing the spam tag.

블로그 스팸 탐지부(250)는 스팸 태그 식별부(210)에 의해 식별된 스팸 태그를 기초로 스팸 태그를 포함하고 있는 포스트를 식별할 수 있고, 식별된 포스트에 기초하여 태그 스팸도를 산출할 수 있다. 블로그 스팸 탐지부(250)는 태그 스팸도 산출 방법을 결정하고 해당 방법에 의해 산출된 태그 스팸도를 이용하여 블로그의 스팸 여부를 탐지할 수 있다.The blog spam detection unit 250 may identify a post including the spam tag based on the spam tag identified by the spam tag identification unit 210, and calculate a tag spam degree based on the identified post. have. The blog spam detection unit 250 may determine a tag spam calculation method and detect whether the blog is spam by using the tag spam degree calculated by the method.

일 실시예에서, 블로그 스팸 탐지부(250)는 블로그의 신뢰도 및 태그 스팸도 간의 가중합이 특정 임계값을 초과하는 경우 해당 블로그를 스팸으로 결정할 수 있다. 블로그 스팸 탐지부(250)는 블로그 신뢰도 산출부(230)에 의해 산출된 블로그의 신뢰도와 포스트당 스팸 태그 비율에 해당하는 태그 스팸도를 가중합한 값을 특정 임계값과 비교하여 블로그의 스팸 여부를 판단할 수 있다. 여기에서, 특정 임계값은 블로그 스팸 탐지 장치(130)를 통해 미리 설정될 수 있다.In one embodiment, the blog spam detection unit 250 may determine that the blog as spam when the weighted sum between the reliability of the blog and the degree of tag spam exceeds a certain threshold. The blog spam detection unit 250 compares a value obtained by adding the tag spam degree corresponding to the reliability of the blog calculated by the blog reliability calculation unit 230 and the spam tag rate per post with a specific threshold to determine whether the blog is spam. You can judge. Here, the specific threshold value may be preset through the blog spam detection device 130.

예를 들어, 블로그 스팸 탐지부(250)는 다음의 수학식을 통해 산출된 값을 특정 임계값과 비교하여 블로그의 스팸 여부를 판단할 수 있다.For example, the blog spam detection unit 250 may determine whether the blog is spam by comparing a value calculated through the following equation with a specific threshold value.

[수학식][Equation]

여기에서, W₀ 및 W₁은 가중치이고, W₀ + W₁ = 1이다. 또한, 포스트당 평균 태그 수 및 포스트당 평균 스팸 태그 수는 스팸 태그가 사용된 포스트만을 대상으로 하여 산출될 수 있다.Here, W ₀ and W ₁ are weights and W ₀ + W ₁ = 1. In addition, the average number of tags per post and the average number of spam tags per post may be calculated for only posts in which spam tags are used.

블로그의 신뢰도가 높을수록 스팸 태그 사용 지수가 높고, 태그 스팸도가 높을수록 포스트당 스팸 태그 비율이 높으므로 양자를 가중합한 값이 높을수록 해당 블로그가 스팸일 확률도 높아질 수 있다. 결과적으로, 블로그 스팸 탐지부(250)는 신뢰도 및 태그 스팸도의 가중합이 특정 임계값을 초과하는 경우 일률적으로 해당 블로그가 스팸인 경우로 판단함으로써 블로그 스팸 여부를 빠르게 탐지할 수 있다.The higher the trustworthiness of the blog, the higher the spam tag usage index, and the higher the tag spam rate, the higher the spam tag ratio per post, so the higher the sum of the two, the higher the probability that the blog is spam. As a result, the blog spam detection unit 250 may quickly detect whether the blog is spam by determining that the blog is spam when the weighted sum of the reliability and the tag spam degree exceeds a specific threshold.

제어부(270)는 블로그 스팸 탐지 장치(130)의 전체적인 동작을 제어하고, 스팸 태그 식별부(210), 블로그 신뢰도 산출부(230) 및 블로그 스팸 탐지부(250) 간의 제어 흐름 또는 데이터 흐름을 관리할 수 있다.The controller 270 controls the overall operation of the blog spam detection device 130 and manages a control flow or data flow between the spam tag identification unit 210, the blog reliability calculation unit 230, and the blog spam detection unit 250. can do.

도 3은 도 2에 있는 스팸 태그 식별부를 설명하는 블록도이다.3 is a block diagram illustrating a spam tag identification unit of FIG. 2.

도 3을 참조하면, 스팸 태그 식별부(210)는 포스트 그룹 생성 모듈(310), 태그 분석 모듈(330), 스팸 태그 식별 모듈(350) 및 제어 모듈(370)을 포함할 수 있다.Referring to FIG. 3, the spam tag identification unit 210 may include a post group generation module 310, a tag analysis module 330, a spam tag identification module 350, and a control module 370.

포스트 그룹 생성 모듈(310)은 적어도 하나의 포스트에 대한 텍스트 파싱(Text Parsing)을 통해 포스트 핵심용어를 추출하고 포스트 핵심용어를 기초로 클러스터링(Clustering)을 수행하여 포스트 그룹을 생성할 수 있다. 여기에서, 포스트 핵심용어는 포스트 본문(Post Body)에 포함된 모든 텍스트를 대상으로 추출되므로 포스트 내에 포함된 포스트 태그(Post Tag)와 중복될 수도 있으나 포스트 태그와 일치하지 않을 수도 있다.The post group generation module 310 may generate post groups by extracting post key terms through clustering of at least one post and performing clustering based on post key terms. Here, since the post key term is extracted for all texts included in the post body, the post key term may overlap with a post tag included in the post but may not match the post tag.

태그 분석 모듈(330)은 포스트 그룹에 대한 그룹 핵심용어를 추출하고 그룹 핵심용어 및 포스트 그룹에 포함된 그룹태그 간의 비교분석을 수행할 수 있다. 태그 분석 모듈(330)은 포스트 그룹 생성 모듈(310)에 의해 추출된 포스트 핵심용어를 그대로 이용할 수 있다. 태그 분석 모듈(330)은 포스트 그룹에 속한 모든 포스트들의 핵심용어를 하나로 통합하여 포스트 그룹에 대한 핵심용어로서 사용할 수 있다. 태그 분석 모듈(330)은 포스트 그룹에 포함된 태그에 해당하는 그룹태그 별로 그룹 핵심용어들의 집합과의 비교를 통해 태그 분석에 필요한 다양한 정보들을 추출할 수 있다.The tag analysis module 330 may extract group key terms for the post group and perform a comparative analysis between the group key terms and the group tags included in the post group. The tag analysis module 330 may use the post key terms extracted by the post group generation module 310 as they are. The tag analysis module 330 may integrate the core terms of all the posts belonging to the post group into one and use them as key terms for the post group. The tag analysis module 330 may extract various information necessary for tag analysis by comparing the group key terms with a group tag corresponding to a tag included in a post group.

일 실시예에서, 태그 분석 모듈(330)은 비교분석을 통해 그룹태그 별로 포스트 그룹 내 출현빈도 수, 출현 포스트 수 및 출현 비율을 도출할 수 있다. 포스트 그룹 내 출현빈도 수는 포스트 그룹에 속한 모든 포스트에서 출현하는 그룹태그의 수에 해당할 수 있다. 포스트 그룹 내 출현 포스트 수는 포스트 그룹에 속한 포스트 중에서 그룹태그가 출현하는 포스트의 수에 해당할 수 있다. 포스트 그룹 내 출현 비율은 포스트 그룹에 속한 모든 포스트의 수에 대한 그룹 태그가 출현하는 포스트의 수의 비율에 해당할 수 있다.In one embodiment, the tag analysis module 330 may derive the number of occurrences, the number of appearance posts, and the appearance rate in the post group for each group tag through comparative analysis. The frequency of appearance in a post group may correspond to the number of group tags that appear in all posts belonging to a post group. The number of posts in the post group may correspond to the number of posts in which the group tag appears among the posts belonging to the post group. The appearance rate in the post group may correspond to the ratio of the number of posts in which the group tag appears to the number of all posts belonging to the post group.

스팸 태그 식별 모듈(350)은 비교분석 결과를 이용하여 스팸 태그를 탐지할 수 있다. 스팸 태그 식별 모듈(350)은 태그 분석 모듈(330)을 통해 도출되는 출현빈도 수, 출현 포스트 수 및 출현 비율 등의 그룹태그 별 통계 정보에 기초하여 스팸 태그인지를 판단할 수 있다.The spam tag identification module 350 may detect the spam tag using the comparison analysis result. The spam tag identification module 350 may determine whether the spam tag is a spam tag based on statistical information for each group tag, such as the frequency of occurrence, the number of posts, and the rate of occurrence derived through the tag analysis module 330.

일 실시예에서, 스팸 태그 식별 모듈(350)은 출현 비율이 특정 임계값 미만에 해당하는 그룹태그를 스팸 태그로 식별할 수 있다. 스팸 태그 식별 모듈(350)은 그룹태그 별로 태그 분석 모듈(330)에 의해 도출된 포스트 그룹 내 스팸 태그 출현 비율이 특정 임계값 미만인 경우 일률적으로 스팸 태그로 결정할 수 있고, 특정 임계값은 블로그 스팸 탐지 장치(130)에 의해 사전에 설정될 수 있다.In one embodiment, the spam tag identification module 350 may identify a group tag whose spam rate falls below a certain threshold as a spam tag. The spam tag identification module 350 may uniformly determine the spam tag when the percentage of spam tags in the post group derived by the tag analysis module 330 for each group tag is less than a specific threshold, and the specific threshold may be determined as blog spam detection. It may be preset by the device 130.

제어 모듈(370)은 스팸 태그 식별부(210)의 전체적인 동작을 제어하고, 포스트 그룹 생성 모듈(310), 태그 분석 모듈(330) 및 스팸 태그 식별 모듈(350) 간의 제어 흐름 또는 데이터 흐름을 관리할 수 있다.The control module 370 controls the overall operation of the spam tag identification unit 210 and manages the control flow or data flow between the post group generation module 310, the tag analysis module 330, and the spam tag identification module 350. can do.

도 4는 도 1에 있는 블로그 스팸 탐지 장치에서 수행되는 블로그 스팸 탐지 과정을 설명하는 순서도이다.4 is a flowchart illustrating a blog spam detection process performed by the blog spam detection device of FIG. 1.

도 4를 참조하면, 블로그 스팸 탐지 장치(130)는 스팸 태그 식별부(210)를 통해 블로그에 게시된 적어도 하나의 포스트의 본문을 기초로 적어도 하나의 포스트에 포함된 복수의 태그들 중 스팸 태그를 식별할 수 있다(단계 S410). 블로그 스팸 탐지 장치(130)는 블로그 신뢰도 산출부(230)를 통해 블로그에 게시된 전체 포스트 및 스팸 태그를 포함하는 스팸 포스트를 기초로 블로그의 신뢰도를 산출할 수 있다(단계 S430). 블로그 스팸 탐지 장치(130)는 블로그 스팸 탐지부(250)를 통해 신뢰도 및 스팸 포스트의 스팸 태그 비율에 해당하는 태그 스팸도를 기초로 블로그의 스팸 여부를 탐지할 수 있다(단계 S450).Referring to FIG. 4, the blog spam detection device 130 may use a spam tag among the plurality of tags included in the at least one post based on the body of the at least one post posted to the blog through the spam tag identification unit 210. Can be identified (step S410). The blog spam detection device 130 may calculate the reliability of the blog based on the spam post including the entire post and the spam tag posted on the blog through the blog reliability calculator 230 (step S430). The blog spam detection device 130 may detect whether or not the blog is spam based on the degree of tag spam corresponding to the reliability and the spam tag ratio of the spam post through the blog spam detection unit 250 (step S450).

도 5는 도 2에 있는 스팸 태그 식별부에서 수행되는 스팸 태그 식별 과정을 설명하는 순서도이다. 스팸 태그 식별부(210)는 포스트 그룹 생성 모듈(310)을 통해 적어도 하나의 포스트에 대한 텍스트 파싱을 통해 포스트 핵심용어를 추출하고 포스트 핵심용어를 기초로 클러스터링을 수행하여 포스트 그룹을 생성할 수 있다(단계 S510). 스팸 태그 식별부(210)는 태그 분석 모듈(330)을 통해 포스트 그룹에 대한 그룹 핵심용어를 추출하고 그룹 핵심용어 및 포스트 그룹에 포함된 그룹태그 간의 비교분석을 수행할 수 있다(단계 S530). 스팸 태그 식별부(210)는 스팸 태그 식별 모듈(350)을 통해 비교분석 결과를 이용하여 스팸 태그를 탐지할 수 있다(단계 S550).FIG. 5 is a flowchart illustrating a spam tag identification process performed by the spam tag identification unit of FIG. 2. The spam tag identification unit 210 may extract post key terms through text parsing of at least one post through the post group generation module 310 and generate a post group by performing clustering based on the post key terms. (Step S510). The spam tag identification unit 210 may extract the group key term for the post group through the tag analysis module 330 and perform a comparative analysis between the group key term and the group tag included in the post group (step S530). The spam tag identification unit 210 may detect the spam tag using the comparison analysis result through the spam tag identification module 350 (step S550).

도 6은 도 2에 있는 스팸 태그 식별부에서 수행되는 스팸 태그를 식별하는 과정을 설명하는 예시도이다.6 is an exemplary view illustrating a process of identifying a spam tag performed by the spam tag identification unit of FIG. 2.

도 6을 참조하면, 스팸 태그 식별부(210)는 포스트 그룹 생성 모듈(310)을 통해 포스트 그룹을 생성할 수 있고, 태그 분석 모듈(330)을 통해 포스트 그룹 별로 핵심용어들을 추출할 수 있다. 테이블 1(610)은 각 포스트 그룹(PG_ID)에 대한 그룹태그(Post Tag)를 추출한 예를 나타내고 있다. 여기에서, 그룹태그는 포스트 그룹에 속한 각 포스트에서 사용되는 태그들의 집합에 해당할 수 있다. Freq.(inPost)는 각 태그가 특정 포스트 그룹에 속한 포스트별 포스트의 본문 내에서 출현한 빈도에 해당한다.Referring to FIG. 6, the spam tag identification unit 210 may generate a post group through the post group generation module 310, and extract key terms for each post group through the tag analysis module 330. Table 1 610 shows an example of extracting a group tag for each post group PG_ID. Here, the group tag may correspond to a set of tags used in each post belonging to a post group. Freq. (InPost) corresponds to the frequency at which each tag appears in the body of each post per post belonging to a particular post group.

스팸 태그 식별부(210)는 태그 분석 모듈(330)을 통해 포스트 그룹 별로 핵심용어와 포스트 태그에 대한 비교 분석을 수행할 수 있다. 테이블 2(630)는 포스트별 그룹태그의 포스트 본문 내 출현빈도 수(Freq.(in Post)), 각 그룹태그가 속한 포스트 그룹 내에서 해당 그룹태그가 출현한 포스트 수(Num. of Post), 해당 포스트 그룹의 총 포스트 수(Num. of Post(for each PG)) 및 해당 그룹태그가 속한 포스트 그룹의 총 포스트 수에 대한 그룹태그 출현 포스트 수의 비율(Ratio of Appearance(in PG))을 나타내고 있다. 스팸 태그 식별부(210)는 스팸 태그 식별 모듈(350)를 통해 Ratio of Appearance(in PG)의 값에 따라 스팸 태그를 식별할 수 있다.The spam tag identification unit 210 may perform a comparative analysis of key terms and post tags for each post group through the tag analysis module 330. Table 2 (630) shows the frequency of appearance (Freq. (In Post)) in the post body of the group tag per post, the number of posts in which the group tag appeared in the post group to which each group tag belongs (Num. Of Post), Num. Of Post (for each PG) of the corresponding post group and the ratio of the number of posts in the group tag to the total number of posts in the post group to which the group tag belongs (Ratio of Appearance (in PG)). have. The spam tag identification unit 210 may identify the spam tag according to the value of Ratio of Appearance (in PG) through the spam tag identification module 350.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although described above with reference to a preferred embodiment of the present invention, those skilled in the art will be variously modified and changed within the scope of the invention without departing from the spirit and scope of the invention described in the claims below I can understand that you can.

100: 스팸 태그 기반의 블로그 스팸 탐지 시스템
110: 사용자 단말 130: 블로그 스팸 탐지 장치
150: 데이터베이스
210: 스팸 태그 식별부 230: 블로그 신뢰도 산출부
250: 블로그 스팸 탐지부 270: 제어부
310: 포스트 그룹 생성 모듈 330: 태그 분석 모듈
350: 스팸 태그 식별 모듈 370: 제어 모듈
610: 테이블 1 630: 테이블 2100: spam tag-based blog spam detection system
110: user terminal 130: blog spam detection device
150: database
210: spam tag identification unit 230: blog reliability calculation unit
250: blog spam detection unit 270: control unit
310: post group generation module 330: tag analysis module
350: spam tag identification module 370: control module
610: table 1 630: table 2

Claims

A spam tag identification unit for identifying a spam tag among a plurality of tags included in the at least one post based on a body of at least one post posted on a blog;
A blog reliability calculator configured to calculate a reliability of the blog based on the entire post posted to the blog and a spam post including the spam tag; And
Spam tag-based blog spam detection device comprising a blog spam detection for detecting whether the blog spam based on the degree of tag spam corresponding to the reliability and the spam tag rate of the spam post.

The method of claim 1, wherein the spam tag identification unit
A post group generation module configured to extract post key terms through text parsing of the at least one post, and to generate a post group by performing clustering based on the post key terms;
A tag analysis module extracting a group key term for the post group and performing a comparative analysis between the group key term and a group tag included in the post group; And
Spam tag-based blog spam detection device comprising a spam tag identification module for identifying a spam tag using the comparison analysis results.

The method of claim 2, wherein the tag analysis module
Spam tag-based blog spam detection device, characterized in that the derivation frequency, the number of appearance posts and the appearance rate in the post group for each group tag through the comparative analysis.

The method of claim 3, wherein the spam tag identification module
Spam tag-based blog spam detection device characterized in that for identifying the group tag that the appearance rate is less than a certain threshold value as a spam tag.

According to claim 1, wherein the blog reliability calculation unit
The spam tag-based blog spam detection device, characterized in that the value obtained by dividing the number of the spam posts by the total number of posts as the reliability of the blog.

According to claim 1, wherein the blog spam detection unit
The spam tag-based blog spam detection device, characterized in that the blog is determined to be spam when the weighted sum between the reliability and the tag spam degree exceeds a specific threshold.

In the blog spam detection method performed in the spam tag-based blog spam detection device,
(a) identifying a spam tag among a plurality of tags included in the at least one post based on a body of at least one post posted to a blog;
(b) calculating a reliability of the blog based on the entire post posted to the blog and a spam post including the spam tag; And
(c) detecting whether the blog is spam based on a tag spam level corresponding to the reliability and the spam tag ratio of the spam post.

The method of claim 7, wherein step (a)
(a1) generating a post group by extracting a post key term through text parsing of the at least one post and performing clustering based on the post key term;
(a2) extracting a group key term for the post group and performing a comparative analysis between the group key term and a group tag included in the post group; And
(a3) spam tag-based blog spam detection method comprising the step of identifying a spam tag using the comparative analysis result.

The method of claim 8, wherein step (a2)
Deriving the spam tag-based blog spam detection method, characterized in that the step of deriving the frequency of occurrence, number of appearance posts and the appearance rate in the post group for each group tag through the comparative analysis.

The method of claim 9, wherein step (a3)
The spam tag-based blog spam detection method, characterized in that the step of identifying the group tag that the appearance rate is less than a certain threshold value as a spam tag.

The method of claim 7, wherein step (b)
And calculating a value obtained by dividing the number of the spam posts by the total number of posts as the reliability of the blog.

The method of claim 7, wherein step (c)
And determining the blog as spam when the weighted sum between the reliability and the tag spam degree exceeds a specific threshold.

Claims [1] A computer-executable recording medium for recording a blog spam detection method performed by a spam tag-based blog spam detection device,
Identifying a spam tag among a plurality of tags included in the at least one post based on a body of at least one post posted to a blog;
Calculating a reliability of the blog based on the entire post posted to the blog and a spam post including the spam tag; And
And detecting whether or not the blog is spam based on the degree of tag spam corresponding to the reliability and the spam tag ratio of the spam post.