KR20200072724A

KR20200072724A - An apparatus for detecting spam news with spam phrases, a method thereof and computer recordable medium storing program to perform the method

Info

Publication number: KR20200072724A
Application number: KR1020180160630A
Authority: KR
Inventors: 김창기; 박경수
Original assignee: 줌인터넷 주식회사
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2020-06-23
Also published as: KR102149917B1

Abstract

The present invention relates to an apparatus for detecting spam news having a spam phrase, a method therefor, and a computer-readable recording medium in which a program for performing the method is recorded. The present invention provides an apparatus for detecting spam news, a method therefor, and a computer-readable recording medium in which a program for performing the method is recorded, wherein the apparatus for detecting spam news includes: a category classification unit for deriving a probability at which a plurality of paragraphs included in the news belong to a plurality of categories; a correlation calculation unit for calculating a correlation between each paragraph and another paragraph through correlation analysis based on the probability of belonging to the plurality of categories; and a spam news discrimination unit for specifying a sentence having the lowest correlation among the plurality of paragraphs as a suspected spam paragraph and determining whether the news is genuine or fake according to the correlation of the suspected spam paragraph.

Description

An apparatus for detecting spam news with spam phrases, a method thereof and computer recordable medium storing program in which a method for detecting spam news containing spam phrases, a method therefor and a program for performing the method are recorded to perform the method}

본 발명은 스팸뉴스 탐지 기술에 관한 것으로, 보다 상세하게는, 스팸 문구가 포함된 스팸뉴스 탐지를 위한 장치, 이를 위한 방법 및 이 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체에 관한 것이다. The present invention relates to a technology for detecting spam news, and more particularly, to an apparatus for detecting spam news containing spam phrases, a method therefor, and a computer-readable recording medium in which a program performing the method is recorded.

인터넷상의 뉴스 중 본문의 내용과는 상관없는 내용의 문장, 혹은 문단을 삽입하여 특정 내용의 스팸을 읽게 만드는 뉴스가 다수 존재한다. There are a number of news on the Internet that insert sentences or paragraphs that are irrelevant to the content of the text to make you read spam of a specific content.

[선행기술문헌][Advanced technical literature]

[특허문헌] 한국등록특허 제1864439호 2018년 06월 01일 등록 (명칭: 가짜 뉴스 판별 가능한 게시글 그래픽 유저 인터페이스 화면창을 구비한 가짜 뉴스 판별 시스템) [Patent Document] Registered Korean Registered Patent No. 1864439 on June 01, 2018 (Name: Fake News Discrimination System with Graphical User Interface Screen Window for Posts that Can Be Discriminated by Fake News)

본 발명은 뉴스 중 기사의 본문 내용과 상관없는 스팸 문구가 삽입된 스팸 뉴스를 탐지하기 위한 장치 및 이를 위한 방법 및 이 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체를 제공함에 있다. The present invention provides an apparatus for detecting spam news in which spam phrases irrespective of the text content of an article are inserted, a method therefor, and a computer-readable recording medium in which a program for performing the method is recorded.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 스팸 뉴스 탐지를 위한 장치는 뉴스에 포함된 복수의 문단이 복수의 카테고리에 속할 확률을 도출하는 카테고리분류부와, 상기 복수의 카테고리에 속할 확률을 기초로 상관관계 분석을 통해 각 문단과 다른 문단과의 상관도(correlation)를 산출하는 상관도산출부와, 상기 복수의 문단 중 상기 상관도의 수치가 가장 낮은 문장을 스팸의심문단으로 특정하고, 상기 스팸의심문단의 상관도에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별하는 스팸뉴스판별부를 포함한다. An apparatus for detecting spam news according to a preferred embodiment of the present invention for achieving the above object includes a category classification unit for deriving a probability that a plurality of paragraphs included in the news belong to a plurality of categories, and the plurality of categories Correlation calculation unit that calculates the correlation between each paragraph and other paragraphs through correlation analysis based on the probability of belonging to, and the sentence with the lowest value of the correlation among the plurality of paragraphs is a suspected spam paragraph And a spam news discrimination unit for determining whether the news is real or fake according to the correlation degree of the spam interrogation panel.

상기 스팸뉴스판별부는 가짜 뉴스와 진짜 뉴스에 대한 상관도 분포를 기초로 가짜 뉴스일 확률을 도출하는 가짜뉴스 확률모델과, 진짜 뉴스일 확률을 도출하는 진짜뉴스 확률모델을 도출하고, 상기 스팸의심문단의 상관도를 상기 가짜뉴스 확률모델 및 상기 진짜뉴스 확률모델에 대입하여 상기 뉴스가 진짜 뉴스일 확률과 상기 뉴스가 가짜 뉴스일 확률을 산출하고, 상기 진짜 뉴스일 확률과 상기 가짜 뉴스일 확률에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부 판별하는 것을 특징으로 한다. The spam news discrimination unit derives a fake news probability model that derives a probability of being a fake news based on a distribution of correlation between fake news and a real news, and a probability model of a real news that derives a probability of being a real news, and interrogates the spam suspect By substituting the correlations of the fake news probability model and the real news probability model to calculate the probability that the news is real news and the probability that the news is fake news, according to the probability that it is real news and the probability that it is fake news It is characterized by determining whether the news is real or fake.

상기 스팸뉴스판별부는 상기 스팸의심문단의 상관도가 소정 수치 미만이면, 상기 뉴스를 가짜로 판별하는 것을 특징으로 한다. The spam news discrimination unit is characterized in that if the correlation of the suspected spam interceptor is less than a predetermined value, the news is faked.

상기 상관도산출부는 각 문단의 카테고리 별 확률에 대한 상기 각 문단 이외의 다른 문단들의 카테고리 별 확률의 평균의 상관도를 산출하는 것을 특징으로 한다. The correlation calculating unit is characterized in that it calculates a correlation of an average of the probability of each category of the paragraphs other than the paragraph with respect to the probability of each category of each paragraph.

상기 카테고리분류부는 문단이 입력되면, 입력된 문단이 복수의 카테고리 각각에 해당할 확률을 수치로 출력하도록 학습하는 것을 특징으로 한다. The category classification unit is characterized in that when a paragraph is inputted, learning to output a probability that the inputted paragraph corresponds to each of a plurality of categories is numerically.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 스팸 뉴스 탐지를 위한 방법은 카테고리분류부가 뉴스에 포함된 복수의 문단이 복수의 카테고리에 속할 확률을 도출하는 단계와, 상관도산출부가 상기 복수의 카테고리에 속할 확률을 기초로 상관관계 분석을 통해 각 문단과 다른 문단과의 상관도(correlation)를 산출하는 단계와, 스팸뉴스판별부가 상기 복수의 문단 중 상기 상관도의 수치가 가장 낮은 문장을 스팸의심문단으로 특정하고, 상기 스팸의심문단의 상관도에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별하는 단계를 포함한다. Method for detecting spam news according to a preferred embodiment of the present invention for achieving the above object is a category classification unit deriving the probability that a plurality of paragraphs included in the news belong to a plurality of categories, the correlation calculation Calculating a correlation between each paragraph and other paragraphs through correlation analysis based on a probability that the unit belongs to the plurality of categories, and the spam news discrimination unit has the highest value of the correlation among the plurality of paragraphs. And specifying a low sentence as a suspected spam questionnaire and determining whether the news is real or fake according to the correlation degree of the suspected spam questionnaire.

상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별하는 단계는 상기 스팸뉴스판별부가 가짜 뉴스와 진짜 뉴스에 대한 상관도 분포를 기초로 가짜 뉴스일 확률을 도출하는 가짜뉴스 확률모델과, 진짜 뉴스일 확률을 도출하는 진짜뉴스 확률모델을 도출하는 단계와, 상기 스팸뉴스판별부가 상기 스팸의심문단의 상관도를 상기 가짜뉴스 확률모델 및 상기 진짜뉴스 확률모델에 대입하여 상기 뉴스가 진짜 뉴스일 확률과 상기 뉴스가 가짜 뉴스일 확률을 산출하는 단계와, 상기 스팸뉴스판별부가 상기 진짜 뉴스일 확률과 상기 가짜 뉴스일 확률에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부 판별하는 단계를 포함한다. The step of determining whether the news is real or fake is based on the probability model of the spam news and the probability model of the fake news that derives the probability of the fake news based on the distribution of the correlation between the fake news and the real news. Deriving a real news probability model to derive, and the spam news discrimination unit substitutes the correlation of the spam interrogation panel into the fake news probability model and the real news probability model to determine whether the news is real news and the news And calculating a probability of being fake news, and determining whether the news is real or fake according to the probability that the spam news discrimination unit is the real news and the probability that the fake news is false.

상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별하는 단계는 상기 스팸뉴스판별부가 상기 스팸의심문단의 상관도가 소정 수치 미만이면, 상기 뉴스를 가짜로 판별하는 것을 특징으로 한다. The step of determining whether the news is real or fake is characterized in that if the correlation between the spam suspects and the spam interrogation panel is less than a predetermined value, the news is fake.

상기 상관도(correlation)를 산출하는 단계는 각 문단의 카테고리 별 확률에 대한 상기 각 문단 이외의 다른 문단들의 카테고리 별 확률의 평균의 상관도를 산출하는 것을 특징으로 한다. The step of calculating the correlation is characterized by calculating a correlation of an average of the probability of each category of other paragraphs other than the paragraph with respect to the probability of each category of each paragraph.

상기 확률을 도출하는 단계 전, 문단이 입력되면, 입력된 문단이 복수의 카테고리 각각에 해당할 확률을 수치로 출력하도록 상기 카테고리 분류기를 학습시키는 단계를 더 포함하는 것을 특징으로 한다. The step of deriving the probability, if the paragraph is input, characterized in that it further comprises the step of learning the category classifier to output the probability that the input paragraph corresponds to each of a plurality of categories as a number.

본 발명의 다른 견지에 따르면, 상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 스팸 뉴스 탐지를 위한 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체를 제공할 수 있다. According to another aspect of the present invention, it is possible to provide a computer-readable recording medium in which a program for performing a method for detecting spam news according to a preferred embodiment of the present invention for achieving the above object.

본 발명에 따르면, 스팸 뉴스를 사전에 검출할 수 있어 스팸 뉴스를 읽는데 낭비하는 시간을 절약할 수 있다. 이는 사용자에게 새로운 사용자경험(UX)을 제공할 수 있다. According to the present invention, spam news can be detected in advance, thereby saving time wasted reading spam news. This can provide a new user experience (UX) to the user.

도 1은 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 장치의 구성을 설명하기 위한 블록도이다.
도 2는 본 발명의 실시예에 따른 스팸탐지장치의 제어부의 구성을 설명하기 위한 블록도이다.
도 3은 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 방법을 설명하기 위한 흐름도이다.
도 4 내지 도 6은 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 방법을 설명하기 위한 도면이다. 1 is a block diagram illustrating the configuration of an apparatus for detecting spam news according to an embodiment of the present invention.
2 is a block diagram for explaining the configuration of a control unit of a spam detection device according to an embodiment of the present invention.
3 is a flowchart illustrating a method for detecting spam news according to an embodiment of the present invention.
4 to 6 are diagrams for explaining a method for detecting spam news according to an embodiment of the present invention.

본 발명의 상세한 설명에 앞서, 이하에서 설명되는 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념으로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 실시예에 불과할 뿐, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형 예들이 있을 수 있음을 이해하여야 한다. Prior to the detailed description of the present invention, the terms or words used in the present specification and claims described below should not be interpreted as being limited to the ordinary or lexical meanings, and the inventors of their own inventions in the best way. In order to explain, it should be interpreted as meanings and concepts consistent with the technical spirit of the present invention based on the principle that it can be properly defined as a concept of terms. Therefore, the embodiments shown in the embodiments and the drawings described in this specification are only the most preferred embodiments of the present invention, and do not represent all of the technical spirit of the present invention, and various equivalents can be substituted at the time of application. It should be understood that there may be water and variations.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 이때, 첨부된 도면에서 동일한 구성 요소는 가능한 동일한 부호로 나타내고 있음을 유의해야 한다. 또한, 본 발명의 요지를 흐리게 할 수 있는 공지 기능 및 구성에 대한 상세한 설명은 생략할 것이다. 마찬가지의 이유로 첨부 도면에 있어서 일부 구성요소는 과장되거나 생략되거나 또는 개략적으로 도시되었으며, 각 구성요소의 크기는 실제 크기를 전적으로 반영하는 것이 아니다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. At this time, it should be noted that the same components are denoted by the same reference numerals in the accompanying drawings. In addition, detailed descriptions of well-known functions and configurations that may obscure the subject matter of the present invention will be omitted. For the same reason, in the accompanying drawings, some components are exaggerated, omitted, or schematically illustrated, and the size of each component does not entirely reflect the actual size.

먼저, 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 장치의 구성을 설명하기로 한다. 도 1은 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 장치의 구성을 설명하기 위한 블록도이다. 도 2는 본 발명의 실시예에 따른 스팸탐지장치의 제어부의 구성을 설명하기 위한 블록도이다. First, a configuration of a device for detecting spam news according to an embodiment of the present invention will be described. 1 is a block diagram illustrating the configuration of an apparatus for detecting spam news according to an embodiment of the present invention. 2 is a block diagram for explaining the configuration of a control unit of a spam detection device according to an embodiment of the present invention.

먼저, 도 1을 참조하면, 스팸뉴스 탐지를 위한 장치(100: 이하, '스팸탐지장치'로 축약함)는 통신부(110), 입력부(120), 표시부(130), 저장부(140) 및 제어부(200)를 포함한다. First, referring to FIG. 1, a device for detecting spam news (hereinafter abbreviated as'spam detection device') includes a communication unit 110, an input unit 120, a display unit 130, a storage unit 140, and It includes a control unit 200.

통신부(110)는 네트워크를 통해 다른 장치와 통신하기 위한 것이다. 통신부(110)는 송신되는 신호의 주파수를 상승 변환 및 증폭하는 RF(Radio Frequency) 송신기(Tx) 및 수신되는 신호를 저 잡음 증폭하고 주파수를 하강 변환하는 RF 수신기(Rx)를 포함할 수 있다. 그리고 통신부(110)는 송신되는 신호를 변조하고, 수신되는 신호를 복조하는 모뎀(Modem)을 포함할 수 있다. 예컨대, 통신부(110)는 제어부(200)의 제어에 따라 인터넷 뉴스를 제공하는 서버(미도시)에 접속하여 뉴스를 다운로드 할 수 있다.The communication unit 110 is for communicating with other devices through a network. The communication unit 110 may include a radio frequency (RF) transmitter (Tx) for up-converting and amplifying the frequency of the transmitted signal, and an RF receiver (Rx) for amplifying the received signal with low-noise and down-converting the frequency. In addition, the communication unit 110 may include a modem for modulating the transmitted signal and demodulating the received signal. For example, the communication unit 110 may download news by accessing a server (not shown) that provides Internet news under the control of the control unit 200.

입력부(120)는 스팸탐지장치(100)를 제어하기 위한 사용자의 키 조작을 입력받고 입력 신호를 생성하여 제어부(200)에 전달한다. 입력부(120)는 스팸탐지장치(100)를 제어하기 위한 각 종 키들을 포함할 수 있다. 입력부(120)는 표시부(130)가 터치스크린으로 이루어진 경우, 각 종 키들의 기능이 표시부(130)에서 이루어질 수 있으며, 터치스크린만으로 모든 기능을 수행할 수 있는 경우, 입력부(120)는 생략될 수도 있다. The input unit 120 receives a key operation of a user for controlling the spam detection device 100, generates an input signal, and transmits the input signal to the control unit 200. The input unit 120 may include various types of keys for controlling the spam detection device 100. When the display unit 130 is made of a touch screen, the input unit 120 may perform functions of various types of keys on the display unit 130, and if all functions can be performed only with the touch screen, the input unit 120 may be omitted. It might be.

표시부(130)는 스팸탐지장치(100)의 메뉴, 입력된 데이터, 기능 설정 정보 및 기타 다양한 정보를 사용자에게 시각적으로 제공한다. 표시부(130)는 스팸탐지장치(100)의 부팅 화면, 대기 화면, 메뉴 화면, 등의 화면을 출력하는 기능을 수행한다. 이러한 표시부(130)는 액정표시장치(LCD, Liquid Crystal Display), 유기 발광 다이오드(OLED, Organic Light Emitting Diodes), 능동형 유기 발광 다이오드(AMOLED, Active Matrix Organic Light Emitting Diodes) 등으로 형성될 수 있다. 한편, 표시부(130)는 터치스크린으로 구현될 수 있다. 이러한 경우, 표시부(130)는 터치센서를 포함한다. 터치센서는 사용자의 터치 입력을 감지한다. 터치센서는 정전용량 방식(capacitive overlay), 압력식, 저항막 방식(resistive overlay), 적외선 감지 방식(infrared beam) 등의 터치 감지 센서로 구성되거나, 압력 감지 센서(pressure sensor)로 구성될 수도 있다. 상기 센서들 이외에도 물체의 접촉 또는 압력을 감지할 수 있는 모든 종류의 센서 기기가 본 발명의 터치센서로 이용될 수 있다. 터치센서는 사용자의 터치 입력을 감지하고, 감지 신호를 발생시켜 제어부(200)로 전송한다. 특히, 표시부(130)가 터치스크린으로 이루어진 경우, 입력부(120) 기능의 일부 또는 전부는 표시부(130)를 통해 이루어질 수 있다. The display unit 130 visually provides a menu of the spam detection device 100, input data, function setting information, and various other information to the user. The display 130 performs a function of outputting a screen of a boot screen, a standby screen, a menu screen, and the like of the spam detection device 100. The display unit 130 may be formed of a liquid crystal display (LCD), organic light emitting diodes (OLED), active matrix organic light emitting diodes (AMOLED), or the like. Meanwhile, the display unit 130 may be implemented as a touch screen. In this case, the display unit 130 includes a touch sensor. The touch sensor detects the user's touch input. The touch sensor may include a touch sensing sensor such as a capacitive overlay, a pressure type, a resistive overlay, an infrared beam, or a pressure sensor. . In addition to the sensors, all kinds of sensor devices capable of sensing contact or pressure of an object can be used as the touch sensor of the present invention. The touch sensor detects a user's touch input, generates a detection signal, and transmits it to the controller 200. In particular, when the display unit 130 is made of a touch screen, some or all of the functions of the input unit 120 may be performed through the display unit 130.

저장부(140)는 스팸탐지장치(100)의 동작에 필요한 프로그램 및 데이터를 저장한다. 특히, 저장부(140)는 동의어 및 반의어 사전, 복수의 인터넷 뉴스를 포함하는 뉴스 데이터베이스 등을 저장한다. 저장부(140)에 저장되는 각 종 데이터는 사용자의 조작에 따라, 삭제, 변경, 추가될 수 있다. The storage unit 140 stores programs and data necessary for the operation of the spam detection device 100. In particular, the storage unit 140 stores a synonym and antonym dictionary, a news database including a plurality of Internet news, and the like. Various types of data stored in the storage 140 may be deleted, changed, or added according to a user's manipulation.

제어부(200)는 스팸탐지장치(100)의 전반적인 동작 및 스팸탐지장치(100)의 내부 블록(110 내지 140)들 간 신호 흐름을 제어하고, 데이터를 처리하는 데이터 처리 기능을 수행할 수 있다. 이러한 제어부(200)는 중앙처리장치(CPU: Central Processing Unit), 그래픽처리장치(GPU: Graphic Processing Unit), 디지털신호처리기(DSP: Digital Signal Processor) 등이 될 수 있다. The control unit 200 may control the overall operation of the spam detection device 100 and the signal flow between the internal blocks 110 to 140 of the spam detection device 100, and may perform a data processing function for processing data. The control unit 200 may be a central processing unit (CPU), a graphic processing unit (GPU), a digital signal processor (DSP), or the like.

도 2를 참조하면, 제어부(200)는 카테고리분류부(210), 상관도산출부(220) 및 스팸뉴스판별부(230)를 포함한다. Referring to FIG. 2, the control unit 200 includes a category classification unit 210, a correlation calculation unit 220, and a spam news discrimination unit 230.

카테고리분류부(210)는 뉴스에 포함된 복수의 문단이 복수의 카테고리에 속할 확률을 도출하기 위한 것이다. 이러한 카테고리분류부(210)는 인공신경망으로 문단이 입력되면, 입력된 문단이 복수의 카테고리 각각에 해당할 확률을 수치로 출력하도록 학습된다. 예컨대, "코스피 지수가 하루 만에 또 다시 큰 폭으로 하락하며 2400선을 내줬다"와 같은 문단이 입력되면, 입력된 문단이 카테고리 {사회, 정치, 경제, 국제, 연예, 스포츠} 각각에 속할 확률 {0.42, 0.12, 0.97, 0.24, 0.01, 0.01}을 도출하도록 학습된다. The category classification unit 210 is for deriving a probability that a plurality of paragraphs included in the news belong to a plurality of categories. When the paragraph is input to the artificial neural network, the category classification unit 210 is learned to output the probability that the input paragraph corresponds to each of a plurality of categories as a numerical value. For example, if a paragraph such as "The KOSPI index has fallen again a day and has given 2400 lines" is entered, the probability that the entered paragraph belongs to each of the categories {social, political, economic, international, entertainment, sports} It is trained to derive {0.42, 0.12, 0.97, 0.24, 0.01, 0.01}.

상관도산출부(220)는 카테고리분류부(210)가 도출한 복수의 카테고리에 속할 확률을 기초로 상관관계 분석을 통해 각 문단과 다른 문단과의 상관도(correlation)를 산출하기 위한 것이다. The correlation calculation unit 220 is for calculating correlation between each paragraph and another paragraph through correlation analysis based on a probability of belonging to a plurality of categories derived by the category classification unit 210.

스팸뉴스판별부(230)는 복수의 문단 중 상관도의 수치가 가장 낮은 문장을 스팸의심문단으로 특정하고, 스팸의심문단의 상관도에 따라 뉴스가 진짜인지 혹은 가짜인지 여부를 판별한다. The spam news discrimination unit 230 identifies a sentence having the lowest correlation number among a plurality of paragraphs as a suspected spam questionnaire, and determines whether the news is genuine or fake according to the correlation of the suspected spam questionnaire.

이러한 카테고리분류부(210), 상관도산출부(220) 및 스팸뉴스판별부(230)를 포함하는 제어부(200)의 동작은 아래에서 더 상세하게 설명될 것이다. The operation of the control unit 200 including the category classification unit 210, the correlation calculation unit 220, and the spam news discrimination unit 230 will be described in more detail below.

그러면, 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 방법에 대해서 설명하기로 한다. 도 3은 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 방법을 설명하기 위한 흐름도이다. 도 4 내지 도 6은 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 방법을 설명하기 위한 도면이다. Then, a method for detecting spam news according to an embodiment of the present invention will be described. 3 is a flowchart illustrating a method for detecting spam news according to an embodiment of the present invention. 4 to 6 are diagrams for explaining a method for detecting spam news according to an embodiment of the present invention.

이와 같은, 스팸 뉴스를 탐지하는 프로세스 이전에, 카테고리분류부(210)는 딥러닝 기법을 통해, 문단이 입력되면, 입력된 문단이 복수의 카테고리 각각에 해당할 확률을 수치로 출력하도록 학습된 상태라고 가정한다. 예컨대, 카테고리분류부(210)는 "코스피 지수가 하루 만에 또 다시 큰 폭으로 하락하며 2400선을 내줬다"와 같은 문단이 입력되면, 입력된 문단이 카테고리 {사회, 정치, 경제, 국제, 연예, 스포츠} 각각에 속할 확률 {0.42, 0.12, 0.97, 0.24, 0.01, 0.01}을 도출하도록 학습된다. Prior to the process of detecting the spam news, the category classification unit 210 is trained to output, as a numerical value, a probability that the input paragraph corresponds to each of a plurality of categories when a paragraph is input through a deep learning technique. Is assumed. For example, when the category classification unit 210 inputs a paragraph such as "The KOSPI index fell again and again in a day and gave 2400 lines," the input paragraph is a category {social, political, economic, international, entertainment , Sports} It is learned to derive the probability {0.42, 0.12, 0.97, 0.24, 0.01, 0.01} to belong to each.

도 3을 참조하면, 스팸탐지장치(100)의 제어부(200)는 S110 단계에서 통신부(110)를 통해 뉴스 기사를 제공하는 웹 페이지를 운영하는 웹 서버에 접속하여 해당 웹 페이지에 포함된 뉴스를 다운로드할 수 있다. Referring to FIG. 3, the control unit 200 of the spam detection apparatus 100 accesses a web server operating a web page providing a news article through the communication unit 110 in step S110 to access news contained in the web page. You can download it.

제어부(200)의 카테고리분류부(210)는 S120 단계에서 뉴스에 포함된 복수의 문단이 복수의 카테고리에 속할 확률을 수치로 도출한다. 예컨대, 도 4에 도시된 바와 같이, 뉴스에 4개의 문단(41)이 존재한다고 가정하면, 카테고리분류부(210)는 다음의 표 1과 같이, 4개의 문단이 예컨대, 사회, 정치, 경제, 국제, 연예, 스포츠와 같은 복수의 카테고리에 속할 확률(42)을 도출한다. The category classification unit 210 of the control unit 200 derives a probability that a plurality of paragraphs included in the news belong to the plurality of categories in step S120. For example, as illustrated in FIG. 4, assuming that four paragraphs 41 exist in the news, the category classification unit 210 may include four paragraphs, for example, social, political, and economic, as shown in Table 1 below. The probability 42 is derived from multiple categories such as international, entertainment, and sports.

[표 1][Table 1]

다음으로, 제어부(200)의 상관도산출부(220)는 S130 단계에서 복수의 문단이 복수의 카테고리에 속할 확률을 기초로 상관관계 분석을 통해 각 문단과 다른 문단과의 상관도(correlation)를 산출한다. 이때, 상관도산출부(220)는 각 문단의 카테고리 별 확률에 대한 상기 각 문단 이외의 다른 문단들의 카테고리 별 확률의 평균의 상관도를 산출한다. Next, the correlation calculating unit 220 of the control unit 200 determines correlation between each paragraph and other paragraphs through correlation analysis based on the probability that a plurality of paragraphs belong to a plurality of categories in step S130. Calculate. At this time, the correlation calculating unit 220 calculates a correlation of an average of the probability of each category of the other paragraphs other than the paragraph with respect to the probability of each category.

예컨대, 상관도산출부(220)는 표 1의 확률값을 기초로 상관관계 분석을 통해 문단 1과 다른 문단, 즉, 문단 2, 3, 4와의 상관도를 산출할 수 있다. 구체적으로, 상관도산출부(220)는 문단 2, 3, 4의 카테고리 별 확률의 평균을 구한다. 예컨대, 문단 2, 3, 4의 카테고리 별 확률의 평균은 {0.30, 0.41, 0.04, 0.65, 0.32, 0.01}과 같다. 그리고 상관도산출부(220)는 문단 1의 카테고리 별 확률 {0.56, 0.64, 0.21, 0.98, 0.02, 0.01}과 문단 2, 3, 4의 카테고리 별 확률의 평균 {0.30, 0.41, 0.04, 0.65, 0.32, 0.01}의 상관도(correlation)를 산출한다. For example, the correlation calculator 220 may calculate a correlation between paragraph 1 and another paragraph, that is, paragraphs 2, 3, and 4 through correlation analysis based on the probability values in Table 1. Specifically, the correlation diagram calculating unit 220 obtains the average of the probability for each category of paragraphs 2, 3, and 4. For example, the average of the probability for each category of paragraphs 2, 3, and 4 is {0.30, 0.41, 0.04, 0.65, 0.32, 0.01}. And the correlation calculator 220 averages the probability of each category in paragraph 1 {0.56, 0.64, 0.21, 0.98, 0.02, 0.01} and the probability of each category in paragraphs 2, 3, and 4, {0.30, 0.41, 0.04, 0.65, The correlation of 0.32, 0.01} is calculated.

이에 따라, 상관도산출부(220)는 다음의 표 2와 같이 상관도(43)를 산출할 수 있다. Accordingly, the correlation diagram calculating unit 220 may calculate the correlation diagram 43 as shown in Table 2 below.

[표 2][Table 2]

스팸뉴스판별부(230)는 S140 단계에서 복수의 문단 중 상관도(42)의 수치가 가장 낮은 문장을 스팸의심문단으로 특정한다. 예컨대, 표 2에 따르면, 스팸뉴스판별부(230)는 문단 3이 상관도가 가장 낮기 때문에 문단 3을 스팸의심문단으로 특정한다. In step S140, the spam news discrimination unit 230 identifies the sentence having the lowest correlation number 42 among the plurality of paragraphs as a suspected spam questionnaire. For example, according to Table 2, the spam news discrimination unit 230 specifies paragraph 3 as a suspected spam paragraph because paragraph 3 has the lowest correlation.

이어서, 스팸뉴스판별부(230)는 S150 단계에서 스팸의심문단의 상관도에 따라 뉴스가 진짜인지 혹은 가짜인지 여부를 판별한다. Subsequently, in step S150, the spam news discrimination unit 230 determines whether the news is real or fake according to the correlation degree of the spam interrogation panel.

일 실시예에 따르면, 도 5 및 도 6에 도시된 바와 같이, 스팸뉴스판별부(230)는 기 저장된 가짜 뉴스에 대한 상관도 분포(A1) 및 진짜 뉴스에 대한 상관도 분포(B1)를 기초로 가짜 뉴스일 확률을 도출하는 가짜뉴스 확률모델(A2)과, 진짜 뉴스일 확률을 도출하는 진짜뉴스 확률모델(B2)을 도출한다. 그리고 스팸뉴스판별부(230)는 스팸의심문단의 상관도를 가짜뉴스 확률모델(A2) 및 진짜뉴스 확률모델(B2)에 대입하여 뉴스가 진짜 뉴스일 확률과 뉴스가 가짜 뉴스일 확률을 산출한다. 이어서, 스팸뉴스판별부(230)는 진짜 뉴스일 확률과 가짜 뉴스일 확률 중 큰 확률값을 값는 경우에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별한다. According to an embodiment, as shown in FIGS. 5 and 6, the spam news discrimination unit 230 is based on a correlation distribution (A1) for pre-stored fake news and a correlation distribution (B1) for real news. We derive a fake news probability model (A2) that derives the probability of fake news, and a real news probability model (B2) that derives the probability of true news. And the spam news discrimination unit 230 substitutes the correlations of the suspected spam suspects into the fake news probability model (A2) and the real news probability model (B2) to calculate the probability that the news is real news and the probability that the news is fake news. . Subsequently, the spam news discrimination unit 230 determines whether the news is real or fake according to a case where a value of a greater probability value between a probability of being genuine news and a probability of fake news is valued.

다른 실시예에 따르면, 스팸뉴스판별부(230)는 스팸의심문단의 상관도가 소정 수치 미만이면, 상기 뉴스를 가짜로 판별한다. According to another embodiment, the spam news discrimination unit 230 determines that the news is fake if the correlation of the suspected spam interrogation group is less than a predetermined value.

예컨대, 뉴스가 문단 1 내지 4를 포함하며, 상관도산출부(220)는 다음의 표 3과 같이, 문단 2, 3, 4와의 상관도를 산출하였다고 가정한다. 또한, 스팸의심문단을 판단하기 위한 기준인 상관도의 소정 수치가 "-0.17"이라고 가정한다. 예컨대, 도 6에서 보듯이, 가짜뉴스 확률모델(A2) 및 진짜뉴스 확률모델(B2)을 기반으로 스팸의심문단을 판단하기 위한 기준값으로서 "-0.17"을 설정할 수 있다.For example, it is assumed that the news includes paragraphs 1 to 4, and the correlation calculator 220 calculates correlations with paragraphs 2, 3, and 4, as shown in Table 3 below. In addition, it is assumed that a predetermined number of correlations, which is a criterion for judging spam suspects, is "-0.17". For example, as shown in FIG. 6, "-0.17" may be set as a reference value for determining the suspected group of spam based on the fake news probability model A2 and the real news probability model B2.

[표 3][Table 3]

이러한 경우, 스팸뉴스판별부(230)는 문단 3의 상관도가 "-0.17" 보다 작은 "-0.2"이기 때문에 이러한 문단 3이 포함된 뉴스를 가짜 뉴스로 판별한다. In this case, the spam news discrimination unit 230 determines that the news containing the paragraph 3 is fake news because the correlation of the paragraph 3 is “-0.2” smaller than “-0.17”.

한편, 앞서 설명된 본 발명의 실시예에 따른 방법은 다양한 컴퓨터수단을 통하여 판독 가능한 프로그램 형태로 구현되어 컴퓨터로 판독 가능한 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터구조 등을 단독으로 또는 조합하여 포함할 수 있다. 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광 기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 와이어뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 와이어를 포함할 수 있다. 이러한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다. On the other hand, the method according to the embodiment of the present invention described above is implemented in a program readable form through various computer means may be recorded on a computer-readable recording medium. Here, the recording medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded on the recording medium may be specially designed and configured for the present invention or may be known and usable by those skilled in computer software. For example, the recording medium includes magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic-optical media such as floptical disks ( magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions may include high-level language wires that can be executed by a computer using an interpreter, as well as machine language wires such as those produced by a compiler. Such a hardware device can be configured to operate as one or more software modules to perform the operation of the present invention, and vice versa.

이상 본 발명을 몇 가지 바람직한 실시예를 사용하여 설명하였으나, 이들 실시예는 예시적인 것이며 한정적인 것이 아니다. 이와 같이, 본 발명이 속하는 기술분야에서 통상의 지식을 지닌 자라면 본 발명의 사상과 첨부된 특허청구범위에 제시된 권리범위에서 벗어나지 않으면서 균등론에 따라 다양한 변화와 수정을 가할 수 있음을 이해할 것이다. Although the present invention has been described using some preferred embodiments, these embodiments are illustrative and not limiting. As described above, those skilled in the art to which the present invention pertains will understand that various changes and modifications can be made according to the theory of equality without departing from the spirit of the present invention and the scope of the rights set forth in the appended claims.

100: 스팸탐지장치
110: 통신부
120: 입력부
130: 표시부
140: 저장부
200: 제어부
210: 카테고리분류부
220: 상관도산출부
230: 스팸뉴스판별부 100: Spam detection device
110: communication department
120: input unit
130: display unit
140: storage
200: control unit
210: Category classification
220: Correlation calculation unit
230: Spam News Division

Claims

In the device for detecting spam news,
A category classification unit deriving a probability that a plurality of paragraphs included in the news belong to a plurality of categories;
A correlation calculator calculating a correlation between each paragraph and another paragraph through correlation analysis based on a probability of belonging to the plurality of categories; And
Includes a spam news discrimination unit that identifies the sentence having the lowest correlation value among the plurality of paragraphs as a suspected spam questionnaire and determines whether the news is real or fake according to the correlation degree of the suspected spam questionnaire. Characterized by
Device for detecting spam news.

According to claim 1,
The spam news discrimination department
Based on the distribution of correlation between fake news and real news, we derive a fake news probability model that derives the probability of fake news, and a real news probability model that derives the probability of true news,
The probability of the news is real news and the probability that the news is fake news are calculated by substituting the correlations of the spam interrogation groups into the fake news probability model and the real news probability model,
And determining whether the news is real or fake based on the probability of the real news and the probability of the fake news.
Device for detecting spam news.

According to claim 1,
The spam news discrimination department
Based on the distribution of correlation between fake news and real news, we derive a fake news probability model that derives the probability of fake news, and a real news probability model that derives the probability of true news,
If the correlation of the suspected spam questionnaire is less than a reference value set based on the fake news probability model and the real news probability model, the news is judged as fake.
Device for detecting spam news.

According to claim 1,
The correlation calculation unit
And calculating a correlation of an average probability of each category of the paragraphs other than the above paragraphs with respect to the probability of each paragraph by category.
Device for detecting spam news.

According to claim 1,
The category sorter
When the paragraph is input, characterized in that learning to output the probability that the input paragraph corresponds to each of a plurality of categories as a number
Device for detecting spam news.

In the method for detecting spam news,
Deriving a probability that the plurality of paragraphs included in the category classification unit belong to the plurality of categories;
Calculating a correlation between each paragraph and another paragraph through correlation analysis based on a probability that the correlation diagram calculation unit belongs to the plurality of categories; And
The spam news discrimination unit identifies a sentence having the lowest correlation number among the plurality of paragraphs as a suspected spam questionnaire, and determines whether the news is real or fake according to the correlational degree of the suspected spam questionnaire; Characterized by including
Method for detecting spam news.

The method of claim 6,
The step of determining whether the news is real or fake is
The spam news discrimination unit deriving a fake news probability model for deriving a probability of a fake news based on a distribution of correlation between fake news and a real news, and a real news probability model for deriving a probability of real news;
Calculating a probability that the news is real news and a probability that the news is fake news by substituting the correlation of the spam suspects into the fake news probability model and the real news probability model; And
And determining whether the news is real or fake according to the probability that the spam news discrimination unit is the real news and the probability that it is the fake news.
Method for detecting spam news.

The method of claim 6,
The step of determining whether the news is real or fake,
The spam news discrimination unit derives a fake news probability model that derives a probability of being fake news based on a distribution of a correlation between fake news and real news, and a probability model of a real news that derives probability of being true news, and interrogates the spam suspect If the correlation of the less than the reference value set based on the fake news probability model and the real news probability model, characterized in that to determine the news as a fake
Method for detecting spam news.

The method of claim 6,
The step of calculating the correlation (correlation)
And calculating a correlation of an average probability of each category of the paragraphs other than the above paragraphs with respect to the probability of each paragraph by category.
Device for detecting spam news.

The method of claim 6,
Before the step of deriving the probability,
And when the paragraph is input, training the category classifier to output the probability that the input paragraph corresponds to each of a plurality of categories as a numerical value.

A computer-readable recording medium in which a program for performing a method for detecting spam news according to any one of claims 6 to 10 is recorded.