KR100857124B1

KR100857124B1 - Filtering System for Harmful Message and Method Thereof and Recording Medium Thereof

Info

Publication number: KR100857124B1
Application number: KR1020060058218A
Authority: KR
Inventors: 최형기; 김범배
Original assignee: 성균관대학교산학협력단
Priority date: 2006-06-27
Filing date: 2006-06-27
Publication date: 2008-09-05
Also published as: KR20080000416A

Abstract

게시판과 같은 인터넷 커뮤니티 상의 유해 메시지를 효과적으로 여과하고, 정상 메시지와 유해 메시지로의 단순 분류가 아닌 다양한 소분류로 세분화하는 유해 메시지 여과 시스템과 그 여과 방법 및 이를 기록한 기록매체에 관한 것으로, 네트워크를 통하여 연결된 클라이언트로부터 수신되는 메시지를 저장하는 데이터베이스를 구비하고 유해 메시지를 여과하는 시스템에 있어서, 상기 메시지를 수신하는 메시지 수신수단, 수신된 상기 메시지에서 다수의 단어를 추출하는 단어 추출수단, 추출된 상기 다수의 단어를 이용하여 상기 메시지의 유해 메시지 여부를 판단하고 상기 데이터베이스에 저장하는 평가수단을 포함하는 구성을 마련한다.The present invention relates to a harmful message filtering system that effectively filters harmful messages on the Internet community such as bulletin boards and to subdividing them into various subclasses instead of simply classifying them into normal and harmful messages. A system for filtering harmful messages, comprising: a database for storing messages received from a client, said system comprising: message receiving means for receiving said message, word extracting means for extracting a plurality of words from said received message, said plurality of extracted Using a word to determine whether the message is a harmful message and provides a configuration including an evaluation means for storing in the database.

상기와 같은 유해 메시지 여과 시스템과 그 여과 방법 및 이를 기록한 기록매체를 이용하는 것에 의해, 여과된 유해 메시지를 세분화된 소분류로 다시 분류하여 인터넷 커뮤니티 상에서도 유해 메시지의 피해 없이 원활한 의사소통을 가능하게 할 수 있다. By using the above-mentioned harmful message filtration system, the filtering method thereof, and the recording medium recording the same, the filtered harmful messages can be reclassified into subdivided subclasses to enable smooth communication without damaging harmful messages even on the Internet community. .

유해 메시지, 스팸, 여과, 필터링, 게시판, 커뮤니티 Harmful messages, spam, filtration, filtering, bulletin boards, communities

Description

Filtering System for Harmful Message and Method Thereof and Recording Medium Thereof}

도 1은 본 발명의 일실시예에 따른 유해 메시지 여과 과정을 개략적으로 도시한 도면, 1 is a view schematically showing a harmful message filtration process according to an embodiment of the present invention,

도 2는 본 발명의 일실시예에 따른 유해 메시지 여과 시스템을 도시한 블록도, 2 is a block diagram showing a harmful message filtering system according to an embodiment of the present invention;

도 3은 본 발명의 일실시예에 따른 유해 메시지 여과 방법을 설명하는 흐름도,3 is a flowchart illustrating a harmful message filtering method according to an embodiment of the present invention;

도 4는 본 발명의 일실시예에 따른 유해 메시지 여과 방법의 학습 과정과 평가 과정을 세분화하여 도시한 도면,4 is a view showing the subdivided learning process and evaluation process of the harmful message filtering method according to an embodiment of the present invention,

도 5는 본 발명의 일실시예에 따른 커뮤니티에 게시되는 메시지를 도시한 도면.5 illustrates a message posted to a community according to an embodiment of the present invention.

* 도면의 주요 부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

10, 11: 클라이언트 20: 메시지 여과 시스템10, 11: Client 20: Message Filtration System

30: 데이터베이스 201: 메시지 수신수단30: database 201: message receiving means

202: 단어 추출수단 203: 제어수단202: word extraction means 203: control means

204: 학습수단 205: 평가수단204: learning means 205: evaluation means

본 발명은 유해 메시지를 처리하는 기술에 관한 것으로, 특히 게시판과 같은 인터넷 커뮤니티 상의 유해 메시지를 효과적으로 여과하고, 정상 메시지와 유해 메시지로의 단순 분류가 아닌 다양한 소분류로 세분화하는 유해 메시지 여과 시스템과 그 여과 방법 및 이를 기록한 기록매체에 관한 것이다.TECHNICAL FIELD The present invention relates to a technique for processing a harmful message, and in particular, a harmful message filtering system and filtering thereof, which effectively filter harmful messages on the Internet community such as bulletin boards, and subdivide them into various sub-categories rather than simple classification into normal and harmful messages. It relates to a method and a recording medium recording the same.

인터넷 사용의 급격한 증가와 함께 수신자의 의사와 상관없이 부당한 이익을 취하는 상업적 광고, 청소년 유해물, 비방과 욕설 등의 메시지들이 오늘날 급증하고 있다. 이러한 유해한 메시지들은 '스팸(Spam)'이나 '정크 메시지(Junk Message)'라고도 하며 이하 '유해 메시지'라 한다. 이들은 의사소통의 저해는 물론 자원과 인력의 낭비 및 인터넷 윤리 논란 문제까지도 야기하고 있는 대표적 인터넷 이슈이다. 특히 인터넷 초기부터 의사소통 매체로 활발히 사용되어온 된 이메일 서비스(E-mail Service)는 유해 메시지의 대표적 피해 사례로서 오늘날 수신되는 이메일의 40% 이상이 바로 이러한 유해 메시지로 이루어져 있다고 할 만큼 그 피해가 심각하다. 또한, 오늘날 급격히 대두되고 있는 이동통신의 문자 서비스(Short Message Service, SMS) 역시 유해 메시지로 인하여 큰 피해를 입고 있다. 이에 따라 이메일 서비스와 이동통신사의 문자 서비스에 관한 다양한 유해 메시지 차단 기법이 다양하게 제시되고 있으며, 유해 메시지 송신자의 송신 기법이 점점 지능화됨 에 따라 차단 기법 역시 이에 대해 적절히 대처하도록 진보하고 있다. With the rapid increase in the use of the Internet, messages such as commercial advertising, youth pests, slurs and abusive messages that are unfairly profitable regardless of the recipient's intention are proliferating today. These harmful messages are also called 'Spam' or 'Junk Message' and hereinafter referred to as 'harmful messages'. These are representative Internet issues that not only impede communication but also cause waste of resources and manpower and controversy over Internet ethics. In particular, the E-mail Service, which has been actively used as a communication medium since the early days of the Internet, is a representative damage case of harmful messages, and more than 40% of the received e-mails are composed of such harmful messages. Do. In addition, the short message service (SMS) of mobile communication, which is rapidly emerging today, is also suffering from harmful messages. Accordingly, various harmful message blocking techniques for email service and text message service of mobile carrier have been proposed in various ways, and as the transmission technique of harmful message sender becomes more intelligent, the blocking technique is also progressing to appropriately cope with this.

이와 같이 스팸메일 처리 기술의 일례가 대한민국 특허 등록공보 제0460322호(2004.11.26 등록, 스팸메일 방지 시스템 및 방법)에 개시되어 있다.As such, an example of a spam mail processing technique is disclosed in Korean Patent Registration Publication No. 0460322 (registered on November 26, 2004, a spam mail prevention system and method).

상기 대한민국 특허 등록공보 제0460322호에 개시된 기술은 허위 메일주소로 전송된 메일과 실제로 사용되는 메일주소로 전송되는 신규 메일을 수신하는 메일수신부, 상기 메일수신부를 통해 수신한 상기 각 허위 메일주소로 전송된 메일들로부터 스팸메일 판단기초정보를 추출하는 정보추출부, 상기 정보추출부에 의해 추출된 스팸메일 판단기초정보를 저장하는 데이터베이스부, 상기 데이터베이스부에 저장된 스팸메일 판단기초정보를 네트워크상으로 전파하는 스팸메일 정보송신부를 포함하는 스팸메일 정보수집 서버, 상기 스팸메일 정보수집 서버로부터 소정 주기별로 스팸메일 판단기초정보를 수신하여 저장하는 갱신부, 상기 갱신부에 의해 갱신되는 스팸메일 판단기초정보를 저장하는 데이터베이스부, 상기 메일수신부에 의해 수신되는 신규 메일의 헤더(Header)정보를 분석하고 이 헤더 분석정보를 상기 데이터베이스부에 저장하는 헤더정보 분석부, 상기 분석된 헤더정보로부터 상기 데이터베이스부를 검색하여 수신된 신규 메일이 스팸메일인지를 판단하여 스팸메일이라 판단될 경우 해당 신규 메일의 수신을 차단하고 저장된 기수신된 메일의 헤더정보로부터 스팸메일 판단기초정보가 갱신된 데이터베이스부를 검색하여 기수신된 메일이 스팸메일인지를 판단하여 스팸메일이라 판단될 경우 해당 메일을 메일서버로부터 삭제하는 스팸메일 필터링부로 이루어지고, 상기 스팸메일 정보수집 서버를 통해 허위 메일주소로 수신되는 스팸메일의 헤더정보로부터 스팸메일 판단기초정보를 추출하여 이를 데이터베이스화하고, 이 데이터베이스화된 스팸메일 판단기초정보를 상기 메일서버가 수신하여 상기 데이터베이스부에 저장하고, 상기 메일서버의 신규 메일 수신시 분석된 헤더정보로부터 스팸메일 판단기초정보를 저장한 상기 데이터베이스부를 검색하여 스팸메일이라 판단될 경우 해당 신규 메일의 수신을 차단하고, 스팸메일 판단기초정보를 저장한 상기 데이터베이스부의 갱신시 기수신된 메일의 헤더정보로부터 스팸메일 판단기초정보가 갱신된 상기 데이터베이스부를 검색하여 스팸메일이라 판단될 경우 해당 메일을 상기 메일서버로부터 삭제함으로써 상기 메일서버에 부담을 주지않고 효율적인 스팸메일 방지를 가능하도록 한 스팸메일 방지 시스템 및 방법에 대해 기재되어 있다.The technology disclosed in the Republic of Korea Patent Registration No. 0460322 is a mail receiver for receiving mail sent to a fake mail address and a new mail sent to a mail address actually used, and sent to each of the fake mail addresses received through the mail receiver. An information extracting unit for extracting the spam mail determination basic information from the extracted mails, a database unit for storing the spam mail determination basic information extracted by the information extraction unit, and propagating the spam mail determination basic information stored in the database unit on the network A spam mail information collecting server including a spam mail information sending unit, an update unit for receiving and storing spam mail judgment basic information at predetermined intervals from the spam mail information collecting server, and spam mail judgment basic information updated by the update unit; Database unit for storing, new received by the mail receiving unit A header information analyzer for analyzing header information of the mail and storing the header analysis information in the database, and searching the database from the analyzed header information to determine whether the received new mail is spam mail If it is determined that this is a spam mail by blocking the reception of the new mail and searching the database part where the spam judgment basic information is updated from the header information of the stored received mail. The spam mail filtering unit is configured to delete the corresponding mail from the mail server, and extracts the spam judgment basic information from the header information of the spam mail received through the spam mail information collection server to the false email address and database the database. The spam server judges basic information And store it in the database unit, search for the database unit storing the spam mail determination basic information from the header information analyzed when the new mail is received by the mail server, and block reception of the new mail when it is determined to be spam mail. Searching the database unit for which the spam mail determination basic information is updated from the header information of the received mail when updating the database unit storing the spam mail determination basic information and deleting the corresponding mail from the mail server when it is determined to be spam mail. It describes a spam mail prevention system and method that enables efficient spam mail without burdening the mail server.

또, 이동통신 단말기의 스팸 메시지 처리 기술의 일례가 대한민국 특허 등록공보 제0576037호(2006.04.25 등록, 이동통신 단말기의 스팸 메시지 차단 방법)에 개시되어 있다.In addition, an example of a spam message processing technology of a mobile communication terminal is disclosed in Korean Patent Registration Publication No. 0576037 (registered on April 25, 2006, a method for blocking spam messages of a mobile communication terminal).

상기 대한민국 특허 등록공보 제0576037호에 개시된 기술은 이동통신 단말기에 스팸 메시지 차단 처리를 수행하기 위한 스팸성 단어 및 스팸성 전화번호를 포함하는 스팸 차단 정보를 등록하는 과정, 상기 이동통신 단말기에 문자 메시지가 수신되는 경우 수신된 문자 메시지의 발신자 정보로부터 문자 메시지 발신번호를 추출하는 과정, 상기 스팸 차단 정보를 검색하여 상기 수신된 문자 메시지 발신번호와 동일한 스팸성 전화번호가 존재하는 경우 상기 수신된 문자 메시지를 스팸 메 시지로 인식하여 폐기 처리하는 과정, 상기 수신된 문자 메시지 발신번호와 동일한 스팸성 전화번호가 존재하지 않는 경우 상기 수신된 문자 메시지를 분석하여 상기 스팸 차단 정보로 등록된 스팸성 단어와 일치하는 단어가 존재하는지 확인하는 과정, 상기 수신된 문자 메시지에 스팸성 단어와 일치하는 단어가 존재하는 경우 상기 수신된 문자 메시지 발신번호와 동일한 전화번호가 이동통신 단말기에 저장되어 있는 전화번호 목록에 존재하는지 확인하는 과정, 상기 수신된 문자 메시지 발신번호와 동일한 전화번호가 존재하는 경우 상기 수신된 문자 메시지에 대해 수신 응답할 수 있도록 문자 메시지 수신 처리를 수행하고 상기 수신된 문자 메시지 발신번호와 동일한 전화번호가 존재하지 않는 경우 상기 수신된 문자 메시지를 스팸 저장부로 이동시키는 과정으로 이루어지며, 상기 이동통신 단말기 사용자에 의해 등록된 스팸성 단어나 전화번호를 이용하여 단말기에 수신된 문자 메시지에 대하여 한 번의 스팸 메시지 차단 처리를 수행하고, 스팸 처리된 메시지에 대하여 그 메시지 발신번호를 이용하여 단말기에 저장되어 있는 전화번호 목록을 자동 검색하여 다시 한번 스팸 메시지 차단 처리를 수행함으로써, 비스팸 메시지, 즉 SMS 센터에 등록된 스팸성 발신번호나 특정 단어만을 기준으로 스팸 메시지 차단을 수행함에 따라 발생할 수 있는 비스팸 메시지의 필터링 및 폐기 처리를 방지하고, 이로써 개별 이동통신 단말기 사용자들에 대하여 보다 안정적이고 효율적인 스팸 메시지 차단 서비스를 제공할 수 있는 이동통신 단말기의 스팸 메시지 차단 방법에 대해 기재되어 있다.The technology disclosed in the Korean Patent Registration Publication No. 0576037 is a process of registering spam blocking information including spam word and spam phone number for performing spam message blocking processing on a mobile communication terminal, and receiving a text message on the mobile communication terminal. Extracting the text message caller number from the sender information of the received text message; searching for the spam blocking information, if the same spammy phone number as the received text message caller number exists; Recognizing the message and discarding it; if there is no spam phone number identical to the received text message originating number, analyzing the received text message to determine whether a word matching the spam word registered as the spam blocking information exists. Confirming process, the received statement If a word matching the spam word exists in the message, checking whether the same phone number as the received text message calling number exists in the list of phone numbers stored in the mobile communication terminal, and the same as the received text message calling number. If there is a telephone number, a text message receiving process is performed to respond to the received text message. If the same telephone number as the received text message originating number does not exist, the received text message is sent to the spam storage unit. It is made of a process of moving, using the spam word or the phone number registered by the user of the mobile communication terminal performs one spam message blocking processing for the text message received at the terminal, the message for the spam-processed message Call the terminal using the calling number By automatically retrieving the list of stored phone numbers and processing spam messages once again, non-spam messages, that is, non-spam messages that can be generated by blocking spam messages based only on specific spam word or specific words registered in the SMS Center A method for blocking spam messages of a mobile communication terminal, which can prevent filtering and discarding of spam messages and thereby provide a more stable and efficient spam message blocking service for individual mobile terminal users.

그러나 상기 공보들에 개시된 기술을 비롯하여 종래의 다양한 기법들은 이메일 서비스와 이동통신 문자 서비스에 국한되는 문제가 있었다. 즉, 인터넷은 이메일이나 이동통신과 같은 매체 이외에도 대형 포털 사이트에서 제공하는 인터넷 커뮤니티(게시판), 뉴스 그룹 등의 다양한 매체들에 의해 이용자 간의 의사소통 구간으로 이용되고 있다. 특히, 인터넷 커뮤니티는 최근 들어 이용자의 사회참여와 활동이 활발해져 그 이용이 급증하는 서비스이다. 인터넷 커뮤니티의 예로는 대형 포털 서비스에서 제공하는 그룹 커뮤니티, 뉴스나 게시글의 덧글 서비스 등이 있다. 이에 따라 이메일 서비스나 이동통신 서비스와 유사하게 인터넷 커뮤니티 상에서도 유해 메시지의 양이 급증하고 있으며, 그 피해 역시 광범위하게 나타나고 있다. 그러나 이러한 인터넷 커뮤니티 상의 유해 메시지에 대한 차단 기법은 그 발명이 전무한 상황이다. However, various conventional techniques, including the technique disclosed in the above publications, have been limited to an email service and a mobile communication text service. That is, the Internet is used as a communication section between users by various media such as an internet community (bulletin board) and news groups provided by large portal sites in addition to media such as e-mail and mobile communication. In particular, the Internet community is a service in which the user's social participation and activity has become active in recent years and its use has increased rapidly. Examples of Internet communities include group communities provided by large portal services, and comment services for news and posts. As a result, the amount of harmful messages is rapidly increasing in the Internet community similarly to e-mail service and mobile communication service, and the damage is also widespread. However, there is no invention of blocking techniques for harmful messages on the Internet community.

또, 이메일 서비스나 이동통신 문자 서비스 상에서 이용되고 있는 다양한 유해 메시지 여과 기법을 인터넷 커뮤니티에 도입하더라도 인터넷 커뮤니티의 특징으로 인해 여과 정확도가 떨어지는 문제가 있다. 인터넷 커뮤니티는 이메일 서비스, 이동통신 문자 서비스와는 달리 하나의 메시지가 다수의 이용자에게 동시에 제공된다. 예를 들어, 이메일 서비스나 이동통신 문자 서비스는 송신된 하나의 메시지가 해당 서비스의 계정 또는 단말기를 지닌 한 명의 이용자에게만 수신되어 그 영향을 끼치는 반면, 인터넷 커뮤니티 상에 게시된 하나의 메시지는 인터넷 커뮤니티를 이용하는 불특정 다수의 인터넷 이용자들이 게시된 메시지를 읽음으로써 그 메시지의 영향을 받게 된다. 이로 인하여 기존의 유해 메시지 여과 기법과 같이 개인의 관점 에서 유해 메시지를 정의하고 여과하는 기법은 인터넷 커뮤니티 상의 다양한 이용자들의 모든 요구를 충족시키지 못하고 낮은 정확도를 지니게 되는 문제가 발생한다. In addition, even if a variety of harmful message filtering techniques used on the e-mail service or a mobile communication text service to the Internet community, there is a problem that the filtering accuracy is lowered due to the characteristics of the Internet community. In the Internet community, unlike an email service and a mobile communication text service, a single message is simultaneously provided to multiple users. For example, an email service or a mobile text service may receive and affect a single message sent to only one user with an account or terminal for that service, while a message posted on an Internet community may be affected by an Internet community. An unspecified number of Internet users who are using B are affected by reading the posted message. As a result, a technique of defining and filtering harmful messages from an individual's point of view like the existing harmful message filtering technique does not meet all the needs of various users on the Internet community and has a low accuracy.

본 발명의 목적은 상술한 바와 같은 문제점을 해결하기 위한 것으로서, 종래의 다양한 유해 메시지 여과 기법을 개선하여 인터넷 커뮤니티 상의 유해 메시지를 여과할 수 있는 유해 메시지 여과 시스템과 그 여과 방법 및 이를 기록한 기록매체를 제공하는 것이다.Disclosure of the Invention An object of the present invention is to solve the problems as described above, and to improve the conventional various harmful message filtration techniques to filter harmful messages on the Internet community, and a filtering method and a recording medium recording the same To provide.

본 발명의 다른 목적은 인터넷 커뮤니티 상의 유해 메시지 여과에서만 발생하는 문제인 이용자간의 유해 메시지에 관한 견해차에서 발생하는 정확도 하락 문제를 해결하기 위해 여과된 유해 메시지를 세분화된 소분류로 다시 분류하여 인터넷 커뮤니티 상에서도 유해 메시지의 피해 없이 원활한 의사소통을 가능하게 하는 유해 메시지 여과 시스템과 그 여과 방법 및 이를 기록한 기록매체를 제공하는 것이다.Another object of the present invention is to classify the filtered harmful messages into subdivided subclassifications in order to solve the problem of deterioration of accuracy caused by differences in opinions about harmful messages between users, which is a problem that only occurs in filtering harmful messages on the Internet community. To provide a harmful message filtering system, a method for filtering the same, and a recording medium recording the same, which enables a smooth communication without damage.

본 발명의 또다른 목적은 인터넷 커뮤니티 상에 나타나는 유해 메시지의 효과적인 차단과 분류를 제공하고 커뮤니티 관리자와 이용자 간의 유해 메시지 정의 문제 때문에 나타나는 분류 오판단율을 최소화하는 유해 메시지 여과 시스템과 그 여과 방법 및 이를 기록한 기록매체를 제공하는 것이다.It is still another object of the present invention to provide an effective blocking and classification of harmful messages appearing on the Internet community, and to minimize harmful classification rates due to problem definition of harmful messages between community managers and users. It is to provide a recording medium.

상기 목적을 달성하기 위해 본 발명에 따른 유해 메시지 여과 시스템은 네트워크를 통하여 연결된 클라이언트로부터 수신되는 메시지를 저장하는 데이터베이스를 구비하고 유해 메시지를 여과하는 시스템에 있어서, 상기 메시지를 수신하는 메시지 수신수단, 수신된 상기 메시지에서 다수의 단어를 추출하는 단어 추출수단, 추출된 상기 다수의 단어를 이용하여 상기 메시지의 유해 메시지 여부를 판단하고 상기 데이터베이스에 저장하는 평가수단을 포함하고, 상기 평가수단은 추출된 다수의 단어에 상기 데이터베이스에 저장된 단어 평가값을 할당하고, 상기 할당된 단어 평가값을 연산하여 상기 메시지의 유해 메시지 여부를 판단하고 상기 메시지를 다수의 유해 메시지 분류로 세분화하는 것을 특징으로 한다.In order to achieve the above object, the harmful message filtering system according to the present invention includes a database for storing messages received from a client connected through a network, and the system for filtering harmful messages, comprising: message receiving means for receiving the message, receiving Word extracting means for extracting a plurality of words from the message, and evaluating means for determining whether the message is harmful using the extracted plurality of words and storing in the database; A word evaluation value stored in the database is assigned to a word of, and the word evaluation value is calculated to determine whether the message is a harmful message, and the message is divided into a plurality of harmful message classifications.

또, 본 발명에 따른 유해 메시지 여과 시스템에 있어서, 상기 유해 메시지 여과 시스템은 상기 다수의 단어 및 다수의 단어 각각에 할당된 단어 평가값을 상기 데이터베이스에 저장하는 학습수단을 더 포함하는 것을 특징으로 한다.In addition, in the harmful message filtering system according to the present invention, the harmful message filtering system further comprises learning means for storing the plurality of words and word evaluation values assigned to each of the plurality of words in the database. .

또, 본 발명에 따른 유해 메시지 여과 시스템에 있어서, 상기 단어 평가값은 상기 다수의 유해 메시지 분류별로 할당되며, 상기 유해 메시지 분류 내의 전체 메시지 중에서 상기 단어를 포함하는 메시지가 차지하는 비율인 것을 특징으로 한다.In the harmful message filtering system according to the present invention, the word evaluation value is assigned for each of the plurality of harmful message classifications, and the message containing the word is a percentage of all the messages in the harmful message classification. .

또, 본 발명에 따른 유해 메시지 여과 시스템에 있어서, 상기 평가수단은 추출된 상기 다수의 단어에서 다수의 최적 단어를 상기 다수의 유해 메시지 분류별로 추출하는 것을 더 수행하며, 상기 다수의 최적 단어는 상기 단어에 할당된 단어 평가값이 0.5로부터 0과 1을 향해 가장 멀리 떨어진 단어를 순차적으로 추출하는 것을 특징으로 한다.Further, in the harmful message filtering system according to the present invention, the evaluating means further extracts a plurality of optimal words from the extracted plurality of words for each of the plurality of harmful message categories, and the plurality of optimal words is The word evaluation value assigned to the word is characterized by sequentially extracting the word farthest from 0.5 toward 0 and 1.

또, 본 발명에 따른 유해 메시지 여과 시스템에 있어서, 상기 평가수단은 상기 메시지가 상기 유해 메시지 분류일 확률을 상기 다수의 유해 메시지 분류에 대해 연산하고, 연산결과 중에서 가장 큰 값을 상기 메시지 평가값으로 할당하여 상 기 메시지를 해당 유해 메시지 분류로 세분화하는 것을 더 수행하는 것을 특징으로 한다.In the harmful message filtering system according to the present invention, the evaluating means calculates the probability that the message is the harmful message classification with respect to the plurality of harmful message classifications, and the largest value among the calculation results is the message evaluation value. And further subdividing the above message into a corresponding harmful message classification.

또, 본 발명에 따른 유해 메시지 여과 시스템에 있어서, 상기 메시지가 상기 유해 메시지 분류일 확률은 식

의 실행에 의해 연산되며, 상기 words는 추출된 상기 최적의 단어이고, 상기 spam₁은 상기 유해 메시지 분류이며, 상기 P(words|spam₁)은 상기 유해 메시지 분류 spam₁에 대해 상기 다수의 최적의 단어에 할당된 단어 평가값들의 곱이고, P(spam₁)은 전체 메시지 중에서 상기 유해 메시지 분류 spam₁이 차지하는 비율이고, P(words)는 전체 메시지 중에서 추출된 상기 최적의 단어를 모두 포함하는 메시지가 차지하는 비율인 것을 특징으로 한다.Further, in the harmful message filtering system according to the present invention, the probability that the message is the harmful message classification is

Computed by execution of the words, the words are the optimal words extracted, the spam ₁ is the harmful message classification, and the P (words | spam ₁ ) is the multiple optimal values for the harmful message classification spam ₁ Is a product of word evaluation values assigned to a word, P (spam ₁ ) is the proportion of the harmful message classification spam ₁ of the total messages, and P (words) is a message containing all of the optimal words extracted from all messages It is characterized in that the ratio occupies.

또, 본 발명에 따른 유해 메시지 여과 시스템에 있어서, 상기 연산결과 중에서 가장 큰 값이 지정된 임계치 미만일 경우 상기 메시지를 정상 메시지로 분류하는 것을 특징으로 한다.In the harmful message filtering system according to the present invention, the message is classified as a normal message when the largest value among the calculation results is less than a specified threshold.

또, 본 발명에 따른 유해 메시지 여과 시스템에 있어서, 상기 메시지는 상기 유해 메시지 여과 시스템이 제공하는 커뮤니티를 통해 상기 클라이언트로부터 수신되는 것을 특징으로 한다.In the harmful message filtering system according to the present invention, the message is received from the client through a community provided by the harmful message filtering system.

또, 본 발명에 따른 유해 메시지 여과 시스템에 있어서, 상기 메시지는 상기 커뮤니티에 게시된 글의 본문내용인 것을 특징으로 한다.In addition, in the harmful message filtering system according to the present invention, the message is characterized in that the body content of the article posted in the community.

또, 상기 목적을 달성하기 위해 본 발명에 따른 유해 메시지 여과 방법은 네트워크를 통하여 연결된 클라이언트로부터 수신되는 메시지를 저장하는 데이터베이스를 구비한 유해 메시지 여과 시스템으로 유해 메시지를 여과하는 방법에 있어서, (a) 상기 클라이언트로부터 메시지를 수신하는 단계, (b) 수신된 상기 메시지에서 다수의 단어를 추출하는 단계, (c) 학습과정인지 평가과정인지 판단하는 단계, (d) 상기 (c) 단계에서 평가과정으로 판단된 경우 추출된 상기 다수의 단어를 이용하여 상기 메시지의 유해 메시지 여부를 판단하는 단계를 포함하고, 상기 (d) 단계는 상기 데이터베이스에 저장된 단어 평가값을 이용하여 상기 메시지의 유해 메시지 여부를 판단하고 상기 메시지를 다수의 유해 메시지 분류로 세분화하는 것을 특징으로 한다.In addition, in order to achieve the above object, the harmful message filtering method according to the present invention is a method for filtering harmful messages with a harmful message filtering system having a database for storing messages received from a client connected through a network, (a) Receiving a message from the client, (b) extracting a plurality of words from the received message, (c) determining whether it is a learning process or an evaluation process, and (d) evaluating in the step (c) Determining whether the message is a harmful message using the extracted plurality of words, and step (d) determines whether the message is a harmful message using a word evaluation value stored in the database. And subdividing the message into a plurality of harmful message classifications.

또, 본 발명에 따른 유해 메시지 여과 방법에 있어서, 상기 유해 메시지를 여과하는 방법은 상기 (c) 단계에서 학습과정으로 판단된 경우 (e1) 추출된 상기 단어에 각각의 상기 다수의 유해 메시지 분류별로 상기 단어 평가값을 할당하는 단계, (e2) 상기 단어와 상기 단어 평가값을 상기 데이터베이스에 저장하는 단계를 더 포함하는 것을 특징으로 한다.In addition, in the harmful message filtering method according to the present invention, the method for filtering the harmful message is determined by the learning process in step (c) (e1) for each of the plurality of harmful message classifications to the extracted words; Allocating the word evaluation value, (e2) storing the word and the word evaluation value in the database.

또, 본 발명에 따른 유해 메시지 여과 방법에 있어서, 상기 단어 평가값은 상기 유해 메시지 분류 내의 전체 메시지 중에서 상기 단어를 포함하는 메시지가 차지하는 비율인 것을 특징으로 한다.Further, in the harmful message filtering method according to the present invention, the word evaluation value is characterized in that the ratio of the message including the word of all the messages in the harmful message classification.

또, 본 발명에 따른 유해 메시지 여과 방법에 있어서, 상기 (d) 단계는 (d1) 추출된 상기 다수의 단어에서 다수의 최적 단어를 상기 다수의 유해 메시지 분류별로 추출하는 단계, (d2) 상기 메시지가 상기 유해 메시지 분류일 확률을 상기 다수의 유해 메시지 분류에 대해 연산하는 단계, (d3) 연산결과 중에서 가장 큰 값을 메시지 평가값으로 할당하여 상기 메시지를 해당 유해 메시지 분류로 세분화하는 단계, (d4) 상기 유해 메시지 분류로 세분화된 상기 메시지와 상기 메시지 평가값을 상기 데이터베이스에 저장하는 단계를 더 포함하는 것을 특징으로 한다.In the harmful message filtering method according to the present invention, the step (d) includes (d1) extracting a plurality of optimal words from the extracted plurality of words for each of the plurality of harmful message categories, and (d2) the message. Calculating a probability of being a harmful message classification for the plurality of harmful message classifications, (d3) dividing the message into a corresponding harmful message classification by allocating a largest value among the calculation results as a message evaluation value, and (d4) And storing the message and the message evaluation value subdivided into the harmful message classification in the database.

또, 본 발명에 따른 유해 메시지 여과 방법에 있어서, 상기 (d1) 단계는 상기 단어에 할당된 상기 단어 평가값이 0.5로부터 0과 1을 향해 가장 멀리 떨어진 단어를 최적 단어로서 순차적으로 추출하는 것을 특징으로 한다.In addition, in the harmful message filtering method according to the present invention, the step (d1) is characterized in that the word evaluation value assigned to the word is sequentially extracted as a word most distant from 0.5 toward 0 and 1 as the optimal word. It is done.

또, 본 발명에 따른 유해 메시지 여과 방법에 있어서, 상기 (d2) 단계에서 상기 메시지가 상기 유해 메시지 분류일 확률은 식

의 실행에 의해 연산되며, 상기 words는 추출된 상기 최적의 단어이고, 상기 spam₁은 상기 유해 메시지 분류이며, 상기 P(words|spam₁)은 상기 유해 메시지 분류 spam₁에 대해 상기 다수의 최적의 단어에 할당된 단어 평가값들의 곱이고, P(spam₁)은 전체 메시지 중에서 상기 유해 메시지 분류 spam₁이 차지하는 비율이고, P(words)는 전체 메시지 중에서 추출된 상기 최적의 단어를 모두 포함하는 메시지가 차지하는 비율인 것을 특징으로 한다.In addition, in the harmful message filtering method according to the present invention, in step (d2), the probability that the message is the harmful message classification is expressed by:

또, 본 발명에 따른 유해 메시지 여과 방법에 있어서, 상기 연산결과 중에서 가장 큰 값이 지정된 임계치 미만일 경우 상기 메시지를 정상 메시지로 분류하는 것을 특징으로 한다.In the harmful message filtering method according to the present invention, the message is classified as a normal message when the largest value among the calculation results is less than a specified threshold.

또, 본 발명에 따른 유해 메시지 여과 방법에 있어서, 상기 메시지는 상기 유해 메시지 여과 시스템이 제공하는 커뮤니티를 통해 상기 클라이언트로부터 수신되는 것을 특징으로 한다.In the harmful message filtering method according to the present invention, the message is received from the client through a community provided by the harmful message filtering system.

또, 본 발명에 따른 유해 메시지 여과 방법에 있어서, 상기 메시지는 상기 커뮤니티에 게시된 글의 본문내용인 것을 특징으로 한다.In addition, in the harmful message filtering method according to the present invention, the message is characterized in that the body content of the article posted in the community.

또, 상기 목적을 달성하기 위해 본 발명에 따른 컴퓨터로 읽을 수 있는 기록매체는 네트워크를 통하여 연결된 클라이언트로부터 수신되는 메시지를 저장하는 데이터베이스를 구비한 유해 메시지 여과 시스템으로 유해 메시지를 여과하는 방법을 컴퓨터로 읽을 수 있는 기록매체에 있어서, (a) 상기 클라이언트로부터 메시지를 수신받는 단계, (b) 수신된 상기 메시지에서 다수의 단어를 추출하는 단계, (c) 학습과정인지 평가과정인지 판단하는 단계, (d) 상기 (c) 단계에서 평가과정으로 판단된 경우 추출된 상기 다수의 단어를 이용하여 상기 메시지의 유해 메시지 여부를 판단하는 단계를 포함하고, 상기 (d) 단계는 상기 데이터베이스에 저장된 단어 평가값을 이용하여 상기 메시지의 유해 메시지 여부를 판단하고 상기 메시지를 다수의 유해 메시지 분류로 세분화하는 것을 특징으로 한다.In addition, the computer-readable recording medium according to the present invention for achieving the above object is a computer-based method for filtering harmful messages with a harmful message filtering system having a database for storing messages received from a client connected via a network. A readable recording medium, comprising: (a) receiving a message from the client, (b) extracting a plurality of words from the received message, (c) determining whether it is a learning process or an evaluation process, ( d) determining whether the message is harmful by using the extracted plurality of words when it is determined in the evaluation process in step (c), wherein step (d) includes a word evaluation value stored in the database. Determining whether the message is a harmful message by using and classify the message into a plurality of harmful messages It characterized by segmentation.

이하, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있을 정도로 상세히 설명하기 위하여, 본 발명의 가장 바람직한 실시 예를 첨부한 도면을 참조하여 상세하게 설명한다. 또한, 본 발명을 설명하는데 있어서 동일 부분은 동일 부호를 붙이고, 그 반복 설명은 생략한다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. In addition, in describing this invention, the same code | symbol is attached | subjected and the repeated description is abbreviate | omitted.

도 1은 본 발명의 일실시예에 따른 유해 메시지 여과 과정을 개략적으로 도시한 도면이다.1 is a view schematically showing a harmful message filtration process according to an embodiment of the present invention.

도 1에서 도시한 바와 같이, 높은 유해 메시지 여과율을 지니기 위해 관리자에 의한 사전 학습이 먼저 시행되어야 하며, 일정 수준 이상의 학습이 시행되면 이를 인터넷 커뮤니티에 적용하여 실제 등록되는 메시지의 유해 메시지 여부를 판단한다. 이를 위해 본 발명에 따른 메시지 여과 시스템은 기 분류된 메시지 집합으로부터 유해 메시지의 특징을 파악하는 학습수단 및 학습의 결과를 이용하여 수신된 메시지를 정상 메시지와 유해 메시지, 그리고 유해 메시지를 세분화된 소분류로 분류하는 평가수단을 포함한다.As shown in FIG. 1, in order to have a high harmful message filtering rate, prior learning by an administrator must be performed first, and when a predetermined level or more of learning is performed, it is applied to the Internet community to determine whether a harmful message is actually registered. . To this end, the message filtration system according to the present invention uses a learning means for identifying the characteristics of harmful messages from a pre-classified message set and the results of the learning into normal messages, harmful messages, and harmful messages. And evaluating means for classification.

다음에 본 발명에 따른 유해 메시지 여과 시스템에 대해 도 2에 따라 설명한다. Next, a harmful message filtration system according to the present invention will be described with reference to FIG.

도 2는 본 발명의 일실시예에 따른 유해 메시지 여과 시스템을 도시한 블록도이다.2 is a block diagram illustrating a harmful message filtering system according to an embodiment of the present invention.

도 2에서 도시한 바와 같이, 본 발명에 따른 유해 메시지 여과 시스템(20)은 다수의 클라이언트(10, 11)로부터 메시지를 수신하는 메시지 수신수단(201), 메시지 수신수단(201)으로 수신된 메시지에서 다수의 단어를 추출하는 단어 추출수단(202), 학습과정을 시행하고자 하는지 평가과정을 시행하고자 하는지 판단하는 제어수단(203), 단어 추출수단(202)에서 추출된 단어마다 다수의 단어 평가값을 할당하여 데이터베이스(30)에 저장하는 학습수단(204), 단어 추출수단(202)에서 추출된 다수의 단어를 이용하여 메시지 수신수단(201)에서 수신한 메시지가 유해 메시지인지의 여부를 판단하여 그 결과를 데이터베이스(30)에 저장하는 평가수단(205)을 구비한다.As shown in Figure 2, the harmful message filtering system 20 according to the present invention is a message receiving means 201, a message receiving means 201, a message receiving means 201 for receiving messages from a plurality of clients (10, 11) Word extraction means 202 for extracting a plurality of words in the control unit; control means 203 for determining whether to implement a learning process or an evaluation process; and a plurality of word evaluation values for each word extracted by the word extraction means 202. Using the plurality of words extracted from the learning means 204 and the word extracting means 202 to allocate and store in the database 30 to determine whether or not the message received by the message receiving means 201 is a harmful message. Evaluation means 205 for storing the result in the database 30 is provided.

도 2에 도시된 메시지 수신수단(201)은 유해 메시지 여과 시스템(20)이 제공하는 커뮤니티를 통해 클라이언트(10, 11)로부터 메시지를 수신하며, 수신된 메시지는 커뮤니티에 등록된 게시글의 본문내용이다. 즉, 게시글의 제목, 작성자 정보, 글 목록 등을 제외한 실제로 전달하고자 하는 메시지의 내용만을 그 대상으로 한다.Message receiving means 201 shown in Figure 2 receives a message from the client (10, 11) through the community provided by the harmful message filtering system 20, the received message is the body content of the post registered in the community . In other words, only the contents of the message to be delivered except the title of the post, the author information, the list of the article, and the like.

또, 단어 추출수단(202)은 메시지에서 국문 및 영문에 상관없이 공백이나 문장 부호로 구분된 한 글자 이상의 조합을 단어로 정의하여 추출한다. In addition, the word extracting means 202 defines and extracts a combination of one or more letters separated by spaces or punctuation marks as words regardless of Korean and English.

또, 학습수단(204)은 단어 추출수단(202)에서 추출한 각각의 단어에 유해 메시지 분류별로 단어 평가값을 할당하여 데이터베이스(30)에 저장하며, 데이터베이스(30)에 저장된 단어 평가값은 평가수단에 의해 메시지의 유해 메시지 여부 판단에 이용된다. 정상 메시지가 아닌 유해 메시지로 판단된 메시지는 다수의 유해 메 시지 분류로 세분화한다. 본 발명에서는 메시지를 크게 정상 메시지와 유해 메시지를 분류하며, 유해 메시지는 다시 다양한 소분류인 성인게시글, 광고게시글, 비방 및 욕설 게시글 등으로도 세분화한다. 따라서, 우선 유해 메시지 내의 어떤 분류에 속하는지를 판단하고, 아무 분류에도 속하지 않을 경우 해당 메시지는 정상 메시지로 분류한다. 유해 메시지 분류로는 성인게시글, 광고게시글, 비방 및 욕설 게시글 등이 있다. 유해 메시지 분류는 메시지 여과 시스템을 관리하는 관리자에 의해 다양하게 지정할 수 있음은 물론이다.Further, the learning means 204 assigns a word evaluation value for each harmful message classification to each word extracted by the word extracting means 202 and stores it in the database 30, and the word evaluation value stored in the database 30 is the evaluation means. It is used to determine whether a message is harmful. Messages determined to be harmful rather than normal are broken down into a number of categories of harmful messages. In the present invention, the message is largely classified into normal messages and harmful messages, and the harmful messages are further subdivided into various sub-categories such as adult posts, advertisement posts, slander and abusive posts. Therefore, it first determines which classification in the harmful message, and if it does not belong to any classification, the message is classified as a normal message. Harmful message categories include adult posts, advertisement posts, slander and abusive posts. The harmful message classification can be variously designated by the administrator managing the message filtering system.

다음에 본 발명에 따른 유해 메시지 여과 방법에 대해 도 3 내지 도 5에 따라 설명한다.Next, a method for filtering harmful messages according to the present invention will be described with reference to FIGS. 3 to 5.

도 3은 본 발명의 일실시예에 따른 유해 메시지 여과 방법을 설명하는 흐름도이고, 도 4는 본 발명의 일실시예에 따른 유해 메시지 여과 방법의 학습 과정과 평가 과정을 세분화하여 도시한 도면이고, 도 5는 본 발명의 일실시예에 따른 커뮤니티에 게시되는 메시지를 도시한 도면이다.3 is a flow chart illustrating a harmful message filtering method according to an embodiment of the present invention, Figure 4 is a view showing a broken down learning process and evaluation process of the harmful message filtering method according to an embodiment of the present invention, 5 is a diagram illustrating a message posted to a community according to an embodiment of the present invention.

도 3에서 도시한 바와 같이, 메시지 수신수단(201)이 클라이언트(10, 11)로부터 메시지를 수신한다(ST3010). 메시지 수신수단(201)을 수신된 메시지에서 단어 추출수단(202)이 다수의 단어를 추출한다(ST3020). 단어가 추출되면 제어수단(203)은 시행하고자 하는 과정이 학습과정인지 평가과정인지 판단한다(ST3030). ST3030 단계에서 학습과정으로 판단된 경우 학습수단(204)은 단어 추출수단(202)에서 추출된 다수의 단어에 각각의 다수의 유해 메시지 분류별로 단어 평가값을 할당한 다(ST3040). 학습수단(204)은 각각의 단어마다 단어 평가값을 할당한 후에 ST3020 단계에서 추출된 단어와 ST3040 단계에서 할당된 단어 평가값을 데이터베이스(30)에 저장한다(ST3041). 이러한 학습과정을 통해 축적된 단어와 단어 평가값은 추후에 유해 메시지 여부를 판단하는 평가과정에서 사용된다.As shown in FIG. 3, the message receiving means 201 receives a message from the clients 10 and 11 (ST3010). The word extracting means 202 extracts a plurality of words from the message receiving means 201 (ST3020). When the word is extracted, the control means 203 determines whether the process to be implemented is a learning process or an evaluation process (ST3030). If it is determined in step ST3030 as the learning process, the learning means 204 assigns a word evaluation value for each of the plurality of harmful message classifications to the plurality of words extracted from the word extracting means 202 (ST3040). The learning means 204 assigns a word evaluation value to each word and then stores the word extracted in step ST3020 and the word evaluation value assigned in step ST3040 in the database 30 (ST3041). The words and word evaluation values accumulated through this learning process are used in the evaluation process to determine whether or not harmful messages later.

ST3030 단계에서 평가과정으로 판단된 경우 단어 추출수단(202)에서 추출된 다수의 단어를 이용하여 메시지 수신수단(201)에 수신된 메시지의 유해 메시지 여부를 판단한다. 이를 위해 평가수단(205)은 데이터베이스(30)에 저장된 단어 평가값을 이용하여 메시지 수신수단(201)에 수신된 메시지의 유해 메시지 여부를 판단하고, 메시지를 다수의 유해 메시지 분류로 세분화한다. 이 과정은 다음과 같다. 우선, 평가수단(205)은 단어 추출수단(202)에서 추출한 다수의 단어에서 다수의 최적 단어를 다수의 유해 메시지 분류별로 추출한다(ST3050). 평가수단(205)은 최적 단어를 추출한 후 이를 이용하여 수신된 메시지가 유해 메시지 분류일 확률을 각각의 유해 메시지 분류에 대해 연산한다(ST3051). 평가수단(205)은 ST3051의 연산결과 중에서 가장 큰 값을 메시지 평가값으로 할당하여(ST3053) 수신된 메시지를 해당 유해 메시지 분류로 세분화한다(ST3054). 이때, ST3051의 연산결과 중에서 가장 큰 값이 미리 지정된 임계치 미만일 경우에는 수신된 메시지를 정상 메시지로 분류한다(ST3060). 평가수단(205)은 유해 메시지 분류로 세분화된 메시지와 메시지 평가값을 데이터베이스(30)에 저장한다(ST3055). If it is determined in step ST3030 that the evaluation process, it is determined whether the message received by the message receiving means 201 by using a plurality of words extracted from the word extraction means 202 whether the harmful message. To this end, the evaluating means 205 determines whether the message received by the message receiving means 201 is a harmful message using the word evaluation value stored in the database 30, and divides the message into a plurality of harmful message classifications. This process is as follows. First, the evaluating means 205 extracts a plurality of optimal words for each of a plurality of harmful message categories from the plurality of words extracted by the word extracting means 202 (ST3050). The evaluating means 205 extracts the optimal word and calculates the probability of receiving the harmful message classification for each harmful message classification using the same (ST3051). The evaluation means 205 assigns the largest value among the calculation results of ST3051 as the message evaluation value (ST3053), and subdivides the received message into the corresponding harmful message classification (ST3054). At this time, when the largest value among the calculation results of ST3051 is less than the predetermined threshold, the received message is classified as a normal message (ST3060). The evaluation means 205 stores the message and the message evaluation value subdivided into the harmful message classification in the database 30 (ST3055).

본 발명에 따른 유해 메시지 여과 방법은 크게 학습과정과 평가과정으로 나 뉠 수 있다. 이에 따라 학습과정과 평가과정에 대해 각각 구체적으로 설명한다.The harmful message filtering method according to the present invention can be largely divided into a learning process and an evaluation process. Accordingly, the learning process and the evaluation process will be described in detail.

가. 학습 과정end. Learning process

도 4a에서 도시한 바와 같이, 본 발명에 따른 유해 메시지 여과 시스템을 관리하는 관리자 유해 메시지의 특징 파악을 위해 학습과정을 시행한다. 우선, 관리자는 인터넷 커뮤니티에 게시된 실제 메시지를 수집하고, 이들의 특성에 따라 유해 메시지 분류로 세분화하는 작업을 선행한다. 예를 들어, 일반적인 인터넷 커뮤니티의 관리자는 정상메시지, 성인게시글, 광고게시글, 비방 및 욕설 게시글 등으로 유해 메시지를 세분화하여 메시지를 수집할 수 있다. 학습과정에서 수집된 메시지 집합의 올바른 분류는 본 발명에 큰 영향을 주기 때문에 학습과정은 매우 중요한 작업이다.As shown in Figure 4a, the learning process is implemented to identify the characteristics of the administrator harmful messages for managing the harmful message filtering system according to the present invention. First, the administrator collects the actual messages posted to the Internet community, and prioritizes the segmentation into harmful message classifications according to their characteristics. For example, administrators of a general Internet community may collect harmful messages by subdividing harmful messages into normal messages, adult posts, advertisement posts, slanders and abusive posts, and the like. The learning process is a very important task because the correct classification of the message set collected in the learning process has a great influence on the present invention.

메시지 수신수단(201)이 메시지를 수신하여 수집하면(ST3010) 일정한 규칙에 의해 메시지를 세분화하면 학습수단(204)은 다음과 같은 과정을 통해 학습을 시행한다.When the message receiving means 201 receives and collects the message (ST3010), when the message is subdivided according to a predetermined rule, the learning means 204 performs the learning through the following process.

도 4a에 도시된 단어추출(ST3020)은 메시지 수신수단(201)에서 수집된 메시지 집합인 정상 메시지, 성인게시글, 광고게시글, 비방 및 욕설 게시글 등으로부터 모든 단어를 추출하는 과정이다. 추출에 이용되는 메시지는 인터넷 커뮤니티에 게시된 글로서 도 5에서 도시한 바와 같다. 즉, 도 5a와 도 5b에 도시한 바와 같이 각각의 항목에 글을 삽입한 후, '확인' 버튼을 클릭함으로써 도 5c에 도시한 바와 같이 커뮤니티에 글을 게시한다. 도 5c에 게시된 글을 클릭하면, 도 도d에 도시한 바와 같이 게시된 글이 표시된다. 이렇게 게시된 글 중 메시지의 제목, 메타 태 그(Meta Tag) 등에서는 단어를 추출하지 않으며, 실제로 전달하고자 하는 내용이 담긴 본문내용에서만 단어를 추출한다. 추출되는 단어는 메시지 내의 글자 조합으로서, 공백 또는 문장부호 등으로 분리 가능한 것을 뜻한다. 또한, 보다 효과적인 단어 추출을 위해 한 글자 이내의 조합은 단어로 취급하지 않는다. 예를 들어, '본 발명은 인터넷 커뮤니티의 유해 메시지 여과를 목표로 한다.'라는 문구가 메시지 내에 포함되어 있다면 '발명은', '인터넷', '커뮤니티의', '유해', '메시지', '여과를', '목표로', 그리고 '한다'가 단어로서 추출된다. The word extraction ST3020 illustrated in FIG. 4A is a process of extracting all words from normal messages, adult posts, advertisement posts, slanders and abusive posts, which are message sets collected by the message receiving unit 201. The message used for extraction is a post posted to the Internet community as shown in FIG. 5. That is, after inserting a post into each item as shown in Figs. 5A and 5B, the post is posted to the community as shown in Fig. 5C by clicking the 'OK' button. When the article posted in FIG. 5C is clicked, the article posted is displayed as shown in FIG. The words are not extracted from the message's title, meta tag, etc., but the words are extracted only from the body content containing the actual contents. The extracted word is a combination of letters in the message, meaning that it can be separated by a space or punctuation. Also, for more effective word extraction, combinations within one letter are not treated as words. For example, if the phrase 'the present invention is aimed at filtering harmful messages in the Internet community' is included in the message, 'invention is', 'internet', 'community', 'harmful', 'message', 'Filter', 'to goal', and 'to' are extracted as words.

이러한 단어 추출과정은 수집된 메시지에 포함된 모든 단어에 대해 수행되며, 학습수단(204)은 추출된 단어를 해당 단어를 포함하는 메시지의 분류, 그리고 분류에서 해당 단어를 지닌 메시지의 수와 함께 데이터베이스(30)에 저장한다(ST3041). 예를 들어, 여러 유해 메시지 분류 가운데 정상 메시지 100개, 성인게시글 10개, 광고게시글 5개, 비방 및 욕설 게시글 0개가 '인터넷'이라는 단어를 포함하고 있다면, '인터넷'이라는 단어는 각각의 유해 메시지 분류와 해당 분류에 나타난 메시지의 수와 함께 저장되는 것이다. This word extraction process is performed for all words included in the collected message, and the learning means 204 stores the extracted words in a database together with the classification of the message containing the word and the number of messages with the word in the classification. The data is stored at 30 (ST3041). For example, if 100 harmful messages, 10 adult posts, 5 ad posts, and 0 slanders and abusive posts contain the word "Internet," the word "Internet" will be used for each harmful message. It is stored with the classification and the number of messages in that category.

학습수단(204)은 수집된 메시지의 모든 단어에 대해 학습과정을 수행하며, 관리자에 의한 메시지 분류에 따라 저장되는 개체가 달라질 수 있다. 관리자가 메시지를 보다 세분화하여 분류할 경우 더욱 많은 데이터가 각각의 단어에 대해 저장된다.The learning means 204 performs a learning process on all words of the collected message, and the stored object may vary according to the message classification by the administrator. If the administrator breaks down the message further, more data is stored for each word.

추출된 단어는 학습과정에서 각각의 분류 기준의 특성 반영 여부를 평가받게 된다. 즉, 저장된 데이터와 전체 수집된 메시지의 양을 통해 단어를 평가한다. 이 러한 단어 평가의 세부 과정은 다음과 같다.The extracted words are evaluated for reflecting the characteristics of each classification criteria in the learning process. That is, words are evaluated based on the amount of stored data and the total amount of collected messages. The detailed process of evaluating these words is as follows.

먼저, 임의로 추출된 단어와 그에 대한 정보를 바탕으로 하여 해당 단어를 포함한 메시지의 수를 각각의 유해 메시지 분류에 따라 측정한다. 이는 이미 단어 추출시 측정되어 있다. 그 후, 각 분류의 전체 메시지에 대한 비율로 각각의 값을 다시 연산한다. 예를 들어, 'viagra'라는 단어가 수집된 메시지의 성인게시글 100개 가운데 80개, 광고게시글 100개 가운데 50개, 비방 및 욕설 게시글 100개 가운데 2개의 메시지에 포함되어 있다면, 'viagra'는 성인 게시글에 대해 0.8, 광고 게시글 0.5, 비방 및 욕설 게시글 0.02로 각각 연산된다. 이러한 값들은 해당 단어를 포함한 메시지가 각각의 유해 메시지 분류에서 나타날 확률인 단어 평가값으로서, 학습수단(204)에 의해 단어에 할당된다(ST3040). 이러한 연산과정은 통상의 베이지안 필터링 기법과 매우 유사하지만, 본 발명에서는 다수의 평가값을 지니도록 함으로써 메시지의 유해 메시지 여부만을 판단하는 것이 아니라 유해 메시지 가운데서도 어떤 분류에 속하는지도 파악 가능한 서비스를 제공한다. 이때 정상 메시지 분류는 별도의 단어 평가값을 지니지 않는다. 정상 메시지 분류는 광범위한 범주의 내용을 포함하고 있기 때문에 유해 메시지 여과에 큰 도움이 되지 못한다. 따라서, 단어 평가값 할당에는 고려하지 않고, 단어 평가값없이 평가과정에서의 분류를 시도한다.First, based on the words randomly extracted and the information about them, the number of messages including the word is measured according to each harmful message classification. This has already been measured at word extraction. Each value is then recalculated as a percentage of the total messages of each classification. For example, if the word 'viagra' is included in two messages out of 80 adult posts, 50 out of 100 ad posts, and 100 slanders and abusive posts in a message where 'viagra' is collected, 'viagra' is an adult It is calculated as 0.8 for the post, 0.5 for the advertisement, and 0.02 for the slander and the profanity. These values are word evaluation values that are the probability that a message including the corresponding word will appear in each harmful message classification, and is assigned to the word by the learning means 204 (ST3040). This operation process is very similar to a typical Bayesian filtering technique, but the present invention provides a service capable of determining whether a message belongs to the harmful message, not only determining whether the message is harmful by having a plurality of evaluation values. . In this case, the normal message classification does not have a separate word evaluation value. Normal message classifications contain a broad range of content that are not very helpful in filtering out harmful messages. Therefore, without considering the word evaluation value, the classification in the evaluation process is attempted without the word evaluation value.

이와 같이 단어 평가값 할당은 모든 단어에 대해 시행하며, 이를 각각의 단어와 함께 다시 데이터베이스(30)에 저장한다(ST3041). 이러한 모든 과정을 학습과정이라 하며, 학습과정은 관리자의 판단에 따라 그 양이 정해진다. 관리자가 대량 의 메시지 집합을 수집하고 이에 대한 학습과정을 시행할 경우 그 효과 및 정확도가 더욱 높아진다. 또한, 변화하는 유해 메시지의 내용, 트릭 등도 주기적으로 학습함으로써 유연히 대처할 수 있다.In this way, word evaluation value assignment is performed for all words, and is stored in the database 30 together with each word (ST3041). All these processes are called learning process, and the amount of learning process is determined by the manager's judgment. If the administrator collects a large set of messages and conducts a learning process on them, the effect and accuracy are higher. In addition, it is possible to flexibly cope by changing the contents of the harmful message, the trick, and the like periodically.

나. 평가과정I. Evaluation process

평가과정은 학습과정이 완료되고 나서 인터넷 커뮤니티에 유해 메시지 여과를 적용한 이후의 과정으로서, 실제 이용자들이 게시하는 글에 대한 유해 메시지 여부를 판단하는 과정이다. 평가과정에서는 인터넷 커뮤니티에 게시되는 글의 실제 내용만을 유해 메시지의 판단 기준으로 활용하며, 게시된 메시지의 실제 내용이 제공되었다는 가정하에 진행된다. 인터넷 커뮤니티에 게재되는 메시지의 실제 내용을 추출하는 기술은 본 분야에서 통상으로 사용되는 공지 기술이므로 구체적 설시는 생략한다.The evaluation process is the process after applying the harmful message filtration to the Internet community after the learning process is completed. It is a process of determining whether the actual message is harmful to the post. In the evaluation process, only the actual content of the posts posted on the Internet community is used as a criterion for determining harmful messages, and it is assumed that the actual contents of the posted messages are provided. Since the technique of extracting the actual content of the message posted on the Internet community is a well-known technique commonly used in the art, specific description thereof will be omitted.

도 4b에서 도시한 바와 같이, 평가 과정은 크게 1) 단어 평가값 할당, 2) 최적 단어 추출, 그리고 3) 메시지 평가로 구성된다. As shown in FIG. 4B, the evaluation process is largely composed of 1) word evaluation value assignment, 2) optimal word extraction, and 3) message evaluation.

먼저, 메시지 수신수단(201)이 새로운 메시지를 수신하면(ST3010), 단어 추출수단(202)이 다수의 단어를 추출한다(ST3020). 평가과정에서도 각각의 단어는 공백 또는 문장부호로 구분 가능한 한 글자 이상의 조합을 뜻한다. 다음으로, 평가수단(205)은 추출된 단어들 중에서 최적 단어를 추출한다(ST3050). 최적 단어를 추출하기 위해 우선 추출된 각각의 단어에 단어 평가값을 할당한다. 각각의 단어에 할당할 단어 평가값은 학습과정의 ST3041 단계에서 데이터베이스(30)에 저장되었으므 로 이를 참조한다. 예를 들어, 게시판에 글이 게시되면 해당 메시지의 단어들은 각각의 소분류에 해당되는 단어 평가값을 할당받는다. 즉, 수신된 게시글에 'viagra'라는 단어가 포함되어 있을 경우, 평가수단(205)은 데이터베이스(30)에 저장된 단어 평가값을 참조하여 성인게시글에 대해 0.8, 광고게시글에 대해 0.5, 비방 및 욕설 게시글에 대해 0.1의 단어 평가값을 할당하고, 다른 모든 단어들에 대해서도 이 과정을 수행한다. 그 후, 메시지의 평가를 위해 정해진 개수만큼 최적의 단어들을 추출하며, 이를 이용하여 소분류에 대한 각각의 메시지 평가값을 연산한다.First, when the message receiving means 201 receives a new message (ST3010), the word extracting means 202 extracts a plurality of words (ST3020). In the evaluation process, each word is a combination of one or more letters that can be separated by spaces or punctuation marks. Next, the evaluation means 205 extracts an optimal word from the extracted words (ST3050). In order to extract the optimal words, a word evaluation value is first assigned to each extracted word. Since the word evaluation value to be assigned to each word is stored in the database 30 in step ST3041 of the learning process, it is referred to this. For example, when a post is posted on a bulletin board, the words in the message are assigned a word evaluation value for each subcategory. That is, when the received post includes the word 'viagra', the evaluation means 205 refers to the word evaluation value stored in the database 30, 0.8 for the adult post, 0.5 for the post, slander and abusive language. Assign a word evaluation of 0.1 for the post, and do this for all other words. After that, the optimal number of words is extracted for a predetermined number of messages for evaluating the message, and each message evaluation value for the small classification is calculated using this.

단어 평가값 할당은 수신된 메시지 내에 있는 모든 단어에 대해 이루어진다. 이때, 학습과정에서 나타나지 않아 단어 평가값을 지니지 않는 새로운 단어에 대해서는 임의의 단어 평가값을 일률적으로 할당한다. 임의의 할당값은 관리자에 의해 변경될 수 있으며, 최적의 정확도를 지니는 값으로 사전에 설정하면 된다. 본 실시예에서는 이러한 상황에 대해 기본적으로 모든 분류에 대해 0.4의 단어 평가값을 할당하도록 구성한다. The word evaluation value assignment is made for every word in the received message. In this case, random word evaluation values are uniformly assigned to new words that do not appear in the learning process and do not have the word evaluation values. Any assigned value can be changed by the administrator and can be set in advance to a value with optimal accuracy. In this embodiment, it is configured to assign a word evaluation value of 0.4 to all classifications basically for such a situation.

수신된 메시지에 대한 단어 평가값 할당이 끝나면, 메시지 평가에 이용할 최적의 단어를 추출한다(ST3050). 최적 단어의 추출 과정(ST3050)은 수신된 메시지로부터 메시지의 특성을 가장 잘 반영하는 적정수의 단어를 추출하는 과정으로서, 0.5로부터 0과 1을 향해 가장 멀리 떨어진 단어 평가값을 할당받은 단어들부터 순차적으로 추출한다. 예를 들어, 메시지의 단어들 가운데 0.8, 0.6, 0.2의 단어 평가값을 할당받은 단어들 A, B, C가 각각 있다면, 이들은 A, C, B 또는 C, A, B와 같은 순으로 0.5와의 차이가 큰 단어 평가값을 할당받은 단어를 먼저 추출한다. After assigning word evaluation values to the received message, an optimal word to be used for message evaluation is extracted (ST3050). The optimal word extraction process (ST3050) is a process of extracting an appropriate number of words that best reflects the characteristics of the message from the received message, starting from the words assigned the word evaluation values farthest from 0.5 to 0 and 1. Extract sequentially. For example, if there are words A, B, and C, each of which is assigned a word evaluation value of 0.8, 0.6, or 0.2 among the words in the message, they are compared with 0.5 in the same order as A, C, B or C, A, B. The word assigned with the big word evaluation value is extracted first.

또한, 하나의 단어가 다수의 유해 메시지 분류에 대한 다수의 단어 평가값을 지니고 있기 때문에, 각각의 분류에 대해서 독립적으로 단어를 추출한다. 예를 들어, 'viagra'라는 단어는 성인게시글에서 0.8이란 단어 평가값을 할당받아 먼저 추출될 확률이 높았지만, 광고게시글의 분류에서는 0.5라는 단어 평가값을 할당받아 가장 나중에 추출될 확률이 높다. 즉, 도 5d에 도시한 바와 같이 하나의 메시지가 'viagra', 'sex', 'school', 'the', 'love'란 5개의 단어로 구성되어 있고, 각각의 단어가 성인게시글, 광고게시글에 대한 단어 평가값을 'viagra'는 0.1과 0.8로, 'sex'는 0.3과 0.7로, 'school'은 0.8과 0.4로, 'the'는 0.5와 0.5로, 'love'는 0.6과 0.9로 각각 단어 평가값을 할당받는다고 하면, 성인게시글 분류를 위한 최적의 단어는 'viagra', 'school', 'sex', 'love', 그리고 'the'의 순으로 추출되며, 광고게시글에 대해서는 'love', 'viagra', 'sex', 'school', 'the'의 순으로 추출된다. 평가수단(305)은 모든 유해 메시지 분류에 대해 이와 같은 순서로 단어를 추출하며, 추출되는 단어의 수는 관리자에 의해 임의로 설정된다. 인터넷 커뮤니티의 특성에 따라 메시지의 길이가 다양하므로, 본 실시예에서는 최적의 정확도를 지니는 값으로 구성하기 위해 기본값으로 15개의 최적 단어를 추출하도록 구성한다. In addition, since one word has multiple word evaluation values for multiple harmful message classifications, words are extracted independently for each classification. For example, the word 'viagra' has a high probability of being extracted first by assigning a word evaluation value of 0.8 in an adult post, but is most likely to be extracted after being assigned a word evaluation value of 0.5 in an advertisement's classification. That is, as shown in FIG. 5D, one message is composed of five words 'viagra', 'sex', 'school', 'the' and 'love', and each word is an adult post or an advertisement post. The word evaluation for 'viagra' is 0.1 and 0.8, 'sex' is 0.3 and 0.7, 'school' is 0.8 and 0.4, 'the' is 0.5 and 0.5, and 'love' is 0.6 and 0.9. If each word evaluation value is assigned, the optimal words for classifying adult posts are extracted in the order of 'viagra', 'school', 'sex', 'love', and 'the'. It is extracted in the order of 'love', 'viagra', 'sex', 'school', and 'the'. The evaluation means 305 extracts words in this order for all harmful message classifications, and the number of extracted words is arbitrarily set by the administrator. Since the length of the message varies according to the characteristics of the Internet community, in the present embodiment, 15 optimal words are extracted as a default value in order to configure a value having an optimal accuracy.

평가수단(205)은 최적의 단어를 추출한 후에 수신된 메시지가 어느 분류에 속하는지를 판단하는 메시지 평가과정을 수행한다. 메시지 평가는 통상으로 사용되는 베이지안 정리를 이용함으로써 빠르게 연산한다. 각각의 분류에 대한 메시지 평가값을 식 1의 베이지안 정리에 반복 적용함으로써 수신된 메시지를 평가한다(ST3051). The evaluation means 205 performs a message evaluation process of determining which category the received message belongs to after extracting the optimal word. Message evaluation is computed quickly by using the commonly used Bayesian theorem. The received message is evaluated by repeatedly applying the message evaluation value for each classification to the Bayesian theorem of Equation 1 (ST3051).

식 1은 추출된 단어들이 메시지에 포함되어 있을 때, 해당 메시지가 유해 메시지의 임의의 분류(spam₁)일 확률 P(spam₁|words)를 연산한 것으로서, words는 추출된 최적의 단어들을 뜻한다. 평가수단(205)은 추출된 단어들을 모두 지니고 있을 때, 분류 spam₁일 확률을 메시지 평가값으로 활용한다. 이때, 분자의 P(words|spam₁)는 임의의 분류 spam₁에 대한 각각의 단어에 대한 단어 평가값의 곱이다. 예를 들어, 성인게시글 분류에 대해서는 수신된 메시지로부터 'viagra', 'school', 'sex', 'love', 'the'의 순으로 단어를 추출하고, 추출된 단어들의 단어 평가값의 곱은 0.0072이다. Equation 1 calculates the probability P (spam ₁ | words) when the extracted words are included in the message, and the message is an arbitrary classification of the harmful message (spam ₁ ), where words represents the optimal words extracted. do. The evaluation means 205 uses the probability of classification spam ₁ as the message evaluation value when all the extracted words are included. Where P (words | spam ₁ ) of the molecule is the product of the word evaluation values for each word for any class spam ₁ . For example, for adult posting classification, words are extracted from the received message in the order of 'viagra', 'school', 'sex', 'love', and 'the', and the product of the word evaluation values of the extracted words is 0.0072 to be.

또한, P(spam₁)은 학습을 위해 수집된 메시지의 수에 대한 분류 spam₁에 속하는 메시지 수의 비율을 나타낸다. P(spam₁)에서 spam₁은 임의의 유해 메시지 소분류인 성인게시글, 광고게시글, 비방 및 욕설 게시글로서 이 중 어느 한 분류이다. P(spam₁)은 학습을 위해 이용된 모든 메시지들 가운데 spam₁이 차지하는 비율을 뜻 한다. 예를 들어, 학습을 위해 총 100개의 메시지가 이용되었고, 100개의 메시지는 성인게시글이 40개, 광고게시글이 30개, 비방 및 욕설 게시글이 20개, 그리고 정상 메시지가 10개로 구성되어 있고, spam₁이 지칭하는 것이 성인게시글이라면 P(spam₁)은

이므로 0.4이다. spam₁이 광고게시글이라면

이므로 0.3, 비방 및 욕설 게시글이라면

이므로, 0.2이다. 위에도 설명했듯이 정상메시지는 평가값이 없기 때문에 이러한 연산에 이용되지 않는다.P (spam ₁ ) also represents the ratio of the number of messages belonging to the classification spam ₁ to the number of messages collected for learning. In P (spam ₁ ), spam ₁ is any of the following categories of adult messages, advertisements, slanders and abusive posts: P (spam ₁ ) is the proportion of spam ₁ of all messages used for learning. For example, a total of 100 messages were used for learning, 100 messages consisted of 40 adult posts, 30 ad posts, 20 slander and abusive posts, and 10 normal messages, spam _{If 1} refers to an adult post, P (spam ₁ )

Is 0.4. If spam ₁ is a post

0.3, so if it's a slander or abusive post

Is 0.2. As mentioned above, normal messages are not used for this operation because there is no evaluation value.

또, 분모의 P(words)는 추출된 최적 단어를 모두 포함한 메시지가 나타날 확률을 뜻하는 것으로서, 구하고자 하는 분류 spam₁은 물론 그 외의 모든 학습 메시지를 대상으로 연산한다. 보다 구체적으로 말하자면, P(words)는 추출된 최적의 단어인 'viagra', 'school', 'sex', 'love', 'the'가 모두 포함된 메시지가 학습과정에 이용된 메시지 집합에서 나타날 확률이다. 예를 들어, 100개의 메시지가 학습에 이용되었는데 이때 100개의 메시지 중에 'viagra', 'school', 'sex', 'love', 'the'라는 단어를 순서에 상관없이 모두 지니고 있는 메시지가 5개 있다면, P(words)는 0.05가 된다. 이때는 어떠한 소분류이던 상관하지 않으며, 전체 학습 메시지를 대상으로 하는 비율을 연산한다. 이때 정상메시지 분류에 대한 단어 평가값은 지니지 않고 있으나 학습과정에서 저장된 메시지 수가 있기 때문에 연산이 가능하다.In addition, P (words) of the denominator means the probability that a message containing all of the extracted optimal words will appear. The P (words) of the denominator are calculated for all other learning messages as well as the class spam ₁ to be obtained. More specifically, P (words) indicates that messages containing all of the extracted optimal words 'viagra', 'school', 'sex', 'love', and 'the' will appear in the message set used in the learning process. Probability. For example, 100 messages were used for training, with 5 messages containing the words 'viagra', 'school', 'sex', 'love', and 'the' in any order. If present, P (words) is 0.05. At this time, it does not matter what small classification, and calculates the ratio which targets the whole learning message. In this case, the word evaluation value for the normal message classification is not included, but operation is possible because there are a number of stored messages in the learning process.

메시지 평가를 모든 분류에 대해서 진행한 후에는 모든 분류에 대한 메시지 평가값들을 서로 비교하고, 그 가운데에서 가장 큰 값을 지니는 메시지 평가값으로 해당 메시지를 분류한다(ST3053, ST3054). 이때 가장 큰 값을 지니는 메시지 평가값을 임의의 임계치와 비교하여(ST3052) 임계치를 넘지 못할 경우 해당 메시지는 어떠한 유해 메시지 분류에도 속하지 않는 메시지로 간주하고, 정상 메시지로 분류한다(ST3060). 예를 들어, 평가수단(205)은 등록된 게시글을 성인게시글에 0.7, 광고게시글에 0.4, 비방 및 욕설 게시글에 0.3으로 평가하고, 각각의 메시지 평가값을 비교하여 최고값인 0.7을 갖는 성인게시글로 분류하려 시도한다. 그러나 최고값이 미리 설정한 임계치인 0.8보다 크지 않다면, 이를 유해 메시지가 아닌 정상메시지로 분류한다. 임계치는 이와 같이 분류점을 지정하기 위해 이용된다. 본 실시예에서는 기본값을 0.8로 설정하였으나, 임의의 임계치 역시 학습되지 않은 단어의 단어 평가값 할당과 최적 단어 추출 개수와 마찬가지로 관리자에 의해 최적의 정확도를 지닐 때의 값으로 설정될 있다.After the message evaluation is performed for all classifications, message evaluation values for all classifications are compared with each other, and the corresponding message is classified into the message evaluation value having the largest value among them (ST3053, ST3054). At this time, if the message evaluation value having the largest value is compared with an arbitrary threshold value (ST3052) and the threshold is not exceeded, the message is regarded as a message that does not belong to any harmful message classification and is classified as a normal message (ST3060). For example, the evaluation means 205 evaluates the registered posts as 0.7 in the adult posts, 0.4 in the advertisement posts, 0.3 in the slander and abusive posts, and compares each message evaluation value to the adult posts having the highest value of 0.7. Attempt to classify as However, if the peak value is not greater than the preset threshold of 0.8, it is classified as a normal message and not a harmful message. The threshold is thus used to specify the classification point. In this embodiment, the default value is set to 0.8, but any threshold value may be set to a value when the administrator has an optimum accuracy, similarly to the word evaluation value allocation and the optimal word extraction number of the unlearned word.

이와 같이, 하나의 메시지에 대해 각각의 유해 메시지 분류에 따른 메시지 평가값을 비교하고, 그 가운데 가장 큰 값이 임의의 임계치를 넘는지의 여부로 해당 메시지의 유해 메시지 여부를 판단하고, 보다 세분화된 소분류로 분류할 수 있다.In this way, the message evaluation values according to the classifications of each harmful message are compared with respect to one message, and whether the largest value among them exceeds an arbitrary threshold is determined whether the corresponding message is a harmful message, and the more detailed subclassification Can be classified as

이상, 본 발명자에 의해서 이루어진 발명은 상기 실시 예에 따라 구체적으로 설명하였지만, 본 발명은 상기 실시 예에 한정되는 것은 아니고, 그 요지를 이탈하지 않는 범위에서 여러 가지로 변경 가능한 것은 물론이다.As mentioned above, although the invention made by this inventor was demonstrated concretely according to the said Example, this invention is not limited to the said Example and can be variously changed in the range which does not deviate from the summary.

즉, 상기 실시예에 있어서는 게시판을 이용한 실시예에 대해 설명하였지만, 이에 한정되는 것은 아니며 유해 여부의 판단이 필요한 메시지가 사용되는 모든 분야에서 실현할 수 있음은 물론이다.That is, in the above embodiment, an embodiment using a bulletin board has been described, but the present invention is not limited thereto and may be realized in all fields in which a message requiring determination of harmfulness is used.

상술한 바와 같이, 본 발명에 따른 유해 메시지 여과 시스템과 그 여과 방법 및 이를 기록한 기록매체에 의하면, 종래의 다양한 유해 메시지 여과 기법을 개선하여 인터넷 커뮤니티 상의 유해 메시지를 여과할 수 있다는 효과가 얻어진다.As described above, according to the harmful message filtering system, the filtering method and the recording medium recording the same according to the present invention, it is possible to improve the conventional various harmful message filtering techniques to filter harmful messages on the Internet community.

또, 본 발명에 따른 유해 메시지 여과 시스템과 그 여과 방법 및 이를 기록한 기록매체에 의하면, 여과된 유해 메시지를 세분화된 소분류로 다시 분류하여 인터넷 커뮤니티 상에서도 유해 메시지의 피해 없이 원활한 의사소통을 가능하게 하는 효과도 얻어진다.In addition, according to the harmful message filtering system, the filtering method and the recording medium recording the same according to the present invention, it is possible to classify the filtered harmful messages into subdivided subclasses to enable smooth communication without damaging harmful messages even on the Internet community. Is also obtained.

또, 본 발명에 따른 유해 메시지 여과 시스템과 그 여과 방법 및 이를 기록한 기록매체에 의하면, 인터넷 커뮤니티 상에 나타나는 유해 메시지의 효과적인 차단과 분류를 제공하고 커뮤니티 관리자와 이용자 간의 유해 메시지 정의 문제 때문에 나타나는 분류 오판단율을 최소화할 수 있다는 효과가 얻어진다.In addition, according to the harmful message filtering system according to the present invention and the filtering method and the recording medium recording the harmful message, the classification misjudgment that provides effective blocking and classification of harmful messages appearing on the Internet community and due to the problem of definition of harmful messages between the community manager and the user The effect that the unit ratio can be minimized is obtained.

또, 본 발명에 따른 유해 메시지 여과 시스템과 그 여과 방법 및 이를 기록한 기록매체에 의하면, 급격히 증가하고 있는 안티 스팸 시장에서 연구 및 개발이 전무한 인터넷 커뮤니티 상에서의 인텔리전트 스팸 필터링(Intelligent Spam Filtering) 기술을 제공함으로써 높은 시장성을 확보할 있으며 이에 관한 파생 연구가 가능하다는 효과가 얻어진다.In addition, according to the harmful message filtering system according to the present invention, the filtering method and the recording medium recording the same, Intelligent Spam Filtering technology on the Internet community with no research and development in the rapidly increasing anti-spam market. As a result, it is possible to secure high marketability and to obtain derivative research on this.

Claims

A system for filtering harmful messages, comprising a database for storing messages received from clients connected over a network, the system comprising:

Message receiving means for receiving the message,

Word extraction means for extracting a plurality of words from the received message;

And evaluating means for determining whether the message is harmful by using the extracted plurality of words and storing it in the database.

The evaluating means assigns a word evaluation value stored in the database to the extracted word to determine whether the message is harmful and subdivided the message into a plurality of harmful message classifications.

The word evaluation value is assigned to each of the plurality of harmful message classifications, and the harmful message filtering system on the Internet community, characterized in that the ratio of the message including the word of all the messages in the harmful message classification.

The method of claim 1,

The harmful message filtering system further comprises a learning means for storing the plurality of words and word evaluation values assigned to each of the plurality of words in the database.

delete

The method of claim 1,

The evaluating means further extracts a plurality of optimal words from the extracted plurality of words for each of the plurality of harmful message categories,

The plurality of optimal words are harmful message filtering system on the Internet community, characterized in that the word evaluation value assigned to the word sequentially extracts the word farthest from 0.5 toward 0 and 1.

The method of claim 4, wherein

And the evaluating means calculates the probability that the message is the harmful message classification for the plurality of harmful message classifications, and assigns the largest value among the calculation results as the message evaluation value to subdivide the message into the corresponding harmful message classification. Harmful message filtering system on the Internet community, characterized in that further performing.

The method of claim 5,

The probability that the message is the harmful message classification is

Is computed by the execution of

The words are the extracted optimal words, the spam ₁ is the harmful message classification, and P (words | spam ₁ ) is the word evaluation value assigned to the plurality of optimal words for the harmful message classification spam ₁ P (spam ₁ ) is the ratio of the harmful message classification spam ₁ to the total messages, and P (words) is the ratio of the message including all the optimal words extracted from all messages Harmful message filtering system on the Internet community.

The method of claim 6,

The harmful message filtering system on the Internet community, characterized in that the message is classified as a normal message when the largest value among the calculation results is less than a specified threshold.

The method according to any one of claims 1, 2 and 4 to 7,

And the message is received from the client through a community provided by the malicious message filtering system.

The method of claim 8,

The message is harmful message filtering system on the Internet community, characterized in that the content of the text posted on the community.

A method for filtering harmful messages with a harmful message filtering system having a database storing messages received from clients connected through a network, the method comprising:

(a) receiving a message from the client,

(b) extracting a plurality of words from the received message,

(c) determining whether it is a learning process or an evaluation process,

(d) determining whether the message is harmful by using the plurality of words extracted when it is determined as the evaluation process in step (c);

In step (d), a word evaluation value stored in the database is assigned to the word extracted in step (b) to determine whether the message is harmful, and the message is divided into a plurality of harmful message classifications.

The word evaluation value is harmful message filtering method on the Internet community, characterized in that the ratio of the message including the word of the total message in the harmful message classification.

The method of claim 10,

How to filter the harmful message

If it is determined that the learning process in step (c)

(e1) assigning the word evaluation value to each of the plurality of harmful message categories to the extracted word,

and (e2) storing the word and the word evaluation value in the database.

delete

The method of claim 10,

Step (d)

(d1) extracting a plurality of optimal words from the extracted plurality of words by the plurality of harmful message classifications,

(d2) calculating a probability for the plurality of harmful message classifications, wherein the probability that the message is the harmful message classification;

(d3) subdividing the message into a corresponding harmful message classification by allocating the largest value among the calculation results as a message evaluation value;

and (d4) storing the message classified into the harmful message classification and the message evaluation value in the database.

The method of claim 13,

Step (d1) is

And the word evaluation value assigned to the word is sequentially extracted as the optimum word from the word farthest from 0.5 toward 0 and 1 as harmful words.

The method of claim 13,

In the step (d2)

The probability that the message is the harmful message classification is

Is computed by the execution of

The words are the extracted optimal words, the spam ₁ is the harmful message classification, and P (words | spam ₁ ) is the word evaluation value assigned to the plurality of optimal words for the harmful message classification spam ₁ P (spam ₁ ) is the ratio of the harmful message classification spam ₁ to the total messages, and P (words) is the ratio of the message including all the optimal words extracted from all messages To filter harmful messages on the Internet community.

The method of claim 15,

The harmful message filtering method on the Internet community, characterized in that the message is classified as a normal message when the largest value among the calculation results is less than a specified threshold.

The method according to any one of claims 10, 11, 13 to 16,

The method of claim 17,

The message is harmful message filtering method on the Internet community, characterized in that the content of the article posted on the community.

A computer-readable recording medium having a method of filtering a harmful message with a harmful message filtering system having a database storing messages received from a client connected through a network, the method comprising:

(a) receiving a message from the client,

(b) extracting a plurality of words from the received message,

(c) determining whether it is a learning process or an evaluation process,

The word evaluation value is a computer-readable recording medium, characterized in that the ratio of the message including the word of the total message in the harmful message classification.