KR100426341B1

KR100426341B1 - System for searching an appointed web site

Info

Publication number: KR100426341B1
Application number: KR10-2001-0010147A
Authority: KR
Inventors: 김동우; 유진용
Original assignee: 김동우
Priority date: 2001-02-27
Filing date: 2001-02-27
Publication date: 2004-04-08
Also published as: KR20020069762A

Abstract

본 발명은 지정된 웹 사이트에 게시된 문서를 검색하고, 검색된 결과를 카테고리별로 분류하여 보고(Reporting)하기 위한 지정 웹 사이트 검색 보고 시스템 및 그 방법에 관한 것으로, 지정 웹 사이트 검색 보고 시스템은 검색 대행 서버와 텍스트 마이닝 서버를 포함하고;The present invention relates to a designated web site search reporting system and method for searching a document posted on a designated web site and classifying and searching the searched results by category. The specified web site search reporting system includes a search agency server. And a text mining server;

상기 검색 대행 서버는 검색주기마다 미리 등록된 URL정보를 이용하여 지정된 서비스 제공자 서버와 접속하고, 접속된 서버로부터 지정된 단계까지의 문서들을 수신하여 텍스트 마이닝 서버로 전달하며,The search agency server connects to the designated service provider server by using the URL information registered in advance in each search cycle, receives the documents up to the designated step from the connected server, and transmits the documents to the text mining server.

상기 텍스트 마이닝 서버는 카테고리별 할당된 학습문서를 토대로 생성된 문서분류함수에 따라 상기 검색 대행 서버로부터 전달된 문서 각각의 카테고리를 분류하고, 분류된 카테고리별 문서들을 상기 검색 대행 서버를 통해 검색 의뢰자 단말기로 레포팅하는 것을 특징으로 한다.The text mining server classifies each category of documents transmitted from the search agency server according to a document classification function generated based on the learning document assigned to each category, and searches the classified client documents through the search agency server. It is characterized by the reporting.

Description

Web site document search reporting system {SYSTEM FOR SEARCHING AN APPOINTED WEB SITE}

본 발명은 문서 자동 검색 시스템에 관한 것으로, 특히 지정된 웹 사이트에 게시된 문서를 검색하고, 검색된 결과를 카테고리별로 분류하여 보고(Reporting)하기 위한 지정 웹 사이트 문서 검색 보고 시스템에 관한 것이다.The present invention relates to an automatic document retrieval system, and more particularly, to a designated web site document retrieval reporting system for retrieving a document posted on a designated web site and classifying and searching the retrieved results by category.

개인용 퍼스널 컴퓨터(PC)의 보급 확대 및 인터넷 사용자 급증에 따라 인터넷은 어떠한 기존 매체 보다도 성장속도가 빠른 정보제공매체로 부상하고 있다. 이러한 이유에서 기업에서는 인터넷을 통한 자사제품의 광고 및 자사 이미지 관리에 심혈을 기울이고 있으며, 특히 모니터링 요원을 통해 각종 사이트들에 게시된 네티즌들의 의견을 적극적으로 수렴하여 자사의 경영관리에 반영하고 있다.With the spread of personal computers (PCs) and the rapid increase of Internet users, the Internet has emerged as an information providing medium that is growing faster than any conventional media. For this reason, companies are paying close attention to advertising their products and managing their images through the Internet. Especially, monitoring personnel actively collect opinions of netizens posted on various sites and reflect them in their management.

그러나 "정보의 바다"라 불리우는 인터넷상에서 자사에 대한 우호적인 문서(예를 들면 신문사 사이트의 경제면에 게시된 자사의 수출호전에 따른 영업이익 극대화)와 비우호적인 문서(예를 들면 특정 사이트에 게시된 자사제품의 불매운동 문서), 그리고 주가조작을 위한 루머배포 기사 혹은 내용들을 모두 모니터링한다는 것은 거의 불가능하다. 왜냐하면 현재의 문서 자동 검색 시스템은 일반적으로 검색단어에 관련된 모든 자료를 나열하고 나열된 문서들에서 원하는 자료를 찾아야 하기 때문에 그 만큼 검색시간이 장기화되는 문제점이 있다. 또한 현재의 문서 자동 검색 시스템에는 모니터링된 문서의 결과에 대한 보고 기능이 구비되어 있지 않기 때문에, 결과적으로는 모니터링 요원이 보고를 위해 검색된 문서의 내용을 새로이 편집하여야 하는 불편함을 수반하게 된다. 또한 모니터링 요원을 이용하여 문서를 검색할 경우에도 역시 많은 사이트들을 모니터링 요원 각자가 일일이 방문하여 게시된 내용을 검색해야 하기 때문에 시간적으로나 인력관리면에서 비효율적인 문제가 발생하게 된다.However, on the Internet, called the "sea of information," friendly documents about the company (e.g. maximizing the operating profits of the company's export improvement posted on the economy of the newspaper site) and unfriendly documents (e.g. It is almost impossible to monitor all of our product boycotts and rumor distribution articles or content for stock manipulation. Because the current document automatic retrieval system generally has to list all the data related to the search word and find the desired data in the listed documents, there is a problem of lengthening the search time. In addition, since the current document automatic retrieval system is not equipped with a report function for the results of the monitored document, as a result, the monitoring agent is inconvenient to newly edit the content of the retrieved document for reporting. In addition, even when searching documents using a monitoring agent, each site has to visit each site and search the posted contents, which causes inefficiency in terms of time and personnel management.

따라서 본 발명의 목적은 지정 웹 사이트들의 문서들을 가져와 자동 분석하고 분석된 결과를 문서 검색 의뢰자에게 자동 보고할 수 있음은 물론, 지정된 웹 사이트들에 게시된 문서들을 검색하기 위한 시스템을 멀티스레드 형태로 설계하여 시스템의 부하를 줄일 수 있는 지정 웹 사이트 문서 검색 보고 시스템을 제공함에 있다.Accordingly, an object of the present invention is to multi-thread a system for retrieving documents published on designated web sites, as well as automatically importing documents of designated web sites and automatically analyzing the analyzed results to a document search client. It is designed to provide a designated web site document search reporting system that can reduce the system load by designing.

본 발명의 또 다른 목적은 각 카테고리에 적합한 학습문서를 일관성 있게 자동 추천하여 문서 분류의 정확성을 높일 수 있는 지정 웹 사이트 문서 검색 보고 시스템을 제공함에 있다.It is still another object of the present invention to provide a designated web site document search reporting system that can automatically and consistently recommend a learning document suitable for each category to increase the accuracy of document classification.

도 1은 본 발명의 바람직한 실시예에 따른 지정 웹 사이트 문서 검색 보고 시스템 구성도.1 is a block diagram of a designated web site document search reporting system according to a preferred embodiment of the present invention.

도 2는 본 발명의 바람직한 실시예에 따른 지정 웹 사이트 문서 검색 보고 시스템과 관리자 단말기의 프로그램 구조도.2 is a program structure diagram of a designated web site document search reporting system and an administrator terminal according to a preferred embodiment of the present invention.

도 3은 본 발명의 바람직한 실시예에 따른 지정 웹 사이트 문서 검색 보고 과정 흐름도.3 is a flowchart of a designated web site document retrieval reporting process according to a preferred embodiment of the present invention.

도 4는 본 발명의 바람직한 실시예에 따른 URL 등록 화면 예시도.4 is an exemplary diagram of a URL registration screen according to a preferred embodiment of the present invention.

도 5는 본 발명의 바람직한 실시예에 따른 검색주기 설정 화면 예시도.5 is an exemplary view illustrating a search period setting screen according to an exemplary embodiment of the present invention.

도 6은 본 발명의 바람직한 실시예에 따른 레포트 관리 화면 예시도.6 is an exemplary view of a report management screen according to a preferred embodiment of the present invention.

도 7은 본 발명의 바람직한 실시예에 따른 레포트 관리의 문서 보기 화면 예시도.7 is a view showing a document view screen of report management according to a preferred embodiment of the present invention.

도 8은 본 발명의 바람직한 실시예에 따른 자동할당문서 학습 화면 예시도.8 is a diagram illustrating an automatic assignment document learning screen according to a preferred embodiment of the present invention.

도 9는 본 발명의 바람직한 실시예에 따른 관리자 등록 화면 예시도.9 is an exemplary diagram of an administrator registration screen according to a preferred embodiment of the present invention.

상기 목적을 달성하기 위한 본 발명의 일 양상에 따른 지정 웹 사이트 문서 검색 보고 시스템은 적어도 검색 대행 서버와 텍스트 마이닝 서버를 포함하고;A designated web site document search reporting system according to an aspect of the present invention for achieving the above object includes at least a search agency server and a text mining server;

또한 본 발명의 일 양상에 따른 지정 웹 사이트 문서 검색 보고 방법은 다수의 서비스 제공자 서버와 네트워크로 접속 가능한 적어도 하나의 서버에 의해서 실행되되;In addition, the designated web site document retrieval reporting method according to an aspect of the present invention is executed by at least one server connectable to a plurality of service provider servers;

검색주기마다 지정된 서비스 제공자 서버와 접속하는 단계와;Connecting to a designated service provider server at each search period;

접속된 서버로부터 지정된 단계까지의 웹 페이지 문서들을 읽어 오는 단계와;Reading web page documents up to a designated step from a connected server;

읽어 온 문서들을 분석하여 학습에 의해 미리 분류된 카테고리들중 하나로 분류하는 단계와;Analyzing the read documents and classifying them into one of categories previously classified by learning;

분류된 카테고리별 문서들을 검색 의뢰한 서비스 제공자 서버로 레포팅하는 단계를 포함함을 특징으로 한다.And reporting the categorized documents to the service provider server which requested the search.

이하 본 발명의 바람직한 실시예들을 첨부한 도면을 참조하여 상세히 설명하기로 한다. 본 발명을 설명함에 있어, 관련된 공지 기능 혹은 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그에 대한 상세한 설명은 생략하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description of the present invention, when it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present invention, a detailed description thereof will be omitted.

우선 도 1은 본 발명의 바람직한 실시예에 따른 지정 웹 사이트 문서 검색 보고 시스템 구성도를 도시한 것이다. 본 발명의 실시예에 따른 지정 웹 사이트 문서 검색 보고 시스템(200)은 크게 검색 대행 서버(Censor Agent Server)(210)와 텍스트 마이닝 서버(Text Mining Server:TMS)(220)로 구성되며 인터넷망(130)을 통해 다수의 인터넷 서비스 제공자 서버(100,110,120)들과 접속된다.First, Figure 1 shows a block diagram of a designated Web site document search reporting system according to a preferred embodiment of the present invention. Designated web site document search reporting system 200 according to an embodiment of the present invention is largely composed of a search agent server (Censor Agent Server) 210 and a text mining server (TMS) 220, the Internet network ( 130 is connected to a plurality of Internet service provider servers (100, 110, 120).

상기 검색 대행 서버(210)는 검색주기마다 TMS(220)에 미리 등록된 URL정보를 이용하여 지정된 인터넷 서비스 제공자 서버와 접속하고, 접속된 서버로부터 지정된 단계까지의 문서(HTML문서)들과 접속된 서버의 URL을 읽어 와 TMS(220)로 전달한다. 이러한 검색 대행 서버(210)는 도 1에 도시한 바와 같이 인터넷 서비스 제공자 서버와 접속하여 지정된 단계까지의 문서들과 해당 서버의 URL을 가져 오는 웹 로봇(212)과, 가져 온 문서의 구문을 해석하고 태그(tag)를 제거하여 출력하는 파서(parser)(214)를 구비한다. 상기 웹 로봇(212)은 멀티스레드 형태로 설계하여 운용될 수 있기 때문에 시스템 부하를 감소시킴은 물론, 문서 검색시간을 단축시킬 수 있다.The search agency server 210 is connected to the designated Internet service provider server by using the URL information registered in advance in the TMS 220 at each search cycle, and is connected to the documents (HTML documents) from the connected server up to the designated step. Read the URL of the server and delivers it to the TMS (220). This search agency server 210 is connected to the Internet service provider server as shown in Figure 1, the web robot 212 to retrieve the document up to the specified step and the URL of the server, and parses the syntax of the imported document And a parser 214 for removing and outputting a tag. Since the web robot 212 can be designed and operated in a multi-threaded form, it can reduce the system load and shorten the document retrieval time.

한편 텍스트 마이닝 서버인 TMS(220)는 학습 문서를 토대로 문서 검열를 위한 카테고리를 생성하고, 상기 검색 대행 서버(210)로부터 전달된 문서들을 분석하여 분류된 카테고리들중 하나의 문서로 분류하고 분류된 카테고리별 문서들을 상기 검색 대행 서버(210)를 통해 검색 의뢰한 서비스 제공자 서버로 보고(reporting)하는 역할을 수행한다. 이러한 TMS(220)는 레포팅 결과와 지정된 웹 사이트의 URL을 저장하는 데이터베이스(222,224)와 접속되어 있다.Meanwhile, the text mining server TMS 220 generates a category for document censoring based on the learning document, analyzes the documents transmitted from the search agency server 210, classifies the document into one of the classified categories, and classifies the category. It performs a role of reporting the respective documents to the service provider server requested to search through the search agency server (210). The TMS 220 is connected to databases 222 and 224 which store the reporting result and the URL of the designated web site.

참고적으로 본 발명의 이해를 돕기 위해 상기 TMS(220)에서 수행되는 텍스트 마이닝 프로세스 대하여 좀 더 구체적으로 설명하면, 우선 본 발명의 일실시예에 따른 텍스트 마이닝 서버(220)는 다음과 같은 과정을 거쳐 자동으로 문서를 분류한다.For reference, the text mining process performed in the TMS 220 in order to help the understanding of the present invention in more detail. First, the text mining server 220 according to an embodiment of the present invention performs the following process. Automatically classify documents.

1. 분류하고자 하는 문서집합의 특성에 맞는 카테고리를 정의한다.1. Define categories that meet the characteristics of the document set you want to classify.

2. 정의된 카테고리의 성격에 맞는 학습문서를 할당한다.2. Allocate learning documents to the characteristics of the defined categories.

3. 할당된 학습문서를 토대로 자동 분류를 위한 문서분류함수를 생성한다.3. Create document classification function for automatic classification based on assigned learning document.

4. 생성된 문서분류 함수에 따라 문서를 자동으로 분류한다.4. Classify documents automatically according to the created document classification function.

5. 필요한 경우 새로운 카테고리를 추가로 정의한다.5. If necessary, define additional new categories.

6. 문서 분류의 정확도를 높이기 위해 각 카테고리에 할당된 학습문서를 갱신하거나 추가한다.6. Update or add the learning document assigned to each category to increase the accuracy of document classification.

즉, 본 발명의 일실시예에 따른 TMS(220)는 우선 특정 카테고리에 할당된 학습문서를 먼저 분석한다. 이러한 경우 각각의 카테고리별로 학습문서에 대해 특정한 특징어(feature word)의 출현여부가 조사되거나 혹은 출현빈도가 카운트된다. 특징어는 문서분류에 잘 활용될 수 있는 적절한 명사 혹은 동사로 구성되며 관리자에 의해 또는 통계분석에 따라 정해질 수 있고, 적절한 개수로 한정될 수도 있다.That is, the TMS 220 according to an embodiment of the present invention first analyzes the learning document assigned to a specific category. In this case, the appearance of a specific feature word for the learning document for each category is examined or the frequency of occurrence is counted. The feature word is composed of appropriate nouns or verbs that can be used for document classification, can be determined by an administrator or by statistical analysis, and can be limited to an appropriate number.

이에 따라 특징어의 출현빈도에 따라 또는 출현여부에 따라 각각의 학습문서에 대해 좌표값이 할당된다. 이러한 좌표계는 특징어로 이루어지는 벡터 공간(vector space)에서의 좌표값으로 볼 수 있다. 예를 들어 특징어가 "컴퓨터", "키보드", "프로그램" 의 3 개로만 구성되고 어떤 학습문서에서 "컴퓨터"란 특징어가 3회, "키보드"란 특징어가 1회, "프로그램"이란 특징어가 2회 나타난다면 이 학습문서의 상기 특징어로 이루어진 벡터 공간에서의 좌표값은 (3.1.2)가 된다. 이와 같은 좌표값은 각각의 문서가 반지름 1로 정규화된 n차원 공간의 구(sphere) 표면 상에 표시될 수 있도록 길이가 1로 정규화된 n차원 (n개의 특징어가 있다고 가정할 경우) 공간의 구(sphere) 표면 상에 표시될 수 있도록 길이가 1인 벡터로 정규화 된다.Accordingly, a coordinate value is assigned to each learning document according to the frequency of appearance or presence of feature words. Such a coordinate system may be viewed as a coordinate value in a vector space consisting of feature words. For example, a feature consists only of three things: "computer", "keyboard", and "program." In some learning documents, the "computer" feature is three times, the "keyboard" feature is one time, and the "program" feature is If it appears twice, the coordinate value in the vector space of the feature words of this learning document is (3.1.2). This coordinate value is an n-dimensional (assuming n feature words) sphere of space that is normalized to length 1 so that each document can be displayed on a sphere surface of n-dimensional space normalized to radius 1. It is normalized to a vector of length 1 so that it can be displayed on the (sphere) surface.

다음으로 특정 카테고리내의 모든 학습문서에 대해서 입력문서에서 추출된 특징어의 출현 확률을 하기 수학식 1에 의해 계산한다.Next, the probability of occurrence of the feature word extracted from the input document for all the learning documents in a specific category is calculated by Equation 1 below.

상기 수학식 1에서 좌변은 특정한 카테고리 내의 모든 문서에 있어서 특징어 t_i가 나타날 확률이다. 우변에서 tf(t, c)는 카테고리 c에서 특징어 t의 출현빈도이다. V는 특징어 공간의 크기로 정규화된 경우는 1이 된다. 즉, 우변은 모든 특징어의 카테고리 C_i에서의 출현빈도에 대한 특징어 t_i의 출현빈도의 비율을 의미한다.그리고 우변은 좌변을 계산하기 위한 근사식이다. 분자에서 1을 더한 것은 전체가 0이 되어 이후의 곱셈을 통해 산출되는 확률값이 0이 되어버리는 것을 피하기 위한 기법이다.In Equation 1, the left side is a probability that the feature t _i appears in all documents in a specific category. On the right side, tf (t, c) is the frequency of occurrence of the feature t in category c. V is 1 when normalized to the size of the feature space. That is, the right side represents the ratio of the frequency of occurrence of the feature word t _i to the frequency of occurrence of all the feature words in category C _i , and the right side is an approximation formula for calculating the left side. The addition of 1 to a numerator is a technique to avoid zeroing the probability that the whole is zeroed and the resultant multiplication is zero.

이 값이 구해지면 이제 입력문서가 어떤 카테고리에 속하는지를 판단할 준비가 다 되었다는 것을 의미한다. 입력문서 d에 있는 모든 특징어에 대해 하기 수학식 2와 같은 확률이 산출된다.Once this value is obtained, it means that we are ready to determine which category the input document belongs to. For all feature words in the input document d, a probability is calculated as in Equation 2 below.

좌변의 수식으로 표현된 확률은 특정한 카테고리에 속한다고 할 때 문서 d의 출현 확률, 즉 특정한 카테고리에 속한 모든 문서 중에서 문서 d와 같은 특징어 벡터 공간상의 좌표값을 가진 문서의 출현확률을 의미한다. 우변의 각 곱셈항들은 상기 수학식 1에 의해 구해진다. 위의 계산은 우변의 각 곱셈항에 해당하는 확률이 서로 독립이라는 가정에 근거한다.The probability expressed by the formula on the left side refers to the probability of occurrence of document d, that is, the probability of appearance of a document having a coordinate value in the feature vector space, such as document d, among all documents belonging to a specific category. Each multiplication term on the right side is obtained by Equation 1 above. The above calculation is based on the assumption that the probabilities for each multiplication term on the right side are independent of each other.

다음으로 하기 수학식 3에 의해 특정한 문서가 카테고리 c_i에 속할 확률이 산출된다.Next, the probability that a specific document belongs to category c _i is calculated by Equation 3 below.

위의 값이 앞서 언급한 문서분류함수에 해당한다. 위에서 P(c_i)는 전체문서의 수 중에서 카테고리에 속하는 문서의 수로 계산되는 카테고리 출현 확률이다. 또한 P(d)는 특징어 벡터 공간에서 문서 d와 동일한 좌표값을 가지는 문서의 출현 확률이다. 그러나 위의 문서분류함수는 절대값을 구하는 것이 아니고 각각의 카테고리에 대해 상대적인 비교만을 하면 충분하기 때문에 모든 경우에 같은 값을 가지는 분모 P(d)항은 계산할 필요가 없다. 결과적으로 사용되는 아래 문서분류함수를 사용하여 각 카테고리에 대한 문서 d의 상대적인 확률값을 계산한다. 즉,을 모두 구해 이 중 최대값이 되는 카테고리에 그 문서 d를 할당하면 카테고리별로 문서가 분류되는 것이다.The above values correspond to the aforementioned document classification function. Where P (c _i ) is the category among the total number of documents The probability of category appearance, calculated as the number of documents belonging to. In addition, P (d) is the probability of appearance of a document having the same coordinate value as the document d in the feature vector space. However, the above document classification function does not need to calculate the denominator P (d) which has the same value in all cases because it is sufficient to make relative comparisons for each category. The resulting document classification function is then used to calculate the relative probability of document d for each category. In other words, If all the values are obtained and the document d is assigned to the category which is the maximum value, the documents are classified by categories.

한편 상술한 텍스트 마이닝 서버(220)에서 학습문서를 어떠한 문서로 선택하느냐에 따라 문서 자동분류의 정확도 내지 신뢰도가 크게 영향을 받는다. 본 발명의 일실시예에 따른 텍스트 마이닝 서버(220)에서는 관리자가 현재 상황을 분석하여 각 카테고리에 할당된 학습문서를 적절하게 갱신한다. 이에 따라 일일이 수집하는 노동부담은 물론 여러 사람이 작업을 함으로써 또는 같은 사람이라 하더라도 다른 시기에 작업을 함으로써 학습문서 선택에 대한 일관성을 잃어 결과적으로 문서분류가 안정되게 이루어지지 못하는 결과를 초래할 수도 있다.On the other hand, the accuracy or reliability of automatic document classification is greatly influenced by which document the learning document is selected in the text mining server 220 described above. In the text mining server 220 according to an embodiment of the present invention, the administrator analyzes the current situation and appropriately updates the learning document assigned to each category. As a result, not only the labor burden collected by one person but also several people working at the same time or at the same time may result in a loss of consistency in the selection of learning documents, resulting in instable document classification.

이를 위해 본 발명의 또 다른 일실시예에서는 분류된 문서집합에 대해 각 카테고리별로 적합한 학습문서를 자동으로 추천하게 하여 보다 정확한 문서 분류가 이루어질 수 있도록 하였다. 이를 보다 구체적으로 설명하면,To this end, another embodiment of the present invention is to automatically recommend the appropriate learning document for each category for the classified document set to be able to achieve a more accurate document classification. In more detail,

우선 문서분류의 성능이 좋아지려면 특징어로 구성되는 벡터공간이 좋은 좌표계를 구성하도록 학습문서가 선택되어야 한다. 즉, 특징어들의 출현빈도 특성이 카테고리별로 확연히 차이가 나도록 학습문서를 선택해주어야 한다는 것이다.First of all, in order to improve the performance of document classification, the learning document should be selected so that the vector space composed of feature words forms a good coordinate system. In other words, the learning document should be selected so that the appearance frequency characteristics of the feature words are clearly different for each category.

좀 더 상세히 설명하면,In more detail,

각 카테고리 ci에서 특징어 w 들의 돗수분포곡선 즉,The frequency distribution curve of feature w in each category ci,

를 모두 계산한다. 그리고가 나타내는 카테고리별 돗수 분포곡선들간의 상대거리가 가능한 한 멀어지도록 학습문서를 선택한다.Calculate all And Select the learning document so that the relative distance between the category distribution curves for each category is as far as possible.

본 발명에서는 이들 돗수 분포곡선간의 거리를 수학식 4와 같이 정의한다.In the present invention, the distance between these temperature distribution curves is defined as in Equation 4.

그리고 상기 수학식 4에서는 하기 수학식 5로 정의한다.And in Equation 4 Is defined by Equation 5 below.

상기 수학식 4는 돗수분포곡선와간의 거리를 정의한 것이고, 수학식 5에 대해서는 이미 수학식 1에서 설명하였다. 상기 수학식 5에서 λ값은 1로 한다.Equation 4 is a dot distribution curve Wow The distance between the two is defined, and Equation 5 has already been described in Equation 1. In Equation 5, the lambda value is 1.

그런데 이 거리 함수는 비대칭형(nonsymmetric)이어서 a로부터 b까지의 거리값이 b로부터 a까지의 거리값과 다르게 된다. 이러한 비대칭형의 성질을 없애주기위해 가중평균값을 계산하여 이를 실제적인 거리함수로서 수학식 6과 같이 정의한다.However, this distance function is nonsymmetric so that the distance from a to b is different from the distance from b to a. In order to eliminate the asymmetric property, the weighted average value is calculated and is defined as Equation 6 as the actual distance function.

이 KLdist(Kullback-Leibler 거리) 함수는 다른 카테고리에 할당되었을 때와 대비하여 특정한 카테고리에 할당되었을때의 무질서도(entropy)를 의미한다. 자동분류함수의 성능이 좋아지려면 이러한 무질서도값이 커지는 방향으로 학습문서가 선택되어져야 한다. 즉, 본 발명의 또 다른 실시예에 따른 TMS(220)는 이 KLdist가 가능한 한 커지도록 하기 위해, 미분류된 학습문서 후보가 될 수 있는 문서 d를 특정 카테고리에 할당되었다고 가정하고 그 외 각 카테고리와의 거리 KLdist(,)를 구한다. 그리고 이 거리값이 소정의 임계값보다 크다면 그 문서를 카테고리 c_i의 학습문서로 할당하면 일반적인 텍스트 마이닝 시스템 보다 정확하게 문서의 자동 분류가 이루어질 수 있다. 참고적으로 상기 임계값은 예를 들면 기존의 학습문서들간의 KLdist 값의 평균값을 구하고 그 값 혹은 그 값에서 약간치를 감산함에 의해 정해질 수 있다.This KLdist (Kullback-Leibler Distance) function means entropy when assigned to a specific category as compared to when assigned to another category. In order to improve the performance of the automatic classification function, the learning document should be selected in the direction of increasing the disorder value. That is, in order to make this KLdist as large as possible, the TMS 220 according to another embodiment of the present invention selects a document d that can be an unclassified learning document candidate in a specific category. Each other category, assuming Distance from KLdist ( , ) If the distance value is larger than a predetermined threshold value, the document may be assigned as a learning document of category c _i , so that automatic classification of the document may be performed more accurately than a general text mining system. For reference, the threshold value may be determined by, for example, obtaining an average value of KLdist values between existing learning documents and subtracting the value or a small value from the value.

한편 도 2는 본 발명의 바람직한 실시예에 따른 지정 웹 사이트 문서 검색 보고 시스템(200)과 관리자 단말기(270)의 프로그램 구조도를 도시한 것이다. 본 발명의 실시예에 따른 검열 대행 서버(210)와 TMS(220)는 하나의 물리적인 서버로 구현되었으나, 별개의 서버로 운영되어 본 발명을 실행할 수도 있다.2 is a diagram illustrating a program structure of a designated web site document search reporting system 200 and an administrator terminal 270 according to an exemplary embodiment of the present invention. The censorship server 210 and the TMS 220 according to the embodiment of the present invention are implemented as one physical server, but may be operated as separate servers to execute the present invention.

도 2를 참조하면, 서버(200)는 다수의 계층화된 프로그램을 포함하는데, 기저에는 Windows NT/2000, Linux 혹은 UNIX 등과 같이 다양한 운영체제(260)가 있다. 기저 위에는 운영체제에 의존하는 웹 서버(250)가 위치하며, 웹 서버(250) 위에는 운영체제에 의존적으로 제공되는 JVM(Java Virtual Machine)(240)이 올려진다. 그리고 JVM(240)에 의존적인 응용 프로그램(Java Servlet/JSP/Java Application)(230)이 그 위에 위치한다. 한편 관리자 단말기(270) 역시 Windows 98/2000/Linux와 같은 운영체제가 위치하며, 운영체제 위에 웹 브라우저와 로봇 제어관리 페이지가 순차적으로 위치하여 상기 서버(200)와 정의된 통신 프로토콜에 따라 데이터 통신을 수행한다.Referring to FIG. 2, the server 200 includes a plurality of layered programs, which are various operating systems 260 such as Windows NT / 2000, Linux, or UNIX. The web server 250 that depends on the operating system is located on the base, and the Java virtual machine 240 that is provided depending on the operating system is mounted on the web server 250. In addition, an application program (Java Servlet / JSP / Java Application) 230 depending on the JVM 240 is located thereon. On the other hand, the manager terminal 270 also has an operating system such as Windows 98/2000 / Linux, and a web browser and a robot control management page are sequentially positioned on the operating system to perform data communication according to the communication protocol defined with the server 200. do.

이하 상술한 구성을 갖는 지정 웹 사이트 문서 검색 보고 시스템(200)의 동작을 도 3 내지 도 9를 참조하여 상세히 설명하기로 한다.Hereinafter, the operation of the designated web site document search reporting system 200 having the above-described configuration will be described in detail with reference to FIGS. 3 to 9.

도 3은 본 발명의 바람직한 실시예에 따른 지정 웹 사이트 검색 보고 과정 흐름도를 도시한 것이며, 도 4 내지 도 9는 본 발명의 바람직한 실시예에 따른 지정 웹 사이트 검색 보고 시스템(200) 운영중 표시되는 화면들을 예시한 것이다.3 is a flowchart illustrating a designated web site search reporting process according to a preferred embodiment of the present invention, and FIGS. 4 to 9 are displayed while the designated web site search reporting system 200 is operated according to a preferred embodiment of the present invention. The screens are examples.

도 3을 참조하면, 우선 소망하는 웹 사이트를 소망의 깊이(depth)까지 주기적으로 검색하기 위해서는 검색하고자 하는 웹 사이트의 URL과 검색주기 및 깊이를 설정(300단계)하여야 한다. 즉, 관리자는 도 4에 도시한 바와 같은 초기화면에서 환경설정을 선택한후 설정메뉴에서 "URL관리"를 선택하여 검색하고자 하는 웹 사이트의 URL을 입력하여 선택함으로써, 지정 웹 사이트 문서 검색 보고 시스템(200)의 URL DB(222)에는 입력된 웹 사이트의 URL이 등록된다. 또한 관리자는 도 5에 도시한 바와 같이 설정메뉴에서 "검색 관리"를 선택하여 월/주/일 단위의 검색주기와 깊이(depth)를 설정함으로써 지정 웹 사이트 검색 보고 시스템(200)은 검색주기마다 지정된 웹 사이트를 지정된 단계까지 검색할 수 있게 되는 것이다. 상기 "depth(깊이)"는 지정된 웹 사이트들을 방문하여 링크된 곳을 몇 단계까지 따라가야 할지를 지시하기 위한 명령어이다.Referring to FIG. 3, first, in order to periodically search for a desired web site to a desired depth, a URL, a search period, and a depth of a web site to be searched should be set (step 300). That is, the administrator selects the environment setting on the initial screen as shown in FIG. 4, selects "URL management" in the setting menu, and inputs and selects the URL of the website to be searched. The URL of the input web site is registered in URL DB 222 of 200. In addition, the administrator selects "search management" in the setting menu as shown in FIG. 5, and sets the search period and depth in units of months / weeks / days so that the designated web site search reporting system 200 is configured for each search period. You will be able to search the specified Web site up to the specified level. The " depth " is a command for indicating how many steps to follow a linked site by visiting designated web sites.

상술한 바와 같이 관리자에 의해 URL과 검색주기 및 깊이가 설정되었으면 검색 대행 서버(210)의 웹 로봇(212)은 설정된 검색주기 여부를 판단(310단계)하여 설정된 검색주기마다 지정된 웹 사이트로의 접속을 시도(320단계)한다. 만약 지정 웹 사이트로의 접속 시도중에 사용자 인증요구가 있으면(330단계) 그에 응답하여 인증처리를 위한 ID와 패스워드를 전송(340단계)하여 준다. 인증처리를 위한 ID와 패스워드 역시 도 4에 도시한 바와 같이 URL 등록시에 함께 등록된다. 반면 사용자 인증요구가 필요 없다면 웹 로봇(212)은 접속된 웹 사이트로부터 URL과 HTML문서를 읽어 와(350단계) 구문 해석부인 파서(214)로 전송하여 준다. 그러면 파서(214)는 전송된 HTML문서를 해석하여 태그 제거된 문서를 TMS(220)로 전송하여 준다(360,370단계).As described above, when the URL, the search period, and the depth are set by the administrator, the web robot 212 of the search agency server 210 determines whether the search period is set (step 310), and accesses to the designated web site at each set search period. Attempt (step 320). If there is a user authentication request while attempting to access the designated web site (step 330), an ID and password for authentication processing are transmitted (step 340). The ID and password for authentication processing are also registered together at the time of URL registration as shown in FIG. On the other hand, if a user authentication request is not required, the web robot 212 reads the URL and HTML document from the connected web site (step 350) and transmits the parser 214, which is a parser. The parser 214 then interprets the transmitted HTML document and transmits the untagged document to the TMS 220 (steps 360 and 370).

한편 TMS(220)에서는 태그제거된 문서의 깊이, 즉 현재 문서의 깊이와 300단계에서 등록된 지정 깊이를 비교(380단계)하여 현재 문서의 깊이가 지정 깊이값 보다 클 때 까지 태그 제거된 문서를 저장(390단계)한다. 그리고 현재 문서의 깊이가 지정 깊이값 보다 크면 저장 문서들의 형태소를 분석(400)하고, 특징을 추출(410단계)한후 할당된 학습문서를 토대로 생성된 문서분류함수에 따라 저장 문서를 카테고리별로 자동 분류(420단계)한다. 이와 같이 카테고리별로 자동 분류된 문서의 화면 예시가 도 6에 도시되어 있다. 도 6에 도시된 화면은 "레포트 관리"항목의 한 화면으로서 카테고리명은 "삼성우호"이다. "삼성우호"를 카테고리명으로 하여 자동 분류된 문서들의 보기화면을 보면 크게 제목과 수집문서, 편집문서 및 검색시간으로 이루어져 있다. 이러한 화면에서 편집문서는 관리자에 의해 수정이 가해진 문서로서 검색 의뢰자에게 보고되는 문서이다. 수정이 가해지지 않은 수집문서를 보고용으로 사용할 수도 있다. 검색 의뢰자에게 있어서 정확한 정보 혹은 의견의 수렴을 위해서는 수집문서 그 자체를 보고 받는 것이 바람직하다 할 것이다.Meanwhile, the TMS 220 compares the depth of the untagged document, that is, the depth of the current document with the designated depth registered in step 300 (step 380), and then removes the tagged document until the depth of the current document is greater than the specified depth value. Save (step 390). If the depth of the current document is greater than the designated depth value, the stems of the stored documents are analyzed (400), the feature is extracted (step 410), and the stored documents are automatically classified according to the category according to the document classification function generated based on the assigned learning document. (Step 420). As such, an example of a screen of a document automatically classified by category is illustrated in FIG. 6. The screen shown in Fig. 6 is one screen of the "Report Management" item, and the category name is "Samsung Friendship". If you look at the view screen of the documents automatically categorized using "Samsung Friendship" as the category name, it consists of the title, the collected document, the edited document, and the search time. In such a screen, the edited document is a document that has been modified by an administrator and reported to the search client. Unmodified collections may be used for reporting purposes. For the requesting client, it would be advisable to report the collection itself to gather accurate information or opinions.

한편 읽어 온 문서들을 미리 분류된 카테고리들중 하나로 분류할 수 없는 경우에는 새로운 카테고리 설정을 위해 미 분류 문서로 따로 저장한후, 카테고리 설정 및 학습 단계를 통해 새로운 카테고리로서 문서 분류가 이루어지도록 한다.On the other hand, if the read documents cannot be classified into one of the pre-categorized categories, they are separately stored as unclassified documents for setting a new category, and the document classification is performed as a new category through the category setting and learning step.

도 7은 도 6에 도시된 "삼성우호" 카테고리명중 하나를 선택한 경우의 문서 화면을 예시한 것이다. 이와 같이 지정된 웹 사이트에서 수집된 혹은 그에 근거하여 편집된 문서는 레포팅 DB(224)에 저장된후 문서 검색 의뢰자의 요청에 의해 설정된 주기마다 독출되어 검색 대행 서버(210)를 통해 검색 의뢰자인 클라이언트 단말기 혹은 서비스 제공자 서버로 보고(reporting)(430단계)된다.FIG. 7 illustrates a document screen when one of the "Samsung Friendship" category names shown in FIG. 6 is selected. Documents collected or edited on the basis of the designated web site are stored in the reporting DB 224 and are read out at intervals set by the request of the document search client, and are then searched for by the client terminal or the search client through the search agent server 210. Reporting to the service provider server (step 430).

따라서 문서 검색 의뢰자인 클라이언트는 지정한 수 많은 웹 사이트들을 직접 모니터링하지 않고서도 자사에 대한 우호적인 혹은 비우호적인 문서들의 보고를 받을 수 있게 되는 것이다. 또한 문서 검색 의뢰자는 보고 받은 문서들을 데이터베이스화함으로써 향후 고객 서비스 향상을 위한 하나의 자료로 이용할 수도 있다.Thus, a client who is a document retrieval client can receive reports of friendly or unfriendly documents about the company without directly monitoring a large number of designated websites. In addition, the document retrieval client may use the reported documents as a database to improve future customer service.

도 8은 본 발명의 바람직한 실시예에 따른 학습 문서 설정 화면 예시도를 도시한 것으로 "삼성우호" 카테고리명을 가지는 경우의 화면 예시도이다. 도 8에 도시한 바와 같이 특정 카테고리명으로 자동 분류된 문서의 우측에는 상관도가 기재되어 있다. 상관도란 분류된 문서와 학습 문서들간의 상관성을 수치로 나타낸 것으로서, "1"에 가까운 값을 가질수록 상관도가 높다 할 수 있다.8 is a diagram illustrating an example of a learning document setting screen according to a preferred embodiment of the present invention. As shown in Fig. 8, a correlation diagram is described on the right side of a document automatically classified under a specific category name. Correlation is a numerical representation of the correlation between the classified document and the learning document, the higher the value is closer to "1" can be said.

따라서 관리자는 상관도가 높은 문서를 선택하여 카테고리별로 할당되어 있는 학습문서들을 갱신토록 명령함으로써, 지정 웹 사이트 문서 검색 보고 시스템(200)의 문서 분류 성능을 향상시킬 수 있다.Therefore, the administrator may improve the document classification performance of the designated web site document search reporting system 200 by selecting a document having a high correlation and instructing to update the learning documents assigned to each category.

도 9는 본 발명의 바람직한 실시예에 따른 관리자 등록 화면 예시도를 도시한 것으로, "환경 설정"항목에서 설정메뉴를 관리자 등록으로 선택한 경우의 화면이다. 이러한 화면을 통해 분류된 문서의 편집을 위한 관리자를 등록하거나, 미분류된 문서들을 모니터링 하기 위한 관리자들을 설정할 수 있다.9 is a diagram illustrating an example of an administrator registration screen according to a preferred embodiment of the present invention, and is a screen when the setting menu is selected as an administrator registration in the "environmental setting" item. This screen allows you to register administrators for editing classified documents or set up administrators to monitor unclassified documents.

상술한 바와 같이 본 발명은 지정 웹 사이트들의 문서들을 가져와 자동 분석하고 분석된 결과를 문서 검색 의뢰자에게 자동 보고할 수 있기 때문에, 문서 검색을 위해 투여하여야 하는 재정적이고도 시간적인 노력을 감축할 수 있는 이점이 있다.As described above, the present invention can take the documents of designated web sites, analyze them automatically, and automatically report the analyzed results to the document search requester, thereby reducing the financial and time efforts to be administered for document retrieval. There is this.

또한 본 발명은 지정된 웹 사이트들에 게시된 문서들을 검색하기 위한 시스템을 멀티스레드 형태로 설계할 수 있기 때문에 시스템의 부하를 줄임은 물론 문서 검색 시간을 최소화할 수 있는 장점이 있고, 모니터링 결과를 주기적으로 제공받아데이터베이스화할 수 있기 때문에, 향후 고객 서비스의 방향성을 제시하는 자료로 사용할 수 있는 이점도 있다.In addition, the present invention can reduce the load on the system and minimize the document retrieval time because the system for retrieving the documents posted on the designated Web sites in a multi-threaded form, and the monitoring results are periodically Since it can be provided as a database, it can also be used as a material that suggests the direction of future customer service.

또한 본 발명은 지정된 웹 사이트에 게시된 내용들을 학습에 의해 세분화된 카테고리별로 분류하여 보고함으로써 의뢰자에게 정확한 정보게시내용물을 제공할 수 있는 이점도 있으며, 각 카테고리에 적합한 학습문서를 일관성 있게 자동 추천하기 때문에 일반적인 텍스트 마이닝 시스템에 의한 문서 분류 보다 정확한 문서 분류가 이루어질 수 있는 장점이 있다.In addition, the present invention has the advantage of providing accurate information contents to the requestor by categorizing and reporting the contents posted on the designated web site by the subdivided category by learning, and consistently automatically recommends the learning documents suitable for each category. There is an advantage that more accurate document classification can be achieved than document classification by a general text mining system.

본 발명은 도면에 도시된 실시예들을 참고로 설명되었으나 이는 예시적인 것에 불과하며, 당해 기술분야에 통상의 지식을 지닌자라면 이로부터 다양한 변형 및 균등한 타실시예가 가능하다는 점을 이해할 것이다. 따라서, 본발명의 진정한 기술적 보호범위는 첨부된 특허청구범위에 의해서만 정해져야 할 것이다.Although the present invention has been described with reference to the embodiments shown in the drawings, this is merely exemplary, and it will be understood by those skilled in the art that various modifications and equivalent other embodiments are possible. Therefore, the true technical protection scope of the present invention should be defined only by the appended claims.

Claims

delete

In a designated web site document retrieval reporting system accessible via a network with multiple service provider servers,

The designated web site search reporting system includes a search agency server and a text mining server;

The search agency server connects to the designated service provider server by using the URL information registered in advance in each search cycle, receives the documents up to the designated step from the connected server, and transmits the documents to the text mining server.

The text mining server classifies each category of documents transmitted from the search agency server according to a document classification function generated based on the learning document assigned to each category, and searches the classified client documents through the search agency server. Report to each of the categories Calculates the distribution of the number distribution of feature words (w), and selects document d, which can be an unclassified learning document candidate, in a specific category. Assume that you have been assigned to another category If the distance value is larger than the preset threshold after calculating the distance from when it is assigned to Designated Web site document search reporting system, characterized in that for updating the learning document by assigning the learning document.

The apparatus of claim 8, wherein the threshold is;

Designated web site document search reporting system, characterized in that the average value of the distance value according to the equation (7) between the pre-assigned learning documents.