KR20030039334A

KR20030039334A - search engine and method using the semantic relationship information for document or keyword

Info

Publication number: KR20030039334A
Application number: KR1020030008329A
Authority: KR
Inventors: 이태희
Original assignee: 이태희
Priority date: 2003-02-10
Filing date: 2003-02-10
Publication date: 2003-05-17

Abstract

PURPOSE: A device and a method for searchiing data using semantic relation data of a search term are provided to extract the plural semantic relation data for a document by automating a process classifying and indexing an input document through an intelligent inference engine. CONSTITUTION: The new morpheme of the input document is extracted by processing a morpheme analysis and stopwords after receiving the input document of the news(S15). A combination vector comprising the isomorphic morpheme and a corresponding discrimination code is extracted through the comparison between the dictionary database I and the news morpheme by using the inference engine I, and the core morpheme combination data is extracted by applying an inference rule to the combination vector(S30-S40). The semantic connection code data according to the core morpheme combination data is automatically extracted from the dictionary database II by using the inference engine II(S50-S60). The core morpheme combination data combined with the connection code data is stored in the connection dictionary database III and is attached to the input document, and the input document is stored in a database(S65-S75).

Description

Search engine and method using the semantic relationship information for document or keyword}

본 발명은 지능형 추론엔진을 통하여 특정문서에 대해 자동으로 생성된 의미중심의 연결코드 정보를 DB검색의 판별조건으로 이용함으로써 단어중심이 아닌 의미중심의 정보검색이 가능한 정보검색장치 및 그 방법에 관한 것으로서, 보다 상세하게는 사용자가 입력한 검색어를 그 자체로서 DB검색의 판별조건으로 활용하는 것이 아니라, 지능형 추론엔진을 이용하여 원본문서의 핵심형태소를 추출한 뒤 그에 대한 의미론적 연결코드 정보를 자동으로 부여하고, 이 정보를 검색의 새로운 판별조건으로 활용함으로써 비록 검색어와 동일한 형태소는 포함하고 있지 않지만 의미상으로 연관성이 있다고 추론된 문서에 대해서는 검색결과로 제시하는 확장된 정보검색장치와 방법에 관한 것이다.The present invention relates to an information retrieval apparatus and method that enables semantic information retrieval instead of word based by using semantic connection code information automatically generated for a specific document through the intelligent inference engine as a condition for DB retrieval. More specifically, the search terms entered by the user are not used as the discriminating condition of the DB search itself. Instead, the key morphemes of the original document are extracted using the intelligent inference engine, and the semantic linking code information is automatically generated. By using this information as a new discrimination condition of search, it relates to an extended information retrieval apparatus and method for presenting a search result for documents inferred to be semantically related, although they do not contain the same morpheme as the search term. .

현재 통용되고 있는 정보검색 방법은 4가지 유형으로 구분할 수 있다.Currently used information retrieval methods can be classified into four types.

첫째, 하나의 단어를 검색기준으로 하여 동일한 단어가 포함된 문서를 일괄적으로 추출하는 단어매칭 방법으로서, 이 경우 문서내부에 동일한 단어가 없을 경우 검색결과로 제시되지 않고, 또한 의미는 관계없이 동일한 단어만 존재하면 검색이 되므로 검색결과의 오류율이 높다.First, a word matching method that extracts a document containing the same word collectively using one word as a search criterion. In this case, if there is no identical word in the document, the word matching method is not presented as a search result and the meaning is the same. If only words exist, the search results are high.

예를 들어 석유가격을 일반적으로 '유가'라는 축약어로 표기하는데, 이에 따를 경우 '--한 이유가--', '유가족 모임', '유가증권' 처럼 검색어와 의미적 연관성이 전혀 없는 문서가 검색되므로 검색오류율이 높다.For example, petroleum prices are usually abbreviated as 'oil prices', which leads to documents that have no semantic association with the search term, such as '-for one reason--', 'family meetings', or 'securities'. Search error rate is high.

둘째, 하나의 단어를 하부단어로 세분한 뒤 각각에 대해 코드를 부여하고 이 단어들에 대한 거리벡터 개념을 적용하여 가장 근접한 유형의 다른 문서를 추출하는 방법인데, 이 방법의 경우 세분화 단계에서 하향식 세분화만 실시하므로 검색어와 등등한 레벨 또는 상위 레벨의 단어에 대해서는 검색이 불가능하다는 단점이 있다.Second, subdividing a word into lower words, assigning a code to each word, and applying the distance vector concept to these words to extract other documents of the closest type. Since only the segmentation is performed, the search cannot be performed for words at the same level or higher level as the search word.

셋째, 복수개의 단어를 이용하여 하나의 최적인 검색범주를 발견하는 방법인데, 이는 결국 하나의 범주에 해당하는 문서만 추출하므로 검색의 다양성이 저하되고, 또한 각 단어에 대한 가중치 부여 방법에 따라 결과의 차이가 크다는 단점이 있다.Third, it is a method of finding one optimal search category using a plurality of words, which in turn extracts only the documents that fall into one category, which reduces the variety of search and results according to the weighting method for each word. There is a disadvantage in that the difference is large.

넷째, 복수개의 단어를 이용하여 복수개의 최적 검색범주를 발견하는 방법으로서, 마찬가지로 각 단어에 대한 가중치 부여방법에 따라 결과의 차이가 크게 된다.Fourth, as a method of discovering a plurality of optimal search categories using a plurality of words, the difference in results is large according to the weighting method for each word.

그리고 상기 네가지 방법 모두 검색어와 비교문서의 형태소간 일대일 비교를 DB검색의 기본조건으로 하는 단어매칭 방식을 활용하므로 해당 검색어가 포함된 문서만 검색대상으로 제한된다는 단점이 있고, 또한 전술했듯이 이러한 단어매칭 방법에 따른 검색오류율의 문제도 크다.In addition, all four methods use a word-matching method that uses a one-to-one comparison between search terms and comparison documents as a basic condition of a DB search, so that only documents that contain the search term are limited to search targets. The problem of retrieval error rate is also large.

그리고 기존방법은 원본문서에 대한 분류코드 인덱싱 과정이 사용자의 직접입력에 의존하므로 분류의 범위가 협소적이고, 미분류된 문서도 많이 나타나게 된다. 예를 들어 1일 500개 이상의 뉴스정보를 생성하는 언론사의 경우 각 뉴스정보에 대한 인덱싱을 취재기자가 집접 입력하는 방법을 따르고 있는데, 그 결과 20가지 이하로 범주화된 대분류(정치뉴스, 경제뉴스, 사회뉴스, 등)만 가능할 뿐 중분류, 세분류 및 의미중심의 연계코드를 직접 인덱싱하는 것은 현실적으로 불가능하다.In the existing method, since the classification code indexing process for the original document depends on the user's direct input, the scope of classification is narrow and many unclassified documents appear. For example, a media company that generates more than 500 news information a day follows the reporter's method of inputting indexing of each news information. As a result, the major categories (political news, economic news, Only social news, etc.) is possible, but it is practically impossible to directly index linkage codes of subclassification, subclassification, and semantics.

본 발명은 상기처럼 정보의 분류 및 인덱싱 그리고 단어매칭 방식의 검색과정에서 제기된 문제점을 해결하기 위하여 안출된 것으로서, 우선 원본문서에 대한 분류 및 인덱싱 과정을 지능형 추론엔진을 이용하여 자동화 함으로써 해당 문서에 대한 복수개의 의미중심의 연관정보를 추출할 수 있고, 또한 그에 대한 연결코드 인덱싱을 빠른 시간에 정확하게 하는 것을 목적으로 한다.The present invention has been made to solve the problems posed by the classification and indexing of information and the search process of word matching method. First, the classification and indexing process of the original document is automated by using an intelligent reasoning engine. It is possible to extract a plurality of semantic information related to the information, and also aims to accurately index the connection code for it.

그리고 이러한 연결코드 인덱스 정보를 사용자 검색단계에서 활용함으로써 비록 문서내부에 검색어와 동일한 형태소는 없지만, 의미론적으로 연관성이 있는 것으로 추론된 문서는 유효한 검색결과로 제시되므로 기존의 단어매칭 방식에 따른 검색오류율 문제도 해결하는데 목적이 있다.By using the link code index information in the user search phase, although the documents are not the same as the search terms in the document, the documents inferred to be semantically relevant are presented as valid search results, so the error rate of the search according to the existing word matching method The aim is to solve the problem.

도 1은 본 발명에 따라 특정문서의 의미중심의 연결코드 정보를 자동으로 생성하는 단계적 절차에 대한 예시도1 is an exemplary diagram for a step-by-step procedure for automatically generating connection code information of the semantic center of a specific document according to the present invention

도 2는 본 발명에서 제시한 연결코드정보의 자동 생성장치에 대한 구성도2 is a block diagram of the automatic generation device of the connection code information presented in the present invention

도 3은 본 발명에서 제시한 추론엔진 Ⅰ의 구성도3 is a block diagram of the inference engine I proposed in the present invention

도 4는 본 발명에서 제시한 추론엔진 Ⅱ의 구성도4 is a block diagram of the inference engine II presented in the present invention

도 5는 본 발명에 따라 입력검색어 또는 선택문서에 대해 단어중심이 아닌 의미중심의 연관문서를 추출하는 단계적 절차에 대한 예시도5 is an exemplary diagram illustrating a step-by-step procedure for extracting a semantic related document instead of a word based word for an input search word or a selected document according to the present invention.

도 6은 본 발명에 따라 입력검색어 또는 선택문서에 대해 의미중심의 연관문서를 추출하는 장치에 대한 구성도6 is a block diagram of an apparatus for extracting a semantic-oriented related document for an input search word or a selected document according to the present invention;

도 7은 본 발명에서 제시한 추론엔진 Ⅲ의 구성도7 is a block diagram of the inference engine III proposed in the present invention

도 8은 본 발명에 따른 사전DB Ⅰ의 간단한 예시도Figure 8 is a simple illustration of the dictionary DB I in accordance with the present invention

도 9는 본 발명에 따른 사전DB Ⅱ의 간단한 예시도Figure 9 is a simplified illustration of the dictionary DB II according to the present invention

도 10은 본 발명에 따른 연결사전DB Ⅲ의 간단한 예시도10 is a simple illustration of the connection dictionary DB III according to the present invention

이와 같은 목적을 달성하기 위해 본 발명에서는 크게 나누어 지능형 추론엔진 I 을 적용하여 원본문서에 대한 핵심형태소 정보를 추출하는 단계와, 지능형 추론엔진 Ⅱ를 적용하여 상기 핵심형태소 정보에 대해 연결코드 정보를 자동으로 인덱싱하는 단계, 그리고 지능형 추론엔진 Ⅲ 을 이용하여 사용자 입력검색어에 해당하는 연결코드 정보를 추출한 뒤 그 코드정보에 해당하는 문서를 Database에서 추출한 뒤 사용자에게 전송하는 단계로 구성된 것을 특징으로 한다.In order to achieve the above object, the present invention broadly divides the core morphological information of the original document by applying the intelligent inference engine I, and automatically applies the connection code information to the core morphological information by applying the intelligent inference engine II. Indexing and extracting the connection code information corresponding to the user input search word using the intelligent reasoning engine III and extracting the document corresponding to the code information from the database and transmitting it to the user.

이하, 첨부도면을 참조하여 본 발명에 따른 바람직한 실시예를 설명한다.Hereinafter, exemplary embodiments of the present invention will be described with reference to the accompanying drawings.

그에 앞서 본 발명에서 정의하여 사용하는 몇가지 용어를 설명한다.Before that, a few terms defined and used in the present invention will be described.

(1) 원본문서(input document) : 내부 또는 외부에서 작성자에 의해 직접 생성된 초기의 문서정보로서 텍스트 기반의 정보.(1) input document: text-based information as initial document information generated directly by the author, either internally or externally.

(2) 뉴스형태소(news morpheme) : 원본문서의 일종인 뉴스정보에 대해 형태소 분석과 불용어 처리를 실시한 이후 생성된 뉴스정보의 형태소 집합.(2) news morpheme: A morpheme set of news information generated after morphological analysis and stopword processing of news information, which is a kind of original document.

(3) 사전형태소(database morpheme) : 관리자가 작성한 사전DB에 포함된 모든 유형의 형태소.(3) database morpheme: All types of morphemes included in the dictionary database created by the administrator.

(4) 동일형태소(isomorphic morpheme) : 상기 뉴스형태소 중 사전형태소에 포함되어 있는 형태소로서 패턴이 동일한 형태소.(4) Isomorphic morpheme: A morpheme included in a dictionary morpheme among the news morphemes, the pattern being the same.

(5) 핵심형태소(core morpheme) : 상기 동일형태소 중 원본문서의 핵심 주제어가 되는 단일 또는 복수의 형태소로서 본 발명의 추론엔진 Ⅰ에 의해 자동추출된 형태소.(5) core morpheme: A morpheme automatically extracted by the inference engine I of the present invention as a single or a plurality of morphemes which are the core subjects of the original document among the same morphemes.

(6) 검색어(input keyword) : 특정 문서를 데이터베이스에서 검색하기 위해 사용자가 직접 입력하는 단어 또는 단어의 조합을 의미.(6) input keyword: A word or combination of words that a user directly enters to retrieve a particular document from a database.

도 1은 본 발명에 따라 원본문서에 대해 의미중심의 연결코드 정보를 자동으로 생성하는 단계적 절차에 대한 예시도로서, 뉴스정보 수신단계(S10), 뉴스형태소 추출단계(S15), 상기 뉴스형태소가 사전DB I 에 존재하는지 판단하는 단계(S20), 사전DB I에 없는 뉴스형태소의 경우 형태소 DB에 저장하는 단계(S25), 사전DB I에 있는 뉴스형태소의 경우 사전DB I에서 동일형태소와 그에 해당하는 구분코드를 추출하는 단계(S30), 상기 동일형태소와 구분코드들의 조합벡터를 생성하는 단계(S35), 추론엔진 I을 이용하여 상기 조합벡터에서 핵심형태소 조합을 추출하는 단계(S40), 상기 핵심형태소 조합의 구분코드가 1인지 여부를 판단하는 단계(S45), 만약 1이 아닌 경우 사전DB Ⅱ에서 구분코드와 일치하는 코드행을 대상으로 하여 동일형태소를 검색하는 단계(S50), 사전DB Ⅱ에서 상기 동일형태소에 대응하는 종목형태소 및 업종코드를 추출하는 단계(S55), 추론엔진 Ⅱ를 이용하여 사전DB Ⅱ에서 S55 또는 S40 정보에 대응하는 연결코드 정보를 추출하는 단계(S60), 상기 S40과 S60의 결과를 결합하여 연결사전 DB Ⅲ에 저장하는 단계(S65), 상기 S40의 결과를 S10의 뉴스정보에 첨가하는 단계(S70), S70의 결과를 Database에 저장한 뒤 종료(S75~S80)하는 것으로 구성된다.1 is an exemplary diagram illustrating a stepwise procedure of automatically generating semantic connection code information for an original document according to the present invention. The news information receiving step (S10), news stemming extraction step (S15), and the news stemming Determining whether there is a dictionary DB I (S20), in the case of a news stemmer not in the dictionary DB I step (S25), in the case of news stemmers in the dictionary DB I, the same stemming in the dictionary DB I and the corresponding Extracting a classification code (S30), generating a combination vector of the same morpheme and the classification code (S35), extracting a core morphological combination from the combination vector using an inference engine I (S40), and Determining whether or not the discriminating code of the core morpheme combination is 1 (S45), if it is not 1, searching for the same morpheme for a line of code that matches the discriminating code in the dictionary DB II (S50), and the dictionary DB. Ⅱ Extracting the item morpheme and the industry code corresponding to the same morpheme (S55), extracting the connection code information corresponding to the S55 or S40 information from the dictionary DB II using the inference engine II (S60), the S40 Combining the results of and S60 and storing them in the connected dictionary DB III (S65), adding the results of the S40 to the news information of S10 (S70), saves the results of S70 in the database and ends (S75 ~ S80). It consists of

특히 본 실시예에서는 이해의 편의상 실시간(뉴스의 경우 1일 평균 500개 이상 생산되므로 평균 3분에 1개씩 생산된다.)으로 발생하는 뉴스정보에 대한 처리과정을 중심으로 하여 설명한다. 이 경우 정보생산자는 언론사의 취재기자, 수신자는 언론사 DBMS 또는 외부 DBMS, 사용자는 유무선 통신망을 통한 인터넷 뉴스정보 이용자로 편의상 정의한다.In particular, in the present embodiment will be described focusing on the process of processing the news information that occurs in real time (in the case of news is produced more than 500 per day because the average is produced by one every three minutes). In this case, the information producer is defined as a journalist's reporter, the receiver is a media DBMS or an external DBMS, and the user is a user of Internet news information through a wired or wireless communication network.

그리고 기존의 일반 검색장치는 자료의 수신 및 저장에 따른 단계가 도 1을 기준으로 할 때 단순히 S10 - S75 - S80 처럼 3단계로만 구성되어 있다.In addition, the conventional general retrieval apparatus is composed of only three steps, such as S10-S75-S80 when the steps according to the reception and storage of data are based on FIG. 1.

즉, 원본문서에 대해 그 어떠한 처리도 없이 단순히 시간순으로 Database에 저장을 하고, 이후 검색어가 입력되면 검색어를 기준으로 Database에서 단어매칭 방식으로 해당문서를 검색하는 방법인 것이다.In other words, the original document is simply stored in the database in chronological order without any processing, and when a search word is entered, the corresponding document is searched in the database by word matching method based on the search word.

상기 도 1의 뉴스정보 수신단계(S10)는 통신시스템의 일반적인 자료 송수신 시스템에 해당되고, 뉴스형태소 추출단계(S15)는 정보검색장치의 기본요소인 형태소 분석기의 역할에 해당하는 것으로서, 본 발명에서 새로이 제시하는 것은 아니므로 구체적인 설명은 생략한다. 단 형태소분석기는 주어진 구나 문장을 의미를 지닌 단어의 최소단위인 형태소로 분리하는 역할을 한다.The news information receiving step S10 of FIG. 1 corresponds to a general data transmission / reception system of a communication system, and the news morpheme extraction step S15 corresponds to a role of a morpheme analyzer which is a basic element of an information retrieval apparatus. It is not a new suggestion, so detailed description thereof will be omitted. The morpheme analyzer, however, serves to separate a given phrase or sentence into morphemes, which are the smallest units of meaningful words.

그리고 상기 S20의 경우 S15의 뉴스형태소가 도 8과 같은 유형의 사전DB I에 존재하는지를 판단하는 것이다. 사전DB I은 뉴스에 등장하는 형태소를 관리자가 미리 정의한 유형별로 구분코드를 부여하여 정리한 것으로서, 수학적 포함관계의 관점에서 보면 열(column)을 기준으로 왼쪽에 위치한 형태소 일수록 상위레벨에 해당되고, 반대로 오른쪽에 위치한 것은 하위레벨이 된다.In the case of S20, it is determined whether the news morpheme of S15 exists in the dictionary DB I of the type shown in FIG. The dictionary DB I summarizes the morphemes appearing in the news by assigning a classification code for each type predefined by the administrator. The morphemes located on the left side of the column in terms of mathematical inclusion relations correspond to higher levels. Conversely, the one on the right becomes the lower level.

간단한 일례로서 "삼양사, 밀가루 가격 인하"라는 뉴스정보에 대해 상기 각 단계를 적용시키면 다음과 같다.As a simple example, if each step is applied to the news information "Samyangsa, flour price reduction" as follows.

(S10) 삼양사, 밀가루 가격 인하(S10) Samyang, wheat prices cut

(S15) 삼양사, 밀가루, 가격, 인하, 밀, 가루(S15) Samyangsa, flour, price, cuts, wheat, flour

(S20) 삼양사 ; 밀가루 ; 가격 ; 인하(S20) Samyangsa; flour ; price ; Cuts

이때 S20은 상기 S15의 형태소 중 사전DB I에 존재하는 형태소 즉, 동일형태소만 추출한 것이다. S20의 동일형태소에 대한 구분코드를 도 8의 사전DB I을 이용하여 부여하면 다음과 같다.In this case, S20 extracts only the morphemes that exist in the dictionary DB I, that is, the same morphemes, from the morphemes of S15. If the classification code for the same morpheme of S20 is given by using the dictionary DB I of FIG.

(S30) (삼양사, 1), (밀가루, 3), (가격, 5), (인하, 6)(S30) (Samyang, 1), (Flour, 3), (Price, 5), (Reduction, 6)

도 8의 사전DB I에서 구분코드는 각 형태소의 유형별 분류코드를 의미하는 것으로서, 개별회사(종목명)의 경우 1, 업종의 경우 2, 업종에 영향을 주는 업종요인의 경우 3 등을 할당하는 것처럼 관리자가 미리 분류기준을 정의한 뒤, 각 형태소를 그 기준에 따라 분류하기 위해 필요한 것이다.In the dictionary DB I of FIG. 8, the classification code means a classification code for each type of morpheme, which is assigned to 1 for an individual company (item name), 2 for an industry type, and 3 for an industry type factor affecting the industry type. As the administrator defines the classification criteria in advance, it is necessary to classify each morpheme according to the criteria.

이후 S35에 따라 구분코드와 형태소의 조합벡터를 구하면 다음과 같다.Then, according to S35, the combination vector of the classification code and the morpheme is obtained as follows.

(S35) (1356 ; 삼양사, 밀가루, 가격, 인하)(S35) (1356; Samyangsa, flour, price, cuts)

이때 조합벡터의 작성방법은 먼저 상기 S30의 모든 구분코드를 공백과 구분자 없이 순차적으로 나열하고, 이후 구분자를 삽입하고, 그 이후 S30의 모든 형태소를 구분자로 구분하여 순차적으로 나열하면 된다.At this time, in the method of preparing the combination vector, all the distinguishing codes of S30 may be sequentially listed without any spaces and separators, and after that, the separators may be inserted, and then all the morphemes of S30 may be separated and separated sequentially.

그리고 S40에서는 추론엔진 I을 이용하여 상기 조합벡터에서 구분코드의 조합을 기준으로 핵심형태소를 추출하는데, 그 결과는 다음과 같다.In S40, the core morpheme is extracted based on the combination of the classification codes from the combination vector using the inference engine I. The results are as follows.

(S40) (1, 삼양사)(S40) (1, Samyangsa)

추론엔진 I은 S35의 형태소 조합벡터에서 원본문서의 주제어가 되는 핵심형태소와 그 구분코드로 구성된 일종의 조합벡터를 구하기 위한 단계적 추론규칙(Inference Rule)이 컴퓨터 프로그램으로 코드화 된 일종의 연산장치이다. 이때 추론규칙은 일례로서 '13**=1', '2****=2', '31**=3'등과 같이 S35의 구분코드 조합을 이용하여 먼저 대표 구분코드를 구하고, 이후 그 대표 구분코드와 동일한 위치정보에 해당하는 형태소를 핵심형태소로 구하는 제반 연산규칙이 정의되어 있어야 하며, 구체적인 설명은 도3에서 하기로 한다. 결국 이 규칙에 따라 상기 S35의 조합벡터에 대한 추론을 하면 '1356 = 1'이라는 결론을 도출할 수 있고, 결국 최종적으로 상기 원본문서에 대한 핵심형태소 조합은 S40처럼 (1, 삼양사)로 된다.Inference engine I is a kind of computing device in which a stepwise inference rule is coded into a computer program to obtain a type of combination vector consisting of the core morpheme and the distinguishing code of the original document from the morpheme combination vector of S35. In this case, as an example, the representative classification code is first obtained by using the combination of the classification code of S35 as '13 ** = 1 ',' 2 **** = 2 ', '31 ** = 3', and then the General operation rules for obtaining the morpheme corresponding to the location information identical to the representative classification code as the core morpheme should be defined, and a detailed description thereof will be made with reference to FIG. 3. In conclusion, inference about the combination vector of S35 according to this rule can lead to the conclusion that '1356 = 1', and finally, the core morphological combination of the original document becomes (1, Samyangsa) as S40.

또 다른 일례로서 "밀가루, 남북경협 중요 이슈"라는 뉴스를 상기처럼 분석하면 다음과 같이 된다.As another example, if the news, "Flour, North-North Economic Cooperation," is analyzed as above.

(S10) 밀가루, 남북경협 중요 이슈(S10) Wheat Flour, Inter-Korean Economic Issues

(S15) 밀가루, 남북경협, 중요, 이슈, 밀, 가루, 남북, 경협(S15) flour, inter-Korean economic cooperation, important issues, wheat, flour, inter-Korean cooperation

(S20) 밀가루 ; 남북경협 ; 이슈(S20) wheat flour; Inter-Korean economic cooperation; issue

(S30) (밀가루, 3), (남북경협, 4), (이슈, 6)(S30) (Flour, 3), (North and North Korea Cooperation, 4), (Issue, 6)

(S35) (346 ; 밀가루, 남북경협, 이슈)(S35) (346; Flour, Inter-Korean Cooperation, Issue)

(S40) (3, 밀가루)(S40) (3, flour)

그리고 상기 S45는 S40의 핵심 형태소 조합에 나타난 구분코드의 유형을 판단하는 단계이다. 이것은 도 9의 사전DB Ⅱ로부터 S40의 핵심형태소에 대한 의미중심의 연결코드 정보를 자동으로 추출할 때 검색오류를 방지하기 위해 중요하다. 상기 2가지 사례를 이용하여 구체적으로 살펴보기로 한다.S45 is a step of determining the type of discrimination code shown in the core morpheme combination of S40. This is important to prevent retrieval errors when automatically extracting semantic connection code information for the core morphology of S40 from the dictionary DB II of FIG. The two examples will be described in detail.

1) 구분코드 = 1 인 경우1) In case of classification code = 1

상기 "삼양사, 밀가루 가격 인하"의 경우는 핵심형태소 조합이 (1, 삼양사)이고 S45에서 구분코드 = 1 이므로, 단계상 S60 에서 도 4의 추론엔진 Ⅱ를 이용하여 사전DB Ⅱ에 있는 '삼양사'에 대한 연결코드 정보를 추출해야 한다. 이때 사전DB Ⅱ는 도 9에서 그 일부구조를 제시했듯이 각 회사에 대해 소속업종, 원재료 및 상품, 경쟁 및 출자회사, 외국관계회사 등과 같이 그 회사와 사업적 연관관계가 있는 모든 유형의 기업정보를 관리자가 미리 정의해 둔 데이터베이스 구조인 것이다. 그리고 도 4의 추론엔진 Ⅱ는 이러한 사전DB Ⅱ를 이용하여 S40의 핵심형태소에 대한 의미중심의 연결코드 정보를 자동으로 추출할 수 있는 연산규칙이 내장된 일련의 연산장치이다. 결국 본 사례의 경우 S60에서 다음과 같은 연결코드 정보를 생성할 수 있다.In the case of "Samyangsa, flour price reduction", the core morpheme combination is (1, Samyangsa) and the division code = 1 in S45, so in step S60 'Samyangsa' in dictionary DB II using the inference engine II of FIG. You need to extract the connection code information for. At this time, as shown in the structure of FIG. 9, the pre-DB II shows all types of company information that has a business relationship with the company, such as the type of business, raw materials and products, competition and subsidiaries, and foreign affiliates. It is a database structure predefined by the administrator. The reasoning engine II of FIG. 4 is a series of arithmetic devices embedded with arithmetic rules for automatically extracting semantic connection code information for core morphemes of S40 using the dictionary DB II. Eventually, in this case, the following connection code information can be generated in S60.

(S60) (1,삼양사 ; 식품,사료,화학,섬유 ; 원당,옥수수,TPA,EG ; 설탕 등)(S60) (1, Samyang Corporation; food, feed, chemical, fiber; raw sugar, corn, TPA, EG; sugar, etc.)

즉, S40에서 구한 형태소 조합의 구분코드가 1인 경우는 사전DB Ⅱ에서 상기 S40 의 형태소 조합과 일치하는 구분코드 및 종목명 형태소를 검색한 뒤, 그 종목 형태소의 하위레벨에 속하는 모든 형태소를 추출하면 된다.That is, when the classification code of the morpheme combination obtained in S40 is 1, the classification code and the item name morpheme matching the morpheme combination of S40 are searched in the dictionary DB II, and then all morphemes belonging to the lower level of the morpheme combination are extracted. Just do it.

이처럼 S60의 결과에서 알 수 있듯이, 본 발명에서는 첫째, 원본문서에서 '삼양사'라는 핵심형태소를 추출하여 그 형태소와 의미론적으로(사업적 관점) 연관성이 있는 모든 형태소를 자동으로 인덱싱 할 수 있고, 둘째, 사용자가 '삼양사'라는 검색어를 입력할 경우 본 발명에서는 S60의 결과를 DB검색의 판별조건으로 이용하므로 삼양사와 동일한 형태소 뿐만 아니라 식품, 사료, 화학업종 등 S60에 제시된 것처럼 삼양사라는 형태소가 없는 뉴스정보도 검색이 가능하므로 검색의 다양성이 확대되고, 셋째, 검색어에 대한 DB검색 과정에서도 S60의 특정위치에 존재하는 핵심형태소 조합정보와만 비교하면 되므로 검색속도와 정확도가 향상되는 장점이 있다.As can be seen from the results of S60, in the present invention, first, it is possible to automatically extract all the morphemes that are semantically related to the morphemes by extracting the core morphology of 'Samyangsa' from the original document. Second, when the user enters the search term 'Samyangsa', the present invention uses the result of S60 as a condition for determining DB search, so there is no morpheme called Samyangsa as shown in S60 as well as the same morpheme as Samyangsa. News information can also be searched, so the variety of search is expanded. Third, the search speed and accuracy are improved because only the morphological combination information existing at a specific location of S60 can be compared even in the DB search process for the search word.

그러나 S40에서의 구분코드가 1 이 아닌 경우는 이와 달리 보다 복잡한 절차를 통하여 추론이 이루어 진다.However, if the discrimination code in S40 is not 1, inference is made through a more complicated procedure.

2) 구분코드 = 1 이 아닌 경우2) If the classification code is not 1

상기 "밀가루, 남북경협 중요 이슈"의 경우는 S40에서 최적 형태소가 (3, 밀가루) 이고 구분코드가 1 이 아니므로, S50에 따라 사전DB Ⅱ에서 상기 구분코드와 일치하는 코드행의 열(column)을 기준으로 하여 핵심형태소와 일치하는 동일형태소를 검색해야 한다. 그리고 S55 에 따라 사전DB Ⅱ에서 상기 동일형태소를 포함하는 종목형태소 및 업종코드를 추출해야 하며, 그 결과는 다음과 같다.In the case of the "wheat flour, inter-Korean economic issues," the optimal morpheme is (3, wheat flour) and the distinguishing code is not 1 in S40, and thus the column of code lines corresponding to the distinguishing code in the dictionary DB II according to S50. Search for identical morphemes that match key morphemes. In addition, according to S55, the item morpheme and the industry code including the same morpheme must be extracted from the dictionary DB II, and the result is as follows.

(S55) (삼양사 ; 1), (제일제당 ; 1), (대한제당 ; 1)(S55) (Samyangsa; 1), (Cheil Jedang; 1), (Korean Sugar; 1)

상기 (삼양사 ; 1)에서 '1'이 의미하는 것은 가령, 삼양사의 복수개 업종코드 중에서 1, 즉 첫 번째 업종인 식품업종을 나타내는 업종코드 구분자이다.'1' in the (Samyangsa; 1) means, for example, one of a plurality of industry codes of Samyangsa, that is, an industry code delimiter indicating the food industry, which is the first industry.

그 이후 S55의 결과를 추론엔진 Ⅱ의 추론기준으로 사용하여 사전DB Ⅱ에서 관련 연결코드 정보를 추출(S60)하면 결과는 다음과 같다.After that, using the result of S55 as the inference standard of inference engine II, extracting the relevant connection code information from dictionary DB II (S60), the result is as follows.

(S60) (3,밀가루;삼양사,1;원당,옥수수;설탕,식용유,밀가루 : 제일제당,1; 원당,옥수수,전분;설탕,식용유,밀가루 : 대한제당,1;원당,옥수수,전분; 설탕,식용유, 밀가루)(S60) (3, wheat flour; Samyangsa, 1; raw sugar, corn; sugar, cooking oil, wheat flour: Jeil sugar, 1; raw sugar, corn, starch; sugar, cooking oil, flour: Daehan sugar, 1; raw sugar, corn, starch; Sugar, cooking oil, flour)

여기서도 ',', ';', ':' 는 구분연산자에 해당된다.Here, ',', ';' and ':' correspond to the division operator.

이 경우도 상기 1)의 경우처럼 첫째, '밀가루'라는 핵심형태소를 이용하여 의미론적으로(사업적 관점) 연관성이 있는 형태소를 자동으로 인덱싱 할 수 있고, 둘째, 사용자가 '밀가루'라는 검색어를 입력하더라도 본 발명에서는 S60의 결과를 DB검색의 판별조건으로 이용하므로 밀가루와 동일한 형태소 뿐만 아니라 삼양사, 식품, 원당, 옥수수 등과 같이 밀가루라는 형태소가 없는 뉴스정보도 검색이 가능하고, 셋째, DB검색 과정에서도 S60의 특정위치에 존재하는 정보와만 비교검색하면 되므로 검색속도와 정확도가 향상되는 장점이 있다.In this case, as in the case of 1) above, first, the morphologically relevant morphemes can be automatically indexed using the core morphology of 'flour', and second, the user searches for the word 'flour'. Even if input, the present invention uses the result of S60 as the condition for determining DB search, so it is possible to search not only the same morphemes as flour but also news information without morphemes such as Samyangsa, food, raw sugar, corn, etc. Third, DB search process Even in S60, only the information that exists in the specific location of S60 needs to be compared and searched, thereby improving the search speed and accuracy.

상기 1)과 2)의 결과를 비교하면 다음과 같은 차이가 있다.Comparing the results of 1) and 2) there are differences as follows.

1)의 경우 S40에서의 구분코드가 1 이므로 사전DB Ⅱ에서 핵심형태소의 모든 하위레벨 형태소들이 연결코드로 자동 추출되지만, 2)의 경우 S40에서의 구분코드가 3, 즉 상품이나 원재료 형태소에 해당되므로 S55에서 그 형태소를 포함하고 있는 상위레벨의 모든 종목형태소와 업종형태소를 추출하고, 그 업종형태소와 직접 연계된 하위레벨의 다른 형태소를 연결코드 정보로 추출해야 한다.In the case of 1), the classification code in S40 is 1, so all lower-level morphemes of the core morphemes are automatically extracted as concatenated codes in the dictionary DB II, but in the case of 2), the classification code in S40 corresponds to 3, namely commodity or raw morphemes Therefore, in S55, all item morphemes and industry morphemes of the upper level including the morphemes must be extracted, and other morphemes of the lower level directly linked to the industry morphemes must be extracted as the connection code information.

이처럼 구분코드 별로 판별기준을 달리하는 이유는 예를 들어 상기처럼 '밀가루'라는 형태소에 대해 사전DB Ⅱ에서 관련 종목형태소를 추출한 뒤, 1)의 경우처럼 단순히 그 종목형태소의 모든 하위 형태소를 연결정보로 추출할 경우 밀가루와 실질적인 연관성이 없는 '화학업종' 및 그 하위 형태소도 연결정보로 추출되므로 검색 오류율이 커지기 때문이다.The reason for differentiating the discrimination criteria for each classification code is, for example, extracting the relevant item morphemes from the dictionary DB II for the morpheme 'flour' as described above, and then simply connecting all sub-morphemes of the item morphemes as in the case of 1). This is because the search error rate increases because 'chemical industry' and subordinate morphemes, which are not substantially related to wheat flour, are extracted as linked information.

그리고 본 실시예 에서는 상기 1)과 2)에 대해 각각 S60에서 추출한 연결코드 정보를 상기처럼 구분코드와 형태소의 조합형태로 나타냈지만, 각 형태소에 대한 별도의 코드번호가 존재할 경우 구분코드와 코드번호의 조합으로도 나타내는 것도 가능하다.In the present embodiment, the connection code information extracted in S60 for the above 1) and 2) is shown in the form of the combination code and the morpheme as described above, but if there is a separate code number for each morpheme, the classification code and the code number It can also be represented by the combination of.

그리고 상기 S65는 S60에서 구한 연결코드 정보를 도 10과 같은 형태의 연결사전DB Ⅲ 에 순차적으로 또는 덮어쓰기 형식으로 저장하는 단계이고, 상기 S70은 상기 S40의 결과를 원본문서의 특정 위치에 첨가하는 단계이다. 이때 S60의 결과가 아닌 S40의 결과를 S10에 첨가시키는 이유는 이후 사용자가 특정 검색어를 입력했을 때 그에 해당하는 연결코드 정보를 연결사전DB Ⅲ에서 추출한 뒤 그 연결코드 정보와 원본문서에 첨가된 상기 S40의 정보와만 비교하면 되므로 Database 검색속도와 정확도를 향상시킬 수 있기 때문이다.In step S65, the connection code information obtained in S60 is stored in the connection dictionary DB III in the form shown in FIG. 10 in a sequential or overwriting format, and S70 adds the result of S40 to a specific position of the original document. Step. In this case, the reason for adding the result of S40 rather than the result of S60 to S10 is that after the user inputs a specific search word, the corresponding connection code information is extracted from the connection dictionary DB III and added to the connection code information and the original document. This is because the database search speed and accuracy can be improved by only comparing the information with S40.

아래에서 (1)은 원본문서의 자료구조를 나타내고, (2)는 S70의 처리를 실시한 자료구조인데, 여기서 S40의 첨가위치는 임의로 배치가능하다.In the following, (1) shows the data structure of the original document, and (2) shows the data structure subjected to the processing of S70, where the additional position of S40 can be arbitrarily arranged.

(1) 원본문서의 자료구조(1) Data structure of the original document

대분류 코드Major Code 날짜 및 시간정보Date and time information 헤드라인Headline 본문main text

(2) S70의 처리를 한 변형된 원본문서(2) Modified original document processed in S70

대분류 코드Major Code 날짜 및 시간정보Date and time information S40의 정보Information of S40 헤드라인Headline 본문main text

그리고 상기 S75 ~ S80 은 S70의 결과를 Database에 저장한 이후 종료하는 단계이다.And S75 ~ S80 is a step of ending after storing the result of S70 in the database.

도 2는 본 발명에서 제시한 의미중심의 연결코드정보 자동 생성장치에 대한 구성도의 실시예로서, 자료수신부(100), 형태소분석부(150), 추론엔진 Ⅰ(200), 형태소 DB(10), 사전DB Ⅰ(20), 추론엔진Ⅱ(300), 사전DB Ⅱ (30), 관리Tool(40), 형태소분류부(400), 연결사전DB Ⅲ(50), 및 Database(500)로 구성된다.Figure 2 is an embodiment of a configuration diagram for a device for automatically generating a connection-oriented information of the semantic center presented in the present invention, the data receiver 100, the morpheme analysis unit 150, the inference engine I (200), the morpheme DB (10) ), Dictionary DB I (20), Reasoning Engine II (300), Dictionary DB II (30), Administration Tool (40), Morphological Classification (400), Linked Dictionary DB III (50), and Database (500). It is composed.

상기 자료수신부(100)는 외부 또는 내부에서 생성된 원본문서를 수신할 수 있는 일반적인 자료수신 장치이고, 형태소분석부(150)는 원본문서에서 뉴스형태소를 추출하는 장치로서, 이를 위해 형태소분석기와 불용어처리기가 필요하지만 이들은 이미 일반화 된 기술에 해당되므로 구체적인 설명은 생략한다. 그리고 상기 추론엔진 Ⅰ(200)은 이후 도 3의 설명과정에서 자세히 설명하겠지만, 상기 형태소분석부(150)에서 추출된 뉴스형태소를 사전DB Ⅰ(20)의 사전형태소와 비교하여 동일형태소와 그에 해당되는 구분코드를 추출하고, 이후 상기 동일형태소와 구분코드의 조합벡터를 생성한 뒤 추론규칙에 따라 원본문서에 대한 핵심형태소 조합을 추출하는 역할을 한다.The data receiving unit 100 is a general data receiving apparatus capable of receiving an original document generated externally or internally, and the morpheme analysis unit 150 extracts a news morpheme from the original document. Although processors are required, they are already generalized techniques, and thus detailed explanations are omitted. And the reasoning engine I (200) will be described in detail later in the description process of FIG. 3, the same morphemes and corresponding morphemes by comparing the news morphemes extracted from the morpheme analysis unit 150 with the dictionary morphology of the dictionary DB I (20) After the classification code is extracted, the combination vector of the same morpheme and the classification code is generated, and the core morpheme combination for the original document is extracted according to the inference rule.

그리고 상기 사전DB Ⅰ(20)은 이후 도 8에서 자세히 설명하겠지만, 뉴스정보에 등장했던 모든 형태소를 미리 정의해 놓은 기준에 따라 분류하고 각각에 구분코드를 할당한 것으로서, 관리 Tool(40)을 통하여 형태소의 추가/삭제 또는 수정이 가능한 일반적인 데이터베이스 구조를 따른다.The dictionary DB I 20 will be described in detail later with reference to FIG. 8, but all the morphemes appearing in the news information are classified according to a predefined criterion, and a classification code is assigned to each of them. It follows the general database structure that can add / delete or modify stemming.

그리고 추론엔진 Ⅱ(300)는 핵심형태소 조합의 구분코드 유형에 따라 사전DB Ⅱ(30)에서 의미중심의 연결코드 정보를 자동으로 추출하는 장치이다. 이때 구분코드가 '1'인 경우는 사전DB Ⅱ에서 핵심형태소와 동일한 종목형태소를 추출한 뒤 그 하위레벨의 모든 형태소들을 연결코드 정보로 추출하면 되고, 구분코드가 '1'이 아닌 경우는 상기 도 1에서의 단계 S50 ~ S60을 따라야 한다. 사전DB Ⅱ(30)는 도 9에서 자세히 설명을 하겠지만, 각 회사별로 제품, 원재료, 경쟁회사, 소속업종 등과 같이 사업적 관련성이 있는 모든 유형의 형태소를 유형별로 구분하여 정리한 자료구조 시스템이다. 그리고 관리 Tool(40)은 실제 화면예시도를 본 발명에서는 제시하지 않았지만, 사전DB Ⅰ, Ⅱ의 내용수정 및 업데이터를 할 수 있는 장치로서 형태소 DB(10)의 내용과 사전DB Ⅰ(20) 및 Ⅱ(30)의 내용을 포함해야 한다.Inference engine II (300) is a device that automatically extracts the semantic-oriented connection code information from the dictionary DB II (30) according to the type of classification code of the core morpheme combination. In this case, when the classification code is '1', the same item morpheme as the core morpheme may be extracted from the dictionary DB II, and all morphemes of the lower level may be extracted as the connection code information. Steps S50 to S60 in 1 should be followed. The dictionary DB II (30) will be described in detail in FIG. 9, but is a data structure system in which all types of morphemes related to business, such as products, raw materials, competitors, and affiliated industries, are sorted by type. Although the management tool 40 does not present the actual screen example in the present invention, it is a device capable of modifying and updating the contents of the dictionary DB I and II, and the contents of the stemming DB 10 and the dictionary DB I 20 and It should contain the contents of II (30).

이와 같이 본 발명에서 제시하는 것처럼 원본문서의 수신단계에서 의미론적 연결코드정보를 생성하여 부여하는 이유는 문서의 검색속도를 향상시키기 위해서이다. 가령 원본문서의 생성간격은 평균 1분정도 되지만, 사용자 접속 특징은 트래픽이 많은 경우 1초에서 10~100명에 이르므로, 사용자 요청시점부터 해당문서의 연결코드 정보를 생성하여 검색을 한 뒤 결과를 제공할 경우 정보제공 시간이 많이 소요되고, 동시에 서버의 부하도 크게 된다. 그러나 경우에 따라서는 이러한 역순방식에 따른 처리도 가능한 것을 특징으로 한다.As described in the present invention, the reason for generating and giving semantic connection code information in the reception step of the original document is to improve the retrieval speed of the document. For example, the generation interval of the original document is about 1 minute on average, but the user access characteristics range from 1 second to 10 to 100 people when there is a lot of traffic. If you provide a large amount of information providing time, at the same time the load on the server is also great. However, in some cases, the reverse process may be possible.

그리고 상기 형태소분류부(400)는 핵심형태소 정보와 연결코드 정보를 결합하여 연결사전DB Ⅲ(50)에 저장하고, 동시에 핵심형태소 조합정보를 원본문서의 특정 위치에 첨가시키는 역할을 하는 장치로서, 이때 연결사전DB Ⅲ(50)는 도 10에서 설명하겠지만, 추론엔진 Ⅰ(200)과 Ⅱ(300)에서 도출된 결과를 순차적으로 저장하는 일종의 데이터베이스 구조이다.The morpheme classification unit 400 combines the core morpheme information and the connection code information and stores them in the connection dictionary DB III (50), and simultaneously adds the core morpheme combination information to a specific position of the original document. At this time, the connection dictionary DB III (50) will be described in Figure 10, but is a kind of database structure for sequentially storing the results derived from the inference engine I (200) and II (300).

그리고 상기 Database(500)는 자료의 저장 및 관리가 가능한 일반적인 저장장치이다.The database 500 is a general storage device capable of storing and managing data.

도 3은 도2의 추론엔진Ⅰ(200)에 대한 상세한 구성도를 나타낸 것으로서, 이는 자료수신부(210), 형태소 비교연산부(220), 형태소 DB(10), 사전DB Ⅰ(20), 관리Tool(40), 구분코드 추출부(230), 조합벡터 생성부(240), 핵심형태소 추출부(250), 추론규칙(60)으로 구성되어 있다.3 shows a detailed configuration diagram of the inference engine I 200 of FIG. 2, which is a data receiver 210, a morpheme comparison operator 220, a morpheme DB 10, a dictionary DB I (20), and a management tool. (40), the classification code extraction unit 230, the combination vector generation unit 240, the core morphological extraction unit 250, and the inference rule (60).

상기에서 자료수신부(210)는 도 2의 형태소분석부(150)에서 추출된 각 뉴스형태소를 수신받는 장치이고, 형태소 비교연산부(220)는 상기 뉴스형태소를 사전DBⅠ(20)의 사전형태소와 비교하여 동일형태소의 존재여부를 판단하는 기능을 한다. 구분코드 추출부(230)는 상기 비교연산부(220)에서 동일형태소를 발견했다면 사전DBⅠ(20)에서 해당 구분코드를 추출하고, 그렇지 않고 동일형태소가 없는 뉴스형태소의 경우 형태소 DB(10)에 저장하는 역할을 한다. 결국 상기 형태소 DB(10)는 사전DB Ⅰ에 없는 뉴스형태소를 임시로 보관하는 저장장치이고, 이 형태소 DB에 저장된 형태소는 관리자가 관리 Tool(40)을 이용하여 사전DB Ⅰ과 Ⅱ로 각각 분산시킴으로써 신규 형태소를 관리하는 순환사이클을 형성한다.The data receiver 210 is a device for receiving each news morpheme extracted from the morpheme analysis unit 150 of FIG. 2, and the morpheme comparison operation unit 220 compares the news morpheme with the dictionary morpheme of the dictionary DBI 20. To determine the existence of the same morpheme. If the division code extracting unit 230 finds the same morpheme in the comparison operation unit 220, extracts the corresponding division code from the dictionary DB I 20, and stores the morpheme DB 10 in the case of a news morpheme without the same morpheme. It plays a role. Finally, the morpheme DB 10 is a storage device for temporarily storing news morphemes not in the dictionary DB I, and the morphemes stored in the morpheme DB are distributed to the dictionary DB I and II by the administrator using the management tool 40, respectively. Create a circulation cycle to manage the new morphemes.

그리고 조합벡터 생성부(240)는 상기 장치(220, 230)에서 추출된 형태소 및 구분코드에 대해 쌍(pair)의 조합벡터를 생성하는 역할을 한다.The combination vector generator 240 generates a pair of combination vectors for the morphemes and the classification codes extracted from the apparatuses 220 and 230.

핵심형태소 추출부(250)는 상기 형태소 조합벡터에 대해 본 발명에서 정의한 추론규칙(60)(inference rule)을 적용하여 원본문서를 대표하는 단일 또는 복수의 핵심형태소와 그에 따른 구분코드로 구성된 조합정보를 추출하는 역할을 한다. 이때 추론규칙은 상기 정보 중에서도 특히 구분코드의 조합형태를 추론대상으로 하는데, 일반적으로 한국어의 경우 어휘구조상 문장 첫머리에 존재하는 주격명사가 핵심 형태소가 되는 경우가 많다. 간단히 예를 들어 뉴스정보의 헤드라인(headline)에 대해 핵심형태소를 추출하는 추론규칙을 살펴본다. 헤드라인 정보의 경우 특징상 조사(postpositional word)와 같은 불용어를 제거하면 일반적으로 5개 이하의 형태소로 구성되는 경우가 대부분이다. 그리고 사전DB Ⅰ에서 모든 형태소의 유형을 5가지로 구분하여 구분코드(1,2,3,4,5)를 부여하면, 모든 헤드라인 정보는 1부터 5까지의 수를 5개씩 조합하여 나타낼 수 있고, 이때 모든 경우의 수는 반복조합이 가능하므로 5*5*5*5*5 = 3125가지가 된다. 즉, 이러한 3125가지의 경우에 대해 미리 구문특성을 파악하여 12324 = 1, 13145 = 11, 35433 = 3 등과 같이 규칙성을 발견한 뒤, 아래처럼 컴퓨터 프로그램으로 코드화하면 된다.The core morpheme extracting unit 250 applies the inference rule 60 defined in the present invention to the morpheme combination vector, and the combination information consisting of a single or a plurality of core morphemes representing the original document and a distinguishing code according to the original document. It serves to extract. At this time, the reasoning rule is the reason for inferring the combination form of the division code among the above information. In general, in case of Korean, the subject noun existing at the beginning of a sentence becomes a core morpheme. For example, we look at inference rules that extract key morphemes from the headlines of news information. In the case of headline information, in most cases, it is generally composed of five or less morphemes when the stopwords such as postpositional words are removed. And if the classification code (1,2,3,4,5) is given by dividing all types of morphemes into five in dictionary DB I, all headline information can be represented by combining five numbers from 1 to 5. In this case, the number of all cases can be repeated combination, so 5 * 5 * 5 * 5 * 5 = 3125. In other words, the syntax characteristics of the 3125 cases can be grasped in advance to find regularity such as 12324 = 1, 13145 = 11, 35433 = 3, and then coded into a computer program as follows.

i = 1 to 5if code(i) = 1if code(i+1) = 2if code(i+2) = 3.........................elsethen core_code = 1.........................endendendendi = 1 to 5if code (i) = 1if code (i + 1) = 2if code (i + 2) = 3 ....... ..elsethen core_code = 1 ......................... endendendend

도 4는 도 2의 추론엔진Ⅱ(300)에 대한 상세한 구성도를 나타낸 것으로서, 이는 자료수신부(310), 구분코드 비교연산부(320), 형태소 추출부(330), 사전DB Ⅱ(30), 관리Tool(40), 연결코드 추출부 Ⅰ(340) 로 구성되어 있다.4 shows a detailed configuration diagram of the inference engine II 300 of FIG. 2, which is a data receiver 310, a division code comparison operation unit 320, a morpheme extraction unit 330, a dictionary DB II 30, It consists of the management tool 40 and the connection code extraction unit I 340.

상기 자료수신부(310)는 도 3의 핵심형태소 추출부(250)의 결과를 수신받는 장치이고, 구분코드 비교연산부(320)는 상기 자료수신부(310)에서 수신받은 핵심형태소 조합정보의 구분코드 유형을 비교하는 장치이다. 이러한 구분코드는 도 8에서 제시한 사전DB Ⅰ(20)의 구조처럼 1,2,3,...N 가지가 존재할 수 있는데, 상기 핵심형태소 조합정보의 구분코드 유형이 '1'인 경우 연결코드 추출부 Ⅰ(340)로 결과를 전송하고, 그 외의 경우 형태소 추출부(330)로 결과를 전송하는 역할을 한다.The data receiver 310 is a device that receives the result of the core morpheme extractor 250 of FIG. 3, and the division code comparison operator 320 is a type of the discriminating code of the core morpheme combination information received by the data receiver 310. It is a device to compare. Such a distinction code may exist in 1, 2, 3, ... N branches as in the structure of the dictionary DB I (20) shown in FIG. 8, and the concatenation code type of the core morphological combination information is '1'. It transmits the result to the code extraction unit I 340, and otherwise transmits the result to the morpheme extraction unit 330.

상기 형태소 추출부(330)는 구분코드 비교연산부(320)의 결과가 '1'이 아닌 경우에만 실행되는 장치로서, 이 경우 사전DB Ⅱ(30)에서 구분코드와 일치하는 코드행의 열(column)에 있는 사전형태소를 대상으로 동일형태소를 추출하는 역할을 한다. 즉, 구분코드가 '1'이 아니라 '3'이라면 사전DB Ⅱ의 코드행이 3 인 열(column)을 기준으로 동일형태소를 추출하고, 'N'이면 코드행이 N 인 열을 기준으로 동일형태소를 추출하는 역할을 한다.The morpheme extraction unit 330 is a device that is executed only when the result of the division code comparison operation unit 320 is not '1'. In this case, a column of code lines matching the division code in the dictionary DB II 30 is used. It extracts the same morpheme from the dictionary morphemes in). That is, if the distinguishing code is '1' instead of '1', the same morpheme is extracted based on the column of the code line of the dictionary DB II based on 3, and the same on the column of the N if the code line is 'N'. It is responsible for extracting morphemes.

그리고 상기 연결코드 추출부 Ⅰ(340)은 2가지의 경우로 역할이 구분되는데, 먼저 구분코드 비교연산부(320)에서 구분코드가 1이면 그때의 핵심형태소와 일치하는 사전형태소를 사전DB Ⅱ에서 추출한 뒤 그 하위레벨의 모든 형태소들을 연결코드 정보로 추출하는 역할을 한다. 그러나 이와 달리 구분코드가 1이 아닌 경우는 상기 형태소 추출부(330)에서 추출한 동일형태소를 포함하는 '종목명' 및 '업종코드'를 사전DB Ⅱ에서 추출한 이후 이를 포함하는 하위레벨의 형태소 정보를 추출하는 역할을 한다.In addition, the connection code extracting unit I 340 is divided into two cases. First, when the classification code is 1 in the division code comparison operation unit 320, a dictionary morpheme matching the core morpheme at that time is extracted from the dictionary DB II. Later, it extracts all the stems of the lower level as connection code information. On the other hand, if the discriminating code is not 1, the morpheme information of the lower level including the 'item name' and the 'business code' including the same morpheme extracted by the morpheme extracting unit 330 is extracted from the dictionary DB II. It serves to extract.

예를 들어 자료수신부(310)에서 (3, 식용유)라는 핵심형태소를 수신했다면, 구분코드 비교연산부(320)에서 구분코드는 '3'이므로, 형태소 추출부(330)는 사전DB Ⅱ(30)의 코드행 '3'에 해당하는 열(column)에서 동일형태소를 추출한다. 그 결과 '식용유'라는 형태소가 3군데서 검색이 되며, 그에 따라 상기 연결코드 추출부Ⅰ(340)에서 상기 형태소를 포함하는 종목형태소와 업종코드를 추출하는데, 그 결과는 다음과 같다.For example, if the data receiver 310 receives a key morpheme (3, cooking oil), the classification code comparison operation unit 320, since the classification code is '3', the morpheme extraction unit 330 is a dictionary DB II (30) Extract the same stem from the column corresponding to line 3 of the code in. As a result, three morphemes of 'edible oil' are searched, and accordingly, the connection code extracting unit I 340 extracts item morphemes and industry codes including the morphemes, and the results are as follows.

(삼양사 ; 1), (제일제당 ; 1), (동방유량 ; 2)(Samyang Corporation; 1), (Cheil Jedang; 1), (Oriental flow rate; 2)

그리고 동 결과를 이용하여 연결코드 추출부Ⅰ(340)에서 다음과 같은 연결코드 정보를 구할 수 있다.Then, using the result, the following connection code information can be obtained from the connection code extraction unit I 340.

(1) (삼양사;1) - (1,삼양사;식품;원당,옥수수;설탕,식용유,밀가루, ..)(1) (Samyangsa; 1)-(1, Samyangsa; food; raw sugar, corn; sugar, cooking oil, flour, ..)

(2) (제일제당;1) - (1,제일제당;식품;옥수수,유지;식용유,밀가루, ..)(2) (Cheil Jedang; 1)-(1, Cheil Jedang; Food; Corn, Oil, Edible Oil, Flour, ..)

(3) (동방유량;2) - (1,동방유량;식품;옥수수,대두;식용유,참기름, ..)(3) (east oil flow rate; 2)-(1, east oil flow rate; food; corn, soybean; cooking oil, sesame oil, ..)

도 5는 본 발명에 따라 사용자가 입력한 검색어에 대해 단어중심이 아닌 의미중심의 연관문서를 검색하는 단계적 절차에 대한 예시도로서, 구체적으로는 사용자의 검색어 입력 또는 선택단계(S1010), 상기 입력검색어의 수신단계(S1020), 검색어가 사전DBⅠ에 존재하는지 판단하는 단계(S1030), 사전DBⅠ에서 상기 검색어의 구분코드를 추출하는 단계(S1040), 상기 S1010에서 선택한 뉴스의 구분코드와 형태소를 Database에서 추출하는 단계(S1050), 추론엔진 Ⅲ를 이용하여 연결사전DB Ⅲ에서 상기 구분코드 및 검색어(또는 형태소)와 일치하는 최근의 연결코드 정보를 추출하는 단계(S1060), 상기 연결코드 정보와 일치하는 문서를 Database에서 검색하는 단계(S1070), 구분코드별로 미리 설정된 화면 프레임에 검색결과를 디스플레이 한 이후 종료하는 단계로 구성된다(S1080~S1090).FIG. 5 is an exemplary diagram illustrating a step-by-step procedure for searching for a relevant document of meaning-centeredness instead of a word-centered word for a user-entered search term, specifically, a user's search term input or selection step (S1010), the input A step of receiving a search word (S1020), a step of determining whether a search word exists in the dictionary DBⅠ (S1030), a step of extracting the classification code of the search word from the dictionary DBⅠ (S1040), the classification code and morpheme of the news selected in the S1010 Database Extracting at step S1050, extracting the latest connection code information matching the division code and the search term (or morpheme) from the connection dictionary DB III using the inference engine III (S1060), matching the connection code information Searching for a document in the database (S1070), and displays the search results in a predetermined screen frame for each division code and then ends (S1080 ~ S1). 090).

이때 상기 S1010은 서비스 제공화면을 통하여 사용자가 특정 검색어를 입력하거나 또는 특정 뉴스의 헤드라인(headline)을 선택하는 단계이고, S1020은 웹서버를 통하여 사용자 입력내용을 수신받는 단계이다. 그리고 상기 S1030은 사용자가 입력한 검색어가 사전DB Ⅰ(20)에 존재하는지 판단하여 존재하지 않을 경우 초기 입력단계(S1010)로 피드백 시키고, 존재할 경우 다음 단계로 결과를 전달하는 단계이다. 상기 S1040은 사용자가 입력한 검색어의 구분코드를 사전DB Ⅰ(20)에서 비교하여 추출하는 단계이다.In this case, S1010 is a step in which a user inputs a specific search word or selects a headline of a specific news through a service providing screen, and S1020 is a step of receiving user input contents through a web server. In operation S1030, it is determined whether the search word input by the user exists in the dictionary DB I (20), and if not present, feeds back the initial input step (S1010), and if present, transmits the result to the next step. The step S1040 is a step of comparing and extracting the classification code of the search word input by the user in the dictionary DB I (20).

그리고 S1050은 상기 S1010 단계에서 사용자가 특정뉴스의 헤드라인을 선택했을 때 웹서버를 통하여 Database(500)에서 해당 뉴스를 추출한 이후 그 뉴스정보에서 도 1의 S70을 통하여 원본뉴스에 첨가된 S40의 정보인 구분코드와 핵심형태소를 추출하는 단계이다.In step S1010, when the user selects a headline of a specific news, the corresponding news is extracted from the database 500 through the web server, and then the information of S40 added to the original news through S70 of FIG. 1 in the news information. This step extracts the identification code and key morphemes.

그리고 상기 S1060은 도 1의 S65에서 추론엔진 Ⅲ를 통하여 연결사전DB Ⅲ(50)에 저장된 정보 중 상기 구분코드 및 검색어(또는 핵심형태소)와 일치하는 연결코드 정보를 추출하는 단계이다. 이때 연결사전DB Ⅲ의 내용 중 가장 최근의 정보를 추출해야 하는 이유는 관리Tool(40)을 통하여 사전DB Ⅰ,Ⅱ가 계속 업데이터 되고, 그에 따라 연결사전DB Ⅲ가 동시에 업데이터 되기 때문이다.The step S1060 is a step of extracting connection code information matching the division code and the search word (or key morpheme) from the information stored in the connection dictionary DB III 50 through the inference engine III in S65 of FIG. 1. The reason why the most recent information is extracted from the contents of the connected dictionary DB III is because the dictionary DB I and II are continuously updated through the management tool 40, and thus the connected dictionary DB III is updated at the same time.

그리고 상기 S1070은 이전 단계에서 추출한 연결코드 정보에 해당하는 문서정보를 Database에서 검색하는 단계로서, 이때 Database 검색의 판별기준은 상기 S1060의 연결코드 정보이므로 비록 S1010에서 사용자는 특정 키워드만 입력 또는 선택하더라도 입력검색어가 포함된 문서 뿐만 아니라, 포함되지 않은 문서도 의미상 연관성이 있다고 판단된 문서는 검색이 가능해진다.Further, the S1070 is a step of searching document information corresponding to the connection code information extracted in the previous step in the database. In this case, the criterion of the database search is the connection code information of the S1060, even though the user inputs or selects only a specific keyword in S1010. In addition to documents containing an input search word, documents that are determined to be semantically related to documents not included can be searched.

그리고 상기 S1080과 S1090은 이전 단계에서 검색된 결과를 구분코드별로 미리 설정된 최적의 화면 프레임에 디스플레이를 하고 종료하는 단계로서, 이처럼 구분코드별로 화면프레임의 형태를 달리하는 이유는 검색어 유형에 따라 정보 디스플레이의 최적화를 추구하기 위해서이다.In addition, the steps S1080 and S1090 display and end the result searched in the previous step in the optimum screen frame preset for each division code. To pursue optimization.

도 6은 도 5에서 설명한 입력검색어에 대한 의미중심의 연관문서를 자동으로 검색하는 장치에 대한 구성도로서, 사용자시스템(600), 통신망(650), 웹서버(700), 검색어 분석부(800), 사전DB Ⅰ(20), 관리 Tool(40), 추론엔진 Ⅲ(900), 연결사전DB Ⅲ(50), Database(500)로 구성되어 있다.FIG. 6 is a block diagram of an apparatus for automatically searching a semantic-oriented related document for the input search term described in FIG. 5. The user system 600, the communication network 650, the web server 700, and the search term analyzer 800 are described. ), Dictionary DB I (20), Management Tool (40), Reasoning Engine III (900), Linked Dictionary DB III (50), and Database (500).

상기 사용자시스템(600)은 PC, 휴대용단말기 등과 같이 유무선 컴퓨터 통신이 가능하고 정보제공자의 서비스 화면을 활성화시킬 수 있는 모든 장치가 해당되고, 통신망(650)은 인터넷을 포함하는 유무선 컴퓨터 통신망을 의미한다. 그리고 상기 웹서버(700)는 사용자 시스템과 정보의 수발신이 가능하며 나아가 사용자에 대한 기본 정보를 저장하고, 내부 Database와 정보의 수발신이 가능한 일반적인 운영 시스템을 의미한다.The user system 600 corresponds to any device capable of wired / wireless computer communication such as a PC, a portable terminal, etc. and capable of activating a service screen of an information provider, and the communication network 650 means a wired / wireless computer communication network including the Internet. . The web server 700 is a general operating system capable of receiving and receiving user systems and information, further storing basic information on users, and receiving and receiving internal databases and information.

그리고 상기 검색어 분석부(800)는 사용자가 입력 또는 선택한 내용에 대해 사전DB Ⅰ(20)과 Database(500)에서 동일형태소(또는 핵심형태소) 및 그에 대응하는 구분코드를 추출하는 장치이다. 그리고 상기 추론엔진 Ⅲ(900)는 이후 도 7에서 자세히 설명하겠지만, 상기 형태소와 구분코드에 대한 정보를 이용하여 연결사전DB Ⅲ에서 일치하는 연결코드 정보를 추출한 뒤, Database에서 상기 연결코드 정보에 해당하는 문서를 검색하기 위한 SQL문을 작성하는 장치이다.The search word analyzing unit 800 is an apparatus for extracting the same morpheme (or core morpheme) and a corresponding classification code from the dictionary DB I 20 and the database 500 with respect to the input or selected content by the user. And the reasoning engine III (900) will be described in detail later in Figure 7, but using the information on the morpheme and the identification code extracts the connection code information matching in the connection dictionary DB III, and corresponds to the connection code information in the database It is a device to write SQL statements to search documents.

그리고 상기 관리Tool(40), 연결사전DB Ⅲ(50) 및 Database(500)는 이전에 설명한 내용과 동일한 기능을 수행하는 장치이므로 구체적인 설명은 생략한다.The management tool 40, the connection dictionary DB III 50, and the database 500 are devices that perform the same functions as previously described, and thus detailed description thereof will be omitted.

도 7은 본 발명에서 제시한 추론엔진 Ⅲ의 구성도로서, 자료수신부(910), 연결코드 추출부 Ⅱ(920), 연결사전DB Ⅲ(50), SQL생성부(930), 자료전송부(940)로 되어 있다.7 is a block diagram of the inference engine III proposed in the present invention, a data receiver 910, a connection code extraction unit II (920), a connection dictionary DB III (50), SQL generation unit 930, data transmission unit ( 940).

상기 자료수신부(910)는 도 6의 검색어 분석부(800)의 결과를 수신받는 장치이고, 연결코드 추출부 Ⅱ(920)는 상기 자료수신부(910)에서 수신한 입력검색어(또는 핵심형태소) 및 구분코드 정보를 연결사전DB Ⅲ(50)의 정보와 비교한 뒤 그에 해당되는 연결코드 정보를 추출하는 장치이다. 여기서 도4의 연결코드 추출부 Ⅰ(340)과 도 7의 연결코드 추출부 Ⅱ(920)를 비교하면 다음과 같다.The data receiver 910 is a device that receives the results of the search word analyzer 800 of FIG. 6, and the connection code extractor II 920 is an input search word (or key morpheme) received from the data receiver 910. The device compares the classification code information with the information of the connection dictionary DB III (50) and extracts the corresponding connection code information. Here, the connection code extraction unit I 340 of FIG. 4 and the connection code extraction unit II 920 of FIG. 7 are compared as follows.

연결코드 추출부 Ⅰ(340)은 도 3의 핵심형태소 추출부(250)에서 생성된 핵심형태소 및 구분코드 정보를 이용하여 사전DB Ⅱ(30)에서 의미중심의 연결코드 정보를 최초로 생성하는 장치이지만, 연결코드 추출부 Ⅱ(920)는 연결코드 추출부 Ⅰ(340)에서 이미 생성하여 연결사전DB Ⅲ(50)에 저장해 둔 정보를 대상으로 하여 도 6의 검색어 분석부(800)에서 추출된 형태소 및 구분코드 정보에 해당하는 의미중심의 연결코드 정보를 추출하는 차이가 있다.The connection code extracting unit I 340 is a device that first generates the semantic center connection code information in the dictionary DB II 30 by using the core morpheme and division code information generated by the core morph extracting unit 250 of FIG. 3. The connection code extracting unit II 920 is a morpheme extracted from the search word analyzing unit 800 of FIG. 6 based on the information already generated by the connection code extracting unit I 340 and stored in the connection dictionary DB III 50. And extracting semantic connection code information corresponding to the classification code information.

상기 연결사전DB Ⅲ(50)는 이전의 설명과 동일하고, 또한 도 10에서 구체적인 설명을 하므로 부연설명은 생략한다. 그리고 상기 SQL생성부(930)는 연결코드 추출부 Ⅱ(920)에서 추출한 연결코드 정보에 해당하는 모든 문서정보를 Database(500)에서 검색하는데 필요한 데이터베이스 쿼리문 즉, SQL문을 자동으로 생성하는 장치이다.The connection dictionary DB III (50) is the same as the previous description, and detailed description thereof will be omitted in FIG. 10. The SQL generation unit 930 automatically generates a database query statement, that is, an SQL statement necessary for retrieving all document information corresponding to the connection code information extracted from the connection code extracting unit II (920) in the database (500). to be.

그리고 상기 자료전송부(940)는 SQL생성부(930)에서 작성한 SQL문을 Database(500)로 전달하는 장치이다.In addition, the data transmission unit 940 is a device for transmitting the SQL statement written by the SQL generation unit 930 to the Database (500).

도 8은 본 발명에 따른 사전DB Ⅰ(20)의 간단한 예시도로서, 뉴스정보에서 제시된 모든 유형의 뉴스형태소를 관리자가 정의한 분류기준에 따라 일차분류 하고, 각 유형별로 해당되는 구분코드를 부여한 데이터베이스 구조의 장치로서, 이때 구분코드의 유형은 분류기준에 따라 1, 2, 3,..., N처럼 다양하게 정의될 수 있다. 특히 본 발명에서는 형태소의 분류기준을 주식시장을 대상으로 하여 개별종목 및 그 동의어에 해당하는 형태소는 '1', 업종형태소는 '2', 업종요인 형태소는 3 등과 같이 설정하여 구분코드의 크기가 클수록 상위레벨의 집합을 의미할 수 있도록 하였다. 그리고 사전DB Ⅰ의 구성방식은 본 예시도와 같이 행(row)기준으로 할 수 도 있지만, 이와 달리 열(column)기준으로 해도 되지만, 결과의 일치성 및 정보의 일대일 대응관계를 만족시키기 위하여 사전DB Ⅱ의 형태소와 정보가 동일해야 한다. 또한 이러한 사전DB Ⅰ은 새로이 발생되는 형태소를 쉽고 간편하게 추가 및 변경할 수 있는 관리 Tool(40)이 동시에 구비되어야만 검색결과의 정확성을 향상시킬 수 있다는 점에서 관리Tool을 부속장치로서 반드시 구비해야 한다. 본 발명에서는 이러한 관리 Tool(40)에 대한 구체적인 예시도를 제공하지 않았지만, 관리의 효율성을 고려할 때 사전DB Ⅰ(20) 및 Ⅱ(30), 그리고 형태소 DB(10)의 내용을 포함할 수 있어야 한다.Figure 8 is a simple example of the dictionary DB I (20) according to the present invention, the first classification of all types of news stemming presented in the news information according to the classification criteria defined by the administrator, the database to which the corresponding classification code is assigned to each type As a device of structure, the type of classification code may be defined variously as 1, 2, 3, ..., N according to the classification criteria. In particular, in the present invention, the classification criteria of the morphemes are set for the stock market, and the morphemes corresponding to the individual items and their synonyms are set to '1', the business morphemes to '2', the business morphemes to 3, and the like. The larger the value is, the higher level the set can be. In addition, the configuration method of dictionary DB I may be based on a row basis as shown in this example. Alternatively, the dictionary DB I may be based on a column basis, but in order to satisfy the consistency of results and the one-to-one correspondence of information, the dictionary DB The morpheme and information in II should be identical. In addition, the dictionary DB I must be equipped with a management tool as an accessory device in that the management tool 40 for easily and simply adding and changing a newly generated morpheme can be provided at the same time to improve the accuracy of the search result. Although the present invention does not provide a detailed illustration of such a management tool 40, it should be able to include the contents of the pre-DB I (20) and II (30), and the stem stem DB (10) in consideration of the management efficiency. do.

도 9는 본 발명에 따른 사전DB Ⅱ(30)의 간단한 예시도로서, 본 발명에서는 주식시장의 개별종목을 기준으로 하여 그 종목과 사업적 관점에서 연관성이 있는 모든 형태소를 미리 정의한 분류기준(소속업종, 원재료, 상품, 관계회사 등)에 따라 분류하고, 그에 해당되는 코드정보를 부여한 것이다. 이때 코드정보는 정보검색의 정확성을 위해서 도 4의 연결코드 추출부 Ⅰ(340)과 도 7의 연결코드 추출부 Ⅱ(930)에서도 그 중요성을 설명했으며, 본 발명에서는 구분코드와 업종코드라는 2가지 유형을 포함시켰다. 여기서 구분코드는 사전DB Ⅰ(20)에서도 설명했듯이 개별기업에 해당되므로 공통적으로 '1'을 부여했고, 업종코드의 경우 해당 기업이 포함되는 모든 업종을 나열한 뒤 단순히 순차적으로 번호를 부여한 것이다.9 is a simple illustration of the dictionary DB II (30) according to the present invention, in the present invention, based on the individual items of the stock market, the classification criteria that pre-define all the morphemes related to the item from the business point of view (belonging to Industry, raw materials, products, affiliated companies, etc.) and assigned the corresponding code information. In this case, the code information has been described in connection code extraction unit I (340) of Figure 4 and connection code extraction unit II (930) of Figure 7 for the accuracy of information retrieval, in the present invention 2 Eggplant types were included. Here, as described in the dictionary DB I (20), the classification code is assigned to '1' in common because it corresponds to an individual company, and in the case of the industry code, all the types of industries included in the company are simply numbered sequentially.

그리고 추론엔진 Ⅱ(300)에서 형태소 추출부(330)의 처리시간 단축을 위하여 사전DB Ⅱ의 제1행에 코드행을 포함시켜야 한다. 왜냐하면 도 1의 S40에서 추출된 핵심형태소 조합의 구분코드를 이용하여 사전DB Ⅱ에서 연관정보를 추출할 때 이러한 코드행 정보를 활용함으로써 검색시간이 단축되기 때문이다. 이를 위해서는 동시에 코드행에 포함되어 있는 코드유형은 상기 사전DB Ⅰ(230)에서 정의한 분류기준과 동일하게 작성되어야 한다. 마찬가지로 결과의 일치성 및 정보의 일대일 대응관계를 만족시키기 위하여 사전DB Ⅰ과 Ⅱ의 형태소 정보는 동일해야 한다. 즉 사전DB Ⅰ에 있는 형태소가 사전DB Ⅱ에도 반드시 있어야 하고, 그 역도 성립해야 한다.In order to shorten the processing time of the morpheme extracting unit 330 in the inference engine II 300, the code line should be included in the first line of the DB II. This is because the retrieval time is reduced by utilizing such code line information when extracting the association information from the dictionary DB II using the classification code of the core morpheme combination extracted in S40 of FIG. 1. To this end, the code type included in the code line must be prepared in the same manner as the classification criteria defined in the dictionary DB I 230. Similarly, in order to satisfy the consistency of the results and the one-to-one correspondence of the information, the morphological information in the prior DB I and II should be identical. That is, the morphemes in the dictionary DB I must also exist in the dictionary DB II and vice versa.

도 10은 본 발명에 따른 연결사전DB Ⅲ(50)의 간단한 예시도로서, 도 1의 S40과 S60의 결과를 결합하여 순차적으로 저장하는 일종의 데이터베이스 구조이다. 여기서 연결사전DB Ⅲ의 과거정보 중 새로이 생성된 상기 S40의 정보와 동일한 정보는 덮어쓰기(overwrite)를 하는 것을 특징으로 한다.10 is a simplified illustration of the connection dictionary DB III (50) according to the present invention, which is a kind of database structure to sequentially store the results of S40 and S60 of FIG. Here, the same information as that of the newly generated information of S40 among the past information of the connection dictionary DB III may be overwritten.

이상의 장치 및 방법으로 구성된 본 발명은 다음의 효과를 기대할 수 있다.The present invention constituted by the above apparatus and method can expect the following effects.

첫째, 기존 검색장치에서는 입력문서에 대한 분류 및 인덱싱 과정을 관리자가 직접 하였으므로 그에 따라 분류의 제한 및 오류가 컸지만, 본 발명에서는 지능형 추론엔진을 이용하여 해당 문서에 대한 의미중심의 연관정보 추출과 그에 대한 연결코드 인덱싱을 자동으로 빠른 시간에 정확하게 할 수 있고, 또한 사용자 요청에 대응한 검색속도도 향상시킬 수 있다.First, in the existing retrieval apparatus, the administrator directly performed the classification and indexing process for the input document. Therefore, the limitation and error of the classification were large. However, in the present invention, the semantic-oriented related information extraction for the document is performed using an intelligent inference engine. Link code indexing can be done quickly and accurately, and the search speed in response to user requests can be improved.

둘째, 기존 검색장치는 사용자가 입력한 검색어가 포함된 문서만 검색하는 단어매칭 방식이어서 검색오류율이 높았지만, 본 발명에서는 추론엔진을 통하여 자동으로 부여된 의미중심의 연결코드 인덱스 정보를 활용함으로써 비록 문서내부에 동일한 검색어는 없지만 의미론적으로 연관성이 있는 것으로 추론된 문서는 유효한 검색결과로 제시되므로 검색결과의 다양성도 높아지고 동시에 검색오류율 문제도 해결할 수 있다.Second, although the existing search apparatus is a word matching method for searching only documents including a search word input by a user, the search error rate is high. However, in the present invention, by using semantic connection code index information automatically provided through an inference engine, Documents inferred with semantically relevant but not semantically related documents are presented as valid search results, thereby increasing the variety of search results and at the same time solving the problem of search error rate.

Claims

A first step of extracting news morphemes of the original document by receiving the original document of the news generated externally or internally, and performing morphological analysis and stopword processing;

The combination vector consisting of the same morpheme and the corresponding classification code is extracted by comparing the dictionary DB I with the news morpheme using the inference engine Ⅰ, and the inference rules are applied to the combination vector to combine one or more core morphemes. A second step of extracting information;

A third step of automatically extracting semantic connection code information according to the core morphological combination information from the dictionary DB II using the inference engine II; And

The semantic meaning of the search term is characterized by combining the core morphological combination information and the connection code information and storing it in the linked dictionary DB III, adding the core morphological combination information to the original document, and storing and terminating it in the database. Information retrieval method using relevance information.

Data receiving unit for receiving the original document of the news generated externally or internally;

A morpheme analysis unit extracting news morphemes after performing morphological analysis and stopword processing on the original document;

An inference engine I for comparing the news morphemes with the dictionary morphemes of the dictionary DB I, extracting the same morphemes and classification codes, generating a combination vector thereof, and then extracting a single or a plurality of key morpheme combinations for the combination vectors;

Inference engine II for automatically generating semantic connection code information in dictionary DB II for the core morpheme combination;

A morpheme classification unit which combines the core morpheme combination information and the connection code information and stores them in the connection dictionary DB III, and adds the core morpheme combination information to the original document and stores the information in the database; And

And a database for storing the original document and key morpheme combination information.

The inference engine I of claim 2,

Data receiving unit for receiving the news morpheme extracted from the morpheme analysis unit;

A morpheme comparison operation unit for extracting the same morpheme by comparing the news morpheme with the dictionary morpheme of the dictionary DB I;

A division code extraction unit for extracting a division code of the same morpheme from dictionary DB I;

A combination vector generation unit combining the same morpheme and a classification code; And

An information retrieval apparatus using semantic relevance information of a search word further comprising a core morphological extracting unit extracting single or plural pieces of key morphological information by applying an inference rule to the combination vector.

The inference engine II of claim 2,

A data receiver receiving the core morpheme combination information;

A division code comparison operation unit for determining a classification code type of the core morpheme information;

A morpheme extraction unit for extracting the same morphemes for the core morphemes from the dictionary DB II according to the type of the classification code; And

An information retrieval apparatus using semantic relevance information of the search word, further comprising a connection code extracting unit I for automatically extracting semantic connection code information of the same morpheme from the dictionary DB II.

A first step in which a user inputs a search word or selects specific document information on a service screen provided through a wired or wireless communication network;

When the document information is selected in the above step, the classification code and key morpheme information are extracted from the original document stored in the database, and when the user inputs a search word, the same morpheme and the classification code are extracted from the dictionary DB I for the corresponding search word. Second step;

A third step of extracting the latest connection code information corresponding to the classification code and the morpheme from the connection dictionary DB III using the inference engine III;

A fourth step of extracting document information corresponding to the connection code information from a database; And

And a fifth step of displaying the database search result on a predetermined user screen frame for each division code and ending the same.

A user system capable of inputting a specific search word or selecting specific document information on a service providing screen; Wired and wireless computer networks; Web server;

A search word analysis unit for extracting the corresponding classification code from the dictionary DB I for the user input search word and extracting the classification code and key morpheme information of the corresponding document from the database for the user selected document information;

An inference engine III for extracting connection code information corresponding to the classification code and morpheme information from a connection dictionary DB III; And

An information retrieval apparatus using semantic correlation information of a search term, further comprising a database storing an original document and key morpheme combination information.

The method of claim 6, wherein the inference engine III,

A connection code extraction unit Ⅱ for automatically extracting semantic connection code information by comparing the classification code and morpheme information of the search word analysis unit with the information of the connection dictionary DB III;

An SQL generator for automatically generating an SQL statement with respect to the connection code information; And

An information retrieval apparatus using semantic relevance information of a search word, further comprising a data transmission unit.