KR20000049333A

KR20000049333A - Engine for comparatively searching product of internet shopping mole with intelligence type

Info

Publication number: KR20000049333A
Application number: KR1019990047316A
Authority: KR
Inventors: 이경진
Original assignee: 한상천; 주식회사 인터하우스
Priority date: 1999-10-28
Filing date: 1999-10-28
Publication date: 2000-08-05
Also published as: KR100296500B1

Abstract

PURPOSE: An engine for comparatively searching a product of an internet shopping mole with an intelligence type is provided to increase a recognition rate and thereby reduce an error, by determines if information having uncertain type exists, analyzing and classifying a content of the information in order to obtain separate valuable data from the information. CONSTITUTION: A web searching system detects the change of a structure in a web page and a home page, updates data periodically and sets an update period to reduce a load. The web searching system includes a Meta Robot and Search Robot. The Meta Robot searches all web pages randomly, and distinguishes a searched web page from a web page that will be searched. The Meta Robot supports a distribution processing, and transfers data to the Search Robot if desired data is not included in the searched web page. The Search Robot searches only a web page containing the desired data. The Search Robot includes the first mode for checking the web page structure, the second mode for supporting a quick search and the third mode for detecting the change of the web page structure.

Description

Intelligent Internet Shopping Mall Product Comparison Search Engine {.}

인터넷 상에 존재하는 쇼핑몰들로부터 별도의 가공작업을 요구하지 않고, 웹형태로 제공하는 문서들로부터 상품정보를 자동으로 추출, 분류하여 사용자들에게디렉토리 형태의 서비스로 제공Instead of requiring separate processing from shopping malls on the Internet, product information is automatically extracted and classified from documents provided in web form and provided to users as a directory type service.

인터넷 지능형 에이전트 및 검색로봇Internet intelligent agent and search robot

기존에는 단순히 문서내에 존재하는 단어를 비교하여 일치하는 가를 판단하는 수준이었기에 상품명, 가격, 제조사 등의 정보가 함께 모여 이루어지는 상품정보와 같이 내용을 판단하여야만 상품을 제대로 제공할 수 있는 경우들을 처리를 할 수가 없었다. 또한 관련 분야를 위해 기 개발된 유사 소프트웨어들 역시 쇼핑몰의 등록 및 처리, 상품검색 등에 있어 상당한 수준의 수작업을 요하거나, 지나치게 낮은 인식률과 오류로 인해 원활하게 정보를 제공하지 못하여왔다.In the past, it was simply to compare the words existing in the document to determine whether they matched. Therefore, if the contents can be properly provided, such as product information, such as product name, price, manufacturer, etc. I could not. Similar softwares already developed for the related fields also require a considerable amount of manual work in the registration and processing of shopping malls, product search, or have not been able to provide information smoothly due to the excessively low recognition rate and errors.

명확하지 않은 형태의 정보의 존재여부를 판단하고, 해당 정보의 내용을 분석, 분류해 냄으로써 가치있는 별도의 정보로써 구별해 내는 것을 구현해 냄으로써, 기존의 지능형 에이전트 소프트웨어들이 가지고 있던 낮은 인식률 및 오류의 증가, 지나친 관리요구 등의 문제를 해결하였다.Increasing the low recognition rate and error of the existing intelligent agent software by determining whether there is an unclear form of information and by analyzing and classifying the content of the information to distinguish it as valuable separate information. It solved the problem of excessive management needs.

가. 웹탐색시스템end. Web search system

(1) 소개(1) Introduction

웹탐색 시스템은 특정 정보를 가지고 있는 웹을 대상으로 자동적으로 자료를 가져오고, 주기적으로 자료를 갱신하고 새로운 정보를 파악하는 시스템이다. 웹페이지 내의 구조가 바뀌거나, 홈페이지 자체의 구조가 변경되는 경우에도 자동으로 이를 파악하여 주기적으로 자료를 갱신하며, 갱신 주기를 자동으로 설정하여 부하를 최소한으로 줄이는 기능을 포함한다.The web search system is a system that automatically retrieves data from the web with specific information, updates data periodically, and catches new information. When the structure of the webpage changes or the structure of the homepage itself changes, it automatically checks for this and updates the data periodically, and includes a function to reduce the load by setting the update cycle automatically.

(2) 기능구성도(2) Functional diagram

웹탐색 시스템은 Robot 기술을 사용한 Meta Robot과 Search Robot의 2개 시스템으로 구성되어 있다.The web search system consists of two systems, Meta Robot using Robot technology and Search Robot.

가. Meta Robotend. Meta robot

Meta Robot은 전체 웹을 대상으로 무작위한 검색을 한다. 전체 웹사이트에 대한 자체 DB를 구성하여, 검색을 한 웹사이트와 검색하지 않은 웹사이트를 구분한다. 또한, 초기에 기본적인 웹사이트 DB를 가지고 있지 않으며, 특정 웹사이트를 시작 사이트로 지정하기만 하면, 전체 웹의 거미줄 같은 구조를 이용해서 끊임없이 검색을 시작한다.Meta Robot searches randomly across the entire web. It constructs its own DB for the entire website, and distinguishes between searched and unsearched websites. In addition, it does not have a basic web site DB initially, and simply designates a specific web site as a starting site, and continuously searches using a web-like structure of the entire web.

자체적으로 Multi Process로 작동되므로 분산 처리가 가능하며, 그 Multi Performance를 결정할 수 있다.As it is operated as a multi process by itself, distributed processing is possible and its multi performance can be determined.

웹사이트를 검색하면서 정보분석시스템과 연계하여 특정 정보가 존재하는지의 여부를 판단하여, 특정 정보가 존재하는 경우에는 Search Robot에게 자료를 전달한다.While searching a website, it determines whether or not specific information exists in connection with an information analysis system, and delivers data to Search Robot when specific information exists.

특정정보에 해당하는 키워드가 명확하게 존재하는 경우에는 국내/외 기존의 검색엔진을 이용하여, Filtering을 하는 기능을 통하여 전체웹을 대상으로 하는 것보다 높은 속도를 낼 수도 있다.If a keyword corresponding to specific information is clearly present, it may be higher than that of the entire web through the filtering function using existing search engines at home and abroad.

나. Search RobotI. Search Robot

Search Robot은 Meta Robot이 판단한 웹사이트 중에서 특정 정보가 존재하는 사이트만을 대상으로 검색을 행한다. 특정정보의 표현방식과 웹사이트의 구조를 파악하는 첫번째 모드와 판단된 구조를 바탕으로 빠른 속도로 검색하는 두 번째 모드, 웹사이트의 구조가 변경되었거나 추가/삭제된 경우에 이를 파악하는 세 번째 모드의 3가지로 구성되어 있다.Search Robot searches only those sites where specific information exists among the websites determined by Meta Robot. The first mode to grasp how the specific information is displayed and the structure of the website, the second mode to quickly search based on the determined structure, and the third mode to identify when the structure of the website is changed or added / deleted. It consists of three parts.

(i) First Mode(i) First Mode

First Mode는 특정 사이트를 처음 검색할 경우에 사용하는 모드로써, 웹사이트의 구조와 특정 정보의 표현방식에 대한 판단을 행한다. 웹사이트내에서 특정 정보가 존재하는 페이지만을 정확히 분리해내고, 이를 등록한다. 또한, 게시판이나 방명록과 같이 불필요한 정보가 대량으로 존재하는 경우에는 이를 판단하여 검색에서 제외시킴으로써 검색속도를 향상시킨다. 기본적으로 웹사이트의 전체 페이지를 검색하게 된다. 또한, 특정 웹사이트 외부로 Search Robot이 검색대상을 확장하는 것을 막기 위해서, 자체적으로 Boundary를 설정하게 된다. 이 경계선에 의해서 다른 사이트로 검색대상이 변경되는 것을 방지한다.First Mode is a mode used when searching for a specific site for the first time. The first mode is used to determine the structure of the website and the expression method of the specific information. Accurately separate and register only pages with specific information within the website. In addition, when there is a large amount of unnecessary information such as a bulletin board or a guest book, it is determined and excluded from the search to improve the search speed. By default it will search the entire page of your website. In addition, in order to prevent Search Robot from expanding the search target outside of a specific website, it sets itself a boundary. This boundary prevents the search target from being changed to another site.

(ⅱ) Fast Mode(Ii) Fast Mode

Fase Mode는 First Mode에서 파악한 정보만을 바탕으로 검색을 행하므로, 빠른 속도로 자료를 갱신할 수 있게 된다. 즉, 정보가 존재하는 페이지만을 대상으로 검색을 하게 되는데, 그 구조에 대해서도 기존에 First Mode에서 작성한 자료를 바탕으로 비교한다. 자료가 새로 추가되었거나 삭제/변경된 경우에 이를 모두 감지할 수 있으며, 기존의 자료와의 차이점을 파악할 수 있으므로 정보를 가져오는 것을 완벽하게 수행한다.Fase Mode searches based only on the information found in First Mode, so data can be updated quickly. That is, the search is performed only on the page where the information exists. The structure is also compared based on the data prepared in First Mode. When the data is newly added or deleted / changed, all of them can be detected, and the difference with the existing data can be detected, so the information is perfectly imported.

자체적으로 Fast Mode를 주기적으로 가동하는 별도의 Server 프로그램이 존재하는데, 이는 자체 Scheduling기능을 포함하므로 주기적으로 검색해야 할 간격을 설정할 수 있다. 처음에 기본값으로 주기를 설정하더라도 여러번 검색을 통해서, 웹사이트의 정보가 바뀌는 간격을 파악한다. 따라서, 상대편 웹서버에 부하를 최소한으로 줄일 수 있다.There is a separate Server program that runs Fast Mode periodically. It includes its own scheduling function, so you can set the interval to search periodically. Even if you initially set the frequency as the default, search through it several times to see how often the information on your website changes. Thus, the load on the other web server can be reduced to a minimum.

정보가 존재하는 웹페이지의 자료를 가져온 후에는 정보분석시스템과 연동하여 대상 자료만을 정확하게 가져오게 되는데, 이 과정을 병행하므로 상대편 웹서버에 자동으로 시간 간격을 두고 접근하게 된다. Fast Mode는 웹사이트의 구조가 변경되는 경우에는 이를 감지할 수가 없는데, 이럴 경우 오류 처리를 통하여 세 번째 Mode인 Restart Mode를 호출함으로써 이를 해결한다.After importing the data of the web page where the information exists, only the target data is accurately imported by interlocking with the information analysis system. In parallel with this process, the other web server is automatically accessed at intervals. Fast Mode cannot detect when the structure of the website is changed. In this case, it is solved by calling Restart Mode, which is the third mode, through error handling.

(ⅲ) Restart Mode(Ⅲ) Restart Mode

Restart Mode는 First Mode와 Fast Mode의 복합형이라고 할 수 있다. 특정 정보가 존재하는 페이지만을 대상으로 정보를 가져오면서 더불어 웹사이트 전체를 파악함으로써 새로운 구조를 파악하게 된다. First Mode와 가장 큰 차이점은 관리자가 수동으로 변경한 정보에 대해서 자료를 갱신하지 않고, 수동 작업 결과를 반영한다는데 있다. 또한 Fast Mode에서 사용하는 빠른 속도를 통한 자료갱신을 할 수 있다.Restart Mode is a combination of First Mode and Fast Mode. By importing information only on pages where specific information exists, the entire structure of the website is identified to identify the new structure. The biggest difference from First Mode is that the administrator does not update the data manually and reflects the result of manual work. You can also update data through the fast speed used in Fast Mode.

(3) 구현방법(3) Implementation method

가. Meta Robotend. Meta robot

Meta Robot은 크게 HTML파일이나 CGI를 가져올 수 있는 HTTP Client 부분과 URL을 parsing하는 부분, URL을 cache하는 부분의 3가지로 나누어진다.Meta Robot is divided into three parts: HTTP Client which can import HTML file or CGI, Parsing URL, and Cache part of URL.

HTTP Client는 C언어나 Java를 사용하여 전형적인 Network Programming방법을 통하여 해결한다. 가상 웹호스팅 사이트에 정상적으로 작동하기 위하여 Header를 자체적으로 조절하며, 웹 Robot을 배제하기 위한 규약인 robots.txt파일에 대한 처리를 한다.HTTP Client solves through typical network programming method using C language or Java. The header is controlled by itself in order to operate on the virtual web hosting site, and the robots.txt file, a protocol for excluding the web robot, is processed.

URL을 파싱하는 부분에서는 크게 2가지 인자로 수행된다. HTML TAG상의 link 정보를 바탕으로 절대 URL을 얻는 것이 주 목적이다.Parsing the URL is done with two main arguments. The main purpose is to get an absolute URL based on the link information on the HTML TAG.

현재의 URL을 current_url에 저장하고, 현재의 상대경로를 reference_url로 저장하여 이 둘을 하나의 url로 만들어 내는데, 특정 웹사이트에서 잘못된 link를 하는 경우에는 이를 오류로 판단한다. 기본적으로 웹브라우저에서 절대 url을 얻어내는 방식과 동일한 방식으로 구현한다. 두 url의 절대경로간의 관계를 파악하는 것은 물리적인 경로를 구분하는 '/'를 사용하여 구현한다. 이는 URL Parser라는 부분에서 다시 자세히 설명하겠다.It saves the current URL in current_url and saves the current relative path as reference_url to make them one url. If a wrong link is made on a specific website, it is determined as an error. Basically, it is implemented in the same way that the absolute url is obtained from the web browser. Determining the relationship between absolute paths of two urls is implemented by using '/' to separate physical paths. This is described in detail later in the section called URL Parser.

URL을 cache하는 것은 똑같은 웹사이트를 다시 검색하지 않도록 하고, 다수개의 웹사이트를 대상으로 Robot이 무한순환을 하게 되는 것을 방지하기 위함이다. 검색이 끝난 사이트와 아직 검색하지 않은 사이트를 구분하며, 특정 정보가 존재하는 사이트는 Search Robot이 검색할 수 있도록 자료를 전달한다. DB를 사용하며 URL, Contents, SEARCH_TIME의 3가지 필드로 구성한다. URL은 절대 경로를 저장하며, Contents는 내용을 포함하고, SEARCH_TIME은 검색한 시간을 나타낸다.The caching of URLs is to avoid re-searching the same website and to prevent the robot from performing infinite circulation on multiple websites. It distinguishes between the searched site and the site that has not been searched yet, and the site where specific information exists delivers the data for Search Robot to search. It uses DB and consists of three fields: URL, Contents, and SEARCH_TIME. The URL stores the absolute path, Contents contains the content, and SEARCH_TIME represents the search time.

Contents는 NOT, END, DATA의 3가지 값을 가질수 있으며, NOT은 아직 검색을 하지 않은 상태를 나타내며 END는 검색이 끝난 상태, DATA는 특정 정보가 존재하여 Search Robot에 자료를 전송한 상태를 나타낸다. SEARCH_TIME은 마지막으로 검색한 시간을 저장하여, 주기적으로 웹사이트를 다시 검색할 수 있도록 제공한다.Contents can have three values: NOT, END, and DATA. NOT indicates that the search has not been done yet, END indicates that the search has been completed, and DATA indicates the state where data is sent to the Search Robot because specific information exists. SEARCH_TIME saves the last searched time so that the website can be searched again periodically.

나. Search RobotI. Search Robot

Search Robot은 2개의 DB Table을 참조하게 되는데, 그 구조를 먼저 설명하겠다.Search Robot will refer to two DB Tables, the structure of which will be explained first.

(i) DB(i) DB

Search Robot이 작동하기 위해서는 2가지의 정보를 기억해야 하는데, 그것은 웹사이트내에서 정보가 존재하는 페이지를 저장하는 것과 정보의 표현방식을 저장하는 것이다.In order for the Search Robot to work, two pieces of information must be stored: one for storing the page where the information exists within the website, and one for storing the presentation of the information.

정보가 존재하는 페이지와 표현방식은 웹사이트내에서 변경될 수 있으며, 이는 Restart Mode에서 감지가 가능하다. 하지만, Fast Mode로 정보를 빠른 속도로 갱신하기 위해서 Cache 형태로 저장해 두는 것이 필요하다. 이 Table은 URL, C0NTENT, STATE, TlME의 크게 4가지 필드로 구성되는는데 URL은 절대경로를 저장하고, CONTENT는 어떤 정보를 담고 있는지 저장한다. 즉, 특정 정보를 포함하고 있는지 그렇지 않은지를 판단함으로써 Fast Mode에서 특정 정보를 담고 있는 페이지만을 검색할 수 있도록 한다.The page on which the information is present and the way it is displayed can be changed within the website, which can be detected in Restart Mode. However, in order to update information in fast mode at a high speed, it is necessary to store it in a cache form. This table consists of four fields: URL, C0NTENT, STATE, and TlME. URL stores absolute path and CONTENT stores what information. In other words, it is possible to search only the page containing the specific information in Fast Mode by determining whether the specific information is included or not.

STATE 필드는 현재 상태를 나타내는데, 현재 검색중인지 검색이 끝났는지, 아직 검색하지 않았는지를 저장해서 중복해서 검색하지 않도록 한다. TIME은 검색이 끝난 최근 시간을 저장하며, 특정 주기로 재검색을 할 수 있도록 구성된다.The STATE field indicates the current state. It saves whether you are searching, whether the search is over, or not yet. TIME saves the latest time when the search is completed, and it is configured to be able to search again at a specific interval.

표현방식을 저장하는 것은 정보분석시스템에서 특정 정보의 표현형식을 파악하여 이를 저장하는 것으로 정보분석시스템에서 설명하겠다.Storing the expression method is to identify and store the expression format of specific information in the information analysis system, which will be explained in the information analysis system.

(ii) First Mode(ii) First Mode

First Mode는 다음과 같은 단계로 구성된다.First Mode consists of the following steps.

ㆍ대상 웹사이트 검색 - Meta Robot에서 전달한 정보를 바탕으로 First Mode로 검색해야 할, 특정 정보를 담고 있는 웹사이트를 찾아낸다.ㆍ Search Target Website-Based on the information delivered by Meta Robot, find a website containing specific information that should be searched in First Mode.

ㆍ현재 상태를 체크 - 현재 특정 웹사이트를 Search Robot이 검색중이라는 상태를 저장한다. Multi Process로 작동하면서 중복해서 검색하지 않도록 한다.• Check the current status-Stores the status that Search Robot is currently searching for a specific website. Do not search for duplicates while working as a Multi Process.

ㆍ기존의 자료를 삭제 - First Mode로 작동하는 경우에는 기존의 자료가 존재하지 않으나, 똑같은 사이트가 다른 URL을 가지고 있거나, 기존의 First Mode에서 비정상적으로 종료된 경우 남아 있을 가능성이 있는 정보들을 삭제한다.• Delete existing data-In case of operating in First Mode, existing data does not exist, but deletes information that may remain if the same site has different URL or terminates abnormally in existing First Mode. .

ㆍSTATE 변경 - 웹사이트에 해당하는 URL을 전부 검색해야 할 대상이 될 수 있도록 STATE를 조정한다. 웹사이트를 처음 검색하는 경우에 이런 기능은 필요없지만, 기존의 First Mode에서 비정상적으로 종료되어 재시도되는 경우가 있을 수 있으므로 STATE를 전부 NOT으로 변경한다.• STATE Modification-Adjust STATE to be the target to search all URLs corresponding to the website. If you are searching the website for the first time, this function is not necessary, but it may be abnormally terminated and retried in the existing First Mode, so change the STATE to NOT all.

ㆍ검색해야 할 URL 판단 - 검색하지 않은 URL을 하나씩 재귀적으로 호출한다. 한 페이지를 검색할 때마다 검색이 끝났음을 체크하기 때문에 단순하게 하나씩 호출하면, 실질적으로 모든 페이지를 순서대로 검색하게 된다.ㆍ Determining URL to search-Recursively call unsearched URLs one by one. Whenever we search for a page, we check that the search is over, so if we simply call them one by one, we will actually search all the pages in order.

ㆍ웹페이지 가져오기 - HTTP Client를 이용해서 해당 URL의 페이지를 가져온다.ㆍ Get Web Page-Get the page of the URL using HTTP Client.

ㆍ정보의 존재여부 판단 - 가져온 웹페이지가 특정 정보를 담고 있는지를 파악한다. 만약 특정 정보가 존재한다면 이는 정보가 있는 페이지로 분류하고, 그렇지 않은 경우에는 정보가 존재하지 않는 페이지로 저장한다.ㆍ Determining the existence of information-Identify whether the imported web page contains specific information. If specific information exists, it is classified as a page with information. Otherwise, it is stored as a page without information.

ㆍ정보 저장 - 특정 정보가 존재하는 페이지의 경우에는 정보분석시스템과 연동하여 특정정보를 추출하여 이를 저장한다. 직접 DB로 입력하지 않고, 일종의 temporary DB를 사용하는데, 이는 오류시에 데이터 복구를 위해서이다. 웹사이트 전체에 대한 검색이 끝난후에 temporary DB의 내용을 메인 DB로 복사한다.ㆍ Information Storage-In the case of a page where specific information exists, the specific information is extracted and stored in association with the information analysis system. Instead of entering the DB directly, a temporary DB is used, which is used to recover data in case of error. After searching the entire website, copy the contents of temporary DB to main DB.

ㆍURL Parser 작동 - URL을 파싱하는 프로그램을 이용하여, 가져온 웹페이지내에서 다른 페이지로의 연결되는 것들을 모두 파악한다. 이런 방식으로 웹사이트 전체를 검색하게 된다. URL Parser의 구현은 다시 자세히 설명하겠다.URL Parser-Using a program that parses URLs, it grasps all the links to other pages in the imported web page. In this way, you will search the entire website. The implementation of the URL Parser will be explained in detail later.

ㆍURL 검사 - 파싱한 결과 URL중에서 해당 웹사이트의 boundary에 속하는 것들만 제외하고는 삭제한다.URL inspection-The parsed URL is deleted except for those belonging to the boundary of the website.

ㆍ현재 상태를 체크 - 검색이 끝났음을 저장하고, 마지막으로 검색한 시간을 변경한다.• Check the current status-Save the end of the search and change the last searched time.

ㆍDB Update - 검색중일 때 임시적으로 사용하던 temporary DB에서 실제 DB로 자료를 옮긴다. 이로써 자료의 등록과정이 모두 끝나게 된다.DB Update-Moves data from the temporary DB that was used temporarily while searching to the actual DB. This completes the registration process.

(ⅲ) Fast Mode(Ⅲ) Fast Mode

ㆍ대상 웹사이트 검색 - Fast Mode로 검색해야할 웹사이트를 찾는다. 최근에 마지막으로 검색한 시간과 현재 시간을 비교하여, 검색 주기를 바탕으로 판단한다.ㆍ Search Target Website-Find a website to search in Fast Mode. The latest search time and the current time are compared to determine the search time.

ㆍ기존의 정보를 복사 - 기존의 정보를 바탕으로 그 변경내용을 저장하는 기능이 있으므로, 기존의 정보를 삭제하지 않고, 메인 DB에서 temporary DB로 내용을 복사한다.ㆍ Copy existing information-Since the function saves the changes based on the existing information, copy the contents from the main DB to the temporary DB without deleting the existing information.

ㆍSTATE 변경 - 웹사이트에 해당하는 URL을 전부 검색해야 할 대상이 될 수 있도록 STATE를 조정한다.• STATE Modification-Adjust STATE to be the target to search all URLs corresponding to the website.

ㆍ검색해야 할 URL 판단 - 검색하지 않은 URL을 하나씩 재귀적으로 호출한다. 한 페이지를 검색할 때마다 검색이 끝났음을 체크하기 때문에 단순하게 하나씩 호출하면, 실질적으로 모든 페이지를 순서대로 검색하게 된다. Fast Mode이므로, 기존의 URL DB에서 정보가 존재하는 페이지만을 대상으로 검색한다.ㆍ Determining URL to search-Recursively call unsearched URLs one by one. Whenever we search for a page, we check that the search is over, so if we simply call them one by one, we will actually search all the pages in order. Since it is Fast Mode, it searches only the pages where information exists in the existing URL DB.

ㆍ정보의 존재여부 판단 - 가져온 웹페이지가 특정 정보를 담고 있는지를 파악한다. Fast Mode의 경우에는 정보가 존재하는 페이지만을 대상으로 검색하지만, 웹사이트의 구조가 변경된 경우에는 정보가 존재하지 않을 수도 있는데, 이럴 경우에는 웹사이트의 구조가 변경된 것으로 판단하고 Restart Mode로 Search Robot을 작동한다.ㆍ Determining the existence of information-Identify whether the imported web page contains specific information. In the case of Fast Mode, only the information page is searched. However, if the structure of the website is changed, the information may not exist. In this case, it is determined that the structure of the website has been changed. Works.

ㆍ정보 저장 - 특정 정보가 존재하는 페이지의 경우에는 정보분석시스템과 연동하여 특정정보를 추출하여 이를 저장한다. 직접 DB로 입력하지 않고, 일종의 temporary DB를 사용하는데, 이는 오류시에 데이터 복구를 위해서이다. 웹사이트 전체에 대한 검색이 끝난후에 temporary DB의 내용을 메인 DB로 복사한다. 기존에 이미 동일 정보가 존재하는 경우에는 그 변경내용만을 파악하여 저장한다. 또한, 기존의 정보와 비교가 필요한 경우에는 그 변경 내용을 별도의 Field에 저장하도록 DB를 구성한다.ㆍ Information Storage-In the case of a page where specific information exists, the specific information is extracted and stored in association with the information analysis system. Instead of entering the DB directly, a temporary DB is used, which is used to recover data in case of error. After searching the entire website, copy the contents of temporary DB to main DB. If the same information already exists, only the change is identified and stored. In addition, when comparison with existing information is needed, the DB is configured to store the changes in a separate field.

ㆍ주기 설정 - 기존의 정보와 새로 검색한 정보가 많이 다른 경우에는 검색 주기를 좀더 줄이게 되고, 그렇지 않고 기존의 정보와 새로 검색한 정보가 일치하는 경우에는 검색 주기를 좀더 늘리게 된다. 이렇게 경험적으로 해당 웹사이트의 변경주기를 파악하여 상대편 웹서버의 부하를 줄이게 된다.ㆍ Period setting-If the existing information and the newly searched information are very different, the search period is further reduced. Otherwise, if the existing information and the newly searched information match, the search period is further increased. As a rule of thumb, it can reduce the load of the web server by identifying the change cycle of the web site.

(ⅳ) Restart Mode(Ⅳ) Restart Mode

Restart Mode는 다음과 같은 단계로 구성된다.Restart Mode consists of the following steps.

ㆍ대상 웹사이트 검색 - Fast Mode에서 전달한 정보를 바탕으로 Restart Mode를 수행해야 할 웹사이트를 판단한다. Restart Mode를 수행하는 경우는 2가지가 있는데, Fast Mode에서 웹사이트의 구조에 대한 변경이 감지된 경우나, 또는 Restart Mode를 수행해야 할 시간주기가 지난 경우이다. Restart Mode를 수행해야 할 시간 주기는 자동으로 설정된다.Target Web Site Search-Determines the website to perform Restart Mode based on the information delivered in Fast Mode. There are two cases in which the Restart Mode is executed, when a change in the structure of the website is detected in the Fast Mode, or when the time period for performing the Restart Mode has passed. The time period for performing Restart Mode is automatically set.

ㆍ웹페이지 구조 파악하기 - 기존에 파악한 웹페이지의 정보 표현방식에 변화가 생긴 경우에는 이를 다시 파악하여 저장한다.ㆍ To grasp the structure of the web page-If there is a change in the information expression method of the web page previously identified, it is reconstructed and stored.

ㆍ주기 설정 - 기존의 웹사이트 구조와 새로 파악한 웹사이트 구조가 다른 경우에는 Restart 검색 주기를 줄이고, 기존의 웹사이트 구조와 새로 파악한 웹사이트 구조가 동일한 경우에는 Restart 검색 주기를 늘리게 된다. 이렇게 경험적으로 해당 웹사이트의 변경주기를 파악하여 상대편 웹서버의 부하를 줄이게 된다.Period setting-If the existing website structure is different from the newly identified website structure, the restart search period is reduced. If the existing website structure and the newly identified website structure are the same, the restart search period is increased. As a rule of thumb, it can reduce the load of the web server by identifying the change cycle of the web site.

다. URL ParserAll. URL Parser

URL Parser는 3개의 인자를 바탕으로 파싱작업을 행하게 되는데, 이는 현재의 url과 웹사이트 이름, HTML 파일명이다. 이를 바탕으로 다음과 같은 순서로 작업이 이루어진다.The URL Parser parses based on three arguments: the current url, the website name, and the HTML file name. Based on this, the work is done in the following order.

ㆍLink Tag판별 - HTML TAG 중에서 다른 페이지로의 link를 나타내는 tag들을 전부 찾아낸다. 주로 '＜a href' 태그와 javascript 기술을 사용한 것들이 이에 해당된다.Link Tag Identification-Finds all tags in the HTML TAG that represent links to other pages. This is mainly the case using the <a href "tag and the javascript technology.

ㆍLink 문자열 판단 - Text형태의 문자열에 Link가 되어 있는 경우에는 이를 파악한다.ㆍ Link string judgment-If link is in the text string, check this.

웹탐색 시스템에서는 필요는 없으나, 분류분석 시스템과 연동되는 경우에는 문자열 자체가 큰 의미를 지닐 수도 있기 때문이다.This is not necessary in the web search system, but when linked with the classification analysis system, the string itself may have a great meaning.

ㆍ현재 URL을 판단 - 현재의 URL을 판단하는데, 이는 주로 물리적인 구조를 나누는 '/' 문자를 중심으로 하여, 현재 디렉토리의 깊이를 구한다.Determining the Current URL-Determining the current URL. This is to determine the depth of the current directory, mainly around the '/' character that divides the physical structure.

ㆍlink URL 분석 - link URL을 분석한다. link URL은 절대 경로일 수도 있고, 상대경로를 나타낼 수도 있으며, 파일명만을 나타내는 경우도 있다. 절대 경로일 경우에는 이를 그대로 사용할 수 있으며, 상대경로나 파일명인 경우에는 현재의 URL과 복합해서 새로운 URL을 만들어 내야 한다.Link URL Analysis-Analyze link URL. The link URL may be an absolute path, a relative path, or only a file name. In case of absolute path, it can be used as it is. In case of relative path or file name, a new URL should be created by combining with the current URL.

나. 정보분석시스템I. Information Analysis System

(1)소개(1) Introduction

정보분석시스템은 외부의 문서정보들을 읽어들이고, 그 안에 담겨진 내용을 분석하여 일관된 정보체계를 구성하는 것을 주된 목적으로 한다. 현재의 전체 시스템에서는 인터넷 상에 존재하는 외부의 사이트들로부터 얻어온 HTML 형식의 문서들을 읽어들이고, 해당 문서 내에 존재하고 있는 상품관련정보들을 추출하는데 정보분석시스템을 이용하고 있다.The main purpose of the information analysis system is to read external document information and to analyze the contents contained in it to form a consistent information system. In the current system, an information analysis system is used to read HTML-formatted documents from external sites on the Internet and to extract product-related information existing in the document.

정보분석시스템은 외부의 정보를 읽어들이고, 이를 처리하는데 있어 인간중심적 사고를 도입하였다. 즉, 인간이 정보를 분석함에 있어 가장 크게 의존하는 정보가 영상정보이고, 영상정보가 타 정보들과 구분되는 특징은 위치정보가 존재하고 있다는 점이라는 것에 착안, 위치정보를 포함하며 유사한 수준의 정보들을 구별해 내는 시작중심적 정보인식법을 개발, 적용하여 인간이 정보를 분석할 때와 유사한 수준의 결과를 얻을 수 있도록 한 것이다. 이러한 개념은 전체 시스템에 공통적으로 적용되어져 있어 스스로 정보의 구조를 판별해 내거나, 유사한 정보의 구성을 간파해 내는 부분들도 이와같은 인간중심적 사고에 기반을 두고 있다.The information analysis system introduced human-centered thinking in reading and processing external information. That is, the information that humans rely most on analyzing information is image information, and the feature that distinguishes image information from other information is that location information exists. We have developed and applied a start-based information recognition method that distinguishes them from each other so that humans can obtain results similar to those when analyzing information. This concept is commonly applied to the whole system, and the parts that discriminate the structure of information by themselves or identify the structure of similar information are based on such human-centered thinking.

(2) 주요기능(2) Main function

정보분석시스템은 전체적으로 볼 때, 네 가지의 주요한 특징을 지니고 있다. 이러한 특징은 유사한 형태의 분석용 시스템에서 찾아보기 힘든 성질들로써 정보분석시스템을 규정짓는 가장 큰 요소들이라 할 수 있다.The overall information analysis system has four main characteristics. These features are hard to find in similar types of analysis systems and are the biggest factors that define information analysis systems.

정보분석시스템에 존재하는 주요한 기능 및 설명은 다음과 같다.The main functions and explanations of the information analysis system are as follows.

가) 시각중심적 정보인식A) Visually oriented information recognition

현재의 전체 시스템은 인터넷 환경을 바탕으로 다양한 사이트에서 다양한 내용과 형식의 정보들을 수집해 오고 있다. 따라서 전체 시스템에서 수집되고 있는 정보는 그 종류나 형식이 대단히 다양하여 어떤 일관된 구성이나 개념을 찾아내기가 힘들다. 현재의 컴퓨터 프로그램 기술은 모든 단계가 논리적으로 동작하게 되므로 비논리적이거나 지나치게 다양한 형태의 데이터에 대해서는 일괄적인 데이터 처리가 불가능하다는 공통적인 문제점을 안고 있어 이러한 문제점을 해결하기 위해서는 좀 더 특수한 형태의 기술이 요구된다.Currently, the entire system has collected information of various contents and formats from various sites based on the Internet environment. Therefore, the information gathered from the whole system is very diverse in its kind and format, making it difficult to find any consistent composition or concept. The current computer program technology has a common problem that all steps are logically operated, and thus batch data processing is impossible for illogical or overly diverse data types. Therefore, a more special type of technology is required to solve these problems. do.

정보분석시스템에서는 이러한 문제의 해결을 위하여 시각중심적인 정보인식법을 채용하였다. 시각중심적 정보인식법이란 현재 주어진 문서형식의 정보를 분석하여 상, 하, 좌, 우와 같은 시각적인 정보들을 함께 포함하는 형태로 다시 재구성함으로써 어떠한 정보와 유사한 정보가 우측으로 나열된다거나 아래에서 위로 늘어서 있다거나 하는 등의 방식으로 정보를 분석할 수 있도록 하는 것을 말한다. 이러한 기법을 적용할 경우, 어떠한 정보가 가지고 있는 속성의 종류를 단순한 문자위주의 정보에서 시각적인 정보까지 확장해 냄으로써 더욱 다양한 정보를 바탕으로 원하는 정보를 정확히 찾아낼 수 있도록 도와준다.In the information analysis system, the visual information recognition method is adopted to solve these problems. Visually-based information recognition method analyzes the information of a given document format and reconstructs it into a form that includes visual information such as up, down, left, and right together. To analyze information in such a way as to When this technique is applied, it helps to find exactly the desired information based on a wider variety of information by extending the kind of attributes that the information has from simple text-based information to visual information.

시각중심적 정보인식법을 이용할 경우, 현재의 전체 시스템에서 이용하고 있는 것과 같은 상품정보의 추출 이외에도 다양한 정보를 추출해 낼 수 있으며, 정보의 우선순위와 계층 등에 대한 정보도 파악할 수 있어 모든 정보에 대하여 범용적으로 적용할 수 있는 등 다양한 장점을 가지게 된다.In case of using the visual-based information recognition method, it is possible to extract various information in addition to the extraction of product information like the one used in the entire system, and to grasp information on the priority and hierarchy of the information. It can have various advantages such as being applicable as an enemy.

나) 핵심정보 자동추출B) Automatic extraction of key information

스스로 정의하지 않은 다양한 정보를 취급함에 있어 가장 중요한 문제중 하나가 바로 어떠한 것이 바로 내가 원하는 정보인지를 간파해 내는 것이다. 특히 현재의 전체 시스템과 같이 단순한 정보를 바탕으로 수천, 수만에 이르는 유사정보들을 명확히 추출해내고자 할 경우에는 더욱 그 중요성이 커지게 된다. 어떠한 것이 원하는 정보인가를 바르게 알아내는 것이야말로 정보추출에 있어서 발생할 수 있는 오류들을 줄일 수 있도록 도와주기 때문이다. 그러나 이러한 부분을 구현하기 위하여 현재까지는 대개 단순한 수준의 사전들을 이용하거나 사람의 도움에 전적으로 의존하고 있다. 이러할 경우에는 수작업의 비중이 대폭 증가하게 되므로 원하는 수준의 자동화율을 얻을 수 없다.One of the most important problems in dealing with a variety of information that you don't define yourself is to find out what you want. In particular, if you want to clearly extract thousands or tens of thousands of similar information based on simple information such as the whole system, the importance becomes more important. Knowing exactly what is the information you want is helping to reduce the errors that can occur in information extraction. However, to this end, up to now, it usually uses simple dictionaries or relies entirely on human help. In this case, the proportion of manual labor is greatly increased, and thus the desired level of automation cannot be achieved.

그러나 정보분석시스템에서는 핵심정보에 대한 자동추출기능을 보유하고 있어 사람의 도움이 없어도 원하는 정보를 추출할 수 있다. 이를 위하여 정보분석시스템에서는 단어단위의 비교를 통한 분석기능 이외에도 정보의 종류와 형식에 따라 분석해내는 별도의 분석기를 활용하고 있으며, 정보의 내용까지도 고려하여 분석하고 있다. 따라서 기존의 방식보다 훨씬 더 폭 넓은 정보를 분석대상에 올려놓을 수 있게 되었으며, 정보의 깊이와 정확도 또한 대단히 높다. 그리고 정보의 획득과정을 자동화함으로써 정보분석의 전 단계를 자동화할 수 있게 되었으며, 관리자의 관리영역을 대폭 줄일 수 있게 되었다.However, the information analysis system has an automatic extraction function for key information, so that the desired information can be extracted even without human assistance. To this end, the information analysis system utilizes a separate analyzer that analyzes according to the type and format of information in addition to the analysis function by comparing word units, and analyzes the information. Therefore, much wider information can be placed on the analysis target than the conventional method, and the depth and accuracy of the information is also very high. And by automating the information acquisition process, it is possible to automate the entire stage of information analysis and greatly reduce the management area of the manager.

다) 지능형 정보분석C) Intelligent information analysis

원하는 정보의 범위를 상당수 좁혔다 하더라도 원하는 정보를 정확히 알아내는 것과는 상당한 차이가 있다. 정보의 형태를 나타낼 수 있는 구조적 형태를 분석해 냈다 하더라도 단편적인 분석정보를 바탕으로 하여 전체 유사정보를 일괄적으로 분석해 내는 것은 여전히 난해한 문제로 남게 된다. 이러한 부분을 해결하기 위하여 기존에는 대개 완벽하게 일치하는 몇몇 경우에 대해서만 정보를 자동으로 추출해 내는 방식을 활용하였고, 따라서 영원히 모든 정보를 완벽하게 획득할 수는 없게 되었다. 인터넷 상에 존재하는 무한한 수의 정보를 대상으로 하는 검색엔진의 입장에서 볼 때, 이러한 원론적인 제약은 대단히 큰 장애가 되며, 반드시 극복해야만 하는 문제라 할 수 있다.Even if you narrow down the scope of information you want, there is a significant difference from knowing exactly what you want. Even if the structural form that can represent the information is analyzed, it is still difficult to analyze all the similar information on the basis of fragmentary analysis information. In order to solve this problem, conventionally, a method of automatically extracting information only for a few cases of perfect matches is not available, so that all information cannot be obtained forever. From the perspective of search engines targeting an infinite number of pieces of information on the Internet, these fundamental constraints are a huge obstacle and must be overcome.

정보분석시스템에서는 이러한 문제를 해결하기 위하여 정보의 구조나 내용이 완전히 일치하지 않더라도 이를 추리하여 상이한 형태이지만 동일한 정보라는 것을 판단해 내는 지능형 정보분석능력을 개발하였다. 전체 정보는 절대 하나의 구조로 일관되게 존재할 수가 없으며, 그러한 형태의 가정은 기본적으로 가정되어서는 안된다. 정보분석시스템은 처음부터 어떠한 형태가 존재할 수 있다라고 하는 가정을 전혀 하지 않고 있으며, 개념적으로도 경우의 수를 고려하는 방식과는 완전히 구분되어져 있다. 정보분석시스템에서는 단지 어떠한 정보가 핵심정보인지, 그리고 그러한 정보가 어떻게 서로 작용하는지에 대한 기본적인 규칙들을 알고 있을 뿐, 쇼핑몰의 형태와 관련한 어떠한 종류의 데이터베이스도 보유하고 있지 않기 때문에 일반적으로 고려할 수 있는 경우를 벗어나는 경우에도 모두 정보를 분석해 낼 수 있으며, 차후 발생할 수 있는 예측불가능한 상황에 대해서도 스스로 문제를 해결할 수 있는 것이다. 심지어는 관련정보가 존재하지 않는 경우에는 관련정보가 존재하지 않는다고 하는 판단까지도 내릴 수 있어 정보가 존재하지 않는 경우까지도 처리할 수 있다. 따라서 정보분석시스템은 이러한 정보에 대한 지능적인 분석능력을 통하여 관련 정보의 획득에 있어 발생할 수 있는 문제들을 근본적으로 해결하였으며, 영원히 동일한 방식으로 추론할 수 있는 영구성을 확보하였다.In order to solve this problem, the information analysis system developed an intelligent information analysis capability that infers even though the structure and contents of the information are not completely consistent, and judges that the information is different but identical. The whole information can never exist consistently in one structure, and such forms of assumptions should not be assumed by default. The information analysis system makes no assumption that there may be any form from the outset, and is conceptually completely separate from the way of considering the number of cases. The information analysis system knows only the basic rules on what information is key information and how it interacts with it, and does not have any kind of database related to the type of shopping mall. Even if they are outside of the box, they can analyze the information and solve the problem themselves for the unpredictable situation that may occur later. Even when the relevant information does not exist, it is possible to make a judgment that the related information does not exist, so that even when the information does not exist, it can be processed. Therefore, the information analysis system fundamentally solved the problems that could occur in obtaining the relevant information through the intelligent analysis ability of such information, and secured the permanence that can be deduced in the same way forever.

라) 편리한 관리구조D) Convenient management structure

정보분석시스템이 완전자동을 목표로 사람의 관리와는 독립하여 개발되었기는 하지만 그렇다고 해서 인간에 의한 관리가 전혀 불필요한 것은 아니다. 지능형 소프트웨어라 할지라도 인간이 도와주면 좀 더 정확해지는 것은 당연한 이치이며, 예측 불가능한 오류의 가능성을 고려할 때 사람에 의한 관리구조도 충분히 고려되고, 준비되어져야만 한다.Although the information analysis system was developed independently of human management with the aim of being fully automatic, it does not mean that human management is unnecessary at all. Even with intelligent software, it is natural to be more accurate with human help, and considering the possibility of unpredictable errors, human management must be fully considered and prepared.

정보분석시스템에서는 사람에 의한 관리를 정보획득의 초기와 말기에만 개입할 수 있도록 하였다. 정보를 획득하는 실제 과정에서는 사람의 참여를 완전히 배제하였으며, 다만 정보를 얻기 전과 얻은 후에 확인하는 작업을 가능토록 한 것이다. 이러할 경우, 관리자는 정보의 획득과 관련한 정보를 사전에 미리 도와주고, 사후에 이를 간략히 확인하기만 하면 되므로 정보의 획득시 발생하는 모든 종류의 과다한 작업으로부터 완전히 해방된다. 이를 통하여 정보분석시스템의 관리를 위해 필요한 인원을 1인 수준으로 끌어내려 관리비용을 획기적으로 절감할 수 있으며, 관리자의 개입이 적으므로 실수에 의한 오류소지도 극단적으로 축소된다.In the information analysis system, human management can only intervene in the early and late stages of information acquisition. The actual process of acquiring information completely excludes the participation of people, but only allows checking before and after the information. In this case, the manager only needs to help in advance the information related to the acquisition of the information in advance, and confirms it briefly afterwards, so that the manager is completely free from all kinds of excessive work occurring in the acquisition of the information. This can drastically reduce management costs by bringing down the number of people required for the management of the information analysis system to one level, and reduces errors caused by mistakes due to less manager intervention.

다. 분류분석시스템All. Classification Analysis System

(1) 소개(1) Introduction

분류분석시스템은 분류정보를 포함하는 자체 DB에 외부의 데이터를 입력하는 경우에 외부의 데이터에 해당하는 분류를 자동으로 분석해서 파악하는 시스템이다. 따라서, 외부의 데이터를 자동으로 자체 DB로 입력하는 시스템에서 유용하게 사용될 수 있다.The classification analysis system is a system that automatically analyzes and grasps classifications corresponding to external data when inputting external data into its own DB including classification information. Therefore, it can be usefully used in a system that automatically inputs external data into its own DB.

분류분석시스템을 웹사이트에 적용하는 경우에 웹페이지 자체의 분류를 판단하는 기능과 페이지 내의 정보마다 별도의 분류를 판단하는 기능을 포함한다. 또한, 다른 웹사이트의 분류가 존재하는 경우에 자체 분류보다 항목이 포괄적이거나 세부적인 경우에 외부 분류와 자체 분류를 매치하는 기능을 포함한다.When the classification analysis system is applied to a website, it includes a function of determining the classification of the web page itself and a function of determining a separate classification for each information in the page. It also includes the ability to match external and self-classifications when the category is comprehensive or detailed rather than self-classification when other web sites are present.

(2) 기능구성도(2) Functional diagram

다수의 정보를 포함하는 정보군에 대한 분류를 판단하는 기능과 한건의 정보마다 해당 분류를 판단하는 기능의 두가지로 구성된다. 또한, 이 두가지 기능을 복합적으로 이용한 분류판단도 가능하다. 각각의 기능에 대하여 세부적으로 설명하면 다음과 같다.It consists of two functions: a function of determining a classification of an information group including a plurality of information, and a function of determining a corresponding classification for each piece of information. In addition, classification judgment using these two functions is possible. Each function will be described in detail as follows.

(가) 정보군에 대한 분류 판단 기능(A) Classification judgment function for information group

정보의 집단 전체를 하나의 분류로 판단하는 기능이다. 이 경우에 정보군이 다수의 분류에 포함되는 경우는 (다)에서 설명할 복합적인 기능으로 구현되므로 제외한다.It is a function to judge the whole group of information as one classification. In this case, the case where the information group is included in multiple classifications is excluded because it is implemented as a complex function described in (c).

정보군의 분류를 판단하기에 앞서서 여러개의 정보군간의 상관관계를 파악하게 된다. 정보군간의 관계를 물리적이 아닌 논리적인 상관관계로 재구성하여 정보군간의 상하관계, 포함관계, 병렬관계를 판단하여 여러개의 정보군을 하나의 분류로 생성하게 된다.Prior to determining the classification of the information group, the correlation between the various information groups is identified. The relationship between the information groups is reconstructed as a logical, not physical, to determine the top, bottom, parallel, and parallel relationships among the information groups to generate several groups of information in one classification.

생성된 논리적인 분류를 자체 분류와 매칭하는 작업을 행하는데, 이 경우에는 다시 다음과 같은 3가지의 경우로 나누어진다.The generated logical classification is matched with its own classification. In this case, it is divided into three cases as follows.

1. 자체 분류와 외부 정보의 분류가 일치하는 경우1. The classification of self-classification and external information is consistent

이 경우는 분류자체가 완벽하게 일치하는 경우이므로, 별도의 작업없이 외부 정보의 분류를 그대로 사용할 수 있게 된다.In this case, the classification itself is a perfect match, so the classification of external information can be used without any additional work.

2. 자체 분류보다 외부 정보의 분류가 더 세부적인 경우2. The classification of external information is more detailed than its own classification.

외부 정보의 분류가 자체 분류보다 더 세부적인 경우에는 자체 분류의 특정 항목이 다수의 외부 분류 항목을 포함하게 되므로, 세부적인 외부 분류 정보를 묶어서 자체 분류에 매칭하는 방법으로 분류를 판단하게 된다.When the classification of the external information is more detailed than the own classification, since a specific item of the own classification includes a plurality of external classification items, the classification is determined by combining detailed external classification information and matching it with the own classification.

3. 자체 분류가 외부 정보의 분류보다 더 세부적인 경우3. Self classification is more detailed than classification of external information

자체 분류가 외부 정보의 분류보다 더 세부적인 경우에는 외부 정보군에 대한 분류를 판단하는 기능으로써, 정보마다의 분류를 정할 수 없다. 이 경우에는 (나)에서 설명할 한건의 정보에 대한 분류 판단 기능을 통하여 정보군을 다시 여러개의 정보군으로 나누어서 분류를 판단하게 된다.When the self classification is more detailed than the classification of the external information, the classification of the external information group is a function of determining the classification for each information. In this case, the classification is determined by dividing the information group into several information groups again through the classification determination function of one piece of information described in (b).

(나) 한건의 정보에 대한 분류 판단 기능(B) Classification judgment function for a single piece of information

정보 하나하나마다의 분류를 판단하는 기능이다. 정보의 내용에 따라서 분류를 판단하는데, 이 경우에 별도의 사전 DB가 필요하다. 사전 DB는 관리자가 직접 자료를 입력하는 형태가 아니라, 기존에 분류가 판단되었거나 내부적인 정보 DATA를 바탕으로 자동으로 생성된다. 분류판단에서의 오류방지를 위해서 사전 DB는 정보와 분류를 1:1로 매칭하는 단순한 방식을 사용하지 않고, 각각의 분류내에서의 정보와 분류를 연결하며, 이 방식을 재귀호출방식을 사용해서 분류의 깊이에 관계없이 구현된다.This function determines the classification of each piece of information. The classification is determined according to the content of the information. In this case, a separate dictionary DB is required. The dictionary DB is not automatically inputted by the administrator, but is automatically generated based on the previously determined classification or internal information DATA. In order to prevent errors in classification judgment, the dictionary DB does not use a simple method of matching information and classification 1: 1, and connects the information and classification in each classification, and this method uses a recursive call method. Implemented regardless of the depth of classification.

(다) 복합적인 분류 판단 기능(C) Complex classification judgment function

(가)와 (나)에서 설명한 분류판단 기능 2가지를 복합적으로 사용해서 판단하는 기능이다. 외부정보의 분류가 자체 분류보다 포괄적인 경우나, 정보군 자체가 여러 분류에 해당하는 정보를 가지고 있는 경우에 이 기능을 사용해서 분류를 판단한다.It is a function that judges by using two kinds of classification judgment functions described in (a) and (b). If the classification of external information is more comprehensive than its own classification, or if the information group itself contains information corresponding to multiple classifications, this function is used to determine the classification.

(가)에서 설명한 분류를 먼저 판단하는데, 가장 하부 항목의 분류와 매칭되지 않고, 하부 분류를 포함하는 상위 분류로 판단이 된다. 이 분류를 (나)에서 언급한 사전 DB를 이용해서 정보 각각마다 분류를 판단한다. 현재 판단한 정보군의 분류내에 정보 각각의 분류가 포함되지 않는 오류는 발생하지 않는데, 이는 사전DB가 정보군에 대해서 적용할 수 있도록 재귀적으로 구성되어 있기 때문이다.The classification described in (a) is first determined, but is not matched with the classification of the lowest sub item, and is determined as an upper classification including the sub classification. The classification is determined for each information by using the dictionary DB mentioned in (b). An error that does not include the classification of each information in the classification of the currently determined information group does not occur because the dictionary DB is recursively configured to apply to the information group.

(3) 구현방법(3) Implementation method

분류분석시스템의 내부는 크게 정보군에 대한 분류판단과 정보 각각에 대한 분류판단의 크게 2가지 부분으로 나누어진다. 외부 정보에 대해서 판단한 분류와 자체 분류와의 매칭 방법은 각각의 경우에 개별적으로 이루어지기 때문에, 별도의 부분으로 구성되어 있지 않다.The interior of the classification analysis system is largely divided into two parts: classification judgment for information groups and classification judgment for each information. Since the matching method between the classification determined for the external information and the self classification is performed individually in each case, it is not composed of separate parts.

각각의 구현방법은 현재의 전체 시스템인 쇼핑몰 비교검색엔진의 형태를 중심으로 설명될 것이다. 따라서, 정보는 쇼핑몰에서 판매중인 상품을 뜻하고, 정보군이라는 단위는 상품이 다수개 존재하는 웹페이지를 의미한다.Each implementation method will be described based on the form of a shopping mall comparison search engine which is the current overall system. Therefore, information refers to a product being sold in a shopping mall, and the information group refers to a web page in which a plurality of products exist.

쇼핑몰에서 판매중인 상품에 대한 분류가 이미 존재하는 경우에 이들의 상관관계를 파악하여 논리적인 분류체계를 프로그램이 판단해야 한다. 이 경우에 웹페이지라는 특성을 이용하여 HTML TAG자체에 대한 특성을 이해하고 분석하게 된다.If there is already a classification of products for sale in a shopping mall, the program should determine the logical classification system by identifying these correlations. In this case, the characteristics of HTML TAG itself are understood and analyzed by using the characteristic of web page.

1. 서로의 연관관계를 판단하는 기능1. The ability to judge the relationship between each other

웹페이지간의 연관관계를 판단하기 위해서는 기본적으로 연결 고리가 될 수 있는 HTML TAG를 바탕으로 파악한다. 그러나, 이 경우에 웹페이지는 물리적으로 상/하구조를 가지고 있지 않고, 거미줄처럼 연결되어 있는 형태이므로 이를 상/하관계로 파악하는 것이 중요하다. 이 구조를 파악하기 위해서는 한 페이지 단위로 여러개의 연결고리중에서 핵심이 되는 연결고리를 판단하는 것이 필요한데, 첫 번째 연결고리와 바로 이전의 연결고리를 전부 저장하여 이를 다시 구성하는 방식으로 해결한다. 이 경우에 제외는 형태는 병렬구조로 연결된 것들이다. 예를 들자면 상품페이지가 여러개로 나누어져 '이전','다음'과 같은 형태로 구성되어 있는 경우이다. 이런 병렬구조는 밑에서 다시 설명하도록 하겠다.In order to determine the relationship between web pages, we basically grasp based on HTML TAG which can be a link. However, in this case, it is important to identify the web page as a top / bottom relationship because the web page does not have a physical structure in the upper / lower structure and is connected like a spider web. In order to grasp this structure, it is necessary to determine the core link among the multiple links in one page unit. The first link and the previous link are stored and reorganized. In this case, the exclusions are those connected in parallel. For example, a product page is divided into several parts and composed of 'previous' and 'next'. This parallel structure will be described again below.

거미줄처럼 연결되어 있는 웹페이지 구조를 논리적으로 바꾸는데 있어서는 여러개의 연결고리중에서 핵심연결고리를 찾아야 된다고 설명했는데, 정보가 존재하는지의 여부를 판단할 수가 있고, 이를 바탕으로 정보가 존재하는 페이지만을 배경으로 연결고리를 다시 재구성하므로, 핵심연결고리만을 찾아내는 것이 가능하게 된다.In logically changing the structure of web pages connected like spider webs, we explained that we need to find the core link among several links. By reconfiguring the link, it is possible to find only the key link.

링크가 되는 경우에 필요하다고 판단되는 정보를 모두 파악하여 이를 바탕으로 핵심 연결고리를 찾게 되는데, 이는 재귀적으로 처리된다. 즉, 상단의 분류에 대한 판단이 서 있고, 그 형태가 자체 분류 구조와 매칭되는 작업이 이루어져 있다고 가정한 상태에서 하나의 상품 페이지에 해당하는 분류를 찾게 된다.In the case of a link, it finds all the information deemed necessary and finds the core link based on it, which is processed recursively. That is, assuming that the classification of the upper stage is standing, and that the form is matched with its own classification structure, the classification corresponding to one product page is found.

일반적인 웹페이지의 연결에서 필요한 정보는 연결되는 문자열이라고 할 수 있는데, 이는 별도의 사전을 구성하여 비교하게 된다. link_dictionary라 불리우는 이 사전에는 연결 문자열에 해당하는 분류를 판단하도록 구성되어 있는데, 기본적으로 상품 분류와 1:1로 매칭되는 문자열들을 가지고 있으며, 좀 더 세부적인 항목들에 대한 것들과 유사어를 가지고 있다. 별도의 유사어 사전이 적용되는데, 이 유사어 사전은 일반적인 검색에서 사용하는 유사어 사전과 동일하다.The information needed in the connection of a general web page is a string to be connected, which is composed by comparing a separate dictionary. Called link_dictionary, this dictionary is configured to determine the classification corresponding to the link string, which basically has 1: 1 matching strings with the product classification, and has similar words with those for more detailed items. A separate similar dictionary is applied, which is identical to the similar dictionary used in a general search.

연결되는 문자열이 존재하지 않는 경우에도 분류 판단에 문제가 없는데, 이는 문자열을 바탕으로 처리하는 구현방식이 여러 가지의 구조파악 방법중의 하나이기 때문이다.Even if there is no concatenated string, there is no problem in the classification judgment, because the implementation method based on the string is one of several structural identification methods.

상품 페이지 하나에 대한 정보가 전혀 존재하지 않고, 오로지 상품목록만 얻을 수 있는 경우라도 상품분류를 파악할 수 있는데, 이것은 상품 각각에 대한 분류를 판단할 수 있는 기능이 존재하기 때문이다. 이런 기능을 이용하면 여러개의 분류에 해당하는 상품이 포함되어 있는 '히트상품' 페이지 같은 경우라도 정확하게 상품분류를 찾아낼 수 있게 된다.Even if there is no information on one product page, and only a product list can be obtained, the product classification can be grasped, because there is a function to determine the classification of each product. By using this function, even if it is a 'hit product' page that includes products of several categories, the classification of products can be found accurately.

2. 병렬구조를 파악하는 기능2. Ability to grasp parallel structure

일반적으로 상품정보를 가지고 있는 페이지가 병렬로 존재하는 경우에는 page link로 구성되게 되어 있는데, 이런 일반적인 정보의 판단이 가능하다. 하지만, 이런 page link형태가 아니더라도, 각각의 페이지가 담고 있는 상품을 판단하는 기능을 통하여 한 페이지에 대한 분류를 판단할 수 있으므로, 같은 분류로 판단되는 페이지들이 연속적으로 존재하는 경우에 이를 병렬구조로 판단한다. 또한, 실질적으로 병렬구조 형태가 아닌 방식으로 구성되어 있다고 하더라도, 물리적인 연결에 관계없이 정보 자체를 바탕으로 판단하기 때문에, 재구성되는 형태에서는 병렬구조로 나타나게 된다.In general, when pages with product information exist in parallel, they are configured as page links. This general information can be determined. However, even if it is not in the form of such a page link, it is possible to determine the classification of a page through the function of judging the product contained in each page. To judge. In addition, even if it is configured in a manner that is not substantially parallel structure, since it is determined based on the information itself regardless of the physical connection, it appears as a parallel structure in the reconstructed form.

(나) 각각의 정보에 대한 분류 판단 기능(B) Classification judgment function for each information

각각의 상품정보에 해당하는 분류를 판단하기 위해서는 별도의 ITEM_D라는 사전이 필요한데, 이 사전은 실질적으로 현재 가지고 있는 상품정보를 바탕으로 자동으로 생성된다. 같은 상품명을 가지는 상품이 서로 다른 분류내에 존재하는 경우에 오류를 유발할 수 있는데, 이런 문제점은 ITEM_D DB구조가 상품명과 분류 정보를 1:1로 매치하는 형태가 아니라, 자체 분류를 포함하는 형태로 구성이 되어 있기 때문에 해결이 가능하다.In order to determine the classification corresponding to each product information, a separate ITEM_D dictionary is required. This dictionary is automatically generated based on the product information that you have. If a product with the same brand name exists in different classifications, it may cause an error. This problem is not composed of the ITEM_D DB structure that matches the product name and the classification information in a one-to-one manner, but includes a self-classification. It is possible to solve this problem.

라. 지능형 검색시스템la. Intelligent search system

(1) 소개(1) Introduction

지능형 검색시스템은 분류 구조를 가지는 DB내에서 자료를 효과적으로 찾기 위한 시스템을 말한다. 수많은 자료중에서 사용자가 원하는 결과만을 찾아내기 위하여 검색어의 의미를 분석하고, 분류와의 상관관계를 파악하는 시스템으로, 다단어 검색에 대한 처리와 유사어 검색이 가능하다.Intelligent search system refers to a system for efficiently searching data in a database with a classification structure. It is a system that analyzes the meaning of search terms and finds correlations with classifications to find only the results that users want from a large number of data.

(2) 기능구성도(2) Functional diagram

지능형 검색시스템은 한 단어 검색, 두 단어 검색, 복합단어 검색의 3가지로 나누어진다. 또한 각각의 경우에 분류구조와의 상관관계를 파악하는 기능을 가진다.The intelligent search system is divided into three categories: one word search, two word search, and compound word search. In each case, it also has a function to grasp the correlation with the classification structure.

(가) 한단어 검색(A) Single word search

가장 간단하게 사용자가 한 단어를 검색하는 경우에는 크게 2가지로 나누어진다. 유사어 사전을 참조하여, 검색어에 대한 대표어와 유사어를 모두 찾아낸다. 만약, 유사어가 존재하지 않는 경우에는 검색어 자체가 대표어가 되며, 검색어 자체를 단 하나의 유사어로 처리한다.In the simplest case, a user searches for a word, it is divided into two categories. By referring to the dictionary of synonyms, both the representative word and the similar word for the search term are found. If there is no similar word, the search word itself becomes a representative word, and the search word itself is treated as a single similar word.

자체 분류 정보를 바탕으로 찾아낸 대표어에 해당하는 분류를 파악한다. 이 경우에 분류가 존재하는 경우와 존재하지 않는 경우로 나누어지는데, 분류가 존재하지 않는 경우에는 일반적인 유사어 검색을 구현하게 된다. 분류가 존재하는 경우에는 분류내의 정보를 보여주게 되는데, 만약에 분류가 최하위 분류가 아닌 상단 분류인 경우에는 하부 분류를 보여줌으로써, 포괄적인 검색어를 입력한 경우에 좀더 세부적인 분류를 선택할 수 있도록 유도한다.Based on the classification information, identify the classification corresponding to the representative word. In this case, the classification is divided into a case where there is a classification and a case where there is no classification. When a classification does not exist, a general similar word search is implemented. If the classification exists, the information in the classification is displayed. If the classification is a top category instead of the lowest category, the sub category is displayed, so that a more detailed classification can be selected when a comprehensive search term is entered. do.

(나) 두단어 검색(B) two-word search

두단어 검색의 경우에는 두단어 간의 상관관계를 파악하게 된다. 먼저, 두 단어 각각의 대표어와 유사어를 찾아낸다. 한단어 검색과 마찬가지로 찾아낸 대표어를 바탕으로 해당하는 분류를 찾게 되는데, 분류가 존재할 수도 있고 그렇지 않을 수도 있다. 한단어 검색과 달리 각각의 검색어가 분류가 존재할 수도 있고, 그렇지 않을수도 있으므로 다시 크게 3가지로 나누어진다. 4가지가 아닌 3가지로 나누어지는 것은 두 단어간의 순서는 중요하지 않기 때문이다. 일반적으로 포괄적인 단어뒤에 세부적인 단어가 나오는 형태라 하더라도, 상관관계 자체를 다시 판단하므로, 아무런 의미가 없게 된다.In the case of two word search, the correlation between two words is determined. First, find representative words and similar words for each of the two words. Similar to single word search, the corresponding classification is searched based on the found representative word. The classification may or may not exist. Unlike single word search, each search word may or may not have a classification. It is divided into three rather than four because the order between two words is not important. In general, even if a word appears in detail after a comprehensive word, the correlation itself is judged again, so there is no meaning.

(ⅰ) 두 검색어 모두 분류가 존재하지 않은 경우(Iii) no classification exists for both search terms

두 검색어 모두 분류가 존재하지 않는 경우에는 일반적인 형태의 AND와 OR검색을 행한다. 이전 작업에서 이미 대표어와 유사어를 판단했으므로, 유사어 검색에 대한 기능을 자동적으로 포함한다.If both search terms do not have a classification, a general AND and OR search is performed. Since the previous work has already determined the representative words and similar words, it automatically includes the function for searching for similar words.

(ⅱ) 한 검색어만 분류가 존재하는 경우(Ii) if only one search term exists

한 검색어만 분류가 존재하는 경우에는 자동으로 분류내 검색을 행하게 된다. 해당하는 분류내에서 나머지 한 검색어에 해당하는 정보를 찾게 되는데, 이런 경우에 앞에서 판단한 유사어 기능은 포함하게 된다.If only one search term exists, a classification is automatically searched. In the corresponding category, information corresponding to the other search term is found. In this case, the similarity function determined above is included.

(ⅲ) 두 검색어 모두 분류가 존재하는 경우(Iii) a classification exists for both search terms

두 검색어 모두 분류가 존재하는 경우에는 두 분류간의 상하/포함관계를 파악하게 된다. 따라서, 두 분류간의 관계에 따라서 다시 여러 가지 경우로 나눌 수 있다. 첫 번째는 두 분류간의 포함관계가 성립하는 경우로, 하부 분류에 해당하는 정보들을 보여준다.If the classification exists in both search terms, the upper and lower relations between the two classifications are determined. Therefore, it can be divided into several cases according to the relationship between the two classifications. The first is the case where the inclusion relation between the two categories is established, and the information corresponding to the subcategory is shown.

두 번째는 두 분류간의 포함관계가 존재하지 않는 경우이다. 이런 경우에는 첫 번째 분류내에서 두 번째 검색어에 해당하는 분류내 검색을 행하고, 두 번째 분류내에서 첫 번째 검색어에 해당하는 분류내 검색을 행하여 결과를 출력한다.Second, there is no inclusion relationship between the two classifications. In this case, an intra-category search corresponding to the second search term is performed in the first category, and an intra-category search corresponding to the first search term is output in the second category.

(다) 복합단어 검색(C) Compound word search

복합단어라 함은 3단어 이상을 말한다. 3단어 이상의 검색어를 입력하는 경우에는 기본적으로 두단어 검색과 같은 방법으로 행한다. 그러나, 두단어와 같이 3가지 경우로 나눌 수가 없기 때문에, 재귀적으로 검색어를 줄여나가는 과정이 필요하다. 즉, 검색어 중에서 분류에 해당되는 검색어가 2개 이상 존재하고, 둘 사이에 포함관계가 성립하는 경우에는 상위 분류에 해당하는 검색어를 제거한다. 이런 과정을 통해서 검색어가 2단어 이하가 되는 경우에는 (가)와 (나)에서 구현한 검색기능이 실행되며, 검색어를 줄인 이후에도 3단어 이상이 나오는 경우에는 AND와 OR검색을 통해서 결과를 출력한다.Compound word means three or more words. In the case of inputting a search word of three words or more, the search is basically performed in the same manner as the two word search. However, since it cannot be divided into three cases like two words, it is necessary to recursively reduce the search word. That is, when two or more search terms corresponding to the classification exist among the search terms and an inclusion relationship is established between the two, the search terms corresponding to the upper classification are removed. If the search term becomes less than 2 words through this process, the search function implemented in (A) and (B) is executed. If more than 3 words appear after reducing the search word, the result is output by AND and OR search. .

(3) 구현방법(3) Implementation method

구현방법을 설명하기 전에 지능형 검색 시스템을 구축하기 위해서 필요한 DataBase를 소개하도록 하겠다. 3단어 이상의 복합단어에 대한 구현은 두단어에 대한 검색과 동일하며, 내부적으로 재귀적 구현을 사용한다는 차이점이 있으므로 별도로 언급하지 않겠다. 또한, 각각의 구현방법은 현재의 전체 시스템인 쇼핑몰 비교검색엔진의 형태를 중심으로 설명될 것이다. 따라서, 분류는 상품을 포함하는 분류를 나타내며, 정보라 함을 상품정보를 나타낸다.Before we explain how to implement it, let me introduce the necessary DataBase to build an intelligent search system. The implementation of three or more compound words is the same as searching for two words, and there is a difference between using a recursive implementation internally. In addition, each implementation method will be described based on the shape of a shopping mall comparison search engine which is the current overall system. Thus, classification refers to a classification that includes a product, and information refers to product information.

(가) 유사어 사전 DB(A) synonym dictionary DB

유사어 사전은 검색어만 사용되므로, 그 구조가 매우 간단하게 구성될 수 있다. 대표어와 대표어에 해당하는 유사어의 리스트 형태로 구성되는데, 다단어의 경우에는 별도로 각각의 단어에 대한 판단을 하기 때문에, 다단어에 대한 대표어는 존재하지 않는다고 가정한다. 따라서, 공백이나 기타 특수문자를 제거하는 별도의 함수를 제작하여 이 과정을 수행하게 된다.Since the analogy dictionary uses only a search word, its structure can be very simple. It is composed of a list of representative words and similar words corresponding to the representative word. In the case of a multiword word, since each word is judged separately, it is assumed that there is no representative word for the multiword word. Therefore, this process is performed by creating a separate function to remove spaces or other special characters.

유사어 사전은 고정적인 형태가 아니라, 유동적으로 변할 수 있기 때문에 등록/수정/삭제할 수 있는 툴이 필요하다. 또한, 유사어를 효과적으로 등록하기 위해서 사용자가 검색한 검색어에 대한 Log분석을 통하여 사용자가 자주 검색하는 단어에 대한 유사어 등록이 손쉽도록 구현한다.Because the dictionary of synonyms is not fixed, but can change dynamically, you need a tool to register, modify, or delete it. In addition, in order to effectively register similar words, the user can easily register the similar words for the words frequently searched by the user by analyzing the log of the search word searched by the user.

(나) 한단어 검색(B) Single word search

한 단어 검색의 경우 첫 번째 행할 일은 대표어와 유사어를 찾는 일이다. 이 과정은 위에서 언급한 유사어 사전 DB를 이용하는데, SQL의 select 문을 이용해서 손쉽게 해결한다. 다수개의 유사어에 대한 처리는 list tree구조를 이용해서 구현한다.For a single word search, the first thing to do is to find a representative word and a similar word. This process uses the synonym dictionary DB mentioned above, which can be easily solved using the SQL select statement. The processing of multiple similar words is implemented using the list tree structure.

두 번째는 대표어에 해당하는 상품 분류를 찾게 된다. 유사어를 이용하지 않고, 대표어만 이용하는 것은 여러개의 유사어를 등록하는 과정에서 상품분류와 일치하는 유사어를 대표어로 등록하기 때문이다.Secondly, the product classification corresponding to the representative word is found. The reason for using only the representative word without using the similar word is that the similar word matching the product classification is registered as the representative word in the process of registering several similar words.

상품분류가 존재하는 경우에는 검색어와 관계없이 상품분류내의 상품을 결과로 보여주게 되는데, 상품명 자체에 검색어에 해당하는 문자열이 존재하지 않더라도 같은 분류에 속한다면 검색이 가능하다. 내부적으로 처리되는 검색 순서는 아래와 같다.If the product classification exists, the product in the product classification is displayed as a result regardless of the search word. If the product string does not exist in the same category, the search is possible. The search order processed internally is as follows.

'mouse'검색 -＞ 대표어로 '마우스' 판단 -＞ 유사어로 'mouse' 존재 -＞ '마우스'에 해당하는 상품 분류 검색 -＞ '마우스' 분류내 상품 출력 -＞ 'Logitech Wheel'과 같이 상품명에 'mouse'가 존재하지 않더라도 검색 가능'mouse' search-> 'mouse' as a representative-> 'mouse' as a synonym-> search for a product category corresponding to 'mouse'-> print a product in the 'mouse' category-> name as 'Logitech Wheel' Searchable even if 'mouse' doesn't exist

(다) 두단어 검색(C) two-word search

두 단어를 검색하는 경우에는 각각의 검색어에 해당하는 분류는 찾아내는 과정이 필요하다.In the case of searching for two words, the classification corresponding to each search word needs to be found.

전체 검색어를 space를 토큰 분리자로 사용하여 각각 토큰으로 분리해 낸다. 분리해낸 토큰은 다시 space를 제거하는데, 이는 유사어 사전 DB 자체에 space가 존재하지 않기 때문이다. space가 존재한다는 것은 이미 그 자체로 2단어 이상이라는 의미를 내포하기 때문에 유사어 사전에는 space가 존재하지 않는다.The entire search term is separated into tokens using space as the token separator. The separated token removes the space again because there is no space in the synonym dictionary DB itself. The existence of a space already implies that there is more than one word in itself, so there is no space in the dictionary of words.

분리해낸 각각의 토큰을 유사어 사전을 통하여 대표어와 유사어를 찾아낸다. 이 과정은 (나)에서 설명한 한 단어의 경우와 동일하다.Each token that is separated is searched for a representative word and a similar word through a dictionary of similar words. This process is the same as for the single word described in (b).

두 단어 모두 대표어는 반드시 존재하는데, 유사어 사전에 등록되어 있지 않은 검색어의 경우에는 검색어 자체를 대표어로 사용하기 때문이다.In both words, a representative word must exist, because a search word that is not registered in a similar dictionary is used as the representative word.

대표어를 사용하여 분류 DB에서 해당되는 분류를 찾아내게 되는데, 분류사전에서 분류명은 유사어 사전의 대표어와 연결된다. 유사어 사전을 새로이 등록하는 경우에는 여러 유사어 중에서 분류에 해당하는 유사어를 대표어로 등록하기 때문이다.The representative words are used to find the corresponding classification in the classification DB. In the classification dictionary, the classification name is connected with the representative words of the similar dictionary. This is because when a new dictionary of similar words is registered, a similar word corresponding to a classification among the similar words is registered as a representative word.

분류를 찾아낸 경우에는 다시 다음의 3가지 경우로 나누어진다.When the classification is found, it is divided into three cases.

두 검색어가 모두 분류가 존재하지 않는 경우에는 일반적인 AND와 OR검색을 행한다. 이 경우에도 유사어 기능을 포함하는데, SQL Query를 이용해서 다음과 같이 구현한다.If both search terms do not have a classification, a general AND and OR search is performed. In this case, the similarity function is included. The following is implemented using SQL Query.

DICTIONARY와 '대표어Field'는 각각 유사어 사전 Table과 대표어에 해당하는 Field를 뜻한다.DICTIONARY and 'keyword field' mean a field corresponding to a table of similar words and a representative word, respectively.

select * from [DICTIONARY] where [대표어Field]='대표어'select * from [DICTIONARY] where [keywordField] = 'keyword'

SQL Query는 기본적으로 정규 표현식(regular expression)을 이용하는데, 이는 다수개의 유사어가 존재하기 때문이다. 여러개의 유사어는 전부 OR로 처리하며, 다수개가 가능하므로 재귀적으로 구현한다. 따라서, 전체 query는 다음과 같은 형태로 구현된다.SQL Query uses regular expressions by default, because there are many similar words. Multiple synonyms are all ORed, and many can be implemented recursively. Therefore, the entire query is implemented in the following form.

DBName과 model은 임의적으로 정한 DB Table 이름과 상품명을 담고 있는 Field를 나타낸다.DBName and model indicate a field containing a randomly designated DB table name and a brand name.

select * from [DBName] where model regexp [대표어1 OR 유사어1 OR 유사어2 OR ...] AND [대표어1' OR 유사어1' OR 유사어2' OR ...]select * from [DBName] where model regexp [prescript 1 OR synonym 1 OR synonym 2 OR ...] AND [prescript 1 'OR synonym 1' OR synonym 2 'OR ...]

또한, 두 검색어 사이의 OR검색을 하는 경우에는 다음과 같은 형태로 SQL Query가 구현된다.In addition, in case of OR search between two search terms, SQL Query is implemented as follows.

select * from [DBName] where model regexp [대표어1 OR 유사어1 OR 유사어2 OR ...] OR [대표어1' OR 유사어1' OR 유사어2' OR...]select * from [DBName] where model regexp [prescript 1 OR synonym 1 OR synonym 2 OR ...] OR [prescript 1 'OR synonym 1' OR synonym 2 'OR ...]

두 검색어 중에서 한 검색어만 분류가 존재하는 경우는, 분류내 검색을 행하게 되며 유사어 검색 기능을 자체적으로 포함하게 된다.If only one of the two search terms has a classification, the search is performed within the classification and includes a similar word search function by itself.

분류정보를 담고 있는 DB는 분류코드를 별도로 가지고 있는데, 이 분류코드를 이용해서 다음과 같이 SQL Query를 구성한다.The DB containing the classification information has a classification code. By using this classification code, a SQL Query is constructed as follows.

select * from [DBName] where [분류코드Field]='분류코드' AND model regexp [대표어1 OR 유사어1 OR 유사어2 OR ...]select * from [DBName] where [classification code Field] = 'classification code' AND model regexp [keyword1 OR synonym1 OR synonym2 OR ...]

상품은 각각 해당하는 분류에 대한 정보를 담고 있으며, 이를 위에서는 '분류코드Field'에 저장되어 있다. 또한, '분류코드'는 검색어에서 추출한 분류코드를 의미한다.Each product contains information on the corresponding classification, which is stored in the "Category Code Field" above. In addition, the "classification code" means a classification code extracted from the search word.

이런 경우를 예로 들자면 '15인치 모니터'와 같은 경우인데, '모니터'에 해당하는 분류가 존재하므로, 모니터 분류내에서 15인치라는 문자열을 검색하게 된다. 또한, 유사어 사전에는 15인치와 15"가 존재하기 때문에, '15" 평면' 과 같은 상품도 검색이 가능하게 된다.For example, this is the case of '15 inch monitor '. Since there is a classification corresponding to' monitor ', the string 15 inches is searched for in the monitor classification. In addition, since 15 inches and 15 "exist in the similar word dictionary, a product such as a" 15 "plane can be searched.

두 검색어 모두 분류가 존재하는 경우는 다시 두 분류간의 포함관계에 의해서 나누어진다. 첫 번째 경우는 두 분류가 서로 포함관계를 가지는 경우이며, 두 번째 경우는 서로의 포함관계가 없는 경우이다.If a category exists in both search terms, it is divided again by the inclusion relationship between the two categories. The first case is when two classifications have inclusion relations with each other, and the second case is no inclusion relation with each other.

'컴퓨터 마우스'와 같은 검색어의 경우에는 '컴퓨터'에 해당하는 분류가 존재하고, '마우스'에 해당하는 분류가 존재하며, 두 분류가 서로 포함관계를 가지게 된다. 이런 경우에는 '마우스' 분류내의 상품을 보여주게 된다. '컴퓨터 마우스'라는 문자열이 상품 정보에 없다고 하더라도 검색이 가능한데, 'Logitech Wheel'과 같은 전혀 다른 상품정보를 가지는 상품이 검색결과에 나타나므로, 지능적인 검색이 가능한 것이다.In the case of a search term such as 'computer mouse', a classification corresponding to 'computer' exists, a classification corresponding to 'mouse' exists, and the two classifications have an inclusion relationship with each other. In this case, the product in the 'mouse' category will be shown. Even if the string 'computer mouse' is not found in the product information, it is possible to search. Products with completely different product information such as 'Logitech Wheel' appear in the search results, so intelligent search is possible.

두 번째 경우는 '컴퓨터 책상'과 같이 '컴퓨터'에 해당하는 분류가 존재하며 '책상'에 해당하는 분류가 '가구' 분류내에 존재하는 경우이다. 이 경우에는 복합 AND검색을 행하게 되는데, '컴퓨터' 분류내에서 '책상'을 검색하며 동시에 '책상' 분류내에서 '컴퓨터'라는 제품을 검색하게 된다. '컴퓨터'분류내에 '책상'이 존재하며, '책상' 분류내에 '컴퓨터용 책상'이 존재하는 경우에 둘다 검색결과로 보여줄 수가 있다.In the second case, there is a classification corresponding to 'computer', such as 'computer desk', and the classification corresponding to 'desk' exists in the 'furniture' category. In this case, a compound AND search is performed, which searches for 'desk' in the 'computer' category and at the same time searches for a product 'computer' in the 'desk' category. If there is a 'desk' in the 'computer' category and a 'computer desk' in the 'desk' category, both can be shown as search results.

SQL Query를 간단하게 나타내면 다음과 같이 된다.A simple SQL query looks like this:

select * from [DBName] where ([분류코드Field]='분류코드1' AND model regexp [대표어1 OR 유사어1 OR 유사어2 OR ...]) OR ([분류코드Field]='분류코드2' AND model regexp [대표어1' OR 유사어1' OR 유사어2' OR ...])select * from [DBName] where ([Category code Field] = 'Category code 1' AND model regexp [Representative 1 OR Analogue 1 OR Analogue 2 OR ...]) OR ([Category code Field] = 'Category 2 'AND model regexp [keyword1' OR synonym1 'OR synonym2' OR ...])

위에서 제시한 예제와 비교해 보면, 분류코드1이 '컴퓨터'에 해당하는 분류이며, 대표어1은 '책상'에 해당된다. 분류코드2는 '책상'에 해당되는 분류이며, 대표어 1'는 컴퓨터'가 된다. 이 경우에도 유사어 사전을 통한 유사어 검색이 가능하다.Compared with the example presented above, the classification code 1 corresponds to the 'computer' and the representative word 1 corresponds to the 'desk'. Classification code 2 is a classification corresponding to 'desk', and the representative word 1 is 'computer'. In this case, the similar words can be searched through the similar word dictionary.

인터넷 상의 거의 모든 정보들이 정돈되거나 연관되지 못한채 무작위적으로 등록되고 있어 이미 개개인이 소화할 수 있는 정보의 양을 초과하였을 뿐만 아니라, 가치있는 정보의 집합으로써 의미를 갖기도 불가능해 졌다. 이러한 상황을 극복하기 위해서는 정보의 의미를 간파하고, 분석하여 필요할 때에 적절한 수준의 정보를 제공하여 줄 수 있는 고차원 소프트웨어가 반드시 필요한데 본 소프트웨어가 바로 그러하다. 본 소프트웨어에 사용된 관련 기술들은 이러한 모호한 정보의 인식 및 분석, 제공의 세 단계를 자동화해냄으로써 차후 인터넷 관련 지능형 에이전트 소프트웨어의 개발에 있어 중대한 이정표를 세웠다 할 것이다.Almost all information on the Internet is randomly registered, unorganized or unrelated, which has already exceeded the amount of information an individual can digest, making it impossible to have meaning as a collection of valuable information. In order to overcome this situation, high-level software is required to understand the meaning of information, analyze it, and provide an appropriate level of information when necessary. The related technologies used in this software will mark important milestones in the development of Internet-related intelligent agent software by automating the three phases of the recognition, analysis and provision of this ambiguous information.

Claims

Intelligent Information Retrieval and Analysis Agent