KR101806452B1

KR101806452B1 - Method and system for managing total financial information

Info

Publication number: KR101806452B1
Application number: KR1020160048820A
Authority: KR
Inventors: 신명일; 이용현; 임홍준
Original assignee: (주)원제로소프트
Priority date: 2016-04-21
Filing date: 2016-04-21
Publication date: 2017-12-08
Also published as: KR20170120389A

Abstract

텍스트 마이닝을 기반으로 한 상품 자동 매핑 방법은 상품 자동 매핑 장치가 데이터베이스로부터 복수의 상품에 대한 기술 문서 데이터를 수신하는 단계, 상품 자동 매핑 장치가 기술 문서 데이터 상에서 복수의 단어를 추출하는 단계, 상품 자동 매핑 장치가 추출된 복수의 단어를 기반으로 업체의 키워드 사전을 생성하는 단계, 상품 자동 매핑 장치가 키워드 사전을 기반으로 주문 상품 및 기존 상품 각각에 대한 연관 분석이 반영된 TF-IDF 가중치를 산출하는 단계, 상품 자동 매핑 장치가 TF-IDF 가중치를 기반으로 주문 상품 및 기존 상품의 코사인 유사도를 산출하는 단계와 상품 자동 매핑 장치가 코사인 유사도를 기반으로 검색 결과를 생성하는 단계를 포함할 수 있다.An article automatic mapping method based on text mining includes: an article automatic mapping apparatus receiving technical document data for a plurality of articles from a database; a step for extracting a plurality of words from the article automatic mapping apparatus on the technical document data; A step of generating a keyword dictionary of a company on the basis of a plurality of words from which the mapping device is extracted, a step of calculating a TF-IDF weight reflecting an association analysis between the order product and the existing product based on the keyword dictionary , The step of calculating the cosine similarity degree of the ordered product and the existing product based on the TF-IDF weight, and the automatic product mapping device may generate the search result based on the cosine similarity.

Description

Technical Field [0001] The present invention relates to a method and apparatus for automatically mapping an article based on text mining,

본 발명은 상품 정보 제공 방법 및 장치에 관한 것으로서, 보다 상세하게는 텍스트 마이닝을 기반으로 한 상품 자동 매핑 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for providing product information, and more particularly, to a method and apparatus for automatic product mapping based on text mining.

자연어로 이루어진 비구조화 자료에서 유용한 정보를 얻어내기 위해 구조화 데이터로 변환한 후 분석하는 기법을 텍스트 마이닝 기법이라고 한다. 인터넷 검색 엔진이나 열람실의 검색 시스템 등에 텍스트 마이닝을 적용할 수 있으며, 기존의 주어진 범주에 따라 문서들을 분류할 수도 있다. 사전 문서에 대한 정보가 존재하지 않는 경우에는 문서 군집을 통해 비슷한 문서들끼리 군집한 후 의미를 찾아내거나 혹은 저자를 찾아낼 수도 있다. 문서 군집은 학문 분야의 지적 구조 연구에 적용된다. 비서학의 지적 구조를 텍스트 마이닝을 이용한 문서 군집을 통해 파악할 수 있으며, 다차원 척도법(Multidimensional Scaling)을 사용하여 2차원 상에 문서들을 표현할 수 있다.The text mining technique is a technique to convert structured data after analyzing unstructured data in natural language to obtain useful information. Text mining can be applied to Internet search engines or search systems in the reading room, and documents can be classified according to existing categories. If there is no information about the dictionary document, it may be possible to find meaning or find the author by clustering similar documents among the document clusters. Document clusters apply to intellectual structure studies in the discipline. The intellectual structure of the secretarial can be grasped through document clustering using text mining, and the documents can be expressed in two dimensions by using Multidimensional Scaling.

기업들은 텍스트 마이닝 기법을 이용하여 고객에 대해 더 자세한 정보를 얻을 수 있기 때문에 의사결정을 하는데 있어 더욱 다양한 정보를 사용할 수 있다. 또한 연구자들은 관심 있는 문서를 쉽게 검색하거나 문서를 분석함으로써 새로운 연구 결과를 얻어낼 수 있다. 텍스트 마이닝은 설문지 작성을 위한 사전 조사 및 분석에 활용될 수 있다. 텍스트 마이닝 결과를 활용함으로써 보다 현실적이고 충실한 설문지를 만들 수 있고 이러한 설문지를 통한 결과는 더욱 객관적인 자료가 될 수 있는 것이다.Enterprises can use text mining techniques to obtain more detailed information about their customers, so they can use more information in making decisions. In addition, researchers can easily retrieve or analyze documents of interest to get new research results. Text mining can be used for preliminary investigation and analysis for questionnaire preparation. By using text mining results, we can create more realistic and faithful questionnaires, and the results of these questionnaires can be more objective data.

KR 10-2009-0124469KR 10-2009-0124469

본 발명의 일 측면은 텍스트 마이닝을 기반으로 한 상품 자동 매핑 방법을 제공한다.One aspect of the present invention provides a method for automatic product mapping based on text mining.

본 발명의 다른 측면은 텍스트 마이닝을 기반으로 한 상품 자동 매핑 방법을 수행하는 장치를 제공한다.Another aspect of the present invention provides an apparatus for performing an automatic product mapping method based on text mining.

본 발명의 일 측면에 따른 텍스트 마이닝을 기반으로 한 상품 자동 매핑 방법은 텍스트 마이닝을 기반으로 한 상품 자동 매핑 방법은, 상품 자동 매핑 장치가 데이터베이스로부터 복수의 상품에 대한 기술 문서 데이터를 수신하는 단계; 상기 상품 자동 매핑 장치가 상기 기술 문서 데이터 상에서 복수의 단어를 추출하는 단계; 상기 상품 자동 매핑 장치가 상기 추출된 복수의 단어를 기반으로 업체의 키워드 사전을 생성하는 단계; 상기 상품 자동 매핑 장치가 상기 키워드 사전을 기반으로 주문 상품 및 기존 상품 각각에 대한 연관 분석이 반영된 TF(term frequency)-IDF(inverse document frequency) 가중치를 산출하는 단계; 상기 상품 자동 매핑 장치가 상기 TF-IDF 가중치를 기반으로 주문 상품 및 기존 상품의 코사인 유사도를 산출하는 단계; 및 상기 상품 자동 매핑 장치가 상기 코사인 유사도를 기반으로 검색 결과를 생성하는 단계를 포함하되, 상기 키워드 사전은 특정 업체와 관련된 상기 복수의 상품에 대한 기술 문서 데이터에 대한 형태소 분석을 통해 생성되고, 상기 연관 분석은 상기 기술 문서 데이터가 하나의 트랜잭션 단위인 경우, lift 값은 제1 단어 및 제2 단어가 동시에 상기 기술 문서 데이터에 동시에 존재할 확률, 상기 제1 단어가 상기 기술 문서 데이터에 존재할 확률 및 상기 제2 단어가 상기 기술 문서 데이터에 존재할 확률을 기반으로 결정되며, 상기 연관 분석이 반영된 상기 TF-IDF 가중치는, 상기 키워드 사전을 기반으로 불필요한 키워드를 걸러낸 주문 상품 및 기존 상품 각각에 대한 제1 문서-키워드 행렬을 생성하고, 상기 제1 문서-키워드 행렬에서 아이템 셋(itemset)을 추출하고, 상기 아이템 셋을 구성하는 키워드에 대해 가중치 및 상기 아이템 셋의 빈도를 반영한 하기의 수학식 2를 이용하여 상기 연관 분석이 반영된 제2 문서-키워드 행렬을 생성하고, 상기 제2 문서-키워드 행렬을 기반으로 상기 연관 분석을 반영한 하기의 수학식 1을 이용하여 상기 TF-IDF 가중치를 포함하는 제3 문서-키워드 행렬을 생성하여 결정되는 것을 특징으로 한다.
<수학식 1>

상기 N은 기존 상품의 수이고, 상기 DF_t는 단어 t가 포함된 기존 상품의 수이고,
상기 AW_t,d는 아래의 수학식 2를 기반으로 결정되고,
<수학식 2>

상기 ATF_t,d는 연관 분석 가중치와 TF 가중치를 합산한 수학식 3을 기반으로 결정되고,
<수학식 3>

상기 A는 연관 분석 가중치이고, 상기 TF_t,d는 TF 가중치이고, 상기 TF는 복수의 상품들 각각에서 특정 단어가 포함된 빈도수이다.According to an aspect of the present invention, there is provided a method for automatic product mapping based on text mining, the method comprising: receiving technical document data for a plurality of products from a database; The automatic product mapping apparatus extracting a plurality of words from the technical document data; The product automatic mapping device generating a keyword dictionary of a business based on the extracted plurality of words; The automatic product mapping apparatus calculating a TF (inverse document frequency) -IDF (inverse document frequency) weight based on the linkage analysis of the order product and the existing product based on the keyword dictionary; The product automatic mapping apparatus calculating a cosine similarity degree of an ordered product and an existing product based on the TF-IDF weight; And generating a search result based on the cosine similarity by the automatic goods mapping apparatus, wherein the keyword dictionary is generated through morphological analysis of technical document data on the plurality of products related to a specific company, The association analysis may be performed such that when the descriptive document data is a transaction unit, the lift value indicates a probability that the first word and the second word are simultaneously present in the descriptive document data, the probability that the first word exists in the descriptive document data, Wherein the TF-IDF weight is determined based on a probability that a second word is present in the descriptive document data, and wherein the TF-IDF weight reflecting the association analysis is a weighted sum of the first word, Generates a document-keyword matrix, and adds itemets in the first document-keyword matrix Generating a second document-keyword matrix in which the association analysis is reflected by using a weight and a frequency of the item set, as follows: < EMI ID = 2.0 > And generating a third document-keyword matrix including the TF-IDF weight by using Equation (1), which reflects the association analysis based on the matrix.
&Quot; (1) "

Where N is the number of existing products, DF _t is the number of existing products that contain the word t,
The AW _{t, d} is determined based on the following equation (2)
&Quot; (2) "

The ATF _{t, d} is determined based on Equation (3) which is the sum of the association analysis weight and the TF weight,
&Quot; (3) "

A is the association analysis weight, TF _{t, d} is the TF weight, and TF is the frequency at which a specific word is included in each of the plurality of products.

삭제delete

또한, 상기 lift 값은 아래의 수학식을 기반으로 결정되고, Further, the lift value is determined based on the following equation,

<수학식>&Lt; Equation &

상기 Supp(XUY)는 상기 제1 단어 및 상기 제2 단어가 동시에 상기 기술 문서 데이터에 동시에 존재할 확률, 상기 Supp(X)는 상기 제1 단어가 상기 기술 문서 데이터에 존재할 확률 및 상기 Supp(Y)는 상기 제2 단어가 상기 기술 문서 데이터에 존재할 확률일 수 있다.Supp (XUY) is a probability that the first word and the second word are simultaneously present in the descriptive document data at the same time, Supp (X) is a probability that the first word is present in the descriptive document data, May be the probability that the second word is present in the descriptive document data.

삭제delete

본 발명의 또 다른 측면에 따른 텍스트 마이닝을 기반으로 한 상품 자동 매핑 장치는 텍스트 마이닝을 기반으로 한 상품 자동 매핑 장치는, 데이터베이스로부터 복수의 상품에 대한 기술 문서 데이터를 수신하도록 구현되는 문서 수집 모듈; 상기 기술 문서 데이터 상에서 복수의 단어를 추출하도록 구현되는 단어 추출 모듈; 상기 추출된 복수의 단어를 기반으로 업체의 키워드 사전을 생성하도록 구현되는 데이터마이닝 모듈; 상기 키워드 사전을 기반으로 주문 상품 및 기존 상품 각각에 대한 연관 분석이 반영된 TF(term frequency)-IDF(inverse document frequency) 가중치를 산출하도록 구현되는 텍스트마이닝 모듈; 및 검색 결과를 생성하기 위해 상기 TF-IDF 가중치를 기반으로 주문 상품 및 기존 상품의 코사인 유사도를 산출하도록 구현되는 유사도 측정 모듈을 포함하되, 상기 키워드 사전은 특정 업체와 관련된 상기 복수의 상품에 대한 기술 문서 데이터에 대한 형태소 분석을 통해 생성되고, 상기 연관 분석은 상기 기술 문서 데이터가 하나의 트랜잭션 단위인 경우, lift 값은 제1 단어 및 제2 단어가 동시에 상기 기술 문서 데이터에 동시에 존재할 확률, 상기 제1 단어가 상기 기술 문서 데이터에 존재할 확률 및 상기 제2 단어가 상기 기술 문서 데이터에 존재할 확률을 기반으로 결정되고, 상기 연관 분석이 반영된 상기 TF-IDF 가중치는, 상기 키워드 사전을 기반으로 불필요한 키워드를 걸러낸 주문 상품 및 기존 상품 각각에 대한 제1 문서-키워드 행렬을 생성하고, 상기 제1 문서-키워드 행렬에서 아이템 셋(itemset)을 추출하고, 상기 아이템 셋을 구성하는 키워드에 대해 가중치 및 상기 아이템 셋의 빈도를 반영한 하기의 수학식 2를 이용하여 상기 연관 분석이 반영된 제2 문서-키워드 행렬을 생성하고, 상기 제2 문서-키워드 행렬을 기반으로 상기 연관 분석을 반영한 하기의 수학식 1을 이용하여 상기 TF-IDF 가중치를 포함하는 제3 문서-키워드 행렬을 생성하여 결정되는 것을 특징으로 한다.
<수학식 1>

상기 A는 연관 분석 가중치이고, 상기 TF_t,d는 TF 가중치이고, 상기 TF는 복수의 상품들 각각에서 특정 단어가 포함된 빈도수이다.According to another aspect of the present invention, there is provided an article automatic mapping apparatus based on text mining, the article automatic mapping apparatus based on text mining includes a document collection module configured to receive technical document data on a plurality of articles from a database; A word extraction module configured to extract a plurality of words from the technical document data; A data mining module that is implemented to generate a business keyword dictionary based on the extracted plurality of words; A text mining module that is configured to calculate a TF (inverse document frequency) weighting factor reflecting association analysis for each order product and existing product based on the keyword dictionary; And a similarity measure module that is configured to calculate a cosine similarity of an ordered product and an existing product based on the TF-IDF weight to generate a search result, wherein the keyword dictionary includes a description of the plurality of products And the association analysis is generated by the morphological analysis on the document data. When the descriptive document data is a transaction unit, the lift value is a probability that the first word and the second word are simultaneously present in the descriptive document data at the same time, One word is determined to be present in the descriptive document data and the probability that the second word is present in the descriptive document data, and the TF-IDF weight reflecting the association analysis is determined based on the keyword dictionary, Generates a first document-keyword matrix for each ordered product and existing product filtered, The method includes extracting an itemet from a first document-keyword matrix, and using the weighted value and the frequency of the item set for the keywords constituting the item set, Generating a document-keyword matrix and generating a third document-keyword matrix including the TF-IDF weight using the following Equation 1 reflecting the association analysis based on the second document-keyword matrix .
&Quot; (1) "

삭제delete

<수학식>&Lt; Equation &

삭제delete

본 발명의 실시예에 따른 텍스트 마이닝을 기반으로 한 상품 자동 매핑 방법 및 장치는 사용자에게 검색 결과에 대해 좀 더 직관적인 경험을 제공함으로써 편리성을 제공하고, 신속하게 검색 결과를 확인하여 원활한 업무 환경을 만들어줄 수 있다. 또한 자동 상품 매핑으로 확대하여 사용자는 간단한 명령을 통해 자동으로 여러 상품에 대해서 주문 상품 매핑을 수행할 수 있다. 기존의 방식처럼 공백으로 구분된 단어를 통해 하나하나 반복 검색하여 찾던 방식이 아니라 상품명 전체를 검색창에 넣어주면 자동으로 해당 상품이 분석되고 사용자가 원할법한 상품들을 보여주기 때문에 작업의 속도가 향상될 수 있다.The method and apparatus for automatically mapping an article based on text mining according to an embodiment of the present invention can provide convenience by providing a user with a more intuitive experience on the search result and promptly confirm the search result, . In addition, by expanding to automatic product mapping, the user can perform order product mapping for various products automatically by simple command. Instead of searching through the words separated by spaces like the existing method, the whole product name is analyzed automatically, and the product is analyzed and the user shows the goods that he / she wants. .

도 1은 본 발명의 실시예에 따른 상품 자동 매핑 장치를 나타낸 개념도이다.
도 2는 본 발명의 실시예에 따른 문서에 포함된 단어 간 연관성 척도를 나타내는 support 및 confidence와 lift의 값을 나타낸 표이다.
도 3 및 도 4는 본 발명의 실시예에 따른 유사도 측정 결과를 나타낸 개념도이다.
도 5는 본 발명의 실시예에 따른 텍스트 마이닝을 기반으로 한 상품 자동 매핑 방법을 나타낸 순서도이다. 1 is a conceptual diagram illustrating an automatic product mapping apparatus according to an embodiment of the present invention.
FIG. 2 is a table showing values of support, confidence, and lift indicating the inter-word association measure included in a document according to an embodiment of the present invention. FIG.
FIGS. 3 and 4 are conceptual diagrams illustrating results of similarity measurement according to an embodiment of the present invention.
5 is a flowchart illustrating a method of automatic product mapping based on text mining according to an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예와 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조 부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.The following detailed description of the invention refers to the accompanying drawings, which illustrate, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the present invention are different, but need not be mutually exclusive. For example, certain features, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the invention in connection with an embodiment. It is also to be understood that the position or arrangement of the individual components within each disclosed embodiment may be varied without departing from the spirit and scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is to be limited only by the appended claims, along with the full scope of equivalents to which such claims are entitled, if properly explained. In the drawings, like reference numerals refer to the same or similar functions throughout the several views.

이하, 도면들을 참조하여 본 발명의 바람직한 실시예들을 보다 상세하게 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the drawings.

온라인/오프 라인 통합 판매 재고 관리 시스템(이하, 재고 관리 시스템)은 쿠팡이나 옥션과 같이 온라인 상에서 주문된 상품의 관리를 위해 재고 관리 시스템에서 관리하고 있는 기존 상품과 매핑 작업을 수행할 수 있다. 이러한 기존 상품과의 매핑 기능들은 재고 관리 자동화를 위한 기능으로서 이를 흔히 주문 상품 매핑이라고 부른다. 하지만, 이러한 주문 상품 매핑을 하는 방식은 대부분 문장의 공백이나 특수 문자의 구분으로 단어를 잘라서 단순히 단어 비교를 통해 처리하는 경우가 많다. 이런 경우에 뜻은 비슷하지만 단어 구성이 다른 상품들은 당연히 찾을 수 없을 뿐만 아니라 실수로 단어의 오타나 누락이 있을 경우, 시스템의 성능이 급격히 감소되는 상황이 발생할 수 있다. 또한 업체마다 다양한 상품명의 특성을 반영하지 못해 시간이 지날수록 재고 관리 기능의 성능은 점점 더 떨어질 수 밖에 없다. 예를 들면, IT 업체의 상품명은 상품 코드로 관리하기 때문에 영문이 상품을 구별하는 키워드가 될 수도 있다. 하지만 다른 업체들은 이와 반대로 상품명 내에 영문에 대한 중요도가 낮을 수도 있다. 이와 같이 수많은 형태의 상품 데이터를 관리하는 빅 데이터(big data)환경 시스템에서 데이터 분석을 통한 의미 기반의 접근이 아닌 기존의 방식은 작업을 신속하게 처리하는데 한계가 있다.Integrated online / offline inventory management system (hereinafter referred to as an inventory management system) can perform mapping work with existing products managed by the inventory management system to manage products ordered online, such as coupons and auction sites. This mapping function with existing products is a function for automating inventory management and is often referred to as custom product mapping. However, most of these methods of order product mapping are performed by simply cutting out words using sentence spacing or special character distinction and simply comparing words. In this case, words with similar meanings are similar but words with different composition can not be found. Of course, if mistakes are made in typing or omission of words, the performance of the system may be drastically reduced. In addition, each vendor does not reflect the characteristics of various product names, and the performance of inventory management functions will gradually decrease over time. For example, since the product name of an IT company is managed by a product code, English may be a keyword that distinguishes a product. Other companies, however, may not be as important in English as they are in product names. In the big data environment system that manages many types of product data, the conventional method rather than the semantic-based approach through the data analysis has a limitation in quickly processing the work.

이하, 본 발명의 실시예에서는 기존의 단순 단어 검색의 주문 상품 매핑 시스템에서 벗어나 상품들을 텍스트 마이닝(text mining)하여 주문 상품 매핑의 자동화를 이룰 수 있다. 우선 업체의 기존 상품 데이터를 축적하여 데이터 패턴과 흐름을 학습시키는 기계 학습(machine learning) 모델을 이용하여 업체에 따른 키워드 사전을 구축할 수 있다. 더불어 데이터 마이닝(data mining)기법인 연관 규칙 분석(association rule analysis)을 문헌 형태의 데이터에 적용시켜 데이터의 관계와 패턴을 분석함으로써 키워드를 추출하고 가중치를 부여할 수 있다. 이후에 텍스트 마이닝의 기법을 이용하여 비구조화된 텍스트를 포함하고 있는 상품들을 구조화해 상품 사이의 유사도를 측정할 수 있게 할 수 있다. 유사도가 높은 상품들을 순서대로 노출시킴으로써 추천 시스템 용도로 사용할 수 있으며, 나아가서는 자동 매핑을 할 수 있다. 이와 같이, 본 발명의 실시예에서는 확률/통계적 접근을 통해 단순한 문자 비교가 아닌 의미 기반의 상품 매핑 시스템이 제안된다.Hereinafter, in the embodiment of the present invention, it is possible to automate custom product mapping by text mining products out of existing custom product word mapping system of simple word search. A company-specific keyword dictionary can be constructed using a machine learning model that accumulates existing product data and learns data patterns and flows. In addition, association rule analysis, which is a data mining technique, is applied to document type data to analyze the relationship and pattern of data, thereby extracting keywords and assigning weights. Then, by using the technique of text mining, it is possible to structure the products including the unstructured text and to measure the similarity between the products. By exposing products with high similarity in order, they can be used for a recommendation system, and then automatic mapping can be performed. Thus, in the embodiment of the present invention, a semantic-based product mapping system is proposed instead of a simple character comparison through a probability / statistical approach.

도 1은 본 발명의 실시예에 따른 상품 자동 매핑 장치를 나타낸 개념도이다. 1 is a conceptual diagram illustrating an automatic product mapping apparatus according to an embodiment of the present invention.

도 1에서는 상품 자동 매핑 장치를 구성하는 각 모듈(또는 구성부)의 동작이 구체적으로 개시된다. In Fig. 1, the operation of each module (or constituent unit) constituting the automatic goods mapping apparatus is specifically disclosed.

도 1을 참조하면, 상품 자동 매핑 장치는 문서 수집 모듈(100), 단어 추출 모듈(110), 데이터마이닝 모듈(120), 텍스트마이닝 모듈(130), 유사도 측정 모듈(140)로 구성될 수 있다. 1, the automatic product mapping apparatus may include a document collection module 100, a word extraction module 110, a data mining module 120, a text mining module 130, and a similarity measurement module 140 .

문서 수집 모듈(100)은 데이터베이스로부터 기술 문서들에 대한 데이터를 수신하거나 다운로드하기 위해 구현될 수 있다. 기술 문서들은 업체들이 등록한 기존의 상품 및 주문 상품에 대한 정보(상품 이름, 상품 규격 등)를 포함할 수 있다. The document collection module 100 may be implemented to receive or download data for technical documents from a database. Technical documents may include information on existing products and order products registered by the companies (product name, product specification, etc.).

단어 추출 모듈(110)은 먼저 상기 문서들 각각에 대한 표준화 작업을 수행할 수 있다. 표준화 작업에는 특수 문자 제거, 불용어의 제거 및 대문자의 소문자로의 변환 중 적어도 하나의 과정이 포함될 수 있다. 즉, 변환된 문서는 표준화 작업이 수행된 문서로서 형태소 분석의 대상이 될 수 있다. 문서의 표준화 이후, 문서로부터 단어를 추출하기 위해 형태소 분석이 진행될 수 있다. 형태소 분석을 통해 각각 문서에서 뽑아낸 단어와 그 단어의 빈도수가 계산될 수 있다. 이 과정을 통해 다수의 단어들 각각의 다수의 문서들 각각에 포함된 빈도수TF(Term frequency)를 성분으로 하는 문서-단어 행렬이 생성될 수 있다. 예컨대, 문서-단어 행렬은 아래의 수학식1과 같이 표현될 수 있다.The word extraction module 110 may first perform a standardization operation on each of the documents. Standardization may include at least one of the following: removing special characters, removing abbreviations, and converting uppercase to lowercase. That is, the converted document can be subjected to morphological analysis as a document in which the standardization work has been performed. After standardization of the document, morphological analysis can proceed to extract words from the document. Through morphological analysis, the words extracted from each document and the frequency of the words can be calculated. Through this process, a document-word matrix including the frequency TF (Term frequency) included in each of a plurality of documents of each of a plurality of words can be generated. For example, the document-word matrix can be expressed as Equation 1 below.

<수학식 1>&Quot; (1) "

수학식 1에서 m(m은1 이상의 자연수)은 기존 상품과 주문 상품의 합계일 수 있다. 즉, 단어 추출 모듈은 m개의 상품을 대상으로 형태소 분석을 수행할 수 있다. 또한, n(n은1이상의 자연수)은 형태소 분석의 결과로써 도출된 문장에서 최소 의미 단위인 형태소들의 개수를 의미할 수 있다. 이때, 형태소의 품사가 다른 동일한 단어에 대해서는 서로 다른 것으로 판별할 수 있다. 따라서, 문서-단어 행렬(X)은 m×n 행렬로 표현될 수 있다. In Equation (1), m (m is a natural number of 1 or more) may be a sum of an existing product and an ordered product. That is, the word extraction module can perform morphological analysis on m products. Also, n (n is a natural number greater than or equal to 1) may mean the number of morphemes that are the smallest semantic units in sentences derived as a result of morpheme analysis. At this time, it can be determined that the same part of the morpheme is different for the same word. Thus, the document-word matrix X can be represented by an mxn matrix.

데이터마이닝 모듈(120)은 업체가 등록한 기존 상품들을 대상으로 하여 데이터마이닝을 수행할 수 있다. 먼저 업체에 축적된 수많은 상품 데이터를 기반으로 업체의 키워드 사전을 구축할 수 있다. 즉, 단어 추출 모듈(110)로 분석된 업체 상품들의 단어 리스트를 축적하여 업체마다 다른 형태의 데이터를 저장할 수 있다. 이에 따라, 업체 키워드 사전 구축에서는 업체의 상품에서 나타날 수 있는 단어와 각 단어가 나타난 상품(문서)의 수(DF: Document frequency)에 대한 값을 알 수 있다. 이 과정을 통해 해당 업체가 주로 어떤 단어를 사용하는지 알 수 있다. 다시 말해 상품에서 흔하게 나타나는 단어가 무엇인지 또는 해당 업체의 핵심 단어는 무엇인지 등에 대한 기존 데이터 분석을 통하여 데이터의 흐름을 알 수 있다.The data mining module 120 can perform data mining on existing products registered by the vendor. First, you can build a company's keyword dictionary based on a large number of product data accumulated in the business. In other words, the word extraction module 110 can accumulate word lists of company products analyzed and store different types of data for each company. Accordingly, in the business keyword dictionary construction, it is possible to know the values of the words that can appear in the company's products and the number of the products (documents) in which the respective words appear (DF: Document frequency). Through this process, you can see which words are used by the vendor. In other words, the flow of data can be known by analyzing the existing data such as what words are common in products or what are the key words of the company.

단어 추출 모듈로 분석된 업체 상품들의 표준화되고 형태소분석으로 뽑혀진 단어리스트를 축적하여 업체 마다 다른 형태의 키워드 사전이 구축될 수 있다. 이에 따라서, 업체 키워드 사전 구축에서는 업체의 상품에서 나타날 수 있는 단어와 각 단어가 나타난 상품(문서)의 수(DF: Document frequency)에 대한 값을 알 수 있다.A keyword dictionary of different types can be constructed for each company by accumulating word lists extracted by standardized and morphological analysis of company products analyzed by word extraction module. Accordingly, in the business keyword dictionary construction, the values for the words that can appear in the company's products and the number of products (documents) in which the respective words appear (DF: Document frequency) can be known.

그 다음으로 데이터마이닝 모듈(120)은 연관 규칙 분석(association rule analysis)을 수행할 수 있다. 데이터마이닝에서 연관 관계(association relationship)는 항목들 사이에 존재하는 유사성 또는 패턴을 의미할 수 있다. 데이터마이닝에서의 연관 규칙 기본 개념은 데이터베이스에 n개의 트랜잭션 집합이 존재하고, 이 트랜잭션에 포함된 모든 항목 집합(itemset)을 I라고 표현했을 경우 공집합이 아닌 항목 집합 X, Y 에 대해서X⊂I, Y⊂I이고,X∩ Y=일 경우 “조건 X라는 사건이 발생했을 때, Y라는 사건이 발생한다.””는 것을 의미하며 “X→Y”로 표현될 수 있다. 이를 문헌에 적용할 경우 여러 문헌에서 {a}와 {b}의 두 용어가 동시에 출현한다면 {a}와 {b}는 서로 연관성이 있는 용어라는 것을 알 수 있다. 트랜잭션이란 발생된 데이터를 저장하는 단위이며 여러 항목을 가질 수 있는 개념인데, 이를 문헌 집단에 적용하면 하나의 문헌이 하나의 트랜잭션이 될 수 있다. 이와 같은 연관 규칙 분석을 하기 위해서Apriori 알고리즘을 수행할 수 있다. 수행 결과, 단어 간 연관성 척도로는 support 및 confidence와 lift가 사용될 수 있다. The data mining module 120 may then perform association rule analysis. In data mining, an association relationship can refer to a similarity or pattern existing between items. Association rules in data mining The basic concept is that if there are n transaction sets in the database and all itemsets contained in this transaction are I, then X⊂I, Y ⊂ I, and X ∩ Y = "means that when an event of condition X occurs, event Y occurs, and it can be expressed as" X → Y ". When applying this to the literature, it can be seen that {a} and {b} are related terms if both terms {a} and {b} appear simultaneously in various documents. A transaction is a unit that stores generated data. It is a concept that can have several items. When applied to a document group, a document can be a transaction. The Apriori algorithm can be performed to analyze this association rule. As a result, support, confidence and lift can be used as a measure of relevance between words.

support는 특정 트랜젝션의 개수에서 특정 항목이 출현한 개수를 의미하며 아래의 수학식 2를 기반으로 산출될 수 있다. support means the number of occurrences of a particular item in the number of specific transactions and can be calculated based on Equation 2 below.

<수학식 2>&Quot; (2) "

수학식 2에서 X는 항목(itemset)이며, T는 트랜잭션의 수이다. 또한confidence는 X를 포함하는 트랜잭션에 대한 Y를 포함하는 트랜잭션의 비율을 의미할 수 있다. 이러한confidence는 규칙(rule) X→Y에 대한 정확도를 측정할 수 있는 지표가 되며, 높은 confidence는 정확한 예측을 가능하게 해줄 수 있다. confidence는 아래의 수학식 3을 만족할 수 있다. In Equation (2), X is an itemset and T is the number of transactions. Confidence can also refer to the percentage of transactions involving Y for transactions involving X. This confidence is an indicator of the accuracy of the rule X → Y, and high confidence can make accurate predictions possible. confidence can satisfy Equation (3) below.

<수학식 3>&Quot; (3) "

마지막으로 Lift는 X, Y의 상관 관계를 나타내는 척도로서 수식은 아래의 수학식 4와 같다.Finally, Lift is a scale for expressing the correlation between X and Y, and the formula is expressed by Equation (4) below.

<수학식 4>&Quot; (4) "

분자는 X, Y가 동시에 트랜잭션에 존재할 확률을 의미하며 분모는 각 X, Y가 독립적으로 트랜잭션에 존재할 확률을 의미할 수 있다. Lift가 1이면 X와 Y가 서로 독립적임을 의미하고, 1보다 큰 경우에는 X와 Y가 서로 양의 상관 관계임을 나타내는 것이며, 1보다 작을 경우에는 X와 Y가 음의 상관 관계임을 의미할 수 있다. The numerator means the probability that X and Y exist simultaneously in the transaction, and the denominator can mean the probability that each X, Y exists independently in the transaction. When Lift is 1, X and Y are independent of each other. If X is greater than 1, X and Y are positive. If X is less than 1, X and Y are negative. .

구체적으로 lift(향상도)는 lift(x->y)에 대해서 y라는 단어가 임의로(random) 상품명에서 나타났을 때보다 x라는 단어와 연관되어 나타나는 경우의 비율를 의미할 수 있다. 즉, 연관 규칙의 오른쪽 항목을 예측하기 위해 얼마나 향상되었는가를 표현하는 측정치일 수 있다. Specifically, lift (lift) can mean the ratio of lift (x-> y) to the case where the word y is associated with the word x rather than when it appears randomly. That is, it may be a measure expressing how much it has been improved to predict the right item of the association rule.

따라서, Support가 낮아도 confidence가 높을 경우 유용한 연관 규칙일 가능성이 높다. 결론적으로 confidence는 조건부에 대해 결과부가 얼마나 자주 적용될 수 있는지를 나타내고, 반면에 support는 그 규칙 자체가 얼마나 믿을 만 한지를 나타낸다고 볼 수 있다. 이와 같은 척도를 이용하여 상품들에서 연관성이 높은 키워드들을 뽑아내어 가중치를 부여할 수 있다. Therefore, if support is low, confidence is likely to be a useful association rule. In conclusion, confidence can indicate how often the results can be applied to the conditional, while support indicates how reliable the rule itself is. Using these measures, keywords with high relevance in products can be extracted and weights can be given.

실시예에 따라 앞에서 생성한 문서-단어 행렬을 Boolean model로 변환하여 알고리즘에 이용할 트랜잭션 행렬 데이터를 생성할 수 있다. 사용자에 의해 정의된 최소 support와 함께 Apriori 알고리즘을 수행하여 연관 규칙, 즉 연관 단어들이 추출될 수 있다. According to the embodiment, the document-word matrix generated above can be converted into a Boolean model to generate transaction matrix data to be used in the algorithm. Association rules can be extracted by performing the Apriori algorithm with the minimum support defined by the user.

도 2는 본 발명의 실시예에 따른 문서에 포함된 단어 간 연관성 척도를 나타내는 support 및 confidence와 lift의 값을 나타낸 표이다. FIG. 2 is a table showing values of support, confidence, and lift indicating the inter-word association measure included in a document according to an embodiment of the present invention. FIG.

도 2를 참조하면, 예를 들어, '500', '용기'라는 단어 집합과 'ml'라는 단어에 대한 support 값은 0.097142857, confidence 값은 1, lift 값은 2.611940299일 수 있다.Referring to FIG. 2, for example, support values for a word set of '500', 'container' and 'ml' may be 0.097142857, a confidence value of 1, and a lift value of 2.611940299.

구체적으로 설명하자면 단어 집합(itemset)은 규칙의 선행과 후행을 합친 것으로서(공통 원소가 없는 항목들의 집합을 뜻함) 선행인 '500','용기'과 후행인 'ml' 전체를 의미할 수 있다. 여기서, 연관 규칙은 '500'과 '용기'라는 단어가 나타났을 때, 'ml'라는 단어가 함께 나타나는 것을 의미할 수 있다. 그때의 단어 집합 또는 아이템 셋(itemset)의 정도를 수치로 나타내는 측정치(measure)는 support(지지도), confidence(신뢰도), lift(향상도)일 수 있다.Specifically, an itemset is a combination of the leading and trailing of a rule (meaning a set of items with no common elements), and may refer to the entire preceding '500', 'container', and the trailing 'ml' . Here, the association rule can mean that the word 'ml' appears together when the words '500' and 'container' appear. A measure that numerically represents the degree of the word set or itemset at that time may be support, confidence, or lift.

텍스트마이닝 모듈(130)은 가장 먼저 키워드 선택 동작을 수행할 수 있다. 키워드 선택은 형태소 분석을 통해 최소 의미 단위로 분리된 상품(문서)에서 핵심 키워드만을 선택하는 것이다. 즉, 데이터마이닝 모듈(120)에서 구축한 업체의 키워드 사전을 참고하여 문서-단어 행렬부터 문서-키워드 행렬을 생성할 수 있다. 키워드 사전은 이미 업체의 기존 데이터를 축적하여 업체에 맞게 학습시킨 것이다. 따라서, 키워드 사전에 없는 키워드는 무의미한 것이라고 판별되고, 문서-단어 행렬에서 해당 단어의 가중치를 0으로 하여 대상에서 제외시킬 수 있다. 이와 같이 키워드 선택은 상품 매핑을 할 때, 방해되는 요소(단어)들을 제거하는 과정이라고 볼 수 있다. 따라서, 문서-단어 행렬(X)로부터 키워드가 걸러진 문서-키워드 행렬(X1)이 생성될 수 있다.The text mining module 130 may first perform a keyword selection operation. Keyword selection is to select only the core keywords in the product (document) separated by the minimum semantic unit through morphological analysis. That is, the document-keyword matrix can be generated from the document-word matrix by referring to the keyword dictionary of the company constructed by the data mining module 120. The keyword dictionary already accumulated the existing data of the company and learned it according to the company. Therefore, a keyword that is not in the keyword dictionary is determined to be meaningless, and the weight of the word in the document-word matrix can be set to 0 and excluded from the target. As mentioned above, keyword selection is a process of removing obstructed elements (words) when performing product mapping. Thus, a document-keyword matrix X1 filtered by keywords from the document-word matrix X can be generated.

텍스트마이닝 모듈(130)은 다음으로 연관 규칙 분석을 이용하여 문서 안에서 연관된 단어들로 가중치를 부여할 수 있다. 먼저 사용자가 특정 support 값을 정의하여 연관 분석 알고리즘을 수행할 수 있다. 그 결과 최소 support를 만족하는 복수개의 itemset을 획득할 수 있다. 그 후 각각의 itemset을 구성하는 키워드에 대해 가중치를 부여할 수 있다. 이렇게 함으로써 선택된 키워드에 가중치를 부여하여 더욱 더 실제 의미를 반영하여 상품에 대한 분석이 수행될 수 있다. 본 발명의 실시예에서는 연관 분석 가중치(A)가 해당 키워드의 itemset frequency로 정의되었다. The text mining module 130 may then weight the associated words in the document using association rule analysis. First, the user can define a specific support value and execute the association analysis algorithm. As a result, a plurality of itemsets satisfying the minimum support can be obtained. Then weights can be given to the keywords that make up each itemset. By doing so, weights are given to the selected keywords, and the analysis of the products can be performed more realistically. In the embodiment of the present invention, the association analysis weight (A) is defined as the itemset frequency of the keyword.

텍스트마이닝 모듈(130)은 다음으로 TF 가중치에 연관 분석 가중치를 반영할 수 있다. 여기서, TF(term frequency)는 상품들 각각에서 특정 단어가 포함된 빈도수를 의미할 수 있다. TF 값이 클수록 해당 단어가 중요하다고 판단할 수 있다. 더불어 연관 분석 가중치 또한 수많은 단어 중에 중요한 단어라고 판단해주는 수치이기 때문에 TF와 동일하게 단어의 중요도와 비례한다고 할 수 있다. 실제로 상품 데이터에서 TF 수치는 Boolean 형태이기 쉬워 TF의 영향력이 떨어질 수 있다. 이러한 데이터 환경에서 연관 분석을 통해 TF의 수치를 보강하기 때문에 가중치를 더 세밀하게 부여할 수 있다. 따라서 아래의 수학식 4와 같이 연관 분석이 반영된 TF 가중치가 산출될 수 있다.The text mining module 130 may then reflect the association analysis weights in the TF weights. Here, the term frequency (TF) may mean the frequency of a particular word in each of the products. The larger the TF value, the more important the word is. In addition, the association analysis weight is also a number that is considered to be an important word among many words, so it can be said that it is proportional to the importance of the word as TF. In fact, the TF value in product data is easy to be in Boolean form, so the influence of TF can be reduced. In this data environment, the correlation is strengthened by the numerical value of the TF, so that weights can be given more finely. Therefore, the TF weight value that reflects the association analysis can be calculated as shown in Equation (4) below.

<수학식 5>Equation (5)

수학식 5에서 A는 연관 분석 가중치를 의미할 수 있다. In Equation (5), A may mean the association analysis weight.

<수학식 6>&Quot; (6) "

수학식 6에서 ,은 연관 분석 결과가 반영된 TF로 스케일 조절을 위해 로그 값을 취할 수 있다. 따라서 문서-키워드 행렬(X1)로부터 연관 분석 결과가 반영된 문서-키워드 행렬(X2)이 생성될 수 있다.In Equation (6), TF reflecting the association analysis result can take a log value for scale adjustment. Therefore, a document-keyword matrix X2 in which the association analysis result is reflected from the document-keyword matrix X1 can be generated.

마지막으로 연관 분석이 반영된 TF-IDF 가중치가 산출될 수 있다. 이때 DF(document frequency)는 특정 단어가 나타난 문서의 수를 의미하며, DF의 역수는 IDF(inverse document frequency)라 정의될 수 있다. DF 수치가 클수록 해당 단어는 흔한 단어라 판단할 수 있어 단어의 중요도가 감소할 수 있다. 즉, IDF 수치가 클수록 대응하는 단어는 중요 단어로 고려될 수 있다. 아래의 수학식 7은 IDF의 수식이다.Finally, the TF-IDF weights reflecting the association analysis can be calculated. In this case, DF (document frequency) means the number of documents in which a specific word appears, and inverse DF can be defined as IDF (inverse document frequency). The larger the DF value, the more likely the word is a common word and the less important the word. That is, as the IDF value increases, the corresponding word can be considered as an important word. Equation (7) below is the IDF equation.

<수학식 7>&Quot; (7) "

수학식 7에서 N은 분석 대상인 업체의 문서의 수, 즉, 기존 상품의 수를 의미하고, 가 단어t가 포함된 문서의 수를 의미할 수 있다. IDF 또한 스케일 조절을 위해 로그 값을 취할 수 있다. 결국, 연관 분석이 반영된 TF-IDF 가중치는 특정 상품에서 많이 등장하고 전체 상품에서 흔하게 등장하지 않는 단어에 높은 값으로 부여될 수 있을 뿐만 아니라 연관 분석의 결과, 즉, 다수의 상품에서 같이 언급된 연관성이 높은 단어들에 대해서도 높은 TF-IDF 가중치가 반영될 수 있다. 따라서 연관 분석이 반영된 TF-IDF가중치인 ATI(association rule TF-IDF)는 아래의 식을 만족할 수 있다.In Equation (7), N means the number of documents of the company to be analyzed, i.e., the number of existing products, and may mean the number of documents including the word t. IDF can also take log values for scaling. As a result, the TF-IDF weights that are reflected in the association analysis can be assigned high values to words that appear frequently in a particular product and not frequently appear in the overall product, as well as the results of association analysis, High TF-IDF weights can also be reflected in these higher words. Therefore, the association rule TF-IDF, which is the weight of the TF-IDF reflecting association analysis, can satisfy the following equation.

<수학식 8>&Quot; (8) "

수학식 8을 사용하여 문서-키워드 행렬(X2)로부터 ATI가 부여된 문서-키워드 행렬(X3)이 생성될 수 있다. 이러한 과정을 통해 해당 상품이 가지는 키워드가 추출되고, 각각 키워드에 대한 가중치가 부여될 수 있다. 이는 비구조화된 데이터가 구조화되는 과정이며, 이 수치를 이용하여 상품들 사이에 유사도가 측정될 수 있다.A document-keyword matrix X3 to which ATI is assigned from the document-keyword matrix X2 can be generated using Equation (8). Through this process, the keyword of the corresponding product can be extracted and a weight for the keyword can be given to each keyword. This is the process by which unstructured data is structured, and the similarity can be measured between products using this figure.

유사도 측정 모듈(140)은 주문 상품과 기존 상품에 대한 유사도 측정을 진행할 수 있다. 유사도 측정을 위해 Vector space model의 유사도 측정 방식 중 하나인 코사인 유사도가 사용될 수 있다. 이는 코사인 유사도가 양수 공간이라는 조건만 만족한다면 복수의 많은 차원 공간에서 거리를 측정하는 것이 가능하기 때문이다. 속성 q, d의 벡터 값이 각각 주어졌을 때, 코사인 유사도는 벡터의 스칼라 곱과 크기로서 아래의 수학식 9와 같이 표현될 수 있다.The similarity measurement module 140 can measure similarity between an ordered product and an existing product. For similarity measure, cosine similarity which is one of the similarity measure of vector space model can be used. This is because it is possible to measure the distance in a plurality of many dimensional spaces if the condition that the cosine similarity satisfies the condition of the positive space. When the vector values of the attributes q and d are respectively given, the cosine similarity can be expressed as a scalar product and magnitude of a vector as shown in Equation (9) below.

<수학식 9>&Quot; (9) "

수학식 9에서 q와 d 벡터는 각각 주문 상품과 기존 상품을 의미하며 해당 문서에서의 키워드 가중치가 사용될 수 있다. 즉, q_i는 문서-키워드 행렬(X3)에서 i열에 해당하는 키워드가 해당 상품에서 가지는 가중치라고 볼 수 있다. 또한 코사인 유사도는 문서들 간의 비교할 때 문서의 길이를 정규화하는 하나의 방법으로 볼 수 있다. 따라서 이렇게 계산된 유사도는 키워드의 가중치가 음의 값이 되는 것이 불가능하기 때문에 0에서 1까지의 값으로 표현되며 두 키워드 가중치 간의 각도는 90도를 넘길 수 없다. 0은 서로 독립적인 경우, 1은 서로 완전히 같은 경우를 의미할 수 있다. In Equation (9), q and d vectors denote an order product and an existing product, respectively, and keyword weights in the document can be used. That is, q _i can be regarded as a weight of the product corresponding to the i-th column in the document-keyword matrix X3. Also, cosine similarity can be seen as one way of normalizing the document length when comparing documents. Therefore, the calculated similarity is expressed as a value from 0 to 1 because the weight of the keyword can not be negative, and the angle between the two keyword weights can not exceed 90 degrees. When 0 is independent from each other, 1 can mean the case where they are completely equal to each other.

아래의 표 1은 키워드의 가중치와 키워드의 가중치를 정규화한 값의 예시이다.Table 1 below is an example of a value obtained by normalizing the weight of the keyword and the weight of the keyword.

<표 1><Table 1>

표 1에서는 주문 상품과 기존 상품에 대해 각각 키워드의 가중치를 정규화한 뒤 각각 단어를 성분으로 수학식 10과 같이 벡터 내적을 계산할 수 있다. In Table 1, it is possible to normalize the weight of a keyword for an ordered product and an existing product, respectively, and then calculate an inner product of the vectors, as shown in Equation (10).

<수학식 10>&Quot; (10) "

수학식 10에서 결과로 나온 0.8이 두 상품의 유사도를 의미할 수 있다. 이렇듯 주문 상품과 기존 상품을 구조화하여 유사도를 측정하고 유사도가 높은 순서로 사용자가 원할 법한 상품들을 추천할 수 있다.A result of 0.8 in Equation 10 can mean the similarity of the two products. Thus, it is possible to structure the ordered product and the existing product, measure the similarity, and recommend the products that the user desires in order of high similarity.

도 3 및 도 4는 본 발명의 실시예에 따른 유사도 측정 결과를 나타낸 개념도이다. FIGS. 3 and 4 are conceptual diagrams illustrating results of similarity measurement according to an embodiment of the present invention.

도 3에서는 주문 상품과 기존 상품의 키워드가 모두 동일한 경우이다. 주문 상품의 키워드는 '차단제500스프레이 500ml 용기'이고, 기존 상품의 키워드는 동일하게 '차단제500스프레이 500ml 용기'일 수 있다. 즉, 주문 상품과 기존 상품의 공통 키워드 frequency는 6으로서 유사도는 1의 값을 가질 수 있다. In Fig. 3, the keywords of the ordered product and the existing product are all the same. The keyword of the order product is '500 sprayer 500 ml sprayer', and the keyword of existing product may be '500 sprayer 500 ml sprayer'. That is, the common keyword frequency of the ordered product and the existing product is 6, and the degree of similarity can be 1.

도 4에서는 주문 상품과 기존 상품의 키워드가 서로 다른 경우이다. 주문 상품의 키워드는 '아토세이프 청소용품 모음전, 선택5)코코넛 만능 클리너 420g*1개'일 수 있다. 기존 상품의 키워드는 '코코넛만능크리너420'일 수 있다. 이러한 경우, 주문 상품과 기존 상품의 공통 키워드 frequency는 3으로서 유사도는 0.4617..의 값을 가질 수 있다. In Fig. 4, the keywords of the ordered product and the existing product are different. The keyword of the order product is "Atosafe cleaning supplies collection, 5) Coconut versatile cleaner 420g * 1" can be. The keyword of the existing product may be 'Coconut Versatile Cleaner 420'. In this case, the common keyword frequency of the ordered product and the existing product is 3, and the similarity degree may be 0.4617 ..

텍스트 마이닝을 기반으로 한 상품 자동 매핑 방법을 수행하는 장치의 프로세스를 기반으로 상품마다 수치가 계산되기 때문에 상품들을 정렬하여 사용자에게 노출시킬 수 있다. 즉, 검색 결과로 나타나는 상품에 대해서 순위를 매겨 추천 시스템으로 사용 가능하기 때문에 사용자에게 검색 결과에 대해 좀 더 직관적인 경험을 제공함으로써 편리성을 제공하고, 신속하게 검색 결과를 확인하여 원활한 업무 환경을 만들어줄 수 있다. 나아가서 자동 상품 매핑으로 확대하여 사용자는 간단한 명령을 통해 자동으로 여러 상품에 대해서 주문 상품 매핑을 수행할 수 있다. 이렇듯 기존의 방식처럼 공백으로 구분된 단어를 통해 하나하나 반복 검색하여 찾던 방식이 아니라 상품명 전체를 검색창에 넣어주면 자동으로 해당 상품을 분석하고 사용자가 원할법한 상품들을 보여주기 때문에 작업의 속도가 향상될 수 있다.Since the numerical value is calculated for each product based on the process of the device performing the automatic product mapping method based on text mining, the products can be sorted and exposed to the user. In other words, since it is possible to use the system as a recommendation system by ranking the products that appear in the search result, it provides convenience to the users by providing a more intuitive experience on the search results, You can make it. Furthermore, by expanding to automatic product mapping, the user can perform order product mapping for various products automatically by simple command. Instead of searching through the words separated by spaces like the existing method, you can search the whole product name automatically by analyzing the product and show the products that you want. .

도 5는 본 발명의 실시예에 따른 텍스트 마이닝을 기반으로 한 상품 자동 매핑 방법을 나타낸 순서도이다. 5 is a flowchart illustrating a method of automatic product mapping based on text mining according to an embodiment of the present invention.

도 5를 참조하면, 상품 자동 매핑 장치가 데이터베이스로부터 복수의 상품에 대한 기술 문서 데이터를 수신할 수 있다(단계 S500). Referring to FIG. 5, the automatic goods-mapping apparatus can receive technical document data for a plurality of products from a database (step S500).

전술한 바와 같이 문서 수집 모듈은 복수의 상품에 대한 기술 문서 데이터를 수신할 수 있다.As described above, the document collection module can receive technical document data for a plurality of products.

상품 자동 매핑 장치가 기술 문서 데이터 상에서 복수의 단어를 추출할 수 있다(단계 S510).The article automatic mapping apparatus can extract a plurality of words on the technical document data (step S510).

전술한 바와 같이 단어 추출 모듈은 기술 문서 데이터에 대한 표준화 작업을 수행하여 문장 내에서 형태소 분석을 수행하여 단어를 추출할 수 있다.As described above, the word extraction module performs a standardization operation on the technical document data to perform morphological analysis in a sentence to extract words.

상품 자동 매핑 장치가 추출된 복수의 단어를 기반으로 업체의 키워드 사전을 생성할 수 있다(단계 S520).The product automatic mapping apparatus can generate a keyword dictionary of a company based on a plurality of extracted words (step S520).

키워드 사전은 상기 복수의 단어 간의 연관성을 기반으로 생성되되, 복수의 단어 간의 연관성은 복수의 단어 중 제1 단어와 제2 단어 간의 lift 값을 기반으로 결정될 수 있다. 만약, 기술 문서 데이터가 하나의 트랜잭션 단위인 경우, lift 값은 상기 제1 단어 및 상기 제2 단어가 동시에 기술 문서 데이터에 동시에 존재할 확률, 제1 단어가 기술 문서 데이터에 존재할 확률 및 제2 단어가 기술 문서 데이터에 존재할 확률을 기반으로 결정될 수 있다.The keyword dictionary is generated based on the association between the plurality of words, and the association between the plurality of words may be determined based on a lift value between the first word and the second word among the plurality of words. If the descriptive document data is a transaction unit, the lift value indicates a probability that the first word and the second word are simultaneously present in the descriptive document data, the probability that the first word exists in the descriptive document data, Can be determined based on the probability of being present in the technical document data.

상품 자동 매핑 장치가 키워드 사전을 기반으로 주문 상품 및 기존 상품 각각에 대한 연관 분석이 반영된 TF(term frequency)-IDF(inverse document frequency) 가중치를 산출할 수 있다(단계 S530).The product automatic mapping apparatus can calculate a term frequency (TF) -IDF (inverse document frequency) weight reflecting the association analysis of the ordered product and the existing product based on the keyword dictionary (step S530).

연관 분석이 반영된 TF-IDF 가중치는 키워드 사전을 기반으로 불필요한 키워드를 걸러낸 주문 상품 및 기존 상품 각각에 대한 제1 문서-키워드 행렬을 생성하고, 제1 문서-키워드 행렬에서 아이템 셋(itemset)을 추출하고, 아이템 셋을 구성하는 키워드에 대해 가중치 및 상기 아이템 셋의 빈도를 반영하여 연관 분석이 반영된 제2 문서-키워드 행렬을 생성하고, 제2 문서-키워드 행렬을 기반으로 연관 분석이 반영된 상기 TF-IDF 가중치를 포함하는 제3 문서-키워드 행렬을 생성하는 절차를 기반으로 결정될 수 있다.The TF-IDF weights reflecting the association analysis are generated by generating a first document-keyword matrix for each ordered product and existing product filtered out unnecessary keywords based on the keyword dictionary, and generating a first document- And generates a second document-keyword matrix reflecting the association analysis by reflecting the weight and the frequency of the item set with respect to the keywords constituting the item set, and based on the second document-keyword matrix, -IDF < / RTI > weight. &Lt; RTI ID = 0.0 >

상품 자동 매핑 장치가 TF-IDF 가중치를 기반으로 주문 상품 및 기존 상품의 코사인 유사도를 산출할 수 있다(단계 S540).The product automatic mapping apparatus can calculate the cosine similarity of the ordered product and the existing product based on the TF-IDF weight (step S540).

코사인 유사도는 전술한 수학식 9를 기반으로 산출될 수 있다.The cosine similarity can be calculated based on Equation (9).

상품 자동 매핑 장치가 코사인 유사도를 기반으로 검색 결과를 생성할 수 있다(단계 S550).The goods automatic mapping apparatus can generate the search result based on the cosine similarity (step S550).

코사인 유사도가 일정 값 이상인 기존 상품 정보가 검색 결과로서 제공될 수 있다.Existing product information whose cosine similarity is equal to or greater than a predetermined value can be provided as a search result.

이와 같은 텍스트 마이닝을 기반으로 한 상품 자동 매핑 방법은 애플리케이션으로 구현되거나 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.Such an automatic product mapping method based on text mining may be implemented in an application or in the form of program instructions that can be executed through various computer components and recorded in a computer-readable recording medium. The computer-readable recording medium may include program commands, data files, data structures, and the like, alone or in combination.

상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들일 수 있고, 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다.The program instructions recorded on the computer-readable recording medium may be those specially designed and configured for the present invention and may be those known and used by those skilled in the computer software arts.

컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD 와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like.

프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of program instructions include machine language code such as those generated by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules for performing the processing according to the present invention, and vice versa.

이상에서는 실시예들을 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it will be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined in the appended claims. It will be possible.

Claims

An automatic product mapping method based on text mining,
The article automatic mapping apparatus receiving technical document data for a plurality of articles from a database;
The automatic product mapping apparatus extracting a plurality of words from the technical document data;
The product automatic mapping device generating a keyword dictionary of a business based on the extracted plurality of words;
The automatic product mapping apparatus calculating a TF (inverse document frequency) -IDF (inverse document frequency) weight based on the linkage analysis of the order product and the existing product based on the keyword dictionary;
The product automatic mapping apparatus calculating a cosine similarity degree of an ordered product and an existing product based on the TF-IDF weight; And
Wherein the automatic goods mapping apparatus generates search results based on the cosine similarity,
Wherein the keyword dictionary is generated through morphological analysis of technical document data of the plurality of products related to a specific company, and the association analysis is performed when the technical document data is one transaction unit, A probability that two words are simultaneously present in the descriptive document data at the same time, a probability that the first word exists in the descriptive document data, and a probability that the second word exists in the descriptive document data,
The TF-IDF weight, which reflects the association analysis,
Generating a first document-keyword matrix for each of an ordered product and an existing product filtered out unnecessary keywords based on the keyword dictionary,
The method includes extracting itemets from the first document-keyword matrix, and calculating a weighted sum of the weighted items and the frequency of the item set by using Equation (2) Generate a document-keyword matrix,
And generating a third document-keyword matrix including the TF-IDF weight using the following Equation 1 reflecting the association analysis based on the second document-keyword matrix.
&Quot; (1) "

delete

The method according to claim 1,
The lift value is determined based on the following equation,
&Lt; Equation &

Supp (XUY) is a probability that the first word and the second word are simultaneously present in the descriptive document data at the same time, Supp (X) is a probability that the first word is present in the descriptive document data, Is a probability that the second word is present in the descriptive document data.

delete

An article automatic mapping apparatus based on text mining,
A document collection module that is implemented to receive technical document data for a plurality of products from a database;
A word extraction module configured to extract a plurality of words from the technical document data;
A data mining module that is implemented to generate a business keyword dictionary based on the extracted plurality of words;
A text mining module that is configured to calculate a TF (inverse document frequency) weighting factor reflecting association analysis for each order product and existing product based on the keyword dictionary; And
And a similarity measure module that is configured to calculate a cosine similarity of an ordered product and an existing product based on the TF-IDF weight to generate a search result,
Wherein the keyword dictionary is generated through morphological analysis of technical document data of the plurality of products related to a specific company, and the association analysis is performed when the technical document data is one transaction unit, The probability that the two words are simultaneously present in the descriptive document data at the same time, the probability that the first word is present in the descriptive document data, and the probability that the second word is present in the descriptive document data,
The TF-IDF weight, which reflects the association analysis,
Generating a first document-keyword matrix for each of an ordered product and an existing product filtered out unnecessary keywords based on the keyword dictionary,
The method includes extracting itemets from the first document-keyword matrix, and calculating a weighted sum of the weighted items and the frequency of the item set by using Equation (2) Generate a document-keyword matrix,
And a third document-keyword matrix including the TF-IDF weight is determined by using the following Equation 1 reflecting the association analysis based on the second document-keyword matrix. .
&Quot; (1) "

delete

The method according to claim 6,
The lift value is determined based on the following equation,
&Lt; Equation &

delete