KR101035038B1

KR101035038B1 - System and method for automatic generation of classifier for large data using of dynamic combination of classifier

Info

Publication number: KR101035038B1
Application number: KR1020100099164A
Authority: KR
Inventors: 정도헌; 성원경; 정한민; 조민희; 홍순찬
Original assignee: 한국과학기술정보연구원
Priority date: 2010-10-12
Filing date: 2010-10-12
Publication date: 2011-05-19
Also published as: WO2012050252A1

Abstract

PURPOSE: A system and method for automatically generating high capacity of classifier by a dynamic combination of a classifier are provided to freely generates a unified classifier which learnd all quality information of combined target databases by dynamically combining a plurality of combining target databases. CONSTITUTION: Databases(100a~100c) classifies and store documents. Classification generating devices(200a~200n) generates an individual classifier. The individual classifier calculates a quality weighted vale by obtaining a similarity of range of quality extracted from a learning target document of the database. The individual classifier determines a range of classified target documents by generating a feature matrix and term vector including the weight value. A classifier dynamic combining device(300) generates an integrated classifier which learned all quality information nof a combination target database.

Description

System and Method for automatic generation of classifier for large data using of dynamic combination of classifier

본 발명은 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 시스템 및 방법에 관한 것으로, 더욱 상세하게는 분류기 생성 장치가 각 데이터베이스의 학습 대상 문서로부터 자질을 추출하고, 상기 추출된 자질과 범주의 유사도를 구하여 자질별 가중치를 연산한 후, 각 자질에 대해 가중치를 포함하는 자질 특성 매트릭스 및 용어 벡터를 생성하여 신규로 수집되는 분류대상 문서의 범주를 결정하는 개별 분류기를 생성하고, 분류기 동적 결합 장치가 복수개의 결합 대상 자질 특성 매트릭스에 출현한 자질들의 코드정보를 리스트화하고, 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성한 후 상기 통합 매트릭스를 이용하여 결합 대상 데이터베이스의 모든 자질 정보를 학습한 통합 분류기를 자유롭게 생성하는 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 시스템 및 방법에 관한 것이다.
The present invention relates to a system and method for automatically generating a large classifier by dynamically combining a classifier. More particularly, the classifier generating apparatus extracts a feature from a study target document of each database, and obtains a similarity between the extracted feature and a category. After calculating the weight for each feature, a feature feature matrix and a term vector including weights are generated for each feature to create a separate classifier that determines the category of the newly collected sorted document. List the code information of the qualities appearing in the feature to be combined property matrix, add the category feature values for each feature to generate a unified matrix, and then use the unified matrix to learn all the feature information in the database to be combined. By dynamic combination of classifiers that freely generate classifiers It relates to a large auto-generated classifier systems and methods.

다양한 학술정보 데이터베이스를 구축하고 서비스하는 경우, 서비스 고도화를 위해 개별 문서의 자동 분류와 통합 분류 및 검색 체계의 적용이 중요한 기술적 요소가 된다.In the case of constructing and servicing various academic information databases, the automatic technical classification of individual documents and the application of integrated classification and retrieval system become important technical elements for service enhancement.

그러나, 자동범주화를 실제 서비스에 적용하기 위해서는 두 가지 문제가 반드시 해결되어야 한다. 첫째, 대용량 문서를 학습할 수 있는 대용량 기반의 분류기 생성기법이 만들어져야 하고 둘째, 도메인 지식 기반의 마이닝 기술에 비해 일반적으로 사용할 수 있는 안정적이고 범용적인 기술이 개발되어야 한다는 점이다. 즉, 대용량 데이터 환경에 적용이 가능한 대용량 학습기반의 자동범주화 기법과 범용적으로 사용할 수 있는 기법의 개발이 필요하다.However, two problems must be solved to apply auto-categorization to actual service. First, a large-scale classifier generation method that can learn large documents should be made. Second, a stable and general-purpose technique that can be generally used should be developed compared to a domain knowledge-based mining technique. That is, it is necessary to develop a large-scale learning-based automatic categorization technique applicable to a large data environment and a technique that can be used universally.

또한, 자동범주화 기법을 실제 서비스에 응용하고자 할 때, 경우에 따라서는 수백만건 이상의 정보자원을 처리해야 하는 경우가 있는데, 이때 범주를 대표하는 문서를 잘 선택하거나, 문서를 대표하는 자질을 선택하고 자질을 축소하는 기법이 필요하게 되므로 비교적 자질 축소 기법에 민감하지 않은 분류기의 개발이 중요하다. 여기서, 자질은 키워드 또는 용어를 의미한다.In addition, when applying the automatic categorization technique to the actual service, it may be necessary to process millions of information resources in some cases. In this case, it is necessary to select a document that represents a category or a quality that represents a document. Because of the need for a feature reduction technique, it is important to develop a classifier that is relatively insensitive to feature reduction techniques. Here, the feature means a keyword or a term.

또한, 통합정보 서비스를 하는 경우, 서비스를 위한 표준 주제 분류 체계에 맞도록 여러 자원을 자동 분류해야 하는데, 학술논문과 특허, 학술논문과 연구보고서 등 이기종 데이터베이스 간의 교차 자동분류 성능이 현저히 떨어지기 때문에 대용량 환경에서 분류기를 범용화 시키기에 어려운 기술적 문제점이 존재한다.
In addition, in case of integrated information service, it is necessary to automatically classify various resources to meet the standard subject classification system for the service, because the cross automatic classification performance between heterogeneous databases such as academic papers, patents, academic papers and research reports is significantly reduced. There is a technical problem that makes it difficult to generalize the classifier in a large environment.

본 발명은 상기한 문제점을 해결하기 위하여 것으로, 본 발명의 목적은 대용량의 문서를 학습시킬 수 있는 대용량 분류기 생성을 위해 복수의 결합대상 데이터베이스를 동적으로 결합하여 어떤 데이터베이스에 대해서도 범용적으로 적용할 수 있는 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 시스템 및 방법을 제공하는데 있다.
The present invention is to solve the above problems, an object of the present invention can be applied to any database by dynamically combining a plurality of target database to generate a large classifier that can learn a large amount of documents. The present invention provides a system and method for automatically generating a large classifier by dynamically combining a classifier.

상기 목적들을 달성하기 위하여 본 발명의 일 측면에 따르면, 문서의 특성에 따라 다수의 문서가 분류되어 저장되는 데이터베이스, 각 데이터베이스의 학습 대상 문서로부터 자질을 추출하고, 상기 추출된 자질과 범주의 유사도를 구하여 자질별 가중치를 연산한 후, 각 자질에 대해 가중치를 포함하는 자질 특성 매트릭스 및 용어 벡터를 생성하여 신규로 수집되는 분류대상 문서의 범주를 결정하는 개별 분류기를 생성하는 분류기 생성 장치, 복수개의 결합 대상 자질 특성 매트릭스에 출현한 자질들의 코드정보를 리스트화하고, 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성한 후 상기 통합 매트릭스를 이용하여 결합 대상 데이터베이스의 모든 자질 정보를 학습한 통합 분류기를 생성하는 분류기 동적 결합 장치를 포함하는 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 시스템이 제공된다.According to an aspect of the present invention, in order to achieve the above objects, a feature in which a plurality of documents are classified and stored according to the characteristics of the document, the feature is extracted from the study target document of each database, and the similarity between the extracted feature and the category A classifier generator for generating feature classifiers and term vectors containing weights for each feature and then generating individual classifiers to determine the categories of newly collected classified documents, a plurality of combinations An integrated classifier that lists the code information of the features appearing in the feature feature matrix, adds the category feature values for each feature, generates an integration matrix, and learns all the feature information of the database to be combined using the integration matrix. Of classifiers comprising a classifier dynamic coupling device to generate the A system for automatically generating a large classifier by dynamic coupling is provided.

상기 분류기 생성 장치는 각 데이터베이스별로 구비되어 있다.The classifier generating device is provided for each database.

상기 분류기 생성장치는 상기 학습 대상 문서를 구성하는 모든 자질에 대해 상기 용어 벡터를 선형 결합하여 투표결과가 높은 값을 상기 분류 대상 문서의 범주로 결정한다.The classifier generating apparatus linearly combines the term vectors with respect to all the constituents of the object to be learned, and determines a high voting result as a category of the object to be classified.

상기 분류기 동적 결합 장치는 상기 결합 대상 데이터베이스를 일정 크기의 용량으로 분할하여 각 용량에 대한 통합 매트릭스를 생성하고, 각 통합 매트릭스를 이용하여 결합 대상 데이터베이스의 모든 자질 정보를 학습한 통합 분류기를 생성한다.The classifier dynamic combining apparatus generates an integrated matrix for each capacity by dividing the database to be combined into a capacity of a predetermined size, and generates an integrated classifier that learns all the feature information of the database to be combined using each integration matrix.

또한, 상기 분류기 동적 결합 장치는 상기 생성된 통합 매트릭스를 개별 자질 특성 매트릭스와 통합 또는 다른 통합 매트릭스와 통합하는 과정을 반복적으로 수행하여 새로운 하나의 통합 매트릭스를 생성한다.In addition, the classifier dynamic coupling device repeatedly performs the process of integrating the generated integration matrix with the individual feature property matrix or other integration matrix to generate a new one integration matrix.

본 발명의 다른 측면에 따르면, 학습 대상 문서로부터 자질을 추출하는 자질 추출부, 상기 추출된 자질과 범주의 유사도를 구하고, 그 유사도를 이용하여 자질별 가중치를 구하는 가중치 연산부, 상기 학습 대상 문서를 구성하는 각 자질에 대해 상기 가중치 연산부에서 구해진 가중치를 포함하는 자질 특성 매트릭스를 생성하는 자질 특성 매트릭스 생성부, 상기 학습 대상 문서를 구성하는 각 자질에 대해 가중치가 표시된 용어 벡터를 생성하는 용어벡터 생성부, 신규로 수집되는 분류 대상 문서로부터 추출된 자질과 동일한 자질에 대해, 상기 용어벡터 생성부에서 생성된 복수의 용어벡터를 결합하여 투표형식으로 계산한 결과, 최대값을 갖는 특정 범주를 상기 분류 대상 문서의 최종 범주로 결정하는 범주 결정부를 포함하는 분류기 생성 장치가 제공된다. According to another aspect of the present invention, a feature extraction unit for extracting features from a learning target document, a weight calculation unit for obtaining similarities between the extracted features and categories, and obtaining weights for each feature using the similarities, and configuring the learning target document A feature feature matrix generator for generating a feature feature matrix including weights obtained by the weight calculator for each feature, a term vector generator for generating a weighted term vector for each feature of the target document; As a result of combining a plurality of term vectors generated by the term vector generation unit in a voting form for the same qualities extracted from the newly collected document to be classified, the specific category having the maximum value is determined. The classifier generating device includes a category determiner that determines a final category of It is.

상기 자질 추출부는 학습 대상 문서의 키워드 필드를 이용하는 방법, 코퍼스 사전을 기반으로 제목 또는 초록의 비구조적 정보로부터 정보를 추출하여 이용하는 방법, 스테밍 또는 형태소 분석기법을 이용하는 자연어 처리방법들 중 적어도 하나를 통해 자질을 추출한다.The feature extracting unit may use at least one of a method of using a keyword field of a study target document, a method of extracting and using information from unstructured information of a title or abstract based on a corpus dictionary, and a method of processing a natural language using stemming or morpheme analysis. Extract qualities through.

상기 가중치 연산부는 코사인, 다이스, 자카드, 로그승산비 중 적어도 하나의 유사계수를 이용하거나 다양한 거리계수를 이용하여 유사도를 구한다.The weight calculator calculates similarity using at least one similarity coefficient among cosine, dice, jacquard, and log odds ratio, or by using various distance coefficients.

상기 가중치 연산부는 자질 가중치에 역문헌 빈도(Inverse Document Frequency)를 추가한

를 이용하여 최종 자질별 가중치(

)를 구한다.The weight calculation unit adds an inverse document frequency to the feature weight.

With the final feature weight

).

상기 용어벡터 생성부는 각 자질에 대해 "범주, 가중치" 쌍으로 구성된 용어벡터를 생성한다.The term vector generator generates a term vector composed of "category, weight" pairs for each feature.

상기 범주 결정부는 상기 신규로 수집되는 분류 대상 문서를 구성하는 자질들에 대해 분류기에서 학습되어 저장된 용어벡터를 매칭하여 벡터정보의 주제별 가중치값을 모두 합한 후, 투표형식으로 계산한 결과의 최대값을 갖는 특정범주를 최종범주로 결정한다.The category determiner learns the constituents of the newly collected classification target document by matching the term vectors learned and stored in the classifier, sums all the weighted values for each subject of the vector information, and then calculates the maximum value of the result calculated in the voting form. The final category is determined by the specific category.

상기 자질 특성 매트릭스는 자질별 문서번호, 범주코드, 가중치, 범주 특성값을 포함하되, 상기 범주 특성값은 자질이 출현하고 특정 범주에 속하는 경우의 빈도수, 자질이 출현하지 않으나 특정 범주에 속하는 경우의 빈도수, 자질이 출현하나 특정 범주에 속하지 않은 경우의 빈도수, 자질이 출현하지 않으면서 특정 범주에도 속하지 않은 경우의 빈도수를 포함한다.The feature feature matrix includes a document number, a category code, a weight, and a category feature value for each feature, wherein the category feature value is a frequency when the feature is present and belongs to a specific category, and the feature is not present but belongs to a specific category. Frequency, the frequency when the feature appears but does not belong to a particular category, and the frequency when the feature does not appear and does not belong to a specific category.

본 발명의 또 다른 측면에 따르면, 복수개의 결합 대상 데이터베이스내 자질 특성 매트릭스로부터 자질을 추출하는 자질 추출부, 상기 추출된 자질들의 코드정보를 리스트화하고, 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성하는 통합 매트릭스 생성부, 상기 통합 매트릭스를 구성하는 각 자질과 범주의 유사도를 각각 구하고, 그 유사도를 이용하여 자질별 가중치를 구하는 가중치 연산부, 상기 통합 매트릭스를 구성하는 각 자질에 대해 가중치가 표시된 용어 벡터를 생성하여 상기 결합 대상 데이터베이스의 모든 자질 정보를 학습한 통합 분류기를 생성하는 용어벡터 생성부를 포함하는 분류기 동적 결합 장치가 제공된다. According to another aspect of the present invention, a feature extraction unit for extracting a feature from feature feature matrices in a plurality of database to be combined, list the code information of the extracted feature, and aggregate the category feature value for each feature combined An integrated matrix generation unit for generating a matrix, a weight calculation unit for obtaining similarity of each feature and category constituting the integrated matrix, and obtaining weights for each feature using the similarity, and a weight for each feature constituting the integrated matrix. There is provided a classifier dynamic combining apparatus including a term vector generator which generates a displayed term vector and generates an integrated classifier that learns all the feature information of the target database.

상기 분류기 동적 결합 장치는 신규로 수집되는 분류 대상 문서로부터 추출된 자질과 동일한 자질에 대해, 상기 용어벡터 생성부에서 생성된 복수의 용어벡터를 결합하여 투표형식으로 계산한 결과, 최대값을 갖는 특정 범주를 상기 분류 대상 문서의 최종 범주로 결정하는 범주 결정부를 더 포함할 수 있다.The classifier dynamic combining apparatus combines a plurality of term vectors generated by the term vector generator for the same qualities extracted from the newly-collected object to be classified, and calculates the result in a voting form. The apparatus may further include a category determiner configured to determine a category as the final category of the document to be classified.

또한, 상기 분류기 동적 결합 장치는 상기 결합 대상 데이터베이스를 일정 크기의 용량으로 분할하는 용량 분할부를 더 포함할 수 있다. The classifier dynamic combining apparatus may further include a capacity dividing unit that divides the combining target database into a capacity having a predetermined size.

상기 통합 매트릭스 생성부는 상기 용량 분할부에 의해 분할된 각 용량에 대한 통합 매트릭스를 병렬로 동시에 생성한다.The integrated matrix generator simultaneously generates an integrated matrix for each of the capacities divided by the capacity divider in parallel.

또한, 상기 분류기 동적 결합 장치는 상기 통합 매트릭스 생성부에서 생성된 통합 매트릭스를 개별 자질 특성 매트릭스와 통합 또는 다른 통합 매트릭스와 통합하는 과정을 피라미드식으로 반복적으로 수행하여 새로운 통합 매트릭스를 생성하는 통합 매트릭스 생성 관리부를 더 포함할 수 있다.In addition, the classifier dynamic coupling device generates an integrated matrix by generating a new integrated matrix by repeatedly integrating the integrated matrix generated by the integrated matrix generator with an individual feature feature matrix or with another integrated matrix in a pyramid manner. The management unit may further include.

상기 통합 매트릭스 생성부는 상기 자질 특성 매트릭스에 출현한 자질값의 전체 셋을 만들고 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성한다.The integrated matrix generator generates an integrated matrix by making a total set of feature values appearing in the feature feature matrix and summing the category feature values for each feature.

또한, 상기 통합 매트릭스 생성부는 각 자질에 대해 전체 범주 리스트에서 빠진 범주는 자동으로 계산하여 그 범주에 대한 필드를 메모리상에 동적으로 생성하고, 각 자질에 대한 전체 범주 특성값을 합산하여 통합 매트릭스를 생성하고, 그 결과를 데이터베이스에 저장한다.In addition, the integrated matrix generator automatically calculates a category missing from the entire category list for each feature, dynamically creates a field for the category in memory, and adds the total category property values for each feature to generate an integrated matrix. Create and store the result in the database.

또한, 상기 통합 매트릭스 생성부는 상기 자질 추출부에서 추출된 고유한 자질의 리스트를 만들고, 상기 결합 대상 자질 특성 매트릭스로부터 전체 범주코드를 추출하여 범주코드 리스트를 만든 후, 각 개별 테이블에서 특정 자질에 대한 정보가 있는 경우 전체 범주코드에 대한 범주 특성값을 추출하되, 존재하지 않은 범주코드에 대해서는 범주 특성값을 계산하여 생성한다.In addition, the integrated matrix generator creates a list of unique features extracted from the feature extraction unit, extracts the entire category code from the feature feature matrix to be combined, creates a category code list, and then, for each specific table, If there is information, the category property value for the entire category code is extracted, but the category property value is generated for the category code that does not exist.

또한, 상기 통합 매트릭스 생성부는 각 개별 테이블 중에서 특정 자질이 존재하지 않은 경우, 해당 자질을 만들고 상기 범주코드 리스트에 있는 모든 범주코드별로 범주 특성값을 각각 생성한다.In addition, when a specific feature does not exist in each individual table, the integrated matrix generator creates a corresponding feature and generates a category characteristic value for each category code in the category code list.

본 발명의 또 다른 측면에 따르면, (a) 복수의 분류기 생성장치가 각 데이터베이스의 학습 대상 문서로부터 자질을 각각 추출하는 단계, (b) 상기 복수의 분류기 생성장치가 상기 추출된 자질과 범주의 유사도를 구하여 자질별 가중치를 각각 연산하는 단계, (c) 상기 복수의 분류기 생성장치가 상기 학습 대상 문서를 구성하는 각 자질에 대해 상기 연산된 가중치를 포함하는 자질 특성 매트릭스 및 용어 벡터를 생성하여 신규로 수집되는 분류 대상 문서의 범주를 각각 결정하는 개별 분류기를 각각 생성하는 단계, (d) 분류기 동적 결합 장치가 복수개의 결합 대상 자질 특성 매트릭스에 출현한 자질들의 코드정보를 리스트화하고, 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성하는 단계, (e) 상기 분류기 동적 결합 장치가 상기 생성된 통합 매트릭스를 이용하여 결합 대상 데이터베이스의 모든 자질 정보를 학습한 통합 분류기를 생성하는 단계를 포함하는 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 방법이 제공된다. According to another aspect of the present invention, (a) the plurality of classifier generating apparatus to extract the features from the study target document of each database, (b) the plurality of classifier generating apparatus similarity degree of the extracted features and categories (C) the plurality of classifier generators generate a feature characteristic matrix and a term vector including the calculated weights for each feature constituting the study target document and newly calculate the weights for each feature. Generating respective classifiers that respectively determine the categories of classified documents to be collected; (d) the classifier dynamic combining device lists the code information of the features appearing in the plurality of feature matrix features to be combined, and for each feature Summing category characteristic values to produce an integration matrix, (e) the classifier dynamic coupling device generating the integration matrix There is provided a method for automatically generating a large classifier by dynamically combining a classifier including generating an integrated classifier that learns all the feature information of a database to be joined using a matrix.

상기 (c)단계는 상기 학습 대상 문서를 구성하는 각 자질에 대해 상기 연산된 가중치를 포함하는 자질 특성 매트릭스를 생성하는 단계, 상기 학습 대상 문서를 구성하는 각 자질에 대해 가중치가 표시된 용어 벡터를 생성하는 단계, 신규로 수집되는 분류 대상 문서로부터 추출된 자질과 동일한 자질에 대해, 상기 생성된 복수의 용어벡터를 결합하여 투표형식으로 계산한 결과, 최대값을 갖는 특정 범주를 상기 분류 대상 문서의 최종 범주로 결정하는 개별 분류기를 생성하는 단계를 포함한다.In the step (c), generating a feature feature matrix including the calculated weights for each feature of the study target document, and generating a weighted term vector for each feature of the study target document. And calculating a voting form by combining the generated plurality of term vectors for the same qualities extracted from the newly-collected classification target documents, and calculating a specific category having a maximum value as the final category of the classification target documents. Creating an individual classifier that determines the categories.

본 발명의 또 다른 측면에 따르면, 분류기 동적 결합 장치가 분류기의 동적 결합에 의해 대용량 분류기를 자동으로 생성하는 방법에 있어서, (a) 결합 대상 데이터베이스내의 자질 특성 매트릭스로부터 자질을 추출하는 단계, (b) 상기 추출된 자질에 대한 코드정보를 리스트화하고, 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성하는 단계, (c) 상기 통합 매트릭스를 구성하는 각 자질과 범주의 유사도를 구하고, 그 유사도를 이용하여 자질별 가중치를 구하는 단계, (d) 상기 통합 매트릭스를 구성하는 각 자질에 대한 가중치가 포함된 용어 벡터를 생성하여 상기 결합 대상 데이터베이스의 모든 자질 정보를 학습한 통합 분류기를 생성하는 단계를 포함하는 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 방법이 제공된다. According to another aspect of the present invention, the classifier dynamic joining apparatus automatically generates a large classifier by dynamic joining of the classifier, the method comprising: (a) extracting a feature from a feature feature matrix in a database to be joined, (b) C) listing code information on the extracted features, adding category property values for each feature, and generating an integrated matrix; (c) obtaining similarity between each feature and category constituting the integrated matrix; Obtaining weights for each feature using similarity, (d) generating a term vector including weights for each feature of the integration matrix to generate an integrated classifier that has learned all the feature information of the database to be combined; Provided is a method for automatically generating a large classifier by dynamically combining a classifier including a.

상기 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 방법은 신규로 수집되는 분류 대상 문서로부터 추출된 자질과 동일한 자질에 대해, 상기 용어벡터 생성부에서 생성된 복수의 용어벡터를 결합하여 투표형식으로 계산한 결과, 최대값을 갖는 특정 범주를 상기 분류 대상 문서의 최종 범주로 결정하는 단계를 더 포함할 수 있다.The method for automatically generating a large classifier by dynamically combining the classifier may be performed by combining a plurality of term vectors generated by the term vector generator in the form of voting for the same qualities extracted from newly collected documents to be classified. As a result, the method may further include determining a specific category having a maximum value as the final category of the classified document.

상기 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 방법은 상기 (b)단계 이후, 상기 생성된 통합 매트릭스를 개별 자질 특성 매트릭스와 통합 또는 다른 통합 매트릭스와 통합하는 과정을 피라미드식으로 반복적으로 수행하여 새로운 거대한 통합 매트릭스를 생성하는 단계를 더 포함할 수 있다.In the method for automatically generating a large classifier by the dynamic combination of the classifier, after the step (b), the process of integrating the generated integration matrix with the individual feature matrix or with the other integration matrix in a pyramid form is repeated to create a new huge classifier. The method may further include generating an integration matrix.

상기 (b)단계는, 상기 자질 특성 매트릭스에 출현한 자질값의 전체 셋을 만들고 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성하되, 각 자질에 대해 전체 범주 리스트에서 빠진 범주는 자동으로 계산하여 그 범주에 대한 필드를 메모리상에 동적으로 생성하고, 각 자질에 대한 전체 범주 특성값을 합산하여 통합 매트릭스를 생성하고, 그 결과를 데이터베이스에 저장한다.In step (b), a total set of feature values appearing in the feature feature matrix is generated and a category feature value for each feature is added to generate a unified matrix, but categories that are missing from the full category list for each feature are automatically Calculate and dynamically generate fields for that category in memory, sum the total category property values for each feature, generate a unified matrix, and store the result in the database.

또한, 상기 (b)단계는, 상기 자질 특성 매트릭스에서 추출된 고유한 자질의 리스트를 만들고, 전체 범주코드를 추출하여 범주코드 리스트를 만든 후, 각 개별 테이블에서 특정 자질에 대한 정보가 있는 경우 전체 범주코드에 대한 범주 특성값을 추출하되, 존재하지 않은 범주코드에 대해서는 범주 특성값을 계산하여 생성하고, 상기 각 개별 테이블 중에서 특정 자질이 존재하지 않은 경우, 해당 자질을 만들고 상기 범주코드 리스트에 있는 모든 범주코드별로 범주 특성값을 각각 생성한다.In addition, in step (b), a list of unique features extracted from the feature feature matrix is created, a category code list is generated by extracting an entire category code, and if there is information on a specific feature in each individual table, Extracts the category property value for the category code, calculates the category property value for the category code that does not exist, and creates a corresponding property if no specific property exists in each of the individual tables. Create a category attribute value for each category code.

본 발명의 또 다른 측면에 따르면, 분류기 동적 결합 장치가 분류기의 동적 결합에 의해 대용량 분류기를 자동으로 생성하는 방법에 있어서, (a) 결합 대상 데이터베이스를 일정 크기의 용량으로 분할하는 단계, (b) 각 분할된 용량내의 자질 특성 매트릭스로부터 자질을 추출하는 단계, (c) 상기 추출된 자질에 대한 코드정보를 리스트화하고, 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성하는 단계, (d) 상기 통합 매트릭스에 표시된 각 자질과 범주의 유사도를 구하고, 그 유사도를 이용하여 자질별 가중치를 구하는 단계, (e) 상기 통합 매트릭스에 표시된 각 자질에 대한 가중치가 포함된 용어 벡터를 생성하여 상기 결합 대상 데이터베이스의 모든 자질 정보를 학습한 통합 분류기를 생성하는 단계를 포함하는 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 방법이 제공된다.
According to another aspect of the present invention, a classifier dynamic joining apparatus automatically generates a large classifier by dynamic joining of the classifier, the method comprising: (a) dividing a database to be combined into a capacity of a predetermined size, (b) Extracting a feature from the feature feature matrix in each divided dose, (c) listing the code information for the extracted feature, and adding a category feature value for each feature to generate an integrated matrix, (d Obtaining similarity of each feature and category indicated in the unification matrix, and obtaining weights for each feature using the similarity; (e) generating a term vector including the weight for each feature indicated in the unified matrix and combining By a dynamic combination of classifiers comprising creating an integrated classifier that has learned all the feature information of the target database. The capacity sorter automatic generation method is provided.

상술한 바와 같이 본 발명에 따르면, 대용량의 문서를 학습시킬 수 있는 대용량 분류기 생성을 위해 복수의 결합대상 데이터베이스를 동적으로 결합하여 어떤 데이터베이스에 대해서도 범용적으로 적용할 수 있다.As described above, according to the present invention, in order to generate a large classifier that can learn a large amount of documents, a plurality of joining databases may be dynamically combined to be universally applied to any database.

또한, 여러 개의 자질특성 매트릭스를 생성하고 이를 동적으로 자유롭게 구성함에 의해, 실제로 대용량의 매트릭스를 생성하는 것과 작은 용량의 매트릭스를 다수 생성하여 동적으로 결합하는 것의 수치상 차이가 전혀 없는 효과가 있다.
In addition, by generating a number of feature matrix and dynamically freely configured, there is no numerical difference between actually generating a large capacity matrix and dynamically generating a large number of small capacity matrices.

도 1은 본 발명의 실시예에 따른 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 시스템을 나타낸 도면.
도 2는 본 발명의 실시예에 따른 분류기 생성 장치의 구성을 개략적으로 나타낸 블럭도.
도 3은 본 발명에 따른 분류기 동적 결합 장치의 구성을 개략적으로 나타낸 블럭도.
도 4는 본 발명의 실시예에 따른 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 방법을 나타낸 흐름도.
도 5 및 도 6은 본 발명의 실시예에 따른 분류기 동적 결합 장치가 개별 분류기의 동적 결합에 의해 대용량 분류기를 자동으로 생성하는 방법을 나타낸 흐름도. 1 is a diagram illustrating a system for automatically generating a large classifier by dynamically combining a classifier according to an embodiment of the present invention.
2 is a block diagram schematically showing a configuration of a classifier generating device according to an embodiment of the present invention.
Figure 3 is a block diagram schematically showing the configuration of the classifier dynamic coupling device according to the present invention.
4 is a flowchart illustrating a method for automatically generating a large classifier by dynamically combining the classifier according to an embodiment of the present invention.
5 and 6 are flowcharts illustrating a method in which the classifier dynamic joining apparatus according to the embodiment of the present invention automatically generates a large classifier by dynamic joining of individual classifiers.

본 발명의 전술한 목적과 기술적 구성 및 그에 따른 작용 효과에 관한 자세한 사항은 본 발명의 명세서에 첨부된 도면에 의거한 이하 상세한 설명에 의해 보다 명확하게 이해될 것이다.Details of the above-described objects and technical configurations of the present invention and the effects thereof according to the present invention will be more clearly understood by the following detailed description based on the accompanying drawings.

이하에서 설명되는 자질은 키워드 또는 용어를 의미한다. Qualities described below mean keywords or terms.

도 1은 본 발명의 실시예에 따른 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 시스템을 나타낸 도면이다.1 is a diagram illustrating a system for automatically generating a large classifier by dynamically combining a classifier according to an embodiment of the present invention.

도 1을 참조하면, 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 시스템은 문서의 특성에 따라 다수의 문서가 분류되어 저장되는 복수의 데이터베이스(100a, 100b, .., 100n, 이하 100이라 칭함), 각 데이터베이스별로 구비된 분류기 생성 장치(200a, 200b,..., 200n, 이하 200이라 칭함), 분류기 동적 결합 장치(300)를 포함한다. Referring to FIG. 1, a system for automatically generating a large classifier by dynamically combining a classifier includes a plurality of databases (100a, 100b,..., 100n, hereinafter 100) in which a plurality of documents are classified and stored according to characteristics of documents. The classifier generating apparatus 200a, 200b, ..., 200n, hereinafter referred to as 200, provided for each database, and the classifier dynamic coupling apparatus 300 are provided.

상기 데이터베이스(100)는 문서의 분류체계와 용어 속성을 포함하는 문서의 특성에 따라 다수의 문서가 분류되어 저장되는 공간을 의미하며, 예를 들어, 과학기술동향이 저장되는 GTB 데이터베이스, 국내학술논문이 저장되는 SOC, 해외학술논문이 저장되는 NDS 데이터베이스, 상기의 세 개의 데이터베이스가 통합된 GNS 데이터베이스 등으로 분류될 수 있다. The database 100 refers to a space in which a plurality of documents are classified and stored according to the characteristics of a document including a document classification system and a term attribute. For example, a GTB database in which scientific and technical trends are stored, a domestic academic paper The stored SOC, the NDS database in which the overseas academic papers are stored, and the three databases are integrated into the GNS database.

상기 분류기 생성 장치(200)는 각 데이터베이스(100)의 학습 대상 문서로부터 자질을 추출하고, 상기 추출된 자질과 범주의 유사도를 구하여 자질별 가중치를 연산한 후, 각 자질에 대해 가중치를 포함하는 자질 특성 매트릭스 및 용어 벡터를 생성하여 신규로 수집되는 분류 대상 문서의 범주를 결정하는 개별 분류기를 데이터베이스(100)별로 생성한다. 이때, 상기 분류기 생성장치(200)는 학습 대상 문서를 구성하는 모든 자질에 대한 용어 벡터를 선형 결합하여 투표결과가 높은 값을 상기 분류 대상 문서의 범주로 결정한다. 여기서, 상기 용어 벡터는 각 자질에 대해 "범주, 가중치" 쌍으로 구성된다. The classifier generating device 200 extracts a feature from a study target document of each database 100, obtains a similarity between the extracted feature and a category, calculates a weight for each feature, and then includes a feature including a weight for each feature. A feature classifier and a term vector are generated to generate a separate classifier for each database 100 that determines a category of newly collected classified documents. In this case, the classifier generating apparatus 200 linearly combines the term vectors for all the qualities constituting the target document to determine a high voting result as the category of the target document. Here, the term vector is composed of "category, weight" pairs for each feature.

상기와 같은 역할을 수행하는 분류기 생성 장치(200)에 대한 상세한 설명은 도 2를 참조하기로 한다. A detailed description of the classifier generating device 200 performing the above role will be described with reference to FIG. 2.

상기 분류기 동적 결합 장치(300)는 복수개의 결합 대상 자질 특성 매트릭스에 출현한 자질들의 코드정보를 리스트화하고, 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성한 후 상기 통합 매트릭스를 이용하여 결합 대상 데이터베이스의 모든 자질 정보를 학습한 통합 분류기를 생성한다. 상기 생성된 통합 분류기는 대용량 분류기일 수 있다. The classifier dynamic combining apparatus 300 lists the code information of the features appearing in the plurality of feature feature matrixes to be combined, adds the category feature values for each feature, generates an integrated matrix, and then uses the integrated matrix. Create an integrated classifier that learns all the feature information of the database to be joined. The generated integrated classifier may be a large classifier.

또한, 상기 분류기 동적 결합 장치(300)는 상기 결합 대상 데이터베이스를 일정 크기의 용량으로 분할하여 각 용량에 대한 통합 매트릭스를 생성하고, 각 통합 매트릭스를 이용하여 결합 대상 데이터베이스의 모든 자질 정보를 학습한 통합 분류기를 생성한다. 이때, 상기 분류기 동적 결합 장치(300)는 각 용량에 대한 통합 매트릭스를 병렬로 동시에 생성한다.In addition, the classifier dynamic combining apparatus 300 generates a consolidated matrix for each capacity by dividing the combining target database into a capacity of a predetermined size, and integrates learning all the feature information of the combining target database using each integration matrix. Create a classifier. At this time, the classifier dynamic coupling device 300 simultaneously generates an integrated matrix for each capacity in parallel.

또한, 상기 분류기 동적 결합 장치(300)는 상기 생성된 통합 매트릭스를 개별 자질 특성 매트릭스와 통합 또는 다른 통합 매트릭스와 통합하는 과정을 반복적으로 수행하여 새로운 커다란 통합 매트릭스를 생성한다. 즉, 상기 분류기 동적 결합 장치(300)는 개별 자질 특성 매트릭스를 다양하게 조합하여 통합 매트릭스를 생성하는 것처럼, 통합 매트릭스를 개별 자질 특성 매트릭스 또는 다른 통합 매트릭스와 다양하게 조합하여 새로운 거대한 통합 매트릭스를 생성할 수 있다.In addition, the classifier dynamic coupling device 300 generates a new large integration matrix by repeatedly integrating the generated integration matrix with an individual feature feature matrix or with another integration matrix. That is, the classifier dynamic coupling device 300 may generate a new huge integrated matrix by various combinations of the integration matrix with the individual feature characteristics matrix or other integration matrix, just as various combinations of the individual feature characteristics matrix are generated. Can be.

상기와 같은 분류기 동적 결합 장치(300)는 상기 분류기 생성 장치(200)에서 생성된 자질 특성 매트릭스를 결합하는 방법을 이용하여 분류기의 동적 결합을 수행한다.The classifier dynamic combiner 300 as described above performs a dynamic combiner of the classifiers using a method of combining the feature feature matrices generated by the classifier generator 200.

또한, 상기 분류기 동적 결합 장치(300)는 학습할 대상 문헌이 많은 경우, 결합 대상 데이터베이스를 적당한 크기로 분할하고, 동적으로 결합하여 거대한 통합 매트릭스를 다시 생성하게 된다. 이때, 학습대상 문헌을 랜덤하게 섞거나 자질 축소 등을 고려하지 않아도 된다. In addition, when there are many target documents to be learned, the classifier dynamic combining apparatus 300 divides the combining target database into an appropriate size and dynamically combines to generate a huge integrated matrix. In this case, it is not necessary to randomly mix the study target literature or to reduce the quality.

상기와 같은 역할을 수행하는 분류기 동적 결합 장치(300)에 대한 상세한 설명은 도 3을 참조하기로 한다.
A detailed description of the classifier dynamic coupling device 300 performing the above role will be given with reference to FIG. 3.

도 2는 본 발명의 실시예에 따른 분류기 생성 장치의 구성을 개략적으로 나타낸 블럭도이다. 2 is a block diagram schematically illustrating a configuration of a classifier generating device according to an embodiment of the present invention.

도 2를 참조하면, 분류기 생성 장치(200)는 자질 추출부(210), 가중치 연산부(220), 자질 특성 매트릭스 생성부(225), 용어벡터 생성부(230), 범주 결정부(240), 저장부(250)를 포함한다.Referring to FIG. 2, the classifier generator 200 includes a feature extractor 210, a weight calculator 220, a feature feature matrix generator 225, a term vector generator 230, a category determiner 240, The storage unit 250 is included.

상기 자질 추출부(210)는 학습 대상 문서로부터 자질을 추출한다. 즉, 상기 자질 추출부(210)는 학습 대상 문서의 키워드 필드를 이용하는 방법, 코퍼스 사전을 기반으로 제목 또는 초록의 비구조적 정보로부터 정보를 추출하여 이용하는 방법, 스테밍 또는 형태소 분석기법을 이용하는 자연어 처리방법들 중 적어도 하나를 통해 자질을 추출한다. The feature extracting unit 210 extracts a feature from a study target document. That is, the feature extraction unit 210 may use a method of using a keyword field of a study target document, a method of extracting and using information from unstructured information of a title or abstract based on a corpus dictionary, and natural language processing using a stemming or morpheme analysis method. Extract qualities through at least one of the methods.

상기 가중치 연산부(220)는 상기 자질 추출부(210)에서 추출된 자질과 범주의 유사도를 구하고, 그 유사도를 이용하여 자질별 가중치를 구한다. 여기서, 자질별 가중치는 자질(키워드)과 범주(주제분야)의 연관도를 의미한다.The weight calculating unit 220 obtains the similarity between the feature extracted from the feature extracting unit 210 and the category, and obtains the weight for each feature using the similarity. Here, the weight for each feature refers to a degree of association between a feature (keyword) and a category (topic).

또한, 상기 가중치 연산부(220)는 코사인, 다이스, 자카드 또는 로그승산비 등의 유사계수를 이용하거나 다양한 거리계수를 이용하여 유사도를 구한다.In addition, the weight calculator 220 calculates the similarity using a similar coefficient such as cosine, dice, jacquard or log odds ratio, or using various distance coefficients.

예를 들어, 학습 대상 문서에 나타난 n개의 단어 자질집합과 후보범주 m개의 집합을 각각 F={f₁, f₂, f₃, ..., f_n}와 C={c₁, c₂, c₃, ...., c_n}로 표현하고, 자질 f_i가 범주 c_j에 대하여 가지는 가중치를 vs(f_i, c_j)라고 한다.For example, the set of n word features and the set of m candidate categories in the document to be studied are F = {f ₁ , f ₂ , f ₃ , ..., f _n } and C = {c ₁ , c ₂ , c ₃ , ...., c _n }, and the weight f _i has for the category c _j is called vs (f _i , c _j ).

이때, 키워드에 해당하는 자질 f와 키워드가 속한 주제분야를 의미하는 범주 c간의 관계는 표1과 같다.In this case, the relationship between the qualities f corresponding to the keyword and the category c representing the subject field to which the keyword belongs is shown in Table 1.

범주 c_j 소속Category c _j belong 범주 c_j 미소속Category c _j Smile 자질 f_i 출현Qualities f _i emerge TPTP TNTN 자질 f_i 미출현Qualities f _i FPFP FNFN

상기 가중치 연산부(220)는 자질에 대한 가중치를 부여하기 위하여 수학식 1을 이용하여 코사인 유사계수(cos(f_i, c_j))를 구하고, 상기 구해진 코사인 유사계수를 이용한 수학식 2를 이용하여 자질별 가중치(vs(f_i, c_j))를 구한다.The weight calculating unit 220 obtains a cosine similar coefficient (cos (f _i , c _j )) using Equation 1 to give a weight to the feature, and uses Equation 2 using the obtained cosine similar coefficient. The weight for each feature (vs (f _i , c _j )) is obtained.

여기서, TP는 자질 f_i가 출현하고 범주 c_j에 속하는 경우의 빈도수를 말하고, FP는 자질 f_i가 출현하지 않으나 범주 c_j에 속하는 경우의 빈도수, TN은 자질 f_i가 출현하나 범주 c_j에 속하지 않은 경우의 빈도수를 말하고, FN은 자질 f_i가 출현하지 않으면서 범주 c_j에 속하지 않은 경우의 빈도수를 말한다. Where, TP is the qualities f _i is the appearance, say the frequency of the case within the scope c _j, FP has qualities f _i the frequency of the case does not occur within the scope c _j, TN qualities f _i is the emergence one category c _j It refers to the frequency of cases that do not belong to, and FN refers to the frequency of cases that do not belong to category c _j without the feature f _i .

여기서, f_i: 자질, c_j: 범주, tf: 용어 빈도수, df: 문헌 빈도수, cos(f_i, c_j): 코사인 유사계수일 수 있다. Here, f _i : qualities, c _j : category, tf: term frequency, df: document frequency, cos (f _i , c _j ): cosine similar coefficient.

즉, 상기 가중치 연산부(220)는 자질 가중치에 역문헌 빈도수(Inverse Document Frequency)를 추가한 수학식 2를 이용하여 최종 자질별 가중치를 구한다. That is, the weight calculator 220 calculates the weight for each final feature by using Equation 2, which adds an inverse document frequency to the feature weight.

또한, 상기 가중치 연산부(220)는 수학식 3과 같은 로그승산비(lor(f_i,c)_j) 공식을 이용하여 유사도를 구할 수 있다. In addition, the weight calculator 220 may obtain a similarity using a log multiplication ratio (lor (f _i , c) _j ), as shown in Equation 3 below.

n개의 자질을 갖는 문헌 d로 구성된 데이터베이스의 경우, 상기 가중치 연산부(220)는 각 계산된 자질별 가중치를 이용하여 수학식 4와 같은 자질값 벡터(d)로 표현할 수 있다. In the case of a database composed of documents d having n features, the weight calculator 220 may express the feature value vector d as shown in Equation 4 by using the calculated weight for each feature.

여기서, 문헌 벡터d를 구성하는 vs(f_i, c_j)는 코사인, 다이스, 자카드 또는 로그승산비 등의 유사계수를 이용하거나 다양한 거리계수를 이용하여 산출한 문헌 d안의 자질 f_i의 가중치를 의미한다. Here, vs (f _i , c _j ) constituting the document vector d is a weight of the feature f _{i in} the document d calculated using a similar coefficient such as cosine, dice, jacquard or log odds ratio or various distance coefficients. it means.

상기 자질 특성 매트릭스 생성부(225)는 학습 대상 문서를 구성하는 각 자질에 대해 상기 가중치 연산부(220)에서 구해진 가중치를 포함하는 자질 특성 매트릭스를 생성한다. 즉, 상기 자질 특성 매트릭스 생성부(225)는 학습 대상 문서를 구성하는 개별 자질에 대해 문서번호, 범주 코드, 가중치, 범주 특성값 등으로 필드가 구성된 자질 특성 매트릭스를 생성한다. 여기서, 상기 범주 특성값은 자질이 출현하고 특정 범주에 속하는 경우의 빈도수, 자질이 출현하지 않으나 특정 범주에 속하는 경우의 빈도수, 자질이 출현하나 특정 범주에 속하지 않은 경우의 빈도수, 자질이 출현하지 않으면서 특정 범주에도 속하지 않은 경우의 빈도수를 말한다. The feature feature matrix generator 225 generates a feature feature matrix including the weights obtained by the weight calculator 220 for each feature of the document to be studied. That is, the feature feature matrix generation unit 225 generates a feature feature matrix in which fields are formed of document numbers, category codes, weights, category feature values, and the like for individual features of the document to be studied. Here, the category characteristic value is a frequency when a feature appears and belongs to a specific category, a frequency when a feature does not appear, but a frequency when a feature belongs, but a frequency when a feature appears but does not belong to a specific category, and a feature does not appear. This is the frequency of cases that do not belong to a particular category.

상기 자질 특성 매트릭스는 분류기 동적 결합 장치가 복수의 분류기를 동적으로 결합할 때 이용된다.The feature property matrix is used when the classifier dynamic coupling device dynamically combines a plurality of classifiers.

상기 용어벡터 생성부(230)는 학습 대상 문서를 구성하는 각 자질에 대해 가중치가 표시된 용어 벡터를 생성한다. 상기 생성된 용어벡터는 각 자질에 대해 "범주, 가중치"의 쌍으로 구성되어 있다.The term vector generator 230 generates a term vector in which weights are displayed for each feature of the target document. The generated term vector is composed of pairs of "category, weight" for each feature.

상기 범주 결정부(240)는 신규로 수집되는 분류 대상 문서로부터 추출된 자질과 동일한 자질에 대해, 상기 용어벡터 생성부(230)에서 생성된 복수의 용어벡터를 결합하여 투표형식으로 계산한 결과, 최대값을 갖는 특정 범주를 상기 분류 대상 문서의 최종 범주로 결정한다. The category determiner 240 combines a plurality of term vectors generated by the term vector generator 230 and calculates a voting form for the same qualities extracted from newly collected documents to be classified, The specific category with the maximum value is determined as the final category of the classified document.

즉, 상기 범주 결정부(240)는 상기 신규로 수집되는 분류 대상 문서를 구성하는 자질들에 대해 분류기에서 학습되어 저장된 용어벡터를 매칭하여 벡터정보의 주제별 가중치값을 모두 합한 후, 투표형식으로 계산한 결과의 최대값을 갖는 특정범주를 최종범주로 결정한다. That is, the category determiner 240 matches the term vectors learned and stored in the classifier with respect to the qualities constituting the newly collected document to be classified, sums all weighted values for each subject of vector information, and then calculates the result in a voting form. The specific category with the maximum of one result is determined as the final category.

따라서, 상기 범주 결정부(240)는 수학식 5를 이용하여 상기 분류 대상 문서의 범주를 결정한다.Accordingly, the category determiner 240 determines a category of the document to be classified using Equation 5.

여기서, f_i: 자질, c_j: 범주, vs(f_i, c_j): 자질별 가중치값을 의미한다. Here, f _i : feature, c _j : category, vs (f _i , c _j ): mean weight value for each feature.

예를 들어, 실험 문서인 d={f₁, f₂, f₃, ..., f_n}, 주제 범주를 C={c₁, c₂, c₃, ...., c_n}라고 할 때, 자질 f_i가 범주 c_j에 대하여 가지는 가중치를 vs(f_i, c_j)라고 하면, 자질값 투표 분류기는 수학식 5를 만족하는 범주 c_j를 문서에 할당한다. For example, the experimental document d = {f ₁ , f ₂ , f ₃ , ..., f _n }, and the subject category C = {c ₁ , c ₂ , c ₃ , ...., c _n } In this case, if the weight f _i has for the category c _j is vs (f _i , c _j ), the feature voting classifier assigns the category c _j satisfying Equation 5 to the document.

상기 저장부(250)에는 각 자질별 범주코드, 가중치, 범주 특성값 등이 표시된 자질 특성 매트릭스가 저장되어 있다.
The storage unit 250 stores a feature feature matrix in which category codes, weights, and category feature values for each feature are displayed.

도 3은 본 발명에 따른 분류기 동적 결합 장치의 구성을 개략적으로 나타낸 블럭도이다. Figure 3 is a block diagram schematically showing the configuration of the classifier dynamic coupling device according to the present invention.

도 3을 참조하면, 분류기 동적 결합 장치(300)는 자질 추출부(310), 통합 매트릭스 생성부(320), 가중치 연산부(330), 용어벡터 생성부(340), 범주 결정부(350)를 포함한다. Referring to FIG. 3, the classifier dynamic combiner 300 may include a feature extractor 310, an integrated matrix generator 320, a weight calculator 330, a term vector generator 340, and a category determiner 350. Include.

상기 자질 추출부(310)는 복수개의 결합 대상 자질 특성 매트릭스로부터 자질을 추출한다. 이때, 상기 자질 추출부(310)는 각 자질에 대해 분류기 생성 장치에 의해 생성된 자질 특성 매트릭스에서 자질을 추출하게 된다.The feature extracting unit 310 extracts a feature from a plurality of feature matrixes to be combined. In this case, the feature extracting unit 310 extracts the feature from the feature feature matrix generated by the classifier generating device for each feature.

상기 통합 매트릭스 생성부(320)는 상기 추출된 자질들의 코드정보를 리스트화하고, 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성한다. 즉, 상기 통합 매트릭스 생성부(320)는 상기 자질 특성 매트릭스에 출현한 자질값의 전체 셋을 만들고 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성한다.The unified matrix generator 320 lists the code information of the extracted features, and generates a unified matrix by summing category feature values for each feature. In other words, the unified matrix generator 320 generates an integrated matrix by making a total set of feature values appearing in the feature feature matrix and summing the category feature values for each feature.

이때, 상기 통합 매트릭스 생성부(320)는 각 자질에 대해 전체 범주 리스트에서 빠진 범주는 자동으로 계산하여 그 범주에 대한 필드를 메모리상에 동적으로 생성하고, 각 자질에 대한 전체 범주 특성값을 합산하여 통합 매트릭스를 생성하며, 그 결과를 데이터베이스에 저장한다. At this time, the integrated matrix generation unit 320 automatically calculates a category missing from the entire category list for each feature, dynamically creates a field for the category in memory, and adds the total category feature values for each feature. Create an integration matrix and store the results in a database.

또한, 상기 통합 매트릭스 생성부(320)는 상기 자질 추출부(310)에서 추출된 고유한 자질의 리스트를 만들고, 상기 결합 대상 자질 특성 매트릭스로부터 전체 범주코드를 추출하여 범주코드 리스트를 만든다. 그런 다음 상기 통합 매트릭스 생성부(320)는 각 개별 테이블에서 특정 자질에 대한 정보가 있는 경우 전체 범주코드에 대한 범주 특성값을 추출하고, 존재하지 않은 범주코드에 대해서는 범주 특성값을 계산하여 생성한다. In addition, the integrated matrix generator 320 creates a list of unique features extracted by the feature extraction unit 310 and extracts the entire category code from the feature feature matrix to be combined to create a category code list. Then, the integrated matrix generator 320 extracts the category characteristic value for the entire category code when there is information on a specific feature in each individual table, and calculates and generates the category characteristic value for the category code that does not exist. .

또한, 상기 통합 매트릭스 생성부(320)는 각 개별 테이블 중에서 특정 자질이 존재하지 않은 경우, 해당 자질을 만들고 상기 범주코드 리스트에 있는 모든 범주코드별로 범주 특성값을 각각 생성한다. In addition, when a specific feature does not exist in each individual table, the integrated matrix generator 320 creates a corresponding feature and generates a category characteristic value for each category code in the category code list.

즉, 상기 통합 매트릭스 생성부(320)는 자질이 모든 개별 테이블에 출현하지는 않으므로, 자질의 개수, 전체 문헌의 수 등 각 자질 특성 매트릭스의 통합 정보를 동적으로 산출하여 TP, TN, FP, FN, 유사도, 역문헌빈도(IDF) 등의 범주 특성값을 재계산하게 된다.That is, since the feature does not appear in every individual table, the integrated matrix generator 320 dynamically calculates integrated information of each feature feature matrix, such as the number of features and the number of documents, to dynamically calculate TP, TN, FP, FN, Category property values such as similarity and inverse bibliographic frequency (IDF) are recalculated.

다시 말하면, 상기 통합 매트릭스 생성부(320)는 각 자질에 대해 범주별로 TP, TN, FP, FN을 합산한다. 이때, 각 자질에 대해 전체 범주 리스트에서 빠진 범주는 자동으로 계산하여 그 범주에 대한 필드를 생성하게 된다.In other words, the integrated matrix generator 320 adds TP, TN, FP, and FN for each feature for each category. At this time, a category missing from the entire category list for each feature is automatically calculated to generate a field for that category.

예를 들어, 기 설정된 범주코드 정보가 "바이오, 화학, 지리, 수학"의 4가지인데, 특정 자질에 "수학"에 해당하는 범주가 없다면, 상기 통합 매트릭스 생성부(320)는 다른 자질의 값으로부터 전체의 합을 구하고, 모든 범주에 대해 FP와 FN을 구하여 "수학"에 대한 범주를 생성하게 된다.For example, if the preset category code information is four types of "bio, chemistry, geography, and mathematics", and there is no category corresponding to "mathematics" in a specific feature, the integrated matrix generator 320 may generate values of other features. The sum of all of them is then calculated and FP and FN are calculated for all categories to generate categories for "math".

상기 가중치 연산부(330)는 상기 통합 매트릭스 생성부(320)에서 생성된 통합 매트릭스를 구성하는 각 자질과 범주의 유사도를 구하고, 그 유사도를 이용하여 자질별 가중치를 구한다. 상기 자질별 가중치를 계산하는 방법은 수학식 1 내지 수학식 3을 이용한다. The weight calculator 330 obtains the similarity of each feature and category constituting the unified matrix generated by the unified matrix generator 320, and obtains the weight for each feature using the similarity. Equation 1 to Equation 3 is used to calculate the weight for each feature.

상기 용어벡터 생성부(340)는 상기 통합 매트릭스를 구성하는 각 자질에 대해 가중치가 표시된 용어 벡터를 생성하여 상기 결합 대상 데이터베이스의 모든 자질 정보를 학습한 통합 분류기를 생성한다. 상기 생성된 용어벡터는 각 자질에 대해 "범주, 가중치"의 쌍으로 구성되어 있다.The term vector generator 340 generates a term vector in which weights are displayed for each feature constituting the unified matrix to generate an unified classifier that learns all feature information of the target database. The generated term vector is composed of pairs of "category, weight" for each feature.

상기 범주 결정부(350)는 신규로 수집되는 분류 대상 문서로부터 추출된 자질과 동일한 자질에 대해, 상기 용어벡터 생성부(340)에서 생성된 복수의 용어벡터를 결합하여 투표형식으로 계산한 결과, 최대값을 갖는 특정 범주를 상기 분류 대상 문서의 최종 범주로 결정한다.The category determiner 350 combines the plurality of term vectors generated by the term vector generator 340 with respect to the same qualities extracted from the newly collected document to be classified, and calculates the result in a voting form. The specific category with the maximum value is determined as the final category of the classified document.

즉, 상기 범주 결정부(350)는 신규로 수집되는 분류 대상 문서를 구성하는 자질들에 대해 분류기에서 학습되어 저장된 용어벡터를 매칭하여 벡터정보의 주제별 가중치값을 모두 합한 후, 투표형식으로 계산한 결과의 최대값을 갖는 특정범주를 최종범주로 결정한다. 이때, 상기 범주 결정부(350)는 수학식 5를 이용하여 상기 분류 대상 문서의 최종 범주를 결정한다.That is, the category determiner 350 matches the term vectors learned and stored in the classifier with respect to the qualities constituting the newly classified document to be collected, sums all the weighted values for each subject of the vector information, and calculates the result in a voting form. The specific category with the maximum value of the result is determined as the final category. In this case, the category determiner 350 determines the final category of the document to be classified using Equation 5.

상기 분류기 동적 결합 장치(300)는 상기 결합 대상 데이터베이스를 일정 크기의 용량으로 분할하는 용량 분할부(미도시)를 더 포함할 수 있다. 그러면, 상기 통합 매트릭스 생성부(320)는 상기 용량 분할부(미도시)에 의해 분할된 각 용량에 대한 통합 매트릭스를 병렬로 동시에 생성하게 된다.The classifier dynamic coupling device 300 may further include a capacity divider (not shown) for dividing the combining target database into a capacity having a predetermined size. Then, the unified matrix generator 320 simultaneously generates an unified matrix for each of the capacities divided by the capacity divider (not shown) in parallel.

상기 용량 분할부(미도시)는 이질적인 데이터베이스의 용량이 일정 용량 이상인 경우, 그 크기를 분할하여 대용량의 데이터를 처리할 수 있도록 한다. The capacity divider (not shown) divides the size of the heterogeneous database when the capacity of the heterogeneous database is greater than or equal to a predetermined capacity so that a large amount of data can be processed.

또한, 상기 분류기 동적 결합 장치(300)는 통합 매트릭스 생성 관리부(370)를 더 포함할 수도 있다. 상기 통합 매트릭스 생성 관리부(370)는 상기 통합 매트릭스 생성부(320)에서 생성된 통합 매트릭스를 개별 자질 특성 매트릭스와 통합 또는 다른 통합 매트릭스와 통합하는 과정을 피라미드식으로 반복적으로 수행하여 새로운 통합 매트릭스를 생성한다.In addition, the classifier dynamic coupling device 300 may further include an integrated matrix generation management unit 370. The integrated matrix generation manager 370 generates a new unified matrix by repeatedly integrating the unified matrix generated by the unified matrix generator 320 with an individual feature feature matrix or with another unified matrix in a pyramid manner. do.

즉, 상기 통합 매트릭스 생성 관리부(370)는 개별 자질 특성 매트릭스를 다양하게 조합하여 통합 매트릭스를 생성하는 것처럼, 상기 통합 매트릭스 생성부(320)에서 생성된 통합 매트릭스를 개별 자질 특성 매트릭스 또는 다른 통합 매트릭스와 다양하게 조합하여 새로운 거대한 통합 매트릭스를 생성한다.
That is, the integrated matrix generation management unit 370 may combine the integrated matrix generated by the integrated matrix generation unit 320 with the individual feature characteristic matrix or another integrated matrix, as in various combinations of the individual feature characteristic matrices to generate the integrated matrix. Various combinations create a new huge unified matrix.

도 4는 본 발명의 실시예에 따른 분류기의 동적 결합에 의한 대용량 분류기 자동 생성 방법을 나타낸 흐름도이다. 4 is a flowchart illustrating a method for automatically generating a large classifier by dynamically combining the classifier according to an embodiment of the present invention.

도 4를 참조하면, 분류기 생성장치는 해당 데이터베이스의 학습 대상 문서로부터 자질을 추출하고(S400), 상기 추출된 자질과 범주의 유사도를 구하여 자질별 가중치를 연산한다(S402). Referring to FIG. 4, the classifier generating apparatus extracts a feature from a study target document of a corresponding database (S400), calculates a similarity between the extracted feature and a category, and calculates a weight for each feature (S402).

그런 다음 상기 분류기 생성 장치는 상기 학습 대상 문서를 구성하는 각 자질에 대해 상기 연산된 가중치를 포함하는 자질 특성 매트릭스 및 용어 벡터를 생성한다(S404). 상기 자질 특성 매트릭스는 자질, 범주, 유사도, 가중치, 범주 특성값 등이 표시되어 있다.Then, the classifier generating device generates a feature feature matrix and a term vector including the calculated weights for each feature of the study target document (S404). The feature feature matrix is indicated by feature, category, similarity, weight, category feature value, and the like.

상기 S404의 수행 후, 상기 분류기 생성장치는 신규로 수집되는 분류 대상 문서에 대해 상기 생성된 용어 벡터를 이용하여 범주를 결정한다(S406). 즉, 상기 분류기 생성 장치는 신규로 수집되는 분류 대상 문서로부터 추출된 자질과 동일한 자질에 대해, 상기 생성된 복수의 용어벡터를 결합하여 투표형식으로 계산하고, 그 계산한 결과 최대값을 갖는 특정 범주를 상기 분류 대상 문서의 최종 범주로 결정한다.After the execution of the S404, the classifier generating apparatus determines a category using the generated term vector for the newly collected classification target document (S406). That is, the classifier generating apparatus combines the generated plurality of term vectors and calculates them in a voting form for the same qualities extracted from the newly collected document to be classified, and then selects a specific category having a maximum value. Is determined as the final category of the document to be classified.

상기 S406이 수행되면, 상기 분류기 생성 장치는 상기 S404에서 생성된 자질 특성 매트릭스를 분류기 동적 결합 장치에 제공한다(S408).When the step S406 is performed, the classifier generation device provides the feature property matrix generated in the step S404 to the classifier dynamic coupling device (S408).

상기 분류기 동적 결합 장치는 상기 분류기 생성 장치로부터의 복수개의 결합 대상 자질 특성 매트릭스에 출현한 자질들의 코드정보를 리스트화하고, 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성한다(S410). 즉, 상기 분류기 동적 결합 장치는 상기 자질 특성 매트릭스에 출현한 자질값의 전체 셋을 만들고 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성한다. 이때, 각 자질에 대해 전체 범주 리스트에서 빠진 범주는 자동으로 계산하여 그 범주에 대한 필드를 메모리상에 동적으로 생성하고, 각 자질에 대한 전체 범주 특성값을 합산하여 통합 매트릭스를 생성하고, 그 결과를 데이터베이스에 저장한다. The classifier dynamic combining apparatus lists the code information of the features appearing in the plurality of feature subject feature matrices from the classifier generating apparatus, and generates a unified matrix by summing the category feature values for each feature (S410). In other words, the classifier dynamic coupling device creates a complete set of feature values appearing in the feature feature matrix and adds the category feature values for each feature to produce an integrated matrix. At this time, the categories that are missing from the list of categories for each feature are automatically calculated, and the fields for that category are dynamically generated in memory, the total category property values for each feature are added together to generate a unified matrix. Save it to the database.

상기 S410의 수행 후, 상기 분류기 동적 결합 장치는 상기 생성된 통합 매트릭스를 이용하여 결합 대상 데이터베이스의 모든 자질 정보를 학습한 통합 분류기를 생성한다(S412).
After performing the S410, the classifier dynamic combining apparatus generates an integrated classifier that learns all the feature information of the database to be combined using the generated integrated matrix (S412).

도 5는 본 발명의 실시예에 따른 분류기 동적 결합 장치가 개별 분류기의 동적 결합에 의해 대용량 분류기를 자동으로 생성하는 방법을 나타낸 흐름도이다. 5 is a flowchart illustrating a method in which a classifier dynamic joining apparatus automatically generates a large classifier by dynamic coupling of individual classifiers according to an embodiment of the present invention.

도 5를 참조하면, 분류기 동적 결합 장치는 결합 대상 데이터베이스내의 자질 특성 매트릭스로부터 자질을 추출하고(S500), 상기 추출된 자질에 대한 코드정보를 리스트화하고 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성한다(S502). 이때, 상기 분류기 동적 결합 장치는 기 설정된 범주코드 정보를 근거로 적어도 하나의 범주가 존재하지 않은 자질이 존재하는 경우, 해당 자질에 대해 전체 범주 리스트에서 자동으로 해당 범주에 대한 필드를 생성하여, 통합 매트릭스를 생성하게 된다.Referring to FIG. 5, the classifier dynamic combining apparatus extracts a feature from a feature feature matrix in a database to be combined (S500), lists the code information on the extracted feature, and aggregates the category feature values for each feature. A matrix is generated (S502). In this case, the classifier dynamic coupling device automatically generates a field for the category in the entire category list for the feature when the feature does not exist based on the preset category code information. You will create a matrix.

또한, 상기 분류기 동적 결합 장치는 상기 생성된 통합 매트릭스를 개별 자질 특성 매트릭스와 통합 또는 다른 통합 매트릭스와 통합하는 과정을 피라미드식으로 반복적으로 수행하여 새로운 거대한 통합 매트릭스를 생성할 수 있다. In addition, the classifier dynamic coupling device may pyramidally iteratively integrate the generated integration matrix with the individual feature property matrix or other integration matrix to generate a new large integration matrix.

상기 S502의 수행 후, 상기 분류기 동적 결합 장치는 상기 통합 매트릭스를 구성하는 각 자질과 범주의 유사도를 구하고(S504), 그 유사도를 이용하여 자질별 가중치를 구한다(S506). 이때, 상기 분류기 동적 결합 장치는 코사인, 다이스, 자카드 또는 로그승산비 등의 유사계수를 이용하거나 다양한 거리계수를 이용하여 유사도를 구하고, 그 유사도를 이용하여 자질별 가중치를 구한다.After the operation of S502, the classifier dynamic combination device obtains the similarity of each feature and category constituting the unified matrix (S504), and calculates the weight for each feature using the similarity (S506). In this case, the classifier dynamic coupling device uses similar coefficients such as cosine, dice, jacquard or log odds ratio, or various similarity coefficients to obtain similarity, and uses the similarity to obtain weight for each feature.

상기 S506의 수행 후, 상기 분류기 동적 결합 장치는 상기 통합 매트릭스를 구성하는 각 자질에 대해 가중치를 포함하는 용어 벡터를 생성하여 상기 결합 대상 데이터베이스의 모든 자질 정보를 학습한 통합 분류기를 생성한다(S508).After performing the step S506, the classifier dynamic combining apparatus generates a term vector including weights for each feature constituting the integration matrix to generate an integrated classifier that has learned all the feature information of the target database (S508). .

그런 다음, 상기 분류기 동적 결합 장치는 신규로 수집되는 분류 대상 문서로부터 추출된 자질과 동일한 자질에 대해, 상기 생성된 복수의 용어벡터를 결합하여 투표형식으로 계산한 결과, 최대값을 갖는 특정 범주를 상기 분류 대상 문서의 최종 범주로 결정한다(S510).
Then, the classifier dynamic combining apparatus combines the generated plurality of term vectors and calculates a voting form for the same qualities extracted from the newly collected document to be classified, and selects a specific category having a maximum value. The final category of the document to be classified is determined (S510).

도 6은 본 발명의 실시예에 따른 분류기 동적 결합 장치가 개별 분류기의 동적 결합에 의해 대용량 분류기를 자동으로 생성하는 방법을 나타낸 흐름도이다. 6 is a flowchart illustrating a method in which the classifier dynamic joining apparatus automatically generates a large classifier by dynamic coupling of individual classifiers according to an embodiment of the present invention.

도 6을 참조하면, 분류기 동적 결합 장치는 결합 대상 데이터베이스를 일정 크기의 용량으로 분할한다(S600).Referring to FIG. 6, the classifier dynamic combining apparatus divides the combining target database into a predetermined size of capacity (S600).

그런 다음 상기 분류기 동적 결합 장치는 각 분할된 용량내의 자질 특성 매트릭스로부터 각각 자질을 추출하고(S602), 상기 추출된 자질에 대한 코드정보를 리스트화하고 각 자질에 대한 범주 특성값을 합산하여 통합 매트릭스를 생성한다(S604).The classifier dynamic coupling device then extracts the features from the feature feature matrices in each partitioned capacity (S602), lists the code information for the extracted features and adds the category feature values for each feature to aggregate the matrix. To generate (S604).

S606부터 S612는 도 5의 S504부터 S510과 상응하므로, 그 설명은 생략하기로 한다. Since S606 through S612 correspond to S504 through S510 in FIG. 5, the description thereof will be omitted.

상기와 같은 방법에 의해 상기 분류기 동적 결합 장치는 대용량의 학습 문서를 자동으로 분할하고 동적으로 결합할 수 있다.
By the above method, the classifier dynamic combining device can automatically divide and dynamically combine a large amount of learning documents.

이와 같이, 본 발명이 속하는 기술분야의 당업자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.
As such, those skilled in the art will appreciate that the present invention can be implemented in other specific forms without changing the technical spirit or essential features thereof. It is, therefore, to be understood as illustrative and not restrictive in all aspects described above. The scope of the present invention is shown by the following claims rather than the above description, and it is to be understood that the meaning and scope of the claims and any modifications or variations derived therefrom are included in the scope of the present invention.

100 : 데이터베이스 200 : 분류기 생성 장치
210, 310 : 자질 추출부 220, 330 : 가중치 연산부
225 : 자질 특성 매트릭스 생성부 230, 340 : 용어벡터 생성부
240, 350 : 범주 결정부 250, 360 : 저장부
370 : 통합 매트릭스 생성 관리부 300 : 분류기 동적 결합 장치
320 : 통합 매트릭스 생성부 100: database 200: classifier generator
210, 310: feature extraction unit 220, 330: weight calculation unit
225: feature property matrix generator 230, 340: term vector generator
240, 350: Category determination section 250, 360: Storage section
370: integrated matrix generation management unit 300: classifier dynamic coupling device
320: integrated matrix generation unit

Claims

A database in which a plurality of documents are classified and stored according to the characteristics of the documents;
The feature is extracted from the target document of each database, the similarity between the extracted feature and the category is calculated, the weight for each feature is calculated, and the feature feature matrix and the term vector including the weight are generated for each feature and newly collected. A classifier generating device for generating an individual classifier for determining a category of the classified document; And
Lists the code information of the features appearing in the plurality of feature feature matrixes to be combined, adds category feature values for each feature, generates an integration matrix, and then learns all feature information of the database to be combined using the integration matrix. A classifier dynamic coupling device for generating an integrated classifier;
Large capacity classifier automatic generation system by the dynamic combination of the classifier comprising a.

The method of claim 1,
The classifier generator is a large-scale classifier automatic generation system by dynamic coupling of the classifier, characterized in that provided for each database.

The method of claim 1,
The classifier generating apparatus may combine the term vectors linearly with respect to all the constituents of the target document to determine a high voting result as a category of the target document. Automatic generation system.

The method of claim 1,
The classifier dynamic combining apparatus divides the database to be combined into a predetermined size to generate an integrated matrix for each capacity, and generates a integrated classifier that learns all the feature information of the database to be combined using each integration matrix. Large classifier automatic generation system by dynamic coupling

The method of claim 1,
The classifier dynamic combining apparatus generates a new integrated matrix by repeatedly integrating the generated integration matrix with the individual feature feature matrix or with the other integration matrix to generate a new integrated matrix. Sorter automatic generation system.

A feature extraction unit for extracting features from the study target document;
A weight calculator which calculates similarity between the extracted features and categories, and obtains weights for each feature using the similarities;
A feature feature matrix generator for generating a feature feature matrix including a weight obtained by the weight calculator for each feature of the target document;
A term vector generator configured to generate a term vector in which weights are displayed for each feature constituting the study target document; And
As a result of combining a plurality of term vectors generated by the term vector generation unit in a voting form for the same qualities extracted from the newly collected document to be classified, the specific category having the maximum value is determined. A category determiner to determine the final category of the;
Classifier generation device comprising a.

The method of claim 6,
The feature extracting unit may use at least one of a method of using a keyword field of a study target document, a method of extracting and using information from unstructured information of a title or abstract based on a corpus dictionary, and a method of processing a natural language using stemming or morpheme analysis. Classifier generator, characterized in that to extract the features through.

The method of claim 6,
And the weight calculator calculates similarity using at least one similarity coefficient among cosine, dice, jacquard, and log odds ratio or various distance coefficients.

The method of claim 6,
The weight calculation unit adds an inverse document frequency to the feature weight.

With the final feature weight

),
vs is a weight, f _i is a feature, c _j is a category, tf is the term frequency, N is the total number of documents, df is the frequency of the document.

The method of claim 6,
The term vector generator generates a term vector composed of "category, weight" pairs for each feature.

The method of claim 6,
The category determiner learns the constituents of the newly collected classification target document by matching the term vectors learned and stored in the classifier, sums all the weighted values for each subject of the vector information, and then calculates the maximum value of the result calculated in the voting form. And a classifier generation apparatus for determining a specific category as a final category.

The method of claim 6,
The feature feature matrix includes document number, category code, weight, and category feature value for each feature.
The category characteristic value is a frequency when a feature appears and belongs to a specific category, a frequency when a feature does not appear, but falls within a specific category, a frequency when a feature appears but does not belong to a specific category, and a feature does not appear. A classifier generating device, characterized in that it includes a frequency of cases which do not belong to the category.

A feature extraction unit for extracting features from feature feature matrices in the plurality of joining databases;
An integrated matrix generating unit which lists the code information of the extracted features and generates an integrated matrix by summing category feature values for each feature;
A weight calculation unit for obtaining similarity between each feature and category constituting the unified matrix, and obtaining weight for each feature using the similarity; And
A term vector generator configured to generate a term vector in which weights are displayed for each feature of the integration matrix to generate an integrated classifier that learns all feature information of the target database;
Classifier dynamic coupling device comprising a.

The method of claim 13,
As a result of combining a plurality of term vectors generated by the term vector generation unit in a voting form for the same qualities extracted from the newly collected document to be classified, the specific category having the maximum value is determined. A classifier dynamic coupling device further comprising a category determiner to determine the final category of the.

The method of claim 13,
And a capacity divider for dividing the database to be combined into a predetermined size of capacity.

The method of claim 13,
And the integrated matrix generating unit simultaneously generates an integrated matrix for each capacity divided by the capacity dividing unit in parallel.

The method of claim 13,
A classifier dynamic combination further comprising an integrated matrix generation management unit generating a new integrated matrix by repeatedly integrating the integrated matrix generated by the integrated matrix generator with an individual feature feature matrix or with another integrated matrix in a pyramid manner. Device.

The method of claim 13,
And the unified matrix generator generates a unified matrix by making a total set of feature values appearing in the feature feature matrix and summing category feature values for each feature.

The method of claim 18,
The integrated matrix generating unit automatically calculates a category missing from the entire category list for each feature, dynamically creates a field for the category in memory, and generates an integrated matrix by summing the total category property values for each feature. Classifier dynamic coupling device for storing the result in a database.

The method of claim 13,
The integrated matrix generator creates a list of unique features extracted by the feature extraction unit, extracts the entire category code from the feature feature matrix to be combined, creates a category code list, and then provides information on specific features in each individual table. If present, extract the category property values for the entire category code,
Classifier dynamic coupling device characterized in that for generating a category code that does not exist by calculating the category characteristic value.

The method of claim 20,
And wherein the integrated matrix generator generates a corresponding feature and generates a category characteristic value for every category code in the category code list, if a specific feature does not exist in each individual table.

(a) extracting, by a plurality of classifier generators, a feature from a study target document of each database;
(b) calculating, by the plurality of classifier generating devices, weights for each feature by obtaining similarities between the extracted features and categories;
(c) the plurality of classifier generating apparatuses generate a feature characteristic matrix and a term vector including the calculated weights for each feature of the target document to determine a category of a newly collected classification target document. Respectively generating separate classifiers;
(d) the classifier dynamic combining apparatus listing the code information of the features appearing in the plurality of feature feature matrixes to be combined, and generating a unified matrix by summing category feature values for each feature; And
(e) generating, by the classifier dynamic combining device, an integrated classifier that has learned all the feature information of the database to be combined using the generated integration matrix;
Large capacity classifier automatic generation method by the dynamic combination of the classifier comprising a.

The method of claim 22,
The step (c)
Generating a feature feature matrix including the calculated weights for each feature of the study target document;
Generating a term vector in which weights are displayed for each feature constituting the study target document; And
As a result of combining the generated plurality of term vectors in the form of voting for the same qualities extracted from the newly collected documents to be classified, the specific category having the maximum value is determined as the final category of the documents to be classified. And a method of automatically generating a large classifier by dynamically combining the classifiers.

In the classifier dynamic joining device to automatically generate a large classifier by dynamic coupling of the classifier,
(a) extracting a feature from a feature feature matrix in a database to be bound;
(b) listing the code information for the extracted feature, and generating a unified matrix by summing category feature values for each feature;
(c) obtaining similarity between each feature and category constituting the unified matrix, and using the similarity to obtain weight for each feature; And
(d) generating an integrated classifier that learns all feature information of the database to be combined by generating a term vector including weights for each feature of the integration matrix;
Large capacity classifier automatic generation method by the dynamic combination of the classifier comprising a.

25. The method of claim 24,
Determining a specific category having the maximum value as the final category of the classified document as a result of combining the generated term vectors with a voting form for the same feature extracted from the newly collected document to be classified. Automatic generation of large classifiers by dynamic combination of classifiers further comprising.

25. The method of claim 24,
After the step (b), the process of integrating the generated integration matrix with the individual feature property matrix or with the other integration matrix may be repeated in pyramid form to generate a new huge integration matrix. Automatic generation of large classifiers by combining.

25. The method of claim 24,
In the step (b), a total set of the feature values appearing in the feature feature matrix is generated and the category feature values for each feature are added to generate an integrated matrix.
Categories that are missing from the full category list for each feature are automatically calculated to dynamically generate fields for that category in memory, aggregate the total category property values for each feature to generate a unified matrix, and then generate the database. Automatic classifier generation method by dynamically combining the classifier, characterized in that stored in the.

25. The method of claim 24,
In the step (b), a list of unique features extracted from the feature feature matrix is made, an entire category code is extracted to create a category code list, and if there is information on a specific feature in each individual table, the entire category code. Extract the category property values for, but generate the category property values for the category codes that do not exist.
If a specific feature does not exist in each of the individual tables, the corresponding feature is created and a large classifier automatic generation method according to the dynamic combination of the classifier, characterized in that to generate a category characteristic value for each category code in the category code list.

In the classifier dynamic joining device to automatically generate a large classifier by dynamic coupling of the classifier,
(a) dividing the database to be combined into a predetermined size;
(b) extracting the features from the feature feature matrix in each divided dose;
(c) listing the code information on the extracted features, and generating a unified matrix by summing category feature values for each feature;
(d) obtaining similarity of each feature and category indicated in the integration matrix, and using the similarity to obtain weight for each feature; And
(e) generating an integrated classifier that learns all the feature information of the database to be combined by generating a term vector including a weight for each feature indicated in the integration matrix;
Large capacity classifier automatic generation method by the dynamic combination of the classifier comprising a.