KR102025280B1

KR102025280B1 - Method and apparatus for selecting feature in classifying multi-label pattern

Info

Publication number: KR102025280B1
Application number: KR1020180067316A
Authority: KR
Inventors: 이재성; 서왕덕; 김대원
Original assignee: 중앙대학교 산학협력단
Priority date: 2018-06-12
Filing date: 2018-06-12
Publication date: 2019-09-25

Abstract

Disclosed are a method of selecting a feature for classifying multi-labeled patterns, and an apparatus thereof. According to one embodiment of the present invention, the apparatus of selecting a feature for classifying multi-labeled patterns is configured to perform a method of selecting a feature for classifying multi-labeled patterns. The method comprises the following steps: generating an initial feature set having a first feature, as a component, selected according to predetermined restriction conditions from a feature full set including a plurality of features, each of which forms a plurality of patterns capable of being classified by multiple labels; generating a first feature sub set on the basis of the first features of the initial feature set; calculating a relevance evaluation value for each of a plurality of second features not belonging to the first feature sub set in the feature full set and generating a second feature sub set on the basis of the relevance average value; generating a third feature sub set on the basis of the first feature sub set and the second feature sub set; and calculating a suitability value with respect to the third feature sub set and generating a final feature sub set on the basis of the suitability value. According to the present invention, labels of the patterns can be classified by using a specific sub set, and thereby the accuracy of classifying the multi-labeled patterns can be improved.

Description

A feature selection method and apparatus for classifying multiple label patterns {METHOD AND APPARATUS FOR SELECTING FEATURE IN CLASSIFYING MULTI-LABEL PATTERN}

본 발명은 다중 레이블 패턴 분류를 위한 특징 선택 방법 및 그 장치에 관한 것으로, 특히 특징 하위 집합에 대한 제약 조건을 고려하여 효과적으로 다중 레이블을 분류할 수 있는 다중 레이블 패턴 분류를 위한 특징 선택 방법 및 그 장치에 관한 것이다.
The present invention relates to a feature selection method and apparatus for classifying multiple label patterns, and more particularly, to a feature selection method and apparatus for classifying multiple label patterns that can effectively classify multiple labels in consideration of constraints on a subset of features. It is about.

최근 다중 레이블 데이터에 대해 많은 연구들이 진행되고 있다. 다중 레이블 데이터는 하나의 패턴이 하나 이상의 레이블을 가지는 데이터로, 문서 분류, 실시간 영상 분류, 유전자 정보 분류, 사용자 정서 분류 등의 많은 분야에서 발생되어 연구되고 있다.Recently, many studies have been conducted on multi-label data. Multi-label data is data having one or more labels in one pattern, and has been generated and studied in many fields such as document classification, real-time image classification, genetic information classification, and user emotion classification.

대표적인 다중 레이블 데이터로 웹문서의 태그 정보가 있다. 웹문서를 분류하기 위해 하나의 웹문서는 태그 정보를 가지고 있고 이를 기반으로 카테고리가 나뉘게 되는데 많은 문서들이 하나의 카테고리에 속하지 않고 여러 카테고리에 속할 수 있다. 예를 들어, 종교적 신념의 문제를 담았던 영화 "다빈치 코드"와 관련된 기사는 영화 카테고리에도 속하면서 종교 카테고리에도 속할 수 있는 문서이다.Representative multi-label data is tag information of web document. In order to classify web documents, a web document has tag information and categories are divided based on it. Many documents do not belong to one category but may belong to various categories. For example, an article related to the movie "DaVinci Code", which deals with the issue of religious beliefs, is a document that can belong to both the film category and the religious category.

이와 관련하여, 다중 레이블 데이터(패턴)에서 레이블과 상관관계가 높은 특징들을 선별하기 위한 연구가 활발히 진행되어 왔다. 그러나, 특징의 중요도를 계산하기 위해 여러 레이블들을 고려해야하는 다중 레이블 문제의 특성상 고차원 레이블에 대한 정확한 상관관계를 추론하기 어려운 문제가 발생한다.In this regard, studies have been actively conducted to select features highly correlated with labels in multiple label data (patterns). However, due to the nature of the multiple label problem that requires consideration of several labels in order to calculate the importance of the feature, it is difficult to infer an accurate correlation for the high-dimensional label.

따라서, 다중 레이블 데이터의 분류 정확도를 높이기 위한 기술의 개발 필요성이 대두되고 있다.Accordingly, there is a need for developing a technology for improving the classification accuracy of multi-label data.

이에 관련하여, 발명의 명칭이 "다중 레이블을 분류하기 위해 이용되는 특징 셋의 선택 방법 및 장치"인 한국등록특허 제10-1656604호가 존재한다.
In this regard, there is a Korean Patent No. 10-1656604 entitled "Method and Device for Selecting a Feature Set Used to Classify Multiple Labels".

본 발명이 해결하고자 하는 기술적 과제는 다중 레이블 패턴의 분류 정확도를 향상시키기 위한 다중 레이블 패턴 분류를 위한 특징 선택 방법 및 그 장치를 제공하는 것이다. SUMMARY OF THE INVENTION The present invention has been made in an effort to provide a method and apparatus for selecting a feature for classifying multiple label patterns for improving the classification accuracy of the multiple label patterns.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.
The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

상기 기술적 과제를 해결하기 위한 본 발명의 일 실시예에 따른 다중 레이블 특징 선택 장치가 다중 레이블 패턴 분류를 위한 특징을 선택하는 방법에 있어서, 다중 레이블로 분류가 가능한 복수의 패턴 각각을 구성하는 복수의 특징을 포함하는 특징 전체 집합에서 기 설정된 제약조건에 따라 선별된 제1 특징을 구성요소로 하는 초기 특징 집합을 생성하는 단계, 상기 초기 특징 집합의 제1 특징들에 기초하여 제1 특징 하위 집합을 생성하는 단계, 상기 특징 전체 집합 중에서 상기 제1 특징 하위 집합에 속하지 않은 복수의 제2 특징 각각에 대한 관련성 평가값을 산출하고, 상기 관련성 평가값에 기초하여 제2 특징 하위 집합을 생성하는 단계, 상기 제1 특징 하위 집합과 상기 제2 특징 하위 집합에 기초하여 제3 특징 하위 집합을 생성하는 단계, 상기 제3 특징 하위 집합에 대한 적합도 값을 산출하고, 상기 적합도 값에 기초하여 m개의 최종 특징 하위 집합을 생성하는 단계를 포함한다. In a method of selecting a feature for multiple label pattern classification by a multi-label feature selection device according to an embodiment of the present invention for solving the technical problem, a plurality of constituting a plurality of patterns that can be classified into multiple labels Generating an initial feature set comprising the first feature selected according to a predetermined constraint in the entire feature set including the feature, and generating a first feature subset based on the first features of the initial feature set. Generating a relevance evaluation value for each of a plurality of second features not belonging to the first feature subset among the entire feature set, and generating a second feature subset based on the relevance evaluation value; Generating a third feature subset based on the first feature subset and the second feature subset; Calculating a goodness of fit value for the feature subset, and generating m final feature subsets based on the goodness of fit value.

바람직하게는, 상기 제약조건은, 각 특징 집합 또는 특징 하위 집합에 포함될 특징의 최대 개수 n(n은 자연수), 총 특징 집합 또는 총 특징 하위 집합의 개수 m(m은 자연수)를 포함하고, 상기 초기 특징 집합을 생성하는 단계는, 각각 n개 이하의 제1 특징을 포함하는 m개의 초기 특징 집합을 생성하는 단계 및 상기 각 초기 특징 집합을 평가하는 단계를 포함한다. Preferably, the constraints include a maximum number n of features to be included in each feature set or feature subset (n is a natural number), a total feature set or a number of feature subsets m (m is a natural number), and Generating an initial feature set includes generating m initial feature sets each including n or less first features and evaluating each of said initial feature sets.

바람직하게는, 상기 제1 특징 하위 집합은, 상기 초기 특징 집합에 유전 연산자(genetic operator)를 적용하여 생성할 수 있다. Preferably, the first feature subset can be generated by applying a genetic operator to the initial feature set.

바람직하게는, 상기 제2 특징 하위 집합을 생성하는 단계는, 제1 상호 정보 척도를 이용하여 상기 제2 특징과 상기 제1 레이블간의 상관 관계를 정의하는 상기 제1 상관 관계 함수에서 상기 제1 상호 정보 척도를 이용하여 상기 제1 특징 및 상기 제2 특징 간의 상관관계를 정의하는 제2 상관 관계 함수를 차감한 특징 상관 함수에 기초하여 수행될 수 있다. Advantageously, generating said second feature subset comprises: said first correlation in said first correlation function defining a correlation between said second feature and said first label using a first correlation information measure It may be performed based on a feature correlation function by subtracting a second correlation function defining a correlation between the first feature and the second feature using an information measure.

바람직하게는, 상기 특징 상관 함수는 하기 수학식에 의하여 정의될 수 있다. Preferably, the feature correlation function may be defined by the following equation.

[수학식][Equation]

여기서, l은 레이블, L은 레이블 집합, M은 상호 정보 척도(mutual information)로서 입력 변수들의 상관관계, f_i 는 제2 특징, f는 제1 특징,

은 제1 상관 관계 함수,

은 제2 상관 관계 함수를 의미할 수 있다. Where l is the label, L is the label set, M is the mutual information measure, and the correlation of the input variables, f _i Is the second feature, f is the first feature,

Is the first correlation function,

May mean a second correlation function.

바람직하게는, 상기 제3 특징 하위 집합을 생성하는 단계는, (2·제1 특징 하위 집합의 개수)에 해당하는 FFC를 사용하여 상기 제3 특징 하위 집합을 평가하는 단계를 더 포함할 수 있다. Preferably, the generating of the third feature subset may further include evaluating the third feature subset using an FFC corresponding to (the number of the second feature subsets). .

바람직하게는, 상기 최종 특징 하위 집합을 생성하는 단계 이후, 기 설정된 FFC를 모두 사용 할때까지 상기 제2 특징 하위 집합을 생성하는 단계, 제3 특징 하위 집합을 생성하는 단계 및 상기 최종 특징 하위 집합을 생성하는 단계를 반복 수행하는 단계를 더 포함할 수 있다. Preferably, after generating the final feature subset, generating the second feature subset, generating a third feature subset, and generating the final feature subset until all of the preset FFCs are used. It may further comprise the step of repeating the step of generating.

상기 기술적 과제를 해결하기 위한 본 발명의 다른 실시예에 따른 다중 레이블 패턴 분류를 위한 특징 선택 장치에 있어서, 다중 레이블로 분류가 가능한 복수의 패턴 각각을 구성하는 복수의 특징을 포함하는 특징 전체 집합에서 기 설정된 제약조건에 따라 선별된 제1 특징을 구성요소로 하는 초기 특징 집합을 생성하는 특징 집합 생성부, 상기 초기 특징 집합의 제1 특징들에 기초하여 제1 특징 하위 집합을 생성하는 제1 특징 하위 집합 생성부, 상기 특징 전체 집합 중에서 상기 제1 특징 하위 집합에 속하지 않은 복수의 제2 특징 각각에 대한 관련성 평가값을 산출하고, 상기 관련성 평가값에 기초하여 제2 특징 하위 집합을 생성하는 제2 특징 하위 집합 생성부, 상기 제1 특징 하위 집합과 상기 제2 특징 하위 집합에 기초하여 제3 특징 하위 집합을 생성하는 제3 특징 하위 집합 생성부, 상기 제3 특징 하위 집합에 대한 적합도 값을 산출하고, 상기 적합도 값에 기초하여 m개의 최종 특징 하위 집합을 생성하는 최종 특징 하위 집합 생성부를 포함한다.
In the feature selection apparatus for classifying a multi-label pattern according to another embodiment of the present invention for solving the above technical problem, in the entire feature set comprising a plurality of features constituting each of a plurality of patterns that can be classified into multiple labels A feature set generation unit that generates an initial feature set having the selected first feature as a component according to preset constraints, and a first feature that generates a first feature subset based on the first features of the initial feature set A subset generation unit configured to calculate a relevance evaluation value for each of a plurality of second features not belonging to the first feature subset among the set of features, and generate a second feature subset based on the relevance evaluation value A feature subset generator, generating a third feature subset based on the first feature subset and the second feature subset; Calculating a goodness of fit value for the third feature subset generator, the third feature subset, and based on the fitness value comprises a final feature subset generator for generating a subset m of the final characteristics.

본 발명에 따르면, 유망하지만 선택되지 않은 특징을 개체에 추가하는 새로운 특징 하위 집합을 도입함으로써 제약조건이 있는 다중 레이블 특징 선택을 위한 GA와 같은 개체 기반 검색(population-based search)의 성능을 향상시킬 수 있다. According to the present invention, the introduction of a new subset of features that adds promising but unselected features to an entity may improve the performance of population-based search such as GA for constrained multi-label feature selection. Can be.

또한, 최적 특징으로 선별된 특징들만으로 구성된 특징 하위 집합을 이용하여 패턴의 레이블을 분류하게되므로, 다중 레이블 패턴의 분류 정확도가 향상되는 효과가 있다.In addition, since the label of the pattern is classified using a feature subset consisting of only the features selected as the optimal feature, the classification accuracy of the multi-label pattern is improved.

또한, 특징들의 특징 평가값 산출을 위한 연산량 및 연산 시간이 줄어들고, 패턴의 개수가 충분히 많지 않은 경우에도 종래에 비해 특징 평가값 산출 결과의 정확성이 높아 종래에 비해 다중 레이블 패턴의 분류 결과의 정확성이 향상되는 효과가 있다.In addition, the amount of calculation and calculation time for calculating the feature evaluation value of the features are reduced, and even when the number of patterns is not large enough, the accuracy of the feature evaluation value calculation result is higher than in the prior art, so that the accuracy of the classification result of the multi-label pattern is higher than in the conventional art. There is an effect to be improved.

본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.
Effects of the present invention are not limited to the above-mentioned effects, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 다중 레이블을 분류하기 위한 특징 선택 방법을 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 다중 레이블 특징 선택 장치를 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른 다중 레이블을 분류하기 위한 특징 선택 방법을 설명하기 위한 순서도이다.
도 4는 본 발명의 일 실시예에 따른 다중 레이블을 분류하기 위한 특징 선택 방법을 설명하기 위한 알고리즘이다.
도 5는 본 발명의 일 실시예에 따른 탐사 연산자(Exploration Operator)를 설명하기 위한 알고리즘이다.
도 6은 다중 레이블 정확도의 관점에서 사용된 FFC의 수 (??) 에 따른 GA와 본 방법의 convergence 비교 결과를 나타낸 그래프이다.
도 7은 Genbase 데이터 세트의 두 자손 세트가 제공한 적합도 값의 box plots을 나타낸다.1 is a view for explaining a feature selection method for classifying multiple labels according to an embodiment of the present invention.
2 is a diagram illustrating an apparatus for selecting multiple label features according to an embodiment of the present invention.
3 is a flowchart illustrating a feature selection method for classifying multiple labels according to an embodiment of the present invention.
4 is an algorithm for explaining a feature selection method for classifying multiple labels according to an embodiment of the present invention.
5 is an algorithm for explaining an exploration operator according to an embodiment of the present invention.
6 is a graph showing a result of comparing convergence between GA and the present method according to the number of FFCs used in terms of multi-label accuracy.
7 shows box plots of goodness-of-fit values provided by two descendant sets of a Genbase data set.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.As the invention allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the written description. However, this is not intended to limit the present invention to specific embodiments, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing the drawings, similar reference numerals are used for similar elements.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component. The term and / or includes a combination of a plurality of related items or any item of a plurality of related items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in the commonly used dictionaries should be construed as having meanings consistent with the meanings in the context of the related art and shall not be construed in ideal or excessively formal meanings unless expressly defined in this application. Do not.

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명은 유망하지만 선택되지 않은 특징을 개체에 추가하는 새로운 특징 하위 집합을 도입함으로써 제약조건이 있는 다중 레이블 특징 선택을 위한 GA와 같은 개체 기반 검색(population-based search)의 성능을 향상시킨다. The present invention improves the performance of population-based search such as GA for constrained multi-label feature selection by introducing a new subset of features that adds promising but unselected features to the entity.

도 1은 본 발명의 일 실시예에 따른 다중 레이블을 분류하기 위한 특징 선택 방법을 설명하기 위한 도면이다. 1 is a view for explaining a feature selection method for classifying multiple labels according to an embodiment of the present invention.

도 1을 참조하면, 제약조건(budget constraint)이 있는 진화 검색 기반 다중 레이블 특징 선택 프로세스에 새로운 특징을 도입할 때 고려해야 할 몇가지 주요 문제를 보여준다. Referring to FIG. 1, some key issues to consider when introducing a new feature into an evolutionary search based multi-label feature selection process with a constraint are shown.

먼저, (a)와 같이 특징 전체 집합 F가 주어지면, (b)와 같이 유전 알고리즘을 이용하여 제약 조건에 따른 chromosomes(특징 집합)를 생성한다. 여기서, 제약조건은 각 chromosomes(특징 집합)에 포함될 특징의 최대 개수(n), 전체 chromosomes의 개수(m)을 포함할 수 있다. 특징 집합 F에는 다중 레이블에 강하게 연관되는 중요한 특징의 하위 집합이 있을 수 있으며, 다중 레이블 분류기가 최종 특징 하위 집합에 포함되는 경우 우수한 식별력을 이끌어 낸다. 유전 알고리즘을 통해 랜덤 초기화 프로세스가 완료되면 f₁과 같은 중요한 특징이 어떤 chromosomes(특징 집합)에 의해 선택되지 않을 수 있다. 이는 각 chromosomes(특징 집합)이 제약 조건 n(특징 하위 집합에 포함될 특징의 개수) 이하 개수의 특징만을 다루기 때문이다. |F|/n chromosomes(특징 집합)은 비록 모든 chromosomes(특징 집합)이 분리된 특징 집합을 선택하도록 강요 되더라도 적어도 모든 특징들을 고려하여 평가할 필요가 있다. 이는 비싼 계산 비용을 초래한다. 대신, 본 발명은 후보 특징 하위 집합의 명시적 평가없이 채용된 필터의 도움으로 유망한 특징을 식별할 수 있다.First, given the full feature set F as shown in (a), we generate chromosomes (feature sets) according to constraints using genetic algorithms as shown in (b). Here, the constraint may include the maximum number n of features to be included in each chromosomes (feature set) and the number m of total chromosomes. Feature set F may have a subset of important features strongly associated with multiple labels, leading to good discrimination when multiple label classifiers are included in the final feature subset. When the random initialization process is completed through the genetic algorithm, important features such as f ₁ may not be selected by any chromosomes. This is because each chromosomes (feature set) handles only a number of features below the constraint n (the number of features to be included in the feature subset). | F | / n chromosomes (feature sets) need to be evaluated with at least all features considered, although all chromosomes are forced to select separate feature sets. This leads to expensive calculation costs. Instead, the present invention can identify promising features with the aid of filters employed without explicit evaluation of candidate feature subsets.

다음으로, (c)와 같이 교차 및 돌연변이와 같은 유전자 연산자를 새로운 chromosomes를 만들기 위해 population에 적용한다. 그러나 조상의 대립 유전자를 교환함으로써 새로운 chromosomes가 만들어지기 때문에, 선택되지 않은 중요한 특징을 고려하지 않을 수 있다. 즉, 조상이 선택하지 않은 특징은 그 자손 또한 해당 특징을 선택하지 않는다. 선택되지 않은 특징을 자식 생성 과정에 추가할 수 있는 유일한 기회는 돌연변이 연산(mutation operation)을 이용하는 것이다. 그러나 이것은 convergence을 얻기 위해 돌연변이 연산이 무작위로 특징을 선택함으로써 행해지고, 변이율이 작은 값으로 설정되기 때문에 계산상 비효율적이다. 따라서 중요한 특징을 무작위로 population에 도입하기 위해 많은 수의 반복 또는 세대를 소비해야한다.Next, as shown in (c), genetic operators such as crossover and mutation are applied to the population to make new chromosomes. However, because the exchange of ancestor alleles creates new chromosomes, important features that are not selected may not be considered. In other words, a feature that is not selected by an ancestor does not select that feature. The only chance to add an unselected feature to the child generation process is to use a mutation operation. However, this is done by randomly selecting features to achieve convergence and is computationally inefficient because the mutation rate is set to a small value. Therefore, it is necessary to spend a large number of iterations or generations to randomly introduce important features into the population.

이에, 본 발명은 (d)와 같이 탐사 연산자를 새로운 자손 각각에 적용하여, 원래 자손이 고려하지 않은 유망한 특징을 포함하는 새로운 특징 하위 집합을 생성할 수 있다. 각 탐사 연산 동안, 여러 레이블 (l₁, l₂, ..., l₈)에서 선택되지 않은 특징의 종속성(dependency)을 계산한다. 각 특징의 순위가 계산된 후(예컨대, f₁ → f₄₄ → f₃₂ → f₃ →··· ), 가장 유망한 특징을 선택하여 새로운 특징 하위 집합을 생성한다. Thus, the present invention may apply a search operator to each new offspring as shown in (d) to generate a new subset of features including promising features not considered by the original offspring. During each exploration operation, the dependencies of the unselected features at different labels (l ₁ , l ₂ , ..., l ₈ ) are calculated. After the rank of each feature is calculated (e.g., f ₁ → f ₄₄ → f ₃₂ → f ₃ →...), The most promising feature is selected to generate a new subset of features.

마지막으로, 탐사 및 유전 연산자 기반의 특징 하위 집합은 단일 개체군(population)으로 통합된다.Finally, exploration and genetic operator based feature subsets are integrated into a single population.

도 2는 본 발명의 일 실시예에 따른 다중 레이블 특징 선택 장치를 설명하기 위한 도면이다. 2 is a diagram illustrating an apparatus for selecting multiple label features according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시예에 따른 다중 레이블 특징 선택 장치(100)는 특징 집합 생성부(110), 제1 특징 하위 집합 생성부(120), 제2 특징 하위 집합 생성부(130), 제3 특징 하위 집합 생성부(140), 최종 특징 하위 집합 생성부(150)를 포함한다. 2, the multi-label feature selection apparatus 100 according to an embodiment of the present invention may include a feature set generator 110, a first feature subset generator 120, and a second feature subset generator ( 130, a third feature subset generator 140, and a final feature subset generator 150.

특징 집합 생성부(110)는 다중 레이블로 분류가 가능한 복수의 패턴 각각을 구성하는 복수의 특징을 모두 구성요소로서 포함하는 특징 전체 집합에서 선별된 제1 특징을 구성요소로 하는 m(m은 자연수)개의 초기 특징 집합을 생성한다. 이때, 특징 집합 생성부(110)는 각각 n개 이하의 제1 특징을 포함하는 m개의 특징 집합을 생성한다. 여기서, n과 m은 제약 조건일 수 있다. 복수의 패턴 각각은 다중 레이블로 분류가 가능하며, 패턴 각각은 복수의 특징들로 구성될 수 있다. 여기서, 패턴은 분류의 대상으로 문서 등일 수 있고, 레이블은 카테고리(장르)일 수 있으며, 특징은 단어일 수 있다. 또한, 특징 하위 집합은 패턴 각각을 특정 레이블로 분류할 때 이용되는 데이터로서, 만일 복수의 패턴 각각을 구성하는 모든 특징들을 이용하여 패턴 각각을 분류하게 되면 분류에 있어 무관한 특징이나 중복되는 특징까지 모두 이용하게 되어 오히려 다중 레이블 분류 성능이 떨어지게 되므로, 본 발명에서는 중요도가 높은 특징들로 구성된 특징 하위 집합을 이용하여 패턴 각각의 레이블을 분류함으로써, 다중 레이블 분류 성능을 높이게 된다.The feature set generating unit 110 includes, as a component, a plurality of features constituting each of a plurality of patterns that can be classified into multiple labels as components, m (m is a natural number). Create) initial set of features. In this case, the feature set generation unit 110 generates m feature sets each including n or less first features. Here, n and m may be constraints. Each of the plurality of patterns may be classified into multiple labels, and each of the patterns may include a plurality of features. Here, the pattern may be a document or the like as the object of classification, the label may be a category (genre), and the feature may be a word. In addition, the feature subset is data used to classify each pattern by a specific label. If the pattern is classified using all the features constituting each of the plurality of patterns, the feature subset may be irrelevant or overlapping. Since the use of all of them reduces the performance of the multi-label classification, the present invention improves the multi-label classification performance by classifying the labels of each pattern by using a feature subset composed of high importance features.

특징 집합 생성부(110)는 랜덤하게 특징 집합을 생성할 수 있다. 이때 생성된 특징 집합은 초기 개체군이라고도 할 수 있다. The feature set generator 110 may randomly generate a feature set. The generated feature set may be referred to as an initial population.

예컨대, 모집단 크기(population size)가 m, 최대 특징의 개수가 n으로 설정되고 최대 FFC가 v로 설정된 경우, 특징 집합 생성부(110)는 최대 n개의 특징을 포함하는 m 개의 특징 집합을 생성할 수 있다. For example, when a population size is m, the maximum number of features is set to n, and the maximum FFC is set to v, the feature set generator 110 may generate m feature sets including up to n features. Can be.

상기와 같이 제약 조건에 따른 특징집합이 생성되면, 특징 집합 생성부(110)는 각 특징 집합을 평가한다. 이때, 특징 집합 생성부(110)는 기 설정된 FFC를 이용하여 각 특징 집합을 평가할 수 있다. When the feature set is generated according to the constraints as described above, the feature set generator 110 evaluates each feature set. In this case, the feature set generator 110 may evaluate each feature set by using a preset FFC.

제1 특징 하위 집합 생성부(120)는 특징 집합의 제1 특징들에 기초하여 제1 특징 하위 집합(feature subset)을 생성한다. 이때, 제1 특징 하위 집합 생성부(120)는 각 특징 집합에 유전 연산자(genetic operator)를 적용하여 제1 특징 하위 집합을 생성할 수 있다. The first feature subset generator 120 generates a first feature subset based on the first features of the feature set. In this case, the first feature subset generator 120 may generate a first feature subset by applying a genetic operator to each feature set.

제2 특징 하위 집합 생성부(130)는 특징 전체 집합 중에서 제1 특징 하위 집합에 속하지 않은 복수의 제2 특징 각각에 대한 관련성 평가값을 산출하고, 그 관련성 평가값에 기초하여 제2 특징 하위 집합을 생성한다. 제1 특징 하위 집합은 n 개 이내의 소수의 제1 특징을 선택하고, 대부분의 특징은 선택되지 않은 상태로 남아있기 때문에, 탐색되지 않은 특징을 탐색하기 위해서는 탐색 연산자가 필요하다. 따라서, 제2 특징 하위 집합 생성부(130)는 제1 특징 하위 집합(offspring set)에 탐사 연산자(exploration operator)를 적용하여 제2 특징 하위 집합을 생성할 수 있다. The second feature subset generation unit 130 calculates a relevance evaluation value for each of the plurality of second features not belonging to the first feature subset among the feature set, and based on the relevance evaluation value, the second feature subset Create Since the first subset of features selects a small number of first features within n and most of the features remain unselected, a search operator is required to search for the features that are not searched. Accordingly, the second feature subset generator 130 may generate a second feature subset by applying an exploration operator to the first feature subset.

구체적으로, 제2 특징 하위 집합 생성부(130)는 유전 연산자에 의해 생성된 각 제1 특징 하위 집합(offspring set)에 대해 목적함수를 최대화하고, 특징 하위 집합의 크기가 |S_c|가 될 때까지 자손(offspring) c에 의해 선택되지 않은 관련 특징을 반복적으로 선택한다. 여기서 |S_c|는 c의 하위 집합 크기이다. 따라서, 탐사 연산자는 선택될 특징의 수를 결정하기 위한 추가적인 파라미터를 필요로 하지 않는다. 탐사 연산을 수행하기 위해, 목적 함수(objective function) Q(f⁺, L)로서 SCLS(scalable criterion for large label)라고 불리는 효과적인 필터 방법을 사용한다. 여기서 L은 레이블 집합이다. 특징 집합인 {F|{S_c ∪ Z}}에서 i번째 특징을 선택하는 것은 관련성(relevance) 평가 값을 최대화하는 특징 f_i 를 식별함으로써 수행된다. 여기서, Z는 i 번째 특징을 선택할 때 i-1 번째 특징을 갖는 특징 하위 집합일 수 있다. Specifically, the second feature subset generator 130 maximizes the objective function for each first feature subset generated by the genetic operator, and the size of the feature subset becomes | S _c | Recursively select relevant features not selected by offspring c until. Where | S _c | is the subset size of c. Thus, the exploration operator does not need additional parameters to determine the number of features to be selected. To perform the exploration operation, we use an effective filter method called scalable criterion for large label (SCLS) as the objective function Q (f ⁺ , L). Where L is the set of labels. The selection of the i th feature in the feature set {F | {S _c ∪ Z}} is performed by identifying the feature f _i that maximizes the relevance evaluation value. Here, Z may be a feature subset having the i−1 th feature when the i th feature is selected.

따라서, 제2 특징 하위 집합 생성부(130)는 아래 수학식 1을 이용하여 관련성 평가를 수행한다. Therefore, the second feature subset generation unit 130 performs a relevance evaluation using Equation 1 below.

[수학식 1][Equation 1]

여기서,

는 L에 대한 f_i 의 연관성(dependency),

는 Z의 선택된 특징에 대한 f_i 의 연관성(dependency), f_i 는 i번째 특징을 나타낸다. here,

Is the dependence of f _i on L,

Denotes the dependency of f _i on the selected feature of Z, f _i denotes the i-th feature.

수학식 1은 아래 수학식 2와 같이 재구성될 수 있다. Equation 1 may be reconfigured as in Equation 2 below.

[수학식 2][Equation 2]

여기서, l은 레이블, L은 레이블 집합, M(x;y)는 상호 정보 척도(mutual information), f_i 는 i번째 특징일 수 있다. 그리고,

으로, 변수 x와 y사이의 상호 정보이고,

로 확률 함수 p(x), p(y) 및 p(x, y)의 결합 엔트로피(joint entropy)일 수 있다. 따라서, 수학식 2로부터 D(f₂)를 아래 수학식 3과 같이 표현될 수 있다. Where l is a label, L is a label set, M (x; y) is a mutual information measure, f _i May be the i th feature. And,

, Mutual information between the variables x and y,

May be a joint entropy of the probability functions p (x), p (y) and p (x, y). Therefore, D (f ₂ ) from Equation 2 may be expressed as Equation 3 below.

[수학식 3] [Equation 3]

또한, 수학식 2로부터

는 아래 수학식 4와 같이 표현될 수 있다. Also, from equation (2)

May be expressed as Equation 4 below.

[수학식 4][Equation 4]

D(f₂)의 스케일링에 대한 적응성(adaptability)을 고려하고, f∈S 와 l∈L 에 의한 반복 계산을 피하면서 R(f₂)를 계산하기 위해, R(f₂)는 아래 수학식 5와 같이 표현될 수 있다. In order to consider the adaptability to scaling of D (f ₂ ) and to calculate R (f ₂ ) while avoiding iterative calculations by f∈S and l∈L, R (f ₂ ) is It can be expressed as 5.

[수학식 5][Equation 5]

여기서, 추정되어야 하는 0 ≤ α ≤ 1 인 경우, 각 레이블에 대한 반복적인 계산을 피하면서 D(f₂)를 기준으로 f₂에 대한 관련성 감소를 결정한다. 따라서, α는 아래 수학식 6과 같이 근사될 수 있다.Here, when 0 ≦ α ≦ 1 to be estimated, a decrease in relevance for f ₂ is determined based on D (f ₂ ) while avoiding repeated calculations for each label. Therefore, α can be approximated as Equation 6 below.

[수학식 6][Equation 6]

결론적으로, f₂ 에 대한 관련성 평가는 아래 기재된 수학식 7을 이용하여 산출할 수 있다. In conclusion, the relevance assessment for f ₂ may be calculated using Equation 7 described below.

[수학식 7][Equation 7]

수학식 7은 i = 2 일 때, 관련성 평가가 수행되는 방법을 나타낸다. Z에서 이전에 선택된 특징들을 고려하면, 최종 관련성 평가는 아래 수학식 8로 표현될 수 있다. Equation 7 shows how relevance evaluation is performed when i = 2. Considering the previously selected features in Z, the final relevance assessment can be expressed by Equation 8 below.

[수학식 8][Equation 8]

은 제1 상관 관계 함수,

Is the first correlation function,

May mean a second correlation function.

수학식 8은 제1 특징 하위 집합에 속하지 않은 복수의 제2 특징 각각에 대한 관련성 평가값을 산출하고, 그 관련성 값에 기초하여 각 특징 하위 집합에 포함된 특징을 선택하기 위한 목적함수일 수 있다. Equation 8 may be an objective function for calculating a relevance evaluation value for each of a plurality of second features not belonging to the first feature subset, and selecting a feature included in each feature subset based on the relevance value.

따라서, 제2 특징 하위 집합 생성부(130)는 제1 상호 정보 척도를 이용하여 제2 특징과 제1 레이블간의 상관 관계를 정의하는 제1 상관 관계 함수에서 제1 상호 정보 척도를 이용하여 제1 특징 및 제2 특징간의 상관관계를 정의하는 제2 상관 관계 함수를 차감한 특징 상관 함수에 기초하여 제2 특징 하위 집합을 생성할 수 있다. 이때, 특징 상관 함수는 수학식 8일 수 있다. 제2 특징 하위 집합의 크기는 제1 특징 하위 집합의 크기와 동일할 수 있다. Accordingly, the second feature subset generation unit 130 uses the first mutual information measure in the first correlation function to define a correlation between the second feature and the first label using the first mutual information measure. A second subset of features may be generated based on a feature correlation function subtracting a second correlation function defining a correlation between the feature and the second feature. In this case, the feature correlation function may be Equation 8. The size of the second feature subset may be the same as the size of the first feature subset.

제3 특징 하위 집합 생성부(140)는 제1 특징 하위 집합과 제2 특징 하위 집합에 기초하여 제3 특징 하위 집합을 생성한다. 이때, 제3 특징 하위 집합 생성부(140)는 (2?제1 특징 하위 집합의 개수)에 해당하는 FFC를 사용하여 제3 특징 하위 집합을 평가할 수 있다. The third feature subset generator 140 generates a third feature subset based on the first feature subset and the second feature subset. In this case, the third feature subset generation unit 140 may evaluate the third feature subset by using the FFC corresponding to (number of second to first feature subsets).

최종 특징 하위 집합 생성부(150)는 제3 특징 하위 집합에 대한 적합도 값을 산출하고, 그 적합도 값에 기초하여 m개의 최종 특징 하위 집합을 생성한다. 즉, 최종 특징 하위 집합 생성부(150)는 초기 특징 집합에 제3 특징 하위 집합을 더하고, 평가값이 높은 순으로 m개의 최종 특징 하위 집합을 선택한다. The final feature subset generator 150 calculates a goodness of fit value for the third feature subset, and generates m final feature subsets based on the goodness of fit value. That is, the final feature subset generator 150 adds the third feature subset to the initial feature set, and selects the m final feature subsets in order of high evaluation value.

도 3은 본 발명의 일 실시예에 따른 다중 레이블을 분류하기 위한 특징 선택 방법을 설명하기 위한 순서도, 도 4는 본 발명의 일 실시예에 따른 다중 레이블을 분류하기 위한 특징 선택 방법을 설명하기 위한 알고리즘, 도 5는 본 발명의 일 실시예에 따른 탐사 연산자를 설명하기 위한 알고리즘이다. 3 is a flowchart illustrating a feature selection method for classifying multiple labels according to an embodiment of the present invention, and FIG. 4 is a view for explaining a feature selection method for classifying multiple labels according to an embodiment of the present invention. 5 is an algorithm for explaining an exploration operator according to an embodiment of the present invention.

도 3을 참조하면, 장치는 특징 전체 집합에서 선별된 제1 특징을 구성요소로 하는 m(m은 자연수)개의 초기 특징 집합 P(t)를 생성하고(S310), 그 초기 특징 집합 P(t)를 평가한다(S320). 즉, 장치는 특징 전체 집합에서 선택된 특징의 최대 n개 2진 비트의 랜덤 할당을 통해 m개의 초기 특징 집합 P(t)를 생성한다. 여기서, n은 특징 하위 집합(Sc)에 허용된 특징의 최대 수를 의미한다. 그런 후, 장치는 적합도 함수를 이용하여 초기 특징 집합 P(t)를 평가한다. 이때, 장치는 초기 특징 집합 P(t)에 대한 적합도 함수로서 다중 라벨 분류 오차를 사용한다. m개의 초기 특징 집합 P(t)는 적합도 값을 얻기 위해 평가되어야하기 때문에, n 개의 적합성 함수 호출(FFC)이 사용된다. Referring to FIG. 3, the apparatus generates m initial feature sets P (t) having m (m is a natural number) as components of the first feature selected from the full feature set (S310), and the initial feature set P (t). ) Is evaluated (S320). That is, the device generates m initial feature set P (t) through random allocation of up to n binary bits of the selected feature from the full feature set. Here, n means the maximum number of features allowed in the feature subset Sc. The device then uses the goodness-of-fit function to evaluate the initial feature set P (t). In this case, the device uses multiple label classification errors as a goodness-of-fit function for the initial feature set P (t). Since m initial feature sets P (t) must be evaluated to obtain a goodness of fit value, n goodness-of-fit calls (FFCs) are used.

상기와 같은 초기화 과정이 완료되면, 장치는 유전 연산자(genetic operator)를 이용하여 제1 특징 하위 집합(offspring set) G(t)를 생성한다(S330). 이때, 장치는 교차 연산자, 돌연변이 연산자 등을 이용하여 선택된 특징의 수를 제어하는 제1 특징 하위 집합 G(t)를 생성한다.When the initialization process is completed, the device generates a first feature subset G (t) by using a genetic operator (S330). At this point, the device generates a first feature subset G (t) that controls the number of selected features using a crossover operator, mutation operator, and the like.

한편, 제1 특징 하위 집합은 n개 이내의 소수의 특징을 선택하기 때문에 대부분의 특징은 선택되지 않은 상태로 남아 있다. 따라서, 탐색되지 않은 특징을 탐색하기 위해서는 탐사 연산자(exploration operator)가 필요하다.On the other hand, since the first feature subset selects fewer than n features, most features remain unselected. Thus, an exploration operator is needed to explore the unexplored feature.

따라서, 장치는 단계 S330이 수행되면, 제1 특징 하위 집합(offspring set) G(t)에 탐사 연산자(exploration operator)를 적용하여 제2 특징 하위 집합 E(t)를 생성한다(S340). 이때, 장치는 유전 연산자에 의해 생성된 각 자식에 대해 목적 함수를 최대화하고, 특징 하위 집합의 크기가 |S_c|가 될 때까지 자손에 의해 선택되지 않은 관련 특징을 반복적으로 선택한다. 여기서 |S_c|는 하위 집합 크기일 수 있다. 그리고, 유전 연산자와 탐사 연산자 간의 균형을 위해 제2 특징 하위 집합 E(t)의 크기는 제1 특징 하위 집합 G (t)의 크기와 동일할 수 있다. 이는 제2 특징 하위 집합 E(t)는 그것의 적합성을 결정하기 위해 평가해야하기 때문에 E(t)의 크기를 G (t)의 값과 동일하게 설정할 수 있다. 탐사 연산자에 대한 알고리즘은 도 5와 같을 수 있다. 도 5를 참조하면, 유전 연산자에 의해 생성된 각 제1 특징 하위 집합(offspring set)에 대해 목적함수를 최대화하고, 특징 하위 집합의 크기가 |S_c|가 될 때까지 자손(offspring) c에 의해 선택되지 않은 관련 특징을 반복적으로 선택한다. 여기서 |S_c|는 c의 부분 집합 크기이다. 따라서, 탐사 연산자는 선택될 특징의 수를 결정하기 위한 추가적인 파라미터를 필요로 하지 않는다. 탐사 연산을 수행하기 위해, 목적 함수(objective function) Q(f⁺, L)로서 SCLS(scalable criterion for large label)라고 불리는 효과적인 필터 방법을 사용한다. 여기서 L은 레이블 집합이다. 특징 집합인 {F|{S_c ∪ Z}}에서 i번째 특징을 선택하는 것은 관련성(relevance) 평가 값을 최대화하는 f_i 를 식별함으로써 수행된다. 여기서, Z는 i 번째 특징을 선택할 때 i-1 번째 특징을 갖는 특징 하위 집합일 수 있다. Accordingly, when step S330 is performed, the apparatus generates a second feature subset E (t) by applying an exploration operator to the first feature subset G (t) (S340). At this time, the device maximizes the objective function for each child produced by the genetic operator and repeatedly selects relevant features not selected by the offspring until the size of the feature subset is | S _c |. Where | S _c | may be a subset size. The size of the second feature subset E (t) may be the same as the size of the first feature subset G (t) to balance the genetic operator and the exploration operator. This may set the size of E (t) equal to the value of G (t) because the second feature subset E (t) must be evaluated to determine its suitability. The algorithm for the search operator may be the same as in FIG. 5. Referring to FIG. 5, the objective function is maximized for each first feature subset generated by the genetic operator, and the offspring _c is turned off until the size of the feature subset becomes | S _c |. Relevant features not selected by the selection are repeatedly selected. Where | S _c | is the subset size of c. Thus, the exploration operator does not need additional parameters to determine the number of features to be selected. To perform the exploration operation, we use an effective filter method called scalable criterion for large label (SCLS) as the objective function Q (f ⁺ , L). Where L is the set of labels. The selection of the i th feature in the feature set {F | {S _c ∪ Z}} is performed by identifying f _i that maximizes the relevance evaluation value. Here, Z may be a feature subset having the i−1 th feature when the i th feature is selected.

단계 S340이 수행되면, 장치는 제1 특징 하위 집합과 제2 특징 하위 집합을 결합하여 제3 특징 하위 집합 N(t)를 생성하고(S350), 제3 특징 하위 집합을 평가한다(S360). 즉, 장치는 t번째 모집단의 제3 특징 하위 집합 N(t)을 생성하고, 일정 수의 FFC를 사용하여 제3 특징 하위 집합을 평가한다. 이때, 장치는 한 세대에 2·|G(t)|의 FFC를 사용할 수 있다. When step S340 is performed, the device combines the first feature subset and the second feature subset to generate a third feature subset N (t) (S350) and evaluate the third feature subset (S360). That is, the device generates a third feature subset N (t) of the t th population and evaluates the third feature subset using a certain number of FFCs. In this case, the device may use an FFC of 2 · | G (t) | in one generation.

단계 S360이 수행되면, P(t)에 N(t)을 더하고, 높은 적합도 값을 갖는 m개의 특징 하위 집합을 선택한다(S370). When step S360 is performed, N (t) is added to P (t), and m feature subsets having a high fitness value are selected (S370).

단계 S330부터 단계 S370은 허용된 모든 FFC를 사용할 때까지 반복된다(S380). 상기와 같은 과정을 통해 최적의 특징 하위 집합이 선택될 수 있다. From step S330 to step S370 is repeated until all the allowed FFCs are used (S380). Through the above process, an optimal subset of features may be selected.

이하, 본 발명의 효과를 실험 결과를 통해 설명하기로 한다.Hereinafter, the effects of the present invention will be described through experimental results.

본 발명은 다양한 영역에서 20 개의 서로 다른 데이터 세트를 실험하였다. Birds 데이터 세트는 여러 조류 호출의 예가 들어있는 오디오 데이터이다. 감정 데이터 세트는 6 개의 감정적인 클러스터로 분류된 음악 데이터이다. Enron, 언어 로그 (LLog) 및 Slashdot 데이터 세트는 텍스트 마이닝 애플리케이션에서 생성되었으며 각 특징은 단어의 출현에 해당하며 각 레이블은 각 텍스트 패턴의 특정 주제에 대한 관련성을 나타낸다. Genbase 및 Yeast 데이터 세트는 생물학적 영역에서 유래되었으며 유전자 및 단백질의 기능에 대한 정보를 포함한다. Mediamill 데이터 세트는 자동 감지 시스템의 비디오 데이터이다. 의료 데이터 세트는 임상적 자유 텍스트의 자연어 처리로부터 획득된 suicide letters의 큰 자료(쿠퍼스)에서 샘플링되었다.The present invention tested 20 different data sets in various areas. The Birds data set is audio data that contains examples of several bird calls. The emotional data set is music data classified into six emotional clusters. Enron, language log (LLog), and Slashdot data sets were created in text mining applications, where each feature corresponds to the appearance of a word, and each label represents a relevance to a specific subject in each text pattern. Genbase and Yeast data sets are derived from biological domains and contain information about the function of genes and proteins. Mediamill data set is the video data of the auto-sensing system. Medical data sets were sampled from large data (coopers) of suicide letters obtained from natural language processing of clinical free text.

Scene dataset는 각 장면에 여러 객체(object)가 포함될 수 있는 스틸 장면의 의미론적 색인과 관련된다. TMC2007 데이터 세트에는 복잡한 공간 시스템의 안전성 보고서가 포함되어 있다. 나머지 9 개의 데이터세트는 Yahoo 데이터 세트 컬렉션에서 가져온 것이다. 10,000 개 이상의 특징으로 구성된 TMC2007 및 Yahoo 컬렉션을 포함하여 텍스트 데이터 세트에 대한 감독되지 않는 차원 감소(unsupervised dimensionality Reduction)를 수행하였다. 특히 문서 빈도가 가장 높은 특징의 상위 2 %와 5 %는 TMC2007과 Yahoo 데이터 세트에 대해 각각 유지되었다. 텍스트 마이닝 도메인에서, 기존 연구는 분류 성능이 문서 빈도를 기반으로 하는 특징의 1 %를 유지하는 데 크게 영향을 받지 않는다고 보고한다. Scene datasets are associated with semantic indexes of still scenes, where each scene can contain multiple objects. The TMC2007 data set contains safety reports for complex spatial systems. The remaining nine datasets are from the Yahoo dataset collection. Unsupervised dimensionality reduction of text data sets was performed, including the TMC2007 and Yahoo collections, which consist of more than 10,000 features. In particular, the top 2% and 5% of the document-highest features were maintained for the TMC2007 and Yahoo data sets, respectively. In the text mining domain, existing studies report that classification performance is not significantly affected by maintaining 1% of features based on document frequency.

표 1은 실험에서 사용된 다중 레이블 데이터 세트에 대한 표준 통계를 나타낸다. Table 1 shows standard statistics for multiple label data sets used in the experiment.

[표 1]TABLE 1

표 1은 데이터 집합의 패턴 수 | W |, 특징 수 | F |, 특징 유형 및 레이블 수 | L |를 포함한다. 특징 유형이 숫자인 경우 MLNB (multiabel naive Bayes classifier)의 감독된 이산화 방법(supervised discretization method)을 사용하여 특징을 이산화한다. 특히, 관찰된 각 숫자 값은 이산화 방법을 사용하여 자동으로 결정되는 여러 개의 저장소(bins) 중 하나에 할당된다. Table 1 shows the number of patterns in the data set | W |, feature count | F |, feature type and number of labels | L | If the feature type is numeric, the feature is discretized using the supervised discretization method of the multi-label naive bayes classifier (MLNB). In particular, each observed numerical value is assigned to one of several bins that are automatically determined using the discretization method.

레이블 카디널리티 카드(label cardinality Card)는 각 인스턴스의 평균 레이블 수를 나타낸다. 레이블 밀도 Den은 총 레이블 수에 대한 레이블 카디널리티이다. 고유 레이블 세트 Distinct의 수는 L에서 고유 레이블 하위 집합 수를 나타낸다. Domain은 각 데이터 집합이 추출된 애플리케이션을 나타낸다.Label cardinality card (label cardinality Card) represents the average number of labels in each instance. Label density Den is the label cardinality for the total number of labels. The number of unique label sets Distinct represents the number of unique label subsets in L. Domain represents the application from which each data set is extracted.

본 방법과 기존의 다중 레이블 특징 선택 방법(제한된 유전 연산자(RGA), NSGA-II, MPSOFS)에 대한 선택된 특징 집합의 평균 크기를 측정했다. 어떤 방법을 선택 하느냐에 따라 10 이하의 특징을 선택할 수 있다. 특히, 다음과 같이 좋은 재현성을 지원하는 상세한 파라미터 설정을 제공한다.The average size of the selected feature set for this method and the existing multiple label feature selection method (limited genetic operator (RGA), NSGA-II, MPSOFS) was measured. Depending on which method you choose, you can select up to 10 features. In particular, it provides detailed parameter settings that support good reproducibility as follows.

RGA는 각각의 염색체에 따라 무작위로 n = 10 특징보다 작은 것을 선택함으로써 m = 20 초기 해를 만든다. t = 0 인 초기 population P(t)의 각 솔루션은 다중 레이블 분류기를 사용하여 평가된다. 다음으로 RGA는 유전 연산자를 사용하여 자손(offspring) 세트 N(t)를 생성한다. 교차 연산자를 적용하려면, P(t)에서 두 솔루션이 무작위로 선택되고 매칭된다. 그런 후, P(t)의 하나의 솔루션은 무작위로 선택되고 돌연변이시킨다. 본 발명에서는 교차율과 돌연변이 율을 모두 1.0으로 설정한 제한적인 교차 및 제한 돌연변이 연산자를 사용하였다. 따라서 각 반복에 대해 GA는 N(t)를 포함하는 세 가지 새로운 솔루션을 생성한다. 새로 생성된 각 솔루션은 다중 레이블 분류기를 사용하여 평가된다. P(t + 1)을 만들려면 P(t)에 N(t)가 추가되고, 적합도 값이 더 높은 20 개의 솔루션이 선택된다.이 절차는 RGA가 100 개의 FFC를 사용할 때까지 반복된다.RGA generates m = 20 initial solutions by randomly selecting smaller than n = 10 features for each chromosome. Each solution of the initial population P (t) with t = 0 is evaluated using multiple label classifiers. The RGA then uses a genetic operator to produce an offspring set N (t). To apply the intersection operator, two solutions are randomly selected and matched in P (t). One solution of P (t) is then randomly selected and mutated. In the present invention, a limited crossover and a restriction mutation operator with both crossover and mutation rates set to 1.0 was used. Thus, for each iteration, GA creates three new solutions containing N (t). Each newly created solution is evaluated using multiple label classifiers. To create P (t + 1), N (t) is added to P (t), and 20 solutions with higher goodness-of-fit values are selected. This procedure is repeated until the RGA uses 100 FFCs.

NSGA-II는 RGA가 생성하는 동일한 수로, 무작위로 m = 20 개의 초기 솔루션을 생성한다. 허용되는 최대 특징 수는 |F|로 설정된다. NSGA-II는 자연스럽게 선택된 특징의 수를 최소화하기 때문이다. P(t)의 각 솔루션은 채용된 다중 레이블 분류기와 특징의 수를 사용하여 평가된다. 그러면 NSGA-II는 | N(t) | = 3 인 N(t) 는 RGA의 동일한 설정인 | N(t) | = 3 인 N(t)를 생성한다. P(t + 1)을 생성하려면, P(t)에 N(t)가 추가되고 각 솔루션의 우수성(superiority)은 nondominated 정렬 방법에 의해 결정된다. {P(t) ∪ N(t)} 솔루션 중에서 우위가 결정되면, 상위 20 개의 솔루션을 선택하여 P(t + 1)을 형성한다. 이 절차는 NSGA-II가 100 개의 FFC를 사용할 때까지 반복된다.NSGA-II is the same number that RGA generates, randomly generating m = 20 initial solutions. The maximum number of features allowed is set to | F |. This is because NSGA-II naturally minimizes the number of features selected. Each solution of P (t) is evaluated using the number of multiple label classifiers and features employed. NSGA-II then | N (t) | N (t) with = 3 is the same setting for RGA | N (t) | Produces N (t) with = 3. To generate P (t + 1), N (t) is added to P (t) and the superiority of each solution is determined by the nondominated sorting method. {P (t) ∪ N (t)} Once the superiority is determined among the solutions, the top 20 solutions are selected to form P (t + 1). This procedure is repeated until NSGA-II uses 100 FFCs.

MPSOFS는 RGA가 생성하는 동일한 수로, 무작위로 20 개의 초기 솔루션을 생성한다. P(t)의 각 솔루션은 사용된 다중 레이블 분류기와 특징의 수를 사용하여 평가되고, 분류되지 않은 정렬 방법을 사용하여 순위가 정해진다. 그런 다음 MPSOFS는 글로벌 최적 솔루션이라는 P(t)의 최상의 솔루션을 보존한다. 또한 각 염색체가 경험한 최상의 솔루션도 보존된다. 이를 개별 최상의 솔루션이라고 하며, 따라서 20개 개별 최상의 솔루션이 있다. 그 후, MPSOFS는 전역 최적 솔루션과 자체 개별 최상 솔루션에 기반으로 각 염색체의 표현을 업데이트한다. 이때, 관성 무게가 0.7298 인 속도와 1.4962의 두 가속도 계수를 사용한다. ??(??)의 모든 염색체가 변형 된 후에 평가되고, P(t + 1)로 간주된다. 이 절차는 MPSOFS가 100 개의 FFC를 사용할 때까지 반복된다. MPSOFS is the same number that RGA generates, randomly generating 20 initial solutions. Each solution of P (t) is evaluated using the number of multiple label classifiers and features used, and ranked using an unclassified sorting method. MPSOFS then preserves P (t) 's best solution, a globally optimal solution. It also preserves the best solution experienced by each chromosome. This is called the individual best solution, so there are 20 individual best solutions. MPSOFS then updates the expression of each chromosome based on its global optimal solution and its own individual best solution. In this case, we use a velocity with an inertial weight of 0.7298 and two acceleration coefficients of 1.4962. After all the chromosomes of ?? (??) are deformed, they are evaluated and considered to be P (t + 1). This procedure is repeated until MPSOFS uses 100 FFCs.

한편, 다른 매개 변수 설정으로 인해 성능이 향상될 수 있지만, 공정한 비교를 위해 모든 방법에 대해 population m의 크기를 20으로, 사용된 FFC 수 v를 100으로 고정하였다. 각 방법으로 획득된 특징 서브셋의 품질을 평가하기 위해, 최종 다중 레이블 분류 성능에 영향을 미칠 수 있는 복잡한 매개변수 튜닝 과정을 거치지 않고 주어진 데이터셋의 고유한 특성을 기반으로 예측된 레이블 하위 집합을 출력하기 때문에, MLNB 분류자를 사용하였다. 공정성을 위해, 각 실험에 대해 홀드 아웃 교차 검증 방법을 사용하였다. 주어진 데이터 세트의 샘플 중 80 %가 다중 레이블 특징 선택 및 분류기 트레이닝을 위한 트레이닝 세트로 무작위로 선택되었고, 나머지 20 %는 다중 레이블 분류 성능을 획득하기 위한 테스트 세트로 사용되었다. RGA와 제안된 방법 모두에 대해, 모집단 크기를 20으로 설정하고 최대 허용 FFC 수를 100으로 설정하였다. 각 실험을 10 회 반복하였고, 평균값은 각 특징 선택 방법의 분류 성능을 나타내기 위해 사용되었다.On the other hand, performance can be improved due to different parameter settings, but for fair comparison, the size of population m is set to 20 and the number of used FFCs v is set to 100 for all methods. To assess the quality of feature subsets obtained by each method, output predicted subsets of labels based on the unique characteristics of a given dataset without going through complex parameter tuning that can affect the performance of the final multilabel classification. For this reason, an MLNB classifier was used. For fairness, the hold out cross validation method was used for each experiment. Eighty percent of the samples in a given data set were randomly selected as a training set for multilabel feature selection and classifier training, and the remaining 20% was used as a test set to achieve multilabel classification performance. For both RGA and the proposed method, the population size was set to 20 and the maximum allowable FFC number was set to 100. Each experiment was repeated 10 times, and the mean value was used to indicate the classification performance of each feature selection method.

해밍 손실, 다중 레이블 정확도, 랭킹 손실 및 정규화된 적용 범위의 네 가지 평가 메트릭을 사용하였다. T= {(T_i, λ_i) | 1 ≤ i ≤ |T|}이 테스트 셋로 주어졌다고 하자. 여기서, λ_i ⊆ L 는 올바른 레이블 부분 집합이다. 주어진 테스트 샘플 T_i 에 대해, MLNB와 같은 분류기는 각 레이블 l ∈ L 에 대해 0 ≤ ψ_i,l≤ 1 의 신뢰 값 집합을 출력해야 한다. 신뢰도 값 ψ_i,l가 0.5와 같은 사전 정의된 임계 값보다 크면, 해당 레이블 l 은 예측된 레이블 하위 집합 Y_i에 포함된다. Ground truth λ_i, 신뢰도 ψ_i,l 및 예측된 레이블 하위 집합 Y_i 에 기초하여, 다중 레이블 분류 성능은 각 평가 척도로 측정될 수 있다.Four evaluation metrics were used: Hamming Loss, Multiple Label Accuracy, Ranking Loss, and Normalized Coverage. T = {(T _i , λ _i ) | Suppose 1 ≤ i ≤ | T |} is given as a test set. Where λ _i ⊆ L is a valid label subset. For a given test sample T _i , a classifier, such as MLNB _, should output a set of confidence values of 0 ≤ ψ _{i, l} ≤ 1 for each label l ∈ L. If the confidence value ψ _{i, l} is greater than a predefined threshold value such as 0.5, then that label l is included in the predicted label subset Y _i . Ground truth λ _i , confidence ψ _{i, l} and predicted label subset Y _i Based on the multiple label classification performance can be measured on each evaluation scale.

다중 레이블 정확도는 아래 수학식 9를 이용하여 산출된다. Multiple label accuracy is calculated using Equation 9 below.

[수학식 9][Equation 9]

해밍 손실은 아래 기재된 수학식 10을 이용하여 산출된다.Hamming loss is calculated using Equation 10 described below.

[수학식 10][Equation 10]

여기서 T는 주어진 테스트 집합이고, λ는 올바른 레이블 부분 집합(correct label subset )을 나타내며, Δ는 두 집합 간의 대칭 차이(symmetric difference )를 나타낸다. Where T is a given test set, λ represents the correct label subset, and Δ represents the symmetric difference between the two sets.

랭킹 손실은 아래 수학식 11을 이용하여 산출된다. The ranking loss is calculated using Equation 11 below.

[수학식 11][Equation 11]

여기서

는

의 상보 집합(complementary set )이다. 랭킹 손실은 모든 관련성이 있고 관련성이 없는 레이블 쌍에 대해

인 (a, b) 쌍의 평균 분율(average fraction )을 측정한다. here

Is

Is a complementary set of. Ranking loss is for all relevant and unrelated label pairs

Measure the average fraction of the phosphorus (a, b) pairs.

마지막으로 정규화된 적용 범위는 아래 수학식 11을 이용하여 산출한다.Finally, the normalized coverage is calculated using Equation 11 below.

[수학식 12][Equation 12]

여기서 랭크 (·)는 대응하는 레이블 l ∈ λ_i의 랭크를 ψ_i,l에 따라 비 증가 순서로 반환한다. 따라서 표준화된 범위는 모든 관련 레이블이 양수가 되도록 양수로 표시해야 하는 레이블 수를 측정한다. 멀티 레이블 정확도가 높고 해밍 손실, 랭킹 손실 및 정규화된 커버리지 값이 낮을수록 분류 성능이 우수함을 나타낸다. Here, the rank (·) returns the rank of the corresponding label l ∈ λ _i in a non-incremental order according to ψ _{i, l} . Thus, the standardized range measures the number of labels that must be expressed as positive so that all relevant labels are positive. Higher multi-label accuracy, lower hamming loss, ranking loss, and normalized coverage values indicate better classification performance.

또한 본 방법의 성능을 검증하기 위해 Wilcoxon signed-rank test를 수행하여 본 방법의 우수성을 확인한다. d_i를 i번째 데이터셋에 대하여 두 가지 방법의 성능 차이라고 하자. 차이는 절대 값에 따라 순위가 정해지고, 가장 작은 d_i가 첫 번째 순위로 지정된다. 동점이 발생하면, 평균 랭크가 지정된다. R⁺ 는 아래 수학식 13과 같이 정의된 비교된 방법이 본 방법을 능가하는 데이터 집합에 대한 랭크의 합일 수 있다. In addition, to verify the performance of the method, Wilcoxon signed-rank test is performed to confirm the superiority of the method. Let d _{i be} the performance difference between the two methods for the i th dataset. The difference is ranked according to the absolute value, with the smallest d _i assigned as the first rank. If a tie occurs, an average rank is assigned. R ⁺ may be a sum of ranks for a data set in which a compared method defined as in Equation 13 below exceeds the present method.

[수학식 13][Equation 13]

R^-는 본 방법이 비교된 방법을 능가하는 데이터 집합에 대한 순위의 합이라고 하자. 그런 다음 Wilcoxon 테스트의 임계값을 기준으로 신뢰 레벨이 α = 0.05이고, N = 20 인 경우, min(R⁺, R^-)이 8보다 작거나 같으면 비교된 방법의 차이가 중요하다. 이 경우, 동일한 성능에 대한 귀무 가설은 기각된다.Let R ^{− be} the sum of the ranks for the data sets that outperform the compared methods. Then, when the confidence level is α = 0.05 and N = 20 based on the Wilcoxon test threshold, the difference in the compared methods is important if min (R ⁺ , R ⁻ ) is less than or equal to 8. In this case, the null hypothesis for the same performance is rejected.

상기와 같은 실험 설정을 통한 실험 결과는 아래와 같다. Experimental results through the experimental settings as described above are as follows.

먼저, 비교 결과(Comparison Results)에 대해 설명하기로 한다. First, comparison results will be described.

표 2는 평가 척도가 다중 레이블 정확도일 때 본 방법과 기존의 다중 레이블 특징 선택 방법에서 선택된 특징 하위 집합의 평균 크기 및 표준 편차에 대한 결과를 나타낸다. Table 2 shows the results for the mean size and standard deviation of the selected subset of features in this method and the existing multiple label feature selection method when the evaluation scale is multi-label accuracy.

[표 2]TABLE 2

표 2에서 X 기호는 해당 데이터 집합에 대해 주어진 제약 조건을 충족시키지 못한 방법을 나타낸다. 본 방법과 RGA는 모든 데이터 집합에 대해 10 개 미만의 특징을 선택했다. NSGA-II 및 MPSOFS 방법은 특징 하위 집합 크기를 최소화하는 목적 함수(objective function)를 가지고 있음에도 불구하고 NSGA-II의 Mediamill 데이터 세트 이외의 모든 데이터 세트에 대해 10 개 미만의 특징을 선택하지 못하였다. NSGA-II와 MPSOFS는 대부분의 데이터 세트에서 10 개 미만의 특징을 선택하지 못했기 때문에, 본 방법의 성능을 후속 실험의 RGA 성능과 비교했다. N은 30 또는 50과 같이 10보다 더 큰 값의 설정을 의미한다. 표 2의 실험 결과는 NSGA-II 또는 MPSOFS가 최종 특징 하위 집합을 출력하기 때문에 주어진 제약 조건을 충족시키지 못한다는 것을 보여준다. 이때, 최종 특징 하위 집합은 대부분의 실험에서 수십 또는 수백 개의 특징으로 구성된다. In Table 2, the X symbol represents a method that does not meet the given constraints for that data set. The method and RGA chose less than 10 features for all data sets. The NSGA-II and MPSOFS methods did not select less than 10 features for all data sets other than NSGA-II's Mediamill data set, although they had an objective function to minimize feature subset size. Since NSGA-II and MPSOFS did not select less than 10 features in most data sets, the performance of this method was compared with the RGA performance of subsequent experiments. N means setting a value greater than 10, such as 30 or 50. The experimental results in Table 2 show that NSGA-II or MPSOFS does not meet the given constraints because it outputs the final feature subset. The final feature subset then consists of tens or hundreds of features in most experiments.

표 3과 표 4는 본 방법에 대한 실험 결과와 20 개의 다중 레이블 이터 세트에 대한 RGA를 포함하고 있으며, 해당 표준 편차와 홀드 아웃 교차 검증을 위한 평균 성능으로 제시된다. 표 3은 멀티 레이블 정확도 및 해밍 손실에 대한 성능 결과를 포함하고, 표 4는 랭킹 손실 및 정규화된 적용 범위에 대한 성능 결과가 포함된다. Tables 3 and 4 contain the experimental results for this method and the RGAs for 20 multiple label data sets and present the average performance for the corresponding standard deviation and holdout cross-validation. Table 3 contains performance results for multi-label accuracy and hamming loss, and Table 4 contains performance results for ranking loss and normalized coverage.

[표 3]TABLE 3

[표 4]TABLE 4

표 3 및 표 4에서 두 가지 방법 간의 최상의 성능은 굵은 글꼴과 √기호로 표시된다. 마지막으로, 표 5는 significance threshold가 α = 0.05인 Genbase 데이터 세트에 대해 RGA에 대한 본 방법에 대한 Wilcoxon signed-rank 테스트의 결과를 나타낸다. In Table 3 and Table 4, the best performance between the two methods is shown in bold and √. Finally, Table 5 shows the results of the Wilcoxon signed-rank test for the present method for RGA on a Genbase data set with a significance threshold of α = 0.05.

[표 5]TABLE 5

표 5에서 각 평가 척도에 대해 각 비교의 승자는 굵은 글꼴로 표시되며 총 순위에서 성과가 좋은 순위 R⁺ 와 p 값의 해당 합계가 괄호 안에 표시된다. 다른 다중 레이블 데이터셋에서 동일한 실험과 비슷한 경향을 관찰할 수 있다. In Table 5, for each rating scale, the winner of each comparison is shown in bold font and the corresponding sum of the R ⁺ and p values that perform well in the total ranking is shown in parentheses. Similar trends can be observed with other multilabel datasets.

표 3과 표 4에서 볼 수 있듯이 본 방법은 대부분의 다중 레이블 데이터 집합에 대해 RGA보다 성능이 우수함을 알 수 있다. 구체적으로, 본 방법은 다중 레이블 정확도 90%, 해밍 손실의 95 %, 랭킹 손실의 95 %, 데이터 세트의 정규화된 커버리지에서 100 %로, 최상의 성능을 달성함을 알 수 있다. 따라서, 본 방법은 모든 평가 척도에 대해 RGA를 크게 능가함을 알 수 있다. 이것은 표 5에 제시된 실험결과로부터 명백한데, 이는 본 방법이 RGA보다 통계적으로 우수하다는 것을 명확하게 증명하는 것이다. As shown in Tables 3 and 4, this method outperforms RGA for most multilabel datasets. Specifically, it can be seen that the method achieves the best performance with 90% multiple label accuracy, 95% of hamming loss, 95% of ranking loss, and 100% in normalized coverage of the data set. Thus, it can be seen that the method significantly outperforms the RGA for all rating measures. This is evident from the experimental results presented in Table 5, which clearly demonstrates that the method is statistically superior to RGA.

다음으로, 본 발명에 따른 실험 결과를 분석하기로 한다. Next, the experimental results according to the present invention will be analyzed.

도 6은 다중 레이블 정확도의 관점에서 사용된 FFC의 수 (u) 에 따른 GA와 본 방법의 convergence 비교 결과를 나타낸 그래프이다. 도 6에서 횡축은 u, 종축은 멀티 레이블 성능을 나타낸다. Convergence은 모집단 기반 검색 방법의 확률적 특성으로 인해 각 실험마다 다를 수 있으므로, 두 알고리즘 모두에서 동일한 초기화된 모집단으로 설정하였고, 실험을 10번 수행 한 후 모집단에서 상위 엘리트의 다중 레이블 정확도 성능을 평균하였다. 도 6은 다중 레이블 정확도 성능이 u로 단조롭게 향상됨을 보여준다. 초기화 단계는 20 개의 FFC를 사용하고 두 방법은 동일하게 초기화된 모집단이 무작위로 생성되기 때문에 두 방법 모두 다중 레이블 정확도를 점차 향상시킨다. 그러나 실험 결과는 초기화 후 탐사 연산자가 모집단에 적용되었기 때문에, 본 방법의 다중 레이블 정확도 값이 u가 20이상일 때, 극적으로 향상됨을 나타낸다. 따라서 도 6은 본 방법이 선택되지 않은 특징들로부터 우수한 특징 하위 집합을 효율적으로 위치시킬 수 있음을 나타낸다.6 is a graph showing a result of comparing convergence between GA and the present method according to the number of FFCs (u) used in terms of multiple label accuracy. In Figure 6, the axis of abscissas is u, and the axis of ordinates is multilabel performance. Convergence can be different for each experiment due to the probabilistic nature of the population-based search method. Therefore, we set the same initialized population in both algorithms and averaged the multi-label accuracy performance of the upper elites in the population after 10 experiments. . 6 shows that the multi-label accuracy performance is monotonically improved to u. The initialization step uses 20 FFCs, and both methods gradually improve multi-label accuracy because the same initialized population is randomly generated. However, the experimental results show that since the exploration operator is applied to the population after initialization, the multi-label accuracy value of the method improves dramatically when u is greater than 20. 6 thus shows that the method can efficiently locate a good subset of features from unselected features.

탐사 연산자의 목표는 다중 레이블 분류 성능을 효과적으로 향상시킬 수 있는 새로운 유망한 특징들을 도입한다. 탐사 연산자의 유효성을 확인하기 위해, 탐사 연산자에 의해 생성된 자손 집합의 적합도 값과 랜덤 연산자의 적합도 값을 비교하는 추가 실험을 수행하였다. 구체적으로, RGA의 동일한 초기화 절차로 10개 이하의 특징을 선택하는 50 개의 염색체, 즉 G를 사용하였고, 탐사 연산자를 제1 자손 셋의 G의 각 염색체에 적용하여 50 개의 새로운 염색체를 생성하였다. 그 후, 비교를 위해 G의 각 염색체에 대한 새로운 특징을 랜덤으로 선택하고 도입하여 두 번째 자손 집합을 생성한다. 마지막으로 네 가지 성능 측정 기준으로 첫 번째 및 두 번째 자손 집합의 적합도 값을 측정하였다. The goal of the exploration operator is to introduce new and promising features that can effectively improve the performance of multiple label classification. To verify the validity of the exploration operator, an additional experiment was performed to compare the goodness-of-fit values of the descendant set generated by the exploration operator with the goodness-of-fit values of the random operator. Specifically, 50 chromosomes, G, were selected that selected 10 or fewer features as the same initialization procedure of RGA, and an exploration operator was applied to each chromosome of G of the first set of progeny to generate 50 new chromosomes. Thereafter, for comparison, a new feature for each chromosome of G is randomly selected and introduced to generate a second set of offspring. Finally, the goodness-of-fit values of the first and second progeny sets were measured using four performance measures.

도 7은 Genbase 데이터 세트의 두 자손 세트가 제공한 적합도 값의 box plots을 나타낸다. 실험 결과에 따르면, 첫 번째 자손 집합(제안)의 적합도 값은 모든 측정 값의 관점에서 두 번째 자손 집합(임의)의 적합도 값보다 훨씬 높기 때문에 탐사 연산자의 검색 가능성(search capability)이 랜덤 검색보다 훨씬 높다는 것을 확인할 수 있다. 7 shows box plots of goodness-of-fit values provided by two descendant sets of a Genbase data set. Experimental results show that the search capability of the exploration operator is much higher than that of random search, because the goodness-of-fit value of the first set of descendants (suggestions) is much higher than the goodness-of-fit value of the second set of children (any) in terms of all measurements. You can see that it is high.

결론적으로 본 발명은 다중 레이블 분류를 위해 제약 조건(budget constraint)과 함께 효과적인 진화 검색 기반 특징 선택 방법을 이용한다. 특징 하위 집합이 최대 허용 개수의 특징 내에서 적은 수의 특징을 선택하고, 대부분의 특징이 제약 조건(budget constraint) 문제에서 선택되지 않았기 때문에, 선택되지 않은 특징 하위 집합에서 관련 특징을 찾기 위해 새로운 탐사 연산자를 사용한다. 20 개의 실제 데이터 세트에 대한 실험은 탐사 연산자가 유전자 검색의 검색 능력을 성공적으로 향상시켜 다중 레이블 분류가 향상되었음을 보여주었다. 또한 본 방법이 특징 하위 집합을 성공적으로 검색할 수 있음을 보여 주었고, 이는 budget constraint을 위반하지 않았다. 통계적 테스트 결과, 본 방법이 기존의 방법보다 4 가지 성능 척도를 능가하는 것으로 나타났다. In conclusion, the present invention utilizes an efficient evolutionary search based feature selection method with a constraint for multiple label classification. Because feature subsets select fewer features within the maximum allowable number of features, and most features are not selected in the constraint constraint problem, new exploration is needed to find relevant features in the unselected feature subsets. Operator. Experiments with 20 real data sets showed that the exploration operator successfully improved the retrieval power of gene retrieval, thereby improving multilabel classification. We also showed that the method can successfully retrieve feature subsets, which did not violate the budget constraint. Statistical tests show that the method outperforms the four performance measures compared to the existing method.

한편, 본 발명의 실시예들은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광매체(magneto-optical), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 일 실시예들의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.On the other hand, embodiments of the present invention can be implemented in the form of program instructions that can be executed by various computer means may be recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the media may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Examples of program instructions such as magneto-optical, ROM, RAM, flash memory, etc. may be executed by a computer using an interpreter as well as machine code such as produced by a compiler. Contains high-level language codes. The hardware device described above may be configured to operate as one or more software modules to perform the operations of one embodiment of the present invention, and vice versa.

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.
So far I looked at the center of the preferred embodiment for the present invention. Those skilled in the art will appreciate that the present invention can be implemented in a modified form without departing from the essential features of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

100 : 다중 레이블 특징 선택 장치
110 : 특징 집합 생성부
120 : 제1 특징 하위 집합 생성부
130 : 제2 특징 하위 집합 생성부
140 : 제3 특징 하위 집합 생성부
150 : 최종 특징 하위 집합 생성부 100: multi-label feature selection device
110: feature set generation unit
120: first feature subset generation unit
130: second feature subset generator
140: third feature subset generator
150: final feature subset generation unit

Claims

In the multi-label feature selection device to select a feature for the classification of a multi-label pattern,
The feature set generation unit generates an initial feature set including, as a component, a first feature selected according to a preset constraint from a whole feature set including a plurality of features constituting a plurality of patterns that can be classified into multiple labels. step;
Generating, by a first feature subset generation unit, a first feature subset based on the first features of the initial feature set;
The second feature subset generation unit calculates a relevance evaluation value for each of a plurality of second features not belonging to the first feature subset among the feature set, and generates a second feature subset based on the relevance evaluation value. Generating;
Generating, by a third feature subset generation unit, a third feature subset based on the first feature subset and the second feature subset; And
A final feature subset generation unit calculating a goodness of fit value for the third feature subset, and generating a final feature subset based on the goodness of fit value,
The constraint includes a maximum number n of features to be included in each feature set or feature subset (n is a natural number), the total number of features or a total number of feature subsets m (m is a natural number),
Generating the initial feature set,
Generating m initial feature sets each comprising n or less first features; And
And evaluating each of the initial feature sets.

delete

The method of claim 1,
The first feature subset is
And generating by applying a genetic operator to the initial feature set.

In the multi-label feature selection device to select a feature for the classification of a multi-label pattern,
The feature set generation unit generates an initial feature set including, as a component, a first feature selected according to a preset constraint from a whole feature set including a plurality of features constituting a plurality of patterns that can be classified into multiple labels. step;
Generating, by a first feature subset generation unit, a first feature subset based on the first features of the initial feature set;
The second feature subset generation unit calculates a relevance evaluation value for each of a plurality of second features not belonging to the first feature subset among the feature set, and generates a second feature subset based on the relevance evaluation value. Generating;
Generating, by a third feature subset generation unit, a third feature subset based on the first feature subset and the second feature subset; And
A final feature subset generation unit calculating a goodness of fit value for the third feature subset, and generating a final feature subset based on the goodness of fit value,
Generating the second feature subset comprises:
In the first correlation function defining a correlation between the second feature and the first label using a first mutual information measure, the correlation between the first feature and the second feature is determined using the first mutual information measure. A method of selecting a feature for multiple label pattern classification, characterized in that performed based on a feature correlation function by subtracting a second correlation function to be defined.

The method of claim 4, wherein
The feature correlation function is defined by the following equation.
[Equation]

Where l is the label, L is the label set, M is the mutual information measure, and the correlation of the input variables, f _i Is the second feature, f is the first feature,

Is the first correlation function,

Means the second correlation function.

The method of claim 1,
Generating the third subset of features may include:
And evaluating the third feature subset using the FFC corresponding to (number of second feature subsets).

The method of claim 1,
After generating the final subset of features,
And repeating the step of generating the second feature subset, generating the third feature subset, and generating the final feature subset until all of the preset FFCs have been used. Feature selection method for pattern classification.

A feature set generation unit for generating an initial feature set including the first feature selected according to a predetermined constraint in the entire feature set including a plurality of features constituting a plurality of patterns that can be classified into multiple labels;
A first feature subset generator configured to generate a first feature subset based on the first features of the initial feature set;
A second feature subset generation for calculating a relevance evaluation value for each of a plurality of second features not belonging to the first feature subset among the set of features, and generating a second feature subset based on the relevance evaluation value part;
A third feature subset generator configured to generate a third feature subset based on the first feature subset and the second feature subset; And
A final feature subset generator configured to calculate a goodness of fit value for the third feature subset, and generate a final feature subset based on the goodness of fit value,
The constraint includes a maximum number n of features to be included in each feature set or feature subset (n is a natural number), the total number of features or a total number of feature subsets m (m is a natural number),
The feature set generation unit generates m initial feature sets each including n or less first features, and evaluates each of the initial feature sets.