KR20100097951A

KR20100097951A - Method and apparatus for classifying multivariate stream data

Info

Publication number: KR20100097951A
Application number: KR1020090016855A
Authority: KR
Inventors: 강재우; 서성보
Original assignee: 고려대학교 산학협력단
Priority date: 2009-02-27
Filing date: 2009-02-27
Publication date: 2010-09-06
Also published as: KR101064617B1

Abstract

PURPOSE: A method and an apparatus for classifying multivariate stream data are provided to accurately classify multivariable stream data by extracting and using the important features from stream data inputted from sensors. CONSTITUTION: A data converter(310) uses a symbol to convert inputted multivariable stream data into one character string. A partial character string generating unit(320) applies an n-gram scheme to the converted character string to create the set of partial character strings having an n number of syllables. A motif extractor(330) extracts a partial character string from the set of the partial character string, wherein the partial character string can be the motif for each class. A data classifier(360) classifies the multi-variable stream data into one among class sets based on the partial character strings.

Description

METHOD AND APPARATUS FOR CLASSIFYING MULTIVARIATE STREAM DATA}

본 발명은 모티프 사이의 시간 관계를 고려한 다변량 스트림 데이터 분류 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for classifying multivariate stream data taking into account the temporal relationship between motifs.

스트림 데이터(Stream data) 분류는 새로운 스트림 데이터를 사전에 정의된 클래스(Class) 집합 중에 하나로 분류하는 것이다. 현존하는 대부분의 분류 기법은 데이터의 수치적 거리와 통계적 분포를 고려하여 가장 유사한 클래스를 선택한다. 하지만 다변량 스트림 데이터는 속성간에 강한 연관성과 순차적인 특성을 가지는 패턴(Pattern)이 존재하기 때문에, 단순한 거리 비교와 통계적 분포의 유사성을 이용한 분류 기법은 한계가 있다.Stream data classification is to classify new stream data into one of a predefined class set. Most existing classification techniques choose the most similar class by considering the numerical distance and statistical distribution of the data. However, since multivariate stream data has a pattern having strong associations and sequential characteristics between attributes, the classification technique using the similarity between simple distance comparison and statistical distribution is limited.

스트림 데이터 패턴 분류 기법은 센서에서 수집된 이력 데이터에서 패턴을 발견하고, 이 패턴들을 이용하여 새로운 데이터를 분류하는데 매우 유용하다. 주식, 날씨, 인구 데이터와 같은 전통적인 시계열 데이터와는 달리, 센서와 무선 네트워크 기술의 발달로 실시간으로 현실 세계의 데이터 수집이 가능하게 되었다. 이러한 데이터 수집 기술로 인해 사용자들은 단순히 센서 데이터 값을 모니터 링(Monitoring)하는 것뿐만 아니라 현재 데이터의 특성을 분류하고 미래의 상황을 예측하려는 요구로 확대되었다.The stream data pattern classification technique is very useful for finding patterns in historical data collected by sensors and classifying new data using these patterns. Unlike traditional time series data such as stocks, weather, and population data, advances in sensors and wireless network technologies have enabled real-time data collection in real time. This data collection technology has expanded the need for users to not only monitor sensor data values, but also to characterize current data and predict future conditions.

도 1은 스트림 데이터의 일 예를 도시한 도면이다.1 is a diagram illustrating an example of stream data.

예를 들어, 여러 개의 센서를 부착한 이동 로봇이 어떠한 일을 수행하면서 일정한 시간 간격 주기로 각 센서들에서 수집된 측정 값을 중앙 서버에 전송할 수 있다. 로봇은 일을 수행하면서, ‘회전’, ‘집기’, ‘충돌’, ‘장애물’ 등의 상황에 직면할 수 있다. 또한, 각 상황에 따라 각각의 센서에서는 시간에 따라 급격히 증가, 감소 또는 변화가 없는 값을 수집할 수 있다. 사용자가 복수개의 센서로부터 얻어지는 다변량 스트림 데이터를 보고, 원거리에 있는 로봇의 수행 패턴을 정확히 분류할 수 있다면, 로봇을 제어하거나 앞으로의 로봇의 상태를 예측할 수 있을 것이다.For example, a mobile robot equipped with several sensors may perform a task and transmit measured values collected from each sensor to a central server at regular intervals. As robots perform their jobs, they may face situations such as rotation, pinch, collision, and obstacles. In addition, depending on the situation, each sensor can collect values that do not increase, decrease or change rapidly over time. If the user can see the multivariate stream data obtained from the plurality of sensors and can accurately classify the performance patterns of the distant robots, the robots can be controlled or predict the state of the robots in the future.

한편, 현존하는 스트림 데이터 분류 기법은 크게 거리 척도를 이용하여 가장 가까운 객체를 선택하는 거리 기반 분류 기법, 통계적 정보를 이용하여 가장 유사한 객체를 선택하는 분류 기법 및 구조적 정보를 이용하는 분류 기법이 있다.On the other hand, existing stream data classification methods include a distance-based classification method for selecting the nearest object using a distance measure, a classification method for selecting the most similar object using statistical information, and a classification method using structural information.

거리 기반 분류 기법은 각 속성열의 수치 벡터에 대한 거리 척도를 이용하여 가장 가까운 거리 객체를 선택하는 방식이다. 가장 일반적인 거리 척도 기법으로는 유클리디안(Euclidean) 거리 척도 또는 Dynamic Time Wrapping(DTW)가 있으며, 총 거리의 합이 가장 가까운 클래스를 선택하는 방식이다.The distance-based classification technique selects the closest distance object using the distance measure of the numerical vector of each attribute column. The most common distance scale technique is the Euclidean distance scale, or Dynamic Time Wrapping (DTW), which selects the class with the closest total sum.

통계 정보를 이용한 분류 기법은 Bayesian classifier, HMM, RNN 등과 같이 확률적 이론과 데이터 분포 특성을 이용한 방식이다. 이 기법들은 사전에 학습된 확률 값과 분포 특성을 이용하여 사후에 가장 유사한(maximum likelihood) 클래스를 선택하는 방식이다.The classification method using statistical information is a method using probabilistic theory and data distribution characteristics such as Bayesian classifier, HMM, RNN. These techniques use the previously learned probability values and distribution characteristics to select the maximum likelihood class.

마지막으로 구조 패턴 분류 기법은 규칙을 생성한 후, 데이터의 특성을 트리(Tree) 또는 그래프(Graph)의 구조로 만들고, 가장 유사한 구조를 따르는 클래스를 선택하는 방식이다.Finally, the structure pattern classification technique is to create a rule, make the characteristics of the data into a tree or graph, and select a class that follows the most similar structure.

하지만 센서 네트워크 응용분야에서 다양한 센서를 통해 수집되는 스트림 데이터는 클래스마다 독특한 데이터 특성을 갖고 있으며, 스트림 데이터의 속성 사이에 강한 시간적 인과관계가 존재하기 때문에 단순한 거리, 확률 및 구조만을 이용한 분류 기법들은 다변량 스트림 데이터를 분류하는데 적합하지 않다.However, in the sensor network application, stream data collected through various sensors has unique data characteristics for each class, and because there is a strong temporal causal relationship between the attributes of the stream data, classification methods using only simple distance, probability and structure are multivariate. Not suitable for classifying stream data.

본 발명의 일 실시예는 복수개의 센서로부터 입력되는 스트림 데이터에서 다른 클래스의 스트림 데이터와 차별되는 중요한 특징들을 추출하고, 상기 추출된 특징들이 클래스에서 차지하는 중요도와 상기 특징들 사이의 시간 관계 패턴을 고려하여 보다 정확하게 다변량 스트림 데이터를 분류할 수 있는 방법 및 장치를 제공한다.An embodiment of the present invention extracts important features that are differentiated from other classes of stream data from stream data input from a plurality of sensors, and considers the importance of the extracted features in a class and the time relationship pattern between the features. To provide a more accurate classification of multivariate stream data.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면은 입력된 다변량 스트림 데이터를 기호를 사용하여 하나의 문자열로 변환하는 데이터 변환부, 상기 변환된 문자열에 엔-그램(n-gram) 방법을 적용하여 n 개(n은 자연수)의 음절을 갖는 부분 문자열들의 집합을 생성하는 부분 문자열 생성부, 상기 생성된 부분 문자열 집합에서 상기 각 클래스에 대한 모티프(Motif)가 될 수 있는 부분 문자열을 추출하는 모티프 추출부 및 상기 추출된 모티프가 될 수 있는 부분 문자열에 기초하여 상기 다변량 스트림 데이터를 상기 클래스 집합 중의 하나로 분류하는 데이터 분류부를 포함하는 다변량 스트림 데이터를 사전에 정의된 클래스 집합 중의 하나로 분류하는 장치를 제공할 수 있다.As a technical means for achieving the above-described technical problem, a first aspect of the present invention is a data conversion unit for converting the input multi-variable stream data into a single character string using a symbol, an n-gram (n- a substring generator for generating a set of substrings having n syllables by applying a gram method, and a part capable of becoming a motif for each class in the generated substring set A multivariate stream data including a motif extractor for extracting a character string and a data classifier for classifying the multivariate stream data into one of the class sets based on the substring that may be the extracted motif as one of a predefined class set. A device for sorting can be provided.

또한, 본 발명의 제 2 측면은 (a) 입력된 상기 다변량 스트림 데이터를 기호를 사용하여 하나의 문자열로 변환하는 단계, (b) 상기 변환된 문자열에 엔-그램(n-gram) 방법을 적용하여 n 개(n은 자연수)의 음절을 갖는 부분 문자열들의 집 합을 생성하는 단계, (c) 상기 생성된 부분 문자열 집합에서 상기 각 클래스에 대한 모티프(Motif)가 될 수 있는 부분 문자열을 추출하는 단계 및 (d) 상기 추출된 모티프가 될 수 있는 부분 문자열에 기초하여 상기 다변량 스트림 데이터를 상기 클래스 집합 중의 하나로 분류하는 단계를 포함하는 다변량 스트림 데이터를 사전에 정의된 클래스 집합 중의 하나로 분류하는 방법을 제공할 수 있다.In addition, the second aspect of the present invention is to (a) converting the input multivariate stream data into a single character string using a symbol, (b) applying the n-gram method to the converted character string Generating a set of substrings having n syllables (n is a natural number), and (c) extracting a substring that can be a motif for each class from the generated set of substrings And classifying the multivariate stream data into one of the class sets based on the substring that may be the extracted motif. Can provide.

전술한 본 발명의 과제 해결 수단에 의하면, 수치 기반의 다변량 스트림 데이터를 데이터의 변화 정도와 연계된 기호를 사용하여 단순화시켜, 스트림 데이터의 해석 알고리즘의 복잡도를 줄였으며, 데이터의 변화 정도와 연계된 기호를 사용하였기 때문에 생성된 규칙에 대한 근사적 해석이 가능하다.According to the above-described problem solving means of the present invention, by simplifying the numerical-based multivariate stream data using symbols associated with the degree of change of the data, the complexity of the interpretation algorithm of the stream data is reduced, Because of the use of symbols, an approximate interpretation of the generated rule is possible.

또한, 전술한 본 발명의 과제 해결 수단에 의하면, 엔 그램(n-gram) 기법을 이용하여, 훈련 데이터와 테스트 데이터의 길이가 동일해야 유사도 비교가 가능했던 기존 스트림 데이터 분류 기법의 문제를 해결할 수 있다.In addition, according to the above-described problem solving means of the present invention, by using the n-gram method, the problem of the conventional stream data classification method that can compare the similarity only if the length of the training data and the test data can be solved. have.

또한, 본 발명의 다른 과제 해결 수단 중 하나에 의하면 모티프를 선별하여 분류 알고리즘에 적용하였기 때문에 전체 데이터를 고려하는 것보다 공간적, 시간적 비용이 절약될 수 있다.In addition, according to one of the other problem solving means of the present invention, since the motif is selected and applied to the classification algorithm, space and time cost can be saved than considering the entire data.

또한, 본 발명의 다른 과제 해결 수단 중 하나에 의하면 하나의 클래스에서 단순히 발생 빈도가 높은 모티프를 선택하는 것이 아니라, 다른 클래스에서의 발생 빈도를 함께 고려한 TFIDF 값에 기초하여 모티프를 선택하였기 때문에 보다 정확하게 스트림 데이터를 분류할 수 있다.In addition, according to one of the other problem solving means of the present invention, since the motif was selected based on the TFIDF value considering the frequency of occurrence in another class, rather than simply selecting the motif with high frequency in one class, Stream data can be classified.

또한, 본 발명의 또 다른 과제 해결 수단 중 하나에 의하면 모티프가 발생하는 확률, TFIDF 값 및 상호 정보(Mutual Information) 값을 이용하여 데이터의 확률 값과 구조적 패턴을 함께 고려하기 때문에 규칙 해석이 가능하며, 보다 정확하게 스트림 데이터를 분류할 수 있다.In addition, according to another means for solving the problem of the present invention, the rule analysis is possible because the probability value of the motif, the TFIDF value, and the mutual information value are considered together and the probability value of the data and the structural pattern are considered. For example, stream data can be classified more accurately.

또한, 본 발명의 또 다른 과제 해결 수단 중 하나에 의하면 로봇 데이터를 이용한 행위 분석, 수화 언어의 인식, 바이오 생체 데이터를 이용한 사건 분석 및 분류 및 RFID, USN 분야에서 수집되는 다변량 스트림 데이터를 분석할 수 있다.In addition, according to another means for solving the problem of the present invention, it is possible to analyze behavior using robot data, recognition of sign language, event analysis and classification using bio biodata, and analyze multivariate stream data collected in RFID and USN fields. have.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and like reference numerals designate like parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is "connected" to another part, this includes not only "directly connected" but also "electrically connected" with another element in between. . In addition, when a part is said to "include" a certain component, which means that it may further include other components, except to exclude other components unless otherwise stated.

이하 첨부된 도면을 참고하여 본 발명을 상세히 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 2는 본 발명의 일 실시예에 따른 다변량 스트림 데이터 분류 시스템의 개략도이다.2 is a schematic diagram of a multivariate stream data classification system according to an embodiment of the present invention.

본 명세서에서는 사람이 x, y, z 방향 이외에, 굴림(roll), 투척(pitch) 및 흔들림(yaw) 등의 동작을 감지할 수 있는 11개의 센서가 각각 부착된 장갑을 양손에 착용하고 ‘YES’ 라는 단어를 수화로 표현하는 경우에 얻어지는 다변량 스트림 데이터를 예로 들어 설명하겠다.In the present specification, in addition to the x, y, and z directions, each person wears gloves with both sensors attached to each of the eleven sensors that can detect motions such as roll, pitch, and yaw. The multivariate stream data obtained when the word 'is expressed in sign language will be described as an example.

사람이 11개의 센서가 부착된 장갑을 양손에 착용하고, ‘YES’라는 단어를 수화로 표현하는 경우, 각 센서에서는 일정한 시간 간격 주기로 데이터를 수집하여 다변량 스트림 데이터 분류 장치(300)에 전송할 수 있다.When a person wears gloves with 11 sensors on both hands and expresses the word 'YES' in sign language, each sensor may collect data at regular time intervals and transmit the data to the multivariate stream data classification apparatus 300. .

다변량 스트림 데이터 분류 장치(300)에 전송된 다변량 스트림 데이터는 ‘YES’라는 단어를 표현하는 동안 시간에 따라 급격히 증가, 감소 또는 변화가 없는 값들로 구성될 수 있다. 또한, 22개의 센서로부터 다변량 스트림 데이터가 수집되므로, 스트림의 개수는 22개이다. 즉, ‘YES’라는 클래스를 수화로 표현하는 동안 다변량 스트림 데이터가 22개가 얻어질 수 있다.The multivariate stream data transmitted to the multivariate stream data classification apparatus 300 may be composed of values that do not increase, decrease or change rapidly over time while expressing the word 'YES'. In addition, since multivariate stream data is collected from 22 sensors, the number of streams is 22. That is, 22 multivariate stream data can be obtained while representing a class called 'YES' in sign language.

다변량 스트림 데이터 분류 장치(300)는 각 클래스에 대한 수치 기반의 상기 다변량 스트림 데이터를 기호를 사용하여 하나의 문자열로 변환할 수 있다. 이때, 상기 기호는 다변량 스트림 데이터의 두 개의 시점 사이의 데이터의 증감 변화 정도와 연계된 기호이다.The apparatus for classifying multivariate stream data 300 may convert the multivariate stream data based on the numerical value of each class into a single character string using a symbol. In this case, the symbol is a symbol associated with the degree of change in the data between two time points of the multivariate stream data.

예를 들어, 다변량 스트림 데이터 분류 장치(300)는 다변량 스트림 데이터를 5개의 기호를 사용하여 하나의 문자열로 변환할 수 있다. 상기 기호는 U, u, D, d, S일 수 있다. 여기서, U는 두 시점 사이의 데이터의 급격한 상승, u는 두 시점 사이의 데이터의 단조로운 상승, D는 두 시점 사이의 데이터의 급격한 감소, d는 두 시점 사이의 데이터의 단조로운 감소, S는 안정화를 의미한다.For example, the multivariate stream data classification apparatus 300 may convert the multivariate stream data into one string using five symbols. The symbol may be U, u, D, d, S. Where U is a sudden rise in data between two points, u is a monotonous rise in data between two points, D is a sharp decrease in data between two points, d is a monotonous decrease in data between two points, and S is stabilization. it means.

또한, 다변량 스트림 데이터 분류 장치(300)는 변환된 문자열에 엔-그램(n-gram) 방법을 적용하여 부분 문자열 집합을 생성할 수 있다. 여기서, 엔-그램 방법이란 하나의 긴 문자열을 순서를 유지하면서 인접한 n 개의 음절 단위로 나누는 방법이다. 예를 들어, ‘대한민국’에 대하여 2-gram 방법을 적용하면, ‘대한’, ‘한민’, ‘민국’이라는 부분 문자열이 생성된다.In addition, the multivariate stream data classification apparatus 300 may generate a partial string set by applying an n-gram method to the converted string. Here, the en-gram method divides one long string into n adjacent syllable units while maintaining the order. For example, if you apply the 2-gram method to 'South Korea', the substrings of 'Korea', 'Hanmin' and 'South Korea' are generated.

또한, 다변량 스트림 데이터 분류 장치(300)는 상기 생성된 부분 문자열 집합에서 각 클래스에 대한 모티프(Motif)가 될 수 있는 부분 문자열을 추출할 수 있다. 여기서 모티프란 각 클래스를 대표하는 중요한 특징이 될 수 있는 사건을 뜻한다. 예를 들어, 다변량 스트림 데이터 분류 장치(300)는 ‘YES’라는 단어를 수화로 표현했을 때 얻어지는 다변량 스트림 데이터에 대한 부분 문자열 중에서, ‘YES’라는 단어를 표현할 때에만 특징적으로 나타나는 문자열을 추출할 수 있다.Also, the multivariate stream data classification apparatus 300 may extract a substring that may be a motif for each class from the generated substring set. Motif here refers to an event that can be an important feature of each class. For example, the multivariate stream data classification apparatus 300 may extract a character string that appears only when the word 'YES' is expressed among substrings of the multivariate stream data obtained when the word 'YES' is expressed in sign language. Can be.

또한, 다변량 스트림 데이터 분류 장치(300)는 상기 추출된 모티프가 될 수 있는 부분 문자열들 중 미리 설정되어 있는 시간 범위 내의 임의의 2개의 상기 부분 문자열 사이의 시간 관계 정보를 생성하고, 상기 시간 관계 정보가 상기 각 클래스에서 차지하는 가중치 값을 산출할 수 있다.In addition, the multivariate stream data classification apparatus 300 generates time relationship information between any two substrings within a preset time range among substrings that may be the extracted motif, and generates the time relationship information. A weight value occupied by each class may be calculated.

전술한 바와 같은, 다변량 스트림 데이터 분류 장치(300)에 의하면 스트림 데이터를 이용한 수화 언어의 인식뿐만 아니라 로봇 데이터를 이용한 행위 분석, 바이오 생체 데이터를 이용한 사건 분석 및 분류, RFID(Radio Frequency Identification), USN(Ubiquitous Sensor Network) 등 다변량으로 수집되는 스트림 데이터를 이용하는 모든 분야에서 특정한 사건을 검출하거나, 상기 검출된 사건을 이용하여 상태 및 행위를 분류할 수 있다.As described above, according to the multivariate stream data classification apparatus 300, not only the recognition of sign language using stream data, but also behavior analysis using robot data, event analysis and classification using bio biometric data, radio frequency identification (RFID), USN, etc. In all fields using stream data collected by multivariate such as (Ubiquitous Sensor Network), a specific event may be detected, or states and actions may be classified using the detected event.

다변량 스트림 데이터 분류 장치(300)의 세부 구성 및 각 구성의 기능을 도 3을 참조하여 자세히 설명하겠다.The detailed configuration of the multivariate stream data classification apparatus 300 and the function of each configuration will be described in detail with reference to FIG. 3.

도 3은 본 발명의 일 실시예에 따른 다변량 스트림 데이터 분류 장치의 세부 구성도이다.3 is a detailed block diagram of a multivariate stream data classification apparatus according to an embodiment of the present invention.

도시된 바와 같이, 본 발명의 일 실시예에 따른 다변량 스트림 데이터 분류 장치는 데이터 변환부(310), 부분 문자열 생성부(320), 모티프 추출부(330), 시간 정보 생성부(340), 상호 정보 생성부(350) 및 데이터 분류부(360)를 포함한다.As shown, the multivariate stream data classification apparatus according to an embodiment of the present invention includes a data converter 310, a substring generator 320, a motif extractor 330, a time information generator 340, and a mutual And an information generator 350 and a data classifier 360.

데이터 변환부(310)는 각 클래스에 대한 수치 기반의 다변량 스트림 데이터를 다변량 스트림 데이터의 두 개의 시점 사이의 데이터의 증감 변화 정도와 연계된 기호를 사용하여 하나의 문자열로 변환할 수 있다.The data converter 310 may convert the numerical-based multivariate stream data for each class into a single string using a symbol associated with the degree of change in the data between two viewpoints of the multivariate stream data.

데이터 변환부(310)는 다변량 스트림 데이터를 기호를 사용하여 문자열로 변환하기 위해 수치 기반의 다변량 스트림 데이터를 정규화 시킬 수 있다. 데이터 정규화는 속성마다 서로 다른 범위의 값을 -1과 1사이로 값을 변환하는 것이다. 정규화를 통해 속성 사이의 값의 비교가 가능하며, 연속된 각 속성값이 평균에서 얼마만큼 떨어져 있는지 알 수 있다.The data converter 310 may normalize the numerical based multivariate stream data in order to convert the multivariate stream data into a character string using a symbol. Data normalization is the conversion of values between -1 and 1 in different ranges for different properties. Normalization allows you to compare values between attributes and see how far each successive attribute value is from the mean.

또한, 데이터 변환부(310)는 수치 기반의 다변량 스트림 데이터의 연속된 두 시점의 정규화된 값의 차에 대한 누적 확률 분포를 산출하고, 상기 산출된 누적 확률 분포에 기초하여 브레이크 포인트(breakpoint)를 결정할 수 있다. 여기서 브레이크 포인트는 기호의 종류를 결정하기 위한 경계 점이 될 수 있다.In addition, the data transformer 310 calculates a cumulative probability distribution for the difference between normalized values of two consecutive time points of the numerical-based multivariate stream data, and breakpoints based on the calculated cumulative probability distribution. You can decide. The break point may be a boundary point for determining the type of symbol.

예를 들어, 상기 기호는 U, u, D, d, s일 수 있다. 여기서, U는 두 시점 사이의 데이터의 급격한 상승, u는 두 시점 사이의 데이터의 단조로운 상승, D는 두 시점 사이의 데이터의 급격한 감소, d는 두 시점 사이의 데이터의 단조로운 감소, S는 안정화를 의미한다. 즉, 연속된 두 시점의 정규화된 값의 차이가 -0.25 ~ +0.25 인 경우에는 S, 연속된 두 시점의 정규화된 값의 차이가 +0.25 ~ +0.84 인 경우에는 u, 연속된 두 시점의 정규화된 값의 차이가 +0.84 이상인 경우에는 U, 연속된 두 시점의 정규화된 값의 차이가 -0.25 ~ -0.84 인 경우에는 d, 연속된 두 시점의 정규화된 값의 차이가 -0.84 이하인 경우에는 D가 사용될 수 있다.For example, the symbol may be U, u, D, d, s. Where U is a sudden rise in data between two points, u is a monotonous rise in data between two points, D is a sharp decrease in data between two points, d is a monotonous decrease in data between two points, and S is stabilization. it means. In other words, if the difference between the normalized values of two consecutive time points is -0.25 ~ +0.25, S, if the difference between the normalized values of two consecutive time points is +0.25 ~ +0.84, u, the normalization of two consecutive time points If the difference between the normalized values is +0.84 or more, U, the difference between the normalized values of two successive time points is -0.25 to -0.84, and the difference between the normalized values of the two successive time points is -0.84 or less. Can be used.

도 4는 본 발명의 일 실시예에 따른 문자열로 변환된 다변량 스트림 데이터를 도시한 도면이다.4 illustrates multivariate stream data converted into character strings according to an embodiment of the present invention.

도시된 바와 같이, 정규화 과정 후의 하나의 스트림은 기호를 이용하여 문자열로 변환될 수 있다. 즉, 데이터 변환부(310)는 하나의 스트림 데이터를 5개의 기호를 이용하여 하나의 긴 문자열로 변환할 수 있다.As shown, one stream after normalization can be converted to a string using symbols. That is, the data converter 310 may convert one stream data into one long string using five symbols.

예를 들어, 상기 변환된 문자열은 ‘16DsDU,…,sss’가 될 수 있다. 여기서 숫자 16은 22개의 센서 중에 16번째의 센서의 속성을 의미한다. 즉, 이 속성은 손바닥을 한쪽 방향으로 흔들 때 수집되는 측정값을 의미한다. 또한, ‘UsDU,…,sss’는 시간이 증가함에 따라 데이터 값이 ‘급격한 상승(U) → 안정화(s) → 급격한 감소(D) → 급격한 상승(U),…, → 안정화(s)’로 변화했음을 의미한다.For example, the converted string is' 16DsDU,... , sss ”. Here, the number 16 means the attribute of the 16th sensor among the 22 sensors. In other words, this property refers to the measurements that are collected when the palm is shaken in one direction. Also, ‘UsDU,… , sss' means that as time goes by, the data value changes from “Rapid Rise (U) → Stabilization (s) → Rapid Decrease (D) → Rapid Rise (U),…. , → stabilization (s) ”.

다시 도 3으로 돌아가서, 부분 문자열 생성부(320)는 데이터 변환부(310)가 생성한 문자열에 엔-그램(n-gram) 방법을 적용하여 n 개(n은 자연수)의 음절을 갖는 부분 문자열 집합을 생성할 수 있다. 예를 들어, 부분 문자열 생성부(320)는 2 개의 음절을 갖는 부분 문자열들의 집합, 3개의 음절을 갖는 부분 문자열들의 집합, 4 개의 음절을 갖는 부분 문자열들의 집합 및 5 개의 음절을 갖는 부분 문자열들의 집합 중 하나 이상의 집합을 생성할 수 있다.3, the substring generator 320 applies the n-gram method to the string generated by the data converter 310, and has a substring having n syllables (n is a natural number). You can create a set. For example, the substring generator 320 may include a set of substrings having two syllables, a set of substrings having three syllables, a set of substrings having four syllables, and a substring having five syllables. You can create one or more sets of sets.

스트림 데이터는 연속된 값 사이에 지역성(locality)이 존재하기 때문에 여러 개의 부분 문자열의 집합은 하나의 긴 문자열보다 데이터 특성을 더욱더 잘 표현할 수 있다. 또한, 순서화된 하나의 긴 문자열에서 순서를 유지한 부분 문자열로 분할하면 긴 문자열에서 부분 문자열의 비교가 가능해진다.Because stream data has locality between successive values, a set of multiple substrings can represent data characteristics better than one long string. In addition, substrings can be compared in long strings by splitting them into one sequenced long string.

도 5는 본 발명의 일 실시예에 따른 부분 문자열 집합을 도시한 도면이다.5 is a diagram illustrating a partial string set according to an embodiment of the present invention.

예를 들어, 도시된 바와 같이, 5개의 기호를 사용하여 변환된 22개의 문자열을 1-gram 방법 내지 5-gram 방법을 적용하여 부분 문자열 집합을 생성할 수 있다. 다시 말해, 부분 문자열 생성부(320)는 데이터 변환부(310)로부터 수신한 22개의 문자열에 1-gram 방법 내지 5-gram 방법을 적용하여 부분 문자열 집합을 생성하였다.For example, as illustrated, a partial string set may be generated by applying the 1-gram method or the 5-gram method to 22 strings converted using 5 symbols. In other words, the substring generator 320 generates a substring set by applying the 1-gram method or the 5-gram method to 22 strings received from the data converter 310.

다시 도 3으로 돌아가서, 모티프 추출부(330)는 부분 문자열 생성부(320)가 생성한 부분 문자열 집합에서 각 클래스에 대한 모티프(Motif)가 될 수 있는 부분 문자열을 추출할 수 있다. 데이터 분류의 품질과 수행 속도를 높이기 위해, 클래 스를 대표하는 중요한 특징, 즉 모티프(Motifs)를 이용하는 것은 필수적이다. 거대한 데이터를 처리하는 분류 기법에는 두 가지 문제점이 있다. 첫째, 처리해야 할 데이터가 너무 크기 때문에 많은 수행시간이 요구되며, 둘째 각 클래스를 대표하는 특징을 엄격하게 결정하는 것이 어려운 문제이다. 수백개의 특징들이 존재하는 거대한 문서, 또는 수십개의 센서에서 동시에 수집되는 센서 데이터를 분석할 때, 중요한 특징을 선택하여 분석한다면 데이터 분류의 정확도와 수행속도가 크게 향상될 것이다.3, the motif extractor 330 may extract a substring that may be a motif for each class from the substring set generated by the substring generator 320. In order to improve the quality and speed of data classification, it is essential to use an important feature representing the class, motifs. There are two problems with classification techniques that process huge data. First, it requires a lot of execution time because the data to be processed is too large. Second, it is difficult to strictly determine the characteristics that represent each class. When analyzing large documents with hundreds of features, or sensor data collected simultaneously from dozens of sensors, selecting and analyzing important features will greatly improve the accuracy and speed of data classification.

모티프 추출부(330)는 모티프가 될 수 있는 부분 문자열을 추출하기 위해 TFIDF(Term Frequency and Inverse Document Frequency) 값을 이용할 수 있다. TFIDF는 각 클래스에 대한 각 부분 문자열의 가중치를 결정하는 기법으로서 TFIDF 값은 다음 수학식에 의해 결정된다.The motif extractor 330 may use a TFIDF (Term Frequency and Inverse Document Frequency) value to extract a substring that may be a motif. TFIDF is a technique for determining the weight of each substring for each class. The TFIDF value is determined by the following equation.

수학식 1에서 In Equation 1

는 발생 빈도로서 클래스 또는 문서(d) 안에서 단어(t)가 발생한 횟수를 의미한다.

Is the frequency of occurrence and means the number of occurrences of the word t in the class or document d.

또한,

는 역문서 빈도로서 다음 수학식에 의해 결정된다.Also,

Is the inverse document frequency, which is determined by the following equation.

수학식 2에서In equation (2)

는 문서 빈도로서, 특정한 단어(t)가 적어도 한번 발생한 문서의 개수를 의미한다. 또한,

는 문서의 총 개수를 의미한다. 그러므로 많은 문서에 특정한 단어(t)가 포함되어 있다면 역문서 빈도 값은 낮고, 특정한 문서에만 상기 특정한 단어(t)가 포함되어 있다면 역문서 빈도 값은 높다.

Is the document frequency, and means the number of documents in which a specific word t occurs at least once. Also,

Means the total number of documents. Therefore, if many documents contain a specific word t, the reverse document frequency value is low, and if only a specific document contains the specific word t, the reverse document frequency value is high.

TFIDF 값은 발생 빈도 값과 역문서 빈도 값의 곱으로 계산되므로, 특정 클래스에서는 특정 부분 문자열이 많이 나타나지만 다른 클래스에서는 상기 부분 문자열이 적게 나타날수록 TFIDF 값이 높게 나타난다. 따라서, 특정 클래스에서는 많이 나타나지만 다른 클래스에서는 적게 나타나는 부분 문자열이 상기 특정 클래스의 모티프가 될 수 있다.Since the TFIDF value is calculated as the product of the occurrence frequency value and the reverse document frequency value, the specific substring appears more in a specific class, but the lower the substring appears in another class, the higher the TFIDF value appears. Therefore, a substring that appears a lot in a specific class but appears less in another class may be a motif of the specific class.

도 6은 본 발명의 일 실시예에 따른 부분 문자열에 대한 TFIDF 값 테이블이다.6 is a TFIDF value table for a substring according to an embodiment of the present invention.

도시된 바와 같이, ‘22uud’는 ‘boy’ 클래스와 ‘come’ 클래스에서만 높은 가중치 값을 가지므로 ‘boy’ 클래스와 ‘come’ 클래스를 대표하는 모티프가 될 수 있다. 즉, 22번 센서가 단조로운 상승(u) → 단조로운 상승(u) → 단조로운 감소(d)를 하는 경우, ‘boy’ 또는 ‘come’ 이라는 단어를 표시한 것임을 예측할 수 있다.As shown, '22uud' has a high weight value only in the 'boy' class and the 'come' class, and thus may be a motif representing the 'boy' class and the 'come' class. That is, if sensor 22 has a monotonous rise (u) → monotonous rise (u) → monotonous decrease (d), it can be predicted that the word 'boy' or 'come' is displayed.

그러나, ‘5dsud’와 같이 가중치의 값이 모든 클래스에서 높게 나타나거나 0이면 클래스를 대표하는 모티프가 될 수 없다.However, if the value of the weight is high or zero in all classes such as '5dsud', it cannot be a motif representing the class.

또한, 부분 문자열 생성부(320)가 생성한 모든 부분 문자열에 대하여 TFIDF 값을 구하는 경우, 1-gram 방법을 적용한 부분 문자열에 대한 TFIDF 값은 거의 0에 가까워 1-gram 방법을 적용한 부분 문자열은 모티프 값이 될 수 없다. 반면에, 2 내지 5-gram 방법을 적용한 부분 문자열이 모티프가 될 수 있는 가능성이 가장 높다.In addition, when the TFIDF value is obtained for all the substrings generated by the substring generator 320, the TFIDF value for the substring using the 1-gram method is almost zero, and the substring to which the 1-gram method is applied is a motif. It cannot be a value. On the other hand, substrings applying the 2 to 5-gram method are most likely to be motifs.

다시 도 3으로 돌아가서, 시간 정보 생성부(340)는 모티프 추출부(330)가 추출한 모티프가 될 수 있는 부분 문자열들 중 임의의 2개의 부분 문자열 사이의 시간 관계 정보를 생성할 수 있다.3, the time information generator 340 may generate time relationship information between any two substrings among substrings that may be motifs extracted by the motif extractor 330.

도 7은 본 발명의 일 실시예에 따른 시간 관계 정보 테이블이다.7 is a time relationship information table according to an embodiment of the present invention.

도시된 바와 같이, 모티프는 시작 시점과 종료 시점을 가지는 시간 간격 데이터이다.As shown, the motif is time interval data having a start point and an end point.

또한, 시간 관계 정보는 종료(Finish), 포함(During), 시작(Start), 중첩(Overlap), 만남(Meet) 및 이전(Before)이 될 수 있다. 종료(Finish)는 제1 모티프와 제2 모티프의 종료 시점이 일치하는 시간 관계 정보이다. 포함(During)은 제2 모티프 안에 제1 모티프가 포함되는 시간 관계 정보이다. 이때, 제1 모티프 안에 제2 모티프가 포함되어도 상관없다. 시작(Start)은 제1 모티프와 제2 모티프의 시작 시점이 일치하는 시간 관계 정보이다. 중첩(Overlap)은 제1 모티프와 제2 모티프가 일부 시간 동안 겹치게 되는 시간 관계 정보이다. 만남(Meet)은 제1 모티프가 끝나면 제2 모티프가 곧바로 시작되는 시간 관계 정보이다. 이때, 제2 모티프가 끝나고 제1 모티프가 곧바로 시작되어도 상관 없다. 마지막으로 이전(Before)은 제1 모티프가 끝나고 제2 모티프가 시작되지만, 시간적 연속성이 없는 시간 관계 정보이다.In addition, the time relationship information may be Finish, Processing, Start, Overlap, Meet, and Before. Finish is time relationship information in which end points of the first motif and the second motif coincide with each other. During is temporal relationship information in which the first motif is included in the second motif. At this time, the second motif may be included in the first motif. Start is time relationship information in which the start time of the first motif and the second motif coincide. Overlap is time relationship information in which the first motif and the second motif overlap for some time. Meet is time relationship information where the second motif starts immediately after the first motif ends. At this time, it does not matter even if the first motif starts immediately after the second motif ends. Finally, Before is the time relationship information where the first motif ends and the second motif starts, but there is no temporal continuity.

시간 정보 생성부(340)는 각각의 시간 관계 정보를 6개의 문자 B,M,O,S,D,F로 표시될 수 있다. 또한, 예를 들어, 시간 정보 생성부(340)는 모티프들 사이의 시간 관계를 {제1 모티프 (시간 관계 정보 문자) 제2 모티프}의 형태로 표시할 수 있다.The time information generator 340 may display each time relationship information as six characters B, M, O, S, D, and F. In addition, for example, the time information generator 340 may display the time relationship between the motifs in the form of {first motif (time relationship information character) second motif}.

상호 정보 생성부(350)는 시간 정보 생성부(340)가 생성한 모티프들 사이의 시간 관계 정보들 중에서 미리 설정되어 있는 시간 범위 내의 각 시간 관계 정보가 각 클래스에서 차지하는 가중치 값을 산출할 수 있다.The mutual information generator 350 may calculate a weight value of each class of time relationship information within a preset time range among time relationship information generated by the time information generator 340. .

시간 정보 생성부(340)가 전체 시간 동안의 모든 모티프들 사이의 시간 관계 정보를 생성하더라도, 근접하게 이웃해 있는 모티프들 사이에는 강한 연관성이 존재하기 때문에, 상호 정보 생성부(350)는 일정한 시간 범위(d) 내의 모티프들 사이의 시간 관계 정보만을 고려할 수 있다.Even though the time information generator 340 generates time relationship information between all the motifs for the entire time, since there is a strong association between adjacent neighboring motifs, the mutual information generator 350 may have a constant time. Only time relationship information between the motifs in the range d may be considered.

도 8은 본 발명의 일 실시예에 따른 모티프 사이의 시간 관계 정보를 나타낸 도면이다.8 is a diagram illustrating time relationship information between motifs according to an embodiment of the present invention.

예를 들어 시간 정보 생성부(340)는 모티프 ‘5uuud’에 모티프 ‘4ud’가 포함되는 것을 검출하고 {5uuud(D)4ud}라는 시간 관계 정보를 생성할 수 있다. 또 한, 시간 정보 생성부(340)는 모티프 ‘5uuud’와 모티프 ‘3Ddsd’가 중첩 됨을 발견하고 {5uuud(O)3Ddsd}라는 시간 관계 정보를 생성할 수 있다. 또한, 시간 정보 생성부(340)는 모티프 ‘4ud’와 모티프 ‘3Ddsd’가 중첩 됨을 발견하고 {4ud(O)3Ddsd}라는 시간 관계 정보를 생성할 수 있다. 또한 시간 정보 생성부(340)는 {3Ddsd(M)2Ddsd}, {4Ddsd(S)3ds}, {2Ddsd(D)4Ddsd}, {2Ddsd(D)3ds} 등의 시간 관계 정보를 생성할 수 있다.For example, the time information generator 340 may detect that the motif '5uuud' includes the motif '4ud' and generate time relationship information called {5uuud (D) 4ud}. In addition, the time information generator 340 may discover that the motif '5uuud' and the motif '3Ddsd' overlap and generate time relationship information called {5uuud (O) 3Ddsd}. In addition, the time information generator 340 may discover that the motif '4ud' and the motif '3Ddsd' overlap and generate time relationship information called {4ud (O) 3Ddsd}. Also, the time information generator 340 may generate time relationship information such as {3Ddsd (M) 2Ddsd}, {4Ddsd (S) 3ds}, {2Ddsd (D) 4Ddsd}, and {2Ddsd (D) 3ds}. .

그러나, 상호 정보 생성부(350)는 일정한 시간 범위(d) 내에서 모티프들 사이의 시간 관계 정보만을 고려할 수 있다.However, the mutual information generator 350 may consider only time relationship information between motifs within a predetermined time range d.

상호 정보 생성부(350)는 두 개의 모티프 사이의 시간 관계 정보가 각 클래스에서 차지하는 가중치 값을 계산하기 위해 상호 정보(Mutual Information) 값을 이용할 수 있다. 다시 말해, 상호 정보 생성부(350)는 모티프 추출부(330)가 추출한 부분 문자열들 중 미리 설정되어 있는 시간 범위 내의 임의의 2개의 부분 문자열과 상기 2개의 부분 문자열이 형성하는 시간 관계 정보 간의 상호 정보(Mutual Information) 값을 이용하여 상기 2개의 부분 문자열이 형성하는 시간 관계 정보가 각 클래스에서 차지하는 가중치 값을 계산할 수 있다.The mutual information generator 350 may use a mutual information value to calculate a weight value of each class of time relationship information between two motifs. In other words, the mutual information generating unit 350 mutually intersects any two substrings within a preset time range among the substrings extracted by the motif extracting unit 330 and the time relationship information formed by the two substrings. A weight value that occupies the time relationship information formed by the two substrings in each class may be calculated using an information value.

상호 정보 값은 아래의 수학식에 의해 결정된다.The mutual information value is determined by the following equation.

여기서

과

는 각각 모티프(

)과 모티프(

)가 특정 클래스의 모티프들 중에서 차지하는 비율이다. 또한,

는 모티프(

)과 모티프(

)가 미리 설정된 시간 범위(d) 내에서 형성하는 시간 관계 정보가 특정 클래스의 모티프들 중에서 차지하는 비율이다.here

and

Is a motif (

) And motifs (

) Is the percentage of motifs of a particular class. Also,

Is the motif (

) And motifs (

) Is a ratio of time-related information formed within a preset time range (d) among motifs of a specific class.

도 9는 본 발명의 일 실시예에 따른 시간 관계 정보에 대한 상호 정보 값 테이블이다.9 is a table of mutual information values for time relationship information according to an embodiment of the present invention.

도시된 바와 같이, 모티프 ‘2uud’와 ‘5uds’가 형성하고 있는 시간 관계 정보(M)는 클래스 ‘girl’에 대한 높은 상호 정보 값, 즉 가중치 값을 갖는다.As shown, the temporal relationship information M formed by the motifs '2uud' and '5uds' has a high mutual information value, that is, a weight value, for the class 'girl'.

다시 말해, 예를 들어, 2번 센서가 단조로운 상승 → 단조로운 상승 → 단조로운 감소를 한 직후 5번 센서가 단조로운 상승 → 단조로운 감소 → 안정화를 유지하면, ‘girl’을 수화로 표현했다고 예측할 수 있다.In other words, for example, if sensor # 2 maintains a monotonous rise, monotonous decrease, and stabilization immediately after sensor # 2 monotonically increases → monotonically rises → monotonically decreases, it can be predicted that “girl” is expressed in sign language.

또한, 모티프 ‘2udu’와 ‘5uds’가 형성하고 있는 시간 관계 정보(O)는 클래스 ‘hello’에 대한 높은 가중치 값을 갖는다.In addition, the time relationship information O formed by the motifs '2udu' and '5uds' has a high weighting value for the class 'hello'.

다시 말해, 2번 센서가 단조로운 상승 → 단조로운 감소 → 단조로운 상승을 하는 동안 5번 센서가 단조로운 상승 → 단조로운 감소 → 안정화를 유지하면, ‘hello’를 수화로 표현했다고 예측할 수 있다.In other words, if sensor # 2 maintains monotonous rise, monotonous decrease, and stabilization while sensor # 2 is monotonically rising → monotonically decreasing → monotonically rising, it can be predicted that “hello” is expressed in sign language.

다시 도 3으로 돌아가서, 데이터 분류부(360)는 모티프 추출부(330)가 추출한 모티프가 될 수 있는 부분 문자열에 기초하여 다변량 스트림 데이터를 미리 정의된 클래스 집합 중의 하나로 분류할 수 있다. 나아가, 데이터 분류부(360)는 시 간 정보 생성부(340)가 생성한 부분 문자열 사이의 시간 관계 정보를 고려하여 다변량 스트림 데이터를 미리 정의된 클래스 집합 중의 하나로 분류할 수 있다. 또한, 데이터 분류부(360)는 상호 정보 생성부(350)가 산출한 시간 관계 정보가 각 클래스에서 차지하는 가중치 값에 기초하여 다변량 스트림 데이터를 미리 정의된 클래스 집합 중의 하나로 분류할 수 있다.3, the data classifier 360 may classify the multivariate stream data into one of a predefined class set based on a substring that may be a motif extracted by the motif extractor 330. Further, the data classifier 360 may classify the multivariate stream data into one of a predefined class set in consideration of the time relationship information between the substrings generated by the time information generator 340. In addition, the data classifying unit 360 may classify the multivariate stream data into one of a predefined class set based on a weight value of each class of the time relationship information calculated by the mutual information generating unit 350.

이와 같이, 본 발명의 일 실시예에 따른 다변량 스트림 데이터 분류 장치는 각 클래스를 표현하는 중요한 모티프를 발견하고 이들 사이의 시간 관계를 고려하기 때문에 보다 정확한 데이터 해석과 예측이 가능하다.As described above, the apparatus for classifying multivariate stream data according to an embodiment of the present invention discovers important motifs representing each class and considers the time relationship therebetween, thereby enabling more accurate data interpretation and prediction.

도 10은 본 발명의 일 실시예에 따른 다변량 스트림 데이터 분류 방법의 흐름도이다.10 is a flowchart of a multivariate stream data classification method according to an embodiment of the present invention.

단계(S1000)에서 데이터 변환부(310)는 입력된 다변량 스트림 데이터를 상기 다변량 스트림 데이터의 두 개의 시점 사이의 데이터의 증감 변화 정도와 연계된 기호를 사용하여 하나의 문자열로 변환할 수 있다. 예를 들어, 상기 기호는 U, u, D, d, s일 수 있다.In operation S1000, the data converter 310 may convert the input multivariate stream data into a single character string using a symbol associated with a degree of change in the data between two viewpoints of the multivariate stream data. For example, the symbol may be U, u, D, d, s.

또한, 단계(S1000)에서 데이터 변환부(310)는 다변량 스트림 데이터를 기호를 사용하여 문자열로 변환하기 위해 수치 기반의 다변량 스트림 데이터를 정규화 시킬 수 있다.In operation S1000, the data converter 310 may normalize the numerical based multivariate stream data in order to convert the multivariate stream data into a character string using a symbol.

또한, 단계(S1000)에서 데이터 변환부(310)는 상기 수치 기반의 다변량 스트림 데이터의 연속된 두 시점의 정규화된 값의 차에 대한 누적 확률 분포를 산출하고, 상기 산출된 누적 확률 분포에 기초하여 브레이크 포인트(breakpoint)를 결정 할 수 있다. 여기서 브레이크 포인트는 기호의 종류를 결정하기 위한 경계 점이 될 수 있다.In operation S1000, the data converter 310 calculates a cumulative probability distribution for the difference between normalized values of two consecutive time points of the numerical based multivariate stream data, and based on the calculated cumulative probability distribution. You can determine the breakpoint. The break point may be a boundary point for determining the type of symbol.

단계(S1020)는 부분 문자열 생성부(320)가 상기 단계(S1000)에서 변환된 문자열에 엔-그램(n-gram) 방법을 적용하여 n 개(n은 자연수)의 음절을 갖는 부분 문자열 집합을 생성하는 단계이다. 예를 들어, 단계(S1020)에서 부분 문자열 생성부(320)는 2 개의 음절을 갖는 부분 문자열들의 집합, 3 개의 음절을 갖는 부분 문자열들의 집합, 4 개의 음절을 갖는 부분 문자열들의 집합, 5 개의 음절을 갖는 부분 문자열들의 집합 중 하나 이상의 집합을 생성할 수 있다.In operation S1020, the substring generator 320 applies an n-gram method to the string converted in operation S1000 to generate a substring set having n syllables (n is a natural number). Generating step. For example, in step S1020, the substring generator 320 may include a set of substrings having two syllables, a set of substrings having three syllables, a set of substrings having four syllables, and five syllables. One or more sets of sets of substrings with

단계(S1040)는 모티프 추출부(330)가 상기 단계(S1020)에서 생성된 부분 문자열 집합에서 각 클래스에 대한 모티프(Motif)가 될 수 있는 부분 문자열을 추출하는 단계이다.In operation S1040, the motif extracting unit 330 extracts a substring that may be a motif for each class from the substring set generated in operation S1020.

단계(S1040)에서 모티프 추출부(330)는 모티프가 될 수 있는 부분 문자열을 추출하기 위해 TFIDF(Term Frequency and Inverse Document Frequency) 값을 이용할 수 있다. 특정 클래스에서는 특정 부분 문자열이 많이 나타나지만 다른 클래스에서는 상기 부분 문자열이 적게 나타날수록 TFIDF 값이 높게 나타난다. 따라서, 특정 클래스에서는 많이 나타나지만 다른 클래스에서는 적게 나타나는 부분 문자열이 상기 특정 클래스의 모티프가 될 수 있다. TFIDF 값에 대한 수식적 설명은 전술한 바와 동일하므로 생략하기로 한다.In operation S1040, the motif extractor 330 may use a TFIDF (Term Frequency and Inverse Document Frequency) value to extract a substring that may be a motif. In a specific class, a large number of specific substrings appear, but in other classes, the smaller the substring, the higher the TFIDF value. Therefore, a substring that appears a lot in a specific class but appears less in another class may be a motif of the specific class. Formal description of the TFIDF value is the same as described above will be omitted.

단계(S1060)는 시간 정보 생성부(340) 상기 단계(S1040)에서 추출된 모티프들 중 임의의 2개의 모티프들 사이의 시간 관계 정보를 생성하는 단계이다.In step S1060, the time information generator 340 generates time relationship information between any two motifs among the motifs extracted in the step S1040.

예를 들어, 시간 관계 정보는 이전(Before), 만남(Meet), 중첩(Overlap), 시작(Start), 포함(During) 및 종료(Finish)가 될 수 있고, 각각의 시간 관계 정보는 6개의 문자 B,M,O,S,D,F로 표시될 수 있다.For example, the time relationship information may be Before, Meet, Overlap, Start, During, and Finish, each time relationship information being 6 pieces. The letters B, M, O, S, D, and F may be represented.

단계(S1080)는 상호 정보 생성부(350)가 상기 단계(S1060)에서 생성된 모티프들 사이의 시간 관계 정보들 중에서 미리 설정되어 있는 시간 범위 내의 각 시간 관계 정보가 각 클래스에서 차지하는 가중치 값을 산출하는 단계이다.In step S1080, the mutual information generating unit 350 calculates a weight value of each time relationship information within a predetermined time range among the time relationship information between the motifs generated in step S1060. It's a step.

상기 단계(S1060)에서 시간 정보 생성부(340)가 전체 시간 동안의 모든 모티프들 사이의 시간 관계 정보를 생성하더라도, 상호 정보 생성부(350)는 일정한 시간 범위(d) 내의 모티프들 사이의 시간 관계 정보만을 고려할 수 있다.Although the time information generator 340 generates time relationship information between all the motifs for the entire time in step S1060, the mutual information generator 350 may determine the time between the motifs within a certain time range d. Only relationship information can be considered.

단계(S1080)에서 상호 정보 생성부(350)는 두 개의 모티프 사이의 시간 관계 정보가 각 클래스에서 차지하는 가중치 값을 계산하기 위해 상호 정보(Mutual Information) 값을 이용할 수 있다. 상호 정보 값에 대한 수식적 설명은 전술한 바와 동일하므로 생략하기로 한다.In operation S1080, the mutual information generator 350 may use a mutual information value to calculate a weight value of time relation information between two motifs in each class. Since the formal description of the mutual information value is the same as described above, it will be omitted.

단계(S1100)는 데이터 분류부(360)가 입력된 다변량 스트림 데이터를 사전에 정의된 클래스 집합 중의 하나로 분류하는 단계이다. 예를 들어, 단계(S1100)에서 데이터 분류부(360)는 상기 단계(S1040)에서 추출된 모티프가 될 수 있는 부분 문자열에 기초하여 다변량 스트림 데이터를 미리 정의된 클래스 집합 중의 하나로 분류할 수 있다. 나아가, 단계(S1100)에서 데이터 분류부(360)는 상기 단계(S1060)에서 생성된 부분 문자열 사이의 시간 관계 정보를 고려하여 다변량 스트림 데이터를 미리 정의된 클래스 집합 중의 하나로 분류할 수 있다. 또한, 데이터 분류 부(360)는 상기 단계(S1080)에서 산출된 시간 관계 정보가 각 클래스에서 차지하는 가중치 값에 기초하여 다변량 스트림 데이터를 미리 정의된 클래스 집합 중의 하나로 분류할 수 있다.In operation S1100, the data classifying unit 360 classifies the input multivariate stream data into one of a predefined class set. For example, in operation S1100, the data classifier 360 may classify the multivariate stream data into one of a predefined class set based on the substring that may be the motif extracted in operation S1040. Furthermore, in operation S1100, the data classifier 360 may classify the multivariate stream data into one of a predefined class set in consideration of the time relationship information between the substrings generated in operation S1060. In addition, the data classifying unit 360 may classify the multivariate stream data into one of a predefined class set based on a weight value of each class of the time relationship information calculated in operation S1080.

본 발명의 일 실시예에 따른 다변량 스트림 데이터 분류 장치 및 방법은 스트림 데이터를 이용한 수화 언어의 인식뿐만 아니라 로봇 데이터를 이용한 행위 분석, 바이오 생체 데이터를 이용한 사건 분석 및 분류, RFID, USN 등 다변량으로 수집되는 스트림 데이터를 이용하는 모든 분야에서 특정한 사건을 검출하거나, 상기 검출된 사건을 이용하여 상태 및 행위를 분류하는데 이용될 수 있다.The apparatus and method for classifying multivariate stream data according to an embodiment of the present invention collects multivariate variables such as behavior analysis using robot data, event analysis and classification using biometric data, RFID, USN, as well as recognition of sign language using stream data. It can be used to detect specific events in all fields using stream data, or to classify states and actions using the detected events.

또한, 본 발명의 일 실시예에 따른 다변량 스트림 데이터 분류 장치 및 방법은 모티프만을 고려하기 때문에 전체 데이터를 고려하는 것보다 공간적, 시간적 비용이 절약된다. 또한 모티프 사이의 시간적 관계가 규칙에 반영되어 분류 규칙의 명확한 해석이 가능하며, 확률과 시간 구조 정보를 함께 고려하기 때문에 단순히 데이터 거리나 데이터 분포를 고려하는 기존의 기법에 정확하게 데이터를 분류할 수 있다.In addition, the apparatus and method for classifying multivariate stream data according to an embodiment of the present invention considers only a motif, thereby saving spatial and time costs than considering the entire data. In addition, the temporal relationship between the motifs is reflected in the rules to enable clear interpretation of the classification rules.Because the probability and time structure information are considered together, the data can be classified accurately with existing techniques that simply consider data distance or data distribution. .

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체 및 통신 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기 타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. 통신 매체는 전형적으로 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈, 또는 반송파와 같은 변조된 데이터 신호의 기타 데이터, 또는 기타 전송 메커니즘을 포함하며, 임의의 정보 전달 매체를 포함한다. One embodiment of the present invention can also be implemented in the form of a recording medium containing instructions executable by a computer, such as a program module executed by the computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. In addition, computer readable media may include both computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media typically includes computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transmission mechanism, and includes any information delivery media.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The foregoing description of the present invention is intended for illustration, and it will be understood by those skilled in the art that the present invention may be easily modified in other specific forms without changing the technical spirit or essential features of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is shown by the following claims rather than the above description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention. do.

도 1은 스트림 데이터의 일 예를 도시한 도면.1 is a diagram illustrating an example of stream data.

도 2는 본 발명의 일 실시예에 따른 다변량 스트림 데이터 분류 시스템의 개략도.2 is a schematic diagram of a multivariate stream data classification system in accordance with an embodiment of the present invention.

도 3은 본 발명의 일 실시예에 따른 다변량 스트림 데이터 분류 장치의 세부 구성도.3 is a detailed block diagram of an apparatus for classifying multivariate stream data according to an embodiment of the present invention.

도 4는 본 발명의 일 실시예에 따른 문자열로 변환된 다변량 스트림 데이터를 도시한 도면.4 illustrates multivariate stream data converted to character strings according to an embodiment of the present invention.

도 5는 본 발명의 일 실시예에 따른 부분 문자열 집합을 도시한 도면.5 illustrates a partial string set according to an embodiment of the present invention.

도 6은 본 발명의 일 실시예에 따른 부분 문자열에 대한 TFIDF 값 테이블.6 is a table of TFIDF values for substrings in accordance with one embodiment of the present invention;

도 7은 본 발명의 일 실시예에 따른 시간 관계 정보 테이블.7 is a time relationship information table according to an embodiment of the present invention.

도 8은 본 발명의 일 실시예에 따른 모티프 사이의 시간 관계 정보를 나타낸 도면.8 illustrates time relationship information between motifs according to an embodiment of the present invention.

도 9는 본 발명의 일 실시예에 따른 시간 관계 정보에 대한 상호 정보 값 테이블.9 is a table of mutual information values for time relationship information according to an embodiment of the present invention.

도 10은 본 발명의 일 실시예에 따른 다변량 스트림 데이터 분류 방법의 흐름도.10 is a flowchart of a method for classifying multivariate stream data according to an embodiment of the present invention.

Claims

An apparatus for classifying multivariate stream data into one of a predefined class set,

A data converter for converting the input multivariate stream data into a single string using symbols;

A substring generator for generating a set of substrings having n syllables (n is a natural number) by applying an n-gram method to the converted string;

A motif extracting unit that extracts a substring that may be a motif for each class from the generated substring set; and

A data classification unit classifying the multivariate stream data into one of the class sets based on the extracted substring

Multivariate stream data classification device comprising a.

The method of claim 1,

A time information generator for generating time relationship information between any two substrings of the extracted substrings;

Further comprising a mutual information generating unit for calculating a weight value occupied by each class of the time relationship information within a preset time range,

And the data classifying unit classifies the multivariate stream data into one of the class sets based on the time relationship information and the weight value.

The method of claim 1,

The data converter normalizes the multivariate stream data, calculates a cumulative probability distribution for the difference between normalized values of two consecutive time points of the multivariate stream data, and breakpoints based on the calculated cumulative probability distribution. And determine the preference based on the breakpoint.

The method of claim 1,

The substring generator generates at least one set of substrings having two syllables, a set of substrings having three syllables, a set of substrings having four syllables, and a set of substrings having five syllables. Multivariate stream data classification device.

The method of claim 3, wherein

And wherein the preference is associated with a degree of change in the data between two time points of the multivariate stream data.

The method of claim 3, wherein

The symbol includes U, u, D, d and S,

Wherein U is a sudden rise in data between two points in time of the multivariate stream data, u is a monotonous rise, D is a sharp decrease, d is a monotonous decrease, and S is a stabilization.

The method of claim 3, wherein

The symbol includes U, u, D, d and S,

S is the difference between the normalized value of the two successive time points is -0.25 ~ +0.25, when u is the difference between the normalized value of the two successive time points is +0.25 ~ +0.84, the U is If the difference between the normalized value of two consecutive time points is +0.84 or more, the d is the difference between the normalized value of the two consecutive time points is -0.25 ~ -0.84, the D is normalized of the two consecutive time points A multivariate stream data classification apparatus that is used when the difference in values is -0.84 or less.

The method of claim 1,

And the motif extracting unit uses a term frequency and inverse document frequency (TFIDF) value that determines a weight of each substring for each class.

The method of claim 2,

And the temporal relationship information includes a before, a meet, an overlap, a start, a containing, and a finish.

The method of claim 9,

The mutual information generating unit using a mutual information value between any two substrings within a preset time range among the extracted substrings and the time relationship information formed by the two substrings Multivariate stream data classification device.

In the method of classifying multivariate stream data into one of a predefined class set,

(a) converting the input multivariate stream data into a single string using symbols;

(b) generating a set of substrings having n syllables (n is a natural number) by applying an n-gram method to the converted string;

(c) extracting a substring that can be a motif for each class from the generated substring set; and

(d) classifying the multivariate stream data into one of the class sets based on the substring that may be the extracted motif.

Multivariate stream data classification method comprising a.

The method of claim 11,

(e) generating time relationship information between any two of the extracted substrings; and

(f) calculating a weight value occupied by the respective classes by the time relationship information within a preset time range;

And (d) classifying the multivariate stream data into one of the class sets based on the time relationship information and the weight value.

The method of claim 11,

Step (a) is

(a1) normalizing the multivariate stream data,

(a2) calculating a cumulative probability distribution for the difference between normalized values of two consecutive time points of the multivariate stream data; and

(a3) determining a breakpoint based on the calculated cumulative probability distribution

The multivariate stream data classification method comprising a.

The method of claim 11,

In step (b), at least one of a set of substrings having two syllables, a set of substrings having three syllables, a set of substrings having four syllables, and a set of substrings having five syllables Steps to generate

Multivariate stream data classification method comprising a.

The method of claim 13,

The symbol includes U, u, D, d and S,

Wherein U is a sudden rise in data between two time points of the multivariate stream data, u is a monotonous rise, D is a sharp decrease, d is a monotonous decrease, and S is a stabilization.

The method of claim 13,

The symbol includes U, u, D, d and S,

Where S is the difference between the normalized values of the two successive time points is -0.25 to +0.25, and u is the difference between the normalized values of the two successive time points is +0.25 to +0.84, and U is When the difference between the normalized value of the two successive time points is +0.84 or more, d is the difference between the normalized value of the two successive time points is -0.25 ~ -0.84, D is the normalization of the two successive time points The method of classifying multivariate stream data which is used when the difference between the calculated values is -0.84 or less.

The method of claim 11,

The step (c) is to use a TFIDF (Term Frequency and Inverse Document Frequency) value for determining the weight of each substring for each class.

13. The method of claim 12,

And the temporal relationship information includes before, meet, overlap, start, containing and finish.

The method of claim 19,

Step (f) uses a mutual information value between any two substrings within a preset time range among the extracted substrings and the time relationship information formed by the two substrings. Multivariate stream data classification method.

A computer-readable recording medium having recorded thereon a program for performing the steps according to any one of claims 11 to 20.

An application device for performing an application operation using multivariate stream data classified according to any one of claims 1 to 10.

An application apparatus for performing an application operation using multivariate stream data classified according to any one of claims 11 to 20.