KR101178068B1

KR101178068B1 - Text category classification apparatus and its method

Info

Publication number: KR101178068B1
Application number: KR1020050063893A
Authority: KR
Inventors: 심유진; 구명완; 김재인
Original assignee: 주식회사 케이티; 서치캐스트 주식회사; 중앙대학교 산학협력단; 한국산업기술평가관리원; 주식회사 솔트룩스
Priority date: 2005-07-14
Filing date: 2005-07-14
Publication date: 2012-08-30
Also published as: KR20070008991A

Abstract

The present invention relates to an apparatus and method for classifying text, and to extracting an input vector from input text (sentence) or extracting the input vector from the input text (sentence) using a normalization scheme or a mapping table, and classifying a matrix. An apparatus and method for classifying a text for classifying a category of text by generating a similarity and generating similarity are provided.

To this end, the present invention provides a category classification apparatus for text, comprising: input vector extraction means for extracting an input vector from an input text (sentence) using a normalization scheme or a mapping table; Classification matrix generating means for generating a classification matrix using the input vector extracted by the input vector extracting means; Similarity generating means for generating a similarity by decomposing a classification matrix generated by the classification matrix generating means using a singular vector; And category classification means for classifying a category of the input text using the similarity generated by the similarity generating means.

Text, Category Classification, Input Vector Extraction, Classification Matrix Generation, Similarity Generation, Normalization, Mapping Table, Speech Recognition, Automatic Call Classification

Description

Text category classification apparatus and its method

도 1은 본 발명에 따른 텍스트의 카테고리 분류 장치의 일실시예 구성도,1 is a block diagram of an embodiment of an apparatus for classifying text according to the present invention;

도 2는 본 발명에 따른 텍스트의 카테고리 분류 방법에 대한 일실시예 흐름도,2 is a flowchart illustrating a method for classifying a category of text according to the present invention;

도 3은 본 발명에 따른 정규화에 대한 일예시도,3 is an example of normalization according to the present invention;

도 4는 본 발명에 따른 맵핑 테이블의 일예시도이다.4 is an exemplary view of a mapping table according to the present invention.

* 도면의 주요 부분에 대한 부호의 설명DESCRIPTION OF THE REFERENCE NUMERALS

10 : 입력 벡터 추출부 11 : 제 1 정규화부10: input vector extractor 11: first normalizer

12 : 어절 분류부 13 : 입력 벡터 생성부12: word classification unit 13: input vector generator

20 : 분류 행렬 생성부 21 : 축척 빈도 벡터 생성부20: classification matrix generator 21: scale frequency vector generator

22 : 제 2 정규화부 23 : 가중부22: second normalization unit 23: weighting unit

24 : 행렬 생성부 30 : 유사도 생성부24: matrix generator 30: similarity generator

31 : 행렬 분해부 32 : 슈도-도큐먼트 행렬 생성부31: matrix decomposition unit 32: pseudo-document matrix generator

33 : 유사도 획득부 40 : 카테고리 분류부33: similarity acquisition unit 40: category classification unit

본 발명은 텍스트의 카테고리 분류 장치 및 그 방법에 관한 것으로, 더욱 상세하게는 입력된 텍스트(문장)로부터 입력 벡터를 추출하거나 입력된 텍스트(문장)로부터 정규화 방식 또는 맵핑 테이블을 이용하여 입력 벡터를 추출하고, 분류 행렬을 생성하며, 유사도를 생성하여 텍스트의 카테고리를 분류하기 위한, 텍스트의 카테고리 분류 장치 및 그 방법에 관한 것이다.The present invention relates to an apparatus for categorizing text and a method thereof, and more particularly, to extract an input vector from input text (a sentence) or to extract an input vector from a input text (a sentence) using a normalization scheme or a mapping table. The present invention relates to an apparatus and method for classifying text, for generating a classification matrix and generating similarity to classify text categories.

일반적으로 음성 인식이 보편화되면서 사람들은 기존에 버튼이나 문자를 이용하여 해야 되는 일을 음성으로 대체하기 시작하였다. 이러한 음성 인식 서비스를 이용함에 있어서, 사람들은 기계가 사람이 자연스럽게 말하는 문장까지도 잘 알아듣기를 바란다.In general, as voice recognition becomes more common, people have begun to replace voices with tasks that need to be done using buttons or text. In using this speech recognition service, people want the machine to understand the sentences that people speak naturally.

그런 맥락에서 사람들은 자유롭게 발화된 문장을 가지고 그 말의 의도를 파악하고 반응할 수 있는 기계를 만드는데 관심을 가지게 되었다.In that context, people became interested in building a machine that could grasp and respond to the intentions of the words with freely spoken sentences.

그것의 응용 분야 중의 하나가 호 분류 시스템으로 볼 수 있는데, 사용자가 전화를 걸어 자신의 의도를 말하면 기계(호 분류 시스템)는 그 사람이 원하는 것이 무엇인지를 파악하여 그런 일을 처리할 수 있는 담당 부서로 전화 호를 전달(연결)해 주는 것이다.One of its applications can be seen as a call classification system, in which the user calls and speaks his intentions, and the machine (call classification system) is responsible for knowing what the person wants to do. The phone call is forwarded to the department.

이러한 호 분류 서비스는 주로 콜센터나 안내 시스템 등과 같이 현재는 안내원들이 직접 그 역할을 수행하고 있으나, 인건비 문제나 사용자의 편리성 관점에서 사람의 역할을 대신하여 편리하게 사용될 수 있을 것이다.Such call classification services are mainly used by guides such as call centers and guidance systems, but they may be conveniently used in place of human roles in terms of labor costs or user convenience.

이러한 호 분류 서비스를 구현함에 있어서, 보다 나은 성능을 내기 위하여 현재 여러 시도가 이루어지고 있다.In implementing such a call classification service, several attempts have been made to achieve better performance.

그 일예로 분류 정확도를 높이기 위해 여러 가지 분류기(classifier)를 조합하여 은행의 호 분류 시스템에 적용하여 그 성능을 보기도 하고, 다른 일예로 요소 단어나 어절 대신에 음소열을 기본 단위로 하고 각각의 카테고리별로 다른 언어 모델과 분류기를 사용하는 방법도 제안되고 있다.For example, in order to improve the accuracy of classification, various classifiers are combined and applied to the bank's call classification system to see its performance.In another example, phoneme strings are used as basic units instead of element words or words. Different language models and classifiers are also proposed.

또한, 일반적으로 분류기는 모든 클래스의 중요도를 동일하게 보는데, 실제 호 분류에 있어서는 분명 더 중요한 카테고리가 존재한다는 점에서 훈련 시에 중요한 카테고리에 비중을 실어주는 방식(가중치를 주는 방식)도 제안되고 있다.In addition, classifiers generally view the importance of all classes equally, and since there are certainly more important categories in actual call classification, a method of weighting important categories during training has been proposed. .

그러나 상기와 같은 종래의 방식들은 여러 분류기를 조합하거나 카테고리별로 다른 언어 모델과 분류기를 사용하여야 하기 때문에 그 구현 알고리즘이 너무 복잡한 문제가 있으며, 그에 따라 현실적으로 구현이 어려운 단점이 있다.However, the above conventional methods have a problem in that the implementation algorithm is too complicated because they need to combine several classifiers or use different language models and classifiers for each category, and accordingly, it is difficult to implement in reality.

한편, 일예로 통신사 고객센터 자동 호 분류 시스템 등에 적용하였을 경우 그 입력 데이터의 범위가 넓은데 비해 훈련에 필요한 데이터 확보에 어려움이 따른다. 따라서 적은 데이터를 가지고도 우수한 성능을 낼 수 있는 텍스트의 카테고리를 분류 방식이 요구되고 있다.On the other hand, when applied to an automatic call classification system, such as a telecommunications company customer center, the input data is wide, but it is difficult to secure data necessary for training. Therefore, there is a demand for a method of classifying text categories that can provide excellent performance with little data.

본 발명은 상기 문제점을 해결하기 위하여 제안된 것으로, 입력된 텍스트(문장)로부터 입력 벡터를 추출하고 분류 행렬을 생성하며 유사도를 생성하여 텍스트의 카테고리를 분류하기 위한, 텍스트의 카테고리 분류 장치 및 그 방법을 제공하는데 그 목적이 있다.The present invention has been proposed in order to solve the above problems, and the apparatus and method for classifying text categories for classifying categories of text by extracting an input vector from an input text (sentence), generating a classification matrix, and generating similarities The purpose is to provide.

또한, 본 발명은 상기 문제점을 해결하고 상기 요구에 부응하기 위하여 제안된 것으로, 입력된 텍스트(문장)로부터 정규화 방식을 이용하여 입력 벡터를 추출하고 분류 행렬을 생성하며 유사도를 생성하여 텍스트의 카테고리를 분류하기 위한, 텍스트의 카테고리 분류 장치 및 그 방법을 제공하는데 다른 목적이 있다.In addition, the present invention has been proposed to solve the above problems and to meet the needs, it is to extract the input vector using the normalization method from the input text (sentence), to generate a classification matrix and to generate the similarity category of the text Another object is to provide an apparatus and method for classifying text categories for classification.

또한, 본 발명은 상기 문제점을 해결하고 상기 요구에 부응하기 위하여 제안된 것으로, 입력된 텍스트(문장)로부터 맵핑 테이블을 이용하여 입력 벡터를 추출하고 분류 행렬을 생성하며 유사도를 생성하여 텍스트의 카테고리를 분류하기 위한, 텍스트의 카테고리 분류 장치 및 그 방법을 제공하는데 또 다른 목적이 있다.In addition, the present invention has been proposed to solve the above problems and to meet the needs, it is to extract the input vector using a mapping table from the input text (sentence), to generate a classification matrix and to generate similarity categories of text Another object is to provide an apparatus and method for classifying text categories for classification.

본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있으며, 본 발명의 실시예에 의해 보다 분명하게 알게 될 것이다. 또한, 본 발명의 목적 및 장점들은 특허 청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 쉽게 알 수 있을 것이다.Other objects and advantages of the present invention can be understood by the following description, and will be more clearly understood by the embodiments of the present invention. Also, it will be readily appreciated that the objects and advantages of the present invention may be realized by the means and combinations thereof indicated in the claims.

상기 목적을 달성하기 위한 본 발명의 장치는, 텍스트의 카테고리 분류 장치에 있어서, 입력되는 텍스트(문장)로부터 정규화 방식 또는 맵핑 테이블을 이용하여 입력 벡터를 추출하기 위한 입력 벡터 추출 수단; 상기 입력 벡터 추출 수단에서 추출한 입력 벡터를 이용하여 분류 행렬을 생성하기 위한 분류 행렬 생성 수단; 상기 분류 행렬 생성 수단에서 생성한 분류 행렬을 싱귤러 벡터(singular vector)를 이용하여 분해하여 유사도를 생성하기 위한 유사도 생성 수단; 및 상기 유사도 생성 수단에서 생성한 유사도를 이용하여 상기 입력 텍스트의 카테고리를 분류하기 위한 카테고리 분류 수단을 포함한다.In accordance with an aspect of the present invention, there is provided an apparatus for classifying text, comprising: input vector extraction means for extracting an input vector from an input text (sentence) using a normalization scheme or a mapping table; Classification matrix generating means for generating a classification matrix using the input vector extracted by the input vector extracting means; Similarity generating means for generating a similarity by decomposing a classification matrix generated by the classification matrix generating means using a singular vector; And category classification means for classifying a category of the input text using the similarity generated by the similarity generating means.

상기 다른 목적을 달성하기 위하여, 상기 입력 벡터 추출 수단은, 상기 입력되는 텍스트(문장)의 단어들이 문맥 내에서 가지는 의미 정보별로 정규화하기 위한 제 1 정규화 수단; 및 상기 제 1 정규화 수단에서 정규화한 텍스트를 어절 단위로 정렬하여 입력 벡터 요소 어절을 분류하기 위한 어절 분류 수단을 포함한다.In order to achieve the above another object, the input vector extracting means comprises: first normalizing means for normalizing for each semantic information of words of the input text (sentence) in context; And word classification means for classifying the input vector element word by aligning the text normalized by the first normalization means by word units.

상기 또 다른 목적을 달성하기 위하여, 상기 입력 벡터 추출 수단은, 상기 입력되는 텍스트(문장)를 어절 단위로 정렬하여 입력 벡터 요소 어절을 분류하기 위한 어절 분류 수단; 및 핵심 요소 어절에 대한 상기 맵핑 테이블을 이용하여 상기 입력 텍스트에 대한 입력 벡터를 생성하기 위한 입력 벡터 생성 수단을 포함한다.In order to achieve the above another object, the input vector extracting means comprises: word classification means for classifying the input vector element word by aligning the input text (phrase) by word units; And input vector generation means for generating an input vector for the input text using the mapping table for key element words.

한편, 상기 목적을 달성하기 위한 본 발명의 방법은, 텍스트의 카테고리 분류 방법에 있어서, 입력 텍스트(문장)로부터 정규화 방식 또는 맵핑 테이블을 이용하여 입력 벡터를 추출하는 입력 벡터 추출 단계; 상기 추출한 입력 벡터를 이용하여 분류 행렬을 생성하는 분류 행렬 생성 단계; 상기 생성한 분류 행렬을 싱귤러 벡터(singular vector)를 이용하여 분해하여 유사도를 생성하는 유사도 생성 단계; 및 상기 생성한 유사도를 이용하여 상기 입력 텍스트의 카테고리를 분류하는 카테고리 분류 단계를 포함한다.On the other hand, the method of the present invention for achieving the above object, in the category classification method of the text, extracting the input vector from the input text (sentence) using a normalization scheme or a mapping table; A classification matrix generation step of generating a classification matrix using the extracted input vector; A similarity generation step of generating a similarity by decomposing the generated classification matrix using a singular vector; And a category classification step of classifying a category of the input text by using the generated similarity.

상기 다른 목적을 달성하기 위하여, 상기 입력 벡터 추출 단계는, 상기 입력 텍스트(문장)의 단어들이 문맥 내에서 가지는 의미 정보별로 정규화하는 제 1 정규화 단계; 및 상기 제 1 정규화 단계에서 정규화한 텍스트를 어절 단위로 정렬하여 입력 벡터 요소 어절을 분류하는 어절 분류 단계를 포함한다.In order to achieve the above object, the input vector extraction step may include: a first normalization step of normalizing the words of the input text (sentence) for each semantic information in the context; And a word classification step of classifying the input vector element word by sorting the text normalized in the first normalization step by word unit.

상기 또 다른 목적을 달성하기 위하여, 상기 입력 벡터 추출 단계는, 상기 입력 텍스트(문장)를 어절 단위로 정렬하여 입력 벡터 요소 어절을 분류하는 어절 분류 단계; 및 핵심 요소 어절에 대한 상기 맵핑 테이블을 이용하여 상기 입력 텍스트에 대한 입력 벡터를 생성하는 입력 벡터 생성 단계를 포함한다.In order to achieve the above another object, the input vector extracting step may include: a word classification step of classifying the input vector element word by aligning the input text (sentence) by word units; And an input vector generation step of generating an input vector for the input text by using the mapping table for key element words.

삭제delete

이처럼 본 발명에서는 텍스트의 카테고리 분류 알고리즘 중에서도 벡터 기반의 방식을 중심으로 한다. 즉, 본 발명에서는 입력된 문장으로부터 입력 벡터를 추출하고 분류 행렬을 생성하며 유사도를 생성하여 텍스트의 카테고리를 분류한다. 그런데, 대부분의 텍스트 카테고리 분류 시스템을 구현하고자 할 경우 그 입력 데이터의 범위가 넓은데 비해 훈련에 필요한 데이터 확보에 어려움이 따른다. 따라서 적은 데이터를 가지고도 우수한 성능을 낼 수 있는 방식이 필요한데 그것을 위한 방식들을 아래와 같이 제시한다. 첫 번째 방식은 입력 벡터를 추출하는 과정에서 문맥상의 비슷한 의미를 가지고 있는 어절 등을 정규화하는 방식이다. 그리고 두 번째 방식은 입력 벡터를 추출하는 과정에서 핵심 요소 어절에 대한 맵핑 테이블을 사용하는 방식이다.As described above, the present invention focuses on the vector-based method among the category classification algorithms of text. That is, in the present invention, the input vector is extracted from the input sentence, the classification matrix is generated, and the similarity is generated to classify the category of the text. However, when implementing most text categorization systems, the input data is wide, but it is difficult to secure data for training. Therefore, there is a need for a method that can provide excellent performance with less data. The first method is to normalize words with similar contextual meanings in the process of extracting input vectors. The second method is to use the mapping table for key word elements in the process of extracting the input vector.

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명하기로 한다.The foregoing and other objects, features and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings, in which: There will be. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail. Hereinafter, a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 텍스트의 카테고리 분류 장치의 일실시예 구성도이다.1 is a configuration diagram of an apparatus for classifying text categories according to an exemplary embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명에 따른 텍스트의 카테고리 분류 장치는, 입력되는 텍스트(문장)로부터 입력 벡터를 추출하기 위한 입력 벡터 추출부(10), 상기 입력 벡터 추출부(10)에서 추출한 입력 벡터를 이용하여 분류 행렬을 생성하기 위한 분류 행렬 생성부(20), 상기 분류 행렬 생성부(20)에서 생성한 분류 행렬을 이용하여 유사도를 생성하기 위한 유사도 생성부(30), 및 상기 유사도 생성부(30)에서 생성한 유사도를 이용하여 상기 입력 텍스트의 카테고리를 분류하기 위한 카테고리 분류부(40)를 포함한다.As shown in FIG. 1, the apparatus for classifying text according to the present invention includes an input vector extractor 10 and an input vector extractor 10 for extracting an input vector from an input text (sentence). A classification matrix generator 20 for generating a classification matrix using an input vector, a similarity generator 30 for generating a similarity using the classification matrix generated by the classification matrix generator 20, and the similarity And a category classification unit 40 for classifying the category of the input text using the similarity generated by the generation unit 30.

여기서, 상기 입력되는 텍스트(문장)는 사용자가 발화한 음성을 음성 인식하여 텍스트로 변환한 것이거나, 바로 텍스트로 입력되는 것일 수도 있다.Here, the input text (sentence) may be a voice recognition by the user to convert the speech into a text or may be directly input as text.

상기 입력 벡터 추출부(10)는, 입력되는 텍스트(문장)의 단어들이 문맥 내에서 가지는 의미 정보별로 정규화하기 위한 제 1 정규화부(11), 및 상기 제 1 정규화부(11)에서 정규화한 텍스트를 어절 단위로 정렬하여 분류에 도움이 되지 않는 단어를 삭제하는 방식으로 입력 벡터 요소 어절을 분류하기 위한 어절 분류부(12)를 포함한다.The input vector extractor 10 may include a first normalization unit 11 for normalizing the words of the input text (sentence) for each semantic information in the context, and the text normalized by the first normalization unit 11. And a word classifier 12 for classifying the input vector element word in such a manner as to sort words by word units to delete words which do not help classification.

한편, 상기 입력 벡터 추출부(10)는, 입력되는 텍스트(문장)를 어절 단위로 정렬하여 분류에 도움이 되지 않는 단어를 삭제하는 방식으로 입력 벡터 요소 어절을 분류하기 위한 어절 분류부(12), 및 핵심 요소 어절에 대한 맵핑 테이블을 이용하여 상기 입력 텍스트에 대한 입력 벡터를 생성하기 위한 입력 벡터 생성부(13)를 포함한다.On the other hand, the input vector extraction unit 10, the word classification unit 12 for classifying the words of the input vector element by sorting the input text (phrases) by word units to delete words that do not help classification And an input vector generator 13 for generating an input vector for the input text using a mapping table for key element words.

그리고 상기 분류 행렬 생성부(20)는, 상기 입력 벡터 추출부(10)에서 추출한 특정 입력 벡터를 이용하여 축척 빈도 벡터(accumulated count vector)를 생성하기 위한 축척 빈도 벡터 생성부(21), 다수의 요소 어절에 비중이 가중되는 것을 보상하기 위하여 상기 축척 빈도 벡터 생성부(21)에서 생성한 축척 빈도 벡터를 정규화하기 위한 제 2 정규화부(22), 특정 카테고리에 해당하는 특정 요소 어절의 변별력을 보상하기 위한 가중부(23), 및 상기 제 2 정규화부(22)와 상기 가중부(23)의 처리 결과를 이용하여 분류 행렬을 생성하기 위한 행렬 생성부(24)를 포함한다.In addition, the classification matrix generator 20 may include a scaled frequency vector generator 21 for generating an accumulated count vector using a specific input vector extracted by the input vector extractor 10. The second normalizer 22 for normalizing the scale frequency vector generated by the scale frequency vector generator 21 to compensate for the weighting of the element word, the discrimination power of the specific element word corresponding to a specific category. And a matrix generator 24 for generating a classification matrix using the processing results of the second normalization unit 22 and the weighting unit 23.

그리고 상기 유사도 생성부(30)는, 상기 행렬 생성부(24)에서 생성한 분류 행렬을 분해하기 위한 행렬 분해부(31), 상기 행렬 분해부(31)에서 분해한 특정 행렬과 입력 벡터를 이용하여 슈도-도규먼트(pseudo-document) 행렬을 생성하기 위한 슈도-도규먼트 행렬 생성부(32), 및 상기 슈도-도규먼트 행렬 생성부(32)에서 생성한 슈도-도규먼트 행렬과 상기 행렬 분해부(31)에서 분해한 스케일드 도큐먼트(scaled document) 행렬을 이용하여 유사도를 획득하기 위한 유사도 획득부(33)를 포함한다. 이때, 상기 유사도 획득부(33)는 상기 슈도-도규먼트 행렬 생성부(32)에서 생성한 슈도-도규먼트 행렬과 상기 행렬 분해부(31)에서 분해한 스케일드 도큐먼트(scaled document) 행렬 사이의 코사인 스코어(cosine score)를 계산하여 입력 벡터와 해당 카테고리의 유사도를 획득한다.The similarity generation unit 30 uses a matrix decomposition unit 31 for decomposing the classification matrix generated by the matrix generation unit 24 and a specific matrix and the input vector decomposed by the matrix decomposition unit 31. A pseudo-document matrix generator 32 for generating a pseudo-document matrix and a pseudo-document matrix generated by the pseudo-document matrix generator 32 and the matrix decomposition A similarity acquiring unit 33 for acquiring the similarity using the scaled document matrix decomposed by the unit 31 is included. In this case, the similarity obtaining unit 33 may be configured between the pseudo-document matrix generated by the pseudo-document matrix generation unit 32 and the scaled document matrix decomposed by the matrix decomposition unit 31. The cosine score is calculated to obtain the similarity between the input vector and the corresponding category.

상기 각 구성요소에 대한 상세 동작 및 구체적인 실시예를 도 2 내지 도 4를 참조하여 상세히 살펴보면 다음과 같다.Detailed operations and specific embodiments of the components will be described in detail with reference to FIGS. 2 to 4 as follows.

도 2는 본 발명에 따른 텍스트의 카테고리 분류 방법에 대한 일실시예 흐름도이다.2 is a flowchart illustrating a method for classifying a category of text according to the present invention.

본 발명에 따른 텍스트의 카테고리 분류 방법은 크게 세 단계로 나눌 수 있 다.The categorization method of the text according to the present invention can be divided into three steps.

첫 번째는 입력 벡터(Query Vector) 추출 단계로서, 일예로 텍스트의 카테고리 분류에 영향을 미칠 수 있는 단어의 리스트를 정하여 그것을 이용하여 입력 벡터를 추출할 수 있고, 입력되는 텍스트(문장)의 단어들이 문맥 내에서 가지는 의미 정보별로 정규화하는 정규화 과정을 이용하여 입력 벡터를 추출할 수도 있다.The first is a query vector extraction step. For example, a list of words that can affect the categorization of texts can be determined, and the input vectors can be extracted using them. An input vector may be extracted using a normalization process for normalizing the semantic information of the context.

두 번째는 분류 행렬 생성 단계로서, 일예로 축척 빈도 벡터 생성 과정, 정규화 과정, 가중 과정, 행렬 생성 과정을 수행하여 분류 매트릭스를 생성한다.The second step is to generate a classification matrix. For example, a classification matrix is generated by performing a scale frequency vector generation process, a normalization process, a weighting process, and a matrix generation process.

세 번째는 유사도 생성 단계로서, 텍스트의 카테고리를 정하는데 있어서 확신 값(confidence score)을 생성하는 과정이다.The third step is the similarity generation step, which generates a confidence score in categorizing the text.

상기 각 단계를 도 3을 참조하여 상세히 살펴보면 다음과 같다.Looking at each step in detail with reference to Figure 3 as follows.

먼저, 입력 벡터(Query Vector) 추출 단계를 살펴보면 다음과 같다.First, the input vector (Query Vector) extraction step is as follows.

입력 벡터는 입력을 해당하는 카테고리로 분류하면서 다른 카테고리에 해당하는 입력은 거절해야 하는 분별력을 가지는 것이 중요하다. 보통 입력으로 쓰이는 텍스트(문장)는 음성 인식의 결과(물론, 음성 인식의 결과가 아닌 텍스트를 입력받을 수도 있음)이므로, 사람이 발화한 그대로라고 보장할 수 없기에 여러 환경에서 강인해야 한다.It is important that the input vector has the discernment to classify the input into the corresponding category while rejecting the input corresponding to the other category. Since the text (text) that is usually used as input is the result of speech recognition (of course, the text may be input instead of the speech recognition result), it cannot be guaranteed that it is spoken by humans, so it must be strong in various environments.

따라서 입력된 모든 단어를 입력 벡터의 요소로 쓰지 않고 카테고리 분류에 영향을 미치는 단어만을 입력 벡터의 요소로 쓰는 것이 필요하다.Therefore, it is necessary not to write all input words as elements of the input vector, but only words that affect the classification of categories as elements of the input vector.

본 발명에서는 입력된 텍스트(문장)를 어절 단위로 정렬하여 “예, 저, 해주세요” 등과 같이 카테고리 분류에 도움이 되지 않는 단어를 삭제하는 방식으로 입 력 벡터 요소 어절(relevant term)을 분류한다(202).In the present invention, the input vector element relevant term is classified by sorting the input text (phrase) by word units and deleting words that do not help categorization such as "Yes, please, please". 202).

또한, 필요 시 어절 분리 과정(202) 이전에 정규화 과정(201)을 거쳐서 단어 자체를 그대로 이용하지 않고 문맥 안에서 가질 수 있는 의미 정보를 이용하는 방식을 사용한다. 즉, 입력되는 텍스트(문장)의 단어들이 문맥 내에서 가지는 의미 정보별로 정규화한다. 이러한 정규화 과정의 몇 가지 예를 살펴보면 도 3에 도시된 바와 같다.In addition, if necessary, prior to the word separation process 202, the normalization process 201 uses a method of using semantic information that can be provided in a context without using the word itself. That is, the words of the input text (sentence) are normalized for each semantic information in the context. Some examples of such a normalization process are shown in FIG. 3.

상기 정규화 과정(201)을 거친 요소 어절의 개수가 텍스트의 카테고리 분류 장치에 있어서 사용되는 어휘의 수가 되어 입력 벡터(Query Vector)의 차수가 된다.The number of element words that have passed through the normalization process 201 becomes the number of vocabulary used in the text category classification device, and becomes the order of the input vector (Query Vector).

한편, 비슷한 단어로 구성되어 형태소적 의미가 같은 경우에는 같은 요소 어절에 해당하도록 도 4에 도시된 바와 같은 맵핑 테이블을 만들어 사용할 수 있다. 즉, 핵심 요소 어절에 대한 맵핑 테이블을 이용하여 상기 입력 텍스트에 대한 입력 벡터를 생성한다(203).On the other hand, when the words are composed of similar words and have the same morphological meaning, a mapping table as shown in FIG. 4 may be used to correspond to the same element word. That is, an input vector for the input text is generated using the mapping table for key element words (203).

이때, 입력된 텍스트에 요소 어절이 한번 나타나면 그 값이 1이 되고, 두 번 나타나면 2가 되며, 그 외에 나타나지 않는 요소의 값들은 0으로 구성된 입력 벡터가 생성되어 훈련과 테스트에 입력으로 사용되게 된다.At this time, if an element word appears in the input text once, its value becomes 1, and if it appears twice, the value becomes 2, and the values of other elements that do not appear are generated as an input vector consisting of 0 and used as input for training and test. .

상기 각 과정 중에서 정규화 과정(201)과 맵핑 테이블 이용 과정(203)은 한정된 데이터를 이용하여 카테고리 분류율을 향상시키기 위한 부가적인 구성요소이다.In each of the above processes, the normalization process 201 and the process of using the mapping table 203 are additional components for improving the category classification rate using limited data.

다음으로, 분류 행렬(Routing Matrix) 생성 단계를 살펴보면 다음과 같다.Next, a step of generating a routing matrix is as follows.

일단 요소 어절을 이용하여 입력 벡터가 생성되면 그와 함께 그에 해당하는 카테고리 정보를 가지게 된다.Once an input vector is generated using element words, it has category information corresponding to it.

이때, 상기 추출한 특정 입력 벡터를 이용하여 축척 빈도 벡터(accumulated count vector)를 생성한다(204). 일예로, n번째 카테고리에 해당하는 입력 벡터들을 모아서 모두 더해서 축척 빈도 벡터(accumulated count vector, An)를 생성한다(204). 만약, N개의 카테고리가 있고 입력 벡터의 차수 즉, 요소 어절의 개수가 M개라면 MxN 행렬인

를 생성할 수 있다. 이때, 인덱스 t는 요소 어절을 표현하고, d는 카테고리를 표현한다.At this time, an accumulated count vector is generated using the extracted specific input vector (204). For example, an input vector corresponding to the nth category is collected and added together to generate an accumulated count vector (An) (204). If there are N categories and the order of the input vectors, that is, the number of element words, is M, then

Can be generated. In this case, the index t represents an element word and d represents a category.

또한, 요소 어절의 빈도수를 그대로 쓰게 되면 훈련 데이터에 많이 나오는 어절에 비중이 주어지게 되므로, 이를 보상하기 위하여 각각의 카테고리에서 나타난 빈도수로 아래의 [수학식 1]과 같이 정규화(normalization)하는 과정을 거친다. 즉, 다수의 요소 어절에 비중이 가중되는 것을 보상하기 위하여 상기 생성한 축척 빈도 벡터를 정규화한다(205).In addition, if the frequency of the element word is used as it is, the weight is given to the word appearing in the training data. Therefore, to compensate for this, the process of normalization is performed as shown in Equation 1 below. Rough That is, the generated scale frequency vector is normalized to compensate for the weighting of a plurality of element words.

또한, 여러 카테고리에 해당하는 요소 어절보다는 하나의 카테고리에만 해당하는 요소 어절이 호 분류에는 변별적인 측면에서 중요하므로, 요소 어절이 발생하는 카테고리의 수로 각 요소 어절에 해당되는 벡터값을 아래의 [수학식 2]와 같이 나누어 준다. 즉, 특정 카테고리에 해당하는 특정 요소 어절의 변별력을 보상한다(206).In addition, since element words that belong to only one category are important in call classification, rather than element words that correspond to multiple categories, the vector value corresponding to each element word is expressed as the number of categories where element words occur. Divide it as shown in Equation 2]. That is, the discrimination power of the specific element word corresponding to the specific category is compensated (206).

이때, t는 텀(term, 처리 기본 단위)으로 요소 어절을 뜻하고, n은 훈련 데이터에서 나타난 입력 텍스트(문장)의 수, d(t)는 요소 어절 t를 포함하는 입력 텍스트(문장)의 수이다.Where t is the term of the element (term), n is the number of input texts (sentences) shown in the training data, and d (t) is the number of input texts (sentences) containing the element word t. It is a number.

상기 정규화 결과(205)와 상기 변별력 보상 결과(206)를 이용하여 분류 행렬을 생성한다(207). 즉,

의 각 행에

를 곱하면 아래의 [수학식 3]과 같이 분류 행렬 C가 생성된다.A classification matrix is generated using the normalization result 205 and the discrimination force compensation result 206 (207). In other words,

On each row of

By multiplying, the classification matrix C is generated as shown in Equation 3 below.

다음으로, 유사도(similarity) 생성 단계를 살펴보면 다음과 같다.Next, look at the similarity (similarity) generation step as follows.

이제, 입력된 벡터가 어떤 카테고리에 해당하는지 판단하는 과정이 필요하다.Now, a process of determining which category the input vector corresponds to is necessary.

요소 어절 벡터와 카테고리 벡터에 대해 균일한 표현을 함과 동시에 차수를 줄이기 위해 싱귤러 벡터 분해(singular vector decomposition) 방식을 적용한다. 즉, 상기 생성한 분류 행렬을 분해한다(208). 이때, 일예로 요소 어절의 개수를 m이라 하고 카테고리의 개수를 n이라고 하면

는 아래의 [수학식 4]와 같이 분해(decomposition)할 수 있다.Singular vector decomposition is applied to reduce the order of the element word and category vectors. That is, the generated classification matrix is decomposed (208). In this case, for example, if the number of element words is m and the number of categories is n,

Can be decomposed as shown in Equation 4 below.

여기서, r이 행렬 C의 랭크값일 때 U는 m x r 오쏘노멀(orthonormal) 행렬이 되고, V는 r x n 오쏘노멀(orthonormal) 행렬이 되며, S는 양수로만 이루어진 r x r 행렬로 0이 아닌 값인

이 아래 차순으로 정렬되어 있는 다이아고널(diagonal) 행렬이다.Here, when r is a rank value of the matrix C, U becomes an mxr orthonormal matrix, V becomes an rxn orthonormal matrix, and S is a positive rxr matrix of nonzero values.

This is a diagonal matrix arranged in descending order.

행렬 U의 i번째 행은 i번째 요소 어절을 표현하는 r차 벡터가 되고, 행렬 V의 j번째 행은 j번째 카테고리를 표현하는 r차의 벡터가 된다.The i-th row of the matrix U becomes an r-order vector representing the i-th element word, and the j-th row of the matrix V becomes an r-order vector representing the j-th category.

따라서 m x 1의 입력 벡터를 Q라고 하면 아래의 [수학식 5]와 같이 슈도-도규먼트(pseudo-document) 행렬 D를 생성할 수 있다. 즉, 상기 분해한 특정 행렬과 입력 벡터를 이용하여 슈도-도규먼트(pseudo-document) 행렬을 생성한다(209).Therefore, if the input vector of m x 1 is Q, a pseudo-document matrix D can be generated as shown in Equation 5 below. That is, a pseudo-document matrix is generated using the decomposed specific matrix and the input vector (209).

이후, 상기 생성한 슈도-도규먼트 행렬과 상기 분해한 스케일드 도큐먼트(scaled document) 행렬을 이용하여 유사도를 획득한다(210). 즉, 유사도는 상기와 같이 구한 슈도-도규먼트(pseudo-document) 행렬 D와 스케일드 도큐먼트(scaled document) 행렬인

사이의 코사인 스코어(cosine score)로 구해질 수 있으며, 두 개의 행렬을 가지고 코사인 스코어(cosine score)를 계산하는 식은 아래의 [수학식 6]과 같다.Thereafter, similarity is obtained using the generated pseudo-document matrix and the decomposed scaled document matrix (210). In other words, the similarity is the pseudo-document matrix D and the scaled document matrix obtained as described above.

The cosine score can be obtained, and a cosine score is calculated using two matrices as shown in Equation 6 below.

상기 [수학식 6]에 의한 유사도의 결과는 n차의 벡터가 될 것이고, j번째의 값은 입력 벡터와 그 카테고리의 유사도를 나타내게 된다.The result of the similarity according to [Equation 6] will be an n-th vector, and the j-th value indicates the similarity between the input vector and its category.

따라서 상기 생성한 유사도를 이용하여 상기 입력 텍스트의 카테고리를 분류한다(211). 이때, 가장 큰 값을 가진 카테고리가 주어진 입력에 해당하는 카테고리로 추정되게 된다.Accordingly, the category of the input text is classified using the generated similarity (211). In this case, the category having the largest value is estimated to be a category corresponding to a given input.

상기와 같은 본 발명은 음성 인식을 이용한 고객센터 자동 호 분류 시스템 등에 적용될 수 있어, 사용자가 전화를 걸어 자신의 의도를 말하면 기계(시스템)가 그 사람이 원하는 바가 무엇인지를 파악하여 그런 일을 처리할 수 있는 부서로 전화 호를 전달(연결)해 주는데 응용될 수 있다.The present invention as described above can be applied to a customer center automatic call classification system using voice recognition, and when a user makes a phone call and speaks his intention, the machine (system) grasps what the person wants and handles such a task. It can be applied to transfer (connect) a telephone call to a department that can.

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체(씨디롬, 롬, 램, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다. 이러한 과정은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있으므로 더 이상 상세히 설명하지 않기로 한다.As described above, the method of the present invention may be implemented as a program and stored in a recording medium (CD-ROM, ROM, RAM, floppy disk, hard disk, magneto-optical disk, etc.) in a computer-readable form. Since this process can be easily implemented by those skilled in the art will not be described in more detail.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the technical spirit of the present invention for those skilled in the art to which the present invention pertains. It is not limited by the drawings.

상기와 같은 본 발명은, 입력된 텍스트(문장)로부터 입력 벡터를 추출하고 분류 행렬을 생성하며 유사도를 생성하는 간단한 알고리즘으로 텍스트의 카테고리를 분류할 수 있고, 그에 따라 구현이 용이하다.As described above, the present invention can classify categories of texts by a simple algorithm that extracts an input vector from an input text (sentence), generates a classification matrix, and generates a similarity, and thus can be easily implemented.

또한, 본 발명은 입력된 텍스트(문장)로부터 정규화 방식 또는 맵핑 테이블을 이용하여 입력 벡터를 추출함으로써, 한정된 데이터를 이용하여 카테고리 분류율을 향상시킬 수 있는 효과가 있다.In addition, the present invention has the effect of improving the category classification rate by using the limited data by extracting the input vector from the input text (sentence) using a normalization scheme or a mapping table.

또한, 본 발명은 음성 인식을 이용한 고객센터 자동 호 분류 시스템 등에 적용될 수 있다.In addition, the present invention can be applied to a customer center automatic call classification system using voice recognition.

Claims

In the category classification apparatus for text,

Input vector extraction means for extracting an input vector from an input text (sentence) using a normalization scheme or a mapping table;

Classification matrix generating means for generating a classification matrix using the input vector extracted by the input vector extracting means;

Similarity generating means for generating a similarity by decomposing a classification matrix generated by the classification matrix generating means using a singular vector; And

Category classification means for classifying a category of the input text using the similarity generated by the similarity generating means

Device for categorizing the text comprising a.

The method of claim 1,

The input vector extraction means,

First normalization means for normalizing the words of the input text (sentence) for each semantic information in the context; And

Word classification means for classifying the word input element by sorting the text normalized by the first normalization means by word units

Device for categorizing the text comprising a.

The method of claim 1,

The input vector extraction means,

Word classification means for classifying the input vector element word by aligning the input text (sentence) in a word unit; And

Input vector generation means for generating an input vector for the input text using the mapping table for key element words

Device for categorizing the text comprising a.

delete

4. The method according to any one of claims 1 to 3,

The classification matrix generating means,

Scaled frequency vector generating means for generating an accumulated count vector using a specific input vector extracted by the input vector extracting means;

Second normalization means for normalizing a scale frequency vector generated by said scale frequency vector generating means to compensate for weighting a plurality of element words;

Weighting means for compensating discrimination of a specific element word corresponding to a specific category; And

Matrix generating means for generating a classification matrix using the processing results of the second normalization means and the weighting means

Device for categorizing the text comprising a.

6. The method of claim 5,

The similarity generating means,

Matrix decomposition means for decomposing a classification matrix generated by the classification matrix generating means by using the singular vector;

Pseudo-document matrix generation means for generating a pseudo-document matrix using the specific matrix decomposed by the matrix decomposition means and the input vector; And

Similarity acquiring means for acquiring the similarity using the pseudo-document matrix generated by the pseudo-document matrix generating means and the scaled document matrix decomposed by the matrix decomposition means

Device for categorizing the text comprising a.

The method of claim 6,

The similarity obtaining means,

A cosine score is calculated between the pseudo-document matrix generated by the pseudo-document matrix generating means and the scaled document matrix decomposed by the matrix decomposition means to calculate a cosine score of the input vector and the corresponding category. A device for classifying text, which obtains a similarity degree.

The method of claim 7, wherein

The input text (sentence) is,

A device for classifying text according to claim 1, wherein the voice related to the telephone call is recognized and converted into text.

In the category classification method of the text,

Extracting an input vector from the input text (sentence) using a normalization scheme or a mapping table;

A classification matrix generation step of generating a classification matrix using the extracted input vector;

A similarity generation step of generating a similarity by decomposing the generated classification matrix using a singular vector; And

A category classification step of classifying a category of the input text using the generated similarity

Categorization method of the text comprising a.

The method of claim 9,

The input vector extraction step,

A first normalization step of normalizing the words of the input text (sentence) for each semantic information in the context; And

The word classification step of classifying the input vector element word by sorting the text normalized in the first normalization step by word unit

Categorization method of the text comprising a.

The method of claim 9,

The input vector extraction step,

A word classification step of classifying an input vector element word by sorting the input text (sentence) in a word unit; And

An input vector generation step of generating an input vector for the input text using the mapping table for key element words

Categorization method of the text comprising a.

delete

12. The method according to any one of claims 9 to 11,

The classification matrix generation step,

A scaled frequency vector generation step of generating an accumulated count vector using the extracted input vector;

A second normalization step of normalizing the generated scale frequency vector to compensate for weighting a plurality of element words;

A weighting step of compensating for discrimination of a specific element word corresponding to a specific category; And

A matrix generation step of generating a classification matrix using the processing results of the second normalization step and the weighting step

Categorization method of the text comprising a.

The method of claim 13,

The similarity generating step,

A matrix decomposition step of decomposing the generated classification matrix by using the singular vector;

A pseudo-document matrix generation step of generating a pseudo-document matrix using the specific matrix decomposed in the matrix decomposition step and the input vector; And

A similarity obtaining step of obtaining similarity using the generated pseudo-document matrix and the scaled document matrix decomposed in the matrix decomposition step

Categorization method of the text comprising a.

15. The method of claim 14,

The similarity obtaining step,

And calculating a cosine score between the pseudo-document matrix and the scaled document matrix to obtain a similarity between the input vector and the corresponding category.

16. The method of claim 15,

The input text (sentence),

A method of classifying text according to claim 1, wherein the voice related to the telephone call is recognized and converted into text.

delete