KR101841615B1

KR101841615B1 - Apparatus and method for computing noun similarities using semantic contexts

Info

Publication number: KR101841615B1
Application number: KR1020160064750A
Authority: KR
Inventors: 맹성현; 강준영; 김부근
Original assignee: 한국과학기술원
Priority date: 2016-02-05
Filing date: 2016-05-26
Publication date: 2018-03-26
Also published as: KR20170094063A

Abstract

문장 또는 문서로부터 추출한 복수의 명사구에 각각 포함되어 있는 복수의 형용사에 대한 속성 벡터 맵을 확인하고, 복수의 명사구에 각각 포함되어 있는 복수의 명사를 기준으로 복수의 제1 속성 벡터 집합을 생성한다. 생성한 복수의 제1 속성 벡터 집합을 확인한 속성 벡터 맵을 토대로 복수의 제2 속성 벡터 집합을 생성하고, 제2 속성 벡터 집합을 토대로 두 명사 각각에 대한 두 개의 제2 속성 벡터 집합을 이용하여 형용사 유사도를 계산한다. 그리고 형용사 유사도를 토대로 두 명사에 대한 의미 유사도를 계산한다.An attribute vector map for a plurality of adjectives respectively included in a plurality of noun phrases extracted from sentences or documents is checked and a plurality of first attribute vector sets are generated based on a plurality of nouns respectively included in the plurality of noun phrases. A plurality of second attribute vector sets are generated based on the attribute vector maps obtained by verifying the plurality of generated first attribute vector sets, and based on the second set of attribute vectors, the second attribute vector set for each of the two nouns is used, Calculate the similarity. Based on the similarity of adjectives, we calculate the semantic similarity of two nouns.

Description

[0001] Apparatus and method for computing similarity [

본 발명은 의미 기반 명사 유사도 계산 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and a method for calculating semantic-based noun similarity.

웹 상의 텍스트 데이터가 증가함에 따라 많은 문서의 이해와 분석 기술들이 관심을 받고 있다. 이에 따라 텍스트 데이터에 대한 사전 연구나 기술들이 개발되고 있는데, 의미 유사도가 그 중 하나이다.As text data on the Web increases, many document understanding and analysis techniques are attracting attention. As a result, preliminary studies and techniques for text data have been developed, and semantic similarity is one of them.

의미 유사도는 단어나 문장 등이 의미론적 실체의 형상을 나타내는데 사용되는 척도로, 특히 명사 간 의미 유사도는 두 명사 사이의 의미적인 거리를 나타내기 위한 척도이다. 의미 유사도는 자연어 처리, 정보 검색, 질문 응답 등에 이용될 수 있기 때문에, 이에 대한 연구가 활발히 진행되고 있다. 그리고 표절 인용 감지와 같이 텍스트로부터의 의미를 활용하는 분야에서, 다른 의미 엔티티(entity) 사이의 관계가 반드시 필요하기 때문에 의미 유사도는 매우 중요한 역할을 한다.The semantic similarity is a measure used to represent the shape of a semantic entity such as a word or a sentence. In particular, the semantic similarity between nouns is a measure for indicating the semantic distance between two nouns. Since semantic similarity can be used for natural language processing, information retrieval, question answering, and so on, studies are being actively pursued. In the field of utilizing semantics from text, such as the detection of plagiarism citation, semantic similarity plays a very important role because the relationship between different semantic entities is necessary.

그러나 의미 유사도를 판단하기 위한 종래의 기술을 이용할 경우, 대상 단어가 대부분 명사이기 때문에 자연어 처리에 적용하기 어려운 단점이 있다. 따라서 의미 유사도의 효과를 향상시키기 위해 다양한 접근법이 연구되고 있으며, 더 나은 의미 유사도를 판단하기 위해 개체(object)를 충분히 설명해야 하고 중요한 기능을 파악해야만 한다. However, when the conventional technique for judging the semantic similarity is used, most target words are nouns, which makes it difficult to apply them to natural language processing. Therefore, various approaches are being studied to improve the effect of semantic similarity. To determine better semantic similarity, the object must be fully explained and important functions must be grasped.

그러나 현재 연구되고 있는 접근 방식으로는 의미 유사도 특히 명사 간 의미 유사도를 판단하기 위해서는 반드시 측정하고자 하는 두 명사가 직간접적으로 같이 등장해야만 한다. 또한, 문맥에 등장하는 불필요한 단어들 또한 유사도 측정 대상 명사의 의미를 나타내는 정보로 사용되기 때문에 정확한 의미 유사도를 판단하기 어렵다는 단점이 있다.However, in order to judge semantic similarity, especially noun semantic similarity, the two nouns to be measured must appear either directly or indirectly. In addition, since unnecessary words appearing in the context are also used as information representing the meaning of the similarity measurement target noun, there is a disadvantage in that it is difficult to determine the correct semantic similarity.

따라서, 본 발명은 의미 컨텍스트로서 형용사를 활용하여, 명사 유사도를 계산하는 장치 및 방법을 제공한다.Accordingly, the present invention provides an apparatus and method for calculating a noun similarity degree using an adjective as a semantic context.

상기 본 발명의 기술적 과제를 달성하기 위한 본 발명의 하나의 특징인 의미 기반으로 명사의 유사도를 계산하는 방법은,According to another aspect of the present invention, there is provided a method of calculating similarity of a noun based on a semantic feature,

문장 또는 문서로부터 추출한 복수의 명사구에 각각 포함되어 있는 복수의 형용사에 대한 속성 벡터 맵을 확인하는 단계; 상기 복수의 명사구에 각각 포함되어 있는 복수의 명사를 기준으로 복수의 제1 속성 벡터 집합을 생성하는 단계; 상기 생성한 복수의 제1 속성 벡터 집합을 상기 확인한 속성 벡터 맵을 토대로 복수의 제2 속성 벡터 집합을 생성하는 단계; 상기 생성한 제2 속성 벡터 집합을 토대로, 두 명사 각각에 대한 두 개의 제2 속성 벡터 집합을 이용하여 형용사 유사도를 계산하는 단계; 및 상기 계산한 형용사 유사도를 토대로 상기 두 명사에 대한 의미 유사도를 계산하는 단계를 포함한다.Identifying an attribute vector map for a plurality of adjectives respectively included in a plurality of noun phrases extracted from sentences or documents; Generating a plurality of first attribute vector sets based on a plurality of nouns respectively included in the plurality of noun phrases; Generating a plurality of second attribute vector sets based on the plurality of generated first attribute vector sets and the identified attribute vector maps; Calculating adjective similarity using two sets of second attribute vectors for each of the two nouns based on the generated second attribute vector set; And calculating the semantic similarities of the two nouns based on the calculated adjective similarity.

상기 제1 속성 벡터 집합을 생성하는 단계는, 추출한 복수의 명사구에 포함되어 있는 명사를 기준으로, 임의의 명사를 수식하며 상기 복수의 명사구에 포함되어 있는 적어도 하나 이상의 형용사를 확인하는 단계; 및 하나 이상의 형용사들을 상기 임의의 명사에 대한 제1 속성 벡터 집합으로 생성하는 단계를 포함할 수 있다.The step of generating the first set of attribute vectors may include identifying at least one adjective included in the plurality of noun phrases by modifying an arbitrary noun based on a noun included in the plurality of extracted noun phrases. And generating one or more adjectives as a first set of attribute vectors for the arbitrary noun.

상기 제2 속성 벡터 집합을 생성하는 단계는, 제1 속성 벡터 집합에 포함되어 있는 적어도 하나 이상의 형용사들을, 상기 속성 벡터 맵을 토대로 형용사 각각에 대응하는 속성 벡터를 확인하는 단계; 및 상기 제1 속성 벡터 집합에 포함되어 있는 형용사들을 각각 확인한 속성 벡터로 변환하여 제2 속성 벡터 집합을 생성하는 단계를 포함할 수 있다.The step of generating the second set of attribute vectors comprises the steps of: identifying at least one adjective included in the first set of attribute vectors, an attribute vector corresponding to each adjective based on the attribute vector map; And converting the adjectives included in the first attribute vector set into the identified attribute vectors to generate a second attribute vector set.

상기 제2 속성 벡터 집합을 생성하는 단계 이후에, 상기 제2 속성 벡터 집합에 포함되어 있는 복수의 속성 벡터들 중, 미리 설정된 기준치 이상의 확률 값을 가지는 속성 벡터만 제2 속성 벡터 집합 내에 포함하도록 필터링하는 단계를 더 포함할 수 있다.Wherein the step of generating the second set of attribute vectors further includes the step of filtering only a plurality of attribute vectors included in the second set of attribute vectors so that only attribute vectors having probability values equal to or higher than preset reference values are included in the second set of attribute vectors The method comprising the steps of:

상기 제2 속성 벡터 집합을 생성하는 단계 이후에, 제2 속성 벡터 집합 내 속성 벡터들을 유사한 형태의 속성 벡터 군으로 분류하는 단계; 및 분류한 속성 벡터 군 중 많은 수의 속성 벡터들을 포함하는 속성 벡터 군을 선택하는 단계를 더 포함할 수 있다.Classifying attribute vectors in a second attribute vector set into attribute vectors in a similar form after the step of generating the second set of attribute vectors; And selecting an attribute vector group including a large number of attribute vectors from among the classified attribute vector groups.

상기 형용사 유사도를 계산하는 단계는, 상기 두 명사 단어 각각에 대한 두 개의 제2 속성 벡터 집합 각각에 대해, 제2 속성 벡터 집합 내 복수의 속성 벡터가 동일한 값을 갖도록 제2 속성 벡터 집합을 형성하는 단계; 동일한 속성 벡터 값을 가지는 두 개의 제2 속성 벡터 집합의 크기가 같아지도록 하는 단계; 크기가 같은 두 개의 제2 속성 벡터 집합 내 속성 벡터간 일대 일 매칭하여 적어도 하나 이상의 속성 벡터 쌍을 생성하는 단계; 상기 적어도 하나 이상의 속성 벡터 쌍 사이의 유사도를 계산하는 단계; 및 상기 속성 벡터 쌍 사이의 유사도를 토대로 상기 두 개의 제2 속성 벡터 집합에 대한 속성 벡터 유사도를 계산하는 단계를 포함할 수 있다.The step of calculating the adjective similarity may include forming a second set of attribute vectors such that a plurality of attribute vectors in the second attribute vector set have the same value for each of the two second attribute vector sets for each of the two noun words step; Making two sets of second attribute vectors having the same attribute vector value equal in size; Generating at least one attribute vector pair by performing one-to-one matching between attribute vectors in two second attribute vector sets having the same size; Calculating a similarity between the at least one attribute vector pair; And calculating an attribute vector similarity for the two second attribute vector sets based on the similarity between the attribute vector pairs.

상기 속성 벡터가 동일한 값을 갖도록 제2 속성 벡터 집합을 형성하는 단계는, 제2 속성 벡터 집합 내에 동일한 속성 벡터가 반복하여 포함되어 있는지 확인하는 단계; 및 반복하여 포함되어 있는 속성 벡터가 있으면, 해당 속성 벡터를 하나만 포함되도록 설정하고 삭제된 수만큼 속성 벡터에 가중치를 부여하는 단계를 포함할 수 있다.Wherein the step of forming a second set of attribute vectors such that the attribute vectors have the same value comprises: determining whether the same attribute vector is repeatedly included in the second set of attribute vectors; And if there is an attribute vector repeatedly included, setting the attribute vector to include only one attribute vector and weighting the attribute vector by the number of deleted attribute vectors.

상기 두 개의 제2 속성 벡터 집합의 크기가 같아지도록 하는 단계는, 상기 제2 속성 벡터 집합의 크기는 제2 속성 벡터 집합에 포함된 속성 벡터 수로 하는 제1 속성 벡터 집합 크기와 제2 속성 벡터 집합 크기를 확인하는 단계; 제1 속성 벡터 집합 크기와 제2 속성 벡터 집합 크기가 상이하면, 속성 벡터 집합 크기가 큰 속성 제2 속성 벡터 집합을 크기가 작은 제2 속성 벡터 집합의 크기가 되도록 속성 벡터들을 병합하는 단계; 및 크기가 동일해진 제2 속성 벡터 집합 내 속성 벡터들을 가중치에 따라 정렬하는 단계를 포함할 수 있다.Wherein the size of the second set of attribute vectors is the same as the size of the first set of attribute vectors and the size of the second set of attribute vectors, Checking the size; Merging the attribute vectors so that the second attribute vector set having a larger attribute vector set size is the second attribute vector set having a smaller size if the first attribute vector set size differs from the second attribute vector set size; And sorting the attribute vectors in the second attribute vector set having the same size according to the weights.

상기 본 발명의 기술적 과제를 달성하기 위한 본 발명의 또 다른 특징인 의미 기반으로 명사의 유사도를 계산하는 장치는,According to another aspect of the present invention, there is provided an apparatus for calculating similarity of a noun based on meaning,

문장이나 문서로부터 형용사-명사로 이루어진 적어도 하나 이상의 명사구를 추출하는 명사구 추출부; 상기 명사구 추출부가 추출한 명사구 내에 포함되어 있는 복수의 형용사에 각각 해당하는 속성 벡터 맵을 미리 저장된 형용사들에 대한 속성 벡터 맵에서 확인하는 속성 벡터 저장부; 상기 하나 이상의 명사구에 대한 속성 벡터 집합을 생성하고, 의미 유사도 계산 대상인 두 명사 각각에 대한 두 개의 속성 벡터 집합을 이용하여 형용사 유사도를 계산하는 형용사 집합 유사도 계산부; 및 상기 형용사 집합 유사도 계산부가 계산한 형용사 유사도를 토대로 상기 두 명사에 대한 의미 유사도를 계산하는 의미 유사도 계산부를 포함한다. A noun phrase extraction unit for extracting at least one noun phrase consisting of an adjective-noun from a sentence or a document; An attribute vector storage unit for identifying an attribute vector map corresponding to a plurality of adjectives included in the noun phrase extracted by the noun phrase extraction unit from attribute vector maps of adjectives stored in advance; An adjective set similarity calculation unit for generating an attribute vector set for the at least one noun phrase and calculating an adjective similarity using two attribute vector sets for each of two nouns for which semantic similarity calculation is to be performed; And a semantic similarity calculating unit for calculating semantic similarities of the two nouns based on the adjective similarity calculated by the adjective set similarity calculating unit.

본 발명에 따르면, 형용사가 대상 명사의 의미에 미치는 영향을 확인할 수 있기 때문에, 명사 의미 유사도를 계산하는 장치가 의미적 유사성을 결정하는데 더욱 나은 성능을 얻을 수 있다.According to the present invention, since the influence of the adjective on the meaning of the target noun can be confirmed, the apparatus for calculating the semantic similarity of noun can obtain better performance in determining the semantic similarity.

또한, 형용사를 사용함으로써, 측정하고자 하는 두 명사가 직간접적으로 동시에 등장하였는지에 상관없이, 컨텍스트 정보를 추출하여 효율적으로 활용할 수 있다.In addition, by using the adjective, the context information can be extracted and used efficiently regardless of whether two nouns to be measured simultaneously appeared directly or indirectly.

도 1은 본 발명의 실시예에 따른 의미 기반 명사 유사도 계산 장치의 구조도이다.
도 2는 본 발명의 실시예에 따른 명사 유사도 계산 방법에 대한 흐름도이다.
도 3은 본 발명의 실시예에 따른 속성 벡터의 예시도이다.
도 4는 본 발명의 실시예에 따른 그룹화된 형용사 범주의 예시도이다.
도 5는 본 발명의 실시예에 따른 속성 유사도 계산 방법에 대한 흐름도이다.1 is a structural diagram of a semantic-based noun similarity calculating apparatus according to an embodiment of the present invention.
2 is a flowchart of a method of calculating a similarity degree of nouns according to an embodiment of the present invention.
3 is an exemplary diagram of an attribute vector according to an embodiment of the present invention.
Figure 4 is an illustration of a grouped adjective category according to an embodiment of the present invention.
FIG. 5 is a flowchart illustrating a method of calculating an attribute similarity according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise.

이하 도면을 참조로 하여 본 발명의 실시예에 따른 의미 기반 명사 유사도 계산 장치 및 방법에 대해 설명한다.Hereinafter, an apparatus and method for calculating a semantic-based noun similarity according to an embodiment of the present invention will be described with reference to the drawings.

도 1은 본 발명의 실시예에 따른 의미 기반 명사 유사도 계산 장치의 구조도이다.1 is a structural diagram of a semantic-based noun similarity calculating apparatus according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 유사도 계산 장치(100)는 명사구 추출부(110), 속성 벡터 저장부(120), 형용사 집합 유사도 계산부(130) 및 의미 유사도 계산부(140)를 포함한다.1, the similarity calculation apparatus 100 includes a noun phrase extraction unit 110, an attribute vector storage unit 120, an adjective set similarity calculation unit 130, and a semantic similarity calculation unit 140.

명사구 추출부(110)는 입력되는 문장이나 문서로부터 명사구를 추출한다. 여기서 명사구는 형용사와 명사가 연달아 나타나는 단어 쌍 즉, 형용사가 명사를 수식하는 형태를 의미한다. 명사구 추출부(110)가 입력되는 문장 또는 문서에서 명사구를 추출하는 방법은 여러 방법을 통해 수행할 수 있으며, 본 발명의 실시예에서는 어느 하나의 방법으로 한정하지 않는다.The noun phrase extraction unit 110 extracts a noun phrase from an input sentence or a document. Here, the noun phrase means a pair of words in which adjectives and nouns appear in succession, that is, a form in which an adjective modifies a noun. The method of extracting a noun phrase from a sentence or a document input by the noun phrase extraction unit 110 can be performed through various methods, and the present invention is not limited to any method.

속성 벡터 저장부(120)는 모든 형용사들에 대해, 유사한 형용사들을 군으로 정의하는 형용사 범주와, 임의의 형용사가 특정 형용사 범주에 속할 확률 나타내는 속성 벡터 맵을 저장, 관리한다. 그리고 명사구 추출부(110)가 추출한 복수의 명사구 내에 포함되어 있는 복수의 형용사 각각 해당하는 속성 벡터 맵을 미리 저장되어 있는 속성 벡터 맵으로부터 확인한다. The attribute vector storage unit 120 stores and manages, for all adjectives, an adjective category defining a group of similar adjectives and an attribute vector map indicating a probability that an adjective belongs to a specific adjective category. The attribute vector maps corresponding to the plurality of adjectives included in the plurality of noun phrases extracted by the noun phrase extraction unit 110 are confirmed from the attribute vector maps stored in advance.

본 발명의 실시예에서 속성 벡터 맵을 사용하는 것은. 직접 명사를 대상으로 형용사의 특정 속성 유형을 추론할 수 있기 때문이다. 그러나 반드시 형용사에 대한 속성 벡터 맵을 사용하는 것으로 한정하지 않으며, 형용사의 속성 유형을 추론할 수 있는 형태라면 어떤 형태도 가능하다. 여기서 속성 벡터 맵의 예에 대해 도 3을 참조로 먼저 설명한다. Using an attribute vector map in an embodiment of the present invention. This is because it is possible to deduce a specific attribute type of an adjective for a direct noun. However, it is not necessarily limited to using the attribute vector map for the adjective, and any form can be used as long as the attribute type of the adjective can be deduced. Here, an example of the attribute vector map will be described first with reference to FIG.

도 3은 본 발명의 실시예에 따른 속성 벡터 맵의 예시도이다.3 is an exemplary diagram of an attribute vector map according to an embodiment of the present invention.

형용사들은 미리 임의의 분류 과정을 통해, 각각 그룹화된 형용사 범주로 분류된다. 그리고 분류 과정을 통해 각 그룹화된 형용사 범주에 형용사가 속할 확률을 각각 부여받게 된다. 임의의 형용사가 특정 형용사 범주에 속할 확률을 토대로 벡터가 생성되는데, 이를 '속성 벡터'라 지칭한다. Adjectives are classified into the grouped adjective categories in advance through an arbitrary classification process. In addition, through the classification process, each grouped adjective category is given the probability of being an adjective. A vector is generated based on the probability that an arbitrary adjective belongs to a specific adjective category, which is referred to as an 'attribute vector'.

본 발명의 실시예에서는 속성 벡터에 대해, 형용사 범주가 13개 있을 때 각 형용사 범주(SuperSense)에 형용사가 속할 확률 값을 벡터 원소로 가지는 64차원의 벡터를 생성하는 것을 예로 하여 설명한다. 그리고 이를 형용사의 대표 벡터인 속성 벡터로 삼아 사용한다. 이때, 본 발명의 실시예에서는 하위 범주(subsense)를 포함하고 있는 하나의 단어인 형용사 범주를 GermaNet에서 제공하는 13개의 형용사 범주 카테고리에 해당하는 13개 범주로 나타내었으나, 형용사 범주의 수에 따라 속성 벡터에 포함되는 수가 변경될 수도 있다.In the embodiment of the present invention, as an attribute vector, a case will be described in which a 64-dimensional vector having a probability value belonging to an adjective in each adjective category (SuperSense) is used as a vector element when there are 13 adjective categories. And we use it as attribute vector which is the representative vector of adjective. At this time, in the embodiment of the present invention, the adjective category, which is one word including the sub-category, is represented by 13 categories corresponding to 13 adjective category categories provided by GermaNet. However, according to the number of adjective categories, The number contained in the vector may be changed.

도 3에 도시된 속성 벡터 맵에 대한 표에서, 첫 번째 열은 형용사이고, 두 번째 열은 유형 분류 정확도가 top-4인 형용사 범주를 나타낸다. 형용사 범주는 시맨틱 클래스로 간주할 수 있기 때문에, 형용사 범주는 형용사 범주가 부착되어 있는 단어의 대단위 의미를 나타낼 수 있다.In the table for the attribute vector map shown in FIG. 3, the first column is an adjective, and the second column indicates an adjective category having a type classification accuracy of top-4. Since the adjective category can be considered as a semantic class, the adjective category can represent a large meaning of the word to which the adjective category is attached.

여기서 top-4는 형용사 분류 훈련에 따라 도출된 유형 분류 정확도를 나타내는 것으로, top-k(k는 정수)에서 k=4인 top-k는 91%의 유형 분류 정확도를 나타내는 것이다. k=1인 top-1은 54%의 유형 분류 정확도를 나타내는 것이다. 유형 분류 정확도를 도출하는 형용사 분류 훈련은 이미 알려진 사항으로, 본 발명의 실시예에서는 상세한 설명을 생략한다.In this case, top-4 indicates the type classification accuracy derived from adjective classification training, and top-k with k = 4 in top-k (k is an integer) indicates 91% classification accuracy. The top-1 with k = 1 indicates a classification accuracy of 54%. The adjective classification training for deriving the type classification accuracy is already known, and a detailed description thereof will be omitted in the embodiment of the present invention.

그리고, 세 번째 열은 각 형용사 범주 클래스에 대한 확률 분포를 그래프로 나타낸 것이다. 형용사 범주에 대한 확률 분포를 구하는 방법은 여러 방법을 통해 얻을 수 있으므로, 본 발명의 실시예에서는 상세한 설명을 생략한다.The third column is a graph of the probability distribution for each adjective category class. The method of obtaining the probability distribution for the adjective category can be obtained through various methods, so that the detailed description will be omitted in the embodiment of the present invention.

나머지 4번째부터 16번째 열은 임의의 형용사가 형용사 범주에 실제로 있을 가능성에 대한 형용사 범주별 확률 값을 나타낸다. 예를 들어, 4번째 열의 형용사 'deaf'라는 단어를 예로 하면, deaf는 형용사 범주로 'BODY' 클래스에 있을 가능성이 크다. 이는, deaf가 형용사 범주 'BEHAVIOR' 클래스에 포함될 가능성은 0.02, 형용사 범주 'FEELING' 클래스에 있을 가능성이 0.106 등 확률 값이 계산되어 있기 때문이며, 13개의 형용사 범주 각각의 확률 값 중 가장 큰 값이 'BODY' 클래스에 대한 값이 0.652에 해당하기 때문이다. 이러한 13개의 확률 값들이 형용사에 대한 확률 벡터를 형성하는 원소가 된다.The remaining 4th through 16th columns show the probability values of adjective categories for the possibility that an arbitrary adjective actually exists in the adjective category. For example, if you use the word 'deaf' in the fourth column, deaf is likely to be in the 'BODY' class as an adjective category. This is because the probability that deaf is included in the 'BEHAVIOR' class of the adjective category is 0.02, and the probability that it is in the FEELING class of the adjective category is 0.106. The probability value of each of the 13 adjective categories is' BODY 'class is equivalent to 0.652. These 13 probability values are the elements forming the probability vector for the adjective.

그리고 속성 벡터 저장부(120)에 저장되어 있는 형용사 범주의 예에 대해 도 4를 참조로 먼저 설명한다.An example of the category of adjectives stored in the attribute vector storage unit 120 will be described first with reference to FIG.

도 4는 본 발명의 실시예에 따른 그룹화된 형용사 범주의 예시도이다.Figure 4 is an illustration of a grouped adjective category according to an embodiment of the present invention.

본 발명의 실시예에 따른 형용사들은 GermaNet을 기반으로 하여 형용사 군들로 형성, 그룹화하는 것을 예로 하여 설명하나, 반드시 이와 같이 한정되는 것은 아니다. 예를 들어 "날씨(WEATHER)"이라는 형용사 범주 내에는 rainy, balmy, foggy, hazy, humid 등의 하위 의미들이 다 포함되어 있다. 여기서 유사하다는 의미는, 형용사 단어의 의미가 유사한 것뿐만 아니라, 의미는 상반되더라도 해당 형용사가 부연 설명하는 명사의 특성이 일치하는 경우에도 형용사들이 유사하다고 가정한다. Although the adjectives according to the embodiment of the present invention are formed and grouped into adjective groups on the basis of GermaNet, the adjectives are not necessarily limited thereto. For example, the adjective category "WEATHER" contains sub-meanings such as rainy, balmy, foggy, hazy, and humid. Here, the similarity means that the adjective words are similar, and even if the meanings are opposite, it is assumed that the adjectives are similar even when the characteristics of the nouns to be explained by the adjectives are identical.

예를 들어, 형용사 'beautiful'과 'pretty'는 의미가 유사하기 때문에 명사에 유사한 특성을 제공하게 된다. 또 다른 예로, 형용사 'fanged'와 'fascinated'는 의미상으로는 유사하지 않으나, 형용사가 묘사하는 외형적인 사실로부터 명사에 유사한 특성을 제공하기 때문에 유사하다고 가정한다. 또 다른 예로, 형용사 'hot'과 'cool'은 의미상으로는 정반대이지만, 두 형용사가 설명하는 대상 명사인 온도 특성을 설명하는 것이므로 유사하다고 가정한다. For example, the adjectives "beautiful" and "pretty" have similar meanings, so they provide similar characteristics to nouns. As another example, the adjectives 'fanged' and 'fascinated' are not semantically similar, but suppose they are similar because they provide similar characteristics to nouns from the external facts described by adjectives. As another example, the adjectives "hot" and "cool" are semantically opposite, but they are assumed to be similar because they describe the temperature characteristics of the target nouns described by the two adjectives.

한편, 도 1을 이어 설명하면, 형용사 집합 유사도 계산부(130)는 복수의 명사구에서, 동일한 명사를 수식하는 형용사들끼리 모아 복수의 제1 속성 벡터 집합을 생성한다. 그리고, 복수의 제1 속성 벡터 집합 내에 각각 포함되어 있는 형용사들을 속성 벡터 맵을 이용하여 13차원의 벡터 값을 가지도록 변경하여, 복수의 제2 속성 벡터 집합을 생성한다. 1, the adjective set similarity calculation unit 130 collects adjectives modifying the same noun in a plurality of noun phrases to generate a plurality of first attribute vector sets. The adjectives included in the plurality of first attribute vector sets are changed to have a vector value of 13 dimensions using the attribute vector map to generate a plurality of second attribute vector sets.

또한, 형용사 유사도 집합 계산부(130)는 제2 속성 벡터 집합 내에 포함된 형용사에 대한 필터링을 수행할 수도 있다. 즉, 제2 속성 벡터 집합 내 속성 벡터들 중, 미리 설정된 기준치 이상의 높은 확률 값을 가지는 속성 벡터들만이 제2 속성 벡터 집합 내에 포함되도록 필터링한다.Also, the adjective similarity degree set calculation unit 130 may perform filtering on the adjectives included in the second attribute vector set. That is, only attribute vectors having a probability value higher than a preset reference value among the attribute vectors in the second attribute vector set are filtered so as to be included in the second attribute vector set.

또는, 제2 속성 벡터 집합 내 형용사들을 유사한 형태의 형용사 군으로 분류하고, 분류한 형용사 군 중 많은 수의 형용사들을 포함하는 형용사 군을 선택하여 사용할 수도 있다. 이는 일부 명사들은 중의적 의미를 가지는 명사가 있을 수 있기 때문에 어느 하나의 의미를 가지도록 선택하기 위함이다. 형용사 유사도 집합 계산부(130)는 형용사 필터링은 반드시 수행되지 않아도 무방하다.Alternatively, the adjectives in the second attribute vector set may be classified into adjective groups of a similar type, and an adjective group including a large number of adjectives among the classified adjective groups may be selected and used. This is because some nouns may have nouns that have a semantic meaning, so that they can be chosen to have one meaning. The adjective similarity degree set calculation unit 130 may not necessarily perform the adjective filtering.

또한, 형용사 유사도 집합 계산부(130)는 제2 속성 벡터 집합 내 속성 벡터들이 동일한 값(value)을 갖도록 변경한다. 즉, 임의의 형용사에 대한 속성 벡터가 중복되어 제2 속성 벡터 집합에 포함되어 있다면, 중복된 속성 벡터는 한 번만 포함되도록 변경한다. 그리고 중복된 형용사에 대해서는 가중치를 중복된 수만큼 적용하여 설정한다.Also, the adjective similarity degree set calculation unit 130 changes the attribute vectors in the second attribute vector set to have the same value. That is, if the attribute vector for an arbitrary adjective is overlapped and included in the second attribute vector set, the duplicate attribute vector is changed to be included only once. And for duplicate adjectives, the number of duplicate weights is applied.

예를 들어, 임의의 형용사가 제2 속성 벡터 집합 내에 한 번만 포함되어 있는 경우에는 형용사 유사도 집합 계산부(130)는 해당 형용사의 중복 제거 없이 가중치가 1이 되도록 설정한다. 그러나, 특정 형용사가 제2 속성 벡터 집합 내에 두 번 포함되어 있다고 가정하면, 형용사 유사도 집합 계산부(130)는 해당 형용사의 중복 제거를 수행하고 가중치가 2가 되도록 설정한다.For example, when an arbitrary adjective is included only once in the second attribute vector set, the adjective similarity degree set calculation unit 130 sets the weight to be 1 without eliminating duplication of the adjective. However, if it is assumed that the specific adjective is included twice in the second attribute vector set, the adjective similarity degree set calculation unit 130 performs deduplication of the adjective and sets the weight to be 2.

또한, 형용사 유사도 집합 계산부(130)는 명사 유사도 계산 대상인 두 명사에 대한 두 개의 제2 속성 벡터 집합을 동일한 크기를 가지는 두 개의 제2 속성 벡터 집합으로 생성한다. 여기서 제2 속성 벡터 집합의 크기는 속성 벡터 집합 내에 포함되어 있는 속성 벡터의 수로 결정되는 것을 예로 하여 설명한다.In addition, the adjective similarity degree set calculation unit 130 generates two second attribute vector sets for two nouns, which are noun similarity degree calculation objects, as two sets of second attribute vectors having the same size. Here, the size of the second attribute vector set is determined by the number of attribute vectors included in the attribute vector set.

만약, 첫 번째 명사에 대한 제2 속성 벡터 집합의 크기가 10이고, 두 번째 명사에 대한 제2 속성 벡터 집합의 크기가 5라고 가정한다. 그러면, 형용사 유사도 집합 계산부(130)는 서로 다른 크기를 나타내는 속성 벡터 집합을 어느 한 쪽의 크기가 되도록 맞춰준다. 본 발명의 실시예에서는 크기가 큰 속성 벡터 집합을 크기가 작은 속성 벡터 집합의 크기에 맞추는 것을 예로 하여 설명하며, 이는 이후 설명한다.If the size of the second attribute vector set for the first noun is 10 and the size of the second attribute vector set for the second noun is 5, Then, the adjective similarity degree set calculation unit 130 sets the attribute vector sets having different sizes to one of the sizes. In the embodiment of the present invention, the size of the large attribute vector set is adjusted to the size of the small size attribute vector set, which will be described later.

또한, 형용사 유사도 집합 계산부(130)는 적어도 하나 이상의 속성 벡터 쌍 사이의 유사도를 각각 계산한 후 평균을 구한다. 또한, 형용사 집합 유사도 계산부(130)는 이분 매칭 형태를 나타내는 속성 벡터 쌍만을 고려하여 두 개의 제2 속성 벡터 집합에 대한 속성 벡터 유사도를 계산한다. Also, the adjective similarity degree set calculation unit 130 calculates an average after calculating the similarities between at least one attribute vector pair. In addition, the adjective set similarity calculation unit 130 calculates the attribute vector similarity for the two second attribute vector sets, considering only the attribute vector pair indicating the two-way matching type.

이는, 본 발명의 실시예에서는 일대일로 매칭된 벡터 쌍 사이의 유사도만을 고려함을 나타내는 것이다. 예를 들어, 첫 번째 제2 속성 벡터 집합의 속성 벡터 a에 대해 두 번째 제2 속성 벡터 집합의 속성 벡터 k와 일대 일 매칭이 되었다고 가정한다. 그러면, a는 k와의 유사도만 계산할 뿐 두 번째 제2 속성 벡터 집합의 다른 속성 벡터와는 유사도 계산을 수행하지 않음을 의미한다.This shows that, in the embodiment of the present invention, only the similarity between vector pairs matched on a one-to-one basis is considered. For example, it is assumed that the attribute vector a of the first second attribute vector set is one-to-one matched with the attribute vector k of the second second attribute vector set. Then, a means that only the similarity to k is calculated, but the similarity calculation is not performed with other attribute vectors of the second attribute vector set.

의미 유사도 계산부(140)는 형용사 집합 유사도 계산부(130)가 계산한 두 개의 제2 속성 벡터 집합 사이의 속성 유사도를 토대로, 두 개의 명사 사이의 의미 유사도를 계산한다. 두 개의 명사간 의미 유사도를 계산할 때, 각 명사의 단어 표현간 유사도와 속성 벡터 집합 사이의 유사도를 고려하여 명사간 의미 유사도를 계산한다.The semantic similarity calculation unit 140 calculates semantic similarities between two nouns based on the attribute similarity between the two sets of second attribute vectors calculated by the adjective set similarity calculation unit 130. [ When calculating the semantic similarity between two nouns, the semantic similarity between nouns is calculated by taking into account the similarity between the word expressions of each noun and the similarity between attribute vector sets.

이상에서 설명한 의미 기반 명사 유사도 계산 장치(100)를 이용하여 두 명사간 유사도를 계산하는 방법에 대해 도 2를 참조로 설명한다.A method of calculating the similarity between two nouns using the above-described semantic-based noun similarity calculating apparatus 100 will be described with reference to FIG.

도 2는 본 발명의 실시예에 따른 명사 유사도 계산 방법에 대한 흐름도이다.2 is a flowchart of a method of calculating a similarity degree of nouns according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 명사구 추출부(110)가 입력된 문장 또는 문서로부터 (형용사, 명사) 형태를 나타내는 명사구를 추출한다(S100). 여러 단어 중, 명사구 추출부(110)는 형용사와 명사가 바로 인접하여 나타나는 경우를 명사구로 추출한다. 여기서 명사구는 하나의 문장이나 문서 내에서 한 개 또는 복수 개 포함될 수 있으므로 모든 명사구들을 명사구로 추출한다. As shown in FIG. 2, the noun phrase extraction unit 110 extracts a noun phrase representing a form (adjective, noun) from the inputted sentence or document (S100). Of the various words, the noun phrase extraction unit 110 extracts the case where the adjective and the noun appear immediately adjacent to each other. Here, a noun phrase can contain one or more words in a sentence or document, so all noun phrases are extracted as noun phrases.

문장 또는 문서로부터 명사구를 추출하기 위하여, 명사구 추출부(110)는 먼저 문장 또는 문서에 포함되어 있는 모든 단어에 품사를 태깅한다. 그리고 태깅된 품사를 통해 형용사와 명사가 인접하여 나타나는 단어들을 명사구로 추출한다. 명사구를 추출하기 위한 방법은 이 외에도 여러 방법을 통해 수행할 수 있으므로, 본 발명의 실시예에서는 어느 하나의 방법으로 한정하여 설명하지 않는다.In order to extract a noun phrase from a sentence or a document, the noun phrase extracting unit 110 first tags the phrase to all the words included in the sentence or the document. Through the tagged parts of speech, the words appearing adjacent to the adjective and the noun are extracted as noun phrases. Since the method for extracting the noun phrases can be performed through various methods, the embodiments of the present invention are not limited to any one method.

S100 단계에서 명사구 추출부(110)가 복수 개의 명사구를 추출하면, 속성 벡터 저장부(120)는 이미 저장되어 있는 속성 벡터 맵 중 명사구에 포함되어 있는 형용사에 대응하는 속성 벡터 맵을 확인한다(S110). 속성 벡터 저장부(120)가 속성 벡터 맵을 확인하기 위하여, 본 발명의 실시예에서는 복수 개의 추출된 명사구 내에 포함된 형용사들에 대한 형용사 범주 태거(supersense tagger)의 분류 결과 즉, 도 3에 실시예로 언급한 속성 벡터가 각각의 형용사와 일대 일 매칭되는 속성 벡터 맵을 확인하는 것을 예로 하여 설명하나, 반드시 이와 같이 한정되는 것은 아니다. If the noun phrase extraction unit 110 extracts a plurality of noun phrases in step S100, the attribute vector storage unit 120 confirms the attribute vector maps corresponding to the adjectives included in the noun phrase among the attribute vector maps already stored (S110 ). In order to confirm the attribute vector map by the attribute vector storage unit 120, in the embodiment of the present invention, classification result of the adjective category tagger for the adjectives included in the plurality of extracted noun phrases, that is, The attribute vector referred to in the example is identified as an attribute vector map matching one-to-one with the respective adjectives. However, the present invention is not limited to such an example.

형용사 유사도 집합 계산부(130)는 "형용사-명사"로 이루어진 복수의 명사구에서, 동일한 명사를 수식하는 형용사들끼리 모아 명사를 기준으로 그룹을 형성하여 제1 속성 벡터 집합으로 생성한다(S120). 예를 들어, 임의의 문장 또는 문서 내에서 추출한 명사구에서 "obscurantism"이라는 명사를 "wild, religious, deliberate, religious, deliberate, religious"가 수식한 적이 있어 추출되었다고 가정한다면, 형용사 유사도 집합 계산부(130)는 각각의 형용사들을 원소로 삼는 집합을 obscurantism과 연결하여 (obscurantism-wild religious deliberate religious deliberate religious)와 같이 그룹핑하여 제1 속성 벡터 집합을 생성한다. The adjective similarity set calculation unit 130 groups the adjectives expressing the same noun in a plurality of noun phrases made up of "adjective-noun ", and forms a group based on the noun to generate a first set of attribute vectors (S120). For example, if it is supposed that the noun phrase "obscurantism" has been extracted from "noun phrase" extracted from an arbitrary sentence or document and "wild, religious, deliberate, religious, deliberate, or religious" has been extracted, the adjective similarity set calculation unit 130 ) Creates a first set of attribute vectors by grouping the set of adjectives with obscurantism (obscurantism-wild religious deliberate, religious deliberate religious).

이와 같이 명사를 기준으로 제1 속성 벡터 집합을 생성한 뒤, 형용사 유사도 집합 계산부(130)는 제1 속성 벡터 집합 내 형용사들을 사용하기 위해 형용사들의 표현 방식을 변경한다. 본 발명의 실시예에서는 S110 단계에서 속성 벡터 저장부(120)가 확인한 속성 벡터 맵을 이용하여 형용사들의 표현 방식을 해당 형용사들에 상응하는 속성 벡터로 변환한다. After generating the first attribute vector set based on the noun, the adjective similarity set calculation unit 130 changes the expression scheme of the adjectives to use the adjectives in the first attribute vector set. In the embodiment of the present invention, the expression scheme of the adjectives is converted into the attribute vector corresponding to the adjectives using the attribute vector map checked by the attribute vector storage unit 120 in step S110.

즉, 형용사들을 도 3에 일부 나타낸 속성 벡터 맵을 토대로, 형용사 각각에 대한 13차원의 속성 벡터로 변환한다. 그리고 속성 벡터로 변환된 형용사를 제1 속성 벡터 집합에 포함된 형용사들로 대체하여, 제2 속성 벡터 집합을 생성한다(S130).That is, the adjectives are converted into 13-dimensional attribute vectors for each adjective based on the attribute vector map shown in part in FIG. The adjective converted into the attribute vector is replaced with the adjectives included in the first attribute vector set to generate a second attribute vector set (S130).

S130 단계를 통해 제2 속성 벡터 집합이 생성되면, 형용사 유사도 집합 계산부(130)는 명사 단어의 중의성을 해결하기 위하여 제2 속성 벡터 집합에 포함되어 있는 복수의 형용사를 필터링하여, 필터링된 제2 속성 벡터 집합을 생성한다(S140). 단어 의미의 모호성은 단어 수준의 의미론적 유사성을 측정하는데 중요한 문제 중 하나임에도 불구하고, 종래에는 단어의 중의성을 고려하고 있지 않았다. If the second attribute vector set is generated in step S130, the adjective similarity set calculation unit 130 filters a plurality of adjectives included in the second attribute vector set to solve the ambiguity of the noun word, 2 attribute vector set (S140). Although the ambiguity of the word meaning is one of the important problems in measuring the semantic similarity at the word level, the ambiguity of the word has not been considered in the past.

따라서, 본 발명의 실시예에서는 형용사 정보로 이 문제를 해결한다. 즉, 본 발명의 실시예에서는 형용사 유사도 집합 계산부(130)가 제2 속성 벡터 집합 내 속성 벡터들을 토대로 속성 벡터에 대응하는 형용사들을 필터링한다. 다시 말해, 미리 설정된 기준치 이상의 확률 값을 가지는 속성 벡터만이 제2 속성 벡터 집합 내에 포함되도록 필터링하여 사용한다. Therefore, the embodiment of the present invention solves this problem with adjective information. That is, in the embodiment of the present invention, the adjective similarity set calculation unit 130 filters the adjectives corresponding to the attribute vector based on the attribute vectors in the second attribute vector set. In other words, only the attribute vector having a probability value equal to or higher than a preset reference value is filtered and used so as to be included in the second attribute vector set.

또는, 제2 속성 벡터 집합 내 속성 벡터들을 유사한 형태의 형용사들로 분류한다. 그리고, 분류한 형용사군 중 많은 수의 형용사들을 포함하는 형용사군을 선택하여 사용할 수도 있다. Alternatively, the attribute vectors in the second attribute vector set are classified into similar adjectives. It is also possible to select and use an adjective group including a large number of adjectives among the classified adjective groups.

apple을 예로 하여 설명하면, apple에 대한 형용사로 sweet, red, green, sour, acid 등이 추출되었다고 가정한다. 그러면, 맛 군으로 묶을 수 있는 형용사와 색깔 군으로 묶을 수 있는 형용사가 분류된다. 이때, 경우에 따라 임의의 명사가 맛 군의 형용사와 색깔 군의 형용사 모두 수식 받을 수 있기 때문에, 더 빈번하게 나타난 맛 군의 형용사만을 추출하여 사용할 수도 있다.Taking apple as an example, assume that sweet, red, green, sour, acid, etc. are extracted as an adjective to apple. Then, adjectives that can be tied to a taste group and adjectives that can be bound to a color group are categorized. At this time, since an arbitrary noun can be modified in both the adjective of the taste group and the adjective of the color group in some cases, only the adjective of the taste group that appears more frequently can be extracted and used.

형용사를 필터링하는 방법은 여러 방법으로 수행할 수 있으며, 본 발명의 실시예에서는 어느 하나의 방법으로 한정하지 않는다. 그리고 S140 단계에 대한 형용사 필터링 단계는 상황에 따라 형용사 필터링을 수행하지 않을 수도 있다.The method of filtering an adjective can be performed by various methods, and the method of the present invention is not limited to any one method. The adjective filtering step in step S140 may not perform adjective filtering depending on the situation.

이와 같이 S140 단계를 통해 형용사가 필터링된 복수의 제2 속성 벡터 집합이 생성되면, 형용사 집합 유사도 계산부(130)는 의미 유사도를 계산하기 위한 두 개의 명사 각각에 대한 두 개의 제2 속성 벡터 집합 사이의 속성 유사도를 계산한다(S150). 본 발명의 실시예에서는 13차원의 속성 벡터의 형태로 형용사를 다루고 있으므로, 두 제2 속성 벡터 집합 사이의 속성 유사도를 계산함으로써 명사의 특성을 판단할 수 있게 된다.If a plurality of second attribute vector sets having the adjectives filtered through the step S140 are generated, the adjective set similarity calculation unit 130 calculates the similarity degree between the two attribute vector sets for each of the two nouns Is calculated (S150). In the embodiment of the present invention, since the adjective is treated in the form of the 13-dimensional attribute vector, the property of the noun can be determined by calculating the degree of similarity between the two sets of second attribute vectors.

본 발명의 실시예에서는 제2 속성 벡터 집합 사이의 속성 유사도를 계산하기 위해 다음 절차들을 수행하며, 도 5를 참조로 먼저 설명한다. In the embodiment of the present invention, the following procedures are performed to calculate the attribute similarity between the second set of attribute vectors, and will be described first with reference to FIG.

도 5는 본 발명의 실시예에 따른 속성 유사도 계산 방법에 대한 흐름도이다.FIG. 5 is a flowchart illustrating a method of calculating an attribute similarity according to an embodiment of the present invention.

도 5에 도시된 바와 같이, 먼저, 형용사 집합 유사도 계산부(130)는 두 개의 제2 속성 벡터 집합을 형성하기 위해 가중치를 갖는 하나의 속성 벡터를 통해 모든 속성 벡터가 동일한 값(value)을 갖도록 변경한다(S151). 가중치는 수집된 속성 벡터의 수로 결정되는 것을 예로 하여 설명한다.5, first, the adjective set similarity calculation unit 130 calculates the attribute vector of the second attribute vector set so that all the attribute vectors have the same value through one attribute vector having a weight. (S151). The weight is determined by the number of collected attribute vectors.

예를 들어, 임의의 단어에 대한 형용사 집합이 {beautiful, pretty, long, short, short, pretty, heavy, small}이라고 가정하면, 두 번 반복된 pretty의 가중치는 2이나 다른 형용사들의 가중치는 1이 된다. 따라서 두 번 반복된 형용사들이 한 번만 나타나도록 조절하되, 가중치 값을 따로 설정한 후 {beautiful, pretty, long, short, heavy, small}로 형성하여 속성 벡터 집합을 형성한다.For example, assuming that the adjective set for any word is {beautiful, pretty, long, short, short, pretty, heavy, small} do. Therefore, it is necessary to adjust the adjectives twice so that they appear only once, but they are formed as {beautiful, pretty, long, short, heavy, small}

두 개의 명사 각각에 대한 제2 속성 벡터 집합을 형성한 뒤, 형용사 집합 유사도 계산부(130)는 두 개의 제2 속성 벡터 집합을 동일한 크기를 가지는 두 개의 제2 속성 벡터 집합으로 생성한다(S152). 여기서 제2 속성 벡터 집합의 크기는 첫 번째 절차의 집합 형성 절차 후에 속성 벡터의 수로 결정된다.After forming a second attribute vector set for each of the two nouns, the adjective set similarity calculation unit 130 generates two second attribute vector sets as two sets of attribute vectors having the same size (S152) . Where the size of the second set of attribute vectors is determined by the number of attribute vectors after the set-up procedure of the first procedure.

첫 번째 단어의 제2 속성 벡터 집합의 속성 벡터들의 수와 두 번째 단어의 제2 속성 벡터 집합의 속성 벡터들의 수는 다를 수 있다. 따라서, 두 제2 속성 벡터 집합의 속성 벡터의 수에 대한 균형을 맞추어, 동일한 크기를 가지는 두 개의 제2 속성 벡터 집합을 생성한다.The number of attribute vectors of the second attribute vector set of the first word and the number of attribute vectors of the second attribute vector set of the second word may be different. Accordingly, the second attribute vector set having the same size is created by balancing the number of attribute vectors of the two second attribute vector sets.

본 발명의 실시예에서는 크기가 큰 제2 속성 벡터 집합을 크기가 작은 제2 속성 벡터 집합의 크기로 축소하는 것을 예로 하여 설명하나, 반드시 이와 같이 한정되는 것은 아니다. 즉, 첫 번째 단어의 제2 속성 벡터 집합 내 속성 벡터들이 10개가 있고, 두 번째 단어의 제2 속성 벡터 집합 내 속성 벡터들이 5개가 있다고 가정한다.In the embodiment of the present invention, the second attribute vector set having a large size is reduced to the size of the second attribute vector set having a small size. However, the present invention is not limited thereto. That is, it is assumed that there are ten attribute vectors in the second attribute vector set of the first word and five attribute vectors in the second attribute vector set of the second word.

그러면, 본 발명의 실시예에서는 첫 번째 단어의 제2 속성 벡터 집합의 크기가 5가 되도록 조절한다. 이를 위해, 10개의 속성 벡터 중, 거리가 가까운 속성 벡터끼리 묶어 5개의 속성 벡터만이 남도록 한다. 여기서, 벡터 간 거리를 측정하는 방법이나, 가장 가까운 속성 벡터를 찾는 방법은 이미 알려진 사항으로, 본 발명의 실시예에서는 상세한 설명을 생략한다.Then, in the embodiment of the present invention, the size of the second attribute vector set of the first word is adjusted to be 5. For this purpose, among the 10 attribute vectors, attribute vectors having close distances are grouped so that only five attribute vectors remain. Here, a method of measuring the distance between vectors or a method of finding the closest attribute vector is already known, and a detailed description thereof will be omitted in the embodiment of the present invention.

이와 같이 두 개의 제2 속성 벡터 집합을 크기가 동일하게 조절한 뒤, 형용사 집합 유사도 계산부(130)는 각 속성 벡터들의 가중치를 내림차순으로 정렬한다. 여기서, 하나로 합쳐진 속성 벡터 중에는 가중치가 1인 속성 벡터가 있을 수 있고 가중치가 2인 속성 벡터가 있을 수 있기 때문에, 두 개의 속성 벡터가 합쳐진 경우에는 가중치를 3으로 둔다고 가정한다. 5개의 속성 벡터들은 각각 가중치가 상이하게 설정될 수 있으므로, 속성 벡터들의 가중치 정렬을 통해 가장 높은 가중치를 갖는 속성 벡터를 구할 수 있다.After the two sets of second attribute vectors are adjusted to have the same size, the adjective set similarity calculation unit 130 sorts the weight of each attribute vector in descending order. Here, it is assumed that the attribute vector having the weight of 1 and the attribute vector having the weight of 2 may be included in the one attribute vector, and when the two attribute vectors are combined, the weight value is assumed to be 3. Since the five attribute vectors may be set differently from each other, an attribute vector having the highest weight can be obtained through weight sorting of the attribute vectors.

그리고, 형용사 집합 유사도 계산부(130)는 두 개의 제2 속성 벡터 집합 내 속성 벡터들을 일대 일로 매칭하여 적어도 하나 이상의 속성 벡터 쌍을 생성한다(S153). 본 발명의 실시예에서는 각각의 속성 집합을 그래프 상의 노드라고 가정하며, 다음 매칭 방법을 통해 두 속성 벡터 집합을 매칭한다.Then, the adjective set similarity calculation unit 130 generates at least one attribute vector pair by matching the attribute vectors in the two second attribute vector sets one-by-one (S153). In the embodiment of the present invention, each attribute set is assumed to be a node on the graph, and the two attribute vector sets are matched through the following matching method.

먼저 두 개의 단어를 A와 B라고 가정한다. 그리고 A 단어에 대한 제2 속성 벡터 집합의 속성 벡터와 B 단어에 대한 제2 속성 벡터 집합의 속성 벡터가 모두 5개라고 가정한다. First, assume that two words are A and B. It is assumed that the attribute vector of the second attribute vector set for the word A and the attribute vector of the second attribute vector set for the word B are all five.

가중치 정렬을 통해 단어 A의 첫 번째 속성 벡터는 가중치가 가장 높은 속성 벡터가 된다. 첫 번째 속성 벡터부터 시작하여, 단어 A의 첫 번째 속성 벡터와 유사한 속성 벡터를 단어 B의 속성 벡터들 중에서 확인한다. 그리고 유사한 속성 벡터가 단어 B의 속성 벡터들 중에 있으면, 두 속성 벡터를 연결한다. 이때, A 단어의 첫 번째 속성 벡터와의 유사한 속성 벡터를 B 단어에서 찾는 방법은 속성 벡터간의 거리를 측정하여 확인한다. 벡터 간 거리를 측정하는 방법이나, 가장 가까운 속성 벡터를 찾는 방법은 이미 알려진 사항으로, 본 발명의 실시예에서는 상세한 설명을 생략한다.Through the weight sorting, the first attribute vector of word A becomes the attribute vector with the highest weight. Starting from the first attribute vector, an attribute vector similar to the first attribute vector of word A is identified from the attribute vectors of word B. Then, if a similar attribute vector is included in the attribute vectors of the word B, the two attribute vectors are connected. In this case, the method of finding an attribute vector similar to the first attribute vector of the word A in the word B is determined by measuring the distance between the attribute vectors. A method of measuring the distance between vectors or a method of finding the closest attribute vector is already known, and a detailed description thereof will be omitted in the embodiment of the present invention.

이때, 두 단어 사이의 속성 벡터들끼리 연결되면, 일대 일 매칭을 위해 연결된 속성 벡터들에는 다른 속성 벡터가 연결되지 않도록 한다. 이와 같은 절차를 반복하여 A 단어와 B 단어 각각에 대한 속성 벡터를 연결하여 속성 벡터 쌍을 생성한 후, 속성 벡터 쌍의 가중치를 이용하여 단어 A와 B 사이의 속성 벡터 집합 사이의 유사도를 계산한다. 속성 벡터 집합 사이의 유사도는, 다음 수학식 1을 통해 계산된다.At this time, if the attribute vectors between the two words are connected to each other, the other attribute vectors are not connected to the attribute vectors for one-to-one matching. The above procedure is repeated to generate attribute vector pairs by connecting the attribute vectors of the words A and B, and then the similarity between the sets of attribute vectors between the words A and B is calculated using the weight of the attribute vector pairs . The similarity between the attribute vector sets is calculated by the following equation (1).

여기서 sim_av(i,j)는 속성 벡터 i와 j 사이의 유사도를 나타낸다.

로 계산되는데, w_i와 w_j는 속성 벡터 i와 j에 대한 가중치를 나타낸다.Where sim _av (i, j) represents the similarity between attribute vectors i and j.

, Where w _i and w _j represent the weights for the attribute vectors i and j.

이와 같이 두 단어 사이에 속성 벡터들이 모두 일대 일로 연결되어 적어도 하나 이상의 속성 벡터 쌍이 형성되면, 일대 일로 연결된 복수개의 속성 벡터 쌍 사이의 유사도를 계산한 후 평균을 구한다(S154). When at least one attribute vector pair is formed by linking all the attribute vectors between two words, the similarity between a plurality of attribute vector pairs connected in a one-to-one relationship is calculated and an average is obtained (S154).

마지막으로 두 개의 제2 속성 벡터 집합 사이에 완전한 이분 매칭 형태를 나타내는 속성 벡터 쌍만을 고려하여, 두 개의 제2 속성 벡터 집합에 대한 속성 벡터 유사도를 계산한다(S155). S155 단계에서 속성 벡터 유사도를 계산하는 방법은 여러 방법을 통해 수행할 수 있으므로, 본 발명의 실시예에서는 상세한 설명을 생략한다.Finally, the attribute vector similarity for the two sets of second attribute vectors is calculated in consideration of only the attribute vector pairs indicating the complete two-way matching type between the two second attribute vector sets (S155). The method of calculating the attribute vector similarity degree in step S155 can be performed by various methods, and therefore, a detailed description thereof will be omitted in the embodiment of the present invention.

한편, 상기 도 2를 이어 설명하면, S150 단계를 통해 두 개의 제2 속성 벡터 집합 사이의 속성 유사도가 계산되면, 의미 유사도 계산부(140)는 두 개의 명사 사이의 의미 유사도를 계산한다(S160). 두 명사간 의미 유사도를 계산할 때, 각 명사의 단어 표현(word embedding)간의 유사도 뿐만 아니라 속성 벡터 집합 사이의 유사도까지 고려하여 계산한다. 2, when the attribute similarity degree between two sets of second attribute vectors is calculated in step S150, the similarity degree calculating unit 140 calculates a similarity degree between two nouns (S160) . When calculating the semantic similarity between two nouns, we calculate not only the similarity between word embedding of each noun, but also the similarity between attribute vector sets.

의미 유사도를 계산하기 위해, 의미 유사도 계산부(140)는 다음 수학식 2를 이용하여 의미 유사도를 계산한다.In order to calculate the semantic similarity, the semantic similarity calculation unit 140 calculates the semantic similarity using the following equation (2).

여기서, sim_WE(A, B)는 벡터 공간 단어 표현상에서 단어 A와 B 사이의 코사인 유사도를 의미하고, α는 경험적으로 결정되는 계수를 의미한다. Here, sim _WE (A, B) denotes the cosine similarity between the words A and B on the vector space word representation, and α denotes the empirically determined coefficient.

이와 같이, 명사 간 의미 유사도를 측정하는데 형용사를 활용함으로써, 단어간 의미적 유사성을 결정하는데 보다 나은 성능을 얻을 수 있다.Thus, by using adjectives to measure semantic similarity between nouns, better performance can be obtained in determining semantic similarity between words.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, It belongs to the scope of right.

Claims

A method of calculating a similarity of a noun based on meaning,
Identifying an attribute vector map for a plurality of adjectives respectively included in a plurality of noun phrases extracted from sentences or documents;
Generating a plurality of first attribute vector sets based on a plurality of nouns respectively included in the plurality of noun phrases;
The attribute vector corresponding to each of the at least one adjective included in the plurality of generated first attribute vector sets is checked from the attribute vector map and the adjectives included in the first attribute vector set are identified as attribute vectors respectively Generating a plurality of second attribute vector sets;
Calculating adjective similarity using two sets of second attribute vectors for each of the two nouns based on the generated second attribute vector set; And
Calculating the semantic similarities of the two nouns based on the calculated adjective similarity,
Based noun similarity calculation method.

The method according to claim 1,
Wherein the step of generating the first set of attribute vectors comprises:
Identifying at least one adjective included in the plurality of noun phrases by modifying an arbitrary noun based on a noun included in the plurality of extracted noun phrases; And
Generating one or more adjectives as a first set of attribute vectors for the arbitrary noun;
Based noun similarity calculation method.

delete

The method according to claim 1,
Wherein the attribute vector map includes an adjective, an adjective category for the adjective, a probability distribution graph for the adjective category, and a plurality of probability values.

5. The method of claim 4,
And a vector value for the adjective is formed based on the plurality of probability values.

The method according to claim 1,
After the step of generating the second set of attribute vectors,
Filtering only a plurality of attribute vectors included in the second attribute vector set so that only attribute vectors having a probability value equal to or higher than a preset reference value are included in the second attribute vector set
Wherein the semantic-based noun similarity calculation method further includes:

The method according to claim 6,
After the step of generating the second set of attribute vectors,
Classifying attribute vectors in a second attribute vector set into attribute vectors in a similar form; And
Selecting an attribute vector group including a large number of attribute vectors from among the classified attribute vector groups
Wherein the semantic-based noun similarity calculation method further includes:

The method according to claim 1,
Wherein the calculating the adjective similarity comprises:
Forming a second set of attribute vectors for each of the two second attribute vector sets for each of the two noun words so that a plurality of attribute vectors in the second attribute vector set have the same value;
Making two sets of second attribute vectors having the same attribute vector value equal in size;
Generating at least one attribute vector pair by performing one-to-one matching between attribute vectors in two second attribute vector sets having the same size;
Calculating a similarity between the at least one attribute vector pair; And
Calculating an attribute vector similarity for the two sets of second attribute vectors based on the similarities between the attribute vector pairs
Based noun similarity calculation method.

9. The method of claim 8,
Wherein the step of forming a second set of attribute vectors such that the attribute vectors have the same value comprises:
Confirming that the same attribute vector is repeatedly included in the second attribute vector set; And
If there is an attribute vector repeatedly included, setting the attribute vector to include only one attribute vector and weighting the attribute vector by the number of deleted attributes
Based noun similarity calculation method.

10. The method of claim 9,
Wherein the step of making the two sets of second attribute vectors equal in size comprises:
Identifying a first attribute vector set size and a second attribute vector set size, the size of the second attribute vector set being the number of attribute vectors included in the second attribute vector set;
Merging the attribute vectors so that the second attribute vector set having a larger attribute vector set size is the second attribute vector set having a smaller size if the first attribute vector set size differs from the second attribute vector set size; And
Arranging the attribute vectors in the second attribute vector set having the same size according to the weights
Based noun similarity calculation method.

9. The method of claim 8,
Wherein the step of calculating the attribute vector similarity for the two sets of second attribute vectors comprises:

Where sim _av (i, j) represents the similarity between attribute vectors i and j,

, And w _i and w _j are weight values for attribute vectors i and j
Based noun similarity computation.

The method according to claim 1,
Wherein the step of calculating semantic similarities for the two nouns comprises:
_{sim word (A, B) =} sim WE (A, B) + α * sim attr (A, B)
Here, sim _WE (A, B) denotes the cosine similarity between the words A and B on the vector space word representation, and α denotes the empirically determined coefficient
Based noun similarity computation.

An apparatus for calculating a similarity degree of a noun based on meaning,
A noun phrase extraction unit for extracting at least one noun phrase consisting of an adjective-noun from a sentence or a document;
An attribute vector storage unit for verifying attribute vector maps corresponding to a plurality of adjectives included in the phrase extracted by the phrase extraction unit from attribute vector maps of adjectives stored in advance;
An attribute vector set for the adjectives modifying the same noun with respect to the at least one noun phrase is generated to generate an attribute vector set, and an adjective similarity calculation unit for calculating an adjective similarity using two sets of attribute vectors for each of two nouns, A set similarity calculation unit; And
And a semantic similarity calculating unit for calculating semantic similarities of the two nouns based on the adjective similarity calculated by the adjective set similarity calculating unit
Based noun similarity calculating unit.

14. The method of claim 13,
The adjective set similarity degree calculation unit calculates,
The adjectives of the first attribute vector set are changed into the attribute vectors according to the attribute vector maps identified by the attribute vector storage unit, and the attribute vector sets Based noun similarity calculation device.

15. The method of claim 14,
The adjective set similarity degree calculation unit calculates,
Wherein only attribute vectors having a probability value equal to or higher than a preset reference value in the attribute vector among the adjectives in the attribute vector set are included in the attribute vector set.

15. The method of claim 14,
The adjective set similarity degree calculation unit calculates,
The attribute vectors are set to have the same value in the two attribute vector sets for the two nouns for which the noun similarity is to be calculated to set weights for the respective attribute vectors and two sets of attribute vectors having the weighted attribute vectors are set to the same size And generates at least one attribute vector pair by matching the attribute vectors in the two attribute vector sets having the same size, thereby generating the semantic-based noun similarity similarity calculating unit.

17. The method of claim 16,
The adjective set similarity degree calculation unit calculates,
And calculating an average after calculating the similarities between the at least one attribute vector pair.

17. The method of claim 16,
The attribute vector similarity for the two attribute vector sets may be expressed as:

, And w _i and w _j are weight values for attribute vectors i and j
Based noun similarity calculation device.

14. The method of claim 13,
The semantic similarity calculation unit may include:
_{sim word (A, B) =} sim WE (A, B) + α * sim attr (A, B)
Here, sim _WE (A, B) denotes the cosine similarity between the words A and B on the vector space word representation, and α denotes the empirically determined coefficient
Based noun similarity calculating unit for calculating the semantic similarities of the two nouns.